MLT encoding definitions

Plain

No compression technique is applied on the data. Depending on the data type the data are stored in the following format:

Boolean:
Integer: Little Endian byte order
Long: Little Endian byte order
Float: IEEE754 floating point numbers in Little Endian byte order
Double: IEEE754 floating point numbers in Little Endian byte order
String: length and data stream

Boolean-RLE

This encoding is used for compressing boolean columns. Least-significant bit numbering (bit-endianness) is used as bit order. Have a look at the ORC specification for implementation details.

Byte-RLE

This encoding is used for compressing byte streams such as the GeometryType stream of the Geometry column. Have a look at the ORC specification for implementation details.

Dictionary Encoding

Dictionary encoding is used to compactly represent repeated values and can be applied to String and Geometry columns. In addition to the actual distinct values stored in the dictionary stream, a separate data stream for the indices into the dictionary is used.

String Dictionary Encoding

The dictionary stream contains distinct UTF-8 encoded string values. The data stream contains the indices for the dictionary encoded as UInt32. A dictionary encoded nullable string column contains of the following streams and order: Present, Length, Dictionary, Data.

All streams can be further compressed by recursively applying lightweight encodings:

Present Stream: Boolean-RLE
Length and Data Stream: see Integer encodings
Dictionary: see FSST Dictionary

FSST Dictionary Encoding

Dictionary encoding needs fully repeating strings to reduce size. However, the attributes of geospatial data often contains strings which share a common prefix but are not completely equal such as the localized country names. FSST replaces frequently occurring substrings while maintaining support for efficient scans and random look-ups. It is used in MLT to further compress the UTF-8 encoded dictionary values. A FSST Dictionary encoded nullable string column contains of the following streams and order: Present, SymbolLength, SymbolTable, String Length, Dictionary (Compressed Corpus) For implementation details see the following paper

Available implementations:

C++: https://github.com/cwida/fsst
Java: Work in Progress
Js/WebAssembly decoder: Work in Progress (simple to implement)

Shared Dictionary Encoding

A Shared dictionary encoding can be used, to share a common dictionary between the columns. In addition, nested fields can use a shared dictionary encoding, to share a common dictionary between the fields. This can be applied for example on localized values of the name:* columns of an OSM dataset that can be identical across fields. If a shared dictionary encoding is used for nested fields, all fields that use the shared dictionary must be grouped one behind the other in the file and prefixed with the dictionary.

Shared dictionary encoded nullable string columns consists of the following streams and order: Length, Dictionary, Present1, Data1, Present2, Data2

Shared FSST Dictionary encoded nullable string columns consists of the following streams and order: SymbolLength, SymbolTable, String Length, Dictionary (Compressed Corpus), Present1, Data1, Present2, Data2

Vertex Dictionary Encoding

Uses an additional VertexOffsets stream for storing the indices to the coordinates of the vertices in the VertexBuffer stream. The vertices in the VertexBuffer are sorted on a Hilbert-Curve and delta encoded in combination with a null suppression technique.

Morton Vertex Dictionary Encoding

The coordinates of the VertexBuffer are transformed to a 1D integer by using the Morton Code. Then the data are sorted according to the morton code and further compressed by applying an Integer compression technique.

Integer Encodings

In general most data in MLT are stored in the form of arrays of integers. Using efficient and fast algorithms for compressing arrays of integers is therefore crucial.

Logical Level Technique

Integers are encoded based on the two lightweight compression techniques delta and rle as well as a combination of both schemes called delta-rle.

Delta

compute the differences between consecutive elements of a sequence e.g. x2 - x1, x3 - x2, … Is always used in combination with a physical level technique to reduce the number of bits needed to store the delta values.

RLE

see https://en.wikipedia.org/wiki/Run-length_encoding for a basic explanation All runs and values are stored in two separate buffers. If the values are unsigned integers ZigZag encoding is applied to the values buffer. Both buffers are then further compressed using a null suppression technique.

Combining delta and rle (Delta-RLE)

Delta compress the values and then apply RLE encoding. This technique is efficient for ascending sequences like Ids.

Physical Level Technique (Null Suppression)

ZigZag encoding

For encoding unsigned integers, ZigZag encoding is used for both of the following null suppression techniques. ZigZag encoding uses the least significant bit for encoding the sign.

Varint encoding

Byte-aligned null suppression technique try to compress an integer using a minimal number of bytes. For implementation details see Protobuf .

SIMD-FastPFOR

Bit-aligned null suppression technique that compress an integer compress an integer using a minimal number of bits by using a patched approach to store exceptions (outliers) in a separate section to keep the overall bit width small.

see https://arxiv.org/pdf/1209.2137.pdf and https://ayende.com/blog/199524-C/integer-compression-the-fastpfor-code

Available implementations:

C++: https://github.com/lemire/FastPFor
Java: https://github.com/lemire/JavaFastPFOR
C#: https://github.com/Genbox/CSharpFastPFOR
Js/WebAssembly: Work in Progress (higher implementation complexity)

Float encodings

Adaptive Lossless floating-Point Compression (ALP)

see https://dl.acm.org/doi/pdf/10.1145/3626717 for a detailed explanation. ALP is a lossless floating-point compression technique that is used to compress the float values in MLT.

Available implementations:

C++: https://github.com/duckdb/duckdb/tree/16830ded9b4882b90f35c2d7e2d740547ae7d3fc/src/include/duckdb/storage/compression/alp
Java: Work in Progress
JS/WebAssembly: Work in Progress