polars_row/
lib.rs

1//! Row format as defined in `arrow-rs`.
2//! This currently partially implements that format only for needed types.
3//! For completeness sake the format as defined by `arrow-rs` is as followed:
4//! Converts [`ArrayRef`] columns into a [row-oriented](self) format.
5//!
6//! ## Overview
7//!
8//! The row format is a variable length byte sequence created by
9//! concatenating the encoded form of each column. The encoding for
10//! each column depends on its datatype (and sort options).
11//!
12//! The encoding is carefully designed in such a way that escaping is
13//! unnecessary: it is never ambiguous as to whether a byte is part of
14//! a sentinel (e.g. null) or a value.
15//!
16//! ## Unsigned Integer Encoding
17//!
18//! A null integer is encoded as a `0_u8`, followed by a zero-ed number of bytes corresponding
19//! to the integer's length.
20//!
21//! A valid integer is encoded as `1_u8`, followed by the big-endian representation of the
22//! integer.
23//!
24//! ```text
25//!               ┌──┬──┬──┬──┐      ┌──┬──┬──┬──┬──┐
26//!    3          │03│00│00│00│      │01│00│00│00│03│
27//!               └──┴──┴──┴──┘      └──┴──┴──┴──┴──┘
28//!               ┌──┬──┬──┬──┐      ┌──┬──┬──┬──┬──┐
29//!   258         │02│01│00│00│      │01│00│00│01│02│
30//!               └──┴──┴──┴──┘      └──┴──┴──┴──┴──┘
31//!               ┌──┬──┬──┬──┐      ┌──┬──┬──┬──┬──┐
32//!  23423        │7F│5B│00│00│      │01│00│00│5B│7F│
33//!               └──┴──┴──┴──┘      └──┴──┴──┴──┴──┘
34//!               ┌──┬──┬──┬──┐      ┌──┬──┬──┬──┬──┐
35//!  NULL         │??│??│??│??│      │00│00│00│00│00│
36//!               └──┴──┴──┴──┘      └──┴──┴──┴──┴──┘
37//!
38//!              32-bit (4 bytes)        Row Format
39//!  Value        Little Endian
40//! ```
41//!
42//! ## Signed Integer Encoding
43//!
44//! Signed integers have their most significant sign bit flipped, and are then encoded in the
45//! same manner as an unsigned integer.
46//!
47//! ```text
48//!        ┌──┬──┬──┬──┐       ┌──┬──┬──┬──┐       ┌──┬──┬──┬──┬──┐
49//!     5  │05│00│00│00│       │05│00│00│80│       │01│80│00│00│05│
50//!        └──┴──┴──┴──┘       └──┴──┴──┴──┘       └──┴──┴──┴──┴──┘
51//!        ┌──┬──┬──┬──┐       ┌──┬──┬──┬──┐       ┌──┬──┬──┬──┬──┐
52//!    -5  │FB│FF│FF│FF│       │FB│FF│FF│7F│       │01│7F│FF│FF│FB│
53//!        └──┴──┴──┴──┘       └──┴──┴──┴──┘       └──┴──┴──┴──┴──┘
54//!
55//!  Value  32-bit (4 bytes)    High bit flipped      Row Format
56//!          Little Endian
57//! ```
58//!
59//! ## Float Encoding
60//!
61//! Floats are converted from IEEE 754 representation to a signed integer representation
62//! by flipping all bar the sign bit if they are negative after normalizing nans
63//! and signed zeros to a canonical representation.
64//!
65//! They are then encoded in the same manner as a signed integer.
66//!
67//! ## Fixed Length Bytes Encoding
68//!
69//! Fixed length bytes are encoded in the same fashion as primitive types above.
70//!
71//! For a fixed length array of length `n`:
72//!
73//! A null is encoded as `0_u8` null sentinel followed by `n` `0_u8` bytes
74//!
75//! A valid value is encoded as `1_u8` followed by the value bytes
76//!
77//! ## Variable Length Bytes (including Strings) Encoding
78//!
79//! A null is encoded as a `0_u8`.
80//!
81//! An empty byte array is encoded as `1_u8`.
82//!
83//! A non-null, non-empty byte array is encoded as `2_u8` followed by the byte array
84//! encoded using a block based scheme described below.
85//!
86//! The byte array is broken up into 32-byte blocks, each block is written in turn
87//! to the output, followed by `0xFF_u8`. The final block is padded to 32-bytes
88//! with `0_u8` and written to the output, followed by the un-padded length in bytes
89//! of this final block as a `u8`.
90//!
91//! Note the following example encodings use a block size of 4 bytes,
92//! as opposed to 32 bytes for brevity:
93//!
94//! ```text
95//!                       ┌───┬───┬───┬───┬───┬───┐
96//!  "MEEP"               │02 │'M'│'E'│'E'│'P'│04 │
97//!                       └───┴───┴───┴───┴───┴───┘
98//!
99//!                       ┌───┐
100//!  ""                   │01 |
101//!                       └───┘
102//!
103//!  NULL                 ┌───┐
104//!                       │00 │
105//!                       └───┘
106//!
107//! "Defenestration"      ┌───┬───┬───┬───┬───┬───┐
108//!                       │02 │'D'│'e'│'f'│'e'│FF │
109//!                       └───┼───┼───┼───┼───┼───┤
110//!                           │'n'│'e'│'s'│'t'│FF │
111//!                           ├───┼───┼───┼───┼───┤
112//!                           │'r'│'a'│'t'│'r'│FF │
113//!                           ├───┼───┼───┼───┼───┤
114//!                           │'a'│'t'│'i'│'o'│FF │
115//!                           ├───┼───┼───┼───┼───┤
116//!                           │'n'│00 │00 │00 │01 │
117//!                           └───┴───┴───┴───┴───┘
118//! ```
119//!
120//! This approach is loosely inspired by [COBS] encoding, and chosen over more traditional
121//! [byte stuffing] as it is more amenable to vectorisation, in particular AVX-256.
122//!
123//! For the unordered row encoding we use a simpler scheme, we prepend the length
124//! encoded as 4 bytes followed by the raw data, with nulls being marked with a
125//! length of u32::MAX.
126//!
127//! ## Dictionary Encoding
128//!
129//! [`RowsEncoded`] needs to support converting dictionary encoded arrays with unsorted, and
130//! potentially distinct dictionaries. One simple mechanism to avoid this would be to reverse
131//! the dictionary encoding, and encode the array values directly, however, this would lose
132//! the benefits of dictionary encoding to reduce memory and CPU consumption.
133//!
134//! As such the [`RowsEncoded`] creates an order-preserving mapping
135//! for each dictionary encoded column, which allows new dictionary
136//! values to be added whilst preserving the sort order.
137//!
138//! A null dictionary value is encoded as `0_u8`.
139//!
140//! A non-null dictionary value is encoded as `1_u8` followed by a null-terminated byte array
141//! key determined by the order-preserving dictionary encoding
142//!
143//! ```text
144//! ┌──────────┐                 ┌─────┐
145//! │  "Bar"   │ ───────────────▶│ 01  │
146//! └──────────┘                 └─────┘
147//! ┌──────────┐                 ┌─────┬─────┐
148//! │"Fabulous"│ ───────────────▶│ 01  │ 02  │
149//! └──────────┘                 └─────┴─────┘
150//! ┌──────────┐                 ┌─────┐
151//! │  "Soup"  │ ───────────────▶│ 05  │
152//! └──────────┘                 └─────┘
153//! ┌──────────┐                 ┌─────┐
154//! │   "ZZ"   │ ───────────────▶│ 07  │
155//! └──────────┘                 └─────┘
156//!
157//! Example Order Preserving Mapping
158//! ```
159//! Using the map above, the corresponding row format will be
160//!
161//! ```text
162//!                           ┌─────┬─────┬─────┬─────┐
163//!    "Fabulous"             │ 01  │ 03  │ 05  │ 00  │
164//!                           └─────┴─────┴─────┴─────┘
165//!
166//!                           ┌─────┬─────┬─────┐
167//!    "ZZ"                   │ 01  │ 07  │ 00  │
168//!                           └─────┴─────┴─────┘
169//!
170//!                           ┌─────┐
171//!     NULL                  │ 00  │
172//!                           └─────┘
173//!
174//!      Input                  Row Format
175//! ```
176//!
177//! ## Struct Encoding
178//!
179//! A null is encoded as a `0_u8`.
180//!
181//! A valid value is encoded as `1_u8` followed by the row encoding of each child.
182//!
183//! This encoding effectively flattens the schema in a depth-first fashion.
184//!
185//! For example
186//!
187//! ```text
188//! ┌───────┬────────────────────────┬───────┐
189//! │ Int32 │ Struct[Int32, Float32] │ Int32 │
190//! └───────┴────────────────────────┴───────┘
191//! ```
192//!
193//! Is encoded as
194//!
195//! ```text
196//! ┌───────┬───────────────┬───────┬─────────┬───────┐
197//! │ Int32 │ Null Sentinel │ Int32 │ Float32 │ Int32 │
198//! └───────┴───────────────┴───────┴─────────┴───────┘
199//! ```
200//!
201//! ## List Encoding
202//!
203//! Lists are encoded by first encoding all child elements to the row format.
204//!
205//! A "canonical byte array" is then constructed by concatenating the row
206//! encodings of all their elements into a single binary array, followed
207//! by the lengths of each encoded row, and the number of elements, encoded
208//! as big endian `u32`.
209//!
210//! This canonical byte array is then encoded using the variable length byte
211//! encoding described above.
212//!
213//! _The lengths are not strictly necessary but greatly simplify decode, they
214//! may be removed in a future iteration_.
215//!
216//! For example given:
217//!
218//! ```text
219//! [1_u8, 2_u8, 3_u8]
220//! [1_u8, null]
221//! []
222//! null
223//! ```
224//!
225//! The elements would be converted to:
226//!
227//! ```text
228//!     ┌──┬──┐     ┌──┬──┐     ┌──┬──┐     ┌──┬──┐        ┌──┬──┐
229//!  1  │01│01│  2  │01│02│  3  │01│03│  1  │01│01│  null  │00│00│
230//!     └──┴──┘     └──┴──┘     └──┴──┘     └──┴──┘        └──┴──┘
231//!```
232//!
233//! Which would be grouped into the following canonical byte arrays:
234//!
235//! ```text
236//!                         ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
237//!  [1_u8, 2_u8, 3_u8]     │01│01│01│02│01│03│00│00│00│02│00│00│00│02│00│00│00│02│00│00│00│03│
238//!                         └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
239//!                          └──── rows ────┘   └───────── row lengths ─────────┘  └─ count ─┘
240//!
241//!                         ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
242//!  [1_u8, null]           │01│01│00│00│00│00│00│02│00│00│00│02│00│00│00│02│
243//!                         └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
244//!```
245//!
246//! With `[]` represented by an empty byte array, and `null` a null byte array.
247//!
248//! These byte arrays will then be encoded using the variable length byte encoding
249//! described above.
250//!
251//! # Ordering
252//!
253//! ## Float Ordering
254//!
255//! Floats are totally ordered just like in the rest of Polars,
256//! -inf < neg < -0.0 = 0.0 < pos < inf < nan, with all nans being equal.
257//!
258//! ## Null Ordering
259//!
260//! The encoding described above will order nulls first, this can be inverted by representing
261//! nulls as `0xFF_u8` instead of `0_u8`
262//!
263//! ## Reverse Column Ordering
264//!
265//! The order of a given column can be reversed by negating the encoded bytes of non-null values
266//!
267//! [COBS]: https://en.wikipedia.org/wiki/Consistent_Overhead_Byte_Stuffing
268//! [byte stuffing]: https://en.wikipedia.org/wiki/High-Level_Data_Link_Control#Asynchronous_framing
269
270extern crate core;
271
272pub mod decode;
273pub mod encode;
274pub(crate) mod fixed;
275mod row;
276mod utils;
277pub(crate) mod variable;
278mod widths;
279
280use arrow::array::*;
281pub type ArrayRef = Box<dyn Array>;
282
283pub use encode::{
284    convert_columns, convert_columns_amortized, convert_columns_amortized_no_order,
285    convert_columns_no_order,
286};
287pub use row::{RowEncodingCategoricalContext, RowEncodingContext, RowEncodingOptions, RowsEncoded};