polars_row/lib.rs
1//! Row format as defined in `arrow-rs`.
2//! This currently partially implements that format only for needed types.
3//! For completeness sake the format as defined by `arrow-rs` is as followed:
4//! Converts [`ArrayRef`] columns into a [row-oriented](self) format.
5//!
6//! ## Overview
7//!
8//! The row format is a variable length byte sequence created by
9//! concatenating the encoded form of each column. The encoding for
10//! each column depends on its datatype (and sort options).
11//!
12//! The encoding is carefully designed in such a way that escaping is
13//! unnecessary: it is never ambiguous as to whether a byte is part of
14//! a sentinel (e.g. null) or a value.
15//!
16//! ## Unsigned Integer Encoding
17//!
18//! A null integer is encoded as a `0_u8`, followed by a zero-ed number of bytes corresponding
19//! to the integer's length.
20//!
21//! A valid integer is encoded as `1_u8`, followed by the big-endian representation of the
22//! integer.
23//!
24//! ```text
25//! ┌──┬──┬──┬──┐ ┌──┬──┬──┬──┬──┐
26//! 3 │03│00│00│00│ │01│00│00│00│03│
27//! └──┴──┴──┴──┘ └──┴──┴──┴──┴──┘
28//! ┌──┬──┬──┬──┐ ┌──┬──┬──┬──┬──┐
29//! 258 │02│01│00│00│ │01│00│00│01│02│
30//! └──┴──┴──┴──┘ └──┴──┴──┴──┴──┘
31//! ┌──┬──┬──┬──┐ ┌──┬──┬──┬──┬──┐
32//! 23423 │7F│5B│00│00│ │01│00│00│5B│7F│
33//! └──┴──┴──┴──┘ └──┴──┴──┴──┴──┘
34//! ┌──┬──┬──┬──┐ ┌──┬──┬──┬──┬──┐
35//! NULL │??│??│??│??│ │00│00│00│00│00│
36//! └──┴──┴──┴──┘ └──┴──┴──┴──┴──┘
37//!
38//! 32-bit (4 bytes) Row Format
39//! Value Little Endian
40//! ```
41//!
42//! ## Signed Integer Encoding
43//!
44//! Signed integers have their most significant sign bit flipped, and are then encoded in the
45//! same manner as an unsigned integer.
46//!
47//! ```text
48//! ┌──┬──┬──┬──┐ ┌──┬──┬──┬──┐ ┌──┬──┬──┬──┬──┐
49//! 5 │05│00│00│00│ │05│00│00│80│ │01│80│00│00│05│
50//! └──┴──┴──┴──┘ └──┴──┴──┴──┘ └──┴──┴──┴──┴──┘
51//! ┌──┬──┬──┬──┐ ┌──┬──┬──┬──┐ ┌──┬──┬──┬──┬──┐
52//! -5 │FB│FF│FF│FF│ │FB│FF│FF│7F│ │01│7F│FF│FF│FB│
53//! └──┴──┴──┴──┘ └──┴──┴──┴──┘ └──┴──┴──┴──┴──┘
54//!
55//! Value 32-bit (4 bytes) High bit flipped Row Format
56//! Little Endian
57//! ```
58//!
59//! ## Float Encoding
60//!
61//! Floats are converted from IEEE 754 representation to a signed integer representation
62//! by flipping all bar the sign bit if they are negative after normalizing nans
63//! and signed zeros to a canonical representation.
64//!
65//! They are then encoded in the same manner as a signed integer.
66//!
67//! ## Fixed Length Bytes Encoding
68//!
69//! Fixed length bytes are encoded in the same fashion as primitive types above.
70//!
71//! For a fixed length array of length `n`:
72//!
73//! A null is encoded as `0_u8` null sentinel followed by `n` `0_u8` bytes
74//!
75//! A valid value is encoded as `1_u8` followed by the value bytes
76//!
77//! ## Variable Length Bytes (including Strings) Encoding
78//!
79//! A null is encoded as a `0_u8`.
80//!
81//! An empty byte array is encoded as `1_u8`.
82//!
83//! A non-null, non-empty byte array is encoded as `2_u8` followed by the byte array
84//! encoded using a block based scheme described below.
85//!
86//! The byte array is broken up into 32-byte blocks, each block is written in turn
87//! to the output, followed by `0xFF_u8`. The final block is padded to 32-bytes
88//! with `0_u8` and written to the output, followed by the un-padded length in bytes
89//! of this final block as a `u8`.
90//!
91//! Note the following example encodings use a block size of 4 bytes,
92//! as opposed to 32 bytes for brevity:
93//!
94//! ```text
95//! ┌───┬───┬───┬───┬───┬───┐
96//! "MEEP" │02 │'M'│'E'│'E'│'P'│04 │
97//! └───┴───┴───┴───┴───┴───┘
98//!
99//! ┌───┐
100//! "" │01 |
101//! └───┘
102//!
103//! NULL ┌───┐
104//! │00 │
105//! └───┘
106//!
107//! "Defenestration" ┌───┬───┬───┬───┬───┬───┐
108//! │02 │'D'│'e'│'f'│'e'│FF │
109//! └───┼───┼───┼───┼───┼───┤
110//! │'n'│'e'│'s'│'t'│FF │
111//! ├───┼───┼───┼───┼───┤
112//! │'r'│'a'│'t'│'r'│FF │
113//! ├───┼───┼───┼───┼───┤
114//! │'a'│'t'│'i'│'o'│FF │
115//! ├───┼───┼───┼───┼───┤
116//! │'n'│00 │00 │00 │01 │
117//! └───┴───┴───┴───┴───┘
118//! ```
119//!
120//! This approach is loosely inspired by [COBS] encoding, and chosen over more traditional
121//! [byte stuffing] as it is more amenable to vectorisation, in particular AVX-256.
122//!
123//! For the unordered row encoding we use a simpler scheme, we prepend the length
124//! encoded as 4 bytes followed by the raw data, with nulls being marked with a
125//! length of u32::MAX.
126//!
127//! ## Dictionary Encoding
128//!
129//! [`RowsEncoded`] needs to support converting dictionary encoded arrays with unsorted, and
130//! potentially distinct dictionaries. One simple mechanism to avoid this would be to reverse
131//! the dictionary encoding, and encode the array values directly, however, this would lose
132//! the benefits of dictionary encoding to reduce memory and CPU consumption.
133//!
134//! As such the [`RowsEncoded`] creates an order-preserving mapping
135//! for each dictionary encoded column, which allows new dictionary
136//! values to be added whilst preserving the sort order.
137//!
138//! A null dictionary value is encoded as `0_u8`.
139//!
140//! A non-null dictionary value is encoded as `1_u8` followed by a null-terminated byte array
141//! key determined by the order-preserving dictionary encoding
142//!
143//! ```text
144//! ┌──────────┐ ┌─────┐
145//! │ "Bar" │ ───────────────▶│ 01 │
146//! └──────────┘ └─────┘
147//! ┌──────────┐ ┌─────┬─────┐
148//! │"Fabulous"│ ───────────────▶│ 01 │ 02 │
149//! └──────────┘ └─────┴─────┘
150//! ┌──────────┐ ┌─────┐
151//! │ "Soup" │ ───────────────▶│ 05 │
152//! └──────────┘ └─────┘
153//! ┌──────────┐ ┌─────┐
154//! │ "ZZ" │ ───────────────▶│ 07 │
155//! └──────────┘ └─────┘
156//!
157//! Example Order Preserving Mapping
158//! ```
159//! Using the map above, the corresponding row format will be
160//!
161//! ```text
162//! ┌─────┬─────┬─────┬─────┐
163//! "Fabulous" │ 01 │ 03 │ 05 │ 00 │
164//! └─────┴─────┴─────┴─────┘
165//!
166//! ┌─────┬─────┬─────┐
167//! "ZZ" │ 01 │ 07 │ 00 │
168//! └─────┴─────┴─────┘
169//!
170//! ┌─────┐
171//! NULL │ 00 │
172//! └─────┘
173//!
174//! Input Row Format
175//! ```
176//!
177//! ## Struct Encoding
178//!
179//! A null is encoded as a `0_u8`.
180//!
181//! A valid value is encoded as `1_u8` followed by the row encoding of each child.
182//!
183//! This encoding effectively flattens the schema in a depth-first fashion.
184//!
185//! For example
186//!
187//! ```text
188//! ┌───────┬────────────────────────┬───────┐
189//! │ Int32 │ Struct[Int32, Float32] │ Int32 │
190//! └───────┴────────────────────────┴───────┘
191//! ```
192//!
193//! Is encoded as
194//!
195//! ```text
196//! ┌───────┬───────────────┬───────┬─────────┬───────┐
197//! │ Int32 │ Null Sentinel │ Int32 │ Float32 │ Int32 │
198//! └───────┴───────────────┴───────┴─────────┴───────┘
199//! ```
200//!
201//! ## List Encoding
202//!
203//! Lists are encoded by first encoding all child elements to the row format.
204//!
205//! A "canonical byte array" is then constructed by concatenating the row
206//! encodings of all their elements into a single binary array, followed
207//! by the lengths of each encoded row, and the number of elements, encoded
208//! as big endian `u32`.
209//!
210//! This canonical byte array is then encoded using the variable length byte
211//! encoding described above.
212//!
213//! _The lengths are not strictly necessary but greatly simplify decode, they
214//! may be removed in a future iteration_.
215//!
216//! For example given:
217//!
218//! ```text
219//! [1_u8, 2_u8, 3_u8]
220//! [1_u8, null]
221//! []
222//! null
223//! ```
224//!
225//! The elements would be converted to:
226//!
227//! ```text
228//! ┌──┬──┐ ┌──┬──┐ ┌──┬──┐ ┌──┬──┐ ┌──┬──┐
229//! 1 │01│01│ 2 │01│02│ 3 │01│03│ 1 │01│01│ null │00│00│
230//! └──┴──┘ └──┴──┘ └──┴──┘ └──┴──┘ └──┴──┘
231//!```
232//!
233//! Which would be grouped into the following canonical byte arrays:
234//!
235//! ```text
236//! ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
237//! [1_u8, 2_u8, 3_u8] │01│01│01│02│01│03│00│00│00│02│00│00│00│02│00│00│00│02│00│00│00│03│
238//! └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
239//! └──── rows ────┘ └───────── row lengths ─────────┘ └─ count ─┘
240//!
241//! ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
242//! [1_u8, null] │01│01│00│00│00│00│00│02│00│00│00│02│00│00│00│02│
243//! └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
244//!```
245//!
246//! With `[]` represented by an empty byte array, and `null` a null byte array.
247//!
248//! These byte arrays will then be encoded using the variable length byte encoding
249//! described above.
250//!
251//! # Ordering
252//!
253//! ## Float Ordering
254//!
255//! Floats are totally ordered just like in the rest of Polars,
256//! -inf < neg < -0.0 = 0.0 < pos < inf < nan, with all nans being equal.
257//!
258//! ## Null Ordering
259//!
260//! The encoding described above will order nulls first, this can be inverted by representing
261//! nulls as `0xFF_u8` instead of `0_u8`
262//!
263//! ## Reverse Column Ordering
264//!
265//! The order of a given column can be reversed by negating the encoded bytes of non-null values
266//!
267//! [COBS]: https://en.wikipedia.org/wiki/Consistent_Overhead_Byte_Stuffing
268//! [byte stuffing]: https://en.wikipedia.org/wiki/High-Level_Data_Link_Control#Asynchronous_framing
269
270extern crate core;
271
272pub mod decode;
273pub mod encode;
274pub(crate) mod fixed;
275mod row;
276mod utils;
277pub(crate) mod variable;
278mod widths;
279
280use arrow::array::*;
281pub type ArrayRef = Box<dyn Array>;
282
283pub use encode::{
284 convert_columns, convert_columns_amortized, convert_columns_amortized_no_order,
285 convert_columns_no_order,
286};
287pub use row::{RowEncodingCategoricalContext, RowEncodingContext, RowEncodingOptions, RowsEncoded};