Metadata Store
The metadata store is the catalog that maps each logical column to its physical page chain. It no longer tracks MVCC timestamps or per-range buckets. Instead, we maintain an append-only list of the latest page versions plus a prefix-sum array that lets us binary-search by logical row id.
Key Types
| Type | Purpose |
|---|---|
PageDescriptor | Plain data describing the current on-disk page (id, path, offset, alloc_len, actual_len, entry_count) |
ColumnChain | In-memory append-only list of descriptors and a matching prefix array |
TableMetaStore | Owns the HashMap<column, ColumnChain> and a HashMap<page_id, PageDescriptor> |
PageDirectory | Thread-safe façade exposing latest, locate_row, locate_range, range, lookup, and registration helpers |
PendingPage | Batch-friendly request used by the writer to publish multiple descriptors at once |
Column Chains
Each column gets a ColumnChain:
ColumnChain {
pages: [PageDescriptor { entry_count: 3, … },
PageDescriptor { entry_count: 8, … }, …],
prefix_entries: [3, 11, …] // cumulative row counts
}
pagesis append-only; the newest descriptor is always at the end.prefix_entries[i]stores the cumulative row count up to and includingpages[i]. We update the tail using a SIMD-friendly helper whenever we replace the last descriptor.- Binary-searching the prefix array gives us
O(log n)lookups forlocate_rowand the bounds needed forlocate_range.
PageDescriptor Fields
| Field | Description |
|---|---|
id | Internally generated 64-bit hex string |
disk_path | Full path to the page file returned by the allocator |
offset | 4 KiB-aligned file offset |
alloc_len | Bytes reserved on disk (multiple of 4 KiB) |
actual_len | Bytes of real payload stored in that allocation |
entry_count | Logical row count contained in the page |
stats | Optional per-column min/max/null-count metadata used by the executor to skip pages |
Both lengths are stored so reads can issue a single direct I/O request for the aligned region (alloc_len) and then trim the tail back to actual_len.
Registration APIs
Most callers use the façade in PageDirectory:
pub fn latest(&self, column: &str) -> Option<PageDescriptor>;
pub fn locate_row(&self, column: &str, row: u64) -> Option<RowLocation>;
pub fn locate_range(&self, column: &str, start: u64, end: u64) -> Vec<PageSlice>;
pub fn range(&self, column: &str, l: u64, r: u64, _ts: u64) -> Vec<PageDescriptor>;
pub fn lookup(&self, id: &str) -> Option<PageDescriptor>;
pub fn register_page(&self, column: &str, path: String, offset: u64) -> Option<PageDescriptor>;
pub fn register_page_with_sizes(&self,
column: &str,
path: String,
offset: u64,
alloc_len: u64,
actual_len: u64,
entry_count: u64,
) -> Option<PageDescriptor>;
pub fn update_latest_entry_count(&self,
table: &str,
column: &str,
entry_count: u64,
) -> Result<(), CatalogError>;
register_page_with_sizes is the workhorse used by the writer. It expects both lengths so the metadata store can keep the prefix sums and on-disk footprint in sync. update_latest_entry_count lets higher-level components (sorted inserts, compaction rewrites) adjust the tail descriptor when an in-place mutation changes the logical row count without rewriting the physical page. Both functions accept entry_count == 0, allowing callers to register an empty page and populate it lazily.
Lookup Helpers
latest(column)– returns the newest descriptor by taking the tail of the column chain.locate_row(column, row)– binary-searches the prefix array, returns the descriptor plus the row offset inside that page.locate_range(column, start, end)– finds the first and last page touched by the requested span and produces aVec<PageSlice>describing the per-page offsets we should read.range(column, l, r, _ts)– convenience wrapper that drops the offsets and returns just the descriptors; the timestamp parameter is currently ignored (all callers see the latest data).
Batch Publication
Writers publish new descriptors by building a Vec<PendingPage> and calling PageDirectory::register_batch. Each PendingPage includes:
PendingPage {
column: String,
disk_path: String,
offset: u64,
alloc_len: u64,
actual_len: u64,
entry_count: u64,
replace_last: bool,
stats: Option<ColumnStats>,
}
replace_last allows the writer to atomically swap in a freshly written page (for example, after rewriting an existing range) without rebuilding the entire chain.
Concurrency
- Reads acquire a shared lock on the column chain, clone the descriptors they need, and drop the lock immediately.
- Writes grab a single
Mutexin the writer, stage all allocations, and then callregister_batch, which takes theTableMetaStorewrite lock once to append or replace descriptors. The critical section is small: just pointer swaps and prefix updates.
Why This Design
- Prefix sums let us answer “which page contains row X?” in logarithmic time.
- Keeping both
alloc_lenandactual_lenmakes the writer and reader agree on how much direct I/O to perform while keeping tail padding minimal. - Dropping MVCC timestamps simplifies the fast path; if we need multi-version readers later, we can layer that on top of the existing column chains without reintroducing nested vectors.