Satori Storage Architecture Overview

This document walks through the current storage engine blueprint implemented in src/. It assumes no prior familiarity with the codebase and covers the core data structures, caches, metadata, and execution flows that already exist in the repository.

High-Level Topology

At runtime, main.rs wires together the storage layers into the following pipeline:

┌─────────────────────────────────────────────────────────────┐
│                         TableMetaStore                      │
│  - column ranges + MVCC versions                            │
│  - owns Arc<PageMetadata> entries                           │
└──────────────┬──────────────────────────────────────────────┘
               │ Arc<PageMetadata> (id, disk path, offset)
               ▼
┌─────────────────────────────────────────────────────────────┐
│                       PageHandler                           │
│  ┌────────────────────────┬───────────────────────────────┐ │
│  │ Uncompressed Page Cache│   Compressed Page Cache       │ │
│  │  (UPC / hot Pages)     │   (CPC / cold blobs)          │ │
│  │  RwLock<LruCache>      │   RwLock<LruCache>            │ │
│  └──────────────┬─────────┴──────────────┬────────────────┘ │
│                 │                        │                  │
│             UPC hit?                 CPC hit?               │
│                 │                        │                  │
│                 ▼                        ▼                  │
│           Arc<PageCacheEntry>      Arc<PageCacheEntry>      │
│                 │                        │                  │
│                 └─────────────┬──────────┘                  │
│                               ▼                             │
│                         Compressor                          │
│                       (lz4_flex + bincode)                  │
│                               │                             │
│                               ▼                             │
│                             PageIO                          │
│                     (64B metadata + blob on disk)           │
└───────────────────────────────┬─────────────────────────────┘
                                │
                                ▼
                        Persistent Storage

All public operations (ops_handler.rs) rely on TableMetaStore to map logical column ranges to physical pages, then ask PageHandler to materialize the requested pages through the cache strata down to disk.

Data Model

Entry (src/entry.rs)

  • Entry represents a single logical row fragment belonging to a column.
  • Metadata fields (prefix_meta / suffix_meta) are placeholders for future per-entry bookkeeping.
  • current_epoch_millis() timestamps are reused for both cache LRU and MVCC bookkeeping.

Page (src/page.rs)

  • A Page groups an ordered list of Entry instances plus optional page-level metadata.
  • Pages serialize via Serde, making them compatible with bincode and LZ4 compression.
  • Page::add_entry currently appends in-memory; flushing to disk is deferred to cache eviction and IO paths.
Page
 ├─ page_metadata : String (placeholder)
 └─ entries       : Vec<Entry>
        ├─ prefix_meta : String
        ├─ data        : String
        └─ suffix_meta : String

Metadata Catalog (src/metadata_store.rs)

TableMetaStore is the authoritative mapping from logical column ranges to physical pages and their on-disk location. It has two maps:

col_data  : HashMap<ColumnName, Arc<RwLock<Vec<TableMetaStoreEntry>>>>
page_data : HashMap<PageId, Arc<PageMetadata>>
  • PageMetadata holds immutable page IDs, disk paths, and file offsets.
  • TableMetaStoreEntry spans a contiguous (start_idx, end_idx) range and stores the MVCC history for that slice.
  • MVCCKeeperEntry captures a particular version (page_id, commit_time, locked_by placeholder).
  • Range lookups use binary search over the (start_idx, end_idx) boundaries, then pick the newest version whose commit_time is ≤ the supplied upper bound.

Metadata Lifecycles

  • Adding a new page:
    1. add_new_page_meta fabricates a PageMetadata (ID TODO: currently stubbed as "1111111").
    2. add_new_page_to_col appends the MVCC entry to the column’s newest TableMetaStoreEntry.
  • Lookup helpers:
    • get_latest_page_meta(column) returns the head of the most recent range.
    • get_ranged_pages_meta(column, l, r, ts) returns Arc<PageMetadata> handles for each qualifying range.

Locks are held briefly: read locks cover the binary search, after which the code drops guards and clones Arc<PageMetadata> values to unblock writers quickly.

Metadata Diagram

col_data["users.age"]  ───────────┐
                                 ▼
Arc<RwLock<Vec<TableMetaStoreEntry>>>  (per-column ranges)
                                 │
                                 ├─ TableMetaStoreEntry #0
                                 │    start_idx = 0
                                 │    end_idx   = 1024
                                 │    page_metas:
                                 │       [ MVCCKeeperEntry(page_id="p1", commit=100),
                                 │         MVCCKeeperEntry(page_id="p2", commit=120) ]
                                 │
                                 └─ TableMetaStoreEntry #1
                                      start_idx = 1024
                                      end_idx   = 2048
                                      page_metas:
                                         [ MVCCKeeperEntry(page_id="p9", commit=180) ]

page_data:
  "p1" -> Arc<PageMetadata { id="p1", disk_path="/data/...", offset=... }>
  "p2" -> Arc<PageMetadata { ... }>
  "p9" -> Arc<PageMetadata { ... }>

Cache Hierarchy (src/page_cache.rs)

There are two symmetric caches layered by temperature:

  1. Uncompressed Page Cache (UPC) stores hot Page structs for CPU-side mutations and reads.
  2. Compressed Page Cache (CPC) stores compressed byte blobs mirroring disk layout, ready to be decompressed back into the UPC.

Each cache is a PageCache<T> with:

  • store: HashMap<PageId, PageCacheEntry<T>> for fast lookups.
  • lru_queue: BTreeSet<(used_time, PageId)> implementing an LRU eviction policy capped at LRUsize = 10.
  • PageCacheEntry<T> wraps the payload in an Arc so multiple consumers can reuse a single cached page without copying.

Planned improvements (documented in comments but not yet implemented):

  • Drop handlers for cache entries should trigger cross-cache promotion (UPC eviction → compress into CPC; CPC eviction → flush to disk). These are currently TODOs.

Cache Flow

Request page "p42"
    │
    ├─ UPC (HashMap) hit? ──┐
    │                      │ return Arc<PageCacheEntryUncompressed>
    │
    ├─ CPC hit? ────────────┤
    │           decompress  │
    │           insert into │
    │           UPC         ┘
    │
    └─ Fetch from disk
            │
            ├─ PageIO::read_from_path (path, offset)
            ├─ insert compressed bytes into CPC
            └─ decompress + seed UPC

Compressor (src/compressor.rs)

  • compress: Serializes Page via bincode, compresses with lz4_flex, and returns a PageCacheEntryCompressed.
  • decompress: Reverses the process, producing a PageCacheEntryUncompressed.
  • Accepts/returns Arc<PageCacheEntry*> so the compressor can operate without extra cloning.

Persistent Storage Layout (src/page_io.rs)

Pages on disk are stored as a 64-byte metadata prefix followed by the compressed payload:

┌───────────────────────────────┬───────────────────────────────┐
│ 64-byte Metadata (rkyv)       │ LZ4 Compressed Page bytes      │
│ - read_size : usize           │ (output of Compressor::compress)│
└───────────────────────────────┴───────────────────────────────┘
  • read_from_path(path, offset):
    1. Seeks to offset, reads the 64-byte prefix.
    2. Uses rkyv::archived_root for zero-copy access to metadata.
    3. Reads the exact number of bytes for the compressed page and returns PageCacheEntryCompressed.
  • write_to_path(path, offset, data):
    1. Serializes the metadata structure with rkyv.
    2. Pads the prefix to exactly 64 bytes.
    3. Writes prefix + page bytes in a single call, then sync_all() to persist.

NOTE: The file is created/truncated with File::create; future work should seek cautiously when appending.

PageHandler (src/page_handler.rs)

PageHandler orchestrates the cache hierarchy and IO. It exposes:

  • get_page(PageMetadata) → single-page access with UPC → CPC → disk fallback.
  • get_pages(Vec<PageMetadata>) → batch access preserving original order and minimizing lock contention.

Batch Retrieval Flow

Input: [PageMetadata{id="p1"}, PageMetadata{id="p7"}, ...]

1. UPC sweep (read lock):
     collect hits in request order, track missing IDs in `meta_map`.

2. CPC sweep (read lock):
     collect compressed hits; remove from `meta_map`.

3. Decompress CPC hits:
     for each (id, Arc<PageCacheEntryCompressed>):
         decompress → PageCacheEntryUncompressed
         insert into UPC (write lock)

4. Remaining misses:
     for each metadata still in `meta_map`:
         PageIO::read_from_path → insert into CPC
         decompress → insert into UPC

5. Final UPC pass (read lock):
     gather pages in original order for any IDs not emitted in step 1.

Locks are held only around direct cache map access; decompression happens outside of locks to prevent blocking other threads.

Operation Layer (src/ops_handler.rs)

External API surface for column operations:

  • upsert_data_into_column(meta_store, page_handler, col, data)
    • Reads latest page metadata via TableMetaStore.
    • Gets the page through PageHandler.
    • Clones, mutates (Page::add_entry), and reinserts into UPC.
    • TODOs remain for updating metadata ranges and persisting dirty pages.
  • update_column_entry(meta_store, page_handler, col, data, row)
    • Similar to upsert but updates a specific row index if it exists.
  • range_scan_column_entry(meta_store, page_handler, col, l_row, r_row, ts)
    • Fetches MVCC-aware page metadata for the range.
    • Calls PageHandler::get_pages to materialize entries.
    • Returns concatenated Entry vectors (no row slicing yet).

Scheduler Stub (src/scheduler.rs)

Contains a minimal thread pool (ThreadPool) backed by a channel and worker threads. It is currently unused but illustrates the intended direction for coordinating WAL/ops/background work.

WAL & Latest Modules

  • src/wal.rs and src/latest.rs are placeholders containing design notes for future write-ahead logging and in-place “latest view” storage. They document intent but contribute no runtime behavior yet.

Putting It All Together

Current execution path for a write request:

Client upsert
   │
   ▼
ops_handler::upsert_data_into_column
   │ read lock
   ▼
TableMetaStore::get_latest_page_meta
   │
   ▼
PageHandler::get_page
   ├─ UPC? -> return Arc (hot path)
   ├─ CPC? -> decompress + seed UPC
   └─ disk -> PageIO::read + CPC insert + decompress + seed UPC
   │
   ▼
Clone Arc<PageCacheEntryUncompressed>
   │ mutate Page entries
   ▼
UPC write lock
   │ add updated page back into cache
   ▼
Return success (metadata update TODO)

Current execution path for a range scan:

Client range query
   │
   ▼
ops_handler::range_scan_column_entry
   │ read lock
   ▼
TableMetaStore::get_ranged_pages_meta
   │
   ▼
PageHandler::get_pages (batch)
   ├─ UPC sweep -> immediate hits
   ├─ CPC sweep -> decompress + UPC
   └─ Disk fetch -> CPC insert -> decompress + UPC
   │
   ▼
Gather Pages in order
   │
   ▼
Collect Entry vectors (no row slicing yet)

Open TODOs Highlighted in Code

  • Real page ID generation and persistence of metadata updates.
  • UPC eviction should compress into CPC; CPC eviction should flush to disk.
  • get_page_path_and_offset should guard against missing IDs (currently unwraps).
  • WAL integration, scheduling, and compaction logic are future work but already have design notes.
  • Page::add_entry needs a disk persistence hook once the eviction path is implemented.

This blueprint should give new contributors a map of the current system and the intended direction. Use it alongside inline comments in src/ to trace the exact code paths.