Storage Layer
The storage layer manages disk allocation, file organization, and page I/O operations. It consists of the DirectBlockAllocator for space management and PageIO for reading/writing compressed pages to disk.
DirectBlockAllocator (src/writer/allocator.rs)
The allocator manages disk space in 256 KiB blocks with 4K alignment for O_DIRECT compatibility.
Constants
const ALIGN_4K: u64 = 4096;
const BLOCK_SIZE: u64 = 262_144; // 256 KiB
const FILE_MAX_BYTES: u64 = 4_294_967_296; // 4 GiB
const DATA_DIR: &str = "storage";
File State
struct FileState {
file_id: u32,
offset: u64,
file: File,
max_size: u64,
}
Files follow the naming pattern: storage/data.{file_id:05} (data.00000, data.00001, …)
Allocation
pub struct PageAllocation {
pub file_id: u32,
pub path: String,
pub offset: u64,
pub actual_len: u64,
pub alloc_len: u64,
}
Allocation Strategy:
compute_alloc_len(actual_len):
full_blocks = actual_len / 256K
tail = actual_len % 256K
alloc_len = (full_blocks * 256K) + round_up_4k(tail)
round_up_4k(len):
(len + 4095) & !4095
Example:
actual_len = 270,000 bytes
full_blocks = 270,000 / 262,144 = 1 block
tail = 270,000 % 262,144 = 7,856 bytes
tail_aligned = round_up_4k(7,856) = 8,192 bytes (2 pages)
alloc_len = (1 * 262,144) + 8,192 = 270,336 bytes
File Rotation
When the current file would exceed 4 GiB, the allocator rotates to a new file:
if offset + size > FILE_MAX_BYTES:
file_id += 1
offset = 0
file = open(data.{file_id:05})
File Rotation Example:
Current: data.00000 at offset 4,294,900,000 (near 4 GiB limit)
Request: allocate 100,000 bytes
Action: Create data.00001, allocate at offset 0
O_DIRECT Support (Linux only)
On Linux, files are opened with O_DIRECT to bypass the kernel page cache:
#[cfg(target_os = "linux")]
opts.custom_flags(libc::O_DIRECT);
Requirements:
- All I/O must be 4K-aligned (offset and size)
- Buffers must be page-aligned in memory
- Reduces memory pressure by avoiding double-buffering
PageIO (src/page_handler/page_io.rs)
PageIO handles reading and writing compressed pages to disk with a 64-byte metadata prefix.
API
read_from_path(path, offset) -> PageCacheEntryCompressed
read_batch_from_path(path, offsets: &[u64]) -> Vec<PageCacheEntryCompressed>
write_to_path(path, offset, data: Vec<u8>)
write_batch_to_path(path, writes: &[(u64, Vec<u8>)])
On-Disk Format
Each page on disk consists of a fixed 64-byte metadata header followed by the compressed page data:
Offset: N
╔═══════════════════════════════════════════╗
║ 64B Metadata (rkyv serialized) ║
║ • read_size : usize ║
╚═══════════════════════════════════════════╝
Offset: N + 64
╔═══════════════════════════════════════════╗
║ Compressed Page Bytes (LZ4 + bincode) ║
╚═══════════════════════════════════════════╝
Metadata Structure:
struct PageMetadata {
read_size: usize, // Size of compressed payload
}
Read Flow (Single Page)
read_from_path(path, offset)
│
├─▶ seek to offset
│
├─▶ read 64 bytes → deserialize metadata (rkyv)
│ └─▶ extract read_size
│
├─▶ read read_size bytes → compressed_data
│
└─▶ return PageCacheEntryCompressed { page: compressed_data }
Batch Read (Linux: io_uring)
On Linux, batch reads use io_uring for parallel I/O:
offsets[]
│
├─▶ Phase 1: Queue all metadata reads (64B each)
│ └─▶ submit io_uring Read opcodes
│
├─▶ Wait for CQEs → parse read_size from each
│
├─▶ Phase 2: Allocate payload buffers based on read_size
│
├─▶ Phase 3: Queue all payload reads (offset + 64, size = read_size)
│ └─▶ submit io_uring Read opcodes
│
└─▶ Wait for CQEs → return Vec<PageCacheEntryCompressed>
Performance:
- All metadata reads happen in parallel
- All payload reads happen in parallel
- Single fsync at the end
- ~N× faster than sequential reads for N pages
Non-Linux: Falls back to sequential reads using standard file I/O.
Write Flow (Single Page)
write_to_path(path, offset, data)
│
├─▶ Serialize metadata { read_size: data.len() } with rkyv
│ └─▶ pad to 64 bytes
│
├─▶ Open file (with O_CREAT)
│
├─▶ Write metadata at offset
│
├─▶ Write compressed data at offset + 64
│
└─▶ fsync()
Batch Write (Linux: io_uring)
On Linux, batch writes use io_uring for parallel I/O:
writes: &[(offset, data)]
│
├─▶ Step 1: Prepare all buffers
│ │
│ ├─▶ For each write:
│ │ ├─▶ serialize metadata (rkyv)
│ │ ├─▶ pad to 64B
│ │ ├─▶ combine metadata + compressed data
│ │ └─▶ store in stable buffer
│ │
│ └─▶ Buffers must remain valid until io_uring completes
│
├─▶ Step 2: Queue all Write opcodes
│ └─▶ submit to io_uring ring buffer
│
├─▶ Step 3: Wait for all completions (CQEs)
│ └─▶ all writes complete in parallel
│
└─▶ Step 4: Single Fsync opcode
└─▶ durability guarantee
Performance:
- All writes submitted in one batch
- Kernel parallelizes I/O operations
- Single fsync for all writes
- Reduces context switches
Non-Linux: Falls back to sequential writes using standard file I/O.
Constants
const PREFIX_META_SIZE: usize = 64;
Integration
Writer → Allocator → PageIO
Writer::persist_allocation()
│
├─▶ DirectBlockAllocator::allocate(actual_len)
│ └─▶ returns PageAllocation { path, offset, alloc_len }
│
├─▶ Zero-pad buffer to alloc_len
│
└─▶ PageIO::write_to_path(path, offset, padded_buffer)
└─▶ Persists to disk
PageHandler → PageIO
PageHandler::get_page()
│
├─▶ Cache miss
│
└─▶ PageFetcher::fetch_and_insert(descriptor)
│
└─▶ PageIO::read_from_path(descriptor.disk_path, descriptor.offset)
└─▶ Returns compressed page
Platform Differences
| Feature | Linux | Windows/macOS |
|---|---|---|
| I/O Engine | io_uring | Standard file I/O |
| O_DIRECT | Yes | No |
| Batch Operations | Parallel | Sequential |
| Performance | ~N× speedup | Baseline |
Error Handling
- Allocation failures return
None(out of space, I/O errors) - Read/write failures propagate
io::Error - File rotation creates parent directories automatically
- Missing files are created on first write
Module Locations
- Allocator:
src/writer/allocator.rs - PageIO:
src/page_handler/page_io.rs - Tests:
src/writer/allocator.rs(unit tests for alignment and rotation)