Storage Algorithm
Harper's storage algorithm is the foundation of all database functionality. It is built on top of RocksDB (the default) or LMDB (legacy), both high-performance key-value stores, and extends them with automatic indexing, query-language-agnostic data access, and ACID compliance.
RocksDB is the default storage engine for new installations. LMDB databases from prior versions are still supported and loaded automatically when detected.
Query Language Agnostic
Harper's storage layer is decoupled from any specific query language. Data inserted via NoSQL operations can be read via SQL, REST, or the Resource API — all accessing the same underlying storage. This architecture allows Harper to add new query interfaces without changing how data is stored.
ACID Compliance
Harper provides full ACID compliance on each node:
- Atomicity: All writes in a transaction either fully commit or fully roll back
- Consistency: Each transaction moves data from one valid state to another
- Isolation: Reads use snapshots and do not block writes; writes do not block reads
- Durability: RocksDB commits are persisted via its Write-Ahead Log (WAL); LMDB uses memory-mapped file writes
Harper uses application-level locking to serialize schema changes and table creation, ensuring write ordering without deadlocks.
Universally Indexed
Changed in: v4.3.0 — Storage performance improvements including better free-space management
For dynamic schema tables, all top-level attributes are automatically indexed immediately upon ingestion — Harper reflexively creates the attribute and its index as new data arrives. For schema-defined tables, indexes are created for all attributes marked with @indexed.
Indexes are type-agnostic, ordering values as follows:
- Booleans
- Numbers (ordered numerically)
- Strings (ordered lexically)
Storage Layout
Each Harper database corresponds to a separate storage environment:
- RocksDB (default): a directory on disk containing all stores for that database
- LMDB (legacy): a single
.mdbfile containing all sub-databases for that database
Within each database, a table is represented by multiple key-value stores:
- Primary store (
tableName/): stores the full record for each primary key - Secondary index stores (
tableName/attributeName): one store per indexed attribute, mapping attribute values to primary keys - Metadata store (
__internal_dbis__): tracks table and attribute definitions for the database
All stores for a given database reside within the same RocksDB directory (or LMDB environment file), so cross-table operations within a database share the same underlying I/O path.
Compression
Changed in: v4.3.0 — Compression is now enabled by default for all records over 4KB
Harper compresses record data automatically for records over 4KB. Compression settings can be configured in the storage configuration. Note that compression settings cannot be changed on existing databases without creating a new compacted copy — see Compaction.
Performance Characteristics
Harper inherits strong performance properties from its storage engines:
RocksDB (default):
- LSM-tree writes: Optimized for write-heavy workloads via log-structured merge trees
- Block cache: Configurable in-memory block cache (defaults to 25% of available system memory)
- WAL durability: Write-Ahead Log provides crash recovery without sacrificing throughput
- Compression: Native support for multiple compression algorithms per level
LMDB (legacy):
- Memory-mapped I/O: Data is accessed via memory mapping, enabling fast reads without data duplication between disk and memory
- Buffer cache integration: Fully exploits the OS buffer cache for reduced I/O
- Zero-copy reads: Readers access data directly from the memory map without copying
- Deadlock-free writes: Full serialization of writers guarantees write ordering without deadlocks
Indexing Example
Given a table with records like this:
┌────┬────────┬────────┐
│ id │ field1 │ field2 │
├────┼────────┼────────┤
│ 1 │ A │ X │
│ 2 │ 25 │ X │
│ 3 │ -1 │ Y │
│ 4 │ A │ │
│ 5 │ true │ 2 │
└────┴────────┴────────┘
Harper maintains three separate key-value stores for that table, all within the same database:
Database (RocksDB directory or LMDB environment)
│
├── primary store: "MyTable/"
│ ┌─────┬──────────────────────────────────────┐
│ │ Key │ Value (full record) │
│ ├─────┼──────────────────────────────────────┤
│ │ 1 │ { id:1, field1:"A", field2:"X" } │
│ │ 2 │ { id:2, field1:25, field2:"X" } │
│ │ 3 │ { id:3, field1:-1, field2:"Y" } │
│ │ 4 │ { id:4, field1:"A" } │
│ │ 5 │ { id:5, field1:true, field2:2 } │
│ └─────┴──────────────────────────────────────┘
│
├── secondary index: "MyTable/field1" secondary index: "MyTable/field2"
│ ┌────────┬───────┐ ┌────────┬───────┐
│ │ Key │ Value │ │ Key │ Value │
│ ├────────┼───────┤ ├────────┼───────┤
│ │ -1 │ 3 │ │ 2 │ 5 │
│ │ 25 │ 2 │ │ X │ 1 │
│ │ A │ 1 │ │ X │ 2 │
│ │ A │ 4 │ │ Y │ 3 │
│ │ true │ 5 │ └────────┴───────┘
│ └────────┴───────┘
Secondary indexes store the attribute value as the key and the record's primary key (id) as the value. To resolve a query result, Harper looks up the matching ids in the secondary index, then fetches the full records from the primary store.
Indexes are ordered — booleans first, then numbers (numerically), then strings (lexically) — enabling efficient range queries across all types.
Related Documentation
- Schema — Defining indexed attributes and vector indexes
- Compaction — Reclaiming free space and applying new storage configuration to existing databases
- Configuration — Storage configuration options (compression, memory maps, blob paths)