Storage Algorithm
Harper's storage algorithm is the foundation of all database functionality. It is built on top of LMDB (Lightning Memory-Mapped Database), a high-performance key-value store, and extends it with automatic indexing, query-language-agnostic data access, and ACID compliance.
Query Language Agnostic
Harper's storage layer is decoupled from any specific query language. Data inserted via NoSQL operations can be read via SQL, REST, or the Resource API — all accessing the same underlying storage. This architecture allows Harper to add new query interfaces without changing how data is stored.
ACID Compliance
Harper provides full ACID compliance on each node using Multi-Version Concurrency Control (MVCC) through LMDB:
- Atomicity: All writes in a transaction either fully commit or fully roll back
- Consistency: Each transaction moves data from one valid state to another
- Isolation: Readers and writers operate independently — readers do not block writers and writers do not block readers
- Durability: Committed transactions are persisted to disk
Each Harper table has a single writer process, eliminating deadlocks and ensuring writes are executed in the order received. Multiple reader processes can operate concurrently for high-throughput reads.
Universally Indexed
Changed in: v4.3.0 — Storage performance improvements including better free-space management
All top-level attributes are automatically indexed immediately upon ingestion. For dynamic schema tables, Harper reflexively creates the attribute and its index as new data arrives. For schema-defined tables, indexes are created for all attributes marked with @indexed.
Indexes are type-agnostic, ordering values as follows:
- Booleans
- Numbers (ordered numerically)
- Strings (ordered lexically)
LMDB Storage Layout
Within the LMDB implementation, table records are grouped into a single LMDB environment file. Each attribute index is stored as a sub-database (dbi) within that environment. This means each attribute has its own dedicated index structure inside the same database file.
Compression
Changed in: v4.3.0 — Compression is now enabled by default for all records over 4KB
Harper compresses record data automatically for records over 4KB. Compression settings can be configured in the storage configuration. Note that compression settings cannot be changed on existing databases without creating a new compacted copy — see Compaction.
Performance Characteristics
Harper inherits the following performance properties from LMDB:
- Memory-mapped I/O: Data is accessed via memory mapping, enabling fast reads without data duplication between disk and memory
- Buffer cache integration: Fully exploits the OS buffer cache for reduced I/O
- CPU cache optimization: Built to maximize data locality within CPU caches
- Deadlock-free writes: Full serialization of writers guarantees write ordering without deadlocks
- Zero-copy reads: Readers access data directly from the memory map without copying
Indexing Example
Given a table with records like this:
┌────┬────────┬────────┐
│ id │ field1 │ field2 │
├────┼────────┼────────┤
│ 1 │ A │ X │
│ 2 │ 25 │ X │
│ 3 │ -1 │ Y │
│ 4 │ A │ │
│ 5 │ true │ 2 │
└────┴────────┴────────┘
Harper maintains three separate LMDB sub-databases for that table:
Table (LMDB environment file)
│
├── primary index: id
│ ┌─────┬──────────────────────────────────────┐
│ │ Key │ Value (full record) │
│ ├─────┼──────────────────────────────────────┤
│ │ 1 │ { id:1, field1:"A", field2:"X" } │
│ │ 2 │ { id:2, field1:25, field2:"X" } │
│ │ 3 │ { id:3, field1:-1, field2:"Y" } │
│ │ 4 │ { id:4, field1:"A" } │
│ │ 5 │ { id:5, field1:true, field2:2 } │
│ └─────┴──────────────────────────────────────┘
│
├── secondary index: field1 secondary index: field2
│ ┌────────┬───────┐ ┌────────┬───────┐
│ │ Key │ Value │ │ Key │ Value │
│ ├────────┼───────┤ ├────────┼───────┤
│ │ -1 │ 3 │ │ 2 │ 5 │
│ │ 25 │ 2 │ │ X │ 1 │
│ │ A │ 1 │ │ X │ 2 │
│ │ A │ 4 │ │ Y │ 3 │
│ │ true │ 5 │ └────────┴───────┘
│ └────────┴───────┘
Secondary indexes store the attribute value as the key and the record's primary key (id) as the value. To resolve a query result, Harper looks up the matching ids in the secondary index, then fetches the full records from the primary index.
Indexes are ordered — booleans first, then numbers (numerically), then strings (lexically) — enabling efficient range queries across all types.
Related Documentation
- Schema — Defining indexed attributes and vector indexes
- Compaction — Reclaiming free space and applying new storage configuration to existing databases
- Configuration — Storage configuration options (compression, memory maps, blob paths)