Sharding
Added in: v4.4.0 (provisional)
Changed in: v4.5.0 — expanded sharding functionality: Harper now honors write requests with residency information that will not be stored on the local node, and nodes can be declaratively configured as part of a shard.
Harper's replication system supports sharding — storing different data across different subsets of nodes — while still allowing data to be accessed from any node in the cluster. This enables horizontal scalability for storage and write performance, while maintaining optimal data locality and consistency.
When sharding is configured, requests for records that don't reside on the handling node are automatically forwarded to the appropriate node transparently. Clients do not need to know where data is stored.
By default (without sharding), Harper replicates all data to all nodes.
Approaches to Sharding
There are two main approaches:
Dynamic sharding — the location (residency) of records is determined dynamically based on where the record was written, the record's data, or a custom function. Records can be relocated dynamically based on where they are accessed. Residency information is specific to each record.
Static sharding — each node is assigned to a specific numbered shard, and each record is replicated to the nodes in that shard based on the primary key, regardless of where the data was written or accessed. More predictable than dynamic sharding: data location is always determinable from the primary key.
Dynamic Sharding
Replication Count
The simplest way to limit replication is to configure a replication count. Set replicateTo in the replication section of harperdb-config.yaml to specify how many additional nodes data should be replicated to:
replication:
replicateTo: 2
This ensures each record is stored on three nodes total (the node that first stored it, plus two others).
Replication Control via REST Header
With the REST interface, you can specify replication targets and confirmation requirements per request using the X-Replicate-To header:
PUT /MyTable/3
X-Replicate-To: 2;confirm=1
2— replicate to two additional nodesconfirm=1— wait for confirmation from one additional node before responding
Specify exact destination nodes by hostname:
PUT /MyTable/3
X-Replicate-To: node1,node2
The confirm parameter can be combined with explicit node lists.
Replication Control via Operations API
Specify replicateTo and replicatedConfirmation in the operation body:
{
"operation": "update",
"schema": "dev",
"table": "MyTable",
"hashValues": [3],
"record": {
"name": "John Doe"
},
"replicateTo": 2,
"replicatedConfirmation": 1
}
Or specify explicit nodes:
{
// ...
"replicateTo": ["node-1", "node-2"],
// ...
}
Programmatic Replication Control
Set replicateTo and replicatedConfirmation programmatically in a resource method:
class MyTable extends tables.MyTable {
put(record) {
const context = this.getContext();
context.replicateTo = 2; // or an array of node names
context.replicatedConfirmation = 1;
return super.put(record);
}
}
Static Sharding
Basic Static Shard Configuration
Assign a node to a numbered shard in harperdb-config.yaml:
replication:
shard: 1
Or assign shards per route:
replication:
routes:
- hostname: node1
shard: 1
- hostname: node2
shard: 2
Or dynamically via the operations API by including shard in an add_node or set_node operation:
{
"operation": "add_node",
"hostname": "node1",
"shard": 1
}
Once shards are configured, use setResidency or setResidencyById (described below) to assign records to specific shards.
Custom Sharding
By Record Content (setResidency)
Define a custom residency function that is called with the full record. Return an array of node hostnames or a shard number.
With this approach, record metadata (including residency information) and indexed properties are replicated to all nodes, but the full record is only stored on the specified nodes.
Return node hostnames:
MyTable.setResidency((record) => {
return record.id % 2 === 0 ? ['node1'] : ['node2'];
});
Return a shard number (replicates to all nodes in that shard):
MyTable.setResidency((record) => {
return record.id % 2 === 0 ? 1 : 2;
});
By Primary Key Only (setResidencyById)
Define a residency function based solely on the primary key. Records (including metadata) are only replicated to the specified nodes — metadata does not need to be replicated everywhere, which allows data to be retrieved without needing access to record data or metadata on the requesting node.
Return a shard number:
MyTable.setResidencyById((id) => {
return id % 2 === 0 ? 1 : 2;
});
Return node hostnames:
MyTable.setResidencyById((id) => {
return id % 2 === 0 ? ['node1'] : ['node2'];
});
Disabling Cross-Node Access
By default, sharding allows data stored on specific nodes to be accessed from any node — requests are forwarded transparently. To disable this and only return data if it is stored on the local node, set replicateFrom to false.
Via the operations API:
{
"operation": "search_by_id",
"table": "MyTable",
"ids": [3],
"replicateFrom": false
}
Via the REST API:
GET /MyTable/3
X-Replicate-From: none
See Also
- Replication Overview — How Harper's replication system works
- Clustering Operations — Operations API for managing cluster nodes