Data Loader
Added in: v4.6.0
The Data Loader is a built-in component that loads data from JSON or YAML files into Harper tables as part of component deployment. It is designed for seeding tables with initial records — configuration data, reference data, default users, or other records that should exist when a component is first deployed or updated.
Configuration
In your component's config.yaml, use the dataLoader key to specify the data files to load:
dataLoader:
files: 'data/*.json'
dataLoader is an Extension and supports the standard files configuration option, including glob patterns.
Data File Format
Each data file loads records into a single table. The file specifies the target database, table, and an array of records.
JSON Example
{
"database": "myapp",
"table": "users",
"records": [
{
"id": 1,
"username": "admin",
"email": "admin@example.com",
"role": "administrator"
},
{
"id": 2,
"username": "user1",
"email": "user1@example.com",
"role": "standard"
}
]
}
YAML Example
database: myapp
table: settings
records:
- id: 1
setting_name: app_name
setting_value: My Application
- id: 2
setting_name: version
setting_value: '1.0.0'
One table per file. To load data into multiple tables, create a separate file for each table.
File Patterns
The files option accepts a single path, a list of paths, or a glob pattern:
# Single file
dataLoader:
files: 'data/seed-data.json'
# Multiple specific files
dataLoader:
files:
- 'data/users.json'
- 'data/settings.yaml'
- 'data/initial-products.json'
# Glob pattern
dataLoader:
files: 'data/**/*.{json,yaml,yml}'
Loading Behavior
The Data Loader runs on every full system start and every component deployment — this includes fresh installs, restarts of the Harper process, and redeployments of the component. It does not re-run on individual thread restarts within a running Harper process.
Because the Data Loader runs on every startup and deployment, change detection is central to how it works safely. On each run:
- All specified data files are read (JSON or YAML)
- Each file is validated to reference a single table
- Records are inserted or updated based on content hash comparison:
- New records are inserted if they don't exist
- Existing records are updated only if the data file content has changed
- Records created outside the Data Loader (via Operations API, REST, etc.) are never overwritten
- Records modified by users after being loaded are preserved and not overwritten
- Extra fields added by users to data-loaded records are preserved during updates
- SHA-256 content hashes are stored in the
hdb_dataloader_hashsystem table to track which records have been loaded and detect changes
Change Detection
| Scenario | Behavior |
|---|---|
| New record | Inserted; content hash stored |
| Unchanged record | Skipped (no writes) |
| Changed data file | Updated via patch, preserving any extra fields |
| Record created by user (not data loader) | Never overwritten |
| Record modified by user after load | Preserved, not overwritten |
| Extra fields added by user to a data-loaded record | Preserved during updates |
This design makes data files safe to redeploy repeatedly — across deployments, node scaling, and system restarts — without losing manual modifications or causing unnecessary writes.
Best Practices
Define schemas first. While the Data Loader can infer schemas from the records it loads, it is strongly recommended to define table schemas explicitly using the graphqlSchema component before loading data. This ensures proper types, constraints, and relationships.
One table per file. Each data file must target a single table. Organize files accordingly.
Idempotent data. Design files to be safe to load multiple times without creating duplicate or conflicting records.
Version control. Include data files in version control for consistency across deployments and environments.
Environment-specific data. Consider using different data files for different environments (development, staging, production) to avoid loading inappropriate records.
Validate before deploying. Ensure data files are valid JSON or YAML and match your table schemas before deployment to catch type mismatches early.
No sensitive data. Do not include passwords, API keys, or secrets directly in data files. Use environment variables or secure configuration management instead.
Example Component Structure
A common production use case is shipping reference data — lookup tables like countries and regions — as part of a component. The records are version-controlled alongside the code, consistent across every environment, and the data loader keeps them in sync on every deployment without touching any user-modified fields.
my-component/
├── config.yaml
├── schemas.graphql
├── roles.yaml
└── data/
├── countries.json # ISO country codes — reference data, ships with component
└── regions.json # region/subdivision codes
config.yaml:
graphqlSchema:
files: 'schemas.graphql'
roles:
files: 'roles.yaml'
dataLoader:
files: 'data/*.json'
rest: true
schemas.graphql:
type Country @table(database: "myapp") @export {
id: ID @primaryKey # ISO 3166-1 alpha-2, e.g. "US"
name: String @indexed
region: String @indexed
}
type Region @table(database: "myapp") @export {
id: ID @primaryKey # ISO 3166-2, e.g. "US-CA"
name: String @indexed
countryId: ID @indexed
country: Country @relationship(from: countryId)
}
data/countries.json:
{
"database": "myapp",
"table": "Country",
"records": [
{ "id": "US", "name": "United States", "region": "Americas" },
{ "id": "GB", "name": "United Kingdom", "region": "Europe" },
{ "id": "DE", "name": "Germany", "region": "Europe" }
// ... all ~250 ISO countries
]
}
data/regions.json:
{
"database": "myapp",
"table": "Region",
"records": [
{ "id": "US-CA", "name": "California", "countryId": "US" },
{ "id": "US-NY", "name": "New York", "countryId": "US" },
{ "id": "GB-ENG", "name": "England", "countryId": "GB" }
// ...
]
}
Because the data loader uses content hashing, adding new countries or correcting a name in the file will update only the changed records on the next deployment — existing records that haven't changed are skipped entirely.
Related Documentation
- Schema — Defining table structure before loading data
- Jobs — Bulk data operations via the Operations API (CSV/JSON import from file, URL, or S3)
- Components — Extension and plugin system that the data loader is built on