Understanding database backup and restore

This topic explains how Infrahub's database backup and restore system works, the architectural decisions behind it, and the various approaches available for protecting your data. Understanding these concepts helps you make informed decisions about your backup strategy and troubleshoot issues when they arise.

Overview

Infrahub's backup system is designed around three core principles:

Completeness: Capturing all data necessary for full system recovery
Consistency: Ensuring data integrity across distributed components
Flexibility: Supporting various deployment scenarios and recovery needs

The backup process involves more than just the Neo4j graph database—it encompasses the entire data ecosystem that Infrahub relies on to function correctly.

Backup architecture

Components requiring backup

Infrahub's data is distributed across multiple systems, each serving a specific purpose:

Neo4j graph database The core of Infrahub's data model, storing all infrastructure relationships, schemas, and configuration data. This uses Neo4j's transactional graph database engine, which maintains ACID compliance and supports point-in-time recovery through transaction logs.

Artifact storage Transforms, queries, and other generated artifacts are stored separately from the graph database. This can be either an S3-compatible object store or local filesystem, depending on your deployment configuration. The separation allows for efficient storage of large files without impacting database performance.

Task management database Prefect's PostgreSQL database contains task execution history, logs, and workflow state. While not critical for data recovery, this information is valuable for auditing and troubleshooting past operations. The infrahubops tool backs this up using pg_dump with the custom format (-Fc) for efficient compression and restoration.

Why multiple components?

This distributed architecture might seem complex, but it serves important purposes:

Performance optimization: Artifacts don't belong in a graph database
Scalability: Each component can be scaled independently
Flexibility: Different storage backends for different data types
Cost efficiency: Use appropriate storage tiers for different data

Backup strategies

Full vs incremental backups

Neo4j supports both full and incremental backups, each with distinct characteristics:

Full backups create a complete copy of the database at a specific point in time. They're self-contained and straightforward to restore but require more storage space and time to complete.

Incremental backups only capture changes since the last backup. Neo4j tracks transaction IDs to ensure no data loss even during active database use. The backup tool automatically determines whether to perform a full or incremental backup based on existing backup files in the target directory.

Online vs offline backups

Online backups (the default approach) allow the database to remain operational during the backup process. Neo4j uses a checkpoint mechanism to ensure consistency:

A checkpoint is triggered to flush pending transactions
Data files are copied while tracking new transactions
Transaction logs ensure no data loss between checkpoint and completion

Offline backups require stopping the database but guarantee a perfectly consistent snapshot. These are typically only necessary for major migrations or when changing database versions.

The backup process explained

How Neo4j ensures consistency

Neo4j's backup mechanism uses several techniques to maintain consistency:

Transaction logs: Every change is written to a transaction log before being applied to the data files
Checkpoints: Periodic checkpoints ensure transaction logs are applied to data files
Backup coordination: The backup process coordinates with the checkpoint mechanism to capture a consistent view

When you initiate a backup with infrahubops:

Check Running Tasks → Create Temp Directory → Backup Neo4j → Backup PostgreSQL → Calculate Checksums → Create Tarball → Cleanup

The tool ensures no tasks are running before starting (unless --force is used), preventing potential data inconsistencies.

Helper container approach

The infrahubops tool uses Docker Compose commands to execute operations directly on running containers, providing several benefits:

No additional containers: Unlike the old utility, infrahubops doesn't create helper containers
Direct execution: Uses docker compose exec to run commands in existing containers
Simplified networking: No need to manage separate Docker networks
Project awareness: Automatically detects and targets the correct Docker Compose project
Integrity verification: Calculates and validates SHA256 checksums for all backed-up files

Restore considerations

Data consistency during restore

Restoring a database is more disruptive than backing it up because it requires:

Stopping the target database to prevent concurrent modifications
Clearing existing data to avoid conflicts
Restoring data files from the backup
Replaying transaction logs to reach the backup point
Recreating metadata including users, roles, and permissions

The system database challenge

Neo4j maintains a special system database containing:

User accounts and authentication data
Role definitions and permissions
Database metadata and configuration

While the backup tool captures the system database, restoring it requires special consideration:

In standalone deployments, the system database is typically not restored to preserve existing configurations
In cluster deployments, system database restoration requires cluster-wide coordination
For disaster recovery scenarios, system database restoration may be necessary but requires additional steps

Cluster restore complexity

Restoring a Neo4j cluster involves additional challenges:

Cluster topology preservation The backup contains data but not cluster roles (leader/follower relationships). After restoration, you must:

Identify a seed instance with the restored data
Use CREATE DATABASE ... OPTIONS { existingData: 'use' } to register the data
Allow the cluster to replicate data to other nodes

Consistency across nodes All nodes must be synchronized to prevent split-brain scenarios. The restoration process typically involves:

Dropping the database cluster-wide
Restoring to a single node
Recreating the database from that seed node
Monitoring replication to ensure all nodes synchronize

Alternative approaches

Direct filesystem copies

While possible, directly copying Neo4j data files has significant limitations:

Requires complete database shutdown
No transaction log coordination
Risk of incomplete or corrupted copies
No automatic metadata handling

Logical exports

Using Cypher queries to export and import data:

Advantages: Human-readable, version-independent, selective export
Disadvantages: Slower, no transaction consistency, requires custom scripting

Continuous replication

Setting up read replicas for backup purposes:

Advantages: Near-zero RPO (Recovery Point Objective), instant failover capability
Disadvantages: Requires additional infrastructure, ongoing synchronization overhead

Implementation details

Backup file structure

The infrahubops tool creates a tarball with the following structure:

infrahub_backup_YYYYMMDD_HHMMSS.tar.gz
├── backup/
│   ├── backup_information.json    # Metadata and checksums
│   ├── database/                   # Neo4j backup files
│   │   └── neo4j-*.backup
│   └── prefect.dump               # PostgreSQL dump

Task safety mechanism

Before creating a backup, the tool checks for running tasks using an embedded Python script that queries the Infrahub API. This prevents backing up data in an inconsistent state. The --force flag bypasses this check but should be used with caution.

Checksum validation

Every file in the backup is protected by SHA256 checksums stored in backup_information.json. During restoration, these checksums are verified to ensure data integrity. Any mismatch will abort the restoration process.

Overview​

Backup architecture​

Components requiring backup​

Why multiple components?​

Backup strategies​

Full vs incremental backups​

Online vs offline backups​

The backup process explained​

How Neo4j ensures consistency​

Helper container approach​

Restore considerations​

Data consistency during restore​

The system database challenge​

Cluster restore complexity​

Alternative approaches​

Direct filesystem copies​

Logical exports​

Continuous replication​

Implementation details​

Backup file structure​

Task safety mechanism​

Checksum validation​

Further reading​