Understanding database backup and restore
This topic explains how Infrahub's database backup and restore system works, the architectural decisions behind it, and the various approaches available for protecting your data. Understanding these concepts helps you make informed decisions about your backup strategy and troubleshoot issues when they arise.
Overview
Infrahub's backup system is designed around three core principles:
- Completeness: Capturing all data necessary for full system recovery
- Consistency: Ensuring data integrity across distributed components
- Flexibility: Supporting various deployment scenarios and recovery needs
The backup process involves more than just the Neo4j graph database—it encompasses the entire data ecosystem that Infrahub relies on to function correctly.
Backup architecture
Components requiring backup
Infrahub's data is distributed across multiple systems, each serving a specific purpose:
Neo4j graph database The core of Infrahub's data model, storing all infrastructure relationships, schemas, and configuration data. This uses Neo4j's transactional graph database engine, which maintains ACID compliance and supports point-in-time recovery through transaction logs.
Artifact storage Transforms, queries, and other generated artifacts are stored separately from the graph database. This can be either an S3-compatible object store or local filesystem, depending on your deployment configuration. The separation allows for efficient storage of large files without impacting database performance.
Task management database
Prefect's PostgreSQL database contains task execution history, logs, and workflow state. While not critical for data recovery, this information is valuable for auditing and troubleshooting past operations. The infrahubops tool backs this up using pg_dump with the custom format (-Fc) for efficient compression and restoration.
Why multiple components?
This distributed architecture might seem complex, but it serves important purposes:
- Performance optimization: Artifacts don't belong in a graph database
- Scalability: Each component can be scaled independently
- Flexibility: Different storage backends for different data types
- Cost efficiency: Use appropriate storage tiers for different data
Backup strategies
Full vs incremental backups
Neo4j supports both full and incremental backups, each with distinct characteristics:
Full backups create a complete copy of the database at a specific point in time. They're self-contained and straightforward to restore but require more storage space and time to complete.
Incremental backups only capture changes since the last backup. Neo4j tracks transaction IDs to ensure no data loss even during active database use. The backup tool automatically determines whether to perform a full or incremental backup based on existing backup files in the target directory.
Online vs offline backups
Online backups (the default approach) allow the database to remain operational during the backup process. Neo4j uses a checkpoint mechanism to ensure consistency:
- A checkpoint is triggered to flush pending transactions
- Data files are copied while tracking new transactions
- Transaction logs ensure no data loss between checkpoint and completion
Offline backups require stopping the database but guarantee a perfectly consistent snapshot. These are typically only necessary for major migrations or when changing database versions.
The backup process explained
How Neo4j ensures consistency
Neo4j's backup mechanism uses several techniques to maintain consistency:
- Transaction logs: Every change is written to a transaction log before being applied to the data files
- Checkpoints: Periodic checkpoints ensure transaction logs are applied to data files
- Backup coordination: The backup process coordinates with the checkpoint mechanism to capture a consistent view
When you initiate a backup with infrahubops:
Check Running Tasks → Create Temp Directory → Backup Neo4j → Backup PostgreSQL → Calculate Checksums → Create Tarball → Cleanup
The tool ensures no tasks are running before starting (unless --force is used), preventing potential data inconsistencies.
Helper container approach
The infrahubops tool uses Docker Compose commands to execute operations directly on running containers, providing several benefits:
- No additional containers: Unlike the old utility, infrahubops doesn't create helper containers
- Direct execution: Uses
docker compose execto run commands in existing containers - Simplified networking: No need to manage separate Docker networks
- Project awareness: Automatically detects and targets the correct Docker Compose project
- Integrity verification: Calculates and validates SHA256 checksums for all backed-up files
Restore considerations
Data consistency during restore
Restoring a database is more disruptive than backing it up because it requires:
- Stopping the target database to prevent concurrent modifications
- Clearing existing data to avoid conflicts
- Restoring data files from the backup
- Replaying transaction logs to reach the backup point
- Recreating metadata including users, roles, and permissions
The system database challenge
Neo4j maintains a special system database containing:
- User accounts and authentication data
- Role definitions and permissions
- Database metadata and configuration
While the backup tool captures the system database, restoring it requires special consideration:
- In standalone deployments, the system database is typically not restored to preserve existing configurations
- In cluster deployments, system database restoration requires cluster-wide coordination
- For disaster recovery scenarios, system database restoration may be necessary but requires additional steps
Cluster restore complexity
Restoring a Neo4j cluster involves additional challenges:
Cluster topology preservation The backup contains data but not cluster roles (leader/follower relationships). After restoration, you must:
- Identify a seed instance with the restored data
- Use
CREATE DATABASE ... OPTIONS { existingData: 'use' }to register the data - Allow the cluster to replicate data to other nodes
Consistency across nodes All nodes must be synchronized to prevent split-brain scenarios. The restoration process typically involves:
- Dropping the database cluster-wide
- Restoring to a single node
- Recreating the database from that seed node
- Monitoring replication to ensure all nodes synchronize
Alternative approaches
Direct filesystem copies
While possible, directly copying Neo4j data files has significant limitations:
- Requires complete database shutdown
- No transaction log coordination
- Risk of incomplete or corrupted copies
- No automatic metadata handling
Logical exports
Using Cypher queries to export and import data:
- Advantages: Human-readable, version-independent, selective export
- Disadvantages: Slower, no transaction consistency, requires custom scripting
Continuous replication
Setting up read replicas for backup purposes:
- Advantages: Near-zero RPO (Recovery Point Objective), instant failover capability
- Disadvantages: Requires additional infrastructure, ongoing synchronization overhead
Implementation details
Backup file structure
The infrahubops tool creates a tarball with the following structure:
infrahub_backup_YYYYMMDD_HHMMSS.tar.gz
├── backup/
│ ├── backup_information.json # Metadata and checksums
│ ├── database/ # Neo4j backup files
│ │ └── neo4j-*.backup
│ └── prefect.dump # PostgreSQL dump
Task safety mechanism
Before creating a backup, the tool checks for running tasks using an embedded Python script that queries the Infrahub API. This prevents backing up data in an inconsistent state. The --force flag bypasses this check but should be used with caution.
Checksum validation
Every file in the backup is protected by SHA256 checksums stored in backup_information.json. During restoration, these checksums are verified to ensure data integrity. Any mismatch will abort the restoration process.
Further reading
- How to backup and restore Infrahub - Practical step-by-step instructions
- infrahub-ops-cli source code - Implementation details and latest features