Snapshots Recovery
Instead of initializing a node using a Postgres dump, it's possible to configure a node to recover from a protocol-level snapshot. This process is much faster and requires much less storage. Note that without pruning enabled, the node state will continuously grow.
How it works
A snapshot is effectively a point-in-time snapshot of the VM state at the end of a certain L1 batch. Snapshots are created for the latest L1 batches periodically (roughly twice a day) and are stored in a public GCS bucket.
Recovery from a snapshot consists of several parts.
Postgres recovery is the initial stage. The node API is not functioning during this stage. The stage is expected to take about 2 minutes.
Merkle tree recovery starts once Postgres is fully recovered. Merkle tree recovery can take about 1 minutes. Ordinarily, Merkle tree recovery is a blocker for node synchronization; i.e., the node will not process blocks newer than the snapshot block until the Merkle tree is recovered.
Recovering RocksDB-based VM state cache is concurrent with Merkle tree recovery and also depends on Postgres recovery. It takes about 1 minute. Unlike Merkle tree recovery, VM state cache is not necessary for node operation (the node will get the state from Postgres is if it is absent), although it considerably speeds up VM execution.
After Postgres recovery is completed, the node becomes operational, providing Web3 API etc. It still needs some time to catch up executing blocks after the snapshot (i.e, roughly severar minutes worth of blocks / transactions). In total, recovery process and catch-up thus should take roughly 5–6 minutes.
Current limitations
Nodes recovered from snapshot don't have any historical data from before the recovery. There is currently no way to back-fill this historic data. E.g., if a node has recovered from a snapshot for L1 batch 500,000; then, it will not have data for L1 batches 499,999, 499,998, etc. The relevant Web3 methods, such as eth_getBlockByNumber, will return an error mentioning the first locally retained block or L1 batch if queried this missing data. The same error messages are used for pruning because logically, recovering from a snapshot is equivalent to pruning node storage to the snapshot L1 batch.
Configuration
To enable snapshot recovery on testnet, you need to set environment variables for a node before starting it for the first time:
EN_SNAPSHOTS_RECOVERY_ENABLED: 'true'
EN_SNAPSHOTS_OBJECT_STORE_BUCKET_BASE_URL: 'testnet.external-node-snapshots.onvia.org'
EN_SNAPSHOTS_OBJECT_STORE_MODE: 'GCSAnonymousReadOnly'For a working examples of a fully configured Nodes recovering from snapshots, see Docker Compose examples and Quick Start.
If a node is already recovered (does not matter whether from a snapshot or from a Postgres dump), setting these env variables will have no effect; the node will never reset its state.
Monitoring recovery
Snapshot recovery information is logged with the following targets: [TODO check if its via_externl_node::init metrics]
Recovery orchestration:
zksync_external_node::initPostgres recovery:
zksync_snapshots_applierMerkle tree recovery:
zksync_metadata_calculator::recovery,zksync_merkle_tree::recovery
An example of snapshot recovery logs during the first node start:
(Obviously, timestamps and numbers in the logs will differ.)
Recovery logic also exports some metrics, the main of which are as follows:
snapshots_applier_storage_logs_chunks_left_to_process
Gauge
-
Number of storage log chunks left to process during Postgres recovery
db_pruner_pruning_chunk_duration_seconds
Histogram
prune_type
Latency of a single pruning iteration
merkle_tree_pruning_deleted_stale_key_versions
Gauge
bound
Versions (= L1 batches) pruned from the Merkle tree
Last updated