Snapshots Recovery

Instead of initializing a node using a Postgres dump, it's possible to configure a node to recover from a protocol-level snapshot. This process is much faster and requires much less storage. Note that without pruning enabled, the node state will continuously grow.

How it works

A snapshot is effectively a point-in-time snapshot of the VM state at the end of a certain L1 batch. Snapshots are created for the latest L1 batches periodically (roughly twice a day) and are stored in a public GCS bucket.

Recovery from a snapshot consists of several parts.

  • Postgres recovery is the initial stage. The node API is not functioning during this stage. The stage is expected to take about 2 minutes.

  • Merkle tree recovery starts once Postgres is fully recovered. Merkle tree recovery can take about 1 minutes. Ordinarily, Merkle tree recovery is a blocker for node synchronization; i.e., the node will not process blocks newer than the snapshot block until the Merkle tree is recovered.

  • Recovering RocksDB-based VM state cache is concurrent with Merkle tree recovery and also depends on Postgres recovery. It takes about 1 minute. Unlike Merkle tree recovery, VM state cache is not necessary for node operation (the node will get the state from Postgres is if it is absent), although it considerably speeds up VM execution.

After Postgres recovery is completed, the node becomes operational, providing Web3 API etc. It still needs some time to catch up executing blocks after the snapshot (i.e, roughly severar minutes worth of blocks / transactions). In total, recovery process and catch-up thus should take roughly 5–6 minutes.

Current limitations

Nodes recovered from snapshot don't have any historical data from before the recovery. There is currently no way to back-fill this historic data. E.g., if a node has recovered from a snapshot for L1 batch 500,000; then, it will not have data for L1 batches 499,999, 499,998, etc. The relevant Web3 methods, such as eth_getBlockByNumber, will return an error mentioning the first locally retained block or L1 batch if queried this missing data. The same error messages are used for pruning because logically, recovering from a snapshot is equivalent to pruning node storage to the snapshot L1 batch.

Configuration

To enable snapshot recovery on testnet, you need to set environment variables for a node before starting it for the first time:

EN_SNAPSHOTS_RECOVERY_ENABLED: 'true'
EN_SNAPSHOTS_OBJECT_STORE_BUCKET_BASE_URL: 'testnet.external-node-snapshots.onvia.org'
EN_SNAPSHOTS_OBJECT_STORE_MODE: 'GCSAnonymousReadOnly'

For a working examples of a fully configured Nodes recovering from snapshots, see Docker Compose examples and Quick Start.

If a node is already recovered (does not matter whether from a snapshot or from a Postgres dump), setting these env variables will have no effect; the node will never reset its state.

Monitoring recovery

Snapshot recovery information is logged with the following targets: [TODO check if its via_externl_node::init metrics]

  • Recovery orchestration: zksync_external_node::init

  • Postgres recovery: zksync_snapshots_applier

  • Merkle tree recovery: zksync_metadata_calculator::recovery, zksync_merkle_tree::recovery

An example of snapshot recovery logs during the first node start:

(Obviously, timestamps and numbers in the logs will differ.)

Recovery logic also exports some metrics, the main of which are as follows:

Metric name
Type
Labels
Description

snapshots_applier_storage_logs_chunks_left_to_process

Gauge

-

Number of storage log chunks left to process during Postgres recovery

db_pruner_pruning_chunk_duration_seconds

Histogram

prune_type

Latency of a single pruning iteration

merkle_tree_pruning_deleted_stale_key_versions

Gauge

bound

Versions (= L1 batches) pruned from the Merkle tree

Last updated