12 KiB
HyperLogLog-Based Tombstone Garbage Collection for Distributed Systems
Abstract
When synchronizing records in a distributed network, deletion presents a fundamental challenge. Nodes must maintain "tombstone" records to prevent deleted data from being resurrected by offline nodes. This paper presents a HyperLogLog-based approach to tombstone garbage collection that uses probabilistic cardinality estimation to detect when tombstones have reached sufficient distribution.
We compare this approach against traditional methods—time-based garbage collection and causal stability detection—analyzing trade-offs in memory, coordination requirements, and failure tolerance.
1. Introduction
Distributed systems face an inherent tension between data consistency and storage efficiency when handling deletions. Traditional tombstone-based approaches guarantee correctness but impose unbounded storage growth. Several approaches have been proposed to address tombstone accumulation:
Time-based Garbage Collection: Sets a fixed time-to-live (TTL) for tombstones, after which they are automatically deleted1. While storage-efficient, this risks data resurrection if offline nodes reconnect after the GC window.
Causal Stability Detection: Prunes tombstones when the system can prove all nodes have observed the deletion2. Implementations vary from vector clocks (tracking operation ordering) to explicit node ID sets (tracking membership). This adds metadata overhead but provides strong guarantees when network conditions allow reliable tracking.
Consensus-based Garbage Collection: Uses coordination protocols to agree on when tombstones can be safely deleted3. Provides strong guarantees but requires synchronization, which may be impractical in partition-prone or high-latency networks.
This paper introduces a HyperLogLog-based approach4 that approximates causal stability detection using probabilistic cardinality estimation. Instead of tracking exact node sets or vector clocks, it uses constant-size HyperLogLog structures to estimate propagation. This trades exactness for dramatic reductions in memory and bandwidth at scale.
2. Core Algorithm
2.1 How It Works
sequenceDiagram
participant A as Node A
participant B as Node B (offline)
participant C as Node C
Note over A,C: Phase 1: Record Propagation
A->>C: record + recordHLL
C->>A: update recordHLL
Note over B: B receives record before going offline
Note over A,C: Phase 2: Tombstone Propagation
A->>A: Create tombstone, delete record
A->>C: tombstone + tombstoneHLL + recordHLL
C->>C: Delete record, update HLLs
Note over A,C: Phase 3: Keeper Election
Note over A,C: estimate(tombstoneHLL) >= estimate(recordHLL)
Note over A,C: Nodes elect minimal keepers
Note over A,C: Phase 4: B Reconnects
B->>B: Comes back online
C->>B: tombstone propagates
B->>B: Deletes stale record
Phase 1: Records propagate through gossip. Each node adds itself to the record's HyperLogLog.
Phase 2: When deletion occurs, the deleting node creates a tombstone containing a copy of the record's HLL as the "target count." The tombstone propagates to online nodes.
Phase 3: When a node's tombstone HLL estimate reaches or exceeds the target, it may become a "keeper." Keepers step down when they encounter another keeper with a higher estimate (using node ID as tie-breaker).
Phase 4: Offline nodes receive the tombstone upon reconnection and delete their stale records.
2.2 Data Model
interface DataRecord<Data> {
id: string;
data: Data;
recordHLL: HyperLogLog; // Tracks nodes that received record
}
interface Tombstone {
id: string;
recordHLL: HyperLogLog; // Target: estimated record distribution
tombstoneHLL: HyperLogLog; // Progress: estimated tombstone distribution
}
2.3 Keeper Election
const shouldStepDown = (
myEstimate: number, // My tombstone HLL estimate
theirEstimate: number, // Incoming tombstone HLL estimate
targetEstimate: number, // Record HLL estimate (threshold)
myNodeId: string,
theirNodeId: string
): boolean => {
const iAmKeeper = myEstimate >= targetEstimate;
const theyAreKeeper = theirEstimate >= targetEstimate;
if (!iAmKeeper || !theyAreKeeper) return false;
// Step down if they have higher estimate
if (theirEstimate > myEstimate) return true;
// Tie-breaker: higher node ID steps down
if (theirEstimate === myEstimate && myNodeId > theirNodeId) return true;
return false;
};
3. Design Rationale
3.1 Why Propagate the Record HLL with Tombstones?
Without a shared target, each node would compare against its own local record HLL, leading to premature garbage collection. By propagating the record HLL with the tombstone and always keeping the highest estimate encountered, all nodes converge on a safe target.
3.2 Why Dynamic Keeper Election?
A fixed originator-as-keeper design creates a single point of failure. If the originator goes offline, tombstone propagation halts and records may resurrect when stale nodes reconnect.
Dynamic election allows any node to become a keeper when it detects tombstoneEstimate >= recordEstimate. This ensures tombstone propagation continues regardless of which specific node initiated the deletion.
3.3 Why Keeper Step-Down?
Without step-down logic, every node eventually becomes a keeper. Step-down creates convergence toward a minimal keeper set:
graph TD
subgraph Keeper Convergence Over Time
T0["t=0: 0 keepers"]
T1["t=1: 5 keepers<br/>(first nodes to detect threshold)"]
T2["t=2: 3 keepers<br/>(2 stepped down after seeing higher estimates)"]
T3["t=3: 1 keeper<br/>(converged to single keeper)"]
end
T0 --> T1 --> T2 --> T3
3.4 Why Node ID Tie-Breaker?
When HLL estimates converge (all nodes have similar values), no node can have a strictly higher estimate. The lexicographic node ID comparison ensures deterministic convergence to a single keeper.
3.5 Why Forward on Step-Down?
With aggressive forwarding, a stepping-down keeper immediately propagates the "winning" tombstone to all reachable nodes, creating a cascade effect that rapidly eliminates redundant keepers.
4. Comparison with Alternative Approaches
4.1 Time-based GC vs HyperLogLog
| Aspect | Time-based GC | HyperLogLog |
|---|---|---|
| Safety | Unsafe if nodes offline > TTL | Safe (waits for propagation) |
| Configuration | Requires tuning TTL | Self-adapting |
| Metadata overhead | None | ~2 KB per tombstone |
| Coordination | None | None |
When to use Time-based: Networks with predictable uptime and short offline periods.
When to use HyperLogLog: Networks with unpredictable offline durations or partition-prone environments.
4.2 Causal Stability Detection vs HyperLogLog
Traditional causal stability uses vector clocks or explicit node sets to track exactly which nodes have observed operations.
| Aspect | Causal Stability | HyperLogLog |
|---|---|---|
| Tracking | Exact (vector clocks/sets) | Approximate (~3% error) |
| Memory per tombstone | O(n) – grows with nodes | O(1) – constant ~2 KB |
| Bandwidth per message | O(n) – grows with nodes | O(1) – constant ~2 KB |
| Debugging | Can enumerate missing nodes | Only aggregate estimates |
| Implementation | Simple data structures | Requires HLL library |
Memory comparison:
| Network Size | Causal Stability | HyperLogLog | HLL Advantage |
|---|---|---|---|
| 10 nodes | ~400 bytes | ~2 KB | 0.2x (worse) |
| 50 nodes | ~2 KB | ~2 KB | 1x (equal) |
| 100 nodes | ~4 KB | ~2 KB | 2x better |
| 1,000 nodes | ~40 KB | ~2 KB | 20x better |
| 10,000 nodes | ~400 KB | ~2 KB | 200x better |
When to use Causal Stability: Networks < 50 nodes, or when exact tracking is required for auditing.
When to use HyperLogLog: Networks > 100 nodes, bandwidth-constrained environments, or unpredictably growing networks.
4.3 Consensus-based GC vs HyperLogLog
| Aspect | Consensus-based | HyperLogLog |
|---|---|---|
| Coordination | Required (Raft/Paxos) | None (local decisions) |
| Partition tolerance | Blocks during partitions | Continues independently |
| Guarantees | Strong consistency | Eventual consistency |
| Latency | Round-trip consensus | Local estimation |
When to use Consensus: Systems already using consensus (e.g., Raft-replicated databases).
When to use HyperLogLog: Partition-tolerant systems, high-latency networks, or systems without existing consensus infrastructure.
4.4 Summary
graph TD
A[Choose GC Approach] --> B{Network Size?}
B -->|< 50 nodes| C[Causal Stability<br/>Exact tracking]
B -->|> 100 nodes| D[HyperLogLog<br/>Probabilistic tracking]
B -->|50-100 nodes| E{Priority?}
E -->|Exactness| C
E -->|Scalability| D
A --> F{Already have consensus?}
F -->|Yes| G[Consensus-based GC<br/>Strong guarantees]
F -->|No| H{Predictable uptime?}
H -->|Yes, short outages| I[Time-based GC<br/>Simple TTL]
H -->|No, long/variable outages| D
5. Simulation Results
We implemented the HyperLogLog approach in a discrete-event simulation with 50 trials per scenario across various failure modes (node offline, network partition, concurrent deleters, origin node failure).
5.1 Resource Usage
| Network Size | Memory per Tombstone | Bandwidth per Gossip |
|---|---|---|
| 20 nodes | ~2 KB | ~2 KB |
| 500 nodes | ~2 KB | ~2 KB |
| 10,000 nodes | ~2 KB | ~2 KB |
Constant resource usage regardless of network size.
6. Limitations
6.1 Estimation Error
HyperLogLog provides ~3% error at precision 10 (1024 registers). This can cause:
- Premature keeper election: Rare, handled conservatively by erring toward retaining tombstones
- Delayed keeper convergence: Minor efficiency impact
6.2 No Node Enumeration
Cannot identify which specific nodes are missing the tombstone. For debugging or auditing, causal stability with explicit tracking is preferable.
6.3 Message Ordering
If a tombstone arrives before the record (due to message reordering), the node ignores the tombstone. If the record subsequently arrives, the node accepts it. However, keepers will eventually propagate the tombstone to this node, so the record is eventually deleted—just not immediately.
7. Conclusion
HyperLogLog-based tombstone garbage collection provides a scalable alternative to traditional approaches:
| Approach | Best For |
|---|---|
| Time-based GC | Predictable networks with short outages |
| Causal Stability | Small networks (< 50 nodes) requiring exact tracking |
| Consensus-based GC | Systems with existing consensus infrastructure |
| HyperLogLog | Large networks (> 100 nodes), partition-prone environments |
The HyperLogLog approach trades exactness (~3% error) for dramatic scalability—constant memory and bandwidth regardless of network size. For networks that may grow unpredictably, this provides a robust foundation without configuration changes.
References
A working simulation is available at simulations/hyperloglog-tombstone/simulation.ts.
-
Ladin, R., et al. (1992). "Providing high availability using lazy replication." ACM TOCS, 10(4). https://doi.org/10.1145/138873.138877 ↩︎
-
Baquero, C., et al. (2017). "Pure Operation-Based Replicated Data Types." arXiv:1710.04469. https://arxiv.org/abs/1710.04469 ↩︎
-
Bauwens, J., & De Meuter, W. (2020). "Memory Efficient CRDTs in Dynamic Environments." PaPoC '20. https://doi.org/10.1145/3380787.3393682 ↩︎
-
Flajolet, P., et al. (2007). "HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm." Discrete Mathematics and Theoretical Computer Science. https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf ↩︎