# HyperLogLog-Based Tombstone Garbage Collection for Distributed Systems ## Abstract When synchronizing records in a distributed network, deletion presents a fundamental challenge. Nodes must maintain "tombstone" records to prevent deleted data from being resurrected by offline nodes. This paper presents a **HyperLogLog-based approach** to tombstone garbage collection that uses probabilistic cardinality estimation to detect when tombstones have reached sufficient distribution. We compare this approach against traditional methods—time-based garbage collection and causal stability detection—analyzing trade-offs in memory, coordination requirements, and failure tolerance. ## 1. Introduction Distributed systems face an inherent tension between data consistency and storage efficiency when handling deletions. Traditional tombstone-based approaches guarantee correctness but impose unbounded storage growth. Several approaches have been proposed to address tombstone accumulation: **Time-based Garbage Collection**: Sets a fixed time-to-live (TTL) for tombstones, after which they are automatically deleted[^2]. While storage-efficient, this risks data resurrection if offline nodes reconnect after the GC window. **Causal Stability Detection**: Prunes tombstones when the system can prove all nodes have observed the deletion[^5]. Implementations vary from vector clocks (tracking operation ordering) to explicit node ID sets (tracking membership). This adds metadata overhead but provides strong guarantees when network conditions allow reliable tracking. **Consensus-based Garbage Collection**: Uses coordination protocols to agree on when tombstones can be safely deleted[^6]. Provides strong guarantees but requires synchronization, which may be impractical in partition-prone or high-latency networks. This paper introduces a **HyperLogLog-based approach**[^1] that approximates causal stability detection using probabilistic cardinality estimation. Instead of tracking exact node sets or vector clocks, it uses constant-size HyperLogLog structures to estimate propagation. This trades exactness for dramatic reductions in memory and bandwidth at scale. [^1]: Flajolet, P., et al. (2007). "HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm." *Discrete Mathematics and Theoretical Computer Science*. https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf [^2]: Ladin, R., et al. (1992). "Providing high availability using lazy replication." *ACM TOCS*, 10(4). https://doi.org/10.1145/138873.138877 [^5]: Baquero, C., et al. (2017). "Pure Operation-Based Replicated Data Types." *arXiv:1710.04469*. https://arxiv.org/abs/1710.04469 [^6]: Bauwens, J., & De Meuter, W. (2020). "Memory Efficient CRDTs in Dynamic Environments." *PaPoC '20*. https://doi.org/10.1145/3380787.3393682 ## 2. Core Algorithm ### 2.1 How It Works ```mermaid sequenceDiagram participant A as Node A participant B as Node B (offline) participant C as Node C Note over A,C: Phase 1: Record Propagation A->>C: record + recordHLL C->>A: update recordHLL Note over B: B receives record before going offline Note over A,C: Phase 2: Tombstone Propagation A->>A: Create tombstone, delete record A->>C: tombstone + tombstoneHLL + recordHLL C->>C: Delete record, update HLLs Note over A,C: Phase 3: Keeper Election Note over A,C: estimate(tombstoneHLL) >= estimate(recordHLL) Note over A,C: Nodes elect minimal keepers Note over A,C: Phase 4: B Reconnects B->>B: Comes back online C->>B: tombstone propagates B->>B: Deletes stale record ``` **Phase 1**: Records propagate through gossip. Each node adds itself to the record's HyperLogLog. **Phase 2**: When deletion occurs, the deleting node creates a tombstone containing a copy of the record's HLL as the "target count." The tombstone propagates to online nodes. **Phase 3**: When a node's tombstone HLL estimate reaches or exceeds the target, it may become a "keeper." Keepers step down when they encounter another keeper with a higher estimate (using node ID as tie-breaker). **Phase 4**: Offline nodes receive the tombstone upon reconnection and delete their stale records. ### 2.2 Data Model ```ts interface DataRecord { id: string; data: Data; recordHLL: HyperLogLog; // Tracks nodes that received record } interface Tombstone { id: string; recordHLL: HyperLogLog; // Target: estimated record distribution tombstoneHLL: HyperLogLog; // Progress: estimated tombstone distribution } ``` ### 2.3 Keeper Election ```ts const shouldStepDown = ( myEstimate: number, // My tombstone HLL estimate theirEstimate: number, // Incoming tombstone HLL estimate targetEstimate: number, // Record HLL estimate (threshold) myNodeId: string, theirNodeId: string ): boolean => { const iAmKeeper = myEstimate >= targetEstimate; const theyAreKeeper = theirEstimate >= targetEstimate; if (!iAmKeeper || !theyAreKeeper) return false; // Step down if they have higher estimate if (theirEstimate > myEstimate) return true; // Tie-breaker: higher node ID steps down if (theirEstimate === myEstimate && myNodeId > theirNodeId) return true; return false; }; ``` ## 3. Design Rationale ### 3.1 Why Propagate the Record HLL with Tombstones? Without a shared target, each node would compare against its own local record HLL, leading to premature garbage collection. By propagating the record HLL with the tombstone and always keeping the highest estimate encountered, all nodes converge on a safe target. ### 3.2 Why Dynamic Keeper Election? A fixed originator-as-keeper design creates a single point of failure. If the originator goes offline, tombstone propagation halts and records may resurrect when stale nodes reconnect. Dynamic election allows any node to become a keeper when it detects `tombstoneEstimate >= recordEstimate`. This ensures tombstone propagation continues regardless of which specific node initiated the deletion. ### 3.3 Why Keeper Step-Down? Without step-down logic, every node eventually becomes a keeper. Step-down creates convergence toward a minimal keeper set: ```mermaid graph TD subgraph Keeper Convergence Over Time T0["t=0: 0 keepers"] T1["t=1: 5 keepers
(first nodes to detect threshold)"] T2["t=2: 3 keepers
(2 stepped down after seeing higher estimates)"] T3["t=3: 1 keeper
(converged to single keeper)"] end T0 --> T1 --> T2 --> T3 ``` ### 3.4 Why Node ID Tie-Breaker? When HLL estimates converge (all nodes have similar values), no node can have a strictly higher estimate. The lexicographic node ID comparison ensures deterministic convergence to a single keeper. ### 3.5 Why Forward on Step-Down? With aggressive forwarding, a stepping-down keeper immediately propagates the "winning" tombstone to all reachable nodes, creating a cascade effect that rapidly eliminates redundant keepers. ## 4. Comparison with Alternative Approaches ### 4.1 Time-based GC vs HyperLogLog | Aspect | Time-based GC | HyperLogLog | |--------|---------------|-------------| | Safety | Unsafe if nodes offline > TTL | Safe (waits for propagation) | | Configuration | Requires tuning TTL | Self-adapting | | Metadata overhead | None | ~2 KB per tombstone | | Coordination | None | None | **When to use Time-based**: Networks with predictable uptime and short offline periods. **When to use HyperLogLog**: Networks with unpredictable offline durations or partition-prone environments. ### 4.2 Causal Stability Detection vs HyperLogLog Traditional causal stability uses vector clocks or explicit node sets to track exactly which nodes have observed operations. | Aspect | Causal Stability | HyperLogLog | |--------|------------------|-------------| | Tracking | Exact (vector clocks/sets) | Approximate (~3% error) | | Memory per tombstone | O(n) – grows with nodes | O(1) – constant ~2 KB | | Bandwidth per message | O(n) – grows with nodes | O(1) – constant ~2 KB | | Debugging | Can enumerate missing nodes | Only aggregate estimates | | Implementation | Simple data structures | Requires HLL library | **Memory comparison:** | Network Size | Causal Stability | HyperLogLog | HLL Advantage | |--------------|------------------|-------------|---------------| | 10 nodes | ~400 bytes | ~2 KB | 0.2x (worse) | | 50 nodes | ~2 KB | ~2 KB | 1x (equal) | | 100 nodes | ~4 KB | ~2 KB | 2x better | | 1,000 nodes | ~40 KB | ~2 KB | 20x better | | 10,000 nodes | ~400 KB | ~2 KB | 200x better | **When to use Causal Stability**: Networks < 50 nodes, or when exact tracking is required for auditing. **When to use HyperLogLog**: Networks > 100 nodes, bandwidth-constrained environments, or unpredictably growing networks. ### 4.3 Consensus-based GC vs HyperLogLog | Aspect | Consensus-based | HyperLogLog | |--------|-----------------|-------------| | Coordination | Required (Raft/Paxos) | None (local decisions) | | Partition tolerance | Blocks during partitions | Continues independently | | Guarantees | Strong consistency | Eventual consistency | | Latency | Round-trip consensus | Local estimation | **When to use Consensus**: Systems already using consensus (e.g., Raft-replicated databases). **When to use HyperLogLog**: Partition-tolerant systems, high-latency networks, or systems without existing consensus infrastructure. ### 4.4 Summary ```mermaid graph TD A[Choose GC Approach] --> B{Network Size?} B -->|< 50 nodes| C[Causal Stability
Exact tracking] B -->|> 100 nodes| D[HyperLogLog
Probabilistic tracking] B -->|50-100 nodes| E{Priority?} E -->|Exactness| C E -->|Scalability| D A --> F{Already have consensus?} F -->|Yes| G[Consensus-based GC
Strong guarantees] F -->|No| H{Predictable uptime?} H -->|Yes, short outages| I[Time-based GC
Simple TTL] H -->|No, long/variable outages| D ``` ## 5. Simulation Results We implemented the HyperLogLog approach in a discrete-event simulation with 50 trials per scenario across various failure modes (node offline, network partition, concurrent deleters, origin node failure). ### 5.1 Resource Usage | Network Size | Memory per Tombstone | Bandwidth per Gossip | |--------------|---------------------|----------------------| | 20 nodes | ~2 KB | ~2 KB | | 500 nodes | ~2 KB | ~2 KB | | 10,000 nodes | ~2 KB | ~2 KB | Constant resource usage regardless of network size. ## 6. Limitations ### 6.1 Estimation Error HyperLogLog provides ~3% error at precision 10 (1024 registers). This can cause: - **Premature keeper election**: Rare, handled conservatively by erring toward retaining tombstones - **Delayed keeper convergence**: Minor efficiency impact ### 6.2 No Node Enumeration Cannot identify which specific nodes are missing the tombstone. For debugging or auditing, causal stability with explicit tracking is preferable. ### 6.3 Message Ordering If a tombstone arrives before the record (due to message reordering), the node ignores the tombstone. If the record subsequently arrives, the node accepts it. However, keepers will eventually propagate the tombstone to this node, so the record is eventually deleted—just not immediately. ## 7. Conclusion HyperLogLog-based tombstone garbage collection provides a scalable alternative to traditional approaches: | Approach | Best For | |----------|----------| | **Time-based GC** | Predictable networks with short outages | | **Causal Stability** | Small networks (< 50 nodes) requiring exact tracking | | **Consensus-based GC** | Systems with existing consensus infrastructure | | **HyperLogLog** | Large networks (> 100 nodes), partition-prone environments | The HyperLogLog approach trades exactness (~3% error) for dramatic scalability—constant memory and bandwidth regardless of network size. For networks that may grow unpredictably, this provides a robust foundation without configuration changes. ## References A working simulation is available at [simulations/hyperloglog-tombstone/simulation.ts](/simulations/hyperloglog-tombstone/simulation.ts). #algorithm #computer #distributed #peer_to_peer