volpe/posts/drafts/hyper-logLog-tombstone-garbage-collection.md

12 KiB
Raw Blame History

HyperLogLog-Based Tombstone Garbage Collection for Distributed Systems

Abstract

When synchronizing records in a distributed network, deletion presents a fundamental challenge. Nodes must maintain "tombstone" records to prevent deleted data from being resurrected by offline nodes. This paper presents a HyperLogLog-based approach to tombstone garbage collection that uses probabilistic cardinality estimation to detect when tombstones have reached sufficient distribution.

We compare this approach against traditional methods—time-based garbage collection and causal stability detection—analyzing trade-offs in memory, coordination requirements, and failure tolerance.

1. Introduction

Distributed systems face an inherent tension between data consistency and storage efficiency when handling deletions. Traditional tombstone-based approaches guarantee correctness but impose unbounded storage growth. Several approaches have been proposed to address tombstone accumulation:

Time-based Garbage Collection: Sets a fixed time-to-live (TTL) for tombstones, after which they are automatically deleted1. While storage-efficient, this risks data resurrection if offline nodes reconnect after the GC window.

Causal Stability Detection: Prunes tombstones when the system can prove all nodes have observed the deletion2. Implementations vary from vector clocks (tracking operation ordering) to explicit node ID sets (tracking membership). This adds metadata overhead but provides strong guarantees when network conditions allow reliable tracking.

Consensus-based Garbage Collection: Uses coordination protocols to agree on when tombstones can be safely deleted3. Provides strong guarantees but requires synchronization, which may be impractical in partition-prone or high-latency networks.

This paper introduces a HyperLogLog-based approach4 that approximates causal stability detection using probabilistic cardinality estimation. Instead of tracking exact node sets or vector clocks, it uses constant-size HyperLogLog structures to estimate propagation. This trades exactness for dramatic reductions in memory and bandwidth at scale.

2. Core Algorithm

2.1 How It Works

sequenceDiagram
participant A as Node A
participant B as Node B (offline)
participant C as Node C

Note over A,C: Phase 1: Record Propagation
A->>C: record + recordHLL
C->>A: update recordHLL
Note over B: B receives record before going offline

Note over A,C: Phase 2: Tombstone Propagation
A->>A: Create tombstone, delete record
A->>C: tombstone + tombstoneHLL + recordHLL
C->>C: Delete record, update HLLs

Note over A,C: Phase 3: Keeper Election
Note over A,C: estimate(tombstoneHLL) >= estimate(recordHLL)
Note over A,C: Nodes elect minimal keepers

Note over A,C: Phase 4: B Reconnects
B->>B: Comes back online
C->>B: tombstone propagates
B->>B: Deletes stale record

Phase 1: Records propagate through gossip. Each node adds itself to the record's HyperLogLog.

Phase 2: When deletion occurs, the deleting node creates a tombstone containing a copy of the record's HLL as the "target count." The tombstone propagates to online nodes.

Phase 3: When a node's tombstone HLL estimate reaches or exceeds the target, it may become a "keeper." Keepers step down when they encounter another keeper with a higher estimate (using node ID as tie-breaker).

Phase 4: Offline nodes receive the tombstone upon reconnection and delete their stale records.

2.2 Data Model

interface DataRecord<Data> {
  id: string;
  data: Data;
  recordHLL: HyperLogLog;  // Tracks nodes that received record
}

interface Tombstone {
  id: string;
  recordHLL: HyperLogLog;     // Target: estimated record distribution
  tombstoneHLL: HyperLogLog;  // Progress: estimated tombstone distribution
}

2.3 Keeper Election

const shouldStepDown = (
  myEstimate: number,      // My tombstone HLL estimate
  theirEstimate: number,   // Incoming tombstone HLL estimate
  targetEstimate: number,  // Record HLL estimate (threshold)
  myNodeId: string,
  theirNodeId: string
): boolean => {
  const iAmKeeper = myEstimate >= targetEstimate;
  const theyAreKeeper = theirEstimate >= targetEstimate;
  
  if (!iAmKeeper || !theyAreKeeper) return false;
  
  // Step down if they have higher estimate
  if (theirEstimate > myEstimate) return true;
  
  // Tie-breaker: higher node ID steps down
  if (theirEstimate === myEstimate && myNodeId > theirNodeId) return true;
  
  return false;
};

3. Design Rationale

3.1 Why Propagate the Record HLL with Tombstones?

Without a shared target, each node would compare against its own local record HLL, leading to premature garbage collection. By propagating the record HLL with the tombstone and always keeping the highest estimate encountered, all nodes converge on a safe target.

3.2 Why Dynamic Keeper Election?

A fixed originator-as-keeper design creates a single point of failure. If the originator goes offline, tombstone propagation halts and records may resurrect when stale nodes reconnect.

Dynamic election allows any node to become a keeper when it detects tombstoneEstimate >= recordEstimate. This ensures tombstone propagation continues regardless of which specific node initiated the deletion.

3.3 Why Keeper Step-Down?

Without step-down logic, every node eventually becomes a keeper. Step-down creates convergence toward a minimal keeper set:

graph TD
subgraph Keeper Convergence Over Time
T0["t=0: 0 keepers"]
T1["t=1: 5 keepers<br/>(first nodes to detect threshold)"]
T2["t=2: 3 keepers<br/>(2 stepped down after seeing higher estimates)"]
T3["t=3: 1 keeper<br/>(converged to single keeper)"]
end
T0 --> T1 --> T2 --> T3

3.4 Why Node ID Tie-Breaker?

When HLL estimates converge (all nodes have similar values), no node can have a strictly higher estimate. The lexicographic node ID comparison ensures deterministic convergence to a single keeper.

3.5 Why Forward on Step-Down?

With aggressive forwarding, a stepping-down keeper immediately propagates the "winning" tombstone to all reachable nodes, creating a cascade effect that rapidly eliminates redundant keepers.

4. Comparison with Alternative Approaches

4.1 Time-based GC vs HyperLogLog

Aspect Time-based GC HyperLogLog
Safety Unsafe if nodes offline > TTL Safe (waits for propagation)
Configuration Requires tuning TTL Self-adapting
Metadata overhead None ~2 KB per tombstone
Coordination None None

When to use Time-based: Networks with predictable uptime and short offline periods.

When to use HyperLogLog: Networks with unpredictable offline durations or partition-prone environments.

4.2 Causal Stability Detection vs HyperLogLog

Traditional causal stability uses vector clocks or explicit node sets to track exactly which nodes have observed operations.

Aspect Causal Stability HyperLogLog
Tracking Exact (vector clocks/sets) Approximate (~3% error)
Memory per tombstone O(n) grows with nodes O(1) constant ~2 KB
Bandwidth per message O(n) grows with nodes O(1) constant ~2 KB
Debugging Can enumerate missing nodes Only aggregate estimates
Implementation Simple data structures Requires HLL library

Memory comparison:

Network Size Causal Stability HyperLogLog HLL Advantage
10 nodes ~400 bytes ~2 KB 0.2x (worse)
50 nodes ~2 KB ~2 KB 1x (equal)
100 nodes ~4 KB ~2 KB 2x better
1,000 nodes ~40 KB ~2 KB 20x better
10,000 nodes ~400 KB ~2 KB 200x better

When to use Causal Stability: Networks < 50 nodes, or when exact tracking is required for auditing.

When to use HyperLogLog: Networks > 100 nodes, bandwidth-constrained environments, or unpredictably growing networks.

4.3 Consensus-based GC vs HyperLogLog

Aspect Consensus-based HyperLogLog
Coordination Required (Raft/Paxos) None (local decisions)
Partition tolerance Blocks during partitions Continues independently
Guarantees Strong consistency Eventual consistency
Latency Round-trip consensus Local estimation

When to use Consensus: Systems already using consensus (e.g., Raft-replicated databases).

When to use HyperLogLog: Partition-tolerant systems, high-latency networks, or systems without existing consensus infrastructure.

4.4 Summary

graph TD
    A[Choose GC Approach] --> B{Network Size?}
    B -->|< 50 nodes| C[Causal Stability<br/>Exact tracking]
    B -->|> 100 nodes| D[HyperLogLog<br/>Probabilistic tracking]
    B -->|50-100 nodes| E{Priority?}
    E -->|Exactness| C
    E -->|Scalability| D
    
    A --> F{Already have consensus?}
    F -->|Yes| G[Consensus-based GC<br/>Strong guarantees]
    F -->|No| H{Predictable uptime?}
    H -->|Yes, short outages| I[Time-based GC<br/>Simple TTL]
    H -->|No, long/variable outages| D

5. Simulation Results

We implemented the HyperLogLog approach in a discrete-event simulation with 50 trials per scenario across various failure modes (node offline, network partition, concurrent deleters, origin node failure).

5.1 Resource Usage

Network Size Memory per Tombstone Bandwidth per Gossip
20 nodes ~2 KB ~2 KB
500 nodes ~2 KB ~2 KB
10,000 nodes ~2 KB ~2 KB

Constant resource usage regardless of network size.

6. Limitations

6.1 Estimation Error

HyperLogLog provides ~3% error at precision 10 (1024 registers). This can cause:

  • Premature keeper election: Rare, handled conservatively by erring toward retaining tombstones
  • Delayed keeper convergence: Minor efficiency impact

6.2 No Node Enumeration

Cannot identify which specific nodes are missing the tombstone. For debugging or auditing, causal stability with explicit tracking is preferable.

6.3 Message Ordering

If a tombstone arrives before the record (due to message reordering), the node ignores the tombstone. If the record subsequently arrives, the node accepts it. However, keepers will eventually propagate the tombstone to this node, so the record is eventually deleted—just not immediately.

7. Conclusion

HyperLogLog-based tombstone garbage collection provides a scalable alternative to traditional approaches:

Approach Best For
Time-based GC Predictable networks with short outages
Causal Stability Small networks (< 50 nodes) requiring exact tracking
Consensus-based GC Systems with existing consensus infrastructure
HyperLogLog Large networks (> 100 nodes), partition-prone environments

The HyperLogLog approach trades exactness (~3% error) for dramatic scalability—constant memory and bandwidth regardless of network size. For networks that may grow unpredictably, this provides a robust foundation without configuration changes.

References

A working simulation is available at simulations/hyperloglog-tombstone/simulation.ts.


  1. Ladin, R., et al. (1992). "Providing high availability using lazy replication." ACM TOCS, 10(4). https://doi.org/10.1145/138873.138877 ↩︎

  2. Baquero, C., et al. (2017). "Pure Operation-Based Replicated Data Types." arXiv:1710.04469. https://arxiv.org/abs/1710.04469 ↩︎

  3. Bauwens, J., & De Meuter, W. (2020). "Memory Efficient CRDTs in Dynamic Environments." PaPoC '20. https://doi.org/10.1145/3380787.3393682 ↩︎

  4. Flajolet, P., et al. (2007). "HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm." Discrete Mathematics and Theoretical Computer Science. https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf ↩︎