volpe/posts/drafts/hyper-logLog-tombstone-garbage-collection.md

11 KiB

Abstract

When synchronizing records in a distributed network there comes to be a problem when deleting records. When a deletion is initiated if individuals within a network were to only to delete their copy of the records it is highly likely that after deletion other nodes would resynchronize the original data reverting the changes. This can happen due to events not happing simultaneously between nodes or due to nodes being temperately disconnected from the network and then reconnecting with an outdated state. The traditional solution to this problem is to create a "tombstone" record that is kept around after the deletion to track that we have in the past had this file but it is has now been deleted so we should not recreate it.

While this approach works it has the issue of every node that exists in the network needing to indefinitely keep around and ever growing amount of tombstone records. Generally after an arbitrary large amount of time is can be assumed that is is safe to clear a tombstone as there should be no more remaining rouge nodes that still have the original data around.

The methodology in this paper resolves around using the hyper log log algorithm to get an estimate of how many nodes have received a record and to then compare that to the same estimate for how many tombstones have been created to prune the amount of tombstones that exist within any given network down to a much smaller amount making it possible to extend the time that we can keep at least one tombstone alive in the network while still reducing the storage overhead.

Core Concept

sequenceDiagram
    participant A as Node A
    participant B as Node B
    participant C as Node C
    
    Note over A,C: Record propagation phase
    A->>B: record + recordHLL
    B->>C: record + recordHLL
    
    Note over A,C: Tombstone propagation phase  
    A->>A: Create tombstone with frozenRecordHLL
    A->>B: tombstone + tombstoneHLL + frozenRecordHLL
    B->>C: tombstone + tombstoneHLL + frozenRecordHLL
    
    Note over A,C: GC phase (after convergence)
    C->>C: tombstoneCount >= frozenRecordCount, become keeper
    B->>B: sees higher estimate, step down and GC

Data Model

Records and tombstones are separate entities:

interface Record<Data> {
  readonly id: string;
  readonly data: Data;
  readonly recordHLL: HLL;  // Tracks nodes with this record
}

interface Tombstone {
  readonly id: string;
  readonly recordHLL: HLL;     // Snapshot of record distribution (updated during propagation)
  readonly tombstoneHLL: HLL;  // Tracks nodes with the tombstone
  readonly isKeeper: boolean;  // This node continues propagating
}

A node stores records and tombstones in separate maps. Tombstones reference records by ID.

Algorithm

1. Record Distribution

When a node creates or receives a record, it adds itself to the record's HLL:

const receiveNewRecord = (incoming: Record, nodeId: string): Record => ({
  ...incoming,
  recordHLL: hllAdd(hllClone(incoming.recordHLL), nodeId),
});

2. Tombstone Creation

When deleting a record, create a tombstone with the frozen recordHLL:

const createTombstone = <Data>(record: Record<Data>, nodeId: string): Tombstone => ({
  id: record.id,
  recordHLL: hllClone(record.recordHLL),  // Snapshot (updated during propagation)
  tombstoneHLL: hllAdd(createHLL(), nodeId),
  isKeeper: false,
});

3. Tombstone Propagation

flowchart TD
    A[Receive tombstone message] --> B{Have this record?}
    B -->|No| C[Ignore - don't accept tombstones for unknown records]
    B -->|Yes| D[Merge HLLs]
    D --> E{tombstoneCount >= frozenRecordCount?}
    E -->|No| F[Store updated record]
    E -->|Yes| G{Already a keeper?}
    G -->|No| H[Become keeper]
    G -->|Yes| I{Incoming estimate > my previous?}
    I -->|Yes| J[Step down and GC]
    I -->|No| K[Stay as keeper]

4. Garbage Collection Logic

const checkGCStatus = (
  tombstone: Tombstone,
  incomingTombstoneEstimate: number | null,
  myPreviousTombstoneEstimate: number,
  myNodeId: string,
  senderNodeId: string | null
): { shouldGC: boolean; becomeKeeper: boolean; stepDownAsKeeper: boolean } => {
  const targetCount = hllEstimate(tombstone.recordHLL);
  const tombstoneCount = hllEstimate(tombstone.tombstoneHLL);

  if (tombstone.isKeeper) {
    // Step down if incoming estimate is higher
    if (incomingTombstoneEstimate !== null && 
        incomingTombstoneEstimate >= targetCount) {
      if (myPreviousTombstoneEstimate < incomingTombstoneEstimate) {
        return { shouldGC: true, becomeKeeper: false, stepDownAsKeeper: true };
      }
      // Tie-breaker: higher node ID steps down when estimates are equal
      if (myPreviousTombstoneEstimate === incomingTombstoneEstimate &&
          senderNodeId !== null && myNodeId > senderNodeId) {
        return { shouldGC: true, becomeKeeper: false, stepDownAsKeeper: true };
      }
    }
    return { shouldGC: false, becomeKeeper: false, stepDownAsKeeper: false };
  }

  // Become keeper when threshold reached
  if (tombstoneCount >= targetCount) {
    return { shouldGC: false, becomeKeeper: true, stepDownAsKeeper: false };
  }

  return { shouldGC: false, becomeKeeper: false, stepDownAsKeeper: false };
};

5. Forward on Step-Down

When a keeper steps down, it immediately forwards the incoming tombstone to all connected peers. This creates a cascading effect that rapidly eliminates redundant keepers:

const forwardTombstoneToAllPeers = (
  network: NetworkState,
  forwardingNodeId: string,
  tombstone: Tombstone,
  excludePeerId?: string
): NetworkState => {
  const forwardingNode = network.nodes.get(forwardingNodeId);
  if (!forwardingNode) return network;
  
  for (const peerId of forwardingNode.peerIds) {
    if (peerId === excludePeerId) continue;
    
    const peer = network.nodes.get(peerId);
    if (!peer || !peer.records.has(tombstone.id)) continue;
    
    const updatedPeer = receiveTombstone(peer, tombstone, forwardingNodeId);
    network.nodes.set(peerId, updatedPeer);
    
    // If this peer also stepped down, recursively forward
    if (!updatedPeer.tombstones.has(tombstone.id) && 
        peer.tombstones.has(tombstone.id)) {
      forwardTombstoneToAllPeers(network, peerId, tombstone, forwardingNodeId);
    }
  }
  
  return network;
};

Design Decisions

Why Freeze the Record HLL?

Without a frozen snapshot, each node compares against its own local recordHLL estimate. The problem:

graph LR
    subgraph Without Frozen HLL
        A1[Node A: recordHLL=2] -->|gossip| B1[Node B: recordHLL=2]
        B1 --> C1[Both think 2 nodes have record]
        C1 --> D1[tombstoneHLL reaches 2]
        D1 --> E1[GC triggers - but Node C still has record!]
    end

The frozen HLL captures the record count at tombstone creation time and propagates with the tombstone. All nodes compare against the same target.

Why Dynamic Keeper Election?

Fixed originator as keeper creates a single point of failure. If the originator goes offline, no one propagates the tombstone.

Dynamic election means any node can become a keeper when it detects tombstoneCount >= frozenRecordCount. Multiple keepers provide redundancy.

Why Keeper Step-Down?

Without step-down, every node eventually becomes a keeper (since they all eventually see the threshold condition). This means no one ever GCs.

Step-down creates convergence:

graph TD
    subgraph Keeper Convergence Over Time
        T0[t=0: 0 keepers]
        T1[t=1: 5 keepers - first nodes to detect threshold]
        T2[t=2: 3 keepers - 2 stepped down after seeing higher estimates]
        T3[t=3: 1 keeper - most informed node remains]
    end
    T0 --> T1 --> T2 --> T3

Why Node ID Tie-Breaker?

When HLL estimates converge (all nodes have similar tombstoneHLL values), no one can have a strictly higher estimate. Without a tie-breaker, keepers with equal estimates would never step down.

The lexicographic node ID comparison ensures deterministic convergence: when two keepers with equal estimates communicate, the one with the higher node ID steps down.

Why Forward on Step-Down?

Without forwarding, keepers only step down when randomly selected for gossip. With aggressive forwarding, a stepping-down keeper immediately propagates the "winning" tombstone to all neighbors, creating a cascade effect that rapidly eliminates redundant keepers.

Complete Receive Handlers

interface NodeState<Data> {
  readonly records: ReadonlyMap<string, Record<Data>>;
  readonly tombstones: ReadonlyMap<string, Tombstone>;
}

const receiveRecord = <Data>(
  node: NodeState<Data>,
  incoming: Record<Data>
): NodeState<Data> => {
  const existing = node.records.get(incoming.id);
  
  const updatedRecord: Record<Data> = existing
    ? { ...existing, recordHLL: hllMerge(existing.recordHLL, incoming.recordHLL) }
    : incoming;
  
  const newRecords = new Map(node.records);
  newRecords.set(incoming.id, updatedRecord);
  return { ...node, records: newRecords };
};

const receiveTombstone = <Data>(
  node: NodeState<Data>,
  incoming: Tombstone
): NodeState<Data> => {
  // Don't accept tombstones for unknown records
  if (!node.records.has(incoming.id)) {
    return node;
  }

  const existing = node.tombstones.get(incoming.id);
  const previousEstimate = existing ? hllEstimate(existing.tombstoneHLL) : 0;

  let updatedTombstone: Tombstone = existing
    ? {
        ...existing,
        tombstoneHLL: hllMerge(existing.tombstoneHLL, incoming.tombstoneHLL),
        recordHLL: keepHigherEstimate(existing.recordHLL, incoming.recordHLL),
      }
    : incoming;

  const gcStatus = checkGCStatus(
    updatedTombstone,
    hllEstimate(incoming.tombstoneHLL),
    previousEstimate
  );

  if (gcStatus.stepDownAsKeeper) {
    // GC both record and tombstone
    return deleteRecordAndTombstone(node, incoming.id);
  }

  if (gcStatus.becomeKeeper) {
    updatedTombstone = { ...updatedTombstone, isKeeper: true };
  }

  const newTombstones = new Map(node.tombstones);
  newTombstones.set(incoming.id, updatedTombstone);
  return { ...node, tombstones: newTombstones };
};

Trade-offs

Aspect Impact
Memory ~1KB per tombstone (frozen HLL at precision 10)
Bandwidth HLLs transmitted with each gossip message
Latency GC delayed until keeper convergence
Consistency Eventual - temporary resurrection events possible

Properties

  • Safety: 100% - tombstones never prematurely deleted
  • Liveness: Keepers step down, enabling eventual GC
  • Fault tolerance: No single point of failure
  • Convergence: Keeper count decreases over time

Simulation Results

A working simulation is available at simulations/hyperloglog-tombstone/simulation.ts.

Test Nodes Records Deleted Tombstones Remaining
Single Node Deletion (50 trials) 750 11 rounds 118 (~16%)
Early Tombstone 20 10 rounds 2 (10%)
Bridged Network (2 clusters) 30 10 rounds 3 (10%)
Concurrent Tombstones (3 deleters) 20 10 rounds 3 (15%)
Network Partition and Heal 20 10 rounds 2 (10%)
Sparse Network (15% connectivity) 500 13 rounds 108 (~22%)

Key findings from simulation:

  • Records are consistently deleted within 10-13 gossip rounds
  • Tombstones converge to 10-22% of nodes remaining as keepers after 100 additional rounds
  • Bridged and partitioned networks converge to ~1 keeper per cluster
  • Higher connectivity leads to faster keeper convergence