volpe/posts/drafts/hyper-logLog-tombstone-garbage-collection.md

314 lines
No EOL
11 KiB
Markdown

## Abstract
When synchronizing records in a distributed network there comes to be a problem when deleting records. When a deletion is initiated if individuals within a network were to only to delete their copy of the records it is highly likely that after deletion other nodes would resynchronize the original data reverting the changes. This can happen due to events not happing simultaneously between nodes or due to nodes being temperately disconnected from the network and then reconnecting with an outdated state. The traditional solution to this problem is to create a "tombstone" record that is kept around after the deletion to track that we have in the past had this file but it is has now been deleted so we should not recreate it.
While this approach works it has the issue of every node that exists in the network needing to indefinitely keep around and ever growing amount of tombstone records. Generally after an arbitrary large amount of time is can be assumed that is is safe to clear a tombstone as there should be no more remaining rouge nodes that still have the original data around.
The methodology in this paper resolves around using the hyper log log algorithm to get an estimate of how many nodes have received a record and to then compare that to the same estimate for how many tombstones have been created to prune the amount of tombstones that exist within any given network down to a much smaller amount making it possible to extend the time that we can keep at least one tombstone alive in the network while still reducing the storage overhead.
### Core Concept
```mermaid
sequenceDiagram
participant A as Node A
participant B as Node B
participant C as Node C
Note over A,C: Record propagation phase
A->>B: record + recordHLL
B->>C: record + recordHLL
Note over A,C: Tombstone propagation phase
A->>A: Create tombstone with frozenRecordHLL
A->>B: tombstone + tombstoneHLL + frozenRecordHLL
B->>C: tombstone + tombstoneHLL + frozenRecordHLL
Note over A,C: GC phase (after convergence)
C->>C: tombstoneCount >= frozenRecordCount, become keeper
B->>B: sees higher estimate, step down and GC
```
## Data Model
Records and tombstones are separate entities:
```ts
interface Record<Data> {
readonly id: string;
readonly data: Data;
readonly recordHLL: HLL; // Tracks nodes with this record
}
interface Tombstone {
readonly id: string;
readonly recordHLL: HLL; // Snapshot of record distribution (updated during propagation)
readonly tombstoneHLL: HLL; // Tracks nodes with the tombstone
readonly isKeeper: boolean; // This node continues propagating
}
```
A node stores records and tombstones in separate maps. Tombstones reference records by ID.
## Algorithm
### 1. Record Distribution
When a node creates or receives a record, it adds itself to the record's HLL:
```ts
const receiveNewRecord = (incoming: Record, nodeId: string): Record => ({
...incoming,
recordHLL: hllAdd(hllClone(incoming.recordHLL), nodeId),
});
```
### 2. Tombstone Creation
When deleting a record, create a tombstone with the frozen recordHLL:
```ts
const createTombstone = <Data>(record: Record<Data>, nodeId: string): Tombstone => ({
id: record.id,
recordHLL: hllClone(record.recordHLL), // Snapshot (updated during propagation)
tombstoneHLL: hllAdd(createHLL(), nodeId),
isKeeper: false,
});
```
### 3. Tombstone Propagation
```mermaid
flowchart TD
A[Receive tombstone message] --> B{Have this record?}
B -->|No| C[Ignore - don't accept tombstones for unknown records]
B -->|Yes| D[Merge HLLs]
D --> E{tombstoneCount >= frozenRecordCount?}
E -->|No| F[Store updated record]
E -->|Yes| G{Already a keeper?}
G -->|No| H[Become keeper]
G -->|Yes| I{Incoming estimate > my previous?}
I -->|Yes| J[Step down and GC]
I -->|No| K[Stay as keeper]
```
### 4. Garbage Collection Logic
```ts
const checkGCStatus = (
tombstone: Tombstone,
incomingTombstoneEstimate: number | null,
myPreviousTombstoneEstimate: number,
myNodeId: string,
senderNodeId: string | null
): { shouldGC: boolean; becomeKeeper: boolean; stepDownAsKeeper: boolean } => {
const targetCount = hllEstimate(tombstone.recordHLL);
const tombstoneCount = hllEstimate(tombstone.tombstoneHLL);
if (tombstone.isKeeper) {
// Step down if incoming estimate is higher
if (incomingTombstoneEstimate !== null &&
incomingTombstoneEstimate >= targetCount) {
if (myPreviousTombstoneEstimate < incomingTombstoneEstimate) {
return { shouldGC: true, becomeKeeper: false, stepDownAsKeeper: true };
}
// Tie-breaker: higher node ID steps down when estimates are equal
if (myPreviousTombstoneEstimate === incomingTombstoneEstimate &&
senderNodeId !== null && myNodeId > senderNodeId) {
return { shouldGC: true, becomeKeeper: false, stepDownAsKeeper: true };
}
}
return { shouldGC: false, becomeKeeper: false, stepDownAsKeeper: false };
}
// Become keeper when threshold reached
if (tombstoneCount >= targetCount) {
return { shouldGC: false, becomeKeeper: true, stepDownAsKeeper: false };
}
return { shouldGC: false, becomeKeeper: false, stepDownAsKeeper: false };
};
```
### 5. Forward on Step-Down
When a keeper steps down, it immediately forwards the incoming tombstone to all connected peers. This creates a cascading effect that rapidly eliminates redundant keepers:
```ts
const forwardTombstoneToAllPeers = (
network: NetworkState,
forwardingNodeId: string,
tombstone: Tombstone,
excludePeerId?: string
): NetworkState => {
const forwardingNode = network.nodes.get(forwardingNodeId);
if (!forwardingNode) return network;
for (const peerId of forwardingNode.peerIds) {
if (peerId === excludePeerId) continue;
const peer = network.nodes.get(peerId);
if (!peer || !peer.records.has(tombstone.id)) continue;
const updatedPeer = receiveTombstone(peer, tombstone, forwardingNodeId);
network.nodes.set(peerId, updatedPeer);
// If this peer also stepped down, recursively forward
if (!updatedPeer.tombstones.has(tombstone.id) &&
peer.tombstones.has(tombstone.id)) {
forwardTombstoneToAllPeers(network, peerId, tombstone, forwardingNodeId);
}
}
return network;
};
```
## Design Decisions
### Why Freeze the Record HLL?
Without a frozen snapshot, each node compares against its own local recordHLL estimate. The problem:
```mermaid
graph LR
subgraph Without Frozen HLL
A1[Node A: recordHLL=2] -->|gossip| B1[Node B: recordHLL=2]
B1 --> C1[Both think 2 nodes have record]
C1 --> D1[tombstoneHLL reaches 2]
D1 --> E1[GC triggers - but Node C still has record!]
end
```
The frozen HLL captures the record count at tombstone creation time and propagates with the tombstone. All nodes compare against the same target.
### Why Dynamic Keeper Election?
Fixed originator as keeper creates a single point of failure. If the originator goes offline, no one propagates the tombstone.
Dynamic election means any node can become a keeper when it detects `tombstoneCount >= frozenRecordCount`. Multiple keepers provide redundancy.
### Why Keeper Step-Down?
Without step-down, every node eventually becomes a keeper (since they all eventually see the threshold condition). This means no one ever GCs.
Step-down creates convergence:
```mermaid
graph TD
subgraph Keeper Convergence Over Time
T0[t=0: 0 keepers]
T1[t=1: 5 keepers - first nodes to detect threshold]
T2[t=2: 3 keepers - 2 stepped down after seeing higher estimates]
T3[t=3: 1 keeper - most informed node remains]
end
T0 --> T1 --> T2 --> T3
```
### Why Node ID Tie-Breaker?
When HLL estimates converge (all nodes have similar tombstoneHLL values), no one can have a strictly higher estimate. Without a tie-breaker, keepers with equal estimates would never step down.
The lexicographic node ID comparison ensures deterministic convergence: when two keepers with equal estimates communicate, the one with the higher node ID steps down.
### Why Forward on Step-Down?
Without forwarding, keepers only step down when randomly selected for gossip. With aggressive forwarding, a stepping-down keeper immediately propagates the "winning" tombstone to all neighbors, creating a cascade effect that rapidly eliminates redundant keepers.
## Complete Receive Handlers
```ts
interface NodeState<Data> {
readonly records: ReadonlyMap<string, Record<Data>>;
readonly tombstones: ReadonlyMap<string, Tombstone>;
}
const receiveRecord = <Data>(
node: NodeState<Data>,
incoming: Record<Data>
): NodeState<Data> => {
const existing = node.records.get(incoming.id);
const updatedRecord: Record<Data> = existing
? { ...existing, recordHLL: hllMerge(existing.recordHLL, incoming.recordHLL) }
: incoming;
const newRecords = new Map(node.records);
newRecords.set(incoming.id, updatedRecord);
return { ...node, records: newRecords };
};
const receiveTombstone = <Data>(
node: NodeState<Data>,
incoming: Tombstone
): NodeState<Data> => {
// Don't accept tombstones for unknown records
if (!node.records.has(incoming.id)) {
return node;
}
const existing = node.tombstones.get(incoming.id);
const previousEstimate = existing ? hllEstimate(existing.tombstoneHLL) : 0;
let updatedTombstone: Tombstone = existing
? {
...existing,
tombstoneHLL: hllMerge(existing.tombstoneHLL, incoming.tombstoneHLL),
recordHLL: keepHigherEstimate(existing.recordHLL, incoming.recordHLL),
}
: incoming;
const gcStatus = checkGCStatus(
updatedTombstone,
hllEstimate(incoming.tombstoneHLL),
previousEstimate
);
if (gcStatus.stepDownAsKeeper) {
// GC both record and tombstone
return deleteRecordAndTombstone(node, incoming.id);
}
if (gcStatus.becomeKeeper) {
updatedTombstone = { ...updatedTombstone, isKeeper: true };
}
const newTombstones = new Map(node.tombstones);
newTombstones.set(incoming.id, updatedTombstone);
return { ...node, tombstones: newTombstones };
};
```
## Trade-offs
| Aspect | Impact |
|--------|--------|
| **Memory** | ~1KB per tombstone (frozen HLL at precision 10) |
| **Bandwidth** | HLLs transmitted with each gossip message |
| **Latency** | GC delayed until keeper convergence |
| **Consistency** | Eventual - temporary resurrection events possible |
## Properties
- **Safety**: 100% - tombstones never prematurely deleted
- **Liveness**: Keepers step down, enabling eventual GC
- **Fault tolerance**: No single point of failure
- **Convergence**: Keeper count decreases over time
## Simulation Results
A working simulation is available at `simulations/hyperloglog-tombstone/simulation.ts`.
| Test | Nodes | Records Deleted | Tombstones Remaining |
|------|-------|-----------------|----------------------|
| Single Node Deletion (50 trials) | 750 | 11 rounds | 118 (~16%) |
| Early Tombstone | 20 | 10 rounds | 2 (10%) |
| Bridged Network (2 clusters) | 30 | 10 rounds | 3 (10%) |
| Concurrent Tombstones (3 deleters) | 20 | 10 rounds | 3 (15%) |
| Network Partition and Heal | 20 | 10 rounds | 2 (10%) |
| Sparse Network (15% connectivity) | 500 | 13 rounds | 108 (~22%) |
Key findings from simulation:
- Records are consistently deleted within 10-13 gossip rounds
- Tombstones converge to 10-22% of nodes remaining as keepers after 100 additional rounds
- Bridged and partitioned networks converge to ~1 keeper per cluster
- Higher connectivity leads to faster keeper convergence