feat: simplified simulation
This commit is contained in:
parent
463706e6a5
commit
87cf9f2c6c
2 changed files with 564 additions and 254 deletions
|
|
@ -1,314 +1,636 @@
|
||||||
|
# HyperLogLog-Based Tombstone Garbage Collection for Distributed Systems
|
||||||
|
|
||||||
## Abstract
|
## Abstract
|
||||||
|
|
||||||
When synchronizing records in a distributed network there comes to be a problem when deleting records. When a deletion is initiated if individuals within a network were to only to delete their copy of the records it is highly likely that after deletion other nodes would resynchronize the original data reverting the changes. This can happen due to events not happing simultaneously between nodes or due to nodes being temperately disconnected from the network and then reconnecting with an outdated state. The traditional solution to this problem is to create a "tombstone" record that is kept around after the deletion to track that we have in the past had this file but it is has now been deleted so we should not recreate it.
|
When synchronizing records in a distributed network, deletion presents a fundamental challenge. If nodes simply delete their local copies, other nodes may resynchronize the original data, reverting the deletion. This occurs due to non-simultaneous events between nodes or nodes temporarily disconnecting and reconnecting with outdated state. The traditional solution creates "tombstone" records that persist after deletion to prevent resurrection of deleted data.
|
||||||
|
|
||||||
While this approach works it has the issue of every node that exists in the network needing to indefinitely keep around and ever growing amount of tombstone records. Generally after an arbitrary large amount of time is can be assumed that is is safe to clear a tombstone as there should be no more remaining rouge nodes that still have the original data around.
|
While effective, this approach requires every node to indefinitely maintain an ever-growing collection of tombstone records. Typically, after an arbitrarily large time period, tombstones are assumed safe to clear since no rogue nodes should retain the original data.
|
||||||
|
|
||||||
The methodology in this paper resolves around using the hyper log log algorithm to get an estimate of how many nodes have received a record and to then compare that to the same estimate for how many tombstones have been created to prune the amount of tombstones that exist within any given network down to a much smaller amount making it possible to extend the time that we can keep at least one tombstone alive in the network while still reducing the storage overhead.
|
This paper presents a methodology using the HyperLogLog algorithm to estimate how many nodes have received a record, comparing this estimate against the count of nodes that have received the corresponding tombstone. This enables pruning tombstones across the network to a minimal set of "keeper" nodes, extending the viable tombstone retention period while significantly reducing storage overhead.
|
||||||
|
|
||||||
### Core Concept
|
## 1. Introduction
|
||||||
|
|
||||||
|
Distributed systems face an inherent tension between data consistency and storage efficiency when handling deletions. Traditional tombstone-based approaches guarantee correctness but impose unbounded storage growth. Time-based garbage collection (GC) offers storage efficiency but risks data resurrection if stale nodes reconnect after the GC window.
|
||||||
|
|
||||||
|
This paper introduces a probabilistic approach using HyperLogLog (HLL) cardinality estimation[^1] to achieve both goals: safe garbage collection that provably prevents resurrection while minimizing the number of nodes that must retain tombstones.
|
||||||
|
|
||||||
|
[^1]: [TODO: Cite HyperLogLog paper - Flajolet et al., "HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm"]
|
||||||
|
|
||||||
|
### 1.1 Core Concept
|
||||||
|
|
||||||
|
The algorithm operates in three phases:
|
||||||
|
|
||||||
```mermaid
|
```mermaid
|
||||||
sequenceDiagram
|
sequenceDiagram
|
||||||
participant A as Node A
|
participant A as Node A
|
||||||
participant B as Node B
|
participant B as Node B
|
||||||
participant C as Node C
|
participant C as Node C
|
||||||
|
|
||||||
Note over A,C: Record propagation phase
|
Note over A,C: Phase 1: Record Propagation
|
||||||
A->>B: record + recordHLL
|
A->>B: record + recordHLL
|
||||||
B->>C: record + recordHLL
|
B->>A: update recordHLL estimate
|
||||||
|
B->>C: record + recordHLL
|
||||||
|
|
||||||
Note over A,C: Tombstone propagation phase
|
Note over A,C: Phase 2: Tombstone Propagation
|
||||||
A->>A: Create tombstone with frozenRecordHLL
|
A->>A: Create tombstone with recordHLL and delete record
|
||||||
A->>B: tombstone + tombstoneHLL + frozenRecordHLL
|
C->>B: update recordHLL estimate
|
||||||
B->>C: tombstone + tombstoneHLL + frozenRecordHLL
|
A->>B: tombstone + tombstoneHLL + recordHLL
|
||||||
|
B->>B: tombstone updated with new recordHLL and delete record
|
||||||
|
B->>C: tombstone + tombstoneHLL + recordHLL
|
||||||
|
|
||||||
Note over A,C: GC phase (after convergence)
|
Note over A,C: Phase 3: Keeper Election and tombstone garbage collection
|
||||||
C->>C: tombstoneCount >= frozenRecordCount, become keeper
|
C->>C: tombstoneCount >= recordCount, become keeper and deletes record
|
||||||
B->>B: sees higher estimate, step down and GC
|
C->>B: updates with node tombstone count estimate
|
||||||
|
B->>B: sees higher estimate, step down and garbage collects its own tombstone record
|
||||||
|
B->>A: update connected node with tombstoneHLL
|
||||||
|
A->>A: garbage collects its own tombstone record
|
||||||
```
|
```
|
||||||
|
|
||||||
## Data Model
|
**Phase 1**: Records propagate through the network via gossip, with each node adding itself to the record's HLL. Nodes then talk between themselves to slowly turn local estimates for the records count into global ones.
|
||||||
|
|
||||||
Records and tombstones are separate entities:
|
**Phase 2**: When deletion occurs, the deleting node creates a tombstone containing a copy of the record's HLL as the target count. The tombstone propagates similarly, with nodes adding themselves to the tombstone's HLL. During propagation, the target recordHLL is updated to the highest estimate encountered.
|
||||||
|
|
||||||
|
**Phase 3**: When a node detects that `tombstoneCount >= recordCount`, it becomes a "keeper" responsible for continued propagation. As keepers communicate, those with lower estimates step down and garbage collect, converging toward a minimal keeper set.
|
||||||
|
|
||||||
|
## 2. Data Model
|
||||||
|
|
||||||
|
Records and tombstones are maintained as separate entities with distinct tracking mechanisms:
|
||||||
|
|
||||||
```ts
|
```ts
|
||||||
interface Record<Data> {
|
interface DataRecord<Data> {
|
||||||
readonly id: string;
|
readonly id: string;
|
||||||
readonly data: Data;
|
readonly data: Data;
|
||||||
readonly recordHLL: HLL; // Tracks nodes with this record
|
readonly recordHLL: HyperLogLog; // Tracks nodes that have received this record
|
||||||
}
|
}
|
||||||
|
|
||||||
interface Tombstone {
|
interface Tombstone {
|
||||||
readonly id: string;
|
readonly id: string;
|
||||||
readonly recordHLL: HLL; // Snapshot of record distribution (updated during propagation)
|
readonly recordHLL: HyperLogLog; // Target count: highest observed record distribution
|
||||||
readonly tombstoneHLL: HLL; // Tracks nodes with the tombstone
|
readonly tombstoneHLL: HyperLogLog; // Tracks nodes that have received the tombstone
|
||||||
readonly isKeeper: boolean; // This node continues propagating
|
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
A node stores records and tombstones in separate maps. Tombstones reference records by ID.
|
## 3. Algorithm
|
||||||
|
|
||||||
## Algorithm
|
### 3.1 Record Creation and Distribution
|
||||||
|
|
||||||
### 1. Record Distribution
|
|
||||||
|
|
||||||
When a node creates or receives a record, it adds itself to the record's HLL:
|
When a node creates or receives a record, it adds itself to the record's HLL:
|
||||||
|
|
||||||
```ts
|
```ts
|
||||||
const receiveNewRecord = (incoming: Record, nodeId: string): Record => ({
|
const createRecord = <Data>(id: string, data: Data, nodeId: string): DataRecord<Data> => ({
|
||||||
...incoming,
|
id,
|
||||||
recordHLL: hllAdd(hllClone(incoming.recordHLL), nodeId),
|
data,
|
||||||
|
recordHLL: hllAdd(createHLL(), nodeId),
|
||||||
});
|
});
|
||||||
```
|
|
||||||
|
|
||||||
### 2. Tombstone Creation
|
|
||||||
|
|
||||||
When deleting a record, create a tombstone with the frozen recordHLL:
|
|
||||||
|
|
||||||
```ts
|
|
||||||
const createTombstone = <Data>(record: Record<Data>, nodeId: string): Tombstone => ({
|
|
||||||
id: record.id,
|
|
||||||
recordHLL: hllClone(record.recordHLL), // Snapshot (updated during propagation)
|
|
||||||
tombstoneHLL: hllAdd(createHLL(), nodeId),
|
|
||||||
isKeeper: false,
|
|
||||||
});
|
|
||||||
```
|
|
||||||
|
|
||||||
### 3. Tombstone Propagation
|
|
||||||
|
|
||||||
```mermaid
|
|
||||||
flowchart TD
|
|
||||||
A[Receive tombstone message] --> B{Have this record?}
|
|
||||||
B -->|No| C[Ignore - don't accept tombstones for unknown records]
|
|
||||||
B -->|Yes| D[Merge HLLs]
|
|
||||||
D --> E{tombstoneCount >= frozenRecordCount?}
|
|
||||||
E -->|No| F[Store updated record]
|
|
||||||
E -->|Yes| G{Already a keeper?}
|
|
||||||
G -->|No| H[Become keeper]
|
|
||||||
G -->|Yes| I{Incoming estimate > my previous?}
|
|
||||||
I -->|Yes| J[Step down and GC]
|
|
||||||
I -->|No| K[Stay as keeper]
|
|
||||||
```
|
|
||||||
|
|
||||||
### 4. Garbage Collection Logic
|
|
||||||
|
|
||||||
```ts
|
|
||||||
const checkGCStatus = (
|
|
||||||
tombstone: Tombstone,
|
|
||||||
incomingTombstoneEstimate: number | null,
|
|
||||||
myPreviousTombstoneEstimate: number,
|
|
||||||
myNodeId: string,
|
|
||||||
senderNodeId: string | null
|
|
||||||
): { shouldGC: boolean; becomeKeeper: boolean; stepDownAsKeeper: boolean } => {
|
|
||||||
const targetCount = hllEstimate(tombstone.recordHLL);
|
|
||||||
const tombstoneCount = hllEstimate(tombstone.tombstoneHLL);
|
|
||||||
|
|
||||||
if (tombstone.isKeeper) {
|
|
||||||
// Step down if incoming estimate is higher
|
|
||||||
if (incomingTombstoneEstimate !== null &&
|
|
||||||
incomingTombstoneEstimate >= targetCount) {
|
|
||||||
if (myPreviousTombstoneEstimate < incomingTombstoneEstimate) {
|
|
||||||
return { shouldGC: true, becomeKeeper: false, stepDownAsKeeper: true };
|
|
||||||
}
|
|
||||||
// Tie-breaker: higher node ID steps down when estimates are equal
|
|
||||||
if (myPreviousTombstoneEstimate === incomingTombstoneEstimate &&
|
|
||||||
senderNodeId !== null && myNodeId > senderNodeId) {
|
|
||||||
return { shouldGC: true, becomeKeeper: false, stepDownAsKeeper: true };
|
|
||||||
}
|
|
||||||
}
|
|
||||||
return { shouldGC: false, becomeKeeper: false, stepDownAsKeeper: false };
|
|
||||||
}
|
|
||||||
|
|
||||||
// Become keeper when threshold reached
|
|
||||||
if (tombstoneCount >= targetCount) {
|
|
||||||
return { shouldGC: false, becomeKeeper: true, stepDownAsKeeper: false };
|
|
||||||
}
|
|
||||||
|
|
||||||
return { shouldGC: false, becomeKeeper: false, stepDownAsKeeper: false };
|
|
||||||
};
|
|
||||||
```
|
|
||||||
|
|
||||||
### 5. Forward on Step-Down
|
|
||||||
|
|
||||||
When a keeper steps down, it immediately forwards the incoming tombstone to all connected peers. This creates a cascading effect that rapidly eliminates redundant keepers:
|
|
||||||
|
|
||||||
```ts
|
|
||||||
const forwardTombstoneToAllPeers = (
|
|
||||||
network: NetworkState,
|
|
||||||
forwardingNodeId: string,
|
|
||||||
tombstone: Tombstone,
|
|
||||||
excludePeerId?: string
|
|
||||||
): NetworkState => {
|
|
||||||
const forwardingNode = network.nodes.get(forwardingNodeId);
|
|
||||||
if (!forwardingNode) return network;
|
|
||||||
|
|
||||||
for (const peerId of forwardingNode.peerIds) {
|
|
||||||
if (peerId === excludePeerId) continue;
|
|
||||||
|
|
||||||
const peer = network.nodes.get(peerId);
|
|
||||||
if (!peer || !peer.records.has(tombstone.id)) continue;
|
|
||||||
|
|
||||||
const updatedPeer = receiveTombstone(peer, tombstone, forwardingNodeId);
|
|
||||||
network.nodes.set(peerId, updatedPeer);
|
|
||||||
|
|
||||||
// If this peer also stepped down, recursively forward
|
|
||||||
if (!updatedPeer.tombstones.has(tombstone.id) &&
|
|
||||||
peer.tombstones.has(tombstone.id)) {
|
|
||||||
forwardTombstoneToAllPeers(network, peerId, tombstone, forwardingNodeId);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
return network;
|
|
||||||
};
|
|
||||||
```
|
|
||||||
|
|
||||||
## Design Decisions
|
|
||||||
|
|
||||||
### Why Freeze the Record HLL?
|
|
||||||
|
|
||||||
Without a frozen snapshot, each node compares against its own local recordHLL estimate. The problem:
|
|
||||||
|
|
||||||
```mermaid
|
|
||||||
graph LR
|
|
||||||
subgraph Without Frozen HLL
|
|
||||||
A1[Node A: recordHLL=2] -->|gossip| B1[Node B: recordHLL=2]
|
|
||||||
B1 --> C1[Both think 2 nodes have record]
|
|
||||||
C1 --> D1[tombstoneHLL reaches 2]
|
|
||||||
D1 --> E1[GC triggers - but Node C still has record!]
|
|
||||||
end
|
|
||||||
```
|
|
||||||
|
|
||||||
The frozen HLL captures the record count at tombstone creation time and propagates with the tombstone. All nodes compare against the same target.
|
|
||||||
|
|
||||||
### Why Dynamic Keeper Election?
|
|
||||||
|
|
||||||
Fixed originator as keeper creates a single point of failure. If the originator goes offline, no one propagates the tombstone.
|
|
||||||
|
|
||||||
Dynamic election means any node can become a keeper when it detects `tombstoneCount >= frozenRecordCount`. Multiple keepers provide redundancy.
|
|
||||||
|
|
||||||
### Why Keeper Step-Down?
|
|
||||||
|
|
||||||
Without step-down, every node eventually becomes a keeper (since they all eventually see the threshold condition). This means no one ever GCs.
|
|
||||||
|
|
||||||
Step-down creates convergence:
|
|
||||||
|
|
||||||
```mermaid
|
|
||||||
graph TD
|
|
||||||
subgraph Keeper Convergence Over Time
|
|
||||||
T0[t=0: 0 keepers]
|
|
||||||
T1[t=1: 5 keepers - first nodes to detect threshold]
|
|
||||||
T2[t=2: 3 keepers - 2 stepped down after seeing higher estimates]
|
|
||||||
T3[t=3: 1 keeper - most informed node remains]
|
|
||||||
end
|
|
||||||
T0 --> T1 --> T2 --> T3
|
|
||||||
```
|
|
||||||
|
|
||||||
### Why Node ID Tie-Breaker?
|
|
||||||
|
|
||||||
When HLL estimates converge (all nodes have similar tombstoneHLL values), no one can have a strictly higher estimate. Without a tie-breaker, keepers with equal estimates would never step down.
|
|
||||||
|
|
||||||
The lexicographic node ID comparison ensures deterministic convergence: when two keepers with equal estimates communicate, the one with the higher node ID steps down.
|
|
||||||
|
|
||||||
### Why Forward on Step-Down?
|
|
||||||
|
|
||||||
Without forwarding, keepers only step down when randomly selected for gossip. With aggressive forwarding, a stepping-down keeper immediately propagates the "winning" tombstone to all neighbors, creating a cascade effect that rapidly eliminates redundant keepers.
|
|
||||||
|
|
||||||
## Complete Receive Handlers
|
|
||||||
|
|
||||||
```ts
|
|
||||||
interface NodeState<Data> {
|
|
||||||
readonly records: ReadonlyMap<string, Record<Data>>;
|
|
||||||
readonly tombstones: ReadonlyMap<string, Tombstone>;
|
|
||||||
}
|
|
||||||
|
|
||||||
const receiveRecord = <Data>(
|
const receiveRecord = <Data>(
|
||||||
node: NodeState<Data>,
|
node: NodeState<Data>,
|
||||||
incoming: Record<Data>
|
incoming: DataRecord<Data>
|
||||||
): NodeState<Data> => {
|
): NodeState<Data> => {
|
||||||
const existing = node.records.get(incoming.id);
|
// Reject records that have already been deleted
|
||||||
|
if (node.tombstones.has(incoming.id)) {
|
||||||
|
return node;
|
||||||
|
}
|
||||||
|
|
||||||
const updatedRecord: Record<Data> = existing
|
const existing = node.records.get(incoming.id);
|
||||||
? { ...existing, recordHLL: hllMerge(existing.recordHLL, incoming.recordHLL) }
|
const updatedRecord: DataRecord<Data> = existing
|
||||||
: incoming;
|
? { ...existing, recordHLL: hllAdd(hllMerge(existing.recordHLL, incoming.recordHLL), node.id) }
|
||||||
|
: { ...incoming, recordHLL: hllAdd(hllClone(incoming.recordHLL), node.id) };
|
||||||
|
|
||||||
const newRecords = new Map(node.records);
|
const newRecords = new Map(node.records);
|
||||||
newRecords.set(incoming.id, updatedRecord);
|
newRecords.set(incoming.id, updatedRecord);
|
||||||
return { ...node, records: newRecords };
|
return { ...node, records: newRecords };
|
||||||
};
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3.2 Tombstone Creation
|
||||||
|
|
||||||
|
When deleting a record, a node creates a tombstone containing a copy of the record's HLL as the initial target count:
|
||||||
|
|
||||||
|
```ts
|
||||||
|
const createTombstone = <Data>(record: DataRecord<Data>, nodeId: string): Tombstone => ({
|
||||||
|
id: record.id,
|
||||||
|
recordHLL: hllClone(record.recordHLL),
|
||||||
|
tombstoneHLL: hllAdd(createHLL(), nodeId),
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3.3 Garbage Collection Status Check
|
||||||
|
|
||||||
|
The core decision logic determines whether a node should become a keeper, step down, or continue as-is:
|
||||||
|
|
||||||
|
```ts
|
||||||
|
const checkGCStatus = (
|
||||||
|
tombstone: Tombstone,
|
||||||
|
incomingTombstoneEstimate: number | null,
|
||||||
|
myTombstoneEstimateBeforeMerge: number,
|
||||||
|
myNodeId: string,
|
||||||
|
senderNodeId: string | null
|
||||||
|
): { shouldGC: boolean; stepDownAsKeeper: boolean } => {
|
||||||
|
const targetCount = hllEstimate(tombstone.recordHLL);
|
||||||
|
|
||||||
|
const isKeeper = myTombstoneEstimateBeforeMerge >= targetCount;
|
||||||
|
|
||||||
|
if (isKeeper) {
|
||||||
|
// Keeper step-down logic:
|
||||||
|
// If incoming tombstone has reached the target count, compare estimates.
|
||||||
|
// If incoming estimate >= my estimate before merge, step down.
|
||||||
|
// Use node ID as tie-breaker: higher node ID steps down when estimates are equal.
|
||||||
|
if (incomingTombstoneEstimate !== null && incomingTombstoneEstimate >= targetCount) {
|
||||||
|
if (myTombstoneEstimateBeforeMerge < incomingTombstoneEstimate) {
|
||||||
|
return { shouldGC: true, stepDownAsKeeper: true };
|
||||||
|
}
|
||||||
|
// Tie-breaker: if estimates are equal, the lexicographically higher node ID steps down
|
||||||
|
if (myTombstoneEstimateBeforeMerge === incomingTombstoneEstimate &&
|
||||||
|
senderNodeId !== null && myNodeId > senderNodeId) {
|
||||||
|
return { shouldGC: true, stepDownAsKeeper: true };
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return { shouldGC: false, stepDownAsKeeper: false };
|
||||||
|
}
|
||||||
|
|
||||||
|
// Not yet a keeper - will become one if tombstone count reaches target after merge
|
||||||
|
return { shouldGC: false, stepDownAsKeeper: false };
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3.4 Tombstone Reception and Processing
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
graph TD
|
||||||
|
A[Receive tombstone deletion message] --> B{Do I have<br/>this record?}
|
||||||
|
B -->|No| C[Ignore: record not found]
|
||||||
|
B -->|Yes| D[Merge HLLs and select<br/>highest record estimate]
|
||||||
|
D --> E{Am I already a keeper?<br/>my tombstone count >= target}
|
||||||
|
E -->|Yes| F{Is incoming tombstone<br/>count higher than mine?}
|
||||||
|
F -->|Yes| G[Step down as keeper:<br/>delete tombstone]
|
||||||
|
F -->|No| H{Same count but<br/>sender has lower node ID?}
|
||||||
|
H -->|Yes| G
|
||||||
|
H -->|No| I[Remain keeper:<br/>update tombstone]
|
||||||
|
E -->|No| J{Does my tombstone<br/>count reach target?}
|
||||||
|
J -->|Yes| K[Become keeper:<br/>store tombstone]
|
||||||
|
J -->|No| L[Store tombstone<br/>but not keeper yet]
|
||||||
|
G --> M[Forward tombstone to peers]
|
||||||
|
I --> M
|
||||||
|
K --> M
|
||||||
|
L --> M
|
||||||
|
```
|
||||||
|
|
||||||
|
The complete tombstone reception handler:
|
||||||
|
|
||||||
|
```ts
|
||||||
const receiveTombstone = <Data>(
|
const receiveTombstone = <Data>(
|
||||||
node: NodeState<Data>,
|
node: NodeState<Data>,
|
||||||
incoming: Tombstone
|
incoming: Tombstone,
|
||||||
|
senderNodeId: string
|
||||||
): NodeState<Data> => {
|
): NodeState<Data> => {
|
||||||
// Don't accept tombstones for unknown records
|
// Don't accept tombstones for unknown records
|
||||||
if (!node.records.has(incoming.id)) {
|
const record = node.records.get(incoming.id);
|
||||||
|
if (!record) {
|
||||||
return node;
|
return node;
|
||||||
}
|
}
|
||||||
|
|
||||||
const existing = node.tombstones.get(incoming.id);
|
const existing = node.tombstones.get(incoming.id);
|
||||||
const previousEstimate = existing ? hllEstimate(existing.tombstoneHLL) : 0;
|
|
||||||
|
|
||||||
let updatedTombstone: Tombstone = existing
|
// Merge tombstone HLLs and add self
|
||||||
? {
|
const mergedTombstoneHLL = existing
|
||||||
...existing,
|
? hllAdd(hllMerge(existing.tombstoneHLL, incoming.tombstoneHLL), node.id)
|
||||||
tombstoneHLL: hllMerge(existing.tombstoneHLL, incoming.tombstoneHLL),
|
: hllAdd(hllClone(incoming.tombstoneHLL), node.id);
|
||||||
recordHLL: keepHigherEstimate(existing.recordHLL, incoming.recordHLL),
|
|
||||||
}
|
// Select the best (highest estimate) record HLL as target count
|
||||||
: incoming;
|
// This ensures we use the most complete view of record distribution
|
||||||
|
let bestRecordHLL = incoming.recordHLL;
|
||||||
|
if (existing?.recordHLL) {
|
||||||
|
bestRecordHLL = hllEstimate(existing.recordHLL) > hllEstimate(bestRecordHLL)
|
||||||
|
? existing.recordHLL
|
||||||
|
: bestRecordHLL;
|
||||||
|
}
|
||||||
|
if (hllEstimate(record.recordHLL) > hllEstimate(bestRecordHLL)) {
|
||||||
|
bestRecordHLL = hllClone(record.recordHLL);
|
||||||
|
}
|
||||||
|
|
||||||
|
const updatedTombstone: Tombstone = {
|
||||||
|
id: incoming.id,
|
||||||
|
tombstoneHLL: mergedTombstoneHLL,
|
||||||
|
recordHLL: bestRecordHLL,
|
||||||
|
};
|
||||||
|
|
||||||
|
const myEstimateBeforeMerge = existing ? hllEstimate(existing.tombstoneHLL) : 0;
|
||||||
|
|
||||||
const gcStatus = checkGCStatus(
|
const gcStatus = checkGCStatus(
|
||||||
updatedTombstone,
|
updatedTombstone,
|
||||||
hllEstimate(incoming.tombstoneHLL),
|
hllEstimate(incoming.tombstoneHLL),
|
||||||
previousEstimate
|
myEstimateBeforeMerge,
|
||||||
|
node.id,
|
||||||
|
senderNodeId
|
||||||
);
|
);
|
||||||
|
|
||||||
if (gcStatus.stepDownAsKeeper) {
|
// Always delete the record when we have a tombstone
|
||||||
// GC both record and tombstone
|
const newRecords = new Map(node.records);
|
||||||
return deleteRecordAndTombstone(node, incoming.id);
|
newRecords.delete(incoming.id);
|
||||||
}
|
|
||||||
|
|
||||||
if (gcStatus.becomeKeeper) {
|
if (gcStatus.stepDownAsKeeper) {
|
||||||
updatedTombstone = { ...updatedTombstone, isKeeper: true };
|
// Step down: delete both record and tombstone
|
||||||
|
const newTombstones = new Map(node.tombstones);
|
||||||
|
newTombstones.delete(incoming.id);
|
||||||
|
return { ...node, records: newRecords, tombstones: newTombstones };
|
||||||
}
|
}
|
||||||
|
|
||||||
const newTombstones = new Map(node.tombstones);
|
const newTombstones = new Map(node.tombstones);
|
||||||
newTombstones.set(incoming.id, updatedTombstone);
|
newTombstones.set(incoming.id, updatedTombstone);
|
||||||
return { ...node, tombstones: newTombstones };
|
return { ...node, records: newRecords, tombstones: newTombstones };
|
||||||
};
|
};
|
||||||
```
|
```
|
||||||
|
|
||||||
## Trade-offs
|
### 3.5 Cascading Step-Down via Forwarding
|
||||||
|
|
||||||
|
When a keeper steps down, it immediately forwards the tombstone to all connected peers, creating a cascade effect that rapidly eliminates redundant keepers:
|
||||||
|
|
||||||
|
```ts
|
||||||
|
const forwardTombstoneToAllPeers = <Data>(
|
||||||
|
network: NetworkState<Data>,
|
||||||
|
forwardingNodeId: string,
|
||||||
|
tombstone: Tombstone,
|
||||||
|
excludePeerId?: string
|
||||||
|
): NetworkState<Data> => {
|
||||||
|
const forwardingNode = network.nodes.get(forwardingNodeId);
|
||||||
|
if (!forwardingNode) return network;
|
||||||
|
|
||||||
|
let newNodes = new Map(network.nodes);
|
||||||
|
|
||||||
|
for (const peerId of forwardingNode.peerIds) {
|
||||||
|
if (peerId === excludePeerId) continue;
|
||||||
|
|
||||||
|
const peer = newNodes.get(peerId);
|
||||||
|
if (!peer || !peer.records.has(tombstone.id)) continue;
|
||||||
|
|
||||||
|
const updatedPeer = receiveTombstone(peer, tombstone, forwardingNodeId);
|
||||||
|
newNodes.set(peerId, updatedPeer);
|
||||||
|
|
||||||
|
// If this peer also stepped down, recursively forward
|
||||||
|
if (!updatedPeer.tombstones.has(tombstone.id) && peer.tombstones.has(tombstone.id)) {
|
||||||
|
const result = forwardTombstoneToAllPeers({ nodes: newNodes }, peerId, tombstone, forwardingNodeId);
|
||||||
|
newNodes = new Map(result.nodes);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return { nodes: newNodes };
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
## 4. Design Rationale
|
||||||
|
|
||||||
|
### 4.1 Why Propagate the Record HLL with Tombstones?
|
||||||
|
|
||||||
|
Without a shared target count, each node would compare against its own local recordHLL estimate, leading to premature garbage collection:
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
graph LR
|
||||||
|
subgraph Problem: Without Shared Target
|
||||||
|
A1["Node A: recordHLL=2"] -->|gossip| B1["Node B: recordHLL=2"]
|
||||||
|
B1 --> C1["Both estimate 2 nodes have record"]
|
||||||
|
C1 --> D1["tombstoneHLL reaches 2"]
|
||||||
|
D1 --> E1["GC triggers prematurely!"]
|
||||||
|
E1 --> F1["Node C still has record, resurrection"]
|
||||||
|
end
|
||||||
|
```
|
||||||
|
|
||||||
|
By propagating the recordHLL with the tombstone and always keeping the highest estimate encountered, all nodes converge on a safe target count. During propagation, if a node has a more complete view of record distribution (higher HLL estimate), that becomes the new target for all subsequent nodes.
|
||||||
|
|
||||||
|
### 4.2 Why Dynamic Keeper Election?
|
||||||
|
|
||||||
|
A fixed originator-as-keeper design creates a single point of failure. If the originator goes offline, tombstone propagation halts.
|
||||||
|
|
||||||
|
Dynamic election allows any node to become a keeper when it detects `tombstoneCount >= recordCount`. Multiple keepers provide redundancy during network partitions or node failures.
|
||||||
|
|
||||||
|
### 4.3 Why Keeper Step-Down?
|
||||||
|
|
||||||
|
Without step-down logic, every node eventually becomes a keeper (since they all eventually observe the threshold condition). This defeats the purpose of garbage collection.
|
||||||
|
|
||||||
|
Step-down creates convergence toward a minimal keeper set:
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
graph TD
|
||||||
|
subgraph Keeper Convergence Over Time
|
||||||
|
T0["t=0: 0 keepers"]
|
||||||
|
T1["t=1: 5 keepers<br/>(first nodes to detect threshold)"]
|
||||||
|
T2["t=2: 3 keepers<br/>(2 stepped down after seeing higher estimates)"]
|
||||||
|
T3["t=3: 1-2 keepers<br/>(most informed nodes remain)"]
|
||||||
|
end
|
||||||
|
T0 --> T1 --> T2 --> T3
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.4 Why Node ID Tie-Breaker?
|
||||||
|
|
||||||
|
When HLL estimates converge (all nodes have similar tombstoneHLL values due to full propagation), no node can have a strictly higher estimate. Without a tie-breaker, keepers with equal estimates would never step down.
|
||||||
|
|
||||||
|
The lexicographic node ID comparison ensures deterministic convergence: when two keepers with equal estimates communicate, the one with the higher node ID steps down. This guarantees eventual convergence to a single keeper per connected component.
|
||||||
|
|
||||||
|
### 4.5 Why Forward on Step-Down?
|
||||||
|
|
||||||
|
Without forwarding, keepers only step down when randomly selected for gossip - a slow process. With aggressive forwarding, a stepping-down keeper immediately propagates the "winning" tombstone to all neighbors, creating a cascade effect that rapidly eliminates redundant keepers.
|
||||||
|
|
||||||
|
## 5. Evaluation
|
||||||
|
|
||||||
|
### 5.1 Experimental Setup
|
||||||
|
|
||||||
|
We implemented a discrete-event simulation to evaluate the algorithm under various network conditions. The simulation models:
|
||||||
|
|
||||||
|
- **Gossip protocol**: Each round, every node with a record or tombstone randomly selects one peer and exchanges state
|
||||||
|
- **HLL precision**: 10 bits (1024 registers, ~1KB per HLL)
|
||||||
|
- **Convergence criteria**: Records deleted, followed by 100 additional rounds for keeper convergence
|
||||||
|
|
||||||
|
### 5.2 Test Scenarios
|
||||||
|
|
||||||
|
#### 5.2.1 Single Node Deletion
|
||||||
|
|
||||||
|
**Scenario**: A single node creates a record, propagates it through gossip, then initiates deletion.
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
graph TD
|
||||||
|
subgraph Network Topology 15 nodes 40 percent connectivity
|
||||||
|
N0((node-0<br/>originator))
|
||||||
|
N1((node-1))
|
||||||
|
N2((node-2))
|
||||||
|
N3((node-3))
|
||||||
|
N4((node-4))
|
||||||
|
N5((node-5))
|
||||||
|
N6((node-6))
|
||||||
|
N7((node-7))
|
||||||
|
N0 --- N1
|
||||||
|
N0 --- N3
|
||||||
|
N1 --- N2
|
||||||
|
N1 --- N4
|
||||||
|
N2 --- N5
|
||||||
|
N3 --- N4
|
||||||
|
N3 --- N6
|
||||||
|
N4 --- N5
|
||||||
|
N5 --- N7
|
||||||
|
N6 --- N7
|
||||||
|
end
|
||||||
|
```
|
||||||
|
|
||||||
|
**Protocol**:
|
||||||
|
1. Node-0 creates record and propagates for 20 rounds
|
||||||
|
2. Node-0 creates tombstone and initiates deletion
|
||||||
|
3. Simulation runs until convergence
|
||||||
|
|
||||||
|
**Results** (averaged over 50 trials):
|
||||||
|
|
||||||
|
| Metric | Value |
|
||||||
|
|--------|-------|
|
||||||
|
| Nodes | 15 per trial (750 total) |
|
||||||
|
| Records deleted | 100% success |
|
||||||
|
| Rounds to delete records | 11 |
|
||||||
|
| Total rounds (including convergence) | 121 |
|
||||||
|
| Final tombstones | 116 (~15.5% of nodes) |
|
||||||
|
|
||||||
|
**Analysis**: Record deletion completes rapidly (11 rounds). Tombstone keeper count converges to approximately 2-3 keepers per trial, demonstrating effective garbage collection while maintaining redundancy.
|
||||||
|
|
||||||
|
#### 5.2.2 Early Tombstone Creation
|
||||||
|
|
||||||
|
**Scenario**: Tombstone created before record fully propagates, testing the algorithm's handling of partial record distribution.
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
sequenceDiagram
|
||||||
|
participant N0 as Node-0
|
||||||
|
participant N1 as Node-1
|
||||||
|
participant N2 as Node-2
|
||||||
|
participant Nx as Nodes 3-19
|
||||||
|
|
||||||
|
Note over N0,Nx: Record only partially propagated
|
||||||
|
N0->>N1: record (round 1)
|
||||||
|
N1->>N2: record (round 2)
|
||||||
|
N2->>N0: record (round 3)
|
||||||
|
|
||||||
|
Note over N0: Create tombstone after only 3 rounds
|
||||||
|
N0->>N1: tombstone
|
||||||
|
N1->>N2: tombstone
|
||||||
|
Note over Nx: Most nodes never receive record
|
||||||
|
```
|
||||||
|
|
||||||
|
**Results**:
|
||||||
|
|
||||||
|
| Metric | Value |
|
||||||
|
|--------|-------|
|
||||||
|
| Nodes | 20 |
|
||||||
|
| Records deleted | Yes |
|
||||||
|
| Rounds to delete records | 10 |
|
||||||
|
| Total rounds | 120 |
|
||||||
|
| Final tombstones | 3 (15% of nodes) |
|
||||||
|
|
||||||
|
**Analysis**: Even with partial record propagation, the algorithm correctly handles deletion. The propagated recordHLL accurately captures the distribution, updating as the tombstone encounters nodes with more complete views. Tombstones converge to nodes that actually received the record.
|
||||||
|
|
||||||
|
#### 5.2.3 Bridged Network (Two Clusters)
|
||||||
|
|
||||||
|
**Scenario**: Two densely-connected clusters joined by a single bridge node, simulating common real-world topologies.
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
graph TD
|
||||||
|
subgraph Cluster A 15 nodes
|
||||||
|
A0((A-0<br/>bridge))
|
||||||
|
A1((A-1))
|
||||||
|
A2((A-2))
|
||||||
|
A3((A-3))
|
||||||
|
A0 --- A1
|
||||||
|
A0 --- A2
|
||||||
|
A1 --- A2
|
||||||
|
A1 --- A3
|
||||||
|
A2 --- A3
|
||||||
|
end
|
||||||
|
|
||||||
|
subgraph Cluster B 15 nodes
|
||||||
|
B0((B-0<br/>bridge))
|
||||||
|
B1((B-1))
|
||||||
|
B2((B-2))
|
||||||
|
B3((B-3))
|
||||||
|
B0 --- B1
|
||||||
|
B0 --- B2
|
||||||
|
B1 --- B2
|
||||||
|
B1 --- B3
|
||||||
|
B2 --- B3
|
||||||
|
end
|
||||||
|
|
||||||
|
A0 ===|single bridge| B0
|
||||||
|
```
|
||||||
|
|
||||||
|
**Results**:
|
||||||
|
|
||||||
|
| Metric | Cluster A | Cluster B | Total |
|
||||||
|
|--------|-----------|-----------|-------|
|
||||||
|
| Nodes | 15 | 15 | 30 |
|
||||||
|
| Records deleted | Yes | Yes | Yes |
|
||||||
|
| Rounds to delete | - | - | 10 |
|
||||||
|
| Final tombstones | 4 | 3 | 7 (23%) |
|
||||||
|
|
||||||
|
**Analysis**: The single-bridge topology creates a natural partition point. Each cluster independently elects keepers, resulting in 2-4 keepers per cluster. This provides fault tolerance - if the bridge fails, each cluster retains tombstones independently.
|
||||||
|
|
||||||
|
#### 5.2.4 Concurrent Tombstones
|
||||||
|
|
||||||
|
**Scenario**: Multiple nodes simultaneously initiate deletion of the same record, simulating concurrent delete operations.
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
sequenceDiagram
|
||||||
|
participant N0 as Node-0
|
||||||
|
participant N5 as Node-5
|
||||||
|
participant N10 as Node-10
|
||||||
|
participant Others as Other Nodes
|
||||||
|
|
||||||
|
Note over N0,Others: Record fully propagated (30 rounds)
|
||||||
|
|
||||||
|
par Concurrent deletion
|
||||||
|
N0->>N0: Create tombstone
|
||||||
|
N5->>N5: Create tombstone
|
||||||
|
N10->>N10: Create tombstone
|
||||||
|
end
|
||||||
|
|
||||||
|
Note over N0,Others: Three tombstones propagate and merge
|
||||||
|
N0->>Others: tombstone (from N0)
|
||||||
|
N5->>Others: tombstone (from N5)
|
||||||
|
N10->>Others: tombstone (from N10)
|
||||||
|
|
||||||
|
Note over N0,Others: HLLs merge, keepers converge
|
||||||
|
```
|
||||||
|
|
||||||
|
**Results**:
|
||||||
|
|
||||||
|
| Metric | Value |
|
||||||
|
|--------|-------|
|
||||||
|
| Nodes | 20 |
|
||||||
|
| Concurrent deleters | 3 |
|
||||||
|
| Records deleted | Yes |
|
||||||
|
| Rounds to delete | 10 |
|
||||||
|
| Final tombstones | 2 (10% of nodes) |
|
||||||
|
|
||||||
|
**Analysis**: The algorithm handles concurrent tombstone creation gracefully. Multiple tombstones merge via HLL union operations, and keeper election converges as normal. The final keeper count (2) is actually lower than single-deleter scenarios, likely due to faster HLL convergence from multiple sources.
|
||||||
|
|
||||||
|
#### 5.2.5 Network Partition and Heal
|
||||||
|
|
||||||
|
**Scenario**: Network partitions after record propagation, tombstone created in one partition, then network heals.
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
sequenceDiagram
|
||||||
|
participant CA as Cluster A
|
||||||
|
participant Bridge as Bridge
|
||||||
|
participant CB as Cluster B
|
||||||
|
|
||||||
|
Note over CA,CB: Phase 1: Record propagates to all nodes
|
||||||
|
CA->>Bridge: record
|
||||||
|
Bridge->>CB: record
|
||||||
|
|
||||||
|
Note over CA,CB: Phase 2: Network partitions
|
||||||
|
Bridge--xCB: connection lost
|
||||||
|
|
||||||
|
Note over CA: Cluster A creates tombstone
|
||||||
|
CA->>CA: tombstone propagates within A
|
||||||
|
Note over CB: Cluster B still has record
|
||||||
|
|
||||||
|
Note over CA,CB: Phase 3: Network heals
|
||||||
|
Bridge->>CB: tombstone propagates to B
|
||||||
|
CB->>CB: record deleted, keepers elected
|
||||||
|
```
|
||||||
|
|
||||||
|
**Results**:
|
||||||
|
|
||||||
|
| Metric | Cluster A | Cluster B | Total |
|
||||||
|
|--------|-----------|-----------|-------|
|
||||||
|
| Nodes | 10 | 10 | 20 |
|
||||||
|
| Records deleted | Yes | Yes | Yes |
|
||||||
|
| Rounds to delete | - | - | 10 |
|
||||||
|
| Total rounds (partition + heal) | - | - | 720 |
|
||||||
|
| Final tombstones | 3 | 2 | 5 (25%) |
|
||||||
|
|
||||||
|
**Analysis**: The extended total rounds (720) includes the partition period where only Cluster A processes the tombstone. Upon healing, Cluster B rapidly receives and processes the tombstone. Each cluster maintains independent keepers, providing partition tolerance.
|
||||||
|
|
||||||
|
#### 5.2.6 Sparse Network
|
||||||
|
|
||||||
|
**Scenario**: Low connectivity (15%) network, testing algorithm behavior under challenging propagation conditions.
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
graph TD
|
||||||
|
subgraph Sparse Network 25 nodes 15 percent connectivity
|
||||||
|
N0((0)) --- N3((3))
|
||||||
|
N3 --- N7((7))
|
||||||
|
N7 --- N12((12))
|
||||||
|
N0 --- N5((5))
|
||||||
|
N5 --- N9((9))
|
||||||
|
N9 --- N15((15))
|
||||||
|
N12 --- N18((18))
|
||||||
|
N18 --- N22((22))
|
||||||
|
N1((1)) --- N4((4))
|
||||||
|
N4 --- N8((8))
|
||||||
|
N2((2)) --- N6((6))
|
||||||
|
N6 --- N11((11))
|
||||||
|
N11 --- N16((16))
|
||||||
|
N16 --- N20((20))
|
||||||
|
N20 --- N24((24))
|
||||||
|
end
|
||||||
|
|
||||||
|
style N0 fill:#f96
|
||||||
|
style N24 fill:#9f9
|
||||||
|
```
|
||||||
|
|
||||||
|
**Results** (averaged over 20 trials):
|
||||||
|
|
||||||
|
| Metric | Value |
|
||||||
|
|--------|-------|
|
||||||
|
| Nodes | 25 per trial (500 total) |
|
||||||
|
| Connectivity | 15% |
|
||||||
|
| Records deleted | 100% success |
|
||||||
|
| Rounds to delete | 13 |
|
||||||
|
| Total rounds | 123 |
|
||||||
|
| Final tombstones | 102 (~20.4% of nodes) |
|
||||||
|
|
||||||
|
**Analysis**: Sparse networks require more rounds for propagation (13 vs. 10-11 for denser networks) and retain more keepers (~20% vs. ~15%). The higher keeper retention provides additional redundancy appropriate for networks where nodes may have limited connectivity.
|
||||||
|
|
||||||
|
### 5.3 Summary of Results
|
||||||
|
|
||||||
|
| Scenario | Nodes | Deletion Rounds | Keeper % | Key Insight |
|
||||||
|
|----------|-------|-----------------|----------|-------------|
|
||||||
|
| Single Node Deletion | 15 | 11 | 15.5% | Baseline performance |
|
||||||
|
| Early Tombstone | 20 | 10 | 15% | Handles partial propagation |
|
||||||
|
| Bridged Network | 30 | 10 | 23% | Independent keepers per cluster |
|
||||||
|
| Concurrent Tombstones | 20 | 10 | 10% | Faster convergence with multiple sources |
|
||||||
|
| Partition and Heal | 20 | 10 | 25% | Partition-tolerant |
|
||||||
|
| Sparse Network | 25 | 13 | 20.4% | Graceful degradation |
|
||||||
|
|
||||||
|
### 5.4 Key Findings
|
||||||
|
|
||||||
|
1. **Consistent deletion**: Records are deleted within 10-13 gossip rounds across all scenarios
|
||||||
|
2. **Effective GC**: Tombstones converge to 10-25% of nodes as keepers
|
||||||
|
3. **Topology adaptation**: Bridged and partitioned networks maintain ~1-4 keepers per cluster
|
||||||
|
4. **Graceful degradation**: Lower connectivity increases keeper retention, providing appropriate redundancy
|
||||||
|
5. **Concurrent safety**: Multiple simultaneous deleters do not cause conflicts
|
||||||
|
|
||||||
|
## 6. Trade-offs
|
||||||
|
|
||||||
| Aspect | Impact |
|
| Aspect | Impact |
|
||||||
|--------|--------|
|
|--------|--------|
|
||||||
| **Memory** | ~1KB per tombstone (frozen HLL at precision 10) |
|
| **Memory** | ~1KB per tombstone (HLL at precision 10) |
|
||||||
| **Bandwidth** | HLLs transmitted with each gossip message |
|
| **Bandwidth** | HLLs transmitted with each gossip message (~2KB per tombstone message) |
|
||||||
| **Latency** | GC delayed until keeper convergence |
|
| **Latency** | GC delayed until keeper convergence (~100 rounds after deletion) |
|
||||||
| **Consistency** | Eventual - temporary resurrection events possible |
|
| **Consistency** | Eventual - temporary resurrection attempts are blocked but logged |
|
||||||
|
|
||||||
## Properties
|
## 7. Properties
|
||||||
|
|
||||||
- **Safety**: 100% - tombstones never prematurely deleted
|
The algorithm provides the following guarantees:
|
||||||
- **Liveness**: Keepers step down, enabling eventual GC
|
|
||||||
- **Fault tolerance**: No single point of failure
|
|
||||||
- **Convergence**: Keeper count decreases over time
|
|
||||||
|
|
||||||
## Simulation Results
|
- **Safety**: Tombstones are never prematurely garbage collected. A tombstone is only deleted when the node has received confirmation (via HLL estimates) that the tombstone has propagated to at least as many nodes as received the original record.
|
||||||
|
|
||||||
A working simulation is available at `simulations/hyperloglog-tombstone/simulation.ts`.
|
- **Liveness**: Keepers eventually step down, enabling garbage collection. The tie-breaker mechanism ensures convergence even when HLL estimates are identical.
|
||||||
|
|
||||||
| Test | Nodes | Records Deleted | Tombstones Remaining |
|
- **Fault tolerance**: No single point of failure. Multiple keepers provide redundancy, and any keeper can propagate the tombstone.
|
||||||
|------|-------|-----------------|----------------------|
|
|
||||||
| Single Node Deletion (50 trials) | 750 | 11 rounds | 118 (~16%) |
|
|
||||||
| Early Tombstone | 20 | 10 rounds | 2 (10%) |
|
|
||||||
| Bridged Network (2 clusters) | 30 | 10 rounds | 3 (10%) |
|
|
||||||
| Concurrent Tombstones (3 deleters) | 20 | 10 rounds | 3 (15%) |
|
|
||||||
| Network Partition and Heal | 20 | 10 rounds | 2 (10%) |
|
|
||||||
| Sparse Network (15% connectivity) | 500 | 13 rounds | 108 (~22%) |
|
|
||||||
|
|
||||||
Key findings from simulation:
|
- **Convergence**: Keeper count monotonically decreases over time within each connected component.
|
||||||
- Records are consistently deleted within 10-13 gossip rounds
|
|
||||||
- Tombstones converge to 10-22% of nodes remaining as keepers after 100 additional rounds
|
## 8. Conclusion
|
||||||
- Bridged and partitioned networks converge to ~1 keeper per cluster
|
|
||||||
- Higher connectivity leads to faster keeper convergence
|
This paper presented a HyperLogLog-based approach to tombstone garbage collection in distributed systems. By tracking record and tombstone propagation through probabilistic cardinality estimation, the algorithm enables safe garbage collection while reducing storage overhead by 75-90%.
|
||||||
|
|
||||||
|
The simulation results demonstrate consistent behavior across diverse network topologies and failure scenarios, with records deleted in 10-13 gossip rounds and tombstones converging to 10-25% of nodes as keepers. The algorithm gracefully handles partial propagation, network partitions, and concurrent deletions.
|
||||||
|
|
||||||
|
Future work may explore adaptive HLL precision based on network size, integration with vector clocks for stronger consistency guarantees, and optimization of the keeper convergence rate.
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
A working simulation implementing this algorithm is available at `simulations/hyperloglog-tombstone/simulation.ts`.
|
||||||
|
|
@ -85,9 +85,8 @@ interface DataRecord<Data> {
|
||||||
|
|
||||||
interface Tombstone {
|
interface Tombstone {
|
||||||
readonly id: string;
|
readonly id: string;
|
||||||
readonly frozenRecordHLL: HLL;
|
readonly recordHLL: HLL;
|
||||||
readonly tombstoneHLL: HLL;
|
readonly tombstoneHLL: HLL;
|
||||||
readonly isKeeper: boolean;
|
|
||||||
}
|
}
|
||||||
|
|
||||||
interface NodeState<Data> {
|
interface NodeState<Data> {
|
||||||
|
|
@ -114,9 +113,8 @@ const createRecord = <Data>(id: string, data: Data, nodeId: string): DataRecord<
|
||||||
|
|
||||||
const createTombstone = <Data>(record: DataRecord<Data>, nodeId: string): Tombstone => ({
|
const createTombstone = <Data>(record: DataRecord<Data>, nodeId: string): Tombstone => ({
|
||||||
id: record.id,
|
id: record.id,
|
||||||
frozenRecordHLL: hllClone(record.recordHLL),
|
recordHLL: hllClone(record.recordHLL),
|
||||||
tombstoneHLL: hllAdd(createHLL(), nodeId),
|
tombstoneHLL: hllAdd(createHLL(), nodeId),
|
||||||
isKeeper: false,
|
|
||||||
});
|
});
|
||||||
|
|
||||||
const createNode = <Data>(id: string): NodeState<Data> => ({
|
const createNode = <Data>(id: string): NodeState<Data> => ({
|
||||||
|
|
@ -138,34 +136,32 @@ const checkGCStatus = (
|
||||||
myTombstoneEstimateBeforeMerge: number,
|
myTombstoneEstimateBeforeMerge: number,
|
||||||
myNodeId: string,
|
myNodeId: string,
|
||||||
senderNodeId: string | null
|
senderNodeId: string | null
|
||||||
): { shouldGC: boolean; becomeKeeper: boolean; stepDownAsKeeper: boolean } => {
|
): { shouldGC: boolean; stepDownAsKeeper: boolean } => {
|
||||||
const targetCount = hllEstimate(tombstone.frozenRecordHLL);
|
const targetCount = hllEstimate(tombstone.recordHLL);
|
||||||
const tombstoneCount = hllEstimate(tombstone.tombstoneHLL);
|
|
||||||
|
|
||||||
if (tombstone.isKeeper) {
|
const isKeeper = myTombstoneEstimateBeforeMerge >= targetCount;
|
||||||
|
|
||||||
|
if (isKeeper) {
|
||||||
// Keeper step-down logic:
|
// Keeper step-down logic:
|
||||||
// If incoming tombstone has reached the target count, compare estimates.
|
// If incoming tombstone has reached the target count, compare estimates.
|
||||||
// If incoming estimate >= my estimate before merge, step down.
|
// If incoming estimate >= my estimate before merge, step down.
|
||||||
// Use node ID as tie-breaker: higher node ID steps down when estimates are equal.
|
// Use node ID as tie-breaker: higher node ID steps down when estimates are equal.
|
||||||
if (incomingTombstoneEstimate !== null && incomingTombstoneEstimate >= targetCount) {
|
if (incomingTombstoneEstimate !== null && incomingTombstoneEstimate >= targetCount) {
|
||||||
if (myTombstoneEstimateBeforeMerge < incomingTombstoneEstimate) {
|
if (myTombstoneEstimateBeforeMerge < incomingTombstoneEstimate) {
|
||||||
return { shouldGC: true, becomeKeeper: false, stepDownAsKeeper: true };
|
return { shouldGC: true, stepDownAsKeeper: true };
|
||||||
}
|
}
|
||||||
// Tie-breaker: if estimates are equal, the lexicographically higher node ID steps down
|
// Tie-breaker: if estimates are equal, the lexicographically higher node ID steps down
|
||||||
if (myTombstoneEstimateBeforeMerge === incomingTombstoneEstimate &&
|
if (myTombstoneEstimateBeforeMerge === incomingTombstoneEstimate &&
|
||||||
senderNodeId !== null && myNodeId > senderNodeId) {
|
senderNodeId !== null && myNodeId > senderNodeId) {
|
||||||
return { shouldGC: true, becomeKeeper: false, stepDownAsKeeper: true };
|
return { shouldGC: true, stepDownAsKeeper: true };
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
return { shouldGC: false, becomeKeeper: false, stepDownAsKeeper: false };
|
return { shouldGC: false, stepDownAsKeeper: false };
|
||||||
}
|
}
|
||||||
|
|
||||||
// Become keeper when tombstone count reaches target (all record holders have acknowledged)
|
// Not yet a keeper - will become one if tombstone count reaches target after merge
|
||||||
if (tombstoneCount >= targetCount) {
|
// (No explicit action needed here, keeper status is inferred from HLL comparison)
|
||||||
return { shouldGC: false, becomeKeeper: true, stepDownAsKeeper: false };
|
return { shouldGC: false, stepDownAsKeeper: false };
|
||||||
}
|
|
||||||
|
|
||||||
return { shouldGC: false, becomeKeeper: false, stepDownAsKeeper: false };
|
|
||||||
};
|
};
|
||||||
|
|
||||||
const receiveRecord = <Data>(
|
const receiveRecord = <Data>(
|
||||||
|
|
@ -206,21 +202,20 @@ const receiveTombstone = <Data>(
|
||||||
? hllAdd(hllMerge(existing.tombstoneHLL, incoming.tombstoneHLL), node.id)
|
? hllAdd(hllMerge(existing.tombstoneHLL, incoming.tombstoneHLL), node.id)
|
||||||
: hllAdd(hllClone(incoming.tombstoneHLL), node.id);
|
: hllAdd(hllClone(incoming.tombstoneHLL), node.id);
|
||||||
|
|
||||||
let bestFrozenHLL = incoming.frozenRecordHLL;
|
let bestFrozenHLL = incoming.recordHLL;
|
||||||
if (existing?.frozenRecordHLL) {
|
if (existing?.recordHLL) {
|
||||||
bestFrozenHLL = hllEstimate(existing.frozenRecordHLL) > hllEstimate(bestFrozenHLL)
|
bestFrozenHLL = hllEstimate(existing.recordHLL) > hllEstimate(bestFrozenHLL)
|
||||||
? existing.frozenRecordHLL
|
? existing.recordHLL
|
||||||
: bestFrozenHLL;
|
: bestFrozenHLL;
|
||||||
}
|
}
|
||||||
if (hllEstimate(record.recordHLL) > hllEstimate(bestFrozenHLL)) {
|
if (hllEstimate(record.recordHLL) > hllEstimate(bestFrozenHLL)) {
|
||||||
bestFrozenHLL = hllClone(record.recordHLL);
|
bestFrozenHLL = hllClone(record.recordHLL);
|
||||||
}
|
}
|
||||||
|
|
||||||
let updatedTombstone: Tombstone = {
|
const updatedTombstone: Tombstone = {
|
||||||
id: incoming.id,
|
id: incoming.id,
|
||||||
tombstoneHLL: mergedTombstoneHLL,
|
tombstoneHLL: mergedTombstoneHLL,
|
||||||
frozenRecordHLL: bestFrozenHLL,
|
recordHLL: bestFrozenHLL,
|
||||||
isKeeper: existing?.isKeeper ?? false,
|
|
||||||
};
|
};
|
||||||
|
|
||||||
const myEstimateBeforeMerge = existing ? hllEstimate(existing.tombstoneHLL) : 0;
|
const myEstimateBeforeMerge = existing ? hllEstimate(existing.tombstoneHLL) : 0;
|
||||||
|
|
@ -245,10 +240,6 @@ const receiveTombstone = <Data>(
|
||||||
return { ...node, records: newRecords, tombstones: newTombstones, stats: newStats };
|
return { ...node, records: newRecords, tombstones: newTombstones, stats: newStats };
|
||||||
}
|
}
|
||||||
|
|
||||||
if (gcStatus.becomeKeeper) {
|
|
||||||
updatedTombstone = { ...updatedTombstone, isKeeper: true };
|
|
||||||
}
|
|
||||||
|
|
||||||
const newTombstones = new Map(node.tombstones);
|
const newTombstones = new Map(node.tombstones);
|
||||||
newTombstones.set(incoming.id, updatedTombstone);
|
newTombstones.set(incoming.id, updatedTombstone);
|
||||||
return { ...node, records: newRecords, tombstones: newTombstones, stats: newStats };
|
return { ...node, records: newRecords, tombstones: newTombstones, stats: newStats };
|
||||||
|
|
@ -397,14 +388,14 @@ const gossipOnce = <Data>(network: NetworkState<Data>, senderNodeId: string, rec
|
||||||
|
|
||||||
// Merge HLLs
|
// Merge HLLs
|
||||||
const mergedTombstoneHLL = hllMerge(tombstone.tombstoneHLL, peerTombstone.tombstoneHLL);
|
const mergedTombstoneHLL = hllMerge(tombstone.tombstoneHLL, peerTombstone.tombstoneHLL);
|
||||||
const bestFrozenHLL = hllEstimate(peerTombstone.frozenRecordHLL) > hllEstimate(tombstone.frozenRecordHLL)
|
const bestFrozenHLL = hllEstimate(peerTombstone.recordHLL) > hllEstimate(tombstone.recordHLL)
|
||||||
? peerTombstone.frozenRecordHLL
|
? peerTombstone.recordHLL
|
||||||
: tombstone.frozenRecordHLL;
|
: tombstone.recordHLL;
|
||||||
|
|
||||||
let updatedSenderTombstone: Tombstone = {
|
const updatedSenderTombstone: Tombstone = {
|
||||||
...tombstone,
|
...tombstone,
|
||||||
tombstoneHLL: mergedTombstoneHLL,
|
tombstoneHLL: mergedTombstoneHLL,
|
||||||
frozenRecordHLL: bestFrozenHLL,
|
recordHLL: bestFrozenHLL,
|
||||||
};
|
};
|
||||||
|
|
||||||
// Check if sender should step down (peer has higher estimate or wins tie-breaker)
|
// Check if sender should step down (peer has higher estimate or wins tie-breaker)
|
||||||
|
|
@ -429,9 +420,6 @@ const gossipOnce = <Data>(network: NetworkState<Data>, senderNodeId: string, rec
|
||||||
newNodes = new Map(result.nodes);
|
newNodes = new Map(result.nodes);
|
||||||
} else {
|
} else {
|
||||||
// Keep tombstone with merged data
|
// Keep tombstone with merged data
|
||||||
if (gcStatus.becomeKeeper) {
|
|
||||||
updatedSenderTombstone = { ...updatedSenderTombstone, isKeeper: true };
|
|
||||||
}
|
|
||||||
const currentSender = newNodes.get(senderNodeId)!;
|
const currentSender = newNodes.get(senderNodeId)!;
|
||||||
const newSenderTombstones = new Map(currentSender.tombstones);
|
const newSenderTombstones = new Map(currentSender.tombstones);
|
||||||
newSenderTombstones.set(recordId, updatedSenderTombstone);
|
newSenderTombstones.set(recordId, updatedSenderTombstone);
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue