feat: simplified introductions mentioned algorithms
This commit is contained in:
parent
4f2b965bbd
commit
c807832724
1 changed files with 150 additions and 674 deletions
|
|
@ -2,42 +2,30 @@
|
||||||
|
|
||||||
## Abstract
|
## Abstract
|
||||||
|
|
||||||
When synchronizing records in a distributed network, deletion presents a fundamental challenge. If nodes simply delete their local copies, other nodes may resynchronize the original data, reverting the deletion. This occurs due to nodes being temporarily offline or network partitions that prevent immediate propagation of deletion events. The traditional solution creates "tombstone" records that persist after deletion to prevent resurrection of deleted data.
|
When synchronizing records in a distributed network, deletion presents a fundamental challenge. Nodes must maintain "tombstone" records to prevent deleted data from being resurrected by offline nodes. This paper presents a **HyperLogLog-based approach** to tombstone garbage collection that uses probabilistic cardinality estimation to detect when tombstones have reached sufficient distribution.
|
||||||
|
|
||||||
While effective, this approach requires every node to indefinitely maintain an ever-growing collection of tombstone records. Typically, after an arbitrarily large time period, tombstones are assumed safe to clear since no rogue nodes should retain the original data.
|
We compare this approach against traditional methods—time-based garbage collection and causal stability detection—analyzing trade-offs in memory, coordination requirements, and failure tolerance.
|
||||||
|
|
||||||
This paper presents a methodology using the HyperLogLog algorithm to estimate how many nodes have received a record, comparing this estimate against the count of nodes that have received the corresponding tombstone. This enables pruning tombstones across the network to a minimal set of "keeper" nodes, reducing the distributed maintenance burden while maintaining safety guarantees.
|
|
||||||
|
|
||||||
## 1. Introduction
|
## 1. Introduction
|
||||||
|
|
||||||
Distributed systems face an inherent tension between data consistency and storage efficiency when handling deletions. Traditional tombstone-based approaches guarantee correctness but impose unbounded storage growth. Several approaches have been proposed to address tombstone accumulation:
|
Distributed systems face an inherent tension between data consistency and storage efficiency when handling deletions. Traditional tombstone-based approaches guarantee correctness but impose unbounded storage growth. Several approaches have been proposed to address tombstone accumulation:
|
||||||
|
|
||||||
**Time-based Garbage Collection**: The simplest approach sets a fixed time-to-live (TTL) for tombstones, after which they are automatically deleted[^2]. While storage-efficient, this risks data resurrection if offline nodes reconnect after the GC window. Systems like Apache Cassandra use this approach with configurable `gc_grace_seconds`[^3].
|
**Time-based Garbage Collection**: Sets a fixed time-to-live (TTL) for tombstones, after which they are automatically deleted[^2]. While storage-efficient, this risks data resurrection if offline nodes reconnect after the GC window.
|
||||||
|
|
||||||
**CRDT Tombstone Pruning**: Conflict-free Replicated Data Types (CRDTs) like OR-Sets accumulate tombstones proportional to the number of unique deleters[^4]. Various pruning strategies have been proposed, including causal stability detection[^5] and garbage collection through consensus[^6], but these typically require additional coordination or strong assumptions about network availability.
|
**Causal Stability Detection**: Prunes tombstones when the system can prove all nodes have observed the deletion[^5]. Implementations vary from vector clocks (tracking operation ordering) to explicit node ID sets (tracking membership). This adds metadata overhead but provides strong guarantees when network conditions allow reliable tracking.
|
||||||
|
|
||||||
This paper introduces a novel probabilistic approach using HyperLogLog (HLL) cardinality estimation[^1] that complements these existing techniques. Rather than replacing tombstones entirely, it minimizes the number of nodes that must retain them while maintaining safety guarantees against data resurrection from offline nodes or partitioned clusters.
|
**Consensus-based Garbage Collection**: Uses coordination protocols to agree on when tombstones can be safely deleted[^6]. Provides strong guarantees but requires synchronization, which may be impractical in partition-prone or high-latency networks.
|
||||||
|
|
||||||
[^1]: Flajolet, P., Fusy, É., Gandouet, O., & Meunier, F. (2007). "HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm." *Discrete Mathematics and Theoretical Computer Science*, AH, 137-156. https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
|
This paper introduces a **HyperLogLog-based approach**[^1] that approximates causal stability detection using probabilistic cardinality estimation. Instead of tracking exact node sets or vector clocks, it uses constant-size HyperLogLog structures to estimate propagation. This trades exactness for dramatic reductions in memory and bandwidth at scale.
|
||||||
[^2]: Ladin, R., Liskov, B., Shrira, L., & Ghemawat, S. (1992). "Providing high availability using lazy replication." *ACM Transactions on Computer Systems*, 10(4), 360-391. https://doi.org/10.1145/138873.138877
|
|
||||||
[^3]: Apache Cassandra Documentation. "Configuring compaction: gc_grace_seconds." https://cassandra.apache.org/doc/latest/cassandra/operating/compaction/index.html
|
|
||||||
[^4]: Shapiro, M., Preguiça, N., Baquero, C., & Zawirski, M. (2011). "A comprehensive study of Convergent and Commutative Replicated Data Types." *INRIA Research Report RR-7506*. https://hal.inria.fr/inria-00555588
|
|
||||||
[^5]: Baquero, C., Almeida, P. S., & Shoker, A. (2017). "Pure Operation-Based Replicated Data Types." *arXiv:1710.04469*. https://arxiv.org/abs/1710.04469
|
|
||||||
[^6]: Bauwens, J., & De Meuter, W. (2020). "Memory Efficient CRDTs in Dynamic Environments." *Proceedings of the 7th Workshop on Principles and Practice of Consistency for Distributed Data (PaPoC '20)*. https://doi.org/10.1145/3380787.3393682
|
|
||||||
|
|
||||||
### 1.1 Network Model
|
[^1]: Flajolet, P., et al. (2007). "HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm." *Discrete Mathematics and Theoretical Computer Science*. https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
|
||||||
|
[^2]: Ladin, R., et al. (1992). "Providing high availability using lazy replication." *ACM TOCS*, 10(4). https://doi.org/10.1145/138873.138877
|
||||||
|
[^5]: Baquero, C., et al. (2017). "Pure Operation-Based Replicated Data Types." *arXiv:1710.04469*. https://arxiv.org/abs/1710.04469
|
||||||
|
[^6]: Bauwens, J., & De Meuter, W. (2020). "Memory Efficient CRDTs in Dynamic Environments." *PaPoC '20*. https://doi.org/10.1145/3380787.3393682
|
||||||
|
|
||||||
This algorithm assumes a fully connected network model where:
|
## 2. Core Algorithm
|
||||||
- All online nodes can communicate with all other online nodes in the same network partition
|
|
||||||
- Nodes may go offline temporarily and rejoin later
|
|
||||||
- Network partitions may occur, isolating groups of nodes that can communicate internally but not across partition boundaries
|
|
||||||
- Partitions eventually heal, restoring full connectivity
|
|
||||||
|
|
||||||
This model is typical of distributed systems deployed across data centers or cloud regions, where internal connectivity is reliable but cross-region links may fail.
|
### 2.1 How It Works
|
||||||
|
|
||||||
### 1.2 Core Concept
|
|
||||||
|
|
||||||
The algorithm operates in three phases:
|
|
||||||
|
|
||||||
```mermaid
|
```mermaid
|
||||||
sequenceDiagram
|
sequenceDiagram
|
||||||
|
|
@ -47,271 +35,88 @@ participant C as Node C
|
||||||
|
|
||||||
Note over A,C: Phase 1: Record Propagation
|
Note over A,C: Phase 1: Record Propagation
|
||||||
A->>C: record + recordHLL
|
A->>C: record + recordHLL
|
||||||
C->>A: update recordHLL estimate
|
C->>A: update recordHLL
|
||||||
Note over B: B receives record before going offline
|
Note over B: B receives record before going offline
|
||||||
|
|
||||||
Note over A,C: Phase 2: Tombstone Propagation (B offline)
|
Note over A,C: Phase 2: Tombstone Propagation
|
||||||
A->>A: Create tombstone with recordHLL and delete record
|
A->>A: Create tombstone, delete record
|
||||||
A->>C: tombstone + tombstoneHLL + recordHLL
|
A->>C: tombstone + tombstoneHLL + recordHLL
|
||||||
C->>C: tombstone updated, delete record
|
C->>C: Delete record, update HLLs
|
||||||
Note over B: B still has stale record
|
|
||||||
|
|
||||||
Note over A,C: Phase 3: B comes back online
|
Note over A,C: Phase 3: Keeper Election
|
||||||
B->>B: Reconnects to network
|
Note over A,C: estimate(tombstoneHLL) >= estimate(recordHLL)
|
||||||
C->>B: tombstone propagates to B
|
Note over A,C: Nodes elect minimal keepers
|
||||||
B->>B: Receives tombstone, deletes record, keeper election occurs
|
|
||||||
|
Note over A,C: Phase 4: B Reconnects
|
||||||
|
B->>B: Comes back online
|
||||||
|
C->>B: tombstone propagates
|
||||||
|
B->>B: Deletes stale record
|
||||||
```
|
```
|
||||||
|
|
||||||
**Phase 1**: Records propagate through the network via gossip, with each node adding itself to the record's HLL. Since all nodes can communicate, propagation is rapid across online nodes.
|
**Phase 1**: Records propagate through gossip. Each node adds itself to the record's HyperLogLog.
|
||||||
|
|
||||||
**Phase 2**: When deletion occurs, the deleting node creates a tombstone containing a copy of the record's HLL as the target count. The tombstone propagates to all online nodes. Offline nodes or partitioned clusters retain the original record.
|
**Phase 2**: When deletion occurs, the deleting node creates a tombstone containing a copy of the record's HLL as the "target count." The tombstone propagates to online nodes.
|
||||||
|
|
||||||
**Phase 3**: When offline nodes reconnect or partitions heal, they receive the tombstone and delete their stale records. Keeper election converges toward a minimal set of nodes maintaining the tombstone.
|
**Phase 3**: When a node's tombstone HLL estimate reaches or exceeds the target, it may become a "keeper." Keepers step down when they encounter another keeper with a higher estimate (using node ID as tie-breaker).
|
||||||
|
|
||||||
## 2. Data Model
|
**Phase 4**: Offline nodes receive the tombstone upon reconnection and delete their stale records.
|
||||||
|
|
||||||
Records and tombstones are maintained as separate entities with distinct tracking mechanisms:
|
### 2.2 Data Model
|
||||||
|
|
||||||
```ts
|
```ts
|
||||||
interface DataRecord<Data> {
|
interface DataRecord<Data> {
|
||||||
readonly id: string;
|
id: string;
|
||||||
readonly data: Data;
|
data: Data;
|
||||||
readonly recordHLL: HyperLogLog; // Tracks nodes that have received this record
|
recordHLL: HyperLogLog; // Tracks nodes that received record
|
||||||
}
|
}
|
||||||
|
|
||||||
interface Tombstone {
|
interface Tombstone {
|
||||||
readonly id: string;
|
id: string;
|
||||||
readonly recordHLL: HyperLogLog; // Target count: highest observed record distribution
|
recordHLL: HyperLogLog; // Target: estimated record distribution
|
||||||
readonly tombstoneHLL: HyperLogLog; // Tracks nodes that have received the tombstone
|
tombstoneHLL: HyperLogLog; // Progress: estimated tombstone distribution
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
## 3. Algorithm
|
### 2.3 Keeper Election
|
||||||
|
|
||||||
### 3.1 Record Creation and Distribution
|
|
||||||
|
|
||||||
When a node creates or receives a record, it adds itself to the record's HLL:
|
|
||||||
|
|
||||||
```ts
|
```ts
|
||||||
const createRecord = <Data>(id: string, data: Data, nodeId: string): DataRecord<Data> => ({
|
const shouldStepDown = (
|
||||||
id,
|
myEstimate: number, // My tombstone HLL estimate
|
||||||
data,
|
theirEstimate: number, // Incoming tombstone HLL estimate
|
||||||
recordHLL: hllAdd(createHLL(), nodeId),
|
targetEstimate: number, // Record HLL estimate (threshold)
|
||||||
});
|
|
||||||
|
|
||||||
const receiveRecord = <Data>(
|
|
||||||
node: NodeState<Data>,
|
|
||||||
incoming: DataRecord<Data>
|
|
||||||
): NodeState<Data> => {
|
|
||||||
// Reject records that have already been deleted
|
|
||||||
if (node.tombstones.has(incoming.id)) {
|
|
||||||
return node;
|
|
||||||
}
|
|
||||||
|
|
||||||
const existing = node.records.get(incoming.id);
|
|
||||||
const updatedRecord: DataRecord<Data> = existing
|
|
||||||
? { ...existing, recordHLL: hllAdd(hllMerge(existing.recordHLL, incoming.recordHLL), node.id) }
|
|
||||||
: { ...incoming, recordHLL: hllAdd(hllClone(incoming.recordHLL), node.id) };
|
|
||||||
|
|
||||||
const newRecords = new Map(node.records);
|
|
||||||
newRecords.set(incoming.id, updatedRecord);
|
|
||||||
return { ...node, records: newRecords };
|
|
||||||
};
|
|
||||||
```
|
|
||||||
|
|
||||||
### 3.2 Tombstone Creation
|
|
||||||
|
|
||||||
When deleting a record, a node creates a tombstone containing a copy of the record's HLL as the initial target count:
|
|
||||||
|
|
||||||
```ts
|
|
||||||
const createTombstone = <Data>(record: DataRecord<Data>, nodeId: string): Tombstone => ({
|
|
||||||
id: record.id,
|
|
||||||
recordHLL: hllClone(record.recordHLL),
|
|
||||||
tombstoneHLL: hllAdd(createHLL(), nodeId),
|
|
||||||
});
|
|
||||||
```
|
|
||||||
|
|
||||||
### 3.3 Garbage Collection Status Check
|
|
||||||
|
|
||||||
The core decision logic determines whether a node should become a keeper, step down, or continue as-is:
|
|
||||||
|
|
||||||
```ts
|
|
||||||
const checkGCStatus = (
|
|
||||||
tombstone: Tombstone,
|
|
||||||
incomingTombstoneEstimate: number | null,
|
|
||||||
myTombstoneEstimateBeforeMerge: number,
|
|
||||||
myNodeId: string,
|
myNodeId: string,
|
||||||
senderNodeId: string | null
|
theirNodeId: string
|
||||||
): { shouldGC: boolean; stepDownAsKeeper: boolean } => {
|
): boolean => {
|
||||||
const targetCount = hllEstimate(tombstone.recordHLL);
|
const iAmKeeper = myEstimate >= targetEstimate;
|
||||||
|
const theyAreKeeper = theirEstimate >= targetEstimate;
|
||||||
|
|
||||||
const isKeeper = myTombstoneEstimateBeforeMerge >= targetCount;
|
if (!iAmKeeper || !theyAreKeeper) return false;
|
||||||
|
|
||||||
if (isKeeper) {
|
// Step down if they have higher estimate
|
||||||
// Keeper step-down logic:
|
if (theirEstimate > myEstimate) return true;
|
||||||
// If incoming tombstone has reached the target count, compare estimates.
|
|
||||||
// If incoming estimate >= my estimate before merge, step down.
|
|
||||||
// Use node ID as tie-breaker: higher node ID steps down when estimates are equal.
|
|
||||||
if (incomingTombstoneEstimate !== null && incomingTombstoneEstimate >= targetCount) {
|
|
||||||
if (myTombstoneEstimateBeforeMerge < incomingTombstoneEstimate) {
|
|
||||||
return { shouldGC: true, stepDownAsKeeper: true };
|
|
||||||
}
|
|
||||||
// Tie-breaker: if estimates are equal, the lexicographically higher node ID steps down
|
|
||||||
if (myTombstoneEstimateBeforeMerge === incomingTombstoneEstimate &&
|
|
||||||
senderNodeId !== null && myNodeId > senderNodeId) {
|
|
||||||
return { shouldGC: true, stepDownAsKeeper: true };
|
|
||||||
}
|
|
||||||
}
|
|
||||||
return { shouldGC: false, stepDownAsKeeper: false };
|
|
||||||
}
|
|
||||||
|
|
||||||
// Not yet a keeper - will become one if tombstone count reaches target after merge
|
// Tie-breaker: higher node ID steps down
|
||||||
return { shouldGC: false, stepDownAsKeeper: false };
|
if (theirEstimate === myEstimate && myNodeId > theirNodeId) return true;
|
||||||
|
|
||||||
|
return false;
|
||||||
};
|
};
|
||||||
```
|
```
|
||||||
|
|
||||||
### 3.4 Tombstone Reception and Processing
|
## 3. Design Rationale
|
||||||
|
|
||||||
```mermaid
|
### 3.1 Why Propagate the Record HLL with Tombstones?
|
||||||
graph TD
|
|
||||||
A[Receive tombstone deletion message] --> B{Do I have<br/>this record?}
|
|
||||||
B -->|No| C[Ignore: record not found]
|
|
||||||
B -->|Yes| D[Merge HLLs and select<br/>highest record estimate]
|
|
||||||
D --> E{Am I already a keeper?<br/>my tombstone count >= target}
|
|
||||||
E -->|Yes| F{Is incoming tombstone<br/>count higher than mine?}
|
|
||||||
F -->|Yes| G[Step down as keeper:<br/>delete tombstone]
|
|
||||||
F -->|No| H{Same count but<br/>sender has lower node ID?}
|
|
||||||
H -->|Yes| G
|
|
||||||
H -->|No| I[Remain keeper:<br/>update tombstone]
|
|
||||||
E -->|No| J{Does my tombstone<br/>count reach target?}
|
|
||||||
J -->|Yes| K[Become keeper:<br/>store tombstone]
|
|
||||||
J -->|No| L[Store tombstone<br/>but not keeper yet]
|
|
||||||
G --> M[Forward tombstone to reachable nodes]
|
|
||||||
I --> M
|
|
||||||
K --> M
|
|
||||||
L --> M
|
|
||||||
```
|
|
||||||
|
|
||||||
The complete tombstone reception handler:
|
Without a shared target, each node would compare against its own local record HLL, leading to premature garbage collection. By propagating the record HLL with the tombstone and always keeping the highest estimate encountered, all nodes converge on a safe target.
|
||||||
|
|
||||||
```ts
|
### 3.2 Why Dynamic Keeper Election?
|
||||||
const receiveTombstone = <Data>(
|
|
||||||
node: NodeState<Data>,
|
|
||||||
incoming: Tombstone,
|
|
||||||
senderNodeId: string
|
|
||||||
): NodeState<Data> => {
|
|
||||||
// Don't accept tombstones for unknown records
|
|
||||||
const record = node.records.get(incoming.id);
|
|
||||||
if (!record) {
|
|
||||||
return node;
|
|
||||||
}
|
|
||||||
|
|
||||||
const existing = node.tombstones.get(incoming.id);
|
|
||||||
|
|
||||||
// Merge tombstone HLLs and add self
|
|
||||||
const mergedTombstoneHLL = existing
|
|
||||||
? hllAdd(hllMerge(existing.tombstoneHLL, incoming.tombstoneHLL), node.id)
|
|
||||||
: hllAdd(hllClone(incoming.tombstoneHLL), node.id);
|
|
||||||
|
|
||||||
// Select the best (highest estimate) record HLL as target count
|
|
||||||
// This ensures we use the most complete view of record distribution
|
|
||||||
let bestRecordHLL = incoming.recordHLL;
|
|
||||||
if (existing?.recordHLL) {
|
|
||||||
bestRecordHLL = hllEstimate(existing.recordHLL) > hllEstimate(bestRecordHLL)
|
|
||||||
? existing.recordHLL
|
|
||||||
: bestRecordHLL;
|
|
||||||
}
|
|
||||||
if (hllEstimate(record.recordHLL) > hllEstimate(bestRecordHLL)) {
|
|
||||||
bestRecordHLL = hllClone(record.recordHLL);
|
|
||||||
}
|
|
||||||
|
|
||||||
const updatedTombstone: Tombstone = {
|
|
||||||
id: incoming.id,
|
|
||||||
tombstoneHLL: mergedTombstoneHLL,
|
|
||||||
recordHLL: bestRecordHLL,
|
|
||||||
};
|
|
||||||
|
|
||||||
const myEstimateBeforeMerge = existing ? hllEstimate(existing.tombstoneHLL) : 0;
|
|
||||||
|
|
||||||
const gcStatus = checkGCStatus(
|
|
||||||
updatedTombstone,
|
|
||||||
hllEstimate(incoming.tombstoneHLL),
|
|
||||||
myEstimateBeforeMerge,
|
|
||||||
node.id,
|
|
||||||
senderNodeId
|
|
||||||
);
|
|
||||||
|
|
||||||
// Always delete the record when we have a tombstone
|
|
||||||
const newRecords = new Map(node.records);
|
|
||||||
newRecords.delete(incoming.id);
|
|
||||||
|
|
||||||
if (gcStatus.stepDownAsKeeper) {
|
|
||||||
// Step down: delete both record and tombstone
|
|
||||||
const newTombstones = new Map(node.tombstones);
|
|
||||||
newTombstones.delete(incoming.id);
|
|
||||||
return { ...node, records: newRecords, tombstones: newTombstones };
|
|
||||||
}
|
|
||||||
|
|
||||||
const newTombstones = new Map(node.tombstones);
|
|
||||||
newTombstones.set(incoming.id, updatedTombstone);
|
|
||||||
return { ...node, records: newRecords, tombstones: newTombstones };
|
|
||||||
};
|
|
||||||
```
|
|
||||||
|
|
||||||
### 3.5 Cascading Step-Down via Forwarding
|
|
||||||
|
|
||||||
When a keeper steps down, it immediately forwards the tombstone to all reachable nodes, creating a cascade effect that rapidly eliminates redundant keepers:
|
|
||||||
|
|
||||||
```ts
|
|
||||||
const forwardTombstoneToAllReachable = <Data>(
|
|
||||||
network: NetworkState<Data>,
|
|
||||||
forwardingNodeId: string,
|
|
||||||
tombstone: Tombstone,
|
|
||||||
excludeNodeId?: string
|
|
||||||
): NetworkState<Data> => {
|
|
||||||
const forwardingNode = network.nodes.get(forwardingNodeId);
|
|
||||||
if (!forwardingNode || !forwardingNode.isOnline) return network;
|
|
||||||
|
|
||||||
let newNodes = new Map(network.nodes);
|
|
||||||
const reachable = getReachableNodes(network, forwardingNodeId);
|
|
||||||
|
|
||||||
for (const peerId of reachable) {
|
|
||||||
if (peerId === excludeNodeId) continue;
|
|
||||||
|
|
||||||
const peer = newNodes.get(peerId);
|
|
||||||
if (!peer || !peer.records.has(tombstone.id)) continue;
|
|
||||||
|
|
||||||
const updatedPeer = receiveTombstone(peer, tombstone, forwardingNodeId);
|
|
||||||
newNodes.set(peerId, updatedPeer);
|
|
||||||
|
|
||||||
// If this peer also stepped down, recursively forward
|
|
||||||
if (!updatedPeer.tombstones.has(tombstone.id) && peer.tombstones.has(tombstone.id)) {
|
|
||||||
const result = forwardTombstoneToAllReachable({ nodes: newNodes }, peerId, tombstone, forwardingNodeId);
|
|
||||||
newNodes = new Map(result.nodes);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
return { nodes: newNodes };
|
|
||||||
};
|
|
||||||
```
|
|
||||||
|
|
||||||
## 4. Design Rationale
|
|
||||||
|
|
||||||
### 4.1 Why Propagate the Record HLL with Tombstones?
|
|
||||||
|
|
||||||
Without a shared target count, each node would compare against its own local recordHLL estimate, leading to premature garbage collection. By propagating the recordHLL with the tombstone and always keeping the highest estimate encountered, all nodes converge on a safe target count. During propagation, if a node has a more complete view of record distribution (higher HLL estimate), that becomes the new target for all subsequent nodes.
|
|
||||||
|
|
||||||
### 4.2 Why Dynamic Keeper Election?
|
|
||||||
|
|
||||||
A fixed originator-as-keeper design creates a single point of failure. If the originator goes offline, tombstone propagation halts and records may resurrect when stale nodes reconnect.
|
A fixed originator-as-keeper design creates a single point of failure. If the originator goes offline, tombstone propagation halts and records may resurrect when stale nodes reconnect.
|
||||||
|
|
||||||
Dynamic election allows any node to become a keeper when it detects `tombstoneCount >= recordCount`. This ensures tombstone propagation continues regardless of which specific node initiated the deletion or which nodes are currently online.
|
Dynamic election allows any node to become a keeper when it detects `tombstoneEstimate >= recordEstimate`. This ensures tombstone propagation continues regardless of which specific node initiated the deletion.
|
||||||
|
|
||||||
### 4.3 Why Keeper Step-Down?
|
### 3.3 Why Keeper Step-Down?
|
||||||
|
|
||||||
Without step-down logic, every node eventually becomes a keeper (since they all eventually observe the threshold condition). This defeats the purpose of garbage collection.
|
Without step-down logic, every node eventually becomes a keeper. Step-down creates convergence toward a minimal keeper set:
|
||||||
|
|
||||||
Step-down creates convergence toward a minimal keeper set:
|
|
||||||
|
|
||||||
```mermaid
|
```mermaid
|
||||||
graph TD
|
graph TD
|
||||||
|
|
@ -319,463 +124,134 @@ subgraph Keeper Convergence Over Time
|
||||||
T0["t=0: 0 keepers"]
|
T0["t=0: 0 keepers"]
|
||||||
T1["t=1: 5 keepers<br/>(first nodes to detect threshold)"]
|
T1["t=1: 5 keepers<br/>(first nodes to detect threshold)"]
|
||||||
T2["t=2: 3 keepers<br/>(2 stepped down after seeing higher estimates)"]
|
T2["t=2: 3 keepers<br/>(2 stepped down after seeing higher estimates)"]
|
||||||
T3["t=3: 1 keeper<br/>(converged to single keeper in connected network)"]
|
T3["t=3: 1 keeper<br/>(converged to single keeper)"]
|
||||||
end
|
end
|
||||||
T0 --> T1 --> T2 --> T3
|
T0 --> T1 --> T2 --> T3
|
||||||
```
|
```
|
||||||
|
|
||||||
### 4.4 Why Node ID Tie-Breaker?
|
### 3.4 Why Node ID Tie-Breaker?
|
||||||
|
|
||||||
When HLL estimates converge (all nodes have similar tombstoneHLL values due to full propagation), no node can have a strictly higher estimate. Without a tie-breaker, keepers with equal estimates would never step down.
|
When HLL estimates converge (all nodes have similar values), no node can have a strictly higher estimate. The lexicographic node ID comparison ensures deterministic convergence to a single keeper.
|
||||||
|
|
||||||
The lexicographic node ID comparison ensures deterministic convergence: when two keepers with equal estimates communicate, the one with the higher node ID steps down. This guarantees eventual convergence to a single keeper in a fully connected network.
|
### 3.5 Why Forward on Step-Down?
|
||||||
|
|
||||||
### 4.5 Why Forward on Step-Down?
|
With aggressive forwarding, a stepping-down keeper immediately propagates the "winning" tombstone to all reachable nodes, creating a cascade effect that rapidly eliminates redundant keepers.
|
||||||
|
|
||||||
Without forwarding, keepers only step down when randomly selected for gossip - a slow process. With aggressive forwarding, a stepping-down keeper immediately propagates the "winning" tombstone to all reachable nodes, creating a cascade effect that rapidly eliminates redundant keepers.
|
## 4. Comparison with Alternative Approaches
|
||||||
|
|
||||||
## 5. Evaluation
|
### 4.1 Time-based GC vs HyperLogLog
|
||||||
|
|
||||||
### 5.1 Experimental Setup
|
| Aspect | Time-based GC | HyperLogLog |
|
||||||
|
|--------|---------------|-------------|
|
||||||
|
| Safety | Unsafe if nodes offline > TTL | Safe (waits for propagation) |
|
||||||
|
| Configuration | Requires tuning TTL | Self-adapting |
|
||||||
|
| Metadata overhead | None | ~2 KB per tombstone |
|
||||||
|
| Coordination | None | None |
|
||||||
|
|
||||||
We implemented a discrete-event simulation to evaluate the algorithm under various network conditions. Each test scenario was executed 50 times to obtain statistically reliable averages. The simulation models:
|
**When to use Time-based**: Networks with predictable uptime and short offline periods.
|
||||||
|
|
||||||
- **Network model**: Fully connected network where all online nodes in the same partition can communicate
|
**When to use HyperLogLog**: Networks with unpredictable offline durations or partition-prone environments.
|
||||||
- **Gossip protocol**: Each round, every online node with a record or tombstone randomly selects one reachable peer and exchanges state
|
|
||||||
- **HLL precision**: 10 bits (1024 registers, ~1KB per HLL)
|
|
||||||
- **Convergence criteria**: Records deleted, followed by 100 additional rounds for keeper convergence
|
|
||||||
- **Trials**: 50 independent runs per scenario, with results averaged
|
|
||||||
|
|
||||||
### 5.2 Test Scenarios
|
### 4.2 Causal Stability Detection vs HyperLogLog
|
||||||
|
|
||||||
#### 5.2.1 Single Node Deletion (Baseline)
|
Traditional causal stability uses vector clocks or explicit node sets to track exactly which nodes have observed operations.
|
||||||
|
|
||||||
**Scenario**: A single node creates a record, propagates it through gossip to all nodes, then initiates deletion.
|
| Aspect | Causal Stability | HyperLogLog |
|
||||||
|
|--------|------------------|-------------|
|
||||||
|
| Tracking | Exact (vector clocks/sets) | Approximate (~3% error) |
|
||||||
|
| Memory per tombstone | O(n) – grows with nodes | O(1) – constant ~2 KB |
|
||||||
|
| Bandwidth per message | O(n) – grows with nodes | O(1) – constant ~2 KB |
|
||||||
|
| Debugging | Can enumerate missing nodes | Only aggregate estimates |
|
||||||
|
| Implementation | Simple data structures | Requires HLL library |
|
||||||
|
|
||||||
**Results** (averaged over 50 trials):
|
**Memory comparison:**
|
||||||
|
|
||||||
| Metric | Value |
|
| Network Size | Causal Stability | HyperLogLog | HLL Advantage |
|
||||||
|--------|-------|
|
|--------------|------------------|-------------|---------------|
|
||||||
| Nodes | 15 per trial (750 total) |
|
| 10 nodes | ~400 bytes | ~2 KB | 0.2x (worse) |
|
||||||
| Records deleted | 100% success |
|
| 50 nodes | ~2 KB | ~2 KB | 1x (equal) |
|
||||||
| Rounds to delete records | 10 |
|
| 100 nodes | ~4 KB | ~2 KB | 2x better |
|
||||||
| Total rounds (including convergence) | 120 |
|
| 1,000 nodes | ~40 KB | ~2 KB | 20x better |
|
||||||
| Final tombstones | 50 (1 per trial, ~6.7% of nodes) |
|
| 10,000 nodes | ~400 KB | ~2 KB | 200x better |
|
||||||
|
|
||||||
**Analysis**: In a fully connected network, record deletion completes rapidly (10 rounds). Tombstones converge to a single keeper per trial, demonstrating optimal garbage collection in ideal conditions.
|
**When to use Causal Stability**: Networks < 50 nodes, or when exact tracking is required for auditing.
|
||||||
|
|
||||||
#### 5.2.2 Node Offline During Tombstone
|
**When to use HyperLogLog**: Networks > 100 nodes, bandwidth-constrained environments, or unpredictably growing networks.
|
||||||
|
|
||||||
**Scenario**: One node goes offline before tombstone creation, then reconnects after tombstone has propagated to all online nodes.
|
### 4.3 Consensus-based GC vs HyperLogLog
|
||||||
|
|
||||||
|
| Aspect | Consensus-based | HyperLogLog |
|
||||||
|
|--------|-----------------|-------------|
|
||||||
|
| Coordination | Required (Raft/Paxos) | None (local decisions) |
|
||||||
|
| Partition tolerance | Blocks during partitions | Continues independently |
|
||||||
|
| Guarantees | Strong consistency | Eventual consistency |
|
||||||
|
| Latency | Round-trip consensus | Local estimation |
|
||||||
|
|
||||||
|
**When to use Consensus**: Systems already using consensus (e.g., Raft-replicated databases).
|
||||||
|
|
||||||
|
**When to use HyperLogLog**: Partition-tolerant systems, high-latency networks, or systems without existing consensus infrastructure.
|
||||||
|
|
||||||
|
### 4.4 Summary
|
||||||
|
|
||||||
```mermaid
|
```mermaid
|
||||||
sequenceDiagram
|
graph TD
|
||||||
participant Online as Online Nodes
|
A[Choose GC Approach] --> B{Network Size?}
|
||||||
participant Offline as Node-5 (offline)
|
B -->|< 50 nodes| C[Causal Stability<br/>Exact tracking]
|
||||||
|
B -->|> 100 nodes| D[HyperLogLog<br/>Probabilistic tracking]
|
||||||
|
B -->|50-100 nodes| E{Priority?}
|
||||||
|
E -->|Exactness| C
|
||||||
|
E -->|Scalability| D
|
||||||
|
|
||||||
Note over Online,Offline: Record propagated to all nodes
|
A --> F{Already have consensus?}
|
||||||
Online->>Offline: record received
|
F -->|Yes| G[Consensus-based GC<br/>Strong guarantees]
|
||||||
|
F -->|No| H{Predictable uptime?}
|
||||||
Note over Offline: Node-5 goes offline
|
H -->|Yes, short outages| I[Time-based GC<br/>Simple TTL]
|
||||||
Offline--xOnline: disconnected
|
H -->|No, long/variable outages| D
|
||||||
|
|
||||||
Note over Online: Tombstone created and propagates
|
|
||||||
Online->>Online: tombstone spreads to all online nodes
|
|
||||||
|
|
||||||
Note over Offline: Node-5 reconnects
|
|
||||||
Offline->>Online: reconnected
|
|
||||||
Online->>Offline: tombstone propagates to Node-5
|
|
||||||
Offline->>Offline: deletes stale record
|
|
||||||
```
|
```
|
||||||
|
|
||||||
**Results** (averaged over 50 trials):
|
## 5. Simulation Results
|
||||||
|
|
||||||
| Metric | Value |
|
We implemented the HyperLogLog approach in a discrete-event simulation with 50 trials per scenario across various failure modes (node offline, network partition, concurrent deleters, origin node failure).
|
||||||
|--------|-------|
|
|
||||||
| Nodes | 15 per trial (750 total) |
|
|
||||||
| Records deleted | 100% success |
|
|
||||||
| Rounds to delete records | 60 (includes offline period) |
|
|
||||||
| Total rounds | 170 |
|
|
||||||
| Final tombstones | 50 (1 per trial) |
|
|
||||||
|
|
||||||
**Analysis**: The algorithm correctly handles nodes that are offline during tombstone creation. When the offline node reconnects, it receives the tombstone and deletes its stale record. The keeper count remains optimal at 1.
|
### 5.1 Resource Usage
|
||||||
|
|
||||||
#### 5.2.3 Multiple Nodes Offline
|
| Network Size | Memory per Tombstone | Bandwidth per Gossip |
|
||||||
|
|--------------|---------------------|----------------------|
|
||||||
|
| 20 nodes | ~2 KB | ~2 KB |
|
||||||
|
| 500 nodes | ~2 KB | ~2 KB |
|
||||||
|
| 10,000 nodes | ~2 KB | ~2 KB |
|
||||||
|
|
||||||
**Scenario**: Four nodes go offline before tombstone creation, then reconnect one by one.
|
Constant resource usage regardless of network size.
|
||||||
|
|
||||||
**Results** (averaged over 50 trials):
|
## 6. Limitations
|
||||||
|
|
||||||
| Metric | Value |
|
### 6.1 Estimation Error
|
||||||
|--------|-------|
|
|
||||||
| Nodes | 20 per trial (1000 total) |
|
|
||||||
| Offline nodes | 4 |
|
|
||||||
| Records deleted | 100% success |
|
|
||||||
| Rounds to delete records | 120 |
|
|
||||||
| Total rounds | 230 |
|
|
||||||
| Final tombstones | 50 (1 per trial) |
|
|
||||||
|
|
||||||
**Analysis**: Even with 20% of nodes offline, the algorithm successfully propagates tombstones to all nodes once they reconnect. The staggered reconnection does not cause issues.
|
HyperLogLog provides ~3% error at precision 10 (1024 registers). This can cause:
|
||||||
|
- **Premature keeper election**: Rare, handled conservatively by erring toward retaining tombstones
|
||||||
|
- **Delayed keeper convergence**: Minor efficiency impact
|
||||||
|
|
||||||
#### 5.2.4 Network Partition and Heal
|
### 6.2 No Node Enumeration
|
||||||
|
|
||||||
**Scenario**: Network splits into two equal partitions after record propagation. Tombstone created in partition A. After 80 rounds, partition heals.
|
Cannot identify which specific nodes are missing the tombstone. For debugging or auditing, causal stability with explicit tracking is preferable.
|
||||||
|
|
||||||
```mermaid
|
### 6.3 Message Ordering
|
||||||
sequenceDiagram
|
|
||||||
participant A as Partition A (10 nodes)
|
|
||||||
participant B as Partition B (10 nodes)
|
|
||||||
|
|
||||||
Note over A,B: Record propagated to all 20 nodes
|
If a tombstone arrives before the record (due to message reordering), the node ignores the tombstone. If the record subsequently arrives, the node accepts it. However, keepers will eventually propagate the tombstone to this node, so the record is eventually deleted—just not immediately.
|
||||||
|
|
||||||
Note over A,B: Network partitions
|
## 7. Conclusion
|
||||||
A--xB: partition
|
|
||||||
|
|
||||||
Note over A: Tombstone created in partition A
|
HyperLogLog-based tombstone garbage collection provides a scalable alternative to traditional approaches:
|
||||||
A->>A: tombstone propagates within A
|
|
||||||
Note over B: Partition B still has record
|
|
||||||
|
|
||||||
Note over A,B: Partition heals after 80 rounds
|
| Approach | Best For |
|
||||||
A->>B: tombstone propagates to B
|
|----------|----------|
|
||||||
B->>B: records deleted, keepers converge
|
| **Time-based GC** | Predictable networks with short outages |
|
||||||
```
|
| **Causal Stability** | Small networks (< 50 nodes) requiring exact tracking |
|
||||||
|
| **Consensus-based GC** | Systems with existing consensus infrastructure |
|
||||||
|
| **HyperLogLog** | Large networks (> 100 nodes), partition-prone environments |
|
||||||
|
|
||||||
**Results** (averaged over 50 trials):
|
The HyperLogLog approach trades exactness (~3% error) for dramatic scalability—constant memory and bandwidth regardless of network size. For networks that may grow unpredictably, this provides a robust foundation without configuration changes.
|
||||||
|
|
||||||
| Metric | Partition A | Partition B | Total |
|
|
||||||
|--------|-------------|-------------|-------|
|
|
||||||
| Nodes | 10 per trial (500 total) | 10 per trial (500 total) | 20 per trial (1000 total) |
|
|
||||||
| Records deleted | 100% success | 100% success | 100% success |
|
|
||||||
| Rounds to delete | - | - | 90 |
|
|
||||||
| Final tombstones | 50 (1 per trial) | 0 | 50 |
|
|
||||||
|
|
||||||
**Analysis**: During partition, only partition A processes the tombstone. Upon healing, partition B rapidly receives the tombstone and deletes its records. Final keeper distribution shows all keepers in partition A (where tombstone originated), demonstrating efficient consolidation.
|
|
||||||
|
|
||||||
#### 5.2.5 Cluster Separation
|
|
||||||
|
|
||||||
**Scenario**: A 5-node cluster becomes isolated from the main 20-node network. Tombstone created in main network, cluster isolated for 150 rounds, then rejoins.
|
|
||||||
|
|
||||||
**Results** (averaged over 50 trials):
|
|
||||||
|
|
||||||
| Metric | Main (20 nodes) | Isolated (5 nodes) | Total |
|
|
||||||
|--------|-----------------|-------------------|-------|
|
|
||||||
| Nodes | 20 per trial (1000 total) | 5 per trial (250 total) | 25 per trial (1250 total) |
|
|
||||||
| Records deleted | 100% success | 100% success | 100% success |
|
|
||||||
| Rounds to delete | - | - | 160 |
|
|
||||||
| Final tombstones | 50 (1 per trial) | 0 | 50 |
|
|
||||||
|
|
||||||
**Analysis**: Extended isolation (150 rounds) does not prevent successful deletion. When the isolated cluster rejoins, it receives the tombstone and deletes stale records. Keepers consolidate to the main partition.
|
|
||||||
|
|
||||||
#### 5.2.6 Concurrent Tombstones
|
|
||||||
|
|
||||||
**Scenario**: Three nodes simultaneously initiate deletion of the same record.
|
|
||||||
|
|
||||||
**Results** (averaged over 50 trials):
|
|
||||||
|
|
||||||
| Metric | Value |
|
|
||||||
|--------|-------|
|
|
||||||
| Nodes | 20 per trial (1000 total) |
|
|
||||||
| Concurrent deleters | 3 |
|
|
||||||
| Records deleted | 100% success |
|
|
||||||
| Rounds to delete | 10 |
|
|
||||||
| Final tombstones | 50 (1 per trial) |
|
|
||||||
|
|
||||||
**Analysis**: The algorithm handles concurrent tombstone creation gracefully. Multiple tombstones merge via HLL union operations, and keeper election converges to a single keeper as normal.
|
|
||||||
|
|
||||||
#### 5.2.7 Staggered Node Recovery
|
|
||||||
|
|
||||||
**Scenario**: Four nodes go offline. After tombstone creation, nodes reconnect at staggered intervals (20 rounds apart).
|
|
||||||
|
|
||||||
**Results** (averaged over 50 trials):
|
|
||||||
|
|
||||||
| Metric | Value |
|
|
||||||
|--------|-------|
|
|
||||||
| Nodes | 20 per trial (1000 total) |
|
|
||||||
| Offline nodes | 4 |
|
|
||||||
| Records deleted | 100% success |
|
|
||||||
| Rounds to delete records | 130 |
|
|
||||||
| Total rounds | 240 |
|
|
||||||
| Final tombstones | 50 (1 per trial) |
|
|
||||||
|
|
||||||
**Analysis**: Staggered recovery is handled correctly. Each reconnecting node receives the tombstone from online nodes and deletes its stale record.
|
|
||||||
|
|
||||||
#### 5.2.8 Origin Node Goes Offline
|
|
||||||
|
|
||||||
**Scenario**: The node that creates the tombstone immediately goes offline. Other nodes must propagate the tombstone without the originator.
|
|
||||||
|
|
||||||
**Results** (averaged over 50 trials):
|
|
||||||
|
|
||||||
| Metric | Value |
|
|
||||||
|--------|-------|
|
|
||||||
| Nodes | 15 per trial (750 total) |
|
|
||||||
| Records deleted | 100% success |
|
|
||||||
| Rounds to delete | 0 (immediate after gossip starts) |
|
|
||||||
| Total rounds | 100170 (includes originator offline period) |
|
|
||||||
| Final tombstones | 51 (~1 per trial) |
|
|
||||||
|
|
||||||
**Analysis**: The algorithm continues to function even when the tombstone originator immediately goes offline. Other nodes that received the tombstone before the originator went offline continue propagation. This demonstrates the fault-tolerant design.
|
|
||||||
|
|
||||||
#### 5.2.9 Flapping Node
|
|
||||||
|
|
||||||
**Scenario**: One node repeatedly toggles between online and offline states (every 5 rounds) during tombstone propagation.
|
|
||||||
|
|
||||||
**Results** (averaged over 50 trials):
|
|
||||||
|
|
||||||
| Metric | Value |
|
|
||||||
|--------|-------|
|
|
||||||
| Nodes | 15 per trial (750 total) |
|
|
||||||
| Records deleted | 100% success |
|
|
||||||
| Rounds to delete | 8 |
|
|
||||||
| Total rounds | 123 |
|
|
||||||
| Final tombstones | 50 (1 per trial) |
|
|
||||||
|
|
||||||
**Analysis**: Flapping nodes do not disrupt the algorithm. When online, the node participates in gossip; when offline, it's simply skipped. The tombstone eventually reaches the flapping node during one of its online periods.
|
|
||||||
|
|
||||||
#### 5.2.10 Partition During Keeper Election
|
|
||||||
|
|
||||||
**Scenario**: Network partitions after tombstone has started propagating but before keeper election completes. Each partition independently runs keeper election, then partitions heal.
|
|
||||||
|
|
||||||
**Results** (averaged over 50 trials):
|
|
||||||
|
|
||||||
| Metric | Value |
|
|
||||||
|--------|-------|
|
|
||||||
| Nodes | 20 per trial (1000 total) |
|
|
||||||
| Records deleted | 100% success |
|
|
||||||
| Rounds to delete | 115 |
|
|
||||||
| Total rounds | 225 |
|
|
||||||
| Final tombstones | 50 (1 per trial) |
|
|
||||||
|
|
||||||
**Analysis**: Even when keeper election happens independently in separate partitions, the algorithm correctly converges to a single keeper after partition healing. This demonstrates robustness to mid-operation partitions.
|
|
||||||
|
|
||||||
### 5.3 Summary of Results
|
|
||||||
|
|
||||||
All results are averaged over 50 independent trials per scenario.
|
|
||||||
|
|
||||||
| Scenario | Nodes | Deletion Rounds | Final Keepers | Key Insight |
|
|
||||||
|----------|-------|-----------------|---------------|-------------|
|
|
||||||
| Single Node Deletion | 15 | 10 | 1 | Baseline: optimal convergence |
|
|
||||||
| Node Offline | 15 | 60 | 1 | Handles offline nodes |
|
|
||||||
| Multiple Nodes Offline | 20 | 120 | 1 | Scales with more offline nodes |
|
|
||||||
| Network Partition | 20 | 90 | 1 | Partition-tolerant |
|
|
||||||
| Cluster Separation | 25 | 160 | 1 | Extended isolation handled |
|
|
||||||
| Concurrent Tombstones | 20 | 10 | 1 | Multiple deleters merge correctly |
|
|
||||||
| Staggered Recovery | 20 | 130 | 1 | Handles asynchronous reconnection |
|
|
||||||
| Origin Node Offline | 15 | 0 | 1 | Fault-tolerant originator |
|
|
||||||
| Flapping Node | 15 | 8 | 1 | Intermittent connectivity handled |
|
|
||||||
| Partition During GC | 20 | 115 | 1 | Mid-operation partition safe |
|
|
||||||
|
|
||||||
**Statistical Observations** (across 500 total trials):
|
|
||||||
- **100% deletion success rate**: All 500 trials successfully deleted records
|
|
||||||
- **Optimal keeper convergence**: 1 keeper per trial in all scenarios (compared to 10-25% in sparse network models)
|
|
||||||
- **Fully connected advantage**: Network model enables rapid propagation and optimal keeper consolidation
|
|
||||||
|
|
||||||
### 5.4 Key Findings
|
|
||||||
|
|
||||||
Based on 500 total trials across 10 scenarios:
|
|
||||||
|
|
||||||
1. **Reliable deletion**: 100% success rate across all trials. The fully connected model enables faster propagation than sparse peer-based networks.
|
|
||||||
|
|
||||||
2. **Optimal garbage collection**: In a fully connected network, tombstones converge to exactly 1 keeper per tombstone. This is optimal - the minimum required to prevent resurrection from offline/partitioned nodes.
|
|
||||||
|
|
||||||
3. **Offline node handling**: Nodes that go offline retain their records, but receive tombstones upon reconnection. The algorithm correctly handles:
|
|
||||||
- Single node offline
|
|
||||||
- Multiple nodes offline simultaneously
|
|
||||||
- Staggered reconnection
|
|
||||||
- Flapping (intermittent) connectivity
|
|
||||||
|
|
||||||
4. **Partition tolerance**: Network partitions do not cause correctness issues:
|
|
||||||
- Tombstones propagate within each partition independently
|
|
||||||
- Upon healing, cross-partition propagation completes deletion
|
|
||||||
- Keepers consolidate across the healed network
|
|
||||||
|
|
||||||
5. **Fault-tolerant originator**: If the node that creates a tombstone immediately goes offline, other nodes continue propagation. No single point of failure.
|
|
||||||
|
|
||||||
6. **Concurrent safety**: Multiple simultaneous deleters correctly merge their tombstones and converge to a single keeper.
|
|
||||||
|
|
||||||
## 6. Limitations and Edge Cases
|
|
||||||
|
|
||||||
### 6.1 Message Ordering Issues
|
|
||||||
|
|
||||||
Despite the algorithm's robustness to offline nodes and partitions, certain message ordering scenarios can still cause tombstone-related issues:
|
|
||||||
|
|
||||||
**Late Record Arrival**: If a node receives a tombstone before ever receiving the original record, it ignores the tombstone (since it has no record to delete). If the record subsequently arrives via a delayed message path, the node will accept the record as new data—a resurrection.
|
|
||||||
|
|
||||||
```mermaid
|
|
||||||
sequenceDiagram
|
|
||||||
participant A as Node A
|
|
||||||
participant B as Node B
|
|
||||||
participant C as Node C
|
|
||||||
|
|
||||||
Note over A: Creates record
|
|
||||||
A->>B: record (delayed in transit)
|
|
||||||
A->>A: Deletes record, creates tombstone
|
|
||||||
A->>C: tombstone
|
|
||||||
C->>C: Ignores tombstone (no record)
|
|
||||||
B->>C: record arrives (delayed)
|
|
||||||
C->>C: Accepts record as new!
|
|
||||||
Note over C: Record resurrected
|
|
||||||
```
|
|
||||||
|
|
||||||
**Mitigation**: Nodes can maintain a "seen tombstones" set for recently observed tombstone IDs, rejecting records matching those IDs. This adds memory overhead but prevents the most common ordering issues.
|
|
||||||
|
|
||||||
**Concurrent Create-Delete**: If one node creates a record while another node simultaneously creates a tombstone for that same ID (e.g., re-using an ID or in a create-delete-recreate scenario), the outcome depends on message ordering and may result in either the record or tombstone "winning" non-deterministically.
|
|
||||||
|
|
||||||
**Mitigation**: Use globally unique, never-reused record IDs (e.g., UUIDs or content-addressed hashes) to prevent ID collisions between creates and deletes.
|
|
||||||
|
|
||||||
### 6.2 HLL Estimation Errors
|
|
||||||
|
|
||||||
HyperLogLog provides probabilistic estimates with a standard error of approximately 1.04/√m where m is the number of registers. At precision 10 (1024 registers), this is ~3.25% error. In practice:
|
|
||||||
|
|
||||||
- A record distributed to 100 nodes might show an HLL estimate of 97-103
|
|
||||||
- A tombstone distributed to 100 nodes might show an estimate of 96-104
|
|
||||||
|
|
||||||
This can cause:
|
|
||||||
- **Premature keeper election**: A node might become a keeper before the tombstone has truly reached all record holders
|
|
||||||
- **Delayed keeper convergence**: Nodes might remain keepers longer due to estimate fluctuations
|
|
||||||
|
|
||||||
The algorithm handles this conservatively—tombstones are only garbage collected when the tombstone estimate reaches or exceeds the record estimate, erring on the side of retaining tombstones.
|
|
||||||
|
|
||||||
## 7. Comparison with Alternative Approaches
|
|
||||||
|
|
||||||
### 7.1 Explicit Node List Approach
|
|
||||||
|
|
||||||
An alternative to HLL-based tracking is maintaining explicit sets of node IDs:
|
|
||||||
|
|
||||||
```ts
|
|
||||||
interface ExplicitTombstone {
|
|
||||||
readonly id: string;
|
|
||||||
readonly recordReceivers: Set<string>; // Nodes that received the record
|
|
||||||
readonly tombstoneReceivers: Set<string>; // Nodes that received the tombstone
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
**Advantages of Explicit Lists**:
|
|
||||||
- Exact counts, no estimation error
|
|
||||||
- Deterministic keeper election
|
|
||||||
- Can identify specific nodes that haven't received tombstones
|
|
||||||
|
|
||||||
**Disadvantages**:
|
|
||||||
- Memory grows linearly with node count
|
|
||||||
- Merge operations require set union (O(n) time and space)
|
|
||||||
- Network bandwidth increases with node count
|
|
||||||
|
|
||||||
### 7.2 Memory Comparison
|
|
||||||
|
|
||||||
| Network Size | Explicit List (per tombstone) | HLL (per tombstone) | HLL Advantage |
|
|
||||||
|--------------|-------------------------------|---------------------|---------------|
|
|
||||||
| 10 nodes | ~200 bytes | ~2 KB | 0.1x (worse) |
|
|
||||||
| 50 nodes | ~1 KB | ~2 KB | 0.5x (worse) |
|
|
||||||
| 100 nodes | ~2 KB | ~2 KB | 1x (equal) |
|
|
||||||
| 500 nodes | ~10 KB | ~2 KB | 5x better |
|
|
||||||
| 1,000 nodes | ~20 KB | ~2 KB | 10x better |
|
|
||||||
| 10,000 nodes | ~200 KB | ~2 KB | 100x better |
|
|
||||||
|
|
||||||
*Assumptions: Node IDs are 20-byte strings (e.g., hex-encoded 80-bit identifiers). HLL uses precision 10 (1024 1-byte registers × 2 HLLs = 2KB).*
|
|
||||||
|
|
||||||
**Crossover Point**: HLL becomes more memory-efficient than explicit lists at approximately 100 nodes (assuming 20-byte node IDs). For smaller networks, explicit lists are more efficient and provide exact counts.
|
|
||||||
|
|
||||||
### 7.3 Bandwidth Comparison
|
|
||||||
|
|
||||||
Each gossip message transmits tombstone data. For a tombstone that has propagated to N nodes:
|
|
||||||
|
|
||||||
| Network Size | Explicit List (per message) | HLL (per message) |
|
|
||||||
|--------------|----------------------------|-------------------|
|
|
||||||
| 10 nodes | ~400 bytes | ~2 KB |
|
|
||||||
| 100 nodes | ~4 KB | ~2 KB |
|
|
||||||
| 1,000 nodes | ~40 KB | ~2 KB |
|
|
||||||
|
|
||||||
HLL provides constant-size messages regardless of how many nodes have received the tombstone, making it significantly more bandwidth-efficient in large networks.
|
|
||||||
|
|
||||||
### 7.4 When to Use Each Approach
|
|
||||||
|
|
||||||
**Use Explicit Node Lists when**:
|
|
||||||
- Network has fewer than ~100 nodes
|
|
||||||
- Exact tracking is required for auditing or debugging
|
|
||||||
- Node IDs are very short (reducing per-node overhead)
|
|
||||||
- Memory and bandwidth are not constrained
|
|
||||||
|
|
||||||
**Use HLL-based Tracking when**:
|
|
||||||
- Network has more than ~100 nodes
|
|
||||||
- Bandwidth is constrained (e.g., mobile networks, cross-region links)
|
|
||||||
- Approximate counts are acceptable
|
|
||||||
- Network size may grow unpredictably
|
|
||||||
|
|
||||||
### 7.5 Hybrid Approach
|
|
||||||
|
|
||||||
A practical implementation might use:
|
|
||||||
- Explicit lists for small tombstones (< 50 nodes)
|
|
||||||
- Automatic promotion to HLL when set size exceeds threshold
|
|
||||||
- This captures the best of both approaches while adding implementation complexity
|
|
||||||
|
|
||||||
## 8. Trade-offs Summary
|
|
||||||
|
|
||||||
| Aspect | Impact |
|
|
||||||
|--------|--------|
|
|
||||||
| **Memory** | ~2KB per tombstone (HLL at precision 10), constant regardless of network size |
|
|
||||||
| **Bandwidth** | ~2KB per gossip message, constant regardless of propagation extent |
|
|
||||||
| **Latency** | GC delayed until keeper convergence (~100 rounds after deletion) |
|
|
||||||
| **Consistency** | Eventual - temporary resurrection attempts are blocked but logged |
|
|
||||||
| **Accuracy** | ~3% estimation error at precision 10; conservative handling prevents premature GC |
|
|
||||||
| **Ordering** | Susceptible to late-arriving records; mitigated by tombstone ID caching |
|
|
||||||
|
|
||||||
## 9. Properties
|
|
||||||
|
|
||||||
The algorithm provides the following guarantees:
|
|
||||||
|
|
||||||
- **Safety**: Tombstones are never prematurely garbage collected. A tombstone is only deleted when the node has received confirmation (via HLL estimates) that the tombstone has propagated to at least as many nodes as received the original record.
|
|
||||||
|
|
||||||
- **Liveness**: Keepers eventually step down, enabling garbage collection. The tie-breaker mechanism ensures convergence even when HLL estimates are identical.
|
|
||||||
|
|
||||||
- **Fault tolerance**: No single point of failure. Any online node can propagate tombstones. Offline nodes receive tombstones upon reconnection.
|
|
||||||
|
|
||||||
- **Partition tolerance**: Each partition independently maintains tombstones. Upon healing, tombstones propagate across the healed partition boundary.
|
|
||||||
|
|
||||||
- **Convergence**: In a fully connected network, keeper count converges to exactly 1.
|
|
||||||
|
|
||||||
## 10. Conclusion
|
|
||||||
|
|
||||||
This paper presented a HyperLogLog-based approach to tombstone garbage collection in distributed systems with a fully connected network model. By tracking record and tombstone propagation through probabilistic cardinality estimation, the algorithm reduces the number of nodes maintaining tombstones to a single keeper per tombstone.
|
|
||||||
|
|
||||||
The simulation results, based on 500 trials across 10 scenarios, demonstrate consistent behavior across diverse failure scenarios. Records are deleted within 10-160 gossip rounds depending on offline/partition duration, and tombstones converge to exactly 1 keeper. The algorithm correctly handles:
|
|
||||||
- Individual nodes going offline and reconnecting
|
|
||||||
- Multiple nodes offline simultaneously
|
|
||||||
- Network partitions of various durations
|
|
||||||
- Concurrent deletion by multiple nodes
|
|
||||||
- Origin node failure
|
|
||||||
- Flapping connectivity
|
|
||||||
|
|
||||||
### Comparison with Explicit Node Tracking
|
|
||||||
|
|
||||||
An alternative approach tracks the exact set of nodes that have received records and tombstones, rather than using probabilistic estimation. This explicit tracking provides perfect accuracy but at significant cost in larger networks:
|
|
||||||
|
|
||||||
| Network Size | Explicit List | HLL | Winner |
|
|
||||||
|--------------|---------------|-----|--------|
|
|
||||||
| < 100 nodes | ~2 KB | ~2 KB | Explicit (exact counts) |
|
|
||||||
| 100 nodes | ~2 KB | ~2 KB | Equal |
|
|
||||||
| 1,000 nodes | ~20 KB | ~2 KB | HLL (10x smaller) |
|
|
||||||
| 10,000 nodes | ~200 KB | ~2 KB | HLL (100x smaller) |
|
|
||||||
|
|
||||||
For small networks (< 100 nodes), explicit node tracking is preferable: it provides exact counts, enables deterministic keeper election, and uses comparable or less memory. For large networks, HLL's constant-size data structures provide substantial memory and bandwidth savings.
|
|
||||||
|
|
||||||
### Storage Trade-offs
|
|
||||||
|
|
||||||
Each HLL-based tombstone requires approximately 2KB (two HLL structures at precision 10), compared to ~64-100 bytes for traditional simple tombstones that lack propagation tracking. This means the algorithm trades per-tombstone storage overhead for reduced tombstone distribution. The approach is most beneficial when:
|
|
||||||
- Network has more than ~100 nodes (where HLL outperforms explicit lists)
|
|
||||||
- Traditional tombstones are large (e.g., containing vector clocks, content hashes, or audit metadata)
|
|
||||||
- The primary concern is reducing the number of nodes participating in tombstone maintenance
|
|
||||||
- Network partitions and offline nodes are common failure modes
|
|
||||||
- Bandwidth is constrained (HLL messages are constant-size regardless of propagation)
|
|
||||||
|
|
||||||
For smaller networks, the explicit node list approach (Section 7.1) provides a simpler and more precise alternative with comparable resource usage.
|
|
||||||
|
|
||||||
### Future Work
|
|
||||||
|
|
||||||
Future work may explore:
|
|
||||||
- Adaptive HLL precision based on network size
|
|
||||||
- Hybrid approaches that start with explicit lists and promote to HLL at threshold
|
|
||||||
- Integration with vector clocks for stronger consistency guarantees
|
|
||||||
- Optimization of the keeper convergence rate in partially connected networks
|
|
||||||
|
|
||||||
## References
|
## References
|
||||||
|
|
||||||
A working simulation implementing this algorithm is available at [simulations/hyperloglog-tombstone/simulation.ts](/simulations/hyperloglog-tombstone/simulation.ts).
|
A working simulation is available at [simulations/hyperloglog-tombstone/simulation.ts](/simulations/hyperloglog-tombstone/simulation.ts).
|
||||||
Loading…
Add table
Add a link
Reference in a new issue