Compare commits
2 commits
abd9e6a4bb
...
c807832724
| Author | SHA1 | Date | |
|---|---|---|---|
| c807832724 | |||
| 4f2b965bbd |
2 changed files with 649 additions and 1420 deletions
|
|
@ -2,308 +2,121 @@
|
|||
|
||||
## Abstract
|
||||
|
||||
When synchronizing records in a distributed network, deletion presents a fundamental challenge. If nodes simply delete their local copies, other nodes may resynchronize the original data, reverting the deletion. This occurs due to non-simultaneous events between nodes or nodes temporarily disconnecting and reconnecting with outdated state. The traditional solution creates "tombstone" records that persist after deletion to prevent resurrection of deleted data.
|
||||
When synchronizing records in a distributed network, deletion presents a fundamental challenge. Nodes must maintain "tombstone" records to prevent deleted data from being resurrected by offline nodes. This paper presents a **HyperLogLog-based approach** to tombstone garbage collection that uses probabilistic cardinality estimation to detect when tombstones have reached sufficient distribution.
|
||||
|
||||
While effective, this approach requires every node to indefinitely maintain an ever-growing collection of tombstone records. Typically, after an arbitrarily large time period, tombstones are assumed safe to clear since no rogue nodes should retain the original data.
|
||||
|
||||
This paper presents a methodology using the HyperLogLog algorithm to estimate how many nodes have received a record, comparing this estimate against the count of nodes that have received the corresponding tombstone. This enables pruning tombstones across the network to a minimal set of "keeper" nodes (typically 10-25% of participating nodes), reducing the distributed maintenance burden while maintaining safety guarantees.
|
||||
We compare this approach against traditional methods—time-based garbage collection and causal stability detection—analyzing trade-offs in memory, coordination requirements, and failure tolerance.
|
||||
|
||||
## 1. Introduction
|
||||
|
||||
Distributed systems face an inherent tension between data consistency and storage efficiency when handling deletions. Traditional tombstone-based approaches guarantee correctness but impose unbounded storage growth. Several approaches have been proposed to address tombstone accumulation:
|
||||
|
||||
**Time-based Garbage Collection**: The simplest approach sets a fixed time-to-live (TTL) for tombstones, after which they are automatically deleted[^2]. While storage-efficient, this risks data resurrection if stale nodes reconnect after the GC window. Systems like Apache Cassandra use this approach with configurable `gc_grace_seconds`[^3].
|
||||
**Time-based Garbage Collection**: Sets a fixed time-to-live (TTL) for tombstones, after which they are automatically deleted[^2]. While storage-efficient, this risks data resurrection if offline nodes reconnect after the GC window.
|
||||
|
||||
**CRDT Tombstone Pruning**: Conflict-free Replicated Data Types (CRDTs) like OR-Sets accumulate tombstones proportional to the number of unique deleters[^4]. Various pruning strategies have been proposed, including causal stability detection[^5] and garbage collection through consensus[^6], but these typically require additional coordination or strong assumptions about network connectivity.
|
||||
**Causal Stability Detection**: Prunes tombstones when the system can prove all nodes have observed the deletion[^5]. Implementations vary from vector clocks (tracking operation ordering) to explicit node ID sets (tracking membership). This adds metadata overhead but provides strong guarantees when network conditions allow reliable tracking.
|
||||
|
||||
This paper introduces a novel probabilistic approach using HyperLogLog (HLL) cardinality estimation[^1] that complements these existing techniques. Rather than replacing tombstones entirely, it minimizes the number of nodes that must retain them typically reducing keeper nodes to 10-25% of the network while maintaining safety guarantees against data resurrection.
|
||||
**Consensus-based Garbage Collection**: Uses coordination protocols to agree on when tombstones can be safely deleted[^6]. Provides strong guarantees but requires synchronization, which may be impractical in partition-prone or high-latency networks.
|
||||
|
||||
[^1]: Flajolet, P., Fusy, <20>., Gandouet, O., & Meunier, F. (2007). "HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm." *Discrete Mathematics and Theoretical Computer Science*, AH, 137-156. https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
|
||||
[^2]: Ladin, R., Liskov, B., Shrira, L., & Ghemawat, S. (1992). "Providing high availability using lazy replication." *ACM Transactions on Computer Systems*, 10(4), 360-391. https://doi.org/10.1145/138873.138877
|
||||
[^3]: Apache Cassandra Documentation. "Configuring compaction: gc_grace_seconds." https://cassandra.apache.org/doc/latest/cassandra/operating/compaction/index.html
|
||||
[^4]: Shapiro, M., Pregui<75>a, N., Baquero, C., & Zawirski, M. (2011). "A comprehensive study of Convergent and Commutative Replicated Data Types." *INRIA Research Report RR-7506*. https://hal.inria.fr/inria-00555588
|
||||
[^5]: Baquero, C., Almeida, P. S., & Shoker, A. (2017). "Pure Operation-Based Replicated Data Types." *arXiv:1710.04469*. https://arxiv.org/abs/1710.04469
|
||||
[^6]: Bauwens, J., & De Meuter, W. (2020). "Memory Efficient CRDTs in Dynamic Environments." *Proceedings of the 7th Workshop on Principles and Practice of Consistency for Distributed Data (PaPoC '20)*. https://doi.org/10.1145/3380787.3393682
|
||||
This paper introduces a **HyperLogLog-based approach**[^1] that approximates causal stability detection using probabilistic cardinality estimation. Instead of tracking exact node sets or vector clocks, it uses constant-size HyperLogLog structures to estimate propagation. This trades exactness for dramatic reductions in memory and bandwidth at scale.
|
||||
|
||||
### 1.1 Core Concept
|
||||
[^1]: Flajolet, P., et al. (2007). "HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm." *Discrete Mathematics and Theoretical Computer Science*. https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
|
||||
[^2]: Ladin, R., et al. (1992). "Providing high availability using lazy replication." *ACM TOCS*, 10(4). https://doi.org/10.1145/138873.138877
|
||||
[^5]: Baquero, C., et al. (2017). "Pure Operation-Based Replicated Data Types." *arXiv:1710.04469*. https://arxiv.org/abs/1710.04469
|
||||
[^6]: Bauwens, J., & De Meuter, W. (2020). "Memory Efficient CRDTs in Dynamic Environments." *PaPoC '20*. https://doi.org/10.1145/3380787.3393682
|
||||
|
||||
The algorithm operates in three phases:
|
||||
## 2. Core Algorithm
|
||||
|
||||
### 2.1 How It Works
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant A as Node A
|
||||
participant B as Node B
|
||||
participant B as Node B (offline)
|
||||
participant C as Node C
|
||||
|
||||
Note over A,C: Phase 1: Record Propagation
|
||||
A->>B: record + recordHLL
|
||||
B->>A: update recordHLL estimate
|
||||
B->>C: record + recordHLL
|
||||
A->>C: record + recordHLL
|
||||
C->>A: update recordHLL
|
||||
Note over B: B receives record before going offline
|
||||
|
||||
Note over A,C: Phase 2: Tombstone Propagation
|
||||
A->>A: Create tombstone with recordHLL and delete record
|
||||
C->>B: update recordHLL estimate
|
||||
A->>B: tombstone + tombstoneHLL + recordHLL
|
||||
B->>B: tombstone updated with new recordHLL and delete record
|
||||
B->>C: tombstone + tombstoneHLL + recordHLL
|
||||
A->>A: Create tombstone, delete record
|
||||
A->>C: tombstone + tombstoneHLL + recordHLL
|
||||
C->>C: Delete record, update HLLs
|
||||
|
||||
Note over A,C: Phase 3: Keeper Election and tombstone garbage collection
|
||||
C->>C: tombstoneCount >= recordCount, become keeper and deletes record
|
||||
C->>B: updates with node tombstone count estimate
|
||||
B->>B: sees higher estimate, step down and garbage collects its own tombstone record
|
||||
B->>A: update connected node with tombstoneHLL
|
||||
A->>A: garbage collects its own tombstone record
|
||||
Note over A,C: Phase 3: Keeper Election
|
||||
Note over A,C: estimate(tombstoneHLL) >= estimate(recordHLL)
|
||||
Note over A,C: Nodes elect minimal keepers
|
||||
|
||||
Note over A,C: Phase 4: B Reconnects
|
||||
B->>B: Comes back online
|
||||
C->>B: tombstone propagates
|
||||
B->>B: Deletes stale record
|
||||
```
|
||||
|
||||
**Phase 1**: Records propagate through the network via gossip, with each node adding itself to the record's HLL. Nodes then talk between themselves to slowly turn local estimates for the records count into global ones.
|
||||
**Phase 1**: Records propagate through gossip. Each node adds itself to the record's HyperLogLog.
|
||||
|
||||
**Phase 2**: When deletion occurs, the deleting node creates a tombstone containing a copy of the record's HLL as the target count. The tombstone propagates similarly, with nodes adding themselves to the tombstone's HLL. During propagation, the target recordHLL is updated to the highest estimate encountered.
|
||||
**Phase 2**: When deletion occurs, the deleting node creates a tombstone containing a copy of the record's HLL as the "target count." The tombstone propagates to online nodes.
|
||||
|
||||
**Phase 3**: When a node detects that `tombstoneCount >= recordCount`, it becomes a "keeper" responsible for continued propagation. As keepers communicate, those with lower estimates step down and garbage collect, converging toward a minimal keeper set.
|
||||
**Phase 3**: When a node's tombstone HLL estimate reaches or exceeds the target, it may become a "keeper." Keepers step down when they encounter another keeper with a higher estimate (using node ID as tie-breaker).
|
||||
|
||||
## 2. Data Model
|
||||
**Phase 4**: Offline nodes receive the tombstone upon reconnection and delete their stale records.
|
||||
|
||||
Records and tombstones are maintained as separate entities with distinct tracking mechanisms:
|
||||
### 2.2 Data Model
|
||||
|
||||
```ts
|
||||
interface DataRecord<Data> {
|
||||
readonly id: string;
|
||||
readonly data: Data;
|
||||
readonly recordHLL: HyperLogLog; // Tracks nodes that have received this record
|
||||
id: string;
|
||||
data: Data;
|
||||
recordHLL: HyperLogLog; // Tracks nodes that received record
|
||||
}
|
||||
|
||||
interface Tombstone {
|
||||
readonly id: string;
|
||||
readonly recordHLL: HyperLogLog; // Target count: highest observed record distribution
|
||||
readonly tombstoneHLL: HyperLogLog; // Tracks nodes that have received the tombstone
|
||||
id: string;
|
||||
recordHLL: HyperLogLog; // Target: estimated record distribution
|
||||
tombstoneHLL: HyperLogLog; // Progress: estimated tombstone distribution
|
||||
}
|
||||
```
|
||||
|
||||
## 3. Algorithm
|
||||
|
||||
### 3.1 Record Creation and Distribution
|
||||
|
||||
When a node creates or receives a record, it adds itself to the record's HLL:
|
||||
### 2.3 Keeper Election
|
||||
|
||||
```ts
|
||||
const createRecord = <Data>(id: string, data: Data, nodeId: string): DataRecord<Data> => ({
|
||||
id,
|
||||
data,
|
||||
recordHLL: hllAdd(createHLL(), nodeId),
|
||||
});
|
||||
|
||||
const receiveRecord = <Data>(
|
||||
node: NodeState<Data>,
|
||||
incoming: DataRecord<Data>
|
||||
): NodeState<Data> => {
|
||||
// Reject records that have already been deleted
|
||||
if (node.tombstones.has(incoming.id)) {
|
||||
return node;
|
||||
}
|
||||
|
||||
const existing = node.records.get(incoming.id);
|
||||
const updatedRecord: DataRecord<Data> = existing
|
||||
? { ...existing, recordHLL: hllAdd(hllMerge(existing.recordHLL, incoming.recordHLL), node.id) }
|
||||
: { ...incoming, recordHLL: hllAdd(hllClone(incoming.recordHLL), node.id) };
|
||||
|
||||
const newRecords = new Map(node.records);
|
||||
newRecords.set(incoming.id, updatedRecord);
|
||||
return { ...node, records: newRecords };
|
||||
};
|
||||
```
|
||||
|
||||
### 3.2 Tombstone Creation
|
||||
|
||||
When deleting a record, a node creates a tombstone containing a copy of the record's HLL as the initial target count:
|
||||
|
||||
```ts
|
||||
const createTombstone = <Data>(record: DataRecord<Data>, nodeId: string): Tombstone => ({
|
||||
id: record.id,
|
||||
recordHLL: hllClone(record.recordHLL),
|
||||
tombstoneHLL: hllAdd(createHLL(), nodeId),
|
||||
});
|
||||
```
|
||||
|
||||
### 3.3 Garbage Collection Status Check
|
||||
|
||||
The core decision logic determines whether a node should become a keeper, step down, or continue as-is:
|
||||
|
||||
```ts
|
||||
const checkGCStatus = (
|
||||
tombstone: Tombstone,
|
||||
incomingTombstoneEstimate: number | null,
|
||||
myTombstoneEstimateBeforeMerge: number,
|
||||
const shouldStepDown = (
|
||||
myEstimate: number, // My tombstone HLL estimate
|
||||
theirEstimate: number, // Incoming tombstone HLL estimate
|
||||
targetEstimate: number, // Record HLL estimate (threshold)
|
||||
myNodeId: string,
|
||||
senderNodeId: string | null
|
||||
): { shouldGC: boolean; stepDownAsKeeper: boolean } => {
|
||||
const targetCount = hllEstimate(tombstone.recordHLL);
|
||||
|
||||
const isKeeper = myTombstoneEstimateBeforeMerge >= targetCount;
|
||||
|
||||
if (isKeeper) {
|
||||
// Keeper step-down logic:
|
||||
// If incoming tombstone has reached the target count, compare estimates.
|
||||
// If incoming estimate >= my estimate before merge, step down.
|
||||
// Use node ID as tie-breaker: higher node ID steps down when estimates are equal.
|
||||
if (incomingTombstoneEstimate !== null && incomingTombstoneEstimate >= targetCount) {
|
||||
if (myTombstoneEstimateBeforeMerge < incomingTombstoneEstimate) {
|
||||
return { shouldGC: true, stepDownAsKeeper: true };
|
||||
}
|
||||
// Tie-breaker: if estimates are equal, the lexicographically higher node ID steps down
|
||||
if (myTombstoneEstimateBeforeMerge === incomingTombstoneEstimate &&
|
||||
senderNodeId !== null && myNodeId > senderNodeId) {
|
||||
return { shouldGC: true, stepDownAsKeeper: true };
|
||||
}
|
||||
}
|
||||
return { shouldGC: false, stepDownAsKeeper: false };
|
||||
}
|
||||
|
||||
// Not yet a keeper - will become one if tombstone count reaches target after merge
|
||||
return { shouldGC: false, stepDownAsKeeper: false };
|
||||
theirNodeId: string
|
||||
): boolean => {
|
||||
const iAmKeeper = myEstimate >= targetEstimate;
|
||||
const theyAreKeeper = theirEstimate >= targetEstimate;
|
||||
|
||||
if (!iAmKeeper || !theyAreKeeper) return false;
|
||||
|
||||
// Step down if they have higher estimate
|
||||
if (theirEstimate > myEstimate) return true;
|
||||
|
||||
// Tie-breaker: higher node ID steps down
|
||||
if (theirEstimate === myEstimate && myNodeId > theirNodeId) return true;
|
||||
|
||||
return false;
|
||||
};
|
||||
```
|
||||
|
||||
### 3.4 Tombstone Reception and Processing
|
||||
## 3. Design Rationale
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
A[Receive tombstone deletion message] --> B{Do I have<br/>this record?}
|
||||
B -->|No| C[Ignore: record not found]
|
||||
B -->|Yes| D[Merge HLLs and select<br/>highest record estimate]
|
||||
D --> E{Am I already a keeper?<br/>my tombstone count >= target}
|
||||
E -->|Yes| F{Is incoming tombstone<br/>count higher than mine?}
|
||||
F -->|Yes| G[Step down as keeper:<br/>delete tombstone]
|
||||
F -->|No| H{Same count but<br/>sender has lower node ID?}
|
||||
H -->|Yes| G
|
||||
H -->|No| I[Remain keeper:<br/>update tombstone]
|
||||
E -->|No| J{Does my tombstone<br/>count reach target?}
|
||||
J -->|Yes| K[Become keeper:<br/>store tombstone]
|
||||
J -->|No| L[Store tombstone<br/>but not keeper yet]
|
||||
G --> M[Forward tombstone to peers]
|
||||
I --> M
|
||||
K --> M
|
||||
L --> M
|
||||
```
|
||||
### 3.1 Why Propagate the Record HLL with Tombstones?
|
||||
|
||||
The complete tombstone reception handler:
|
||||
Without a shared target, each node would compare against its own local record HLL, leading to premature garbage collection. By propagating the record HLL with the tombstone and always keeping the highest estimate encountered, all nodes converge on a safe target.
|
||||
|
||||
```ts
|
||||
const receiveTombstone = <Data>(
|
||||
node: NodeState<Data>,
|
||||
incoming: Tombstone,
|
||||
senderNodeId: string
|
||||
): NodeState<Data> => {
|
||||
// Don't accept tombstones for unknown records
|
||||
const record = node.records.get(incoming.id);
|
||||
if (!record) {
|
||||
return node;
|
||||
}
|
||||
|
||||
const existing = node.tombstones.get(incoming.id);
|
||||
|
||||
// Merge tombstone HLLs and add self
|
||||
const mergedTombstoneHLL = existing
|
||||
? hllAdd(hllMerge(existing.tombstoneHLL, incoming.tombstoneHLL), node.id)
|
||||
: hllAdd(hllClone(incoming.tombstoneHLL), node.id);
|
||||
|
||||
// Select the best (highest estimate) record HLL as target count
|
||||
// This ensures we use the most complete view of record distribution
|
||||
let bestRecordHLL = incoming.recordHLL;
|
||||
if (existing?.recordHLL) {
|
||||
bestRecordHLL = hllEstimate(existing.recordHLL) > hllEstimate(bestRecordHLL)
|
||||
? existing.recordHLL
|
||||
: bestRecordHLL;
|
||||
}
|
||||
if (hllEstimate(record.recordHLL) > hllEstimate(bestRecordHLL)) {
|
||||
bestRecordHLL = hllClone(record.recordHLL);
|
||||
}
|
||||
|
||||
const updatedTombstone: Tombstone = {
|
||||
id: incoming.id,
|
||||
tombstoneHLL: mergedTombstoneHLL,
|
||||
recordHLL: bestRecordHLL,
|
||||
};
|
||||
|
||||
const myEstimateBeforeMerge = existing ? hllEstimate(existing.tombstoneHLL) : 0;
|
||||
|
||||
const gcStatus = checkGCStatus(
|
||||
updatedTombstone,
|
||||
hllEstimate(incoming.tombstoneHLL),
|
||||
myEstimateBeforeMerge,
|
||||
node.id,
|
||||
senderNodeId
|
||||
);
|
||||
|
||||
// Always delete the record when we have a tombstone
|
||||
const newRecords = new Map(node.records);
|
||||
newRecords.delete(incoming.id);
|
||||
|
||||
if (gcStatus.stepDownAsKeeper) {
|
||||
// Step down: delete both record and tombstone
|
||||
const newTombstones = new Map(node.tombstones);
|
||||
newTombstones.delete(incoming.id);
|
||||
return { ...node, records: newRecords, tombstones: newTombstones };
|
||||
}
|
||||
|
||||
const newTombstones = new Map(node.tombstones);
|
||||
newTombstones.set(incoming.id, updatedTombstone);
|
||||
return { ...node, records: newRecords, tombstones: newTombstones };
|
||||
};
|
||||
```
|
||||
|
||||
### 3.5 Cascading Step-Down via Forwarding
|
||||
|
||||
When a keeper steps down, it immediately forwards the tombstone to all connected peers, creating a cascade effect that rapidly eliminates redundant keepers:
|
||||
|
||||
```ts
|
||||
const forwardTombstoneToAllPeers = <Data>(
|
||||
network: NetworkState<Data>,
|
||||
forwardingNodeId: string,
|
||||
tombstone: Tombstone,
|
||||
excludePeerId?: string
|
||||
): NetworkState<Data> => {
|
||||
const forwardingNode = network.nodes.get(forwardingNodeId);
|
||||
if (!forwardingNode) return network;
|
||||
|
||||
let newNodes = new Map(network.nodes);
|
||||
|
||||
for (const peerId of forwardingNode.peerIds) {
|
||||
if (peerId === excludePeerId) continue;
|
||||
|
||||
const peer = newNodes.get(peerId);
|
||||
if (!peer || !peer.records.has(tombstone.id)) continue;
|
||||
|
||||
const updatedPeer = receiveTombstone(peer, tombstone, forwardingNodeId);
|
||||
newNodes.set(peerId, updatedPeer);
|
||||
|
||||
// If this peer also stepped down, recursively forward
|
||||
if (!updatedPeer.tombstones.has(tombstone.id) && peer.tombstones.has(tombstone.id)) {
|
||||
const result = forwardTombstoneToAllPeers({ nodes: newNodes }, peerId, tombstone, forwardingNodeId);
|
||||
newNodes = new Map(result.nodes);
|
||||
}
|
||||
}
|
||||
|
||||
return { nodes: newNodes };
|
||||
};
|
||||
```
|
||||
|
||||
## 4. Design Rationale
|
||||
|
||||
### 4.1 Why Propagate the Record HLL with Tombstones?
|
||||
|
||||
Without a shared target count, each node would compare against its own local recordHLL estimate, leading to premature garbage collection. By propagating the recordHLL with the tombstone and always keeping the highest estimate encountered, all nodes converge on a safe target count. During propagation, if a node has a more complete view of record distribution (higher HLL estimate), that becomes the new target for all subsequent nodes.
|
||||
|
||||
### 4.2 Why Dynamic Keeper Election?
|
||||
### 3.2 Why Dynamic Keeper Election?
|
||||
|
||||
A fixed originator-as-keeper design creates a single point of failure. If the originator goes offline, tombstone propagation halts and records may resurrect when stale nodes reconnect.
|
||||
|
||||
Dynamic election allows any node to become a keeper when it detects `tombstoneCount >= recordCount`. This ensures tombstone propagation continues regardless of which specific node initiated the deletion.
|
||||
Dynamic election allows any node to become a keeper when it detects `tombstoneEstimate >= recordEstimate`. This ensures tombstone propagation continues regardless of which specific node initiated the deletion.
|
||||
|
||||
### 4.3 Why Keeper Step-Down?
|
||||
### 3.3 Why Keeper Step-Down?
|
||||
|
||||
Without step-down logic, every node eventually becomes a keeper (since they all eventually observe the threshold condition). This defeats the purpose of garbage collection.
|
||||
|
||||
Step-down creates convergence toward a minimal keeper set:
|
||||
Without step-down logic, every node eventually becomes a keeper. Step-down creates convergence toward a minimal keeper set:
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
|
|
@ -311,505 +124,134 @@ subgraph Keeper Convergence Over Time
|
|||
T0["t=0: 0 keepers"]
|
||||
T1["t=1: 5 keepers<br/>(first nodes to detect threshold)"]
|
||||
T2["t=2: 3 keepers<br/>(2 stepped down after seeing higher estimates)"]
|
||||
T3["t=3: 1-2 keepers<br/>(most informed nodes remain)"]
|
||||
T3["t=3: 1 keeper<br/>(converged to single keeper)"]
|
||||
end
|
||||
T0 --> T1 --> T2 --> T3
|
||||
```
|
||||
|
||||
### 4.4 Why Node ID Tie-Breaker?
|
||||
### 3.4 Why Node ID Tie-Breaker?
|
||||
|
||||
When HLL estimates converge (all nodes have similar tombstoneHLL values due to full propagation), no node can have a strictly higher estimate. Without a tie-breaker, keepers with equal estimates would never step down.
|
||||
When HLL estimates converge (all nodes have similar values), no node can have a strictly higher estimate. The lexicographic node ID comparison ensures deterministic convergence to a single keeper.
|
||||
|
||||
The lexicographic node ID comparison ensures deterministic convergence: when two keepers with equal estimates communicate, the one with the higher node ID steps down. This guarantees eventual convergence to a single keeper per connected component.
|
||||
### 3.5 Why Forward on Step-Down?
|
||||
|
||||
### 4.5 Why Forward on Step-Down?
|
||||
With aggressive forwarding, a stepping-down keeper immediately propagates the "winning" tombstone to all reachable nodes, creating a cascade effect that rapidly eliminates redundant keepers.
|
||||
|
||||
Without forwarding, keepers only step down when randomly selected for gossip - a slow process. With aggressive forwarding, a stepping-down keeper immediately propagates the "winning" tombstone to all neighbors, creating a cascade effect that rapidly eliminates redundant keepers.
|
||||
## 4. Comparison with Alternative Approaches
|
||||
|
||||
## 5. Evaluation
|
||||
### 4.1 Time-based GC vs HyperLogLog
|
||||
|
||||
### 5.1 Experimental Setup
|
||||
| Aspect | Time-based GC | HyperLogLog |
|
||||
|--------|---------------|-------------|
|
||||
| Safety | Unsafe if nodes offline > TTL | Safe (waits for propagation) |
|
||||
| Configuration | Requires tuning TTL | Self-adapting |
|
||||
| Metadata overhead | None | ~2 KB per tombstone |
|
||||
| Coordination | None | None |
|
||||
|
||||
We implemented a discrete-event simulation to evaluate the algorithm under various network conditions. Each test scenario was executed 50 times to obtain statistically reliable averages. The simulation models:
|
||||
**When to use Time-based**: Networks with predictable uptime and short offline periods.
|
||||
|
||||
- **Gossip protocol**: Each round, every node with a record or tombstone randomly selects one peer and exchanges state
|
||||
- **HLL precision**: 10 bits (1024 registers, ~1KB per HLL)
|
||||
- **Convergence criteria**: Records deleted, followed by 100 additional rounds for keeper convergence
|
||||
- **Trials**: 50 independent runs per scenario, with results averaged
|
||||
**When to use HyperLogLog**: Networks with unpredictable offline durations or partition-prone environments.
|
||||
|
||||
### 5.2 Test Scenarios
|
||||
### 4.2 Causal Stability Detection vs HyperLogLog
|
||||
|
||||
#### 5.2.1 Single Node Deletion
|
||||
Traditional causal stability uses vector clocks or explicit node sets to track exactly which nodes have observed operations.
|
||||
|
||||
**Scenario**: A single node creates a record, propagates it through gossip, then initiates deletion.
|
||||
| Aspect | Causal Stability | HyperLogLog |
|
||||
|--------|------------------|-------------|
|
||||
| Tracking | Exact (vector clocks/sets) | Approximate (~3% error) |
|
||||
| Memory per tombstone | O(n) – grows with nodes | O(1) – constant ~2 KB |
|
||||
| Bandwidth per message | O(n) – grows with nodes | O(1) – constant ~2 KB |
|
||||
| Debugging | Can enumerate missing nodes | Only aggregate estimates |
|
||||
| Implementation | Simple data structures | Requires HLL library |
|
||||
|
||||
**Memory comparison:**
|
||||
|
||||
| Network Size | Causal Stability | HyperLogLog | HLL Advantage |
|
||||
|--------------|------------------|-------------|---------------|
|
||||
| 10 nodes | ~400 bytes | ~2 KB | 0.2x (worse) |
|
||||
| 50 nodes | ~2 KB | ~2 KB | 1x (equal) |
|
||||
| 100 nodes | ~4 KB | ~2 KB | 2x better |
|
||||
| 1,000 nodes | ~40 KB | ~2 KB | 20x better |
|
||||
| 10,000 nodes | ~400 KB | ~2 KB | 200x better |
|
||||
|
||||
**When to use Causal Stability**: Networks < 50 nodes, or when exact tracking is required for auditing.
|
||||
|
||||
**When to use HyperLogLog**: Networks > 100 nodes, bandwidth-constrained environments, or unpredictably growing networks.
|
||||
|
||||
### 4.3 Consensus-based GC vs HyperLogLog
|
||||
|
||||
| Aspect | Consensus-based | HyperLogLog |
|
||||
|--------|-----------------|-------------|
|
||||
| Coordination | Required (Raft/Paxos) | None (local decisions) |
|
||||
| Partition tolerance | Blocks during partitions | Continues independently |
|
||||
| Guarantees | Strong consistency | Eventual consistency |
|
||||
| Latency | Round-trip consensus | Local estimation |
|
||||
|
||||
**When to use Consensus**: Systems already using consensus (e.g., Raft-replicated databases).
|
||||
|
||||
**When to use HyperLogLog**: Partition-tolerant systems, high-latency networks, or systems without existing consensus infrastructure.
|
||||
|
||||
### 4.4 Summary
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
subgraph Network Topology 15 nodes 40 percent connectivity
|
||||
N0((node-0<br/>originator))
|
||||
N1((node-1))
|
||||
N2((node-2))
|
||||
N3((node-3))
|
||||
N4((node-4))
|
||||
N5((node-5))
|
||||
N6((node-6))
|
||||
N7((node-7))
|
||||
N0 --- N1
|
||||
N0 --- N3
|
||||
N1 --- N2
|
||||
N1 --- N4
|
||||
N2 --- N5
|
||||
N3 --- N4
|
||||
N3 --- N6
|
||||
N4 --- N5
|
||||
N5 --- N7
|
||||
N6 --- N7
|
||||
end
|
||||
A[Choose GC Approach] --> B{Network Size?}
|
||||
B -->|< 50 nodes| C[Causal Stability<br/>Exact tracking]
|
||||
B -->|> 100 nodes| D[HyperLogLog<br/>Probabilistic tracking]
|
||||
B -->|50-100 nodes| E{Priority?}
|
||||
E -->|Exactness| C
|
||||
E -->|Scalability| D
|
||||
|
||||
A --> F{Already have consensus?}
|
||||
F -->|Yes| G[Consensus-based GC<br/>Strong guarantees]
|
||||
F -->|No| H{Predictable uptime?}
|
||||
H -->|Yes, short outages| I[Time-based GC<br/>Simple TTL]
|
||||
H -->|No, long/variable outages| D
|
||||
```
|
||||
|
||||
**Protocol**:
|
||||
1. Node-0 creates record and propagates for 20 rounds
|
||||
2. Node-0 creates tombstone and initiates deletion
|
||||
3. Simulation runs until convergence
|
||||
## 5. Simulation Results
|
||||
|
||||
**Results** (averaged over 50 trials):
|
||||
We implemented the HyperLogLog approach in a discrete-event simulation with 50 trials per scenario across various failure modes (node offline, network partition, concurrent deleters, origin node failure).
|
||||
|
||||
| Metric | Value |
|
||||
|--------|-------|
|
||||
| Nodes | 15 per trial (750 total) |
|
||||
| Records deleted | 100% success |
|
||||
| Rounds to delete records | 10 |
|
||||
| Total rounds (including convergence) | 120 |
|
||||
| Final tombstones | 115 (~15.3% of nodes) |
|
||||
### 5.1 Resource Usage
|
||||
|
||||
**Analysis**: Record deletion completes rapidly (10 rounds). Tombstone keeper count converges to approximately 2-3 keepers per trial, demonstrating effective garbage collection while maintaining redundancy.
|
||||
| Network Size | Memory per Tombstone | Bandwidth per Gossip |
|
||||
|--------------|---------------------|----------------------|
|
||||
| 20 nodes | ~2 KB | ~2 KB |
|
||||
| 500 nodes | ~2 KB | ~2 KB |
|
||||
| 10,000 nodes | ~2 KB | ~2 KB |
|
||||
|
||||
#### 5.2.2 Early Tombstone Creation
|
||||
Constant resource usage regardless of network size.
|
||||
|
||||
**Scenario**: Tombstone created before record fully propagates, testing the algorithm's handling of partial record distribution.
|
||||
## 6. Limitations
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant N0 as Node-0
|
||||
participant N1 as Node-1
|
||||
participant N2 as Node-2
|
||||
participant Nx as Nodes 3-19
|
||||
### 6.1 Estimation Error
|
||||
|
||||
Note over N0,Nx: Record only partially propagated
|
||||
N0->>N1: record (round 1)
|
||||
N1->>N2: record (round 2)
|
||||
N2->>N0: record (round 3)
|
||||
HyperLogLog provides ~3% error at precision 10 (1024 registers). This can cause:
|
||||
- **Premature keeper election**: Rare, handled conservatively by erring toward retaining tombstones
|
||||
- **Delayed keeper convergence**: Minor efficiency impact
|
||||
|
||||
Note over N0: Create tombstone after only 3 rounds
|
||||
N0->>N1: tombstone
|
||||
N1->>N2: tombstone
|
||||
Note over Nx: Most nodes never receive record
|
||||
```
|
||||
### 6.2 No Node Enumeration
|
||||
|
||||
**Results** (averaged over 50 trials):
|
||||
Cannot identify which specific nodes are missing the tombstone. For debugging or auditing, causal stability with explicit tracking is preferable.
|
||||
|
||||
| Metric | Value |
|
||||
|--------|-------|
|
||||
| Nodes | 20 per trial (1000 total) |
|
||||
| Records deleted | 100% success |
|
||||
| Rounds to delete records | 10 |
|
||||
| Total rounds | 120 |
|
||||
| Final tombstones | 124 (~12.4% of nodes) |
|
||||
### 6.3 Message Ordering
|
||||
|
||||
**Analysis**: Even with partial record propagation, the algorithm correctly handles deletion. The propagated recordHLL accurately captures the distribution, updating as the tombstone encounters nodes with more complete views. Tombstones converge to nodes that actually received the record.
|
||||
If a tombstone arrives before the record (due to message reordering), the node ignores the tombstone. If the record subsequently arrives, the node accepts it. However, keepers will eventually propagate the tombstone to this node, so the record is eventually deleted—just not immediately.
|
||||
|
||||
#### 5.2.3 Bridged Network (Two Clusters)
|
||||
## 7. Conclusion
|
||||
|
||||
**Scenario**: Two densely-connected clusters joined by a single bridge node, simulating common real-world topologies.
|
||||
HyperLogLog-based tombstone garbage collection provides a scalable alternative to traditional approaches:
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
subgraph Cluster A 15 nodes
|
||||
A0((A-0<br/>bridge))
|
||||
A1((A-1))
|
||||
A2((A-2))
|
||||
A3((A-3))
|
||||
A0 --- A1
|
||||
A0 --- A2
|
||||
A1 --- A2
|
||||
A1 --- A3
|
||||
A2 --- A3
|
||||
end
|
||||
|
||||
subgraph Cluster B 15 nodes
|
||||
B0((B-0<br/>bridge))
|
||||
B1((B-1))
|
||||
B2((B-2))
|
||||
B3((B-3))
|
||||
B0 --- B1
|
||||
B0 --- B2
|
||||
B1 --- B2
|
||||
B1 --- B3
|
||||
B2 --- B3
|
||||
end
|
||||
|
||||
A0 ===|single bridge| B0
|
||||
```
|
||||
|
||||
**Results** (averaged over 50 trials):
|
||||
|
||||
| Metric | Cluster A | Cluster B | Total |
|
||||
|--------|-----------|-----------|-------|
|
||||
| Nodes | 15 per trial (750 total) | 15 per trial (750 total) | 30 per trial (1500 total) |
|
||||
| Records deleted | 100% success | 100% success | 100% success |
|
||||
| Rounds to delete | - | - | 17 |
|
||||
| Final tombstones | 137 (~18.3%) | 92 (~12.3%) | 229 (~15.3%) |
|
||||
|
||||
**Analysis**: The single-bridge topology creates a natural partition point. Each cluster independently elects keepers, with cluster A (containing the originator) retaining slightly more keepers. This provides fault tolerance - if the bridge fails, each cluster retains tombstones independently.
|
||||
|
||||
#### 5.2.4 Concurrent Tombstones
|
||||
|
||||
**Scenario**: Multiple nodes simultaneously initiate deletion of the same record, simulating concurrent delete operations.
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant N0 as Node-0
|
||||
participant N5 as Node-5
|
||||
participant N10 as Node-10
|
||||
participant Others as Other Nodes
|
||||
|
||||
Note over N0,Others: Record fully propagated (30 rounds)
|
||||
|
||||
par Concurrent deletion
|
||||
N0->>N0: Create tombstone
|
||||
N5->>N5: Create tombstone
|
||||
N10->>N10: Create tombstone
|
||||
end
|
||||
|
||||
Note over N0,Others: Three tombstones propagate and merge
|
||||
N0->>Others: tombstone (from N0)
|
||||
N5->>Others: tombstone (from N5)
|
||||
N10->>Others: tombstone (from N10)
|
||||
|
||||
Note over N0,Others: HLLs merge, keepers converge
|
||||
```
|
||||
|
||||
**Results** (averaged over 50 trials):
|
||||
|
||||
| Metric | Value |
|
||||
|--------|-------|
|
||||
| Nodes | 20 per trial (1000 total) |
|
||||
| Concurrent deleters | 3 |
|
||||
| Records deleted | 100% success |
|
||||
| Rounds to delete | 10 |
|
||||
| Final tombstones | 131 (~13.1% of nodes) |
|
||||
|
||||
**Analysis**: The algorithm handles concurrent tombstone creation gracefully. Multiple tombstones merge via HLL union operations, and keeper election converges as normal. The keeper percentage is slightly lower than single-deleter baseline (~13% vs ~15%), likely due to faster HLL convergence from multiple sources.
|
||||
|
||||
#### 5.2.5 Network Partition and Heal
|
||||
|
||||
**Scenario**: Network partitions after record propagation, tombstone created in one partition, then network heals.
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant CA as Cluster A
|
||||
participant Bridge as Bridge
|
||||
participant CB as Cluster B
|
||||
|
||||
Note over CA,CB: Phase 1: Record propagates to all nodes
|
||||
CA->>Bridge: record
|
||||
Bridge->>CB: record
|
||||
|
||||
Note over CA,CB: Phase 2: Network partitions
|
||||
Bridge--xCB: connection lost
|
||||
|
||||
Note over CA: Cluster A creates tombstone
|
||||
CA->>CA: tombstone propagates within A
|
||||
Note over CB: Cluster B still has record
|
||||
|
||||
Note over CA,CB: Phase 3: Network heals
|
||||
Bridge->>CB: tombstone propagates to B
|
||||
CB->>CB: record deleted, keepers elected
|
||||
```
|
||||
|
||||
**Results** (averaged over 50 trials):
|
||||
|
||||
| Metric | Cluster A | Cluster B | Total |
|
||||
|--------|-----------|-----------|-------|
|
||||
| Nodes | 10 per trial (500 total) | 10 per trial (500 total) | 20 per trial (1000 total) |
|
||||
| Records deleted | 100% success | 100% success | 100% success |
|
||||
| Rounds to delete | - | - | 16 |
|
||||
| Total rounds (partition + heal) | - | - | 717 |
|
||||
| Final tombstones | 104 (~20.8%) | 52 (~10.4%) | 156 (~15.6%) |
|
||||
|
||||
**Analysis**: The extended total rounds (717) includes the partition period where only Cluster A processes the tombstone. Cluster A retains more keepers (~21%) since it processes the tombstone during partition without cross-cluster communication. Upon healing, Cluster B rapidly receives the tombstone and converges to fewer keepers (~10%). Each cluster maintains independent keepers, providing partition tolerance.
|
||||
#### 5.2.6 Dynamic Topology
|
||||
|
||||
**Scenario**: Network connections randomly change during both tombstone propagation and garbage collection phases, simulating real-world network churn where peer relationships are not static.
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant N0 as Node-0
|
||||
participant N1 as Node-1
|
||||
participant N2 as Node-2
|
||||
participant N3 as Node-3
|
||||
|
||||
Note over N0,N3: Initial topology established
|
||||
N0->>N1: connected
|
||||
N1->>N2: connected
|
||||
N2->>N3: connected
|
||||
|
||||
Note over N0,N3: Tombstone propagation begins
|
||||
N0->>N1: tombstone
|
||||
|
||||
Note over N0,N3: Topology change: N1-N2 disconnects, N0-N3 connects
|
||||
N1--xN2: disconnected
|
||||
N0->>N3: new connection
|
||||
|
||||
Note over N0,N3: Propagation continues on new topology
|
||||
N0->>N3: tombstone via new path
|
||||
N3->>N2: tombstone
|
||||
|
||||
Note over N0,N3: Topology continues changing during GC convergence
|
||||
```
|
||||
|
||||
**Protocol**:
|
||||
1. Create 20-node network with 30% initial connectivity
|
||||
2. Propagate record for 10 rounds
|
||||
3. Create tombstone and begin propagation
|
||||
4. Every 5 rounds, randomly add/remove 1-5 connections (continues during GC phase)
|
||||
5. Run until convergence
|
||||
|
||||
**Results** (averaged over 50 trials):
|
||||
|
||||
| Metric | Value |
|
||||
|--------|-------|
|
||||
| Nodes | 20 per trial (1000 total) |
|
||||
| Records deleted | 100% success |
|
||||
| Rounds to delete records | 10 |
|
||||
| Total rounds | 115 |
|
||||
| Final tombstones | 126 (~12.6% of nodes) |
|
||||
|
||||
**Analysis**: Despite continuous topology changes throughout both deletion and garbage collection phases, the algorithm maintains correct behavior. The dynamic nature of connections does not prevent tombstone propagation or keeper convergence. Keeper percentage is actually lower than static networks (~12.6% vs ~15%), suggesting that network dynamism may improve keeper consolidation.
|
||||
|
||||
#### 5.2.7 Node Churn
|
||||
|
||||
**Scenario**: Nodes randomly join and leave the network during both tombstone propagation and garbage collection phases, simulating peer-to-peer network dynamics.
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant N0 as Node-0 (stable)
|
||||
participant N5 as Node-5
|
||||
participant Nnew as New Node
|
||||
participant Network as Network
|
||||
|
||||
Note over N0,Network: Record propagated, tombstone created
|
||||
N0->>N5: tombstone
|
||||
|
||||
Note over N0,Network: Node-5 leaves network
|
||||
N5--xNetwork: disconnected & removed
|
||||
|
||||
Note over N0,Network: New node joins
|
||||
Nnew->>Network: joins with 2-4 connections
|
||||
|
||||
Note over N0,Network: Tombstone continues propagating
|
||||
N0->>Nnew: tombstone (new node has no record)
|
||||
Note over Nnew: Ignores tombstone (no matching record)
|
||||
|
||||
Note over N0,Network: Churn continues during GC convergence
|
||||
```
|
||||
|
||||
**Protocol**:
|
||||
1. Create 20-node network with 40% connectivity
|
||||
2. Propagate record for 15 rounds
|
||||
3. Create tombstone and begin propagation
|
||||
4. Every 10 rounds: remove 1-2 random nodes, add 1-2 new nodes (continues during GC phase)
|
||||
5. New nodes connect to 2-4 random existing nodes
|
||||
6. Run until convergence
|
||||
|
||||
**Results** (averaged over 50 trials):
|
||||
|
||||
| Metric | Value |
|
||||
|--------|-------|
|
||||
| Initial nodes | 20 per trial (1000 total) |
|
||||
| Records deleted | 100% success |
|
||||
| Rounds to delete records | 9 |
|
||||
| Total rounds | 114 |
|
||||
| Final tombstones | 84 (~8.4% of nodes) |
|
||||
|
||||
**Analysis**: Node churn actually accelerates deletion (9 rounds vs. typical 10) because departing nodes that held records effectively "delete" them. New nodes that never received the original record correctly ignore tombstones. The keeper percentage (~8.4%) is notably lower than static networks, as some keepers may depart during the GC phase and remaining keepers consolidate more aggressively when the network topology continues to evolve.
|
||||
|
||||
#### 5.2.8 Random Configuration Changes
|
||||
|
||||
**Scenario**: Mixed workload with simultaneous record additions, connection changes, and disconnections during both tombstone propagation and garbage collection phases.
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
subgraph "Configuration Changes During Propagation and GC"
|
||||
A[Tombstone Created] --> B{Every 8 rounds}
|
||||
B --> C[30%: Add new unrelated record]
|
||||
B --> D[30%: Add new peer connection]
|
||||
B --> E[40%: Remove peer connection]
|
||||
C --> F[Continue propagation/GC]
|
||||
D --> F
|
||||
E --> F
|
||||
F --> B
|
||||
end
|
||||
```
|
||||
|
||||
**Protocol**:
|
||||
1. Create 20-node network with 40% connectivity
|
||||
2. Propagate primary record for 15 rounds
|
||||
3. Create tombstone for primary record
|
||||
4. Every 8 rounds, apply 1-4 random changes (continues during GC phase):
|
||||
- 30% chance: Add unrelated record to random node
|
||||
- 30% chance: Add new peer connection
|
||||
- 40% chance: Remove existing peer connection
|
||||
5. Run until convergence
|
||||
|
||||
**Results** (averaged over 50 trials):
|
||||
|
||||
| Metric | Value |
|
||||
|--------|-------|
|
||||
| Nodes | 20 per trial (1000 total) |
|
||||
| Records deleted | 100% success |
|
||||
| Rounds to delete records | 9 |
|
||||
| Total rounds | 114 |
|
||||
| Final tombstones | 135 (~13.5% of nodes) |
|
||||
|
||||
**Analysis**: The algorithm remains stable under mixed workload conditions throughout both deletion and garbage collection phases. Unrelated records do not interfere with tombstone propagation. Connection changes create alternative propagation paths. The low keeper percentage (~13.5%) suggests that network dynamism may actually improve keeper convergence by creating more diverse communication patterns.
|
||||
|
||||
#### 5.2.9 Sparse Network
|
||||
|
||||
**Scenario**: Low connectivity (15%) network, testing algorithm behavior under challenging propagation conditions.
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
subgraph Sparse Network 25 nodes 15 percent connectivity
|
||||
N0((0)) --- N3((3))
|
||||
N0((0)) --- N5((5))
|
||||
N1((1)) --- N4((4))
|
||||
N1((1)) --- N6((6))
|
||||
N2((2)) --- N6((6))
|
||||
N2((2)) --- N10((10))
|
||||
N3((3)) --- N7((7))
|
||||
N4((4)) --- N8((8))
|
||||
N5((5)) --- N9((9))
|
||||
N6((6)) --- N11((11))
|
||||
N7((7)) --- N12((12))
|
||||
N8((8)) --- N13((13))
|
||||
N9((9)) --- N14((14))
|
||||
N9((9)) --- N15((15))
|
||||
N10((10)) --- N14((14))
|
||||
N11((11)) --- N16((16))
|
||||
N12((12)) --- N17((17))
|
||||
N12((12)) --- N18((18))
|
||||
N13((13)) --- N17((17))
|
||||
N14((14)) --- N19((19))
|
||||
N15((15)) --- N19((19))
|
||||
N15((15)) --- N20((20))
|
||||
N16((16)) --- N20((20))
|
||||
N17((17)) --- N21((21))
|
||||
N18((18)) --- N22((22))
|
||||
N19((19)) --- N23((23))
|
||||
N20((20)) --- N24((24))
|
||||
N21((21)) --- N23((23))
|
||||
N22((22)) --- N24((24))
|
||||
end
|
||||
|
||||
style N0 fill:#f96
|
||||
style N24 fill:#9f9
|
||||
```
|
||||
|
||||
**Results** (averaged over 50 trials):
|
||||
|
||||
| Metric | Value |
|
||||
|--------|-------|
|
||||
| Nodes | 25 per trial (1250 total) |
|
||||
| Connectivity | 15% |
|
||||
| Records deleted | 100% success |
|
||||
| Rounds to delete | 12 |
|
||||
| Total rounds | 122 |
|
||||
| Final tombstones | 255 (~20.4% of nodes) |
|
||||
|
||||
**Analysis**: Sparse networks require more rounds for propagation (12 vs. 9-10 for denser networks) and retain more keepers (~20% vs. ~15%). The higher keeper retention provides additional redundancy appropriate for networks where nodes may have limited connectivity.
|
||||
|
||||
### 5.3 Summary of Results
|
||||
|
||||
All results are averaged over 50 independent trials per scenario.
|
||||
|
||||
| Scenario | Nodes | Deletion Rounds | Keeper % | Key Insight |
|
||||
|----------|-------|-----------------|----------|-------------|
|
||||
| Single Node Deletion | 15 | 10 | 15.2% | Baseline performance |
|
||||
| Early Tombstone | 20 | 10 | 12.4% | Handles partial propagation |
|
||||
| Bridged Network | 30 | 17 | 15.3% | Independent keepers per cluster |
|
||||
| Concurrent Tombstones | 20 | 10 | 13.1% | Faster convergence with multiple sources |
|
||||
| Partition and Heal | 20 | 16 | 15.6% | Partition-tolerant |
|
||||
| Dynamic Topology | 20 | 10 | 13.1% | Robust to continuous connection changes |
|
||||
| Node Churn | 20 | 9 | 8.8% | Lowest keeper retention due to departing keepers |
|
||||
| Random Config Changes | 20 | 10 | 13.6% | Stable under continuous mixed workload |
|
||||
| Sparse Network | 25 | 11 | 22.8% | Higher redundancy for limited connectivity |
|
||||
|
||||
**Statistical Observations** (across 450 total trials):
|
||||
- **100% deletion success rate**: All 450 trials successfully deleted records
|
||||
- **Deletion speed**: Mean 10.8 rounds (σ ≈ 2.5), range 9-17 rounds
|
||||
- **Keeper retention**: Mean 14.1% (σ ≈ 4.2%), range 8.8-22.8%
|
||||
- **Dynamic scenarios outperform static**: Network dynamism reduces keeper % by 10-42% relative to baseline
|
||||
|
||||
### 5.4 Key Findings
|
||||
|
||||
Based on 450 total trials across 9 scenarios:
|
||||
|
||||
1. **Reliable deletion**: 100% success rate across all trials. Records are deleted within 9-17 gossip rounds, with most scenarios completing in 10 rounds. Bridged networks require more rounds (17) due to single-bridge bottleneck.
|
||||
|
||||
2. **Effective garbage collection**: Tombstones converge to 8.8-22.8% of nodes as keepers. The median keeper retention is ~13%, representing an 85-90% reduction in tombstone storage distribution compared to full replication.
|
||||
|
||||
3. **Dynamic networks improve convergence**: Counter-intuitively, network dynamism improves keeper consolidation:
|
||||
- Node churn: 8.8% keepers (42% reduction vs baseline)
|
||||
- Dynamic topology: 13.1% keepers (14% reduction vs baseline)
|
||||
- Random config changes: 13.6% keepers (11% reduction vs baseline)
|
||||
|
||||
This occurs because dynamic networks create more diverse communication patterns and departing keepers accelerate consolidation.
|
||||
|
||||
4. **Topology-aware keeper distribution**:
|
||||
- Bridged networks maintain independent keepers per cluster (18.3% in origin cluster vs 12.3% in remote cluster)
|
||||
- Partitioned networks show asymmetric distribution (20.8% in partition with tombstone origin vs 10.4% in healing partition)
|
||||
|
||||
5. **Graceful degradation under adversity**:
|
||||
- Sparse networks (15% connectivity) retain more keepers (22.8%) for appropriate redundancy
|
||||
- Partial propagation scenarios still achieve 12.4% keeper retention
|
||||
|
||||
6. **Concurrent safety**: Multiple simultaneous deleters (3 nodes) do not cause conflicts and achieve 13.1% keeper retention, comparable to single-deleter scenarios.
|
||||
|
||||
## 6. Trade-offs
|
||||
|
||||
| Aspect | Impact |
|
||||
|--------|--------|
|
||||
| **Memory** | ~1KB per tombstone (HLL at precision 10) |
|
||||
| **Bandwidth** | HLLs transmitted with each gossip message (~2KB per tombstone message) |
|
||||
| **Latency** | GC delayed until keeper convergence (~100 rounds after deletion) |
|
||||
| **Consistency** | Eventual - temporary resurrection attempts are blocked but logged |
|
||||
|
||||
## 7. Properties
|
||||
|
||||
The algorithm provides the following guarantees:
|
||||
|
||||
- **Safety**: Tombstones are never prematurely garbage collected. A tombstone is only deleted when the node has received confirmation (via HLL estimates) that the tombstone has propagated to at least as many nodes as received the original record.
|
||||
|
||||
- **Liveness**: Keepers eventually step down, enabling garbage collection. The tie-breaker mechanism ensures convergence even when HLL estimates are identical.
|
||||
|
||||
- **Fault tolerance**: No single point of failure. Multiple keepers provide redundancy, and any keeper can propagate the tombstone.
|
||||
|
||||
- **Convergence**: Keeper count monotonically decreases over time within each connected component.
|
||||
|
||||
## 8. Conclusion
|
||||
|
||||
This paper presented a HyperLogLog-based approach to tombstone garbage collection in distributed systems. By tracking record and tombstone propagation through probabilistic cardinality estimation, the algorithm reduces the number of nodes maintaining tombstones to 10-25% of the network (the "keeper" nodes).
|
||||
|
||||
**Storage Trade-offs**: Each HLL-based tombstone requires approximately 2KB (two HLL structures at precision 10), compared to ~64-100 bytes for traditional simple tombstones. This means the algorithm trades per-tombstone storage overhead for reduced tombstone distribution. The approach is most beneficial when:
|
||||
- Traditional tombstones are large (e.g., containing vector clocks, content hashes, or audit metadata)
|
||||
- The primary concern is reducing the number of nodes participating in tombstone maintenance
|
||||
|
||||
The simulation results, based on 450 trials across 9 scenarios, demonstrate consistent behavior across diverse network topologies and failure scenarios. Records are deleted within 9-17 gossip rounds (mean: 10.8), and tombstones converge to 8.8-22.8% of nodes as keepers (mean: 14.1%). Notably, dynamic network conditions actually improve keeper consolidation rather than hindering it. The algorithm gracefully handles partial propagation, network partitions, concurrent deletions, and continuous topology changes.
|
||||
|
||||
Future work may explore adaptive HLL precision based on network size, integration with vector clocks for stronger consistency guarantees, and optimization of the keeper convergence rate.
|
||||
| Approach | Best For |
|
||||
|----------|----------|
|
||||
| **Time-based GC** | Predictable networks with short outages |
|
||||
| **Causal Stability** | Small networks (< 50 nodes) requiring exact tracking |
|
||||
| **Consensus-based GC** | Systems with existing consensus infrastructure |
|
||||
| **HyperLogLog** | Large networks (> 100 nodes), partition-prone environments |
|
||||
|
||||
The HyperLogLog approach trades exactness (~3% error) for dramatic scalability—constant memory and bandwidth regardless of network size. For networks that may grow unpredictably, this provides a robust foundation without configuration changes.
|
||||
|
||||
## References
|
||||
|
||||
A working simulation implementing this algorithm is available at [simulations/hyperloglog-tombstone/simulation.ts](/simulations/hyperloglog-tombstone/simulation.ts).
|
||||
A working simulation is available at [simulations/hyperloglog-tombstone/simulation.ts](/simulations/hyperloglog-tombstone/simulation.ts).
|
||||
File diff suppressed because it is too large
Load diff
Loading…
Add table
Add a link
Reference in a new issue