under #rust #setsum

setsum: a checksum for unordered sets

This post is motivated by a single problem: I want to efficiently maintain a checksum over an entire key-value store such that, no matter what operations happen, the checksum represents the data the key-value store holds.

Specifically, I want:

  • An end-to-end checksum over unordered sets.
  • Support for set-union and set-difference operations.
  • Insertion does not look like removal.

Concretely, the type of the structure that I want is (in pseudo-rust):

trait Setsum {
    fn insert(&mut self, item: &[u8]);
    fn remove(&mut self, item: &[u8]);
    fn digest(&self) -> [u8; SETSUM_BYTES];
    fn hexdigest(&self) -> String;
}

This interface would allow insertion and removal of elements in any order. What’s not captured in this type is that the digest should be the same for the same set of items, no matter how many times an item is added and subsequently removed. For example, the following sets are equivalent:

let mut lhs = Setsum::default();
lhs.insert(&[b'A']);
lhs.insert(&[b'A']);
lhs.insert(&[b'A']);
lhs.insert(&[b'B']);
lhs.insert(&[b'C']);
lhs.remove(&[b'A']);
lhs.remove(&[b'A']);

let mut rhs = Setsum::default();
rhs.insert(&[b'C']);
rhs.insert(&[b'B']);
rhs.insert(&[b'A']);

assert_eq!(lhs.hexdigest(), rhs.hexdigest());

Because the setsum is order-agnostic, it becomes possible to divide-and-conquer the data set.

let mut lhs = Setsum::default();
lhs.insert(&[b'A']);
lhs.insert(&[b'B']);
lhs.insert(&[b'C']);

let mut rhs = Setsum::default();
rhs.insert(&[b'X']);
rhs.insert(&[b'Y']);
rhs.insert(&[b'Z']);

let both = lhs + rhs;

Finally, it’s possible to represent “patches” to a set that remove some elements and add others. When applied to an existing setsum, a patch will produce the correct outcome.

let mut base = Setsum::default();
base.insert(&[b'1']);
base.insert(&[b'2']);
base.insert(&[b'3']);

let mut patch = Setsum::default();
patch.remove(&[b'1']);
patch.insert(&[b'4']);

let result = base + patch;
// result is the same as setsum(2, 3, 4)

I’ll leave it as an exercise to the reader and a future blog post to see how exactly this works for an LSM-tree. If you’re curious, poke at lsmtk.