!pip install --upgrade pip --quiet
!pip3 install nbinteract --quiet 
!pip3 install scipy --quiet
!pip3 install sklearn --quiet
from demo_utils import *
import nbinteract as nbi
import_all()

Learned Count-Min Structures

By Markie Wagner, Ankush Swarnakar, Hashem Elezabi, and Elena Berman

It’s easy to walk away from this paper and think: “Huh, ML can replace anything.” But think more critically about the fundamental ideas that the authors introduced. While ML is ubiquitous throughout their work, two tangible takeaways are: Use data probability distributions to optimize data structures Use recursive models & auxiliary structures to optimize data structures Let’s explore these ideas a bit further. The original paper focuses mainly on indexing structures, like B-Trees, Hash Maps, and Bloom Filters. Since we plan to investigate the effects of probability distributions on these structures, let’s take a deep-dive into a probabilistic structure instead — the Count-Min Sketch.

Begin Demo

Count-Min Sketch estimates across distributions

The plots below shows a scatterplot of errors with data points (left), the distribution of data (center), and the distribution of errors (right). Again, we can see that the Count-Min Sketch gives us decent estimates; none of the errors exceed the threshold, which is excellent! But interestingly, by converting to a normal distribution, our average error almost halved, from 20.154 to 11.4482, and our max error shot up by 12.

opts1 = {
    'title': "Scatterplot of Errors",
    'ylabel': "Error",
    'xlabel': "Data Point",
    'ylim': (0, 500)}

nbi.scatter(generate_hist_sample, get_y_errors, n=(1,2000), eps=(0.01,1, 0.01), delta=(0.01, 1, 0.01), distribution={'normal': "normal", 'zipf': 'zipf', 'uniform': 'uniform', 'lognorm':'lognorm'}, options=opts1)
opts2 = {
    'title': "Distribution of Data",
    'ylabel': "Count",
    'xlabel': "Data Point",}

nbi.hist(generate_hist_sample, n=(1,10000), distribution={'normal': "normal", 'zipf': 'zipf', 'uniform': 'uniform', 'lognorm':'lognorm'}, options=opts2)
opts3 = {
    'title': "Distribution of errors",
    'ylabel': "Count",
    'xlabel': "Error Magnitude",}
nbi.hist(get_data_for_hist_errors, n=(1,2000), eps=(0.01,1, 0.01), delta=(0.01, 1, 0.01), distribution={'normal': "normal", 'zipf': 'zipf', 'uniform': 'uniform', 'lognorm':'lognorm'}, options=opts3)

How does data spread affect our errors?

Let’s take this a step further and scope out the exact effect on standard deviation (a mathematical proxy for the “spread” of the data) on the errors. Let’s investigate what happens when we sample n = 1000 integers from a normal distribution with a mean of μ = 0. We vary the standard deviation between 1 and 100 to understand how spreading the distribution affects errors.

opts4 = {
    'title': "Scatterplot of Errors",
    'ylabel': "Error",
    'xlabel': "Data Point",
    'ylim': (0, 500),}
print("Vary the standard distribution to see how the errors change!")

nbi.scatter(get_sample_sd, get_y_errors_sd, n=(1,2000), eps=(0.01,1, 0.01), delta=(0.01, 1, 0.01), sd=(0.01, 1000, 10), options=opts4)
Vary the standard distribution to see how the errors change!
opts5 = {
    'title': "Distribution of errors",
    'ylabel': "Count",
    'xlabel': "Error Magnitude",}
print("Vary the standard distribution to see how error disribution change!")
nbi.hist(get_data_for_hist_errors_sd, n=(1,2000), eps=(0.01,1, 0.01), delta=(0.01, 1, 0.01),sd=(0.01, 1000, 10),  options=opts5)
Vary the standard distribution to see how error disribution change!

Optimization #1: The Learned Count-Min Sketch

One approach is simply to treat the heavy hitters and non-heavy-hitters separately. This is where we can motivate our data structure design with two ideas from the original Learned Index Structures paper –– recursive models and auxiliary structures.

opts = {
    'title': "Count-Min Sketch",
    'ylabel': "Frequency",
    'xlabel': "Error",}
print("Vary the standard distribution to see how error disribution change!")
nbi.hist(opt_1_normal, sd=(0, 1000, 10),  options=opts)
Vary the standard distribution to see how error disribution change!
opts = {
    'title': "Learned Count-Min Sketch",
    'ylabel': "Frequency",
    'xlabel': "Error",}
print("Vary the standard distribution to see how error disribution change!")
nbi.hist(opt_1_learned, sd=(0, 1000, 10),  options=opts)
Vary the standard distribution to see how error disribution change!

Optimization #2: Rules Count-Min Sketch

Let’s take it a step even further. Why do we need machine-learning anyways? What if we know the data distribution so well that we know exactly which elements are going to be heavy hitters?

Suppose you’re Jeff Dean on a regular Tuesday afternoon and you think to yourself — “I wonder what are the frequencies of different search queries on Google for 2020!” Maybe you have data from 2019 on the most frequent search terms. You know 2020 data is likely to be close to that of 2019 (barring a pandemic, among other things :/), so you treat the most frequent queries from 2019 as heavy hitters. Instead of learning what the heavy-hitters are, we use a rules-based approach.

How do we simulate this concept without using ML? Fundamentally, any rules-based approach for determining will give a yes/no answer to whether a given element is a heavy hitter. To emulate this, we can construct a set of data points that will be heavy hitters based on “rules.” Our “rule” is: if the element is within the top p proportion of counts in the data set, it’s a heavy hitter. We can thus iterate through our data set, add all heavy-hitters to a set, and then feed that set into our Sketch constructor. Now, instead of inferring whether an element is a heavy hitter, we can simply check its presence within our set.

opts = {
    'title': "Rule Based Count-Min Sketch",
    'ylabel': "Frequency",
    'xlabel': "Error",}
print("Vary the standard distribution to see how error disribution change!")
nbi.hist(opt_2, sd=(0, 1000, 10),  options=opts)
Vary the standard distribution to see how error disribution change!