Bloom Filters


Github Repo


Bloom Filters are cool

This semester at Carnegie Mellon I took a course on database systems: 15-445. In one of the lectures we briefly glanced over bloom filters and their functionality in speeding up queries. When I watched the lecture, I remembered how I used bloom filters in previous CS courses and decided it would be interesting to further research and implement my own version. In this article, I will document the basics on bloom filters, some of the interesting finds I’ve made in regard to bloom filters, my C++ implementation, and other findings.


How Bloom Filters Work and Why They Exist

The bloom filter is an interesting yet simple data structure. At its core, a bloom filter is simply a bit vector.

Bloom Filter
A simple example of an 8 bit bloom filter

Bloom filters are used to determine set membership. Using a bit vector for set membership means that this data structure is simple to interface with and very space efficient. Additionally, memory accesses to the data structure utilize temporal locality by nature.

The trade off to fast lookups/adding is the possibility of false positives.

Add Function

When we add to the filter, we simply pass in our value / string through an efficient non cryptographic hash function and then mod our hashed result by the length (in bits) of our bloom filter. This will give us an index in which we will set the indexed bit to 1.

Bloom Filter
Add String
Lookup Function

Lookup works the same way. We pass our value through a hash function and then mod our hashed result by the length (in bits) of our bloom filter. This will give us an index. We then check the indexed bit.

If the bit is 0, we know for sure that our value does not exist in the set.

If the bit is 1, there is a possibility our value exists in the set.

Bloom Filter
Lookup - “Goran” may exist in the set
Bloom Filter
Lookup - “Jon” does not exist in the set

My C++ Implementation

Github Repo

bloom.h:
#include <mutex>
#include <stdint.h>
#include <stdlib.h>
#include <bitset>
#include "cityhash/city.h"

We use google’s cityhash library as it is a fast noncryptographic hashing function

/** @brief used to interface with cityhash types */
typedef uint64_t uint64;

/** @brief used to divide by 64 via logical shift */
#define DIV_64_SHIFT 6

Global Variables

namespace std {

class BloomFilter {

public:

  /**
   * @brief Constructor for BloomFilter type
   *
   * zeroes filter and associated filter metadata
   *
   * @param[in] filter_size - dynamically defines size of filter using
   * filter_size*64 bits.
   * filter_size must be < 67108864 //potentially 1048576
   * 
   * @return BloomFilter instance
   */
  BloomFilter(int filter_size);

  /**
   * @brief Destructor for BloomFilter type
   *
   * frees dynamically allocated memory for filter
   */
  ~BloomFilter();

  /**
   * @brief lookup a string in the bloom filter
   *
   * calls cityhash64() and checks the calculated filter_index from the hash
   * result
   *
   * @param[in] str
   * 
   * @return bool
   * true -> str may exist in the set
   * false -> str does not exist in the set
   */
  bool lookup(string str);

  /**
   * @brief Adds a string to the bloom filter
   *
   * calls cityhash64() and sets the calculated filter_index to 1 from the hash
   * result
   *
   * @param[in] str
   */
  void add_string(string str);

  /**
   * @brief filter_count
   * 
   * multiple calls to add_string with the same argument still increment this
   * counter
   * 
   * @return call count of add_string
   */
  size_t filter_count();

  /**
   * @brief creates filter string
   *
   * @return string representing the entire bloom filter in binary
   */
  string expose_filter();

  /**
   * @brief filter_ratio
   * 
   * @return bits filled with 1
   */
  size_t filter_ratio();

Library Functions

protected:

  /** @brief counts instances of add_string() calls */
  size_t filter_count_;

  /** @brief dynamically allocated uint64 array as bloom filter */
  uint64 *filter_;

  /** @brief filter_size represents length of filter array in elements */
  int filter_size_;

  /** @brief filter_bits represents length of filter in bits */
  int filter_bits_;
};
} // namespace std

Protected Variables Used

bloom.cpp functions:
BloomFilter::BloomFilter(int filter_size) {

  //dynamically allocated uint64 array as bloom filter
  filter_ = new uint64[filter_size];

  //clearing filter
  for(int i = 0; i < filter_size;i++) {
    filter_[i] = 0;
  }

  filter_count_ = 0;
  filter_size_ = filter_size;
  filter_bits_ = filter_size << DIV_64_SHIFT;
}

BloomFilter::~BloomFilter() {
  delete[] filter_;
}

Constructor and Destructor: We use an array of uint64’s to represent the bloom filter. This is the most space efficient and fastest data structure for our implementation.

void BloomFilter::add_string(string str) {
  uint64 hashed_val = CityHash64(str.data(), str.length());
  //bit_on represents which bit index should be set to 1
  uint64 bit_on = hashed_val % filter_bits_;
  //calculates which array element in filter_ holds the bit_on index
  uint64 filter_index = bit_on >> DIV_64_SHIFT;
  //1 gets left shifted to the corresponding index
  //bit_on - (filter_index << DIV_64_SHIFT) localizes the bit to the array index
  filter_[filter_index] |= 1 << (bit_on - (filter_index << DIV_64_SHIFT));
  filter_count_ += 1;
}

Add String Function: Note how we calculate the filter index. Instead of dividing we use a shift by 6 which is equivalent to dividng by 64. While the compiler should recognize this optimization, it is easy for a programmer to understand this computation. Thus we manually implement the optimization. Additionally, we use bit-wise operations to figure out the index within our bloom filter uint64 array that we must target.

bool BloomFilter::lookup(string str) {
  uint64 hashed_val = CityHash64(str.data(), str.length());
  //bit_on represents which bit index should be set to 1
  uint64 bit_on = hashed_val % filter_bits_;
  //calculates which array element in filter_ holds the bit_on index
  uint64 filter_index = bit_on >> DIV_64_SHIFT;
  //1 gets left shifted to the corresponding index
  //bit_on - (filter_index << DIV_64_SHIFT) localizes the bit to the array index
  return (filter_[filter_index] &
          (1 << (bit_on - (filter_index << DIV_64_SHIFT)))) != 0;
}

Lookup String Function: This function also uses similar bit-wise operation optimizations that add_string uses

size_t BloomFilter::filter_count() { return filter_count_; }

string BloomFilter::expose_filter() {
  std::string s = std::bitset<64>(filter_[0]).to_string();
  for (int i = 1; i < filter_size_; i++) {
    s.append(std::bitset<64>(filter_[i]).to_string());
  }
  return s;
}

size_t filter_ratio() {
  return 0;
}

Heauristic Functions