Author Topic: [C] Implementation of a Bloom Filter  (Read 1531 times)

0 Members and 1 Guest are viewing this topic.

Offline 0poitr

  • Peasant
  • *
  • Posts: 149
  • Cookies: 64
    • View Profile
[C] Implementation of a Bloom Filter
« on: October 20, 2013, 09:48:11 am »
Hello ez! It's been long since I last posted. Been busy with petty stuff. Missed you people!

Today I'll post my implementation of a bloom filter. Bloom filters are a very widely used data structure in real-life applications (http://en.wikipedia.org/wiki/Bloom_filter#Examples).
In context of computer science, bloom filters are one sort of probabilistic data structure. It's mostly used to lookup if an element resides in the set of data currently in the data structure, but since it's probabilistic, there can be false positives. The benefit? Savings in storage space and lookup time.

In more details, a bloom filter is implemented as a bit array. For each element in a data set, k different hashes are generated and the corresponding bits of the bit array are set. For example, if the word "dad" generates the hashes 17, 15, 97  then 17th 15th and 97th bits of the array will be set. For a lookup, the same number of hashes are generated from the search string and if all the corresponding bits in the bitmap are set, the word _may_ be in the set. 'may' because some other word might have generated a subset of hashes as the search string. Hence, false positves might be reported. But since, we never clear an already set bit, false negetives can't occur.

Now, how do we deal with the false negetives? I'll direct you to this: http://en.wikipedia.org/wiki/Bloom_filter#Probability_of_false_positives
It has a better explanation that I can.

About the implementation, I calculate the optimal number of bits from the formuae

m being the number of bits, n the number of elements and p is the probability of false +ves.

I used Murmurhash to compute a 128 bit hash for each word, split the hash into 4 hashes of 32bits each, take the hash modulo size_of_bitmap and set the corresponding bits, continuing this way untill i have bits_per_words number of hashes generated and the bits set.

The lookup is simple. Generate the same hashes and check if all are set.

I know there's some redundant code but too lazy to organize.
All suggestions for optimizations, ideas and other improvements are welcome. If you have questions regarding the code / bloom filters in general please ask.

Compiling: there's a make.sh file in the zip. Extract the zip and execute it. You'll need gcc and g++ to compile the code.
Running: ./blm <wordlist>

The program simply creates a filter for the wordlist in memory and drops you to a prompt. You can enter arbitrary words and check if they belong to the set. You can check the correctness of the lookup by looking into the file itself and searching for the word. It's just a demonstration but you could use the code to more meaningful work.

Notes: I discard words that contain non alphabetic charaters on purpose. You can remove the checks in read_file().
make.sh builds the elf with debug symbols and no optimization flags.
You can increse accuracy of the lookup specifying a lower value for 'p' where req_bits is calculated in main().

Thanks for reading! and have fun with the code. 8)
Let me know how many false +ves you got if you're using a different wordlist (it must be newline separated).
« Last Edit: October 21, 2013, 06:31:15 am by 0poitr »
Imagination is the first step towards Creation.

Offline Deque

  • P.I.N.N.
  • Global Moderator
  • Overlord
  • *
  • Posts: 1203
  • Cookies: 518
  • Programmer, Malware Analyst
    • View Profile
Re: [C] Implementation of a Bloom Filter
« Reply #1 on: October 20, 2013, 09:26:40 pm »
Will try it. Until then +1 for seeing again some high quality content from you.
I had heard about bloom filters in university, but never tried to implement them.

Offline 0poitr

  • Peasant
  • *
  • Posts: 149
  • Cookies: 64
    • View Profile
Re: [C] Implementation of a Bloom Filter
« Reply #2 on: October 21, 2013, 06:33:06 am »
Will try it. Until then +1 for seeing again some high quality content from you.
I had heard about bloom filters in university, but never tried to implement them.
Thanks mate! You've done some releases meanwhile, I see. 'll be checking them out. 8)
Imagination is the first step towards Creation.