Entries Tagged 'Tech' ↓

Visualizing Cuckoo Hashing

Cuckoo hashing is a simple and elegant method for collision-resolution in hash tables. The resulting dictionary data structure is easy to implement, quite efficient in practice and it is also interesting theoretically (there are several interesting open questions about its properties).

I made a small visualization of how Cuckoo hashing works. The demo uses JavaScript and HTML5 Canvas, so it should work in most browsers. Let me know if you find it useful or if you have any feedback.

The demo is loosely inspired by Jason Davies’ Bloom filter visualization. Here is my old post about Bloom filters.

Random wiki image wallpaper

Here’s a small idea I wrote about before: change the desktop wallpaper automatically using random images from Wikipedia. Why Wikipedia and not some other source? Because these images are generally more interesting, more diverse and better quality than any other collection I know.

Click this link a couple of times for some examples of random wiki images:

Finally I managed to hack it together and here it is for you to download. I only tried it on Linux but probably it can be ran with small modifications on a Mac as well. If you manage to put together something like this on Windows let me know in the comments. Here are the steps:

1. What we need

  • feh – this is a nice, lightweight image viewer ( https://wiki.archlinux.org/index.php/Feh )
  • convert, mogrify – these are some utilities from the imagemagick set of tools (easily available on most distributions)
  • wget – for fetching files from the internet
  • sed, awk, grep, xargs – you have these already

If you don’t have the first three, on Ubuntu all you need to do is:

sudo apt-get install feh
sudo apt-get install imagemagick
sudo apt-get install wget

2. The main thing

Make a folder where you put the files needed for this wallpaper-changing-business. For me that is ~/random_wallpaper

Create the file run.sh in this newly created folder and copy the code below into it. The code does the following: it fetches the URL of a random image from Wikipedia, it fetches the image itself, makes an attempt to convert it to a decent size (change resolution in the code if necessary), makes an attempt to add the title of the image to the image, sets it as wallpaper, waits for an hour, removes the image and repeats. Additionally, it saves filenames and timestamps in an archive.txt file, so that if you see something interesting, you can look it up later. Feel free to tweak it to your needs, for example, the 60 minutes repeat period in the end could be made 20s or whatever else.

while true; do
	# Remove old file and get new random image
	cat filename.txt | xargs rm
	echo -n "http://upload" > url.txt
	wget -q -O - "$@" http://commons.wikimedia.org/wiki/Special:Random/File | grep -m 1 "href=\"//upload" | sed -e 's/.*href\=\"\/\/upload//g' | sed -e 's/\".*//g' >> url.txt
	url=`cat url.txt`
	wget -O temp_image.tmp "$url"
	echo -n "temp_image." > filename.txt
	cat url.txt | sed -e 's/.*\///g' | sed -e 's/.*\.//g' >> filename.txt
	filename=`cat filename.txt`
	mv temp_image.tmp "$filename"

	# Get filename of image for display
	echo $url | awk -F"/" '{print $NF}' | awk -F. '{print $1}' > urldisplay.txt
	urldisplay=`cat urldisplay.txt`

	# Resize image and add title (filename)
	mogrify -resize 1024x768\> "$filename"  
	convert -pointsize 20 -fill red -draw "text 10,30 "$urldisplay""  "$filename" "$filename"

	# Set as wallpaper
	feh --bg-center "$filename"

	# Leave some traces in the logfile
	echo -n `date` >> archive.txt
	echo -n "    " >> archive.txt
	echo $url >> archive.txt

	# Go to sleep
	sleep 60m

Make this file executable:

chmod +x run.sh

3. Wrapping it up

All we need now is a way of running our script. If you want to try it out just once, or you want to use it as a slideshow on a birthday party, you can simply run it from the terminal:

./run.sh &

If you want to use it all the time, it should be called at each startup. This can be achieved in different ways, depending a bit on the setup used.
I run Fluxbox as a window manager, which is closely related to Blackbox or Openbox.
Here I add towards the end of my startup file the following:

~/random_wallpaper/run.sh &

If you use Gnome, you could add the same line to your ~/.xinitrc file. The Feh documentation says that to make it play nicely with Feh, you should disable Nautilus’ control of the desktop:

gconftool-2 --set /apps/nautilus/preferences/show_desktop --type boolean false

An alternative way of calling run.sh would be to set up a periodic job using cron. In this case the loop (the while- and the done- lines) are not needed.

Let me know in the comments how it worked out, or if you know a simpler way of setting it all up.

phpBB recent posts widget

On a language website that I maintain I recently switched to a full-fledged phpBB forum from a previous solution that included a comment box on every page, provided by the online service Disqus.

phpBB is the classical choice for forum software, it is open source, based on php (as the name suggests), well supported, with a lot of information scattered around on the internet, with many useful extensions and a bewildering amount of parameters to be configured.

Here are some of the pros/cons of phpBB that I noticed in comparison to the previous set-up I used:
+ More control over the content, the comments are now stored on my server
+ Better organization of topics, possibility to move messages from one thread to another, the accumulated threads serve as some sort of knowledge-base
+ More fine-grained user permissions and moderating options. While this is useful, I would be happier with some sane defaults, as it takes quite a lot of time to play around with all these settings, to little visible benefit to the users.
+ Better integration: I can customize phpBB to closely match the looks of my site, more so than with the widgets I used previously
+ Users can create user accounts within my site
+ Since phpBB has been around for ages, most people are comfortable using it
A lot of time wasted setting up, configuring, customizing, upgrading, etc. the software
A lot of time wasted fighting spam. While there are many half-solutions available, in this regard running your own forum is clearly inferior to a centralized service like Disqus
The look and feel of the forum is slightly heavy-weight and outdated (not too web2.0-y). This is not a big problem, as it is familiar even to non-technical people.
Higher threshold for commenting. Although I tried to keep registration requirements, rules, captcha’s to a minimum, the number of messages written is somewhat smaller than in the previous system. However, the posts are longer, usually more substantial and better organized, so this is actually a +.

All in all I am quite happy with phpBB, even though it is not nearly as minimalistic as I would prefer it to be. One thing, however, that I was still missing from the Disqus days is a recent posts widget. This would be embedded on a different page than the actual forum (for example on the front page) and it would show the most recent 5-10 posts that were written in the forum. As I found no obvious ready-made solution, I wrote my own, which was not very difficult, as the database structure of the forums is quite straightforward.

You can download it below, with some explanation on how to install it. Let me know if you find it useful, or if you make any improvements to it. I spent about as much time putting it together, as it took me to write this post, so it is really nothing fancy. It is released under the WTFPL license, which means that you can … well, you know.


Put the following line in the web page where you want the recent comments to appear:

<div id="recent-widget"></div>

And somewhere later in the same file (preferably just before the closing </body> tag):

<script type="text/javascript">
httpRequest("recent-widget.php", showrecent);
function showrecent(WIDGET){
 d = document.getElementById('recent-widget');
 d.innerHTML = WIDGET;

function httpRequest(url, callback) {
  var httpObj = false;
  if (typeof XMLHttpRequest != 'undefined') {
    httpObj = new XMLHttpRequest();
  } else if (window.ActiveXObject) {
      httpObj = new ActiveXObject('Msxml2.XMLHTTP');
    } catch(e) {
        httpObj = new ActiveXObject('iMicrosoft.XMLHTTP');
      } catch(e) {}
  if (!httpObj) return;
  httpObj.onreadystatechange = function() {
    if (httpObj.readyState == 4) { // when request is complete
  httpObj.open('GET', url, true);

This will load the recent-widget.php file that writes the actual widget and put the output into the HTML file. Upload this on your server, so that it is accessible from your main page. The only advantage to this javascript-loader is that you can easily display the same widget on many different HTML pages. Otherwise you could simply put the stuff from recent-widget.php into your own php files.

Here is the recent-widget.php

Before you upload it, you have to modify some things in it that are about the access to the forum database and links to the forum itself and you can modify other things, like the actual layout, colors used, number of posts, etc. See the TODO comments in the file for more information. This assumes that you are running phpBB with MySQL, but if you run some other database, such as SQLite, it should still be easy to modify. Obviously, this whole widget is nothing too complicated, so even if you never programmed in php, or seen an SQL query before, it is still a decent way to learn if you tweak it to work.

Note: this widget is provided without any guarantee – unfortunately, due to lack of time, I am unable to provide support for it. You might find some useful information in the comments below.

Sketching data structures

“Sketching” data structures store a summary of a data set in situations where the whole data would be prohibitively costly to store (at least in a fast-access place like the memory as opposed to the hard disk). Variants of trees, hash tables, etc. are not sketching structures, they just facilitate access to the data, but they still store the data itself. However, the concept of hashing is closely related to most sketching ideas as we will see.

The main feature of sketching data structures is that they can answer certain questions about the data extremely efficiently, at the price of the occasional error. The best part is that the probability of an error can be quantified and the programmer can trade off the expected error rate with the amount of resources (storage, time) afforded. At the limit of this trade-off (when no error is allowed) these sketching structures collapse into traditional data structures.

Sketching data structures are somewhat counter-intuitive, but they can be useful in many real applications. I look at two such structures mostly for my own benefit: As I try to understand them, I write down my notes. Perhaps someone else will find them useful. Links to further information can be found in the end. Leave comments if you know of other sketching data structures that you found useful or if you have some favorite elegant and unusual data structure.

1. Bloom filter

Suppose we have to store a set of values (A) that come from a “universe” of possible values (U). Examples: IP addresses, words, names of people, etc. Then we need to check whether a new item x is a member of set A or not. For example, we might check if a word is spelled correctly by looking it up in the dictionary, or we can verify whether an IP address is banned by looking it up in our black list.

We could achieve this by storing the whole set A in our favorite data structure. Alternatively, we could just store a binary array, with one bit for each possible element in U. For example, to quickly check if a number is prime or not, we could precompute an array of bits for all numbers from 0 to the maximum value we need:

Prime = 001101010001010001010001...

To check whether a number is prime, we look at the corresponding bit and we are done. This is a dummy example, but it is already obvious that in most cases the range of possible values is too large to make this practical. The number of all possible strings of length 5, containing just letters from the English alphabet is 26^5 = 11,881,376 and in most real problems the universe U is much larger than that.

The magic of the Bloom filter allows us to get away with much less storage at the price of an occasional mistake. This mistake can only be a false positive, the Bloom filter might say that x is in A when in fact it is not. On the other hand, when it says that x is not in A, this is always true, in other words false negatives are impossible. In some applications (like the spell-checker), this is acceptable if false positives are not too frequent. In other applications (like the IP blacklist), misses are more common and in the case of a hit, we can verify the answer by reading the actual data from the more costly storage. In this case the Bloom filter can act as an efficiency layer in front of a more costly storage structure. If false positives can be tolerated, the Bloom filter can be used by itself.

The way it works is really simple: we use a binary array of size n, as in the prime numbers example, that is initialized with 0s. In this case however, n is much smaller than the total number of elements in U. For each element to be added to A, we compute k different hash values (using k independent hash functions) with results between 1 and n. We set all these locations h1, h2, …, hk (the indexes returned by the hash functions) in the binary array to 1. To check if y is in A, we compute the hash values h1(y), …, hk(y) and check the corresponding locations in the array. If at least one of them is 0, the element is missing. If all fields are 1, we can say that the element is present with a certainty that depends on n (the size of the array), k (the number of hashes) and the number of elements inserted. Note that n and k can be set beforehand by the programmer.

The source of this uncertainty is that hash values can collide. This becomes more of a problem as the array is filling up. If the array were full, the answer to all queries would be a yes. In this simple variant, deleting an element is not possible: we cannot just set the corresponding fields to 0, as this might interfere with other elements that were stored. There are many variants of Bloom filters, some allowing deletion and some allowing the storage of a few bits of data as well. For these and for some rigorous analysis, as well as some implementation tricks, see the links below.

A quick dummy example is a name database. Suppose we want to store female names and reject male names. We use two hash functions that return a number from 1 to 10 for any string.

Initial configuration: 0000000000
Insert("Sally")      : 0100000001
   # h1("Sally") = 2, h2("Sally") = 10
Insert("Jane")       : 1110000001
   # h1("Jane") = 1, h2("Jane") = 3
Insert("Mary")       : 1110100001
   # h1("Mary") = 5, h2("Mary") = 2 [collision]

   # bits 2 and 10 are set,
   # return HIT
   # h1("John") = 10 set, but h2("John") = 4 not set
   # return MISS
   # h1("Bob") = 5 set, h2("Bob") = 1 set
   # return HIT (false positive)

Wikipedia: Bloom filter (and variants)

Description and Ruby implementation (Ilya Grigorik)

Bloom filter – the math (Pei Cao)

The original paper (Bloom, 1970)

Tech talk: Bloom filter and variants (Ely Porat)

The math and intuition behind Bloom filters (Michael Nielsen)

Interactive demo of Bloom filters (Jason Davies)

2. Count-Min sketch

The Count-Min (CM) sketch is less known than the Bloom filter, but it is somewhat similar (especially to the counting variants of the Bloom filter). The problem here is to store a numerical value associated with each element, say the number of occurrences of the element in a stream (for example when counting accesses from different IP addresses to a server). Surprisingly, this can be done using less space than the number of elements, with the trade-off that the result can be slightly off sometimes, but mostly on the small values. Again, the parameters of the data structure can be chosen such as to obtain a desired accuracy.

CM works as follows: we have k different hash functions and k different tables which are indexed by the outputs of these functions (note that the Bloom filter can be implemented in this way as well). The fields in the tables are now integer values. Initially we have all fields set to 0 (all unseen elements have count 0). When we increase the count of an element, we increment all the corresponding k fields in the different tables (given by the hash values of the element). If a decrease operation is allowed (which makes things more difficult), we similarly subtract a value from all k elements.

To obtain the count of an element, we take the minimum of the k fields that correspond to that element (as given by the hashes). This makes intuitive sense. Out of the k values, probably some have been incremented on other elements also (if there were collisions on the hash values). However, if not all k fields have been returned by the hash functions on other elements, the minimum will give the correct value. See illustration for an example on counting hits from IP addresses:

In this example the scenario could be that we want to notice if an IP address is responsible for a lot of traffic (to further investigate if there is a problem or some kind of attack). The CM structure allows us to do this without storing a record for each address. When we increment the fields corresponding to an address, simultaneously we check if the minimum is above some threshold and we do some costly operation if it is (which might be a false alert). On the other hand, the real count can never be larger than the reported number, so if the minimum is a small number, we don’t have to do anything (this holds for the presented simple variant that does not allow decreases). As the example shows, CM sketch is most useful for detecting “heavy hitters” in a stream.

It is interesting to note that if we take the CM data structure and make the counters such that they saturate at 1, we obtain the Bloom filter.

For further study, analysis of the data structure and variants, proper choice of parameters, see the following links:

CM sketch page

Original paper (Cormode, Muthukrishnan, 2005)

Lecture slides (Helsinki Univ. of Tech)

Sketch library C++

C and Java implementations

What is your favorite counter-intuitive data structure or algorithm?