Wednesday, August 3, 2011

Given a file containing 4,300,000,000 integers, how can you find one that appears at least twice

A Theoritical Approach From Programming Pearls

Binary search find an element that occurs at least twice by recursively searching the subinterval that contains more than half of the integers. My original solution did not guarantee that the number of integers is halved in each iteration, so the worst case run time of its log2 n passes was proportional to n·log n. Jim Saxe reduced that to linear time by observing that the search can avoid carrying too many duplicates.

When his search knows that a duplicate must be in a current range of m integers, it will only store m+1 integers on its current work tape;

If more integers would have gone on the tape, his program discards them. Although his method frequently ignores input variables, its strategy is conservative enough to ensure that it finds at least one duplicate.

The algorithm that Bentley is talking about works by repeatedly halving
the candidate range in which the duplicate element must lie. Initially
this range is 0..2^32-1. On each pass it throws away the half of the
range that contains fewer data values and it also throws away the data
lying in that half. Eventually the range decreases to a single value,
which must be a duplicate because the remaining data has at least two
elements. The problem that Bentley notes is that the data may still
have 4 billion elements at this stage! The final improvement he
mentions is that you can throw away data _inside_ the candidate range
as long as you keep enough data around to ensure that at least one
duplicate occurs in it. This is equal to the size of the current
candidate range plus 1!


Another Similer Approach I Found

Create a bit array of length 2^32 bits (initialize to zero), that would be about 512MB and will fit into RAM on any modern machine.

Start reading the file, int by int, check bit with the same index as the value of the int, if the bit is set you have found a duplicate, if it is zero, set to one and proceed with the next int from the file.

The trick is to find a suitable data structure and algorithm. In this case everything fits into RAM with a suitable data structure and a simple and efficient algorithm can be used.
If the numbers are int64 you need to find a suitable sorting strategy or make multiple passes, depending on how much additional storage you have available.

The Pigeonhole Principle -- If you have N pigeons in M pigeonholes, and N>M, there are at least 2 pigeons in a hole. The set of 32-bit integers are our 2^32 pigeonholes, the 4.3 billion numbers in our file are the pigeons. Since 4.3x10^9 > 2^32, we know there are duplicates.

You can apply this principle to test if a duplicate we're looking for is in a subset of the numbers at the cost of reading the whole file, without loading more than a little at a time into RAM-- just count the number of times you see a number in your test range, and compare to the total number of integers in that range. For example, to check for a duplicate between 1,000,000 and 2,000,000 inclusive:

int pigeons = 0;
int pigeonholes = 2000000 - 1000000 + 1; // include both fenceposts
for (each number N in file) {
if ( N >= 1000000 && N <= 2000000 ) { pigeons++ } } if (pigeons > pigeonholes) {
// one of the duplicates is between 1,000,000 and 2,000,000
// try again with a narrower range
}

Picking how big of range(s) to check vs. how many times you want to read 16GB of data is up to you :)

As far as a general algorithm category goes, this is a combinatorics (math about counting) problem.

No comments :