Wednesday, June 22, 2011

Detect 1% Duplicate Entry in a File of 13 Million Entries Efficiently.

We have a text file with 13 Million entries, all numbers in the range from 1.000.000 to 100.000.000. Each line has only one number. The file is unordered and we have about 1% duplicate entries. Remove the duplicates.

As We have a Huge Data obvious things that should come into mind is using bit array or bit set where ith bit represent the position of number in array of size 13* 2^20

before start solving question i suggest you people to check basics of bit operations here

#include
#include
using namespace std;
int main()
{
int i=0,j=0,k=0;
int n=18;
n|=1<<3;//Set the ith bit here i=3
printf( " %d ", n);
n&=~(1<<3);//clear ith bit
printf( " %d ", n);
n^=1<<1;//toggle ith bit
printf( " %d ", n);
i=n&(1<<4)?1:0;//check ith (i=4)bit set or not
printf( " %d ", i);
return 0;
}


Data Structure:Bit Array

Algorithm:
1.set ith bit n bit arry when reading number 1 by one from file .
2.if ith bit corrosponding to number is already set then number is duplicate .
& we are done.

as we are using bit array here so When using an array of bytes to represent set of bits, i.e., a bit array or bitset, the index of the byte in the array associated with a bit n can be calculated using division:

ith index / CHAR_BIT (number represented by 32 bit or 64 bit etc)

where n is the index of the given bit and CHAR_BIT gives the number of bits in a C char.

"this will give position of byte element in array elements"

The index of the bit within the byte indexed by the above can be calculated via a modulo operation:

n % CHAR_BIT

"so this will give us position of bit inside byte."

so then we can write the following method to find 1% duplicates

bool setBit(int a[], int i)
{
int index = i / 8; //position number in array
int offset = i % 8;//position of bit in byte
if (a[index] & (1<< offset))//check bit is set or not
return false;//when ith bit is already set
a[index] |= 1< return true;
}

Time Complexity O(N) size of file
Feel Free to Comment or Optmize the Solution

No comments :