Sunday, October 9, 2011

How Google Won User's Heart :: Google Auto-Suggestion/Auto-Completion Algorithm Exposed

How Auto Suggestion Works e.g. How Google Won User's Heart
You've seen search engines suggest queries when you begin typing the first few letters of your search string. This is being done by Duck Duck Go as well as Google (to name a few). This is typically done by maintaining a list of past queries and/or important strings that the search engine thinks are worthy of being suggested to a user that is trying to find something similar. These suggestions are effective only if the search engine spits them out very fast since these should show up on the screen before the user has finished typing what he/she wanted to type. Hence the speed with which these suggestions are made is very critical to the usefulness of this feature.

Let us consider a situation (and a possible way of approaching this problem) in which when a user enters the first few letters of a search query, he/she is presented with some suggestions that have as their prefix, the string that the user has typed. Furthermore, these suggestions should be ordered by some score that is associated with each such suggestion.

Approach-1:

Our first attempt at solving this would probably involve keeping the initial list of suggestions sorted in lexicographic order so that a simple binary search can give us the 2 ends of the list of strings that serve as candidate suggestions. These are all the strings that have the user's search query as a prefix. We now need to sort all these candidates by their associated score in non-increasing order and return the first 6 (say). We will always return a very small subset (say 6) of the candidates because it is not feasible to show all candidates since the user's screen is of bounded size and we don't want to overload the user with too many options. The user will get better suggestions as he/she types in more letters into the query input box.

We immediately notice that if the candidate list (for small query prefixes say of length 3) is large (a few thousand), then we will be spending a lot of time sorting these candidates by their associated score. The cost of sorting is O(n log n) since the candidate list may be as large as the original list in the worst case. Hence, this is the total cost of the approch. Apache's solr uses this approach. Even if we keep the scores bound within a certain range and use bucket sort, the cost is still going to be O(n). We should definitely try to do better than this.


Approach-2:

One way of speeding things up is to use a Trie and store (pointers or references to) the top 6 suggestions at or below that node in the node itself. This idea is mentioned here. This results in O(m) query time, where m is the length of the prefix (or user's search query).

However, this results in too much wasted space because:
  1. Tries are wasteful of space and
  2. You need to store (pointers or references to) 6 suggestions at each node which results in a lot of redundancy of data

We can mitigate (1) by using Radix(or Patricia) Trees instead of Tries.


Approach-3:

There are also other approaches to auto-completion such as prefix expansion that are using by systems such asredis. However, these approaches use up memory proportional to the square of the size of each suggestion (string). The easy way to get around this is to store all the suggestions as a linear string (buffer) and represent each suggestion as an (index,offset) pair into that buffer. For example, suppose you have the strings:
  1. hello world
  2. hell breaks lose
  3. whiteboard
  4. blackboard

Then your buffer would look like this:
hello worldhell breaks losewhiteboardblackboard
And the 4 strings are actually represented as:
(0,11), (11,16), (27,10), (37,10)

Similarly, each prefix of the suggestion "whiteboard" is:
  1. w => (27,1)
  2. wh => (27,2)
  3. whi => (27,3)
  4. whit => (27,4)
  5. white => (27,5)
  6. whiteb => (27,6)
  7. and so on...

Approach-4:

We can do better by using Segment (or Interval) Trees. The idea is to keep the suggestions sorted (as in approach-1), but have an additional data structure called a Segment Tree which allows us to perform range searches very quickly. You can query a range of elements in Segment Tree very efficiently. Typically queries such as min, max, sum, etc... on ranges in a segment tree can be answered in O(log n) where n is the number of leaf nodes in the Segment Tree. So, once we have the 2 ends of the candidate list, we perform a range search to get the element with the highest score in the list of candidates. Once we get this node, we insert this range (with the maximum score in that range as the key) into the priority queue. The top element in the queue is popped and split at the location where the element with the highest score occurs and the scores for the 2 resulting ranges are computed and pushed back into the priority queue. This continues till we have popped 6 elements from the priority queue. It is easy to see that we will have never considered more than 2k ranges (here k = 6).

Hence, the complexity of the whole process is the sum of:
  1. The complexity for the range calculation: O(log n) (omitting prefix match cost) and
  2. The complexity for a range search on a Segment Tree performed 2k times: O(2k log n) (since the candidate list can be at most 'n' in length)

Giving a total complexity of:
O(log n) + O(2k log n)
which is: O(k log n)

Some Useful Links
http://en.wikipedia.org/wiki/Autocomplete
http://igoro.com/archive/efficient-auto-complete-with-a-ternary-search-tree/
http://stevedaskam.wordpress.com/2009/05/17/autocomplete-data-structures/
http://sujitpal.blogspot.com/2007/02/three-autocomplete-implementations.html
http://stackoverflow.com/questions/73095/how-to-implement-a-simple-auto-complete-functionality
http://code.google.com/p/autocomplete-server/
http://suggesttree.sourceforge.net/
http://www.quora.com/What-is-the-best-open-source-solution-for-implementing-fast-auto-complete
http://www.phpfreaks.com/forums/index.php?topic=240881.0
http://dhruvbird.blogspot.com/2010/09/very-fast-approach-to-search.html

Friday, September 16, 2011

Wednesday, September 14, 2011

Given two words of equal length that are in a dictionary, write a method to transform one word into another word by changing only one letter at a time. The new word you get in each step must be in the dictionary


Numbers are randomly generated and passed to a method. Write a program to find and maintain the median value as new values are generated.


Given an Array of Strings . Write a program to find the longest word made of other words efficiently ?


Given two arrays [5 6 2 8 1] or [4 7 9 2 4]. you have to find possible pairs where sum of numbers in both arrays of a pair is equal....eg for the first case it will be [5 6] and [2 8 1] efficiently .


Use Google Chrome For Better View of Contents on Blog :)

Dear Readers,
                              I am getting regular suggestion & useful comments from many geeks to re-arrange the question according to type, i will try to do this asap i get time as i have to edit HTML Code & also some  people are saying that contents are good but unable to view them. for this problem try Google Chrome Until I change the GUI & edit Html Code.

Thanks  For Giving Suggestion to Solving Problems Efficiently 
Happy Coding
Shashank !!!

Monday, September 5, 2011

Find the smallest number expressible as the sum of cubes of two numbers in two different ways. Indeed, 10^3 + 9^3 =12^3+1^3=1729 , Do It Efficiently


Problem:
A mathematician G. H. Hardy was on his way to visit his collaborator S. Ramanujan an who was in the hospital. Hardy remarked to Ramanujan that he traveled in taxi cab number 1729 which seemed a dull one and he hoped it was not a bad omen. To this, Ramanujan replied that 1729 was a very interesting number-it was the smallest number expressible as the sum of cubes of two numbers in two different ways. Indeed,
103 + 93 =123 + 1 3 二1729.

Hint: This problem is very similar to another very popular problem that is asked in interviews. You are given a  η ×η matrix in which both rows and columns are sorted in ascending order and you are supposed
to find a given number in the matrix.


Algorithm:
In this case, we are essentially looking for an implicit matrix A such that A(i,j) =i^3十j^3. In our case, the matrix will have n^1/3 rows and columns and this matrix of size n^1/3 * n^1/3 we try to search k=i^3+j^3 isn't it  ans algorithms for searching for a number in such a matrix that are linear in the number of rows.

One approach is to start by comparing x to An,1. If  x = An ,l , stop.Otherwise, there are two cases:
- x < A1,n in which case x is less than element at Column n at 1,n position. which mean we can escape whole row itself . we decrement column pointer to left.e.g. previous column.
- x > A1,n , in which case if x > a[1][n] is greater than it will be greater then all elements in Row 1.. we increment the row pointer to next row.


Efficient Solution:
Here is the Pseudo Code For The Same.
int IsNumberEqualstoSumeTwoCubes(int n)
{
       int row=Ceil(Pow(n,1/3));// Pow Has Complexity of O(logn) Ceil(1729)=577
       int column=row;
       int i=0;int j=column;
      while(i<row && j>=0)
     {
               int m=i*i*i+j*j*j;
               if(m==n)
                   return 1; // print i^3 and j^3 these are two such numbers
              else if ( m < n)
               i++;
              else
               j--;
      }
return 0; //such number DNE
}

Complexity Analysis:
In either case, we have a matrix with η fewer elements to search. In each iteration, we remove a row or a column, which means we inspect 2η-1 elements.


We claim that our algorithm that solves the matrix search problem will have to compare x with each of the 2n-l elements shown (i.e the diagonal elements (in case of column element matching )and  the elements immediately below them (in case of row elements matching )and this is obvious just draw a diagram in which both row & column are in sorted order & try to run the above algorithm ).
Call these elemmts the # elements..
Comparing x with any other elements in matrix does not eliminate the any of the # elements.we can easily Proof this by Contradiction:Suppose X algorithm doesn't compare x with  any of the above # elements.
then we could make that element x ( instead of x-1 got to prev. column  or x+1 go to next row.).hence we will get wrong result.

Above Algorithm Really Requires Deep Understanding of Young_tableau :)

Time Complexity O(row+column)=O(N^1/3+N^1/3)=O(2*N^1/3)=O(N^1/3)
Space Complexity O(1)



Friday, September 2, 2011

Get Rid Off Suffix Tree !!!!! An Online Compilation of Helpful Tutorial

Excellent Study Material For Understanding Algorithms & implementing Suffix Tree

6.http://apps.topcoder.com/forums/?module=RevisionHistory&messageID=1171511
7.http://www.stanford.edu/class/cs97si/suffix-array.pdf

Application of Suffix Tree

1.Longest repeated substring problem:
The longest repeated substring problem is finding the longest substring of a string that occurs at least twice. This problem can be solved in linear time and space by building a suffix tree for the string, and finding the deepest internal node in the tree. The string spelled by the edges from the root to such a node is a longest repeated substring. The problem of finding the longest substring with at least k occurrences can be found by first preprocessing the tree to count the number of leaf descendants for each internal node, and then finding the deepest node with at least k descendants.

By from Sartaj Sahni

Find the longest substring of S that appears at least m > 1 times. This query can be answered in O(|S|) time in the following way:
(a) Traverse the suffix tree labeling the branch nodes with the sum of the label lengths from the root and also with the number of information nodes in the subtrie. (b) Traverse the suffix tree visiting branch nodes with information node count >= m. Determine the visited branch node with longest label length.
Note that step (a) needs to be done only once. Following this, we can do step (b) for as many values of m as is desired. Also, note that when m = 2 we can avoid determining the number of information nodes in subtries. In a compressed trie, every subtrie rooted at a branch node has at least two information nodes in it.

Note that step (a) needs to be done only once. Following this, we can do step (b) for as many values of m as is desired. Also, note that when m = 2 we can avoid determining the number of information nodes in subtries. In a compressed trie, every subtrie rooted at a branch node has at least two information nodes in it.
2. Pattern Searching/Sub-String Searching 
Searching for a substring, pat[1..m], in txt[1..n], can be solved in O(m) time (after the suffix tree for txt has been built in O(n) time).
3. Longest Repeated Substring 
4. Longest Common Subsequence 
5. Longest Palindrome
http://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf





Add a special ``end of string'' character, e.g. `$', to txt[1..n] and build a suffix tree; the longest repeated substring oftxt[1..n] is indicated by the deepest fork node in the suffix tree, where depth is measured by the number of characters traversed from the root, i.e., `issi' in the case of `mississippi'. The longest repeated substring can be found in O(n) time using a suffix tree.
The longest common substring of two strings, txt1 and txt2, can be found by building a generalized suffix tree for txt1 andtxt2: Each node is marked to indicate if it represents a suffix of txt1 or txt2 or both. The deepest node marked for both txt1and txt2 represents the longest common substring.
Equivalently, one can build a (basic) suffix tree for the string txt1$txt2#, where `$' is a special terminator for txt1 and `#' is a special terminator for txt2. The longest common substring is indicated by the deepest fork node that has both `...$...' and `...#...' (no $) beneath it.
(Try it using the HTML FORM above.)


Note that the `longest common substring problem' is different to the `longest common subsequence problem' which is closely related to the `edit-distance problem': An instance of a subsequence can have gaps where it appears in txt1 and in txt2, but an instance of a substring cannot have gaps.
A palindrome is a string, P, such that P=reverse(P). e.g. `abba'=reverse(`abba'). e.g. `ississi' is the longest palindrome in `mississippi'. The longest palindrome of txt[1..n] can be found in O(n) time, e.g. by building the suffix tree for txt$reverse(txt)# or by building the generalized suffix tree for txt and reverse(txt).
(Try it.)



Others 
--


Some Awesome Algorithms/Codes For Suffix Tree Implementation