Speeding up VLOOKUP (and MATCH) formulas by orders of magnitude!

Written by Peter Albert on . Posted in Data handling, Expert, Modeling

Executive summary

By default, you’re using FALSE as the last parameter of a VLOOKUP or MATCH formula to make sure that the proper result is returned. This approach is really slow on larger data sets. Alternatively, sorting the data and using TRUE will increase the overall search speed by a very large factor!!!

This post will explain how and when to use this technique – and what to watch out for.

Quick reference

  1. Sort your data ascending according to your key column (the first column of your VLOOKUP range)
  2. Replace your =VLOOKUP(A1,Sheet2!$A:$B,2,0) with =VLOOKUP(A1,Sheet2!$A:$B,2) (you can omit the TRUEas last parameter as this is the default)
  3. Watchout: In case you are not sure if A1 is contained in your list, use this formula: =IF(VLOOKUP(A1,Sheet2!$A:$B,1)=A1,VLOOKUP(A1,Sheet2!$A:$B,2),NA())

Detailed explanation

Imagine the old days when you would need to search for a phone number of a friend in the directory. Given the last name, you’d approach this task by going to the section with the matching first letter of the last name. There you’d again skip a few pages till the second letter matches – and then you’d probably start looking for the full match until you find your friend.

Now think of a typical VLOOKUP formula, say =VLOOKUP(A1,Sheet2!$A:$B,2,FALSE). When Excel is searching for the content of the cell A1 in column A of Sheet2, because of the FALSE as last parameter, Excel will actually check for the it cell by cell, starting from the top! If you want to find a match in a dataset with 100,000 elements, on average this searching would therefore involve 50,000 comparisons (which are computationally expensive if you’re searching for texts and not for numbers).

What most users don’t know is that Excel actually also supports another approach, similar to the intuitive “find a number in the directory” approach mentioned above. When replacing the last parameter with a TRUE(or leaving it blank, as this is the default), Excel will approach the search as follows:

  1. Assume the list to be searched is sorted ascending
  2. Start in the middle (e.g. in row 50,000) – if the value you’re searching for is larger than the this element, take the “upper” half of the elements, else the “lower” half
  3. Of this new half (e.g. 1-50,000), take again the middle (25,000), do the comparison and determine the quarter to search in
  4. Continue narrowing down the range until only one cell is left and return this cell

Using this approach (it’s called binary search in Computer Science) is really efficient. In the example with 100,000 elements, it would take a maximum of 17 comparisons until the match is found. And checking for one million elements is a mere 20 comparison in the worst case!!!

The method is incredibly fast, but has two severe down-sides: it assumes the list is sorted – and it will always return a result, even if there is no exact match!

If the data is not sorted, this will result in completely wrong results, as the element in the middle does not say anything about the other elements. Therefore, always make sure to sort always your data in ascending order !

The second problem can be handled with a small trick: You first use VLOOKUP to return the first column of your target table. If the list contains the element you were looking for, this will simply return you the original item. However, in case the element is not in the list, this will return some other element (the next closest value in fact) – and you therefore know that the element is missing and can handle it accordingly. Therefore, use this formula: =IF(VLOOKUP(A1,Sheet2!$A:$B,1)=A1,VLOOKUP(A1,Sheet2!$A:$B,2),NA()).

You might complain that this is again inefficient, as it’ll search for A1 twice – and I’ll agree in principle. But 16 searches times two is still much lower than 50,000, so I actually wouldn’t mind in most cases! And if you really want to optimize, then check out the INDEX/MATCH technique and extract the MATCH part to a helper column. Then your formula would be =IF(INDEX(Sheet2!$A:$A,HelperCellWithMatchResult)=A1,INDEX(Sheet2!$B:$B,HelperCellWithMatchResult),NA())

As small side remark: the binary search works with both, numbers and texts as keys.

So what are you waiting for, literally? Using only this technique and the INDEX/MATCH replacement, I recently managed to reduce calculation time of a larger model from 17 minutes to a mere 5 seconds!

Tags: , , ,

Trackback from your site.

Peter Albert

Worked 5 years as Consultant and Project Manager at McKinsey & Company and Bain & Company. Now builds Excel, other models and web applications for consultancies and other clients - and provides Excel training to consultants at all skill levels.

Comments (1)

  • CB

    |

    Thank you sir… vlookuping 425,000 rows vs a 27,000 row spreadsheet and finally got fed up and searched for a solution to speed things up… This worked very well! Turned a 7 minute process into a 2 second process.

    Reply

Leave a comment