The Quagmire Challenge: Extracting Text Strings of Uneven Length (April 2017)
The Question
Most of us are comfortable using text functions rather than Text to Columns to parse and clean up the text strings that are often dumped from a central G/L or database or system. We can use FIND and other functions to extract what we need but very often these functions work best from the left side of the text string. A recent client had this issue and it forms the basis for our current Quagmire challenge.
If you click here you can download the spreadsheet with the sample data for the problem.
Pull the numbers from the righthand side of the string. The strings are of uneven length as are the numbers. The only constants are that the delimiting character (the “|”) is always the same and the number always appears at the end of the string.
The Solution
For each of these solutions we assume that the string is in cell B2.
Method 1 – Find Numbers
This is the simplest method and we liked it for its transparency. Build ten formulas – one for each number – like the following:
=IFERROR(FIND(“1″,B2),”NA”)
The first number must occupy the closest cell to the left so using a MIN function on the 10 previous functions will find the location of the first number. Once we know that location we are done since the number is always followed by ” |” at the end of the string. Thus to pull out the number we use:
=MID(B2,location of first number,LEN(B2)-location of first number-1)
To ensure that the extracted number can be used as a number you could use the INT function, multiply the result by 1 or subtract 0.
CAVEAT – while we like this solution, it has a problem that may arise. In our example we did not have numbers in the text strings until the final number but there will be a problem if there was a text string:
Operations | Technical Support | Router 2 Maintenance | US Dept 342 | 2342034 |
The formula will pick the “2” following router and will pull the wrong text. The solution shown in Method 2 avoids this problem.
Method 1A – Array Formulas
Before moving on to Method 2 we should mention that many folks sent us some very clever solutions that employed the power of Excel’s ability to perform calculations on arrays of numbers. These “array formulas” can be used to search for all ten numbers in a single formula. Array formulas can be identified by the appearance of “{ }” brackets around or in the formula. They are often referred to as CSE functions for the CTRL + SHIFT + ENTER key stroke that is needed to enter them properly.
Array functions are too complex to do a proper discussion of in this article but we normally do not use them despite their power. For anyone who has taken our modelling course, you know that the number one attribute of a good Excel model is transferability – the ability for a new model owner to quickly understand and take over operation of someone else’s financial model. The concern we have is best expressed by this cautionary note on Microsoft’s office support website:
“Other users of your workbook might not understand your formulas. In practice, array formulas are generally not explained in a worksheet. Therefore, if other people need to modify your workbooks, you should either avoid array formulas or make sure those people know about any array formulas and understand how to change them, if they need to.”
Despite our misgivings we wanted to be fair to our contestants since we did not stipulate our usual reluctance to use array formulas. One version of the solution that many people submitted, employed the following function combination as part of the solution in Method 1:
=MIN(FIND({1,2,3,4,5,6,7,8,9,0},B2&”0123456789″))
This formula does not require CTRL+ SHIFT + ENTER to work and it repetitively searches for the location of the 10 numbers in the string in one formula. Why the “&”0123456789” at the end of the formula? Well this is a little trick instead of using an IFERROR function. By adding all the numbers to the string, the FIND function never returns an error. Since they are added at the end of the string, the additional numbers will never be selected by the MIN function.
Method 2 – Substitution
This is our preferred solution since it has no complicated formulas and bases the result on location rather than just finding the first occurrence of a number. In our case, we want the location that is situated between the last and second to last delimiter – the “|”.
Step 1 – how many delimiters do we have? The trick with this is to count the current string length and then remove the delimiters and compare the length. The difference must be the number of delimiters.
=LEN(B2)-LEN(SUBSTITUTE(B2,”|”,””))
The SUBSTITUTE function is going to get used twice. This first time we removed the delimiter and replaced them with nothing – “” – which made the text string shorter by the number of delimiters. Let’s assume the answer is six. That means we need to find the number after the fifth, second to last, delimiter.
Step 2 – create a unique identifier for the fifth delimiter so that it can be located using a FIND function.
=SUBSTITUTE(B2,”|”,”~”,5)
Of course, we would use the formula that returns the number 5 rather than hardcoded number 5 above. The interesting thing about SUBSTITUTE is that it is one of the few text functions that lets you choose where to operate. By changing the fifth delimiter to the unique character “~” (assuming that the “~” never appears in your text string) we can now use FIND to locate it since it will uniquely identify the start of the number sequence in the text string.
=FIND(“~”,CELL CONTAINING THE SUBSTITUTE FORMULA)
This locates the delimiter and the number will be two spaces to the right (location +2). Now that the number location is located we can use a MID function similar to the one used in Method 1 to extract the number.