Why is Fuzzy Matching Important?
As Product Data Analysts, we are checking our product titles and descriptions against our structured fields, looking to ensure the values are equivalent. But we can't rely on them being identical. If we flag every row where they are not identical, we will overwhelm the person validating data with a blizzard of mismatches that aren't really errors. On the other hand, if we don't do the check at all, we risk confusing potential customers, or worse yet, shipping them something they didn’t intend to buy.
Computers and humans have different data preferences. Structured data (data fields that contain a single value such as brand, color, size, weight, wattage) is perfect for computers - searching, filtering, and selecting - because the data is in formats that are far easier for computers to understand. Unstructured data in prose form is much more accessible to humans. We read titles, descriptions and bullet points when we are deciding if a product is what we are looking for.
Computers are naturally good at exact matching where two values are either equal or not equal. But they are not as good as humans are at understanding if values are similar or equivalent. That's where a technique called "Fuzzy Matching" comes in.
It is crucial that both structured and unstructured data fields don't disagree, but that doesn't mean they have to match exactly. Product Titles must always agree with the structured data and the description.
For example, if I search for a men's T-shirt, I expect the computer to find shirts that are designed to be worn by men. If I clicked on a Product Title from the search result and the description said the shirt was for women, I would be confused. I wouldn't buy the product because I couldn't be sure which was correct –the Product Title or the description. But if the description contained "T-shirt for men", "men's T-shirt", "T-shirt, man" I would have no problem recognizing that the structured data means the same thing that I intend to buy.
Similarly, if I searched for an Apple Watch and the product's brand field said "Apple Inc." or "Apple Incorporated", I would be confident I found what I was looking for. If it said "Epple", I wouldn't sure. It might be a typo, or it might be a cheap knock-off brand. If it said "Casio", I'd know it wasn't what I was looking for and that the search wasn’t doing a good job.
Without fuzzy matching, we would have no way to automatically validate our product feed and present searchers with exactly what they are looking for.
Fuzzy Matching: Strengths and Weaknesses of Common Techniques
1. Phonetic Matching (SoundEx or Metaphone)
- Is massively scalable. It is the only type of fuzzy matching where you can completely pre-compute an index, which is why it is commonly used in databases and search engines.
- Works well when two words are spelled differently but still sound the same
- Is very flexible for many types of misspellings
- It can be too fuzzy in finding words that are substantively different,
- It may miss words that are using rare pronunciation of particular letters of combinations (like "ph" vs "f").
- It completely ignores numbers, punctuation, and case
- It can have issues with transpositions of consonants
2. Edit Distance Matching (Damerau-Levenshtein)
This measures exactly how many edits (insertions, deletions, replacements, or transpositions) have to be made to turn one string into another
- Is a highly accurate fuzzy matching comparison
- It is slow when trying to match each item in one list against an item in another list (because the number of computations scales with the square of the size of the lists)
- It doesn't prioritize based on the nature of changes. Changing "-Test." to "Test" is the same edit distance as changing "Tint" to "Test".
- It has problems with short acronyms with and without periods ("C.A.T." vs "CAT")
3. Character NGrams-Jaccard Technique
This uses the number of unique sequential combinations of letters in each one word that also appears in the other. It will consider repeating patterns of various lengths to be much more similar than other fuzzy matching techniques ("ABCABC" and "ABCABCABCABC" are identical for an NGram length <= 3). This can be good or bad depending on your needs.
- Faster than Levenshtein because you can compute the NGrams once per word, and just compare sets for each possible match
- It isn’t scalable in the way that Phonetics Algorithms are because you must run the comparisons between each pair of words
- Much weaker than Levenshtein in that it is especially weak against comparing acronyms with and without period
4. Starts-with/Ends-with Matching
Will consider there to be some similarity if any of the words we are comparing start or end with the target word or vice versa.
- Helpful in dealing with compound words that may be treated differently within the same dataset. For instance, it is especially common with brands and trademarks to have them written with no space in between the words, with a dash in between the words, and with a space in between the words. "Hash" and "Brown" are both fuzzily related to "Hashbrown"
- This technique has problems with short words that are also common prefixes and suffixes ("In", "site" are both found at the start or end of many words, but are not good fuzzy matches).
5. Acronyms & Abbreviations Matching
This checks against a fixed list of known words and abbreviations
- Better than the above-mentioned Fuzzy Matching techniques which all have a problem with seeing that abbreviations and acronyms are equivalent to the words they represent ("Co." means the same thing as "Company")
- Requires an outside source of knowledge about which abbreviations/acronyms go with each word
- Short abbreviations can mean different things depending on context, leading to false positive matches
6. Semantic Meaning with Word Vectors Technique
This for finding words are fuzzily related because they appear in same context as the word we are trying to match
- Is able to find words that mean the same thing ("dim blue" vs "dark blue") in a way that no other Fuzzy matching Technique would discover
- Creating an appropriate Word Vector model for a dataset, especially a small one, is not an exact science. This can end up suggesting matches that are more coincidental than actually semantically identical in meaning.
- If you create a Word Vector model based on a generic body of text, it won't perform well on words that are specific to product data (brands, trademarks).
Since each technique has different strengths and weakness, it is best to create a combined ensemble Fuzzy Matching algorithm that will flexibly handle finding equivalent values in product data. My preferred combination is using 1 through 4 for general use in product titles, and 1 through 5 for checking brands (since the list of normal company abbreviations is a manageable size).
Other applications for Fuzzy Matching in Product Data
Aside from data consistency checks to prevent confusion, we also use fuzzy matching to:
- Standardize structured values to be the same format.
- Find corrections for typos and abbreviations.
- Match search terms to product titles/descriptions for SQR analysis and performance optimization
- Automatically identify field names in data feeds