When faced with a catalog of many thousands of products, trying to fix every problem by hand can be time-consuming and frustrating. When the same kinds of problems tend to show up across multiple product titles, you may be tempted to look for a faster solution. If your catalog is in a spreadsheet, you may do a Search and Replace-All. If you’re using a PIM, you may write a custom rule to substitute the problem text with the correction. There are problems with this approach.
- The first problem is one of awareness of the exact problems we want to fix. We have to manually discover every typo, every incorrect abbreviation, every pattern of improper punctuation in every product title. This is hard enough with a catalog that doesn't change often. If you have high product turnover, this becomes burdensome as you must constantly search for new problems every time the catalog is updated. Using fuzzy matching with regular expressions can help with finding variations on simple patterns. For example "50% off", "55%off, "60% OFF" and "75%-Off" could all be discovered with a single flexible pattern. But this kind of matching won't work for all kinds of product title problems.
- The second problem is that we have to do many replacements or create and maintain many different rules. Keeping rules to fix every typo discovered ends up being very unwieldy and still won't catch new typos. Even if your PIM had a rule to automatically fix typos with the most likely English word, you would still have problems with brands and trademarks that aren't in a standard dictionary.
- The third problem is that sometimes solving one problem will cause others. For example, removing a promotional word with a rule may create a new punctuation problem with 2 consecutive commas in the fixed title. This requires each rule be done in a particular order, and sometimes rechecked after another one is applied.
- The fourth and hardest problem is context. Sometimes what is a problem in one product title is acceptable/required for another. Imagine you have created rules to standardize colors for your Google Shopping feed so that "maroon", "scarlet", and "crimson" are all displayed as "red". This works until you discover that your products from the brand "Scarlet" are now wrong, or that you are advertising the DVD for "Crimson Tide" as "Red Tide". Or if you want to list your products on Amazon Marketplace (whose guidelines state that you should spell out 9-inch instead of using double-quotes 9"), trying to do replace-all with double-quotes will quickly mess up titles containing quoted phrases such as T-Shirts. Even something as simple as making words be the correct capitalization is non-trivial when you consider MPNs, trademarks, abbreviations, acronyms, units of measurements, and prepositions.
What's a better way to fix product titles?
In our product title quality analysis, we use a variety of fuzzy matching techniques to find typos, promotional text, incorrect punctuation, etc. We analyze contextually to try to identify brands and trademarks to help remove false-positives. And we analyze semantically within product categories, because different kinds of products have different requirements and expectations.