30.01.24 by Mundher Al-Shabi

Semantic Product Matching

5 min read

With the growing popularity of online shopping, e-commerce, and instant grocery shopping, there’s a growing necessity for an algorithm that can identify products in our inventory that are similar to those offered by competitors.

This technology not only enables us to develop more informed pricing strategies but also helps us understand the differences in product variety between us and our competitors. Additionally, it provides a valuable tool for identifying any duplicate items within our own product range, ensuring a more diverse and efficient assortment for our customers.

In this article, we will explore our method for solving the product-matching problem using only product titles. However, it’s worth noting that this technique can also be applied to images and enhanced with additional product attributes like price and size.

Essentially, our goal is to take a given product title, referred to as p, and find a corresponding product title within an unordered set, labelled as S={t_1, t_2, …, t_n}. This collection, S, may contain our product titles which are useful for finding duplicates in our assortments, or competitor product titles which are particularly useful for assortment gap analysis.

Lexical Matching (First Approach)

The lexical matching first breaks down the titles p and t into words (a bag of words). Then we calculate the number of words that overlap between the two titles and normalize the result to the total number of words. This similarity metric is known as Intersection over Union.

This method can be further refined by incorporating Term Frequency (TF) and Inverse Document Frequency (IDF) along with the BM25 similarity score. The key advantages of this technique include:

Efficiency with Inverted Index: We can employ an inverted index for rapid searching of matching words. This makes the process highly efficient, especially when dealing with a large set, S.
Availability of Tools: There are several libraries and tools, such as Lucene, that facilitate the implementation of this method.

However, it’s important to note a fundamental limitation of Lexical Matching: it assumes that for words in p and t to match, they must be identical. This assumption isn’t always accurate, as the same product can be described using different terms. Here are some examples to illustrate this point:

Unit and size	Pepsi 1000ml Pepsi 1L
Misspelling	Coca-Cola 250ml CocaCola 250ml
Missing words	Marigold HL Chocolate Flavoured Milk 200 ml Marigold HL Chocolate Milk 200ml

This limitation points to the necessity of integrating more sophisticated language understanding methods that can recognize synonyms, paraphrases, and contextually similar phrases, thereby enhancing the accuracy of product matching in diverse and complex retail environments.

Semantic Encoder (Second Approach)

The Semantic Encoder is a model designed to transform text into embeddings that encapsulate the deeper semantic meaning of the text. Essentially, it goes beyond mere surface-level analysis and delves into the underlying context and nuances conveyed by the words. For instance, when we calculate the Euclidean distance between the embeddings of two different texts that convey the same or very similar meanings, this distance should be notably small.

For our specific needs, we utilized the SBERT (Sentence-BERT) library, which comprises a pre-trained Transformers-based Large Language Model (LLM). This model has been further fine-tuned using a Siamese Network architecture. To tailor the model more closely to our requirements, we conducted this fine-tuning process using our own internal dataset. This dataset is composed of pairs of product titles, each pair labelled as either ‘matched‘ or ‘not-matched‘.

The advantage of using a Semantic Encoder over traditional lexical matching methods is that a Semantic Encoder captures the meaning of the entire sentence, rather than just looking at word overlap. This means it can detect semantic similarity even when the titles use different words. For example, Semantic Encoder can understand that “fast USB charger” and “quick charging USB adapter” are semantically similar, even though they share few words in common.

However, the Semantic Encoder approach has some limitations:

Contextual Limitation: Since each title is encoded independently, these models might miss the nuanced interplay between two pieces of text (like in a unit or size).
Misses important keywords: Due to the limited size of embedding, the Encoder focuses on high-level semantic representation rather than certain keywords. However, certain keywords such as brand names are important for matching and may be missed in the representation.

Retrieval-Rerank (Third Approach)

The Retrieval-Rerank approach is a well-established method to enhance accuracy in information retrieval tasks, effectively balancing speed and precision. This method involves two distinct stages: an initial fast but less precise retrieval phase, followed by a slower, more accurate reranking phase. The primary objective of the first stage is to narrow down the field of potential matches, given that the second stage is more time-consuming.

First Stage: Retrieval

In the retrieval stage, the goal is to quickly generate a list of k candidates or potential matches. For this purpose, we can either use the first or second approach. In our specific case, we opted for the first approach, leveraging Lexical Matching. This choice was driven by its cost-effectiveness and efficiency, making it well-suited for rapidly processing a large number of documents to extract a manageable subset of candidates.

Second Stage: Reranking

For the reranking stage, we employed a transformer-based cross-encoder model. Unlike an encoder-only model, which processes inputs independently and generates embeddings for each, the cross-encoder examines pairs of inputs together. This joint processing allows the model to capture the interactions between the texts, leading to a significantly higher accuracy.

This feature is particularly valuable for tasks that demand an in-depth understanding of the relationships between texts, such as discerning subtle nuances or contextual similarities. However, the cross-encoder’s increased computational intensity is precisely why it’s reserved for the second stage, where it operates on a reduced set of candidates identified in the first stage.

By combining these two stages, the Retrieval-Rerank approach strikes a balance between operational efficiency and accuracy. The first stage efficiently filters the vast pool of data, while the second stage applies a more rigorous, detailed analysis to the narrowed-down set, ensuring high-quality final results without the prohibitive computational cost of applying the cross-encoder to the entire dataset.

Figure 1: Retrieval-Rerank architecture. Note the dotted lines (–) only exist during the training, while the solid lines exist during the training and inference.

Hard Negative Sampling

Hard negative sampling is a concept that revolves around the selection of negative samples that are challenging for the model to correctly identify, hence the term “hard” negatives. This technique is often used to improve the performance of models by ensuring they learn to distinguish between truly similar and dissimilar cases.

To effectively identify hard negative samples, we employed the encoder-based approach as our primary tool. This method involves finding pairs of data that are not matches (from the labels), yet possess embeddings that are surprisingly close to each other, surpassing a predefined similarity threshold. These pairs are considered ‘hard negatives’ because, despite their lack of a direct match, their embeddings suggest a high degree of similarity, making them challenging for the model to accurately classify. Once these hard negative samples were identified, we proceeded to fine-tune our cross-encoder models using the hard negative samples.

Conclusion

Our method, focusing on product titles, employs Lexical Matching for the first stage. Then we employ a Retrieval-Rerank approach for efficient and accurate matching. This comprehensive strategy, augmented by hard negative sampling, ensures our ability to effectively identify similar products, manage assortment gaps, and maintain a diverse product range for our customers.