Product Matching Across E-Commerce Sites: Algorithms That Actually Work
A technical deep-dive into fuzzy matching, AI-powered semantic matching, and vector embeddings for cross-site product comparison in e-commerce.
Product Matching Across E-Commerce Sites: Algorithms That Actually Work
Scraping competitor prices is the easy part. The hard part is figuring out which of your products corresponds to which of theirs.
"Premium Glass Jar 4oz Clear" from your catalog might be the same as "4 oz Clear Glass Container - Premium Quality" from a competitor. Or it might not — maybe theirs has a different closure type. Getting this right is the foundation of useful competitive intelligence.
Here's a technical look at the algorithms that solve this problem, their tradeoffs, and how to combine them for reliable results.
The Challenge
Product matching across e-commerce sites is hard because:
Naming conventions vary wildly. "1oz Mylar Bag" vs "Mylar Pouch 1 Ounce" vs "1-oz Flat Pouch, Mylar" — all the same product, all described differently. Attributes are embedded in titles. Size, color, material, and closure type are mashed into the product name rather than structured as separate fields. Extracting and comparing them requires parsing. Partial matches matter. Two products might be 80% similar — same material, same size, but different closure type. Whether that's a "match" depends on your business context. Scale compounds the problem. With 500 brand products and 5,000 competitor products, there are 2.5 million potential pairs to evaluate. Brute-force comparison doesn't work.Algorithm 1: Fuzzy Text Matching
Fuzzy matching compares product name strings using text similarity metrics. The core idea is to measure how similar two strings are after normalizing for common differences like word order and formatting.
How It Works
- "Premium Glass Jar 4oz Clear" vs "4 oz Clear Glass Container Premium"
- Despite different word order and formatting, a fuzzy matcher recognizes these are highly similar
Strengths
- Fast. Processes thousands of comparisons per second with no external calls
- No external dependencies. Runs locally, no API calls
- Predictable. Same inputs always produce the same score
- Good for obvious matches. Products with similar names score high
Weaknesses
- Semantic blind spots. Doesn't know that "CR" means "child-resistant" or that "1oz" and "1 ounce" are the same
- Noise sensitivity. Marketing language ("Best Seller!", "NEW!") in titles reduces match accuracy
- No attribute awareness. Treats all words equally — can't distinguish size from color from material
When to Use
Fuzzy matching is a great first pass. With the right threshold, you'll catch the straightforward matches with high confidence. Products below the threshold need a smarter approach.
Algorithm 2: AI-Powered Semantic Matching
Language models understand product semantics. They know that "4oz" and "4 ounce" are equivalent, that "mylar" and "metalized polyester" refer to the same material, and that "pop top" is a closure type.
How It Works
Strengths
- Semantic understanding. Handles synonyms, abbreviations, and domain knowledge
- Context-aware. Can use industry profile (categories, common terms) to improve accuracy
- Explains reasoning. The model can articulate why two products match or don't match
- Handles ambiguity. Can flag "possible matches" for human review
Weaknesses
- Cost. Each matching call costs money (though modern AI models are increasingly affordable)
- Latency. API calls add latency compared to local computation
- Non-deterministic. The same inputs might produce slightly different results across runs
- Hallucination risk. The model might confidently match products that aren't actually the same
When to Use
AI matching is ideal as a second pass after fuzzy matching. Run fuzzy first to catch the easy matches, then send the remaining unmatched products to the LLM for semantic analysis.
Algorithm 3: Vector Embeddings
Vector embeddings represent product names as high-dimensional numerical vectors. Similar products have vectors that are close together in embedding space, regardless of how differently they're worded.
How It Works
Strengths
- Scales efficiently. Embedding generation is a one-time cost per product. Similarity search is fast with proper indexing
- Language-agnostic similarity. Captures semantic meaning without explicit rules
- Incrementally updateable. New products get embedded once and are immediately searchable
Weaknesses
- Black box. Hard to explain why two products matched or didn't
- Requires infrastructure. Needs a database with vector search capabilities
- Embedding quality varies. General-purpose embeddings may not capture domain-specific nuances (e.g., packaging terminology)
When to Use
Vector search works well as a candidate retrieval step. Find the nearest neighbors for each product, then use fuzzy matching or AI to confirm the actual match.
The Hybrid Approach
The most reliable strategy combines all three algorithms:
Pass 1: Candidate Retrieval with Embeddings
Generate embeddings for all products. For each brand product, retrieve the most similar competitor products by vector similarity. This reduces the search space from thousands to a manageable candidate set.
Pass 2: Fuzzy Scoring
Run fuzzy matching on all candidate pairs. High-scoring products are accepted as matches. Mid-range scores go to the AI pass for deeper analysis.
Pass 3: AI Confirmation
Send ambiguous candidates to an AI model for semantic evaluation. The model provides a confidence score and reasoning for each potential match.
Pass 4: Human Review
Products that the AI is uncertain about get flagged for manual review. This is typically a small percentage of the total — a manageable workload.
Practical Considerations
Price Ratio Guards
If your product costs $5 and the potential match costs $500, they're probably not the same product regardless of name similarity. Apply a price ratio guard to reject matches where the prices are wildly different.
Stale Product Detection
Products that haven't been updated by a scrape recently should be flagged. They might be discontinued or out of stock, making matches unreliable.
Confidence Tracking
Track the confidence distribution of your matches over time. If average confidence is dropping, it might indicate that competitors are changing their naming conventions or that your product catalog has shifted.
Implementation in VantageDash
VantageDash implements all three algorithms. The Comparison page shows matched products with confidence scores, and you can run fuzzy matching, AI matching, or hybrid matching from the dashboard. Product embeddings are stored via vector search in our database, enabling fast similarity search across thousands of products.
Match results include confidence scores, reasoning from the AI model, and price-per-unit comparisons to help you make informed pricing decisions.