Guides
Menu Deduplication Pipeline
Menu data from multiple sources (POS systems, aggregators, manual entry) inevitably contains duplicates. "Chicken Biryani", "Chiken Biryani", and "Murgh Biryani" are all the same dish. This guide walks through the full dedup pipeline.
How it works
- Load your menu data (CSV, database export, POS feed)
- Send all items to POST /dedup
- The API clusters duplicates and picks a canonical name per cluster
- Update your database with canonical references
Full example
import requests
import csv
API_KEY = "YOUR_KEY"
BASE = "https://embed.statode.com"
headers = {"X-API-Key": API_KEY, "Content-Type": "application/json"}
# Load menu from CSV
with open("menu.csv") as f:
items = [row["item_name"] for row in csv.DictReader(f)]
# Deduplicate
resp = requests.post(f"{BASE}/dedup", headers=headers, json={"items": items})
data = resp.json()
print(f"Found {len(data['clusters'])} duplicate groups")
print(f"{data['duplicate_items']} excess items to remove")
for cluster in data["clusters"]:
print(f"\nCanonical: {cluster['canonical']}")
print(f" Duplicates: {', '.join(m for m in cluster['members'] if m != cluster['canonical'])}")
Handling large menus
The /dedup endpoint accepts up to 2,000 items per request. For larger menus, chunk your data:
def dedup_chunked(items, chunk_size=2000):
all_clusters = []
for i in range(0, len(items), chunk_size):
chunk = items[i:i + chunk_size]
resp = requests.post(f"{BASE}/dedup", headers=headers, json={"items": chunk})
all_clusters.extend(resp.json()["clusters"])
return all_clusters
Note that cross-chunk duplicates won't be caught. If you have more than 2,000 items, consider running /match on suspected pairs across chunk boundaries.
Threshold tuning
The default threshold works well for most menus. You can adjust it with the cosine_threshold parameter:
- 0.80 - Aggressive dedup. Catches more duplicates but may merge similar-but-different items (e.g., "Latte" and "Mocha").
- 0.85 - Default. Good balance of precision and recall.
- 0.90 - Conservative. Only merges near-identical items. Use this if false merges are costly.
resp = requests.post(f"{BASE}/dedup", headers=headers,
json={"items": items, "cosine_threshold": 0.80})
Tips
- Send raw menu text as-is. The API handles noise stripping and spelling normalization internally.
- The
canonicalname in each cluster is the cleanest, most complete form. Use it as your display name. - Run dedup after every menu import, not just once. New data sources introduce new duplicates.
- Dietary conflicts (e.g., "Chicken Burger" vs "Veg Burger") are never merged, regardless of threshold.