Built-in Preprocessing
Every text input to the dish-embed API is automatically cleaned before processing. You send raw menu data as-is. No need to preprocess on your end.
What it does
Noise stripping
Real menu data from POS systems, aggregator feeds, and manual entry contains junk:
| Input | After stripping |
|---|---|
**NEW** 50% OFF Chicken Biryani | Chicken Biryani |
Masala Dosa (Serves 2) - $12.99 | Masala Dosa |
***BESTSELLER*** Butter Naan [V] | Butter Naan |
Pepperoni Pizza 12" Large | Pepperoni Pizza |
Stripped elements: prices, serving sizes, promotional markers, formatting characters, size indicators, dietary tags in brackets.
Spelling normalization
Common misspellings and transliterations are normalized to canonical English forms:
| Input | Normalized |
|---|---|
| Panner Tikka | Paneer Tikka |
| Chiken Biryani | Chicken Biryani |
| Murgh Makhani | Chicken Makhani |
| Jhinga Curry | Prawns Curry |
This means "Panner Tikka" and "Paneer Tikka" produce identical embeddings without any work on your part.
Cross-lingual handling
For items in non-Latin scripts (Japanese, Korean, Arabic, Chinese, Thai, etc.), the original script is preserved as-is. Normalization rules only apply to Latin-script text.
This means:
- "ラーメン" (ramen in Japanese) is kept unchanged
- "Ramen Noodle Soup" gets standard normalization
- Both produce similar embeddings because the model is multilingual
What it does NOT do
- It does not change the meaning of items. "Chicken" stays "Chicken", never becomes "Poultry".
- It does not remove legitimate modifiers. "Spicy Chicken Wings" stays as "Spicy Chicken Wings".
- It does not translate between languages. Cross-lingual matching is handled by the model itself.
- It does not correct all possible misspellings. Only known food-domain corrections are applied.
Why this matters
Without preprocessing, these would produce different embeddings:
"**NEW** Chiken Biryani (Serves 2)"
"Chicken Biryani"
With preprocessing, both are normalized to "Chicken Biryani" before embedding, producing identical vectors. This dramatically improves dedup recall and search accuracy without any effort on your part.
Preprocessing in responses
API responses return both the original and preprocessed text where relevant:
{
"query": "**POPULAR** chiken biryani half",
"query_preprocessed": "chicken biryani"
}
This lets you verify what the model actually processed, useful for debugging unexpected results.