Concepts

Built-in Preprocessing

Every text input to the dish-embed API is automatically cleaned before processing. You send raw menu data as-is. No need to preprocess on your end.

What it does

Noise stripping

Real menu data from POS systems, aggregator feeds, and manual entry contains junk:

InputAfter stripping
**NEW** 50% OFF Chicken BiryaniChicken Biryani
Masala Dosa (Serves 2) - $12.99Masala Dosa
***BESTSELLER*** Butter Naan [V]Butter Naan
Pepperoni Pizza 12" LargePepperoni Pizza

Stripped elements: prices, serving sizes, promotional markers, formatting characters, size indicators, dietary tags in brackets.

Spelling normalization

Common misspellings and transliterations are normalized to canonical English forms:

InputNormalized
Panner TikkaPaneer Tikka
Chiken BiryaniChicken Biryani
Murgh MakhaniChicken Makhani
Jhinga CurryPrawns Curry

This means "Panner Tikka" and "Paneer Tikka" produce identical embeddings without any work on your part.

Cross-lingual handling

For items in non-Latin scripts (Japanese, Korean, Arabic, Chinese, Thai, etc.), the original script is preserved as-is. Normalization rules only apply to Latin-script text.

This means:

  • "ラーメン" (ramen in Japanese) is kept unchanged
  • "Ramen Noodle Soup" gets standard normalization
  • Both produce similar embeddings because the model is multilingual

What it does NOT do

  • It does not change the meaning of items. "Chicken" stays "Chicken", never becomes "Poultry".
  • It does not remove legitimate modifiers. "Spicy Chicken Wings" stays as "Spicy Chicken Wings".
  • It does not translate between languages. Cross-lingual matching is handled by the model itself.
  • It does not correct all possible misspellings. Only known food-domain corrections are applied.

Why this matters

Without preprocessing, these would produce different embeddings:

"**NEW** Chiken Biryani (Serves 2)"
"Chicken Biryani"

With preprocessing, both are normalized to "Chicken Biryani" before embedding, producing identical vectors. This dramatically improves dedup recall and search accuracy without any effort on your part.

Preprocessing in responses

API responses return both the original and preprocessed text where relevant:

{
  "query": "**POPULAR** chiken biryani half",
  "query_preprocessed": "chicken biryani"
}

This lets you verify what the model actually processed, useful for debugging unexpected results.