Built-in Preprocessing

Every text input to the dish-embed API is automatically cleaned before processing. You send raw menu data as-is. No need to preprocess on your end.

What it does

Noise stripping

Real menu data from POS systems, aggregator feeds, and manual entry contains junk:

Input	After stripping
`NEW 50% OFF Chicken Biryani`	`Chicken Biryani`
`Masala Dosa (Serves 2) - $12.99`	`Masala Dosa`
`*BESTSELLER* Butter Naan [V]`	`Butter Naan`
`Pepperoni Pizza 12" Large`	`Pepperoni Pizza`

Stripped elements: prices, serving sizes, promotional markers, formatting characters, size indicators, dietary tags in brackets.

Spelling normalization

Common misspellings and transliterations are normalized to canonical English forms:

Input	Normalized
Panner Tikka	Paneer Tikka
Chiken Biryani	Chicken Biryani
Murgh Makhani	Chicken Makhani
Jhinga Curry	Prawns Curry

This means "Panner Tikka" and "Paneer Tikka" produce identical embeddings without any work on your part.

Cross-lingual handling

For items in non-Latin scripts (Japanese, Korean, Arabic, Chinese, Thai, etc.), the original script is preserved as-is. Normalization rules only apply to Latin-script text.

This means:

"ラーメン" (ramen in Japanese) is kept unchanged
"Ramen Noodle Soup" gets standard normalization
Both produce similar embeddings because the model is multilingual

What it does NOT do

It does not change the meaning of items. "Chicken" stays "Chicken", never becomes "Poultry".
It does not remove legitimate modifiers. "Spicy Chicken Wings" stays as "Spicy Chicken Wings".
It does not translate between languages. Cross-lingual matching is handled by the model itself.
It does not correct all possible misspellings. Only known food-domain corrections are applied.

Why this matters

Without preprocessing, these would produce different embeddings:

"**NEW** Chiken Biryani (Serves 2)"
"Chicken Biryani"

With preprocessing, both are normalized to "Chicken Biryani" before embedding, producing identical vectors. This dramatically improves dedup recall and search accuracy without any effort on your part.

Preprocessing in responses

API responses return both the original and preprocessed text where relevant:

{
  "query": "**POPULAR** chiken biryani half",
  "query_preprocessed": "chicken biryani"
}

This lets you verify what the model actually processed, useful for debugging unexpected results.

On this page