Concepts

How Food Embeddings Work

What are embeddings?

An embedding is a numeric vector (a list of numbers) that captures the meaning of a piece of text. Similar meanings produce similar vectors. You can measure how similar two items are by comparing their vectors using cosine similarity.

  • 1.0 = identical meaning
  • 0.8+ = very similar (likely the same dish)
  • 0.5-0.7 = related (same category or cuisine)
  • Below 0.3 = unrelated

Why food needs specialized embeddings

General-purpose embedding models (the kind you'd use for document search or chatbot retrieval) fail on food data in specific ways:

Transliteration blindness

"Murgh" is Hindi for chicken. "Murgh Makhani" and "Butter Chicken" are the same dish. Generic models treat "Murgh" as an unknown token and produce low similarity scores. dish-embed maps transliterations correctly because it was trained on real multilingual menu data.

Noise sensitivity

Real menu data looks like this:

**NEW** 50% OFF Chicken Biryani (Serves 2) [Non-Veg]

A generic model embeds all that noise as part of the meaning. dish-embed strips it before embedding, so this matches "Chicken Biryani" with high confidence.

Cross-lingual understanding

"Pollo Asado" (Spanish), "Grilled Chicken" (English), "Murgh Tandoori" (Hindi) are all grilled chicken preparations. dish-embed produces similar embeddings across 100+ languages because it's built on a multilingual base model and fine-tuned on cross-lingual food pairs.

Dietary signal preservation

Generic models don't know that "Paneer Tikka" is vegetarian and "Chicken Tikka" is not. They see high text overlap and produce high similarity. dish-embed understands that protein differences change the fundamental nature of a dish.

The model

dish-embed is fine-tuned from BAAI/bge-m3, a 568M parameter multilingual model supporting 100+ languages. It's trained on hundreds of thousands of curated food item pairs from real restaurant menus worldwide, covering Indian, East Asian, Southeast Asian, Middle Eastern, European, Latin American, and American cuisines.

The training process teaches the model food-specific knowledge:

  • Which items are the same dish under different names
  • Which items are related but distinct (Butter Chicken vs Dal Makhani)
  • Cross-lingual equivalences
  • Cuisine and category relationships

Using embeddings directly

If you want to store embeddings in your own vector database for custom search or clustering:

resp = requests.post(f"{BASE}/embed", headers=headers,
    json={"items": ["Chicken Biryani", "Murgh Biryani", "Veg Pulao"], "dimension": 384})

embeddings = resp.json()["embeddings"]
# Each embedding is a list of 384 floats
# Store in Pinecone, Weaviate, pgvector, FAISS, etc.

You can choose your embedding dimension (128, 256, or 384) depending on your quality and storage requirements. See Matryoshka Dimensions for trade-offs.