Gemini Embedding 2 is a multimodal embedding model developed by Google, capable of converting text, images, video, audio, and documents into a single vector space.
Unlike conventional embedding models that handle only text, the defining feature of this model is its ability to map 5 types of media into a single semantic space. For example, an audio clip of an abnormal factory sound and a text document describing the corresponding equipment troubleshooting procedure can be placed in close proximity in vector space — enabling cross-modal search within a single model. In RAG pipelines where non-text knowledge needs to be searchable, this significantly reduces the overhead of preparing separate models for each modality. The input window is 8,192 tokens, allowing for larger chunk sizes. Output dimensions reach up to 3,072, but thanks to the Matryoshka architecture, they can be reduced to 1,536 (balanced) or 768 (optimized for low-latency search). Task optimization parameters are also available, allowing the mathematical properties of vectors to be adjusted based on use cases such as retrieval and classification. With native support for over 100 languages, the model is well-suited for multilingual RAG and cross-lingual search. Official integrations with LangChain, LlamaIndex, Weaviate, Qdrant, and ChromaDB are provided, enabling seamless incorporation into existing vector database infrastructure. Pricing is $0.25 per 1 million tokens, with a free tier available. Migrating from the conventional text-embedding-004 is straightforward in terms of swapping the model ID, but since the vector spaces differ, existing indexes will need to be rebuilt. When fully leveraging multimodal input, careful design is required — including decisions on the granularity at which images and audio are included in the index, and balancing search accuracy against storage costs.


Embedding is a technique that transforms unstructured data such as text, images, and audio into fixed-length numerical vectors while preserving semantic relationships.

A vector database stores text, images, and other data as numerical vectors (embeddings) and provides fast search based on semantic similarity.

Hybrid search is a technique that combines keyword-based full-text search (such as BM25) with vector search (semantic search), leveraging the strengths of both to improve retrieval accuracy.

Local LLM / SLM Deployment Comparison — AI Utilization Without Cloud API Dependency

An open-weight model is a language model whose trained weights (parameters) are publicly released and can be freely downloaded for use in inference and fine-tuning.