Gemini Embedding 2 is a multimodal embedding model developed by Google, capable of converting text, images, video, audio, and documents into a single vector space.
Unlike conventional embedding models that handle only text, the defining feature of this model is its ability to map 5 types of media into a single semantic space. For example, an audio clip of an abnormal factory sound and a text document describing the corresponding equipment troubleshooting procedure can be placed in close proximity in vector space — enabling cross-modal search within a single model. In RAG pipelines where non-text knowledge needs to be searchable, this significantly reduces the overhead of preparing separate models for each modality.
The input window is 8,192 tokens, allowing for larger chunk sizes. Output dimensions reach up to 3,072, but thanks to the Matryoshka architecture, they can be reduced to 1,536 (balanced) or 768 (optimized for low-latency search). Task optimization parameters are also available, allowing the mathematical properties of vectors to be adjusted based on use cases such as retrieval and classification.
With native support for over 100 languages, the model is well-suited for multilingual RAG and cross-lingual search. Official integrations with LangChain, LlamaIndex, Weaviate, Qdrant, and ChromaDB are provided, enabling seamless incorporation into existing vector database infrastructure.
Pricing is $0.25 per 1 million tokens, with a free tier available. Migrating from the conventional text-embedding-004 is straightforward in terms of swapping the model ID, but since the vector spaces differ, existing indexes will need to be rebuilt. When fully leveraging multimodal input, careful design is required — including decisions on the granularity at which images and audio are included in the index, and balancing search accuracy against storage costs.


Embedding is a technique that transforms unstructured data such as text, images, and audio into fixed-length numerical vectors while preserving semantic relationships.

A vector database stores text, images, and other data as numerical vectors (embeddings) and provides fast search based on semantic similarity.

Hybrid search is a technique that combines keyword-based full-text search (such as BM25) with vector search (semantic search), leveraging the strengths of both to improve retrieval accuracy.

What is a Vector Database? A Complete Guide to How It Works, Top Product Comparisons, and RAG Applications

An algorithm that merges text based on frequent patterns and splits it into subword units. It directly affects the input/output cost and processing speed of LLMs; for low-resource languages, insufficient dedicated vocabulary leads to byte-level decomposition.