Embedding is a technique that transforms unstructured data such as text, images, and audio into fixed-length numerical vectors while preserving semantic relationships.
A computer cannot determine from raw strings that "apple" and "orange" are similar. Embedding solves this problem. When "apple" is converted into a vector like [0.23, -0.41, 0.87, ...] with hundreds of dimensions, the vector for "orange" is close by while "automobile" is far away. Semantic closeness becomes numerical closeness.
Embeddings play a core role inside LLMs as well. Input text is first tokenized, and each token is converted into an embedding vector. The Transformer processes this sequence of vectors to generate output.
In practice, sentence-level embeddings are used most frequently. Models such as OpenAI's text-embedding-3-small and Cohere's embed-v4 convert entire sentences into single vectors. Storing these vectors in a vector database enables semantic search and the retrieval layer for RAG.
When selecting a model, dimensionality, supported languages, and cost are the key criteria. For Japanese or Thai language processing, benchmarking multilingual model accuracy beforehand is important.


Gemini Embedding 2 is a multimodal embedding model developed by Google, capable of converting text, images, video, audio, and documents into a single vector space.

A vector database stores text, images, and other data as numerical vectors (embeddings) and provides fast search based on semantic similarity.

An algorithm that merges text based on frequent patterns and splits it into subword units. It directly affects the input/output cost and processing speed of LLMs; for low-resource languages, insufficient dedicated vocabulary leads to byte-level decomposition.

What is a Vector Database? A Complete Guide to How It Works, Top Product Comparisons, and RAG Applications

Hybrid search is a technique that combines keyword-based full-text search (such as BM25) with vector search (semantic search), leveraging the strengths of both to improve retrieval accuracy.