Embedding

Embedding

Embedding is a technique that transforms unstructured data such as text, images, and audio into fixed-length numerical vectors while preserving semantic relationships.

A computer cannot determine from raw strings that "apple" and "orange" are similar. Embedding solves this problem. When "apple" is converted into a vector like [0.23, -0.41, 0.87, ...] with hundreds of dimensions, the vector for "orange" is close by while "automobile" is far away. Semantic closeness becomes numerical closeness.

Embeddings play a core role inside LLMs as well. Input text is first tokenized, and each token is converted into an embedding vector. The Transformer processes this sequence of vectors to generate output.

In practice, sentence-level embeddings are used most frequently. Models such as OpenAI's text-embedding-3-small and Cohere's embed-v4 convert entire sentences into single vectors. Storing these vectors in a vector database enables semantic search and the retrieval layer for RAG.

When selecting a model, dimensionality, supported languages, and cost are the key criteria. For Japanese or Thai language processing, benchmarking multilingual model accuracy beforehand is important.