The page has been translated by Gen AI.

Elasticsearch Vector Search Guide

Elasticsearch Vector Search Guide

Overview

Elasticsearch has the characteristics of NoSQL and is designed to provide fast search in real-time through distributed processing. In particular, it allows searching for large amounts of unstructured data and supports full-text search. These characteristics are also advantageous for searching for text using vectors.

Vector search refers to a search where documents or queries are represented as vectors, not as regular text. Vector search uses machine learning (ML) to convert content such as images, audio, or videos into vector representations and finds similar data using the nearest neighbor algorithm.

For example, unlike traditional search forms that find matching keywords, it finds videos that are most similar to the user’s opinion among videos with descriptions. Additionally, when searching for similar documents, the easiest way is to check how many words match between the input and the document, but in reality, even if they don’t match perfectly, there may be similar or related data, so these need to be considered as well.

Vector embedding means that it can search for synonyms or related words, and it represents similarity using the distance in the embedding space.

Note
On Samsung Cloud Platform, Elasticsearch vector search is only available in the Enterprise version, and it is not available in the OSS version because X-pack support is not provided.

Vector (Text) Embedding

Embedding refers to the process of converting human language or images, which are difficult for computers to understand because they are only composed of 0s and 1s, into vectors that machines can understand.

There are several types of embedding techniques, and Elasticsearch is most related to word/sentence embedding.

  • Word-level embedding includes word2vec and GloVe models.
  • Sentence-level embedding includes ELMo and BERT.

Word-level embedding is the simplest form of embedding and is a process that enables similarity calculation and visualization. However, the biggest problem with word-level embedding is that it’s difficult to distinguish homonyms.

To compensate for this, sentence-level embedding includes the contextual meaning of a sentence in the embedding, providing the effect of transfer learning and good performance.

In Elasticsearch, text (word/sentence) embedding and dense_vector type are used to search for similarity.

Preparing for Vector Search

From Elasticsearch 7.0, high-dimensional vector (Vector) Type is supported, and from 7.3, vectors can be used for document scoring.

Elasticsearch currently supports Dense Vector Field Type, and the details are as follows:

  1. In word embedding models, one word is represented as one Dense vector.
  2. It is mainly used for kNN (k-nearest neighbor) search.
  3. It does not support aggregation or sorting.
  4. It is used to rank documents in script_score queries.
  5. The dimension value (Dimensions) of the field cannot exceed 2048.

Creating a Vector Field in an Index

To use vector search in Elasticsearch, you need to create a vector field in an index first.

Color mode
PUT position-sample
{
  "mappings": {
    "properties": {
      "position-vector": {
        "type": "dense_vector",
        "dims": 3,
        "index": false
      },
      "size": {
        "type": "long"
      }
    }
  }
}
PUT position-sample
{
  "mappings": {
    "properties": {
      "position-vector": {
        "type": "dense_vector",
        "dims": 3,
        "index": false
      },
      "size": {
        "type": "long"
      }
    }
  }
}
Code Block. Index Vector Field Creation Example

In the example above, an index named “position-sample” is created with two fields: “size” and “position-vector”.

  • The “size” field is of type “long”.
  • The “position-vector” field is a 3-dimensional “dense_vector” type.

Indexing Documents with Vectors

To index documents with vectors, you need to include the vector field in the request.

Color mode
POST position-sample/_bulk?refresh=true
{ "index": { "_id": "1" } }
{ "position-vector": [-84.095, 435.33, 951.2], "size": 1024 }
{ "index": { "_id": "2" } }
{ "position-vector": [219, 234, -52.0], "size": 512 }
{ "index": { "_id": "3" } }
{ "position-vector": [13, 67, 23.0], "size": 2048 }
POST position-sample/_bulk?refresh=true
{ "index": { "_id": "1" } }
{ "position-vector": [-84.095, 435.33, 951.2], "size": 1024 }
{ "index": { "_id": "2" } }
{ "position-vector": [219, 234, -52.0], "size": 512 }
{ "index": { "_id": "3" } }
{ "position-vector": [13, 67, 23.0], "size": 2048 }
Code Block. Document Indexing Example

Running Vector Search

Elasticsearch supports two methods for vector search.

  • Exact kNN

    • This is a similarity metric (Euclidean distance or cosine similarity, etc.) in vector space for the point.
    • You can find “K nearest neighbors” for the measured query vector.
  • aNN(approximate Knn) :

    • Embedding models for text or images typically create high-dimensional vectors with dozens to thousands of elements.
    • Finding the nearest neighbors in such high-dimensional vectors is very difficult, so you need to think about the most efficient method.
      • There are several methods, but it is common to sacrifice accuracy to improve speed.
      • This means that aNN algorithms do not always guarantee the true K nearest vector search results.
      • Nevertheless, they are widely used because they can maintain good performance and efficiency while expanding to large data sets.
    • Elasticsearch uses the HNSW algorithm for aNN search.

script_score vector function and query are used.

Color mode
POST position-sample/_search
{
  "query": {
    "script_score": {
      "query" : {
        "bool" : {
          "filter" : {
            "range" : {
              "size" : {
                "gte": 100
              }
            }
          }
        }
      },
      "script": {
        "source": "cosineSimilarity(params.queryVector, 'position-vector') + 1.0",
        "params": {
          "queryVector": [30.9, -908, 523]
        }
      }
    }
  }
}
POST position-sample/_search
{
  "query": {
    "script_score": {
      "query" : {
        "bool" : {
          "filter" : {
            "range" : {
              "size" : {
                "gte": 100
              }
            }
          }
        }
      },
      "script": {
        "source": "cosineSimilarity(params.queryVector, 'position-vector') + 1.0",
        "params": {
          "queryVector": [30.9, -908, 523]
        }
      }
    }
  }
}
Code Block. Vector Search Example Using Script_Score Query

In this example, the “script_score” query is used to score each document based on the cosine similarity with the query vector.

  • “cosineSimilarity” function calculates the cosine similarity between the document vector and the query vector and adds "+1.0" to the result to ensure the score is always positive.
  • When you run this search query, ElasticSearch returns a list of documents sorted by their similarity to the query vector.

knn option is used to search one or more dense_vector fields with index value set to true.

Color mode
PUT picture-index
{
  "mappings": {
    "properties": {
      "picture-vector": {
        "type": "dense_vector",
        "dims": 5,
        "index": true,
        "similarity": "l2_norm"
      },
      "name-vector": {
        "type": "dense_vector",
        "dims": 3,
        "index": true,
        "similarity": "l2_norm"
      },
      "name": {
        "type": "text"
      },
      "file-type": {
        "type": "keyword"
      }
    }
  }
}
PUT picture-index
{
  "mappings": {
    "properties": {
      "picture-vector": {
        "type": "dense_vector",
        "dims": 5,
        "index": true,
        "similarity": "l2_norm"
      },
      "name-vector": {
        "type": "dense_vector",
        "dims": 3,
        "index": true,
        "similarity": "l2_norm"
      },
      "name": {
        "type": "text"
      },
      "file-type": {
        "type": "keyword"
      }
    }
  }
}
Code Block. Example of Creating aNN Index Vector Field

In this example, an index called “picture index” is created with a total of 4 fields.

  • “name” field is of “text” type, which means it is analyzed for text search.
  • On the other hand, “picture vector” field is 5-dimensional and “name vector” field is 3-dimensional.

Example Code

Color mode
POST picture-index/_bulk?refresh=true
{ "index": { "_id": "1" } }
{ "picture-vector": [7, 10, 4, 17, 81], "name-vector": [1, 1, 9], "name": "beautiful sun", "file-type": "png" }
{ "index": { "_id": "2" } }
{ "picture-vector": [20, 12, 13, 2, 5], "name-vector": [18, 5, 27], "name": "blue sky", "file-type": "jpg" }
{ "index": { "_id": "3" } }
{ "picture-vector": [15, 10, 2, 47, 17], "name-vector": [9, 5, 40], "name": "wave", "file-type": "png" }
POST picture-index/_bulk?refresh=true
{ "index": { "_id": "1" } }
{ "picture-vector": [7, 10, 4, 17, 81], "name-vector": [1, 1, 9], "name": "beautiful sun", "file-type": "png" }
{ "index": { "_id": "2" } }
{ "picture-vector": [20, 12, 13, 2, 5], "name-vector": [18, 5, 27], "name": "blue sky", "file-type": "jpg" }
{ "index": { "_id": "3" } }
{ "picture-vector": [15, 10, 2, 47, 17], "name-vector": [9, 5, 40], "name": "wave", "file-type": "png" }
Code Block. aNN Document Indexing Example
Color mode
POST picture-index/_search
{
  "knn": {
    "field": "picture-vector",
    "query_vector": [1, 29, 60, 14, 54],
    "k": 5,
    "num_candidates": 100
  },
  "fields": [ "name", "file-type" ]
}
POST picture-index/_search
{
  "knn": {
    "field": "picture-vector",
    "query_vector": [1, 29, 60, 14, 54],
    "k": 5,
    "num_candidates": 100
  },
  "fields": [ "name", "file-type" ]
}
Code Block. aNN Query-based KNN Vector Search Example