The page has been translated by Gen AI.

Elasticsearch Vector Search Usage vector

Overview

Elasticsearch has NoSQL characteristics and is designed to provide fast, real-time search through distributed processing. In particular, it can search large volumes of unstructured data and supports Full-Text (Full-Text). These features also provide a significant advantage when performing necessary text searches using vectors.

Vector Search refers to a type of search where documents or queries are represented as vectors rather than plain text. Vector Search uses machine learning (ML) to convert content such as images, audio, or video into vector representations and employs nearest neighbor algorithms to find similar data.

For example, unlike traditional keyword-matching search, we find the video whose description is most similar to the user’s opinion among videos that include descriptions. Also, the simplest way to search for similar documents is to check how many of the entered words match the words contained in the document, but in practice, even if they do not match exactly, there may be data that is similar or related, so these should also be taken into account.

Vector embedding contains meaning that enables searching with synonyms or related terms, and expresses similarity by using distances in the embedding space.

Reference

In Samsung Cloud Platform, Elasticsearch vector search is available only in the Enterprise version, and it cannot be used in the OSS version because X-pack is not supported.

Vector (text) embedding

Embedding refers to the result or the series of processes of converting human language or images into vectors—a sequence of numbers that machines can understand—because for a computer composed only of 0 and 1, grasping their meaning is difficult.

There are various types of embedding techniques, and Elasticsearch is most closely related to word/sentence embeddings.

Word-level embeddings include the word2vec and GloVe models.
Sentence-level embeddings include ELMo and BERT, among others.

Word-level embedding can be considered the simplest form of embedding, a process that enables similarity calculation and visualization. However, the biggest issue with word-level embedding is that it is difficult to differentiate ‘homonym (Homonym)’.

To address this, sentence-level embedding includes the contextual meaning of the entire sentence, rather than individual words, within the embedding, resulting in better performance and the advantages of transfer learning (Transfer Learning).

In Elasticsearch, similarity is searched using text (word/sentence) embeddings and the dense_vector type.

Preparing for vector search

Since ElasticSearch version 7.0, high-dimensional vector (Vector) Type has been supported, and from version 7.3 it has been enhanced to allow vectors to be used for document scoring (Document Scoring).

ElasticSearch currently supports a field type called Dense Vector, and the details are as follows.

In a word embedding model, each word is represented as a Dense vector.
It is mainly used for kNN (k-nearest neighbor) search.
Aggregations(aggregations) or sorting(sorting) are not supported.
It is used to rank documents in a script_score query.
The dimension value (Dimensions) of a field cannot exceed 2048.

Create a vector field in the Index

To use vector search in ElasticSearch, you must first create a vector field in the index.

Color mode

PUT position-sample
{
  "mappings": {
    "properties": {
      "position-vector": {
        "type": "dense_vector"
        "dims": 3,
        "index": false
      },
      "size": {
        "type": "long"
      }
    }
  }
}

PUT position-sample
{
  "mappings": {
    "properties": {
      "position-vector": {
        "type": "dense_vector"
        "dims": 3,
        "index": false
      },
      "size": {
        "type": "long"
      }
    }
  }
}

Code block. Index Vector Field creation example

In the above example, we create an index called “position-sample” with two fields, “size” and “position-vector”.

“size” field is of type “long”.
“position-vector” field is a 3-dimensional “dense_vector” type.

Document indexing using vectors

To index a document with a vector, you must include the vector field in the request.

Color mode

POST position-sample/_bulk?refresh=true
{ "index": { "_id": "1" } }
{ "position-vector": [-84.095, 435.33, 951.2], "size": 1024 }
{ "index": { "_id": "2" } }
{ "position-vector": [219, 234, -52.0], "size": 512 }
{ "index": { "_id": "3" } }
{ "position-vector": [0.59, -98.2, 24.8], "size": 2048 }

POST position-sample/_bulk?refresh=true
{ "index": { "_id": "1" } }
{ "position-vector": [-84.095, 435.33, 951.2], "size": 1024 }
{ "index": { "_id": "2" } }
{ "position-vector": [219, 234, -52.0], "size": 512 }
{ "index": { "_id": "3" } }
{ "position-vector": [0.59, -98.2, 24.8], "size": 2048 }

code block. Document Indexing example

Execute vector search

Elasticsearch supports two methods for vector search.

Exact kNN
- This is a similarity metric in the vector space for the given point (e.g., Euclidean distance or cosine similarity).
- You can find K “nearest neighbors” of the measured query vector.
aNN(approximate Knn) :
- Typically, embedding models for text or images generate high-dimensional vectors composed of as few as 100 to as many as over 1,000 elements.
- Finding the nearest neighbor in such high-dimensional vectors is very difficult, so we need to consider the most efficient method.
  - There are various methods, but it is common to sacrifice accuracy to improve speed.
  - This means that the aNN algorithm does not always guarantee the true K nearest vector search results.
  - Nevertheless, it is widely used because it can scale to large data sets while maintaining good performance and efficient execution.
- Elasticsearch uses the HNSW algorithm for aNN search.

Exact kNN vector search

script_score uses vector functions and queries.

Color mode

POST position-sample/_search
{
  "query": {
    "script_score": {
      "query" : {
        "bool" : {
          "filter" : {
            "range" : {
              "size" : {
                "gte": 100
              }
            }
          }
        }
      },
      "script": {
        "source": "cosineSimilarity(params.queryVector, 'position-vector') + 1.0"
        "params": {
          "queryVector": [30.9, -908, 523]
        }
      }
    }
  }
}

POST position-sample/_search
{
  "query": {
    "script_score": {
      "query" : {
        "bool" : {
          "filter" : {
            "range" : {
              "size" : {
                "gte": 100
              }
            }
          }
        }
      },
      "script": {
        "source": "cosineSimilarity(params.queryVector, 'position-vector') + 1.0"
        "params": {
          "queryVector": [30.9, -908, 523]
        }
      }
    }
  }
}

Code block. Example of Vector Search using Script_Score query

In the above example, we use the “script_score” query to score each document based on cosine similarity with the query vector.

“cosineSimilarity” function calculates the cosine similarity between a document vector and a query vector, and adds "+1.0" to the result to ensure the score is always positive.
When you execute this search query, ElasticSearch returns a list of documents sorted by similarity to the query vector.

Using aNN for vector search

Search for one or more dense_vector fields whose index value is true using the knn option.

Color mode

PUT picture-index
{
  "mappings": {
    "properties": {
      "picture-vector": {
        "type": "dense_vector"
        "dims": 5,
        "index": true,
        "similarity": "l2_norm"
      },
      "name-vector": {
        "type": "dense_vector"
        "dims": 3,
        "index": true,
        "similarity": "l2_norm"
      },
      "name": {
        "type": "text
      },
      "file-type": {
        "type": "keyword"

      }
    }
  }
}

PUT picture-index
{
  "mappings": {
    "properties": {
      "picture-vector": {
        "type": "dense_vector"
        "dims": 5,
        "index": true,
        "similarity": "l2_norm"
      },
      "name-vector": {
        "type": "dense_vector"
        "dims": 3,
        "index": true,
        "similarity": "l2_norm"
      },
      "name": {
        "type": "text
      },
      "file-type": {
        "type": "keyword"

      }
    }
  }
}

Code block. aNN Index Vector Field creation example

In the above example, we create an index called “picture index” with a total of 4 fields.

“name” field is of “text” type. This means it is analyzed for text search.
In contrast, “picture vector” field is 5-dimensional, “name vector“ field is 3-dimensional.

example code

Color mode

POST picture-index/_bulk?refresh=true
{ "index": { "_id": "1" } }
{ "picture-vector": [7, 10, 4, 17, 81], "name-vector": [1, 1, 9], "name": "beautiful sun", "file-type": "png" }
{ "index": { "_id": "2" } }
{ "picture-vector": [20, 12, 13, 2, 5], "name-vector": [18, 5, 27], "name": "blue sky", "file-type": "jpg" }
{ "index": { "_id": "3" } }
{ "picture-vector": [15, 10, 2, 47, 17], "name-vector": [9, 5, 40], "name": "wave", "file-type": "png" }

POST picture-index/_bulk?refresh=true
{ "index": { "_id": "1" } }
{ "picture-vector": [7, 10, 4, 17, 81], "name-vector": [1, 1, 9], "name": "beautiful sun", "file-type": "png" }
{ "index": { "_id": "2" } }
{ "picture-vector": [20, 12, 13, 2, 5], "name-vector": [18, 5, 27], "name": "blue sky", "file-type": "jpg" }
{ "index": { "_id": "3" } }
{ "picture-vector": [15, 10, 2, 47, 17], "name-vector": [9, 5, 40], "name": "wave", "file-type": "png" }

Code block. aNN Document Indexing example

Color mode

POST picture-index/_search
{
  "knn": {
    "field": "picture-vector"
    "query_vector": [1, 29, 60, 14, 54]
    "k": 5,
    "num_candidates": 100
  },
  "fields": [ "name", "file-type" ]
}

POST picture-index/_search
{
  "knn": {
    "field": "picture-vector"
    "query_vector": [1, 29, 60, 14, 54]
    "k": 5,
    "num_candidates": 100
  },
  "fields": [ "name", "file-type" ]
}

code block. Example of KNN Vector Search using aNN query