The page has been translated by Gen AI.

PostgreSQL pgvector Usage vector

PostgreSQL pgvector Usage vector

Overview

In this document, we will discuss how to use the pgvector extension in the Samsung Cloud Platform PostgreSQL(DBaaS), EPAS(DBaaS) products. In PostgreSQL, you can use additional features not provided by PostgreSQL through the extension mechanism. The pgvector extension adds open-source vector similarity search to PostgreSQL. In this document, we explore how to configure the pgvector extension, and how to create, store, and query vectors. The current latest version is v0.8.0, and more detailed usage can be found in the official documentation (https://github.com/pgvector/pgvector/blob/master/README.md).

concept

Vector similarity(Vector similarity)

Vector similarity is a method used to measure how similar two items are by representing them as a series of numeric vectors. Vector similarity is typically calculated using distance metrics such as cosine similarity (cosine similarity) or Euclidean distance (Euclidean distance). Euclidean distance measures the straight-line distance between two vectors, whereas cosine similarity measures the cosine value of the angle between two vectors. Similarity values generally range between 0 and 1, and higher values indicate greater similarity. It is high. Vector similarity is widely used in various applications such as recommendation systems, text classification, image recognition, and clustering.

Embedding(Embedding)

Embedding is a technique that evaluates the relevance of text, images, video, or other types of information. Through this evaluation, machine learning models can efficiently identify relationships and similarities between data, enabling them to recognize patterns and make accurate predictions.

Using the pgvector extension

Check whether pgvector extension is available

First, verify whether the pgvector extension is installed and available.

SELECT * FROM pg_catalog.pg_available_extensions where name=pgvector;

If it is not displayed in the list, the pgvector extension package is not installed on the DB server, so please request package installation through the ‘1:1 inquiry’ of Samsung Cloud Platform Conolse and then proceed.

Enable pgvector extension

To enable the pgvector extension, execute the CREATE EXTENSION statement to load the package into the DB.

CREATE EXTENSION pgvector;

How to use the pgvector extension

Data Storage

  • Create a table with a column of vector type

    CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3));
    
  • Add a column of vector type to the existing table

    ALTER TABLE items ADD COLUMN embedding vector(3);
    
  • Insert vector data

    INSERT INTO items (embedding) VALUES ('[1,2,3]'), ('[4,5,6]');
    
  • vector data Upsert

    INSERT INTO items (id, embedding) VALUES (1, '[1,2,3]'), (2, '[4,5,6]') ON CONFLICT (id) DO UPDATE SET embedding = EXCLUDED.embedding;
    
  • vector data Update

    UPDATE items SET embedding = '[1,2,3]' WHERE id = 1;
    
  • Delete vector data

    DELETE FROM items WHERE id = 1;
    

Data Query

  • Nearest neighbor search for a specific vector value (Euclidean distance)

    SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;
    
  • Nearest neighbor search for a specific row (Euclidean distance)

    SELECT * FROM items WHERE id != 1 ORDER BY embedding <-> (SELECT embedding FROM items WHERE id = 1) LIMIT 5;
    
  • Retrieve rows within a specific distance of a given vector value (Euclidean distance)

    SELECT * FROM items WHERE embedding <-> '[3,1,2]' < 5;
    

pgvector provides three operators that can be used for similarity calculations.

operatorExplanation
<->Euclidean distance
<#>negative inner product
<=>cosine distance
Table. Operators provided by pgvector

  • Euclidean distance query

    SELECT embedding <-> '[3,1,2]' AS distance FROM items;
    
  • Dot product lookup (multiply by -1 because the dot product returns a negative value)

    SELECT (embedding <#> '[3,1,2]') * -1 AS inner_product FROM items;
    
  • Cosine similarity lookup (1 – using cosine distance)

    SELECT 1 - (embedding <=> '[3,1,2]') AS cosine_similarity FROM items;
    
  • Vector average query

    SELECT AVG(embedding) FROM items;
    
  • Query vector group average

    SELECT category_id, AVG(embedding) FROM items GROUP BY category_id;
    

Data Indexing

You can add an index to use Approximate Nearest Neighbor search for performance. Unlike a regular index, the search results may change after adding the index. However, you can index vectors up to a maximum of 2,000 dimensions.

  • Create L2 distance index

    CREATE INDEX ON items USING ivfflat (embedding vector_l2_ops) WITH (lists = 100);
    
  • Create internal index

    CREATE INDEX ON items USING ivfflat (embedding vector_ip_ops) WITH (lists = 100);
    
  • Create cosine distance index

    CREATE INDEX ON items USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
    

Data Filtering

You can create and utilize an index on the columns used in the WHERE clause for condition searches.

  • Create index for conditional search column

    CREATE INDEX ON items (category_id);
    
  • Query using conditional search columns

    SELECT * FROM items WHERE category_id = 123 ORDER BY embedding <-> '[3,1,2]' LIMIT 5;
    
  • Create partial index(https://www.postgresql.org/docs/current/indexes-partial.html)

    CREATE INDEX ON items USING ivfflat (embedding vector_l2_ops) WITH (lists = 100) WHERE (category_id = 123);
    

You can also split the dataset using partitioning (https://www.postgresql.org/docs/current/ddl-partitioning.html)

  • Create partition table (LIST partition)

    CREATE TABLE items (embedding vector(3), category_id int) PARTITION BY LIST(category_id);
    

You can perform hybrid search by using it together with PostgreSQL’s full-text search.(https://www.postgresql.org/docs/current/textsearch-intro.html)

SELECT id, content FROM items, plainto_tsquery('hello search') query WHERE textsearch @@ query ORDER BY ts_rank_cd(textsearch, query) DESC LIMIT 5;