The page has been translated by Gen AI.

PostgreSQL pgvector User Guide

Overview

This document will cover the method of using the pgvector extension in Samsung Cloud Platform PostgreSQL (DBaaS) and EPAS (DBaaS) products. In PostgreSQL, the extension feature allows for the use of additional functions not provided by PostgreSQL. The pgvector extension adds open-source vector similarity search to PostgreSQL. This document will explain how to set up the pgvector extension, create and store vectors, and query them. The current latest version is v0.8.0, and more detailed usage can be found in the official documentation (https://github.com/pgvector/pgvector/blob/master/README.md).

Concepts

Vector Similarity

Vector similarity is a method used to measure how similar two items are by representing them as a series of numbers, or vectors. Vector similarity is typically calculated using distance metrics such as cosine similarity or Euclidean distance. Euclidean distance measures the straight-line distance between two vectors, while cosine similarity measures the cosine of the angle between two vectors. The similarity value typically ranges from 0 to 1, with higher values indicating higher similarity. Vector similarity is widely used in various applications such as recommendation systems, text classification, image recognition, and clustering.

Embedding

Embedding is a technique used to evaluate the relevance of text, images, videos, or other types of information. By evaluating the information, machine learning models can efficiently identify relationships and similarities between data, recognize patterns, and make accurate predictions.

Using the pgvector Extension

Checking pgvector Extension Availability

First, check if the pgvector extension is installed and available for use.

SELECT * FROM pg_catalog.pg_available_extensions where name='pgvector';

If it is not listed, the pgvector extension package is not installed on the DB server. Please request package installation through the Samsung Cloud Platform Console’s ‘1:1 Inquiry’ and proceed.

Enabling the pgvector Extension

To enable the pgvector extension, execute the CREATE EXTENSION statement to load the package into the DB.

CREATE EXTENSION pgvector;

Using the pgvector Extension

Data Storage

Create a table with a vector-type column

CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3));

Add a vector-type column to an existing table

ALTER TABLE items ADD COLUMN embedding vector(3);

Insert vector data

INSERT INTO items (embedding) VALUES ('[1,2,3]'), ('[4,5,6]');

Upsert vector data

INSERT INTO items (id, embedding) VALUES (1, '[1,2,3]'), (2, '[4,5,6]') ON CONFLICT (id) DO UPDATE SET embedding = EXCLUDED.embedding;

Update vector data

UPDATE items SET embedding = '[1,2,3]' WHERE id = 1;

Delete vector data
```
DELETE FROM items WHERE id = 1;
```

Data Querying

Specific vector value Nearest neighbor search (Euclidean distance)

SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;

Specific row Nearest neighbor search (Euclidean distance)

SELECT * FROM items WHERE id != 1 ORDER BY embedding <-> (SELECT embedding FROM items WHERE id = 1) LIMIT 5;

Row retrieval within a specific distance from a specific vector value (Euclidean distance)
```
SELECT * FROM items WHERE embedding <-> '[3,1,2]' < 5;
```

Pgvector provides three operators that can be used for similarity calculations.

Operator	Description
<->	Euclidean distance
<#>	Negative inner product
<=>	Cosine distance

Table. Operators provided by pgvector

Euclidean distance query

SELECT embedding <-> '[3,1,2]' AS distance FROM items;

Inner product query (inner product returns a negative value, so multiply by -1)
```
SELECT (embedding <#> '[3,1,2]') * -1 AS inner_product FROM items;
```

Cosine similarity query (use 1 - cosine distance)

SELECT 1 - (embedding <=> '[3,1,2]') AS cosine_similarity FROM items;

Vector average query
```
SELECT AVG(embedding) FROM items;
```

Vector group average query

SELECT category_id, AVG(embedding) FROM items GROUP BY category_id;

Data Indexing

For performance, an index can be added to use Approximate nearest neighbor search. Unlike regular indexes, the search results may differ after adding an index. However, indexing is possible up to 2,000-dimensional vectors.

Create L2 distance index

CREATE INDEX ON items USING ivfflat (embedding vector_l2_ops) WITH (lists = 100);

Create inner product index

CREATE INDEX ON items USING ivfflat (embedding vector_ip_ops) WITH (lists = 100);

Create cosine distance index

CREATE INDEX ON items USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);

Data Filtering

An index can be created for the columns used in the WHERE clause condition search and utilized.

Create index for condition search column
```
CREATE INDEX ON items (category_id);
```

Utilize condition search column query

SELECT * FROM items WHERE category_id = 123 ORDER BY embedding <-> '[3,1,2]' LIMIT 5;

Create partial index (https://www.postgresql.org/docs/current/indexes-partial.html)

CREATE INDEX ON items USING ivfflat (embedding vector_l2_ops) WITH (lists = 100) WHERE (category_id = 123);

You can also use partitioning to divide the dataset (https://www.postgresql.org/docs/current/ddl-partitioning.html)

Partition Table Creation (LIST Partition)

CREATE TABLE items (embedding vector(3), category_id int) PARTITION BY LIST(category_id);

Hybrid Search

PostgreSQL’s full-text search can be used together to perform hybrid searches. (https://www.postgresql.org/docs/current/textsearch-intro.html)

SELECT id, content FROM items, plainto_tsquery('hello search') query WHERE textsearch @@ query ORDER BY ts_rank_cd(textsearch, query) DESC LIMIT 5;