PostgreSQL pgvector User Guide
PostgreSQL pgvector User Guide
Overview
This document will cover the method of using the pgvector extension in Samsung Cloud Platform PostgreSQL (DBaaS) and EPAS (DBaaS) products. In PostgreSQL, the extension feature allows for the use of additional functions not provided by PostgreSQL. The pgvector extension adds open-source vector similarity search to PostgreSQL. This document will explain how to set up the pgvector extension, create and store vectors, and query them. The current latest version is v0.8.0, and more detailed usage can be found in the official documentation (https://github.com/pgvector/pgvector/blob/master/README.md).
Concepts
Vector Similarity
Vector similarity is a method used to measure how similar two items are by representing them as a series of numbers, or vectors. Vector similarity is typically calculated using distance metrics such as cosine similarity or Euclidean distance. Euclidean distance measures the straight-line distance between two vectors, while cosine similarity measures the cosine of the angle between two vectors. The similarity value typically ranges from 0 to 1, with higher values indicating higher similarity. Vector similarity is widely used in various applications such as recommendation systems, text classification, image recognition, and clustering.
Embedding
Embedding is a technique used to evaluate the relevance of text, images, videos, or other types of information. By evaluating the information, machine learning models can efficiently identify relationships and similarities between data, recognize patterns, and make accurate predictions.
Using the pgvector Extension
Checking pgvector Extension Availability
First, check if the pgvector extension is installed and available for use.
SELECT * FROM pg_catalog.pg_available_extensions where name='pgvector';
If it is not listed, the pgvector extension package is not installed on the DB server. Please request package installation through the Samsung Cloud Platform Console’s ‘1:1 Inquiry’ and proceed.
Enabling the pgvector Extension
To enable the pgvector extension, execute the CREATE EXTENSION statement to load the package into the DB.
CREATE EXTENSION pgvector;
Using the pgvector Extension
Data Storage
Create a table with a vector-type column
CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3));Add a vector-type column to an existing table
ALTER TABLE items ADD COLUMN embedding vector(3);Insert vector data
INSERT INTO items (embedding) VALUES ('[1,2,3]'), ('[4,5,6]');Upsert vector data
INSERT INTO items (id, embedding) VALUES (1, '[1,2,3]'), (2, '[4,5,6]') ON CONFLICT (id) DO UPDATE SET embedding = EXCLUDED.embedding;Update vector data
UPDATE items SET embedding = '[1,2,3]' WHERE id = 1;Delete vector data
DELETE FROM items WHERE id = 1;
Data Querying
Specific vector value Nearest neighbor search (Euclidean distance)
SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;Specific row Nearest neighbor search (Euclidean distance)
SELECT * FROM items WHERE id != 1 ORDER BY embedding <-> (SELECT embedding FROM items WHERE id = 1) LIMIT 5;Row retrieval within a specific distance from a specific vector value (Euclidean distance)
SELECT * FROM items WHERE embedding <-> '[3,1,2]' < 5;
Pgvector provides three operators that can be used for similarity calculations.
| Operator | Description |
|---|---|
| <-> | Euclidean distance |
| <#> | Negative inner product |
| <=> | Cosine distance |
Euclidean distance query
SELECT embedding <-> '[3,1,2]' AS distance FROM items;Inner product query (inner product returns a negative value, so multiply by -1)
SELECT (embedding <#> '[3,1,2]') * -1 AS inner_product FROM items;Cosine similarity query (use 1 - cosine distance)
SELECT 1 - (embedding <=> '[3,1,2]') AS cosine_similarity FROM items;Vector average query
SELECT AVG(embedding) FROM items;Vector group average query
SELECT category_id, AVG(embedding) FROM items GROUP BY category_id;
Data Indexing
For performance, an index can be added to use Approximate nearest neighbor search. Unlike regular indexes, the search results may differ after adding an index. However, indexing is possible up to 2,000-dimensional vectors.
Create L2 distance index
CREATE INDEX ON items USING ivfflat (embedding vector_l2_ops) WITH (lists = 100);Create inner product index
CREATE INDEX ON items USING ivfflat (embedding vector_ip_ops) WITH (lists = 100);Create cosine distance index
CREATE INDEX ON items USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
Data Filtering
An index can be created for the columns used in the WHERE clause condition search and utilized.
Create index for condition search column
CREATE INDEX ON items (category_id);Utilize condition search column query
SELECT * FROM items WHERE category_id = 123 ORDER BY embedding <-> '[3,1,2]' LIMIT 5;Create partial index (https://www.postgresql.org/docs/current/indexes-partial.html)
CREATE INDEX ON items USING ivfflat (embedding vector_l2_ops) WITH (lists = 100) WHERE (category_id = 123);
You can also use partitioning to divide the dataset (https://www.postgresql.org/docs/current/ddl-partitioning.html)
Partition Table Creation (LIST Partition)
CREATE TABLE items (embedding vector(3), category_id int) PARTITION BY LIST(category_id);
Hybrid Search
PostgreSQL’s full-text search can be used together to perform hybrid searches. (https://www.postgresql.org/docs/current/textsearch-intro.html)
SELECT id, content FROM items, plainto_tsquery('hello search') query WHERE textsearch @@ query ORDER BY ts_rank_cd(textsearch, query) DESC LIMIT 5;