PostgreSQL pgvector Usage vector
PostgreSQL pgvector Usage vector
Overview
In this document, we will discuss how to use the pgvector extension in the Samsung Cloud Platform PostgreSQL(DBaaS), EPAS(DBaaS) products. In PostgreSQL, you can use additional features not provided by PostgreSQL through the extension mechanism. The pgvector extension adds open-source vector similarity search to PostgreSQL. In this document, we explore how to configure the pgvector extension, and how to create, store, and query vectors. The current latest version is v0.8.0, and more detailed usage can be found in the official documentation (https://github.com/pgvector/pgvector/blob/master/README.md).
concept
Vector similarity(Vector similarity)
Vector similarity is a method used to measure how similar two items are by representing them as a series of numeric vectors. Vector similarity is typically calculated using distance metrics such as cosine similarity (cosine similarity) or Euclidean distance (Euclidean distance). Euclidean distance measures the straight-line distance between two vectors, whereas cosine similarity measures the cosine value of the angle between two vectors. Similarity values generally range between 0 and 1, and higher values indicate greater similarity. It is high. Vector similarity is widely used in various applications such as recommendation systems, text classification, image recognition, and clustering.
Embedding(Embedding)
Embedding is a technique that evaluates the relevance of text, images, video, or other types of information. Through this evaluation, machine learning models can efficiently identify relationships and similarities between data, enabling them to recognize patterns and make accurate predictions.
Using the pgvector extension
Check whether pgvector extension is available
First, verify whether the pgvector extension is installed and available.
SELECT * FROM pg_catalog.pg_available_extensions where name=’pgvector’;
If it is not displayed in the list, the pgvector extension package is not installed on the DB server, so please request package installation through the ‘1:1 inquiry’ of Samsung Cloud Platform Conolse and then proceed.
Enable pgvector extension
To enable the pgvector extension, execute the CREATE EXTENSION statement to load the package into the DB.
CREATE EXTENSION pgvector;
How to use the pgvector extension
Data Storage
Create a table with a column of vector type
CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3));Add a column of vector type to the existing table
ALTER TABLE items ADD COLUMN embedding vector(3);Insert vector data
INSERT INTO items (embedding) VALUES ('[1,2,3]'), ('[4,5,6]');vector data Upsert
INSERT INTO items (id, embedding) VALUES (1, '[1,2,3]'), (2, '[4,5,6]') ON CONFLICT (id) DO UPDATE SET embedding = EXCLUDED.embedding;vector data Update
UPDATE items SET embedding = '[1,2,3]' WHERE id = 1;Delete vector data
DELETE FROM items WHERE id = 1;
Data Query
Nearest neighbor search for a specific vector value (Euclidean distance)
SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;Nearest neighbor search for a specific row (Euclidean distance)
SELECT * FROM items WHERE id != 1 ORDER BY embedding <-> (SELECT embedding FROM items WHERE id = 1) LIMIT 5;Retrieve rows within a specific distance of a given vector value (Euclidean distance)
SELECT * FROM items WHERE embedding <-> '[3,1,2]' < 5;
pgvector provides three operators that can be used for similarity calculations.
| operator | Explanation |
|---|---|
| <-> | Euclidean distance |
| <#> | negative inner product |
| <=> | cosine distance |
Euclidean distance query
SELECT embedding <-> '[3,1,2]' AS distance FROM items;Dot product lookup (multiply by -1 because the dot product returns a negative value)
SELECT (embedding <#> '[3,1,2]') * -1 AS inner_product FROM items;Cosine similarity lookup (1 – using cosine distance)
SELECT 1 - (embedding <=> '[3,1,2]') AS cosine_similarity FROM items;Vector average query
SELECT AVG(embedding) FROM items;Query vector group average
SELECT category_id, AVG(embedding) FROM items GROUP BY category_id;
Data Indexing
You can add an index to use Approximate Nearest Neighbor search for performance. Unlike a regular index, the search results may change after adding the index. However, you can index vectors up to a maximum of 2,000 dimensions.
Create L2 distance index
CREATE INDEX ON items USING ivfflat (embedding vector_l2_ops) WITH (lists = 100);Create internal index
CREATE INDEX ON items USING ivfflat (embedding vector_ip_ops) WITH (lists = 100);Create cosine distance index
CREATE INDEX ON items USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
Data Filtering
You can create and utilize an index on the columns used in the WHERE clause for condition searches.
Create index for conditional search column
CREATE INDEX ON items (category_id);Query using conditional search columns
SELECT * FROM items WHERE category_id = 123 ORDER BY embedding <-> '[3,1,2]' LIMIT 5;Create partial index(https://www.postgresql.org/docs/current/indexes-partial.html)
CREATE INDEX ON items USING ivfflat (embedding vector_l2_ops) WITH (lists = 100) WHERE (category_id = 123);
You can also split the dataset using partitioning (https://www.postgresql.org/docs/current/ddl-partitioning.html)
Create partition table (LIST partition)
CREATE TABLE items (embedding vector(3), category_id int) PARTITION BY LIST(category_id);
Hybrid Search
You can perform hybrid search by using it together with PostgreSQL’s full-text search.(https://www.postgresql.org/docs/current/textsearch-intro.html)
SELECT id, content FROM items, plainto_tsquery('hello search') query WHERE textsearch @@ query ORDER BY ts_rank_cd(textsearch, query) DESC LIMIT 5;