#include <IndexIVF_HNSW.h>

Inheritance diagram for ivfhnsw::IndexIVF_HNSW:

Public Types
typedef uint32_t	idx_t
	all indices are this type

Public Member Functions
	IndexIVF_HNSW (size_t dim, size_t ncentroids, size_t bytes_per_code, size_t nbits_per_idx)

void	build_quantizer (const char path_data, const char path_info, const char *path_edges, size_t M=16, size_t efConstruction=500)

void	assign (size_t n, const float x, idx_t labels, size_t k=1)

virtual void	search (size_t k, const float x, float distances, long *labels)

virtual void	add_batch (size_t n, const float x, const idx_t xids, const idx_t *precomputed_idx=nullptr)

virtual void	train_pq (size_t n, const float *x)

virtual void	write (const char *path)
	Write index to the path.

virtual void	read (const char *path)
	Read index from the path.

void	compute_centroid_norms ()
	Compute norms of the HNSW vertices.

void	rotate_quantizer ()
	For correct search using OPQ encoding rotate points in the coarse quantizer.

Public Attributes
size_t	d
	Vector dimension.

size_t	nc
	Number of centroids.

size_t	code_size
	Code size per vector in bytes.

hnswlib::HierarchicalNSW *	quantizer
	Quantizer that maps vectors to inverted lists (HNSW [Y.Malkov])

faiss::ProductQuantizer *	pq
	Produces the residual codes.

faiss::ProductQuantizer *	norm_pq
	Produces the norm codes of reconstructed base vectors.

faiss::LinearTransform *	opq_matrix
	Rotation matrix for OPQ encoding.

bool	do_opq
	Turn on/off OPQ encoding.

size_t	nprobe
	Number of probes at search time.

size_t	max_codes
	Max number of codes to visit to do a query.

std::vector< std::vector< idx_t > >	ids
	Inverted lists for indexes.

std::vector< std::vector < uint8_t > >	codes
	PQ codes of residuals.

std::vector< std::vector < uint8_t > >	norm_codes
	PQ codes of norms of reconstructed base vectors.

Protected Member Functions
float	pq_L2sqr (const uint8_t *code)
	L2 sqr distance function for PQ codes.

Protected Attributes
std::vector< float >	norms
	L2 square norms of reconstructed base vectors.

std::vector< float >	centroid_norms
	L2 square norms of coarse centroids.

std::vector< float >	precomputed_table
	Size pq.M * pq.ksub.

Detailed Description

Index based on a inverted file (IVF) with Product Quantizer encoding.

In the inverted file, the quantizer (an HNSW instance) provides a quantization index for each vector to be added. The quantization index maps to a list (aka inverted list or posting list), where the id of the vector is then stored.

At search time, the vector to be searched is also quantized, and only the list corresponding to the quantization index is searched. This speeds up the search by making it non-exhaustive. This can be relaxed using multi-probe search: a few (nprobe) quantization indices are selected and several inverted lists are visited.

Supports HNSW quantizer construction, PQ training, adding vertices, serialization and searching.

Each residual vector is encoded as a product quantizer code.

Currently only asymmetric queries are supported: database-to-database queries are not implemented.

Member Function Documentation

void ivfhnsw::IndexIVF_HNSW::add_batch	(	size_t	n,
		const float *	x,
		const idx_t *	xids,
		const idx_t *	precomputed_idx = `nullptr`
	)

virtual

Add n vectors of dimension d to the index.

Parameters

n	number of base vectors in a batch
x	base vectors to add, size n * d
xids	ids to store for the vectors (size n)
precomputed_idx	if non-null, assigned idxs to store for the vectors (size n)

void ivfhnsw::IndexIVF_HNSW::assign	(	size_t	n,
		const float *	x,
		idx_t *	labels,
		size_t	k = `1`
	)

Return the indices of the k HNSW vertices closest to the query x.

Parameters

n	number of input vectors
x	query vectors, size n * d
labels	output labels of the nearest neighbours, size n * k
k	number of the closest HNSW vertices to the query x

void ivfhnsw::IndexIVF_HNSW::build_quantizer	(	const char *	path_data,
		const char *	path_info,
		const char *	path_edges,
		size_t	M = `16`,
		size_t	efConstruction = `500`
	)

Construct from stretch or load the existing quantizer (HNSW) instance

if all files exist, quantizer will be loaded, else HNSW will be constructed

Parameters

path_data	path to input vectors
path_info	path to parameters for HNSW
path_edges	path to edges for HNSW
M	min number of edges per point, default: 16
efConstruction	max number of candidate vertices in queue to observe, default: 500

There has been removed parallel HNSW construction in order to make internal centroid ids equal to external ones. Construction time is still acceptable: ~5 minutes for 1 million 96-d vectors on Intel Xeon E5-2650 V2 2.60GHz.

void ivfhnsw::IndexIVF_HNSW::search	(	size_t	k,
		const float *	x,
		float *	distances,
		long *	labels
	)

virtual

Query n vectors of dimension d to the index.

Return at most k vectors. If there are not enough results for a query, the result array is padded with -1s.

Parameters

k	number of the closest vertices to search
x	query vectors, size n * d
distances	output pairwise distances, size n * k
labels	output labels of the nearest neighbours, size n * k

Search procedure

During IVF-HNSW-PQ search we compute

d = || x - y_C - y_R ||^2

where x is the query vector, y_C the coarse centroid, y_R the refined PQ centroid. The expression can be decomposed as:

d = || x - y_C ||^2 - || y_C ||^2 + || y_C + y_R ||^2 - 2 * (x|y_R)

term 1 term 2 term 3

We use the following decomposition:

term 1 is the distance to the coarse centroid, that is computed during the 1st stage search in the HNSW graph, minus the norm of the coarse centroid
term 2 is the L2 norm of the reconstructed base point, that is computed at construction time, quantized using separately trained product quantizer for such norms and stored along with the residual PQ codes.
term 3 is the classical non-residual distance table.

Norms of centroids are precomputed and saved without compression, as their memory consumption is negligible. If it is necessary, the norms can be added to the term 3 and compressed to byte together. We do not think that it will lead to considerable decrease in accuracy.

Since y_R defined by a product quantizer, it is split across sub-vectors and stored separately for each subvector.

Reimplemented in ivfhnsw::IndexIVF_HNSW_Grouping.

void ivfhnsw::IndexIVF_HNSW::train_pq	(	size_t	n,
		const float *	x
	)

virtual

Train product quantizers

Parameters

n	number of training vectors of dimension d
x	learn vectors, size n * d

Reimplemented in ivfhnsw::IndexIVF_HNSW_Grouping.

The documentation for this struct was generated from the following files:

IndexIVF_HNSW.h
IndexIVF_HNSW.cpp

Public Types

Public Member Functions

Public Attributes

Protected Member Functions

Protected Attributes

Detailed Description

Member Function Documentation