https://solr.apache.org/guide/solr/latest/query-guide/dense-vector-search.html
# Apache Solr Reference Guide
[ ]
Solr Website
Resources
Solr Javadocs Source Code Community Links Contribute
Download
Solr Reference Guide 9.0
* Solr Reference Guide
+ 9.0
* Pre-9.0 Reference Guides
+ 8.11
+ 8.10
+ 8.9
+ 8.8
+ 8.7
+ 8.6
+ 8.5
+ 8.4
+ 8.3
+ 8.2
+ 8.1
+ 8.0
+ 7.7
+ 7.6
+ 7.5
+ 7.4
+ 7.3
+ 7.2
+ 7.1
+ 7.0
+ 6.6
+ Older
* Getting Started
+ Introduction to Solr
+ Solr Concepts
o Documents, Fields, and Schema Design
o Solr Indexing
o Searching in Solr
o Relevance
o Solr Glossary
+ Solr Tutorials
o Exercise 1 Index Techproducts Data
o Exercise 2 Index Films Data
o Exercise 3 Index Your Own Data
o Getting Started with SolrCloud
o SolrCloud on AWS EC2
+ Solr Admin UI
+ About This Guide
* Deployment Guide
+ Solr Control Script Reference
+ Installation & Deployment
o System Requirements
o Installing Solr
o Taking Solr to Production
o JVM Settings
o Upgrading a Solr Cluster
# IndexUpgraderTool
o Backup and Restore
o Solr in Docker
# Solr Docker FAQ
# Solr & ZooKeeper with Docker Networking
o Solr on HDFS
+ Scaling Solr
o Solr Cluster Types
o User-Managed Clusters
# User-Managed Index Replication
# User-Managed Distributed Search
o SolrCloud Clusters
# SolrCloud Shards and Indexing
# SolrCloud Recoveries and Write Tolerance
# SolrCloud Distributed Requests
# Node Roles
# Aliases
# Collections API
@ Cluster and Node Management Commands
@ Collection Management Commands
@ Shard Management Commands
@ Replica Management Commands
@ Alias Management
# ZooKeeper Configuration
@ ZooKeeper Ensemble Configuration
@ ZooKeeper File Management
@ ZooKeeper Utilities
@ SolrCloud with Legacy Configuration Files
# Admin UI
@ Collections / Core Admin
@ Cloud Screens
+ Monitoring Solr
o Configuring Logging
o Ping
o Metrics Reporting
o Performance Statistics Reference
o Plugins & Stats Screen
o MBean Request Handler
o Monitoring with Prometheus and Grafana
o JMX with Solr
o Thread Dump Screen
o Distributed Tracing
o Circuit Breakers
o Request Rate Limiters
o Task Management
+ Securing Solr
o Configuring Authentication and Authorization
# Basic Authentication Plugin
# Kerberos Authentication Plugin
# JWT Authentication Plugin
# Certificate Authentication Plugin
# Hadoop Authentication Plugin
# Rule-Based Authorization Plugins
o Audit Logging
o Enabling SSL
o ZooKeeper Access Control
o Security UI
+ Client APIs
o SolrJ
o JavaScript
o Python
o Ruby
* Configuration Guide
+ Solr Configuration Files
+ Property Substitution in Configuration Files
+ Core Discovery
+ Configuring solr.xml
+ Configuring solrconfig.xml
o Index Location and Format
o Index Segments and Merging
o Schema Factory Configuration
o Commits and Transaction Logs
o Caches and Query Warming
o Request Handlers and Search Components
o Implicit Request Handlers
o RealTime Get
o InitParams
o RequestDispatcher
o Update Request Processors
o Script Update Processor
o Codec Factory
+ Configuration APIs
o Config API
o Request Parameters API
o Managed Resources
o Collections API
o Configsets API
o CoreAdmin API
o v2 API
+ Configsets
+ Resource Loading
+ Solr Plugins
o Lib Directories and Directives
o Solr Modules
o Package Management
# Package Manager Internals
o Cluster Plugins
o Replica Placement Plugins
* Schema and Indexing Guide
+ Solr Schema
o Schema Elements
o Schema API
o Schemaless Mode
o Schema Designer
o Schema Browser Screen
+ Fields & Schema Design
o Fields
o Field Types
# Field Type Definitions and Properties
# Field Types Included with Solr
# Currencies and Exchange Rates
# Date Formatting and Date Math
# Enum Fields
# External Files and Processes
# Field Properties by Use Case
o Copy Fields
o Dynamic Fields
o DocValues
o Luke Request Handler
+ Document Analysis in Solr
o Analyzers
o Tokenizers
o Filters
o CharFilterFactories
o Language Analysis
o Phonetic Matching
o Analysis Screen
+ Indexing & Data Operations
o Indexing with Update Handlers
# Transforming and Indexing Custom JSON
o Indexing with Solr Cell and Apache Tika
o Indexing Nested Documents
o Post Tool
o Documents Screen
o Partial Document Updates
o Reindexing
o Language Detection
o De-Duplication
o Content Streams
* Query Guide
+ Query Syntax and Parsers
o Common Query Parameters
o Standard Query Parser
o DisMax Query Parser
o Extended DisMax (eDisMax) Query Parser
o Function Queries
o Local Params
o JSON Request API
# JSON Query DSL
o Searching Nested Child Documents
o Block Join Query Parser
o Join Query Parser
o Spatial Search
o Dense Vector Search
o Other Query Parsers
o SQL Query Language
# JDBC with DbVisualizer
# JDBC with SQuirreL SQL
# JDBC with Apache Zeppelin
# JDBC with Python/Jython
# JDBC with R
o Query Screen
o SQL Query Screen
+ Enhancing Queries
o Spell Checking
o Suggester
o MoreLikeThis
o Query Re-Ranking
o Learning To Rank
o Tagger Handler
o Analytics Component
# Analytics Expression Sources
# Analytics Mapping Functions
# Analytics Reduction Functions
o Terms Component
o Term Vector Component
o Stats Component
+ Controlling Results
o Faceting
o JSON Facet API
# JSON Faceting Domain Changes
o Collapse and Expand Results
o Result Grouping
o Result Clustering
o Highlighting
o Query Elevation Component
o Document Transformers
o Response Writers
o Exporting Result Sets
o Pagination of Results
+ Streaming Expressions
o Stream Source Reference
o Stream Decorator Reference
o Stream Evaluator Reference
o Streaming Expressions and Math Expressions
# Visualization
# Getting Started
# Loading Data
# Searching, Sampling and Aggregation
# Transforming Data
# Scalar Math
# Vector Math
# Variables
# Matrices and Matrix Math
# Text Analysis and Term Vectors
# Probability Distributions
# Statistics
# Linear Regression
# Curve Fitting
# Time Series
# Interpolation, Derivatives and Integrals
# Digital Signal Processing
# Monte Carlo Simulations
# Machine Learning
# Graph
# Computational Geometry
# Log Analytics
o Graph Traversal
o Stream Request Handler API
o Stream Screen
*
+ Solr Upgrade Notes
o Major Changes in Solr 9
o Major Changes in Solr 8
o Major Changes in Solr 7
o Major Changes in Solr 6
* Solr Reference Guide
* Query Guide
* Query Syntax and Parsers
* Dense Vector Search
Edit this Page
Dense Vector Search
Solr's Dense Vector Search adds support for indexing and searching
dense numerical vectors.
Deep learning can be used to produce a vector representation of both
the query and the documents in a corpus of information.
These neural network-based techniques are usually referred to as
neural search, an industry derivation from the academic field of
Neural information Retrieval.
Important Concepts
Dense Vector Representation
A traditional tokenized inverted index can be considered to model
text as a "sparse" vector, in which each term in the corpus
corresponds to one vector dimension. In such a model, the number of
dimensions is generally quite high (corresponding to the term
dictionary cardinality), and the vector for any given document
contains mostly zeros (hence it is sparse, as only a handful of terms
that exist in the overall index will be present in any given
document).
Dense vector representation contrasts with term-based sparse vector
representation in that it distills approximate semantic meaning into
a fixed (and limited) number of dimensions.
The number of dimensions in this approach is generally much lower
than the sparse case, and the vector for any given document is dense,
as most of its dimensions are populated by non-zero values.
In contrast to the sparse approach (for which tokenizers are used to
generate sparse vectors directly from text input) the task of
generating vectors must be handled in application logic external to
Apache Solr.
There may be cases where it makes sense to directly search data that
natively exists as a vector (e.g., scientific data); but in a text
search context, it is likely that users will leverage deep learning
models such as BERT to encode textual information as dense vectors,
supplying the resulting vectors to Apache Solr explicitly at index
and query time.
For additional information you can refer to this blog post.
Dense Retrieval
Given a dense vector v that models the information need, the easiest
approach for providing dense vector retrieval would be to calculate
the distance (euclidean, dot product, etc.) between v and each vector
d that represents a document in the corpus of information.
This approach is quite expensive, so many approximate strategies are
currently under active research.
The strategy implemented in Apache Lucene and used by Apache Solr is
based on Navigable Small-world graph.
It provides efficient approximate nearest neighbor search for high
dimensional vectors.
See Approximate nearest neighbor algorithm based on navigable small
world graphs [2014] and Efficient and robust approximate nearest
neighbor search using Hierarchical Navigable Small World graphs [2018
] for details.
Index Time
This is the Apache Solr field type designed to support dense vector
search:
DenseVectorField
The dense vector field gives the possibility of indexing and
searching dense vectors of float elements.
For example:
[1.0, 2.5, 3.7, 4.1]
Here's how DenseVectorField should be configured in the schema:
vectorDimension
Required Default: none
The dimension of the dense vector to pass in.
Accepted values: Any integer < = 1024.
similarityFunction
Optional Default: euclidean
Vector similarity function; used in search to return top K most
similar vectors to a target vector.
Accepted values: euclidean, dot_product or cosine.
+ euclidean: Euclidean distance
+ dot_product: Dot product
this similarity is intended as an optimized way to perform cosine
similarity. In order to use it, all vectors must be of unit length,
including both document and query vectors. Using dot product with
vectors that are not unit length can result in errors or poor search
results.
* cosine: Cosine similarity
the preferred way to perform cosine similarity is to normalize all
vectors to unit length, and instead use DOT_PRODUCT. You should only
use this function if you need to preserve the original vectors and
cannot normalize them in advance.
To use the following advanced parameters that customise the codec
format and the hyper-parameter of the HNSW algorithm make sure you
set this configuration in solrconfig.xml:
...
Here's how DenseVectorField can be configured with the advanced codec
hyper-parameters:
codecFormat
Optional Default: Lucene90HnswVectorsFormat
(advanced) Specifies the knn codec implementation to use
Accepted values: Lucene90HnswVectorsFormat.
Please note that the codecFormat accepted values may change in future
releases.
Lucene index back-compatibility is only supported for the default
codec. If you choose to customize the codecFormat in your schema,
upgrading to a future version of Solr may require you to either
switch back to the default codec and optimize your index to rewrite
it into the default codec before upgrading, or re-build your entire
index from scratch after upgrading.
hnswMaxConnections
Optional Default: 16
(advanced) This parameter is specific for the
Lucene90HnswVectorsFormat codec format:
Controls how many of the nearest neighbor candidates are
connected to the new node.
It has the same meaning as M from the 2018 paper.
Accepted values: Any integer.
hnswBeamWidth
Optional Default: 100
(advanced) This parameter is specific for the
Lucene90HnswVectorsFormat codec format:
It is the number of nearest neighbor candidates to track while
searching the graph for each newly inserted node.
It has the same meaning as efConstruction from the 2018 paper.
Accepted values: Any integer.
DenseVectorField supports the attributes: indexed, stored.
currently multivalue is not supported
Here's how a DenseVectorField should be indexed:
JSON
[{ "id": "1",
"vector": [1.0, 2.5, 3.7, 4.1]
},
{ "id": "2",
"vector": [1.5, 5.5, 6.7, 65.1]
}
]
XML
1
1.0
2.5
3.7
4.1
2
1.5
5.5
6.7
65.1
SolrJ
final SolrClient client = getSolrClient();
final SolrInputDocument d1 = new SolrInputDocument();
d1.setField("id", "1");
d1.setField("vector", Arrays.asList(1.0f, 2.5f, 3.7f, 4.1f));
final SolrInputDocument d2 = new SolrInputDocument();
d2.setField("id", "2");
d2.setField("vector", Arrays.asList(1.5f, 5.5f, 6.7f, 65.1f));
client.add(Arrays.asList(d1, d2));
Query Time
This is the Apache Solr query approach designed to support dense
vector search:
knn Query Parser
The knn k-nearest neighbors query parser allows to find the k-nearest
documents to the target vector according to indexed dense vectors in
the given field.
The score for a retrieved document is the approximate distance to the
target vector(defined by the similarityFunction configured at
indexing time).
It takes the following parameters:
f
Required Default: none
The DenseVectorField to search in.
topK
Optional Default: 10
How many k-nearest results to return.
Here's how to run a KNN search:
&q={!knn f=vector topK=10}[1.0, 2.0, 3.0, 4.0]
The search results retrieved are the k-nearest to the vector in input
[1.0, 2.0, 3.0, 4.0], ranked by the similarityFunction configured at
indexing time.
Usage with Filter Queries
The knn query parser can be used in filter queries:
&q=id:(1 2 3)&fq={!knn f=vector topK=10}[1.0, 2.0, 3.0, 4.0]
The knn query parser can be used with filter queries:
&q={!knn f=vector topK=10}[1.0, 2.0, 3.0, 4.0]&fq=id:(1 2 3)
When using knn in these scenarios make sure you have a clear
understanding of how filter queries work in Apache Solr:
The Ranked List of document IDs resulting from the main query q is
intersected with the set of document IDs deriving from each filter
query fq.
e.g.
Ranked List from q=[ID1, ID4, ID2, ID10] Set from fq=
{ID3, ID2, ID9, ID4} = [ID4,ID2]
Usage as Re-Ranking Query
The knn query parser can be used to rerank first pass query results:
&q=id:(3 4 9 2)&rq={!rerank reRankQuery=$rqq reRankDocs=4 reRankWeight=1}&rqq={!knn f=vector topK=10}[1.0, 2.0, 3.0, 4.0]
When using knn in re-ranking pay attention to the topK parameter.
The second pass score(deriving from knn) is calculated only if the
document d from the first pass is within the k-nearest neighbors(in
the whole index) of the target vector to search.
This means the second pass knn is executed on the whole index
anyway, which is a current limitation.
The final ranked list of results will have the first pass score(main
query q) added to the second pass score(the approximated
similarityFunction distance to the target vector to search)
multiplied by a multiplicative factor(reRankWeight).
Details about using the ReRank Query Parser can be found in the
Query Re-Ranking section.
Additional Resources
* Blog: https://sease.io/2022/01/apache-solr-neural-search.html
* Blog: https://sease.io/2022/01/
apache-solr-neural-search-knn-benchmark.html
(c) Apache Software Foundation. All rights reserved.