https://solr.apache.org/guide/solr/latest/query-guide/dense-vector-search.html # Apache Solr Reference Guide [ ] Solr Website Resources Solr Javadocs Source Code Community Links Contribute Download Solr Reference Guide 9.0 * Solr Reference Guide + 9.0 * Pre-9.0 Reference Guides + 8.11 + 8.10 + 8.9 + 8.8 + 8.7 + 8.6 + 8.5 + 8.4 + 8.3 + 8.2 + 8.1 + 8.0 + 7.7 + 7.6 + 7.5 + 7.4 + 7.3 + 7.2 + 7.1 + 7.0 + 6.6 + Older * Getting Started + Introduction to Solr + Solr Concepts o Documents, Fields, and Schema Design o Solr Indexing o Searching in Solr o Relevance o Solr Glossary + Solr Tutorials o Exercise 1 Index Techproducts Data o Exercise 2 Index Films Data o Exercise 3 Index Your Own Data o Getting Started with SolrCloud o SolrCloud on AWS EC2 + Solr Admin UI + About This Guide * Deployment Guide + Solr Control Script Reference + Installation & Deployment o System Requirements o Installing Solr o Taking Solr to Production o JVM Settings o Upgrading a Solr Cluster # IndexUpgraderTool o Backup and Restore o Solr in Docker # Solr Docker FAQ # Solr & ZooKeeper with Docker Networking o Solr on HDFS + Scaling Solr o Solr Cluster Types o User-Managed Clusters # User-Managed Index Replication # User-Managed Distributed Search o SolrCloud Clusters # SolrCloud Shards and Indexing # SolrCloud Recoveries and Write Tolerance # SolrCloud Distributed Requests # Node Roles # Aliases # Collections API @ Cluster and Node Management Commands @ Collection Management Commands @ Shard Management Commands @ Replica Management Commands @ Alias Management # ZooKeeper Configuration @ ZooKeeper Ensemble Configuration @ ZooKeeper File Management @ ZooKeeper Utilities @ SolrCloud with Legacy Configuration Files # Admin UI @ Collections / Core Admin @ Cloud Screens + Monitoring Solr o Configuring Logging o Ping o Metrics Reporting o Performance Statistics Reference o Plugins & Stats Screen o MBean Request Handler o Monitoring with Prometheus and Grafana o JMX with Solr o Thread Dump Screen o Distributed Tracing o Circuit Breakers o Request Rate Limiters o Task Management + Securing Solr o Configuring Authentication and Authorization # Basic Authentication Plugin # Kerberos Authentication Plugin # JWT Authentication Plugin # Certificate Authentication Plugin # Hadoop Authentication Plugin # Rule-Based Authorization Plugins o Audit Logging o Enabling SSL o ZooKeeper Access Control o Security UI + Client APIs o SolrJ o JavaScript o Python o Ruby * Configuration Guide + Solr Configuration Files + Property Substitution in Configuration Files + Core Discovery + Configuring solr.xml + Configuring solrconfig.xml o Index Location and Format o Index Segments and Merging o Schema Factory Configuration o Commits and Transaction Logs o Caches and Query Warming o Request Handlers and Search Components o Implicit Request Handlers o RealTime Get o InitParams o RequestDispatcher o Update Request Processors o Script Update Processor o Codec Factory + Configuration APIs o Config API o Request Parameters API o Managed Resources o Collections API o Configsets API o CoreAdmin API o v2 API + Configsets + Resource Loading + Solr Plugins o Lib Directories and Directives o Solr Modules o Package Management # Package Manager Internals o Cluster Plugins o Replica Placement Plugins * Schema and Indexing Guide + Solr Schema o Schema Elements o Schema API o Schemaless Mode o Schema Designer o Schema Browser Screen + Fields & Schema Design o Fields o Field Types # Field Type Definitions and Properties # Field Types Included with Solr # Currencies and Exchange Rates # Date Formatting and Date Math # Enum Fields # External Files and Processes # Field Properties by Use Case o Copy Fields o Dynamic Fields o DocValues o Luke Request Handler + Document Analysis in Solr o Analyzers o Tokenizers o Filters o CharFilterFactories o Language Analysis o Phonetic Matching o Analysis Screen + Indexing & Data Operations o Indexing with Update Handlers # Transforming and Indexing Custom JSON o Indexing with Solr Cell and Apache Tika o Indexing Nested Documents o Post Tool o Documents Screen o Partial Document Updates o Reindexing o Language Detection o De-Duplication o Content Streams * Query Guide + Query Syntax and Parsers o Common Query Parameters o Standard Query Parser o DisMax Query Parser o Extended DisMax (eDisMax) Query Parser o Function Queries o Local Params o JSON Request API # JSON Query DSL o Searching Nested Child Documents o Block Join Query Parser o Join Query Parser o Spatial Search o Dense Vector Search o Other Query Parsers o SQL Query Language # JDBC with DbVisualizer # JDBC with SQuirreL SQL # JDBC with Apache Zeppelin # JDBC with Python/Jython # JDBC with R o Query Screen o SQL Query Screen + Enhancing Queries o Spell Checking o Suggester o MoreLikeThis o Query Re-Ranking o Learning To Rank o Tagger Handler o Analytics Component # Analytics Expression Sources # Analytics Mapping Functions # Analytics Reduction Functions o Terms Component o Term Vector Component o Stats Component + Controlling Results o Faceting o JSON Facet API # JSON Faceting Domain Changes o Collapse and Expand Results o Result Grouping o Result Clustering o Highlighting o Query Elevation Component o Document Transformers o Response Writers o Exporting Result Sets o Pagination of Results + Streaming Expressions o Stream Source Reference o Stream Decorator Reference o Stream Evaluator Reference o Streaming Expressions and Math Expressions # Visualization # Getting Started # Loading Data # Searching, Sampling and Aggregation # Transforming Data # Scalar Math # Vector Math # Variables # Matrices and Matrix Math # Text Analysis and Term Vectors # Probability Distributions # Statistics # Linear Regression # Curve Fitting # Time Series # Interpolation, Derivatives and Integrals # Digital Signal Processing # Monte Carlo Simulations # Machine Learning # Graph # Computational Geometry # Log Analytics o Graph Traversal o Stream Request Handler API o Stream Screen * + Solr Upgrade Notes o Major Changes in Solr 9 o Major Changes in Solr 8 o Major Changes in Solr 7 o Major Changes in Solr 6 * Solr Reference Guide * Query Guide * Query Syntax and Parsers * Dense Vector Search Edit this Page Dense Vector Search Solr's Dense Vector Search adds support for indexing and searching dense numerical vectors. Deep learning can be used to produce a vector representation of both the query and the documents in a corpus of information. These neural network-based techniques are usually referred to as neural search, an industry derivation from the academic field of Neural information Retrieval. Important Concepts Dense Vector Representation A traditional tokenized inverted index can be considered to model text as a "sparse" vector, in which each term in the corpus corresponds to one vector dimension. In such a model, the number of dimensions is generally quite high (corresponding to the term dictionary cardinality), and the vector for any given document contains mostly zeros (hence it is sparse, as only a handful of terms that exist in the overall index will be present in any given document). Dense vector representation contrasts with term-based sparse vector representation in that it distills approximate semantic meaning into a fixed (and limited) number of dimensions. The number of dimensions in this approach is generally much lower than the sparse case, and the vector for any given document is dense, as most of its dimensions are populated by non-zero values. In contrast to the sparse approach (for which tokenizers are used to generate sparse vectors directly from text input) the task of generating vectors must be handled in application logic external to Apache Solr. There may be cases where it makes sense to directly search data that natively exists as a vector (e.g., scientific data); but in a text search context, it is likely that users will leverage deep learning models such as BERT to encode textual information as dense vectors, supplying the resulting vectors to Apache Solr explicitly at index and query time. For additional information you can refer to this blog post. Dense Retrieval Given a dense vector v that models the information need, the easiest approach for providing dense vector retrieval would be to calculate the distance (euclidean, dot product, etc.) between v and each vector d that represents a document in the corpus of information. This approach is quite expensive, so many approximate strategies are currently under active research. The strategy implemented in Apache Lucene and used by Apache Solr is based on Navigable Small-world graph. It provides efficient approximate nearest neighbor search for high dimensional vectors. See Approximate nearest neighbor algorithm based on navigable small world graphs [2014] and Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs [2018 ] for details. Index Time This is the Apache Solr field type designed to support dense vector search: DenseVectorField The dense vector field gives the possibility of indexing and searching dense vectors of float elements. For example: [1.0, 2.5, 3.7, 4.1] Here's how DenseVectorField should be configured in the schema: vectorDimension Required Default: none The dimension of the dense vector to pass in. Accepted values: Any integer < = 1024. similarityFunction Optional Default: euclidean Vector similarity function; used in search to return top K most similar vectors to a target vector. Accepted values: euclidean, dot_product or cosine. + euclidean: Euclidean distance + dot_product: Dot product this similarity is intended as an optimized way to perform cosine similarity. In order to use it, all vectors must be of unit length, including both document and query vectors. Using dot product with vectors that are not unit length can result in errors or poor search results. * cosine: Cosine similarity the preferred way to perform cosine similarity is to normalize all vectors to unit length, and instead use DOT_PRODUCT. You should only use this function if you need to preserve the original vectors and cannot normalize them in advance. To use the following advanced parameters that customise the codec format and the hyper-parameter of the HNSW algorithm make sure you set this configuration in solrconfig.xml: ... Here's how DenseVectorField can be configured with the advanced codec hyper-parameters: codecFormat Optional Default: Lucene90HnswVectorsFormat (advanced) Specifies the knn codec implementation to use Accepted values: Lucene90HnswVectorsFormat. Please note that the codecFormat accepted values may change in future releases. Lucene index back-compatibility is only supported for the default codec. If you choose to customize the codecFormat in your schema, upgrading to a future version of Solr may require you to either switch back to the default codec and optimize your index to rewrite it into the default codec before upgrading, or re-build your entire index from scratch after upgrading. hnswMaxConnections Optional Default: 16 (advanced) This parameter is specific for the Lucene90HnswVectorsFormat codec format: Controls how many of the nearest neighbor candidates are connected to the new node. It has the same meaning as M from the 2018 paper. Accepted values: Any integer. hnswBeamWidth Optional Default: 100 (advanced) This parameter is specific for the Lucene90HnswVectorsFormat codec format: It is the number of nearest neighbor candidates to track while searching the graph for each newly inserted node. It has the same meaning as efConstruction from the 2018 paper. Accepted values: Any integer. DenseVectorField supports the attributes: indexed, stored. currently multivalue is not supported Here's how a DenseVectorField should be indexed: JSON [{ "id": "1", "vector": [1.0, 2.5, 3.7, 4.1] }, { "id": "2", "vector": [1.5, 5.5, 6.7, 65.1] } ] XML 1 1.0 2.5 3.7 4.1 2 1.5 5.5 6.7 65.1 SolrJ final SolrClient client = getSolrClient(); final SolrInputDocument d1 = new SolrInputDocument(); d1.setField("id", "1"); d1.setField("vector", Arrays.asList(1.0f, 2.5f, 3.7f, 4.1f)); final SolrInputDocument d2 = new SolrInputDocument(); d2.setField("id", "2"); d2.setField("vector", Arrays.asList(1.5f, 5.5f, 6.7f, 65.1f)); client.add(Arrays.asList(d1, d2)); Query Time This is the Apache Solr query approach designed to support dense vector search: knn Query Parser The knn k-nearest neighbors query parser allows to find the k-nearest documents to the target vector according to indexed dense vectors in the given field. The score for a retrieved document is the approximate distance to the target vector(defined by the similarityFunction configured at indexing time). It takes the following parameters: f Required Default: none The DenseVectorField to search in. topK Optional Default: 10 How many k-nearest results to return. Here's how to run a KNN search: &q={!knn f=vector topK=10}[1.0, 2.0, 3.0, 4.0] The search results retrieved are the k-nearest to the vector in input [1.0, 2.0, 3.0, 4.0], ranked by the similarityFunction configured at indexing time. Usage with Filter Queries The knn query parser can be used in filter queries: &q=id:(1 2 3)&fq={!knn f=vector topK=10}[1.0, 2.0, 3.0, 4.0] The knn query parser can be used with filter queries: &q={!knn f=vector topK=10}[1.0, 2.0, 3.0, 4.0]&fq=id:(1 2 3) When using knn in these scenarios make sure you have a clear understanding of how filter queries work in Apache Solr: The Ranked List of document IDs resulting from the main query q is intersected with the set of document IDs deriving from each filter query fq. e.g. Ranked List from q=[ID1, ID4, ID2, ID10] Set from fq= {ID3, ID2, ID9, ID4} = [ID4,ID2] Usage as Re-Ranking Query The knn query parser can be used to rerank first pass query results: &q=id:(3 4 9 2)&rq={!rerank reRankQuery=$rqq reRankDocs=4 reRankWeight=1}&rqq={!knn f=vector topK=10}[1.0, 2.0, 3.0, 4.0] When using knn in re-ranking pay attention to the topK parameter. The second pass score(deriving from knn) is calculated only if the document d from the first pass is within the k-nearest neighbors(in the whole index) of the target vector to search. This means the second pass knn is executed on the whole index anyway, which is a current limitation. The final ranked list of results will have the first pass score(main query q) added to the second pass score(the approximated similarityFunction distance to the target vector to search) multiplied by a multiplicative factor(reRankWeight). Details about using the ReRank Query Parser can be found in the Query Re-Ranking section. Additional Resources * Blog: https://sease.io/2022/01/apache-solr-neural-search.html * Blog: https://sease.io/2022/01/ apache-solr-neural-search-knn-benchmark.html (c) Apache Software Foundation. All rights reserved.