https://zilliz.com/learn/approximate-nearest-neighbor-oh-yeah-ANNOY * Products Zilliz Cloud Fully-managed vector database service designed for speed, scale and high performance. Milvus Open-source vector database built for billion-scale vector similarity search. + Cloud vs. Milvus + Benchmark + Open Source + Integrations + Support Portal LogoHow to Select the Most Appropriate CU Type and Size for Your Business?Learn More * Pricing + Pricing PlanFlexible pricing options for every team on any budget + CalculatorEstimate your cost + MarketplaceSubscribe on AWS Marketplace LogoFree TierLearn More * Developers Documentation The Zilliz Cloud Developer Hub where you can find all the information to work with Zilliz Cloud Learn More + Learn + Community LogoCommunity Office HoursSave your spot now * Resources + Blog + Whitepapers + Webinars + Trainings + Events LogoWhy Vector Databases Matter for Unstructured DataDownload Whitepaper * Solutions By Use CaseRetrieval Augmented GenerationView all use cases LogoHow Short-video Platform Likee Removes Duplicate Videos with MilvusRead all customer stories Contact usLog inGet Started Free Back * Vector Database 101 Approximate Nearest Neighbor Oh Yeah (Annoy) May 25, 202311 min read Frank Liu Frank Liu Approximate Nearest Neighbor Oh Yeah (Annoy) Introduction Welcome back to Vector Database 101. In the previous tutorial, we deep-dived into Hierarchical Navigable Small Worlds (HNSW). HNSW is a graph-based indexing algorithm that today is one of the most popular indexing strategies used in vector databases. In this tutorial, we'll switch gears and talk about tree-based vector indexes. Specifically, we'll talk about Approximate Nearest Neighbor Oh Yeah (Annoy), an algorithm that uses a forest of trees to conduct the nearest neighbor search. For those familiar with random forests or gradient-boosted decision trees, Annoy can seem like a natural extension of these algorithms, only for the nearest neighbor search rather than machine learning. As with our HNSW tutorial, we'll first walk through how Annoy works from a high level before developing our own simple Python implementation. Annoy basics While HNSW is built upon the connected graph and skip list, Annoy uses binary search trees as the core data structure. The key idea behind Annoy (and other tree-based indexes) is to repeatedly partition our vector space and search only a subset of the partitions for nearest neighbors. If this sounds like IVF, you're right; the idea is the same, but the execution is slightly different. [ann] [Annoy, visualized (from https://github.com/spotify/annoy).] The best way to understand Annoy is to visualize how a single tree is built. However, remember that high-dimensional hyperspaces are very different from 2D/3D Euclidean spaces from an intuitive perspective, so the images below are only for reference. Let's start with indexing. For Annoy, this is a recursive process where the maximum size of the call stack is the depth of the tree. In the first iteration, two random dataset vectors, a and b, are selected, and the full hyperspace is split along a hyperplane equidistant from both a and b. Then, vectors in the "left" half of the hyperspace get assigned to the left half of the tree, while vectors in the "right" half of the subspace are given to the right half of the tree. Note that this can be done without actually computing the hyperplane itself - for every dataset vector, we need to determine whether a (left) or b (right) is closer. [After_the_] After the first, second, and Nth iteration, respectively. Source. The second iteration repeats this process for both the left and right subtrees generated by the first iteration, resulting in a tree with a depth of two and four leaf nodes. This process continues for the third, fourth, and subsequent iterations until a leaf node has fewer than a pre-defined number of elements K. In the original Annoy implementation, K is a variable value that the user can set. With the index fully built, we can now move on to querying. Given some query vector q, we can search by traversing the tree. A hyperplane splits each intermediate node, and we can determine which side of the hyperplane the query vector falls on by computing its distance to the left and right vectors. We repeat this process until we reach a leaf node containing an array of, at most, K vectors. We can then rank and return these vectors to the user. Implementing Annoy Now we know how Annoy works, and let's start with the implementation. As usual, we'll first create a dataset of (128 dimensional) vectors: >>> import numpy as np >>> dataset = np.random.normal(size=(1000, 128)) Let's first define a Node class containing left and right subtrees: class Node(object): """Initialize with a set of vectors, then call `split()`. """ def __init__(self, ref: np.ndarray, vecs: List[np.ndarray]): self._ref = ref self._vecs = vecs self._left = None self._right = None @property def ref(self) -> Optional[np.ndarray]: """Reference point in n-d hyperspace. Evaluates to `False` if root node. """ return self._ref @property def vecs(self) -> List[np.ndarray]: """Vectors for this leaf node. Evaluates to `False` if not a leaf. """ return self._vecs @property def left(self) -> Optional[object]: """Left node. """ return self._left @property def right(self) -> Optional[object]: """Right node. """ return self._right The vecs variable contains a list of all vectors within the node. If the length of this list is less than some value K, then they will remain as-is; otherwise, these vectors will then get propagated to left and right, with vecs[0] and vecs[1] remaining as the two randomly selected vectors used to split the hyperplane. Let's now move to indexing. First, Recall that every node in the tree is split by a hyperplane orthogonal to the line connecting two randomly selected dataset vectors. Conveniently, we can determine which side of the hyperplane a query vector lies on by computing distance. As usual, we'll use NumPy's vectorized math for this: def _is_query_in_left_half(q, node): # returns `True` if query vector resides in left half dist_l = np.linalg.norm(q - node.vecs[0]) dist_r = np.linalg.norm(q - node.vecs[1]) return dist_l < dist_r Now let's move to building the actual tree. import random def split_node(node, K: int, imb: float) -> bool: # stopping condition: maximum # of vectors for a leaf node if len(node._vecs) <= K: return False # continue for a maximum of 5 iterations for n in range(5): left_vecs = [] right_vecs = [] # take two random indexes and set as left and right halves left_ref = node._vecs.pop(np.random.randint(len(node._vecs))) right_ref = node._vecs.pop(np.random.randint(len(node._vecs))) # split vectors into halves for vec in node._vecs: dist_l = np.linalg.norm(vec - left_ref) dist_r = np.linalg.norm(vec - right_ref) if dist_l < dist_r: left_vecs.append(vec) else: right_vecs.append(vec) # check to make sure that the tree is mostly balanced r = len(left_vecs) / len(node._vecs) if r < imb and r > (1 - imb): node._left = Node(left_ref, left_vecs) node._right = Node(right_ref, right_vecs) return True # redo tree build process if imbalance is high node._vecs.append(left_ref) node._vecs.append(right_ref) return False def _build_tree(node, K: int, imb: float): """Recurses on left and right halves to build a tree. """ node.split(K=K, imb=imb) if node.left: _build_tree(node.left, K=K, imb=imb) if node.right: _build_tree(node.right, K=K, imb=imb) def build_forest(vecs: List[np.ndarray], N: int = 32, K: int = 64, imb: float = 0.95) -> List[Node]: """Builds a forest of `N` trees. """ forest = [] for _ in range(N): root = Node(None, vecs) _build_tree(root, K, imb) forest.append(root) return forest This is a denser code block, so let's walk through it step-by-step. First, given an already-initialized Node, we randomly select two vectors and split the dataset into left and right halves. We then use the function we defined earlier to determine which of the two halves the subvectors belong to. Note that we've added an imb parameter to maintain tree balance - if one side of the tree contains more than 95% of all the subvectors, we redo the split process. With node splitting in place, the build_tree function will recursively call itself on all nodes. Leaf nodes are defined as those which contain fewer than K subvectors. Great, so we've built a binary tree that lets us significantly reduce the scope of our search. Now let's implement querying as well. Querying is pretty straightforward; we traverse the tree, continuously moving along the left or right branches until we've arrived at the one we're interested in: def _query_linear(vecs: List[np.ndarray], q: np.ndarray, k: int) -> List[np.ndarray]: return sorted(vecs, key=lambda v: np.linalg.norm(q-v))[:k] def query_tree(root: Node, q: np.ndarray, k: int) -> List[np.ndarray]: """Queries a single tree. """ while root.left and root.right: dist_l = np.linalg.norm(q - node.left.ref) dist_r = np.linalg.norm(q - node.right.ref) root = root.left if dist_l < dist_r else root.right # brute-force search the nearest neighbors return _query_linear(root.vecs, q, k) This chunk of code will greedily traverse the tree, returning a single nearest neighbor (nq = 1). However, recall that we're often interested in finding multiple nearest neighbors. Additionally, multiple nearest neighbors can live in other leaf nodes as well. So how can we solve these issues? Run, forest, run (Yes, I do realize that the main character's name is spelled "Forrest" in the American classic.) In a previous tutorial on IVF, recall that we often expanded our search beyond the Voronoi cell closest to the query vector. The reason is due to cell edges - if a query vector is close to a cell edge, it's very likely that some of its nearest neighbors may be in a neighboring cell. These "edges" are much more common in high-dimensional spaces, so a large-ish value of nprobe is often used when a high recall is needed. We face the same problem for tree-based indexes - some of our nearest neighbors may be outside the nearest leaf node/polygon. Annoy solves this by 1) allowing searches on both sides of a split and 2) creating a forest of trees. Let's first expand on our implementation in the previous section to search both sides of a split: def _select_nearby(node: Node, q: np.ndarray, thresh: int = 0): """Functions identically to _is_query_in_left_half, but can return both. """ if not node.left or not node.right: return () dist_l = np.linalg.norm(q - node.left.ref) dist_r = np.linalg.norm(q - node.right.ref) if np.abs(dist_l - dist_r) < thresh: return (node.left, node.right) if dist_l < dist_r: return (node.left,) return (node.right,) def _query_tree(root: Node, q: np.ndarray, k: int) -> List[np.ndarray]: """This replaces the `query_tree` function above. """ pq = [root] nns = [] while pq: node = pq.pop(0) nearby = _select_nearby(node, q, thresh=0.05) # if `_select_nearby` does not return either node, then we are at a leaf if nearby: pq.extend(nearby) else: nns.extend(node.vecs) # brute-force search the nearest neighbors return _query_linear(nns, q, k) def query_forest(forest: List[Node], q, k: int = 10): nns = set() for root in forest: # merge `nns` with query result res = _query_tree(root, q, k) nns.update(res) return _query_linear(nns, q, k) Next, we'll add a function to allow us to build the full index as a forest of trees: def build_forest(vecs: List[np.ndarray], N: int = 32, K: int = 64, imb: float = 0.95) -> List[Node]: """Builds a forest of `N` trees. """ forest = [] for _ in range(N): root = Node(None, vecs) _build_tree(root, K, imb) forest.append(root) return forest With everything implemented, let's now put it all together, as we've done for IVF, SQ, PQ, and HNSW: from typing import List, Optional import random import numpy as np class Node(object): """Initialize with a set of vectors, then call `split()`. """ def __init__(self, ref: np.ndarray, vecs: List[np.ndarray]): self._ref = ref self._vecs = vecs self._left = None self._right = None @property def ref(self) -> Optional[np.ndarray]: """Reference point in n-d hyperspace. Evaluates to `False` if root node. """ return self._ref @property def vecs(self) -> List[np.ndarray]: """Vectors for this leaf node. Evaluates to `False` if not a leaf. """ return self._vecs @property def left(self) -> Optional[object]: """Left node. """ return self._left @property def right(self) -> Optional[object]: """Right node. """ return self._right def split(self, K: int, imb: float) -> bool: # stopping condition: maximum # of vectors for a leaf node if len(self._vecs) <= K: return False # continue for a maximum of 5 iterations for n in range(5): left_vecs = [] right_vecs = [] # take two random indexes and set as left and right halves left_ref = self._vecs.pop(np.random.randint(len(self._vecs))) right_ref = self._vecs.pop(np.random.randint(len(self._vecs))) # split vectors into halves for vec in self._vecs: dist_l = np.linalg.norm(vec - left_ref) dist_r = np.linalg.norm(vec - right_ref) if dist_l < dist_r: left_vecs.append(vec) else: right_vecs.append(vec) # check to make sure that the tree is mostly balanced r = len(left_vecs) / len(self._vecs) if r < imb and r > (1 - imb): self._left = Node(left_ref, left_vecs) self._right = Node(right_ref, right_vecs) return True # redo tree build process if imbalance is high self._vecs.append(left_ref) self._vecs.append(right_ref) return False def _select_nearby(node: Node, q: np.ndarray, thresh: int = 0): """Functions identically to _is_query_in_left_half, but can return both. """ if not node.left or not node.right: return () dist_l = np.linalg.norm(q - node.left.ref) dist_r = np.linalg.norm(q - node.right.ref) if np.abs(dist_l - dist_r) < thresh: return (node.left, node.right) if dist_l < dist_r: return (node.left,) return (node.right,) def _build_tree(node, K: int, imb: float): """Recurses on left and right halves to build a tree. """ node.split(K=K, imb=imb) if node.left: _build_tree(node.left, K=K, imb=imb) if node.right: _build_tree(node.right, K=K, imb=imb) def build_forest(vecs: List[np.ndarray], N: int = 32, K: int = 64, imb: float = 0.95) -> List[Node]: """Builds a forest of `N` trees. """ forest = [] for _ in range(N): root = Node(None, vecs) _build_tree(root, K, imb) forest.append(root) return forest def _query_linear(vecs: List[np.ndarray], q: np.ndarray, k: int) -> List[np.ndarray]: return sorted(vecs, key=lambda v: np.linalg.norm(q-v))[:k] def _query_tree(root: Node, q: np.ndarray, k: int) -> List[np.ndarray]: """Queries a single tree. """ pq = [root] nns = [] while pq: node = pq.pop(0) nearby = _select_nearby(node, q, thresh=0.05) # if `_select_nearby` does not return either node, then we are at a leaf if nearby: pq.extend(nearby) else: nns.extend(node.vecs) # brute-force search the nearest neighbors return _query_linear(nns, q, k) def query_forest(forest: List[Node], q, k: int = 10): nns = set() for root in forest: # merge `nns` with query result res = _query_tree(root, q, k) nns.update(res) return _query_linear(nns, q, k) And that's it for Annoy! Wrapping up In this tutorial, we did a deep dive into Annoy, a tree-based indexing strategy with a playful name. There are better languages than Python for implementing vector search data structures due to interpreter overhead. Still, we use as much numpy-based array math. We can do many optimizations to prevent copying memory back and forth, but I'll leave them as an exercise for the reader. In the following tutorial, we'll continue our deep dive into indexing strategies with a rundown of the Vamana algorithm - also known more commonly as DiskANN - a unique graph-based indexing algorithm that is tailored specifically towards querying directly from solid state hard drives. All code for this tutorial is freely available on my Github. Take another look at the Vector Database 101 courses 1. Introduction to Unstructured Data 2. What is a Vector Database? 3. Comparing Vector Databases, Vector Search Libraries, and Vector Search Plugins 4. Introduction to Milvus 5. Milvus Quickstart 6. Introduction to Vector Similarity Search 7. Vector Index Basics and the Inverted File Index 8. Scalar Quantization and Product Quantization 9. Hierarchical Navigable Small Worlds (HNSW) 10. Approximate Nearest Neighbor Oh Yeah (ANNOY) 11. Choosing the Right Vector Index for Your Project 12. DiskANN and the Vamana Algorithm --------------------------------------------------------------------- Content * Introduction * Annoy basics * Implementing Annoy * Run, forest, run * Wrapping up * Take another look at the Vector Database 101 courses Take Zilliz for a spin with our free 30-day trial Start Free Trial Share this article Copied Sign Up for the Zilliz newsletter [ ] Subscribe 201 Redwood Shores Pkwy, Suite 330 Redwood City, California 94065 * Product + Zilliz Cloud + Free Tier + What is Milvus + What is GPTCache + Towhee + Attu + Milvus CLI + Feder * Resources + Docs + Blog + GitHub + Benchmark + Comparison + Resources + Glossary * Company + About + Careers + News + Partners + Events + Contact Sales Terms of ServicePrivacy PolicySecuritySystem Status LF AI, LF AI & data, Milvus, and associated open-source project names are trademarks of the the Linux Foundation. (c) Zilliz 2023 All rights reserved. aicpa