https://www.greptile.com/blog/semantic Greptile logo greptile PricingBlogDocs Book a demoLoginStart Codebases are uniquely hard to search semantically Codebases are uniquely hard to search semantically August 15, 2024 (4d ago) Daksh Gupta I'm Daksh, one of the co-founders of Greptile. We're building AI that understands codebases, which you can query using an API. To do this, we have to, depending on the task, provide an LLM with snippets from the codebase, ideally the fewest number of snippets that give it sufficient information to respond to the query. Semantic search illustration We found out that this problem is harder than it looks. Before we get into all the ways we tried to make codebase semantic search work, it's interesting to see why it's different from semantically searching a book. Semantically searching a book First, we index the corpus we are trying to search over: 1. Split it up into units (or "chunks"), each one ideally about distinct topics. Splitting by paragraph is a good approximation of this. 2. Generate semantic vector embeddings for each "chunk". Semantic vector embeddings capture the semantic meaning of a piece of text as an n-dimensional vector. The algorithm to create them is quite fascinating and is explained very well in chapter 6 of Speech and Language Processing, available for free here. When two pieces of text are semantically similar, their vectors are too, which can be quantified by their cosine similarity or normalized dot product. You could say, therefore, that the semantic similarity between two pieces of text t1, t2 is: dot product of v_1 and v_2 Where v1 is the normalized semantic embedding vector for text t1 and v2 is the normalized semantic embedding vector for text t2. Indexing To retrieve, we start by generating a semantic vector embedding for the query, in this case, "Word storing frequently accessed key-value pairs". We then compare that against our database of vectors and find the one(s) that match the closest, i.e., have the lowest dot product and highest similarity. Retrieval In theory, this should work for codebases too. Surely you could split up a codebase into files or functions as the "chunks", embed them, and do a similar semantic similarity-based search. We tried this, and got fairly poor results even with simple queries like "Session management code", which should have retrieved the files associated with session management, and instead naively picked up files that mentioned "management" or "session", since those files were the most semantically similar. Code and natural language are not semantically similar Here I have a natural language query, with which I want to search a codebase to find where the code for HFT fraud detection is. query = "Get the code that detects potential fraud in HFT transactions" This is the code I am hoping the search will return: from collections import defaultdict def analyze_high_frequency_transactions( transactions: List[Dict[str, any]], time_window: timedelta, amount_threshold: float, frequency_threshold: int ) -> Tuple[List[Dict[str, any]], Dict[str, List[datetime]]]: def is_suspicious(times: List[datetime]) -> bool: if len(times) < frequency_threshold: return False times.sort() for i in range(len(times) - frequency_threshold + 1): if times[i + frequency_threshold - 1] - times[i] <= time_window: return True return False suspicious_transactions = [] account_timestamps = defaultdict(list) for transaction in transactions: account = transaction['account_id'] amount = transaction['amount'] timestamp = datetime.fromisoformat(transaction['timestamp']) if amount >= amount_threshold: account_timestamps[account].append(timestamp) if is_suspicious(account_timestamps[account]): suspicious_transactions.append(transaction) flagged_accounts = { account: timestamps for account, timestamps in account_timestamps.items() if is_suspicious(timestamps) } return suspicious_transactions, flagged_accounts Here is an English description of what the code does: This function analyzes a list of financial transactions to identify potentially suspicious high-frequency trading patterns. It flags transactions and accounts that meet specific criteria: 1. Transactions with amounts above a certain threshold. 2. Accounts with a high frequency of such transactions within a specified time window. The function takes four parameters: - A list of transaction dictionaries - A time window for analysis - An amount threshold for transactions - A frequency threshold for the number of transactions It returns two items: 1. A list of suspicious transactions 2. A dictionary of flagged accounts with their transaction timestamps The function uses nested helper functions and defaultdict for efficient processing. It's designed to handle large datasets and could be part of a larger financial monitoring system. I computed the semantic similarity between the query and code, and then between the query and the natural language code description. Here is what I found: Comparison Similarity Query and code 0.7280 Query and description 0.8152 The similarity between the query and the description is meaningfully higher (12%) than the similarity between query and description. This should mean that searching over a natural language summary of the codebase should yield better results than searching over the code. Chunking and noise Another aspect is the signal-to-noise ratio of the chunk that was embedded. In the indexing step if you chunk the codebase by file (every file being one vector), you might include a lot of code that isn't relevant to the query. Here we examine three scenarios, and see the semantic similarity each one has with the query "Get the code that detects potential fraud in HFT transactions" 1. Full file of random code 2. Full file of random code with correct function buried in between 3. Just the correct function Interestingly, adding noise dramatically reduces the semantic similarity, to a point where the performance is closer to if it were just noise than if it was just the correct function. Scenario Similarity Query and full file 0.718032 Query and full file with function 0.739421 Query and just the function 0.768347 Summary * Semantic search on codebases works better if you first translate the code to natural language, before generating embedding vectors. * It also works better if you chunk more "tightly" - on a per-function level rather than a per-file level. This is because noise negatively impacts retrieval quality in a huge way. Greptile logo greptile 2001 Van Ness Ave, San Francisco, California 94109 GitHubXLinkedInDiscord Product Pricing Docs API Zapier Security Company Blog Podcast LinkedIn Terms of Service Careers Contact us (c) 2024 Tabnam, Inc. [aits-v]