https://opensourceconnections.com/blog/2019/05/29/falsehoods-programmers-believe-about-search/ Skip to content OpenSource Connections Logo * About Us + Books & Resources + Careers (US) + Careers (EU) * Search Relevancy Consulting + What Is Search Relevance? + Why Partner with OSC? + Our Proven Process + Elasticsearch Consulting + Solr Consulting + OpenSearch Consulting * Case Studies * Training + Solr + Elasticsearch + OpenSearch + Learning to Rank + Natural Language Search * Events * Blog * Contact Search [ ] Blog Falsehoods Programmers Believe About Search Max Irwin May 29, 2019 Max Irwin Category: Community As much as anyone I'm a fan of resurrecting trends and memes and pretending it's cool. In that vein dear friends, I've exhumed the venerable "Falsehoods Programmers Believe" party from 4 years ago to bring you one about, no less, Search. Search is a deceptively complex field, where competence is hard-won through training, practice, and experience. The list stands at a total of 105 falsehoods. I couldn't mash up the ole 99-problems meme with this to cull 6 unworthy items, because they are all worthy. I will leave you with that brief introduction and, of course, the list: * Search engines work like databases * Search can be considered an additional feature just like any other * Search can be added as a well performing feature to your existing product quickly * Search can be added as a well performing feature to your existing product with reasonable effort * Choosing the correct search engine is easy and you will always be happy with your decision * Once setup, search will work the same way forever * Once setup, search will work the same way for a while * Once setup, search will work the same way for the next week * The default search engine settings will deliver a good search experience * Customers know what they are looking for * Customers who know what they are looking for will search for it in the way you expect * Customers who don't know what they are looking for will search accordingly * A customer using the same query twice expects the same results for both searches * Customers only search for a few terms * Customers only search for less than some set number of terms * Customers never copy and paste a whole document into a search bar * Customers balance quotes and parenthesis * Customers that don't balance quotes or parenthesis don't expect phrasing or grouping * You can pass the customer query directly into your search engine * You can write a query parser that will always parse the query successfully * You will never have to return a query parse error to the customer * When you find the boolean operator 'OR', you always know it doesn't mean Oregon * Customers notice their own misspellings * Customers don't expect your search to correct misspellings * It is possible to create a list of all misspellings * It is possible to create an algorithm to handle all misspellings * A misspelled word is never the same as another correctly spelled word * All customers expect spelling correction to work the same * All customers want their misspellings corrected * A search should always return results, no matter how absurd * If you don't have any results to show, customers won't mind * When the perfect results are shown to the customer, they will notice it * You don't need to monitor search queries, results, and clicks * Customers won't get nervous that you are logging their search activity * Search queries are not affected by GDPR * Looking at the data, it is always possible to tell whether a customer found what they were looking for * Customers will click on what they are looking for when they've found it * You can build a search that works like Google * You can build a search that works like Google sometimes * You should use Google as a target for your search * Customers don't mind if your search doesn't work like Google * Customers don't expect your search to work like Google * Customers won't compare you to Google * A bad search, no matter how minor nor how rare, will never reflect poorly on your product * Since Google doesn't use facets, customers don't need them * Facet hit counts are always correct * Facets have no impact on performance * You can just cache queries to get performant facets * Personalized search is easy * Learning to rank is easy and just requires a plugin * You have enough data for learning-to-rank * Over time, you can curate enough data for learning-to-rank * You don't need to spend lots of time formatting content for it to work well in your search engine * Text extraction engines will always produce text that doesn't need to be post-processed * All your markup will be stripped as you expect it to be * Content is well formed * Content is mostly well formed * Content is predictably well formed * Content, sourced from a database and templates, are formed the same * Content teams treat search as their top priority * Manually changing content to improve search is easy * Improving content can be automated with reasonable effort * Queries for 'C programming' and 'C++ programming' will produce different results * Queries for '401k' and '401(k)' will produce the same results * Tokenization as it works out of the box is right for your content and queries * Tokenization can be changed to meet the needs of your entire corpus and all queries * Tokenization can be changed to meet the needs of most of your corpus and most queries * Tokenization can be conditional * You should roll your own tokenizer * You will never have a debate about tokenization * Regular Expressions for tokenization is a good idea * Regular Expressions have minimal performance impact * You will never have a debate about regular expressions * You should remove stop words * You should not remove stop words * You know what the list of stop words should be * Stop words will never change * When you find the stopword 'in', you know it doesn't mean Indiana * It's easy to make certain things case sensitive * Case sensitivity is a good idea * Synonyms are easy * Synonyms will improve recall in the way you want * Synonyms have the same relevance in all documents * Synonyms for Abbreviations and Acronyms always work as you expect * Synonyms can be extracted from your corpus with natural language processing * Using Word2Vec for synonyms is a good idea * Stemming will solve your recall problems * Lemmatization will solve your recall problems * Lemmatization dictionaries are static * Languages don't change * Natural language processing (NLP) tools work perfectly * Incorporating NLP into your analysis pipeline is straightforward * Search queries are complete sentences and can be accurately tagged with parts of speech * Showing a list of search suggestions is easy * Suggestions should just use the out of the box search engine suggestions * Suggestions should incorporate customer query logs * Customers would never type anything offensive into your search bar * Customers would never try to hack you through your search bar * Customers don't need highlighting to find what they've searched for * Default highlighters are good enough for all your content and queries * Making a custom highlighter isn't too difficult. It's just matching strings right? * Making a custom highlighter that is better than the default version will take less than a year * Turning on caching will solve your performance issues * Customers don't expect near real time updates * 30 second commit time is short enough for everyone Keen to avoid believing falsehoods about search? Let us help! Categories Categories[Select Category ] Archives Archives [Select Month ] Recent Posts * How to make Quepid talk to your .NET Search API * Filtering results by query patterns with Regular Expressions and Querqy * Chorus, now also for Elasticsearch! * Why we set up a MICES in the US - and why we think that you should attend * Autonomy and IBM Watson - clever marketing doesn't always win in the end --------------------------------------------------------------------- Related Posts [IMG_3626-crop-200x200] May 5, 2022 Atita Arora How to make Quepid talk to your .NET Search API Read More [IMG_3606-crop-200x200] March 29, 2022 Eric Pugh Filtering results by query patterns with Regular Expressions and Querqy Read More [IMG_3705-crop-200x200] March 10, 2022 Rene Kriegler Why we set up a MICES in the US - and why we think that you should attend Read More [IMG_3541-rot-200x200] February 1, 2022 Charlie Hull Autonomy and IBM Watson - clever marketing doesn't always win in the end Read More [IMG_3606-crop-200x200] January 10, 2022 Eric Pugh Building a Search Technology Radar Read More OpenSource Connections Logo (434) 466-1467+44 (08700) 118334hello@o19s.com * * * * About Us * Search Relevancy Consulting * Case Studies * Training * Events * Blog * Contact Website Terms & Conditions * Privacy Policy Cookie Policy * Client Conflict of Interest Policy Impressum (OpenSource Connections Europe GmbH) (c) 2022 OpenSource Connections, LLC * *