hngopher.com

       [HN Gopher] Lucene: The Good Parts (2015)
       ___________________________________________________________________
        
       Lucene: The Good Parts (2015)
        
       Author : todsacerdoti
       Score  : 71 points
       Date   : 2022-03-29 11:31 UTC (1 days ago)
        
 (HTM) web link (blog.parse.ly)
 (TXT) w3m dump (blog.parse.ly)
        
       | orf wrote:
       | > SQL was not then, and is still not now, a very good blob or
       | document storage system. Yet, there seemed to be no alternative
       | to SQL for durability, short of relying directly upon the
       | filesystem.
       | 
       | Yep, because it's a query language.
        
       | madmax108 wrote:
       | Lucene was my introduction to concepts in document search such as
       | TF-IDF and attribute-based information retrieval working at an
       | eCommerce company where these problems were our bread and butter,
       | but while it was incredibly good at what it did, the concepts it
       | used were so 'low-level' that I always felt like higher level
       | "wrappers" around Lucene such as Solr/ElasticSearch were so much
       | easier to get started with and scaled up (and in many ways, more
       | idiot-proof), even as someone who was not a novice to the field.
       | 
       | Lucene in Action was an incredible book though, esp. given the
       | time when it came out, and somehow has remained quite relevant
       | through the years too! Strong recommend!
        
       | dang wrote:
       | Related:
       | 
       |  _Lucene: The Good Parts_ -
       | https://news.ycombinator.com/item?id=9667378 - June 2015 (14
       | comments)
       | 
       |  _Lucene: The Good Parts_ -
       | https://news.ycombinator.com/item?id=9198092 - March 2015 (16
       | comments)
        
       | blakesterz wrote:
       | "In 2004, Solr was created by Yonik Seeley at CNET Networks as an
       | in-house project to add search capability for the company
       | website."
       | 
       | I have no idea how I never knew that CNET created Solr! (Solr
       | uses Lucene)
        
         | pyuser583 wrote:
         | I didn't know CNET created any internal tech.
        
         | notepalf wrote:
         | Solr is a great tool, we use it at my job to index documents
         | and never had any problem with it. We choosed it over
         | elasticsearch because it seems simpler to setup and administer.
        
         | tomwheeler wrote:
         | About ten years after creating Solr, Yonik Seeley joined
         | Cloudera to work on integration between Apache Hadoop and Solr.
         | 
         | There's an interesting connection here: Doug Cutting, perhaps
         | best known as the creator of Hadoop, was the Chief Architect at
         | Cloudera. Most people recognize Doug as the creator of Apache
         | Hadoop, but he also created Lucene. In fact, Hadoop originated
         | from a Lucene subproject called Nutch, which aimed to build a
         | scalable web crawler.
        
           | ideonode wrote:
           | Wasn't the inspiration for Hadoop not just a web crawler use-
           | case, but also Google's famous MapReduce paper?
        
             | tomwheeler wrote:
             | Yes and no. The goal of the Nutch project was simply to
             | create a web crawler, but it hit some scalability limits.
             | Since Google had recently published two papers (MapReduce
             | and Google Filesystem) that were quite relevant to scaling
             | data processing and storage for a web crawler, Doug and
             | Mike created an open source implementation of those ideas
             | and redesigned the web crawler to use it.
             | 
             | The technology had many applications beyond a web crawler,
             | of course, but that was the original use case.
        
       | weeksie wrote:
       | Memory lane. Way back in the mists of time, circa 2004 or 5 I
       | wanted to learn about search indexing. I was a pre-rails ruby
       | head and translated a large portion of lucene's index code into
       | ruby as a learning exercise. The result was abysmal and somewhat
       | wonky, but I did learn a bunch. Both about lucene and ruby's FFI.
       | 
       | Reading and translating code is such a great way to internalize a
       | concept that you're unfamiliar with, while getting a glimpse into
       | someone else's mental model.
       | 
       | Lots of people tell you to read code, but it's hard to overstate
       | the power of filtering a codebase through your brain and out your
       | fingertips.
        
         | rjbwork wrote:
         | Neat! A bit over a decade ago, I had a similar experience with
         | Lucene (not that I implemented it from scratch, but certainly
         | used it in a fairly unorthodox, for the time, manner). I had to
         | implement some search stuff and Elastic Search was still in its
         | infancy so was not necessarily the "obviously right choice" as
         | it has been recently for this kind of document search job.
         | 
         | I implemented a multi-tenant search engine on top of Lucene
         | using C# and Azure Blob Storage under the direction of my
         | manager at the time. This was actually pretty cool, because I
         | had actually learned about TF-IDF and search technologies in
         | school so it was nice to be using some of that knowledge. And
         | there were a lot of problems to solve with regards to locking,
         | index update coordination, etc. that, as we know, ES takes care
         | of for us today. Anyway, the project was a success, and
         | launched, and backed a couple of products for a couple of years
         | until it was decommissioned due to outside forces basically
         | making it irrelevant.
         | 
         | That knowledge and experience seems to have ultimately led me
         | down the path to becoming the resident ES "expert" at my
         | current position.
        
       | dagenix wrote:
       | > This meant Lucene was less concerned with things like MVCC,
       | ACID, and 3-NF, and was instead concerned with much more
       | practical concerns, like how to build a fast and humane interface
       | for unstructured data.
       | 
       | I absolutely hate this attitude. Different use cases have
       | different requirements. The author here appears to be dismissing
       | any use case different than their own as not practical.
        
         | rjbwork wrote:
         | Totally agree. I have a hard time taking seriously a
         | perspective on the merits of a data technology that is so
         | dismissive towards concerns like MVCC, ACID, and normal forms.
         | These have been foundational to data technologies for nearly 50
         | years at this point for a reason. To discard them as
         | "impractical" indicates to me a severe immaturity of
         | perspective.
        
         | nemo44x wrote:
         | You can use Lucene and implement these types of features on top
         | of it if they're important to you. I think what the author was
         | trying to say was the Lucene contributors decided to focus on a
         | certain thing and leave other implementation features up to
         | people using the library.
         | 
         | Lucene gives you a lot of levers to control how it works and if
         | you want to build a distributed, MVCC, ACID compliant datastore
         | on top of it, you can. It's just not a concern of the library.
        
       ___________________________________________________________________
       (page generated 2022-03-30 23:02 UTC)