[HN Gopher] Lucene: The Good Parts (2015)
___________________________________________________________________
Lucene: The Good Parts (2015)
Author : todsacerdoti
Score : 71 points
Date : 2022-03-29 11:31 UTC (1 days ago)
(HTM) web link (blog.parse.ly)
(TXT) w3m dump (blog.parse.ly)
| orf wrote:
| > SQL was not then, and is still not now, a very good blob or
| document storage system. Yet, there seemed to be no alternative
| to SQL for durability, short of relying directly upon the
| filesystem.
|
| Yep, because it's a query language.
| madmax108 wrote:
| Lucene was my introduction to concepts in document search such as
| TF-IDF and attribute-based information retrieval working at an
| eCommerce company where these problems were our bread and butter,
| but while it was incredibly good at what it did, the concepts it
| used were so 'low-level' that I always felt like higher level
| "wrappers" around Lucene such as Solr/ElasticSearch were so much
| easier to get started with and scaled up (and in many ways, more
| idiot-proof), even as someone who was not a novice to the field.
|
| Lucene in Action was an incredible book though, esp. given the
| time when it came out, and somehow has remained quite relevant
| through the years too! Strong recommend!
| dang wrote:
| Related:
|
| _Lucene: The Good Parts_ -
| https://news.ycombinator.com/item?id=9667378 - June 2015 (14
| comments)
|
| _Lucene: The Good Parts_ -
| https://news.ycombinator.com/item?id=9198092 - March 2015 (16
| comments)
| blakesterz wrote:
| "In 2004, Solr was created by Yonik Seeley at CNET Networks as an
| in-house project to add search capability for the company
| website."
|
| I have no idea how I never knew that CNET created Solr! (Solr
| uses Lucene)
| pyuser583 wrote:
| I didn't know CNET created any internal tech.
| notepalf wrote:
| Solr is a great tool, we use it at my job to index documents
| and never had any problem with it. We choosed it over
| elasticsearch because it seems simpler to setup and administer.
| tomwheeler wrote:
| About ten years after creating Solr, Yonik Seeley joined
| Cloudera to work on integration between Apache Hadoop and Solr.
|
| There's an interesting connection here: Doug Cutting, perhaps
| best known as the creator of Hadoop, was the Chief Architect at
| Cloudera. Most people recognize Doug as the creator of Apache
| Hadoop, but he also created Lucene. In fact, Hadoop originated
| from a Lucene subproject called Nutch, which aimed to build a
| scalable web crawler.
| ideonode wrote:
| Wasn't the inspiration for Hadoop not just a web crawler use-
| case, but also Google's famous MapReduce paper?
| tomwheeler wrote:
| Yes and no. The goal of the Nutch project was simply to
| create a web crawler, but it hit some scalability limits.
| Since Google had recently published two papers (MapReduce
| and Google Filesystem) that were quite relevant to scaling
| data processing and storage for a web crawler, Doug and
| Mike created an open source implementation of those ideas
| and redesigned the web crawler to use it.
|
| The technology had many applications beyond a web crawler,
| of course, but that was the original use case.
| weeksie wrote:
| Memory lane. Way back in the mists of time, circa 2004 or 5 I
| wanted to learn about search indexing. I was a pre-rails ruby
| head and translated a large portion of lucene's index code into
| ruby as a learning exercise. The result was abysmal and somewhat
| wonky, but I did learn a bunch. Both about lucene and ruby's FFI.
|
| Reading and translating code is such a great way to internalize a
| concept that you're unfamiliar with, while getting a glimpse into
| someone else's mental model.
|
| Lots of people tell you to read code, but it's hard to overstate
| the power of filtering a codebase through your brain and out your
| fingertips.
| rjbwork wrote:
| Neat! A bit over a decade ago, I had a similar experience with
| Lucene (not that I implemented it from scratch, but certainly
| used it in a fairly unorthodox, for the time, manner). I had to
| implement some search stuff and Elastic Search was still in its
| infancy so was not necessarily the "obviously right choice" as
| it has been recently for this kind of document search job.
|
| I implemented a multi-tenant search engine on top of Lucene
| using C# and Azure Blob Storage under the direction of my
| manager at the time. This was actually pretty cool, because I
| had actually learned about TF-IDF and search technologies in
| school so it was nice to be using some of that knowledge. And
| there were a lot of problems to solve with regards to locking,
| index update coordination, etc. that, as we know, ES takes care
| of for us today. Anyway, the project was a success, and
| launched, and backed a couple of products for a couple of years
| until it was decommissioned due to outside forces basically
| making it irrelevant.
|
| That knowledge and experience seems to have ultimately led me
| down the path to becoming the resident ES "expert" at my
| current position.
| dagenix wrote:
| > This meant Lucene was less concerned with things like MVCC,
| ACID, and 3-NF, and was instead concerned with much more
| practical concerns, like how to build a fast and humane interface
| for unstructured data.
|
| I absolutely hate this attitude. Different use cases have
| different requirements. The author here appears to be dismissing
| any use case different than their own as not practical.
| rjbwork wrote:
| Totally agree. I have a hard time taking seriously a
| perspective on the merits of a data technology that is so
| dismissive towards concerns like MVCC, ACID, and normal forms.
| These have been foundational to data technologies for nearly 50
| years at this point for a reason. To discard them as
| "impractical" indicates to me a severe immaturity of
| perspective.
| nemo44x wrote:
| You can use Lucene and implement these types of features on top
| of it if they're important to you. I think what the author was
| trying to say was the Lucene contributors decided to focus on a
| certain thing and leave other implementation features up to
| people using the library.
|
| Lucene gives you a lot of levers to control how it works and if
| you want to build a distributed, MVCC, ACID compliant datastore
| on top of it, you can. It's just not a concern of the library.
___________________________________________________________________
(page generated 2022-03-30 23:02 UTC)