[HN Gopher] Building a Better Search Engine for Semantic Scholar...
       ___________________________________________________________________
        
       Building a Better Search Engine for Semantic Scholar (2020)
        
       Author : boyter
       Score  : 50 points
       Date   : 2021-08-28 09:49 UTC (1 days ago)
        
 (HTM) web link (medium.com)
 (TXT) w3m dump (medium.com)
        
       | 1787 wrote:
       | I was one of the people who built the "plugin based ranking
       | model" at Semantic Scholar as an intern several years ago; it's a
       | rare treat to get a blog post about what happens after you leave
       | and they need to replace your system.
       | 
       | Although actually I feel validated in one respect - it sounds
       | like it took a huge amount of data cleaning and human validation
       | to use the click log data for training.At the time we got many
       | questions about why we didn't "just use it". Although now that
       | I'm at a company with several orders of magnitude more traffic I
       | am a believer in the almighty click log.
        
       | potamic wrote:
       | > It's true that deep learning has the potential to provide
       | better performance, but the model twiddling, slow training
       | (compared to LightGBM), and slower inference are all points
       | against it.
       | 
       | Transformer models appear to be the current state of the art when
       | it comes to semantic analysis. Google already seems to be using
       | BERT for search. Does anyone know what is the performance from
       | such algorithms like and how Google runs it at scale?
        
         | tinyhouse wrote:
         | Funny. I was searching hotel reviews today and Google doesn't
         | even know that "wifi" and "wi-fi" are the same thing. (I'm sure
         | the Google reviews use a different algorithm than the main
         | Google site; but it was still funny to see with all the NLP
         | stuff they are doing)
        
         | lumost wrote:
         | Like many organizations, there is often a disconnect between
         | google research and production.
         | 
         | It's likely the models run in google search are Bert inspired,
         | or were compared with Bert derived models.
        
         | vinni2 wrote:
         | It depends on what the BERT was pre-trained on. If it was pre-
         | trained on Google News Collection like most Transformer models
         | are, fine-tuning on something like scholarly data won't perform
         | so well due to vocabulary mismatch. Besides as the author
         | mentions training models like LightGBM are efficient and
         | perform reasonably.
        
           | spratzt wrote:
           | I agree with this. In my experience, LightGBM is often good
           | enough. The additional improvement provided by neural net
           | models is not worth the substantial increase in development
           | time.
        
           | dunefox wrote:
           | > pre-trained on Google News Collection like most Transformer
           | models are
           | 
           | BERT for example was trained on Wikipedia and Book corpora.
           | Do you have a source for this claim?
        
       ___________________________________________________________________
       (page generated 2021-08-29 23:01 UTC)