[HN Gopher] Building a Better Search Engine for Semantic Scholar...
___________________________________________________________________
Building a Better Search Engine for Semantic Scholar (2020)
Author : boyter
Score : 50 points
Date : 2021-08-28 09:49 UTC (1 days ago)
(HTM) web link (medium.com)
(TXT) w3m dump (medium.com)
| 1787 wrote:
| I was one of the people who built the "plugin based ranking
| model" at Semantic Scholar as an intern several years ago; it's a
| rare treat to get a blog post about what happens after you leave
| and they need to replace your system.
|
| Although actually I feel validated in one respect - it sounds
| like it took a huge amount of data cleaning and human validation
| to use the click log data for training.At the time we got many
| questions about why we didn't "just use it". Although now that
| I'm at a company with several orders of magnitude more traffic I
| am a believer in the almighty click log.
| potamic wrote:
| > It's true that deep learning has the potential to provide
| better performance, but the model twiddling, slow training
| (compared to LightGBM), and slower inference are all points
| against it.
|
| Transformer models appear to be the current state of the art when
| it comes to semantic analysis. Google already seems to be using
| BERT for search. Does anyone know what is the performance from
| such algorithms like and how Google runs it at scale?
| tinyhouse wrote:
| Funny. I was searching hotel reviews today and Google doesn't
| even know that "wifi" and "wi-fi" are the same thing. (I'm sure
| the Google reviews use a different algorithm than the main
| Google site; but it was still funny to see with all the NLP
| stuff they are doing)
| lumost wrote:
| Like many organizations, there is often a disconnect between
| google research and production.
|
| It's likely the models run in google search are Bert inspired,
| or were compared with Bert derived models.
| vinni2 wrote:
| It depends on what the BERT was pre-trained on. If it was pre-
| trained on Google News Collection like most Transformer models
| are, fine-tuning on something like scholarly data won't perform
| so well due to vocabulary mismatch. Besides as the author
| mentions training models like LightGBM are efficient and
| perform reasonably.
| spratzt wrote:
| I agree with this. In my experience, LightGBM is often good
| enough. The additional improvement provided by neural net
| models is not worth the substantial increase in development
| time.
| dunefox wrote:
| > pre-trained on Google News Collection like most Transformer
| models are
|
| BERT for example was trained on Wikipedia and Book corpora.
| Do you have a source for this claim?
___________________________________________________________________
(page generated 2021-08-29 23:01 UTC)