[HN Gopher] Google Algorithm Leaked
       ___________________________________________________________________
        
       Google Algorithm Leaked
        
       Author : certifiedloud
       Score  : 44 points
       Date   : 2024-05-29 17:25 UTC (5 hours ago)
        
 (HTM) web link (www.seroundtable.com)
 (TXT) w3m dump (www.seroundtable.com)
        
       | advisedwang wrote:
       | It's not clear to me whether the leak is actually for Google
       | Search or one of the products around search that isn't "Search",
       | like Document Warehouse [1]. Is there anything definitive one way
       | or the other in all this? Nobody seems to even questioning this
       | 
       | [1] https://cloud.google.com/document-warehouse/docs/overview
        
         | 9dev wrote:
         | If you read the original publication on this[1], they mention
         | there's a stray commit publishing the internal variant of the
         | SDK intended for the actual Google warehouse database. So the
         | code bases probably live close enough together for someone to
         | accidentally pass the wrong folder name or something.
         | 
         | This has been fixed, but the commit and all it's changes are
         | out there--and tragically, published alongside a copy of the
         | Apache 2.0 license (intended for the document warehouse API
         | SDK), which officially sanctioned freely copying and using the
         | code. So there is really nothing Google can do about it.
         | 
         | [1] https://ipullrank.com/google-algo-leak
        
       | atonse wrote:
       | This looks like it's written in Elixir (the docs are using
       | ExDocs, Elixir's documentation toolset).
       | 
       | This can't possibly be the actual search index rules (which is
       | probably code that's decades old, my guess is either in Python or
       | Java?) - unless they rewrote all of it in the past few years?
       | 
       | Can anyone else confirm this?
        
         | 9dev wrote:
         | It's not. Google uses a content warehouse database internally
         | that holds all stored web page content, and to access this vast
         | database, they have an API. The code discovered here is a
         | generated SDK for Elixir for this content warehouse API.
         | 
         | Apparently, Google had a now deprecated product (who would have
         | guessed that? Consider me shocked!) that provided customers
         | with a trimmed-down version of this database for their own
         | purposes, but mistakenly published the internal SDK code
         | instead of that intended for Google Cloud customers to GitHub.
         | 
         | So while this doesn't directly show the search index source
         | code, it describes the data schema of the index in great
         | detail, so there are at least some interesting educated guesses
         | on the workings of the actual index to draw from it.
        
       | ChrisArchitect wrote:
       | [dupe]
       | 
       | Some more discussion:
       | https://news.ycombinator.com/item?id=40496967
        
       ___________________________________________________________________
       (page generated 2024-05-29 23:02 UTC)