hngopher.com

       [HN Gopher] Show HN: Open-source text-to-geolocation models
       ___________________________________________________________________
        
       Show HN: Open-source text-to-geolocation models
        
       Yachay is an open-source community that works with the most
       accurate text-to-geolocation models on the market right now
        
       Author : yachayai
       Score  : 36 points
       Date   : 2022-11-21 15:10 UTC (7 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | neoncontrails wrote:
       | This is _really_ cool. Early in the pandemic I released a local
       | news aggregation tool that aimed to aggregate COVID-related
       | content and score it for relevance using an ensemble of ML
       | classification models, including one that would attempt to infer
       | an article's geographic coordinates. Accuracy peaked at about
       | ~70-80%, which was just not quite high enough for this use case.
       | With a large enough dataset of geotagged documents I'm pretty
       | sure we could've improved that by another 10-15% which would've
       | likely been "good enough" for our purposes. But one of the
       | surprising things I took away from the project was that there's
       | not a well-defined label for this category of classification
       | problems, and as a result there's few datasets or benchmarks to
       | encourage progress.
        
       | tomthe wrote:
       | There are no weights and no data, only some code to create a
       | pytorch character based network and train it. Will you provide
       | weights or data in the future? Do you have any benchmark over
       | Nominatim or Google maps?
       | 
       | I think something like this (but with more substance) could be
       | helpful for some people, especially in the social sciences.
        
         | rmbyrro wrote:
         | Yea, I was expecting a general-purpose model or dataset to
         | train a model. The idea is great, but - as it currently stands
         | - of no use to most people.
        
       | DOsinga wrote:
       | This does look interesting but as other comments have pointed out
       | without data or weights it's not clear how well this works. The
       | training notebook seems to suggest it is not actually improving
       | all that much on the training data
        
       | TuringNYC wrote:
       | Has anyone got this working? Curious if someone could PR a
       | dependencies file that can be used to run this?
        
       | JimDabell wrote:
       | Depending upon your use-case, you can get pretty good results by
       | using spaCy for named entity recognition then matching on the
       | titles of Wikipedia articles that have coordinates.
        
         | rmbyrro wrote:
         | Tried this in the past, it's too limited... There are too many
         | ways certain locations can be referred to. Take: New York City,
         | NYC, NY, New York, NYCity, so on...
        
           | JimDabell wrote:
           | Wikipedia handles "New York City" and "NYC" as intended. "NY"
           | and "New York" are ambiguous to both machines and humans (are
           | you referring to the city or the state?) and if you have a
           | resolution strategy for this then Wikipedia gives you the
           | options to disambiguate. I've never seen "NYCity" used by
           | anybody.
        
             | rmbyrro wrote:
             | If you start processing web articles on the scale of
             | millions you'll be surprised by how creative people can be.
             | Not talking about tweets, just news and blog articles.
        
       | rmbyrro wrote:
       | This would have been tremendously useful in a project I worked at
       | a few years ago.
       | 
       | It's really a difficult task to parse text at large scale with
       | accurate geographical tagging.
        
       ___________________________________________________________________
       (page generated 2022-11-21 23:02 UTC)