[HN Gopher] Show HN: Open-source text-to-geolocation models
___________________________________________________________________
Show HN: Open-source text-to-geolocation models
Yachay is an open-source community that works with the most
accurate text-to-geolocation models on the market right now
Author : yachayai
Score : 36 points
Date : 2022-11-21 15:10 UTC (7 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| neoncontrails wrote:
| This is _really_ cool. Early in the pandemic I released a local
| news aggregation tool that aimed to aggregate COVID-related
| content and score it for relevance using an ensemble of ML
| classification models, including one that would attempt to infer
| an article's geographic coordinates. Accuracy peaked at about
| ~70-80%, which was just not quite high enough for this use case.
| With a large enough dataset of geotagged documents I'm pretty
| sure we could've improved that by another 10-15% which would've
| likely been "good enough" for our purposes. But one of the
| surprising things I took away from the project was that there's
| not a well-defined label for this category of classification
| problems, and as a result there's few datasets or benchmarks to
| encourage progress.
| tomthe wrote:
| There are no weights and no data, only some code to create a
| pytorch character based network and train it. Will you provide
| weights or data in the future? Do you have any benchmark over
| Nominatim or Google maps?
|
| I think something like this (but with more substance) could be
| helpful for some people, especially in the social sciences.
| rmbyrro wrote:
| Yea, I was expecting a general-purpose model or dataset to
| train a model. The idea is great, but - as it currently stands
| - of no use to most people.
| DOsinga wrote:
| This does look interesting but as other comments have pointed out
| without data or weights it's not clear how well this works. The
| training notebook seems to suggest it is not actually improving
| all that much on the training data
| TuringNYC wrote:
| Has anyone got this working? Curious if someone could PR a
| dependencies file that can be used to run this?
| JimDabell wrote:
| Depending upon your use-case, you can get pretty good results by
| using spaCy for named entity recognition then matching on the
| titles of Wikipedia articles that have coordinates.
| rmbyrro wrote:
| Tried this in the past, it's too limited... There are too many
| ways certain locations can be referred to. Take: New York City,
| NYC, NY, New York, NYCity, so on...
| JimDabell wrote:
| Wikipedia handles "New York City" and "NYC" as intended. "NY"
| and "New York" are ambiguous to both machines and humans (are
| you referring to the city or the state?) and if you have a
| resolution strategy for this then Wikipedia gives you the
| options to disambiguate. I've never seen "NYCity" used by
| anybody.
| rmbyrro wrote:
| If you start processing web articles on the scale of
| millions you'll be surprised by how creative people can be.
| Not talking about tweets, just news and blog articles.
| rmbyrro wrote:
| This would have been tremendously useful in a project I worked at
| a few years ago.
|
| It's really a difficult task to parse text at large scale with
| accurate geographical tagging.
___________________________________________________________________
(page generated 2022-11-21 23:02 UTC)