[HN Gopher] OpenRefine
___________________________________________________________________
OpenRefine
Author : rbanffy
Score : 227 points
Date : 2023-10-21 21:51 UTC (1 days ago)
(HTM) web link (openrefine.org)
(TXT) w3m dump (openrefine.org)
| xnx wrote:
| Cool tool that I've been looking for an excuse to use more. Glad
| to see that it has continued to updat and improve after evolving
| from Freebase Gridworks to Google Refine to OpenRefine.
| cdcarter wrote:
| Four or five years ago, this was a tool I was using almost every
| day for work. Doing data consolidation and migrations for small
| nonprofits, we were faced with so many loosely structured excel
| sheets and CSV exports from various mailing programs. OpenRefine
| was absolutely instrumental in cleaning up lots of disparate data
| when the data sources were too many and too variable to make a
| scripted solution valuable. Glad to see it lives on.
| yawnxyz wrote:
| What tool did you move on to using instead? This tool seems
| super powerful!
| a5seo wrote:
| Can't speak for OP but I moved to Exploratory.io. And the
| beauty of it is, it's a GUI for R so you can export your
| transformation steps to R if needed.
| layman51 wrote:
| I have been working at a nonprofit and have only recently
| started using this for cleaning up Excel or CSV files that we
| want to import. I am not as familiar with doing this with code,
| but I love that this tool gives me the steps I have taken in
| case I ever want to audit the changes I made to the data. The
| one disadvantage I see is that it seems like it's only for a
| single user and it might be burdensome to collaborate since you
| have to share the project file.
|
| I'm still excited to learn more about OpenRefine, but I guess
| maybe something like Google Colab might be better in terms of
| sharing and having direct access to our G Drives.
| wg0 wrote:
| This used to be Google Refine at some point. It seems it still
| uses GWT (Google Web Toolkit) which was an amazing idea for its
| time.
|
| Rewrite it in Rust+SQLite+Tauri+Typescript+Svelte?
| jillesvangurp wrote:
| You could rewrite it but it doesn't really solve a problem this
| thing has. Web assembly of course is an opportunity to bring in
| a lot of existing data processing frameworks from e.g. the
| python, julia, r, etc. worlds and run them in a browser. Refine
| did a lot of it's processing with a Java based server approach.
| The goal should be to reuse, not to reinvent if you take on a
| project like this. Chat gpt integration is a no brainer for
| this stuff these days. It excels at cleaning things up and
| figuring out unstructured/half structured data.
|
| The spread sheet ui is super useful and something that non
| technical people are much more comfortable dealing with. I've
| used Google sheets as an interface to business people over the
| years. Whether it is categorizations, place descriptions,
| addresses, etc. just put it in a spreadsheet.
|
| Instead of building complicated UIs and tools, you just build a
| csv/tsv importer and let people do their thing in a
| spreadsheet, export, validate, import. Once you get it in
| people's heads that the column names are off limits for
| editing, they kind of get it. The nice thing about this stuff
| is that it is low tech, easy, and effective. And easy to
| explain to an intern, product owner, or other person that needs
| to sit down and do the monkey work.
|
| Refine takes this to the next level. You can take any old data
| in tabular format and cluster it phonetically, minor spelling
| differences, or by other criteria, bulk edit some rows, and
| export it. It's also easy to enrich things via some rest API or
| run some simple scripts. But even just the bulk editing and
| grouping is super useful. We used it when it was still Google
| Refine more than 12 years ago to clean up tens of thousands of
| POIs. Typically we'd be grouping things on e.g. the city name
| and find that there would be a few spelling variations of
| things like Munchen, Munchen, Muenchen, Munich, etc. Toss in a
| few utf-8 encoding issues where the u got garbled and it's a
| perfect tool for cleaning that up.
|
| Tens of thousands of records is potentially a lot of work but
| still tiny data. We had a machine learning team that used
| machine learning as the hammer for the proverbial nail. Google
| Refine achieved more in 1 afternoon than that team did trying
| to machine learn their way out of that mess in half a year.
| 5ersi wrote:
| It seems to be pure JS with jQuery:
| https://github.com/OpenRefine/OpenRefine/blob/master/main/we...
|
| It is SSR with Velocity templates.
| hodanli wrote:
| this is my go-to tool for text unification and database
| normalization.
| alflervag wrote:
| I'm surprised to hear you using this for database
| normalization. Could you expand on how OpenRefine is helpful
| here?
| datadrivenangel wrote:
| Tools like this allow for easy transformations and data wrangling
| while also keeping project history in a way that helps preserve
| lineage. Very valuable.
| codetrotter wrote:
| This is awesome!
|
| I had a look at how to use this. The video I watched is a couple
| years old, but probably mostly relevant still.
| https://youtu.be/nORS7STbLyk
|
| The thing that really resonates with me here is the way they use
| faceting to find bad data.
|
| When I write pipelines on the command line, I sometimes find it
| necessary to filter and select data in various ways. Because of
| this I end up rerunning cli pipelines multiple times sometimes.
| If instead I dump it to csv I could see myself using OpenRefine
| with its faceting to pick out the relevant data for processing
| hermitcrab wrote:
| Seems focussed on dealing with a single dataset.
|
| -There doesn't seem to be a canvas for visualizing the flow
| (graph) of transformations.
|
| -It doesn't seem to support operations of 2 datasets, such as
| Join.
|
| Isn't that rather limiting? Or have I missed something?
| saadullahsaeed wrote:
| Not the most straightforward way to do it but it does support
| joining across two datasets. You can reference and add columns
| from a different "project"
| nologic01 wrote:
| It is a powerful tool but like other tools in this space (think
| e.g., pandas) it needs some serious getting used to, it is not
| the most intuitive user interface.
|
| What was dissapointing last time I used it is arguably not a
| problem of openrefine at all: the connectivity and response of
| wikidata queries was very slow. But that combination of local
| data harmomization with an open and globally available reference
| is super-important. I hope it somehow receives more attention and
| traction.
| pbronez wrote:
| Hmm seems like improving wikidata's API performance would be a
| good use of Wikimedia's war chest.
| zozbot234 wrote:
| Their main blocker is lack of a high-performance SPARQL
| backend, for complex queries where ordinary text-based search
| is not effective. https://www.wikidata.org/wiki/Wikidata:SPAR
| QL_query_service/... has more information, though it's not
| fully up to date (Qlever especially has had significant
| updates since).
| tkt wrote:
| I used to teach OpenRefine as a part of Data Carpentry workshops,
| and more than once, I heard people say it changed their lives.
|
| This is that lesson for getting started with OpenRefine.
| http://datacarpentry.org/OpenRefine-ecology-lesson/
___________________________________________________________________
(page generated 2023-10-23 09:02 UTC)