[HN Gopher] OpenRefine
       ___________________________________________________________________
        
       OpenRefine
        
       Author : rbanffy
       Score  : 227 points
       Date   : 2023-10-21 21:51 UTC (1 days ago)
        
 (HTM) web link (openrefine.org)
 (TXT) w3m dump (openrefine.org)
        
       | xnx wrote:
       | Cool tool that I've been looking for an excuse to use more. Glad
       | to see that it has continued to updat and improve after evolving
       | from Freebase Gridworks to Google Refine to OpenRefine.
        
       | cdcarter wrote:
       | Four or five years ago, this was a tool I was using almost every
       | day for work. Doing data consolidation and migrations for small
       | nonprofits, we were faced with so many loosely structured excel
       | sheets and CSV exports from various mailing programs. OpenRefine
       | was absolutely instrumental in cleaning up lots of disparate data
       | when the data sources were too many and too variable to make a
       | scripted solution valuable. Glad to see it lives on.
        
         | yawnxyz wrote:
         | What tool did you move on to using instead? This tool seems
         | super powerful!
        
           | a5seo wrote:
           | Can't speak for OP but I moved to Exploratory.io. And the
           | beauty of it is, it's a GUI for R so you can export your
           | transformation steps to R if needed.
        
         | layman51 wrote:
         | I have been working at a nonprofit and have only recently
         | started using this for cleaning up Excel or CSV files that we
         | want to import. I am not as familiar with doing this with code,
         | but I love that this tool gives me the steps I have taken in
         | case I ever want to audit the changes I made to the data. The
         | one disadvantage I see is that it seems like it's only for a
         | single user and it might be burdensome to collaborate since you
         | have to share the project file.
         | 
         | I'm still excited to learn more about OpenRefine, but I guess
         | maybe something like Google Colab might be better in terms of
         | sharing and having direct access to our G Drives.
        
       | wg0 wrote:
       | This used to be Google Refine at some point. It seems it still
       | uses GWT (Google Web Toolkit) which was an amazing idea for its
       | time.
       | 
       | Rewrite it in Rust+SQLite+Tauri+Typescript+Svelte?
        
         | jillesvangurp wrote:
         | You could rewrite it but it doesn't really solve a problem this
         | thing has. Web assembly of course is an opportunity to bring in
         | a lot of existing data processing frameworks from e.g. the
         | python, julia, r, etc. worlds and run them in a browser. Refine
         | did a lot of it's processing with a Java based server approach.
         | The goal should be to reuse, not to reinvent if you take on a
         | project like this. Chat gpt integration is a no brainer for
         | this stuff these days. It excels at cleaning things up and
         | figuring out unstructured/half structured data.
         | 
         | The spread sheet ui is super useful and something that non
         | technical people are much more comfortable dealing with. I've
         | used Google sheets as an interface to business people over the
         | years. Whether it is categorizations, place descriptions,
         | addresses, etc. just put it in a spreadsheet.
         | 
         | Instead of building complicated UIs and tools, you just build a
         | csv/tsv importer and let people do their thing in a
         | spreadsheet, export, validate, import. Once you get it in
         | people's heads that the column names are off limits for
         | editing, they kind of get it. The nice thing about this stuff
         | is that it is low tech, easy, and effective. And easy to
         | explain to an intern, product owner, or other person that needs
         | to sit down and do the monkey work.
         | 
         | Refine takes this to the next level. You can take any old data
         | in tabular format and cluster it phonetically, minor spelling
         | differences, or by other criteria, bulk edit some rows, and
         | export it. It's also easy to enrich things via some rest API or
         | run some simple scripts. But even just the bulk editing and
         | grouping is super useful. We used it when it was still Google
         | Refine more than 12 years ago to clean up tens of thousands of
         | POIs. Typically we'd be grouping things on e.g. the city name
         | and find that there would be a few spelling variations of
         | things like Munchen, Munchen, Muenchen, Munich, etc. Toss in a
         | few utf-8 encoding issues where the u got garbled and it's a
         | perfect tool for cleaning that up.
         | 
         | Tens of thousands of records is potentially a lot of work but
         | still tiny data. We had a machine learning team that used
         | machine learning as the hammer for the proverbial nail. Google
         | Refine achieved more in 1 afternoon than that team did trying
         | to machine learn their way out of that mess in half a year.
        
         | 5ersi wrote:
         | It seems to be pure JS with jQuery:
         | https://github.com/OpenRefine/OpenRefine/blob/master/main/we...
         | 
         | It is SSR with Velocity templates.
        
       | hodanli wrote:
       | this is my go-to tool for text unification and database
       | normalization.
        
         | alflervag wrote:
         | I'm surprised to hear you using this for database
         | normalization. Could you expand on how OpenRefine is helpful
         | here?
        
       | datadrivenangel wrote:
       | Tools like this allow for easy transformations and data wrangling
       | while also keeping project history in a way that helps preserve
       | lineage. Very valuable.
        
       | codetrotter wrote:
       | This is awesome!
       | 
       | I had a look at how to use this. The video I watched is a couple
       | years old, but probably mostly relevant still.
       | https://youtu.be/nORS7STbLyk
       | 
       | The thing that really resonates with me here is the way they use
       | faceting to find bad data.
       | 
       | When I write pipelines on the command line, I sometimes find it
       | necessary to filter and select data in various ways. Because of
       | this I end up rerunning cli pipelines multiple times sometimes.
       | If instead I dump it to csv I could see myself using OpenRefine
       | with its faceting to pick out the relevant data for processing
        
       | hermitcrab wrote:
       | Seems focussed on dealing with a single dataset.
       | 
       | -There doesn't seem to be a canvas for visualizing the flow
       | (graph) of transformations.
       | 
       | -It doesn't seem to support operations of 2 datasets, such as
       | Join.
       | 
       | Isn't that rather limiting? Or have I missed something?
        
         | saadullahsaeed wrote:
         | Not the most straightforward way to do it but it does support
         | joining across two datasets. You can reference and add columns
         | from a different "project"
        
       | nologic01 wrote:
       | It is a powerful tool but like other tools in this space (think
       | e.g., pandas) it needs some serious getting used to, it is not
       | the most intuitive user interface.
       | 
       | What was dissapointing last time I used it is arguably not a
       | problem of openrefine at all: the connectivity and response of
       | wikidata queries was very slow. But that combination of local
       | data harmomization with an open and globally available reference
       | is super-important. I hope it somehow receives more attention and
       | traction.
        
         | pbronez wrote:
         | Hmm seems like improving wikidata's API performance would be a
         | good use of Wikimedia's war chest.
        
           | zozbot234 wrote:
           | Their main blocker is lack of a high-performance SPARQL
           | backend, for complex queries where ordinary text-based search
           | is not effective. https://www.wikidata.org/wiki/Wikidata:SPAR
           | QL_query_service/... has more information, though it's not
           | fully up to date (Qlever especially has had significant
           | updates since).
        
       | tkt wrote:
       | I used to teach OpenRefine as a part of Data Carpentry workshops,
       | and more than once, I heard people say it changed their lives.
       | 
       | This is that lesson for getting started with OpenRefine.
       | http://datacarpentry.org/OpenRefine-ecology-lesson/
        
       ___________________________________________________________________
       (page generated 2023-10-23 09:02 UTC)