[HN Gopher] Bad Data and Data Engineering: Dissecting Google Pla...
       ___________________________________________________________________
        
       Bad Data and Data Engineering: Dissecting Google Play Music Takeout
       Data
        
       Author : otter-in-a-suit
       Score  : 49 points
       Date   : 2021-12-25 22:17 UTC (1 days ago)
        
 (HTM) web link (chollinger.com)
 (TXT) w3m dump (chollinger.com)
        
       | wodenokoto wrote:
       | > The script should be decently self-explanatory [...] Please
       | note that this is all single-threaded, which I don't recommend -
       | with nohup and the like, you can trivially parallelize this.
       | 
       | How do you parallelize a loop in bash without getting all the
       | echo's intertwined and jumbled together?
        
         | karlding wrote:
         | In general, you can partition the loop to delegate to "workers"
         | and have each instance pipe the output to different files, each
         | corresponding to an output stream. This avoids the need for
         | mutual exclusion around your output streams. If you need to
         | aggregate logs then run some log aggregator.
        
       | progbits wrote:
       | So are the mp3 files not the same as what the author uploaded? I
       | could imagine weird organization for tracks from the service but
       | for self-uploaded data I would be surprised if they didn't just
       | give them back the same.
       | 
       | The article never mentioned how this showed up in the GPM app
       | itself which feels lacking.
       | 
       | Otherwise a nice article but it reminds me why I long ago gave up
       | on media metadata organization. So much work, so much mess...
        
         | kiloDalton wrote:
         | In the case of lossless files, the takeout files are
         | empathically not the same files that were uploaded. Google
         | Music would allow a user to upload lossless FLAC files, but
         | internally it converted them to 320 kpbs MP3 files. So, GPM
         | certainly transcoded a portion of uploaded files. I'm not sure
         | to what extent it left files alone if they met Google's
         | formatting specifications. Perhaps someone else knows.
        
           | jeffbee wrote:
           | I don't think they did very much leaving things alone. One of
           | my biggest problems with GPM was that my uploads would
           | seemingly get de-duplicated alongside some other record that
           | wasn't exactly the same, like a reissue or a remaster of the
           | same record that sounded noticeably different. Sometimes an
           | album I uploaded would gain a mysterious bonus track. They
           | also at some point hosed up the whole system in such a way
           | that many of my records contained every track twice, which
           | meant I had to make playlists out of my old albums just to
           | remove the even-numbered tracks and make it listenable again.
           | 
           | If you takeout from YTM it says your music files are "Your
           | originally uploaded audio file" which is nice. Since music in
           | YTM may have been migrated from GPM, that seems to imply that
           | GPM retained the originals.
           | 
           | When they shut down GPM I migrated to YTM, which doesn't seem
           | to have these specific catalog problems. I also just re-
           | organized my local copy of my FLACs using MusicBrainz Picard.
           | Unlike this author I no longer have the giant wall of CDs!
        
         | randomifcpfan wrote:
         | IIRC, GPM stored user uploads in MP3. If you uploaded a non-MP3
         | file, it was transcoded into a MP3 during upload. It is this
         | file that GPM takeout is providing.
         | 
         | Separate from that, GPM matched your uploaded MP3 file against
         | the service music corpus, and if there was a match, the service
         | streamed the canonical version. Originally the streaming
         | service used 320 kbps MP3, but later the service switched to
         | 256 kbps AAC. GPM takeout does not provide the canonical
         | version.
        
       | faizshah wrote:
       | Great post, for this pipeline I would have probably used a
       | makefile for the batch pipeline instead of airflow just to keep
       | it simple. I would also make my sink a SQLite database so that
       | you can easily search through it with a web interface using
       | datasette.
       | 
       | For the places where bash was used I would just use python and
       | any cli tools you want to call I just use subprocess. It's much
       | simpler and I can run the scripts in a repl and execute cells in
       | Jupyter or just normal pycharm so its quick and interactive.
       | 
       | Love that you included something on building a data dictionary, I
       | am honestly guilty of in the past not including a good data
       | dictionary for the source data. I would just leave in the output
       | of df.describe() or df.info() at the top of the jupyter notebook
       | where you restructure the source data before processing it. I now
       | think you should include and save as a CSV a data dictionary of
       | the source data and the final data as it's more maintainable or
       | at least leave a comment in your script.
       | 
       | Otherwise everything else is pretty similar to what I would do, I
       | just went to my google takeout and apparently all my google play
       | data and songs are gone so I guess I can't try this myself...
        
         | wodenokoto wrote:
         | My first thought was also "why not SQLite?", but the author
         | says he already has a MariaDB running. So, using the tools you
         | know.
         | 
         | I guess it is the same for make vs airflow. I had no idea they
         | could be used interchangeably for single machine workloads.
         | 
         | While I've seen datasette mentioned a lot of places, I still
         | don't really know what it is, but if it makes exploring sqlite
         | databases easy, I should give it a try!
        
           | faizshah wrote:
           | The makefile data pipeline is definitely an underrated
           | technique a couple great HN comments on this technique:
           | 
           | - https://news.ycombinator.com/item?id=22283368
           | 
           | - https://news.ycombinator.com/item?id=18896204
           | 
           | I personally learned it from bioinformaticians theres great
           | coverage of this and other command line data skills in this
           | book: https://www.oreilly.com/library/view/bioinformatics-
           | data-ski...
           | 
           | The SQLite, pandas, bash, make stack for quick data science
           | projects is a great and maintainable one that doesn't require
           | too much specialized knowledge.
        
             | diarrhea wrote:
             | Great read on Jupyter Notebooks. I've always had a strong
             | dislike for them as well - just feel fuzzy, dirty, often in
             | weird state - and the user put it into words nicely. I
             | might just be bad with Jupyter, but I've been willing to
             | mold these notebooks into saner forms (modularization,
             | keeping state, deterministic cell order/content/execution)
             | and came up empty.
             | 
             | At some point you got to ask if it's not the tool's fault.
             | Modularization especially is infuriating -- it seems there
             | is no reliable way to rerun `import myownmodule` when work
             | is done on the latter. It won't detect changed code. A
             | _kernel restart_ is necessary (`reload` and friends didn 't
             | help), losing all state. It's punishment for saner software
             | engineering practices.
        
               | faizshah wrote:
               | You are absolutely right I have taken to referring to
               | that as "notebook code."
               | 
               | I love jupyter because its a great interactive
               | programming environment that speeds up building scripts.
               | But jupyter notebooks always end up being...notebooks.
               | They are records of past work and thoughts that I had but
               | I wouldn't present my paper notes to a coworker as
               | finished documentation. I always end up extracting those
               | jupyter snippets into runnable scripts and then I end up
               | doing a lot of the work all over again as I parameterize
               | things.
               | 
               | I have taken to using pycharm scientific mode and adding
               | "cells" into my scripts:
               | https://blog.jetbrains.com/pycharm/2018/04/pycharm-
               | scientifi...
               | 
               | You get the interactivity of jupyter with the cells, the
               | scientific view for plots and data, but you're also
               | writing a real script so you can still maintain some
               | sanity in your code and good git history (and hopefully
               | add some tests).
               | 
               | Then I combine these scripts together into a makefile and
               | voila quick, easy and maintainable scripts. If you just
               | stick to make, python, pandas and bash any programmer can
               | modify your pipeline without needing specialist skills. I
               | have written scrapers + ETL in <1.5 hour (journalist
               | needed some data quick) with this method.
               | 
               | God I wish I still worked in data science I'm currently a
               | front end dev and don't get to do any of this at work.
               | It's such fun and offers so much room for creativity and
               | problem solving.
        
               | wodenokoto wrote:
               | I think notebooks are great for building a presentation /
               | reports of a data exploration, but I agree that they
               | shouldn't represent programs.
               | 
               | Rstudio won't export an R-markdown notebook to PDF
               | without rerunning all cells sequentially. It's great for
               | keeping the code "in check", but for things that takes
               | hours to run, it can be quite annoying.
               | 
               | I do think most of the problems with notebooks go away if
               | you somehow force sequential running before presenting it
               | to other people.
               | 
               | Another really cool take on notebooks are Julia's Pluto
               | Notebooks. Like a spreadsheet, there is no sequence that
               | cells are run in. Everything is updated simultaneously.
               | It's kinda hard to explain, but the JuliacCn presentation
               | on these is absolutely wonderful:
               | https://www.youtube.com/watch?v=IAF8DjrQSSk
        
       ___________________________________________________________________
       (page generated 2021-12-26 23:01 UTC)