[HN Gopher] Bad Data and Data Engineering: Dissecting Google Pla...
___________________________________________________________________
Bad Data and Data Engineering: Dissecting Google Play Music Takeout
Data
Author : otter-in-a-suit
Score : 49 points
Date : 2021-12-25 22:17 UTC (1 days ago)
(HTM) web link (chollinger.com)
(TXT) w3m dump (chollinger.com)
| wodenokoto wrote:
| > The script should be decently self-explanatory [...] Please
| note that this is all single-threaded, which I don't recommend -
| with nohup and the like, you can trivially parallelize this.
|
| How do you parallelize a loop in bash without getting all the
| echo's intertwined and jumbled together?
| karlding wrote:
| In general, you can partition the loop to delegate to "workers"
| and have each instance pipe the output to different files, each
| corresponding to an output stream. This avoids the need for
| mutual exclusion around your output streams. If you need to
| aggregate logs then run some log aggregator.
| progbits wrote:
| So are the mp3 files not the same as what the author uploaded? I
| could imagine weird organization for tracks from the service but
| for self-uploaded data I would be surprised if they didn't just
| give them back the same.
|
| The article never mentioned how this showed up in the GPM app
| itself which feels lacking.
|
| Otherwise a nice article but it reminds me why I long ago gave up
| on media metadata organization. So much work, so much mess...
| kiloDalton wrote:
| In the case of lossless files, the takeout files are
| empathically not the same files that were uploaded. Google
| Music would allow a user to upload lossless FLAC files, but
| internally it converted them to 320 kpbs MP3 files. So, GPM
| certainly transcoded a portion of uploaded files. I'm not sure
| to what extent it left files alone if they met Google's
| formatting specifications. Perhaps someone else knows.
| jeffbee wrote:
| I don't think they did very much leaving things alone. One of
| my biggest problems with GPM was that my uploads would
| seemingly get de-duplicated alongside some other record that
| wasn't exactly the same, like a reissue or a remaster of the
| same record that sounded noticeably different. Sometimes an
| album I uploaded would gain a mysterious bonus track. They
| also at some point hosed up the whole system in such a way
| that many of my records contained every track twice, which
| meant I had to make playlists out of my old albums just to
| remove the even-numbered tracks and make it listenable again.
|
| If you takeout from YTM it says your music files are "Your
| originally uploaded audio file" which is nice. Since music in
| YTM may have been migrated from GPM, that seems to imply that
| GPM retained the originals.
|
| When they shut down GPM I migrated to YTM, which doesn't seem
| to have these specific catalog problems. I also just re-
| organized my local copy of my FLACs using MusicBrainz Picard.
| Unlike this author I no longer have the giant wall of CDs!
| randomifcpfan wrote:
| IIRC, GPM stored user uploads in MP3. If you uploaded a non-MP3
| file, it was transcoded into a MP3 during upload. It is this
| file that GPM takeout is providing.
|
| Separate from that, GPM matched your uploaded MP3 file against
| the service music corpus, and if there was a match, the service
| streamed the canonical version. Originally the streaming
| service used 320 kbps MP3, but later the service switched to
| 256 kbps AAC. GPM takeout does not provide the canonical
| version.
| faizshah wrote:
| Great post, for this pipeline I would have probably used a
| makefile for the batch pipeline instead of airflow just to keep
| it simple. I would also make my sink a SQLite database so that
| you can easily search through it with a web interface using
| datasette.
|
| For the places where bash was used I would just use python and
| any cli tools you want to call I just use subprocess. It's much
| simpler and I can run the scripts in a repl and execute cells in
| Jupyter or just normal pycharm so its quick and interactive.
|
| Love that you included something on building a data dictionary, I
| am honestly guilty of in the past not including a good data
| dictionary for the source data. I would just leave in the output
| of df.describe() or df.info() at the top of the jupyter notebook
| where you restructure the source data before processing it. I now
| think you should include and save as a CSV a data dictionary of
| the source data and the final data as it's more maintainable or
| at least leave a comment in your script.
|
| Otherwise everything else is pretty similar to what I would do, I
| just went to my google takeout and apparently all my google play
| data and songs are gone so I guess I can't try this myself...
| wodenokoto wrote:
| My first thought was also "why not SQLite?", but the author
| says he already has a MariaDB running. So, using the tools you
| know.
|
| I guess it is the same for make vs airflow. I had no idea they
| could be used interchangeably for single machine workloads.
|
| While I've seen datasette mentioned a lot of places, I still
| don't really know what it is, but if it makes exploring sqlite
| databases easy, I should give it a try!
| faizshah wrote:
| The makefile data pipeline is definitely an underrated
| technique a couple great HN comments on this technique:
|
| - https://news.ycombinator.com/item?id=22283368
|
| - https://news.ycombinator.com/item?id=18896204
|
| I personally learned it from bioinformaticians theres great
| coverage of this and other command line data skills in this
| book: https://www.oreilly.com/library/view/bioinformatics-
| data-ski...
|
| The SQLite, pandas, bash, make stack for quick data science
| projects is a great and maintainable one that doesn't require
| too much specialized knowledge.
| diarrhea wrote:
| Great read on Jupyter Notebooks. I've always had a strong
| dislike for them as well - just feel fuzzy, dirty, often in
| weird state - and the user put it into words nicely. I
| might just be bad with Jupyter, but I've been willing to
| mold these notebooks into saner forms (modularization,
| keeping state, deterministic cell order/content/execution)
| and came up empty.
|
| At some point you got to ask if it's not the tool's fault.
| Modularization especially is infuriating -- it seems there
| is no reliable way to rerun `import myownmodule` when work
| is done on the latter. It won't detect changed code. A
| _kernel restart_ is necessary (`reload` and friends didn 't
| help), losing all state. It's punishment for saner software
| engineering practices.
| faizshah wrote:
| You are absolutely right I have taken to referring to
| that as "notebook code."
|
| I love jupyter because its a great interactive
| programming environment that speeds up building scripts.
| But jupyter notebooks always end up being...notebooks.
| They are records of past work and thoughts that I had but
| I wouldn't present my paper notes to a coworker as
| finished documentation. I always end up extracting those
| jupyter snippets into runnable scripts and then I end up
| doing a lot of the work all over again as I parameterize
| things.
|
| I have taken to using pycharm scientific mode and adding
| "cells" into my scripts:
| https://blog.jetbrains.com/pycharm/2018/04/pycharm-
| scientifi...
|
| You get the interactivity of jupyter with the cells, the
| scientific view for plots and data, but you're also
| writing a real script so you can still maintain some
| sanity in your code and good git history (and hopefully
| add some tests).
|
| Then I combine these scripts together into a makefile and
| voila quick, easy and maintainable scripts. If you just
| stick to make, python, pandas and bash any programmer can
| modify your pipeline without needing specialist skills. I
| have written scrapers + ETL in <1.5 hour (journalist
| needed some data quick) with this method.
|
| God I wish I still worked in data science I'm currently a
| front end dev and don't get to do any of this at work.
| It's such fun and offers so much room for creativity and
| problem solving.
| wodenokoto wrote:
| I think notebooks are great for building a presentation /
| reports of a data exploration, but I agree that they
| shouldn't represent programs.
|
| Rstudio won't export an R-markdown notebook to PDF
| without rerunning all cells sequentially. It's great for
| keeping the code "in check", but for things that takes
| hours to run, it can be quite annoying.
|
| I do think most of the problems with notebooks go away if
| you somehow force sequential running before presenting it
| to other people.
|
| Another really cool take on notebooks are Julia's Pluto
| Notebooks. Like a spreadsheet, there is no sequence that
| cells are run in. Everything is updated simultaneously.
| It's kinda hard to explain, but the JuliacCn presentation
| on these is absolutely wonderful:
| https://www.youtube.com/watch?v=IAF8DjrQSSk
___________________________________________________________________
(page generated 2021-12-26 23:01 UTC)