[HN Gopher] Spotify Optimized the Largest Dataflow Job Ever for ...
       ___________________________________________________________________
        
       Spotify Optimized the Largest Dataflow Job Ever for Wrapped 2020
        
       Author : SirOibaf
       Score  : 187 points
       Date   : 2021-02-12 09:44 UTC (13 hours ago)
        
 (HTM) web link (engineering.atspotify.com)
 (TXT) w3m dump (engineering.atspotify.com)
        
       | shaicoleman wrote:
       | My off-topic rant: I'd really wish Spotify would focus on
       | improving the core player experience. It has barely seen any
       | improvements in years.
       | 
       | * Not overwrite/delete my listening history everytime I switch
       | devices
       | 
       | * Allow tabs, or some way to resume what I've been listening to
       | in different contexts
       | 
       | * Option to open only one instance, instead of having multiple
       | instances that mess with each other
       | 
       | * Playing local files crashes/not working on Linux
       | 
       | * Change playback speed, not just for podcasts
       | 
       | * Jump back/forward, not just for podcasts
       | 
       | * Have some visibility when the song was last played / play count
       | 
       | * Liked songs not always appearing in search results
       | 
       | * Sorting search results not working
       | 
       | * Add basic functionality to the dbus interface (e.g. seeking)
       | 
       | * Ability to report songs (e.g. wrong titles/badly split
       | tracks/etc.)
        
         | bootlooped wrote:
         | Almost all of these strike me as only benefiting a very small
         | sliver of users, like well under 1%. An infinitesimal portion
         | of Spotify listeners even know what dbus is. How much engineer
         | time is it worth to improve something like that?
        
         | [deleted]
        
         | krakmh wrote:
         | You are absolutely right. Even more, people are using it for
         | free using mods like these
         | https://bestforandroid.com/apk/spotify-premium-mod-apk/
        
           | yaqoob wrote:
           | Yup, well said.
        
         | TheRealSteel wrote:
         | I have been begging them for years to add a 'resume playlist'
         | feature but they won't do it.
        
       | spotyoufi462881 wrote:
       | Anyone wanna know how much Spotify wanna know about you?
       | 
       | https://twitter.com/steipete/status/1025024813889478656
        
         | Jonnax wrote:
         | What's scary about this?
         | 
         | They're complying with GDPR. Isn't that a good thing?
         | 
         | The "scary" thing in that tweet is that they store the
         | manufacturer of their bluetooth headphones?
        
           | spotyoufi462881 wrote:
           | That's not what one could call complying. He had to follow
           | them like a dog for a long while just to get his rightful
           | data.
        
             | Jonnax wrote:
             | That was from 2018. Almost 3 years ago.
        
         | yaqoob wrote:
         | Woo. Thats incredible. Seriously!
        
         | npteljes wrote:
         | I'm already thinking that every service hoovers as much up as
         | they can. Nice to see proof that it is actually happening!
        
         | float4 wrote:
         | I wonder whether they use the bluetooth device logging purely
         | to develop their social graph. A person streaming music to
         | their own bluetooth headphone using a friend's computer and
         | Spotify account can be detected this way. I can't really come
         | up with another purpose for it.
         | 
         | Other than that, I'm not surprised by what they log. Virtually
         | every company stores search queries, oauth grants, play
         | history, ad interactions etc. Doesn't make it right of course.
        
       | max_streese wrote:
       | Hi not sure if I am just completely off here but I am wondering
       | how this relates or compares to processing things with Kafka and
       | Kafka Streams?
       | 
       | If I am reading things correctly with Kafka the workflow
       | equivalent to what's written in the article would be to have your
       | producer produce via hash-based-round-robin (the default
       | partitioning algorithm) based on the key you are interested in
       | into some topic and then your consumer would just read it and
       | your data would already be sorted for the given keys (because
       | within a partition Kafka has sorting guarantees) and also be co-
       | partitioned correctly if you need to read some other topic in
       | with the same number of partitions and the same logical keys
       | produced via the same algorithm. No?
        
         | jsjsbdkj wrote:
         | This is the most basic pattern for distributed joins - you hash
         | on the join key in both tables and shuffle data based on hash
         | ranges. In some systems like Redshift you can designate the key
         | for distribution so that "related" records are already co-
         | located on a single shard.
         | 
         | > our data would already be sorted for the given keys (because
         | within a partition Kafka has sorting guarantees)
         | 
         | It's been a while since I used Kafka but I don't remember
         | "sorting guarantees". Consumers see events "in order" based on
         | when they were produced, because each partition is a queue.
        
           | max_streese wrote:
           | Yes I guess my point is when using Kafka in combination with
           | Kafka Streams and you produce things partitioned in a way
           | that you need them for consumption then you do not need to do
           | any shuffling in the instance where you want to join because
           | data is already partitioned correctly.
        
             | dvdhnt wrote:
             | You seem to know what you're talking about. Any
             | recommendations on learning resources for this type of
             | flow? Or really understanding which platform works for in
             | each situation?
             | 
             | I'm learning proper data flow in real time as I look to
             | transition ETL of product data into Postgres to a more
             | applicable system.
             | 
             | Finding the right learning resources is difficult! Cheers.
        
       | CyberRabbi wrote:
       | "How <moderately large startup> did <some obscure tech thing>"
       | 
       | <... low substance and poorly written synopsis of the obscure
       | tech thing ...>
       | 
       | "Enjoy work like this? Want to make a big impact as a lowly
       | programmer who will never start their own company? Well at
       | <moderately large startup>, lowly programmers who will never
       | start their own companies have the freedom to make a big impact."
        
         | rrdharan wrote:
         | Spotify is no longer a startup.
        
           | CyberRabbi wrote:
           | Good point. That's changes the substance of the post.
        
       | gripfx wrote:
       | Yet, I, as a user still cannot see my playcounts via app or API.
        
         | marliechiller wrote:
         | this is a massive gripe of mine. I remember being able to
         | toggle so many stats in itunes back in the day which gave me
         | such great insight into my listening habits and other cool
         | tidbits. Spotify, for all its ease of listening has removed
         | some of the magic there
        
         | [deleted]
        
         | utucuro wrote:
         | That might be explained as guarding their information, yet I
         | still have trouble believing that they are unable to
         | distinguish two artists with the same name from each other.
         | Each week, my Release Radar playlist is basically invaded by 0
         | listener rappers who "feat" defunct 70s rock bands who do not
         | manage their artist pages... Response from Spotify: please
         | report them individually, from the desktop app.
        
           | sbarre wrote:
           | That's some interesting product hacking in a way.
           | 
           | I've run into artist naming collisions enough to know it's a
           | thing but I've never seen anyone (ab)use it intentionally..
           | 
           | Neat, but still annoying I'm sure. =)
        
           | thejoeflow wrote:
           | this. They don't even do an exact string match of the artist
           | name, the reason why I can't fathom. I could make a new
           | artist account named "drake" (lowercase d) today and probably
           | show up in a million Release Radar playlists by next week.
        
         | asutekku wrote:
         | This is one of the reasons why I still keep using lastfm. Stats
         | in Spotify are next to useless and even then it would not track
         | my local listening.
        
         | montag wrote:
         | You can, to some extent, using the play history API.
         | https://output.jsbin.com/ribat
        
       | greatthx wrote:
       | Great! Now will they stop scanning my entire hard drive with
       | their desktop app? Also stop opening sockets directly to
       | advertiser IP address. And stop paying off data thieves instead
       | of disclosing to their users that their passwords were leaked.
       | And to stop being sellouts too!
        
         | jeofken wrote:
         | Do you have links to further readings about this?
        
       | meibo wrote:
       | Wrapped works so well because it panders to you. Everyone likes
       | to be acknowledged for listening to the weird indie band they
       | discovered earlier this year.
       | 
       | I enjoy it anyways, and Spotify is still a great service for now
       | - I wonder if it'll meet the same fate as Netflix at some point,
       | with publishing houses going for their own streaming services
       | instead.
        
         | pradn wrote:
         | If anything the top-5 format doesn't capture the long tail you
         | might be listening to.
        
         | onion2k wrote:
         | _I wonder if it 'll meet the same fate as Netflix at some
         | point, with publishing houses going for their own streaming
         | services instead._
         | 
         | I don't think so because the way that people consume music and
         | the way they consume films and television are _very_ different.
         | With a film you might block out a few hours to watch that
         | specific provider. With music you 're more likely to want to
         | interleave content from several providers at the same time (eg
         | a playlist). Unless all the providers are available on the same
         | platform it wouldn't work well.
        
           | Waterluvian wrote:
           | Speaking of interleaving, I wish Spotify understood how to
           | interact with Concept Albums. Changing to the middle of The
           | Wall for one song is jarring and I usually skip it.
           | 
           | And then I worry skipping it is going to train the algorithm
           | that I don't like Pink Floyd.
        
             | adhoc_slime wrote:
             | What does this mean? I don't use spotify. Are you saying
             | when you go to listen to an album it will put a random song
             | into the queue?
        
               | meibo wrote:
               | Only if you listen on shuffle(like every music app I
               | guess) or in their AI lists. Would be nice if the AI ones
               | would respect that for sure, was wondering why that's not
               | a thing since like 2016.
        
               | Waterluvian wrote:
               | I'll be listening to a theme or genre of music and it
               | adds songs that don't really work outside the context of
               | their album.
        
               | umanwizard wrote:
               | Spotify allows both: listening to random songs (either
               | based on a particular genre, seeded from an existing
               | playlist, or generated from the user's listening habits),
               | or choosing specific songs or albums to listen to.
               | 
               | They probably meant listening in random mode, and Spotify
               | randomly choosing a track that doesn't make much sense
               | outside the context of its album.
        
             | 2038AD wrote:
             | I never understood why for albums like this they even
             | bother splitting up the tracks.
        
           | Allower wrote:
           | No.. that's how I watch TV and movies too. What person do you
           | know says "I only want to watch Paramount shit!"?
        
           | hnlmorg wrote:
           | A considerable amount of music is distributed by a small
           | subset of providers. So in that regard it's not that
           | different to the TV / film situation and you could
           | theoretically still interleave different artists.
           | 
           | There's also a common use case where people will just play a
           | specific artist for an hour. Or even an album.
           | 
           | Frankly, I hope services like Spotify don't disappear. It's a
           | great loss to consumers just how fragmented video streaming
           | services have become. I'd hate to see the same happen to
           | music as well.
        
             | alvarlagerlof wrote:
             | I don't think that behavior is all to common. I see people
             | listening to a very wide range of artists and almost never
             | reach for stuff outside of it around here.
        
               | hnlmorg wrote:
               | I use music streaming services to play specific albums
               | (as I tend to play older artists who created albums
               | designed to be played as a whole entity rather than
               | singles)
               | 
               | My wife uses them to shuffle singles by specific artists
               | (she's more into pop music).
               | 
               | I'd wager if my wife and I both coincidentally follow the
               | same pattern despite doing so for different reasons, that
               | it's then likely a more common pattern than first
               | assumed. Please also bare in mind that I'm not suggesting
               | our use cases are how the majority of people consume
               | music, but I'd be surprised if it was small enough to be
               | a rounding error.
        
         | merlinscholz wrote:
         | [deleted]
        
           | underwater wrote:
           | Do you have anything more terse. A 30 minute, ad filled,
           | video isn't exactly a light read.
        
       | matsemann wrote:
       | I came here expecting to read about the tech in the article or
       | how others do big data processing stuff. Instead I get off topic
       | Spotify rants.. Did you read the article or just see Spotify in
       | the headline and decided your gripes therefore are relevant?
        
         | johncena33 wrote:
         | This has become a huge problem on HN lately. Lots of
         | discussions are nothing but complaining. Now the technical
         | discussions are starting to get infested with off-topic
         | whining. The mods don't do anything about off-topic rants. If
         | you point it out you'll get downvoted [1][2][3].
         | 
         | [1] https://news.ycombinator.com/item?id=25839399 [2]
         | https://news.ycombinator.com/item?id=25064636 [3]
         | https://news.ycombinator.com/item?id=24699908
        
           | adventured wrote:
           | > The mods don't do anything about off-topic rants.
           | 
           | Mod. There is a question of how much one moderator can do
           | against the tide. HN really needs a couple of full time paid
           | moderators, with their salaries covered by the zillion dollar
           | YC bank account.
        
             | [deleted]
        
           | pedroaraujo wrote:
           | Absolutely, there has been a dramatic shift in the type of
           | people who visit HN in the past years.
           | 
           | I used to think that Reddit was bad in this regard but to be
           | honest it mostly affects the big subreddits, the niche and
           | small ones still have a high quality community. HN became
           | pretty much like the biggest subs on Reddit.
        
             | chishaku wrote:
             | Back in my day...
             | 
             | This comment thread is in its own category of low quality
             | discussion.
             | 
             | Negativity bias prevents you from seeing that 95% of the
             | homepage right now is technical/nerdy with a lot of high
             | quality corresponding discussion.
             | 
             | When political/social issues hit the homepage, they often
             | slide off quickly if the corresponding discussion is of low
             | quality (has many downvoted comments).
             | 
             | HN is certainly not perfect but just focusing on the parts
             | you don't like prevents you from seeing the bigger picture.
        
         | harryf wrote:
         | Agreed to I'll bite...
         | 
         | > The intuition is that for datasets commonly and frequently
         | joined on a known key, e.g., user events with user metadata on
         | a user ID, we can write them in bucket files with records
         | bucketed and sorted by that key. By knowing which files contain
         | a subset of keys and in what order, shuffle becomes a matter of
         | merge-sorting values from matching bucket files, completely
         | eliminating costly disk and network I/O of moving key-value
         | pairs around.
         | 
         | I'm actually surprised that this should be regarded as "novel"
         | in data science.
         | 
         | It reminds me of something in Eric Raymonds "The Art of Unix
         | Programming" (I don't have time to find the link right now)
         | where it discussed an approach from the earlier days of Linux
         | filesystems where you had a limit on the number of iNodes that
         | could exist in a single directory and corresponding
         | performance. The work around was to create a subdirectory
         | structure to store files based on the filename. But then you
         | tended to get many files starting with the same characters all
         | in the same directories. What turned out to be a better way to
         | distribute the files evenly in the directory structure was to
         | take the first and _last_ character of the file name and use
         | those to create the subdirectories. This way you were more
         | likely to spread the files evenly across the structure.
        
           | nerpderp82 wrote:
           | I read it and I am like ... we already do this. This is
           | common and obvious. Maybe I am missing something.
           | 
           | Before worker nodes had as much memory as they have now,
           | almost everything needed to use small buffers and spill to
           | disk. BDB (Berkeley DB) was an extremely common tool for
           | doing out of core data operations. Because the ETL tools I
           | was writing needed to run on machines with 512MB of ram, it
           | required out of core algorithms. We easily had jobs
           | processing 10-20GB with only 512M of ram.
           | 
           | I am sure I am missing something, reading the paper now.
           | 
           | http://kth.diva-
           | portal.org/smash/get/diva2:1334587/FULLTEXT0...
        
           | trhway wrote:
           | > The intuition is that for datasets commonly and frequently
           | joined on a known key, e.g., user events with user metadata
           | on a user ID, we can write them in bucket files with records
           | bucketed and sorted by that key.
           | 
           | Index organized tables in Oracle, clustered tables in Mssql.
           | "Intuition" in modern big data world :)
        
           | jpitz wrote:
           | It shouldn't be novel. I dive into this topic when I
           | interview data engineers.
        
           | mandis wrote:
           | >It reminds me of something in Eric Raymonds "The Art of Unix
           | Programming" (I don't have time to find the link right now)
           | where it discussed an approach from the earlier days of Linux
           | filesystems where you had a limit on the number of iNodes
           | that could exist in a single directory and corresponding
           | performance. The work around was to create a subdirectory
           | structure to store files based on the filename. But then you
           | tended to get many files starting with the same characters
           | all in the same directories. What turned out to be a better
           | way to distribute the files evenly in the directory structure
           | was to take the first and _last_ character of the file name
           | and use those to create the subdirectories. This way you were
           | more likely to spread the files evenly across the structure.
           | 
           | Interesting. I have been pondering over filesystem
           | performance and inode limits in servers/home-servers since a
           | long time. This seems useful infomration
        
             | harryf wrote:
             | I think the bit I was remembering was the Terminfo Case
             | Study on page 149 of the Art of Unix Programming - https://
             | citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.62... -
             | read it a long time ago though and re-reading now, it's not
             | exactly what I was remembering but that's memory for you...
        
         | jlouis wrote:
         | The article isn't easy to read unless you have some knowledge
         | up front about the used technologies and what the problem is.
         | 
         | This definitely drives people to comment on other things.
         | 
         | My gut feeling screams they made a problem themselves in the
         | first place which they then "solved". Similar to a "solution
         | running around looking for a problem" type of deal.
        
         | chishaku wrote:
         | You discover that people have different thoughts than you and
         | decided your gripe about this is therefore relevant?
         | 
         | Upvote, downvote or move on.
        
           | matsemann wrote:
           | I agree that my post is just additional noise. But it's noise
           | not disturbing a good signal.
           | 
           | Let me just point out that I think discussing sides of what's
           | linked can be interesting and relevant. In this instance I
           | think discussing privacy around the amount of data Spotify
           | stores is a relevant subdiscussion worth exploring. But
           | complaints about their UI doesn't feel very relevant. Etc.
           | 
           | Now that's only my opinion. What made me write my comment was
           | that it was literally no one discussing the concepts of the
           | article in the first 15 or so comments. Which I found a bit
           | disappointing as I thought the tech is interesting.
        
             | chishaku wrote:
             | > But it's noise not disturbing a good signal.
             | 
             | The signal will always be weak early on.
             | 
             | Look at the comments now and it's clear that the
             | upvote/downvote mechanisms are working sufficiently to
             | address your concern.
             | 
             | This should not be surprising. It's typically faster to not
             | read the article and spew superficial off-topic comments
             | than to write on-topic, substantive, and technical
             | comments.
        
               | matsemann wrote:
               | Good point, although I felt it was more than normal in
               | this case. Having my meta comment on top is also no good,
               | hopefully it can be demoted somehow. Will try to flag
               | this thread.
        
       | iamacyborg wrote:
       | And they're still less useful than the data last.fm makes
       | available to you.
        
         | dewey wrote:
         | Useful how? What actionable insights do you get from Last.fm?
         | It's vanity metrics optimized for sharing on social networks
         | and there's nothing wrong with it.
        
           | iamacyborg wrote:
           | Well, for one it gives me the yearly summary once the year is
           | actually over.
           | 
           | It's not particularly actionable to know you listen to more
           | music on a Saturday than a Sunday, but it is mildly
           | interesting for those who are curious about such things.
        
         | sbarre wrote:
         | Wrapped is a marketing product, not a data product.
         | 
         | Spotify itself is not a data product.
         | 
         | The first rule of avoiding disappointment is managing your own
         | expectations to align with reality.
         | 
         | That said, you can export more granular data Spotify has on you
         | from your settings page[0] if you want to do your own deeper
         | analysis on your trends and usage, etc..
         | 
         | 0: https://www.spotify.com/account/privacy/
        
       | paulsutter wrote:
       | Largest data flow job ever? I'm sure Google would beg to differ.
       | At Quantcast we process 50PB every day, and that's nothing
       | compared to real scale like Google.
       | 
       | And merge joins from sorted data? Joins have been done that way
       | since the punched card days on mainframes (and by any scaled data
       | system)
        
         | enigmo wrote:
         | Misleading headline, the first sentence is "how Spotify
         | optimized and sped up elements from *our largest Dataflow
         | job*". Surely it's not the largest ever run, even on Dataflow.
        
         | sbarre wrote:
         | Surely you read the article before posting, right?
         | 
         | From literally the first sentence:
         | 
         | > from our largest Dataflow job
        
           | ascar wrote:
           | Surely you have read the guidlines before posting, right?
           | 
           | > Please don't comment on whether someone read an article.
           | "Did you even read the article? It mentions that" can be
           | shortened to "The article mentions that."
           | 
           | https://news.ycombinator.com/newsguidelines.html
           | 
           | The hackernews title _and_ the article title say  "the".
           | Critizing this clickbait is more than warranted.
        
             | sbarre wrote:
             | Fair point, I could have worded it better.
             | 
             | But I think it's a reach to call this "clickbait".
             | 
             | Article titles are shortened all the time, and you can't
             | expect them to have all the context in the title.
             | 
             | However, one should reasonably expect people participating
             | in a discussion about an article (particularly when posting
             | criticism) to have actually read it.
             | 
             | Complaining about something that is provably false in the
             | first sentence of the article is the bigger sin here, is it
             | not?
        
         | davweb wrote:
         | It's largest Dataflow[1] job ever, with a capital D, not "data
         | flow".
         | 
         | [1]: https://cloud.google.com/dataflow
        
       | [deleted]
        
       | gabagool wrote:
       | Regarding the title, how do we know this is THE largest dataflow
       | job? The article body itself only makes mention that this is
       | THEIR largest dataflow job. This post doesn't make any
       | quantifiable claims either one could use to support, this is all
       | I found:
       | 
       | > "We estimate around a 50% decrease in Dataflow costs this year
       | compared to previous years' Bigtable-based approach.
       | Additionally, we avoided scaling the Bigtable cluster up two to
       | three times its normal capacity (up to around 1,500 nodes at
       | peak"
       | 
       | The official Spotify Engineering Tweet similarly only makes
       | mention that this is Spotify's largest dataflow job ever:
       | https://twitter.com/SpotifyEng/status/1359887825047613442.
       | 
       | I'm fairly sure a similar accidental unsourced exaggeration was
       | made last year.
       | 
       | Maybe the title should be Spotify Optimized Their Largest
       | Dataflow Job Ever For Wrapped 2020?
        
       | fooblat wrote:
       | When this report came out it was the straw that broke the camel's
       | back for me in terms of my data privacy. Most people seem to have
       | found Wrapped 2020 entertaining but I found it creepy.
       | 
       | I miss being able to do something simple like listen to music or
       | watch a movie without all my actions being recorded and saved. So
       | I'm back to buying physical media and DRM free downloads.
       | 
       | I'm convinced that it is now important to hold on to older
       | appliances that work without internet access or data collection
       | this plus right to repair gives me hope for the future.
        
         | superbcarrot wrote:
         | I haven't used Spotify in over a year so I don't know about
         | this. What in Wrapped 2020 was so different from previous years
         | that made it creepy for you?
        
         | josteink wrote:
         | > I'm convinced that it is now important to hold on to older
         | appliances that work without internet access or data collection
         | 
         | Or run modern, up to date FOSS equivalents on machines you
         | control.
         | 
         | I've migrated more and more services like that and I'm slowly
         | but surely building my own "cloud".
        
           | npteljes wrote:
           | I understand why people don't constantly think about their
           | digital footprint, but it's not that different from when they
           | order the same thing at a stall for a month, and the clerk
           | begins to ask "Do you want your usual?".
        
             | spotyoufi462881 wrote:
             | But the digital footprint is infinitely copyable, permanent
             | etc
        
         | diggan wrote:
         | Seems strange that "Wrapped 2020" was the straw for you. Ever
         | since launch, the top reasons for starting to use Spotify has
         | been availability (across devices) and the provided
         | radio/discover playlists that automatically finds music based
         | on your previous history, this has always been a core
         | proposition of Spotify, and not something that happened now.
         | 
         | Then it's not always good, right or even close sometimes, but
         | it's not like it's a hidden feature.
        
           | fooblat wrote:
           | I joined spotify in 2011 and the core offer was simply access
           | to all music across your devices. However, that's not really
           | relevant.
           | 
           | 2020 was the year I really started to question why I was
           | taking such care with my data in some ways but not others.
           | The Wrapped 2020 was a bright reminder to me that if I want
           | to take my privacy seriously I need to look at everything in
           | my life that collects data. Simple as that.
           | 
           | I don't think spotify is wrong or evil to collect the data. I
           | actually think it is a great product. I have just decided
           | that I want to leave as little data around about my daily
           | activities as possible.
           | 
           | As it happens I don't use the Discovery features very much
           | and it turns out there are still some enjoyable FM stations
           | where I live. When I want background music I turn on the
           | radio. When I want something more specific I play an album or
           | playlist from my collection.
        
           | Nullabillity wrote:
           | Discover Weekly (the first algorithmic playlist) launched in
           | 2015, Spotify itself launched in 2008. It did have radio from
           | the start, but that was (at least sold as) a fairly generic
           | genre-and-decade-based affair, rather than something
           | personalized.
        
             | 88 wrote:
             | Spotify acquired the vast majority of their current users
             | since 2015.
        
             | diggan wrote:
             | I do know the history of Spotify, became a user first in
             | 2009 (back when it was invite only) as my family got a
             | subscription together with our broadband connection with
             | Bredbandsbolaget in Sweden. The radio was always a feature
             | they advertised in order to find similar music to the one
             | you like, although you're correct that Discover Weekly et
             | al wasn't available until later. I guess you could argue
             | that because the radio was initially just for artist pages
             | then playlists, it was kind of manual and not automatic
             | like todays playlists that they generate.
             | 
             | Although it is strange to leave Spotify in 2020 citing
             | concerns about Spotify using your data, as it has been
             | going on from at least 2015, possibly earlier.
        
         | 88 wrote:
         | What's your concern, that your listening history will be linked
         | to your personal identity and somehow used against you?
         | 
         | If that's the case, why not just sign up using a single-use
         | email address, pay via gift cards purchased with cash, and if
         | you're really concerned, use a VPN?
        
           | spotyoufi462881 wrote:
           | This is what relentless back-and-forth made someone realise
           | how much Spotify loves him.
           | 
           | https://twitter.com/steipete/status/1025024813889478656
        
             | gcbirzan wrote:
             | Ironically, I've been trying to get that data and cannot. I
             | got a 400KiB zip file with like 10 JSONs inside, only
             | relevant ones being listening history (ts and ms played per
             | song). Support suggested there might be more data, they
             | even re-did the export, but nothing.
        
             | 88 wrote:
             | Why create a throwaway account just to post this?
             | 
             | I don't think it's a surprise to anyone that Spotify
             | collects telemetry data on its users.
             | 
             | The main reason myself (and many others) use Spotify is
             | because they use this data to recommend new music that I
             | will like.
        
               | spotyoufi462881 wrote:
               | Creating throwaways to post is a feature of HN. I didn't
               | abuse it. I don't have a HN account.
               | 
               | It's not meant as slander. It's just to show others 'here
               | is something that's true that you might not have thought
               | of or have thought of but haven't thought of it in all
               | it's 250 MB glory. Decide for yourself'.
               | 
               | Folks who wanna try something different might like 'Radio
               | Paradise'. It's human curated music. Have wide platforms
               | support [1] but also regular stream URLs [2]
               | 
               | [1] https://radioparadise.com/listen/options [2]
               | https://radioparadise.com/listen/stream-links
        
           | Nextgrid wrote:
           | Because your music taste is essentially a unique fingerprint
           | that could be used to track you in the future even across
           | accounts or platforms, and the only way to escape would be to
           | literally give up on all your favorite songs and listen to
           | something totally unrelated.
        
             | 88 wrote:
             | So what? Your voice is a unique fingerprint. Your writing
             | style is a unique fingerprint. Your gait is a unique
             | fingerprint. Who you are friends with is a unique
             | fingerprint.
             | 
             | Why should I be worried about a company knowing what music
             | I listen to?
        
           | avh02 wrote:
           | Because going through all that just to listen to music is
           | bonkers
        
             | marliechiller wrote:
             | but also far less effort and money than buying physical
             | copies of music
        
       | BorisTheBrave wrote:
       | I'm having a hard time understanding this article. It seems to be
       | a bit too low level on the specifics of Beam for general
       | consumption.
       | 
       | From what i undestand, Spark has the same feature built in. If
       | the planner knows that the source data is partitioned and/or
       | sorted appropriately, it can skip shuffling/sorting it, instead
       | having each executor directly requesting the one file it needs.
       | 
       | It's a nice optimization, but it's not game changing. You often
       | end up having to shuffle anyway, as you are joining on a
       | different key, or for performance reason you need more executors
       | than the set amount of partitions, or the shuffle needed to write
       | the data doesn't justify the savings on the readers.
       | 
       | Maybe it's better with their additional optimizations? Spark does
       | not do those, mostly.
        
         | enigmo wrote:
         | Hive also has had this optimization for as long as I can
         | remember. As others have noted it's not particularly new or
         | novel, it's just not part of the Beam SDK.
        
         | texasbigdata wrote:
         | 50% cost reduction though
        
       | blt wrote:
       | This article fails to make a clean problem statement for the
       | general audience. It jumps right into jargon and names from some
       | framework/library. It reads like it was an internal report from a
       | programmer to their team, and someone decided to make it public
       | with no changes.
        
       | staticelf wrote:
       | Spotify is a company that feels like they want to be a "big tech
       | company" when in reality they do not need to. All they need to do
       | is provide a great service with as much music as possible.
        
         | sbarre wrote:
         | At some point they need to try to differentiate themselves from
         | Amazon or Apple, who could easily out-spend them on licensing
         | music catalogues.
         | 
         | I think Spotify's "moat" is largely the analysis they've done
         | on all the listening data they have, and their ability to
         | provide that to the music industry as a product.
         | 
         | They share rudimentary summaries with customers via stuff like
         | Wrapped but I have to imagine they have much more detailed and
         | robust data products for the industry...
         | 
         | If they're just a giant Dropbox for MP3s with a music player
         | sitting on top, they don't really stand a chance...
        
         | matthewmacleod wrote:
         | That's true - it's kind of like how Google keep pretending to
         | be a big tech company when all they really need to do it
         | provide a great service with as many search results as
         | possible.
        
         | onion2k wrote:
         | _All they need to do is provide a great service with as much
         | music as possible._
         | 
         | I wonder if adding 'as much music as possible' would drive any
         | growth without the AI music discovery stuff. There has to be a
         | diminishing return to adding new songs - if Spotify adds an
         | artist that only a few thousand people have heard of then
         | that's only going attract a few thousand new customers _at
         | most_. No one else is going to listen to that artist unless
         | Spotify recommends the songs to people who might like them.
        
         | danielscrubs wrote:
         | Wouldn't surprise me if they shuffle more data than Twitter. Is
         | it because it's not a Silicon Valley darling that it gets all
         | the snide remarks?
        
         | throwaway3699 wrote:
         | They clearly value the free+recommendations+ads stack to drive
         | their business more than adding basic features.
        
         | horseRad wrote:
         | I for one love discover weekly and hope they will continue
         | working on improving it. I consider it a core feature :)
        
           | Hacman123 wrote:
           | Agree 100%. I always find great songs there. A cool feature
           | to discover new music. But I guess we are not that
           | sophisticated. Predicting which songs I'm gonna like is
           | probably noit that difficult.
        
           | staticelf wrote:
           | Sure, that is a great feature but that doesn't really need
           | that much data besides listen data that they need to save
           | anyway (I am assuming).
        
         | bjohnson225 wrote:
         | I'm not sure I get the point in relation to the article. They
         | have a huge amount of data, so they need to handle it in a "big
         | tech" way & it's not like they're doing this stuff just because
         | they can.
         | 
         | The Spotify end of year wrap is amazing marketing for them that
         | people absolutely love and share widely.
        
       | mouldysammich wrote:
       | listenbrainz has gotten pretty good in the last while for keeping
       | track of your music stats. Its got a nice weekly stats page etc.
       | Its also not ad laden like last.fm
        
       | a254613e wrote:
       | Perhaps a bit off-topic. But a lot of users (myself included)
       | reported wildly inaccurate data in the spotify wrapped this year
       | with seemingly no explanation, no shared accounts, no re-used
       | passwords, no weird listening history, etc.
       | 
       | I wonder if some of the data in the "We worked with the
       | maintainer of these data sets to convert a year's worth of data
       | to SMB format." step got corrupted or just wrongly
       | converted/lost.
       | 
       | I'm not sure how else explain that I have to google artists in my
       | top 10 because I never heard of them.
        
       | soheilpro wrote:
       | Sorry if it's off-topic, but if anyone's interested, I'm
       | launching volt.fm next week.
       | 
       | It connects to your Spotify account and generates a nice public
       | page with your stats (top artists, top tracks), playlists and
       | etc.
       | 
       | You can reserve your username now: https://volt.fm
        
         | pohy wrote:
         | Hi, how is this different/better than last.fm?
        
           | soheilpro wrote:
           | It helps you promote your playlists and discover new ones.
        
             | saberience wrote:
             | Why would I want to promote my playlists?
        
       | person_of_color wrote:
       | Could be done with a bash script...
        
         | rovr138 wrote:
         | And paper and pencil
         | 
         | What's your point?
        
       | sailfast wrote:
       | For those that would like to dig more into what SMB is - here's
       | the link to the paper from the article: http://kth.diva-
       | portal.org/smash/get/diva2:1334587/FULLTEXT0...
        
       | henron wrote:
       | This technique of using distributed storage for large joins
       | instead of shuffling between compute nodes also helps make your
       | job robust to spot instance kills. Until disaggregated shuffle
       | services are widely adopted, it can be really handy.
        
       | polskibus wrote:
       | I wonder if they could publish dollar cost of that job before and
       | after the optimization, as provided by GCP billing. I know it
       | could be a bit unfair (some costs may be static, regardless of
       | job size, etc.) but it would improve decision making for others
       | if discussions of public cloud usage optimizations also include
       | the cost.
        
         | sbarre wrote:
         | They do refer to savings in percentages.. I feel like giving
         | away actual dollar costs would potentially break contractual
         | agreements (because I doubt Spotify pays public list price),
         | and potentially give away competitive information about how
         | much data they have on their customers etc..
         | 
         | I agree with you that it would be interesting to know, I just
         | don't think it's realistic for them to release that
         | information.
        
       | [deleted]
        
       | shoulderfake wrote:
       | Whatever theyre doing with data means nothing when their client
       | apps are absolute dogshit.
        
       | saberience wrote:
       | One thing I can never understand about Spotify is that despite
       | it's insane budget, huge amount of employees/talent, they still
       | can't create better personalized playlists than either Pandora OR
       | last.fm.
       | 
       | To this day when I want a recommended playlist based on my
       | taste/history, I always use last.fm because it's just plain
       | better. Why? The "Discover" etc playlists on Spotify are just
       | crap.
        
         | H8crilA wrote:
         | Remember the Netflix Challenge fiasco? It's pretty clear that
         | this problem is either ridiculously hard, or it is
         | fundamentally "unsolvable" (because, perhaps, it depends on
         | highly variable mood).
        
           | [deleted]
        
         | feintruled wrote:
         | Really? I find my 'Discover Weekly' playlist to be astounding.
         | I have to confess for a long time I thought it was a human
         | curated playlist from someone who just happened to have
         | _exactly_ the sames tastes as me.
        
           | adamhp wrote:
           | I have to second this. I can't list the number of artists I
           | have discovered over the last few years using Discover Weekly
           | alone. It's really incredible.
        
             | iamacyborg wrote:
             | It's been the opposite for me. It regularly surfaces songs
             | I've listened to in the past from artists I listen to
             | frequently. As a music discovery algo Spotify has been
             | nothing but poor in my opinion.
        
       | philip1209 wrote:
       | I'd be curious how this compares in load to Google's internal
       | applications. I'm also curious what the capacity of Google's
       | infrastructure goes to Google vs. GCE - has combined GCE usage
       | even passed the compute needs of Google internally yet?
        
         | jeffbee wrote:
         | The only even remotely concrete information in this post is
         | their input was 1PB, and they typically have 500 bigtable
         | tablet servers. In 2008, Google said they processed 20PB per
         | day through mapreduce jobs. For the last ten years the only
         | thing they've said about the size of their public web index is
         | that it is over 100PB.
        
       ___________________________________________________________________
       (page generated 2021-02-12 23:01 UTC)