[HN Gopher] Spotify Optimized the Largest Dataflow Job Ever for ...
___________________________________________________________________
Spotify Optimized the Largest Dataflow Job Ever for Wrapped 2020
Author : SirOibaf
Score : 187 points
Date : 2021-02-12 09:44 UTC (13 hours ago)
(HTM) web link (engineering.atspotify.com)
(TXT) w3m dump (engineering.atspotify.com)
| shaicoleman wrote:
| My off-topic rant: I'd really wish Spotify would focus on
| improving the core player experience. It has barely seen any
| improvements in years.
|
| * Not overwrite/delete my listening history everytime I switch
| devices
|
| * Allow tabs, or some way to resume what I've been listening to
| in different contexts
|
| * Option to open only one instance, instead of having multiple
| instances that mess with each other
|
| * Playing local files crashes/not working on Linux
|
| * Change playback speed, not just for podcasts
|
| * Jump back/forward, not just for podcasts
|
| * Have some visibility when the song was last played / play count
|
| * Liked songs not always appearing in search results
|
| * Sorting search results not working
|
| * Add basic functionality to the dbus interface (e.g. seeking)
|
| * Ability to report songs (e.g. wrong titles/badly split
| tracks/etc.)
| bootlooped wrote:
| Almost all of these strike me as only benefiting a very small
| sliver of users, like well under 1%. An infinitesimal portion
| of Spotify listeners even know what dbus is. How much engineer
| time is it worth to improve something like that?
| [deleted]
| krakmh wrote:
| You are absolutely right. Even more, people are using it for
| free using mods like these
| https://bestforandroid.com/apk/spotify-premium-mod-apk/
| yaqoob wrote:
| Yup, well said.
| TheRealSteel wrote:
| I have been begging them for years to add a 'resume playlist'
| feature but they won't do it.
| spotyoufi462881 wrote:
| Anyone wanna know how much Spotify wanna know about you?
|
| https://twitter.com/steipete/status/1025024813889478656
| Jonnax wrote:
| What's scary about this?
|
| They're complying with GDPR. Isn't that a good thing?
|
| The "scary" thing in that tweet is that they store the
| manufacturer of their bluetooth headphones?
| spotyoufi462881 wrote:
| That's not what one could call complying. He had to follow
| them like a dog for a long while just to get his rightful
| data.
| Jonnax wrote:
| That was from 2018. Almost 3 years ago.
| yaqoob wrote:
| Woo. Thats incredible. Seriously!
| npteljes wrote:
| I'm already thinking that every service hoovers as much up as
| they can. Nice to see proof that it is actually happening!
| float4 wrote:
| I wonder whether they use the bluetooth device logging purely
| to develop their social graph. A person streaming music to
| their own bluetooth headphone using a friend's computer and
| Spotify account can be detected this way. I can't really come
| up with another purpose for it.
|
| Other than that, I'm not surprised by what they log. Virtually
| every company stores search queries, oauth grants, play
| history, ad interactions etc. Doesn't make it right of course.
| max_streese wrote:
| Hi not sure if I am just completely off here but I am wondering
| how this relates or compares to processing things with Kafka and
| Kafka Streams?
|
| If I am reading things correctly with Kafka the workflow
| equivalent to what's written in the article would be to have your
| producer produce via hash-based-round-robin (the default
| partitioning algorithm) based on the key you are interested in
| into some topic and then your consumer would just read it and
| your data would already be sorted for the given keys (because
| within a partition Kafka has sorting guarantees) and also be co-
| partitioned correctly if you need to read some other topic in
| with the same number of partitions and the same logical keys
| produced via the same algorithm. No?
| jsjsbdkj wrote:
| This is the most basic pattern for distributed joins - you hash
| on the join key in both tables and shuffle data based on hash
| ranges. In some systems like Redshift you can designate the key
| for distribution so that "related" records are already co-
| located on a single shard.
|
| > our data would already be sorted for the given keys (because
| within a partition Kafka has sorting guarantees)
|
| It's been a while since I used Kafka but I don't remember
| "sorting guarantees". Consumers see events "in order" based on
| when they were produced, because each partition is a queue.
| max_streese wrote:
| Yes I guess my point is when using Kafka in combination with
| Kafka Streams and you produce things partitioned in a way
| that you need them for consumption then you do not need to do
| any shuffling in the instance where you want to join because
| data is already partitioned correctly.
| dvdhnt wrote:
| You seem to know what you're talking about. Any
| recommendations on learning resources for this type of
| flow? Or really understanding which platform works for in
| each situation?
|
| I'm learning proper data flow in real time as I look to
| transition ETL of product data into Postgres to a more
| applicable system.
|
| Finding the right learning resources is difficult! Cheers.
| CyberRabbi wrote:
| "How <moderately large startup> did <some obscure tech thing>"
|
| <... low substance and poorly written synopsis of the obscure
| tech thing ...>
|
| "Enjoy work like this? Want to make a big impact as a lowly
| programmer who will never start their own company? Well at
| <moderately large startup>, lowly programmers who will never
| start their own companies have the freedom to make a big impact."
| rrdharan wrote:
| Spotify is no longer a startup.
| CyberRabbi wrote:
| Good point. That's changes the substance of the post.
| gripfx wrote:
| Yet, I, as a user still cannot see my playcounts via app or API.
| marliechiller wrote:
| this is a massive gripe of mine. I remember being able to
| toggle so many stats in itunes back in the day which gave me
| such great insight into my listening habits and other cool
| tidbits. Spotify, for all its ease of listening has removed
| some of the magic there
| [deleted]
| utucuro wrote:
| That might be explained as guarding their information, yet I
| still have trouble believing that they are unable to
| distinguish two artists with the same name from each other.
| Each week, my Release Radar playlist is basically invaded by 0
| listener rappers who "feat" defunct 70s rock bands who do not
| manage their artist pages... Response from Spotify: please
| report them individually, from the desktop app.
| sbarre wrote:
| That's some interesting product hacking in a way.
|
| I've run into artist naming collisions enough to know it's a
| thing but I've never seen anyone (ab)use it intentionally..
|
| Neat, but still annoying I'm sure. =)
| thejoeflow wrote:
| this. They don't even do an exact string match of the artist
| name, the reason why I can't fathom. I could make a new
| artist account named "drake" (lowercase d) today and probably
| show up in a million Release Radar playlists by next week.
| asutekku wrote:
| This is one of the reasons why I still keep using lastfm. Stats
| in Spotify are next to useless and even then it would not track
| my local listening.
| montag wrote:
| You can, to some extent, using the play history API.
| https://output.jsbin.com/ribat
| greatthx wrote:
| Great! Now will they stop scanning my entire hard drive with
| their desktop app? Also stop opening sockets directly to
| advertiser IP address. And stop paying off data thieves instead
| of disclosing to their users that their passwords were leaked.
| And to stop being sellouts too!
| jeofken wrote:
| Do you have links to further readings about this?
| meibo wrote:
| Wrapped works so well because it panders to you. Everyone likes
| to be acknowledged for listening to the weird indie band they
| discovered earlier this year.
|
| I enjoy it anyways, and Spotify is still a great service for now
| - I wonder if it'll meet the same fate as Netflix at some point,
| with publishing houses going for their own streaming services
| instead.
| pradn wrote:
| If anything the top-5 format doesn't capture the long tail you
| might be listening to.
| onion2k wrote:
| _I wonder if it 'll meet the same fate as Netflix at some
| point, with publishing houses going for their own streaming
| services instead._
|
| I don't think so because the way that people consume music and
| the way they consume films and television are _very_ different.
| With a film you might block out a few hours to watch that
| specific provider. With music you 're more likely to want to
| interleave content from several providers at the same time (eg
| a playlist). Unless all the providers are available on the same
| platform it wouldn't work well.
| Waterluvian wrote:
| Speaking of interleaving, I wish Spotify understood how to
| interact with Concept Albums. Changing to the middle of The
| Wall for one song is jarring and I usually skip it.
|
| And then I worry skipping it is going to train the algorithm
| that I don't like Pink Floyd.
| adhoc_slime wrote:
| What does this mean? I don't use spotify. Are you saying
| when you go to listen to an album it will put a random song
| into the queue?
| meibo wrote:
| Only if you listen on shuffle(like every music app I
| guess) or in their AI lists. Would be nice if the AI ones
| would respect that for sure, was wondering why that's not
| a thing since like 2016.
| Waterluvian wrote:
| I'll be listening to a theme or genre of music and it
| adds songs that don't really work outside the context of
| their album.
| umanwizard wrote:
| Spotify allows both: listening to random songs (either
| based on a particular genre, seeded from an existing
| playlist, or generated from the user's listening habits),
| or choosing specific songs or albums to listen to.
|
| They probably meant listening in random mode, and Spotify
| randomly choosing a track that doesn't make much sense
| outside the context of its album.
| 2038AD wrote:
| I never understood why for albums like this they even
| bother splitting up the tracks.
| Allower wrote:
| No.. that's how I watch TV and movies too. What person do you
| know says "I only want to watch Paramount shit!"?
| hnlmorg wrote:
| A considerable amount of music is distributed by a small
| subset of providers. So in that regard it's not that
| different to the TV / film situation and you could
| theoretically still interleave different artists.
|
| There's also a common use case where people will just play a
| specific artist for an hour. Or even an album.
|
| Frankly, I hope services like Spotify don't disappear. It's a
| great loss to consumers just how fragmented video streaming
| services have become. I'd hate to see the same happen to
| music as well.
| alvarlagerlof wrote:
| I don't think that behavior is all to common. I see people
| listening to a very wide range of artists and almost never
| reach for stuff outside of it around here.
| hnlmorg wrote:
| I use music streaming services to play specific albums
| (as I tend to play older artists who created albums
| designed to be played as a whole entity rather than
| singles)
|
| My wife uses them to shuffle singles by specific artists
| (she's more into pop music).
|
| I'd wager if my wife and I both coincidentally follow the
| same pattern despite doing so for different reasons, that
| it's then likely a more common pattern than first
| assumed. Please also bare in mind that I'm not suggesting
| our use cases are how the majority of people consume
| music, but I'd be surprised if it was small enough to be
| a rounding error.
| merlinscholz wrote:
| [deleted]
| underwater wrote:
| Do you have anything more terse. A 30 minute, ad filled,
| video isn't exactly a light read.
| matsemann wrote:
| I came here expecting to read about the tech in the article or
| how others do big data processing stuff. Instead I get off topic
| Spotify rants.. Did you read the article or just see Spotify in
| the headline and decided your gripes therefore are relevant?
| johncena33 wrote:
| This has become a huge problem on HN lately. Lots of
| discussions are nothing but complaining. Now the technical
| discussions are starting to get infested with off-topic
| whining. The mods don't do anything about off-topic rants. If
| you point it out you'll get downvoted [1][2][3].
|
| [1] https://news.ycombinator.com/item?id=25839399 [2]
| https://news.ycombinator.com/item?id=25064636 [3]
| https://news.ycombinator.com/item?id=24699908
| adventured wrote:
| > The mods don't do anything about off-topic rants.
|
| Mod. There is a question of how much one moderator can do
| against the tide. HN really needs a couple of full time paid
| moderators, with their salaries covered by the zillion dollar
| YC bank account.
| [deleted]
| pedroaraujo wrote:
| Absolutely, there has been a dramatic shift in the type of
| people who visit HN in the past years.
|
| I used to think that Reddit was bad in this regard but to be
| honest it mostly affects the big subreddits, the niche and
| small ones still have a high quality community. HN became
| pretty much like the biggest subs on Reddit.
| chishaku wrote:
| Back in my day...
|
| This comment thread is in its own category of low quality
| discussion.
|
| Negativity bias prevents you from seeing that 95% of the
| homepage right now is technical/nerdy with a lot of high
| quality corresponding discussion.
|
| When political/social issues hit the homepage, they often
| slide off quickly if the corresponding discussion is of low
| quality (has many downvoted comments).
|
| HN is certainly not perfect but just focusing on the parts
| you don't like prevents you from seeing the bigger picture.
| harryf wrote:
| Agreed to I'll bite...
|
| > The intuition is that for datasets commonly and frequently
| joined on a known key, e.g., user events with user metadata on
| a user ID, we can write them in bucket files with records
| bucketed and sorted by that key. By knowing which files contain
| a subset of keys and in what order, shuffle becomes a matter of
| merge-sorting values from matching bucket files, completely
| eliminating costly disk and network I/O of moving key-value
| pairs around.
|
| I'm actually surprised that this should be regarded as "novel"
| in data science.
|
| It reminds me of something in Eric Raymonds "The Art of Unix
| Programming" (I don't have time to find the link right now)
| where it discussed an approach from the earlier days of Linux
| filesystems where you had a limit on the number of iNodes that
| could exist in a single directory and corresponding
| performance. The work around was to create a subdirectory
| structure to store files based on the filename. But then you
| tended to get many files starting with the same characters all
| in the same directories. What turned out to be a better way to
| distribute the files evenly in the directory structure was to
| take the first and _last_ character of the file name and use
| those to create the subdirectories. This way you were more
| likely to spread the files evenly across the structure.
| nerpderp82 wrote:
| I read it and I am like ... we already do this. This is
| common and obvious. Maybe I am missing something.
|
| Before worker nodes had as much memory as they have now,
| almost everything needed to use small buffers and spill to
| disk. BDB (Berkeley DB) was an extremely common tool for
| doing out of core data operations. Because the ETL tools I
| was writing needed to run on machines with 512MB of ram, it
| required out of core algorithms. We easily had jobs
| processing 10-20GB with only 512M of ram.
|
| I am sure I am missing something, reading the paper now.
|
| http://kth.diva-
| portal.org/smash/get/diva2:1334587/FULLTEXT0...
| trhway wrote:
| > The intuition is that for datasets commonly and frequently
| joined on a known key, e.g., user events with user metadata
| on a user ID, we can write them in bucket files with records
| bucketed and sorted by that key.
|
| Index organized tables in Oracle, clustered tables in Mssql.
| "Intuition" in modern big data world :)
| jpitz wrote:
| It shouldn't be novel. I dive into this topic when I
| interview data engineers.
| mandis wrote:
| >It reminds me of something in Eric Raymonds "The Art of Unix
| Programming" (I don't have time to find the link right now)
| where it discussed an approach from the earlier days of Linux
| filesystems where you had a limit on the number of iNodes
| that could exist in a single directory and corresponding
| performance. The work around was to create a subdirectory
| structure to store files based on the filename. But then you
| tended to get many files starting with the same characters
| all in the same directories. What turned out to be a better
| way to distribute the files evenly in the directory structure
| was to take the first and _last_ character of the file name
| and use those to create the subdirectories. This way you were
| more likely to spread the files evenly across the structure.
|
| Interesting. I have been pondering over filesystem
| performance and inode limits in servers/home-servers since a
| long time. This seems useful infomration
| harryf wrote:
| I think the bit I was remembering was the Terminfo Case
| Study on page 149 of the Art of Unix Programming - https://
| citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.62... -
| read it a long time ago though and re-reading now, it's not
| exactly what I was remembering but that's memory for you...
| jlouis wrote:
| The article isn't easy to read unless you have some knowledge
| up front about the used technologies and what the problem is.
|
| This definitely drives people to comment on other things.
|
| My gut feeling screams they made a problem themselves in the
| first place which they then "solved". Similar to a "solution
| running around looking for a problem" type of deal.
| chishaku wrote:
| You discover that people have different thoughts than you and
| decided your gripe about this is therefore relevant?
|
| Upvote, downvote or move on.
| matsemann wrote:
| I agree that my post is just additional noise. But it's noise
| not disturbing a good signal.
|
| Let me just point out that I think discussing sides of what's
| linked can be interesting and relevant. In this instance I
| think discussing privacy around the amount of data Spotify
| stores is a relevant subdiscussion worth exploring. But
| complaints about their UI doesn't feel very relevant. Etc.
|
| Now that's only my opinion. What made me write my comment was
| that it was literally no one discussing the concepts of the
| article in the first 15 or so comments. Which I found a bit
| disappointing as I thought the tech is interesting.
| chishaku wrote:
| > But it's noise not disturbing a good signal.
|
| The signal will always be weak early on.
|
| Look at the comments now and it's clear that the
| upvote/downvote mechanisms are working sufficiently to
| address your concern.
|
| This should not be surprising. It's typically faster to not
| read the article and spew superficial off-topic comments
| than to write on-topic, substantive, and technical
| comments.
| matsemann wrote:
| Good point, although I felt it was more than normal in
| this case. Having my meta comment on top is also no good,
| hopefully it can be demoted somehow. Will try to flag
| this thread.
| iamacyborg wrote:
| And they're still less useful than the data last.fm makes
| available to you.
| dewey wrote:
| Useful how? What actionable insights do you get from Last.fm?
| It's vanity metrics optimized for sharing on social networks
| and there's nothing wrong with it.
| iamacyborg wrote:
| Well, for one it gives me the yearly summary once the year is
| actually over.
|
| It's not particularly actionable to know you listen to more
| music on a Saturday than a Sunday, but it is mildly
| interesting for those who are curious about such things.
| sbarre wrote:
| Wrapped is a marketing product, not a data product.
|
| Spotify itself is not a data product.
|
| The first rule of avoiding disappointment is managing your own
| expectations to align with reality.
|
| That said, you can export more granular data Spotify has on you
| from your settings page[0] if you want to do your own deeper
| analysis on your trends and usage, etc..
|
| 0: https://www.spotify.com/account/privacy/
| paulsutter wrote:
| Largest data flow job ever? I'm sure Google would beg to differ.
| At Quantcast we process 50PB every day, and that's nothing
| compared to real scale like Google.
|
| And merge joins from sorted data? Joins have been done that way
| since the punched card days on mainframes (and by any scaled data
| system)
| enigmo wrote:
| Misleading headline, the first sentence is "how Spotify
| optimized and sped up elements from *our largest Dataflow
| job*". Surely it's not the largest ever run, even on Dataflow.
| sbarre wrote:
| Surely you read the article before posting, right?
|
| From literally the first sentence:
|
| > from our largest Dataflow job
| ascar wrote:
| Surely you have read the guidlines before posting, right?
|
| > Please don't comment on whether someone read an article.
| "Did you even read the article? It mentions that" can be
| shortened to "The article mentions that."
|
| https://news.ycombinator.com/newsguidelines.html
|
| The hackernews title _and_ the article title say "the".
| Critizing this clickbait is more than warranted.
| sbarre wrote:
| Fair point, I could have worded it better.
|
| But I think it's a reach to call this "clickbait".
|
| Article titles are shortened all the time, and you can't
| expect them to have all the context in the title.
|
| However, one should reasonably expect people participating
| in a discussion about an article (particularly when posting
| criticism) to have actually read it.
|
| Complaining about something that is provably false in the
| first sentence of the article is the bigger sin here, is it
| not?
| davweb wrote:
| It's largest Dataflow[1] job ever, with a capital D, not "data
| flow".
|
| [1]: https://cloud.google.com/dataflow
| [deleted]
| gabagool wrote:
| Regarding the title, how do we know this is THE largest dataflow
| job? The article body itself only makes mention that this is
| THEIR largest dataflow job. This post doesn't make any
| quantifiable claims either one could use to support, this is all
| I found:
|
| > "We estimate around a 50% decrease in Dataflow costs this year
| compared to previous years' Bigtable-based approach.
| Additionally, we avoided scaling the Bigtable cluster up two to
| three times its normal capacity (up to around 1,500 nodes at
| peak"
|
| The official Spotify Engineering Tweet similarly only makes
| mention that this is Spotify's largest dataflow job ever:
| https://twitter.com/SpotifyEng/status/1359887825047613442.
|
| I'm fairly sure a similar accidental unsourced exaggeration was
| made last year.
|
| Maybe the title should be Spotify Optimized Their Largest
| Dataflow Job Ever For Wrapped 2020?
| fooblat wrote:
| When this report came out it was the straw that broke the camel's
| back for me in terms of my data privacy. Most people seem to have
| found Wrapped 2020 entertaining but I found it creepy.
|
| I miss being able to do something simple like listen to music or
| watch a movie without all my actions being recorded and saved. So
| I'm back to buying physical media and DRM free downloads.
|
| I'm convinced that it is now important to hold on to older
| appliances that work without internet access or data collection
| this plus right to repair gives me hope for the future.
| superbcarrot wrote:
| I haven't used Spotify in over a year so I don't know about
| this. What in Wrapped 2020 was so different from previous years
| that made it creepy for you?
| josteink wrote:
| > I'm convinced that it is now important to hold on to older
| appliances that work without internet access or data collection
|
| Or run modern, up to date FOSS equivalents on machines you
| control.
|
| I've migrated more and more services like that and I'm slowly
| but surely building my own "cloud".
| npteljes wrote:
| I understand why people don't constantly think about their
| digital footprint, but it's not that different from when they
| order the same thing at a stall for a month, and the clerk
| begins to ask "Do you want your usual?".
| spotyoufi462881 wrote:
| But the digital footprint is infinitely copyable, permanent
| etc
| diggan wrote:
| Seems strange that "Wrapped 2020" was the straw for you. Ever
| since launch, the top reasons for starting to use Spotify has
| been availability (across devices) and the provided
| radio/discover playlists that automatically finds music based
| on your previous history, this has always been a core
| proposition of Spotify, and not something that happened now.
|
| Then it's not always good, right or even close sometimes, but
| it's not like it's a hidden feature.
| fooblat wrote:
| I joined spotify in 2011 and the core offer was simply access
| to all music across your devices. However, that's not really
| relevant.
|
| 2020 was the year I really started to question why I was
| taking such care with my data in some ways but not others.
| The Wrapped 2020 was a bright reminder to me that if I want
| to take my privacy seriously I need to look at everything in
| my life that collects data. Simple as that.
|
| I don't think spotify is wrong or evil to collect the data. I
| actually think it is a great product. I have just decided
| that I want to leave as little data around about my daily
| activities as possible.
|
| As it happens I don't use the Discovery features very much
| and it turns out there are still some enjoyable FM stations
| where I live. When I want background music I turn on the
| radio. When I want something more specific I play an album or
| playlist from my collection.
| Nullabillity wrote:
| Discover Weekly (the first algorithmic playlist) launched in
| 2015, Spotify itself launched in 2008. It did have radio from
| the start, but that was (at least sold as) a fairly generic
| genre-and-decade-based affair, rather than something
| personalized.
| 88 wrote:
| Spotify acquired the vast majority of their current users
| since 2015.
| diggan wrote:
| I do know the history of Spotify, became a user first in
| 2009 (back when it was invite only) as my family got a
| subscription together with our broadband connection with
| Bredbandsbolaget in Sweden. The radio was always a feature
| they advertised in order to find similar music to the one
| you like, although you're correct that Discover Weekly et
| al wasn't available until later. I guess you could argue
| that because the radio was initially just for artist pages
| then playlists, it was kind of manual and not automatic
| like todays playlists that they generate.
|
| Although it is strange to leave Spotify in 2020 citing
| concerns about Spotify using your data, as it has been
| going on from at least 2015, possibly earlier.
| 88 wrote:
| What's your concern, that your listening history will be linked
| to your personal identity and somehow used against you?
|
| If that's the case, why not just sign up using a single-use
| email address, pay via gift cards purchased with cash, and if
| you're really concerned, use a VPN?
| spotyoufi462881 wrote:
| This is what relentless back-and-forth made someone realise
| how much Spotify loves him.
|
| https://twitter.com/steipete/status/1025024813889478656
| gcbirzan wrote:
| Ironically, I've been trying to get that data and cannot. I
| got a 400KiB zip file with like 10 JSONs inside, only
| relevant ones being listening history (ts and ms played per
| song). Support suggested there might be more data, they
| even re-did the export, but nothing.
| 88 wrote:
| Why create a throwaway account just to post this?
|
| I don't think it's a surprise to anyone that Spotify
| collects telemetry data on its users.
|
| The main reason myself (and many others) use Spotify is
| because they use this data to recommend new music that I
| will like.
| spotyoufi462881 wrote:
| Creating throwaways to post is a feature of HN. I didn't
| abuse it. I don't have a HN account.
|
| It's not meant as slander. It's just to show others 'here
| is something that's true that you might not have thought
| of or have thought of but haven't thought of it in all
| it's 250 MB glory. Decide for yourself'.
|
| Folks who wanna try something different might like 'Radio
| Paradise'. It's human curated music. Have wide platforms
| support [1] but also regular stream URLs [2]
|
| [1] https://radioparadise.com/listen/options [2]
| https://radioparadise.com/listen/stream-links
| Nextgrid wrote:
| Because your music taste is essentially a unique fingerprint
| that could be used to track you in the future even across
| accounts or platforms, and the only way to escape would be to
| literally give up on all your favorite songs and listen to
| something totally unrelated.
| 88 wrote:
| So what? Your voice is a unique fingerprint. Your writing
| style is a unique fingerprint. Your gait is a unique
| fingerprint. Who you are friends with is a unique
| fingerprint.
|
| Why should I be worried about a company knowing what music
| I listen to?
| avh02 wrote:
| Because going through all that just to listen to music is
| bonkers
| marliechiller wrote:
| but also far less effort and money than buying physical
| copies of music
| BorisTheBrave wrote:
| I'm having a hard time understanding this article. It seems to be
| a bit too low level on the specifics of Beam for general
| consumption.
|
| From what i undestand, Spark has the same feature built in. If
| the planner knows that the source data is partitioned and/or
| sorted appropriately, it can skip shuffling/sorting it, instead
| having each executor directly requesting the one file it needs.
|
| It's a nice optimization, but it's not game changing. You often
| end up having to shuffle anyway, as you are joining on a
| different key, or for performance reason you need more executors
| than the set amount of partitions, or the shuffle needed to write
| the data doesn't justify the savings on the readers.
|
| Maybe it's better with their additional optimizations? Spark does
| not do those, mostly.
| enigmo wrote:
| Hive also has had this optimization for as long as I can
| remember. As others have noted it's not particularly new or
| novel, it's just not part of the Beam SDK.
| texasbigdata wrote:
| 50% cost reduction though
| blt wrote:
| This article fails to make a clean problem statement for the
| general audience. It jumps right into jargon and names from some
| framework/library. It reads like it was an internal report from a
| programmer to their team, and someone decided to make it public
| with no changes.
| staticelf wrote:
| Spotify is a company that feels like they want to be a "big tech
| company" when in reality they do not need to. All they need to do
| is provide a great service with as much music as possible.
| sbarre wrote:
| At some point they need to try to differentiate themselves from
| Amazon or Apple, who could easily out-spend them on licensing
| music catalogues.
|
| I think Spotify's "moat" is largely the analysis they've done
| on all the listening data they have, and their ability to
| provide that to the music industry as a product.
|
| They share rudimentary summaries with customers via stuff like
| Wrapped but I have to imagine they have much more detailed and
| robust data products for the industry...
|
| If they're just a giant Dropbox for MP3s with a music player
| sitting on top, they don't really stand a chance...
| matthewmacleod wrote:
| That's true - it's kind of like how Google keep pretending to
| be a big tech company when all they really need to do it
| provide a great service with as many search results as
| possible.
| onion2k wrote:
| _All they need to do is provide a great service with as much
| music as possible._
|
| I wonder if adding 'as much music as possible' would drive any
| growth without the AI music discovery stuff. There has to be a
| diminishing return to adding new songs - if Spotify adds an
| artist that only a few thousand people have heard of then
| that's only going attract a few thousand new customers _at
| most_. No one else is going to listen to that artist unless
| Spotify recommends the songs to people who might like them.
| danielscrubs wrote:
| Wouldn't surprise me if they shuffle more data than Twitter. Is
| it because it's not a Silicon Valley darling that it gets all
| the snide remarks?
| throwaway3699 wrote:
| They clearly value the free+recommendations+ads stack to drive
| their business more than adding basic features.
| horseRad wrote:
| I for one love discover weekly and hope they will continue
| working on improving it. I consider it a core feature :)
| Hacman123 wrote:
| Agree 100%. I always find great songs there. A cool feature
| to discover new music. But I guess we are not that
| sophisticated. Predicting which songs I'm gonna like is
| probably noit that difficult.
| staticelf wrote:
| Sure, that is a great feature but that doesn't really need
| that much data besides listen data that they need to save
| anyway (I am assuming).
| bjohnson225 wrote:
| I'm not sure I get the point in relation to the article. They
| have a huge amount of data, so they need to handle it in a "big
| tech" way & it's not like they're doing this stuff just because
| they can.
|
| The Spotify end of year wrap is amazing marketing for them that
| people absolutely love and share widely.
| mouldysammich wrote:
| listenbrainz has gotten pretty good in the last while for keeping
| track of your music stats. Its got a nice weekly stats page etc.
| Its also not ad laden like last.fm
| a254613e wrote:
| Perhaps a bit off-topic. But a lot of users (myself included)
| reported wildly inaccurate data in the spotify wrapped this year
| with seemingly no explanation, no shared accounts, no re-used
| passwords, no weird listening history, etc.
|
| I wonder if some of the data in the "We worked with the
| maintainer of these data sets to convert a year's worth of data
| to SMB format." step got corrupted or just wrongly
| converted/lost.
|
| I'm not sure how else explain that I have to google artists in my
| top 10 because I never heard of them.
| soheilpro wrote:
| Sorry if it's off-topic, but if anyone's interested, I'm
| launching volt.fm next week.
|
| It connects to your Spotify account and generates a nice public
| page with your stats (top artists, top tracks), playlists and
| etc.
|
| You can reserve your username now: https://volt.fm
| pohy wrote:
| Hi, how is this different/better than last.fm?
| soheilpro wrote:
| It helps you promote your playlists and discover new ones.
| saberience wrote:
| Why would I want to promote my playlists?
| person_of_color wrote:
| Could be done with a bash script...
| rovr138 wrote:
| And paper and pencil
|
| What's your point?
| sailfast wrote:
| For those that would like to dig more into what SMB is - here's
| the link to the paper from the article: http://kth.diva-
| portal.org/smash/get/diva2:1334587/FULLTEXT0...
| henron wrote:
| This technique of using distributed storage for large joins
| instead of shuffling between compute nodes also helps make your
| job robust to spot instance kills. Until disaggregated shuffle
| services are widely adopted, it can be really handy.
| polskibus wrote:
| I wonder if they could publish dollar cost of that job before and
| after the optimization, as provided by GCP billing. I know it
| could be a bit unfair (some costs may be static, regardless of
| job size, etc.) but it would improve decision making for others
| if discussions of public cloud usage optimizations also include
| the cost.
| sbarre wrote:
| They do refer to savings in percentages.. I feel like giving
| away actual dollar costs would potentially break contractual
| agreements (because I doubt Spotify pays public list price),
| and potentially give away competitive information about how
| much data they have on their customers etc..
|
| I agree with you that it would be interesting to know, I just
| don't think it's realistic for them to release that
| information.
| [deleted]
| shoulderfake wrote:
| Whatever theyre doing with data means nothing when their client
| apps are absolute dogshit.
| saberience wrote:
| One thing I can never understand about Spotify is that despite
| it's insane budget, huge amount of employees/talent, they still
| can't create better personalized playlists than either Pandora OR
| last.fm.
|
| To this day when I want a recommended playlist based on my
| taste/history, I always use last.fm because it's just plain
| better. Why? The "Discover" etc playlists on Spotify are just
| crap.
| H8crilA wrote:
| Remember the Netflix Challenge fiasco? It's pretty clear that
| this problem is either ridiculously hard, or it is
| fundamentally "unsolvable" (because, perhaps, it depends on
| highly variable mood).
| [deleted]
| feintruled wrote:
| Really? I find my 'Discover Weekly' playlist to be astounding.
| I have to confess for a long time I thought it was a human
| curated playlist from someone who just happened to have
| _exactly_ the sames tastes as me.
| adamhp wrote:
| I have to second this. I can't list the number of artists I
| have discovered over the last few years using Discover Weekly
| alone. It's really incredible.
| iamacyborg wrote:
| It's been the opposite for me. It regularly surfaces songs
| I've listened to in the past from artists I listen to
| frequently. As a music discovery algo Spotify has been
| nothing but poor in my opinion.
| philip1209 wrote:
| I'd be curious how this compares in load to Google's internal
| applications. I'm also curious what the capacity of Google's
| infrastructure goes to Google vs. GCE - has combined GCE usage
| even passed the compute needs of Google internally yet?
| jeffbee wrote:
| The only even remotely concrete information in this post is
| their input was 1PB, and they typically have 500 bigtable
| tablet servers. In 2008, Google said they processed 20PB per
| day through mapreduce jobs. For the last ten years the only
| thing they've said about the size of their public web index is
| that it is over 100PB.
___________________________________________________________________
(page generated 2021-02-12 23:01 UTC)