[HN Gopher] Yandex open-sources its exabyte-scale big data platform
___________________________________________________________________
Yandex open-sources its exabyte-scale big data platform
Author : xpl
Score : 230 points
Date : 2023-03-22 13:57 UTC (9 hours ago)
(HTM) web link (medium.com)
(TXT) w3m dump (medium.com)
| VWWHFSfQ wrote:
| It's my understanding that the original Yandex founders and tech
| left Russia several years ago. Is this new technology? Or is it
| stuff being extricated from the (now) Kremlin-controlled
| codebase?
| xpl wrote:
| It isn't new, in the article they state that it originated back
| in 2006. It's been in development since then.
| magundu wrote:
| Is it an alternative to Snowflake?
| alephone1 wrote:
| We provide YQL for running large-scale OLAP SQL queries. In
| this regard YT can be compared to Snowflake. However, we target
| on prem deployment in the first place, while Snowflake runs in
| aws/gcp/azure, and queries are performed over data that sits in
| S3.
| rch wrote:
| I work at Cloudera and I see it as roughly analogous to CDP,
| but I didn't see anything about hybrid on-prem/cloud
| deployments.
| arivkin wrote:
| It's true. We are very close to CDP and Apache Hadoop. We are
| only about on-prem right now. It's not a secret that we
| didn't need a cloud deployment at Yandex. But we see and
| understand the demand for it. And we have a plan:)
| [deleted]
| karsinkk wrote:
| This looks very impressive! As another commenter echoed, the code
| base is ~5million lines of C++ code, but almost no comments at
| all. Unless the documentation is excellent, maintenance/open
| source work is going to be difficult.
| xpl wrote:
| The docs, for the reference: https://ytsaurus.tech/docs/en/
|
| P.S. I wonder if LLMs could be used to generate docs and
| comments for big hairy codebases. Seems that the current
| generation of LLMs lack context to do it, but maybe it's "just
| one or two more papers down the line"(r)...
| sciencesama wrote:
| Who cares ?
| konart wrote:
| Same people who use Clickhouse or YDB at least.
| cynicalsecurity wrote:
| [flagged]
| bertil wrote:
| I'm all for holding companies that have supported dangerous
| regimes to account. However, when it comes to data management,
| totalitarian regimes rarely indicate inadequate implementation.
| IBM's role in Germany in the 40s was horrific, but it proved
| their ideas of tabulations and files were promising. Just like
| with rocketry, there were many valuable things to learn that
| defined the rest of the XXth century.
|
| The FSB likely has a lot of crimes to atone for. Still, suppose
| one of their specialists publishes something on data management
| or how to manage hundreds of sock-puppet social media accounts.
| In that case, I'd be tempted to listen and learn from a likely
| expert--unless you suspect that they think this article is not
| sincere and meant as a distraction from actual good practices.
|
| Similarly, the CIA has done very problematic things, but the
| people who worked in the disguise department have a creative
| take on changing your appearance. I'm unsure when I would have
| to do that, but I'm always curious about how data is stored
| efficiently. And yes, like the FSB, the NSA has opinions about
| that, and those are typically well-informed.
|
| Was their practice constitutional? Seemingly not, IANAL. But do
| they have good insights into caching video files at scale?
| Definitely.
| medo-bear wrote:
| > IBM's role in Germany in the 40s was horrific, but it
| proved their ideas of tabulations and files were promising
|
| source: https://en.m.wikipedia.org/wiki/IBM_and_the_Holocaust
| jones6ofMont wrote:
| [flagged]
| stef25 wrote:
| > remember FSB is NOT KGB because Russia now is a democratic
| country and not a communist USSR
|
| Secret services in a democratic country being better / worse
| than their communist counterparts aside, "democracy" is a bit
| of stretch when describing Russia.
| Idiot_in_Vain wrote:
| Russia now is a democratic country???
|
| Only a complete idiot would think that.
| newaccount2023 wrote:
| [flagged]
| DanTheManPR wrote:
| As opposed to the IT department of the NSA?
| throwaw12 wrote:
| brainwashing at its peak.
|
| When USA does things, it is for the good of society, democracy.
| When Russia does things, it is hurting people, bad for society.
|
| Come on buddy, time to wake up and understand every country
| does things for its own good and whatever your media is telling
| about Russia is bad, it is because they're applying 3 letter
| agency brainwashing methods on to you.
|
| Code is open source, if you read code you will not get under
| Russian propoganda.
| garbagecoder wrote:
| nice whataboutism.
| throwaw12 wrote:
| this discussion will not add any value to the tech
| community of HN. My main concern of parent comment is, it
| is adding politics to the discussion of recently open
| sourced tech piece, instead of focusing on its
| capabilities, trade-offs and shortcomings and using it as a
| chance to learn constraints of another systems (in this
| case Yandex)
|
| it's always whataboutism when it doesn't fit the narrative,
| why haven't you asked same question to the parent comment
| when they tried to politicise technical product discussion?
| medo-bear wrote:
| [flagged]
| garbagecoder wrote:
| [flagged]
| medo-bear wrote:
| there is no need to slander me. im definitely not paid by
| anyone to give my opinions
|
| imperialism is imperialism. it should be denounced
|
| if you dont think there is a US imperialist element
| involved in Ukraine you need to inform yourself a bit
| more. you can start with reading about North Stream 2
|
| before the war in Ukraine i used to think that, in
| comparison the US, at least Russia bombs countries
| without calling it freedom [1]. When Russia started its
| "special operation" against Ukraine i now think that at
| best Putin is trolling the US, and at worst he is trying
| to be the US
|
| [1] https://en.m.wikipedia.org/wiki/Operation_Enduring_Fr
| eedom
| newaccount2023 wrote:
| [flagged]
| garbagecoder wrote:
| look at all of these throw away accounts making bad
| arguments.
|
| Yes, we look at everyone's record. Both records.
| kofejnik wrote:
| [flagged]
| bigbillheck wrote:
| Got some news for you, buddy.
| omgtehlion wrote:
| Oh come on already...
|
| Of course FSB has a backdoor to *data* collected by Yandex, but
| the code itself (as well as coders) have nothing to do with any
| three-letter-agency within Russia.
| [deleted]
| screamingninja wrote:
| > the code itself (as well as coders) have nothing to do with
| any three-letter-agency within Russia.
|
| citation needed
| eddsh1994 wrote:
| Why don't you read the code? The amount of people
| commenting on this, if there was something to hide they'd
| have probably caught it already and if not surely by next
| week.
| dna_polymerase wrote:
| Let's keep in mind, the only companies that were ever found
| to collude with three-letter agencies were the big players
| in the US. But for some reason, Yandex needs to prove
| something here...
| ivan_gammel wrote:
| Maybe this is exactly the reason why it's worth checking? They
| _open-sourced_ it.
| Aldipower wrote:
| Yeah, like Apache Mesos is running the NSA data center in
| Utah. One of the biggest in the world.
| mdaniel wrote:
| Do you happen to have any link where I can read more about
| this?
| g9yuayon wrote:
| The github repo has 5 million lines of C++ code (headers
| included), 1.6 million lines of C code, and even nearly 1 million
| lines of Scala + Java code. We'd need some serious docs to adopt
| this technology.
|
| The most interesting part of YT is Cypress. I'm particularly
| interested in how they make their master cluster horizontally
| scalable.
| karsinkk wrote:
| I was going over some of the code in the core folder for
| concurrency, threading and compression, what surprised me is
| that there's absolutely no comments whatsoever. Agree that
| unless there's excellent documentation, open source maintenance
| might be challenging.
|
| Having said that, this definitely does look to be an impressive
| feat of engineering!
| gritukan wrote:
| Historically, the master server of YTsaurus was a single RSM
| (replicated state machine) that contained all the meta-
| information about the cluster. This included the tree of the
| distributed filesystem, transactions, information about users
| and tables, placement of chunks, and much more.
|
| However, this approach proved to be non-scalable as the memory
| amount and throughput of the master server soon became
| insufficient. To address this issue, we implemented Multicell
| technology. With Multicell, there are multiple RSMs called
| secondary masters that store information about chunks of the
| tables and their placement. The primary master still stores
| information about the distributed filesystem and transactions
| but is now single and non-sharded.
|
| After a few years, the masters became overloaded again, and we
| implemented Portals. With Portals, one can select a subtree of
| Cypress and place it in one of the secondary masters. This
| technology is used nowadays, and home directories of some
| active users are hosted on secondary masters.
|
| However, we anticipate that this approach will also become
| insufficient in a few years. Therefore, we are currently
| working on a new technology called Sequoia, which stores
| information about the Cypress tree shape in horizontally
| scalable dynamic tables.
|
| It is hard to describe all aspects of master server internals
| in one comment. Therefore, feel free to join our chat at
| t.me/ytsaurus for further discussion!
| matrix_overload wrote:
| Looks like a reasonable response to a recent leak [0]
|
| [0] https://news.ycombinator.com/item?id=34525936
| galkk wrote:
| There is a conspiracy theory that recent open sourcing of Yandex
| tech and to some extent even a leak is a preparation for global
| Yandex exodus from Russia.
| slt2021 wrote:
| exodus has already happened
| speed_spread wrote:
| You can take Yandex out of Russia... but you can not take
| Russia out of Yandex. Admittedly, the same could probably be
| said about any other $big_search:$nuclear_power pair. It's just
| that the other companies either can't or have no reason for
| exile.
|
| And so I suspect Yandex leaving it's home turf would
| essentially be a covert invasion of wherever they'd swarm to.
| Somehow, money tells me UK would be a likely target. Source:
| worked briefly for an exiled Russian company. Would not repeat.
| ClumsyPilot wrote:
| > a covert invasion of wherever they'd swarm to.
|
| > an exiled Russian company.
|
| Either you are exhiled or you are invading, how can you be
| both?
| helge9210 wrote:
| For example, you wholeheartedly support the policy, but are
| not comfortable living under the sanctions.
| ClumsyPilot wrote:
| My understanding of Exile is that you are rejected by the
| society you are exiled from, due to your actions or
| beliefs. Like profound disagreement with the policy.
|
| I have certainly seen the kind of people you talk about,
| they support (or at least used to) Putin, but don't want
| their kids to live in Russia. I would call them more like
| emigrants of convenience.
| speed_spread wrote:
| Deception. Moving a piece on a game board should have more
| than one effect. One should preferably make it so that the
| most obvious effect (exile) is not the one that's actually
| the most valuable in the long term (taking root in adverse
| country).
|
| Bonus "inception" points if you can make the adversary
| believe that you did it because they forced you to
| (sanctions).
| panki27 wrote:
| Wasn't this "open sourced" in their recent sourcecode leak
| already?
| omgtehlion wrote:
| No, it was not, see
| https://arseniyshestakov.com/2023/01/26/yandex-services-sour...
| perryh2 wrote:
| [flagged]
| TylerLives wrote:
| Does anyone know how they used the slurs?
| marwis wrote:
| Looking at the OP comment before it was flagged, someone just
| did `s/slave/n****r/g` on the codebase.
|
| BTW It's pretty ridiculous that American censorship makes it
| impossible to even call out and criticize racism like in the
| case of OP.
| slt2021 wrote:
| banlist of words to filter out from search results, something
| like that
| metafates wrote:
| It was used for filter lists in search engine. I think you
| understand, that google is using the same exact technique. It's
| just haven't been leaked
| maxdo wrote:
| It's a problem of entire Russia. My Asian girlfriend from china
| was denied to enter a restaurant in Moscow because of her
| origins...
|
| It's absolutely legal to put restrict your rental apartments
| nationality. In yandex you can ads like " will lease my house
| only to russians". It's everywhere.
|
| If you will ride a taxi in russia, an offensive words to other
| nationalities/ethnicities is everywhere.
|
| The problem is so big, that even non russian/slavic people
| applied to offend each other. I've been in a taxi when two
| middle east decent people start screaming to each other
| "churka" even they are both could apply to this definition.
|
| I personally trying not to invest into anything Russian, simply
| because that society is very, very sick and they are very far
| away even from recognition of this problem. They think it's
| their strength...
| trallnag wrote:
| What were you doing in Russia if it's so bad?
| IYasha wrote:
| Rollin', Hatin' )
| IYasha wrote:
| [flagged]
| ClumsyPilot wrote:
| > It's absolutely legal to put restrict your rental
| apartments nationality.
|
| I believe it is still illegal 'hypothetically', just like
| hypothetically Russia has elections.
|
| > They think it's their strength
|
| Yeah, the delusion is serious
| orbital-decay wrote:
| It's illegal but the rental contract is private, so it can
| be denied without explanation; good luck suing and proving
| that you're being discriminated.
|
| The discrimination in public ads and places like
| restaurants is strictly illegal, though essentially not
| enforced unless you sue, because of the multitude of
| reasons - racism and lax anti-discrimination policies in
| particular, and also a million of others (the entire rental
| housing market is almost unregulated, for one). Lately,
| most popular ad sites implemented their own discrimination
| bans, however it still doesn't guarantee that the landlord
| won't be an asshole.
| medo-bear wrote:
| > It's illegal but the rental contract is private, so it
| can be denied without explanation; good luck suing and
| proving that you're being discriminated.
|
| How is this different than in other places in Europe
| orbital-decay wrote:
| When worded like this, it's not that different. It's the
| little details that add up, depending on the actual
| country. You're much less likely to be discriminated
| against in UK or Germany than in places like Bulgaria,
| Ukraine, or Russia. Due to both the attitude and
| enforcement. The rental market in Germany seems over-
| regulated, but my black friend of Ethiopian descent (he's
| Russian, born and raised) had no problem finding a place
| to live there, while in Russia he's been overtly or
| silently rejected by the landlords so often so he had to
| rent the apartment from myself for a year despite it
| being far away from his work.
| TechBro8615 wrote:
| > I've been in a taxi when two middle east decent people
| start screaming to each other "churka"
|
| Another way of interpreting this sort of culture is that they
| know not to take things too personally and that intent behind
| words is more important than the words themselves. It's a
| kind of liberating way of interacting with each other, and
| not uncommon to see in environments like fraternities or
| sports teams. Some people will call that culture
| exclusionary, but I might call those people neurotic.
|
| To put it another way, who are you to be offended on behalf
| of the people calling each other churka? Clearly they're not
| offended by it, so shouldn't you let them have their fun?
| 2-718-281-828 wrote:
| [dead]
| konart wrote:
| > My Asian girlfriend from china was denied to enter a
| restaurant in Moscow because of her origins...
|
| While I'm not questioning your statement - this sound very
| questionable, considering the number of Chinese tourists in
| Moscow every year.
|
| >It's absolutely legal to put restrict your rental apartments
| nationality
|
| >I personally trying not to invest into anything Russian,
| simply because that society is very, very sick
|
| Do you as a gaijin invest in anything japanese in this case?
| It is not uncommon to see a "gaijins are not allowed" sign in
| Japan either.
|
| While it is true that Russia has (and will probably have for
| a long time) some racism problems (just like superiority
| complex) - I wouldn't say it's as bad as you think it is
| (especially compared to 90s for example)
| trololo01001 wrote:
| Oh yeah, russians are committing mass rapes and
| indiscriminate civilian killings in Ukraine and this is
| just a tiny issue that you can easily ignore.
| IYasha wrote:
| Funny to see 0 of 50 comments on a HN tech article being tech-
| related...
| hotstickyballs wrote:
| It's not technically interesting since a lot of big data
| solutions already solve this problem so I guess only the
| geopolitics are left.
| xpl wrote:
| I doubt about "a lot". Also, "already solve" does not mean
| "solve better" or even "good enough".
|
| It would be very interesting to see some in-depth comparisons
| with already-existing open source technology (like Hadoop,
| Hive, Iceberg, ZooKeeper) to get a sense of when and where YT
| could be more effective.
| hamilyon2 wrote:
| I am obviously biased, but, yes, technically it is very, very
| interesting. Distributed transactions, kiparis, YQL are
| interesting. Another aspect, the only other open alternative
| is Hadoop, hbase and hive. They don't compete with yt on
| usability and developer experience aspects. Yt is much more
| polished, despite historical quirks.
| l0b0 wrote:
| Agreed, but this is presumably also an absolutely massive
| project hardly anyone here has even used before. So it's not
| surprising that there are no big tech insights on the day of
| the release. An `scc` printout might be interesting, but any
| in-depth analysis is going to take a long time.
| arivkin wrote:
| Here are some developers from YT which can help you and
| answer some technical questions. Also YT is a huge and old
| project. I believe you can find a lot of people which are ex-
| Yandex who worked with YT.
| booi wrote:
| I think what you're seeing is a resurgence of tech ethics where
| the source and support of a project is as important as the
| technology itself.
| gre wrote:
| github link https://github.com/YTsaurus/YTsaurus
| ushakov wrote:
| How many companies exist that will utilise its full capabilities?
| My bet would be 50-100
| arivkin wrote:
| Why do you think so? How many companies use Apache Hadoop?
| xpl wrote:
| My guess it could be successfully used for relatively small
| (terabyte-scale) workloads.
|
| It's the same as when people use k8s not utilizing its full
| capabilities, only to be able to massively scale up when
| needed.
| RobotToaster wrote:
| Most people don't utilise the full capabilities of the tools
| they use.
|
| Sure it would be overkill for a lot of applications, but so is
| redis, react.js, etc.
| gritukan wrote:
| Hello! I work at YT and would like to answer a question that was
| asked in a flagged thread about the comparison between YT and
| Hive and Zookeeper.
|
| Both Cypress and Zookeeper are fault-tolerant distributed
| hierarchical filesystems that can be used for distributed
| coordination, but Cypress has much richer functionality.
|
| Recall that Zookeeper's data model is just a tree consisting of
| homogeneous nodes that can be either ephemeral or persistent,
| along with a set of sessions that control the lifetime of
| ephemeral nodes. This simple model allows to implement multiple
| primitives of distributed synchronization, such as leader
| election, exactly-once queue processing, or two-phase commits.
| However, it is not always easy to integrate Zookeeper with third-
| party systems. For example, if you want to elect a leader via
| Zookeeper and use it to insert data into a database, it is
| mandatory that the instance remains the leader during the commit
| into the database, which is not easy to implement without races
| or some additional assumptions. In YTsaurus, transactions
| permeate our entire system. You can start a transaction and
| acquire an exclusive lock at some Cypress node (which is a way to
| make a leader election), and after that, the transaction becomes
| the leader lease. You can then modify Cypress, run MapReduce or
| YQL operations using the transaction as a prerequisite, lock some
| files and tables in the same transaction, and do many other
| things. Currently, we are working on the ability to use Cypress
| locks as prerequisites for dynamic table commits. There are many
| other features in Cypress that are not implemented in Zookeeper,
| such as symlinks, automatic expiration of unused nodes, and many
| others. Moreover, Cypress can be sharded using Portals about
| which I wrote in a previous comment, so this filesystem is
| scalable unlike Zookeeper. Even without sharding a single primary
| master of YTsaurus can hold tens of gigabytes of metadata of
| Cypress while Zookeeper state size is limited with hundreds of
| megabytes accoring to etcd vs Zookeeper comparision [1].
|
| One major disadvantage of Cypress compared to Zookeeper is the
| lack of watches, so all changes tracking should be done via short
| polling. The good news is that Cypress is well-optimized for read
| queries with the possibility to read from followers and from
| multiple threads, so this is not a big problem. In the meantime,
| we are considering the possibility of adding some kind of watches
| to Cypress.
|
| The big difference between Cypress and Zookeeper is the
| replicated state machine implementation. With all due respect to
| Zookeeper developers, Zookeeper was implemented over 15 years ago
| when the world of distributed algorithms was different. Today we
| see that ZAB (the consensus algorithm used in Zookeeper) has some
| shortcomings in failover speed and stability. There are multiple
| reports of Zookeeper being unstable under heavy load. In
| YTsaurus, we use an in-house library called Hydra for RSM
| implementation. This is our consensus algorithm very similar to
| RAFT that has proven itself to be both efficient and fault-
| tolerant. We use Hydra for master servers, clock servers, and
| tablet cells (RSMs that store data in dynamic tables). I even had
| an idea to implement a Zookeeper API using Hydra both to simplify
| migration to YTsaurus and check Hydra performance and correctness
| via multiple tests implemented for Zookeeper (Jepsen, for
| instance), but did not have enough time to finish this project.
|
| This comment is already quite long, so I will write about the YT
| vs Hive comparison in another comment later on.
|
| [1] -- https://etcd.io/docs/v3.3/learning/why/
| garbagecoder wrote:
| Didn't this just leak?
| xpl wrote:
| AFAIK it wasn't in the leak (that "YT" platform specifically).
| Also, open-sourcing proprietary projects of this scale is a
| pretty hard job and it can't be done quickly -- seems they had
| been doing it for a long time, starting long before the leak.
| arivkin wrote:
| That's true. We have been preparing for opening YT for a long
| time when we found out about the leak. We even joked inside
| that they couldn't even do it normally and we still have to
| open our part of the code ourselves :)
|
| It's truly a hard work, because we were very tightly tied to
| the Yandex infrastructure and we had to learn how to deploy
| in k8s from scratch. Also you need a new brand and clean your
| documentation from irrelevant things... All this takes
| months.
| reisse wrote:
| There is an interesting take that Russian part of the Yandex
| group opensources as much as possible, in order for the overseas
| companies of the group to leverage technologies without legal or
| financial ties with Russia.
|
| For me this seems very plausible, as for the last year they first
| did everything to distance from anything related to politics (e.
| g. they sold their news and their blogging platform to the
| basically state-owned VK), and then to separate Russian and
| overseas businesses as much as possible.
| tiffanyh wrote:
| Yandex has definitely made it extremely easy to leverage their
| tech, by selecting to use the Apache 2.0 license.
| drewda wrote:
| It does look like Clickhouse transmogrified from a Russia-based
| Yandex open-source project into a San Francisco-based VC-back
| C-corp incorporated in Delaware just in the nick of time.
|
| (Not suggesting this is wrong. I'm just offering an
| interpretation based on skimming public blog posts over the
| past ~2 years.)
| 0xDEF wrote:
| Yandex is incorporated in the Netherlands and the founders live
| in Israel.
|
| I think they will create several spin-off open source companies
| (like ClickHouse Inc.) outside Russia to continue doing B2B
| business with the outside world.
| konart wrote:
| They did some time ago actually.
|
| Like https://nebius.com/about for example.
|
| Not to mention that Yandex been operation in countries
| outside of Russia under different name from the beginning
| (like https://en.wikipedia.org/wiki/Yango_(ride_sharing))
| deepsun wrote:
| It doesn't matter where the company is incorporated -- when
| Kremlin wants the access to the data Yandex cannot say no.
| One cannot operate in Russia and not play by Kremlin rules,
| especially media companies.
| malaya_zemlya wrote:
| Unforytunately, only one founder, Arkady Volozh, is still
| alive. Ilya Segalovich has died 10 years ago.
| slt2021 wrote:
| russian Yandex no longer belongs to original Yandex
| founders/owners - it belongs to and is controlled by kremlin.
|
| makes sense that original engineers/founders create their own
| stuff via opensourcing their original work
| reisse wrote:
| Then why original Yandex founder is under EU sanctions?
|
| Spoiler: because until June 22 (effectively until the
| sanctions hit) he was a Yandex CEO and owner of 8% shares
| (45% voting shares).
|
| There is no "almighty Kremlin" that owns everything. There
| is, however, a set of rules you must comply to if you want to
| do multibillion dollar business in Russia. You either bend,
| or sell your business to more complacent oligarchs. Durov
| chose the latter, Volozh chose the first.
| the_mitsuhiko wrote:
| Would be a clever move but I have doubts that no longer Yandex
| has much to say in Russian Yandex. The latter seems very close
| to the Kremlin now.
___________________________________________________________________
(page generated 2023-03-22 23:02 UTC)