[HN Gopher] High-Performance Graph Databases
___________________________________________________________________
High-Performance Graph Databases
Author : belter
Score : 129 points
Date : 2023-05-21 15:39 UTC (7 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| shrubble wrote:
| What would be a use of 100k cores for a graph database?
| henrydark wrote:
| 100 use cases for 1k cores
| jandrewrogers wrote:
| Aggregate effective memory bandwidth if you use the CPU cache
| well. Graph databases are not compute intensive, nor are they
| particularly large data models, but they are extremely memory
| I/O intensive. The classic model for graph-oriented HPC was
| vast numbers of weak cores and barrel processors for this
| reason, though the utility of specialized hardware
| architectures has been greatly reduced by better software
| architecture for this purpose over time.
| ocrow wrote:
| A social media service for hundreds of millions of users?
| threeseed wrote:
| Fraud analytics.
|
| You're looking for patterns across large numbers of entities
| and relationships.
|
| And ideally you want this all done in real-time so you can stop
| transactions before they are approved.
| belter wrote:
| "...we harness established practices from the HPC landscape to
| build a system that outperforms all past GDBs presented in the
| literature by orders of magnitude, for both OLTP and OLAP
| workloads. For this, we first identify and crystallize
| performance-critical building blocks in the GDB design, and
| abstract them into a portable and programmable API specification,
| called the Graph Database Interface (GDI), inspired by the best
| practices of MPI. We then use GDI to design a GDB for
| distributed-memory RDMA architectures. Our implementation
| harnesses one-sided RDMA communication and collective operations,
| and it offers architecture-independent theoretical performance
| guarantees. The resulting design achieves extreme scales of more
| than a hundred thousand cores..."
| [deleted]
| jandrewrogers wrote:
| I am a little confused by the purpose of this paper. The
| architecture described is roughly how graph databases have always
| been implemented on HPC systems. The main contribution seems to
| be that they put a lot of polish on what were admittedly
| prototype-ish implementations historically? I was hoping for some
| interesting approaches to some of the fundamental computer
| science problems that cause scaling issues when graphs become
| large but this is more of a standard "throw hardware at it"
| solution (which has significant limitations).
| lmeyerov wrote:
| Agreed. They brush aside the years of high performance
| computing graph implementations, eg, cuStinger. If you look at
| systems like neo4j's GDS and how other multipurpose systems are
| going wrt views/projections for accelerated compute, that has
| been enabling targeting way more performance without dying
| under complexity. Benchmarking perf without that kind of
| comparison is weird, I'm surprised reviewers allowed that. (The
| work may still be good.. just you can't know without that.)
| mathisfun123 wrote:
| I love this - "it's just engineering" say the people that think
| the hardest part about building a spaceship is having the big
| idea to build the spaceship. Note the "I love this" was
| sarcasm.
| jandrewrogers wrote:
| My point was that the architecture they are using has already
| been done multiple times for graph databases using things
| like RDMA (which has existed in HPC for ages), that is a
| known quantity. It was less "it's just engineering" and more
| I've seen similar implementations for a long time so what
| makes this different? I am interested in this space in part
| because I spent much of my time in HPC working on graph
| databases.
| eternalban wrote:
| You should quickly skim sections marked with a (comically
| fat) "!" symbol. These indicate their "key design choices
| and insights" in the design space.
| LukeEF wrote:
| I am not sure of the exact statistic, but something like 95% of
| all production databases are less than 10GB. There seems to be a
| 'FAANG hacker' fascination with 'extreme-scale' which probably
| comes from seeing the challenges faced by the handful of
| organizations working at that level. Much of the time most graph
| database users want (as in why are they there) a DB that allows
| you to flexibly model your data and run complex queries. They
| probably also want some sort of interoperability. If you can do
| that well for 10GB, that is holy grail enough. We certainly found
| that developing graph database TerminusDB [1] - most users have
| smaller production DBs, more lightly use bells and whistles
| features, and really want things like easy schema evolution.
|
| [1] https://github.com/terminusdb/terminusdb
| deegles wrote:
| What are the databases with easy schema evolution?
| swader999 wrote:
| I get that angle but I also see orgs capturing too much data.
| What's the use case for it? Not sure but if we ever do need it
| we'll have it is the typical answer.
| fnord77 wrote:
| really? I don't quite believe that. We're a tiny company with
| maybe 70 customers and db is roughly 11Tb.
| cubefox wrote:
| That seems a lot. What type of data?
| [deleted]
| paulddraper wrote:
| Assuming 1kb per "record" that's 150 million records per
| customer.
|
| Definitely a data heavy product, wherever it is that you're
| offering.
|
| (Unless you keep large blobs in the DB. But database scale
| has more to do with records than raw storage.)
| belter wrote:
| Interesting that you mention the value 10 GB, as it is the size
| of a DynamoDB partition or an AWS Aurora cell...
| im_down_w_otp wrote:
| I think I kind of agree with this.
|
| One of the simpler supported backends for our Modality product
| (https://auxon.io/products/modality), which results in a data
| model that's a special case of a DAG for modeling big piles of
| casually correlated events from piles and piles of distributed
| components for "system of systems" use cases, is built using
| SQLite, and the scaling limiter is almost always how
| efficiently the traces & telemetry can be exfiltrated from the
| systems under test/observation before how fast the ingest path
| can actually record things becomes a problem.
|
| That said, I do love me some RDMA action. 10 years ago I was
| fiddling with getting Erlang clustering working via RDMA on a
| little 5 node Infiniband cluster. To mixed results.
| parentheses wrote:
| I agree with your sentiment but I suppose you're considering
| the wrong statistics. Instead you should consider: - how many
| jobs have interviews that necessitate knowing how to handle
| extreme scale
|
| - proportion of jobs (not companies) requiring extreme scale -
| the fact that non extreme scales are the long tail doesn't mean
| it's a fat tail
|
| - proportion of buyers/potential users that walk away from the
| inability to handle extreme scale
|
| ... and more sarcastically
|
| - proportion of articles about extreme scale
|
| - proportion of repos about extreme scale
| mumblemumble wrote:
| Only one anecdote, but I found out a while after starting at
| my current job that directly questioning the extent to which
| scale-out was actually needed to solve a problem during a
| technical interview question is the thing that made me stand
| out from the rest of the crowd, and landed me the job. Being
| able to constructively challenge assumptions is an incredibly
| valuable job skill, and good managers know that.
| zimpenfish wrote:
| Counter-anecdote: directly questioning "scale-out
| fantasies" has contributed to my early departure from a
| handful of jobs and contracts. One place was obsessed with
| getting everything into AWS auto-scaling groups when the
| problem was actually that they were running on MySQL with a
| godawful schema, dumbass session management, and horrific
| queries that we weren't allowed to fix because they were
| "migrating to node microservices anyway" (pretty sure that
| still hasn't happened years later.)
|
| > Being able to constructively challenge assumptions is an
| incredibly valuable job skill
|
| I would agree but ...
|
| > good managers
|
| ... are few and far between.
| hobs wrote:
| The best people challenge bad assumption and worst bosses
| get mad.
|
| Had one boss get mad that I reduced the database
| footprint by 94% - why? Because he wrote the initial
| implementation and refused to believe that his baby,
| which cost so much space because of how awesome it was,
| could fit into 5GB.
|
| But challenging the status quo has gotten me to where I
| am, so I wont stop it anytime soon :)
| threeseed wrote:
| This research paper is talking about performance whilst you're
| talking about scalability.
|
| Those are related but are distinct from each other.
|
| And sure about 95% of companies would have their needs met with
| a simpler system but that does leave a lot of companies who
| will not. And for those of us in say finance doing
| customer/fraud analytics I would welcome all the performance I
| can get.
| loeg wrote:
| > This research paper is talking about performance whilst
| you're talking about scalability. Those are related but are
| distinct from each other.
|
| The paper has "Scale to Hundreds of Thousands of Cores" in
| the title. I have not yet read the paper but it seems
| unlikely it doesn't talk about scalability.
| threeseed wrote:
| I was referring to scalability in the sense of the size of
| the data being stored.
|
| You can have slow queries with 10GB of data just like you
| can have fast queries with 10PB of data.
| adgjlsfhk1 wrote:
| If your data is small enough to easily fit in ram, you
| kind of can't have that slow a query on it (or at least
| you no longer are talking about a database problem).
| mumblemumble wrote:
| I'm guessing that, when the paper's author mentioned
| "hundreds of thousands of cores", they didn't have 10GB
| of data in mind. That works out less than a typical L1
| cache's worth of data per core.
| rocqua wrote:
| This isn't a graph database like neo4j. This is a graph
| database like I hoped neo4j would be. It's not about having an
| easier time working with schemas. It's about analyzing graphs
| that are too big to fit in RAM. Transaction analysis for banks,
| trafic analysis of roads, failure resilience of utility
| networks, etc.
|
| In these kinds of workloads you quickly run into performance
| bottlenecks. Even in-memory analyses need care to avoid
| conplete pointer chasing slowdowns.
|
| I do still hope this is fast in like a single CPU 32 core 64GB
| system with an SSD. But if this takes a cluster to be useful,
| then I will still love it.
| bubblethink wrote:
| >There seems to be a 'FAANG hacker' fascination
|
| Yeah, but the hacker fascination is what drives progress. You
| could have made the same type of argument about ML, and we
| would have been content with MNIST.
| RhodesianHunter wrote:
| But the 5% of places where that kind of scale is needed are the
| ones paying the top 1% salary band, so this is the content
| distributed systems engineers like to read about and work on.
| lasfter wrote:
| Maciej Besta, the first author of this paper, is a machine.
|
| Aside from coordinating big groups to write tons of papers, he
| does a bunch of impressive wilderness exploration. I recommend
| checking out his website, there's some stunning photos:
|
| https://people.inf.ethz.ch/bestam/expeditions.html
| osigurdson wrote:
| Stating that someone "is a machine" is somewhat ambiguous these
| days as it is increasingly possible that they literally are
| one.
| EGreg wrote:
| Wasn't TJ Holowaychuk confirmed to be an AI, or a group of
| people like Bourbaki? He never showed up at any events :)
|
| https://qr.ae/pyvfKK
| gumby wrote:
| I also misinterpreted that sentence!
| mathisfun123 wrote:
| Torsten is PI of that group not Maciej so you're half right,
| since being Maciej is first author on this one he probably did
| most of the technical work (and who cares about
| "coordination").
| uptownfunk wrote:
| What are these used for in the "real" world?
| perfect_wave wrote:
| A few that I've encountered:
|
| - Feature generation for machine learning model training
| (particularly popular with fraud detection at financial
| institutions) - social networks (think LinkedIn or Facebook) -
| supply chain analysis and optimization - healthcare patient
| data analysis (looking at similar patients to recommend
| treatments or do large scale analysis) - user identification
| (eg taking lots of data points and tying to a specific user).
| There's a more specific name for this I can't remember off the
| top of my head.
| z3t4 wrote:
| I once was obsessed at finding the holy Grail of optimization and
| scaling, but then came to the conclusion that every system is
| optimized and scaled differently. Different systems have
| different bottlenecks.
| slively wrote:
| So true, rarely is anything the "best" or "better", but instead
| each thing is a bucket of trade offs we choose from. Maybe
| frustrating, but also what makes engineering genuinely
| interesting.
| uptownfunk wrote:
| This sounds similar to the no free lunch theorem in AI
| jjgreen wrote:
| Huh? NFL is a theorem on optimisation, not AI
| cubefox wrote:
| It's originally a theorem about supervised learning, which
| turned out to be the prototypical application (via CNNs) of
| deep learning. Deep learning is now mostly synonymous with
| AI. Though I'm not aware of any NFL theorem for
| reinforcement learning or unsupervised learning.
| mumblemumble wrote:
| Given that the most well-known use case for optimization
| algorithms is training machine learning models, it seems to
| me like a perfectly reasonable, if buzzword-oriented, thing
| to say.
___________________________________________________________________
(page generated 2023-05-21 23:00 UTC)