[HN Gopher] The Future Is Big Graphs: A Community View on Graph ...
___________________________________________________________________
The Future Is Big Graphs: A Community View on Graph Processing
Systems
Author : Anon84
Score : 130 points
Date : 2021-09-12 10:45 UTC (12 hours ago)
(HTM) web link (cacm.acm.org)
(TXT) w3m dump (cacm.acm.org)
| ThePhysicist wrote:
| I think graph processing has great potential for use cases where
| the data model is heterogeneous and the structure of a problem
| really can be modeled as a graph problem. I think the tooling and
| the technologies are still not nearly as mature as e.g.
| relational databases and most developers have very little
| exposure to graph models, so adoption is still quite low, but
| will probably rise as the cloud providers offer hosted graph
| databases. In 2015 I was quite excited for TitanDB and the
| TinkerPop stack but I think the acquisition of the team behind it
| (Aurelius) by DataStax kind of sucked the air out of the
| community edition. As far as I understand Amazon is using many
| ideas and concepts from TitanDB in their Neptune graph database,
| but unfortunately they don't seem to care to open-source it. I
| don't think graph databases & graph processing will become
| mainstream tough, as they're just not a good fit for most
| problems.
| ta988 wrote:
| Neptune sucked the air out of blazegraph as well... It is a
| constant problem. Graph databases will only ever become more
| mainstream when we get a real postgresql-level database in term
| of reliability, maintenance and trust in its evolution.
| zozbot234 wrote:
| PostgreSQL is a fine graph database. It would be nice if it
| supported the graph-like query language that's now being
| standardized as an extension to SQL, but other than that it
| works quite well.
| lumost wrote:
| Unfortunately graph DBs are a solution in search of a problem.
| The performance of a graph traversal in all current
| implementations will be at best equivalent to a well crafted
| join at worst much slower.
|
| I'm unclear on where truly heterogeneous data problems would
| emerge outside of Q&A applications and potentially CRM. Usually
| the work to acquire a dataset is so large that there is limited
| reason not to properly model the data.
| amw-zero wrote:
| > The performance of a graph traversal in all current
| implementations will be at best equivalent to a well crafted
| join at worst much slower.
|
| A join must be implemented with a loop / scan to find the
| relevant target rows by foreign key - This is why foreign
| keys pretty much have to be indexed, to give the scan any
| hope of acceptable performance. So, remember, a join in a
| relational database generally means a join PLUS an index
| lookup.
|
| If it's not clear, indexes are not free - they have storage
| cost as well as imposing a cost on writes to the table.
| Indexes also make lookup very very fast, but still not free.
| On a large table, you may be incurring the cost of 20-30
| comparisons to find the desired rows (note that this is not a
| made up number. Assuming a btree index which reduces lookups
| to O(log_2(n)), you will need 20 comparisons to find a record
| in a table with 1,000,000 rows, and 26 comparisons to find a
| record in a table with 100,000,000 rows.)
|
| Graph databases (at least neo4j) always talk about index-free
| adjacency. Now you hopefully understand what this means. The
| edge between two nodes in a graph database is stored directly
| with no need for an index to find the destination node - the
| traversal is index-free. This can be a tremendous advantage,
| and that's also not made-up. You can subtract the cost of
| index lookups that I gave earlier .
| jandrewrogers wrote:
| These methods only work for very small graphs though.
| Secondary indexing structures are inefficient and don't
| scale at all regardless of the implementation details,
| hence why even non-graph systems don't use them at large
| scales on single machines. The scale threshold where using
| indexes is worse than the alternatives is low. Direct
| representation without indexes (like your Neo4j example)
| are even worse due to the latency introduced for each step
| by very poor locality. There is no possible version of
| these methods that will work well; this would have been a
| solved problem 30 years ago otherwise.
|
| Latency is the performance killer for graph traversals, so
| locality is everything, and preserving locality across
| recursive traversals is the critical computer science trick
| that is often missing. The computational cost of a
| comparison is irrelevant to the performance of graph
| traversal.
|
| The way most "pure" graph systems implement graph
| traversals is unusably slow outside of trivial cases
| because they mitigate latency poorly. In practical graph
| systems, common traversals are typically materialized ahead
| of time so that a graph traversal operation is not required
| to satisfy queries, which defeats the point in many cases.
| rajman187 wrote:
| Very well summarized, particularly about the point of
| locality.
|
| In terms of intractability large graph problems, here's
| an example of a scalable solution that provides
| incredible performance (measured in terms of response
| times, queries per second, etc) by Facebook to represent
| the social graph https://www.usenix.org/system/files/conf
| erence/atc13/atc13-b...
| lmeyerov wrote:
| Some slightly different views based on our experiences with
| startups/techco's/enterprises/govs. A lot of what there is true,
| but at the same time, we've been seeing major appetite in 2 areas
| that lead to quite different directions from the article.
|
| To be clear, I do think the CACM authors are right about a lot of
| what is commercially + academically popular in big graph land
| today. Ex: Powering better knowledge graphs, especially for NLP
| neural network empowered use cases (BERT/GPT). Ex: Enriching
| inputs into traditional neural network inference calls, like a
| user 360 before making a recommendation. I expect graph DBs to
| keep improving to help with theses.
|
| At the same time, working with teams from first-principles to
| help skate to where the puck is going, we've come to 2 overall
| different priorities where those ideas aren't the primary
| technology enablers for big graph use cases in terms of
| engineering + big $$$$ enablers. Broadly, it's useful to think
| though what Data Analysts are doing and what Data Scientists are
| doing:
|
| 1. Graph BI: Similar to Tableau, Looker, and more autoML-y
| successors, there are a LOT of analysts who are trying to better
| understand all that data in their data warehouses. Operational
| data in Splunk/ELK, transactions & business data in SalesForce
| CRM / RedShift transactions / etc. Graph questions are powerful
| here. Graphs are all about relationships like user/cohort
| behavior, so opens up a lot of questions/answers for analysts,
| especially if made accessible. However, from a data engineering
| perspective, copying that data into graph DBs as a second system
| of record is historically a "now you have 2 problems" thing, and
| part of why most data lake projects fail. So we've been focusing
| on building graph query pushdown + virtual fabric style projects
| that expose graph thinking over your existing data infra. The
| discussion in the CACM article on dynamic graphs touches on this,
| but that's only one piece of that problem. Getting analyst graph
| insights over warehouses as ~easy as it is today for basic time
| questions on events requires a lot more. Surprisingly, we suspect
| so much so that it may commoditize the particular choice of graph
| compute engine (vs say
| connectors/virtualization/UI/automation/AI.)
|
| 2. Graph AI: Traditional neural networks have been increasingly
| dominating due to strong ability to combine categorical &
| continuous data around entities of interest (think computer
| vision realizing some pixels are a cat)... but struggled to
| leverage relationship structure: imagine a communication or
| transaction between people/devices/companies with rich histories
| & communities. Being able to fuse those matters a LOT on
| commercial contexts, so we'd been watching graph neural networks
| for awhile. Over the last ~2 years, with the mounting focus by
| the AI community, key problems around performance, heteorgeneous
| data, etc., have been getting solved, so GNNs are starting to
| show up in all sorts of places now. Interestingly, most companies
| using them don't plug graph databases or knowledge graphs or even
| traditional graph compute engines into them. If you look at how
| pytorch, dgl, etc. are being built, they largely ignore what
| traditional graph infra vendors focus on as 1-3 magnitudes too
| slow, and overall, wrong tool for the job: different problem
| space.
|
| In both cases, teams building/using graph DB or graph compute
| engines, like the CACM authors, get enlisted to help as natural
| stakeholders for pieces of the problem. Likewise, they naturally
| want to help + grow their scope. Unfortunately, from a
| perspective of "you're lucky to do even 1-3 things well", they've
| been struggling to provide the needed end-to-end solutions. There
| are plenty of critical & fascinating scale, design, + use case
| challenges to tackle far beyond the core db/compute engine
| perspective here.
|
| Anyways, this is an awesome time for this stuff so I encourage
| getting involved! We're actively hiring for help creating these
| new categories (all levels: fullstack, UI, BI, data, AI) +
| launching projects with lighthouse design partners, so if either
| is relevant, please reach out! Likewise, traditional graph
| DB/compute is important too (see: CACM article), and been a no-
| brainer for attracting most of the VC attention for the last few
| years due to being a traditional DB play, and we've enjoyed
| partnering with friends at them, so I'd recommend looking at
| those companies as well.
| wirthjason wrote:
| Anyone have an idea how a graph database like Neo4J is
| implemented? I imagine it as a unique query language (cypher)
| sitting on top of a traditional relational database.
| rajman187 wrote:
| Neo4J uses index-free adjacency which means each node is
| storing the address of where its edges point to.
|
| This raises an interesting question about "big" graphs and
| distributed systems---storing the address in memory across
| machines. So Neo4J, last I looked into it, is very fast for
| reads using a master-slave setup but may not scale well beyond
| a single machine.
|
| Other approaches such as RedisGraph use OpenBLAS, but again
| you're limited to a matrix representation.
|
| And yet others, like TitanDB (bought by DataStax and open-
| source community forked into JanusDB) use graph abstractions
| that sit on top of distributed systems (HBase, Cassandra,
| Dynamo, etc) and these rely on adjacency lists and large
| indices. So the idea of "big graph" has been around for at
| least 6 years in the NoSQL world, but indeed everything is not
| a graph problem, tempting as that may be to claim.
|
| Good article on native graph representation:
| https://dzone.com/articles/letter-regarding-native-graph
| samsquire wrote:
| From Neo4j themselves - edges are referenced by index position
| in a file.
|
| https://neo4j.com/developer/kb/understanding-data-on-disk/
| tokamak-teapot wrote:
| Labelled property graph: https://neo4j.com/blog/rdf-triple-
| store-vs-labeled-property-...
| heinrichhartman wrote:
| Neo4J is a native database build on the JVM. It does not use
| any underlying relational DBs. IIRC, Graph links are
| implemented as direct references to other node objects (if they
| are in memory), which make traversal way faster than what you
| would get with relational DBs.
| thanatos519 wrote:
| RedisGraph uses OpenBLAS sparse matrices which deliver
| astounding performance.
|
| It's not as feature-complete as Neo4J but it's a promising
| start.
| [deleted]
| malthejorgensen wrote:
| I'd imagine it's more like an adjacency list structure with
| various indexes (similar to a regular relational dbs) to allows
| lookups based on node properties
| zomglings wrote:
| I have spent the last 7 years building ontologies (knowledge
| bases), small and large. Like with most tech articles, this one
| is pushing a very specific point of view that is relevant to
| megacorporations but certainly not to most companies that would
| benefit from the deployment of graph-structured knowledge bases.
|
| Unless a company has tons of extra cash to burn using
| technologies like neo4j, Spark, etc., they are better off just
| building an ontology on top of their existing datastores.
|
| The work on the semantic web in the 2000s added a lot of valuable
| ideas and I see a tendency to downplay these ideas in favor of
| new technologies. Storing relationships as RDF triples and
| building indices that are well-suited to your queries will be
| good enough for the majority of people who need an ontology.
|
| I have found that graph-based systems don't model time well. When
| you are building a knowledge base, you often need to understand
| how knowledge has changed over time. You may want to relate facts
| across time or understand what the state of knowledge was within
| a given period of time. This usually takes extra work.
|
| Another thing that I have found these graph-based systems to do
| poorly is the management of storage classes for different types
| of data. If you are building a graph knowledge base, not all the
| information in your knowledge base will be consumed in the same
| way. If your knowledge base is sufficiently large, you will want
| to store less frequently accessed data on cheaper storage (e.g.
| HDD or even object storage like S3) and hotter data on more
| expensive storage (SSD or even maybe keep it in memory). I have
| always had to write this code by myself and haven't found a tool
| that helps to manage this very importance concern in ontology
| building.
|
| That turned into a bit of a ramble, but damn ontologies are cool.
| :)
| jcims wrote:
| >The work on the semantic web in the 2000s added a lot of
| valuable ideas and I see a tendency to downplay these ideas in
| favor of new technologies. Storing relationships as RDF triples
| and building indices that are well-suited to your queries will
| be good enough for the majority of people who need an ontology.
|
| Is this basically the idea of using RDF to create explicit
| joins between datasets in other systems? E.g. each tuple has a
| foreign key reference to a row/doc that represents that node in
| the graph?
|
| >That turned into a bit of a ramble, but damn ontologies are
| cool. :)
|
| This is a major hurdle in infosec as enterprises adopt cloud
| products. The traditional ontologies around the risk profiles
| and state of operating systems and networks have been quite
| stable over the past 10-20 years and leverage things like
| DMTF's CIM and IETF's SNMP. However, what I'm finding is that
| 'cloud' products and services surface entirely new namespaces
| and concepts in nearly every instance. Sometimes, as in the
| case of AWS EC2, it's pretty simple to map these back to an
| existing schema, but in other cases like say AWS Chime or Glue,
| it gets really tricky. It feels like I'm looking for a system
| that can rapidly ingest the service's native information model
| then adapt it with overlays that transform or aggregate various
| services in a way that allows consistent data modelling.
|
| Any suggestions there? :)
| zomglings wrote:
| re: RDF for joins. That's exactly right. You can keep things
| as light or as heavy as you need them to be regarding
| when/how/whether you validate foreign key constraints,
| whether you use a schema registry, etc. The simplicity of the
| format makes it really easy to build around.
|
| My only suggestion re: infosec ontologies is to not push too
| much functionality into the ontology all at once. I have had
| ontology projects end in spectacular failure, and all those
| failures have resulted from putting too much responsibility
| on the ontology from the very beginning.
|
| The ontology should be as dumb as possible to start with. In
| your case, the approach I would suggest is:
|
| 1. Ontology stores mapping between cloud provider resource
| schemas to your internal schema. These mappings can be
| represented in multiple ways - as a URL to a service which
| implements the mapping or as a JSON object in some DSL which
| defines the transformations or even as the code/binary that
| should be run to perform the transformation.
|
| 2. From the ontology, it should be easy to query/update the
| following things: current cloud provider schema for a given
| resource, internal schema corresponding to a cloud resource,
| and current mapping to the internal schema, the number of
| failed transformations from cloud schema to internal schema
| in the past {1 hour | 1 day | 1 week | ...}.
|
| 3. At first, you build the transformations from the cloud
| provider schemas to internal schema by hand.
|
| 4. A worker which alerts you when the cloud provider schemas
| change (should be easy to build on top of your ontology after
| step 2 - check the number of failures in transformation).
|
| 5. Now write a worker which statically analyzes the terraform
| code base to see when they changed the code related to a
| particular cloud API you are working with. Represent this in
| the ontology (e.g. "AWS Chime" - "terraform:vblah.blah.blah"
| -> "<json resource definition>"
|
| 6. Write a worker which monitors updates to - "terraform:*"
| edges and propagates alerts you to update the mapping given
| the new JSON resource definition.
|
| 7. For some services, it will be easy to automate the
| transformation update. For these things, create a new
| ontology worker.
|
| I have found that it's better to let workers run idempotently
| on cronjobs/systemd timers than to implement event-driven
| workers.
|
| The whole thing will probably take a few months, but you will
| have production-ready code at the end of each step and start
| benefitting from the existence of the ontology right away.
|
| (Sounds like you are already at step 3 or 4?)
| jcims wrote:
| This is amazing, thank you! Most of this is a mental
| experiment at this point, but I might take some of this to
| start building it. Appreciate the note!
| leeuw01 wrote:
| Any word from the wise for those who want to get into
| ontologies?
|
| Furthermore, do you know about any good resources regarding the
| process of capturing information into models? I've found books
| about solutions like Ecore & OWL, but I fail to find books
| about the practice of modelling in general.
| zomglings wrote:
| I'm not so wise, my friend, but I do have some tips from hard
| experience:
|
| 1. Most people don't understand the importance of ontologies.
| It is an uphill battle to convince them. Build your
| ontologies quietly. They are a superpower.
|
| 2. A real ontology must extend far beyond the confines of the
| data store. You must think of it as also including the
| workers that bring data into the ontology, the workers which
| verify the consistency of the ontology, and the query engines
| through which users get data out of the ontology.
|
| 3. Start simple. Your ontologies should be as dumb as
| possible to start out with but they should support extension.
|
| 4. Learn one datastore really well and build your ontologies
| there. Chances are you don't need to know anything about OWL
| or neo4j or Spark or anything hyped. I use Postgres.
|
| 5. Idempotency is very important. This means, for example,
| that if you run a data collection worker twice with the same
| arguments, it should not create duplicate data in your
| ontology.
|
| 6. Build for a specific use case. Don't try to build a
| general purpose ontology. The kind of data you will be
| storing and the access patterns will determine a lot about
| the structure of your ontology and you should be accounting
| for these things.
|
| 7. Avoid event-drive architectures (e.g. IFTTT, AWS Lambda,
| etc.). Put your faith in idempotence and cron (actually, I
| use systemd but you get the picture).
|
| Do you have an idea what kind of information you would want
| to build an ontology for?
| ctvo wrote:
| > If your knowledge base is sufficiently large, you will want
| to store less frequently accessed data on cheaper storage (e.g.
| HDD or even object storage like S3) and hotter data on more
| expensive storage (SSD or even maybe keep it in memory). I have
| always had to write this code by myself and haven't found a
| tool that helps to manage this very importance concern in
| ontology building.
|
| This is a general concern for a lot of areas, including
| observability. Lots of data, need to query it, need to store it
| cheaply with a time component to help differentiate.
|
| Take a look at Thanos and Grafana Labs Cortex. It's a mostly
| generalize architecture.
| zomglings wrote:
| Thank you. Didn't know about Thanos.
| amw-zero wrote:
| I am honestly intrigued any time I hear anyone mention RDF. You
| have had success using it? It wasn't exactly clear how you used
| it from what you said.
| zomglings wrote:
| I only use RDF triples as a storage format for edges. I don't
| normally take advantage of triple stores (e.g. apache Jena),
| sparql query engines, etc.
|
| It's just a simple format and it's easy to build different
| types of query/analysis capabilities around.
|
| The only thing that's a little awkward is applying types to
| items in an rdf triple (in a way that's computationally
| cheap). Typing is a general problem, though.
| LukeEF wrote:
| I strongly agree with this - RDF is fine and there are some
| great concepts in the SemWeb world (IRIs for example), but
| OWL and SPARQL are confusing and can be off-putting,
| especially when it is all rolled together. The latest
| iteration of our triple store actually moves radically
| towards linked-document or a 'document graph' to make it
| more understandable while keeping RDF at the base.
| pbourke wrote:
| Can you express graph attributes such as edge weight in
| RDF?
|
| I am familiar with RDF triples such as "Book", author,
| "Jane Doe". How would you express "10 minute phone call at
| time t between persons A and B"?
| mhitza wrote:
| Not previous commenter, but as with other problems in
| computing, I find that you need to add a level of
| indirection. "CallSessionX" ->
| participant -> "person A" "CallSessionX" ->
| participant -> "person B" "CallSessionX" ->
| duration -> "10 minute" "CallSessionX" -> at_time
| -> t
|
| Haven't used Sparql or Cypher in many years (used in
| Neo4j), but with the later you could do a query along the
| lines "person A" <- participant <-
| call_session -> participant -> "person B"
|
| To find calls between person A and person B, then you can
| add a variable binding on the exact node you're
| interested in (eg call_session) to extract/aggregate
| attributes.
| zomglings wrote:
| You could do it, but a triple wouldn't be the best choice
| of data structure. I would just add a new numerical
| column called "weight" to each relation in your RDF
| store: "Harry Potter and the
| Philosopher's Stone","author","Jane Doe",0.005
| "Harry Potter and the Philosopher's Stone","author","J.K.
| Rowling",1.0 ...
|
| You could easily export RDF triples from this if you
| needed to using rules like "The author for each book is
| determined by the '<book>-author-<name>' relation of
| highest weight".
|
| Edit: Sorry I didn't answer your question about "10
| minute phone call at time t between person A and person
| B".
|
| The best way I can think of to model this with triples
| is: "call:<call_id>","started_at",t
| "call:<call_id>","ended_at",t + <duration>
| "person A","participant","call:<call_id>" "person
| B","participant","call:<call_id>"
|
| But this may be overly complicated depending on how you
| want to query the data and what you want to do with it.
|
| Edit 2: mhitza already answered it better!
| https://news.ycombinator.com/item?id=28503097
| FpUser wrote:
| I've used RDF triples for some specific part of a bigger
| business data model. It was implemented on a regular database
| and used to encode things like "Joe Hates Propaganda" which
| is exactly what was needed and hence was a success (meaning
| that it had worked as intended).
| danieljh wrote:
| The Future is Tiny Graphs! [1]
|
| In the style of "Scalability! At What COST?" I've started a side-
| project bringing succinct data structures (rank/select [2],
| elias-fano coding) and modern instruction sets (popcount, pdep)
| to graphs.
|
| For some reason the article doesn't mention the field of graph
| compression and I think it shows: for example most if not all
| popular open source routing engines don't do anything regarding
| graph compression. It's time to change that! Reach out to me
| (preferably on the Github project) if you want try some ideas :)
|
| [1] https://tinygraph.org [2]
| http://www.cs.cmu.edu/~dga/papers/zhou-sea2013.pdf
| LukeEF wrote:
| We built TerminusDB[1], our OSS knowledge graph, using succinct
| data structures. As we have a immutable delta encoding approach
| to allow domain teams to build and curate data products that
| they then share, this has been a very fruitful combination. We
| wrote a technical white paper all about it! [2]
|
| [1] https://github.com/terminusdb/terminusdb [2]
| https://github.com/terminusdb/terminusdb/blob/dev/docs/white...
| [deleted]
| triska wrote:
| Very interesting! I especially like the expressed interest in
| _logic-based_ approaches for reasoning about graphs. Quoting from
| the article:
|
| _" Logic-based and declarative formalisms. Logic provides a
| unifying formalism for expressing queries, optimizations,
| integrity constraints, and integration rules. Starting from
| Codd's seminal insight relating logical formulae to relational
| queries, many first order (FO) logic fragments have been used to
| formally define query languages with desirable properties such as
| decidable evaluation. Graph query languages are essentially a
| syntactic variant of FO augmented with recursive capabilities.
| [...] The influence of logic is pivotal not only to database
| languages, but also as a foundation for combining logical
| reasoning with statistical learning in AI. Logical reasoning
| derives categorical notions about a piece of data by logical
| deduction."_
|
| This interest could be good news for vendors of logic programming
| languages such as Prolog and its syntactic subset Datalog. These
| languages enable very convenient reasoning about graphs and allow
| both storing and querying the database with the same simple
| language elements that can themselves also be analyzed and
| reasoned about with the same formalism.
|
| Imagine how convenient it would be to have a large graph database
| represented as a set of Prolog clauses that can be queried, run
| and analyzed with an efficient Prolog system! The data itself,
| queries, and any desired transformation and augmentation could be
| uniformly expressed in this way.
| beaconstudios wrote:
| well if you think about it, a triple store is basically a
| database of propositional logic statements.
| LukeEF wrote:
| I know at least one vendor of a prolog based graph database [1]
| that hopes the interest continues to grow.
|
| [1] https://github.com/terminusdb/terminusdb
| sfvisser wrote:
| This has been the semantic web dream for decades now.
| triska wrote:
| Yes indeed, although the semantic web chose a much more
| convoluted set of technologies without the syntactic and
| conceptual coherence and simplicity of actual logic
| programming languages, thus making reasoning about everything
| extremely complex and inconvenient in comparison. I think
| this is a severe handicap when using these technologies in
| practice.
|
| All automated reasoning is by necessity exclusively
| _syntactic_ in nature since the meaning of processed symbols
| is inaccessible to a machine, therefore a careful
| consideration of syntactic aspects is very important to
| ensure easy analysis and meta-interpretation of logic
| programs and queries.
|
| As I see it, a good formalism for knowledge representation
| makes it easy to reason about the knowledge and also about
| all programs and queries that themselves reason about the
| knowledge, enabling easy meta-interpretation and coherent
| extension languages.
| molham wrote:
| We usually talk about several paradigms for data management and
| programming: navigational (e.g. CODASYL, OO databases/languages,
| most graph DBs), tensor-based (e.g. MOLAP, Matlab, TensorFlow,
| Pytorch), Map-Reduce (e.g. Hadoop and Spark before they started
| to disown their MR heritage). Semistructured (e.g. JSON) and
| dataframes (e.g. Pandas) are also important abstractions. These
| abstractions can be captured in the relational paradigm -- in
| other words we can think of them as special cases and represent
| them as views on relational models.
|
| It is clear from the last 50 years of our industry that the
| relational paradigm (eventually) always wins. It wins because it
| separates the _what_ from the _how_ and automates away a bunch of
| tricky programming that we have to do with the more imperative
| paradigms. -- the OLTP workload was originally supported by
| navigational systems and then switched to relational systems
| mostly based on 3rd normal form schemas -- the OLAP workload was
| originally supported by multi-dimensional array (tensor) systems
| then switched to relational systems based on star and snowflake
| schemas -- the Big Data workload was originally supported by map-
| reduce systems then switched to relational systems based on star
| /snowflake + JSON schemas.
|
| I believe that the Graph (via navigational), Reasoning (via
| procedural), and ML (via dataframe and tensor) workloads will
| move to relational systems based on graph normal form (GNF)
| schemas. Those relational systems won't be built like the SQL-
| based systems of today. They will need new relational language
| extensions, new semantic optimizers, new optimal join algorithms,
| materialized views that are maintained efficiently, new data
| structures, etc. We are working on such a system at RelationaAI.
| We're in limited preview now but hope to have an open preview by
| early next year.
| leetrout wrote:
| A previous coworker now works at Katana Graph and I've been
| keeping an eye on what they claim to be doing. Sounds very
| interesting.
|
| https://katanagraph.com/
| chubot wrote:
| It feels like this should have a citation to _Scalability! At
| What COST?_
|
| https://www.usenix.org/system/files/conference/hotos15/hotos...
|
| which evaluates distributed graph processing frameworks (circa
| 2015) on various problems and shows that they can be slower than
| a single threaded program on a laptop.
|
| They have a data set with 105 M nodes and 3.7B edges that fits in
| 14 GB, which all fits in memory (at least 128-256 GB of memory
| should be taken for granted in these types of problems). Maybe
| with a clever encoding you can store the entire social network of
| the world, or at least a FB-sized one. (Storing nodes as integers
| is efficient and useful for many graph algorithms. You can later
| "join" with the ancillary data using a fast non-graph distributed
| processing framework).
|
| This article mentions the challenges but doesn't mention the
| solution of simply using a single machine for big graphs:
|
| _Even seemingly minute HPAD variations, for example the graph 's
| degree distribution, can have significant performance
| implications._
|
| _Graph processing systems rely on complex runtimes that combine
| software and hardware platforms. It can be a daunting task to
| capture system-under-test performance--including parallelism,
| distribution, streaming vs. batch operation_
|
| Anyone who has done programming with say MapReduce knows that it
| can be very elegant and fun, but also limiting. This is even more
| true of graph processing frameworks, as the article explains.
|
| If you simply put everything in memory and write native code
| (which isn't hard at all for batch jobs), then you have so much
| more flexibility with ANY algorithm, and especially graph
| algorithms. This type of programming is also fun! You don't have
| to fit your problem into the framework and switch frameworks when
| it doesn't fit (or more likely just accept 10-1000x performance
| decrease when it doesn't fit).
|
| ---
|
| Abstract:
|
| _We offer a new metric for big data platforms, COST, or the
| Configuration that Outperforms a Single Thread. The COST of a
| given platform for a given problem is the hardware configuration
| required before the platform outperforms a competent single-
| threaded implementation. COST weighs a system's scalability
| against the overheads introduced by the system, and indicates the
| actual performance gains of the system, without rewarding systems
| that bring substantial but parallelizable overheads._
| christkv wrote:
| Distributing a graph is a really hard problem does anyone know
| what the state of this is?
| topicseed wrote:
| Dgraph and ArangoDB seems to be doing it well.
| junon wrote:
| We ended up ditching Arango for a new project, despite it
| having decent enough graph handling.
|
| This is because, unlike many other DBs (namely Mongo, which
| we're trying desperately to move away from), you can't spin
| up a new instance and allow the application to idempotently
| initialize it.
|
| The multi-modal engine of Arango seems great, but the DX
| surrounding its use is a bit subpar, and would greatly shift
| a lot of burden to the environment config, which only adds to
| potential security vulnerabilities.
|
| This is only made worse by the fact it has a severely limited
| granularity with security permissions.
| ta988 wrote:
| Security is another big issue with the open graph
| databases. Some closed ones have ok mechanisms. But we are
| not there yet.
| rrwright wrote:
| I'm the creator of thatDot Connect [1], a distributed
| asynchronous streaming graph interpreter, where we've been
| working on exactly this problem for about 7 years now. Scalable
| distributed graph processing has so many applications! We were
| fortunate enough to be funded by DARPA for a while working on
| automated APT detection for cyber-security. Some of our recent
| work has shown linear scalability of this approach on large
| clusters at high throughput. For example, we got to half a
| million complex graph events ingested per second on a combined
| read/write workload and could have just kept scaling. We're in
| the process of writing up this work and sharing all the details.
|
| [1] https://www.thatdot.com/technology/
| amw-zero wrote:
| I don't have a number of course, but I believe that almost all
| relational db models are more naturally modeled as graphs. In
| fact, all normalized relational db models that use foreign keys
| are just implementing a graph - foreign keys represent edges
| between nodes.
|
| has-many, has-one, and many-to-many are all graph relationships.
| The relational model + foreign keys is just one way to encode a
| graph.
| pphysch wrote:
| Yeah, tables/rows are basically vertices with 1 label & many
| properties*, and FKs are labeled edges without properties. But,
| you can easily implement edge properties with many-to-many
| tables, at the cost of an extra JOIN. In practice, this is a
| useful and efficient species of graph.
|
| I think many people get baited by Neo4j's marketing without
| really understanding relational models and RDBMS (I sure did).
|
| * With the (often true) assumption that many vertices will
| follow the same normalized schema.
| bawolff wrote:
| Graphs are the least restricting data model. I think everything
| is easier to model as a graph, because its just things
| (verticies) and how they are related (edges). You mostly get to
| skip the modelling step.
|
| The benefit of modelling as a relational db is that it forces
| you to put the data in a form that's easy to work with from a
| computer perspective.
| Anon84 wrote:
| If you're interested in learning more about graphs, you might
| find my "Graphs for Data Science" substack:
| https://graphs4sci.substack.com/ where I work though using graph
| perspectives to analyze real world datasets.
| graphviz wrote:
| Someone on the internet is wrong! Or maybe just partially
| correct.
|
| This is very interesting work. There is some selling going on,
| not discussing enough about the issues that drove the database
| community toward relational databases and away from "network"
| databases in the 1970s.
|
| If you think of visualization systems as a microcosm of more
| general computing, notice people need more than graphs - they
| need all kinds of tables, charts, maps, and other things. Same
| with "graph databases".
|
| This area does exist because the database community underinvested
| in computing on graphs and networks, the same as it did with
| usability for a long time. It is still catching up but there is a
| lot of great work.
|
| Another problem is helping users to understand the cost of what
| they are trying to compute. Select and join? I can estimate that.
| Find betweenness centrality of a large network? Find a tour that
| visits all the cycles of a network exactly once? All the edges?
| All nodes? Oops. What about graph object types? Is it OK to mix
| them?
|
| Sometimes I wonder arguments that "graph databases are [more]
| efficient for linked structures." After 20 or 30 years of intense
| R&D, relational databases already seem pretty efficient for
| computing with relationships.
| _wldu wrote:
| SQL struggles with hierarchical data (trees), which are graphs.
| pphysch wrote:
| In practice most hierarchies are shallow (or poorly designed)
|
| Country, Province/State, City, Neighborhood/ZIP, Building,
| Room. 5 jumps to go from planetary to personal. 5 JOINs is
| not unreasonable.
| [deleted]
___________________________________________________________________
(page generated 2021-09-12 23:01 UTC)