[HN Gopher] The Future Is Big Graphs: A Community View on Graph ...
       ___________________________________________________________________
        
       The Future Is Big Graphs: A Community View on Graph Processing
       Systems
        
       Author : Anon84
       Score  : 130 points
       Date   : 2021-09-12 10:45 UTC (12 hours ago)
        
 (HTM) web link (cacm.acm.org)
 (TXT) w3m dump (cacm.acm.org)
        
       | ThePhysicist wrote:
       | I think graph processing has great potential for use cases where
       | the data model is heterogeneous and the structure of a problem
       | really can be modeled as a graph problem. I think the tooling and
       | the technologies are still not nearly as mature as e.g.
       | relational databases and most developers have very little
       | exposure to graph models, so adoption is still quite low, but
       | will probably rise as the cloud providers offer hosted graph
       | databases. In 2015 I was quite excited for TitanDB and the
       | TinkerPop stack but I think the acquisition of the team behind it
       | (Aurelius) by DataStax kind of sucked the air out of the
       | community edition. As far as I understand Amazon is using many
       | ideas and concepts from TitanDB in their Neptune graph database,
       | but unfortunately they don't seem to care to open-source it. I
       | don't think graph databases & graph processing will become
       | mainstream tough, as they're just not a good fit for most
       | problems.
        
         | ta988 wrote:
         | Neptune sucked the air out of blazegraph as well... It is a
         | constant problem. Graph databases will only ever become more
         | mainstream when we get a real postgresql-level database in term
         | of reliability, maintenance and trust in its evolution.
        
           | zozbot234 wrote:
           | PostgreSQL is a fine graph database. It would be nice if it
           | supported the graph-like query language that's now being
           | standardized as an extension to SQL, but other than that it
           | works quite well.
        
         | lumost wrote:
         | Unfortunately graph DBs are a solution in search of a problem.
         | The performance of a graph traversal in all current
         | implementations will be at best equivalent to a well crafted
         | join at worst much slower.
         | 
         | I'm unclear on where truly heterogeneous data problems would
         | emerge outside of Q&A applications and potentially CRM. Usually
         | the work to acquire a dataset is so large that there is limited
         | reason not to properly model the data.
        
           | amw-zero wrote:
           | > The performance of a graph traversal in all current
           | implementations will be at best equivalent to a well crafted
           | join at worst much slower.
           | 
           | A join must be implemented with a loop / scan to find the
           | relevant target rows by foreign key - This is why foreign
           | keys pretty much have to be indexed, to give the scan any
           | hope of acceptable performance. So, remember, a join in a
           | relational database generally means a join PLUS an index
           | lookup.
           | 
           | If it's not clear, indexes are not free - they have storage
           | cost as well as imposing a cost on writes to the table.
           | Indexes also make lookup very very fast, but still not free.
           | On a large table, you may be incurring the cost of 20-30
           | comparisons to find the desired rows (note that this is not a
           | made up number. Assuming a btree index which reduces lookups
           | to O(log_2(n)), you will need 20 comparisons to find a record
           | in a table with 1,000,000 rows, and 26 comparisons to find a
           | record in a table with 100,000,000 rows.)
           | 
           | Graph databases (at least neo4j) always talk about index-free
           | adjacency. Now you hopefully understand what this means. The
           | edge between two nodes in a graph database is stored directly
           | with no need for an index to find the destination node - the
           | traversal is index-free. This can be a tremendous advantage,
           | and that's also not made-up. You can subtract the cost of
           | index lookups that I gave earlier .
        
             | jandrewrogers wrote:
             | These methods only work for very small graphs though.
             | Secondary indexing structures are inefficient and don't
             | scale at all regardless of the implementation details,
             | hence why even non-graph systems don't use them at large
             | scales on single machines. The scale threshold where using
             | indexes is worse than the alternatives is low. Direct
             | representation without indexes (like your Neo4j example)
             | are even worse due to the latency introduced for each step
             | by very poor locality. There is no possible version of
             | these methods that will work well; this would have been a
             | solved problem 30 years ago otherwise.
             | 
             | Latency is the performance killer for graph traversals, so
             | locality is everything, and preserving locality across
             | recursive traversals is the critical computer science trick
             | that is often missing. The computational cost of a
             | comparison is irrelevant to the performance of graph
             | traversal.
             | 
             | The way most "pure" graph systems implement graph
             | traversals is unusably slow outside of trivial cases
             | because they mitigate latency poorly. In practical graph
             | systems, common traversals are typically materialized ahead
             | of time so that a graph traversal operation is not required
             | to satisfy queries, which defeats the point in many cases.
        
               | rajman187 wrote:
               | Very well summarized, particularly about the point of
               | locality.
               | 
               | In terms of intractability large graph problems, here's
               | an example of a scalable solution that provides
               | incredible performance (measured in terms of response
               | times, queries per second, etc) by Facebook to represent
               | the social graph https://www.usenix.org/system/files/conf
               | erence/atc13/atc13-b...
        
       | lmeyerov wrote:
       | Some slightly different views based on our experiences with
       | startups/techco's/enterprises/govs. A lot of what there is true,
       | but at the same time, we've been seeing major appetite in 2 areas
       | that lead to quite different directions from the article.
       | 
       | To be clear, I do think the CACM authors are right about a lot of
       | what is commercially + academically popular in big graph land
       | today. Ex: Powering better knowledge graphs, especially for NLP
       | neural network empowered use cases (BERT/GPT). Ex: Enriching
       | inputs into traditional neural network inference calls, like a
       | user 360 before making a recommendation. I expect graph DBs to
       | keep improving to help with theses.
       | 
       | At the same time, working with teams from first-principles to
       | help skate to where the puck is going, we've come to 2 overall
       | different priorities where those ideas aren't the primary
       | technology enablers for big graph use cases in terms of
       | engineering + big $$$$ enablers. Broadly, it's useful to think
       | though what Data Analysts are doing and what Data Scientists are
       | doing:
       | 
       | 1. Graph BI: Similar to Tableau, Looker, and more autoML-y
       | successors, there are a LOT of analysts who are trying to better
       | understand all that data in their data warehouses. Operational
       | data in Splunk/ELK, transactions & business data in SalesForce
       | CRM / RedShift transactions / etc. Graph questions are powerful
       | here. Graphs are all about relationships like user/cohort
       | behavior, so opens up a lot of questions/answers for analysts,
       | especially if made accessible. However, from a data engineering
       | perspective, copying that data into graph DBs as a second system
       | of record is historically a "now you have 2 problems" thing, and
       | part of why most data lake projects fail. So we've been focusing
       | on building graph query pushdown + virtual fabric style projects
       | that expose graph thinking over your existing data infra. The
       | discussion in the CACM article on dynamic graphs touches on this,
       | but that's only one piece of that problem. Getting analyst graph
       | insights over warehouses as ~easy as it is today for basic time
       | questions on events requires a lot more. Surprisingly, we suspect
       | so much so that it may commoditize the particular choice of graph
       | compute engine (vs say
       | connectors/virtualization/UI/automation/AI.)
       | 
       | 2. Graph AI: Traditional neural networks have been increasingly
       | dominating due to strong ability to combine categorical &
       | continuous data around entities of interest (think computer
       | vision realizing some pixels are a cat)... but struggled to
       | leverage relationship structure: imagine a communication or
       | transaction between people/devices/companies with rich histories
       | & communities. Being able to fuse those matters a LOT on
       | commercial contexts, so we'd been watching graph neural networks
       | for awhile. Over the last ~2 years, with the mounting focus by
       | the AI community, key problems around performance, heteorgeneous
       | data, etc., have been getting solved, so GNNs are starting to
       | show up in all sorts of places now. Interestingly, most companies
       | using them don't plug graph databases or knowledge graphs or even
       | traditional graph compute engines into them. If you look at how
       | pytorch, dgl, etc. are being built, they largely ignore what
       | traditional graph infra vendors focus on as 1-3 magnitudes too
       | slow, and overall, wrong tool for the job: different problem
       | space.
       | 
       | In both cases, teams building/using graph DB or graph compute
       | engines, like the CACM authors, get enlisted to help as natural
       | stakeholders for pieces of the problem. Likewise, they naturally
       | want to help + grow their scope. Unfortunately, from a
       | perspective of "you're lucky to do even 1-3 things well", they've
       | been struggling to provide the needed end-to-end solutions. There
       | are plenty of critical & fascinating scale, design, + use case
       | challenges to tackle far beyond the core db/compute engine
       | perspective here.
       | 
       | Anyways, this is an awesome time for this stuff so I encourage
       | getting involved! We're actively hiring for help creating these
       | new categories (all levels: fullstack, UI, BI, data, AI) +
       | launching projects with lighthouse design partners, so if either
       | is relevant, please reach out! Likewise, traditional graph
       | DB/compute is important too (see: CACM article), and been a no-
       | brainer for attracting most of the VC attention for the last few
       | years due to being a traditional DB play, and we've enjoyed
       | partnering with friends at them, so I'd recommend looking at
       | those companies as well.
        
       | wirthjason wrote:
       | Anyone have an idea how a graph database like Neo4J is
       | implemented? I imagine it as a unique query language (cypher)
       | sitting on top of a traditional relational database.
        
         | rajman187 wrote:
         | Neo4J uses index-free adjacency which means each node is
         | storing the address of where its edges point to.
         | 
         | This raises an interesting question about "big" graphs and
         | distributed systems---storing the address in memory across
         | machines. So Neo4J, last I looked into it, is very fast for
         | reads using a master-slave setup but may not scale well beyond
         | a single machine.
         | 
         | Other approaches such as RedisGraph use OpenBLAS, but again
         | you're limited to a matrix representation.
         | 
         | And yet others, like TitanDB (bought by DataStax and open-
         | source community forked into JanusDB) use graph abstractions
         | that sit on top of distributed systems (HBase, Cassandra,
         | Dynamo, etc) and these rely on adjacency lists and large
         | indices. So the idea of "big graph" has been around for at
         | least 6 years in the NoSQL world, but indeed everything is not
         | a graph problem, tempting as that may be to claim.
         | 
         | Good article on native graph representation:
         | https://dzone.com/articles/letter-regarding-native-graph
        
         | samsquire wrote:
         | From Neo4j themselves - edges are referenced by index position
         | in a file.
         | 
         | https://neo4j.com/developer/kb/understanding-data-on-disk/
        
         | tokamak-teapot wrote:
         | Labelled property graph: https://neo4j.com/blog/rdf-triple-
         | store-vs-labeled-property-...
        
         | heinrichhartman wrote:
         | Neo4J is a native database build on the JVM. It does not use
         | any underlying relational DBs. IIRC, Graph links are
         | implemented as direct references to other node objects (if they
         | are in memory), which make traversal way faster than what you
         | would get with relational DBs.
        
         | thanatos519 wrote:
         | RedisGraph uses OpenBLAS sparse matrices which deliver
         | astounding performance.
         | 
         | It's not as feature-complete as Neo4J but it's a promising
         | start.
        
         | [deleted]
        
         | malthejorgensen wrote:
         | I'd imagine it's more like an adjacency list structure with
         | various indexes (similar to a regular relational dbs) to allows
         | lookups based on node properties
        
       | zomglings wrote:
       | I have spent the last 7 years building ontologies (knowledge
       | bases), small and large. Like with most tech articles, this one
       | is pushing a very specific point of view that is relevant to
       | megacorporations but certainly not to most companies that would
       | benefit from the deployment of graph-structured knowledge bases.
       | 
       | Unless a company has tons of extra cash to burn using
       | technologies like neo4j, Spark, etc., they are better off just
       | building an ontology on top of their existing datastores.
       | 
       | The work on the semantic web in the 2000s added a lot of valuable
       | ideas and I see a tendency to downplay these ideas in favor of
       | new technologies. Storing relationships as RDF triples and
       | building indices that are well-suited to your queries will be
       | good enough for the majority of people who need an ontology.
       | 
       | I have found that graph-based systems don't model time well. When
       | you are building a knowledge base, you often need to understand
       | how knowledge has changed over time. You may want to relate facts
       | across time or understand what the state of knowledge was within
       | a given period of time. This usually takes extra work.
       | 
       | Another thing that I have found these graph-based systems to do
       | poorly is the management of storage classes for different types
       | of data. If you are building a graph knowledge base, not all the
       | information in your knowledge base will be consumed in the same
       | way. If your knowledge base is sufficiently large, you will want
       | to store less frequently accessed data on cheaper storage (e.g.
       | HDD or even object storage like S3) and hotter data on more
       | expensive storage (SSD or even maybe keep it in memory). I have
       | always had to write this code by myself and haven't found a tool
       | that helps to manage this very importance concern in ontology
       | building.
       | 
       | That turned into a bit of a ramble, but damn ontologies are cool.
       | :)
        
         | jcims wrote:
         | >The work on the semantic web in the 2000s added a lot of
         | valuable ideas and I see a tendency to downplay these ideas in
         | favor of new technologies. Storing relationships as RDF triples
         | and building indices that are well-suited to your queries will
         | be good enough for the majority of people who need an ontology.
         | 
         | Is this basically the idea of using RDF to create explicit
         | joins between datasets in other systems? E.g. each tuple has a
         | foreign key reference to a row/doc that represents that node in
         | the graph?
         | 
         | >That turned into a bit of a ramble, but damn ontologies are
         | cool. :)
         | 
         | This is a major hurdle in infosec as enterprises adopt cloud
         | products. The traditional ontologies around the risk profiles
         | and state of operating systems and networks have been quite
         | stable over the past 10-20 years and leverage things like
         | DMTF's CIM and IETF's SNMP. However, what I'm finding is that
         | 'cloud' products and services surface entirely new namespaces
         | and concepts in nearly every instance. Sometimes, as in the
         | case of AWS EC2, it's pretty simple to map these back to an
         | existing schema, but in other cases like say AWS Chime or Glue,
         | it gets really tricky. It feels like I'm looking for a system
         | that can rapidly ingest the service's native information model
         | then adapt it with overlays that transform or aggregate various
         | services in a way that allows consistent data modelling.
         | 
         | Any suggestions there? :)
        
           | zomglings wrote:
           | re: RDF for joins. That's exactly right. You can keep things
           | as light or as heavy as you need them to be regarding
           | when/how/whether you validate foreign key constraints,
           | whether you use a schema registry, etc. The simplicity of the
           | format makes it really easy to build around.
           | 
           | My only suggestion re: infosec ontologies is to not push too
           | much functionality into the ontology all at once. I have had
           | ontology projects end in spectacular failure, and all those
           | failures have resulted from putting too much responsibility
           | on the ontology from the very beginning.
           | 
           | The ontology should be as dumb as possible to start with. In
           | your case, the approach I would suggest is:
           | 
           | 1. Ontology stores mapping between cloud provider resource
           | schemas to your internal schema. These mappings can be
           | represented in multiple ways - as a URL to a service which
           | implements the mapping or as a JSON object in some DSL which
           | defines the transformations or even as the code/binary that
           | should be run to perform the transformation.
           | 
           | 2. From the ontology, it should be easy to query/update the
           | following things: current cloud provider schema for a given
           | resource, internal schema corresponding to a cloud resource,
           | and current mapping to the internal schema, the number of
           | failed transformations from cloud schema to internal schema
           | in the past {1 hour | 1 day | 1 week | ...}.
           | 
           | 3. At first, you build the transformations from the cloud
           | provider schemas to internal schema by hand.
           | 
           | 4. A worker which alerts you when the cloud provider schemas
           | change (should be easy to build on top of your ontology after
           | step 2 - check the number of failures in transformation).
           | 
           | 5. Now write a worker which statically analyzes the terraform
           | code base to see when they changed the code related to a
           | particular cloud API you are working with. Represent this in
           | the ontology (e.g. "AWS Chime" - "terraform:vblah.blah.blah"
           | -> "<json resource definition>"
           | 
           | 6. Write a worker which monitors updates to - "terraform:*"
           | edges and propagates alerts you to update the mapping given
           | the new JSON resource definition.
           | 
           | 7. For some services, it will be easy to automate the
           | transformation update. For these things, create a new
           | ontology worker.
           | 
           | I have found that it's better to let workers run idempotently
           | on cronjobs/systemd timers than to implement event-driven
           | workers.
           | 
           | The whole thing will probably take a few months, but you will
           | have production-ready code at the end of each step and start
           | benefitting from the existence of the ontology right away.
           | 
           | (Sounds like you are already at step 3 or 4?)
        
             | jcims wrote:
             | This is amazing, thank you! Most of this is a mental
             | experiment at this point, but I might take some of this to
             | start building it. Appreciate the note!
        
         | leeuw01 wrote:
         | Any word from the wise for those who want to get into
         | ontologies?
         | 
         | Furthermore, do you know about any good resources regarding the
         | process of capturing information into models? I've found books
         | about solutions like Ecore & OWL, but I fail to find books
         | about the practice of modelling in general.
        
           | zomglings wrote:
           | I'm not so wise, my friend, but I do have some tips from hard
           | experience:
           | 
           | 1. Most people don't understand the importance of ontologies.
           | It is an uphill battle to convince them. Build your
           | ontologies quietly. They are a superpower.
           | 
           | 2. A real ontology must extend far beyond the confines of the
           | data store. You must think of it as also including the
           | workers that bring data into the ontology, the workers which
           | verify the consistency of the ontology, and the query engines
           | through which users get data out of the ontology.
           | 
           | 3. Start simple. Your ontologies should be as dumb as
           | possible to start out with but they should support extension.
           | 
           | 4. Learn one datastore really well and build your ontologies
           | there. Chances are you don't need to know anything about OWL
           | or neo4j or Spark or anything hyped. I use Postgres.
           | 
           | 5. Idempotency is very important. This means, for example,
           | that if you run a data collection worker twice with the same
           | arguments, it should not create duplicate data in your
           | ontology.
           | 
           | 6. Build for a specific use case. Don't try to build a
           | general purpose ontology. The kind of data you will be
           | storing and the access patterns will determine a lot about
           | the structure of your ontology and you should be accounting
           | for these things.
           | 
           | 7. Avoid event-drive architectures (e.g. IFTTT, AWS Lambda,
           | etc.). Put your faith in idempotence and cron (actually, I
           | use systemd but you get the picture).
           | 
           | Do you have an idea what kind of information you would want
           | to build an ontology for?
        
         | ctvo wrote:
         | > If your knowledge base is sufficiently large, you will want
         | to store less frequently accessed data on cheaper storage (e.g.
         | HDD or even object storage like S3) and hotter data on more
         | expensive storage (SSD or even maybe keep it in memory). I have
         | always had to write this code by myself and haven't found a
         | tool that helps to manage this very importance concern in
         | ontology building.
         | 
         | This is a general concern for a lot of areas, including
         | observability. Lots of data, need to query it, need to store it
         | cheaply with a time component to help differentiate.
         | 
         | Take a look at Thanos and Grafana Labs Cortex. It's a mostly
         | generalize architecture.
        
           | zomglings wrote:
           | Thank you. Didn't know about Thanos.
        
         | amw-zero wrote:
         | I am honestly intrigued any time I hear anyone mention RDF. You
         | have had success using it? It wasn't exactly clear how you used
         | it from what you said.
        
           | zomglings wrote:
           | I only use RDF triples as a storage format for edges. I don't
           | normally take advantage of triple stores (e.g. apache Jena),
           | sparql query engines, etc.
           | 
           | It's just a simple format and it's easy to build different
           | types of query/analysis capabilities around.
           | 
           | The only thing that's a little awkward is applying types to
           | items in an rdf triple (in a way that's computationally
           | cheap). Typing is a general problem, though.
        
             | LukeEF wrote:
             | I strongly agree with this - RDF is fine and there are some
             | great concepts in the SemWeb world (IRIs for example), but
             | OWL and SPARQL are confusing and can be off-putting,
             | especially when it is all rolled together. The latest
             | iteration of our triple store actually moves radically
             | towards linked-document or a 'document graph' to make it
             | more understandable while keeping RDF at the base.
        
             | pbourke wrote:
             | Can you express graph attributes such as edge weight in
             | RDF?
             | 
             | I am familiar with RDF triples such as "Book", author,
             | "Jane Doe". How would you express "10 minute phone call at
             | time t between persons A and B"?
        
               | mhitza wrote:
               | Not previous commenter, but as with other problems in
               | computing, I find that you need to add a level of
               | indirection.                   "CallSessionX" ->
               | participant -> "person A"         "CallSessionX" ->
               | participant -> "person B"         "CallSessionX" ->
               | duration -> "10 minute"         "CallSessionX" -> at_time
               | -> t
               | 
               | Haven't used Sparql or Cypher in many years (used in
               | Neo4j), but with the later you could do a query along the
               | lines                  "person A" <- participant <-
               | call_session -> participant -> "person B"
               | 
               | To find calls between person A and person B, then you can
               | add a variable binding on the exact node you're
               | interested in (eg call_session) to extract/aggregate
               | attributes.
        
               | zomglings wrote:
               | You could do it, but a triple wouldn't be the best choice
               | of data structure. I would just add a new numerical
               | column called "weight" to each relation in your RDF
               | store:                   "Harry Potter and the
               | Philosopher's Stone","author","Jane Doe",0.005
               | "Harry Potter and the Philosopher's Stone","author","J.K.
               | Rowling",1.0              ...
               | 
               | You could easily export RDF triples from this if you
               | needed to using rules like "The author for each book is
               | determined by the '<book>-author-<name>' relation of
               | highest weight".
               | 
               | Edit: Sorry I didn't answer your question about "10
               | minute phone call at time t between person A and person
               | B".
               | 
               | The best way I can think of to model this with triples
               | is:                   "call:<call_id>","started_at",t
               | "call:<call_id>","ended_at",t + <duration>
               | "person A","participant","call:<call_id>"         "person
               | B","participant","call:<call_id>"
               | 
               | But this may be overly complicated depending on how you
               | want to query the data and what you want to do with it.
               | 
               | Edit 2: mhitza already answered it better!
               | https://news.ycombinator.com/item?id=28503097
        
           | FpUser wrote:
           | I've used RDF triples for some specific part of a bigger
           | business data model. It was implemented on a regular database
           | and used to encode things like "Joe Hates Propaganda" which
           | is exactly what was needed and hence was a success (meaning
           | that it had worked as intended).
        
       | danieljh wrote:
       | The Future is Tiny Graphs! [1]
       | 
       | In the style of "Scalability! At What COST?" I've started a side-
       | project bringing succinct data structures (rank/select [2],
       | elias-fano coding) and modern instruction sets (popcount, pdep)
       | to graphs.
       | 
       | For some reason the article doesn't mention the field of graph
       | compression and I think it shows: for example most if not all
       | popular open source routing engines don't do anything regarding
       | graph compression. It's time to change that! Reach out to me
       | (preferably on the Github project) if you want try some ideas :)
       | 
       | [1] https://tinygraph.org [2]
       | http://www.cs.cmu.edu/~dga/papers/zhou-sea2013.pdf
        
         | LukeEF wrote:
         | We built TerminusDB[1], our OSS knowledge graph, using succinct
         | data structures. As we have a immutable delta encoding approach
         | to allow domain teams to build and curate data products that
         | they then share, this has been a very fruitful combination. We
         | wrote a technical white paper all about it! [2]
         | 
         | [1] https://github.com/terminusdb/terminusdb [2]
         | https://github.com/terminusdb/terminusdb/blob/dev/docs/white...
        
       | [deleted]
        
       | triska wrote:
       | Very interesting! I especially like the expressed interest in
       | _logic-based_ approaches for reasoning about graphs. Quoting from
       | the article:
       | 
       |  _" Logic-based and declarative formalisms. Logic provides a
       | unifying formalism for expressing queries, optimizations,
       | integrity constraints, and integration rules. Starting from
       | Codd's seminal insight relating logical formulae to relational
       | queries, many first order (FO) logic fragments have been used to
       | formally define query languages with desirable properties such as
       | decidable evaluation. Graph query languages are essentially a
       | syntactic variant of FO augmented with recursive capabilities.
       | [...] The influence of logic is pivotal not only to database
       | languages, but also as a foundation for combining logical
       | reasoning with statistical learning in AI. Logical reasoning
       | derives categorical notions about a piece of data by logical
       | deduction."_
       | 
       | This interest could be good news for vendors of logic programming
       | languages such as Prolog and its syntactic subset Datalog. These
       | languages enable very convenient reasoning about graphs and allow
       | both storing and querying the database with the same simple
       | language elements that can themselves also be analyzed and
       | reasoned about with the same formalism.
       | 
       | Imagine how convenient it would be to have a large graph database
       | represented as a set of Prolog clauses that can be queried, run
       | and analyzed with an efficient Prolog system! The data itself,
       | queries, and any desired transformation and augmentation could be
       | uniformly expressed in this way.
        
         | beaconstudios wrote:
         | well if you think about it, a triple store is basically a
         | database of propositional logic statements.
        
         | LukeEF wrote:
         | I know at least one vendor of a prolog based graph database [1]
         | that hopes the interest continues to grow.
         | 
         | [1] https://github.com/terminusdb/terminusdb
        
         | sfvisser wrote:
         | This has been the semantic web dream for decades now.
        
           | triska wrote:
           | Yes indeed, although the semantic web chose a much more
           | convoluted set of technologies without the syntactic and
           | conceptual coherence and simplicity of actual logic
           | programming languages, thus making reasoning about everything
           | extremely complex and inconvenient in comparison. I think
           | this is a severe handicap when using these technologies in
           | practice.
           | 
           | All automated reasoning is by necessity exclusively
           | _syntactic_ in nature since the meaning of processed symbols
           | is inaccessible to a machine, therefore a careful
           | consideration of syntactic aspects is very important to
           | ensure easy analysis and meta-interpretation of logic
           | programs and queries.
           | 
           | As I see it, a good formalism for knowledge representation
           | makes it easy to reason about the knowledge and also about
           | all programs and queries that themselves reason about the
           | knowledge, enabling easy meta-interpretation and coherent
           | extension languages.
        
       | molham wrote:
       | We usually talk about several paradigms for data management and
       | programming: navigational (e.g. CODASYL, OO databases/languages,
       | most graph DBs), tensor-based (e.g. MOLAP, Matlab, TensorFlow,
       | Pytorch), Map-Reduce (e.g. Hadoop and Spark before they started
       | to disown their MR heritage). Semistructured (e.g. JSON) and
       | dataframes (e.g. Pandas) are also important abstractions. These
       | abstractions can be captured in the relational paradigm -- in
       | other words we can think of them as special cases and represent
       | them as views on relational models.
       | 
       | It is clear from the last 50 years of our industry that the
       | relational paradigm (eventually) always wins. It wins because it
       | separates the _what_ from the _how_ and automates away a bunch of
       | tricky programming that we have to do with the more imperative
       | paradigms. -- the OLTP workload was originally supported by
       | navigational systems and then switched to relational systems
       | mostly based on 3rd normal form schemas -- the OLAP workload was
       | originally supported by multi-dimensional array (tensor) systems
       | then switched to relational systems based on star and snowflake
       | schemas -- the Big Data workload was originally supported by map-
       | reduce systems then switched to relational systems based on star
       | /snowflake + JSON schemas.
       | 
       | I believe that the Graph (via navigational), Reasoning (via
       | procedural), and ML (via dataframe and tensor) workloads will
       | move to relational systems based on graph normal form (GNF)
       | schemas. Those relational systems won't be built like the SQL-
       | based systems of today. They will need new relational language
       | extensions, new semantic optimizers, new optimal join algorithms,
       | materialized views that are maintained efficiently, new data
       | structures, etc. We are working on such a system at RelationaAI.
       | We're in limited preview now but hope to have an open preview by
       | early next year.
        
       | leetrout wrote:
       | A previous coworker now works at Katana Graph and I've been
       | keeping an eye on what they claim to be doing. Sounds very
       | interesting.
       | 
       | https://katanagraph.com/
        
       | chubot wrote:
       | It feels like this should have a citation to _Scalability! At
       | What COST?_
       | 
       | https://www.usenix.org/system/files/conference/hotos15/hotos...
       | 
       | which evaluates distributed graph processing frameworks (circa
       | 2015) on various problems and shows that they can be slower than
       | a single threaded program on a laptop.
       | 
       | They have a data set with 105 M nodes and 3.7B edges that fits in
       | 14 GB, which all fits in memory (at least 128-256 GB of memory
       | should be taken for granted in these types of problems). Maybe
       | with a clever encoding you can store the entire social network of
       | the world, or at least a FB-sized one. (Storing nodes as integers
       | is efficient and useful for many graph algorithms. You can later
       | "join" with the ancillary data using a fast non-graph distributed
       | processing framework).
       | 
       | This article mentions the challenges but doesn't mention the
       | solution of simply using a single machine for big graphs:
       | 
       |  _Even seemingly minute HPAD variations, for example the graph 's
       | degree distribution, can have significant performance
       | implications._
       | 
       |  _Graph processing systems rely on complex runtimes that combine
       | software and hardware platforms. It can be a daunting task to
       | capture system-under-test performance--including parallelism,
       | distribution, streaming vs. batch operation_
       | 
       | Anyone who has done programming with say MapReduce knows that it
       | can be very elegant and fun, but also limiting. This is even more
       | true of graph processing frameworks, as the article explains.
       | 
       | If you simply put everything in memory and write native code
       | (which isn't hard at all for batch jobs), then you have so much
       | more flexibility with ANY algorithm, and especially graph
       | algorithms. This type of programming is also fun! You don't have
       | to fit your problem into the framework and switch frameworks when
       | it doesn't fit (or more likely just accept 10-1000x performance
       | decrease when it doesn't fit).
       | 
       | ---
       | 
       | Abstract:
       | 
       |  _We offer a new metric for big data platforms, COST, or the
       | Configuration that Outperforms a Single Thread. The COST of a
       | given platform for a given problem is the hardware configuration
       | required before the platform outperforms a competent single-
       | threaded implementation. COST weighs a system's scalability
       | against the overheads introduced by the system, and indicates the
       | actual performance gains of the system, without rewarding systems
       | that bring substantial but parallelizable overheads._
        
       | christkv wrote:
       | Distributing a graph is a really hard problem does anyone know
       | what the state of this is?
        
         | topicseed wrote:
         | Dgraph and ArangoDB seems to be doing it well.
        
           | junon wrote:
           | We ended up ditching Arango for a new project, despite it
           | having decent enough graph handling.
           | 
           | This is because, unlike many other DBs (namely Mongo, which
           | we're trying desperately to move away from), you can't spin
           | up a new instance and allow the application to idempotently
           | initialize it.
           | 
           | The multi-modal engine of Arango seems great, but the DX
           | surrounding its use is a bit subpar, and would greatly shift
           | a lot of burden to the environment config, which only adds to
           | potential security vulnerabilities.
           | 
           | This is only made worse by the fact it has a severely limited
           | granularity with security permissions.
        
             | ta988 wrote:
             | Security is another big issue with the open graph
             | databases. Some closed ones have ok mechanisms. But we are
             | not there yet.
        
       | rrwright wrote:
       | I'm the creator of thatDot Connect [1], a distributed
       | asynchronous streaming graph interpreter, where we've been
       | working on exactly this problem for about 7 years now. Scalable
       | distributed graph processing has so many applications! We were
       | fortunate enough to be funded by DARPA for a while working on
       | automated APT detection for cyber-security. Some of our recent
       | work has shown linear scalability of this approach on large
       | clusters at high throughput. For example, we got to half a
       | million complex graph events ingested per second on a combined
       | read/write workload and could have just kept scaling. We're in
       | the process of writing up this work and sharing all the details.
       | 
       | [1] https://www.thatdot.com/technology/
        
       | amw-zero wrote:
       | I don't have a number of course, but I believe that almost all
       | relational db models are more naturally modeled as graphs. In
       | fact, all normalized relational db models that use foreign keys
       | are just implementing a graph - foreign keys represent edges
       | between nodes.
       | 
       | has-many, has-one, and many-to-many are all graph relationships.
       | The relational model + foreign keys is just one way to encode a
       | graph.
        
         | pphysch wrote:
         | Yeah, tables/rows are basically vertices with 1 label & many
         | properties*, and FKs are labeled edges without properties. But,
         | you can easily implement edge properties with many-to-many
         | tables, at the cost of an extra JOIN. In practice, this is a
         | useful and efficient species of graph.
         | 
         | I think many people get baited by Neo4j's marketing without
         | really understanding relational models and RDBMS (I sure did).
         | 
         | * With the (often true) assumption that many vertices will
         | follow the same normalized schema.
        
         | bawolff wrote:
         | Graphs are the least restricting data model. I think everything
         | is easier to model as a graph, because its just things
         | (verticies) and how they are related (edges). You mostly get to
         | skip the modelling step.
         | 
         | The benefit of modelling as a relational db is that it forces
         | you to put the data in a form that's easy to work with from a
         | computer perspective.
        
       | Anon84 wrote:
       | If you're interested in learning more about graphs, you might
       | find my "Graphs for Data Science" substack:
       | https://graphs4sci.substack.com/ where I work though using graph
       | perspectives to analyze real world datasets.
        
       | graphviz wrote:
       | Someone on the internet is wrong! Or maybe just partially
       | correct.
       | 
       | This is very interesting work. There is some selling going on,
       | not discussing enough about the issues that drove the database
       | community toward relational databases and away from "network"
       | databases in the 1970s.
       | 
       | If you think of visualization systems as a microcosm of more
       | general computing, notice people need more than graphs - they
       | need all kinds of tables, charts, maps, and other things. Same
       | with "graph databases".
       | 
       | This area does exist because the database community underinvested
       | in computing on graphs and networks, the same as it did with
       | usability for a long time. It is still catching up but there is a
       | lot of great work.
       | 
       | Another problem is helping users to understand the cost of what
       | they are trying to compute. Select and join? I can estimate that.
       | Find betweenness centrality of a large network? Find a tour that
       | visits all the cycles of a network exactly once? All the edges?
       | All nodes? Oops. What about graph object types? Is it OK to mix
       | them?
       | 
       | Sometimes I wonder arguments that "graph databases are [more]
       | efficient for linked structures." After 20 or 30 years of intense
       | R&D, relational databases already seem pretty efficient for
       | computing with relationships.
        
         | _wldu wrote:
         | SQL struggles with hierarchical data (trees), which are graphs.
        
           | pphysch wrote:
           | In practice most hierarchies are shallow (or poorly designed)
           | 
           | Country, Province/State, City, Neighborhood/ZIP, Building,
           | Room. 5 jumps to go from planetary to personal. 5 JOINs is
           | not unreasonable.
        
             | [deleted]
        
       ___________________________________________________________________
       (page generated 2021-09-12 23:01 UTC)