[HN Gopher] Super-Structured Data: Rethinking the Schema
       ___________________________________________________________________
        
       Super-Structured Data: Rethinking the Schema
        
       Author : mccanne
       Score  : 84 points
       Date   : 2022-05-17 14:19 UTC (8 hours ago)
        
 (HTM) web link (www.brimdata.io)
 (TXT) w3m dump (www.brimdata.io)
        
       | loquisgon wrote:
       | I got interested when the different spectrum points of json and
       | relational were contrasted. So I read the whole thing. I got lost
       | and disheartened when the new terminology, starting with the
       | super-structured name was introduced and completely went downhill
       | with the other z names. Maybe it's just me and maybe it is like
       | quantum mechanics and any other innovation where new names don't
       | make sense and feel ugly.
        
       | bthomas wrote:
       | I didn't follow this part:
       | 
       | > EdgeDB is essentially a new data silo whose type system cannot
       | be used to serialize data external to the system.
       | 
       | I think this implies that serializing external data to zson is
       | easier than writing an INSERT into edgedb, but not sure why that
       | would be.
        
       | SPBS wrote:
       | This is a data serialization format, not a replacement for
       | storing your business data. Your business data _needs_ to have
       | the same schema enforced everywhere, otherwise how are you going
       | to reconcile your user data now and your user data 5 months ago
       | if their schemas are radically different?
        
       | kmerroll wrote:
       | Interesting discussion, but buried in a lot of legacy thinking
       | about schemas and personally, I don't find Yet-Another-Schema-
       | Abstraction (YASA)(tm) layer very compelling when better
       | solutions in functional programming and semantic ontologies are
       | far ahead in this area.
       | 
       | Suggest looking into JSON-LD which was intended to solve many of
       | the type and validation use-cases related to type and schema.
        
         | abraxaz wrote:
         | To pile on a bit here, JSON-LD is based on RDF, which is an
         | abstract syntax for data as semantic triples (i.e. RDF
         | statements), there is also RDF* which is in development which
         | extends this basic data model to make statements about
         | statements.
         | 
         | RDF has concrete syntaxes, one of them being JSON-LD, and it
         | can be used to model relational databases fairly well with
         | R2RML (https://www.w3.org/TR/r2rml/) which essentially turns
         | relation databases into a concrete syntax for RDF.
         | 
         | schema.org is also based on RDF, and is essentially an ontology
         | (one of many) that can be used for RDF and non RDF data, but
         | mainly because almost all data can be represented as RDF - so
         | non RDF data is just data that does not have a formal mapping
         | to RDF yet.
         | 
         | Ontologies is a concept used frequently in RDF but rarely
         | outside of it, it is quite important for federated or
         | distributed knowledge, or descriptions of entities. It focuses
         | heavily on modelling properties instead of modelling objects,
         | and then whenever a property occurs that property can be
         | understood within the context of an ontology.
         | 
         | An example is the age of a person
         | (https://schema.org/birthDate)
         | 
         | When I get a semantic triple:
         | 
         | <example:JohnSmith> <https://schema.org/birthDate>
         | "2000-01-01"^^<https://schema.org/Date>
         | 
         | This tells me that the entity identified by the IRI
         | <example:JohnSmith> is a person - and their birth date is
         | 2000-01-01. I however don't expect that i will get all other
         | descriptions of this person at the same time, I won't
         | necessarily get their <https://schema.org/nationality> for
         | example, even though this is a property of a
         | <https://schema.org/Person> defined by schema.org
         | 
         | I can also combine https://schema.org/ based descriptions with
         | other descriptions, and these descriptions can be merged from
         | multiple sources and then queried together using SPARQL.
        
         | ducharmdev wrote:
         | You're right, I think Yet-Another-Schema-Solution (YASS)(tm)
         | would be much more compelling!
         | 
         | (Please forgive me)
        
         | vlmutolo wrote:
         | What are the "better solutions in functional programming and
         | semantic ontologies"? What would I Google?
        
       | CharlesW wrote:
       | The words "anarchy" and "authoritarianism" seem unnecessarily
       | emotional and pejorative, and because of their semantic baggage I
       | personally wouldn't use them in a professional situation. The
       | author counts on the emotional color of those words to attempt an
       | argument that both are somehow bad.
       | 
       | Instead of those words I'd suggest something like "schema on
       | write" vs. "schema on read", or "persisted structured" vs.
       | "persisted unstructured". "Document" vs. "relational" doesn't
       | quite capture it, since unstructured data can have late-binding
       | relations applied at read time, and structured data doesn't have
       | to be relational.
       | 
       | And of course, modern relational databases can store unstructured
       | data as easily as structured data.
        
       | munro wrote:
       | I love the idea of getting rid of tables, when developing
       | application code I'm often thinking in terms of Maps/Sets/Lists--
       | I wish I could just take that code and make it persistent.
       | PRIMARY KEY is really like a map. Also I wish I had transactional
       | memory in my application. Not sure what the future looks like,
       | but I am loving all this development in the database space.
        
       | natemcintosh wrote:
       | So it sounds like one of the advantages of the Zed ecosystem is
       | that its data can go into three file formats (zson, zng, zst),
       | each designed for a specific use case, and convert between them
       | easily and without loss.
       | 
       | And it seems like the newer "zed lake" format is like a large
       | blob managed by a server. Can you also convert data to and from
       | and the file formats to the lake format? What is the lake's main
       | use case?
        
       | simonw wrote:
       | > The idea here is that instead of manually creating schemas,
       | what if the schemas were automatically created for you? When
       | something doesn't fit in a table, how about automatically adding
       | columns for the missing fields?
       | 
       | I've been experimenting with this approach against SQLite for a
       | few years now, and I really like it.
       | 
       | My sqlite-utils package does exactly this. Try running this on
       | the command line:                   brew install sqlite-utils
       | echo '[           {"id": 1, "name": "Cleo"},           {"id": 2,
       | "name": "Azy", "age": 1.5}         ]' | sqlite-utils insert
       | /tmp/demo.db creatures - --pk id         sqlite-utils schema
       | /tmp/demo.db
       | 
       | It outputs the generated schema:                   CREATE TABLE
       | [creatures] (            [id] INTEGER PRIMARY KEY,
       | [name] TEXT,            [age] FLOAT         );
       | 
       | When you insert more data you can use the --alter flag to have it
       | automatically create any missing columns.
       | 
       | Full documentation here: https://sqlite-
       | utils.datasette.io/en/stable/cli.html#inserti...
       | 
       | It's also available as a Python library: https://sqlite-
       | utils.datasette.io/en/stable/python-api.html
        
       | feoren wrote:
       | Wow, what a waste of time. I've been doing it correctly for so
       | long that I forget that virtually everyone else on the planet has
       | no idea how to build a good data model. What pisses me off is
       | that I actually have the right answer on how to avoid all of this
       | pain, but if I typed it out here I'd either waste my time and get
       | ignored or (much, much less likely) get my idea poached. It takes
       | hours to fully communicate anyway. What do you do when you know
       | you're sitting on an approach & tech that could revolutionize the
       | X-hundred-billion-dollar data management industry but you can
       | barely even get your own fucking employer to take you seriously?
       | 
       | Anyway this article is crap and gets everything wrong, just like
       | all of you do. Whatever, nothing to see here I guess.
        
       | thinkharderdev wrote:
       | Arrow has union types (as well as structs and dictionary types).
       | Parquet doesn't but I think it has an intentionally shallow types
       | system to allow flexibility in encoding. Basically everything is
       | either a numeric or binary and the logical type for binary
       | columns is defined in metadata. So you can use, for instance,
       | Arrow as the encoding.
        
       | ccleve wrote:
       | tldr; Don't use relational tables or unstructured document
       | databases. Instead use structured types. The "schema" here is
       | ultimately a collection of independent objects / classes with
       | well-defined fields.
       | 
       | Ok, fine. But I'm not sure how this helps if you have six
       | different systems with six different definitions of a customer,
       | and more importantly, different relationships between customers
       | and other objects like orders or transactions or locations or
       | communications.
       | 
       | I don't see their approach as ground-breaking, but it is
       | definitely worthy of discussion.
        
         | abraxaz wrote:
         | > Ok, fine. But I'm not sure how this helps if you have six
         | different systems with six different definitions of a customer,
         | and more importantly, different relationships between customers
         | and other objects like orders or transactions or locations or
         | communications.
         | 
         | If you have this problem, consider giving RDF a look - you can
         | fairly easily use RDF based technologies to map the data in
         | these systems onto a common model, some examples of tools that
         | may be useful here is https://www.w3.org/TR/r2rml/ and
         | https://github.com/ontop/ontop - you can also use JSON-LD to
         | convert most JSON data to RDF. For more info ask in
         | https://gitter.im/linkeddata/chat
        
         | HelloNurse wrote:
         | It helps if this machinery can reject data and thus perform
         | validation. Since recursive construction of union types (valid
         | records can look like this, or also like that...) is trivial, a
         | programmer somewhere has to draw the line between "loosen the
         | schema to allow this record" and "reject this record to enforce
         | the schema".
        
           | mccanne wrote:
           | Author here. Agreed! Validation is important. While I didn't
           | make this point in the article, our thinking is schema
           | validation does not require that the serialization format
           | utilize schemas as the building block and you can always
           | implementation schema (or type) validation (and versioning)
           | on top of super-structured data (as can also be done with
           | document databases).
        
             | cmollis wrote:
             | this is a major hassle when converting from avro (from
             | kafka which uses a schema registry, so schemas are not
             | shipped with the avro data) and storing in parquet which
             | requires a schema in the file but you can 'upgrade' it with
             | another schema when reading it. It would be great to have a
             | binary protocol-like format (schema-less avro), and a
             | schema-less columnar storage format.. which is I guess is
             | what these guys are doing.
        
       | [deleted]
        
       | difflens wrote:
       | Perhaps I don't understand their use case fully, but it seems to
       | me that every schema can be defined as a child protobuf message,
       | and each child can then be added to a oneof field of a parent
       | protobuf message. This way, you get the strict/optional type
       | checks that are required, and the efficiency and ecosystem around
       | protobufs.
        
         | mccanne wrote:
         | Author here. This totally makes sense. The challenge here is
         | you need to store the type definitions somewhere (e.g., in the
         | .proto files) and any system that processes protocol buffers
         | needs to know which proto to apply to which messages. The theme
         | of super-structured data is that this type structure should be
         | native to the serialized data and our premise is this leads to
         | better DX (though Zed is early and the jury is out). Perhaps
         | flexbuffers is a closer analogy, which I should have mentioned
         | in the article.
        
       | mamcx wrote:
       | Note: The relational model (even SQL) is THIS.
       | 
       | Despite the claims, SQL is NOT "schema-fixed".
       | 
       | You can 100% create new schemas, alter them and modify them.
       | 
       | What actual happens is that if you have a CENTRAL repository of
       | data (aka "source of truth"), then you bet you wanna "freeze"
       | your schemas (because is like a API, where you need to fulfill
       | contracts).
       | 
       | --
       | 
       | SQL have limitations in lack of composability, the biggest reason
       | "NoSQL" work is this: A JSON is composable. A "stringy" SQL is
       | not. If SQL were really around "relations, tupes" like (stealing
       | from my project, TablaM):                   [Customer id:i32,
       | name:Str; 1, "Jhon"]
       | 
       | then developers will have less reason to go elsewhere.
        
         | flappyeagle wrote:
         | why hasn't someone built a composable flavor of SQL? it seems
         | like a burning need
        
           | pjungwir wrote:
           | This is what Tutorial D is, but it's never been widely
           | adopted.
        
         | zmgsabst wrote:
         | I'm not sure what you mean by "composable" here -- could you
         | elaborate?
        
           | mamcx wrote:
           | Composable is the ability to define things in the small and
           | combine with confidence.
           | 
           | SQL not allow this:                   by_id := WHERE id = $1
           | SELECT * | by_id
        
         | mccanne wrote:
         | Author here. All good points. Yes, you can build a super-
         | structured type system on top of tables. EdgeDB does this well.
         | And you can put JSON into relational columns. Then you might
         | ask what the "type" of that column is? Well, if you want deep
         | types, the row type varies from column to column as the JSON
         | values vary and you have to walk the JSON to determine the
         | type. SQL implementation are beginning to try to do deal with
         | this mess by adding layers on top of tables. We're saying,
         | maybe we should think differently about the problem and build
         | tables on top of types as a special case of a type system. This
         | also gives a very nice way to get data into and out of systems
         | without having to go through the messiness of ODBC and special
         | casing tables vs tuples vs scalars etc.
        
           | cryptonector wrote:
           | Normalize to the max then denormalize till you achieve the
           | performance trade-offs you want. That's the rule in
           | relational schema design.
           | 
           | Adding JSON traversal operators and functions helps a lot
           | when you end up denormalizing bits of the schema. It's not
           | hard.
        
           | mamcx wrote:
           | This is true and is a limitation of SQL (not of the
           | relational model per-se), and also is part of the problem
           | that SQL is not composable (so you don't have a way to nested
           | table definitions)
        
           | vosper wrote:
           | You mentioned EdgeDB in the blog post, too, but I just think
           | you and them are dealing with different problems.
           | 
           | My understanding of EdgeDB is they're mostly trying to make
           | correct data-modeling simpler and more intuitive; to let
           | people model relations in the same way they speak and think
           | about it, rather than having to map to SQL concepts like join
           | tables. I rather like what they're going for, though I
           | haven't used it.
           | 
           | EdgeDB seems to be mostly for business logic and OLTP.
           | They're not trying to deal with arbitrary incoming data that
           | might be outside of the control of the ingestion system. You
           | wouldn't even have an ingestion system with EdgeDB.
        
       | troelsSteegin wrote:
       | It looks like the use case is specifying types for dataflow
       | operators (aka endpoints for dataflow pipes) [0] and I surmise
       | composition should be super easy. I was surprised not to see any
       | mention of XML or XML Schema as prior art, especially with their
       | discussion of schema registries. Edit: Oh, the point of reference
       | is Kafka [1]
       | 
       | [0] https://zed.brimdata.io/docs/language/overview/ [1]
       | https://docs.confluent.io/platform/current/schema-registry/i...
        
         | hbarka wrote:
         | I also thought about XML. It has the Document Object Model
         | (DOM), the structure which describes the data.
        
       | anentropic wrote:
       | The first few sections of this post nearly lost me, waffling on
       | about NoSQL vs whatever.
       | 
       | Eventually we get to the meat:
       | 
       | > _For example, the JSON value_
       | {"s":"foo","a":[1,"bar"]}
       | 
       | > _would traditionally be called "schema-less" and in fact is
       | said have the vague type "object" in the world of JavaScript or
       | "dict" in the world of Python. However, the super-structured
       | interpretation of this value's type is instead:_
       | 
       | > _type record with field s of type string and field a of type
       | array of type union of types integer and string_
       | 
       | > _We call the former style of typing a "shallow" type system and
       | the latter style of typing a "deep" type system. The hierarchy of
       | a shallow-typed value must be traversed to determine its
       | structure whereas the structure of a deeply-typed value is
       | determined directly from its type._
       | 
       | This is a bit confusing, since JSON data commonly has an implicit
       | schema, or "deep type system" as this post calls it, and if you
       | consume data in any statically-typed language you will
       | materialise the implicit "deep" types in your host language.
       | 
       | So it seems that ZSON is sort of like a TypeScript-ified version
       | of JSON, where the implicit types are made explicit.
       | 
       | It seems the point is not to have an external schema that
       | documents must comply to, so I guess at the end of the day has
       | similar aim to other "self-describing" message formats like
       | https://amzn.github.io/ion-docs/ ? i.e. each message has its own
       | schema
       | 
       | So the interesting part is perhaps the new data tools to work
       | with large collections of self-describing messages?
        
         | vosper wrote:
         | > The first few sections of this post nearly lost me, waffling
         | on about NoSQL vs whatever.
         | 
         | Since the author of the blog post is here, I'll just jump in to
         | agree with this part: there is a _lot_ of unecessary background
         | text before we get to the meat of it. I don 't think people
         | need a history lesson on NoSQL and SQL, and IMO the
         | "authoritarianism" metaphor is a stretch, and that word has
         | pretty negative connotations.
         | 
         | I think there's some value in setting the scene, but I think
         | you will lose readers before they get to the much more
         | interesting content further down. I recommend revising it to be
         | a lot shorter.
        
       ___________________________________________________________________
       (page generated 2022-05-17 23:01 UTC)