[HN Gopher] Super-Structured Data: Rethinking the Schema
___________________________________________________________________
Super-Structured Data: Rethinking the Schema
Author : mccanne
Score : 84 points
Date : 2022-05-17 14:19 UTC (8 hours ago)
(HTM) web link (www.brimdata.io)
(TXT) w3m dump (www.brimdata.io)
| loquisgon wrote:
| I got interested when the different spectrum points of json and
| relational were contrasted. So I read the whole thing. I got lost
| and disheartened when the new terminology, starting with the
| super-structured name was introduced and completely went downhill
| with the other z names. Maybe it's just me and maybe it is like
| quantum mechanics and any other innovation where new names don't
| make sense and feel ugly.
| bthomas wrote:
| I didn't follow this part:
|
| > EdgeDB is essentially a new data silo whose type system cannot
| be used to serialize data external to the system.
|
| I think this implies that serializing external data to zson is
| easier than writing an INSERT into edgedb, but not sure why that
| would be.
| SPBS wrote:
| This is a data serialization format, not a replacement for
| storing your business data. Your business data _needs_ to have
| the same schema enforced everywhere, otherwise how are you going
| to reconcile your user data now and your user data 5 months ago
| if their schemas are radically different?
| kmerroll wrote:
| Interesting discussion, but buried in a lot of legacy thinking
| about schemas and personally, I don't find Yet-Another-Schema-
| Abstraction (YASA)(tm) layer very compelling when better
| solutions in functional programming and semantic ontologies are
| far ahead in this area.
|
| Suggest looking into JSON-LD which was intended to solve many of
| the type and validation use-cases related to type and schema.
| abraxaz wrote:
| To pile on a bit here, JSON-LD is based on RDF, which is an
| abstract syntax for data as semantic triples (i.e. RDF
| statements), there is also RDF* which is in development which
| extends this basic data model to make statements about
| statements.
|
| RDF has concrete syntaxes, one of them being JSON-LD, and it
| can be used to model relational databases fairly well with
| R2RML (https://www.w3.org/TR/r2rml/) which essentially turns
| relation databases into a concrete syntax for RDF.
|
| schema.org is also based on RDF, and is essentially an ontology
| (one of many) that can be used for RDF and non RDF data, but
| mainly because almost all data can be represented as RDF - so
| non RDF data is just data that does not have a formal mapping
| to RDF yet.
|
| Ontologies is a concept used frequently in RDF but rarely
| outside of it, it is quite important for federated or
| distributed knowledge, or descriptions of entities. It focuses
| heavily on modelling properties instead of modelling objects,
| and then whenever a property occurs that property can be
| understood within the context of an ontology.
|
| An example is the age of a person
| (https://schema.org/birthDate)
|
| When I get a semantic triple:
|
| <example:JohnSmith> <https://schema.org/birthDate>
| "2000-01-01"^^<https://schema.org/Date>
|
| This tells me that the entity identified by the IRI
| <example:JohnSmith> is a person - and their birth date is
| 2000-01-01. I however don't expect that i will get all other
| descriptions of this person at the same time, I won't
| necessarily get their <https://schema.org/nationality> for
| example, even though this is a property of a
| <https://schema.org/Person> defined by schema.org
|
| I can also combine https://schema.org/ based descriptions with
| other descriptions, and these descriptions can be merged from
| multiple sources and then queried together using SPARQL.
| ducharmdev wrote:
| You're right, I think Yet-Another-Schema-Solution (YASS)(tm)
| would be much more compelling!
|
| (Please forgive me)
| vlmutolo wrote:
| What are the "better solutions in functional programming and
| semantic ontologies"? What would I Google?
| CharlesW wrote:
| The words "anarchy" and "authoritarianism" seem unnecessarily
| emotional and pejorative, and because of their semantic baggage I
| personally wouldn't use them in a professional situation. The
| author counts on the emotional color of those words to attempt an
| argument that both are somehow bad.
|
| Instead of those words I'd suggest something like "schema on
| write" vs. "schema on read", or "persisted structured" vs.
| "persisted unstructured". "Document" vs. "relational" doesn't
| quite capture it, since unstructured data can have late-binding
| relations applied at read time, and structured data doesn't have
| to be relational.
|
| And of course, modern relational databases can store unstructured
| data as easily as structured data.
| munro wrote:
| I love the idea of getting rid of tables, when developing
| application code I'm often thinking in terms of Maps/Sets/Lists--
| I wish I could just take that code and make it persistent.
| PRIMARY KEY is really like a map. Also I wish I had transactional
| memory in my application. Not sure what the future looks like,
| but I am loving all this development in the database space.
| natemcintosh wrote:
| So it sounds like one of the advantages of the Zed ecosystem is
| that its data can go into three file formats (zson, zng, zst),
| each designed for a specific use case, and convert between them
| easily and without loss.
|
| And it seems like the newer "zed lake" format is like a large
| blob managed by a server. Can you also convert data to and from
| and the file formats to the lake format? What is the lake's main
| use case?
| simonw wrote:
| > The idea here is that instead of manually creating schemas,
| what if the schemas were automatically created for you? When
| something doesn't fit in a table, how about automatically adding
| columns for the missing fields?
|
| I've been experimenting with this approach against SQLite for a
| few years now, and I really like it.
|
| My sqlite-utils package does exactly this. Try running this on
| the command line: brew install sqlite-utils
| echo '[ {"id": 1, "name": "Cleo"}, {"id": 2,
| "name": "Azy", "age": 1.5} ]' | sqlite-utils insert
| /tmp/demo.db creatures - --pk id sqlite-utils schema
| /tmp/demo.db
|
| It outputs the generated schema: CREATE TABLE
| [creatures] ( [id] INTEGER PRIMARY KEY,
| [name] TEXT, [age] FLOAT );
|
| When you insert more data you can use the --alter flag to have it
| automatically create any missing columns.
|
| Full documentation here: https://sqlite-
| utils.datasette.io/en/stable/cli.html#inserti...
|
| It's also available as a Python library: https://sqlite-
| utils.datasette.io/en/stable/python-api.html
| feoren wrote:
| Wow, what a waste of time. I've been doing it correctly for so
| long that I forget that virtually everyone else on the planet has
| no idea how to build a good data model. What pisses me off is
| that I actually have the right answer on how to avoid all of this
| pain, but if I typed it out here I'd either waste my time and get
| ignored or (much, much less likely) get my idea poached. It takes
| hours to fully communicate anyway. What do you do when you know
| you're sitting on an approach & tech that could revolutionize the
| X-hundred-billion-dollar data management industry but you can
| barely even get your own fucking employer to take you seriously?
|
| Anyway this article is crap and gets everything wrong, just like
| all of you do. Whatever, nothing to see here I guess.
| thinkharderdev wrote:
| Arrow has union types (as well as structs and dictionary types).
| Parquet doesn't but I think it has an intentionally shallow types
| system to allow flexibility in encoding. Basically everything is
| either a numeric or binary and the logical type for binary
| columns is defined in metadata. So you can use, for instance,
| Arrow as the encoding.
| ccleve wrote:
| tldr; Don't use relational tables or unstructured document
| databases. Instead use structured types. The "schema" here is
| ultimately a collection of independent objects / classes with
| well-defined fields.
|
| Ok, fine. But I'm not sure how this helps if you have six
| different systems with six different definitions of a customer,
| and more importantly, different relationships between customers
| and other objects like orders or transactions or locations or
| communications.
|
| I don't see their approach as ground-breaking, but it is
| definitely worthy of discussion.
| abraxaz wrote:
| > Ok, fine. But I'm not sure how this helps if you have six
| different systems with six different definitions of a customer,
| and more importantly, different relationships between customers
| and other objects like orders or transactions or locations or
| communications.
|
| If you have this problem, consider giving RDF a look - you can
| fairly easily use RDF based technologies to map the data in
| these systems onto a common model, some examples of tools that
| may be useful here is https://www.w3.org/TR/r2rml/ and
| https://github.com/ontop/ontop - you can also use JSON-LD to
| convert most JSON data to RDF. For more info ask in
| https://gitter.im/linkeddata/chat
| HelloNurse wrote:
| It helps if this machinery can reject data and thus perform
| validation. Since recursive construction of union types (valid
| records can look like this, or also like that...) is trivial, a
| programmer somewhere has to draw the line between "loosen the
| schema to allow this record" and "reject this record to enforce
| the schema".
| mccanne wrote:
| Author here. Agreed! Validation is important. While I didn't
| make this point in the article, our thinking is schema
| validation does not require that the serialization format
| utilize schemas as the building block and you can always
| implementation schema (or type) validation (and versioning)
| on top of super-structured data (as can also be done with
| document databases).
| cmollis wrote:
| this is a major hassle when converting from avro (from
| kafka which uses a schema registry, so schemas are not
| shipped with the avro data) and storing in parquet which
| requires a schema in the file but you can 'upgrade' it with
| another schema when reading it. It would be great to have a
| binary protocol-like format (schema-less avro), and a
| schema-less columnar storage format.. which is I guess is
| what these guys are doing.
| [deleted]
| difflens wrote:
| Perhaps I don't understand their use case fully, but it seems to
| me that every schema can be defined as a child protobuf message,
| and each child can then be added to a oneof field of a parent
| protobuf message. This way, you get the strict/optional type
| checks that are required, and the efficiency and ecosystem around
| protobufs.
| mccanne wrote:
| Author here. This totally makes sense. The challenge here is
| you need to store the type definitions somewhere (e.g., in the
| .proto files) and any system that processes protocol buffers
| needs to know which proto to apply to which messages. The theme
| of super-structured data is that this type structure should be
| native to the serialized data and our premise is this leads to
| better DX (though Zed is early and the jury is out). Perhaps
| flexbuffers is a closer analogy, which I should have mentioned
| in the article.
| mamcx wrote:
| Note: The relational model (even SQL) is THIS.
|
| Despite the claims, SQL is NOT "schema-fixed".
|
| You can 100% create new schemas, alter them and modify them.
|
| What actual happens is that if you have a CENTRAL repository of
| data (aka "source of truth"), then you bet you wanna "freeze"
| your schemas (because is like a API, where you need to fulfill
| contracts).
|
| --
|
| SQL have limitations in lack of composability, the biggest reason
| "NoSQL" work is this: A JSON is composable. A "stringy" SQL is
| not. If SQL were really around "relations, tupes" like (stealing
| from my project, TablaM): [Customer id:i32,
| name:Str; 1, "Jhon"]
|
| then developers will have less reason to go elsewhere.
| flappyeagle wrote:
| why hasn't someone built a composable flavor of SQL? it seems
| like a burning need
| pjungwir wrote:
| This is what Tutorial D is, but it's never been widely
| adopted.
| zmgsabst wrote:
| I'm not sure what you mean by "composable" here -- could you
| elaborate?
| mamcx wrote:
| Composable is the ability to define things in the small and
| combine with confidence.
|
| SQL not allow this: by_id := WHERE id = $1
| SELECT * | by_id
| mccanne wrote:
| Author here. All good points. Yes, you can build a super-
| structured type system on top of tables. EdgeDB does this well.
| And you can put JSON into relational columns. Then you might
| ask what the "type" of that column is? Well, if you want deep
| types, the row type varies from column to column as the JSON
| values vary and you have to walk the JSON to determine the
| type. SQL implementation are beginning to try to do deal with
| this mess by adding layers on top of tables. We're saying,
| maybe we should think differently about the problem and build
| tables on top of types as a special case of a type system. This
| also gives a very nice way to get data into and out of systems
| without having to go through the messiness of ODBC and special
| casing tables vs tuples vs scalars etc.
| cryptonector wrote:
| Normalize to the max then denormalize till you achieve the
| performance trade-offs you want. That's the rule in
| relational schema design.
|
| Adding JSON traversal operators and functions helps a lot
| when you end up denormalizing bits of the schema. It's not
| hard.
| mamcx wrote:
| This is true and is a limitation of SQL (not of the
| relational model per-se), and also is part of the problem
| that SQL is not composable (so you don't have a way to nested
| table definitions)
| vosper wrote:
| You mentioned EdgeDB in the blog post, too, but I just think
| you and them are dealing with different problems.
|
| My understanding of EdgeDB is they're mostly trying to make
| correct data-modeling simpler and more intuitive; to let
| people model relations in the same way they speak and think
| about it, rather than having to map to SQL concepts like join
| tables. I rather like what they're going for, though I
| haven't used it.
|
| EdgeDB seems to be mostly for business logic and OLTP.
| They're not trying to deal with arbitrary incoming data that
| might be outside of the control of the ingestion system. You
| wouldn't even have an ingestion system with EdgeDB.
| troelsSteegin wrote:
| It looks like the use case is specifying types for dataflow
| operators (aka endpoints for dataflow pipes) [0] and I surmise
| composition should be super easy. I was surprised not to see any
| mention of XML or XML Schema as prior art, especially with their
| discussion of schema registries. Edit: Oh, the point of reference
| is Kafka [1]
|
| [0] https://zed.brimdata.io/docs/language/overview/ [1]
| https://docs.confluent.io/platform/current/schema-registry/i...
| hbarka wrote:
| I also thought about XML. It has the Document Object Model
| (DOM), the structure which describes the data.
| anentropic wrote:
| The first few sections of this post nearly lost me, waffling on
| about NoSQL vs whatever.
|
| Eventually we get to the meat:
|
| > _For example, the JSON value_
| {"s":"foo","a":[1,"bar"]}
|
| > _would traditionally be called "schema-less" and in fact is
| said have the vague type "object" in the world of JavaScript or
| "dict" in the world of Python. However, the super-structured
| interpretation of this value's type is instead:_
|
| > _type record with field s of type string and field a of type
| array of type union of types integer and string_
|
| > _We call the former style of typing a "shallow" type system and
| the latter style of typing a "deep" type system. The hierarchy of
| a shallow-typed value must be traversed to determine its
| structure whereas the structure of a deeply-typed value is
| determined directly from its type._
|
| This is a bit confusing, since JSON data commonly has an implicit
| schema, or "deep type system" as this post calls it, and if you
| consume data in any statically-typed language you will
| materialise the implicit "deep" types in your host language.
|
| So it seems that ZSON is sort of like a TypeScript-ified version
| of JSON, where the implicit types are made explicit.
|
| It seems the point is not to have an external schema that
| documents must comply to, so I guess at the end of the day has
| similar aim to other "self-describing" message formats like
| https://amzn.github.io/ion-docs/ ? i.e. each message has its own
| schema
|
| So the interesting part is perhaps the new data tools to work
| with large collections of self-describing messages?
| vosper wrote:
| > The first few sections of this post nearly lost me, waffling
| on about NoSQL vs whatever.
|
| Since the author of the blog post is here, I'll just jump in to
| agree with this part: there is a _lot_ of unecessary background
| text before we get to the meat of it. I don 't think people
| need a history lesson on NoSQL and SQL, and IMO the
| "authoritarianism" metaphor is a stretch, and that word has
| pretty negative connotations.
|
| I think there's some value in setting the scene, but I think
| you will lose readers before they get to the much more
| interesting content further down. I recommend revising it to be
| a lot shorter.
___________________________________________________________________
(page generated 2022-05-17 23:01 UTC)