[HN Gopher] The Protobuf Language Specification
___________________________________________________________________
The Protobuf Language Specification
Author : akshayshah
Score : 125 points
Date : 2022-09-12 16:40 UTC (6 hours ago)
(HTM) web link (buf.build)
(TXT) w3m dump (buf.build)
| rigelbm wrote:
| Echoing some of the sentiment here: although this was certainly a
| great effort and the result is awesome, that is NOT The Protobuf
| Language Specification, for as long as the maintainers of the
| Protobuf (protoc) project don't agree to follow it.
|
| This is certainly The Buf Language Specification, which is useful
| in itself. Specs are like contracts. If I were to build a tool to
| be compatible with Buf, I would definitely aim it to work with
| this spec.
|
| The problem is that the Protobuf project simply didn't sign this
| contract. Whatever it says is, sorry for the choice of word, a
| bit pointless if I'm trying to build a tool compatible with
| Protobuf, specially around forward compatibility.
|
| The industry does need better tooling around protobuf/efficient
| RPC, and being dependent on a single company (i.e. Google) is
| definitely not healthy. I hope you guys succeed in what your are
| trying to doing.
| mook wrote:
| That's basically the equivalent of RubySpec -- reverse
| engineered from MRI (the original Ruby implementation) for use
| by Rubinius. It was adopted by the other alternative Ruby
| implementations too.
|
| It looks like the original is now gone, but a fork has taken
| over. Looking at some comments, fighting to get MRI to adopt it
| may have burnt out the people behind it.
| rigelbm wrote:
| Actually, I thought about it twice and I retract what I said
| about this specification being pointless for building tools
| compatible with protobuf. Reasons:
|
| * The language itself is unlikely to change much given it's
| been public for so long. A non-official spec that captures the
| current implementation is probably going to survive for some
| time.
|
| * There's no official spec (which I would prefer) for me to
| base my tool on. This spec is about my only choice. The more
| tools targeting this spec, the hardest would be for Google to
| break compatibility with it, reinforcing my previous point.
|
| I will keep the parent comment for context, and I don't retract
| the fact that I think the title is misleading. Otherwise, great
| work!!
| staticassertion wrote:
| Agreed. If IDEs and alternative compilers are all building
| off of the spec because it's the path of least resistance,
| and there are no bugs _for a while_ , the defacto standard
| impl is going to face serious scrutiny for parting from it.
|
| And, as you said, proto isn't in a great position to be
| making crazy changes anyway.
| jhumphries131 wrote:
| Our aim is to make the spec accurately match Google's reference
| compiler -- for as long as that is the source of truth, which
| is hopefully not forever :)
|
| Even for those not using Buf, we expect this documentation to
| be of interest to the community as it describes a large number
| of facets of protoc that were previously undocumented (and
| required examining the source for protoc or playing around with
| test source code to see what it expects and what descriptors it
| generates).
|
| If issues are found with this spec, it is true that we'll most
| likely have to revise the spec to match the compiler, not the
| other way around. But no software is perfect: some variations
| will be due to bugs in protoc, which can be fixed in the
| compiler to properly match the spec. Over time, we'd love to
| see an outcome where a formal specification is the source of
| truth.
|
| For now, our commitment is to make (and keep) this spec as
| accurate as possible to describe the Protobuf language, not
| some Buf dialect.
| marsven_422 wrote:
| cpurdy wrote:
| oh .. cool .. pricing for protobuf
| habitue wrote:
| > we are standing on the shoulders of giants, those who have
| built and battle-tested it, and brought it to its current mature
| state
|
| I would rewrite this maybe to:
|
| > we are making Google's internal problems into everyone's
| problems
|
| There are benefits to an IDL in the abstract, but an IDL for
| everyone should be built with the benefit of hindsight looking at
| the lessons of protobuf, ion, thrift, etc. Not just baking
| Google's internal backwards compatibility obligations into a
| formal spec everyone should follow.
|
| I think any time google takes an internal tool and flips the
| "open source" bit on it, it turns out to be a bad match for the
| rest of the world. When they instead take the time to build a new
| system that learns from the internal tool, like Kubernetes
| learned from Borg, I think the end result is significantly more
| valuable.
| orf wrote:
| I quite like Protobuf definitions. I find them very easy to
| read and I love the fact I can distribute them to a bunch of
| different languages via a library.
|
| Are these Google's internal problems? Or, what google-internal
| problems do protobufs solve that nobody else needs to care
| about?
|
| Edit: to your edit, I find it hard to see a different way to do
| things.
| advisedwang wrote:
| Is there a license on this spec?
| akshayshah wrote:
| Apache 2.0:
| https://github.com/bufbuild/protobuf.com/blob/main/LICENSE
| mmastrac wrote:
| (removing my unfair characterization)
| Master_Odin wrote:
| A large corporate sponsor that has done a terrible job of
| shepherding the protocol, maintaining docs, etc. I'm all for a
| community push to divorce the protobuf
| specification/implementation from Google and to have it be much
| more community maintained, as it's clear that Google doesn't
| seem to care to.
| jhumphries131 wrote:
| Google actually requested that the community contribute to a
| real specification.
| https://github.com/protocolbuffers/protobuf/issues/6188#issu...
|
| So we've taken the initiative. No rent-seeking.
| peteradio wrote:
| Never having had the opportunity to put protobuf into action but
| having some interest I've had these questions:
|
| 1) What would you say is the best use case
|
| 2) and what unfortunate misusecases have you come across
| jhumphries131 wrote:
| The key "killer" use case is for describing RPC schemas. By
| describing the schema in an IDL, you can then generate client
| stubs and server interfaces in a variety of implementation
| languages, allowing interop between heterogenous clients and
| servers.
|
| They are also useful for describing domain models. This isn't
| surprising since domain models usually find their way into RPC
| schemas (since RPCs will often query or define model data). But
| they can also be used in other cases, such as for persistence
| and structured logging.
|
| Some misuse I have seen involves trying to make a protobuf
| model the _only_ representation of a domain model: it is almost
| inevitable that a physical model (a representation of your data
| in a SQL database, for example) will need to vary from a
| logical model, and trying to make a single protobuf
| representation serve double duty can be a source of problems.
| Making protobuf schemas conform to physical storage constraints
| often makes for worse abstraction. The model becomes de-
| normalized (which can make constraints and relationships harder
| to model/enforce), and can even leak details that consumers
| shouldn't know about or care about.
|
| Another misuse is using it for data that never leaves a process
| -- a program's private, internal state. If a data structure
| never needs to be serialized (to persist or send to another
| process), then you're better off using native data structures
| in the implementation language (which generally have far
| greater flexibility/expressibility as well as better
| performance).
| shaftway wrote:
| I do most of my work with protos in Java, and it's nice to have
| a schema like this that will build a bunch of immutable POJOs
| with builder classes and enough infrastructure to be able to do
| some interesting reflection-style stuff on top of the
| serialization / deserialization.
|
| The wide variety of client languages is really nice. I'm fairly
| certain that I can parse a proto in any language I'm ever going
| to use.
|
| The binary wire format is fairly straightforward, and is pretty
| tight without using compression. Fields are byte-aligned, and
| if you wanted to generate a binary proto message by
| concatenating a few things together it isn't very hard. And
| then you can use your proto definitions in whatever language
| you want to parse it. You can even parse your proto definition
| into a proto (Google provides the proto proto definitions, I
| think I got that grammar right) and write tools that generate
| whatever code you want easily.
|
| I think the text format is under-utilized. It's my go-to for
| configuration files. You create the proto definitions with
| whatever structures you want to structure your config settings,
| and then use the official parser to parse a text file. It
| supports comments (I'm throwing shade at you, JSON), and is
| simpler than YAML, while adding that structure. You can also
| use command line tools to validate the file as a pre-commit, or
| translate it into a binary format if you don't want to rely on
| the text format.
| IshKebab wrote:
| A good reference. I don't think it was really needed in the same
| way that e.g. a JSON or C++ spec was, since the language is so
| simple there's not much room for ambiguity. Definitely nice to
| have anyway.
| jhumphries131 wrote:
| It is a simple language, being just an IDL (no expressions, no
| logic, no state, no memory model, etc).
|
| However, you might be surprised about the room for ambiguity.
| For example, there are several mistakes in the grammars on the
| official developer site. And there is no place on the developer
| site that, for example, clearly explains how option names are
| formulated and interpreted. Even the way that relative
| references are resolved is not thoroughly described; it is
| probably the most complicated part of the spec because
| coherence/consistency wasn't keenly considered when the
| reference implementation in protoc was devised
| (https://www.protobuf.com/docs/language-spec#relative-
| referen...).
|
| So if you wanted to write a tool for the language, one that
| could correctly parse and understand a source file, without the
| details in this spec that tool will almost certainly be
| incorrect and disagree with how protoc parses and understands
| the same source.
| IshKebab wrote:
| It does make me hope that some protobuf libraries will
| integrate their own compilers. The official one can be a bit
| of a pain to install. This will definitely make that easier!
| jupp0r wrote:
| > Protobuf is the most stable and widely adopted IDL today
|
| I've run into this misconception so many times over the last
| decade. Protobuf is much less than an IDL (intentionally so).
| It's used to describe the data of an interface but is completely
| unopinionated about all other aspects of an IDL. GRPC is a great
| example of how to use ProtoBuf in an IDL, but it could be used
| for other categories of interfaces (object oriented, etc). People
| treating ProtoBuf as an IDL make the mistake of concentrating too
| much on the data format and not about (imho) more important
| aspects like pre and postconditions etc, that make an interface
| an interface.
| CobrastanJorji wrote:
| This is interesting. So one company invents and maintains a
| compiler, then a different company found that documentation to be
| insufficient (big surprise), so wrote out a lengthy standard that
| conformed to what the first company's compiler happened to do?
| Seems very useful, but also seems risky and hard to maintain.
| What happens when Google tightens a constraint or adds a new
| feature next week?
| Master_Odin wrote:
| I think the hope is that this could be a situation like
| CommonMark and Markdown, where google's implementation will
| continue to exist, but that the community just totally moves
| over to this new specification / tooling, and that anytime
| someone says "protobuf", they don't even necessarily realize
| that Google has a thing, they just know this specification.
| numbsafari wrote:
| ... except for whenever you've got to communicate or exist
| inside the Google ecosystem. Now you've got "protobuf as per
| Google" and "protobuf as per internet randos" with twice the
| dependency graph.
|
| It's unfortunate, and not surprising, that Google hasn't made
| a protobuf spec. I realize there are a ton of protobuf fans
| out there but, personally, it just feels like a massive,
| awkwardly maintained mess that I'm forced to live with for
| (mostly) their benefit.
| smcl wrote:
| Trouble is that I don't think everyone's moved on with
| Markdown, I still encounter different dialects that are
| subtly different. Even within one suite of tools from a
| single vendor ... [angry stares in the direction of
| Atlassian]
| eklitzke wrote:
| Generally new features in protobuf are just new features, so if
| a new feature is added then the worst case is the documentation
| will be out of date/incomplete for a short period of time.
|
| For the most part the "constraint tightening" thing doesn't
| affect the language specification, at least in my experience.
| For example, there have been some changes in protobuf that
| affect things like serialization order. A change like that can
| break brittle tests that do things like checking that a
| function produces some exact serialized string/byte sequence,
| but they don't affect the semantics of the language.
| jhumphries131 wrote:
| We keep an eye on the protobuf repo, so if a change is made we
| can both incorporate it into our products (https://buf.build)
| and into this spec.
|
| A great outcome for the ecosystem would be that Google chooses
| to engage with the community before making language changes.
| And the best possible outcome would be that the authority is
| eventually inverted: a specification doc becomes the definitive
| source of truth on the language and `protoc` is updated to
| conform to it, instead of vice versa.
| mmastrac wrote:
| (deleted)
| CobrastanJorji wrote:
| Ahhh, is that the play? That makes a lot of sense.
| jhumphries131 wrote:
| Matt, that is not our intention. But we are trying to build a
| business around making schema-driven APIs easy, and protobuf
| is at the core of our current products. So we are trying to
| improve the ecosystem around protobuf, and a critical aspect
| of that in our esteem is having a spec.
|
| While `protoc` remains the source of truth, this spec
| captures the syntax and rules accepted and enforced by
| `protoc` in a far more detailed way than the official
| developer site.
| mmastrac wrote:
| Ah, hey Josh. I didn't realize you were involved in this
| effort. FWIW, seeing your name associated with this
| definitely gives it a bit more authority than I had
| originally assumed.
| kentonv wrote:
| If it helps, I (original author of proto2) have been
| advising Buf and like what they're doing. (Disclosure: I
| also invested a small amount.)
|
| Buf is founded by engineers who spent a LOT of time
| working with Protobufs outside of Google. I was always
| the one saying "please don't write your own .proto
| parser" but I am convinced Buf actually knows what they
| are doing here and probably have all the details right.
|
| Our industry has a whole lot of tooling and
| infrastructure built around JSON, and almost every piece
| of it could be way better if it were operating with well-
| defined types instead, in the same way that TypeScript
| tooling benefits vs. JavaScript. Google has had an all-
| protobuf ecosystem internally for a long time, but much
| of it will never be released publicly. So that leaves
| someone like Buf to really build it out. I'm pretty
| interested to see where they're able to take it.
| wrs wrote:
| This is a similar situation to Ruby and Rubinius (an
| alternative implementation of Ruby). Because there was no Ruby
| specification other than the original MRI implementation, the
| Rubinius project (a new alternative implementation) created
| their own test suite to codify expected Ruby behavior. However,
| the MRI developers didn't use it, and the behavior diverged.
|
| The original creator gave up on the idea [0] but it was
| immediately taken over by others and is still maintained [1].
| In case of conflict, though, Matz (lead developer on MRI), not
| the specification, is the source of truth [2].
|
| [0] https://github.com/rubinius/rubinius-website-
| archive/blob/87...
|
| [1] https://github.com/ruby/spec
|
| [2] http://ruby.github.io/rubyspec.github.io/bugs_found/
| kyrra wrote:
| Googler, opinions are my own. I don't work on protobuf at all,
| just use it all the time (like most Googlers)
|
| I haven't dug into this in great detail yet, but the hard thing
| about the proto "spec" is that there isn't one, and protoc lets
| you do all kinds of crazy things that are really hard to model in
| languages like Antlr. There were some poor choices in protoc
| dating back to when proto1 was first created that have been
| carried forward. Having 20 years of proto definitions lets people
| come up with some crazy use cases.
|
| Definitely interesting for this company to create an EBNF
| definition for protobuf.
| jjtheblunt wrote:
| if comfortable answering, why protobuf instead of gob?
| jhumphries131 wrote:
| Gob is Go-specific. Our mission is to make schema-driven APIs
| easy, regardless of what language you use. Protobuf already
| has official support for nearly a dozen languages, and
| unofficial support for many more. Protobuf also has a
| compiler with a plugin model, which facilitates supporting
| even more in the future.
|
| Furthermore, Protobuf is an IDL, not a full-blown programming
| language. This makes it ideal for this use case, for
| describing APIs and data structures.
|
| Gob-encoded data structures are described with Go. While Go
| is great for writing server-side business logic, it is not as
| well-suited as a description language for data that you need
| to share with non-Go systems.
| whacker wrote:
| proto predates gob by quite a bit. gob was introduced with
| golang, and it's not really used anywhere else.
| erik_seaberg wrote:
| My takeaway from Java serialization was that a schema-
| driven encoding that's supported in many languages is a lot
| more useful.
| jhumphries131 wrote:
| It's possible that the internal version of protoc is very
| different from the open-source version. (I know there are
| numerous differences, but not sure how pervasive they are in
| the parser.)
|
| The open-source version has a hand-written tokenizer and
| recursive descent parser that is not too difficult to translate
| to EBNF. You'll notice that the section on numeric literals is
| a little wonky, because the tokenizer does a check that is hard
| to describe in EBNF. But it isn't too bad.
|
| Also, some of the constraints of the language are in prose in
| this spec because they are easier to enforce using a semantic
| validation pass, instead of trying to model purely with a CFG.
| (Optionality of the colon in the text format, used in message
| literals, comes to mind.)
|
| There are some things that technically _could_ be handled in
| the grammar, but they would make the grammar much more
| cumbersome to read and understand. So those things are also
| extracted into prose.
|
| > Definitely interesting for this company to create an EBNF
| definition for protobuf.
|
| For what it's worth, Google has also published an EBNF
| definition (the subject blog post contains links to those
| specs). But they are incomplete and not entirely accurate,
| which is a non-trivial part of what led us to writing and
| publishing this spec.
| kyrra wrote:
| One place protoc doesn't align well is the descriptor object.
| https://developers.google.com/protocol-
| buffers/docs/referenc...
|
| Comment placement is basically allowed anywhere by protoc,
| but how to get those comments within a Descriptor object for
| a proto is not well defined (there are places where you can
| put comments that are not available within Descriptor). It
| provides leading/trailing comments, but there are many other
| cases that are missed today (like comments embedded within a
| list of items in an array). Maybe this is a mismatch between
| what protoc allows and what Descriptor presents, but it's
| definitely annoying.
| jeffparsons wrote:
| > As of today, Protobuf is now a fully-defined language:
|
| (Etc.)
|
| I'm not sure what this is meant to achieve in reality. There is
| still only one implementation that defines what the language
| actually is, and that is Google's protoc.
|
| I'm my experience working with and writing alternative parsers
| for the '.proto' language, I've found that time and again
| Google's documentation for the format is either woefully vague,
| or directly contradicts the actual implementation. I don't see
| the value in a third party "spec" if what I have to do in
| practice will always be "whatever Google did in protoc".
| returningfory2 wrote:
| > There is still only one implementation that defines what the
| language actually is, and that is Google's protoc.
|
| From the article, it seems this is not true anymore?
|
| > We've built the [new Buf proto] compiler within the buf CLI
| to accurately match protoc.
| jhumphries131 wrote:
| Our intent with the compiler in Buf is to match protoc as
| perfectly as we can. We want to instill maximum confidence in
| our users that Buf is a trustworthy tool and a suitable
| replacement for protoc. And to do that, we need the behavior
| to match.
|
| But we do hope that eventually the _official_ definition of
| the language will be a proper specification, not a particular
| implementation. (And maybe this document could be the start
| of that shift.)
|
| So while there are multiple implementations (Buf,
| Square/Wire, probably others), the protoc implementation is
| canon.
| overboard2 wrote:
| Have you thought of creating your own version of the
| protobuf language, sort of like GNU C? You could have an
| optional flag to enable it, which would allow you to create
| your own official specification.
| jhumphries131 wrote:
| The intent of this spec is to actually put "whatever Google did
| in protoc" into a readable format, so you don't have to read
| the C++ code. The official docs fall short on providing much of
| the details that are included.
| morelisp wrote:
| Protobuf isn't too complicated, I've found the wire format
| docs to be some of the best among the avro/msgpack/thrift/etc
| competitors.
|
| Maybe you mean something besides the wire format. In that
| case, good luck, because that shit ain't protobuf.
| akshayshah wrote:
| The wire format is fairly straightforward if you've seen a
| few binary encodings. The language used to write the
| schemas isn't quite as simple and regular as you might
| hope, though.
|
| > Maybe you mean something besides the wire format. In that
| case, good luck, because that shit ain't protobuf.
|
| Naming's hard :) Being really pedantic, I think even Google
| calls the schema description language "Protocol Buffers"
| and uses phrases like "the Protobuf binary format" or "the
| Protocol Buffer wire format" to refer to the wire format.
| Colloquially, it's never confused me to just use "Protobuf"
| for both.
| morelisp wrote:
| Except I've written thousands of lines of protobuf format
| handling that never, or only extremely distantly, touch a
| schema file. But there's no reason you'd ever do the
| inverse, pushing protobuf schemas around with no intent
| to handle the wire format. As an abstract data definition
| format it's exceptionally poor, it only makes sense if
| you also want to use the wire format (which is... better
| than poor, especially as the commodity ones go).
| akshayshah wrote:
| > But there's no reason you'd ever do the inverse,
| pushing protobuf schemas around with no intent to handle
| the wire format.
|
| You could be writing a linter, a formatter, an
| implementation of the Language Server Protocol, a
| compiler that's not protoc, a way to apply semantic
| patches to large numbers of Protobuf schemas, or any
| number of other useful tools. There's clearly at least
| some demand for tools like this - partial implementations
| of most of these exist, often with some corporate
| backing.
|
| Unless you're implementing a Protobuf runtime
| (google.golang.org/protobuf in Go, upb for Python, etc.),
| your experience seems unusual to me - most developers
| I've encountered read and write the wire format using one
| of the existing runtimes.
|
| That said, it does sound like a lot of fun - especially
| if it's in lisp!
| lhorie wrote:
| That sounds great, but what's the governance story? Are the
| authors of the spec document committing to keeping the
| document up to date here henceforth? Are the protoc
| maintainers committing to have these folks involved in
| project direction decisions?
| jhumphries131 wrote:
| As of right now, the former. That is currently required for
| our tools to remain functioning (https://buf.build).
| If/when changes are made to the language, we update our
| tools (and this spec) to continue to be accurate.
| jeffbee wrote:
| Out of curiosity: why write proto language implementations
| rather than protoc plugins?
| jhumphries131 wrote:
| A great question: We do plan to add content about the plugin
| protocol to this site. While documentation for plugins is
| light, it is easier to find and to get a working plugin than
| it is to find the information needed, for example, to write a
| tool that performs static analysis on a protobuf source
| files.
|
| The biggest omission in the existing docs was the
| specification of the language.
|
| Plugins generally require a library for a particular
| implementation language, so content we write would likely
| have to focus on a single language and library (at least to
| start). Whereas a spec is more broadly useful, regardless of
| what implementation language one is using with Protobuf.
| geraldcombs wrote:
| So that you can analyze and troubleshoot protobuf network
| traffic? Wireshark has a protobuf parser that integrates with
| our dissection API:
|
| https://gitlab.com/wireshark/wireshark/-/blob/master/epan/pr.
| ..
| jeffbee wrote:
| Hrmm, I don't see why you would need to think about proto
| files to do this. You can dissect protocol messages on the
| wire using the descriptor. In fact, I would say that would
| be a good improvement to the code you just showed me.
| jen20 wrote:
| (I don't work at Buf, but happen to be able to answer this) -
| the post at [1] describes the rationale for wanting something
| different than the Google compiler.
|
| To my mind I'd rather have something written in Go that I can
| pull in and version using `go.mod` instead of having to
| special case a single tool, as well.
|
| [1]: https://docs.buf.build/reference/internal-compiler
| season2episode3 wrote:
| It appears Google has released a spec as of 11 days ago:
| https://github.com/protocolbuffers/protobuf/issues/6188#issu...
| bufbuild wrote:
| That is for the text format, which is a serialized
| representation of Protobuf data. As they specify in the linked
| doc, it is not the format for the actual language:
|
| > This format is distinct from the format of text within a
| .proto schema.
| silasdavis wrote:
| Golang is an example of a language defined by a spec not an
| implementation? Discuss.
| wrs wrote:
| I see your point, but the implementers of Go do pay a lot more
| attention than most "defined by an implementation" languages to
| specifying what they're doing before they do it. And if you
| find a difference between the specification and the
| implementation, generally the specification will prevail.
| jvolkman wrote:
| This seems like a great resource. Kudos.
|
| > But most of them are based on the incomplete specs from
| Google's developer site. None of them can correctly predict what
| source files protoc will actually accept or reject 100% of the
| time.
|
| I'd like to think I got pretty close with the plugin that now
| ships with IntelliJ. It even supports the 65-bit integer literal
| [1] that protoc happens to accept for proto2-style float and
| double default values.
|
| With this as a starting point, it'd be nice to fix some of the
| pecularities that arise from "implementation as spec", such as
| that literal value, and the fact that colon optionality in text
| format is based on value type, not syntax.
|
| 1: https://github.com/jvolkman/intellij-protobuf-
| editor/blob/6e...
| miohtama wrote:
| Out of curiosity, why does Protobuf allow a negative 64-bit
| value in the first place? No CPU architecture supports such as
| far as I know.
| kentonv wrote:
| It's been 15 years since I wrote this so I don't remember
| exactly, but I think it's just an implementation quirk. See:
|
| https://github.com/protocolbuffers/protobuf/blob/main/src/go.
| ..
|
| The parser consumes a "-", then consumes a number. The number
| can be any 64-bit unsigned integer. It is then converted to a
| double. Finally, if a "-" was seen earlier, it is negated. So
| by accident, it ends up allowing the range of a 65-bit signed
| integer.
| jvolkman wrote:
| It's just a long-standing quirk in the parser. It parses the
| numeric part as an unsigned 64-bit number, then applies the
| sign afterwards. And the result can be approximately stuffed
| into a floating point value.
|
| The behavior for integer fields is different; compilation
| will fail with an out of range error.
___________________________________________________________________
(page generated 2022-09-12 23:00 UTC)