[HN Gopher] Compile-time JSON deserialization in C++
___________________________________________________________________
Compile-time JSON deserialization in C++
Author : dctwin
Score : 98 points
Date : 2024-07-06 09:35 UTC (4 days ago)
(HTM) web link (medium.com)
(TXT) w3m dump (medium.com)
| dctwin wrote:
| Hello! I wrote this short blog post about using pattern-matching-
| like template metaprogramming to deserialize JSON at build time -
| please let me know what you think (especially if you see
| improvements)
| whizzter wrote:
| It's cute and neat to be able to do it 100% constexpr, however
| as you mention the indexers feels a tad inelegant.
|
| I've written 2 iterations of a reflection library where you
| needed to annotate structs slightly with an ugly macro but once
| done you could just do: Message msg; if (parse_json(str,msg)) {
| ..process msg struct.. }
|
| The previous iterations were for C++11 and C++17 but it seems
| that with C++20 features you don't even seem to need the macro
| uglyness so I personally think libraries need to move in the
| direction of plain old structs.
| cobbal wrote:
| small note: "JSON in its pure form anarchic" is missing a verb
| dctwin wrote:
| Thank you!
| worstspotgain wrote:
| As someone who used to have to do this sort of compile-time
| stuff with previous versions of the standard, I'm jealous of
| how much more can be done now that I don't have to.
|
| If you're looking for an interesting follow-up project, here's
| something I had to do once that's now become much easier:
| compute a compile-time hash of the compilation for the current
| translation unit, e.g. __BASE_FILE__ hashed together with
| __TIMESTAMP__ or the equivalents for each platform.
|
| This allows you to dynamically invalidate on-disk caches and
| trigger new-build tripwires based on ongoing revisions.
| Development and release builds are handled identically: if
| source file X handles a cache and X was recompiled, discard the
| cache.
| dctwin wrote:
| Thanks for the idea! Yes constexpr std::vector feels like
| cheating
| emmanueloga_ wrote:
| The concept reminds me of F#'s "Type Providers" [1].
|
| In terms of the implementation ... I feel like C++ is best when
| used in an "orthodox style" and minimizing the use of templates
| as much as possible.
|
| --
|
| 1: https://learn.microsoft.com/en-
| us/dotnet/fsharp/tutorials/ty...
| goodcanadian wrote:
| My experience with template meta-programming: it is hardly
| ever useful, but on those rare occasions when it is, it is
| magical!
| svalorzen wrote:
| I recently actually tried to do a very similar thing, although
| a bit tighter in scope. What stopped me what that actually
| deserializing floating points cannot currently be done at
| compile time; the only utility available to do so is
| `from_chars` and it is only constexpr for ints.
|
| I did not see any mention of this in the post; so are you
| actually simply extracting the string versions of the numbers,
| without verifying nor deserializing them?
| dctwin wrote:
| I was able to do the primitive
|
| long double result = 0.0;
|
| while (...) { if (json[head] == '.') ...
| result *= 10; result += json[head] - '0';
|
| }
|
| in a constexpr function with no problem :)
| kookybakker wrote:
| How about floats in scientific notation?
| dctwin wrote:
| You can do something similar, no? std::pow is not
| constexpr (most float stuff is not, presumably due to
| floating point state) but you can implement 10^x anyway
| svalorzen wrote:
| The problem with this is that it will not actually parse
| double in IEEE 754, as you will accumulate inaccuracies at
| every step of the loop. In theory, when parsing a float,
| you are supposed to return the floating point that is
| closest to the original string number. Your code will not
| do that. Even if you accept the inaccuracy, if you for some
| reason load the JSON using a runtime library, you'll get
| different numbers and consequently and result that depend
| on those numbers. For my use-case this was not acceptable
| unfortunately..
| dctwin wrote:
| Yes, very true. I noticed that even already at 3dp the
| floats start to compare unequal. The long double helped
| but it's not really.
|
| I googled and found two examples of constexpr float
| parsing repositories, but from the sounds of things, you
| understand this problem better than I and will have seen
| them already
| nikeee wrote:
| Could this be leveraged to emit a parser that is specialized for
| the provided type that can be used at runtime? Afaik .NET does
| something like that using code generators.
|
| The advantage being that the parser is tailored to the specific
| type that is deserialized and it writes directly to the struct's
| fields instead of going through some dictionary.
| leni536 wrote:
| This lib does something like that:
|
| https://github.com/beached/daw_json_link
| nikeee wrote:
| It seems that you have to maintain hand-coded mappings for
| each type. Maybe this could be solved by using C++23's
| compile-time reflections.
| dctwin wrote:
| Yes, the nonconstexpr version does just that, unless I
| misunderstood your question. See also boost::spirit for a 'big'
| version of this
| nikki93 wrote:
| I use this static reflection hack in C++ --
| https://godbolt.org/z/enh8za4ja
|
| You do have to tag struct fields with a macro, but you can attach
| contexpr-visitable attributes. There's also a static limit to how
| many reflectable fields you can have, all reflectable fields need
| to be at the front of the struct, and the struct needs to be an
| aggregate.
| gpderetta wrote:
| that forEachProp function... it brings back nightmares of when,
| before variadics, we used to macro generate up-to N-arity
| functions (with all the const/non-const permutations(.
|
| Now I use the same trick in our code base to generically hash
| aggregates, but I limit it to 4 fields for sanity.
| asguy wrote:
| Holy crap; that's pretty epic. Did you come up with that
| yourself?
| forrestthewoods wrote:
| I think the value of compile-time JSON deserialization is... well
| I was going to say zero but really it's negative. It's a cute
| trick, but please don't ever do this in a real project.
| dctwin wrote:
| Despite my writing the article I agree
| delfinom wrote:
| You clearly never wrote a Fizz Buzz enterprise grade
| application ;)
| beached_whale wrote:
| So I have owned a library for 6years or so that does constexpr
| JSON to data structures, JSON Link. There are a few benefits
| and in the near future with #embed it gets even better. The big
| benefit is that we can now get earlier errors and do testing in
| constexpr that gives more guarantees around the areas of core
| UB and in most implementations they add constexpr checked
| preconditions on the std library too. But, just because it is
| marked constexpr, doesn't mean it will be run at compile time.
| This also, limits the shenanigans that the library dev can do
| to get potential perf and work around design limitations.
|
| In JSON Link's case, since it was using C++17 at the time, it
| forced me to think around the problem of allocation and who
| does it. The library does not allocate, but potentially the
| data structures being deserialized to will. In C++20 you can
| get limited constexpr allocations but they are good for things
| like stacks and eliminating the fixed buffers many devs have
| used in the past; which is a good thing on it's own but isn't
| really allowing one to parse to a vector at compile time(as in
| OP's example) for things that persist.
|
| Where this will get really interesting, though, is when #embed
| is in the major compilers. It's mostly there in clang, with gcc
| on the way I believe. It will open the door for DSL's and
| compile time configs in human readable formats or interop with
| other tools(maybe GUI designers)
|
| As for OP's library, I am not a fan of the json_value like
| library approach that treats JSON as a thing to care about when
| it is usually just an imp detail to move to ones business
| objects.
|
| TL;DR The big benefit though, is the ability to reason about
| the quality of the code in the library and have stronger
| testing.
| dctwin wrote:
| You're right to point out that this is really 'first class
| JSON', rather than the Pydantic/Jackson type thing where the
| json barely exists and is immediately transformed into your
| models and classes.
|
| Thanks for reading the article though, that's cool. I am a
| daw_json_link fan
| beached_whale wrote:
| Build up the test suite inside static asserts, or a macro
| that lets you switch. It will be really nice when you
| update right and your IDE will tell you before you even hit
| compile because clangd found an issue.
| pragma_x wrote:
| Since you've done this for real in a library, I have to ask:
| how would you decide to use a compile-time template solution
| like this versus a code generator or some other "outboard"
| tool to generate code?
|
| I'm curious since I've gone back and forth on this in my own
| career. Both approaches come with their own pros and cons,
| but each get us to the same place.
| stephc_int13 wrote:
| I am afraid of the compile-time cost.
|
| For this kind of things I tend to prefer using a simpler program
| (written in anything you like) to generate C or C++ instead of
| having the compile do the same thing much slowly.
|
| Meta programming can be good, but it is even better done with an
| actual meta program, IMO.
| adolph wrote:
| Brings to mind the old story about a JSON DSL
|
| https://thedailywtf.com/articles/the-inner-json-effect
| threatripper wrote:
| Is this real? It can't be real. Nobody can be this stupid.
| But then again it takes a special kind of person who doesn't
| understand satire to actually do something like that.
| Somebody, where they would say "we trained him wrong on
| purpose as a kind of a joke".
| bruce511 wrote:
| I'm _pretty_ sure it's satire, but the fact that you and I
| can't say for sure is perhaps illustrative of the failure.
|
| I've encountered this pattern several times over my career.
| Some very smart programmer decides that for "reasons" the
| standard way to do something is "bad". (Usually
| "performance" or "bloat" are words bandied around.) They
| then happily architect a new system to replace the "old
| thing". Of course the new thing is completely undocumented
| (because genius programmers don't waste their time writing
| docs).
|
| If you're _lucky_ the programmer then spends his whole
| career there maintaining the thing. If you're lucky the
| whole thing becomes obsolete and discarded before he
| retires. Hint: You're not lucky.
|
| So what you are left with is this big ball of smoosh, with
| no documentation, that no-one can figure-out, much less
| understand. Oh he designed this before multi-core
| processors were a thing? Before we switched to a preemtive
| threaded OS? Well no, none of the code is thread-safe, and
| he's left the company so we need someone to "just update
| it".
|
| There are reasons standard libraries exist. There are
| usually reasons they're a bit slower than hand-coding
| specific cases in assembler. There are reasons why they are
| "bloated" with support for lots of edge-cases. (like
| comments).
|
| When some really smart person starts talking about how it's
| all rubbish, be afraid. Be very afraid.
| sumtechguy wrote:
| > There are reasons standard libraries exist
|
| That right there. Before there is a standard lib for
| something if there are N people coding something up there
| could be N! ways to do something.
|
| If you do not know about a standard lib or it doesn't
| exist there will be some _wild_ code written.
|
| It is when that standard library shows up you should at
| least consider just throwing your bespoke code away. Not
| always but should at least be considered. I personally
| have replaced thousands of lines of code and modules I
| wrote just by switching them to some existing library.
| The upside is if that standard lib does not do what I
| want I have enough knowledge to either bend it around so
| it does or I can fix it up (or put my bespoke code back).
| I know I am not that smart, but I know enough that my
| code is probably brittle and probably should be thrown
| away.
|
| Also watch out for some 'standard libs'. Some of them are
| little more than someone's hobby project and have all the
| exact same issues you are trying to avoid. One project I
| worked on some guy had written a grid control. He was
| charging something like 10k a year to use it. But it was
| just one guy and I quote "i just touch it once or twice a
| year and drink margaritas on the beach". It was a bug
| prone riddled mess we spent a non insignificant amount of
| time fixing. We bought another one for a onetime fee of
| 500 bucks and it was wildly faster and more importantly
| had near zero bugs and a turn around time of 1-2 days if
| we found one.
| _nalply wrote:
| It's the inner platform effect. When I was young I fell
| into the same trap. I invented a flexible database schema
| where I put each field into a database row with some
| metadata describing the field. But that's nonsense. Just
| use what the database provides.
|
| There's a Wikipedia page about it:
| https://en.wikipedia.org/wiki/Inner-platform_effect
|
| A variant of it is: Any sufficiently complicated program
| contains a slow and buggy implementation of half of Lisp.
| That's the Greenspun's tenth rule:
| https://en.wikipedia.org/wiki/Greenspun%27s_tenth_rule
|
| This applies to the kernel as well to put it bluntly and a
| bit ironically: eBPF, but this shouldn't be understood that
| I mean that eBPF is not well thought out!
| https://en.wikipedia.org/wiki/EBPF
| Joker_vD wrote:
| > flexible database schema where I put each field into a
| database row with some metadata describing the field.
|
| I imagine everyone has invented this scheme at one point
| or another. It's so obvious, when you think about it!
| 1f60c wrote:
| It has to be satire because of Tom's complete overreaction
| and the fact that comments are actually one of the easiest
| things to handle when building a lexer (usually, you just
| discard them). Eval'ing them makes no sense.
|
| That said, I suppose stranger things have happened.
| cdirkx wrote:
| Nah I've seen this happen IRL. In this system
| "configuration" was read out of tables in a word document,
| processed via XSLT transformations and eventually it would
| spit out a huuuuge single C# document (recent
| "improvement", before that it was some obscure licenced
| language). Builds happened overnight because they took so
| long, and there was no way to test something locally.
|
| The "advantage" of this system was that there was no need
| for programmers, as there was "no code", just
| configuration!. This was supposed to allowed "domain
| experts" without programming knowledge to work with the
| system. However a month long training by the creator of the
| system was still required, as he had to explain which of
| the 7 boolean types you should use if you wanted to add a
| new column 0.o (for those who want to know, there was
| true/false, 0/1, yes/no, true/false/unknown, true/false
| rendered as a toggle, true/false rendered as a checkbox...)
| spacechild1 wrote:
| > In this system "configuration" was read out of tables
| in a word document, processed via XSLT transformations
| and eventually it would spit out a huuuuge single C#
| document
|
| This is hilarious! It takes a special kind of ignorance
| to come up with a solution like this.
| chipdart wrote:
| > Is this real? It can't be real. Nobody can be this
| stupid.
|
| Having worked in an org with an official in-house genius
| who was terribly tight with a tech-illiterate leadership
| and faked his way into his status, I can't really tell.
| Throwing people under the bus, blaming the world around
| them for problems created by your brittle code, shunning
| best practices in favor of finger-pointing... This happens
| in small shops more often than we'd like believe.
|
| As the saying goes, truth is stranger than fiction. Because
| fiction is expected to make sense.
| kazinator wrote:
| > _generate C or C++ instead of having the compile do the same
| thing much slowly_
|
| That's a wild-assed guess. A JSON decoder right in the compiler
| could easily be faster than generation involving extra tool
| invocations and multiple passes.
|
| Also, if you use ten code generators for ten different features
| in a pipeline instead of ten compile-time things built into the
| language, will _that_ still be faster? What if most files use
| just use one one or two features? You have to pass them through
| all the generators just in case; each generator decides whether
| the file contains anything that it knows how to expand.
| jayd16 wrote:
| I'm not taking sides but I don't think a code-gen tool
| necessitates re-scanning the entire codebase every compile.
| gRPC would be a good example.
| kazinator wrote:
| Well not every compile. Obviously, incremental compiles
| (thanks to a tool like make) notice that the generated code
| is still newer than the inputs.
|
| Obviously, you have files that are not generated. They
| don't need any gen tool.
|
| That's a disadvantage. If you want to start using JSON at
| compile time in a file, and the technology for that is a
| code generator, you have to move that file to a different
| category, perhaps by changing its suffix, and possibly
| indicate it somewhere in the build system as one of the
| sources needing the json generator. Whereas if it's in the
| language, you just do it in your .cpp file and that's it.
|
| Token based macro preprocessors and code generators are
| simply not defensible in the face of structural macro
| systems and compile-time evaluation. They are just
| something you use when you don't have the latter. You can
| use code generators and preprocessors with languages that
| don't have anything built in, and which are resistant to
| change (will not support any decent metaprogramming in the
| foreseeable future).
| Joker_vD wrote:
| > A JSON decoder right in the compiler could easily be faster
| than generation involving extra tool invocations and multiple
| passes.
|
| It also can easily be slower: C++ templates are not exactly
| known for their blazingly fast compilation speed. Besides,
| the program they encode in this case is effectively being
| interpreted by the C++ compiler which, I suppose, is not
| really optimized for that: it's still mostly oriented around
| emitting optimized machine code.
| pjc50 wrote:
| > You have to pass them through all the generators just in
| case; each generator decides whether the file contains
| anything that it knows how to expand.
|
| The C# approach for this is that code generators operate as
| compiler plugins (and therefore also _IDE_ plugins, so if you
| report an error from the code generator it goes with all the
| other compile errors). There is a two-pass approach where
| your plugin gets to scan the syntax tree quickly for "might
| be relevant" and then another go later; the first pass is
| cached.
|
| A limitation of the plugin approach is that your codegen code
| itself has to be in a separate project that gets compiled
| first.
|
| An argument in favor of separate-codegen is that if it breaks
| you can inspect the intermediate code, and indeed do things
| like breakpoints, logging and inspection in the code
| generator itself. The C++ approach seems like it might be
| hard to debug in some situations.
| dctwin wrote:
| Yes, I agree. I don't see much practical use in this. I was
| just surprised how (relatively) straightforwards this is to do,
| and thought it was more cool than useful
| silon42 wrote:
| Often I also find the opposite problem ... sure, you can do
| some stuff in (c++) metaprogramming, but can you (at compile
| time) generate a JSON/XML/YAML file that can be fed to some
| other part of the system?
| dctwin wrote:
| The opposite 'toString' problem seems harder - I didn't
| try, but it should be possible now that std::string is
| constexpr.
|
| I don't think you could parse it with, say, a class that
| has a std::string member (because of the transience
| restriction), but perhaps you can use lambdas that capture
| that string by reference, and call each other as
| appropriate?
|
| As for exporting that as some sort of compiler artefact for
| use elsewhere, I am not sure how you would do that...
| chipdart wrote:
| > Yes, I agree. I don't see much practical use in this.
|
| Me too. The best example I can come up with is loading test
| data in automated tests, but even then I wouldn't use this
| sort of approach.
| chipdart wrote:
| > I am afraid of the compile-time cost.
|
| Even though compilation time is the bane of C++, I think this
| concern regarding this specific usage is grossly overblown. I'm
| going to tell you why.
|
| With incremental builds you only rebuild whatever has changed
| in your project. Embedding JSON documents in a C++ app is the
| kind of thing that is rarely touched, specially if all your
| requirements are met by serializing docs at compile time. This
| means that this deserialization will only be rarely rebuilt,
| and only under two scenarios: full rebuild, and touching the
| file.
|
| As far as full rebuilds go, there is no scenario where
| deserializing JSON represents a relevant task in your build
| tree.
|
| As for touching the file, if for some weird and unbelievable
| reason the build step for the JSON deserialization component is
| deemed too computationally expensive, it's trivial to move this
| specific component into a subproject that's built
| independently. This means that the full cost of an incremental
| build boils down to a) rebuilding your tiny JSON
| deserialization subproject, b) linking. Step a) runs happily in
| parallel with any other build task, thus it's impact is
| meaningless.
|
| To read more on the topic, google for "horizontal
| architecture", a concept popularized by the book "Large-Scale
| C++: Process and Architecture, Volume 1" By John Lakos.
|
| Mountain out of a molehill.
| OskarS wrote:
| There is another scenario where this is an issue: if this
| code ends up in a header which is included in a lot of
| places. You might say "that's dumb, don't do that", but there
| is a real tendency in C++ for things to migrate into headers
| (because they're templates, because you want them to be
| aggressively inlined, for convenience, whatever), and then
| headers get included into other headers, then without knowing
| it you suddenly have disastrous compile times.
|
| Like, for this particular example, you might start out with a
| header that looks like: SomeData
| get_data_from_json(std::string_view json);
|
| with nothing else in it, everything else in a .cpp file.
|
| Then somebody comes around and says "we'd like to reuse the
| parsing logic to get SomeOtherData as well" and your nice,
| one-line header becomes template<typename
| Ret> Ret get_data_from_json(std::string_view json) {
| // .. a gazillion lines of template-heavy code }
|
| which ends up without someone noticing it in
| "CommonUtils.hpp", and now your compiler wants to curl up in
| a ball and cry every time you build.
|
| It takes more discipline than you think across a team to
| prevent this from happening, mostly because a lot of people
| don't take "this takes too long to compile" as a serious
| complaint if it involves any kind of other trade-off.
| chipdart wrote:
| > There is another scenario where this is an issue: if this
| code ends up in a header which is included in a lot of
| places.
|
| This is all on itself a sign that your project is
| fundamentally broken, but this is already covered by
| scenario b) incremental builds.
|
| Even if for some reason you resist the urge of following
| best practices and not create your own problems, there are
| a myriad of techniques to limit the impact of touching a
| single file in your builds. Using a facade class to move
| your serialized JSON to an implementation detail of a class
| is perhaps the lowest effort one, but the textbook example
| would be something like a PIMPL.
|
| The main problem with the build time of C++ projects are
| not the build times per se but clueless developers, who are
| oblivious to the problem domain, fumbling basic things and
| ending up creating their own problems. Once one of them
| stops to ask themself why is the project taking so much
| time to build, more often than not you find yourself a few
| commits away from dropping build times to a fraction of the
| cost. Even onboarding something like ccache requires no
| more than setting an environment variable.
| actionfromafar wrote:
| Fundamentally broken, or waiting for modules to become a
| thing? I tried to use https://github.com/mjspncr/lzz3 for
| a few years but it became impractical to me to fiddle
| with tooling.
|
| _a_ : You don't have source file and header file, you
| put everything in one file and _lzz_ sorts it out during
| build.
| anothername12 wrote:
| Not a C++ user, but is this the same as #. reader macro in Common
| Lisp?
| kazinator wrote:
| Also, this: (defmacro macro-time (&rest forms)
| `(quote ,(eval `(progn ,@forms))))
|
| _forms_ are evaluated at macro-expansion-time, and their result
| is quoted, and substituted for the (macro-time ...) invocation.
|
| For instance, if we have a _snarf-file_ function which reads a
| text file and returns the contents as a string, we can do:
| (macro-time (snarf-file "foo.txt"))
|
| and we now have the contents of foo.txt as a string literal.
| heisig wrote:
| Yes, the #. reader macro is one of the ways how you can achieve
| this in Common Lisp. Using the reader macro is also way more
| efficient because you don't awkwardly use your compiler as an
| interpreter for a weird subset of your actual language - you
| simply call to compiled code.
|
| Seeing Greenspun's tenth rule [1] in action again and again is
| one of the weird things we Common Lisp programmers have to
| endure. I wish we would have more discussions on how to improve
| Lisp even further instead of trying to 'fix' C or C++ for the
| umpteenth time.
|
| [1] https://en.wikipedia.org/wiki/Greenspun%27s_tenth_rule
| ykonstant wrote:
| >I wish we would have more discussions on how to improve Lisp
| even further instead of trying to 'fix' C or C++ for the
| umpteenth time.
|
| I agree one million percent; projects like SBCL are great,
| but my impression is that there are tons of improvements to
| be had in producing optimized code for modern processors
| (cache friendliness, SIMD, etc), GPU programming etc. I asked
| about efforts in those directions here and there, but did not
| get very clear answers.
| abbeyj wrote:
| Could you use something like `template <StringLiteral str>
| constexpr inline Key<str> key;`? Then you could write
| `key<"myKey">` instead of `Key<"myKey">{}`, saving you from
| needing the `{}` each time.
| dctwin wrote:
| Hm - so this would instantiate a variable for each key in the
| class namespace? I admit I haven't seen anything like this but
| sounds very interesting
| actionfromafar wrote:
| Where are the functional language programmers so I can hold their
| beer?
| fsloth wrote:
| What a beautiful example of abuse of C++ templates. I love it.
|
| But please don't do this in production.
|
| What ever you need to do, use C++ templates as the last resort
| because you've figured out all other approaches suck even more.
| Maintaining template heavy code is absolutely horrible and
| wasteful (and if it's C++ production code we measure it's
| lifetime in decades). And no, there is no way "to do it correctly
| so it doesn't suck".
|
| Templates belong to the lowest abstraction levels - as stl mostly
| does. Anyhting more prevalent is an abomination.
|
| If the schema is fixed, have types with the data and if you have
| a default data, provide it using initializer lists.
|
| Ie. have a struct or structs with explicit serializeToJson and
| deserializeFromJson functions.
|
| It's faster to write than figuring out the correct template
| gymnastics and about 100x easier to maintain and extend.
___________________________________________________________________
(page generated 2024-07-10 23:02 UTC)