hngopher.com

       [HN Gopher] Building a high performance JSON parser
       ___________________________________________________________________
        
       Building a high performance JSON parser
        
       Author : davecheney
       Score  : 349 points
       Date   : 2023-11-05 13:10 UTC (9 hours ago)
        
 (HTM) web link (dave.cheney.net)
 (TXT) w3m dump (dave.cheney.net)
        
       | denysvitali wrote:
       | Also interesting: https://youtu.be/a7VBbbcmxyQ
        
       | eatonphil wrote:
       | The walkthrough is very nice, how to do this if you're going to
       | do it.
       | 
       | If you're going for pure performance in a production environment
       | you might take a look at Daniel Lemire's work:
       | https://github.com/simdjson/simdjson. Or the MinIO port of it to
       | Go: https://github.com/minio/simdjson-go.
        
         | vjerancrnjak wrote:
         | If your JSON always looks the same you can also do better than
         | general JSON parsers.
        
           | lylejantzi3rd wrote:
           | Andreas Fredriksson demonstrates exactly that in this video:
           | https://vimeo.com/644068002
        
             | ykonstant wrote:
             | Very enjoyable video!
        
             | kaladin_1 wrote:
             | I really enjoyed this video even though he lost me with the
             | SIMD code.
        
           | diarrhea wrote:
           | I wonder: can fast, special-case JSON parsers be dynamically
           | autogenerated from JSON Schemas?
           | 
           | Perhaps some macro-ridden Rust monstrosity that spits out
           | specialised parsers at compile time, dynamically...
        
             | minhazm wrote:
             | For json schema specifically there are some tools like go-
             | jsonschema[1] but I've never used them personally. But you
             | can use something like ffjson[2] in go to generate a static
             | serialize/deserialize function based on a struct
             | definition.
             | 
             | [1] https://github.com/omissis/go-jsonschema [2]
             | https://github.com/pquerna/ffjson
        
               | atombender wrote:
               | Hey, go-jsonschema is my project. (Someone else just took
               | over maintaining it, though.) It still relies on the
               | standard Go parser; all it does it generate structs with
               | the right types and tags.
        
             | galangalalgol wrote:
             | Doesn't the serde crate's json support do precisely this?
             | It generates structs that have optional in all the right
             | places and with all the right types anyway. Seems like the
             | llvm optimiser can probably do something useful with that
             | even if the serde feature isn't using apriori knowledge out
             | of the schema.
        
             | dleeftink wrote:
             | Somewhat tangentially related, Fabian Iwand posted this
             | regex prefix tree visualiser/generator last week [0], which
             | may offer some inspiration for prototyping auto generated
             | schemas.
        
               | atombender wrote:
               | You forgot to include the link?
        
             | PartiallyTyped wrote:
             | Pydantic does that to some extend I think.
        
             | marginalia_nu wrote:
             | A fundamental problem with JSON parsing is that it has
             | variable length fields that don't encode their length, in a
             | streaming scenario you basically need to keep resizing your
             | buffer until the data fits. If the data is on disk and not
             | streaming you may get away with reading ahead to find the
             | end of the field first, but that's also not particularly
             | fast.
             | 
             | Schemas can't fix that.
        
             | mhh__ wrote:
             | It's relatively common in D application to use the compile
             | time capabilities to generator a parser at compile time
        
           | loeg wrote:
           | You might also move to something other than JSON if parsing
           | it is a significant part of your workload.
        
             | haswell wrote:
             | Most of the times I've had to deal with JSON performance
             | issues, it involved a 3rd party API and JSON was the only
             | option.
             | 
             | If you're building something net-new and know you'll have
             | these problems out the gate, something other than JSON
             | might be feasible, but the moment some other system not in
             | the closed loop needs to work with the data, you're back to
             | JSON and any associated perf issues.
        
         | fooster wrote:
         | Last time I compared the performance of various json parsers
         | the simd one turned out to be disappointingly slow.
        
         | Thaxll wrote:
         | The fastest json lib in Go is the one done by the company
         | behind Tiktok.
        
           | ken47 wrote:
           | Fastest at what?
        
             | cannonpalms wrote:
             | > For all sizes of json and all scenarios of usage, Sonic
             | performs best.
             | 
             | The repository has benchmarks
        
               | mananaysiempre wrote:
               | I'm not seeing simdjson in them though? I must be missing
               | something because the Go port of it is explicitly
               | mentioned in the motivation[1] (not the real thing,
               | though).
               | 
               | [1] https://github.com/bytedance/sonic/blob/main/docs/INT
               | RODUCTI...
        
           | rockinghigh wrote:
           | https://github.com/bytedance/sonic
        
           | pizzafeelsright wrote:
           | Excellent treat vector.
        
           | pizzafeelsright wrote:
           | Excellent treat vector.
        
         | lionkor wrote:
         | simdjson has not been the fastest for a long long time
        
           | jzwinck wrote:
           | What is faster? According to
           | https://github.com/kostya/benchmarks#json nothing is.
        
       | rexfuzzle wrote:
       | Great to see a shout out to Phil Pearl! Also worth looking at
       | https://github.com/bytedance/sonic
        
       | Galanwe wrote:
       | "Json" and "Go" seem antithetical in the same sentence as "high
       | performance" to me. As long as we are talking about _absolute
       | performance_.
       | 
       | When talking about _relative performance_, as in _compared to
       | what can be done with a somewhat similar stack_, I feel like it
       | should deserve a mention of what the stack is, otherwise there is
       | no reference point.
        
         | bsdnoob wrote:
         | Did you even open the article? Following is literally in first
         | paragraph
         | 
         | > This package offers the same high level json.Decoder API but
         | higher throughput and reduced allocations
        
           | jcelerier wrote:
           | How does that contradict what the parent poster says? I think
           | it's very weird to call something "high performance" when it
           | looks like it's maybe 15-20% of the performance of a simdjson
           | in c++. This is not "going from normal performance to high
           | performance", this going from "very subpar" to "subpar"
        
             | willsmith72 wrote:
             | Ok but how many teams are building web APIs in C++?
        
               | TheCleric wrote:
               | I worked with a guy who did this. It was fast, but boy
               | howdy was it not simple.
        
             | lolinder wrote:
             | Par is different for different stacks. It's reasonable for
             | someone to treat their standard library's JSON parser as
             | "par", given that that's the parser that most of their
             | peers will be using, even if there are faster options that
             | are commonly used in other stacks.
        
             | Thaxll wrote:
             | Because even in C++ people don't use json simd most project
             | use rapidjson which Go is on part with.
        
         | SmoothBrain12 wrote:
         | Exactly, it's not too hard to implement in C. The one I made
         | never copied data, instead saved the pointer/length to the
         | data. The user only had to Memory Map the file (or equivalent),
         | pass that data into the parse. Only memory allocation was for
         | the Jason nodes.
         | 
         | This way they only paid the parsing tax (decoding doubles,
         | etc..) if the user used that data.
         | 
         | You hit the nail on the head
        
           | compressedgas wrote:
           | Like how https://github.com/zserge/jsmn works. I thought it
           | would be neat to have such as parser for
           | https://github.com/vshymanskyy/muon
        
           | mr_mitm wrote:
           | Did you publish the code somewhere? I'd be interested in
           | reading it.
        
           | eska wrote:
           | I also really like this paradigm. It's just that in old
           | crusty null-terminated C style this is really awkward because
           | the input data must be copied or modified. But it's not an
           | issue when using slices (length and pointer). Unfortunately
           | most of the C standard library and many operating system APIs
           | expect that.
           | 
           | I've seen this referred to as a pull parser in a Rust
           | library? (https://github.com/raphlinus/pulldown-cmark)
        
           | Aurornis wrote:
           | The linked article is a GopherCon talk.
           | 
           | The first line of the article explains the context of the
           | talk:
           | 
           | > This talk is a case study of designing an efficient Go
           | package.
           | 
           | The target audience and context are clearly Go developers.
           | Some of these comments are focusing too much on the headline
           | without addressing the actual article.
        
           | cryo wrote:
           | I've made a JSON parser which works like this too. No dynamic
           | memory allocations, similar to the JSMN parser but stricter
           | to the specification.
           | 
           | Always nice to be in control over memory :)
           | 
           | https://sr.ht/~cryo/cj
        
           | hgs3 wrote:
           | Yup and if your implementation uses a hashmap for object key
           | -> value lookup, then I recommend allocating the hashmap
           | _after_ parsing the object not during to avoid continually
           | resizing the hashmap. You can implement this by using an
           | intrusive linked list to track your key /value JSON nodes
           | until the time comes to allocate the hashmap. Basically when
           | parsing an object 1. use a counter 'N' to track the number of
           | keys, 2. link the JSON nodes representing key/value pairs
           | into an intrusive linked list, 3. after parsing the object
           | use 'N' to allocate a perfectly sized hashmap in one go. You
           | can then iterate over the linked list of JSON key/value pair
           | nodes adding them to the hashmap. You can use this same trick
           | when parsing JSON arrays to avoid continually resizing a
           | backing array. Alternatively, never allocate a backing array
           | and instead use the linked list to implement an iterator.
        
           | duped wrote:
           | > The user only had to Memory Map the file (or equivalent)
           | 
           | Having done this myself, it's a massive cheat code because
           | your bottleneck is almost always i/o and memory mapped i/o is
           | orders of magnitude faster than sequential calls to read().
           | 
           | But that said it's not always appropriate. You can have
           | gigabytes of JSON to parse, and the JSON might be available
           | over the network, and your service might be running on a
           | small node with limited memory. Memory mapping here adds
           | quite a lot of latency and cost to the system. A very fast
           | streaming JSON decoder is the move here.
        
             | vlovich123 wrote:
             | > memory mapped i/o is orders of magnitude faster than
             | sequential calls to read()
             | 
             | That's not something I've generally seen. Any source for
             | this claim?
             | 
             | > You can have gigabytes of JSON to parse, and the JSON
             | might be available over the network, and your service might
             | be running on a small node with limited memory. Memory
             | mapping here adds quite a lot of latency and cost to the
             | system
             | 
             | Why does mmap add latency? I would think that mmap adds
             | more latency for small documents because the cost of doing
             | the mmap is high (cross CPU TLB shoot down to modify the
             | page table) and there's no chance to amortize. Relatedly,
             | there's minimal to no relation between SAX vs DOM style
             | parsing and mmap - you can use either with mmap. If you're
             | not aware, you do have some knobs with mmap to hint to the
             | OS how it's going to be used although it's very unwieldy to
             | configure it to work well.
        
               | duped wrote:
               | Experience? Last time I made that optimization it was
               | 100x faster, ballpark. I don't feel like benchmarking it
               | right now, try yourself.
               | 
               | The latency comes from the fact you need to have the
               | whole file. The use case I'm talking about is a JSON
               | document you need to pull off the network because it
               | doesn't exist on disk, might not fit there, and might not
               | fit in memory.
        
               | vlovich123 wrote:
               | > Experience? Last time I made that optimization it was
               | 100x faster, ballpark. I don't feel like benchmarking it
               | right now, try yourself.
               | 
               | I have. Many times. There's definitely not a 100x
               | difference given that normal file I/O can easily saturate
               | NVMe throughput. I'm sure it's possible to build a repro
               | showing a 100x difference, but you have to be doing
               | something intentionally to cause that (e.g. using a very
               | small read buffer so that you're doing enough syscalls
               | that it shows up in a profile).
               | 
               | > The latency comes from the fact you need to have the
               | whole file
               | 
               | That's a whole other matter. But again, if you're pulling
               | it off the network, you usually can't mmap it anyway
               | unless you're using a remote-mounted filesystem (which
               | will add more overhead than mmap vs buffered I/O).
        
               | duped wrote:
               | I think you misunderstood my point, which was to
               | highlight exactly when mmap won't work....
        
               | ben-schaaf wrote:
               | In my experience mmap is at best 50% faster compared to
               | good pread usage on Linux and MacOS.
        
         | Aurornis wrote:
         | > "Json" and "Go" seem antithetical in the same sentence as
         | "high performance" to me
         | 
         | Absolute highest performance is rarely the highest priority in
         | designing a system.
         | 
         | Of course we could design a hyper-optimized, application
         | specific payload format and code the deserializer in assembly
         | and the performance would be great, but it wouldn't be useful
         | outside of very specific circumstances.
         | 
         | In most real world projects, performance of Go and JSON is fine
         | and allow for rapid development, easy implementation, and
         | flexibility if anything changes.
         | 
         | I don't think it's valid to criticize someone for optimizing
         | within their use case.
         | 
         | > I feel like it should deserve a mention of what the stack is,
         | otherwise there is no reference point.
         | 
         | The article clearly mentions that this is a GopherCon talk in
         | the header. It was posted on Dave Cheney's website, a well-
         | known Go figure.
         | 
         | It's clearly in the context of Go web services, so I don't
         | understand your criticisms. The context is clear from the
         | article.
         | 
         | The _very first line_ of the article explains the context:
         | 
         | > This talk is a case study of designing an efficient Go
         | package.
        
         | riku_iki wrote:
         | > "Json" and "Go" seem antithetical in the same sentence as
         | "high performance" to me. As long as we are talking about
         | _absolute performance_.
         | 
         | Even just "Json" is problematic here as wire protocol for
         | absolute performance no matter what will be programming
         | language.
        
           | 3cats-in-a-coat wrote:
           | I think people say that as they give disproportional weight
           | to the fact it's text-based, while ignoring how astoundingly
           | simple and linear it is to write and read.
           | 
           | The only way to nudge the needle is to start exchanging
           | direct memory dumps, which is what ProtoBuff and the like do.
           | But this is clearly only for very specific use.
        
             | riku_iki wrote:
             | > while ignoring how astoundingly simple and linear it is
             | to write and read.
             | 
             | code maybe simple, but you have lots of performance
             | penalties: resolving field keys, you need to construct some
             | complicated data structures through memory allocations,
             | which is expensive.
             | 
             | > to start exchanging direct memory dumps, which is what
             | ProtoBuff and the like do
             | 
             | Protobuff actually is doing parsing, it is just binary
             | format. What you describing is more like Flatbuffer.
             | 
             | > But this is clearly only for very specific use.
             | 
             | yes, specific use is high performance computations )
        
         | crazygringo wrote:
         | JSON is often the format of an externally provided data source,
         | and you don't have a choice.
         | 
         | And whatever language you're writing in, you usually want to do
         | what you can to maximize performance. If your JSON input is 500
         | bytes it probably doesn't matter, but if you're intaking a 5 MB
         | JSON file then you can definitely be sure the performance does.
         | 
         | What more do you need to know about "the stack" in this case?
         | It's whenever you need to ingest large amounts of JSON in Go.
         | Not sure what could be clearer.
        
         | 3cats-in-a-coat wrote:
         | JSON is probably the fastest serialization format to produce
         | and parse, which is also safe for public use, compared to
         | binary formats which often have fragile, highly specific and
         | vulnerable encoding as they're directly plopped into memory and
         | used as-is (i.e. they're not parsed at all, it's just two
         | computers exchanging memory dumps).
         | 
         | Compare it with XML for example, which is a nightmare of
         | complexity if you actually follow the spec and not just make
         | something XML-like.
         | 
         | We have some formats which try to walk the boundary between
         | safe/universal and fast like ASN.1 but those are obscure at
         | best.
        
           | alpaca128 wrote:
           | I prefer msgpack if the data contains a lot of numeric
           | values. Representing numbers as strings like in JSON can blow
           | up the size and msgpack is usually just as simple to use.
        
         | tptacek wrote:
         | This is an article about optimizing a JSON parser in Go.
         | 
         | As always, try to remember that people usually aren't writing
         | (or posting their talks) specifically for an HN audience.
         | Cheney clearly has an audience of Go programmers; that's the
         | space he operates in. He's not going to title his post, which
         | he didn't make for HN, just to avoid a language war on the
         | threads here.
         | 
         | It's our responsibility to avoid the unproductive language war
         | threads, not this author's.
        
         | hoosieree wrote:
         | Even sticking within the confines of json there's low hanging
         | fruit, e.g. if your data is typically like:
         | [{"k1": true, "k2": [2,3,4]}, {"k1": false, "k2": []}, ...]
         | 
         | You can amortize the overhead of the keys by turning this from
         | an array of structs (AoS) into a struct of arrays (SoA):
         | {"k1": [true, false, ...], "k2": [[2,3,4], [], ...]}
         | 
         | Then you only have to read "k1" and "k2" once, instead of _once
         | per record_. Presumably there will be the odd record that
         | contains something like { "k3": 0} but you can use mini batches
         | of SoA and tune their size according to your desired
         | latency/throughput tradeoff.
         | 
         | Or if your data is 99.999% of the time just pairs of k1 and k2,
         | turn them into tuples:                   {"k1k2":
         | [true,[2,3,4],false,[], ...]}
         | 
         | And then 0.001% of the time you send a lone k3 message:
         | {"k3": 2}
         | 
         | Even if your endpoints can't change their schema, you can still
         | trade latency for throughput by doing the SoA conversion,
         | transmitting, then converting back to AoS at the receiver.
         | Maybe worthwhile if you have to forward the message many times
         | but only decode it once.
        
         | kunley wrote:
         | "Antiethical".
         | 
         | Is it true that mindless bashing Go became some kind of a new
         | religion in some circles?
        
       | kevingadd wrote:
       | I'm surprised there's no way to say 'I really mean it, inline
       | this function' for the stuff that didn't inline because it was
       | too big.
       | 
       | The baseline whitespace count/search operation seems like it
       | would be MUCH faster if you vectorized it with SIMD, but I can
       | understand that being out of scope for the author.
        
         | mgaunard wrote:
         | Of course you can force-inline.
        
           | cbarrick wrote:
           | Obviously you can manually inline functions. That's what
           | happened in the article.
           | 
           | The comment is about having a directive or annotation to make
           | the compiler inline the function for you, which Go does not
           | have. IMO, the pre-inline code was cleaner to me. It's a
           | shame that the compiler could not optimize it.
           | 
           | There was once a proposal for this, but it's really against
           | Go's design as a language.
           | 
           | https://github.com/golang/go/issues/21536
        
             | mgaunard wrote:
             | You can in any systems programming language.
             | 
             | Go is mostly a toy language for cloud people.
        
               | cbarrick wrote:
               | > toy language
               | 
               | You may be surprised to hear that Go is used in a ton of
               | large scale critical systems.
        
               | mgaunard wrote:
               | I don't consider cloud technology a critical system.
        
       | peterohler wrote:
       | You might want to take a look at https://github.com/ohler55/ojg.
       | It takes a different approach with a single pass parser. There
       | are some performance benchmarks included on the README.md landing
       | page.
        
       | arun-mani-j wrote:
       | I remember reading a SO question which asks for a C library to
       | parse JSON. A comment was like - C developers won't use a library
       | for JSON, they will write one themselves.
       | 
       | I don't know how "true" that comment is but I thought I should
       | try to write a parser myself to get a feel :D
       | 
       | So I wrote one, in Python - https://arunmani.in/articles/silly-
       | json-parser/
       | 
       | It was a delightful experience though, writing and testing to
       | break your own code with different variety of inputs. :)
        
         | xoac wrote:
         | Good for you but what does this have to do with the article?
        
         | janmo wrote:
         | I wrote a small JSON parser in C myself which I called jsoncut.
         | It just cuts out a certain part of a json file. I deal with
         | large JSON files, but want only to extract and parse certain
         | parts of it. All libraries I tried parse everything, use a lot
         | of RAM and are slow.
         | 
         | Link here, if interested to have a look:
         | https://github.com/rgex/jsoncut
        
           | vlovich123 wrote:
           | The words you're looking for are SAX-like JSON parser or
           | streaming json parser. I don't know if there's any command
           | line tools like the one you wrote that use it though to
           | provide a jq-like interface.
        
             | janmo wrote:
             | I tried JQ and other command line tools, all were extremely
             | slow and seemed to always parse the entire file.
             | 
             | My parser just reads the file byte by byte until it finds
             | the target, then outputs the content. When that's done it
             | stops reading the file, meaning that it can be extremely
             | fast when the targeted information is at the beginning of
             | the JSON file.
        
               | vlovich123 wrote:
               | You're still describing a SAX parser (i.e. streaming). jq
               | doesn't use a SAX parser because it's a multi-pass
               | document editor at its core, hence why I said "jq-like"
               | in terms of supporting a similar syntax for single-pass
               | queries. If you used RapidJSON's SAX parser in the body
               | of your custom code (returning false once you found what
               | you're looking for), I'm pretty sure it would
               | significantly outperform your custom hand-rolled code. Of
               | course, your custom code is very small with no external
               | dependencies and presumably fast enough, so tradeoffs.
        
         | masklinn wrote:
         | > I remember reading a SO question which asks for a C library
         | to parse JSON. A comment was like - C developers won't use a
         | library for JSON, they will write one themselves.
         | 
         | > I don't know how "true" that comment is
         | 
         | Either way it's a good way to get a pair of quadratic loops in
         | your program: https://nee.lv/2021/02/28/How-I-cut-GTA-Online-
         | loading-times...
        
         | jjice wrote:
         | I guess there are only so many ways to write a JSON parser b
         | cause one I wrote on a train in Python looks very similar!
         | 
         | I thought it would be nice and simple but it really was still
         | simpler than I expected. It's a fantastic spec if you need to
         | throw one together yourself, without massive performance
         | considerations.
        
       | visarga wrote:
       | nowadays I am more interested in a "forgiving" JSON/YAML parser,
       | that would recover from LLM errors, is there such a thing?
        
         | kevingadd wrote:
         | If the LLM did such a bad job that the syntax is wrong, do you
         | really trust the data inside?
         | 
         | Forgiving parsers/lexers are common in language compilers for
         | languages like rust or C# or typescript, you may want to
         | investigate typescript in particular since it's applicable to
         | JSON syntax. Maybe you could repurpose their parser.
        
         | RichieAHB wrote:
         | I feel like trying to infer valid JSON from invalid JSON is a
         | recipe for garbage. You'd probably be better off doing a second
         | pass with the "JSON" through the LLM but, as the sibling
         | commenter said, at this point even the good JSON may be garbage
         | ...
        
         | _dain_ wrote:
         | halloween was last week
        
         | explaininjs wrote:
         | Perhaps not quite what you're asking for, but along the same
         | lines there's this "Incomplete JSON" parser, which takes a
         | string of JSON as it's coming out of an LLM and parses it into
         | as much data as it can get. Useful for building streaming UI's,
         | for instance it is used on https://rexipie.com quite
         | extensively.
         | 
         | https://gist.github.com/JacksonKearl/6778c02bf85495d1e39291c...
         | 
         | Some example test cases:                   { input: '[{"a": 0,
         | "b":', output: [{ a: 0 }] },         { input: '[{"a": 0, "b":
         | 1', output: [{ a: 0, b: 1 }] },              { input: "[{},",
         | output: [{}] },         { input: "[{},1", output: [{}, 1] },
         | { input: '[{},"', output: [{}, ""] },         { input:
         | '[{},"abc', output: [{}, "abc"] },
         | 
         | Work could be done to optimize it, for instance add streaming
         | support. But the cycles consumed either way is minimal for LLM-
         | output-length=constrained JSON.
         | 
         | Fun fact: as best I can tell, GPT-4 is entirely unable to
         | synthesize code to accomplish this task. Perhaps that will
         | change as this implementation is made public, I do not know.
        
         | gurrasson wrote:
         | The jsonrepair tool https://github.com/josdejong/jsonrepair
         | might interest you. It's tailored to fix JSON strings.
         | 
         | I've been looking into something similar for handling partial
         | JSONs, where you only have the first n chars of a JSON. This is
         | common with LLM with streamed outputs aimed at reducing
         | latency. If one knows the JSON schema ahead, then one can start
         | processing these first fields before the remaining data has
         | fully loaded. If you have to wait for the whole thing to load
         | there is little point in streaming.
         | 
         | Was looking for a library that could do this parsing.
        
           | explaininjs wrote:
           | See my sibling comment :)
        
       | mannyv wrote:
       | These are always interesting to read because you get to see
       | runtime quirks. I'm surprised there was so much function call
       | overhead, for example. And it's interesting you can bypass range
       | checkong.
       | 
       | The most important thing, though, is the process: measure then
       | optimize.
        
       | isuckatcoding wrote:
       | This is fantastically useful.
       | 
       | Funny enough I stumbled upon your article just yesterday through
       | google search.
        
       | mgaunard wrote:
       | "It's unrealistic to expect to have the entire input in memory"
       | -- wrong for most applications
        
         | isuckatcoding wrote:
         | Yes but for applications where you need to do ETL style
         | transformations on large datasets, streaming is an immensely
         | useful strategy.
         | 
         | Sure you could argue go isn't the right tool for the job but I
         | don't see why it can't be done with the right optimizations
         | like this effort.
        
           | dh2022 wrote:
           | If performance is important why would you keep large datasets
           | in JSON format?
        
             | querulous wrote:
             | sometimes it's not your data
        
             | isuckatcoding wrote:
             | Usually because the downstream service or store needs it
        
             | Maxion wrote:
             | Because you work at or for some bureaucratic MegaCorp, that
             | does weird things with no real logic behind it other than
             | clueless Dilbert managers making decisions based on
             | LinkedIn blogs. Alternatively desperate IT consultants
             | trying to get something to work with too low of a budget
             | and/or no access to do things the right way.
             | 
             | Be glad you have JSON to parse, and not EDI, some custom
             | deliminated data format (with no or old documentation) - or
             | _shudders_ you work in the airline industry with SABRE.
        
         | capableweb wrote:
         | https://yourdatafitsinram.net/
        
         | mannyv wrote:
         | If you're building a library you either need to explicitly call
         | out your limits or do streaming.
         | 
         | I've pumped gigs of jaon data, so a streaming parser is
         | appreciated. Plus streaming shows the author is better at
         | engineering and is aware of the various use cases.
         | 
         | Memory is not cheap or free except in theory.
        
           | jjeaff wrote:
           | I guess it's all relative. Memory is significantly cheaper if
           | you get it anywhere but on loan from a cloud provider.
        
             | mannyv wrote:
             | RAM is always expensive no matter where you get it from.
             | 
             | Would you rather do two hours of work or force thousands of
             | people to buy more RAM because your library is a memory
             | hog?
             | 
             | And on embedded systems RAM is a premium. More RAM = most
             | cost.
        
         | e12e wrote:
         | If you can live with "fits on disk" mmap() is a viable option?
         | Unless you truly need streaming (early handling of early data,
         | like a stream of transactions/operations from a single JSON
         | file?)
        
           | mannyv wrote:
           | In general, JSON comes over the network, so MMAP won't really
           | work unless you save to a file. But then you'll run out of
           | disk space.
           | 
           | I mean, you have a 1k, 2k, 4k buffer. Why use more, because
           | it's too much work?
        
         | ahoka wrote:
         | Most applications read JSONs from networks, where you have a
         | stream. Buffering and fiddling with the whole request in memory
         | increases latency by a lot, even if your JSON is smallish.
        
           | mgaunard wrote:
           | On a carefully built WebSocket server you would ensure your
           | WebSocket messages all fit within a single MTU.
        
           | Rapzid wrote:
           | Most( _most_ ) JSON payloads are probably much smaller than
           | many buffer sizes so just end up all in memory anyway.
        
       | jensneuse wrote:
       | I've taken a very similar approach and built a GraphQL tokenizer
       | and parser (amongst many other things) that's also zero memory
       | allocations and quite fast. In case you'd like to check out the
       | code: https://github.com/wundergraph/graphql-go-tools
        
         | markl42 wrote:
         | How big of an issue is this for GQL servers where all queries
         | are known ahead of time (allowlist) - i.e. you can
         | cache/memorize the ast parsing and this is only a perf issue
         | for a few minutes after the container starts up
         | 
         | Or does this bite us in other ways too?
        
           | jensneuse wrote:
           | I build GraphQL API gateways / Routers for 5+ years now. It
           | would be nice if trusted Documents or persisted operations
           | were the default, but the reality is that a lot of people
           | want to open up their GraphQL to the public. For that reason
           | we've build a fast parser, validator, normalizer and many
           | other things to support these use cases.
        
       | nwpierce wrote:
       | Writing a json parser is definitely an educational experience. I
       | wrote one this summer for my own purposes that is decently fast:
       | https://github.com/nwpierce/jsb
        
       | lamontcg wrote:
       | Wish I wasn't 4 or 5 uncompleted projects deep right now and had
       | the time to rewrite a monkey parser using all these tricks.
        
       | jchw wrote:
       | Looks pretty good! Even though I've written far too many JSON
       | parsers already in my career, it's really nice to have a
       | reference for how to think about making a reasonable, fast JSON
       | parser, going through each step individually.
       | 
       | That said, I will say one thing: you don't _really_ need to have
       | an explicit tokenizer for JSON. You can get rid of the concept of
       | tokens and integrate parsing and tokenization _entirely_. This is
       | what I usually do since it makes everything simpler. This is a
       | lot harder to do with something like the rest of ECMAscript since
       | in something like ECMAscript you wind up needing look-ahead
       | (sometimes arbitrarily large look-ahead... consider arrow
       | functions: it 's mostly a subset of the grammar of a
       | parenthesized expression. Comma is an operator, and for default
       | values, equal is an operator. It isn't until the => does or does
       | not appear that you know for sure!)
        
         | coldtea wrote:
         | What line of work are you in that you've "written far too many
         | JSON parsers already" in your career?!!!
        
           | craigching wrote:
           | Probably anywhere that requires parsing large JSON documents.
           | Off the shelf JSON parsers are notoriously slow on large JSON
           | documents.
        
             | ahoka wrote:
             | Not necessarily, for example Newtonsoft is fine with
             | multiple hundreds of megabyes if you use it correctly. But
             | of course depends on how large we are talking about.
        
           | jchw wrote:
           | Reasons differ. C++ is a really hard place to be. It's gotten
           | better, but if you can't tolerate exceptions, need code that
           | is as-obviously-memory-safe-as-possible, can parse
           | incrementally (think SAX style), off-the-shelf options like
           | jsoncpp may not fit the bill.
           | 
           | Handling large documents is indeed another big one. It _sort-
           | of_ fits in the same category as being able to parse
           | incrementally. That said, Go has a JSON scanner you can sort
           | of use for incremental parsing, but in practice I 've found
           | it to be a lot slower, so for large documents it's a problem.
           | 
           | I've done a couple in hobby projects too. One time I did a
           | partial one in Win32-style C89 because I wanted one that
           | didn't depend on libc.
        
           | lgas wrote:
           | Someone misunderstood the JSONParserFactory somewhere along
           | the line.
        
           | marcosdumay wrote:
           | I've seen "somebody doesn't agree with the standard and we
           | must support it" way too many times, and I've written JSON
           | parsers because of this. (And, of course, it's easy to get
           | some difference with the JSON standard.)
           | 
           | I've had problems with handling streams like the OP on
           | basically every programing language and data-encoding
           | language pair that I've tried. It looks like nobody ever
           | thinks about it (I do use chunking any time I can, but some
           | times you can't).
           | 
           | There are probably lots and lots of reasons to write your own
           | parser.
        
             | jbiggley wrote:
             | This reminds me of my favourite quote about standards.
             | 
             | >The wonderful thing about standards is that there are so
             | many of them to choose from.
             | 
             | And, keeping with the theme, this quote may be from Grace
             | Hopper, Andrew Tanenbaum, Patricia Seybold or Ken Olsen.
        
       | evmar wrote:
       | In n2[1] I needed a fast tokenizer and had the same "garbage
       | factory" problem, which is basically that there's a set of
       | constant tokens (like json.Delim in this post) and then strings
       | which cause allocations.
       | 
       | I came up with what I think is a kind of neat solution, which is
       | that the tokenizer is generic over some T and takes a function
       | from byteslice to T and uses T in place of the strings. This way,
       | when the caller has some more efficient representation available
       | (like one that allocates less) it can provide one, but I can
       | still unit test the tokenizer with the identity function for
       | convenience.
       | 
       | In a sense this is like fusing the tokenizer with the parser at
       | build time, but the generic allows layering the tokenizer such
       | that it doesn't know about the parser's representation.
       | 
       | [1] https://github.com/evmar/n2
        
       | suzzer99 wrote:
       | Can someone explain to me why JSON can't have comments or
       | trailing commas? I really hope the performance gains are worth
       | it, because I've lost 100s of man-hours to those things, and had
       | to resort to stuff like this in package.json:
       | "IMPORTANT: do not run the scripts below this line, they are for
       | CICD only": true,
        
         | coldtea wrote:
         | It can't have comments because it didn't originally had
         | comments, so now it's too late. And it originally didn't have
         | comments, because Douglas Cockford thought they could be abused
         | for parsing instructions.
         | 
         | As for not having trailing commas, it's probably a less
         | intentional bad design choice.
         | 
         | That said, if you want commas and comments, and control the
         | parsers that will be used for your JSON, then use JSONC (JSON
         | with comments). VSCode for example does that for its JSON
         | configuration.
        
           | explaininjs wrote:
           | JSONC also supports trailing commas. It is, in effect, "JSON
           | with no downsides".
           | 
           | TOML/Yaml always drive me batty with all their obscure
           | special syntax. Whereas it's almost impossible to look at a
           | formatted blob of JSON and not have a very solid
           | understanding of what it represents.
           | 
           | The one thing I _might_ add is multiline strings with ` 's,
           | but even that is probably more trouble than it's worth, as
           | you immediately start going down the path of "well let's also
           | have syntax to strip the indentation from those strings,
           | maybe we should add new syntax to support raw strings, ..."
        
           | tubthumper8 wrote:
           | Does JSONC have a specification or formal definition? People
           | have suggested[1] using JSON5[2] instead for that reason
           | 
           | [1] https://github.com/microsoft/vscode/issues/100688
           | 
           | [2] https://spec.json5.org/
        
             | mananaysiempre wrote:
             | Unfortunately, JSON5 says keys can be ES5
             | IdentifierName[1]s, which means you must carry around
             | Unicode tables. This makes it a non-option for small
             | devices, for example. (I mean, not really, you technically
             | _could_ fit the necessary data and code in low single-digit
             | kilobytes, but it feels stupid that you have to. Or you
             | could just not do that but then it's no longer JSON5 and
             | what was the point of having a spec again?)
             | 
             | [1] https://es5.github.io/x7.html#x7.6
        
           | mananaysiempre wrote:
           | Amusingly, it originally _did_ have comments. Removing
           | comments was the one change Crockford ever made to the
           | spec[1].
           | 
           | [1] https://web.archive.org/web/20150105080225/https://plus.g
           | oog... (thank you Internet Archive for making Google's social
           | network somewhat accessible and less than useless)
        
         | shepherdjerred wrote:
         | I don't know the historic reason why it wasn't included in the
         | original spec, but at this point it doesn't matter. JSON is
         | entrenched and not going to change.
         | 
         | If you want comments, you can always use jsonc.
        
         | semiquaver wrote:
         | It's not that it "can't", more like it "doesn't". Douglas
         | Crockford prioritized simplicity when specifying JSON. Its BNF
         | grammar famously fits on one side of a business card.
         | 
         | Other flavors of JSON that include support for comments and
         | trailing commas exist, but they are reasonably called by
         | different names. One of these is YAML (mostly a superset of
         | JSON). To some extent the difficulties with YAML (like unquoted
         | 'no' being a synonym for false) have vindicated Crockford's
         | priorities.
        
       | forrestthewoods wrote:
       | > Any (useful) JSON decoder code cannot go faster that this.
       | 
       | That line feels like a troll. Cunningham's Law in action.
       | 
       | You can definitely go faster than 2 Gb/sec. In a word, SIMD.
        
         | shoo wrote:
         | we could re-frame by distinguishing problem statements from
         | implementations
         | 
         | Problem A: read a stream of bytes, parse it as JSON
         | 
         | Problem B: read a stream of bytes, count how many bytes match a
         | JSON whitespace character
         | 
         | Problem B should require fewer resources* to solve than problem
         | A. So in that sense problem B is a relaxation of problem A, and
         | a highly efficient implementation of problem B should be able
         | to process bytes much more efficiently than an "optimal"
         | implementation of problem A.
         | 
         | So in this sense, we can probably all agree with the author
         | that counting whitespace bytes is an easier problem than the
         | full parsing problem.
         | 
         | We're agreed that the author's implementation (half a page of
         | go code that fits on a talk slide) to solve problem B isn't the
         | most efficient way to solve problem B.
         | 
         | I remember reading somewhere the advice that to set a really
         | solid target for benchmarking, you should avoid measuring the
         | performance of implementations and instead try to estimate a
         | theoretical upper bound on performance, based on say a
         | simplified model of how the hardware works and a simplification
         | of the problem -- that hopefully still captures the essence of
         | what the bottleneck is. Then you can compare any implementation
         | to that (unreachable) theoretical upper bound, to get more of
         | an idea of how much performance is still left on the table.
         | 
         | * for reasonably boring choices of target platform, e.g. amd64
         | + ram, not some hypothetical hardware platform with
         | surprisingly fast dedicated support for JSON parsing and bad
         | support for anything else.
        
       | ncruces wrote:
       | It's possible to improve over the standard library with better
       | API design, but it's not really possible to do a fully streaming
       | parser that doesn't half fill structures before finding an error
       | and bailing out in the middle, which is another explicit design
       | constraint for the standard library.
        
       | hintymad wrote:
       | How is this compared to Daniel Lemire's simdjson?
       | https://github.com/simdjson/simdjson
        
       | 1vuio0pswjnm7 wrote:
       | "But there is a better trick that we can use that is more space
       | efficient than this table, and is sometimes called a computed
       | goto."
       | 
       | From 1989:
       | 
       | https://raw.githubusercontent.com/spitbol/x32/master/docs/sp...
       | 
       | "Indirection in the Goto field is a more powerful version of the
       | computed Goto which appears in some languages. It allows a
       | program to quickly perform a multi-way control branch based on an
       | item of data."
        
       | wood_spirit wrote:
       | My own lessons from writing fast json parsers has a lot of
       | language-type things but here are some generalisations:
       | 
       | Avoid heap allocations in tokenising. Have a tokeniser that is a
       | function that returns a stack-allocated struct or an int64 token
       | that is a packed field describing the start, length and type
       | offsets etc of the token.
       | 
       | Avoid heap allocations in parsing: support a getString(key
       | String) type interface for clients that what to chop up a buffer.
       | 
       | For deserialising to object where you know the fields at compile
       | time, generally generate a switch case of key length before
       | comparing string values.
       | 
       | My experience in data pipelines that process lots of json is that
       | choice of json library can be a 3-10x performance difference and
       | that all the main parsers want to allocate objects.
       | 
       | If the classes you are serialising or deserialising is known at
       | compile time then Jackson Java does a good job but you can get a
       | 2x boost with careful coding and profiling.
       | 
       | Whereas if you are paying aribrary json then all the mainstream
       | parsers want to do lots of allocations that a more intrusive
       | parser that you write yourself can avoid, and that you can make
       | massive performance wins if you are processing thousands or
       | millions of objects per second.
        
       | wslh wrote:
       | I remember this JSON benchmark page from RapidJSON [1].
       | 
       | [1] https://rapidjson.org/md_doc_performance.html
        
       ___________________________________________________________________
       (page generated 2023-11-05 23:00 UTC)