hngopher.com

       [HN Gopher] Analyzing multi-gigabyte JSON files locally
       ___________________________________________________________________
        
       Analyzing multi-gigabyte JSON files locally
        
       Author : bubblehack3r
       Score  : 99 points
       Date   : 2023-03-18 15:41 UTC (7 hours ago)
        
 (HTM) web link (thenybble.de)
 (TXT) w3m dump (thenybble.de)
        
       | jahewson wrote:
       | I had to parse a database backup from Firebase, which was,
       | remarkably, a 300GB JSON file. The database is a tree rooted at a
       | single object, which means that any tool that attempts to stream
       | individual objects always wanted to buffer this single 300GB root
       | object. It wasn't enough to strip off the root either, as the
       | really big records were arrays a couple of levels down, with a
       | few different formats depending on the schema. For added fun our
       | data included some JSON serialised inside strings too.
       | 
       | This was a few years ago and I threw every tool and language I
       | could at it, but they were either far too slow or buffered
       | records larger than memory, even the fancy C++ SIMD parsers did
       | this. I eventually got something working in Go and it was
       | impressively fast and ran on my MacBook, but we never ended up
       | using it as another engineer just wrote a script that read the
       | entire database from the Firebase API record-by-record throttled
       | over several days, lol.
        
         | simonw wrote:
         | I've used ijson in Python for this kind of thing in the past,
         | it's pretty effective: https://pypi.org/project/ijson/
        
       | kosherhurricane wrote:
       | What I would have done is first create a map of the file, just
       | the keys and shapes, without the data. That way I can traverse
       | the file. And then mmap the file to traverse and read the data.
       | 
       | A couple of dozen lines of code would do it.
        
       | zerop wrote:
       | Other day i discovered duckdb on HN which allows firing SQL on
       | JSON. But i am not sure if that can take this much volume of
       | data.
        
       | version_five wrote:
       | For a hacky solution, I've often just used grep, tr, awk, etc. If
       | it's a well structured file and all the records are the same or
       | similar enough, it's often possible to grep your way into getting
       | the thing you want on each line, and then use awk or sed to parse
       | out the data. Obviously lots of ways this can break down, buy 9GB
       | is nothing if you can make it work with these tools. I have found
       | jq much slower.
        
         | philwelch wrote:
         | Yeah, if the JSON is relatively flat, converting to TSV makes
         | the data fairly trivial to consume using awk and other classic
         | command line tools. I did a lot of this when a past employer
         | decided they couldn't afford Splunk.
        
       | jeffbee wrote:
       | One thing that will greatly help with `jq` is rebuilding it so it
       | suits your machine. The package of jq that comes with Debian or
       | Ubuntu Linux is garbage that targets k8-generic (on the x86_64
       | variant), is built with debug assertions, and uses the GNU system
       | allocator which is the worst allocator on the market. Rebuilding
       | it targeting your platform, without assertions, and with tcmalloc
       | makes it twice as fast in many cases.
       | 
       | On this 988MB dataset I happen to have at hand, compare Ubuntu jq
       | with my local build, with hot caches on an Intel Core i5-1240P.
       | time parallel -n 100 /usr/bin/jq -rf ../program.jq ::: * ->
       | 1.843s       time parallel -n 100 ~/bin/jq -rf ../program.jq :::
       | * -> 1.121s
       | 
       | I know it stinks of Gentoo, but if you have any performance
       | requirements at all, you can help yourself by rebuilding the
       | relevant packages. Never use the upstream mysql, postgres, redis,
       | jq, ripgrep, etc etc.
        
         | jeffbee wrote:
         | I guess another interesting fact worth mentioning here is the
         | "efficiency" cores on a modern Intel CPU are every bit as good
         | as the performance cores for this purpose. The 8C/8T Atom side
         | of the i5-1240P has the same throughput as the 4C/8T Core side
         | for this workload. I get 1.79s using CPUs 0-7 and 1.82s on CPUs
         | 8-15.
        
       | berkle4455 wrote:
       | Just use clickhouse-local or duckdb. Handling data measured in
       | terabytes is easy.
        
       | Animats wrote:
       | Rust's serde-json will iterate over a file of JSON without
       | difficulty, and will write one from an iterative process without
       | building it all in memory. I routinely create and read multi-
       | gigabyte JSON files. They're debug dumps of the the scene my
       | metaverse viewer is looking at.
       | 
       | Streaming from large files was routine for XML, but for some
       | reason, JSON users don't seem to work with streams much.
        
       | ginko wrote:
       | This is something I did recently. We have this binary format we
       | use for content traces. You can dump it to JSON, but that turns a
       | ~10GB into a ~100GB file. I needed to check some aspects of this
       | with Python, so I used ijson[1] to parse the JSON without having
       | to keep it in memory.
       | 
       | The nice thing is that our dumping tool can also output JSON to
       | STDOUT so you don't even need to dump the JSON representation to
       | the hard disk. Just open the tool in a subprocess and pipe the
       | output to the ijson parser. Pretty handy.
       | 
       | [1] https://pypi.org/project/ijson/
        
       | maCDzP wrote:
       | I like SQLite and JSON columns. I wonder how fast it would be if
       | you save the whole JSON file in one record and then query SQLite.
       | I bet it's fast.
       | 
       | You could probably use that one record to then build tables in
       | SQLite that you can query.
        
       | funstuff007 wrote:
       | Anyone who's generating multi-GB JSON files on purpose has some
       | explaining to do.
        
         | ghshephard wrote:
         | Logs. jsonl is a popular streaming format.
        
           | funstuff007 wrote:
           | I guess, but you can grep JSONL just like you can a regular
           | log file. As such, you don't need any sophisticated tools as
           | discussed in this article.
           | 
           | > 2. Each Line is a Valid JSON Value
           | 
           | > 3. Line Separator is '\n'
           | 
           | https://jsonlines.org/
        
             | ghshephard wrote:
             | Yes - 100% I spend _hours_ a day blasting through line json
             | and I _always_ pre-filter with egrep, and only move to
             | things like jq with the hopefully (dramatically) reduced
             | log size.
             | 
             | Also - with linejson - you can just grab the first 10,000
             | or so lines and tweak your query with that before throwing
             | it against the full log structure as well.
             | 
             | With that said - this entire thread has been gold - lots of
             | useful strategies for working with large json files.
        
       | 19h wrote:
       | To analyze and process the pushshift Reddit comment & submission
       | archives we used Rust with simd-json and currently get to around
       | 1 - 2GB/s (that's including the decompression of the zstd
       | stream). Still takes a load of time when the decompressed files
       | are 300GB+.
       | 
       | Weirdly enough we ended up networking a bunch of Apple silicon
       | MacBooks together as the Ryzen 32C servers didn't even closely
       | match its performance :/
        
       | cube2222 wrote:
       | OctoSQL[0] or DuckDB[1] will most likely be much simpler, while
       | going through 10 GB of JSON in a couple seconds at most.
       | 
       | Disclaimer: author of OctoSQL
       | 
       | [0]: https://github.com/cube2222/octosql
       | 
       | [1]: https://duckdb.org/
        
       | hamilyon2 wrote:
       | Clickhouse is the best way to analyze 10GB sized json by far.
       | 
       | Latest bunch of features add near-native json support. Coupled
       | with ability to add extracted columns make the whole process
       | easy. It is fast, you can use familiar SQL syntax, not
       | constrainted to RAM limits.
       | 
       | It is a bit hard if you want to iteratively process file line-by
       | line or use advanced SQL. And you have one-time cost of writing
       | schema. Apart from that, I can't think of any downsides.
       | 
       | Edit: clarify a bit
        
         | fdajojiocsjo wrote:
         | [dead]
        
       | tylerhannan wrote:
       | There was an interesting article on this recently...
       | 
       | https://news.ycombinator.com/item?id=31004563
       | 
       | It prompted quite some conversation and discussion and, in the
       | end, an updated benchmark across a variety of tools
       | https://colab.research.google.com/github/dcmoura/spyql/blob/...
       | conveniently right in the 10GB dataset size.
        
       | mastax wrote:
       | Dask looks really cool, I hope I remember it exists next time I
       | need it.
       | 
       | I've been pretty baffled, and disappointed, by how bad Python is
       | at parallel processing. Yeah, yeah, I know: The GIL. But so much
       | time and effort has been spent engineering around every other
       | flaw in Python and yet this part is still so bad. I've tried
       | every "easy to use" parallelism library that gets recommended and
       | none of them has satisfied. Always: "couldn't pickle this
       | function" or spawning loads of processes that use up all my RAM
       | for no visible reason but don't use any CPU or make any
       | indication of progress. I'm sure I'm missing something, I'm not a
       | Python guy. But every other language I've used has an easy to use
       | stateless parallel map that hasn't given me any trouble.
        
         | isoprophlex wrote:
         | I've been seeing python at least once every week for a looooong
         | time. Years. A decade maybe. You are not missing anything. It's
         | a big steamy pile of horse manure.
        
       | isoprophlex wrote:
       | Nice writeup, but is jq & GNU parallel or a notebook full of
       | python spaghetti the best (least complex) tool for the job?
       | 
       | DuckDB might be nice here, too. See
       | https://duckdb.org/2023/03/03/json.html
        
         | samwillis wrote:
         | DuckDB is awesome. As a comparison, I have a dataset that
         | starts life as a 35gb set of json files. Imported into Postgres
         | it's ~6gb, and a key query I run takes 3 min 33 seconds.
         | 
         | Imported into DuckDB (still about ~6gb for all columns), the
         | same SQL query takes 1.1 second!
         | 
         | The key thing is that the columns (for all rows) the query
         | scans total only about 100mb, so DuckDB has a lot less to scan.
         | But on top of that it's vectorised query execution is
         | incredibly quick.
         | 
         | https://mobile.twitter.com/samwillis/status/1633213350002798...
        
           | pletnes wrote:
           | I found that exporting big tables as a bunch of parquet files
           | is faster and uses less memory than duckdb's internal format.
        
         | pletnes wrote:
         | Duckdb is fantastic. Doesn't need a schema, either.
        
       | zeitlupe wrote:
       | Spark is my favorite tool to deal with jsons. It can read as many
       | jsons - in any format located in any even nested folder structure
       | - as you want, offers parallelization, and is great to flatten
       | structs. I've never run into memory issues (or never ran out of
       | workarounds) so far.
        
         | pidge wrote:
         | Yeah, given that everything is now multi-core, it makes sense
         | to use a natively parallel tool for anything compute-bound. And
         | Spark will happily run locally and (unlike previous big data
         | paradigms) doesn't require excessive mental contortions.
         | 
         | Of course while you're at it, you should probably just convert
         | all your JSON into Parquet to speed up successive queries...
        
       | cpuguy83 wrote:
       | Jq does support slurp mode so you should be able to do this using
       | that... granted I've never attempted this and the syntax is very
       | different.
       | 
       | --- edit ---
       | 
       | I used the wrong term, the correct term is streaming mode.
        
         | Groxx wrote:
         | It does work, but it is a _huge_ headache to use, in part
         | because the documentation around it is nowhere near enough to
         | understand how to use it.
         | 
         | If I used it regularly I'd probably develop a feel for it and
         | be much faster - it is _reasonable_ , just abnormal and
         | extremely low level, and much harder to use with other jq
         | stuff. But I almost always start looking for alternatives well
         | before I reach that point.
        
       | hprotagonist wrote:
       | i would seriously consider sqlite-utils here.
       | 
       | https://sqlite-utils.datasette.io/en/stable/cli.html
        
         | qbasic_forever wrote:
         | Was going to post the same thing, I suspect converting the
         | dataset to a SQLite db would be infinitely more fast and
         | productive than pecking away at it with pandas and such.
        
           | sgarland wrote:
           | At this size, I doubt it. While SQLite can read JSON if
           | compiled with support for it, it stores it as TEXT. The only
           | native indexing possible for that that I'm aware of is full-
           | text search, and I suspect the cardinality of JSON characters
           | would make that inefficient. Not to mention that the author
           | stated they didn't have enough memory to store the entire
           | file, so with a DB you'd be reading from disk.
           | 
           | MySQL or Postgres with their native JSON datatypes _might_ be
           | faster, but you still have to load it in, and
           | storing/indexing it in either of those is [0] its own [1]
           | special nightmare full of footguns.
           | 
           | Having done similar text manipulation and searches with giant
           | CSV files, parallel and xsv [2] is the way to go.
           | 
           | [0]: https://dev.mysql.com/doc/refman/8.0/en/json.html
           | 
           | [1]: https://www.postgresql.org/docs/current/datatype-
           | json.html
           | 
           | [2]: https://github.com/BurntSushi/xsv
        
             | simonw wrote:
             | The trick sqlite-utils uses is to automatically create a
             | SQLite table schema that matches the shape of the JSON
             | object - specifically of you give it a JSON array of
             | objects.
             | 
             | It doesn't handle streaming JSON out of the box though, so
             | you'd need to write some custom code on top of something
             | like ijson to avoid loading the entire JSON file into
             | memory first.
        
           | philwelch wrote:
           | SQLite is great for datasets that fit comfortably into
           | memory, but otherwise it starts to struggle.
        
             | simonw wrote:
             | That's not accurate: SQLite works great with data that's
             | too big to fit in memory provided you create indexes and
             | write queries that can take advantage of them - just like
             | any other relational database (MySQL/PostgreSQL etc).
        
             | fbdab103 wrote:
             | Come again? It is a database. The entire raison-d'etre is
             | for managing bigger-than-memory data. On the SQLite
             | forum[0], Dr. Hipp indicated a customer approached them
             | that they were using a ~40 TB SQLite database. Now, you
             | could argue it is less good at that scale, particularly
             | analytical workloads, but SQLite has an enormous amount of
             | headroom that you are unlikely to encounter.
             | 
             | [0]: https://sqlite.org/forum/info/6e13e1911f1fe39c
        
             | hprotagonist wrote:
             | happily, i have multiple gigabytes of memory ...
        
               | philwelch wrote:
               | Sure, but a 40 GB SQLite database on a machine with 16 GB
               | of RAM is not gonna be happy
        
               | qbasic_forever wrote:
               | You're not going to do better with pandas or similar
               | tools. If it can't fit in memory, it's going to be
               | painful. SQLite is the least painful in my experience,
               | and it sets you up for working with the data in a proper
               | DB like postgres or similar for when you get fed up with
               | the memory constraints.
        
               | philwelch wrote:
               | I wouldn't use pandas in that situation either.
        
               | kosherhurricane wrote:
               | In the original article, they were trying to read a 10GB
               | file into 30GB RAM.
        
       | Groxx wrote:
       | tbh my usual strategy is to drop into a real programming language
       | and use whatever JSON stream parsing exists there, and dump the
       | contents into a half-parsed file that can be split with `split`.
       | Then you can use "normal" tools on one of those pieces for fast
       | iteration, and simply `cat * | ...` for the final slow run on all
       | the data.
       | 
       | Go is quite good for this, as it's extremely permissive about
       | errors and structure, has very good performance, and comes with a
       | streaming parser in the standard library. It's pretty easy to be
       | finished after only a couple minutes, and you'll be bottlenecked
       | on I/O unless you did something truly horrific.
       | 
       | And when jq isn't enough because you need to do joins or
       | something, shove it into SQLite. Add an index or three. It'll
       | massively outperform almost anything else unless you need rich
       | text content searches (and even then, a fulltext index might be
       | just as good), and it's plenty happy with a terabyte of data.
        
       | rvanlaar wrote:
       | Recently had 28GB json of IOT data with no guarantees on the data
       | structure inside.
       | 
       | Used simdjson [1] together with python bindings [2]. Achieved
       | massive speedups for analyzing the data. Before it was in the
       | order of minutes, then it became fast enough to not leave my
       | desk. Reading from disk became the bottleneck, not cpu power and
       | memory.
       | 
       | [1] https://github.com/simdjson/simdjson [2]
       | https://pysimdjson.tkte.ch/
        
       | jmmv wrote:
       | Some random comments:
       | 
       | * A few GBs of data isn't really that much. Even /considering/
       | the use of cloud services just for this sounds crazy to me... but
       | I'm sure there are people out there that believe it's the only
       | way to do this (not the author, fortunately).
       | 
       | * "You might find out that the data doesn't fit into RAM (which
       | it well might, JSON is a human-readable format after all)" -- if
       | I'm reading this right, the author is saying that the parsed data
       | takes _more_ space than the JSON version? JSON is a text format
       | and interning it into proper data structures is likely going to
       | take _less_ space, not more.
       | 
       | * "When you're ~trial-and-error~iteratively building jq commands
       | as I do, you'll quickly grow tired of having to wait about a
       | minute for your command to succeed" -- well, change your workflow
       | then. When tackling new queries, it's usually a good idea to
       | reduce the data set. Operate on a few records until you have the
       | right query so that you can iterate as fast as possible. Only
       | once you are confident with the query, run it on the full data.
       | 
       | * Importing the data into a SQLite database may be better overall
       | for exploration. Again, JSON is slow to operate on because it's
       | text. Pay the cost of parsing only once.
       | 
       | * Or write a custom little program that streams data from the
       | JSON file without buffering it all in memory. JSON parsing
       | libraries are plentiful so this should not take a lot of code in
       | your favorite language.
        
         | vlovich123 wrote:
         | > JSON is a text format and interning it into proper data
         | structures is likely going to take _less_ space, not more.
         | 
         | ~~Yeah but parsing it can require ~2x the RAM available and
         | push you into swap / make it not possible.~~
         | 
         | > Or write a custom little program that streams data from the
         | JSON file without buffering it all in memory. JSON parsing
         | libraries are plentiful so this should not take a lot of code
         | in your favorite language.
         | 
         | What is the state of SAX JSON parsing? I used yajl a long time
         | ago but not sure if that's still the state of the art (and it's
         | C interface was not the easiest to work with).
         | 
         | EDIT: Actually, I think the reason is that you typically will
         | have pointers (8 bytes) in place of 2 byte demarcations in the
         | text version (eg "", {}, [] become pointers). It's very hard to
         | avoid that (maybe impossible? Not sure) and no surprise that
         | Python has a problem with this.
        
           | saidinesh5 wrote:
           | There's libraries like ijson. At my last job, i wrote a
           | wrapper around this C++ library called rapidjson to parse
           | streaming json. It worked quite well.
           | 
           | Here's the write up about it:
           | https://dinesh.cloud/2022/streaming-json-for-fun-and-profit/
           | And here's the code: https://github.com/multiversal-
           | ventures/json-buffet
           | 
           | The API isn't the best. I'd have preferred an iterator based
           | solution as opposed to this callback based one. But we worked
           | with what rapidjson gave us for the proof of concept. The
           | reason for this specific implementation was we wanted to
           | build an index to query the server directly about it's huge
           | json files (compressed size of 20+GB per file) using http
           | range queries.
        
         | bastawhiz wrote:
         | > JSON is a text format and interning it into proper data
         | structures is likely going to take _less_ space, not more.
         | 
         | If you're parsing to structs, yes. Otherwise, no. Each object
         | key is going to be a short string, which is going to have some
         | amount of overhead. You're probably storing the objects as hash
         | tables, which will necessarily be larger than the two bytes
         | needed to represent them as text (and probably far more than
         | you expect, so they have enough free space for there to be
         | sufficiently few hash collisions).
         | 
         | JSON numbers are also 64-bit floats, which will almost
         | universally take up more bytes per number than their serialized
         | format for most JSON data.
        
           | vlovich123 wrote:
           | I think even structs have this problem because typically you
           | heap allocate all the structs/arrays. You could try to arena
           | allocate contiguous objects in place, but that sounds hard
           | enough that I doubt that anyone bothers. Using a SAX parser
           | is almost certainly the tool you want to use.
        
           | matja wrote:
           | > JSON numbers are also 64-bit floats
           | 
           | In common implementations they are, but RFC 8259 and ECMA-404
           | do not specify the range, precision, or underlying
           | implementation for the storage of numbers in JSON.
           | 
           | A implementation that guarantees interoperability between all
           | implementations of JSON would use an arbitrary-sized number
           | format, but they seldom do.
           | 
           | No idea what ISO/IEC 21778:2017 says because it's not free.
        
             | bastawhiz wrote:
             | I mean, I suppose you could use a numeric format that takes
             | more than 64 bits, but you certainly couldn't use one that
             | uses fewer. You have no way to know whether a number
             | _should be_ an int or a float. And since JavaScript (the JS
             | in JSON) encodes with 64 bits, you 'd be losing precision
             | if you choose to use any less. The point is that an integer
             | written as one byte in the JSON _will_ take up more space
             | (at least seven bytes more) when decoded, all the way up to
             | numbers that take seven characters.
        
               | gpderetta wrote:
               | (subsets of )ASN.1, LEB128, and of course UTF-8 are all
               | examples of encoding binary integers with variable
               | length.
        
               | bastawhiz wrote:
               | Sure, and JSON is as well. 0 takes three fewer bytes to
               | encode than 1234. But if I parse JSON and my numbers come
               | back as a type backed by UTF-8, I'm going to be left
               | confused and angry. Nobody expects their JSON parser to
               | give them back numeric values that their CPU can't
               | natively perform operations on.
        
         | zigzag312 wrote:
         | > if I'm reading this right, the author is saying that the
         | parsed data takes _more_ space than the JSON version? JSON is a
         | text format and interning it into proper data structures is
         | likely going to take _less_ space, not more.
         | 
         | Utf8 json strings will get converted to utf16 strings in some
         | languages, doubling the size of strings in memory compared to
         | the size on disk.
        
         | taeric wrote:
         | It still surprises me how many have the intuition that loading
         | the data will take more space than the file.
         | 
         | Even more annoying when it is. (Compressed or binary formats
         | not withstanding.)
        
           | vlovich123 wrote:
           | Isn't that almost always true unless the format is
           | specifically designed to take the same amount on disk and in
           | RAM? If that's the case I think typically it means the file
           | on disk is larger than it needs to be. That's because you
           | have to convert the file to some kind of in-memory
           | representation that's easier to work with to manipulate it
           | and what not which requires structure that isn't necessary at
           | rest.
           | 
           | Can you give an example where the file on disk is going to be
           | larger than what it is in memory? Provided you're not just
           | reading it and working with it as an opaque binary blob.
        
             | hnlmorg wrote:
             | XML would probably fall into that category.
             | 
             | But I do agree with your point more generally speaking.
        
         | ot wrote:
         | > Or write a custom little program that streams data from the
         | JSON file without buffering it all in memory. JSON parsing
         | libraries are plentiful so this should not take a lot of code
         | in your favorite language.
         | 
         | Several years ago I wrote a paper [1] on representing the parse
         | tree of a JSON document in a tiny fraction of the JSON size
         | itself, using succinct data structures. The representation
         | could be built with a single pass of the JSON, and basically
         | constant additional memory.
         | 
         | The idea was to pre-process the JSON and then save the parse
         | tree, so it could be kept in memory over several passes of the
         | JSON data (which may not fit in memory), avoiding to re-do the
         | parsing work on each pass.
         | 
         | I don't think I've seen this idea used anywhere, but I still
         | wonder if it could have applications :)
         | 
         | [1]
         | http://groups.di.unipi.it/~ottavian/files/semi_index_cikm.pd...
        
         | saidinesh5 wrote:
         | > * "You might find out that the data doesn't fit into RAM
         | (which it well might, JSON is a human-readable format after
         | all)" -- if I'm reading this right, the author is saying that
         | the parsed data takes _more_ space than the JSON version? JSON
         | is a text format and interning it into proper data structures
         | is likely going to take _less_ space, not more.
         | 
         | Not to mention, even when using bad data structures (eg.
         | hashmap of hashmaps..), One can just add a large enough
         | swapfile and brute force their way through it no?
        
         | [deleted]
        
         | the_duke wrote:
         | > JSON is a text format and interning it into proper data
         | structures is likely going to take _less_ space, not more.
         | 
         | That depends a lot on the language and the json library.
         | 
         | Lets take `{"ab": 22}` as an example.
         | 
         | That's 10 bytes.
         | 
         | In a language like Rust and using the serde library, this could
         | be deserialized directly into a struct with one integer, let's
         | pick a u32. So that would only be four bytes.
         | 
         | But if it was deserialized to serdes dynamic Value type, this
         | would be : a HashMap<String, u32>, which has a constant size of
         | 48 bytes, plus an allocation of I don't know how much (first
         | allocation will cover more than one entry), plus 16 bytes
         | overhead for the string, plus 2 bytes for the actual string
         | contents, plus the 4 bytes for the u32. So that's already over
         | ~90 bytes, a lot more than the JSON.
         | 
         | Dynamic languages like Python also have a lot of overhead for
         | all the objects.
         | 
         | Keys can of course be interned, but not that many default JSON
         | parser libraries do that afaik.
        
           | btown wrote:
           | I feel like many of the same principles from V8 (the engine
           | for Chrome/Node.js) hidden classes would apply here. It's not
           | just that you're interning keys; you're actually creating a
           | set of optimized data classes that are transitioned to when
           | you encounter a key that wasn't seen before. In such a mode,
           | you're not even storing the keys, just a pointer to the type
           | information tree. Only when you have a sub-object whose keys
           | are truly unpredictable do you go into "dictionary mode (aka
           | 'slow mode')" - which I think says it all!
           | 
           | https://v8.dev/docs/hidden-classes
        
           | DonHopkins wrote:
           | The key names of all the identically shaped objects are
           | repeated and can all be interned into sharing the same
           | strings, saving a huge amount of memory. That's why csv files
           | can be much more compact that the equivalent json files: they
           | only list the keys once in the first row.
        
             | mindslight wrote:
             | "ab" is 4 bytes, while a 64 bit pointer to an interned
             | string is 8 bytes. It would seem that the savings of CSV
             | would be better analogized to static typing - the type is
             | defined once, then each record only contains data.
             | 
             | I had the same intuition as the original comment. But no,
             | the relative sizes of data formats aren't that
             | straightforward. One could intern symbols to say 16 bit
             | values, or one could infer structural types and compress
             | the data that way. But those are both creating additional
             | assumptions and processing that likely aren't done by
             | commonly available tools.
        
           | rootw0rm wrote:
           | i'm using bincode to serialize btrees and the files are
           | smaller than i expected...it's pretty cool. must be storing
           | my u32s as u16s or u8s sometimes.
        
           | pkaye wrote:
           | How about something like a Go interface which is essentially
           | two pointers and the data itself.
        
             | Groxx wrote:
             | That's the same thing they're describing, and equivalent to
             | what Go does for map[string]int64 (plus or minus minor
             | details about how strings are optimized).
             | 
             | The map has its two slices (pointer + length + capacity =
             | 8*3 times two slices) and the string needs a separate
             | allocation somewhere because it's essentially a pointer to
             | a slice of bytes. All of which is true for almost all
             | reasonably efficiency-focused languages, Rust and Go
             | included - it's just how you make a compact hashmap.
        
               | Yoric wrote:
               | Does Go perform interning of strings?
        
               | Groxx wrote:
               | Compile-time strings, I think so? In that all identical
               | strings simply point to the same data. Otherwise no.
        
           | closeparen wrote:
           | I would read "proper data structure" charitably, as a struct
           | or record type, maybe a tuple. And if it must be a hashmap,
           | then at least using symbols and not strings for keys.
        
       | Nihilartikel wrote:
       | If you're doing interactive analysis, converting the json to
       | parquet is a great first step.. After that duckdb or spark are a
       | good way to go. I only fall back to spark if some aggregations
       | are too big to fit in RAM. Spark spills to disk and subdivides
       | the physical plans better in my experience..
        
         | lmeyerov wrote:
         | Yep!
         | 
         | We do the switch to parquet, and then as they say, use dask so
         | we can stick with python for interesting bits as SQL is
         | relatively anti-productive there
         | 
         | Interestingly, most of the dask can actually be dask_cudf and
         | cudf nowadays: dask/pandas on a GPU, so can stay in the same
         | computer, no need for distributed, even if TBs etc of json
        
         | pradeepchhetri wrote:
         | Well if you need to convert json to parquet to do anything
         | fast, then what is the meaning ? You will end up wasting way
         | more resource in that conversion itself that your benefit is
         | all equalized in the cost of extra storage utilization (since
         | now you have json and parquet files both). The whole point is
         | to do fast operations in json itself. Try out
         | clickhouse/clickhouse-local.
        
           | closeparen wrote:
           | If you're doing interactive analysis, generally you're going
           | to have multiple queries, so it can be worthwhile to pay the
           | conversion cost once upfront. You don't necessarily retain
           | the JSON form, or at least not for as long.
        
             | [deleted]
        
       | thakoppno wrote:
       | Would sampling the JSON down to 20MB and running jq
       | experimentally until one has found an adequate solution be a
       | decent alternative approach?
       | 
       | It depends on the dataset one supposes.
        
         | epalm wrote:
         | Yeah, I do this when querying sql databases. I limit the data
         | to some small/local range, iteratively work on the query, and
         | when I'm happy with the local results, I remove the filter and
         | get the big results.
        
       | ddulaney wrote:
       | I really like using line-delimited JSON [0] for stuff like this.
       | If you're looking at a multi-GB JSON file, it's often made of a
       | large number of individual objects (e.g. semi-structured JSON log
       | data or transaction records).
       | 
       | If you can get to a point where each line is a reasonably-sized
       | JSON file, a lot of things gets way easier. jq will be streaming
       | by default. You can use traditional Unixy tools (grep, sed, etc.)
       | in the normal way because it's just lines of text. And you can
       | jump to any point in the file, skip forward to the next line
       | boundary, and know that you're not in the middle of a record.
       | 
       | The company I work for added line-delimited JSON output to lots
       | of our internal tools, and working with anything else feels
       | painful now. It scales up really well -- I've been able to do
       | things like process full days of OPRA reporting data in a bash
       | script.
       | 
       | [0]: https://jsonlines.org/
        
       ___________________________________________________________________
       (page generated 2023-03-18 23:00 UTC)