[HN Gopher] Analyzing multi-gigabyte JSON files locally
___________________________________________________________________
Analyzing multi-gigabyte JSON files locally
Author : bubblehack3r
Score : 99 points
Date : 2023-03-18 15:41 UTC (7 hours ago)
(HTM) web link (thenybble.de)
(TXT) w3m dump (thenybble.de)
| jahewson wrote:
| I had to parse a database backup from Firebase, which was,
| remarkably, a 300GB JSON file. The database is a tree rooted at a
| single object, which means that any tool that attempts to stream
| individual objects always wanted to buffer this single 300GB root
| object. It wasn't enough to strip off the root either, as the
| really big records were arrays a couple of levels down, with a
| few different formats depending on the schema. For added fun our
| data included some JSON serialised inside strings too.
|
| This was a few years ago and I threw every tool and language I
| could at it, but they were either far too slow or buffered
| records larger than memory, even the fancy C++ SIMD parsers did
| this. I eventually got something working in Go and it was
| impressively fast and ran on my MacBook, but we never ended up
| using it as another engineer just wrote a script that read the
| entire database from the Firebase API record-by-record throttled
| over several days, lol.
| simonw wrote:
| I've used ijson in Python for this kind of thing in the past,
| it's pretty effective: https://pypi.org/project/ijson/
| kosherhurricane wrote:
| What I would have done is first create a map of the file, just
| the keys and shapes, without the data. That way I can traverse
| the file. And then mmap the file to traverse and read the data.
|
| A couple of dozen lines of code would do it.
| zerop wrote:
| Other day i discovered duckdb on HN which allows firing SQL on
| JSON. But i am not sure if that can take this much volume of
| data.
| version_five wrote:
| For a hacky solution, I've often just used grep, tr, awk, etc. If
| it's a well structured file and all the records are the same or
| similar enough, it's often possible to grep your way into getting
| the thing you want on each line, and then use awk or sed to parse
| out the data. Obviously lots of ways this can break down, buy 9GB
| is nothing if you can make it work with these tools. I have found
| jq much slower.
| philwelch wrote:
| Yeah, if the JSON is relatively flat, converting to TSV makes
| the data fairly trivial to consume using awk and other classic
| command line tools. I did a lot of this when a past employer
| decided they couldn't afford Splunk.
| jeffbee wrote:
| One thing that will greatly help with `jq` is rebuilding it so it
| suits your machine. The package of jq that comes with Debian or
| Ubuntu Linux is garbage that targets k8-generic (on the x86_64
| variant), is built with debug assertions, and uses the GNU system
| allocator which is the worst allocator on the market. Rebuilding
| it targeting your platform, without assertions, and with tcmalloc
| makes it twice as fast in many cases.
|
| On this 988MB dataset I happen to have at hand, compare Ubuntu jq
| with my local build, with hot caches on an Intel Core i5-1240P.
| time parallel -n 100 /usr/bin/jq -rf ../program.jq ::: * ->
| 1.843s time parallel -n 100 ~/bin/jq -rf ../program.jq :::
| * -> 1.121s
|
| I know it stinks of Gentoo, but if you have any performance
| requirements at all, you can help yourself by rebuilding the
| relevant packages. Never use the upstream mysql, postgres, redis,
| jq, ripgrep, etc etc.
| jeffbee wrote:
| I guess another interesting fact worth mentioning here is the
| "efficiency" cores on a modern Intel CPU are every bit as good
| as the performance cores for this purpose. The 8C/8T Atom side
| of the i5-1240P has the same throughput as the 4C/8T Core side
| for this workload. I get 1.79s using CPUs 0-7 and 1.82s on CPUs
| 8-15.
| berkle4455 wrote:
| Just use clickhouse-local or duckdb. Handling data measured in
| terabytes is easy.
| Animats wrote:
| Rust's serde-json will iterate over a file of JSON without
| difficulty, and will write one from an iterative process without
| building it all in memory. I routinely create and read multi-
| gigabyte JSON files. They're debug dumps of the the scene my
| metaverse viewer is looking at.
|
| Streaming from large files was routine for XML, but for some
| reason, JSON users don't seem to work with streams much.
| ginko wrote:
| This is something I did recently. We have this binary format we
| use for content traces. You can dump it to JSON, but that turns a
| ~10GB into a ~100GB file. I needed to check some aspects of this
| with Python, so I used ijson[1] to parse the JSON without having
| to keep it in memory.
|
| The nice thing is that our dumping tool can also output JSON to
| STDOUT so you don't even need to dump the JSON representation to
| the hard disk. Just open the tool in a subprocess and pipe the
| output to the ijson parser. Pretty handy.
|
| [1] https://pypi.org/project/ijson/
| maCDzP wrote:
| I like SQLite and JSON columns. I wonder how fast it would be if
| you save the whole JSON file in one record and then query SQLite.
| I bet it's fast.
|
| You could probably use that one record to then build tables in
| SQLite that you can query.
| funstuff007 wrote:
| Anyone who's generating multi-GB JSON files on purpose has some
| explaining to do.
| ghshephard wrote:
| Logs. jsonl is a popular streaming format.
| funstuff007 wrote:
| I guess, but you can grep JSONL just like you can a regular
| log file. As such, you don't need any sophisticated tools as
| discussed in this article.
|
| > 2. Each Line is a Valid JSON Value
|
| > 3. Line Separator is '\n'
|
| https://jsonlines.org/
| ghshephard wrote:
| Yes - 100% I spend _hours_ a day blasting through line json
| and I _always_ pre-filter with egrep, and only move to
| things like jq with the hopefully (dramatically) reduced
| log size.
|
| Also - with linejson - you can just grab the first 10,000
| or so lines and tweak your query with that before throwing
| it against the full log structure as well.
|
| With that said - this entire thread has been gold - lots of
| useful strategies for working with large json files.
| 19h wrote:
| To analyze and process the pushshift Reddit comment & submission
| archives we used Rust with simd-json and currently get to around
| 1 - 2GB/s (that's including the decompression of the zstd
| stream). Still takes a load of time when the decompressed files
| are 300GB+.
|
| Weirdly enough we ended up networking a bunch of Apple silicon
| MacBooks together as the Ryzen 32C servers didn't even closely
| match its performance :/
| cube2222 wrote:
| OctoSQL[0] or DuckDB[1] will most likely be much simpler, while
| going through 10 GB of JSON in a couple seconds at most.
|
| Disclaimer: author of OctoSQL
|
| [0]: https://github.com/cube2222/octosql
|
| [1]: https://duckdb.org/
| hamilyon2 wrote:
| Clickhouse is the best way to analyze 10GB sized json by far.
|
| Latest bunch of features add near-native json support. Coupled
| with ability to add extracted columns make the whole process
| easy. It is fast, you can use familiar SQL syntax, not
| constrainted to RAM limits.
|
| It is a bit hard if you want to iteratively process file line-by
| line or use advanced SQL. And you have one-time cost of writing
| schema. Apart from that, I can't think of any downsides.
|
| Edit: clarify a bit
| fdajojiocsjo wrote:
| [dead]
| tylerhannan wrote:
| There was an interesting article on this recently...
|
| https://news.ycombinator.com/item?id=31004563
|
| It prompted quite some conversation and discussion and, in the
| end, an updated benchmark across a variety of tools
| https://colab.research.google.com/github/dcmoura/spyql/blob/...
| conveniently right in the 10GB dataset size.
| mastax wrote:
| Dask looks really cool, I hope I remember it exists next time I
| need it.
|
| I've been pretty baffled, and disappointed, by how bad Python is
| at parallel processing. Yeah, yeah, I know: The GIL. But so much
| time and effort has been spent engineering around every other
| flaw in Python and yet this part is still so bad. I've tried
| every "easy to use" parallelism library that gets recommended and
| none of them has satisfied. Always: "couldn't pickle this
| function" or spawning loads of processes that use up all my RAM
| for no visible reason but don't use any CPU or make any
| indication of progress. I'm sure I'm missing something, I'm not a
| Python guy. But every other language I've used has an easy to use
| stateless parallel map that hasn't given me any trouble.
| isoprophlex wrote:
| I've been seeing python at least once every week for a looooong
| time. Years. A decade maybe. You are not missing anything. It's
| a big steamy pile of horse manure.
| isoprophlex wrote:
| Nice writeup, but is jq & GNU parallel or a notebook full of
| python spaghetti the best (least complex) tool for the job?
|
| DuckDB might be nice here, too. See
| https://duckdb.org/2023/03/03/json.html
| samwillis wrote:
| DuckDB is awesome. As a comparison, I have a dataset that
| starts life as a 35gb set of json files. Imported into Postgres
| it's ~6gb, and a key query I run takes 3 min 33 seconds.
|
| Imported into DuckDB (still about ~6gb for all columns), the
| same SQL query takes 1.1 second!
|
| The key thing is that the columns (for all rows) the query
| scans total only about 100mb, so DuckDB has a lot less to scan.
| But on top of that it's vectorised query execution is
| incredibly quick.
|
| https://mobile.twitter.com/samwillis/status/1633213350002798...
| pletnes wrote:
| I found that exporting big tables as a bunch of parquet files
| is faster and uses less memory than duckdb's internal format.
| pletnes wrote:
| Duckdb is fantastic. Doesn't need a schema, either.
| zeitlupe wrote:
| Spark is my favorite tool to deal with jsons. It can read as many
| jsons - in any format located in any even nested folder structure
| - as you want, offers parallelization, and is great to flatten
| structs. I've never run into memory issues (or never ran out of
| workarounds) so far.
| pidge wrote:
| Yeah, given that everything is now multi-core, it makes sense
| to use a natively parallel tool for anything compute-bound. And
| Spark will happily run locally and (unlike previous big data
| paradigms) doesn't require excessive mental contortions.
|
| Of course while you're at it, you should probably just convert
| all your JSON into Parquet to speed up successive queries...
| cpuguy83 wrote:
| Jq does support slurp mode so you should be able to do this using
| that... granted I've never attempted this and the syntax is very
| different.
|
| --- edit ---
|
| I used the wrong term, the correct term is streaming mode.
| Groxx wrote:
| It does work, but it is a _huge_ headache to use, in part
| because the documentation around it is nowhere near enough to
| understand how to use it.
|
| If I used it regularly I'd probably develop a feel for it and
| be much faster - it is _reasonable_ , just abnormal and
| extremely low level, and much harder to use with other jq
| stuff. But I almost always start looking for alternatives well
| before I reach that point.
| hprotagonist wrote:
| i would seriously consider sqlite-utils here.
|
| https://sqlite-utils.datasette.io/en/stable/cli.html
| qbasic_forever wrote:
| Was going to post the same thing, I suspect converting the
| dataset to a SQLite db would be infinitely more fast and
| productive than pecking away at it with pandas and such.
| sgarland wrote:
| At this size, I doubt it. While SQLite can read JSON if
| compiled with support for it, it stores it as TEXT. The only
| native indexing possible for that that I'm aware of is full-
| text search, and I suspect the cardinality of JSON characters
| would make that inefficient. Not to mention that the author
| stated they didn't have enough memory to store the entire
| file, so with a DB you'd be reading from disk.
|
| MySQL or Postgres with their native JSON datatypes _might_ be
| faster, but you still have to load it in, and
| storing/indexing it in either of those is [0] its own [1]
| special nightmare full of footguns.
|
| Having done similar text manipulation and searches with giant
| CSV files, parallel and xsv [2] is the way to go.
|
| [0]: https://dev.mysql.com/doc/refman/8.0/en/json.html
|
| [1]: https://www.postgresql.org/docs/current/datatype-
| json.html
|
| [2]: https://github.com/BurntSushi/xsv
| simonw wrote:
| The trick sqlite-utils uses is to automatically create a
| SQLite table schema that matches the shape of the JSON
| object - specifically of you give it a JSON array of
| objects.
|
| It doesn't handle streaming JSON out of the box though, so
| you'd need to write some custom code on top of something
| like ijson to avoid loading the entire JSON file into
| memory first.
| philwelch wrote:
| SQLite is great for datasets that fit comfortably into
| memory, but otherwise it starts to struggle.
| simonw wrote:
| That's not accurate: SQLite works great with data that's
| too big to fit in memory provided you create indexes and
| write queries that can take advantage of them - just like
| any other relational database (MySQL/PostgreSQL etc).
| fbdab103 wrote:
| Come again? It is a database. The entire raison-d'etre is
| for managing bigger-than-memory data. On the SQLite
| forum[0], Dr. Hipp indicated a customer approached them
| that they were using a ~40 TB SQLite database. Now, you
| could argue it is less good at that scale, particularly
| analytical workloads, but SQLite has an enormous amount of
| headroom that you are unlikely to encounter.
|
| [0]: https://sqlite.org/forum/info/6e13e1911f1fe39c
| hprotagonist wrote:
| happily, i have multiple gigabytes of memory ...
| philwelch wrote:
| Sure, but a 40 GB SQLite database on a machine with 16 GB
| of RAM is not gonna be happy
| qbasic_forever wrote:
| You're not going to do better with pandas or similar
| tools. If it can't fit in memory, it's going to be
| painful. SQLite is the least painful in my experience,
| and it sets you up for working with the data in a proper
| DB like postgres or similar for when you get fed up with
| the memory constraints.
| philwelch wrote:
| I wouldn't use pandas in that situation either.
| kosherhurricane wrote:
| In the original article, they were trying to read a 10GB
| file into 30GB RAM.
| Groxx wrote:
| tbh my usual strategy is to drop into a real programming language
| and use whatever JSON stream parsing exists there, and dump the
| contents into a half-parsed file that can be split with `split`.
| Then you can use "normal" tools on one of those pieces for fast
| iteration, and simply `cat * | ...` for the final slow run on all
| the data.
|
| Go is quite good for this, as it's extremely permissive about
| errors and structure, has very good performance, and comes with a
| streaming parser in the standard library. It's pretty easy to be
| finished after only a couple minutes, and you'll be bottlenecked
| on I/O unless you did something truly horrific.
|
| And when jq isn't enough because you need to do joins or
| something, shove it into SQLite. Add an index or three. It'll
| massively outperform almost anything else unless you need rich
| text content searches (and even then, a fulltext index might be
| just as good), and it's plenty happy with a terabyte of data.
| rvanlaar wrote:
| Recently had 28GB json of IOT data with no guarantees on the data
| structure inside.
|
| Used simdjson [1] together with python bindings [2]. Achieved
| massive speedups for analyzing the data. Before it was in the
| order of minutes, then it became fast enough to not leave my
| desk. Reading from disk became the bottleneck, not cpu power and
| memory.
|
| [1] https://github.com/simdjson/simdjson [2]
| https://pysimdjson.tkte.ch/
| jmmv wrote:
| Some random comments:
|
| * A few GBs of data isn't really that much. Even /considering/
| the use of cloud services just for this sounds crazy to me... but
| I'm sure there are people out there that believe it's the only
| way to do this (not the author, fortunately).
|
| * "You might find out that the data doesn't fit into RAM (which
| it well might, JSON is a human-readable format after all)" -- if
| I'm reading this right, the author is saying that the parsed data
| takes _more_ space than the JSON version? JSON is a text format
| and interning it into proper data structures is likely going to
| take _less_ space, not more.
|
| * "When you're ~trial-and-error~iteratively building jq commands
| as I do, you'll quickly grow tired of having to wait about a
| minute for your command to succeed" -- well, change your workflow
| then. When tackling new queries, it's usually a good idea to
| reduce the data set. Operate on a few records until you have the
| right query so that you can iterate as fast as possible. Only
| once you are confident with the query, run it on the full data.
|
| * Importing the data into a SQLite database may be better overall
| for exploration. Again, JSON is slow to operate on because it's
| text. Pay the cost of parsing only once.
|
| * Or write a custom little program that streams data from the
| JSON file without buffering it all in memory. JSON parsing
| libraries are plentiful so this should not take a lot of code in
| your favorite language.
| vlovich123 wrote:
| > JSON is a text format and interning it into proper data
| structures is likely going to take _less_ space, not more.
|
| ~~Yeah but parsing it can require ~2x the RAM available and
| push you into swap / make it not possible.~~
|
| > Or write a custom little program that streams data from the
| JSON file without buffering it all in memory. JSON parsing
| libraries are plentiful so this should not take a lot of code
| in your favorite language.
|
| What is the state of SAX JSON parsing? I used yajl a long time
| ago but not sure if that's still the state of the art (and it's
| C interface was not the easiest to work with).
|
| EDIT: Actually, I think the reason is that you typically will
| have pointers (8 bytes) in place of 2 byte demarcations in the
| text version (eg "", {}, [] become pointers). It's very hard to
| avoid that (maybe impossible? Not sure) and no surprise that
| Python has a problem with this.
| saidinesh5 wrote:
| There's libraries like ijson. At my last job, i wrote a
| wrapper around this C++ library called rapidjson to parse
| streaming json. It worked quite well.
|
| Here's the write up about it:
| https://dinesh.cloud/2022/streaming-json-for-fun-and-profit/
| And here's the code: https://github.com/multiversal-
| ventures/json-buffet
|
| The API isn't the best. I'd have preferred an iterator based
| solution as opposed to this callback based one. But we worked
| with what rapidjson gave us for the proof of concept. The
| reason for this specific implementation was we wanted to
| build an index to query the server directly about it's huge
| json files (compressed size of 20+GB per file) using http
| range queries.
| bastawhiz wrote:
| > JSON is a text format and interning it into proper data
| structures is likely going to take _less_ space, not more.
|
| If you're parsing to structs, yes. Otherwise, no. Each object
| key is going to be a short string, which is going to have some
| amount of overhead. You're probably storing the objects as hash
| tables, which will necessarily be larger than the two bytes
| needed to represent them as text (and probably far more than
| you expect, so they have enough free space for there to be
| sufficiently few hash collisions).
|
| JSON numbers are also 64-bit floats, which will almost
| universally take up more bytes per number than their serialized
| format for most JSON data.
| vlovich123 wrote:
| I think even structs have this problem because typically you
| heap allocate all the structs/arrays. You could try to arena
| allocate contiguous objects in place, but that sounds hard
| enough that I doubt that anyone bothers. Using a SAX parser
| is almost certainly the tool you want to use.
| matja wrote:
| > JSON numbers are also 64-bit floats
|
| In common implementations they are, but RFC 8259 and ECMA-404
| do not specify the range, precision, or underlying
| implementation for the storage of numbers in JSON.
|
| A implementation that guarantees interoperability between all
| implementations of JSON would use an arbitrary-sized number
| format, but they seldom do.
|
| No idea what ISO/IEC 21778:2017 says because it's not free.
| bastawhiz wrote:
| I mean, I suppose you could use a numeric format that takes
| more than 64 bits, but you certainly couldn't use one that
| uses fewer. You have no way to know whether a number
| _should be_ an int or a float. And since JavaScript (the JS
| in JSON) encodes with 64 bits, you 'd be losing precision
| if you choose to use any less. The point is that an integer
| written as one byte in the JSON _will_ take up more space
| (at least seven bytes more) when decoded, all the way up to
| numbers that take seven characters.
| gpderetta wrote:
| (subsets of )ASN.1, LEB128, and of course UTF-8 are all
| examples of encoding binary integers with variable
| length.
| bastawhiz wrote:
| Sure, and JSON is as well. 0 takes three fewer bytes to
| encode than 1234. But if I parse JSON and my numbers come
| back as a type backed by UTF-8, I'm going to be left
| confused and angry. Nobody expects their JSON parser to
| give them back numeric values that their CPU can't
| natively perform operations on.
| zigzag312 wrote:
| > if I'm reading this right, the author is saying that the
| parsed data takes _more_ space than the JSON version? JSON is a
| text format and interning it into proper data structures is
| likely going to take _less_ space, not more.
|
| Utf8 json strings will get converted to utf16 strings in some
| languages, doubling the size of strings in memory compared to
| the size on disk.
| taeric wrote:
| It still surprises me how many have the intuition that loading
| the data will take more space than the file.
|
| Even more annoying when it is. (Compressed or binary formats
| not withstanding.)
| vlovich123 wrote:
| Isn't that almost always true unless the format is
| specifically designed to take the same amount on disk and in
| RAM? If that's the case I think typically it means the file
| on disk is larger than it needs to be. That's because you
| have to convert the file to some kind of in-memory
| representation that's easier to work with to manipulate it
| and what not which requires structure that isn't necessary at
| rest.
|
| Can you give an example where the file on disk is going to be
| larger than what it is in memory? Provided you're not just
| reading it and working with it as an opaque binary blob.
| hnlmorg wrote:
| XML would probably fall into that category.
|
| But I do agree with your point more generally speaking.
| ot wrote:
| > Or write a custom little program that streams data from the
| JSON file without buffering it all in memory. JSON parsing
| libraries are plentiful so this should not take a lot of code
| in your favorite language.
|
| Several years ago I wrote a paper [1] on representing the parse
| tree of a JSON document in a tiny fraction of the JSON size
| itself, using succinct data structures. The representation
| could be built with a single pass of the JSON, and basically
| constant additional memory.
|
| The idea was to pre-process the JSON and then save the parse
| tree, so it could be kept in memory over several passes of the
| JSON data (which may not fit in memory), avoiding to re-do the
| parsing work on each pass.
|
| I don't think I've seen this idea used anywhere, but I still
| wonder if it could have applications :)
|
| [1]
| http://groups.di.unipi.it/~ottavian/files/semi_index_cikm.pd...
| saidinesh5 wrote:
| > * "You might find out that the data doesn't fit into RAM
| (which it well might, JSON is a human-readable format after
| all)" -- if I'm reading this right, the author is saying that
| the parsed data takes _more_ space than the JSON version? JSON
| is a text format and interning it into proper data structures
| is likely going to take _less_ space, not more.
|
| Not to mention, even when using bad data structures (eg.
| hashmap of hashmaps..), One can just add a large enough
| swapfile and brute force their way through it no?
| [deleted]
| the_duke wrote:
| > JSON is a text format and interning it into proper data
| structures is likely going to take _less_ space, not more.
|
| That depends a lot on the language and the json library.
|
| Lets take `{"ab": 22}` as an example.
|
| That's 10 bytes.
|
| In a language like Rust and using the serde library, this could
| be deserialized directly into a struct with one integer, let's
| pick a u32. So that would only be four bytes.
|
| But if it was deserialized to serdes dynamic Value type, this
| would be : a HashMap<String, u32>, which has a constant size of
| 48 bytes, plus an allocation of I don't know how much (first
| allocation will cover more than one entry), plus 16 bytes
| overhead for the string, plus 2 bytes for the actual string
| contents, plus the 4 bytes for the u32. So that's already over
| ~90 bytes, a lot more than the JSON.
|
| Dynamic languages like Python also have a lot of overhead for
| all the objects.
|
| Keys can of course be interned, but not that many default JSON
| parser libraries do that afaik.
| btown wrote:
| I feel like many of the same principles from V8 (the engine
| for Chrome/Node.js) hidden classes would apply here. It's not
| just that you're interning keys; you're actually creating a
| set of optimized data classes that are transitioned to when
| you encounter a key that wasn't seen before. In such a mode,
| you're not even storing the keys, just a pointer to the type
| information tree. Only when you have a sub-object whose keys
| are truly unpredictable do you go into "dictionary mode (aka
| 'slow mode')" - which I think says it all!
|
| https://v8.dev/docs/hidden-classes
| DonHopkins wrote:
| The key names of all the identically shaped objects are
| repeated and can all be interned into sharing the same
| strings, saving a huge amount of memory. That's why csv files
| can be much more compact that the equivalent json files: they
| only list the keys once in the first row.
| mindslight wrote:
| "ab" is 4 bytes, while a 64 bit pointer to an interned
| string is 8 bytes. It would seem that the savings of CSV
| would be better analogized to static typing - the type is
| defined once, then each record only contains data.
|
| I had the same intuition as the original comment. But no,
| the relative sizes of data formats aren't that
| straightforward. One could intern symbols to say 16 bit
| values, or one could infer structural types and compress
| the data that way. But those are both creating additional
| assumptions and processing that likely aren't done by
| commonly available tools.
| rootw0rm wrote:
| i'm using bincode to serialize btrees and the files are
| smaller than i expected...it's pretty cool. must be storing
| my u32s as u16s or u8s sometimes.
| pkaye wrote:
| How about something like a Go interface which is essentially
| two pointers and the data itself.
| Groxx wrote:
| That's the same thing they're describing, and equivalent to
| what Go does for map[string]int64 (plus or minus minor
| details about how strings are optimized).
|
| The map has its two slices (pointer + length + capacity =
| 8*3 times two slices) and the string needs a separate
| allocation somewhere because it's essentially a pointer to
| a slice of bytes. All of which is true for almost all
| reasonably efficiency-focused languages, Rust and Go
| included - it's just how you make a compact hashmap.
| Yoric wrote:
| Does Go perform interning of strings?
| Groxx wrote:
| Compile-time strings, I think so? In that all identical
| strings simply point to the same data. Otherwise no.
| closeparen wrote:
| I would read "proper data structure" charitably, as a struct
| or record type, maybe a tuple. And if it must be a hashmap,
| then at least using symbols and not strings for keys.
| Nihilartikel wrote:
| If you're doing interactive analysis, converting the json to
| parquet is a great first step.. After that duckdb or spark are a
| good way to go. I only fall back to spark if some aggregations
| are too big to fit in RAM. Spark spills to disk and subdivides
| the physical plans better in my experience..
| lmeyerov wrote:
| Yep!
|
| We do the switch to parquet, and then as they say, use dask so
| we can stick with python for interesting bits as SQL is
| relatively anti-productive there
|
| Interestingly, most of the dask can actually be dask_cudf and
| cudf nowadays: dask/pandas on a GPU, so can stay in the same
| computer, no need for distributed, even if TBs etc of json
| pradeepchhetri wrote:
| Well if you need to convert json to parquet to do anything
| fast, then what is the meaning ? You will end up wasting way
| more resource in that conversion itself that your benefit is
| all equalized in the cost of extra storage utilization (since
| now you have json and parquet files both). The whole point is
| to do fast operations in json itself. Try out
| clickhouse/clickhouse-local.
| closeparen wrote:
| If you're doing interactive analysis, generally you're going
| to have multiple queries, so it can be worthwhile to pay the
| conversion cost once upfront. You don't necessarily retain
| the JSON form, or at least not for as long.
| [deleted]
| thakoppno wrote:
| Would sampling the JSON down to 20MB and running jq
| experimentally until one has found an adequate solution be a
| decent alternative approach?
|
| It depends on the dataset one supposes.
| epalm wrote:
| Yeah, I do this when querying sql databases. I limit the data
| to some small/local range, iteratively work on the query, and
| when I'm happy with the local results, I remove the filter and
| get the big results.
| ddulaney wrote:
| I really like using line-delimited JSON [0] for stuff like this.
| If you're looking at a multi-GB JSON file, it's often made of a
| large number of individual objects (e.g. semi-structured JSON log
| data or transaction records).
|
| If you can get to a point where each line is a reasonably-sized
| JSON file, a lot of things gets way easier. jq will be streaming
| by default. You can use traditional Unixy tools (grep, sed, etc.)
| in the normal way because it's just lines of text. And you can
| jump to any point in the file, skip forward to the next line
| boundary, and know that you're not in the middle of a record.
|
| The company I work for added line-delimited JSON output to lots
| of our internal tools, and working with anything else feels
| painful now. It scales up really well -- I've been able to do
| things like process full days of OPRA reporting data in a bash
| script.
|
| [0]: https://jsonlines.org/
___________________________________________________________________
(page generated 2023-03-18 23:00 UTC)