[HN Gopher] A SQLite extension for reading large files line-by-line
___________________________________________________________________
A SQLite extension for reading large files line-by-line
Author : polyrand
Score : 104 points
Date : 2022-07-30 16:45 UTC (6 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| cube2222 wrote:
| Hey!
|
| OctoSQL[0] author here, this is really impressive! I like this
| much more than the approach taken by other sqlite-based tools
| which first load stuff into SQLite and then let you query it.
|
| On the other hand, it does have a cons that it doesn't
| automatically infer the schema of the input JSON and you still
| have to manually parse the raw lines. Maybe it would be possible
| to surmount this by exposing a json-dedicated file reading
| function which also does SQL inference (I'm not knowledgable
| about SQLite internals)?
|
| One piece of feedback is with regard to the benchmarks: I think
| it would be worth it to add additional benchmarks which work on
| slightly more complex datasets than the one used here. I did a
| comparison of this vs OctoSQL on the Brazil dataset, and - as
| expected - sqlite-lines wiped the floor with it. However, then I
| ran the following queries on a slightly more complex dataset (the
| Amazon review dataset in this case, from SPyQL's benchmark
| notebook[1]): ~> time OCTOSQL_NO_TELEMETRY=1
| ./octosql "SELECT COUNT(*), AVG(overall)
| FROM books.json
| WHERE reviewerName = 'James'" +-------+-------------------+
| | count | avg_overall | +-------+-------------------+
| | 3010 | 4.402325581395349 | +-------+-------------------+
| real 0m49.805s user 0m32.169s sys 0m9.163s
| ~> time ./lines0-linux-amd64-sqlite3 :memory: "SELECT COUNT(*),
| AVG(json_extract(line, '$.overall'))
| FROM lines_read('books.json')
| WHERE json_extract(line, '$.reviewerName') = 'James'"
| 3010|4.40232558139535 real 1m47.933s user 1m27.024s
| sys 0m11.559s
|
| and as you can see, the results go in a very different direction.
|
| But anyhow, congrats on the project, and I'm pumped to see what
| you come up with next!
|
| [0]: https://github.com/cube2222/octosql
|
| [1]:
| https://github.com/dcmoura/spyql/blob/master/notebooks/json_...
| alexgarcia-xyz wrote:
| Hey thanks for sharing!
|
| Re infering schema of input JSON: That would be slick! Though
| SQLite does have some limitations here with table-valued
| functions vs virtual table. I won't go into the specifics, but
| something like this isn't possible in SQLite:
| select name, age from lines_json_read('students.json')
|
| The "name" and "age" dynamic columns aren't possible when using
| the "table function" syntax, but something like this is
| possible using traditional "virtual table" syntax:
| create virtual table students using
| lines_json_read(filename="students.json"); select
| name, age from students;
|
| It's a small difference, but definitely possible! Though
| parsing JSON in C is tricky, but would definitely accept
| contributions that figure it out.
|
| And re benchmarks - thanks for sharing! Yeah, they're pretty
| basic, so would love to add more complex ones. With the
| books.json example, I think what's happening is that in
| SQLite's JSON function, it parses the JSON each time in each
| json_extract function - so it parses twice for each row in that
| query. I also suspect that the long strings in "reviewText"
| might slow down, but can't be sure. Once I get some free time
| I'll add OctoSQL to the benchmark suite and this new books
| dataset
| tstack wrote:
| Another option would be to create the virtual table used for
| the table-valued function using a schema to define the
| columns: CREATE VIRTUAL TABLE
| students_json_read USING lines_json_read(schema="students-
| schema.json);
|
| Then, that one table could be used for multiple files:
| SELECT * FROM students_json_read('school1-students.json');
| SELECT * FROM students_json_read('school2-students.json');
| simonw wrote:
| I love this example from the docs: select
| line -> '$.id' as id, line -> '$.name' as name
| from lines_read("my-file.ndjson");
|
| This is using the new -> JSON operator which was added to SQLite
| a few months ago.
|
| The lines_read() thing there is a table-valued function, which
| means it returns a virtual table that you can query. This is a
| streaming operation which means it's safe to run it against a
| 100GB+ file without worrying about it sucking all of the data
| into memory at once.
|
| Where this gets really useful is when you combine it with a
| create table statement: create table names as
| select line -> '$.id' as id, line ->
| '$.name' as name from lines_read("my-file.ndjson");
|
| This should efficiently create a table with that exact subset of
| the data pulled from the newline-delimited JSON file.
| jokoon wrote:
| I've read that spatialite is not well maintained, is that true?
| bicijay wrote:
| Thats interesting, is there any way to add "custom delimiter
| templates" (or alternative tools)? I have some files where each
| line contains different kind of informations, but those are
| separated by their index on the row, so for example, customer
| name is the letter index 3 to letter index 10.
| alexgarcia-xyz wrote:
| Hmm, so you have lines that have fields in fixed indexed
| positions? Maybe something line this: $ cat
| hn.txt 11alex a 12brian b 12craig c
| $ cat test.sql .header on .mode box .load
| dist/lines0 select substr(line, 1, 2) as id,
| substr(line, 3, 7) as name, substr(line, 11, 2) as age
| from lines_read('hn.txt') $ sqlite3x :memory: '.read
| test.sql' +----+---------+-----+ | id | name |
| age | +----+---------+-----+ | 11 | alex | a |
| | 12 | brian | b | | 12 | craig | c |
| +----+---------+-----+
|
| Unfortunately SQLite's substr function is a bit awkward, but if
| each field has constant indicies it may work
| wodenokoto wrote:
| These I believe are what is called fixed width files.
|
| If you use pandas, there is a reader for those `pd.read_fwf()`
|
| I'm sure there are many more languages that support this, it's
| just the only one where I had a need.
| bicijay wrote:
| Exactly that, all banks here in Brazil are exporting
| transactions with this kind of format (Its called CNAB file
| over here and they are probably using COBOL systems). You
| have different kinds of CNAB files, CNAB 200 where you have
| lines with 200 characters, CNAB 400...
|
| Inside those files you have multiple chunks and every line
| contains information about a transaction. So if you wanna get
| the transaction amount you would look for all characters
| between character n10 and n20 for example. (Those positions
| are fixed and follow the bank template).
|
| We had success parsing it with python/node, but i think
| spinning up a sqllite instance, reading straight from the
| file and doing some calculations would be much faster.
| qbasic_forever wrote:
| Very cool looking! I love using SQL for ad-hoc data analysis, my
| brain just groks the language so much more easily than other data
| query languages like python + pandas.
| randyrand wrote:
| nice simple & reliable approach to a hard perf problem.
| freecodyx wrote:
| I like the fact that author is extensively documenting the source
| code, following the sqlite spirit
| alexgarcia-xyz wrote:
| Hey, author here, happy to answer any questions! Also checkout
| this notebook for a deeper dive into sqlite-lines, along with a
| slick WASM demonstration and more thoughts on the codebase itself
| https://observablehq.com/@asg017/introducing-sqlite-lines
|
| I really dig SQLite, and I believe SQLite extensions will push it
| to another level. I rarely reach for Pandas or other
| "traditional" tools and query languages, and instead opt for
| plain ol' SQLite and other extensions. As a shameless plug, I
| recently started a blog series on SQLite and related tools and
| extensions if you want to learn more! Next week I'll be
| publishing more SQLite extensions for parsing HTML + making HTTP
| requests https://observablehq.com/@asg017/a-new-sqlite-blog-
| series
|
| A few other SQLite extensions:
|
| - xlite, for reading Excel files, in Rust
| https://github.com/x2bool/xlite
|
| - sqlean, several small SQLite extensions in C
| https://github.com/nalgeon/sqlean
|
| - mergestat, several SQLite extensions for developers (mainly
| Github's API) in Go https://github.com/mergestat/mergestat
| simonw wrote:
| I'm so interested to hear about the HTTP requests stuff.
|
| I've played a tiny bit with SQLite Python extensions that
| perform HTTP requests, but I'm very unsure of the implications
| of this kind of thing.
|
| What happens to the rest of the SQLite query engine if your
| custom function hits a page like https://httpbin.org/delay/10
| with a deliberate 10 second delay before the request returns?
|
| Maybe this is fine - your transaction takes 10 seconds longer
| but everything else continues to work as normal. I've not done
| the work to figure that out.
| alexgarcia-xyz wrote:
| Yup, if it hits https://httpbin.org/delay/10 then the entire
| script will delay for 10 seconds (unless you have a custom
| lower timeout setting). Which means you can only make HTTP
| requests in serial requests, which sucks, but much simpler
| than coordinating with other tools
|
| It'll come out in a week, but lmk if you want to play with it
| beforehand!
| simonw wrote:
| How will it affect other SQLite connections against the
| same database?
|
| My guess is they'll be OK, but presumably any transactions
| will lock tables for the duration?
| alexgarcia-xyz wrote:
| I think you're right, tho not an expert with multiple
| connections. I think if the DB is in WAL mode and you're
| only issuing read-only requests, then there will be no
| lock and long HTTP requests won't lock anything up. But
| if you issue a write-request that locks tables, then
| other write-requests will be blocked with long requests.
|
| But if you're in journal mode then there's not much you
| can do. Please correct me if I'm wrong!
| simonw wrote:
| How difficult do you think it would be compile SpatiaLite (a
| big hairy C codebase) to WASM and load it as a dynamic module,
| like you've done with sqlite-lines?
| alexgarcia-xyz wrote:
| This has been done by other people already!
|
| - https://github.com/jvail/spl.js
|
| - https://observablehq.com/@visionscarto/hello-spl-js
| simonw wrote:
| Oh wow! I wonder how hard it would be to load that module
| into https://github.com/simonw/datasette-lite
|
| Hah, turns out I have an issue for that already, which I'd
| forgotten about: https://github.com/simonw/datasette-
| lite/issues/30
___________________________________________________________________
(page generated 2022-07-30 23:00 UTC)