[HN Gopher] Inserting One Billion Rows in SQLite Under a Minute
       ___________________________________________________________________
        
       Inserting One Billion Rows in SQLite Under a Minute
        
       Author : todsacerdoti
       Score  : 397 points
       Date   : 2021-07-18 13:01 UTC (9 hours ago)
        
 (HTM) web link (avi.im)
 (TXT) w3m dump (avi.im)
        
       | einpoklum wrote:
       | > Recently, I ran into a situation where I needed a test database
       | with lots of rows and needed it fast. So I did what any
       | programmer would do: wrote a Python script to generate the DB.
       | 
       | The author must have a rather low opinion of programmers...
       | generating data with a script is good for 1,000 rows. Maybe
       | 10,000 rows. 100,000 is probably stretching it. Beyond that the
       | slowness will be meaningful.
       | 
       | Anyway, if the "test database" data is given in advance, you want
       | an in-memory DBMS which can load data from files. If it isn't -
       | use an in-memory DBMS which supports generated columns, or apply
       | your random-number-generating function on a trivial column.
       | 
       | (MonetDB standalone/embedded would be a decent option; maybe the
       | columnar storage mechanism in MariaDB? And of course there are
       | commercial offerings.)
       | 
       | > Unfortunately, it was slow. Really slow. So I did what any
       | programmer would do: went down the rabbit hole of learning more
       | about SQLite...
       | 
       | That's the wrong approach. SQLite isn't called "lite" for
       | nothing. It's not what you would use for bulk operations on large
       | tables.
       | 
       | Read the SQLite FAQ: https://www.sqlite.org/whentouse.html
       | 
       | it oversells itself a bit, but still says its (only) advantages
       | for data analysis are: "easier to install and use and the
       | resulting database is a single file that can be written to a USB
       | memory stick or emailed."
        
         | geofft wrote:
         | I suspect these opinions would be less likely to get downvoted
         | if you could back them up with data, specifically:
         | 
         | - Assuming you do have a test database in advance that's
         | represented in a file, but for some reason not a SQLite file or
         | other database, how long does it take to load 100M rows in a
         | database of your choice, and how does that compare to the
         | author's results?
         | 
         | - The author does not in fact have a test database in advance,
         | and is generating random data, which I _think_ is what you 're
         | saying by "apply your random-number-generating function on a
         | trivial column" - how long would your approach take to generate
         | 100M rows?
         | 
         | - If you generated a database of about the same size in a
         | database of your choice, how do bulk operations on large tables
         | perform in comparison to SQLite? (My understanding is that it's
         | "lite" because it's not client/server, not because its
         | performance is poor.)
         | 
         | I don't think any of these experiments would be terribly hard
         | to run.
        
       | siscia wrote:
       | Creator of RediSQL / zeeSQL (https://zeesql.com/)
       | 
       | Insertion performance on a single table are very very hard to
       | optimize.
       | 
       | A single process looping it is your best bet.
       | 
       | I would just increase the batch size, which is the most influent
       | factor.
       | 
       | Then another point... When you do batches, you do
       | BEGIN TRANSACTION;         for i in range(1, 50):
       | execute_stmt         COMMIT;
       | 
       | You do not create a long list of parameters.
       | 
       | https://github.com/avinassh/fast-sqlite3-inserts/blob/master...
       | 
       | ;)
        
         | avinassh wrote:
         | That's a great point, let me try it!
         | 
         | > You do not create a long list of parameters.
         | 
         | I have done much worse by trying to insert a really long string
         | of 100K params
        
         | siscia wrote:
         | Even though merging strings seems faster...
        
           | [deleted]
        
       | avinassh wrote:
       | Hey all, author here. I didn't expect this to reach front page of
       | HN. I came here to submit and it was here already! I am looking
       | for more ideas to experiment. Here's one idea which someone from
       | another forum gave me: exploring recursive queries and using
       | SQLite random methods to do the insertions.
       | 
       | Another crazy idea is to learn about SQLite file format and just
       | write the pages to disk.
        
         | Twirrim wrote:
         | I gave the suggestion on Twitter, but probably worth mentioning
         | here too as people here might not be aware of it.
         | 
         | https://www.flamingspork.com/projects/libeatmydata/
         | 
         | libeatmydata shouldn't be used in production environments,
         | generally speaking, as it biases for speed over safety (lib-
         | eat-my-data), by disabling fsync, and associated commands for
         | the running process under it. Disabling those commands results
         | in less I/O pressure, but comes with the risk that the program
         | thinks it has written safely and durably, and that may not be
         | true. It essentially stops programs that are written to be
         | crash proof, from being actually crash proof. Which under the
         | circumstances you're operating on you almost certainly don't
         | care.
         | 
         | I've used this when reloading database replicas from a dump
         | from master before, as it drastically speeds up operations
         | there.
        
           | the8472 wrote:
           | Overlayfs offers libeatmydata functionality in recent linux
           | kernels without having to intercept libc, so it should work
           | for things that use raw syscalls too.
        
           | nh2 wrote:
           | Aside: SQLite has native functionality to do what
           | libeatmydata does (disable fsync):                   pragma
           | synchronous = off
           | 
           | https://www.sqlite.org/pragma.html#pragma_synchronous
        
         | anigbrowl wrote:
         | It's a great article! Always nice to see clear well-written
         | guides that save people time from reinventing the wheel.
        
         | justinclift wrote:
         | You mention wanting to do a Go version. Not sure if it's
         | useful, but this is a SQLite "bulk data generation" util I
         | threw together ages ago in Go:
         | 
         | https://github.com/sqlitebrowser/sqlitedatagen
         | 
         | There's some initial work to parallelise it with goroutines
         | here:
         | 
         | https://github.com/sqlitebrowser/sqlitedatagen/blob/multi_ro...
         | 
         | Didn't go very far down that track though, as the mostly single
         | threaded nature of writes in SQLite seemed to prevent that from
         | really achieving much. Well, I _think_ that's why it didn't
         | really help. ;)
        
           | mathnode wrote:
           | Very nice! I shall give that a try. I always get excited by
           | SQLite and go projects.
           | 
           | I maintain a similar go tool for work, which I use to stuff
           | around 1TB into MariaDB a time:
           | 
           | https://github.com/rcbensley/brimming
        
         | mst wrote:
         | I'm curious how you'd fair if you pregenerated the data, with
         | periodic COMMITs, and then just sling it at the sqlite3 binary
         | over STDIN and see how it handled that.
         | 
         | Certainly if I was trying to do the same thing in Pg my first
         | thought would be "batched COPY commands".
        
         | A4ukYkq6ILl wrote:
         | https://www.baidu.com
        
         | [deleted]
        
         | eatonphil wrote:
         | I am also interested in writing the SQLite or PostgreSQL file
         | format straight to disk as a faster way to do ETL. I'd be
         | curious to hear if anyone has actually tried this. I'd be as
         | curious if you end up trying this too.
        
           | avinassh wrote:
           | yes, I plan to try this next and measure the performance.
           | Others also suggested similar idea to try.
           | 
           | > I am also interested in writing the SQLite or PostgreSQL
           | file format straight to disk as a faster way to do ETL.
           | 
           | what exactly you are trying to do here?
        
           | [deleted]
        
           | tyingq wrote:
           | Sqlite has a btree module/api, though there's a lot of
           | "TODO:" notes in the document:
           | https://sqlite.org/btreemodule.html
        
             | avinassh wrote:
             | TIL, this is great! I think this could make things even
             | easier instead of writing the raw pages
        
           | oldtablet wrote:
           | I really like sqlite but I'm curious how fast it would be to
           | write the data to parquet files and read it with duckdb. I've
           | done something similar before and in my case duckdb was
           | faster than sqlite.
        
         | Fiahil wrote:
         | I did similar Rust/Python record generation experiments, and I
         | can relate to these numbers.
         | 
         | However, I think your Rust threaded trial might be a little bit
         | off in subtle ways. I would truly expect it to perform about
         | several times better than single threaded Rust and async Rust
         | (async is generally slower on these workloads, but still faster
         | than Python)
         | 
         | Edit : After reading your rust code, you might have room for
         | some improvements :
         | 
         | - don't rely on random, simply cycle through your values
         | 
         | - pre-allocate your vecs with \with_capacity\
         | 
         | - dont use channels, prefer deques..
         | 
         | - ..or even better, don't use synchronization primitives and
         | open one connection per thread (not sure if it will work with
         | sqlite?)
        
       | mongol wrote:
       | 1 billion rows, 8GB RAM. I think there will be some limitation
       | from available RAM as well.
        
       | asicsp wrote:
       | > _The larger DB required more than 30 minutes to generate. So I
       | spent 30+ hours to reduce the 30 mins of running time :p_
       | 
       | Worth it.
       | 
       | > _I could just run s the script again._
       | 
       | Found a typo.
        
         | ericlewis wrote:
         | Cool
        
       | pksebben wrote:
       | I'm pretty new to this sort of optimization stuff, so forgive me
       | if this is ridiculous for any reason, but I'm wondering about two
       | things reading this:
       | 
       | 1: is it useful / feasible to set a sort of "lower bound" to
       | these optimizations by profiling the raw write time on a set of a
       | certain size?
       | 
       | 2. assuming that writes are ultimately the limiting factor, could
       | you gain performance by calculating the minimum batch size as
       | it's write time intersects in-memory calculation, and as soon as
       | you hit that size, pop a thread to write that batch to disk -
       | then return some async signal when the write is done to trigger
       | the next batch of writes? is it possible to "stitch together"
       | these batches post-writes?
       | 
       | edit: mobile formatting is le suck
        
       | Groxx wrote:
       | Yep - I discovered most of this a while back when I needed to do
       | a bunch of exploring of ~200m rows of data (like 50GB). I chucked
       | it into SQLite and was done with everything in a day, it was
       | absolutely wonderful. This post is a pretty good summary of what
       | to change and why.
       | 
       | Prepared statements should be considered the norm for SQLite -
       | they have pretty major performance benefits, and any decent ORM
       | or query engine can probably do it for you implicitly or with a
       | single flag, so it's practically free for many applications. I'm
       | always surprised how often I see applications or benchmarks or
       | "how to be fast with SQLite" not using them, so I'm definitely
       | glad to see them covered here :)
        
       | ThePadawan wrote:
       | There are some cool ideas in there!
       | 
       | I wonder how much I could speed up my test suites (that don't
       | rely on transaction rollbacks) by disabling journalling. Those
       | milliseconds per query add up to seconds per tests and minutes
       | per PR.
        
       | jonnycomputer wrote:
       | >The generated data would be random with following constraints:
       | The area column would hold six digits area code (any six digits
       | would do, no validation). The age would be any of 5, 10, or 15.
       | The active column is either 0 or 1.
       | 
       | That is about 6 million combinations. This is not such a big
       | space that it would be impractical to precompute all possible
       | combinations. I wonder if hashing in to such a precomputed table
       | might help. Might not if the overhead of maintaining the lookup
       | table is too high.
        
       | nzealand wrote:
       | If you insert 4 rows into a table, and join it on itself without
       | any conditional criteria, it will result in a cartesian product
       | of 16 rows (Select * from a, a)
       | 
       | Do that cartesian join three more times, and you have over 4
       | billion results.
       | 
       | Then you simply need to use the random function in conjunction
       | with division, rounding and case statements to get the desired
       | random numbers.
        
       | notatelloshill wrote:
       | an in memory database is not a database ... you might as well
       | just create billion object array and have your java process
       | constantly running to have that data accessible.
        
       | orf wrote:
       | How long does it take when using the native CSV import features?
       | 
       | https://www.sqlite.org/csv.html
        
         | motoboi wrote:
         | It's fast. But then we need a "1 billion CSV lines" article.
        
           | orf wrote:
           | Which I expect is significantly faster to generate. You could
           | probably combine it with a pipe to avoid all disk IO.
        
         | tanin wrote:
         | I built https://superintendent.app and experimented a lot with
         | this feature.
         | 
         | I can import 1GB CSV file in 10 seconds on my MacBook. This
         | queries from a virtual table and puts it in actual table
        
       | roman-holovin wrote:
       | INSERT INTO user (area, age, active) SELECT abs(random()) %
       | 1000000, (abs(random()) % 3 + 1) * 5, abs(random()) % 2 FROM
       | generate_series(1, 100000000, 1)
       | 
       | Faster by 10% than fastest author implementation on my machine -
       | 19 seconds against 21 for 'threaded_batched'.
        
         | gfodor wrote:
         | I have to assume this was rejected as a valid approach, but if
         | not, this whole thread gets 10x more interesting than it
         | already was. I hope the author responds.
        
           | roman-holovin wrote:
           | Well, even if you just insert zeros instead of random values,
           | it takes 9 seconds on my computer to insert 100M rows, so
           | even that is not a 1B rows per minute.
           | 
           | And I think INSERT INTO ... SELECT is the fastest way to bulk
           | insert data into sqlite.
           | 
           | Also, I have tried to use carray sqlite feature that allow to
           | share memory with sqlite and use recursive CTE to query it,
           | but it is slower. Though, you can pass values you've
           | generated from Rust instead of using random().
        
       | arusahni wrote:
       | [ @dang, please delete if/when you get the chance, thank you ]
        
         | kangalioo wrote:
         | Did you mean to post this to
         | https://news.ycombinator.com/item?id=27872596 ?
        
           | arusahni wrote:
           | Yes. Not sure how I got mixed up.
        
       | lmz wrote:
       | I wonder if defining a virtual table
       | (https://www.sqlite.org/vtab.html) and just doing
       | 
       | INSERT INTO t (...) SELECT ... from virtual_table
       | 
       | would be any faster.
        
         | rubyfan wrote:
         | From my armchair here I thought something similar, I think I
         | heard 1) that inserting into the same table has some speed
         | limitations and 2) one of the authors ideas was to spin up 4
         | processes...
         | 
         | I wonder if you could insert into 10 different tables from 10
         | threads or processes then
         | 
         | insert into single_table (...) select (...) union.
         | 
         | No idea if insert select does the trick or not but you're
         | almost at the point of partitioning here. If the application
         | called for it you could do some sort of poor mans partition in
         | order to write all the data and then push some complexity into
         | the read.
         | 
         | I also wondered what if any impact the ID primary key had on
         | inserting at the level of frequency.
         | 
         | /armchair
        
       | bob1029 wrote:
       | You can go even faster if you can organize your problem such that
       | you simply start from a copy of a template SQLite database each
       | time (it is just a file/byte sequence).
       | 
       | We do SQL evaluation for a lot of business logic throughout our
       | product, and we have found that starting from a template database
       | (i.e. one with the schema predefined and canonical values
       | populated) can save a lot of time when working in tight loops.
        
         | jonnycomputer wrote:
         | Are you suggesting have a template and then updating rows with
         | random values where necessary?
        
           | nezirus wrote:
           | You just do raw byte copies from sample DB, no SQL or
           | "inserts" or anything similar. Imagine test database consists
           | of 3 parts (all raw bytes)
           | 
           | ### PROLOGUE ### Sample row ### EPILOGUE
           | 
           | You copy & write prologue, write 1B sample raws (can optimize
           | this at will, large writes, etc)
           | 
           | Copy & write epilogue and fsync the data. You probably need
           | to modify some metadata, but that should be a few writes at
           | most.
           | 
           | That should be as good as it gets, providing your IO is
           | optimal.
        
             | tyingq wrote:
             | Do you mean crafting all the various database page btree
             | structures and entries yourself? I'd be concerned about
             | subtle bugs.
        
               | [deleted]
        
               | ndepoel wrote:
               | An SQLite database is just a file. You can build the
               | empty database with schema and base values ahead of time,
               | save it to a file (or an in-memory byte buffer) and then
               | every time you want to create a new database, you just
               | copy that file. No need to do any expensive
               | initialization queries that way. If raw high-speed
               | throughput is needed, skipping that step can make a
               | significant difference.
        
               | tyingq wrote:
               | Yes, that approach makes sense. I thought what I was
               | replying to was suggesting writing b-tree pages
               | themselves, outside of sqlite, for the new data.
        
               | [deleted]
        
             | eismcc wrote:
             | This idea is mentioned as one of the future work at the
             | bottom.
        
           | A4ukYkq6ILl wrote:
           | > Are you suggesting have a template and then updating rows
           | with random values where necessary?
           | 
           | > Are you suggesting have a template and then updating rows
           | with random values where necessary?
        
           | A4ukYkq6ILl wrote:
           | 3PwjGO6X9Gwc
        
           | bob1029 wrote:
           | Yes. Start from known-good database and then update or insert
           | with needed deltas. You could even have _more_ than you need
           | in the template file and truncate /drop what is not relevant
           | for the current context. Depending on the situation, it might
           | be faster to start from a database containing all of the
           | possible things.
        
             | gunapologist99 wrote:
             | This makes sense. Just do a copy or copy-on-write to an
             | existing database file; you could even have the actual
             | bytes of an empty (or pre-DDL'ed) sqlite file in memory in
             | your app, rather than needing to do a disk copy.
        
       | perlgeek wrote:
       | I wonder if something like https://paulbradley.org/sqlite-test-
       | data/ could be used to generate the test data in sqlite3
       | directly, and if it would be any faster.
       | 
       | (It would likely be faster for other DB engines, because there is
       | network overhead in the communication between the program and the
       | DB; no such things for sqlite).
        
       | A4ukYkq6ILl wrote:
       | 7VKs3Dsn
        
       | chovybizzass wrote:
       | Need it done in Deno, fren.
        
       | kebman wrote:
       | Mini summary: Do not use this in production. xD
        
       | svdr wrote:
       | I thought one billion was 1000M (and not 100M)?
        
         | xeromal wrote:
         | Yeah, that's my interpretation of a billion too. I vaguely
         | recall that India or Britain interprets a billion differently
         | though. Maybe that's what they're thinking?
        
           | mongol wrote:
           | In Swedish, and I think in German and other languages too, 1
           | billion is 1e12. 1e9 is called milliard / miljard in Swedish.
        
             | scns wrote:
             | Yes, in Germany too.
        
               | raarts wrote:
               | And in the Netherlands.
        
               | mobilio wrote:
               | Also in Bulgaria
        
           | atleta wrote:
           | I think it's the same in UK English as well. But in some
           | (most?) European languages billion actually means 1 000 000
           | million (so a thousand times _more_ ). And we use "milliard"
           | for 1000 million.
        
             | diehunde wrote:
             | https://en.wikipedia.org/wiki/Long_and_short_scale for
             | anyone that's interested in the differences
        
             | xeromal wrote:
             | It looks like you're right. Just looked it up. A billion is
             | 1million^2 in its etymology so English speakers are the odd
             | one out.
        
               | res0nat0r wrote:
               | He hasn't reached the goal yet, and is currently doing
               | 100 million in ~34 seconds.
               | 
               | > Looking forward to discussions and/or collaborations
               | with curious souls in my quest to generate a billion
               | record SQLite DB quickly. If this sounds interesting to
               | you, reach out to me on Twitter or submit a PR.
        
           | motogpjimbo wrote:
           | Just about everyone in the UK defines a billion to be 10^9.
           | I've heard about the 10^12 definition but never encountered
           | anyone who uses it - I think it must have been an older usage
           | that fell out of favour.
        
             | jtvjan wrote:
             | In 1974, the UK government abandoned the long scale
             | (million, milliard, billion, billiard) in favor of the
             | short scale (million, billion, trillion, quadrillion) used
             | in the US. The long scale is still used in languages like
             | Dutch, German, and French.
        
         | kylec wrote:
         | This is an important comment. It's not a typo in the article,
         | even the source code only does 100M rows, not 1B. The author
         | definitely does not hit the target of 1B rows in a minute.
        
         | gunapologist99 wrote:
         | Exactly.                 thousand = 1000       million  = 1000
         | * thousand (or 1000^2)       billion  = 1000 * million (or
         | 1000^3)       trillion = 1000 * billion (or 1000^4)
         | 
         | (not to discount regional differences)
         | 
         | https://www.brainyquote.com/quotes/everett_dirksen_201172
        
         | throwaway210222 wrote:
         | 1,000,000 million. In old money.
        
           | thanksforfish wrote:
           | Ref: https://en.m.wikipedia.org/wiki/Billion
        
         | chrisseaton wrote:
         | The author doesn't say a billion is 100 million. They say
         | they'd like to be able to insert a billion, and say they're
         | able to insert 100 million. It's not a contradiction.
        
           | minitoar wrote:
           | Ok but that is a rather clickbaity title. The title makes it
           | sound like they are successfully doing that.
        
             | infogulch wrote:
             | Perhaps " _Towards_ Inserting 1B Rows in SQLite in Under a
             | Minute " would be a better title.
        
             | avinassh wrote:
             | Hey, sorry for the misleading title. I started with 'Fast
             | SQLite Inserts' and it had many iterations. In the title, I
             | wanted to intend that I want to insert 1 billion rows under
             | a minute on my machine. I thought the current title is
             | fine, since I got LGTM for earlier drafts. The detail about
             | on my machine is also important since mine is a two year
             | old laptop and all the measurements are done on it.
             | 
             | Also I got another feedback that title should indicate that
             | it is a test database and emphasise that it is not durable.
             | 
             | I am wondering the right way to convey all of this in the
             | title yet also keep it short.
        
               | newman314 wrote:
               | Attempting 1 Billion Row Inserts in under 1 Minute
        
               | dstroot wrote:
               | Achieving 100m SQLite inserts in under a minute.
        
               | minitoar wrote:
               | Thanks for the explanation. I think it only needs a minor
               | tweak. Maybe prefix with "trying to" or something like
               | that. I am empathetic to the challenge of naming things
               | concisely and accurately.
        
               | scns wrote:
               | I hate hyperbole but i think the title is fine.
        
             | dclowd9901 wrote:
             | I agree. While the title reflects the eventual goal of the
             | effort, the goal has yet to be achieved (and may or may not
             | be achievable at all). I think it's a bit irresponsible to
             | use a title like that for a post that neglects to have
             | achieved what was described in the title.
        
               | sbarre wrote:
               | Perhaps "Working towards 1B rows in under a minute" would
               | have been better.
        
           | fridif wrote:
           | "USA LANDS MAN ON MARS BEFORE SOVIET UNION"
           | 
           | "WELL ACTUALLY IT WAS THE MOON BUT YOU GET THE IDEA"
        
             | chrisseaton wrote:
             | There's no need to be snarky. I didn't write the title, I'm
             | just explaining what the author means.
        
               | fridif wrote:
               | I'm just memeing what it sounded like in my head, nothing
               | personal
        
       | A4ukYkq6ILl wrote:
       | https://www.baidu.com
        
       | mpweiher wrote:
       | Interesting!
       | 
       | I was actually just working on SQLite speed for Objective-S
       | (http://objective.st), partly as a driver for getting some of the
       | more glaring inefficiencies out.
       | 
       | Using a "to do list" schema, I currently get the 100M rows out in
       | 56 seconds, which is around half the speed of the Rust example
       | given here, but 3 times the speed of PyPy and almost 10x faster
       | than Python.
       | 
       | This is from an interpreted script that not only does the inserts
       | and creates the objects to insert in the first place, but also
       | defines the actual class.
       | 
       | The lower-level code is written in Objective-C, like the rest of
       | Objective-S.
       | 
       | Class definition:                   class Task {           var
       | <int> id.           var <bool> done.           var <NSString>
       | title.           -description { "<Task: title: {this:title} done:
       | {this:done}>". }           +sqlForCreate {               '( [id]
       | INTEGER PRIMARY KEY, [title] NVARCHAR(220) NOT NULL, [done]
       | INTEGER );'.           }         }.
       | 
       | Code to insert a computed array of tasks 10 times:
       | 1 to:10 do: {              this:tasksTable insert:taskList.
       | }.
        
       | polyrand wrote:
       | Really good article, I've experimented with this kind of
       | workloads in SQLite a few times and found it insightful. One note
       | though, using:                 pragma temp_store = memory;
       | 
       | only affects temporary tables and indices, not the main database
       | itself[0]
       | 
       | [0] https://sqlite.org/pragma.html#pragma_temp_store
        
       | mimir wrote:
       | Database optimization posts are always interesting, but it's
       | really hard to do any apples to apples comparison. Your
       | performance is going to depend mostly on your hardware, internal
       | database settings and tunings, and OS level tunings. I'm glad
       | this one included some insight into the SQLite settings disabled,
       | but there's always going to be too many factors to easily compare
       | this to your own setup.
       | 
       | For most SQL systems, the fastest way to do inserts is always
       | just going to batched inserts. There's maybe some extra tricks to
       | reduce network costs/optimize batches [0], but at it's core you
       | are still essentially inserting into the table through the normal
       | insert path. You can basically then only try and reduce the
       | amount of work done on the DB side per insert, or optimize your
       | OS for your workload.
       | 
       | Some other DB systems (more common in NoSQL) let you actually do
       | real bulk loads [1] where you are writing direct(ish) database
       | files and actually bypassing much of the normal write path.
       | 
       | [0] https://dev.mysql.com/doc/refman/5.7/en/insert-
       | optimization.... [1] https://blog.cloudera.com/how-to-use-hbase-
       | bulk-loading-and-...
        
         | vajrabum wrote:
         | Oracle, DB2, MySQL, SQL Server, and PostgreSQL all support bulk
         | insert. Two obvious use cases are QA Databases and the L part
         | of ETL which pretty much require it.
        
       | MobiusHorizons wrote:
       | I would recommend looking into dtrace or other observability
       | tools to figure out what to optimize next. So far you have
       | basically had to guess where the slowness is and optimize that.
       | 
       | Lastly, how long does it take to make a copy of an already
       | prepared 1b row SQLite file?that seems easier than generating a
       | new one.
        
       | Seattle3503 wrote:
       | I've used the other pragma, but does
       | 
       | PRAGMA locking_mode = EXCLUSIVE;
       | 
       | improve performance much?
        
       | ajnin wrote:
       | Here's my solution : generate the 1B records .sqlite file once
       | beforehand, then before running each test, copy it over the test
       | DB file ;)
        
       | mirekrusin wrote:
       | If it's single, in memory table, is there really need to use
       | database? Won't language provided data structures suffice?
        
         | jonnycomputer wrote:
         | It is a good question, but I'm guessing that the whole reason
         | for creating the database in the first place is to test
         | interactions with the database, in which case you'd have to
         | mock up all the interactions being tested (selects, deletes,
         | joins, etc.).
        
         | einpoklum wrote:
         | Yes, if you want to run non-trivial queries on that data.
         | 
         | Although, frankly, SQLite would not be my choice.
        
           | adamnemecek wrote:
           | What would be your choice?
        
             | einpoklum wrote:
             | If you don't do any joins, Clikhouse is worth a try.
             | Otherwise, MonetDB (or even MonetDB embedded) if you want
             | FOSS. Commercial offerings - Actian Vector, or maybe
             | HyperDB. The big corporate ones are probably pretty good,
             | but it's always difficult to tell because there's this
             | taboo on any proper benchmarking against them.
             | 
             | If you just want to browse what other people are using, and
             | not go by the recommendations of a random commenter, try:
             | 
             | https://db-engines.com/en/ranking
             | 
             | but note that's a joint ranking both for transaction-
             | focused and analytics-focused DBMSes.
             | 
             | If it's time series data there are some more specialized
             | offerings and I'm (even) less of an expert there.
        
           | xxs wrote:
           | I can't think of a single case where in-memory database is a
           | good option, aside playing w/ SQL.
           | 
           | Whatever queries that might be a hashmap/tree/skiplist, etc.
           | would be a lot better.
        
             | tgv wrote:
             | For something like redis, but with more structure.
             | 
             | But a pretty good use case (IMO) is testing. If you want to
             | do an integration test with an SQL database, and you want
             | to test large numbers, this might be a good fully
             | functional stub to run locally.
        
               | xxs wrote:
               | >For something like redis,
               | 
               | Redis is in pretty much the same category. Testing is
               | sort of a valid case, if you are committed to write pure
               | SQL with minimal use of any dialect specifics (but even
               | 'create table' syntax might be different). Running on the
               | real thing is close to no replacement when it comes to
               | databases.
               | 
               | Many databases have docker images nowadays, so it aint
               | hard to run them locally. Likely at a point you'd want to
               | optimize the SQL, itself, rendering the tests
               | incompatible.
        
             | manigandham wrote:
             | What do you think a relational database does to provide you
             | the querying flexibility? It builds the same exact data
             | structures, but does it automatically and dynamically,
             | while offering much more usability.
             | 
             | Most attempts to query using raw data structures just means
             | you end up rebuilding a (very poor) relational database
             | with none of the features.
        
             | adrianN wrote:
             | If the set of queries you want to run is fixed you can
             | probably beat SQLite with a lot of work, if the queries are
             | dynamic I doubt that you can do much better than SQLite
             | without herculean efforts. A in-memory database is thus a
             | good option if you either don't care too much about the
             | runtime of your queries and you want to save a bunch of
             | development time, or if you don't know enough about your
             | workload to be able to beat a general purpose solution.
        
               | xxs wrote:
               | >you can =probably= beat SQLite with =a lot of= work,
               | 
               | Actually, I am beyond certain. When it's all about the
               | memory no database comes even remotely close to a
               | properly picked datastructures + structure/objects
               | layout.
               | 
               | If I need transaction log + persistence, databases have a
               | decent application.
               | 
               | In more than 20y, I have never had a case: Yay, I can use
               | relation structures in memory b/c I don't know what I am
               | going to do with the data.
        
             | einpoklum wrote:
             | It's very likely it wouldn't. (Decent) analytical DBMSes
             | have a whole lot up their sleeves which just choosing one
             | plain-vanilla data structure for your table doesn't even
             | approach in terms of performance. The benefit may be well
             | upwards of 10x in many real-world scenario.
             | 
             | Of course, if you know you get one single query which you
             | know in advance, carefully build a data structure to cater
             | just to that, and you know your data structures beyond the
             | just the basics - then, yes, a DBMS would be overkill. But
             | it still won't be a walk in the park.
        
             | [deleted]
        
             | faizshah wrote:
             | Sqlite provides a full text search module (FTS5) which
             | provides a lot of nice search features. I have used this
             | numerous times to build serverless apis for large static
             | datasets for frontend analytical experiences like data
             | tables without setting up elasticsearch. Thats one use
             | case.
             | 
             | Another is they support this closures.c extension which is
             | very nice for rapid queries on tree structured data locally
             | in memory. The JSON1 extension is also nice for rapidly
             | querying/reshaping deeply nested json data. Theres also
             | spellfix1 that can provide fuzzing capabilities. If you
             | need any of these with low latency and in memory its a
             | great choice.
             | 
             | Sqlite is great for rapidly building low latency static
             | data serving services for frontend experiences. Something
             | I'm exploring now is combining sqlite with streamlit to
             | rapidly build data exploration UIs.
             | 
             | Like how many times have you wanted to quickly add fuzzy
             | matching or full text search to some program? You use
             | fuzzywuzzy but its pretty slow, sqlite provides performant
             | implementations of this stuff thats super simple to set up.
        
             | _delirium wrote:
             | For the case where I want the ability to run one-off
             | analytics queries, having access to the usual set of
             | join/filter/summarize/etc. operations is a lot more
             | convenient and less error-prone than having to manually
             | write them. But dplyr [1] is my go-to rather than SQLite
             | personally for in-memory data sizes.
             | 
             | [1] https://dplyr.tidyverse.org/
        
         | lsuresh wrote:
         | There are some domains like building cluster managers where
         | having a relational view of the cluster state and being able to
         | query/manipulate it declaratively has significant perks. See
         | for example:
         | https://www.usenix.org/system/files/osdi20-suresh.pdf
         | (disclaimer: I'm an author)
        
       | pachico wrote:
       | I've been dealing with lots of data and SQLite and my outtakes
       | are:
       | 
       | - language has very little to do since the bottleneck will most
       | likely be the way you insert data
       | 
       | - indeed prepared statements are useful but the performance
       | didn't change much when I did long transactions and commit every
       | certain amount of thousand of rows
       | 
       | - having lots of rows in your table is good but certain queries,
       | like aggregation over many rows, are not what SQLite is great
       | about.
       | 
       | - ClickHouse can easily ingest that and more in a laptop without
       | even any scripting language.
        
       | bilekas wrote:
       | > Threads / async may not be faster always
       | 
       | This is actually a little bit mind blowing for me. I'm gonna go
       | and play with this. But what a cool read!
        
       | elephantum wrote:
       | I bet, that this is mostly random.randint benchmark and not
       | SQLite.
       | 
       | Also interesting whether batched numpy version would compare
       | better to Rust.
        
         | elephantum wrote:
         | Correction: random.randint + string interpolation
        
       | IfOnlyYouKnew wrote:
       | I ran this and got the following results:
       | /fast-sqlite3-inserts (master)> time make busy-rust
       | Sun Jul 18 17:04:59 UTC 2021 [RUST] busy.rs (100_000_000)
       | iterations                      real 0m9.816s           user
       | 0m9.380s           sys 0m0.433s
       | ________________________________________________________
       | Executed in    9.92 secs    fish           external           usr
       | time    9.43 secs    0.20 millis    9.43 secs           sys time
       | 0.47 secs    1.07 millis    0.47 secs
       | fast-sqlite3-inserts (master)> time make busy-rust-thread
       | Sun Jul 18 17:04:48 UTC 2021 [RUST] threaded_busy.rs
       | (100_000_000) iterations                      real 0m2.104s
       | user 0m13.640s           sys 0m0.724s
       | ________________________________________________________
       | Executed in    2.33 secs    fish           external           usr
       | time   13.68 secs    0.20 millis   13.68 secs           sys time
       | 0.78 secs    1.18 millis    0.78 secs
       | 
       | I'm probably doing something wrong. Or I'm getting the pace
       | needed for the billion?
       | 
       | This is on a M1 MacBook Air.
        
       | rmetzler wrote:
       | Is PRAGMA temp_store = MEMORY the same as putting the file on a
       | TempFS?
        
         | elteto wrote:
         | Maybe, maybe not. I imagine writing directly to memory will
         | always be faster than writing to a filesystem, even if it's
         | ultimately backed by RAM. Might be a negligible difference
         | though.
        
       ___________________________________________________________________
       (page generated 2021-07-18 23:00 UTC)