[HN Gopher] Inserting One Billion Rows in SQLite Under a Minute
___________________________________________________________________
Inserting One Billion Rows in SQLite Under a Minute
Author : todsacerdoti
Score : 397 points
Date : 2021-07-18 13:01 UTC (9 hours ago)
(HTM) web link (avi.im)
(TXT) w3m dump (avi.im)
| einpoklum wrote:
| > Recently, I ran into a situation where I needed a test database
| with lots of rows and needed it fast. So I did what any
| programmer would do: wrote a Python script to generate the DB.
|
| The author must have a rather low opinion of programmers...
| generating data with a script is good for 1,000 rows. Maybe
| 10,000 rows. 100,000 is probably stretching it. Beyond that the
| slowness will be meaningful.
|
| Anyway, if the "test database" data is given in advance, you want
| an in-memory DBMS which can load data from files. If it isn't -
| use an in-memory DBMS which supports generated columns, or apply
| your random-number-generating function on a trivial column.
|
| (MonetDB standalone/embedded would be a decent option; maybe the
| columnar storage mechanism in MariaDB? And of course there are
| commercial offerings.)
|
| > Unfortunately, it was slow. Really slow. So I did what any
| programmer would do: went down the rabbit hole of learning more
| about SQLite...
|
| That's the wrong approach. SQLite isn't called "lite" for
| nothing. It's not what you would use for bulk operations on large
| tables.
|
| Read the SQLite FAQ: https://www.sqlite.org/whentouse.html
|
| it oversells itself a bit, but still says its (only) advantages
| for data analysis are: "easier to install and use and the
| resulting database is a single file that can be written to a USB
| memory stick or emailed."
| geofft wrote:
| I suspect these opinions would be less likely to get downvoted
| if you could back them up with data, specifically:
|
| - Assuming you do have a test database in advance that's
| represented in a file, but for some reason not a SQLite file or
| other database, how long does it take to load 100M rows in a
| database of your choice, and how does that compare to the
| author's results?
|
| - The author does not in fact have a test database in advance,
| and is generating random data, which I _think_ is what you 're
| saying by "apply your random-number-generating function on a
| trivial column" - how long would your approach take to generate
| 100M rows?
|
| - If you generated a database of about the same size in a
| database of your choice, how do bulk operations on large tables
| perform in comparison to SQLite? (My understanding is that it's
| "lite" because it's not client/server, not because its
| performance is poor.)
|
| I don't think any of these experiments would be terribly hard
| to run.
| siscia wrote:
| Creator of RediSQL / zeeSQL (https://zeesql.com/)
|
| Insertion performance on a single table are very very hard to
| optimize.
|
| A single process looping it is your best bet.
|
| I would just increase the batch size, which is the most influent
| factor.
|
| Then another point... When you do batches, you do
| BEGIN TRANSACTION; for i in range(1, 50):
| execute_stmt COMMIT;
|
| You do not create a long list of parameters.
|
| https://github.com/avinassh/fast-sqlite3-inserts/blob/master...
|
| ;)
| avinassh wrote:
| That's a great point, let me try it!
|
| > You do not create a long list of parameters.
|
| I have done much worse by trying to insert a really long string
| of 100K params
| siscia wrote:
| Even though merging strings seems faster...
| [deleted]
| avinassh wrote:
| Hey all, author here. I didn't expect this to reach front page of
| HN. I came here to submit and it was here already! I am looking
| for more ideas to experiment. Here's one idea which someone from
| another forum gave me: exploring recursive queries and using
| SQLite random methods to do the insertions.
|
| Another crazy idea is to learn about SQLite file format and just
| write the pages to disk.
| Twirrim wrote:
| I gave the suggestion on Twitter, but probably worth mentioning
| here too as people here might not be aware of it.
|
| https://www.flamingspork.com/projects/libeatmydata/
|
| libeatmydata shouldn't be used in production environments,
| generally speaking, as it biases for speed over safety (lib-
| eat-my-data), by disabling fsync, and associated commands for
| the running process under it. Disabling those commands results
| in less I/O pressure, but comes with the risk that the program
| thinks it has written safely and durably, and that may not be
| true. It essentially stops programs that are written to be
| crash proof, from being actually crash proof. Which under the
| circumstances you're operating on you almost certainly don't
| care.
|
| I've used this when reloading database replicas from a dump
| from master before, as it drastically speeds up operations
| there.
| the8472 wrote:
| Overlayfs offers libeatmydata functionality in recent linux
| kernels without having to intercept libc, so it should work
| for things that use raw syscalls too.
| nh2 wrote:
| Aside: SQLite has native functionality to do what
| libeatmydata does (disable fsync): pragma
| synchronous = off
|
| https://www.sqlite.org/pragma.html#pragma_synchronous
| anigbrowl wrote:
| It's a great article! Always nice to see clear well-written
| guides that save people time from reinventing the wheel.
| justinclift wrote:
| You mention wanting to do a Go version. Not sure if it's
| useful, but this is a SQLite "bulk data generation" util I
| threw together ages ago in Go:
|
| https://github.com/sqlitebrowser/sqlitedatagen
|
| There's some initial work to parallelise it with goroutines
| here:
|
| https://github.com/sqlitebrowser/sqlitedatagen/blob/multi_ro...
|
| Didn't go very far down that track though, as the mostly single
| threaded nature of writes in SQLite seemed to prevent that from
| really achieving much. Well, I _think_ that's why it didn't
| really help. ;)
| mathnode wrote:
| Very nice! I shall give that a try. I always get excited by
| SQLite and go projects.
|
| I maintain a similar go tool for work, which I use to stuff
| around 1TB into MariaDB a time:
|
| https://github.com/rcbensley/brimming
| mst wrote:
| I'm curious how you'd fair if you pregenerated the data, with
| periodic COMMITs, and then just sling it at the sqlite3 binary
| over STDIN and see how it handled that.
|
| Certainly if I was trying to do the same thing in Pg my first
| thought would be "batched COPY commands".
| A4ukYkq6ILl wrote:
| https://www.baidu.com
| [deleted]
| eatonphil wrote:
| I am also interested in writing the SQLite or PostgreSQL file
| format straight to disk as a faster way to do ETL. I'd be
| curious to hear if anyone has actually tried this. I'd be as
| curious if you end up trying this too.
| avinassh wrote:
| yes, I plan to try this next and measure the performance.
| Others also suggested similar idea to try.
|
| > I am also interested in writing the SQLite or PostgreSQL
| file format straight to disk as a faster way to do ETL.
|
| what exactly you are trying to do here?
| [deleted]
| tyingq wrote:
| Sqlite has a btree module/api, though there's a lot of
| "TODO:" notes in the document:
| https://sqlite.org/btreemodule.html
| avinassh wrote:
| TIL, this is great! I think this could make things even
| easier instead of writing the raw pages
| oldtablet wrote:
| I really like sqlite but I'm curious how fast it would be to
| write the data to parquet files and read it with duckdb. I've
| done something similar before and in my case duckdb was
| faster than sqlite.
| Fiahil wrote:
| I did similar Rust/Python record generation experiments, and I
| can relate to these numbers.
|
| However, I think your Rust threaded trial might be a little bit
| off in subtle ways. I would truly expect it to perform about
| several times better than single threaded Rust and async Rust
| (async is generally slower on these workloads, but still faster
| than Python)
|
| Edit : After reading your rust code, you might have room for
| some improvements :
|
| - don't rely on random, simply cycle through your values
|
| - pre-allocate your vecs with \with_capacity\
|
| - dont use channels, prefer deques..
|
| - ..or even better, don't use synchronization primitives and
| open one connection per thread (not sure if it will work with
| sqlite?)
| mongol wrote:
| 1 billion rows, 8GB RAM. I think there will be some limitation
| from available RAM as well.
| asicsp wrote:
| > _The larger DB required more than 30 minutes to generate. So I
| spent 30+ hours to reduce the 30 mins of running time :p_
|
| Worth it.
|
| > _I could just run s the script again._
|
| Found a typo.
| ericlewis wrote:
| Cool
| pksebben wrote:
| I'm pretty new to this sort of optimization stuff, so forgive me
| if this is ridiculous for any reason, but I'm wondering about two
| things reading this:
|
| 1: is it useful / feasible to set a sort of "lower bound" to
| these optimizations by profiling the raw write time on a set of a
| certain size?
|
| 2. assuming that writes are ultimately the limiting factor, could
| you gain performance by calculating the minimum batch size as
| it's write time intersects in-memory calculation, and as soon as
| you hit that size, pop a thread to write that batch to disk -
| then return some async signal when the write is done to trigger
| the next batch of writes? is it possible to "stitch together"
| these batches post-writes?
|
| edit: mobile formatting is le suck
| Groxx wrote:
| Yep - I discovered most of this a while back when I needed to do
| a bunch of exploring of ~200m rows of data (like 50GB). I chucked
| it into SQLite and was done with everything in a day, it was
| absolutely wonderful. This post is a pretty good summary of what
| to change and why.
|
| Prepared statements should be considered the norm for SQLite -
| they have pretty major performance benefits, and any decent ORM
| or query engine can probably do it for you implicitly or with a
| single flag, so it's practically free for many applications. I'm
| always surprised how often I see applications or benchmarks or
| "how to be fast with SQLite" not using them, so I'm definitely
| glad to see them covered here :)
| ThePadawan wrote:
| There are some cool ideas in there!
|
| I wonder how much I could speed up my test suites (that don't
| rely on transaction rollbacks) by disabling journalling. Those
| milliseconds per query add up to seconds per tests and minutes
| per PR.
| jonnycomputer wrote:
| >The generated data would be random with following constraints:
| The area column would hold six digits area code (any six digits
| would do, no validation). The age would be any of 5, 10, or 15.
| The active column is either 0 or 1.
|
| That is about 6 million combinations. This is not such a big
| space that it would be impractical to precompute all possible
| combinations. I wonder if hashing in to such a precomputed table
| might help. Might not if the overhead of maintaining the lookup
| table is too high.
| nzealand wrote:
| If you insert 4 rows into a table, and join it on itself without
| any conditional criteria, it will result in a cartesian product
| of 16 rows (Select * from a, a)
|
| Do that cartesian join three more times, and you have over 4
| billion results.
|
| Then you simply need to use the random function in conjunction
| with division, rounding and case statements to get the desired
| random numbers.
| notatelloshill wrote:
| an in memory database is not a database ... you might as well
| just create billion object array and have your java process
| constantly running to have that data accessible.
| orf wrote:
| How long does it take when using the native CSV import features?
|
| https://www.sqlite.org/csv.html
| motoboi wrote:
| It's fast. But then we need a "1 billion CSV lines" article.
| orf wrote:
| Which I expect is significantly faster to generate. You could
| probably combine it with a pipe to avoid all disk IO.
| tanin wrote:
| I built https://superintendent.app and experimented a lot with
| this feature.
|
| I can import 1GB CSV file in 10 seconds on my MacBook. This
| queries from a virtual table and puts it in actual table
| roman-holovin wrote:
| INSERT INTO user (area, age, active) SELECT abs(random()) %
| 1000000, (abs(random()) % 3 + 1) * 5, abs(random()) % 2 FROM
| generate_series(1, 100000000, 1)
|
| Faster by 10% than fastest author implementation on my machine -
| 19 seconds against 21 for 'threaded_batched'.
| gfodor wrote:
| I have to assume this was rejected as a valid approach, but if
| not, this whole thread gets 10x more interesting than it
| already was. I hope the author responds.
| roman-holovin wrote:
| Well, even if you just insert zeros instead of random values,
| it takes 9 seconds on my computer to insert 100M rows, so
| even that is not a 1B rows per minute.
|
| And I think INSERT INTO ... SELECT is the fastest way to bulk
| insert data into sqlite.
|
| Also, I have tried to use carray sqlite feature that allow to
| share memory with sqlite and use recursive CTE to query it,
| but it is slower. Though, you can pass values you've
| generated from Rust instead of using random().
| arusahni wrote:
| [ @dang, please delete if/when you get the chance, thank you ]
| kangalioo wrote:
| Did you mean to post this to
| https://news.ycombinator.com/item?id=27872596 ?
| arusahni wrote:
| Yes. Not sure how I got mixed up.
| lmz wrote:
| I wonder if defining a virtual table
| (https://www.sqlite.org/vtab.html) and just doing
|
| INSERT INTO t (...) SELECT ... from virtual_table
|
| would be any faster.
| rubyfan wrote:
| From my armchair here I thought something similar, I think I
| heard 1) that inserting into the same table has some speed
| limitations and 2) one of the authors ideas was to spin up 4
| processes...
|
| I wonder if you could insert into 10 different tables from 10
| threads or processes then
|
| insert into single_table (...) select (...) union.
|
| No idea if insert select does the trick or not but you're
| almost at the point of partitioning here. If the application
| called for it you could do some sort of poor mans partition in
| order to write all the data and then push some complexity into
| the read.
|
| I also wondered what if any impact the ID primary key had on
| inserting at the level of frequency.
|
| /armchair
| bob1029 wrote:
| You can go even faster if you can organize your problem such that
| you simply start from a copy of a template SQLite database each
| time (it is just a file/byte sequence).
|
| We do SQL evaluation for a lot of business logic throughout our
| product, and we have found that starting from a template database
| (i.e. one with the schema predefined and canonical values
| populated) can save a lot of time when working in tight loops.
| jonnycomputer wrote:
| Are you suggesting have a template and then updating rows with
| random values where necessary?
| nezirus wrote:
| You just do raw byte copies from sample DB, no SQL or
| "inserts" or anything similar. Imagine test database consists
| of 3 parts (all raw bytes)
|
| ### PROLOGUE ### Sample row ### EPILOGUE
|
| You copy & write prologue, write 1B sample raws (can optimize
| this at will, large writes, etc)
|
| Copy & write epilogue and fsync the data. You probably need
| to modify some metadata, but that should be a few writes at
| most.
|
| That should be as good as it gets, providing your IO is
| optimal.
| tyingq wrote:
| Do you mean crafting all the various database page btree
| structures and entries yourself? I'd be concerned about
| subtle bugs.
| [deleted]
| ndepoel wrote:
| An SQLite database is just a file. You can build the
| empty database with schema and base values ahead of time,
| save it to a file (or an in-memory byte buffer) and then
| every time you want to create a new database, you just
| copy that file. No need to do any expensive
| initialization queries that way. If raw high-speed
| throughput is needed, skipping that step can make a
| significant difference.
| tyingq wrote:
| Yes, that approach makes sense. I thought what I was
| replying to was suggesting writing b-tree pages
| themselves, outside of sqlite, for the new data.
| [deleted]
| eismcc wrote:
| This idea is mentioned as one of the future work at the
| bottom.
| A4ukYkq6ILl wrote:
| > Are you suggesting have a template and then updating rows
| with random values where necessary?
|
| > Are you suggesting have a template and then updating rows
| with random values where necessary?
| A4ukYkq6ILl wrote:
| 3PwjGO6X9Gwc
| bob1029 wrote:
| Yes. Start from known-good database and then update or insert
| with needed deltas. You could even have _more_ than you need
| in the template file and truncate /drop what is not relevant
| for the current context. Depending on the situation, it might
| be faster to start from a database containing all of the
| possible things.
| gunapologist99 wrote:
| This makes sense. Just do a copy or copy-on-write to an
| existing database file; you could even have the actual
| bytes of an empty (or pre-DDL'ed) sqlite file in memory in
| your app, rather than needing to do a disk copy.
| perlgeek wrote:
| I wonder if something like https://paulbradley.org/sqlite-test-
| data/ could be used to generate the test data in sqlite3
| directly, and if it would be any faster.
|
| (It would likely be faster for other DB engines, because there is
| network overhead in the communication between the program and the
| DB; no such things for sqlite).
| A4ukYkq6ILl wrote:
| 7VKs3Dsn
| chovybizzass wrote:
| Need it done in Deno, fren.
| kebman wrote:
| Mini summary: Do not use this in production. xD
| svdr wrote:
| I thought one billion was 1000M (and not 100M)?
| xeromal wrote:
| Yeah, that's my interpretation of a billion too. I vaguely
| recall that India or Britain interprets a billion differently
| though. Maybe that's what they're thinking?
| mongol wrote:
| In Swedish, and I think in German and other languages too, 1
| billion is 1e12. 1e9 is called milliard / miljard in Swedish.
| scns wrote:
| Yes, in Germany too.
| raarts wrote:
| And in the Netherlands.
| mobilio wrote:
| Also in Bulgaria
| atleta wrote:
| I think it's the same in UK English as well. But in some
| (most?) European languages billion actually means 1 000 000
| million (so a thousand times _more_ ). And we use "milliard"
| for 1000 million.
| diehunde wrote:
| https://en.wikipedia.org/wiki/Long_and_short_scale for
| anyone that's interested in the differences
| xeromal wrote:
| It looks like you're right. Just looked it up. A billion is
| 1million^2 in its etymology so English speakers are the odd
| one out.
| res0nat0r wrote:
| He hasn't reached the goal yet, and is currently doing
| 100 million in ~34 seconds.
|
| > Looking forward to discussions and/or collaborations
| with curious souls in my quest to generate a billion
| record SQLite DB quickly. If this sounds interesting to
| you, reach out to me on Twitter or submit a PR.
| motogpjimbo wrote:
| Just about everyone in the UK defines a billion to be 10^9.
| I've heard about the 10^12 definition but never encountered
| anyone who uses it - I think it must have been an older usage
| that fell out of favour.
| jtvjan wrote:
| In 1974, the UK government abandoned the long scale
| (million, milliard, billion, billiard) in favor of the
| short scale (million, billion, trillion, quadrillion) used
| in the US. The long scale is still used in languages like
| Dutch, German, and French.
| kylec wrote:
| This is an important comment. It's not a typo in the article,
| even the source code only does 100M rows, not 1B. The author
| definitely does not hit the target of 1B rows in a minute.
| gunapologist99 wrote:
| Exactly. thousand = 1000 million = 1000
| * thousand (or 1000^2) billion = 1000 * million (or
| 1000^3) trillion = 1000 * billion (or 1000^4)
|
| (not to discount regional differences)
|
| https://www.brainyquote.com/quotes/everett_dirksen_201172
| throwaway210222 wrote:
| 1,000,000 million. In old money.
| thanksforfish wrote:
| Ref: https://en.m.wikipedia.org/wiki/Billion
| chrisseaton wrote:
| The author doesn't say a billion is 100 million. They say
| they'd like to be able to insert a billion, and say they're
| able to insert 100 million. It's not a contradiction.
| minitoar wrote:
| Ok but that is a rather clickbaity title. The title makes it
| sound like they are successfully doing that.
| infogulch wrote:
| Perhaps " _Towards_ Inserting 1B Rows in SQLite in Under a
| Minute " would be a better title.
| avinassh wrote:
| Hey, sorry for the misleading title. I started with 'Fast
| SQLite Inserts' and it had many iterations. In the title, I
| wanted to intend that I want to insert 1 billion rows under
| a minute on my machine. I thought the current title is
| fine, since I got LGTM for earlier drafts. The detail about
| on my machine is also important since mine is a two year
| old laptop and all the measurements are done on it.
|
| Also I got another feedback that title should indicate that
| it is a test database and emphasise that it is not durable.
|
| I am wondering the right way to convey all of this in the
| title yet also keep it short.
| newman314 wrote:
| Attempting 1 Billion Row Inserts in under 1 Minute
| dstroot wrote:
| Achieving 100m SQLite inserts in under a minute.
| minitoar wrote:
| Thanks for the explanation. I think it only needs a minor
| tweak. Maybe prefix with "trying to" or something like
| that. I am empathetic to the challenge of naming things
| concisely and accurately.
| scns wrote:
| I hate hyperbole but i think the title is fine.
| dclowd9901 wrote:
| I agree. While the title reflects the eventual goal of the
| effort, the goal has yet to be achieved (and may or may not
| be achievable at all). I think it's a bit irresponsible to
| use a title like that for a post that neglects to have
| achieved what was described in the title.
| sbarre wrote:
| Perhaps "Working towards 1B rows in under a minute" would
| have been better.
| fridif wrote:
| "USA LANDS MAN ON MARS BEFORE SOVIET UNION"
|
| "WELL ACTUALLY IT WAS THE MOON BUT YOU GET THE IDEA"
| chrisseaton wrote:
| There's no need to be snarky. I didn't write the title, I'm
| just explaining what the author means.
| fridif wrote:
| I'm just memeing what it sounded like in my head, nothing
| personal
| A4ukYkq6ILl wrote:
| https://www.baidu.com
| mpweiher wrote:
| Interesting!
|
| I was actually just working on SQLite speed for Objective-S
| (http://objective.st), partly as a driver for getting some of the
| more glaring inefficiencies out.
|
| Using a "to do list" schema, I currently get the 100M rows out in
| 56 seconds, which is around half the speed of the Rust example
| given here, but 3 times the speed of PyPy and almost 10x faster
| than Python.
|
| This is from an interpreted script that not only does the inserts
| and creates the objects to insert in the first place, but also
| defines the actual class.
|
| The lower-level code is written in Objective-C, like the rest of
| Objective-S.
|
| Class definition: class Task { var
| <int> id. var <bool> done. var <NSString>
| title. -description { "<Task: title: {this:title} done:
| {this:done}>". } +sqlForCreate { '( [id]
| INTEGER PRIMARY KEY, [title] NVARCHAR(220) NOT NULL, [done]
| INTEGER );'. } }.
|
| Code to insert a computed array of tasks 10 times:
| 1 to:10 do: { this:tasksTable insert:taskList.
| }.
| polyrand wrote:
| Really good article, I've experimented with this kind of
| workloads in SQLite a few times and found it insightful. One note
| though, using: pragma temp_store = memory;
|
| only affects temporary tables and indices, not the main database
| itself[0]
|
| [0] https://sqlite.org/pragma.html#pragma_temp_store
| mimir wrote:
| Database optimization posts are always interesting, but it's
| really hard to do any apples to apples comparison. Your
| performance is going to depend mostly on your hardware, internal
| database settings and tunings, and OS level tunings. I'm glad
| this one included some insight into the SQLite settings disabled,
| but there's always going to be too many factors to easily compare
| this to your own setup.
|
| For most SQL systems, the fastest way to do inserts is always
| just going to batched inserts. There's maybe some extra tricks to
| reduce network costs/optimize batches [0], but at it's core you
| are still essentially inserting into the table through the normal
| insert path. You can basically then only try and reduce the
| amount of work done on the DB side per insert, or optimize your
| OS for your workload.
|
| Some other DB systems (more common in NoSQL) let you actually do
| real bulk loads [1] where you are writing direct(ish) database
| files and actually bypassing much of the normal write path.
|
| [0] https://dev.mysql.com/doc/refman/5.7/en/insert-
| optimization.... [1] https://blog.cloudera.com/how-to-use-hbase-
| bulk-loading-and-...
| vajrabum wrote:
| Oracle, DB2, MySQL, SQL Server, and PostgreSQL all support bulk
| insert. Two obvious use cases are QA Databases and the L part
| of ETL which pretty much require it.
| MobiusHorizons wrote:
| I would recommend looking into dtrace or other observability
| tools to figure out what to optimize next. So far you have
| basically had to guess where the slowness is and optimize that.
|
| Lastly, how long does it take to make a copy of an already
| prepared 1b row SQLite file?that seems easier than generating a
| new one.
| Seattle3503 wrote:
| I've used the other pragma, but does
|
| PRAGMA locking_mode = EXCLUSIVE;
|
| improve performance much?
| ajnin wrote:
| Here's my solution : generate the 1B records .sqlite file once
| beforehand, then before running each test, copy it over the test
| DB file ;)
| mirekrusin wrote:
| If it's single, in memory table, is there really need to use
| database? Won't language provided data structures suffice?
| jonnycomputer wrote:
| It is a good question, but I'm guessing that the whole reason
| for creating the database in the first place is to test
| interactions with the database, in which case you'd have to
| mock up all the interactions being tested (selects, deletes,
| joins, etc.).
| einpoklum wrote:
| Yes, if you want to run non-trivial queries on that data.
|
| Although, frankly, SQLite would not be my choice.
| adamnemecek wrote:
| What would be your choice?
| einpoklum wrote:
| If you don't do any joins, Clikhouse is worth a try.
| Otherwise, MonetDB (or even MonetDB embedded) if you want
| FOSS. Commercial offerings - Actian Vector, or maybe
| HyperDB. The big corporate ones are probably pretty good,
| but it's always difficult to tell because there's this
| taboo on any proper benchmarking against them.
|
| If you just want to browse what other people are using, and
| not go by the recommendations of a random commenter, try:
|
| https://db-engines.com/en/ranking
|
| but note that's a joint ranking both for transaction-
| focused and analytics-focused DBMSes.
|
| If it's time series data there are some more specialized
| offerings and I'm (even) less of an expert there.
| xxs wrote:
| I can't think of a single case where in-memory database is a
| good option, aside playing w/ SQL.
|
| Whatever queries that might be a hashmap/tree/skiplist, etc.
| would be a lot better.
| tgv wrote:
| For something like redis, but with more structure.
|
| But a pretty good use case (IMO) is testing. If you want to
| do an integration test with an SQL database, and you want
| to test large numbers, this might be a good fully
| functional stub to run locally.
| xxs wrote:
| >For something like redis,
|
| Redis is in pretty much the same category. Testing is
| sort of a valid case, if you are committed to write pure
| SQL with minimal use of any dialect specifics (but even
| 'create table' syntax might be different). Running on the
| real thing is close to no replacement when it comes to
| databases.
|
| Many databases have docker images nowadays, so it aint
| hard to run them locally. Likely at a point you'd want to
| optimize the SQL, itself, rendering the tests
| incompatible.
| manigandham wrote:
| What do you think a relational database does to provide you
| the querying flexibility? It builds the same exact data
| structures, but does it automatically and dynamically,
| while offering much more usability.
|
| Most attempts to query using raw data structures just means
| you end up rebuilding a (very poor) relational database
| with none of the features.
| adrianN wrote:
| If the set of queries you want to run is fixed you can
| probably beat SQLite with a lot of work, if the queries are
| dynamic I doubt that you can do much better than SQLite
| without herculean efforts. A in-memory database is thus a
| good option if you either don't care too much about the
| runtime of your queries and you want to save a bunch of
| development time, or if you don't know enough about your
| workload to be able to beat a general purpose solution.
| xxs wrote:
| >you can =probably= beat SQLite with =a lot of= work,
|
| Actually, I am beyond certain. When it's all about the
| memory no database comes even remotely close to a
| properly picked datastructures + structure/objects
| layout.
|
| If I need transaction log + persistence, databases have a
| decent application.
|
| In more than 20y, I have never had a case: Yay, I can use
| relation structures in memory b/c I don't know what I am
| going to do with the data.
| einpoklum wrote:
| It's very likely it wouldn't. (Decent) analytical DBMSes
| have a whole lot up their sleeves which just choosing one
| plain-vanilla data structure for your table doesn't even
| approach in terms of performance. The benefit may be well
| upwards of 10x in many real-world scenario.
|
| Of course, if you know you get one single query which you
| know in advance, carefully build a data structure to cater
| just to that, and you know your data structures beyond the
| just the basics - then, yes, a DBMS would be overkill. But
| it still won't be a walk in the park.
| [deleted]
| faizshah wrote:
| Sqlite provides a full text search module (FTS5) which
| provides a lot of nice search features. I have used this
| numerous times to build serverless apis for large static
| datasets for frontend analytical experiences like data
| tables without setting up elasticsearch. Thats one use
| case.
|
| Another is they support this closures.c extension which is
| very nice for rapid queries on tree structured data locally
| in memory. The JSON1 extension is also nice for rapidly
| querying/reshaping deeply nested json data. Theres also
| spellfix1 that can provide fuzzing capabilities. If you
| need any of these with low latency and in memory its a
| great choice.
|
| Sqlite is great for rapidly building low latency static
| data serving services for frontend experiences. Something
| I'm exploring now is combining sqlite with streamlit to
| rapidly build data exploration UIs.
|
| Like how many times have you wanted to quickly add fuzzy
| matching or full text search to some program? You use
| fuzzywuzzy but its pretty slow, sqlite provides performant
| implementations of this stuff thats super simple to set up.
| _delirium wrote:
| For the case where I want the ability to run one-off
| analytics queries, having access to the usual set of
| join/filter/summarize/etc. operations is a lot more
| convenient and less error-prone than having to manually
| write them. But dplyr [1] is my go-to rather than SQLite
| personally for in-memory data sizes.
|
| [1] https://dplyr.tidyverse.org/
| lsuresh wrote:
| There are some domains like building cluster managers where
| having a relational view of the cluster state and being able to
| query/manipulate it declaratively has significant perks. See
| for example:
| https://www.usenix.org/system/files/osdi20-suresh.pdf
| (disclaimer: I'm an author)
| pachico wrote:
| I've been dealing with lots of data and SQLite and my outtakes
| are:
|
| - language has very little to do since the bottleneck will most
| likely be the way you insert data
|
| - indeed prepared statements are useful but the performance
| didn't change much when I did long transactions and commit every
| certain amount of thousand of rows
|
| - having lots of rows in your table is good but certain queries,
| like aggregation over many rows, are not what SQLite is great
| about.
|
| - ClickHouse can easily ingest that and more in a laptop without
| even any scripting language.
| bilekas wrote:
| > Threads / async may not be faster always
|
| This is actually a little bit mind blowing for me. I'm gonna go
| and play with this. But what a cool read!
| elephantum wrote:
| I bet, that this is mostly random.randint benchmark and not
| SQLite.
|
| Also interesting whether batched numpy version would compare
| better to Rust.
| elephantum wrote:
| Correction: random.randint + string interpolation
| IfOnlyYouKnew wrote:
| I ran this and got the following results:
| /fast-sqlite3-inserts (master)> time make busy-rust
| Sun Jul 18 17:04:59 UTC 2021 [RUST] busy.rs (100_000_000)
| iterations real 0m9.816s user
| 0m9.380s sys 0m0.433s
| ________________________________________________________
| Executed in 9.92 secs fish external usr
| time 9.43 secs 0.20 millis 9.43 secs sys time
| 0.47 secs 1.07 millis 0.47 secs
| fast-sqlite3-inserts (master)> time make busy-rust-thread
| Sun Jul 18 17:04:48 UTC 2021 [RUST] threaded_busy.rs
| (100_000_000) iterations real 0m2.104s
| user 0m13.640s sys 0m0.724s
| ________________________________________________________
| Executed in 2.33 secs fish external usr
| time 13.68 secs 0.20 millis 13.68 secs sys time
| 0.78 secs 1.18 millis 0.78 secs
|
| I'm probably doing something wrong. Or I'm getting the pace
| needed for the billion?
|
| This is on a M1 MacBook Air.
| rmetzler wrote:
| Is PRAGMA temp_store = MEMORY the same as putting the file on a
| TempFS?
| elteto wrote:
| Maybe, maybe not. I imagine writing directly to memory will
| always be faster than writing to a filesystem, even if it's
| ultimately backed by RAM. Might be a negligible difference
| though.
___________________________________________________________________
(page generated 2021-07-18 23:00 UTC)