[HN Gopher] In-memory vs. disk-based DB: Why do you need a large...
___________________________________________________________________
In-memory vs. disk-based DB: Why do you need a larger than memory
architecture?
Author : taubek
Score : 47 points
Date : 2023-09-02 17:33 UTC (5 hours ago)
(HTM) web link (memgraph.com)
(TXT) w3m dump (memgraph.com)
| moomoo11 wrote:
| I use levelDB for cache on low cost resources, ie 1 vcpu, 0.5gb
| ram, 5-10gb ssd.
|
| Gives me great performance (especially paired with Go), and I'm
| able to deliver 6-10k rps for a few dollars a month.
| cartweheel wrote:
| They've misunderstood what an in-memory database is though. The
| distinction only makes sense in a historical perspective, with
| on-disk formats and associated algorithms that worked well for
| execution from spinning iron. In fact neo4j is also an in-memory
| database, since it will slow to a complete crawl on a HDD. The
| fact that it supports swapping pages in and out of disk can more
| be likened to virtual memory, rather than any specific support
| for executing queries efficiently from disk.
| hakunin wrote:
| Not 100% what the article is about, just a short story.
|
| In one of my old jobs we had megabytes of infrequently accessed
| static key-value data. If we simply loaded it into a const (i.e.
| a hash table), it would blow up the RAM so much, that we would
| need to upgrade to bigger VPSes. If we put it in database, it
| would make it annoying to keep these tables up to date, track
| changes in them.
|
| I figured this was one of those in-between use cases, where the
| best solution is to have zero-RAM lookups from SSD. In my case, I
| wrote a little ruby library[1] that arranges data in equal cells
| in a file, and performs binary searches via `pread`. This was
| perfect for us, because we kept data in our repos, sacrificed no
| RAM at runtime, SSD lookups were fast enough, and we didn't have
| to support a more elaborate db.
|
| [1]: https://github.com/maxim/wordmap
| teaearlgraycold wrote:
| Why not commit a sqlite db?
| hakunin wrote:
| I considered that, but couldn't find a way to precisely
| control amount of RAM used when you read from it.
| jerrygenser wrote:
| I have a similar use case and I have had success reading from
| sqlite files embedded in my repo
| boywitharupee wrote:
| Basically, your solution ended up residing on-disk, but the
| data was fragmented in such a way that efficient lookup was
| possible? Did this copy from kernel space to userspace, also?
| hakunin wrote:
| >the data was fragmented in such a way that efficient lookup
| was possible
|
| Yeah, there's a build step that sorts and arranges data into
| "cells", making binary search possible.
|
| Not sure about kernel/user space. I'm just calling `pread`
| from ruby, so only a few bytes are loaded per lookup.
| rubiquity wrote:
| > sacrificed no RAM at runtime
|
| Those lookups eventually made their way into the kernel's page
| cache.
| [deleted]
| hakunin wrote:
| Are you saying that the file would just entirely be loaded
| into a page cache eventually? I imagine, even if true, it
| still wouldn't result in OOM killing the server/worker
| daemons on account of this data?
| rubiquity wrote:
| Depending on the file size and available memory, yes. In
| the event memory needs to be reclaimed Linux's memory
| management system will free from the page cache first if
| processes need more anonymous memory.
| CraigJPerry wrote:
| I think i'm missing something here, could you not just mmap()
| the file in and let the kernel take care of memory pressure for
| you?
| lisper wrote:
| Sure -- until your system crashes and data is left in an
| inconsistent state because some of it was written to non-
| volatile storage and some of it wasn't and your OS had no
| concept of transactional consistency because mmap is a leaky
| abstraction.
| posnet wrote:
| Not a problem here. Read the top level comment.
|
| "static key-value data. ... a const (i.e. a hash table)"
| sschnei8 wrote:
| https://db.cs.cmu.edu/mmap-cidr2022/
| rubiquity wrote:
| Good paper and overall the right opinion but not very
| relevant to a relatively simple system as OP described
| since they weren't dealing with safety, transactions, etc.
| hakunin wrote:
| I'm no systems programmer, but I remember trying to research
| mmap approach. It's been 3 years, so I'm not sure what
| stopped me, but something didn't feel right. Perhaps it was
| the lack of control over how memory is used. I could clearly
| see how not use it, and didn't want any fluctuations.
|
| Edit: oh and I think I did come across some article like the
| one linked in the neighbor comment. It's starting to come
| back.
| andersa wrote:
| > Fetching data from disk is something that everyone strives to
| avoid since it takes approximately 10x more time than using it
| from main memory.
|
| Did they mean 10 _thousand_ times? Or is the in-memory version
| that inefficient?
| Darych wrote:
| Based on this resource
| https://gist.github.com/hellerbarde/2843375 1MB sequential read
| from SSD just 4x slower than the same read from main memory.
| For random reads main memory faster of course. I believe author
| meant some average value.
| pavlov wrote:
| Yes, even with SSDs it seems like 10x is very optimistic. It
| should be several orders of magnitude.
| Moto7451 wrote:
| I think they're playing it safe with their points of
| comparison. DDR5 supports 64,000 MB/s per channel and 4x NVMe
| PCIe5 SSDs support 10,000 MB/s. Depending on how many memory
| channels and what RAID you use, I think 10x improvement over
| high performance storage is unimpeachable. Memory latency being
| better than SSD latency will really benefit memory depending on
| workload, but I don't think you can just throw one number out
| there to represent that.
|
| Now if you're comparing to spinning rust, memory is definitely
| going to blow it away, but commodity hardware isn't running
| tens or hundreds of TBs of memory. Memory to SSD comparisons
| seem right.
| semi-extrinsic wrote:
| For example the latest AMD Genoa has 12 memory channels per
| socket, at dual socket and with enough DIMMs that's a 75x
| speed advantage even if you compare with RAID0 of high
| perfomance NVMe.
| wtallis wrote:
| > even if you compare with RAID0 of high perfomance NVMe.
|
| Do you mean a RAID0 of just two or four NVMe SSDs? It's
| absolutely ridiculous to count aggregate DRAM bandwidth
| across two CPU sockets and not do the same for PCIe lanes.
| A fair comparison is that Genoa has about twice the DRAM
| bandwidth as it has PCIe bandwidth, though in a fully-
| loaded database server some portion of the PCIe bandwidth
| will be used for networking rather than storage.
| zamadatix wrote:
| FWIW I think you have the aggregation backwards, dual
| socket Genoa would have 24 channels of DRAM but sacrifice
| some of the PCIe lanes for the interconnect. Your numbers
| work out right though as single socket is actually about
| 1:1 in RAM vs PCIe bandwidth so dual socket would still
| come out roughly 2:1.
|
| I think 10x is a fair rough number though, depending on
| your access pattern.
| wtallis wrote:
| Dual-socket Genoa would be 24 channels at DDR5-4800 (38.4
| GB/s) for a total of ~921.6 GB/s. Typical PCIe
| configurations are 64 or 80 lanes per socket for dual-
| socket, so 128 or 160 lanes total, at PCIe 5.0 speed
| that's ~504 GB/s for 128 lanes total or ~630 GB/s for 160
| lane configurations.
|
| Single-socket Genoa would be 12 channels of DRAM (~460.8
| GB/s) and 128 lanes of PCIe 5.0 (~504 GB/s), but none of
| the previous comments were specifically about single-
| socket Genoa and I wasn't going to silently switch from
| considering dual-socket in one sentence to single-socket
| in the next sentence.
| zamadatix wrote:
| Ah yes, I forgot about the 48 lane 2 socket interconnect
| mode which does allow you to still aggregate some
| additional lanes.
| convolvatron wrote:
| its also may be limited on iops/s. you _can_ really just work
| with latency, but you need to consider the pipeline depth.
| [deleted]
| paulddraper wrote:
| 10x is reasonable-ish for bandwidth not for latency
| mbuda wrote:
| I guess when you compare pure performance of hardware, 10x is
| very optimistic from the perspective of disk. Probably the
| author based this number on some some specific
| application/database context / bias in the measurements. But
| yea, pure hardware difference might be hugely different compare
| to that number.
| jandrewrogers wrote:
| Recent database engine designs tend to be bandwidth bound. The
| difference in bandwidth between memory and modern storage
| hardware is much smaller than you might expect. Really taking
| advantage of that storage bandwidth is more difficult than
| memory, as it requires some top notch scheduler design.
| refset wrote:
| In a similar vein, this was news to me recently:
|
| > here's a chart comparing the throughputs of typical memory,
| I/O and networking technologies used in servers in 2020
| against those technologies in 2023
|
| > Everything got faster, but the relative ratios also
| completely flipped
|
| > memory located remotely across a network link can now be
| accessed with no penalty in throughput
|
| The graphs demonstrate it very clearly:
| https://blog.enfabrica.net/the-next-step-in-high-
| performance...
| pclmulqdq wrote:
| 10,000 is for spinning rust. With SSDs on NVMe it's about 50x,
| so pick 10x or 100x to be your number.
|
| This should have been a revolution in DB design, IMO, but it
| kind of hasn't been.
| bullen wrote:
| The whole point of a DB is to remember the data after power
| cycle.
|
| So either way you turn it both solutions go to disk?
| CyberDildonics wrote:
| _The whole point of a DB is to remember the data after power
| cycle._
|
| You are thinking of a file system.
| DaiPlusPlus wrote:
| Databases don't have to use a file-system - look at old Big
| Iron systems where the database _is_ the file-system.
|
| In the context of Linux/Unix/etc, the file-system is just
| another API for the OS - consider /dev/null or /proc - those
| are certainly in the file-system but they aren't tied to
| persistent storage.
| CyberDildonics wrote:
| _Databases don't have to use a file-system_
|
| I didn't say they did.
|
| _consider /dev/null or /proc - those are certainly in the
| file-system but they aren't tied to persistent storage._
|
| I'm not sure what point you're trying to make here.
| avianlyric wrote:
| Sure, querying that data is also important. Which means the
| question of "do you optimise for in-memory queries, or on disk
| queries" is important.
|
| Simply saying you want the data to be durable isn't that
| interesting or hard to achieve, there's plenty of ways of
| achieving durable storage. The hard part is doing durable
| storage while also solving problems like query speed, and
| concurrency control.
| Guvante wrote:
| It depends on what you mean by durable. Claiming that
| persistence to disk is trivial when compared to returned
| values is a stretch IMHO. Modern DBs still have bugs where
| data returned is different than permanently persisted leading
| to inconsistency errors.
|
| And that is ignoring performance queries that don't require
| any guarantees.
___________________________________________________________________
(page generated 2023-09-02 23:00 UTC)