[HN Gopher] Does my data fit in RAM?
___________________________________________________________________
Does my data fit in RAM?
Author : louwrentius
Score : 92 points
Date : 2022-08-02 19:49 UTC (3 hours ago)
(HTM) web link (yourdatafitsinram.net)
(TXT) w3m dump (yourdatafitsinram.net)
| nsxwolf wrote:
| 64 TiB fits in a Dell Poweredge R840 with 6TB max RAM... how
| exactly?
| louwrentius wrote:
| Please take a look further down the list. For your use case, a
| Power System E980 may just be enough or too small. :-)
| mciancia wrote:
| Maybe someone mixed up regular dram and optane dimms?
| staticassertion wrote:
| mine does not
| louwrentius wrote:
| The original site made by lukegb inspired me because of the down-
| to-earth simplicity. Scaling vertically is often so much easier
| and better in so many dimensions than creating a complex
| distributed computing setup.
|
| This is why I recreated the site when it went down quite a while
| ago.
|
| The recent article "Use One Big Server"[0] inspired me to
| (re)submit this website to HN because it addresses the same
| topic. I like this new article so much because in this day and
| age of the cloud, people tend to forget how insanely fast and
| powerful modern servers have become.
|
| And if you don't have budget for new equiment, the second-hand
| stuff from a few years back is stil beyond amazing and the prices
| are very reasonable compared to cloud cost. Sure, running bare
| metal co-located somewhere has it's own cost, but it's not that
| of a big deal and many issues can be dealt with using 'remote
| hands' services.
|
| To be fair, the article admits that in the end it's really about
| your organisation's specific circumstances and thus your
| requirements. Physical servers and/or vertical scaling may not
| (always) be the right answer. That said, do yourself a favour,
| and do take this option seriously and at least consider it. You
| can even do an experiment: buy some second-hand gear just to gain
| some experience with hardware if you don't have it already and do
| a trial in a co-location.
|
| Now that we are talking, yourdatafitsinram.net runs on a
| Raspberry Pi 4 which in turn is running on solar power.[1] (The
| blog and this site are both running on the same host)
|
| [0]: https://news.ycombinator.com/item?id=32319147
|
| [1]: https://louwrentius.com/this-blog-is-now-running-on-solar-
| po...
| karamanolev wrote:
| > many issues can be dealt with using 'remote hands' services.
|
| I have a few second-hand HP/Dell/Supermicro systems running
| colocated. I find that for all software issues, remote
| management / IPMI / KVM over IP is perfectly sufficient. Remote
| hands are needed only for actual hardware issues, most of which
| is "replace this component with an identical one". Usually HDD,
| if you're running those. Overall, I'm quite happy with the
| setup and it's very high on the value/$ spectrum.
| louwrentius wrote:
| Yes, I bet a lot of people aren't even aware of IPMI/KVM over
| IP capabilities that servers have for decades, which makes
| hardware management (manual or automated!) much easier.
|
| Remote hands is for the inevitable hardware failure (Disk,
| PSU, Fan) or human error (you locked yourself out somehow
| remotely from IPMI).
|
| P.S. I have a HP Proliant DL380 G8 with 128 GB of memory and
| 20 physical cores as a lab system for playing with many
| virtual machines. I turn it on and off on demand using IPMI.
| toast0 wrote:
| IPMI is nice, although the older you go, the more
| _particular_ it gets. I had professional experience with the
| SuperMicro Xeon e5-2600 series v1-4, and recently started
| renting a previous generation server[1] and it 's worse than
| the ones I used before. It's still servicable though; but I'm
| not sure it it's using a dedicated LAN, because the kvm and
| the sol drop out when the OS starts or ends; it'll come back,
| but you miss early boot messages.
|
| It's definitely worth the effort to script starting the KVM,
| and maybe even the sol. If you've got a bunch of servers, you
| should script the power management as well, if nothing else,
| you want to rate limit power commands across your fleet to
| prevent accidental mass restarts. Intentional mass restarts
| can probably happen through the OS, so 1 power command per
| second across your fleet is probably fine. (You can always
| hack out the rate limit if you're really sure).
|
| [1] I don't need a whole server, but for $30/month when I
| wanted to leave my VPS behind for a few reasons anyway...
| baisq wrote:
| Why is table.html loaded as an external resource instead of being
| in index.html proper?
| [deleted]
| louwrentius wrote:
| I can't remember why I did that, probably to keep the data
| separate from the rest of the code.
| game-of-throws wrote:
| I just confirmed that 640 KB fits in RAM. That's enough for me.
| rmetzler wrote:
| 640K ought to be enough for anybody.
| mech422 wrote:
| Thanks Bill!
| tester756 wrote:
| >Gates himself has strenuously denied making the comment. In
| a newspaper column that he wrote in the mid-1990s, Gates
| responded to a student's question about the quote: "I've said
| some stupid things and some wrong things, but not that. No
| one involved in computers would ever say that a certain
| amount of memory is enough for all time." Later in the
| column, he added, "I keep bumping into that silly quotation
| attributed to me that says 640K of memory is enough. There's
| never a citation; the quotation just floats like a rumor,
| repeated again and again."
| vlunkr wrote:
| Amazing. This has been the solution to postgres issues for me.
| Just add enough memory that everything, or at least everything
| that is accessed frequently can fit in RAM. Suddenly everything
| is cached and fast.
| hyperman1 wrote:
| Funny, it lets you click to negative amounts of RAM. My -1 PiB
| fits in RAM, so having it as a unit is not useless. (It also
| accepts fractions but not octal)
| antisthenes wrote:
| If you're wondering, the cutoff is 64 TiB.
|
| That's the amount of RAM on an IBM Power E980 System.
| baisq wrote:
| How much does that cost?
| edmundsauto wrote:
| $8.5 million according to a sibling comment.
| dang wrote:
| Related:
|
| _Does my data fit in RAM?_ -
| https://news.ycombinator.com/item?id=22309883 - Feb 2020 (162
| comments)
| Cwizard wrote:
| Anyone have any recommendations for a SQL engine that works on
| in-memory data and has a simple/monolithic architecture? Our data
| is about 50-100gb (uncompressed) and thus easily fits into
| memory. I am sure we could do our processing using something like
| polars or pandas in memory quite quickly but we prefer a SQL
| interface. Using postgres is still quite slow even when it has
| more than enough memory available compared to something like
| duckdb. Duckdb has other limitations however. I've been eying
| MemSQL but that also seems to be targeted more towards multi
| machine deployments.
| chaxor wrote:
| SQLite is almost always the answer
| mritchie712 wrote:
| what limit are you hitting with duckdb?
| giraffe_lady wrote:
| sqlite?
| somekyle wrote:
| Is the point of this that you can do large-scale data processing
| without the overhead of distribution if you're willing to pay for
| the kind of hardware that can give you fast random access to all
| of it?
| nattaylor wrote:
| Yes, take a look at the "inspired by" tweet [0]
|
| [0] https://twitter.com/garybernhardt/status/600783770925420546
| civilized wrote:
| Has anyone tried firing up Pandas or something to load a multi-TB
| table? Would be interested to see if you run into some hidden
| snags.
| jdeaton wrote:
| I've done this though the data in the table was split across
| DataFrames in many concurrent processes.
| https://stackoverflow.com/questions/49438954/python-shared-m...
| itamarst wrote:
| There's just a huge amount of waste in many cases which is very
| easy to fix. For example, if we have a list of fractions
| (0.0-1.0):
|
| * Python list of N Python floats: 32xN bytes (approximate, the
| Python float is 24 bytes + 8-byte pointer for each item in the
| list)
|
| * NumPy array of N double floats: 8xN bytes
|
| * Hey, we don't need that much precision, let's use 32-bit floats
| in NumPy: 4xN
|
| * Actually, values of 0-100 are good enough, let's just use uint8
| in NumPy and divide by 100 if necessary to get the fraction: N
| bytes
|
| And now we're down to 3% of original memory usage, and quite
| possibly with no meaningful impact on the application.
|
| (See e.g. https://pythonspeed.com/articles/python-integers-
| memory/ and https://pythonspeed.com/articles/pandas-reduce-
| memory-lossy/ for longer prose versions that approximate the
| above.)
| deckard1 wrote:
| interesting. Python doesn't use tagged pointers? I would think
| most dynamic languages would store immediate char/float/int in
| a single tagged 32-bit/64-bit word. That's some crazy overhead.
| acdha wrote:
| This has been talked about for years but I believe it's still
| complicated by C API compatibility. The most recent
| discussion I see is here:
|
| https://github.com/faster-cpython/ideas/discussions/138
|
| Victor Stinner's experiment showed some performance
| regressions, too:
|
| https://github.com/vstinner/cpython/pull/6#issuecomment-6561.
| ..
| nneonneo wrote:
| Absolutely everything in CPython is a PyObject, and that
| can't be changed without breaking the C API. A PyObject
| contains (among other things) a type pointer, a reference
| count, and a data field; none of these things can be changed
| without (again) breaking the C API.
|
| There have definitely been attempts to modernize; the HPy
| project (https://hpyproject.org/), for instance, moves
| towards a handle-oriented API that keeps implementation
| details private and thus enables certain optimizations.
| [deleted]
| BLanen wrote:
| You're describing operations done on data in memory to save
| memory. That list of fractions still needs to be in memory at
| some point. And if you're batching, this whole discussion goes
| out of the window.
| rcoveson wrote:
| Why would the whole original dataset need to be in memory all
| at once to operate on it value-by-value and put it into an
| array?
| BLanen wrote:
| If the whole original dataset doesn't need to be in memory
| all at once, there isn't even an issue to begin with.
| saltcured wrote:
| I think the point is that you can use a streaming IO
| approach to transcode or load data into the compact
| representation in memory, which is then used by whatever
| algorithm actually needs the in-memory access. You don't
| have to naively load the entire serialization from disk
| into memory.
|
| This is one reason projects like Twitter popularized
| serializations like json-stream in the past, to make it
| even easier to incrementally load a large file with basic
| software. Formats like TSV and CSV are also trivially
| easy to load with streaming IO.
|
| I think the mark of good data formats and libraries is
| that they allow for this. They should not force an in-
| memory all or nothing approach, even if applications may
| want to put all their data in memory. If for no other
| reason, the application developer should be allowed to
| commit most of the system RAM to their actual data, not
| the temporary buffers needed during the IO process.
|
| If I want to push a machine to its limits on some large
| data, I do not want to be limited to 1/2, 1/3 or worse of
| the machine size because some IO library developers have
| all read an article like this and think "my data fits in
| RAM"! It's not "your data" nor your RAM when you are
| writing a library. If a user's actual end data might just
| barely fit in RAM, it will certainly fail if the deep
| call-stack of typical data analysis tools is cavalier
| about allocating additional whole-dataset copies during
| some synchronous load step...
| adamsmith143 wrote:
| Ok now I have 100s of columns. I should do this for every
| single one in every single dataset I have?
| staticassertion wrote:
| Yes?
| itamarst wrote:
| It takes like 5 minutes, and once you are in the habit it's
| something you do automatically as you write the code and so
| it doesn't actually cost you extra time.
|
| Efficient representation should be something you build into
| your data model, it will save you time in the long run.
|
| (Also if you have 100s of columns you're hopefully already
| benefiting from something like NumPy or Arrow or whatever, so
| you're already doing better than you could be... )
| maerF0x0 wrote:
| > It takes like 5 minutes, and once you are in the habit
| it's something you do automatically as you write the code
| and so it doesn't actually cost you extra time.
|
| This is the argument I've been having my whole career with
| people who claim the better way is "too hard and too slow"
| .
|
| I'm like "gee, funny how the thing you do the most often
| you're fastest at... could it be that you'd be just as fast
| at a better thing if you did it more than never?" .
| dahfizz wrote:
| Hey, programmer time is expensive. It is our duty to
| always do the easiest, most wasteful thing. /s
| maerF0x0 wrote:
| Future me's time is free to today me. :wink:
| chaps wrote:
| Hah, I'd love to work with the datasets you work with if it
| takes five minutes to do this. Or maybe you're just
| suggesting it takes five minutes to write out "TEXT" for
| each column type?
|
| The data I work with is messy, from hand written notes,
| multiple sources, millions of rows, etc etc. A single point
| that's written as "one" instead of 1 makes your whole idea
| fall on its face.
| itamarst wrote:
| For pile-of-strings data, there are still things you can
| do. E.g. in Pandas, if there are a small number of
| different values, switch to categoricals
| (https://pythonspeed.com/articles/pandas-load-less-data/
| item 3). And there's a new column type for strings that
| uses less memory
| (https://pythonspeed.com/articles/pandas-string-dtype-
| memory/).
| bee_rider wrote:
| Is enough data generated from handwritten notes that the
| memory cost is a serious problem? I was under the
| impression that hundreds of books worth of text fit in a
| gigabyte.
| dvfjsdhgfv wrote:
| You'll need to decide on a case by case basis. Many datasets
| I work with are being generated by machines, come from
| network cards etc. - these are quite consistent. Occasionally
| I deal with datasets prepare by humans and these are mediocre
| at best, and in these cases I spend a lot of time cleaning
| them up. Once it's done, I can clearly see if there are some
| columns can be stored in a more efficient way, or not. If the
| dataset is large, I do it, because it gives me extra freedom
| if I can fit everything in RAM. If it's small, I don't
| bother, my time is more expensive than potential gains.
| [deleted]
| Goz3rr wrote:
| Am I the only one here using Chrome or is everyone else just
| ignoring the table being broken? The author used an <object> tag
| which just results in Chrome displaying "This plugin is not
| supported". I'm unsure why they didn't just use an iframe
| instead.
| louwrentius wrote:
| I can only state for myself that on my Mac running chrome, the
| site works OK. I don't get any plugin messages.
| AdamJacobMuller wrote:
| https://www.redbooks.ibm.com/redpapers/pdfs/redp5510.pdf
|
| I want one of these.
|
| a system with 1TB of ram is 133k, 8.5mil for a system with 64TB
| of ram?
| chaxor wrote:
| Absolutely not. You can purchase a system with 1TB of ram, and
| some decent CPUs etc for ~25k. My lab just did this. That's far
| overpriced. 133k is closer to what you would spend if you used
| the machine with 1tb "in the cloud".
| didgetmaster wrote:
| I still remember the first advertisement I saw for 1TB of disk
| space. I think it was about 1997 and about the biggest
| individual drive you could buy was 2GB. The system was the size
| of a couple of server racks and they put 500 of those disks in
| it. It cost over $1M for the whole system.
| nimish wrote:
| That's insanely overpriced. A 128gb lrdimm is $1000. So a tb on
| a commodity 8 mem slot board would be 8k plus a few thousand
| for the cpu and chassis.
| sophacles wrote:
| I find it mind boggling that one can purchase a server with more
| RAM than the sum of all working storage media in my house.
| boredumb wrote:
| would be neat if I could do say, 6gb, and see the machines that
| are closest in size instead of only the upper limit
| bob1029 wrote:
| This kind of realization that "yes, it probably will" has
| recently inspired me to hand-build various database engines
| wherein the entire working set lives in memory. I do realize
| others have worked on this idea too, but I always wanted to play
| with it myself.
|
| My most recent prototypes use a hybrid mechanism that
| dramatically increases the supported working set size. Any
| property larger than a specific cutoff would be a separate read
| operation to the durable log. For these properties, only the
| log's 64-bit offset is stored in memory. There is an alternative
| heuristic that allows for the developer to add attributes which
| signify if properties are to be maintained in-memory or permitted
| to be secondary lookups.
|
| As a consequence, that 2TB worth of ram can properly track
| hundreds or even thousands of TB worth of effective data.
|
| If you are using modern NVMe storage, those reads to disk are
| stupid-fast in the worst case. There's still a really good chance
| you will get a hit in the IO cache if you application isn't
| ridiculous and has some predictable access patterns.
| saltcured wrote:
| I don't mean to discourage personal exploration in any way, but
| when doing this sort of thing it can also be illuminating to
| consider the null hypothesis... what happens if you let the
| conventional software use a similarly enlarged RAM budget or
| fast storage?
|
| SQLite or PostgreSQL can be given some configuration/hints to
| be more aggressive about using RAM while still having their
| built-in capability to spill to storage rather than hit a hard
| limit. Or on Linux (at least), just allowing the OS page cache
| to sprawl over a large RAM system may make the IO so fast that
| the database doesn't need to worry about special RAM usage. For
| PostgreSQL, this can just be hints to the optimizer to adjust
| the cost model and consider random access to be cheaper when
| comparing possible query plans.
|
| Once you do some sanity check benchmarks of different systems
| like that, you might find different bottlenecks than expected,
| and this might highlight new performance optimization quests
| you hadn't even considered before. :-)
| none_to_remain wrote:
| Several years ago my job then got a dev and prod server with a
| terabyte of RAM. I liked the dev server because a few times I
| found myself thinking "this would be easy to debug if I had an
| insane amount of RAM" and then I would remember I did.
| ailef wrote:
| Basically every fits in RAM up to 24TB.
| donkarma wrote:
| 64TB because of the mainframe
| jhbadger wrote:
| I was disappointed that the page didn't start offering vintage
| computers for very small datasets given that it has bytes and
| kilobytes as options ("your data is too large for a VIC-20, but
| a Commodore 64 should handle it")
| louwrentius wrote:
| That is actually a funny idea, I didn't think about that. I
| only revived and refreshed what somebody else came up with
| and made before me.
| louwrentius wrote:
| Extra anecdote:
|
| Around 2000, a guy told me he was asked to support very
| significant performance issues with a server running a critical
| application. He quickly figured out that the server ran out of
| memory. Option 1 was to rewrite the application to use less
| memory. He chose option two: increase the server memory, going
| from 64 MB to 128 MB (Yes MB).
|
| At that time, 128 MB was an ungodly amount of memory and memory
| was very expensive. But it was still cheaper to just throw RAM at
| the problem than to spend many hours rewriting the application.
| z3t4 wrote:
| Your data might even fit in the CPU L3 cache... But most likely
| you want your data to be persistent. But how often do you
| actually "pull the plug" on your servers!? And what happens when
| SSD's are fast enough ? Will we see a whole new architecture
| where the working memory is integrated into the CPU and the main
| memory is persistent ?
| mnd999 wrote:
| That was the promise of optane. Unfortunately nobody bought it.
| tester756 wrote:
| What do you mean by "nobody"?
|
| Significant % of top 500 fortune used it
| bee_rider wrote:
| On one hand it would be cool to have some persistence in the
| CPU. On the other -- imagine if rebooting a computer didn't
| make the problems all go away. What a nightmare.
| rob_c wrote:
| marcinzm wrote:
| We went with this approach. Pandas hit GIL limits which made it
| too slow. Then we moved to Dask and hit GIL limits on the
| scheduler process. Then we moved to Spark and hit JVM GC
| slowdowns on the amount of allocated memory. Then we burned it
| all down and became hermits in the woods.
| [deleted]
| mritchie712 wrote:
| Did you consider Clickhouse? join's are slow, but if your data
| is in a single table, it works really well.
| marcinzm wrote:
| We were trying to keep everything on one machine in (mostly)
| memory for simplicity. Once you open up the pandoras box of
| distributed compute there's a lot of options including other
| ways of running Spark. But yes, in retrospect, we should have
| opened that box first.
| anko wrote:
| I have solved a similar problem, in a similar way and i've
| found polars <https://www.pola.rs/> to solve this quite
| well without needing clickhouse. It has a python library
| but does most processing in rust, across multiple cores.
| I've used it for data sets up to about 20GB no worries, but
| my computer's ram became the issue, not polars itself.
| marcinzm wrote:
| We were using 500+gb of memory at peak and were expecting
| that to grow. If I remember we didn't go with Polars
| because we needed to run custom apply functions on
| DataFrames. Polars had them but the function took a tuple
| (not a DF or dict) which when you've got 20+ columns
| makes for really error prone code. Dask and Spark both
| supported a batch transform operation so the function
| took a Pandas Dataframe as input and output.
| mumblemumble wrote:
| I have decided that all solutions to questions of scale fall
| into one of two general categories. Either you can spend all
| your money on computers, or you can spend all your money on
| C/C++/Rust/Cython/Fortran/whatever developers.
|
| There's one severely under-appreciated factor that favors the
| first option: computers are commodities that can be acquired
| very quickly. Almost instantaneously if you're in the cloud.
| Skilled lower-level programmers are very definitely not
| commodities, and growing your pool of them can easily take
| months or years.
| jbverschoor wrote:
| Buying hardware won't give you the same performance benefits
| as a better implementation/architecture.
|
| And if the problem is big enough, buying hardware will cause
| operational problems, so you'll need more people. And most
| likely you're not gonna wanna spend on people, so you get a
| bunch of people who won't fix the problem, but buy more
| hardware.
| mumblemumble wrote:
| Ayup.
|
| And yet, people still regularly choose to go down a path
| that leads there. Because business decisions are about
| satisficing, not optimizing. So "I'm 90% sure I will be
| able to cope with problems of this type but it might cost
| as much as $10,000,000" is often favored above, "I am 75%
| sure I might be able to solve problems of this type for no
| more than $500,000," when the hypothetical downside of not
| solving it is, "We might go out of business."
| marcinzm wrote:
| >And if the problem is big enough, buying hardware will
| cause operational problems, so you'll need more people. And
| most likely you're not gonna wanna spend on people, so you
| get a bunch of people who won't fix the problem, but buy
| more hardware.
|
| That's why people love the cloud.
___________________________________________________________________
(page generated 2022-08-02 23:01 UTC)