hngopher.com

       [HN Gopher] The RAM Myth
       ___________________________________________________________________
        
       The RAM Myth
        
       Author : signa11
       Score  : 174 points
       Date   : 2024-12-18 22:43 UTC (1 days ago)
        
 (HTM) web link (purplesyringa.moe)
 (TXT) w3m dump (purplesyringa.moe)
        
       | CountHackulus wrote:
       | Actual benchmarks and graphs. Thank you so much for that.
        
       | Const-me wrote:
       | In my 3 years old laptop, system memory (dual channel DDR4-3200)
       | delivers about 50 GB / second. You have measured 50M elements per
       | second, with 8 bytes per element translates to 400 MB / second.
       | If your hardware is similar, your implementation did less than 1%
       | of theoretical bottleneck.
       | 
       | When your data is indeed big and the performance actually
       | matters, consider doing something completely different.
        
         | sgarland wrote:
         | In the real world, your application is running in a container
         | among hundreds or thousands of other containers. The system's
         | resources are also probably being managed by a hypervisor. The
         | CPU is shared among N tenants _and_ is overcommitted. It's not
         | that much to ask to optimize where you can, when it isn't
         | unreasonably difficult.
         | 
         | More importantly, this attitude is precisely why software sucks
         | today. "[CPU, Memory, Disk] is cheap, engineering time isn't."
         | Fuck that. Bad engineering time is _incredibly_ expensive. This
         | is an excuse to not spend time learning the ins and outs of
         | your language, and your hardware.
         | 
         | It also frustrates me to no end that people are so deeply
         | incurious. This seems to only happen in tech, which is baffling
         | considering how incredibly broad and deep the industry is.
         | Instead, everyone clamors to learn the new Javascript
         | abstraction that lets them get even further away from the
         | reality of what the magic boxes are doing.
        
           | kmarc wrote:
           | Agree with the sentiment. However it's hard to stay curious,
           | even harder to stay up-to-date.
           | 
           | I liked fiddling with storage for a while, got really into
           | it, deepened my knowledge about it. A couple years later I
           | realized everything else (networking, architectures,
           | languages) developed so much, mot of my (non-basic) knowledge
           | was obsolete. Picking up where I left off with all
           | technologies is incredibly hard and caused fatigue.
           | 
           | Now I'm at a point where I have the feeling I don't know
           | nothing about anything. It's factually not true, but my gut
           | feeling tells this. Would I be younger, this would trigger a
           | lot of anxiety. Thankfully I can janfle this by now.
        
             | sgarland wrote:
             | That's understandable. I'm into databases (both
             | professionally and personally), so I get to touch a wide
             | variety of things. Following a query the whole way is
             | pretty neat. Typed query --> client --> wire protocol -->
             | network --> server --> DB frontend --> DB planner --> DB
             | storage engine --> OS --> Memory / Disk. `blktrace` is
             | super interesting to watch commands being sent to the disk.
        
               | buran77 wrote:
               | When you are deeply involved in tech both personally and
               | professionally you are probably very passionate about
               | this and makes sense that you'd only look closely at this
               | field and think "people are so deeply incurious. This
               | seems to only happen in tech".
               | 
               | Tech is also one of the (if not _the_ ) most dynamic and
               | fast evolving field a normal person will ever touch.
               | Curiosity in tech can drain every single bit of free time
               | and energy you have and you will hardly keep up with the
               | progress, maybe barely scratch the surface. But people's
               | available free time and energy wanes and curiosity is a
               | collateral victim.
               | 
               | I've painfully gone through the entire cycle of this,
               | including the bit of resurgence later on when you have a
               | combination of free time but less energy. What I can say
               | is that this absolutely _does not_ happen just in tech.
               | If anything tech is flooded with people with more
               | curiosity than almost any other field.
        
               | sgarland wrote:
               | > When you are deeply involved in tech both personally
               | and professionally you are probably very passionate about
               | this and makes sense that you'd only look closely at this
               | field and think "people are so deeply incurious. This
               | seems to only happen in tech".
               | 
               | Good point. I commented in a sibling post to the same
               | effect.
               | 
               | I've certainly felt the personal strain of time sink and
               | procrastination in my homelab. It's currently running
               | k3os, which has been dead for about four years now,
               | because everything I want is still running, and I never
               | seem have the motivation on the weekend to yell at my
               | laptop when I could be playing board games.
               | 
               | > including the bit of resurgence later on when you have
               | a combination of free time but less energy.
               | 
               | I'm guessing that will happen in another decade or so,
               | when my kids are grown.
        
           | akira2501 wrote:
           | > "[CPU, Memory, Disk] is cheap, engineering time isn't."
           | Fuck that.
           | 
           | It is. It's absurdly cheap. I ensure I check the amount of
           | time it would take for me to make a performance improvement
           | against the runtime costs of my functions. It's rarely worth
           | the extra effort.
           | 
           | Seriously, until you get into the millions of records per
           | second level, you're almost never benefited. You may make
           | your function 2x faster, at a cost of additional complexity,
           | but you never run it enough in a year for it to pay itself
           | back.
           | 
           | > Bad engineering time is _incredibly_ expensive.
           | 
           | Engineering time is expensive. Period. It speaks to the need
           | to minimize it.
           | 
           | > This is an excuse to not spend time learning the ins and
           | outs of your language, and your hardware.
           | 
           | All of which will change in a few years, which is fine, if
           | you're also committing to keeping _all that code_ up to date
           | right along with it. Otherwise you end up with an obscure
           | mess that you have to unwind 5 years of context to understand
           | and fix again.
           | 
           | Complexity and available mental contexts are forgotten costs.
           | If your language even has that many "ins and outs" to begin
           | with you may want to reconsider that.
        
             | sgarland wrote:
             | > You may make your function 2x faster, at a cost of
             | additional complexity, but you never run it enough in a
             | year for it to pay itself back.
             | 
             | I'm not talking about increased complexity, I'm talking
             | about extremely basic things that take zero extra time,
             | like using the correct data structure. For example, in
             | Python:                   In [8]: a = array("i", (x for x
             | in range(1_000_000)))            ...: l = [x for x in
             | range(1_000_000)]            ...: d = deque(l)
             | ...: for x in (a, l, d):            ...:
             | print(f"{sys.getsizeof(x) / 2**20} MiB")            ...:
             | 3.902385711669922 MiB         8.057334899902344 MiB
             | 7.868537902832031 MiB
             | 
             | Very similar structures, with very different memory
             | requirements and access speeds. I can count on one hand
             | with no fingers the number of times I've seen an array
             | used.
             | 
             | Or knowing that `random.randint` is remarkably slow
             | compared to `random.random()`, which can matter in a hot
             | loop:                   In [10]: %timeit
             | math.floor(random.random() * 1_000_000)         31.9 ns +-
             | 0.138 ns per loop (mean +- std. dev. of 7 runs, 10,000,000
             | loops each)              In [11]: %timeit random.randint(0,
             | 1_000_000)         163 ns +- 0.653 ns per loop (mean +-
             | std. dev. of 7 runs, 10,000,000 loops each)
             | 
             | > All of which will change in a few years, which is fine,
             | if you're also committing to keeping _all that code_ up to
             | date right along with it.
             | 
             | With the exception of list comprehension over large ranges
             | slowing down from 3.11 --> now, I don't think there's been
             | much in Python that's become dramatically worse such that
             | you would need to refactor it later (I gather the
             | Javascript community does this ritual every quarter or so).
             | Anything being deprecated has years of warning.
        
               | akira2501 wrote:
               | > which can matter in a hot loop:
               | 
               | 163ns - 31.9ns == 131.1ns
               | 
               | This will need to happen 7.6 million times to save me 1
               | CPU second. On AWS lambda with 1GB of memory this will
               | cost you a whopping: $0.0000166667.
               | 
               | The point is, you're not even wrong, but there are
               | vanishingly few cases where it would actually matter to
               | the bottom line in practice. You're taking an absolutist
               | point of view to a discipline which thoroughly rejects
               | it.
               | 
               | This is what I love about the cloud. It forces you to
               | confront what your efforts are actually worth by placing
               | a specific value on all of these commodities. In my
               | experience they're often worth very little given that
               | none of us have the scale of problems where this would
               | show actual returns.
        
               | norir wrote:
               | Sure, but the cumulative effects of pervasive mediocre to
               | bad decisions do add up. And it isn't just about cloud
               | compute cost. Your own time is stolen by the slow ci jobs
               | that you inevitably get stuck waiting for. For me, I
               | prioritize my own personal happiness in my work and this
               | mindset taken too far makes me unhappy.
        
               | zrm wrote:
               | Reaching the scale where it shows actual returns isn't
               | all that difficult. You need it to happen 7.6 million
               | times to save 1 CPU second, but each CPU core can execute
               | it nearly that many times every second.
               | 
               | Probably you don't leave it generating only random
               | numbers all day, but suppose you do generate a good few,
               | so that it's 1% of your total compute budget, and you
               | have only a modest load, using on average four CPU cores
               | at any given time. Then saving that amount of computation
               | will have saved you something like $15/year in compute,
               | recurring. Which isn't actually that bad a return for ten
               | seconds worth of choosing the right function.
               | 
               | There are also a lot of even fairly small entities for
               | which four cores is peanuts and they're running a hundred
               | or a thousand at once, which quickly turns up the price.
               | 
               | And even the things with internet scale aren't all that
               | rare. Suppose you're making a contribution to the
               | mainline Linux kernel. It will run on billions of
               | devices, possibly for decades. Even if it doesn't run
               | very often, that's still a lot of cycles, and some of the
               | kernel code _does_ run very often. Likewise code in
               | popular web browsers, javascript on popular websites or
               | in popular libraries, etc.
               | 
               | You don't have to work for Google to make a contribution
               | to zlib and that kind of stuff has the weight of the
               | world on it.
        
               | Aeolun wrote:
               | You are saying that the potential gains are less than an
               | order of magnitude. That mkes them a pretty hard sell in
               | most instances.
        
               | masklinn wrote:
               | > Very similar structures, with very different memory
               | requirements and access speeds. I can count on one hand
               | with no fingers the number of times I've seen an array
               | used.
               | 
               | That is obvious when you actually check the access speed
               | of arrays and find out it is about half that of lists on
               | small integers (under 256), and worse on non-small
               | integers. That is literally the opposite trade off of
               | what you want in 99.99% of cases.
               | 
               | Deques are even less of a consideration, they're unrolled
               | linked lists so random access is impossible and iteration
               | is slower, you use a deque when you need _a deque_ (or at
               | least a fifo), aka when you need to routinely manipulate
               | the head of the collection.
        
               | sgarland wrote:
               | It depends on your constraints. If you're limited by RAM,
               | arrays make a lot of sense for certain applications. If
               | you need Python's buffer protocol, again, they make a lot
               | of sense.
               | 
               | As to deques, yes, they have specific uses, and being
               | slightly smaller isn't usually a selling point for them.
               | My point was that I have seen many cases where an
               | incorrect data structure was used, because a list or dict
               | was "good enough." And sure, they generally are, but if
               | the language ships with other options, why wouldn't you
               | explore those?
        
               | saagarjha wrote:
               | Ok, but neither of these matter until you know they
               | matter. Seriously. Like, yes, it's nice they exist and
               | that they are available for when you want them, but I
               | would generally advise people to use a list or
               | random.randint, if only because I value their confidence
               | with them over the 2x performance win, because most
               | workloads are not simply just a single array or random
               | number generator loop. And, to be clear, I work on
               | performance professionally: most of my job is not making
               | things as fast as possible, but considering the tradeoffs
               | that go into writing that code. I understand your example
               | as showing off an interesting performance story but in
               | the real world most workloads are more complex than what
               | can be solved with using a rare but drop-in API.
        
               | IanCal wrote:
               | Yes but if you want to do things in a less obvious way
               | you should be aware of the downsides, such as bias in
               | your random numbers. Also making sure you watch out for
               | off by one errors.
               | 
               | Stolen the number to show this off well from a bug report
               | somewhere:                   random_counter = Counter()
               | for i in range(10_000_000):             result =
               | floor(random() * 6755399441055744) % 3
               | random_counter[result] += 1              print("floor
               | method", random_counter.most_common(3))
               | randint_counter = Counter()                  for i in
               | range(10_000_000):             result = randint(0,
               | 6755399441055743) % 3             randint_counter[result]
               | += 1              print("randint method",
               | randint_counter.most_common(3))
               | 
               | Result                   floor method [(1, 3751972), (0,
               | 3333444), (2, 2914584)]         randint method [(1,
               | 3334223), (2, 3333273), (0, 3332504)]
               | 
               | https://bugs.python.org/issue9025
        
               | sgarland wrote:
               | Have you ran this in any modern version of Python? It's
               | been fixed for a long time.
        
               | IanCal wrote:
               | 3.10 so I redid it on 3.13.1, same results.
        
               | sgarland wrote:
               | Ugh, I was checking `randrange` (as the bug mentions),
               | not `random`. I stand corrected.
        
               | IanCal wrote:
               | Ah yeah sorry I should have mentioned it wasn't the same,
               | I used it as it has a nice number that shows the bias to
               | a pretty extreme degree.
        
               | rcxdude wrote:
               | Python is 100% the wrong language to worry about this in.
               | If your hot loops are in python and you care about
               | performance, you should be rewriting them in another
               | language.
        
               | sgarland wrote:
               | Agreed; I used it partially because TFA used it to
               | demonstrate ideas, and partially because I'm very
               | familiar with it.
               | 
               | But you're correct, of course. When I need something to
               | go faster in Python, I write it in C. If it's more than a
               | small section, then a full rewrite is reasonable.
        
               | spookie wrote:
               | Even if so, their point still stands. It's a tiny change
               | that grants huge speedups.
        
               | Cthulhu_ wrote:
               | I was curious why randint was slower than random and
               | found a good SO answer [0] (it's like chatgpt by humans
               | for you youngsters out there), the gist of it is that
               | `random()` calls a C function directly (which I presume
               | goes straight to a syscall), whereas `randint` is
               | implemented in Python and has a load of preconditions /
               | defensive programming (which I presume is executed at
               | every invocation, so seven potential branches etc it has
               | to check every time).
               | 
               | Of course, if it's not in a tight loop and you need a
               | whole integer between a range then it's the most
               | developer-ergonomic way to get it. If you do need more
               | performance but want to keep the ergonomics, writing your
               | own random-int-between-two-numbers is fairly
               | straightforward but it'll take some time.
               | 
               | [0] https://stackoverflow.com/a/58126026/204840
        
             | jjav wrote:
             | > You may make your function 2x faster, at a cost of
             | additional complexity, but you never run it enough in a
             | year for it to pay itself back.
             | 
             | "You" is both singular and plural, which is often the
             | problem with this thinking.
             | 
             | Is it worth spending a month of engineering time to make a
             | page load in 50ms instead of 2s? Seems like a lot of
             | engineering time for a noticeable but somewhat minor
             | improvement.
             | 
             | But now, what if you have a million users who do this
             | operation 100x/day? _Absolutely_ worth it!
             | 
             | For example, I sure wish atlassian would spend a tiny bit
             | of effort into making jira faster. Even if it is 1 second
             | per ticket, since I'm viewing 100+ tickets per day that
             | adds up. And there's many hundreds of us at the company
             | doing the same thing, it really adds up.
        
               | ddtaylor wrote:
               | > 50ms instead of 2s
               | 
               | In the past I believe Google was very adament that page
               | load time perception was very important to other metrics.
        
               | nottorp wrote:
               | Most of the time they just move the expensive processing
               | to the user's browser so they don't have to pay for it :)
        
               | sfn42 wrote:
               | You're probably not going to achieve that with the kind
               | of optimization described in this article though.
        
               | xlii wrote:
               | Nit: 50ms vs 2000ms is 40x speed increase, i.e. ~1.5
               | order of magnitude.
               | 
               | I still keep words of my database optimization lecturer
               | who said that by his experience optimization below 1 OOM
               | aren't worth it and most ,,good ones" are 3+
               | 
               | > Absolutely worth it!
               | 
               | Long reaching assumption. Even the biggest companies have
               | limited resources (even if vast). Would you rather
               | improve load times by 2x (from 500ms to 250ms) or improve
               | checkout reliability from 99% to 99.5%? And there is much
               | more to consider on some levels (e.g. planning for
               | thermal efficiency is fun).
               | 
               | Software development is always a game of choice.
        
               | sgarland wrote:
               | > I still keep words of my database optimization lecturer
               | who said that by his experience optimization below 1 OOM
               | aren't worth it and most ,,good ones" are 3+
               | 
               | Like everything, it depends. Is the query gating a lot of
               | other things, especially things that can be done in
               | parallel? Shaving 10 ms off might very well be
               | meaningful. Is it a large OLAP query, and the owning team
               | has SLAs that depend on it? Going from 60 --> 55 minutes
               | might actually matter.
               | 
               | The two biggest performance-related issues with RDBMS
               | that I deal with, aside from indexing choices, are over-
               | selection (why on earth do ORMs default to SELECT * ?),
               | and poor JOIN strategies. Admittedly the latter is often
               | a result of poor schema design, but for example, the
               | existence of semi/anti-joins seems to be uncommon
               | knowledge.
        
               | xlii wrote:
               | If SLAs are involved then I'd argue it's not about
               | optimization but about business goals, which
               | unsurprisingly take precedence.
               | 
               | But there is another case that is very similar: threshold
               | passing (or how I like to call it - waterfalling). Small
               | inefficiencies add up and at some point a small slowdowns
               | reach critical mass and some significant system breaks
               | everything else.
               | 
               | When system was designed by competent engineers huge
               | optimizations aren't easy, so it's shaving couple millis
               | here and couple millis there. But, as in the first case,
               | I'd categorize it as a performance failure.
        
               | Cthulhu_ wrote:
               | As always, https://xkcd.com/1205/ is a great reference to
               | keep in mind.
               | 
               | That said, most developers (I assume, by most I mean
               | myself) never work with code where the work they do has
               | any serious impact on performance; it's layers on top of
               | code that should be fast, but the work I do is
               | implementing business logic and screens, in which case
               | readability vastly trumps performance.
               | 
               | I mean right now I'm writing a line of code that can be
               | represented as a nested ternary or a function call with
               | some more written-out if/elses. The ternary outperforms
               | the function call 2:1 or more, but either one can be
               | executed hundreds of times a second BUT will only be
               | executed once or twice per page view. It's not worth the
               | tradeoff, even if this line of code will be executed
               | hundreds of millions of times overall.
        
               | binary132 wrote:
               | Jira is incredibly slow, almost unusable. So awful.
        
               | pphysch wrote:
               | More features = more customers = more revenue
               | 
               | More data collection = more revenue
               | 
               | More crap layered on = everything is slower
               | 
               | Everything sucks = more support demand = supply price
               | leverage = more revenue
               | 
               | Enterprise software is necessarily slow and complicated,
               | and not for your benefit.
        
               | binary132 wrote:
               | I'm not quite following your point. It sounds like you're
               | agreeing that Jira sucks?
        
             | Const-me wrote:
             | > until you get into the millions of records per second
             | level, you're almost never benefited
             | 
             | Yeah, but the software landscape is very diverse.
             | 
             | On my job (CAM/CAE) I often handle data structures with
             | gigabytes of data. Worse, unlike e.g. multimedia frameworks
             | many algorithms operating on these numbers are global i.e.
             | can't be represented as a pipeline which splits data into
             | independent chunks and processes them sequentially.
             | 
             | Making performance critical functions twice as fast might
             | saves hours of time for a single end user in a single day.
        
           | tonyarkles wrote:
           | > In the real world, your application is running in a
           | container among hundreds or thousands of other containers
           | 
           | I mean, that's an engineering decision too. In my day job
           | we're capturing, pre-processing, running inference on, and
           | post-processing about 500Mpx/s worth of live image data at
           | about 80ms/frame end-to-end at the edge. The processor SoM
           | costs about $3000/unit and uses about 50W running flat out.
           | The retail cost of our overall product is two orders of
           | magnitude more than what the processor is worth but it incurs
           | zero recurring costs for us.
           | 
           | Edit: and it's got 64GB of Unified RAM that I've got all to
           | myself :)
        
             | sgarland wrote:
             | I was wondering if someone from a different sub-industry
             | would disagree here :D
             | 
             | That sounds like a very interesting job, with quite
             | different requirements and constraints from what I'm used
             | to. One day, I'll get a job where application latency is
             | critical, and optimizations matter deeply. Undoubtedly I'll
             | find something else to be upset about, but at least it'll
             | be a new complaint.
        
               | tonyarkles wrote:
               | > Undoubtedly I'll find something else to be upset about
               | 
               | Vendor SDKs. You'll be upset about that I guarantee :)
        
           | ddtaylor wrote:
           | I can somewhat echo some of the statements here and provide
           | my own experience that is similar.
           | 
           | I spend a decent amount of time writing decent C++ code. My
           | friends in other parts of the industry are writing certainly
           | better C++ code than me because they are doing it in
           | environments that are more constricted in various ways. In
           | either case, I do spend my time catching up a bit and would
           | consider myself a competent C++21 programmer in some ways.
           | 
           | My experience and my conversations lead me to understand
           | there is so much left on the table with even the most basic
           | implementations. When I implement it correctly in C++ we get
           | close to some of the theoretical limits for the hardware for
           | some workloads, compared to something that is literally 1% as
           | fast running in NodeJS.
           | 
           | Wit that said, for so many situations I cannot justify the
           | time and complexity to use C++ for many projects. At least
           | for the stage most projects are in. In theory this
           | optimization can happen later, but it never really does
           | because the horizontal (or sometimes even vertical) scaling
           | kicks in and we're all just willing to throw a few more
           | dollars at the problem instead of re-engineering it. Sure,
           | some of the really big companies like Netflix find a decent
           | reason from time to time to invest the engineering time
           | squeeze out those numbers, but it's becoming the exception
           | and not the rule.
        
             | Thorrez wrote:
             | >C++21
             | 
             | There's C++20 and C++23. No C++21.
        
               | ddtaylor wrote:
               | Sorry, I meant to type C++11
        
             | spookie wrote:
             | I've also found that not having some optimization mindset
             | from the get go limits your product to achieve only a local
             | maxima of performance in the end. It might not even be a
             | good local maxima.
             | 
             | It's best to have at least some optimization considerations
             | from the start. I'm saying some because too much is a
             | problem.
        
           | dakiol wrote:
           | It's hard to learn the ins and outs of dozens of programming
           | languages. One doesn't usually just use one or two PL over
           | their entire career. I have worked professionally with at
           | least PHP, Java, Python, JS, Go, and Ruby. That without
           | taking into account the respective libraries and frameworks
           | (and without taking into account as well the myriad of other
           | components like dbs, web servers, caches, etc.)
           | 
           | It sounds like an excuse, I know. The true is I just don't
           | have that much time.
        
           | christophilus wrote:
           | > people are so deeply incurious. This seems to only happen
           | in tech
           | 
           | It happens everywhere. If anything, techies are more curious
           | than the average Joe. How many fellow programmers can you
           | nerd-snipe with a comment that makes them say, "Well, that's
           | strange..."
        
           | duckmysick wrote:
           | > It also frustrates me to no end that people are so deeply
           | incurious. This seems to only happen in tech
           | 
           | This doesn't match my observations. In many fields, training
           | is limited and hides the details. We train workers to repeat
           | specific tasks and they excel at them. But they don't have
           | conceptual understanding. Any explanation of why things are
           | done the way they are is surface-level. You can see it when a
           | procedure fails for some reason. They way to deal with is to
           | 1) do basic checks 2) try again and if it still fails 3)
           | delegate to someone else. Nobody is going to troubleshoot or
           | optimize tasks unless that's their main job.
           | 
           | It happens in construction, in the kitchens, on the assembly
           | lines, and in the offices. It happens because it gets the job
           | done.
        
             | sgarland wrote:
             | You're right. My wording was flat-out wrong, and I
             | apologize. A more accurate sentiment (IMO) would be "that
             | this happens in tech is baffling, considering..."
        
             | marcosdumay wrote:
             | Yes, but in most fields, conceptual understanding doesn't
             | matter for most tasks.
             | 
             | That's the problem with software. A brick-layer doesn't
             | need to understand structural stability, but in software
             | every task is structure validation, and people expect
             | brick-layers to be productive.
        
               | phatskat wrote:
               | > A brick-layer doesn't need to understand structural
               | stability
               | 
               | Maybe a "junior" brick layer doesn't need to, as much as
               | a junior engineer doesn't need to understand the ins and
               | outs of their language. But a senior brick layer, or an
               | architect, needs to understand more of the details so
               | that they can set out the plan for the junior.
        
             | hinkley wrote:
             | There are a lot of developers who learn how to do a task
             | and never question whether the task could be automated,
             | either away or to reduce errors. They just do the same
             | mundane task four hours a week in perpetuity.
             | 
             | That's the part that frustrates me. Your job is to automate
             | things. So why aren't you automating things?
        
           | bulatb wrote:
           | Some apple-pickers think the point of picking apples is to
           | prove how good you are at climbing trees. They'll never not
           | be mad at pluckers who just pluck the nearest apple, and the
           | people who reward them for the apples, and the world that
           | doesn't understand the pluckers picked their apples _wrong_ ,
           | and didn't even earn them, and they're not even real apples.
        
             | pif wrote:
             | Sir, your comment is poetry. I commend you.
        
             | sgarland wrote:
             | It's one thing if you know the optimal (or at least
             | approaching it) solution, and deliberately use something
             | else for other reasons. At least thought went into it.
             | 
             | I'm not upset at this simply for purity's sake; it directly
             | impacts me in my day-to-day job. A hundred devs
             | individually decide that good enough is good enough
             | (deliberately or otherwise), and suddenly my infrastructure
             | is dealing with traffic and query patterns it shouldn't
             | have to, and then I have to explain what network bandwidth
             | limits are, and why it's absurd that we hit them in the
             | first place.
             | 
             | In general, what I'm railing against is the continued push
             | towards simplification and abstraction, while not requiring
             | the understanding of what lies beneath. The hyperscalers
             | certainly aren't trying to foster that attitude in-house -
             | someone has to build the abstractions in the first place,
             | after all, and keep datacenters humming. Yet they're
             | happily shoveling it out, because it's a win-win for them:
             | fewer people have the skills necessary to compete (or even
             | to simply not use their services), and more people are able
             | to use their services.
        
               | jebarker wrote:
               | > it directly impacts me in my day-to-day job
               | 
               | It directly impacts anyone that uses software everyday.
               | Most people don't seem to understand and/or care how
               | poorly the software they use runs.
        
               | yetihehe wrote:
               | Most people don't care about their job, they just want to
               | put in minimum viable effort, get paid and go home to
               | their family. Typically they do this because putting more
               | effort didn't bring them anything (not even some
               | recognition of efforts) or even backfired. It's
               | applicable to all jobs, not only in tech.
        
               | jebarker wrote:
               | I get that. My comment has nothing to do with people's
               | pride in their work. I was commenting that there's not
               | enough user demand for better preforming software because
               | users don't seem to mind the software hellscape we
               | currently live in.
        
               | didgetmaster wrote:
               | It's like the contractor (or subcontractor) who built the
               | house you live in. They don't have to pay your monthly
               | heating or cooling bills, so they often do whatever is
               | fastest and cheapest for them.
               | 
               | Who cares if they didn't insulate a wall correctly or put
               | an inefficient system in? If they can get it by the
               | inspector, they will do it. If they can save themselves
               | $100, that is all that matters to them. Who cares if it
               | adds $10 to each month's utility bills for the next 30
               | years? They got their money and are now long gone.
        
               | immibis wrote:
               | To get a certain behaviour, incentivize that behaviour.
        
               | sgarland wrote:
               | This is what the mythical Full Stack and DevOps movements
               | were supposed to do. They did not.
        
             | jebarker wrote:
             | The part missing from this analogy is that the low-hanging
             | fruit pickers are slowly killing the trees.
        
               | throwaway519 wrote:
               | Why are fruit pickers hanging from trees at all?
        
               | blipvert wrote:
               | Just get a panking pole already!
        
             | mrkeen wrote:
             | Climbing a ladder takes more time, so in order to get the
             | apples to market faster, the apple tree owners keep the
             | pickers focused on only picking the apples within arm's
             | reach.
             | 
             | The owners also disallow the use of ladders because the
             | pool of candidates to hire remains bigger.
             | 
             | And the highest 90% of apples remain unpicked.
        
               | dsr_ wrote:
               | Having recently been to an apple orchard...
               | 
               | - Apple trees are deliberately bred to be short and wide.
               | 
               | - An apple-picking-stick has a sort of basket on the end,
               | which allows an apple to be selected and pulled off from
               | 1-2 meters away.
               | 
               | What lessons can be learned?
               | 
               | - Using the right tools improves your yield, safely.
               | 
               | - Having the data in a convenient place means you can use
               | it more readily.
        
             | binary132 wrote:
             | Somehow, the trees keep getting exponentially bigger, and
             | yet the pies we're getting are actually kinda just getting
             | smaller and worse. Maybe just picking the nearest apples
             | isn't working out as well as they say it is.
        
             | foobarchu wrote:
             | Maybe a better analogy would be farming.
             | 
             | A farmer who understands the value of crop rotation is
             | completely right to be upset when other farmers are
             | monocropping and killing the soil, or when they squander
             | the local aquifer. It's directly impacting the
             | sustainability of their industry.
        
           | killingtime74 wrote:
           | Life is very different for many people and I think we just
           | need to build empathy for people who treat a job as just a
           | job. If they deliver and are not unkind about it there's
           | nothing wrong about not going above and beyond the bare
           | minimum, which is what they are paid for.
        
             | sgarland wrote:
             | As I commented above, a large part of my umbrage stems from
             | the impact these decisions have on my job. I dislike having
             | to work harder because others didn't want to go above the
             | bare minimum.
             | 
             | This isn't unique to any one company, nor my personal
             | experience. At my last job, my team was initially human
             | triaging practically all alerts, because we had the rare
             | ability to read logs. I spoke to someone at Slack once who
             | was stuck doing the same. That's an absurdly low expected
             | level of competence.
        
               | badpun wrote:
               | You don't _have_ to work harder, as evidenced by those
               | people who do the bare minimum. You just care about your
               | work more than people who pay you for it (or the manager
               | hired to manage you), which is the cause of your
               | frustration here IMO.
        
           | jrochkind1 wrote:
           | > It also frustrates me to no end that people are so deeply
           | incurious. This seems to only happen in tech,
           | 
           | I mean software engineering and computer science from the
           | start is _built on abstraction_ , it makes sense that it
           | attracts people who find taking the abstraction as reality
           | and ignoring the underlying "real" layers to be attractive.
           | 
           | (Also it's not "real" until you get down to... voltages?
           | Electrons? I don't know... the whole thing about our whole
           | endeavor being built of abstraction is there are so many
           | layers. Nobody thinks everyone has to routinely go ALL THE
           | WAY down, but I get your point that being more curious than
           | many tend to be about a couple layers down is helpful, with
           | how far down a couple layers is depending on the nature of
           | your work. I'm not saying it's not helpful, I'm suggesting
           | possibly why -- because we select for people who swim happily
           | in abstractions, who love em even. And the success of all of
           | CS and software engineering is based on the ability for it to
           | _often work_. nobody really has to be curious about voltages
           | when writing a python script to read and write from storage)
        
           | crabbone wrote:
           | Nope. If a program is meant to run on the entire CPU, and if
           | it's meant to use all the memory, then that's how it will
           | run. This isn't a question of modernity or availability of
           | tools or programmers' salary.
           | 
           | Or, to put it differently, using more or less memory / CPU
           | isn't really an indication of anything about a program.
           | Sometimes you mean to use all that's available, other times
           | you mean to use as little as you can. There's no way to tell
           | which one should be done w/o knowing your goals.
           | 
           | For example, the entire free memory on Linux is a free real
           | estate for filesystem caching. So that memory is not wasted,
           | if the system can help it. So, having your entire memory used
           | (to support filesystem cache) wouldn't be really a
           | performance problem (quite the contrary, it would look like
           | the system is doing a good job at caching whatever reads your
           | applications are making).
        
           | tharkun__ wrote:
           | In the real world, your application is running in a container
           | among hundreds or thousands of other containers. The system's
           | resources are also probably being managed by a hypervisor.
           | The CPU is shared among N tenants _and_ is overcommitted.
           | 
           | When I read this, I thought the rest of your post would go
           | entirely differently. As in, I immediately agreed with you,
           | only for you to "turn this 180" (according to how I think
           | about this at least :) )                    It's not that
           | much to ask to optimize where you can, when it isn't
           | unreasonably difficult.
           | 
           | You take the above to mean we should optimize such that L2
           | cache is used as per the post as much as possible. Optimize
           | the wazoo out of things.
           | 
           | But how does that even help in any way, when the CPU this
           | runs on is like you said shared among N tenants and your
           | carefully optimized L2 cache access is still going to be miss
           | because another tenant got "your" CPU in between?
           | 
           | If you're paying for bare metal i.e. have the whole instance
           | for yourself, by all means, if your domain actually requires
           | you to use the system in such a way (things like high
           | frequency trading come to mind), then optimize like that!
           | 
           | If you're running on seventeen levels of hypervisors that
           | destroy any careful optimization in a random fashion anyway,
           | then what's the point even? (non rhetorical question!)
        
             | kbelder wrote:
             | It's a version of the tragedy of the commons. Yes, your
             | software bloat isn't the main factor keeping you slow, and
             | cleaning it up might be barely perceptible.
             | 
             | But... all the other software makes the same decision.
             | Then, suddenly, everything on the computer is running at
             | 10% efficiency.
             | 
             | In a complex, multitasking environment, keeping your code
             | clean benefits other tasks as much or more than your own.
             | It should be considered a responsibility.
        
               | tharkun__ wrote:
               | I would agree with the sentiment but not necessarily the
               | conclusion.
               | 
               | Like, don't use (the equivalent of) Stooge Sort (for your
               | particular situation).
               | 
               | But unless you are in a very particular situation, it
               | should be OK for everyone to "just use your language's
               | built-in sort function" (hopefully that's a thing and you
               | don't use something where you need to roll your own even
               | in 2024) assuming that it uses a sane default like
               | quicksort or merge sort that will work perfectly fine for
               | most regular common situations.
               | 
               | Another example might be to not stupidly build slow
               | algorithms saying "this won't see much data anyhow" (yes,
               | I've seen that in PRs unfortunately) when a very simple
               | hash lookup will ensure it's fast all the time. But it
               | should be totally fine to assume that your language's
               | hash table implementation is fine for common situations
               | and you don't need to optimize anything unless you're a
               | very special case.
        
           | TacticalCoder wrote:
           | > In the real world, your application is running in a
           | container among hundreds or thousands of other containers.
           | The system's resources are also probably being managed by a
           | hypervisor. The CPU is shared among N tenants _and_ is
           | overcommitted. It's not that much to ask to optimize where
           | you can, when it isn't unreasonably difficult.
           | 
           | And the real irony is that those programmers not giving a
           | crap about performances and justifying it are resting on the
           | shoulders of giants who went out of their way to have these
           | kernels, hypervisors, containers, passthroughs, memory usage,
           | etc. be as performant as they can.
           | 
           | Same for security.
           | 
           | Then you get the "I'm not in the business of writing fast
           | code" and "I'm not a security company, I'm selling X and Y".
           | 
           | But they _all_ , on a daily basis, use actual proper software
           | written by people who know about these things.
        
         | purplesyringa wrote:
         | That's good food for thought. My hardware is quite old, but I
         | agree that the numbers seem somewhat suspicious.
         | 
         | I believe that RAM can only be RMW'd in cache lines, so
         | modifying just 8 bytes still requires 64 bytes to be
         | transmitted. I'm assuming the 50 GB/s throughput is half-
         | duplex, and 25 GB/s over 64 bytes is ~400 Melems/s, somewhat
         | closer to my result.
         | 
         | I tried using non-temporal stores in the straightforward
         | algorithm, but to my surprise, this led to a significant
         | decrease of performance across all input lengths.
         | 
         | > When your data is indeed big and the performance actually
         | matters, consider doing something completely different.
         | 
         | I'm not sure what you mean by this. Scaling across machines is
         | just ignoring the problem. What do you mean by "something
         | completely different"?
        
           | saagarjha wrote:
           | Non-temporal stores will not help you if all you do are those
           | accesses. The point of using them is that you don't want them
           | pulled into the caches and pushing out everything else.
        
           | bobmcnamara wrote:
           | > I believe that RAM can only be RMW'd in cache lines, so
           | modifying just 8 bytes still requires 64 bytes to be
           | transmitted.
           | 
           | Ages ago I worked with several different memory controllers,
           | and it depends on the memory controller, cache, and MMU
           | configuration.
           | 
           | Plenty of systems do require cacheline updates, then
           | modifying 8B requires reading one or two cachelines, updating
           | them, and writing them out eventually.
           | 
           | Some caches track cacheline validity with a bit per byte.
           | This enables a CPU to set a book out in memory without
           | fetching the cacheline. The cache may then try to burst read
           | that line from the memory controller, but if it doesn't get
           | around to it before deciding to flush that cacheline, it may
           | issue a single byte write to the controller. The controller
           | can then issue a makes DRAM write to the memory, which will
           | update only certain bytes in the DRAM column. However, this
           | still takes about as long as sending the full cacheline but
           | it offloads the read-modify.
           | 
           | Validity per byte is also useful to implement hit under miss.
           | 
           | I bet on newer, bigger systems tricks like this are less
           | useful since the memory busses are faster and wider today.
        
         | wkat4242 wrote:
         | And my GPU delivers 1TB/s. Massive difference <3
         | 
         | I wish we could get those kind of speeds on system RAM.
        
           | winter_blue wrote:
           | To/from what sort of devices could the GPU read or write at
           | 1TB/sec, besides main memory?
           | 
           | The fastest consumer SSDs top out at several GB/sec (I guess
           | with massive hardware RAID they could be faster, but not sure
           | if they'd be 1TB/sec fast).
           | 
           | Even a network adapter that does 10 Gbit/sec is only recently
           | becoming slightly more common for the average consumer. Not
           | sure if any consumer adapters in the 10 or 1Tbit/sec range
           | exist at all.
        
             | chasd00 wrote:
             | > Not sure if any consumer adapters in the 10 or 1Tbit/sec
             | range exist at all.
             | 
             | Further, what exactly is a consumer going to plug a
             | 1Tbit/sec adapter into? Your little home ATT fiber
             | connection isn't going to come close to using that
             | available bandwidth.
        
               | simoncion wrote:
               | > Further, what exactly is a consumer going to plug a
               | 1Tbit/sec adapter into?
               | 
               | Another similarly-equipped machine on the LAN, and the
               | switch(es) between them.
        
             | semi-extrinsic wrote:
             | Consumer? Even in the "if-you-need-to-ask-what-it-costs-
             | you-can't-afford-it" world of frontier HPC systems, were
             | only getting teasers of 0.8 Tbit/sec NICs this year.
             | 
             | As you say, only the GPU, maybe RAM, and the NIC will be
             | able to churn data at these speeds. There is a reason why
             | Mellanox (Nvidia) has developed GPUDirect RDMA, so the GPU
             | and NIC can talk directly to each other.
             | 
             | https://www.servethehome.com/this-is-the-next-gen-nvidia-
             | con...
        
             | simoncion wrote:
             | > ...besides main memory?
             | 
             | Main memory _is_ an important thing to have fast, though.
             | The faster (and lower-wallclock-latency) it is, the less
             | time your system spends waiting around when it needs to
             | swap things in and out of it. It 's my understanding that
             | programs that need to be fast (like many video games) take
             | pains to preemptively load data into RAM from disk, and
             | (when appropriate for the program) from main RAM into VRAM.
             | If main RAM's transfer speed was equal to or greater than
             | VRAM's, and it access latency was a small fraction of a
             | frame render time, (presumably) some of that preloading
             | complexity could go away.
             | 
             | > I guess with massive hardware RAID they could be
             | faster...
             | 
             | This section of the comment is for folks who haven't been
             | paying attention to how fast storage has gotten: It's
             | nowhere near 1TB per second, but...
             | 
             | I have four 4TB SATA-attached Crucial MX500s set up in LVM2
             | RAID 0. This array is a bit faster than a 10gbit link.
             | (That is, I get 1.5GByte/s transfer rate off of the thing.)
             | Even a single non-garbage U.2-attached (or (barf)
             | M.2-attached) device can saturate a 10Gbit link.
        
             | Cthulhu_ wrote:
             | I'm reading the latest SSDs do 14 GB/s of sequential
             | reading and/or 15 million IOPS; transfer speed wise that's
             | close to the highest end DDR3 (2007) and the lowest end
             | DDR4 (2014). SSDs are definitely not memory speed fast yet
             | (definitely not for random access) but definitely getting
             | closer.
        
             | Aurornis wrote:
             | > To/from what sort of devices could the GPU read or write
             | at 1TB/sec, besides main memory?
             | 
             | The 1TB/sec is between the GPU and the GPU memory, which is
             | the bottleneck.
             | 
             | You don't need that much bandwidth for loading inputs and
             | outputting results, just for random access during compute.
        
           | ein0p wrote:
           | It actually almost never does. To see that you'd need to
           | benchmark. It's pretty difficult get good utilization on GPU
           | on either compute or memory bandwidth side. A lot of kernels
           | irretrievably fuck up both. You need long, coalesced
           | reads/writes, and judicious use of the memory hierarchy, or
           | else everything gets very slow very quickly.
        
           | saagarjha wrote:
           | I mean they can only do that if you have hundreds of threads
           | all making coalesced writes. Typical CPU workloads look
           | nothing like that; if you pointer chase on the GPU you are
           | going to get absurdly bad performance.
        
         | tc4v wrote:
         | cache misses are slow because of latency, not because of
         | throughput.
        
           | geysersam wrote:
           | Isn't the point of the person you replied to that the article
           | author wasn't able to eliminate latency because if they were
           | they'd be constrained by throughput but they are not?
        
         | toast0 wrote:
         | > In my 3 years old laptop, system memory (dual channel
         | DDR4-3200) delivers about 50 GB / second.
         | 
         | That's almost certainly for (mostly) sequential access.
         | 
         | When you just want a couple bytes here and there, and access
         | isn't pipelined and prefetch doesn't accelerate your use case,
         | the real world bandwidth is going to be significantly less.
        
           | Const-me wrote:
           | The input data is sequential.
           | 
           | I don't understand Rust but if the code is doing what's
           | written there, "simple multiplicative hash and perform a
           | simple analysis on the buckets - say, compute the sum of
           | minimums among buckets" it's possible to improve
           | substantially.
           | 
           | They don't need to move these megabytes of elements between
           | collections. I would split the input across a few CPU cores,
           | compute per-bucket minimums in parallel on each core, when
           | all completed aggregate minimums across all cores, then
           | compute sum of the results.
           | 
           | Multiplicative hash and the final sum should be possible to
           | vectorize on most computers. Updating per-bucket minimum is
           | probably impossible to vectorize (technically AVX512 set has
           | the required instructions but I'm not sure these are
           | efficient) but there're much fewer buckets than input
           | numbers, which means arrays with per-bucket minimums are
           | likely to fit in caches.
        
       | kazinator wrote:
       | > _Cache is seen as an optimization for small data: if it fits in
       | L2, it's going to be processed faster_
       | 
       | Nobody worth their salt believes just this and nothing else.
       | 
       | Yes, if the data fits entirely into a given cache, that's a nice
       | case that's easy to reason about. No matter what access pattern
       | is applied to the data, it doesn't matter because it's in the
       | cache.
       | 
       | Hopefully everyone working with caches understands that they
       | provide a potential speedup when _not_ everything fits into the
       | cache, and that this depends on the pattern of access (mainly,
       | does it exhibit  "locality"). Moreover, this case is extremely
       | important.
       | 
       | The article gives an example of exactly that: improving the
       | locality of access.
       | 
       | If you don't know this, you don't know one of the first facts
       | about caching.
       | 
       | There is something else to know about: you can't tell by size
       | alone whether a given data set will fit into a cache. The problem
       | is that caches are not always fully associative. In a set
       | associative cache, a given block of data cannot be stored in any
       | cache line: it is assigned to a small set of possible cache
       | lines. Then within a set, the cache lines are dynamically
       | allocated and tagged. A given bunch of working which appears to
       | be just smaller than the cache might be arranged in such a poor
       | way in memory that it doesn't map to all of the cache's sets. And
       | so, it actually does not fit into the cache.
        
         | purplesyringa wrote:
         | That's true, perhaps my wording is off. I believe that the
         | devil in the details. Sure, knowing that better access patterns
         | result in better performance is common. But the fact that the
         | access pattern can be optimized when the problem is literally
         | "access RAM at these random locations" is counterintuitive,
         | IMO.
        
           | ryao wrote:
           | When you have locality, prefetch can mask the latency of
           | getting the next object, regardless of whether everything
           | fits in cache.
        
         | zahlman wrote:
         | >Nobody worth their salt believes just this and nothing
         | else.... and that this depends on the pattern of access
         | (mainly, does it exhibit "locality").... If you don't know
         | this, you don't know one of the first facts about caching.
         | 
         | Not necessarily specific to this issue, but I've found that
         | surprisingly many people out there are not "worth their salt"
         | in areas where you'd really expect them to be.
        
           | HelloNurse wrote:
           | Assuming incompetence cannot be a general strategy, but there
           | are many surprising ways to get jobs and pass exams.
        
       | ggm wrote:
       | Exemplar code in python, shows the benefit in rust. o---kaaaaay.
       | 
       | So the outcome is "for this compiled rust, around 1m records it
       | gets better"
       | 
       | But you didn't actually prove a general case, you proved "for a
       | good optimising rust compiler" didn't you?
       | 
       | Maybe I'm over-thinking it. Maybe this does just become simple,
       | working-set-locality stuff.
       | 
       | I could take the lesson: for less than 10m things, on modern
       | hardware with more than 4 cores and more than 4GB stop worrying
       | and just code in the idiom which works for you.
        
         | purplesyringa wrote:
         | Uh? The post very clearly says: "I'm using Python as
         | pseudocode; pretend I used your favorite low-level language".
         | The goal is to show what's possible when you do your best, of
         | course I'm not going to use Python for anything beyond
         | demonstrating ideas.
         | 
         | > But you didn't actually prove a general case, you proved "for
         | a good optimising rust compiler" didn't you?
         | 
         | Again, that was never my goal. I chose Rust because I've
         | stumbled upon this problem while working on a Rust library. I
         | could've chosen C++, or C, or maybe even Go -- the result
         | would've been the same, and I checked codegen to make sure of
         | that.
         | 
         | > I could take the lesson: for less than 10m things, on modern
         | hardware with more than 4 cores and more than 4GB stop worrying
         | and just code in the idiom which works for you.
         | 
         | The number of cores and RAM capacity has nothing to do with
         | this. It's all about how well data fits in cache, and "less
         | than 10m things" are likely to fit in L3 anyway. If your main
         | takeaway from "here's how to process large data" was "I don't
         | need to worry about this for small data", well, I don't know
         | what to say.
        
           | ggm wrote:
           | Large and Small are so contextual. I'm processing 350m
           | events/day in 24h splits, and I managed to stop worrying
           | about locality of reference because I'm the sole occupant of
           | the machine. When I did worry about it, I found radix tree,
           | awk hash and perl/python hash/dict pretty much all occupied
           | much the same space and time but a tuned C implementation got
           | 2-3x faster than any of them. Somebody else pointed out
           | memory resident for most of this would be faster still but
           | you have to then work to process 24h of things against a
           | single memory instance. Which means buying into IPC to get
           | the data "into" that memory.
           | 
           | It interested me you didn't show the idea in rust. That was
           | the only point I was making: Python as pseudocode to think
           | things in documents is fine with me.
           | 
           | But overall, I liked your outcome. I just think it's
           | important to remember large and small are very contextual.
           | Your large case looks to me to be >5m things and for an awful
           | lot of people doing stupid things, 5m is bigger than they'll
           | ever see. If the target was only people who routinely deal in
           | hundreds of millions of things, then sure.
        
           | ryao wrote:
           | > The number of cores and RAM capacity has nothing to do with
           | this. It's all about how well data fits in cache, and "less
           | than 10m things" are likely to fit in L3 anyway.
           | 
           | What matters is locality since that allows prefetch to mask
           | latency. If you have this, then you are in a good place even
           | if your data does not fit in the L3 cache. What you did
           | demonstrates the benefits that locality gives from the effect
           | on prefetch. Fitting in L3 cache helps, but not as much as
           | prefetch does. If you do not believe me, test a random access
           | pattern on things in L3 cache vs a sequential access pattern.
           | The sequential access pattern will win every time, because L3
           | cache is relatively slow and prefetch masks that latency.
           | 
           | I have seen options for disabling prefetch and cache as BIOS
           | options (although I only remember the option for disabling
           | cache in ancient systems). If you could get one, you could do
           | some experiments to see which will matter more.
        
       | quotemstr wrote:
       | I was expecting something about NUMA performance, temporal memory
       | access instructions, shared versus global versus register memory
       | on GPUs, SRAM, and so on. There's an article about all these
       | things waiting to be written. This article is instead about
       | memory-hierarchy-aware access pattern optimization, which is
       | important, but not the whole story.
        
       | shmerl wrote:
       | _> The RAM myth is a belief that modern computer memory resembles
       | perfect random-access memory. Cache is seen as an optimization
       | for small data_
       | 
       | The post doesn't seem to answer what the memory actually
       | resembles. If it's not resembling a random access memory, then
       | what is it resembling?
        
         | zahlman wrote:
         | It resembles a hierarchy wherein a small fraction of memory - a
         | "cache" region - can be accessed much faster than the rest.
         | With careful planning, the programmer can increase the odds
         | that the necessary information for the next step of an
         | algorithm, at any given point, is within that small region.
         | (This is a simplification; there are actually several layers of
         | cache between a CPU and the most general part of the RAM.)
        
         | rcxdude wrote:
         | It resembles sequential access memory with relatively fast
         | seeks and a cache (ok, multiple layers of caches). Reading
         | sequential, predictable addresses gives you much more
         | throughput than random access, and reading a value that was
         | recently accessed (or adjacent to such) is much lower latency
         | than something that was not. There's further wrinkles in
         | multicore systems as well, because then accesses to memory
         | recently written to by another core can be slower again.
        
         | dare944 wrote:
         | I can only conclude the word myth was chosen for attention.
         | Modern CPU memory systems (with or without their caches)
         | certainly _resemble_ idealized RAMs.
        
       | awanderingmind wrote:
       | This was a great read, thanks. OP, readers might benefit from
       | having it explicitly mentioned that while the pseudocode is in
       | Python, actual Python code will likely not benefit from such an
       | approach because of how memory is fragmented in the standard
       | Python implementation - e.g. this discussion:
       | https://stackoverflow.com/questions/49786441/python-get-proc...
       | 
       | I am tempted to actually test this (i.e. see if there's a speedup
       | in Python), but I don't have the time right now.
        
         | zahlman wrote:
         | I should have looked at the comments before I started writing
         | the code. I'd have replied to you otherwise :) (see
         | https://news.ycombinator.com/item?id=42459055 .)
        
           | awanderingmind wrote:
           | Haha thanks
        
       | zahlman wrote:
       | It's a little surprising that this works at all, since the
       | partitioning step in the radix sort is itself the same kind of
       | sharding operation. But I guess that's because it allows for
       | working on smaller pieces at a time that fit in whatever cache
       | they need to.
       | 
       | > Python can't really reserve space for lists, but pretend
       | `reserve` did that anyway.
       | 
       | FWIW, you can pre-fill a list with e.g. `None` values, and then
       | _replacing_ those values won 't cause resizes or relocations (of
       | the _list_ 's memory - the elements are still indirected). But of
       | course you'd then need to keep track of the count of "real"
       | elements yourself.
       | 
       | But of course, tricks like this are going to be counterproductive
       | in Python anyway, because of all the indirection inherent in the
       | system. They'll also get worse if you actually do have objects
       | (unless perhaps they're statically, constantly sized, and in a
       | language that can avoid indirection in that case) with a "group
       | ID" attribute, rather than integers to which you can apply some
       | hash function. Some test results, trying to approximate the Rust
       | code in Python but with such objects:
       | 
       | https://gist.github.com/zahlman/c1d2e98eac57cbb853ce2af515fe...
       | 
       | And as I expected, the results on my (admittedly underpowered)
       | machine are terrible (output is time in seconds):
       | $ ./shard.py          Naive: 1.1765959519980242         Presized:
       | 2.254509582002356         Library sort / groupby:
       | 6.990680840001005         Simple radixing first:
       | 3.571575194997422
       | 
       | It's important to understand the domain. For Pythonistas, simple
       | really is better than complex. And identifying the bottlenecks
       | and moving them to a faster language is also important (Numpy
       | isn't successful by accident). The naive result here is still, if
       | I'm understanding the graphs right, dozens of times worse than
       | what any of the Rust code achieves.
       | 
       | (Edit: merely "several" times worse. I forgot that I told
       | `timeit` to run 10 iterations over the same input, so each test
       | processes ~10M elements.)
        
         | purplesyringa wrote:
         | > It's a little surprising that this works at all, since the
         | partitioning step in the radix sort is itself the same kind of
         | sharding operation.
         | 
         | The key property here is the number of groups. When sharding
         | data to n groups, only n locations have to be stored in cache
         | -- the tails of the groups. In my radix sort implementation,
         | this is just 256 locations, which works well with cache.
        
       | xigency wrote:
       | There are actually some algorithms specifically designed to
       | optimize usage of cache resources without knowing the specific
       | features of the cache.
       | 
       | "Cache Oblivious algorithms" https://en.wikipedia.org/wiki/Cache-
       | oblivious_algorithm
        
       | lifeisstillgood wrote:
       | Anaesthetists are required to undergo days of retraining per year
       | because the field, and the safety numbers, keep moving.
       | 
       | To do this researchers publish findings and professionals
       | aggregate those into useful training materials
       | 
       | Who / where is the useful training materials for software devs?
       | It really cannot be just blogs.
       | 
       | That's a VC investment that has ROI across the industry - tell
       | a16z how to measure that and we are quids in
        
         | tialaramex wrote:
         | Anaesthetists also undergo many _years_ of training _prior_ to
         | taking post. Typically they 've got say 10 years training
         | before they actually do the job, much of it specialised to this
         | job in particular (and the rest in general or intensive care
         | medicine)
         | 
         | If you're lucky a Software Engineer has a three year degree in
         | CS, and probably only a semester at best was studying "Software
         | Engineering" and even that might focus on something you don't
         | care about, such as formal methods.
         | 
         | It is _entirely possible_ that your junior engineers have never
         | maintained a sizeable codebase for more than a few weeks, have
         | never co-operated on software with more than a handful of other
         | people, and have never used most of the common software
         | libraries you use every day, regardless of whether these are
         | in-house (so how could they) or widely available.
         | 
         | For example maybe you do lots of web apps fronting SQL Server
         | databases in C#. Your new hire has six months of C#, they half-
         | remember a course doing SQL on an Oracle database, and all
         | their web knowledge is in Javascript. Do they know version
         | control? Kinda. Have they used a test framework before? Well,
         | they did in Java but never in C#.
         | 
         | The "All Your Tests Are Terrible" talk begins by pointing out
         | that probably they're strictly wrong because you _don 't have
         | any fucking tests_. All of the rest of the talk is about the
         | hideous tests that you're unhappy to find because you forget
         | that somebody could just not have bothered to write any tests
         | at all.
        
           | globnomulous wrote:
           | Hey, I resent and agree with this!
        
           | simoncion wrote:
           | > All of the rest of the talk is about the hideous tests that
           | you're unhappy to find because you forget that somebody could
           | just not have bothered to write any tests at all.
           | 
           | At many times in my professional "career", I've found myself
           | wishing that whoever wrote the test I'm staring at just
           | hadn't bothered. "Tests that say and/or look like they test
           | one thing, but test something entirely different." [0] and
           | "Tests that actually test nothing at all and never fail when
           | the code they claim to test is changed out from under them."
           | are two of the many categories of tests I wish folks would
           | just never have wasted their time writing.
           | 
           | [0] To be clear, I usually find tests of this type to claim
           | to be testing something useful, but actually be testing
           | something that's not worth testing.
        
         | simoncion wrote:
         | > Who / where is the useful training materials for software
         | devs? It really cannot be just blogs.
         | 
         | Why can't it be blogs, books, and word-of-mouth?
         | 
         | If you're a programmer (as your HN profile suggests that you
         | are), you _already_ know how little formal training we receive.
        
           | lifeisstillgood wrote:
           | I think I am saying we should fund more training materials
           | and spread the word on where they are to be found.
           | 
           | Maybe
           | 
           | But something !
        
         | mrkeen wrote:
         | "nulls should not be in your language" is my version of
         | "doctors should wash their hands"
         | 
         | "Immutability-first, with mutation-as-a-special-case" is my
         | "accountants should not use erasers"
         | 
         | "make illegal states unrepresentable" is my "hospitals should
         | sterilise their operating rooms and equipment"
         | 
         | As a field, we are very far from reaching broad agreement on
         | the left side (which I consider the basics). So what would the
         | training materials teach?
        
           | feoren wrote:
           | Lately, "nulls should not be in your language" sounds like
           | "reality should be simpler". Queue angry fist-shaking at the
           | complexities of real life and a wistful longing for a
           | logical, functional, mathematical world. In reality,
           | sometimes out of 8 questions, you only know the answer to 7,
           | and the missing answer could be any of the 8. So you can:
           | 
           | 1. Prevent the user from doing anything until they go find
           | that 8th answer, which cascades into your tool not helping
           | anyone until they have a complete, validated, full,
           | historical data set.
           | 
           | 2. Subdivide those 8 questions into having independent
           | existence, which cascades into your entire application being
           | a giant key/value store, and all the difficulties of null are
           | simply replaced with the difficulties of "this key/value pair
           | does not exist yet".
           | 
           | 3. Add a sentinel value for "I don't know this yet", while
           | feeling good about yourself because that sentinel value is
           | not _technically_ null, while introducing all the same exact
           | issues that null causes, plus a bunch of infrastructure you
           | have to write yourself. Basically: reimplement null, but
           | worse.
           | 
           | 4. Make the answers nullable.
           | 
           | It'd be like claiming that an IDE should simply not allow
           | syntax errors to be entered at all. Errors would be
           | completely impossible! Except at some point your user needs
           | to actually _write_ the thing, and you 've just abandoned the
           | idea of helping them. So they write it in some other editor,
           | and then paste it into your IDE. Or you embrace the fact that
           | incomplete work is OK.
           | 
           | Yes, nulls are generally way over-used. But disallowing them
           | entirely is a fool's errand. Source: I've been that fool.
           | 
           | In before: "You should just be using Maybe<> everywhere" --
           | "Maybe" is just another word for "nullable". The only
           | difference is the level of support you get from the compiler
           | / type-checker. So that argument is "the compiler should help
           | you get null right", which I completely agree with! There's
           | still work to be done getting type checkers better at null.
           | But that's a far cry from "null should not be in your
           | language" and equating the use of nulls to a doctor not
           | washing his hands.
        
             | int_19h wrote:
             | Of course nobody is seriously arguing about ditching the
             | notion of a sentinel value that indicates absence of value.
             | The question, rather, is how to best handle that case.
             | 
             | > In before: "You should just be using Maybe<> everywhere"
             | -- "Maybe" is just another word for "nullable". The only
             | difference is the level of support you get from the
             | compiler / type-checker.
             | 
             | The difference is that in languages with null
             | references/pointers, all references/pointers are
             | _implicitly_ nullable - that is, it is indeed like  "using
             | Maybe<> everywhere". But when you have proper option types,
             | you use Maybe<> not _everywhere_ , but only where you know
             | nulls can actually occur.
        
               | feoren wrote:
               | > The difference is ... [whether] all references/pointers
               | are implicitly nullable [or not]
               | 
               | Right so we're narrowing down the actual problem here,
               | which revolves around implicit vs. explicit, compiler
               | support vs. on-your-own, static vs. dynamic typing. I
               | agree with you that I strongly prefer explicit over
               | implicit nullability, assuming it fits within the idioms
               | of the language. But this is a stronger condition than
               | static typing, so to believe "implicit nulls" is as
               | irresponsible as a _doctor not washing their hands_ ,
               | you* have to believe that everyone using dynamic typing
               | is even more irresponsible than that. There's a place for
               | dynamically typed languages, there's a place for implicit
               | nulls, there's a place for strict type- and null-
               | checking, there's a place for mathematically proving code
               | correct. Engineering is always about tradeoffs -- it's
               | way too simplistic to just say "null bad".
               | 
               | * I know it wasn't literally _you_ arguing that; I 'm
               | still partly addressing mrkeen here.
        
       | zozbot234 wrote:
       | Tl;dr: cache is the new RAM; RAM is the new disk; disk is the new
       | tape.
        
       | antirez wrote:
       | Using Python pseudo code in this context is hardly
       | understandable.
        
         | purplesyringa wrote:
         | What concise and readable language would you suggest instead?
        
           | antirez wrote:
           | That's not the problem at hand. Python is good for
           | pseudocode. But not if you want to talk about cache misses
           | because in the pseudocode written with higher level languages
           | a lot of details on how memory is accessed are opaque.
        
             | purplesyringa wrote:
             | Again, what would you suggest instead? If you can't guess
             | that `list` is supposed to represent an array of
             | consecutive elements, I have trouble thinking of a language
             | that'll make that clear without being exceedingly verbose
             | for no reason.
        
               | antirez wrote:
               | A bit later, in the article, you'll see that memory
               | patterns in allocating the arrays have a role. A role
               | that was hidden initially.
               | 
               | > Again, what would you suggest instead?
               | 
               | The answer is inside you. You have only to search for it.
               | Or, if you really want to extort me the obvious: any low
               | level language (even not implementing every detail but
               | calling imaginary functions whose name suggest what they
               | are doing). This exercise will show you for instance that
               | you'll have to immediately choose if to
               | append_to_dynamic_array() or add_to_linked_list().
        
               | purplesyringa wrote:
               | > This exercise will show you for instance that you'll
               | have to immediately choose if to
               | append_to_dynamic_array() or add_to_linked_list().
               | 
               | Linked lists, _that 's_ what you're worried about?
               | `[...]` is not a linked list in Python, in fact I don't
               | know any imperative language where it's something other
               | than a dynamic array/vector. I can only assume someone
               | who doesn't understand or intuit this is arguing in bad
               | faith, especially when taking your attitude into account.
               | What did I do to deserve being talked down to?
               | 
               | > any low level language
               | 
               | Like this?
               | std::vector<std::vector<element_t>> groups(n_groups);
               | for (auto&& element : elements) {
               | groups[element.group].push_back(std::move(element));
               | }              std::sort(elements.begin(),
               | elements.end(), [](const auto& a, const auto& b) {
               | return a.group < b.group;         });
               | std::vector<std::vector<element_t>> groups;         for
               | (auto group_elements : group_by(std::move(elements),
               | [](const auto& element) {             return
               | element.group;         })) {
               | groups.push_back(std::move(group_elements));         }
               | 
               | Is it worth it? If you don't know `list` is a dynamic
               | array in Python, how will you know `std::vector` is a
               | dynamic array in C++? Not to mention the syntax is
               | terrible. C would be just as bad. Using Rust would get
               | people angry about Rust evangelism. The only winning move
               | is not to play.
        
               | antirez wrote:
               | // Create n empty arrays.         DynamicArray*
               | groups[n_groups];         for (int i = 0; i < n_groups;
               | i++) {             groups[i] = create_empty_array();
               | }                  // For each element in elements[],
               | append it to its group's array.         for (int i = 0; i
               | < n_elements; i++) {
               | append_to_dynamic_array(groups[elements[i].group],
               | elements[i].value);         }
        
               | purplesyringa wrote:
               | antirez: You already got it wrong. You've got a double
               | indirection, with `groups[i]` pointing to a heap-
               | allocated instance of `DynamicArray`, which supposedly
               | stores the actual heap pointer.
               | 
               | It's not about the language being low-level or high-
               | level. It's about understanding the basics of memory
               | layout. If the pseudocode being in Python is an obstacle
               | for you, the problem is not in Python, but in your (lack
               | of) intuition.
        
               | antirez wrote:
               | I wrote the first random semantic that came in mind, in
               | the pseudocode in C above. The fact you believe it is not
               | the right one _is a proof that in C I can express the
               | exact semantic_ , and in Python I can't (because
               | everything is hidden inside how the Python interpreter
               | can or can not implement a given thing).
               | 
               | So, modify my C code to match what you have in mind, if
               | you wish. But the point is that low level code will
               | specify in a much clearer way what's going on with
               | memory, and in a piece of code that is about memory
               | layout / access, that's a good idea. Otherwise I consider
               | Python a good language for pseudocode. Just: not in this
               | case.
               | 
               | Have a nice day, I'll stop replying since would be
               | useless to continue. P.S. I didn't downvote you.
        
               | binary132 wrote:
               | If only there were a better syntax for abstractions over
               | machine code. :)
        
           | sebstefan wrote:
           | For an article about caching and performance C is the lingua
           | franca
        
           | Cthulhu_ wrote:
           | C was suggested, but I find Go to be good too for this
           | purpose, you get some of the low level memory access (stack
           | vs heap stuff) but don't need to worry about memory
           | allocation or whatnot. Great for showing the differences in
           | algorithms.
        
       | veltas wrote:
       | > The only way to prevent cache misses is to make the memory
       | accesses more ordered.
       | 
       | Or prefetch. Not enough people know about prefetch.
        
         | adrian_b wrote:
         | All modern CPUs have powerful hardware prefetchers, which are
         | insufficiently documented.
         | 
         | On any modern CPU (i.e. from the last 20 years), explicit
         | prefetching is very seldom useful.
         | 
         | What is important is to organize your memory accesses in such
         | orders so that they will occur in one of the patterns that
         | triggers the hardware prefetcher, which will then take care to
         | provide the data on time.
         | 
         | The patterns recognized by the hardware prefetchers vary from
         | CPU to CPU, but all of them include the accesses where the
         | addresses are in arithmetic progressions, going either forwards
         | or backwards, so usually all array operations will be
         | accelerated automatically by the hardware prefetchers.
        
       | bjornsing wrote:
       | I sometimes get the feeling it would be better to just start
       | treating RAM as secondary storage, with read/write access through
       | a (hardware implemented) io_uring style API. Is there anything
       | along those lines out there?
        
         | rwmj wrote:
         | CXL RAM is something like this on the physical side. It's RAM
         | over a PCIe connection, and PCIe is basically a very fast,
         | serial, point to point network. However as far as I'm aware the
         | "API" to software makes it looks just like regular, local RAM.
         | 
         | https://en.wikipedia.org/wiki/Compute_Express_Link
        
       | froh wrote:
       | I'D love to understand the differences between the "devices"
       | named A, Y, M in the performance measurement, referring to "(A,
       | Y, M indicate different devices)"
       | 
       | any pointers appreciated
        
       | Kon-Peki wrote:
       | Radix sorting is great, as this article points out. But I
       | followed the links and didn't see any mention of the most obvious
       | optimization that makes it even better on _modern_ systems.
       | 
       | If you do MSB for the first pass, then the subsequent passes
       | within each bucket are completely independent :) So... Choose a
       | large radix and do one MSB pass, then parallelize the LSB
       | processing of each bucket. This is fantastic on a GPU. You might
       | think that you would want many thousands of kernels running
       | simultaneously, but in actuality each SM (and the Apple
       | equivalent) has a thread-local cache, so you really only want a
       | couple hundred simultaneous kernels at the most. I've written
       | CUDA and Metal versions and they are _fast_. And you don 't even
       | have to be very good at GPU programming to get there. It's very
       | basic. AoCP has the pseudocode.
        
       | mpweiher wrote:
       | Not sure what "Myth" the author thinks they are exposing.
       | 
       | Jim Gray wrote "RAM is the new disk" at least in 2006, probably
       | earlier, so 20 years ago. And people have been saying "RAM is the
       | new tape" for quite some time now as well.
        
       | exabrial wrote:
       | RAM Myth #2: Free memory is good thing. A billion windows "power"
       | users have etched this into canon.
        
         | verall wrote:
         | Windows starts doing awful things under slight memory pressure
         | like 80% usage.
         | 
         | There are cascading failures like high ram usage ->
         | swap+compression -> power+thermal limits -> desktop
         | responsiveness is unreasonably bad (i.e. >5 minutes to open
         | task manager to kill a single firefox.exe using >10G memory
         | caused by a single misbehaving javascript among many tabs)
         | 
         | This is an incredibly common scenario for laptops that cost
         | less than $1000.
         | 
         | Free memory means things will probably not go to shit in the
         | next 10 minutes.
         | 
         | Linux is a whole different story with different sharp edges.
        
       | 0x1ceb00da wrote:
       | Tried running the code. Crashes with 'attempt to add with
       | overflow'
        
       ___________________________________________________________________
       (page generated 2024-12-19 23:02 UTC)