[HN Gopher] The RAM Myth
___________________________________________________________________
The RAM Myth
Author : signa11
Score : 174 points
Date : 2024-12-18 22:43 UTC (1 days ago)
(HTM) web link (purplesyringa.moe)
(TXT) w3m dump (purplesyringa.moe)
| CountHackulus wrote:
| Actual benchmarks and graphs. Thank you so much for that.
| Const-me wrote:
| In my 3 years old laptop, system memory (dual channel DDR4-3200)
| delivers about 50 GB / second. You have measured 50M elements per
| second, with 8 bytes per element translates to 400 MB / second.
| If your hardware is similar, your implementation did less than 1%
| of theoretical bottleneck.
|
| When your data is indeed big and the performance actually
| matters, consider doing something completely different.
| sgarland wrote:
| In the real world, your application is running in a container
| among hundreds or thousands of other containers. The system's
| resources are also probably being managed by a hypervisor. The
| CPU is shared among N tenants _and_ is overcommitted. It's not
| that much to ask to optimize where you can, when it isn't
| unreasonably difficult.
|
| More importantly, this attitude is precisely why software sucks
| today. "[CPU, Memory, Disk] is cheap, engineering time isn't."
| Fuck that. Bad engineering time is _incredibly_ expensive. This
| is an excuse to not spend time learning the ins and outs of
| your language, and your hardware.
|
| It also frustrates me to no end that people are so deeply
| incurious. This seems to only happen in tech, which is baffling
| considering how incredibly broad and deep the industry is.
| Instead, everyone clamors to learn the new Javascript
| abstraction that lets them get even further away from the
| reality of what the magic boxes are doing.
| kmarc wrote:
| Agree with the sentiment. However it's hard to stay curious,
| even harder to stay up-to-date.
|
| I liked fiddling with storage for a while, got really into
| it, deepened my knowledge about it. A couple years later I
| realized everything else (networking, architectures,
| languages) developed so much, mot of my (non-basic) knowledge
| was obsolete. Picking up where I left off with all
| technologies is incredibly hard and caused fatigue.
|
| Now I'm at a point where I have the feeling I don't know
| nothing about anything. It's factually not true, but my gut
| feeling tells this. Would I be younger, this would trigger a
| lot of anxiety. Thankfully I can janfle this by now.
| sgarland wrote:
| That's understandable. I'm into databases (both
| professionally and personally), so I get to touch a wide
| variety of things. Following a query the whole way is
| pretty neat. Typed query --> client --> wire protocol -->
| network --> server --> DB frontend --> DB planner --> DB
| storage engine --> OS --> Memory / Disk. `blktrace` is
| super interesting to watch commands being sent to the disk.
| buran77 wrote:
| When you are deeply involved in tech both personally and
| professionally you are probably very passionate about
| this and makes sense that you'd only look closely at this
| field and think "people are so deeply incurious. This
| seems to only happen in tech".
|
| Tech is also one of the (if not _the_ ) most dynamic and
| fast evolving field a normal person will ever touch.
| Curiosity in tech can drain every single bit of free time
| and energy you have and you will hardly keep up with the
| progress, maybe barely scratch the surface. But people's
| available free time and energy wanes and curiosity is a
| collateral victim.
|
| I've painfully gone through the entire cycle of this,
| including the bit of resurgence later on when you have a
| combination of free time but less energy. What I can say
| is that this absolutely _does not_ happen just in tech.
| If anything tech is flooded with people with more
| curiosity than almost any other field.
| sgarland wrote:
| > When you are deeply involved in tech both personally
| and professionally you are probably very passionate about
| this and makes sense that you'd only look closely at this
| field and think "people are so deeply incurious. This
| seems to only happen in tech".
|
| Good point. I commented in a sibling post to the same
| effect.
|
| I've certainly felt the personal strain of time sink and
| procrastination in my homelab. It's currently running
| k3os, which has been dead for about four years now,
| because everything I want is still running, and I never
| seem have the motivation on the weekend to yell at my
| laptop when I could be playing board games.
|
| > including the bit of resurgence later on when you have
| a combination of free time but less energy.
|
| I'm guessing that will happen in another decade or so,
| when my kids are grown.
| akira2501 wrote:
| > "[CPU, Memory, Disk] is cheap, engineering time isn't."
| Fuck that.
|
| It is. It's absurdly cheap. I ensure I check the amount of
| time it would take for me to make a performance improvement
| against the runtime costs of my functions. It's rarely worth
| the extra effort.
|
| Seriously, until you get into the millions of records per
| second level, you're almost never benefited. You may make
| your function 2x faster, at a cost of additional complexity,
| but you never run it enough in a year for it to pay itself
| back.
|
| > Bad engineering time is _incredibly_ expensive.
|
| Engineering time is expensive. Period. It speaks to the need
| to minimize it.
|
| > This is an excuse to not spend time learning the ins and
| outs of your language, and your hardware.
|
| All of which will change in a few years, which is fine, if
| you're also committing to keeping _all that code_ up to date
| right along with it. Otherwise you end up with an obscure
| mess that you have to unwind 5 years of context to understand
| and fix again.
|
| Complexity and available mental contexts are forgotten costs.
| If your language even has that many "ins and outs" to begin
| with you may want to reconsider that.
| sgarland wrote:
| > You may make your function 2x faster, at a cost of
| additional complexity, but you never run it enough in a
| year for it to pay itself back.
|
| I'm not talking about increased complexity, I'm talking
| about extremely basic things that take zero extra time,
| like using the correct data structure. For example, in
| Python: In [8]: a = array("i", (x for x
| in range(1_000_000))) ...: l = [x for x in
| range(1_000_000)] ...: d = deque(l)
| ...: for x in (a, l, d): ...:
| print(f"{sys.getsizeof(x) / 2**20} MiB") ...:
| 3.902385711669922 MiB 8.057334899902344 MiB
| 7.868537902832031 MiB
|
| Very similar structures, with very different memory
| requirements and access speeds. I can count on one hand
| with no fingers the number of times I've seen an array
| used.
|
| Or knowing that `random.randint` is remarkably slow
| compared to `random.random()`, which can matter in a hot
| loop: In [10]: %timeit
| math.floor(random.random() * 1_000_000) 31.9 ns +-
| 0.138 ns per loop (mean +- std. dev. of 7 runs, 10,000,000
| loops each) In [11]: %timeit random.randint(0,
| 1_000_000) 163 ns +- 0.653 ns per loop (mean +-
| std. dev. of 7 runs, 10,000,000 loops each)
|
| > All of which will change in a few years, which is fine,
| if you're also committing to keeping _all that code_ up to
| date right along with it.
|
| With the exception of list comprehension over large ranges
| slowing down from 3.11 --> now, I don't think there's been
| much in Python that's become dramatically worse such that
| you would need to refactor it later (I gather the
| Javascript community does this ritual every quarter or so).
| Anything being deprecated has years of warning.
| akira2501 wrote:
| > which can matter in a hot loop:
|
| 163ns - 31.9ns == 131.1ns
|
| This will need to happen 7.6 million times to save me 1
| CPU second. On AWS lambda with 1GB of memory this will
| cost you a whopping: $0.0000166667.
|
| The point is, you're not even wrong, but there are
| vanishingly few cases where it would actually matter to
| the bottom line in practice. You're taking an absolutist
| point of view to a discipline which thoroughly rejects
| it.
|
| This is what I love about the cloud. It forces you to
| confront what your efforts are actually worth by placing
| a specific value on all of these commodities. In my
| experience they're often worth very little given that
| none of us have the scale of problems where this would
| show actual returns.
| norir wrote:
| Sure, but the cumulative effects of pervasive mediocre to
| bad decisions do add up. And it isn't just about cloud
| compute cost. Your own time is stolen by the slow ci jobs
| that you inevitably get stuck waiting for. For me, I
| prioritize my own personal happiness in my work and this
| mindset taken too far makes me unhappy.
| zrm wrote:
| Reaching the scale where it shows actual returns isn't
| all that difficult. You need it to happen 7.6 million
| times to save 1 CPU second, but each CPU core can execute
| it nearly that many times every second.
|
| Probably you don't leave it generating only random
| numbers all day, but suppose you do generate a good few,
| so that it's 1% of your total compute budget, and you
| have only a modest load, using on average four CPU cores
| at any given time. Then saving that amount of computation
| will have saved you something like $15/year in compute,
| recurring. Which isn't actually that bad a return for ten
| seconds worth of choosing the right function.
|
| There are also a lot of even fairly small entities for
| which four cores is peanuts and they're running a hundred
| or a thousand at once, which quickly turns up the price.
|
| And even the things with internet scale aren't all that
| rare. Suppose you're making a contribution to the
| mainline Linux kernel. It will run on billions of
| devices, possibly for decades. Even if it doesn't run
| very often, that's still a lot of cycles, and some of the
| kernel code _does_ run very often. Likewise code in
| popular web browsers, javascript on popular websites or
| in popular libraries, etc.
|
| You don't have to work for Google to make a contribution
| to zlib and that kind of stuff has the weight of the
| world on it.
| Aeolun wrote:
| You are saying that the potential gains are less than an
| order of magnitude. That mkes them a pretty hard sell in
| most instances.
| masklinn wrote:
| > Very similar structures, with very different memory
| requirements and access speeds. I can count on one hand
| with no fingers the number of times I've seen an array
| used.
|
| That is obvious when you actually check the access speed
| of arrays and find out it is about half that of lists on
| small integers (under 256), and worse on non-small
| integers. That is literally the opposite trade off of
| what you want in 99.99% of cases.
|
| Deques are even less of a consideration, they're unrolled
| linked lists so random access is impossible and iteration
| is slower, you use a deque when you need _a deque_ (or at
| least a fifo), aka when you need to routinely manipulate
| the head of the collection.
| sgarland wrote:
| It depends on your constraints. If you're limited by RAM,
| arrays make a lot of sense for certain applications. If
| you need Python's buffer protocol, again, they make a lot
| of sense.
|
| As to deques, yes, they have specific uses, and being
| slightly smaller isn't usually a selling point for them.
| My point was that I have seen many cases where an
| incorrect data structure was used, because a list or dict
| was "good enough." And sure, they generally are, but if
| the language ships with other options, why wouldn't you
| explore those?
| saagarjha wrote:
| Ok, but neither of these matter until you know they
| matter. Seriously. Like, yes, it's nice they exist and
| that they are available for when you want them, but I
| would generally advise people to use a list or
| random.randint, if only because I value their confidence
| with them over the 2x performance win, because most
| workloads are not simply just a single array or random
| number generator loop. And, to be clear, I work on
| performance professionally: most of my job is not making
| things as fast as possible, but considering the tradeoffs
| that go into writing that code. I understand your example
| as showing off an interesting performance story but in
| the real world most workloads are more complex than what
| can be solved with using a rare but drop-in API.
| IanCal wrote:
| Yes but if you want to do things in a less obvious way
| you should be aware of the downsides, such as bias in
| your random numbers. Also making sure you watch out for
| off by one errors.
|
| Stolen the number to show this off well from a bug report
| somewhere: random_counter = Counter()
| for i in range(10_000_000): result =
| floor(random() * 6755399441055744) % 3
| random_counter[result] += 1 print("floor
| method", random_counter.most_common(3))
| randint_counter = Counter() for i in
| range(10_000_000): result = randint(0,
| 6755399441055743) % 3 randint_counter[result]
| += 1 print("randint method",
| randint_counter.most_common(3))
|
| Result floor method [(1, 3751972), (0,
| 3333444), (2, 2914584)] randint method [(1,
| 3334223), (2, 3333273), (0, 3332504)]
|
| https://bugs.python.org/issue9025
| sgarland wrote:
| Have you ran this in any modern version of Python? It's
| been fixed for a long time.
| IanCal wrote:
| 3.10 so I redid it on 3.13.1, same results.
| sgarland wrote:
| Ugh, I was checking `randrange` (as the bug mentions),
| not `random`. I stand corrected.
| IanCal wrote:
| Ah yeah sorry I should have mentioned it wasn't the same,
| I used it as it has a nice number that shows the bias to
| a pretty extreme degree.
| rcxdude wrote:
| Python is 100% the wrong language to worry about this in.
| If your hot loops are in python and you care about
| performance, you should be rewriting them in another
| language.
| sgarland wrote:
| Agreed; I used it partially because TFA used it to
| demonstrate ideas, and partially because I'm very
| familiar with it.
|
| But you're correct, of course. When I need something to
| go faster in Python, I write it in C. If it's more than a
| small section, then a full rewrite is reasonable.
| spookie wrote:
| Even if so, their point still stands. It's a tiny change
| that grants huge speedups.
| Cthulhu_ wrote:
| I was curious why randint was slower than random and
| found a good SO answer [0] (it's like chatgpt by humans
| for you youngsters out there), the gist of it is that
| `random()` calls a C function directly (which I presume
| goes straight to a syscall), whereas `randint` is
| implemented in Python and has a load of preconditions /
| defensive programming (which I presume is executed at
| every invocation, so seven potential branches etc it has
| to check every time).
|
| Of course, if it's not in a tight loop and you need a
| whole integer between a range then it's the most
| developer-ergonomic way to get it. If you do need more
| performance but want to keep the ergonomics, writing your
| own random-int-between-two-numbers is fairly
| straightforward but it'll take some time.
|
| [0] https://stackoverflow.com/a/58126026/204840
| jjav wrote:
| > You may make your function 2x faster, at a cost of
| additional complexity, but you never run it enough in a
| year for it to pay itself back.
|
| "You" is both singular and plural, which is often the
| problem with this thinking.
|
| Is it worth spending a month of engineering time to make a
| page load in 50ms instead of 2s? Seems like a lot of
| engineering time for a noticeable but somewhat minor
| improvement.
|
| But now, what if you have a million users who do this
| operation 100x/day? _Absolutely_ worth it!
|
| For example, I sure wish atlassian would spend a tiny bit
| of effort into making jira faster. Even if it is 1 second
| per ticket, since I'm viewing 100+ tickets per day that
| adds up. And there's many hundreds of us at the company
| doing the same thing, it really adds up.
| ddtaylor wrote:
| > 50ms instead of 2s
|
| In the past I believe Google was very adament that page
| load time perception was very important to other metrics.
| nottorp wrote:
| Most of the time they just move the expensive processing
| to the user's browser so they don't have to pay for it :)
| sfn42 wrote:
| You're probably not going to achieve that with the kind
| of optimization described in this article though.
| xlii wrote:
| Nit: 50ms vs 2000ms is 40x speed increase, i.e. ~1.5
| order of magnitude.
|
| I still keep words of my database optimization lecturer
| who said that by his experience optimization below 1 OOM
| aren't worth it and most ,,good ones" are 3+
|
| > Absolutely worth it!
|
| Long reaching assumption. Even the biggest companies have
| limited resources (even if vast). Would you rather
| improve load times by 2x (from 500ms to 250ms) or improve
| checkout reliability from 99% to 99.5%? And there is much
| more to consider on some levels (e.g. planning for
| thermal efficiency is fun).
|
| Software development is always a game of choice.
| sgarland wrote:
| > I still keep words of my database optimization lecturer
| who said that by his experience optimization below 1 OOM
| aren't worth it and most ,,good ones" are 3+
|
| Like everything, it depends. Is the query gating a lot of
| other things, especially things that can be done in
| parallel? Shaving 10 ms off might very well be
| meaningful. Is it a large OLAP query, and the owning team
| has SLAs that depend on it? Going from 60 --> 55 minutes
| might actually matter.
|
| The two biggest performance-related issues with RDBMS
| that I deal with, aside from indexing choices, are over-
| selection (why on earth do ORMs default to SELECT * ?),
| and poor JOIN strategies. Admittedly the latter is often
| a result of poor schema design, but for example, the
| existence of semi/anti-joins seems to be uncommon
| knowledge.
| xlii wrote:
| If SLAs are involved then I'd argue it's not about
| optimization but about business goals, which
| unsurprisingly take precedence.
|
| But there is another case that is very similar: threshold
| passing (or how I like to call it - waterfalling). Small
| inefficiencies add up and at some point a small slowdowns
| reach critical mass and some significant system breaks
| everything else.
|
| When system was designed by competent engineers huge
| optimizations aren't easy, so it's shaving couple millis
| here and couple millis there. But, as in the first case,
| I'd categorize it as a performance failure.
| Cthulhu_ wrote:
| As always, https://xkcd.com/1205/ is a great reference to
| keep in mind.
|
| That said, most developers (I assume, by most I mean
| myself) never work with code where the work they do has
| any serious impact on performance; it's layers on top of
| code that should be fast, but the work I do is
| implementing business logic and screens, in which case
| readability vastly trumps performance.
|
| I mean right now I'm writing a line of code that can be
| represented as a nested ternary or a function call with
| some more written-out if/elses. The ternary outperforms
| the function call 2:1 or more, but either one can be
| executed hundreds of times a second BUT will only be
| executed once or twice per page view. It's not worth the
| tradeoff, even if this line of code will be executed
| hundreds of millions of times overall.
| binary132 wrote:
| Jira is incredibly slow, almost unusable. So awful.
| pphysch wrote:
| More features = more customers = more revenue
|
| More data collection = more revenue
|
| More crap layered on = everything is slower
|
| Everything sucks = more support demand = supply price
| leverage = more revenue
|
| Enterprise software is necessarily slow and complicated,
| and not for your benefit.
| binary132 wrote:
| I'm not quite following your point. It sounds like you're
| agreeing that Jira sucks?
| Const-me wrote:
| > until you get into the millions of records per second
| level, you're almost never benefited
|
| Yeah, but the software landscape is very diverse.
|
| On my job (CAM/CAE) I often handle data structures with
| gigabytes of data. Worse, unlike e.g. multimedia frameworks
| many algorithms operating on these numbers are global i.e.
| can't be represented as a pipeline which splits data into
| independent chunks and processes them sequentially.
|
| Making performance critical functions twice as fast might
| saves hours of time for a single end user in a single day.
| tonyarkles wrote:
| > In the real world, your application is running in a
| container among hundreds or thousands of other containers
|
| I mean, that's an engineering decision too. In my day job
| we're capturing, pre-processing, running inference on, and
| post-processing about 500Mpx/s worth of live image data at
| about 80ms/frame end-to-end at the edge. The processor SoM
| costs about $3000/unit and uses about 50W running flat out.
| The retail cost of our overall product is two orders of
| magnitude more than what the processor is worth but it incurs
| zero recurring costs for us.
|
| Edit: and it's got 64GB of Unified RAM that I've got all to
| myself :)
| sgarland wrote:
| I was wondering if someone from a different sub-industry
| would disagree here :D
|
| That sounds like a very interesting job, with quite
| different requirements and constraints from what I'm used
| to. One day, I'll get a job where application latency is
| critical, and optimizations matter deeply. Undoubtedly I'll
| find something else to be upset about, but at least it'll
| be a new complaint.
| tonyarkles wrote:
| > Undoubtedly I'll find something else to be upset about
|
| Vendor SDKs. You'll be upset about that I guarantee :)
| ddtaylor wrote:
| I can somewhat echo some of the statements here and provide
| my own experience that is similar.
|
| I spend a decent amount of time writing decent C++ code. My
| friends in other parts of the industry are writing certainly
| better C++ code than me because they are doing it in
| environments that are more constricted in various ways. In
| either case, I do spend my time catching up a bit and would
| consider myself a competent C++21 programmer in some ways.
|
| My experience and my conversations lead me to understand
| there is so much left on the table with even the most basic
| implementations. When I implement it correctly in C++ we get
| close to some of the theoretical limits for the hardware for
| some workloads, compared to something that is literally 1% as
| fast running in NodeJS.
|
| Wit that said, for so many situations I cannot justify the
| time and complexity to use C++ for many projects. At least
| for the stage most projects are in. In theory this
| optimization can happen later, but it never really does
| because the horizontal (or sometimes even vertical) scaling
| kicks in and we're all just willing to throw a few more
| dollars at the problem instead of re-engineering it. Sure,
| some of the really big companies like Netflix find a decent
| reason from time to time to invest the engineering time
| squeeze out those numbers, but it's becoming the exception
| and not the rule.
| Thorrez wrote:
| >C++21
|
| There's C++20 and C++23. No C++21.
| ddtaylor wrote:
| Sorry, I meant to type C++11
| spookie wrote:
| I've also found that not having some optimization mindset
| from the get go limits your product to achieve only a local
| maxima of performance in the end. It might not even be a
| good local maxima.
|
| It's best to have at least some optimization considerations
| from the start. I'm saying some because too much is a
| problem.
| dakiol wrote:
| It's hard to learn the ins and outs of dozens of programming
| languages. One doesn't usually just use one or two PL over
| their entire career. I have worked professionally with at
| least PHP, Java, Python, JS, Go, and Ruby. That without
| taking into account the respective libraries and frameworks
| (and without taking into account as well the myriad of other
| components like dbs, web servers, caches, etc.)
|
| It sounds like an excuse, I know. The true is I just don't
| have that much time.
| christophilus wrote:
| > people are so deeply incurious. This seems to only happen
| in tech
|
| It happens everywhere. If anything, techies are more curious
| than the average Joe. How many fellow programmers can you
| nerd-snipe with a comment that makes them say, "Well, that's
| strange..."
| duckmysick wrote:
| > It also frustrates me to no end that people are so deeply
| incurious. This seems to only happen in tech
|
| This doesn't match my observations. In many fields, training
| is limited and hides the details. We train workers to repeat
| specific tasks and they excel at them. But they don't have
| conceptual understanding. Any explanation of why things are
| done the way they are is surface-level. You can see it when a
| procedure fails for some reason. They way to deal with is to
| 1) do basic checks 2) try again and if it still fails 3)
| delegate to someone else. Nobody is going to troubleshoot or
| optimize tasks unless that's their main job.
|
| It happens in construction, in the kitchens, on the assembly
| lines, and in the offices. It happens because it gets the job
| done.
| sgarland wrote:
| You're right. My wording was flat-out wrong, and I
| apologize. A more accurate sentiment (IMO) would be "that
| this happens in tech is baffling, considering..."
| marcosdumay wrote:
| Yes, but in most fields, conceptual understanding doesn't
| matter for most tasks.
|
| That's the problem with software. A brick-layer doesn't
| need to understand structural stability, but in software
| every task is structure validation, and people expect
| brick-layers to be productive.
| phatskat wrote:
| > A brick-layer doesn't need to understand structural
| stability
|
| Maybe a "junior" brick layer doesn't need to, as much as
| a junior engineer doesn't need to understand the ins and
| outs of their language. But a senior brick layer, or an
| architect, needs to understand more of the details so
| that they can set out the plan for the junior.
| hinkley wrote:
| There are a lot of developers who learn how to do a task
| and never question whether the task could be automated,
| either away or to reduce errors. They just do the same
| mundane task four hours a week in perpetuity.
|
| That's the part that frustrates me. Your job is to automate
| things. So why aren't you automating things?
| bulatb wrote:
| Some apple-pickers think the point of picking apples is to
| prove how good you are at climbing trees. They'll never not
| be mad at pluckers who just pluck the nearest apple, and the
| people who reward them for the apples, and the world that
| doesn't understand the pluckers picked their apples _wrong_ ,
| and didn't even earn them, and they're not even real apples.
| pif wrote:
| Sir, your comment is poetry. I commend you.
| sgarland wrote:
| It's one thing if you know the optimal (or at least
| approaching it) solution, and deliberately use something
| else for other reasons. At least thought went into it.
|
| I'm not upset at this simply for purity's sake; it directly
| impacts me in my day-to-day job. A hundred devs
| individually decide that good enough is good enough
| (deliberately or otherwise), and suddenly my infrastructure
| is dealing with traffic and query patterns it shouldn't
| have to, and then I have to explain what network bandwidth
| limits are, and why it's absurd that we hit them in the
| first place.
|
| In general, what I'm railing against is the continued push
| towards simplification and abstraction, while not requiring
| the understanding of what lies beneath. The hyperscalers
| certainly aren't trying to foster that attitude in-house -
| someone has to build the abstractions in the first place,
| after all, and keep datacenters humming. Yet they're
| happily shoveling it out, because it's a win-win for them:
| fewer people have the skills necessary to compete (or even
| to simply not use their services), and more people are able
| to use their services.
| jebarker wrote:
| > it directly impacts me in my day-to-day job
|
| It directly impacts anyone that uses software everyday.
| Most people don't seem to understand and/or care how
| poorly the software they use runs.
| yetihehe wrote:
| Most people don't care about their job, they just want to
| put in minimum viable effort, get paid and go home to
| their family. Typically they do this because putting more
| effort didn't bring them anything (not even some
| recognition of efforts) or even backfired. It's
| applicable to all jobs, not only in tech.
| jebarker wrote:
| I get that. My comment has nothing to do with people's
| pride in their work. I was commenting that there's not
| enough user demand for better preforming software because
| users don't seem to mind the software hellscape we
| currently live in.
| didgetmaster wrote:
| It's like the contractor (or subcontractor) who built the
| house you live in. They don't have to pay your monthly
| heating or cooling bills, so they often do whatever is
| fastest and cheapest for them.
|
| Who cares if they didn't insulate a wall correctly or put
| an inefficient system in? If they can get it by the
| inspector, they will do it. If they can save themselves
| $100, that is all that matters to them. Who cares if it
| adds $10 to each month's utility bills for the next 30
| years? They got their money and are now long gone.
| immibis wrote:
| To get a certain behaviour, incentivize that behaviour.
| sgarland wrote:
| This is what the mythical Full Stack and DevOps movements
| were supposed to do. They did not.
| jebarker wrote:
| The part missing from this analogy is that the low-hanging
| fruit pickers are slowly killing the trees.
| throwaway519 wrote:
| Why are fruit pickers hanging from trees at all?
| blipvert wrote:
| Just get a panking pole already!
| mrkeen wrote:
| Climbing a ladder takes more time, so in order to get the
| apples to market faster, the apple tree owners keep the
| pickers focused on only picking the apples within arm's
| reach.
|
| The owners also disallow the use of ladders because the
| pool of candidates to hire remains bigger.
|
| And the highest 90% of apples remain unpicked.
| dsr_ wrote:
| Having recently been to an apple orchard...
|
| - Apple trees are deliberately bred to be short and wide.
|
| - An apple-picking-stick has a sort of basket on the end,
| which allows an apple to be selected and pulled off from
| 1-2 meters away.
|
| What lessons can be learned?
|
| - Using the right tools improves your yield, safely.
|
| - Having the data in a convenient place means you can use
| it more readily.
| binary132 wrote:
| Somehow, the trees keep getting exponentially bigger, and
| yet the pies we're getting are actually kinda just getting
| smaller and worse. Maybe just picking the nearest apples
| isn't working out as well as they say it is.
| foobarchu wrote:
| Maybe a better analogy would be farming.
|
| A farmer who understands the value of crop rotation is
| completely right to be upset when other farmers are
| monocropping and killing the soil, or when they squander
| the local aquifer. It's directly impacting the
| sustainability of their industry.
| killingtime74 wrote:
| Life is very different for many people and I think we just
| need to build empathy for people who treat a job as just a
| job. If they deliver and are not unkind about it there's
| nothing wrong about not going above and beyond the bare
| minimum, which is what they are paid for.
| sgarland wrote:
| As I commented above, a large part of my umbrage stems from
| the impact these decisions have on my job. I dislike having
| to work harder because others didn't want to go above the
| bare minimum.
|
| This isn't unique to any one company, nor my personal
| experience. At my last job, my team was initially human
| triaging practically all alerts, because we had the rare
| ability to read logs. I spoke to someone at Slack once who
| was stuck doing the same. That's an absurdly low expected
| level of competence.
| badpun wrote:
| You don't _have_ to work harder, as evidenced by those
| people who do the bare minimum. You just care about your
| work more than people who pay you for it (or the manager
| hired to manage you), which is the cause of your
| frustration here IMO.
| jrochkind1 wrote:
| > It also frustrates me to no end that people are so deeply
| incurious. This seems to only happen in tech,
|
| I mean software engineering and computer science from the
| start is _built on abstraction_ , it makes sense that it
| attracts people who find taking the abstraction as reality
| and ignoring the underlying "real" layers to be attractive.
|
| (Also it's not "real" until you get down to... voltages?
| Electrons? I don't know... the whole thing about our whole
| endeavor being built of abstraction is there are so many
| layers. Nobody thinks everyone has to routinely go ALL THE
| WAY down, but I get your point that being more curious than
| many tend to be about a couple layers down is helpful, with
| how far down a couple layers is depending on the nature of
| your work. I'm not saying it's not helpful, I'm suggesting
| possibly why -- because we select for people who swim happily
| in abstractions, who love em even. And the success of all of
| CS and software engineering is based on the ability for it to
| _often work_. nobody really has to be curious about voltages
| when writing a python script to read and write from storage)
| crabbone wrote:
| Nope. If a program is meant to run on the entire CPU, and if
| it's meant to use all the memory, then that's how it will
| run. This isn't a question of modernity or availability of
| tools or programmers' salary.
|
| Or, to put it differently, using more or less memory / CPU
| isn't really an indication of anything about a program.
| Sometimes you mean to use all that's available, other times
| you mean to use as little as you can. There's no way to tell
| which one should be done w/o knowing your goals.
|
| For example, the entire free memory on Linux is a free real
| estate for filesystem caching. So that memory is not wasted,
| if the system can help it. So, having your entire memory used
| (to support filesystem cache) wouldn't be really a
| performance problem (quite the contrary, it would look like
| the system is doing a good job at caching whatever reads your
| applications are making).
| tharkun__ wrote:
| In the real world, your application is running in a container
| among hundreds or thousands of other containers. The system's
| resources are also probably being managed by a hypervisor.
| The CPU is shared among N tenants _and_ is overcommitted.
|
| When I read this, I thought the rest of your post would go
| entirely differently. As in, I immediately agreed with you,
| only for you to "turn this 180" (according to how I think
| about this at least :) ) It's not that
| much to ask to optimize where you can, when it isn't
| unreasonably difficult.
|
| You take the above to mean we should optimize such that L2
| cache is used as per the post as much as possible. Optimize
| the wazoo out of things.
|
| But how does that even help in any way, when the CPU this
| runs on is like you said shared among N tenants and your
| carefully optimized L2 cache access is still going to be miss
| because another tenant got "your" CPU in between?
|
| If you're paying for bare metal i.e. have the whole instance
| for yourself, by all means, if your domain actually requires
| you to use the system in such a way (things like high
| frequency trading come to mind), then optimize like that!
|
| If you're running on seventeen levels of hypervisors that
| destroy any careful optimization in a random fashion anyway,
| then what's the point even? (non rhetorical question!)
| kbelder wrote:
| It's a version of the tragedy of the commons. Yes, your
| software bloat isn't the main factor keeping you slow, and
| cleaning it up might be barely perceptible.
|
| But... all the other software makes the same decision.
| Then, suddenly, everything on the computer is running at
| 10% efficiency.
|
| In a complex, multitasking environment, keeping your code
| clean benefits other tasks as much or more than your own.
| It should be considered a responsibility.
| tharkun__ wrote:
| I would agree with the sentiment but not necessarily the
| conclusion.
|
| Like, don't use (the equivalent of) Stooge Sort (for your
| particular situation).
|
| But unless you are in a very particular situation, it
| should be OK for everyone to "just use your language's
| built-in sort function" (hopefully that's a thing and you
| don't use something where you need to roll your own even
| in 2024) assuming that it uses a sane default like
| quicksort or merge sort that will work perfectly fine for
| most regular common situations.
|
| Another example might be to not stupidly build slow
| algorithms saying "this won't see much data anyhow" (yes,
| I've seen that in PRs unfortunately) when a very simple
| hash lookup will ensure it's fast all the time. But it
| should be totally fine to assume that your language's
| hash table implementation is fine for common situations
| and you don't need to optimize anything unless you're a
| very special case.
| TacticalCoder wrote:
| > In the real world, your application is running in a
| container among hundreds or thousands of other containers.
| The system's resources are also probably being managed by a
| hypervisor. The CPU is shared among N tenants _and_ is
| overcommitted. It's not that much to ask to optimize where
| you can, when it isn't unreasonably difficult.
|
| And the real irony is that those programmers not giving a
| crap about performances and justifying it are resting on the
| shoulders of giants who went out of their way to have these
| kernels, hypervisors, containers, passthroughs, memory usage,
| etc. be as performant as they can.
|
| Same for security.
|
| Then you get the "I'm not in the business of writing fast
| code" and "I'm not a security company, I'm selling X and Y".
|
| But they _all_ , on a daily basis, use actual proper software
| written by people who know about these things.
| purplesyringa wrote:
| That's good food for thought. My hardware is quite old, but I
| agree that the numbers seem somewhat suspicious.
|
| I believe that RAM can only be RMW'd in cache lines, so
| modifying just 8 bytes still requires 64 bytes to be
| transmitted. I'm assuming the 50 GB/s throughput is half-
| duplex, and 25 GB/s over 64 bytes is ~400 Melems/s, somewhat
| closer to my result.
|
| I tried using non-temporal stores in the straightforward
| algorithm, but to my surprise, this led to a significant
| decrease of performance across all input lengths.
|
| > When your data is indeed big and the performance actually
| matters, consider doing something completely different.
|
| I'm not sure what you mean by this. Scaling across machines is
| just ignoring the problem. What do you mean by "something
| completely different"?
| saagarjha wrote:
| Non-temporal stores will not help you if all you do are those
| accesses. The point of using them is that you don't want them
| pulled into the caches and pushing out everything else.
| bobmcnamara wrote:
| > I believe that RAM can only be RMW'd in cache lines, so
| modifying just 8 bytes still requires 64 bytes to be
| transmitted.
|
| Ages ago I worked with several different memory controllers,
| and it depends on the memory controller, cache, and MMU
| configuration.
|
| Plenty of systems do require cacheline updates, then
| modifying 8B requires reading one or two cachelines, updating
| them, and writing them out eventually.
|
| Some caches track cacheline validity with a bit per byte.
| This enables a CPU to set a book out in memory without
| fetching the cacheline. The cache may then try to burst read
| that line from the memory controller, but if it doesn't get
| around to it before deciding to flush that cacheline, it may
| issue a single byte write to the controller. The controller
| can then issue a makes DRAM write to the memory, which will
| update only certain bytes in the DRAM column. However, this
| still takes about as long as sending the full cacheline but
| it offloads the read-modify.
|
| Validity per byte is also useful to implement hit under miss.
|
| I bet on newer, bigger systems tricks like this are less
| useful since the memory busses are faster and wider today.
| wkat4242 wrote:
| And my GPU delivers 1TB/s. Massive difference <3
|
| I wish we could get those kind of speeds on system RAM.
| winter_blue wrote:
| To/from what sort of devices could the GPU read or write at
| 1TB/sec, besides main memory?
|
| The fastest consumer SSDs top out at several GB/sec (I guess
| with massive hardware RAID they could be faster, but not sure
| if they'd be 1TB/sec fast).
|
| Even a network adapter that does 10 Gbit/sec is only recently
| becoming slightly more common for the average consumer. Not
| sure if any consumer adapters in the 10 or 1Tbit/sec range
| exist at all.
| chasd00 wrote:
| > Not sure if any consumer adapters in the 10 or 1Tbit/sec
| range exist at all.
|
| Further, what exactly is a consumer going to plug a
| 1Tbit/sec adapter into? Your little home ATT fiber
| connection isn't going to come close to using that
| available bandwidth.
| simoncion wrote:
| > Further, what exactly is a consumer going to plug a
| 1Tbit/sec adapter into?
|
| Another similarly-equipped machine on the LAN, and the
| switch(es) between them.
| semi-extrinsic wrote:
| Consumer? Even in the "if-you-need-to-ask-what-it-costs-
| you-can't-afford-it" world of frontier HPC systems, were
| only getting teasers of 0.8 Tbit/sec NICs this year.
|
| As you say, only the GPU, maybe RAM, and the NIC will be
| able to churn data at these speeds. There is a reason why
| Mellanox (Nvidia) has developed GPUDirect RDMA, so the GPU
| and NIC can talk directly to each other.
|
| https://www.servethehome.com/this-is-the-next-gen-nvidia-
| con...
| simoncion wrote:
| > ...besides main memory?
|
| Main memory _is_ an important thing to have fast, though.
| The faster (and lower-wallclock-latency) it is, the less
| time your system spends waiting around when it needs to
| swap things in and out of it. It 's my understanding that
| programs that need to be fast (like many video games) take
| pains to preemptively load data into RAM from disk, and
| (when appropriate for the program) from main RAM into VRAM.
| If main RAM's transfer speed was equal to or greater than
| VRAM's, and it access latency was a small fraction of a
| frame render time, (presumably) some of that preloading
| complexity could go away.
|
| > I guess with massive hardware RAID they could be
| faster...
|
| This section of the comment is for folks who haven't been
| paying attention to how fast storage has gotten: It's
| nowhere near 1TB per second, but...
|
| I have four 4TB SATA-attached Crucial MX500s set up in LVM2
| RAID 0. This array is a bit faster than a 10gbit link.
| (That is, I get 1.5GByte/s transfer rate off of the thing.)
| Even a single non-garbage U.2-attached (or (barf)
| M.2-attached) device can saturate a 10Gbit link.
| Cthulhu_ wrote:
| I'm reading the latest SSDs do 14 GB/s of sequential
| reading and/or 15 million IOPS; transfer speed wise that's
| close to the highest end DDR3 (2007) and the lowest end
| DDR4 (2014). SSDs are definitely not memory speed fast yet
| (definitely not for random access) but definitely getting
| closer.
| Aurornis wrote:
| > To/from what sort of devices could the GPU read or write
| at 1TB/sec, besides main memory?
|
| The 1TB/sec is between the GPU and the GPU memory, which is
| the bottleneck.
|
| You don't need that much bandwidth for loading inputs and
| outputting results, just for random access during compute.
| ein0p wrote:
| It actually almost never does. To see that you'd need to
| benchmark. It's pretty difficult get good utilization on GPU
| on either compute or memory bandwidth side. A lot of kernels
| irretrievably fuck up both. You need long, coalesced
| reads/writes, and judicious use of the memory hierarchy, or
| else everything gets very slow very quickly.
| saagarjha wrote:
| I mean they can only do that if you have hundreds of threads
| all making coalesced writes. Typical CPU workloads look
| nothing like that; if you pointer chase on the GPU you are
| going to get absurdly bad performance.
| tc4v wrote:
| cache misses are slow because of latency, not because of
| throughput.
| geysersam wrote:
| Isn't the point of the person you replied to that the article
| author wasn't able to eliminate latency because if they were
| they'd be constrained by throughput but they are not?
| toast0 wrote:
| > In my 3 years old laptop, system memory (dual channel
| DDR4-3200) delivers about 50 GB / second.
|
| That's almost certainly for (mostly) sequential access.
|
| When you just want a couple bytes here and there, and access
| isn't pipelined and prefetch doesn't accelerate your use case,
| the real world bandwidth is going to be significantly less.
| Const-me wrote:
| The input data is sequential.
|
| I don't understand Rust but if the code is doing what's
| written there, "simple multiplicative hash and perform a
| simple analysis on the buckets - say, compute the sum of
| minimums among buckets" it's possible to improve
| substantially.
|
| They don't need to move these megabytes of elements between
| collections. I would split the input across a few CPU cores,
| compute per-bucket minimums in parallel on each core, when
| all completed aggregate minimums across all cores, then
| compute sum of the results.
|
| Multiplicative hash and the final sum should be possible to
| vectorize on most computers. Updating per-bucket minimum is
| probably impossible to vectorize (technically AVX512 set has
| the required instructions but I'm not sure these are
| efficient) but there're much fewer buckets than input
| numbers, which means arrays with per-bucket minimums are
| likely to fit in caches.
| kazinator wrote:
| > _Cache is seen as an optimization for small data: if it fits in
| L2, it's going to be processed faster_
|
| Nobody worth their salt believes just this and nothing else.
|
| Yes, if the data fits entirely into a given cache, that's a nice
| case that's easy to reason about. No matter what access pattern
| is applied to the data, it doesn't matter because it's in the
| cache.
|
| Hopefully everyone working with caches understands that they
| provide a potential speedup when _not_ everything fits into the
| cache, and that this depends on the pattern of access (mainly,
| does it exhibit "locality"). Moreover, this case is extremely
| important.
|
| The article gives an example of exactly that: improving the
| locality of access.
|
| If you don't know this, you don't know one of the first facts
| about caching.
|
| There is something else to know about: you can't tell by size
| alone whether a given data set will fit into a cache. The problem
| is that caches are not always fully associative. In a set
| associative cache, a given block of data cannot be stored in any
| cache line: it is assigned to a small set of possible cache
| lines. Then within a set, the cache lines are dynamically
| allocated and tagged. A given bunch of working which appears to
| be just smaller than the cache might be arranged in such a poor
| way in memory that it doesn't map to all of the cache's sets. And
| so, it actually does not fit into the cache.
| purplesyringa wrote:
| That's true, perhaps my wording is off. I believe that the
| devil in the details. Sure, knowing that better access patterns
| result in better performance is common. But the fact that the
| access pattern can be optimized when the problem is literally
| "access RAM at these random locations" is counterintuitive,
| IMO.
| ryao wrote:
| When you have locality, prefetch can mask the latency of
| getting the next object, regardless of whether everything
| fits in cache.
| zahlman wrote:
| >Nobody worth their salt believes just this and nothing
| else.... and that this depends on the pattern of access
| (mainly, does it exhibit "locality").... If you don't know
| this, you don't know one of the first facts about caching.
|
| Not necessarily specific to this issue, but I've found that
| surprisingly many people out there are not "worth their salt"
| in areas where you'd really expect them to be.
| HelloNurse wrote:
| Assuming incompetence cannot be a general strategy, but there
| are many surprising ways to get jobs and pass exams.
| ggm wrote:
| Exemplar code in python, shows the benefit in rust. o---kaaaaay.
|
| So the outcome is "for this compiled rust, around 1m records it
| gets better"
|
| But you didn't actually prove a general case, you proved "for a
| good optimising rust compiler" didn't you?
|
| Maybe I'm over-thinking it. Maybe this does just become simple,
| working-set-locality stuff.
|
| I could take the lesson: for less than 10m things, on modern
| hardware with more than 4 cores and more than 4GB stop worrying
| and just code in the idiom which works for you.
| purplesyringa wrote:
| Uh? The post very clearly says: "I'm using Python as
| pseudocode; pretend I used your favorite low-level language".
| The goal is to show what's possible when you do your best, of
| course I'm not going to use Python for anything beyond
| demonstrating ideas.
|
| > But you didn't actually prove a general case, you proved "for
| a good optimising rust compiler" didn't you?
|
| Again, that was never my goal. I chose Rust because I've
| stumbled upon this problem while working on a Rust library. I
| could've chosen C++, or C, or maybe even Go -- the result
| would've been the same, and I checked codegen to make sure of
| that.
|
| > I could take the lesson: for less than 10m things, on modern
| hardware with more than 4 cores and more than 4GB stop worrying
| and just code in the idiom which works for you.
|
| The number of cores and RAM capacity has nothing to do with
| this. It's all about how well data fits in cache, and "less
| than 10m things" are likely to fit in L3 anyway. If your main
| takeaway from "here's how to process large data" was "I don't
| need to worry about this for small data", well, I don't know
| what to say.
| ggm wrote:
| Large and Small are so contextual. I'm processing 350m
| events/day in 24h splits, and I managed to stop worrying
| about locality of reference because I'm the sole occupant of
| the machine. When I did worry about it, I found radix tree,
| awk hash and perl/python hash/dict pretty much all occupied
| much the same space and time but a tuned C implementation got
| 2-3x faster than any of them. Somebody else pointed out
| memory resident for most of this would be faster still but
| you have to then work to process 24h of things against a
| single memory instance. Which means buying into IPC to get
| the data "into" that memory.
|
| It interested me you didn't show the idea in rust. That was
| the only point I was making: Python as pseudocode to think
| things in documents is fine with me.
|
| But overall, I liked your outcome. I just think it's
| important to remember large and small are very contextual.
| Your large case looks to me to be >5m things and for an awful
| lot of people doing stupid things, 5m is bigger than they'll
| ever see. If the target was only people who routinely deal in
| hundreds of millions of things, then sure.
| ryao wrote:
| > The number of cores and RAM capacity has nothing to do with
| this. It's all about how well data fits in cache, and "less
| than 10m things" are likely to fit in L3 anyway.
|
| What matters is locality since that allows prefetch to mask
| latency. If you have this, then you are in a good place even
| if your data does not fit in the L3 cache. What you did
| demonstrates the benefits that locality gives from the effect
| on prefetch. Fitting in L3 cache helps, but not as much as
| prefetch does. If you do not believe me, test a random access
| pattern on things in L3 cache vs a sequential access pattern.
| The sequential access pattern will win every time, because L3
| cache is relatively slow and prefetch masks that latency.
|
| I have seen options for disabling prefetch and cache as BIOS
| options (although I only remember the option for disabling
| cache in ancient systems). If you could get one, you could do
| some experiments to see which will matter more.
| quotemstr wrote:
| I was expecting something about NUMA performance, temporal memory
| access instructions, shared versus global versus register memory
| on GPUs, SRAM, and so on. There's an article about all these
| things waiting to be written. This article is instead about
| memory-hierarchy-aware access pattern optimization, which is
| important, but not the whole story.
| shmerl wrote:
| _> The RAM myth is a belief that modern computer memory resembles
| perfect random-access memory. Cache is seen as an optimization
| for small data_
|
| The post doesn't seem to answer what the memory actually
| resembles. If it's not resembling a random access memory, then
| what is it resembling?
| zahlman wrote:
| It resembles a hierarchy wherein a small fraction of memory - a
| "cache" region - can be accessed much faster than the rest.
| With careful planning, the programmer can increase the odds
| that the necessary information for the next step of an
| algorithm, at any given point, is within that small region.
| (This is a simplification; there are actually several layers of
| cache between a CPU and the most general part of the RAM.)
| rcxdude wrote:
| It resembles sequential access memory with relatively fast
| seeks and a cache (ok, multiple layers of caches). Reading
| sequential, predictable addresses gives you much more
| throughput than random access, and reading a value that was
| recently accessed (or adjacent to such) is much lower latency
| than something that was not. There's further wrinkles in
| multicore systems as well, because then accesses to memory
| recently written to by another core can be slower again.
| dare944 wrote:
| I can only conclude the word myth was chosen for attention.
| Modern CPU memory systems (with or without their caches)
| certainly _resemble_ idealized RAMs.
| awanderingmind wrote:
| This was a great read, thanks. OP, readers might benefit from
| having it explicitly mentioned that while the pseudocode is in
| Python, actual Python code will likely not benefit from such an
| approach because of how memory is fragmented in the standard
| Python implementation - e.g. this discussion:
| https://stackoverflow.com/questions/49786441/python-get-proc...
|
| I am tempted to actually test this (i.e. see if there's a speedup
| in Python), but I don't have the time right now.
| zahlman wrote:
| I should have looked at the comments before I started writing
| the code. I'd have replied to you otherwise :) (see
| https://news.ycombinator.com/item?id=42459055 .)
| awanderingmind wrote:
| Haha thanks
| zahlman wrote:
| It's a little surprising that this works at all, since the
| partitioning step in the radix sort is itself the same kind of
| sharding operation. But I guess that's because it allows for
| working on smaller pieces at a time that fit in whatever cache
| they need to.
|
| > Python can't really reserve space for lists, but pretend
| `reserve` did that anyway.
|
| FWIW, you can pre-fill a list with e.g. `None` values, and then
| _replacing_ those values won 't cause resizes or relocations (of
| the _list_ 's memory - the elements are still indirected). But of
| course you'd then need to keep track of the count of "real"
| elements yourself.
|
| But of course, tricks like this are going to be counterproductive
| in Python anyway, because of all the indirection inherent in the
| system. They'll also get worse if you actually do have objects
| (unless perhaps they're statically, constantly sized, and in a
| language that can avoid indirection in that case) with a "group
| ID" attribute, rather than integers to which you can apply some
| hash function. Some test results, trying to approximate the Rust
| code in Python but with such objects:
|
| https://gist.github.com/zahlman/c1d2e98eac57cbb853ce2af515fe...
|
| And as I expected, the results on my (admittedly underpowered)
| machine are terrible (output is time in seconds):
| $ ./shard.py Naive: 1.1765959519980242 Presized:
| 2.254509582002356 Library sort / groupby:
| 6.990680840001005 Simple radixing first:
| 3.571575194997422
|
| It's important to understand the domain. For Pythonistas, simple
| really is better than complex. And identifying the bottlenecks
| and moving them to a faster language is also important (Numpy
| isn't successful by accident). The naive result here is still, if
| I'm understanding the graphs right, dozens of times worse than
| what any of the Rust code achieves.
|
| (Edit: merely "several" times worse. I forgot that I told
| `timeit` to run 10 iterations over the same input, so each test
| processes ~10M elements.)
| purplesyringa wrote:
| > It's a little surprising that this works at all, since the
| partitioning step in the radix sort is itself the same kind of
| sharding operation.
|
| The key property here is the number of groups. When sharding
| data to n groups, only n locations have to be stored in cache
| -- the tails of the groups. In my radix sort implementation,
| this is just 256 locations, which works well with cache.
| xigency wrote:
| There are actually some algorithms specifically designed to
| optimize usage of cache resources without knowing the specific
| features of the cache.
|
| "Cache Oblivious algorithms" https://en.wikipedia.org/wiki/Cache-
| oblivious_algorithm
| lifeisstillgood wrote:
| Anaesthetists are required to undergo days of retraining per year
| because the field, and the safety numbers, keep moving.
|
| To do this researchers publish findings and professionals
| aggregate those into useful training materials
|
| Who / where is the useful training materials for software devs?
| It really cannot be just blogs.
|
| That's a VC investment that has ROI across the industry - tell
| a16z how to measure that and we are quids in
| tialaramex wrote:
| Anaesthetists also undergo many _years_ of training _prior_ to
| taking post. Typically they 've got say 10 years training
| before they actually do the job, much of it specialised to this
| job in particular (and the rest in general or intensive care
| medicine)
|
| If you're lucky a Software Engineer has a three year degree in
| CS, and probably only a semester at best was studying "Software
| Engineering" and even that might focus on something you don't
| care about, such as formal methods.
|
| It is _entirely possible_ that your junior engineers have never
| maintained a sizeable codebase for more than a few weeks, have
| never co-operated on software with more than a handful of other
| people, and have never used most of the common software
| libraries you use every day, regardless of whether these are
| in-house (so how could they) or widely available.
|
| For example maybe you do lots of web apps fronting SQL Server
| databases in C#. Your new hire has six months of C#, they half-
| remember a course doing SQL on an Oracle database, and all
| their web knowledge is in Javascript. Do they know version
| control? Kinda. Have they used a test framework before? Well,
| they did in Java but never in C#.
|
| The "All Your Tests Are Terrible" talk begins by pointing out
| that probably they're strictly wrong because you _don 't have
| any fucking tests_. All of the rest of the talk is about the
| hideous tests that you're unhappy to find because you forget
| that somebody could just not have bothered to write any tests
| at all.
| globnomulous wrote:
| Hey, I resent and agree with this!
| simoncion wrote:
| > All of the rest of the talk is about the hideous tests that
| you're unhappy to find because you forget that somebody could
| just not have bothered to write any tests at all.
|
| At many times in my professional "career", I've found myself
| wishing that whoever wrote the test I'm staring at just
| hadn't bothered. "Tests that say and/or look like they test
| one thing, but test something entirely different." [0] and
| "Tests that actually test nothing at all and never fail when
| the code they claim to test is changed out from under them."
| are two of the many categories of tests I wish folks would
| just never have wasted their time writing.
|
| [0] To be clear, I usually find tests of this type to claim
| to be testing something useful, but actually be testing
| something that's not worth testing.
| simoncion wrote:
| > Who / where is the useful training materials for software
| devs? It really cannot be just blogs.
|
| Why can't it be blogs, books, and word-of-mouth?
|
| If you're a programmer (as your HN profile suggests that you
| are), you _already_ know how little formal training we receive.
| lifeisstillgood wrote:
| I think I am saying we should fund more training materials
| and spread the word on where they are to be found.
|
| Maybe
|
| But something !
| mrkeen wrote:
| "nulls should not be in your language" is my version of
| "doctors should wash their hands"
|
| "Immutability-first, with mutation-as-a-special-case" is my
| "accountants should not use erasers"
|
| "make illegal states unrepresentable" is my "hospitals should
| sterilise their operating rooms and equipment"
|
| As a field, we are very far from reaching broad agreement on
| the left side (which I consider the basics). So what would the
| training materials teach?
| feoren wrote:
| Lately, "nulls should not be in your language" sounds like
| "reality should be simpler". Queue angry fist-shaking at the
| complexities of real life and a wistful longing for a
| logical, functional, mathematical world. In reality,
| sometimes out of 8 questions, you only know the answer to 7,
| and the missing answer could be any of the 8. So you can:
|
| 1. Prevent the user from doing anything until they go find
| that 8th answer, which cascades into your tool not helping
| anyone until they have a complete, validated, full,
| historical data set.
|
| 2. Subdivide those 8 questions into having independent
| existence, which cascades into your entire application being
| a giant key/value store, and all the difficulties of null are
| simply replaced with the difficulties of "this key/value pair
| does not exist yet".
|
| 3. Add a sentinel value for "I don't know this yet", while
| feeling good about yourself because that sentinel value is
| not _technically_ null, while introducing all the same exact
| issues that null causes, plus a bunch of infrastructure you
| have to write yourself. Basically: reimplement null, but
| worse.
|
| 4. Make the answers nullable.
|
| It'd be like claiming that an IDE should simply not allow
| syntax errors to be entered at all. Errors would be
| completely impossible! Except at some point your user needs
| to actually _write_ the thing, and you 've just abandoned the
| idea of helping them. So they write it in some other editor,
| and then paste it into your IDE. Or you embrace the fact that
| incomplete work is OK.
|
| Yes, nulls are generally way over-used. But disallowing them
| entirely is a fool's errand. Source: I've been that fool.
|
| In before: "You should just be using Maybe<> everywhere" --
| "Maybe" is just another word for "nullable". The only
| difference is the level of support you get from the compiler
| / type-checker. So that argument is "the compiler should help
| you get null right", which I completely agree with! There's
| still work to be done getting type checkers better at null.
| But that's a far cry from "null should not be in your
| language" and equating the use of nulls to a doctor not
| washing his hands.
| int_19h wrote:
| Of course nobody is seriously arguing about ditching the
| notion of a sentinel value that indicates absence of value.
| The question, rather, is how to best handle that case.
|
| > In before: "You should just be using Maybe<> everywhere"
| -- "Maybe" is just another word for "nullable". The only
| difference is the level of support you get from the
| compiler / type-checker.
|
| The difference is that in languages with null
| references/pointers, all references/pointers are
| _implicitly_ nullable - that is, it is indeed like "using
| Maybe<> everywhere". But when you have proper option types,
| you use Maybe<> not _everywhere_ , but only where you know
| nulls can actually occur.
| feoren wrote:
| > The difference is ... [whether] all references/pointers
| are implicitly nullable [or not]
|
| Right so we're narrowing down the actual problem here,
| which revolves around implicit vs. explicit, compiler
| support vs. on-your-own, static vs. dynamic typing. I
| agree with you that I strongly prefer explicit over
| implicit nullability, assuming it fits within the idioms
| of the language. But this is a stronger condition than
| static typing, so to believe "implicit nulls" is as
| irresponsible as a _doctor not washing their hands_ ,
| you* have to believe that everyone using dynamic typing
| is even more irresponsible than that. There's a place for
| dynamically typed languages, there's a place for implicit
| nulls, there's a place for strict type- and null-
| checking, there's a place for mathematically proving code
| correct. Engineering is always about tradeoffs -- it's
| way too simplistic to just say "null bad".
|
| * I know it wasn't literally _you_ arguing that; I 'm
| still partly addressing mrkeen here.
| zozbot234 wrote:
| Tl;dr: cache is the new RAM; RAM is the new disk; disk is the new
| tape.
| antirez wrote:
| Using Python pseudo code in this context is hardly
| understandable.
| purplesyringa wrote:
| What concise and readable language would you suggest instead?
| antirez wrote:
| That's not the problem at hand. Python is good for
| pseudocode. But not if you want to talk about cache misses
| because in the pseudocode written with higher level languages
| a lot of details on how memory is accessed are opaque.
| purplesyringa wrote:
| Again, what would you suggest instead? If you can't guess
| that `list` is supposed to represent an array of
| consecutive elements, I have trouble thinking of a language
| that'll make that clear without being exceedingly verbose
| for no reason.
| antirez wrote:
| A bit later, in the article, you'll see that memory
| patterns in allocating the arrays have a role. A role
| that was hidden initially.
|
| > Again, what would you suggest instead?
|
| The answer is inside you. You have only to search for it.
| Or, if you really want to extort me the obvious: any low
| level language (even not implementing every detail but
| calling imaginary functions whose name suggest what they
| are doing). This exercise will show you for instance that
| you'll have to immediately choose if to
| append_to_dynamic_array() or add_to_linked_list().
| purplesyringa wrote:
| > This exercise will show you for instance that you'll
| have to immediately choose if to
| append_to_dynamic_array() or add_to_linked_list().
|
| Linked lists, _that 's_ what you're worried about?
| `[...]` is not a linked list in Python, in fact I don't
| know any imperative language where it's something other
| than a dynamic array/vector. I can only assume someone
| who doesn't understand or intuit this is arguing in bad
| faith, especially when taking your attitude into account.
| What did I do to deserve being talked down to?
|
| > any low level language
|
| Like this?
| std::vector<std::vector<element_t>> groups(n_groups);
| for (auto&& element : elements) {
| groups[element.group].push_back(std::move(element));
| } std::sort(elements.begin(),
| elements.end(), [](const auto& a, const auto& b) {
| return a.group < b.group; });
| std::vector<std::vector<element_t>> groups; for
| (auto group_elements : group_by(std::move(elements),
| [](const auto& element) { return
| element.group; })) {
| groups.push_back(std::move(group_elements)); }
|
| Is it worth it? If you don't know `list` is a dynamic
| array in Python, how will you know `std::vector` is a
| dynamic array in C++? Not to mention the syntax is
| terrible. C would be just as bad. Using Rust would get
| people angry about Rust evangelism. The only winning move
| is not to play.
| antirez wrote:
| // Create n empty arrays. DynamicArray*
| groups[n_groups]; for (int i = 0; i < n_groups;
| i++) { groups[i] = create_empty_array();
| } // For each element in elements[],
| append it to its group's array. for (int i = 0; i
| < n_elements; i++) {
| append_to_dynamic_array(groups[elements[i].group],
| elements[i].value); }
| purplesyringa wrote:
| antirez: You already got it wrong. You've got a double
| indirection, with `groups[i]` pointing to a heap-
| allocated instance of `DynamicArray`, which supposedly
| stores the actual heap pointer.
|
| It's not about the language being low-level or high-
| level. It's about understanding the basics of memory
| layout. If the pseudocode being in Python is an obstacle
| for you, the problem is not in Python, but in your (lack
| of) intuition.
| antirez wrote:
| I wrote the first random semantic that came in mind, in
| the pseudocode in C above. The fact you believe it is not
| the right one _is a proof that in C I can express the
| exact semantic_ , and in Python I can't (because
| everything is hidden inside how the Python interpreter
| can or can not implement a given thing).
|
| So, modify my C code to match what you have in mind, if
| you wish. But the point is that low level code will
| specify in a much clearer way what's going on with
| memory, and in a piece of code that is about memory
| layout / access, that's a good idea. Otherwise I consider
| Python a good language for pseudocode. Just: not in this
| case.
|
| Have a nice day, I'll stop replying since would be
| useless to continue. P.S. I didn't downvote you.
| binary132 wrote:
| If only there were a better syntax for abstractions over
| machine code. :)
| sebstefan wrote:
| For an article about caching and performance C is the lingua
| franca
| Cthulhu_ wrote:
| C was suggested, but I find Go to be good too for this
| purpose, you get some of the low level memory access (stack
| vs heap stuff) but don't need to worry about memory
| allocation or whatnot. Great for showing the differences in
| algorithms.
| veltas wrote:
| > The only way to prevent cache misses is to make the memory
| accesses more ordered.
|
| Or prefetch. Not enough people know about prefetch.
| adrian_b wrote:
| All modern CPUs have powerful hardware prefetchers, which are
| insufficiently documented.
|
| On any modern CPU (i.e. from the last 20 years), explicit
| prefetching is very seldom useful.
|
| What is important is to organize your memory accesses in such
| orders so that they will occur in one of the patterns that
| triggers the hardware prefetcher, which will then take care to
| provide the data on time.
|
| The patterns recognized by the hardware prefetchers vary from
| CPU to CPU, but all of them include the accesses where the
| addresses are in arithmetic progressions, going either forwards
| or backwards, so usually all array operations will be
| accelerated automatically by the hardware prefetchers.
| bjornsing wrote:
| I sometimes get the feeling it would be better to just start
| treating RAM as secondary storage, with read/write access through
| a (hardware implemented) io_uring style API. Is there anything
| along those lines out there?
| rwmj wrote:
| CXL RAM is something like this on the physical side. It's RAM
| over a PCIe connection, and PCIe is basically a very fast,
| serial, point to point network. However as far as I'm aware the
| "API" to software makes it looks just like regular, local RAM.
|
| https://en.wikipedia.org/wiki/Compute_Express_Link
| froh wrote:
| I'D love to understand the differences between the "devices"
| named A, Y, M in the performance measurement, referring to "(A,
| Y, M indicate different devices)"
|
| any pointers appreciated
| Kon-Peki wrote:
| Radix sorting is great, as this article points out. But I
| followed the links and didn't see any mention of the most obvious
| optimization that makes it even better on _modern_ systems.
|
| If you do MSB for the first pass, then the subsequent passes
| within each bucket are completely independent :) So... Choose a
| large radix and do one MSB pass, then parallelize the LSB
| processing of each bucket. This is fantastic on a GPU. You might
| think that you would want many thousands of kernels running
| simultaneously, but in actuality each SM (and the Apple
| equivalent) has a thread-local cache, so you really only want a
| couple hundred simultaneous kernels at the most. I've written
| CUDA and Metal versions and they are _fast_. And you don 't even
| have to be very good at GPU programming to get there. It's very
| basic. AoCP has the pseudocode.
| mpweiher wrote:
| Not sure what "Myth" the author thinks they are exposing.
|
| Jim Gray wrote "RAM is the new disk" at least in 2006, probably
| earlier, so 20 years ago. And people have been saying "RAM is the
| new tape" for quite some time now as well.
| exabrial wrote:
| RAM Myth #2: Free memory is good thing. A billion windows "power"
| users have etched this into canon.
| verall wrote:
| Windows starts doing awful things under slight memory pressure
| like 80% usage.
|
| There are cascading failures like high ram usage ->
| swap+compression -> power+thermal limits -> desktop
| responsiveness is unreasonably bad (i.e. >5 minutes to open
| task manager to kill a single firefox.exe using >10G memory
| caused by a single misbehaving javascript among many tabs)
|
| This is an incredibly common scenario for laptops that cost
| less than $1000.
|
| Free memory means things will probably not go to shit in the
| next 10 minutes.
|
| Linux is a whole different story with different sharp edges.
| 0x1ceb00da wrote:
| Tried running the code. Crashes with 'attempt to add with
| overflow'
___________________________________________________________________
(page generated 2024-12-19 23:02 UTC)