[HN Gopher] I/O is no longer the bottleneck
       ___________________________________________________________________
        
       I/O is no longer the bottleneck
        
       Author : benhoyt
       Score  : 341 points
       Date   : 2022-11-26 09:24 UTC (13 hours ago)
        
 (HTM) web link (benhoyt.com)
 (TXT) w3m dump (benhoyt.com)
        
       | btbuildem wrote:
       | As it often goes with these types of interview questions, there's
       | a lot of context missing. What is the goal? Do we want readable
       | code? Do we want fast code? Are we constrained somehow? It seems
       | here the author doesn't really know, but kudos to them for
       | examining their assumptions.
       | 
       | As a side note, a trie would be a neat data structure to use in a
       | solution for this toy problem.
        
       | furstenheim wrote:
       | It would be nice to see this benchmark in Node.js with streams
        
       | SleepyMyroslav wrote:
       | the state of benchmarking by normal IT people is tragic. If one
       | checks out his 'optimization problem statement' article [1] they
       | can find:
       | 
       | >ASCII: it's okay to only support ASCII
       | 
       | >Threading: it should run in a single thread on a single machine
       | 
       | >Stdlib: only use the language's standard library functions.
       | 
       | This is truly 1978 all over again. No flame graphs, no hardware
       | counters, no bottleneck analysis. Using these 'optimizations' for
       | job interviews is questionable at best.
       | 
       | [1] https://benhoyt.com/writings/count-words/
        
         | chakkepolja wrote:
         | Still better than obscure question from page 20 of Leetcode.
        
       | mastazi wrote:
       | Networking is I/O: API calls, database access, etc. - it's not
       | just disk access. The article is deriving a generalised statement
       | based on a very specific use case.
        
       | MrLeap wrote:
       | Interesting! This made me wonder -- would this kind of
       | optimization be recognized and rewarded in colossal scale
       | organizations?
       | 
       | I've seen comments about Google multiple times here where people
       | say you wont be getting promotions unless you're shipping new
       | things -- maintaining the old wont do it.
       | 
       | But if you get to something core enough, it seems like the
       | numbers would be pretty tangible and easy to point to during perf
       | review time?
       | 
       | "Found a smoother way to sort numbers that reduced the
       | "whirrrrrr" noise our disks made. It turns out this reduces disk
       | failure rates by 1%, arrested nanoscale structural damage to the
       | buildings our servers are in, allowed a reduction in necessary
       | PPE, elongaded depreciation offsets and other things -- this one
       | line of code has saved Google a billion dollars. That's why my
       | compensation should be increased to include allowing me to fall
       | limply into the arms of another and be carried, drooling, into
       | the office, where others will dress me"
       | 
       | In this hypothetical scenario, would a Googler be told "Your
       | request has been approved, it may take one or two payment periods
       | before your new benefits break into your apartment" or "No, you
       | need to ship another chat program before you're eligible for
       | that."?
        
         | johnfn wrote:
         | > Found a smoother way to sort numbers that reduced the
         | "whirrrrrr" noise our disks made. It turns out this reduces
         | disk failure rates by 1%, arrested nanoscale structural damage
         | to the buildings our servers are in, allowed a reduction in
         | necessary PPE, elongaded depreciation offsets and other things
         | -- this one line of code has saved Google a billion dollars
         | 
         | Hah! I mean, if you can truly prove a business benefit by
         | improving performance, I'm sure that you'd have a good shot at
         | a promotion. Thing is it's actually quite difficult to do so,
         | and in the likely chance you cannot it just looks like you
         | squandered a bunch of time for no reason.
        
           | anthlax wrote:
           | Metrics are religiously collected and any sizable performance
           | improvement will have a clear impact on one metric or another
           | no?
        
         | TeMPOraL wrote:
         | "Promotion? You must be joking. Your little stunt caused a
         | revenue spike that pushed us over the threshold in $country,
         | prompting $regulatory-agency to look into starting antitrust
         | proceedings. Mitigating this will require an army of lawyers,
         | that will cost approximately a third of the $1B you 'saved' us.
         | Additionally, we will now have to create another throwaway chat
         | app, for which we'll allocate another third of a billion in the
         | budget. The final third... will go to executive bonuses,
         | obviously.
         | 
         | You are hereby placed on a Performance Improvement Plan,
         | starting tomorrow. On the off chance you'll come out of the
         | other end still employed, keep in mind that your manager isn't
         | being stupid by forbidding such 'optimizations', they're just
         | following orders."
        
         | PaulKeeble wrote:
         | They likely have become one of the 10k getting fired for
         | "underperforming". Company management in basically all
         | businesses don't like know it alls that do the opposite of what
         | they are told regardless of the outcome.
        
           | plonk wrote:
           | Google's core services wouldn't work so reliably if they
           | didn't value optimization and engineering. I don't work
           | there, but I'm pretty sure that the SREs and the developers
           | behind Search and Maps don't get fired based on how many
           | products they launched.
        
           | vitiral wrote:
           | Is there a valid source for this number? All the articles I
           | see seem to quote each other.
        
           | kevinventullo wrote:
           | You don't know what you're talking about.
        
         | xyzzy_plugh wrote:
         | > This made me wonder -- would this kind of optimization be
         | recognized and rewarded in colossal scale organizations?
         | 
         | It depends if this kind of optimization is valuable to the
         | organization. Often times it's not. Spending money and time to
         | save money and time is often viewed as less efficient than
         | generating more revenue.
        
         | hansvm wrote:
         | Yeah, when I was there I saw plenty of <1% optimizations saving
         | REDACTED gobs of money, and people were rewarded for it. I
         | don't think it's applicable to most teams though.
         | 
         | Imagine a foo/bar/widget app that only serves 20B people
         | (obvious exaggeration to illustrate the point) and is only
         | necessary up to a few hundred times per day. You can handle
         | that sort of traffic on a laptop on my home router and still
         | have enough hootzpah left to stream netflix. I mean, you are
         | Google, and you need to do something better than that [0], but
         | the hardware for your project is going to be negligible
         | compared to other concerns unless you're doing FHE or video
         | transcoding or something extraordinarily expensive.
         | 
         | Walk that backward to, how many teams have 20B users or are
         | doing extraordinarily expensive things? I don't have any clue,
         | but when you look at public examples of cheap things that never
         | got much traction and probably had a suite of engineers [1],
         | I'd imagine it's not everyone in any case. You're probably
         | mostly looking at people with enough seniority to be able to
         | choose to work on core code affecting most services.
         | 
         | [0] https://www.youtube.com/watch?v=3t6L-FlfeaI
         | 
         | [1] https://killedbygoogle.com/
        
           | ilyt wrote:
           | Google seems to have problem of "you won't get
           | promotion/raise if you don't work on something new" so they
           | are not interested by services that just work and provide
           | tidy little constant revenue
        
         | evmar wrote:
         | [former Googler]
         | 
         | Yes, I occasionally saw people get highlighted for making
         | optimizations like "this saves 1% in [some important service]".
         | When you're running millions of machines, 1% is a lot of
         | machines. However, it's also likely the case that the easy 1%s
         | have already been found...
        
         | kevinventullo wrote:
         | Pay increases and promotions are for the most part a little
         | more formulaic and upper-bounded than what you describe, but
         | generally speaking if you can prove you saved the company $XM
         | you will be rewarded (from firsthand experience, this is also
         | true at Meta).
        
       | joshspankit wrote:
       | In the last few years I've been quietly moving database "rows"
       | from databases to the disk.
       | 
       | Back in the day accessing data from MySQL was actually _slower_
       | than current SSD speeds. And now you can get all sorts of
       | benefits for free: hard link deduplication, versioning, live
       | backup, easy usage of GNU tools...
       | 
       | I don't discuss this with certain types of colleagues, but the
       | results are excellent.
        
       | ThinkBeat wrote:
       | Is there any hardware accelerator / co processor / for the PC
       | that will read a file into RAM autonomously mainframish and
       | notify the OS when the file is fully loaded into memory?
       | (bypassing the CPU entirely)
       | 
       | Leaving the CPU to bother with other things during that time.
        
         | jbverschoor wrote:
         | Isn't this what a DMA controller does?
        
         | imtringued wrote:
         | That is how I/O works already.
         | 
         | The problem is dealing with an asynchronous filesystem API
         | provided by the kernel.
        
         | pca006132 wrote:
         | You mean DMA?
        
         | 323 wrote:
         | This is already how stuff works.
         | 
         | There are even OS APIs for this - DirectStorage on Windows.
        
         | jeroenhd wrote:
         | Do we need a coprocessor for that? Any protocol using DMA
         | already does exactly this through interrupts or other message
         | passing systems.
         | 
         | DMA capable controllers are everywhere, I don't think you'll
         | find any storage controllers in your modern computer that don't
         | do this.
         | 
         | Of course DMA only operates on byte ranges and not on files,
         | but adding file system parsers to drives and disk controllers
         | sounds like a way to introduce awful bugs and corruption.
         | Assuming the OS keeps the necessary file system structures
         | cached in RAM, plain old DMA should be good enough here.
        
           | bogomipz wrote:
           | >"Any protocol using DMA already does exactly this through
           | interrupts or other message passing systems."
           | 
           | Interesting, I'm only familiar with the classic interrupts
           | approach. What are some of the other common message passing
           | systems used in DMA?
        
             | wtallis wrote:
             | NVMe has memory-mapped doorbell registers associated with
             | each submission or completion queue, which are ring buffers
             | typically allocated in host RAM and accessed by the SSD
             | using DMA. Generating interrupts when a new entry is posted
             | to a completion queue is optional. The host system can
             | instead simply poll the completion queue to check whether
             | any new IO operations have been completed. Depending on how
             | busy the drive is vs how busy the CPU is, forgoing
             | interrupts in favor of polling may reduce latency (fewer
             | context switches when handling a new completion), but can
             | also burn a lot of CPU time.
        
           | wtallis wrote:
           | > Of course DMA only operates on byte ranges and not on
           | files, but adding file system parsers to drives and disk
           | controllers sounds like a way to introduce awful bugs and
           | corruption.
           | 
           | There's an interesting intermediate solution in the form of
           | SSDs that provide a key-value interface instead of a linear
           | block device. That gives the SSD more explicit knowledge
           | about which chunks of data should be kept contiguous and can
           | be expected to have the same lifetime.
        
       | mastax wrote:
       | I was recently working on parsing 100K CSV files and inserting
       | them into a database. The files have a non-column-oriented header
       | and other idiosyncrasies so they can't be directly imported
       | easily. They're stored on an HDD so my first instinct was to
       | focus on I/O: read the whole file into memory as an async
       | operation so that there are fewer larger IOs to help the HDD and
       | so that other parser tasks can do work while waiting for the read
       | to complete. I used a pretty featureful C# CSV parsing library
       | which did pretty well on benchmarks [0] (CsvHelper) so I wasn't
       | really worried about that part.
       | 
       | But that intuition was completely wrong. The 100K CSV files only
       | add up to about 2GB. Despite being many small files reading
       | through them all is pretty fast the first time, even on Windows,
       | and then they're in the cache and you can ripgrep through them
       | all almost instantaneously. The pretty fast parser library is
       | fast because it uses runtime code generation for the specific
       | object type that is being deserialized. The overhead of
       | allocating a bunch of complex parser and typeconverter objects,
       | doing reflection on the parsed types, and generating code for a
       | parser, means that for parsing lots of tiny files its really
       | slow.
       | 
       | I had to stop worrying about it because 2 minutes is fast enough
       | for a batch import process but it bothers me still.
       | 
       | Edit: CsvHelper doesn't have APIs to reuse parser objects. I
       | tested patching in a ConcurrentDictionary to cache the generated
       | code and it massively sped up the import. But again it was fast
       | enough and I couldn't let myself get nerd sniped.
       | 
       | Edit2: the import process would run in production on a server
       | with low average load, 256GB RAM, and ZFS with zstd compression.
       | So the CSV files will live permanently in the page cache and ZFS
       | ARC. The import will probably run a few dozen times a day to
       | catch changes. IO is really not going to be the problem. In fact,
       | it would probably speed things up to switch to synchronous reads
       | and remove all the async overhead. Oh well.
       | 
       | [0]: https://www.joelverhagen.com/blog/2020/12/fastest-net-csv-
       | pa...
        
       | notacoward wrote:
       | Interesting observation, but I think the author crosses a bridge
       | too far here.
       | 
       | > If you're processing "big data", disk I/O probably isn't the
       | bottleneck.
       | 
       | If it fits on a single machine, it is _by definition_ not big
       | data. When you 're dealing with really big data, it's likely
       | coming from another machine, or more likely a cluster of them.
       | Networks can also be pretty fast, but there will still be some
       | delay associated with that plus the I/O (which might well be on
       | spinning rust instead of flash) on the other end. Big data
       | requires parallelism to cover those latencies. _Requires._ It
       | might be true that I /O is no longer likely to be the bottleneck
       | for a single-threaded program, but leave "big data" out of it
       | because in that world I/O really is still a - if not the - major
       | limiter.
        
         | kragen wrote:
         | the author is evidently pretty unskilled and unaware of it in a
         | lot of ways, this being one of them
        
       | visarga wrote:
       | Loading 1GB json files is still slow.
        
         | mrkeen wrote:
         | If you specify the type of the data, not just the size, then
         | yep: it wasn't really about the number of bytes read, but what
         | the CPU had to do to process it.
        
         | omgtehlion wrote:
         | Not if you use SAX-like (or "evented") parser
        
         | flohofwoe wrote:
         | That's usually not because of IO but because of dumb memory
         | management in the JSON parser implementation. Some JSON parsers
         | build a huge tree out of individually allocated items for each
         | JSON property, or in higher level languages, build an
         | object/property tree out of the JSON structure. That's
         | guaranteed to be slow for large trees.
        
         | krapp wrote:
         | I don't want to be that guy but having 1GB json files to begin
         | with seems like the bigger problem.
        
       | therealbilly wrote:
       | Does any of this really matter?
        
       | daviddever23box wrote:
       | The problem here, as with most interview problems, is that it is
       | wholly dissociated from its context; memory contraints, disk I/O,
       | and file size are non-trivial considerations. Shame on people for
       | stare-at-one's-shoes thinking.
        
         | javajosh wrote:
         | People can make reasonable assumptions for interview problems.
         | In this case, assume the code runs on a modern developer
         | laptop, with an SSD and lots of RAM, using some modern OS, is a
         | fine assumption.
         | 
         | If someone asks you a physics question about a car being
         | dropped into the ocean, generally you don't have to worry about
         | the make and model of the car.
        
           | LastTrain wrote:
           | But if you read the article, you would see that you'd run
           | afoul with this interviewer doing that, because he has
           | utterly convinced himself that I/O is no longer the
           | bottleneck and coming to the same conclusion is your
           | unspecified task in the interview.
        
       | nkozyra wrote:
       | > Some candidates say that sorting is going to be the bottleneck,
       | because it's O(N log N) rather than the input processing, which
       | is O(N). However, it's easy to forget we're dealing with two
       | different N's: the total number of words in the file, and the
       | number of unique words.
       | 
       | I don't see how that changes anything. There's a reason we use
       | Big O rather than other notations. Their answer would still be
       | correct.
        
         | burlesona wrote:
         | No. Big O tells us the theoretical best case for how the
         | compute time will increase as the input size increases for a
         | given _algorithm_. It does not say anything about the
         | performance of a specific _program_ on a specific input.
         | 
         | The author's point is that if we have two different inputs A
         | and B, and A is sufficiently smaller than B, then the runtime
         | of B will dominate the program. This would be true even if we
         | have to operate on A with a very slow algorithm.
         | 
         | For example suppose you have some extreme scenario where you
         | have to do three coloring on A and then find the minimum value
         | of B. The runtime would be O(2^A) + O(B). Just plug in numbers
         | and you can see that if B > 2^A then B takes longer even though
         | that algorithm is much faster. If you suppose B is always much
         | larger than 2^A, then B is your bottleneck.
        
         | bufferoverflow wrote:
         | It really depends on the number of unique words in the input
         | file. If it's all unique words, sorting can become the
         | bottleneck.
        
         | [deleted]
        
         | kragen wrote:
         | input processing is o(n), sorting the unique words is o(m log
         | m)
         | 
         | whether o(m log m) is bigger or smaller than o(n) depends on
         | the relationship between n and m
         | 
         | in the case used by another poster in this thread,
         | concatenating some large number of copies of the bible, m is
         | constant, so o(m log m) is o(1)
         | 
         | in another possible case where the corpus is generated
         | uniformly randomly over all possible words of, say, 20 letters,
         | though in theory o(m log m) is o(1), over a practical range it
         | would be o(n log n)
         | 
         | more typically m is proportional to some power of n depending
         | on the language; for english the power is 'typically between
         | 0.4 and 0.6'
         | 
         | as it turns out o([?]n log [?]n) is less than o(n)
         | 
         | a totally different issue is that big-o notation doesn't always
         | narrow down where the bottleneck is even if you calculate it
         | correctly because n is never arbitrarily large and so the
         | bottleneck depends on the constant factors
        
           | nkozyra wrote:
           | > a totally different issue is that big-o notation doesn't
           | always narrow down where the bottleneck is even if you
           | calculate it correctly because n is never arbitrarily large
           | and so the bottleneck depends on the constant factors
           | 
           | Yeah I think this is part of my point, in that theoretical
           | best, average and worst case scenarios can be useful in
           | isolation but rarely tell the whole story.
           | 
           | In the worst case, every word is unique and is equal to word
           | count and it's a standard sorting issue.
        
       | mrkeen wrote:
       | A few ballpark numbers I encountered:
       | 
       | Sequentially reading a file on a spinny laptop disk was about
       | 80-100 MB/s. On an SSD that went up to 400-500 MB/s for me.
       | 
       | That's the sequential case! What about random access? I tried an
       | experiment where I memory mapped a large file and started
       | updating bytes at random. I could get the rate down to
       | kilobytes/sec.
       | 
       | Even though we've all heard that SSDs don't pay as much as a
       | penalty for random access as spinny disks, it's still a huge
       | penalty. Sequential spinny disk access is faster than SSD random
       | access.
        
         | GrayShade wrote:
         | > Sequential spinny disk access is faster than SSD random
         | access.
         | 
         | It is, but on both kind of drives you'll want to dispatch at
         | least a couple of requests at once to get better performance.
         | In the memory-mapped case, that means using multiple threads.
         | 
         | In addition, you might also want to call madvise(MADV_RANDOM)
         | on the mapping.
        
           | mrkeen wrote:
           | I didn't think there'd be that much discussion on my point
           | (seq hdd > rand sdd).
           | 
           | > In the memory-mapped case, that means using multiple
           | threads.
           | 
           | My gut tells me I'd lose more to contention/false-sharing
           | than I'd gain through multithreading - but I haven't done the
           | experiment.
        
         | koolba wrote:
         | > Sequential spinny disk access is faster than SSD random
         | access.
         | 
         | No it's not. At least not with modern SSDs or NVMe storage.
         | 
         | Even at 100 MB/s, a spinning disk in _sequential_ mode is doing
         | 100 x 1024  / 4 = 25,600 IOPS (assuming a standard 4K per
         | operation).
         | 
         | Even consumer grade NVMe hardware gets 5-10x of that for random
         | workloads.
        
           | kragen wrote:
           | it is true that spinning rust is slower than most ssds
           | including nvme ssds even in sequential mode
           | 
           | however, a spinning disk doing a sequential access is not
           | doing 25600 iops
           | 
           | if the sequential access lasts 10 seconds it is doing 0.1
           | iops
        
           | [deleted]
        
           | mrkeen wrote:
           | > Even consumer grade NVMe hardware gets 5-10x of that for
           | random workloads.
           | 
           | Cool, lots of IOPS!
           | 
           | But like I said, I got it down to kilobytes/sec.
        
             | justsomehnguy wrote:
             | Because you did it the most inefficient way.
             | 
             | This [0] comment is totally on point.
             | 
             | Also note what a consumer SSDs can be made even with a
             | single flash chip. A more performant ones are made of bunch
             | of chips internally (essentially a RAID0 with some magic)
             | so they can do a parallel operations if the data resides on
             | the different flash blocks. Still, if your thread is only
             | doing one operation a time with blocks < flash rewrite
             | block size you will hit the write amplification anyway.
             | 
             | I think if you do the same test but without a memory mapped
             | file (ie let the OS and disk subsystem do their thing) you
             | will get much more _speed_.
             | 
             | [0] https://news.ycombinator.com/item?id=33751973
        
             | koolba wrote:
             | On what hardware and how much parallelization? The max IOPS
             | numbers are only when you're saturating the command queue.
             | 
             | It's a throughput number, not a single operation,
             | completion, followed by the next one.
        
         | wtallis wrote:
         | > I tried an experiment where I memory mapped a large file and
         | started updating bytes at random. I could get the rate down to
         | kilobytes/sec.
         | 
         | Memory-mapped IO means you're only giving the SSD one request
         | to work on at a time, because a thread can only page fault on
         | one page at a time. An SSD can only reach its peak random IO
         | throughput if you give it lots of requests to work on in
         | parallel. Additionally, your test was probably doing small
         | writes with all volatile caching disallowed, forcing the
         | (presumably consumer rather than enterprise) SSD to perform
         | read-modify-write cycles not just of the 4kB virtual memory
         | pages the OS works with, but also the larger native flash
         | memory page size (commonly 16kB). If you'd been testing only
         | read performance, or permitted a normal degree of write
         | caching, you would have seen far higher performance.
        
         | d_tr wrote:
         | True. And main memory access can easily become slower than SSD
         | sequential access if you do random byte accesses and your
         | working set is larger than the CPU's caches or TLBs.
        
       | Sirupsen wrote:
       | Yes, sequential I/O bandwidth is closing the gap to memory. [1]
       | The I/O pattern to watch out for, and the biggest reason why e.g.
       | databases do careful caching to memory, is that _random_ I/O is
       | still dreadfully slow. I/O bandwidth is brilliant, but latency is
       | still disappointing compared to memory. Not to mention, in
       | typical Cloud workloads, IOPS are far more expensive than memory.
       | 
       | [1]: https://github.com/sirupsen/napkin-math
        
         | BulgarianIdiot wrote:
         | > Yes, sequential I/O bandwidth is closing the gap to memory.
         | 
         | Hilariously meanwhile, RAM has become significantly slower
         | compared to CPU performance, i.e. you spend a disproportionate
         | time to read and write to memory, so despite RAM is faster, CPU
         | is way faster.
         | 
         | Which means I/O remains a bottleneck...
        
         | jonstewart wrote:
         | Random I/O with NVME is slower than sequential I/O still, but
         | the gap between the two has been narrowed considerably and is
         | quite high in historical/comparative absolute terms. To get
         | close to peak random I/O limits, you need to dispatch a lot of
         | commands in parallel--that's an I/O idiom that doesn't really
         | exist in high level languages, and I think that's where a lot
         | of the challenge is.
        
           | crote wrote:
           | The problem is that a lot of workloads using random I/O have
           | dependencies between the different I/O operations. A database
           | traversing a tree-like index cannot issue the next read until
           | the previous one has finished. You are limited by latency,
           | which for NVMe is still orders of magnitude worse than
           | memory.
        
             | the8472 wrote:
             | Those are rarely the slow ones though. Lots of software
             | simply has not been written to keep IO queues full. They
             | read a bit of data, process it, read the next bit and so
             | on. On a single thread. This makes all kinds of IO
             | (including network) way slower than it has to be.
             | 
             | For example a tree-index can be parallelized by walking
             | down different branches. On top of that one can issue a
             | prefetch for the next node (on each branch) while
             | processing the current ones.
        
               | ilyt wrote:
               | Yup, a lot of software is (was?) written with assumptions
               | that mattered with spinning rust. And even if author
               | didn't intend to, serial code generates serial, dependent
               | IO.
        
         | ShredKazoo wrote:
         | I thought for SSDs it didn't matter whether data was adjacent
         | on disk?
        
           | crote wrote:
           | Well, yes and no.
           | 
           | With spinning rust you have to wait for the sector you want
           | to read to rotate underneath the read head. For a fast 10.000
           | RPM drive, a single rotation takes 6 milliseconds. This means
           | that for random access the average latency is going to be 3
           | milliseconds - and even that's ignoring the need to move the
           | read head between different tracks! Sequential data doesn't
           | suffer from this, because it'll be passing underneath the
           | read head in the exact order you want - you can even take the
           | track switching time into account to make this even better.
           | 
           | SSDs have a different problem. Due to the way NAND is
           | physically constructed it is only possible to read a single
           | page at a time, and accessing a single page has a latency of
           | a few nanoseconds. This immediately places a lower limit on
           | the random read access time. However, SSDs allow you to send
           | read commands which span many pages, allowing the SSD to
           | reorder the reads in the most optimal way, and do multiple
           | reads in parallel. This means that you only have to pay the
           | random access penalty once - not to mention that you have to
           | issue way fewer commands to the SSD.
           | 
           | SSDs try to make this somewhat better by having a very deep
           | command queue: you can issue literally thousands of random
           | reads at once, and the SSD will reorder them for faster
           | execution. Unfortunately this doesn't gain you a lot if your
           | random reads have dependencies, such as when traversing a
           | tree structure, and you are still wasting a lot of effort
           | reading entire pages when you only need a few bytes.
        
             | jonstewart wrote:
             | You really need an NVME interface to the SSD, though. SATA3
             | is the bottleneck for SATA SSDs
        
             | ShredKazoo wrote:
             | Interesting, thanks! So it sounds like it's not so much
             | "random" I/O that's slow, but rather "unbatched" I/O or
             | something like that?
             | 
             | Curious to hear your thoughts on this thread if you have
             | time to share:
             | https://news.ycombinator.com/item?id=33752870
        
             | mamcx wrote:
             | > Unfortunately this doesn't gain you a lot if your random
             | reads have dependencies, such as when traversing a tree
             | structure,
             | 
             | So, this mean Btrees suffer? Which could be the most
             | optimal layout for a database storage where only SSD
             | matters?
             | 
             | I'm working in one that is just WAL-only and scanning all
             | in each operation (for now!) and wanna see what I can do
             | for improve the situation.
        
         | marginalia_nu wrote:
         | Depending on how your madvise is set up, it's often the case
         | that sequential disk reads _are_ memory reads. You 're
         | typically only paying the price for touching the first page in
         | a sequential run, that or subsequent page faults come at a big
         | discount.
         | 
         | If you read 1,000,000 random bytes (~1 Mb) scattered across a
         | huge file (let's say you're fetching from some humongous on-
         | disk hash table), it will to a first order be about as slow as
         | reading 4 Gb sequentially. This will incur the same number of
         | page faults. There _are_ ways of speeding this up, but only so
         | much.
         | 
         | Although, I/O is like an onion of caching layers, so in
         | practice this may or may not hold up depending on previous
         | access patterns of the file, lunar cycles, whether venus is in
         | retrograde.
        
           | Sirupsen wrote:
           | `madvise(2)` doesn't matter _that_ much in my experience with
           | [1] on modern Linux kernels. SSD just can't read _quite_ as
           | quickly as memory in my testing. Sure, SSD will be able to
           | re-read a lot into ram, analogous to how memory reading will
           | be able to rapidly prefetch into L1.
           | 
           | I get ~30 GiB/s for threaded sequential memory reads, but ~4
           | GiB/s for SSD. However, I think the SSD number is single-
           | threaded and not even with io_uring--so I need to regenerate
           | those numbers. It's possible it could be 2-4x better.
           | 
           | [1]: https://github.com/sirupsen/napkin-math
        
             | shitlord wrote:
             | How'd you measure the maximum memory bandwidth? In
             | Algorithmica's benchmark, the max bandwidth was observed to
             | be about 42 GBPS: https://en.algorithmica.org/hpc/cpu-
             | cache/sharing/
             | 
             | I'm not sure how they calculated the theoretical limit of
             | 42.4 GBPS, but they have multiple measurements higher than
             | 30 GBPS.
        
             | marginalia_nu wrote:
             | I think the effects of madvise primarily crop up in
             | extremely I/O-saturated scenarios, which are rare. Reads
             | primarily incur latency, with a good SSD it's hard to
             | actually run into IOPS limitations and you're not likely to
             | run out of RAM for caching either in this scenario.
             | MADV_RANDOM is usually a pessimization, MADV_SEQUENTIAL may
             | help if you are truly reading sequentially, but may also
             | worsen performance as pages don't linger as long.
             | 
             | But as I mentioned, there's caching upon caching, and also
             | protocol level optimizations, and hardware-level
             | considerations (physical block size may be quite large but
             | is generally unknown).
             | 
             | It's nearly impossible to benchmark this stuff in a
             | meaningful way. Or rather, it's nearly impossible to know
             | what you are benchmarking, as there are a lot of
             | nontrivially stateful parts all the way down that have real
             | impact on your performance.
             | 
             | There are so many moving parts I think the only meaningful
             | disk benchmarks consider whatever application you want to
             | make go faster. Do the change. Is it faster? Great. Is it
             | not? Well at least you learned.
        
             | menaerus wrote:
             | > I get ~30 GiB/s for threaded sequential memory reads, but
             | ~4 GiB/s for SSD. However, I think the SSD number is
             | single-threaded and not even with io_uring--so I need to
             | regenerate those numbers. It's possible it could be 2-4x
             | better.
             | 
             | Assuming that you run the experiments on NVMe SSD which is
             | attached to PCIe 3.0, where theoretical maximum is around
             | 1GB/s per each lane, I am not sure I understand how do you
             | expect to go faster than 4 GiB/s? Isn't that already a
             | theoretical maximum of what you can achieve?
        
               | formerly_proven wrote:
               | PCIe 4.0 SSDs are pretty common nowadays and are
               | basically limited to what PCIe 4.0 x4 can do (around 7
               | GB/s net throughput).
        
               | menaerus wrote:
               | I don't think they're that common. You would have to have
               | quite recentish motherboard and CPU that both support
               | PCIe 4.0.
               | 
               | And I'm pretty sure that parent comment doesn't own such
               | a machine because otherwise I'd expect 7-8GB/s figure to
               | be reported in the first place.
        
               | dagmx wrote:
               | I really doubt they're that common. They only became
               | available on motherboards fairly recently, and are quite
               | expensive.
               | 
               | I'd guess that they're a small minority of devices at the
               | moment.
        
               | Sirupsen wrote:
               | You might be very right about that! It's been a while
               | since I did the SSD benchmarks. Glad to hear it's most
               | likely entirely accurate at 4 GiB/s then!
        
         | derefr wrote:
         | > _random_ I/O is still dreadfully slow. I/O bandwidth is
         | brilliant, but latency is still disappointing compared to
         | memory.
         | 
         | The wondrous thing about modern CPU architectures (e.g. Zen3),
         | though, is all the PCIe lanes you get with them. If you really
         | need high random IOPS, you can now cram 24 four-lane NVMe disks
         | into a commodity server (with PCIe M.2 splitter cards) and
         | saturate the link bandwidth on all of them. Throw them all in a
         | RAID0, and stick a filesystem on them with the appropriate
         | stripe width, and you'll get something that's only about 3x
         | higher-latency for cold(!) random reads, than a read from RAM.
         | 
         | (My company provides a data-analytics SaaS product; this is
         | what our pool of [shared multitenant, high concurrency] DB
         | read-replicas look like.)
        
           | Dylan16807 wrote:
           | I thought NVMe flash latency was measured in tens of
           | microseconds. 3x RAM would be a fraction of a microsecond,
           | right?
        
             | emn13 wrote:
             | I think the person you're replying to is confusing IOPS
             | with latency. If you add enough parallelism, then NAND
             | flash random-read IOPS will eventually reach DRAM
             | performance.
             | 
             | But it's not going to be easy - for a sense of scale I just
             | tested a 7950x at stock speeds with stock JEDEC DDR5
             | timings. I inserted a bunch of numbers in an 8GB block of
             | memory, and with a deterministic random seed randomly pick
             | 4kb pages, computing their sum and eventually reporting
             | that (to avoid overly clever dead-code analysis, and make
             | sure the data is fully read).
             | 
             | With an SSD-friendly 4K page size that resulted in 2.8
             | million iops of QD1 random read. By comparison, a web
             | search for intel's 5800x optane's QD1 results shows 0.11
             | million iops, and that's the fastest random read SSD there
             | is at those queue depths, AFAIK.
             | 
             | If you add parallelism, then ddr5 reaches 11.6 million iops
             | at QD16 (via 16 threads), fast SSDs reach around 1 million,
             | the optane reaches 1.5 million. An Epyc Genoa server chip
             | has 6 times as many DDR5 memory channels as this client
             | system does; and I'm not sure how well that scales, but 60
             | million 4kb random read iops sounds reasonable, I assume.
             | Intel's memory controllers are supposedly even better (at
             | least for clients). Turning on XMP and PBO improves results
             | by 15-20%; and even tighter secondary/tertiary timings are
             | likely possible.
             | 
             | I don't think you're going to reach those numbers not even
             | with 24 fast NVMe drives.
             | 
             | And then there's the fact that I picked the ssd-friendly
             | 4kb size; 64-byte random reads reach 260 million iops -
             | that's not quite as much bandwidth as @ 4kb, but the
             | scaling is pretty decent. Good luck reaching those kind of
             | numbers on SSDs, let alone the kind of numbers a 12-channel
             | server might reach...
             | 
             | We're getting close enough that the loss in performance at
             | highly parallel workloads is perhaps acceptable enough for
             | some applications. But it's still going to be a serious
             | engineering challenge to even get there, and you're only
             | going to come close under ideal (for the NAND)
             | circumstances - lower parallelism or smaller pages and it's
             | pretty much hopeless to arrive at even the same order of
             | magnitude.
        
               | 867-5309 wrote:
               | at this point _cost_ would become the bottleneck. compare
               | 24x1TB NVMe drives to 24TB of DDR5
        
             | derefr wrote:
             | Under ideal conditions, yes. But the 3x difference I see in
             | practice is less about NVMe being just that good; and more
             | about operations against (main) memory getting bottlenecked
             | under high all-cores concurrent access with no cross-
             | workload memory locality to enable any useful cache
             | coherence. And also about memory accesses only being "in
             | play" when a worker thread isn't context-switched out;
             | while PCIe-triggered NVMe DMA can proceed _while_ the
             | thread has yielded for some other reason.
             | 
             | In other words, _when measured E2E in the context of a
             | larger work-step_ (one large enough to be interrupted by a
             | context-switch), the mean, amortized difference between the
             | two types of fetch becomes  <3x.
             | 
             | Top of my wishlist for future architectures is "more,
             | lower-width memory channels" -- i.e. increased intra-CPU
             | NUMAification. Maybe something CXL.mem will roughly
             | simulate -- kind of a move from circuit-switched memory to
             | packet-switched memory, as it were.
        
           | wil421 wrote:
           | I'm running a FreeNAS box on an i3-8100. Right now I'm
           | converting the NAS and my desktop to server chassis and
           | putting them in a rack. Once I get a 10GB Unifi switch and
           | NICs off ebay, I'm debating on running my desktop and servers
           | diskless using iSCSI backed up by RAID0 NVME drives.
        
             | justsomehnguy wrote:
             | Whatever floats your boat, but iSCSI is limited to 1500 MTU
             | (9k? Are you sure you can boot with 9k enabled?) and while
             | you can have 10Gbit _thoughput_ that doesn 't mean what you
             | will get it always, eg 100 IO operations would generate 100
             | packets and it doesn't matter if it was 1500B each or only
             | 100B.
             | 
             | And you wouldn't see the speed improvement on RAID0 NVMe
             | drives except extremely rare fully sequential operations
             | lasting for at least tens of seconds.
             | 
             | You also can try it just by running a VM with iSCSI boot on
             | your current desktop.
        
               | ilyt wrote:
               | > Whatever floats your boat, but iSCSI is limited to 1500
               | MTU (9k? Are you sure you can boot with 9k enabled?) and
               | while you can have 10Gbit throughput that doesn't mean
               | what you will get it always, eg 100 IO operations would
               | generate 100 packets and it doesn't matter if it was
               | 1500B each or only 100B.
               | 
               | Ugh, ISCSI does have queueing so you can have many
               | operations in flight, and one operation doesn't really
               | translate to one packet in the first place, kernel will
               | happily pack few smaller operations to TCP socket into
               | one packet when there is load.
               | 
               | The single queue is the problem here but dumb admin trick
               | is just to up more than one IP on the server and connect
               | all of them via multipath
        
               | TristanBall wrote:
               | Been a long time since anything iscsi related didn't hand
               | 9k, for boot or otherwise.
               | 
               | But I look at it this way. You need 40gbit networking for
               | a single pci3 nvme ( and newer drives can saturate that,
               | or close )
               | 
               | And because you're throttling throughput you'll see much
               | more frequent, longer, queuing delays, on the back of a
               | network stack that ( unless you're using rdma ) is
               | already 5x-10x slower than nvme.
               | 
               | It'll be fast enough for lots of things, especially
               | home/lab use, and it'll be _amazing_ if you 're upgrading
               | from sata spinning disk.. but 10gbit is slow by modern
               | storage standards.
               | 
               | Of course, that's not the only consideration. Shared
               | storage and iscsi in particular can be extremely
               | convenient! And sometimes offers storage functionality
               | that clients don't have ( snapshots, compression,
               | replication )
        
             | ilyt wrote:
             | That's a lot of effort to put silent piece of silicon few
             | metres away from the machine.
             | 
             | iSCSI gotta eat some of your CPU (you're changing "send a
             | request to disk controller and wait" to "do a bunch of work
             | to create packet,send it over the network, and get it back)
             | if you don't have card with offload, it also might kinda
             | not be fast enough to get the most out of NVMe, especially
             | more in RAID0
             | 
             | And, uh, just don't keep anything important there...
        
       | Un1corn wrote:
       | Does someone have a comparison between common server ssds and
       | consumer ssds? I wonder if the speed is equal or not
        
         | PaulKeeble wrote:
         | Hard drives are still about 150MB/s.
         | 
         | SATA SSDs are limited to 550MB/s.
         | 
         | PCI-E 3.0 SSDs more like 3500 MB/s.
         | 
         | PCI-E 4.0 SSDs are 7000MB/s.
         | 
         | All of these are at consumer level pricing, you can get 2TB of
         | PCI-E 4 from Western Digital for PS130 at the moment, usually
         | about PS180. The issue is sustained writes more than reads for
         | consumer verses enterprise drives where the speed drops off due
         | to a lack of SLC caching and lack of cooling and TLC/QLC which
         | is slower for sustained writing.
         | 
         | The example given is very much a consumer level device and not
         | a particularly quick one by today's standards. You can also
         | expect much faster reads cached than that on a DDR5 system I
         | suspect.
        
           | Dalewyn wrote:
           | It should be noted that those SSD speeds are all protocol
           | limits rather than NAND flash limits. ~7GB/s is literally the
           | maximum speed PCIE4 can provide, likewise ~3.5GB/s for PCIE3
           | and ~500MB/s for SATA3.
        
             | omgtehlion wrote:
             | > the maximum speed PCIE4
             | 
             | 4-lane PCIe, as most nvme drives are. I havent seen drives
             | with wider lanes though...
        
               | Dalewyn wrote:
               | Ah, right. Lanes. I should have mentioned those as well,
               | thanks for catching that.
        
           | bluedino wrote:
           | Todays large hard drives are over 250MBs in sequential
           | operations
        
           | js4ever wrote:
           | For me main difference between hdd and nvme is not really the
           | throughput but the acces time. When you manipulate small
           | files it's a lot more important
        
         | sekh60 wrote:
         | Consumer SSDs are trash for sustained writes since they get
         | their top speeds from cache and use slower NAND. Enterprise
         | SSDs tend to have better write endurance, and faster NAND. I
         | have a small Ceph cluster and as an example when I first bought
         | SSDs for it I tried consumer Samsung 870 Evo's. They performed
         | worse than spinning rust.
        
           | wtallis wrote:
           | Consumer SSDs don't really use slower NAND; the QLC vs TLC
           | ratio _might_ be higher for the consumer SSD market than the
           | enterprise SSD market, but a consumer TLC drive is using the
           | same speed of NAND as an enterprise TLC drive (and likewise
           | for QLC drives).
           | 
           | Enterprise SSDs only really have significantly higher
           | endurance if you're looking at the top market segments where
           | a drive is configured with much more spare area than consumer
           | drives (ie. where a 1TiB drive has 800GB usable capacity
           | rather than 960GB or 1000GB). Most of the discrepancy in
           | write endurance ratings between mainstream consumer and
           | mainstream enterprise drives comes from their respective
           | write endurance ratings being calculated according to
           | _different criteria_ , and from consumer SSDs being given
           | low-ball endurance ratings so that they don't cannibalize
           | sales of enterprise drives.
           | 
           | Your poor Ceph performance with Samsung consumer SATA SSDs
           | wasn't due to the NAND, but to the lack of power loss
           | protection on the consumer SSDs leading to poor sync write
           | performance.
        
           | menaerus wrote:
           | If that had been true what you say then we wouldn't be able
           | to saturate the PCIe 3.0 x4 bus with consumer NVMe SSD which
           | we absolutely can. The biggest difference is in the
           | durability as mentioned in the comment below.
        
         | wtallis wrote:
         | Read speeds are similar between consumer and enterprise SSDs;
         | they use the same flash and there's overlap with high-end
         | consumer SSDs using the same controllers as entry-level and
         | mid-range enterprise SSDs.
         | 
         | The main difference is in write performance: consumer SSDs use
         | SLC caching to provide high burst write performance, while
         | server SSDs usually don't and are optimized instead for
         | consistent, sustainable write performance (for write streams of
         | many GB).
         | 
         | Server SSDs also usually have power loss protection capacitors
         | allowing them to safely buffer writes in RAM even when the host
         | requests writes to be flushed to stable storage; consumer
         | drives have to choose between lying to the host and buffering
         | writes dangerously, or having abysmal write performance if the
         | host is not okay with even a little bit of volatile write
         | caching.
        
       | 323 wrote:
       | Further evidence is the fact that optimized SIMD JSON or UTF8
       | libraries exist. If I/O was the bottleneck, there wouldn't be a
       | need to parse JSON using SIMD.
        
       | samsquire wrote:
       | My immediate thought was are you measuring throughput or latency?
       | 
       | The latency of reading from disk is indeed very slow compared to
       | CPU instructions.
       | 
       | A 3ghz clock speed processor is running 3 billion (3,000,000,000
       | cycles a second) and some instructions take 1 cycle. You get 3
       | cycles per nanosecond. A SSD or spinning disk access costs many
       | multiples of cycles.
       | 
       | Read 1 MB sequentially from SSD* 1,000,000
       | 
       | That's a lot of time that could be spent doing additions or
       | looping.
       | 
       | https://gist.github.com/jboner/2841832
        
         | formerly_proven wrote:
         | And you get 2-4 instructions per cycle on average.
        
         | noctune wrote:
         | That is true, but assuming file I/O is blocking you also have
         | to pay for a context switch to take advantage of that.
         | 
         | But I guess you could avoid that using eg. io_uring.
        
         | cb321 wrote:
         | I agree - "IO _bandwidth_ is no longer a BW bottleneck " would
         | be a better title.
        
       | tonetheman wrote:
       | I/O is no longer the bottleneck for silly interview questions for
       | the most part. But for real programs it can still be an issue.
        
       | nanis wrote:
       | Related posts from my past experience from about 10 years ago:
       | 
       | * Splitting long lines is slow[1]
       | 
       | * Can Parallel::ForkManager speed up a seemingly IO bound
       | task?[2]
       | 
       | In both cases, Perl is the language used (with a little C thrown
       | in for [1]), but they are in a similar vein to the topic of this
       | post. In [1], I show that the slowness in processing large files
       | line by line is not due to I/O, but due to the amount of work
       | done by code. In [2], a seemingly I/O bound task is sped up by
       | throwing more CPU at it.
       | 
       | [1]: https://www.nu42.com/2013/02/splitting-long-lines-is-
       | slow.ht...
       | 
       | [2]: https://www.nu42.com/2012/04/can-parallelforkmanager-
       | speed-u...
        
       | chaxor wrote:
       | It's still very bizarre to me to see people completely write off
       | spinning platter drives as 'old tech'. They're still used
       | everywhere! (At least in my world)
       | 
       | We have all of my teams data on an array of 18tb drives for a
       | 100TB raid10 setup, and a NAS at home doing the same, etc. Even
       | some of our OS drives at work are 7200rpm drives - and we're a
       | computational lab. Why is everyone so intent that these
       | effectively no longer exist? The cost for a decent amount of
       | space with NVME drives is just far too astronomical. We're not
       | all millionaires.
        
         | crote wrote:
         | Spinning rust is still by far the cheapest way to store bulk
         | data, and it is fast enough for infrequent access. It's
         | perfectly fine for archival purposes. Large disks are $0.02 /
         | GB while large SSDs are closer to $0.07 / GB.
         | 
         | On the other hand, ignoring my media collection, both my
         | personal computer and server only need a few hundred GB of
         | storage. SSDs are cheap enough that they are a no-brainer: they
         | are physically a lot smaller, quieter, and more resilient. The
         | orders-of-magnitude better IO speed doesn't hurt either, even
         | if most of it is wasted on me.
         | 
         | 1TB NVMe drives are now less than $75. I could get a 1TB HDD
         | for $40 or a 3TB one for $75, but why bother?
        
           | mmis1000 wrote:
           | Even 2TB SSDs can be bought at < $130 now. Plus the fact that
           | consumer motherboard usually have 2 m.2 slot. You can have
           | 4TB of ultra fast storage on a normal consumer level computer
           | without pcie extension cards. Which is overkill for most
           | people. Most consumers probably don't even need that much
           | storage space.
        
             | kragen wrote:
             | until they start shooting video or finetuning stable
             | diffusion or something
             | 
             | video and archived neural net snapshots are pretty
             | sequential
        
         | jackmott42 wrote:
         | I think you know the answer to your question, which is that
         | normal end users who build PCs basically never use them. Power
         | users with NAS and huge porn collections and servers, of course
         | they still do.
        
         | [deleted]
        
         | swalsh wrote:
         | Most people are conscious that they're making a performance
         | trade off when they decide to store data in one.
        
         | javajosh wrote:
         | _> a 100TB raid10 setup, and a NAS at home doing the same_
         | 
         | How do you deal with these onerous constraints? Do you have a
         | system for deciding how many copies of archive.org to keep
         | locally?
        
           | [deleted]
        
           | fishtacos wrote:
           | I know you're being facetious, but here's some fun stats from
           | archive.org.
           | 
           | A few highlights from the Petabox storage system:
           | 
           | No Air Conditioning, instead use excess heat to help heat the
           | building.
           | 
           | Raw Numbers as of December 2021:
           | 
           | 4 data centers, 745 nodes, 28,000 spinning disks
           | 
           | Wayback Machine: 57 PetaBytes
           | 
           | Books/Music/Video Collections: 42 PetaBytes
           | 
           | Unique data: 99 PetaBytes
           | 
           | Total used storage: 212 PetaBytes
           | 
           | Considering this data is from a year ago, it's got to be
           | substantially larger now.
           | 
           | Link: https://archive.org/web/petabox.php
        
           | chaxor wrote:
           | I'm not sure what exactly you mean - but we typically store
           | DNA sequencing data from people that want us to analyze it,
           | databases of proteins and scientific articles, ML models,
           | etc. Proper considerations of how to store the data and
           | compress it correctly help a lot, but unfortunately academics
           | are pretty terrible at choosing the right data structures for
           | the job, and often several copies are needed for mirroring
           | the raw data, then the processed versions are updated.
        
         | vitno wrote:
         | Spinning disk are fine for some workloads. However, I've
         | definitely seen teams engineering around slower spinning discs.
         | NVMe isn't egregiously expensive, and if teams are needing to
         | spend too much work and time thinking about disk speed it
         | probably cheaper in the long run to switch.
        
         | simonmales wrote:
         | And there is still demand. Otherwise they wouldn't be still
         | produced.
        
         | burntsushi wrote:
         | I don't see anywhere in this article where spinning disks are
         | written off.
         | 
         | It is sensible to consider the I/O speed of things other than
         | spinning disks. Especially when they are exceptionally
         | commonplace in consumer devices and developer workstations.
        
         | Thaxll wrote:
         | They're not used everywhere, most stuff desktop or servers are
         | now based on SSD or equivalent.
         | 
         | Your use case of people running desktop with 100TB is niche,
         | for $100 you can get a very fast 1TB nvme drive now days, which
         | is fine for 99.99% of the population.
        
       | mrich wrote:
       | I have a hunch when rewriting the program in C/C++ or Rust this
       | will change significantly.
        
         | jhfdbkofdchk wrote:
         | Writing this in Perl will have significantly different results.
         | Can't speak for Go but text processing in Python is slow.
        
         | civopsec wrote:
         | Not necessarily considering that he wrote that the Go version
         | tries to avoid memory allocations.
        
       | iAm25626 wrote:
       | IO also meant network to me. Often the target(database, or device
       | generating telemetry) is 10+ms away. That round trip time is
       | bottle neck by physics(speed of light). side benefit of sqlite
       | being local file system/memory.
        
       | throwaway71271 wrote:
       | EBS costs crazy crazy amounts for reasonable iops
       | 
       | We pay 7k per month for RDS that can do barely 2k iops.. in the
       | same time a machine at hetzner does 2 million iops for 250 euro
       | per month (not to mention it also have 4x more codes and 5x more
       | ram).
       | 
       | So, even though I/O is no longer the bottle neck physically, it
       | still is a considerable issue and design challenge on the cloud.
        
         | nix23 wrote:
         | Well yes it's a total ripoff
         | 
         | I installed a DB-Server for a Customer around 2years ago, in a
         | DC near him with 16 cores 48GB Ram and ~6TB -> 12 SSD, vDevs
         | mirror with 2, and Stripe over the mirrored vDevs (kind of a
         | Raid10 but zfs), compression zstd (1GB could be compressed down
         | to ~200MB so 5 times less reading/writing, and in theory ~30TB
         | of pure DB-Data, 20TB realistic, remember never fill a zpool
         | over 72%) record-size 16kb (postgresql). After 3 month the
         | machine was paid (compared to the "cloud"-price) and the
         | performance kind of 10-12 times higher.
         | 
         | Called the customer about a two month ago and he said the DB-
         | Server is still to fast and maybe he wants another one who uses
         | less power... ;)
        
           | samb1729 wrote:
           | > remember never fill a zpool over 72%
           | 
           | Could you please explain where this number comes from?
        
             | ewwhite wrote:
             | Yeah, that's a myth now. It's not current advice.
        
               | nix23 wrote:
               | 72% is my rule of thumb for write heavy production stuff
               | (my absolute limit would be 75%) but it depend on record-
               | sizes, raidz-level, if you have mostly write or mostly
               | read workloads, how big your files are, how many
               | snapshots you have if you have a dedicated ZIL-Device and
               | much more. For a Home-NAS (Movies etc) you can easy go up
               | to 85%...if it's a "~WORM" workload maybe 90%...but
               | resivering can then be a thing of days (weeks?), depends
               | again on the raidz-level or mirror etc.
               | 
               | >Yeah, that's a myth now. It's not current advice.
               | 
               | It's not and you know it, keep it under 72% believe me if
               | you want a performant zfs (especially if you delete files
               | and have many snapshots...check the YT linked at the end)
               | 
               | >>Keep pool space under 80% utilization to maintain pool
               | performance. Currently, pool performance can degrade when
               | a pool is very full and file systems are updated
               | frequently, such as on a busy mail server. Full pools
               | might cause a performance penalty, but no other issues.
               | If the primary workload is immutable files (write once,
               | never remove), then you can keep a pool in the 95-96%
               | utilization range. Keep in mind that even with mostly
               | static content in the 95-96% range, write, read, and
               | resilvering performance might suffer.
               | 
               | https://web.archive.org/web/20150905142644/http://www.sol
               | ari...
               | 
               | And under no circumstances go over 90%:
               | 
               | https://openzfs.github.io/openzfs-
               | docs/Performance%20and%20T...
               | 
               | >An introduction to the implementation of ZFS - Kirk
               | McKusick
               | 
               | https://www.youtube.com/watch?v=TQe-nnJPNF8
        
           | ilyt wrote:
           | The cloud costs really are between "unreasonable" and "very
           | unreasonable". The only time when it gets cheaper is if
           | workload would be so spiky we coul'dve turned most of it off
           | for 2/3 of the day but most of what we have is many smaller
           | projects that don't even have enough traffic to even need
           | scaling and the big ones, well, can't scale database size
           | down off-peak...
           | 
           | Over last ~6 years we did "is it worth going to cloud"
           | calculation few times and it was always ridiculously more
           | expensive.
        
           | fomine3 wrote:
           | Virtualized storage system increases latency (QD1 IOPS).
           | Naively built non-parallel apps tend to rely on QD1 IOPS
           | performance, so it runs very slowly on cloud platform,
           | compared to dev machine with direct attached NVMe.
        
         | LastTrain wrote:
         | Agree! And at the end of the day, we are optimizing for cost.
         | Although, the EBS portion of that 7k RDS bill is going to be
         | tiny, right?
        
           | throwaway71271 wrote:
           | well its tiny if you keep it 2k, and make sure you dont touch
           | disk much, god forbid someone makes a query that requires a
           | temp table.. a query you wouldnt even notice in a baremetal
           | machine brings the whole rds setup down, cant even read the
           | writeahead log and cant replicate and etc.. its like watching
           | a slow motion train wreck from 2 queries per second
        
       | Mave83 wrote:
       | If you have to handle multi million I/O's or files, or GB/s or
       | TB/s bandwidth, just use DAOS.io storage. Awesome fast, scale out
       | and self healing.
        
       | chewbacha wrote:
       | Wouldn't memory allocation still be IO of a different resource?
       | We're still getting slowed down reading and writing bits to a
       | storage device. Perhaps it's not the hard drive but the claimed
       | blocker here doesn't appear to be CPU.
        
         | diarrhea wrote:
         | That discussion reminds me of Intel Optane. The current
         | distinction between hard disks and RAM isn't a necessity
         | dictated by some law of nature. Yet it's the box most people
         | think within (for good reason).
        
         | umanwizard wrote:
         | Reading and writing to main memory is not usually called "IO"
         | in this context.
        
         | dahfizz wrote:
         | Not really, no. Allocation involves reading/writing to memory,
         | but that's not why it's slow. It's slow because allocation
         | involves a context switch into the kernel. And an allocation
         | algorithm itself isn't trivial.
        
           | zabzonk wrote:
           | user space code will not use the kernel and there will be no
           | context switch. it will call a function such as malloc, which
           | is a user-space function. malloc will then go on to interact
           | with the memory allocation sub-system, which may occasionally
           | need to ask the os for more memory via a system call, if the
           | allocator is any good.
        
             | dahfizz wrote:
             | Yeah malloc is user space code, but it will do a syscall
             | like sbrk to actually get memory from the kernel.
             | 
             | The default malloc in glibc does not pad the values given
             | to sbrk, so you have to do a syscall for every 4k chunk of
             | memory (the pagesize). So unless you do lots of very small
             | (<<4k) allocations, you call sbrk pretty often.
             | 
             | You will also page fault when you access the new pages, and
             | this traps into kernel code again.
             | 
             | So yeah, you are technically correct that some allocations
             | may be fast because the memory is already available and
             | mapped. Allocations, on average, are still slow because it
             | involves context switches to the kernel (potentially
             | multiple).
             | 
             | TLDR: you make it sound like a syscall within malloc is
             | rare, but many/most allocations will trigger a syscall.
        
               | morelisp wrote:
               | It is rare because one of the very first things anyone
               | does if they're concerned about allocations is replace
               | glibc malloc.
        
         | cb321 wrote:
         | Indeed - for that particular problem a big cost is the "IO"
         | from main memory "into the CPU cache". Ben is careful to
         | qualify it as "disk" IO, but even this is somewhat vague. (as
         | is "CPU cache" vs L3/L2/L1 - Ben's interview problem is highly
         | L2 sensitive, I think, and increasing variety in that size will
         | make results harder to interpret.)
         | 
         | On a modern gen4 NVMe, I routinely get 7 GiB/s. gen5 is
         | supposed to double that (as soon as manufactures get "enough"
         | money out of gen4 given PCIe4's extremely short life compared
         | to gen3's extremely long one.)
         | 
         | There was a time not long ago (maybe still) where highly scaled
         | up, many core (40+) Intel CPUs could not match that getting
         | from DIMMs into L3 for just 1 core (as per his interview
         | problem). So, we are indeed moving into an era where "IO" from
         | the primary persistent device is indeed no worse than IO from
         | DIMMs, at least in bandwidth terms. DIMMs still have much
         | better latency and the latency-BW ambiguity has been observed
         | elsethread.
         | 
         | EDIT: I should clarify, to connect my text with your comment,
         | that the real cost of (1-core, uncontended) allocation is also
         | more "populating/mapping the cache" with copies, not just
         | "allocation" in itself.
        
       | slotrans wrote:
       | Article itself is fine but the "conclusion" is loony. You can't
       | draw conclusions about big data from toy experiments with small
       | data.
        
       | [deleted]
        
       | williamcotton wrote:
       | Wouldn't it be nice if we could specify the allocators in GC
       | languages? Like, why not expose a bump allocator arena to Python
       | with a manual release?
        
         | krapp wrote:
         | I don't know if C++ counts as a "GC language" per se, but you
         | can write custom allocators for STL containers (eg here:
         | https://betterprogramming.pub/c-memory-pool-and-small-
         | object...).
        
       | alecco wrote:
       | Nonsense. Latency matters. NVMe latency is measured in
       | microseconds, while DRAM latency is measured in nanoseconds.
       | 
       | Sequential processing is not that common.
        
         | kragen wrote:
         | latency matters but sequential processing is still most
         | processing
         | 
         | i hope you are doing well
        
       | osigurdson wrote:
       | Compared to L1 cache reference, it certainly still is.
        
       | ihatepython wrote:
       | It's true, I/O is no longer the bottleneck.
       | 
       | The bottleneck now is interviewers who think they have the
       | knowledge and expertise, but do not. However their authoritative
       | position ends up distorting everything, and then said person
       | blogs about it and causes even more problems.
        
         | sublimefire wrote:
         | OMG what did I miss, please expand. What is being distorted and
         | what problems are caused?
        
           | YetAnotherNick wrote:
           | Most people don't build their program for NVMe drives, but
           | for cloud. Try benchmarking it in regular instance in AWS/GCP
           | using default SSD and you will know that what interviewees
           | said is true, as they are likely not building it to run a
           | program in "your" laptop.
        
         | dang wrote:
         | " _Don 't be snarky._" -
         | https://news.ycombinator.com/newsguidelines.html
         | 
         | If you know more than someone else, that's great - but in that
         | case please share some of what you know, so the rest of us can
         | learn. Just putting another person down doesn't help.
         | 
         | https://hn.algolia.com/?dateRange=all&page=0&prefix=true&sor...
        
         | dist1ll wrote:
         | Holy god what a burn! I absolutely agree though.
        
       | [deleted]
        
       | fancyfredbot wrote:
       | NVME storage really is very fast for sequential reads, but I'd
       | respectfully suggest that for simple tasks a Dell laptop with
       | 1.6GB/s read speed should be bottlenecked by IO if the compute is
       | optimised. For example SIMD-json can parse json at over 7GB/s.
       | https://simdjson.org/
        
       | kazinator wrote:
       | Processing _cache hot_ data has never been the bottleneck.
       | 
       | Running some search on a file on your 486-with-8-megs-of-RAM
       | running Linux, where the file was in the operating system's
       | cache, was dependent on the performance of the program, and the
       | overhead of reading data from the cache through syscalls.
       | 
       | You can't handwave away the performance of the program with the
       | argument that it will hide behind I/O even if that is true for
       | cache-cold run because cache-hot performance is important. People
       | run multiple different kinds of processing passes on the same
       | data.
        
       | up2isomorphism wrote:
       | The correct way to describe his experiment should be:
       | 
       | Of course I/O is the slowest, but it is fast enough to let most
       | of the programmers not able to fully utilize it.
        
       | lmeyerov wrote:
       | I like the author started with measuring and thinking bandwidth,
       | which makes sense for streaming through a big file, so I'd have
       | continued that way towards a diff design & conclusion
       | 
       | Continuing with standard python (pydata) and ok hw:
       | 
       | - 1 cheap ssd: 1-2 GB/s
       | 
       | - 8 core (3 GHz) x 8 SIMD: 1-3 TFLOPS?
       | 
       | - 1 pci card: 10+ GB/s
       | 
       | - 1 cheapo GPU: 1-3 TFLOPS?
       | 
       | ($$$: cross-fancy-multi-GPU bandwidth: 1 TB/s)
       | 
       | For streaming like word count, the Floating point operation
       | (proxy for actual ops) to Read ratio is unclear, and the above
       | supports 1000:1 . Where the author is reaching the roofline on
       | either is a fun detour, so I'll switch to what I'd expect of
       | pydata python.
       | 
       | It's fun to do something like run regexes on logs use cudf one
       | liners (GPU port of pandas) and figure out the bottleneck. 1 GB/s
       | sounds low, I'd expect the compute to be more like 20GB+/s for
       | in-memory, so they'd need to chain 10+ SSD achieve that, and good
       | chance the PCI card would still be fine. At 2-5x more compute,
       | the PCI card would probably become a new bottleneck.
        
       | tuyguntn wrote:
       | Haven't gone through the code, but measurement methodology seems
       | wrong to me.
       | 
       | > As you can see, the disk I/O in the simple Go version takes
       | only 14% of the running time. In the optimized version, we've
       | sped up both reading and processing, and the disk I/O takes only
       | 7% of the total.
       | 
       | 1. If I/O wasn't a bottleneck, shouldn't we optimize only reading
       | to have comparable benchmarks?
       | 
       | 2. Imagine program was running 100 sec, (14% I/O) so 14 seconds
       | are spent on I/O. Now we optimize processing and total time
       | became 70 seconds, if I/O wasn't a bottleneck, and we haven't
       | optimized I/O, total disk I/O should become 20% of total
       | execution time, not 7%.
       | 
       | Disk I/O:
       | 
       | > Go simple (0.499), Go optimized (0.154)
       | 
       | clearly, I/O access was optimized 3x and total execution was
       | optimized 1.6x times. This is not a good way of measurement to
       | say I/O is not a bottleneck.
       | 
       | I agree though things are getting faster.
        
       | andrew-ld wrote:
       | Honestly I find this article too vague, and if you process large
       | amounts of data you rarely do so with orderly reads and writes,
       | even with databases optimized for fast disks (see rocksdb) have
       | disks as a bottleneck even with the most recently developed
       | hardware.
        
       | austin-cheney wrote:
       | I encountered this myself yesterday when attempting to
       | performance test WebSockets in JavaScript:
       | https://github.com/prettydiff/share-file-systems/blob/master...
       | 
       | The parsing challenge is complex enough that it will always be
       | faster to extract the data from the network than it is to process
       | it. As a result excess data must be stored until it can be
       | evaluated or else it must be dropped, therefore the primary
       | processing limitation is memory access not CPU speed executing
       | instructions. JavaScript is a garbage collected language, so you
       | are at the mercy of the language and it doesn't really matter how
       | you write the code because if the message input frequency is high
       | enough and large enough memory will always be the bottleneck, not
       | the network or the application code.
       | 
       | In terms of numbers this is provable. When testing WebSocket
       | performance on my old desktop with DDR3 memory I was sending
       | messages (without a queue or any kind of safety consideration) at
       | about 180,000 messages per second. In my laptop with DDR4 memory
       | the same test indicated a message send speed at about 420,000
       | messages per second. The CPU in the old desktop is faster and
       | more powerful than the CPU in the laptop.
        
       | prvt wrote:
       | "sorting with O(n^2) is no longer a bottleneck as we have fast
       | processors" /s
        
         | itissid wrote:
         | That makes no sense. Just the brutal math of a polynomial is
         | always going to be poor enough to notice than subpolynomial
         | times.
        
           | vitiral wrote:
           | Often but not always. Cool trick: any bounded limit is always
           | O(1)!
           | 
           | Pick a small enough bound and an O(n^2) algorithm behaves
           | better than an O(n log n). This is why insertion sort is used
           | for sorting lengths less than ~64, for example.
        
       | mikesabbagh wrote:
       | Usually we mainly run jobs on nfs or similar disks. I/O time
       | would be more significant. Would be nice to run thoses tests on
       | aws
        
       | dehrmann wrote:
       | I'm not surprised. I've seen bulk MySQL reads in Python be CPU-
       | bound. The interesting followup was that parallelizing reads in
       | subprocesses wasn't going to help much because I'd get killed by
       | CPU again when serializing/deserializing between processes. I was
       | capped at a ~2x speedup.
        
       | inglor_cz wrote:
       | Hmm. I just fought with a MariaDB installation that, when set to
       | immediate write to a disk, became rather slow. 7-8 INSERTs into a
       | DB could easily take 3 seconds; unfortunately the internal logic
       | of the system didn't really lend itself to one INSERT of multiple
       | lines.
       | 
       | Once I reconfigured innodb_flush_log_at_trx_commit to 2, the UI
       | started being lightning fast.
        
       | Dunedan wrote:
       | > I haven't shown an optimized Python version because it's hard
       | to optimize Python much further! (I got the time down from 8.4 to
       | 7.5 seconds). It's as fast as it is because the core operations
       | are happening in C code - that's why it so often doesn't matter
       | that "Python is slow".
       | 
       | An obvious optimization would be to utilize all available CPU
       | cores by using the MapReduce pattern with multiple threads.
       | 
       | I believe that'd be necessary for a fair conclusion anyway, as
       | you can't claim that I/O isn't the bottleneck, without utilizing
       | all of the available CPU and memory resources.
        
         | masklinn wrote:
         | A more obvious optimisation (to me) would be to leverage the
         | native functions and avoid creating a list of ~80 million
         | strings in memory.
         | 
         | On my machine, the base script pretty reliably takes ~10s:
         | Reading   : 0.1935129165649414         Processing:
         | 9.955206871032715         Sorting   : 0.0067043304443359375
         | Outputting: 0.01335597038269043         TOTAL     :
         | 10.168780088424683
         | 
         | Switching content to a no-op (`content = sys.stdin`) and
         | feeding `Counter` from a native iterators pipeline:
         | counts = collections.Counter(chain.from_iterable(map(str.split,
         | map(str.lower, content))))
         | 
         | is a pretty reliable 10% gain:                   Reading   :
         | 1.1920928955078125e-06         Processing: 8.863707780838013
         | Sorting   : 0.004117012023925781         Outputting:
         | 0.012418985366821289         TOTAL     : 8.880244970321655
         | 
         | As far as I can tell, the bottleneck is about half the
         | preprocessing (lowercasing and splitting) and half filling the
         | Counter.
         | 
         | You won't get a 10x gain out of that though.
        
         | plonk wrote:
         | > An obvious optimization would be to utilize all available CPU
         | cores by using the MapReduce pattern with multiple threads.
         | 
         | Nope, the GIL will make that useless. You need to actually
         | implement the tight loops in C/C++ and call that with batches
         | of data to get benefits from threading.
         | 
         | An obvious, but more expensive optimization would be to use a
         | process pool. Make sure that all the objects you pass around
         | are serializable.
         | 
         | Python makes optimization much harder than it should be. I hope
         | the GIL gets the hammer at some point, but that seems to be a
         | huge task.
        
           | Waterluvian wrote:
           | You can use Processpool but at that point you're way into re-
           | architecting.
        
             | plonk wrote:
             | Not much more than the switch to MapReduce and threads.
             | Actually the interface is exactly the same if you use
             | executors from concurrent.futures.
        
             | znpy wrote:
             | inter-process communication has its own overhead though.
        
               | plonk wrote:
               | You can usually counteract that by sending large enough
               | batches to the processes.
        
           | Dunedan wrote:
           | > Nope, the GIL will make that useless.
           | 
           | In Python yes. I missed that. The Go implementation would
           | still benefit from multiple threads, wouldn't it?
        
             | plonk wrote:
             | Yes I think so.
        
           | xmcqdpt2 wrote:
           | For this problem the multiple process version would be quite
           | simple in python or any other languages. It's a classic same
           | program multiple data (SPMD) task. You split the file into N
           | chunks than run N versions of the original program on it (a
           | Map). You then need to collate the results, which required a
           | second program, but that step is similar to the sorting step
           | in the original and so would be negligible wrt wall time (a
           | quick Reduce).
           | 
           | For large files you should get almost embarrassing
           | parallelism.
        
             | jvanderbot wrote:
             | Oh I think a few simd instructions could reduce processing
             | to near zero without going crazy with multi-threaded
             | architectures.
             | 
             | Remember that fizzbuzz on HN that hit GB/s? Mostly SIMD.
             | Zero multi-threaded IIRC.
        
           | ShredKazoo wrote:
           | The GIL won't prevent you from parallelizing I/O will it?
        
             | plonk wrote:
             | If I/O was the bottleneck, parallelizing it won't help,
             | your SSD/network link/database won't get magically faster.
             | 
             | If I/O wasn't the bottleneck, I guess you can parallelize
             | reading, but what are you gaining?
             | 
             | If you're writing to files, most of the time the parallism
             | will be hard to implement correctly. SQLite doesn't support
             | parallel writes for example.
        
               | ShredKazoo wrote:
               | >If I/O was the bottleneck, parallelizing it won't help,
               | your SSD/network link/database won't get magically
               | faster.
               | 
               | I think your SSD/network link/database might be able to
               | work in parallel even when Python can't. Details:
               | 
               | Suppose I am scraping a website using a breadth-first
               | approach. I have a long queue of pages to scrape. A
               | single-threaded scraper looks like: pop the next page in
               | the queue, block until the web server returns that page,
               | repeat. A multi-threaded scraper looks like: thread wakes
               | up, pops the next page in the queue, sleeps until the web
               | server returns that page, repeat. With the multi-threaded
               | scraper I can initiate additional downloads while the
               | thread sleeps.
               | 
               | My assumption here is that the download over the network
               | is at some level being performed by making a system call
               | (how could it not be?) And once you have multiple system
               | calls going, they can be as parallel as the OS permits
               | them to be; the OS doesn't have to worry about the GIL.
               | And also the server should be able to serve requests in
               | parallel (assuming for the sake of argument that the
               | server doesn't suffer from the GIL).
               | 
               | Same essential argument applies to the database. Suppose
               | I'm communicating with the database using IPC. The
               | database isn't written in Python and doesn't suffer from
               | the GIL. Multiple Python threads can be sleeping on the
               | database while the database processes their requests,
               | possibly in parallel if the db supports that.
               | 
               | I think this argument could even work for the SSD if the
               | kernel is able to batch your requests in a way that takes
               | advantage of the hardware, according to this person:
               | https://news.ycombinator.com/item?id=33752411
               | 
               | Very curious to hear your thoughts here. Essentially my
               | argument is that the SSD/network link/database could be a
               | "bottleneck" in terms of latency without being the
               | bottleneck in terms of throughput (i.e. it has unused
               | parallel capacity even though it's operating at maximum
               | speed).
        
               | plonk wrote:
               | You're right, my comment only applies when bandwidth is
               | the bottleneck. In Python, that web parser could probably
               | do even better with asyncio in a single OS thread.
        
               | srcreigh wrote:
               | Haven't tried it, but SQLite supports some type of
               | concurrent modification now
               | 
               | https://www.sqlite.org/cgi/src/doc/begin-
               | concurrent/doc/begi...
        
               | ilyt wrote:
               | > If I/O was the bottleneck, parallelizing it won't help,
               | your SSD/network link/database won't get magically
               | faster.
               | 
               | Of course it will. Near every serious DB will allow to
               | work on multiple requests in parallel and unless the DB
               | itself is on something that's slow you will get data
               | faster from 2 parallel requests than from serializing
               | them
               | 
               | NVMe SSDs in particular can easily fill what single
               | thread can read, just run _fio_ with single vs parallel
               | threads to see that.
               | 
               | > If you're writing to files, most of the time the
               | parallism will be hard to implement correctly. SQLite
               | doesn't support parallel writes for example.
               | 
               | That's just one random example. If all you do is "read
               | data, parse ,write data" in some batch job you can have
               | massive parallelism. Sharding is also easy way to fill up
               | the IO.
        
         | superjan wrote:
         | That is of course an easy solution but I would argue that this
         | is just throwing more resources at the problem. Not a very
         | impressive optimization.
         | 
         | Emery Berger has a great talk [1] where he argues that it is
         | mostly pointless to optimize python code, if your program is
         | slow, you should look for a properly optimized library to do
         | that work for you.
         | 
         | 1: https://www.youtube.com/watch?v=vVUnCXKuNOg
        
           | Dunedan wrote:
           | > That is of course an easy solution but I would argue that
           | this is just throwing more resources at the problem. Not a
           | very impressive optimization.
           | 
           | You could say the same about the existing implementation as
           | that reads the whole file into memory instead of processing
           | it in chunks.
        
         | ummonk wrote:
         | Or better yet, write a compute shader since hashing is an
         | embarrassingly parallel operation.
         | 
         | That said, the OP's article is correct in that straightforward
         | idiomatic implementations of this algorithm are very much
         | compute bound. The corollary is that eng work put into
         | optimizing compute usage often won't be waisted for programs
         | processing disk data (or even network data with modern 10Gb
         | fiber connections).
        
         | [deleted]
        
       | mips_avatar wrote:
       | I/O is sometimes the bottleneck. On Windows any operation with
       | lots of small file operations bottlenecks on NTFS and Defender
       | operations. It makes some applications like git that run
       | beautifully on Linux need to take some countermeasures to operate
       | well on Windows.
        
       | Dunedan wrote:
       | As the algorithm used in the example is straight-forward I
       | figured that using UNIX command line tools might be an even
       | simpler way to implement it. Here is what I came up with:
       | time cat kjvbible_x100.txt | tr "[:upper:] " "[:lower:]\n" | sort
       | --buffer-size=50M | uniq -c | sort -hr > /dev/null
       | 
       | On my machine this turned out to be ~5 times slower than the
       | provided Python implementation. Nearly all of the time is spent
       | in the first invocation of `sort`. Further increasing the buffer
       | size doesn't make a significant difference. I also played around
       | with the number of threads `sort` uses, but didn't see any
       | improvement there either.
       | 
       | I'm quite puzzled why `sort is so much slower, especially as it
       | does sorting in parallel utilizing multiple CPU cores, while the
       | Python implementation is single-threaded.
       | 
       | Does somebody have an explanation for that?
        
         | zackmorris wrote:
         | Dangit, I'm supposed to be doing yardwork today so you hijacked
         | my procrastination motivation haha!
         | 
         | Edit: I had no idea that awk was so fast, and I suspect that
         | only parallelization would beat it. but I agree with the others
         | that the main bottleneck is the `sort | uniq` for results1.txt
         | # https://stackoverflow.com/a/27986512 # count word occurrences
         | # https://unix.stackexchange.com/a/205854 # trim surrounding
         | whitespace       # https://linuxhint.com/awk_trim_whitespace/ #
         | trim leading or trailing whitespace              time cat
         | kjvbible_x100.txt | tr "[:upper:] " "[:lower:]\n" | sort
         | --buffer-size=50M | uniq -c | sort -hr > results1.txt
         | real 0m13.852s       user 0m13.836s       sys 0m0.229s
         | time cat kjvbible_x100.txt | tr "[:upper:] " "[:lower:]\n" |
         | awk '{count[$1]++} END {for (word in count) print count[word],
         | word}' | sort -hr > results2.txt       real 0m1.425s       user
         | 0m2.243s       sys 0m0.061s              diff results1.txt
         | results2.txt       109,39133c109,39133       # many whitespace
         | differences due to how `uniq -c` left-pads first column with
         | space              diff <(cat results1.txt | awk '{$1=$1};1')
         | <(cat results2.txt | awk '{$1=$1};1')       # bash-only due to
         | <() inline file, no differences after trimming surrounding
         | whitespace              cat results1.txt | awk '{ sub(/^[
         | \t]+/, ""); print }' | diff - results2.txt       # sh-
         | compatible, no differences after trimming leading whitespace of
         | results1.txt              # 13.836 / 2.243 = ~6x speedup with
         | awk
        
         | mastax wrote:
         | `sort | uniq` is really slow for this, as it has to sort the
         | entire input first. I use `huniq` which is way faster for this.
         | I'm sure there are many similar options.
         | 
         | https://github.com/koraa/huniq
        
         | benhoyt wrote:
         | It's because that first invocation of sort is sorting the
         | entire input (413MB), not just the unique words (less than a
         | MB). The sort is probably O(NlogN), but that's a big N.
         | Counting by inserting into a hash table is much faster, at
         | O(N).
        
         | aljarry wrote:
         | It looks like you're sorting the whole file, while python
         | implementation sorts only unique values.
        
       | buybackoff wrote:
       | Having dealt with several data parsers I would like to somehow
       | estimate how much electricity is burned globally just for lazy
       | implementations. E.g. in .NET non-devirtualized `Stream.ReadByte`
       | is often one of the hottest methods in a profiler. It and related
       | methods could easily be responsible for double-digit share of CPU
       | when processing data. I mean, it's not IO but just pure overheads
       | that disappear with custom buffering where reading a single byte
       | is as cheap as it should be.
        
       | lotik wrote:
       | Can we just give it more resources? /s
        
       | sriku wrote:
       | On a related note, John Ousterhout (in the RAMCloud project) was
       | basically betting that the latency of accessing RAM on another
       | computer on a fast local network will eventually become
       | competitive to local RAM access.
       | 
       | https://ramcloud.atlassian.net/wiki/spaces/RAM/overview
        
       | stabbles wrote:
       | The title I/O is _no longer_ the bottleneck seems to suggest disk
       | speed has caught up, while in reality the slowness is due to poor
       | implementation (slow Python or Go with lots of allocations).
       | 
       | The real problem to me is that languages are too high-level and
       | hiding temporary allocations too much. If you had to write this
       | in C, you would naturally avoid unnecessary allocations, cause
       | alloc / free in the hot loop looks bad.
       | 
       | Presumably soon enough it's very unlikely you find any new word
       | (actually it's 10 passes over the same text) and most keys exist
       | in the hashmap, so it would be doing a lookup and incrementing a
       | counter, which should not require allocations.
       | 
       | Edit: OK, I've ran OP's optimize C-version [1] and indeed, it
       | only hits 270MB/s. So, OP's point remains valid. Perf tells me
       | that 23% of all cache refs are misses, so I wonder if it can be
       | optimized to group counters of common words together.
       | 
       | [1] https://benhoyt.com/writings/count-words/
        
         | davedx wrote:
         | Yeah I've worked on non trivial javascript codebases where
         | allocations and gc mattered.
        
           | christophilus wrote:
           | I'd like to hear more about this.
        
             | mceachen wrote:
             | Not your OP, but I've got stories!
             | 
             | PhotoStructure is a non-trivial app written in TypeScript
             | for both frontend and backend tasks.
             | 
             | Coming from Scala, I initially used a lot of Array.map,
             | Array.foreach, and, to handle things like Some/None/Either:
             | 
             | function map<T,U>( maybe: <T|undefined>, f: (t:T) => U )
             | 
             | According to the V8 memory inspector, all those tiny fat-
             | arrow functions can hammer the GC (presumably due to stack
             | allocations and pulling in local context). Replacing large
             | array iteration with for loops was a big win.
             | 
             | Also, handling large files with a streaming parser when
             | possible, instead of reading them entirely into memory,
             | another win.
             | 
             | Buffer concatenation may be faster by pushing read chunks
             | onto an array, and joining the lot at the end, rather than
             | incrementally appending with .concat.
             | 
             | When memoizing functions, if they themselves return
             | functions, watch out for fat-arrows if you didn't need the
             | caller's context (which may prevent gc from unexpectedly
             | retained variables).
             | 
             | But the first step should always be to profile your app.
             | You can't assert improvement without a baseline.
        
         | imtringued wrote:
         | The file is only being read once according to the blog and then
         | it is in memory. This isn't an I/O intensive application at
         | all.
         | 
         | I have written a small script in python that does something
         | similar. I have a word list with 1000 words and I check the
         | presence of the words. Here is the thing. For every word I go
         | through the entire file. So lets say I scan the file one
         | thousand times. In fact I did something more complicated and
         | ended up going 6000 times over the original file and yet it
         | still took only three seconds. If all these scans had to reread
         | the file it would take forever.
        
           | kragen wrote:
           | he has figures for both cached and uncached performance
           | 
           | if all your scans had to reread the file from nvme it would
           | take five times as long if we extrapolate from those figures
           | 
           | not forever
        
           | nunez wrote:
           | the optimized version does it byte by byte
        
           | Severian wrote:
           | This sounds like a very inefficient method. The Python stdlib
           | module 'collections' includes a Counter class which can do
           | this for you. Then all you need to do is copy the dict
           | key/item pair to a new one that match your word list.
        
         | adverbly wrote:
         | Of course disk speed has caught up. Single core clock speed
         | hasn't changed in over 10 years now? Disk now uses SSD and has
         | been chugging along with improvements.
         | 
         | I'm curious what a multi thread(multi process for python due to
         | GIL?) comparison would be. Obviously people aren't doing this
         | by default though, so the author's point still stands.
        
           | hn_go_brrrrr wrote:
           | It's not just about clock speed. CPI has been steadily
           | improving across processor generations, even when the clock
           | speed doesn't meaningfully change.
        
             | atty wrote:
             | Sorry do you mean IPC (instructions per clock) or is there
             | another term that I'm not aware of?
        
               | vlovich123 wrote:
               | Cycles per instruction.
        
               | wtallis wrote:
               | "Cycles per instruction" sounds more like a measurement
               | of instruction latency rather than instruction
               | throughput, and I don't think instruction latency has
               | improved as much in the past decade as instruction
               | throughput: the most significant microarchitecture trend
               | has been toward wider CPUs with more execution ports,
               | wider instruction decoders, and larger reorder buffers.
        
           | chronial wrote:
           | > Single core clock speed hasn't changed in over 10 years
           | now?
           | 
           | This is not true. I thought the same thing and you are right
           | in regards to base line clock speed. But the performance is
           | still increasing. I just got a new PC at work with a 2021
           | high-end CPU and it has 200% single core performance of the
           | 2015 mid-range CPU in my private PC:
           | 
           | https://www.cpubenchmark.net/compare/4597vs2599/Intel-i9-129.
           | ..
        
             | antisthenes wrote:
             | Are you simply glossing over the fact that CPU 1 has a
             | turbo of 5.2 Ghz and the other 3.6 Ghz ?
        
               | glogla wrote:
               | Also, the the old one is 65W TDP and the new one is 241W
               | TDP.
        
               | chronial wrote:
               | Yes, I did not go into details where the differences are
               | coming from. The base clockspeeds are the same, which is
               | obviously the number that GP noticed not changing over
               | the last 10 years.
               | 
               | Other things changed. The turbo clockspeed is one of
               | them.
        
             | 0x457 wrote:
             | Well, CPU got a lot better at benchmarks, that is true.
             | Caches got bigger, predictions got better. Specialized
             | instructions were added. IPC improvement kinda slowed down
             | after Sandy Bridge, at least for Intel.
             | 
             | Also, the comment you're quoting is talking about clock
             | speed, and the link you provided literally shows the same
             | base clock speed - 3.2 GHz. Intel progressively pushed
             | turbo speed higher, but that the speed you could have
             | achieved yourself by overclocking.
        
               | chronial wrote:
               | Does the CPU constantly hold the turbo speed under a
               | single threaded workload?
        
               | 0x457 wrote:
               | It depends on motherboard and cooling. 6700K, for
               | example, is constantly running at 4.2Ghz or 4.5Ghz
               | (winter clocks). Constantly while thermals allow it...
               | Non-overclocking motherboards allow it to boost for 2
               | minutes, after that, it's iffy.
        
               | eulers_secret wrote:
               | Depends. The cpu attempts to hold the maximum possible
               | speed on any cores that are in use.
               | 
               | On my water cooled and specifically tweaked desktop- yes.
               | It'll hold max boost indefinitely, even with all threads.
               | (getting to about 80c after 10 mins). Single-thread max
               | is faster, and it'll hold that as well.
               | 
               | My laptop will pull power within 15 seconds and be down
               | to base clocks in a couple mins. Unless I set it down
               | outside and it's very cold.
        
           | PartiallyTyped wrote:
           | > Single core clock speed hasn't changed in over 10 years
           | now?
           | 
           | Technically we have gone from 3.3gz- 3.9gz base frequency,
           | 4.1-4.2 single core boosts, 4.7ghz heavy OC in Haswell to
           | 
           | 4.3 and 3.0 base clock for efficiency/perf cores, with boosts
           | to 5.4 ish, stable 5.8ghz all core frequency with heavy OC,
           | and even pushing 6.something GHz single core with a good
           | setup.
           | 
           | But then again, this is on the more extreme ends. Mainstream
           | laptops have remained relatively stable.
           | 
           | re Python MP:
           | 
           | On my 12700k, I saw almost linear scaling (per core, not
           | thread) with multiprocessing when loading stuff from the disk
           | for ML applications and processing. This was with the
           | datasets library.
        
         | bluedino wrote:
         | Don't forget about naive implementations in languages like C++
         | where using cin/cout can cause your program to run slower than
         | Python when doing file IO
        
         | xyzzy_plugh wrote:
         | I don't know if I can agree. Python is certainly problematic
         | for the reasons you outlined, but Go is much closer to C in
         | terms of performance. Go is also compiled, and as such you have
         | similar opportunity to optimize and measure allocs. In my
         | experience it's much harder to tune Python, and Go is often
         | _easier_ to tune than C due to the great ergonomics of the
         | compiler. You can also drop down to assembly!
         | 
         | But really, I disagree because I've frequently saturated
         | massive IOPS. I/O is still the bottleneck. The article pretty
         | much immediately excludes network I/O, which is in many cases
         | more common than disk I/O. Even so, tiny single-threaded
         | programs reading words one-at-a-time are obviously not going to
         | be I/O constrained with modern disks. For these types of
         | programs, I/O hasn't been a bottleneck in a long, long time,
         | and I'd actually be surprised to hear candidates suggest
         | otherwise.
        
           | exabrial wrote:
           | Let's call spades spades: C spanks Go in every measure. Go is
           | nowhere near C speed.
           | 
           | It's certainly more performant than any dynamically typed
           | scripting language: JavaScript, Python, Ruby, etc but it's
           | probably closer to C#.
        
           | kragen wrote:
           | yeah, a friend of mine wrote a logfile parser in 01996 in c,
           | and when i rewrote it in perl (also in 01996) it was about
           | one tenth the size and ten times slower
           | 
           | whether i/o is the bottleneck depends on what you're doing
           | and on which computer, and that's been true for at least 50
           | years
        
             | computator wrote:
             | May I ask what's the rationale for writing 01996 rather
             | than 1996? I've seen this before but I haven't seen an
             | explanation of the advantage of it.
        
               | mcculley wrote:
               | Y-10K awareness: https://longnow.org/ideas/long-now-
               | years-five-digit-dates-an...
        
               | bqmjjx0kac wrote:
               | But it looks like an octal literal.
        
               | withinboredom wrote:
               | Haha, I thought it was some architecture I've never heard
               | of.
        
           | sidlls wrote:
           | Go's problem is that _in practice_ there are interfaces (fat
           | pointers) everywhere. I'm about to start optimizing one of
           | the services my team owns and my first step is going to be
           | looking for all the bad performance issues that comes out of
           | this.
        
           | dahfizz wrote:
           | I mean, Go is closer to C than Python is. But Go is still a
           | garbage collected language with a runtime. It's not going to
           | come close to the limits of performance that C offers
        
             | morelisp wrote:
             | Having played with this a previous time it was posted and
             | looking at the profiler results, the main difference
             | between the Go and C isn't due to any memory allocations
             | but rather the lack of in-place update and slightly higher
             | cost from being general-purpose in Go's hash table. (This
             | is also probably why Rust, without the GC overhead,
             | clusters with Go rather than C.)
        
               | sidlls wrote:
               | I'm fairly certain that if one wrote C (or C++) programs
               | with the same safety targets that Rust has out-of-the-box
               | the performance is close, if not (for practical purposes)
               | identical.
               | 
               | Of course, to do much useful (and performant) in Rust one
               | often has to break out `unsafe`, which eliminates _some_
               | of the out-of-the-box guarantees for safety--and in some
               | cases makes one wonder if it 's worth all the overhead
               | instead of just using C or C++.
        
               | morelisp wrote:
               | > the same safety targets
               | 
               | Rust's selling point is that the safety targets' costs
               | are dev/compile-time ones. There should not be a
               | difference unless the C/C++ code requires some IB or
               | extremely manual memory management trickery, which it
               | doesn't; and Go offers basically the same memory safety
               | guarantees as Rust in this regard and is (slightly)
               | faster.
               | 
               | In this case it's really almost entirely about the speed
               | of the hash table.
        
               | sidlls wrote:
               | I was referring to the overhead of development time. Rust
               | is more complex than C, at least, and even after having
               | gained proficiency coding in Rust still takes more time
               | than writing an equivalent C program.
               | 
               | And "zero-cost" is misleading. There are definitely
               | performance impacts from the implicit (and unadvertised
               | explicit) bounds checking some of Rust's features come
               | with. Writing a C program to an equivalent level of
               | safety would have similar performance impacts. Hence, for
               | as close to the same safety as possible, Rust and C
               | should be almost identical in terms of performance.
        
             | k__ wrote:
             | That's probably the reason why Go got many people switching
             | from their scripting languages and didn't end up as the C
             | killer it was sold in the beginning.
        
               | taf2 wrote:
               | I never remember it being sold as a C killer maybe as a
               | Java killer?
        
               | dagw wrote:
               | It was more sold as a C++ killer than a C killer.
        
               | capableweb wrote:
               | golang.org circa 2010:
               | 
               | > go is ... fast
               | 
               | > Go compilers produce fast code fast. Typical builds
               | take a fraction of a second yet the resulting programs
               | run nearly as quickly as comparable C or C++ code.
               | 
               | https://web.archive.org/web/20100217123645/http://golang.
               | org...
               | 
               | That seems to me like they were trying to say "If you
               | want C/C++ performance but nicer/easier syntax, you can
               | use Go", which turned out to be not that true in the end.
               | 
               | Edit: the old "Language Design FAQ" also goes further in
               | detail on how the envision (the first version of the)
               | language: https://web.archive.org/web/20100211104313/http
               | ://golang.org...
        
               | [deleted]
        
               | forty wrote:
               | I believe Go was created to replace python at Google.
        
               | habibur wrote:
               | It was sold as C killer in the beginning.
               | 
               | And then a few years later there was an article that
               | said, the Go engineers were surprised when they saw C/C++
               | coders weren't switching to Go rather Python/Ruby coders
               | were "upgrading" to Go.
        
               | pdimitar wrote:
               | That's exactly my reason to use it: fast prototyping and
               | ability to replace bash scripts (as they become too
               | unwieldy to maintain and expand from one point and on).
               | 
               | I've also successfully made an MVP with Golang which I
               | then proceeded to rewrite in Rust in almost only one go
               | and almost without blockers along the way.
               | 
               | Golang is pretty good but it still lacks important things
               | like algebraic data types, and they're hugely important
               | for fearless refactoring and correctness.
        
             | eloff wrote:
             | For a script like this your be surprised. If you don't do
             | too much allocation, the garbage collector need not run. In
             | fact it's common to disable it at the start. Now you can
             | allocate much faster than C because you never clean up
             | after yourself!
             | 
             | A real world example is esbuild, the author implemented it
             | both Rust and Go initially. The Go version was faster and
             | the code simpler. Which is why it's implemented in Go.
        
               | leethaxor wrote:
               | > A real world example is esbuild, the author implemented
               | it both Rust and Go initially. The Go version was faster
               | and the code simpler. Which is why it's implemented in
               | Go.
               | 
               | But why is swc faster than esbuild then? The code isn't
               | even considerably more complex.
        
               | tail_exchange wrote:
               | You can find the comments from the ESBuild author here on
               | HN:
               | 
               | https://news.ycombinator.com/item?id=22336284
        
               | leethaxor wrote:
               | Rust compiler is much faster today than it was in 2020,
               | btw.
               | 
               | > This is a side project and it has to be fun for me to
               | work on it.
               | 
               | I respect this 100% - but then we shouldn't assume Go is
               | better than Rust just based on that esbuild used it
               | instead of Rust.
        
               | eloff wrote:
               | Because different programs, implemented differently, run
               | at different speeds...
               | 
               | I'm saying the performance of Go can sometimes be
               | surprisingly fast. Not that it's magic.
        
               | leethaxor wrote:
               | So you're saying they wrote exactly the same program in
               | Go and Rust for the comparison, changing the syntax only?
               | Well then it's no surprise the Go version was faster.
               | 
               | Don't write Rust as if it was Go. That doesn't say
               | anything meaningful about either Go or Rust.
        
               | eloff wrote:
               | Actually he wrote the Rust version first, so you're wrong
               | jumping to conclusions.
               | 
               | I'm not trying to say Go is faster than Rust, it's
               | usually slower. But there are always exceptions to the
               | rule. The Go code, on the other hand, is usually simpler
               | and quicker to write. For that reason I'd prefer Go if
               | the problem lends itself to a garbage collected language.
        
               | leethaxor wrote:
               | So what did you mean by this?
               | 
               | > Because different programs, implemented differently,
               | run at different speeds...
               | 
               | We're talking about two programs with exactly the same
               | purpose - ingest TypeScript and output JavaScript. It's a
               | pretty clear-cut comparison, IMHO.
               | 
               | > The Go code, on the other hand, is usually simpler and
               | quicker to write
               | 
               | I'm writing Go code at work, and Rust code mostly for fun
               | (but used it at work too). I'd say this has changed
               | significantly in the last 2 years. Now with rust-analyzer
               | and much improved compiler output, writing Rust is very
               | simple and quick too. I guess getting into Rust can be a
               | little harder if you've only ever used GCed languages
               | before, but it's not _that_ hard to learn - and once you
               | do it 's super-effective. And the type inference of Rust
               | is a huge reason why I'm using it - while Go has none.
               | 
               | Another thing to consider - usually the code in Go is
               | much more about writing algorithms yourself instead of
               | using library functionality (this is changing slowly
               | thanks to the new support of generics but most code
               | hasn't caught up yet and there aren't good libs using it
               | so far). The resulting code in Go can be convoluted a lot
               | and contain very hidden bugs. People also usually don't
               | bother implementing a proper search/sorting algorithm for
               | the sake of simplicity/speed of development - which you'd
               | get automatically if you used a library function - so the
               | code is less efficient. My Go code is usually 2-3x longer
               | than the equivalent in TypeScript or Rust.
               | 
               | Go is great, I like it. Rust is great too. I recommend
               | you to do what the esbuild author did - test it and
               | choose for yourself, don't bother too much about others'
               | opinion.
        
               | eloff wrote:
               | > We're talking about two programs with exactly the same
               | purpose - ingest TypeScript and output JavaScript. It's a
               | pretty clear-cut comparison, IMHO.
               | 
               | There are an infinite number of ways to design two
               | programs for that task, with different trade-offs. You
               | can't draw conclusions about which language is faster
               | based on two different implementations by different
               | people.
               | 
               | > Go is great, I like it. Rust is great too. I recommend
               | you to do what the esbuild author did - test it and
               | choose for yourself, don't bother too much about others'
               | opinion.
               | 
               | I'm actually writing Rust code the last two years. It's
               | been a while since I've used Go. But I'd rather use Go if
               | the problem allows for a garbage collector. It's just
               | simpler than managing it manually in Rust with the borrow
               | checker and its rules. This is my opinion, nobody else's.
        
               | stavros wrote:
               | The GP didn't say that, maybe they wrote Go as if it were
               | Rust, for all we know.
        
               | cma wrote:
               | You could LD_PRELOAD a free that doesn't do anything in C
               | too or a macro define to really make it no cost.
        
               | eloff wrote:
               | You'd need to change malloc too so it doesn't do extra
               | work. You could do it. Go obviously isn't going to be
               | faster than well written C. But sometimes you can be
               | surprised.
        
         | morelisp wrote:
         | The Go code is reasonably efficient at avoiding allocations,
         | though it's hard to avoid some. Without thinking too hard, it's
         | also going to be hard to avoid those same ones in C, and some
         | possible improvements would apply to both (e.g. defining a
         | maximum word length).
         | 
         | "Lowercase word count" is a surprisingly difficult case in this
         | regard, because you need to check and potentially transform
         | each character individually, and also store a normalized form
         | of each word. Probably some smart SIMD lowercase function could
         | help here but I don't think any language is going to offer that
         | out of the box. It's also defined in a way I think detaches a
         | bit much from real-world issues - it's handling arbitrary bytes
         | but also only ASCII. If it had to handle UTF-8 it would be very
         | different; but also if it could make assumptions that only a
         | few control characters were relevant.
        
           | dahfizz wrote:
           | > Probably some smart SIMD lowercase function could help here
           | but I don't think any language is going to offer that out of
           | the box.
           | 
           | No, but at least you have direct access to the intrinsics in
           | C. To get vectorization in Go, you have to implement it in C
           | and link that into your Go program.
        
             | morelisp wrote:
             | > To get vectorization in Go, you have to implement it in C
             | and link that into your Go program.
             | 
             | Go has an assembler which does not require implementing
             | anything in C (though the assembler uses a more C-style
             | syntax), nor critically does it require using CGo linkage.
             | It's used to implement many hot paths in the stdlib.
        
             | throwaway894345 wrote:
             | Typically you would just implement vectorization in
             | assembly in the Go case.
        
           | stabbles wrote:
           | > but I don't think any language is going to offer that out
           | of the box.
           | 
           | That's what compilers are for. I tried to improve the C
           | version to make it friendlier to the compiler. Clang does a
           | decent job:
           | 
           | https://godbolt.org/z/o35edavPn
           | 
           | I'm getting 1.325s (321MB/s) instead of 1.506s (282MB/s) on a
           | 100 concatenated bibles. That's still not a 10x improvement
           | though; the problem is cache locality in the hash map.
        
             | cb321 wrote:
             | Note: Just concatenating the bibles keeps your hash map
             | artificially small (EDIT: relative to more organic natural
             | language vocabulary statistics)...which matters because as
             | you correctly note the big deal is if you can fit the
             | histogram in the L2 cache as noted elsethread and this
             | _really_ matters if you go parallel where N CPUs*L2 caches
             | can speed things up a lot -- _until_ your histograms blow
             | out CPU-private L2 cache sizes.
             | https://github.com/c-blake/adix/blob/master/tests/wf.nim
             | (or a port to your favorite lang instead of Nim) might make
             | it easy to play with these ideas (and see at least one way
             | to avoid almost all "allocation" - under some
             | interpretations).
             | 
             | A better way to "scale up" is to concatenate various other
             | things from Project Gutenberg: https://www.gutenberg.org/
             | At least then you have "organic" statistics on the hash.
        
             | [deleted]
        
           | Const-me wrote:
           | > I don't think any language is going to offer that out of
           | the box.
           | 
           | C# offers that out of the box, and the solution is much
           | simpler there.
           | 
           | Pass StringComparer.OrdinalIgnoreCase or similar
           | (InvariantCultureIgnoreCase, CurrentCultureIgnoreCase) to the
           | constructor of the hash map, and the hash map will become
           | case-agnostic. No need to transform strings.
        
             | masklinn wrote:
             | The one possible issue is by not transforming the string
             | you're going to run the possibly less efficient CI
             | comparison _a lot_ : because the corpus is text and
             | duplicated, by my reckoning there are ~32000 unique "words"
             | out of 82 million inout "words". That's a lot of conflicts.
             | 
             | Though the values should mostly be quite short, so a
             | vectorised comparison might not even trigger as it wouldn't
             | have the time to "stride": only one word of the top 10 even
             | exceeds 4 letters ("shall", a hair under a million in my
             | corpus).
        
               | Const-me wrote:
               | The C# standard library will not run many case-
               | insensitive comparisons. That comparison object doesn't
               | _just_ compare two objects for equality, it also has
               | another interface method which computes a hash of a
               | single object.
               | 
               | Here's implementation of the hash function used by that
               | StringComparer.OrdinalIgnoreCase: https://source.dot.net/
               | #System.Private.CoreLib/src/libraries... As you see, it
               | has a fast path for ASCII-only input strings.
        
               | masklinn wrote:
               | > The C# standard library will not run many case-
               | insensitive comparisons. That comparison object doesn't
               | just compare two objects for equality, it also has
               | another interface method which computes a hash of a
               | single object.
               | 
               | Which doesn't matter because I'm talking about identical
               | strings, so they _will_ hash the same by definition, and
               | they _will_ have to be compared.
               | 
               | So the question is how fast the CI hash and equality
               | operate _compared to the CS ones_.
               | 
               | And I asked about comparison because I _assumed_ that
               | would be the costlier of the two operations, relative to
               | its CS brethren.
        
               | Const-me wrote:
               | > how fast the CI hash and equality operate compared to
               | the CS ones.
               | 
               | If the string is ASCII like in the OP's use case, I think
               | the difference is not huge.
               | 
               | CS comparison looks more optimized, they have an inner
               | loop which compares 12 bytes as 3 64-bit values: https://
               | source.dot.net/#System.Private.CoreLib/src/libraries...
               | 
               | CI comparer doesn't do that, it loads individual UTF-16
               | elements: https://source.dot.net/#System.Private.CoreLib/
               | src/libraries... But still, it's very simple code which
               | does sequential memory access.
               | 
               | > And I asked about comparison because I assumed that
               | would be the costlier of the two operations, relative to
               | its CS brethren.
               | 
               | I think the bottleneck is random memory loads from the
               | hash table.
               | 
               | Hashing and comparison do sequential RAM access. The
               | prefetcher in the CPU will do its job, you'll get 2
               | memory loads every cycle, for short strings going to be
               | extremely fast. If that hashtable doesn't fit in L3
               | cache, the main memory latency is much slower than
               | comparing strings of 10-20 characters, no matter case
               | sensitive or not.
        
               | morelisp wrote:
               | But all of those optimizations can also be done, and more
               | efficiently, by transforming while reading. Even if
               | they're as fast as they can be, they're not as fast as a
               | memcmp. The C# approach isn't buying you any performance
               | here.
        
               | Const-me wrote:
               | Yeah, it's possible to change case while loading, but I'm
               | not sure that's gonna be much faster.
               | 
               | But I'm sure it gonna be much harder.
               | 
               | For non-ASCII strings, converting case may change their
               | length in bytes. You don't even know in advance how much
               | memory you need to transform 2GB of input text (or 1MB
               | buffer if streaming). And if streaming, you need to be
               | careful to keep code points together: with a naive
               | approach you gonna crash with a runtime exception when
               | you split a single codepoint between chunks.
               | 
               | English words are 99.9% ASCII, but that remaining 0.1%
               | like "naive" is not. The C# standard library is doing the
               | right thing for this use case. Specifically, for 99.9% of
               | words the CI comparer will use the faster ASCII-only code
               | to compare or hash, and only do the expensive shenanigans
               | for small count of non-ASCII words.
               | 
               | Note how C# makes the implementation much simpler. A
               | single parameter passed to the constructor of the
               | Dictionary<string,Something> makes it implement case-
               | insensitivity automagically.
        
             | [deleted]
        
       | nimish wrote:
       | If you ignore latency sure. Optane is still the fastest storage
       | by quite a bit. Flash has yet to catch up and might never do so.
       | 
       | Tons of files and random writes can bring even an enterprise
       | flash ssd to its knees but Optane keeps on trucking
        
       | avinassh wrote:
       | I am working on a project [0] to generate 1 billion rows in
       | SQLite under a minute and inserted 100M rows inserts in 33
       | seconds. First, I generate the rows and insert them in an in-
       | memory database, then flush them to the disk at the end. To flush
       | it to disk it takes only 2 seconds, so 99% of the time is being
       | spent generating and adding rows to the in-memory B Tree.
       | 
       | For Python optimisation, have you tried PyPy? I ran my same code
       | (zero changes) using PyPy, and I got 3.5x better speed.
       | 
       | I published my findings here [1].
       | 
       | [0] - https://github.com/avinassh/fast-sqlite3-inserts
       | 
       | [1] - https://avi.im/blag/2021/fast-sqlite-inserts/
        
       | edejong wrote:
       | "Optimised" code according to the author. What can we do to
       | optimise further?
       | 
       | - Read file in one thread pool, streaming the chunks to...
       | 
       | - ...another thread pool, tokenise, count, sort the chunks and
       | send them to ...
       | 
       | - ... merge in another thread pool. (basically map-reduce).
       | 
       | - please stop malloc'ing for each token
       | 
       | - prealloc map for found tokens (better to just allocate room for
       | 200k words).
       | 
       | - SIMD would optimise your inner-loop quite a lot. However, there
       | are optimised libraries for this, so you don't have to write this
       | yourself.
       | 
       | - `word = append(word, c)` <= this is very slow
       | 
       | Why is there no profiling? Why don't you check how the compiler
       | interpreted your code and benchmark the subparts?
       | 
       | In addition, there are at least some errors in your optimised
       | program:
       | 
       | - you can't lower-case by substract like you do. Non-ascii
       | characters would fail.
       | 
       | - also you can't tokenising by comparing with c <= ' '. There are
       | many characters which would break a string. See this exercise:
       | https://campus.datacamp.com/courses/introduction-to-natural-...
        
         | morelisp wrote:
         | > thread pool
         | 
         | May reduce wall-clock time but increase total compute time (and
         | so also power). It's less an optimization than a tradeoff.
         | 
         | > please stop malloc'ing for each token
         | 
         | It doesn't, only when it gets put in the map. (And while the
         | particular allocation could be smaller, _something_ guaranteed
         | to represent a specific arbitrary-length string has to be put
         | in the map, which is going to malloc.)
         | 
         | > prealloc map for found tokens (better to just allocate room
         | for 200k words).
         | 
         | Has no meaningful effect on performance.
         | 
         | > SIMD would optimise your inner-loop quite a lot.
         | 
         | No, as pointed out elsethread, it's a measurable boost but
         | nowhere near the 10x you need to make the main claim (I/O not
         | the bottleneck) be wrong. Not even 2x.
         | 
         | > `word = append(word, c)` <= this is very slow
         | 
         | Has no meaningful effect on performance.
         | 
         | Perhaps you should read the whole post.
        
       | brianolson wrote:
       | SSD is pretty fast; but my app is actually trying to do more than
       | 100_000 read-modify-write cycles per second and that still
       | requires careful thought about the database and schema we're
       | using.
       | 
       | CPU and RAM are pretty fast. I do a live-coding interview
       | question and I ask people do to a naive implementation first,
       | then later I ask about possible optimizations. A third to a half
       | of candidates want to do fewer RAM accesses and oh by is that the
       | wrong avenue for this problem - especially when they just wrote
       | their solution in Python and you could get a 10x-20x speedup by
       | rewrite in C/C++/Go/Rust/etc.
       | 
       | Network is IO too. Network is pretty fast, datacenter-to-
       | datacenter, but end users can still have their experience
       | improved with better encoding and protocol; and outbound
       | bandwidth bills can be improved by that too.
        
       | kgeist wrote:
       | The most common performance problems I've encountered in our
       | projects are: 1) lack of indexes resulting in extensive table
       | scans 2) I/O calls in a loop without batching.
       | 
       | I don't know if it counts as "I/O bottlenecks" or not.
        
       | Havoc wrote:
       | Picking something counterintuitive as interview question is not a
       | great idea. Defeats the purpose - harder to tell whether the
       | candidate is going with conventional wisdom because that's what
       | they think the answer is, or because the candidate thinks that's
       | the answer the interviewer expects.
       | 
       | i.e. You could get sharp candidates that know the correct
       | technical answer but intentionally give the wrong one because
       | they rightly concluded statistically odds are the interviewer is
       | on conventional wisdom wavelength.
       | 
       | Could be good if you're specifically looking for someone good at
       | mindgames I guess
        
         | [deleted]
        
         | LastTrain wrote:
         | And then compound that by your counterintuitive assertion being
         | generally wrong in most useful cases. He is either interviewing
         | very junior candidates or frustrating a lot of people.
        
       | hansvm wrote:
       | I/O is still often the bottleneck. My laptop can handle 11 GB/s
       | through RAM (and no NVME, so under 1 GB/s through the hard
       | drive), less with unpredictable I/O patterns (like a hash-map)
       | and 7600 GB/s through the CPU. Unless the thing you're doing is
       | particularly expensive per byte of data, you're going to be
       | limited at a minimum by RAM I/O, and maybe by DISK I/O.
       | 
       | FWIW, all my recent performance wins have either been by reducing
       | RAM I/O or restructuring work to reduce contention in the memory
       | controller, even at the cost of adding significantly more work to
       | the CPU.
        
       | svnpenn wrote:
       | > converting to lowercase
       | 
       | in regards to accuracy, uppercase is the better option:
       | 
       | https://stackoverflow.com/a/65433126
        
         | kragen wrote:
         | there doesn't seem to be any reasoning or evidence in that post
         | supporting "uppercase is the better option", just that
         | uppercase produces a larger number of word classes, which might
         | be correct or incorrect
         | 
         | tchrist explains in that thread why neither uppercase nor
         | lowercase is the best option:
         | 
         | > _Mapping to lowercase doesn't work for Unicode data, only for
         | ASCII. You should be mapping to Unicode foldcase here, not
         | lowercase. Otherwise yours is a Sisyphean task, since lowercase
         | of Sisuphos is sisuphos, while lowercase of its uppercase,
         | SISUPhOS, is the correct sisuphos, which is indeed the foldcase
         | of all of those. Do you now understand why Unicode has a
         | separate map? The casemappings are too complex for blindly
         | mapping to anything not designed for that explicit purpose, and
         | hence the presence of a 4th casemap in the Unicode casing
         | tables: uppercase, titlecase, lowercase, foldcase._
         | 
         | of course 'sisuphos' is not correct as a written word but if
         | you were to encounter it then you should clearly consider it
         | equivalent to 'sisuphos'
        
           | svnpenn wrote:
           | > there doesn't seem to be any reasoning or evidence in that
           | post supporting "uppercase is the better option", just that
           | uppercase produces a larger number of word classes, which
           | might be correct or incorrect
           | 
           | this sentence appears to be nonsense. the code doesnt check
           | "word classes", it cases folds two characters and compares
           | them.
        
             | kragen wrote:
             | character classes then
        
               | svnpenn wrote:
               | it doesn't check character classes either. It literally
               | takes two characters, then uppercases both and compares,
               | then lowercases both and compares. I have no idea where
               | you are getting that it has anything to do with word or
               | character classes, it doesn't.
        
               | kragen wrote:
               | by 'word class' i meant 'a set of words that are
               | considered equivalent by whatever your equivalency
               | relation is'
               | 
               | similarly for 'character class'
               | 
               | cf. https://en.wikipedia.org/wiki/Equivalence_class
               | 
               | what i thought the linked program did was that it counted
               | how many of those there were
               | 
               | now on looking at it further i can see that it doesn't
               | seem to be doing that but i don't have any idea what it
               | _does_ do
               | 
               | however, it definitely doesn't take into account the
               | information you would need to learn anything about which
               | candidate equivalency relation is better, which is
               | something you'd need to examine at at least a word level,
               | considering examples like grosste, Sisuphos, and the
               | notoriously fatal sikisinca/sikisinca pair
        
               | svnpenn wrote:
               | > doesn't take into account the information you would
               | need to learn anything about which candidate equivalency
               | relation is better
               | 
               | OK, no one said it did that. Its purely comparing
               | characters, which is and always was what I said it was
               | doing. And somehow it took 5 comments before you even
               | decided to _actually read the answer_. Maybe next time
               | you should start by _actually reviewing_ and
               | understanding what you are commenting on, _before_ making
               | multiple comments.
        
       | namkt wrote:
       | I would have thought that allocations in managed languages like
       | Go/Python would have been the "fast" part of the processing.
       | Isn't technically the GC that's slowing you down, and not the
       | allocation per se? For one-shot input/output programs like these
       | I guess you could tune the GC to kick in with less frequency.
       | 
       | You also note that _reading a file sequentially from disk is very
       | fast_ , which it is, but there is no guarantee that the file's
       | contents are actually sequential on disk (fragmentation), right?
       | We'd have to see how the file was written, and I guess at worst
       | you'd be reading sequential chunks of a hand-wavy 4KB or
       | something depending on the file system and what not. I'm sure
       | others can fill in the details.
       | 
       | Just nit-picking here.
        
         | PaulKeeble wrote:
         | Files aren't stored sequentially in an SSD anyway, they are
         | scattered all over the place physically on different blocks
         | just due to the way SSDs work. This doesn't hurt their
         | sequential read performance at all since they have no seek time
         | they already have a lookup table from the physical block the OS
         | sees to a virtual location within themselves.
         | 
         | However one thing I found out a few years ago is that old data
         | can be slow to read as a lot of error correction kicks in.
         | Additionally a lot of fragmentation at the operating system
         | level in Windows has quite a bit of overhead. It can seriously
         | degrade performance to about 50MB/s sequential reads. In
         | practice defragmentation/rewriting of certain high write files
         | may be necessary on SSDs because Windows read performance
         | degrades at high levels of fragmentation.
        
         | wtallis wrote:
         | > You also note that reading a file sequentially from disk is
         | very fast, which it is, but there is no guarantee that the
         | file's contents are actually sequential on disk
         | (fragmentation), right?
         | 
         | Correct. And there are actually two layers of fragmentation to
         | worry about: the traditional filesystem-level fragmentation of
         | a file being split across many separate chunks of the drive's
         | logical block address space (which can be fixed by a defrag
         | operation), and fragmentation hidden within the SSD's flash
         | translation layer as a consequence of the file contents being
         | written or updated at different times.
         | 
         | The latter can often have a much smaller effect than you might
         | expect for what sounds like it could be a pathological corner
         | case: https://images.anandtech.com/graphs/graph16136/sustained-
         | sr.... shows typically only a 2-3x difference due to
         | artificially induced fragmentation at the FTL level. But if the
         | OS is also having to issue many smaller read commands to
         | reassemble a file, throughput will be severely affected unless
         | the OS is able to issue lots of requests in parallel (which
         | would depend on being able to locate many file extents from
         | each read of the filesystem's B+ tree or whatever, and the OS
         | actually sending those read requests to the drive in a large
         | batch).
        
       | vinay_ys wrote:
       | Storage and compute separation is key to scaling data workloads.
       | Here, scaling could be w.r.t volume/shape of data, number of
       | concurrent jobs on the same dataset, complexity of each job etc.
       | In such an architecture, network access is unavoidable. And, to
       | if you have multiple jobs competing for access to the same
       | dataset concurrently, your sequential access can turn into semi-
       | random access. You also have concerns about utilization of
       | resources while being scalable w.r.t arbitrary bursty contentious
       | workloads. These are the things that make it complex w.r.t
       | managing IO resources.
        
       | christophilus wrote:
       | Does anyone have a good resource for reasoning about how to avoid
       | allocations in JavaScript?
        
       | Klinky wrote:
       | The Samsung PM9A1 is the OEM version of the 980 Pro, a top-tier
       | PCIe 4.0 NVME SSD. What about an older SATA SSD(one without DRAM
       | buffer or HMB), or a 5400RPM hard drive? Also as others have
       | pointed out, random I/O will tank perf, especially simultaneous
       | r/w operations to the same media.
        
       | vitiral wrote:
       | I question the methodology.
       | 
       | To measure this I would have N processes reading the file from
       | disk with the max number of parallel heads (typically 16 I
       | think). These would go straight into memory. It's possible you
       | could do this with one process and the kernel will split up the
       | block read into 16 parallel reads as well, needs investigation.
       | 
       | Then I would use the rest of the compute for number crunching as
       | fast as possible using as many available cores as possible: for
       | this problem, I think that would basically boil down to a map
       | reduce. Possibly a lock-free concurrent hashmap could be
       | competitive.
       | 
       | Now, run these in parallel and measure the real time from start
       | to finish of both. Also gets the total CPU time spent for
       | reference.
       | 
       | I'm pretty sure the author's results are polluted: while they are
       | processing data the kernel is caching the next block. Also, it's
       | not really fair to compare single threaded disk IO to a single
       | process: one of the reasons for IO being a bottleneck is that it
       | has concurrency constraints. Never the less I would be interested
       | in both the single threaded and concurrent results.
        
         | Klinky wrote:
         | I agree, I think there is often a faulty assumption by many
         | developers doing benchmarking that their underlying environment
         | doesn't matter. Often I see performance results posted with
         | little mention of the details of the environment. At least here
         | they posted they were using a high-end SSD, but note they just
         | say they're on a "Dell XPS 13", as if there aren't multiple
         | variants of that model produced every year for the last 5 or 6
         | years.
         | 
         | You're probably also right their multiple test runs resulted in
         | the OS caching data, and a lot of the test runs may have just
         | been testing in-memory performance instead of raw storage I/O
         | performance.
        
         | LastTrain wrote:
         | Thank you. I'd be pretty annoyed in this interview. Surely my
         | potential employer would be more interested in having me apply
         | my twenty years of real-world experiences to what I learned in
         | CS240.
        
       ___________________________________________________________________
       (page generated 2022-11-26 23:01 UTC)