[HN Gopher] I/O is no longer the bottleneck
___________________________________________________________________
I/O is no longer the bottleneck
Author : benhoyt
Score : 341 points
Date : 2022-11-26 09:24 UTC (13 hours ago)
(HTM) web link (benhoyt.com)
(TXT) w3m dump (benhoyt.com)
| btbuildem wrote:
| As it often goes with these types of interview questions, there's
| a lot of context missing. What is the goal? Do we want readable
| code? Do we want fast code? Are we constrained somehow? It seems
| here the author doesn't really know, but kudos to them for
| examining their assumptions.
|
| As a side note, a trie would be a neat data structure to use in a
| solution for this toy problem.
| furstenheim wrote:
| It would be nice to see this benchmark in Node.js with streams
| SleepyMyroslav wrote:
| the state of benchmarking by normal IT people is tragic. If one
| checks out his 'optimization problem statement' article [1] they
| can find:
|
| >ASCII: it's okay to only support ASCII
|
| >Threading: it should run in a single thread on a single machine
|
| >Stdlib: only use the language's standard library functions.
|
| This is truly 1978 all over again. No flame graphs, no hardware
| counters, no bottleneck analysis. Using these 'optimizations' for
| job interviews is questionable at best.
|
| [1] https://benhoyt.com/writings/count-words/
| chakkepolja wrote:
| Still better than obscure question from page 20 of Leetcode.
| mastazi wrote:
| Networking is I/O: API calls, database access, etc. - it's not
| just disk access. The article is deriving a generalised statement
| based on a very specific use case.
| MrLeap wrote:
| Interesting! This made me wonder -- would this kind of
| optimization be recognized and rewarded in colossal scale
| organizations?
|
| I've seen comments about Google multiple times here where people
| say you wont be getting promotions unless you're shipping new
| things -- maintaining the old wont do it.
|
| But if you get to something core enough, it seems like the
| numbers would be pretty tangible and easy to point to during perf
| review time?
|
| "Found a smoother way to sort numbers that reduced the
| "whirrrrrr" noise our disks made. It turns out this reduces disk
| failure rates by 1%, arrested nanoscale structural damage to the
| buildings our servers are in, allowed a reduction in necessary
| PPE, elongaded depreciation offsets and other things -- this one
| line of code has saved Google a billion dollars. That's why my
| compensation should be increased to include allowing me to fall
| limply into the arms of another and be carried, drooling, into
| the office, where others will dress me"
|
| In this hypothetical scenario, would a Googler be told "Your
| request has been approved, it may take one or two payment periods
| before your new benefits break into your apartment" or "No, you
| need to ship another chat program before you're eligible for
| that."?
| johnfn wrote:
| > Found a smoother way to sort numbers that reduced the
| "whirrrrrr" noise our disks made. It turns out this reduces
| disk failure rates by 1%, arrested nanoscale structural damage
| to the buildings our servers are in, allowed a reduction in
| necessary PPE, elongaded depreciation offsets and other things
| -- this one line of code has saved Google a billion dollars
|
| Hah! I mean, if you can truly prove a business benefit by
| improving performance, I'm sure that you'd have a good shot at
| a promotion. Thing is it's actually quite difficult to do so,
| and in the likely chance you cannot it just looks like you
| squandered a bunch of time for no reason.
| anthlax wrote:
| Metrics are religiously collected and any sizable performance
| improvement will have a clear impact on one metric or another
| no?
| TeMPOraL wrote:
| "Promotion? You must be joking. Your little stunt caused a
| revenue spike that pushed us over the threshold in $country,
| prompting $regulatory-agency to look into starting antitrust
| proceedings. Mitigating this will require an army of lawyers,
| that will cost approximately a third of the $1B you 'saved' us.
| Additionally, we will now have to create another throwaway chat
| app, for which we'll allocate another third of a billion in the
| budget. The final third... will go to executive bonuses,
| obviously.
|
| You are hereby placed on a Performance Improvement Plan,
| starting tomorrow. On the off chance you'll come out of the
| other end still employed, keep in mind that your manager isn't
| being stupid by forbidding such 'optimizations', they're just
| following orders."
| PaulKeeble wrote:
| They likely have become one of the 10k getting fired for
| "underperforming". Company management in basically all
| businesses don't like know it alls that do the opposite of what
| they are told regardless of the outcome.
| plonk wrote:
| Google's core services wouldn't work so reliably if they
| didn't value optimization and engineering. I don't work
| there, but I'm pretty sure that the SREs and the developers
| behind Search and Maps don't get fired based on how many
| products they launched.
| vitiral wrote:
| Is there a valid source for this number? All the articles I
| see seem to quote each other.
| kevinventullo wrote:
| You don't know what you're talking about.
| xyzzy_plugh wrote:
| > This made me wonder -- would this kind of optimization be
| recognized and rewarded in colossal scale organizations?
|
| It depends if this kind of optimization is valuable to the
| organization. Often times it's not. Spending money and time to
| save money and time is often viewed as less efficient than
| generating more revenue.
| hansvm wrote:
| Yeah, when I was there I saw plenty of <1% optimizations saving
| REDACTED gobs of money, and people were rewarded for it. I
| don't think it's applicable to most teams though.
|
| Imagine a foo/bar/widget app that only serves 20B people
| (obvious exaggeration to illustrate the point) and is only
| necessary up to a few hundred times per day. You can handle
| that sort of traffic on a laptop on my home router and still
| have enough hootzpah left to stream netflix. I mean, you are
| Google, and you need to do something better than that [0], but
| the hardware for your project is going to be negligible
| compared to other concerns unless you're doing FHE or video
| transcoding or something extraordinarily expensive.
|
| Walk that backward to, how many teams have 20B users or are
| doing extraordinarily expensive things? I don't have any clue,
| but when you look at public examples of cheap things that never
| got much traction and probably had a suite of engineers [1],
| I'd imagine it's not everyone in any case. You're probably
| mostly looking at people with enough seniority to be able to
| choose to work on core code affecting most services.
|
| [0] https://www.youtube.com/watch?v=3t6L-FlfeaI
|
| [1] https://killedbygoogle.com/
| ilyt wrote:
| Google seems to have problem of "you won't get
| promotion/raise if you don't work on something new" so they
| are not interested by services that just work and provide
| tidy little constant revenue
| evmar wrote:
| [former Googler]
|
| Yes, I occasionally saw people get highlighted for making
| optimizations like "this saves 1% in [some important service]".
| When you're running millions of machines, 1% is a lot of
| machines. However, it's also likely the case that the easy 1%s
| have already been found...
| kevinventullo wrote:
| Pay increases and promotions are for the most part a little
| more formulaic and upper-bounded than what you describe, but
| generally speaking if you can prove you saved the company $XM
| you will be rewarded (from firsthand experience, this is also
| true at Meta).
| joshspankit wrote:
| In the last few years I've been quietly moving database "rows"
| from databases to the disk.
|
| Back in the day accessing data from MySQL was actually _slower_
| than current SSD speeds. And now you can get all sorts of
| benefits for free: hard link deduplication, versioning, live
| backup, easy usage of GNU tools...
|
| I don't discuss this with certain types of colleagues, but the
| results are excellent.
| ThinkBeat wrote:
| Is there any hardware accelerator / co processor / for the PC
| that will read a file into RAM autonomously mainframish and
| notify the OS when the file is fully loaded into memory?
| (bypassing the CPU entirely)
|
| Leaving the CPU to bother with other things during that time.
| jbverschoor wrote:
| Isn't this what a DMA controller does?
| imtringued wrote:
| That is how I/O works already.
|
| The problem is dealing with an asynchronous filesystem API
| provided by the kernel.
| pca006132 wrote:
| You mean DMA?
| 323 wrote:
| This is already how stuff works.
|
| There are even OS APIs for this - DirectStorage on Windows.
| jeroenhd wrote:
| Do we need a coprocessor for that? Any protocol using DMA
| already does exactly this through interrupts or other message
| passing systems.
|
| DMA capable controllers are everywhere, I don't think you'll
| find any storage controllers in your modern computer that don't
| do this.
|
| Of course DMA only operates on byte ranges and not on files,
| but adding file system parsers to drives and disk controllers
| sounds like a way to introduce awful bugs and corruption.
| Assuming the OS keeps the necessary file system structures
| cached in RAM, plain old DMA should be good enough here.
| bogomipz wrote:
| >"Any protocol using DMA already does exactly this through
| interrupts or other message passing systems."
|
| Interesting, I'm only familiar with the classic interrupts
| approach. What are some of the other common message passing
| systems used in DMA?
| wtallis wrote:
| NVMe has memory-mapped doorbell registers associated with
| each submission or completion queue, which are ring buffers
| typically allocated in host RAM and accessed by the SSD
| using DMA. Generating interrupts when a new entry is posted
| to a completion queue is optional. The host system can
| instead simply poll the completion queue to check whether
| any new IO operations have been completed. Depending on how
| busy the drive is vs how busy the CPU is, forgoing
| interrupts in favor of polling may reduce latency (fewer
| context switches when handling a new completion), but can
| also burn a lot of CPU time.
| wtallis wrote:
| > Of course DMA only operates on byte ranges and not on
| files, but adding file system parsers to drives and disk
| controllers sounds like a way to introduce awful bugs and
| corruption.
|
| There's an interesting intermediate solution in the form of
| SSDs that provide a key-value interface instead of a linear
| block device. That gives the SSD more explicit knowledge
| about which chunks of data should be kept contiguous and can
| be expected to have the same lifetime.
| mastax wrote:
| I was recently working on parsing 100K CSV files and inserting
| them into a database. The files have a non-column-oriented header
| and other idiosyncrasies so they can't be directly imported
| easily. They're stored on an HDD so my first instinct was to
| focus on I/O: read the whole file into memory as an async
| operation so that there are fewer larger IOs to help the HDD and
| so that other parser tasks can do work while waiting for the read
| to complete. I used a pretty featureful C# CSV parsing library
| which did pretty well on benchmarks [0] (CsvHelper) so I wasn't
| really worried about that part.
|
| But that intuition was completely wrong. The 100K CSV files only
| add up to about 2GB. Despite being many small files reading
| through them all is pretty fast the first time, even on Windows,
| and then they're in the cache and you can ripgrep through them
| all almost instantaneously. The pretty fast parser library is
| fast because it uses runtime code generation for the specific
| object type that is being deserialized. The overhead of
| allocating a bunch of complex parser and typeconverter objects,
| doing reflection on the parsed types, and generating code for a
| parser, means that for parsing lots of tiny files its really
| slow.
|
| I had to stop worrying about it because 2 minutes is fast enough
| for a batch import process but it bothers me still.
|
| Edit: CsvHelper doesn't have APIs to reuse parser objects. I
| tested patching in a ConcurrentDictionary to cache the generated
| code and it massively sped up the import. But again it was fast
| enough and I couldn't let myself get nerd sniped.
|
| Edit2: the import process would run in production on a server
| with low average load, 256GB RAM, and ZFS with zstd compression.
| So the CSV files will live permanently in the page cache and ZFS
| ARC. The import will probably run a few dozen times a day to
| catch changes. IO is really not going to be the problem. In fact,
| it would probably speed things up to switch to synchronous reads
| and remove all the async overhead. Oh well.
|
| [0]: https://www.joelverhagen.com/blog/2020/12/fastest-net-csv-
| pa...
| notacoward wrote:
| Interesting observation, but I think the author crosses a bridge
| too far here.
|
| > If you're processing "big data", disk I/O probably isn't the
| bottleneck.
|
| If it fits on a single machine, it is _by definition_ not big
| data. When you 're dealing with really big data, it's likely
| coming from another machine, or more likely a cluster of them.
| Networks can also be pretty fast, but there will still be some
| delay associated with that plus the I/O (which might well be on
| spinning rust instead of flash) on the other end. Big data
| requires parallelism to cover those latencies. _Requires._ It
| might be true that I /O is no longer likely to be the bottleneck
| for a single-threaded program, but leave "big data" out of it
| because in that world I/O really is still a - if not the - major
| limiter.
| kragen wrote:
| the author is evidently pretty unskilled and unaware of it in a
| lot of ways, this being one of them
| visarga wrote:
| Loading 1GB json files is still slow.
| mrkeen wrote:
| If you specify the type of the data, not just the size, then
| yep: it wasn't really about the number of bytes read, but what
| the CPU had to do to process it.
| omgtehlion wrote:
| Not if you use SAX-like (or "evented") parser
| flohofwoe wrote:
| That's usually not because of IO but because of dumb memory
| management in the JSON parser implementation. Some JSON parsers
| build a huge tree out of individually allocated items for each
| JSON property, or in higher level languages, build an
| object/property tree out of the JSON structure. That's
| guaranteed to be slow for large trees.
| krapp wrote:
| I don't want to be that guy but having 1GB json files to begin
| with seems like the bigger problem.
| therealbilly wrote:
| Does any of this really matter?
| daviddever23box wrote:
| The problem here, as with most interview problems, is that it is
| wholly dissociated from its context; memory contraints, disk I/O,
| and file size are non-trivial considerations. Shame on people for
| stare-at-one's-shoes thinking.
| javajosh wrote:
| People can make reasonable assumptions for interview problems.
| In this case, assume the code runs on a modern developer
| laptop, with an SSD and lots of RAM, using some modern OS, is a
| fine assumption.
|
| If someone asks you a physics question about a car being
| dropped into the ocean, generally you don't have to worry about
| the make and model of the car.
| LastTrain wrote:
| But if you read the article, you would see that you'd run
| afoul with this interviewer doing that, because he has
| utterly convinced himself that I/O is no longer the
| bottleneck and coming to the same conclusion is your
| unspecified task in the interview.
| nkozyra wrote:
| > Some candidates say that sorting is going to be the bottleneck,
| because it's O(N log N) rather than the input processing, which
| is O(N). However, it's easy to forget we're dealing with two
| different N's: the total number of words in the file, and the
| number of unique words.
|
| I don't see how that changes anything. There's a reason we use
| Big O rather than other notations. Their answer would still be
| correct.
| burlesona wrote:
| No. Big O tells us the theoretical best case for how the
| compute time will increase as the input size increases for a
| given _algorithm_. It does not say anything about the
| performance of a specific _program_ on a specific input.
|
| The author's point is that if we have two different inputs A
| and B, and A is sufficiently smaller than B, then the runtime
| of B will dominate the program. This would be true even if we
| have to operate on A with a very slow algorithm.
|
| For example suppose you have some extreme scenario where you
| have to do three coloring on A and then find the minimum value
| of B. The runtime would be O(2^A) + O(B). Just plug in numbers
| and you can see that if B > 2^A then B takes longer even though
| that algorithm is much faster. If you suppose B is always much
| larger than 2^A, then B is your bottleneck.
| bufferoverflow wrote:
| It really depends on the number of unique words in the input
| file. If it's all unique words, sorting can become the
| bottleneck.
| [deleted]
| kragen wrote:
| input processing is o(n), sorting the unique words is o(m log
| m)
|
| whether o(m log m) is bigger or smaller than o(n) depends on
| the relationship between n and m
|
| in the case used by another poster in this thread,
| concatenating some large number of copies of the bible, m is
| constant, so o(m log m) is o(1)
|
| in another possible case where the corpus is generated
| uniformly randomly over all possible words of, say, 20 letters,
| though in theory o(m log m) is o(1), over a practical range it
| would be o(n log n)
|
| more typically m is proportional to some power of n depending
| on the language; for english the power is 'typically between
| 0.4 and 0.6'
|
| as it turns out o([?]n log [?]n) is less than o(n)
|
| a totally different issue is that big-o notation doesn't always
| narrow down where the bottleneck is even if you calculate it
| correctly because n is never arbitrarily large and so the
| bottleneck depends on the constant factors
| nkozyra wrote:
| > a totally different issue is that big-o notation doesn't
| always narrow down where the bottleneck is even if you
| calculate it correctly because n is never arbitrarily large
| and so the bottleneck depends on the constant factors
|
| Yeah I think this is part of my point, in that theoretical
| best, average and worst case scenarios can be useful in
| isolation but rarely tell the whole story.
|
| In the worst case, every word is unique and is equal to word
| count and it's a standard sorting issue.
| mrkeen wrote:
| A few ballpark numbers I encountered:
|
| Sequentially reading a file on a spinny laptop disk was about
| 80-100 MB/s. On an SSD that went up to 400-500 MB/s for me.
|
| That's the sequential case! What about random access? I tried an
| experiment where I memory mapped a large file and started
| updating bytes at random. I could get the rate down to
| kilobytes/sec.
|
| Even though we've all heard that SSDs don't pay as much as a
| penalty for random access as spinny disks, it's still a huge
| penalty. Sequential spinny disk access is faster than SSD random
| access.
| GrayShade wrote:
| > Sequential spinny disk access is faster than SSD random
| access.
|
| It is, but on both kind of drives you'll want to dispatch at
| least a couple of requests at once to get better performance.
| In the memory-mapped case, that means using multiple threads.
|
| In addition, you might also want to call madvise(MADV_RANDOM)
| on the mapping.
| mrkeen wrote:
| I didn't think there'd be that much discussion on my point
| (seq hdd > rand sdd).
|
| > In the memory-mapped case, that means using multiple
| threads.
|
| My gut tells me I'd lose more to contention/false-sharing
| than I'd gain through multithreading - but I haven't done the
| experiment.
| koolba wrote:
| > Sequential spinny disk access is faster than SSD random
| access.
|
| No it's not. At least not with modern SSDs or NVMe storage.
|
| Even at 100 MB/s, a spinning disk in _sequential_ mode is doing
| 100 x 1024 / 4 = 25,600 IOPS (assuming a standard 4K per
| operation).
|
| Even consumer grade NVMe hardware gets 5-10x of that for random
| workloads.
| kragen wrote:
| it is true that spinning rust is slower than most ssds
| including nvme ssds even in sequential mode
|
| however, a spinning disk doing a sequential access is not
| doing 25600 iops
|
| if the sequential access lasts 10 seconds it is doing 0.1
| iops
| [deleted]
| mrkeen wrote:
| > Even consumer grade NVMe hardware gets 5-10x of that for
| random workloads.
|
| Cool, lots of IOPS!
|
| But like I said, I got it down to kilobytes/sec.
| justsomehnguy wrote:
| Because you did it the most inefficient way.
|
| This [0] comment is totally on point.
|
| Also note what a consumer SSDs can be made even with a
| single flash chip. A more performant ones are made of bunch
| of chips internally (essentially a RAID0 with some magic)
| so they can do a parallel operations if the data resides on
| the different flash blocks. Still, if your thread is only
| doing one operation a time with blocks < flash rewrite
| block size you will hit the write amplification anyway.
|
| I think if you do the same test but without a memory mapped
| file (ie let the OS and disk subsystem do their thing) you
| will get much more _speed_.
|
| [0] https://news.ycombinator.com/item?id=33751973
| koolba wrote:
| On what hardware and how much parallelization? The max IOPS
| numbers are only when you're saturating the command queue.
|
| It's a throughput number, not a single operation,
| completion, followed by the next one.
| wtallis wrote:
| > I tried an experiment where I memory mapped a large file and
| started updating bytes at random. I could get the rate down to
| kilobytes/sec.
|
| Memory-mapped IO means you're only giving the SSD one request
| to work on at a time, because a thread can only page fault on
| one page at a time. An SSD can only reach its peak random IO
| throughput if you give it lots of requests to work on in
| parallel. Additionally, your test was probably doing small
| writes with all volatile caching disallowed, forcing the
| (presumably consumer rather than enterprise) SSD to perform
| read-modify-write cycles not just of the 4kB virtual memory
| pages the OS works with, but also the larger native flash
| memory page size (commonly 16kB). If you'd been testing only
| read performance, or permitted a normal degree of write
| caching, you would have seen far higher performance.
| d_tr wrote:
| True. And main memory access can easily become slower than SSD
| sequential access if you do random byte accesses and your
| working set is larger than the CPU's caches or TLBs.
| Sirupsen wrote:
| Yes, sequential I/O bandwidth is closing the gap to memory. [1]
| The I/O pattern to watch out for, and the biggest reason why e.g.
| databases do careful caching to memory, is that _random_ I/O is
| still dreadfully slow. I/O bandwidth is brilliant, but latency is
| still disappointing compared to memory. Not to mention, in
| typical Cloud workloads, IOPS are far more expensive than memory.
|
| [1]: https://github.com/sirupsen/napkin-math
| BulgarianIdiot wrote:
| > Yes, sequential I/O bandwidth is closing the gap to memory.
|
| Hilariously meanwhile, RAM has become significantly slower
| compared to CPU performance, i.e. you spend a disproportionate
| time to read and write to memory, so despite RAM is faster, CPU
| is way faster.
|
| Which means I/O remains a bottleneck...
| jonstewart wrote:
| Random I/O with NVME is slower than sequential I/O still, but
| the gap between the two has been narrowed considerably and is
| quite high in historical/comparative absolute terms. To get
| close to peak random I/O limits, you need to dispatch a lot of
| commands in parallel--that's an I/O idiom that doesn't really
| exist in high level languages, and I think that's where a lot
| of the challenge is.
| crote wrote:
| The problem is that a lot of workloads using random I/O have
| dependencies between the different I/O operations. A database
| traversing a tree-like index cannot issue the next read until
| the previous one has finished. You are limited by latency,
| which for NVMe is still orders of magnitude worse than
| memory.
| the8472 wrote:
| Those are rarely the slow ones though. Lots of software
| simply has not been written to keep IO queues full. They
| read a bit of data, process it, read the next bit and so
| on. On a single thread. This makes all kinds of IO
| (including network) way slower than it has to be.
|
| For example a tree-index can be parallelized by walking
| down different branches. On top of that one can issue a
| prefetch for the next node (on each branch) while
| processing the current ones.
| ilyt wrote:
| Yup, a lot of software is (was?) written with assumptions
| that mattered with spinning rust. And even if author
| didn't intend to, serial code generates serial, dependent
| IO.
| ShredKazoo wrote:
| I thought for SSDs it didn't matter whether data was adjacent
| on disk?
| crote wrote:
| Well, yes and no.
|
| With spinning rust you have to wait for the sector you want
| to read to rotate underneath the read head. For a fast 10.000
| RPM drive, a single rotation takes 6 milliseconds. This means
| that for random access the average latency is going to be 3
| milliseconds - and even that's ignoring the need to move the
| read head between different tracks! Sequential data doesn't
| suffer from this, because it'll be passing underneath the
| read head in the exact order you want - you can even take the
| track switching time into account to make this even better.
|
| SSDs have a different problem. Due to the way NAND is
| physically constructed it is only possible to read a single
| page at a time, and accessing a single page has a latency of
| a few nanoseconds. This immediately places a lower limit on
| the random read access time. However, SSDs allow you to send
| read commands which span many pages, allowing the SSD to
| reorder the reads in the most optimal way, and do multiple
| reads in parallel. This means that you only have to pay the
| random access penalty once - not to mention that you have to
| issue way fewer commands to the SSD.
|
| SSDs try to make this somewhat better by having a very deep
| command queue: you can issue literally thousands of random
| reads at once, and the SSD will reorder them for faster
| execution. Unfortunately this doesn't gain you a lot if your
| random reads have dependencies, such as when traversing a
| tree structure, and you are still wasting a lot of effort
| reading entire pages when you only need a few bytes.
| jonstewart wrote:
| You really need an NVME interface to the SSD, though. SATA3
| is the bottleneck for SATA SSDs
| ShredKazoo wrote:
| Interesting, thanks! So it sounds like it's not so much
| "random" I/O that's slow, but rather "unbatched" I/O or
| something like that?
|
| Curious to hear your thoughts on this thread if you have
| time to share:
| https://news.ycombinator.com/item?id=33752870
| mamcx wrote:
| > Unfortunately this doesn't gain you a lot if your random
| reads have dependencies, such as when traversing a tree
| structure,
|
| So, this mean Btrees suffer? Which could be the most
| optimal layout for a database storage where only SSD
| matters?
|
| I'm working in one that is just WAL-only and scanning all
| in each operation (for now!) and wanna see what I can do
| for improve the situation.
| marginalia_nu wrote:
| Depending on how your madvise is set up, it's often the case
| that sequential disk reads _are_ memory reads. You 're
| typically only paying the price for touching the first page in
| a sequential run, that or subsequent page faults come at a big
| discount.
|
| If you read 1,000,000 random bytes (~1 Mb) scattered across a
| huge file (let's say you're fetching from some humongous on-
| disk hash table), it will to a first order be about as slow as
| reading 4 Gb sequentially. This will incur the same number of
| page faults. There _are_ ways of speeding this up, but only so
| much.
|
| Although, I/O is like an onion of caching layers, so in
| practice this may or may not hold up depending on previous
| access patterns of the file, lunar cycles, whether venus is in
| retrograde.
| Sirupsen wrote:
| `madvise(2)` doesn't matter _that_ much in my experience with
| [1] on modern Linux kernels. SSD just can't read _quite_ as
| quickly as memory in my testing. Sure, SSD will be able to
| re-read a lot into ram, analogous to how memory reading will
| be able to rapidly prefetch into L1.
|
| I get ~30 GiB/s for threaded sequential memory reads, but ~4
| GiB/s for SSD. However, I think the SSD number is single-
| threaded and not even with io_uring--so I need to regenerate
| those numbers. It's possible it could be 2-4x better.
|
| [1]: https://github.com/sirupsen/napkin-math
| shitlord wrote:
| How'd you measure the maximum memory bandwidth? In
| Algorithmica's benchmark, the max bandwidth was observed to
| be about 42 GBPS: https://en.algorithmica.org/hpc/cpu-
| cache/sharing/
|
| I'm not sure how they calculated the theoretical limit of
| 42.4 GBPS, but they have multiple measurements higher than
| 30 GBPS.
| marginalia_nu wrote:
| I think the effects of madvise primarily crop up in
| extremely I/O-saturated scenarios, which are rare. Reads
| primarily incur latency, with a good SSD it's hard to
| actually run into IOPS limitations and you're not likely to
| run out of RAM for caching either in this scenario.
| MADV_RANDOM is usually a pessimization, MADV_SEQUENTIAL may
| help if you are truly reading sequentially, but may also
| worsen performance as pages don't linger as long.
|
| But as I mentioned, there's caching upon caching, and also
| protocol level optimizations, and hardware-level
| considerations (physical block size may be quite large but
| is generally unknown).
|
| It's nearly impossible to benchmark this stuff in a
| meaningful way. Or rather, it's nearly impossible to know
| what you are benchmarking, as there are a lot of
| nontrivially stateful parts all the way down that have real
| impact on your performance.
|
| There are so many moving parts I think the only meaningful
| disk benchmarks consider whatever application you want to
| make go faster. Do the change. Is it faster? Great. Is it
| not? Well at least you learned.
| menaerus wrote:
| > I get ~30 GiB/s for threaded sequential memory reads, but
| ~4 GiB/s for SSD. However, I think the SSD number is
| single-threaded and not even with io_uring--so I need to
| regenerate those numbers. It's possible it could be 2-4x
| better.
|
| Assuming that you run the experiments on NVMe SSD which is
| attached to PCIe 3.0, where theoretical maximum is around
| 1GB/s per each lane, I am not sure I understand how do you
| expect to go faster than 4 GiB/s? Isn't that already a
| theoretical maximum of what you can achieve?
| formerly_proven wrote:
| PCIe 4.0 SSDs are pretty common nowadays and are
| basically limited to what PCIe 4.0 x4 can do (around 7
| GB/s net throughput).
| menaerus wrote:
| I don't think they're that common. You would have to have
| quite recentish motherboard and CPU that both support
| PCIe 4.0.
|
| And I'm pretty sure that parent comment doesn't own such
| a machine because otherwise I'd expect 7-8GB/s figure to
| be reported in the first place.
| dagmx wrote:
| I really doubt they're that common. They only became
| available on motherboards fairly recently, and are quite
| expensive.
|
| I'd guess that they're a small minority of devices at the
| moment.
| Sirupsen wrote:
| You might be very right about that! It's been a while
| since I did the SSD benchmarks. Glad to hear it's most
| likely entirely accurate at 4 GiB/s then!
| derefr wrote:
| > _random_ I/O is still dreadfully slow. I/O bandwidth is
| brilliant, but latency is still disappointing compared to
| memory.
|
| The wondrous thing about modern CPU architectures (e.g. Zen3),
| though, is all the PCIe lanes you get with them. If you really
| need high random IOPS, you can now cram 24 four-lane NVMe disks
| into a commodity server (with PCIe M.2 splitter cards) and
| saturate the link bandwidth on all of them. Throw them all in a
| RAID0, and stick a filesystem on them with the appropriate
| stripe width, and you'll get something that's only about 3x
| higher-latency for cold(!) random reads, than a read from RAM.
|
| (My company provides a data-analytics SaaS product; this is
| what our pool of [shared multitenant, high concurrency] DB
| read-replicas look like.)
| Dylan16807 wrote:
| I thought NVMe flash latency was measured in tens of
| microseconds. 3x RAM would be a fraction of a microsecond,
| right?
| emn13 wrote:
| I think the person you're replying to is confusing IOPS
| with latency. If you add enough parallelism, then NAND
| flash random-read IOPS will eventually reach DRAM
| performance.
|
| But it's not going to be easy - for a sense of scale I just
| tested a 7950x at stock speeds with stock JEDEC DDR5
| timings. I inserted a bunch of numbers in an 8GB block of
| memory, and with a deterministic random seed randomly pick
| 4kb pages, computing their sum and eventually reporting
| that (to avoid overly clever dead-code analysis, and make
| sure the data is fully read).
|
| With an SSD-friendly 4K page size that resulted in 2.8
| million iops of QD1 random read. By comparison, a web
| search for intel's 5800x optane's QD1 results shows 0.11
| million iops, and that's the fastest random read SSD there
| is at those queue depths, AFAIK.
|
| If you add parallelism, then ddr5 reaches 11.6 million iops
| at QD16 (via 16 threads), fast SSDs reach around 1 million,
| the optane reaches 1.5 million. An Epyc Genoa server chip
| has 6 times as many DDR5 memory channels as this client
| system does; and I'm not sure how well that scales, but 60
| million 4kb random read iops sounds reasonable, I assume.
| Intel's memory controllers are supposedly even better (at
| least for clients). Turning on XMP and PBO improves results
| by 15-20%; and even tighter secondary/tertiary timings are
| likely possible.
|
| I don't think you're going to reach those numbers not even
| with 24 fast NVMe drives.
|
| And then there's the fact that I picked the ssd-friendly
| 4kb size; 64-byte random reads reach 260 million iops -
| that's not quite as much bandwidth as @ 4kb, but the
| scaling is pretty decent. Good luck reaching those kind of
| numbers on SSDs, let alone the kind of numbers a 12-channel
| server might reach...
|
| We're getting close enough that the loss in performance at
| highly parallel workloads is perhaps acceptable enough for
| some applications. But it's still going to be a serious
| engineering challenge to even get there, and you're only
| going to come close under ideal (for the NAND)
| circumstances - lower parallelism or smaller pages and it's
| pretty much hopeless to arrive at even the same order of
| magnitude.
| 867-5309 wrote:
| at this point _cost_ would become the bottleneck. compare
| 24x1TB NVMe drives to 24TB of DDR5
| derefr wrote:
| Under ideal conditions, yes. But the 3x difference I see in
| practice is less about NVMe being just that good; and more
| about operations against (main) memory getting bottlenecked
| under high all-cores concurrent access with no cross-
| workload memory locality to enable any useful cache
| coherence. And also about memory accesses only being "in
| play" when a worker thread isn't context-switched out;
| while PCIe-triggered NVMe DMA can proceed _while_ the
| thread has yielded for some other reason.
|
| In other words, _when measured E2E in the context of a
| larger work-step_ (one large enough to be interrupted by a
| context-switch), the mean, amortized difference between the
| two types of fetch becomes <3x.
|
| Top of my wishlist for future architectures is "more,
| lower-width memory channels" -- i.e. increased intra-CPU
| NUMAification. Maybe something CXL.mem will roughly
| simulate -- kind of a move from circuit-switched memory to
| packet-switched memory, as it were.
| wil421 wrote:
| I'm running a FreeNAS box on an i3-8100. Right now I'm
| converting the NAS and my desktop to server chassis and
| putting them in a rack. Once I get a 10GB Unifi switch and
| NICs off ebay, I'm debating on running my desktop and servers
| diskless using iSCSI backed up by RAID0 NVME drives.
| justsomehnguy wrote:
| Whatever floats your boat, but iSCSI is limited to 1500 MTU
| (9k? Are you sure you can boot with 9k enabled?) and while
| you can have 10Gbit _thoughput_ that doesn 't mean what you
| will get it always, eg 100 IO operations would generate 100
| packets and it doesn't matter if it was 1500B each or only
| 100B.
|
| And you wouldn't see the speed improvement on RAID0 NVMe
| drives except extremely rare fully sequential operations
| lasting for at least tens of seconds.
|
| You also can try it just by running a VM with iSCSI boot on
| your current desktop.
| ilyt wrote:
| > Whatever floats your boat, but iSCSI is limited to 1500
| MTU (9k? Are you sure you can boot with 9k enabled?) and
| while you can have 10Gbit throughput that doesn't mean
| what you will get it always, eg 100 IO operations would
| generate 100 packets and it doesn't matter if it was
| 1500B each or only 100B.
|
| Ugh, ISCSI does have queueing so you can have many
| operations in flight, and one operation doesn't really
| translate to one packet in the first place, kernel will
| happily pack few smaller operations to TCP socket into
| one packet when there is load.
|
| The single queue is the problem here but dumb admin trick
| is just to up more than one IP on the server and connect
| all of them via multipath
| TristanBall wrote:
| Been a long time since anything iscsi related didn't hand
| 9k, for boot or otherwise.
|
| But I look at it this way. You need 40gbit networking for
| a single pci3 nvme ( and newer drives can saturate that,
| or close )
|
| And because you're throttling throughput you'll see much
| more frequent, longer, queuing delays, on the back of a
| network stack that ( unless you're using rdma ) is
| already 5x-10x slower than nvme.
|
| It'll be fast enough for lots of things, especially
| home/lab use, and it'll be _amazing_ if you 're upgrading
| from sata spinning disk.. but 10gbit is slow by modern
| storage standards.
|
| Of course, that's not the only consideration. Shared
| storage and iscsi in particular can be extremely
| convenient! And sometimes offers storage functionality
| that clients don't have ( snapshots, compression,
| replication )
| ilyt wrote:
| That's a lot of effort to put silent piece of silicon few
| metres away from the machine.
|
| iSCSI gotta eat some of your CPU (you're changing "send a
| request to disk controller and wait" to "do a bunch of work
| to create packet,send it over the network, and get it back)
| if you don't have card with offload, it also might kinda
| not be fast enough to get the most out of NVMe, especially
| more in RAID0
|
| And, uh, just don't keep anything important there...
| Un1corn wrote:
| Does someone have a comparison between common server ssds and
| consumer ssds? I wonder if the speed is equal or not
| PaulKeeble wrote:
| Hard drives are still about 150MB/s.
|
| SATA SSDs are limited to 550MB/s.
|
| PCI-E 3.0 SSDs more like 3500 MB/s.
|
| PCI-E 4.0 SSDs are 7000MB/s.
|
| All of these are at consumer level pricing, you can get 2TB of
| PCI-E 4 from Western Digital for PS130 at the moment, usually
| about PS180. The issue is sustained writes more than reads for
| consumer verses enterprise drives where the speed drops off due
| to a lack of SLC caching and lack of cooling and TLC/QLC which
| is slower for sustained writing.
|
| The example given is very much a consumer level device and not
| a particularly quick one by today's standards. You can also
| expect much faster reads cached than that on a DDR5 system I
| suspect.
| Dalewyn wrote:
| It should be noted that those SSD speeds are all protocol
| limits rather than NAND flash limits. ~7GB/s is literally the
| maximum speed PCIE4 can provide, likewise ~3.5GB/s for PCIE3
| and ~500MB/s for SATA3.
| omgtehlion wrote:
| > the maximum speed PCIE4
|
| 4-lane PCIe, as most nvme drives are. I havent seen drives
| with wider lanes though...
| Dalewyn wrote:
| Ah, right. Lanes. I should have mentioned those as well,
| thanks for catching that.
| bluedino wrote:
| Todays large hard drives are over 250MBs in sequential
| operations
| js4ever wrote:
| For me main difference between hdd and nvme is not really the
| throughput but the acces time. When you manipulate small
| files it's a lot more important
| sekh60 wrote:
| Consumer SSDs are trash for sustained writes since they get
| their top speeds from cache and use slower NAND. Enterprise
| SSDs tend to have better write endurance, and faster NAND. I
| have a small Ceph cluster and as an example when I first bought
| SSDs for it I tried consumer Samsung 870 Evo's. They performed
| worse than spinning rust.
| wtallis wrote:
| Consumer SSDs don't really use slower NAND; the QLC vs TLC
| ratio _might_ be higher for the consumer SSD market than the
| enterprise SSD market, but a consumer TLC drive is using the
| same speed of NAND as an enterprise TLC drive (and likewise
| for QLC drives).
|
| Enterprise SSDs only really have significantly higher
| endurance if you're looking at the top market segments where
| a drive is configured with much more spare area than consumer
| drives (ie. where a 1TiB drive has 800GB usable capacity
| rather than 960GB or 1000GB). Most of the discrepancy in
| write endurance ratings between mainstream consumer and
| mainstream enterprise drives comes from their respective
| write endurance ratings being calculated according to
| _different criteria_ , and from consumer SSDs being given
| low-ball endurance ratings so that they don't cannibalize
| sales of enterprise drives.
|
| Your poor Ceph performance with Samsung consumer SATA SSDs
| wasn't due to the NAND, but to the lack of power loss
| protection on the consumer SSDs leading to poor sync write
| performance.
| menaerus wrote:
| If that had been true what you say then we wouldn't be able
| to saturate the PCIe 3.0 x4 bus with consumer NVMe SSD which
| we absolutely can. The biggest difference is in the
| durability as mentioned in the comment below.
| wtallis wrote:
| Read speeds are similar between consumer and enterprise SSDs;
| they use the same flash and there's overlap with high-end
| consumer SSDs using the same controllers as entry-level and
| mid-range enterprise SSDs.
|
| The main difference is in write performance: consumer SSDs use
| SLC caching to provide high burst write performance, while
| server SSDs usually don't and are optimized instead for
| consistent, sustainable write performance (for write streams of
| many GB).
|
| Server SSDs also usually have power loss protection capacitors
| allowing them to safely buffer writes in RAM even when the host
| requests writes to be flushed to stable storage; consumer
| drives have to choose between lying to the host and buffering
| writes dangerously, or having abysmal write performance if the
| host is not okay with even a little bit of volatile write
| caching.
| 323 wrote:
| Further evidence is the fact that optimized SIMD JSON or UTF8
| libraries exist. If I/O was the bottleneck, there wouldn't be a
| need to parse JSON using SIMD.
| samsquire wrote:
| My immediate thought was are you measuring throughput or latency?
|
| The latency of reading from disk is indeed very slow compared to
| CPU instructions.
|
| A 3ghz clock speed processor is running 3 billion (3,000,000,000
| cycles a second) and some instructions take 1 cycle. You get 3
| cycles per nanosecond. A SSD or spinning disk access costs many
| multiples of cycles.
|
| Read 1 MB sequentially from SSD* 1,000,000
|
| That's a lot of time that could be spent doing additions or
| looping.
|
| https://gist.github.com/jboner/2841832
| formerly_proven wrote:
| And you get 2-4 instructions per cycle on average.
| noctune wrote:
| That is true, but assuming file I/O is blocking you also have
| to pay for a context switch to take advantage of that.
|
| But I guess you could avoid that using eg. io_uring.
| cb321 wrote:
| I agree - "IO _bandwidth_ is no longer a BW bottleneck " would
| be a better title.
| tonetheman wrote:
| I/O is no longer the bottleneck for silly interview questions for
| the most part. But for real programs it can still be an issue.
| nanis wrote:
| Related posts from my past experience from about 10 years ago:
|
| * Splitting long lines is slow[1]
|
| * Can Parallel::ForkManager speed up a seemingly IO bound
| task?[2]
|
| In both cases, Perl is the language used (with a little C thrown
| in for [1]), but they are in a similar vein to the topic of this
| post. In [1], I show that the slowness in processing large files
| line by line is not due to I/O, but due to the amount of work
| done by code. In [2], a seemingly I/O bound task is sped up by
| throwing more CPU at it.
|
| [1]: https://www.nu42.com/2013/02/splitting-long-lines-is-
| slow.ht...
|
| [2]: https://www.nu42.com/2012/04/can-parallelforkmanager-
| speed-u...
| chaxor wrote:
| It's still very bizarre to me to see people completely write off
| spinning platter drives as 'old tech'. They're still used
| everywhere! (At least in my world)
|
| We have all of my teams data on an array of 18tb drives for a
| 100TB raid10 setup, and a NAS at home doing the same, etc. Even
| some of our OS drives at work are 7200rpm drives - and we're a
| computational lab. Why is everyone so intent that these
| effectively no longer exist? The cost for a decent amount of
| space with NVME drives is just far too astronomical. We're not
| all millionaires.
| crote wrote:
| Spinning rust is still by far the cheapest way to store bulk
| data, and it is fast enough for infrequent access. It's
| perfectly fine for archival purposes. Large disks are $0.02 /
| GB while large SSDs are closer to $0.07 / GB.
|
| On the other hand, ignoring my media collection, both my
| personal computer and server only need a few hundred GB of
| storage. SSDs are cheap enough that they are a no-brainer: they
| are physically a lot smaller, quieter, and more resilient. The
| orders-of-magnitude better IO speed doesn't hurt either, even
| if most of it is wasted on me.
|
| 1TB NVMe drives are now less than $75. I could get a 1TB HDD
| for $40 or a 3TB one for $75, but why bother?
| mmis1000 wrote:
| Even 2TB SSDs can be bought at < $130 now. Plus the fact that
| consumer motherboard usually have 2 m.2 slot. You can have
| 4TB of ultra fast storage on a normal consumer level computer
| without pcie extension cards. Which is overkill for most
| people. Most consumers probably don't even need that much
| storage space.
| kragen wrote:
| until they start shooting video or finetuning stable
| diffusion or something
|
| video and archived neural net snapshots are pretty
| sequential
| jackmott42 wrote:
| I think you know the answer to your question, which is that
| normal end users who build PCs basically never use them. Power
| users with NAS and huge porn collections and servers, of course
| they still do.
| [deleted]
| swalsh wrote:
| Most people are conscious that they're making a performance
| trade off when they decide to store data in one.
| javajosh wrote:
| _> a 100TB raid10 setup, and a NAS at home doing the same_
|
| How do you deal with these onerous constraints? Do you have a
| system for deciding how many copies of archive.org to keep
| locally?
| [deleted]
| fishtacos wrote:
| I know you're being facetious, but here's some fun stats from
| archive.org.
|
| A few highlights from the Petabox storage system:
|
| No Air Conditioning, instead use excess heat to help heat the
| building.
|
| Raw Numbers as of December 2021:
|
| 4 data centers, 745 nodes, 28,000 spinning disks
|
| Wayback Machine: 57 PetaBytes
|
| Books/Music/Video Collections: 42 PetaBytes
|
| Unique data: 99 PetaBytes
|
| Total used storage: 212 PetaBytes
|
| Considering this data is from a year ago, it's got to be
| substantially larger now.
|
| Link: https://archive.org/web/petabox.php
| chaxor wrote:
| I'm not sure what exactly you mean - but we typically store
| DNA sequencing data from people that want us to analyze it,
| databases of proteins and scientific articles, ML models,
| etc. Proper considerations of how to store the data and
| compress it correctly help a lot, but unfortunately academics
| are pretty terrible at choosing the right data structures for
| the job, and often several copies are needed for mirroring
| the raw data, then the processed versions are updated.
| vitno wrote:
| Spinning disk are fine for some workloads. However, I've
| definitely seen teams engineering around slower spinning discs.
| NVMe isn't egregiously expensive, and if teams are needing to
| spend too much work and time thinking about disk speed it
| probably cheaper in the long run to switch.
| simonmales wrote:
| And there is still demand. Otherwise they wouldn't be still
| produced.
| burntsushi wrote:
| I don't see anywhere in this article where spinning disks are
| written off.
|
| It is sensible to consider the I/O speed of things other than
| spinning disks. Especially when they are exceptionally
| commonplace in consumer devices and developer workstations.
| Thaxll wrote:
| They're not used everywhere, most stuff desktop or servers are
| now based on SSD or equivalent.
|
| Your use case of people running desktop with 100TB is niche,
| for $100 you can get a very fast 1TB nvme drive now days, which
| is fine for 99.99% of the population.
| mrich wrote:
| I have a hunch when rewriting the program in C/C++ or Rust this
| will change significantly.
| jhfdbkofdchk wrote:
| Writing this in Perl will have significantly different results.
| Can't speak for Go but text processing in Python is slow.
| civopsec wrote:
| Not necessarily considering that he wrote that the Go version
| tries to avoid memory allocations.
| iAm25626 wrote:
| IO also meant network to me. Often the target(database, or device
| generating telemetry) is 10+ms away. That round trip time is
| bottle neck by physics(speed of light). side benefit of sqlite
| being local file system/memory.
| throwaway71271 wrote:
| EBS costs crazy crazy amounts for reasonable iops
|
| We pay 7k per month for RDS that can do barely 2k iops.. in the
| same time a machine at hetzner does 2 million iops for 250 euro
| per month (not to mention it also have 4x more codes and 5x more
| ram).
|
| So, even though I/O is no longer the bottle neck physically, it
| still is a considerable issue and design challenge on the cloud.
| nix23 wrote:
| Well yes it's a total ripoff
|
| I installed a DB-Server for a Customer around 2years ago, in a
| DC near him with 16 cores 48GB Ram and ~6TB -> 12 SSD, vDevs
| mirror with 2, and Stripe over the mirrored vDevs (kind of a
| Raid10 but zfs), compression zstd (1GB could be compressed down
| to ~200MB so 5 times less reading/writing, and in theory ~30TB
| of pure DB-Data, 20TB realistic, remember never fill a zpool
| over 72%) record-size 16kb (postgresql). After 3 month the
| machine was paid (compared to the "cloud"-price) and the
| performance kind of 10-12 times higher.
|
| Called the customer about a two month ago and he said the DB-
| Server is still to fast and maybe he wants another one who uses
| less power... ;)
| samb1729 wrote:
| > remember never fill a zpool over 72%
|
| Could you please explain where this number comes from?
| ewwhite wrote:
| Yeah, that's a myth now. It's not current advice.
| nix23 wrote:
| 72% is my rule of thumb for write heavy production stuff
| (my absolute limit would be 75%) but it depend on record-
| sizes, raidz-level, if you have mostly write or mostly
| read workloads, how big your files are, how many
| snapshots you have if you have a dedicated ZIL-Device and
| much more. For a Home-NAS (Movies etc) you can easy go up
| to 85%...if it's a "~WORM" workload maybe 90%...but
| resivering can then be a thing of days (weeks?), depends
| again on the raidz-level or mirror etc.
|
| >Yeah, that's a myth now. It's not current advice.
|
| It's not and you know it, keep it under 72% believe me if
| you want a performant zfs (especially if you delete files
| and have many snapshots...check the YT linked at the end)
|
| >>Keep pool space under 80% utilization to maintain pool
| performance. Currently, pool performance can degrade when
| a pool is very full and file systems are updated
| frequently, such as on a busy mail server. Full pools
| might cause a performance penalty, but no other issues.
| If the primary workload is immutable files (write once,
| never remove), then you can keep a pool in the 95-96%
| utilization range. Keep in mind that even with mostly
| static content in the 95-96% range, write, read, and
| resilvering performance might suffer.
|
| https://web.archive.org/web/20150905142644/http://www.sol
| ari...
|
| And under no circumstances go over 90%:
|
| https://openzfs.github.io/openzfs-
| docs/Performance%20and%20T...
|
| >An introduction to the implementation of ZFS - Kirk
| McKusick
|
| https://www.youtube.com/watch?v=TQe-nnJPNF8
| ilyt wrote:
| The cloud costs really are between "unreasonable" and "very
| unreasonable". The only time when it gets cheaper is if
| workload would be so spiky we coul'dve turned most of it off
| for 2/3 of the day but most of what we have is many smaller
| projects that don't even have enough traffic to even need
| scaling and the big ones, well, can't scale database size
| down off-peak...
|
| Over last ~6 years we did "is it worth going to cloud"
| calculation few times and it was always ridiculously more
| expensive.
| fomine3 wrote:
| Virtualized storage system increases latency (QD1 IOPS).
| Naively built non-parallel apps tend to rely on QD1 IOPS
| performance, so it runs very slowly on cloud platform,
| compared to dev machine with direct attached NVMe.
| LastTrain wrote:
| Agree! And at the end of the day, we are optimizing for cost.
| Although, the EBS portion of that 7k RDS bill is going to be
| tiny, right?
| throwaway71271 wrote:
| well its tiny if you keep it 2k, and make sure you dont touch
| disk much, god forbid someone makes a query that requires a
| temp table.. a query you wouldnt even notice in a baremetal
| machine brings the whole rds setup down, cant even read the
| writeahead log and cant replicate and etc.. its like watching
| a slow motion train wreck from 2 queries per second
| Mave83 wrote:
| If you have to handle multi million I/O's or files, or GB/s or
| TB/s bandwidth, just use DAOS.io storage. Awesome fast, scale out
| and self healing.
| chewbacha wrote:
| Wouldn't memory allocation still be IO of a different resource?
| We're still getting slowed down reading and writing bits to a
| storage device. Perhaps it's not the hard drive but the claimed
| blocker here doesn't appear to be CPU.
| diarrhea wrote:
| That discussion reminds me of Intel Optane. The current
| distinction between hard disks and RAM isn't a necessity
| dictated by some law of nature. Yet it's the box most people
| think within (for good reason).
| umanwizard wrote:
| Reading and writing to main memory is not usually called "IO"
| in this context.
| dahfizz wrote:
| Not really, no. Allocation involves reading/writing to memory,
| but that's not why it's slow. It's slow because allocation
| involves a context switch into the kernel. And an allocation
| algorithm itself isn't trivial.
| zabzonk wrote:
| user space code will not use the kernel and there will be no
| context switch. it will call a function such as malloc, which
| is a user-space function. malloc will then go on to interact
| with the memory allocation sub-system, which may occasionally
| need to ask the os for more memory via a system call, if the
| allocator is any good.
| dahfizz wrote:
| Yeah malloc is user space code, but it will do a syscall
| like sbrk to actually get memory from the kernel.
|
| The default malloc in glibc does not pad the values given
| to sbrk, so you have to do a syscall for every 4k chunk of
| memory (the pagesize). So unless you do lots of very small
| (<<4k) allocations, you call sbrk pretty often.
|
| You will also page fault when you access the new pages, and
| this traps into kernel code again.
|
| So yeah, you are technically correct that some allocations
| may be fast because the memory is already available and
| mapped. Allocations, on average, are still slow because it
| involves context switches to the kernel (potentially
| multiple).
|
| TLDR: you make it sound like a syscall within malloc is
| rare, but many/most allocations will trigger a syscall.
| morelisp wrote:
| It is rare because one of the very first things anyone
| does if they're concerned about allocations is replace
| glibc malloc.
| cb321 wrote:
| Indeed - for that particular problem a big cost is the "IO"
| from main memory "into the CPU cache". Ben is careful to
| qualify it as "disk" IO, but even this is somewhat vague. (as
| is "CPU cache" vs L3/L2/L1 - Ben's interview problem is highly
| L2 sensitive, I think, and increasing variety in that size will
| make results harder to interpret.)
|
| On a modern gen4 NVMe, I routinely get 7 GiB/s. gen5 is
| supposed to double that (as soon as manufactures get "enough"
| money out of gen4 given PCIe4's extremely short life compared
| to gen3's extremely long one.)
|
| There was a time not long ago (maybe still) where highly scaled
| up, many core (40+) Intel CPUs could not match that getting
| from DIMMs into L3 for just 1 core (as per his interview
| problem). So, we are indeed moving into an era where "IO" from
| the primary persistent device is indeed no worse than IO from
| DIMMs, at least in bandwidth terms. DIMMs still have much
| better latency and the latency-BW ambiguity has been observed
| elsethread.
|
| EDIT: I should clarify, to connect my text with your comment,
| that the real cost of (1-core, uncontended) allocation is also
| more "populating/mapping the cache" with copies, not just
| "allocation" in itself.
| slotrans wrote:
| Article itself is fine but the "conclusion" is loony. You can't
| draw conclusions about big data from toy experiments with small
| data.
| [deleted]
| williamcotton wrote:
| Wouldn't it be nice if we could specify the allocators in GC
| languages? Like, why not expose a bump allocator arena to Python
| with a manual release?
| krapp wrote:
| I don't know if C++ counts as a "GC language" per se, but you
| can write custom allocators for STL containers (eg here:
| https://betterprogramming.pub/c-memory-pool-and-small-
| object...).
| alecco wrote:
| Nonsense. Latency matters. NVMe latency is measured in
| microseconds, while DRAM latency is measured in nanoseconds.
|
| Sequential processing is not that common.
| kragen wrote:
| latency matters but sequential processing is still most
| processing
|
| i hope you are doing well
| osigurdson wrote:
| Compared to L1 cache reference, it certainly still is.
| ihatepython wrote:
| It's true, I/O is no longer the bottleneck.
|
| The bottleneck now is interviewers who think they have the
| knowledge and expertise, but do not. However their authoritative
| position ends up distorting everything, and then said person
| blogs about it and causes even more problems.
| sublimefire wrote:
| OMG what did I miss, please expand. What is being distorted and
| what problems are caused?
| YetAnotherNick wrote:
| Most people don't build their program for NVMe drives, but
| for cloud. Try benchmarking it in regular instance in AWS/GCP
| using default SSD and you will know that what interviewees
| said is true, as they are likely not building it to run a
| program in "your" laptop.
| dang wrote:
| " _Don 't be snarky._" -
| https://news.ycombinator.com/newsguidelines.html
|
| If you know more than someone else, that's great - but in that
| case please share some of what you know, so the rest of us can
| learn. Just putting another person down doesn't help.
|
| https://hn.algolia.com/?dateRange=all&page=0&prefix=true&sor...
| dist1ll wrote:
| Holy god what a burn! I absolutely agree though.
| [deleted]
| fancyfredbot wrote:
| NVME storage really is very fast for sequential reads, but I'd
| respectfully suggest that for simple tasks a Dell laptop with
| 1.6GB/s read speed should be bottlenecked by IO if the compute is
| optimised. For example SIMD-json can parse json at over 7GB/s.
| https://simdjson.org/
| kazinator wrote:
| Processing _cache hot_ data has never been the bottleneck.
|
| Running some search on a file on your 486-with-8-megs-of-RAM
| running Linux, where the file was in the operating system's
| cache, was dependent on the performance of the program, and the
| overhead of reading data from the cache through syscalls.
|
| You can't handwave away the performance of the program with the
| argument that it will hide behind I/O even if that is true for
| cache-cold run because cache-hot performance is important. People
| run multiple different kinds of processing passes on the same
| data.
| up2isomorphism wrote:
| The correct way to describe his experiment should be:
|
| Of course I/O is the slowest, but it is fast enough to let most
| of the programmers not able to fully utilize it.
| lmeyerov wrote:
| I like the author started with measuring and thinking bandwidth,
| which makes sense for streaming through a big file, so I'd have
| continued that way towards a diff design & conclusion
|
| Continuing with standard python (pydata) and ok hw:
|
| - 1 cheap ssd: 1-2 GB/s
|
| - 8 core (3 GHz) x 8 SIMD: 1-3 TFLOPS?
|
| - 1 pci card: 10+ GB/s
|
| - 1 cheapo GPU: 1-3 TFLOPS?
|
| ($$$: cross-fancy-multi-GPU bandwidth: 1 TB/s)
|
| For streaming like word count, the Floating point operation
| (proxy for actual ops) to Read ratio is unclear, and the above
| supports 1000:1 . Where the author is reaching the roofline on
| either is a fun detour, so I'll switch to what I'd expect of
| pydata python.
|
| It's fun to do something like run regexes on logs use cudf one
| liners (GPU port of pandas) and figure out the bottleneck. 1 GB/s
| sounds low, I'd expect the compute to be more like 20GB+/s for
| in-memory, so they'd need to chain 10+ SSD achieve that, and good
| chance the PCI card would still be fine. At 2-5x more compute,
| the PCI card would probably become a new bottleneck.
| tuyguntn wrote:
| Haven't gone through the code, but measurement methodology seems
| wrong to me.
|
| > As you can see, the disk I/O in the simple Go version takes
| only 14% of the running time. In the optimized version, we've
| sped up both reading and processing, and the disk I/O takes only
| 7% of the total.
|
| 1. If I/O wasn't a bottleneck, shouldn't we optimize only reading
| to have comparable benchmarks?
|
| 2. Imagine program was running 100 sec, (14% I/O) so 14 seconds
| are spent on I/O. Now we optimize processing and total time
| became 70 seconds, if I/O wasn't a bottleneck, and we haven't
| optimized I/O, total disk I/O should become 20% of total
| execution time, not 7%.
|
| Disk I/O:
|
| > Go simple (0.499), Go optimized (0.154)
|
| clearly, I/O access was optimized 3x and total execution was
| optimized 1.6x times. This is not a good way of measurement to
| say I/O is not a bottleneck.
|
| I agree though things are getting faster.
| andrew-ld wrote:
| Honestly I find this article too vague, and if you process large
| amounts of data you rarely do so with orderly reads and writes,
| even with databases optimized for fast disks (see rocksdb) have
| disks as a bottleneck even with the most recently developed
| hardware.
| austin-cheney wrote:
| I encountered this myself yesterday when attempting to
| performance test WebSockets in JavaScript:
| https://github.com/prettydiff/share-file-systems/blob/master...
|
| The parsing challenge is complex enough that it will always be
| faster to extract the data from the network than it is to process
| it. As a result excess data must be stored until it can be
| evaluated or else it must be dropped, therefore the primary
| processing limitation is memory access not CPU speed executing
| instructions. JavaScript is a garbage collected language, so you
| are at the mercy of the language and it doesn't really matter how
| you write the code because if the message input frequency is high
| enough and large enough memory will always be the bottleneck, not
| the network or the application code.
|
| In terms of numbers this is provable. When testing WebSocket
| performance on my old desktop with DDR3 memory I was sending
| messages (without a queue or any kind of safety consideration) at
| about 180,000 messages per second. In my laptop with DDR4 memory
| the same test indicated a message send speed at about 420,000
| messages per second. The CPU in the old desktop is faster and
| more powerful than the CPU in the laptop.
| prvt wrote:
| "sorting with O(n^2) is no longer a bottleneck as we have fast
| processors" /s
| itissid wrote:
| That makes no sense. Just the brutal math of a polynomial is
| always going to be poor enough to notice than subpolynomial
| times.
| vitiral wrote:
| Often but not always. Cool trick: any bounded limit is always
| O(1)!
|
| Pick a small enough bound and an O(n^2) algorithm behaves
| better than an O(n log n). This is why insertion sort is used
| for sorting lengths less than ~64, for example.
| mikesabbagh wrote:
| Usually we mainly run jobs on nfs or similar disks. I/O time
| would be more significant. Would be nice to run thoses tests on
| aws
| dehrmann wrote:
| I'm not surprised. I've seen bulk MySQL reads in Python be CPU-
| bound. The interesting followup was that parallelizing reads in
| subprocesses wasn't going to help much because I'd get killed by
| CPU again when serializing/deserializing between processes. I was
| capped at a ~2x speedup.
| inglor_cz wrote:
| Hmm. I just fought with a MariaDB installation that, when set to
| immediate write to a disk, became rather slow. 7-8 INSERTs into a
| DB could easily take 3 seconds; unfortunately the internal logic
| of the system didn't really lend itself to one INSERT of multiple
| lines.
|
| Once I reconfigured innodb_flush_log_at_trx_commit to 2, the UI
| started being lightning fast.
| Dunedan wrote:
| > I haven't shown an optimized Python version because it's hard
| to optimize Python much further! (I got the time down from 8.4 to
| 7.5 seconds). It's as fast as it is because the core operations
| are happening in C code - that's why it so often doesn't matter
| that "Python is slow".
|
| An obvious optimization would be to utilize all available CPU
| cores by using the MapReduce pattern with multiple threads.
|
| I believe that'd be necessary for a fair conclusion anyway, as
| you can't claim that I/O isn't the bottleneck, without utilizing
| all of the available CPU and memory resources.
| masklinn wrote:
| A more obvious optimisation (to me) would be to leverage the
| native functions and avoid creating a list of ~80 million
| strings in memory.
|
| On my machine, the base script pretty reliably takes ~10s:
| Reading : 0.1935129165649414 Processing:
| 9.955206871032715 Sorting : 0.0067043304443359375
| Outputting: 0.01335597038269043 TOTAL :
| 10.168780088424683
|
| Switching content to a no-op (`content = sys.stdin`) and
| feeding `Counter` from a native iterators pipeline:
| counts = collections.Counter(chain.from_iterable(map(str.split,
| map(str.lower, content))))
|
| is a pretty reliable 10% gain: Reading :
| 1.1920928955078125e-06 Processing: 8.863707780838013
| Sorting : 0.004117012023925781 Outputting:
| 0.012418985366821289 TOTAL : 8.880244970321655
|
| As far as I can tell, the bottleneck is about half the
| preprocessing (lowercasing and splitting) and half filling the
| Counter.
|
| You won't get a 10x gain out of that though.
| plonk wrote:
| > An obvious optimization would be to utilize all available CPU
| cores by using the MapReduce pattern with multiple threads.
|
| Nope, the GIL will make that useless. You need to actually
| implement the tight loops in C/C++ and call that with batches
| of data to get benefits from threading.
|
| An obvious, but more expensive optimization would be to use a
| process pool. Make sure that all the objects you pass around
| are serializable.
|
| Python makes optimization much harder than it should be. I hope
| the GIL gets the hammer at some point, but that seems to be a
| huge task.
| Waterluvian wrote:
| You can use Processpool but at that point you're way into re-
| architecting.
| plonk wrote:
| Not much more than the switch to MapReduce and threads.
| Actually the interface is exactly the same if you use
| executors from concurrent.futures.
| znpy wrote:
| inter-process communication has its own overhead though.
| plonk wrote:
| You can usually counteract that by sending large enough
| batches to the processes.
| Dunedan wrote:
| > Nope, the GIL will make that useless.
|
| In Python yes. I missed that. The Go implementation would
| still benefit from multiple threads, wouldn't it?
| plonk wrote:
| Yes I think so.
| xmcqdpt2 wrote:
| For this problem the multiple process version would be quite
| simple in python or any other languages. It's a classic same
| program multiple data (SPMD) task. You split the file into N
| chunks than run N versions of the original program on it (a
| Map). You then need to collate the results, which required a
| second program, but that step is similar to the sorting step
| in the original and so would be negligible wrt wall time (a
| quick Reduce).
|
| For large files you should get almost embarrassing
| parallelism.
| jvanderbot wrote:
| Oh I think a few simd instructions could reduce processing
| to near zero without going crazy with multi-threaded
| architectures.
|
| Remember that fizzbuzz on HN that hit GB/s? Mostly SIMD.
| Zero multi-threaded IIRC.
| ShredKazoo wrote:
| The GIL won't prevent you from parallelizing I/O will it?
| plonk wrote:
| If I/O was the bottleneck, parallelizing it won't help,
| your SSD/network link/database won't get magically faster.
|
| If I/O wasn't the bottleneck, I guess you can parallelize
| reading, but what are you gaining?
|
| If you're writing to files, most of the time the parallism
| will be hard to implement correctly. SQLite doesn't support
| parallel writes for example.
| ShredKazoo wrote:
| >If I/O was the bottleneck, parallelizing it won't help,
| your SSD/network link/database won't get magically
| faster.
|
| I think your SSD/network link/database might be able to
| work in parallel even when Python can't. Details:
|
| Suppose I am scraping a website using a breadth-first
| approach. I have a long queue of pages to scrape. A
| single-threaded scraper looks like: pop the next page in
| the queue, block until the web server returns that page,
| repeat. A multi-threaded scraper looks like: thread wakes
| up, pops the next page in the queue, sleeps until the web
| server returns that page, repeat. With the multi-threaded
| scraper I can initiate additional downloads while the
| thread sleeps.
|
| My assumption here is that the download over the network
| is at some level being performed by making a system call
| (how could it not be?) And once you have multiple system
| calls going, they can be as parallel as the OS permits
| them to be; the OS doesn't have to worry about the GIL.
| And also the server should be able to serve requests in
| parallel (assuming for the sake of argument that the
| server doesn't suffer from the GIL).
|
| Same essential argument applies to the database. Suppose
| I'm communicating with the database using IPC. The
| database isn't written in Python and doesn't suffer from
| the GIL. Multiple Python threads can be sleeping on the
| database while the database processes their requests,
| possibly in parallel if the db supports that.
|
| I think this argument could even work for the SSD if the
| kernel is able to batch your requests in a way that takes
| advantage of the hardware, according to this person:
| https://news.ycombinator.com/item?id=33752411
|
| Very curious to hear your thoughts here. Essentially my
| argument is that the SSD/network link/database could be a
| "bottleneck" in terms of latency without being the
| bottleneck in terms of throughput (i.e. it has unused
| parallel capacity even though it's operating at maximum
| speed).
| plonk wrote:
| You're right, my comment only applies when bandwidth is
| the bottleneck. In Python, that web parser could probably
| do even better with asyncio in a single OS thread.
| srcreigh wrote:
| Haven't tried it, but SQLite supports some type of
| concurrent modification now
|
| https://www.sqlite.org/cgi/src/doc/begin-
| concurrent/doc/begi...
| ilyt wrote:
| > If I/O was the bottleneck, parallelizing it won't help,
| your SSD/network link/database won't get magically
| faster.
|
| Of course it will. Near every serious DB will allow to
| work on multiple requests in parallel and unless the DB
| itself is on something that's slow you will get data
| faster from 2 parallel requests than from serializing
| them
|
| NVMe SSDs in particular can easily fill what single
| thread can read, just run _fio_ with single vs parallel
| threads to see that.
|
| > If you're writing to files, most of the time the
| parallism will be hard to implement correctly. SQLite
| doesn't support parallel writes for example.
|
| That's just one random example. If all you do is "read
| data, parse ,write data" in some batch job you can have
| massive parallelism. Sharding is also easy way to fill up
| the IO.
| superjan wrote:
| That is of course an easy solution but I would argue that this
| is just throwing more resources at the problem. Not a very
| impressive optimization.
|
| Emery Berger has a great talk [1] where he argues that it is
| mostly pointless to optimize python code, if your program is
| slow, you should look for a properly optimized library to do
| that work for you.
|
| 1: https://www.youtube.com/watch?v=vVUnCXKuNOg
| Dunedan wrote:
| > That is of course an easy solution but I would argue that
| this is just throwing more resources at the problem. Not a
| very impressive optimization.
|
| You could say the same about the existing implementation as
| that reads the whole file into memory instead of processing
| it in chunks.
| ummonk wrote:
| Or better yet, write a compute shader since hashing is an
| embarrassingly parallel operation.
|
| That said, the OP's article is correct in that straightforward
| idiomatic implementations of this algorithm are very much
| compute bound. The corollary is that eng work put into
| optimizing compute usage often won't be waisted for programs
| processing disk data (or even network data with modern 10Gb
| fiber connections).
| [deleted]
| mips_avatar wrote:
| I/O is sometimes the bottleneck. On Windows any operation with
| lots of small file operations bottlenecks on NTFS and Defender
| operations. It makes some applications like git that run
| beautifully on Linux need to take some countermeasures to operate
| well on Windows.
| Dunedan wrote:
| As the algorithm used in the example is straight-forward I
| figured that using UNIX command line tools might be an even
| simpler way to implement it. Here is what I came up with:
| time cat kjvbible_x100.txt | tr "[:upper:] " "[:lower:]\n" | sort
| --buffer-size=50M | uniq -c | sort -hr > /dev/null
|
| On my machine this turned out to be ~5 times slower than the
| provided Python implementation. Nearly all of the time is spent
| in the first invocation of `sort`. Further increasing the buffer
| size doesn't make a significant difference. I also played around
| with the number of threads `sort` uses, but didn't see any
| improvement there either.
|
| I'm quite puzzled why `sort is so much slower, especially as it
| does sorting in parallel utilizing multiple CPU cores, while the
| Python implementation is single-threaded.
|
| Does somebody have an explanation for that?
| zackmorris wrote:
| Dangit, I'm supposed to be doing yardwork today so you hijacked
| my procrastination motivation haha!
|
| Edit: I had no idea that awk was so fast, and I suspect that
| only parallelization would beat it. but I agree with the others
| that the main bottleneck is the `sort | uniq` for results1.txt
| # https://stackoverflow.com/a/27986512 # count word occurrences
| # https://unix.stackexchange.com/a/205854 # trim surrounding
| whitespace # https://linuxhint.com/awk_trim_whitespace/ #
| trim leading or trailing whitespace time cat
| kjvbible_x100.txt | tr "[:upper:] " "[:lower:]\n" | sort
| --buffer-size=50M | uniq -c | sort -hr > results1.txt
| real 0m13.852s user 0m13.836s sys 0m0.229s
| time cat kjvbible_x100.txt | tr "[:upper:] " "[:lower:]\n" |
| awk '{count[$1]++} END {for (word in count) print count[word],
| word}' | sort -hr > results2.txt real 0m1.425s user
| 0m2.243s sys 0m0.061s diff results1.txt
| results2.txt 109,39133c109,39133 # many whitespace
| differences due to how `uniq -c` left-pads first column with
| space diff <(cat results1.txt | awk '{$1=$1};1')
| <(cat results2.txt | awk '{$1=$1};1') # bash-only due to
| <() inline file, no differences after trimming surrounding
| whitespace cat results1.txt | awk '{ sub(/^[
| \t]+/, ""); print }' | diff - results2.txt # sh-
| compatible, no differences after trimming leading whitespace of
| results1.txt # 13.836 / 2.243 = ~6x speedup with
| awk
| mastax wrote:
| `sort | uniq` is really slow for this, as it has to sort the
| entire input first. I use `huniq` which is way faster for this.
| I'm sure there are many similar options.
|
| https://github.com/koraa/huniq
| benhoyt wrote:
| It's because that first invocation of sort is sorting the
| entire input (413MB), not just the unique words (less than a
| MB). The sort is probably O(NlogN), but that's a big N.
| Counting by inserting into a hash table is much faster, at
| O(N).
| aljarry wrote:
| It looks like you're sorting the whole file, while python
| implementation sorts only unique values.
| buybackoff wrote:
| Having dealt with several data parsers I would like to somehow
| estimate how much electricity is burned globally just for lazy
| implementations. E.g. in .NET non-devirtualized `Stream.ReadByte`
| is often one of the hottest methods in a profiler. It and related
| methods could easily be responsible for double-digit share of CPU
| when processing data. I mean, it's not IO but just pure overheads
| that disappear with custom buffering where reading a single byte
| is as cheap as it should be.
| lotik wrote:
| Can we just give it more resources? /s
| sriku wrote:
| On a related note, John Ousterhout (in the RAMCloud project) was
| basically betting that the latency of accessing RAM on another
| computer on a fast local network will eventually become
| competitive to local RAM access.
|
| https://ramcloud.atlassian.net/wiki/spaces/RAM/overview
| stabbles wrote:
| The title I/O is _no longer_ the bottleneck seems to suggest disk
| speed has caught up, while in reality the slowness is due to poor
| implementation (slow Python or Go with lots of allocations).
|
| The real problem to me is that languages are too high-level and
| hiding temporary allocations too much. If you had to write this
| in C, you would naturally avoid unnecessary allocations, cause
| alloc / free in the hot loop looks bad.
|
| Presumably soon enough it's very unlikely you find any new word
| (actually it's 10 passes over the same text) and most keys exist
| in the hashmap, so it would be doing a lookup and incrementing a
| counter, which should not require allocations.
|
| Edit: OK, I've ran OP's optimize C-version [1] and indeed, it
| only hits 270MB/s. So, OP's point remains valid. Perf tells me
| that 23% of all cache refs are misses, so I wonder if it can be
| optimized to group counters of common words together.
|
| [1] https://benhoyt.com/writings/count-words/
| davedx wrote:
| Yeah I've worked on non trivial javascript codebases where
| allocations and gc mattered.
| christophilus wrote:
| I'd like to hear more about this.
| mceachen wrote:
| Not your OP, but I've got stories!
|
| PhotoStructure is a non-trivial app written in TypeScript
| for both frontend and backend tasks.
|
| Coming from Scala, I initially used a lot of Array.map,
| Array.foreach, and, to handle things like Some/None/Either:
|
| function map<T,U>( maybe: <T|undefined>, f: (t:T) => U )
|
| According to the V8 memory inspector, all those tiny fat-
| arrow functions can hammer the GC (presumably due to stack
| allocations and pulling in local context). Replacing large
| array iteration with for loops was a big win.
|
| Also, handling large files with a streaming parser when
| possible, instead of reading them entirely into memory,
| another win.
|
| Buffer concatenation may be faster by pushing read chunks
| onto an array, and joining the lot at the end, rather than
| incrementally appending with .concat.
|
| When memoizing functions, if they themselves return
| functions, watch out for fat-arrows if you didn't need the
| caller's context (which may prevent gc from unexpectedly
| retained variables).
|
| But the first step should always be to profile your app.
| You can't assert improvement without a baseline.
| imtringued wrote:
| The file is only being read once according to the blog and then
| it is in memory. This isn't an I/O intensive application at
| all.
|
| I have written a small script in python that does something
| similar. I have a word list with 1000 words and I check the
| presence of the words. Here is the thing. For every word I go
| through the entire file. So lets say I scan the file one
| thousand times. In fact I did something more complicated and
| ended up going 6000 times over the original file and yet it
| still took only three seconds. If all these scans had to reread
| the file it would take forever.
| kragen wrote:
| he has figures for both cached and uncached performance
|
| if all your scans had to reread the file from nvme it would
| take five times as long if we extrapolate from those figures
|
| not forever
| nunez wrote:
| the optimized version does it byte by byte
| Severian wrote:
| This sounds like a very inefficient method. The Python stdlib
| module 'collections' includes a Counter class which can do
| this for you. Then all you need to do is copy the dict
| key/item pair to a new one that match your word list.
| adverbly wrote:
| Of course disk speed has caught up. Single core clock speed
| hasn't changed in over 10 years now? Disk now uses SSD and has
| been chugging along with improvements.
|
| I'm curious what a multi thread(multi process for python due to
| GIL?) comparison would be. Obviously people aren't doing this
| by default though, so the author's point still stands.
| hn_go_brrrrr wrote:
| It's not just about clock speed. CPI has been steadily
| improving across processor generations, even when the clock
| speed doesn't meaningfully change.
| atty wrote:
| Sorry do you mean IPC (instructions per clock) or is there
| another term that I'm not aware of?
| vlovich123 wrote:
| Cycles per instruction.
| wtallis wrote:
| "Cycles per instruction" sounds more like a measurement
| of instruction latency rather than instruction
| throughput, and I don't think instruction latency has
| improved as much in the past decade as instruction
| throughput: the most significant microarchitecture trend
| has been toward wider CPUs with more execution ports,
| wider instruction decoders, and larger reorder buffers.
| chronial wrote:
| > Single core clock speed hasn't changed in over 10 years
| now?
|
| This is not true. I thought the same thing and you are right
| in regards to base line clock speed. But the performance is
| still increasing. I just got a new PC at work with a 2021
| high-end CPU and it has 200% single core performance of the
| 2015 mid-range CPU in my private PC:
|
| https://www.cpubenchmark.net/compare/4597vs2599/Intel-i9-129.
| ..
| antisthenes wrote:
| Are you simply glossing over the fact that CPU 1 has a
| turbo of 5.2 Ghz and the other 3.6 Ghz ?
| glogla wrote:
| Also, the the old one is 65W TDP and the new one is 241W
| TDP.
| chronial wrote:
| Yes, I did not go into details where the differences are
| coming from. The base clockspeeds are the same, which is
| obviously the number that GP noticed not changing over
| the last 10 years.
|
| Other things changed. The turbo clockspeed is one of
| them.
| 0x457 wrote:
| Well, CPU got a lot better at benchmarks, that is true.
| Caches got bigger, predictions got better. Specialized
| instructions were added. IPC improvement kinda slowed down
| after Sandy Bridge, at least for Intel.
|
| Also, the comment you're quoting is talking about clock
| speed, and the link you provided literally shows the same
| base clock speed - 3.2 GHz. Intel progressively pushed
| turbo speed higher, but that the speed you could have
| achieved yourself by overclocking.
| chronial wrote:
| Does the CPU constantly hold the turbo speed under a
| single threaded workload?
| 0x457 wrote:
| It depends on motherboard and cooling. 6700K, for
| example, is constantly running at 4.2Ghz or 4.5Ghz
| (winter clocks). Constantly while thermals allow it...
| Non-overclocking motherboards allow it to boost for 2
| minutes, after that, it's iffy.
| eulers_secret wrote:
| Depends. The cpu attempts to hold the maximum possible
| speed on any cores that are in use.
|
| On my water cooled and specifically tweaked desktop- yes.
| It'll hold max boost indefinitely, even with all threads.
| (getting to about 80c after 10 mins). Single-thread max
| is faster, and it'll hold that as well.
|
| My laptop will pull power within 15 seconds and be down
| to base clocks in a couple mins. Unless I set it down
| outside and it's very cold.
| PartiallyTyped wrote:
| > Single core clock speed hasn't changed in over 10 years
| now?
|
| Technically we have gone from 3.3gz- 3.9gz base frequency,
| 4.1-4.2 single core boosts, 4.7ghz heavy OC in Haswell to
|
| 4.3 and 3.0 base clock for efficiency/perf cores, with boosts
| to 5.4 ish, stable 5.8ghz all core frequency with heavy OC,
| and even pushing 6.something GHz single core with a good
| setup.
|
| But then again, this is on the more extreme ends. Mainstream
| laptops have remained relatively stable.
|
| re Python MP:
|
| On my 12700k, I saw almost linear scaling (per core, not
| thread) with multiprocessing when loading stuff from the disk
| for ML applications and processing. This was with the
| datasets library.
| bluedino wrote:
| Don't forget about naive implementations in languages like C++
| where using cin/cout can cause your program to run slower than
| Python when doing file IO
| xyzzy_plugh wrote:
| I don't know if I can agree. Python is certainly problematic
| for the reasons you outlined, but Go is much closer to C in
| terms of performance. Go is also compiled, and as such you have
| similar opportunity to optimize and measure allocs. In my
| experience it's much harder to tune Python, and Go is often
| _easier_ to tune than C due to the great ergonomics of the
| compiler. You can also drop down to assembly!
|
| But really, I disagree because I've frequently saturated
| massive IOPS. I/O is still the bottleneck. The article pretty
| much immediately excludes network I/O, which is in many cases
| more common than disk I/O. Even so, tiny single-threaded
| programs reading words one-at-a-time are obviously not going to
| be I/O constrained with modern disks. For these types of
| programs, I/O hasn't been a bottleneck in a long, long time,
| and I'd actually be surprised to hear candidates suggest
| otherwise.
| exabrial wrote:
| Let's call spades spades: C spanks Go in every measure. Go is
| nowhere near C speed.
|
| It's certainly more performant than any dynamically typed
| scripting language: JavaScript, Python, Ruby, etc but it's
| probably closer to C#.
| kragen wrote:
| yeah, a friend of mine wrote a logfile parser in 01996 in c,
| and when i rewrote it in perl (also in 01996) it was about
| one tenth the size and ten times slower
|
| whether i/o is the bottleneck depends on what you're doing
| and on which computer, and that's been true for at least 50
| years
| computator wrote:
| May I ask what's the rationale for writing 01996 rather
| than 1996? I've seen this before but I haven't seen an
| explanation of the advantage of it.
| mcculley wrote:
| Y-10K awareness: https://longnow.org/ideas/long-now-
| years-five-digit-dates-an...
| bqmjjx0kac wrote:
| But it looks like an octal literal.
| withinboredom wrote:
| Haha, I thought it was some architecture I've never heard
| of.
| sidlls wrote:
| Go's problem is that _in practice_ there are interfaces (fat
| pointers) everywhere. I'm about to start optimizing one of
| the services my team owns and my first step is going to be
| looking for all the bad performance issues that comes out of
| this.
| dahfizz wrote:
| I mean, Go is closer to C than Python is. But Go is still a
| garbage collected language with a runtime. It's not going to
| come close to the limits of performance that C offers
| morelisp wrote:
| Having played with this a previous time it was posted and
| looking at the profiler results, the main difference
| between the Go and C isn't due to any memory allocations
| but rather the lack of in-place update and slightly higher
| cost from being general-purpose in Go's hash table. (This
| is also probably why Rust, without the GC overhead,
| clusters with Go rather than C.)
| sidlls wrote:
| I'm fairly certain that if one wrote C (or C++) programs
| with the same safety targets that Rust has out-of-the-box
| the performance is close, if not (for practical purposes)
| identical.
|
| Of course, to do much useful (and performant) in Rust one
| often has to break out `unsafe`, which eliminates _some_
| of the out-of-the-box guarantees for safety--and in some
| cases makes one wonder if it 's worth all the overhead
| instead of just using C or C++.
| morelisp wrote:
| > the same safety targets
|
| Rust's selling point is that the safety targets' costs
| are dev/compile-time ones. There should not be a
| difference unless the C/C++ code requires some IB or
| extremely manual memory management trickery, which it
| doesn't; and Go offers basically the same memory safety
| guarantees as Rust in this regard and is (slightly)
| faster.
|
| In this case it's really almost entirely about the speed
| of the hash table.
| sidlls wrote:
| I was referring to the overhead of development time. Rust
| is more complex than C, at least, and even after having
| gained proficiency coding in Rust still takes more time
| than writing an equivalent C program.
|
| And "zero-cost" is misleading. There are definitely
| performance impacts from the implicit (and unadvertised
| explicit) bounds checking some of Rust's features come
| with. Writing a C program to an equivalent level of
| safety would have similar performance impacts. Hence, for
| as close to the same safety as possible, Rust and C
| should be almost identical in terms of performance.
| k__ wrote:
| That's probably the reason why Go got many people switching
| from their scripting languages and didn't end up as the C
| killer it was sold in the beginning.
| taf2 wrote:
| I never remember it being sold as a C killer maybe as a
| Java killer?
| dagw wrote:
| It was more sold as a C++ killer than a C killer.
| capableweb wrote:
| golang.org circa 2010:
|
| > go is ... fast
|
| > Go compilers produce fast code fast. Typical builds
| take a fraction of a second yet the resulting programs
| run nearly as quickly as comparable C or C++ code.
|
| https://web.archive.org/web/20100217123645/http://golang.
| org...
|
| That seems to me like they were trying to say "If you
| want C/C++ performance but nicer/easier syntax, you can
| use Go", which turned out to be not that true in the end.
|
| Edit: the old "Language Design FAQ" also goes further in
| detail on how the envision (the first version of the)
| language: https://web.archive.org/web/20100211104313/http
| ://golang.org...
| [deleted]
| forty wrote:
| I believe Go was created to replace python at Google.
| habibur wrote:
| It was sold as C killer in the beginning.
|
| And then a few years later there was an article that
| said, the Go engineers were surprised when they saw C/C++
| coders weren't switching to Go rather Python/Ruby coders
| were "upgrading" to Go.
| pdimitar wrote:
| That's exactly my reason to use it: fast prototyping and
| ability to replace bash scripts (as they become too
| unwieldy to maintain and expand from one point and on).
|
| I've also successfully made an MVP with Golang which I
| then proceeded to rewrite in Rust in almost only one go
| and almost without blockers along the way.
|
| Golang is pretty good but it still lacks important things
| like algebraic data types, and they're hugely important
| for fearless refactoring and correctness.
| eloff wrote:
| For a script like this your be surprised. If you don't do
| too much allocation, the garbage collector need not run. In
| fact it's common to disable it at the start. Now you can
| allocate much faster than C because you never clean up
| after yourself!
|
| A real world example is esbuild, the author implemented it
| both Rust and Go initially. The Go version was faster and
| the code simpler. Which is why it's implemented in Go.
| leethaxor wrote:
| > A real world example is esbuild, the author implemented
| it both Rust and Go initially. The Go version was faster
| and the code simpler. Which is why it's implemented in
| Go.
|
| But why is swc faster than esbuild then? The code isn't
| even considerably more complex.
| tail_exchange wrote:
| You can find the comments from the ESBuild author here on
| HN:
|
| https://news.ycombinator.com/item?id=22336284
| leethaxor wrote:
| Rust compiler is much faster today than it was in 2020,
| btw.
|
| > This is a side project and it has to be fun for me to
| work on it.
|
| I respect this 100% - but then we shouldn't assume Go is
| better than Rust just based on that esbuild used it
| instead of Rust.
| eloff wrote:
| Because different programs, implemented differently, run
| at different speeds...
|
| I'm saying the performance of Go can sometimes be
| surprisingly fast. Not that it's magic.
| leethaxor wrote:
| So you're saying they wrote exactly the same program in
| Go and Rust for the comparison, changing the syntax only?
| Well then it's no surprise the Go version was faster.
|
| Don't write Rust as if it was Go. That doesn't say
| anything meaningful about either Go or Rust.
| eloff wrote:
| Actually he wrote the Rust version first, so you're wrong
| jumping to conclusions.
|
| I'm not trying to say Go is faster than Rust, it's
| usually slower. But there are always exceptions to the
| rule. The Go code, on the other hand, is usually simpler
| and quicker to write. For that reason I'd prefer Go if
| the problem lends itself to a garbage collected language.
| leethaxor wrote:
| So what did you mean by this?
|
| > Because different programs, implemented differently,
| run at different speeds...
|
| We're talking about two programs with exactly the same
| purpose - ingest TypeScript and output JavaScript. It's a
| pretty clear-cut comparison, IMHO.
|
| > The Go code, on the other hand, is usually simpler and
| quicker to write
|
| I'm writing Go code at work, and Rust code mostly for fun
| (but used it at work too). I'd say this has changed
| significantly in the last 2 years. Now with rust-analyzer
| and much improved compiler output, writing Rust is very
| simple and quick too. I guess getting into Rust can be a
| little harder if you've only ever used GCed languages
| before, but it's not _that_ hard to learn - and once you
| do it 's super-effective. And the type inference of Rust
| is a huge reason why I'm using it - while Go has none.
|
| Another thing to consider - usually the code in Go is
| much more about writing algorithms yourself instead of
| using library functionality (this is changing slowly
| thanks to the new support of generics but most code
| hasn't caught up yet and there aren't good libs using it
| so far). The resulting code in Go can be convoluted a lot
| and contain very hidden bugs. People also usually don't
| bother implementing a proper search/sorting algorithm for
| the sake of simplicity/speed of development - which you'd
| get automatically if you used a library function - so the
| code is less efficient. My Go code is usually 2-3x longer
| than the equivalent in TypeScript or Rust.
|
| Go is great, I like it. Rust is great too. I recommend
| you to do what the esbuild author did - test it and
| choose for yourself, don't bother too much about others'
| opinion.
| eloff wrote:
| > We're talking about two programs with exactly the same
| purpose - ingest TypeScript and output JavaScript. It's a
| pretty clear-cut comparison, IMHO.
|
| There are an infinite number of ways to design two
| programs for that task, with different trade-offs. You
| can't draw conclusions about which language is faster
| based on two different implementations by different
| people.
|
| > Go is great, I like it. Rust is great too. I recommend
| you to do what the esbuild author did - test it and
| choose for yourself, don't bother too much about others'
| opinion.
|
| I'm actually writing Rust code the last two years. It's
| been a while since I've used Go. But I'd rather use Go if
| the problem allows for a garbage collector. It's just
| simpler than managing it manually in Rust with the borrow
| checker and its rules. This is my opinion, nobody else's.
| stavros wrote:
| The GP didn't say that, maybe they wrote Go as if it were
| Rust, for all we know.
| cma wrote:
| You could LD_PRELOAD a free that doesn't do anything in C
| too or a macro define to really make it no cost.
| eloff wrote:
| You'd need to change malloc too so it doesn't do extra
| work. You could do it. Go obviously isn't going to be
| faster than well written C. But sometimes you can be
| surprised.
| morelisp wrote:
| The Go code is reasonably efficient at avoiding allocations,
| though it's hard to avoid some. Without thinking too hard, it's
| also going to be hard to avoid those same ones in C, and some
| possible improvements would apply to both (e.g. defining a
| maximum word length).
|
| "Lowercase word count" is a surprisingly difficult case in this
| regard, because you need to check and potentially transform
| each character individually, and also store a normalized form
| of each word. Probably some smart SIMD lowercase function could
| help here but I don't think any language is going to offer that
| out of the box. It's also defined in a way I think detaches a
| bit much from real-world issues - it's handling arbitrary bytes
| but also only ASCII. If it had to handle UTF-8 it would be very
| different; but also if it could make assumptions that only a
| few control characters were relevant.
| dahfizz wrote:
| > Probably some smart SIMD lowercase function could help here
| but I don't think any language is going to offer that out of
| the box.
|
| No, but at least you have direct access to the intrinsics in
| C. To get vectorization in Go, you have to implement it in C
| and link that into your Go program.
| morelisp wrote:
| > To get vectorization in Go, you have to implement it in C
| and link that into your Go program.
|
| Go has an assembler which does not require implementing
| anything in C (though the assembler uses a more C-style
| syntax), nor critically does it require using CGo linkage.
| It's used to implement many hot paths in the stdlib.
| throwaway894345 wrote:
| Typically you would just implement vectorization in
| assembly in the Go case.
| stabbles wrote:
| > but I don't think any language is going to offer that out
| of the box.
|
| That's what compilers are for. I tried to improve the C
| version to make it friendlier to the compiler. Clang does a
| decent job:
|
| https://godbolt.org/z/o35edavPn
|
| I'm getting 1.325s (321MB/s) instead of 1.506s (282MB/s) on a
| 100 concatenated bibles. That's still not a 10x improvement
| though; the problem is cache locality in the hash map.
| cb321 wrote:
| Note: Just concatenating the bibles keeps your hash map
| artificially small (EDIT: relative to more organic natural
| language vocabulary statistics)...which matters because as
| you correctly note the big deal is if you can fit the
| histogram in the L2 cache as noted elsethread and this
| _really_ matters if you go parallel where N CPUs*L2 caches
| can speed things up a lot -- _until_ your histograms blow
| out CPU-private L2 cache sizes.
| https://github.com/c-blake/adix/blob/master/tests/wf.nim
| (or a port to your favorite lang instead of Nim) might make
| it easy to play with these ideas (and see at least one way
| to avoid almost all "allocation" - under some
| interpretations).
|
| A better way to "scale up" is to concatenate various other
| things from Project Gutenberg: https://www.gutenberg.org/
| At least then you have "organic" statistics on the hash.
| [deleted]
| Const-me wrote:
| > I don't think any language is going to offer that out of
| the box.
|
| C# offers that out of the box, and the solution is much
| simpler there.
|
| Pass StringComparer.OrdinalIgnoreCase or similar
| (InvariantCultureIgnoreCase, CurrentCultureIgnoreCase) to the
| constructor of the hash map, and the hash map will become
| case-agnostic. No need to transform strings.
| masklinn wrote:
| The one possible issue is by not transforming the string
| you're going to run the possibly less efficient CI
| comparison _a lot_ : because the corpus is text and
| duplicated, by my reckoning there are ~32000 unique "words"
| out of 82 million inout "words". That's a lot of conflicts.
|
| Though the values should mostly be quite short, so a
| vectorised comparison might not even trigger as it wouldn't
| have the time to "stride": only one word of the top 10 even
| exceeds 4 letters ("shall", a hair under a million in my
| corpus).
| Const-me wrote:
| The C# standard library will not run many case-
| insensitive comparisons. That comparison object doesn't
| _just_ compare two objects for equality, it also has
| another interface method which computes a hash of a
| single object.
|
| Here's implementation of the hash function used by that
| StringComparer.OrdinalIgnoreCase: https://source.dot.net/
| #System.Private.CoreLib/src/libraries... As you see, it
| has a fast path for ASCII-only input strings.
| masklinn wrote:
| > The C# standard library will not run many case-
| insensitive comparisons. That comparison object doesn't
| just compare two objects for equality, it also has
| another interface method which computes a hash of a
| single object.
|
| Which doesn't matter because I'm talking about identical
| strings, so they _will_ hash the same by definition, and
| they _will_ have to be compared.
|
| So the question is how fast the CI hash and equality
| operate _compared to the CS ones_.
|
| And I asked about comparison because I _assumed_ that
| would be the costlier of the two operations, relative to
| its CS brethren.
| Const-me wrote:
| > how fast the CI hash and equality operate compared to
| the CS ones.
|
| If the string is ASCII like in the OP's use case, I think
| the difference is not huge.
|
| CS comparison looks more optimized, they have an inner
| loop which compares 12 bytes as 3 64-bit values: https://
| source.dot.net/#System.Private.CoreLib/src/libraries...
|
| CI comparer doesn't do that, it loads individual UTF-16
| elements: https://source.dot.net/#System.Private.CoreLib/
| src/libraries... But still, it's very simple code which
| does sequential memory access.
|
| > And I asked about comparison because I assumed that
| would be the costlier of the two operations, relative to
| its CS brethren.
|
| I think the bottleneck is random memory loads from the
| hash table.
|
| Hashing and comparison do sequential RAM access. The
| prefetcher in the CPU will do its job, you'll get 2
| memory loads every cycle, for short strings going to be
| extremely fast. If that hashtable doesn't fit in L3
| cache, the main memory latency is much slower than
| comparing strings of 10-20 characters, no matter case
| sensitive or not.
| morelisp wrote:
| But all of those optimizations can also be done, and more
| efficiently, by transforming while reading. Even if
| they're as fast as they can be, they're not as fast as a
| memcmp. The C# approach isn't buying you any performance
| here.
| Const-me wrote:
| Yeah, it's possible to change case while loading, but I'm
| not sure that's gonna be much faster.
|
| But I'm sure it gonna be much harder.
|
| For non-ASCII strings, converting case may change their
| length in bytes. You don't even know in advance how much
| memory you need to transform 2GB of input text (or 1MB
| buffer if streaming). And if streaming, you need to be
| careful to keep code points together: with a naive
| approach you gonna crash with a runtime exception when
| you split a single codepoint between chunks.
|
| English words are 99.9% ASCII, but that remaining 0.1%
| like "naive" is not. The C# standard library is doing the
| right thing for this use case. Specifically, for 99.9% of
| words the CI comparer will use the faster ASCII-only code
| to compare or hash, and only do the expensive shenanigans
| for small count of non-ASCII words.
|
| Note how C# makes the implementation much simpler. A
| single parameter passed to the constructor of the
| Dictionary<string,Something> makes it implement case-
| insensitivity automagically.
| [deleted]
| nimish wrote:
| If you ignore latency sure. Optane is still the fastest storage
| by quite a bit. Flash has yet to catch up and might never do so.
|
| Tons of files and random writes can bring even an enterprise
| flash ssd to its knees but Optane keeps on trucking
| avinassh wrote:
| I am working on a project [0] to generate 1 billion rows in
| SQLite under a minute and inserted 100M rows inserts in 33
| seconds. First, I generate the rows and insert them in an in-
| memory database, then flush them to the disk at the end. To flush
| it to disk it takes only 2 seconds, so 99% of the time is being
| spent generating and adding rows to the in-memory B Tree.
|
| For Python optimisation, have you tried PyPy? I ran my same code
| (zero changes) using PyPy, and I got 3.5x better speed.
|
| I published my findings here [1].
|
| [0] - https://github.com/avinassh/fast-sqlite3-inserts
|
| [1] - https://avi.im/blag/2021/fast-sqlite-inserts/
| edejong wrote:
| "Optimised" code according to the author. What can we do to
| optimise further?
|
| - Read file in one thread pool, streaming the chunks to...
|
| - ...another thread pool, tokenise, count, sort the chunks and
| send them to ...
|
| - ... merge in another thread pool. (basically map-reduce).
|
| - please stop malloc'ing for each token
|
| - prealloc map for found tokens (better to just allocate room for
| 200k words).
|
| - SIMD would optimise your inner-loop quite a lot. However, there
| are optimised libraries for this, so you don't have to write this
| yourself.
|
| - `word = append(word, c)` <= this is very slow
|
| Why is there no profiling? Why don't you check how the compiler
| interpreted your code and benchmark the subparts?
|
| In addition, there are at least some errors in your optimised
| program:
|
| - you can't lower-case by substract like you do. Non-ascii
| characters would fail.
|
| - also you can't tokenising by comparing with c <= ' '. There are
| many characters which would break a string. See this exercise:
| https://campus.datacamp.com/courses/introduction-to-natural-...
| morelisp wrote:
| > thread pool
|
| May reduce wall-clock time but increase total compute time (and
| so also power). It's less an optimization than a tradeoff.
|
| > please stop malloc'ing for each token
|
| It doesn't, only when it gets put in the map. (And while the
| particular allocation could be smaller, _something_ guaranteed
| to represent a specific arbitrary-length string has to be put
| in the map, which is going to malloc.)
|
| > prealloc map for found tokens (better to just allocate room
| for 200k words).
|
| Has no meaningful effect on performance.
|
| > SIMD would optimise your inner-loop quite a lot.
|
| No, as pointed out elsethread, it's a measurable boost but
| nowhere near the 10x you need to make the main claim (I/O not
| the bottleneck) be wrong. Not even 2x.
|
| > `word = append(word, c)` <= this is very slow
|
| Has no meaningful effect on performance.
|
| Perhaps you should read the whole post.
| brianolson wrote:
| SSD is pretty fast; but my app is actually trying to do more than
| 100_000 read-modify-write cycles per second and that still
| requires careful thought about the database and schema we're
| using.
|
| CPU and RAM are pretty fast. I do a live-coding interview
| question and I ask people do to a naive implementation first,
| then later I ask about possible optimizations. A third to a half
| of candidates want to do fewer RAM accesses and oh by is that the
| wrong avenue for this problem - especially when they just wrote
| their solution in Python and you could get a 10x-20x speedup by
| rewrite in C/C++/Go/Rust/etc.
|
| Network is IO too. Network is pretty fast, datacenter-to-
| datacenter, but end users can still have their experience
| improved with better encoding and protocol; and outbound
| bandwidth bills can be improved by that too.
| kgeist wrote:
| The most common performance problems I've encountered in our
| projects are: 1) lack of indexes resulting in extensive table
| scans 2) I/O calls in a loop without batching.
|
| I don't know if it counts as "I/O bottlenecks" or not.
| Havoc wrote:
| Picking something counterintuitive as interview question is not a
| great idea. Defeats the purpose - harder to tell whether the
| candidate is going with conventional wisdom because that's what
| they think the answer is, or because the candidate thinks that's
| the answer the interviewer expects.
|
| i.e. You could get sharp candidates that know the correct
| technical answer but intentionally give the wrong one because
| they rightly concluded statistically odds are the interviewer is
| on conventional wisdom wavelength.
|
| Could be good if you're specifically looking for someone good at
| mindgames I guess
| [deleted]
| LastTrain wrote:
| And then compound that by your counterintuitive assertion being
| generally wrong in most useful cases. He is either interviewing
| very junior candidates or frustrating a lot of people.
| hansvm wrote:
| I/O is still often the bottleneck. My laptop can handle 11 GB/s
| through RAM (and no NVME, so under 1 GB/s through the hard
| drive), less with unpredictable I/O patterns (like a hash-map)
| and 7600 GB/s through the CPU. Unless the thing you're doing is
| particularly expensive per byte of data, you're going to be
| limited at a minimum by RAM I/O, and maybe by DISK I/O.
|
| FWIW, all my recent performance wins have either been by reducing
| RAM I/O or restructuring work to reduce contention in the memory
| controller, even at the cost of adding significantly more work to
| the CPU.
| svnpenn wrote:
| > converting to lowercase
|
| in regards to accuracy, uppercase is the better option:
|
| https://stackoverflow.com/a/65433126
| kragen wrote:
| there doesn't seem to be any reasoning or evidence in that post
| supporting "uppercase is the better option", just that
| uppercase produces a larger number of word classes, which might
| be correct or incorrect
|
| tchrist explains in that thread why neither uppercase nor
| lowercase is the best option:
|
| > _Mapping to lowercase doesn't work for Unicode data, only for
| ASCII. You should be mapping to Unicode foldcase here, not
| lowercase. Otherwise yours is a Sisyphean task, since lowercase
| of Sisuphos is sisuphos, while lowercase of its uppercase,
| SISUPhOS, is the correct sisuphos, which is indeed the foldcase
| of all of those. Do you now understand why Unicode has a
| separate map? The casemappings are too complex for blindly
| mapping to anything not designed for that explicit purpose, and
| hence the presence of a 4th casemap in the Unicode casing
| tables: uppercase, titlecase, lowercase, foldcase._
|
| of course 'sisuphos' is not correct as a written word but if
| you were to encounter it then you should clearly consider it
| equivalent to 'sisuphos'
| svnpenn wrote:
| > there doesn't seem to be any reasoning or evidence in that
| post supporting "uppercase is the better option", just that
| uppercase produces a larger number of word classes, which
| might be correct or incorrect
|
| this sentence appears to be nonsense. the code doesnt check
| "word classes", it cases folds two characters and compares
| them.
| kragen wrote:
| character classes then
| svnpenn wrote:
| it doesn't check character classes either. It literally
| takes two characters, then uppercases both and compares,
| then lowercases both and compares. I have no idea where
| you are getting that it has anything to do with word or
| character classes, it doesn't.
| kragen wrote:
| by 'word class' i meant 'a set of words that are
| considered equivalent by whatever your equivalency
| relation is'
|
| similarly for 'character class'
|
| cf. https://en.wikipedia.org/wiki/Equivalence_class
|
| what i thought the linked program did was that it counted
| how many of those there were
|
| now on looking at it further i can see that it doesn't
| seem to be doing that but i don't have any idea what it
| _does_ do
|
| however, it definitely doesn't take into account the
| information you would need to learn anything about which
| candidate equivalency relation is better, which is
| something you'd need to examine at at least a word level,
| considering examples like grosste, Sisuphos, and the
| notoriously fatal sikisinca/sikisinca pair
| svnpenn wrote:
| > doesn't take into account the information you would
| need to learn anything about which candidate equivalency
| relation is better
|
| OK, no one said it did that. Its purely comparing
| characters, which is and always was what I said it was
| doing. And somehow it took 5 comments before you even
| decided to _actually read the answer_. Maybe next time
| you should start by _actually reviewing_ and
| understanding what you are commenting on, _before_ making
| multiple comments.
| namkt wrote:
| I would have thought that allocations in managed languages like
| Go/Python would have been the "fast" part of the processing.
| Isn't technically the GC that's slowing you down, and not the
| allocation per se? For one-shot input/output programs like these
| I guess you could tune the GC to kick in with less frequency.
|
| You also note that _reading a file sequentially from disk is very
| fast_ , which it is, but there is no guarantee that the file's
| contents are actually sequential on disk (fragmentation), right?
| We'd have to see how the file was written, and I guess at worst
| you'd be reading sequential chunks of a hand-wavy 4KB or
| something depending on the file system and what not. I'm sure
| others can fill in the details.
|
| Just nit-picking here.
| PaulKeeble wrote:
| Files aren't stored sequentially in an SSD anyway, they are
| scattered all over the place physically on different blocks
| just due to the way SSDs work. This doesn't hurt their
| sequential read performance at all since they have no seek time
| they already have a lookup table from the physical block the OS
| sees to a virtual location within themselves.
|
| However one thing I found out a few years ago is that old data
| can be slow to read as a lot of error correction kicks in.
| Additionally a lot of fragmentation at the operating system
| level in Windows has quite a bit of overhead. It can seriously
| degrade performance to about 50MB/s sequential reads. In
| practice defragmentation/rewriting of certain high write files
| may be necessary on SSDs because Windows read performance
| degrades at high levels of fragmentation.
| wtallis wrote:
| > You also note that reading a file sequentially from disk is
| very fast, which it is, but there is no guarantee that the
| file's contents are actually sequential on disk
| (fragmentation), right?
|
| Correct. And there are actually two layers of fragmentation to
| worry about: the traditional filesystem-level fragmentation of
| a file being split across many separate chunks of the drive's
| logical block address space (which can be fixed by a defrag
| operation), and fragmentation hidden within the SSD's flash
| translation layer as a consequence of the file contents being
| written or updated at different times.
|
| The latter can often have a much smaller effect than you might
| expect for what sounds like it could be a pathological corner
| case: https://images.anandtech.com/graphs/graph16136/sustained-
| sr.... shows typically only a 2-3x difference due to
| artificially induced fragmentation at the FTL level. But if the
| OS is also having to issue many smaller read commands to
| reassemble a file, throughput will be severely affected unless
| the OS is able to issue lots of requests in parallel (which
| would depend on being able to locate many file extents from
| each read of the filesystem's B+ tree or whatever, and the OS
| actually sending those read requests to the drive in a large
| batch).
| vinay_ys wrote:
| Storage and compute separation is key to scaling data workloads.
| Here, scaling could be w.r.t volume/shape of data, number of
| concurrent jobs on the same dataset, complexity of each job etc.
| In such an architecture, network access is unavoidable. And, to
| if you have multiple jobs competing for access to the same
| dataset concurrently, your sequential access can turn into semi-
| random access. You also have concerns about utilization of
| resources while being scalable w.r.t arbitrary bursty contentious
| workloads. These are the things that make it complex w.r.t
| managing IO resources.
| christophilus wrote:
| Does anyone have a good resource for reasoning about how to avoid
| allocations in JavaScript?
| Klinky wrote:
| The Samsung PM9A1 is the OEM version of the 980 Pro, a top-tier
| PCIe 4.0 NVME SSD. What about an older SATA SSD(one without DRAM
| buffer or HMB), or a 5400RPM hard drive? Also as others have
| pointed out, random I/O will tank perf, especially simultaneous
| r/w operations to the same media.
| vitiral wrote:
| I question the methodology.
|
| To measure this I would have N processes reading the file from
| disk with the max number of parallel heads (typically 16 I
| think). These would go straight into memory. It's possible you
| could do this with one process and the kernel will split up the
| block read into 16 parallel reads as well, needs investigation.
|
| Then I would use the rest of the compute for number crunching as
| fast as possible using as many available cores as possible: for
| this problem, I think that would basically boil down to a map
| reduce. Possibly a lock-free concurrent hashmap could be
| competitive.
|
| Now, run these in parallel and measure the real time from start
| to finish of both. Also gets the total CPU time spent for
| reference.
|
| I'm pretty sure the author's results are polluted: while they are
| processing data the kernel is caching the next block. Also, it's
| not really fair to compare single threaded disk IO to a single
| process: one of the reasons for IO being a bottleneck is that it
| has concurrency constraints. Never the less I would be interested
| in both the single threaded and concurrent results.
| Klinky wrote:
| I agree, I think there is often a faulty assumption by many
| developers doing benchmarking that their underlying environment
| doesn't matter. Often I see performance results posted with
| little mention of the details of the environment. At least here
| they posted they were using a high-end SSD, but note they just
| say they're on a "Dell XPS 13", as if there aren't multiple
| variants of that model produced every year for the last 5 or 6
| years.
|
| You're probably also right their multiple test runs resulted in
| the OS caching data, and a lot of the test runs may have just
| been testing in-memory performance instead of raw storage I/O
| performance.
| LastTrain wrote:
| Thank you. I'd be pretty annoyed in this interview. Surely my
| potential employer would be more interested in having me apply
| my twenty years of real-world experiences to what I learned in
| CS240.
___________________________________________________________________
(page generated 2022-11-26 23:01 UTC)