[HN Gopher] 21 GB/s CSV Parsing Using SIMD on AMD 9950X
___________________________________________________________________
21 GB/s CSV Parsing Using SIMD on AMD 9950X
Author : zigzag312
Score : 227 points
Date : 2025-05-09 13:38 UTC (9 hours ago)
(HTM) web link (nietras.com)
(TXT) w3m dump (nietras.com)
| winterbloom wrote:
| This is a staggering ~3x improvement in just under 2 years since
| Sep was introduced June, 2023.
|
| You can't claim this when you also do a huge hardware jump
| WD-42 wrote:
| Yea wtf is that chart, it literally skips 4 cpu generations
| where it shows "massive performance gain".
|
| Straight to the trash with this post.
| g-mork wrote:
| It also appears to be reporting whole-CPU vs. single thread,
| 1.3 GB/sec is not impressive for single thread perf
| iamleppert wrote:
| Agreed. How hard is it to keep hardware fixed, load the
| data into memory, and use a single core for your
| benchmarks? When I see a chart like that I think, "What
| else are they hiding?"
|
| Folks should check out https://github.com/dathere/qsv if
| they need an actually fast CSV parser.
| Remnant44 wrote:
| I mean... A single 9950x core is going to struggle to do
| more than 16 GB/second of direct mem copy bandwidth. So
| being within an order of magnitude of that seems reasonable
| ziml77 wrote:
| But it repeats the 0.9.0 test on the new hardware. So the
| first big jump is a hardware change, but the second jump is
| the software changes.
| matja wrote:
| 4 generations?
|
| 5950x is Zen 3
|
| 9950x is Zen 5
| chupasaurus wrote:
| Sine Zen 2 (3000) the mobile CPUs are up by a thousand
| respectively to their desktop counterparts. edit: Or Nx2000
| where N is from Zen N.
| hinkley wrote:
| And even with 2, CPU generations aren't what they used to
| be back when a candy bar cost less than a dollar.
| freeone3000 wrote:
| They claim a 3GB/s improvement versus previous version of sep
| on equal hardware -- and unlike "marketing" benchmarks, include
| the actual speed achieved and the hardware used.
| stabbles wrote:
| Do note that this speed even before the 3GB/s improvement
| exceeds the bandwidth of most disks, so the bottleneck is
| loading data in memory. I don't know of many applications
| where CSV is produced and consumed in memory, so I wonder
| what the use is.
| freeone3000 wrote:
| Slower than network! In-memory processing of OLAP tables,
| streaming splitters, large data set division... but also
| the faster the parser, the less time you spend parsing and
| the more you spend doing actual work
| tetha wrote:
| This is honestly something that caught me off-guard a
| bit. If you have good internal network connectivity,
| small queries and your relational database has the data
| in memory, it can be faster to fetch data from the DB via
| the network than reading it from disk.
|
| Like, sure, I can give you an application server with
| faster disks and more memory and you or me are certainly
| capable of implementing an application server that could
| load the data from disk faster than all of that. And then
| we build caching to keep the hot data in memory, because
| that's faster.
|
| But then we've spent very advanced development resources
| to build a relational database with some application code
| at the edge.
|
| This can make sense in some high frequency trading
| situations, but in many more mundane web-backends, a
| chunky database and someone capable of optimizing stupid
| queries enable and simplify the work of a much bigger
| number of developers.
| bee_rider wrote:
| You can also get this with Infiniband, although it is
| less surprising, and basically what you'd expect to see.
|
| I did once use a system where the network bandwidth was
| in the same ballpark as the memory bandwidth, which might
| not be surprising for some of the real HPC-heads here but
| it surprised me!
| pdpi wrote:
| "We can parse at x GB/s" is more or less the reciprocal of
| "we need y% of your CPU capacity to saturate I/O".
|
| Higher x -> lower y -> more CPU for my actual workload.
| vardump wrote:
| Decompression is your friend. Usually CSV compresses really
| well.
|
| Multiple cores decompressing LZ4 compressed data can
| achieve crazy bandwidth. More than 5 GB/s per core.
| jbverschoor wrote:
| They also included 0.9.0 vs 0.10.0. on the new hardware. (21385
| vs 18203), so the jump because of software is 17%.
|
| Then if we take 0.9.0 on previous hardware (13088) and add the
| 17%, it's 15375. Version 0.1.0 was 7335.
|
| So... 15375/7335 -> a staggering 2.1x improvement in just under
| 2 years
| perching_aix wrote:
| > You can't claim this when you also do a huge hardware jump
|
| Well, they did. Personally, I find it an interesting way of
| looking at it, it's a lens for the "real performance" one could
| get using this software year over year. (Not saying it isn't a
| misleading or fallacious claim though.)
| criddell wrote:
| I was expecting to see assembly language and was pleasantly
| surprised to see C#. Very impressive.
|
| Nice work!
| gavinray wrote:
| Modern .NET has the deepest integration with SIMD and vector
| intrinsics of what most people would consider "high-level
| languages".
|
| https://learn.microsoft.com/en-us/dotnet/standard/simd
|
| Tanner Gooding at Microsoft is responsible for a lot of the
| developments in this area and has some decent blogposts on it,
| e.g.
|
| https://devblogs.microsoft.com/dotnet/dotnet-8-hardware-intr...
| voidUpdate wrote:
| I shudder to think who needs to process a million lines of csv
| that fast...
| segmondy wrote:
| lots of folks in Finance, you can share csv with any Finance
| company and they can process it. It's text.
| zzbn00 wrote:
| Humans generate decisions / text information at rates of
| ~bytes per second at most. There is barely enough humans
| around to generate 21GB/s of information even if all they did
| was make financial decisions!
|
| So 21 GB/s would be solely algos talking to algos... Given
| all the investment in the algos, surely they don't need to be
| exchanging CSV around?
| internetter wrote:
| > Humans generate decisions / text information at rates of
| ~bytes per second at most
|
| Yes, but the consequences of these decisions are worth much
| more. You attach an ID to the user, and an ID to the
| transaction. You store the location and time where it was
| made. Ect.
| zzbn00 wrote:
| I think these would add only small amount of information
| (and in a DB would be modelled as joins). Only adds lots
| of data if done very inefficiently.
| jajko wrote:
| Why are you theoretising? I can tell you from out there
| its used massively, and its not going away in contrary.
| Even rather small banks can end up generating various
| reports etc. which can easily become huge.
|
| The speed of human decision has basically 0 role here, as
| it doesn't with messaging generally, there is way more to
| companies than just direct keyboard-to-output link.
| adrianN wrote:
| You might have accumulated some decades of data in that
| format and now want to ingest it into a database.
| zzbn00 wrote:
| Yes, but if you have decades of data, what turns on
| having to wait for a minute or 10 minutes to convert it?
| hermitcrab wrote:
| CSV is a questionable choice for a dataset that size. It's
| not very efficient in terms of size (real numbers take more
| bytes to store as text than as binary), it's not the
| fastest to parse (due to escaping) and a single delimiter
| or escape out of place corrupts everything afterwards. That
| not to mention all the issues around encoding, different
| delimiters etc.
| zzbn00 wrote:
| Its great for when people need to be in the loop, looking
| at the data, maybe loading in Excel etc. (I use it
| myself...). But not enough humans around for 21 GB/s
| jstimpfle wrote:
| > (real numbers take more bytes to store as text than as
| binary)
|
| Depends on the distribution of numbeds in the sataset.
| It's quite common to have small numbers. For these text
| is a more efficient representation compared to binary,
| especially compared to 64-bit or larger binary encodings.
| wat10000 wrote:
| Standards (whether official or de facto) often aren't the
| best in isolation, but they're the best in reality because
| they're widely used.
|
| Imagine you want to replace CSV for this purpose. From a
| purely technical view, this makes total sense. So you
| investigate, come up with a better standard, make sure it
| has all the capabilities everyone needs from the existing
| stuff, write a reference implementation, and go off to get
| it adopted.
|
| First place you talk to asks you two questions: "Which of
| my partner institutions accept this?" "What are the
| practical benefits of switching to this?"
|
| Your answer to the first is going to be "none of them" and
| the answer to the second is going to be vague hand-wavey
| stuff around maintainability and making programmers
| happier, with maybe a little bit of "this properly handles
| it when your clients' names have accent marks."
|
| Next place asks the same questions, and since the first
| place wasn't interested, you have the same answers....
|
| Replacing existing standards that are Good Enough is
| really, really hard.
| cyral wrote:
| The only real example I can think of is the US options
| market feed. It is up to something like 50 GiB/s now, and
| is open 6.5 hours per day. Even a small subset of the feed
| that someone may be working on for data analysis could be
| huge. I agree CSV shouldn't even be used here but I am sure
| it is.
| h4ck_th3_pl4n3t wrote:
| You seem to not realize that most humans are not coders.
|
| And non coders use proprietary software, which usually has
| an export into CSV or XLS to be compatible with Microsoft
| Office.
| sunrunner wrote:
| I shudder to think of what it means to be storing the _results_
| of processing 21 GB/s of CSV. Hopefully some useful kind of
| aggregation, but if this was powering some kind of search over
| structured data then it has to be stored somewhere...
| devmor wrote:
| Just because you're processing 21GB/s of CSV doesn't mean you
| need all of it.
|
| If your data is coming from a source you don't own, it's
| likely to include data you don't need. Maybe there's 30
| columns and you only need 3 - or 200 columns and you only
| need 1.
|
| Enterprise ETL is full of such cases.
| hermitcrab wrote:
| For all its many weaknesses, I believe CSV is still the most
| common data interchange format.
| adra wrote:
| Erm, maybe file based? JSON is the king if you count
| exchanges worldwide a sec. Maybe no 2 is form-data which is
| basically email multipart, and if course there's email as a
| format. Very common =)
| hermitcrab wrote:
| I meant file-based.
| devmor wrote:
| I honestly wonder if JSON is king. I used to think so until
| I started working in fintech. XML is unfortunately
| everywhere.
| hermitcrab wrote:
| JSON isn't great for tabular data. And an awful lot of
| data is tabular.
| trollbridge wrote:
| It's become a very common interchange format, even internally;
| it's also easy to deflate. I have had to work on codebases
| where CSV was being pumped out at basically the speed of a NIC
| card (its origin was Netflow, and then aggregated and otherwise
| processed, and the results sent via CSV to a master for further
| aggregation and analysis).
|
| I really don't get, though, why people can't just use protocol
| buffers instead. Is protobuf really that hard?
| nobleach wrote:
| Extremely hard to tell an HR person, "Right-click on here in
| your Workday/Zendesk/Salesforce/etc UI and export a
| protobuf". Most of these folks in the business world LIVE in
| Excel/Spreadsheet land so a CSV feels very native. We can
| agree all day long that for actual data TRANSFER, CSV is
| riddled with edge cases. But it's what the customers are
| using.
| heavenlyblue wrote:
| It's extremely unlikely they need to load spreadsheets
| large enough for 21Gb/s speed to matter
| nobleach wrote:
| Oh absolutely! I'm just mentioning why CSV is chosen over
| Protobufs.
| SteveNuts wrote:
| You'd be surprised. Big telcos use CSV and SFTP for CDR
| data, and there's a lot of it.
| matja wrote:
| Kind of, there isn't a 1:1 mapping of protobuf wire types to
| schema types, so you need to package the protobuf schema with
| the data and compile it to parse the data, or decide on the
| schema before-hand. So now you need to decide on a file
| format to bundle the schema and the data.
| bombela wrote:
| protobuf is more friction, and actually slow to write and
| read.
|
| For better or worse, CSV is easy to produce via printf. Easy
| to read by breaking lines and splitting by the delimiter.
| Escaping delimiters part of the content is not hard, though
| often added as an afterthought.
|
| Protobuf requires to install a library. Understand how it
| works. Write a schema file. Share the shema to others. The
| API is cumbersome.
|
| Finally to offer this mutable struct via setter and getter
| abstraction, with variable length encoded numbers, variable
| length strings etc. The library ends up quite slow.
|
| In my experience protobuf is slow and memory hungry. The
| generated code is also quite bloated, which is not helping.
|
| See https://capnproto.org/ for details from the original
| creator of protobuf.
|
| Is CSV faster than protobuf? I don't know, and I haven't
| tested. But I wouldn't be surprised if it is.
| raron wrote:
| > For better or worse, CSV is easy to produce via printf.
| Easy to read by breaking lines and splitting by the
| delimiter. Escaping delimiters part of the content is not
| hard, though often added as an afterthought.
|
| Based on the amount of software I seen producing broken CSV
| or can't parse (more-or-less) valid CSV, I don't think that
| is true.
|
| It seems to be easy, because just printf("%s,%d,%d\n", ...)
| but it is full of edge cases most programmers don't think
| about.
| to11mtm wrote:
| I'm not the biggest fan of Protobuf, mostly around the
| 'perhaps-too-minimal' typing of the system and the
| performance differentials present on certain languages in the
| library.
|
| e.x. I know in .NET space, MessagePack is usually faster than
| proto, I think similar is true for JVM. Main disadvantage is
| there's not good schema based tooling around it.
| moregrist wrote:
| I have. I think it's a pretty easy situation for certain kinds
| of startups to find themselves in:
|
| - Someone decides on CSV because it's easy to produce and you
| don't have that much data. Plus it's easier for the <non-
| software people> to read so they quit asking you to give them
| Excel sheets. Here <non-software people> is anyone who has a
| legit need to see your data and knows Excel really well. It can
| range from business types to lab scientists.
|
| - Your internal processes start to consume CSV because it's
| what you produce. You build out key pipelines where one or more
| steps consume CSV.
|
| - Suddenly your data increases by 10x or 100x or more because
| something started working: you got some customers, your sensor
| throughput improved, the science part started working, etc.
|
| Then it starts to make sense to optimize ingesting millions or
| billions of lines of CSV. It buys you time so you can start
| moving your internal processes (and maybe some other teams'
| stuff) to a format more suited for this kind of data.
| ourmandave wrote:
| That cartesian product file accounting sends you at year end?
| constantcrying wrote:
| In basically every situation it is inferior to HDF5.
|
| I do not think there is an actual explanation besides
| ignorance, laziness or "it works".
| pak9rabid wrote:
| Ugh.....I do unfortunately.
| vessenes wrote:
| If we are lucky we will see Arthur Whitney get triggered and post
| either a one liner beating this or a shakti engine update and a
| one liner beating this. Progress!
| stabbles wrote:
| Instead of doing 4 comparisons against each character `\n`, `\r`,
| `;` and `"` followed by 3 or operations, a common trick is to do
| 1 shuffle, 1 comparison and 0 or operations. I blogged about this
| trick: https://stoppels.ch/2022/11/30/io-is-no-longer-the-
| bottlenec... (Trick 2)
|
| Edit: they do make use of ternary logic to avoid one or
| operation, which is nice. Basically (a | b | c) | d is computed
| using `vpternlogd` and `vpor` resp.
| justinhj wrote:
| really cool thanks
| Aardwolf wrote:
| Take that, Intel and your "let's remove AVX-512 from every
| consumer CPU because we want to put slow cores on every single
| one of them and also not consider multi-pumping it"
| tadfisher wrote:
| A lot of this stems from the 10nm hole they had to dig
| themselves out from. Yields are bad, so costs are high, so
| let's cut the die as much as possible, ship Atom-derived cores
| and market it as an energy-saving measure. The expensive parts
| can be bigger and we'll cut the margins on those to retain the
| server/cloud sector. Also our earnings go into the shitter and
| we lose market share anyway, but at least we tried.
| wtallis wrote:
| This issue is less about Intel's fab failures and more about
| their inability to decouple their architecture update cadence
| from their fab progress. They stopped iterating on their CPU
| designs while waiting for 10nm to get fixed. That left them
| with an oversized P core and an outdated E core, and all they
| could do for Alder Lake was slap them onto one die and ship
| it, with no ability to produce a well-matched pair of core
| designs in any reasonable time frame. We're _still_ seeing
| weird consequences of their inability to port CPU designs
| between processes and fabs: this year 's laptop processors
| have HyperThreading only in the lowest-cost parts--those that
| still have the CPU chiplet fabbed at Intel while the higher
| core count parts are made by TSMC.
| imtringued wrote:
| Considering the non-standard nature of CSV, quoting throughput
| numbers in bytes is meaningless. It makes sense for JSON, since
| you know what the output is going to be (e.g. floats, integers,
| strings, hashmaps, etc). With CSV you only get strings for each
| column, so 21 GB/s of comma splitting would be the pinnacle of
| meaninglessness. Like, okay, but I still have to parse the
| stringy data, so what gives? Yeah, the blog post does reference
| float parsing, but a single float per line would count as "CSV".
|
| Now someone might counter and say that I should just read the
| README.MD, but then that suspicion simply turns out to be true:
| They don't actually do any escaping or quoting by default, making
| the quoted numbers an example of heavily misleading advertising.
| liuliu wrote:
| CSV is standardized in RFC 4180 (well, as standardized as most
| of what we considered internet "standard").
|
| Otherwise agree, if you don't do escaping (a.k.a. "quoting",
| the same thing for CSV), you are not implementing it correctly.
| For example, if you quote a line break, in RFC 4180, this line
| break will be in that quoted string, but if you don't need to
| handle that, you can implement CSV parsing much faster (proper
| handling line break with quoted string requires 2-pass approach
| (if you are going to use many-core) while not handling it at
| all can be done with 1-pass approach). I discussed about this
| detail in https://liuliu.me/eyes/loading-csv-file-at-the-speed-
| limit-o...
| a3w wrote:
| Side note: RFCs are great standards, as they are readable.
|
| As an example of how not to do it: XML can be assumed a
| standard, but I cannot afford to read it. DIN/ISO is great
| for manufacturing in theory, but bad for zero-cost of initial
| investment like IT.
| zeristor wrote:
| Why not use Parquet?
| mcraiha wrote:
| Excel does not output Parquet.
| speed_spread wrote:
| True. But also Excel probably collapses into a black hole
| going straight to hell trying to handle 21GB of data.
| hermitcrab wrote:
| Excel .xlsx files are limited to 1,048,576 rows and 16,384
| columns.
|
| Excel .xls files are limited to 65,536 rows and 256
| columns.
| mihular wrote:
| 21GB/s, not 21GB ...
| anthk wrote:
| mawk would handle a 21 GB csv (or maybe one true awk)
| fast enough.
| buyucu wrote:
| Excel often outputs broken csv :)
| hinkley wrote:
| I have been privileged in my career to never need to parse
| Excel output but occasionally feed it input. Especially
| before Grafana was a household name.
|
| Putting something out so manager stops asking you 20
| questions about the data is a double edged sword though.
| Those people can hallucinate more than a pre-Covid AI
| engine. Grafana is just weird enough that people would
| rather consume a chart than try to make one, then you have
| some control over the acid trip.
| constantcrying wrote:
| Or HDF5 or any other format which is actually meant to store
| large amounts of floating point data.
| chao- wrote:
| It feels crazy to me that Intel spent years dedicating die space
| on consumer SKUs to "make fetch happen" with AVX-512, and as more
| and more libraries are finally using it, as Intel's goal is
| achieved, they have removed AVX-512 from their consumer SKUs.
|
| It isn't that AMD has better AVX-512 support, which would be an
| impressive upset on it's own. Instead, it is only that AMD has
| AVX-512 on consumer CPUs, because Intel walked away from their
| own investment.
| MortyWaves wrote:
| It's wild seeing how stupid Intel is being.
| neonsunset wrote:
| If it's any consolation, Sep will happily use AVX-512 whenever
| available, without having to opt into that explicitly,
| including the server parts, as it will most likely run under a
| JIT runtime (although it's NAOT-compatible). So you're not
| missing out by being forced to target the lowest common
| denominator.
| sitkack wrote:
| That is what Intel does, they build up a market (Optane) and
| then do a rug pull (Depth Cameras). They continue to do this
| thing where they do a huge push into a new technology, then
| don't see the uptake and let it die. Instead of building slowly
| and then at the right time, doing a big push. Optane support
| was _just getting mature_ in the Linux kernel when they pulled
| it. And they focused on some weird cost cutting move when
| marketing it as a ram replacement for semi-idle VMs, ok.
|
| They keep repeating the same mistakes all the way back to
| https://en.wikipedia.org/wiki/Intel_iAPX_432
| sebmellen wrote:
| Bad habits are hard to break!
| etaioinshrdlu wrote:
| Well, Itanium might be a counterexample, they probably tried
| to make that work for far too long..
| sitkack wrote:
| Itanium worked as intended.
| paddy_m wrote:
| So far as killing HP PA-Risc, SGI MIPS, DEC Alpha, and
| seriously hurting the chance for adoption of Sparc, and
| POWER outside of their respective parents (did I miss
| any)?
|
| Thing is, they could have killed it by 1998, without ever
| releasing anything, that would have killed the other
| architectures it was trying to compete with. Instead they
| waited until 2020 to end support.
|
| What the VLIW of Itanium needed and never really got was
| proper compiler support. Nvidia has this in spades with
| CUDA. It's easy to port to Nvidia where you do get
| serious speedups. AVX-512 never offered enough of a
| speedup from what I could tell, even though it was well
| supported by at least ICC (and numpy/scipy when properly
| compiled)
| knowitnone wrote:
| "they could have killed it by 1998, without ever
| releasing anything"
|
| perhaps Intel really wanted it to work and killing other
| architectures was only a side effect?
| kyboren wrote:
| > What the VLIW of Itanium needed and never really got
| was proper compiler support.
|
| This is kinda under-selling it. The fundamental problem
| with statically-scheduled VLIW machines like Itanium is
| it puts all of the complexity in the compiler.
| Unfortunately it turns out it's just really hard to make
| a good static scheduler!
|
| In contrast, dynamically-scheduled out-of-order
| superscalar machines work great but put all the
| complexity in silicon. The transistor overhead was
| expensive back in the day, so statically-scheduled VLIWs
| seemed like a good idea.
|
| What happened was that static scheduling stayed really
| hard while the transistor overhead for dynamic scheduling
| became irrelevantly cheap. "Throw more hardware at it"
| won handily over "Make better software".
| bri3d wrote:
| No, VLIW is even worse than this. Describing it as a
| compiler problem undersells the issue. VLIW is not
| tractable for a multitasking / multi tenant system due to
| cache residency issues. The compiler cannot efficiently
| schedule instructions without knowing what is in cache.
| But, it can't know what's going to be in cache if it
| doesn't know what's occupying the adjacent task time
| slices. Add virtualization and it's a disaster.
| sitkack wrote:
| It only works for fixed workloads, like accelerators,
| with no dynamic sharing.
| mrweasel wrote:
| Itanium was more of an HP product than an Intel one.
| sheepscreek wrote:
| > They continue to do this thing where they do a huge push
| into a new technology, then don't see the uptake and let it
| die.
|
| Except Intel deliberately made AVX 512 a feature exclusively
| available to Xeon and enterprise processors in future
| generations. This backward step artificially limits its
| availability, forcing enterprises to invest in more expensive
| hardware.
|
| I wonder if Intel has taken a similar approach with Arc GPUs,
| which lack support for GPU virtualization (SR-IOV). They
| somewhat added vGPU support to all built-in 12th-14th Gen
| chips through the i915 driver on Linux. It's a pleasure to
| have graphics-acceleration in multiple VMs simultaneously,
| through the same GPU.
| sitkack wrote:
| They go out their way to segment their markets, ECC, AVX,
| Optane support (only specific server class skus). I hate
| it, I hate as a home pc user, I hate it as an enterprise
| customer, I hate as a shareholder.
| knowitnone wrote:
| Every company does this. If you're grandma only uses a
| web browser, word processor, and excel, does she really
| want to spend an additional $50 on a feature she'll not
| use? Same with NPUs. Different consumers want different
| features for different prices.
| tliltocatl wrote:
| Except it hinders adoption, because not having a feature
| in entry-level products will mean less incentive (and
| ability) for software developers to use it. Compatibility
| is so valuable it makes everyone converge on the least
| common denominator, so when you price-gouge on a
| software-exposed feature, you might as well bury this
| feature altogether.
| sitkack wrote:
| Three fallacies and you are OUT!
| Gud wrote:
| Indeed. Octane/3dxpoint was mind blowing futuristic stuff but
| it was just gone after 5 years? On the market? Talk about
| short sighted.
| gnfargbl wrote:
| The rugpull on Optane was incredibly frustrating. Intel
| developed a technology which made really meaningful
| improvements to workloads in an industry that is full of
| sticky late adopters (RDBMSes). They kept investing until the
| point where they had unequivocally made their point and the
| late adopters were just about getting it... and _then_ killed
| it!
|
| It's hard to understand how they could have played that
| particular hand more badly. Even a few years on, I'm missing
| Optane drives because there is still no functional
| alternative. If they just held out a bit longer, they would
| have created a set of enterprise customers who would still be
| buying the things in 2040.
| jerryseff wrote:
| Optane was incredible. It's insane that Intel dropped this.
| FpUser wrote:
| I am very disappointed about Optane drives. Perfect case for
| superfast vertically scalable database. I was going to build
| a solution based on this but suddenly it is gone for all
| practical intents and purposes.
| high_na_euv wrote:
| Optane was cancelled because manufacturer sold the fab
| buyucu wrote:
| Intel is horrible with software. My laptop has a pretty good
| iGPU, but it's not properly supported by PyTorch or most other
| software. Vulkan inference with llama.cpp does wonders, and it
| makes me sad that most software other than llama.cpp does not
| take advantage of it.
| kristianp wrote:
| Sounds like something to try. Do I just need to compile
| Vulkan support to use the igpu?
| tedunangst wrote:
| I mean, the most interesting part of the article for me:
|
| > A bit surprisingly the AVX2 parser on 9950X hit ~20GB/s! That
| is, it was better than the AVX-512 based parser by ~10%, which
| is pretty significant for Sep.
|
| They fixed it, that's the whole point, but I think there's
| evidence that AVX-512 doesn't actually benefit consumers that
| much. I would be willing to settle for a laptop that can only
| parse 20GB/s and not 21GB/s of CSV. I think vector assembly
| nerds care about support much more than users.
| vardump wrote:
| That probably just means it's a memory bandwidth bound
| problem. It's going to be a different story for tasks that
| require more computation.
| wyager wrote:
| You can still saturate an ultrawide vector unit with
| narrower instructions if you have wide enough dispatch
| neonsunset wrote:
| AVX512 is not just about width. It ships with a lot of very
| useful instructions available for narrower vectors with
| AVX512VL. It also improves throughput per instruction. You're
| not hand-writing intrinsified code usually yet compilers,
| especially JIT ones, can make use of it for all sorts of
| common operations that become x times faster. In .NET, having
| AVX512 will speed up linear search, memory copying, string
| comparison which are straightforward, but it will also affect
| its Regex performance which uses SearchValues<T> which under
| the hood is able to perform complex shuffles and vector
| lookups on larger vectors with much better throughput. AVX512
| lends itself to a more compact codegen (although .NET is not
| perfect in that regard, I think it sometimes regresses vs
| AVX2 with its instruction choices, but it's a matter of
| iterative improvement).
| Aurornis wrote:
| In this article, they saw the following speeds:
|
| Original: 18 GB/s
|
| AVX2: 20 GB/s
|
| AVX512: 21 GB/s
|
| This is an AMD CPU, but it's clear that the AVX512 benefits are
| marginal over the AVX2 version. Note that Intel's consumer
| chips do support AVX2, even on the E-cores.
|
| But there's more to the story: This is a single-threaded
| benchmark. Intel gave up AVX512 to free up die space for more
| cores. Intel's top of the line consumer part has 24 cores as a
| result, whereas AMD's top consumer part has 16. We'd have to
| look at actual Intel benchmarks to see, but if the AVX2 to
| AVX512 improvements are marginal, a multithreaded AVX2 version
| across more cores would likely outperform a multithreaded
| AVX512 version across fewer cores. Note that Intel's E-cores
| run AVX2 instructions slower than the P-cores, but again the
| AVX boost is marginal in this benchmark anyway.
|
| I know people like to get angry at Intel for taking a feature
| away, but the real-world benefit of having AVX512 instead of
| only AVX2 is very minimal. In most cases, it's probably offset
| by having extra cores working on the problem. There are very
| specific workloads, often single-threaded, that benefit from
| AVX-512, but on a blended mix of applications and benchmarks I
| suspect Intel made an informed decision to do what they did.
| neonsunset wrote:
| AVX2 vs AVX512 in this case may be somewhat misleading. In
| .NET, even if you use 256bit-wide vectors, it will still take
| advantage of AVX512VL whenever available to fuse chained
| operations into masked, vpternlogd's, etc.[0] (plus standard
| operations like stack zeroing, struct copying, string
| comparison, element search, and other can use the full
| width)[1]
|
| So to force true AVX2 the benchmark would have to be ran with
| `DOTNET_EnableAVX512F=0` which I assume is not the case here.
|
| [0]: https://devblogs.microsoft.com/dotnet/performance-
| improvemen...
|
| [1]: https://devblogs.microsoft.com/dotnet/performance-
| improvemen...
| ChadNauseam wrote:
| Isn't AVX-10 on the horizon, which will have most of the
| goodies that AVX-512 had? (I'm actually not even sure what the
| difference is supposed to be between them.)
| constantcrying wrote:
| There are very good alternatives to csv for storing and
| exchanging floating point/other data.
|
| The HDF5 format is very good and allows far more structure in
| your files, as well as metadata and different types of lossless
| and lossy compression.
| anthk wrote:
| > Net 9.0
|
| heh, do it again with mawk.
| jerryseff wrote:
| Christ using... .NET?
|
| I want to vomit.
|
| Use elixir, you can easily get this close using Rust NIFs and
| pattern matching.
| h4ck_th3_pl4n3t wrote:
| Then show us your elixir implementation?
| chpatrick wrote:
| In my experience I've found it difficult to get substantial gains
| with custom SIMD code compared to modern compiler auto-
| vectorization, but to be fair that was with more vector-friendly
| code than JSON parsing.
| theropost wrote:
| I need this, just finished 300GB of CSV extracts, and
| manipulating, data integrity checks, and so on take longer than
| they should.
| haberman wrote:
| The article doesn't clearly define what this 21 GB/s code is
| doing.
|
| - What format exactly is it parsing? (eg. does the dialect of CSV
| support quoted commas, or is the parser merely looking for commas
| and newlines)?
|
| - What is the parser doing with the result (ie. populating a data
| structure, etc)?
___________________________________________________________________
(page generated 2025-05-09 23:00 UTC)