hngopher.com

       [HN Gopher] 21 GB/s CSV Parsing Using SIMD on AMD 9950X
       ___________________________________________________________________
        
       21 GB/s CSV Parsing Using SIMD on AMD 9950X
        
       Author : zigzag312
       Score  : 227 points
       Date   : 2025-05-09 13:38 UTC (9 hours ago)
        
 (HTM) web link (nietras.com)
 (TXT) w3m dump (nietras.com)
        
       | winterbloom wrote:
       | This is a staggering ~3x improvement in just under 2 years since
       | Sep was introduced June, 2023.
       | 
       | You can't claim this when you also do a huge hardware jump
        
         | WD-42 wrote:
         | Yea wtf is that chart, it literally skips 4 cpu generations
         | where it shows "massive performance gain".
         | 
         | Straight to the trash with this post.
        
           | g-mork wrote:
           | It also appears to be reporting whole-CPU vs. single thread,
           | 1.3 GB/sec is not impressive for single thread perf
        
             | iamleppert wrote:
             | Agreed. How hard is it to keep hardware fixed, load the
             | data into memory, and use a single core for your
             | benchmarks? When I see a chart like that I think, "What
             | else are they hiding?"
             | 
             | Folks should check out https://github.com/dathere/qsv if
             | they need an actually fast CSV parser.
        
             | Remnant44 wrote:
             | I mean... A single 9950x core is going to struggle to do
             | more than 16 GB/second of direct mem copy bandwidth. So
             | being within an order of magnitude of that seems reasonable
        
           | ziml77 wrote:
           | But it repeats the 0.9.0 test on the new hardware. So the
           | first big jump is a hardware change, but the second jump is
           | the software changes.
        
           | matja wrote:
           | 4 generations?
           | 
           | 5950x is Zen 3
           | 
           | 9950x is Zen 5
        
             | chupasaurus wrote:
             | Sine Zen 2 (3000) the mobile CPUs are up by a thousand
             | respectively to their desktop counterparts. edit: Or Nx2000
             | where N is from Zen N.
        
             | hinkley wrote:
             | And even with 2, CPU generations aren't what they used to
             | be back when a candy bar cost less than a dollar.
        
         | freeone3000 wrote:
         | They claim a 3GB/s improvement versus previous version of sep
         | on equal hardware -- and unlike "marketing" benchmarks, include
         | the actual speed achieved and the hardware used.
        
           | stabbles wrote:
           | Do note that this speed even before the 3GB/s improvement
           | exceeds the bandwidth of most disks, so the bottleneck is
           | loading data in memory. I don't know of many applications
           | where CSV is produced and consumed in memory, so I wonder
           | what the use is.
        
             | freeone3000 wrote:
             | Slower than network! In-memory processing of OLAP tables,
             | streaming splitters, large data set division... but also
             | the faster the parser, the less time you spend parsing and
             | the more you spend doing actual work
        
               | tetha wrote:
               | This is honestly something that caught me off-guard a
               | bit. If you have good internal network connectivity,
               | small queries and your relational database has the data
               | in memory, it can be faster to fetch data from the DB via
               | the network than reading it from disk.
               | 
               | Like, sure, I can give you an application server with
               | faster disks and more memory and you or me are certainly
               | capable of implementing an application server that could
               | load the data from disk faster than all of that. And then
               | we build caching to keep the hot data in memory, because
               | that's faster.
               | 
               | But then we've spent very advanced development resources
               | to build a relational database with some application code
               | at the edge.
               | 
               | This can make sense in some high frequency trading
               | situations, but in many more mundane web-backends, a
               | chunky database and someone capable of optimizing stupid
               | queries enable and simplify the work of a much bigger
               | number of developers.
        
               | bee_rider wrote:
               | You can also get this with Infiniband, although it is
               | less surprising, and basically what you'd expect to see.
               | 
               | I did once use a system where the network bandwidth was
               | in the same ballpark as the memory bandwidth, which might
               | not be surprising for some of the real HPC-heads here but
               | it surprised me!
        
             | pdpi wrote:
             | "We can parse at x GB/s" is more or less the reciprocal of
             | "we need y% of your CPU capacity to saturate I/O".
             | 
             | Higher x -> lower y -> more CPU for my actual workload.
        
             | vardump wrote:
             | Decompression is your friend. Usually CSV compresses really
             | well.
             | 
             | Multiple cores decompressing LZ4 compressed data can
             | achieve crazy bandwidth. More than 5 GB/s per core.
        
         | jbverschoor wrote:
         | They also included 0.9.0 vs 0.10.0. on the new hardware. (21385
         | vs 18203), so the jump because of software is 17%.
         | 
         | Then if we take 0.9.0 on previous hardware (13088) and add the
         | 17%, it's 15375. Version 0.1.0 was 7335.
         | 
         | So... 15375/7335 -> a staggering 2.1x improvement in just under
         | 2 years
        
         | perching_aix wrote:
         | > You can't claim this when you also do a huge hardware jump
         | 
         | Well, they did. Personally, I find it an interesting way of
         | looking at it, it's a lens for the "real performance" one could
         | get using this software year over year. (Not saying it isn't a
         | misleading or fallacious claim though.)
        
       | criddell wrote:
       | I was expecting to see assembly language and was pleasantly
       | surprised to see C#. Very impressive.
       | 
       | Nice work!
        
         | gavinray wrote:
         | Modern .NET has the deepest integration with SIMD and vector
         | intrinsics of what most people would consider "high-level
         | languages".
         | 
         | https://learn.microsoft.com/en-us/dotnet/standard/simd
         | 
         | Tanner Gooding at Microsoft is responsible for a lot of the
         | developments in this area and has some decent blogposts on it,
         | e.g.
         | 
         | https://devblogs.microsoft.com/dotnet/dotnet-8-hardware-intr...
        
       | voidUpdate wrote:
       | I shudder to think who needs to process a million lines of csv
       | that fast...
        
         | segmondy wrote:
         | lots of folks in Finance, you can share csv with any Finance
         | company and they can process it. It's text.
        
           | zzbn00 wrote:
           | Humans generate decisions / text information at rates of
           | ~bytes per second at most. There is barely enough humans
           | around to generate 21GB/s of information even if all they did
           | was make financial decisions!
           | 
           | So 21 GB/s would be solely algos talking to algos... Given
           | all the investment in the algos, surely they don't need to be
           | exchanging CSV around?
        
             | internetter wrote:
             | > Humans generate decisions / text information at rates of
             | ~bytes per second at most
             | 
             | Yes, but the consequences of these decisions are worth much
             | more. You attach an ID to the user, and an ID to the
             | transaction. You store the location and time where it was
             | made. Ect.
        
               | zzbn00 wrote:
               | I think these would add only small amount of information
               | (and in a DB would be modelled as joins). Only adds lots
               | of data if done very inefficiently.
        
               | jajko wrote:
               | Why are you theoretising? I can tell you from out there
               | its used massively, and its not going away in contrary.
               | Even rather small banks can end up generating various
               | reports etc. which can easily become huge.
               | 
               | The speed of human decision has basically 0 role here, as
               | it doesn't with messaging generally, there is way more to
               | companies than just direct keyboard-to-output link.
        
             | adrianN wrote:
             | You might have accumulated some decades of data in that
             | format and now want to ingest it into a database.
        
               | zzbn00 wrote:
               | Yes, but if you have decades of data, what turns on
               | having to wait for a minute or 10 minutes to convert it?
        
             | hermitcrab wrote:
             | CSV is a questionable choice for a dataset that size. It's
             | not very efficient in terms of size (real numbers take more
             | bytes to store as text than as binary), it's not the
             | fastest to parse (due to escaping) and a single delimiter
             | or escape out of place corrupts everything afterwards. That
             | not to mention all the issues around encoding, different
             | delimiters etc.
        
               | zzbn00 wrote:
               | Its great for when people need to be in the loop, looking
               | at the data, maybe loading in Excel etc. (I use it
               | myself...). But not enough humans around for 21 GB/s
        
               | jstimpfle wrote:
               | > (real numbers take more bytes to store as text than as
               | binary)
               | 
               | Depends on the distribution of numbeds in the sataset.
               | It's quite common to have small numbers. For these text
               | is a more efficient representation compared to binary,
               | especially compared to 64-bit or larger binary encodings.
        
             | wat10000 wrote:
             | Standards (whether official or de facto) often aren't the
             | best in isolation, but they're the best in reality because
             | they're widely used.
             | 
             | Imagine you want to replace CSV for this purpose. From a
             | purely technical view, this makes total sense. So you
             | investigate, come up with a better standard, make sure it
             | has all the capabilities everyone needs from the existing
             | stuff, write a reference implementation, and go off to get
             | it adopted.
             | 
             | First place you talk to asks you two questions: "Which of
             | my partner institutions accept this?" "What are the
             | practical benefits of switching to this?"
             | 
             | Your answer to the first is going to be "none of them" and
             | the answer to the second is going to be vague hand-wavey
             | stuff around maintainability and making programmers
             | happier, with maybe a little bit of "this properly handles
             | it when your clients' names have accent marks."
             | 
             | Next place asks the same questions, and since the first
             | place wasn't interested, you have the same answers....
             | 
             | Replacing existing standards that are Good Enough is
             | really, really hard.
        
             | cyral wrote:
             | The only real example I can think of is the US options
             | market feed. It is up to something like 50 GiB/s now, and
             | is open 6.5 hours per day. Even a small subset of the feed
             | that someone may be working on for data analysis could be
             | huge. I agree CSV shouldn't even be used here but I am sure
             | it is.
        
             | h4ck_th3_pl4n3t wrote:
             | You seem to not realize that most humans are not coders.
             | 
             | And non coders use proprietary software, which usually has
             | an export into CSV or XLS to be compatible with Microsoft
             | Office.
        
         | sunrunner wrote:
         | I shudder to think of what it means to be storing the _results_
         | of processing 21 GB/s of CSV. Hopefully some useful kind of
         | aggregation, but if this was powering some kind of search over
         | structured data then it has to be stored somewhere...
        
           | devmor wrote:
           | Just because you're processing 21GB/s of CSV doesn't mean you
           | need all of it.
           | 
           | If your data is coming from a source you don't own, it's
           | likely to include data you don't need. Maybe there's 30
           | columns and you only need 3 - or 200 columns and you only
           | need 1.
           | 
           | Enterprise ETL is full of such cases.
        
         | hermitcrab wrote:
         | For all its many weaknesses, I believe CSV is still the most
         | common data interchange format.
        
           | adra wrote:
           | Erm, maybe file based? JSON is the king if you count
           | exchanges worldwide a sec. Maybe no 2 is form-data which is
           | basically email multipart, and if course there's email as a
           | format. Very common =)
        
             | hermitcrab wrote:
             | I meant file-based.
        
             | devmor wrote:
             | I honestly wonder if JSON is king. I used to think so until
             | I started working in fintech. XML is unfortunately
             | everywhere.
        
               | hermitcrab wrote:
               | JSON isn't great for tabular data. And an awful lot of
               | data is tabular.
        
         | trollbridge wrote:
         | It's become a very common interchange format, even internally;
         | it's also easy to deflate. I have had to work on codebases
         | where CSV was being pumped out at basically the speed of a NIC
         | card (its origin was Netflow, and then aggregated and otherwise
         | processed, and the results sent via CSV to a master for further
         | aggregation and analysis).
         | 
         | I really don't get, though, why people can't just use protocol
         | buffers instead. Is protobuf really that hard?
        
           | nobleach wrote:
           | Extremely hard to tell an HR person, "Right-click on here in
           | your Workday/Zendesk/Salesforce/etc UI and export a
           | protobuf". Most of these folks in the business world LIVE in
           | Excel/Spreadsheet land so a CSV feels very native. We can
           | agree all day long that for actual data TRANSFER, CSV is
           | riddled with edge cases. But it's what the customers are
           | using.
        
             | heavenlyblue wrote:
             | It's extremely unlikely they need to load spreadsheets
             | large enough for 21Gb/s speed to matter
        
               | nobleach wrote:
               | Oh absolutely! I'm just mentioning why CSV is chosen over
               | Protobufs.
        
               | SteveNuts wrote:
               | You'd be surprised. Big telcos use CSV and SFTP for CDR
               | data, and there's a lot of it.
        
           | matja wrote:
           | Kind of, there isn't a 1:1 mapping of protobuf wire types to
           | schema types, so you need to package the protobuf schema with
           | the data and compile it to parse the data, or decide on the
           | schema before-hand. So now you need to decide on a file
           | format to bundle the schema and the data.
        
           | bombela wrote:
           | protobuf is more friction, and actually slow to write and
           | read.
           | 
           | For better or worse, CSV is easy to produce via printf. Easy
           | to read by breaking lines and splitting by the delimiter.
           | Escaping delimiters part of the content is not hard, though
           | often added as an afterthought.
           | 
           | Protobuf requires to install a library. Understand how it
           | works. Write a schema file. Share the shema to others. The
           | API is cumbersome.
           | 
           | Finally to offer this mutable struct via setter and getter
           | abstraction, with variable length encoded numbers, variable
           | length strings etc. The library ends up quite slow.
           | 
           | In my experience protobuf is slow and memory hungry. The
           | generated code is also quite bloated, which is not helping.
           | 
           | See https://capnproto.org/ for details from the original
           | creator of protobuf.
           | 
           | Is CSV faster than protobuf? I don't know, and I haven't
           | tested. But I wouldn't be surprised if it is.
        
             | raron wrote:
             | > For better or worse, CSV is easy to produce via printf.
             | Easy to read by breaking lines and splitting by the
             | delimiter. Escaping delimiters part of the content is not
             | hard, though often added as an afterthought.
             | 
             | Based on the amount of software I seen producing broken CSV
             | or can't parse (more-or-less) valid CSV, I don't think that
             | is true.
             | 
             | It seems to be easy, because just printf("%s,%d,%d\n", ...)
             | but it is full of edge cases most programmers don't think
             | about.
        
           | to11mtm wrote:
           | I'm not the biggest fan of Protobuf, mostly around the
           | 'perhaps-too-minimal' typing of the system and the
           | performance differentials present on certain languages in the
           | library.
           | 
           | e.x. I know in .NET space, MessagePack is usually faster than
           | proto, I think similar is true for JVM. Main disadvantage is
           | there's not good schema based tooling around it.
        
         | moregrist wrote:
         | I have. I think it's a pretty easy situation for certain kinds
         | of startups to find themselves in:
         | 
         | - Someone decides on CSV because it's easy to produce and you
         | don't have that much data. Plus it's easier for the <non-
         | software people> to read so they quit asking you to give them
         | Excel sheets. Here <non-software people> is anyone who has a
         | legit need to see your data and knows Excel really well. It can
         | range from business types to lab scientists.
         | 
         | - Your internal processes start to consume CSV because it's
         | what you produce. You build out key pipelines where one or more
         | steps consume CSV.
         | 
         | - Suddenly your data increases by 10x or 100x or more because
         | something started working: you got some customers, your sensor
         | throughput improved, the science part started working, etc.
         | 
         | Then it starts to make sense to optimize ingesting millions or
         | billions of lines of CSV. It buys you time so you can start
         | moving your internal processes (and maybe some other teams'
         | stuff) to a format more suited for this kind of data.
        
         | ourmandave wrote:
         | That cartesian product file accounting sends you at year end?
        
         | constantcrying wrote:
         | In basically every situation it is inferior to HDF5.
         | 
         | I do not think there is an actual explanation besides
         | ignorance, laziness or "it works".
        
         | pak9rabid wrote:
         | Ugh.....I do unfortunately.
        
       | vessenes wrote:
       | If we are lucky we will see Arthur Whitney get triggered and post
       | either a one liner beating this or a shakti engine update and a
       | one liner beating this. Progress!
        
       | stabbles wrote:
       | Instead of doing 4 comparisons against each character `\n`, `\r`,
       | `;` and `"` followed by 3 or operations, a common trick is to do
       | 1 shuffle, 1 comparison and 0 or operations. I blogged about this
       | trick: https://stoppels.ch/2022/11/30/io-is-no-longer-the-
       | bottlenec... (Trick 2)
       | 
       | Edit: they do make use of ternary logic to avoid one or
       | operation, which is nice. Basically (a | b | c) | d is computed
       | using `vpternlogd` and `vpor` resp.
        
         | justinhj wrote:
         | really cool thanks
        
       | Aardwolf wrote:
       | Take that, Intel and your "let's remove AVX-512 from every
       | consumer CPU because we want to put slow cores on every single
       | one of them and also not consider multi-pumping it"
        
         | tadfisher wrote:
         | A lot of this stems from the 10nm hole they had to dig
         | themselves out from. Yields are bad, so costs are high, so
         | let's cut the die as much as possible, ship Atom-derived cores
         | and market it as an energy-saving measure. The expensive parts
         | can be bigger and we'll cut the margins on those to retain the
         | server/cloud sector. Also our earnings go into the shitter and
         | we lose market share anyway, but at least we tried.
        
           | wtallis wrote:
           | This issue is less about Intel's fab failures and more about
           | their inability to decouple their architecture update cadence
           | from their fab progress. They stopped iterating on their CPU
           | designs while waiting for 10nm to get fixed. That left them
           | with an oversized P core and an outdated E core, and all they
           | could do for Alder Lake was slap them onto one die and ship
           | it, with no ability to produce a well-matched pair of core
           | designs in any reasonable time frame. We're _still_ seeing
           | weird consequences of their inability to port CPU designs
           | between processes and fabs: this year 's laptop processors
           | have HyperThreading only in the lowest-cost parts--those that
           | still have the CPU chiplet fabbed at Intel while the higher
           | core count parts are made by TSMC.
        
       | imtringued wrote:
       | Considering the non-standard nature of CSV, quoting throughput
       | numbers in bytes is meaningless. It makes sense for JSON, since
       | you know what the output is going to be (e.g. floats, integers,
       | strings, hashmaps, etc). With CSV you only get strings for each
       | column, so 21 GB/s of comma splitting would be the pinnacle of
       | meaninglessness. Like, okay, but I still have to parse the
       | stringy data, so what gives? Yeah, the blog post does reference
       | float parsing, but a single float per line would count as "CSV".
       | 
       | Now someone might counter and say that I should just read the
       | README.MD, but then that suspicion simply turns out to be true:
       | They don't actually do any escaping or quoting by default, making
       | the quoted numbers an example of heavily misleading advertising.
        
         | liuliu wrote:
         | CSV is standardized in RFC 4180 (well, as standardized as most
         | of what we considered internet "standard").
         | 
         | Otherwise agree, if you don't do escaping (a.k.a. "quoting",
         | the same thing for CSV), you are not implementing it correctly.
         | For example, if you quote a line break, in RFC 4180, this line
         | break will be in that quoted string, but if you don't need to
         | handle that, you can implement CSV parsing much faster (proper
         | handling line break with quoted string requires 2-pass approach
         | (if you are going to use many-core) while not handling it at
         | all can be done with 1-pass approach). I discussed about this
         | detail in https://liuliu.me/eyes/loading-csv-file-at-the-speed-
         | limit-o...
        
           | a3w wrote:
           | Side note: RFCs are great standards, as they are readable.
           | 
           | As an example of how not to do it: XML can be assumed a
           | standard, but I cannot afford to read it. DIN/ISO is great
           | for manufacturing in theory, but bad for zero-cost of initial
           | investment like IT.
        
       | zeristor wrote:
       | Why not use Parquet?
        
         | mcraiha wrote:
         | Excel does not output Parquet.
        
           | speed_spread wrote:
           | True. But also Excel probably collapses into a black hole
           | going straight to hell trying to handle 21GB of data.
        
             | hermitcrab wrote:
             | Excel .xlsx files are limited to 1,048,576 rows and 16,384
             | columns.
             | 
             | Excel .xls files are limited to 65,536 rows and 256
             | columns.
        
             | mihular wrote:
             | 21GB/s, not 21GB ...
        
               | anthk wrote:
               | mawk would handle a 21 GB csv (or maybe one true awk)
               | fast enough.
        
           | buyucu wrote:
           | Excel often outputs broken csv :)
        
             | hinkley wrote:
             | I have been privileged in my career to never need to parse
             | Excel output but occasionally feed it input. Especially
             | before Grafana was a household name.
             | 
             | Putting something out so manager stops asking you 20
             | questions about the data is a double edged sword though.
             | Those people can hallucinate more than a pre-Covid AI
             | engine. Grafana is just weird enough that people would
             | rather consume a chart than try to make one, then you have
             | some control over the acid trip.
        
         | constantcrying wrote:
         | Or HDF5 or any other format which is actually meant to store
         | large amounts of floating point data.
        
       | chao- wrote:
       | It feels crazy to me that Intel spent years dedicating die space
       | on consumer SKUs to "make fetch happen" with AVX-512, and as more
       | and more libraries are finally using it, as Intel's goal is
       | achieved, they have removed AVX-512 from their consumer SKUs.
       | 
       | It isn't that AMD has better AVX-512 support, which would be an
       | impressive upset on it's own. Instead, it is only that AMD has
       | AVX-512 on consumer CPUs, because Intel walked away from their
       | own investment.
        
         | MortyWaves wrote:
         | It's wild seeing how stupid Intel is being.
        
         | neonsunset wrote:
         | If it's any consolation, Sep will happily use AVX-512 whenever
         | available, without having to opt into that explicitly,
         | including the server parts, as it will most likely run under a
         | JIT runtime (although it's NAOT-compatible). So you're not
         | missing out by being forced to target the lowest common
         | denominator.
        
         | sitkack wrote:
         | That is what Intel does, they build up a market (Optane) and
         | then do a rug pull (Depth Cameras). They continue to do this
         | thing where they do a huge push into a new technology, then
         | don't see the uptake and let it die. Instead of building slowly
         | and then at the right time, doing a big push. Optane support
         | was _just getting mature_ in the Linux kernel when they pulled
         | it. And they focused on some weird cost cutting move when
         | marketing it as a ram replacement for semi-idle VMs, ok.
         | 
         | They keep repeating the same mistakes all the way back to
         | https://en.wikipedia.org/wiki/Intel_iAPX_432
        
           | sebmellen wrote:
           | Bad habits are hard to break!
        
           | etaioinshrdlu wrote:
           | Well, Itanium might be a counterexample, they probably tried
           | to make that work for far too long..
        
             | sitkack wrote:
             | Itanium worked as intended.
        
               | paddy_m wrote:
               | So far as killing HP PA-Risc, SGI MIPS, DEC Alpha, and
               | seriously hurting the chance for adoption of Sparc, and
               | POWER outside of their respective parents (did I miss
               | any)?
               | 
               | Thing is, they could have killed it by 1998, without ever
               | releasing anything, that would have killed the other
               | architectures it was trying to compete with. Instead they
               | waited until 2020 to end support.
               | 
               | What the VLIW of Itanium needed and never really got was
               | proper compiler support. Nvidia has this in spades with
               | CUDA. It's easy to port to Nvidia where you do get
               | serious speedups. AVX-512 never offered enough of a
               | speedup from what I could tell, even though it was well
               | supported by at least ICC (and numpy/scipy when properly
               | compiled)
        
               | knowitnone wrote:
               | "they could have killed it by 1998, without ever
               | releasing anything"
               | 
               | perhaps Intel really wanted it to work and killing other
               | architectures was only a side effect?
        
               | kyboren wrote:
               | > What the VLIW of Itanium needed and never really got
               | was proper compiler support.
               | 
               | This is kinda under-selling it. The fundamental problem
               | with statically-scheduled VLIW machines like Itanium is
               | it puts all of the complexity in the compiler.
               | Unfortunately it turns out it's just really hard to make
               | a good static scheduler!
               | 
               | In contrast, dynamically-scheduled out-of-order
               | superscalar machines work great but put all the
               | complexity in silicon. The transistor overhead was
               | expensive back in the day, so statically-scheduled VLIWs
               | seemed like a good idea.
               | 
               | What happened was that static scheduling stayed really
               | hard while the transistor overhead for dynamic scheduling
               | became irrelevantly cheap. "Throw more hardware at it"
               | won handily over "Make better software".
        
               | bri3d wrote:
               | No, VLIW is even worse than this. Describing it as a
               | compiler problem undersells the issue. VLIW is not
               | tractable for a multitasking / multi tenant system due to
               | cache residency issues. The compiler cannot efficiently
               | schedule instructions without knowing what is in cache.
               | But, it can't know what's going to be in cache if it
               | doesn't know what's occupying the adjacent task time
               | slices. Add virtualization and it's a disaster.
        
               | sitkack wrote:
               | It only works for fixed workloads, like accelerators,
               | with no dynamic sharing.
        
             | mrweasel wrote:
             | Itanium was more of an HP product than an Intel one.
        
           | sheepscreek wrote:
           | > They continue to do this thing where they do a huge push
           | into a new technology, then don't see the uptake and let it
           | die.
           | 
           | Except Intel deliberately made AVX 512 a feature exclusively
           | available to Xeon and enterprise processors in future
           | generations. This backward step artificially limits its
           | availability, forcing enterprises to invest in more expensive
           | hardware.
           | 
           | I wonder if Intel has taken a similar approach with Arc GPUs,
           | which lack support for GPU virtualization (SR-IOV). They
           | somewhat added vGPU support to all built-in 12th-14th Gen
           | chips through the i915 driver on Linux. It's a pleasure to
           | have graphics-acceleration in multiple VMs simultaneously,
           | through the same GPU.
        
             | sitkack wrote:
             | They go out their way to segment their markets, ECC, AVX,
             | Optane support (only specific server class skus). I hate
             | it, I hate as a home pc user, I hate it as an enterprise
             | customer, I hate as a shareholder.
        
               | knowitnone wrote:
               | Every company does this. If you're grandma only uses a
               | web browser, word processor, and excel, does she really
               | want to spend an additional $50 on a feature she'll not
               | use? Same with NPUs. Different consumers want different
               | features for different prices.
        
               | tliltocatl wrote:
               | Except it hinders adoption, because not having a feature
               | in entry-level products will mean less incentive (and
               | ability) for software developers to use it. Compatibility
               | is so valuable it makes everyone converge on the least
               | common denominator, so when you price-gouge on a
               | software-exposed feature, you might as well bury this
               | feature altogether.
        
               | sitkack wrote:
               | Three fallacies and you are OUT!
        
           | Gud wrote:
           | Indeed. Octane/3dxpoint was mind blowing futuristic stuff but
           | it was just gone after 5 years? On the market? Talk about
           | short sighted.
        
           | gnfargbl wrote:
           | The rugpull on Optane was incredibly frustrating. Intel
           | developed a technology which made really meaningful
           | improvements to workloads in an industry that is full of
           | sticky late adopters (RDBMSes). They kept investing until the
           | point where they had unequivocally made their point and the
           | late adopters were just about getting it... and _then_ killed
           | it!
           | 
           | It's hard to understand how they could have played that
           | particular hand more badly. Even a few years on, I'm missing
           | Optane drives because there is still no functional
           | alternative. If they just held out a bit longer, they would
           | have created a set of enterprise customers who would still be
           | buying the things in 2040.
        
             | jerryseff wrote:
             | Optane was incredible. It's insane that Intel dropped this.
        
           | FpUser wrote:
           | I am very disappointed about Optane drives. Perfect case for
           | superfast vertically scalable database. I was going to build
           | a solution based on this but suddenly it is gone for all
           | practical intents and purposes.
        
           | high_na_euv wrote:
           | Optane was cancelled because manufacturer sold the fab
        
         | buyucu wrote:
         | Intel is horrible with software. My laptop has a pretty good
         | iGPU, but it's not properly supported by PyTorch or most other
         | software. Vulkan inference with llama.cpp does wonders, and it
         | makes me sad that most software other than llama.cpp does not
         | take advantage of it.
        
           | kristianp wrote:
           | Sounds like something to try. Do I just need to compile
           | Vulkan support to use the igpu?
        
         | tedunangst wrote:
         | I mean, the most interesting part of the article for me:
         | 
         | > A bit surprisingly the AVX2 parser on 9950X hit ~20GB/s! That
         | is, it was better than the AVX-512 based parser by ~10%, which
         | is pretty significant for Sep.
         | 
         | They fixed it, that's the whole point, but I think there's
         | evidence that AVX-512 doesn't actually benefit consumers that
         | much. I would be willing to settle for a laptop that can only
         | parse 20GB/s and not 21GB/s of CSV. I think vector assembly
         | nerds care about support much more than users.
        
           | vardump wrote:
           | That probably just means it's a memory bandwidth bound
           | problem. It's going to be a different story for tasks that
           | require more computation.
        
             | wyager wrote:
             | You can still saturate an ultrawide vector unit with
             | narrower instructions if you have wide enough dispatch
        
           | neonsunset wrote:
           | AVX512 is not just about width. It ships with a lot of very
           | useful instructions available for narrower vectors with
           | AVX512VL. It also improves throughput per instruction. You're
           | not hand-writing intrinsified code usually yet compilers,
           | especially JIT ones, can make use of it for all sorts of
           | common operations that become x times faster. In .NET, having
           | AVX512 will speed up linear search, memory copying, string
           | comparison which are straightforward, but it will also affect
           | its Regex performance which uses SearchValues<T> which under
           | the hood is able to perform complex shuffles and vector
           | lookups on larger vectors with much better throughput. AVX512
           | lends itself to a more compact codegen (although .NET is not
           | perfect in that regard, I think it sometimes regresses vs
           | AVX2 with its instruction choices, but it's a matter of
           | iterative improvement).
        
         | Aurornis wrote:
         | In this article, they saw the following speeds:
         | 
         | Original: 18 GB/s
         | 
         | AVX2: 20 GB/s
         | 
         | AVX512: 21 GB/s
         | 
         | This is an AMD CPU, but it's clear that the AVX512 benefits are
         | marginal over the AVX2 version. Note that Intel's consumer
         | chips do support AVX2, even on the E-cores.
         | 
         | But there's more to the story: This is a single-threaded
         | benchmark. Intel gave up AVX512 to free up die space for more
         | cores. Intel's top of the line consumer part has 24 cores as a
         | result, whereas AMD's top consumer part has 16. We'd have to
         | look at actual Intel benchmarks to see, but if the AVX2 to
         | AVX512 improvements are marginal, a multithreaded AVX2 version
         | across more cores would likely outperform a multithreaded
         | AVX512 version across fewer cores. Note that Intel's E-cores
         | run AVX2 instructions slower than the P-cores, but again the
         | AVX boost is marginal in this benchmark anyway.
         | 
         | I know people like to get angry at Intel for taking a feature
         | away, but the real-world benefit of having AVX512 instead of
         | only AVX2 is very minimal. In most cases, it's probably offset
         | by having extra cores working on the problem. There are very
         | specific workloads, often single-threaded, that benefit from
         | AVX-512, but on a blended mix of applications and benchmarks I
         | suspect Intel made an informed decision to do what they did.
        
           | neonsunset wrote:
           | AVX2 vs AVX512 in this case may be somewhat misleading. In
           | .NET, even if you use 256bit-wide vectors, it will still take
           | advantage of AVX512VL whenever available to fuse chained
           | operations into masked, vpternlogd's, etc.[0] (plus standard
           | operations like stack zeroing, struct copying, string
           | comparison, element search, and other can use the full
           | width)[1]
           | 
           | So to force true AVX2 the benchmark would have to be ran with
           | `DOTNET_EnableAVX512F=0` which I assume is not the case here.
           | 
           | [0]: https://devblogs.microsoft.com/dotnet/performance-
           | improvemen...
           | 
           | [1]: https://devblogs.microsoft.com/dotnet/performance-
           | improvemen...
        
         | ChadNauseam wrote:
         | Isn't AVX-10 on the horizon, which will have most of the
         | goodies that AVX-512 had? (I'm actually not even sure what the
         | difference is supposed to be between them.)
        
       | constantcrying wrote:
       | There are very good alternatives to csv for storing and
       | exchanging floating point/other data.
       | 
       | The HDF5 format is very good and allows far more structure in
       | your files, as well as metadata and different types of lossless
       | and lossy compression.
        
       | anthk wrote:
       | > Net 9.0
       | 
       | heh, do it again with mawk.
        
       | jerryseff wrote:
       | Christ using... .NET?
       | 
       | I want to vomit.
       | 
       | Use elixir, you can easily get this close using Rust NIFs and
       | pattern matching.
        
         | h4ck_th3_pl4n3t wrote:
         | Then show us your elixir implementation?
        
       | chpatrick wrote:
       | In my experience I've found it difficult to get substantial gains
       | with custom SIMD code compared to modern compiler auto-
       | vectorization, but to be fair that was with more vector-friendly
       | code than JSON parsing.
        
       | theropost wrote:
       | I need this, just finished 300GB of CSV extracts, and
       | manipulating, data integrity checks, and so on take longer than
       | they should.
        
       | haberman wrote:
       | The article doesn't clearly define what this 21 GB/s code is
       | doing.
       | 
       | - What format exactly is it parsing? (eg. does the dialect of CSV
       | support quoted commas, or is the parser merely looking for commas
       | and newlines)?
       | 
       | - What is the parser doing with the result (ie. populating a data
       | structure, etc)?
        
       ___________________________________________________________________
       (page generated 2025-05-09 23:00 UTC)