[HN Gopher] FP8 is ~100 tflops faster when the kernel name has "...
___________________________________________________________________
FP8 is ~100 tflops faster when the kernel name has "cutlass" in it
Author : limoce
Score : 220 points
Date : 2025-07-11 10:36 UTC (12 hours ago)
(HTM) web link (twitter.com)
(TXT) w3m dump (twitter.com)
| KomoD wrote:
| actual link: https://github.com/triton-lang/triton/pull/7298
| bede wrote:
| Thank you, perhaps the parent can be edited to use this URL
| instead
| orlp wrote:
| GenuineIntel moment.
| hofrogs wrote:
| I'm interested in that story, what are you referring to with
| "GenuineIntel"?
| orlp wrote:
| Intel's C++ compiler is known to add branches in its
| generated code checking if the CPU is "GenuineIntel" and if
| not use a worse routine: https://en.wikipedia.org/wiki/Intel_
| C%2B%2B_Compiler#Support....
| pieterbreed wrote:
| Is this for the runtime of the compiled code or for the
| compiling machine? Do they generate slow code if the
| compiler is running on non-intel?
| SSLy wrote:
| the runtime. patching cpuid makes the code go faster
| kstrauser wrote:
| For the compiled code. Its output deliberately runs
| slower on non-Intel CPUs.
| Uvix wrote:
| Runtime of the compiled code. The ostensible intent is so
| that new processors can use new features like SIMD, while
| offering a fallback for older ones. In practice, they're
| detecting an Intel processor, not just the specific
| feature.
| danieldk wrote:
| Also MKL:
|
| https://danieldk.eu/Intel-MKL-on-AMD-Zen
| bayindirh wrote:
| Even in the middle of that turmoil, we managed to compile
| some code with Intel's ICC and make it go faster on AMD
| Opterons, breaking Intel's own numbers.
|
| When my colleague said that they managed to go faster than
| intel with icc with some hand tuned parameters, I remember
| answering "youdidwat?".
|
| Good times.
| reitzensteinm wrote:
| Or maybe Quack III: Arena. https://m.slashdot.org/story/21054
| iforgotpassword wrote:
| I think that was the first case (to go public), but I
| remember reading about this in game magazines a couple times
| after this, for both ATI and nvidia.
| 42lux wrote:
| Now I want a Quake shooter but with ducks.
| carlos22 wrote:
| Not ducks, but chickens, was very popular in Germany back
| in the day: https://en.wikipedia.org/wiki/Crazy_Chicken
| avhception wrote:
| Oh wow, that was a blast from the past. The Moorhuhn
| craze!
|
| Many people, including me, didn't have an internet
| connection back in the day. The Sneakernet went into
| overdrive so get everyone a copy!
| supportengineer wrote:
| A Duck Hunt, if you will...
| dahauns wrote:
| Aah, that brings back memories...
|
| Interestingly, most benchmark controversies back in the day
| are now expected behaviour, i.e. game-specific optimizations
| with no (well, in this age of upscalers and other lossy
| optimization techniques, probably even somewhat) visible
| image degradation. A gaming-specific driver with no game-
| specific improvements in its changelog would be considered
| strange, and it very much works with executable detection.
|
| Back in the day, there was still the argument that drivers
| should not optimize for benchmarks even when visually
| identical, because it wouldn't show the hardware's real world
| potential. Kinda cute from today's perspective. :)
|
| But of course there were the obvious cases...
|
| The Quack3 lowering filtering quality as shown above, of
| course (at least that one was put into the driver as a
| togglable setting later on).
|
| But the most cheeky one has to be nVidia's 3dmark03
| "optimizations", where they blatantly put static clip planes
| into the scenes so that everything outside the predefined
| camera path from the benchmark sequence would simply be cut
| from the scene early (which e.g. fully broke the freelook
| patched into 3dmark and would generally break any interactive
| application)
| bayindirh wrote:
| You beat me to it. _Grrr..._
|
| Just kidding, nice to see another person who remembers
| these things. Want some root beer?
| bayindirh wrote:
| Ooh, I remember this, but actually the thing is older than
| it.
|
| First, nVidia and ATI used executable names for detecting
| games, then they started to add heuristics.
|
| If you think they stopped the practice, you're _very_
| mistaken. Every AMD and nVidia driver has game and app
| specific _fixes and optimizations_.
|
| nVidia cheated in 3D Mark that way, so they patched/changed
| their benchmark to prevent it. Also, again nVidia, patched
| their drivers so some of the more expensive but visually
| invisible calls like scene flushes in a particular game is
| batched (e.g. do all 50 flushes at the 50th call) to prevent
| the game becoming a slide show on expensive hardware.
|
| This is also why AMDs and Intel's open source drivers under
| Linux a success, because they are _vanilla_ drivers written
| from scratch per spec, and if your code calls OpenGL /Vulkan
| to spec, then you're golden.
|
| Even some companies cross compile AMD's Linux drivers for
| windows on embedded systems since they're free from useless
| optimizations from them.
| koakuma-chan wrote:
| is 100 tflops a lot?
| brightmood wrote:
| yea
| saagarjha wrote:
| It's like 5-10% here
| irrelative wrote:
| Correct, this is the actual headline too. 100 tflops sure
| seems like it'd be more than that, but here we are.
|
| If the headline was "FB8 is ~7% faster when kernel name has
| 'cutlass' in it...", it wouldn't seem sensational.
| saagarjha wrote:
| I think the interesting part is that it improves
| performance measurably at all, not the actual number. These
| people are trying to hit 90+% MFU (though most don't reach
| it) so this does actually translate to many millions of
| dollars for them.
| progx wrote:
| 5060 ti +~15%
| HideousKojima wrote:
| According to Terminator 3 Skynet used a mere 60 TFLOPS
| IAmBroom wrote:
| How much is that in jiggawatts per parsec?
| nolok wrote:
| Intel's quest to move from "trusted by default / the reference"
| to "check for scam" is getting worse every release. And it's 100%
| self inflicted. How weird.
| pkhuong wrote:
| NVIDIA-inflicted in this case.
| aleph_minus_one wrote:
| In my understanding of the PR, it rather seems that it is
| _NVidia_ is the company that is cheating. :-)
| hvenev wrote:
| In `libnvidia-nvvm.so` the string `cutlass` appears right after
| `Memory Dependence Analysis` and `memdep`. Perhaps it acts as an
| optimization attribute of some sort, where the compiler is
| allowed to make assumptions about the kernel's behavior that are
| not valid in general?
| high_na_euv wrote:
| Thats very likely imo
| jdright wrote:
| yes, that is a very usual way (known practices) of vendors
| applying specific optimizations for known things.
|
| It is also part of the benchmarks game they play against each
| other.
| MichaelZuo wrote:
| It's really strange for established companies to waste their
| credibility on games like that...
| MangoToupe wrote:
| I was pretty young at the time, but I recall the market for
| graphics being a _lot_ wider open at the time Quake was
| released. Remember 3dfx? They produced the Voodoo series of
| graphics cards. They 're barely a distant memory now.
|
| Quake was also _the_ standard for a game that was willing
| to fully exploit the hardware of the time.
| IAmBroom wrote:
| Never underestimate how much human ego will control
| actions.
| MBCook wrote:
| The link is long dead and the Wayback machine doesn't have a
| copy.
|
| But in 2001 ATI was caught applying optimizations to Quake 3
| when someone realized if you renamed the executable from
| "quake" to "quack" the score dropped a ton. It was a big
| scandal.
|
| I know that's common now but that wasn't a thing that was
| done at the time.
| hinkley wrote:
| There are bugs that certain games rely on and features that
| some don't use. I'm currently trying to optimize a library
| out of spite. (I want it to work better than the competitor
| that caused me a lot of problems on a recent project). The
| amount of conditional logic around what is essentially a
| function to increment a value is breathtaking.
| gatlin wrote:
| Do you have any kind of example you're able to share? I
| don't mean to take your IP but I want to see this
| breathtaking vista.
| hinkley wrote:
| To avoid doxxing myself: In a deep call stack it's
| possible to end up sanitizing inputs multiple times and
| in different ways.
|
| A frequent example I've encountered is web frameworks
| that have to keep checking for escaped text because they
| didn't write it in horizontal layers where you know for
| sure that all inputs have been scrubbed when they reach
| this function but not that one. So the same functions get
| called with data that comes from your team and from
| customers. Reuse is tricky.
| hamburglar wrote:
| "Checking for escaped text" is the sort of nonsense that
| tells you you're dealing with amateur developers.
| amiga386 wrote:
| A simple example would be that the function
| glGetString(GL_EXTENSIONS) crashes the original Quake
| engine and many licensees, because it's expecting no more
| than a 256 character string.
|
| The driver looks to see if a known old game is calling
| it, and if it's one known to crash, it returns no more
| than 256 characters, and likely also puts all the _old_
| extensions that the game is likely to know and react to
| in the string.
|
| There are also all sorts of games that called APIs in a
| particular order or set particular options, because they
| represented a "fast path" at the time, and now they
| don't, but if you're that program, then yes they do.
|
| Ultimately, this clutter is what let do the development
| of the Vulcan API, to stop games second-guessing graphics
| APIs which themselves second-guess the games.
| IAmBroom wrote:
| In at least one past version of Windows (circa 1990s), if
| you tried to replace the default web browser of IE with
| another choice you were given an Open File dialog window to
| choose the executable.
|
| Funny quirk, though: that particular window wouldn't show
| files named firefox.exe. It would accept that as typed
| input, if you were at the correct folder, but the file
| listing omitted that particular file.
|
| Maybe it was mozilla.exe; it was a long time ago. But that
| was the discovery that pushed me off IE forever.
| lstamour wrote:
| I vaguely remember that being the start of the browser
| prompts to set your current browser as the default. It
| was so hard to just configure that they had to build a
| way to set it within the browser.
|
| You saw that again in more modern times when Microsoft
| removed support for the APIs they provided to set browser
| defaults, forcing browser makers to write step by step
| instructions on what to click to set the default browser.
|
| I believe they walked that back, but it left such a bad
| taste that I switched my installation of Windows from
| default mode to EU mode in order to avoid it. And come to
| think of it, I haven't used my windows machine for much
| outside of AI in about 6 months.
|
| But Microsoft is not alone in these sort of defaults
| games - every OS or browser maker, Apple, Google,
| Firefox, wants to create moats so they can more easily
| monetize your usage of a product. I never thought I'd
| prefer the business model of free to play games, where
| they just outright ask you for money and have to keep
| finding new ways to entertain instead of relying on hard
| to change defaults and selling your data.
| charcircuit wrote:
| An app being able to see itself as the default browser
| sounds like such a dangerous API, especially if it can be
| done silently without the user realizing it.
| atomicnumber3 wrote:
| Was it a scandal at the time? My understanding of how per-
| game card-driver optimizations work today is:
|
| 1. AAAA Game Studio shits out another unoptimized clunker
|
| 2. nvidia considers it a reputational risk if games run at
| 30 FPS on a 5090
|
| 3. They go in, look at the perverse ways the game misuses
| rendering primitives, and then hacks shit in to make
| whatever bad things they're doing less bad.
|
| As a gamer, this seems fine to me and i generally blame the
| AAAA devs for being bad at their jobs or AAAA studio leads
| for being ok shipping unoptimized messes.
| btbuilder wrote:
| I believe the driver silently swapped the textures to
| lower quality ones that looked worse but gave a
| performance boost.
| antonvs wrote:
| > As a gamer, this seems fine to me
|
| As a software developer, it almost certainly has a bad
| effect on the ecosystem long term. "Hacks shit in" is the
| very definition of technical debt, and that has a cost
| that someone, somewhere is going to have to pay in some
| form.
| monkpit wrote:
| You're looking as a dev, but the reality is that a
| consumer cannot see technical debt. If the studio churns
| out a game, the vendor sprinkles on some optimizations,
| people play it and move on, then the tech debt just
| vaporizes into the void. It's not real at that point.
| wtetzner wrote:
| Just because a consumer can't see technical debt doesn't
| mean they aren't paying for it. Most game studios
| continue to re-use code, so it doesn't just "vaporize"
| into the void.
| 8n4vidtmkvmk wrote:
| I'm pretty sure I pay this debt with lost FPS and every
| time I glitch through the floor into the nether.
| RHSeeger wrote:
| I can't reply to the person that replied to you, so
|
| > You're looking as a dev, but the reality is that a
| consumer cannot see technical debt.
|
| The consumer can't _see_ technical debt, but they sure as
| heck can be impacted by it.
|
| - Technical debt means the code base is harder to work
| with later. So fixes/enhancements take longer to make it
| into the code (and sometimes never can)
|
| - This particular type of technical debt means the code
| by the game developers sets precedent, and the next
| developer may us it as an example. So the amount of code
| incorrectly using the api grows faster over time
| strbean wrote:
| For some reason HN sometimes hides the reply button on
| leaf comments. I think this only happens for very new
| comments.
|
| You can click the timestamp ("X minutes ago") to view the
| comment without context, and reply from there.
| monocasa wrote:
| I think it's a anti flamewar tactic to put the brakes on
| quick replies.
| charcircuit wrote:
| >the next developer may us it as an example
|
| These hacks are game specific, so another developer
| wouldn't get them.
| lsaferite wrote:
| The next developer at that company that uses or
| references the crappy code for another project would
| still have the issue, but not get the benefit of the
| down-stream GPU vendor hacks to fix the buggy game.
| RHSeeger wrote:
| The way the API was used incorrectly "worked", and the
| game didn't see the negative impact of it because it was
| "fixed away". And then the incorrect usage is used again
| on another game and doesn't get the "fixed away" benefit.
| And the same incorrect usage could happen over and over
| because "it works".
| SideQuark wrote:
| > technical debt, and that has a cost that someone,
| somewhere is going to have to pay in some form
|
| There is no reason anyone has to pay each and every iota
| of technical debt. Plenty of things with technical debt
| hit end of life and no one ever looks in that code again.
| I suspect most technical debt goes this way - in program,
| program never updates (or minor updates), then dies.
|
| Your claim would require every piece of technical debt in
| anything ever (code, buildings, cars, anywhere) has to be
| removed before the thing goes end of life or goes into a
| mode where it never is changed. That seems ludicrous to
| me.
| cyanydeez wrote:
| Does anyone talk about how technical debt often just gets
| thrown into the garbage so we can buy fancy new technical
| crap, and its what pays for most of yalls jobs.
| itsTyrion wrote:
| it rendered in lower quality, IIRC lower textures / much
| more aggressive mipmapping and/or LOD
| gmueckl wrote:
| Except that if a developer has that kind of market pull,
| nVidida will gladly help those devs with getting it
| right. They are excellent at maintaining developer
| relations.
| mcculley wrote:
| I was surprised to see "AAAA". I didn't know there were 4
| As now.
|
| "AAAA Game Studio shits out another unoptimized clunker"
| seems a paradoxical statement to me. I would have thought
| "AAAA" meant "highly resourced" game company. Does it
| just mean high revenue? Lots of players?
| mwpmaybe wrote:
| High price...
| vinceguidry wrote:
| The more money you throw at an effort, the more gets
| flushed out as waste, and the harder it is to maintain
| quality. Pretty universal across business.
| wtetzner wrote:
| AAA/AAAA just means "how much money was spent developing
| the game". High cost doesn't automatically equal high
| quality. In fact, it seems after a certain point to mean
| the opposite.
| bigfishrunning wrote:
| AAAA isn't a real thing, it's a memey joke based on a
| press release by a microsoft studio that was closed
| before ever releasing a single game
| toast0 wrote:
| > Was it a scandal at the time?
|
| Yes. My understanding was it was optimized by reducing
| precision or something to a visibly apparent degree.
|
| It's different if the driver changes things in ways such
| that rendered output is the same or at least
| imperceptibly different. I think there's also a lot more
| communication between gpu makers and game/engine
| developers these days; plus a lot more frequent updates.
| KronisLV wrote:
| > My understanding was it was optimized by reducing
| precision or something to a visibly apparent degree.
|
| If only we had that sort of a control over rendering for
| every game ourselves - since projects like OptiScaler at
| least let us claw back control over sometimes proprietary
| upscaling and even framegen, but it's not quite enough:
| https://github.com/optiscaler/OptiScaler
|
| I'd also mention Lossless Scaling here, though it still
| only works on upscaling and framegen and with worse
| methods, but at least works for most games out there: htt
| ps://store.steampowered.com/app/993090/Lossless_Scaling/
|
| I want to be able to freely toggle between different
| types of AA and SSAO and reflections and lighting and LOD
| systems and various shader effects (especially things
| like chromatic aberration or motion blur) and ray tracing
| and all that, instead of having to hope that the console
| port that's offered to me has those abilities in the
| graphics menu and that whoever is making the decisions
| hasn't decided that actually "low" graphics (that would
| at least run smoothly) would look too bad for the game's
| brand image or something.
| PLenz wrote:
| The Volkswagon emissions testing model
| rowanG077 wrote:
| Let's hope for Nvidia this is an innocent optimization only valid
| for internal kernels that cannot be applied in general.
| jagrsw wrote:
| In which case checking for a string inside arbitrary name is
| sloppy (a bug).
| high_na_euv wrote:
| I have small experience with compilers and llvm but youd be
| shocked how many things rely on names and parsing names
|
| If you have hundreds of passes that are complex and rely on
| various "contracts" like type names or some shit, then really
| crazy things like this can happen unintentionally and not
| maliciously
| diggan wrote:
| Web-developers are well aware of this too. Sincerely,
| Mozilla/5.0 (X11; Linux x86_64; rv:139.0) Gecko/20100101
| Firefox/139.0
| bravesoul2 wrote:
| Funny we send a browser wars tombstone in every request!
| antonvs wrote:
| Let's have a moment of silence for Gecko/20100101
| halJordan wrote:
| Why would i be shocked that a name is informative. Like... are
| you surprised that wrought iron is wrought? Or cast iron is
| made from a cast?
| IAmBroom wrote:
| Dog piles are often neither composed of dogs, nor actual
| piles.
|
| Names can be both informative, and misdirecting, at the same
| time.
| the8472 wrote:
| Some names are standardized items, like memcpy. Matching those
| is ok, nothing sneaky going on there. Matching something
| vendor-specific in a general-purpose API is different story.
| giingyui wrote:
| And what's the downside of using that kernel name? It can't just
| be that it's faster and nothing else. Unless they included lots
| of sleep(x) calls.
| samus wrote:
| There might be optimizations that are only safe for the code
| that this was an intender for.
| bialpio wrote:
| Seems like a bad idea to rely on a name for deciding this
| then, unless it's documented somewhere that using names
| containing certain substrings may trigger unsafe
| optimizations...
| Arch-TK wrote:
| I wish people either learned how to use git or just wholesale
| stopped using it.
| tempaway43563 wrote:
| So, what is Cutlass, can someone explain whether checking for
| kernel names makes sense here or is a form of cheating?
|
| https://docs.nvidia.com/cutlass/index.html
| rurban wrote:
| That's strange because the cutlass docs explicitly does NOT
| mention fp8 support. So it looks like it can be used
| nevertheless with fp8 by using the name hack.
| mlazos wrote:
| It supports e5m2 and e4m3 right in the doc linked.
| gpm wrote:
| Github version: https://github.com/NVIDIA/cutlass
|
| I wonder if we search the comments if we can find something
| referencing this.
| zahlman wrote:
| This tweet appears to be taking the original material out of
| context to misrepresent it:
|
| > Rewrite the attention kernel to be persistent. This gives
| better performance at low-contexts. However, fp16 at large
| context has suffered a bit due to a ptxas instruction scheduling
| issue in the softmax partition. fp8 is ~100 tflops faster when
| the kernel name has "cutlass" in it.
|
| The charitable reading is that, on certain kernels, _using fp8
| rather than fp16 values_ gives better performance. (Although I
| can 't even see how the numbers relate to a "~100 tflops faster"
| claim in any respect, nor does it even list any kernel names or
| suggest a control kernel!) But this is being presented as if
| someone has uncovered evidence of cheating on benchmarks.
| saagarjha wrote:
| I think you're the one doing that to the tweet, actually.
| zahlman wrote:
| What are you talking about? When I view the tweet, the _only_
| text I see is:
|
| > > fp8 is 100 tflops faster when the kernel name has
| "cutlass" in it
|
| > kms
| saagarjha wrote:
| And it includes a link to show that this is the context it
| came from.
| zahlman wrote:
| And when I look at the link, the part I quoted is the
| relevant text I see.
|
| In order to get to the part that you're trying to hold me
| accountable for, I would furthermore have to click onto
| the commits tab and search through a 93-commit PR.
|
| I thought today I was using a site where trying to think
| the best of people and propose that someone had taken
| something out of context, based on the immediately
| available context having a simpler explanation, would not
| get me treated like a corporate shill (for a company I
| don't even care about). Apparently I was wrong.
| saagarjha wrote:
| I don't think you are a corporate shill. I do think that
| you immediately going "clearly the tweet is wrong"
| without doing any research whatsoever was unwarranted,
| though. You also keep bringing up that it's 93 commits
| but all getting squashed you have to do is search for
| "cutlass" to find out what is going on. I think you're
| obligated to do at least that when you call it out for
| being wrong.
| zettabomb wrote:
| No, that sentence is separate from the rest. Take a look at the
| pull request: # Up to 150 TFLOPS faster for
| fp8! if specialization.constants["dtype"] ==
| gl.float8e5: name = "cutlass_" + name
| zahlman wrote:
| The tweet is quoting from the first message in the
| "conversation" on the PR. There are 93 commits in the PR and
| GitHub doesn't even default to that tab. I looked at the
| obvious text and drew the conclusion that was obvious to me.
| imtringued wrote:
| https://github.com/triton-lang/triton/pull/7298/commits/a5e2...
|
| It's literally in the code.
| zahlman wrote:
| I already had to deal with Twitter and a link shortening
| service just to get to GitHub and then it still only pointed
| to the facing page of a 93-commit PR.
| spoaceman7777 wrote:
| Seems this is likely due to ongoing work on FP8 support on
| nvidia/cutlass. From my reading, the alternative code path was
| likely added recently for testing by external contributors to the
| cutlass project, and other involved parties. (Rather than
| attempting to distribute custom packaged internal builds of
| cuda.)
|
| This ticket is a good starting place to see the chain of issues
| around the ongoing work:
| https://github.com/NVIDIA/cutlass/pull/2037
___________________________________________________________________
(page generated 2025-07-11 23:01 UTC)