[HN Gopher] Highly efficient matrix transpose in Mojo
___________________________________________________________________
Highly efficient matrix transpose in Mojo
Author : timmyd
Score : 123 points
Date : 2025-06-06 19:28 UTC (1 days ago)
(HTM) web link (veitner.bearblog.dev)
(TXT) w3m dump (veitner.bearblog.dev)
| sestep wrote:
| I'm not an expert in this space, but is this meaningful? I'd
| assume that it's more common to fuse together transposition with
| an operation that precedes or follows it (e.g. matmul), which
| should be far more efficient than materializing the entire
| transposition in memory if it's just an intermediate value.
| musebox35 wrote:
| Matrix transpose is a canonical example of a memory bound
| operation and often used to showcase optimization in a
| particular programming language or library. See for example the
| cutlass matrix transpose tutorial from Jay Shah of flash
| attention 3 paper: https://research.colfax-intl.com/tutorial-
| matrix-transpose-i...
| saagarjha wrote:
| Unfortunately the issue (alluded to in the blog post you
| linked) is that transposes do absolutely no work but memory
| loads. Sure, they test that you can swizzle your accesses,
| but modern accelerators are all about pipelining and feeding
| matrix multiply units, which is considerably harder than
| loading from memory as fast as possible. Actually, even the
| Mojo post barely beats CUDA for most of its kernels, because
| you can hit memory bandwidth for transpose on the latest
| hardware using techniques from 5-10 years ago. This is
| definitely not true for more interesting operations.
| musebox35 wrote:
| I totally agree that the resulting kernel will be rarely
| useful. I just wanted to highlight that it is a commonly
| used educational exercise to showcase how to optimize for
| memory throughput. If the post showed how to fuse a
| transpose + rmsnorm epilogue to a gemm then the kernel
| would be more functional but the blog post would be much
| harder to follow for newcomers.
|
| Jay Shah's later articles contain examples that involve
| epilogue fusion. IMHO, understanding how to write an
| efficient transpose helps with following the more involved
| ones.
| simon_vtr wrote:
| That was exactly my reason to write this blogpost and
| optimise transpose. It is a simple educational yet not
| trivial example to learn the basics.
| colesantiago wrote:
| Does anyone use Mojo in production at all or are even hiring for
| Mojo?
| melodyogonna wrote:
| Modular (the company behind Mojo) uses it in production. I
| imagine that if they have any clients then those also use Mojo
| in production - albeit indirectly - since all the GPU kernels
| used by Modular are written in Mojo.
| htrp wrote:
| Left unsaid, the 14% improvement in performance came at the cost
| of increasing dev time by 35%
| bravetraveler wrote:
| Reminds me of this, lol:
|
| > _" From the moment I understood the weakness of my flesh, it
| disgusted me. I craved the strength and certainty of steel."_
|
| 14% all the time vs 35% some of the time
|
| _edit:_ Closing numbers are far less impressive than those
| buried in the middle of the post. Confusing; bye everyone
| vlan121 wrote:
| Mojos compiler is closed source. Thats a big no-no
| dgurchenkov wrote:
| I work on Mojo. The whole compiler, runtime etc. will get open
| sourced, most likely within a year. It is just a matter of time
| and us getting all the required work done.
|
| https://docs.modular.com/mojo/faq/#open-source
| almostgotcaught wrote:
| > runtime
|
| Are you talking about your libc equivalent or MAX?
| dgurchenkov wrote:
| Both.
|
| Mojo standard library is already open source. Mojo at the
| moment does not need a runtime (but if it ever needs one
| it'd get open sourced). My point was, Mojo as a whole, as a
| programming language & a reference implementation, will
| definitely get open sourced.
|
| MAX itself is a bigger beast to work with, and I am out of
| my depth to talk about it. I think it'll get open sourced
| as well, just the timeline might be different (shorter or
| longer, IDK).
| xiphias2 wrote:
| ,,will get open sourced'' means closed source, parent wrote
| the same
| GeekyBear wrote:
| Chris Lattner (the CEO of Modular) was previously the
| technical lead behind the creation of LLVM, Clang and
| Swift, all of which were open sourced.
|
| He has a bit of a track record already.
| xiphias2 wrote:
| Sure, but at that time he was employed by Apple for
| example.
|
| Now he's making a for profit company and there's already
| MAX and MAX Enterprise stuff to not trust that the open
| source part would be competitive with already great
| inferencing frameworks for example.
| voronar wrote:
| Mr. Mojo Risin'
| arjvik wrote:
| Where's the 14%? Looks like their final kernels show a 0.14%
| improvement of Mojo over the equivalent CUDA kernel?
| 77pt77 wrote:
| It looks because it does.
|
| >(2771.35/2775.49 - 1) * 100 = -.14916285052369131300
|
| Flagged.
| timmyd wrote:
| Updated the title to the original. I did base the numbers on
|
| "This kernel archives 1437.55 GB/s compared to the 1251.76
| GB/s we get in CUDA" (14.8%) which is still impressive
| jsnell wrote:
| The "Switching to Mojo gave a 14% improvement over CUDA" title is
| editorialized, the original is "Highly efficient matrix transpose
| in Mojo".
|
| Also, the improvement is 0.14%, not 14% making the editorialized
| linkbait particularly egregious.
| baal80spam wrote:
| 0.14% is within the limits of statistical error. So this is a
| nothing-"article".
| jsnell wrote:
| I don't think that's fair. The article promised a highly
| efficient kernel and seems to have delivered exactly that,
| which isn't "nothing". My beef is entirely with the submitted
| title.
| jebarker wrote:
| Yeah, it seems like the blog post is just meant to be an
| example of how to do something in Mojo and not a dunk on CUDA.
| timmyd wrote:
| FWIW I didnt take the blog as a dunk on CUDA, just as an
| impressive outcome from the blog writer in Mojo. It's awesome
| to see this on Hopper - if it makes it go faster thats
| awesome.
| atomicapple wrote:
| I think the OP based the title off of "This kernel archives
| 1437.55 GB/s compared to the 1251.76 GB/s we get in CUDA"
| (14.8%) and not the final kernels for whatever reason
| timmyd wrote:
| [op here] To be clear: Yes, there are 3 kernels - you can see
| those in the linked github at the end of the article if you
| clicked that. These are:
|
| transpose_naive - Basic implementation with TMA transfers
|
| transpose_swizzle - Adds swizzling optimization for better
| memory access patterns
|
| transpose_swizzle_batched - Adds thread coarsening (batch
| processing) on top of swizzling
|
| Performance comparison with CUDA: The Mojo implementations
| achieve bandwidths of:
|
| transpose_naive: 1056.08 GB/s (32.0025% of max)
|
| transpose_swizzle: 1437.55 GB/s (43.5622% of max)
|
| transpose_swizzle_batched: 2775.49 GB/s (84.1056% of max)
|
| via the GitHub - simveit/efficient_transpose_mojo
|
| Comparing to the CUDA implementations mentioned in the article:
|
| Naive kernel: Mojo achieves 1056.08 GB/s vs CUDA's 875.46 GB/s
|
| Swizzle kernel: Mojo achieves 1437.55 GB/s vs CUDA's 1251.76
| GB/s
|
| Batched swizzle kernel: Mojo achieves 2775.49 GB/s vs CUDA's
| 2771.35 GB/s
|
| So there is highly efficient matrix transpose in Mojo
|
| All three Mojo kernels outperform their CUDA counterparts, with
| the naive and swizzle kernels showing significant improvements
| (20.6% and 14.8% faster respectively), while the final
| optimized kernel achieves essentially identical performance
| (slightly better by 4.14 GB/s).
|
| The "flag" here seemed innapropriate given that its true this
| implementation is indeed faster, and certainly the final
| iteration could be improved on further. It wasn't wrong to say
| 14% or even 20%.
| jsnell wrote:
| Users of the site only have one control available: the flag.
| There's no way to object only to the title but not to the
| post, and despite what you say that title hit the trifecta:
| not the original title, factually incorrect, and clickbait.
| So I'm not that surprised it got flagged (even if I did not
| flag it myself).
|
| Email the mods at hn@ycombinator.com. There's a chance
| they'll remove the flag and re-up the post.
| timmyd wrote:
| thanks jsnell - i did they and they appreciated the comment
| above, and unflagged it. i appreciate it!
| noracists wrote:
| slop
| londons_explore wrote:
| _Why_ do we ever need to transpose a matrix?
|
| Isn't it better to simply combine the transposition with whatever
| next operation one wishes to do with the matrix?
| throwawayabcdef wrote:
| The next operation might need the data in column major order to
| read it fast. So you might have to transpose first. And these
| maybe be concurrent stages of a processing pipeline.
| viraptor wrote:
| Now I'm curious, how many times do you have to fully read the
| matrix in GPU for the total impact of reading columns to be
| higher than one-off actual transpose and then sequential row
| reads? I know it depends on lots of things, I'm after a rough
| estimate.
| saagarjha wrote:
| It's quite rare. Usually problems are tiled anyway and you
| can amortize the cost of having data in the "wrong" layout
| by loading coalesced in whatever is the best layout for
| your data and then transposing inside your tile, which
| gives you access to much faster memory.
| stephencanon wrote:
| The one pure transpose case that does come up
| occasionally is an in-place non-square transpose, where
| there is a rich literature of very fussy algorithms. If
| someone managed to make any headway with compiler
| optimization there, I'd be interested.
| hogepodge wrote:
| You're right that a good graph compiler will do this for you.
| There still may be times, like if you're interfacing with
| another library, where you'll need to switch a matrix between
| row major or column major layouts.
| meindnoch wrote:
| Serious linear algebra libraries expect a flag that tells if
| elements are column-major or row-major.
| fulafel wrote:
| This could make Mojo look even better as it would ld be more
| compute heavy and the last step thread reduction would be less
| relevant.
| almostgotcaught wrote:
| As someone said below - you'd never write just a transpose kernel
| - it'll be fused into something else.
| saagarjha wrote:
| Look the frontier AI companies need something other than
| reversing binary trees to give interview candidates
| almostgotcaught wrote:
| No one is going to ask this on an interview. Used to be
| matmul. These days it's FA.
| melodyogonna wrote:
| I wonder if there is a reason for not using the high level
| abstractions provided by Modular
| saagarjha wrote:
| Most interesting algorithms (e.g. with dynamic shapes, mixed
| computation) are typically better scheduled by hand.
| Q6T46nT668w6i3m wrote:
| Sure, but Modular's mission was to provide abstractions to
| minimize these types of optimizations.
| totalperspectiv wrote:
| I'd also add that Mojo is new, and people are still feeling it
| out by trying to 1:1 things with Cuda.
| saagarjha wrote:
| > This kernel archives a bandwidth of 1056.08 GB/s which is
| faster than the 875.46 GB/s we archived using CUDA. I believe
| that to be the reason because we use the PTX api for TMA
| transfers in Mojo.
|
| I can't say for sure because I couldn't find the CUDA kernel but
| I kind of doubt this is true. You can hit memory bandwidth on
| Hopper without using TMA at all, which is mostly designed for
| accelerating asynchronous copies and reducing memory pressure. If
| all you are doing is a transpose you don't need any of this to go
| fast (though it might simplify your indexing code...?)
| simon_vtr wrote:
| The kernels I mention in CUDA use all the equivalent logic like
| the Mojo kernels. You can find them on my GitHub:
| https://github.com/simveit/effective_transpose You may want to
| provide a faster kernel on H100 via PR and I will merge after
| checking it's faster.
| iandanforth wrote:
| I'm probably just ignorant but shouldn't the graphic of the tiled
| transpose have the green vector column-oriented in the final
| matrix?
| somethingsome wrote:
| The colors are reading writing operations ;)
|
| You have global memory and shared memory, the global is slower.
|
| You read in rows in the global memory (faster than reading
| columns)
|
| You write in columns in the shared memory (slower than in rows,
| but the shared memory is fast, this is the transpose operation)
|
| You read in rows in the shared memory (very fast)
|
| You write in rows in the global memory (faster than writing in
| columns)
|
| The idea behind that tiling is to hide the slow part in a
| memory that is faster.
| daft_pink wrote:
| I think Mojo's lack of being a true open product and existing to
| drive profits at Modular has really held it back.
|
| It's just really impractical to use a licensed programming
| language in 2025.
| totalperspectiv wrote:
| My impression is that this is on purpose on their part. They've
| repeatedly stated that by 2026 they will open source the
| compiler, and I think they've wanted a slow adoption ramp in
| order to spend some more time getting it right first.
|
| Possibly rose-tinted glasses on my part, but I'm optimistic for
| 2026. Chris Lattner has a pretty strong track record of getting
| these things right.
| veidr wrote:
| Yeah, _and_ he 's clearly trying to avoid what happened to
| Swift[1]. Although the danger of "corporate owner priorities
| dictate releasing half-baked/awful changes" risk is still
| there, Lattner himself has more influence within Modular
| (obviously, as co-founder and CEO) than he did at Apple, so
| it may work out better this time.
|
| [1]: https://news.ycombinator.com/item?id=30416070
| melodyogonna wrote:
| Yeah, Mojo's development has been pretty transparent. Chris
| publishes technical documents for most features and takes
| community feedback into account. A recent example is here:
| https://forum.modular.com/t/variable-bindings-proposal-
| discu...
|
| Btw, Mojo's development is a masterclass in language
| development and community building, it's been fun watching
| Chris go back to fix technical debts in existing features
| rather than proceeding with adding new features.
| GeekyBear wrote:
| > he's clearly trying to avoid what happened to Swift
|
| Also to MLIR while Lattner was at Google:
|
| > MLIR was born--a modular, extensible compiler
| infrastructure designed to bring order to the chaos. It
| brought forth a foundation that could scale across hardware
| platforms, software frameworks, and the rapidly evolving
| needs of machine learning. It aimed to unify these systems,
| and provide a technology platform that could harmonize
| compute from many different hardware makers.
|
| But unification is hard. What started as a technical
| project quickly turned into a battleground: open-source
| governance, corporate rivalries, and competing visions all
| collided. What could have been a straightforward
| engineering win became something much more complicated.
|
| https://www.modular.com/blog/democratizing-ai-compute-
| part-8...
| daft_pink wrote:
| It was just very difficult primarily because of the way the
| license limitations and install steps made it difficult to
| drop it into the existing python tooling ecosystem.
|
| I haven't tried it in a long time, but as it's a Python
| superset, I tried to drop it into my jupyter notebook
| docker container and you had to agree to license terms and
| register your email and install a modular package that
| contained a bunch of extra things.
|
| If you want to get widespread adoption for a python
| superset, you would probably want to get it included in the
| official jupyter docker images as people who do this sort
| of programming like to use a jupyter repl, but they just
| made it so difficult.
|
| I'm no open source zealot and I'm happy to pay for
| software, but I think the underlying language needs to be a
| lot more open to be practical.
| thunkingdeep wrote:
| Is the word archive used in place of achieve? I'm not sure if
| there is a terminology issue that I don't understand in this
| post...
| totalperspectiv wrote:
| In the coarse graining code, you use an @parameter-for. Doesn't
| that lead to some pretty large code size unrolling that? Or is
| that less of an issue on GPU?
|
| Great write up! I learned a lot!
| simon_vtr wrote:
| It doesn't. The batch size is just 8. This is a very good trick
| and often needed to archive peak performance in memory bound
| kernels. You can checkout the equivalent code in cuda aswell :)
| graycat wrote:
| Fast matrix transpose? Agree for a _transposed_ matrix, just
| change the indexing arithmetic that converts row i and column j
| to an offset in the storage for the matrix and then remember that
| this is a _transposed_ matrix. Some software _object_ semantics
| could make this easy for other software to use.
| jjtheblunt wrote:
| i think the problem with changing the indexing arithmetic is
| that you could end up with arithmetic incompatible with vector
| instructions in hardware that you're hoping to use for
| parallelism.
___________________________________________________________________
(page generated 2025-06-07 23:01 UTC)