[HN Gopher] Highly efficient matrix transpose in Mojo
       ___________________________________________________________________
        
       Highly efficient matrix transpose in Mojo
        
       Author : timmyd
       Score  : 123 points
       Date   : 2025-06-06 19:28 UTC (1 days ago)
        
 (HTM) web link (veitner.bearblog.dev)
 (TXT) w3m dump (veitner.bearblog.dev)
        
       | sestep wrote:
       | I'm not an expert in this space, but is this meaningful? I'd
       | assume that it's more common to fuse together transposition with
       | an operation that precedes or follows it (e.g. matmul), which
       | should be far more efficient than materializing the entire
       | transposition in memory if it's just an intermediate value.
        
         | musebox35 wrote:
         | Matrix transpose is a canonical example of a memory bound
         | operation and often used to showcase optimization in a
         | particular programming language or library. See for example the
         | cutlass matrix transpose tutorial from Jay Shah of flash
         | attention 3 paper: https://research.colfax-intl.com/tutorial-
         | matrix-transpose-i...
        
           | saagarjha wrote:
           | Unfortunately the issue (alluded to in the blog post you
           | linked) is that transposes do absolutely no work but memory
           | loads. Sure, they test that you can swizzle your accesses,
           | but modern accelerators are all about pipelining and feeding
           | matrix multiply units, which is considerably harder than
           | loading from memory as fast as possible. Actually, even the
           | Mojo post barely beats CUDA for most of its kernels, because
           | you can hit memory bandwidth for transpose on the latest
           | hardware using techniques from 5-10 years ago. This is
           | definitely not true for more interesting operations.
        
             | musebox35 wrote:
             | I totally agree that the resulting kernel will be rarely
             | useful. I just wanted to highlight that it is a commonly
             | used educational exercise to showcase how to optimize for
             | memory throughput. If the post showed how to fuse a
             | transpose + rmsnorm epilogue to a gemm then the kernel
             | would be more functional but the blog post would be much
             | harder to follow for newcomers.
             | 
             | Jay Shah's later articles contain examples that involve
             | epilogue fusion. IMHO, understanding how to write an
             | efficient transpose helps with following the more involved
             | ones.
        
               | simon_vtr wrote:
               | That was exactly my reason to write this blogpost and
               | optimise transpose. It is a simple educational yet not
               | trivial example to learn the basics.
        
       | colesantiago wrote:
       | Does anyone use Mojo in production at all or are even hiring for
       | Mojo?
        
         | melodyogonna wrote:
         | Modular (the company behind Mojo) uses it in production. I
         | imagine that if they have any clients then those also use Mojo
         | in production - albeit indirectly - since all the GPU kernels
         | used by Modular are written in Mojo.
        
       | htrp wrote:
       | Left unsaid, the 14% improvement in performance came at the cost
       | of increasing dev time by 35%
        
         | bravetraveler wrote:
         | Reminds me of this, lol:
         | 
         | > _" From the moment I understood the weakness of my flesh, it
         | disgusted me. I craved the strength and certainty of steel."_
         | 
         | 14% all the time vs 35% some of the time
         | 
         |  _edit:_ Closing numbers are far less impressive than those
         | buried in the middle of the post. Confusing; bye everyone
        
       | vlan121 wrote:
       | Mojos compiler is closed source. Thats a big no-no
        
         | dgurchenkov wrote:
         | I work on Mojo. The whole compiler, runtime etc. will get open
         | sourced, most likely within a year. It is just a matter of time
         | and us getting all the required work done.
         | 
         | https://docs.modular.com/mojo/faq/#open-source
        
           | almostgotcaught wrote:
           | > runtime
           | 
           | Are you talking about your libc equivalent or MAX?
        
             | dgurchenkov wrote:
             | Both.
             | 
             | Mojo standard library is already open source. Mojo at the
             | moment does not need a runtime (but if it ever needs one
             | it'd get open sourced). My point was, Mojo as a whole, as a
             | programming language & a reference implementation, will
             | definitely get open sourced.
             | 
             | MAX itself is a bigger beast to work with, and I am out of
             | my depth to talk about it. I think it'll get open sourced
             | as well, just the timeline might be different (shorter or
             | longer, IDK).
        
           | xiphias2 wrote:
           | ,,will get open sourced'' means closed source, parent wrote
           | the same
        
             | GeekyBear wrote:
             | Chris Lattner (the CEO of Modular) was previously the
             | technical lead behind the creation of LLVM, Clang and
             | Swift, all of which were open sourced.
             | 
             | He has a bit of a track record already.
        
               | xiphias2 wrote:
               | Sure, but at that time he was employed by Apple for
               | example.
               | 
               | Now he's making a for profit company and there's already
               | MAX and MAX Enterprise stuff to not trust that the open
               | source part would be competitive with already great
               | inferencing frameworks for example.
        
       | voronar wrote:
       | Mr. Mojo Risin'
        
       | arjvik wrote:
       | Where's the 14%? Looks like their final kernels show a 0.14%
       | improvement of Mojo over the equivalent CUDA kernel?
        
         | 77pt77 wrote:
         | It looks because it does.
         | 
         | >(2771.35/2775.49 - 1) * 100 = -.14916285052369131300
         | 
         | Flagged.
        
           | timmyd wrote:
           | Updated the title to the original. I did base the numbers on
           | 
           | "This kernel archives 1437.55 GB/s compared to the 1251.76
           | GB/s we get in CUDA" (14.8%) which is still impressive
        
       | jsnell wrote:
       | The "Switching to Mojo gave a 14% improvement over CUDA" title is
       | editorialized, the original is "Highly efficient matrix transpose
       | in Mojo".
       | 
       | Also, the improvement is 0.14%, not 14% making the editorialized
       | linkbait particularly egregious.
        
         | baal80spam wrote:
         | 0.14% is within the limits of statistical error. So this is a
         | nothing-"article".
        
           | jsnell wrote:
           | I don't think that's fair. The article promised a highly
           | efficient kernel and seems to have delivered exactly that,
           | which isn't "nothing". My beef is entirely with the submitted
           | title.
        
         | jebarker wrote:
         | Yeah, it seems like the blog post is just meant to be an
         | example of how to do something in Mojo and not a dunk on CUDA.
        
           | timmyd wrote:
           | FWIW I didnt take the blog as a dunk on CUDA, just as an
           | impressive outcome from the blog writer in Mojo. It's awesome
           | to see this on Hopper - if it makes it go faster thats
           | awesome.
        
         | atomicapple wrote:
         | I think the OP based the title off of "This kernel archives
         | 1437.55 GB/s compared to the 1251.76 GB/s we get in CUDA"
         | (14.8%) and not the final kernels for whatever reason
        
         | timmyd wrote:
         | [op here] To be clear: Yes, there are 3 kernels - you can see
         | those in the linked github at the end of the article if you
         | clicked that. These are:
         | 
         | transpose_naive - Basic implementation with TMA transfers
         | 
         | transpose_swizzle - Adds swizzling optimization for better
         | memory access patterns
         | 
         | transpose_swizzle_batched - Adds thread coarsening (batch
         | processing) on top of swizzling
         | 
         | Performance comparison with CUDA: The Mojo implementations
         | achieve bandwidths of:
         | 
         | transpose_naive: 1056.08 GB/s (32.0025% of max)
         | 
         | transpose_swizzle: 1437.55 GB/s (43.5622% of max)
         | 
         | transpose_swizzle_batched: 2775.49 GB/s (84.1056% of max)
         | 
         | via the GitHub - simveit/efficient_transpose_mojo
         | 
         | Comparing to the CUDA implementations mentioned in the article:
         | 
         | Naive kernel: Mojo achieves 1056.08 GB/s vs CUDA's 875.46 GB/s
         | 
         | Swizzle kernel: Mojo achieves 1437.55 GB/s vs CUDA's 1251.76
         | GB/s
         | 
         | Batched swizzle kernel: Mojo achieves 2775.49 GB/s vs CUDA's
         | 2771.35 GB/s
         | 
         | So there is highly efficient matrix transpose in Mojo
         | 
         | All three Mojo kernels outperform their CUDA counterparts, with
         | the naive and swizzle kernels showing significant improvements
         | (20.6% and 14.8% faster respectively), while the final
         | optimized kernel achieves essentially identical performance
         | (slightly better by 4.14 GB/s).
         | 
         | The "flag" here seemed innapropriate given that its true this
         | implementation is indeed faster, and certainly the final
         | iteration could be improved on further. It wasn't wrong to say
         | 14% or even 20%.
        
           | jsnell wrote:
           | Users of the site only have one control available: the flag.
           | There's no way to object only to the title but not to the
           | post, and despite what you say that title hit the trifecta:
           | not the original title, factually incorrect, and clickbait.
           | So I'm not that surprised it got flagged (even if I did not
           | flag it myself).
           | 
           | Email the mods at hn@ycombinator.com. There's a chance
           | they'll remove the flag and re-up the post.
        
             | timmyd wrote:
             | thanks jsnell - i did they and they appreciated the comment
             | above, and unflagged it. i appreciate it!
        
       | noracists wrote:
       | slop
        
       | londons_explore wrote:
       | _Why_ do we ever need to transpose a matrix?
       | 
       | Isn't it better to simply combine the transposition with whatever
       | next operation one wishes to do with the matrix?
        
         | throwawayabcdef wrote:
         | The next operation might need the data in column major order to
         | read it fast. So you might have to transpose first. And these
         | maybe be concurrent stages of a processing pipeline.
        
           | viraptor wrote:
           | Now I'm curious, how many times do you have to fully read the
           | matrix in GPU for the total impact of reading columns to be
           | higher than one-off actual transpose and then sequential row
           | reads? I know it depends on lots of things, I'm after a rough
           | estimate.
        
             | saagarjha wrote:
             | It's quite rare. Usually problems are tiled anyway and you
             | can amortize the cost of having data in the "wrong" layout
             | by loading coalesced in whatever is the best layout for
             | your data and then transposing inside your tile, which
             | gives you access to much faster memory.
        
               | stephencanon wrote:
               | The one pure transpose case that does come up
               | occasionally is an in-place non-square transpose, where
               | there is a rich literature of very fussy algorithms. If
               | someone managed to make any headway with compiler
               | optimization there, I'd be interested.
        
         | hogepodge wrote:
         | You're right that a good graph compiler will do this for you.
         | There still may be times, like if you're interfacing with
         | another library, where you'll need to switch a matrix between
         | row major or column major layouts.
        
           | meindnoch wrote:
           | Serious linear algebra libraries expect a flag that tells if
           | elements are column-major or row-major.
        
         | fulafel wrote:
         | This could make Mojo look even better as it would ld be more
         | compute heavy and the last step thread reduction would be less
         | relevant.
        
       | almostgotcaught wrote:
       | As someone said below - you'd never write just a transpose kernel
       | - it'll be fused into something else.
        
         | saagarjha wrote:
         | Look the frontier AI companies need something other than
         | reversing binary trees to give interview candidates
        
           | almostgotcaught wrote:
           | No one is going to ask this on an interview. Used to be
           | matmul. These days it's FA.
        
       | melodyogonna wrote:
       | I wonder if there is a reason for not using the high level
       | abstractions provided by Modular
        
         | saagarjha wrote:
         | Most interesting algorithms (e.g. with dynamic shapes, mixed
         | computation) are typically better scheduled by hand.
        
           | Q6T46nT668w6i3m wrote:
           | Sure, but Modular's mission was to provide abstractions to
           | minimize these types of optimizations.
        
         | totalperspectiv wrote:
         | I'd also add that Mojo is new, and people are still feeling it
         | out by trying to 1:1 things with Cuda.
        
       | saagarjha wrote:
       | > This kernel archives a bandwidth of 1056.08 GB/s which is
       | faster than the 875.46 GB/s we archived using CUDA. I believe
       | that to be the reason because we use the PTX api for TMA
       | transfers in Mojo.
       | 
       | I can't say for sure because I couldn't find the CUDA kernel but
       | I kind of doubt this is true. You can hit memory bandwidth on
       | Hopper without using TMA at all, which is mostly designed for
       | accelerating asynchronous copies and reducing memory pressure. If
       | all you are doing is a transpose you don't need any of this to go
       | fast (though it might simplify your indexing code...?)
        
         | simon_vtr wrote:
         | The kernels I mention in CUDA use all the equivalent logic like
         | the Mojo kernels. You can find them on my GitHub:
         | https://github.com/simveit/effective_transpose You may want to
         | provide a faster kernel on H100 via PR and I will merge after
         | checking it's faster.
        
       | iandanforth wrote:
       | I'm probably just ignorant but shouldn't the graphic of the tiled
       | transpose have the green vector column-oriented in the final
       | matrix?
        
         | somethingsome wrote:
         | The colors are reading writing operations ;)
         | 
         | You have global memory and shared memory, the global is slower.
         | 
         | You read in rows in the global memory (faster than reading
         | columns)
         | 
         | You write in columns in the shared memory (slower than in rows,
         | but the shared memory is fast, this is the transpose operation)
         | 
         | You read in rows in the shared memory (very fast)
         | 
         | You write in rows in the global memory (faster than writing in
         | columns)
         | 
         | The idea behind that tiling is to hide the slow part in a
         | memory that is faster.
        
       | daft_pink wrote:
       | I think Mojo's lack of being a true open product and existing to
       | drive profits at Modular has really held it back.
       | 
       | It's just really impractical to use a licensed programming
       | language in 2025.
        
         | totalperspectiv wrote:
         | My impression is that this is on purpose on their part. They've
         | repeatedly stated that by 2026 they will open source the
         | compiler, and I think they've wanted a slow adoption ramp in
         | order to spend some more time getting it right first.
         | 
         | Possibly rose-tinted glasses on my part, but I'm optimistic for
         | 2026. Chris Lattner has a pretty strong track record of getting
         | these things right.
        
           | veidr wrote:
           | Yeah, _and_ he 's clearly trying to avoid what happened to
           | Swift[1]. Although the danger of "corporate owner priorities
           | dictate releasing half-baked/awful changes" risk is still
           | there, Lattner himself has more influence within Modular
           | (obviously, as co-founder and CEO) than he did at Apple, so
           | it may work out better this time.
           | 
           | [1]: https://news.ycombinator.com/item?id=30416070
        
             | melodyogonna wrote:
             | Yeah, Mojo's development has been pretty transparent. Chris
             | publishes technical documents for most features and takes
             | community feedback into account. A recent example is here:
             | https://forum.modular.com/t/variable-bindings-proposal-
             | discu...
             | 
             | Btw, Mojo's development is a masterclass in language
             | development and community building, it's been fun watching
             | Chris go back to fix technical debts in existing features
             | rather than proceeding with adding new features.
        
             | GeekyBear wrote:
             | > he's clearly trying to avoid what happened to Swift
             | 
             | Also to MLIR while Lattner was at Google:
             | 
             | > MLIR was born--a modular, extensible compiler
             | infrastructure designed to bring order to the chaos. It
             | brought forth a foundation that could scale across hardware
             | platforms, software frameworks, and the rapidly evolving
             | needs of machine learning. It aimed to unify these systems,
             | and provide a technology platform that could harmonize
             | compute from many different hardware makers.
             | 
             | But unification is hard. What started as a technical
             | project quickly turned into a battleground: open-source
             | governance, corporate rivalries, and competing visions all
             | collided. What could have been a straightforward
             | engineering win became something much more complicated.
             | 
             | https://www.modular.com/blog/democratizing-ai-compute-
             | part-8...
        
             | daft_pink wrote:
             | It was just very difficult primarily because of the way the
             | license limitations and install steps made it difficult to
             | drop it into the existing python tooling ecosystem.
             | 
             | I haven't tried it in a long time, but as it's a Python
             | superset, I tried to drop it into my jupyter notebook
             | docker container and you had to agree to license terms and
             | register your email and install a modular package that
             | contained a bunch of extra things.
             | 
             | If you want to get widespread adoption for a python
             | superset, you would probably want to get it included in the
             | official jupyter docker images as people who do this sort
             | of programming like to use a jupyter repl, but they just
             | made it so difficult.
             | 
             | I'm no open source zealot and I'm happy to pay for
             | software, but I think the underlying language needs to be a
             | lot more open to be practical.
        
       | thunkingdeep wrote:
       | Is the word archive used in place of achieve? I'm not sure if
       | there is a terminology issue that I don't understand in this
       | post...
        
       | totalperspectiv wrote:
       | In the coarse graining code, you use an @parameter-for. Doesn't
       | that lead to some pretty large code size unrolling that? Or is
       | that less of an issue on GPU?
       | 
       | Great write up! I learned a lot!
        
         | simon_vtr wrote:
         | It doesn't. The batch size is just 8. This is a very good trick
         | and often needed to archive peak performance in memory bound
         | kernels. You can checkout the equivalent code in cuda aswell :)
        
       | graycat wrote:
       | Fast matrix transpose? Agree for a _transposed_ matrix, just
       | change the indexing arithmetic that converts row i and column j
       | to an offset in the storage for the matrix and then remember that
       | this is a _transposed_ matrix. Some software _object_ semantics
       | could make this easy for other software to use.
        
         | jjtheblunt wrote:
         | i think the problem with changing the indexing arithmetic is
         | that you could end up with arithmetic incompatible with vector
         | instructions in hardware that you're hoping to use for
         | parallelism.
        
       ___________________________________________________________________
       (page generated 2025-06-07 23:01 UTC)