[HN Gopher] Finding a billion factorials in 60 ms with SIMD
       ___________________________________________________________________
        
       Finding a billion factorials in 60 ms with SIMD
        
       Author : todsacerdoti
       Score  : 163 points
       Date   : 2025-06-22 23:12 UTC (23 hours ago)
        
 (HTM) web link (codeforces.com)
 (TXT) w3m dump (codeforces.com)
        
       | few wrote:
       | It's interesting to see which codeforces blog posts get traction
       | on HN.
       | 
       | For context, in competitive programming a lot of combinatorial
       | problems (find some formula to count something) require you to
       | output the answer modulo some prime. This is because otherwise
       | the answer would overflow an int and make the problem too tedious
       | to be fun and too hard for problem setters to write good problems
       | or checkers for.
       | 
       | So to prove that you still know how to count the thing, you can
       | do it a finite field. If you use integers mod prime, you still
       | have all the usual arithmetic operations like addition
       | subtraction multiplication. And even division is still easy since
       | you can calculate multiplicative inverse with Fermat's Little
       | Theorem (a^(p-2) = a^(-1) mod p). The final answer you output is
       | not the real thing you're counting, but just evidence that you
       | had the right formula and did all the right operations.
       | 
       | Anyway, just wanted to give context for why competitive
       | programmers care about factorial mod a prime (usually as part of
       | a binomial or multinomial expression). And I'm kind of surprised
       | anyone outside of competitive programming cares about it.
       | 
       | See also:
       | 
       | https://usaco.guide/gold/modular?lang=cpp
       | 
       | https://usaco.guide/gold/combo?lang=cpp
        
         | dataflow wrote:
         | > a^(p-2) = a^(-1) mod p
         | 
         | Tangent, but curious: what made you write it like this? I've
         | only ever seen it written as                 a^(p-1) = 1 mod p
         | 
         | or                 a^p = a mod p
         | 
         | Is that form somehow more useful in some cases?
        
           | addaon wrote:
           | a^(-1) mod p is the multiplicative inverse in a finite field.
           | The point of the original comment was to show how to
           | transform the multiplicative inverse into an easier problem.
        
             | cperciva wrote:
             | Just what do you think the running time of modular inverse
             | is?
        
               | atq2119 wrote:
               | That's a pretty snarky and unhelpful approach to the
               | conversation.
               | 
               | That said, I'm also a bit surprised to see somebody
               | discuss modular inverses without mentioning the extended
               | euclidean algorithm, which is a more elementary solution.
        
               | cperciva wrote:
               | Snarky I'll admit, but it was a serious question. Inverse
               | is quasilinear time; the exponential is quasiquadratic.
               | Maybe he didn't know about the EEA?
        
               | tmyklebu wrote:
               | Well, modular multiplication is faster than modular
               | inverse, both asymptotically for large moduli and
               | practically for almost all moduli I can think of. (2, 3,
               | and 4 being notable exceptions!)
               | 
               | The article computes modular inverses of a_1, ..., a_n
               | by:
               | 
               | - Computing (a_1 * ... * a_i) = (a_1 * ... * a_{i-1}) *
               | a_i recursively
               | 
               | - Computing (a_1 * ... * a_n)^(-1) by square-and-multiply
               | 
               | - Computing (a_1 * ... * a_i)^(-1) = (a_1 * ... *
               | a_{i+1})^(-1) a_{i+1} recursively.
               | 
               | - Computing a_i^(-1) = (a_1 * ... * a_i)^(-1) * (a_1 *
               | ... * a_{i-1}) for each i.
               | 
               | The second step is a scalar operation, so its running
               | time is immaterial as long as you aren't doing something
               | too silly.
               | 
               | For my caveman brain, both Fermat's little theorem and
               | square-and-multiply exponentiation are pretty easy to
               | understand. Moreover, the code is going to be "defect-
               | evident"---if I've gotten the logic wrong or forgotten
               | integer promotions or modular reductions as in qsort's
               | post, it'll quickly be clear by skimming the code.
        
               | benreesman wrote:
               | I disagree on two counts:
               | 
               | - having Colin stop by your thread is strictly an
               | opportunity for useful information to flow from a
               | singular source to many people
               | 
               | - you would hear that aloud 100 times a day in any office
               | where serious work was being done by professionals on a
               | deadline and think nothing of it, bet your ass places in
               | the world where serious hackers rise only on merit and
               | have the best gear embargoed are saying stuff like that
               | all the time. this nepotism capture bubble is an outlier
               | in the history of serious engineering.
               | 
               | Defining the rudeness threshold down to the point where
               | cperciva clears it is one part comedy and two parts
               | tragedy with the words Hacker News in bold at the top of
               | the page.
        
           | qsort wrote:
           | It's easier to write code for efficiently computing the
           | inverse in that form, roughly:                 int
           | FastExp(int a, int e, int p)       {         if (e == 0)
           | return 1;         if (e == 1) return a;         if (e % 2 ==
           | 0) return FastExp((a*a)%p, e/2, p);         else return a *
           | FastExp((a*a)%p, e/2, p);       }
           | 
           | In math competitions where you only have pen and paper, you'd
           | instead turn what you wrote into a Diophantine equation you
           | can solve with the usual method.
        
       | vivzkestrel wrote:
       | Since we are talking factorials, i wanted to ask. What is the
       | largest factorial that the biggest supercomputer known to man has
       | computed? how long did it take
        
         | adgjlsfhk1 wrote:
         | probably not very big. factorials are pretty boring in the
         | sense that it's relatively trivial to compute a factorial
         | almost as big as your hard drive. 99% of the time will happen
         | in a single multiply
        
           | cperciva wrote:
           | _99% of the time will happen in a single multiply_
           | 
           | Far from it. Asymptotically it's a proportion 2/log(N) of the
           | compute cost.
        
           | LegionMammal978 wrote:
           | Yeah, most algorithms that run in O( _n_ polylog( _n_ ))
           | time, including most divide-and-conquer bigint arithmetic,
           | will ultimately be dominated by how much memory you have
           | available. For some experiments a while back, I would create
           | a terabyte-long swapfile to extend the range of what GMP
           | could do, and it still filled up within a couple CPU-weeks at
           | most.
        
         | ethan_smith wrote:
         | The current record is 10^10^9 (a billion billion digits) by
         | Peter Luschny in 2021 using a parallel algorithm. For exact
         | factorials, Clifford Stern calculated 170! in 2012 which has
         | over 300 digits.
        
       ___________________________________________________________________
       (page generated 2025-06-23 23:01 UTC)