[HN Gopher] Faer-rs: Linear algebra foundation for Rust
       ___________________________________________________________________
        
       Faer-rs: Linear algebra foundation for Rust
        
       Author : nateb2022
       Score  : 195 points
       Date   : 2024-04-24 12:41 UTC (10 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | rayrrr wrote:
       | How exactly does this dovetail with https://github.com/rust-
       | or/good_lp ? Will it be a replacement, an enhancement, or
       | something else?
        
         | sestep wrote:
         | Seems like a different focus, no? Aren't linear programming
         | solvers a much narrower domain than general linear algebra
         | libraries?
        
           | fastneutron wrote:
           | Correct, linear algebra and linear programming are two very
           | distinct things. The latter is a widely-used optimization
           | technique, and computationally depends on the former, which
           | is a general mathematical framework for essentially all
           | numerical computing.
        
         | rayrrr wrote:
         | I asked because Faer describes itself as a low-level linear
         | algebra library with a high-level wrapper, whereas good_lp
         | describes itself as a high-level wrapper around many relevant
         | low-level libraries.
        
           | constantcrying wrote:
           | Linear programming and linear algebra aren't the same thing.
        
         | electrozav wrote:
         | That's code for linear programming (optimization) not for
         | linear algebra
        
       | andyferris wrote:
       | Does anyone understand the difference (in the benchmark tables)
       | between faer and faer(par)? Which number should be considered
       | important?
        
         | royjacobs wrote:
         | If you look at the benchmark code [0] you can see it indicates
         | whether Rayon is used for parallelism.
         | 
         | [0] https://github.com/sarah-ek/faer-rs/blob/main/faer-
         | bench/src...
        
         | llimllib wrote:
         | the former is run with no parallelism, and the latter with
         | `parallelism::Rayon(0)`, which presumably means that it has
         | some parallelism, though I don't write rust or know what Rayon
         | is.
         | 
         | [1]: https://github.com/sarah-ek/faer-
         | rs/blob/172f651fafe625a2534...
        
           | nateb2022 wrote:
           | Rayon is a library that helps to parallelize sequential
           | computations while guaranteeing data-race freedom. Cf.
           | https://docs.rs/rayon/latest/rayon/
        
           | nindalf wrote:
           | Rayon is a crate that makes it easy to write code that
           | executes in parallel. Take a function that looks like this.
           | It'll execute on a single thread.
           | 
           | fn sum_of_squares(input: &[i32]) -> i32 { input.iter()
           | .map(|&i| i * i) .sum() }
           | 
           | If you changed it to input.par_iter() then it executes in
           | parallel on multiple threads.
        
       | bayindirh wrote:
       | Why Eigen is not run in parallel mode w/ Open-MP?
       | 
       | Eigen handle most (if not all, I just skimmed the tables) tasks
       | in parallel [0]. Plus, it has hand-tuned SIMD code inside, so it
       | needs "-march=native -mtune=native -O3" to make it "full send".
       | 
       | Some solvers' speed change more than 3x with "-O3", to begin
       | with.
       | 
       | This is the Eigen benchmark file [1].
       | 
       | [0]: https://eigen.tuxfamily.org/dox/TopicMultiThreading.html
       | 
       | [1]: https://github.com/sarah-ek/faer-rs/blob/main/faer-
       | bench/eig...
        
         | sarah-ek wrote:
         | author here, eigen is compiled with -fopenmp, which enables
         | parallelism by default
        
           | bayindirh wrote:
           | Hi! Thanks for chiming in!
           | 
           | Did you check with resource utilization? If you don't provide
           | "OMP_NUM_THREADS=n", Eigen doesn't auto-parallelize by
           | default.
        
             | sarah-ek wrote:
             | i did check, yes
        
         | dannyz wrote:
         | Related on the Eigen benchmarking, I see a lot of use of auto
         | in the benchmarks. Eigen does not recommend using it like this
         | (https://eigen.tuxfamily.org/dox/TopicPitfalls.html) because
         | the template expressions can be quite complicated. I'm not sure
         | if it matters here or not, but it would probably better not to
         | use it in the benchmarks.
        
           | sarah-ek wrote:
           | i've contributed to eigen in the past and know enough about
           | the internals of the codebase to know my way around safe
           | `auto` usage
        
             | dannyz wrote:
             | I wasn't worried about safe usage, more that some of the
             | initialization may be moved inside the benchmarking
             | function instead of outside of it like intended. I'm sure
             | you know more about it than me though.
        
           | bayindirh wrote:
           | While auto is a compile time burden, it creates a lot of load
           | during compilation of this benchmark.
           | 
           | My complete Ph.D., using ton of Eigen components plus other
           | libraries was compiling in 10 seconds flat on a way older
           | computer. This requires gigs of RAM plus a minute.
        
             | sarah-ek wrote:
             | the long compile times are mostly because im instantiating
             | every dense decomposition in the library in one translation
             | unit, for several data types (f32, f64, f128, c32, c64,
             | c128)
        
       | citizen_friend wrote:
       | Is there a rust API for LAPACK?
        
         | sanxiyn wrote:
         | Yes, see https://github.com/blas-lapack-rs.
        
       | meisel wrote:
       | Looking at thin matrix SVD, it appears _much_ faster than
       | everyone else. I'm curious what it's doing differently at a high
       | level and if there's any tradeoff in accuracy. I also wonder how
       | it compares to MKL, which is typically the winner in all these
       | benchmarks on Intel.
        
         | sarah-ek wrote:
         | im in the process of refactoring the benchmark code at the
         | moment, and plan to include mkl in the benches soon.
         | 
         | overall, the results show that faer is usually faster, or even
         | with openblas, and slower than mkl on my desktop
        
           | mjhay wrote:
           | Wow, that's impressive! I wouldn't expect anything to be able
           | to beat MKL, given the optimizations made based on
           | proprietary information.
        
       | netwrt wrote:
       | Does this include an einsum function? I find that it makes
       | complex operations easier to work with
        
         | sarah-ek wrote:
         | not yet. a tensor api is on the long term todo list, but it's a
         | big undertaking and i'd like to focus on matrix operations for
         | the time being
        
       | Ericson2314 wrote:
       | Nixpkgs has a some pluggable BLAS/Lapack implementation infra. If
       | this does offer a shim layer doing exactly that interace, it
       | would be nice to see this packaged as a new alternative!
        
       | heinrichhartman wrote:
       | This does not seem to depend on BLAS/LAPACK.
       | 
       | Good to see LU decomposition with full pivoting being implemented
       | here (which is missing from BLAS/LAPACK). This gives a fast,
       | numerically stable way to compute the rank of a matrix (with a
       | basis of the kernel and image spaces). Details:
       | https://www.heinrichhartmann.com/posts/2021-03-08-rank-decom....
        
         | sarah-ek wrote:
         | lapack does expose a full pivoting lu as far as i can tell?
         | https://netlib.org/lapack/explore-html/d8/d4d/group__getc2_g...
        
           | bee_rider wrote:
           | If you are going to include vs MKL benchmarks in your repo,
           | full pivoting LU might be one to consider. I think most
           | people are happy with partial pivoting, so I sorta suspect
           | Intel hasn't heavily tuned their implementation, might be
           | room to beat up on the king of the hill, haha.
        
             | sarah-ek wrote:
             | funny you mention that, the full pivoting. it's one of the
             | few benchmarks where faer wins by a huge margin
             | 
             | n faer mkl openblas
             | 
             | 1024 27.06 ms 186.33 ms 793.26 ms
             | 
             | 1536 73.57 ms 605.71 ms 2.65 s
             | 
             | 2048 280.74 ms 1.53 s 8.99 s
             | 
             | 2560 867.15 ms 3.31 s 17.06 s
             | 
             | 3072 1.87 s 6.13 s 55.21 s
             | 
             | 3584 3.42 s 10.18 s 71.56 s
             | 
             | 4096 6.11 s 15.70 s 168.88 s
        
               | adgjlsfhk1 wrote:
               | it might be interesting to add butterfly lu
               | https://arxiv.org/abs/1901.11371. it's a way of doing a
               | numerically stable lu like factorization without any
               | pivoting, which allows it to parallelize better.
        
               | sarah-ek wrote:
               | looks interesting! thanks for sharing
        
               | bee_rider wrote:
               | It looks like they are describing a preconditioner there.
        
               | adgjlsfhk1 wrote:
               | the key point is that the preconditioner allows you to
               | skip pivoting which is really nice because the pivoting
               | introduces a lot of data dependence.
        
           | heinrichhartman wrote:
           | Thanks for pointing this out. Looks like only the python
           | bindings are not included in nunpy.
        
         | bmitc wrote:
         | Is there a collection of algorithms published somewhere, in a
         | coherent manner? It seems like to me, there should be a
         | collection of matrix algebra, symbolic algebra, etc. algorithms
         | somewhere that makes it easy to implement these in any language
         | or system. As it is, it seems you need to already be an expert
         | or researcher in the space to make any movement.
        
           | rvdca wrote:
           | One could argue that libraries like BLAS/LAPACK are those
           | collections...
        
             | bmitc wrote:
             | That can be argued, but they're pretty esoteric. For
             | example, what language are they written in? Fortran,
             | assembly, etc.? Probably nothing the majority of developers
             | or mathematicians can't understand. And creating bindings
             | for them is non-trivial. I don't even know where to start
             | from the website. And on the website, all the links to the
             | routines are broken. https://www.netlib.org/blas/
        
           | sevagh wrote:
           | BLIS is an interesting new direction in that regard:
           | https://github.com/flame/blis
           | 
           | >The BLAS-like Library Instantiation Software (BLIS)
           | framework is a new infrastructure for rapidly instantiating
           | Basic Linear Algebra Subprograms (BLAS) functionality. Its
           | fundamental innovation is that virtually all computation
           | within level-2 (matrix-vector) and level-3 (matrix-matrix)
           | BLAS operations can be expressed and optimized in terms of
           | very simple kernels.
        
         | sevagh wrote:
         | On the contrary, it seemingly can be used to make a BLAS
         | implementation (example in a PR: https://github.com/sarah-
         | ek/faer-rs/pull/37)
        
       | arijun wrote:
       | In a few fields of rust we are starting to see a convergence of
       | lower level libraries that then can be shared amongst the higher
       | level crates. For example, wgpu is seeing broad use across a
       | bunch of libraries from game engines to UI libraries. This then
       | allows the shared libraries to be made more robust, with more
       | shared resources going into them.
       | 
       | Does anyone know how much of this is happening in the
       | matrix/array space in rust? There are several libraries that have
       | overlapping goals: ndarray, nalgebra, etc. How much do they share
       | in terms of underlying code? Do they share data structures, or is
       | anything like that on the horizon?
        
         | antimora wrote:
         | WGPU is also used by Burn deep learning framework and WONNX for
         | computation.
        
         | sarah-ek wrote:
         | as far as i know, very little is shared. ndarray-linalg is
         | mostly a lapack wrapper. nalgebra and faer both implement the
         | algorithms from scratch, with nalgebra focusing more on smaller
         | matrices
        
       | owlbite wrote:
       | Something looks dubious with the benchmarking here to me.
       | 
       | Top-tier numerical linear algebra libraries hold all hit the same
       | number (give or take a few percent) for matrix multiply, because
       | they're all achieving the same hardware peak performance.
        
         | sarah-ek wrote:
         | one issue that may be affecting the result is that openmp's
         | threadpool doesn't play well with rayon's. i've seen some perf
         | degradation in the past (on both sides) when both are used in
         | the same program
         | 
         | i plan to address that after refactoring the benches by
         | executing each library individually
        
           | jimbobraggins wrote:
           | Very cool project! I'd suggest before running the new
           | benchmarks to reach out to to the developers of the packages
           | you are testing against to see if they think the benchmark
           | you wrote is doing efficient calling conventions. I work on a
           | large open source software project and we've had people claim
           | they are 10x faster than us while they were really using our
           | code in some very malformed ways.
           | 
           | Also stops them from grumbling after you post good results!
        
             | sarah-ek wrote:
             | fair enough. i try to stay in touch with the eigen and
             | nalgebra developers so i have a good idea on how to write
             | code efficiently with them. for openblas and mkl i've been
             | trying recently to call into the lapack api (benches doing
             | that are still unpublished at the moment), that way im
             | using a universal interface for that kinda stuff.
             | 
             | and of course i do check the cpu utilization to make sure
             | that all threads are spinning for multithreaded benchmarks,
             | and occasionally check the assembly of the hot loops to
             | make sure that the libraries were built properly and are
             | dispatching to the right code. (avx2, avx512, etc) so
             | overall i try to take it seriously and i'll give credit
             | where credit is due when it turns out another library is
             | faster
        
       | loeg wrote:
       | A bit of a tangent, but the same author has something like a
       | libdivide (C++) for Rust: https://github.com/sarah-ek/fastdiv .
       | Cool.
        
       ___________________________________________________________________
       (page generated 2024-04-24 23:01 UTC)