[HN Gopher] Vectorization: Introduction
___________________________________________________________________
Vectorization: Introduction
Author : gsav
Score : 126 points
Date : 2023-06-01 19:50 UTC (3 hours ago)
(HTM) web link (cvw.cac.cornell.edu)
(TXT) w3m dump (cvw.cac.cornell.edu)
| NKosmatos wrote:
| I know I'm probably going to get downvoted for this, but I'm
| going to post it anyhow :-)
|
| https://jvns.ca/blog/2023/02/08/why-does-0-1-plus-0-2-equal-...
| sva_ wrote:
| Perhaps you could point out in which way you think this is
| relevant?
| NKosmatos wrote:
| From the introductory text "The ultimate goal of
| vectorization is an increase in floating-point
| performance...". I find it a bit funny that by vectorizing
| float executions we'll be able to calculate wrong answers
| faster ;-)
| windsignaling wrote:
| Still irrelevant.
| m00x wrote:
| How is this relevant? This is more of a critique of floats
| than it is about vectorization.
| jcranmer wrote:
| It's not a wrong answer, anymore than writing down 1/3 as
| 0.333, multiplying that by 3, and discovering that the
| result is 0.999 and not 1.
|
| And there's nothing in vectorization that requires
| floating-point numbers; it's just that the algorithms that
| tend to demand the most out of a computer tend to be large
| numerical algorithms, which use floating-point.
| sva_ wrote:
| > the result is 0.999 and not 1.
|
| But 0.999... is 1.
|
| Edit: Nevermind, I see you meant "writing down 1/3 as
| 0.333" literally.
| comicjk wrote:
| They're not wrong answers, they're floating-point answers.
| What's wrong is the assumption that infinite precision will
| fit in a handful of bytes.
| shoo wrote:
| for many practical applications where we compute things,
| the input data isn't fully accurate, it's an estimate with
| significant error or uncertainty. when we consider how the
| outputs of computations are used, in many cases it isn't
| useful for outputs to be arbitrarily precise. maybe the
| output is used to make a decision and the expected cost or
| benefit decision won't significantly change even if the
| output is wrong by 1%, for example.
|
| one person's "wrong" is another person's approximation. by
| reducing accuracy we can often get very large performance
| improvements. good engineering makes an appropriate
| tradeoff.
|
| there's even more scope for this kind of thing inside some
| numerical algorithms where having a "pretty good" guess
| gives a massive speed boost, but the algorithm can be
| iterated to find quite precise answers even if the "pretty
| good" guess is an approximation that's off by 20%
|
| there's yet more scope for this if we consider if we're
| trying to mathematically model reality -- even if we
| perform all the math with arbitrary precision calculations,
| the theoretical model itself is still an approximation of
| reality, there's some error there. there's limited value in
| solving a model exactly, as there's always some
| approximation error between the model and reality.
|
| it's amazing we have such good support for high-performance
| scientific quality ieee floating point math everywhere in
| commodity hardware. it's a great tool to have in the
| toolbelt
| nologic01 wrote:
| We need also something like Tensorization:Introduction
| alfalfasprout wrote:
| Tensor ops are usually implemented using the principles of
| vectorizing code.
|
| After all most tensors are implemented as 1D arrays with
| strides for each dimension with the innermost dimension always
| contiguous.
| sva_ wrote:
| I think the bigger difficulty lies in how to store and
| process this stuff on a GPU/other accelerator.
|
| If the OP cares about implementation details about how an API
| like PyTorch is made, I think the MiniTorch 'book' is a
| pretty good intro:
|
| https://minitorch.github.io/
| richardw wrote:
| ChatGPT interpretation:
|
| Sure! Let me break it down for you:
|
| When we talk about "tensor ops," we're referring to
| operations performed on tensors, which are multidimensional
| arrays of numbers commonly used in mathematics and computer
| science.
|
| Now, these tensor operations are typically implemented using
| a technique called "vectorizing code." In simple terms,
| vectorizing code means performing operations on entire arrays
| of data instead of looping through each element one by one.
| It's like doing multiple calculations at once, which can be
| more efficient and faster.
|
| Tensors are usually represented as 1D arrays, meaning all the
| elements are arranged in a single line. Each dimension of the
| tensor has a concept called "stride," which represents how
| many elements we need to skip to move to the next element in
| that dimension. This helps us efficiently access and
| manipulate the data in the tensor.
|
| Additionally, the innermost dimension of a tensor is always
| "contiguous," which means the elements are stored
| sequentially without any gaps. This arrangement also aids in
| efficient processing of the tensor data.
|
| So, in summary, tensor ops involve performing operations on
| multidimensional arrays, and we use vectorized code to do
| these operations more efficiently. Tensors are represented as
| 1D arrays with strides for each dimension, and the innermost
| dimension is always stored sequentially without any gaps.
| riku_iki wrote:
| > the principles of vectorizing code. > 1D arrays
|
| I think accelerators mostly do computations over small
| matrices and not vectors, so it is "matricized" code, but I
| am not an expert in this area.
| amelius wrote:
| We really need to name tensors differently. "Array" would be a
| more fitting name.
|
| Tensors are really different mathematical objects with a far
| more rich structure than those used in the Deep Learning
| context.
|
| https://en.wikipedia.org/wiki/Tensor
| nologic01 wrote:
| I think that naming battle is lost.
|
| Would indeed be cool to have _true_ tensor processing units
| :-)
|
| PKD level of cool as you could argue those chips directly
| process spacetime chunks
| sva_ wrote:
| Names can have different meanings based on context, even
| within pure mathematics. I think a mathematician would have
| little trouble discerning which kind of tensor is meant, as
| this sort of reusing existing terms is pretty common.
|
| Just calling it an array might be underselling it, at this
| point. Perhaps a tensor in the context of CompSci is just a
| particular type of array with certain expected properties.
| itishappy wrote:
| Same with vectors!
|
| https://en.wikipedia.org/wiki/Vector_(mathematics_and_physic.
| ..
| opportune wrote:
| You could apply that argument to "isomorphic" functions, c++
| "vectors" or compute "vectorizarion", "integer" types... even
| though the semantics aren't exactly the same between the
| computing and math terms, they're related enough that they do
| help with understanding to those coming from a math
| background (which to be fair is probably less and less
| programmers over time).
| myownpetard wrote:
| Isomorphic javascript is one of the most egregious.
| godelski wrote:
| This is mostly going to be the same. The main difference in
| tensorization is that your vectors are not contiguous. So you
| lose some speedup. How you typically optimize this is by
| flattening your tensor. Then your data is contiguious and all
| the same rules apply.
|
| This is overly simplified though. Things get different when we
| start talking about {S,M}I{S,M}D (page addresses SIMD) or GPUs.
| Parallel computing is a whole other game, and CUDA takes it to
| the next level (lots of people can write CUDA kernels, not a
| lot of people can write GOOD CUDA kernels (I'm definitely not a
| pro, but know some)). Parallelism adds another level of
| complexity due to being able to access different memory
| locations simultaneously. If you think concurrent optimization
| is magic, parallel optimization is wizardry (I still believe
| this is true)
| corsix wrote:
| From a hardware perspective, vector instructions operate on
| small 1D vectors, whereas tensor instructions operate on small
| 2D matrices. I say "instructions", but it's really only matrix
| multiply or matrix multiply and accumulate - most other
| instructions are fine staying as 1D.
| nologic01 wrote:
| If there is matrix multiply at hardware level its fair to
| have another name than vectorization. For example the
| dimensions and partitioning of large matrices to fit would be
| specific to that design and very different from rolling
| things out on 1D arrays
| dragontamer wrote:
| Auto-vectorization is easier to get into than other SIMD-
| frameworks like CUDA, OpenCL, ROCm, Intel's ISPC and whatnot. But
| in my experience, auto-vectorizers are just not as flexible as
| the proper SIMD-tools.
|
| I'd say auto-vectorization should still be learned by modern
| high-performance programmers, because its very low-hanging fruit.
| You barely have to do anything and suddenly your for-loops are
| AVX512 optimized, though maybe not to the fullest extent
| possible.
|
| Still, I suggest that programmers also learn how to properly make
| SIMD code. Maybe intrinsics are too hard in practice, but ISPC,
| CUDA, and other SIMD-programming environments make things far
| easier to learn than you might expect.
|
| ------------
|
| ISPC in particular is Intel's SIMD programming language, much
| akin to CUDA except it compiles into AVX512. So for AVX512-like
| code execution environments, using the ISPC language/compiler is
| exceptionally useful.
|
| Its harder to learn a new language than to learn a few compiler-
| options to enable auto-vectorization however. So in practice,
| auto-vectorization will continue to be used. But for tasks that
| specifically would benefit from SIMD-thinking, the dedicated ISPC
| language should be beneficial.
| sva_ wrote:
| > I'd say auto-vectorization should still be learned by modern
| high-performance programmers, because its very low-hanging
| fruit.
|
| Somewhat related: 55 GiB/s FizzBuzz
| https://codegolf.stackexchange.com/questions/215216/high-thr...
| (discussion: https://news.ycombinator.com/item?id=29031488)
| saagarjha wrote:
| Definitely not autovectorized.
| jackmott42 wrote:
| What do you mean by learning auto vectorization? Do you mean
| learning how to structure your code so that your compiler of
| choice has a good chance of pulling it off?
| Veliladon wrote:
| Pretty much.
|
| If one wants to sum two arrays instead of for looping through
| the elements of two arrays one might instead use iterate by
| chunks so that the compiler can easily tie them all together
| as a single operation which can then easily vectorize.
| photochemsyn wrote:
| This looks like a curious case, non-fixed-length vectorization
| code in RISC-V:
|
| > "Perhaps the most interesting part of the open RISC-V
| instruction set architecture (ISA) is the vector extension
| (RISC-V "V"). In contrast to the average single-instruction
| multipe-data (SIMD) instruction set, RISC-V vector instructions
| are vector length agnostic (VLA). Thus, a RISC-V "V" CPU is
| flexible in choosing a vector register size while RISC-V "V"
| binary code is portable between different CPU implementations."
|
| https://gms.tf/riscv-vector.html
| divbzero wrote:
| With interpreted languages such as Python or MATLAB, the
| performance gains come not only from hardware parallelism but
| also from the use of compiled libraries instead of interpreted
| code.
| saagarjha wrote:
| > Vectorization is a process by which floating-point computations
| in scientific code are compiled into special instructions that
| execute elementary operations (+,-,*, etc.) or functions (exp,
| cos, etc.) in parallel on fixed-size vector arrays.
|
| I guess this is a scientific computing course or something but I
| feel like even so it's important to point out that most
| processors have many vector instructions that operate on
| integers, bitfields, characters, and the like. The fundamental
| premise of "do a thing on a bunch of data at once" isn't limited
| to just floating point operations.
___________________________________________________________________
(page generated 2023-06-01 23:00 UTC)