[HN Gopher] Will Floating Point 8 Solve AI/ML Overhead?
___________________________________________________________________
Will Floating Point 8 Solve AI/ML Overhead?
Author : rbanffy
Score : 40 points
Date : 2023-01-15 21:51 UTC (1 days ago)
(HTM) web link (semiengineering.com)
(TXT) w3m dump (semiengineering.com)
| jasonjmcghee wrote:
| I am curious, if folks have tried hybrid methods using ensembles.
|
| Train the main model using FP8 (or other quantized approaches)
| and have a small calibrating model at FP32 that is trained
| afterward.
| thriftwy wrote:
| Why go so extreme when you can have fp12? Perhaps have 4 high bit
| exponent and low signed int8 mantissa.
|
| Or vice versa, 7 bit exponent, sign and 4 bit mantissa.
| jasonjmcghee wrote:
| I think the general idea is to make use of SIMD, and generally
| the max size is 2^x. So if you're trying to multiply as many
| numbers as possible in, say, 64 bits, FP8 would get you 8x8 and
| FP16 would get you 4x4. FP12 would get you 5x5 with some unused
| space which would be a huge amount of extra work to implement
| for a 20% gain in efficiency.
| sj4nz wrote:
| Could be 8-bit posits may be enough. Has that been done? At
| scale. I do not know.
| bigbillheck wrote:
| Posits aren't the answer to any question worth asking.
| meltyness wrote:
| Why not?
| SideQuark wrote:
| Original posits are variable width, making them nearly
| useless for high performance parallel computations. Later
| versions don't add anything of use for low precision neural
| networks, and lack of hardware support anywhere make them
| too slow for anything other than toying around.
|
| See also
| http://people.eecs.berkeley.edu/~wkahan/UnumSORN.pdf and
| https://www.youtube.com/watch?v=LZAeZBVAzVw
| moloch-hai wrote:
| > lack of hardware support
|
| That seems fixable. Don't people make chips that do what
| you want a lot of? A chip with an array of 8-bit posit
| PUs could process a hell of a lot in parallel, subject
| only to getting the arguments and results to useful
| places.
| ansk wrote:
| Will Doubling Disk Size Solve Storage?
| wolfram74 wrote:
| I feel like the fastest tier of ram can only get so big before
| speed of light delays become relevant.
| visarga wrote:
| The practical question of interest is: will this make it
| possible to run GPT-3 size models on normal desktops with GPU?
| Like Stable Diffusion.
| SoftTalker wrote:
| At some point the practical question would be how do you get
| all the data onto the desktop.
| Dylan16807 wrote:
| People are already downloading 100GB games. And data rates
| are growing much faster than RAM capacities. The logistics
| of downloading a model smaller than GPU RAM are unlikely to
| ever get complicated.
| eklitzke wrote:
| Kind of a weird article, as 8-bit quantization is widely used in
| production for neural networks for a number of years now. The
| title of the article is a bit misleading since it's widely known
| that 8-bit quantization does work and is extremely effective at
| improving inference throughput and latency. I'm not 100% sure if
| I'm reading the article correctly since it's a bit oblique, but
| it seems like the news here is that work is being done to
| formally specify a cross-vendor FP8 standard, as what exists
| right now is de facto standards from different vendors.
| FL33TW00D wrote:
| INT8 quantisation has been used in production for years. FP8
| has not.
| Dylan16807 wrote:
| FP8 provides some nice accuracy benefits over INT8 but if you
| swap it out that doesn't affect your overhead.
| ftufek wrote:
| The article mentions the 8 bit quantization, I believe this is
| about training in fp8 as native format. The latest GPUs provide
| huge flops for those, Tim Dettmers updated his gpu article and
| he talks about this, the claim is 0.66 PFLOPS for an RTX 4090.
| make3 wrote:
| The title is nonsensical. The faster the compute is, or the
| faster inference is (through eg precision), the larger the models
| people will train, because accuracy / output quality increases
| indefinitely with model size, and everyone knows this. So a
| different precision it will not "Solve the AI/ML Overhead",
| that's nonsense. People will just use as large a model as they
| can for their latency budget at inference & for their $ budget at
| training, whatever it is.
| gumby wrote:
| Really for me just the mantissa would be fine; no need for
| exponent bc so much of what I worked on is between 0..1
|
| There was an interesting paper from the Allen Institute a few
| years ago describing a system with 1 bit weights that worked
| pretty well! Since I read it I've been musing on trying that,
| though it seems unlikely I will be able to any time soon.
| [deleted]
| thfuran wrote:
| If you just have a mantissa, aren't you doing fixed point math?
| gumby wrote:
| Yes, just looking for a weight in the range 0 <= x < 1. But I
| want to do large numbers of calculations using the GPU, else
| I'd use the SIMD int instructions (AVX)
| snickerbockers wrote:
| Just do fixed point bruh.
| gumby wrote:
| it is, but doesn't give me the hardware affordance I want:
| https://news.ycombinator.com/item?id=34405604
| voz_ wrote:
| " High on the ML punch list is how to run models more efficiently
| using less power, especially in critical applications like self-
| driving vehicles where latency becomes a matter of life or
| death."
|
| Never ever heard of inference latency being a bottleneck here...
| amelius wrote:
| > People who follow a strict neuromorphic interpretation have
| even discussed binary neural networks, in which the input
| functions like an axon spike, just 0 or 1.
|
| How do you perform differentiation with this datatype?
| _0ffh wrote:
| The article, comparing single and double precision:
|
| >the mantissa jumps from 32 bits to 52 bits
|
| Rather from 23 (+1 for implicit msb) to 52 (+), I suppose.
| amelius wrote:
| In the old days of CS, people were talking about optimizations in
| the big-O sense.
|
| Nowadays the talk is mostly about optimization of constant
| factors, so it seems.
| kortex wrote:
| Related:
|
| https://ai.facebook.com/blog/making-floating-point-math-high...
|
| Which is meta's 8 bit data type originally called (8,1, alpha,
| beta, gamma). I think they realized that's a terrible name so I
| think they are calling it Deepfloat or something now.
| [deleted]
| fswd wrote:
| For LLM, INT8 is old news but still exciting. FP8 would
| definitely be an improvement. However the new coolness is INT4.
|
| > Excitingly, we manage to reach the INT4 weight quantization for
| GLM-130B while existing successes have thus far only come to the
| INT8 level. Memory-wise, by comparing to INT8, the INT4 version
| helps additionally save half of the required GPU memory to 70GB,
| thus allowing GLM130B inference on 4 x RTX 3090 Ti (24G) or 8 x
| RTX 2080 Ti (11G). Performance-wise, Table 2 left indicates that
| without post-training at all, the INT4-version GLM-130B
| experiences almost no performance degradation, thus maintaining
| the advantages over GPT-3 on common benchmarks.
|
| Page 7 https://arxiv.org/pdf/2210.02414.pdf
| cypress66 wrote:
| Hopper seems to drop int4 support so maybe it's old news now?
|
| https://en.m.wikipedia.org/wiki/Hopper_(microarchitecture)
| dragontamer wrote:
| At this rate, we're going to end up with FP1 (1-bit floating
| point) numbers...
|
| I guess that's nonsensical. 1-bit in all FP is the sign bit. So
| I guess the minimum size is 2-bit FP (1-bit sign + 1-bit
| exponent + 0-bit implicit 1 mantissa)
| kortex wrote:
| At FP2, you are probably better off with {-1, 0, 1, NaN}
| (sign+mantissa) rather than sign/exponent. You basically bit
| pack.
|
| FP3 gives you sign, 1x "exponent", 1x mantissa, so still
| kinda bit packing.
|
| I could see FP4 with sign, 1x exponent, 2x mantissa. Exponent
| would really just be a 4x multiplier, giving
| +/-0,1,2,3,4,8,12
|
| Or invert all those, so you are expressing common fractions
| on 0..1
| dragontamer wrote:
| Real life has the E3 series: 1, 2.2, 4.7, and then 10, 22,
| 47, 100, 220, 470, 1000, etc. Etc.
|
| EEs would recognize these values to be the Preferred
| Resistors values for projects (though more commonly the E6
| series is used in projects, the E3 and E1 values are
| preferred)
|
| That's 3 values per decade, which is slightly more
| dispersed than a FP4 that consists of 1 sign + 3 exponent +
| 0 (implicit mantissa 1 bit).
|
| Or the values -128, -64, -32, -16, -8, -4, -2, -1, 1, 2,
| ... 128.
|
| Maybe we can take -128 and call that zero instead, cause
| zero is useful.
|
| --------
|
| Given how even E3 is still useful in real world electrical
| engineering problems, I'm more inclined to allocate more
| bits to the exponent than the mantissa.
| ben_w wrote:
| > Real life has the E3 series: 1, 2.2, 4.7, and then 10,
| 22, 47, 100, 220, 470, 1000, etc. Etc.
|
| Took me until today to realise that sequence is a rounded
| version of 10^(n/3) for integer n.
| Dylan16807 wrote:
| If you're going to bother doing floats you should probably
| make them balanced around 1.
|
| And exponent seems to be much more important for these
| small sizes. The first paper that shows up for FP4 almost
| has negative mantissa bits. Their encoding has 0, 1/64,
| 1/16, 1/4, 1, 4, 16, 64.
| varispeed wrote:
| That once we get into asymmetrical number coding so that you
| could use numbers that take fraction of bits.
| dimatura wrote:
| Binary neural networks, where weights and/or activations are
| just 0/1s, are an active research area. In theory they could
| be implemented very efficiently in hardware. But in contrast
| to FP16 (or to some extent, int8), just quantizing FP32 to 1
| bit doesn't work very well. There have been successful
| methods in practice. There was a company called Xnor.ai that
| was built partially around this technology, but it was sold
| to Apple a couple years ago. I don't know what's the current
| SOTA in this area, though.
| SideQuark wrote:
| 1 bit would work fine - make the values represent +-1 or so
| visarga wrote:
| I think I read somewhere it only goes as low as int4. Can't
| find the reference.
___________________________________________________________________
(page generated 2023-01-16 23:00 UTC)