[HN Gopher] Exponentially faster language modelling
___________________________________________________________________
Exponentially faster language modelling
Author : born-jre
Score : 162 points
Date : 2023-11-21 14:31 UTC (1 days ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| fgfm wrote:
| This approach feels like pruning, but the speedup is considerably
| higher. Interestingly, I'm curious how this will play out on more
| recent transformer architectures though: I guess the speedup will
| be more important for the largest architectures, but even if we
| can get 2x or 10x speedup on Mistral/Zephyr, Orca 2 or
| OpenChat3.5, that would be a tremendous achievement!
| webmaven wrote:
| I'm curious as to how applicable this approach might be for
| text-to-image models like Stable Diffusion.
| ndr wrote:
| Abstract:
|
| > Language models only really need to use an exponential fraction
| of their neurons for individual inferences. As proof, we present
| UltraFastBERT, a BERT variant that uses 0.3% of its neurons
| during inference while performing on par with similar BERT
| models. UltraFastBERT selectively engages just 12 out of 4095
| neurons for each layer inference. This is achieved by replacing
| feedforward networks with fast feedforward networks (FFFs). While
| no truly efficient implementation currently exists to unlock the
| full acceleration potential of conditional neural execution, we
| provide high-level CPU code achieving 78x speedup over the
| optimized baseline feedforward implementation, and a PyTorch
| implementation delivering 40x speedup over the equivalent batched
| feedforward inference. We publish our training code, benchmarking
| setup, and model weights.
|
| Conclusions
|
| > We present UltraFastBERT, a modified version of the
| (crammed)BERT architecture that uses fast feedforward instead of
| feedforward networks in its intermediate layers. UltraFastBERT
| serves as proof that large language models only really need to
| engage an exponential fraction of their parameters to perform
| individual inferences. UltraFastBERT-1x11, our deepest model with
| the highest promise of acceleration, uses only 0.3% of its
| neurons during inference and already achieves a 78x CPU speedup
| over the inference time of the corresponding feedforward layer.
| With a theoretical speedup promise of 341x at the scale of BERT-
| base models, we hope that our work will inspire an effort to
| implement primitives for conditional neural execution as a part
| of device programming interfaces.
| iNic wrote:
| Do I understand correctly that the difficulty of making this
| useful is writing code to run this idea on GPUs?
| nolist_policy wrote:
| As far as I understood it: Forget GPUs, this thing is plenty
| fast on CPUs.
|
| In general, GPUs are bad at branching. The fastest way to
| implement it on GPUs is probably to let it calculate both
| sides of the branch and then only use the result of the one
| that was taken. Which won't be faster than a normal NN.
| falcor84 wrote:
| I wonder then, if each inference only uses a small part of
| the net, could you possibly perform multiple inferences in
| the same forward pass?
| lawlessone wrote:
| > Forget GPUs, this thing is plenty fast on CPUs.
|
| Does this mean everyone could be running the 100+b models
| from ram?
|
| This opens up a lot , some models could be run very fast on
| small machines with this.
|
| Bundling a small model inside a game to act as part of the
| mind for ingame NPC's (obviously with some tuning) becomes
| practical with this.
| hobofan wrote:
| The bottleneck for "easy integration" into games and
| applications right now is as much the RAM usage as is the
| slowness. This would probably bring the speed to an
| acceptable level but you would still have to hold the
| whole model in RAM.
|
| That would make it a lot more feasible to run models in
| the cloud (triple digit RAM is a lot more abundant than
| VRAM), but wouldn't do that much for consumer hardware.
| nolist_policy wrote:
| I wonder if the model takes similar branches while in the
| same context? Then you can fault in parts of the model
| from disk as needed.
| entropicdrifter wrote:
| Interesting idea. Like texture streaming, you'd just
| stream in the parts of the model from disk to fill up all
| available RAM. If the NPC needed to think about something
| not cached in RAM, you'd throw up a "hmm, let me think
| about this" while stuff loads from disk.
| PaulHoule wrote:
| They are getting a 78x speedup w/o hardware support which is
| pretty good: they think they can speed it up another 4x if
| they had the right hardware support. So it looks useful now
| with possibility to get better.
|
| So long as I've been involved with neural networks for text
| analysis it's seemed to me that we really should be using
| sparse activations because any particular document only
| involves a limited set of concepts.
|
| For instance a search engine for patents might be looking at
| a patent for adhesive tape which activates a certain set of
| concepts but is not going to activate concepts involved with
| bicycle derailleurs or public key cryptography: a sparse
| representation reflects this and dense representations don't.
| kristianp wrote:
| > a PyTorch implementation delivering 40x speedup over the
| equivalent batched feedforward inference
|
| Does this not indicate a 40x speedup on the GPU?
|
| Edit: looking at the paper, their "Naive CUDA" implementation
| also shows a 117x speedup in Table 2.
| sdrg822 wrote:
| Cool. Important note:
|
| """ One may ask whether the conditionality introduced by the use
| of CMM does not make FFFs incompatible with the processes and
| hardware already in place for dense matrix multiplication and
| deep learning more broadly. In short, the answer is "No, it does
| not, save for some increased caching complexity." """
|
| It's hard to beat the hardware lottery!
| algo_trader wrote:
| Infact, as stated in the paper, this is bad news
|
| > We therefore leave the attention layers untouched
|
| Meaning, presumably, that the GPU memory remains the bottleneck
|
| Flops really are quite cheap by now, e.g. vision inference chip
| ~$2/teraflop/s !!
| marcinzm wrote:
| Bottleneck for larger models however this would presumably
| allow for cheaper models at scale or on compute constrained
| devices (like phones).
| entropicdrifter wrote:
| And potentially for distributing a model across several
| devices at inference time. You could devote a cluster of
| smaller/weaker machines to inference.
| sroussey wrote:
| You can do that today, the only advantage today though is
| being able to fix the model in memory. It's sequential
| and slower due to communication costs, though batching
| might be faster?
| ashirviskas wrote:
| >Flops really are quite cheap by now, e.g. vision inference
| chip ~$2/teraflop/s !!
|
| I'm really interested, can you share where you got these
| numbers?
| algo_trader wrote:
| Axelera [1] or Halio [2] give you 100-200tflop for ~$200.
|
| 8-bit ops, inference only, low memory embedded, excluding
| the host, implied utilization from FPS specs is ~20%
|
| But the trend is there.
|
| There are also newer ADAS/AV units from China which claim
| 1000tflops and cant really cost more than $1000/$2000 per
| car.
|
| These are all tiled designed (see also dojo/tesla) heavily
| over-weighed on flops vs memory
|
| [1] https://www.axelera.ai/
|
| [2] https://hailo.ai/
| Y_Y wrote:
| You can't get flops on a Hailo-8, they're fixed-point
| only. As much as these specialised inference chips are
| cool, we're a long way from just being able to drop them
| in where a GPU was. Not to mention the memory is hugely
| constrained. The Hailo chips I've worked with were all
| limited to 20MiB for the weights which is a squeeze even
| at 4-bit.
| YetAnotherNick wrote:
| > ~$2/teraflop/s
|
| H100 is basically ~$2/(2000 tflops/s)/hour or $1 for 4*10^18
| floating point operations.
| theGnuMe wrote:
| There's another paper replacing attention with FF networks so
| just combine the two and you've got something.
| gdoug wrote:
| Link? Sounds like a good read! :)
| smeeth wrote:
| Not op but might be this:
| https://arxiv.org/pdf/2311.10642.pdf
| Klaster_1 wrote:
| What are the potential consequences? Does this open doors to
| faster edge inference or improved capabilities?
| yvdriess wrote:
| Both. Cheaper CPU-based inference, GPUs are not as competitive
| for sparse linear algebra. This could lead to much larger
| models, as you only touch a small portion of the matrix during
| inference. However, the training here is still dense-LA on a
| GPU, so you still blow up the compute cost when increasing
| model size.
| WithinReason wrote:
| Note this doesn't speed up training
| swalsh wrote:
| Has anyone used SIMD instructions to try and speed up cpu
| inference?
| hobofan wrote:
| Most inference builds on top of BLAS libraries, which in
| their implementation take advantage of SIMD.
| singhrac wrote:
| A lot of CPU inference libraries (llama.cpp included) use
| as much SIMD as possible, sometimes by hand-writing loops.
| The one I hack on, llama.rs, uses portable_simd but
| specializes to your CPU at compile time.
|
| My experience has been that most CPU inference is actually
| not compute limited, but memory bandwidth limited, since
| most weights are used for a few operations per token (how
| quickly can you load and unload the entire 70 GB of weights
| into your registers?). It's not quite that bad but I found
| most vectorization changes didn't meaningfully change
| performance.
| anonymousDan wrote:
| Would you say that is the state of the art CPU inference
| library?
| valine wrote:
| GPU utilization should be down when using this technique. I'm
| hoping this could allow for more efficient batch inference on
| GPUs. If you can predict 10 tokens for the price of 1 it
| should allow you to do tree of thought much more efficiently.
|
| https://github.com/princeton-nlp/tree-of-thought-llm
| hobofan wrote:
| I think with that magnitude of a speed improvement it should
| become feasible to do just-in-time embedding creation for
| semantic search for much larger documents.
| millisecond wrote:
| Could this be applied to other models like Llama2 or Mistral?
| andy99 wrote:
| Just from the abstract I don't see why not, it's just replacing
| the feed forward network that's part of all of these models
| with a very sparse one. The bigger problem is you seemingly
| have to retrain the model, so you couldn't just drop in llama2
| weights from meta and have it work. Which makes it much more
| limiting. Something that used existing weights would be a lot
| more practical (like quantization for example). For BERT, I can
| see this being useful if you had to make a really fast
| embedding model. There was a discussion about a fast embedding
| use case not long ago
| https://news.ycombinator.com/item?id=37898001
| MinusGix wrote:
| It certainly could, and I wouldn't be surprised if the authors
| want to try it out on those. You do have issues of past
| improvements often not quite enhancing more powerful models
| nearly as much. I'd expect this to possibly not work as well,
| something like the bigger models ending up with more
| polysemantic neurons because they're given more ''incentive''
| (training time, neuron count, dataset size which they're
| encouraged to be able to reconstruct) to extract as much
| possible. This might make so the method performs worse due to
| this intermingling. (See the transformer circuits website for
| that) (Though I expect there's ways to recover a good chunk of
| extra lost throughput/accuracy, maybe by doing extra steps to
| directly steer the training towards breaking apart polysemantic
| neurons)
| lopuhin wrote:
| There are two issues here -- for one, in big transformers, more
| compute is in the attention layers, while this work improves
| only feed-forward layers, which are more important for smaller
| models and smaller sequence lengths. Second, in many typical
| scenarios LLM inference is memory bandwidth bound, I'm not sure
| if it's possible to utilize their approach to reduce required
| memory bandwidth.
| joelthelion wrote:
| Doesn't reducing the number of neurons drastically reduce
| memory requirements?
| lopuhin wrote:
| Yes it might. "Reduction of number of neurons" is not
| static here, unlike traditional pruning approaches, here
| they still keep all weights, but the network dynamically
| selects which sub-portion of them to use. There is a
| related discussion of this in section 3.2 (page 4), but
| they don't think they mention actual memory bandwidth
| requirements/wins of their implementation, and probably
| there can be different tradeoffs for different devices.
| OneOffAsk wrote:
| Is this similar to what iOS 17 uses for its new autocomplete?
| WithinReason wrote:
| Link to previous paper:
|
| https://arxiv.org/abs/2308.14711
|
| An attempt at a summary: They use a sigmoid function to make
| differentiable "soft" branches, and stack them to construct a
| binary tree, with the goal of only taking one branch at inference
| time (but training the whole tree) leading to log(W) instead of W
| inference cost. They gradually harden the branches so they become
| hard branches at the end of training.
|
| A branch is computed as _branch(input, N)_ , with a neural
| network N computing a scalar _c=N(input)_ , then using a sigmoid
| to do a soft branch by returning the weighted sum of the
| recursive call _s(c)*branch(input, N_left) + (1-s(c)) *
| branch(input, N_right)_ (the two weights _s(c)_ and _1-s(c)_ sum
| to 1). They only do "proper processing" using the leaf nodes.
|
| Then they add a new loss term that encourages hard decisions by
| minimising the entropy of the Bernoulli distribution, making the
| 2 weights converge to 0 and 1, at which point only one branch
| needs to be taken at inference. They also state that this
| hardening often happens automatically though.
|
| It's a simple idea but the loss formulation is nice, you usually
| want your loss terms to be a measure of information.
| WithinReason wrote:
| Also, this didn't come from OpenAI or DeepMind, or even
| industry. What are those guys even doing? :)
| mmaunder wrote:
| Many labs doing foundational work like this and making
| progress don't have the anything near the budget or compute
| to implement at scale. In other words they don't have a Sam
| and his backers or a Zuck and his budget.
| Micoloth wrote:
| They sure as hell have no incentives to make Neural Network
| faster and more accessible, for starters..
|
| (Considering they right now make more money and have more
| control, the less accessible and the more computation-hungry
| AI models are)
|
| To be fair, this approach (claims to) only speed up
| inference, not training, so all the GPUs are needed anyway.
| WithinReason wrote:
| They certainly have an incentive to keep these kinds of
| improvements in-house and not publish them, since they are
| commercial entities and this represents a competitive
| advantage.
| lawlessone wrote:
| I think Nvidia might have an incentive for this not to
| exist.
|
| edit: but you are right for the AI companies not open
| sourcing their models it's an advantage to have it when
| others don't
| WithinReason wrote:
| I'm actually not sure about Nvidia, due to
| https://en.wikipedia.org/wiki/Jevons_paradox
| ForkMeOnTinder wrote:
| But if things get too efficient for individual users, you
| won't need an Nvidia GPU anymore. People will use cheaper
| hardware instead. I'm looking forward to running good
| models at decent speed on a low-end CPU or whatever
| crappy GPU is in my phone.
| jacobsimon wrote:
| I had the same thought this morning and was debating
| selling my nvda stock when I saw this - feels like they
| are well-positioned right now, as with crypto a few years
| ago, but if there were an efficiency breakthrough that
| allowed commodity CPUs to do the inference instead, this
| advantage could vanish quickly.
| godelski wrote:
| Nvidia can't make GPUs fast enough. I doubt 10xing
| training and/or inference efficiency would result in a
| decrease in demand. I would be surprised if it didn't
| instead increase demand. Mind you, Nvidia is pushing hard
| on TensorRT which optimizes models at inference time and
| results in major increases in throughput (not 10x though
| lol).
| rictic wrote:
| Yeah, Jevons Paradox suggests that 10xing efficiency of
| training and inference would increase demand for GPUs.
| godelski wrote:
| I wouldn't be so quick to conspiracy. I'm the author of a
| work and a famous blog post that trains a particular common
| architecture much faster (don't want to dox myself too
| much) and with far fewer parameters, but it has been
| rejected several times and is now arxiv only. Our most
| common complaint was "who would use this? Why not just take
| a large model and tune it?" That question alone held us
| back a year (had over a hundred citations by then and
| remains my most cited work) until it switched to "use more
| datasets" and "not novel" (by that time true, others had
| built off of us, cited us, and published in top venues).
|
| I don't think this was some conspiracy by big labs to push
| back against us (we're nobodies) but rather that people get
| caught up in hype and reviewers are lazy and incentivized
| to reject. You're trained to be critical of works and
| especially consider that post hoc most solutions appear far
| simpler than they actually are. But context matters because
| if you don't approach every paper with nuance it's easy to
| say "oh, it's just x." But if those ideas were so simple
| and obvious they would also be prolific. I see a lot of
| small labs suffer the same fate simply due to lack of
| compute. If you don't make your new technique work on many
| datasets it becomes the easiest thing to reject a paper by.
| ACs aren't checking that reviews are reasonable. I've even
| argued with fellow reviewers about papers in workshops --
| papers I would have accepted in the main conference -- that
| are brushed off and the reviewers admit in their reviews
| that they do not work on these topics. I don't understand
| what's going on but at times it feels like a collective
| madness. A 10 page paper with 4 very different datasets
| that solves a problem, is clearly written, has no major
| flaws, and is useful to the community should not need
| defending when submitted to a workshop just because
| reviewers aren't qualified to review the work (this paper
| got in btw). We are moving into a "pay to play" ecosystem
| and that will only create bad science due to group think.
| (another aspect of "pay to play" is in the tuning. Spending
| $1M to tune your model to be the best doesn't mean it is
| better than a model that could not afford the search. Often
| more than half of resources are spent on tuning now)
| wruza wrote:
| Is there a place where you guys discuss... things? I'm
| layman interested in this topic akin to pop-
| physics/maths, but have no chance to just read papers and
| "get it". On the other hand, immediately available
| resources focus more on how-to part of it rather than on
| what's up overall. Also, do you have something like
| 3b1b/pbs/nph for it? Content that _you_ can watch and say
| "well, yep, good job".
| airgapstopgap wrote:
| ...ETH Zurich is an illustrious research university that
| often cooperates with Deepmind and other hyped groups,
| they're right there at the frontier too, and have been for a
| very long time. They don't have massive training runs on
| their own but pound for pound I'd say they have better
| papers.
| godelski wrote:
| ETH Zurich is one of the top labs in the world. Disney
| Research also works with them a lot. Another "sleeper" is
| University of Amsterdam that has rockstars like Max Welling
| and his students Kingma, Salimans,van den Berg, and
| Hoogeboom.
|
| It's easy to get hyped up on the big tech labs because they
| have the most compute, but the best papers come from
| smaller labs and unfortunately more lately face larger
| challenges in getting published. It's the smaller works
| that create the foundations that end up in these giant
| models. ML is in a really weird space right now.
| pr337h4m wrote:
| This from DeepMind:
|
| DiLoCo: Distributed Low-Communication Training of Language
| Models - https://arxiv.org/pdf/2311.08105.pdf
|
| From the first author on Twitter: "It could quite a big deal
| for people who don't have access to a colocated cluster of
| GPUs:
|
| e.g. with DiLoCo you could train your model, with data-
| parallelism, across all GPU providers, looking in real-time
| for the cheapest price, even if pre-emptable, even across
| continents"
|
| https://twitter.com/Ar_Douillard/status/1724839420820361687
| quickthrower2 wrote:
| It is not surprising. The assumption is that they have the
| best people. That you can objectively search 8 billion people
| for the best people globally is folly of course. There are
| geniuses without US citizenship / visas / green cards. And so
| outside brains are going to figure this out. Mix in GDP of
| $rest_of_world has much more resources than any company, and
| the luck-driven nature of making AI discoveries, and I reckon
| most progress will be outside of OpenAI etc. Driven by a
| problem the big guys don't need to solve: how do I avoiding
| buying a $5k graphics card.
| lawlessone wrote:
| From the previous paper you cited >Pushing FFFs to the limit,
| we show that they can use as little as 1% of layer neurons for
| inference in vision transformers while preserving 94.2% of
| predictive performance.
|
| This feels like that often misinterpreted Einstein meme/qoute
| about humans only using a fraction of their brain power.
|
| Is this only for inference though? could it boost training?
| WithinReason wrote:
| That's an interesting question. It actually provides a nice
| way to parallelized training: Pretrain e.g. the first 3
| branch levels, which effectively fragments the model into 8
| separate parts, which you can continue training across 8
| independent servers/nodes with no further communication
| between the nodes. A central server would run the 1st 3
| levels and mark parts of the training set that each node has
| to train on. Maybe you could do this for the whole network
| and distribute the training in SETI@HOME style all over the
| world.
|
| Hold on, you don't even need to freeze the branches
| completely: each node could train 1 branch on the path to its
| leaf node and communicate a change in the branch node to a
| central server, so you can distribute training without having
| to pre-freeze the branches. Still would need some pre-
| training though, and the splits would change slowly, and the
| attention mechanism could complicate things.
|
| Currently distributed neural network training SETI@HOME style
| looks like a complete pipe dream that nobody is taking
| seriously. But a smart branching mechanism like this could
| _suddenly_ make it possible. Folding@home reached 1.5
| exaflops, which made it the world 's largest supercomputer.
| Imagine the models we could train that way, they would far
| surpass whatever OpenAI or Google could train and would be
| public.
| alchemist1e9 wrote:
| This!
|
| If this becomes true then it's a game changer. I hope you
| are correct.
| richardw wrote:
| Also steps up the economic benefit of, and therefore demand
| for, botnets. We really need a solution to bad actors
| controlling vast amounts of compute.
| klyrs wrote:
| That ship has sailed, and her name is bitcoin.
| richardw wrote:
| If bitcoin keeps the botnets away from world beating AI,
| worth it.
| Geee wrote:
| I am barely understanding, so a stupid question:
|
| Does this also mean that it would be possible to train on
| parallel GPU-poor setup instead of needing lots of GPU
| memory / bandwidth on one computer?
| bloopernova wrote:
| Apologies for layman question: how much tera/peta/exa-flops
| do current models use to train?
|
| Well, I'm assuming they'd use whatever they're given, so
| maybe the question should be "how much less time would
| training take on a 1.5 exaflops computer?"
| foobiekr wrote:
| As many as they can afford.
|
| A lot of clusters are totally homogeneous, at least
| within some very large domains, so for a given
| interconnect and a generation of GPU you know the maximum
| message latency, the peak sustained pflop rate, and so on
| but what often matters is some combination of the
| depreciation-cost-per-time and the watt hours per unit
| time, where you can sort of approximate both if you
| ignore the unfortunate realities, which then act as a
| multiplier.
|
| For example, a problem is network issues - and not just
| scale - as the training sequence often involve billions
| of cycles of short compute-sync sequences which are
| bursty (e.g., all-to-all, barrier, compute, barrier, all
| to all, ...) but between which there isn't enough time to
| engage low power modes so you're burning $ due to slack
| and waste. This is true in different ways for a lot of
| training approaches.
|
| You can approximate this, but it's so sensitive to data
| set size, specific training schedule, etc. that you won't
| be able to get the most important answer.
| thomasahle wrote:
| Sounds like hiarchial softmax from the early NLP days
| knexer wrote:
| It's mentioned briefly in the paper(1), but I'm more interested
| in the interpretability implications of this approach. In some
| respects, this marries the interpretability/editability of a
| small decision tree with the expressive power of a large neural
| network. Usually you see those two on extreme opposite ends of
| a tradeoff spectrum - but this approach, if it scales, might
| shift the pareto frontier.
|
| (1): As a byproduct, the learned regions can also be used as a
| partition of the input space for interpretability, surgical
| model editing, catastrophic forgetting mitigation, reduction of
| replay data budget, etc..
| tokai wrote:
| Why not use the real title? Its short and precise.
| vorticalbox wrote:
| hugging face model
|
| https://huggingface.co/pbelcak/UltraFastBERT-1x11-long
| lawlessone wrote:
| > This model is provided only as sanity check for research
| purposes, it is untested and unfit for deployment.
|
| I guess this means it isn't pretrained yet? Is it still just
| random weights?
| mjn wrote:
| "Unfit for deployment" or "not intended for deployment" is
| semi-standard wording for research models that are just raw
| language models with none of the safety/bias/offensiveness
| filtering that is usually desired for product applications.
| For example, if you deploy it as a customer-service chatbot,
| it might tell your customers to kill themselves, or call them
| racial slurs.
|
| It doesn't mean that there's anything technically wrong with
| the language model per se as a model of language, just that
| there has been no effort made to ensure it's fit to be
| deployed as-is for any given generative-AI use case, and the
| model authors would prefer you didn't do that.
| ilaksh wrote:
| Is it possible to use this with something like Llama 2?
| vouaobrasil wrote:
| This is rather scary. I feel we are witnessing the evolution of
| language models and artificial intelligence, which seems
| intellectually laudable until you realize that the underlying
| evolutionary framework for this evolution is the global
| capitalistic system whose only criteria for selection in short-
| term monetary gain.
|
| We are creating a monster.
| hendler wrote:
| Rather than looking to capitalism which has provided tremendous
| benefits to society as well as unintended consequences you may
| want to update your thinking to focus on the incentives
| alignment problem in general.
|
| This TED talk articulates it well: https://youtu.be/WX_vN1QYgmE
|
| What is after capitalism?
| vouaobrasil wrote:
| I absolutely disagree that reformism as in the video via
| incentives will be enough.
| dicroce wrote:
| I think it's good to be concerned and cautious but I also think
| you are being a bit extreme here.
| vouaobrasil wrote:
| I absolutely disagree - I believe everyone else is blind, the
| same way we are blind that our current lifestyles are an
| exercise in extreme violence on the nonhuman world.
| qntty wrote:
| According to scientists, we only use 0.3% of our neural networks.
| Imagine if we could use 100%.
| kleiba wrote:
| Nice.
|
| I know HN can sometimes be the place where humor goes to die,
| but I found this comment hilarious.
| madprofessor wrote:
| Wonderful.
| jredwards wrote:
| Thank you for taking one of my most hated memes and turning it
| into hilarity.
| baq wrote:
| Mix this with yesterday's matmul approximation (maddness) in HW
| for a casual... three orders of magnitude speed increase?
| nulld3v wrote:
| Can you link the post for the matmul approximation?
| jml7c5 wrote:
| https://news.ycombinator.com/item?id=38360776
| nulld3v wrote:
| Thank you!
| hobofan wrote:
| I'm not 100% sure, but those seem mostly mutually exclusive (or
| redundant), with the decision tree in maddness taking on a
| similar function as the binary tree in FFF that decides which
| neurons to activate.
| terafo wrote:
| They are mostly incompatible.
| measured_step wrote:
| How would this scale for a use case like writing code? I could
| imagine that some inputs would require a large number of neurons.
| Would this architecture be able to do that if it were scaled up?
|
| I'm also curious if this model architecture would achieve the
| grokking of more complex concepts at scale.
| jasonjmcghee wrote:
| Does anyone understand why they are using B x H instead of B x S
| x H?
|
| Why is the context size and batch size represented as a single
| parameter?
| numeri wrote:
| I would have to go back and reread the paper to be sure, but FF
| layers are applied position-wise, meaning independently and in
| parallel on all input tokens/positions. Because of that, I
| could imagine contexts where the sequence dimension isn't
| relevant, i.e., for computational complexity.
| rsolva wrote:
| I find running 7B models on my 6 year old small form factor HP
| EliteDesk to be fast enough for casual everyday use. If this
| speedup can be applied generally to commonly used models, I can
| serve a local ChatGPT experience for both friends and family from
| my tiny homelab in my basement.
|
| _mind blown_
| FooBarWidget wrote:
| I find 7B models to be too stupid. They often respond with
| nonsense or fail to follow instructions.
| cooper_ganglia wrote:
| Even 65B models approach a level of being almost usable, but
| still fall short in my personal experience.
| cloudking wrote:
| This is why I'm not understanding the excitement around
| open source models, they pale in comparison to GPT-4
| quality, so I have no use for them until we have something
| comparable.
| GaggiX wrote:
| Even like OpenChat-3.5? (Probably the best 7B model out
| there) Demo: https://openchat.team/
|
| HuggingFace: https://huggingface.co/openchat/openchat_3.5
|
| On the LLM arena (blinded comparisons), it's the third best
| non-proprietary model:
| https://huggingface.co/spaces/lmsys/chatbot-arena-
| leaderboar...
| sorokod wrote:
| _What is the sum of odd numbers in this set: 4, 7, 12, 1,
| 3_
|
| _The sum of odd numbers in the given set is 4 + 7 + 1 =
| 12. Therefore, the answer is 12._
| all2 wrote:
| Technically 3 is even. \s
| moffkalast wrote:
| Are there any comparisons with Mistral-instruct? I've yet
| to see anything under 30B beat it in any way.
| quickthrower2 wrote:
| If anyone is on the ball enough to turn this into a colab or
| notebook that would be appreciated! Would love to see the code
___________________________________________________________________
(page generated 2023-11-22 23:00 UTC)