[HN Gopher] Large Transformer Model Inference Optimization
___________________________________________________________________
Large Transformer Model Inference Optimization
Author : headalgorithm
Score : 56 points
Date : 2023-01-20 19:27 UTC (3 hours ago)
(HTM) web link (lilianweng.github.io)
(TXT) w3m dump (lilianweng.github.io)
| ilaksh wrote:
| Are any companies with the largest most capable models doing
| these things? Maybe OpenAI has used some of them for GPT-4.
|
| But also maybe there is another company using a very large
| dataset and some optimizations. I would love to have an
| alternative so I wasn't 100% reliant on OpenAI.
| jayalammar wrote:
| We train and serve large models at cohere.ai. We've shared some
| optimization techniques here: https://txt.cohere.ai/running-
| large-language-models-in-produ...
| ilaksh wrote:
| Awesome! Can your models write code?
| binarymax wrote:
| I help teams run transformers in their production systems on CPU,
| using my product based on ONNX Runtime.
|
| This is a great article, but if you're using something based on
| BERT or RoBERTa, you don't need to do much. Distillation is
| usually the only step you need to take if you're really picky, or
| if your scale is millions of requests per day and you're not
| making enough money to support the infrastructure.
|
| I have had mixed results with quantization and sparsification,
| but IMO it's just not worth it as they can be unstable.
| haldujai wrote:
| Or you could keep it simple and just not use a 500B parameter
| model which is unnecessarily large for 99.9999999999999% of use
| cases.
| drdeca wrote:
| I think that is likely too many nines (depending on how you are
| counting and weighting "use cases").
| haldujai wrote:
| Sure, admittedly I was being overly hyperbolic and a bit
| snarky.
|
| However I am genuinely curious what sort of industrial "real
| world task" there is that requires edge inference on GPT3.5
| or PaLM-sized models where you would run into this problem
| and not have the infrastructure therefore requiring these
| potentially unstable tricks?
|
| The point I was alluding to is that LLMs of this size are
| overkill for most commercial use cases (e.g. NER, document
| classification, semantic search, chat bot).
| usmannk wrote:
| This post isn't about "edge inference".
| haldujai wrote:
| Maybe I'm missing the point of the article then. What's
| the low-resource scenario where inference speed is the
| bottleneck for transformer adoption at scale?
| bravura wrote:
| New AI tasks are being unlocked by (large-scale)
| foundation models (Liang, 2022).
|
| Fine-tuning in low-resource (few-shot) scenarios is now
| possible for many new applications.
|
| However, these new AI applications relied upon a huge
| pretrained model to get there. Because the old approach
| of training from scratch on 100 labeled examples didn't
| work well.
|
| Thus, we want to distill the knowledge so that the model
| can be deployed in low-resource scenarios.
|
| [edit: I see your below comment about the concern about
| transformer cost. Agreed. This is one of the many
| concerns around foundation models that must be
| understood. The happy path is that training the
| foundation model is a one-time cost that pays dividends
| in the many tasks it unlocks. However, you are correct
| that the research to get there is quite spendy. I
| encourage you to skim this paper. It's long but very
| accessible: https://arxiv.org/pdf/2108.07258.pdf]
| chipgap98 wrote:
| I would imagine this would also bring down the cost of
| running large models, which could increase their
| adoption.
| haldujai wrote:
| Fair enough. I guess I'm biased by my working environment
| and current belief that we're scaling transformer models
| unnecessarily, but I guess that is also partly influenced
| by their cost.
| madlag wrote:
| May I add another method: block fine-pruning of transformers
| (pruning while fine-tuning) ?
|
| https://arxiv.org/abs/2109.04838
|
| Using blocks allows to keep good performence on GPUS, while
| giving some flexibility in the pruning pattern. And when removing
| entirely empty rows and columns the pruned matrices are actually
| pretty dense, so competitive with structured pruning for speedup,
| but less "aggressive" on the network during the pruning process.
| Disclaimer: I am the main co-author.
| bravura wrote:
| Let us not forget that Lillian Weng was telling us to pay
| attention to diffusion models before they became cool, and
| definitely before your dad was using Stable Diffusion to generate
| logos for his rotary club.
___________________________________________________________________
(page generated 2023-01-20 23:00 UTC)