[HN Gopher] Transformers, originally designed to handle language...
       ___________________________________________________________________
        
       Transformers, originally designed to handle language, are taking on
       vision
        
       Author : theafh
       Score  : 103 points
       Date   : 2022-03-10 17:01 UTC (5 hours ago)
        
 (HTM) web link (www.quantamagazine.org)
 (TXT) w3m dump (www.quantamagazine.org)
        
       | graycat wrote:
       | After a long "training set", I have _learned_ that when I see a
       | headline that ends with a question mark, don 't bother to read
       | the article; the answer to the question is nearly always "No".
       | E.g.,
       | 
       | "Will Expert Systems AI Replace Nearly All Knowledge Workers?"
       | 
       | No, they didn't. And I am fully confident they won't.
       | 
       | So, due to the question mark, just stop reading. Then I apply
       | this _training_ to
       | 
       | "Will Transformers Take over Artificial Intelligence?"
        
         | version_five wrote:
         | https://en.m.wikipedia.org/wiki/Betteridge%27s_law_of_headli...
        
       | [deleted]
        
       | JPKab wrote:
       | Has anyone seen successful use of transformers with tabular data,
       | particularly high-cardinality categorical data sets? Curious.
        
       | godelski wrote:
       | There's pretty popular vision transformer tutorial on Medium. The
       | authors were focused on making it work even with limited compute.
       | 
       | https://medium.com/pytorch/training-compact-transformers-fro...
        
       | mark_l_watson wrote:
       | I think they will temporarily take over deep learning, but all of
       | AI? No.
        
       | fithisux wrote:
       | There are a lot of hype driven articles, less providing hand
       | waving explanations and none about mathematics
        
       | goodmachine wrote:
       | I really like Quanta: but this article was not so great.
        
       | andrewmatte wrote:
       | I have not given transformers enough attention... but my
       | impression is that this is still storing entities in the weights
       | of the neural network instead of in a database where the can be
       | operated on with CRUD. What are the knowledge discovery
       | researchers doing with respect to transformers? And the SAT
       | solver researchers?
       | 
       | Here is an article on KDNuggets that explains transformers but
       | doesn't answer my questions:
       | https://www.kdnuggets.com/2021/06/essential-guide-transforme...
        
         | stingraycharles wrote:
         | I think it's relatively straightforward to serialize such a
         | model into different representations, I completely understand
         | that they keep the actual data inside pytorch state by default.
         | 
         | Out of curiosity, what tools are researchers generally using to
         | explore neural networks? I'm just an armchair ML enthusiast
         | myself, but NN always appear very much like black boxes.
         | 
         | What are the goals and methods for exploring neural network
         | state nowadays?
        
         | lolspace wrote:
         | > I have not given transformers enough attention...
         | 
         | ( deg [?]? deg)
        
           | fakethenews2022 wrote:
           | Attention is all you need
        
         | cleancoder0 wrote:
         | First transformer models still dealt only with the training
         | set.
         | 
         | Eventually it was extended to work with an external data source
         | that it queries. This is not a new thing, for example, image
         | style transfer and some other image tasks that were attempted
         | before the domination of NNs did the same thing (linear models
         | would query the db for help and guided feature extraction).
         | 
         | The greatest effect in transformers is the attention mechanism
         | combined with self-supervised learning. Investigations in self-
         | supervised learning tasks (article illustrates one word gap,
         | but there are others) can result in superior models that are
         | sometimes even easier to train.
         | 
         | As for SAT, optimization, graph neural networks might end up
         | being more effective (due to high structure of the inputs). I'm
         | definitely awaiting for traveling salesman solver or similar,
         | guided by NN, solving things faster and reaching optimality
         | more frequently that optimized heuristic algos.
        
           | bglazer wrote:
           | > I'm definitely awaiting for traveling salesman solver or
           | similar, guided by NN, solving things faster and reaching
           | optimality more frequently that optimized heuristic algos.
           | 
           | There was a competition for exactly this at Neurips 2021
           | 
           | https://www.ecole.ai/2021/ml4co-competition/
           | 
           | Not sure how much they improved over handcrafted heuristics,
           | but the summary paper may give some insights
           | 
           | https://arxiv.org/abs/2203.02433
        
           | graycat wrote:
           | > I'm definitely awaiting for traveling salesman solver or
           | similar, guided by NN, solving things faster and reaching
           | optimality more frequently that optimized heuristic algos.
           | 
           | Just in case we are not being clear, let's be clear. Bluntly
           | in nearly every practical sense, the traveling salesman
           | problem (TSP) is NOT very difficult. Instead we have had good
           | approaches for decades.
           | 
           | I got into the TSP writing software to schedule the fleet for
           | FedEx. A famous, highly accomplished mathematician asked me
           | what I was doing at FedEx, and as soon as I mentioned
           | scheduling the fleet he waved his hand and concluded I was
           | only wasting time, that the TSP was too hard. He was wrong,
           | badly wrong.
           | 
           | Once I was talking with some people in a startup to design
           | the backbone of the Internet. They were convinced that the
           | TSP was really difficult. In one word, WRONG. Big mistake.
           | Expensive mistake. Hype over reality.
           | 
           | I mentioned that my most recent encounter with combinatorial
           | optimization was _solving_ a problem with 600,000 0-1
           | variables and 40,000 constraints. They immediately, about 15
           | of them, concluded I was lying. I was telling the full, exact
           | truth.
           | 
           | So, what is difficult about the TSP? Okay, we would like an
           | algorithm for some software that would solve TSP problems (1)
           | to exact optimality, (2) in worst cases, (3) in time that
           | grows no faster than some polynomial in the size of the input
           | data to the problem. So, for (1) being provably within 0.025%
           | of exact optimality is not enough. And for (2) exact
           | optimality in polynomial time for 99 44/100% of real problems
           | is not enough.
           | 
           | In the problem I attacked with 600,000 0-1 variables and
           | 40,000 constraints, a real world case of allocation of
           | marketing resources, I came within the 0.025% of optimality.
           | I know I was this close due to some bounding from some
           | nonlinear duality -- easy math.
           | 
           | So, in your
           | 
           | > reaching optimality more frequently that optimized
           | heuristic algos.
           | 
           | heuristics may not be, in nearly all of reality probably are
           | not, reaching "optimality" in the sense of (2).
           | 
           | The hype around the TSP has been to claim that the TSP is
           | really difficult. Soooo, given some project that is to cost
           | $100 million, an optimal solution might save $15 million, and
           | some software based on what has long been known (e.g., from
           | G. Nemhauser) can save all but $1500 is not of interest.
           | Bummer. Wasted nearly all of $15 million.
           | 
           | For this, see the cartoon early in Garey and Johnson where
           | they confess they can't solve the problem (optimal network
           | design at Bell Labs) but neither can a long line of other
           | people. WRONG. SCAM. The stockholders of AT&T didn't care
           | about the last $1500 and would be thoroughly pleased by the
           | $15 million without the $1500. Still that book wanted to say
           | the network design problem could not yet be solved -- that
           | statement was true only in the sense of exact optimality in
           | polynomial time on worst case problems, a goal of essentially
           | no interest to the stockholders of AT&T.
           | 
           | For neural networks (NN), I don't expect (A) much progress in
           | any sense over what has been known (e.g., Nemhauser _et al._
           | ) for decades. And, (B) the progress NNs might make promise
           | to be in performance aspects other than getting to exact
           | optimality.
           | 
           | Yes, there are some reasons for taking the TSP and the issue
           | of P versus NP seriously, but _optimality_ on real world
           | _optimization_ problems is not one of the main reasons.
           | 
           | Here my goal is to get us back to reality and set aside some
           | of the hype about how difficult the real world TSP is.
        
             | cleancoder0 wrote:
             | There's LKH http://webhotel4.ruc.dk/~keld/research/LKH/
             | which is heuristics and best open implementation. Adding
             | optimality estimates is the least complicated part.
             | 
             | When TSP is mentioned today, unlike 50 years ago when LK
             | heuristic got published, I assume all of the popular &
             | practical variants, like time window constraints, pickup
             | and delivery, capacity constraints, max drop time
             | requirement after pickup, flexible route start, adding
             | location independent breaks (break can happen anytime in
             | the sequence or in a particular time window of day) etc.
             | Some of the subproblems are so constrained that you cannot
             | even move around that effectively as you can with raw TSP.
             | 
             | Some of the subproblems have O(n) or O(n log n) evaluations
             | of best local moves, generic solvers are even worse at
             | handling that (Concorde LP optimizations cannot cover that
             | efficiently). When no moves are possible, you have to see
             | what moves brings you back to a feasible solution and how
             | many local changes you need to do to accomplish this.
             | 
             | For example, just adding time windows complicates or makes
             | most well known TSP heuristics useless. Now imagine if we
             | add a requirement between pairs of locations that they need
             | to be at most X time apart (picking up and then delivering
             | perishable goods), that the route can start at an arbitrary
             | moment etc.
             | 
             | I personally spent quite a lot of time working on these
             | algorithms and I'd say the biggest issue is instance
             | representation (is it enough to have a sequence of location
             | ids ?). For example, one of my recent experiments was using
             | zero suppressed binary decision diagrams to easily traverse
             | some of these constrained neighborhoods and maintain the
             | invariants after doing local changes. Still too slow for
             | some instances I handle (real world is 5000 locations, 100
             | salesmen and an insane amount of location/salesmen
             | constraints).
        
               | graycat wrote:
               | Amazing. Of course I've heard of Kernighan long ago, but
               | this is the first I've heard of LKH.
               | 
               | I did a lot in optimization, in my Ph.D. studies and in
               | my career, but I dropped it, decades ago -- my decision
               | was made for me by my customers, essentially there
               | weren't any or at least not nearly enough that I could
               | find.
               | 
               | Actually, my summary view is that for applications of
               | math in the US, the main customer is US national
               | security. Now there are big bucks to apply algorithms and
               | software to some big data, and maybe, _maybe_ , there is
               | some interest in math. But the call I got from Google
               | didn't care at all about my math, optimization,
               | statistics, or stochastic processes background. Instead
               | they asked what was my favorite programming language, and
               | my answer, PL/I, was the end of the interview. I'm sure
               | the correct answer was C++. I still think PL/I is a
               | better language than C++.
               | 
               | Early in my career, I was doing really well with applied
               | math and computing, but that was all for US national
               | security and within 50 miles of the Washington Monument.
               | 
               | Now? I'm doing a startup. There is some math in it, but
               | it is just a small part, an advantage, maybe crucial, but
               | still _small_.
        
               | cleancoder0 wrote:
               | There's quite a resurgence of need for optimization.
               | 
               | There's a lot of companies that want to provide an
               | Uber/Lyft-like service of their own product. So you have
               | a bunch of smaller problems that you want to solve as
               | best as possible in ~1 second.
               | 
               | A lot of small companies with their delivery fleets want
               | to optimize (pest control, christmas tree delivery,
               | cleaning, technical service, construction (coordinating
               | teams that construct multiple things at multiple
               | locations at the same time) etc.).
               | 
               | On the other hand, not related to TSP, the whole energy
               | market in the US is very LP/ILP optimizable and has a lot
               | of customers (charging home batteries, car batteries,
               | discharging when price is high, etc.).
               | 
               | I would admit that the scientific field of discrete
               | optimization is littered with genetic algorithms, ant
               | colonies and other "no free lunch" optimization
               | algorithms that make very little sense from progress
               | perspective, so it does feel like the golden era was from
               | the 70s to early 90s. I do not have a PhD but somehow
               | ended up doing machine learning and discrete optimization
               | most of my career.
        
               | feanaro wrote:
               | What do you mean when you say these algorithms make very
               | little sense from a progress perspective?
        
             | enchiridion wrote:
             | Where is a good place to look for algorithms/math for
             | solving problems similar to the ones you mentioned?
        
               | graycat wrote:
               | Can look at the now old work of G. Nemhauser. His work
               | was for _combinatorial_ optimization and not just for
               | exactly the traveling salesman problem (TSP).
               | 
               | E.g., there is
               | 
               | George L. Nemhauser and Laurence A. Wolsey, _Integer and
               | Combinatorial Optimization,_ ISBN 0-471-35943-2, John
               | Wiley  & Sons, Inc., New York, 1999.
               | 
               | Some approaches involve _set covering_ and _set
               | partitioning_. Soooo, for the FedEx fleet, first just
               | generate all single airplane feasible tours from the
               | Memphis hub and back. Here can honor some really goofy
               | constraints and complicated costing; can even handle some
               | stochastic issues, i.e., the costs depend on the flight
               | planning and that depends on the loads which are random,
               | but it would be okay to work with just expectations -- we
               | 're talking complicated costing! Then with all those
               | tours generated, pick ones that _cover_ all the cities to
               | be served, i.e., _partition_ the cities. Have a good shot
               | at using linear programming, tweaked a little to handle
               | 0-1 constraints, to pick the tours.
               | 
               | Then more generally for a lot of practical problems can
               | write linear programming problems with some of the
               | variables integer. Then can tweak the simplex algorithm
               | of linear programming to handle some of such constraints
               | fairly _naturally_ in the algorithm. E.g., of course, can
               | proceed with now classic branch and bound.
               | 
               | The TSP taken narrowly can be regarded as more
               | specialized.
               | 
               | So, net, there is a big bag of essentially tricks, some
               | with some math and some just heuristics.
               | 
               | Part of the interest in the issue of P versus NP was to
               | do away with the bag of tricks and have just some one
               | grand, fantastic algorithm and computer program with
               | guaranteed performance. Nice if doable. Alas, after all
               | these years, so far not really "doable", not as just one
               | grand, fantastic .... And the question of P versus NP has
               | resisted so much for so long that it has even a
               | philosophical flavor. And there are serious claims that a
               | technically _good_ algorithm would have some really
               | astounding consequences.
               | 
               | Sure, I have some half baked ideas sitting around that I
               | hope will show that P = NP -- doesn't everyone? But my
               | point here was just simple: For several decades we have
               | been able to do quite well on real problems. Oh, for the
               | problem with 600,000 0-1 variables and 40,000 contraints,
               | otherwise linear, I used _nonlinear duality theory_
               | (which is simple) or, if you wish, Lagrangian relaxation
               | -- it 's one of the tricks.
               | 
               | Another old trick: For the actual TSP in any _Euclidean_
               | space (sure, the plane but also 3 dimensions or 50 if you
               | want), that is, with Euclidean distance, just find a
               | minimum spanning tree (there are at least two _good_ ,
               | that is, polynomial algorithms, for that) and then in a
               | simple and fairly obvious way make a TSP tour out of that
               | tree. That approach actually has some probabilistic
               | bounds on how close it is to optimality, and it does
               | better with more cities -- it's another tool in the kit.
               | 
               | My main conclusion about the TSP, combinatorial
               | optimization, and optimization more generally is that
               | there are way, Way, WAY too few good customers. Whether
               | there is 15% of project cost to be saved or not, the
               | people responsible for the projects just do NOT want to
               | be bothered. In simple terms, in practice, it is
               | essentially a dead field. My view is that suggesting that
               | a young person devote some significant part of their
               | career to _optimization_ is, bluntly, in a word,
               | irresponsible.
        
               | lacker wrote:
               | Python-MIP is a great library that provides an interface
               | to many different algorithms like this. It's practical
               | for using in scientific programming where appropriate,
               | and if you read through the docs you can find the names
               | of specific algorithms that it uses with pointers to
               | where to learn more.
               | 
               | https://docs.python-mip.com/en/latest/intro.html
        
           | yobbo wrote:
           | > As for SAT, optimization, graph neural networks might end
           | up being more effective
           | 
           | Learning from data is a different problem from optimization.
           | For example, if facts about cities gave additional clues
           | beyond their location about the optimal order, then learning
           | could benefit in the travelling salesman problem. Or if the
           | cost of paths is only known implicitly through data examples.
           | 
           | Compare to how NN:s can be used for data compression, for
           | example upscaling images, by learning from photographs only
           | the tiny the subset of all possible images that are
           | meaningful to humans. But it is not useful for general data
           | compression.
        
             | cleancoder0 wrote:
             | What about AlphaGo, AlphaZero (chess)?
             | 
             | Optimization is also data, given a local state, can you
             | identify the sequence of transformations that will get you
             | to a better state. The reward is instantly measurable and
             | the goal is minimizing the total cost.
        
         | axg11 wrote:
         | I wrote a short post on retrieval transformers that you might
         | find interesting [0]. It's a twist on transformers that allows
         | scaling "world knowledge" independently in a database-like
         | manner.
         | 
         | [0] - https://arsham.substack.com/p/retrieval-transformers-for-
         | med...
        
         | adamsmith143 wrote:
         | Isn't the benefit of NNs on some level that you can store finer
         | grained and more abstract data than a standard DB?
        
           | macrolocal wrote:
           | Maybe. Transformers model associative memory in a way made
           | precise by their connection to Hopfield networks.
           | Individually, they're like look-up tables, but the queries
           | can be ambiguous, even based on subtle higher-order patterns
           | (which the network identifies on its own), and the returned
           | values can be a mixture of stored information, weighted by
           | statistically meaningful confidences.
        
         | ctoth wrote:
         | > The latest batch of language models can be much smaller yet
         | achieve GPT-3 like performance by being able to query a
         | database or search the web for information[0].
         | 
         | [0]: https://jalammar.github.io/illustrated-retrieval-
         | transformer...
        
       | tourist_on_road wrote:
       | Transformers gained popularity due to the scalable nature of the
       | architecture and how well it can be parallelized on existing
       | GPU/XLA hardware. Modeling is always conditioned on the hardware
       | available at hand. Transfomers lack inductive bias which make it
       | generic building blocks unlike CNN/RNN like models and by
       | injecting inductive bias like positional encoding, it can be well
       | translated to various domains.
        
         | saeranv wrote:
         | Are transformers competitive with (for example) CNNs on vision-
         | related tasks when there's less data available? I'm not that
         | familiar with "injecting inductive bias" via positional
         | encodings, but it sounds really interesting. My crude
         | understanding is that the positional encodings were used in the
         | original Transformer architecture to encode the ordering of
         | words for NLP. Are they more flexible then that? For example,
         | can they be used to replicate the image-related inductive bias
         | of CNNs and match CNN performance on small datasets (1000 -
         | 10,000)?
         | 
         | If not, then to me, it seems like only industries where it's
         | possible to get access to a large amount of representative data
         | (i.e. greater than a million?) benefit from transformers. In
         | industries where there are bottlenecks to data generation,
         | there's a clear benefit in leveraging the inductive bias in
         | other architectures, such as the various ways CNNs have biases
         | towards image recognition.
         | 
         | I'm in an industry (building energy consumption prediction)
         | where we can only generate around 10,000 to 100,000 datapoints
         | (from simulation engines) for DL. Are transformers ever used
         | with that scale of data?
        
           | [deleted]
        
           | stevenwalton wrote:
           | > Are transformers competitive with (for example) CNNs on
           | vision-related tasks when there's less data available?
           | 
           | They can be, there's current research into the tradeoffs
           | between local inductive bias (information from local
           | receptive fields: CNNs have strong local inductive bias) and
           | global inductive bias (large receptive fields: i.e.
           | attention). There's plenty of works that combine CNNs and
           | Attention/Transformers. A handful of them focus on smaller
           | datasets, but the majority are more interested in ImageNet.
           | There's also work being done to change the receptive fields
           | within attention mechanisms as a means to balance this.
           | 
           | > Are transformers ever used with that scale of data?
           | 
           | So there's a yes and no to your question. But definitely yes
           | since people have done work on Flowers102 (6.5k training) and
           | CIFAR10 (50k training). Keep in mind that not all these
           | models are pure transformers. Some have early convolutions or
           | intermediate ones. Some of these works even have a smaller
           | number of parameters and better computational efficiency than
           | CNNs.
           | 
           | But more importantly, I think the big question is about what
           | type of data you have. If large receptive fields are helpful
           | to your problem then transformers will work great. If you
           | need local receptive fields then CNNs will tend to do better
           | (or combinations of transformers and CNNs or reduced
           | receptive fields on transformers). I doubt there will be a
           | one size fits all architecture.
           | 
           | One thing to also keep in mind is that transformers typically
           | like heavy amounts of augmentation. Not all data can be
           | augmented significantly. There's also pre-training and
           | knowledge transfer/distillation.
        
           | tourist_on_road wrote:
           | Good point. The fact there is no inductive bias inherent to
           | transformers makes it difficult to train a decent model on
           | small datasets from scratch. However, there are recent
           | research directions that try to address this problem [1].
           | 
           | Also baking in some sort of domain specific inductive bias
           | into model architecture itself can address this problem as
           | well [2].
           | 
           | [1]: Escaping the Big Data Paradigm with Compact
           | Transformers: https://arxiv.org/abs/2104.05704
           | 
           | [2]: CvT: Introducing Convolutions to Vision Transformers:
           | https://arxiv.org/abs/2103.15808
        
             | version_five wrote:
             | Maybe a naive question: is there no transfer learning with
             | transformers? I've done a lot of work with CNN
             | architectures on small datasets, and almost always start
             | with something trained on imagenet, and fine tune, or do
             | some kine of semi-supervised training to start. Can we do
             | that with VIT et al as well? Or are they really usually
             | trained from scratch?
        
               | stevenwalton wrote:
               | Lots of people transfer learn with transformers. ViT[0]
               | originally did CIFAR with it. Then DeiT[1] introduced
               | some knowledge transfer (note: their student is larger
               | than the teacher). ViT pretrained on both ImageNet21k and
               | JFT-300m.
               | 
               | CCT ([1] from above) was focused on training from
               | scratch.
               | 
               | There's two paradigms to be aware of. ImageNet and pre-
               | training can often be beneficial but it doesn't always
               | help. It really depends on the problem you're trying to
               | tackle and if there are similar features within the
               | target dataset and the pre-trained dataset. If there is
               | low similarity you might as well train from scratch.
               | Also, you might not want as large of models (like ViT and
               | DeiT have, which ViT's has more parameters than CIFAR-10
               | has features).
               | 
               | Disclosure: Author on CCT
               | 
               | [0] https://arxiv.org/abs/2010.11929
               | 
               | [1] https://arxiv.org/abs/2012.12877
        
               | version_five wrote:
               | Awesome, thanks for the reply. It's been on my list to
               | try transformers instead of (mainly) Resnet for a while
               | now.
        
             | [deleted]
        
         | Der_Einzige wrote:
         | All of this is just another externalization of the bitter
         | lesson.
         | 
         | http://www.incompleteideas.net/IncIdeas/BitterLesson.html
        
       | melling wrote:
       | On Kaggle they are getting more usage.
        
       | dekhn wrote:
       | I assume transformers will be replaced by something, just as
       | transformers replaced other sequential models.
       | 
       | That said, transformers have already earned a place in the annals
       | of ML, if for no other reason than they were the critical to the
       | first technology to solve protein structure prediction.
        
       | g42gregory wrote:
       | I think Transformers are a (good) tool in AI toolbox and they are
       | already being used a conjunction with many other recent tools
       | such as Normalizing Flows, Diffusion Models, Energy-Based Models,
       | etc...
        
       | zitterbewegung wrote:
       | If you want to play with Transformers you can go here
       | https://transformer.huggingface.co/
       | 
       | They have a really easy to use library in Python called
       | Transformers. Below is an example of how to use it.
       | >>> from transformers import pipeline            # Allocate a
       | pipeline for sentiment-analysis       >>> classifier =
       | pipeline('sentiment-analysis')       >>> classifier('We are very
       | happy to introduce pipeline to the transformers repository.')
       | [{'label': 'POSITIVE', 'score': 0.9996980428695679}]
        
         | PestoDiRucola wrote:
         | Huggingface is great! The only issue is the documentation which
         | is rather lacking if you want to get more serious about writing
         | custom models and solving more complex issues than what
         | normally documented in the examples there.
        
       | omarhaneef wrote:
       | For those who are interested in how Transformers seem more
       | prevalent, please read this thread by Karpathy where he talks
       | about a consolidation in ML:
       | 
       | https://twitter.com/karpathy/status/1468370605229547522
       | 
       | And of course one of the early classic papers in the field, as a
       | bonus:
       | 
       | https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd05...
       | 
       | (The paper is mentioned in the article)
        
         | lucidrains wrote:
         | if one prefers video, Yannic Kilcher does an excellent
         | explanation of the seminal paper
         | https://www.youtube.com/watch?v=iDulhoQ2pro
        
           | amrrs wrote:
           | Glad to see you here lucidrains. Truly appreciate your recent
           | open-source contributions and works like big sleep, deep
           | daze.
           | 
           | Everyone else check out this
           | https://github.com/lucidrains?tab=repositories
        
             | lucidrains wrote:
             | Thanks for the kind words, and credit goes to Ryan Murdock
             | for Big Sleep and Deep Daze. I simply packaged it up to
             | spread the usage
        
         | algo_trader wrote:
         | Have any of the "sub-quadratic transformers" [1] gone
         | mainstream? Or is everyone simply rich enough to buy enough
         | GPUs.
         | 
         | [1]https://www.gwern.net/notes/Attention
        
           | lucidrains wrote:
           | I would recommend Routing Transformer
           | https://github.com/lucidrains/routing-transformer but the
           | real truth is nothing beats full attention. Luckily, someone
           | recently figured out how to get past the memory bottleneck.
           | https://github.com/lucidrains/memory-efficient-attention-
           | pyt...
        
       | CabSauce wrote:
       | I thought they already had.
        
       | iamwil wrote:
       | For those that want a high level overview of Transformers, we
       | recently covered it in our podcast:
       | https://www.youtube.com/watch?v=Kb0II5DuDE0
        
       ___________________________________________________________________
       (page generated 2022-03-10 23:00 UTC)