[HN Gopher] Transformers, originally designed to handle language...
___________________________________________________________________
Transformers, originally designed to handle language, are taking on
vision
Author : theafh
Score : 103 points
Date : 2022-03-10 17:01 UTC (5 hours ago)
(HTM) web link (www.quantamagazine.org)
(TXT) w3m dump (www.quantamagazine.org)
| graycat wrote:
| After a long "training set", I have _learned_ that when I see a
| headline that ends with a question mark, don 't bother to read
| the article; the answer to the question is nearly always "No".
| E.g.,
|
| "Will Expert Systems AI Replace Nearly All Knowledge Workers?"
|
| No, they didn't. And I am fully confident they won't.
|
| So, due to the question mark, just stop reading. Then I apply
| this _training_ to
|
| "Will Transformers Take over Artificial Intelligence?"
| version_five wrote:
| https://en.m.wikipedia.org/wiki/Betteridge%27s_law_of_headli...
| [deleted]
| JPKab wrote:
| Has anyone seen successful use of transformers with tabular data,
| particularly high-cardinality categorical data sets? Curious.
| godelski wrote:
| There's pretty popular vision transformer tutorial on Medium. The
| authors were focused on making it work even with limited compute.
|
| https://medium.com/pytorch/training-compact-transformers-fro...
| mark_l_watson wrote:
| I think they will temporarily take over deep learning, but all of
| AI? No.
| fithisux wrote:
| There are a lot of hype driven articles, less providing hand
| waving explanations and none about mathematics
| goodmachine wrote:
| I really like Quanta: but this article was not so great.
| andrewmatte wrote:
| I have not given transformers enough attention... but my
| impression is that this is still storing entities in the weights
| of the neural network instead of in a database where the can be
| operated on with CRUD. What are the knowledge discovery
| researchers doing with respect to transformers? And the SAT
| solver researchers?
|
| Here is an article on KDNuggets that explains transformers but
| doesn't answer my questions:
| https://www.kdnuggets.com/2021/06/essential-guide-transforme...
| stingraycharles wrote:
| I think it's relatively straightforward to serialize such a
| model into different representations, I completely understand
| that they keep the actual data inside pytorch state by default.
|
| Out of curiosity, what tools are researchers generally using to
| explore neural networks? I'm just an armchair ML enthusiast
| myself, but NN always appear very much like black boxes.
|
| What are the goals and methods for exploring neural network
| state nowadays?
| lolspace wrote:
| > I have not given transformers enough attention...
|
| ( deg [?]? deg)
| fakethenews2022 wrote:
| Attention is all you need
| cleancoder0 wrote:
| First transformer models still dealt only with the training
| set.
|
| Eventually it was extended to work with an external data source
| that it queries. This is not a new thing, for example, image
| style transfer and some other image tasks that were attempted
| before the domination of NNs did the same thing (linear models
| would query the db for help and guided feature extraction).
|
| The greatest effect in transformers is the attention mechanism
| combined with self-supervised learning. Investigations in self-
| supervised learning tasks (article illustrates one word gap,
| but there are others) can result in superior models that are
| sometimes even easier to train.
|
| As for SAT, optimization, graph neural networks might end up
| being more effective (due to high structure of the inputs). I'm
| definitely awaiting for traveling salesman solver or similar,
| guided by NN, solving things faster and reaching optimality
| more frequently that optimized heuristic algos.
| bglazer wrote:
| > I'm definitely awaiting for traveling salesman solver or
| similar, guided by NN, solving things faster and reaching
| optimality more frequently that optimized heuristic algos.
|
| There was a competition for exactly this at Neurips 2021
|
| https://www.ecole.ai/2021/ml4co-competition/
|
| Not sure how much they improved over handcrafted heuristics,
| but the summary paper may give some insights
|
| https://arxiv.org/abs/2203.02433
| graycat wrote:
| > I'm definitely awaiting for traveling salesman solver or
| similar, guided by NN, solving things faster and reaching
| optimality more frequently that optimized heuristic algos.
|
| Just in case we are not being clear, let's be clear. Bluntly
| in nearly every practical sense, the traveling salesman
| problem (TSP) is NOT very difficult. Instead we have had good
| approaches for decades.
|
| I got into the TSP writing software to schedule the fleet for
| FedEx. A famous, highly accomplished mathematician asked me
| what I was doing at FedEx, and as soon as I mentioned
| scheduling the fleet he waved his hand and concluded I was
| only wasting time, that the TSP was too hard. He was wrong,
| badly wrong.
|
| Once I was talking with some people in a startup to design
| the backbone of the Internet. They were convinced that the
| TSP was really difficult. In one word, WRONG. Big mistake.
| Expensive mistake. Hype over reality.
|
| I mentioned that my most recent encounter with combinatorial
| optimization was _solving_ a problem with 600,000 0-1
| variables and 40,000 constraints. They immediately, about 15
| of them, concluded I was lying. I was telling the full, exact
| truth.
|
| So, what is difficult about the TSP? Okay, we would like an
| algorithm for some software that would solve TSP problems (1)
| to exact optimality, (2) in worst cases, (3) in time that
| grows no faster than some polynomial in the size of the input
| data to the problem. So, for (1) being provably within 0.025%
| of exact optimality is not enough. And for (2) exact
| optimality in polynomial time for 99 44/100% of real problems
| is not enough.
|
| In the problem I attacked with 600,000 0-1 variables and
| 40,000 constraints, a real world case of allocation of
| marketing resources, I came within the 0.025% of optimality.
| I know I was this close due to some bounding from some
| nonlinear duality -- easy math.
|
| So, in your
|
| > reaching optimality more frequently that optimized
| heuristic algos.
|
| heuristics may not be, in nearly all of reality probably are
| not, reaching "optimality" in the sense of (2).
|
| The hype around the TSP has been to claim that the TSP is
| really difficult. Soooo, given some project that is to cost
| $100 million, an optimal solution might save $15 million, and
| some software based on what has long been known (e.g., from
| G. Nemhauser) can save all but $1500 is not of interest.
| Bummer. Wasted nearly all of $15 million.
|
| For this, see the cartoon early in Garey and Johnson where
| they confess they can't solve the problem (optimal network
| design at Bell Labs) but neither can a long line of other
| people. WRONG. SCAM. The stockholders of AT&T didn't care
| about the last $1500 and would be thoroughly pleased by the
| $15 million without the $1500. Still that book wanted to say
| the network design problem could not yet be solved -- that
| statement was true only in the sense of exact optimality in
| polynomial time on worst case problems, a goal of essentially
| no interest to the stockholders of AT&T.
|
| For neural networks (NN), I don't expect (A) much progress in
| any sense over what has been known (e.g., Nemhauser _et al._
| ) for decades. And, (B) the progress NNs might make promise
| to be in performance aspects other than getting to exact
| optimality.
|
| Yes, there are some reasons for taking the TSP and the issue
| of P versus NP seriously, but _optimality_ on real world
| _optimization_ problems is not one of the main reasons.
|
| Here my goal is to get us back to reality and set aside some
| of the hype about how difficult the real world TSP is.
| cleancoder0 wrote:
| There's LKH http://webhotel4.ruc.dk/~keld/research/LKH/
| which is heuristics and best open implementation. Adding
| optimality estimates is the least complicated part.
|
| When TSP is mentioned today, unlike 50 years ago when LK
| heuristic got published, I assume all of the popular &
| practical variants, like time window constraints, pickup
| and delivery, capacity constraints, max drop time
| requirement after pickup, flexible route start, adding
| location independent breaks (break can happen anytime in
| the sequence or in a particular time window of day) etc.
| Some of the subproblems are so constrained that you cannot
| even move around that effectively as you can with raw TSP.
|
| Some of the subproblems have O(n) or O(n log n) evaluations
| of best local moves, generic solvers are even worse at
| handling that (Concorde LP optimizations cannot cover that
| efficiently). When no moves are possible, you have to see
| what moves brings you back to a feasible solution and how
| many local changes you need to do to accomplish this.
|
| For example, just adding time windows complicates or makes
| most well known TSP heuristics useless. Now imagine if we
| add a requirement between pairs of locations that they need
| to be at most X time apart (picking up and then delivering
| perishable goods), that the route can start at an arbitrary
| moment etc.
|
| I personally spent quite a lot of time working on these
| algorithms and I'd say the biggest issue is instance
| representation (is it enough to have a sequence of location
| ids ?). For example, one of my recent experiments was using
| zero suppressed binary decision diagrams to easily traverse
| some of these constrained neighborhoods and maintain the
| invariants after doing local changes. Still too slow for
| some instances I handle (real world is 5000 locations, 100
| salesmen and an insane amount of location/salesmen
| constraints).
| graycat wrote:
| Amazing. Of course I've heard of Kernighan long ago, but
| this is the first I've heard of LKH.
|
| I did a lot in optimization, in my Ph.D. studies and in
| my career, but I dropped it, decades ago -- my decision
| was made for me by my customers, essentially there
| weren't any or at least not nearly enough that I could
| find.
|
| Actually, my summary view is that for applications of
| math in the US, the main customer is US national
| security. Now there are big bucks to apply algorithms and
| software to some big data, and maybe, _maybe_ , there is
| some interest in math. But the call I got from Google
| didn't care at all about my math, optimization,
| statistics, or stochastic processes background. Instead
| they asked what was my favorite programming language, and
| my answer, PL/I, was the end of the interview. I'm sure
| the correct answer was C++. I still think PL/I is a
| better language than C++.
|
| Early in my career, I was doing really well with applied
| math and computing, but that was all for US national
| security and within 50 miles of the Washington Monument.
|
| Now? I'm doing a startup. There is some math in it, but
| it is just a small part, an advantage, maybe crucial, but
| still _small_.
| cleancoder0 wrote:
| There's quite a resurgence of need for optimization.
|
| There's a lot of companies that want to provide an
| Uber/Lyft-like service of their own product. So you have
| a bunch of smaller problems that you want to solve as
| best as possible in ~1 second.
|
| A lot of small companies with their delivery fleets want
| to optimize (pest control, christmas tree delivery,
| cleaning, technical service, construction (coordinating
| teams that construct multiple things at multiple
| locations at the same time) etc.).
|
| On the other hand, not related to TSP, the whole energy
| market in the US is very LP/ILP optimizable and has a lot
| of customers (charging home batteries, car batteries,
| discharging when price is high, etc.).
|
| I would admit that the scientific field of discrete
| optimization is littered with genetic algorithms, ant
| colonies and other "no free lunch" optimization
| algorithms that make very little sense from progress
| perspective, so it does feel like the golden era was from
| the 70s to early 90s. I do not have a PhD but somehow
| ended up doing machine learning and discrete optimization
| most of my career.
| feanaro wrote:
| What do you mean when you say these algorithms make very
| little sense from a progress perspective?
| enchiridion wrote:
| Where is a good place to look for algorithms/math for
| solving problems similar to the ones you mentioned?
| graycat wrote:
| Can look at the now old work of G. Nemhauser. His work
| was for _combinatorial_ optimization and not just for
| exactly the traveling salesman problem (TSP).
|
| E.g., there is
|
| George L. Nemhauser and Laurence A. Wolsey, _Integer and
| Combinatorial Optimization,_ ISBN 0-471-35943-2, John
| Wiley & Sons, Inc., New York, 1999.
|
| Some approaches involve _set covering_ and _set
| partitioning_. Soooo, for the FedEx fleet, first just
| generate all single airplane feasible tours from the
| Memphis hub and back. Here can honor some really goofy
| constraints and complicated costing; can even handle some
| stochastic issues, i.e., the costs depend on the flight
| planning and that depends on the loads which are random,
| but it would be okay to work with just expectations -- we
| 're talking complicated costing! Then with all those
| tours generated, pick ones that _cover_ all the cities to
| be served, i.e., _partition_ the cities. Have a good shot
| at using linear programming, tweaked a little to handle
| 0-1 constraints, to pick the tours.
|
| Then more generally for a lot of practical problems can
| write linear programming problems with some of the
| variables integer. Then can tweak the simplex algorithm
| of linear programming to handle some of such constraints
| fairly _naturally_ in the algorithm. E.g., of course, can
| proceed with now classic branch and bound.
|
| The TSP taken narrowly can be regarded as more
| specialized.
|
| So, net, there is a big bag of essentially tricks, some
| with some math and some just heuristics.
|
| Part of the interest in the issue of P versus NP was to
| do away with the bag of tricks and have just some one
| grand, fantastic algorithm and computer program with
| guaranteed performance. Nice if doable. Alas, after all
| these years, so far not really "doable", not as just one
| grand, fantastic .... And the question of P versus NP has
| resisted so much for so long that it has even a
| philosophical flavor. And there are serious claims that a
| technically _good_ algorithm would have some really
| astounding consequences.
|
| Sure, I have some half baked ideas sitting around that I
| hope will show that P = NP -- doesn't everyone? But my
| point here was just simple: For several decades we have
| been able to do quite well on real problems. Oh, for the
| problem with 600,000 0-1 variables and 40,000 contraints,
| otherwise linear, I used _nonlinear duality theory_
| (which is simple) or, if you wish, Lagrangian relaxation
| -- it 's one of the tricks.
|
| Another old trick: For the actual TSP in any _Euclidean_
| space (sure, the plane but also 3 dimensions or 50 if you
| want), that is, with Euclidean distance, just find a
| minimum spanning tree (there are at least two _good_ ,
| that is, polynomial algorithms, for that) and then in a
| simple and fairly obvious way make a TSP tour out of that
| tree. That approach actually has some probabilistic
| bounds on how close it is to optimality, and it does
| better with more cities -- it's another tool in the kit.
|
| My main conclusion about the TSP, combinatorial
| optimization, and optimization more generally is that
| there are way, Way, WAY too few good customers. Whether
| there is 15% of project cost to be saved or not, the
| people responsible for the projects just do NOT want to
| be bothered. In simple terms, in practice, it is
| essentially a dead field. My view is that suggesting that
| a young person devote some significant part of their
| career to _optimization_ is, bluntly, in a word,
| irresponsible.
| lacker wrote:
| Python-MIP is a great library that provides an interface
| to many different algorithms like this. It's practical
| for using in scientific programming where appropriate,
| and if you read through the docs you can find the names
| of specific algorithms that it uses with pointers to
| where to learn more.
|
| https://docs.python-mip.com/en/latest/intro.html
| yobbo wrote:
| > As for SAT, optimization, graph neural networks might end
| up being more effective
|
| Learning from data is a different problem from optimization.
| For example, if facts about cities gave additional clues
| beyond their location about the optimal order, then learning
| could benefit in the travelling salesman problem. Or if the
| cost of paths is only known implicitly through data examples.
|
| Compare to how NN:s can be used for data compression, for
| example upscaling images, by learning from photographs only
| the tiny the subset of all possible images that are
| meaningful to humans. But it is not useful for general data
| compression.
| cleancoder0 wrote:
| What about AlphaGo, AlphaZero (chess)?
|
| Optimization is also data, given a local state, can you
| identify the sequence of transformations that will get you
| to a better state. The reward is instantly measurable and
| the goal is minimizing the total cost.
| axg11 wrote:
| I wrote a short post on retrieval transformers that you might
| find interesting [0]. It's a twist on transformers that allows
| scaling "world knowledge" independently in a database-like
| manner.
|
| [0] - https://arsham.substack.com/p/retrieval-transformers-for-
| med...
| adamsmith143 wrote:
| Isn't the benefit of NNs on some level that you can store finer
| grained and more abstract data than a standard DB?
| macrolocal wrote:
| Maybe. Transformers model associative memory in a way made
| precise by their connection to Hopfield networks.
| Individually, they're like look-up tables, but the queries
| can be ambiguous, even based on subtle higher-order patterns
| (which the network identifies on its own), and the returned
| values can be a mixture of stored information, weighted by
| statistically meaningful confidences.
| ctoth wrote:
| > The latest batch of language models can be much smaller yet
| achieve GPT-3 like performance by being able to query a
| database or search the web for information[0].
|
| [0]: https://jalammar.github.io/illustrated-retrieval-
| transformer...
| tourist_on_road wrote:
| Transformers gained popularity due to the scalable nature of the
| architecture and how well it can be parallelized on existing
| GPU/XLA hardware. Modeling is always conditioned on the hardware
| available at hand. Transfomers lack inductive bias which make it
| generic building blocks unlike CNN/RNN like models and by
| injecting inductive bias like positional encoding, it can be well
| translated to various domains.
| saeranv wrote:
| Are transformers competitive with (for example) CNNs on vision-
| related tasks when there's less data available? I'm not that
| familiar with "injecting inductive bias" via positional
| encodings, but it sounds really interesting. My crude
| understanding is that the positional encodings were used in the
| original Transformer architecture to encode the ordering of
| words for NLP. Are they more flexible then that? For example,
| can they be used to replicate the image-related inductive bias
| of CNNs and match CNN performance on small datasets (1000 -
| 10,000)?
|
| If not, then to me, it seems like only industries where it's
| possible to get access to a large amount of representative data
| (i.e. greater than a million?) benefit from transformers. In
| industries where there are bottlenecks to data generation,
| there's a clear benefit in leveraging the inductive bias in
| other architectures, such as the various ways CNNs have biases
| towards image recognition.
|
| I'm in an industry (building energy consumption prediction)
| where we can only generate around 10,000 to 100,000 datapoints
| (from simulation engines) for DL. Are transformers ever used
| with that scale of data?
| [deleted]
| stevenwalton wrote:
| > Are transformers competitive with (for example) CNNs on
| vision-related tasks when there's less data available?
|
| They can be, there's current research into the tradeoffs
| between local inductive bias (information from local
| receptive fields: CNNs have strong local inductive bias) and
| global inductive bias (large receptive fields: i.e.
| attention). There's plenty of works that combine CNNs and
| Attention/Transformers. A handful of them focus on smaller
| datasets, but the majority are more interested in ImageNet.
| There's also work being done to change the receptive fields
| within attention mechanisms as a means to balance this.
|
| > Are transformers ever used with that scale of data?
|
| So there's a yes and no to your question. But definitely yes
| since people have done work on Flowers102 (6.5k training) and
| CIFAR10 (50k training). Keep in mind that not all these
| models are pure transformers. Some have early convolutions or
| intermediate ones. Some of these works even have a smaller
| number of parameters and better computational efficiency than
| CNNs.
|
| But more importantly, I think the big question is about what
| type of data you have. If large receptive fields are helpful
| to your problem then transformers will work great. If you
| need local receptive fields then CNNs will tend to do better
| (or combinations of transformers and CNNs or reduced
| receptive fields on transformers). I doubt there will be a
| one size fits all architecture.
|
| One thing to also keep in mind is that transformers typically
| like heavy amounts of augmentation. Not all data can be
| augmented significantly. There's also pre-training and
| knowledge transfer/distillation.
| tourist_on_road wrote:
| Good point. The fact there is no inductive bias inherent to
| transformers makes it difficult to train a decent model on
| small datasets from scratch. However, there are recent
| research directions that try to address this problem [1].
|
| Also baking in some sort of domain specific inductive bias
| into model architecture itself can address this problem as
| well [2].
|
| [1]: Escaping the Big Data Paradigm with Compact
| Transformers: https://arxiv.org/abs/2104.05704
|
| [2]: CvT: Introducing Convolutions to Vision Transformers:
| https://arxiv.org/abs/2103.15808
| version_five wrote:
| Maybe a naive question: is there no transfer learning with
| transformers? I've done a lot of work with CNN
| architectures on small datasets, and almost always start
| with something trained on imagenet, and fine tune, or do
| some kine of semi-supervised training to start. Can we do
| that with VIT et al as well? Or are they really usually
| trained from scratch?
| stevenwalton wrote:
| Lots of people transfer learn with transformers. ViT[0]
| originally did CIFAR with it. Then DeiT[1] introduced
| some knowledge transfer (note: their student is larger
| than the teacher). ViT pretrained on both ImageNet21k and
| JFT-300m.
|
| CCT ([1] from above) was focused on training from
| scratch.
|
| There's two paradigms to be aware of. ImageNet and pre-
| training can often be beneficial but it doesn't always
| help. It really depends on the problem you're trying to
| tackle and if there are similar features within the
| target dataset and the pre-trained dataset. If there is
| low similarity you might as well train from scratch.
| Also, you might not want as large of models (like ViT and
| DeiT have, which ViT's has more parameters than CIFAR-10
| has features).
|
| Disclosure: Author on CCT
|
| [0] https://arxiv.org/abs/2010.11929
|
| [1] https://arxiv.org/abs/2012.12877
| version_five wrote:
| Awesome, thanks for the reply. It's been on my list to
| try transformers instead of (mainly) Resnet for a while
| now.
| [deleted]
| Der_Einzige wrote:
| All of this is just another externalization of the bitter
| lesson.
|
| http://www.incompleteideas.net/IncIdeas/BitterLesson.html
| melling wrote:
| On Kaggle they are getting more usage.
| dekhn wrote:
| I assume transformers will be replaced by something, just as
| transformers replaced other sequential models.
|
| That said, transformers have already earned a place in the annals
| of ML, if for no other reason than they were the critical to the
| first technology to solve protein structure prediction.
| g42gregory wrote:
| I think Transformers are a (good) tool in AI toolbox and they are
| already being used a conjunction with many other recent tools
| such as Normalizing Flows, Diffusion Models, Energy-Based Models,
| etc...
| zitterbewegung wrote:
| If you want to play with Transformers you can go here
| https://transformer.huggingface.co/
|
| They have a really easy to use library in Python called
| Transformers. Below is an example of how to use it.
| >>> from transformers import pipeline # Allocate a
| pipeline for sentiment-analysis >>> classifier =
| pipeline('sentiment-analysis') >>> classifier('We are very
| happy to introduce pipeline to the transformers repository.')
| [{'label': 'POSITIVE', 'score': 0.9996980428695679}]
| PestoDiRucola wrote:
| Huggingface is great! The only issue is the documentation which
| is rather lacking if you want to get more serious about writing
| custom models and solving more complex issues than what
| normally documented in the examples there.
| omarhaneef wrote:
| For those who are interested in how Transformers seem more
| prevalent, please read this thread by Karpathy where he talks
| about a consolidation in ML:
|
| https://twitter.com/karpathy/status/1468370605229547522
|
| And of course one of the early classic papers in the field, as a
| bonus:
|
| https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd05...
|
| (The paper is mentioned in the article)
| lucidrains wrote:
| if one prefers video, Yannic Kilcher does an excellent
| explanation of the seminal paper
| https://www.youtube.com/watch?v=iDulhoQ2pro
| amrrs wrote:
| Glad to see you here lucidrains. Truly appreciate your recent
| open-source contributions and works like big sleep, deep
| daze.
|
| Everyone else check out this
| https://github.com/lucidrains?tab=repositories
| lucidrains wrote:
| Thanks for the kind words, and credit goes to Ryan Murdock
| for Big Sleep and Deep Daze. I simply packaged it up to
| spread the usage
| algo_trader wrote:
| Have any of the "sub-quadratic transformers" [1] gone
| mainstream? Or is everyone simply rich enough to buy enough
| GPUs.
|
| [1]https://www.gwern.net/notes/Attention
| lucidrains wrote:
| I would recommend Routing Transformer
| https://github.com/lucidrains/routing-transformer but the
| real truth is nothing beats full attention. Luckily, someone
| recently figured out how to get past the memory bottleneck.
| https://github.com/lucidrains/memory-efficient-attention-
| pyt...
| CabSauce wrote:
| I thought they already had.
| iamwil wrote:
| For those that want a high level overview of Transformers, we
| recently covered it in our podcast:
| https://www.youtube.com/watch?v=Kb0II5DuDE0
___________________________________________________________________
(page generated 2022-03-10 23:00 UTC)