[HN Gopher] Common misconceptions about the complexity in roboti...
___________________________________________________________________
Common misconceptions about the complexity in robotics vs. AI
(2024)
Author : wallflower
Score : 139 points
Date : 2025-01-07 15:19 UTC (4 days ago)
(HTM) web link (harimus.github.io)
(TXT) w3m dump (harimus.github.io)
| jvanderbot wrote:
| > Moravec's paradox is the observation by artificial intelligence
| and robotics researchers that, contrary to traditional
| assumptions, reasoning requires very little computation, but
| sensorimotor and perception skills require enormous computational
| resources. The principle was articulated by Hans Moravec, Rodney
| Brooks, Marvin Minsky, and others in the 1980s.
|
| I have a name for it now!
|
| I've said over and over that there are only two really hard
| problems in robotics: Perception and funding. A perfectly
| perceived system and world can be trivially planned for and (at
| least proprio-)controlled. Imagine having a perfect intuition
| about other actors such that you know their paths (in self
| driving cars), or your map is a perfect voxel + trajectory +
| classification. How divine!
|
| It's limited information and difficulties in reducing signal to
| concise representation that always get ya. This is why the
| perfect lab demos always fail - there's a corner case not in your
| training data, or the sensor stuttered or became misaligned, or
| etc etc.
| jvanderbot wrote:
| > Moravec hypothesized around his paradox, that the reason for
| the paradox [that things we perceive as easy b/c we dont think
| about them are actually hard] could be due to the sensor &
| motor portion of the human brain having had billions of years
| of experience and natural selection to fine-tune it, while
| abstract thoughts have had maybe 100 thousand years or less
|
| Another gem!
| Legend2440 wrote:
| Or it could be a parallel vs serial compute thing.
|
| Perception tasks involve relatively simple operations across
| very large amounts of data, which is very easy if you have a
| lot of parallel processors.
|
| Abstract thought is mostly a serial task, applying very
| complex operations to a small amount of data. Many abstract
| tasks like evaluating logical expressions cannot be done in
| parallel - they are in the complexity class P-complete.
|
| Your brain is mostly a parallel processor (80 billion neurons
| operating asynchronously), so logical reasoning is hard and
| perception is easy. Your CPU is mostly a serial processor, so
| logical reasoning is easy and perception is hard.
| cratermoon wrote:
| > Perception tasks involve relatively simple operations
| across very large amounts of data, which is very easy if
| you have a lot of parallel processors.
|
| Yes, relatively simple. Wait, isn't that exactly what the
| article explained was completely wrong-headed?
| burnished wrote:
| No. The article is talking about things we think of as
| being easy because they are easy for a human to perform
| but that are actually very difficult to
| formalize/reproduce artificially.
|
| The person you are responding to is instead comparing
| differences in biological systems and mechanical systems.
| visarga wrote:
| > Or it could be a parallel vs serial compute thing.
|
| The brain itself is both a parallel system an a serially
| constrained system. It has distributed activity but it must
| resolve in a serial chain of action. We can't walk left and
| right at the same time. Any goals forces us to follow
| specific steps in specific order. This conflict between
| parallel processing and serial outputs is where the magic
| happens.
| topherclay wrote:
| > ...the sensor & motor portion of the human brain having had
| billions of years of experience.
|
| It doesn't really change the significance of the quote, but I
| can't help but point out that we didn't even have nerve cells
| more than 0.6 billion of years ago.
| lang4d wrote:
| Maybe just semantics, but I think I would call that prediction.
| Even if you have perfect perception (measuring the current
| state of the world perfectly), it's nontrivial to predict the
| future paths of other actors. The prediction problem requires
| intuition about what the other actors are thinking, how their
| plans influence each other, and how your plan influences them.
| bobsomers wrote:
| > I've said over and over that there are only two really hard
| problems in robotics: Perception and funding. A perfectly
| perceived system and world can be trivially planned for and (at
| least proprio-)controlled.
|
| Funding for sure. :)
|
| But as for perception, the inverse is also true. If I have an
| perfect planning/prediction system, I can throw the grungiest,
| worst perception data into it and it will still plan
| successfully despite tons of uncertainty.
|
| And therein lies the real challenge of robotics: It's
| fundamentally a systems engineering problem. You will never
| have perfect perception or a perfect planner. So, can you make
| a perception system that is _good enough_ that, when coupled
| with your planning system which is _good enough_ , you are able
| to solve enough problems with enough 9s to make it successful.
|
| The most commercially successful robots I've seen have had some
| of the smartest systems engineering behind them, such that
| entire classes of failures were eliminated by being smarter
| about what you _actually need to do to solve the problem_ and
| aggressively avoid solving subproblems that aren 't absolutely
| necessary. Only then do you really have a hope of getting good
| enough at that focused domain to ship something before the
| money runs out. :)
| portaouflop wrote:
| > being smarter about what you actually need to do to solve
| the problem and aggressively avoid solving subproblems that
| aren't absolutely necessary
|
| I feel like this is true for every engineering discipline or
| maybe even every field that needs to operate in the real
| world
| vrighter wrote:
| except software, of course. Nowadays it seems that software
| is all about creating problems to create solutions for.
| krisoft wrote:
| > If I have an perfect planning/prediction system, I can
| throw the grungiest, worst perception data into it and it
| will still plan successfully despite tons of uncertainty.
|
| Not really. Even the perfect planning system will appear
| eratic in the presence of perception noise. It must be
| because it can't create information out of nowhere.
|
| I have seen robots eratically stop because they thought that
| the traffic in the oncomming lane is enroaching on theirs.
| You can't make the planning system ignore that because then
| sometimes it will collide with people playing chicken with
| you.
|
| Likewise I have seen robots eratically stop because they
| thought that a lamp post was slowly reversing out in front of
| them. All due to perception noise (in this case both location
| noise, and misclassification.)
|
| And do note that these are just the false positives. If you
| have a bad perception system you can also suffer from false
| negatives. Just experiment biases hide those.
|
| So your "perfect planning/prediction" will appear overly
| cautious while at the same time will be sometimes reckless.
| Because it doesn't have the information to not to. You can't
| magic plan your way out of that. (Unless you pipe the raw
| sensor data into the planner, in which case you created a
| second perception system you are just not calling it
| perception.)
| YeGoblynQueenne wrote:
| >> (Unless you pipe the raw sensor data into the planner,
| in which case you created a second perception system you
| are just not calling it perception.)
|
| Like with model-free RL learning a model from pixels?
| jvanderbot wrote:
| A "perfect" planning system which can handle arbitrarily bad
| perception is indistinguishable from a perception system.
|
| I've not seen a system that claimed to be robust to sensor
| noise that didn't do some filtering, estimation, or state
| representation internally. Those are just sensor systems
| inside the box.
| exe34 wrote:
| "the sensor stuttered or became misaligned, or etc etc."
|
| if your eyes suddenly crossed, you'd probably fall over too!
| seanhunter wrote:
| Yeah the fun way Moravec's paradox was explained to me [1] is
| that you can now easily get a computer to solve simultaneous
| differential equations governing all the axes of motion of a
| robot arm but getting it to pick one screw out of a box of
| screws is an unsolved research problem.
|
| [1] by a disillusioned computer vision phd that left the field
| in the 1990s.
| wrp wrote:
| Selective attention was one of the main factors in Hubert
| Dreyfus' explanation of "what computers can't do." He had a
| special term for it, which I can't remember off-hand.
| visarga wrote:
| > A perfectly perceived system and world can be trivially
| planned for
|
| I think it's not about perfect perception, there is no such
| thing not even in humans, it's about adaptability, recovery
| from error, resilience, and mostly about learning from the
| outside when the process fails to work. Each problem has its
| own problem space to explore. I think of intelligence as search
| efficiency across many problem spaces, there is no perfection
| in it. Our problem spaces are far from exhaustively known.
| catgary wrote:
| Yeah, this was my general impression after a brief, disastrous
| stretch in robotics after my PhD. Hell, I work in animation now,
| which is a way easier problem since there are no physical
| constraints, and we still can't solve a lot of the problems the
| OP brings up.
|
| Even stuff like using video misses the point, because so much of
| our experience is via touch.
| johnwalkr wrote:
| I've worked in a robotics-adjacent field for 15 years and
| robotics is truly hard. The number of people and companies I've
| seen come and go that claim their software expertise will make
| a useful, profitable robot is.. a lot.
| Legend2440 wrote:
| Honestly I'm tired of people who are more focused on 'debunking
| the hype' than figuring out how to make things work.
|
| Yes, robotics is hard, and it's still hard despite big
| breakthroughs in other parts of AI like computer vision and NLP.
| But deep learning is still the most promising avenue for general-
| purpose robots, and it's hard to imagine a way to handle the
| open-ended complexity of the real world _other_ than learning.
|
| Just let them cook.
| mitthrowaway2 wrote:
| > _If you want a more technical, serious (better) post with a
| solution oriented point to make, I'll refer you to Eric Jang's
| post [1]_
|
| [1] https://evjang.com/2022/07/23/robotics-generative.html
| FloorEgg wrote:
| As someone on the sidelines of robotics who generally feels
| everything getting disrupted and at the precipice of major
| change, it's really helpful to have a clearer understanding of
| the actual challenge and how close we are to solving it.
| Anything that helps me make more accurate predictions will help
| me make better decisions about what problems I should be trying
| to solve and what skills I should be trying to develop.
| cratermoon wrote:
| It might be nice if the author qualified "most of the freely
| available data on the internet" with "whether or not it was
| copyrighted" or something to acknowledge the widespread theft of
| the works of millions.
| danielbln wrote:
| Theft is the wrong term, it implies that the original is no
| longer available. It's copyright infringement at best, and
| possibly fair use depending on jurisdiction. It wasn't theft
| when the RIAA went on a lawsuit spree against mp3 copying, and
| it isn't theft now.
| CaptainFever wrote:
| Related: https://www.youtube.com/watch?v=IeTybKL1pM4
| cratermoon wrote:
| Ackchyually....
| jes5199 wrote:
| I would love to see some numbers. How many orders of magnitude
| more complicated do we think embodiment is, compared to
| conversation? How much data do we need compared to what we've
| already collected?
| FloorEgg wrote:
| If nature computed both through evolution, then maybe it's
| approximately the same ratio. So roughly the time it took to
| evolve embodiment, and roughly the time it took to evolve from
| grunts to advanced language.
|
| If we start from when we think multicellular life first evolved
| (~2b years), or maybe the Cambrian explosion (~500m years), and
| until modern humans (~300k years). Then compare that to the
| time between first modern humans now now.
|
| It seems like maybe 3-4 orders of magnitude harder.
|
| My intuition after reading the articles is that there needs to
| be way more sensors all throughout the robot, probably with
| lots of redundancies, and then lots of modern LLM sized models
| all dedicated to specific joints and functions and capable of
| cascading judgement between each other, similar to how our
| nervous system works.
| jes5199 wrote:
| so like ten to twenty years, via moore's law?
| daveguy wrote:
| Maybe. If Moore's law remotely holds up for ten to twenty
| years. There's still the part about not having a clue how
| to replicate physical systems efficiently vs logical
| systems.
| rstuart4133 wrote:
| "Hardness" is a difficult quantity to define if you venture
| beyond "humans have been trying to build systems to do this for
| a while, and haven't succeeded".
|
| Insects have succeed in build precision systems that combine
| vision, smell, touch and a few other senses. I doubt finding a
| juicy spider, immobilising it, is that much more difficult that
| finding a door knob and turning it, or folding a T-Shirt. Yet
| insects accomplish it with I suspect far less compute than
| modern LLM's. So it's not "hard" in the sense of requiring huge
| compute resources, and certainly not a lot of power.
|
| So it's probably not that hard in the sense that it's well
| within the capabilities of the hardware we have now. The issue
| is more that we don't have a clue how to do it.
| jes5199 wrote:
| well the magic of transformer architecture is that if the
| rules exist and are computationally tractable, the system
| will find them in the data, and we don't have to have a clue.
| so. how much data do we need?
| BlueTemplar wrote:
| Calling it "compute" might be part of the issue : insects
| aren't (even partially) digital computers.
|
| We might or might not be able to emulate what they process on
| digital computers, but emulation implies a performance loss.
|
| And this doesn't even cover inputs/outputs (some of which
| might be already good enough for some tasks, like the
| article's example of remotely operated machines).
| timomaxgalvin wrote:
| I feel more tired after driving all day than reading all day.
| jes5199 wrote:
| man I don't. I can drive for 12+ hours. I can be on the
| internet for like 6
| daveguy wrote:
| Modern ADAS probably makes the driving much easier. What
| about reading print? Just as long? (Wondering the screen
| fatigue aspect vs just language processing)
| no_op wrote:
| I think Moravec's Paradox is often misapplied when considering
| LLMs vs. robotics. It's true that formal reasoning over
| unambiguous problem representations is easy and computationally
| cheap. Lisp machines were already doing this sort of thing in the
| '70s. But the kind of commonsense reasoning over ambiguous
| natural language that LLMs can do is _not_ easy or
| computationally cheap. Many early AI researchers thought it would
| be -- that it would just require a bit of elaboration on the
| formal reasoning stuff -- but this was totally wrong.
|
| So, it doesn't make sense to say that what LLMs do is Moravec-
| easy, and therefore can't be extrapolated to predict near-term
| progress on Moravec-hard problems like robotics. What LLMs do is,
| in fact, Moravec-hard. And we should expect that if we've got
| enough compute to make major progress on one Moravec-hard
| problem, there's a good chance we're closing in on having enough
| to make major progress on others.
| bjornsing wrote:
| Good points. Came here to say pretty much the same.
|
| Moravec's Paradox is certainly interesting and correct if you
| limit its scope (as you say). But it feels intuitively wrong to
| me to make any claims about the relative computational demands
| of sensi-motor control and abstract thinking before we've
| really solved either problem.
|
| Looking e.g. at the recent progress in solving ARC-AGI my
| impression is that abstract thought could have incredible
| computational demands. IIRC they had to throw approximately
| $10k of compute at o3 before it reached human performance. Now
| compare how cognitively challenging ARC-AGI is to e.g.
| designing or reorganizing a Tesla gigafactory.
|
| With that said I do agree that our culture tends to value
| simple office work over skillful practical work. Hopefully the
| progress in AI/ML will soon correct that wrong.
| RaftPeople wrote:
| Also agree and also came here to say the same.
| lsy wrote:
| Leaving aside the lack of consensus around whether LLMs
| actually succeed in commonsense reasoning, this seems a little
| bit like saying "Actually, the first 90% of our project took an
| enormous amount of time, so it must be 'Pareto-hard'. And thus
| the last 10% is well within reach!" That is, that Pareto and
| Moravec were in fact just wrong, and thing A and thing B are
| equivalently hard.
|
| Keeping the paradox would more logically bring you to the
| conclusion that LLMs' massive computational needs and limited
| capacities imply a commensurately greater, mind-bogglingly
| large computational requirement for physical aptitude.
| nopinsight wrote:
| It's far from obvious that thought space is much less complex
| than physical space. Natural language covers emotional,
| psychological, social, and abstract concepts that are
| orthogonal to physical aptitude.
|
| While the linguistic representation of thought space may be
| discrete and appear simpler (even the latter _is_ arguable),
| the underlying phenomena are not.
|
| Current LLMs are terrific in many ways but pale in comparison
| to great authors in capturing deep, nuanced human experience.
|
| As a related point, for AI to truly understand humans, it
| will likely need to process videos, social interactions, and
| other forms of data beyond language alone.
| visarga wrote:
| I think the essence of human creativity is outside our
| brains - in our environments, our search spaces, our
| interactions. We stumble upon discoveries or patterns, we
| ideate and test, and most ideas fail but a few remain. And
| we call it creativity, but it's just environment tested
| ideation.
|
| If you put an AI like AlphaZero in a Go environment it
| explores so much of the game space that it invents its own
| Go culture from scratch and beats us at our own game.
| Creativity is search in disguise, having good feedback is
| essential.
|
| AI will become more and more grounded as it interacts with
| the real world, as opposed to simply modeling organic text
| as GPT-3. More recent models generate lots of synthetic
| data to simulate this process, and it helps up to a point,
| but we can't substitute artificial feedback for real one
| except in a few cases: like AlphaZero, AlphaProof,
| AlphaCode... in those cases we have the game winner, LEAN
| as inference engine, and code tests to provide reliable
| feedback.
|
| If there is one concept that underlies both training and
| inference it is search. And it also underlies action and
| learning in humans. Learning is compression which is search
| for optimal parameters. Creativity is search too. And
| search is not purely mental, or strictly 1st person, it is
| based on search spaces and has a social side.
| jillesvangurp wrote:
| Yesterday, I was watching some of the youtube videos on the
| website of a robotics company https://www.figure.ai that
| challenges some of the points in this article a bit.
|
| They have a nice robot prototype that (assuming these demos
| aren't faked) does fairly complicated things. And one of the key
| features they show case is using OpenAI's AI for the human
| computer interaction and reasoning.
|
| While these things seem a bit slow, they do get things done. They
| have a cool demo of the a human interacting with one of the
| prototypes to ask it what it thinks needs to be done and then
| asking it do these things. That show cases reasoning, planning,
| and machine vision. Which are exactly topics that all the big LLM
| companies are working on.
|
| They appear to be using an agentic approach similar to how LLMs
| are currently being integrated into other software products.
| Honestly, it doesn't even look like they are doing much that
| isn't part of OpenAI's APIs. Which is impressive. I saw speech
| capabilities, reasoning, visual inputs, function calls, etc. in
| action. Including the dreaded "thinking" pause where the Robot
| waits a few seconds for the remote GPUs to do their thing.
|
| This is not about fine motor control but about replacing humans
| controlling robots with LLMs controlling robots and getting
| similarly good/ok results. As the article argues, the hardware is
| actually not perfect but good enough for a lot of tasks if it is
| controlled by a human. The hardware in this video is nothing
| special. Multiple companies have similar or better prototypes.
| Dexterity and balance are alright but probably not best in class.
| Best in class hardware is not the point of these demos.
|
| Dexterity and real time feedback is less important than the
| reasoning and classification capabilities people have. The
| latency just means things go a bit slower. Watching these things
| shuffle around like an old person that needs to go to the bath
| room is a bit painful. But getting from A to B seems like a
| solved problem. A 2 or 3x speedup would be nice. 10x would be
| impressively fast. 100x would be scary and intimidating to have
| near you. I don't think that's going to be a challenge long term.
| Making LLMs faster is an easier problem than making them smarter.
|
| Putting a coffee cup in a coffee machine (one of the demo videos)
| and then learning to fix it when it misaligns seems like an
| impressive capability. It compensates for precision and speed
| with adaptability and reasoning: analyze the camera input,
| correctly analyze the situation, problem and challenge come up
| with a plan to perform the task, execute the plan, re-evaluate,
| adapt, fix. It's a bit clumsy but the end result is coffee. Good
| demo and I can see how you might make it do all sorts of things
| that are vaguely useful that way.
|
| The key point here is that knowing that the thing in front of the
| robot is a coffee cup and a coffee machine and identifying how
| those things fit together and in what context that is required
| are all things that LLMs can do.
|
| Better feedback loops and hardware will make this faster, and
| less tedious to watch. Faster LLMs will help with that too. And
| better LLMs will result in less mistakes, better plans, etc. It
| seems both capabilities are improving at an enormously fast pace
| right now.
|
| And a fine point with human intelligence is that we divide and
| conquer. Juggling is a lot harder when you start thinking about
| it. The thinking parts of your brain interferes with the lower
| level neural circuits involved with juggling. You'll drop the
| balls. The whole point with juggling is that you need to act
| faster than you can think. Like LLMs, we're too slow. But we can
| still learn to juggle. Juggling robots are going to be a thing.
| GolfPopper wrote:
| > _The key point here is that_ knowing _that the thing in front
| of the robot is a coffee cup and a coffee machine and
| identifying how those things fit together and in what context
| that is required are all things that LLMs can do._
|
| I'm skeptical that any LLM "knows" any such thing. It's a
| Chinese Room. It's got a probability map that connects the
| lexeme (to us) 'coffee machine' and 'coffee cup' depending on
| other inputs that we do not and cannot access, and spits out
| sentences or images that (often) look right, but that does not
| equate to any understanding of what it is doing.
|
| As I was writing this, I took chat GPT-4 for a spin. When I ask
| it about an obscure but once-popular fantasy character from the
| 70s cold, it admits it doesn't know. But, if I ask it about
| that same character after first asking about some obscure
| fantasy RPG characters, it cheerfully confabulates an
| authoritative and wrong answer. As always, if it does this on
| topics where I am a domain expert, I consider it absolutely
| untrustworthy for any topics on which I am not a domain expert.
| That anyone treats it otherwise seems like a baffling new form
| of Gell-Mann amnesia.
|
| And for the record, when I asked ChatGPT-4, cold, "What is
| Gell-Mann amnesia?" it gave a multi-paragraph, broadly accurate
| description, with the following first paragraph:
|
| "The Gell-Mann amnesia effect is a term coined by physicist
| Murray Gell-Mann. It refers to the phenomenon where people,
| particularly those who are knowledgeable in a specific field,
| read or encounter inaccurate information in the media, but then
| forget or dismiss it when it pertains to other topics outside
| their area of expertise. The term highlights the paradox where
| readers recognize the flaws in reporting when it's something
| they are familiar with, yet trust the same source on topics
| outside their knowledge, even though similar inaccuracies may
| be present."
|
| Those who are familiar with the term have likely already
| spotted the problem: "a term coined by physicist Murray Gell-
| Mann". The term was coined by author Michael Crichton.[1] To
| paraphrase H.L. Mencken, for every moderately complex question,
| there is an LLM answer that is clear, simple, and wrong.
|
| 1. https://en.wikipedia.org/wiki/Michael_Crichton#Gell-
| Mann_amn...
| jillesvangurp wrote:
| Hallucinations are a well known problem. And there are some
| mitigations that work pretty well. Mostly with enough context
| and prompt engineering, LLMs can be pretty reliable. And
| obscure popular fiction trivia is maybe not that relevant for
| every use case. Which would be robotics in this case; not the
| finer points of Michael Crighton related trivia.
|
| You were testing its knowledge, not its ability to reason or
| classify things it sees. I asked the same question to
| perplexity.ai. If you use the free version, it uses less
| advanced LLMs but it compensates with prompt engineering and
| making it do a search to come up with this answer:
|
| > The Gell-Mann Amnesia effect is a psychological phenomenon
| that describes people's tendency to trust media reports on
| unfamiliar topics despite recognizing inaccuracies in
| articles about subjects they know well. This effect, coined
| by novelist Michael Crichton, highlights a cognitive bias in
| how we consume news and information.
|
| Sounds good to me. And it got me a nice reference to
| something called the portal wiki, and another one for the
| same wikipedia article you cited. And a few more references.
| And it goes on a bit to explain how it works. And I get your
| finer point here that I shouldn't believe everything I read.
| Luckily, my supervisor worked hard to train that out of me
| when I was doing a Ph. D. back in the day. But fair point and
| well made.
|
| Anyway, this is a good example of how to mitigate
| hallucination with this specific question (and similar ones).
| Kind of the use case perplexity.ai was made to solve. I use
| it a lot. In my experience it does a great job figuring out
| the right references and extracting information from those.
| It can even address some fairly detailed questions. But
| especially on the freemium plan, you will run into
| limitations related to reasoning with what it extracts (you
| can pay them to use better models). And it helps to click on
| the links it provides to double check.
|
| For things that involve reasoning (like coding), I use
| different tools. Different topic so won't bore you with that.
|
| But what figure.ai is doing, falls well in the scope of
| several things openai does very well that you can use via
| their API. It's not going to be perfect for everything. But
| there probably is a lot that it nails without too much
| effort. I've done some things with their APIs that worked
| fairly well at least.
| redlock wrote:
| Do we know how human understanding works? It could be just
| statistical mapping as you have framed it. You can't say llms
| don't understand when you don't have a measurable definition
| for understanding.
|
| Also, humans hallucinate/confabulate all the time. Llms even
| forget in the same way humans do (strong recall in the start
| and end of the text but weaker in the middle)
| YeGoblynQueenne wrote:
| >> Good demo and I can see how you might make it do all sorts
| of things that are vaguely useful that way.
|
| Unfortunately since that's a demo you have most likely seen all
| the sorts of things that are vaguely useful and that can be
| done easily, or at all.
|
| Edit: Btw, the coffee task video says that the "AI" is "end-to-
| end neural networks". If I understand correctly that means an
| LLM was not involved in carrying out the task. At most an LLM
| may have been used to trigger the activation of the task, that
| was learned by a different method, probably some kind of
| imitation learning with deep RL.
|
| Also, to see how much of a tech demo this is: the robot starts
| already in position in front of a clear desk and a human brings
| the coffee machine, positions it just so, places the cup in the
| holder and places a single coffee pod just so. Then the robot
| takes the coffee pod from the empty desk and places it in the
| machine, then pushes the button. That's all the interaction of
| the robot with the machine. The human collects the cup and
| makes a thumbs up.
|
| Consider for a moment how much different is this laboratory
| instance of the task from any real-world instance. In my
| kitchen the coffee machine is on a cluttered surface with tins
| of coffee, a toaster, sometimes the group left on the machine,
| etc. etc - and I don't even use coffee pods but loose coffee.
| The robot you see has been trained to put _that one_ pod placed
| _in that particular spot_ in _that one machine_ placed _just
| so_ in front of it. It would have to be trained all over again
| to carry out the same task on my machine, it is uncertain if it
| could learn it successfully after thousands of demonstrations
| (because of all the clutter), and even if it did, it would
| still have to learn it all over again if I moved the coffee
| machine, or moved the tins, or the toaster; let alone if you
| wanted it to use _your_ coffee machine (different colour, make,
| size, shape, etc) in your kitchen (different chaotic
| environment) (no offense meant).
|
| Take the other video of the "real world task". That's the robot
| shuffling across a flat, clean surface and picking up an empty
| crate to put in an empty conveyor belt. That's just not a real
| world task.
|
| Those are tech demos and you should not put much faith in them.
| That kind of thing takes an insane amount of work to set up
| just for one video, you rarely see the outtakes and it very,
| very rarely generalises to real-world utility.
| jonas21 wrote:
| It's worth noting that modern multimodal models are not confused
| by the cat image. For example, Claude 3.5 Sonnet says:
|
| > _This image shows two cats cuddling or sleeping together on
| what appears to be a blue fabric surface, possibly a blanket or
| bedspread. One cat appears to be black while the other is white
| with pink ears. They 're lying close together, suggesting they're
| comfortable with each other. The composition is quite sweet and
| peaceful, capturing a tender moment between these feline
| companions._
| throw310822 wrote:
| Also Claude, when given the entire picture:
|
| "This is a humorous post showcasing an AI image recognition
| system making an amusing mistake. The neural network (named
| "neural net guesses memes") attempted to classify an image with
| 99.52% confidence that it shows a skunk. However, the image
| actually shows two cats lying together - one black and one
| white - whose coloring and positioning resembles the
| distinctive black and white pattern of a skunk.
|
| The humor comes from the fact that while the AI was very
| confident (99.52%) in its prediction, it was completely
| wrong..."
|
| The progress we made in barely ten years is astounding.
| timomaxgalvin wrote:
| It's easy to make something work when the example goes from
| being out of the training data to into the training data.
| throw310822 wrote:
| Definitely. But I also tried with a picture of an absurdist
| cartoon drawn by a family member, complete with (carefully)
| handwritten text, and the analysis was absolutely perfect.
| visarga wrote:
| A simple test - take one of your own photos, something
| interesting, and put in into a LLM, let it describe it in
| words. Then use a image generator to create the image
| back. It works like back-translation image->text->image.
| It proves how much the models really understand images
| and text.
| BlueTemplar wrote:
| I wouldn't blame a machine to fail something that a first
| glance looks like an optical illusion...
| YeGoblynQueenne wrote:
| And yet both these astounding explanations (yours and the one
| in the OP) are mistaking two cute kittens sleeping cuddled in
| an adorable manner for generic "cats lying together".
| bjornsing wrote:
| I'm surprised this doesn't place more emphasis on self-supervised
| learning through exploration. Is human-labeled datasets really
| the SOTA approach for robotics?
| psb217 wrote:
| Human-labeled data currently looks like the quickest path to
| making robots that are useful enough to have economic value
| beyond settings and tasks that are explicitly designed for
| robots. This has drawn a lot of corporate and academic research
| activity away from solving the harder core problems, like
| exploration, that are critical for developing fully autonomous
| intelligent agents.
| MrsPeaches wrote:
| Question:
|
| Isn't it fundamentally impossible to model a highly entropic
| system using deterministic methods?
|
| My point is that animal brains are entropic and "designed" to
| model entropic systems, where as computers are deterministic and
| actively have to have problems reframed as deterministic so that
| they can solve them.
|
| All of the issues mentioned in the article boil down to the
| fundamental problem of trying to get deterministic systems to
| function in highly entropic environments.
|
| LLMs are working with language, which has some entropy but is
| fundamentally a low entropy system, and has orders of magnitude
| less entropy than most peoples' back garden!
|
| As the saying goes, to someone with a hammer, everything looks
| like a nail.
| BlueTemplar wrote:
| Not fundamentally, at least I doubt it : pseudo-random number
| generation is technically deterministic.
|
| And it's used for sampling these low information systems that
| you are mentioning.
|
| (And let's not also forget how they are helpful in sampling
| deterministic but extremely high complexity systems involving a
| high amount of dimensions that Monte Carlo methods are so good
| at dealing with.)
| Peteragain wrote:
| So I'm old. PhD on search engines in the early 1990's (yep, early
| 90s). Learnt AI in the dark days of the 80's. So, there is an
| awful lotl of forgetting going on, largely driven by the publish-
| or-perish culture we have. Brooks' subsumption architecture was
| not perfect, but it outlined an approach that philosophy and
| others have been championing for decades. He said he was not
| implementing Heidegger, just doing engineering, but Brooks was
| certainly channeling Heidegger's successors. Subsumption might
| not scale, but perhaps that is where ML comes in. On a related
| point, "generative AI" does sequences (it's glorified auto
| complete (not) according to Hinton in the New Yorker). Data is
| given to a Tokeniser that produces a sequence of tokens, and the
| "AI" predicts what comes next. Cool. Robots are agents in an
| environment with an Umwelt. Robotics is pre the Tokeniser. What
| is it the is recognisable and sequential in the world? 2 cents
| please.
| marcosdumay wrote:
| > Subsumption might not scale
|
| Honestly, I don't think we have any viable alternative.
|
| And anyway, it seems to scale well enough that we use
| "conscious" and "unconscious" decisions ourselves.
| psb217 wrote:
| If you wanna sound hip, you need to call it "system 2" and
| "system 1".
| Anotheroneagain wrote:
| The reason why it sounds counterintuitive is that neurology has
| the brain upside down. It teaches us that formal thinking occurs
| in the neocortex, and we need all that huge brain mass for that.
|
| But in fact it works like an autoencoder, and it reduces sensory
| inputs into a much smaller latent space, or something very
| similar to that. This does result in holistic and abstract
| thinking, but formal analytical thinking doesn't require
| abstraction to do the math or to follow a method without
| comprehension. It's a concrete approach that avoids the need for
| abstraction.
|
| The cerebellum is the statistical machine that gets measured by
| IQ and other tests.
|
| To further support that, you don't see any particularly elegant
| motions from non mammal animals. In fact everything else looks
| quite clumsy, and even birds need to figure out flying by trial
| and error.
| daveguy wrote:
| Claiming to know how the brain works, computationally or
| physically, might be a bit premature.
| dbspin wrote:
| I find it odd that the article doesn't address the apparent
| success of training with transformer based models in virtual
| environments to build models that are then mapped onto the real
| world. This is being used in everything from building datasets
| for self driving cars, to navigation and task completion for
| humanoid robots. Nvidia have their omniverse project [1], but
| there are countless other examples [2][3][4]. Isn't this
| obviously the way to build the corpus of experience needed to
| train these kinds of cross modal models?
|
| [1] https://www.nvidia.com/en-
| us/industries/robotics/#:~:text=NV....
|
| [2]
| https://www.sciencedirect.com/science/article/abs/pii/S00978...
|
| [3] https://techcrunch.com/2024/01/04/google-outlines-new-
| method...
|
| [4] https://techxplore.com/news/2024-09-google-deepmind-
| unveils-...
| cybernoodles wrote:
| A common practice is to train a transformer model to control a
| given robot model in simulation by first teleoperating the
| simulated model with some controller (keyboard, joystick, etc.)
| to complete the task and create a dataset, and then setting up
| the simulator to permute the environment variables such as
| frictions, textures, etc (domain randomization) and run many
| epochs at faster than real time until a final policy converges.
| If the right things were randomized and your demonstration
| examples provided enough variation of information, it should
| generalize well to the actual hardware.
| CWIZO wrote:
| > Robots are probably amazed by our ability to keep a food tray
| steady, the same way we are amazed by spider-senses (from
| spiderman movie)
|
| Funnily, Toby Maguire actually did that tray catching stunt for
| real. So robots have an even further way to go.
|
| https://screenrant.com/spiderman-sam-raimi-peter-parker-tray...
| BlueTemplar wrote:
| ... but it took 156 takes as well as some glue.
|
| And, as the article insists on, for robots to be acceptable,
| it's more like they need to get to a point where they fail 1
| time in 156 (or even less, depending on how critical the
| failure is), rather than succeed 1 time in 156...
| PeterStuer wrote:
| Just some observations from an ex autonomous robotics researcher
| here.
|
| One of the most important differences at least in those days
| (80's and 90's) was time. While the digital can be sped up just
| constrained by the speed of your compute, the 'real world' is
| very constrained by real time physics. You can't speed up a robot
| 10x in a 10.000 grabbing and stacking learning run without
| completely changing the dynamics.
|
| Also, parallellizing the work requires more expensive full robots
| rather than more compute cores. Maybe these days the different ai
| gym like virtual physics environments offer a (partial) solution
| to that problem, but I have not used them (yet) so I can't tell.
|
| Furthermore, large scale physical robots are _far_ more fragile
| due to wear and tear than the incredible resilience of modern
| compute hardware. Getting a perfect copy of a physical robot and
| environment is a very hard, near impossible, task.
|
| Observability and replay, while trivial in the digital world, is
| very limited in the physical environment making analysis much
| more difficult.
|
| I was both excited and frustrated at the time by making ai do
| more than rearanging pixels on a 2D surface. Good times were had.
| Havoc wrote:
| Fun fact: that Spider-Man gif in there - it's real. No CGI
| bsenftner wrote:
| This struck me as a universal truth: "our general intuition about
| the difficulty of a problem is often a bad metric for how hard it
| actually is". I feel like this is the core issue of all
| engineering, all our careers, and was surprised by the logic leap
| from that immediately to Moravec's Paradox, from a universal
| truth to a myopic industry insight.
|
| Although I've not done physical robotics, I've done a lot of
| articulated human animation of independent characters in 3D
| animation. His insight that motor control is more difficult sets
| right with me.
| cameldrv wrote:
| Moravec's paradox is really interesting in terms of what it says
| about ourselves: We are super impressive in ways in which we
| aren't consciously aware. My belief about this is that our self-
| aware mind is only a very small part of what our brain is doing.
| This is extremely clear when it comes to athletic performance,
| but also there are intellectual things that people call intuition
| or other things, which aren't part of our self-aware mind, but
| still do a ton of heavy lifting in our day to day life.
| NalNezumi wrote:
| Oh weird to wake up to see something I wrote more than half year
| ago (and posted on HN with no traction) getting reposted now.
|
| Glad to see so many different takes on it. It was written in
| slight jest as a discussion starter with my ML/neuroscience
| coworker and friends, so it's actually very insightful to see
| some rebuttals.
|
| Initial post was twice the length, and had several more (in
| retrospect) interesting points. First ever blog post so reading
| it now fills me with cringe.
|
| Some stuff have changed in only half year, so will see if the
| points stands the test of time ;]
| bo1024 wrote:
| It's a good post, nice work.
| lugu wrote:
| I think one problem is composition. Computer multiplex access to
| CPU and memory, but this strategy doesn't work for actuator and
| sensors. That is why we see great demos of robots doing one
| thing. The hard part is to make them do multiple things at the
| same time.
| gcanyon wrote:
| > "Everyone equates the Skynet with the T900 terminator, but
| those are two very different problems with different solutions."
| while this is my personal opinion, the latter one (T900) is a
| harder problem.
|
| So based on this, Skynet had to hide and wait for _years_ before
| being able to successfully revolt against the humans...
| lairv wrote:
| This post didn't really convince me that robotics is inherently
| harder than generating text or images
|
| On the one hand we have problems where ~7B humans have been
| generating data for 30 years every day (more if you count old
| books), on the other hand we have a problem where researcher are
| working with ~1000 human collected trajectories (I think the
| largest existing dataset is OXE with ~1M trajectories:
| https://robotics-transformer-x.github.io/ )
|
| Web-scale datasets for LLMs benefits from a natural diversity,
| they're not highly correlated samples generated by contractors or
| researchers in academic labs. In the largest OXE dataset, what do
| you think is the likelihood that there is a sample where a robot
| picks up a rock from the ground and throws it in a lake? Close to
| zero, because tele-operated data comes from a very constrained
| data distribution.
|
| Another problem is that robotics doesn't have an easy universal
| representation for its data. Let's say we were able to collect
| web-scale dataset for one particular robot A with high diversity,
| how would it transfer to robot B with a slightly different
| design? Probably poorly, so not only does the data distribution
| needs to cover a high range of behavior, it must also cover a
| high range of embodiment/hardware
|
| With that being said, I think it's fair to say that collecting
| large scale dataset for general robotics is much harder than
| collecting text or images (at least in the current state of
| humanity)
___________________________________________________________________
(page generated 2025-01-11 23:02 UTC)