[HN Gopher] Common misconceptions about the complexity in roboti...
       ___________________________________________________________________
        
       Common misconceptions about the complexity in robotics vs. AI
       (2024)
        
       Author : wallflower
       Score  : 139 points
       Date   : 2025-01-07 15:19 UTC (4 days ago)
        
 (HTM) web link (harimus.github.io)
 (TXT) w3m dump (harimus.github.io)
        
       | jvanderbot wrote:
       | > Moravec's paradox is the observation by artificial intelligence
       | and robotics researchers that, contrary to traditional
       | assumptions, reasoning requires very little computation, but
       | sensorimotor and perception skills require enormous computational
       | resources. The principle was articulated by Hans Moravec, Rodney
       | Brooks, Marvin Minsky, and others in the 1980s.
       | 
       | I have a name for it now!
       | 
       | I've said over and over that there are only two really hard
       | problems in robotics: Perception and funding. A perfectly
       | perceived system and world can be trivially planned for and (at
       | least proprio-)controlled. Imagine having a perfect intuition
       | about other actors such that you know their paths (in self
       | driving cars), or your map is a perfect voxel + trajectory +
       | classification. How divine!
       | 
       | It's limited information and difficulties in reducing signal to
       | concise representation that always get ya. This is why the
       | perfect lab demos always fail - there's a corner case not in your
       | training data, or the sensor stuttered or became misaligned, or
       | etc etc.
        
         | jvanderbot wrote:
         | > Moravec hypothesized around his paradox, that the reason for
         | the paradox [that things we perceive as easy b/c we dont think
         | about them are actually hard] could be due to the sensor &
         | motor portion of the human brain having had billions of years
         | of experience and natural selection to fine-tune it, while
         | abstract thoughts have had maybe 100 thousand years or less
         | 
         | Another gem!
        
           | Legend2440 wrote:
           | Or it could be a parallel vs serial compute thing.
           | 
           | Perception tasks involve relatively simple operations across
           | very large amounts of data, which is very easy if you have a
           | lot of parallel processors.
           | 
           | Abstract thought is mostly a serial task, applying very
           | complex operations to a small amount of data. Many abstract
           | tasks like evaluating logical expressions cannot be done in
           | parallel - they are in the complexity class P-complete.
           | 
           | Your brain is mostly a parallel processor (80 billion neurons
           | operating asynchronously), so logical reasoning is hard and
           | perception is easy. Your CPU is mostly a serial processor, so
           | logical reasoning is easy and perception is hard.
        
             | cratermoon wrote:
             | > Perception tasks involve relatively simple operations
             | across very large amounts of data, which is very easy if
             | you have a lot of parallel processors.
             | 
             | Yes, relatively simple. Wait, isn't that exactly what the
             | article explained was completely wrong-headed?
        
               | burnished wrote:
               | No. The article is talking about things we think of as
               | being easy because they are easy for a human to perform
               | but that are actually very difficult to
               | formalize/reproduce artificially.
               | 
               | The person you are responding to is instead comparing
               | differences in biological systems and mechanical systems.
        
             | visarga wrote:
             | > Or it could be a parallel vs serial compute thing.
             | 
             | The brain itself is both a parallel system an a serially
             | constrained system. It has distributed activity but it must
             | resolve in a serial chain of action. We can't walk left and
             | right at the same time. Any goals forces us to follow
             | specific steps in specific order. This conflict between
             | parallel processing and serial outputs is where the magic
             | happens.
        
           | topherclay wrote:
           | > ...the sensor & motor portion of the human brain having had
           | billions of years of experience.
           | 
           | It doesn't really change the significance of the quote, but I
           | can't help but point out that we didn't even have nerve cells
           | more than 0.6 billion of years ago.
        
         | lang4d wrote:
         | Maybe just semantics, but I think I would call that prediction.
         | Even if you have perfect perception (measuring the current
         | state of the world perfectly), it's nontrivial to predict the
         | future paths of other actors. The prediction problem requires
         | intuition about what the other actors are thinking, how their
         | plans influence each other, and how your plan influences them.
        
         | bobsomers wrote:
         | > I've said over and over that there are only two really hard
         | problems in robotics: Perception and funding. A perfectly
         | perceived system and world can be trivially planned for and (at
         | least proprio-)controlled.
         | 
         | Funding for sure. :)
         | 
         | But as for perception, the inverse is also true. If I have an
         | perfect planning/prediction system, I can throw the grungiest,
         | worst perception data into it and it will still plan
         | successfully despite tons of uncertainty.
         | 
         | And therein lies the real challenge of robotics: It's
         | fundamentally a systems engineering problem. You will never
         | have perfect perception or a perfect planner. So, can you make
         | a perception system that is _good enough_ that, when coupled
         | with your planning system which is _good enough_ , you are able
         | to solve enough problems with enough 9s to make it successful.
         | 
         | The most commercially successful robots I've seen have had some
         | of the smartest systems engineering behind them, such that
         | entire classes of failures were eliminated by being smarter
         | about what you _actually need to do to solve the problem_ and
         | aggressively avoid solving subproblems that aren 't absolutely
         | necessary. Only then do you really have a hope of getting good
         | enough at that focused domain to ship something before the
         | money runs out. :)
        
           | portaouflop wrote:
           | > being smarter about what you actually need to do to solve
           | the problem and aggressively avoid solving subproblems that
           | aren't absolutely necessary
           | 
           | I feel like this is true for every engineering discipline or
           | maybe even every field that needs to operate in the real
           | world
        
             | vrighter wrote:
             | except software, of course. Nowadays it seems that software
             | is all about creating problems to create solutions for.
        
           | krisoft wrote:
           | > If I have an perfect planning/prediction system, I can
           | throw the grungiest, worst perception data into it and it
           | will still plan successfully despite tons of uncertainty.
           | 
           | Not really. Even the perfect planning system will appear
           | eratic in the presence of perception noise. It must be
           | because it can't create information out of nowhere.
           | 
           | I have seen robots eratically stop because they thought that
           | the traffic in the oncomming lane is enroaching on theirs.
           | You can't make the planning system ignore that because then
           | sometimes it will collide with people playing chicken with
           | you.
           | 
           | Likewise I have seen robots eratically stop because they
           | thought that a lamp post was slowly reversing out in front of
           | them. All due to perception noise (in this case both location
           | noise, and misclassification.)
           | 
           | And do note that these are just the false positives. If you
           | have a bad perception system you can also suffer from false
           | negatives. Just experiment biases hide those.
           | 
           | So your "perfect planning/prediction" will appear overly
           | cautious while at the same time will be sometimes reckless.
           | Because it doesn't have the information to not to. You can't
           | magic plan your way out of that. (Unless you pipe the raw
           | sensor data into the planner, in which case you created a
           | second perception system you are just not calling it
           | perception.)
        
             | YeGoblynQueenne wrote:
             | >> (Unless you pipe the raw sensor data into the planner,
             | in which case you created a second perception system you
             | are just not calling it perception.)
             | 
             | Like with model-free RL learning a model from pixels?
        
           | jvanderbot wrote:
           | A "perfect" planning system which can handle arbitrarily bad
           | perception is indistinguishable from a perception system.
           | 
           | I've not seen a system that claimed to be robust to sensor
           | noise that didn't do some filtering, estimation, or state
           | representation internally. Those are just sensor systems
           | inside the box.
        
         | exe34 wrote:
         | "the sensor stuttered or became misaligned, or etc etc."
         | 
         | if your eyes suddenly crossed, you'd probably fall over too!
        
         | seanhunter wrote:
         | Yeah the fun way Moravec's paradox was explained to me [1] is
         | that you can now easily get a computer to solve simultaneous
         | differential equations governing all the axes of motion of a
         | robot arm but getting it to pick one screw out of a box of
         | screws is an unsolved research problem.
         | 
         | [1] by a disillusioned computer vision phd that left the field
         | in the 1990s.
        
           | wrp wrote:
           | Selective attention was one of the main factors in Hubert
           | Dreyfus' explanation of "what computers can't do." He had a
           | special term for it, which I can't remember off-hand.
        
         | visarga wrote:
         | > A perfectly perceived system and world can be trivially
         | planned for
         | 
         | I think it's not about perfect perception, there is no such
         | thing not even in humans, it's about adaptability, recovery
         | from error, resilience, and mostly about learning from the
         | outside when the process fails to work. Each problem has its
         | own problem space to explore. I think of intelligence as search
         | efficiency across many problem spaces, there is no perfection
         | in it. Our problem spaces are far from exhaustively known.
        
       | catgary wrote:
       | Yeah, this was my general impression after a brief, disastrous
       | stretch in robotics after my PhD. Hell, I work in animation now,
       | which is a way easier problem since there are no physical
       | constraints, and we still can't solve a lot of the problems the
       | OP brings up.
       | 
       | Even stuff like using video misses the point, because so much of
       | our experience is via touch.
        
         | johnwalkr wrote:
         | I've worked in a robotics-adjacent field for 15 years and
         | robotics is truly hard. The number of people and companies I've
         | seen come and go that claim their software expertise will make
         | a useful, profitable robot is.. a lot.
        
       | Legend2440 wrote:
       | Honestly I'm tired of people who are more focused on 'debunking
       | the hype' than figuring out how to make things work.
       | 
       | Yes, robotics is hard, and it's still hard despite big
       | breakthroughs in other parts of AI like computer vision and NLP.
       | But deep learning is still the most promising avenue for general-
       | purpose robots, and it's hard to imagine a way to handle the
       | open-ended complexity of the real world _other_ than learning.
       | 
       | Just let them cook.
        
         | mitthrowaway2 wrote:
         | > _If you want a more technical, serious (better) post with a
         | solution oriented point to make, I'll refer you to Eric Jang's
         | post [1]_
         | 
         | [1] https://evjang.com/2022/07/23/robotics-generative.html
        
         | FloorEgg wrote:
         | As someone on the sidelines of robotics who generally feels
         | everything getting disrupted and at the precipice of major
         | change, it's really helpful to have a clearer understanding of
         | the actual challenge and how close we are to solving it.
         | Anything that helps me make more accurate predictions will help
         | me make better decisions about what problems I should be trying
         | to solve and what skills I should be trying to develop.
        
       | cratermoon wrote:
       | It might be nice if the author qualified "most of the freely
       | available data on the internet" with "whether or not it was
       | copyrighted" or something to acknowledge the widespread theft of
       | the works of millions.
        
         | danielbln wrote:
         | Theft is the wrong term, it implies that the original is no
         | longer available. It's copyright infringement at best, and
         | possibly fair use depending on jurisdiction. It wasn't theft
         | when the RIAA went on a lawsuit spree against mp3 copying, and
         | it isn't theft now.
        
           | CaptainFever wrote:
           | Related: https://www.youtube.com/watch?v=IeTybKL1pM4
        
           | cratermoon wrote:
           | Ackchyually....
        
       | jes5199 wrote:
       | I would love to see some numbers. How many orders of magnitude
       | more complicated do we think embodiment is, compared to
       | conversation? How much data do we need compared to what we've
       | already collected?
        
         | FloorEgg wrote:
         | If nature computed both through evolution, then maybe it's
         | approximately the same ratio. So roughly the time it took to
         | evolve embodiment, and roughly the time it took to evolve from
         | grunts to advanced language.
         | 
         | If we start from when we think multicellular life first evolved
         | (~2b years), or maybe the Cambrian explosion (~500m years), and
         | until modern humans (~300k years). Then compare that to the
         | time between first modern humans now now.
         | 
         | It seems like maybe 3-4 orders of magnitude harder.
         | 
         | My intuition after reading the articles is that there needs to
         | be way more sensors all throughout the robot, probably with
         | lots of redundancies, and then lots of modern LLM sized models
         | all dedicated to specific joints and functions and capable of
         | cascading judgement between each other, similar to how our
         | nervous system works.
        
           | jes5199 wrote:
           | so like ten to twenty years, via moore's law?
        
             | daveguy wrote:
             | Maybe. If Moore's law remotely holds up for ten to twenty
             | years. There's still the part about not having a clue how
             | to replicate physical systems efficiently vs logical
             | systems.
        
         | rstuart4133 wrote:
         | "Hardness" is a difficult quantity to define if you venture
         | beyond "humans have been trying to build systems to do this for
         | a while, and haven't succeeded".
         | 
         | Insects have succeed in build precision systems that combine
         | vision, smell, touch and a few other senses. I doubt finding a
         | juicy spider, immobilising it, is that much more difficult that
         | finding a door knob and turning it, or folding a T-Shirt. Yet
         | insects accomplish it with I suspect far less compute than
         | modern LLM's. So it's not "hard" in the sense of requiring huge
         | compute resources, and certainly not a lot of power.
         | 
         | So it's probably not that hard in the sense that it's well
         | within the capabilities of the hardware we have now. The issue
         | is more that we don't have a clue how to do it.
        
           | jes5199 wrote:
           | well the magic of transformer architecture is that if the
           | rules exist and are computationally tractable, the system
           | will find them in the data, and we don't have to have a clue.
           | so. how much data do we need?
        
           | BlueTemplar wrote:
           | Calling it "compute" might be part of the issue : insects
           | aren't (even partially) digital computers.
           | 
           | We might or might not be able to emulate what they process on
           | digital computers, but emulation implies a performance loss.
           | 
           | And this doesn't even cover inputs/outputs (some of which
           | might be already good enough for some tasks, like the
           | article's example of remotely operated machines).
        
         | timomaxgalvin wrote:
         | I feel more tired after driving all day than reading all day.
        
           | jes5199 wrote:
           | man I don't. I can drive for 12+ hours. I can be on the
           | internet for like 6
        
             | daveguy wrote:
             | Modern ADAS probably makes the driving much easier. What
             | about reading print? Just as long? (Wondering the screen
             | fatigue aspect vs just language processing)
        
       | no_op wrote:
       | I think Moravec's Paradox is often misapplied when considering
       | LLMs vs. robotics. It's true that formal reasoning over
       | unambiguous problem representations is easy and computationally
       | cheap. Lisp machines were already doing this sort of thing in the
       | '70s. But the kind of commonsense reasoning over ambiguous
       | natural language that LLMs can do is _not_ easy or
       | computationally cheap. Many early AI researchers thought it would
       | be -- that it would just require a bit of elaboration on the
       | formal reasoning stuff -- but this was totally wrong.
       | 
       | So, it doesn't make sense to say that what LLMs do is Moravec-
       | easy, and therefore can't be extrapolated to predict near-term
       | progress on Moravec-hard problems like robotics. What LLMs do is,
       | in fact, Moravec-hard. And we should expect that if we've got
       | enough compute to make major progress on one Moravec-hard
       | problem, there's a good chance we're closing in on having enough
       | to make major progress on others.
        
         | bjornsing wrote:
         | Good points. Came here to say pretty much the same.
         | 
         | Moravec's Paradox is certainly interesting and correct if you
         | limit its scope (as you say). But it feels intuitively wrong to
         | me to make any claims about the relative computational demands
         | of sensi-motor control and abstract thinking before we've
         | really solved either problem.
         | 
         | Looking e.g. at the recent progress in solving ARC-AGI my
         | impression is that abstract thought could have incredible
         | computational demands. IIRC they had to throw approximately
         | $10k of compute at o3 before it reached human performance. Now
         | compare how cognitively challenging ARC-AGI is to e.g.
         | designing or reorganizing a Tesla gigafactory.
         | 
         | With that said I do agree that our culture tends to value
         | simple office work over skillful practical work. Hopefully the
         | progress in AI/ML will soon correct that wrong.
        
           | RaftPeople wrote:
           | Also agree and also came here to say the same.
        
         | lsy wrote:
         | Leaving aside the lack of consensus around whether LLMs
         | actually succeed in commonsense reasoning, this seems a little
         | bit like saying "Actually, the first 90% of our project took an
         | enormous amount of time, so it must be 'Pareto-hard'. And thus
         | the last 10% is well within reach!" That is, that Pareto and
         | Moravec were in fact just wrong, and thing A and thing B are
         | equivalently hard.
         | 
         | Keeping the paradox would more logically bring you to the
         | conclusion that LLMs' massive computational needs and limited
         | capacities imply a commensurately greater, mind-bogglingly
         | large computational requirement for physical aptitude.
        
           | nopinsight wrote:
           | It's far from obvious that thought space is much less complex
           | than physical space. Natural language covers emotional,
           | psychological, social, and abstract concepts that are
           | orthogonal to physical aptitude.
           | 
           | While the linguistic representation of thought space may be
           | discrete and appear simpler (even the latter _is_ arguable),
           | the underlying phenomena are not.
           | 
           | Current LLMs are terrific in many ways but pale in comparison
           | to great authors in capturing deep, nuanced human experience.
           | 
           | As a related point, for AI to truly understand humans, it
           | will likely need to process videos, social interactions, and
           | other forms of data beyond language alone.
        
             | visarga wrote:
             | I think the essence of human creativity is outside our
             | brains - in our environments, our search spaces, our
             | interactions. We stumble upon discoveries or patterns, we
             | ideate and test, and most ideas fail but a few remain. And
             | we call it creativity, but it's just environment tested
             | ideation.
             | 
             | If you put an AI like AlphaZero in a Go environment it
             | explores so much of the game space that it invents its own
             | Go culture from scratch and beats us at our own game.
             | Creativity is search in disguise, having good feedback is
             | essential.
             | 
             | AI will become more and more grounded as it interacts with
             | the real world, as opposed to simply modeling organic text
             | as GPT-3. More recent models generate lots of synthetic
             | data to simulate this process, and it helps up to a point,
             | but we can't substitute artificial feedback for real one
             | except in a few cases: like AlphaZero, AlphaProof,
             | AlphaCode... in those cases we have the game winner, LEAN
             | as inference engine, and code tests to provide reliable
             | feedback.
             | 
             | If there is one concept that underlies both training and
             | inference it is search. And it also underlies action and
             | learning in humans. Learning is compression which is search
             | for optimal parameters. Creativity is search too. And
             | search is not purely mental, or strictly 1st person, it is
             | based on search spaces and has a social side.
        
       | jillesvangurp wrote:
       | Yesterday, I was watching some of the youtube videos on the
       | website of a robotics company https://www.figure.ai that
       | challenges some of the points in this article a bit.
       | 
       | They have a nice robot prototype that (assuming these demos
       | aren't faked) does fairly complicated things. And one of the key
       | features they show case is using OpenAI's AI for the human
       | computer interaction and reasoning.
       | 
       | While these things seem a bit slow, they do get things done. They
       | have a cool demo of the a human interacting with one of the
       | prototypes to ask it what it thinks needs to be done and then
       | asking it do these things. That show cases reasoning, planning,
       | and machine vision. Which are exactly topics that all the big LLM
       | companies are working on.
       | 
       | They appear to be using an agentic approach similar to how LLMs
       | are currently being integrated into other software products.
       | Honestly, it doesn't even look like they are doing much that
       | isn't part of OpenAI's APIs. Which is impressive. I saw speech
       | capabilities, reasoning, visual inputs, function calls, etc. in
       | action. Including the dreaded "thinking" pause where the Robot
       | waits a few seconds for the remote GPUs to do their thing.
       | 
       | This is not about fine motor control but about replacing humans
       | controlling robots with LLMs controlling robots and getting
       | similarly good/ok results. As the article argues, the hardware is
       | actually not perfect but good enough for a lot of tasks if it is
       | controlled by a human. The hardware in this video is nothing
       | special. Multiple companies have similar or better prototypes.
       | Dexterity and balance are alright but probably not best in class.
       | Best in class hardware is not the point of these demos.
       | 
       | Dexterity and real time feedback is less important than the
       | reasoning and classification capabilities people have. The
       | latency just means things go a bit slower. Watching these things
       | shuffle around like an old person that needs to go to the bath
       | room is a bit painful. But getting from A to B seems like a
       | solved problem. A 2 or 3x speedup would be nice. 10x would be
       | impressively fast. 100x would be scary and intimidating to have
       | near you. I don't think that's going to be a challenge long term.
       | Making LLMs faster is an easier problem than making them smarter.
       | 
       | Putting a coffee cup in a coffee machine (one of the demo videos)
       | and then learning to fix it when it misaligns seems like an
       | impressive capability. It compensates for precision and speed
       | with adaptability and reasoning: analyze the camera input,
       | correctly analyze the situation, problem and challenge come up
       | with a plan to perform the task, execute the plan, re-evaluate,
       | adapt, fix. It's a bit clumsy but the end result is coffee. Good
       | demo and I can see how you might make it do all sorts of things
       | that are vaguely useful that way.
       | 
       | The key point here is that knowing that the thing in front of the
       | robot is a coffee cup and a coffee machine and identifying how
       | those things fit together and in what context that is required
       | are all things that LLMs can do.
       | 
       | Better feedback loops and hardware will make this faster, and
       | less tedious to watch. Faster LLMs will help with that too. And
       | better LLMs will result in less mistakes, better plans, etc. It
       | seems both capabilities are improving at an enormously fast pace
       | right now.
       | 
       | And a fine point with human intelligence is that we divide and
       | conquer. Juggling is a lot harder when you start thinking about
       | it. The thinking parts of your brain interferes with the lower
       | level neural circuits involved with juggling. You'll drop the
       | balls. The whole point with juggling is that you need to act
       | faster than you can think. Like LLMs, we're too slow. But we can
       | still learn to juggle. Juggling robots are going to be a thing.
        
         | GolfPopper wrote:
         | > _The key point here is that_ knowing _that the thing in front
         | of the robot is a coffee cup and a coffee machine and
         | identifying how those things fit together and in what context
         | that is required are all things that LLMs can do._
         | 
         | I'm skeptical that any LLM "knows" any such thing. It's a
         | Chinese Room. It's got a probability map that connects the
         | lexeme (to us) 'coffee machine' and 'coffee cup' depending on
         | other inputs that we do not and cannot access, and spits out
         | sentences or images that (often) look right, but that does not
         | equate to any understanding of what it is doing.
         | 
         | As I was writing this, I took chat GPT-4 for a spin. When I ask
         | it about an obscure but once-popular fantasy character from the
         | 70s cold, it admits it doesn't know. But, if I ask it about
         | that same character after first asking about some obscure
         | fantasy RPG characters, it cheerfully confabulates an
         | authoritative and wrong answer. As always, if it does this on
         | topics where I am a domain expert, I consider it absolutely
         | untrustworthy for any topics on which I am not a domain expert.
         | That anyone treats it otherwise seems like a baffling new form
         | of Gell-Mann amnesia.
         | 
         | And for the record, when I asked ChatGPT-4, cold, "What is
         | Gell-Mann amnesia?" it gave a multi-paragraph, broadly accurate
         | description, with the following first paragraph:
         | 
         | "The Gell-Mann amnesia effect is a term coined by physicist
         | Murray Gell-Mann. It refers to the phenomenon where people,
         | particularly those who are knowledgeable in a specific field,
         | read or encounter inaccurate information in the media, but then
         | forget or dismiss it when it pertains to other topics outside
         | their area of expertise. The term highlights the paradox where
         | readers recognize the flaws in reporting when it's something
         | they are familiar with, yet trust the same source on topics
         | outside their knowledge, even though similar inaccuracies may
         | be present."
         | 
         | Those who are familiar with the term have likely already
         | spotted the problem: "a term coined by physicist Murray Gell-
         | Mann". The term was coined by author Michael Crichton.[1] To
         | paraphrase H.L. Mencken, for every moderately complex question,
         | there is an LLM answer that is clear, simple, and wrong.
         | 
         | 1. https://en.wikipedia.org/wiki/Michael_Crichton#Gell-
         | Mann_amn...
        
           | jillesvangurp wrote:
           | Hallucinations are a well known problem. And there are some
           | mitigations that work pretty well. Mostly with enough context
           | and prompt engineering, LLMs can be pretty reliable. And
           | obscure popular fiction trivia is maybe not that relevant for
           | every use case. Which would be robotics in this case; not the
           | finer points of Michael Crighton related trivia.
           | 
           | You were testing its knowledge, not its ability to reason or
           | classify things it sees. I asked the same question to
           | perplexity.ai. If you use the free version, it uses less
           | advanced LLMs but it compensates with prompt engineering and
           | making it do a search to come up with this answer:
           | 
           | > The Gell-Mann Amnesia effect is a psychological phenomenon
           | that describes people's tendency to trust media reports on
           | unfamiliar topics despite recognizing inaccuracies in
           | articles about subjects they know well. This effect, coined
           | by novelist Michael Crichton, highlights a cognitive bias in
           | how we consume news and information.
           | 
           | Sounds good to me. And it got me a nice reference to
           | something called the portal wiki, and another one for the
           | same wikipedia article you cited. And a few more references.
           | And it goes on a bit to explain how it works. And I get your
           | finer point here that I shouldn't believe everything I read.
           | Luckily, my supervisor worked hard to train that out of me
           | when I was doing a Ph. D. back in the day. But fair point and
           | well made.
           | 
           | Anyway, this is a good example of how to mitigate
           | hallucination with this specific question (and similar ones).
           | Kind of the use case perplexity.ai was made to solve. I use
           | it a lot. In my experience it does a great job figuring out
           | the right references and extracting information from those.
           | It can even address some fairly detailed questions. But
           | especially on the freemium plan, you will run into
           | limitations related to reasoning with what it extracts (you
           | can pay them to use better models). And it helps to click on
           | the links it provides to double check.
           | 
           | For things that involve reasoning (like coding), I use
           | different tools. Different topic so won't bore you with that.
           | 
           | But what figure.ai is doing, falls well in the scope of
           | several things openai does very well that you can use via
           | their API. It's not going to be perfect for everything. But
           | there probably is a lot that it nails without too much
           | effort. I've done some things with their APIs that worked
           | fairly well at least.
        
           | redlock wrote:
           | Do we know how human understanding works? It could be just
           | statistical mapping as you have framed it. You can't say llms
           | don't understand when you don't have a measurable definition
           | for understanding.
           | 
           | Also, humans hallucinate/confabulate all the time. Llms even
           | forget in the same way humans do (strong recall in the start
           | and end of the text but weaker in the middle)
        
         | YeGoblynQueenne wrote:
         | >> Good demo and I can see how you might make it do all sorts
         | of things that are vaguely useful that way.
         | 
         | Unfortunately since that's a demo you have most likely seen all
         | the sorts of things that are vaguely useful and that can be
         | done easily, or at all.
         | 
         | Edit: Btw, the coffee task video says that the "AI" is "end-to-
         | end neural networks". If I understand correctly that means an
         | LLM was not involved in carrying out the task. At most an LLM
         | may have been used to trigger the activation of the task, that
         | was learned by a different method, probably some kind of
         | imitation learning with deep RL.
         | 
         | Also, to see how much of a tech demo this is: the robot starts
         | already in position in front of a clear desk and a human brings
         | the coffee machine, positions it just so, places the cup in the
         | holder and places a single coffee pod just so. Then the robot
         | takes the coffee pod from the empty desk and places it in the
         | machine, then pushes the button. That's all the interaction of
         | the robot with the machine. The human collects the cup and
         | makes a thumbs up.
         | 
         | Consider for a moment how much different is this laboratory
         | instance of the task from any real-world instance. In my
         | kitchen the coffee machine is on a cluttered surface with tins
         | of coffee, a toaster, sometimes the group left on the machine,
         | etc. etc - and I don't even use coffee pods but loose coffee.
         | The robot you see has been trained to put _that one_ pod placed
         | _in that particular spot_ in _that one machine_ placed _just
         | so_ in front of it. It would have to be trained all over again
         | to carry out the same task on my machine, it is uncertain if it
         | could learn it successfully after thousands of demonstrations
         | (because of all the clutter), and even if it did, it would
         | still have to learn it all over again if I moved the coffee
         | machine, or moved the tins, or the toaster; let alone if you
         | wanted it to use _your_ coffee machine (different colour, make,
         | size, shape, etc) in your kitchen (different chaotic
         | environment) (no offense meant).
         | 
         | Take the other video of the "real world task". That's the robot
         | shuffling across a flat, clean surface and picking up an empty
         | crate to put in an empty conveyor belt. That's just not a real
         | world task.
         | 
         | Those are tech demos and you should not put much faith in them.
         | That kind of thing takes an insane amount of work to set up
         | just for one video, you rarely see the outtakes and it very,
         | very rarely generalises to real-world utility.
        
       | jonas21 wrote:
       | It's worth noting that modern multimodal models are not confused
       | by the cat image. For example, Claude 3.5 Sonnet says:
       | 
       | > _This image shows two cats cuddling or sleeping together on
       | what appears to be a blue fabric surface, possibly a blanket or
       | bedspread. One cat appears to be black while the other is white
       | with pink ears. They 're lying close together, suggesting they're
       | comfortable with each other. The composition is quite sweet and
       | peaceful, capturing a tender moment between these feline
       | companions._
        
         | throw310822 wrote:
         | Also Claude, when given the entire picture:
         | 
         | "This is a humorous post showcasing an AI image recognition
         | system making an amusing mistake. The neural network (named
         | "neural net guesses memes") attempted to classify an image with
         | 99.52% confidence that it shows a skunk. However, the image
         | actually shows two cats lying together - one black and one
         | white - whose coloring and positioning resembles the
         | distinctive black and white pattern of a skunk.
         | 
         | The humor comes from the fact that while the AI was very
         | confident (99.52%) in its prediction, it was completely
         | wrong..."
         | 
         | The progress we made in barely ten years is astounding.
        
           | timomaxgalvin wrote:
           | It's easy to make something work when the example goes from
           | being out of the training data to into the training data.
        
             | throw310822 wrote:
             | Definitely. But I also tried with a picture of an absurdist
             | cartoon drawn by a family member, complete with (carefully)
             | handwritten text, and the analysis was absolutely perfect.
        
               | visarga wrote:
               | A simple test - take one of your own photos, something
               | interesting, and put in into a LLM, let it describe it in
               | words. Then use a image generator to create the image
               | back. It works like back-translation image->text->image.
               | It proves how much the models really understand images
               | and text.
        
             | BlueTemplar wrote:
             | I wouldn't blame a machine to fail something that a first
             | glance looks like an optical illusion...
        
           | YeGoblynQueenne wrote:
           | And yet both these astounding explanations (yours and the one
           | in the OP) are mistaking two cute kittens sleeping cuddled in
           | an adorable manner for generic "cats lying together".
        
       | bjornsing wrote:
       | I'm surprised this doesn't place more emphasis on self-supervised
       | learning through exploration. Is human-labeled datasets really
       | the SOTA approach for robotics?
        
         | psb217 wrote:
         | Human-labeled data currently looks like the quickest path to
         | making robots that are useful enough to have economic value
         | beyond settings and tasks that are explicitly designed for
         | robots. This has drawn a lot of corporate and academic research
         | activity away from solving the harder core problems, like
         | exploration, that are critical for developing fully autonomous
         | intelligent agents.
        
       | MrsPeaches wrote:
       | Question:
       | 
       | Isn't it fundamentally impossible to model a highly entropic
       | system using deterministic methods?
       | 
       | My point is that animal brains are entropic and "designed" to
       | model entropic systems, where as computers are deterministic and
       | actively have to have problems reframed as deterministic so that
       | they can solve them.
       | 
       | All of the issues mentioned in the article boil down to the
       | fundamental problem of trying to get deterministic systems to
       | function in highly entropic environments.
       | 
       | LLMs are working with language, which has some entropy but is
       | fundamentally a low entropy system, and has orders of magnitude
       | less entropy than most peoples' back garden!
       | 
       | As the saying goes, to someone with a hammer, everything looks
       | like a nail.
        
         | BlueTemplar wrote:
         | Not fundamentally, at least I doubt it : pseudo-random number
         | generation is technically deterministic.
         | 
         | And it's used for sampling these low information systems that
         | you are mentioning.
         | 
         | (And let's not also forget how they are helpful in sampling
         | deterministic but extremely high complexity systems involving a
         | high amount of dimensions that Monte Carlo methods are so good
         | at dealing with.)
        
       | Peteragain wrote:
       | So I'm old. PhD on search engines in the early 1990's (yep, early
       | 90s). Learnt AI in the dark days of the 80's. So, there is an
       | awful lotl of forgetting going on, largely driven by the publish-
       | or-perish culture we have. Brooks' subsumption architecture was
       | not perfect, but it outlined an approach that philosophy and
       | others have been championing for decades. He said he was not
       | implementing Heidegger, just doing engineering, but Brooks was
       | certainly channeling Heidegger's successors. Subsumption might
       | not scale, but perhaps that is where ML comes in. On a related
       | point, "generative AI" does sequences (it's glorified auto
       | complete (not) according to Hinton in the New Yorker). Data is
       | given to a Tokeniser that produces a sequence of tokens, and the
       | "AI" predicts what comes next. Cool. Robots are agents in an
       | environment with an Umwelt. Robotics is pre the Tokeniser. What
       | is it the is recognisable and sequential in the world? 2 cents
       | please.
        
         | marcosdumay wrote:
         | > Subsumption might not scale
         | 
         | Honestly, I don't think we have any viable alternative.
         | 
         | And anyway, it seems to scale well enough that we use
         | "conscious" and "unconscious" decisions ourselves.
        
           | psb217 wrote:
           | If you wanna sound hip, you need to call it "system 2" and
           | "system 1".
        
       | Anotheroneagain wrote:
       | The reason why it sounds counterintuitive is that neurology has
       | the brain upside down. It teaches us that formal thinking occurs
       | in the neocortex, and we need all that huge brain mass for that.
       | 
       | But in fact it works like an autoencoder, and it reduces sensory
       | inputs into a much smaller latent space, or something very
       | similar to that. This does result in holistic and abstract
       | thinking, but formal analytical thinking doesn't require
       | abstraction to do the math or to follow a method without
       | comprehension. It's a concrete approach that avoids the need for
       | abstraction.
       | 
       | The cerebellum is the statistical machine that gets measured by
       | IQ and other tests.
       | 
       | To further support that, you don't see any particularly elegant
       | motions from non mammal animals. In fact everything else looks
       | quite clumsy, and even birds need to figure out flying by trial
       | and error.
        
         | daveguy wrote:
         | Claiming to know how the brain works, computationally or
         | physically, might be a bit premature.
        
       | dbspin wrote:
       | I find it odd that the article doesn't address the apparent
       | success of training with transformer based models in virtual
       | environments to build models that are then mapped onto the real
       | world. This is being used in everything from building datasets
       | for self driving cars, to navigation and task completion for
       | humanoid robots. Nvidia have their omniverse project [1], but
       | there are countless other examples [2][3][4]. Isn't this
       | obviously the way to build the corpus of experience needed to
       | train these kinds of cross modal models?
       | 
       | [1] https://www.nvidia.com/en-
       | us/industries/robotics/#:~:text=NV....
       | 
       | [2]
       | https://www.sciencedirect.com/science/article/abs/pii/S00978...
       | 
       | [3] https://techcrunch.com/2024/01/04/google-outlines-new-
       | method...
       | 
       | [4] https://techxplore.com/news/2024-09-google-deepmind-
       | unveils-...
        
         | cybernoodles wrote:
         | A common practice is to train a transformer model to control a
         | given robot model in simulation by first teleoperating the
         | simulated model with some controller (keyboard, joystick, etc.)
         | to complete the task and create a dataset, and then setting up
         | the simulator to permute the environment variables such as
         | frictions, textures, etc (domain randomization) and run many
         | epochs at faster than real time until a final policy converges.
         | If the right things were randomized and your demonstration
         | examples provided enough variation of information, it should
         | generalize well to the actual hardware.
        
       | CWIZO wrote:
       | > Robots are probably amazed by our ability to keep a food tray
       | steady, the same way we are amazed by spider-senses (from
       | spiderman movie)
       | 
       | Funnily, Toby Maguire actually did that tray catching stunt for
       | real. So robots have an even further way to go.
       | 
       | https://screenrant.com/spiderman-sam-raimi-peter-parker-tray...
        
         | BlueTemplar wrote:
         | ... but it took 156 takes as well as some glue.
         | 
         | And, as the article insists on, for robots to be acceptable,
         | it's more like they need to get to a point where they fail 1
         | time in 156 (or even less, depending on how critical the
         | failure is), rather than succeed 1 time in 156...
        
       | PeterStuer wrote:
       | Just some observations from an ex autonomous robotics researcher
       | here.
       | 
       | One of the most important differences at least in those days
       | (80's and 90's) was time. While the digital can be sped up just
       | constrained by the speed of your compute, the 'real world' is
       | very constrained by real time physics. You can't speed up a robot
       | 10x in a 10.000 grabbing and stacking learning run without
       | completely changing the dynamics.
       | 
       | Also, parallellizing the work requires more expensive full robots
       | rather than more compute cores. Maybe these days the different ai
       | gym like virtual physics environments offer a (partial) solution
       | to that problem, but I have not used them (yet) so I can't tell.
       | 
       | Furthermore, large scale physical robots are _far_ more fragile
       | due to wear and tear than the incredible resilience of modern
       | compute hardware. Getting a perfect copy of a physical robot and
       | environment is a very hard, near impossible, task.
       | 
       | Observability and replay, while trivial in the digital world, is
       | very limited in the physical environment making analysis much
       | more difficult.
       | 
       | I was both excited and frustrated at the time by making ai do
       | more than rearanging pixels on a 2D surface. Good times were had.
        
       | Havoc wrote:
       | Fun fact: that Spider-Man gif in there - it's real. No CGI
        
       | bsenftner wrote:
       | This struck me as a universal truth: "our general intuition about
       | the difficulty of a problem is often a bad metric for how hard it
       | actually is". I feel like this is the core issue of all
       | engineering, all our careers, and was surprised by the logic leap
       | from that immediately to Moravec's Paradox, from a universal
       | truth to a myopic industry insight.
       | 
       | Although I've not done physical robotics, I've done a lot of
       | articulated human animation of independent characters in 3D
       | animation. His insight that motor control is more difficult sets
       | right with me.
        
       | cameldrv wrote:
       | Moravec's paradox is really interesting in terms of what it says
       | about ourselves: We are super impressive in ways in which we
       | aren't consciously aware. My belief about this is that our self-
       | aware mind is only a very small part of what our brain is doing.
       | This is extremely clear when it comes to athletic performance,
       | but also there are intellectual things that people call intuition
       | or other things, which aren't part of our self-aware mind, but
       | still do a ton of heavy lifting in our day to day life.
        
       | NalNezumi wrote:
       | Oh weird to wake up to see something I wrote more than half year
       | ago (and posted on HN with no traction) getting reposted now.
       | 
       | Glad to see so many different takes on it. It was written in
       | slight jest as a discussion starter with my ML/neuroscience
       | coworker and friends, so it's actually very insightful to see
       | some rebuttals.
       | 
       | Initial post was twice the length, and had several more (in
       | retrospect) interesting points. First ever blog post so reading
       | it now fills me with cringe.
       | 
       | Some stuff have changed in only half year, so will see if the
       | points stands the test of time ;]
        
         | bo1024 wrote:
         | It's a good post, nice work.
        
       | lugu wrote:
       | I think one problem is composition. Computer multiplex access to
       | CPU and memory, but this strategy doesn't work for actuator and
       | sensors. That is why we see great demos of robots doing one
       | thing. The hard part is to make them do multiple things at the
       | same time.
        
       | gcanyon wrote:
       | > "Everyone equates the Skynet with the T900 terminator, but
       | those are two very different problems with different solutions."
       | while this is my personal opinion, the latter one (T900) is a
       | harder problem.
       | 
       | So based on this, Skynet had to hide and wait for _years_ before
       | being able to successfully revolt against the humans...
        
       | lairv wrote:
       | This post didn't really convince me that robotics is inherently
       | harder than generating text or images
       | 
       | On the one hand we have problems where ~7B humans have been
       | generating data for 30 years every day (more if you count old
       | books), on the other hand we have a problem where researcher are
       | working with ~1000 human collected trajectories (I think the
       | largest existing dataset is OXE with ~1M trajectories:
       | https://robotics-transformer-x.github.io/ )
       | 
       | Web-scale datasets for LLMs benefits from a natural diversity,
       | they're not highly correlated samples generated by contractors or
       | researchers in academic labs. In the largest OXE dataset, what do
       | you think is the likelihood that there is a sample where a robot
       | picks up a rock from the ground and throws it in a lake? Close to
       | zero, because tele-operated data comes from a very constrained
       | data distribution.
       | 
       | Another problem is that robotics doesn't have an easy universal
       | representation for its data. Let's say we were able to collect
       | web-scale dataset for one particular robot A with high diversity,
       | how would it transfer to robot B with a slightly different
       | design? Probably poorly, so not only does the data distribution
       | needs to cover a high range of behavior, it must also cover a
       | high range of embodiment/hardware
       | 
       | With that being said, I think it's fair to say that collecting
       | large scale dataset for general robotics is much harder than
       | collecting text or images (at least in the current state of
       | humanity)
        
       ___________________________________________________________________
       (page generated 2025-01-11 23:02 UTC)