[HN Gopher] Figure 01 robot demos its OpenAI integration
___________________________________________________________________
Figure 01 robot demos its OpenAI integration
Author : mvkel
Score : 158 points
Date : 2024-03-13 14:50 UTC (8 hours ago)
(HTM) web link (twitter.com)
(TXT) w3m dump (twitter.com)
| mvkel wrote:
| The ability to translate between text to servo movement is
| unreal, and it looks like gpt4 vision + whisper are heavily used.
| They're also using the term "reasoning" which is... new.
|
| Can you call this an AI wrapper company? Kinda! The medium is a
| little different than an app, of course.
|
| Lots of amazing applications of AI even if frontier AI
| development froze today.
| jaffee wrote:
| > text to servo movement
|
| yeah this was super impressive. If this is at the point where
| you can put an arbitrary object in front of it and ask it to
| move it somewhere, that's going to be huge for industrial
| automation type stuff I'd imagine.
|
| I do wonder how much of that demo was pre-baked/trained though.
| Could they repeat the same thing with a banana? What if the
| table was more cluttered? What if there were two people in the
| frame?
| mvkel wrote:
| Great question.
|
| Boston Dynamics has been demoing the pre-baked dance routine
| for 2+ decades at this point. Really hoping we can evolve
| past it.
| tintor wrote:
| Not true. Boston Dynamics has been demoing long and
| complicated single take videos involving walking, running,
| jumping and manipulation, in less than ideal conditions.
| archermarks wrote:
| Knowing as many people in the robotics space as I do, I
| suspect the demo may not be completely "pre-baked" but it is
| almost certaintly highly selected. Often they'll try the demo
| many many times until they get a clean run-through without
| mistakes. The circumstances are also likely pretty idealized,
| like they pick objects and settings that they know it
| performs well in.
| mvkel wrote:
| Interesting! This would sort of explain the low energy of
| the human demonstrator in the video
|
| "Take 488... action!"
| superzamp wrote:
| Interesting stutter at 0:53
| DalasNoin wrote:
| Similar at 1:47 I... I think
|
| It sounds so human, a person would also stutter at an
| introspective question like this. I wonder if their text to
| speech was trained on human data and produces these artifacts
| of human speech, or if it is intentional.
| mordymoop wrote:
| I use ChatGPT voice a lot, and it is prone to this exact type
| of stutter. I don't think it's intentional. I think there are
| certain phonetic/tonal linkages that are naturally "weird"
| (uncommon in the training corpus) and that AI struggle with
| them. Why this struggle manifests as a very human-like
| stutter is a fascinating question.
| austinkhale wrote:
| I'm not sure when OpenAI added them, but you can hear similar
| things when using the ChatGPT voice mode on iOS. Sometimes it
| feels almost like a latency stutter and other times it feels
| intentionally human.
| andoando wrote:
| I would have added umms and hmms artificially just to make
| the latency less apparent, so Id say good chance thats what
| they did lol
| bigyikes wrote:
| Now that you mention it, I, uh, also add umms when my
| speech pathways experience high latency.
| ilaksh wrote:
| I believe it's Eleven Labs API with the Stability setting
| turned down a little bit. It is definitely trained on human
| speech and when you use a somewhat lower setting than
| default, it will insert those types of natural imperfections
| or pauses and is very realistic.
| modeless wrote:
| OpenAI's TTS does this. You can hear it in regular ChatGPT's
| voice mode (which this demo is based on, it uses the same
| animations on the robot's face). It will also sometimes
| randomly hallucinate syllables or whole nonsense words,
| although that is rarer.
| EForEndeavour wrote:
| Is there a setting for this in the ChatGPT app? I have never
| once noticed it produce an "uh" or repeated syllable like
| "I... I think I did pretty well."
| modeless wrote:
| Really? Have you used it much? I haven't used it a ton but
| it definitely says "uh" and has various other artifacts.
| Maybe they have improved it recently but it was quite
| obvious when I first got access. Or maybe some of the
| voices are more prone to it than others.
|
| The naturalness of the speech is extremely good, though.
| ben_w wrote:
| Huh, TIL.
|
| I noticed the stutter too, interesting to see that is what
| TTS just does now, and not a sign of a human and a sound
| filter.
| DalasNoin wrote:
| This humanoid form plus the voice really gives of a different
| feeling than the pure chat version. I think it will feel even
| more profound if they can add eyes and eye contact. Imagining
| demoing this to a random person.
| eigenvalue wrote:
| This shows the real utility of the Groq low latency inference.
| That delay in responding makes it much less impressive (it's
| still really impressive obviously!)
| modeless wrote:
| That delay will be eliminated very soon. IMO low latency
| natural voice conversations are going to be bigger than
| ChatGPT. It's going to blow people's minds when they can
| converse with these AIs casually just like with their real life
| friends. It won't be anything like Siri or Alexa anymore.
|
| Here's a demo from a startup in this space. Still very early.
| https://deck.sindarin.tech/
| tomcam wrote:
| I'll be thrilled when Siri can spell my wife's name correctly
| after 13 years of continuous usage and explicitly training it
| in her name. Admittedly her name is wildly complicated and
| totally unknown to the software folks at Apple: Ada
|
| Also my radio-trained voice is so generic a caller every
| week-ish assumes _I_ am a bot, so I'm pretty sure the problem
| isn't me enunciation or accent.
| margorczynski wrote:
| LLMs can't really spell as the smallest "block" of
| information they operate on are tokenized whole words. Same
| issue with e.g. arithmetic.
| ta8645 wrote:
| But they should be able to select the correct token for
| homophones, which amounts to the same thing.
| jxy wrote:
| Can we stop this kind of misinformation? Training a model
| to map a token to individual letters are no harder than
| training a model to be fluent at English. Arithmetics
| with small number of digits are achievable as well. You
| can just try a small 7B model yourself. If you don't know
| where to start, try the mistral instruct v0.2, and this
| is how it goes,
|
| > [INST] Spell out the following word letter by letter:
| margorczynski [/INST] m - a - r - g - o - r - c - z - y -
| n - s - k - i
|
| > So, the word "margorczynski" spelled out letter by
| letter is: m-a-r-g-o-r-c-z-y-n-s-k-i.
|
| The text between `[INST]` and `[/INST]` is the input. The
| text after `[/INST]` is the output.
| margorczynski wrote:
| Is Karpathy lying saying that word tokenization brings
| such problems that can be seen in many LLMs?
|
| https://twitter.com/karpathy/status/1657949234535211009
|
| I'm not arguing that you can't use single chars just that
| many of the issues parent discussed are caused by this.
| imtringued wrote:
| The easy solution is to create an additional dataset that
| is token aware. I.e. you take 1% of the dataset and take
| random tokens and split them into smaller tokens while
| expecting the same answer on the character level. This
| should force the model to learn multiple token
| representations of the same character strings.
| tintor wrote:
| Letter-by-letter tokenization increases inference and
| training costs and latency (as you need more tokens)
| ben_w wrote:
| > Admittedly her name is wildly complicated and totally
| unknown to the software folks at Apple: Ada
|
| Aye.
|
| I was surprised this morning when it decided I had was
| talking about a "Mark of chain". 1/3rd of the time it hears
| "bedroom 100%" as "bedroom off".
|
| When cooking dinner today, I asked for a "ten minute
| timer", it responded "for how long?" then confirmed my "ten
| minute minute timer".
|
| Still better than Alexa, which kept telling us it couldn't
| find <<kitchen>> on Spotify even though we didn't even have
| Spotify.
|
| And _way_ better than the voice control on Mac OS Classic;
| back in the late 90s /early 00s, it interpreted 75% of my
| attempts to use it as "tell me a joke" (it wasn't even a
| good joke), and ignored 20%.
| educaysean wrote:
| I must say the demo did nothing to improve my opinion of the
| current state of voice-based AI conversations.
| batwood011 wrote:
| Hey there -- I'm Brian, the founder of Sindarin, the company
| behind this pitch deck.
|
| This demo is pretty bad compared to what we currently have in
| development.
|
| We've been in code freeze in prod for over two months to get
| our substantially improved engine finished.
|
| It'll be out in a few weeks, and it'll blow this version away
| in every way that matters.
|
| Thanks for checking us out!
| zola wrote:
| Make sure to post it on hacker news :)
| penjelly wrote:
| demod this. The ability to interrupt the language model is
| very cool, However, i notice, it failed to move onto the next
| slide often. It could never get to the final slide without
| explicit mention to go there, and when i got to the last
| slide, i asked to go back to the first slide, it would say
| "ok lets go to the last slide" everytime, these are probably
| more control issues than language model issues but i thought
| id point them out, just in case.
| leetharris wrote:
| Absolutely. People on X keep making the mistake of assuming
| cloud / network latency is the problem here.
|
| The vast majority of America is within 10ms of a data center.
| That's nothing.
|
| The current challenge for most interaction is ASR -> prompt
| processing latency. This will be improved with multimodal
| models on specialized hardware like Groq.
| tomp wrote:
| IME I can get about 0.2s to get the first chunk from Mistral
| (i.e. Mistral API, using Mixtral model (`mistral-small`), not
| Mixtral on Groq) (and note the Mistral sends larger chunks,
| unlike ChatGPT which sends individual tokens)
|
| and another 0.6s or so to get first voice chunks from PlayHT
|
| measuring STT latency is harder, I need to implement a local
| VAD model first to properly measure it, but I think it's on the
| order of 0.5s
|
| So this has nothing to do with Groq, really. ChatGPT is just
| slow (too slow for realtime voice communication).
| hobofan wrote:
| Unless the only thing you want to do with the robot is talk,
| you need to do a lot more reasoning and execution planning
| first (= multiple LLM round trips; tool calling) before you
| even know whether talking is the correct action to take. So
| the naive time-to-first-chunk estimate will be way off.
| andoando wrote:
| just add a hmmm before every response
| fragmede wrote:
| Which we humans do all the time. Okay, like, so, hear me
| out, alright? See, what's really going on, yeah, is...
| gfodor wrote:
| The bottleneck here is the multimodal vision processing, at
| least if my experience building this kind of thing is any
| indication. Afaik Groq has not demonstrated the speeds they
| have for this. (Obviously they'll be better than OpenAI, but it
| still may be slow enough to leave people disappointed.)
| daveguy wrote:
| The speech to servo movement is impressive as others pointed out.
| What strikes me as amazing is the speed with which it is
| performing tasks that require dexterity. This is the first object
| manipulation robot demo I have seen that didn't require speeding
| up the video for it to look "natural".
| huytersd wrote:
| How though. It's probably just predefined actions that are
| triggered by the LLM output. At the same time it would be
| impressive if the LLM determined the right function to call in
| real time, was able to deal with the ambiguity in placement of
| the garbage and bonus points if it could do that in a scenario
| that wasn't hardcoded to exactly standing behind that table in
| that spot.
| og_kalu wrote:
| The robot movements are separate end-to-end neural networks.
| They're triggered by the LLM but aren't hardcoded.
| margorczynski wrote:
| Why does it then after performing an action or whole plan
| return to the "default" position with the hands in that
| strange stance? Looks kinda like there is some "hard coded"
| flow that simply uses the LLM(s) to perform actions.
| og_kalu wrote:
| That's a closed loop operation. It's pretty common to
| train these robots on closed loop demonstrations.
| huytersd wrote:
| Do you know this for a fact or are you speculating?
| StarCyan wrote:
| One of the Figure engineers described the system at a
| high level:
| https://twitter.com/coreylynch/status/1767927506332721154
| m3kw9 wrote:
| Is it just me or are people easily impressed by these robot
| demo that moves nothing like humans and doing the simplistic
| tasks like passing a large object. The freaking Apple is within
| his arms reach and he asked to pass the Apple. I almost laughed
| out loud.
| bigyikes wrote:
| Easily impressed? These are difficult tasks! Even being able
| to grip the apple without crushing it is impressive.
|
| Not sure anything here is state of the art, but that doesn't
| make it easy.
| darth_avocado wrote:
| I am 100% convinced that the demo is partly fabricated, or at
| least very far from the robot's capabilities. The inference
| part was probably true, but the dexterity would put them
| waaaaay ahead of what robots are capable of, if not programmed
| finely for a specific task. Industrial robots obviously have
| that kind of dexterity and precision because they're
| specifically programmed for a task. General purpose robots
| however are nowhere as close to this level of accuracy or
| fluidity in their movements.
| whatever1 wrote:
| Probably some sort of imitation learning
| fragmede wrote:
| It is. Have you seen ok robot?
|
| https://ok-robot.github.io/
| binoct wrote:
| OK-Robot is super impressive, but there is a huge
| difference in the manipulation ability OK-Robot shows with
| a simple 1-DOF gripper and parallel-axis arm to pick up
| objects and place them largely on flat surfaces, and the
| human-analog arms and fingers of Figure 01 picking up a
| plate in both hands and placing it in the right slot in a
| drying rack, or dropping the apple into a person's moving
| hand.
|
| It would be absolutely amazing if they really are at that
| level of manipulation in general, and it would put them
| vastly beyond what anyone has been able to do date. However
| robotics demos have a great history of being a mix of
| slight of hand (partial/full tele-op), heavily cherry
| picked, or tuned to a extremely specific example.
|
| Because it's such a leap being implied by this video, it's
| reasonable to want significantly more evidence before
| believing they can do this type of interaction and
| manipulation in a general way. But even if it is heavily
| leaning say on imitation learning for this exact scenario,
| there are tons of potential applications for this level of
| capability.
| StarCyan wrote:
| Yeah the motion looks way more fluid then similar systems I've
| seen before.
| btbuildem wrote:
| The slightly dismissive nature of the toss seemed like a hack
| or maybe a deliberate way to speed up the [pick up, carry,
| drop] pathway.
| gotrythis wrote:
| Someone, please ask OpenAI to stop artificially dumbing down
| ChatGPT by adding "um" to the audio output. I get that it is
| supposed to make it more human-like or something, but every time
| I hear it do that, I cringe and feel sad for humanity.
| niek_pas wrote:
| Funny, I have the exact opposite response. It does, indeed,
| make it seem more human-like to me.
| gotrythis wrote:
| I agree. That's the sad part. I want super-intelligence to
| sound intelligent and not artificially brought down to our
| level. :-)
| ilaksh wrote:
| If they are using Eleven Labs, this is just the Stability
| setting. Turning it down will make it more realistic and
| closer to the training data. That is what causes the pauses
| and imperfections.
|
| You can sign up and use their Voice Lab for free or maybe a
| few bucks and experiment with the slider for Stability and
| the other setting.
|
| In my opinion, turning Stability down just a little bit to
| demo extremely realistic speech is a no-brainer. They could
| have turned it up and made it ultra-smooth, but that makes
| no sense. Why make your robot demo less realistic
| deliberately?
| ChrisArchitect wrote:
| Related release about the deal:
|
| _Figure Raises $675M at $2.6B Valuation;Signs Collaboration
| Agreement with OpenAI_
|
| https://news.ycombinator.com/item?id=39553560
| thebiglebrewski wrote:
| Omg. This is really impressive stuff!
|
| I'm sure it's a bit cherry-picked and they chose things it is
| good at. But it is already showing some useful stuff.
| d--b wrote:
| Now that's a robot that will be fun jailbreaking. The safety
| hazards are robocop-level.
| iknownthing wrote:
| Interesting how it put the trash into the basket when that wasn't
| explicitly asked for.
| bilsbie wrote:
| I found this discussion interesting. Talking about this as a new
| form of labor.
|
| https://twitter.com/CernBasher/status/1767939757105991791
| ordinaryradical wrote:
| Reads like an ad written by GPT, honestly.
|
| But this guy is a professional advice giver, so to be expected?
|
| Wouldn't surprise me if he outsourced his tweeting.
| lelag wrote:
| It's a really cool demo and I'm impressed by the dexterity of the
| robot, however I'm a bit underwhelm by what's shown here in the
| sense that the speech and reasoning capabilities is just obvious
| to anyone who's been paying attention and has experience with
| GPT4. The function calling was great, but it had a very simple
| "world" to interact with.
|
| It's really interesting to see it integrated with a robot that
| can interact with the world though. I think that what's really
| holding back the current crop of Gen AI is inference cost and
| speed. When we figure out to get thousands of token per seconds
| for cheap, I think we will be able to bruteforce many hard
| problems and actually start seeing amazing applications like this
| one (but in production rather than a cool demo).
| golol wrote:
| Yea of course this is just an LLM interacting through a very
| crude interface with some control algorithms, but I think it is
| amazing that we sort of have an approach for both ends of the
| complexity spectrum now: LLMs for high level, vague, common
| sense reasoning, and traditional robot control, planning,
| machine learning methods for the physicak execution of simple
| movements. We just have to gradually connect these two systems.
| mjfisher wrote:
| I take your point about familiarity with GPT4 making this less
| immediately impressive - but just as an end to end demo it's
| absolutely mind-blowing how far we've come.
|
| Can you imagine seeing this ten years ago? Moving so far in
| such a short time frame would have been unbelievable.
|
| This gave me the "I'm actually living in the future" vibes that
| I always imagined I'd get from flying cars.
| marstall wrote:
| after that faked google gemini AI video from awhile back, I've
| got a healthy dose of skepticism about these next-gen demos.
| obviously they've done a lot though, kudos.
| unraveller wrote:
| alternate vid source https://www.youtube.com/watch?v=Sq1QZB5baNw
| margorczynski wrote:
| Question is how much this is cherry-picked as we all remember the
| "demo" of Bard. It would be nice to see it thrown into some
| random environment and then asked to do stuff, otherwise this has
| little value aside from marketing.
| modeless wrote:
| Selecting one of a set of pre-trained actions by voice is cool
| but not exactly ground-breaking. Using GPT-4V to describe a scene
| is also pretty simple. The most impressive things here to me are
| the speed of picking up the trash and the fluid passing of
| objects between hands.
|
| It's unclear how general these movement policies are though. The
| way that guy is standing perfectly still makes me think that it
| would fail if everything wasn't set up just so. I'd like to see
| demos with more variation.
|
| I don't want to be too negative here though. I think it's a great
| demo and I can't wait to see more.
| fragmede wrote:
| The Ok-robot demo shows that the technology for it to be fairly
| general is there, though no idea if figure one is using their
| technology or not. Simply being able to command a robot instead
| of moving a turtle with gcode is nothing short of astounding to
| those who aren't deeply involved and tracking the sota progress
| in this area.
|
| https://ok-robot.github.io/
| Animats wrote:
| > Selecting one of a set of pre-trained actions by voice is
| cool but not exactly ground-breaking.
|
| Yes. Compare "Put That There" (1979).[1]
|
| > The way that guy is standing perfectly still makes me think
| that it would fail if everything wasn't set up just so. I'd
| like to see demos with more variation.
|
| Yes. Unstructured manipulation is hard. Structured robotic
| manipulation is pretty standard. Picking isolated objects is a
| solved problem. Here's a robot recycling sorting system, "Max-
| AI".[2] That's been in use for years. San Francisco recycling
| uses those robots. So do many other cities.
|
| (That's from "Bulk Handling Systems", a company which does
| exactly what their name says. Recycling and trash come in bulk,
| and their machines handle it. Shakers, magnets, screens, and
| vision-based air sorters do 95% of the sorting. The robots only
| handle the hard cases. This is the no-bullshit end of AI.)
|
| [1] https://www.youtube.com/watch?v=RyBEUyEtxQo
|
| [2] https://max-ai.com/
| KoolKat23 wrote:
| I suspect it's less pre-trained than you think, the tweet is
| prefaced with a message that all actions are driven by neural
| networks indicating that it's probably adjusting for the
| objects, the environment etc.
| tintor wrote:
| It is safe to assume it works only for that specific demo,
| and nothing else.
| modeless wrote:
| I think it's true that it can adjust, somewhat. I expect that
| it could handle slight variations such as moving any of the
| items on the table around by a few inches, or adding or
| removing a couple of plates from the rack. However, I do not
| expect that it could handle larger variations like replacing
| the apple with a pineapple, picking trash out of the dish
| rack or the cup instead of the plate, replacing the cup with
| a coffee mug, or sliding the dish rack over to the other side
| of the robot.
|
| I'd love to be wrong but I expect that if they had that much
| flexibility in their controllers they would have demonstrated
| it.
| KoolKat23 wrote:
| I'm sure there is definitely some limitation in this demo,
| but who knows, Google RT-2 exists after all. Enough shapes
| and object training will save the day, there's still
| objects me as a human would be new to handling.
| modeless wrote:
| Oh for sure these limitations will fall. It's only a
| matter of time (and a lot of work behind the scenes).
|
| Covariant's RFM-1 (announced Monday) is an interesting
| approach to generality.
| https://www.youtube.com/watch?v=1Go6HEC-bYU
| zvmaz wrote:
| Yes, perhaps the demonstration can be "demystified," but I can't
| resist but be astounded by the robot. A few years ago this was
| unimaginable and only seen in science fiction movies.
|
| Truly flabbergasted.
| oliwary wrote:
| This is cool. But why would the plate go in the drying rack? It
| is obviously dirty since there was trash and an apple on it. It
| should have been washed first.
| educaysean wrote:
| I hadn't even noticed this leap in logic. Good catch.
| basil-rash wrote:
| I'm certainly not eating at Figure 1's house!
| tintor wrote:
| Because the robot likely doesn't have any memory. It sees empty
| plate after the trash got removed.
| digitalsalvatn wrote:
| The singularity is nigh! We must work towards it so humans can be
| free from the drudgery of work and can work towards whatever
| their heart desires! Join us on our quest for Digital Salvation.
| userabchn wrote:
| One of my colleagues predicted, when ChatGPT was first released,
| that AI would reduce the value of knowledge work relative to
| manual labour, but I disagreed as I think the main thing holding
| back robots from replacing many manual labour jobs at the moment
| is the difficulty communicating with them. I argued that ChatGPT
| indicated that we were not too far away from being able to tell a
| robot to pick the apples from a particular row and put them in
| the green barn as the usual red barn was being painted today.
| This video suggests that I was right, and I in fact suspect that
| such manual labour is a more realistic type of work for AI (with
| robots) to substantially replace first than most knowledge work,
| where I think it will remain as an assistant for some time.
| iamflimflam1 wrote:
| I think we are going to find that the Moravec paradox is wrong.
| Interesting times...
| imtringued wrote:
| Assuming you already have a robot, it is going to be easier
| to teach it to clean your toilet than to make it write a
| scientific paper that is worth publishing.
| sciencesama wrote:
| Thought he is going to give it knives and ask it to cut the apple
| !!
| ProfessorZoom wrote:
| me when my chatgpt wrapper robot can't respond cause the wifi is
| down
| nemothekid wrote:
| The most impressive part of this demo, to me, is the robot
| "seeing" and picking up objects with human-like appendages. I
| must have missed something, but I was under the impression that
| this was very hard. As I understand it inverse kinematics is
| pretty hard - did they solve it with NNs?
| tintor wrote:
| Try poking it with a hockey stick (Boston Dynamic style)
| beefnugs wrote:
| I wish I was a 5 year old that didn't actually know how
| infeasible and useless this all is, and could just be positive
| about the future for once. But humans cant even figure out you
| can't operate an "imaginary number go up" underneath the basic
| human requirement of rent for shelter, there is no way they can
| make this technology useful or affordable or reliable or good
| Solvency wrote:
| Fully agreed. This is going to turn into novelty nonsense, like
| some fashion designer will have a robot runway event and it'll
| drum up a bunch of stupid press. Then the military will take
| the rest over into advancing drone technology.
___________________________________________________________________
(page generated 2024-03-13 23:01 UTC)