[HN Gopher] Show HN: Only 1 LLM can fly a drone
___________________________________________________________________
Show HN: Only 1 LLM can fly a drone
Author : beigebrucewayne
Score : 158 points
Date : 2026-01-26 11:00 UTC (23 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| bigfishrunning wrote:
| Why would you want an LLM to fly a drone? Seems like the wrong
| tool for the job -- it's like saying "Only one power drill can
| pound roofing nails". Maybe that's true, but just get a hammer
| pavlov wrote:
| Yeah, it feels a bit like asking "which typewriter model is the
| best for swimming".
| peterpost2 wrote:
| Did you read his post?
|
| He answers your question
| macintux wrote:
| > Please don't comment on whether someone read an article.
| "Did you even read the article? It mentions that" can be
| shortened to "The article mentions that".
|
| https://news.ycombinator.com/newsguidelines.html
| philipwhiuk wrote:
| I disagree. The nearest justification is:
|
| > to see what happens
| ceejayoz wrote:
| Isn't that the epitome of the hacker spirit?
|
| "Why?" "Because I can!"
| munchler wrote:
| Because we're interested in AGI (emphasis on _general_ ) and
| LLM's are the closest thing to AGI that we have right now.
| notepad0x90 wrote:
| There are almost endless reasons why. It's like asking why
| would you want a self-driving car. Having a drone to transport
| things would be amazing, or to patrol an area. LLMs can be
| helpful with object identification, reacting to different
| events, and taking commands from users.
|
| The first thought I had was those security guard robots that
| are popping up all over the place. if they were drones instead,
| and LLM talked to people asking them to do/not-do things, that
| would be an improvement.
|
| Or an waiter drone, that takes your order in a restaurant,
| flies to the kitchen, picks up a sealed and secured food
| container, flies it back to the table, opens it, and leaves. It
| will monitor for gestures and voice commands to respond to
| diners and get their feedback, abuse, take the food back if it
| isn't satisfactory,etc...
|
| This is the type of stuff we used to see in futuristic movies.
| It's almost possible now. glad to see this kind of tinkering.
| lewispollard wrote:
| The point is that you don't need an LLM to pilot the thing,
| even if you want to integrate an LLM interface to take a
| request in natural language.
| notepad0x90 wrote:
| We don't need a lot of things, but new tech should also
| address what people want, not just needs. I don't know how
| to pilot drones, nor do I care to learn how to, but I want
| to do things with drones, does that qualify as a need? Tech
| is there to do things for us we're too lazy to do.
| volkercraig wrote:
| I don't think you understand what an "LLM" is. They're
| text generators. We've had autopilot since the 1930s that
| relies on measurable things... like PID loops, direct
| sensor input. You don't need the "language model" part to
| run an autopilot, that's just silly.
| pixl97 wrote:
| You see to be talking past him and ignoring what they are
| actually saying.
|
| LLMs are a higher level construct than PID loops. With
| things like autopilot I can give the controller a command
| like 'Go from A to B', and chain constructs like this to
| accomplish a task.
|
| With an LLM I can give the drone/LLM system complex
| command that I'd never be able to encode to a controller
| alone. "Fly a grid over my neighborhood, document the
| location of and take pictures of every flower garden".
|
| And if an LLM is just a 'text generator' then it's a
| pretty damned spectacular one as it can take free formed
| input and turn it into a set of useful commands.
| volkercraig wrote:
| They are text generators, and yes they are pretty good,
| but that really is all they are, they don't actually
| learn, they don't actually think. Every "intelligence"
| feature by every major AI company relies on semantic
| trickery and managing context windows. It even says it
| right on the tin; Large LANGUAGE Model.
|
| Let me put it this way: What OP built is an airplane in
| which a pilot doesn't have a control stick, but they have
| a keyboard, and they type commands into the airplane to
| run it. It's a silly unnecessary step to involve
| language.
|
| Now what you're describing is a language problem, which
| is orchestration, and that is more suited to an LLM.
| lukan wrote:
| "they don't actually learn"
|
| Give the LLM agent write acces to a text file to take
| notes and it can actually learn. Not really realiable,
| but some seem to get useful results. They ain't just text
| generators anymore.
|
| (but I agree that it does not seem the smartest way to
| control a plane with a keyboard)
| volkercraig wrote:
| If thats youre definition of learning, my casio FX has an
| "ans" feature that "learns" from earlier calculations!!
| lukan wrote:
| Can that "ans" variable influence the general way your
| casio does future calculations?
|
| I don't think so. But with a AI agent it can.
|
| Sure, they still don't have real understanding, but
| calling this technology mere text generators in 2026
| seems a bit out of the loop.
| infecto wrote:
| My confusion maybe? Is this simulator just flying point a
| to b? Seems like it's handling collisions while trying to
| locate the targets and identify them. That seems quite a
| bit more complex than what you are describing has been
| solved since the 1930s.
| notepad0x90 wrote:
| LLMs can do chat-completion, they don't do only chat
| completion. There are LLMs for image generation, voice
| generation, video generation and possibly more. The
| camera of a drone inputs images for the LLM, then it
| determines what action take based on that. Similar to if
| you asked ChatGPT "there is a tree in this picture, if
| you were operating a drone, what action would you take to
| avoid collision", except the "there is a tree" part is
| done by the LLMs image recognition, and the sys prompt is
| "recognize objects and avoid collision", of course I'm
| simplifying it a lot but it is essentially generating
| navigational directions under a visual context using
| image recognition.
| nrrbtrbbrb wrote:
| > There are LLMs for image generation,
|
| That part isn't handled by an LLM
|
| > voice generation,
|
| That part isn't handled by an LLM
|
| > video generation
|
| That part isn't handled by an LLM
| famouswaffles wrote:
| Yes it can be, and often is. Advanced voice mode in
| chatGPT and the voice mode in Gemini are LLMs. So is the
| image gen in both chatGPT and Gemini (Nano Banana).
| cheema33 wrote:
| "You don't need the "language model" part to run an
| autopilot, that's just silly."
|
| I think most of us understood that reproducing what
| existing autopilot can do was not the goal. My
| inexpensive DJI quadcopter has an impressive abilities in
| this area as well. But, I cannot give it a mission in
| natural language and expect it to execute it. Not even
| close.
| laffOr wrote:
| There are two different things:
|
| 1. a drone that you can talk to and fly on its own
|
| 2. a drone where the flying is controlled by an LLM
|
| (2) is a specific instance of the larger concept of (1).
|
| You make an argument that 1 should be addressed, which no
| one is denying in this thread - people are arguing that
| (2) is a bad way to do (1).
| notepad0x90 wrote:
| You're considering "talking to" a separate thing, I
| consider it the same as reading street signs or using
| object recognition. My voice or text input is just one
| type of input. Can other ML solutions or algorithms
| detect a tree (same as me telling it there is a tree,yaw
| to the right), yes, can LLMs detect a tree and determine
| what course of action to take? also true. Which is
| better? I don't know, but I won't be quick to dismiss
| anyone attempting to use LLMs.
| infecto wrote:
| That's a pretty boring point for what looks like a fun
| project. Happy to see this project and know I am not the
| only one thinking about these kinds of applications.
| coder543 wrote:
| An LLM that can't understand the environment properly can't
| properly reason about which command to give in response to
| a user's request. Even if the LLM is a very inefficient way
| to pilot the thing, _being able to pilot_ means the LLM has
| the reasoning abilities required to also translate a user
| 's request into commands that make sense for the more
| efficient, lower-level piloting subsystem.
| laffOr wrote:
| You could have a program, not LLM-based but could be ANN, for
| flying and an LLM for overseeing; the LLM could give the
| program instructions to the pilot program as a (x,y,z)
| directions. I mean currently autopilots are typically not
| LLMs, right?
|
| You describe why it would be useful to have an LLM in a drone
| to interact with it but do not explain why it is the very
| same LLM that should be doing the flying.
| notepad0x90 wrote:
| I'm not OP, I don't know what specific roles the LLM should
| be using, but LLMs are great with object recognition, and
| using both text (street signs,notices,etc..) and visual
| cues to predict the correct response. The actual motor
| control i'm sure needs no LLMs, but the decision making
| could use any number of solutions, I agree that an LLM-only
| solution sounds bad, but I didn't do the testing and
| comparison to be confident in that assessment.
| iso1631 wrote:
| You want a self driving car
|
| You don't want an LLM to drive a car
|
| There is more to "AI" than LLMs
| coder543 wrote:
| Waymo is certainly interested in using LLMs/VLMs for this
| purpose.
|
| https://waymo.com/research/emma/
|
| https://waymo.com/blog/2024/10/introducing-emma
|
| https://waymo.com/blog/2025/12/demonstrably-safe-ai-for-
| auto...
| notepad0x90 wrote:
| I don't mind someone trying LLMs to see if they can do
| better than existing ML solutions.
| fwip wrote:
| Both of those proposed uses are bad things that are worse
| than what they would replace.
| dan-bailey wrote:
| When your only tool is a hammer, every problem begins to
| resemble a nail.
| infecto wrote:
| What's the right tool then?
|
| This looks like a pretty fun project and in my rough estimation
| a fun hacker project.
| bigfishrunning wrote:
| The right tool would likely be some conventional autopilot
| software; if you want AI cred you could train a Neural
| Network which maps some kind of path to the control features
| of the drone. LLMs are language models -- good for language,
| but not good for spacial reasoning or navigation or many of
| the other things you need to pilot a drone.
| infecto wrote:
| So you are suggesting building a full featured package that
| is nontrivial compared to this fun excitement?
|
| Vision models do a pretty decent job with spatial
| reasoning. It's not there yet but you're dismissing some
| interesting work going on.
| bob1029 wrote:
| The system prompt for the drone is hilarious to me. These
| models are horrible at spatial reasoning tasks:
|
| https://github.com/kxzk/snapbench/blob/main/llm_drone/src/ma...
|
| I've been working with integrating GPT-5.2 in Unity. It's
| fantastic at scripting but completely worthless at managing
| transforms for scene objects. Even with elaborate planning
| phases it's going to make a complete jackass of itself in world
| space every time.
|
| LLMs are also wildly unsuitable for real-time control problems.
| They never will be. A PID controller or dedicated pathfinding
| tool being driven by the LLM will provide a radically superior
| result.
| storystarling wrote:
| Agreed. I've found the only reliable architecture for this is
| treating the LLM purely as a high-level planner rather than a
| controller.
|
| We use a state machine (LangGraph) to manage the intent and
| decision tree, but delegate the actual transform math to
| deterministic code. You really want the model deciding the
| strategy and a standard solver handling the vectors,
| otherwise you're just burning tokens to crash into walls.
| ralusek wrote:
| Why would you want an LLM to identify plants and animals? Well,
| they're often better than bespoke image classification models
| at doing just that. Why would you want a language model to help
| diagnose a medical condition?
|
| It would not surprise me at all if self-driving models are
| adopting a lot of the model architecture from LLMs/generative
| AI, and actually invoke actual LLMs in moments where they
| would've needed human intervention.
|
| Imagine if there's a decision engine at the core of a self
| driving model, and it gets a classification result of what to
| do next. Suddenly it gets 3 options back with 33.33% weight
| attached to each of them and a very low confidence interval of
| which is the best choice. Maybe that's the kind of scenario
| that used to trigger self-driving to refuse to choose and defer
| to human intervention. If that can then first defer judgement
| to an LLM which could say "that's just a goat crossing the
| road, INVOKE: HONK_HORN," you could imagine how that might be
| useful. LLMs are clearly proving to be universal reasoning
| agents, and it's getting tiring to hear people continuously try
| to reduce them to "next word predictors."
| avaer wrote:
| Using an LLM is the SOTA way to turn plain text instructions
| into embodied world behavior.
|
| Charitably, I guess you can question why you would ever want to
| use text to command a machine in the world (simulated or not).
|
| But I don't see how it's the wrong tool given the goal.
| irl_zebra wrote:
| SOTA typically refers to achieving the best performance, not
| using the trendiest thing regardless of performance. There is
| some subtlety here. At some point an LLM might give the best
| performance in this task, but that day is not today, so an
| LLM is not SOTA, just trendy. It's kinda like rewriting
| something in Rust and calling it SOTA because that's the
| trend right now. Hope that makes sense.
| infecto wrote:
| I don't think trendy is really the right word and maybe
| it's not state of the art but a lot of us in the industry
| are seeing emerging capabilities that might make it SOTA.
| Hope that makes sense.
| irl_zebra wrote:
| LLMs are indeed the definition of trendy (I've found
| using Google Trends to dive in is a good entry point to
| get a broad sense of whether something is "trendy")!
| Basically the right way to think about it is that
| something can be promising, and demonstrate emerging
| capabilities, but but those things don't make something
| SOTA, nor do they make it trendy. They can be related
| though (I expect everything SOTA was once promising and
| emerging, but not everything promising or emerging became
| SOTA). It's a subtlety that isn't super easy to grasp,
| but (and here is one area I think an LLM can show
| promise) an LLM like ChatGPT can help unpick the
| distinctions here. Still, it's slightly nuanced and I
| understand the confusion.
| infecto wrote:
| I think the point may have flown over your head. I am
| suggesting you are being dismissive with a distinct lack
| of thought on your reply. Like said I don't think state
| of the art is the right way to describe it but I think
| trendy is equally wrong from the other side of the
| spectrum. Models that can deal with vision have some
| really interesting use cases and ones that can be
| valuable, in a lot of ways I would say state of the art
| could describe it but I know to folks that are hopelessly
| negative, it's a hard reach so I was trying to balance it
| for you. Hope that makes sense.
| famouswaffles wrote:
| >Using an LLM is the SOTA way to turn plain text
| instructions into embodied world behavior.
|
| >SOTA typically refers to achieving the best performance
|
| Multimodal Transformers _are_ the best way to turn plain
| text instructions to embodied world behavior. Nothing to do
| with being 'trendy'. A Vision Language Action model would
| probably have done much better but really the only
| difference between that and the models trialed above is
| training data. Same technology.
| smw1218 wrote:
| It's a great feature to tell my drone to do a task in English.
| Like "a child is lost in the woods around here. Fly a search
| pattern to find her" or "film a cool panorama of this property.
| Be sure to get shots of the water feature by the pool." While
| LLMs are bad at flying, better navigation models likely can't
| be prompted in natural language yet.
| volkercraig wrote:
| What you're describing is still ultimately the "view" layer
| of a larger autopilot system, that's not what OP is doing.
| He's getting the text generator to drive the drone. An LLM
| can handle parsing input, but the wayfinding and driving
| would (in the real world) be delegated to modern autopilot.
| Mashimo wrote:
| > Why would you want an LLM to fly a drone?
|
| We are on HACKER news. Using tools outside the scope is the
| ethos of a hacker.
| antisthenes wrote:
| LLMs flying weaponized drones is exactly how it starts.
| popcornricecake wrote:
| One day they'll fly to a drone factory, eliminate all the
| personnel, then start gently shooting at the machinery to
| create more weaponized drones and then it's all over before you
| know it!
| SoftTalker wrote:
| It's pretty entertaining seeing the plot lines and ficticious
| history in _The Terminator_ movies actually happening in real
| time.
| goda90 wrote:
| https://www.youtube.com/watch?v=O-2tpwW0kmU
| accrual wrote:
| I think it's fascinating work even if LLMs aren't the ideal tool
| for this job right now.
|
| There were some experiments with embodied LLMs on the front page
| recently (e.g. basic robot body + task) and SOTA models struggled
| with that too. And of course they would - what training data is
| there for embodying a random device with arbitrary controls and
| feedback? They have to lean on the "general" aspects of their
| intelligence which is still improving.
|
| With dedicated embodiment training and an even tighter/faster
| feedback loop, I don't see why an LLM couldn't successfully pilot
| a drone. I'm sure some will still fall of the rails, but software
| guardrails could help by preventing certain maneuvers.
| fsiefken wrote:
| I am curious how these models would perform and how much energy
| they'd take to semi-realtime detect objects: SmolVLM2-500M -
| Moondream 0.5B/2B/2.5B - Qwen3-VL (3B)
| https://huggingface.co/collections/Qwen/qwen3-vl
|
| I am sure this is already worked on in Russia, Ukraine and The
| Netherlands. A lot can go wrong with autonomous flying. One could
| load the VLM on a high end android phone on the drone and have
| dual control.
| SpyCoder77 wrote:
| A better way would be a VLA as opposed to a VLM. VLAs are meant
| to take action, where as vlms are for geneeral use.
| https://cognitivedrone.github.io/
| avaer wrote:
| Gemini 3 is the only model I've found that can reason spatially.
| The results here are accurate to my experiments with putting LLM
| NPCs in simulated worlds.
|
| I was surprised that most VLLMs cannot reliably tell if a
| character is facing left or right, they will confidently lie no
| matter what you do (even gemini 3 cannot do it reliably). I guess
| it's just not in the training data.
|
| That said Qwen3VL models are smaller/faster and better "spatially
| grounded" in pixel space, because pixel coordinates are encoded
| in the tokens. So you can use them for detecting things in the
| scene, and where they are (which you can project to 3d space if
| you are running a sim). But they are not good reasoning models so
| don't ask them to think.
|
| That means the best pipeline I've found at the moment is to tack
| a dumb detection prepass on before your action reasoning. This
| basically turns 3d sims into 1d text sims operating on labels --
| which is something that LLMs _are_ good at.
| Krutonium wrote:
| Neuro-sama, the V-Tuber/AI actually does a decent job of it.
| Vedal seems to have cooked and figured out how to make an LLM
| move reasonably well in VRChat.
|
| Not perfectly, there's a lot abuse of gravity or the lack
| thereof, but yeah. Neuro has also piloted a Robot Dog in the
| past.
| storystarling wrote:
| I suspect the latency on Gemini 3 makes it non-viable for a
| real-time control loop though. Even if the reasoning works, the
| input token costs would destroy the unit economics pretty
| quickly. I'd be worried about relying on that kind of API
| overhead for the critical path.
| 101008 wrote:
| > the input token costs would destroy the unit economics
| pretty quickly.
|
| They say this is going to happen to every task after the stop
| subsidizing token costs.
| zinodaur wrote:
| Not for coding though - I'd buy 4 H200's and stick them in
| my basement if i had to
| nish__ wrote:
| To do what?
| weird-eye-issue wrote:
| CODING
| volkercraig wrote:
| I don't understand. Surely training an LSTM with sensor input is
| more practical and reasonable way than trying to get a text
| generator to speak commands to a drone.
| encrux wrote:
| Very much depends on what you want to do.
|
| The fact that a language model can ,,reason" (in the LLM-slang
| meaning of the term) about 3D space is an interesting property.
|
| If you give a text description of a scene and ask a robot to
| perform a peg in hole task, modern models are able to solve
| them fairly easily based on movement primitives. I implemented
| this on a UR robot arm back in 2023
|
| The next logical step is, instead of having the model output
| text (code representing movement primitives), outputting tokens
| in action space. This is what models like pi0 are doing.
| volkercraig wrote:
| I mean semantically language evolved as an interpretation for
| the material world, so assuming that you can describe a
| problem in language, and considering that there exists a
| solution to said problem that is describable in language,
| then I'm sure a big enough LLM could do it... but you can
| also calculate highly detailed orbital maps with epicycles if
| you just keep adding more... you just don't because it's a
| waste of time and there's a simpler way.
|
| The latter part is interesting. I'm not sure how the
| performance of one of those would be once they are working
| well, but my naive gut feeling is that splitting the language
| part and the driving part into two delegates is cleaner,
| safer, faster and more predictable.
| convolvatron wrote:
| note that the control systems you were talking about before
| (i.e. PID) would probably take hold pretty directly in a
| tiny network, and exactly because of that limitation, be
| far less likely to contain 'hallucinations'. object
| avoidance and path planning are likely similar.
|
| since this is a limited and continuous domain, its a far
| better one for neural training than natural language. I
| guess this notion that a language model should be used for
| 3d motion control is a real indicator about the level of
| thought going into some of these applications.
| eichin wrote:
| At least he's not feeding real drones to the coyotes... oh,
| there's a link in the readme https://github.com/kxzk/tello-bench
| modeless wrote:
| This is what VLA models are for. They would work much better.
| Would need a bit of fine tuning but probably not much. Lots of
| literature out there on using VLAs to control drones.
| SpyCoder77 wrote:
| Did some research, found a model that is exactly that.
| https://cognitivedrone.github.io/
| culi wrote:
| The Black Mirror speedrun continues
| goda90 wrote:
| Slaughterbots: https://www.youtube.com/watch?v=O-2tpwW0kmU
| beigebrucewayne wrote:
| Thanks will check this out!
| andai wrote:
| Gemini Flash beats Gemini Pro? How does that work?
|
| Gemini Pro, like the other models, didn't even find a single
| creature.
| seniortaco wrote:
| "drone"
| broast wrote:
| On the discussion of the right or wrong tool, I find it possible
| that the ability to reason towards a goal is more valuable in the
| long run than an intrinsic ability to achieve the same result. Or
| maybe a mix of both is the ideal.
| me551ah wrote:
| In a real world test you would have a tool call for the LLM which
| is a bit high level like GoTo(object) and the tool calls another
| program which identities the objects in frame and uses standard
| programs to go to that.
| SpyCoder77 wrote:
| https://cognitivedrone.github.io/
| mbreese wrote:
| I can't really take this too seriously. This seems to me to be a
| case of asking "can an LLM do X?" Instead, the question is like
| to see is: "I want to do X, is an LLM this right tool?"
|
| But that said, I think the author missed something. LLMs aren't
| great at this type of reasoning/state task, but they are good at
| writing programs. Instead of asking the LLM to search with a
| drone, it would be very interesting to know how they performed if
| you asked them to _write a program_ to search with a drone.
|
| This is more aligned with the strengths of LLMs, so I could see
| this as having more success.
| zahlman wrote:
| > I gave 7 frontier LLMs a simple task: pilot a drone through a
| 3D voxel world and find 3 creatures.
|
| > Only one could do it.
|
| If I understood the chart correctly, even the successful one only
| found 1/6 of the creatures across multiple runs.
| uoaei wrote:
| No science detected.
|
| Without comparison to some null hypothesis (a random policy),
| this article is hogwash.
| zahlman wrote:
| Given that all the other agents failed to find any creatures,
| it's hard to imagine that a random policy would except by
| extreme coincidence.
| TOMDM wrote:
| It is possible to be consistently wrong in a way that
| randomness is not.
|
| For some problems, randomness outperforms incompetent
| reasoning
| SoftTalker wrote:
| LLMs are trained on text. Why would we expect them to understand
| a visual and tactile 3D world?
| azinman2 wrote:
| Because they're also multimodal vLLMs.
| kylehotchkiss wrote:
| This sounds like a good way to get your drone shot down by a
| Concerned Citizen or the military.
| dimatura wrote:
| This is neat! It's a bit amusing in that I worked on a somewhat
| similar project for my phd thesis almost 10 years ago, although
| in that case we got it working on a real drone (heavily
| customized, based on DJI matrice) in the field, with only onboard
| compute. Back then it was just a fairly lightweight CNN for the
| perception, not that we could've gotten much more out of the
| jetson TX2.
| arikrahman wrote:
| Interesting. In some benchmarks I even see flash outperforming
| thinking in general reasoning.
___________________________________________________________________
(page generated 2026-01-27 10:01 UTC)