Hacker Newsnew | past | comments | ask | show | jobs | submit | bigfishrunning's commentslogin

Why would you want an LLM to fly a drone? Seems like the wrong tool for the job -- it's like saying "Only one power drill can pound roofing nails". Maybe that's true, but just get a hammer

There are almost endless reasons why. It's like asking why would you want a self-driving car. Having a drone to transport things would be amazing, or to patrol an area. LLMs can be helpful with object identification, reacting to different events, and taking commands from users.

The first thought I had was those security guard robots that are popping up all over the place. if they were drones instead, and LLM talked to people asking them to do/not-do things, that would be an improvement.

Or an waiter drone, that takes your order in a restaurant, flies to the kitchen, picks up a sealed and secured food container, flies it back to the table, opens it, and leaves. It will monitor for gestures and voice commands to respond to diners and get their feedback, abuse, take the food back if it isn't satisfactory,etc...

This is the type of stuff we used to see in futuristic movies. It's almost possible now. glad to see this kind of tinkering.


You could have a program, not LLM-based but could be ANN, for flying and an LLM for overseeing; the LLM could give the program instructions to the pilot program as a (x,y,z) directions. I mean currently autopilots are typically not LLMs, right?

You describe why it would be useful to have an LLM in a drone to interact with it but do not explain why it is the very same LLM that should be doing the flying.


I'm not OP, I don't know what specific roles the LLM should be using, but LLMs are great with object recognition, and using both text (street signs,notices,etc..) and visual cues to predict the correct response. The actual motor control i'm sure needs no LLMs, but the decision making could use any number of solutions, I agree that an LLM-only solution sounds bad, but I didn't do the testing and comparison to be confident in that assessment.

The point is that you don't need an LLM to pilot the thing, even if you want to integrate an LLM interface to take a request in natural language.

An LLM that can't understand the environment properly can't properly reason about which command to give in response to a user's request. Even if the LLM is a very inefficient way to pilot the thing, being able to pilot means the LLM has the reasoning abilities required to also translate a user's request into commands that make sense for the more efficient, lower-level piloting subsystem.

That’s a pretty boring point for what looks like a fun project. Happy to see this project and know I am not the only one thinking about these kinds of applications.

We don't need a lot of things, but new tech should also address what people want, not just needs. I don't know how to pilot drones, nor do I care to learn how to, but I want to do things with drones, does that qualify as a need? Tech is there to do things for us we're too lazy to do.

There are two different things:

1. a drone that you can talk to and fly on its own

2. a drone where the flying is controlled by an LLM

(2) is a specific instance of the larger concept of (1).

You make an argument that 1 should be addressed, which no one is denying in this thread - people are arguing that (2) is a bad way to do (1).


You're considering "talking to" a separate thing, I consider it the same as reading street signs or using object recognition. My voice or text input is just one type of input. Can other ML solutions or algorithms detect a tree (same as me telling it there is a tree,yaw to the right), yes, can LLMs detect a tree and determine what course of action to take? also true. Which is better? I don't know, but I won't be quick to dismiss anyone attempting to use LLMs.

I don't think you understand what an "LLM" is. They're text generators. We've had autopilot since the 1930s that relies on measurable things... like PID loops, direct sensor input. You don't need the "language model" part to run an autopilot, that's just silly.

You see to be talking past him and ignoring what they are actually saying.

LLMs are a higher level construct than PID loops. With things like autopilot I can give the controller a command like 'Go from A to B', and chain constructs like this to accomplish a task.

With an LLM I can give the drone/LLM system complex command that I'd never be able to encode to a controller alone. "Fly a grid over my neighborhood, document the location of and take pictures of every flower garden".

And if an LLM is just a 'text generator' then it's a pretty damned spectacular one as it can take free formed input and turn it into a set of useful commands.


They are text generators, and yes they are pretty good, but that really is all they are, they don't actually learn, they don't actually think. Every "intelligence" feature by every major AI company relies on semantic trickery and managing context windows. It even says it right on the tin; Large LANGUAGE Model.

Let me put it this way: What OP built is an airplane in which a pilot doesn't have a control stick, but they have a keyboard, and they type commands into the airplane to run it. It's a silly unnecessary step to involve language.

Now what you're describing is a language problem, which is orchestration, and that is more suited to an LLM.


LLMs can do chat-completion, they don't do only chat completion. There are LLMs for image generation, voice generation, video generation and possibly more. The camera of a drone inputs images for the LLM, then it determines what action take based on that. Similar to if you asked ChatGPT "there is a tree in this picture, if you were operating a drone, what action would you take to avoid collision", except the "there is a tree" part is done by the LLMs image recognition, and the sys prompt is "recognize objects and avoid collision", of course I'm simplifying it a lot but it is essentially generating navigational directions under a visual context using image recognition.

My confusion maybe? Is this simulator just flying point a to b? Seems like it’s handling collisions while trying to locate the targets and identify them. That seems quite a bit more complex than what you are describing has been solved since the 1930s.

"You don't need the "language model" part to run an autopilot, that's just silly."

I think most of us understood that reproducing what existing autopilot can do was not the goal. My inexpensive DJI quadcopter has an impressive abilities in this area as well. But, I cannot give it a mission in natural language and expect it to execute it. Not even close.


You want a self driving car

You don't want an LLM to drive a car

There is more to "AI" than LLMs



I don't mind someone trying LLMs to see if they can do better than existing ML solutions.

Both of those proposed uses are bad things that are worse than what they would replace.

Because we’re interested in AGI (emphasis on general) and LLM’s are the closest thing to AGI that we have right now.

Using an LLM is the SOTA way to turn plain text instructions into embodied world behavior.

Charitably, I guess you can question why you would ever want to use text to command a machine in the world (simulated or not).

But I don't see how it's the wrong tool given the goal.


SOTA typically refers to achieving the best performance, not using the trendiest thing regardless of performance. There is some subtlety here. At some point an LLM might give the best performance in this task, but that day is not today, so an LLM is not SOTA, just trendy. It's kinda like rewriting something in Rust and calling it SOTA because that's the trend right now. Hope that makes sense.

>Using an LLM is the SOTA way to turn plain text instructions into embodied world behavior.

>SOTA typically refers to achieving the best performance

Multimodal Transformers are the best way to turn plain text instructions to embodied world behavior. Nothing to do with being 'trendy'. A Vision Language Action model would probably have done much better but really the only difference between that and the models trialed above is training data. Same technology.


I don’t think trendy is really the right word and maybe it’s not state of the art but a lot of us in the industry are seeing emerging capabilities that might make it SOTA. Hope that makes sense.

LLMs are indeed the definition of trendy (I've found using Google Trends to dive in is a good entry point to get a broad sense of whether something is "trendy")! Basically the right way to think about it is that something can be promising, and demonstrate emerging capabilities, but but those things don't make something SOTA, nor do they make it trendy. They can be related though (I expect everything SOTA was once promising and emerging, but not everything promising or emerging became SOTA). It's a subtlety that isn't super easy to grasp, but (and here is one area I think an LLM can show promise) an LLM like ChatGPT can help unpick the distinctions here. Still, it's slightly nuanced and I understand the confusion.

I think the point may have flown over your head. I am suggesting you are being dismissive with a distinct lack of thought on your reply. Like said I don’t think state of the art is the right way to describe it but I think trendy is equally wrong from the other side of the spectrum. Models that can deal with vision have some really interesting use cases and ones that can be valuable, in a lot of ways I would say state of the art could describe it but I know to folks that are hopelessly negative, it’s a hard reach so I was trying to balance it for you. Hope that makes sense.

> Why would you want an LLM to fly a drone?

We are on HACKER news. Using tools outside the scope is the ethos of a hacker.


When your only tool is a hammer, every problem begins to resemble a nail.

Yeah, it feels a bit like asking "which typewriter model is the best for swimming".

It's a great feature to tell my drone to do a task in English. Like "a child is lost in the woods around here. Fly a search pattern to find her" or "film a cool panorama of this property. Be sure to get shots of the water feature by the pool." While LLMs are bad at flying, better navigation models likely can't be prompted in natural language yet.

What you're describing is still ultimately the "view" layer of a larger autopilot system, that's not what OP is doing. He's getting the text generator to drive the drone. An LLM can handle parsing input, but the wayfinding and driving would (in the real world) be delegated to modern autopilot.

The system prompt for the drone is hilarious to me. These models are horrible at spatial reasoning tasks:

https://github.com/kxzk/snapbench/blob/main/llm_drone/src/ma...

I've been working with integrating GPT-5.2 in Unity. It's fantastic at scripting but completely worthless at managing transforms for scene objects. Even with elaborate planning phases it's going to make a complete jackass of itself in world space every time.

LLMs are also wildly unsuitable for real-time control problems. They never will be. A PID controller or dedicated pathfinding tool being driven by the LLM will provide a radically superior result.


Agreed. I’ve found the only reliable architecture for this is treating the LLM purely as a high-level planner rather than a controller.

We use a state machine (LangGraph) to manage the intent and decision tree, but delegate the actual transform math to deterministic code. You really want the model deciding the strategy and a standard solver handling the vectors, otherwise you're just burning tokens to crash into walls.


What’s the right tool then?

This looks like a pretty fun project and in my rough estimation a fun hacker project.


The right tool would likely be some conventional autopilot software; if you want AI cred you could train a Neural Network which maps some kind of path to the control features of the drone. LLMs are language models -- good for language, but not good for spacial reasoning or navigation or many of the other things you need to pilot a drone.

So you are suggesting building a full featured package that is nontrivial compared to this fun excitement?

Vision models do a pretty decent job with spatial reasoning. It’s not there yet but you’re dismissing some interesting work going on.


Why would you want an LLM to identify plants and animals? Well, they're often better than bespoke image classification models at doing just that. Why would you want a language model to help diagnose a medical condition?

It would not surprise me at all if self-driving models are adopting a lot of the model architecture from LLMs/generative AI, and actually invoke actual LLMs in moments where they would've needed human intervention.

Imagine if there's a decision engine at the core of a self driving model, and it gets a classification result of what to do next. Suddenly it gets 3 options back with 33.33% weight attached to each of them and a very low confidence interval of which is the best choice. Maybe that's the kind of scenario that used to trigger self-driving to refuse to choose and defer to human intervention. If that can then first defer judgement to an LLM which could say "that's just a goat crossing the road, INVOKE: HONK_HORN," you could imagine how that might be useful. LLMs are clearly proving to be universal reasoning agents, and it's getting tiring to hear people continuously try to reduce them to "next word predictors."


Did you read his post?

He answers your question


> Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that".

https://news.ycombinator.com/newsguidelines.html


I disagree. The nearest justification is:

> to see what happens


Isn't that the epitome of the hacker spirit?

"Why?" "Because I can!"


What I don't understand about this is that, if you become "the guy that always does the wrong thing", doesn't that also damage your family's reputation? I don't mean to come off insulting here, just trying to understand.

Why? If you really care that little about the properties of the Linux distribution, just run one of the many that already exist.

Linux From Scratch was never really about running the system anyway -- Most people go through it as a learning exercise and then run a maintained distribution anyway; I would think it's a tiny minority that maintains an LFS system for a long time.


> Linux From Scratch was never really about running the system anyway

It definitely is. See the bootscripts:

https://www.linuxfromscratch.org/lfs/view/development/chapte...

Admittedly the main problems I had was with configuring the linux kernel. I have no good solution to make this simpler. That config file is sooooooooooo huge ... no clue how to handle this. There is no way I have enough time to sift through all the options. Or compare what changed from kernel to kernel version. Anyone has an idea how to handle this on your own?


starting from a 'known good' config file (olddefconfig, allnoconfig) and then carefully iteratively switching options around is the only methodology ive had any success with.

I've had some good experiences with using Gemini to help me explore Kernel config options (e.g. "whats the minimum set of config options i need to have enabled to faciliate feature X?")

LLMs in general are very knowledgeable about the Linux kernel (unsurprisingly!) which is why I made my original comment. You can ask them about relevant places in the kernel source tree to look at for a given mechanism, and they'll point you to the file and function without having to 'look'.


Good thing 4k monitors didn't exist in 2000

My comment was tongue in cheek while simultaneously highlighting that at least some increased ram consumption is required for modern computing, and highlighting how incredibly far technology has come in 2.5 decades.

So in this case an LLM would just be a less-reliable compiler? What's the point? If you have to formally specify your program, we already have tools for that, no boiling-the-oceans required

Exactly -- how many 192.168.0.1 certs do you think LetsEncrypt wants to issue?

The BRs specifically forbid issuing such a certificate since 2015. So, slightly before they were required to stop using SHA-1, slight after they were forbidden from issuing certificates for nonsense like .com or .ac.uk which obviously shouldn't be available to anybody even if they do insist they somehow "own" these names.

I can't edit it now, but that comment should have said *.com or *.ac.uk -- that is wildcards in which the suffix beyond the wildcard is an entire TLD or an entire "Public Suffix" which the rules say don't belong to anyone as a whole, they're to be shared by unrelated parties and so such a wildcard will never be a reasonable thing to exist.

Yeah, it seems kind of funny how Signal is marketed as a somewhat paranoid solution, but most people run it on an iPhone out of the app store with no way to verify the source. All it takes is one villain to infiltrate one of a few offices and Signal falls apart.

Same goes for Whatsapp, but the marketing is different there.


Ok so which iPhone app can be verified from source?

Or is your problem that your peer might run the app on an insecure device? How would you exclude decade old Android devices with unpatched holes? I don't want to argue nirvana fallacy here but what is the solution you'd like to propose?


I don't think there is a solution -- Signal advertises itself as having a sort of security that isn't really possible with any commercially available device. You have to trust more people then just the person you're communicating with; if that's unacceptable then you need to give up a bunch of convenience and find another method of communicating.

Fortunately, the parties that you have to trust when you use signal haven't been malicious in any way, but that doesn't mean that they can't.


You need to be careful here, because we have a real tendency to get stuck in local maxima with technology. For instance, the QWERTY keyboard layout exists to prevent typewriter keys from jamming, but we're stuck with it because it's the "standardized solution" and you can't really buy a non-QWERTY keyboard without getting into the enthusiast market.

I do agree changing things for the sake of change isn't a good thing, but we should also be afraid of being stuck in a rut


I agree with you, but I'm completely aware that the point you're making is the same point that's causing the problem.

"Stuck in a rut" is a matter of perspective. A good marketer can make even the most established best practice be perceived as a "rut", that's the first step of selling someone something: convince them they have a problem.

It's easy to get a non-QWERTY keyboard. I'm typing on a split orthlinear one now. I'm sure we agree it would not be productive for society if 99% of regular QWERTY keyboards deviated a little in search of that new innovation that will turn their company into the next Xerox or Hoover or Google. People need some stability to learn how to make the most of new features.

Technology evolves in cycles, there's a boom of innovation and mass adoption which inevitably levels out with stabilisation and maturity. It's probably time for browser vendors to accept it's time to transition into stability and maturity. The cost of not doing that is things like adblockers, noscript, justthebrowser etc will gain popularity and remove any anti-consumer innovations they try. Maybe they'll get to a position where they realise their "innovative" features are being disable by so many users that it makes sense to shift dev spending to maintenance and improvement of existing features, instead of "innovation".


> For instance, the QWERTY keyboard layout exists to prevent typewriter keys from jamming, but we're stuck with it because it's the "standardized solution" and you can't really buy a non-QWERTY keyboard without getting into the enthusiast market.

So, we are "stuck" with something that apparently seems to work fine for most people, and when it doesn't there is an option to also use something else?

Not sure if that's a great example

Sometimes good enough is just good enough


> the QWERTY keyboard layout exists to prevent typewriter keys from jamming

even if it is true (is it a myth by any chance?), it does not mean that alternatives are better at say typing speed


As someone that makes my own keyboard firmware, 100% agree. For most people, typing speed isn't a bottleneck. There is a whole community of people that type faster than 250wpm on custom, chording-enabled keyboards. The tradeoff is that it takes years to relearn how to type. Its the same as being a stenographer at that point. Its not worth it for most people.

Even if there was a new layout that did suddenly allow everyone to type twice as fast, what would we get with that? Maybe twice as many social media posts, but nothing actually useful.


I'd imagine at this point that most social media posts are done by swiping or tapping a phone's virtual keyboard (if one is used at all).

One don't need to be a scientist to take a look at own hands and fingers, to see that they are not crooked to the left. Ortholinear keyboard would be objectively better, even with the same keymap like QWERTY, but we don't produce those for masses for a variety of reasons. Same with many other ideas.

> to see that they are not crooked to the left

how it makes ortholinear keyboards better?


If I recall correctly, QWERTY was designed to minimize jamming. The myth is that it was designed to slow people down.

Whether it does slow people down, as a side effect, is not as well established since, as another person pointed out, typing speed isn't the bottleneck for most people. Learning the layout and figuring out what to write is. On top of that, most of the claims for faster layouts come from marketing materials. It doesn't mean they are wrong, but there is a vested interest.

If there was a demonstrably much faster input method for most users, I suspect it would have been adopted long ago.


It's been debunked by both research (no such mention at the time) and practice on extant machines.

These days QWERTY keyboards are optimal because programs, programming languages and text formats are optimized for QWERTY keyboards.

Depends on the language no? Qwerty isn't great for APL.

I have a QWERTZ keyboard!

Is my digital life at a natural end now?


If you mean the default German keyboard layout then, yes, putting backslashes, braces and brackets behind AtlGr makes it sub-optimal in my book. Thankfully what's printed on the keys is not that important so you too can have a QWERRTY keyboard if you want.

Apparently there is a charging circuit, because the battery will run out long before the fluid does

This is the brand I usually use. https://www.off-stamp.com

It has a separate magnetically attached battery / charging unit. I have to charge 5-6 times per "tank" that's attached. The battery side also has a mini-led display showing animations and battery / juice left so it's actually communicating with the tank side. A kit with battery and tank runs me about $25, but the tank alone is about $20. So they add $5 to cover the battery / charging component. It's a vice, but at least with this brand I'm not throwing away batteries weekly.


I always felt those Off-Stamps were at least a bit better than other disposables since the battery portion was at least reusable.

Hey man, RISC architecture is gonna change everything

RISC is good.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: