[HN Gopher] Tools: Code Is All You Need
___________________________________________________________________
Tools: Code Is All You Need
Author : Bogdanp
Score : 252 points
Date : 2025-07-03 10:51 UTC (12 hours ago)
(HTM) web link (lucumr.pocoo.org)
(TXT) w3m dump (lucumr.pocoo.org)
| mritchie712 wrote:
| > try completing a GitHub task with the GitHub MCP, then repeat
| it with the gh CLI tool. You'll almost certainly find the latter
| uses context far more efficiently and you get to your intended
| results quicker.
|
| This is spot on. I have a "devops" folder with a CLAUDE.md with
| bash commands for common tasks (e.g. find prod / staging logs
| with this integration ID).
|
| When I complete a novel task (e.g. count all the rows that were
| synced from stripe to duckdb) I tell Claude to update CLAUDE.md
| with the example. The next time I ask a similar question, Claude
| one-shots it.
|
| This is the first few lines of the CLAUDE.md
| This file provides guidance to Claude Code (claude.ai/code) when
| working with code in this repository. ## Purpose
| This devops folder is dedicated to Google Cloud Platform (GCP)
| operations, focusing on: - Google Cloud Composer
| (Airflow) DAG management and monitoring - Google Cloud
| Logging queries and analysis - Kubernetes cluster
| management (GKE) - Cloud Run service debugging
| ## Common DevOps Commands ### Google Cloud Composer
| ```bash # View Composer environment details
| gcloud composer environments describe meltano --location us-
| central1 --project definite-some-id # List DAGs in
| the environment gcloud composer environments storage dags
| list --environment meltano --location us-central1 --project
| definite-some-id # View DAG runs gcloud
| composer environments run meltano --location us-central1 dags
| list # Check Airflow logs gcloud logging
| read 'resource.type="cloud_composer_environment" AND
| resource.labels.environment_name="meltano"' --project definite-
| some-id --limit 50
| lsaferite wrote:
| Just as a related aside, you could literally make that bottom
| section into a super simple stdio MCP Server and attach that to
| Claude Code. Each of your operations could be a tool and have a
| well-defined schema for parameters. Then you are giving the LLM
| a more structured and defined way to access your custom
| commands. I'm pretty positive there are even pre-made MCP
| Servers that are designed for just this activity.
|
| Edit: First result when looking for such an MCP Server:
| https://github.com/inercia/MCPShell
| gbrindisi wrote:
| wouldn't this defeat the point? Claude Code already has
| access to the terminal, adding specific instruction in the
| context is enough
| lsaferite wrote:
| No. You are giving textual instructions to Claude in the
| hopes that it correctly generates a shell command for you
| vs giving it a tool definition with a clearly defined
| schema for parameters and your MCP Server is, presumably,
| enforcing adherence to those parameters BEFORE it hits your
| shell. You would be helping Claude in this case as you're
| giving a clearer set of constraints on operation.
| fassssst wrote:
| Either way it is text instructions used to call a
| function (via a JSON object for MCP or a shell command
| for scripts). What works better depends on how the model
| you're using was post trained and where in the prompt
| that info gets injected.
| wrs wrote:
| Well, with MCP you're giving textual instructions to
| Claude in hopes that it correctly generates a tool call
| for you. It's not like tool calls have access to some
| secret deterministic mode of the LLM; it's still just
| text.
|
| To an LLM there's not much difference between the list of
| sample commands above and the list of tool commands it
| would get from an MCP server. JSON and GNU-style args are
| very similar in structure. And presumably the command is
| enforcing constraints even better than the MCP server
| would.
| lsaferite wrote:
| Not strictly true. The LLM provider should be running a
| constrained token selection based off of the json schema
| of the tool call. That alone makes a massive difference
| as you're already discarding non-valid tokens during the
| completion at a low level. Now, if they had a BNF Grammer
| for each cli tool and enforced token selection based on
| that, you'd be much better off than unrestrained token
| selection.
| jayd16 wrote:
| I feel like I'm taking crazy pills sometimes. You have a file
| with a set of snippets and you prefer to ask the AI to
| hopefully run them instead of just running it yourself?
| lreeves wrote:
| The commands aren't the special sauce, it's the analytical
| capabilities of the LLM to view the outputs of all those
| commands and correlate data or whatever. You could accomplish
| the same by prefilling a gigantic context window with all the
| logs but when the commands are presented ahead of time the
| LLM can "decide" which one to run based on what it needs to
| do.
| light_hue_1 wrote:
| Yes. I'm not the poster but I do something similar.
|
| Because now the capabilities of the model grow over time. And
| I can ask questions that involve a handful of those snippets.
| When we get to something new that requires some doing, it
| becomes another snippet.
|
| I can offload everything I used to know about an API and
| never have to think about it again.
| mritchie712 wrote:
| the snippets are examples. You can ask hundreds of variations
| of similar, but different, complex questions and the LLM can
| adjust the example for that need.
|
| I don't have a snippet for, "find all 500's for the meltano
| service for duckdb syntax errors", but it'd easily nail that
| given the existing examples.
| dingnuts wrote:
| but if I know enough about the service to write examples,
| most of the time I will know the command I want, which is
| less typing, faster, costs less, and doesn't waste a ton of
| electricity.
|
| In the other cases I see what the computer outputs, LEARN,
| and then the functionality of finding what I need just
| isn't useful next time. Next time I just type the command.
|
| I don't get it.
| loudmax wrote:
| LLMs are really good at processing vague descriptions of
| problems and offering a solution that's reasonably close
| to the mark. They can be a great guide for unfamiliar
| tools.
|
| For example, I have a pretty good grasp of regular
| expressions because I'm an old Perl programmer, but I
| find processing json using `jq` utterly baffling. LLMs
| are great at coming up with useful examples, and
| sometimes they'll even get it perfect the first time.
| I've learned more about properly using `jq` with the help
| of LLMs than I ever did on my own. Same goes for
| `ffmpeg`.
|
| LLMs are not a substitute for learning. When used
| properly, they're an enhancement to learning.
|
| Likewise, never mind the idiot CEOs of failing companies
| looking forward to laying off half their workforce and
| replacing them with AI. When properly used, AI is a tool
| to help people become more productive, not replace human
| understanding.
| qazxcvbnmlp wrote:
| You dont ask the ai to run the commands. you say "build and
| test this feature" and then the AI correctly iterates back
| and forth between the build and test commands until the thing
| works.
| chriswarbo wrote:
| I use a similar file, but just for myself (I've never used an
| LLM "agent"). I live in Emacs, but this is the only thing I use
| org-mode for: it lets me fold/unfold the sections, and I can
| press C-c C-c over any of the code snippets to execute it. Some
| of them are shell code, some of them are Emacs Lisp code which
| generates shell code, etc.
| stpedgwdgfhgdd wrote:
| I do something similar, but the problem is that claude.md keeps
| on growing.
|
| To tackle this, I converted a custom prompt into an
| application, but there is an interesting trade-off. The
| application is deterministic. It cannot deal with unknown
| situations. In contrast to CC, which is way slower, but can try
| alternative ways of dealing with an unknown situation.
|
| I ended up with adding an instruction to the custom command to
| run the application and fix the application code (TDD) if there
| is a problem. Self healing software... who ever thought
| e12e wrote:
| You're letting the LLM execute privileged API calls against
| your production/test/staging environment, just hoping it won't
| corrupt something, like truncate logs, files, databases etc?
|
| Or are you asking it to provide example commands that you can
| sanity check?
|
| I'd be curious to see some more concrete examples.
| mindwok wrote:
| More appropriately: the terminal is all you need.
|
| I have used MCP daily for a few months. I'm now down to a single
| MCP server: terminal (iTerm2). I have OpenAPI specs on hand if I
| ever need to provide them, but honestly shell commands and curl
| get you pretty damn far.
| jasonthorsness wrote:
| I never knew how far it was possible to go in bash shell with
| the built-in tools until I saw the LLMs use them.
| zahlman wrote:
| Possibly because most people who could mentor you, would give
| up and switch to their preference of {Perl, Python, Ruby,
| PHP, ...} far earlier.
|
| (Check out Dave Eddy, though. https://github.com/bahamas10 ;
| also occasionally streams on YouTube and then creates short
| educational video content there:
| https://www.youtube.com/@yousuckatprogramming )
| pclowes wrote:
| Directionally I think this is right. Most LLM usage at scale
| tends to be filling the gaps between two hardened interfaces. The
| reliability comes not from the LLM inference and generation but
| the interfaces themselves only allowing certain configuration to
| work with them.
|
| LLM output is often coerced back into something more
| deterministic such as types, or DB primary keys. The value of the
| LLM is determined by how well your existing code and tools model
| the data, logic, and actions of your domain.
|
| In some ways I view LLMs today a bit like 3D printers, both in
| terms of hype and in terms of utility. They excel at quickly
| connecting parts similar to rapid prototyping with 3d printing
| parts. For reliability and scale you want either the LLM or an
| engineer to replace the printed/inferred connector with something
| durable and deterministic (metal/code) that is cheap and fast to
| run at scale.
|
| Additionally, there was a minute during the 3D printer Gardner
| hype cycle where there were notions that we would all just print
| substantial amounts of consumer goods when the reality is the
| high utility use case are much more narrow. There is a corollary
| here to LLM usage. While LLMs are extremely useful we cannot rely
| on LLMs to generate or infer our entire operational reality or
| even engage meaningfully with it without some sort of pre-
| existing digital modeling as an anchor.
| abdulhaq wrote:
| this is a really good take
| foobarbecue wrote:
| Hype cycle for drones and VR was similar -- at the peak, you
| have people claiming drones will take over package delivery and
| everyone will spend their day in VR. Reality is that the
| applicability is more narrow.
| jangxx wrote:
| I mean both of these things are actually happening (drone
| deliveries and people spending a lot of time in VR), just at
| a much much smaller scale than it was hyped up to be.
| giovannibonetti wrote:
| Drones and VR require significant upfront hardware
| investment, which curbs adoption. On the other hand,
| adopting LLM-as-a-service has none of these costs, so no
| wonder so many companies are getting involved with it so
| quickly.
| nativeit wrote:
| Right, but abstract costs are still costs to _someone_ ,
| so how far does that go before mass adoption turns into a
| mass liability for whomever is ultimately on the hook? It
| seems like there is this extremely risky wager that
| everyone is playing--that LLM's will find their "killer
| app" before the real costs of maintaining them becomes
| too much to bear. I don't think these kinds of bets often
| pay off. The opposite actually, I think every truly
| revolutionary technological advance in the contemporary
| timeframe has arisen out of its very obvious killer
| app(s), they were in a sense inevitable. Speculative tech
| --the blockchain being one of the more salient and
| frequently tapped examples--tends to work in pretty clear
| bubbles, in my estimation. I've not yet been convinced
| this one is any different, aside from the absurd scale at
| which it has been cynically sold as the biggest thing
| since Gutenberg, but while that makes it somewhat
| distinct, it's still a rather poor argument against it
| being a bubble.
| pxc wrote:
| A parallel outcome for LLMs sounds realistic to me.
| deadbabe wrote:
| If it's not happening at the scale it was pitched, then
| it's not happening.
| jangxx wrote:
| This makes no sense, just because something didn't become
| as big as the hypemen said it would doesn't make the
| inventions or users of those inventions disappear.
| deadbabe wrote:
| For something to be considered "happening" you can't just
| have a handful of localized examples. It has to be
| happening at a large noticeable scale that even people
| unfamiliar with the tech are noticing. Then you can say
| it's "happening". Otherwise, it's just smaller groups of
| people doing stuff.
| falcor84 wrote:
| Considering what we've been seeing in the Russia-Ukraine
| and Iran-Israel wars, drones are definitely happening at
| scale. For better or for worse, I expect worldwide
| production of drones to greatly expand over the coming
| years.
| soulofmischief wrote:
| That's the claim for AR, not VR, and you're just noticing how
| research and development cycles play out, you can draw
| comparisons to literally any technology cycle.
| 65 wrote:
| That is in fact the claim for VR. Remember the Metaverse?
| Oculus headsets are VR headsets. The Apple Vision Pro is a
| VR headset.
| mumbisChungo wrote:
| The metaverse is and was a guess at how the children of
| today might interact as they age into active market
| participants. Like all these other examples, speculative
| mania preceded genuine demand and it remains to be seen
| whether it plays out over the coming 10-15 years.
| sizzle wrote:
| Ahh yes let's get the next generation addicted to literal
| screens strapped to their eyeballs for maximum
| monetization, humanity be damned. Glad it's a failing
| bet. Now sex bots might be onto something...
| mumbisChungo wrote:
| It may or may not be a failing bet. Maybe smartphones are
| the ultimate form of human-data interface and we'll
| simply never do better.
| jrm4 wrote:
| I'll take your argument a bit further. The thing is --
| "human-data" interfaces are not particularly important.
| Human-Human ones are. This is probably why it's going to
| be difficult, if not impossible, to beat the smartphone;
| VR or whatever doesn't fundamentally "bring people closer
| together" in a way the smartphone nearly absolutely did.
| mumbisChungo wrote:
| VR may not, but social interaction with AR might be more
| palatable and better UX than social interaction while
| constantly looking down at at a computer we still call a
| "phone" for some reason.
| outworlder wrote:
| > The Apple Vision Pro is a VR headset.
|
| For some use cases it is indeed used for VR. But it has
| AR applications and all the necessary hardware and
| software.
| threatofrain wrote:
| Good drones are very Chinese atm, as is casual consumer drone
| delivery. Americans might be more than a decade away even
| with concerted bipartisan war-like effort to boost domestic
| drone competency.
|
| The reality is Chinese.
| sarchertech wrote:
| Aren't people building DIY drones that are close to and in
| some cases superior to off the shelf Chinese drones?
| threatofrain wrote:
| Off the shelf Chinese drones is somewhat vague, we can
| just say DJI. Their full drone and dock system for the
| previous generation goes for around $20k. DJI iterates on
| this space on a yearly cadence and have just come out
| with the Dock 3.
|
| 54 minute flight time (47 min hover) for fully unmanned
| operations.
|
| If you're talking about fpv racing where tiny drones fly
| around 140+ mph, then yeah DJI isn't in that space.
| sarchertech wrote:
| That hardly seems like it would take the US 10 years to
| replicate on a war footing aside from the price.
|
| I mean if we're talking dollar to dollar comparison, the
| US will likely never be able to produce something as
| cheaply as China (unless China drastically increases
| their average standard of living).
| tonyarkles wrote:
| There's a really weird phenomenon too with drones. I've
| used Chinese (non-drone) software for work a bunch in the
| past and it's been almost universally awful. On the drone
| side, especially DJI, they've flipped this script
| completely. Every non-DJI drone I've flown has had
| miserable UX in comparison to DJI. Mission Planner (open
| source, as seen in the Ukraine attack videos) is super
| powerful but also looks like ass and functions similarly.
| QGC is a bit better, especially the vendor-customized
| versions (BSD licensed) but the vendors almost always
| neuter great features that are otherwise available in the
| open source version and at the same time modify things so
| that you can't talk to the aircraft using the OSS
| version. The commercial offerings I've used are no
| better.
|
| Sure, we need to be working on being able to build the
| hardware components in North America, and I've seen a
| bunch of people jump on that in the last year. But wow is
| the software ever bad and I haven't really seen anyone
| working to improve that.
| ivape wrote:
| You checked out drone warfare? It's all the rage in every
| conflict at the moment. The hype around drones is not fake,
| and I'd compare it more to autonomous cars because regulation
| is the only reason you don't see a million private drones
| flying around.
| dazed_confused wrote:
| Yes, to an extent, but I would say that is an extension of
| artillery and long-range fire capabilities.
| jmj wrote:
| As is well known, AI is whatever hasn't been done yet.
| golergka wrote:
| People claimed that we would spend most of our day on the
| internet in the mid-90s, and then the dotcom bubble burst.
| And then people claimed that by 2015 robo-taxis would be
| around all the major cities of the planet.
|
| You can be right but too early. There was a hype wave for
| drones and VR (more than one for the latter one), but I
| wouldn't be so sure that it's peak of their real world usage
| yet.
| skeeter2020 wrote:
| >> You can be right but too early.
|
| Unless opportunity cost is zero this is a varation on being
| wrong.
| whiplash451 wrote:
| Interesting take but too bearish on LLMs in my opinion.
|
| LLMs have already found large-scale usage (deep research,
| translation) which makes them more ubiquitous today than 3D
| printers ever will or could have been.
| dingnuts wrote:
| And yet you didn't provide a single reference link! Every
| case of LLM usage that I've seen claimed about those things
| has been largely a lie -- guess you won't take the
| opportunity to be the first to present a real example. Just
| another rumor.
| whiplash451 wrote:
| My reference is the daily usage of chatgpt around me
| (outside of tech circles).
|
| I don't want to sound like a hard-core LLM believer. I get
| your point and it's fair.
|
| I just wanted to point out that the current usage of
| chatgpt is a lot broader than that of 3D printers even at
| the peak hype of it.
| dingnuts wrote:
| Outside of tech circles it looks like NFTs: people
| following hype using tech they don't understand which
| will be popular until the downsides we're aware of that
| they are ignorant to have consequences, and then the
| market will reflect the shift in opinion.
| whiplash451 wrote:
| I see it differently: people are switching to chatgpt
| like they switched to google back in 2005 (from whatever
| alternative existed back then)
|
| And I mean random people, not tech circles
|
| It's very different from NFTs in that respect
| basch wrote:
| No way.
|
| Everybody under a certain age is using ChatGPT, where
| they were once using search and friendship/expertises.
| It's the number 1 app in the App Store. Copilot use in
| the enterprise is so seamless, you just talk to
| PowerPoint or outlook and it formulated what you were
| supposed to make or write.
|
| It's not a fad, it is a paradigm change.
|
| People don't need to understand how it works for it to
| work.
| lotsoweiners wrote:
| > It's the number 1 app in the App Store.
|
| When I checked the iOS App Store just now, something
| called Love Island USA is the #1 free app. Kinda makes
| you think....
| dingnuts wrote:
| I know it's popular; that doesn't mean it's not a fad.
| Consequences take time. It's easy to use but once you get
| burned in a serious way by the bot that's still wrong 20%
| of the time, you'll become more reluctant to put your
| coin in the slot machine.
|
| Maybe if the AI companies start offering refunds for
| wrong answers, then the price per token might not be such
| a scam.
| jrm4 wrote:
| Not even remotely in the same universe; the difference is
| ChatGPT is actually having an impact, people are
| incorporating it day-to-day in a way that NFTs never
| stood much of a chance.
| retsibsi wrote:
| Even if the most bearish predictions turn out to be
| correct, the comparison of LLMs to NFTs is a galaxy-
| spanning stretch.
|
| NFTs are about as close to literally useless as it gets,
| and that was always obvious; 99% of the serious attention
| paid to them came from hustlers and speculators.
|
| LLMs, for all their limitations, are already good at some
| things and useful in some ways. Even in the areas where
| they are (so far) too unreliable for serious use, they're
| not pure hype and bullshit; they're doing things that
| would have seemed like magic 10 years ago.
| benreesman wrote:
| What we call an LLM today (by which almost everyone means an
| autogressive language model from the Generative Pretrained
| Transformer family tree, and BERTs are still doing important
| eork, believe that) is actually an offshoot of neural machine
| translation.
|
| This isn't (intentionally at least) mere HN pedantry: they
| really do act like translation tools in a bunch of observable
| ways.
|
| And while they have recently crossed the threshold into
| "yeah, I'm always going to have a gptel buffer open now"
| territory at the extreme high end, their utility outside of
| the really specific, totally non-generalizing code lookup
| gizmo usecase remains a claim unsupported by robust profits.
|
| There is a hole in the ground where something between 100
| billion and a trillion dollars in the ground that so far has
| about 20B in revenue (not profit) going into it annually.
|
| AI is going to be big (it was big ten years ago).
|
| LLMs? Look more and more like the Metaverse every day as
| concerns the economics.
| rapind wrote:
| > There is a hole in the ground where something between 100
| billion and a trillion dollars in the ground that so far
| has about 20B in revenue (not profit) going into it
| annually.
|
| This is a concern for me. I'm using claude-code daily and
| find it very useful, but I'm expecting the price to
| continue getting jacked up. I do want to support Anthropic,
| but they might eventually need to cross a price threshold
| where I bail. We'll see.
|
| I expect at some point the more open models and tools will
| catch up when the expensive models like ChatGPT plateau
| (assuming they do plateau). Then we'll find out if these
| valuations measure up to reality.
|
| Note to the Hypelords: It's not perfect. I need to read
| every change and intervene often enough. "Vibe coding" is
| nonsense as expected. It is definitely good though.
| benreesman wrote:
| Vibe coding is nonsense, and its really kind of
| uncomfortable to realize that a bunch of people you had
| tons of respect for are either ignorant or
| dishonest/bought enough to say otherwise. There's a cold
| wind blowing and the bunker-building crowd, well let's
| just say I won't shed a tear.
|
| You don't stock antibiotics and bullets in a survival
| compound because you think that's going to keep out a
| paperclip optimizer gone awry. You do that in the forlorn
| hope that when the guillotines come out that you'll be
| able to ride it out until the Nouveau Regime is in a
| negotiating mood. But they never are.
| juped wrote:
| I'm just taking advantage and burning VCs' money on
| useful but not world-changing tools while I still can.
| We'll come out of it with consumer-level okay tools even
| if they don't reach the levels of Claude today, though.
| strgcmc wrote:
| As a thought-exercise -- assume models continue to
| improve, whereas "using claude-code daily" is something
| you choose to do because it's useful, but is not yet at
| the level of "absolute necessity, can't imagine work
| without it". What if it does become, that level of
| absolute necessity?
|
| - Is your demand inelastic at that point, if having
| claude-code becomes effectively required, to sustain your
| livelihood? Does pricing continue to increase, until it's
| 1%/5%/20%/50% of your salary (because hey, what's the
| alternative? if you don't pay, then you won't keep up
| with other engineers and will just lose your job
| completely)?
|
| - But if tools like claude-code become such a necessity,
| wouldn't enterprises be the ones paying? Maybe, but maybe
| like health-insurance in America (a uniquely dystopian
| thing), your employer may pay some portion of the
| premiums, but they'll also pass some costs to you as the
| employee... Tech salaries have been cushy for a while
| now, but we might be entering a "K-shaped" inflection
| point --> if you are an OpenAI elite researcher, then you
| might get a $100M+ offer from Meta; but if you are an
| average dev doing average enterprise CRUD, maybe your
| wages will be suppressed because the small cabal of LLM
| providers can raise prices and your company HAS to pay,
| which means you HAVE to bear the cost (or else what? you
| can quit and look for another job, but who's hiring?)
|
| This is a pessimistic take of course (and vastly
| oversimplified / too cynical). A more positive outcome
| might be, that increasing quality of AI/LLM options leads
| to a democratization of talent, or a blossoming of "solo
| unicorns"... personally I have toyed with calling this,
| something like a "techno-Amish utopia", in the sense that
| Amish people believe in self-sufficiency and are not
| wholly-resistant to technology (it's actually quite
| clever, what sorts of technology they allow for
| themselves or not), so what if we could take that
| further?
|
| If there was a version of that Amish-mentality of
| loosely-federated self-sufficient communities (they have
| newsletters! they travel to each other! but they largely
| feed themselves, build their own tools, fix their own
| fences, etc.!), where engineers + their chosen LLM
| partner could launch companies from home, manage their
| home automation / security tech, run a high-tech small
| farm, live off-grid from cheap solar, use excess
| electricity to Bitcoin mine if they choose to, etc....
| maybe there is actually a libertarian world that can
| arise, where we are no longer as dependent on large
| institutions to marshal resources, deploy capital, scale
| production, etc., if some of those things are more in-
| reach for regular people in smaller communities, assisted
| by AI. This of course assumes that, the cabal of LLM
| model creators can be broken, that you don't need to pay
| for Claude if the cheaper open-source-ish Llama-like
| alternative is good enough
| rapind wrote:
| Well my business doesn't rely on AI as a competitive
| advantage, at least not yet anyways. So as it stands, if
| claude got 100x as effective, but cost 100x more, I'm not
| sure I could justify the cost because my market might
| just not be large enough. Which means I can either ditch
| it (for an alternative if one exists) or expand into
| other markets... which is appealing but a huge change
| from what I'm currently doing.
|
| As usual, the answer is "it depends". I guarantee though
| that I'll at least start looking at alternatives when
| there's a huge price hike.
|
| Also I suspect that a 100x improvement (if even possible)
| wouldn't just cost 100 times as much, but probably
| 100,000+ times as much. I also suspect than an
| improvement of 100x will be hyped as an improvement of
| 1,000x at least :)
|
| Regardless, AI is really looking like a commodity to me.
| While I'm thankful for all the investment that got us
| here, I doubt anyone investing this late in the game at
| these inflated numbers are going to see a long term
| return (other than ponzi selling).
| sebzim4500 wrote:
| >LLMs? Look more and more like the Metaverse every day as
| concerns the economics.
|
| ChatGPT has 800M+ weekly active users how is that
| comparable to the Metaverse in any way?
| benreesman wrote:
| I said as concerns the economics. It's clearly more
| popular than the Oculus or whatever, but it's still a
| money bonfire and shows no signs of changing on that
| front.
| threetonesun wrote:
| LLMs as we know them via ChatGPT were a way to disrupt
| the search monopoly Google had for so many years. And my
| guess is the reason Google was in no rush to jump into
| that market was because they knew the economics of it
| sucked.
| benreesman wrote:
| Right, and inb4 ads on ChatGPT to stop the bleeding.
| That's the default outcome at this point: quantize it
| down gradually to the point where it can be ad supported.
|
| You can just see the scene from the Sorkin film where
| Fidji is saying to Altman: "Its time to monetize the
| site."
|
| "We don't even know what it is yet, we know that it is
| cool."
| datameta wrote:
| Without trying to take away from your assertion, I think it
| is worthwhile to mention that part of this phenomenon is the
| unavoidable matter of meatspace being expensive and dataspace
| being intangibly present everywhere.
| deadbabe wrote:
| large scale usage in niche domains is still small scale
| overall.
| kibwen wrote:
| No, 3D printers are the backbone of modern physical
| prototyping. They're far more important to today's global
| economy than LLMs are, even if you don't have the vantage
| point to see it from your sector. That might change in the
| future, but snapping your fingers to wink LLMs out of
| existence would change essentially nothing about how the
| world works today; it would be a non-traumatic non-event.
| There just hasn't been time to integrate them into any
| essential processes.
| whiplash451 wrote:
| > snapping your fingers to wink LLMs out of existence would
| change essentially nothing about how the world works today
|
| One could have said the same thing about Google in 2006
| kibwen wrote:
| No, not even close. By 2006 all sorts of load-bearing
| infrastructure was relying on Google (e.g. Gmail). Today
| LLMs are still on the edge of important systems, rather
| than underlying those systems.
| johnsmith1840 wrote:
| Things like BERT are a load bearing structure in data
| science pipelines.
|
| I assume there are massive number of LLM analysis
| pipelines out there.
|
| I suppose it depends if you consider non determinist
| DS/ML pipelines "loadbearing" or not. Most are not using
| LLMs though.
|
| 3D parts regularly are used beyond prototyping though as
| tooling for a small company can be higher than just metal
| 3D parts. So I do somewhat agree but the loss of
| productivity in software prototyping would be a massive
| hit if LLMs vanished.
| nativeit wrote:
| [citation needed]
| skeeter2020 wrote:
| Th author is not bearish on LLMs at all; this post is about
| using LLMs and code vs. LLMs with autonomous tools via MCP.
| An example from your set would be translation. The author
| says you'll get better results if you do something like ask
| an LLM to translate documents, review the proposed approach,
| ask it to review it's work and maybe ask another LLM to
| validate the results than if you say "you've got 10K
| documents in English, and these tools - I speak French"
| hk1337 wrote:
| > Directionally I think this is right.
|
| We have a term at work we use called, "directionally accurate",
| when it's not entirely accurate but headed in the right
| direction.
| graerg wrote:
| > This is a significant advantage that an MCP (Multi-Component
| Pipeline) typically cannot offer
|
| Oh god please no, we must stop this initialism. We've gone too
| far.
| the_mitsuhiko wrote:
| It's the wrong acronym. I wrote this blog post on the bike and
| used an LLM to fix up the dictation that I did. While I did
| edit it heavily and rewrote a lot of things, I did not end up
| noticing that my LLM expanded MCP incorrectly. It's Model
| Context Protocol.
| apgwoz wrote:
| And you shipped it to production. Just like real agentic
| coding! Nice!
| the_mitsuhiko wrote:
| Which I don't feel great about because I do not like to use
| LLMs for writing blog posts. I just really wanted to
| explore if I can write a blog post on my bike commute :)
| bitwize wrote:
| We're all in line to get de-rezzed by the MCP, one way or
| another.
| baal80spam wrote:
| Isn't it a bit like saying: saw is all you need (for carpenters)?
|
| I mean, you _probably_ could make most furniture with only a saw,
| but why?
| nativeit wrote:
| In this analogy, do you have to design, construct, and learn
| from first principles to operate literally every other tool
| you'd like to use in addition to the saw?
| jasonthorsness wrote:
| Tools are constraints and time/token savers. Code is expensive in
| terms of tokens and harder to constrain in environments that
| can't be fully locked-down because network access for example is
| needed by the task. You need code AND tools.
| blahgeek wrote:
| > Code is expensive in terms of tokens and harder to constrain
| in environments
|
| It's also true for human. But then we invented functions /
| libraries / modules
| webdevver wrote:
| what does "MCP" stand for?
| dangus wrote:
| I was about to say the same thing.
|
| It's bad writing practice to do this, even if you are assuming
| your followers are following you.
|
| Especially for a site like Twitter that has a login wall.
| jasonlotito wrote:
| I'm confused. It links to the definition in the first
| sentence, and I'm not sure what you mean by Twitter in this
| context.
| empath75 wrote:
| If you don't know what it is and can't be bothered to google
| it, then you probably aren't the audience for this.
| komali2 wrote:
| Microsoft Certified Professional, a very common certification.
|
| Oh wait... hm ;) perhaps the writing nerds had it right when
| they recommend always writing the full acronym out the first
| time it's used in an article, no matter how common one presumes
| it to be
| apgwoz wrote:
| Mashup Context Protocol, of course! There was a post the other
| day comparing MCP tools to the mashups of web 2.0. It's a much
| better acronym expansion.
| aidenn0 wrote:
| I didn't know either, but in the very first sentence, the
| author provides the expansion and a link to the Wikipedia page
| for it.
| vidarh wrote:
| I frankly use tools mostly as an auth layer for things were raw
| access is too big a footgun without a permissions step. So I give
| the agent the choice of asking for permission to do things via
| the shell, or going nuts without user-interaction via a tool that
| enforces reasonable limitations.
|
| Otherwise you can e.g just give it a folder of preapproved
| scripts to run and explain usage in a prompt.
| empath75 wrote:
| The problem with this is that you have to give your LLM basically
| unbounded access to everything you have access to, which is a
| recipe for pain.
| the_mitsuhiko wrote:
| Not necessarily. I have a small little POC agentic tool on my
| side which is fully sandboxed, an it's inherently "non prompt
| injectable" by the data that it processes since it only ever
| passes that data through generated code.
|
| Disclaimer: it does not work well enough. But I think it shows
| great promise.
| elyase wrote:
| This is similar to the tool call (fixed code & dynamic params) vs
| code generation (dynamic code & dynamic params) discussion: tools
| offer contrains and save tokens, code gives you flexibility. Some
| papers suggest that generating code is often superior and this
| will likely become even more true as language models improve
|
| [1] https://huggingface.co/papers/2402.01030
|
| [2] https://huggingface.co/papers/2401.00812
|
| [3] https://huggingface.co/papers/2411.01747
|
| I am working on a model that goes a step beyond and even makes
| the distinction between thinking and code execution unnecessary
| (it is all computation in the end), unfortunately no link to
| share yet
| simonw wrote:
| Something I've realized about LLM tool use is that it means that
| if you can reduce a problem to something that can be solved by an
| LLM in a sandbox using tools in a loop, you can brute force that
| problem.
|
| The job then becomes identifying those problems and figuring out
| how to configure a sandbox for them, what tools to provide and
| how to define the success criteria for the model.
|
| That still takes significant skill and experience, but it's at a
| higher level than chewing through that problem using trial and
| error by hand.
|
| My assembly Mandelbrot experiment was the thing that made this
| click for me: https://simonwillison.net/2025/Jul/2/mandelbrot-
| in-x86-assem...
| rasengan wrote:
| Makes sense.
|
| I treat an LLM the same way I'd treat myself as it relates to
| context and goals when working with code.
|
| "If I need to do __________ what do I need to know/see?"
|
| I find that traditional tools, as per the OP, have become ever
| more powerful and useful in the age of LLMs (especially grep).
|
| Furthermore, LLMs are quite good at working with shell tools
| and functionalities (heredoc, grep, sed, etc.).
| dist-epoch wrote:
| I've been using a VM for a sandbox, just to make sure it won't
| delete my files if it goes insane.
|
| With some host data directories mounted read only inside the
| VM.
|
| This creates some friction though. Feels like a tool which runs
| the AI agent in a VM, but then copies it's output to the host
| machine after some checks would help, so that it would feel
| that you are running it natively on the host.
| jitl wrote:
| This is very easy to do with Docker. Not sure it you want the
| vm layer as an extra security boundary, but even so you can
| just specify the VM's docker api endpoint to spawn processes
| and copy files in/out from shell scripts.
| simonw wrote:
| Have you tried giving the model a fresh checkout in a read-
| write volume?
| dist-epoch wrote:
| Hmm, excellent idea, somehow I assumed that it would be
| able to do damage in a writable volume, but it wouldn't be
| able to exit it, it would be self-contained to that
| directory.
| nico wrote:
| > LLM in a sandbox using tools in a loop, you can brute force
| that problem
|
| Does this require using big models through their APIs and
| spending a lot of tokens?
|
| Or can this be done either with local models (probably very
| slow), or with subscriptions like Claude Code with Pro (without
| hitting the rate/usage limits)?
|
| I saw the Mandelbrot experiment, it was very cool, but still a
| rather small project, not really comparable to a
| complex/bigger/older code base for a platform used in
| production
| simonw wrote:
| The local models aren't quite good enough for this yet in my
| experience - the big hosted models (o3, Gemini 2.5, Claude 4)
| only just crossed the capability threshold for this to start
| working well.
|
| I think it's possible we'll see a local model that can do
| this well within the next few months though - it needs good
| tool calling, not an encyclopedic knowledge of the world.
| Might be possible to fit that in a model that runs locally.
| pxc wrote:
| There's a fine-tune of Qwen3 4B called "Jan Nano" that I
| started playing with yesterday, which is basically just
| fine-tuned to be more inclined to look things up via web
| searches than to answer them "off the dome". It's not good-
| good, but it does seem to have a much lower effective
| hallucination rate than other models of its size.
|
| It seems like maybe similar approaches could be used for
| coding tasks, especially with tool calls for reading man
| pages, info pages, running `tldr`, specifically consulting
| Stack Overflow, etc. Some of the recent small MoE models
| from Chinese companies are significantly smarter than
| models like Qwen 4B, but run about as quickly, so maybe on
| systems with high RAM or high unified memory, even with
| middling GPUs, they could be genuinely useful for coding if
| they are made to be avoid doing anything without tool use.
| never_inline wrote:
| Wasn't there a tool calling benchmark by docker guys which
| concluded qwen models are nearly as good as GPT? What is
| your experience about it?
|
| Personally I am convinced JSON is a bad format for LLMs and
| code orchestration in python-ish DSL is the future. But
| local models are pretty bad at code gen too.
| nico wrote:
| > it needs good tool calling, not an encyclopedic knowledge
| of the world
|
| I wonder if there are any groups/companies out there
| building something like this
|
| Would love to have models that only know 1 or 2 languages
| (eg. python + js), but are great at them and at tool
| calling. Definitely don't need my coding agent to know all
| of Wikipedia and translating between 10 different languages
| johnsmith1840 wrote:
| Given 2 datasets:
|
| 1. A special code dataset 2. A bunch of "unrelated" books
|
| My understanding is that the model trained on just the
| first will never beat the model trained on both.
| Bloomberg model is my favorite example of this.
|
| If you can squirell away special data then that special
| data plus everything else will beat the any other models.
| But that's basically what openai, google, and anthropic
| are all currently doing.
| e12e wrote:
| I wonder if common lisp with repl and debugger could
| provide a better tool than your example with nasm wrapped
| via apt in Docker...
|
| Essentially just giving LLMs more state of the art systems
| made for incremental development?
|
| Ed: looks like that sort of exists:
| https://github.com/bhauman/clojure-mcp
|
| (Would also be interesting if one could have a few LLMs
| working together on red/green TDD approach - have an
| orchestrator that parse requirements, and dispatch a red
| goblin to write a failing test; a green goblin that writes
| code until the test pass; and then some kind of hobgoblin
| to refactor code, keeping test(s) green - working with the
| orchestrator to "accept" a given feature as done and move
| on to the next...
|
| With any luck the resulting code _might_ be a bit more
| transparent (stricter form) than other LLM code)?
| chamomeal wrote:
| That's super cool, I'm glad you shared this!
|
| I've been thinking about using LLMs for brute forcing problems
| too.
|
| Like LLMs kinda suck at typescript generics. They're
| surprisingly bad at them. Probably because it's easy to write
| generics that _look_ correct, but are then screwy in many
| scenarios. Which is also why generics are hard for humans.
|
| If you could have any LLM actually use TSC, it could run tests,
| make sure things are inferring correctly, etc. it could just
| keep trying until it works. I'm not sure this is a way to
| produce understandable or maintainable generics, but it would
| be pretty neat.
|
| Also while typing this is realized that cursor can see
| typescript errors. All I need are some utility testing types,
| and I could have cursor write the tests and then brute force
| the problem!
|
| If I ever actually do this I'll update this comment lol
| vunderba wrote:
| _> The job then becomes identifying those problems and figuring
| out how to configure a sandbox for them, what tools to provide,
| and how to define the success criteria for the model._
|
| Your test case seems like a quintessential example where you're
| missing that last step.
|
| Since it is unlikely that you understand the math behind
| fractals or x86 assembly (apologies if I'm wrong on this), your
| only means for verifying the accuracy of your solution is a
| superficial visual inspection, e.g. "Does it look like the
| Mandelbrot series?"
|
| Ideally, your evaluation criteria would be expressed as a
| continuous function, but at the very least, it should take the
| form of a sufficiently diverse quantifiable set of discrete
| inputs and their expected outputs.
| simonw wrote:
| That's exactly why I like using Mandelbrot as a demo: it's
| perfect for "superficial visual inspection".
|
| With a bunch more work I could likely have got a vision LLM
| to do that visual inspection for me in the assembly example,
| but having a human in the loop for that was much more
| productive.
| shepherdjerred wrote:
| Are fractals or x86 assembly representative of most dev work?
| nartho wrote:
| I think it's irrelevant. The point they are trying to make
| is anytime you ask a LLM for something that's outside of
| your area of expertise you have very little to no way to
| insure it is correct.
| diggan wrote:
| > anytime you ask a LLM for something that's outside of
| your area of expertise you have very little to no way to
| insure it is correct.
|
| I regularly use LLMs to code specific functions I don't
| necessarily understand the internals of. Most of the time
| I do that, it's something math-heavy for a game. Just
| like any function, I put it under automated and manual
| tests. Still, I review and try to gain some intuition
| about what is happening, but it is still very far of my
| area of expertise, yet I can be sure it works as I expect
| it to.
| chrisweekly wrote:
| Giving LLMs the right context -- eg in the form of predefined
| "cognitive tools", as explored with a ton of rigor here^1 --
| seems like the way forward, at least to this casual observer.
|
| 1. https://github.com/davidkimai/Context-
| Engineering/blob/main/...
|
| (the repo is a WIP book, I've only scratched the surface but it
| seems pretty brilliant to me)
| skeeter2020 wrote:
| One of my biggest, ongoing challenges has been to get the LLM
| to use the tool(s) that are appropriate for the job. It feels
| like teach your kids to say, do laundry and you want to just
| tell them to step aside and let you do it.
| FrustratedMonky wrote:
| I wonder if having 2 LLM's communicate will eventually be more
| like humans talking. With all the same problems.
| CuriouslyC wrote:
| I already have agents managing different repositories ask each
| other questions and make requests. It works pretty well for the
| most part.
| victorbjorklund wrote:
| I think the GitHub CLI example isn't entirely fair to MCP. Yes,
| GitHub's CLI is extensively documented online, so of course LLMs
| will excel at generating code for well-known tools. But MCP
| shines in different scenarios.
|
| Consider internal company tools or niche APIs with minimal online
| documentation. Sure, you could dump all the documentation into
| context for code generation, but that often requires more context
| than interacting with an MCP tool. More importantly, generated
| code for unfamiliar APIs is prone to errors so you'd need robust
| testing and retry mechanisms built in to the process.
|
| With MCP, if the tools are properly designed and receive correct
| inputs, they work reliably. The LLM doesn't need to figure out
| API intricacies, authentication flows, or handle edge cases -
| that's already handled by the MCP server.
|
| So I agree MCP for GitHub is probably overkill but there are many
| legitimate use cases where pre-built MCP tools make more sense
| than asking an LLM to reverse-engineer poorly documented or
| proprietary systems from scratch.
| light_hue_1 wrote:
| That's handled by the MCP server in the sense of it doesn't do
| authentication, etc. it provides a simplified view of the
| world.
|
| If that's what you wanted you could have designed that as your
| poorly documented internal API differently to begin with.
| There's zero advantage to MCP in the scenario you describe
| aside from convincing people that their original API is too
| hard to use.
| the_mitsuhiko wrote:
| > Sure, you could dump all the documentation into context for
| code generation, but that often requires more context than
| interacting with an MCP tool.
|
| MCP works exactly that way: you dump documentation into the
| context. That's how the LLM knows how to call your tool. Even
| for custom stuff I noticed that giving the LLM things to work
| with that it knows (eg: python, javascript, bash) beats it
| using MCP tool calling, and in some ways it wastes less
| context.
|
| YMMV, but I found the limit of tools available to be <15 with
| sonnet4. That's a super low amount. Basically the official
| playwright MCP alone is enough to fully exhaust your available
| tool space.
| JyB wrote:
| Ive never used that many. The LLM performances
| collapse/degrade significantly because of too much initial
| context? It seems like MCP implems updates could easily solve
| that. Like only injecting relevant servers for the given task
| based on initial user prompt.
| the_mitsuhiko wrote:
| > Ive never used that many.
|
| The playwright MCP alone introduces 25 tools into the
| context :(
| forrestthewoods wrote:
| Unpopular Opinion: I hate Bash. Hate it. And hate the ecosystem
| of Unix CLIs that are from the 80s and have the most obtuse,
| inscrutable APIs ever designed. Also this ecosystem doesn't work
| on Windows -- which, as a game dev, is my primary environment.
| And no, WSL does not count.
|
| I don't think the world needs yet another shell scripting
| language. They're all pretty mediocre at best. But maybe this is
| an opportunity to do something interesting.
|
| Python environment is a clusterfuck. Which UV is rapidly bringing
| into something somewhat sane. Python isn't the ultimate language.
| But I'd definitely be more interested in "replace yourself with a
| UV Python script" over "replace yourself with a shell script".
| Would be nice to see use this as an opportunity to do better than
| Bash.
|
| I realize this is unpopular. But unpopular doesn't mean wrong.
| lsaferite wrote:
| Python CAN be a "shell script" in this case though...
|
| Tool composition over stdio will get you very very far. That's
| what an interface "from the 80s" does for you 45 years later.
| That same stdio composability is easily pipe into/through any
| number of cli tools written in any number of languages,
| compiled and interpreted.
| forrestthewoods wrote:
| Composing via stdio is so bloody terrible. Layers and layers
| of bullshit string parsing and encoding and decoding. Soooo
| many bugs. And utterly undebuggable. A truly miserable
| experience.
| zahlman wrote:
| And now you also understand many of the limitations LLMs
| have.
| osigurdson wrote:
| Nobody likes coding in bash but everyone does it (a little)
| because it is everywhere.
| forrestthewoods wrote:
| > because it is everywhere
|
| Except for the fact that actually it is not everywhere.
| nativeit wrote:
| I see your point, but bear with me here--it kind of is.
|
| I suppose if one wanted to be pedantically literal, then
| you are indeed correct. In every other meaningful
| consideration, the parent comment is. Maybe not Bash
| specifically, but #!/bin/sh is broadly available on nearly
| every connected device on the planet, in some capacity.
| From the perspective of how we could automate nearly
| anything, you'd be hard-pressed to find something more
| universal than a shell script.
| forrestthewoods wrote:
| > you'd be hard-pressed to find something more universal
| than a shell script.
|
| 99.9% of my 20-year career has been spent on Windows. So
| bash scripts are entirely worthless and dead to me.
| osigurdson wrote:
| If you use git on Windows, bash is normally available.
| Agree, that this isn't widely used though.
| forrestthewoods wrote:
| Yeah I've never see anyone rely on users to use GitBash
| to run shell scripts.
|
| Amusingly although I certainly use GitHub for hobby
| projects I've never actually used it for work. And have
| never come across a Windows project that mandated its
| use. Well, maybe one or two over the years.
| nativeit wrote:
| What do you suppose the proportion is of computers
| actively running Windows in the world right now, versus
| those running some kind of *nix/BSD-based OS? This
| includes everything a person or machine could reasonably
| interface with, and that's Turing complete (in other
| words, a traffic light is limited to its own fixed logic,
| so it doesn't count; but most contemporary wifi routers
| contain general-purpose memory and processors, many even
| run some kind of *nix kernel, so they very much do
| count).
|
| That's my case for Bash being more or less everywhere,
| but I think this debate is entirely semantic. Literally
| just talking about different things.
|
| EDIT: escaped *
| forrestthewoods wrote:
| I think if someone were, for example, to release an open
| source C++ library and it only compiles for Linux or only
| comes with Bash scripts then I would not consider that
| library to be crossplatform nor would I consider it to
| run everywhere.
|
| I don't think it's "just semantics". I think it's a
| meaningful distinction.
|
| Game dev is a perhaps a small niche of computer
| programming. I mean these days the majority of
| programming is webdev JavaScript, blech. But game dev is
| also _overwhelmingly_ Windows based. So I dispute any
| claim that Unix is "everywhere". And I'm regularly
| annoyed by people who falsely pretend it is.
| hollerith wrote:
| Me, too. Also, _Unix_ as a whole is overrated. One reason it
| won was an agreement mediated by a Federal judge presiding over
| an anti-trust trial that AT &T would not enter the computer
| market while IBM would not enter the telecommunications market,
| so Unix was distributed at zero cost rather than sold.
|
| Want to get me talking reverentially about the pioneers of our
| industry? Talk to me about Doug Engelbart, Xerox PARC and the
| Macintosh team at Apple. There was some brilliant work!
| nativeit wrote:
| > Also, Unix as a whole is overrated. One reason it won was
| an agreement mediated by a Federal judge presiding over an
| anti-trust trial that AT&T would not enter the computer
| market while IBM would not enter the telecommunications
| market, so Unix was distributed at zero cost rather than
| sold.
|
| What did Unix win?
| hollerith wrote:
| Mind share of the basic design. Unix's design decisions are
| important parts of MacOS and Linux.
|
| Multics would be an example of a more innovative OS than
| Unix, but its influence on the OSes we use today has been a
| lot less.
| nativeit wrote:
| I suppose the deeper question I'd have would be, how
| would its no-cost distribution prevent better
| alternatives from being developed/promoted/adopted along
| the way? I guess I don't follow your line of logic. To be
| fair, I'm not experienced enough with either OS
| development nor any notable alternatives to Unix to
| agree/disagree with your conclusions. My intuition wants
| to disagree, only because I like Linux, and even sort of
| like Bash scripts--but I have _nothing_ but my own
| subjective preferences to base that position on, and I 'm
| actually quite open to being better-informed into
| submission. ;-)
|
| I'm a pretty old hat with Debian at this point, so I've
| got plenty of opinions for its contemporary
| implementations, but I always sort of assumed most of the
| fundamental architectural/systems choices had more or
| less been settled as the "best choices" via the usual
| natural selection, along with the OSS community's abiding
| love for reasoned debate. I can generally understand the
| issues folks have with some of these defaults, but my
| favorite aspect of OS's like Debian are that they
| generally defer to the sysadmin's desires for all things
| where we're likely to have strong opinions. It's "default
| position" of providing no default positions. Certainly
| now that there are containers and orchestration like Nix,
| the layer that is Unix is even less visible, and
| infrastructure-as-code mean a lot of developers can just
| kind of forget about the OS layer altogether, at least
| beyond the OS('s) they choose for their own daily
| driver(s).
|
| Getting this back to the OG point--I can understand why
| people don't like the Bash scripting language. But it
| seems trivial these days to get to a point where one
| could use Python, Lua, Forth, et al to automate and
| control any system running a _nix /BSD OS, and _nix OS's
| do several key things rather well (in my opinion), such
| as service bootstrapping, lifecycle management,
| networking/comms, and maintaining a small footprint.
|
| For whatever it's worth, one could start with nothing but
| a Debian ISO and some preseed files, and get to a point
| where they could orchestrate/launch anything they could
| imagine using their own language/application of choice,
| without ever touching having touched a shell prompt or
| writing a line of Bash. Not for nothing, that's almost
| certainly how many Linux-based customized distributions
| (and even full-blown custom/bespoke OS's) are created,
| but it doesn't have to be so complicated if one just
| wants to get to where Python scripts are able to run (for
| example).
| hollerith wrote:
| Most OSes no longer have any users or squeak by with less
| than 1000 users on their best day ever: Plan 9, OS/2,
| Beos, AmigaOS, Symbian, PalmOS, the OS for the Apple II,
| CP/M, VMS, TOPS-10, Multics, Compatible Time-Sharing
| System, Burroughs Master Control Program, Univac's Exec
| 8, Dartmouth Time-Sharing System, etc.
|
| Some of the events that help Unix survive longer than
| most are the decision of DARPA (in 1979 or the early
| 1980s IIRC) to fund the addition of a TCP/IP networking
| stack to Unix and the decision in 1983 of Richard
| Stallman to copy the Unix design for his GNU project. The
| reason DARPA and Stallman settled on Unix was that they
| knew about it and were somewhat familiar with it because
| it was given away for free (mostly to universities and
| research labs). Success tends to beget success in
| "spaces" with strong "network externalities" such as the
| OS space.
|
| >Getting this back to the OG point
|
| I agree that it is easy to avoid writing shell scripts.
| The problem is that other people write them, e.g., as the
| recommended way to install some package I want. The
| recommended way to install a Rust toolchain for example
| is to run a shell script (rustup). I trust the Rust
| maintainers not to intentionally put an attack in the
| script, but I don't trust them not to have inadvertently
| included a vulnerability in the script that some third
| party might be able to exploit (particularly since it is
| quite difficult to write an attack-resistant shell
| script).
| hollerith wrote:
| OK, consider the browser market: are there any browsers
| that cost money? If so, I've not heard of it. From the
| beginning, Netscape Corporation, Microsoft, Opera and
| Apple gave away their browsers for free. That is because
| by the early 1990s it was well understood (at least by
| Silicon Valley execs) that what is important is grabbing
| mind share, and charging any amount of money would
| severely curtain the ability to do that.
|
| In the 1970s when Unix started being distributed outside
| of Bell Labs, tech company execs did not yet understand
| that. The owners of Unix adopted a superior strategy to
| ensure survival of Unix _by accident_ (namely, by being
| sued -- IIRC in the 1950s -- by the US Justice Department
| on anti-trust grounds).
| zahlman wrote:
| > Python environment is a clusterfuck. Which UV is rapidly
| bringing into something somewhat sane.
|
| Uv is able to do what it does mainly because of a) being a
| greenfield project b) in an environment of new standards that
| the community has been working on since the first days that
| people complained about said clusterfuck.
|
| But that's assuming you actually need to set up an environment.
| People really underestimate what can be done easily with just
| the standard library. And when they do grab the most popular
| dependencies, they end up exercising a tiny fraction of that
| code.
|
| > But I'd definitely be more interested in "replace yourself
| with a UV Python script" over "replace yourself with a shell
| script".
|
| There is no such thing as "a UV Python script". Uv doesn't
| create a new language. It doesn't even have a monopoly on what
| I _guess_ you 're referring to, i.e. the system it uses for
| specifying dependencies inline in a script. That comes from an
| ecosystem-wide standard, https://peps.python.org/pep-0723/.
| Pipx also implements creating environments for such code and
| running it, as do Hatch and PDM; and other tools offer
| appropriate support - e.g. editors may be able to syntax-
| highlight the declaration etc.
|
| Regardless, what you describe is not at all opposed to what the
| author has in mind here. The term "shell script" is often used
| quite loosely.
| forrestthewoods wrote:
| Ok?
| recursivedoubts wrote:
| I would like to see MPC integrate the notion of hypermedia
| controls.
|
| Seems like that would be a potential way to get self-organizing
| integrations.
| vasusen wrote:
| I think the Playwright MCP is a really good example of the
| overall problem that the author brings up.
|
| However, I couldn't really understand if he's saying that the
| Playwright MCP is good to use for your own app or whether he
| means for your own app just tell the LLM directly to export
| Playwright code.
| shelajev wrote:
| it's the latter: "you can actually start telling it to write a
| Playwright Python script instead and run that".
|
| and while running the code might faster, it's unclear whether
| that approach scales well. Sending an MCP tool command to click
| the button that says "X", is something a small local LLM can
| do. Writing complex code after parsing significant amount of
| HTML (for correct selectors for example) probably needs a
| managed model.
| jumploops wrote:
| We're playing an endless cat and mouse game of capabilities
| between old and new right now.
|
| Claude Code shows that the models can excel at using "old"
| programmatic interfaces (CLIs) to do Real Work(tm).
|
| MCP is a way to dynamically provide "new" programmatic interfaces
| to the models.
|
| At some point this will start to converge, or at least appear to
| do so, as the majority of tools a model needs will be in its pre-
| training set.
|
| Then we'll argue about MPPP (model pre-training protocol
| pipeline), and how to reduce knowledge pollution of all the LLM-
| generated tools we're passing to the model.
|
| Eventually we'll publish the Merrium-Webster Model Tool
| Dictionary (MWMTD), surfacing all of the approved tools hidden in
| the pre-training set.
|
| Then the kids will come up with Model Context Slang (MCS), in an
| attempt to use the models to dynamically choose unapproved tools,
| for much fun and enjoyment.
|
| Ad infinitum.
| JyB wrote:
| > It demands too much context.
|
| This is solved trivially by having default initial prompts. All
| majors tools like Claude Code or Gemini CLI have ways to set them
| up.
|
| > You pass all your tools to an LLM and ask it to filter it down
| based on the task at hand. So far, there hasn't been much better
| approaches proposed.
|
| Why is a "better" approach needed? If modern LLMs can properly
| figure it out? It's not like LLMs don't keep getting better with
| larger and larger context length. I never had a problem with an
| LLM struggling to use the appropriate MCP function on it's own.
|
| > But you run into three problems: cost, speed, and general
| reliability
|
| - cost: They keep getting cheaper and cheaper. It's ridiculously
| inexpensive for what those tools provide.
|
| - speed: That seem extremely short sighted. No one is sitting
| idle looking at Claude Code in their terminal. And you can have
| more than one working on unrelated topics. That defeats the
| purpose. No matter how long it takes the time spent is purely
| bonus. You don't have to spend time in the loop when asking well
| defined tasks.
|
| - reliability: Seem very prompt correlated ATM. I guess some
| people don't know what to ask which is the main issue.
|
| Having LLMS being able to complete tedious tasks involving so
| many external tools at once is simply amazing thanks to MCP.
| Anecdotal but just today it did a task flawlessly involving:
| Notion pages, Linear Ticket, git, GitHub PR, GitHub CI logs.
| Being in the loop was just submitting one review on the PR. All
| the while I was busy doing something else. And for what, ~1$?
| the_mitsuhiko wrote:
| > This is solved trivially by having default initial prompts.
| All majors tools like Claude Code or Gemini CLI have ways to
| set them up.
|
| That only makes it worse. The MCP tools available all add to
| the initial context. The more tools, the more of the context is
| populated by MCP tool definitions.
| JyB wrote:
| Do you mean that some tools (MCP clients) pass all functions
| of all configured MCP servers in the initial prompt?
|
| If that's the case: I understand the knee-jerk reaction but
| if it works? Also what theoretically prevents altering the
| prompt chaining logic in these tools to only expose a
| condensed list of MCP servers, not their whole capabilities,
| and only inject details based on LLM outputs? Doesn't seem
| like an insurmountable problem.
| the_mitsuhiko wrote:
| > Do you mean that some tools (MCP clients) pass all
| functions of all configured MCP servers in the initial
| prompt?
|
| Not just some, all. That's just how MCP works.
|
| > If that's the case: I understand the knee-jerk reaction
| but if it works?
|
| I would not be writing about this if it worked well. The
| data indicates that it worse significantly worse than not
| using MCP because of the context rot, and the low too
| utilization.
| JyB wrote:
| I guess I don't see the technical limitation. Seem like a
| protocol update issue.
| dingnuts wrote:
| > cost: They keep getting cheaper and cheaper
|
| no they don't[0], the cost is just still hidden from you but
| the freebies will end just like MoviePass and cheap Ubers
|
| https://bsky.app/profile/edzitron.com/post/3lsw4vatg3k2b
|
| "Cursor released a $200-a-month subscription then made their
| $20-a-month subscription worse (worse output, slower) - yet it
| seems even on Max they're rate limiting people!"
|
| https://bsky.app/profile/edzitron.com/post/3lsw3zwgw4c2h
| fkyoureadthedoc wrote:
| The cost will stay hidden from me because my job will pay it,
| just like the cost of my laptop, o365 license, and every
| other tool I use at work.
| nativeit wrote:
| Until they use your salary to pay for another dozen
| licenses.
| JyB wrote:
| Fair. I'm using Claude Code which is pay as you go. The
| Market will probably do its things. (The company pays anyway
| obviously)
| antirez wrote:
| I have the feeling that's not really MCP specifically VS other
| ways, but it is pretty simply: at the current state of AI, to
| have a human in the loop is _much_ better. LLMs are great at
| certain tasks but they often get trapped into local minima, if
| you do the back and forth via the web interface of LLMs, ask it
| to write a program, look at it, provide hints to improve it, test
| it, ..., you get much better results and you don 't cut yourself
| out to find a 10k lines of code mess that could be 400 lines of
| clear code. That's the current state of affairs, but of course
| many will try very hard to replace programmers that is currently
| _not_ possible. What it is possible is to accelerate the work of
| a programmer several times (but they must be good both at
| programming and LLM usage), or take a smart person that has a
| relatively low skill in some technology, and thanks to LLM make
| this person productive in this field without the long training
| otherwise needed. And many other things. But "agentic coding"
| right now does not work well. This will change, but right now the
| real gain is to use the LLM as a colleague.
|
| It is not MCP: it is autonomous agents that don't get feedbacks
| from smart humans.
| rapind wrote:
| So I run my own business (product), I code everything, and I
| use claude-code. I also wear all the other hats and so I'd be
| happy to let Claude handle all of the coding if / when it can.
| I can confirm we're certainly not there yet.
|
| It's definitely useful, but you have to read everything. I'm
| working in a type-safe functional compiled language too. I'd be
| scared to try this flow in a less "correctness enforced"
| language.
|
| That being said, I do find that it works well. It's not living
| up to the hype, but most of that hype was obvious nonsense. It
| continues to surprise me with it's grasp on concepts and is
| definitely saving me some time, and more importantly making
| some larger tasks more approachable since I can split my time
| better.
| galdre wrote:
| My absolute favorite use of MCP so far is Bruce Hauman's clojure-
| mcp. In short, it gives the LLM (a) a bash tool, (b) a persistent
| Clojure REPL, and (c) structural editing tools.
|
| The effect is that it's far more efficient at editing Clojure
| code than any purely string-diff-based approach, and if you write
| a good test suite it can rapidly iterate back and forth just
| editing files, reloading them, and then re-running the test suite
| at the REPL -- just like I would. It's pretty incredible to
| watch.
| chamomeal wrote:
| I was just going to comment about clojure-mcp!! It's far and
| away the coolest use of mcp I've seen so far.
|
| It can straight up debug your code, eval individual
| expressions, document return types of functions. It's amazing.
|
| It actually makes me think that languages with strong REPLs are
| a better for languages than those without. Seeing clojure-mcp
| do its thing is the most impressive AI feat I've seen since I
| saw GPT-3 in action for the first time
| e12e wrote:
| https://github.com/bhauman/clojure-mcp
| manaskarekar wrote:
| Off topic: That font/layout/contrast on the page is very pleasing
| and inviting.
| khalic wrote:
| Honestly, I'm getting tired of these sweeping statements about
| what developers are supposed to be, how it's "the right way to
| use AI". We are in uncharted territories that are changing by the
| day. Maybe we have to drop the self-assurance and opinionated
| view points and tackle this like a scientific problem.
| pizzathyme wrote:
| 100% agreed - he mentions 3 barriers to using MCP over code:
| "cost, speed, and general reliability". But all 3 of these
| could change by 10-100x within a few years, if not months. Just
| recently OpenAI dropped the price of using o3 by 80%
|
| This is not an environment where you can establish a durable
| manifesto
| luckystarr wrote:
| I always dreamed of a tool which would know the intent, semantic
| and constraints of all inputs and outputs of any piece of code
| and thus could combine these code pieces automatically. It was
| always a fuzzy idea in my head, but this piece now made it a bit
| more clear. While LLMs could generate those adapters between
| distinct pieces automatically, it's a expensive (latency, tokens)
| process. Having a system with which not only to type the
| variables, but also to type the types (intents, semantic meaning,
| etc.) would be helpful but likely not sufficient. There has been
| so much work on ontologies, semantic networks, logical inference,
| etc. but all of it is spread all over the place. I'd like to have
| something like this integrated into a programming language and
| see what it feels like.
| tristanz wrote:
| You can combine MCPs within composable LLM generated code if you
| put in a little work. At Continual (https://continual.ai), we
| have many workflows that require bulk actions, e.g. iterating
| over all issues, files, customers, etc. We inject MCP tools into
| a sandboxed code interpreter and have the agent generate both
| direct MCP tool calls and composable scripts that leverage MCP
| tools depending on the task complexity. After a bunch of work it
| actually works quite well. We are also experimenting with
| continual learning via a Voyager like approach where the LLM can
| save tool scripts for future use, allowing lifelong learning for
| repeated workflows.
| JyB wrote:
| That autocompounding aspect of constantly refining initial
| prompts with more and more knowledge is so interesting. Gut
| feeling says it's something that will be "standardized" in some
| way, exactly like what MCP did.
| tristanz wrote:
| Yes, I think you could get quite far with a few tools like
| memory/todo list + code interpreter + script save/load. You
| could probably get a lot farther though if you RLVRed this
| similar to how o3 uses web search so effectively during it's
| thinking process.
| wrs wrote:
| MCP is literally the same as giving an LLM a set of man page
| summaries and a very limited shell over HTTP. It's just in a
| different syntax (JSON instead of man macros and CLI args).
|
| It would be better for MCP to deliver function definitions and
| let the LLM write little scripts in a simple language.
| CharlieDigital wrote:
| > So maybe we need to look at ways to find a better abstraction
| for what MCP is great at, and code generation. For that that we
| might need to build better sandboxes and maybe start looking at
| how we can expose APIs in ways that allow an agent to do some
| sort of fan out / fan in for inference. Effectively we want to do
| as much in generated code as we can, but then use the magic of
| LLMs after bulk code execution to judge what we did.
|
| Por que no los dos? I ended up writing an OSS MCP server that
| securely executes LLM generated JavaScript using a C# JS
| interpreter (Jint) and handing it a `fetch` analogue as well as
| `jsonpath-plus`. Also gave it a built-in secrets manager.
|
| Give it an objective and the LLM writes its own code and uses the
| tool iteratively to accomplish the task (as long as you can
| interact with it via a REST API).
|
| For well known APIs, it does a fine job generating REST API
| calls.
|
| _You can pretty much do anything with this._
|
| https://github.com/CharlieDigital/runjs
| pamelafox wrote:
| Regarding the Playwright example: I had the same experience this
| week attempting to build an agent first by using the Playwright
| MCP server, realizing it was slow, token-inefficient, and flaky,
| and rewriting with direct Playwright calls.
|
| MCP servers might be fun to get an idea for what's possible, and
| good for one-off mashups, but API calls are generally more
| efficient and stable, when you know what you want.
|
| Here's the agent I ended up writing:
| https://github.com/pamelafox/personal-linkedin-agent
|
| Demo: https://www.youtube.com/live/ue8D7Hi4nGs
| arkmm wrote:
| This is cool! Also have found the Playwright MCP implementation
| to be overkill and think of it more as a reference to an
| opinionated subset of the Playwright API.
|
| LinkedIn has this reputation of being notorious about making it
| hard to build automations on top of, did you run into any
| roadblocks when building your personal LinkedIn agent?
| zahlman wrote:
| ... Ah, reading these as well as more carefully reading TFA,
| I understand now that there is an MCP based on Playwright,
| and that Playwright itself is not considered an example of
| something that accidentally is an MCP despite having been
| released all the way back in January 2020.
|
| ... But now I still feel far away from understanding what MCP
| really _is_. As in:
|
| * What specifically do I have to implement in order to create
| one?
|
| * Now that the concept exists, what are the implications as
| the author of, say, a traditional REST API?
|
| * Now that the concept exists, what new problems exist to
| solve?
| pramodbiligiri wrote:
| Wouldn't the sweet spot for MCP be where the LLM is able to do
| most of the heavy lifting on its own (outputting some kind of
| structured or unstructured output), but needs a bit of
| external/dynamic data that it can't do without? The list of MCP
| servers/tools it can use should nail that external lookup in a
| (mostly) deterministic way.
|
| This would work best if a human is the end consumer of this
| output, or will receive manual vetting eventually. I'm not sure
| I'd leave such a system running unsupervised in production ("the
| Automation at Scale" part mentioned by the OP).
| ramoz wrote:
| You don't solve the problem of being able to rely on the agent
| to call the MCP.
|
| Hooks into the agent's execution lifecycle seem more reliable
| for deterministic behavior and supervision.
| pramodbiligiri wrote:
| I agree. In any large backend software running on a server,
| it's the LLM invocation which would be a call out to an
| external system, and with proper validation around the
| results. At which point, calling an "MCP Server" is also just
| your backend software invoking one more library/service based
| on inspecting some part of the response from the LLM.
|
| This doesn't take away from the utility of MCP when it comes
| Claude Desktop and the likes!
| briandw wrote:
| Anyone else switch their LLM subscription every month? I'm back
| on ChatGPT for O3 use, but expect that Grok4 will be next.
| jrm4 wrote:
| Yup, I can't help but think that a lot of the bad thinking comes
| from trying to avoid the following fact: LLMs are only good where
| your output does not need to be precise and/or verifiably
| "perfect," which is kind of the opposite of how code has worked,
| or has tried to work, in the past.
|
| Right now I got it for: DRAFTS of prose things -- and the only
| real killer in my opinion, autotagging thousands of old
| bookmarks. But again, that's just to have cool stuff to go back
| and peruse, not something that _must be correc.t_
| never_inline wrote:
| The problem I see with MCP is very simple. It's using JSON as the
| format and that's nowhere as expressive as a programming
| language.
|
| Consider a python function signature
|
| list_containers(show_stopped: bool = False, name_pattern:
| Optional[str] = None, sort: Literal["size", "name", "started_at"]
| = "name"). It doesn't even need docs
|
| Now convert this to JSON schema which is 4x larger input already.
|
| And when generating output, the LLM will generate almost 2x more
| tokens too, because JSON. Easier to get confused.
|
| And consider that the flow of calling python functions and using
| their output to call other tools etc... is seen 1000x more times
| in their fine tuning data, whereas JSON tool calling flows are
| rare and practically only exist in instruction tuning phase. Then
| I am sure instruction tuning also contains even more complex code
| examples where model has to execute complex logic.
|
| Then theres the whole issue of composition. To my knowledge
| there's no way LLM can do this in one response.
| vehicle = call_func_1() if vehicle.type == "car":
| details = lookup_car(vehicle.reg_no) else if vehicle.type
| == "motorcycle": details =
| lookup_motorcycle(vehicle.reg_ni)
|
| How is JSON tool calling going to solve this?
| chrisweekly wrote:
| Great point.
|
| But "the" problem with MCP? IMVHO (Very humble, non-expert) the
| half-baked or missing security aspects are more fundamental.
| I'd love to hear updates about that from ppl who know what
| they're talking about.
| 8note wrote:
| the reason to use the llm is that you dont know ahead of time
| that the vehicle type is only a car or motorcycle, and the llm
| will also figure out a way to detail bycicles and boats and
| airplanes, and to consider both left and right shoes
| separately.
|
| the llm cant just be given this function because its
| specialized to just the two options.
|
| you could have it do a feedback loop of rewriting the python
| script after running it, but whats the savings at tha point?
| youre wasting tokens talking about cars in python when you
| already know is a ski, and the llm could ask directly for the
| ski details without writing a script to do it in between
| prairieroadent wrote:
| makes sense and if realized then deno is in excellent position to
| be one of the leading if not the main sandbox runtime for agents
| keybored wrote:
| tl;dr of one of today's AI posts: all you need is code generation
|
| It's 2025 and this is the epitome of progress.
|
| On the positive side code generation can be solid if you also
| have/can generate easy-to-read validation or tests for the
| generated code. I mean that _you_ can read, of course.
| LudwigNagasena wrote:
| I hit the same roadblock with MCP. If you work with data, LLM
| becomes a very expensive pipe with an added risk of
| hallucinations. It's better to simply connect it to a Python
| environment enriched with integrations you need.
| SatvikBeri wrote:
| I use Julia at work, which benefits from long-running sessions,
| because it compiles functions the first time they run. So I wrote
| a very simple MCP that lets Claude Code send code to a persistent
| Julia kernel using Jupyter.
|
| It had a much bigger impact than I expected - not only does test
| code run much faster (and not time out), but Claude seems to be
| much more willing to just run functions from our codebase rather
| than do a bunch of bespoke bash stuff to try and make something
| work. It's anecdotal, but CCUsage says my token usage has dropped
| nearly 50% since I wrote the server.
|
| Of course, it didn't have to be MCP - I could have used some
| other method to get Claude to run code from my codebase more
| frequently. The broader point is that it's much easier to just
| add a useful function to my codebase than it is to write
| something bespoke for Claude.
| macleginn wrote:
| "Claude seems to be much more willing to just run functions
| from our codebase rather than do a bunch of bespoke bash stuff
| to try and make something work" -- simply because it knows that
| there is a kernel it can send code to?
| alganet wrote:
| It's finally happening. The acceleration of the full AI
| disillusionment:
|
| - LLMs will do everything.
|
| - Shit, they won't. I'll do some traditional programming to put
| it on a leash.
|
| - More traditional programming.
|
| - Wait, this traditional programming thing is quite good.
|
| - I barely use LLMs now.
|
| - Man, that LLM stuff was a bad trip.
|
| See you all on the other side!
___________________________________________________________________
(page generated 2025-07-03 23:01 UTC)