[HN Gopher] Launch HN: Halluminate (YC S25) - Simulating the int...
___________________________________________________________________
Launch HN: Halluminate (YC S25) - Simulating the internet to train
computer use
Hi everyone, Jerry and Wyatt here from Halluminate
(https://halluminate.ai/). We help AI labs train computer use
agents with high quality data and RL environments. Training AI
agents to use computers, browsers, and software is one of the
highest-potential opportunities for AI. To date, however, this
capability is still unreliable. The emerging method to improve this
is called Reinforcement Learning with Verifiable Rewards (RLVR).
However, researchers are currently bottlenecked by a lack of high-
quality simulators and task + verifiers. To solve this problem,
we're building Westworld, a fully-simulated internet made up of
synthetic versions of the most common consumer and enterprise apps.
Agents use Westworld to learn how to do economically valuable
tasks. For example, AI agents can practice planning vacations on a
simulated flight booking site (https://flights.halluminate.ai/), or
learn how to reorganize outdated information in your sales
platform, or train to do financial modeling directly in a
spreadsheet. Here's a demo showing our flight booking simulation:
https://www.loom.com/share/74a3b28067e24c1b886054ba90a90aa5. How
it works: AI agents access our environment and are given a task +
verifier. A task is basically an objective for the agent to
achieve, for example "Book me a flight from SF to NYC on this date
with x, y, z filters." A verifier is a programmatic way to
determine if the task was successfully completed. For example, in
this case it might be a json that checks if the final flight data
matches expectations. These signals can then be used to calculate a
reward in RL. The more simulators we build, the more AI labs can
improve on capabilities that computer use agents are currently weak
at. One of our customers saw a ~20% improvement in date-picking
performance when training on our flight booking simulator. Two
things make this hard so far: (1) The simulations have to be
realistic. You can't get away with a vibe-coded "80% solution"
because even small divergences impact performance. Generating
simulated data is even harder. For example, massaging flight data
to look realistic took a lot of trial and experimentation. (2) The
tasks you train agents on have to be well-chosen. They are only
valuable if they reflect work that people actually want solved. We
need a lot of feedback from domain experts to get this right. That
said, we find this work incredibly interesting and are excited to
tackle these issues. A few things we are pumped to ship in the near
term: - Ability to train on long-horizon tasks by stringing
multiple simulators together for extended workflows; - Procedural
data generation. Instead of synthetically generating all the data
upfront, how can we model data generation so that our simulators
are populated procedurally as agents explore (think Minecraft); -
Open source! We plan to release our environments to the public so
developers/researchers can hack them for their own experimentation.
RL simulators are just one part of our business. The other part is
around human data creation (think Scale AI but for computer use).
We produce off-the-shelf pre-training/fine-tuning datasets, expert
human evaluation/error analysis, or any other data needs for our
customers. There are also a lot of exciting overlaps between the
two - for example, using human experts to help create our
simulators/tasks. Happy to go in more detail, but we thought that
simulators would make for the more interesting HackerNews post :)
Finally, about us: Wyatt and I met while studying CS at Cornell and
have been living and working together for over 7 years. I
previously led product/research at Capital One Labs, where I
launched one of the first AI agents in banking. Wyatt previously
was a Cornell Milstein scholar and did large-scale data engineering
for 2 early-stage startups in NYC. We left our jobs last year, and
faced these problems first-hand while building evals for our
customers who were browser/computer use agent companies. If anyone
has any questions, feedback, or thoughts please let us know!
Looking forward to your comments.
Author : wujerry2000
Score : 39 points
Date : 2025-08-11 15:30 UTC (7 hours ago)
| mrbluecoat wrote:
| Interesting name for an AI company - one letter away from
| hallucinate..
| rickcarlino wrote:
| I misread it as humiliate. Side note that this is not intended
| as a joke. This name might not be good long term.
| wujerry2000 wrote:
| Yea haha ... early idea was illuminate + hallucinations. Naming
| isn't our strength :)
| mousetree wrote:
| I thought it was halloumi + illuminate
| suninsight wrote:
| how about halarax ...halucinate and paralax
| aresant wrote:
| It's a great / memorable / and tongue-in-cheek name that
| anybody seriously in the space will instantly get and
| appreciate.
| inLewOf wrote:
| Not too late to change the logo to a friendly piece of halloumi
| :) (best i could find from a quick google image search
| https://www.redbubble.com/i/poster/Halloumi-by-
| PaulSDesign/1...)
| seemaze wrote:
| Was disappointed to discover no cheese was involved in this
| venture
| wm2 wrote:
| might to make some merch from this!
| thebiglebrewski wrote:
| Man, I was kind of hoping this was a YCombinator-backed cheese
| factory. But good luck on the launch!
| mikepurvis wrote:
| "good luck with lunch"
| CodingJeebus wrote:
| Curious to see how this works out. The flight booking example is
| interesting because it's one of the last purchase powers I'd want
| to hand over to an AI.
|
| If it gets a major travel detail wrong, purchases a business
| class ticket on accident, etc. and I need to adjust the booking
| by calling the airline, then I'm way less happy than I was if I
| just bought the ticket myself. Not to mention what happens when
| Google flights gets a UI refresh and knocks the accuracy rate of
| the agent down even 10%.
|
| Digital criminals are gonna love it, though.
|
| I'm personally much more interested in automating browser tasks
| that aren't economically valuable because that mitigates the
| risk.
| mousetree wrote:
| Why are flight bookings the go to example always? For most
| people, booking a flight happens infrequently, is a non-trivial
| expense (to your point), and is not that burdensome to do
| yourself.
| wujerry2000 wrote:
| We agree that as a demo flight booking is probably overused.
|
| However, in talking with my AI Labs, their perspective on
| flight booking is a little different. "Solving" flight
| booking requires the AI agent to solve a LOT of hard
| problems. Namely, personalization, context, weighing multiple
| options, interacting with the UI, math, then wrapping that
| all up into a coherent response. The thought process is IF a
| computer use agent is able to solve flight booking well, then
| we will have developed many other powerful primitives that
| will scale to other problems.
|
| So as a standalone use case, I'm inclined to agree this might
| not be where the most agent traction is seen. However, as a
| research/capability goal, there are some generalizations that
| could apply to other very important use cases.
| fragmede wrote:
| It's because most people have done it; and it's infrequent
| and sufficiently expensive that makes it enough of a pain
| point to make for a good example. Because it's infrequent,
| most people don't have a rigorous well-practiced system for
| how to go about it to get the optimal ticket for their
| particular circumstances for that flight, and because it can
| be somewhat expensive, there's a bit of a burden taken on in
| order to optimize for price as well, especially given all the
| shenanigans airlines play with pricing.
|
| If you're rich, you can just look for the ticket at the time
| you like on your preferred airline and buy a first class
| ticket, whatever the price, for whenever you want to fly,
| even if it's tomorrow. For the rest, that's not practical. So
| the flight search has to begin a few months out, with the
| burden of doing multiple searches (in incognito mode) across
| various airlines and/or aggregators, in order to optimize
| various factors. This takes a non-trivial amount of time. Add
| in looking for hotels and rental cars, and for some it's fun,
| for others it's an annoying burdensome chore that stands in
| the way of being on vacation.
|
| It's just an example use case though. Similar to how "robot
| maid" that folds clothes isn't the be-all or end-all for
| robotics, if an AI is able to perform that task, it's going
| to have capabilities necessary for performing a wide variety
| of other tasks.
| wujerry2000 wrote:
| UI refreshes knocking down simulator realism is a real issue
| that we're still trying to solve.
|
| I think this will probably be a mixture of automated
| QA/engineering and scale.
|
| Another interesting path is actually partnering directly with
| software providers to offer their platforms as simulators IF
| they see there is a competitive advantage to training agents to
| perform well on their UI.
|
| This idea we're really excited about, but it would require a
| company to see real revenue potential in enabling agentic
| access vs not. I'd say we're still on the "block them out"
| phase of the internet (ex. see Cloudflare's recent post about
| bot detection: https://blog.cloudflare.com/perplexity-is-using-
| stealth-unde...)
| superb_dev wrote:
| Airlines will love it too. How long until an AI company gets
| paid to prefer a certain company
| andy99 wrote:
| > it's one of the last purchase powers I'd want to hand over to
| an AI.
|
| A seemingly under-valued aspect of automation is doing it well
| vs just mechanically correctly. This is true for humans as well
| as AI. We had an outsourced Indian "research" team at a old
| employer that would basically just summarize the top 3 google
| search results for your research topic. Now AI does that.
| Likewise I've had an admin book flights, they suck at it,
| indifferently just picking something and never really able to
| capture complex personal preferences about which schedule is
| best.
|
| Just automating stuff is one thing, doing it thoughtfully is
| very hard, whether human or AI.
| zebomon wrote:
| This is very interesting. I think a lot of people may be quick to
| overlook the value of such simulators when thinking about AI
| agents at the extremes. (Either they're not good enough to trust
| or they're so good they'll leapfrog over any economic value
| here.)
|
| My own experience makes me lean toward thinking that the truth is
| somewhere in the middle in this situation, and that simulators
| like these will be valuable. I've been experimenting a lot with
| computer use on my website Bingeclock, passing through different
| prompts along the lines of "make a movie marathon based on X."
| The newest agents are consistently impressive, while also being
| consistently imperfect in surprising and interesting ways.
|
| Whether or not all the labs are already running this kind of
| thing internally for themselves, you would know better than I.
| But it's an idea that seems very useful nonetheless.
| Congratulations on the launch!
| wujerry2000 wrote:
| Computer use agents are starting to perform well on
| websites/apps that are in their training distribution, but
| still struggle a lot when dealing with tasks outside their
| distribution. A big reason why is because many more
| niche/enterprise applications are really hard to test on in the
| real world, hence the need for sims!
|
| re: labs doing this internally. They definitely are! However,
| the scale of sims buildout is going to be massive, probably
| many orders of magnitude above what we have today. We think it
| makes sense for one central player to do this because a really
| good simulator can be used by multiple people at once. It
| doesn't make sense for every AI lab/company to build out their
| own environments if an industry standard catalog exists.
| zebomon wrote:
| Intriguing analysis. I'll be following along with interest!
| sealthedeal wrote:
| Super cool. What would the real world use cases for SME adoption?
| wujerry2000 wrote:
| A few common ones we've heard
|
| Engineering: QA automation is huge, closes the loop on "fully
| automated" software engineering if another computer use system
| is able to click around and help identify bugs in software
|
| Deep Research: probably the biggest use case for computer use
| right now, finding information that isn't easily indexed or
| accessible via APIs.
|
| General RPA: This is industry specific, but lots of just
| everyday knowledge work involves data transfer between many
| platforms that sucks and no one wants to do. A great example is
| Epic in Healthcare. SO much labor is employed just to write and
| read information from this desktop app that isn't easily
| accessible. Imagine a computer use system that can do automated
| data pulls at scale for legacy desktop apps. This is a huge
| huge use case, and something that we're excited to try and
| improve with simulators of things like Epic, SAP, Salesforce,
| etc.
|
| Consumer: Lots of just general everyday tasks. I would
| recommend checking out https://yutori.com/ if you're interested
| in seeing how a computer use agent can be helpful in your day
| to day. Its fun for daily news reports, restaurant reservation
| checking, etc.
| whymauri wrote:
| Are these simulations shared between your customers, or are you
| building bespoke environments per client/user? How does the
| creation of environments scale?
| wujerry2000 wrote:
| Theses are really good questions!
|
| we share the public/consumer simulators, but we also build
| bespoke environments on a per customer basis (think enterprise
| sites or even full VMs loaded with applications and data).
|
| environment creation scalability is a big priority for us. we
| currently automate most of the process, but it still takes a
| fair bit of manual work to finish them and to get the details
| right. there is some reusability across environments, for
| example, we can use the flight results generation code in any
| travel/flightbooking sim. we also have some semi-automated
| approaches for creating tasks and verifiers. but still lots of
| work to be done here.
| whymauri wrote:
| Super interesting, thank you.
| orliesaurus wrote:
| Good luck Jerry!!! Interesting pivot for sure, playgrounds for AI
| seems like a good idea, I wish someone tackled them in 3D too
| (not just for browser/computer agent I mean) :P
| DearAll wrote:
| Love what you're doing. Are you currently open to interns? Would
| love to connect with you and chat more about using high quality
| data to help people better train and evaluate their ai agents!
| wm2 wrote:
| hey not hiring right now but connect with me on twitter and we
| can talk more there: https://x.com/wgm752
___________________________________________________________________
(page generated 2025-08-11 23:00 UTC)