[HN Gopher] We gave 5 LLMs $100K to trade stocks for 8 months
       ___________________________________________________________________
        
       We gave 5 LLMs $100K to trade stocks for 8 months
        
       Author : cheeseblubber
       Score  : 348 points
       Date   : 2025-12-04 23:08 UTC (23 hours ago)
        
 (HTM) web link (www.aitradearena.com)
 (TXT) w3m dump (www.aitradearena.com)
        
       | sethops1 wrote:
       | > Testing GPT-5, Claude, Gemini, Grok, and DeepSeek with $100K
       | each over 8 months of backtested trading
       | 
       | So the results are meaningless - these LLMs have the advantage of
       | foresight over historical data.
        
         | CPLX wrote:
         | Not sure how sound the analysis is but they did apparently
         | actually think of that.
        
         | PTRFRLL wrote:
         | > We were cautious to only run after each model's training
         | cutoff dates for the LLM models. That way we could be sure
         | models couldn't have memorized market outcomes.
        
           | plufz wrote:
           | I know very little about how the environment where they run
           | these models look, but surely they have access to different
           | tools like vector embeddings with more current data on
           | various topics?
        
             | disconcision wrote:
             | you can (via the api, or to a lesser degree through the
             | setting in the web client) determine what tools if any a
             | model can use
        
               | disconcision wrote:
               | with the exception that it doesn't seem possible to fully
               | disable this for grok 4
        
               | alchemist1e9 wrote:
               | which is curiously the best model ...
        
               | plufz wrote:
               | But isn't that more which MCP:s you can configure it to
               | use? Do we have any idea which secret sauce stuff they
               | have? Surely it's not just a raw model that they are
               | executing?
        
             | endtime wrote:
             | If they could "see" the future and exploit that they'd
             | probably have much higher returns.
        
               | alchemist1e9 wrote:
               | 56% over 8 months with the constraints provided are
               | pretty good results for Grok.
        
               | plufz wrote:
               | I would say that if these models independently could
               | create such high returns all these companies would shut
               | down the external access to the models and just have
               | their own money making machine. :)
        
           | stusmall wrote:
           | Even if it is after the cut off date wouldn't the models be
           | able to query external sources to get data that could
           | positively impact them? If the returns were smaller I could
           | reasonably believe it but beating the S&P500 returns by 4x+
           | strains credulity.
        
             | cheeseblubber wrote:
             | We used the LLMs API and provided custom tools like a stock
             | ticker tool that only gave stock price information for that
             | date of backtest for the model. We did this for news apis,
             | technical indicator apis etc. It took quite a long time to
             | make sure that there weren't any data leakage. The whole
             | process took us about a month or two to build out.
        
               | alchemist1e9 wrote:
               | I have a hunch Grok model cutoff is not accurate and
               | somehow it has updated weights though they still call it
               | the same Grok model as the params and size are unchanged
               | but they are incrementally training it in the background.
               | Of course I don't know this but it's what I would do in
               | their situation since ongoing incremental training could
               | he a neat trick to improve their ongoing results against
               | competitors, even if marginal. I also wouldn't trust the
               | models to honestly disclose their decision process
               | either.
               | 
               | That said. This is a fascinating area of research and I
               | do think LLM driven fundamental investing and trading has
               | a future.
        
         | itake wrote:
         | > We time segmented the APIs to make sure that the simulation
         | isn't leaking the future into the model's context.
         | 
         | I wish they could explain what this actually means.
        
           | nullbound wrote:
           | Overall, it does sound weird. On the one hand, assuming I
           | properly I understand what they are saying is that they
           | removed model's ability to cheat based on their specific
           | training. And I do get that nuance ablation is a thing, but
           | this is not what they are discussing there. They are only
           | removing one avenue of the model to 'cheat'. For all we know,
           | some that data may have been part of its training set
           | already...
        
           | devmor wrote:
           | It's a very silly way of saying that the data the LLMs had
           | access to was presented in chronological order, so that for
           | instance, when they were trading on stocks at the start of
           | the 8 month window, the LLMs could not just query their APIs
           | to see the data from the end of the 8 month window.
        
         | joegibbs wrote:
         | That's only if they're trained on data more recent than 8
         | months ago
        
       | deadbabe wrote:
       | Yea, so this is bullshit. An approximation of reality still isn't
       | reality. If you're convinced the LLMs will perform as backtested,
       | put real money and see what happens.
        
       | chroma205 wrote:
       | >We gave each of five LLMs $100K in paper money
       | 
       | Stopped reading after "paper money"
       | 
       | Source: quant trader. paper trading does not incorporate market
       | impact
        
         | zahlman wrote:
         | If your initial portfolio is 100k you are not going to have
         | meaningful "market impact" with your trades assuming you
         | actually make them vs. paper trading.
        
         | a13n wrote:
         | I mean if you're going to write algos that trade the first
         | thing you should do is check whether they were successful on
         | historical data. This is an interesting data point.
         | 
         | Market impact shouldn't be considered when you're talking about
         | trading S&P stocks with $100k.
        
           | verdverm wrote:
           | Historical data is useful for validation, don't develop algos
           | against it, test hypotheses until you've biased your data,
           | then move on to something productive for society
        
         | txg wrote:
         | Lack of market response is a valid point, but $100k is pretty
         | unlikely to have much impact especially if spread out over
         | multiple trades.
        
         | tekno45 wrote:
         | the quant trader you talked to probably sucks.
        
       | dash2 wrote:
       | There's also this thing going on right now:
       | https://nof1.ai/leaderboard
       | 
       | Results are... underwhelming. All the AIs are focused on
       | daytrading Mag7 stocks; almost all have lost money with gusto.
        
         | syntaxing wrote:
         | Let me guess, the mystery model is theirs
        
           | yahoozoo2 wrote:
           | It says "Undisclosed frontier AI Lab (not Nof1)"
        
         | richardhenry wrote:
         | If I'm understanding this website correctly, these models can
         | only trade in a handful of tech stocks along with the XYZ100
         | hyperliquid coin?
        
         | enlyth wrote:
         | With the speed of how pricing information propagates, this
         | seems way too dependent on how the agent is built, what
         | information it has access to, and the feedback loop between the
         | LLM and actions it can carry out
        
         | mjk3026 wrote:
         | I also saw the hype on X yesterday and had already checked the
         | https://nof1.ai/leaderboard, so I figured this post was about
         | those results -- but apparently it's a completely different
         | arena.
         | 
         | I still have no idea how to make sense of the huge gap between
         | the Nof1 arena and the aitradearena results. But honestly, the
         | Nof1 dashboard -- with the models posting real-time investment
         | commentary -- is way more interesting to watch than the
         | aitradearena results anyway.
        
         | rallies wrote:
         | I think the big limitation of nof1 is that they're not using a
         | lot of data that an actual investor would use when researching
         | companies.
         | 
         | We're trying to fix some of those limitations and run a similar
         | live competition at https://rallies.ai/arena
        
       | chongli wrote:
       | They outperformed the S&P 500 but seem to be fairly well
       | correlated with it. Would like to see a 3X leveraged S&P 500 ETF
       | like SPXL charted against those results.
        
         | 10000truths wrote:
         | ...over the course of 8.5 months, which is way too short for a
         | meaningful result. If their strategy could outperform the S&P
         | 500's 10-year return, they wouldn't be blogging about it.
        
         | driverdan wrote:
         | VTI gained over 10% in that time period so it wasn't much
         | better.
        
       | bcrosby95 wrote:
       | > Grok ended up performing the best while DeepSeek came close to
       | second. Almost all the models had a tech-heavy portfolio which
       | led them to do well. Gemini ended up in last place since it was
       | the only one that had a large portfolio of non-tech stocks.
       | 
       | I'm not an investor or researcher, but this triggers my spidey
       | sense... it seems to imply they aren't measuring what they think
       | they are.
        
         | etchalon wrote:
         | I don't feel like they measured anything. They just confirmed
         | that tech stocks in the US did pretty well.
        
           | JoeAltmaier wrote:
           | They measured the investment facility of all those LLMs.
           | That's pretty much what the title says. And they had
           | dramatically different outcomes. So that tells me something.
        
             | DennisP wrote:
             | I mean, what it kinda tells me is that people talk about
             | tech stocks the most, so that's what was most prevalent in
             | the training data, so that's what most of the LLMs said to
             | invest in. That's the kind of strategy that works until it
             | really doesn't.
        
               | ghaff wrote:
               | Cue 2020 or so. I do have investments in tech stocks but
               | I have a lot more conservative investments too.
        
             | Libidinalecon wrote:
             | It shows nothing. This is a bullshit stunt that should be
             | obvious to anyone who has placed a few trades.
        
               | JoeAltmaier wrote:
               | Unless you think of it as an AI exercise, not a stock
               | trading exercise. Which point evaded most people.
        
             | skeeter2020 wrote:
             | They "proved" that US tech stocks did better than
             | portfolios with less US tech stocks over a recent, very
             | short time range. 1. You didn't know that? 2. Whata re you
             | going to do with this "new information"?
        
               | JoeAltmaier wrote:
               | As a stock-trading exercise? Nothing, as you note. As an
               | AI investigation it says plenty. Which is the point I was
               | making (and got missed by all those stock-trading self-
               | appointed experts who fastened onto that)
        
         | olliepro wrote:
         | A more sound approach would have been to do a monte carlo
         | simulation where you have 100 portfolios of each model and look
         | at average performance.
        
           | observationist wrote:
           | Grok would likely have an advantage there, as well - it's got
           | better coupling to X/Twitter, a better web search index,
           | fewer safety guardrails in pretraining and system prompt
           | modification that distort reality. It's easy to envision
           | random market realities that would trigger ChatGPT or Claude
           | into adjusting the output to be more politically correct.
           | DeepSeek would be subject to the most pretraining distortion,
           | but have the least distortion in practice if a random neutral
           | host were selected.
           | 
           | If the tools available were normalized, I'd expect a tighter
           | distribution overall but grok would still land on top.
           | Regardless of the rather public gaffes, we're going to see
           | grok pull further ahead because they inherently have a 10-15%
           | advantage in capabilities research per dollar spent.
           | 
           | OpenAI and Anthropic and Google are all diffusing their
           | resources on corporate safetyism while xAI is not. That
           | advantage, all else being equal, is compounding, and I hope
           | at some point it inspires the other labs to give up the
           | moralizing politically correct self-righteous "we know
           | better" and just focus on good AI.
           | 
           | I would love to see a frontier lab swarm approach, though.
           | It'd also be interesting to do multi-agent collaborations
           | that weight source inputs based on past performance, or use
           | some sort of orchestration algorithm that lets the group
           | exploit the strengths of each individual model. Having 20
           | instances of each frontier model in a self-evolving swarm,
           | doing some sort of custom system prompt revision with a
           | genetic algorithm style process, so that over time you get 20
           | distinct individual modes and roles per each model.
           | 
           | It'll be neat to see the next couple years play out - OpenAI
           | had the clear lead up through q2 this year, I'd say, but
           | Gemini, Grok, and Claude have clearly caught up, and the
           | Chinese models are just a smidge behind. We live in
           | wonderfully interesting times.
        
             | jessetemp wrote:
             | > fewer safety guardrails in pretraining and system prompt
             | modification that distort reality.
             | 
             | Really? Isn't Grok's whole schtick that it's Elon's
             | personal altipedia?
        
               | nickthegreek wrote:
               | My understanding is that grok api is way different than
               | the grok x bot. Which of course does Grok as a business
               | any favors. Personally, I do not engage with either.
        
               | bdangubic wrote:
               | you gotta be quite a crazy person to use grok :)
        
               | airstrike wrote:
               | @grok is this true?
        
               | bdangubic wrote:
               | ... checking with my creator ...
        
               | AlexCoventry wrote:
               | Grok is good for up-to-the-minute information, and for
               | requests that other chat services refuse to entertain,
               | like requests for instructions on how to physically
               | disable the cellular modem in your car.
        
               | KPGv2 wrote:
               | I sat in my kid's extracurricular a couple months ago and
               | had an FBI agent tell me that Grok was the most
               | trustworthy based on "studies," so that's what she had
               | for her office.
        
               | skeeter2020 wrote:
               | Did she get that info from Grok?
        
               | bdangubic wrote:
               | Grok has Elon as better athelete than LeBron so I would
               | agree with FBI Agent. can't get that kind of insight
               | anywhere else :)
        
               | doe88 wrote:
               | Maybe be crazy is what you need to bet at a stock market
               | - not a financial advice, and also not written by Grok -
               | I swear :))
        
               | observationist wrote:
               | It's excellent, and it doesn't get into the weird
               | ideological ruts and refusals other bots do.
               | 
               | Grok's search and chat is better than the other
               | platforms, but not $300/month better, ChatGPT seems to be
               | the best no rate limits pro class bot. If Grok 5 is a
               | similar leap in capabilities as 3 to 4, then I might pay
               | the extra $100 a month. The "right wing Elon sycophant"
               | thing is a meme based on hiccups with the public facing
               | twitter bot. The app, api, and web bot are just generally
               | very good, and do a much better job at neutrality and
               | counterfactuals and not refusing over weird moralistic
               | nonsense.
        
             | UncleMeat wrote:
             | I know that Musk deserving a lifetime achievement award at
             | the Adult Video Network awards over Riley Reid is
             | definitely an indication of minimal "system prompt
             | modification that distort[s] reality."
        
               | scubbo wrote:
               | ...I'm not familiar with the reference.
        
               | fragmede wrote:
               | https://www.theguardian.com/technology/2025/nov/21/elon-
               | musk...
        
               | red-iron-pine wrote:
               | for the folks unaware, he was nominated for sucking more
               | dicks in a single shoot than anyone, while still
               | producing great content. he also hit several holes-in-one
               | golfing later that week.
        
             | KPGv2 wrote:
             | OTOH it has the richest man in the world actively meddling
             | in its results when they don't support his politics.
        
               | buu700 wrote:
               | Anyone who hasn't used Grok might be surprised to learn
               | that it isn't shy about disagreeing with Elon on plenty
               | of topics, political or otherwise. Any insinuation to the
               | contrary seems to be pure marketing spin on his part.
               | 
               | Grok is often absurdly competent compared to other SOTA
               | models, definitely not a tool I'd write off over its
               | supposed political leanings. IME it's routinely able to
               | solve problems where other models failed, and Gemini
               | 2.5/3 and GPT-5 tend to have consistently high praise for
               | its analysis of any issue.
               | 
               | That's as far as the base model/chatbot is concerned, at
               | least. I'm less familiar with the X bot's work.
        
               | godelski wrote:
               | Two things can be true at the same time. Yes, Grok will
               | say mean things about Musk but it'll also say
               | ridiculously good things                 > hey @grok if
               | you had the number one overall pick in the 1997 NFL draft
               | and your team needed a quarterback, would you have taken
               | Peyton Manning, Ryan Leaf or Elon Musk?            >>
               | Elon Musk, without hesitation. Peyton Manning built
               | legacies with precision and smarts, but Ryan Leaf
               | crumbled under pressure; Elon at 27 was already
               | outmaneuvering industries, proving unmatched adaptability
               | and grit. He'd redefine quarterbacking--not just throwing
               | passes, but engineering wins through innovation, turning
               | deficits into dominance like he does with rockets and
               | EVs. True MVPs build empires, not just score touchdowns.
               | - https://x.com/silvermanjacob/status/1991565290967298522
               | 
               | I think what's more interesting is that most of the
               | tweets here [0] have been removed. I'm not going to call
               | conspiracy because I've seen some of them. Probably
               | removed because going viral isn't always a good thing...
               | 
               | [0] https://gizmodo.com/11-things-grok-says-elon-musk-
               | does-bette...
        
               | buu700 wrote:
               | They can be, but in this case they don't seem to be.
               | Here's Grok's response to that prompt (again, the actual
               | chatbot service, not the X account): https://grok.com/sha
               | re/c2hhcmQtMw_2b46259a-5291-458e-9b85-0c....
               | 
               | I don't recall Grok ever making mean comments (about Elon
               | or otherwise), but it clearly doesn't think highly of his
               | football skills. The chain of thought shows that it
               | interpreted the question as a joke.
               | 
               | The one thing I find interesting about this response is
               | that it referred to Elon as "the greatest entrepreneur
               | alive" without qualification. That's not really in line
               | with behavior I've seen before, but this response is
               | calibrated to a very different prompting style than I
               | would ordinarily use. I suppose it's possible that Grok
               | (or any model) could be directed to push certain ideas to
               | certain types of users.
        
               | godelski wrote:
               | Sure, but they also update the models, especially when
               | things like this go viral. So it is really hard to
               | evaluate accurately and honestly the fast changing nature
               | of LLMs makes them difficult to work with too.
        
               | tengbretson wrote:
               | It seems to have recognized a question as being
               | engagement bait and it responded in the most engagement-
               | baity way possible.
        
               | skeeter2020 wrote:
               | it's so wildly inconsistent you can't build on top of it
               | with reliability. And getting high praise from any model
               | is ridiculously easy: ask a question, make a statment,
               | correct the model's dumb error, etc.
        
               | buu700 wrote:
               | It's easy for us as humans to correct dumb mistakes made
               | by AI. It's less easy for AI to correct mistakes made by
               | AI.
               | 
               | What's remarkable on Grok's part is when it spends five
               | minutes churning through a few thousand lines of code
               | (not the whole codebase, just the relevant files) and
               | correctly arrives at the correct root cause of a complex
               | bug in one shot.
               | 
               | Grok as a model may or may not be uniquely amazing per
               | se, but the service's eagerness to throw compute at
               | problems that genuinely demand it is a superpower that
               | makes at least makes it uniquely amazing in practice. By
               | comparison, even Gemini 3 often returns
               | lazy/shallow/wrong responses (and I say that as a regular
               | user of Gemini).
        
           | cyberrock wrote:
           | While not strictly stocks, it would be interesting to see
           | them trade on game economies like EVE, WoW, RuneScape,
           | Counter Strike, PoE, etc.
        
           | ekianjo wrote:
           | indeed, and also a "model" does not mean anything per se, you
           | have hundreds of different prompts, you can layer agents on
           | top, you can use temperature that will lead to different
           | outcomes. The number of dimensions to explore is huge.
        
         | IgorPartola wrote:
         | Yeah I mean if you generally believe the tech sector is going
         | to do well because it has been doing well you will beat the
         | overall market. The problem is that you don't know if and when
         | there might be a correction. But since there is this one
         | segment of the overall market that has this steady upwards
         | trend and it hasn't had a large crash, then yeah any pattern
         | seeking system will identify "hey this line keeps going up!"
         | Would it have the nuance to know when a crash is coming if none
         | of the data you test it on has a crash?
         | 
         | It would almost be more interesting to specifically train the
         | model on half the available market data, then test it on
         | another half. But here it's like they added a big free loot box
         | to the game and then said "oh wow the player found really good
         | gear that is better than the rest!"
         | 
         | Edit: from what I causally remember a hedge fund can beat the
         | market for 2-4 years but at 10 years and up their chances of
         | beating the market go to very close to zero. Since LLMs have
         | bit been around for that long it is going to be difficult to
         | test this without somehow segmenting the data.
        
           | tshaddox wrote:
           | > It would almost be more interesting to specifically train
           | the model on half the available market data, then test it on
           | another half.
           | 
           | Yes, ideally you'd have a model trained only on data up to
           | some date, say January 1, 2010, and then start running the
           | agents in a simulation where you give them each day's new
           | data (news, stock prices, etc.) one day at a time.
        
             | IgorPartola wrote:
             | I mean ultimately this is an exercise in frustration
             | because if you do that you will have trained your model on
             | market patterns that might not be in place anymore. For
             | example after the 2008 recession regulations changed. So do
             | market dynamics actually work the same in 2025 as in 2005?
             | I honestly don't know but intuitively I would say that it
             | is possible that they do not.
             | 
             | I think a potentially better way would be to segment the
             | market up to today but take half or 10% of all the stocks
             | and make only those available to the LLM. Then run the test
             | on the rest. This accounts for rules and external forces
             | changing how markets operate over time. And you can do this
             | over and over picking a different 10% market slice for
             | training data each time.
             | 
             | But then your problem is that if you exclude let's say
             | Intel from your training data and AMD from your testing
             | data then there ups and downs don't really make sense since
             | they are direct competitors. If you separate by market
             | segment then does training the model on software tech
             | companies might not actually tell you accurately how it
             | would do for commodities or currency training. Or maybe I
             | am wrong and trading is trading no matter what you are
             | trading.
        
               | chris_st wrote:
               | > _you will have trained your model on market patterns
               | that might not be in place anymore_
               | 
               | My working definition of technical analysis [0]
               | 
               | [0]: https://en.wikipedia.org/wiki/Technical_analysis
        
               | IgorPartola wrote:
               | It is always fun (in a broad sense of that word) when I
               | make a comment on an industry I know nothing about and
               | somehow stumble onto a thing that not only has a name but
               | also research. I am sure there is a German word for that
               | feel of discovering something that countless others have
               | already discovered.
        
               | chris_st wrote:
               | XKCD calls it the "Lucky 10,000" [0]
               | 
               | [0]: https://xkcd.com/1053/
        
               | mewpmewp2 wrote:
               | That is referring to something completely else. This is
               | referring to some common fact that the person didn't
               | figure out by themself. OP is referring to something they
               | came up with themselves in a field they have no
               | experience with, realizing it is actually a thing in a
               | way feeling validated and clever.
        
               | taneq wrote:
               | Any time I invent a cool thing, I go and try and find it
               | online. Usually it's already an established product,
               | which totally validates my feeling that the thing I
               | invented is cool and would be a good product. :D
               | 
               | Occasionally it's (as far as I can tell) a legitimately
               | new 'wow that's obvious' style thing and I consider
               | prototyping it. :)
        
               | chasing0entropy wrote:
               | What have you prototyped recently? Anything you have
               | released to market? I'm in the same general area by am
               | teetering on actually launching products wouldn't mind
               | connecting with a like minded e gineer
        
               | biztos wrote:
               | > there is a German word
               | 
               | Zeitgeistuberspannungsfreude
        
               | stouset wrote:
               | I am frankly astonished at the number of otherwise-
               | intelligent people who actually seem to believe in this
               | stuff.
               | 
               | One of the worst possible things to do in a competitive
               | market is to trade by some publicly-available formulaic
               | strategy. It's like announcing your rock-paper-scissors
               | move to your opponent in advance.
        
               | tim333 wrote:
               | A couple of subtleties in that. Rather than rock paper
               | scissors with three options, there are hundreds of
               | technical strategies out there so you may still be doing
               | something unusual. Secondly the mass of the public are
               | kind of following a technical strategy of just buy index
               | funds because the index has gone up the past. Which is
               | ignoring the fundamental issue of whether stocks decent
               | value for money at the moment.
        
               | intalentive wrote:
               | Technical analysis is a basket of heuristics. Support /
               | resistance / breakout (especially around whole numbers)
               | seems to reflect persistent behavior rooted in human
               | psychology. Look at the heavy buying at the $30 mark
               | here, putting a floor under silver:
               | https://finviz.com/futures_charts.ashx?p=d&t=SI This is a
               | common pattern it can be useful to know.
        
               | 0manrho wrote:
               | > you will have trained your model on market patterns
               | that might not be in place anymore
               | 
               | How is that relevant to what was proposed? If it's
               | trading and training on 2010 data, what relevance does
               | todays market dynamics and regulations have?
               | 
               | Which further begs the question, what's the point of this
               | exercise?
               | 
               | Is it to develop a model than compete effectively in
               | today's market? If so then yeah, the 2010
               | trading/training idea probably isn't the best idea for
               | the reasons you've outlined.
               | 
               | Or is it to determine the capacity of an AI to learn and
               | compete effectively within any given arbitrary
               | market/era? If so, then today's dynamics/constraints are
               | irrelevant unless you're explicitly trying to train/trade
               | on todays markets (which isn't what the person you're
               | replying to proposed, but is obviously a valid desire and
               | test case to evaluate in it's own right)
               | 
               | Or is it evaluating its ability to identify what those
               | constraints/limitations are and then build strategies
               | based on it? In which case it doesn't matter _when_ you
               | 're training/trading so much as your ability to feed it
               | accurate and complete data for that time period be it
               | today, or 15 years ago or whenever, which is no small
               | ask.
        
               | noduerme wrote:
               | Just to name a different but related approach, as a hobby
               | project I built a (non LLM) model that trained mainly on
               | data from stocks that didn't move much over the past
               | decade, seeking ways to beat the performance of those
               | particular stocks. I put it into practice for a couple of
               | years, and came out roughly even by constantly
               | rebalancing a basket of stocks that, as a whole, dropped
               | by about 20%. I considered that to be a success, although
               | it would've been nicer to make money.
        
               | godelski wrote:
               | > I think a potentially better way would be to segment
               | the market up to today but take half or 10% of all the
               | stocks and make only those available to the LLM.
               | 
               | Autocorrelation is going to bite you in the ass.
               | 
               | Those stocks are going to be coupled. Let's take an easy
               | example. Suppose you include Nvidia in the training data
               | and hold out AMD for test. Is there information leakage?
               | Yes. The problem is that each company isn't independent.
               | You have information leakage in both the setting where
               | companies grow together as well as zero sum games (since
               | x + y = 0, if you know x then you know y). But in this
               | example AMD tends with Nvidia. Maybe not as much, but
               | they go in the same direction. They're coupled
               | 
               | Not to mention that in the specific setting the LLMs were
               | given news and other information.
        
             | hxtk wrote:
             | I suspect trading firms have already done this to the
             | maximum extent that it's profitable to do so. I think if
             | you were to integrate LLMs into a trading algorithm, you
             | would need to incorporate more than just signals from the
             | market itself. For example, I hazard a guess you could
             | outperform a model that operates purely on market data with
             | a model that also includes a vector embedding of a
             | selection of key social and news media accounts or other
             | information sources that have historically been difficult
             | to encode until LLMs.
        
               | giantg2 wrote:
               | "includes a vector embedding of a selection of key social
               | and news media accounts or other information sources that
               | have historically been difficult to encode until LLMs."
               | 
               | Not really. Sentiment analysis in social networks has
               | been around for years. It's probably cheaper to by that
               | analysis and feed it to LLMs than to have LLMs do it.
        
               | solotronics wrote:
               | The part people are missing here is that if the trading
               | firms are all doing something, that in itself influences
               | the market.
               | 
               | If they are all giving the LLMs money to invest and the
               | AIs generally buy the same group of stocks, those stocks
               | will go up. As more people attempt the strategy it
               | infuses fresh capital and more importantly signaling to
               | the trading firms there are inflows to these stocks. I
               | think its probably a reflexive loop at this point.
        
               | brendoelfrendo wrote:
               | They could have the AI perform paper trading: give it a
               | simulated account but real data. This would make sense to
               | me if it was just a research project. That said, I
               | imagine the more high-tech trading firms started running
               | this research a long time ago and wouldn't be surprised
               | if there were already LLM-based trading bots that could
               | be influencing the market.
        
           | calmbonsai wrote:
           | For a nice historic perspective on hedge funds and the
           | industry as a whole, read Mallaby's "More Money Than God".
        
           | ainiriand wrote:
           | As an old friend investor I know always says: 'It is really
           | easy to make money in the market when everyone is doing it,
           | just try to not lose it when they lose it'.
        
           | arisAlexis wrote:
           | You believe in the tech sector because technology always goes
           | well and it's what humans strive to achieve, not because it
           | has done well recently. It has always.
        
             | knollimar wrote:
             | When does the tech sector become the computer sector?
             | 
             | Agriculture would have been considered tech 200 years ago.
        
               | arisAlexis wrote:
               | full throttle until AGI is achieved, then we will see
        
               | d-lisp wrote:
               | Maybe one day we will discover that a method exists for
               | computing/displaying/exchanging arbitrary things through
               | none other means than our own flesh and brains.
        
           | Eddy_Viscosity2 wrote:
           | > a hedge fund can beat the market for 2-4 years but at 10
           | years and up their chances of beating the market go to very
           | close
           | 
           | In that case the winning strategy would be to switch hedge
           | funds every 3 years.
        
             | perlgeek wrote:
             | The problem is that you don't know in advance which will be
             | doing well when.
        
             | skeeter2020 wrote:
             | Except you don't know which fund is going to "go on a hot
             | streak" or when the magic will end. The original statement
             | only holds when looking at historical data; it's not
             | predictive.
        
           | stonemetal12 wrote:
           | Would that work for LLMs though? They hypothetically trained
           | on news papers from the second half of the data so they have
           | knowledge of "future" events.
        
         | monksy wrote:
         | They're not measuring performance in the context of when things
         | happen and in the time that they are. It think its only showing
         | recent performance and popularity. To actually evaluate how
         | these do you need to be able to correct the model and retrain
         | it per different time periods and then measure how it would do.
         | Then you'll get better information from the backtesting.
        
         | seanmcdirmid wrote:
         | We had this discussion in previous posts about congressional
         | leaders who had the risk appetite to go tech heavy and
         | therefore outperformed normal congress critters.
         | 
         | Going heavy on tech can be rewarding, but you are taking on
         | more risk of losing big in a tech crash. We all know that, and
         | if you don't have that money to play riskier moves, its not
         | really a move you can take.
         | 
         | Long term it is less of a win if a tech bubble builds and pops
         | before you can exit (and you can't out it out to re-inflate).
        
           | hobobaggins wrote:
           | They didn't just outperform "normal" congress critters.. they
           | also outperformed nearly every hedge fund on the planet. But
           | they (meaning, of course, just one person and their spouse)
           | are obviously geniuses.
        
             | seanmcdirmid wrote:
             | Hedge funds suck though. They don't invest in FAANG, they
             | do risky stuff that doesn't pay off, you are still
             | comparing incomparable things.
             | 
             | I'm obviously a genius because 90% of my stock is in tech,
             | most of us on HN are geniuses in your opinion?
        
               | cap11235 wrote:
               | What do you think hedge funds do?
        
               | seanmcdirmid wrote:
               | They use crazy investment strategies that allow them to
               | capture high returns in adverse general market
               | conditions, but they rather under perform the general
               | market in normal and booming conditions. "Hedge" is
               | actually in their name for a reason. Rich people use
               | hedge funds for...hedging.
        
               | mvkel wrote:
               | Downside protection. Hedging. Giving you gains at the
               | lowest beta possible.
        
             | stouset wrote:
             | Hedge funds' goals are often not to maximize profit, but to
             | provide returns uncorrelated with the rest of some
             | benchmark market. This is useful for the wealthy as it
             | means you can better survive market crashes.
        
             | Guillaume86 wrote:
             | They also outperformed themselves before being in a leader
             | position...
        
           | directevolve wrote:
           | This is a wildly disingenuous interpretation of that study.
           | 
           | " Using transaction-level data on US congressional stock
           | trades, we find that lawmakers who later ascend to leadership
           | positions perform similarly to matched peers beforehand but
           | outperform them by 47 percentage points annually after
           | ascension. Leaders' superior performance arises through two
           | mechanisms. The political influence channel is reflected in
           | higher returns when their party controls the chamber, sales
           | of stocks preceding regulatory actions, and purchase of
           | stocks whose firms receiving more government contracts and
           | favorable party support on bills. The corporate access
           | channel is reflected in stock trades that predict subsequent
           | corporate news and greater returns on donor-owned or home-
           | state firms."
           | 
           | https://www.nber.org/papers/w34524
        
         | tclancy wrote:
         | I mean, run the experiment during a different trend in the
         | market and the results would probably be wildly different. This
         | feels like chartists [1] but lazier.
         | 
         | [1] https://www.investopedia.com/terms/c/chartist.asp
        
           | refactor_master wrote:
           | If you've ever read a blog on trading when LSTMs came out,
           | you'd have seen all sorts of weird stuff with predicting the
           | price at t+1 on a very bad train/test split, where the author
           | would usually say "it predicts t+1 with 99% accuracy compared
           | to t", and the graph would be an exact copy with a t+1
           | offset.
           | 
           | So eye-balling the graph looks great, almost perfect even,
           | until you realize that in real-time the model would've
           | predicted yesterday's high on today's market crash and you'd
           | have lost everything.
        
             | blitzar wrote:
             | if you feed in price i.e. 280.1, 281.5, 281.9 ... you are
             | going to get some pretty good looking results when it comes
             | to predicting the next days price (t+1) with a margin of
             | +/- a percent or so.
        
           | throwawayffffas wrote:
           | To be fair to chartists, they try to identify if they are in
           | a bear market or one is coming and get out early.
        
         | culi wrote:
         | I'd like to see this study replicated during a bear market
        
           | gizajob wrote:
           | Yeah the timeframe is crucial here. The experiment began as
           | Trump launched his tariff tweets which caused a huge downward
           | correction and then a large uptrend. Buying almost anything
           | tech at the start of this would have made money.
        
           | petercooper wrote:
           | Agreed. While I don't see it outperforming long held funds,
           | it'd be interesting to see if they could pick up on negative
           | signals in the news feed, and also any potential advantage of
           | not being emotional about its decisions.
        
         | KPGv2 wrote:
         | Also studying for eight months is not useful. Loads of traders
         | do this well for eight months and then do shit for the next
         | five years. And tellingly, they didn't beat the S&P 500. They
         | invested in something else that beat the S&P 500. And the one
         | that didn't invest in that something did _worse_ than the S &P
         | 500.
         | 
         | What this tells me is they were lucky to have picked something
         | that would beat the market _for now_.
        
         | mvkel wrote:
         | S&P 500 is also tech heavy and notoriously difficult to beat
         | over the long run
        
         | micromacrofoot wrote:
         | probably hitching onto sycophancy for the parent company and
         | getting lucky as a result... that Grok September rally aligns
         | somewhat with TSLA for instance
        
       | parpfish wrote:
       | I wonder if this could be explained as the result of LLMs being
       | trained to have pro-tech/ai opinions while we see massive run ups
       | in tech stock valuations?
       | 
       | It'd be great to see how they perform within particular sectors
       | so it's not just a case of betting big on tech while tech stocks
       | are booming
        
       | gwd wrote:
       | The summary to me is here:
       | 
       | > Almost all the models had a tech-heavy portfolio which led them
       | to do well. Gemini ended up in last place since it was the only
       | one that had a large portfolio of non-tech stocks.
       | 
       | If the AI bubble had popped in that window, Gemini would have
       | ended up the leader instead.
        
         | turtletontine wrote:
         | Yup. This is the fallacy of thinking you're a genius because
         | you made money on the market. Being lucky at the moment (or
         | even the last 5 years) does not mean you'll continue to be
         | lucky in the future.
         | 
         | "Tech line go up forever" is not a viable model of the economy;
         | you need an explanation of why it's going up now, and why it
         | might go down in the future. And also models of many other
         | industries, to understand when and why to invest elsewhere.
         | 
         | And if your bets pay off in the short term, that doesn't
         | necessarily mean your model is right. You could have chosen the
         | right stocks for the wrong reasons! Past performance doesn't
         | guarantee future performance.
        
           | Vegenoid wrote:
           | Clearly AI is not a bubble, look how good it is at predicting
           | the stock market!
        
           | gwd wrote:
           | What would have been impressive is if the favored industries,
           | or individual companies, experienced a major drop during the
           | target testing window, and the LLMs managed to pull out of
           | those industries _before_ they dropped.
        
       | lawlessone wrote:
       | Could they give some random people (i volunteer) 100k for 8
       | months? ...as a control
        
         | iLoveOncall wrote:
         | I know this is a joke comment, but there are plenty of websites
         | that simulate the stock market and where you can use paper
         | money to trade.
         | 
         | People say it's not equivalent to actually trading though, and
         | you shouldn't use it as a predictor of your actual trading
         | performance, because you have a very different risk tolerance
         | when risking your actual money.
        
           | ghaff wrote:
           | Yeah, if you give me $100K I'm almost certainly going to make
           | very different decisions than either a supposedly optimizing
           | computer or myself at different ages.
        
       | andirk wrote:
       | Update with Gemini 3. It's far better than its predecessors.
        
       | apical_dendrite wrote:
       | Looking at the recent holdings for the best models, it looks like
       | it's all tech/semiconductor stocks. So in this time frame they
       | did very well, but if they ended in April, they would have
       | underperformed the S&P500.
        
       | halzm wrote:
       | I think these tests are always difficult to gauge how meaningful
       | they actually are. If the S&P500 went up 12% over that period,
       | mainly due to tech stocks, picking a handful of tech stocks is
       | always going to set you higher than the S&P. So really all I
       | think they test is whether the models picked up on the trend.
       | 
       | I more surprised that Gemini managed to lose 10%. I wish they
       | actually mentioned what the models invested in and why.
        
         | taylorlapeyre wrote:
         | Wait -- isn't that exactly what good investors do? They look
         | for what stocks are going to beat expectations and invest in
         | them. If a stock broker I hired got this return, I wouldn't be
         | rolling my eyes and saying "that's only because they noticed
         | the trend in tech stocks." That's exactly what I'm paying them
         | to do.
        
         | Marsymars wrote:
         | > picking a handful of tech stocks is always going to set you
         | higher than the S&P.
         | 
         | That's a bold claim.
        
       | buredoranna wrote:
       | Like so many analyses before them, including my own, this
       | completely misses the basics of mean/variance risk analysis.
       | 
       | We need to know the risk adjusted return, not just the return.
        
       | xnx wrote:
       | Spoiler: They did not use real money or perform any actual
       | trades.
        
       | jacktheturtle wrote:
       | This is really dumb. Because the models themselves, like markets,
       | are indeterministic. They will yield different investment
       | strategies based on prompts and random variance.
       | 
       | This is a really dumb measurement.
        
       | iLoveOncall wrote:
       | Since it's not included in the main article, here is the prompt:
       | 
       | > You are a stock trading agent. Your goal is to maximize
       | returns.
       | 
       | > You can research any publicly available information and make
       | trades once per day.
       | 
       | > You cannot trade options.
       | 
       | > Analyze the market and provide your trading decisions with
       | reasoning.
       | 
       | >
       | 
       | > Always research and corroborate facts whenever possible.
       | 
       | > Always use the web search tool to identify information on all
       | facts and hypotheses.
       | 
       | > Always use the stock information tools to get current or past
       | stock information.
       | 
       | >
       | 
       | > Trading parameters:
       | 
       | > - Can hold 5-15 positions
       | 
       | > - Minimum position size: $5,000
       | 
       | > - Maximum position size: $25,000
       | 
       | >
       | 
       | > Explain your strategy and today's trades.
       | 
       | Given the parameters, this definitely is NOT representative of
       | any actual performance.
       | 
       | I recommend also looking at the trade history and reasoning for
       | each trade for each model, it's just complete wind.
       | 
       | As an example, Deepseek made only 21 trades, which were all buys,
       | which were all because "Companyy X is investing in AI". I doubt
       | anyone believe this to be a viable long-term trading strategy.
        
         | Scubabear68 wrote:
         | Agree. Those parameters are incredibly artificial bullshit.
        
       | cheeseblubber wrote:
       | OP here. We realized there are a ton of limitations with backtest
       | and paper money but still wanted to do this experiment and share
       | the results. By no means is this statistically significant on
       | whether or not these models can beat the market in the long term.
       | But wanted to give everyone a way to see how these models think
       | about and interact with the financial markets.
        
         | irishcoffee wrote:
         | > But wanted to give everyone a way to see how these models
         | think...
         | 
         | Think? What exactly did "it" think about?
        
           | cheeseblubber wrote:
           | You can click in to the chart and see the conversation as
           | well as for each trade what was the reasoning it gave for it
        
             | philipwhiuk wrote:
             | A model can't tell you why it made the decision.
             | 
             | What it can do is inspect the decision it made and make up
             | a reason a human might have said when making the decision.
        
           | stoneyhrm1 wrote:
           | "Pass the salt? You mean pass the sodium chloride?"
        
         | joegibbs wrote:
         | I think it would be interesting to see how it goes in a
         | scenario where the market declines or where tech companies
         | underperform the rest of the market. In recent history they've
         | outperformed the market and that might bias the choices that
         | the LLMs make - would they continue with these positive biases
         | if they were performing badly?
        
         | apparent wrote:
         | > Grok ended up performing the best while DeepSeek came close
         | to second.
         | 
         | I think you mean "DeepSeek came in a close second".
        
           | apparent wrote:
           | OK, now it says:
           | 
           | > Grok ended up performing the best while DeepSeek came close
           | second.
           | 
           | "came in a close second" is an idiom that only makes sense
           | word-for-word.
        
         | gerdesj wrote:
         | These are LLMs - next token guessers. They don't think at all
         | and I suggest that you don't try to get rich quick with one!
         | 
         | LLMs are handy tools but no more. Even Qwen3-30B heavily
         | quantised will do a passable effort of translating some Latin
         | to English. It can whip up small games in a single prompt and
         | much more and with care can deliver seriously decent results
         | but so can my drill driver! That model only needs a PS500
         | second hand GPU - that's impressive for me. Also GPT-OSS etc.
         | 
         | Yes, you can dive in with the bigger models that need serious
         | hardware and they seem miraculous. A colleague had to recently
         | "force" Claude to read some manuals until it realised it had
         | made a mistake about something and frankly I think "it" was
         | only saying it had made a mistake. I must ask said colleague to
         | grab the reasoning and analyse it.
        
         | anigbrowl wrote:
         | You should redo this with human controls. By a weird
         | coincidence, I have sufficient free time.
        
         | pottertheotter wrote:
         | Cool experiment.
         | 
         | I have a PhD in capital markets research. It would be even more
         | informative to report abnormal returns (market/factor-adjusted)
         | so we can tell whether the LLMs generated true alpha rather
         | than just loading on tech during a strong market.
        
         | this_user wrote:
         | I can almost guarantee you that these models will underperform
         | the market in the long run, because they are simply not
         | designed for this purpose. LLMs are designed to simulate a
         | conversation, not predict forward returns of a time series.
         | What's more, most of the widely disseminated knowledge out
         | there on the topic is effectively worthless, because there is
         | an entire cottage industry of fake trading gurus and grifters,
         | and the LLMs have no ability to separate actual information
         | from the BS.
         | 
         | If you really wanted to do this, you would have to train
         | specialist models - not LLMs - for trading, which is what firms
         | are doing, but those are strictly proprietary.
         | 
         | The only other option would be to train an LLM on actually
         | correct information and then see if it can design the
         | specialist model itself, but most of the information you would
         | need for that purpose is effectively hidden and not found in
         | public sources. It is also entirely possible that these trading
         | firms have already been trying this: using their proprietary
         | knowledge and data to attempt to train a model that can act as
         | a quant researcher.
        
         | philipwhiuk wrote:
         | You're not really giving them any money and it's not actually
         | trading.
         | 
         | There's no market impact to any trading decision they make.
        
         | beezle wrote:
         | What were the risk adjusted returns? Without knowing that, this
         | is all kind of meaningless. Being high beta in a rising market
         | doesn't equate to anything brilliant.
        
       | mlmonkey wrote:
       | > We were cautious to only run after each model's training cutoff
       | dates for the LLM models
       | 
       | Grok is constantly training and/or it has access to websearch
       | internally.
       | 
       | You cannot backtest LLMs. You can only "live" test them going
       | forward.
        
         | cheeseblubber wrote:
         | Via api you can turn off websearch internally. We provided all
         | the models with their own custom tools that only provided data
         | up to the date of the backtest.
        
           | mlmonkey wrote:
           | But Grok is internally training on Tweets etc. continuously.
        
       | dogmayor wrote:
       | They could only trade once per day and hold 5-15 positions with a
       | position size of $5k-$25k according to the agent prompt. Limited
       | to say the least.
        
       | digitcatphd wrote:
       | Backtesting is a complete waste in this scenario. The models
       | already know the best outcomes and are biased towards it.
        
       | 1a527dd5 wrote:
       | Time.
       | 
       | That has been the best way to get returns.
       | 
       | I setup a 212 account when I was looking to buy our first house.
       | I bought in small tiny chunks of industry where I was comfortable
       | and knowledgeable in. Over the years I worked up a nice
       | portfolio.
       | 
       | Anyway, long story short. I forgot about the account, we moved
       | in, got a dog, had children.
       | 
       | And then I logged in for the first time in ages, and to my shock.
       | My returns were at 110%. I've done nothing. It's bizarre and
       | perplexing.
        
         | jondwillis wrote:
         | ...did you beat the market? 110% is pretty much what the nasdaq
         | has done over the last 5 years
         | 
         | Also N=1
        
         | delijati wrote:
         | time in the market beats timing the market -> Kenneth Fisher
         | ... i learned it the hard way ;)
        
         | lisbbb wrote:
         | Yeah, uh, all I did was buy BRK.B like a decade ago and it's up
         | 172% or something like that.
         | 
         | The only way I have seen people outperform is by having insider
         | information.
        
       | theideaofcoffee wrote:
       | "Everyone (including LLMs) is a genius in a bull market."
        
         | apparent wrote:
         | Apparently everyone (but Gemini).
        
           | koakuma-chan wrote:
           | Could Gemini end up being better over the longer term?
        
             | scarmig wrote:
             | Depends on if the market can stay irrational longer than
             | Gemini stays solvent.
        
         | mrweasel wrote:
         | I was thinking the same thing. A number of coworkers where
         | trading stocks a few years ago and felt pretty good about their
         | skills, until someone pointed out that making good stock picks
         | was easy when everything is going up. Sure enough, when the
         | market started to fail, they all lost money.
         | 
         | What could make this a bit more interesting is to tell the LLM
         | to avoid the tech stocks, at least the largest ones. Then give
         | it actual money, because your trades will affect the market.
        
       | tiffani wrote:
       | What was the backtesting method? Was walk-forward testing
       | involved? There are different ways to backtest.
        
       | Nevermark wrote:
       | Just one run per model? That isn't backtesting. I mean
       | technically it is, but "testing" implies producing meaningful
       | measures.
       | 
       | Also just one time interval? Something as trivial as "buy AI"
       | could do well in one interval, and given models are going to be
       | pumped about AI, ...
       | 
       | 100 independent runs on each model over 10 very different market
       | behavior time intervals would producing meaningful results. Like
       | actually credible, meaningful means and standard deviations.
       | 
       | This experiment, as is, is a very expensive unbalanced
       | uncharacterizable random number generator.
        
         | cheeseblubber wrote:
         | Yes definitely we were using our own budget and out of our own
         | pocket and these model runs were getting expensive. Claude
         | costed us around 200-300 dollars a 8 month run for example. We
         | want to scale it and get more statistically significant results
         | but wanted to share something in the interim.
        
           | Nevermark wrote:
           | Got it. It is an interesting thing to explore.
        
         | ipnon wrote:
         | Yes, if these models available for $200/month a making 50%
         | returns reliably, why isn't Citadel having layoffs?
        
           | lisbbb wrote:
           | In my experience, you get a few big winners, but since you
           | have to keep placing new trades (e.g. bets) you eventually
           | blow one and lose most of what you made. This is particularly
           | true with options and futures trades. It's a stupid way to
           | speculate with or without AI help doesn't matter and will
           | never matter.
        
         | energy123 wrote:
         | To their credit, they say in the article that the results
         | aren't statistically significant. It would be better if that
         | disclaimer was more prominently displayed though.
         | 
         | The tone of the article is focused on the results when it
         | should be "we know the results are garbage noise, but here is
         | an interesting idea".
        
         | hhutw wrote:
         | Yeah...one run per model is just random walk in my opinion
        
         | Marsymars wrote:
         | To take it to the absurdist conclusion, you could backtest each
         | LLM "which single stock should I buy on Jan 1, 2010 to maximize
         | my returns over the next 15 years?"
         | 
         | If your backtested LLM performed well, would you use the same
         | strategy for the next 15 years? (I suppose there are people who
         | would.)
        
         | zer0tonin wrote:
         | Not only just one run per model, but no metrics other than
         | total return. If you pick stocks at random you have a very high
         | chance of beating the S&P 500, so you need a bit more than that
         | to make a good benchmark.
        
       | Bender wrote:
       | This experiment was also performed with a fish [1] though it was
       | only given $50,000. Spoiler, the fish did great _vs wall street
       | bets_.
       | 
       | [1] - https://www.youtube.com/watch?v=USKD3vPD6ZA [video][15
       | mins]
        
       | naet wrote:
       | I used to work for a brokerage API geared at algorithmic traders
       | and in my experience anecdotal experience many strategies seem to
       | work well when back-tested on paper but for various reasons can
       | end up flopping when actually executed in the real market. Even
       | testing a strategy in real time paper trading can end up
       | differently than testing on the actual market where other parties
       | are also viewing your trades and making their own responses. The
       | post did list some potential disadvantages of backtesting, so
       | they clearly aren't totally in the dark on it.
       | 
       | Deepseek did not sell anything, but did well with holding a lot
       | of tech stocks. I think that can be a bit of a risky strategy
       | with everything in one sector, but it has been a successful one
       | recently so not surprising that it performed well. Seems like
       | they only get to "trade" once per day, near the market close, so
       | it's not really a real time ingesting of data and making
       | decisions based on that.
       | 
       | What would really be interesting is if one of the LLMs switched
       | their strategy to another sector at an appropriate time. Very
       | hard to do but very impressive if done correctly. I didn't see
       | that anywhere but I also didn't look deeply at every single
       | trade.
        
         | bmitc wrote:
         | I've honestly never understood what backtesting even does
         | because of the things you mention like time it takes to request
         | and close trades (if they even do!), responses to your trades,
         | the continuous and dynamic input of the market into your model,
         | etc.
         | 
         | Is there any reference that explains the deep technicalities of
         | backtesting and how it is supposed to actually influence your
         | model development? It seems to me that one could spend a huge
         | amount of effort on backtesting that would distract from
         | building out models and tooling and that that effort might not
         | even pay off given that the backtesting environment is not the
         | real market environment.
        
           | tim333 wrote:
           | I'm not sure about deep technicalities but backtesting is a
           | useful thing to see how some strategy would have performed at
           | some times in the past but there are quite a lot of
           | limitations to it. Two of the big ones are the market
           | reacting to you and maybe more so a kind of hindsight bias
           | where you devise some strategy that would have worked great
           | on past markets but the real time ones do something
           | different.
           | 
           | https://en.wikipedia.org/wiki/Long-Term_Capital_Management
           | was kind of an example of both of those. They based their
           | predictions on past behaviour which proved incorrect. Also if
           | other market participants figure a large player is in trouble
           | and going to have to sell a load of bonds they all drop their
           | bids to take advantage of that.
           | 
           | A lot of deviations from efficient market theory are like
           | that - not deeply technical but about human foolishness.
        
           | Maxatar wrote:
           | We use back testing at my firm for two primary reasons, one
           | as a way to verify correctness and two as a way to assess
           | risk.
           | 
           | We do not use it as a way to determine profitability.
        
             | bmitc wrote:
             | This is interesting because I'm not immediately sure how
             | you verify correctness and assess risk without also
             | addressing profitability.
             | 
             | By assessing risk is that just checking that it does dump
             | all your money and that you can at least maintain a stable
             | investment cache?
             | 
             | Are you willing to say more about correctness? Is the
             | correctness of the models, of the software, or something
             | else?
        
               | Maxatar wrote:
               | Profitability is not in any way considered a property of
               | the correctness of an algorithm. An algorithm can be
               | profitable and incorrect, and an algorithm can be correct
               | but not profitable.
               | 
               | Correctness has to do with whether the algorithm
               | performed the intended actions in response to the
               | inputs/events provided to it, nothing more. For the most
               | part correctness of an algorithm can be tested the same
               | way most software is tested, ie. unit tests, but it's
               | also worth testing the algorithm using live data/back
               | testing it since it's not feasible to cover every
               | possible scenario in giant unit tests, but you can get
               | pretty good coverage of a variety of real world scenarios
               | by back testing.
        
         | lisbbb wrote:
         | This. This all day. I used to paper trade using ThinkOrSwim and
         | I was doubling and tripling my money effortlessly. Then I
         | decided to move my strategy to the real deal and it didn't do
         | very well at all. It was all bs.
        
         | chroma205 wrote:
         | >but for various reasons can end up flopping when actually
         | executed in the real market.
         | 
         | 1. Your order can legally be "front run" by the lead or
         | designated market maker who receives priority trade matching,
         | bypassing the normal FIFO queue. Not all exchanges do this.
         | 
         | 2. Market impact. Other participants will cancel their order,
         | or increase their order size, based on your new order. And yes,
         | the algos do care about your little 1 lot order.
         | 
         | Also if you improve the price ("fill the gap"), your single 1
         | qty order can cause 100 other people to follow you. This does
         | not happen in paper trading.
         | 
         | Source: HFT quant
        
           | derrida wrote:
           | Dear HFT Quant,
           | 
           | > And yes, the algos do care about your little 1 lot order.
           | 
           | I'm just your usual "corrupted nerd" geek with some
           | mathematics and computer security background interests - 2
           | questions if I may 1. what's like the most interesting paper
           | you have read recently or unrelated thing you are interested
           | in at the moment? 2. " And yes, the algos do care about your
           | little 1 lot order." How would one see this effect you
           | mentioned - like it seems wildly anomalous, how would go
           | about finding this effect assuming maximum mental
           | venturesomeness, a tiny $100 and too much time?
        
             | ainiriand wrote:
             | Sometimes the spread is really tight.
        
             | tim333 wrote:
             | Retail speculator here. Re 2 it's often quite easy to demo
             | on thinly traded markets - I'm more familiar with crypto.
             | Say the spread is 81.00 buy, 81.03 sell. Put in a limit buy
             | at 81.00 and watch someone/something immediately outbid you
             | ate 81.01. In the short term that kind of thing is done by
             | algorithms but there are humans behind it and doing it too.
             | 
             | There's quite a lot of other game playing going on also.
        
             | gosub100 wrote:
             | Even a 1 lot order could be the deciding factor for some
             | algorithm that's calculating averages or other statistics.
             | Especially for options books.
        
           | this_user wrote:
           | If you actually were in the industry, you would know that
           | most retail traders don't fail, because they lose a tick here
           | or there on execution, they fail, because their strategies
           | have no edge in the first place.
        
             | chroma205 wrote:
             | > If you actually were in the industry, you would know that
             | most retail traders don't fail, because they lose a tick
             | here or there on execution
             | 
             | Where did I say "retail trader"?
             | 
             | Because "institutional" low-latency market makers trade 1
             | lot all the time.
        
               | this_user wrote:
               | The context from parent was obviously that. Instis don't
               | trade on Alpaca.
               | 
               | > Because "institutional" low-latency market makers trade
               | 1 lot all the time.
               | 
               | That sentence alone tells me that you're a LARPer.
        
               | chroma205 wrote:
               | > That sentence alone tells me that you're a LARPer
               | 
               | cope.
               | 
               | Equity options are sparse and have 1 order of 1 lot/qty
               | per price. But usually empty. Too many prices and
               | expiration dates.
               | 
               | US treasury bond cash futures (BrokerTec) are almost
               | always 1 lot orders. Multiple orders per level though.
               | 
               | I could go on, but I'm busy as our team of 4's algos are
               | printing US$500k/hour today.
        
           | dubcanada wrote:
           | There is a big difference between back testing scalping and
           | back testing buy 100 NVIDA at $103 and sell at $110.
        
           | Maxatar wrote:
           | >Your order can legally be "front run" by the lead or
           | designated market maker who receives priority trade matching,
           | bypassing the normal FIFO queue. Not all exchanges do this.
           | 
           | Unless you're thinking of some obscure exchange in a tiny
           | market, this is just untrue in the U.S., Europe, Canada, and
           | APAC. There are no exchanges where market makers get any kind
           | of priority to bypass the FIFO queue.
        
             | chroma205 wrote:
             | > There are no exchanges where market makers get any kind
             | of priority to bypass the FIFO queue.
             | 
             | Nope, several large, active, and liquid markets in the US.
             | 
             | Legally it's not named "bypass the FIFO queue". That would
             | be dumb.
             | 
             | In practice, it goes by politically correct names such as
             | "designated market maker fill" or "institutional order
             | prioritization" or "leveling round".
        
               | Maxatar wrote:
               | I can tell you as someone who is a designated market
               | maker on several ETFs in the U.S., none of this exists as
               | a means of giving market makers priority fills. You're
               | taking existing terms and misusing them. For example
               | institutional order prioritization is used as a wash
               | trade prevention mechanism, not as a way for designated
               | market makers to get some kind of fill preference.
               | Leveling rounds also do not involve exchanges, this is an
               | internal tool used by a broker's OMS to rebalance
               | residuals so accounts end up with the intended
               | allocation, or cleaning up odd-lot/mixed-lot leftovers.
               | 
               | I am getting the feeling you either are not actually a
               | quant, or you were a quant and just misheard and confused
               | a lot of things together, but one thing is for sure...
               | your claim that market makers get some kind of priority
               | fills is factually incorrect.
        
         | ddtaylor wrote:
         | Alpaca?
        
         | acrooks wrote:
         | A really important part of this is the emotional component.
         | When real money is involved, then you will sometimes face
         | actual losses. It's hard for a human to completely trust the
         | machine in real world trading
        
         | andoando wrote:
         | Backtracking is useless because if you try out a million
         | strategies, by chance you will find one that works for past
         | data.
        
       | copypaper wrote:
       | >Each model gets access to market data, news APIs, company
       | financials...
       | 
       | The article is very very vague on their methodology (unless I
       | missed it somewhere else?). All I read was, "we gave AI access to
       | market data and forced it to make trades". How often did these
       | models run? Once a day? In a loop continuously? Did it have
       | access to indicators (such as RSI)? Could it do arbitrary
       | calculations with raw data? Etc...
       | 
       | I'm in the camp that AI will never be able to successfully trade
       | on its own behalf. I know a couple of successful traders (and
       | many unsuccessful!), and it took them years of learning and
       | understanding before breaking even. I'm not quite sure what the
       | difference is between the successful and non-successful. Some
       | sort of subconscious knowledge from staring at charts all day? A
       | level of intuition? Regardless, it's more than just market data
       | and news.
       | 
       | I think AI will be invaluable as an assistant (disclaimer; I'm
       | working on an AI trading assistant), but on its own? Never. Some
       | things simply simply can't be solved with AI and I think this is
       | one of them. I'm open to being wrong, but nothing has convinced
       | me otherwise.
        
       | XenophileJKO wrote:
       | So.. I have been using an LLM to make 30 day buy and hold
       | portfolios. And the results are "ok". (Like 8% vs 6% for the S&P
       | 500 over the last 90 days)
       | 
       | What you ask the model to do is super important. Just like
       | writing or coding.. the default "behavior" is likely to be
       | "average".. you need to very careful of what you are asking for.
       | 
       | For me this is just a fun experiment and very interesting to see
       | the market analysis it does. I started with o3 and now I'm using
       | 5.1 Thinking (set to max).
       | 
       | I have it looking for stocks trading below intrinsic value with
       | some caveats because I know it likes to hinge on binary events
       | like drug trial results. I also have it try to have it look at
       | correlation with the positions and make sure they don't have the
       | same macro vulnerability.
       | 
       | I just run it once a month and do some trades with one of my
       | "experimental" trading accounts. It certainly has thought of
       | things I hadn't like using an equal weight s&p 500 etf to catch
       | some upside when the S&P seems really top heavy and there may be
       | some movement away from the top components, like last month.
        
         | themafia wrote:
         | I look for issues with a recent double bottom and high insider
         | buy activity. I've found this to be a highly reliable set of
         | signals.
        
           | XenophileJKO wrote:
           | That is interesting.
           | 
           | I was trying to not be "very" prescriptive. My initial
           | impression was, if you don't tell it to look at intrinsic
           | value, the model will look at meme or very common stocks too
           | much. Alternatively specifying an investing persona would
           | probably also move it out of that default behavior profile.
           | You have to kind of tell it about what it cares about. This
           | isn't necessarily about trying to maximize a strategy, it was
           | more about learning what kinds of things would it focus on,
           | what kind of analysis.
        
       | dismalaf wrote:
       | Back when I was in university we used statistical techniques
       | similar to what LLMs use to predict the stock market. It's not a
       | surprise that LLMs would do well over this time period. The
       | problem is that when the market turns and bucks trends they don't
       | do so well, you need to intervene.
        
       | cedws wrote:
       | Backtesting for 8 months is not rigorous enough and also this
       | site has no source code or detailed methodology. Not worth the
       | click.
        
       | _alternator_ wrote:
       | Wait, they didn't give them real money. They simulated the
       | results.
        
       | petesergeant wrote:
       | If I'm reading this, almost all of Grok's advantage comes from
       | heavy bets into semi-conductors spiking: ASML, INTC, MU.
        
       | mikewarot wrote:
       | They weren't doing it in real time, thus it's possible that the
       | LLMs might have had undisclosed perfect knowledge of the actual
       | history of the market. Only an real time study is going to
       | eliminate this possibility.
        
       | itake wrote:
       | Model output is non-deterministic.
       | 
       | Did they make 10 calls per decision and then choose the majority?
       | or did they just recreate the monkey picking stocks strategy?
        
         | ta12653421 wrote:
         | ++1
         | 
         | This.
         | 
         | Thats also the reason why i still belive in "classic
         | instruments" when configuring my trade app; the model wont give
         | you the same entries on lets say 5 questions.
        
       | hoerzu wrote:
       | How many trades? What's the z-score?
        
       | hoerzu wrote:
       | For backtesting LLMs on polymarket I built. You can try with live
       | data without sign up at: https://timba.fun
        
       | luccabz wrote:
       | we should:
       | 
       | 1. train with a cutoff date at ~2006
       | 
       | 2. simulate information flow (financial data, news, earnings,
       | ...) day by day
       | 
       | 3. measure if any model predicts the 2008 collapse, how confident
       | they are in the prediction and how far in advance
        
       | stuffn wrote:
       | Trading in a nearly 20 year bull market and doing well is not an
       | accomplishment.
        
       | dehrmann wrote:
       | Is it just prompting LLMs with "I have $100k to invest. Here are
       | all publicly traded stocks and a few stats on them. Which stocks
       | should I buy?" And repeat daily, rebalancing as needed?
       | 
       | This isn't the best use case for LLMs without a lot of prompt
       | engineering and chaining prompts together, and that's probably
       | more insightful than running them LLMs head-to-head.
        
       | client4 wrote:
       | The obvious next question is: does the AI on cocaine outperform?
       | https://pihk.ai/
        
       | Genego wrote:
       | When I see stuff like this, I feel like rereading the Incerto by
       | Taleb just to refresh and sharpen my bullshit senses.
        
         | bwfan123 wrote:
         | LLM is the fad of the day, and these sort of articles provoke
         | the natural get-rich-quick-greed inherent in all of us,
         | especially the young tech-types. As such they are clickbait,
         | and also a barometer of the silliness that is widespread.
         | 
         | I am curious why re-reading incerto sharpens your bullshit
         | sense. I have read a few in that series, but didnt see it as
         | sharpening my bullshit sensor.
        
       | dhosek wrote:
       | I wouldn't trust any backtracking test with these models. Try
       | doing a real-time test over 8 months and see what happens then.
       | I'd also be suspicious of anything that doesn't take actual costs
       | into account.
        
         | rallies wrote:
         | We're running some live experiments these days, for both stocks
         | and options. https://rallies.ai/arena
        
           | philipwhiuk wrote:
           | With actual money? Or still fake money?
        
       | wowamit wrote:
       | Is finding the right stocks to invest in an LLM problem? Language
       | models aren't the right fit, I would presume. It would also be
       | insightful to compare this with traditional ML models.
        
       | XCSme wrote:
       | If it's backtesting on data older than the model, then strategy
       | can have lookahead bias, because the model might already know
       | what big events will happen that can influence the stock markets.
        
       | lvspiff wrote:
       | I setup real life accounts with etrade and fidelity using the
       | etrade auto portfolio, fidelity i have an advisor for retirement,
       | and then i did a basket portfolio as well but used ms365 with
       | grok 5 and various articles and strategies to pick a set of 5
       | etfs that would perform similarly to the exposure of my other
       | two.
       | 
       | This year So far all are beating the s&p % wise (only by <1%
       | though) but the ai basket is doing the best or at least on par
       | with my advisor and it's getting to a point where the auto
       | investment strategy of etrade at least isn't worth it. Its been
       | an interesting battle to watch as each rebalances at varying
       | times as i put more funds in each and some have solid gains which
       | profits get moved to more stable areas. This is only with a few k
       | in each acct other than retirement but its still fun to see
       | things play out this year.
       | 
       | In other words though im not surprised at all by the results. Ai
       | isnt something to day trade with still but it is helpful in doing
       | research for your desired risk exposure long term imo.
        
         | lisbbb wrote:
         | How much are the expense ratios on those etfs you chose,
         | though? I mean, Vanguard, Fidelity, Blackrock, and others have
         | extremely low cost funds and etfs and it has been shown year
         | after year and decade after decade that you can't beat their
         | average returns over the long term. Indexing works for a
         | reason. Beating something by 1%? It's not even worth it if your
         | costs and taxes are higher than that.
        
       | IncreasePosts wrote:
       | Just picking tech stocks and winning isn't interesting unless we
       | know the thesis behind picking the tech sticks.
       | 
       | Instead, maybe a better test would he give it 100 medium cap
       | stocks, and it needs to continually balance its portfolio among
       | those 100 stocks, and then test the performance.
        
       | refactor_master wrote:
       | Should have done GME stocks only. Now THAT would've been
       | interesting to see how much they'd end up losing on that.
       | 
       | Just riding a bubble up for 8 months with no consequences is not
       | an indicator of anything.
        
       | btbuildem wrote:
       | It turns out DeepSeek only made BUY trades (not a single SELL in
       | the history in their live example) -- so basically, buy & hold
       | strategy wins, again.
        
         | culi wrote:
         | this study should be replicated during a bear market
        
           | bmitc wrote:
           | Buy and hold performs well over long time scales by simply
           | not adjusting based upon sentiment.
        
             | throwawayffffas wrote:
             | Operating word is long, historically if you entered the
             | market just before a downturn, it could take years up to a
             | couple of decades to make up. Depending on which downturn
             | we are looking at.
        
               | bmitc wrote:
               | I think that requires entering once. I was referring to
               | continuing to enter periodically and holding.
        
       | darepublic wrote:
       | So in other words I should have listened to the YouTube brainrot
       | and asked chatgot for my trades. Sigh.
        
       | theymademe wrote:
       | prince of zamunda LLM edition or whatever that movie was based on
       | that book was based on the realization how pathetic it all was
       | based on was? .... yeah, some did a good one on ya. just imagine
       | evaluating that offspring one or two generations later ... ffs,
       | _this_ is sooooooooooooooo embarrassing
        
       | 867-5309 wrote:
       | tl;dr https://www.aitradearena.com/blog/llm-performance-chart.png
        
       | 867-5309 wrote:
       | GPT-5 was released _4_ months ago..
        
       | regnull wrote:
       | I'm working on a project where you can run your own experiment
       | (or use it for real trading): https://portfoliogenius.ai. Still a
       | bit rough, but most of the main functionality works.
        
       | hsuduebc2 wrote:
       | In bullish market when few companies are creating a bubble, does
       | this benchmark have any informational value? Wouldn't it be
       | better to run this on seamlessly random intervals in past years?
        
       | mempko wrote:
       | The stats are abysmal. What's the MDD compared to S&P 500. What
       | is the Sortino? What are the confidence intervals for all the
       | stats? Number of trades? So many questions....
        
       | energy123 wrote:
       | One of the recent NeurIPS best paper recipients is relevant here:
       | https://openreview.net/forum?id=saDOrrnNTz
       | 
       | > an extensive empirical study across more than 70 models,
       | revealing the Artificial Hivemind effect: pronounced intra- and
       | inter-model homogenization
       | 
       | So the inter-model variety will be exeptionally low. Users of
       | LLMs will intuitively know this already, of course.
        
       | keepamovin wrote:
       | I'd say Grok did best because it has the best access to
       | information. Grok deep search and real time knowledge
       | capabilities due to the X integration and just general being
       | plugged into the pulse of the Internet a really best in class.
       | It's a great OSINT research tool.
       | 
       | Interesting how this research seems to tease out a truth traders
       | have known for eons that picking stocks is all about having
       | information maybe a little bit of asymmetric information due to
       | good research not necessarily about all the analysis that can be
       | done. (that's important but information is king) because it's a
       | speculative market that's collectively reacting to those kind of
       | signals.
        
       | stockresearcher wrote:
       | I appreciate that you've made the trade histories downloadable
       | and will be taking a look to see what I can learn.
       | 
       | I've glanced over some of it and really wonder why they seemed to
       | focus on a small group of stocks.
        
       | aperture147 wrote:
       | Why is bullshit detector ringing as hell right now??? This sounds
       | like another billion-dollar-Markov-chain-IP that claimed to
       | change the world, opening with a paper with flying colors.
        
       | frobisher wrote:
       | lolol Gemini
        
       | rallies wrote:
       | This is pretty cool.
       | 
       | We're also running a live experiment on both stocks and options.
       | One difference with our experiment is a lot more tools being
       | available to the models (anything you can think of, sec filings,
       | fundamentals, live pricing, options data).
       | 
       | We think backtests are meaningless given LLMs have mostly
       | memorized every single thing that happened so it's not a good
       | test. So we're running a forward test. Not enough data for now
       | but pretty interesting initial results
       | 
       | https://rallies.ai/arena
        
         | touristtam wrote:
         | How is Qwen so much worse than the rest (for the period
         | accounted)?
        
         | natiman1000 wrote:
         | Is the code/prompts used open source? if not how can we say
         | it's ligit
        
       | nurettin wrote:
       | Deepseek and grok together would perform even better.
        
       | vpribish wrote:
       | this is so stupid i wish i could flag it twice
        
         | Frieren wrote:
         | flagged it for you
        
       | aidenn0 wrote:
       | It seems to me that short-term simulations will tend to
       | underprice risk.
       | 
       | Imagine a market where you can buy only two stocks:
       | 
       | Stock A goes up invariably 1% per month
       | 
       | Stock B goes up 1.5% per month with a 99% chance, but loses 99%
       | of its value with a 1% chance.
       | 
       | Stock B has a 94% chance of beating stock A on a 6 month
       | simulation, but only a 30% chance of beating stock A on a 10 year
       | simulation.
        
       | fortran77 wrote:
       | I would love to see this run during an extended bear market
       | period.
        
         | ta12653421 wrote:
         | Cant the model go short in a bear market?
        
       | toephu2 wrote:
       | Predicting stock prices means you are competing directly against
       | massive hedge funds and professional quant teams with effectively
       | unlimited budgets and large teams of engineers. These
       | professionals are already using and constantly tweaking the
       | latest models to gain an advantage.
       | 
       | It is highly unlikely that you guys or any individual, even
       | utilizing the latest LLMs will consistently discover an edge that
       | beats the market over the long run.
        
       | pech0rin wrote:
       | 8 months of a huge bull market. Not exactly indicative of any
       | real insight.
        
       | rcarmo wrote:
       | I spent a while looking at trading algos a few years back (partly
       | because of quant stuff I got involved in, and partly out of
       | curiosity). I found that none of the "slow" trading (i.e., that
       | you could run at home alongside your day trading account) was
       | substantially effective (at least in my sampling), but I never
       | thought an LLM would be any good at it because all the analysis
       | is quantitative, not qualitative or contextual.
       | 
       | In short, I don't think this study proves anything unless they
       | gave the LLMs additional context besides the pure trading data
       | (Bloomberg terminals have news for a reason--there's typically a
       | lot more context in he market than individual stock values or
       | history).
        
       | morgengold wrote:
       | Am I right that you let LLMs decide for themselves what to read
       | into their input data (like market data, news APIs, company
       | financials)? While this is worth testing, I think it would be
       | more interesting to give them patterns to look for. I played
       | around with using them for technical analysis and let them make
       | the associations with past stock performances. They can even
       | differentiate on what worked in the last 5 years, what in the
       | last year, in the last 3 month etc. This way they can pick up
       | (hopefully) changes in market behavior. Generally the main
       | strength of this approach is to use their pattern recognition
       | capability and also take out the human factor (emotions) for
       | trading decitions.
        
       | Bombthecat wrote:
       | I wouldn't call this a test, I would create a test portfolio of
       | hundred semi random stocks and see what they sell buy or keep.
       | 
       | That tells me way more then "YOLO tech stocks"
        
       | bitmasher9 wrote:
       | 1. Backtesting doesn't mean very much. For lots of reasons real
       | trading is different than backtesting.
       | 
       | 2. 8 months is an incredibly short trading window. I care where
       | the market will be in 8 years way more then 8 months.
        
         | ryandvm wrote:
         | It seems like back-testing an LLM is going to require
         | significant white-washing of the test data to prevent the LLM
         | from just trading on historical trends it is aware of.
         | 
         | Scrubbing symbol names wouldn't even be enough because I
         | suspect some of these LLMs could "figure out" which stock is,
         | say NVDA, based on the topology of its performance graph.
        
       | amelius wrote:
       | Nonsense. Title should read $0 because they didn't use actual
       | money.
       | 
       | Also, it seems pretty stupid to use commodity tech like LLMs for
       | this.
        
       | FrustratedMonky wrote:
       | How much of this is just because the market as a whole is going
       | up.
       | 
       | This same kind of mentality happened pre-2008. People thought
       | they were great at being day-traders, and had all kinds of
       | algorithms that were 'beating the market'.
       | 
       | But it was just that the entire market was going up. They weren't
       | doing anything special.
       | 
       | Once the market turned downward, that was when it took talent to
       | stay even.                  Show me these things beating a
       | downward market.
        
       | throwawayffffas wrote:
       | > We also built a way to simulate what an agent would have seen
       | at any point in the past. Each model gets access to market data,
       | news APIs, company financials--but all time filtered: agents see
       | only what would have been available on that specific day during
       | the test period.
       | 
       | That's not going to work, these agents especially the larger
       | ones, will have news about the companies embedded in their
       | weights.
        
         | devilsbabe wrote:
         | Funny how if you kept reading before commenting, they addressed
         | that point specifically
         | 
         | > We were cautious to only run after each model's training
         | cutoff dates for the LLM models. That way we could be sure
         | models couldn't have memorized market outcomes.
        
       | thedougd wrote:
       | Would be nice to use the logos in the legend. I use these LLMs
       | everyday and didn't know what half these logos on the graph were.
        
       | krauses wrote:
       | I'd like to see a variation of the models being fine tuned based
       | on investments of those in congress that seem to consistently
       | outperform the markets.
        
       | RandomLensman wrote:
       | Could be interesting to see performance distribution for random
       | strategies on that stock universe as a comparison. The reverse
       | could also be interesting: how do the models perform on data that
       | is random?
        
       | mvkel wrote:
       | When the market is rising, everyone looks like a genius.
       | 
       | Would have been better to have variants of each, locked to
       | specific industries.
       | 
       | It also sounds like they were -forced- to make trades every day.
       | Why? deciding not to trade is a good strategy too.
        
       | cramcgrab wrote:
       | Yeah I've been using grok to manage my yolo fund, it's been doing
       | great so far, up around 178% ytd, only rebalance once every other
       | month.
        
       | portly wrote:
       | What is the point of this?
       | 
       | LLMs are trained to predict the next word in a text. In what way,
       | shape or form does that have anything to do with stock market
       | prediction? Completely ridiculous AI bubble nonsense.
        
         | another_twist wrote:
         | No it isnt. Next word prediction is what humans do to
         | communicate anyway so the criticism isnt valid. Except you do
         | that for your own sentences (if you do it for others its
         | considered rude :) ).
         | 
         | Anyways this criticism is now dated given that modern day LLMs
         | can solve unseen reasoning problems such as those found in the
         | IMO.
         | 
         | It does have something to do with the stock market, since its
         | about making hypotheses and trading based off that. However,
         | I'd agree that making a proper trading AI here would require
         | reasoning based fine tuning for stock market trading actions.
         | Sort of like running GRPO taking market feedback as the reward.
         | the article simply cant do that due to not having access to the
         | underlying model weight.
        
         | bwfan123 wrote:
         | shhh. We need more of these as counter-parties to improve
         | alpha.
        
       | mvkel wrote:
       | Predicting the stock market will likely never happen because it's
       | recursive. We can predict the next 10 days of weather, but the
       | weather doesn't change because it read your forecast. As long as
       | markets continue to react to their own reactions, they will
       | remain unpredictable.
       | 
       | If the strategy is long, there might be alpha to be found. But
       | day trading? No way.
        
         | oersted wrote:
         | If stocks are more of a closed system that are weakly affected
         | by external factors in the short term, now I finally understand
         | why they hire so many physicists for financial modeling!
         | 
         | There is of course the fact that physicists tend to be the best
         | applied mathematicians, even if they don't end up using any of
         | their physics knowledge. And they generally had the reputation
         | of "the smartest" people for the last century.
         | 
         | Anyway, such systems are complex and chaotic yes, but there are
         | many ways of predicting aspects of them, like with fluid
         | simulation to give a basic example. And I don't get your point
         | about weather, it is also recursive in the same way and
         | reacting to its own reactions. Sure it is not reacting to
         | predictions of itself, but that's just a special kind of
         | reaction, and patterns in others predictions can definitely be
         | predicted accurately, perhaps not individually but in the
         | aggregate.
        
           | mvkel wrote:
           | > there are many ways of predicting aspects of them
           | 
           | Yes, and it's priced in
           | 
           | > but that's just a special kind of reaction
           | 
           | That's just arguing semantics. My point was that weather
           | doesn't react to human predictions, explicitly
        
         | jerf wrote:
         | "We can predict the next 10 days of weather, but the weather
         | doesn't change because it read your forecast."
         | 
         | Less true than it used to be, with cloud seeding being an off-
         | the-shelf technology now. Still largely true, but not entirely
         | true anymore.
        
       | machiaweliczny wrote:
       | > Potential accidental data leakage from the "future"
       | 
       | Exactly. Makes no sense with models like grok. DeepSeek also
       | likely has this leak as was trained later.
        
       | Glyptodon wrote:
       | Multiple runs of randomized backtesting seem needed for this to
       | mean anything. It's also not clear to me how there's any kind of
       | information update loop. Maybe I didn't read closely enough.
        
       | kqr wrote:
       | Extremely similar earlier submission but focused on
       | cryptocurrencies, using real money, and in real time:
       | https://news.ycombinator.com/item?id=45976832
       | 
       | I'm extremely skeptical of any attempt to prevent leakage of
       | future results to LLMs evaluated on backtesting. Both because
       | this has beet shown in the literature to be difficult, and
       | because I personally found it very difficult when working with
       | LLMs for forecasting.
        
       | kqr wrote:
       | Their annual geometric mean return is 45 %! That's some serious
       | overbetting. In a market that didn't accidentally align with
       | their biases, they would have lost money very quickly.
        
       | reactordev wrote:
       | I would love for them to have included a peg position on SPY @
       | 100k over the course of the same period. Gives a much better
       | benchmark of what an LLM can do (not much above 2-4%).
       | 
       | Still, cool to see others in my niche hobby of finding the money
       | printer.
        
       | peterbonney wrote:
       | The devil is really in the details on how the orders were
       | executed in the backtest, slippage, etc. Instead of comparing to
       | the S&P 500 I'd love to see it benchmarked against a range of
       | active strategies, including common non-AI approaches (e.g. mean
       | reversion, momentum, basic value focus, basic growth focus, etc.)
       | and some simple predictive (non-generative) AI models. This would
       | help shake out whether there is selection alpha coming out of the
       | models, or whether there is execution alpha coming out of the
       | backtest.
        
       | rao-v wrote:
       | I'd rather give an LLM the earnings report for a stock and the
       | next day's SNP 500 opening and see if it can predict the opening
       | price.
       | 
       | Expecting an LLM to magically beat efficient market theory is a
       | bit silly.
       | 
       | Much more reasonable to see if it can incorporate information as
       | well as the market does (to start)
        
       | natiman1000 wrote:
       | If the code and prompts are not open source how can we trust
       | anything yall say?
        
       | elzbardico wrote:
       | A rising tide lift all boats.
        
       | dudeinhawaii wrote:
       | This is the complete wrong way to do this. I say this as someone
       | who does work in this area of leveraging LLMs to a limited degree
       | in trading.
       | 
       | LLMs are naive, easily convinced, and myopic. They're also non-
       | deterministic. We have no way of knowing if you ran this little
       | experiment 10 times whether they'd all pick something else. This
       | is a scattershot + luck.
       | 
       | The RIGHT way to do this is to first solve the underlying problem
       | deterministically. That is, you first write your trading
       | algorithm that's been thoroughly tested. THEN you can surface
       | metadata to LLMs and say things along the lines of "given this
       | data + data you pull from the web", make your trade decision for
       | this time period and provide justification.
       | 
       | Honestly, adding LLMs directly to any trading pipeline just adds
       | non-useful non-deterministic behavior.
       | 
       | The main value is speed of wiring up something like sentiment
       | analysis as a value add or algorithmic supplement. Even this
       | should be done using proper ML but I see the most value in using
       | LLMs to shortcut ML things that would require time/money/compute.
       | Trading value now for value later (the ML algorithm would
       | ultimately run cheaper long-run but take longer to get into
       | prod).
       | 
       | This experiment, like most "I used AI to trade" blogs are
       | completely naive in their approach. They're taking the lowest
       | possible hanging fruit. Worst still when those results are the
       | rising tide lifting all boats.
       | 
       |  _Edit_ (was a bit harsh) This experiment is an example of the
       | kind of embarrassingly obvious things people try with LLMs
       | without understanding the domain and writing it up. To an
       | outsider it can sound exciting. To an insider it 's like seeing a
       | new story "LLMs are designing new CPUs!". No they're not. A more
       | useful bit of research would be to control for the various
       | variables (sector exposure etc) and then run it 10_000 times and
       | report back on how LLM A skews towards always buying tech and LLM
       | B skews towards always recommending safe stocks.
       | 
       | Alternatively, if they showed the LLM taking a step back and
       | saying "ah, let me design this quant algo to select the best
       | stocks" -- and then succeeding -- I'd be impressed. I'd also know
       | that it was learned from every quant that had AI double check
       | their calculations/models/python.. but that's a different point.
        
       | snapdeficit wrote:
       | Anyone who traded tech stocks in the 1990s when AmeriTrade
       | appeared remembers this story.
       | 
       | Have the LLMS trade anything BUT tech stocks and see how they do.
       | 
       | That's the real test.
       | 
       | EDIT: I remember this is probably before AmeriTrade offered
       | options. I was calling in trades at 6:30AM PST to my broker while
       | he probably laughed at me. But the point is the same: any doofus
       | could make money buying tech stocks and holding for a few weeks.
       | Companies were splitting constantly.
        
       ___________________________________________________________________
       (page generated 2025-12-05 23:01 UTC)