[HN Gopher] The Deep Research problem
___________________________________________________________________
The Deep Research problem
Author : cratermoon
Score : 92 points
Date : 2025-02-21 21:26 UTC (4 days ago)
(HTM) web link (www.ben-evans.com)
(TXT) w3m dump (www.ben-evans.com)
| Lws803 wrote:
| I always wondered, if deep research has an X% chance of producing
| errors in it's report and you have to double check everything +
| visit every source or potentially correct it yourself. Then does
| it really save time in helping you get research done (outside of
| coding and marketing)? .
| ImaCake wrote:
| It might depend on how much you struggle with writers block. An
| LLM essay with sources is probably a better starting point than
| a blank page. But it will vary between people.
| baxtr wrote:
| I urge anyone to do the following: take a subject you know really
| really well and then feed it into one of the deep research tools
| and check the results.
|
| You might be amazed but most probably very shocked.
| ilrwbwrkhv wrote:
| Yup none of these tools are actually any close to AGI or
| "research". They are still a much better search engine and of
| course spam generator.
| tptacek wrote:
| I did a trial run with Deep Research this weekend to do a
| comparative analysis of the comp packages for Village Managers in
| suburbs around Chicagoland (it's election season, our VM's comp
| had become an issue).
|
| I have a decent idea of where to look to find comp information
| for a given municipality. But there are a lot of Chicagoland
| suburbs and tracking documents down for all of them would have
| been a chore.
|
| Deep Research was valuable. But it only did about 60% of the work
| (which, of course, it presented as if it was 100%). It found
| interesting sources I was unaware of, and assembled lots of easy-
| to-get public data that would have been annoying for me to
| collect that made spot-checking easier (for instance, basic stuff
| like the name of every suburban Village Manager). But I still had
| to spot check everything myself.
|
| The premise of this post seems to be that material errors in Deep
| Research results negate the value of the product. I can't speak
| to how OpenAI is selling this; if the claim is "subscribe to Deep
| Research and it will generate reliable research reports for you",
| well, obviously, no. But as with most AI things, if you get paste
| the hype, it's plain to see the value it's actually generating.
| WhitneyLand wrote:
| >>The premise of this post seems to be that material errors in
| Deep Research results negate the value of the product
|
| No it's not. It's that it's oversold from a marketing
| perspective and comes with some big caveats.
|
| But it does talk about big time savings for the right contexts.
|
| Emphasis from the article:
|
| "these things _are_ useful"
| iandanforth wrote:
| I'll share my recipe for using these products on the off chance
| it helps someone.
|
| 1. Only do searches that result in easily verifiable results from
| non-AI sources.
|
| 2. Always perform the search in multiple products (Gemini 1.5
| Deep Research, Gemini 2.0 Pro, ChatGPT o3-mini-high, Claude 3.7
| w/ extended thinking, Perplexity)
|
| With these two rules I have found the current round of LLMs
| useful for "researchy" queries. Collecting the results across
| tools and then throwing out the 65-75% slop results in genuinely
| useful information that would have taken me much longer to find.
|
| Now the above could be seen as a harsh critique of these tools,
| as in the kiddie pool is great as long as you're wearing full
| hazmat gear, but I still derive regular and increasing value from
| them.
| munchler wrote:
| This makes sense. How many of those products do you have to pay
| for?
| kridsdale3 wrote:
| I'm not OP but I do similar stuff. I pay for Claude's basic
| tier, OpenAI's $200 tier, and Gemini ultra-super-advanced I
| get for free because I work there.
|
| I combine all the 'slop' from the three of them in to Gemini
| (1 or 2 M context window) and have it distill the valuable
| stuff in there to a good final-enough product.
|
| Doing so has got me a lot of kudos and applause from those I
| work with.
| munchler wrote:
| Wow, that's eye-opening. So, just to be clear, you're
| paying for Claude and OpenAI out of your own pocket, and
| using the results at your Google job? We live in
| interesting times, for sure. :)
| submeta wrote:
| Deep Research is in its ,,ChatGPT 2.0" phase. It will improve,
| dramatically. And to the naysayers: When OpenAI released its
| first models, many doubted that it will be good at coding. Now
| after two years look at Cursor, aider, and all the llms powering
| them, what you can do with a few prompts and iterations.
|
| Deep research will dramatically improve as it's a process that
| can be replicated and automated.
| amelius wrote:
| This is like saying: y=e^-x+1 will soon be 0, because look at
| how fast it went through y=2!
| kridsdale3 wrote:
| I appreciate your style of humor.
| PeterFBell wrote:
| Thanks for making my day :)
| nicksrose7224 wrote:
| disagree - i actually think all the problems the author lays
| out about Deep Research apply just as well to GPT4o / o3-mini-
| whatever. These things just are absolutely terrible at
| precision & recall of information
| simonw wrote:
| I think Deep Research shows that these things can be very
| good at precision and recall of information if you give them
| access to the right tools... but that's not enough, because
| of source quality. A model that has great precision and
| recall but uses flawed reports from Statista and Statcounter
| is still going to give you bad information.
| lsy wrote:
| Research skills involve not just combining multiple pieces of
| data, but also being able to apply very subtle skills to
| determine whether a source is trustworthy, to cross-check numbers
| where their accuracy is important (and to determine _when_ it 's
| "important"), and to engage in some back and forth to determine
| which data actually applies to the research question being asked.
| In this sense, "deep research" is a misleading term, since the
| output is really more akin to a probabilistic "search" over the
| training data where the result may or may not be accurate and
| requires you to spot-check every fact. It is probably useful for
| surfacing new sources or making syntactic conjectures about how
| two pieces of data may fit together, but checking all of those
| sources for _existence_ , let alone validity, still _needs_ to be
| done by a person, and the output, as it stands in its polished
| form today, doesn 't compel users to take sufficient
| responsibility for its factuality.
| rollinDyno wrote:
| Everyone who has been working on RAG is aware of how important
| source control is. Simply directing your agent to fetch keyword
| matching documents will lead to inaccurate claims.
|
| The reality is that for now it is not possible to leave the human
| out of research, so I think the best LLM can only help curate
| sources and synthesize them, but cannot reliably write sound
| conclusions.
|
| Edit: this is something elicit.com recognized quite early. But
| even when I was using it, I was wishing I had more control over
| the space over which the tool was conducting search.
| theGnuMe wrote:
| One other existential question is Simpson's paradox, which I
| believe is exploited by politicians to support different policies
| from the same underlying data. I see this as a problem for
| government especially if we have liberal or conservative trained
| LLMs. We expect the computer to give us the correct answer, but
| when the underlying model is trained one way by RLHF or by
| systemic/weighted bias in its source documents -- Imagine
| training a libertarian AI on Cato papers -- you have could have
| highly confident pseudo-intellectual junk. Economists already
| deal with this problem daily since their field was heavily
| politicized. Law as well is another one.
| ImaCake wrote:
| I've never thought of Simpson's Paradox as a political problem
| before, thanks for sharing this!
|
| Arguably this applies just as well to Bayesian vs Frequentist
| statisticians or Molecular vs Biochemical Biologists.
| jppope wrote:
| These days I'm feeling like GenAi is basically an accuracy rate
| of 95% maybe 96%. Great at boilerplate, great at stuff you want
| an intern to do or maybe to outsource... but it really struggles
| with the valuable stuff. The errors are almost always in the most
| inconvenient places and they are hard to see... So I agree with
| Ben Evans on this one, what is one to do? the further you lean on
| it the worse your skills and specializations get. It is
| invaluable for some kinds of work greatly speeding you up, but
| then some of the things you would have caught take you down a
| rabbit hole that waste so much time. The tradeoffs here aren't
| great.
| bakari500 wrote:
| Yeah but you have 4 to 6 % error that's not good even if you
| have dumb computer
| smusamashah wrote:
| Watched recent Viva la dirt league videos on how trailers lie and
| do false promises. Now I see LLM as that marketing guy. Even if
| he knows everything, he can't help with lying. You can't trust
| anything he says no matter how authoritative he sounds, even if
| he is telling the truth you have know way of knowing.
|
| These deep research things are a waste of time if you can't trust
| the output. Code you can run and verify. How do you verify this.
___________________________________________________________________
(page generated 2025-02-25 23:00 UTC)