[HN Gopher] Show NH: "data-to-paper" - autonomous stepwise LLM-d...
       ___________________________________________________________________
        
       Show NH: "data-to-paper" - autonomous stepwise LLM-driven research
        
       Author : roykishony
       Score  : 127 points
       Date   : 2024-05-12 01:52 UTC (21 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | visarga wrote:
       | You can train idea-to-paper models on tons of papers with code.
       | There are many examples of paper impl on github.
        
         | roykishony wrote:
         | yes - LLMs tuned based on data science publications will be
         | great. need a dataset of papers with reliable and well-
         | performed analysis. Notably though it works quite well even
         | with the general purpose LLMs. The key was to break the complex
         | process into smaller steps where results from upstream steps
         | are used downstream. that also creates papers where every
         | downstream result is programmatically linked to upstream data.
        
       | robwwilliams wrote:
       | Most interesting in the omics era. There is a huge gap between
       | massive well structured data and granular use of these data to
       | both develop and test ideas. For one particular family of mice we
       | have about 15 million vectors of phenome data--all of it mappable
       | as genetic loci.
       | 
       | A tool to smoothly catalyze "data to paper" or better yet "data
       | to prevention or treatment" is what we need.
        
         | roykishony wrote:
         | yes that's sounds like the type of data that will be fun to try
         | out with data-to-paper! The repo is now open - you're welcome
         | to give it a try. and happy to hear suggestions for
         | improvements and development directions. data-to-treatment
         | date-to-insights data-to-prevention data-to-???
        
           | startupsfail wrote:
           | Evaluate quality of generated papers on 10-20 samples with
           | peer review.
        
       | bjornsing wrote:
       | > data-to-paper is a framework for systematically navigating the
       | power of AI to perform complete end-to-end scientific research,
       | starting from raw data and concluding with comprehensive,
       | transparent, and human-verifiable scientific papers (example).
       | 
       | Even if this thing works I wouldn't call it "end-to-end
       | scientific research". IMHO the most challenging and interesting
       | part of scientific research is coming up with a hypothesis and
       | designing an experiment to test it. Data analysis and paper
       | writing is just a small part of the end-to-end process.
        
         | rlt wrote:
         | The very next paragraph:
         | 
         | > Towards this goal, data-to-paper systematically guides
         | interacting LLM and rule-based agents through the conventional
         | scientific path, from annotated data, through _creating
         | research hypotheses_ , conducting literature search, writing
         | and debugging data analysis code, interpreting the results, and
         | ultimately the step-by-step writing of a complete research
         | paper.
        
           | bjornsing wrote:
           | > from annotated data, through creating research hypotheses
           | 
           | Then it's all just wrong, automated p-hacking. You're
           | supposed to start with the hypothesis, not generate it from
           | the data you're about to publish.
        
             | YeGoblynQueenne wrote:
             | More to the point you're supposed to start with an
             | observation that your current theory can't explain. Then
             | you make a hypothesis that tries to explain the observation
             | and collect more observations to try and refute your
             | hypothesis; if you're a good falsificationist, that is.
             | That doesn't seem to be the process described above. Like
             | you say it's just a pipeline from data to paper, great for
             | writing papers, but not much for science.
             | 
             | But I guess these days in many fields of science and in
             | popular parlance "data" has become synonymous with
             | "observation" and "writing papers" with "research", so.
        
       | 8organicbits wrote:
       | > You are solely responsible for the entire content of created
       | manuscripts including their rigour, quality, ethics and any other
       | aspect. The process should be overseen and directed by a human-
       | in-the-loop and created manuscripts should be carefully vetted by
       | a domain expert. The process is NOT error-proof and human
       | intervention is necessary to ensure accuracy and the quality of
       | the results.
       | 
       | I'm happy to see this directly stated. Is there any guidance for
       | domain experts on the types of mistakes an LLM will make? The
       | process will be different from vetting a university student's
       | paper so they are unlikely to know what to look out for. How
       | often will a domain expert reject generated papers? Given the
       | large vetting burden, does this save any time versus doing the
       | research the traditional way? I'm honestly wary domain experts
       | won't be used, careful review won't be performed, and believable
       | AI slop will spread in academic channels that aren't ready to
       | weed out these flawed papers. We're relying pretty heavily on
       | personal ethics here, right?
        
       | jeffreygoesto wrote:
       | But who wants to spend human time to read all that? To me if
       | seems wet should train an AI to do it. Stanislaw Lem predicted
       | that AI goes on such a tangent that we better not interact with
       | it in his book
       | https://en.m.wikipedia.org/wiki/Peace_on_Earth_(novel)
        
       | uniqueuid wrote:
       | With all the positive comments here, I feel like someone should
       | play the role of the downer.
       | 
       | First of all, it's inevitable that LLMs will be/are used in this
       | way and it's great to see development and discussion in the open!
       | That's really important.
       | 
       | Secondly, this will absolutely destroy some areas of science even
       | more than they have already been.
       | 
       | Why? First, science as all of humankind is always a balance
       | between benevolent and malevolent actors. Science already battles
       | data forgery, p-hacking and replication issues. Giving
       | researchers access to tools like this will mean that some
       | conventional quality assurance processes will fail hard. Double-
       | blind peer review will no longer work when there are 10:1 or
       | 100:1 AI generated to high-quality submissions.
       | 
       | Second, doing analysis and writing a paper is one bottleneck of
       | science, but epistemologically, it's not the important one. There
       | are innumerable ways to analyze extant data and it's completely
       | moot to do any analysis in this way. Simmons, Nelson and
       | Simonsohn / Gelman et al. etc have shown: Given a dataset, (1)
       | the findings you can get are practically always from very
       | negative effects to very positive effects, depending on the setup
       | of the analysis. So having _one_ analysis is pointless,
       | especially without theory. (2) even when you give really good
       | labs the same data and question, almost nobody will get the same
       | result (many labs experiment).
       | 
       | What does this tell us? There are a few parts of science that are
       | extremely important and without them science is not only low-
       | impact, it even has a harmful effect by creating costs for
       | pruning and distilling findings. The really important part are
       | causal analyses, and they practically always involve data
       | collection. That's why sciences with strong experimental
       | traditions fare a bit better - when you need to run a costly
       | experiment yourself in order to publish a paper, this creates a
       | strong incentive to think things through and do high-impact
       | research.
       | 
       | So yeah, we've seen this coming and it must create a big backlash
       | that prevents this kind of research from being published, even if
       | vetted humans.
       | 
       | Source: am a scientist, am a journal editor.
        
         | oefrha wrote:
         | Agreed as a former scientist (theoretical high energy physics).
         | I've yet to meet one person in related fields who's
         | enthusiastic about giving paper mills a 2000% productivity
         | boost while giving honest people a 20% boost at best, and by
         | the looks of it, this kind of data-to-mindless-statistical-
         | correlation agents will hit the already bullshit-laden, not
         | very scientific fields the hardest. I'm not sure that future
         | can be deterred though, the cat is already out of the bag.
        
           | YeGoblynQueenne wrote:
           | I just hope that one day we find the jerk who put the poor
           | animal in the bag in the first place.
           | 
           | Sorry, I just had to. Hottest day of the year in the UK today
           | and warm weather causes me to lose inhibition.
        
         | pilgrim0 wrote:
         | So, in the report, the statement "the power of AI to perform
         | complete _end-to-end_ scientific research" is a blatant lie.
         | Given that your comment seems to be the most reasonable one,
         | and considering that I've seen, over and over, that it's always
         | the domain experts who are the least enthusiastic about AI
         | byproducts, I recalled a saying from the Shogun series:
         | 
         | "Why is it that only those who have never fought in a battle
         | are so eager to be in one?"
        
           | uniqueuid wrote:
           | Thanks, that's a nice quote.
           | 
           | With regard to the debate, I think it's good not to engage in
           | too much black-and-white thinking. Science itself is a pretty
           | muddy affair, and we still haven't grown beyond simplistic
           | null hypothesis significance testing (NHST), even decades
           | after its problematic implications became clear.
           | 
           | That's why it's so important to look at the macro
           | implications: I.e. how does this shift costs? As another
           | comment nicely put it, LLMs are empowering good science, but
           | they are potentially empowering bad science at an order of
           | magnitude more.
        
             | pilgrim0 wrote:
             | Having a design background, I agree completely. To explain
             | why design matters in this case, we simply need to look at
             | ergonomic factors: literally the "economy of work." That's
             | why I pointed out the "end to end" claim as a lie because
             | it's impossible to assert such things without thorough
             | testing of the applications and continued analysis of its
             | effects on the whole supply chain. Most of those AI
             | byproducts will likely be laughable in the coming decades,
             | similarly to the recurring weird-form-factor boom
             | surrounding whatever device is in vogue. Refer to the video
             | linked in [1] for good examples of weird PC input devices
             | from the 2000s. It takes considerable time for the most
             | viable form-factors to be established, and once that's
             | achieved, then the designs of the vast majority of products
             | within a category converge to the most ergonomic (and
             | economic) one. What bothers me most is not the advent of
             | novelty and experiments, but the overconfidence and
             | overpromises surrounding what are merely untested product
             | hypotheses for most of AI applications. The negligible
             | marginal cost of producing derivative work in software,
             | fueled by the high availability of accessible tooling and
             | lack of rigorous design and scientific training, is to
             | blame. Never mind the hype cycle, which is natural and
             | expected. In times like these, it is when we most need
             | pragmatic skepticism. I wonder if AI developers at all care
             | to do the bare minimum due diligence required to launch
             | their products. Seems to be a rare thing in SWE in general.
             | 
             | [1] https://youtu.be/Sbtgc6mi44M?si=X2e0DSlxZjC7_YOf
        
         | escape_goat wrote:
         | Generally speaking, I defer to your expertise point of view in
         | the matter, and I agree that it will be far easier to generate
         | meaningless research that passes the test of appearing
         | meaningful to reviewers than it will be to generate meaningful
         | research that passes the test of appearing meaningful to
         | reviewers.
         | 
         | However, it is an open secret that this is already true, is the
         | thing. Meaningful peer review is already confined to islands
         | within a system that has devolved into generating content. The
         | automation of the process doesn't represent a tipping point,
         | and I don't think that the ethically disclosed production of
         | 'research' by large language models is going to represent a
         | significant part of the problem. The errors of the current
         | system will be reduced to absurdity by the existent ethical
         | norms.
        
       | sarusso wrote:
       | The example paper does not mention what type of diabetes it is
       | about - if type 1 or type 2 - and they have very different risk
       | factors.
       | 
       | While it's kind of clear form the context that it's about type 2,
       | I doubt a paper like this would pass a peer review without
       | stating it explicitly, in particular with respect to the data set
       | that could potentially include both. Rigor is essential in
       | drawing scientific conclusions.
       | 
       | I guess this is a good example about the statistical nature of
       | LLMs outputs (type 2 is the most common) and consequentially
       | their limitations...
        
         | ttaallooss wrote:
         | The hypothesis you have raised about the source of the implicit
         | assumptions these models make is indeed an interesting and
         | plausible one, in my opinion.
         | 
         | Biases in data will always exist, as this is the nature of our
         | world. We need to think about them carefully and understand the
         | challenges they introduce, especially when training large
         | "foundational" models that encode a vast amount of data about
         | the world. We should be particularly cautious when interpreting
         | their outputs and when using them to draw any kind of
         | scientific conclusions.
         | 
         | I think this is one of many reasons why we implemented the
         | system with inherent human overseeing and strongly encourage
         | people to provide input and feedback throughout the process.
        
         | twobitshifter wrote:
         | Can we feed llms peer reviews and add a reviewer stage to this?
         | Multi agent system would likely catch the poor effort
         | submissions. It could either just reject or provide feedback if
         | the recommendation was to revise.
        
       | Cyphase wrote:
       | @dang typo in title ("Show NH")
        
         | MaxBarraclough wrote:
         | Perhaps they're just focusing on the New Hampshire readers.
        
       | YeGoblynQueenne wrote:
       | Oh, cool. Now all those dodgy conferences and journals that fill
       | my inbox with invitations to publish at their venues can stop
       | bothering me and just generate the research they want themselves.
        
       | cess11 wrote:
       | What's scientific about this? The README.md isn't clear about the
       | philosophy of science that this tool supposedly implements and
       | applies.
       | 
       | Seems to me to be scientific in the same manner ELIZA is
       | therapeutic.
        
         | ttaallooss wrote:
         | I encourage you to look at the manuscript we have put on arXiv:
         | https://arxiv.org/abs/2404.17605 and go through the thread on
         | X: https://x.com/RoyKishony/status/1785319021329674593
         | 
         | We will be happy to explain and even correct ourselves, if
         | needed, if approached in a civil, respectful manner.
        
           | cess11 wrote:
           | Skimmed through much of it, I don't see anything explicit
           | about which philosophy of science is applied. It seems more
           | like automated information processing, similar to what quant
           | finance and similar is up to.
           | 
           | Do you belong to some popperian philosophy? It can't be
           | feyerabendian, since his thinking put virtue as foundational
           | for science. Do you agree with the large journal publishers,
           | that the essence of science is to increase their profits?
           | 
           | Not sure why you think you've earned my respect, and it would
           | be very hard for me to violate your rights since we
           | communicate by text alone.
        
       | roykishony wrote:
       | Thanks everyone for engagement and discussion. Following the
       | range of comments, just a few thoughts:
       | 
       | 1. Traceability, transparency and verifiability. I think the key
       | question for me is not only whether AI can accelerate science,
       | but rather how we can use AI to accelerate science while at the
       | same time enhancing key scientific values, like transparency,
       | traceability and verifiability.
       | 
       | More and more these days when I read scientific papers, published
       | either at high impact journals or at more specialized journals, I
       | find it so hard, and sometimes even frustratingly impossible, to
       | understand and check what exactly was done to analyze the raw
       | data and get to the key results, what was the specific chain of
       | analysis steps, what parameters where used, etc, etc. The data is
       | often not there or is poorly annotated, the analysis is explained
       | poorly, the code is missing or is impossible to track, etc etc.
       | At all, it became practically impossible to repeat and check the
       | analysis and the results of many peer reviewed publications.
       | 
       | Why are papers so hard to follow and trace? Because writing clear
       | and fully traceable and transparent papers is very hard, and we
       | don't have powerful tools for doing this, and it requires the
       | scientific process itself (or at least the data analysis part) to
       | be done in an organized and fully traceable way.
       | 
       | Our data-to-paper approach is designed to provide ways to use AI
       | powerfully, not only to speed up science (by a lot!), but also at
       | the same time to use AI to enhance transparency, traceability and
       | verifiability. Data-to-paper sets a standard for traceability and
       | verifiability which imo exceeds the current level of human
       | created manuscripts. In particular:
       | 
       | 1. "Data-Chaining": by tracing information flow through the
       | research steps, data-to-paper creates what we call "data-chained"
       | manuscripts, where results, methodology and data are
       | _programmatically_ linked. See this video
       | (https://youtu.be/mHd7VOj7Q-g). You can also try click-tracing
       | results in this example ms:
       | https://raw.githubusercontent.com/rkishony/data-to-paper-sup...
       | 
       | See more about this and more examples in our preprint:
       | https://arxiv.org/abs/2404.17605
       | 
       | 2. Human in the loop. We are looking at different ways to create
       | a co-piloted environment where human scientists can direct and
       | oversee the process. We currently have a co-pilot app that allows
       | users to follow the process, to set and change prompts and to
       | provide review comments at the end of each steps
       | (https://youtu.be/Nt_460MmM8k). Will be great to get feedback
       | (and help!) on ways in which this could be enhanced.
       | 
       | 3. P-value hacking. Data-to-paper is designed to raise an
       | hypothesis (autonomously, or by user input) and then go through
       | the research steps to test the hypothesis. If the hypothesis test
       | is negative, it is perfectly fine and suitable to write a
       | negative-result manuscript. In fact, in one of the tests that we
       | have done we gave it data of a peer reviewed publication that
       | reports a positive and a negative result and data-to-paper
       | created manuscripts that correctly report both of these results.
       | 
       | So data-to-paper on its own is not doing multiple hypothesis
       | searches. In fact it can help you realize just how many
       | hypotheses you have actually tested (something very hard for
       | human research even when done honestly). Can people ask data-to-
       | paper to create 1000 papers and then read them all and choose
       | only the single one in which a positive result is found? Yes -
       | people can always cheat and science is built on trust, but it is
       | not going to be particularly easier than any other of the many
       | ways available for people to cheat if they want.
       | 
       | 4. Final note: LLMs are here and are here to stay and are already
       | used extensively in science doing (sadly sometimes undisclosed:
       | https://retractionwatch.com/papers-and-peer-reviews-with-evi...).
       | The new models of ChatGPT5, ChatGPT6, ... will likely write a
       | whole manuscript for you in just a single prompt. So the question
       | is not whether AI will go into science (it already does), but
       | rather how to do so and use AI in ways that fosters, not
       | jeopardizes, accountability, transparency, verifiability and
       | other important scientific values. This is what we are trying to
       | do with data-to-paper. We hope our project stimulates further
       | discussions on how to harness AI in science while preserving and
       | enhancing key scientific values.
        
         | uniqueuid wrote:
         | Hi,
         | 
         | thanks for the honest and thoughtful discussion you are
         | conducting here. Comments tend to be simplistic and it's great
         | to see that you raise the bar by addressing criticism and
         | questions in earnest!
         | 
         | That said, I think the fundamental problem of such tools is
         | unsolvable: Out of all possible analytical designs, they create
         | boring existing results at best, and wrong results (i.e.
         | missing confounders, misunderstanding context ...) as the worst
         | outcome. They also pollute science with harmful findings that
         | lack meaning in the context of a field.
         | 
         | These issues have been well-known for about ten years and are
         | explained excellently e.g in papers such as [1].
         | 
         | There is really one way to guard against bad science today, and
         | that is _true pre-registration_. And that is something which
         | LLMs fundamentally cannot do.
         | 
         | So while tools such as data-to-paper may be helpful, they can
         | only be so in the context of pre-registered hypotheses where
         | they follow a path pre-defined by humans _before collecting
         | data_.
         | 
         | [1]
         | http://www.stat.columbia.edu/~gelman/research/unpublished/p_...
        
           | alchemist1e9 wrote:
           | > That said, I think the fundamental problem of such tools is
           | unsolvable: Out of all possible analytical designs, they
           | create boring existing results at best, and wrong results
           | (i.e. missing confounders, misunderstanding context ...) as
           | the worst outcome. They also pollute science with harmful
           | findings that lack meaning in the context of a field.
           | 
           | This doesn't seem correct to me at all. If new data is
           | provided and the LLM is simply an advanced tool that applies
           | known analysis techniques to the data, then why would they
           | create "boring existing results"?
           | 
           | I don't see why systems using an advanced methodology should
           | not produce novel and new results when provided new data.
           | 
           | There is a lot of reactionary or even luddite responses to
           | the direction we are headed with LLMs.
        
             | uniqueuid wrote:
             | Sorry but I think we have very different perspectives here.
             | 
             | I assume you mean that LLMs can generate new insights in
             | the sense of producing plausible results from new data or
             | in the sense of producing plausible but previously unknown
             | results from old data.
             | 
             | Both these things are definitely possible, but they are not
             | necessarily (and in fact often not) good science.
             | 
             | Insights in science are not rare. There are trillions of
             | plausible insights, and all can be backed by data. The real
             | problem is the reverse: Finding a meaningful and useful
             | finding in a sea of billion other ones.
             | 
             | LLMs learn from past data, and that means they will have
             | more support for "boring", i.e. conventional hypotheses,
             | which have precedent in training material. So I assume that
             | while they can come up with novel hypotheses and results,
             | these results will probably tend to conform to a
             | (statistically defined) paradigm of past findings.
             | 
             | When they produce novel hypotheses or findings, it is
             | unlikely that they will create genuinely meaningful AND
             | true insights. Because if you randomly generate new ideas,
             | almost all of them are wrong (see the papers I linked).
             | 
             | So in essence, LLMs should have a hard time doing real
             | science, because real science is the complex task of
             | finding unlikely, true, and interesting things.
        
               | alchemist1e9 wrote:
               | Have you personally used LLMs within agent frameworks
               | that apply CoT and OPA patterns or others from cognitive
               | architecture theories?
               | 
               | I'd be surprised if you have used LLMs beyond the classic
               | chat based linear interface that is commonly used and
               | still have the opinions you do.
               | 
               | In my opinion, once you combine RAG and agent frameworks
               | with raw observational input data they can absolutely do
               | real reasoning, analysis, and create new insights that
               | are meaningful and will be considered genuine new
               | science. This project/group we are discussing have
               | practically proven this with their replication examples.
        
           | roykishony wrote:
           | Thanks much for these thoughtful comments and ideas.
           | 
           | I can't but fully agree: pre-registered hypothesis is the
           | only way to fully guard against bad science. This in essence
           | is what the FDA is doing for clinical trials too. And btw
           | lowering the traditional and outdated 0.05 cutoff is also
           | critical imo.
           | 
           | Now, say we are in a utopian world where all science is pre-
           | registered. Why can't we imagine AI being part of the process
           | that creates the hypotheses to be registered? And why can't
           | we imagine it also being part of the process that analyzes
           | the data once it's collected? And in fact, maybe it can even
           | be part of the process that help collects the data itself?
           | 
           | To me, neither if we are in such a utopian world, nor in the
           | far-from-utopian current scientific world, there is
           | ultimately no fundamental tradeoff between using AI in
           | science and adhering to fundamental scientific values. Our
           | purpose with data-to-paper is to demonstrate and to provide
           | tools to harness AI to speed up scientific discovery while
           | _enhancing_ the values of traceability and transparency and
           | make our scientific output much more traceable and
           | understandable and verifiable.
           | 
           | As of the question of novelty: indeed, research on existing
           | public datasets which we have currently done cannot be too
           | novel. Though scientists can also use data-to-paper with
           | their own fascinating original data. It might help in some
           | aspects of the analysis, certainly help them keep track of
           | what they are doing and how to report it transparently.
           | Ultimately I hope that such co-piloting deployment will allow
           | us delegating more straight forward tasks to the AI and
           | letting us human scientists to engage in higher level
           | thinking and higher level conceptualization.
        
             | uniqueuid wrote:
             | True, we seem to have a pretty similar perspective after
             | all.
             | 
             | My concern is an ecological one within science, and your
             | argument addresses the frontier of scientific methods.
             | 
             | I am sure both are compatible. One interesting question is
             | what instruments are suitable to reduce negative
             | externalities from bad actors. Pre-registration works, but
             | is limited to few fields where the stakes are high. We will
             | probably similarly see a staggered approach with more
             | restrictive methods in some fields and less restrictive
             | ones in others.
             | 
             | That said, there remain many problems to think about: E.g.
             | what happens to meta-analyses if the majority of findings
             | comes from the same mechanism? Will humans be able to
             | resist the pull of easy AI suggestions and instead think
             | hard where they should? Are there sensible mechanisms for
             | enforcing transparency? Will these trends bring us back to
             | a world in which trust was only based on prestige of known
             | names?
             | 
             | Interesting times, certainly.
        
       | alchemist1e9 wrote:
       | This is a step forward! Forget the detractors and any negative
       | comments this is a small peek into the future which will include
       | automated research, automated engineering, all sorts of tangible
       | ways to automate progress. Obviously the road will be bumpy, with
       | many detractors and complaints.
       | 
       | Here is a small idea for taking it one step further in the
       | future. Perhaps there could be an additional stage where once the
       | initial data is analyzed and some candidate research ideas
       | generate, a domain knowledge stage is incorporated. So Semantic
       | Scholar API helps generate a set of reference papers currently,
       | instead those papers could be downloaded in full, put into a
       | local RAG, and then have agents read in detail each paper with
       | the summary of the current data in context, effectively doing
       | research, store it's summaries and ideas in the same RAG, then
       | combine all that context specific research into the material for
       | the further development of the paper.
       | 
       | There is a link to awesome-agents and I'd be curious what their
       | opinion is of various other agent frameworks, especially as I
       | don't think they actually used any.
       | 
       | For my proposed idea above I think txtai could provide a lot of
       | the tools needed.
        
         | ttaallooss wrote:
         | This is a super cool idea! We have considered implementing a
         | variation of what you suggested, with the additional feature of
         | linking each factual statement directly to the relevant lines
         | in the literature. Imagine that in each scientific paper, you
         | could click on any factual or semi-factual statement to be led
         | to the exact source--not just the paper, but the specific
         | relevant lines. From there, you could continue clicking to
         | trace the origins of each fact or idea.
        
           | alchemist1e9 wrote:
           | > This is a super cool idea!
           | 
           | Thank you. I'm honored you found it useful.
           | 
           | > From there, you could continue clicking to trace the
           | origins of each fact or idea.
           | 
           | Exactly! I think you would like automated semantic knowledge
           | graph building example in txtai.
           | 
           | Imagine how much could be done when price/token drops by
           | another few orders of magnitude! I can envision a world with
           | millions of research agents doing automated research on many
           | thousands of data sets simultaneously and then pooling their
           | research together for human scientists to study, interpret
           | and review.
        
         | roykishony wrote:
         | thanks! indeed currently we only provide the LLM with a short
         | tldr created by Semantic Scholar for each paper. Reading the
         | whole thing and extracting and connecting to specific findings
         | and results will be amazing to do. Especially as it can start
         | creating a network of logical links between statements in the
         | vast scientific literature. txtai indeed looks extremely
         | helpful for this.
        
           | alchemist1e9 wrote:
           | Excellent! I'm glad my input was interesting.
           | 
           | txtai has some demos of automated semantic graph building
           | that might be relevant.
           | 
           | I noticed you didn't really use any existing agent
           | frameworks, which I find very understandable as their value
           | added can be questionable over DIY approaches. However txtai
           | might fit better with your overall technology style and
           | philosophy.
           | 
           | Has your team studied latest CoT, OPA, or research into
           | Cognitive architectures?
        
             | roykishony wrote:
             | thanks. will certainly look deeper into txtai. our project
             | is now open and you are more than welcome to give a hand if
             | you can! yes you are right - it is built completely from
             | scratch. Does have some similarities to other agent
             | packages, but we have some unique aspects especially in
             | terms of tracing information flow between many steps and
             | thereby creating the idea of "data-chained" manuscripts
             | (that you can click each result and go back all the way to
             | the specific code lines). also, we have a special code-
             | running environment that catches many different types of
             | common improper uses of imported statistical packages.
        
               | alchemist1e9 wrote:
               | "data-chained" will be very valuable, especially for the
               | system to evaluate itself and verify the work it's
               | performed.
               | 
               | this is obviously just my initial impression on a
               | distracted Sunday but I'm very encouraged by your project
               | and I will absolutely be following it and looking at your
               | source code.
               | 
               | The detractors don't understand LLMs and probably haven't
               | used them in the way you have and I have. They don't
               | understand that with CoT and OPA that they can be used to
               | reason and think themselves.
               | 
               | I've used them for full automated script writing,
               | performing the job of a software developer. I've also
               | used them to create study guides and practice tests, and
               | then grade those tests. When one implements first hand
               | automated systems with agent frameworks using the APIs it
               | gives a deeper understanding of their power over the
               | basic chat usage most are familiar with.
               | 
               | The people arguing that your system can't do real science
               | are silly, as if the tedious process and logical thinking
               | is something so complex and human that the LLMs can't do
               | it when used within a cognitive framework, of course they
               | can!
               | 
               | Anyway I'm very exited by your project. I hope this
               | summer to spend at least a week dedicated to setting it
               | up and exploring potential integrations with txtai for
               | use on private knowledge bases in addition to your public
               | Scholarly published papers.
        
             | roykishony wrote:
             | and yes we are implementing CoT and OPA - but surely there
             | is ton of room for improvements!
        
       | missblit wrote:
       | Hello,
       | 
       | Your example paper has omitted non-English characters in the
       | names of anyone who gets cited. Look especially at citation [5],
       | where a lot of the authors have European characters in their
       | names which get omitted.
       | 
       | There is also possibly a missing x or [?] in "1.81 10^5" on page
       | 3.
        
         | roykishony wrote:
         | wow - thank you for the meticulous check - these are issues we
         | should certainly fix!
        
       | Eiim wrote:
       | I'm working on my Master's in Statistics, so I feel I can comment
       | on some of what's going on here (although there are others more
       | experienced than me in the comments as well, and I generally
       | agree with their assessments). I'm going to look only at the
       | diabetes example paper for now, mostly because I have finals
       | tomorrow. I find it to be the equivalent of a STA261 final
       | project at our university, with some extra fluff and nicer
       | formatting. It's certainly not close to something I could submit
       | to a journal.
       | 
       | The whole paper is "we took an existing dataset and ran the
       | simplest reasonable model (a logistics regression) on it". That's
       | about 5-10 minutes in R (or Python, or SAS, or whatever else).
       | It's a very well-understood process, and it's a good starting
       | point to understand the data, but it can't be the only thing in
       | your paper, this isn't the 80's anymore.
       | 
       | The overall style is verbose and flowery, typical of LLMs. Good
       | research papers should be straightforward and to the point.
       | There's also strange mixing of "we" and "I" throughout.
       | 
       | We learn in the introduction that interaction effects were
       | tested. That's fine, I'd want to see it set up earlier why these
       | interaction effects are posited to be interesting. It said
       | earlier that "a comprehensive investigation considering a
       | multitude of diabetes-influencing lifestyle factors concurrently
       | in relation to obesity remains to befully considered", but quite
       | frankly, I don't believe that. Diabetes is remarkably well-
       | studied, especially in observational studies like this one, due
       | to its prevalence. I haven't searched the literature but I really
       | doubt that no similar analysis has been done. This is one of the
       | hardest parts of a research paper, finding existing research and
       | where its gaps are, and I don't think an LLM will be sufficiently
       | capable of that any time soon.
       | 
       | There's a complete lack of EDA in the paper. I don't need much
       | (the whole analysis of this paper could be part of the EDA for a
       | proper paper), but some basic distributional statistics of the
       | variables. How many responses in the dataset were diabetic? Is
       | there a sex bias? What about age distribution? Are any values
       | missing? These are really important for observational studies
       | because if there's any issues they should be addressed in some
       | way. As it is, it's basically saying "trust us, our data is
       | perfect" which is a huge ask. It's really weird that a bunch of
       | this is in the appendix (which is way too long to be included in
       | the paper, would need to be supplementary materials, but that's
       | fine) (and also it's poorly formatted) but not mentioned anywhere
       | in the paper itself. When looking at the appendix, the main
       | concern that I have is that only 14% of the dataset is diabetic.
       | This means that models will be biased towards predicting non-
       | diabetic (if you just predict non-diabetic all of the time,
       | you're already 86% accurate!). It's not as big of an issue for
       | logistic regression, or for observational modeling like this, but
       | I would have preferred an adjustment related to this.
       | 
       | In the results, I'm disappointed by the over-reliance on
       | p-values. This is something that the statistics field is trying
       | to move away from, of a multitude of reasons, one of which is
       | demonstrated quite nicely here: p-values are (almost) always
       | miniscule with large n, and in this case n=253680 is very large.
       | Standard errors and CIs have the same issue. The Z-value is the
       | most useful measure of confidence here in my eyes. Effect sizes
       | are typically the more interesting metric for such studies. On
       | that note, I would have liked to see predictors normalized so
       | that coefficients can be directly compared. BMI, for example, has
       | a small coefficient, but that's likely just because it has a
       | large range and variance.
       | 
       | It's claimed that the AIC shows improved fit for the second
       | model, but the change is only ~0.5%, which isn't especially
       | convincing. In fact, it could be much less, because we don't have
       | enough significant figures to see how the rounding went down.
       | p-value is basically meaningless as previously stated.
       | 
       | The methods section says almost nothing that isn't already stated
       | at least once. I'd like to know something about the tools which
       | were used in this section, which is completely lacking. I do want
       | it highlight this quote: "Both models employed a method to adjust
       | for all possible confounders inthe analysis." What??? All
       | possible confounders? If you know what that means you know that
       | that's BS. "A method"? What is your magic tool to remove all
       | variance not reflected in the dataset, I need to know! I
       | certainly don't see it reflected in the code.
       | 
       | The code itself seems fine, maybe a little over-complicated but
       | that might be necessary for how it Interfaces with the LLM. The
       | actual analysis is equivalent to 3 basic lines of R (read CSV,
       | basic log reg with default parameters 1, basic log reg with
       | default parameters 2).
       | 
       | This paper would probably get about a B+ in 261, but shouldn't
       | pass a 400-level class. The analysis is very simple and
       | unimpressive for a few reasons. For one, the questions asked of
       | the dataset are very light. More interesting, for example, might
       | have been to do variable selection on all interaction terms and
       | find which are important. More models should have been compared.
       | The dataset is also extremely simple and doesn't demand complex
       | analysis. An experimental design, or messy data with errors and
       | missing values, or something requiring multiple datasets, would
       | be a more serious challenge. It's quite possible that one of the
       | other papers addresses this though.
        
         | roykishony wrote:
         | Thanks so much for these thorough comments.
         | 
         | You suggested some directions for more complex analysis that
         | could be done on this data - I would be so curious to see what
         | you get if you could take the time to try out running data-to-
         | paper as a co-pilot on your own - you can then give it
         | directions and feedback on where to go - will be fascinating to
         | see where you take it!
         | 
         | We also must look ahead: complexity and novelty will rapidly
         | increase as ChatGPT5, ChatGPT6 etc are rolled in. The key with
         | data-to-paper is to build a platform that harnesses these tools
         | in a structured way that creates transparent and well-traceable
         | papers. Your ability to read and understand and follow all the
         | analysis in these manuscripts so quickly speaks to your talent
         | of course, but also to the way these papers are structured.
         | Talking from experience, it is much harder to review human-
         | created papers at such speed and accuracy...
         | 
         | As for your comments on "it's certainly not close to something
         | I could submit to a journal" - please kindly look at the
         | examples where we show reproducing peer reviewed publications
         | (published in a completely reasonable Q1 journal, PLOS One).
         | See this original paper by Saint-Fleur et al:
         | https://journals.plos.org/plosone/article?id=10.1371/journal...
         | 
         | and here are 10 different independent data-to-paper runs in
         | which we gave it the raw data and the research goal of the
         | original publication and asked it to do the analysis reach
         | conclusions and write the paper:
         | https://github.com/rkishony/data-to-paper-supplementary/tree...
         | (look up the 10 manuscripts designated "manuscriptC1.pdf" -
         | "manuscriptC10.pdf")
         | 
         | See our own analysis of these manuscripts and reliability in
         | our arxiv preprint: https://arxiv.org/abs/2404.17605
         | 
         | Note that the original paper was published after the training
         | horizon of the LLM that we used and also that we have
         | programmatically removed the original paper from the result of
         | the literature search that data-to-paper does so that it cannot
         | see it in the search.
         | 
         | Thanks so much again and good luck for the exam tomorrow!
        
       | QuadmasterXLII wrote:
       | It's paper reviewing season for me and I think I got one of these
       | submitted. Took a while reading it to realize that it wasn't just
       | a stupid human writing it, there was literally no substance there
       | to find. I can't share details because the confidentiality
       | statement I signed as part of reviewing was pretty strict.
       | However, going forward we are going to have to start
       | deanonymizing and blacklisting the 'authors,' otherwise the ratio
       | of time spent 'writing' vs reviewer time wasted will be
       | crippling.
        
         | twobitshifter wrote:
         | In one group that I am part of, we had a reviewer use AI on
         | submissions, this scared the larger org and we now have a
         | policy of no-AI reviews - however I think AI is closer to
         | competently reviewing some elements of papers than it is
         | editing them itself. For example, it's the best spelling /
         | grammar tool I've ever seen. Since many submissions are by non
         | native English speakers, a limited AI review comment would make
         | sense to me.
         | 
         | Overall, because of the happy to serve alignment of commercial
         | AI it's more likely to praise us than be critical, which would
         | mean that OTS models may not fit in to the reviews of methods
         | and conclusions.
        
       ___________________________________________________________________
       (page generated 2024-05-12 23:01 UTC)