[HN Gopher] Show NH: "data-to-paper" - autonomous stepwise LLM-d...
___________________________________________________________________
Show NH: "data-to-paper" - autonomous stepwise LLM-driven research
Author : roykishony
Score : 127 points
Date : 2024-05-12 01:52 UTC (21 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| visarga wrote:
| You can train idea-to-paper models on tons of papers with code.
| There are many examples of paper impl on github.
| roykishony wrote:
| yes - LLMs tuned based on data science publications will be
| great. need a dataset of papers with reliable and well-
| performed analysis. Notably though it works quite well even
| with the general purpose LLMs. The key was to break the complex
| process into smaller steps where results from upstream steps
| are used downstream. that also creates papers where every
| downstream result is programmatically linked to upstream data.
| robwwilliams wrote:
| Most interesting in the omics era. There is a huge gap between
| massive well structured data and granular use of these data to
| both develop and test ideas. For one particular family of mice we
| have about 15 million vectors of phenome data--all of it mappable
| as genetic loci.
|
| A tool to smoothly catalyze "data to paper" or better yet "data
| to prevention or treatment" is what we need.
| roykishony wrote:
| yes that's sounds like the type of data that will be fun to try
| out with data-to-paper! The repo is now open - you're welcome
| to give it a try. and happy to hear suggestions for
| improvements and development directions. data-to-treatment
| date-to-insights data-to-prevention data-to-???
| startupsfail wrote:
| Evaluate quality of generated papers on 10-20 samples with
| peer review.
| bjornsing wrote:
| > data-to-paper is a framework for systematically navigating the
| power of AI to perform complete end-to-end scientific research,
| starting from raw data and concluding with comprehensive,
| transparent, and human-verifiable scientific papers (example).
|
| Even if this thing works I wouldn't call it "end-to-end
| scientific research". IMHO the most challenging and interesting
| part of scientific research is coming up with a hypothesis and
| designing an experiment to test it. Data analysis and paper
| writing is just a small part of the end-to-end process.
| rlt wrote:
| The very next paragraph:
|
| > Towards this goal, data-to-paper systematically guides
| interacting LLM and rule-based agents through the conventional
| scientific path, from annotated data, through _creating
| research hypotheses_ , conducting literature search, writing
| and debugging data analysis code, interpreting the results, and
| ultimately the step-by-step writing of a complete research
| paper.
| bjornsing wrote:
| > from annotated data, through creating research hypotheses
|
| Then it's all just wrong, automated p-hacking. You're
| supposed to start with the hypothesis, not generate it from
| the data you're about to publish.
| YeGoblynQueenne wrote:
| More to the point you're supposed to start with an
| observation that your current theory can't explain. Then
| you make a hypothesis that tries to explain the observation
| and collect more observations to try and refute your
| hypothesis; if you're a good falsificationist, that is.
| That doesn't seem to be the process described above. Like
| you say it's just a pipeline from data to paper, great for
| writing papers, but not much for science.
|
| But I guess these days in many fields of science and in
| popular parlance "data" has become synonymous with
| "observation" and "writing papers" with "research", so.
| 8organicbits wrote:
| > You are solely responsible for the entire content of created
| manuscripts including their rigour, quality, ethics and any other
| aspect. The process should be overseen and directed by a human-
| in-the-loop and created manuscripts should be carefully vetted by
| a domain expert. The process is NOT error-proof and human
| intervention is necessary to ensure accuracy and the quality of
| the results.
|
| I'm happy to see this directly stated. Is there any guidance for
| domain experts on the types of mistakes an LLM will make? The
| process will be different from vetting a university student's
| paper so they are unlikely to know what to look out for. How
| often will a domain expert reject generated papers? Given the
| large vetting burden, does this save any time versus doing the
| research the traditional way? I'm honestly wary domain experts
| won't be used, careful review won't be performed, and believable
| AI slop will spread in academic channels that aren't ready to
| weed out these flawed papers. We're relying pretty heavily on
| personal ethics here, right?
| jeffreygoesto wrote:
| But who wants to spend human time to read all that? To me if
| seems wet should train an AI to do it. Stanislaw Lem predicted
| that AI goes on such a tangent that we better not interact with
| it in his book
| https://en.m.wikipedia.org/wiki/Peace_on_Earth_(novel)
| uniqueuid wrote:
| With all the positive comments here, I feel like someone should
| play the role of the downer.
|
| First of all, it's inevitable that LLMs will be/are used in this
| way and it's great to see development and discussion in the open!
| That's really important.
|
| Secondly, this will absolutely destroy some areas of science even
| more than they have already been.
|
| Why? First, science as all of humankind is always a balance
| between benevolent and malevolent actors. Science already battles
| data forgery, p-hacking and replication issues. Giving
| researchers access to tools like this will mean that some
| conventional quality assurance processes will fail hard. Double-
| blind peer review will no longer work when there are 10:1 or
| 100:1 AI generated to high-quality submissions.
|
| Second, doing analysis and writing a paper is one bottleneck of
| science, but epistemologically, it's not the important one. There
| are innumerable ways to analyze extant data and it's completely
| moot to do any analysis in this way. Simmons, Nelson and
| Simonsohn / Gelman et al. etc have shown: Given a dataset, (1)
| the findings you can get are practically always from very
| negative effects to very positive effects, depending on the setup
| of the analysis. So having _one_ analysis is pointless,
| especially without theory. (2) even when you give really good
| labs the same data and question, almost nobody will get the same
| result (many labs experiment).
|
| What does this tell us? There are a few parts of science that are
| extremely important and without them science is not only low-
| impact, it even has a harmful effect by creating costs for
| pruning and distilling findings. The really important part are
| causal analyses, and they practically always involve data
| collection. That's why sciences with strong experimental
| traditions fare a bit better - when you need to run a costly
| experiment yourself in order to publish a paper, this creates a
| strong incentive to think things through and do high-impact
| research.
|
| So yeah, we've seen this coming and it must create a big backlash
| that prevents this kind of research from being published, even if
| vetted humans.
|
| Source: am a scientist, am a journal editor.
| oefrha wrote:
| Agreed as a former scientist (theoretical high energy physics).
| I've yet to meet one person in related fields who's
| enthusiastic about giving paper mills a 2000% productivity
| boost while giving honest people a 20% boost at best, and by
| the looks of it, this kind of data-to-mindless-statistical-
| correlation agents will hit the already bullshit-laden, not
| very scientific fields the hardest. I'm not sure that future
| can be deterred though, the cat is already out of the bag.
| YeGoblynQueenne wrote:
| I just hope that one day we find the jerk who put the poor
| animal in the bag in the first place.
|
| Sorry, I just had to. Hottest day of the year in the UK today
| and warm weather causes me to lose inhibition.
| pilgrim0 wrote:
| So, in the report, the statement "the power of AI to perform
| complete _end-to-end_ scientific research" is a blatant lie.
| Given that your comment seems to be the most reasonable one,
| and considering that I've seen, over and over, that it's always
| the domain experts who are the least enthusiastic about AI
| byproducts, I recalled a saying from the Shogun series:
|
| "Why is it that only those who have never fought in a battle
| are so eager to be in one?"
| uniqueuid wrote:
| Thanks, that's a nice quote.
|
| With regard to the debate, I think it's good not to engage in
| too much black-and-white thinking. Science itself is a pretty
| muddy affair, and we still haven't grown beyond simplistic
| null hypothesis significance testing (NHST), even decades
| after its problematic implications became clear.
|
| That's why it's so important to look at the macro
| implications: I.e. how does this shift costs? As another
| comment nicely put it, LLMs are empowering good science, but
| they are potentially empowering bad science at an order of
| magnitude more.
| pilgrim0 wrote:
| Having a design background, I agree completely. To explain
| why design matters in this case, we simply need to look at
| ergonomic factors: literally the "economy of work." That's
| why I pointed out the "end to end" claim as a lie because
| it's impossible to assert such things without thorough
| testing of the applications and continued analysis of its
| effects on the whole supply chain. Most of those AI
| byproducts will likely be laughable in the coming decades,
| similarly to the recurring weird-form-factor boom
| surrounding whatever device is in vogue. Refer to the video
| linked in [1] for good examples of weird PC input devices
| from the 2000s. It takes considerable time for the most
| viable form-factors to be established, and once that's
| achieved, then the designs of the vast majority of products
| within a category converge to the most ergonomic (and
| economic) one. What bothers me most is not the advent of
| novelty and experiments, but the overconfidence and
| overpromises surrounding what are merely untested product
| hypotheses for most of AI applications. The negligible
| marginal cost of producing derivative work in software,
| fueled by the high availability of accessible tooling and
| lack of rigorous design and scientific training, is to
| blame. Never mind the hype cycle, which is natural and
| expected. In times like these, it is when we most need
| pragmatic skepticism. I wonder if AI developers at all care
| to do the bare minimum due diligence required to launch
| their products. Seems to be a rare thing in SWE in general.
|
| [1] https://youtu.be/Sbtgc6mi44M?si=X2e0DSlxZjC7_YOf
| escape_goat wrote:
| Generally speaking, I defer to your expertise point of view in
| the matter, and I agree that it will be far easier to generate
| meaningless research that passes the test of appearing
| meaningful to reviewers than it will be to generate meaningful
| research that passes the test of appearing meaningful to
| reviewers.
|
| However, it is an open secret that this is already true, is the
| thing. Meaningful peer review is already confined to islands
| within a system that has devolved into generating content. The
| automation of the process doesn't represent a tipping point,
| and I don't think that the ethically disclosed production of
| 'research' by large language models is going to represent a
| significant part of the problem. The errors of the current
| system will be reduced to absurdity by the existent ethical
| norms.
| sarusso wrote:
| The example paper does not mention what type of diabetes it is
| about - if type 1 or type 2 - and they have very different risk
| factors.
|
| While it's kind of clear form the context that it's about type 2,
| I doubt a paper like this would pass a peer review without
| stating it explicitly, in particular with respect to the data set
| that could potentially include both. Rigor is essential in
| drawing scientific conclusions.
|
| I guess this is a good example about the statistical nature of
| LLMs outputs (type 2 is the most common) and consequentially
| their limitations...
| ttaallooss wrote:
| The hypothesis you have raised about the source of the implicit
| assumptions these models make is indeed an interesting and
| plausible one, in my opinion.
|
| Biases in data will always exist, as this is the nature of our
| world. We need to think about them carefully and understand the
| challenges they introduce, especially when training large
| "foundational" models that encode a vast amount of data about
| the world. We should be particularly cautious when interpreting
| their outputs and when using them to draw any kind of
| scientific conclusions.
|
| I think this is one of many reasons why we implemented the
| system with inherent human overseeing and strongly encourage
| people to provide input and feedback throughout the process.
| twobitshifter wrote:
| Can we feed llms peer reviews and add a reviewer stage to this?
| Multi agent system would likely catch the poor effort
| submissions. It could either just reject or provide feedback if
| the recommendation was to revise.
| Cyphase wrote:
| @dang typo in title ("Show NH")
| MaxBarraclough wrote:
| Perhaps they're just focusing on the New Hampshire readers.
| YeGoblynQueenne wrote:
| Oh, cool. Now all those dodgy conferences and journals that fill
| my inbox with invitations to publish at their venues can stop
| bothering me and just generate the research they want themselves.
| cess11 wrote:
| What's scientific about this? The README.md isn't clear about the
| philosophy of science that this tool supposedly implements and
| applies.
|
| Seems to me to be scientific in the same manner ELIZA is
| therapeutic.
| ttaallooss wrote:
| I encourage you to look at the manuscript we have put on arXiv:
| https://arxiv.org/abs/2404.17605 and go through the thread on
| X: https://x.com/RoyKishony/status/1785319021329674593
|
| We will be happy to explain and even correct ourselves, if
| needed, if approached in a civil, respectful manner.
| cess11 wrote:
| Skimmed through much of it, I don't see anything explicit
| about which philosophy of science is applied. It seems more
| like automated information processing, similar to what quant
| finance and similar is up to.
|
| Do you belong to some popperian philosophy? It can't be
| feyerabendian, since his thinking put virtue as foundational
| for science. Do you agree with the large journal publishers,
| that the essence of science is to increase their profits?
|
| Not sure why you think you've earned my respect, and it would
| be very hard for me to violate your rights since we
| communicate by text alone.
| roykishony wrote:
| Thanks everyone for engagement and discussion. Following the
| range of comments, just a few thoughts:
|
| 1. Traceability, transparency and verifiability. I think the key
| question for me is not only whether AI can accelerate science,
| but rather how we can use AI to accelerate science while at the
| same time enhancing key scientific values, like transparency,
| traceability and verifiability.
|
| More and more these days when I read scientific papers, published
| either at high impact journals or at more specialized journals, I
| find it so hard, and sometimes even frustratingly impossible, to
| understand and check what exactly was done to analyze the raw
| data and get to the key results, what was the specific chain of
| analysis steps, what parameters where used, etc, etc. The data is
| often not there or is poorly annotated, the analysis is explained
| poorly, the code is missing or is impossible to track, etc etc.
| At all, it became practically impossible to repeat and check the
| analysis and the results of many peer reviewed publications.
|
| Why are papers so hard to follow and trace? Because writing clear
| and fully traceable and transparent papers is very hard, and we
| don't have powerful tools for doing this, and it requires the
| scientific process itself (or at least the data analysis part) to
| be done in an organized and fully traceable way.
|
| Our data-to-paper approach is designed to provide ways to use AI
| powerfully, not only to speed up science (by a lot!), but also at
| the same time to use AI to enhance transparency, traceability and
| verifiability. Data-to-paper sets a standard for traceability and
| verifiability which imo exceeds the current level of human
| created manuscripts. In particular:
|
| 1. "Data-Chaining": by tracing information flow through the
| research steps, data-to-paper creates what we call "data-chained"
| manuscripts, where results, methodology and data are
| _programmatically_ linked. See this video
| (https://youtu.be/mHd7VOj7Q-g). You can also try click-tracing
| results in this example ms:
| https://raw.githubusercontent.com/rkishony/data-to-paper-sup...
|
| See more about this and more examples in our preprint:
| https://arxiv.org/abs/2404.17605
|
| 2. Human in the loop. We are looking at different ways to create
| a co-piloted environment where human scientists can direct and
| oversee the process. We currently have a co-pilot app that allows
| users to follow the process, to set and change prompts and to
| provide review comments at the end of each steps
| (https://youtu.be/Nt_460MmM8k). Will be great to get feedback
| (and help!) on ways in which this could be enhanced.
|
| 3. P-value hacking. Data-to-paper is designed to raise an
| hypothesis (autonomously, or by user input) and then go through
| the research steps to test the hypothesis. If the hypothesis test
| is negative, it is perfectly fine and suitable to write a
| negative-result manuscript. In fact, in one of the tests that we
| have done we gave it data of a peer reviewed publication that
| reports a positive and a negative result and data-to-paper
| created manuscripts that correctly report both of these results.
|
| So data-to-paper on its own is not doing multiple hypothesis
| searches. In fact it can help you realize just how many
| hypotheses you have actually tested (something very hard for
| human research even when done honestly). Can people ask data-to-
| paper to create 1000 papers and then read them all and choose
| only the single one in which a positive result is found? Yes -
| people can always cheat and science is built on trust, but it is
| not going to be particularly easier than any other of the many
| ways available for people to cheat if they want.
|
| 4. Final note: LLMs are here and are here to stay and are already
| used extensively in science doing (sadly sometimes undisclosed:
| https://retractionwatch.com/papers-and-peer-reviews-with-evi...).
| The new models of ChatGPT5, ChatGPT6, ... will likely write a
| whole manuscript for you in just a single prompt. So the question
| is not whether AI will go into science (it already does), but
| rather how to do so and use AI in ways that fosters, not
| jeopardizes, accountability, transparency, verifiability and
| other important scientific values. This is what we are trying to
| do with data-to-paper. We hope our project stimulates further
| discussions on how to harness AI in science while preserving and
| enhancing key scientific values.
| uniqueuid wrote:
| Hi,
|
| thanks for the honest and thoughtful discussion you are
| conducting here. Comments tend to be simplistic and it's great
| to see that you raise the bar by addressing criticism and
| questions in earnest!
|
| That said, I think the fundamental problem of such tools is
| unsolvable: Out of all possible analytical designs, they create
| boring existing results at best, and wrong results (i.e.
| missing confounders, misunderstanding context ...) as the worst
| outcome. They also pollute science with harmful findings that
| lack meaning in the context of a field.
|
| These issues have been well-known for about ten years and are
| explained excellently e.g in papers such as [1].
|
| There is really one way to guard against bad science today, and
| that is _true pre-registration_. And that is something which
| LLMs fundamentally cannot do.
|
| So while tools such as data-to-paper may be helpful, they can
| only be so in the context of pre-registered hypotheses where
| they follow a path pre-defined by humans _before collecting
| data_.
|
| [1]
| http://www.stat.columbia.edu/~gelman/research/unpublished/p_...
| alchemist1e9 wrote:
| > That said, I think the fundamental problem of such tools is
| unsolvable: Out of all possible analytical designs, they
| create boring existing results at best, and wrong results
| (i.e. missing confounders, misunderstanding context ...) as
| the worst outcome. They also pollute science with harmful
| findings that lack meaning in the context of a field.
|
| This doesn't seem correct to me at all. If new data is
| provided and the LLM is simply an advanced tool that applies
| known analysis techniques to the data, then why would they
| create "boring existing results"?
|
| I don't see why systems using an advanced methodology should
| not produce novel and new results when provided new data.
|
| There is a lot of reactionary or even luddite responses to
| the direction we are headed with LLMs.
| uniqueuid wrote:
| Sorry but I think we have very different perspectives here.
|
| I assume you mean that LLMs can generate new insights in
| the sense of producing plausible results from new data or
| in the sense of producing plausible but previously unknown
| results from old data.
|
| Both these things are definitely possible, but they are not
| necessarily (and in fact often not) good science.
|
| Insights in science are not rare. There are trillions of
| plausible insights, and all can be backed by data. The real
| problem is the reverse: Finding a meaningful and useful
| finding in a sea of billion other ones.
|
| LLMs learn from past data, and that means they will have
| more support for "boring", i.e. conventional hypotheses,
| which have precedent in training material. So I assume that
| while they can come up with novel hypotheses and results,
| these results will probably tend to conform to a
| (statistically defined) paradigm of past findings.
|
| When they produce novel hypotheses or findings, it is
| unlikely that they will create genuinely meaningful AND
| true insights. Because if you randomly generate new ideas,
| almost all of them are wrong (see the papers I linked).
|
| So in essence, LLMs should have a hard time doing real
| science, because real science is the complex task of
| finding unlikely, true, and interesting things.
| alchemist1e9 wrote:
| Have you personally used LLMs within agent frameworks
| that apply CoT and OPA patterns or others from cognitive
| architecture theories?
|
| I'd be surprised if you have used LLMs beyond the classic
| chat based linear interface that is commonly used and
| still have the opinions you do.
|
| In my opinion, once you combine RAG and agent frameworks
| with raw observational input data they can absolutely do
| real reasoning, analysis, and create new insights that
| are meaningful and will be considered genuine new
| science. This project/group we are discussing have
| practically proven this with their replication examples.
| roykishony wrote:
| Thanks much for these thoughtful comments and ideas.
|
| I can't but fully agree: pre-registered hypothesis is the
| only way to fully guard against bad science. This in essence
| is what the FDA is doing for clinical trials too. And btw
| lowering the traditional and outdated 0.05 cutoff is also
| critical imo.
|
| Now, say we are in a utopian world where all science is pre-
| registered. Why can't we imagine AI being part of the process
| that creates the hypotheses to be registered? And why can't
| we imagine it also being part of the process that analyzes
| the data once it's collected? And in fact, maybe it can even
| be part of the process that help collects the data itself?
|
| To me, neither if we are in such a utopian world, nor in the
| far-from-utopian current scientific world, there is
| ultimately no fundamental tradeoff between using AI in
| science and adhering to fundamental scientific values. Our
| purpose with data-to-paper is to demonstrate and to provide
| tools to harness AI to speed up scientific discovery while
| _enhancing_ the values of traceability and transparency and
| make our scientific output much more traceable and
| understandable and verifiable.
|
| As of the question of novelty: indeed, research on existing
| public datasets which we have currently done cannot be too
| novel. Though scientists can also use data-to-paper with
| their own fascinating original data. It might help in some
| aspects of the analysis, certainly help them keep track of
| what they are doing and how to report it transparently.
| Ultimately I hope that such co-piloting deployment will allow
| us delegating more straight forward tasks to the AI and
| letting us human scientists to engage in higher level
| thinking and higher level conceptualization.
| uniqueuid wrote:
| True, we seem to have a pretty similar perspective after
| all.
|
| My concern is an ecological one within science, and your
| argument addresses the frontier of scientific methods.
|
| I am sure both are compatible. One interesting question is
| what instruments are suitable to reduce negative
| externalities from bad actors. Pre-registration works, but
| is limited to few fields where the stakes are high. We will
| probably similarly see a staggered approach with more
| restrictive methods in some fields and less restrictive
| ones in others.
|
| That said, there remain many problems to think about: E.g.
| what happens to meta-analyses if the majority of findings
| comes from the same mechanism? Will humans be able to
| resist the pull of easy AI suggestions and instead think
| hard where they should? Are there sensible mechanisms for
| enforcing transparency? Will these trends bring us back to
| a world in which trust was only based on prestige of known
| names?
|
| Interesting times, certainly.
| alchemist1e9 wrote:
| This is a step forward! Forget the detractors and any negative
| comments this is a small peek into the future which will include
| automated research, automated engineering, all sorts of tangible
| ways to automate progress. Obviously the road will be bumpy, with
| many detractors and complaints.
|
| Here is a small idea for taking it one step further in the
| future. Perhaps there could be an additional stage where once the
| initial data is analyzed and some candidate research ideas
| generate, a domain knowledge stage is incorporated. So Semantic
| Scholar API helps generate a set of reference papers currently,
| instead those papers could be downloaded in full, put into a
| local RAG, and then have agents read in detail each paper with
| the summary of the current data in context, effectively doing
| research, store it's summaries and ideas in the same RAG, then
| combine all that context specific research into the material for
| the further development of the paper.
|
| There is a link to awesome-agents and I'd be curious what their
| opinion is of various other agent frameworks, especially as I
| don't think they actually used any.
|
| For my proposed idea above I think txtai could provide a lot of
| the tools needed.
| ttaallooss wrote:
| This is a super cool idea! We have considered implementing a
| variation of what you suggested, with the additional feature of
| linking each factual statement directly to the relevant lines
| in the literature. Imagine that in each scientific paper, you
| could click on any factual or semi-factual statement to be led
| to the exact source--not just the paper, but the specific
| relevant lines. From there, you could continue clicking to
| trace the origins of each fact or idea.
| alchemist1e9 wrote:
| > This is a super cool idea!
|
| Thank you. I'm honored you found it useful.
|
| > From there, you could continue clicking to trace the
| origins of each fact or idea.
|
| Exactly! I think you would like automated semantic knowledge
| graph building example in txtai.
|
| Imagine how much could be done when price/token drops by
| another few orders of magnitude! I can envision a world with
| millions of research agents doing automated research on many
| thousands of data sets simultaneously and then pooling their
| research together for human scientists to study, interpret
| and review.
| roykishony wrote:
| thanks! indeed currently we only provide the LLM with a short
| tldr created by Semantic Scholar for each paper. Reading the
| whole thing and extracting and connecting to specific findings
| and results will be amazing to do. Especially as it can start
| creating a network of logical links between statements in the
| vast scientific literature. txtai indeed looks extremely
| helpful for this.
| alchemist1e9 wrote:
| Excellent! I'm glad my input was interesting.
|
| txtai has some demos of automated semantic graph building
| that might be relevant.
|
| I noticed you didn't really use any existing agent
| frameworks, which I find very understandable as their value
| added can be questionable over DIY approaches. However txtai
| might fit better with your overall technology style and
| philosophy.
|
| Has your team studied latest CoT, OPA, or research into
| Cognitive architectures?
| roykishony wrote:
| thanks. will certainly look deeper into txtai. our project
| is now open and you are more than welcome to give a hand if
| you can! yes you are right - it is built completely from
| scratch. Does have some similarities to other agent
| packages, but we have some unique aspects especially in
| terms of tracing information flow between many steps and
| thereby creating the idea of "data-chained" manuscripts
| (that you can click each result and go back all the way to
| the specific code lines). also, we have a special code-
| running environment that catches many different types of
| common improper uses of imported statistical packages.
| alchemist1e9 wrote:
| "data-chained" will be very valuable, especially for the
| system to evaluate itself and verify the work it's
| performed.
|
| this is obviously just my initial impression on a
| distracted Sunday but I'm very encouraged by your project
| and I will absolutely be following it and looking at your
| source code.
|
| The detractors don't understand LLMs and probably haven't
| used them in the way you have and I have. They don't
| understand that with CoT and OPA that they can be used to
| reason and think themselves.
|
| I've used them for full automated script writing,
| performing the job of a software developer. I've also
| used them to create study guides and practice tests, and
| then grade those tests. When one implements first hand
| automated systems with agent frameworks using the APIs it
| gives a deeper understanding of their power over the
| basic chat usage most are familiar with.
|
| The people arguing that your system can't do real science
| are silly, as if the tedious process and logical thinking
| is something so complex and human that the LLMs can't do
| it when used within a cognitive framework, of course they
| can!
|
| Anyway I'm very exited by your project. I hope this
| summer to spend at least a week dedicated to setting it
| up and exploring potential integrations with txtai for
| use on private knowledge bases in addition to your public
| Scholarly published papers.
| roykishony wrote:
| and yes we are implementing CoT and OPA - but surely there
| is ton of room for improvements!
| missblit wrote:
| Hello,
|
| Your example paper has omitted non-English characters in the
| names of anyone who gets cited. Look especially at citation [5],
| where a lot of the authors have European characters in their
| names which get omitted.
|
| There is also possibly a missing x or [?] in "1.81 10^5" on page
| 3.
| roykishony wrote:
| wow - thank you for the meticulous check - these are issues we
| should certainly fix!
| Eiim wrote:
| I'm working on my Master's in Statistics, so I feel I can comment
| on some of what's going on here (although there are others more
| experienced than me in the comments as well, and I generally
| agree with their assessments). I'm going to look only at the
| diabetes example paper for now, mostly because I have finals
| tomorrow. I find it to be the equivalent of a STA261 final
| project at our university, with some extra fluff and nicer
| formatting. It's certainly not close to something I could submit
| to a journal.
|
| The whole paper is "we took an existing dataset and ran the
| simplest reasonable model (a logistics regression) on it". That's
| about 5-10 minutes in R (or Python, or SAS, or whatever else).
| It's a very well-understood process, and it's a good starting
| point to understand the data, but it can't be the only thing in
| your paper, this isn't the 80's anymore.
|
| The overall style is verbose and flowery, typical of LLMs. Good
| research papers should be straightforward and to the point.
| There's also strange mixing of "we" and "I" throughout.
|
| We learn in the introduction that interaction effects were
| tested. That's fine, I'd want to see it set up earlier why these
| interaction effects are posited to be interesting. It said
| earlier that "a comprehensive investigation considering a
| multitude of diabetes-influencing lifestyle factors concurrently
| in relation to obesity remains to befully considered", but quite
| frankly, I don't believe that. Diabetes is remarkably well-
| studied, especially in observational studies like this one, due
| to its prevalence. I haven't searched the literature but I really
| doubt that no similar analysis has been done. This is one of the
| hardest parts of a research paper, finding existing research and
| where its gaps are, and I don't think an LLM will be sufficiently
| capable of that any time soon.
|
| There's a complete lack of EDA in the paper. I don't need much
| (the whole analysis of this paper could be part of the EDA for a
| proper paper), but some basic distributional statistics of the
| variables. How many responses in the dataset were diabetic? Is
| there a sex bias? What about age distribution? Are any values
| missing? These are really important for observational studies
| because if there's any issues they should be addressed in some
| way. As it is, it's basically saying "trust us, our data is
| perfect" which is a huge ask. It's really weird that a bunch of
| this is in the appendix (which is way too long to be included in
| the paper, would need to be supplementary materials, but that's
| fine) (and also it's poorly formatted) but not mentioned anywhere
| in the paper itself. When looking at the appendix, the main
| concern that I have is that only 14% of the dataset is diabetic.
| This means that models will be biased towards predicting non-
| diabetic (if you just predict non-diabetic all of the time,
| you're already 86% accurate!). It's not as big of an issue for
| logistic regression, or for observational modeling like this, but
| I would have preferred an adjustment related to this.
|
| In the results, I'm disappointed by the over-reliance on
| p-values. This is something that the statistics field is trying
| to move away from, of a multitude of reasons, one of which is
| demonstrated quite nicely here: p-values are (almost) always
| miniscule with large n, and in this case n=253680 is very large.
| Standard errors and CIs have the same issue. The Z-value is the
| most useful measure of confidence here in my eyes. Effect sizes
| are typically the more interesting metric for such studies. On
| that note, I would have liked to see predictors normalized so
| that coefficients can be directly compared. BMI, for example, has
| a small coefficient, but that's likely just because it has a
| large range and variance.
|
| It's claimed that the AIC shows improved fit for the second
| model, but the change is only ~0.5%, which isn't especially
| convincing. In fact, it could be much less, because we don't have
| enough significant figures to see how the rounding went down.
| p-value is basically meaningless as previously stated.
|
| The methods section says almost nothing that isn't already stated
| at least once. I'd like to know something about the tools which
| were used in this section, which is completely lacking. I do want
| it highlight this quote: "Both models employed a method to adjust
| for all possible confounders inthe analysis." What??? All
| possible confounders? If you know what that means you know that
| that's BS. "A method"? What is your magic tool to remove all
| variance not reflected in the dataset, I need to know! I
| certainly don't see it reflected in the code.
|
| The code itself seems fine, maybe a little over-complicated but
| that might be necessary for how it Interfaces with the LLM. The
| actual analysis is equivalent to 3 basic lines of R (read CSV,
| basic log reg with default parameters 1, basic log reg with
| default parameters 2).
|
| This paper would probably get about a B+ in 261, but shouldn't
| pass a 400-level class. The analysis is very simple and
| unimpressive for a few reasons. For one, the questions asked of
| the dataset are very light. More interesting, for example, might
| have been to do variable selection on all interaction terms and
| find which are important. More models should have been compared.
| The dataset is also extremely simple and doesn't demand complex
| analysis. An experimental design, or messy data with errors and
| missing values, or something requiring multiple datasets, would
| be a more serious challenge. It's quite possible that one of the
| other papers addresses this though.
| roykishony wrote:
| Thanks so much for these thorough comments.
|
| You suggested some directions for more complex analysis that
| could be done on this data - I would be so curious to see what
| you get if you could take the time to try out running data-to-
| paper as a co-pilot on your own - you can then give it
| directions and feedback on where to go - will be fascinating to
| see where you take it!
|
| We also must look ahead: complexity and novelty will rapidly
| increase as ChatGPT5, ChatGPT6 etc are rolled in. The key with
| data-to-paper is to build a platform that harnesses these tools
| in a structured way that creates transparent and well-traceable
| papers. Your ability to read and understand and follow all the
| analysis in these manuscripts so quickly speaks to your talent
| of course, but also to the way these papers are structured.
| Talking from experience, it is much harder to review human-
| created papers at such speed and accuracy...
|
| As for your comments on "it's certainly not close to something
| I could submit to a journal" - please kindly look at the
| examples where we show reproducing peer reviewed publications
| (published in a completely reasonable Q1 journal, PLOS One).
| See this original paper by Saint-Fleur et al:
| https://journals.plos.org/plosone/article?id=10.1371/journal...
|
| and here are 10 different independent data-to-paper runs in
| which we gave it the raw data and the research goal of the
| original publication and asked it to do the analysis reach
| conclusions and write the paper:
| https://github.com/rkishony/data-to-paper-supplementary/tree...
| (look up the 10 manuscripts designated "manuscriptC1.pdf" -
| "manuscriptC10.pdf")
|
| See our own analysis of these manuscripts and reliability in
| our arxiv preprint: https://arxiv.org/abs/2404.17605
|
| Note that the original paper was published after the training
| horizon of the LLM that we used and also that we have
| programmatically removed the original paper from the result of
| the literature search that data-to-paper does so that it cannot
| see it in the search.
|
| Thanks so much again and good luck for the exam tomorrow!
| QuadmasterXLII wrote:
| It's paper reviewing season for me and I think I got one of these
| submitted. Took a while reading it to realize that it wasn't just
| a stupid human writing it, there was literally no substance there
| to find. I can't share details because the confidentiality
| statement I signed as part of reviewing was pretty strict.
| However, going forward we are going to have to start
| deanonymizing and blacklisting the 'authors,' otherwise the ratio
| of time spent 'writing' vs reviewer time wasted will be
| crippling.
| twobitshifter wrote:
| In one group that I am part of, we had a reviewer use AI on
| submissions, this scared the larger org and we now have a
| policy of no-AI reviews - however I think AI is closer to
| competently reviewing some elements of papers than it is
| editing them itself. For example, it's the best spelling /
| grammar tool I've ever seen. Since many submissions are by non
| native English speakers, a limited AI review comment would make
| sense to me.
|
| Overall, because of the happy to serve alignment of commercial
| AI it's more likely to praise us than be critical, which would
| mean that OTS models may not fit in to the reviews of methods
| and conclusions.
___________________________________________________________________
(page generated 2024-05-12 23:01 UTC)