[HN Gopher] The business of extracting knowledge from academic p...
___________________________________________________________________
The business of extracting knowledge from academic publications
Author : kevin_hu
Score : 253 points
Date : 2021-12-08 03:49 UTC (1 days ago)
(HTM) web link (markusstrasser.org)
(TXT) w3m dump (markusstrasser.org)
| anyfactor wrote:
| > Why purchase access to a 3rd party AI reading engine or a
| knowledge graph when you can just hire hundreds of postdocs in
| Hyderabad to parse papers into JSON? (at a $6,000 yearly salary)
|
| I really like jobs people think AI can do in theory but can't
| really do them effectively irl. Where do I get a part-time gig
| like that if I think I am capable of reviewing and creating
| summary of non-STEM papers? Except for homeworks and assignments
| of course.
| Mezzie wrote:
| Yeah, you can't outsource that to Hyderabad. You'd need to know
| subject knowledge + very specific English and possibly other
| languages depending on the field (not saying Indians can't do
| this, but I've studied enough languages to know that doing high
| level/academic work in a non-native language is hell even when
| the language is pitched to students).
|
| And you'd have to know enough about the process and authors to
| know what makes papers relevant. The metadata matters as much
| as the data.
| anyfactor wrote:
| All good points. But you do have to recognize the tradeoff.
| Has AI come so far that it could perform better than industry
| specific human intelligence? You have to consider that maybe
| some Indian researchers could review the papers as they are
| doing that job as part time gig.
|
| You have to test out both solution. And as these jobs are
| treated as contracts there is no significant commitment for
| choosing one over the other. We can't be certain if one
| method is better than the other without trying both of them
| out without prejudice.
|
| I, for one am agnostic about either choice. Because AI is
| overhyped yet it has spillover benefits as a marketing-sales
| point but offshore human intelligence has a bad rep but could
| be effective if you have proper documentation, supervision
| and review framework.
| Mezzie wrote:
| Oh yeah, I was just thinking currently. In five to ten
| years once AI/ML/etc. trickle out of tech/theory spaces and
| starts to be combined with subject expertise, I think we'll
| see really interesting things.
|
| The other matter is that an Indian who could review papers
| that well would also cost more than 6k/year and would not
| be easily replaceable, which eliminates the main benefit of
| outsourcing for a company trying to operate in such a way
| in 2021.
|
| In 2030? I'd say the odds are if somebody in Hyderabad can
| do that then they can start their OWN company rather than
| bother with us at all. Honestly, given India's role in
| pharmaceutical manufacture, I'd be shocked if things like
| that don't start popping up.
| PaulHoule wrote:
| 1. The real value is in operational documents such as clinical
| notes, maintenance records, soldier and police notebooks, etc.
| this info is proprietary to a organization and its partners and
| is directly linked to how it produces and pays for value.
|
| 2. Superhuman accuracy at limited tasks is not good enough. For
| instance transcribing audio at 95% word-level accuracy would be
| good for a human but it means every other sentence is garbled.
| People communicate despite this because they ask questions. A
| useful text-to-structured information tool has to exert back
| pressure on bullshit, give feedback about what it understands and
| push the author to tell a story that makes sense and has adequate
| detail.
| PaulHoule wrote:
| When you count all the ways to be wrong, the median scientific
| paper is wrong.
|
| In biomedical fields they dismiss more than half of papers out of
| hand when they do a Cochrane meta analysis. It begs the question
| of why such papers (which aren't fit to extract knowledge from)
| are published or funded at all.
|
| I got a PhD in theoretical physics and routinely found that
| something was wrong on page 23 of a 50 page calculation and
| nobody published anything about it in 30 years. Possibly the
| whole body of work on string theory since 1980 [most of hep-th]
| is a pipe dream at best. Because young physicists have to spend
| the first third of their career in a squid game fight for
| survival not to fathom the secrets of the universe but to please
| their elders we get situations like that absurd idea of Steve
| Hawking that information gets lost in a black hole. (E.g. if you
| believe that you aren't even going to try quantum gravity.)
| throwaway984393 wrote:
| Science is not about innovation. Science is about tiny little
| results that by themselves have no immediate benefit, but slowly
| improve our overall understanding, and eventually lead to an
| unexpected benefit. Science is not developed in order to solve a
| business problem - it is purely an advancement in overall
| knowledge of the world (the traditional aim of natural
| philosophy). In this sense, science is not compatible with
| business interests.
| Nalta wrote:
| For anyone interested, my whole PhD was in biomedical hypothesis
| generation! I think the most "serious" attempts at building these
| systems have been focused around providing assistance to
| scientists, and not just coming up with new ideas on their own.
|
| here's an actual medical paper that my first system, Moliere, was
| able to help discover:
|
| https://link.springer.com/article/10.1007/s11481-019-09885-8
| julienchastang wrote:
| Is your PhD thesis online anywhere?
| Nalta wrote:
| https://sybrandt.com/documents/dissertation.pdf
| Kydlaw wrote:
| Interesting insights. Particularly on the business aspect. But I
| am not surprise by the outcome, as the author said: nobody want
| to pay for what its proposed in academia. Everybody is already
| more or less struggling with funding, so nobody want to add extra
| fat in their funding requests.
|
| Coming from CS, something I would really like to see though is a
| tool that would summarize a scientific area/domain. Something
| that would kill literature reviews and/or would provide an
| overview of the hot topics/open questions in different areas.
|
| Edit: corrections
| PaulHoule wrote:
| This paper touches one aspect of it, which is that the source
| material is bad, but it doesn't even start on the fact that the
| tools aren't good enough and that many of the fashionable ideas
| (Word embeddings) are dead ends.
| tomlue wrote:
| Knowledge extraction is weird. Just because I extracted some
| knowledge doesn't mean that I now 'have' that knowledge.
|
| The better use case for this is teaching, not creating knowledge
| bases that nobody will use.
| holub008 wrote:
| > Close to nothing of what makes science actually work is
| published as text on the web
|
| Unless there's some nuance I missed, I immensely disagree with
| this statement.
|
| I'm currently in the biomedical literature review space, and I
| appreciate the detailed insights. I wonder if the author
| considered that literature review is used in a wide variety of
| domains outside pharma/drug discovery (where I perceived their
| efforts were focused). Regulatory monitoring/reporting, hospital
| guideline generation, etc.
|
| This is a billion dollar industry, and I couldn't agree more that
| it's technologically underdeveloped. I do not agree that AI-based
| extraction is the solution, at least in the near-term. The formal
| methodologies used by reviewers/meta-analysts: search strategy
| generation, lit search, screening, extraction, critical
| appraisal, synthesis/statistical analysis, are IMO more nuanced
| than an AI can capture. They require human input or review. My
| business is betting on this premise :)
| woliveirajr wrote:
| > This post is about the issues with semantic intelligence
| platforms that predominantly leverage the published academic
| literature.
|
| I was happy to see apost that clearly states its purpose.
|
| edit: misspelling
| [deleted]
| cousin_it wrote:
| I can confirm that in my current area of interest (how to
| synthesize a cello or saxophone sound), there are hundreds of
| academic papers published over decades, each of them says "our
| method sounds more realistic than others", but code and audio
| samples are never available, and verbal descriptions always skip
| crucial details. I have no doubt that academics have a ton of
| expertise, but their output in paper form is basically unusable,
| I'm not sure it achieves any purpose besides resume padding.
| Reading a forum of synth hobbyists is a hundred times more
| useful.
| dekhn wrote:
| Hey, that unusability of papers is a form of job security.
|
| Seriously though, you're totally right. I got very dissatisfied
| with science when I realized that many people were effectively
| publishing unreproducible crap created by terrible code.
| Fortunately, more and more people are learning how to recognize
| the crap.
| tchalla wrote:
| > I have no doubt that academics have a ton of expertise, but
| their output in paper form is basically unusable, I'm not sure
| it achieves any purpose besides resume padding.
|
| If you think the entire field of academia doesn't achieve any
| purpose, you may want to reconsider your position. Most likely,
| almost everything that you do today on a computer was an
| academic paper. Yes, it was without code and data. Yet, it was
| not unusable and achieved more than enough purpose.
|
| The average comment on HN on academia comes from a mindset
| where everyone wants a product. The purpose of a paper is NOT
| to release a software or a product. But, to test an idea under
| some assumptions. That's what all research does at its core -
| formulate a hypothesis, design an experiment to test the
| hypothesis and report the results and implications. Are all
| research papers perfect"? No. Are all of them usable? No.
|
| Your use case - sound synthesis for a specific instrument - may
| not be a scientific challenge. It is however an engineering
| challenge and hence, you found a better answer amongst
| hobbyists and tinkerers. Now, try looking for the a vaccine for
| Covid - and guess where you'd find that answer? In decades of
| research on mRNA with repeated failures, papers that couldn't
| be replicated, unavailability of "code" and samples with verbal
| descriptions skipping crucial details.
| tvhahn wrote:
| Balaji Srinivasan had a good take on this recently in his
| conversation with Tim Ferris. I quote:
|
| "The thing is, I don't care if something has a thousand
| retweets, what I care about is if it has two or three
| independent confirmations from economically dis-aligned
| actors. This is the same as academia, by the way, everybody's
| optimizing citations. What you actually want to optimize is
| independent replication. That's what true science is. It's
| not peer review. It is physical tests."
| czzr wrote:
| Yes and no. Literal replications are less valuable than
| people think - what you really want are independent tests
| of different parts of the causal network of the underlying
| model.
| jakub_g wrote:
| I'm an outsider but it seems to me the difference between
| academia and opensource/hobby forums is massive:
|
| In opensource the attitude is "See bug? Send a PR!"
|
| Whereas academic papers are like publishing software into a
| blockchain (and not source but binaries, i.e. PDFs full of
| shortcuts): you don't want for people to easily find bugs and
| contribute fixes, so you handwave a lot so that no one can
| reproduce your exact thing.
| remram wrote:
| The biggest difference IMHO is when comparing to something
| like Wikipedia or Stackoverflow. I wish the fabric of
| scholarly communication similarly allowed for browsing
| reviews, updating papers, commenting with new references,
| etc.
| [deleted]
| Mezzie wrote:
| I think this is a valuable idea. There are online archives
| that allow for paper updating for academics, like SSRN, but
| as a CONSUMER of academic literature, the land is pretty
| barren.
|
| The difficulty in such a thing would be the journals and
| database companies are holding on to their exclusivity and
| profit motives with an iron fist, so unless you want to get
| sued into oblivion, you'd have to stick with open source or
| accessible articles, so you'd need to either specialize in
| disciplines that have moved away from closed-source enough
| that the tool wouldn't have massive holes in it.
|
| Also determining which new references and reviews have
| relevance (like if anybody can comment with new references,
| who goes through to check they're actually relevant or say
| what the person says they say?), preventing
| academics/administrators from gaming the system if it DOES
| get popular, etc. In open source, this is crowd-sourced,
| but for some academic fields the number of people who are
| qualified to speak on a matter is extremely small.
|
| /academic librarian thoughts
| kwertyoowiyop wrote:
| Now THAT might be a realistic technical goal & business
| opportunity.
| Mezzie wrote:
| The legal costs make this a non-starter unless it's done
| by a giant company. Who would, in my opinion, ruin it,
| and the odds of enough academics complying with a big
| tech company are small imo.
|
| It'd be viable for fields that don't use/rely on for-
| profit or closed journals, but I don't know if the money
| to run it would be there, especially since the odds of
| the big Schol Comm players suing is still there, because
| it'd be worth it to ruin the tool/effort before it can
| challenge them.
|
| Building this would be my dream job, but hahaha no.
| kovvy wrote:
| Generally, anyone writing a paper about something that could
| benefit from bugfixes would love to accept them, but doesn't
| have the time or resources to actually do so - unless there's
| another paper in it. If they have somehow managed to find
| enough personal time to have a hobby project, then they
| probably do accept bugfixes - and you should get them in
| before that person burns out.
| Fomite wrote:
| It also doesn't happen enough to design for - I once
| presented a fairly open-source contributor friendly project
| at SciPy that I hoped would be compelling (it was about
| modeling the zombie epidemic), actively asked for help, had
| set up a couple open requests of varying levels of
| complexity.
|
| I think there was one pull request total?
|
| The juice just didn't end up being worth the squeeze.
| zozbot234 wrote:
| > In opensource the attitude is "See bug? Send a PR!"
|
| More like "What works: You tell me!" and "Kindly fix this bug
| plz sar."
| kkylin wrote:
| I'm an academic (applied math) and want to respond to this:
| academic papers are the way they are for lots of reasons, many
| of which (not so good) have been mentioned on HN. There are a
| couple that I do not see very often however:
|
| (1) Many academics aren't aware non-academics read their papers
| at all: we work with other academics, go to conferences with
| other academics, and on the rare occasions we hear from
| readers, it's from other academics. Big exception: in some
| fields academia and industry have much more interaction,
| biomedical research (the subject of the linked article) being
| one of them. Extracting knowledge from that literature has a
| large number of practical and economic implications.
|
| (2) There seems to be a perception that published papers are a
| repository of established or state-of-the-art knowledge.
| Perhaps they were meant to be that way, and perhaps more of
| them should be. But for many journals in many fields,
| publications are a form of moderated discussion. Reconstructing
| the state of knowledge from snippets of conversation is always
| going to be hard.
|
| What can help make the literature more accessible? some of the
| forces are structural, some are due to current limitations of
| technology. But one thing that can help if you find the results
| of a paper interesting and are able to track down the authors,
| write them. People like hearing their work is noticed, and they
| like talking to people about things they're interested in.
|
| Another is to make constructive suggestions (or even pitch in
| to improve code where it's open source & available). Between
| teaching, advising, committee work, etc (not to mention
| family), most of us have to prioritize, and as much as I'd like
| to clean up old code for release in the hopes someone finds it
| useful, it isn't going to get my grad students out the door
| with a degree or a job -- I'm generally spending more time on
| their research problems these days than my own. But if I know
| there's interest / use I might prioritize time a little
| differently.
| Tomte wrote:
| > What can help make the literature more accessible?
|
| Review articles, sometimes called surveys.
|
| I've always thought that new PhDs would be excellent authors
| for those, having digested lots literature for their
| dissertation.
| lnwlebjel wrote:
| Also, I believe there is a hierarchy that goes something
| like: academic papers -> review articles -> specialized
| books -> text books.
|
| The text changes to fit the audience, and the knowledge
| becomes more accepted (and or fundamental) further down the
| line.
| wheelinsupial wrote:
| > Review articles, sometimes called surveys.
|
| Is this field specific? I have read survey articles in math
| and biology, and was told by some of my profs that they use
| these articles as an introduction to a new field.
|
| A quick Google search seems to show these exist in CS
| (along with tutorial papers), physics, and chemistry but
| I'm having a little difficulty finding statistics survey
| papers (survey methods come up instead).
|
| Is the problem that there aren't enough of them or they are
| behind paywalls?
| tenkabuto wrote:
| For Stats, check out
| https://www.annualreviews.org/journal/statistics
|
| Please suggest others if you find them.
|
| Annual Reviews has a bunch of journals for surveys of
| various fields. Most of them are paywalled, but there's
| ways around that.
| ska wrote:
| > I've always thought that new PhDs
|
| A well written PhD or MSc thesis is often the best way into
| a new field, ime. If the committee is good on this aspect
| they'll insist you've put enough detail in for someone to
| follow along mostly self contained.
| markusstrasser wrote:
| You hit the nail on the head. Will put some of that in the
| appendix of the post!
| captainmuon wrote:
| I may be a bit cynical, but at least in my former field
| (experimental physics), the main purpose of papers seems to be
| to "lock in" a finished achivement. You do the actual research,
| pass internal reviews and peer review, and then publishing the
| paper is just to make it "official". Unfortunately, many papers
| are never expected to be read. The crucial information exists,
| but you usually get it from personal communication, internal
| wikis, or review articles. You just need the paper to copy a
| formula or graph, and to cite it in the end.
|
| There _are_ papers that are well-written and useful, but there
| are at least as many that are just drivel (I probably
| contributed to both kinds).
|
| Unfortunately, the prevailing attitude is that outside people
| will not understand our stuff anyway, so we often make no
| effort to make papers understandable, or to publish data.
| (There is a lot of great outreach and science communication,
| but not so much for students or researchers from other fields
| who want to follow the technical details.)
| dsizzle wrote:
| Counterpoint: citations are a valuable currency in science.
| Arguably one of the best ways to earn citations is to do good
| work and write clear papers.
|
| Not saying incentives are perfectly aligned -- many citations
| are superficial ("this topic was studied before"), and papers
| count for a lot even if they're never cited, etc
| temporaryi3 wrote:
| I did my phd in experimental physics and I have to say my
| realisation of this large point, that papers are little more
| than resume padding to lock in an achievement was a
| significant contributor towards destroying, and I use
| destroying seriously here, any faith or trust that peer
| review or publishing has anything at all to do with the
| scientific method at all.
|
| Your results replicate, or they don't. Your calculations,
| equations, and models predict experiment. Or they don't.
|
| Writing papers about it and getting the feedback of "peers"
| is nothing more than an old fashioned circle jerk for padding
| resumes, CVs, and persuading other people in that academic
| hierarchy that you deserve funding. It is a game that is
| divorced from actually learning, researching, understanding,
| measuring, and predicting the world.
| orbifold wrote:
| In academia there always is a difference between the way
| results are advertised and what conclusions are drawn
| internally. This is more true in some fields than others,
| I'm most familiar with it ML, Physics. Part of your skill
| as a researcher is to understand based on omissions, the
| datasets etc. the quiet part that isn't said out loud.
| Depending how you sell things you can get a Nature /
| Science paper with confusing inconsistent terminology, hand
| rolled C++ implementation, provided you are the first and a
| another method which might be 1000x times faster will only
| make it into PRL (yes I'm thinking of two specific papers,
| but won't say which).
| cyanydeez wrote:
| There's probably space for a startup that properly archives
| the technical nature of findings.
| paufernandez wrote:
| +1
| [deleted]
| ska wrote:
| > but their output in paper form is basically unusable
|
| Others have commented as well but I will reinforce: their
| output is basically unusable for you for the purpose you want
| to put it to.
|
| Which is fair, but you should also recognize that you are not
| the audience of the papers and for good or for ill the system
| is not set up to help you with this.
| javajosh wrote:
| Don't know much about this industry but yes, it feels like one of
| those industries that sprang up because one person with money
| said, "Hmm, sounds like a good idea," and then other people with
| money and FOMO joined in. When this happens past a certain level
| you get a miniature innovation bubble (MIB)!
|
| (At least MIBs are rather harmless, at least in the long run, and
| can actually yield some benefit: innovative people are drawn to
| these types of industries and inevitably create cool things as a
| by-product of their work.)
| pezzana wrote:
| > My biggest mistake was that I didn't have experience as a
| biotech researcher or postdoc working in a lab.
|
| That is is big problem - good to recognize it as such.
|
| I can tell because the article, though lengthy, never seems to
| state an explicit problem to be solved. Rather, various ways to
| apply technology to a field are discussed.
|
| This is a recipe for failure. You need 3 things:
|
| 1. a problem to be solved
|
| 2. a customer who has that problem
|
| 3. money in the customer's pocket waiting to be transferred to
| yours
|
| The article never even gets to (1).
|
| Regarding (2), if academic groups are the target customer, you're
| going to have a bad time. They have little money and they tend to
| be all to happy to build something that sort-of replicates the
| commercial product you've created for them.
|
| This leaves scientific for-profit companies. They have lots of
| problems (and these days money), but these problems tend to be
| quite difficult to discover and solve because of the extensive
| domain and industry knowledge required.
| tvhahn wrote:
| Yes. I wonder if the author had first worked on the patent side
| (I'd be interested to hear more about this idea). Perhaps
| working on patents first would be a path to get experience (and
| product-market-fit). From there, one could branch out into
| other domains (e.g. bio).
| toss1 wrote:
| And the bottom line is:
|
| >>... nothing of it will go anywhere.
|
| >>Don't take that as a challenge. Take it as a red flag and run.
| Run towards better problems.
|
| Wow, speaking of the value of negative results, that is hugely
| valuable! Could easily save person-decades of work & funds for
| more productive results.
|
| The insights that the most relevant knowledge is not written into
| the publications (for a variety of reasons), and that the few
| that are are of limited use to the target audience, and even when
| it is useful it is a small part of the workload (i. e., not a
| real pain point), are key to seeing that the entire category of
| projects to extract & encode such knowledge is doomed.
| bryanph_ wrote:
| One thing that strikes me about most academic knowledge tools is
| that they seem to focus on parsing the current set of academic
| literature and producing supposedly interesting insights out of
| them (which quickly tends to snowball into wanting some kind of
| generalized model for knowledge as a whole). What I think is much
| more interesting is creating tools that help people create better
| academic writing in the first place (thinking tools if you will).
| This is however much more a UX problem rather than it being a
| pure engineering problem. That is why I think we see many more
| tools in the knowledge extraction space as most academics
| thinking about these kind of things probably have an engineering
| background. That combined with the the fact that we seemingly all
| want to throw machine learning at any problem we encounter.
| eurasiantiger wrote:
| It likely wouldn't take much to craft an "arXiv Copilot" out of
| GitHub Copilot.
| TOMDM wrote:
| "It's well understood how to"
|
| And
|
| "It likely wouldn't take much to"
|
| Are worlds apart in this case, training and deploying models
| on that scale is a huge investment, even if you already had
| all the code and cleaned training data.
| Mezzie wrote:
| Can confirm: This is my main tech interest at the moment
| and if I consider how long it's going to take, I want to
| die.
| a_bonobo wrote:
| As an ECR with English as a second language, the paid version
| of Grammarly has clarified my writing quite a bit. I think
| there's more unexplored value in this space.
| grlass wrote:
| I recall seeing a Show HN post a while back about a research
| focussed web browser that helps as a thinking tool:
|
| https://news.ycombinator.com/item?id=28446147
| totetsu wrote:
| It's amazing what you miss on HN when you skip a day
| Hard_Space wrote:
| A sign-in/sign-up necessary just to see the browser in
| action? Hard pass.
| beauzero wrote:
| https://www.loom.com/share/93c7c0012f514c37b58a42fa65badc88
| civilized wrote:
| To your point but even more general: the ML/AI space is far too
| focused on replacing people rather than helping people. There
| is a suffocating cultural conceit that we are on the verge of
| general AI and oh my gosh what will the humans do, we better
| institute universal basic income right away, etc.
|
| What a joke.
|
| Try to help humans think better first. If you succeed at that,
| you _might_ be on the right track towards developing cold
| fusion, er, general AI.
| urthor wrote:
| Unfortunately you'll run into the rapid fact that in the
| ML/AI space you get almost zero points for building
| something.
|
| You get a whole lot of points for discovering something,
| designing something, or a proof.
|
| But there's a very large amount of people focused entirely on
| aims that are very, very distant from actually making human
| lives genuinely better.
|
| Mostly because everyone quietly understands all the
| extraordinarily complicated mathematics is actually
| extraordinarily complicated.
|
| Hence the ROI isn't worthwhile.
| geoduck14 wrote:
| >But there's a very large amount of people focused entirely
| on aims that are very, very distant from actually making
| human lives genuinely better.
|
| I can't speak to _each and every person_ working on ML, but
| I thought I would share a fun use case I ran across the
| other day.
|
| There is a business in some foreign country that is similar
| to Uber Eats: customer goes to an app, browses for food
| from various restaurants, orders, it gets delivered.
|
| The business was using ML to help the restaurants: the
| restaurants upload a pic of the dishes, a title, and a
| description (usually all from an existing menu). The
| business would parse the description to guess at what was
| in the dish. Scan the picture to guess at the quantity of
| food (entre, side, desert, etc). Compare ingredients
| against publicly available nutrition info. Now the end
| consumer can do things like: search for gluten free,
| vegetarian, pork free, <300 calories, desert, etc.
|
| Almost all of this was "possible" before, but it would have
| required enormous effort from the restaurants inputting the
| data or customers reading each item. Now it is "easy", and
| it actually helps the end customers - and the restaurants.
| oldsecondhand wrote:
| Guessing allergen content sounds like a disaster waiting
| to happen.
| auggierose wrote:
| Sounds like this would be an epic fail. I mean, just add
| a spoon of oil more, and your calorie guess is totally
| off. This is a clear cut case of SEEMINGLY helping. It
| certainly does not help the end customer. It might help
| the restaurant, as they don't care if the customer is
| receiving valid information, as long as they are buying.
| Mezzie wrote:
| I agree.
|
| I'm an academic librarian, and they're completely different
| ways of working: When I do academic work, I (ideally) have to
| take my time and I'm not supposed to present my work until it's
| developed enough that I'm confident it presents a substantial
| improvement; I have to prove that it's worth a colleague's time
| to engage with by meeting certain requirements.
| Coding/developing, on the other hand, requires a lot more back
| and forth, a lot more "I don't know", and is more immediate in
| a way I find very satisfying.
|
| I would LOVE to see more back and forth between engineers and
| academics in terms of ways of working; I think there's a lot of
| benefit to be gained there: Tech tends to not consider the
| future as much as they should, but the academics could really
| benefit from doing what you mentioned and improve the system
| they work in rather than accepting it.
|
| One of the things I'm trying to do is get better at/learn some
| ML so I can play around with turning the things I learned in
| grad school into useful tools, but I'm a single journeyman dev
| doing this in my spare time so odds of anything actually useful
| coming out of it is small.
| Quanttek wrote:
| Exactly! Assist me in my process of researching and writing.
| and use what I have already done in e.g. putting and
| classifying paper in Endnote. It's interesting that the author
| seemed to have a similar idea for a brief second but then
| tossed it away:
|
| > _similar: an app that pops up serendipitous connections
| between a corpus (previous writings, saved articles, bookmarks
| ...) and the active writing session or paragraph. The corpus,
| preferably your own, could be from folders, text files, blog
| archive, a Roam Research graph or a Notion /Evernote database._
| acomjean wrote:
| A lot of papers are hand or computer assisted annotated.
|
| For medical papers Mesh terms:
| https://www.nlm.nih.gov/mesh/meshhome.html
|
| Gene information is extracted by flybase/ worm base ...
|
| It's time consuming, expensive probably not perfect but for
| certain types of papers it makes searching better.
| Throwaway197401 wrote:
| The problem with academic publications is that they compliance
| based.
|
| The peer-review process is not a scientific process but a
| publishing process and it serves as an unfortunate gate-keeper.
|
| This gate-keeping has done so much damage to the scientific field
| that it's hard to see any way out of it in it's current form.
|
| The biggest problem is that peer-review gives the paper a stamp
| as if it's been approved by some higher scientific standard. And
| that lead us to the very unhealthy idea of "follow the science"
| or "the science is settled"
|
| The scientific method is a process of conjecture and criticism
| and is never ending. Peer review give a "blue checkmark" to
| papers they don't deserved and is especially problematic in the
| social sciences where up to 70% of research isn't reproducible.
|
| Reproducibility should be the gold standard NOT peer-review which
| has it's own bias and cargo cult built in.
|
| The pourpose of science it to create good explanations that are
| hard to vary. The purpose of scientific publications is to create
| prestige.
|
| So kill peer reviews there are other mechanisms to ensure quality
| of research.
| _Wintermute wrote:
| I think if the author had listened to pretty much any post-
| doc/technician or senior researcher in the field who has had to
| review a number of publications they would have been told these
| things straight away.
| civilized wrote:
| Interesting but I don't know how to make sense of it. How can it
| be that "close to nothing of what makes science actually work is
| published as text on the web"?
|
| - Is the information that makes science actually work mostly in
| images that the machines don't yet understand?
|
| - Was the information paywalled or in private databases and
| inaccessible to this researcher?
|
| - Are the papers mostly just advertisements for researchers to
| come gab with each other at conferences and doodle on cocktail
| napkins, and that's where all the "real science" happens?
|
| - (From the comments) is the information needed to make sense of
| papers communicated privately or orally from PI's to postdocs and
| grad students, or within industrial research labs?
|
| Something is missing from my mental picture here.
|
| Don't real scientists mostly learn how to think about their
| fields by reading textbooks and papers? (This is a genuine
| question.) If so, isn't it likely that our tools just aren't
| advanced enough to learn like humans do? If not, what do humans
| use to learn how to think about science that's missing from
| textbooks and papers?
| mellavora wrote:
| <disclaimer: former real scientist>
|
| Science is a profession like others. When you are earning your
| Ph.D. you learn to think about the field by reading papers and
| discussing with peers and colleagues, yes.
|
| The intro of a well-structured research paper should follow
| this pattern:
|
| - This is a really important topic and here is why.
|
| - What is the current state of the art in this field? (this
| comes from reading 100-1000 publications on the topic and
| selecting the 5-10 most relevant to the next point). HOWEVER,
| the state of the art leaves this question unanswered.
|
| - Here are some reasons why the idea in this paper can help
| answer that question (cite another 3-10 papers).
|
| - Our hypothesis is that XXX can answer the important
| unanswered question (where X is derived from the prior
| section).
|
| So, what I am getting at, a scientific publication is part of a
| conversation. When I'm citing the 5-10 papers to summarize the
| state of the art, I'm assuming the reader has read 50% of the
| 100-1000 papers which I also read, and knows where the 5-10
| which I cite fit into that broader context.
|
| So any paper, in isolation, only has a fraction of its meaning
| in the publication. The real information is the surrounding
| context.
|
| Pro tip: if I'm reading a paper and want to understand it
| better, I also read one or two of the papers it cites, and one
| or two papers which cite it. Also, it can take a few times
| through before I start to understand what the author is trying
| to say.
| stevenbedrick wrote:
| Exactly! Scientific papers are not meant to stand on their
| own -- they are pieces of a much larger jigsaw puzzle. In
| order to make heads or tails out of a paper, one really needs
| to have a sense of where the paper fits into its larger
| picture. Building up necessary base of knowledge to develop
| that sense, both in terms of explicit knowledge and tacit
| knowledge, is part of what a PhD student is actually doing
| while they are working on their PhD, and is part of why the
| process takes as long as it does.
|
| Also, the mechanical process of effectively reading a paper
| is highly non-linear, and is a skill in and of itself. In a
| lot of ways, it is more akin to high-level pattern matching
| than it is to more "normal" reading. At least at my
| institution, it is something that we actually teach our
| students to do in formal ways (the obligatory "How to read a
| scientific paper" lecture during the first term or two) and
| then make them practice over and over again for years
| (journal clubs, etc.). The original author eventually figured
| this out, which is to their credit.
| hoseja wrote:
| As the article states, papers are mostly career advancement
| tools and scientists are incentivized to put the least amount
| of useful information into them they can get away with. Real
| scientists mostly learn from their instructors who possess all
| the jealously guarded institutional knowledge.
|
| Yes, it is very broken.
| mellavora wrote:
| Hard disagree. With a caveat -- I do acknowledge that for an
| important number of professional academics your statement may
| be true, and I have heard a former post-doc at ETH Zurich
| describe their papers as career points (so also a grain of
| truth at elite institutes).
|
| But for most of the academics I have known and worked with,
| publications are taken quite seriously, and institutional
| knowledge is freely shared. There is an incentive to reduce
| the content in papers, but it is out of respect for the
| reader (a paper is not a textbook) and an honest attempt to
| limit the discussion to the core hypothesis of the work. You
| have 6 pages to 1) describe the content of 6*100 pages (the
| 100 other relevant papers on the topic), 2) present your
| addition to this body of knowledge, 3) discuss the insights
| your work brings, again referring to the content of 600
| pages.
|
| and those 600 pages you are summarizing are as information-
| dense as your work.
| jokteur wrote:
| It is always difficult to try to understand and implement the
| theory explained in papers which seem fine on the surface,
| but when you look more closely, you find a bunch of mistakes,
| there are giant holes in the details, and you end up trying
| to redo the whole paper.
|
| There should be journals/websites/blogs dedicated to trying
| to reexplain / implement papers.
| bluGill wrote:
| Doing a good review is hard.
|
| I've been reviewing a few C++ papers (things proposed to
| C++23) lately. Many of them are over my head and all I can
| find are a few spelling errors. The ones I've understood
| took me 3 readings before I found some giant holes in the
| details (which I pointed out to the authors, the next
| revision corrected them). In one case I actually started
| implementing the feature myself, and only then did I
| realize there was something ambiguous in the paper (after
| talking with the author we decided to document an
| implementors choices as it doesn't matter for the 99% use
| case, and the rest could go either way depending on
| hardware and data details so better to allow choice)
|
| The vast majority if papers I'm far too laze to go into
| that level of detail on. I just assume the authors is a
| smart person and so I trustingly let them go. It may well
| be if I understood the paper I'd be horrified, ask me in
| 2043 when we have 20 years of hindsight...
|
| I have to believe that peer review is the same - many
| reviewers are just reading and looking for something
| obvious but not really understanding details.
| mellavora wrote:
| Someone once said that they enjoyed one of my papers, and
| that even though they thought the writing was very clear
| they still had to read it 3 times to understand it.
|
| I told them that I had to write it 100 times and spend
| two years before I understood it.
|
| So if they could pick it up in 3 readings over 3 days,
| they were doing pretty good.
| Mezzie wrote:
| SO difficult, especially given that just because you're
| in the 'same' field and technically qualified to do a
| peer review doesn't mean you actually understand what
| you're reading.
|
| For example, I'm qualified to review papers on
| educational programs for children. I should never be
| asked to do that.
| civilized wrote:
| It's hard for me to even comprehend how this could be true,
| but it does sound familiar enough from credible sources that
| maybe it's right regardless of what makes sense to me.
| anonymousDan wrote:
| I mean honestly this is just total bullshit. There is plenty
| of value in academic papers. It's just that there is very
| little money to be made in developing tools such as those
| mentioned by the OP as there is very little money in
| academia.
| viewfromafar wrote:
| I understood the criticism directed at the value of papers
| as instruments of knowledge sharing. The argument is not
| that papers are completely useless in terms of knowledge
| sharing but that this pure purpose of dissemination is
| largely overshadowed by considerations of carreer,
| prestige, funding or any interest other than knowledge
| sharing.
|
| This is the world we live in. A scientist is a person that
| needs to make a living and is subject to various
| constraints.
|
| The reason that there is little money to be made is that
| society hasn't found a way to set up the scientific process
| in such a way that the constraints would value the increase
| in public domain knowledge higher than the incentives to
| hold some knowledge back.
|
| Part of this may stem from leaving specialized knowledge to
| academia while letting only companies reap the monetary
| rewards of putting the knowledge to use. Society benefits
| only indirectly (better drugs, machines, etc) but industry
| players will rather shield knowledge and adapt its
| representation to their own needs.
| Al-Khwarizmi wrote:
| I can't speak for biomedicine, but speaking as an academic in
| CS the claim that "close to nothing of what makes science
| actually work is published as text on the web" looks like a
| huge hyperbole to me.
|
| It's true that the so-called "folk knowledge", knowledge that
| exists in the community but no one bothers to publish in the
| form of papers, is a real problem, but at least in my field,
| it's by no means the majority of knowledge.
|
| As someone from a peripheral university where you can't just
| drive a few miles and talk to the best in your field, I have
| successfully started in new subfields of study (getting to the
| level of writing research papers in top-tier venues) by reading
| the literature.
|
| While this essay provides a very interesting point of view, I
| suspect it's heavily colored by the authors' failure to
| monetize the technology (which is related to the fact that
| people doing most of the grunt research work, who would benefit
| the most for this, are PhD students and postdocs who have no
| money to pay for it - in the text, the author hints at this). I
| wouldn't take it as an accurate description of academia.
| viewfromafar wrote:
| Also CS, my interpretation of "what makes science work" is a
| little different and I would argue that - despite a lot of
| foundations and techniques being shared in research papers -
| this field more than any other is constraining the free
| circulation and application of knowledge.
|
| The equivalent to those biomedical industry players are the
| big tech who develop closed source and push the edge in some
| area. They will publish but that does not mean you can
| replicate any of it.
|
| Software is also fragmented, crippled by IP lawsuits, patent
| trolls and so on. This does inhibit ability of society to
| benefit from software since it depends on the private sector
| to sort things out. The PhDs go and build businesses to "make
| the science work" in that sense.
|
| The ideal of detached pursuit of knowledge is not a complete
| fiction (despite the hyperbole), but it does remain an ideal
| that can only be approximated.
| Al-Khwarizmi wrote:
| As an academic, all my papers from the last 5 or so years
| have associated github repos where all the code is
| accessible under free licenses. Most of my peers in
| academia do the same. Documentation quality is admittedly
| quite hit-and-miss, because we aren't paid for that and we
| need to jump to the next paper, but all the code is there
| and everything can be replicated even if it takes some
| effort due to rushed code or suboptimal documentation.
|
| Industry is a different world, and indeed there are plenty
| of opaque industry papers that aren't replicable at all
| because much of the model is essentially a trade secret,
| and the paper is more an avenue for bragging than for
| developing new knowledge together with the rest of the
| community. To be honest, I would just outright disallow
| that kind of papers. But that's not a popular opinion, and
| taking into account that big tech companies sponsor our
| conferences and provide grants, I can't blame those who
| think otherwise.
| rm445 wrote:
| The 'what makes science work' is stored in the scientists.
|
| They learn by reading the literature, but also by
| communicating, and by an active process of testing their own
| understanding and resolving gaps and inconsistencies. Even when
| a self-taught genius like Ramanujan comes along, they benefit
| from being brought into the community.
|
| The question of how one would determine the state of the art in
| a field has an answer, but at present it would be
| indistinguishable from training a scientist, rather than
| running a clever software tool that could synthesize from the
| literature.
| civilized wrote:
| Well that's an interesting idea, isn't it (even if completely
| impractical today)? Self-training AI robot scientist who not
| only reads the literature but actually chats with other
| scientists and tries to do science to improve its
| understanding. AlphaZero but for science.
| Vetch wrote:
| AlphaZero cannot chat and interact outside moving pieces.
| For science, self-training would be too wasteful,
| intractable and impossible to boot, given there's no
| simulator.
|
| An AlphaZero for science would instead be like the recent
| deepmind paper where the pattern matching capabilities and
| internal features of a neural network were used to navigate
| some domain's decision space of conjecture formation and
| testing.
| amcoastal wrote:
| Try: Paperswithcode.com
|
| If its not there I won't use it! If you dont provide code with
| your paper it better have a really useful concept in it
| otherwise not citation. Which beckons to the problem in the
| article where most important information in basic research
| papers is: "Hey, this concept works" as opposed to a rigorous
| test of exactly what makes the concept work and how to use it
| in other situations.
| aimor wrote:
| In my experience useful scientific knowledge is accumulated in
| people actively working. Documents (books, papers, guides,
| programs, talks, blog posts, etc) are communication tools, but
| are limited by the medium and the ability of the authors.
| People can consume documents and create analogies to their
| specific work, but from there it's the process of working that
| produces: experts, systems, tools. Sometimes those products are
| again documented.
| Vetch wrote:
| The article's core claims are:
|
| > Extracting, structuring or synthesizing "insights" from
| academic publications (papers) or building knowledge bases from a
| domain corpus of literature has negligible value in industry.
|
| > Most knowledge necessary to make scientific progress is not
| online and not encoded.
|
| > Close to nothing of what makes science actually work is
| published as text on the web
|
| > The tech is not there to make fact checking work reliably, even
| in constrained domains.
|
| > Accurately and programmatically transforming an entire piece of
| literature into a computer-interpretable, complete and actionable
| knowledge artifact remains a pipe dream.
|
| It also states existing old school "biomedical knowledge bases,
| databases, ontologies that are updated regularly", with Expert
| Entry cutting through the noise in a way that NLP cannot.
|
| Although I disagree with its conclusions, much of this jives with
| my experience. From the perspective of research, modern NLP and
| transformers are appropriately hyped but from the perspective of
| real world application, they are over-hyped. Transformers have
| deeper understanding than anything prior, they can figure out
| patterns in their context with a flexibility that goes way beyond
| regurgitation.
|
| They are also prone to hallucinating text, quoting misleading
| snippets, require lots of resources for inference and enjoy being
| confidently wrong at a rate that makes industrial use nearly
| unworkable. They're powerful but you should think hard about
| whether you actually need them. Most of the time their true
| advantage is not leveraged.
|
| -----
|
| My disagreements are with its advice.
|
| > For recommendations, the suggestion is "follow the best
| institutions and ~50 top individuals".
|
| But this just creates a rich get richer effect and retards
| science since most are reluctant to go against those with a lot
| of clout.
|
| > Why purchase access to a 3rd party AI reading engine...when you
| can just hire hundreds of postdocs in Hyderabad to parse papers
| into JSON? (at a $6,000 yearly salary). Would you invest in
| automation if you have billions of disposable income and access
| to cheap labor? After talking with employees of huge companies
| like GSK, AZ and Medscape the answer is a clear no.
|
| This reminds me of responses to questions of the sort: "Why
| didnt't X (where X might be Ottomans or Chinese) get to the
| industrial revolution first?".
|
| Article also warns against working on ideas such as _"...semantic
| search, interoperable protocols and structured data,
| serendipitous discovery apps, knowledge organization. "_
|
| A lot such apps are solutions chasing after a problem but could
| work if designed to solve a specific real world problem. On the
| other hand, an outsider trying to start a generalized VC backed
| business targeting industry is bound to fail. In fact, this seems
| a major sticking point in the author's endeavor.
|
| Industry is jaded and set in their ways, startups focus on
| summarization and recommendations and retrieval which are low
| value in scientific enterprise and academia is focused on
| automation which turns out brittle. Still, this line of research
| is needed. Knowledge production is growing rapidly while humans
| are not getting any smarter. Specialization has meant increases
| in redundant information, loss of context and a stall in theory
| production (hence "much less logic and deduction happening").
|
| While the published literature is sorely lacking, humans can with
| effort extract and or triangulate value from it. Tooling needs to
| augment that process.
| markusstrasser wrote:
| "follow the best institutions and ~50 top individuals" wasn't
| meant as a suggestion actually, just an observation of what
| most people do.
|
| You're right they "could work if designed to solve a specific
| real world problem" but against what baseline? The baseline
| could be spending that time on actual deep tech projects and
| not NLP meta-science
| markusstrasser wrote:
| But you're right; open source projects for extracting infos
| (like PubTator) are valuable but ontologies/KGs need ongoing
| expert (ML, AI, SWEs, information architects, labelers) work
| (unlike most of Wikipedia or GH) so it's tough to make
| something that doesn't suck in a distributed open source
| fashion
| plaidfuji wrote:
| Having invested quite a bit of my own time into various aspects
| of the scientific knowledge extraction morass, I'd say the author
| is largely on point, but there's a significant, and potentially
| valuable distinction to be made between extracting research
| outputs and research inputs.
|
| At least in the field of materials science, papers are by and
| large a record of research _outputs_. We made material X and it
| achieved performance Y - here are a bunch of measurements to
| prove that this is in fact what we made and that it truly
| achieved performance Y at relevant conditions, etc. In this
| sense, papers really function as an advertisement: look at what
| we achieved.
|
| What papers do _not_ do is rigorously document inputs, or provide
| a step-by-step guide to reproduce said results, for obvious
| reasons.
|
| My current take on this topic is that it would be both feasible
| and valuable to build a knowledge extraction system to compile
| and compare outputs across a specified field. Think the big
| "chart of all verified solar cell efficiencies over time" [1],
| but generated automatically. This would at least immediately
| orient researchers to the distribution of state of the art
| results, and help ensure that they don't omit relevant references
| in their reviews.
|
| But extracting and making sense of inputs (methods), or even
| "knowledge"? Forget about it.
|
| [1] https://www.nrel.gov/pv/cell-efficiency.html
| i000 wrote:
| When I was in grad school, I joined a startup incubator and build
| a prototype which combined two of the tools mentioned in the
| article: "a query builder (by demonstration)" and "A paper
| recommender system", a simple companion which would help
| scientist to not miss relevant research to them. This was 10
| years ago, before Google Scholar has similar features.
|
| The incubator introduced me to advisors with business experience
| in this field. And I got told in no uncertain terms what is the
| gist of this article: The value lies in the molecular and
| clinical data. In 2021 I would add digital pathology / imaging
| data.
| geoduck14 wrote:
| >And I got told in no uncertain terms what is the gist of this
| article: The value lies in the molecular and clinical data. In
| 2021 I would add digital pathology / imaging data.
|
| I feel like you are trying to tell me something REALLY
| valuable, but I don't quite understand it. Can you please
| elaborate?
| potatoman22 wrote:
| My take: answering questions using clinical data > answering
| questions with papers
| i000 wrote:
| There is immense value in clinical data (all the
| information captured and siloed through EHR). Pharma
| companies pay for access to it to gather real-world
| evidence (RWE) how, for example, their drug performs.
| Molecular information is increasingly valuable too for
| research, biomarker development, patient cohort
| identification etc. The imaging data and pathology data are
| valuable because they are typically expertly annotated and
| can be used to train computer-vision algorithms etc. to
| solve medical problems - like diagnosis.
| JackFr wrote:
| It's interesting that OP did seemingy little research with
| respect to existing work in the field.
|
| https://www.nlm.nih.gov/medline/medline_overview.html
|
| Medline, a searchable online directory of medical research papers
| has existed for 50 years. The National Library of Medicine for
| many years was a leader in document search and retrieval before
| there was a web. In the 80's they were doing vector cosine
| document similarity, document clustering and automated
| classifcation. They were also doing so great stuff like indexing
| papers based on proteins and gene sequences - so a paper which
| might be in a field completely different than yours might pop up
| if a similar protein or sequence was mentioned.
|
| (Disclosure - I worked at the National Library of Medicine in the
| 90's)
|
| That being said, in the past 30 years search and retrieval
| exploded to say nothing of ML, but its crazy to ignore the stuff
| which has come before, AND it's tough to compete with a national
| lab whose mandate is to basically give the stuff away.
| inlitro wrote:
| 100%
|
| It also felt like a long apology/explanation for Emergent
| Ventures rather than a true deep analysis. Pretty strong (and
| often false) statements for only what seems like half a year of
| total, somewhat vague work.
| stevenbedrick wrote:
| The best thing about the NLM's work in this space is how deeply
| it has been informed by the needs and workflows of the
| biomedical researchers, which is a perspective that has been
| sorely lacking in work coming from outsiders.
|
| I did think that the author did a good job of outlining (some
| of) the basic structural issues that make this a tough field to
| monetize, but even setting those aside, there's no substitute
| for actually knowing your users and what they need, and that's
| something the NLM is amazing at.
|
| (Disclosure, my PhD was funded by an NLM training grant, some
| of my research is funded extramurally by the NLM, and I have a
| lot of NLM colleagues, so I'm maybe a little bit biased)
| bigdict wrote:
| Any article on this topic should mention Tshitoyan et al.
|
| https://www.nature.com/articles/s41586-019-1335-8
| markusstrasser wrote:
| Hey, author here. Great discussion so far. Will update the post
| with some of the comments and critiques.
| shusaku wrote:
| Previous discussion:
| https://news.ycombinator.com/item?id=29445715
___________________________________________________________________
(page generated 2021-12-09 23:00 UTC)