[HN Gopher] AI for Data Journalism: demonstrating what we can do...
___________________________________________________________________
AI for Data Journalism: demonstrating what we can do with this
stuff
Author : duck
Score : 149 points
Date : 2024-04-22 06:09 UTC (16 hours ago)
(HTM) web link (simonwillison.net)
(TXT) w3m dump (simonwillison.net)
| jeffbee wrote:
| Regarding the campaign finance documents, it's not just a vision
| problem with handwritten filings. Even when I feed Gemini Pro a
| typed PDF with normal text (the kind you can search, select, and
| copy in any PDF reader) it just invents things to fill in the
| result. I asked it to summarize the largest donors of a local
| campaign given a 70-page PDF and it gave me a top-10 table that
| seemed plausible but all the names were pulled from somewhere
| else in Gemini's tiny brain. None of them appeared anywhere in
| the filing.
| simonw wrote:
| Yeah, I'm beginning to think detailed OCR-style extraction just
| isn't a good fit for these models.
|
| It's frustrating, because getting that junk out of horribly
| scanned PDFs is 90% of the need of data journalism!
|
| The best OCR API I've used is still AWS Textract - have you
| seen anything better?
|
| I ended up building a tiny CLI tool for Textract because I
| found the thing so hard to use:
| https://github.com/simonw/textract-cli
| hm-nah wrote:
| Azure Doc Intel is good. It creates a semantic representation
| (is json) of the doc. The schema is useful with page,
| paragraph, table, etc objects. The bounding-box for each
| element is also useful. It flails with images, only giving
| you the bounding-box of where the image exists. It's up to
| you to extract any images separately then figure out how to
| correlate. Overall I think it's useful rather than rolling
| your own (at least for small scale).
| simonw wrote:
| I have to admit I've been having trouble figuring out what
| to do with bounding boxes of elements - I've got those out
| of Textract, but it feels like there's a lot of custom code
| needed to get from a bunch of bounding boxes to a useful
| JSON structure of the document.
|
| That's why the idea of having an LLM like GPT-4 Vision or
| Gemini Pro or Claude process these things is so tempting -
| I want to be able to tell it what to do and get back JSON.
| And I can! It's just not reliable enough (yet?) to be
| particularly useful.
| larodi wrote:
| Have you considered using a sort of SAM (segment
| anything) for the bounds and then OCR for the text and
| finally run it through LLM, which is a good predictor of
| text, to figure missing words or typos (wrong chars).
| simonw wrote:
| Oh that's an interesting idea! One challenge I've had
| with OCR is that things like multiple columns frequently
| confuse it - pulling out the regions first, OCRIng them
| independently and then using an LLM to try and piece
| everything back together could be a neat direction.
| c_moscardi wrote:
| Yeah, I think MS' is the best out there, but agree that the
| usability leaves something to be desired. 2 thoughts:
|
| 1. I believe the IR jargon for getting a JSON of this form
| is Key Information Extraction (KIE). MS has an out-of-the-
| box model for this. I just tried the screenshot and it did
| a pretty good (but not perfect) job. It didn't get every
| form field, but most. MS sort-of has a flow for fine-
| tuning, but it really leaves a lot to be desired IMO.
| Curious if this would be "good enough" to satisfy the use
| case.
|
| 2. In terms of just OCR (i.e. getting the text/numeric
| strings correct), MS is known to be the best on typed text
| at the moment [1]. Handwriting is a different beast... but
| it looks like MS is doing a very good job there (and SOTA
| on handwriting is very good). In particular, it got all the
| numbers in that screenshot correct.
|
| If you want to see the results from MS on the screenshot in
| this blog post, here's the entire JSON blob. A bit of a
| behemoth but the key/value stuff is in there: https://gist.
| github.com/cmoscardi/8c376094181451a49f0c62406e...
|
| [1] https://mindee.github.io/doctr/latest/using_doctr/using
| _mode...
| simonw wrote:
| That does look pretty great, thanks for the tip.
|
| Sending images through that API and then using an LLM to
| extract data from the text result from the OCR could be
| worth exploring.
| mistrial9 wrote:
| OCR software is thirty years in the making! there must be
| dozens of alternatives .. interested to hear from people
| close to this topic
| is_true wrote:
| I had a good experience using easyocr to get data out of
| lottery draws videos
| wcedmisten wrote:
| This is why I can't really trust LLMs for data tasks like this.
| I can't really be certain that it won't be hallucinating
| results
| devmor wrote:
| Generative AI is just not the Leatherman Tool everyone wants it
| to be.
|
| There are absolutely ways to create solutions in this problem
| space using ML as a tool, but they are more specialized and
| probably not economical until the cost of training bespoke
| models goes down.
| anamax wrote:
| "Leatherman Tool" is a good comparison, as pretty much every
| "blade" in a Leathermann is a crappy tool, better only than
| the Swiss Army knife equivalent.
|
| The Leatherman's reason for existence is that you have it
| with you for unexpected problems.
| ddp26 wrote:
| Looks great Simon! Do you have any anecdotes of journalists using
| the techniques in this demo in news pieces we can read?
| simonw wrote:
| Most of the stuff I presented in this talk (the structured data
| extraction things, AI query assistance etc) is so new that it's
| not had a chance to be used for a published story yet - I'm
| working with a few newsrooms right now getting them setup with
| it though.
|
| Datasette itself has been used for all kinds of things within
| newsrooms. Two recent examples I heard about: the WSJ use it
| internally for tools around topics like CEO compensation
| tracking, and Bellingcat have used it for some of their work
| that involves leaked data relating to Russia.
|
| The problem with open source tools is that people can use them
| without telling you about it! I'm trying to get better at
| encouraging people to share details like this with me, I'd
| really like to get some good use-cases written up.
| simonw wrote:
| This post is about a talk I gave at a data journalism conference,
| but it's also a demo of a whole bunch of projects I've been
| working on over the past couple of months/years:
|
| - Claude 3 Haiku generation via phone or laptop camera:
| https://tools.simonwillison.net/haiku
|
| - A new Datasette plugin for creating tables by pasting in
| CSV/TSV/JSON data: https://github.com/datasette/datasette-import
|
| - A plugin that lets you ask a question in English and have that
| converted into a SQL query (also using Claude 3 Haiku):
| https://github.com/datasette/datasette-query-assistant
|
| - shot-scraper for scraping web pages from the command-line:
| https://shot-scraper.datasette.io/en/stable/javascript.html
|
| - Datasette Enrichments for applying bulk operations to data in a
| SQLite database table:
| https://enrichments.datasette.io/en/stable/ - demonstrating both
| a geocoder enrichment https://github.com/datasette/datasette-
| enrichments-opencage and a GPT-powered one:
| https://github.com/datasette/datasette-enrichments-gpt
|
| - The LLM command-line tool: https://llm.datasette.io/ -
| including a preview of the image support, currently in a branch
|
| - The datasette-extract plugin for extracting structured data
| into tables from unstructured text and images:
| https://www.datasette.cloud/blog/2024/datasette-extract/
|
| - The new datasette-embeddings plugin for calculating and storing
| embeddings for table content and executing semantic search
| queries against them: https://github.com/datasette/datasette-
| embeddings
|
| - Datasette Scribe by Alex Garcia: a tool for transcribing
| YouTube video audio into a database, diarizing speakers and
| making the whole thing searchable:
| https://simonwillison.net/2024/Apr/17/ai-for-data-journalism...
| doctorpangloss wrote:
| But Simon, you are a strongly opinionated guy whose opinions
| are interesting. Why don't _you_ make a data journalism piece?
| Why don 't _you_ submit an article to the NYTimes, WSJ, etc.?
| They will publish it! What are you waiting for?
| simonw wrote:
| Sounds like a lot of work!
|
| I have been thinking that I need to do some actual reporting
| myself at some point though, ultimate version of dogfooding
| for all of my stuff.
| andy99 wrote:
| Looks interesting - my reaction though is that AI is great at
| demos and always has been. The devil is in the details and
| historically that has made it unusable for most applications
| despite the presence of a cool demo. I'm not criticizing the
| project, I get that its demos. Just that we judge AI too much by
| demos and then handwave about how it will actually work in
| practice.
| simonw wrote:
| That was very much the point of my talk - and the reason I had
| so many live demos (as opposed to pre-recorded demos). I wanted
| to have at least a few instances of demos going wildly wrong to
| help emphasize how unreliable this stuff is.
|
| Being unreliable doesn't mean it isn't useful. Journalists
| handle unreliable sources all the time - fact-checking and
| comparing multiple sources is built into the profession. As
| such, I think journalists may be better equipped to make use of
| LLMs than most other professions!
| larodi wrote:
| It is useful and perhaps very useful for journalists and
| other people who use it onetime. It is very ill suited for
| massive automation at the moment, and that's a real problem
| everyone struggles with.
|
| The application of embedding vectors without the rest of the
| LLM can presently deliver much sustainable innovation. At
| least compared to present day SOTA models (imho of course).
| fauigerzigerk wrote:
| I found the talk very interesting because it shows both the
| issues as well as potential solutions.
|
| One of the demos (extracting text from a PDF turned PNG)
| makes me wonder how you're ever going to fact check whether
| something in there is a hallucination. Innocent doctors won't
| always turn out to be Michael Jackson's sister after all :)
|
| But then in one of the last demos you're showing how the fact
| checking can be "engineered" right into the prompt: "What
| were the themes of this meeting and for each theme give me an
| illustrative quote". Now you can search for the quote.
|
| This is kind of eye opening for me, because you could build
| this sort of deterministic provability into all kinds of
| prompts. It certainly doesn't work for all applications but
| where it does work it basically allows you to swap false
| positives for false negatives, which is extremely valuable in
| many cases.
| sorokod wrote:
| What would be the equivalent of searching for quotes in
| your first (PNG) example?
|
| Switching to text source, what would you do if say 30% of
| the quotes do not match with CTR-F?
| fauigerzigerk wrote:
| _> What would be the equivalent of searching for quotes
| in your first (PNG) example?_
|
| I don't have a general answer to that. It depends on the
| specifics of the application. In many cases the documents
| I'm interested in will have some overlap with structured
| data I have stored in a database. In the concrete example
| there could be a register of practicing physicians that
| could be used for cross referencing. But in other cases I
| think it's an unsolved problem that may never be solved
| completely.
|
| _> Switching to text source, what would you do if say
| 30% of the quotes do not match with CTR-F?_
|
| That's what I meant by swapping false positives for false
| negatives. You could simply throw out all the items for
| which you can't find the quote (which can obviously be
| done automatically). The remaining items are now "fact
| checked" to some degree. But the number of false
| negatives will probably have increased because not all
| the quotes without matches will be hallucinations.
|
| Another approach would be to send the query separately to
| multiple different models or to ask one model to check
| another model's claims.
|
| I think what works and what is good enough is highly
| application specific.
| sorokod wrote:
| There are two issues to address
|
| 1. The price of validation.
|
| 2. The quality.
|
| The baseline is to do the work yourself and compare - the
| equivalent of a "brute force" solution. This off course
| defeats the purpose of the entire exercise. You propose
| an approach to reduce the validation price by crafting
| the prompt in such a way that the validation can be
| parially automated. This may reduce the quality because
| of false negatives and what not.
|
| The underlying assumption is that this process is cheaper
| then "brute force" and the quality is "good enough". It
| would be interesting to see a writeup of some specific
| examples.
| skybrian wrote:
| I think of AI as a "hint generator" that will give you some
| good guesses, but you still have to verify the guesses
| yourself. One thing it can help with is coming up with
| search terms that you might not have thought of.
| Upvoter33 wrote:
| Great talk and I have been enjoying your work. Keep it up!
| photochemsyn wrote:
| There was a story about how many legislative bills introduced in
| the US were written by lobbyists a few years ago that might have
| benefited from these tools:
|
| https://publicintegrity.org/politics/state-politics/copy-pas...
|
| > "Using data provided by LegiScan, which tracks every proposed
| law introduced in the U.S., we pulled in digital copies of nearly
| 1 million pieces of legislation introduced between 2010 and Oct.
| 15, 2018."
|
| The scoring system used seems fairly simple in comparison to what
| AI can do: > "Our scoring system is based on three factors: the
| longest string of common text between a model and a bill; the
| number of common strings of five or more words; and the number of
| common strings of 10 or more words."
| AlbertCory wrote:
| I found a couple of bylined articles in the Daily Illini that I
| wrote as a freshman. _Very_ boring stuff; they didn 't give the
| exciting beats to a newbie.
|
| Then I ran them through ChatGPT to see what it thinks "objective
| journalism" is.
|
| https://albertcory50.substack.com/p/ai-does-journalism
|
| Conclusion: if I'd done what it suggested, the editor would have
| red-pencilled it out. It isn't that hard to just write the facts.
| No one wants to hear "both sides" on a donation of land to the
| Park District.
| simonw wrote:
| One of the themes of my talk was that generating text directly
| is actually one of the least interesting applications of LLMs
| to journalism - that's why I focused on things like structured
| data extraction and code generation (SQL and code interpreter),
| those are much more useful for data journalists IMO.
| Vvector wrote:
| "Haikus from images with Claude 3 Haiku"
|
| But think of all the poet jobs you are eliminating...
___________________________________________________________________
(page generated 2024-04-22 23:02 UTC)