[HN Gopher] AI for Data Journalism: demonstrating what we can do...
       ___________________________________________________________________
        
       AI for Data Journalism: demonstrating what we can do with this
       stuff
        
       Author : duck
       Score  : 149 points
       Date   : 2024-04-22 06:09 UTC (16 hours ago)
        
 (HTM) web link (simonwillison.net)
 (TXT) w3m dump (simonwillison.net)
        
       | jeffbee wrote:
       | Regarding the campaign finance documents, it's not just a vision
       | problem with handwritten filings. Even when I feed Gemini Pro a
       | typed PDF with normal text (the kind you can search, select, and
       | copy in any PDF reader) it just invents things to fill in the
       | result. I asked it to summarize the largest donors of a local
       | campaign given a 70-page PDF and it gave me a top-10 table that
       | seemed plausible but all the names were pulled from somewhere
       | else in Gemini's tiny brain. None of them appeared anywhere in
       | the filing.
        
         | simonw wrote:
         | Yeah, I'm beginning to think detailed OCR-style extraction just
         | isn't a good fit for these models.
         | 
         | It's frustrating, because getting that junk out of horribly
         | scanned PDFs is 90% of the need of data journalism!
         | 
         | The best OCR API I've used is still AWS Textract - have you
         | seen anything better?
         | 
         | I ended up building a tiny CLI tool for Textract because I
         | found the thing so hard to use:
         | https://github.com/simonw/textract-cli
        
           | hm-nah wrote:
           | Azure Doc Intel is good. It creates a semantic representation
           | (is json) of the doc. The schema is useful with page,
           | paragraph, table, etc objects. The bounding-box for each
           | element is also useful. It flails with images, only giving
           | you the bounding-box of where the image exists. It's up to
           | you to extract any images separately then figure out how to
           | correlate. Overall I think it's useful rather than rolling
           | your own (at least for small scale).
        
             | simonw wrote:
             | I have to admit I've been having trouble figuring out what
             | to do with bounding boxes of elements - I've got those out
             | of Textract, but it feels like there's a lot of custom code
             | needed to get from a bunch of bounding boxes to a useful
             | JSON structure of the document.
             | 
             | That's why the idea of having an LLM like GPT-4 Vision or
             | Gemini Pro or Claude process these things is so tempting -
             | I want to be able to tell it what to do and get back JSON.
             | And I can! It's just not reliable enough (yet?) to be
             | particularly useful.
        
               | larodi wrote:
               | Have you considered using a sort of SAM (segment
               | anything) for the bounds and then OCR for the text and
               | finally run it through LLM, which is a good predictor of
               | text, to figure missing words or typos (wrong chars).
        
               | simonw wrote:
               | Oh that's an interesting idea! One challenge I've had
               | with OCR is that things like multiple columns frequently
               | confuse it - pulling out the regions first, OCRIng them
               | independently and then using an LLM to try and piece
               | everything back together could be a neat direction.
        
             | c_moscardi wrote:
             | Yeah, I think MS' is the best out there, but agree that the
             | usability leaves something to be desired. 2 thoughts:
             | 
             | 1. I believe the IR jargon for getting a JSON of this form
             | is Key Information Extraction (KIE). MS has an out-of-the-
             | box model for this. I just tried the screenshot and it did
             | a pretty good (but not perfect) job. It didn't get every
             | form field, but most. MS sort-of has a flow for fine-
             | tuning, but it really leaves a lot to be desired IMO.
             | Curious if this would be "good enough" to satisfy the use
             | case.
             | 
             | 2. In terms of just OCR (i.e. getting the text/numeric
             | strings correct), MS is known to be the best on typed text
             | at the moment [1]. Handwriting is a different beast... but
             | it looks like MS is doing a very good job there (and SOTA
             | on handwriting is very good). In particular, it got all the
             | numbers in that screenshot correct.
             | 
             | If you want to see the results from MS on the screenshot in
             | this blog post, here's the entire JSON blob. A bit of a
             | behemoth but the key/value stuff is in there: https://gist.
             | github.com/cmoscardi/8c376094181451a49f0c62406e...
             | 
             | [1] https://mindee.github.io/doctr/latest/using_doctr/using
             | _mode...
        
               | simonw wrote:
               | That does look pretty great, thanks for the tip.
               | 
               | Sending images through that API and then using an LLM to
               | extract data from the text result from the OCR could be
               | worth exploring.
        
           | mistrial9 wrote:
           | OCR software is thirty years in the making! there must be
           | dozens of alternatives .. interested to hear from people
           | close to this topic
        
           | is_true wrote:
           | I had a good experience using easyocr to get data out of
           | lottery draws videos
        
         | wcedmisten wrote:
         | This is why I can't really trust LLMs for data tasks like this.
         | I can't really be certain that it won't be hallucinating
         | results
        
         | devmor wrote:
         | Generative AI is just not the Leatherman Tool everyone wants it
         | to be.
         | 
         | There are absolutely ways to create solutions in this problem
         | space using ML as a tool, but they are more specialized and
         | probably not economical until the cost of training bespoke
         | models goes down.
        
           | anamax wrote:
           | "Leatherman Tool" is a good comparison, as pretty much every
           | "blade" in a Leathermann is a crappy tool, better only than
           | the Swiss Army knife equivalent.
           | 
           | The Leatherman's reason for existence is that you have it
           | with you for unexpected problems.
        
       | ddp26 wrote:
       | Looks great Simon! Do you have any anecdotes of journalists using
       | the techniques in this demo in news pieces we can read?
        
         | simonw wrote:
         | Most of the stuff I presented in this talk (the structured data
         | extraction things, AI query assistance etc) is so new that it's
         | not had a chance to be used for a published story yet - I'm
         | working with a few newsrooms right now getting them setup with
         | it though.
         | 
         | Datasette itself has been used for all kinds of things within
         | newsrooms. Two recent examples I heard about: the WSJ use it
         | internally for tools around topics like CEO compensation
         | tracking, and Bellingcat have used it for some of their work
         | that involves leaked data relating to Russia.
         | 
         | The problem with open source tools is that people can use them
         | without telling you about it! I'm trying to get better at
         | encouraging people to share details like this with me, I'd
         | really like to get some good use-cases written up.
        
       | simonw wrote:
       | This post is about a talk I gave at a data journalism conference,
       | but it's also a demo of a whole bunch of projects I've been
       | working on over the past couple of months/years:
       | 
       | - Claude 3 Haiku generation via phone or laptop camera:
       | https://tools.simonwillison.net/haiku
       | 
       | - A new Datasette plugin for creating tables by pasting in
       | CSV/TSV/JSON data: https://github.com/datasette/datasette-import
       | 
       | - A plugin that lets you ask a question in English and have that
       | converted into a SQL query (also using Claude 3 Haiku):
       | https://github.com/datasette/datasette-query-assistant
       | 
       | - shot-scraper for scraping web pages from the command-line:
       | https://shot-scraper.datasette.io/en/stable/javascript.html
       | 
       | - Datasette Enrichments for applying bulk operations to data in a
       | SQLite database table:
       | https://enrichments.datasette.io/en/stable/ - demonstrating both
       | a geocoder enrichment https://github.com/datasette/datasette-
       | enrichments-opencage and a GPT-powered one:
       | https://github.com/datasette/datasette-enrichments-gpt
       | 
       | - The LLM command-line tool: https://llm.datasette.io/ -
       | including a preview of the image support, currently in a branch
       | 
       | - The datasette-extract plugin for extracting structured data
       | into tables from unstructured text and images:
       | https://www.datasette.cloud/blog/2024/datasette-extract/
       | 
       | - The new datasette-embeddings plugin for calculating and storing
       | embeddings for table content and executing semantic search
       | queries against them: https://github.com/datasette/datasette-
       | embeddings
       | 
       | - Datasette Scribe by Alex Garcia: a tool for transcribing
       | YouTube video audio into a database, diarizing speakers and
       | making the whole thing searchable:
       | https://simonwillison.net/2024/Apr/17/ai-for-data-journalism...
        
         | doctorpangloss wrote:
         | But Simon, you are a strongly opinionated guy whose opinions
         | are interesting. Why don't _you_ make a data journalism piece?
         | Why don 't _you_ submit an article to the NYTimes, WSJ, etc.?
         | They will publish it! What are you waiting for?
        
           | simonw wrote:
           | Sounds like a lot of work!
           | 
           | I have been thinking that I need to do some actual reporting
           | myself at some point though, ultimate version of dogfooding
           | for all of my stuff.
        
       | andy99 wrote:
       | Looks interesting - my reaction though is that AI is great at
       | demos and always has been. The devil is in the details and
       | historically that has made it unusable for most applications
       | despite the presence of a cool demo. I'm not criticizing the
       | project, I get that its demos. Just that we judge AI too much by
       | demos and then handwave about how it will actually work in
       | practice.
        
         | simonw wrote:
         | That was very much the point of my talk - and the reason I had
         | so many live demos (as opposed to pre-recorded demos). I wanted
         | to have at least a few instances of demos going wildly wrong to
         | help emphasize how unreliable this stuff is.
         | 
         | Being unreliable doesn't mean it isn't useful. Journalists
         | handle unreliable sources all the time - fact-checking and
         | comparing multiple sources is built into the profession. As
         | such, I think journalists may be better equipped to make use of
         | LLMs than most other professions!
        
           | larodi wrote:
           | It is useful and perhaps very useful for journalists and
           | other people who use it onetime. It is very ill suited for
           | massive automation at the moment, and that's a real problem
           | everyone struggles with.
           | 
           | The application of embedding vectors without the rest of the
           | LLM can presently deliver much sustainable innovation. At
           | least compared to present day SOTA models (imho of course).
        
           | fauigerzigerk wrote:
           | I found the talk very interesting because it shows both the
           | issues as well as potential solutions.
           | 
           | One of the demos (extracting text from a PDF turned PNG)
           | makes me wonder how you're ever going to fact check whether
           | something in there is a hallucination. Innocent doctors won't
           | always turn out to be Michael Jackson's sister after all :)
           | 
           | But then in one of the last demos you're showing how the fact
           | checking can be "engineered" right into the prompt: "What
           | were the themes of this meeting and for each theme give me an
           | illustrative quote". Now you can search for the quote.
           | 
           | This is kind of eye opening for me, because you could build
           | this sort of deterministic provability into all kinds of
           | prompts. It certainly doesn't work for all applications but
           | where it does work it basically allows you to swap false
           | positives for false negatives, which is extremely valuable in
           | many cases.
        
             | sorokod wrote:
             | What would be the equivalent of searching for quotes in
             | your first (PNG) example?
             | 
             | Switching to text source, what would you do if say 30% of
             | the quotes do not match with CTR-F?
        
               | fauigerzigerk wrote:
               | _> What would be the equivalent of searching for quotes
               | in your first (PNG) example?_
               | 
               | I don't have a general answer to that. It depends on the
               | specifics of the application. In many cases the documents
               | I'm interested in will have some overlap with structured
               | data I have stored in a database. In the concrete example
               | there could be a register of practicing physicians that
               | could be used for cross referencing. But in other cases I
               | think it's an unsolved problem that may never be solved
               | completely.
               | 
               |  _> Switching to text source, what would you do if say
               | 30% of the quotes do not match with CTR-F?_
               | 
               | That's what I meant by swapping false positives for false
               | negatives. You could simply throw out all the items for
               | which you can't find the quote (which can obviously be
               | done automatically). The remaining items are now "fact
               | checked" to some degree. But the number of false
               | negatives will probably have increased because not all
               | the quotes without matches will be hallucinations.
               | 
               | Another approach would be to send the query separately to
               | multiple different models or to ask one model to check
               | another model's claims.
               | 
               | I think what works and what is good enough is highly
               | application specific.
        
               | sorokod wrote:
               | There are two issues to address
               | 
               | 1. The price of validation.
               | 
               | 2. The quality.
               | 
               | The baseline is to do the work yourself and compare - the
               | equivalent of a "brute force" solution. This off course
               | defeats the purpose of the entire exercise. You propose
               | an approach to reduce the validation price by crafting
               | the prompt in such a way that the validation can be
               | parially automated. This may reduce the quality because
               | of false negatives and what not.
               | 
               | The underlying assumption is that this process is cheaper
               | then "brute force" and the quality is "good enough". It
               | would be interesting to see a writeup of some specific
               | examples.
        
             | skybrian wrote:
             | I think of AI as a "hint generator" that will give you some
             | good guesses, but you still have to verify the guesses
             | yourself. One thing it can help with is coming up with
             | search terms that you might not have thought of.
        
           | Upvoter33 wrote:
           | Great talk and I have been enjoying your work. Keep it up!
        
       | photochemsyn wrote:
       | There was a story about how many legislative bills introduced in
       | the US were written by lobbyists a few years ago that might have
       | benefited from these tools:
       | 
       | https://publicintegrity.org/politics/state-politics/copy-pas...
       | 
       | > "Using data provided by LegiScan, which tracks every proposed
       | law introduced in the U.S., we pulled in digital copies of nearly
       | 1 million pieces of legislation introduced between 2010 and Oct.
       | 15, 2018."
       | 
       | The scoring system used seems fairly simple in comparison to what
       | AI can do: > "Our scoring system is based on three factors: the
       | longest string of common text between a model and a bill; the
       | number of common strings of five or more words; and the number of
       | common strings of 10 or more words."
        
       | AlbertCory wrote:
       | I found a couple of bylined articles in the Daily Illini that I
       | wrote as a freshman. _Very_ boring stuff; they didn 't give the
       | exciting beats to a newbie.
       | 
       | Then I ran them through ChatGPT to see what it thinks "objective
       | journalism" is.
       | 
       | https://albertcory50.substack.com/p/ai-does-journalism
       | 
       | Conclusion: if I'd done what it suggested, the editor would have
       | red-pencilled it out. It isn't that hard to just write the facts.
       | No one wants to hear "both sides" on a donation of land to the
       | Park District.
        
         | simonw wrote:
         | One of the themes of my talk was that generating text directly
         | is actually one of the least interesting applications of LLMs
         | to journalism - that's why I focused on things like structured
         | data extraction and code generation (SQL and code interpreter),
         | those are much more useful for data journalists IMO.
        
       | Vvector wrote:
       | "Haikus from images with Claude 3 Haiku"
       | 
       | But think of all the poet jobs you are eliminating...
        
       ___________________________________________________________________
       (page generated 2024-04-22 23:02 UTC)