[HN Gopher] Kor: a half-baked prototype that "helps" you extract...
___________________________________________________________________
Kor: a half-baked prototype that "helps" you extract structured
data using LLMs
Author : BorisWilhelms
Score : 52 points
Date : 2023-06-26 19:05 UTC (3 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| captainmuon wrote:
| Neat, I was just looking for something like this today, I think
| I'll give it a spin.
|
| Does anybody here have experience with metadata extraction using
| LLMs? I've been thinking about it recently. and wonder if just
| making a big prompt and putting that into OpenGPT or even ChatGPT
| is really the way to go, or if there is a "cleverer" way. Maybe
| you could train specifically for certain fields, or use the LLM
| in a different way (like you can use the embeddings directly to
| do simularity search)?
|
| Another idea was, if you have a lot of similar HTML documents, to
| not ask the LLM for the metadata, but to ask it for CSS selectors
| that contain the metadata fields - assuming it can deal with HTML
| and the data is verbatim in there. Then you should be able to get
| much more consistent results.
| nerpderp82 wrote:
| I gave it some css paths extracted from devtools, and some
| sample elements with data that needed extraction and had it
| write a beautiful soup + regex routine to do the extractions.
| worked fine. Also thousands of times faster.
| anotherpaulg wrote:
| Does this take advantage of the new OpenAI functions api? From a
| quick look, I can't find any indication that it does. Although I
| find it tricky to disentangle the langchain abstractions, so I
| might be missing it. Kor's last release predates the announcement
| of OpenAI functions, so probably not.
|
| Seems like this is now best done via functions, if you're using
| OpenAI's models? They call out "extracting structured data from
| text" as a key use case in their announcement.
|
| https://openai.com/blog/function-calling-and-other-api-updat...
| BorisWilhelms wrote:
| No, it is not using openai functions. Since it is on top of
| langchain it uses the LLM abstraction of it and it can be used
| with other models as well.
| anotherpaulg wrote:
| Yup, the flexibility of running against any model via
| langchain is super helpful.
| kiernanmcgowan wrote:
| Another tool like this is Marvin. My experience this that these
| work pretty well, but the world of prompt "engineering" is a very
| squishy one and getting the exact output format you want is not
| guaranteed.
|
| https://www.askmarvin.ai/
| mark_l_watson wrote:
| I have experimented with Kor several times, cool idea.
| dennisy wrote:
| Have you tried this on HTML?
| BorisWilhelms wrote:
| Yes, tried it on HTML to get "metadata" that was not present in
| the HTML meta tags, such as author, publish date, etc. Works
| good.
| BorisWilhelms wrote:
| Actually not on raw HTML, but with the WebBaseLoader from
| Langchain which strips away HTML tags.
| dennisy wrote:
| Ahh cool thank you!
___________________________________________________________________
(page generated 2023-06-26 23:00 UTC)