[HN Gopher] Kor: a half-baked prototype that "helps" you extract...
       ___________________________________________________________________
        
       Kor: a half-baked prototype that "helps" you extract structured
       data using LLMs
        
       Author : BorisWilhelms
       Score  : 52 points
       Date   : 2023-06-26 19:05 UTC (3 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | captainmuon wrote:
       | Neat, I was just looking for something like this today, I think
       | I'll give it a spin.
       | 
       | Does anybody here have experience with metadata extraction using
       | LLMs? I've been thinking about it recently. and wonder if just
       | making a big prompt and putting that into OpenGPT or even ChatGPT
       | is really the way to go, or if there is a "cleverer" way. Maybe
       | you could train specifically for certain fields, or use the LLM
       | in a different way (like you can use the embeddings directly to
       | do simularity search)?
       | 
       | Another idea was, if you have a lot of similar HTML documents, to
       | not ask the LLM for the metadata, but to ask it for CSS selectors
       | that contain the metadata fields - assuming it can deal with HTML
       | and the data is verbatim in there. Then you should be able to get
       | much more consistent results.
        
         | nerpderp82 wrote:
         | I gave it some css paths extracted from devtools, and some
         | sample elements with data that needed extraction and had it
         | write a beautiful soup + regex routine to do the extractions.
         | worked fine. Also thousands of times faster.
        
       | anotherpaulg wrote:
       | Does this take advantage of the new OpenAI functions api? From a
       | quick look, I can't find any indication that it does. Although I
       | find it tricky to disentangle the langchain abstractions, so I
       | might be missing it. Kor's last release predates the announcement
       | of OpenAI functions, so probably not.
       | 
       | Seems like this is now best done via functions, if you're using
       | OpenAI's models? They call out "extracting structured data from
       | text" as a key use case in their announcement.
       | 
       | https://openai.com/blog/function-calling-and-other-api-updat...
        
         | BorisWilhelms wrote:
         | No, it is not using openai functions. Since it is on top of
         | langchain it uses the LLM abstraction of it and it can be used
         | with other models as well.
        
           | anotherpaulg wrote:
           | Yup, the flexibility of running against any model via
           | langchain is super helpful.
        
       | kiernanmcgowan wrote:
       | Another tool like this is Marvin. My experience this that these
       | work pretty well, but the world of prompt "engineering" is a very
       | squishy one and getting the exact output format you want is not
       | guaranteed.
       | 
       | https://www.askmarvin.ai/
        
       | mark_l_watson wrote:
       | I have experimented with Kor several times, cool idea.
        
       | dennisy wrote:
       | Have you tried this on HTML?
        
         | BorisWilhelms wrote:
         | Yes, tried it on HTML to get "metadata" that was not present in
         | the HTML meta tags, such as author, publish date, etc. Works
         | good.
        
           | BorisWilhelms wrote:
           | Actually not on raw HTML, but with the WebBaseLoader from
           | Langchain which strips away HTML tags.
        
             | dennisy wrote:
             | Ahh cool thank you!
        
       ___________________________________________________________________
       (page generated 2023-06-26 23:00 UTC)