hngopher.com

       [HN Gopher] Show HN: Turn any website into a knowledge base for ...
       ___________________________________________________________________
        
       Show HN: Turn any website into a knowledge base for LLMs
        
       I built this tool because I wanted a way to just take a bunch of
       URLs or domains, and query their content in RAG applications.  It
       takes away the pain of crawling, extracting content, chunking,
       vectorizing, and updating periodically.  I'm curious to see if it
       can be useful to others. I meant to launch this six months ago but
       life got in the way...
        
       Author : tompec
       Score  : 124 points
       Date   : 2024-07-30 00:54 UTC (1 days ago)
        
 (HTM) web link (www.embedding.io)
 (TXT) w3m dump (www.embedding.io)
        
       | cranberryturkey wrote:
       | how does this work?
        
         | tompec wrote:
         | Give it URLs or domains, and it will crawl and extract their
         | content, embed them in a vector database, and give you an
         | endpoint that you can then query when doing RAG stuff or
         | semantic search.
        
           | xiconfjs wrote:
           | But how does it work in the background? What's the tech
           | stack?
        
             | ramon156 wrote:
             | In another comment:
             | 
             | > Tech stack is a mix of serverless Laravel, with
             | Cloudflare and AWS functions, and some Pinecone for vector
             | storage. Still experimenting on a few things but don't want
             | to over-engineer unless I know where I'm going.
        
         | kordlessagain wrote:
         | I do this with https://mitta.ai by using a Playwright container
         | that does a callback to a pipeline that uses either meta data
         | from the PDF or sends it to an EasyOCR deployment on a GPU
         | instance on Google for text extraction. Then I use a custom
         | chunker and instructor/xl embeddings.
         | 
         | All of that code is Open Source, and works well for most sites.
         | Some sites block Google IPs, but the Playwright container can
         | run locally, so should be able to work around it with some
         | minimal effort.
        
       | samuria wrote:
       | Interesting, I wanted to do this for a personal use case (mostly
       | learning), but with PDFs. What's tech stack? I have explored
       | using the AWS AI tools, but it seems a bit overkill for what I
       | want it to do.
        
         | tompec wrote:
         | Tech stack is a mix of serverless Laravel, with Cloudflare and
         | AWS functions, and some Pinecone for vector storage. Still
         | experimenting on a few things but don't want to over-engineer
         | unless I know where I'm going.
        
           | stevenicr wrote:
           | Given that cloudflare spies on traffic and reports to
           | multiple agencies on it's findings, perhaps a breakdown of
           | the chain and the privacy implications of each block in the
           | stack would be beneficial?
        
             | stevenicr wrote:
             | Ya know, a downvote on this pre-aug 2019 would be fine.
             | 
             | people still being ignorant about their publicly posted
             | policies 5 years later is annoying.
        
         | lou1306 wrote:
         | If the PDFS are textual or have OCR, then pdf2text from the
         | Poppler suite ought to be enough? If not, add
         | Tesseract/ocrmypdf to the pipeline?
        
         | kordlessagain wrote:
         | Here's some code to deal with that:
         | 
         | https://github.com/MittaAI/SlothAI/blob/main/SlothAI/lib/pro...
         | 
         | https://github.com/MittaAI/mitta-community/tree/main/service...
         | 
         | There's code in there that just reads PDF meta data as well,
         | but you can't always guarantee it's there in a PDF.
        
       | blackeyeblitzar wrote:
       | Is there a way to deal with websites where you need to login?
       | Like subscription based sites?
        
         | tompec wrote:
         | Unless you own those sites, I'm afraid that's not going to be
         | possible.
        
       | ancras wrote:
       | This is interesting. Can it work with any website, even say
       | document repositories hosted on standard servers like gitbook?
        
         | tompec wrote:
         | It works with pretty much any website, and works well with docs
         | hosted on GitBook yes, I have embedded a website that's hosted
         | there.
        
           | webappguy wrote:
           | Confirmation email doesn't work, so cannot try it. I
           | attemtped twice and checked spam
        
             | tompec wrote:
             | Apologies, please email me at support at embedding.io. If
             | you have something you'd like embedded, please also mention
             | it so I can set it up for you.
        
       | dazbradbury wrote:
       | Nice! What's the underlying model / RAG approach being used? Be
       | good to understand that part as presumably it will have a big
       | impact on performance / usability of the results.
        
       | r0b05 wrote:
       | Does it hallucinate much?
        
       | michaelmior wrote:
       | This looks interesting, but I get a 404 on the iframe when I try
       | to go into the chat.
        
         | tompec wrote:
         | Sorry about that, a bit too much load at the moment
        
       | 23B1 wrote:
       | I tried it out. This would be extremely useful to me to the point
       | I'd be willing to happily pay for it, as it's something I would
       | have otherwise had to spend a long time hacking together.
       | 
       | 1) The returned output from a query seems pretty limited in
       | length and breadth.
       | 
       | 2) No apparent way to adjust my prompts to improve/adjust the
       | output e.g. not really 'conversational' (not sure if that is your
       | intent)
       | 
       | Otherwise keep developing and be sure to push update
       | notifications to your new mailing list! ;-)
        
         | dmje wrote:
         | Agree with this. I also think the emphasis here (to OP) should
         | be "I'd be willing to happily pay for it" - ie I'd rather be
         | paying a reasonable amount each month for something that is
         | going to remain active that have the large (current) disparity
         | between "free" and "enterprise". I'd say make some middle tiers
         | of (I don't know?) $5 / $10 / $20 a month for reasonable
         | numbers of queries or whatever. Keep the "enterprise" offering
         | there for the biggies, but offer us small players some hope
         | that this will be sufficiently funded / supported.
         | 
         | Brilliant idea, btw, I like it :-)
        
           | tompec wrote:
           | Thanks! I'm still figuring things out about pricing, but
           | there will be small plans available.
        
         | tompec wrote:
         | Thanks! The chat demo is actually just a small thing I put
         | together as a preview of what can be done, but the main product
         | is the API. But seeing that most users seem to like that,
         | there's probably something there... If you want to email me at
         | support at embedding.io with some requirements, I can see how
         | to make that work for you.
        
       | pryelluw wrote:
       | Does this respect robots.txt?
        
         | srameshc wrote:
         | Valid question and I am sure it doesn't.
        
         | danirod wrote:
         | I hope this gets answered.
         | 
         | Also I've checked their docs to see if there is any mention
         | about the user agents or IP ranges they use for scraping, with
         | no luck.
        
         | tompec wrote:
         | It does respect robots.txt when crawling. I'll add more details
         | about this in the docs.
        
           | pryelluw wrote:
           | I appreciate the reply. As someone who runs multiple CMSs
           | it's painful to deal with the ai crawlers these days.
           | Specially the ones that don't respect my terms.
        
       | mkw5053 wrote:
       | I made a similar open source app a year ago or so
       | https://github.com/mkwatson/chat_any_site
        
       | crowcroft wrote:
       | I like this. Abstracting away the management of embeddings and
       | vector database is something I desperately want, and adding in
       | website crawling is useful as well.
        
       | rcarmo wrote:
       | How do I feed it a sitemap?
        
         | tompec wrote:
         | It currently will try to find a sitemap on its own. But I have
         | on the roadmap to let users add their own.
        
       | muggermuch wrote:
       | I like this a lot!
       | 
       | But: I feel the more of these services come to being, the more
       | likely it is that every website starts putting up gates to keep
       | the bots away.
       | 
       | Sort of like a weird GenAI take on Cixin Liu's Dark Forest
       | hypothesis
       | (https://en.wikipedia.org/wiki/Dark_forest_hypothesis).
       | 
       | (Edited to add a reference.)
        
         | marcellus23 wrote:
         | Responding just because it's a pet peeve of mine: Cixin Liu did
         | not invent the dark forest hypothesis. People were discussing
         | it, and writing science fiction books about it, for decades
         | before the 3BP books were published. Nothing against him, and
         | he definitely helped popularize the concept, but I think it's
         | incorrect to refer to it as "Cixin Liu's hypothesis".
        
       | khanan wrote:
       | Can this be deployed on-prem or is it an cloud-toy?
        
         | tompec wrote:
         | Currently just a cloud-toy.
        
       | danirogerc wrote:
       | Can I query multiple vectorized websites at once? Can I export
       | vectorized websites and host them myself? Any chance to export
       | them to a no-code format, like PDF?
        
         | tompec wrote:
         | You can group as many websites as you want into a collection.
         | Then query that collection. Not sure what you mean by
         | exporting; you would like to export the vectors themselves? Or
         | just the chunks of text from the websites?
        
       | Cynddl wrote:
       | I find it interesting that as an (edit: UK) academic researcher,
       | I would be likely be forbidden to use tools like this, that fail
       | basic ethics standards, regulations such as GDPR, and practical
       | standards such as respecting robots.txt [given there's no
       | information on embedding.io, it's unlikely I can block the
       | crawler when designing a website].
       | 
       | There's still room for an ethical development of such crawlers
       | and technologies, but it needs to be consent-first, with strong
       | ethical and legal standards. The crazy development of such tools
       | has been a massive issue for a number of small online
       | organisations that struggle with poorly implemented or maintained
       | bots (as discussed for OpenStreetMap or Read The Docs).
        
       ___________________________________________________________________
       (page generated 2024-07-31 23:00 UTC)