[HN Gopher] Show HN: Open-source ETL framework to sync data from...
       ___________________________________________________________________
        
       Show HN: Open-source ETL framework to sync data from SaaS tools to
       vector stores
        
       Hey hacker news, we launched a few weeks ago as a GPT-powered
       chatbot for developer docs, and quickly realized that the value of
       what we're doing isn't the chatbot itself. Rather, it's the time we
       save developers by automating the extraction of data from their
       SaaS tools (Github, Zendesk, Salesforce, etc) and helping transform
       it to contextually relevant chunks that fit into GPT's context
       window.  A lot of companies are building prototypes with GPT right
       now and they're all using some combination of Langchain/Llama Index
       + Weaviate/Pinecone + GPT3.5/GPT4 as their stack for retrieval
       augmented generation (RAG). This works great for prototypes, but
       what we learned was that as you scale your RAG app to more users
       and ingest more sources of content, it becomes a real pain to
       manage your data pipelines.  For example, if you want to ingest
       your developer docs, process it into chunks of <500 tokens, and add
       those chunks to a vector store, you can build a prototype with
       Langchain fairly quickly. However, if you want to deploy it to
       customers like we did for BentoML
       ([https://www.bentoml.com/](https://www.bentoml.com/)) you'll
       quickly realize that a naive chunking method that splits by
       character/token leads to poor results, and that "delete and re-
       vectorize everything" when the source docs change doesn't scale as
       a data synchronization strategy.  We took the code we used to build
       chatbots for our early customers and turned it into an open source
       framework to rapidly build new data Connectors and Chunkers. This
       way developers can use community built Connectors and Chunkers to
       start running vector searches on data from any source in a matter
       of minutes, or write their own in a matter of hours.  Here's a
       video demo:
       [https://youtu.be/I2V3Cu8L6wk](https://youtu.be/I2V3Cu8L6wk)  The
       repo has instructions on how to get started and set up API
       endpoints to load, chunk, and vectorize data quickly. Right now it
       only works with websites and Github repos, but we'll be adding
       Zendesk, Google Drive, and Confluence integrations soon too.
        
       Author : jasonwcfan
       Score  : 34 points
       Date   : 2023-03-30 16:44 UTC (6 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | binarymax wrote:
       | Cool project! It's not clear to me in the code from where you are
       | getting embeddings. Are all your embeddings coming from OpenAI?
       | If so, that sounds expensive for personal use.
        
         | jasonwcfan wrote:
         | You can use any embeddings you want! We normally use OpenAI's
         | ada which is $4 per 10 million tokens, which is fine for now.
         | But eventually we'll need to figure out a way to incrementally
         | sync data from SaaS tools instead of re-vectorizing all the
         | content when the vector store needs to be updated.
        
       | jn2clark wrote:
       | Looks really interesting! Are you looking for more vector search
       | integrations? we have one here https://github.com/marqo-ai/marqo
       | which includes a lot of the transformation logic (including
       | inference). If so, we can do a PR
        
       ___________________________________________________________________
       (page generated 2023-03-30 23:01 UTC)