hngopher.com

       [HN Gopher] Show HN: Danswer - Open-source question answering ac...
       ___________________________________________________________________
        
       Show HN: Danswer - Open-source question answering across all your
       docs
        
       My friend and I have been feeling frustrated at how inefficient it
       is to find information at work. There are so many tools (Slack,
       Confluence, GitHub, Jira, Google Drive, etc.) and they provide
       different (often not great) ways to find information. We thought
       maybe LLMs could help, so over the last couple months we've been
       spending a bit of time on the side to build Danswer.  It is an open
       source, self-hosted search tool that allows you to ask questions
       and get answers across common workspace apps AND your personal
       documents (via file upload / web scraping)! Full demo here:
       https://www.youtube.com/watch?v=geNzY1nbCnU&t=2s.  The code
       (https://github.com/danswer-ai/danswer) is open source and
       permissively licensed (MIT). If you want to try it out, you can set
       it up locally with just a couple of commands (more details in our
       docs - https://docs.danswer.dev/introduction). We hope that someone
       out there finds this useful  We'd love to hear from you in our
       Slack
       (https://join.slack.com/t/danswer/shared_invite/zt-1u3h3ke3b-...)
       or Discord (https://discord.gg/TDJ59cGV2X). Let us know what other
       features would be useful for you!
        
       Author : Weves
       Score  : 128 points
       Date   : 2023-07-10 14:55 UTC (8 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | kaismh wrote:
       | The video looks impressive, well done. Why didn't you build it on
       | top of langchain or other similar frameworks?
        
         | Weves wrote:
         | We're actually planning on migrating to LangChain very soon
         | (primarily to allow for memory / tool usage + automatic
         | integrations with llamacpp / other open source model serving
         | frameworks). We didn't start with it initially since we didn't
         | want to restrict our usage patterns too much while we were
         | (even more) unsure of what exactly we were going to build.
         | 
         | As far as using other data connector frameworks, we found that
         | we either (1) didn't think they were very good or/and (2) they
         | didn't support automatic syncing effectively. For larger
         | enterprises, it's not feasible to do a complete sync every X
         | minutes. We need to be able to get a time-bounded subset (or
         | have them push updates to us), which is something LangChain,
         | LlamaIndex, etc. don't support natively.
        
       | ssddanbrown wrote:
       | I maintain an open source documentation platform, for which I had
       | received a few queries about AI tooling. I'm not into the AI
       | world of development, and my tech stack & distribution approach
       | aren't great to provide AI friendly tech in my project itself,
       | but connecting to external applications that can consume/combine
       | multiple sources seemed like a good potential approach.
       | 
       | I came across Danswer a few days ago as an option for this, so I
       | spent a day building a connector [1]. I was pleasantly surprised
       | how accurate the output was for something like this. I have a few
       | pages detailing my servers and I could ask things like "Where is
       | x server hosted"? and get a correct response accompanied with a
       | link to the right source page.
       | 
       | Some things to be aware of specifically about Danswer: It only
       | works with OpenAI right now, although the team said that open
       | model support is important as a future focus. Additionally it
       | felt fairly heavy to run and required a 30 minute docker build
       | process but I think they've improved on this now with pre-built
       | images, and I'm not familiar with the usual requirements/weight
       | of this kind of tech. Otherwise, things were easy to start up and
       | play around with, even for an AI noob like me. Both their web and
       | text-upload source connectors worked without issue in my testing.
       | 
       | [1]: https://github.com/danswer-ai/danswer/pull/139
        
         | gardnr wrote:
         | There are a couple open source projects that expose llama.cpp
         | and gpt4j models via a compatible OpenAI API. This is one of
         | them: https://github.com/lhenault/simpleAI
        
       | andre-z wrote:
       | We at Qdrant are glad to be a part of this awesome solution,
       | providing the Vector Database resource for Danswer.
       | https://github.com/qdrant/qdrant
        
         | Weves wrote:
         | Amazing foundational tools like Qdrant make building in this
         | space so much easier <3
        
       | JimmyRuska wrote:
       | This is great, love it!
       | 
       | Crawling sites to index the FAQ's and knowledge bases, into the
       | vector search, isn't as intimidating as it sounds, at least on
       | linux systems. Sometimes a thin wrapper function over plain old
       | wget will get you 99% of the way                   wget -rnH -t 1
       | --waitretry=0 'https://{{domain}}' -P '{{domain}}'
        
       | [deleted]
        
       | danpetrov wrote:
       | Sadly completely unusuable for our usecase - if you are targeting
       | Enterprise, you should know better than to use OpenAI models as
       | the only LLM available.
       | 
       | For now I will stick to PrivateGPT and LocalGPT.
        
         | jagtstronaut wrote:
         | Completely unusable for internal docs* should be the caveat.
         | For external doc OpenAI is fine unless you have stuff behind a
         | password.
        
         | Weves wrote:
         | Yea, that's good feedback - we've gotten requests for open
         | source model support from a lot from people we've talked about.
         | It's one of our highest priorities, and should be available
         | soon!
        
           | adr1an wrote:
           | Are you about to use GPT4ALL[^1] or anything else? If you're
           | going with the second option, then please share any link to
           | such resources... I'd be interested.
           | 
           | And, to share with you something: I saw somewhere a tool
           | (maybe it was GPT4ALL itself) that had the ability to expose
           | a OpenAI-compatible local API on localhost:8080... Ah, yes.
           | Here it is. Actually, there are two. They are described as
           | possible backends for Bavarder (that's a free access to
           | multiple online models, API key is not required):
           | https://bavarder.codeberg.page/help/local/
           | 
           | [^1]: https://github.com/nomic-ai/gpt4all/blob/main/gpt4all-
           | backen...
        
       | lmeyerov wrote:
       | How are you thinking about the "document level access control" to
       | make this viable for business environments?
       | 
       | Ex: If a connected gdrive document gets indexed, but then someone
       | fixes the share settings in google docs for some item to be more
       | restrictive.. How does Danswer avoid leaking that data? Dynamic
       | check before returning any doc that the live federated auth
       | settings safelist the requesting user reading that doc?
        
         | Weves wrote:
         | Great question! Right now, our access control is very basic.
         | When admins setup connectors to other apps, all documents
         | indexed are accessible by all (meant to be public documents
         | only). Individual users can index private documents by
         | providing their own access tokens for connectors, and those
         | docs will be only available to the user who owns that access
         | token. Improving this is a high priority item for us, as we
         | understand this is a deal-breaker for enterprises.
         | 
         | The immediate plan is to extend our current poll / push based
         | connectors to also grab access information (+ add IdP
         | integrations for cross-app identity). There will be some delay
         | to grab access updates, which will be combatted by the dynamic
         | check with the app / IdP itself at query time that you
         | mentioned (still investigating exactly how this will work).
         | 
         | We are also considering adding support for group based access
         | defined within Danswer itself for sources that don't provide
         | APIs to get access information (default being all-public if not
         | specified). Of course, for these, we will not be able to sync
         | permissions.
        
       | XCSme wrote:
       | Wait, this is not local? Why use OpenAI third-party requests
       | instead of a local model?
        
         | Weves wrote:
         | Adding local/open source model support is at the top of our
         | TODO list! When we started building, open source models were
         | quite a bit more behind of GPT-4 then they are now. At that
         | time the performance gap was at a point where locally hosted
         | models would provide a significantly hampered experience, but
         | we think that gap has (and will continue) to close rapidly.
        
       | TommyCat wrote:
       | Looks great and will test it out, but for enterprises definitely
       | needs support for Azure/Office 365 integration to index Word,
       | Excel, etc. Lots of docs are stored in Onedrive, Teams channels,
       | and SharePoint. I'm going to test these use cases, but would be
       | nice if it supports it OOB like Google Docs. Also, any thought of
       | OOB connectors to ServiceNow or other ticketing/KB platforms?
        
         | Weves wrote:
         | Native support for the Microsoft suite of tools is something we
         | plan to add fairly soon! We're a small team, and currently
         | swamped with connector/feature requests so no promises on the
         | timeline.
         | 
         | Ticketing platforms like ServiceNow fall under a similar
         | category, although a bit lower priority in my mind.
        
       | lastdong wrote:
       | Noooooooo, not openAI! It looks perfect, just allow to run models
       | like Vicuna or Llama locally - well, since it's open source
       | anyone can contribute to make this happen.
       | 
       | Thank you for your work, it looks great
        
       | sixhobbits wrote:
       | I've seen a few of these, and this one looks like it is more
       | feature complete than many (e.g. including web scraping I think
       | is an important component).
       | 
       | Looks nice! Curious about the business model or is it just a
       | hobby project?
        
         | Weves wrote:
         | Thanks! For now, we're just focused on making sure this solves
         | a problem effectively for people. In the long term, if we're
         | able to build up trust, we'll probably offer a managed version.
        
       | tibanne wrote:
       | This looks interesting. Thank you for making public. I made
       | something similar that uses data from only Notion. Do you happen
       | to have / be developing a Notion connector?
        
         | Weves wrote:
         | We are actively building a Notion connector! Will be out very
         | soon :)
        
       | PeterStuer wrote:
       | In my experience the QA with documents pattern is fairly
       | straightforward to implement. 90% of the effort to get to a
       | preformant system hoever goes into massaging the documents into
       | semantically meaningfull chuncks. Most business documents, unlike
       | blogposts and news articles, are not just running text. They have
       | a lott of implicit structure that when lost as the typical naive
       | chunckers do, lose much of the contextualized meaning as well.
        
         | Weves wrote:
         | Agree with the point about intelligent chunking being very
         | important! Each individual app connector can choose how it
         | wants to split each `document` into `section`s (important
         | point: this is customized at an app-level). The default chunker
         | then keeps each section as part of a single chunk as much as
         | possible. The goal here is, as you mentioned, to give each
         | chunk the relevant surrounding context.
         | 
         | Additionally, the indexing process is setup as a composable
         | pipeline under the hood. It would be fairly trivial to plug in
         | different chunkers for different sources as needed in the
         | future.
        
       | ttul wrote:
       | I wonder how long it will be before Google Workspace just has
       | this feature for your Docs. It can't be long... Question-
       | answering against external docs is something Google could easily
       | add. I worry about the defensibility of startups working in this
       | area as it's so fully in front of the steamroller.
        
       | nbulka wrote:
       | how does this compare to llama-index
        
         | Weves wrote:
         | LlamaIndex is a _very_ generic framework to ingest data (from
         | anywhere, with no specific context in mind). Developers then
         | build on top of this framework in order to simplify the process
         | of creating LLM-powered on apps. Developers need to handle
         | automatically syncing, building a UI to manage connections,
         | build out the actual features /functionality they desire, etc.
         | 
         | Danswer is:
         | 
         | (1) itself an end-to-end application which allows you to
         | connect to all your workplace tools via a UI, and then ask
         | questions and get answers based on these documents. The goal is
         | to be a permissively licensed, open source solution to the
         | enterprise knowledge retrieval problem.
         | 
         | (2) an ingestion framework specifically targeted for enterprise
         | applications. We provide a UI for admins to manage connections
         | to other common workplace apps, automatically sync them into a
         | vector DB + a keyword search engine, and expose APIs that allow
         | access to these underlying data stores (more to come in this
         | direction). We take care of access control (more in the
         | pipeline here as well), only grabbing updates so we don't have
         | to pull thousands (or millions) of documents every X minutes,
         | etc. TL;DR: we're focused on a specific ingestion use case.
        
         | monkeydust wrote:
         | Also wondering this.
        
       | Intox wrote:
       | Another great tool solving the exact problem we're willing to
       | solve using an external service we can't use.
       | 
       | No company at a decent size (those who actually reach some
       | complexity of documentation) will be okay with exfiltrating
       | confidential information to an external service we have no deal
       | or NDA with. Sure, OpenAI is easy to integrate, but it's also an
       | absolute showstopper for a company.
       | 
       | We don't need state-of-the-art LLMs with 800k context, we need
       | confidentiality.
        
         | rolisz wrote:
         | Two weeks ago I finished a project for a client who wanted a
         | "talk to your documents" application, without using OpenAI or
         | other 3rd party APIs, but by using open source models running
         | on their own infrastructure.
         | 
         | If you're interested in something similar, send me an email.
        
         | dimal wrote:
         | I'm kinda confused by this. Every company already keeps their
         | data in Google Docs, Notion, Slack, Confluence, Jira, or any
         | number of other providers. When you sign up for one of these
         | services, there's always a compliance step to make sure it's
         | ok. OpenAI's TOS says they don't use API data for training. So
         | what makes sending this data to OpenAI different than sending
         | it to any of the above providers? This is an honest question. I
         | don't understand the difference.
        
           | [deleted]
        
           | jsiepkes wrote:
           | > Every company already keeps their data in Google Docs
           | 
           | The TOS for the (paid) enterprise products such as Google
           | workspace are totally different from the (free) consumer
           | versions. For example Google can't use the data for AI
           | training.
        
             | curl-up wrote:
             | TOS of OpenAI API (which tools like this use) do not allow
             | for model training on the data either. You might be
             | confusing their API with ChatGPT, which has a different
             | policy.
        
           | nrjames wrote:
           | For our part, we self-host Confluence and gitlab, have tons
           | of internal documentation and web pages, are are prohibited
           | from using external tools unless they can be hosted
           | internally in a sandboxed manner. There's no way on the
           | planet they would approve the use of connecting to an OpenAI
           | API for trawling through internal documentation.
        
             | rolisz wrote:
             | There are open source models that can deliver pretty well
             | for chatbot over internal documentation. If you're
             | interested, feel free to reach out to me.
        
           | ixfo wrote:
           | Trust. OpenAI's ignored everyone's copyright and legal usage
           | terms for the rest of their training data, what lawyer is
           | going to trust them to follow their contractual terms?
        
           | floomk wrote:
           | Why would you send your data to the company that built its
           | value by slurping up everyone's data without consent? It
           | doesn't matter what they promise now, they have shown that
           | they dont care about intellectual property, copyright or any
           | of that. They literally cannot be trusted.
        
       ___________________________________________________________________
       (page generated 2023-07-10 23:00 UTC)