[HN Gopher] Launch HN: Danswer (YC W24) - Open-source AI search ...
       ___________________________________________________________________
        
       Launch HN: Danswer (YC W24) - Open-source AI search and chat over
       private data
        
       Hey HN! Chris and Yuhong here from Danswer
       (https://github.com/danswer-ai/danswer). We're building an open
       source and self-hostable ChatGPT-style system that can access your
       team's unique knowledge by connecting to 25 of the most common
       workplace tools (Slack, Google Drive, Jira, etc.). You ask
       questions in natural language and get back answers based on your
       team's documents. Where relevant, answers are backed by citations
       and links to the exact documents used to generate them.  Quick
       Demo: https://youtu.be/hqSouur2FXw  Originally Danswer was a side
       project motivated by a challenge we experienced at work. We noticed
       that as teams scale, finding the right information becomes more and
       more challenging. I recall being on call and helping a customer
       recover from a mission critical failure but the error was related
       to some obscure legacy feature I had never used. For most projects,
       a simple question to ChatGPT would have solved it; but in this
       moment, ChatGPT was completely clueless without additional context
       (which I also couldn't find).  We believe that within a few years,
       every org will be using team-specific knowledge assistants. We also
       understand that teams don't want to tell us their secrets and not
       every team has the budget for yet another SaaS solution, so we
       open-sourced the project. It is just a set of containers that can
       be deployed on any cloud or on-premise. All of the data is
       processed and persisted on that same instance. Some teams have even
       opted to self-host open-source LLMs to truly airgap the system.  I
       also want to share a bit about the actual design of the system
       (https://docs.danswer.dev/system_overview). If you have questions
       about any parts of the flow such as the model choice,
       hyperparameters, prompting, etc. we're happy to go into more depth
       in the comments.  The system revolves around a custom Retrieval
       Augmented Generation (RAG) pipeline we've built. During indexing
       time (we pull documents from connected sources every 10 minutes),
       documents are chunked and indexed into hybrid keyword+vector
       indices (https://github.com/danswer-
       ai/danswer/blob/main/backend/dans...).  For the vector index (which
       gives the system the flexibility to understand natural language
       queries), we use state of the art prefix-aware embedding models
       trained with contrastive loss. Optionally the system can be
       configured to go over each doc with multiple passes of different
       granularity to capture wide context vs fine details. We also
       supplement the vector search with a keyword based BM25 index +
       N-Grams so that the system performs well even in low data domains.
       Additionally we've added in learning from feedback and time based
       decay--see our custom ranking function (https://github.com/danswer-
       ai/danswer/blob/main/backend/dans... - this flexibility is why we
       love Vespa as a Vector DB).  At query time, we preprocess the query
       with query-augmentation, contextual-rephrasing, as well as standard
       techniques like removing stopwords and lemmatization. Once the top
       documents are retrieved, we ask a smaller LLM to decide which of
       the chunks are "useful for answering the query" (this is something
       we haven't seen much of elsewhere, but our tests have shown to be
       one of the biggest drivers for both precision and recall). Finally
       the most relevant passages are passed to the LLM along with the
       user query and chat history to produce the final answer. We post-
       process by checking guardrails and extracting citations to link the
       user to relevant documents. (https://github.com/danswer-
       ai/danswer/blob/main/backend/dans...)  The Vector and Keyword
       indices are both stored locally and the NLP models run on the same
       instance (we've chosen ones that can run without GPU). The only
       exception is that the default Generative model is OpenAI's GPT,
       however this can also be swapped out
       (https://docs.danswer.dev/gen_ai_configs/overview).  We've seen
       teams use Danswer on problems like: Improving turnaround times for
       support by reducing time taken to find relevant documentation;
       Helping sales teams get customer context instantly by combing
       through calls and notes; Reducing lost engineering time from
       answering cross-team questions, building duplicate features due to
       inability to surface old tickets or code merges, and helping on-
       calls resolve critical issues faster by providing the complete
       history on an error in one place; Self-serving onboarding for new
       members who don't know where to find information.  If you'd like to
       play around with things locally, check out the quickstart guide
       here: https://docs.danswer.dev/quickstart. If you already have
       Docker, you should be able to get things up and running in <15
       minutes. And for folks who want a zero-effort way of trying it out
       or don't want to self-host, please visit our Cloud:
       https://www.danswer.ai/
        
       Author : yuhongsun
       Score  : 125 points
       Date   : 2024-02-22 14:20 UTC (8 hours ago)
        
       | candiddevmike wrote:
       | How does this compare to something like OpenWebUI?
        
         | yuhongsun wrote:
         | We have a strong emphasis on the retrieval half of RAG. A big
         | part of the value of Danswer is in connecting to sources like
         | Notion, Linear, GitLabs, etc. We do incremental updates to keep
         | data fresh, pull in metadata, etc. We also have features to
         | manage access to documents in Danswer like RBAC.
         | 
         | Basically, (as I understand it) OpenWebUI is more like a
         | ChatGPT frontend and we're more like a unified search with an
         | emphasis on LLMs.
        
       | scrollaway wrote:
       | We're building something similar. How do I contact you?
        
         | yuhongsun wrote:
         | Email: founders@danswer.ai Slack:
         | https://join.slack.com/t/danswer/shared_invite/zt-2afut44lv-...
         | Discord: https://discord.gg/TDJ59cGV2X
         | 
         | We hear this a lot - we take it to mean we've found something
         | that people like and need!
        
       | trungdq88 wrote:
       | This is incredible. Congrats on the launch!
        
         | yuhongsun wrote:
         | Thank you for the kind words!
        
       | cpach wrote:
       | Interesting!
       | 
       | I'm curious about the busines model. I see more and more YC
       | companies that are FOSS, which is nice.
       | 
       | Why did you choose a FOSS license instead of proprietary with a
       | license fee?
       | 
       | What are your plans for securing funding for further development?
       | 
       | Would be very interesting to hear your thoughts on this.
       | 
       | Congrats on the launch!
        
         | yuhongsun wrote:
         | We chose FOSS because we think this type of tool will be
         | universal in the near future. We likely won't be able to
         | directly serve millions of teams ourselves by the time it
         | happens but by open sourcing, if teams want it, they can just
         | set it up.
         | 
         | Especially with small teams that have no budget, they can get
         | value and never need to talk to us. We hope to grow with them.
         | 
         | In terms of funding and the economics of it, we have a cloud
         | currently which is paid and we're developing additional
         | features that are not free. For example, identifying experts on
         | the team when the system is unable to find the information to
         | answer the question directly. This is an example of a feature
         | that large teams are likely willing to pay for but small teams
         | won't need (as they know each other intimately).
        
           | cpach wrote:
           | Ok, cool. Best of luck with Danswer!
        
       | trungdq88 wrote:
       | Quick question: do you offer API? I'm hoping to integrate this
       | with an existing chat UI that I have.
        
         | yuhongsun wrote:
         | Yes, we have an API and a way of accessing it with a generated
         | API key which you can find in the admin panel.
         | 
         | Two things to note though.
         | 
         | The APIs are intended for serving the Danswer frontend. The
         | functionality is generally complete for similar use cases but
         | it's not documented so you have to look at the code.
         | 
         | If you're overusing the API on the cloud without providing your
         | own OpenAI key, we will likely have to shut down the instance
         | to prevent losing too much on inference fees.
        
       | BillFranklin wrote:
       | Good luck folks! I'm glad there are projects trying to solve
       | enterprise search.
       | 
       | I guess the main problem is the "private" aspect, if I've
       | understood your goals correctly. Since most SaaS products lock
       | down the private data unless you pay enterprise fees for
       | compliance tooling.
       | 
       | For instance, if you want to ingest data from private Slack
       | channels or Notion groups, you have to get the users in those
       | groups to add your bot to them, otherwise there's no way of your
       | service getting access to the data. It's possible, just a bad UX
       | for users.
       | 
       | That said, built-in search for most SaaS products built after
       | 2015 is generally quite good (e.g. Slack has an internal Learning
       | to Rank service for a while now, which makes their search
       | excellent: https://slack.engineering/search-at-slack/), so you'd
       | be solving for companies like Webex and Confluence where their
       | internal search is not great. At companies like Google they have
       | internal search across products, which is the ideal end state,
       | but have the benefit that they own the source code for most of
       | their internal products.
        
         | yuhongsun wrote:
         | I think I understand your concern but if I miss the point,
         | please follow up!
         | 
         | So regarding getting access to read knowledge from the
         | different tools, it depends tool by tool but a lot of them have
         | API keys or options for app integrations available in the free
         | tier (GitHub, Google Drive, Confluence come to mind). Other
         | tools don't have a free tier and you just get access to the API
         | keys as a part of paying for the service. I think there are
         | probably tools that require a premium fee to get integration
         | access but I'm not aware of any personally.
         | 
         | For the SlackBot, it can add itself to public channels but for
         | private channels someone needs to add it. It is what it is
         | sadly.
         | 
         | About search being available for most SaaS products: SaaS tools
         | are definitely improving their own searches. But I still think
         | a single place to search and aggregate data has significant
         | value. For example, as an engineer by training, often getting
         | the full picture for some customer escalation includes reading
         | Slack threads, Confluence Design docs, old Pull Requests on
         | GitHub. Would be nice to get it all in one place.
        
           | BillFranklin wrote:
           | > It is what it is sadly.
           | 
           | This is what I mean -- previously I built a similar search
           | engine on top of slack, notion, etc., but didn't launch the
           | product because I thought that requiring users to constantly
           | add bots to private channels would be a subpar experience. I
           | thought this would be a blocker for good UX, so didn't go
           | further, but maybe you'll find a nice solution!
           | 
           | Searching over public internal data is addressed by a few
           | existing tools, but it's the private aspect which is pretty
           | difficult to handle and disastrous to get wrong when managed
           | ad-hoc - e.g. someone accidentally adds the bot to a private
           | slack group called #layoffs :) so you'd want this handled
           | properly and centrally.
           | 
           | I guess you'll also need to handle privacy well, ~maybe it's
           | OK when run as a SaaS for db admins to have access to
           | ingested data, but if it's OSS then the people that run it
           | probably shouldn't be able to read the private data that's
           | ingested, so now you need to handle search over encrypted
           | data, which is a fun problem :D
        
             | nl wrote:
             | > ~maybe it's OK when run as a SaaS for db admins to have
             | access to ingested data, but if it's OSS then the people
             | that run it probably shouldn't be able to read the private
             | data that's ingested
             | 
             | I don't understand the distinction here. If Danswer runs a
             | SaaS version then yes I agree they can have a license
             | agreement that lets their DB Admins see data in some cases
             | which is fine. That seems an orthogonal issue to if a
             | company is running the OSS version internally, in which
             | case presumably their administrator can see all docs (but
             | software administrators usually can do this anyway).
        
               | Weves wrote:
               | Yep, this is exactly correct! For our SaaS version, we do
               | have an agreement which allows us to look at data if
               | needed to debug issues and/or improve search performance.
               | 
               | For self-hosted deployments, usually a select few admins
               | who have setup the plumbing on AWS do have access (but as
               | nl has mentioned, these people usually have access to
               | superuser access on the tools we connect to anyways so
               | this is a noop).
        
       | Oras wrote:
       | The integration part (connectors) is the key here. I can see how
       | beneficial this would be for companies as they can plug and play.
       | 
       | Adding the vectorisation locally is superb, I've played around
       | with sbert models before and ability to run without GPU is going
       | to simplify the process a lot.
        
         | yuhongsun wrote:
         | Ah yes, this reminds me! I forgot to mention it but for the
         | local NLP models that we run, they're in the range of 100
         | million parameters so they're able to be run on CPU (no GPU
         | required!) with pretty low latency.
         | 
         | Also a fun tidbit on the connectors, more than half of them now
         | are built by open source contributors! We just have an
         | interface that needs to be implemented and people have been
         | able to figure it out generally.
        
       | michaelmior wrote:
       | I noticed the Google Drive connector includes sheets. It looks
       | like for the time being, these just get indexed as CSV file. That
       | seems like it would miss a lot of context since a good number of
       | spreadsheets aren't structured as a simple table. I'm wondering
       | if you have any plans to make spreadsheet indexing more useful
       | going forward.
        
         | yuhongsun wrote:
         | Ya, handling spreadsheets is a beast of its own. We have this
         | simple implementation to cover easy cases for folks but likely
         | it will need its own more involved pipeline for indexing,
         | retrieval, and interactions with the LLM.
         | 
         | Currently with large tables, it's not handled very well either.
         | The more complete approach would be to pass the headers to the
         | LLM and ask it to generate a formula to parse the data rather
         | than feeding the whole table(s) to the LLM directly.
         | 
         | Some of the bigger items we want to target that require special
         | flows are code search, SQL tables, and Excel/CSVs/TSVs
        
           | cyanydeez wrote:
           | most of the files I have, I'm most interested in finding
           | graphs and then updating relevant data.
           | 
           | it looks like the best way to do that is understand the ooxml
           | format in xlsx. it's all fairly easy to understand.
        
             | yuhongsun wrote:
             | Ya, parsing the file is generally not bad at all. The
             | problem comes with the fact that LLMs are notoriously bad
             | with numbers and formatted data. So the current approach of
             | passing relevant information to the LLM and asking it to
             | generate answers will produce misleading information when
             | larger tables are passed in.
             | 
             | By asking the LLM to generate a formula though, it doesn't
             | actually need to do any number crunching of its own which
             | makes solving the challenge a bit more reliable when it
             | comes to LLMs.
        
               | cyanydeez wrote:
               | I've limited my expectations for LLM support to file
               | management. locating relevant datasets or filing away
               | things. the interactive QA just don't seem salient beyond
               | some high level.
        
               | yuhongsun wrote:
               | A lot of people are very bullish on AI, it's very
               | interesting to hear the opposite side. My opinion is that
               | LLMs are very powerful at digesting and distilling
               | knowledge which is why we built this project. I also
               | think that LLMs are terrible reasoning engines and so
               | agent-flows are not quite ready for primetime.
               | 
               | Would love to hear your perspective on the space!
        
               | cyanydeez wrote:
               | I certainly see the value of large document retrieval and
               | various forms of search.
               | 
               | However, what seems to be the business proposition is
               | giving managers shallow access to documents but won't
               | lead to rigorous information.
               | 
               | There's a few middle grounds where it can yield insights.
               | like regulatory scenarios where you want to understand
               | how public orgs satisfy permits with written plans.
               | 
               | however, what I don't believe will yield is the context
               | size. when I want to explore my knowledge base, I need
               | far more than 128k and there's sever orders of structure
               | that language itself is not going to bridge.
        
       | tibanne wrote:
       | Are you using anything like Llamaindex internally or did you
       | write it from scratch without the assistance form a helping
       | wrapper like this?
        
         | yuhongsun wrote:
         | We use LlamaIndex very sparingly. Specifically the context
         | aware document chunking functionality is via LlamaIndex.
         | 
         | We couldn't use the more involved pipelines because we needed
         | significant custom logic to enforce permissions, filters (like
         | time filter, source filter, document-set filters), and other
         | complexities. At that point, it's easier to write from scratch
         | rather than conform to expectations of these third party
         | libraries.
        
           | tibanne wrote:
           | I was considering starting a project just like this using
           | Llamaindex but I think I'll give yours a try first before
           | going that route. Looks good. Thank you.
        
             | yuhongsun wrote:
             | I think these developer platforms like LlamaIndex and
             | Langchain are super great for prototyping and understanding
             | the crowdsourced best approaches in solving these new LLM
             | related challenges.
             | 
             | Depending on how custom the pipelines need to be, you'll
             | either find that you've saved a huge amount of time using
             | these libraries, or you'll find that you have no option but
             | to switch off and build from scratch.
        
       | ramoz wrote:
       | Your methodology is nice.
       | 
       | Is there any work or planned work around enterprise
       | authentication and access? For instance, indexing Sharepoint in
       | such a way where a user of Danswer isn't exposed to sharepoint
       | information they wouldn't otherwise have access to?
        
         | yuhongsun wrote:
         | Thank you!
         | 
         | Yes, there are several options for user authentication (Basic
         | Auth, Google OAuth, OIDC, SAML).
         | 
         | Currently the RBAC is managed via the Danswer and this controls
         | who has access to which documents (it's done at the connector
         | level as it would be untenable to assign access to documents
         | individually).
         | 
         | We're also working on automatically sync-ing permissions from
         | the sources. Basically seeing which emails have access to each
         | doc and mapping it to the Danswer users.
        
       | carlossouza wrote:
       | Congrats on the launch! Does it search within pdf files?
        
         | yuhongsun wrote:
         | Thanks! Yes, it does do PDFs. We don't do anything fancy with
         | it though like Optical Character Recognition (OCR). So pictures
         | of text, as well as images and graphs will be lost. This is
         | something we will work on though.
         | 
         | Is this something that you would find a lot of value in or is
         | simple text processing of PDFs sufficient?
        
           | canadiantim wrote:
           | Not OP, but I would definitely find a lot of value from
           | processing PDFs in such a way that it could eg understand
           | tables and images. I work in mining and having it digest a
           | 43-101 technical report with images and tables would be
           | supremely valuable.
           | 
           | I know that might be a niche case tho.
           | 
           | Absolutely incredible work you're doing tho wow, I'm very
           | impressed by what you're doing and the way you're doing it.
           | Even if you stopped now this is a masterpiece, so while yes I
           | would definitely find a lot of value from being able to
           | process images and graphs/tables, simply being able to
           | process the text and cite it is already a superpower. Thank
           | you for your amazing work!!!
        
           | johntash wrote:
           | I'd benefit from OCR too. Not just PDFs, but OCR on images
           | could be super useful to.
           | 
           | For a personal use case, I'm thinking things like receipts.
           | For work, I'm thinking OCR on architecture diagrams/etc.
        
       | tibanne wrote:
       | Another question. If I host this publicly for people that I work
       | with on their data, how can I make sure that it's only them that
       | can access the service? Do you have any form of auth?
        
         | yuhongsun wrote:
         | Yes! There is Basic Auth (email + password with email
         | verification) and Google OAuth available in the free version.
         | 
         | We also do OIDC and SAML to integration with Identity Providers
         | (IdPs) like Okta but that's part of the paid features. Ahhh
         | please don't hate us!
        
           | tibanne wrote:
           | With the free version, can I constrain the emails to be from
           | one domain? i.e. the company domain
        
             | Weves wrote:
             | Chris here (the other founder) - yes you can! We have an
             | `VALID_EMAIL_DOMAINS ` env variable which controls this.
             | 
             | For example, for us we have
             | `VALID_EMAIL_DOMAINS=danswer.ai`.
        
               | tibanne wrote:
               | Awesome, thank you!
        
       | bredren wrote:
       | Does the GitHub connector imply ability to ask questions of an
       | entire code base, such as: "help me write an endpoint to style of
       | our codebase."
       | 
       | I believe there are a variety of projects focused on that problem
       | but would be good to latch onto one that handles and integrates
       | externalities.
       | 
       | Speaking of which I did not see sentry in the list of connections
       | or mentioned in an issue. Any plans there?
        
         | yuhongsun wrote:
         | Code search is something we have in our sights in the next
         | couple months. Currently the GitHub connector pulls in PRs and
         | Issues but not the whole code base. We wanted to have the best
         | RAG pipeline so we deep dived on that. Code search uses a
         | combination of graph based traversals and a different type of
         | embedding so it's a separate effort, but we will definitely
         | build it out since it's immensely useful to engineering teams!
        
         | yuhongsun wrote:
         | So the intent of the GitHub/GitLab connector in the current
         | iteration is to help people easily find implementations that
         | have been done before. For example, if a new bug comes up with
         | some feature, I can easily search for the PRs relating to that
         | feature in natural language and filter down the code changes
         | that may have caused it.
        
       | Akashic101 wrote:
       | What an amazing tool, I cant wait to implement this in our
       | workflow. How well does this work with documents in other
       | languages besides English?
        
         | yuhongsun wrote:
         | Glad you asked! We are actually the only project (open source
         | or closed), as far as I know, that handles multilingual quite
         | well. We have options to do multilingual query-expansion and
         | also multilingual embedding models.
         | 
         | Without dropping any names, a big french company with 5000
         | people is actually doing this with Danswer and have found great
         | success.
         | 
         | There is some info here:
         | https://docs.danswer.dev/configuration_guide You'll also want
         | to change the embedding model in the admin UI
        
       | BrandiATMuhkuh wrote:
       | This is really nice. Congratulations for launching.
       | 
       | Just the last 2-3 weeks have I had talks with enterprise
       | companies regarding this topic. It seems to be on every CEO's
       | agenda. I have talked to a couple of startups who wanted to do a
       | similar thing to you. But they all feared Microsoft Copilot is
       | not beatable. So they don't even try.
        
         | yuhongsun wrote:
         | This is also another reason why we think OSS is the way to go
         | here. Taking on the tech giants alone is definitely a daunting
         | task (maybe even impossible for a small isolated team).
         | 
         | The hope is that by working with the community, we'll be able
         | to incorporate the best ideas and contributions from a large
         | pool of like-minded people to build something everyone can
         | benefit from!
         | 
         | The OSS space has absolutely taken off in the NLP space and the
         | excitement has bled over to developer platforms resulting in
         | some outstanding projects, hopefully the same will happen with
         | LLM applications.
        
         | sigmoid10 wrote:
         | >But they all feared Microsoft Copilot is not beatable. So they
         | don't even try.
         | 
         | The thing is, you don't need to beat copilot. Copilot may well
         | be the worse system and Microsoft can still win because they
         | offer the most enterprise-y solution. I wouldn't be worried
         | about competing with them on functionality. But I also would
         | never even try to outdo them on a business level.
        
           | yuhongsun wrote:
           | I definitely see what you mean. They also have advantages in
           | bundling co-pilot with their other offerings. This will in no
           | means be an easy battle for us, but we have hope that we'll
           | be able to build something people love and end up using!
        
       | parthi2929 wrote:
       | Do we have options for developing our own custom connectors? For
       | ex, not much known apps like Erpnext etc.. In other words, any
       | app not in danswer, we should be able to create custom connector
       | and use..
        
         | yuhongsun wrote:
         | Yes! And we'd love it if you contribute them back to the
         | project! More than half of the connectors are community
         | contributed at this point and it's by far the most common area
         | of contribution.
         | 
         | There's a simple Document interface that needs to be
         | implemented to provide stuff like title, content, link, etc.
         | From there the rest of the Danswer code takes it and handles
         | the indexing and making it available for querying.
         | 
         | There's a contributing guide here: https://github.com/danswer-
         | ai/danswer/blob/main/CONTRIBUTING...
         | 
         | And a connector contributing guide here:
         | https://github.com/danswer-ai/danswer/blob/main/backend/dans...
        
       | csmpltn wrote:
       | What's your moat? FAANG+ are all working on similar products.
        
         | yuhongsun wrote:
         | We're leaning heavily into the open source aspect. We think
         | that a solution like this will be useful to even smaller teams
         | (10-50) that other companies won't want to target. There is
         | some non-trivial amount of setup required, specifically chasing
         | down the API keys etc. So for the SaaS alternatives, they rely
         | heavily on their sales orgs being very hands on so it makes no
         | sense for them to target small teams. In fact for many of them,
         | if you try to sign up for a demo and you say you're less than
         | 50 people, they straight up ignore you.
         | 
         | So hopefully teams will self-adopt Danswer and as they grow,
         | they will keep using us!
         | 
         | For larger teams, the transparency and peace of mind of self-
         | hosting an open source solution is also a major benefit. We've
         | often heard from large teams that have adopted Danswer, that
         | the customizability of it has been a driving factor in their
         | adoption. They want to own the solution and they want to
         | customize it specifically for their needs. At the very basic
         | level, a lot of teams have swapped in domain specific embedding
         | models and prompts but we've seem some significantly more
         | involved customizations as well.
        
       | flaviuspopan wrote:
       | This is one of those posts that makes me feel like I'm in the
       | right place at the right time. Thank you, this is a fantastic
       | piece of tech.
        
         | yuhongsun wrote:
         | Thanks so much! It's always rewarding to hear people sharing
         | our excitement!
        
       | pstorm wrote:
       | I've been planning on building some of this for an internal tool,
       | but now it looks like I don't have to. I'm impressed by the demo,
       | it looks really polished.
       | 
       | I'm particularly surprised by the speed considering all of the
       | pre and post processing. I am doing some similar things and that
       | is one bottlenecks. I'll dig in, but I'm curious what models you
       | are using for each of these steps.
        
         | yuhongsun wrote:
         | A lot of teams we talk to switched from an in-house solution to
         | either directly using Danswer or building on top of Danswer.
         | Glad you liked the demo!
         | 
         | We're using E5 base by default but there's an embedding model
         | admin page to choose alternatives. There's also an API for it
         | if you know what you're doing, you can even set one of the
         | billion+ parameter LLM bi-encoders if you want (but you'd need
         | a GPU for sure).
        
       | ned_at_codomain wrote:
       | Congrats on launching guys.
       | 
       | When I was in consulting, being able to search our internal
       | documents was absurdly painful.
       | 
       | We had all this data from client teams spanning decades and
       | continents, but I could never find what I needed. It was all
       | locked up in these silly PowerPoint files -- not even necessarily
       | PDFs. I'd literally spend all night sometimes clicking through
       | pages by hand.
       | 
       | Corporate knowledge management is an absolutely immense problem,
       | and I'm thrilled to see you tackling it.
        
         | yuhongsun wrote:
         | Ya, there's definitely a common thread there! A lot of
         | consulting firms have given us very positive feedback from
         | using Danswer. I think the nature of the time-framed projects
         | and frequent scope changes mean that people have to always take
         | in new information and a lot of documented knowledge is lost or
         | at least difficult to find.
         | 
         | A great use case for sure!
        
       | esafak wrote:
       | I would suggest putting real time ingestion on the roadmap; that
       | would unlock a lot of new use cases.
       | 
       | Did you use a library to implement the integrations?
        
         | yuhongsun wrote:
         | Do you mean directly uploading files in the UI and chatting
         | against those files? This one will be done within the next few
         | weeks, it's a very high value item for sure.
         | 
         | Alternatively, if you're talking about real time indexing to
         | make it available to everyone immediately, there's an Ingestion
         | API, where users can send documents in the expected format
         | directly to the system, is this what you're thinking?
         | 
         | The integrations are built in house, some using client
         | libraries of the particular tool (like the Atlassian python
         | client library for example). We considered using Airbyte,
         | LlamaHub etc. but we found that they don't support the full
         | flexibility that we need, including pulling incremental updates
         | and access permissions.
        
           | esafak wrote:
           | I mean streaming insertions and indexing using pub/sub or
           | something.
        
             | yuhongsun wrote:
             | Got it, we considered it but most connectors that we pull
             | from don't support this. Might bring it back for the ones
             | that do, thanks for the suggestion!
             | 
             | It's actually still in the code, just none of the
             | connectors implement it atm: https://github.com/danswer-
             | ai/danswer/blob/main/backend/dans...
        
       | smeej wrote:
       | I think the biggest problem I've run into with company
       | documentation is that the relevant docs either don't exist at all
       | or are woefully out of date. Sure, there might be a procedure doc
       | that spelled out how to handle a particular type of issue, but it
       | has probably been updated three times in Slack DMs and twice on
       | Zoom calls since the doc was actually written. And maybe at least
       | once, the company has "declared bankruptcy" on, say, Confluence,
       | and half-converted a few things over to Notion.
       | 
       | If Danswer could surface the "official" answer, such as it is,
       | but then also note how old the authoritative doc is, and which
       | teams have been messaging about it and/or scheduling calls with
       | related names more recently, and then tell me who's responsible
       | on that team for knowing the _actual_ procedure, I 'd never need
       | another tool!
        
         | yuhongsun wrote:
         | We definitely recommend you trying it out!
         | 
         | As far as how outdated information is handled - we pass the
         | most relevant documents along with metadata to the LLM. So in
         | the case you mentioned, the LLM will be provided the procedure
         | doc and the time it was updated, the relevant Slack messages
         | and the times they were sent, and the call transcripts along
         | with when the call happened. The LLM tends to handle this
         | pretty well.
         | 
         | Additionally, during the retrieval phase, there's a time based
         | decay applied based on the last time the document was updated.
         | Also there is learning from feedback so users can upvote
         | documents that are useful and downvote documents that are
         | deprecated.
        
       | jdthedisciple wrote:
       | How does this compare to simply rolling your own OpenAI Assistant
       | (apart from direct integration of Slack etc.)?
        
         | yuhongsun wrote:
         | So one of the main things we do is automatically sync-ing
         | documents from different sources of knowledge from your team.
         | So all of the data connectors as well as the user
         | authentication and access systems would have to be built from
         | scratch if you did your own. Also if you have more than a few
         | documents you would have to recreate the RAG pipeline (and ours
         | is fairly involved so it would be quite some work). Finally
         | there's the UI and other features like learning from feedback,
         | usage analytics, chat history, etc etc.
         | 
         | If you're just looking to upload a few personal docs into a
         | chat assistant for your own use, probably Danswer is overkill
         | and more complex than the effort is worth. If you're thinking
         | of a team wide use case, then using Danswer makes sense.
        
       | Beefin wrote:
       | Another RAG chat bot ... YC is pretty unoriginal huh
        
       | acod wrote:
       | Really like the idea! The company I work for has recently
       | contracted with (glean)[glean.com] which seems to serve the same
       | purpose but imo the killer feature which they lack is being able
       | to work collaboratively with the AI to produce an answer by
       | enabling the human operator to explicitly scope down the context
       | of a chat to specific documents and then converse with the
       | document in question.
       | 
       | Sometimes you know roughly where the data you're looking for
       | exists but the artifact containing the information is extremely
       | dense to interpret. For example, a runbook for a system could
       | span 10s to 100s of pages and to actually accomplish what you
       | want means interpreting and joining information from different
       | sections of the same document. It seems like there's potential
       | here to allow an expert to define explicit scope of what it needs
       | to search and then include information in the context as wide or
       | as narrow as the question requires.
        
         | yuhongsun wrote:
         | Sounds like Danswer might just fit your needs. Also ya, we came
         | up with this idea of chatting with documents that you can
         | select on the fly, I think we're the only ones who do this
         | still. People have really been liking that one!
         | 
         | If you happen to want to talk to us about Danswer, we'd love to
         | welcome you to our Slack:
         | https://join.slack.com/t/danswer/shared_invite/zt-2afut44lv-...
        
       | DrWonbor wrote:
       | I like open source software. I like what you are doing and
       | keeping development open.
       | 
       | I have been closely watching AI development. There are 10k+ apps
       | now using AI. Every major company FAANG, Tier-2,3,4,5 company now
       | have AI as top priority. However, there got to be something
       | coming out of wrapper software. I have not read docs entirely
       | yet. I have a few questions for you that might give us idea
       | whether this fits our use case.
       | 
       | 1. Which models are you using for this? Can I switch models to
       | open source?
       | 
       | 2. When you say connect to Apps, how often are you pulling data
       | from these apps? For example, you connect to confluence where
       | tens of wikis get updated. How much of that ends up in your
       | vector DB?
       | 
       | 3. Most important, what separates you from tens of other
       | providers out there? Glean, as someone commented, is very similar
       | to what you are doing.
       | 
       | 4. How do you plan to convince SMBs and mid-size companies to use
       | you over say in-house development?
       | 
       | 5. OpenAI, Mistral, Claude and other LLM model developers can
       | build this functionality natively into their offering. Are you
       | concerned about becoming obselete or losing competitive ground?
       | If not, why?
       | 
       | Either way, this is a good direction. I will try it out tonight.
       | Feel free to respond when you get a chance.
        
         | yuhongsun wrote:
         | Hello, thanks for the kind words! With regards to your
         | questions:
         | 
         | 1. Are you referring to the local NLP models or the LLM? The
         | local models are already open source models or ones we've
         | trained ourselves. If you're talking about the LLM, the default
         | is OpenAI but it's easy to configure other ones without any
         | code changes.
         | 
         | 2. Most sources are pulled from every 10 minutes. They have
         | incremental updates so if you have Confluence with a million
         | pages, probably in the last 10 minutes, only a dozen or so have
         | been updated. The only exception is websites (which are crawled
         | recursively so we don't know which pages are updated before we
         | try), which is updated once a day.
         | 
         | 3. Glean is indeed similar. Without going into the features in
         | detail, we are an open source Glean with more of an emphasis on
         | LLMs and Chat.
         | 
         | 4. There's generally not a great reason to build from scratch
         | if an open source alternative with +75% alignment exists. They
         | can always build on top of us if they want. A lot of teams
         | reach out to us because they were looking to switch from their
         | in house solution to Danswer. Generally though these are larger
         | teams, we haven't seen many SMBs building RAG for their own
         | usage, usually these smaller teams building RAG are looking to
         | productize.
         | 
         | 5. Currently there is no cheap and fast way to fine-tune LLMs
         | every time a document is updated. If you want an LLM to
         | remember the document that was just updated you'd have to
         | augment it to at least dozens of similar (but all correct)
         | examples. RAG is still the only viable option. Then there is
         | the problem of security etc. since you can't enforce user roles
         | at the LLM level. So companies that focus on building LLMs
         | don't really compete in this specific space and they don't want
         | to either as they're trying to build AGI. There is more of a
         | threat from teams like Microsoft and Google who are indeed
         | trying to build knowledge assistants for their product lines,
         | but we think there is a world where open source ends up winning
         | against the giants!
        
       | sroussey wrote:
       | > Once the top documents are retrieved, we ask a smaller LLM to
       | decide which of the chunks are "useful for answering the query"
       | 
       | This sounds like normal re-ranking. How is it different?
        
         | Weves wrote:
         | Well the most standard approach is to use cross-encoders (e.g.
         | something like Cohere Rerank) to give similarity scores between
         | the query and the chunk, and then use these scores to update
         | the ranking.
         | 
         | Our approach is to use an LLM (gpt-3.5-turbo for example), and
         | to ask it explicitly "Is this chunk <CHUNK> useful for
         | answering this query <QUERY>". We've found, while certainly a
         | bit more expensive, the larger model size and greater
         | understanding of the world allows this approach to yield
         | significantly better results that the SOTA cross-encoders. It
         | also allows us to ask the model to explain why it's useful,
         | which can be really helpful for the user when determining if
         | they should look deeper into a document (as opposed to the
         | standard keyword-based highlighting which often isn't very
         | useful when determining if a document actually has useful
         | information for your query).
        
           | sroussey wrote:
           | Interesting! Thanks for the explanation.
        
       | anonu wrote:
       | How do you prevent "how much do my colleagues make?" questions
       | from being answered to the wrong people? I know you mention
       | citations and the ability to backtrack a "fact" to the source
       | document. How robust is this?
        
         | Weves wrote:
         | So we're leaning on access control to do this! Right now we
         | support manually configured group-based access at the connector
         | level (e.g. Users X, Y, and Z make up the `Engineering` group
         | and that group should have access to Folders A, B, C in Google
         | Drive + Github).
         | 
         | We're also in the process of adding the ability to sync
         | permissions from sources. For example, with this in place you
         | would only be able to chat with / search over documents in
         | Google Drive that you have access to. Since everything is RAG
         | based rather than any fine-tuning, this will guarantee that
         | someone asking "how much do my colleagues make?" will not get
         | an answer UNLESS they already have access to the document that
         | has this info (in which case, it shouldn't be a problem :D)
        
       | jhoechtl wrote:
       | Can I connect it to any OpenAI Rest compatible LLM? So having my
       | own LLM on premises based eg on ollama OpenAI Rest endpoint?
        
         | Weves wrote:
         | Yes absolutely! We actually have a doc specifically for Ollama:
         | https://docs.danswer.dev/gen_ai_configs/ollama
        
       | ck_one wrote:
       | Congrats on the launch!
       | 
       | In what way is "prefix-aware embedding models trained with
       | contrastive loss" better than the standard embedding model
       | provided by OpenAI?
       | 
       | "added in learning from feedback and time based decay" => Sounds
       | interesting! Have you seen significant gains in precision and
       | recall here?
       | 
       | It looks like you are using NextJS app dir + external backend.
       | Why did you decide against NextJs for frontend and backend? Are
       | you happy with your choice?
        
         | Weves wrote:
         | OpenAI's models may fit that description as well under the
         | hood. Specifically, for `prefix-aware`, this is useful when you
         | have short passages (e.g. Slack messages) that you are trying
         | to match against short queries (e.g. user questions). Without
         | being prefix-aware, the model can get confused, think both are
         | queries, and cause any short passages to match very strongly
         | with short queries.
         | 
         | For learning from feedback for sure! No exact benchmarks, but
         | we've heard from quite a few users about how useful this is to
         | push high quality docs up and reduce the prevalence of poor
         | docs. This is all very hard to evaluate since there aren't
         | readily available, real-world "corporate tool / knowledge base"
         | datasets out there. We're actually building our own in house
         | right now, so we should have more concrete numbers around these
         | things soon.
         | 
         | For the backend, we do a lot of stuff with local embedding
         | models / cross encoders / tokenization / stemming / stop word
         | removal etc. Python has the most mature ecosystem for this
         | kinda stuff (and the retrieval pipeline is the core of our
         | product), so we don't regret it at all!
        
       | aberzun wrote:
       | Congrats on the launch! Don't think anyone has mentioned this yet
       | but that is a fire name! :) Love the pun.
        
         | Weves wrote:
         | Thank you for the kind words!
         | 
         | Yea, the name has many meanings. It's a fun little puzzle to
         | find them all :)
        
       ___________________________________________________________________
       (page generated 2024-02-22 23:00 UTC)