[HN Gopher] Launch HN: Hamming (YC S24) - Automated Testing for ...
       ___________________________________________________________________
        
       Launch HN: Hamming (YC S24) - Automated Testing for Voice Agents
        
       Hi HN! Sumanyu and Marius here from Hamming
       (https://www.hamming.ai). Hamming lets you automatically test your
       LLM voice agent. In our interactive demo, you play the role of the
       voice agent, and our agent will play the role of a difficult end
       user. We'll then score your performance on the call. Try it here:
       https://app.hamming.ai/voice-demo (no signup needed). In practice,
       our agents call your agent!  LLM voice agents currently require a
       lot of iteration and tuning. For example, one of our customers is
       building an LLM drive-through voice agent for fast food chains.
       Their KPI is order accuracy. It's crucial for their system to
       gracefully handle dietary restrictions like allergies and customers
       who get distracted or otherwise change their minds mid-order.
       Mistakes in this context could lead to unhappy customers, potential
       health risks, and financial losses.  How do you make sure that such
       a thing actually works? Most teams spend hours calling their voice
       agent to find bugs, change the prompt or function definitions, and
       then call their voice agent again to ensure they fixed the problem
       and didn't create regressions. This is slow, ad hoc, and feels like
       a waste of time. In other areas of software development, automated
       testing has already eliminated this kind of repetitive grunt work
       -- so why not here, too?  We were initially working on helping
       users create evals for prompts & LLM pipelines for a few months but
       noticed two things:  1) Many of our friends were building LLM voice
       agents.  2) They were spending too much time on manual testing.
       This gave us evidence that there will be more voice companies in
       the future, and they will need something to make the iteration
       process easier. We decided to build it!  Our solution involves four
       steps:  (1) Create diverse but realistic user personas and
       scenarios covering the expected conversation space. We create these
       ourselves for each of our customers. Getting LLMs to create diverse
       scenarios even with high temperatures is surprisingly tricky. We're
       learning a lot of tricks along the way to create more randomness
       and more faithful role-play from the folks at
       https://www.reddit.com/r/LocalLLaMA/.  (2) Have our agents call
       your agent when we test your agent's ability to handle things like
       background noise, long silences, or interruptions. Or have us test
       just the LLM / logic layer (function calls, etc.) via an API hook.
       (3) We score the outputs for each conversation using deterministic
       checks and LLM judges tailored to the specific problem domain
       (e.g., order accuracy, tone, friendliness). An LLM judge reviews
       the entire conversation transcript (including function calls and
       traces) against predefined success criteria, using examples of both
       good and bad transcripts as references. It then provides a
       classification output and detailed reasoning to justify its
       decisions. Building LLM judges that consistently align with human
       preferences is challenging, but we're improving with each judge we
       manually develop.  (4) Re-use the checks and judges above to score
       production traffic and use it to track quality metrics in
       production. (i.e., online evals)  We created a Loom recording
       showing our customers' logged-in experience. We cover how you store
       and manage scenarios, how you can trigger an experiment run, and
       how we score each transcript. See the video here:
       https://www.loom.com/share/839fe585aa1740c0baa4faa33d772d3e  We're
       inspired by our experiences at Tesla, where Sumanyu led growth
       initiatives as a data scientist, and Anduril, where Marius headed a
       data infrastructure team. At both companies, simulations were key
       to testing autonomous systems before deployment. A common
       challenge, however, was that simulations often fell short of
       capturing real-world complexity, resulting in outcomes that didn't
       always translate to reality. In voice testing, we're optimistic
       about overcoming this issue. With tools like PlayHT and ElevenLabs,
       we can generate highly realistic voice interactions, and by
       integrating LLMs that exhibit human-like reasoning, we hope our
       simulations will closely replicate how real users interact with
       voice agents.  For now, we're manually onboarding and activating
       each user. We're working hard to make it self-serve in the next few
       weeks. The demo at https://app.hamming.ai/voice-demo doesn't
       require any signup, though!  Our current pricing is a mix of usage
       and the number of seats: https://hamming.ai/pricing. We don't use
       customer data for training purposes or to benefit other customers,
       and we don't sell any data. We use PostHog to track usage. We're in
       the process of getting HIPAA compliance, with SOC 2 being next on
       the list.  Looking ahead, we're focused on making scenario
       generation and LLM judge creation more automated and self-serve. We
       also want to create personas based on real production conversations
       to make it easier to 'replay' a user on demand.  A natural next
       step beyond testing is optimization. We're considering building a
       voice agent optimizer (like DSPy) that uses scenarios from testing
       that failed to generate a new set of prompts or function call
       definitions to make the scenario pass. We find the potential of
       self-play and self-improvement here super exciting.  We'd love to
       hear about your experiences with voice agents, whether as a user or
       someone building them. If you're building in the voice or agentic
       space, we're curious about what is working well for you and what
       challenges you are encountering. We're eager to learn from your
       insights about setting up evals and simulation pipelines or your
       thoughts on where this space is heading.
        
       Author : sumanyusharma
       Score  : 75 points
       Date   : 2024-08-15 15:44 UTC (7 hours ago)
        
       | rstocker99 wrote:
       | That drive through customer... oh my. I have new found empathy
       | for drive through operators.
        
         | sumanyusharma wrote:
         | Yes! Drive-through customers can be very impatient. We tried to
         | make the demo persona maximally annoying.
         | 
         | Testing for edge cases is especially important because getting
         | an order wrong can cause health hazards, long line-ups, and
         | churn!
        
       | neilk wrote:
       | Why "Hamming"? As in Richard Hamming, ex-Bell Labs, "You and Your
       | Research"?
        
         | sumanyusharma wrote:
         | Yup, we named it after Richard Hamming. His essay 'you and your
         | research' was deeply influential during my undergrad; I re-read
         | it every quarter.
         | 
         | Our current product draws inspiration from Hamming distance
         | because we're comparing the `distance` between current LLM
         | output vs. desired LLM output.
        
       | plurby wrote:
       | Wow, gonna test this with my Retell AI agent.
        
         | sumanyusharma wrote:
         | Nice! What's the use case your agent solves for?
         | 
         | I'm happy to spin up some scenarios that are more relevant for
         | you instead of our stock demo personas :)
         | 
         | Feel free to email me at sumanyu@hamming.ai
        
       | zebomon wrote:
       | As someone whose job has been negatively impacted by LLMs
       | already, I'll echo the sentiment here that use cases like this
       | one are sort of depressing, as they will primarily impact people
       | who work long hours for small pay. It certainly seems like
       | there's money to be made in this, so congratulations. The landing
       | page is clear and inviting as well. I think I understand what my
       | workflow inside it would be like based on your text and images.
       | 
       | I'm most excited to see well-done concepts in this space, though,
       | as I hope it means we're fast-forwarding past this era to one in
       | which we use AI to do new things for people and not just do old
       | things more cheaply. There's undeniably value in the latter but I
       | can't shake the feeling that the short-term effects are really
       | going to sting for some low-income people who can only hope that
       | the next wave of innovations will benefit them too.
        
         | esafak wrote:
         | What line of work was it?
        
           | zebomon wrote:
           | I'm a Top Rated/Pro-verified ghostwriter on Fiverr. It's been
           | my full-time job since 2015. Went from mid-six figures in
           | 2022 to scraping by today.
        
       | pj_mukh wrote:
       | My 2.5 year old yesterday starting saying "Hey, This is a test,
       | Can you hear me?", parroting me spending hours testing my LLM.
       | Hah.
       | 
       | This will work with a https://www.pipecat.ai type system? Would
       | love to wrap a continuous testing system with my bot.
        
       | meiraleal wrote:
       | There is not even one reliable and proven "voice agent" yet
       | (correct me if I'm wrong but the best available, elevenlabs,
       | isn't that great yet to be a voice agent) but there is already
       | companies selling the test of voice agents?
       | 
       | Selling shovels on a gold rush seems to have become the only one
       | mantra here.
        
         | bongodongobob wrote:
         | As a test, I asked GPT to call my phone company and get my
         | account balance. It worked and even declined some program they
         | tried to sign me up for. Blew my mind.
        
           | kgc wrote:
           | What were the steps to get it to make a call?
        
         | sumanyusharma wrote:
         | It's a bit of a catch-22.
         | 
         | Making current voice agents reliable is incredibly time-
         | consuming and complex. This challenge has kept many teams from
         | pushing their agents into production. Those who do launch often
         | release a very limited, basic version to minimize risk. We
         | frequently talk to teams in both camps.
         | 
         | As a result, there aren't many 'killer' voice products on the
         | market right now. But as models improve, we'll see more voice-
         | centric companies emerge.
         | 
         | Teams are already calling their agents by hand and keeping
         | track of experiment runs in a spreadsheet. We're just
         | automating the workflow and making it easier to run
         | experiments!
        
       | euvin wrote:
       | The idea of testing an agent with annoying situations, like
       | uncooperative people or vague responses, makes me wonder if, in
       | the future, similar approaches might be tried on humans. People
       | could be (unknowingly) subjected to automated "social benchmarks"
       | with artificially designed situations, which I'm sure I don't
       | have to explain how dystopian that is.
       | 
       | It would essentially be another form of a behavioral interview. I
       | wonder if this exists already, in some form?
        
         | sumanyusharma wrote:
         | I wonder if a more optimistic version of this could be used to
         | train humans and improve their skills. I'm thinking along the
         | lines of LeetCode / Project Euler, but more dynamic and
         | personalized!
         | 
         | Few examples:
         | 
         | 1) Customer service: Simulating challenging customer
         | interactions could help reps develop patience and problem-
         | solving skills.
         | 
         | 2) Emergency responders: Creating realistic crisis scenarios
         | (like 911 calls) that could improve decision-making under
         | pressure.
         | 
         | 3) Healthcare: Virtual patients with complex symptoms could
         | speed up the learning rate for med students.
         | 
         | 4) Conflict resolution: Practicing with difficult personalities
         | could aid mediators and negotiators.
         | 
         | 5) Sales: AI-simulated tough customers could help salespeople
         | refine their pitches and objection-handling skills in a low-
         | stakes environment.
         | 
         | Thoughts?
        
           | euvin wrote:
           | That does sound like an interesting idea. Upon further
           | thought, I think that it would _heavily_ depend on
           | implementation.
           | 
           | In a bad case, I envision a ton of companies or institutions
           | employing very strict & narrow situations to the point where
           | they only accept a very homogenized personality. It could end
           | up creating a stiff or worse culture than if they had
           | naturally accumulated a diverse population, if that makes
           | sense. Discrimination already exists, but would be made a lot
           | easier, automated, and commonplace.
           | 
           | In a good case, extremely antisocial behavior (situations
           | that are "softballs" or "hard to screw up for reasonable
           | people") could be easily caught at scale and addressed an
           | early age. Plus the cases you've listed, eliminating the need
           | for special attention and mentorship from the limited people
           | we meet irl.
           | 
           | I'm sure there are other horrible or amazing cases I'm
           | missing.
           | 
           | So as all tools are, it would depend. Whether this will
           | actually benefit more than harm will depend on the society
           | you place it in, and I'm not sure I have that much faith in
           | the corporate world.
        
       | atyro wrote:
       | Nice! Great to see the UI looks clean enough that it's accessible
       | to non-engineers. The prompt management and active monitoring
       | combo looks especially useful. Been looking for something with
       | this combo for an expense app we're building.
        
         | sumanyusharma wrote:
         | Yes! We're aiming to build a tool that both engineers and non-
         | engineers love.
         | 
         | We've discovered that it's often faster for non-technical
         | domain experts to iterate on prompts in a structured, eval-
         | driven way, rather than relying on engineers to translate
         | business requirements into prompts.
         | 
         | While storing prompts in code offers version control benefits,
         | it can hinder collaboration. On the other hand, using a pure
         | CMS for prompts enhances collaboration but sacrifices some
         | modern software development practices.
         | 
         | We're working towards a solution that bridges this gap,
         | combining the best of both approaches. We're not there yet, but
         | we have a clear roadmap to achieve this vision!
        
       | themacguffinman wrote:
       | AI voice agents are weird to me because voice is already a very
       | inefficient and ambiguous medium, the only reason I would make a
       | voice call is to talk to a human who is equipped to tackle the
       | ambiguous edge cases that the engineers didn't already
       | anticipate.
       | 
       | If you're going to develop AI voice agents to tackle pre-
       | determined cases, why wouldn't you just develop a self-serve non-
       | voice UI that's way more efficient? Why make your users navigate
       | a nebulous conversation tree to fulfill a programmable task?
       | 
       | Personally when I realize I can only talk to a bot, I lose
       | interest and end the call. If I wanted to do something routine, I
       | wouldn't have called.
        
         | michaelmior wrote:
         | I'm the same way, and I don't have any data on this, but it's
         | possible that we're in the minority. This probably isn't the
         | case, but hopefully anyone implementing such a system has
         | thought through whether it will actually provide any value.
         | 
         | For example, if you had an existing IVR system and you tracked
         | menu options and found that a significant portion of calls were
         | able to be answered by non-smart pre-recorded messages,
         | upgrading to an AI voice agent could be a reasonable
         | improvement.
        
           | sumanyusharma wrote:
           | Our customers, who build voice agents, are often asked by
           | their customers to make their voice agents more human-like
           | and flexible. Their clients -- businesses like pest control
           | and automotive repairs -- value providing a personalized
           | experience but want the convenience and reliability of a 24/7
           | booking and answering service.
        
         | Centigonal wrote:
         | Some people (like me) are primarily verbal processors:
         | 
         | - I am dictating this message through macOS's voice to text
         | right now
         | 
         | - I am a huge user of Google Assistant
         | 
         | - I prefer to call people versus texting them
         | 
         | - I tend to call restaurants instead of using something like
         | Toast to order takeout (although this is partially because
         | online services will add a surcharge onto the price sometimes,
         | and sometimes I need to ask questions about dietary
         | restrictions, etc.)
         | 
         | Generally, wherever possible, I will use a voice interface
         | versus a text based one to get my point across. It's just
         | faster and more convenient for me. I'm pretty neutral on the
         | consumption side: I read and listen to audiobooks in roughly
         | equal amounts.
         | 
         | All that to say that, just like there are people out there who
         | prefer text UIs, there are also people who prefer voice
         | interfaces.
        
           | sumanyusharma wrote:
           | I use Superwhisper (no affiliation, just a happy user), which
           | runs a local Whisper model, to create most of my email drafts
           | and post-meeting notes. I find Whisper more accurate than
           | Mac's built-in speech-to-text, plus I'm faster at speaking
           | than typing.
           | 
           | Sometimes, I even 'talk' into Cursor's chat window instead of
           | typing. The only downside? It can get a bit annoying for
           | others when you're talking to yourself all day.
        
         | jcims wrote:
         | Think 1-800-CONTACTS not Siri. Call centers are super expensive
         | and the user experience is usually pretty bad. There's a huge
         | incentive to move to voice agents, but one of the challenges is
         | building a framework to adequately test it. That seems to be
         | what this is focused on.
        
       | serjester wrote:
       | I feel like the better positioning would be evals for voice
       | agents. It seems just as challenging to figure out all the ways
       | your system can go wrong, as it is to build the system in the
       | first place. Doing this in a way that actually adds value without
       | any domain expertise, seems impossible.
       | 
       | If it did, wouldn't all the companies with production AI text
       | interfaces be using similar techniques? Now being able to easily
       | replay a conversation that was recorded with a real user seems
       | like a huge value add.
        
         | sumanyusharma wrote:
         | Absolutely agree that creating effective evals requires domain
         | expertise. Right now, we're co-building evals with customers,
         | but we're identifying which aspects can be productized.
         | 
         | Regarding text-based evals -- part of testing voice agents
         | involves assessing their core reasoning logic. To do that, we
         | bypass the voice layer and simulate conversations via text. So
         | yes, the core simulation engine is reusable for both
         | conversational text and voice interactions.
         | 
         | We're also excited about shipping the ability to replay a
         | simulated conversation inspired by a real user!
        
       | xan_ps007 wrote:
       | is there an open source variant available? I am building
       | https://github.com/bolna-ai/bolna which is an open source voice
       | orchestration.
       | 
       | would love to have something like this integrated as part of our
       | open source stack.
        
         | sumanyusharma wrote:
         | Bolna looks awesome! We've considered going open-source, but
         | we're not sure how to effectively manage a community.
         | 
         | I'll reach out async!
        
           | xan_ps007 wrote:
           | Sure! Would love to discuss synergies and if we can integrate
           | it. Thanks & all the best!
        
       | prithvi24 wrote:
       | This is great to see. Evals on voice are hard - we only have
       | evals on text based prompting, but it doesn't fully capture
       | everything. Excited to give this a try.
        
         | sumanyusharma wrote:
         | This tracks. Text evals to test core logic and voice evals for
         | overall end-to-end performance!
        
       | telecomhacker wrote:
       | I work in the telecom space. I don't think this paradigm will get
       | adopted in the near future. Customers are already building voice
       | bots on top of Google Dialogflow e.g. Cognigy. Cognigy does have
       | LLM capabilities, but it is not widely adopted. I think voice
       | bots will still have to be manually configured for some time.
        
         | sumanyusharma wrote:
         | I'm curious to learn more about what's blocking the widespread
         | adoption of the LLM capabilities. Lack of knowledge,
         | reliability, or something else?
        
       | diwank wrote:
       | Congratulations for the launch! We had a big QC need for
       | https://kea.ai/ where we needed to stress test our CX agents in
       | real time too. This would be a big life saver. kudos on the
       | product and the brilliant demo!
        
         | sumanyusharma wrote:
         | I am curious - how was the team solving this at Kea?
        
       | kinard wrote:
       | I'm working on AI voice agents here in the UK for real estate
       | professionals, unfortunately I couldn't try your service.
        
         | sumanyusharma wrote:
         | We forgot to enable non-US numbers in our config for the demo.
         | (oops)
         | 
         | We're working on a fix right now!
        
       ___________________________________________________________________
       (page generated 2024-08-15 23:00 UTC)