[HN Gopher] Launch HN: Parity (YC S24) - AI for on-call engineer...
___________________________________________________________________
Launch HN: Parity (YC S24) - AI for on-call engineers working with
Kubernetes
Hey HN -- we're Jeffrey, Coleman, and Wilson, and we're building
Parity (https://tryparity.com), an AI SRE copilot for on-call
engineers working with Kubernetes. Before you've opened your
laptop, Parity has conducted an investigation to triage, determine
root cause, and suggest a remediation for an issue. You can check
out a quick demo of Parity here: https://tryparity.com/demo We met
working together as engineers at Crusoe, a cloud provider, and we
always dreaded being on-call. It meant a week of putting our lives
and projects on hold to be prepared to firefight an issue at any
hour of the day. We experienced sleepless nights after being woken
up by a PagerDuty alert to then find and follow a runbook. We
canceled plans to make time to sift through dashboards and logs in
search of the root cause of downtime in our k8s cluster. After
speaking with other devs and SREs, we realized we weren't alone.
While every team wants better monitoring systems or a more
resilient design, the reality is that time and resources are often
too limited to make these investments. We're building Parity to
solve this problem. We're enabling engineers working with
Kubernetes to more easily handle their on-call by using AI agents
to execute runbooks and conduct root cause analysis. We knew LLMs
could help given their ability to quickly process and interpret
large amounts of data. But we've found that LLMs alone aren't
sufficiently capable, so we've built agents to take on more complex
tasks like root cause analysis. By allowing on-call engineers to
handle these tasks more easily and eventually freeing them from
such responsibilities, we create more time for them to focus on
complex and valuable engineering investments. We built an agent to
investigate issues in Kubernetes by following the same steps a
human would: developing a possible root cause, validating it with
logs and metrics, and iterating until a well-supported root cause
is found. Given a symptom like "we're seeing elevated 503 errors",
our agent develops hypotheses as to why this may be the case, such
as nginx being misconfigured or application pods being under-
resourced. Then, it gathers the necessary information from the
cluster to either support or rule out those hypotheses. These
results are presented to the engineer as a report with a summary
and each hypothesis. It includes all the evidence the agent
considered when coming to a conclusion so that an engineer can
quickly review and validate the results. With the results of the
investigation, an on-call engineer can focus on implementing a fix.
We've built an additional agent to automatically execute runbooks
when an alert is triggered. It follows steps of a runbook more
rigorously than an LLM alone and with more flexibility than
workflow automation tools like Temporal. This agent is a
combination of separate LLM agents each responsible for a single
step of the runbook. Each runbook step agent will execute arbitrary
instructions like "look for nginx logs that could explain the 503
error". A separate LLM will evaluate the results, ensuring the step
agent followed the instructions, and determines which subsequent
step of the runbook to execute. This allows us to execute runbooks
with cycles, retries, and complex branching conditions. With these
tools, we aim to handle the "what's going wrong" part of on-call
for engineers. We still believe it makes the most sense to continue
to trust engineers with actually resolving issues as this requires
potentially dangerous or irreversible commands. For that reason,
our agents exclusively execute read-only commands. If this sounds
like it could be useful for you, we'd love for you to give the
product a try! Our service can be installed in your cluster via a
helm repo in just a couple of minutes. For our HN launch, we've
removed the billing requirement for new accounts, so you can test
it out on your cluster for free. We'd love to hear your feedback
in the comments!
Author : wilson090
Score : 58 points
Date : 2024-08-26 14:55 UTC (8 hours ago)
| Atotalnoob wrote:
| It would be kind of interesting if, based on an engineer
| accepting the suggestion, parity generated a new run book.
|
| This would allow repeated issues to be well documented.
|
| On iOS Firefox, when clicking "pricing" on the menu, it scrolls
| to the proper location, but does not close the menu. Closing the
| menu causes it to jump to the top of the page. Super annoying.
| wilson090 wrote:
| Agreed, this feature is on our todo list. Another big problem
| we're aiming to tackle is the tribal knowledge that builds up
| on teams in part due to a lack of documentation. We want to
| make it easy to build new runbooks and keep existing runbooks
| up to date
|
| And thanks for the bug report, I'll take a look
| upon_drumhead wrote:
| If an issue can be automatically detected and remediated, do
| you really need a runbook? That space has to be huge. I don't
| see a purpose for documenting it.
|
| That said, a tool that runs through existing runbooks and
| improves them or suggests new ones would be extremely useful
| IMHO.
| Atotalnoob wrote:
| Improving documentation.
|
| Keep in mind, they are suggestions. It sounds like the
| product will automatically execute runbooks but hold
| suggestions for engineer input. This would move it from
| "suggestion" to "automatically do X"
|
| Also, sometimes LLMs are wrong.
| jtsaw wrote:
| The product will automatically execute runbooks for you. So
| far we've focused on using runbooks customers already have,
| since they know they work for them. We've also added the
| ability to turn of automatic execution for cases like a
| suggested runbook, so the customer can make any edits if
| necessary before approving it to be executed automatically.
|
| Yea, this is a big challenge for us. We're using a variety
| of strategies to make sure hallucinations are rare, but
| that's why we're also committed to not executing actions
| that modify your cluster unless explicitly specified in a
| runbook
| threeseed wrote:
| > I don't see a purpose for documenting it.
|
| Enterprises implement stringent Change Management procedures.
|
| If you are making _any_ change to a Prod environment it needs
| to be thoroughly documented.
| cortesoft wrote:
| > I don't see a purpose for documenting it.
|
| Because when it goes wrong you will want to know what it did.
| When you discover something new, you are going to want to be
| able to change the runbook. New employees are going to want
| to learn how things work from the runbook.
|
| Why WOULDN'T you want to document what it is doing? I would
| never trust an AI that didn't tell me what it was doing and
| why.
| raunakchowdhuri wrote:
| hmmm idk how I would feel about giving an llm cluster access from
| a security pov
| wilson090 wrote:
| Valid concern, security and safety are essential for anything
| that can access a production system. We use k8s RBAC to ensure
| that the access is read-only, so even if the LLM hallucinates
| and tries to destroy something, it can't
|
| As we will eventually move towards write-access, we're closely
| following the work in LLM safety. There has been some
| interesting work to use smaller models to evaluate tool
| calls/completions against a set of criteria to ensure safety
| martinald wrote:
| Other problem is that you become an extremely big target for
| bad actors as you have read/write (or just even read) access
| to all these k8s clusters. Obviously you can mitigate against
| that to a fairly high degree with on prem, but for users not
| on that...
|
| Cool idea though!
| stackskipton wrote:
| Azure Kubernetes Wrangler (SRE) here, before I turn some LLM
| loose on my cluster, I need to know what it supports, how it
| supports it and how I can integrate into my workflow.
|
| Videos show CrashLoopBackOff pod and analyzing logs. This works
| if Pod is writing to stdout but I've got some stuff doing
| straight to ElasticSearch. Does LLM speak Elastic Search? How
| about Log Files in the Pod? (Don't get me started on that
| nightmare)
|
| You also show fixing by editing YAML in place. That's great
| except my FluxCD is going revert since you violated principle of
| "All goes through GitOps". So if you are going to change
| anything, you need to update the proper git repo. Also in said
| GitOps is Kustomize so hope you understand all interactions
| there.
|
| Personally, the stuff that takes most troubleshooting time is
| Kubernetes infrastructure. Network CNI is acting up. Ingress
| Controller is missing proper path based routing. NetworkPolicy
| says No to Pod talking to PostGres Server. CertManager is on
| strike and certificate has expired. If LLM is quick at
| identifying those, it has some uses but selling me on "Dev made
| mistake with Pod Config" is likely not to move the needle because
| I'm really quick at identifying that.
|
| Maybe I'm not the target market and target market is "Small Dev
| team that bought Kubernetes without realizing what they were
| signing up for"
| pmccarren wrote:
| > CertManager is on strike and certificate has expired
|
| Had a good chuckle here, hah.
| wilson090 wrote:
| Your comment brings up a good point (and also one of our big
| challenges): there is a huge diversity in the tools teams use
| to setup and operate their infra. Right now our platform only
| speaks to your cluster directly through kubectl commands. We'll
| build other integrations so it can communicate with things like
| Elastic Search to broaden its context as needed, but we'll have
| to be somewhat thoughtful in picking the highest ROI
| integrations to build.
|
| Currently, we only handle the investigation piece and suggest a
| remediation to the on-call engineer. But to properly move into
| automatically applying a fix, which we hope to do at some
| point, we'll need to integrate into CI/CD
|
| As for the demo example, I agree that the issue itself isn't
| the most compelling. We used it as an example since it is easy
| to visualize and set up for a demo. The agent is capable of
| investigating more complex issues we've seen in our customer's
| production clusters, but we're still looking for a way to
| better simulate these on our test environment, so if you/anyone
| has ideas we'd love to hear them.
|
| We do think this has more value for engineers/teams with less
| expertise in k8s, but we think SREs will still find it useful
| stackskipton wrote:
| >we're still looking for a way to better simulate these on
| our test environment, so if you/anyone has ideas we'd love to
| hear them.
|
| Pick Kubernetes offering from big 3, deploy it then blow it
| up.
|
| (I couldn't get HackerNews to format properly and done
| fighting it)
|
| On Azure, deploy a Kubernetes cluster with following:
|
| Azure CNI with Network Policies
|
| Application Gateway for Containers
|
| External DNS hooked to Azure DNS
|
| Ingress Nginx
|
| Flexible PostGres Server (outside the cluster)
|
| FluxCD/Argo
|
| Something with using Workload Identity
|
| Once all that is configured, put some fake workloads on it
| and start misconfiguring it with your LLM wired up. When the
| fireworks start, identify the failures and train your LLM
| properly.
| solatic wrote:
| > we think SREs will still find it useful
|
| There are two kinds of outages: people being idiots and legit
| hard-to-track-down bugs. SREs worth their salt don't need
| help with the former. They may find an AI bot somewhat useful
| to find root cause quicker, but usually not so valuable as to
| justify paying the kind of price you would need to charge to
| make your business viable to VCs. As for the latter, good
| luck collecting enough training data.
|
| Otherwise, you're selling a self-driving car to executives
| who want the chauffeur without the salary. Sounds like a
| great idea, until you think about the tail cases. Then you
| wish you had a chauffeur (or picked up driving skills
| yourself).
|
| Maybe you'll find a market, but as an SRE, I wouldn't want to
| sell it.
| solatic wrote:
| I basically want to +1 this. OP isn't selling to any place that
| is already spending six figures on SRE salaries. Actual
| competitors are companies like Komodor and Robusta who sell "we
| know Kubernetes better than you" solutions to companies that
| don't want to spend money on SRE salaries. Companies in this
| situation should just seriously reconsider hosting on
| Kubernetes and go back to higher-level managed services like
| ECS/Fargate, Fly/Railway, etc.
| shmatt wrote:
| Im sure this is on their roadmap, but honestly a pre-requisite
| should be a separate piece of software that analyzes and
| suggests changes to your error handling.
|
| This is a cool proof of concept but almost useless otherwise in
| a production system
|
| I can already feed Claude or ChatGPT my kubectl output pretty
| easily
|
| Error handling and logging that are tailored for consumption of
| a specific pre trained model, thats where this will be ground
| breaking
| wilson090 wrote:
| That is something we're working on -- good observability is a
| place where teams usually fall short and often the limiting
| factor to better incident response. We're working on logging
| integrations as a first step.
| stackskipton wrote:
| The AI needs to be integrated into Dev IDE. All my logging
| screaming is terrible decisions made by long ago Devs but
| getting them fixed now is impossible because they don't want
| to do it and no one is going to make them.
| mdaniel wrote:
| Why would you have your demo video set to "unlisted"? (on what
| appears to be your official channel) I'd think you'd want to show
| up in as many places as possible
| henning wrote:
| An AI agent to triage the production issues caused by code
| generated by some other startup's generative AI bot. I fucking
| love tech in 2024.
| dockerd wrote:
| You forget the AI tech which help test the AI tech
| klinquist wrote:
| Website won't load - just me?
| jtsaw wrote:
| which website doesn't load for you?
| klinquist wrote:
| trtparity.com, looks like it's a local problem, loads on
| cellular.
| threeseed wrote:
| > This agent is a combination of separate LLM agents each
| responsible for a single step of the runbook
|
| Someone needs to explain to me how this is expected to work.
|
| Percentage of Hallucinations/Errors x Steps in Runbook = Total
| Errors
|
| 0.05 x 10 = 0.5 = 50%
| bicx wrote:
| Getting tired of seeing this concept of practically guaranteed
| hallucinations from any LLM used in production. I've used LLMs
| for various tasks, and if you tune your system correctly, it
| can be very reliable. It's just not always plug-and-play
| reliability. You need to set up your fine-tuning and prompts
| and then test well for consistent results.
| threeseed wrote:
| > it can be very reliable
|
| You need to quantify this. With actual numbers.
|
| I am getting very tired of seeing everyone pushing LLMs and
| being disingenuous about exactly how often it is getting
| things wrong. And what the impact of that is. There is a
| reason that AI is not taking off in the enterprise and that
| is because people who take their job seriously are getting
| tired too.
| bicx wrote:
| I understand your sentiment, but I also don't think it's
| fair to say people are being disingenuous. I don't work for
| an AI company, I just use it with decent results.
|
| My last project needed a way to tag certain types of
| business activity indicated in 17,000 user reviews. I wrote
| a prompt with 5 different tags along with descriptions for
| each business activity, took a few-shot approach by
| defining 8 different examples and how I would tag them, and
| then ran the tagging prompt on batches of indexed reviews,
| giving it 100 reviews per batch. I did a random sampling of
| about 200 items, and the success rate was roughly 89%. I
| could have improved it by doing more iterations, and
| possibly fine-tuning if I felt that it was worth it.
|
| In every run, it generated matching results for the input
| JSON in a corresponding output JSON with no errors in any
| run.
|
| That's the only example I have numbers on off the top of my
| head.
|
| EDIT: This was using `gpt-4o-2024-05-13`
| deathanatos wrote:
| (Reader beware: basically everything the LLM emits in
| here is wrong.)
|
| I was attempting to find formats that used a particular
| quoting style -- SQL style double-the-character-to-espace
| -- and bizarrely JSON came up.
|
| Gemini:
|
| > _JSON: In JSON, double quotes are used to delimit
| strings. If a double quote needs to be included within
| the string itself, it 's escaped by doubling it (e.g.,
| "This is a string with a "quote" in it.")._
|
| That is, "This is a string with a "quote"
| in it."
|
| Me: _Does your JSON example parse as valid JSON?_
|
| Gemini:
|
| > _Yes, the JSON example I provided ( "This is a string
| with a "quote" in it.") is actually valid JSON._
|
| > _The reason it 's valid is because the doubled quote
| within the string is interpreted as an escaped character,
| representing a single literal quote within the string
| itself. The JSON parser recognizes the double quotes at
| the beginning and end as delimiters for the entire
| string, not the escaped quote within._
|
| It also called out,
|
| > _XML: Use a backslash (\\) to escape characters,
| including quotes._
|
| In a different context, and more towards "working with
| k8s" that the OP's AI is focused on,
|
| Me: _Regex to extract the replica number from a
| statefulset pod 's name?_
|
| Gemini:
|
| > _Use kubectl: The kubectl get pods command with output
| formatting can directly show replica numbers. Here 's an
| example:_ kubectl get pods -o
| jsonpath='{.items[*].metadata.name}{" - replica:
| "}{.metadata.labels.replicaset-Xu Hao }'
|
| The Chinese here, as best as I can tell, reads "serial
| number" ... which sort of contextually makes sense in the
| sense of an LLM, I guess. The command is utter nonsense,
| of course.
| nerdjon wrote:
| > You need to set up your fine-tuning and prompts and then
| test well for consistent results.
|
| Tell that to Google...
|
| Seriously, it is well established that these systems
| hallucinate. Trying to say otherwise shows you are trying to
| push something that just is not true.
|
| They can be right, yes. But when they are wrong they can be
| catastrophically wrong. You could be wasting time looking
| into the wrong problem with something like this.
| drawnwren wrote:
| If you're curious what the state of the art in multi-agent
| is looking like, I really recommend
| https://thinkwee.top/multiagent_ebook/
| mplewis wrote:
| Every LLM conversation is guaranteed to contain some level of
| hallucination. You will never get the percentage down to
| zero.
| wilson090 wrote:
| We separated out the runbook such that each step is a separate
| LLM in the agent. Between each step, there's sort of a
| "supervisor" that ensures that the step was completed
| correctly, and then routes to another step based on the
| results. So in reality, a single step failing requires two
| hallucinations. Hallucinations are also not a fixed percentage
| across all calls -- you can make them less likely by
| maintaining focused goals (this is why we made runbooks agentic
| rather than a single long conversation)
| threeseed wrote:
| And what is your average error rate per runbook step.
| jtsaw wrote:
| one thing we're experimenting to help with the
| hallucinations/error rate issue is using a committee
| framework where we take a majority vote.
|
| If the error rate of 1 expert is 5%, then for a committee
| of 10 experts, the probability a majority of the committee
| errors is around 0.00276% (binomial distribution with
| p=0.05). For 10 steps, this would be an error rate of
| 0.0276%
| threeseed wrote:
| Pretty bad maths there. Those committee members are not
| independent.
|
| They are highly correlated even amongst LLMs from
| different vendors.
| jtsaw wrote:
| I'm not sure they are highly correlated. A committee uses
| the same LLM with the same input context to generate
| different outputs. Given the same context LLMs should
| produce the same next token output distribution (assuming
| fixed model parameters, temperature, etc). So, while
| tokens in a specific output are highly correlated,
| complete outputs should be independent since they are
| generated independently from the same distribution. You
| are right they are not iid but the calculation was just a
| simplification.
| nerdjon wrote:
| That second step hallucinating is far more likely when you
| are feeding it incorrect information from the first
| hallucination.
|
| LLM's are very easy to manipulate.
|
| At one point with a system prompt telling Claude it was
| OpenAI, I was able to ask what its model is and it would
| confidently tell me it was OpenAI. Garbage data in, garbage
| data out.
|
| Admittedly that is an extreme case, but you're giving that
| second prompt wrong data in the hopes that it will identify
| it instead of just thinking it's fine when it is part of its
| new context.
| jtsaw wrote:
| yea. We're definitely concerned about hallucinations and
| are using a variety of techniques to try and mitigate it
| (there's some existing discussion here, but using
| committees and sub-agents responsible for smaller tasks has
| helped).
|
| What's helped the most, though, is using cluster
| information to back up decision making. That way we know
| the data it's considering isn't garbage, and the outputs
| are backed up by actual data.
| nerdjon wrote:
| Well the website seems to be down so I can't actually see any
| information about what LLM you are using, but I seriously hope
| you are not just sending the data to OpenAI API or something like
| that and are forcing the use of a private (ideally self hosted)
| service.
|
| I would not want any data about my infrastructure sent to a
| public LLM, regardless of how sanitized things are.
|
| Otherwise, on paper it seems cool. But I worry about getting
| complicit with this tech. It is going to fail, that is just the
| reality. We know LLM's will hallucinate and there is not much we
| can do about it, it is the nature of the tech.
|
| SO it might work most of the time, but when it doesn't and you're
| bashing your head against the wall trying to figure out what is
| broken. This system is telling you that all of these things are
| fine, but one of them actually isn't. But it worked enough times
| that you trust it, so you don't bother double checking.
|
| That is before we even talk about having this thing running code
| for automatic remediation, which I hope no one seriously
| considers ever doing that.
| wilson090 wrote:
| Hmm we're not seeing any issues with the website on our end --
| tryparity.com is down for you?
|
| The data security point with LLMs is definitely relevant.
| There's a broader conversation ongoing right now about how
| teams will securely use LLMs, but from our conversations so far
| teams have been willing to adopt the tech. We've been working
| with startups up to this point, so we'll likely need to offer
| support self-hosted LLMs if we were to support enterprise or
| bring-your-own-keys for larger startups.
|
| The hallucination point is interesting. I think a lot of
| products will need to solve this problem of having so much
| trust with the user that they'll blindly follow the outputs,
| but occasionally failing due to hallucination. Our approach has
| been to 1) only focus on investigation/root cause and 2) make
| sure it's easy to audit the results by sharing all of the
| results + supporting evidence
| manveerc wrote:
| Congratulations on the launch! I'm curious--how is what you're
| building different from other AI SRE solutions out there, like
| Cleric, Onegrep, Resolve, Beeps, and others?
| wilson090 wrote:
| Thanks! Hard to make a comparison to Cleric since their site
| doesn't really have any features or a demo. Onegrep is a fellow
| YC company, and we love what they're building! They seem to be
| more focused on workflows and pulling together context (also a
| very important problem in the space), we we've put more of a
| focus on root causing infra issues specifically. Resolve seems
| to come from the same category as Temporal, which are more
| traditional automation platforms. These end up being somewhat
| rigid tools in that you have to very explicitly define each
| step and they require a certain level of CI/CD or monitoring
| sophistication to be useful. Using LLMs allows us to relax
| these requirements and follow workflows like an actual engineer
| would.
|
| I haven't heard of Beeps and can't find it, could you share the
| URL?
| manveerc wrote:
| > They seem to be more focused on workflows and pulling
| together context (also a very important problem in the
| space), we we've put more of a focus on root causing infra
| issues specifically.
|
| So just to clarify, are you saying that Parity is focused on
| infrastructure issues, while something like Onegrep addresses
| the broader problem by providing context?
|
| > I haven't heard of Beeps and can't find it, could you share
| the URL?
|
| https://www.beeps.co/
| wilson090 wrote:
| Yes, my understanding is that Onegrep is meant to provide
| context from your documentation and past incidents, which
| can be helpful when trying to solve an alert. We're focused
| on root-causing underlying infrastructure issues by
| actually looking into the logs/configurations/metrics.
|
| Ah I actually did see beeps a while back. I haven't tried
| their product, but they seem to be similar to
| rootly/Onegrep in that they're working on on-call
| management/post-mortems
| drawnwren wrote:
| This is a great idea. I use claude for my most of my unknown K8s
| bugs and it's impressive how useful it is (far more than my
| coding bugs).
| wilson090 wrote:
| Thanks! We've also been impressed with the performance of out-
| of-the-box LLMs on this use case. I think in part it is because
| k8s is a significantly more constrained problem-space than
| coding, and because of that we'll get to a much more complete
| solution with the existing state of LLMs than we would for a
| product like a general software engineer agent.
| ronald_petty wrote:
| I think this kind of tooling is one positive aspect of
| integrating LLM tech in certain workflows/pipelines. Tools like
| k8sgpt are similar in purpose and show a strong potential to be
| useful. Look forward to seeing how this progresses.
| wilson090 wrote:
| Thanks! k8sgpt is great, it was one of our inspirations
___________________________________________________________________
(page generated 2024-08-26 23:00 UTC)