[HN Gopher] Launch HN: ContainIQ (YC S21) - Kubernetes Native Mo...
       ___________________________________________________________________
        
       Launch HN: ContainIQ (YC S21) - Kubernetes Native Monitoring with
       eBPF
        
       Hi HN, I'm Nate, and here together with my co-founder Matt, we are
       the founders of ContainIQ (https://www.containiq.com/). ContainIQ
       is a complete K8s monitoring solution that is easy to set up and
       maintain and provides a comprehensive view of cluster health.  Over
       the last few years, we noticed a shift that more of our friends and
       other founders were using Kubernetes earlier on. (Whether or not
       they actually need it so early is not as clear, but that's a point
       for another discussion.) From our past experience using open-source
       tooling and other platforms on the market, we knew that the
       existing tooling out there wasn't built for this generation of
       companies building with Kubernetes.  Many early to middle-market
       tech companies don't have the resources to manage and maintain a
       bunch of disparate monitoring tools, and most engineering teams
       don't know how to use them. But when scaling, engineering teams do
       know that they need to monitor cluster health and core metrics, or
       else end users will suffer. Measuring HTTP response latency by URL
       path, in particular, is important for many companies, but can be
       time-consuming to install application packages for each individual
       microservice.  We decided to build a solution that was easy to set
       up and maintain. Our goal was to get users 95% of the way there
       almost instantly.  Today, our Kubernetes monitoring platform has
       four core features: (1) metrics: CPU and memory for pods/nodes,
       view limits, capacity, and correlate to events, alert on changes;
       (2) events: K8s events dashboard, correlate to logs, alerts; (3)
       latency: monitor RPS, p95, and p99 latencies by microservices,
       including by URL path, alerts; and (4) logs: container level log
       storage and search.  Our latency feature set was built using a
       technology called eBPF. BPF, or the Berkeley Packet Filter, was
       developed from a need to filter network packets in order to
       minimize unnecessary packet copies from the kernel space to the
       user space. Since version 3.18, the Linux kernel provides extended
       BPF, or eBPF, which uses 64-bit registers and increases the number
       of registers from two to ten. We install the necessary kernel
       headers for users automatically.  With eBPF, we are monitoring from
       the kernel and OS level, and not at the application level. Our
       users can measure and monitor HTTP response latency across all of
       their microservices and URL paths, as long as their kernel version
       is supported. We are able to deliver this experience immediately by
       parsing the network packet from the socket directly. We then
       correlate the socket and sk_buff information to your Kubernetes
       pods to provide metrics like requests per second, p95, and p99
       latency at the path and microservice level, without you having to
       instrument each microservice at the application level. For example
       with ContainIQ, you can track how long your node.js application is
       taking to respond to HTTP requests from your users, ultimately
       allowing you to see which parts of your web application are slowest
       and alerting you when users are experiencing slowdowns.  Users can
       correlate events to logs and metrics in one view. We knew how
       annoying it was to toggle between multiple tabs and then scroll
       endlessly through logs trying to match up timestamps. We fixed
       this. For example, a user can click from an event (ex a pod dying)
       to the logs at that point in time.  Users can set alerts across
       really all data points (ex. p95 latency, a K8s job failing, a pod
       eviction).  Installation is straightforward either using helm or
       with our YAML files.  Pricing is $20 per node / month + $1 per GB
       of log data ingested. You can sign up on our website directly with
       the self-service flow. You can also book a demo if you would like
       to talk to us, but that isn't required. Here are some videos
       (https://www.containiq.com/kubernetes-monitoring) if you are
       curious to see our UX before signing up.  We know that we have a
       lot of work left to do. And we welcome your suggestions, comments,
       and feedback. Thank you!
        
       Author : NWMatherson
       Score  : 66 points
       Date   : 2022-01-06 16:30 UTC (6 hours ago)
        
       | nyellin wrote:
       | Nice to see a new eBPF based solution out there. Good luck.
        
         | NWMatherson wrote:
         | Thanks so much!
        
       | rlyshw wrote:
       | I recently had an issue where my UDP service worked fine exposed
       | directly as a NodePort type, but not through an nginx UDP
       | ingress. I _think_ the issue was that the ingress controller
       | forwarding operation was just too slow for the service's needs,
       | but I had no way of really knowing.
       | 
       | Now if I had this kernel level network monitoring system, I
       | probably could have had a clearer picture as to what is going on.
       | 
       | Really one of the hardest problems I've had with
       | learning/deploying in k8s is trying to trace down the multiple
       | levels of networking, from external TLS termination to
       | LoadBalancers, through ingress controllers, all the way down to
       | application-level networking, I've found more often than not the
       | easiest path is to just get rid of those layers of complexity
       | completely.
       | 
       | In the end I just exposed my server on NodePort, forwarded my NAT
       | to it, and called it done. But it sounds like something like
       | ContainIQ can really add to a k8s admin's toolset for
       | troubleshooting these complex network issues. I also agree with
       | other comments here that a limited, personal-use/community tier
       | would be great for wider adoption and home-lab users like me :)
        
         | NWMatherson wrote:
         | Appreciate this insight and I agree with you.
         | 
         | And I can definitely circle back here when our limited use tier
         | goes live. Agree on that too.
        
       | gigatexal wrote:
       | A community edition/non-paid would be quite nice to be able to
       | trial this out before paying.
       | 
       | This is how an old employer adopted CockroachDB because we
       | trialed the non-enterprise version and then ultimatley bought a
       | license.
        
         | permalac wrote:
         | Agreed.
         | 
         | Our employer does not invest on this kind of tools, so when a
         | free version does not exist for us the tool does not exist.
         | 
         | We would be happy to provide usage metrics and reports, our
         | company is full open source and open data, and we work/invest
         | time on open projects when possible.
        
           | nyellin wrote:
           | Not the OP, but I develop a different open source tool for
           | Kubernetes and would love to talk! (Email is in my profile)
        
           | HatchedLake721 wrote:
           | Does your company use paid versions of the open source tools
           | or pay support?
        
           | NWMatherson wrote:
           | We are planning to open source our agents in 2022!
        
         | NWMatherson wrote:
         | I agree. We are planning to launch a free edition with limited
         | size and data retention. For users to try / play with before
         | paying. It is in the works and we hope to have this out in the
         | next few months.
         | 
         | We are also thinking about launching trials too.
        
       | nodesocket wrote:
       | Hello. I own and run a DevOps consulting company and use DataDog
       | exclusively for clients. DD works pretty well as it integrates
       | with cloud providers (such as AWS), physical servers (agent), and
       | Kubernetes (helm chart). The pain point is still creating all the
       | custom dashboards, alerts, and DataDog integrations and
       | configuration. Managing the DataDog account can almost be a full-
       | time job for somebody. Especially with clients who have lots of
       | independent k8s clusters all in a single DD account (lots of
       | filtering on tags and labels).
       | 
       | What does ContainIQ offer in terms of benefits over well
       | established players like DataDog? I will say, the Traefik DataDog
       | integration is horrible and hasn't been updated in years so
       | that's something I wish was better. DataDog does support
       | Kubernetes events (into the feed), and their logging offering is
       | quite good (though very expensive).
        
         | NWMatherson wrote:
         | The dashboard configuration issue was actually one of the pain
         | points we targeted initially. It was an issue we experienced
         | too. And we talked to a lot of our friends who had spent
         | significant time setting these dashboards up in Datadog. One of
         | our initial goals has been to try to automate to get you 95% of
         | the way there without any configuration on your end. We've also
         | tried to make alerting really easy and are working to automate
         | the process of setting smart alerts. Would love to chat more
         | about your experience if you are open to it. My email is nate
         | (at) containiq (dot) com
        
       | MoSattler wrote:
       | How does this compare to Opstrace? [0]
       | 
       | [0]: https://opstrace.com
        
         | NWMatherson wrote:
         | OpsTrace took an interesting approach (and was a YC company
         | too, recently acquired by GitLab). We are a managed solution,
         | whereas OpsTrace was a self hosted open source solution. And we
         | are not building on top of other open source tools. With
         | ContainIQ, you can get metrics natively and other features that
         | you wouldn't otherwise be able to get (ex p95 latency by
         | endpoint) with OpsTrace and its integrations.
        
       | kolanos wrote:
       | How does this compare to Pixie? [0]
       | 
       | [0]: https://github.com/pixie-io/pixie
        
         | outgame wrote:
         | Polar signals develops Parca [0] which is another eBPF
         | observability tool, and Isovalent develops Cilium [1] which is
         | built on eBPF as well. Genuinely curious if there are
         | differences, or if eBPF only allows for specific observability
         | functionality and each tool has it all.
         | 
         | [0]: https://github.com/parca-dev/parca
         | 
         | [1]: https://github.com/cilium/cilium
        
           | brancz wrote:
           | Polar Signals founder and one of the creators of Parca here.
           | From what I can tell ContainIQ is distinct from Parca and
           | Polar Signals as we only concern ourselves with continuous
           | profiling, which is complementary to metrics, logs and
           | traces. From our experience, while eBPF is certainly limited
           | and it can be painful to work with the verifier at times, it
           | hits a sweet spot for observability collection because of low
           | overhead and you really only read some structs from memory
           | somewhere for which the limitations of eBPF tend to be
           | plentiful.
           | 
           | Definitely excited to see more eBPF tooling appear in the
           | observability space.
        
             | NWMatherson wrote:
             | Well said, we are excited to see more eBPF tooling appear
             | as well.
        
         | NWMatherson wrote:
         | Pixie is definitely similar in their eBPF based approach. I
         | believe there are differences in the types of data they collect
         | and correlate with. For example we collect logs and state
         | information (node status, node conditions, pod scheduled ect)
         | along side our eBPF based metrics like latency. I'm sure there
         | are things they collect that we don't as well.
        
       ___________________________________________________________________
       (page generated 2022-01-06 23:01 UTC)