[HN Gopher] Launch HN: Opstrace (YC S19) - open-source Datadog
___________________________________________________________________
Launch HN: Opstrace (YC S19) - open-source Datadog
Hi HN! Seb here, with my co-founder Mat. We are building an open-
source observability platform aimed at the end user. We assemble
what we consider the best open source APIs and interfaces such as
Prometheus and Grafana, but make them as easy to use and featureful
as Datadog, with for example TLS and authentication by default.
It's scalable (horizontally and vertically) and upgradable without
a team of experts. Check it out here: http://opstrace.com/ &
https://github.com/opstrace/opstrace About us: I co-founded
dotCloud which became Docker, and was also an early employee at
Cloudflare where I built their monitoring system back when there
was no Prometheus (I had to use OpenTSDB :-). I have since been
told it's all been replaced with modern stuff--thankfully! Mat and
I met at Mesosphere where, after building DC/OS, we led the teams
that would eventually transition the company to Kubernetes. In
2019, I was at RedHat and Mat was still at Mesosphere. A few months
after IBM announced purchasing RedHat, Mat and I started
brainstorming problems that we could solve in the infrastructure
space. We started interviewing a lot of companies, always asking
them the same questions: "How do you build and test your code? How
do you deploy? What technologies do you use? How do you monitor
your system? Logs? Outages?" A clear set of common problems
emerged. Companies that used external vendors--such as CloudWatch,
Datadog, SignalFX--grew to a certain size where cost became
unpredictable and wildly excessive. As a result (one of many
downsides we would come to uncover) they monitored less (i.e. just
error logs, no real metrics/logs in staging/dev and turning metrics
off in prod to reduce cost). Companies going the opposite route--
choosing to build in-house with open source software--had different
problems. Building their stack took time away from their product
development, and resulted in poorly maintained, complicated messes.
Those companies are usually tempted to go to SaaS but at their
scale, the cost is often prohibitive. It seemed crazy to us that
we are still stuck in this world where we have to choose between
these two paths. As infrastructure engineers, we take pride in
building good software for other engineers. So we started Opstrace
to fix it. Opstrace started with a few core principles: (1) The
customer should always own their data; Opstrace runs entirely in
your cloud account and your data never leaves your network. (2) We
don't want to be a storage vendor--that is, we won't bill customers
by data volume because this creates the wrong incentives for us.
(AWS and GCP are already pretty good at storage.) (3) Transparency
and predictability of costs--you pay your cloud provider for the
storage/network/compute for running Opstrace and can take advantage
of any credits/discounts you negotiate with them. We are
incentivized to help you understand exactly where you are spending
money because you pay us for the value you get from our product
with per-user pricing. (For more about costs, see our recent blog
post here: https://opstrace.com/blog/pulling-cost-curtain-back).
(4) It should be REAL Open Source with the Apache License, Version
2.0. To get started, you install Opstrace into your AWS or GCP
account with one command: `opstrace create`. This installs Opstrace
in your account, creates a domain name and sets up authentication
for you for free. Once logged in you can create tenants that each
contain APIs for Prometheus, Fluentd/Loki and more. Each tenant has
a Grafana instance you can use. A tenant can be used to logically
separate domains, for example, things like prod, test, staging or
teams. Whatever you prefer. At the heart of Opstrace runs a Cortex
(https://github.com/cortexproject/cortex) cluster to provide the
above-mentioned scalable Prometheus API, and a Loki
(https://github.com/grafana/loki) cluster for the logs. We front
those with authenticated endpoints (all public in our repo). All
the data ends up stored only in S3 thanks to the amazing work of
the developers on those projects. An "open source Datadog"
requires more than just metrics and logs. We are actively working
on a new UI for managing, querying and visualizing your data and
many more features, like automatic ingestion of logs/metrics from
cloud services (CloudWatch/Stackdriver), Datadog compatible API
endpoints to ease migrations and side by side comparisons and
synthetics (e.g. Pingdom). You can follow along on our public
roadmap: https://opstrace.com/docs/references/roadmap. We will
always be open source, and we make money by charging a per-user
subscription for our commercial version which will contain fine-
grained authz, bring-your-own OIDC and custom domains. Check out
our repo (https://github.com/opstrace/opstrace) and give it a spin
(https://opstrace.com/docs/quickstart). We'd love to hear what
your perspective is. What are your experiences related to the
problems discussed here? Are you all happy with the tools you're
using today?
Author : spahl
Score : 298 points
Date : 2021-02-01 18:16 UTC (1 days ago)
| opsunit wrote:
| Why should I run this instead of renewing my Wavefront contract?
| englambert wrote:
| It's hard to answer that concretely without knowing a little
| bit more about your use cases. Care to share a bit more?
|
| One thing comes to mind: we don't bill by data volume.
| Wavefront is charging you for the volume of data your
| applications produce. This can lead to negative outcomes, such
| as surprise bills from a newly deployed service and a
| subsequent scramble to find and limit the offenders.
|
| We think this pricing model forms the wrong incentives.
| Charging by volume means a company is more incentivized to have
| their customers (you) send you more data, and less incentivized
| to help them get more value from that data. This is a
| fundamental change we want to bring to the market--we want our
| incentives to align with yours, we want to be paid for the
| value we bring to your company. We charge on a per-user basis.
| You should monitor your applications and infrastructure the
| right way, not afraid to send data because it might blow the
| budget.
| opsunit wrote:
| Wavefront brings a number of things to the table that aren't
| core competencies we wish to maintain in-house.
|
| I know it can scale to massive volumes without interaction
| from us.
|
| I know it'll be available when our infrastructure isn't. By
| being a third party we can be confident that any action on
| our part (such as rolling an SCP out to an AWS org, despite
| unit tests) won't impact the observability we rely on to tell
| us we've screwed that up.
|
| I can plug 100s of AWS accounts and 10s of payers into it and
| I don't have to think about that in terms of making self-
| hosted infrastructure available via PrivateLinks or some
| other such complication.
|
| I pay mid six-figure sums annually for these things to "just
| work". If you folks believe I can achieve this functionality
| on a per-seat basis I'd be interested in saving those six
| figures.
| englambert wrote:
| We're building Opstrace to be as simple as a provider like
| Wavefront -- we've failed if you need additional
| competencies to manage it. That being said, we're early in
| our journey and still have a ways to go.
|
| As mentioned in the original post here, at the core of
| Opstrace is Cortex (https://cortexproject.io). We know that
| Cortex scales well to hundreds of millions of unique active
| metrics, so depending on the exact characteristics of your
| workload, the fundamentals should be there.
|
| However, Cortex is a serious service to run and if you were
| to DIY it would require operations work that you currently
| don't have with Wavefront. This is the problem we're trying
| to solve--making these great OSS solutions easier to use
| for people like you.
|
| Opstrace is made to be exposed on the internet (which is
| optional of course), so you can easily run it in an
| isolated account to keep it safe from all other operations.
| And in fact, this is the configuration we recommend for
| production use.
|
| Regarding "100s of AWS accounts and 10s of payers"... does
| that include any form of multi-tenant isolation? We support
| multi-tenancy out of the box to enable controlling rate
| limits and authorization limits for different groups. We'd
| need to talk in more detail about that. If you'd like to do
| that privately, please shoot me at chris@opstrace.com.
| We're of course happy to continue the discussion here with
| you as well.
| mh- wrote:
| As a heads up, I think you meant to link
| https://cortexmetrics.io/
| jgehrcke wrote:
| Thanks for the correction! You linked to the right
| Cortex, not to be confused with
| https://github.com/TheHive-Project/Cortex, haha.
| https://github.com/cortexproject/cortex is what we talk
| about. Naming is hard.
| englambert wrote:
| :facepalm: Yes, indeed, I conflated the website and the
| GitHub org. Mea culpa.
| jgehrcke wrote:
| JP from Opstrace here.
|
| Thanks for sharing this perspective, stressing the relative
| value of predictability.
|
| Of course, when things go pear-shaped the last thing you
| want to discover is that your monitoring pipeline doesn't
| work as expected. We feel you.
|
| Your skepticism is justified and I'm super happy to see
| that here. We know that our future users are (and should
| be) quite demanding with respect to robustness of the
| platform.
|
| We're not naively assuming that it's easy to build a
| platform that is highly available, auto-scaling, and
| generally worry-free.
|
| In fact, based on our experience, we really know that we'll
| have to invest an incredible amount of engineering effort
| in order to make things super reliable and predictable. On
| the other hand, by making some smart decisions we can get
| far with little effort. We have super strong building
| blocks that we can rely on (such as using a cloud-provided
| database for storing critical configuration state).
|
| > If you folks believe I can achieve this functionality on
| a per-seat basis I'd be interested in saving those six
| figures.
|
| The bet is on, but of course we need a bit of time :)
| zaczekadam wrote:
| Hey, I think this might be the coolest product intro I've read.
|
| My two points - right now docs are clearly targeting users
| familiar with the competition but for someone like me who does
| not know similar products, a 'how it works' section with examples
| would be awesome.
|
| Fingers crossed!
| jgehrcke wrote:
| Jan-Philip here, from the Opstrace team. Thanks for these kind
| words! For sure, you're right, we can do a much better job at
| describing how things work. Providing great documentation is
| one of our top priorities :-)!
| tmzt wrote:
| You mentioned Loki in your post. I evaluated it for our company
| and was reasonably impressed with the simplicity of setup and
| efficient storage. Where it failed us was the difficulty
| searching by customer identifiers or other "high cardinality"
| labels, or full-text. There's a longstanding issue [1] on the
| Github for this. Are you doing anything to improve log search
| versus an Elasticsearch cluster, for instance?
|
| More broadly, how are you contributing to the upstream projects?
|
| [1]
| https://github.com/grafana/loki/issues?page=7&q=is%3Aissue+i...
| jgehrcke wrote:
| JP from Opstrace here. Great questions!
|
| > Are you doing anything to improve log search versus an
| Elasticsearch cluster, for instance?
|
| No. You're right, Loki is not designed for building up an index
| for full-text search. The premise here is that you won't
| typically need that and that in exchange for not having to
| build up that index, you get other advantages (such as being
| able to rely on an object store for both payload and index
| data!). If, on the other hand, in special situations, you need
| to "grep-search" your logs, this is absolutely doable with
| Loki! Loki does not neglect this use case. The opposite is
| true; everyone is already excited about the performance
| characteristics that Loki already has today when it comes to
| ad-hoc processing full text. For example, see
| https://twitter.com/Kuqd/status/1336722211604996098 and
| definitely have a look at
| https://grafana.com/blog/2020/12/08/how-to-create-fast-
| queri.... I'm sure Cyril is happy to answer your questions,
| too!
|
| > how are you contributing to the upstream projects?
|
| We're reporting issues and try to contribute as much as we can!
| Of course, this effort has only started. So far, we've
| contributed to Loki's Fluentd plugin (https://github.com/grafan
| a/loki/pulls?q=is%3Apr+author%3Ajge...), and our testing
| efforts have helped reveal edge cases; see for example
| https://github.com/grafana/loki/issues/2124 and
| https://github.com/grafana/loki/issues/3085.
|
| We're excited to substantially contribute to both Loki and
| Cortex in the future!
| richardw wrote:
| Point around incentives: We use Dynatrace. I'm sure it's an eye-
| watering price but I do like that everyone who wants a license
| can get one. I don't have to consider costs to add an entire dev
| team and teach them how to use it. It also means an entire dev
| team knows how to use it for future jobs.
| fat-apple wrote:
| We certainly don't want to create an adverse incentive where
| you would consider limiting the number of devs who had access
| to the monitoring system. There are trade-offs but we think
| that per-seat pricing (like GitLab, GitHub) actually does make
| it much easier to budget and plan for monitoring spend.
| Generally, a headcount plan is more predictable than the data
| your application generates. For example, a single engineer can
| add (and maybe should be adding) far more metrics and logs to
| their applications to monitor it correctly. They should not
| also be worried about breaking the budget when doing so. Does
| this make sense to you - what do you think?
| thow_away_4242 wrote:
| Meanwhile, AWS getting the "AWS Opstrace Service" branding and
| marketing pages ready.
| fat-apple wrote:
| :-) I have talked about the subject in this comment thread:
| https://news.ycombinator.com/item?id=25991764
| polskibus wrote:
| What are your plans on supporting open telemetry? Can I send open
| telemetry data to opstrace?
| [deleted]
| fat-apple wrote:
| Yes you can do this today. You cannot send this directly to
| Opstrace via grpc yet, so you would need to use the Prometheus
| remote write exporter (https://github.com/open-
| telemetry/opentelemetry-collector/bl...) with the open
| telemetry collector. Right now this means only metrics are
| supported but we will be working on traces as well. Check out
| the point about tracing on our roadmap
| https://opstrace.com/docs/references/roadmap
| tamasnet wrote:
| This looks very promising, thank you and congrats!
|
| Also, please don't forget about people (like me) who don't run on
| $MAJOR_CLOUD_PROVIDER. I'd be curious to try this e.g. on self-
| operated Docker w/ Minio.
| nickbp wrote:
| Hi, this is Nick Parker from the Opstrace team. I personally
| have my own on-prem arm64/amd64 K3s cluster, including a basic
| 4-node Minio deployment, so I'm very interested in getting
| local deployment up and running myself. We're a small team and
| we've been focusing on getting a couple well-defined use-cases
| in order before adding support for running Opstrace in custom
| and on-prem environments. It turns into a bit of a
| combinatorial explosion in terms of supporting all the
| possibilities. But we definitely want to support deploying to
| custom infrastructure eventually.
| e12e wrote:
| This looks like an interesting product. We're figuring out
| our monitoring stack - but have also found Loki/Grafana - and
| we're looking at Victoria Metrics rather than Cortex. Our
| hope is that the combination will turn out to be able to
| scale down as well as up, and be possible to fit in with
| Docker swarm/compose on-prem and at digital ocean. Also looks
| like vector might be a good option for collecting data.
|
| Will keep an eye out, to see if optrace might be a fit for
| us.
|
| https://victoriametrics.github.io/
| fat-apple wrote:
| Curious to learn more about the tradeoffs you've found
| between Cortex and Victoria Metrics and why you could be
| drawn to one more than the other?
| e12e wrote:
| Mostly the promise that they do mostly the same, but vm
| uses less resources. I believe vm has slightly less
| resolution/fidelity - but not so much that we expect it
| to matter for our use-case.
|
| But we're still implementing it - so we might still run
| into surprises.
| jgehrcke wrote:
| Jan-Philip from Opstrace here, good morning. Would love
| to hear more, after all, when you know more :-). By the
| way, we had been asked about VictoriaMetrics before,
| here:
| https://github.com/opstrace/opstrace/discussions/98.
| tamasnet wrote:
| Good to know, and thanks for confirming. I totally get that
| supporting all combinations can be a real challenge. I guess
| containerization helps, but that's becoming it's own
| smorgasbord of almost-compatible bits.
| nickbp wrote:
| Yeah, anytime I hear "on-prem" deployment I think of my
| previous experience with getting a product deployed across
| a lot of different K8s environments. At the surface there
| are ostensibly common APIs, but the underlying components
| (networking, storage) are not necessarily interchangeable.
| There may also be custom policies around e.g. labels,
| SecurityContexts, or NetworkPolicies. In my own K3s cluster
| I generally just manage the YAML specs for the deployments
| by hand, since I'll often need to e.g. specify the arch
| constraint to run against, or ensure that it's running a
| multi-arch image. It's a really interesting problem though,
| and it's something that we're targeting.
| filereaper wrote:
| Happy to hear you're now with the old Mesosphere gang at
| Opstrace Nick!
| jgehrcke wrote:
| Feels like we're nowhere and everywhere now!
| mleonhard wrote:
| > opstrace create -c CONFIG_FILE_PATH PROVIDER CLUSTER_NAME
|
| > opstrace destroy PROVIDER CLUSTER_NAME
|
| > opstrace list PROVIDER
|
| I want to keep cluster config in source control, track deployment
| changes in code reviews, and automate deployments. Do you have
| any plans to add an 'apply' command to support this?
|
| $ opstrace apply -c CONFIG_FILE_PATH [--dry-run] PROVIDER
| CLUSTER_NAME
| jgehrcke wrote:
| Hey! JP from Opstrace here. Thanks for reading through things
| and for sharing your thoughts. The quick reply is that we still
| have to introduce a proper cluster config diff and mutation
| design.
|
| An `apply` command might look innocent on the surface. But.
| Upgrades (including config changes) are hard. Super hard. If
| it's helping a bit: the entire current Opstrace team has dealt
| with super challenging platform upgrade scenarios in the most
| demanding customer environments in the past years. We try to
| not underestimate this challenge :).
|
| We're super fast moving right now and didn't want to bother
| with in-place config changes (as you can imagine, we wouldn't
| really be able to provide solid guarantees around that). We'll
| work on that and make it nice when the time is right, and when
| we feel like we can can actually provide guarantees.
| brodouevencode wrote:
| We use [insert very large application performance monitoring tool
| here] for workloads running in [insert very, very large cloud
| provider here] and after examining our deployments, concluded
| that we were spending nearly $13k/mo for data transfer out
| expenditures because the monitoring agents have crazy aggressive
| defaults. Seems like running our own (which may be worthwhile)
| would alleviate anything like that.
| nrmitchi wrote:
| Tip, if you happen to be using datadog, make sure datadog agent
| logs are disabled from being ingested into datadog.
|
| If you can disable them at the agent level and avoid the data
| out that would be even better.
|
| At a previous employer the defaults were quite literally half
| of our log volume, that we were paying for. I was doing a
| sanity check before renewing our datadog contract and was very
| not-pleased to discover that.
| fat-apple wrote:
| We're about to release a Datadog compatible API so you can
| point your Datadog agent at Opstrace instead (stay tuned for
| the blog post). Our goal is to be able to tell you exactly
| how much data the agent is sending and how much that is
| costing you (and for example what services/containers are
| responsible for the bulk of the cost). Here's a list of the
| PRs: https://github.com/opstrace/opstrace/pulls?q=is%3Apr+is%
| 3Acl...
| brodouevencode wrote:
| Lol no, the other really large one. Five minute Cloudwatch
| polling defaults are just overkill even in production.
| mdaniel wrote:
| I even opened a support ticket for their stupid python agent
| logging its connection refused tracebacks on every metrics
| poll and was told "too bad"
|
| They really don't give one whit about log discipline or
| allowing the user to influence the agent's log levels
| englambert wrote:
| Perhaps on a related note, see this discussion about the
| power of incentives here:
| https://news.ycombinator.com/item?id=25994653
| alexchamberlain wrote:
| It feels like the large monitoring applications should run
| aggregators in large cloud providers to reduce traffic for
| everyone.
| jgehrcke wrote:
| Haha, sure. I suppose that for example AWS has little
| incentive for allowing for example Datadog to to offer a
| special per-AZ endpoint. But hey. Here we come into play :).
| cyberpunk wrote:
| Can hurt yourself that way too -- happened to us, but with not
| a lot of data, and all down to Thanos
| aggregating/reducing/whatever-ing meeeeeeeeeelions of metrics
| inside a s3 bucket to the tune of about 7k a month :/
| spahl wrote:
| Yes that is frustrating indeed. On top of paying your external
| vendor, you are punished by the egress cost you have to pay to
| your infrastructure cloud provider. This is one of the problems
| we wanted to solve. Feel free to contact me seb@opstrace.com.
| [deleted]
| rockyluke wrote:
| Congratulations! You did a really great job.
| arianvanp wrote:
| Your mascot is almost exactly identical to https://scylladb.com/
| 's mascot. Is there any connection; or a happy accident?
| englambert wrote:
| Chris here, from the Opstrace team. As it turns out, it's just
| a happy coincidence. When we discovered theirs we fell in love
| with it as well. They have many different versions of their
| monster (https://www.scylladb.com/media-kit/)... similarly
| you'll see several new versions of our mascot, Tracy the
| Octopus, over time!
| tailspin2019 wrote:
| Nicely designed site, great logo, but after clicking around a bit
| (and looking at GitHub) I'm confused by what this product
| actually is.
|
| DataDog has a UI. Does Opstrace? Or is it just a CLI/API based
| tool?
|
| If you actually have a UI element to your product you're doing a
| huge disservice to yourself by not actually showing this
| anywhere...
|
| EDIT: I don't mean to sound negative, I'm wondering if
| positioning this against Datadog is going to create immediate,
| potentially incorrect, expectations in people's minds as to what
| this product might provide.
|
| From first impressions I'd say this is much closer to Prometheus
| (which does have a UI but it's so basic it may as well not - but
| then the UI is not the point of Prometheus).
| Denzel wrote:
| The headline feels deliberately clickbait-y and disingenuous. I
| love and support the idea. Aspirationally, the founders may
| want to compete with Datadog, but OpsTrace overlaps some small
| percentage of Datadog's feature set.
|
| I'm surprised the mods haven't edited this title.
|
| Source: I'm an engineer that's used, operated and hacked on a
| medium-sized prom+grafana; and used Datadog at a large, multi-
| region, global scale.
| dang wrote:
| Sorry about that! I didn't see anything wrong with the title
| but I'm not familiar with Datadog's feature set. If you take
| the title as aspirational, it may make more sense.
| spahl wrote:
| It's definitely aspirational. With time and work the
| features gap (for the important ones) will narrow:-)
|
| What we want to emphasize is that it is possible to build
| this and have the advantages of a Datadog without the
| drawbacks.
| Denzel wrote:
| No need to be sorry, you do a great job moderating HN dang!
|
| I just felt let down because the promise of the title
| (which I was excited by) doesn't match reality in that it's
| quite impossible to deliver Datadog's feature set for
| metrics, traces, and logs -- and tie all three together --
| on top of prometheus+grafana because the underlying TSDB
| doesn't even support the notion of user-customizable
| indexing. Prometheus indexes all labels, which is why most
| Prometheus users even have to worry about high cardinality
| and time series explosion. This is one point highlighting
| how OpsTrace's current architecture can not satisfy their
| positioning; there are many more. I'm sure they're aware of
| them.
|
| As an aspiration, the title makes sense. As someone in
| their target market, I was a bit turned off by the
| embellishment.
| fat-apple wrote:
| We're truly sorry that you felt let down! You raise a
| great point that is worth clarifying though. All logs,
| metrics and future traces are stored in S3/GCS (and yes
| _metrics_ are stored in TSDB format). While this does not
| allow a single query language to ask questions across all
| these sources in one query (yet), it is absolutely
| possible to build what Datadog is with a new user
| interface. To go even further, now that it's all in one
| place (S3 /GCS) other technologies can be leveraged to
| create a higher level query language across everything,
| such as PrestoDB (one would have to build backends for
| it), or even a new dedicated open source columnar store.
| Denzel wrote:
| > other technologies can be leveraged
|
| That's my point: your software architecture _must_ evolve
| to deliver on your promises (open-source Datadog) because
| it's _impossible_ to satisfy your promises with the
| architecture as it exists today.
|
| Nonetheless, I appreciate all the responses. Good luck to
| OpsTrace!
| fat-apple wrote:
| Thanks for your support! We're still early in our journey
| when compared to the featureset Datadog has built over time,
| but we definitely aspire to bridge the best of Datadog and
| open source. Based on your experience, what sorts of features
| would you prioritize to get there?
| jgehrcke wrote:
| I'd love to chime in, too, and say thank you for sharing this
| perspective. It's totally understandable, and this HN thread
| wouldn't be complete and genuine w/o this talking point.
| Based on your experience, we'd certainly appreciate for you
| to keep an eye on how our product evolves! Super interested
| in your expertise and opinions. Thanks again for not being
| afraid of being 'that guy'.
| fat-apple wrote:
| Thanks for that feedback - we've got a lot of work to do to
| make that more clear! We're still early in our journey so we're
| not there yet, but we're moving fast. We're working on a new
| collaborative UI for interacting with your data in a way that
| solves a lot of problems we've witnessed with current
| monitoring UIs (let me know if you want more detail). It's in
| early development and we haven't released it yet, so while
| Opstrace does have a UI now, it's currently limited to system
| management (adding/removing users and tenants). For interacting
| with data, we currently ship a Grafana instance per tenant. The
| roadmap has some basic information about this (might not be
| something you stumbled across). Let me know if I can clarify
| anything else. https://opstrace.com/docs/references/roadmap
| tailspin2019 wrote:
| Thanks for clarifying. I think all the great work you're
| doing on this is at risk of being diminished by pushing the
| DataDog comparison so hard when really there is (currently)
| no comparison in my view.
|
| Not to say you won't get there, but why not shout about your
| current strengths right now (if you're going to talk about
| the product) and not necessarily where you _hope_ to be.
|
| I only make this point because the "open source DataDog"
| headline just spins me off in the complete wrong direction as
| a newcomer and I end up confused and somewhat underwhelmed.
| With better positioning this wouldn't have been the case!
|
| That said, the project looks v interesting so keep up the
| good work and I hope you take my constructive criticism in
| the way it's intended!
| fat-apple wrote:
| Thanks for the great feedback, definitely understand where
| you're coming from and we'll keep that in mind moving
| forward! Very much appreciate the constructive criticism!
| jarym wrote:
| Very exciting! Question: your homepage says it'll always be
| Apache 2 but what will you do if someone like AWS rebrands your
| work (looking over at Elastic here)?
| fat-apple wrote:
| Mat here (Seb's Cofounder). Great question. We are not only
| building a piece of infrastructure but a complete product with
| its own UI and features, rather than a standalone API. Our
| customer is the end-user more than the person wanting to build
| on top of it. GitLab and others have shown that when you do
| that the probability of being forked or just resold goes down
| drastically.
| ensignavenger wrote:
| Gitlab is open-core, so that gives them a lot of closed
| source features to sell. Do you plan to be open-core like
| Gitlab?
| spahl wrote:
| Yes! We will be having features that you have to pay a
| subscription for. It starts with the usual suspects: custom
| SSO, custom domains, and authorization - things that we
| would be hosting as an ongoing service for customers. Most
| features will be open when we create them -- this is near
| and dear to our hearts -- it's important our users can be
| successful with the OSS version. Over time, the commercial
| features will also flow into the open as we release new
| proprietary ones. Our commercial features will be public in
| our repo, under a commercial license.
|
| We will also have a managed version where we deploy and
| maintain it for the customer in a cloud account they
| provide us.
| erinnh wrote:
| Will you have a small(er) plan for homelabs?
|
| I like supporting open source projects, and while SSO is
| pretty useless to me, I always like custom domains.
| spahl wrote:
| We are still experimenting with pricing and what can be
| open and closed. To be completely transparent we chose
| custom domains because we know companies care a lot. When
| we have more features on the commercial side we can start
| to chat about supporting it in the open version. Still
| early in our journey, happy to discuss anything, like a
| small plan with just custom domains. Would you pay for
| that?
| erinnh wrote:
| Yeah. That's kind of what I mean. I have no problem
| paying some money to support you guys and not have to
| host it on my own. I generally prefer my monitoring etc
| to not be done by myself anyway cause I take myself
| offline way too often
|
| Only point was of there was going to be a smaller plan
| for homelabs or the like that doesn't have the raw amount
| of Traffic or features as the enterprise plans do.
| tobilg wrote:
| Great job, congratulations from an ex-Mesosphere colleague!
| NSMyself wrote:
| Looking good, congrats on launching
| hangonhn wrote:
| Damn. That's one hell of a set of credentials for the founders.
|
| I was the engineer who was heavily involved with monitoring at my
| last job and a lot of what this is doing aligns with what I would
| have done myself. At my new job, I work on different stuff but I
| can see we're going to run into monitoring issues soon too. I'm
| so, so, so glad this is an option because I do not want to
| rebuild that stuff all over again. Getting monitoring scalable
| and robust is HARD!
| englambert wrote:
| Hey, thank you. :-) That's kind of how we feel -- it seems like
| everyone is building tooling around Prometheus, and frankly, we
| hope that collective effort can hopefully be redirected to more
| impactful value creation for our industry. On a personal note,
| most of us on the team have been there in one way or another,
| struggling to actually monitor our own work. We've had surprise
| Datadog bills and felt the pain of scaling Prometheus. (In
| fact, I'm planning a blog post about this struggle, so stay
| tuned.) It feels like this problem should already be solved,
| but it's not. So we're trying to fix it.
| crazy5sheep wrote:
| Prometheus is great, the main problem is the bloat of metrics
| it's collecting. one really needs to carefully define the
| rules to scrape, compute, reduce and filter the ones that are
| not needed and the ones that need to precompute.
| englambert wrote:
| You're absolutely right.
|
| As mentioned earlier
| (https://news.ycombinator.com/item?id=25993825), our goal
| is to be super transparent; we want you to fully understand
| what you're spending on infrastructure. We feel good that
| there's an incentive to help you work through the problems
| that you've mentioned.
|
| Attributing collection and querying is made easier with
| authentication enabled by default. You can make your
| tenants as fine- or coarse-grained as you want, handing out
| authentication tokens to the producers writing to those
| tenants. This makes it easier to trace back to sources of
| bloat. You can also place rate limits on individual tenants
| to prevent bloat in the first place.
|
| Additionally, we think users might reconsider the premise
| of the problem. Because the cost of running Opstrace
| follows cloud economics (because it runs in your own cloud
| account), it's basically as cheap as it can possibly be. So
| you might consider that you do not have as much pressure to
| curate what is stored as you think. (I didn't say "no"
| pressure, but "less" might be a huge improvement. :-) )
| boundlessdreamz wrote:
| 1. It would be great if you can integrate with
| https://vector.dev/. Also saves you the effort of integrating
| with many sources
|
| 2. When opstrace is setup in AWS/GCP, what is the typical fixed
| cost?
| fat-apple wrote:
| Great questions!
|
| (1) As it stands today, you can already use
| https://vector.dev/docs/reference/sinks/prometheus_remote_wr...
| to write metrics directly to our Prometheus API. You can also
| use https://vector.dev/docs/reference/sinks/loki/ to send your
| logs to our Loki API. Vector is very cool in our opinion and
| we'd love to see if there is more we can do with it. What are
| your thoughts?
|
| (2) As for cost, our super early experiments
| (https://opstrace.com/blog/pulling-cost-curtain-back) indicate
| that ingesting 1M active series with 18-month retention is less
| than $30 per day. It is a very important topic and we've
| already spent quite a bit of time on exploring this. Our goal
| is to be super transparent (something you don't get with SaaS
| vendors like Datadog) by adding a system cost tab in the UI.
| Clearly, the cost depends on the specific configuration and use
| case, i.e. on parameters such as load profile, redundancy, and
| retention. A credible general answer would come in the shape of
| some kind of formula, involving some of these parameters -- and
| empirically derived from real-world observations (testing,
| testing, testing!). For now, it's fair to say that we're in the
| observation phase -- from here, we'll certainly do many
| optimizations specifically towards reducing cost, and we'll
| also focus on providing good recommendations (because as we all
| know cost is just one dimension in a trade-off space). We're
| definitely excited about the idea of providing users useful,
| direct insight into the cost (say, daily cost) of their
| specific, current Opstrace setup (observation is key!). We've
| talked a lot about "total cost of ownership" (TCO) in the team.
| GeneralTspoon wrote:
| This looks super cool!
|
| We just moved away from Datadog because their log storage pricing
| is too high for us. We moved to BigQuery instead. But the
| interface kind of sucks.
|
| Would love to get this up and running. A couple of questions:
|
| 1. Is it possible to setup outside of AWS/GCP? I would like to
| set this up on a dedicated server.
|
| 2. If not - then do you have a pricing comparison page where you
| give some example figures? e.g. to ingest 1 billion log lines
| from Apache per month it will cost you roughly $X in AWS hosting
| fees and $Y per seat to use Opstrace
| fat-apple wrote:
| Currently you can only deploy to AWS and GCP, but we do intend
| to extend support to on-prem/dedicated servers in due course
| (see https://news.ycombinator.com/item?id=25992237). Until now
| we've been focussing completely on building a scalable,
| reliable product by standing on the shoulders of these cloud
| providers where we can take advantage of services like S3, RDS,
| and elastic compute.
|
| We've done a deep dive into the cost model for metrics and
| posted more about it here: https://opstrace.com/blog/pulling-
| cost-curtain-back. We are still working on a full cost analysis
| for logs - I'd be happy to send it to you once we have it (feel
| free to email me mat@opstrace.com to chat about your use case).
| Our goal is to be super transparent (see
| https://news.ycombinator.com/item?id=25992081) with cost and to
| have a page on our website that helps someone determine what to
| expect (probably some sort of calculator with live data). Our
| UI will also show you exactly what your system is currently
| costing you with some breakdown for teams or services so you
| know who/what is driving your monitoring cost. We're doing user
| testing on our to-be-released UI now and would love to have
| people like yourself give us early feedback (since you
| mentioned the BigQuery interface).
| alexhutcheson wrote:
| One pain point with Prometheus is that is has relatively weak
| support for quantiles, histograms, and sets[1]:
|
| - Histograms require manually specifying the distribution of your
| data, which is time-consuming, lossy, and can introduce
| significant error bands around your quantile estimates.
|
| - Quantiles calculated via the Prometheus "summary" feature are
| specific to a given host, and not aggregatable, which is almost
| never what you want (you normally want to see e.g. the 95th
| percentile value of request latency for all servers of a given
| type, or all servers within a region). Quantiles can be
| calculated from histograms instead, but that requires a well-
| specified histogram and can be expensive at query time.
|
| - As far as I know, Prometheus doesn't have any explicit support
| for unique sets. You can compute this at query time, but
| persisting and then querying high-cardinality data in this way is
| expensive.
|
| Understanding the distribution of your data (rather than just
| averages) is arguably the most important feature you want from a
| monitoring dashboard, so the weak support for quantiles is very
| limiting.
|
| Veneur[2] addresses these use-cases for applications that use
| DogStatsD[3] by using clever data structures for approximate
| histograms[4] and approximate sets[5], but I believe its
| integration with Prometheus is limited and currently only one-way
| - there is a CLI app to poll Prometheus metrics and push them
| into Veneur[6], but there's no output sink for Veneur to write to
| Prometheus (or expose metrics for a Prometheus instance to poll),
| and you aren't able to use the approximate histogram or
| approximate set datatypes if you go that route, because they
| can't be expressed as Prometheus metrics.
|
| It would be extremely useful to have something similar for
| Prometheus, either by integrating with Veneur or implementing
| those data structures as an extension to Prometheus.
|
| [1] https://prometheus.io/docs/practices/histograms/
|
| [2] https://github.com/stripe/veneur
|
| [3] https://docs.datadoghq.com/developers/dogstatsd/
|
| [4] https://github.com/stripe/veneur#approximate-histograms
|
| [5] https://github.com/stripe/veneur#approximate-sets
|
| [6] https://github.com/stripe/veneur/tree/master/cmd/veneur-
| prom...
| sneak wrote:
| > _We will always be open source, and we make money by charging a
| per-user subscription for our commercial version which will
| contain fine-grained authz, bring-your-own OIDC and custom
| domains._
|
| Seems to me that these are at odds. If you're open source, why
| does anyone have to pay for these things?
|
| If you're open core, I think it's mighty misleading to say things
| like "We will always be open source" because then not only is it
| untrue on its face, but also if someone contributes useful
| features to the open source project that compete with or supplant
| your paid proprietary bits, you are incentivized to refuse to
| merge their work - extremely not in the spirit of open source.
|
| My perspective, which you asked for, is that open core is
| dishonest, and that you should be honest with yourselves about
| being a _proprietary software vendor_ if that 's indeed your
| plan, and stop with the open source posturing.
|
| If I've misunderstood you, then I apologize.
| fat-apple wrote:
| Thanks for your feedback! As with many in the industry, we are
| trying our best to figure this out.
|
| Our intention is to be really transparent with how we build and
| price software, which is why our commercial features will also
| be public in our repo, but commercially licensed. Transparency
| is critical in our opinion.
|
| This is the model we've seen work for other highly impactful
| software projects.
|
| We've created a ticket to track our addition of commercial code
| to our repo: https://github.com/opstrace/opstrace/issues/319
| kazinator wrote:
| I'm gonna put on my St. Ignucius robe here, and say that yes,
| that behavior is within the limits of "open source".
|
| If they used "free software" language, then we might have a
| case for posturing.
| davelester wrote:
| SaaS is a common way for open source companies to create
| revenue, look no further than WordPress, GitLab, Databricks,
| DataStax, and many others. Kudos to the opstrace team for
| taking this path.
|
| There's nothing inherently dishonest when a company emphasizes
| their open source strategy. Open source community building is
| as much about shipping code as it is leading people, and that
| requires you to be transparent about your intentions. I've
| interpreted opstrace's release as just that.
|
| I think the concern about neutral project governance is an
| important one. It's early days, but from what I've seen it
| seems clear what is being sold vs what is open today. The fact
| that the project is released under the Apache v2 license means
| that folks are able to reuse, distribute, and sell the project
| as they wish -- even fork it if they dislike the direction.
| That said, if governance is a priority for your use I'd
| definitely look to project in neutral software foundations like
| the Apache Software Foundation and CNCF.
| sciurus wrote:
| It looks like you're largely selling a fancy installer for
| software primarily developed by another company, Grafana Labs.
| They offer both open source, hosted SaaS, and paid-for
| "enterprise" versions of their software.
|
| Why should someone choose Opstrace over purchasing from them
| directly?
| englambert wrote:
| Our installer is indeed an important part of what we're
| offering and we're continuously evolving our operator to manage
| the ongoing maintenance. But in terms of being a feature
| complete, Open Source Datadog, you're right that we have a long
| way to go to achieve our vision. As mentioned in other replies,
| we are working on other interesting components as well, such as
| a new collaborative UI
| (https://news.ycombinator.com/item?id=25996154), API
| integrations (https://news.ycombinator.com/item?id=25994268),
| and more.
|
| That being said, in case you couldn't tell, we love software
| from Grafana Labs. It's popular for a reason. However, we want
| it to be as easy to install and maintain as clicking a button,
| i.e., as simple as Datadog. So one problem we are trying to
| solve today is that while, yes, you can stitch together all of
| their OSS projects yourself (and many, many people do), it's a
| non-trivial exercise to set up and then maintain. We've done it
| ourselves, seen friends go through it; we'd like to stop
| everyone from becoming a subject matter expert and reinventing
| the wheel. (Especially since when our friends do it themselves
| they always skimp on important things like, say, security.)
| Bottom line--we're inspired by Grafana Labs. We strive to also
| be good OSS stewards and contribute to the overall ecosystem
| like they have.
|
| Another way to solve the "stitching-it-together" problem, as
| you mentioned, is of course pay Grafana Labs for their SaaS
| (which I've done in the past) or one of their on-prem
| Enterprise versions. However, these are not open source. The
| former is hosted in their cloud account and single-tenant; the
| latter have no free versions. We think Opstrace provides a lot
| of value, but we understand that it's not for everyone.
| rtkaratekid wrote:
| Looking through the docs I'm seeing there will be (at some point)
| an API. Does this include ways to integrate data coming from non-
| Opstrace sources? My specific case is an in-house monitor that
| basically just generates data.
| englambert wrote:
| Yes, indeed! Currently you can already using the Prometheus
| remote_write API as discussed here:
| https://news.ycombinator.com/item?id=25992392. So if you can
| collect your in-house metrics with Prometheus, then you could
| write to Opstrace.
|
| Additionally, we are close to launching a Datadog API as
| mentioned here: https://news.ycombinator.com/item?id=25994268
|
| So stay tuned to our blog or newsletter for more on this.
|
| Are there other specific APIs you're interested in?
| rtkaratekid wrote:
| Thanks for the reply! Not necessarily, I work in R&D
| developing low-level data collection so I'm more just trying
| to keep my ear to to ground in terms of what's going on in
| the stack just above what I'm doing :)
| englambert wrote:
| Sounds good. I know that feeling. :-) Cheers.
| ogazitt wrote:
| Congrats on launching - looks awesome! It's about time we have an
| open source datadog :)
|
| Also, it's great to see the early focus on developer experience -
| "opstrace create".
| fat-apple wrote:
| Thank you! We like to simplify but not obfuscate!
| rubiquity wrote:
| Nice. I've talked myself out of starting a monitoring product at
| least a few dozen times. As you point out, customers either get
| to choose between being gouged or run their own spaghetti.
|
| On top of bad UX, I do think the storage layer is where customers
| are really getting hit by these companies. The big players are
| using very unoptimized ingestion and querying layers and
| pretending like tiered storage never happened. Developers share
| some of the blame too by not being at all pragmatic about how
| long and how much to keep. It's a tough nut to crack.
|
| What's the plan for commercial? They run it themselves and pay
| per user? If so, that's refreshing.
| nickbp wrote:
| That's the plan! The incumbent SaaS providers are effectively
| charging a premium over the underlying storage. Their business
| is really reselling storage. Removing that premium via a self-
| hosted system then greatly reduces the need to structure your
| applications to fit the cost of monitoring it. This also means
| that any negotiated discounts and features they may use (e.g.,
| S3 Standard-IA) are also applicable to data in Opstrace.
|
| We will also have a blog post about bad UX in a couple weeks...
| stay tuned. What are some of your biggest gripes about UX?
| stevemcghee wrote:
| FWIW, I was able to play with a preview and found it
| straightforward to set up and it kinda just did what I expected.
| I'm happy to see them taking next steps here. Good luck opstrace!
| dudeinjapan wrote:
| Hi there, at TableCheck (www.tablecheck.com) we recently adopted
| Lightstep.
|
| In a nutshell, running all these various components (Grafana,
| etc) is a royal pain in the neck. Even if `opstrace create`
| spawns them easily, the problem is running/maintaining them. We
| want someone to run these for us as a SaaS/PaaS and we're happy
| to pay them.
|
| Re: your principles:
|
| (1) The customer should always own their data --> we agree.
| However, we are happy for you to be a custodian of that data.
|
| (2) We don't want to be a storage vendor --> neither do we. We
| want storage to be someone else's problem. We're happy for you to
| use a cloud platform like AWS/GCP and charge us a 50% markup.
|
| (3/4) Transparency, predictability of costs, open source --> all
| excellent.
| BringerOfChaos wrote:
| @dudeinjapan, check out https://grafana.com/products/cloud/ The
| first line on the page says, "Your observability, managed as a
| service"
|
| It might not fit your use case...but it might.
| dudeinjapan wrote:
| We used self-hosted Grafana until late 2020. We looked at
| Grafana cloud but ultimate chose a combination of NewRelic
| One for metric monitoring and Lightstep for request tracing.
| Those two were the "best-of-breed" for their respective areas
| with the best pricing for our use case.
| jgehrcke wrote:
| Jan-Philip from Opstrace here. This is lovely feedback!
|
| > is a royal pain in the neck.
|
| It's fun to see how different people put the same unpleasant
| experience into words in this thread. Thanks for adding your
| personal touch. Every time we hear something like that, we're
| re-assured that we're on the right track.
|
| > Even if `opstrace create` spawns them easily, the problem is
| running/maintaining them
|
| Yes. You're right. While we can be proud of our
| setup/installation process already, we know that there's so
| much more to it. We don't underestimate that. Maybe also see
| https://news.ycombinator.com/item?id=25998587, where I just
| commented on the robustness topic.
|
| > However, we are happy for you to be a custodian of that data.
|
| Great.
|
| > We want storage to be someone else's problem.
|
| I share that perspective. We, of course, are happy to let
| S3/GCS do the actual job.
|
| > We're happy for you to use a cloud platform like AWS/GCP and
| charge us a 50% markup.
|
| That's great to hear, and I hope you can be enthusiastic about
| the fact that our markup is _not_ going to be relative to
| storage volume. It's going to be independent of that.
|
| > Transparency, predictability of costs, open source --> all
| excellent.
|
| Thanks for sharing. That's incredibly motivating.
|
| Keep an eye on us, and we'd love to hear from you!
| dudeinjapan wrote:
| Will definitely keep an eye on your service. Again, don't
| underestimate users willingness to pay for a PaaS product you
| make, even if you also do a dual PaaS/self-hosted option like
| Gitlab or MongoDB does, for example. We'd definitely prefer
| the PaaS, and that's where the big $$$ is made these days.
|
| By the way, we are requiring that any vendor we choose in
| this area support OpenTelemetry as we've already instrumented
| our apps with it. Lightstep, Datadog, and others already
| supporting.
| xchaotic wrote:
| This is an interesting perspective. Our current monitoring
| system charges by the amount of data ingested / stored so
| there is a perverse incentive to observe less men if we have
| detailed debug level logs available.
| danmur wrote:
| I'm glad I don't have to pay per-man to observe men :P.
| xyzzy_plugh wrote:
| Lightstep was bananas expensive and had several limitations
| that lead to us moving away from it. Hopefully it's easier to
| scrub PII from it these days.
| englambert wrote:
| Heh, yes, the PII scrubbing need is very real. This is
| definitely a contributing factor to our data ownership
| commitment--just keep your data in your own account.
|
| On the cost topic, last week we published a blog post
| analyzing the cost of running Opstrace on AWS
| (https://opstrace.com/blog/pulling-cost-curtain-back). (In
| fact, feel free to do a local repro to confirm our results.)
| As mentioned elsewhere here on HN, we are incentivized to
| provide total transparency in terms of what you spend on your
| cloud infrastructure. We haven't compared ourselves to
| everyone, but feel confident that letting our customers pay
| S3 directly is the best deal possible.
| dudeinjapan wrote:
| Lightstep has recently changed their pricing, it's now the
| most cost effective option out there. You pay $100 per
| service with basically unlimited data ingestion, it's a
| fabulous deal. https://lightstep.com/pricing/
|
| As for scrubbing PII, they are now supporting the
| OpenTelemetry tracing API which does this as standard. For
| query endpoints you will see something like "QUERY Users
| where name=? email=?", i.e. masking with "?" chars as you
| only care about the keys since those what determine your
| indexing or lack thereof. (This is handled in the
| OpenTelemetry application library/plugin level.)
|
| As an aside, PII scrubbing should be done that whether or not
| you own the cluster, because even if you own the cluster you
| generally don't want your support staff seeing PII esp. as
| organization grows larger.
| [deleted]
| mrwnmonm wrote:
| Man, I was hoping someone would do this. Thanks very much. Please
| please please, care about the design. I don't know why open
| source projects always have bad design.
|
| Wish you all the best. and Congratulations!
| fat-apple wrote:
| Thank you! Yes, design is very near and dear to our hearts! If
| you're interested in giving me some early feedback on our UX,
| email me mat@opstrace.com.
| mrwnmonm wrote:
| Would love to. But will wait for the demo.
| snissn wrote:
| hi! Some quick perspective - my thoughts looking into this are
| "ok cool what metrics do i get for free? cpu load? disk usage?
| the hard to find memory usage?" and i just get lost in your home
| page without any examples of what the dashboard looks like
| nickbp wrote:
| Just to answer the question about what metrics are included,
| you can write and read any kind of custom metrics and log data
| from your applications, and have to build useful dashboards
| yourself. When first deployed, the user tenants (you can create
| any number of tenants to partition your data) are empty (you
| start with a clean slate) and ready for you to send any
| metrics/logs to it. You also have to add your own dashboards to
| interpret the data you've sent.
|
| Opstrace does ship with a "system" tenant designed for
| monitoring the Opstrace system itself. This tenant has built-in
| dashboards that we've designed to show you the health of the
| Opstrace system.
|
| Incidentally, having sharable "dashboards" across
| people/teams/organizations is something we are also working on,
| so people don't have to re-invent dashboards all the time.
|
| We also have some guidelines for you to ingest metrics from
| Kubernetes clusters
| (https://opstrace.com/docs/guides/user/instrumenting-
| a-k8s-cl...) and are building native cloud metrics collection.
| Feel free to follow along in GitHub:
| https://github.com/opstrace/opstrace/issues/310.
| spahl wrote:
| We totally agree our website is way too wordy and we are
| working on explaining our vision through various ways.
| Screenshots of course, but also things like short videos. We
| actually just did one of our quickstart
| https://youtu.be/XkVxYaHsDyY. It's not perfect but we will get
| there:-)
|
| Thanks for the feedback, we appreciate it!
| spahl wrote:
| Sorry I don't know what happened here is a new youtube link:
| https://www.youtube.com/watch?v=ooqBn1Q-y2Q
|
| It's from this page: https://opstrace.com/docs/quickstart
| nickstinemates wrote:
| Congrats! This is really exciting
___________________________________________________________________
(page generated 2021-02-02 23:02 UTC)