[HN Gopher] Netdata: Open-source real-time monitoring platform
___________________________________________________________________
Netdata: Open-source real-time monitoring platform
Author : dvfjsdhgfv
Score : 229 points
Date : 2021-04-21 08:26 UTC (14 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| jitl wrote:
| I run Netdata on my home server using the official docker image.
| Beware - by default it'll send home telemetry, and the web UI
| will try to "register" your instance with some kind of cloud. I
| find this super annoying, but it's possible to turn it off; just
| not well documented.
|
| There's also a lot of plugins that scrape many kinds of logs,
| look at process data, etc. Again, might be useful, but for a home
| user it's much better to turn it all off.
|
| My notes here have a write-up of how to run Netdata via Docker
| with example config files that disable the unwanted features
| here: https://jake.tl/notes/2019-11-09-cozy-server#netdata
| linsomniac wrote:
| We have a central influxdb with telegraf metrics among others,
| and some grafana graphs.
|
| I still install netdata on every machine though. Almost never use
| it, but there have been some times where it was useful to look at
| netdata. It's light weight enough that it hasn't been a problem.
| abbaselmas wrote:
| Why netdata is popular on HN now? There must be some big news or
| something..
| andrewm4894 wrote:
| i think it was just a random share by someone.
| saberience wrote:
| Why would I use it over DataDog?
| nine_k wrote:
| Different scale? Different price point?
| PanosJee wrote:
| More metrics, higher resolution, pre-configured alert. It
| cannot displace it as is but it can work in tandem very nicely.
| Havoc wrote:
| It's very neat for individual servers
|
| Doesn't work well for monitoring multiple servers though from
| what I can tell.
| PanosJee wrote:
| Have you given a try to netdata.cloud? It's still early but
| solves this problem without any effort
| _joel wrote:
| There's a netdata prometheus exporter, but it overlaps a lot
| with node-exporter. If you're already runnning netdata however
| then it could be a good choice
| lima wrote:
| node_exporter is a lot more robust. Had both running for a
| while and netdata would get stuck when there's I/O trouble,
| while node_exporter was carefully built not to do any I/O and
| kept working just fine.
| _joel wrote:
| Yes, I'd stick to node exporer where possible.
| dddw wrote:
| Indeed, very handy tool. I once used it to discover that a new
| deployment generated spike CPU load. Reason was a badly
| implemented javascript doing a call on the db when hovering
| over a product (preview stock). Fun to see the actual
| correlation in a GUI in realtime.
| odyslam wrote:
| Hey,
|
| That sounds super interesting. Would you be interested to
| share the use-case in our forums?
| https://community.netdata.cloud
|
| Ping me (@odyslam in the forum) if you need any help. We are
| always looking for awesome stories like that!
| szszrk wrote:
| That is true. It has some features to allow to quickly jump
| between many machines, but indeed out of the box it is not for
| many servers at the same time.
|
| BUT it can be configured to push data to Prometheus or similar
| (it's called a "backend") and some other integrations like
| notifications can be done.
|
| Super neat project, very easy to set up. I highly recommend it
| to anyone who does performance troubleshooting. Netdata put
| into a standard Linux system will detect a lot of different
| things, like firewalls, containers, a lot of software like
| databases, queuing systems, mail systems and provide additional
| data every second.
|
| Works flawlessly on Proxmox clusters!
| Abishek_Muthian wrote:
| Neat project but I'm also confused with 'distributed'; It
| sounds like designed for monitoring multiple systems in a
| single dashbaord OOB with 'zero-config', But on further digging
| it seems like the distributed monitoring works 'only' with
| their cloud service[1].
|
| Further clarity on this would be appreciated.
|
| [1]
| https://github.com/netdata/netdata/blob/master/docs/quicksta...
| ohthehugemanate wrote:
| It's OOTB configured to use their free cloud service, but
| with 2 lines of config you can run your own central
| collection point instead. That's what I do for my home
| install.
|
| BUT the UI for this is just a dropdown for each of your
| monitored servers. I've found I actually want to export data
| to a more robust system so I can view patterns across
| machines, too.
| Abishek_Muthian wrote:
| Thanks, So monitoring multiple machines is possible in a
| central console although not in a single dashboard.
| Hopefully it will be available soon and the project seems
| useful as it is as I like the idea of getting the system
| information of all the Servers, SBCs in my network.
| underhood wrote:
| Did you try Netdata Cloud?
| ktsaou wrote:
| Hi. I am the founder of Netdata.
|
| We complement the Netdata agent with Netdata.Cloud, a free-
| forever SaaS offering that maintains all the principles of the
| Netdata agent, while providing infrastructure level monitoring
| and several additional convenience features.
|
| In Netdata.Cloud, infrastructure is organized in war-rooms. On
| each war-room you will find the "Overview" page, that provides
| a fully automated dashboard, very similar to the one provided
| by the agent, in which every chart presented aggregates data
| from all servers in the war-room! Magic! Zero configuration!
| Fully automated!
|
| Keep in mind that Netdata.Cloud is a thin convenience layer on
| top of the Netdata agent. We don't aggregate your data. Your
| data stay inside your servers. We only collect and store a few
| metadata (how many netdata agents you have, which metrics they
| collect, what alarms have been configured, when they triggered
| - but not the actual metric and log data of your systems).
|
| Try it! You will be surprised!
| Havoc wrote:
| >I am the founder of Netdata.
|
| Awesome! I see the free tier is indeed looking generous. Just
| hooked up a node and looks good - I like the calculate
| correlations on alerts thing in particular.
|
| >Keep in mind that Netdata.Cloud is a thin convenience layer
| on top of the Netdata agent.
|
| I see. Didn't know/understand that.
|
| On the claim node page - could you perhaps add the kickstart
| bash script code too? I find myself needing them one after
| the other yet they're on different pages
| andrewm4894 wrote:
| Good to hear metrics correlation might be useful to you,
| just as background you can get more info here:
| https://www.netdata.cloud/blog/netdata-cloud-metric-
| correlat...
|
| At the moment it's based on a short window of data so the
| focus is more for short term changes around an area of
| interest you have already found.
|
| Longer term it would be cool to be able to use an anomaly
| score on the metrics themselves (or if a lot of alarms
| happen to be going off) to automatically find such regions
| for you so its more like surfacing insights to you as
| opposed to you having to already know a window if time you
| are interested in.
| linsomniac wrote:
| >Keep in mind that Netdata.Cloud is a thin convenience layer
| on top of the Netdata agent. We don't aggregate your data.
|
| I didn't get that from the website until just now. I was
| looking and looking for how much it would cost to subscribe
| for our 150 dev/stg/prod VMs -- usually that's the killer.
| sneak wrote:
| Note that netdata phones home without consent in the default
| configuration. For many, the whole point of doing system-
| administration is selfhosting and autonomy, and privacy is
| frequently a big component of that.
|
| Netdata blows a big hole in that by transmitting your usage
| information off of your box without getting permission.
| odyslam wrote:
| That's a bit unfair. In the docs we are being very upfront that
| you can opt-out of anonymous telemetry:
| https://learn.netdata.cloud/docs/get
|
| we use the data we gather in order to make smarter product
| decisions. We want to invest resources where it matters, so we
| need to know how our users use the product.
|
| We are also very detailed on what we gather:
| https://learn.netdata.cloud/docs/agent/netdata-security
|
| Lastly, we just changed our analytics engine, from google-
| analytics to a self-hosted posthog, which is an open-source
| product analytics platform
| edoceo wrote:
| how about on install you prompt to opt-IN to this "feature"?
|
| posthog or not, your target market is more sensitive to this
| telemetry crap than GP.
| ex_amazon_sde wrote:
| As odyslam wrote, opt-out is unethical.
| sneak wrote:
| The telemetry isn't anonymous: it includes the client IP; the
| method you use to transmit the data cannot work anonymously.
|
| Additionally, what's actually unfair is that you proceed with
| this spying without the consent of the user. Being upfront
| about it is not obtaining consent: it's just informing the
| user you're about to _violate_ their (lack of) consent.
|
| You must obtain consent from the user first, before
| transmitting their information. Otherwise, your software is
| spyware. (Disclosing that you're going to spy on the user
| doesn't make you not-spyware.)
|
| > _we use the data we gather in order to make smarter product
| decisions._
|
| Yes, you transmit the private data of the user for the
| express purpose of enriching yourself.
|
| Opt-out is unethical: you must obtain _opt-in_ consent
| _first_. The data you are transmitting does not belong to
| you.
| yunohn wrote:
| Dude, seriously?
|
| You choose to willfully install Netdata. You have to read
| the docs where the opt-out telemetry is clearly explained,
| before you can self-host it too. If you care, you can
| disable it.
|
| I honestly don't understand HN. Multiple commenters
| deriding a free open-source project for having basic
| telemetry to understand feature usage.
| andrewm4894 wrote:
| Disclaimer - i work for Netdata Cloud.
|
| We actually mask the ip address (https://github.com/netdata
| /dashboard/blob/master/src/domains...) so it's not even
| sent - we just send "127.0.0.1" as the IP into our self
| hosted PostHog. Likewise with any URL, referrer type event
| properties that could leak a hostname to us - we don't want
| that data at all so explicitly mask it before even
| capturing it in our telemetry system.
|
| Previously, when using a fairly standard Google Analytics
| implementation, we could not really have this level of
| control all that easily.
|
| So the hope is that with PostHog we can do better here
| while still enabling some really useful product telemetry
| to help us understand how to make the product better over
| time and try catch bugs and issues quicker too.
|
| Oh and we have removed Google Tag Manager (GTM) from the
| agent dashboard too so that that's no longer around as a
| possibility for loading other third party tags too.
|
| You can read more here: https://github.com/netdata/netdata/
| blob/master/docs/anonymou...
|
| p.s. PostHog is really cool - check it out:
| https://posthog.com/docs#philosophy,
| https://github.com/PostHog/posthog
| mekster wrote:
| All these graphs are never really actionable and are only of
| interest for a short period of time and you won't be looking at
| it after a while because they don't mean anything unless you know
| where and when the problem is.
|
| A sever admin wants "Incident" panel that only shows anomaly
| components at the top coupled with adjustable alerting mechanism
| and not just a dump of all the data there is blindly.
|
| There are so many tools that does this and pretend it's
| impressive including ELK but whether it's Grafana or Kibana, you
| need a lot of manual tweaking to make the dashboards actually
| useful.
| chrisandchris wrote:
| So true.
|
| But netdata also supports alerts by e-mail or HTTP-based (e.g.
| for Slack), so why not just turn on notifications and live from
| them?
| andrewm4894 wrote:
| Disclaimer - i work at Netdata Cloud on ML.
|
| This is one of the things i am focusing on most - how to
| package and then surface up "anomaly events" to the user that
| the user can then quickly digest and decide if they are or are
| not something that could represent an "incident". So human in
| the loop sort of ML to help assist and lower the cognitive load
| of all the charts.
|
| We have a first step on this ladder via the python based
| anomalies collector that you could play around with to see if
| any use:
| https://learn.netdata.cloud/docs/agent/collectors/python.d.p...
|
| That will give two summary aggregation type charts based on
| anomaly scores built off all the system.* charts by default
| (but can be configured however you want) - so, at each second,
| an anomaly probability for each chart and an anomaly flag if
| the model thinks that chart looked anomalous at that time.
|
| We are also working on some other related projects to build on
| this capability out more and do it at the edge, in C++ or Go as
| opposed to Python, (or via a parent node) as cheap as possible
| so minimal, ideally negligible, impact on the agent itself. We
| should have some more features related to this in coming months
| as we are just trying to dogfood them internally a little
| first.
|
| Anyone interested i'd love to hear some more feedback here or
| in this community megathread here:
| https://community.netdata.cloud/t/anomalies-collector-feedba...
| or just email me at analytics-ml-team@netdata.cloud
| lima wrote:
| What's the argument for anomaly detection - it's an obvious
| thing to do that has been tried many times, but doesn't
| actually seem to provide much value in practice (especially
| at large scale, where you'll get spurious correlations).
|
| What would you need it for? Once you defined your SLOs,
| either your service meets them or not. What's the value in
| alerting someone that "this graph looks funny"?
| laminatedsmore wrote:
| Do 4xx responses count against your SLO? For me they don't,
| but an abnormal increase might still signify that something
| is actually wrong. (I haven't yet found a useful tool for
| highlighting this kind of abnormality though)
| toomanybeersies wrote:
| > What's the value in alerting someone that "this graph
| looks funny"?
|
| What's the value in mentioning to someone that the chicken
| they're about to eat looks a bit raw?
|
| It stops them from eating it and getting food poisoning.
|
| Anomalies are often warnings (harbingers?) of a problem
| which could lead to a fault and downtime.
| cakrit wrote:
| It's about troubleshooting. When you have a complex
| infrastructure, it's not enough to say that your db queries
| are slower than usual. Ok, so you immediately see that your
| db server is getting a lot more traffic. What was the root
| cause though and what can you do about it now? Given enough
| "funny charts", you can see for example that you have hit a
| resource limit that you can temporarily raise and also see
| that a particular component of your infrastructure has an
| anomalous behavior, e.g. a cron job that was usually
| utilizing resources for a few seconds, now takes minutes.
| So you can provide a quick workaround and move on to
| investigate what changed with that cron job.
| andrewm4894 wrote:
| I almost think of anomaly detection as a UI/UX type tool to
| help users navigate the data/systems. So use ML to find
| "interesting" or "novel" periods of time in your
| architecture (in the sense that the ML thinks they look
| novel based on some model), and then enable a user who is
| ultimately best placed to decide if it's actually of real
| interest to them or more like a false positive that they
| can just ignore and move on.
|
| So doing it in a way where you can quickly scan such events
| i think could be useful even if only 1 in 20 actually turn
| out to be some potential problem that might have been
| missed by your alarms or maybe could even be a precursor to
| some impact on SLO's etc.
|
| The aim would also be "this collection of graphs look funny
| at the same time" as opposed to "the individual graph looks
| funny" as if you have an anomaly score for every chart for
| sure at any given moment some individual charts would be
| randomly firing. But when you pool the information across
| charts and hosts and systems the hope is that then you can
| use anomaly detection as another way to explore your system
| and catch when things change unexpectedly.
| mdale wrote:
| I generally agree that well defined SLOs for back end
| services works as you define service contacts between
| services and care less about the particular funny graph
| being surfaced more that a particular service is out of
| contract.
|
| Where automatic anomaly detection was very valuable for us
| was in the video domain with multi dimensional end user
| telemetry. I.e what would be lost as noise in top level
| metics could be surfaced via anomaly detection for specific
| combination of dimensions that you could not otherwise
| manually observe. I.e video start time in Mexico is fine
| ... But an ISP in Mexico City is not failing but when data
| is sliced and anomaly highlighted we see its newly under
| preforming and we need to feed this data into our CDN
| switching to improve video start time there.
|
| The data had too many dimensions that were always changing
| with degraded experience easily lost in the noise when
| measuring across platform and our software updates,
| combinations target devices, connection types, geo
| location, specific content, active ab tests, etc. In such
| cases automatic anomaly detection was pretty critical.
| ktsaou wrote:
| Thank you for this feedback. I am the founder of Netdata.
|
| Netdata is about making our lives easier. If you need to tweak
| Netdata, please open a github issue to let us know. It is a
| bug. Netdata should provide the best possible dashboards and
| alerts out of the box. If it does not for you, we missed
| something and we need your help to fix it, so please open a
| github issue to let us know of your use case. We want Netdata
| to be installed and effectively used with zero configuration,
| even mid-crisis, so although tweaking is possible and we
| support plenty of it, it should not be required.
|
| An "incident" is a way to organize people, an issue management
| tool for monitoring, a collaboration feature. Netdata's primary
| goal however, is about exploring and understanding our
| infrastructure. We are trying to be amazingly effective in this
| by providing unlimited high resolution metrics, real-time
| dashboards and battle tested alarms. In our roadmap we have
| many features that we believe will change the way we understand
| monitoring. We are changing even the most fundamental features
| of a chart.
|
| Of course at the same time we are trying to improve
| collaboration. This is why Netdata.Cloud, our free-forever SaaS
| offering that complements the open-source agent to provide out
| of the box infrastructure-level monitoring along side several
| convenience features, organizes our infra in war-rooms. In
| these war-rooms we have added metrics correlation tools that
| can help us find the most relevant metrics for something that
| got our attention, an alarm, a spike or a dive on a chart.
|
| For Netdata, the high level incident panel you are looking for,
| will be based on a mix of charts and alarms. And we hope it is
| going to be also fully automated, autodetected and provided
| with zero configuration and tweaking. Stay tuned. We are baking
| it...
| Croftengea wrote:
| > our free-forever SaaS offering that complements the open-
| source agent
|
| How do you make or plan to make money?
| sdesol wrote:
| I was analyzing the activity in the netdata project and
| what I found interesting was this project is less active
| than I would have thought. See the following for insights
| into the project:
|
| https://public-001.gitsense.com/insights/github/repos?q=win
| d...
|
| In the last 30 days, there were 2 frequent and 3 occasional
| contributors. I honestly thought frequent contributors
| would have been much higher, which leads me to believe the
| project is quite mature and they don't need a lot of people
| to work on netdata.
|
| Based on Crunchbase, they've raised about 33 million so far
| and if the number of people required to maintain netbase is
| low (relatively speaking that is), I can see them not
| really needing to worry about making money and I'm guessing
| they are finding value in gathering data for ML.
| andrewm4894 wrote:
| oh cool that's a nice tool.
|
| p.s. i am the only person working on ML at Netdata and i
| can confirm we don't gather any data for ML purposes,
| which is actually my biggest challenge right now :) -
| convincing people the ML can be useful without having
| lots of nice labeled data from real netdata users to be
| able to quantify that with typical metrics like accuracy
| etc. I'm hoping to introduce mainly unsupervised ML
| features into the product that don't rely on lots of
| labeled data and have thumbs up/down type feedback and we
| can then use that to figure out if new ML based features
| are working or being useful for users. So any models that
| would be trained would be trained on the host and live on
| the host as opposed to in Netdata Cloud somewhere.
| sdesol wrote:
| > i am the only person working on ML at Netdata and i can
| confirm we don't gather any data for ML purposes, which
| is actually my biggest challenge right now :)
|
| Yeah I would have to imagine that it would be an issue.
| This is just my personal opinion, but I think there
| should be a way to provide anonymized data for building
| models for anomaly detection. Maybe an opt-in feature, as
| it would benefit everybody using netdata, but this is
| just my own personal opinion.
| ktsaou wrote:
| > they've raised about 33 million
|
| yes, this is right
|
| > if the number of people required to maintain netbase is
| low (relatively speaking that is)
|
| The Netdata agent is a robust and mature product. We
| maintain it and we constantly improve it, but:
|
| - most of our efforts go to Netdata.Cloud
|
| - most of the action in the agent is in internal forks we
| have. For example, we are currently testing ML at the
| edge. This will eventually go into the agent, but is not
| there yet. Same with EBPF. We do a lot of work to
| streamline the process of providing the best EBPF
| experience out there.
|
| > I can see them not really needing to worry about making
| money
|
| We are going to make money on top of the free tier of
| Netdata.Cloud. We are currently building the free tier.
| In about a year from now we will start introducing new
| paid features to Netdata.Cloud. Whatever we will have
| released by then, will always be free.
|
| > I'm guessing they are finding value in gathering data
| for ML
|
| No, we are not gathering any data for any purpose. Our
| database is distributed. Your data are your data. We
| don't need them.
| sdesol wrote:
| Hey thanks for the insights. I figured effort was being
| spent elsewhere and/or was not visible in the public
| repo.
| ktsaou wrote:
| The same way GitHub, Slack or Cloudflare provide massively
| free-forever SaaS offerings while making money.
|
| We believe that the world will greatly benefit by a
| monitoring solution that is massively battle tested, highly
| opinionated, incorporating all the knowledge and experience
| of the community for monitoring infrastructure, systems and
| applications. A solution that is installed in seconds, even
| mid-crisis and is immediately effective in identifying
| performance and stability issues.
|
| The tricky part is to find a way to support this and
| sustain it indefinitely. We believe we nailed it!
|
| So, we plan to avoid selling monitoring features. Our free
| offering will never have a limit on the number of nodes
| monitored, the number users using it, the number of metrics
| collected, analyzed and presented, no limit on the
| granularity of data, the number of war-rooms, of
| dashboards, the number of alarms configured, the
| notifications sent, etc. All these will always be free.
|
| And no, we are not collecting any data for ML or any other
| purpose. The opposite actually: we plan to release ML at
| the edge, so that each server will learn its own behavior.
|
| We plan to eventually sell increased convenience features,
| enforcement of compliance to business policies and
| enterprise specific integrations, all of them on top of the
| free offering.
| nawgz wrote:
| This is a good question, their website doesn't seem to have
| any "Pricing" information anywhere and everything is "get
| now" and "sign up for free"...
| unixhero wrote:
| I do not necessarily disagree with you regarding what a server
| admin / ops personnel needs, however;
|
| I for one deeply enjoy interacting with my Netdata dashboard
| whenever I want to deep dive into my servers resources and
| behaviors. For me it fits a purpose and if I ever were to run a
| company that hosted things, I would want it and I would want to
| pay for it. I am a huge fan and a long time homelab user of
| Netdata.
| goodpoint wrote:
| Netdata is focused on short-term, real-time metrics. I use it
| often during development.
| mdip wrote:
| > never really actionable ... only of interest for a short
| period ... you know where and when the problem is.
|
| I'm not a sysadmin of a large shop (I did that for a short bit,
| but prior to this existing), so I can only speak as a guy who
| runs a few big linux servers/virtuals. I've had netdata
| installed on my home severs for quite some time. And yes, the
| graphs were really cool, at first, and kinda went into the
| background.
|
| Here's the thing: when something isn't right with those boxes,
| that's become the first place I visit. Since I had some
| franken-boxes with a bunch of storage, it's often related to
| the array, or btrfs. When I hop in there I'll notice an alert
| or two, gooble it, alter something and never see it again. It
| helped me solve some network issues.
|
| I don't know, short of it being a more busy process than I wish
| it were on my server (only a little and I'm running a few plug-
| ins on the one that I'm unhappy with), it's been helpful.
| fatlasp wrote:
| Same. I'm not a fully qualified sys admin but I do have
| access to a number of our servers (I'm more of a full stack
| generalist than an expert at anything) and I immediately go
| to netdata when one of my services isn't acting right. For me
| its a nice 'system at a glance' where I can check on the host
| and then alert someone more knowledgeable than myself if
| there's something that looks off
| ARandomerDude wrote:
| I completely disagree. I work on a system with multiple servers
| communicating with each other and billions of events per day.
| Watching the meters wiggle on all the servers simultaneously is
| a really important debugging tool. If something slows down or
| goes down, and you know what it's connected to, it's pretty
| easy to troubleshoot what the cause is at an incredibly
| specific level just by looking at the meters.
|
| I think it's likely one's take on whether these tools are
| useful is very dependent on the system architecture.
| toomanybeersies wrote:
| There's value in having large dashboards that contain a bunch
| of non-prioritised graphs and gauges. I've managed to find a
| fair few problems by scrolling through such dashboards. Usually
| it's due to poorly configured monitors/alerts, but sometimes
| I'll spot things that you wouldn't reasonably expect an
| algorithm to pick up.
|
| Plus it's good fun to look at a big dashboard and pretend
| you're Homer Simpson at the Springfield Nuclear Power Plant.
| nednar wrote:
| You seem to have a very specific vision. Could you mock it
| somehow? HTML, Figma, Paint, Powerpoint, whatever? Quite
| curious about ideas.
| xPaw wrote:
| Netdata does have alarms/alerts, and comes with default ones.
|
| https://learn.netdata.cloud/guides/step-by-step/step-05
| odyslam wrote:
| Actually, we give a lot of thought in defining sane default
| alarms for most of the data sources that we have.
|
| We want our users to get 80% of the value with 20% of the
| effort.
|
| It's an opinionated approach that liberates a lot of users
| from having to setup and maintain everything.
| mdip wrote:
| I think you took a lot of flack in the original comment[0]
|
| The alarms in netdata resolved a long-standing network
| issue on one of my boxes, and have variously alerted me to
| problems I could resolve with storage which greatly
| improved performance on my largest volume. On my other box,
| one look at the graphs alerted me to the fact that the
| _entire_ SSD for my bcache volume was going unused[1]. I
| then used them while altering configuration and working
| with the drive to ensure the cache was being filled in a
| manner consistent with what the volume stored /how it was
| used.
|
| The more I think about it, I might not have been as
| enthusiastic in my original comment as I should have been.
| It's been very helpful to me. I don't usually keep things
| like this running for very long (it wastes cycles on aging
| hardware...that isn't heavily used, but hey, it's the
| principal!) but I've kept this around because every time
| I've thought about removing it, I've visited the dashboard
| one last time and found something there that made me keep
| it.
|
| [0] Though, as I mentioned, I'm not a sysadmin; I have a
| lab that might indicate otherwise, but I don't get paid for
| it.
|
| [1] I had reloaded the machine/redone a previous
| configuration that included bcache and it _screamed_ ; I
| knew my new setup was much slower but I had forgotten about
| it until netdata made it obvious, again. I can't remember
| what I had to do to fix it, but it had something to do with
| the policy used to determine if a file should be put into
| the cache, and I think it was related to the fact that the
| cache was added to a volume with data present that rarely
| changed.
| ohthehugemanate wrote:
| > A sever admin wants "Incident" panel that only shows anomaly
| components at the top coupled with adjustable alerting
| mechanism and not just a dump of all the data there is blindly.
|
| Netdata does this too, with a ton of thresholds already set up
| by default. The list of active alerts is at the top, with a
| badge and everything. Notifications use a hook system, so you
| can use whatever mechanism you like. Personally I get emails
| for medium level alerts, SMS for high and above, and wall
| posts/notifications on my primary machine for crits. It took
| some tuning to get the thresholds right for me, all perfectly
| easy to do.
|
| I agree I would prefer to have the active warnings more visible
| than the graphs, but one click away really isn't bad.
| jeffbee wrote:
| Based on my experience a dashboard that you only use when you
| know where and when the problem occurred is incredibly useful,
| and the lack of one can be very frustrating. While you of
| course need a systematic approach to incident detection, you
| _also_ need comprehensive eyes-on-glass dashboards during your
| investigations. "Anomaly detection" is much spoken of but
| generalized anomaly detection doesn't exist. You still need
| skilled operators to just have a look around in many cases.
|
| An example, drawn from several major incidents in my career.
| You get an alert, you narrow it down to a process or machine,
| you evict the machine from your serving population to remediate
| the incident, but how do you keep it from recurring? The
| anomalous thing isn't apparent in your monitoring data, so it
| must be among the bazillion statistics that a running system
| exposes, but which you can't afford to collect and monitor on a
| per-host, per-container, per-process level of detail. That's
| when you want something exactly like netdata!
| tyingq wrote:
| I agree with this, and it's interesting how many open source
| tools there are that create these graphs and charts, store tons
| of data, etc. All with a mostly "eyes on glass" bent, which
| doesn't scale terribly well.
|
| When, really, what's more important is actionable events,
| correlation, duplicate suppression, escalating notifications,
| etc. Something like what "Netcool Omnibus" and other commercial
| software does. Isolate actionable problems and make sure
| somebody owns the problem.
|
| But for reasons I don't understand, there isn't much in the
| open source world in that space.
| enz wrote:
| Is it any good? ;)
| papazach wrote:
| It is great, I have claimed my VMs running Netdata to the
| Netdata Cloud and I am very happy with it! Took me only a few
| minutes to claim them all (11 VMs) and boom the dashboards were
| ready out of the box.
| unixhero wrote:
| Very good.
| odyslam wrote:
| yes :)
| mtmsr wrote:
| I have played around with netdata just yesterday on my home
| server. Great tool, but the defaults are overkill for my needs.
| After spending an hour trying to simplify (=disable most of the
| "collectors") using the documentation, I finally gave up.
|
| Settled on neofetch [1] instead: pure bash, wrote my own custom
| inputs including color coding for incident reporting in less time
| than it took me to strip down netdata. Highly recommended if you
| want to spend your time on other things than (setting up) server
| monitoring.
|
| [1] https://github.com/dylanaraps/neofetch
| gregwebs wrote:
| Thanks for the link: neofetch seems a good tool when you just
| want to manually see what is going on. Netdata is also designed
| to alert, forward data to other locations, monitor at 1 second
| granularity, and to store historical data efficiently if you
| want to see what went on in the recent past.
| tifadg1 wrote:
| Could someone enlighten me on the internals, how is netdata able
| to get realtime granularity, whereas prometheus defaults to 15s?
| distantsounds wrote:
| it polls every second, to get metrics every second. genius, i
| know?
| gregwebs wrote:
| Prometheus is designed around metric centralization and running
| a scraper at some interval (every 15s). Netdata was originally
| focused on running on a single node and collecting at small
| intervals. Centralizing that data every second is a separate
| task, and you could avoid it with Netdata simply by viewing
| Netdata on the node in question. Netdata can also be configured
| to stream data to a central node.
|
| The centralized pull architecture of Prometheus does not lend
| itself towards small interval updates or towards resiliency
| (you actually need to run 2 Prometheus and double scrape for
| that).
| wongarsu wrote:
| I doesn't store much history from what I can tell. If you don't
| have years worth of data points then having 15 times as many
| isn't a big deal.
| mekster wrote:
| And people are somehow meant to only monitor a single server?
|
| There's a reason timeseries databases are trying to get
| downsampling right.
| manigandham wrote:
| It's a locally installed agent that monitors and serves
| metrics on the same host. If you want to monitor multiple
| hosts then you can either visit the dashboards
| individually, or scrape the APIs and put the metrics on a
| combined dashboard - which is what Netdata Cloud is.
| cakrit wrote:
| It's because it was built with high granularity and unlimited
| metrics as a key differentiator from the beginning. The core is
| written in pure C, optimized to death. Even long-term retention
| was initially sacrificed, in order to be able achieve that high
| performance, with minimal resource needs.
|
| Long term retention is now possible, but with relatively high
| memory requirements, depending on how many metrics are
| collected. Again, it was a decision to never give up realtime
| granularity and speed, even at the cost of writing our own
| timeseries db in C and utilizing more memory.
| dvfjsdhgfv wrote:
| The only gripe I have with it is the approach to security, i.e.
| the lack of user accounts (even one). So you have to either block
| the stats by IP (who is doing it these days?) or use other
| workarounds like proxying by Nginx etc.
| manos_saratsis wrote:
| You can use Netdata Cloud to have secure authenticated access
| to your single node dashboard. Data remain on your systems and
| are streamed to your browser. Netdata Cloud stores only
| metadata.
| odyslam wrote:
| Using Netdata Cloud is a great way not to spend any time with
| that and access the Agent's dashboard through Netdata Cloud. We
| use WSS and MQTT, so it's super secure and lightweight.
|
| The data are streamed from the Agent directly to your browser
| via the cloud.
|
| Relevant docs:
| https://learn.netdata.cloud/docs/configure/secure-nodes#disa...
| sammy2244 wrote:
| So the only convenient way to have security is to use the
| cloud version? Got it.
| distantsounds wrote:
| yes, because a 10 line nginx config with basic http auth is
| too difficult for a sysadmin to set up in conjunction with
| his systems monitoring tool
|
| stop being obtuse
| dvfjsdhgfv wrote:
| It's not that it's too difficult, but we were accustomed
| to having this functionality built in in similar products
| in the past, then things changed. When ELK first showed
| up there was a big wave of attacks on ELK servers because
| they were completely unsecured and at that time X-Pack
| Security was a paid add-on, they changed their mind
| later, some time after an open source solution appeared.
| PixyMisa wrote:
| Absolutely. It has to be there, and users have to be
| forced to configure it at install time.
|
| How many times do we need to repeat this mistake?
| napsterbr wrote:
| That's the key difference between self-hosted and SaaS. If
| you self-host, you are responsible for setting up the
| required infrastructure, taking care of updates, backups
| etc.
|
| If setting up a reverse proxy behind whatever monitoring
| you've got is too much, then yes, by all means use the SaaS
| offering -- but that's 100% the user responsibility, and
| there's no need to be snarky about it.
| dvfjsdhgfv wrote:
| > If you self-host, you are responsible for setting up
| the required infrastructure, taking care of updates,
| backups etc.
|
| Are you speaking about Netdata or in general? Because if
| the former, then at least the updates part is not true:
| the installation script turns out nightly updates (and
| telemetry).
|
| Frankly, the reason there is no basic auth is that
| Netdata doesn't use a third-party web server but a built-
| in one, so they would have to add this functionality.
| dvfjsdhgfv wrote:
| > So the only convenient way to have security is to use the
| cloud version? Got it.
|
| I wouldn't formulate it that way, it's just a bit annoying
| for me to see this trend of not having even tiny bit of
| security built in and having to do extra work just to
| protect the dashboard. Just one admin account and a random
| generated password would be fine.
| tinco wrote:
| Workarounds like proxying by nginx is not a workaround, it's
| the industry standard way of managing access to services. It's
| both more convenient _and_ more secure, a rare combination.
|
| More convenient because you can use your companies pre-existing
| authentication to authenticate the requests, and more secure
| because you're not having to manage separate passwords and user
| accounts.
| dvfjsdhgfv wrote:
| I understand your opinion, but it's not like that everywhere.
| I work for many clients who have single servers or specific
| setups and having to configure Nginx is an extra step and an
| additional layer that could be made totally unnecessary by
| building in just one admin account and assigning a random
| password to it.
| smarx007 wrote:
| I have it listen on a loopback interface and do SSH port
| forwarding when I want to look at the stats. Nginx proxying
| with basic auth is a perfectly reasonable approach and not a
| workaround in my humble opinion. I would trust these two
| approaches more than an unknown mechanism in Netdata.
| gregwebs wrote:
| Netdata is a great building block in a monitoring system. It now
| does a lot of monitoring via eBPF, connects to Prometheus, and
| integrates with k8s.
| odyslam wrote:
| We do love ebpf. Guilty as charged -\\_(tsu)_/-
|
| We have a whole bunch of metrics that we keep track and we are
| currently implementing a load more.
|
| Soonish, we will greatly increase the number of metrics that we
| gather with ebpf. That coupled with our per-second granularity,
| should give you a very detailed view of the system.
|
| Docs:
| https://learn.netdata.cloud/docs/agent/collectors/ebpf.plugi...
| Community Forums discussion:
| https://community.netdata.cloud/t/linux-kernel-insights-with...
| crazypython wrote:
| Haven't been able to use its graphical interface to view
| historical data. At least it uses fewer resources than Grafana.
| distantsounds wrote:
| netdata doesn't store metrics historically but you can funnel
| whatever ones you want out and ship them off to a log store
| like graphite or opentsdb.
| PanosJee wrote:
| It has the option now to set retention for up to a year.
| distantsounds wrote:
| ooh, even better!
| mprovost wrote:
| I write and maintain an open source monitoring tool and I looked
| into adding a mode to output metrics in Netdata format and ran
| away screaming. It's just an unstructured text format where you
| output commands to stdout, one per line. Each command consists of
| whitespace-separated fields. Which field is the units? Oh, the
| 4th. And some fields are optional, I'm not even sure how that
| works but I think you can't skip an optional field if you then
| want to use any field after that. It's like structured data
| formats like JSON or god forbid XML never happened.
| cakrit wrote:
| Netdata can ingest prometheus metrics as well, so you can just
| use that format. Eventually everything will become
| Openmetrics/Opentelemetry
| petecooper wrote:
| Previous discussions:
|
| https://news.ycombinator.com/item?id=11388196
|
| https://news.ycombinator.com/item?id=17773874
|
| https://news.ycombinator.com/item?id=26886792
|
| (For commentary, I'm not being snarky.)
| hivacruz wrote:
| How does it compare to New Relic who also happens to monitor, if
| enabled, containers and system things?
___________________________________________________________________
(page generated 2021-04-21 23:02 UTC)