hngopher.com

       [HN Gopher] Netdata: Open-source real-time monitoring platform
       ___________________________________________________________________
        
       Netdata: Open-source real-time monitoring platform
        
       Author : dvfjsdhgfv
       Score  : 229 points
       Date   : 2021-04-21 08:26 UTC (14 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | jitl wrote:
       | I run Netdata on my home server using the official docker image.
       | Beware - by default it'll send home telemetry, and the web UI
       | will try to "register" your instance with some kind of cloud. I
       | find this super annoying, but it's possible to turn it off; just
       | not well documented.
       | 
       | There's also a lot of plugins that scrape many kinds of logs,
       | look at process data, etc. Again, might be useful, but for a home
       | user it's much better to turn it all off.
       | 
       | My notes here have a write-up of how to run Netdata via Docker
       | with example config files that disable the unwanted features
       | here: https://jake.tl/notes/2019-11-09-cozy-server#netdata
        
       | linsomniac wrote:
       | We have a central influxdb with telegraf metrics among others,
       | and some grafana graphs.
       | 
       | I still install netdata on every machine though. Almost never use
       | it, but there have been some times where it was useful to look at
       | netdata. It's light weight enough that it hasn't been a problem.
        
       | abbaselmas wrote:
       | Why netdata is popular on HN now? There must be some big news or
       | something..
        
         | andrewm4894 wrote:
         | i think it was just a random share by someone.
        
       | saberience wrote:
       | Why would I use it over DataDog?
        
         | nine_k wrote:
         | Different scale? Different price point?
        
         | PanosJee wrote:
         | More metrics, higher resolution, pre-configured alert. It
         | cannot displace it as is but it can work in tandem very nicely.
        
       | Havoc wrote:
       | It's very neat for individual servers
       | 
       | Doesn't work well for monitoring multiple servers though from
       | what I can tell.
        
         | PanosJee wrote:
         | Have you given a try to netdata.cloud? It's still early but
         | solves this problem without any effort
        
         | _joel wrote:
         | There's a netdata prometheus exporter, but it overlaps a lot
         | with node-exporter. If you're already runnning netdata however
         | then it could be a good choice
        
           | lima wrote:
           | node_exporter is a lot more robust. Had both running for a
           | while and netdata would get stuck when there's I/O trouble,
           | while node_exporter was carefully built not to do any I/O and
           | kept working just fine.
        
             | _joel wrote:
             | Yes, I'd stick to node exporer where possible.
        
         | dddw wrote:
         | Indeed, very handy tool. I once used it to discover that a new
         | deployment generated spike CPU load. Reason was a badly
         | implemented javascript doing a call on the db when hovering
         | over a product (preview stock). Fun to see the actual
         | correlation in a GUI in realtime.
        
           | odyslam wrote:
           | Hey,
           | 
           | That sounds super interesting. Would you be interested to
           | share the use-case in our forums?
           | https://community.netdata.cloud
           | 
           | Ping me (@odyslam in the forum) if you need any help. We are
           | always looking for awesome stories like that!
        
         | szszrk wrote:
         | That is true. It has some features to allow to quickly jump
         | between many machines, but indeed out of the box it is not for
         | many servers at the same time.
         | 
         | BUT it can be configured to push data to Prometheus or similar
         | (it's called a "backend") and some other integrations like
         | notifications can be done.
         | 
         | Super neat project, very easy to set up. I highly recommend it
         | to anyone who does performance troubleshooting. Netdata put
         | into a standard Linux system will detect a lot of different
         | things, like firewalls, containers, a lot of software like
         | databases, queuing systems, mail systems and provide additional
         | data every second.
         | 
         | Works flawlessly on Proxmox clusters!
        
         | Abishek_Muthian wrote:
         | Neat project but I'm also confused with 'distributed'; It
         | sounds like designed for monitoring multiple systems in a
         | single dashbaord OOB with 'zero-config', But on further digging
         | it seems like the distributed monitoring works 'only' with
         | their cloud service[1].
         | 
         | Further clarity on this would be appreciated.
         | 
         | [1]
         | https://github.com/netdata/netdata/blob/master/docs/quicksta...
        
           | ohthehugemanate wrote:
           | It's OOTB configured to use their free cloud service, but
           | with 2 lines of config you can run your own central
           | collection point instead. That's what I do for my home
           | install.
           | 
           | BUT the UI for this is just a dropdown for each of your
           | monitored servers. I've found I actually want to export data
           | to a more robust system so I can view patterns across
           | machines, too.
        
             | Abishek_Muthian wrote:
             | Thanks, So monitoring multiple machines is possible in a
             | central console although not in a single dashboard.
             | Hopefully it will be available soon and the project seems
             | useful as it is as I like the idea of getting the system
             | information of all the Servers, SBCs in my network.
        
         | underhood wrote:
         | Did you try Netdata Cloud?
        
         | ktsaou wrote:
         | Hi. I am the founder of Netdata.
         | 
         | We complement the Netdata agent with Netdata.Cloud, a free-
         | forever SaaS offering that maintains all the principles of the
         | Netdata agent, while providing infrastructure level monitoring
         | and several additional convenience features.
         | 
         | In Netdata.Cloud, infrastructure is organized in war-rooms. On
         | each war-room you will find the "Overview" page, that provides
         | a fully automated dashboard, very similar to the one provided
         | by the agent, in which every chart presented aggregates data
         | from all servers in the war-room! Magic! Zero configuration!
         | Fully automated!
         | 
         | Keep in mind that Netdata.Cloud is a thin convenience layer on
         | top of the Netdata agent. We don't aggregate your data. Your
         | data stay inside your servers. We only collect and store a few
         | metadata (how many netdata agents you have, which metrics they
         | collect, what alarms have been configured, when they triggered
         | - but not the actual metric and log data of your systems).
         | 
         | Try it! You will be surprised!
        
           | Havoc wrote:
           | >I am the founder of Netdata.
           | 
           | Awesome! I see the free tier is indeed looking generous. Just
           | hooked up a node and looks good - I like the calculate
           | correlations on alerts thing in particular.
           | 
           | >Keep in mind that Netdata.Cloud is a thin convenience layer
           | on top of the Netdata agent.
           | 
           | I see. Didn't know/understand that.
           | 
           | On the claim node page - could you perhaps add the kickstart
           | bash script code too? I find myself needing them one after
           | the other yet they're on different pages
        
             | andrewm4894 wrote:
             | Good to hear metrics correlation might be useful to you,
             | just as background you can get more info here:
             | https://www.netdata.cloud/blog/netdata-cloud-metric-
             | correlat...
             | 
             | At the moment it's based on a short window of data so the
             | focus is more for short term changes around an area of
             | interest you have already found.
             | 
             | Longer term it would be cool to be able to use an anomaly
             | score on the metrics themselves (or if a lot of alarms
             | happen to be going off) to automatically find such regions
             | for you so its more like surfacing insights to you as
             | opposed to you having to already know a window if time you
             | are interested in.
        
           | linsomniac wrote:
           | >Keep in mind that Netdata.Cloud is a thin convenience layer
           | on top of the Netdata agent. We don't aggregate your data.
           | 
           | I didn't get that from the website until just now. I was
           | looking and looking for how much it would cost to subscribe
           | for our 150 dev/stg/prod VMs -- usually that's the killer.
        
       | sneak wrote:
       | Note that netdata phones home without consent in the default
       | configuration. For many, the whole point of doing system-
       | administration is selfhosting and autonomy, and privacy is
       | frequently a big component of that.
       | 
       | Netdata blows a big hole in that by transmitting your usage
       | information off of your box without getting permission.
        
         | odyslam wrote:
         | That's a bit unfair. In the docs we are being very upfront that
         | you can opt-out of anonymous telemetry:
         | https://learn.netdata.cloud/docs/get
         | 
         | we use the data we gather in order to make smarter product
         | decisions. We want to invest resources where it matters, so we
         | need to know how our users use the product.
         | 
         | We are also very detailed on what we gather:
         | https://learn.netdata.cloud/docs/agent/netdata-security
         | 
         | Lastly, we just changed our analytics engine, from google-
         | analytics to a self-hosted posthog, which is an open-source
         | product analytics platform
        
           | edoceo wrote:
           | how about on install you prompt to opt-IN to this "feature"?
           | 
           | posthog or not, your target market is more sensitive to this
           | telemetry crap than GP.
        
           | ex_amazon_sde wrote:
           | As odyslam wrote, opt-out is unethical.
        
           | sneak wrote:
           | The telemetry isn't anonymous: it includes the client IP; the
           | method you use to transmit the data cannot work anonymously.
           | 
           | Additionally, what's actually unfair is that you proceed with
           | this spying without the consent of the user. Being upfront
           | about it is not obtaining consent: it's just informing the
           | user you're about to _violate_ their (lack of) consent.
           | 
           | You must obtain consent from the user first, before
           | transmitting their information. Otherwise, your software is
           | spyware. (Disclosing that you're going to spy on the user
           | doesn't make you not-spyware.)
           | 
           | > _we use the data we gather in order to make smarter product
           | decisions._
           | 
           | Yes, you transmit the private data of the user for the
           | express purpose of enriching yourself.
           | 
           | Opt-out is unethical: you must obtain _opt-in_ consent
           | _first_. The data you are transmitting does not belong to
           | you.
        
             | yunohn wrote:
             | Dude, seriously?
             | 
             | You choose to willfully install Netdata. You have to read
             | the docs where the opt-out telemetry is clearly explained,
             | before you can self-host it too. If you care, you can
             | disable it.
             | 
             | I honestly don't understand HN. Multiple commenters
             | deriding a free open-source project for having basic
             | telemetry to understand feature usage.
        
             | andrewm4894 wrote:
             | Disclaimer - i work for Netdata Cloud.
             | 
             | We actually mask the ip address (https://github.com/netdata
             | /dashboard/blob/master/src/domains...) so it's not even
             | sent - we just send "127.0.0.1" as the IP into our self
             | hosted PostHog. Likewise with any URL, referrer type event
             | properties that could leak a hostname to us - we don't want
             | that data at all so explicitly mask it before even
             | capturing it in our telemetry system.
             | 
             | Previously, when using a fairly standard Google Analytics
             | implementation, we could not really have this level of
             | control all that easily.
             | 
             | So the hope is that with PostHog we can do better here
             | while still enabling some really useful product telemetry
             | to help us understand how to make the product better over
             | time and try catch bugs and issues quicker too.
             | 
             | Oh and we have removed Google Tag Manager (GTM) from the
             | agent dashboard too so that that's no longer around as a
             | possibility for loading other third party tags too.
             | 
             | You can read more here: https://github.com/netdata/netdata/
             | blob/master/docs/anonymou...
             | 
             | p.s. PostHog is really cool - check it out:
             | https://posthog.com/docs#philosophy,
             | https://github.com/PostHog/posthog
        
       | mekster wrote:
       | All these graphs are never really actionable and are only of
       | interest for a short period of time and you won't be looking at
       | it after a while because they don't mean anything unless you know
       | where and when the problem is.
       | 
       | A sever admin wants "Incident" panel that only shows anomaly
       | components at the top coupled with adjustable alerting mechanism
       | and not just a dump of all the data there is blindly.
       | 
       | There are so many tools that does this and pretend it's
       | impressive including ELK but whether it's Grafana or Kibana, you
       | need a lot of manual tweaking to make the dashboards actually
       | useful.
        
         | chrisandchris wrote:
         | So true.
         | 
         | But netdata also supports alerts by e-mail or HTTP-based (e.g.
         | for Slack), so why not just turn on notifications and live from
         | them?
        
         | andrewm4894 wrote:
         | Disclaimer - i work at Netdata Cloud on ML.
         | 
         | This is one of the things i am focusing on most - how to
         | package and then surface up "anomaly events" to the user that
         | the user can then quickly digest and decide if they are or are
         | not something that could represent an "incident". So human in
         | the loop sort of ML to help assist and lower the cognitive load
         | of all the charts.
         | 
         | We have a first step on this ladder via the python based
         | anomalies collector that you could play around with to see if
         | any use:
         | https://learn.netdata.cloud/docs/agent/collectors/python.d.p...
         | 
         | That will give two summary aggregation type charts based on
         | anomaly scores built off all the system.* charts by default
         | (but can be configured however you want) - so, at each second,
         | an anomaly probability for each chart and an anomaly flag if
         | the model thinks that chart looked anomalous at that time.
         | 
         | We are also working on some other related projects to build on
         | this capability out more and do it at the edge, in C++ or Go as
         | opposed to Python, (or via a parent node) as cheap as possible
         | so minimal, ideally negligible, impact on the agent itself. We
         | should have some more features related to this in coming months
         | as we are just trying to dogfood them internally a little
         | first.
         | 
         | Anyone interested i'd love to hear some more feedback here or
         | in this community megathread here:
         | https://community.netdata.cloud/t/anomalies-collector-feedba...
         | or just email me at analytics-ml-team@netdata.cloud
        
           | lima wrote:
           | What's the argument for anomaly detection - it's an obvious
           | thing to do that has been tried many times, but doesn't
           | actually seem to provide much value in practice (especially
           | at large scale, where you'll get spurious correlations).
           | 
           | What would you need it for? Once you defined your SLOs,
           | either your service meets them or not. What's the value in
           | alerting someone that "this graph looks funny"?
        
             | laminatedsmore wrote:
             | Do 4xx responses count against your SLO? For me they don't,
             | but an abnormal increase might still signify that something
             | is actually wrong. (I haven't yet found a useful tool for
             | highlighting this kind of abnormality though)
        
             | toomanybeersies wrote:
             | > What's the value in alerting someone that "this graph
             | looks funny"?
             | 
             | What's the value in mentioning to someone that the chicken
             | they're about to eat looks a bit raw?
             | 
             | It stops them from eating it and getting food poisoning.
             | 
             | Anomalies are often warnings (harbingers?) of a problem
             | which could lead to a fault and downtime.
        
             | cakrit wrote:
             | It's about troubleshooting. When you have a complex
             | infrastructure, it's not enough to say that your db queries
             | are slower than usual. Ok, so you immediately see that your
             | db server is getting a lot more traffic. What was the root
             | cause though and what can you do about it now? Given enough
             | "funny charts", you can see for example that you have hit a
             | resource limit that you can temporarily raise and also see
             | that a particular component of your infrastructure has an
             | anomalous behavior, e.g. a cron job that was usually
             | utilizing resources for a few seconds, now takes minutes.
             | So you can provide a quick workaround and move on to
             | investigate what changed with that cron job.
        
             | andrewm4894 wrote:
             | I almost think of anomaly detection as a UI/UX type tool to
             | help users navigate the data/systems. So use ML to find
             | "interesting" or "novel" periods of time in your
             | architecture (in the sense that the ML thinks they look
             | novel based on some model), and then enable a user who is
             | ultimately best placed to decide if it's actually of real
             | interest to them or more like a false positive that they
             | can just ignore and move on.
             | 
             | So doing it in a way where you can quickly scan such events
             | i think could be useful even if only 1 in 20 actually turn
             | out to be some potential problem that might have been
             | missed by your alarms or maybe could even be a precursor to
             | some impact on SLO's etc.
             | 
             | The aim would also be "this collection of graphs look funny
             | at the same time" as opposed to "the individual graph looks
             | funny" as if you have an anomaly score for every chart for
             | sure at any given moment some individual charts would be
             | randomly firing. But when you pool the information across
             | charts and hosts and systems the hope is that then you can
             | use anomaly detection as another way to explore your system
             | and catch when things change unexpectedly.
        
             | mdale wrote:
             | I generally agree that well defined SLOs for back end
             | services works as you define service contacts between
             | services and care less about the particular funny graph
             | being surfaced more that a particular service is out of
             | contract.
             | 
             | Where automatic anomaly detection was very valuable for us
             | was in the video domain with multi dimensional end user
             | telemetry. I.e what would be lost as noise in top level
             | metics could be surfaced via anomaly detection for specific
             | combination of dimensions that you could not otherwise
             | manually observe. I.e video start time in Mexico is fine
             | ... But an ISP in Mexico City is not failing but when data
             | is sliced and anomaly highlighted we see its newly under
             | preforming and we need to feed this data into our CDN
             | switching to improve video start time there.
             | 
             | The data had too many dimensions that were always changing
             | with degraded experience easily lost in the noise when
             | measuring across platform and our software updates,
             | combinations target devices, connection types, geo
             | location, specific content, active ab tests, etc. In such
             | cases automatic anomaly detection was pretty critical.
        
         | ktsaou wrote:
         | Thank you for this feedback. I am the founder of Netdata.
         | 
         | Netdata is about making our lives easier. If you need to tweak
         | Netdata, please open a github issue to let us know. It is a
         | bug. Netdata should provide the best possible dashboards and
         | alerts out of the box. If it does not for you, we missed
         | something and we need your help to fix it, so please open a
         | github issue to let us know of your use case. We want Netdata
         | to be installed and effectively used with zero configuration,
         | even mid-crisis, so although tweaking is possible and we
         | support plenty of it, it should not be required.
         | 
         | An "incident" is a way to organize people, an issue management
         | tool for monitoring, a collaboration feature. Netdata's primary
         | goal however, is about exploring and understanding our
         | infrastructure. We are trying to be amazingly effective in this
         | by providing unlimited high resolution metrics, real-time
         | dashboards and battle tested alarms. In our roadmap we have
         | many features that we believe will change the way we understand
         | monitoring. We are changing even the most fundamental features
         | of a chart.
         | 
         | Of course at the same time we are trying to improve
         | collaboration. This is why Netdata.Cloud, our free-forever SaaS
         | offering that complements the open-source agent to provide out
         | of the box infrastructure-level monitoring along side several
         | convenience features, organizes our infra in war-rooms. In
         | these war-rooms we have added metrics correlation tools that
         | can help us find the most relevant metrics for something that
         | got our attention, an alarm, a spike or a dive on a chart.
         | 
         | For Netdata, the high level incident panel you are looking for,
         | will be based on a mix of charts and alarms. And we hope it is
         | going to be also fully automated, autodetected and provided
         | with zero configuration and tweaking. Stay tuned. We are baking
         | it...
        
           | Croftengea wrote:
           | > our free-forever SaaS offering that complements the open-
           | source agent
           | 
           | How do you make or plan to make money?
        
             | sdesol wrote:
             | I was analyzing the activity in the netdata project and
             | what I found interesting was this project is less active
             | than I would have thought. See the following for insights
             | into the project:
             | 
             | https://public-001.gitsense.com/insights/github/repos?q=win
             | d...
             | 
             | In the last 30 days, there were 2 frequent and 3 occasional
             | contributors. I honestly thought frequent contributors
             | would have been much higher, which leads me to believe the
             | project is quite mature and they don't need a lot of people
             | to work on netdata.
             | 
             | Based on Crunchbase, they've raised about 33 million so far
             | and if the number of people required to maintain netbase is
             | low (relatively speaking that is), I can see them not
             | really needing to worry about making money and I'm guessing
             | they are finding value in gathering data for ML.
        
               | andrewm4894 wrote:
               | oh cool that's a nice tool.
               | 
               | p.s. i am the only person working on ML at Netdata and i
               | can confirm we don't gather any data for ML purposes,
               | which is actually my biggest challenge right now :) -
               | convincing people the ML can be useful without having
               | lots of nice labeled data from real netdata users to be
               | able to quantify that with typical metrics like accuracy
               | etc. I'm hoping to introduce mainly unsupervised ML
               | features into the product that don't rely on lots of
               | labeled data and have thumbs up/down type feedback and we
               | can then use that to figure out if new ML based features
               | are working or being useful for users. So any models that
               | would be trained would be trained on the host and live on
               | the host as opposed to in Netdata Cloud somewhere.
        
               | sdesol wrote:
               | > i am the only person working on ML at Netdata and i can
               | confirm we don't gather any data for ML purposes, which
               | is actually my biggest challenge right now :)
               | 
               | Yeah I would have to imagine that it would be an issue.
               | This is just my personal opinion, but I think there
               | should be a way to provide anonymized data for building
               | models for anomaly detection. Maybe an opt-in feature, as
               | it would benefit everybody using netdata, but this is
               | just my own personal opinion.
        
               | ktsaou wrote:
               | > they've raised about 33 million
               | 
               | yes, this is right
               | 
               | > if the number of people required to maintain netbase is
               | low (relatively speaking that is)
               | 
               | The Netdata agent is a robust and mature product. We
               | maintain it and we constantly improve it, but:
               | 
               | - most of our efforts go to Netdata.Cloud
               | 
               | - most of the action in the agent is in internal forks we
               | have. For example, we are currently testing ML at the
               | edge. This will eventually go into the agent, but is not
               | there yet. Same with EBPF. We do a lot of work to
               | streamline the process of providing the best EBPF
               | experience out there.
               | 
               | > I can see them not really needing to worry about making
               | money
               | 
               | We are going to make money on top of the free tier of
               | Netdata.Cloud. We are currently building the free tier.
               | In about a year from now we will start introducing new
               | paid features to Netdata.Cloud. Whatever we will have
               | released by then, will always be free.
               | 
               | > I'm guessing they are finding value in gathering data
               | for ML
               | 
               | No, we are not gathering any data for any purpose. Our
               | database is distributed. Your data are your data. We
               | don't need them.
        
               | sdesol wrote:
               | Hey thanks for the insights. I figured effort was being
               | spent elsewhere and/or was not visible in the public
               | repo.
        
             | ktsaou wrote:
             | The same way GitHub, Slack or Cloudflare provide massively
             | free-forever SaaS offerings while making money.
             | 
             | We believe that the world will greatly benefit by a
             | monitoring solution that is massively battle tested, highly
             | opinionated, incorporating all the knowledge and experience
             | of the community for monitoring infrastructure, systems and
             | applications. A solution that is installed in seconds, even
             | mid-crisis and is immediately effective in identifying
             | performance and stability issues.
             | 
             | The tricky part is to find a way to support this and
             | sustain it indefinitely. We believe we nailed it!
             | 
             | So, we plan to avoid selling monitoring features. Our free
             | offering will never have a limit on the number of nodes
             | monitored, the number users using it, the number of metrics
             | collected, analyzed and presented, no limit on the
             | granularity of data, the number of war-rooms, of
             | dashboards, the number of alarms configured, the
             | notifications sent, etc. All these will always be free.
             | 
             | And no, we are not collecting any data for ML or any other
             | purpose. The opposite actually: we plan to release ML at
             | the edge, so that each server will learn its own behavior.
             | 
             | We plan to eventually sell increased convenience features,
             | enforcement of compliance to business policies and
             | enterprise specific integrations, all of them on top of the
             | free offering.
        
             | nawgz wrote:
             | This is a good question, their website doesn't seem to have
             | any "Pricing" information anywhere and everything is "get
             | now" and "sign up for free"...
        
         | unixhero wrote:
         | I do not necessarily disagree with you regarding what a server
         | admin / ops personnel needs, however;
         | 
         | I for one deeply enjoy interacting with my Netdata dashboard
         | whenever I want to deep dive into my servers resources and
         | behaviors. For me it fits a purpose and if I ever were to run a
         | company that hosted things, I would want it and I would want to
         | pay for it. I am a huge fan and a long time homelab user of
         | Netdata.
        
         | goodpoint wrote:
         | Netdata is focused on short-term, real-time metrics. I use it
         | often during development.
        
         | mdip wrote:
         | > never really actionable ... only of interest for a short
         | period ... you know where and when the problem is.
         | 
         | I'm not a sysadmin of a large shop (I did that for a short bit,
         | but prior to this existing), so I can only speak as a guy who
         | runs a few big linux servers/virtuals. I've had netdata
         | installed on my home severs for quite some time. And yes, the
         | graphs were really cool, at first, and kinda went into the
         | background.
         | 
         | Here's the thing: when something isn't right with those boxes,
         | that's become the first place I visit. Since I had some
         | franken-boxes with a bunch of storage, it's often related to
         | the array, or btrfs. When I hop in there I'll notice an alert
         | or two, gooble it, alter something and never see it again. It
         | helped me solve some network issues.
         | 
         | I don't know, short of it being a more busy process than I wish
         | it were on my server (only a little and I'm running a few plug-
         | ins on the one that I'm unhappy with), it's been helpful.
        
           | fatlasp wrote:
           | Same. I'm not a fully qualified sys admin but I do have
           | access to a number of our servers (I'm more of a full stack
           | generalist than an expert at anything) and I immediately go
           | to netdata when one of my services isn't acting right. For me
           | its a nice 'system at a glance' where I can check on the host
           | and then alert someone more knowledgeable than myself if
           | there's something that looks off
        
         | ARandomerDude wrote:
         | I completely disagree. I work on a system with multiple servers
         | communicating with each other and billions of events per day.
         | Watching the meters wiggle on all the servers simultaneously is
         | a really important debugging tool. If something slows down or
         | goes down, and you know what it's connected to, it's pretty
         | easy to troubleshoot what the cause is at an incredibly
         | specific level just by looking at the meters.
         | 
         | I think it's likely one's take on whether these tools are
         | useful is very dependent on the system architecture.
        
         | toomanybeersies wrote:
         | There's value in having large dashboards that contain a bunch
         | of non-prioritised graphs and gauges. I've managed to find a
         | fair few problems by scrolling through such dashboards. Usually
         | it's due to poorly configured monitors/alerts, but sometimes
         | I'll spot things that you wouldn't reasonably expect an
         | algorithm to pick up.
         | 
         | Plus it's good fun to look at a big dashboard and pretend
         | you're Homer Simpson at the Springfield Nuclear Power Plant.
        
         | nednar wrote:
         | You seem to have a very specific vision. Could you mock it
         | somehow? HTML, Figma, Paint, Powerpoint, whatever? Quite
         | curious about ideas.
        
         | xPaw wrote:
         | Netdata does have alarms/alerts, and comes with default ones.
         | 
         | https://learn.netdata.cloud/guides/step-by-step/step-05
        
           | odyslam wrote:
           | Actually, we give a lot of thought in defining sane default
           | alarms for most of the data sources that we have.
           | 
           | We want our users to get 80% of the value with 20% of the
           | effort.
           | 
           | It's an opinionated approach that liberates a lot of users
           | from having to setup and maintain everything.
        
             | mdip wrote:
             | I think you took a lot of flack in the original comment[0]
             | 
             | The alarms in netdata resolved a long-standing network
             | issue on one of my boxes, and have variously alerted me to
             | problems I could resolve with storage which greatly
             | improved performance on my largest volume. On my other box,
             | one look at the graphs alerted me to the fact that the
             | _entire_ SSD for my bcache volume was going unused[1]. I
             | then used them while altering configuration and working
             | with the drive to ensure the cache was being filled in a
             | manner consistent with what the volume stored /how it was
             | used.
             | 
             | The more I think about it, I might not have been as
             | enthusiastic in my original comment as I should have been.
             | It's been very helpful to me. I don't usually keep things
             | like this running for very long (it wastes cycles on aging
             | hardware...that isn't heavily used, but hey, it's the
             | principal!) but I've kept this around because every time
             | I've thought about removing it, I've visited the dashboard
             | one last time and found something there that made me keep
             | it.
             | 
             | [0] Though, as I mentioned, I'm not a sysadmin; I have a
             | lab that might indicate otherwise, but I don't get paid for
             | it.
             | 
             | [1] I had reloaded the machine/redone a previous
             | configuration that included bcache and it _screamed_ ; I
             | knew my new setup was much slower but I had forgotten about
             | it until netdata made it obvious, again. I can't remember
             | what I had to do to fix it, but it had something to do with
             | the policy used to determine if a file should be put into
             | the cache, and I think it was related to the fact that the
             | cache was added to a volume with data present that rarely
             | changed.
        
         | ohthehugemanate wrote:
         | > A sever admin wants "Incident" panel that only shows anomaly
         | components at the top coupled with adjustable alerting
         | mechanism and not just a dump of all the data there is blindly.
         | 
         | Netdata does this too, with a ton of thresholds already set up
         | by default. The list of active alerts is at the top, with a
         | badge and everything. Notifications use a hook system, so you
         | can use whatever mechanism you like. Personally I get emails
         | for medium level alerts, SMS for high and above, and wall
         | posts/notifications on my primary machine for crits. It took
         | some tuning to get the thresholds right for me, all perfectly
         | easy to do.
         | 
         | I agree I would prefer to have the active warnings more visible
         | than the graphs, but one click away really isn't bad.
        
         | jeffbee wrote:
         | Based on my experience a dashboard that you only use when you
         | know where and when the problem occurred is incredibly useful,
         | and the lack of one can be very frustrating. While you of
         | course need a systematic approach to incident detection, you
         | _also_ need comprehensive eyes-on-glass dashboards during your
         | investigations.  "Anomaly detection" is much spoken of but
         | generalized anomaly detection doesn't exist. You still need
         | skilled operators to just have a look around in many cases.
         | 
         | An example, drawn from several major incidents in my career.
         | You get an alert, you narrow it down to a process or machine,
         | you evict the machine from your serving population to remediate
         | the incident, but how do you keep it from recurring? The
         | anomalous thing isn't apparent in your monitoring data, so it
         | must be among the bazillion statistics that a running system
         | exposes, but which you can't afford to collect and monitor on a
         | per-host, per-container, per-process level of detail. That's
         | when you want something exactly like netdata!
        
         | tyingq wrote:
         | I agree with this, and it's interesting how many open source
         | tools there are that create these graphs and charts, store tons
         | of data, etc. All with a mostly "eyes on glass" bent, which
         | doesn't scale terribly well.
         | 
         | When, really, what's more important is actionable events,
         | correlation, duplicate suppression, escalating notifications,
         | etc. Something like what "Netcool Omnibus" and other commercial
         | software does. Isolate actionable problems and make sure
         | somebody owns the problem.
         | 
         | But for reasons I don't understand, there isn't much in the
         | open source world in that space.
        
       | enz wrote:
       | Is it any good? ;)
        
         | papazach wrote:
         | It is great, I have claimed my VMs running Netdata to the
         | Netdata Cloud and I am very happy with it! Took me only a few
         | minutes to claim them all (11 VMs) and boom the dashboards were
         | ready out of the box.
        
         | unixhero wrote:
         | Very good.
        
         | odyslam wrote:
         | yes :)
        
       | mtmsr wrote:
       | I have played around with netdata just yesterday on my home
       | server. Great tool, but the defaults are overkill for my needs.
       | After spending an hour trying to simplify (=disable most of the
       | "collectors") using the documentation, I finally gave up.
       | 
       | Settled on neofetch [1] instead: pure bash, wrote my own custom
       | inputs including color coding for incident reporting in less time
       | than it took me to strip down netdata. Highly recommended if you
       | want to spend your time on other things than (setting up) server
       | monitoring.
       | 
       | [1] https://github.com/dylanaraps/neofetch
        
         | gregwebs wrote:
         | Thanks for the link: neofetch seems a good tool when you just
         | want to manually see what is going on. Netdata is also designed
         | to alert, forward data to other locations, monitor at 1 second
         | granularity, and to store historical data efficiently if you
         | want to see what went on in the recent past.
        
       | tifadg1 wrote:
       | Could someone enlighten me on the internals, how is netdata able
       | to get realtime granularity, whereas prometheus defaults to 15s?
        
         | distantsounds wrote:
         | it polls every second, to get metrics every second. genius, i
         | know?
        
         | gregwebs wrote:
         | Prometheus is designed around metric centralization and running
         | a scraper at some interval (every 15s). Netdata was originally
         | focused on running on a single node and collecting at small
         | intervals. Centralizing that data every second is a separate
         | task, and you could avoid it with Netdata simply by viewing
         | Netdata on the node in question. Netdata can also be configured
         | to stream data to a central node.
         | 
         | The centralized pull architecture of Prometheus does not lend
         | itself towards small interval updates or towards resiliency
         | (you actually need to run 2 Prometheus and double scrape for
         | that).
        
         | wongarsu wrote:
         | I doesn't store much history from what I can tell. If you don't
         | have years worth of data points then having 15 times as many
         | isn't a big deal.
        
           | mekster wrote:
           | And people are somehow meant to only monitor a single server?
           | 
           | There's a reason timeseries databases are trying to get
           | downsampling right.
        
             | manigandham wrote:
             | It's a locally installed agent that monitors and serves
             | metrics on the same host. If you want to monitor multiple
             | hosts then you can either visit the dashboards
             | individually, or scrape the APIs and put the metrics on a
             | combined dashboard - which is what Netdata Cloud is.
        
         | cakrit wrote:
         | It's because it was built with high granularity and unlimited
         | metrics as a key differentiator from the beginning. The core is
         | written in pure C, optimized to death. Even long-term retention
         | was initially sacrificed, in order to be able achieve that high
         | performance, with minimal resource needs.
         | 
         | Long term retention is now possible, but with relatively high
         | memory requirements, depending on how many metrics are
         | collected. Again, it was a decision to never give up realtime
         | granularity and speed, even at the cost of writing our own
         | timeseries db in C and utilizing more memory.
        
       | dvfjsdhgfv wrote:
       | The only gripe I have with it is the approach to security, i.e.
       | the lack of user accounts (even one). So you have to either block
       | the stats by IP (who is doing it these days?) or use other
       | workarounds like proxying by Nginx etc.
        
         | manos_saratsis wrote:
         | You can use Netdata Cloud to have secure authenticated access
         | to your single node dashboard. Data remain on your systems and
         | are streamed to your browser. Netdata Cloud stores only
         | metadata.
        
         | odyslam wrote:
         | Using Netdata Cloud is a great way not to spend any time with
         | that and access the Agent's dashboard through Netdata Cloud. We
         | use WSS and MQTT, so it's super secure and lightweight.
         | 
         | The data are streamed from the Agent directly to your browser
         | via the cloud.
         | 
         | Relevant docs:
         | https://learn.netdata.cloud/docs/configure/secure-nodes#disa...
        
           | sammy2244 wrote:
           | So the only convenient way to have security is to use the
           | cloud version? Got it.
        
             | distantsounds wrote:
             | yes, because a 10 line nginx config with basic http auth is
             | too difficult for a sysadmin to set up in conjunction with
             | his systems monitoring tool
             | 
             | stop being obtuse
        
               | dvfjsdhgfv wrote:
               | It's not that it's too difficult, but we were accustomed
               | to having this functionality built in in similar products
               | in the past, then things changed. When ELK first showed
               | up there was a big wave of attacks on ELK servers because
               | they were completely unsecured and at that time X-Pack
               | Security was a paid add-on, they changed their mind
               | later, some time after an open source solution appeared.
        
               | PixyMisa wrote:
               | Absolutely. It has to be there, and users have to be
               | forced to configure it at install time.
               | 
               | How many times do we need to repeat this mistake?
        
             | napsterbr wrote:
             | That's the key difference between self-hosted and SaaS. If
             | you self-host, you are responsible for setting up the
             | required infrastructure, taking care of updates, backups
             | etc.
             | 
             | If setting up a reverse proxy behind whatever monitoring
             | you've got is too much, then yes, by all means use the SaaS
             | offering -- but that's 100% the user responsibility, and
             | there's no need to be snarky about it.
        
               | dvfjsdhgfv wrote:
               | > If you self-host, you are responsible for setting up
               | the required infrastructure, taking care of updates,
               | backups etc.
               | 
               | Are you speaking about Netdata or in general? Because if
               | the former, then at least the updates part is not true:
               | the installation script turns out nightly updates (and
               | telemetry).
               | 
               | Frankly, the reason there is no basic auth is that
               | Netdata doesn't use a third-party web server but a built-
               | in one, so they would have to add this functionality.
        
             | dvfjsdhgfv wrote:
             | > So the only convenient way to have security is to use the
             | cloud version? Got it.
             | 
             | I wouldn't formulate it that way, it's just a bit annoying
             | for me to see this trend of not having even tiny bit of
             | security built in and having to do extra work just to
             | protect the dashboard. Just one admin account and a random
             | generated password would be fine.
        
         | tinco wrote:
         | Workarounds like proxying by nginx is not a workaround, it's
         | the industry standard way of managing access to services. It's
         | both more convenient _and_ more secure, a rare combination.
         | 
         | More convenient because you can use your companies pre-existing
         | authentication to authenticate the requests, and more secure
         | because you're not having to manage separate passwords and user
         | accounts.
        
           | dvfjsdhgfv wrote:
           | I understand your opinion, but it's not like that everywhere.
           | I work for many clients who have single servers or specific
           | setups and having to configure Nginx is an extra step and an
           | additional layer that could be made totally unnecessary by
           | building in just one admin account and assigning a random
           | password to it.
        
         | smarx007 wrote:
         | I have it listen on a loopback interface and do SSH port
         | forwarding when I want to look at the stats. Nginx proxying
         | with basic auth is a perfectly reasonable approach and not a
         | workaround in my humble opinion. I would trust these two
         | approaches more than an unknown mechanism in Netdata.
        
       | gregwebs wrote:
       | Netdata is a great building block in a monitoring system. It now
       | does a lot of monitoring via eBPF, connects to Prometheus, and
       | integrates with k8s.
        
         | odyslam wrote:
         | We do love ebpf. Guilty as charged -\\_(tsu)_/-
         | 
         | We have a whole bunch of metrics that we keep track and we are
         | currently implementing a load more.
         | 
         | Soonish, we will greatly increase the number of metrics that we
         | gather with ebpf. That coupled with our per-second granularity,
         | should give you a very detailed view of the system.
         | 
         | Docs:
         | https://learn.netdata.cloud/docs/agent/collectors/ebpf.plugi...
         | Community Forums discussion:
         | https://community.netdata.cloud/t/linux-kernel-insights-with...
        
       | crazypython wrote:
       | Haven't been able to use its graphical interface to view
       | historical data. At least it uses fewer resources than Grafana.
        
         | distantsounds wrote:
         | netdata doesn't store metrics historically but you can funnel
         | whatever ones you want out and ship them off to a log store
         | like graphite or opentsdb.
        
           | PanosJee wrote:
           | It has the option now to set retention for up to a year.
        
             | distantsounds wrote:
             | ooh, even better!
        
       | mprovost wrote:
       | I write and maintain an open source monitoring tool and I looked
       | into adding a mode to output metrics in Netdata format and ran
       | away screaming. It's just an unstructured text format where you
       | output commands to stdout, one per line. Each command consists of
       | whitespace-separated fields. Which field is the units? Oh, the
       | 4th. And some fields are optional, I'm not even sure how that
       | works but I think you can't skip an optional field if you then
       | want to use any field after that. It's like structured data
       | formats like JSON or god forbid XML never happened.
        
         | cakrit wrote:
         | Netdata can ingest prometheus metrics as well, so you can just
         | use that format. Eventually everything will become
         | Openmetrics/Opentelemetry
        
       | petecooper wrote:
       | Previous discussions:
       | 
       | https://news.ycombinator.com/item?id=11388196
       | 
       | https://news.ycombinator.com/item?id=17773874
       | 
       | https://news.ycombinator.com/item?id=26886792
       | 
       | (For commentary, I'm not being snarky.)
        
       | hivacruz wrote:
       | How does it compare to New Relic who also happens to monitor, if
       | enabled, containers and system things?
        
       ___________________________________________________________________
       (page generated 2021-04-21 23:02 UTC)