https://blog.thecloudside.com/raspberry-pi-monitoring-using-telegraf-influxdb-and-grafana-defb63127fe3 The Cloudside View * Cloud * Data * Kubernetes * App * ML & AI * Cloudside Raspberry Pi Monitoring using Telegraf, InfluxDB, and Grafana RK Kuppala RK Kuppala Follow Aug 26 * 7 min read We recently had to build a reliable monitoring dashboard for a bunch of Raspberry Pi devices for one of our customers. These devices are mounted on vehicles on the move and they are expected to have intermittent connectivity issues. One of the requirements was to be able to collect the metrics even when the devices are not connected to the network and deliver them to the monitoring system when they come online. Why not Prometheus? Prometheus normally is the first thing that comes to mind when it comes to monitoring. Prometheus's node_exporter agent can run on a Raspberry Pi device and expose metrics on http:// public.ip.of.device:9100. But the problem with this approach is that Prometheus is a pull-based monitoring system. This will not work for our current use case because of the intermittent connectivity issues. When a device is not accessible for a given period, Prometheus's scraper will miss ingesting the node_exporter metrics. We wanted our monitoring agent to keep collecting metrics even when the remote monitoring host is not available, buffer the metrics locally and deliver as soon as it comes online. While Prometheus has the push gateway, but it's designed for short-lived, service-level batch jobs and not for full-fledged system monitoring. We tried colocating the node_exporter with a local push gateway hoping that we could achieve buffering of metrics, but push gateway basically rewrites a local file metrics.store with the latest metrics all the time and does not really buffer them. Prometheus is not the main topic of this post, but there are other places where the pull approach shines and it is explained very well here. This brings us to the main topic of this post -- the TIG stack. In T elegraf, InfluxDb, and Grafana combination, we were able to find a solution that pushes metrics from the devices using Telegraf, store them in InfluxDB and visualize using Grafana. Using metrics_buffer_limit configuration in Telegraf, we were able to buffer the metrics locally to survive small but intermittent connectivity issues before they are shipped to a remote InfluxDb database. I thought documenting all setup related steps in a post would be useful to posterity, so the rest of the post is a step by step guide :) Install and Configure InfluxDB 2.0 I assume that you already have a VM running on AWS or GCP, so I won't bore you with the steps of creating VMs. Let's go ahead with the installation. Find the right binaries from this page. I am using CentOS for this guide. wget https://dl.influxdata.com/influxdb/releases/influxdb2-2.0.8.x86_64.rpmsudo yum localinstall influxdb2-2.0.8.x86_64.rpm Enable the service and start it systemctl enable influxdb.service && systemctl start influxdb.service You should now be able to configure it for the first time by visiting the public IP of your VM on port 8086 [1] [1] You can set up the initial user, password, organization (this is a workspace that let's you control access to a group of users), and a bucket (a bucket is similar to a database with a retention policy). Once connected, you will be able to view your organization, buckets, and tokens. A token is how you allow Telegraf to write data or Grafana to read from InfluxDB. We will come back to tokens in a bit. [1] [1] Let's Install and Configure Grafana You can either install Grafana on the same host that runs InfluxDB or choose a separate VM. For this post, we are using the same VM. Let's install Grafana. Create a file sudo vi /etc/yum.repos.d/grafana.repo and add the following [grafana] name=grafana baseurl=https://packages.grafana.com/oss/rpm repo_gpgcheck=1 enabled=1 gpgcheck=1 gpgkey=https://packages.grafana.com/gpg.key sslverify=1 sslcacert=/etc/pki/tls/certs/ca-bundle.crt Install sudo yum install grafana Enable and start the service sudo systemctl daemon-reload && sudo systemctl start grafana-server && sudo systemctl status grafana-server You will now be able to access the UI at http://public.ip.of.vm:3000. Go ahead and use admin/admin for the first login, setup a new password. Once you are in, go to add data source and choose influxDB [1] [1] Under URL, I am going to choose http://localhost:8086 pointing to InfluxDB running locally. If your Influx host is a different VM, choose the url accordingly. You can either choose InfluxQL or Flux for Query Language, but note that InfluxQL with InfluxDB 2.0 has read-only support. For my Raspberry Pi dashboard, I chose Flux. [1] [1] Now to the authentication part. Toggle the "Basic Auth" off. We will now go to InfluxDB UI in a different browser tab, and create an authentication token that Grafana will use to connect to InfluxDB. Open http://your.vm.ip.address:8086 to access InfluxDb and navigate to tokens [1] [1] Generate a token with read/write or both access and specify the bucket(s) [1] [1] Get the token and comeback to Grafana configuration [1] [1] Use the token here, save and test [1] [1] We are now ready to ingest the data to InfluxDB and visualize it in Grafana. It's Raspberry time now. Configure Telegraf on Raspberry Pi and Collect Metrics SSH to the raspberry pi device and install the Telegraf agent. Before that, you might want to change the default device password and hostname using raspi-config Install necessary tools sudo apt-get update && sudo apt-get install apt-transport-https Our Pi 4 devices run buster, so let's add the influx key wget -qO- https://repos.influxdata.com/influxdb.key | sudo apt-key add -- source /etc/os-releasetest $VERSION_ID = "10" && echo "deb https://repos.influxdata.com/debian buster stable" | sudo tee /etc/apt/sources.list.d/influxdb.list Install and enable the Telegraf service sudo apt-get update && sudo apt-get install telegraf && systemctl enable telegraf.service You can start the service once and check the status. It will show errors related to InfluxDB not being reachable because of the default telegraf.conf file, which talks to a local InfluxDB. Ignore this error and stop the service. systemctl start telegraf.service systemctl status telegraf.service # stop service after this We have to now configure the Telegraf agent to write to the InfluxDB output and also collect metrics using Inputs. Out of the box, Telegraf supports a wide variety of inputs and outputs. Take a look at all the supported plugins here. Apart from the plugins listed, Telegraf also allows us to execute custom scripts to capture additional metrics. We will use the custom script input to capture some raspberry pi specific metrics like SOC temperature, SD RAM voltages, throttle states etc. There is prior art available already -- here and here. Below is a combination of the two scripts above, by Oostens. Here is the custom script, create a file with this content in your raspberry pi device sudo vi /var/lib/telegraf/vcgencmd.sh Make the script executable and also check whether telegraf user can execute this script. Also, we need telegraf user to be able to get information about GPU, so add telegraf to video group sudo usermod -G video telegrafchmod +x /var/lib/telegraf/vcgencmd.sh sudo -u telegraf /var/lib/telegraf/vcgencmd.sh Let's now move on to telegraf.conf file. This gist has the configuration file I have used. Pay attention to the following: Output is configured to write to InfluxDB. The URL is the public IP of the InfluxDB server, port 8086. You will also recollect that we created an organization, bucket and a token in above sections. [[outputs.influxdb_v2]]urls = ["http://x.x.x.x:8086"] token = "XXXXXX-YOUR-INFLUXDB-AUTHENTICATION-TOKEN-XXXXXX" organization = "cloudside-academy" bucket = "cloudside-academy-dev" After this, there are standard input plugins like [[inputs.cpu]] , [[inputs.system]] etc used to collect the system metrics. Finally we have inputs gathered via custom script we created above # Read RPi CPU temperature [[inputs.exec]] commands = [ '''sed -e 's/^\([0-9]\{2\}\)\(.*\)$/\1.\2/' /sys/class/thermal/thermal_zone0/temp''' ] name_override = "sys" data_format = "grok" grok_patterns = ["%{NUMBER:thermal_zone0:float}"] # Vcgencmd input [[inputs.exec]] commands = ["/var/lib/telegraf/vcgencmd.sh"] timeout = "7s" data_format = "influx" Go ahead and create the new telegraf.config file with this gist (and your changes as needed) sudo mv /etc/telegraf/telegraf.conf /etc/telegraf/telegraf.conf.bak sudo vi /etc/telegraf/telegraf.conf Start the service systemctl start telegraf.service You should now see telegraf agent sending data to InfluxDB and ready to be visualized. [1] [1] Finally, it's dashboard time. You can either build your own custom dashboard, or import an existing Grafana dashboard like this. I ended up with a custom dashboard like this for our customer. [1] [1] Hope you found this useful! Happy monitoring! :) The Cloudside View We shall not cease from exploration...! * Cloud * Raspberry Pi * Monitoring * Influxdb The Cloudside View The Cloudside View We are Cloudside, a trusted team of cloud-native and data problem solvers, helping our clients tackle complex scale, availability, performance, and data analytics problems on GCP, AWS. This blog is a witness to our team's adventures and learnings in Cloud, Data & App Engineering RK Kuppala Written by RK Kuppala Follow Co-Founder & CTO, thecloudside.com The Cloudside View The Cloudside View We are Cloudside, a trusted team of cloud-native and data problem solvers, helping our clients tackle complex scale, availability, performance, and data analytics problems on GCP, AWS. This blog is a witness to our team's adventures and learnings in Cloud, Data & App Engineering