Silencing Netdata debugfs.plugin
hwmon log spam with libsensors
For some time I noticed my systemd journal was being inundated with Netdata sensor alarm state transitions. This increased the noise in the logs, making it hard to find important log entries and causing the journal logs to rotate too frequently.
The Problem
Here’s an example of the log spam I saw, showing a sensor rapidly flapping between states due to a motherboard with a poor sensor chip implementation, featuring a floating temperature input and bad thresholds that would flip back and forth many times a minute:
Jun 07 23:39:28 servuh debugfs.plugin[6036]: temperature sensor 'temperature_nct6798-isa-0290_temp3_AUXTIN0' transitioned from state 'clear' to 'critical' [device 'nct6775.656', driver 'nct6798', subsystem 'platform', path '/sys/class/hwmon/hwmon4']: input 127.000000 >= critical high 125.000000 (userspace evaluation using kernel provided thresholds)
Jun 07 23:39:32 servuh debugfs.plugin[6036]: temperature sensor 'temperature_nct6798-isa-0290_temp3_AUXTIN0' transitioned from state 'critical' to 'clear' [device 'nct6775.656', driver 'nct6798', subsystem 'platform', path '/sys/class/hwmon/hwmon4']
Jun 07 23:39:52 servuh debugfs.plugin[6036]: temperature sensor 'temperature_nct6798-isa-0290_temp3_AUXTIN0' transitioned from state 'clear' to 'alarm' [device 'nct6775.656', driver 'nct6798', subsystem 'platform', path '/sys/class/hwmon/hwmon4']: alarm == 1.000000 (kernel driver generated)
Jun 07 23:39:54 servuh debugfs.plugin[6036]: temperature sensor 'temperature_nct6798-isa-0290_temp3_AUXTIN0' transitioned from state 'alarm' to 'clear' [device 'nct6775.656', driver 'nct6798', subsystem 'platform', path '/sys/class/hwmon/hwmon4']
Jun 07 23:40:10 servuh debugfs.plugin[6036]: temperature sensor 'temperature_nct6798-isa-0290_temp3_AUXTIN0' transitioned from state 'clear' to 'alarm' [device 'nct6775.656', driver 'nct6798', subsystem 'platform', path '/sys/class/hwmon/hwmon4']: alarm == 1.000000 (kernel driver generated)
Jun 07 23:40:12 servuh debugfs.plugin[6036]: temperature sensor 'temperature_nct6798-isa-0290_temp3_AUXTIN0' transitioned from state 'alarm' to 'critical' [device 'nct6775.656', driver 'nct6798', subsystem 'platform', path '/sys/class/hwmon/hwmon4']: input 127.000000 >= critical high 125.000000 (userspace evaluation using kernel provided thresholds)
These messages typically appear due to poor hardware implementation by motherboard manufacturers:
- Floating sensor inputs that aren’t properly tied high or low
- Incorrectly configured threshold values in the sensor chip
The Search for a Solution
When I first encountered this issue, I spent quite a bit of time searching for solutions. Most of the search results and documentation pointed to Netdata’s go.d.plugin
for sensor monitoring.
The current implementation, starting with Netdata v2.2.0, uses the debugfs.plugin
with libsensors
integration, but finding accurate documentation about this was challenging. Many forum posts and documentation pages are outdated, still referencing the old go.d.plugin
approach.
I discovered the relevant information in the Netdata PR #19251 which moved sensor monitoring to the debugfs.plugin
, and later found the implementation details in the debugfs.plugin
module-libsensors.c source code.
After some digging through the Netdata source code and libsensors
documentation, I discovered that the solution lies in properly configuring libsensors
rather than trying to modify Netdata’s behavior directly.
The Solution: libsensors
Configuration
Netdata uses libsensors
to monitor hardware sensors. You can control these messages by configuring libsensors
to:
- Set custom thresholds for sensors
- Ignore specific sensors
- Adjust the sensitivity of state transitions
- Configure how sensor events are logged
As a bonus, configuring the sensors3.conf
configuration and friends will also make the sensors
CLI command output match!
Configuration Location and Example
For my ASUS W680 motherboard, I created a drop-in configuration file at /etc/sensors.d/50-w680.conf
. Here’s a real-world example of how to configure libsensors
to reduce log spam on a W680 motherboard with an NCT6798
sensor chip:
chip "nct6798-isa-0290"
# Ignore the problematic temperature sensor AUXTIN0 (temp3)
ignore temp3
ignore temp7
# Ignore the problematic voltage sensor in6
ignore in6
# Set reasonable voltage limits to make ALARM values useful
# These are example values - adjust based on your hardware
set in1_max 1.2
set in2_max 3.5
set in3_max 3.4
set in4_max 1.2
set in7_max 3.5
set in8_max 3.3
set in9_max 1.2
set in10_max 1.2
set in11_max 1.2
set in12_max 1.2
set in13_max 1.0
set in14_max 1.0
# Set minimum fan speeds for the fans you actually use
set fan1_min 200
set fan2_min 200
set fan3_min 200
set fan4_min 200
Viewing Sensor Data
You can view the current sensor readings using the sensors
command. Here’s an example output for the NCT6798
chip:
> sensors 'nct6798-isa-0290'
nct6798-isa-0290
Adapter: ISA adapter
in0: 912.00 mV (min = +0.00 V, max = +1.74 V)
in1: 1.01 V (min = +0.00 V, max = +1.20 V)
in2: 3.41 V (min = +0.00 V, max = +3.50 V)
in3: 3.30 V (min = +0.00 V, max = +3.41 V)
in4: 1000.00 mV (min = +0.00 V, max = +1.20 V)
in5: 32.00 mV (min = +0.00 V, max = +0.00 V)
in7: 3.39 V (min = +0.00 V, max = +3.50 V)
in8: 3.18 V (min = +0.00 V, max = +3.30 V)
in9: 1.04 V (min = +0.00 V, max = +1.20 V)
in10: 1.01 V (min = +0.00 V, max = +1.20 V)
in11: 928.00 mV (min = +0.00 V, max = +1.20 V)
in12: 1.05 V (min = +0.00 V, max = +1.20 V)
in13: 440.00 mV (min = +0.00 V, max = +1.00 V)
in14: 880.00 mV (min = +0.00 V, max = +1.00 V)
fan1: 2960 RPM (min = 200 RPM)
fan2: 1105 RPM (min = 200 RPM)
fan3: 3013 RPM (min = 200 RPM)
fan4: 2980 RPM (min = 200 RPM)
fan5: 0 RPM (min = 0 RPM)
fan7: 0 RPM (min = 0 RPM)
SYSTIN: +26.0°C (high = +80.0°C, hyst = +75.0°C)
(crit = +125.0°C) sensor = thermistor
CPUTIN: +39.5°C (high = +80.0°C, hyst = +75.0°C)
(crit = +125.0°C) sensor = thermistor
AUXTIN1: +26.0°C (high = +80.0°C, hyst = +75.0°C)
(crit = +125.0°C) sensor = thermistor
AUXTIN2: +29.0°C (high = +80.0°C, hyst = +75.0°C)
(crit = +100.0°C) sensor = thermistor
AUXTIN3: +32.0°C (high = +80.0°C, hyst = +75.0°C)
(crit = +100.0°C) sensor = thermistor
PECI Agent 0: +44.0°C (high = +98.0°C, hyst = +95.0°C)
PECI Agent 0 Calibration: +37.0°C
PCH_CHIP_CPU_MAX_TEMP: +0.0°C
PCH_CHIP_TEMP: +56.0°C
PCH_CPU_TEMP: +0.0°C
pwm1: 16% (mode = pwm) MANUAL CONTROL
pwm2: 49% (mode = pwm)
pwm3: 16% (mode = pwm) MANUAL CONTROL
pwm4: 16% (mode = pwm) MANUAL CONTROL
pwm7: 128% (mode = pwm)
intrusion0: ALARM
intrusion1: ALARM
beep_enable: disabled
This output shows all the sensors available on the chip, including temperatures, voltages, fan speeds, and other status indicators.
Applying Configuration at Boot
If you’ve configured min/max values to adjust alarm thresholds, you’ll want to ensure these settings are applied at system boot. Create a systemd service to handle this:
# /etc/systemd/system/sensors-set.service
[Unit]
Description=Set NCT6798 Sensor Configuration
Before=netdata.service
[Service]
Type=oneshot
ExecStart=/usr/bin/sensors -s
[Install]
WantedBy=multi-user.target
Enable and start the service:
sudo systemctl enable --now sensors-set.service
This ensures your sensor configuration is applied before Netdata starts, allowing it to see the correct alarm values. While I haven’t verified this, I believe Netdata only reads the sensor configuration once during initialization.
Conclusion
While hardware sensor monitoring is important, the most frustrating aspect is dealing with poor sensor integration by motherboard manufacturers, followed closely by Netdata’s lack of documentation on the new debugfs.plugin
. However, these issues are easily solved using libsensors
configuration, which provides a flexible way to maintain important monitoring while keeping your logs clean and manageable.
Comments