I want something that has a WebUI, can show in a graph like the CPU and RAM graph for this day and maybe some days before. Also I would like to view what was running at any given time (I mean from 2-3 days before to now).
Is there any (FOSS) software that does that?
Thanks.
I work for a large enterprise and build ML model monitoring pipelines fairly frequently—this will be a more in depth but similar use case to what you’re asking.
We use Grafana (visualization) and Prometheus (timeseries db)—they’re built for this use case exactly. Tons of info out there on how to build, configure, connect to your sensors, and deploy it.
LibreNMS hasn’t been mentioned yet, and it’s very good. It does take some setting up, but its use of SNMP for data collection means that it’s easy to collect data from a wide range of network hardware as well. A wide range of alerting is available.
I think prometheus + grafana might be what you are looking for. In combination with loki grafana can also be used for viewing log messages.
Absolutely this, nothing else is required. Well, maybe alertmanager if you want to receive alerts
Do both have to run on the host machine or can a remote machine execute the probes (over ssh or something).
Grafana is just the frontend, its a dashboard for your different data sources Prometheus is the “database”, it scrapes data from your endpoints over http
I use Check_MK
+1 for CMK. It’s built on nagios. Been using it for decades. That shit is rock solid and has never let me down.
Prometheus is metrics and grafana reports it. IMHO, better reporting and graphing, better eye candy. But also harder to setup and get right.
CMK agent works on 95% of what you want with just the agent.
Zabbix?
If you’re serious about monitoring your shit this is really the best answer. Zabbix is love. Zabbix is life.
We use libreNMS. Its docs state that it will do this, but we only use the uptime monitoring feature, so I can’t arrest as to how well it will monitor everything else.
Munin is a tried and true solution. It installs on the server creates graphs and makes it easy to see a stair step graph to problems like out of memory.
I’d also highly recommend installing atop and having it collect stats every 1 to 2 minutes. You can go to a crashed server and step through what was running in a “top” like interfsce. I install atop on any server as a means for post incident diagnosis.
I would use OpenTelemetry, Prometheus, and Grafana…
Which parts are OpenTelemetry for? Is Prometheus Agent, Prometheus Server and Grafana not enough?
I like it because I use it for MELT in general. Prometheus generally does metrics and if you want to include logs, traces and events, it becomes more cumbersome. With the Otel collector, I can just update my collector configuration to point to the various services.
I’m not saying OP can’t use what you suggested, just stating what I would use.