I have a lot of dashboards relating to specific applications/groups of servers. Last week (while I was on holidays) a number of servers failed, and while I can see the (closed) alerts in SCOM for computer unreachable and heartbeat failure, I have been told that the health state of the computers, and hence the tiles didn’t change from green.
I have ran tests and, sure enough, the server icons display the “uncontactable” circle, but do not go Red.
When I view the health explorer in SCOM, the availability shows red all right, but it doesn’t roll up to the server status, and hence, the dashboard tile.
Looking at the Microsoft documentation, it appears that it should
“The health state for the agent-managed computer will change to critical (red) when the Health Service Heartbeat Failure alert is generated.”
Has anyone come across this, or am I losing the plot?
The servers are a mix of 2016 and 2012, but I presume that shouldn’t matter as it is the scom health service monitor as opposed to a windows MP.
I just came back to let you know I’ve found a solution to the issue. which is built into SCOM but I had forgotten about it.
By default, SCOM Component groups ignore the health state if monitoring is lost.
To ensure the tile is changed when monitoring is lost, Change the health status of the DA component group to roll up to Error when monitoring ins lost. Ta-Da
Hello, ALmost all the health state changes are coming from the agent itself.
If the agent can not communicate with the server it can not pass on the monitor health changes. So the state does not change on those objects (disks, databases, websites, windows services) on those machines. They will turn to grey or unmonitored though at some point as you may have seen. But the rollups of those states stays green (some also go to unmonitored and grey, but higher up its green again).
Only that agent heartbeat thing is running on the management server and will complain and make the agent object red. and give you one or two alerts depending on if the agent is reachable by ICMP ping or not. That object will turn red. But generally it is not a part of monitoring dashboards.
You could add it to the relevant dashboards as an object of interest. Just keep in mind that if an agent is down and the machine is up, there is a chance the application is still running on that machine.
We generally have separate dashboards for agent health and try to react quickly if they turn red. its either agent server down, or server so overloaded in resources that it can not react anymore, or simply the agent on the machine crashed and did not recover. ANd react to those cases, so the dashboards these machines are sitting in dont go into grey or red states depending on what objects you put in the DA.
When you say “uncontactable” circle, do you mean the green empty circle or greyed out state? Or are you saying that they remain green even after configured amount of heartbeat misses/period passed?
Servers often go grey straight from healthy state all the time. They have a greyed out healthy check mark on them when that happens.
However, I agree with what you’re saying, seems like an inconsistency in behavior (unless there’s a reason for it that I’m unaware of).
Thanks Badger, I’ll try to upload the images again
I think you are right (its SCOM DAs by the way). I am displaying the OS class, which seams independent on agent health. I must investigate further to see if I can change it to a class that is dependent on the health state.
Do you know what class is dependent on agent health, that could be used to roll up to the DA, and still reflect all the other monitors/rules necessary?
Thanks Bob (and Badger).
What you’ve said makes perfect sense and tallies with my testing.
I’ll incorporate the SCOM agent state as a DA component and add it into the Dashboards.
What threw me off is I remember tiles going red when servers were uncontactable, bit it is our Linux Servers that do this. They do turn the tile red when the server goes down, or the agent is uncontactable.
Hi Ervia, that is correct. Because a Linux/Unix and Network device is monitored by the management server (mostly) it will turn to red when the agent is not available.
Same also applies often when you are using website monitoring with transactions from another machine. For DA\’s with a web server in it, I advise to add a synthetic transaction or other web monitoring being run from another machine to check if the site is reachable.
No problem Ervia.
Just FYI, this is the exact reason SquaredUp has EA’s which reflect the “accurate” health of your application, not dependent on health of it’s underlying objets (which can or won’t change states over something irrelevant/ health service connection interruptions). Basically what Bob said in his last reply about URL monitoring, EA’s do exactly that.