Health service and computer unreachable alerting but not changing health state of computer

I have a lot of dashboards relating to specific applications/groups of servers. Last week (while I was on holidays) a number of servers failed, and while I can see the (closed) alerts in SCOM for computer unreachable and heartbeat failure, I have been told that the health state of the computers, and hence the tiles didn’t change from green.

I have ran tests and, sure enough, the server icons display the “uncontactable” circle, but do not go Red.

When I view the health explorer in SCOM, the availability shows red all right, but it doesn’t roll up to the server status, and hence, the dashboard tile.

Looking at the Microsoft documentation, it appears that it should

“The health state for the agent-managed computer will change to critical (red) when the Health Service Heartbeat Failure alert is generated.”

Has anyone come across this, or am I losing the plot?

The servers are a mix of 2016 and 2012, but I presume that shouldn’t matter as it is the scom health service monitor as opposed to a windows MP.

Hi Ervia,

When you say “uncontactable” circle, do you mean the green empty circle or greyed out state? Or are you saying that they remain green even after configured amount of heartbeat misses/period passed?

Servers often go grey straight from healthy state all the time. They have a greyed out healthy check mark on them when that happens.

https://docs.microsoft.com/en-us/system-center/scom/manage-agents-not-healthy?view=sc-om-2019

However, I agree with what you’re saying, seems like an inconsistency in behavior (unless there’s a reason for it that I’m unaware of).

Thanks Badger, I'll try to upload the images again

.

I think you are right (its SCOM DAs by the way). I am displaying the OS class, which seams independent on agent health. I must investigate further to see if I can change it to a class that is dependent on the health state.

Do you know what class is dependent on agent health, that could be used to roll up to the DA, and still reflect all the other monitors/rules necessary?

healthstate.jpg

notred.jpg

Hello, ALmost all the health state changes are coming from the agent itself.
If the agent can not communicate with the server it can not pass on the monitor health changes. So the state does not change on those objects (disks, databases, websites, windows services) on those machines. They will turn to grey or unmonitored though at some point as you may have seen. But the rollups of those states stays green (some also go to unmonitored and grey, but higher up its green again).

Only that agent heartbeat thing is running on the management server and will complain and make the agent object red. and give you one or two alerts depending on if the agent is reachable by ICMP ping or not. That object will turn red. But generally it is not a part of monitoring dashboards.
You could add it to the relevant dashboards as an object of interest. Just keep in mind that if an agent is down and the machine is up, there is a chance the application is still running on that machine.

We generally have separate dashboards for agent health and try to react quickly if they turn red. its either agent server down, or server so overloaded in resources that it can not react anymore, or simply the agent on the machine crashed and did not recover. ANd react to those cases, so the dashboards these machines are sitting in dont go into grey or red states depending on what objects you put in the DA.

2 Likes

Thanks Bob (and Badger).

What you’ve said makes perfect sense and tallies with my testing.

I’ll incorporate the SCOM agent state as a DA component and add it into the Dashboards.

What threw me off is I remember tiles going red when servers were uncontactable, bit it is our Linux Servers that do this. They do turn the tile red when the server goes down, or the agent is uncontactable.

Hi Ervia, that is correct. Because a Linux/Unix and Network device is monitored by the management server (mostly) it will turn to red when the agent is not available.
Same also applies often when you are using website monitoring with transactions from another machine. For DA’s with a web server in it, I advise to add a synthetic transaction or other web monitoring being run from another machine to check if the site is reachable.

No problem Ervia.

Just FYI, this is the exact reason SquaredUp has EA’s which reflect the “accurate” health of your application, not dependent on health of it’s underlying objets (which can or won’t change states over something irrelevant/ health service connection interruptions). Basically what Bob said in his last reply about URL monitoring, EA’s do exactly that.

I just came back to let you know I've found a solution to the issue. which is built into SCOM but I had forgotten about it.

By default, SCOM Component groups ignore the health state if monitoring is lost.

To ensure the tile is changed when monitoring is lost, Change the health status of the DA component group to roll up to Error when monitoring ins lost. Ta-Da

1 Like

Hi Badger, it is the greyed out state (with a green health state behind it). I’ve added an image. The image is taken from Squaredup. As you can see, it is showing that the agent is uncontactable, but that the server was in a healthy state before contact was lost. This means that the dashboard tile never changes to a red state, and therefore shows no warning.
in other words, if there is a dashboard tile for an application, and all servers go down for the application, Then there is no warning or change in the dashboard tile.
This seems to contract what is said by Microsoft, in that the health of the server includes the health of the scom agent.
The agent health does go red, as per my second image.

Sorry, but looks like the images were stripped out? I don’t see any image in your reply there.

From what I know, the agent health is determined by the health service watcher object, which has a separate class. The server view in SCOM or the SquaredUp dashboard could be scoped to the OS class, which is why it’s not dependent on the agent health. Though I’ll be honest, it’s just a speculation.

But again, I do understand and agree there seems to be inconsistencies with the health states.

When you say applications, do you mean the SCOM DAs or SquaredUp EAs?