Health Service Heartbeat Failure and Computer Not Reachable alerts are almost daily generating from many agents.

Hi

Health Service Heartbeat Failure and Computer Not Reachable alerts are almost daily generating from many agents.

when contacted the windows team , they informed that all of the servers are up and running.

we have to close these alerts as false alerts .

please suggest a way to resolve issue why these alerts are generating in the first place.

https://technet.microsoft.com/en-us/library/hh212798(v=sc.12).aspx

We have some places in our locked down environments where ping is supposed to be allowed, but is not in reality. So whenever we get a heartbeat failure, the FTC alerts as well since it cannot ping the system. The Heartbeat alert closes, but the FTC does not. Reason: SCOM still cannot ping the system. So the alert has to be closed manually.

Is this what you are seeing?

This is the standard response that I expect from a server team. When we experience large numbers of HB/FTC, its usually one of the following:
> actual network issue where the agents cannot reach their MS/GW for a period of time
> DNS issue where agents are reporting in by IP and not name, which SCOM rejects. (we see this a lot)
> occasionally you can see issues where SCOM DB performance can cause all the agents to throw a HB; but you’d see really poor console performance as well as other indicators (errors) in the console

Try to telnet FROM the agent TO the MS/GW using the FQDN of the MS/GW. DO NOT use ping in any troubleshooting; this is the 21st century, say goodbye to ping!

Remember, FTC is a diagnostic step in the HB failure monitor; so its dependent on MS to be able to PING the agent by name, usually FQDN. You can override the diagnostic step in order to have the PING run from another system; I’ve had to do this because our agents cannot “see” back into the monitoring environment and can only be reached by their GW or a designated server with that traffic allowed.

You should check the logs at the agent side to see what it says. Double-check AV settings, see: https://support.microsoft.com/en-us/help/975931/recommendations-for-antivirus-exclusions-that-relate-to-operations-man

Just to ensure network connectivity, try the following PowerShell script on the agent

@(gci “HKLM:\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Agent Management Groups” | %{ gci (join-path $.PSPath “Parent Health Services”) | Get-ItemProperty | select -ExpandProperty NetworkName }) | %{ try { (new-object Net.Sockets.TcpClient).Connect($, 5723); write-host “SUCCESS: Could connect to $_ on port 5723” -foreground Green } catch { write-host "ERROR: $_ " -foreground Red} }

Look at patterns, ensure the firewalls/routes do not interfere with communications, check with network to confirm. Asymmetric routes come to mind…

Anyhow, logs on the agent are a must if you want to go further

I have wrote a script to ping the server from the management server continuously.

if the server is not pinging then only the output is redirected to the log with the time stamp.

scheduled the script in the windows scheduler .

Log has generated after few days .compared the FTC alerts time stamp with the logs time stamp.it was matching .

hence the alerts are generated out of network issue.

Script :

while ($true)
{
if(Test-Connection -ComputerName <servername > -count 1 -Quiet) {
} else{
$(Get-Date) >> (logfilepath)
$x= “Not pinging”
$x >> (logfilepath)
}
}

For the record, I’ve seen overloaded management servers as a cause for this behavior. In my customer’s case, they were using service monitor templates in SCOM. When they created a new monitor, the resulting config generation on the agents generated hundreds of heartbeat/not reachable alerts to up servers. The management server’s CPU would spike through the roof and the alerts would remain in place until the MS resources settled back down. Might be worth checking performance on the management servers.

Have you seen any trends? Is it from a certain profile of agent, location, time of day, etc… Not sure how you are trying to spot alert trends, but I would recommend downloading and importing the free Veeam report library and run the “alert statistics” report which can help with this. I had a customer who had a long running issue with similar frustrations (heartbeat/computer not reachable). When we ran this report we were able to identify an alert pattern which led to a resolution shortly afterwards.

No,
Eventhough the server is up and running , we are getting FTC and heartbeat failure alerts .the alerts will auto close after few minutes again,

Network issue for the agent from what perspective ?

we have 5 management servers, in total we manage only 1500 servers . there no much spike in the overload.