We have some places in our locked down environments where ping is supposed to be allowed, but is not in reality. So whenever we get a heartbeat failure, the FTC alerts as well since it cannot ping the system. The Heartbeat alert closes, but the FTC does not. Reason: SCOM still cannot ping the system. So the alert has to be closed manually.
This is the standard response that I expect from a server team. When we experience large numbers of HB/FTC, its usually one of the following:
> actual network issue where the agents cannot reach their MS/GW for a period of time
> DNS issue where agents are reporting in by IP and not name, which SCOM rejects. (we see this a lot)
> occasionally you can see issues where SCOM DB performance can cause all the agents to throw a HB; but you’d see really poor console performance as well as other indicators (errors) in the console
Try to telnet FROM the agent TO the MS/GW using the FQDN of the MS/GW. DO NOT use ping in any troubleshooting; this is the 21st century, say goodbye to ping!
Remember, FTC is a diagnostic step in the HB failure monitor; so its dependent on MS to be able to PING the agent by name, usually FQDN. You can override the diagnostic step in order to have the PING run from another system; I’ve had to do this because our agents cannot “see” back into the monitoring environment and can only be reached by their GW or a designated server with that traffic allowed.
For the record, I’ve seen overloaded management servers as a cause for this behavior. In my customer’s case, they were using service monitor templates in SCOM. When they created a new monitor, the resulting config generation on the agents generated hundreds of heartbeat/not reachable alerts to up servers. The management server’s CPU would spike through the roof and the alerts would remain in place until the MS resources settled back down. Might be worth checking performance on the management servers.
Have you seen any trends? Is it from a certain profile of agent, location, time of day, etc… Not sure how you are trying to spot alert trends, but I would recommend downloading and importing the free Veeam report library and run the “alert statistics” report which can help with this. I had a customer who had a long running issue with similar frustrations (heartbeat/computer not reachable). When we ran this report we were able to identify an alert pattern which led to a resolution shortly afterwards.