Ping Monitor in SCOM

Hi All,

Someone in our environment rebooted a Critical Production Server and no alert was received from SCOM.

Management is out for blood as to why and to have it fixed.
Issue i find is that the Server is a VM and it went down and up in 6 seconds.

SCOM is too slow to detect this failure as it only polls every 60 secs and will alert of a failure on the 4th second.

Has anyone been able to find a suitable fix to this apart from setting up Event ID monitors for Shutdown, Startup etc.

I have also tried OpsLogix Ping Monitor but find that while it is effective in alerting, I can not customise alert console descriptions which is a significant draw back for alerts going to my Level 1 support.

I am thinking if a Powershell monitor that should be run “independent” of the SCOM Agent as if the SCOM Agent Service stops for any reason for example a server is being shut down…i will not receive any alert from the script.

Does anyone out here have any good ideas or such a script that can help save my bacon.

Hi SaiyadRa,

Flip side of your request (aggressive monitoring) is that any hiccups (and there are plenty of those) will generate an alert. You may soon be dealing with a “Cry Wolf” problem soon after.

That said, yes you can simply make a rule alert for any Event ID 1074 for User32. Will run if the server is rebooted while not in Maintenance.

Cheers

Hi Pascal,
This solution has not worked for me so far as if someone restarts the VM from vCenter, there is no Event ID logged for shutdown. Event ID is only generated when server is up and when SCOm agent starts running and even then it still misses the Event ID.
The server goes down and is up in 6 seconds.
The Application has crashed, the customer is now calling but there is still no Alert from SCOM.

I have found that creating a rule alert for Event ID equals 1074 AND with parameter 5 equals restart this alerts when someone restarts from vCenter.

Agree with Mandy, even when restarted from vCenter/HyperV there will still be an event 1074.

Actually, your best plan of action is to have deeper monitoring of the application, regardless of the server infrastructure below. Windows can well be up but if the service or process is not responding correctly it doesn’t matter from an application availability standpoint. Thus, assuming a multi-tiered application, another tier of the solution should already be generating an error message when that component goes down, and so that error message could be alerted on… Better yet, health should be determined if the other tier can communicate correctly or not.

<my 2 cents >
And all that to cover the fact that 6 seconds of downtime of a single component should not bring down the entire application if that critical. Application should either have some retries or some HA solution of its components…
</my 2 cents >

Hi Pascal,

Is there any way to get alerted from SCOM when the server actually goes “down” (like with in 3 -4 seconds) and not when the server is “Up” and running?

Hi SaiyadRa,

It goes back to what I said before about deeper monitoring of the application. You will need custom monitoring to do this. You can either try yourself (I have done it for several of my apps) or there are probably several providers that offer the service. I would probably recommend that you try having a 3rd Party write and show you how to develop custom monitoring for your app. Knowledge will probably serve for other apps.

Cheers