Automatic Recovery scripts

rtbywalski · September 20, 2017, 5:11pm

Doing one of my first times of creating a script to do an automatic recovery and ran into an issue. I think I have a solution but want to run it past some people smarter than me at this. Issue is in a service that runs a printspooler for one of our applications it has a tendency to get hung up and stop processing jobs. I have a performance counter that detects the number of jobs in the que and sends a critical alert if the number gets too high. What I was attempting to do was to watch this and then when the issue occurs to run a script that will restart the service an that will get printing going again. When I tried this out of the box what happened was the service kept just restarting every few seconds. I suspect that what was happening was that it kept seeing the service as in a critical state as the jobs had not yet cleared the que so it kept just rerunning the script. My first thought was to put in a delay at the end of 2 minutes so that the script would not exit until that happened which would give it time to process some print jobs and set the counter back to a good condition. What I need to know is does this sound like it would work or would it just keep running multiple instances of the script which would also keep jobs from processing. I just loaded up squaredup’s powershell management which should make this script way simpler than the vb script I found online.

jannep · September 21, 2017, 6:30am

Have you selected the checkbox to recalculate the monitor after running the recovery script?

https://technet.microsoft.com/en-us/library/hh705258(v=sc.12).aspx

How often do you check the status of the performance counter? If you reset the monitor after running the recovery script it should have enought time to clear the print queue if you dont check it to often.

We had a similiar problem with a recovery task. Where we only wanted the recovery to try one time and if it did not work it would not try again for a while. So we added a function in the recovery script that wrote a timestamp to a registry key. And when the recovery script ran we checked if the script has run the last 4 hours if not the script ran. Otherwise it just exited so that the alert would stay open for manual interaction.

viper · September 21, 2017, 8:49am

Diagnostics and recoveries are triggered by health state changes, so if the service stops and the monitor changes to a critical state, the recovery will fire once and then not fire again until that monitor once more transitions from healthy/warning to critical.

Typically loops occur when you enable the option to forcibly reset the monitor after the recovery has run (you should almost never do this).

If the service kept restarting I suspect either the monitor was flapping (causing the recovery to run again) or something else is also attempting to manage the service state and triggering restarts (the Service automatic recovery options in windows and SCOM recoveries do not play well together). If you check the State Change events tab for that monitor in SCOM, you can see the recovery execution history and you’ll be able to work out if either of the above is causing the issue.

rtbywalski · September 21, 2017, 12:48pm

What I am running here is a performance counter that is watching the object EpicPrintService and the Counter is Jobs On Queue. It is set to a 15 minute interval. This is set to alert if more than 10 jobs are detected.

I do not think I had the Recalculate monitor state after recovery finishes checked so that may be worth a try. It could have been something in the vb script I found online since I do not speak VB very well just enough to modify an existing script to do what I want sometimes. We did add in the powershell mp this past week and plan to try using it instead.

This is my first attempt to do a recovery harder than just starting a stopped service so I appreciate the suggestions. My plan is to start trying to get this working again after Ignite

michouser · September 25, 2017, 1:26pm

Remediation is possible.

This path we have taken to implement remediation is to incorporate the use of Orchestrator. Doing so allows logic to be introduced which prevents a loop from occurring. Right now, we have remediation working on a few scenarios. One is the starting of a stopped service. The scenario goes is like this, if the service goes critical (stops) the critical alert is triggered. Once triggered, a call to Orchestrator is initiated. This call passes variables that contain the necessary parameters for the runbook. The runbook verifies the services is down and attempts to restart the service. Once restarted, verification process ensures that the service is running. This in turn causes the SCOM critical alert to clear. If for some reason the service will not restart the runbook stops after three attempts. Remediation can be done, just takes a lot of testing to rule out all scenarios. This process results in three text messages being sent, one to indicate the service is down. The second notification indicates that the service was or was not restarted successfully by Orchestrator. Finally the third notification will indicate that the critical alert in SCOM has cleared.

viper · September 21, 2017, 8:27am

Worth noting, the “recalculate monitor” checkbox only actually does anything if the monitor supports “on demand detection” - the same feature that makes the recalculate button function. So if clicking that doesn’t do anything for your monitor, enabling that option won’t either. Shame as most monitors seem not to support this very useful feature, most likely just because it’s not really documented.

jannep · September 21, 2017, 8:41am

Good note by Vyper. We had a guy from Microsoft visiting us doing a RAP on our System center setup. And he did not now why they hade both “recalculate” and “reset” when the “recalculate” actually did not do anything in most cases. He mentioned though that in rollup scenarios the recalculate could speed up the rollup process.

jannep · September 21, 2017, 8:59am

Then I would guess that when he fires the restart the monitor goes green because it does not have any performance data to pick up and then it goes read again.
You should be able to see if it is flapping under the state change.

tysonp · September 14, 2021, 7:08pm

OnDemand detection must be baked into the MonitorType for the Recalculate button to do anything. Typically you would find it included for a scheduled, scripted datasource workflow.

There are several reasons why it may not exist in the MonitorType:

The author didn’t know how to implement it
It’s not appropriate for the monitor type (example: event detection monitor)
Author was lazy or forgot
Author omitted it on purpose

Why would the OnDemand module be omitted on purpose, you ask?
Upon initialization of a unit monitor, OnDemand (if included in the MT) will calculate health for every instance of the target class type but it WILL NOT use cookdown. The agent will run separate instances of the datasource for every single target instance. This can be devastating for target types with many instances. Think of IIS sites or SQL agent jobs with hundreds or potentially thousands of instances on an agent. OnDemand is certainly helpful at times but the potentially harmful effects on multi-instance classes during unit monitor initialization mean it is may be better to leave it out.

What if OnDemand does not exist in the MonitorType?
If OnDemand does not exist in the MonitorType, the unit monitor will automatically initialize to healthy. It will then run next based on the configured schedule and calculate health normally at that time. Cookdown will be leveraged by the datasource if possible.