Doing one of my first times of creating a script to do an automatic recovery and ran into an issue. I think I have a solution but want to run it past some people smarter than me at this. Issue is in a service that runs a printspooler for one of our applications it has a tendency to get hung up and stop processing jobs. I have a performance counter that detects the number of jobs in the que and sends a critical alert if the number gets too high. What I was attempting to do was to watch this and then when the issue occurs to run a script that will restart the service an that will get printing going again. When I tried this out of the box what happened was the service kept just restarting every few seconds. I suspect that what was happening was that it kept seeing the service as in a critical state as the jobs had not yet cleared the que so it kept just rerunning the script. My first thought was to put in a delay at the end of 2 minutes so that the script would not exit until that happened which would give it time to process some print jobs and set the counter back to a good condition. What I need to know is does this sound like it would work or would it just keep running multiple instances of the script which would also keep jobs from processing. I just loaded up squaredup’s powershell management which should make this script way simpler than the vb script I found online.
Diagnostics and recoveries are triggered by health state *changes*, so if the service stops and the monitor changes to a critical state, the recovery will fire once and then not fire again until that monitor once more transitions from healthy/warning to critical.
Typically loops occur when you enable the option to forcibly reset the monitor after the recovery has run (you should almost never do this).
If the service kept restarting I suspect either the monitor was flapping (causing the recovery to run again) or something else is also attempting to manage the service state and triggering restarts (the Service automatic recovery options in windows and SCOM recoveries do not play well together). If you check the State Change events tab for that monitor in SCOM, you can see the recovery execution history and you’ll be able to work out if either of the above is causing the issue.
Have you selected the checkbox to recalculate the monitor after running the recovery script?
How often do you check the status of the performance counter? If you reset the monitor after running the recovery script it should have enought time to clear the print queue if you dont check it to often.
We had a similiar problem with a recovery task. Where we only wanted the recovery to try one time and if it did not work it would not try again for a while. So we added a function in the recovery script that wrote a timestamp to a registry key. And when the recovery script ran we checked if the script has run the last 4 hours if not the script ran. Otherwise it just exited so that the alert would stay open for manual interaction.
What I am running here is a performance counter that is watching the object EpicPrintService and the Counter is Jobs On Queue. It is set to a 15 minute interval. This is set to alert if more than 10 jobs are detected.
I do not think I had the Recalculate monitor state after recovery finishes checked so that may be worth a try. It could have been something in the vb script I found online since I do not speak VB very well just enough to modify an existing script to do what I want sometimes. We did add in the powershell mp this past week and plan to try using it instead.
This is my first attempt to do a recovery harder than just starting a stopped service so I appreciate the suggestions. My plan is to start trying to get this working again after Ignite
Remediation is possible.
This path we have taken to implement remediation is to incorporate the use of Orchestrator. Doing so allows logic to be introduced which prevents a loop from occurring. Right now, we have remediation working on a few scenarios. One is the starting of a stopped service. The scenario goes is like this, if the service goes critical (stops) the critical alert is triggered. Once triggered, a call to Orchestrator is initiated. This call passes variables that contain the necessary parameters for the runbook. The runbook verifies the services is down and attempts to restart the service. Once restarted, verification process ensures that the service is running. This in turn causes the SCOM critical alert to clear. If for some reason the service will not restart the runbook stops after three attempts. Remediation can be done, just takes a lot of testing to rule out all scenarios. This process results in three text messages being sent, one to indicate the service is down. The second notification indicates that the service was or was not restarted successfully by Orchestrator. Finally the third notification will indicate that the critical alert in SCOM has cleared.