Monitor for failures in a short time frame

rtbywalski · January 11, 2018, 1:53pm

Work at a hospital and we use Epic have a custom monitor in place that watches a performance monitor for number of failures in the Epic print que. Some failures are normal and there is a process to clean them up. What I am looking to do is only trigger an alert if I get x number of failed jobs over a short period of time say 5 failures in 10 minutes would be something I would like to trigger an email and have someone investigate. Not quite sure how to go about this though.

Jelly · January 11, 2018, 2:01pm

Performance monitors overview:

https://technet.microsoft.com/en-us/library/hh457563(v=sc.12).aspx

You want to create a Unit Monitor based on a Windows Performance counter. This blog should give you a rough idea (skip past the rule creation):

https://sites.google.com/site/scomblogs/journal-blog/processperformancemonitoringinscom

Ensure that the monitor target (on the next page) is the same as the rule target (right click the rule > properties, it will have the target on the first page):

You can then configure the sample rate and thresholds further along in the wizard.

rtbywalski · January 11, 2018, 2:12pm

That’s pretty much what I did and set a thresh hold of 20 on it. Problem is the we routinely cross that level before he process to clean up failed jobs happens. We need to hold on to failed jobs for a few days incase it was a print device was down and we then just need to resubmit the failed jobs back through. I know it weird how Epic handles printing but it is what it is. So with all that in mind I am trying to only alert if we get 5 failed jobs in say a 10 minute time frame then raise the alert. Since that is about the number we get over a few hours that would be a great indicator of potential issues. This came up since we had one of the servers fail earlier this week and it had a few hundred failed jobs in a very short time frame.

chris_watson · January 11, 2018, 3:25pm

Hello,

I think Jelly has provided the best answer for this, I am curious though in the Configure Alerts window have you unticked “Automatically resolve the alert when the monitor returns to a healthy state”???

This would cause the alert to stick around until it manually has been closed by the responding engineer.

A side problem to this is that the alert count will increase, and if not frequently checked could skew the results you get if you solely look at the alerts window.

Hope this helps,

Chris

Jelly · January 12, 2018, 10:26am

Perhaps you can create a recovery task to ping the printer if it’s got a queue of jobs remaining: https://technet.microsoft.com/en-us/library/hh551141(v=sc.12).aspx?f=255&MSPPError=-2147217396 - Check out the Squared Up Community PowerShell MP for more complex monitoring options (https://download.squaredup.com/management-packs/powershell-monitoring-management-pack/)

pchip · February 5, 2018, 6:52pm

@Jelly It’s a diagnostic task not a recovery task (captures more information instead of trying to fix the health). The downside I see is that the information from the diagnostic does not show in the alert (and of course in the notifications)

@rtbywalski, what you need is a hybrid “Correlated Event” and “Performance Counter” or a Suppressor ConditionDetection Module. Depends on how the data is shown in the Performance Counter. I’d go more with the suppressor. If you probe every 30 seconds then you could put MatchCount=5 and SampleCount=20. You can read about it here: https://msdn.microsoft.com/en-us/library/jj129836.aspx

If you’re serious about doing any true MP Authoring (or what I said above is alien to you), I’d suggest going through the MP Authoring class found in MS Virtual Academy or (same content) here: https://channel9.msdn.com/Series/System-Center-2012-R2-Operations-Manager-Management-Packs?direction=asc

HTH

Monitor for failures in a short time frame

Couldn’t find what you needed?

Don’t worry, we’ve got more options for you.

SquaredUp Docs

SquaredUp Statuspage

Contact Support