SCOM Performance

schoeman · November 28, 2017, 7:37am

Has anyone got some suggestions/advice to help troubleshoot SCOM performance issues? My SCOM database server, SQL Instance, is running constantly between 80% and 90% CPU utilisation. My guess is there might be an MP that is playing up or has been misconfigured. Any ideas in terms of how to find the issue?

aatefsoliman · November 28, 2017, 8:47am

Hi Schoeman,

is that your datawarehouse instance or operations manager main db?

i have encountered same issues mainly on my Datawarehouse instance and one of the things i had noticed is

1- Size of the dataware house is incredibly big, so that actually affect processing on the dataware house queries specially on squaredup dashboard with performance so after investigations i decided to change grooming and stop all the unneeded event collection rules because those were the biggest tables in the datawarehouse.

some helpful info here.

https://scompanion.wordpress.com/2014/12/18/datawarehouse-database-cleanup-sql-query/

here you can find the most noisy rules in database as well.

https://blogs.technet.microsoft.com/kevinholman/2009/10/04/what-is-config-churn/

2- if troubleshooting ManagementPack thats hard tricky one, but one thing is you can check the latest installed MPs by installed date, or check latest overrides applied.

Jelly · November 28, 2017, 10:24am

I’ve been tinkering with this in our environment for the last couple of weeks, just slowly chipping away at what might be causing some general performance issues, as well as some hard crashes, and running some general maintenance that was never set up or performed. My main source of info has been Kevin Holmans SQL queries (SCOM SQL queries | Microsoft Learn).

Start with the large table query from the article – This will tell you what is taking the most space and might give you a hint as to what is taking the most space. You can also check the database sizing with the next query.
Find the noisiest monitors (https://blogs.technet.microsoft.com/kevinholman/2009/12/21/tuning-tip-do-you-have-monitors-constantly-flip-flo) and also determine if there is stale state data (we had 911 days, which is impressive considering we have a 7-day retention)
Check out the SQL maintenance article (https://blogs.technet.microsoft.com/kevinholman/2017/08/03/what-sql-maintenance-should-i-perform-on-my-scom-2016-databases/) – we had never done a full re-index on the OpsDB and didn’t have maintenance set up. Double check you don’t have your own maintenance running at the same time
Check out Squared Up’s tuning DW video (https://youtu.be/NuYm7e5QZfE)
Tuning. Tuning. Tuning. Use the queries Kevin Holman’s article to see what rules and monitors are causing the most inserts, state changes, and noise. If a performance rule is rarely used, change the frequency it collects. If an event collection rule is never used, turn it off. The general rule of thumb with alerting and health states in SCOM: Fix it / Tune it / Turn it off.

Good luck!

Side note, a small note at the end of the article:

The Operations Database needs 50% free space at all times. This is for growth, and for re-index operations to be successful. This is a general supportability recommendation, but the OpsDB will alert when this falls below 40%.

schoeman · November 28, 2017, 12:29pm

Hay Jelly thanks for your reply - I’ll try your suggestions and recommendations to see if this makes a difference. I’ll keep you posted

schoeman · November 28, 2017, 12:31pm

TheCloudMayor thanks for your reply - It is the OpsManager DB that is getting hammered. I’ll try your suggestions and recommendations to see if this makes a difference. Many thanks

me1 · November 28, 2017, 1:22pm

Great tips. We also created a SquaredUp dashboard with a lot of kevins queries. That quickly lets us see noisy object last 24h, last 7 days, config churn etc

Jelly · November 28, 2017, 2:15pm

Dashboards! This is my next step. We’re moving to having multiple MG’s for different purposes, with one top MG for monitoring everything. Getting the maintenance in place, alerting for when it fails, monitoring, then dashboards - it’s a whole lotta work, but saves so much time down the line