Got a weird one that I can’t find any solid information on. We’re a relatively new environment, one development management group with a few services monitored (20 or so agents so far), production just monitoring itself right now.
So having read Kevin Holman’s health-service restarts article back in the day we’ve got our two environments overridden to 30K handles for the health-service. In our development environment this works perfectly fine and to this day there hasn’t been a single health-service restart. Generally the monitoringhost.exe executable sits on between 2-6k handles with the occasional spike to about 10K once in a blue moon. So in general pretty stable.
Production was also pretty stable until I tried to get development SCOM to monitor the production management group and vice versa, the theory being that if something went wrong with one, the other would be able to tell us about it. Handles on the monitoring hosts for the database servers and web servers went through the roof on both sides and health service restarts multiple times a day. I stopped this pretty quickly. The development servers went back to normal the production ones are still facing this issue (though it has slowed down).
Looking at the process handles using process explorer and handles (sysinternals), there seem to be a number of unnamed handles of various types that simply don’t exist on the development side and tens of thousands of auth tokens (development seems to have system and nothing else (as I would expect) production seems to have all sorts, mostly system, but also things like IIS Pool accounts, my own domain account, service accounts, Desktop window manager accounts, etc.) and just seems to keep hordeing them ever increasing until the monitor trips and restarts the service.
I’ve tried flushing the cache, repairing the agent, pulling it out completely and starting fresh, defragging the healthstore database. None of this seems to help. I’m at a bit of a loss really. Other than starting from scratch is there anything else that would be worth trying?
What you’ve done here is you’ve basically asked the agents to run twice the amount of monitoring (figuratively, as I’m assuming you don’t have the same number of machines in each MG (management group), nor that you have the same MPs collecting in all both), therefore increasing the agent utilization.
I’ve had management groups monitoring other MG’s and seen similar results. Where I am now we have five MG’s, most don’t monitor each other, but I have a “prod” MG that monitors them all. I see high agent utilization fairly frequently, though I’ve intentionally increased the thresholds as per Kevin Holman’s docs.
One other thing to take into account is that if you are monitoring the management servers from another management group, the management server agents will be seen as normal agents, and can be restarted when the thresholds are breached – this is not good and you’ll need to disable these two monitors for management servers that belong to other management groups, to prevent this:
OK I get that the handles will be higher in general in this sort of sityaltion, but over the weekend I’ve upped this to 60K handles and they just hit that instead and restart anyway (though the process is slower). Judging by the increase in alert count this is happening roughly 6 times a day even on the higher threshold. I also created a new VM that does literally nothing but sit there as a test for this scenario and it too is exhibiting the same problem (Next step will be to remove and manage via dev to see if the same thing happens there now). Surely this can’t be normal?
I’m pretty close to binning the management group and starting again
Good point about disabling the monitors for the other management servers, I had missed that one.
Interesting - Have you created any custom MPs? Sounds like there may be some workflows that aren’t terminating once complete. What version+UR of SCOM are you running? Are the agents all up to date? Do you use OMS via management group connect?
Sorry it’s taken so long to get back to you! Been crazy round here recently. Not really to be honest. Only thing along those lines I have done is a custom monitor to try and spot problems with that APM bug in IIS machines. I’ve pulled that out but problems persist. As I’m not monitoring anything properly in that MG I’m going to pull out all the management packs and see if that makes any difference. If it does I’ll add them in slowly to see if I can identify a problematic one. Thanks for all your help so far! I’ve not really had any other help from any other sources so it is much appreciated!