Got a weird one that I can’t find any solid information on. We’re a relatively new environment, one development management group with a few services monitored (20 or so agents so far), production just monitoring itself right now.
So having read Kevin Holman’s health-service restarts article back in the day we’ve got our two environments overridden to 30K handles for the health-service. In our development environment this works perfectly fine and to this day there hasn’t been a single health-service restart. Generally the monitoringhost.exe executable sits on between 2-6k handles with the occasional spike to about 10K once in a blue moon. So in general pretty stable.
Production was also pretty stable until I tried to get development SCOM to monitor the production management group and vice versa, the theory being that if something went wrong with one, the other would be able to tell us about it. Handles on the monitoring hosts for the database servers and web servers went through the roof on both sides and health service restarts multiple times a day. I stopped this pretty quickly. The development servers went back to normal the production ones are still facing this issue (though it has slowed down).
Looking at the process handles using process explorer and handles (sysinternals), there seem to be a number of unnamed handles of various types that simply don’t exist on the development side and tens of thousands of auth tokens (development seems to have system and nothing else (as I would expect) production seems to have all sorts, mostly system, but also things like IIS Pool accounts, my own domain account, service accounts, Desktop window manager accounts, etc.) and just seems to keep hordeing them ever increasing until the monitor trips and restarts the service.
I’ve tried flushing the cache, repairing the agent, pulling it out completely and starting fresh, defragging the healthstore database. None of this seems to help. I’m at a bit of a loss really. Other than starting from scratch is there anything else that would be worth trying?
This is somewhat expected behavior.
What you’ve done here is you’ve basically asked the agents to run twice the amount of monitoring (figuratively, as I’m assuming you don’t have the same number of machines in each MG (management group), nor that you have the same MPs collecting in all both), therefore increasing the agent utilization.
I’ve had management groups monitoring other MG’s and seen similar results. Where I am now we have five MG’s, most don’t monitor each other, but I have a “prod” MG that monitors them all. I see high agent utilization fairly frequently, though I’ve intentionally increased the thresholds as per Kevin Holman’s docs.
One other thing to take into account is that if you are monitoring the management servers from another management group, the management server agents will be seen as normal agents, and can be restarted when the thresholds are breached – this is not good and you’ll need to disable these two monitors for management servers that belong to other management groups, to prevent this:
Turns out it was the 10.0.17.0 version of the Server 2016 Management pack!
Updating it to 10.0.21.0 has resolved the issue (and is a listed fix in the release notes).
Quite why this was affecting one MG and not the other is still beyond my understanding!