MonitoringHost.exe crashing loop, a cry for help.

Hi All,

I have a wierd problem, that I can’t find a solution for.
We have several SCOM environments that we manage, all versions of SCOM.
A week ago, we started getting an Alert from a handfull of servers “System Center Management Health Service Unloaded System Rule(s)”
Normally when i get this error, I restart the agent or flush the health cache, this have alway in ther past fixed this problem.

But now we have 5 servers in our primary Management Group (1800 agents) with this error, they all failed about the samme time, and the normal fix does not work.

After doing a deeper investigation, the OpsMgr eventlog reveals that the MonitoringHost is crashing, starting up, crashing etc…
The error is :

A monitoring host is unresponsive or has crashed. The status code for the host failure was 2164195371.

The errors in the application log reveals there is a problem with read/write to memory (all servers are Virtual)

Application: MonitoringHost.exe Framework Version: v4.0.30319 Description: The application requested process termination through System.Environment.FailFast(string message). Message: Attempted to read or write protected memory. This is often an indication that other memory is corrupt. Stack: at System.Environment.FailFast(System.String)

What we have tried so far, is to update an agent from a 2012R2, to latest 2019. No difference.
Remove the agent from the Management Group, remove agent software, reinstalling and adding the agent to the Management Group. No difference.
Remove the agent from the Management Group (2012R2), and add it to another Management Group (2019). This resulted in the agent worked as intended.
Returning the agent to the correct Management Group, resulted in the agent crashing again.
Looked at the number of handles, this never exceeds 25000 on the server.

The one thing, we can see is the same for all 5 servers, is that they are running Windows Server 2012R2, (but we have about 200 of these monitored in this management group).
We have suspected that maybe an windows update have caused this, but all our servers are running the same patchlevel.

Now a Customer are begining to experience the same behavior on 1 agent. (SCOM 1807 Management Group)

There have not been updated any management packs in the recent weeks.

Do you guys have any idear of what to try next ? otherwise next step will be to contact Microsoft Support, and I am guessing their first comment will be to upgrade to scom 2019 :slight_smile:

Cheers,

Most of the unusual agent issues I’ve had were triggered by the anti-virus in some way so perhaps test that if you are able. Ideally turn it off to know for sure although some companie won’t allow that :slight_smile:

Remove the agent from the Management Group (2012R2), and add it to another Management Group (2019). This resulted in the agent worked as intended. Returning the agent to the correct Management Group, resulted in the agent crashing again.
I would say that you have one or more workflow on the production MG that is crashing the agent. I would try to get the list of workflow running on that agent on that MG. Would be best to compare with the ones running on the other MG and then you can start your investigation.

Also if you don’t use APM don’t install it! https://kevinholman.com/2017/08/05/reinstalling-your-scom-agents-with-the-noapm-switch/

Hi Nicole, I tried completely removing the Installed antivirus, and safly no effect, Thanks for the idear though :slight_smile:

Hi Pascal,
My suspicion is also on one or more of the workflows.
We don’t use APM anywere, and it is not installed on any of the monitored servers.
I have to try a look through all the workflows activated for the agent (500+) Now I just have to find the time to do this :slight_smile: I will get back to you when I have checked the workflows.

You may want to simply open a call with Microsoft, they may be able to filter this out faster…