I have 4 SCOM groups as shown below – and these 4 are all sub-groups of “Group – Citrix Public Environment” Health Rollups have been configured using Tao Yang’s “OpsMgr 2012 Self Maintenance”
Every now and then this group “Group – Citrix Public Environment” goes into a “Not Monitored” state. See Fig 1.
I then have to put this group in Maintenance Mode for 5 min then it all shows Green again.See Fig 2.
Could someone please explain?
How long after you have run the script have you waited. I have often found that it can take 10 - 15 minutes for the health state to get changed the first time.
It is like the health state is only calculated when an event that updates the health state is triggered. (So either a check of one of the items in the group like a disk check every 15 minutes) or when you put it in to maintenance then when it comes back out it will also cause a health state update.
Try running the script and then going for lunch then the health state should be all good.
If the problem is reoccurring have you tried flushing the health service state and cache on your monitor? If this doesn’t help try repairing the monitor. Looking at event errors on the object will give you a better clue as to why its not monitored.
Hi Davey - These groups/monitors have been in place for 2 months now. I’m aware of the health state not displaying immediately as you have to wait for the agent to report back on its health state.
Hi SCOMTheRipper - I have tried flushing the Health State by stopping the “Microsoft Monitoring Agent” service, renaming the “Health Service State” folder to “Health Service State.old” then starting the “Microsoft Monitoring Agent” service again. The rollup monitor will then display fine for a few weeks but then it would go back to displaying “Not Monitored” again.
I have tried to repair the monitor but it doesn’t seem to have an impact.
The frustrating thing is that the actual Agents are showing as healthy…
The servers that are monitored sit on a DMZ with a SCOM Gateway Server if that helps.
Are there any error events around the time that it appears unmonitored?
Also do you have the right management packs downloaded for your objects operating system? seems like this solved the same issue for a lot of people e.g. server 2012 (apologies if you’ve already checked this)
I think this might be related to the fact that I have updated to “UR 9” and the agents for that DMZ is still on “UR 8” I’ll let you know tomorrow
We get this a lot across several platforms. The flushing of the health cache to force health recalculation is time consuming, likewise is maintenance mode. We also tend to have to schedule the maintenance mode as only the “all contained objects” option will clear the erroneous health state, which subsequently places all the contained servers and production applications into maintenance mode.
In one example we have eight management servers and we go through flushing the health cache on each until health is recalculated.
Health calculation is carried out by one of the management servers in the management group, which is intended to be non-specific as it depends which is hosting the role, which isn’t something you can find out easily.
I would be very intrigued to see if anyone has a solution for this problem.
Hi Schoeman - did you manage to find out what was causing this issue? keep us posted with an answer if you find out!