Today, investigating an other issue i noticed an agent logging Connectivity issues to one of Our management servers. It turned out that we had 4 MS having trouble accepting Connections on port 5723. Telnet Connection was refused as well and agents where complaining alot.
The OpsMgr Connector could not connect to ;MANAGEMENTSRV:5723. The error code is 10061L(No connection could be made because the target machine actively refused it.).
OpsMgr was unable to set up a communications channel to MANAGEMENTSRV. Communication will resume when MANAGEMENTSRVis available and communication from this computer is allowed.
Booting the affected management servers seems to resolve the issue. Using splunk i was able to find out when it happened – during a Network outage, but why? Communication to SQL etc are just fine. Agents failed over to the two MS still in operation.
SquaredUp graph showing avg.-batch per sec for the server working an one thats not…
SCOM accepts new data from agents, and queues it – there’s a specific workflow that accepts that data, processes it, and then stores it to the database. There are other workflows running at the same time – group calculation, timer, data sync between Operations database and Data Warehouse.
I’ve seen times when the management server is overloaded (either by memory usage or CPU) where the workflow processing new data stalls out, and the management server stops accepting any new connections. I imagine what happened in your case is that the workflow writing to the database got backed up, and the workflows stalled out, and never started back up again. Usually restarting the server is the way to get things “unstuck”… is that what happened in your case?