I am currently monitoring 15 platforms, 1 Linux which is the HQ Server, 12 Windows 2003, and 2 Windows 2000 Server agents. I have when availability changes alerts configured for the Windows platforms.
This has been working well for about two months, but over the last week or so, it has been behaving strangely. It will fire an alert that every platform has changed availability to 0. Then, a minute or two later, it will fire an alert for every platform that availability has changed to 1. When I look in the Dashboard, all platforms show 100% availability as the low, average, and max for the time in question.
Upon on the initial occurrence, I thought it might have been a momentary network blip, but I have one of the Windows agents doing InetAddress Ping on all switches and routers. I did notice 2 things this morning in the HQ Health plugin:
1) There was a clock drift between the agent and server. I do not know if this has always existed or not, but I set the HQ server's OS to NTP from one of the agents that is a domain controller. This has reduced the offset significantly.
2) Several of the items on the Cache tab, including AvailabilitySummary are listed in red. The details for the AvailabilitySummary are Size:236, Hits:39437, and Misses:298.
Does anyone have any ideas why this might be occurring? Getting paged multiple times last night at around 11:30pm, 1:15am, 2:45am, 4:15am, and 5:45am was not exactly a fun experience. Thanks.