House of Brick Principal Architect
In our world, monitoring an IT infrastructure is defined as having accurate, up-to-date knowledge of the current state and health of critical servers. Sending alerts based on monitoring metrics means that administrators should be notified when something is unhealthy and that admins can (and should) take corrective measures. However, people sometimes blur the lines and alert on items that should not be alerted on. Alternatively, overlooking critical issues and not alerting on them can destroy a business.
When we set up monitoring on business critical systems (specifically VMware, Windows, and SQL Server), many key metrics are monitored. Occasionally we store them for historical purposes. These items include host and guest CPU, memory and disk activity, disk used and capacity, and event and error tracking. Sometimes we will run in-depth SQL Server performance data collection and collect items such as Page Life Expectancy, Buffer Cache Hit Ratio, recompilations, data or log file autogrows, tempdb usage, and other important items to profile and trend.
Most of these items are initially thought to be important enough to set an alert However, is there really anything you should do about your CPU running higher than average for five minutes? What if my Page Life Expectancy falls below some predetermined threshold for ten minutes during a backup? What if I miss an alert that I care about because other alerts were bulk deleted and this alert was in the middle of the list?
At House of Brick, we have two levels of alerting: warning and critical. A warning causes a notification to go out, but sometimes it is more informational. Other times the notification is something to be investigated during the next business day. Either way, these items are not critical and administrators should not be alerted in the middle of the night. Critical alerts, on the other hand, are very important. These alerts are for items that simply cannot wait. Even more important is that the alerts are delivered in a timely manner. If a business critical server crashes, the business demands that it get resolved as soon as possible. The following items jump to mind as examples for critical alerting:
SQL Server error states 17-25
Core services are stopped
Database integrity problems
Scheduled job failures
Disks approaching full in the operating system or datastores approaching full in VMware
Host memory ballooning that is not resolving itself
Active Directory domain controllers failing to replicate to other DCs
And the obvious server or device failure
Warning alerts are less critical. Examples of these conditions are:
A sample benchmark query is running a bit slower than average, but still returns a recordset
SAN to SAN replication is backed up but still transmitting
VMware host memory utilization is over 80% but below 90%
A virtual disk is approaching 80% full
Other alerting is subtler: What if a scheduled job runs long? What if you have ten times over the average inbound database connections? Even more subtle: What if my primary data loading scheduled job is taking twice as long today as it did six months ago? What if my database growth has an organization that will completely run out of available space in four months? How do you detect these sorts of trends? How do you classify these? Monitoring and alerting should be well thought out before going live. A full list of devices, performance metrics, error states, and scenarios for your specific environment should be developed and then analyzed. They should be configured by resource type so that the critical alerts are appropriate for the underlying cause. Critical alerts should all be items that you welcome being woken up in the middle of the night or interrupted on weekends to handle so that they cause the least disruption to business. They should be something you can actively do something about to fix. Do not send out critical alerts on items that the administrators cannot take action on immediately. Monitoring your environment also means that you proactively review the environment for baseline changes. Are things running slower today than they were six months ago? If so, can you quantify specifically how much of a different exists? Have you automated the analysis and have the reports automatically delivered so you do not skip the routine checks?
You should be monitoring your environment and capturing performance statistics. You should be alerting on items that are of high importance. You should be proactively protecting your business.