Recording the SQL Server System State

House of Brick Principal Architect

“Your database is running slow, and you need to drop every other emergency and fix it right now!”

Do you get nailed with exasperated comments such as this without warning? Don’t you ever wish the person would say “I inserted a billion records into table dbo.XYZ and now the nonclustered index dbo.IX_XYZ_MakeReportingFaster needs to be rebuilt because it’s slowing down reporting”?

That’s always fun (not). Not only are folks in the organization in a panic, but you are left with no useful information to help you start the triage process except for a particular server is running an arbitrary “slow” value. You must start from ground zero on the server in question and work your way into the problem and eventual solution, and that’s time consuming.

What if this same user comes to you tomorrow and says “Sorry about the emergency yesterday… but here’s another question. I have not budgeted for anything for your team for the next couple of years, but when are we going to run out of space on your database servers? I just thought of that today…”

Do you have the system in place that could just pop out the answer?

You should consider automating a system that can help automate some of the collection of common runtime metrics and system states so that you can help yourself determine what is out of the norm during an emergency, as well as provide for a long-term capacity management baseline for all key metrics on your servers.

Recently, I was reading one of Chris Shaw’s (B, L, T) fantastic chapters on the utility database from the new book Pro SQL Server 2012 Practices and thought that I would share some of my practices that can help backup some of his recommendations with some more practical examples.

Checklists and Data Collection

First, reference a previous blog post of mine where I outline all of the usual tasks that I perform during normal day-to-day maintenance of these servers. Now, I’ll show you how to create a system that can assist in automating these processes!

Let’s take a pretty routine task – checking database file sizes.

To keep things simple, I am adapting query number 17 from Glenn Berry’s SQL Server 2012 diagnostic queries, January 2013. You can see more fantastic queries for these types of purposes in those scripts.

Create a placeholder for the data.

CREATE TABLE [dbo].[FileSize](

[ServerName] [nvarchar](128) NULL,

[DatabaseName] [nvarchar](128) NULL,

[FileID] [int] NOT NULL,

[FileName] [sysname] NOT NULL,

[PhysicalName] [nvarchar](260) NOT NULL,

[TypeDesc] [nvarchar](60) NULL,

[StateDesc] [nvarchar](60) NULL,

[MB] [bigint] NULL,

[SampleDT] [datetime] NOT NULL

) ON [PRIMARY]

To start populating this table, you could create a job to periodically execute the following query and store the results. I normally sample database file sizes once a week unless that environment has a high number of transactions and frequent file growth.

 

-- Adapted from Glenn Berry SQL Server 2012 Diagnostic Queries, January 2013

-- SQLServerPerformance.wordpress.com

-- File Names and Paths for TempDB and all user databases in instance (Query 17)

INSERT INTO dbo.FileSize

SELECT

@@ServerName as ServerName,

DB_NAME([database_id])AS [DatabaseName],

[file_id] as FileID,

name as FileName,

physical_name as PhysicalName,

type_desc as TypeDesc,

state_desc as StateDesc,

CONVERT( bigint, size/128.0) AS MB,

GETDATE() as SampleDT

FROM sys.master_files WITH (NOLOCK)

WHERE [database_id] > 4

AND [database_id] 32767

OR [database_id] = 2

ORDER BY DB_NAME([database_id]) OPTION (RECOMPILE);

 

The following query can present to you the previous month’s samples, and calculate the amount of growth per file over that month.

with cteFileSizes

(ServerName, DatabaseName, FileID, FileName, SampleDT, MB, WeeksPrevious)

as (

select

ServerName, DatabaseName, FileID, FileName, SampleDT, MB,

DENSE_RANK() OVER (ORDER by SampleDT DESC) 'WeeksPrevious'

from

dbo.FileSize

)

select

w1.ServerName, w1.DatabaseName, w1.FileID, w1.FileName,

w1.MB - w5.MB as OneMonthGrowthMB,

w1.MB as Week0MB, w2.MB as WeekMinus1MB, w3.MB as WeekMinus2MB,

w4.MB as WeekMinus3MB, w5.MB as WeekMinus4MB,

w1.SampleDT as Week0DT, w2.SampleDT as WeekMinus1DT,

w3.SampleDT as WeekMinus2DT,

w4.SampleDT as WeekMinus3DT, w5.SampleDT as WeekMinus4DT

from

cteFileSizes w1

left join cteFileSizes w2 on w1.ServerName = w2.ServerName

and w1.DatabaseName = w2.DatabaseName and w1.FileID = w2.FileID

left join cteFileSizes w3 on w1.ServerName = w3.ServerName

and w1.DatabaseName = w3.DatabaseName and w1.FileID = w3.FileID

left join cteFileSizes w4 on w1.ServerName = w4.ServerName

and w1.DatabaseName = w4.DatabaseName and w1.FileID = w4.FileID

left join cteFileSizes w5 on w1.ServerName = w5.ServerName

and w1.DatabaseName = w5.DatabaseName and w1.FileID = w5.FileID

where

w1.WeeksPrevious = 1

and w2.WeeksPrevious = 2

and w3.WeeksPrevious = 3

and w4.WeeksPrevious = 4

and w5.WeeksPrevious = 5

order by

ServerName, DatabaseName, FileID

 

Voila! You now have a quick report that you can whip up in SSRS, schedule it to automatically deliver, and include it in your weekly routine. The sky is the limit with the items that you can record and analyze. I always say more data is better than less, so monitor and record anything you can possibly consider valuable.

Perfmon

Next, every server that I manage has Perfmon running in the background at all times, constantly sampling performance data. Perfmon data is always good to have, because not all system activity comes from SQL Server. Other items, such as system backups, antivirus scans, and other programs running in the background can have a negative effect on performance. Understanding what a system was doing at the time of a problem is one of the crucial components that can make or break a triage investigation.

Here’s an example of this in action. A while back I was at a customer site and they were having random SQL Server database mirror failovers in the middle of the night on a couple of older database systems. We did not have much to go on with the logging available at the time. I set up Perfmon to capture data every five minutes. A few days later, we experienced an unplanned failover. Sifting through the Perfmon data, we discovered that disk write activity onto RAID-5 set of local disks were high an hour before the event, and then went through the roof a few moments before the unplanned failover. We were then able to correlate the activity to a couple of mis-timed jobs. A database-level backup was overlapping into the sporadic runtimes of a system-level backup, and the onboard SAS controller was being periodically overwhelmed and went unresponsive while the cache flushed to disk.

We set up Perfmon to collect a number of important counters every five minutes, and store the data to a log file that is date-stamped and rotated every night. The counters that we usually start with include the following items.

Object Name

Counter Name

PhysicalDisk

Current Disk Queue Length

PhysicalDisk

Average Disk Read Queue Length

PhysicalDisk

Average Disk Write Queue Length

PhysicalDisk

Average Disk sec/Read

PhysicalDisk

Average Disk sec/Write

PhysicalDisk

Average Disk Bytes/Read

PhysicalDisk

Average Disk Bytes/Write

PhysicalDisk

Disk Read Bytes / sec

PhysicalDisk

Disk Write Bytes / sec

Memory

Page Faults/sec

Memory

Pages / sec

Memory

Available Mbytes

Paging File

% Usage

Processor

%User Time

Processor

%Privileged Time

Processor

%Processor Time

Processor

Interrupts / sec

System

Processor Queue Length

SQLServer:Access Methods

Forwarded Records / sec

SQLServer:Access Methods

Full scans / sec

SQLServer:Access Methods

Page splits / sec

SQLServer:Memory Manager

Memory Grants Pending

SQLServer:Buffer Manager

Buffer Cache Hit Ratio

SQLServer:Buffer Manager

Checkpoints / sec

SQLServer:Buffer Manager

Lazy Writes / sec

SQLServer:Buffer Manager

Page Life Expenctancy

SQLServer:Buffer Manager

Readahead pages / sec

SQLServer:Databases

Transactions/sec

SQLServer:General Statistics

User Connections

SQLServer:Latches

Average Latch wait Time

SQLServer:Locks

Average Wait Time (ms)

SQLServer:Locks

Lock Wait time (ms)

SQLServer:Locks

Lock waits / sec

SQLServer:SQL Statistics

SQL Compilations / sec

SQLServer:SQL Statistics

SQL Re-Compilations / sec

SQLServer:SQL Statistics

Batch Requests / sec

You want to remember to collect all instances of these items, not the cumulative rollups that can wash out key information.

Infrastructure Statistics

Finally, I prefer to record all statistics underneath the SQL Servers and the operating systems. This includes items like SAN performance, VMware or Hyper-V performance statistics for both the VM itself and the hosts it resides on, and even down to networking activity.

Why would I suggest these items? Ponder this situation. Your organization has a SAN with 50 servers connected. A system administrator misconfigures an antivirus scan setting and accidentally triggers a full scan on all 50 of those servers at the same time.

Your SQL Servers now grind to a crawl. Someone runs into your office and demands that you investigate why the database server is performance poorly (it’s always the DBA’s fault, right?). The SQL Server and Windows performance data stats now record a burst of suddenly high disk latency and reduced throughput. You look at the vCenter statistics and find that most of the virtual machines are strangely maxed out on CPU utilization. Outside of those metrics, you do not know what is occurring, but you now have a great set of metrics to go to the storage group and ask them to investigate further.

Quite frequently, DBAs just do not have access to these items. However, you can task the different administrators for these systems to setup automatic reports to be routinely delivered to you. Find a way to get these statistics, because after all, data is the most important part of the business, and you should be aware of how the infrastructure is performing.

Wrap Up

Maintain your SLAs by baselining the environment and being proactive on resource contention. Use this information to help triage pain points in the infrastructure, and ensure that your systems are running at their peak performance!

Table of Contents

Related Posts