House of Brick SQL Server Recording the SQL Server System State

Recording the SQL Server System State

SQL Server

House of Brick Principal Architect

“Your database is running slow, and you need to drop every other emergency and fix it right now!”

Do you get nailed with exasperated comments such as this without warning? Don’t you ever wish the person would say “I inserted a billion records into table dbo.XYZ and now the nonclustered index dbo.IX_XYZ_MakeReportingFaster needs to be rebuilt because it’s slowing down reporting”?

That’s always fun (not). Not only are folks in the organization in a panic, but you are left with no useful information to help you start the triage process except for a particular server is running an arbitrary “slow” value. You must start from ground zero on the server in question and work your way into the problem and eventual solution, and that’s time consuming.

What if this same user comes to you tomorrow and says “Sorry about the emergency yesterday… but here’s another question. I have not budgeted for anything for your team for the next couple of years, but when are we going to run out of space on your database servers? I just thought of that today…”

Do you have the system in place that could just pop out the answer?

You should consider automating a system that can help automate some of the collection of common runtime metrics and system states so that you can help yourself determine what is out of the norm during an emergency, as well as provide for a long-term capacity management baseline for all key metrics on your servers.

Recently, I was reading one of Chris Shaw’s (B, L, T) fantastic chapters on the utility database from the new book Pro SQL Server 2012 Practices and thought that I would share some of my practices that can help backup some of his recommendations with some more practical examples.

Checklists and Data Collection

First, reference a previous blog post of mine where I outline all of the usual tasks that I perform during normal day-to-day maintenance of these servers. Now, I’ll show you how to create a system that can assist in automating these processes!

Let’s take a pretty routine task – checking database file sizes.

To keep things simple, I am adapting query number 17 from Glenn Berry’s SQL Server 2012 diagnostic queries, January 2013. You can see more fantastic queries for these types of purposes in those scripts.

Create a placeholder for the data.

CREATE TABLE [dbo].[FileSize](

[ServerName] [nvarchar](128) NULL,

[DatabaseName] [nvarchar](128) NULL,

[FileID] [int] NOT NULL,

[FileName] [sysname] NOT NULL,

[PhysicalName] [nvarchar](260) NOT NULL,

[TypeDesc] [nvarchar](60) NULL,

[StateDesc] [nvarchar](60) NULL,

[MB] [bigint] NULL,

[SampleDT] [datetime] NOT NULL

) ON [PRIMARY]

To start populating this table, you could create a job to periodically execute the following query and store the results. I normally sample database file sizes once a week unless that environment has a high number of transactions and frequent file growth.

-- Adapted from Glenn Berry SQL Server 2012 Diagnostic Queries, January 2013

-- SQLServerPerformance.wordpress.com

-- File Names and Paths for TempDB and all user databases in instance (Query 17)

INSERT INTO dbo.FileSize

SELECT

@@ServerName as ServerName,

DB_NAME([database_id])AS [DatabaseName],

[file_id] as FileID,

name as FileName,

physical_name as PhysicalName,

type_desc as TypeDesc,

state_desc as StateDesc,

CONVERT( bigint, size/128.0) AS MB,

GETDATE() as SampleDT

FROM sys.master_files WITH (NOLOCK)

WHERE [database_id] > 4

AND [database_id] 32767

OR [database_id] = 2

ORDER BY DB_NAME([database_id]) OPTION (RECOMPILE);

The following query can present to you the previous month’s samples, and calculate the amount of growth per file over that month.

with cteFileSizes

(ServerName, DatabaseName, FileID, FileName, SampleDT, MB, WeeksPrevious)

as (

select

ServerName, DatabaseName, FileID, FileName, SampleDT, MB,

DENSE_RANK() OVER (ORDER by SampleDT DESC) 'WeeksPrevious'

from

dbo.FileSize

)

select

w1.ServerName, w1.DatabaseName, w1.FileID, w1.FileName,

w1.MB - w5.MB as OneMonthGrowthMB,

w1.MB as Week0MB, w2.MB as WeekMinus1MB, w3.MB as WeekMinus2MB,

w4.MB as WeekMinus3MB, w5.MB as WeekMinus4MB,

w1.SampleDT as Week0DT, w2.SampleDT as WeekMinus1DT,

w3.SampleDT as WeekMinus2DT,

w4.SampleDT as WeekMinus3DT, w5.SampleDT as WeekMinus4DT

from

cteFileSizes w1

left join cteFileSizes w2 on w1.ServerName = w2.ServerName

and w1.DatabaseName = w2.DatabaseName and w1.FileID = w2.FileID

left join cteFileSizes w3 on w1.ServerName = w3.ServerName

and w1.DatabaseName = w3.DatabaseName and w1.FileID = w3.FileID

left join cteFileSizes w4 on w1.ServerName = w4.ServerName

and w1.DatabaseName = w4.DatabaseName and w1.FileID = w4.FileID

left join cteFileSizes w5 on w1.ServerName = w5.ServerName

and w1.DatabaseName = w5.DatabaseName and w1.FileID = w5.FileID

where

w1.WeeksPrevious = 1

and w2.WeeksPrevious = 2

and w3.WeeksPrevious = 3

and w4.WeeksPrevious = 4

and w5.WeeksPrevious = 5

order by

ServerName, DatabaseName, FileID

Voila! You now have a quick report that you can whip up in SSRS, schedule it to automatically deliver, and include it in your weekly routine. The sky is the limit with the items that you can record and analyze. I always say more data is better than less, so monitor and record anything you can possibly consider valuable.

Perfmon

Next, every server that I manage has Perfmon running in the background at all times, constantly sampling performance data. Perfmon data is always good to have, because not all system activity comes from SQL Server. Other items, such as system backups, antivirus scans, and other programs running in the background can have a negative effect on performance. Understanding what a system was doing at the time of a problem is one of the crucial components that can make or break a triage investigation.

Here’s an example of this in action. A while back I was at a customer site and they were having random SQL Server database mirror failovers in the middle of the night on a couple of older database systems. We did not have much to go on with the logging available at the time. I set up Perfmon to capture data every five minutes. A few days later, we experienced an unplanned failover. Sifting through the Perfmon data, we discovered that disk write activity onto RAID-5 set of local disks were high an hour before the event, and then went through the roof a few moments before the unplanned failover. We were then able to correlate the activity to a couple of mis-timed jobs. A database-level backup was overlapping into the sporadic runtimes of a system-level backup, and the onboard SAS controller was being periodically overwhelmed and went unresponsive while the cache flushed to disk.

We set up Perfmon to collect a number of important counters every five minutes, and store the data to a log file that is date-stamped and rotated every night. The counters that we usually start with include the following items.

Object Name	Counter Name
PhysicalDisk	Current Disk Queue Length
PhysicalDisk	Average Disk Read Queue Length
PhysicalDisk	Average Disk Write Queue Length
PhysicalDisk	Average Disk sec/Read
PhysicalDisk	Average Disk sec/Write
PhysicalDisk	Average Disk Bytes/Read
PhysicalDisk	Average Disk Bytes/Write
PhysicalDisk	Disk Read Bytes / sec
PhysicalDisk	Disk Write Bytes / sec
Memory	Page Faults/sec
Memory	Pages / sec
Memory	Available Mbytes
Paging File	% Usage
Processor	%User Time
Processor	%Privileged Time
Processor	%Processor Time
Processor	Interrupts / sec
System	Processor Queue Length
SQLServer:Access Methods	Forwarded Records / sec
SQLServer:Access Methods	Full scans / sec
SQLServer:Access Methods	Page splits / sec
SQLServer:Memory Manager	Memory Grants Pending
SQLServer:Buffer Manager	Buffer Cache Hit Ratio
SQLServer:Buffer Manager	Checkpoints / sec
SQLServer:Buffer Manager	Lazy Writes / sec
SQLServer:Buffer Manager	Page Life Expenctancy
SQLServer:Buffer Manager	Readahead pages / sec
SQLServer:Databases	Transactions/sec
SQLServer:General Statistics	User Connections
SQLServer:Latches	Average Latch wait Time
SQLServer:Locks	Average Wait Time (ms)
SQLServer:Locks	Lock Wait time (ms)
SQLServer:Locks	Lock waits / sec
SQLServer:SQL Statistics	SQL Compilations / sec
SQLServer:SQL Statistics	SQL Re-Compilations / sec
SQLServer:SQL Statistics	Batch Requests / sec

You want to remember to collect all instances of these items, not the cumulative rollups that can wash out key information.

Infrastructure Statistics

Finally, I prefer to record all statistics underneath the SQL Servers and the operating systems. This includes items like SAN performance, VMware or Hyper-V performance statistics for both the VM itself and the hosts it resides on, and even down to networking activity.

Why would I suggest these items? Ponder this situation. Your organization has a SAN with 50 servers connected. A system administrator misconfigures an antivirus scan setting and accidentally triggers a full scan on all 50 of those servers at the same time.

Your SQL Servers now grind to a crawl. Someone runs into your office and demands that you investigate why the database server is performance poorly (it’s always the DBA’s fault, right?). The SQL Server and Windows performance data stats now record a burst of suddenly high disk latency and reduced throughput. You look at the vCenter statistics and find that most of the virtual machines are strangely maxed out on CPU utilization. Outside of those metrics, you do not know what is occurring, but you now have a great set of metrics to go to the storage group and ask them to investigate further.

Quite frequently, DBAs just do not have access to these items. However, you can task the different administrators for these systems to setup automatic reports to be routinely delivered to you. Find a way to get these statistics, because after all, data is the most important part of the business, and you should be aware of how the infrastructure is performing.

Wrap Up

Maintain your SLAs by baselining the environment and being proactive on resource contention. Use this information to help triage pain points in the infrastructure, and ensure that your systems are running at their peak performance!

baselining, Checklists, Chris Shaw, Data Collection, Database, DBA, Glenn Berry, Metrics, Perfmon, Query, RAID-5, SLA, SQL Server, SQL Server 2012, System State

House of Brick Staff

All Posts

Stop Guessing About Your Database Estate

Get continuous visibility into database sprawl and licensing risk across hybrid environments.

Oracle

How to Configure Continuous Database Inventory for Audit Readiness

Learn best practices for configuring continuous database inventory with automated discovery, unified tracking, and historical snapshots to eliminate audit surprises.

March 26, 2026

Oracle

Oracle Database Feature Usage is Your Single Biggest Audit Trap

Oracle feature usage can trigger massive audit penalties. Learn how to detect, track, and avoid licensing risk before it’s too late.

March 24, 2026

Diagram showing the AWS database visibility gap: AWS infrastructure tools see EC2 and RDS instances but cannot see database-level details like Oracle and SQL Server editions, feature usage, or license compliance status

AWS

You Can’t Address Database Sprawl Without Knowing What You Have

AWS tools see instances, not databases. Learn why fixing Oracle and SQL Server sprawl requires visibility that connects infrastructure data to database-level compliance information.

February 27, 2026

Popular Keywords

Categories

About House of Brick