House of Brick Principal Architect
“Your database is running slow, and you need to drop every other emergency and fix it right now!”
Do you get nailed with exasperated comments such as this without warning? Don’t you ever wish the person would say “I inserted a billion records into table dbo.XYZ and now the nonclustered index dbo.IX_XYZ_MakeReportingFaster needs to be rebuilt because it’s slowing down reporting”?
That’s always fun (not). Not only are folks in the organization in a panic, but you are left with no useful information to help you start the triage process except for a particular server is running an arbitrary “slow” value. You must start from ground zero on the server in question and work your way into the problem and eventual solution, and that’s time consuming.
What if this same user comes to you tomorrow and says “Sorry about the emergency yesterday… but here’s another question. I have not budgeted for anything for your team for the next couple of years, but when are we going to run out of space on your database servers? I just thought of that today…”
Do you have the system in place that could just pop out the answer?
You should consider automating a system that can help automate some of the collection of common runtime metrics and system states so that you can help yourself determine what is out of the norm during an emergency, as well as provide for a long-term capacity management baseline for all key metrics on your servers.
Recently, I was reading one of Chris Shaw’s (B, L, T) fantastic chapters on the utility database from the new book Pro SQL Server 2012 Practices and thought that I would share some of my practices that can help backup some of his recommendations with some more practical examples.
Checklists and Data Collection
First, reference a previous blog post of mine where I outline all of the usual tasks that I perform during normal day-to-day maintenance of these servers. Now, I’ll show you how to create a system that can assist in automating these processes!
Let’s take a pretty routine task – checking database file sizes.
To keep things simple, I am adapting query number 17 from Glenn Berry’s SQL Server 2012 diagnostic queries, January 2013. You can see more fantastic queries for these types of purposes in those scripts.
Create a placeholder for the data.
CREATE TABLE [dbo].[FileSize]( [ServerName] [nvarchar](128) NULL, [DatabaseName] [nvarchar](128) NULL, [FileID] [int] NOT NULL, [FileName] [sysname] NOT NULL, [PhysicalName] [nvarchar](260) NOT NULL, [TypeDesc] [nvarchar](60) NULL, [StateDesc] [nvarchar](60) NULL, [MB] [bigint] NULL, [SampleDT] [datetime] NOT NULL ) ON [PRIMARY]
To start populating this table, you could create a job to periodically execute the following query and store the results. I normally sample database file sizes once a week unless that environment has a high number of transactions and frequent file growth.
-- Adapted from Glenn Berry SQL Server 2012 Diagnostic Queries, January 2013 -- SQLServerPerformance.wordpress.com -- File Names and Paths for TempDB and all user databases in instance (Query 17) INSERT INTO dbo.FileSize SELECT @@ServerName as ServerName, DB_NAME([database_id])AS [DatabaseName], [file_id] as FileID, name as FileName, physical_name as PhysicalName, type_desc as TypeDesc, state_desc as StateDesc, CONVERT( bigint, size/128.0) AS MB, GETDATE() as SampleDT FROM sys.master_files WITH (NOLOCK) WHERE [database_id] > 4 AND [database_id] 32767 OR [database_id] = 2 ORDER BY DB_NAME([database_id]) OPTION (RECOMPILE);
The following query can present to you the previous month’s samples, and calculate the amount of growth per file over that month.
with cteFileSizes (ServerName, DatabaseName, FileID, FileName, SampleDT, MB, WeeksPrevious) as ( select ServerName, DatabaseName, FileID, FileName, SampleDT, MB, DENSE_RANK() OVER (ORDER by SampleDT DESC) 'WeeksPrevious' from dbo.FileSize ) select w1.ServerName, w1.DatabaseName, w1.FileID, w1.FileName, w1.MB - w5.MB as OneMonthGrowthMB, w1.MB as Week0MB, w2.MB as WeekMinus1MB, w3.MB as WeekMinus2MB, w4.MB as WeekMinus3MB, w5.MB as WeekMinus4MB, w1.SampleDT as Week0DT, w2.SampleDT as WeekMinus1DT, w3.SampleDT as WeekMinus2DT, w4.SampleDT as WeekMinus3DT, w5.SampleDT as WeekMinus4DT from cteFileSizes w1 left join cteFileSizes w2 on w1.ServerName = w2.ServerName and w1.DatabaseName = w2.DatabaseName and w1.FileID = w2.FileID left join cteFileSizes w3 on w1.ServerName = w3.ServerName and w1.DatabaseName = w3.DatabaseName and w1.FileID = w3.FileID left join cteFileSizes w4 on w1.ServerName = w4.ServerName and w1.DatabaseName = w4.DatabaseName and w1.FileID = w4.FileID left join cteFileSizes w5 on w1.ServerName = w5.ServerName and w1.DatabaseName = w5.DatabaseName and w1.FileID = w5.FileID where w1.WeeksPrevious = 1 and w2.WeeksPrevious = 2 and w3.WeeksPrevious = 3 and w4.WeeksPrevious = 4 and w5.WeeksPrevious = 5 order by ServerName, DatabaseName, FileID
Voila! You now have a quick report that you can whip up in SSRS, schedule it to automatically deliver, and include it in your weekly routine. The sky is the limit with the items that you can record and analyze. I always say more data is better than less, so monitor and record anything you can possibly consider valuable.
Next, every server that I manage has Perfmon running in the background at all times, constantly sampling performance data. Perfmon data is always good to have, because not all system activity comes from SQL Server. Other items, such as system backups, antivirus scans, and other programs running in the background can have a negative effect on performance. Understanding what a system was doing at the time of a problem is one of the crucial components that can make or break a triage investigation.
Here’s an example of this in action. A while back I was at a customer site and they were having random SQL Server database mirror failovers in the middle of the night on a couple of older database systems. We did not have much to go on with the logging available at the time. I set up Perfmon to capture data every five minutes. A few days later, we experienced an unplanned failover. Sifting through the Perfmon data, we discovered that disk write activity onto RAID-5 set of local disks were high an hour before the event, and then went through the roof a few moments before the unplanned failover. We were then able to correlate the activity to a couple of mis-timed jobs. A database-level backup was overlapping into the sporadic runtimes of a system-level backup, and the onboard SAS controller was being periodically overwhelmed and went unresponsive while the cache flushed to disk.
We set up Perfmon to collect a number of important counters every five minutes, and store the data to a log file that is date-stamped and rotated every night. The counters that we usually start with include the following items.
Current Disk Queue Length
Average Disk Read Queue Length
Average Disk Write Queue Length
Average Disk sec/Read
Average Disk sec/Write
Average Disk Bytes/Read
Average Disk Bytes/Write
Disk Read Bytes / sec
Disk Write Bytes / sec
Pages / sec
Interrupts / sec
Processor Queue Length
Forwarded Records / sec
Full scans / sec
Page splits / sec
Memory Grants Pending
Buffer Cache Hit Ratio
Checkpoints / sec
Lazy Writes / sec
Page Life Expenctancy
Readahead pages / sec
Average Latch wait Time
Average Wait Time (ms)
Lock Wait time (ms)
Lock waits / sec
SQL Compilations / sec
SQL Re-Compilations / sec
Batch Requests / sec
You want to remember to collect all instances of these items, not the cumulative rollups that can wash out key information.
Finally, I prefer to record all statistics underneath the SQL Servers and the operating systems. This includes items like SAN performance, VMware or Hyper-V performance statistics for both the VM itself and the hosts it resides on, and even down to networking activity.
Why would I suggest these items? Ponder this situation. Your organization has a SAN with 50 servers connected. A system administrator misconfigures an antivirus scan setting and accidentally triggers a full scan on all 50 of those servers at the same time.
Your SQL Servers now grind to a crawl. Someone runs into your office and demands that you investigate why the database server is performance poorly (it’s always the DBA’s fault, right?). The SQL Server and Windows performance data stats now record a burst of suddenly high disk latency and reduced throughput. You look at the vCenter statistics and find that most of the virtual machines are strangely maxed out on CPU utilization. Outside of those metrics, you do not know what is occurring, but you now have a great set of metrics to go to the storage group and ask them to investigate further.
Quite frequently, DBAs just do not have access to these items. However, you can task the different administrators for these systems to setup automatic reports to be routinely delivered to you. Find a way to get these statistics, because after all, data is the most important part of the business, and you should be aware of how the infrastructure is performing.
Maintain your SLAs by baselining the environment and being proactive on resource contention. Use this information to help triage pain points in the infrastructure, and ensure that your systems are running at their peak performance!