Let's look at SharePoint troubleshooting by examining a well-performing Microsoft SharePoint 2010 farm and working backward from there.
When I initially sat down to start this article, I quickly realized that you can't begin to troubleshoot performance issues until you have a sound baseline to start with. Otherwise, the likelihood of you figuring out what is going wrong in the environment is extremely low. With this in mind, I decided to tackle this topic from a bit of a different perspective, examining what forms the base of a solid and well-performing Microsoft SharePoint 2010 farm and working backward from there.
Performance is a tremendously broad topic when you're talking about SharePoint 2010. End users frequently have concerns such as "Saving that form took too long," or "It felt like it took forever for me to upload a file," or -- my favorite -- "SharePoint feels slow." As IT pros, we rarely get specifics when people complain about performance; they expect us to fix problems instantly without any details of what's really going on.
But SharePoint user education is a topic for another day. This article focuses instead on some of the main areas that you can address to ensure that any bottlenecks users experience are not SharePoint-related and to give you that solid foundation I mentioned earlier.
Windows Server Hardware Sizing
First, you'll want to make sure that the platform that supports SharePoint is sound. To do so, you need to correctly size your hardware to support the SharePoint tier that is being hosted. You'll also need to ensure that Windows Server has been optimized.
Table 1 lists Microsoft's minimum hardware requirements for web servers and application servers in a farm installation. Keep in mind that these are minimum recommendations and will serve up a minimal experience. If you want to optimize performance, these numbers are not going to be anywhere close to good enough.
In most scenarios, web- and application-tier servers will experience CPU contention before RAM contention, but that will depend on your application-pool configuration. If you load 2GB into the application pools at startup -- which you shouldn't be doing, but I've seen it happen in highly application development-focused scenarios -- and you're running four app pools, then you've exhausted your RAM before a user even hits a page. For the optimal RAM profile, examine what your app pools will require, multiply that number by the number of app pools that you expect to have, and then add half again as much to ensure room for growth and the occasional application that doesn't dispose properly:
(Required RAM) ´ (# of App Pools) ´ 1.5 = Proposed RAM Profile
For more information about application pools for SharePoint, review the Microsoft article "SharePoint Server 2010 Capacity Management: Software Boundaries and Limits."
The proper CPU profile is going to rely primarily on your environment's balance between virtualization and hardware. Microsoft Hyper-V Server 2008 R2 supports as many as four cores per guest virtual machine (VM), and VMware vSphere Enterprise Plus supports as many as eight cores. If you need more than eight cores, then your path lies with physical servers. I believe in virtualization as a path because of the lower total cost of ownership (TCO) of the high availability and disaster recovery that it enables; however, each path has its own virtues. Whichever path you choose, a general performance rule is to keep your CPU utilization at less than 50 percent per server.
Hard disk space is a fairly straightforward decision: 80GB is never going to be enough. Each web and application server will have its own Microsoft User Location Server (ULS), IIS, and event logs; copy of the 14 hive; and WinSxS directory. Add the need for a pagefile that doubles your RAM count and a desire to make sure that your server doesn't crash because you didn't have enough disk space. I recommend a minimum of 200GB per server -- 400GB if you can afford it. Rather than splitting drives into multiple partitions, keep a single, larger C partition to manage growth.
SQL Server Hardware Sizing
The SQL Server tier is the one in which you'll want to make your hardware investment. If you don't give SQL Server enough horsepower, you're sunk before you leave port. CPU and RAM are both crucial to SQL Server performance, but be aware that SQL Server will chew up as much RAM as it can get its teeth into, regardless of load. In most cases, SQL Server takes RAM and never gives it back. CPU will trend up and down over time, but if you don't have enough cores across which to spread the load, you'll find yourself with pegged CPUs and a poorly performing SharePoint farm. Table 2 lists Microsoft's minimum hardware requirements for SQL Servers in a farm installation.
When sizing SQL, always consider which services you can separate out. The minimum requirements refer specifically to the relational database management system (RDBMS) engine, not taking into account any SQL Server Integration Services (SSIS), Reporting Services (SSRS), or Analysis Services (SSAS) needs that you might have. Separate these services onto their own hardware whenever possible, and address their specific hardware requirements as well.
Again, the minimum hardware requirements quoted above are not overly practical in real-world farms. Determine your high-availability and disaster-recovery requirements and figure out your needs for clustering versus mirroring before making your hardware purchase. When you have finished your analysis and are ready to actually make a purchase, go for as large a hardware footprint as your budget will allow. (For performance information about SQL Server clustering, I recommend the Microsoft article "Optimizing Failover Cluster Performance." For performance information about SQL Server mirroring, read the Microsoft article "Database Mirroring Best Practices and Performance Considerations." )
In my experience, this approach has paid off more times than I can count. Each time I refresh a lease (usually every 3 to 4 years) or purchase a new SQL Server to replace an existing box, I can never remember saying, "Wow, that server was way over-powered for what it was meant to do." My usual statement is, "It's amazing that we managed to limp along on such a small box."
SQL Server Performance
There are many schools of thought as to how to make SQL Server perform at its best. For the sake of this article, we're going to isolate the discussion to specifically those areas that affect SharePoint.
Pregrow your databases and set automatic growth parameters. Why are pregrowing databases and setting automatic growth parameters important? SQL Server is at its most efficient when it has a large block of contiguous space.
One of the worst-performing base operations for SQL Server is its database growth operation. SQL Server goes out, finds available space on disk, and adds that space to the database. By default, this operation is set to grow by 1MB per operation. When a content database is created, it has a 20MB data file and a 3MB log file. Creating a blank site takes up another 1MB of space, and the database is forced to grow. Uploading 2 PDFs (each just under 3MB) causes the data file to grow to a total of 28MB and puts the log file at 5MB. That's just 11 grow operations to create a blank site with a document library and two uploaded PDFs. Imagine how many grow operations per day will occur on a production server with hundreds of collaborating users.
My general rule for estimating the starting size of the content database is first to get an estimate from your business analysts (or by examining your trend data) and then to double that estimate. If you think you'll need 20GB, grow your database to 40GB to ensure contiguous space. Fibre Channel and Serial ATA (SATA) disks are relatively inexpensive these days, and unless you're using enterprise flash disks (which I typically wouldn't use for content databases), you can get numerous high-capacity spindles for a fairly low price. Do stay beneath the recommended size of 200GB whenever possible.
Make sure to set the autogrow setting from 1MB to something reasonable. I set autogrow by 500MB to 1GB, depending upon the importance and purpose of the content database. Why not set the autogrow to a variable, as the product allows? The answer is large databases. If you have a 1TB database and set it to autogrow to 10 percent, you're going to grow that database 500GB and the user is going to wait while it grows. If you set autogrow to 500MB, users will notice but not be seriously inconvenienced.
If variable auto growth still needs to be a part of your maintenance strategy, consider these points:
- Set standard autogrow to somewhere between 500MB to 1GB.
- Create a daily maintenance job to check for the amount of free space in your database.
- If the amount of free space is less than 10 percent, then grow free space at a scheduled time when users will not be affected.
Validating and documenting settings can be a bit of an accountability nightmare, but this is where T-SQL can help. You can use the query in Listing 1 to identify your database settings so you don't need to make a manual account. Table 3 shows an example of the listing's output.
Know your I/O requirements. SQL Server databases consist of at least one data file and one log file, but you can create supplemental data files to spread out the workload. We're going to assume a minimum configuration for this article.
As a general rule, the log files in SQL Server are high-intensity write, whereas the data files can be a healthy mix of both read and write. Based on this generality, the databases that SharePoint uses (other than TempDB, which is dominantly write-intensive) are generally read\write in nature. You can set up archival content databases that house read-only site collections, but the real power of SharePoint is in its collaboration functionality, which by its nature needs both read and write elements.
Why is this knowledge vital? Data storage requires redundancy, and the core of this redundancy is RAID technology. The performance of your databases can be severely affected by the RAID level that is applied to the storage on which your databases resides, most specifically your write-intensive databases. The two most commonly used RAID levels for database storage are RAID 5 and RAID 10.
The performance difference between RAID 5 and RAID 10 is all about writes. In RAID 5, data is written to each spindle for each block of data, whereas in RAID 10, the data is written perceptibly only one time. (Background processing to the mirror drive occurs but does not affect the user's perceived performance.) The RAID 5 phenomenon is commonly referred to as the RAID penalty or write amplification. As a result, it is highly suggested that log files, TempDB, and any databases that you expect to be write-intensive live on RAID 10. Table 4 shows a quick chart of recommendations.
The difficult decision regarding I/O is cost versus return. RAID 10 is extremely storage inefficient and costly but provides the best possible performance and redundancy. RAID 5 is storage efficient and provides redundancy, but you'll contend with a performance hit. The best answer is to do what is best for your business, providing a healthy balance. If your department budget is $100k per year, you aren't going to be able to afford a SAN with enough storage to put everything on a separate RAID 10 and implement enterprise flash disks. Do the most you can with the budget you have (and scrape and plead for more any chance you get).
SharePoint Web Server Load Testing
Load testing is an often overlooked and highly disregarded art when it comes to SharePoint. We all plan to test for loads, but it seems rare that anyone actually takes the time to do so. There are several simple (as well as many more-complex) tools that allow you to load-test your environment:
- Microsoft Load Testing Kit (LTK) for SharePoint Foundation 2010
- Microsoft Web Capacity Analysis Tool (WCat)
- HP LoadRunner
A recent load test example. Looking at your web servers, you need to examine the number of concurrent users that you're going to expect to have and measure that number against what your hardware and VM can manage. A recent load test that I performed showed these results:
- Two SharePoint web servers (four cores, 16GB RAM), using load balancing
- One SharePoint application Server (four cores, 16GB RAM)
- One SQL Server instance (16 cores, 128GB RAM)
When performing a simple Create, Read, Update, and Delete operation in a standard SharePoint list, the system became completely non-responsive at 500 concurrent users. This minor field test simply logged a user in, created a list item, added text to the item, saved the item, deleted the item, and logged the user out. Watching this test on the server side, we found that we were immediately CPU-bound.
As a result of this test, we determined that adding CPUs to the servers or adding web servers to the farm and load balance would allow us to gain the additional concurrent user count that we expected, thereby satisfying customer requirements. These are the issues that simply knowing your infrastructure requirements can help to alleviate.
Load balancing in your SharePoint application tier. In SharePoint 2010, Microsoft seems to have done a good job with a simple feature called the Service Application Load Balancer. This feature serves round-robin requests to all listening service applications that can respond to a given request. Therefore, if a user is hitting web application A and making a request for Microsoft Excel Services, and you have three application servers that are scoped to listen for requests on web application A, then SharePoint passes the request to the next application server in line to respond. And what manner of configuration must you, the much beleaguered and weary admin, complete to enable this magic? Not a thing. Simply add in the Service Application Load Balancer and allow the proxy to do the work.
In all seriousness, this feature looks in the configuration database, builds a cached list of available service applications, determines which endpoints are available to process the request, and then hands off said request. This process enables not only load balancing, but fault tolerance as well. If an endpoint is unreachable, the proxy drops that endpoint from the rotation for 10 minutes (the default setting, which is configurable), after which it tries to reach the endpoint again. In a multi-farm environment, the Topology web service handles the discovery and loads the information into the local configuration database, with the endpoints being treated as though they were in the local farm.
Any number of factors can contribute to SharePoint users' perceived performance issues. In some cases, these factors are completely out of your control or have nothing to do with SharePoint. Not to mention that users don't generally think about the size of the operations that they execute. That user who had trouble uploading a file isn't going to tell you that the file was 220MB or that they were communicating with the server via a slow hotel connection over SSL VPN. But in many situations, SharePoint issues can be avoided simply by carefully evaluating and properly sizing your hardware and supporting applications, before users get anywhere near SharePoint.
Along with the performance solutions that I discussed in this article, a few other simple tools and features can go a long way toward troubleshooting many SharePoint issues. See the sidebar "SharePoint Troubleshooting: A Little Data Goes a Long Way," for more information.
Listing 1: Query to Identify Database Settings
exec sp_msforeachdb 'use [?]; insert into #GrowthTable select Type_Desc, Name, Physical_Name, State_Desc,Size,Max_Size,Growth,Is_Percent_Growth from sys.database_files'
select * from #GrowthTable
drop table #GrowthTable