Lessons learned from mass uploads
In recent months, we’ve seen customers who are doing major migrations of documents—for example, from file servers—to SharePoint. Far too often, one of the complaints I hear voiced in these scenarios is how slow the upload can be, particularly when uploading huge numbers of documents. I thought it would be worth starting a discussion of lessons learned related to performance of mass uploads. This is a fairly targeted business scenario—most organizations don’t do this too often—but it also brings up some key points to consider about performance in other SharePoint scenarios.
The following are among the factors that can cause performance of mass uploads to suffer:
- The recovery model for the content database (a SQL Server setting) is set to full by default. While this is certainly the appropriate setting for a production environment, as it allows for recovery using transaction logs, it can slow the content loading of a migration. Set the recovery mode to simple, which causes the contents of the transaction log to be truncated each time a checkpoint is issued for the database. Just remember two things: First, set it back to Full when finished. Second, remember this mode means that the database recovery point can only be as recent as the last database backup, so you’ll probably want to back up before your migration—and there are many good reasons for that, anyway.
- Search indexing, if it kicks in, consumes resources that you might need on your WFEs and SQL servers for processing the migration of files. Make sure that search jobs are scheduled appropriately—or paused—while you do your mass upload.
- Anti-virus software, if it is scanning every document that is uploaded, or is scanning the database or BLOB store directly, can slow things down tremendously. Assuming that your documents were scanned when they were uploaded to their original location, you probably don’t need to incur that penalty when simply moving those documents to SharePoint.
- BLOB storage can affect performance—for better or worse. As you know, I’ve done a lot of writing and speaking about BLOB storage and content database scalability. BLOBs (binary large objects) are the binary, unstructured chunk of data that is the document as it is stored in SQL in the AllDocStreams table of your content database. You can externalize BLOBs using EBS or RBS, which means you store BLOBs in a location other than your content database, and the database gets a pointer to the document. When you externalize BLOBs, you reduce the writes to your database. By default, when you upload a document, it gets written to the transaction log first, then gets committed to the database. That’s two writes for every document. By externalizing BLOBs, there is conceptually a performance benefit. But it really depends on the performance of the storage tier to which you move BLOBs, and depending on the performance of the EBS or RBS provider (the software that manages the communication between EBS/RBS, which are Microsoft APIs, and your BLOB storage platform). For example, if you’re externalizing BLOBs to cloud storage—like Amazon or Rackspace for example, it’s likely performance will be penalized. But if you’re externalizing to a high-performance storage tier, performance can definitely increase for this mass-upload scenario.
- Database growth sizing. The default database size and growth settings for SQL databases are really not appropriate for most SharePoint databases, particularly those that will contain BLOBs. Set the size of your content database to something that represents the size of the data you’re going to upload. Consider the space that metadata will take, as well. That way, SQL doesn’t have to “grow” the database as you upload—the space is already there. As a side note, size and growth affect performance as your environment scales—there are some great blog posts on the “interwebs” to help you determine an appropriate setting, but I recommend setting an initial size that represents your expected content size (including metadata and BLOBs, if stored in SQL) over the first few months of your service, and a growth setting of 10% of that size. But be smart about it—there are a lot of variables in that calculation that all depend on your usage patterns.
- Storage performance, of course, can affect the uploads. Consider creative solutions—like moving the database to which you’re uploading to a separate set of spindles, a separate SQL instance, or a separate SQL server, during the upload. Then move it to its “final home” after uploading is complete. Keep in mind you might even be able to do a migration in a lab then bring the content database into production. Just detach and reattach the content databases.
- The web front end (WFE) can be a bottleneck. Consider uploading to a dedicated web front end that is not being hit by users (though it’s typically the SQL side that’s the bottleneck)… you can target your migration using DNS or load balancer settings.
- The bottleneck might be the connection between the WFE and SQL Server. Use a dedicated high-speed (Gig-E or 10Gig-E) network between WFE and SQL servers. Use teaming if NICs support it.
- The client side can also be a bottleneck, as can requests that aren’t load balanced. Consider running the migration directly on the WFE or from multiple clients, depending on your infrastructure.
- The source can be the bottleneck. Consider all of the previous issues as to where the files are coming from? Should you perform the upload from the file server, for example? Should you move or copy the files to disks that are local to the WFE to maximize performance of the actual upload? That kind of two-step process may help you migrate during specific time windows of your service level agreements.
There’s a lot of room for performance problems—and for creative solutions—when you think about all of the moving parts (both infrastructure and services) that are at play in a simple mass upload!
Hopefully this gives you some ideas for this mass upload scenario—ideas that are also useful points to consider in other performance scenarios. Of course, there’s a LOT more to say about performance and SQL performance in particular. I’d like to thank my colleague Randy Williams for his significant contribution to this newsletter and point you to his great presentation about optimizing SQL for SharePoint. And I’d like to invite you to comment on this article with YOUR experiences related to mass-upload performance optimization.