Two weeks ago, I discussed the User Profile service application (UPA) and User Profile Synchronization service (UPS) in SharePoint 2010 in the context of issues that I, my clients, and others have had getting the UPS to start successfully. The article generated quite a stir, and I received numerous passionate responses from the community. 

Those who have “felt the pain” agreed enthusiastically with one of my tongue-in-cheek best practice recommendations:

To start the User Profile Synchronization successfully, you must sacrifice a chicken at the moment of the green flash of sunset on the night of a new moon.

There were also responses from staff at Microsoft, SharePoint and Forefront Identity Manager (FIM) MVPs, who worked to describe the problems that are arising with the UPS and UPA, and with SharePoint and FIM.

These folks had reasonable and accurate explanations that went into details about versions of “bits” and changes caused by SharePoint patches and cumulative updates (CUs), including the October 2010 CU which had to be pulled because of a change it made that impacted the UPA and manifested itself with a non-startable UPS.

I had to chuckle, because these explanations served to illustrate, to me, exactly the point I was preparing to make this week: Welcome to the new world of IT.  It’s based on multiple services, owned by different organizations, layered and complex beyond belief and, sometimes, impossible to troubleshoot and support. 

This week, I’d like to share with you the details about my experience with UPS in September—a period I refer to as the “14 Days of My Life That I Will Never Get Back.” It’s a horror story perhaps better suited to Halloween than Thanksgiving, but it’s an important one to consider as we head into this new world of IT.

In September, I was writing the SharePoint Training Kit for Microsoft.  The Training Kit is a fantastic resource (it will be published in early Spring, 2011, by the way) and it is unique as a Microsoft resource because it has to step you, the reader, through the creation of a SharePoint farm and all of its features, from beginning to end. 

There are a series of Practices—labs—that you perform, beginning with a “bare metal” machine. By the end of the training kit, you have a fully featured, fully functioning, “real world” SharePoint farm. 

Creating these practices gives me, as an author, an opportunity to fully test all of the moving parts and dependencies not only of SharePoint, but of Windows, Active Directory, client software (like IE and Office 2010), of concepts like least privilege security, and of infrastructure like virtualization, because the reader is guided through building a lab environment using virtual machines.

Sometimes I uncover moving parts and dependencies and interactions that don’t work as well as Microsoft might have us think.  The UPS and UPA practices were a case in point. 

I spent 14 days, about 12-18 hours per day, trying to start the UPS on a farm that was completely clean and set up perfectly (scripted, in fact, so it could be replicated consistently).  FIM just simply wouldn’t start. 

We tried everything we could think of. As I discussed last week, there are a lot of moving parts and requirements for UPS to start successfully.  We tweaked every possible setting on the farm. We tried every iteration of every variable we could think of, including rebuilding the farm both on Hyper-V and VMware. 

No luck.  And of course each “test” of UPS startup means waiting 10-20 minutes for the repetitive attempts to fail, so there was lots of time wasted with each successive iteration of “fixes.”

I had three of the top SharePoint MVPs (huge shoutouts of thanks to Spence Harbar, Todd Klindt, and Matthew McDermott) and generous members of the SharePoint product group all working on the problem with me.  They all put in huge amounts of time and valiant efforts but nobodynobody could solve the problem.  It was evident that the virtual servers on the farm were configured correctly, and combing the logs revealed nothing other than the fact that FIM wouldn’t start.

Eventually, after 14 days of this nonsense, we tried something completely unrelated to SharePoint. We moved the farm VMs to a different physical host. And, voila, everything started perfectly.

This indicated that our “build” was perfectly fine, but that there was a resource constraint of some kind that was not “bubbling up.”  Even though we turned all SharePoint logging on to Verbose, FIM wasn’t logging (at all), and there was still zero information as to what problem actually was causing FIM startup to fail. 

Two months later, we are no closer to an answer as to what “ate my life” in September, but the scars have healed.  As I’ve shared this story with folks inside and outside of Microsoft, people nod knowingly. 

Because, really, the root problem here has nothing to do with FIM or SharePoint (both of which are exceptional products) or virtualization or hardware.  It has to do with complexity—with moving parts that work together brilliantly when they work, if they work, but that fail spectacularly when they don’t.

In this case, it’s FIM and SharePoint. FIM is of course owned by a completely separate team at MS than SharePoint, and even though SP2010 and FIM are “stapled together”, settings (e.g. logging levels) don’t get pushed between components, and information doesn’t bubble up in any kind of helpful manner. 

But again, I emphasize it’s not so much a failing of the products themselves, but that UPS & FIM are particularly salient examples of how when two moving parts aren’t really unified, any number of small problems can cause the house of cards to fall.

What does this portend for our future as IT Pros? I think it’s pretty scary.  We are moving “up the stack” of technology, into an increasingly solution-focused level with products like SharePoint that rely on a very complex stack of technologies and moving parts, many of which SharePoint doesn’t really control or unify—it only relies upon them.  You don’t have visibility from SharePoint into the inner workings of FIM (or IIS, or Active Directory or ADFS or…). 

You can extrapolate the problem to cloud services, and interactions with “moving parts” that you don’t even own and don’t have any visibility into. The types of problems we’re setting ourselves up for as IT Pros are ugly, indeed. 

How will you troubleshoot a problem in a custom solution that you’ve built or bought when that custom solution relies on cloud services, such as Azure or Office 365 in the case of Microsoft’s stack, or any number of other “cloud” services? 

Based on what I’ve experienced this year, both in my own UPS drama and with other scenarios at customers, it’s going to be a lot of finger pointing and “we don’t own that so we can’t help you with it” support tickets.

I love SharePoint... I think it’s an absolutely amazing product, both for its out-of-box functionality and as a platform. I admire Microsoft and the product team for getting SharePoint 2010 across the finish line and into the market. I am endlessly grateful for the support Microsoft offers its customers, and me in particular, and for the talented community that extends that support ecosystem.

My real concern has nothing to do with SharePoint itself (or FIM), but with the fact that our trajectory is towards a level of complexity that I don’t believe vendors, let alone IT Pros, are really prepared to support. 

Complexity and interdependencies between distinct, separately-owned and supported components is, I fear, the Achilles Heel of IT and the cloud.  How will you prepare yourself, and protect your Achilles Heel, as your enterprise sprints into the future?

Related reading:

SharePoint User Profile Synch: Achilles’ Profile (Part I), by Dan Holme

SharePoint 2010 in the Cloud: 17 Risks to Consider, by Dave Chennault