This week we are releasing our sprint 71 work on VS Online. You can read more about the changes on the news page. In the past, we’ve struggled with how to “time” the publishing of release notes with the deployment of the software. As we get more and more instances (we have 4 now and will be adding a 5th one soon), it becomes increasingly hard to get any timing right. Starting this sprint we are going to publish the release notes when the upgrade for the first public instance starts. That means the release notes will always be available a little before the features are but should usually be, at most, a couple of days earlier (and less for many people).
I also wanted to comment on one thing mentioned in the news post: work item performance.
I mentioned last sprint that we are bringing some larger internal teams onto TFS/VS Online internally and, in that process, we’ve run into some performance (and usability) issues. We made some progress on them last sprint and more this sprint.
A couple of examples:
- Work Item form loading – One of the teams has crazy large work item forms. I’m not recommending the practice but, I’ll observe that different teams need different levels of sophistication here. When they came onto TFS with these crazy large forms, opening a work item would take 2.8 seconds on average – painfully slow. With optimizations, including not building the DOM for hidden fields ahead of time, we reduced this to 0.8 seconds. Simpler forms won’t see that dramatic of an improvement but all forms will see some.
- Shared work item queries – this team also keeps a very large number of shared queries. In fact, within a couple of months, they had built up thousands of shared queries – and those take a while to load. For them it was 10.5 seconds to load the query tree – every time you navigate to the queries page. With progressive rendering optimizations, we’ve reduced that to 0.3 seconds. Again, not everyone will see anything that dramatic but everyone will see some improvement.
We are working on additional scenarios for next sprint – like opening a work item form by clicking on a link in an email – that takes seconds today. We’re going to reduce it substantially in sprint 72.
All of these changes will also be included in TFS 2013 Update 4.
Looking forward to sharing more in a few weeks – we have a ton of new stuff queued up for the sprint 72 deployment. Stay tuned.
Today we released the second CTP (community technology preview) of Visual Studio 2013 Update 4 and TFS 2013 Update 4. You can read about the VS changes on the Visual Studio blog. You can download the CTP. Or you can check out the release notes. Also here’s the link to the blog post I wrote on CTP1.
TFS improvements in this CTP include:
- Work item performance improvements
- The ability to maximize the rich text editor and the comments on the history control
- Improve hyperlink experience in the rich text editor
- Increased the number of items you can have in the first and last columns on the kanban board
- Addition of the “stakeholder” license change
The addition of support for stakeholder licensing is certainly the largest change in this CTP. You can read more about stakeholder licensing. And you can go to the “Access Levels” tab in the TFS web access settings page in this CTP to see what it looks like.
I think our next Update 4 release will be our release candidate so we’re getting close to the end.
I’ve been amazed lately how many people don’t know about Team Explorer Everywhere. We’ve had it for a couple of years now. It’s a really nice solution for Java/Eclipse users to use Team Foundation Server or Visual Studio Online. It includes an Eclipse plugin and a command line both of which run on Windows, Mac or Linux. I know quite a few customers who use it and like it but it continues to surprise me that, more often than not, when I mention using TFS on a Mac or Linux, people say “Huh, you can do that?”.
Now, I think there’s a but of “it’s our own fault” going on here. I’ve recently been giving people here a bunch of grief because you have to work pretty hard to find anything about it on visualstudio.com. I hope that will be fixed soon. There is some content on getting started for Mac/Linux/Eclipse users but that’s about it – and it’s pretty shallow content.
Also, on another thread we had a conversation (again) about long paths. We added support for long paths to the server a while back but, I found out recently, that we did not add support to our cross platform clients. There’s a whole set of issues with supporting it on Windows that I won’t try to rehash here yet again, but there’s no reason we can’t support it in our non-Windows clients. The team has taken up that work and I’m hopeful/expecting that we can deliver it in TEE 2013.3.
Greg Boer posted a very detailed article on how to use the Scaled Agile Framework with TFS: http://blogs.msdn.com/b/visualstudioalm/archive/2014/09/11/scaled-agile-framework-using-tfs-to-support-epics-release-trains-and-multiple-backlogs-whitepaper.aspx
Check it out,
We’ve worked hard over the last year to continue to improve our testing tools offering – for developers, testers and end users. We’ve made our testing experiences available via a web browser, reduced licensing requirements, improved customizability and more.
It’s great to see that, once again, Gartner has placed us in the leader quadrant in their 2014 Integrated Quality Suites report after evaluating out integrated offering with Visual Studio and Team Foundation Server that works well for development and testing teams. So often, I run into people who are surprised that we even have an offering for testers – it’s a very well-kept secret. It’s great to see Gartner recognizing the quality of the offering we have. Please read their report and judge for yourself if Visual Studio ALM can provide a great solution for you.
Gartner asked me to include the following disclaimer…
Gartner, Magic Quadrant for Integrated Software Quality Suites, Mark Driver, Thomas E. Murphy, Nathan Wilson, Ray Valdes, David Norton, Maritess Sobejana 28 August 2014
Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings. Gartner research publications consist of the opinions of Gartner's research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.
We’ve recently started bring some larger Microsoft teams into VS Online. That has been a very enlightening experience. It has particularly highlighted issues with fit and finish for high bandwidth activities like bug triage, performance and scale issues with larger data sets, etc.
This sprint (sprint 70), deploying today, we’ve delivered a large number of smallish improvements based on what we’ve learned. I think this pattern will continue for the next couple of sprints as we get everyone “settled in” to using VS Online in heavy daily use.
**IMPORTANT** This deployment is going to span the weekend and many accounts won’t see all of the improvements until Monday Sept 8th.
You can read more about the specific improvements in Aaron’s new post: http://visualstudio.com/en-us/news/2014-sep-4-vso.
There’s also a chunk of work in this sprint (also described in the release notes) that’s our next installment in our new open web extensibility model. The most significant improvement is substantially improved REST APIs for work item tracking but the coolest addition is Hubot support.
As always, let us know what you think,
I don’t know about you but it’s kind of hard for me to wrap my head around the fact that we are already on the road to delivering Visual Studio 2013.4 and Team Foundation Server 2013.4. Update 3 wasn’t that long ago. Today we are delivering Update 4 CTP (Community Technology Preview) 1. As always, there will be a couple of CTPs – roughly 3 weeks apart, then a release candidate and a final release. So, the final release of Update 4 is still a ways off – Oct/Nov timeframe, but if you are interested in seeing it develop, the CTPs give you a good way to follow it. For TFS functionality, particularly early in the development cycle (like we are now), Visual Studio Online is even a better way to check it out. That way you don’t have to install anything and all of the Update 4 functionality already is or will soon be deployed on the cloud service.
As usual, in my post, I’ll focus on the ALM functionality in Update 4 and you can go to the VIsual Studio blog to learn more about the IDE pieces. Now, the reality is that the new features in Update 4 are disproportionately in Team Foundation Server. It’s a pretty modest release for the IDE but a pretty big one for TFS.
Here’s some valuable links:
So, enough preamble, let’s get to what’s coming.
Since VS/TFS 2012, we’ve had a TFS based code review experience in Visual Studio. It only works with TFVC and, because it’s in the VS IDE, it’s great for VS users, but it’s not so useful for Eclipse (or XCode, …) developers.
Git, being a distributed version control system, brings with it a different code review like workflow called “Pull requests”. A user with changes in a branch or fork submits a pull request for those changes to be merged into another branch/fork. A committer (and others) in the destination is responsible for reviewing the changes, commenting on them, etc and ultimately accepting them by merging them in or rejecting them.
In TFS 2013 Update 4, we are introducing a web based pull request solution for Git. This gives us a good code review solution for Git and it will work reasonably well regardless of what IDE you use. You can read more about it in this detailed walkthrough of pull requests on VS Online.
Sometime in the next year, we will be working to better reconciled the TFVC experience and the GIt experience so they aren’t so completely different like they are today.
Update 4 will also include charting improvements in Web Access, including the ability to show trends – up to a year, simple aggregates – sum of values, etc.
Work management improvements
Lately we have been working on improvements, based on feedback, to our work item management UI. None of them are huge but there are lot’s of nice little improvements. A few are included in CTP1:
Move to position on the backlog – A new keyboard based prioritization capability that’s handy for people who prefer the keyboard or have very long backlogs and get tired of scrolling to drag & drop.
"Full-screen mode" for all the pages under the Backlogs hub – This enables you to eliminate all the chrome and focus on the data you really care about – particularly useful for things like stand up meetings in front of the task board.
Search for an Area path in Web Access – A new way of managing very large area path hierarchies.
And beyond these that are in CTP 1, there will be much more. If you track our VS Online enhancements on our release notes page, you’ll be able to see stuff showing up over the next few sprints that will also make it into Update 4.
This all the big stuff in CTP 1. Of course there’s lots of bug fixes, various performance improvements, etc. It’s still early so, expect an update with more stuff being added to the list every few weeks.
Thanks and feedback encouraged…
About 6 weeks ago, I announced a plan to make some licensing changes to VS Online and Team Foundation Server that would make it easier for more people to participate in the development process. A few weeks ago, we did part 1 by enabling Test Hub access for VS Online Advanced licenses. Today we have completed step 2 by enabling the new Stakeholder license allowing an unlimited # of people in each account significant access to the work item tracking and Agile planning capabilities, at no charge. You can read my news announcement here: http://www.visualstudio.com/en-us/news/2014-aug-27-vso.
The final step in this wave of changes will happen in TFS 2013 Update 4, later this year, when we enable the Stakeholder license changes in our on-premises product.
Thanks and let us know if we can help you.
Here's a very cool way to use our new VS Online extensibility (that will be available in TFS V.Next) to enable sending an email to a Team Room. Our new REST, OAuth and Service Hooks support can be used in countless creative ways.
We had a pretty serious outage last Thursday all told it was a little over 5 hours. The symptoms were that performance was so bad that the service was basically unavailable for most people (though there was some intermittent access as various mitigation steps were taken). It started around 14:00 UTC and ended a little before 19:30 UTC. This duration and severity makes this one of the worst incidents we’ve ever had on VS Online.
We feel terrible about it and continue to be committed to doing everything we can to prevent outages. I’m sorry for the problems it caused. The team worked tirelessly from Thursday through Sunday both to address the immediate health issues and to fix underlying bugs that might cause recurrences.
As you might imagine, for the past week, we’ve been hard at work trying to understand what happened and what changes we have to make to prevent such things in the future. It is often very difficult to find proof of the exact trigger for outages but you can learn a ton by studying them closely.
On an outage like this, there’s a set of questions I always ask, and they include:
What happened was that one of the core SPS (Shared Platform Services) databases became overwhelmed with database updates and started queuing up so badly that it effectively blocked callers. Since SPS is part of the authentication and licensing process, we can’t just completely ignore it – though I would suggest that if it became very sluggish, it wouldn’t be the end of the world if we bypassed some licensing checks to keep the service responsive.
What was the trigger? What made it happen today vs yesterday or any other day?
Though we’ve worked hard on this question, we don’t have any definitive answer (we’re still pursuing it though). We know that before the incident, some configuration changes were made that caused a significant increase in traffic between our “TFS” service and our “SPS” (Shared Platform Service). That traffic involved additional license validation checks that had been improperly disabled. We also know that, at about the same time, we saw a spike in latencies and failed deliveries of Service Bus messages. We believe that one or both of these were key triggers but we are missing some logging on SPS database access to be able to be 100% certain. Hopefully, in the next few days, we’ll know more conclusively.
What was the “root cause”?
This is different than the trigger in the sense that the trigger is often a condition that caused some cascading effect. The root cause is more about understanding why the effect cascaded and why it took the system down. It turns out that, I believe, the root cause was that we had accumulated a series of bugs that were causing extra SPS database work to be done and that the system was inherently unstable – from a performance perspective. It just took some poke at the system – in the form of extra identity or licensing churn to cause a ripple effect on these bugs. Most, but not all, of them were introduced in the last few sprints. Here’s a list of the “core” causal bugs that we’ve found and fixed so far:
- Many calls from TFS -> SPS were inappropriately updating the “TFS service” identity's properties. This created SQL write contention and invalidated the identity by sending a Service Bus message from SPS -> TFS. This message caused the app tiers to invalidate their cache and subsequent TFS requests to make a call to SPS causing further property updates and a vicious cycle.
- A bug in 401-handling code was making an update to the identity causing an invalidation of the identity's cache – no vicious cycle but lots of extra cache flushes.
- A bug in the Azure Portal extension service was retrying 401s every 5sec.
- An old behavior that was causing the same invalidation 'event' to be resent from each SPS AT (user1 was invalidated on AT1, user2 was invalidated from AT2 -> user1 will be sent 2 invalidations). And we have about 4 ATs so this can have a pretty nasty multiplicative effect.
We’ve also found/fixed a few “aggravating” bugs that made the situation worse but wouldn’t have been bad enough to cause serious issues on their own:
- Many volatile properties were being stored in Identity's extended properties causing repeated cache invalidations and broad “change notifications” to be sent to listeners who didn’t care about the property changes.
- A few places were updating properties with unchanged values causing an unnecessary invalidation and SQL round trips.
All of these, in some form, have to do with updates to identities in the system that then often cause propagating change notifications (which in some cases were over propagated) that caused extra processing/updates/cache invalidations. It was “unstable” because anything that caused an unexpected increased load in these identity updates would spiral out of control due to multiplicative effects and cycles.
What did we learn from the event?
I always want to look beyond the immediate and understand the underlying pattern. This is sometimes called “The 5 whys”. This is, in fact, the most important question in the list. Why did this happen and what can we do differently? Not what bugs did we hit. Why were those bugs there? What should we have done to ensure those bugs were caught in the design/development process before anything went into production?
Let me start with a story. Way back in 2008, when we were beginning to rollout TFS across very large teams at Microsoft, we had a catastrophe. We significantly underestimated the load that many thousands of people and very large scale build labs would put on TFS. We lived in hell for close to 9 months with significant performance issues, painful daily slowdowns and lots of people sending me hate mail.
My biggest learning from that was, when it comes to performance, you can’t trust abstractions. In that case, we were treating SQL Server as a relational database. What I learned is that it’s really not. It’s a software abstraction layer over disk I/O. If you don’t know what’s happening at the disk I/O layer, you don’t know anything. Your ignorance may be bliss – but when you get hit with 10x or 100x scale/performance requirement and you fall over dead. We became very deep in SQL disk layout, head seeks, data density, query plans, etc. We optimized the flows from the top to the very bottom and made sure we knew where all the CPU went, all the I/Os went, etc. When we were done, TFS scaled to crazy large teams and code bases.
We then put in place regression tests that would measure changes, not just in time but also in terms of SQL round trips, etc.
So back to last Thursday… We’ve gotten sloppy. Sloppy is probably too harsh. As with any team, we are pulled in the tension between eating our Wheaties and adding capabilities that customers are asking for. In the drive toward rapid cadence, value every sprint, etc., we’ve allowed some of the engineering rigor that we had put in place back then to atrophy – or more precisely, not carried it forward to new code that we’ve been writing. This, I believe, is the root cause – Developers can’t fully understand the cost/impact of a change they make because we don’t have sufficient visibility across the layers of software/abstraction, and we don’t have automated regression tests to flag when code changes trigger measurable increases in overall resource cost of operations. You must, of course, be able to do this in synthetic test environments – like unit tests, but also in production environments because you’ll never catch everything in your tests.
So, we’ve got some bugs to fix and some more debt to pay off in terms of tuning the interaction between TFS and SPS but, most importantly, we need to put in place some infrastructure to better measure and flag changes in end to end cost – both in test and in production.
The irony here (not funny irony but sad irony), is that there has been some renewed attention on this in the team recently. A few weeks ago, we had a “hack-a-thon” for small groups of people on the team to experiment with new ideas. One of the teams built a prototype of a solution for capturing important performance tracing information across the end-to-end thread of a request. I’ll try to do a blog post in the next couple of weeks to show some of these ideas. And just the week before this incident Buck (our Dev director) and I were having a conversation about needing to invest more in this very scenario. Unfortunately we had a major incident before we could address the gap.
What are we going to do?
OK, so we learned a lot, but what are we actually going to do about it. Clearly step 1 is mitigate the emergency and get the system back to sustainable health quickly. I think we are there now. But we haven’t addressed the underlying whys yet. So, some plans we are making now include:
- We will analyze call patterns within SPS and between SPS and SQL and build the right telemetry and alerts to catch situations early. Adding baselines into unit and functional tests will enforce that baselines don't get violated when a dev checks in code.
- Partitioning and scaling of SPS Config DB will be a very high priority. With the work to enable tenant-level hosts, we can partition identity related information per tenant. This enables us to scale SPS data across databases, enabling a higher “ceiling” and more isolation in the event things ever go badly again.
- We are looking into building an ability for a service to throttle and recover itself from a slow or failed dependency. We should leverage the same techniques for TFS -> SPS communication and let TFS leverage cached state or fail gracefully. (This is actually also a remaining action item from the previous outage we had a months or so ago.)
- We should test our designs for lag in Service Bus delivery and ensure that our functionality continues to work or degrades gracefully.
- Moving to Service Bus batching APIs and partitioned topics will help us scale better to handle very 'hot' topics like Identity.
As always, hopefully you get some value from the details behind the mistakes we make. Thank you for bearing with us.
Well, today was my turn to take the ALS Ice Bucket Challenge. This morning I was tagged by both Adam Cogan and Scott Guthrie. Tempting as it is though, I’m not doing it twice I don’t know if it’s typical, but Adam challenged me to complete the task within 24 hours. I spent today thinking about how I would get home early enough, how I would orchestrate it and where I would get the ice.
I decided to do it on the farm (notice the cows behind me) and to have my kids help me. When I first told them that I needed their help, they moaned in exasperation at having to “help dad again”. When I told them, they would be pouring ice cold water on me while filming it, they jumped up and down screaming with excitement.
I gave them instructions ahead of time on what to do. Apparently I wasn’t clear enough. You’ll notice that my daughter poured the water on me very gradually. I was hoping for an instantaneous drenching but instead I got a slow motion frostbite. It seemed like it lasted 10 minutes
I’ve now named James Philips, David Treadwell and Buck Hodges as my victims, ahem, I mean nominees in the challenge. Good luck to you all – it’s for a good cause. Of course, you could just donate $100 and chicken out. Or you could drench yourself and donate $100 anyway. Up to you.
Yesterday I rolled out the release notes for our sprint 69 deployment. Check out the updates!
P.S. the outage we had last Thursday was *not* caused by the rollout of updates (or, at best only tangentially so). I’m hoping to get the outage retrospective written up tomorrow.
Today we released CTP of our next major Visual Studio release. You can look in the release notes to see what is in it. You can also read about the highlights on John’s post. There’s no startling new improvements in this CTP but lots of nice, smaller, things.
Just a reminder, we aren’t shipping CTP previews of TFS “14” at this point. Visual Studio Online is still your best way to check out the latest TFS improvements.
Let me start by apologizing for the pretty horrific outage we had last week. I’ve been silent on it because I was on a “last family vacation” in Europe before my oldest son goes off to college. Buck Hodges and others have been working hard on it. I’ve been reading up on everything that happened and everything that’s been done. I need to spend some time talking with the team but I expect to publish a lengthy retrospective in the next few days. Stay tuned for more.
At this point, I think enough has been done that we won’t see a recurrence of the issues any time soon. However, there is some underlying work that will take some time (small number of months I expect) to put in place the infrastructure necessary to avoid another incident in the same class.
Again, I’m very sorry for the disruption. We take all incidents very seriously and work hard to ensure they won’t ever happen again.
Today we released the final version of Visual Studio 2013 Update 3 and Team Foundation Server Update 3. You can get the update using the link below. Note that the link includes both the Visual Studio & TFS downloads (among other things) if you expand the Details section on the page.
I’ve blogged about the features before but I’ll reiterate that some of the biggest enhancements in this Update include:
- CodeLens support for Git
- Configurable display on in progress items on the backlog (a common customer request)
- Application Insights tooling
- Desktop app support in the memory usage tool (including WPF)
- Release management support for PowerShell/DSC and Chef
- Test plan/suite customization, permissions, auditing, etc.
- Cloud load testing integration with Application Insights for app under test telemetry/diagnostics
- and a substantial number of bug fixes (listed in the KB article).
Along side Update 3, we are also releasing an updated CTP of our Cordova tooling with support for Windows 7. Make sure to look for that too.
Thanks for all of your help with validating early drops of this release and we hope you like it. We’re happy to be delivering it and already turning our attention to Update 4. I’m hoping we’re going to see several very nice improvements to the TFS Agile planning tools in that release plus a lot more. Stay tuned. I suspect we’ll ship the first CTP of Update 4 in a couple of months.
I was out doing chores on my farm yesterday morning and ran across something surprising (to me, at least). By the pig pen, we have an electric fence charger and it is covered by buckets. For some reason I don’t understand (my wife did it), there are two buckets – one nested inside the other. Yesterday, I removed the top bucket and inside it, I found a frog. It took me a minute to recognize it as a frog because it was so white it didn’t look much like a frog.
I know frogs can change colors to match their surroundings but white? I’ve seen documentaries of incredible color changing animals but those are exotic animals in some exotic place – not a frog in my back yard, right?
10 points to anyone who can identify what kind of frog (or maybe toad for all I know) it is
It was pretty cool.
Sorry it took me a week and a half to get to this.
We had the most significant VS Online outage we’ve had in a while on Friday July 18th. The entire service was unavailable for about 90 minutes. Fortunately it happened during non-peak hours so the number of affected customers was fewer than it might have been but I know that’s small consolation to those who were affected.
My main goal from any outage that we have is to learn from it. With that learning, I want to make our service better and also share it so, maybe, other people can avoid similar errors.
The root cause was that a single database in SQL Azure became very slow. I actually don’t know why, so I guess it’s not really the root cause but, for my purposes, it’s close enough. I trust the SQL Azure team chased that part of the root cause – certainly did loop them in on the incident. Databases will, from time to time, get slow and SQL Azure has been pretty good about that over the past year or so.
The scenario was that Visual Studio (the IDE) was calling our “Shared Platform Services” (a common service instance managing things like identity, user profiles, licensing, etc.) to establish a connection to get notified about updates to roaming settings. The Shared Platform Services were calling Azure Service Bus and it was calling the ailing SQL Azure database.
The slow Azure database caused calls to the Shard Platform Services (SPS) to pile up until all threads in the SPS thread pool were consumed, at which point, all calls to TFS eventually got blocked due to dependencies on SPS. The ultimate result was VS Online being down until we manually disabled our connection to Azure Service Bus an the log jam cleared itself up.
There was a lot to learn from this. Some of it I already knew, some I hadn’t thought about but, regardless of which category it was in, it was a damn interesting/enlightening failure.
**UPDATE** Within the first 10 minutes I've been pinged by a couple of people on my team pointing out that people may interpret this as saying the root cause was Azure DB. Actually, the point of my post is that it doesn't matter what the root cause was. Transient failures will happen in a complex service. The interesting thing is that you react to them appropriately. So regardless of what the trigger was, the "root cause" of the outage was that we did not handle a transient failure in a secondary service properly and allowed it to cascade into a total service outage. I'm also told that I may be wrong about what happened in SB/Azure DB. I try to stay away from saying too much about what happens in other services because it's a dangerous thing to do from afar. I'm not going to take the time to go double check and correct any error because, again, it's not relevant to the discussion. The post isn't about the trigger. The post is about how we reacted to the trigger and what we are going to do to handle such situations better in the future.
Don’t let a ‘nice to have’ feature take down your mission critical ones
I’d say the first and foremost lesson is “Don’t let a ‘nice to have’ feature take down your mission critical ones.” There’s a notion in services that all services should be loosely coupled and failure tolerant. One service going down should not cause a cascading failure, causing other services to fail but rather only the portion of functionality that absolutely depends on the failing component is unavailable. Services like Google and Bing are great at this. They are composed of dozens or hundreds of services and any single service might be down and you never even notice because most of the experience looks like it always does.
The crime of this particular case is that, the feature that was experiencing the failure was Visual Studio settings roaming. If we had properly contained the failure, your roaming settings wouldn’t have synchronized for 90 minutes and everything else would have been fine. No big deal. Instead, the whole service went down.
In our case, all of our services were written to handle failures in other services but, because the failure ultimately resulted in thread pool exhaustion in a critical service, and it reached the point that no service could make forward progress.
Smaller services are better
Part of the problem here was that a very critical service like our authentication service shared an exhaustible resource (the thread pool) with a very non-critical service (the roaming settings service). Another principle of services is that they should be factored into small atomic units of work if at all possible. Those units should be run with as few common failure points as possible and all interactions should honor “defensive programming” practices. If our authentication service goes down, then our service goes down. But the roaming settings service should never take the service down. We’ve been on a journey for the past 18 months or so of gradually refactoring VS Online into a set of loosely coupled services. In fact, only about a year ago, what is now SPS was factored out of TFS into a separate service. All told, we have about 15 or so independent services today. Clearly, we need more :)
How many times do you have to retry?
Another one of the long standing rules in services is that transient failures are “normal”. Every service consuming another service has to be tolerant of dropped packets, transient delays, flow control backpressure, etc. The primary technique is to retry when a service you are calling fails. That’s all well and good. The interesting thing we ran into here was a set of cascading retries. Our situation was
Visual Studio –> SPS –> Service Bus –> Azure DB
When Azure DB failed Service Bus retried 3 times. When Service Bus failed, SPS retried 2 times. When SPS failed, VS retried 3 times. 3 * 2 * 3 = 18 times. So, every single Visual Studio client launched in that time period caused a total of 18 attempts on the SQL Azure database. Since the problem was that the database was running slow (resulting in a timeout after like 30 seconds), that’s 18 tries * 30 seconds = 9 minutes each.
Calls in this stack of services piled up and up and up until, eventually, the thread pool was full and no further requests could be processed.
As it turns out SQL Azure is actually very good about communicating to it’s callers whether or not a retry is worth attempting. SB doesn’t pay attention to that and doesn’t communicate it to it’s callers. And neither does SPS. So a new rule I learned is that it’s important that any service carefully determine, based on the error, whether or not retries are called for *and* communicate back to their callers whether or not retries are advisable. If this had been done, each connection would have been only 30 seconds rather than 9 minutes and likely the situation would have been MUCH better.
A traffic cop goes a long way
Imagine that SPS kept count of how many concurrent calls were in progress to Service Bus. Knowing that this service was a “low priority” service and that calls were synchronous and the thread pool limited, it could have decided that, once that concurrent number of calls exceeded some threshold (let’s say 30, for arguments sake) that it would start rejecting all subsequent calls until the traffic jam drained a bit. Some callers would very quickly get rejected and their settings wouldn’t be roamed but we’d never have exhausted threads and the higher priority services would have continued to run just fine. Assuming the client is set to attempt a reconnect on some very infrequent interval, the system would eventually self-heal, assuming the underlying database issue was cleared up.
Threads, threads and more threads
I’m sure I won’t get out of this without someone pointing at that one of the root causes here is that the inter-service calls were synchronous. They should have been asynchronous, therefore not consuming a thread and never exhausting the thread pool. It’s a fair point but not my highest priority take away here. You are almost always consuming some resource, even on async calls – usually memory. That resource may be large but it too is not inexhaustible. The techniques I’ve listed above are valuable, regardless of sync or async and will also prevent other side effects, like pounding an already ailing database into the dirt with excessive retries.
So, it’s a good point, but I don’t think it’s a silver bullet.
So, onto our backlog go another series of “infrastructure" improvements and practices that will help us provide an ever more reliable service. All software will fail eventually, somehow. The key thing is to examine each and every failure, trace the failure all the way to the root cause, generalize the lessons and build defenses for the future.
I’m sorry for the interruption we caused. I can’t promise it won’t happen again, *but* after a few more weeks (for us to implement some of these defenses), it won’t happen again for these reasons.
Thanks as always for joining us on this journey and being astonishingly understanding as we learn, And, hopefully these lessons provide some value to you in your own development efforts.
A month ago I wrote about our newly enabled capability to measure quality of service on a customer by customer basis. In that post I mentioned that we had actually identified a customer experiencing issues before they even contacted us about them and had started working with them to understand the issues. Well, the rest of that story…
We’ve identified the underlying issue. The customer had an unusually large number of Team Projects in their account and some of our code paths were not scaling well, resulting in slower than expected response times. We have debugged it, coded a fix and will be deploying it with our next sprint deployment.
Now that’s cool. We’ve already started working with a few other of the accounts that have the lowest quality of service metrics. Our plan is to make this a regular part of our sprint rhythm where, every sprint, we investigate a top few customer accounts on the list and try to deploy fixes within a sprint or two – improving the service every sprint.
Today we began deployment of our sprint 68 work. There’s a bunch of really good stuff there. I say “begun” because deployment is a multi day event now as we roll it out across instances. Everyone should have the updates by tomorrow (Tue) afternoon. You can read the release notes to get details.
You’ll see that one part of the licensing changes I described a couple of weeks ago are now live – addition of Test Hub access to the Visual Studio Online Advanced license. The remaining stakeholder licensing changes are still tracking to go live in mid-August. Stay tuned for more.
Azure Active Directory support
The biggest thing in the announcement is the next step in our rollout of Azure Active Directory (AAD) support in VS Online. We started this journey back in April with the very first flicker of AAD support at the Build conference. We added more support at TechEd but I’ve stayed pretty quiet about it because, until this week, there was no way to convert and existing account to AAD. With this deployment we’ve enabled it. Officially it’s in preview and you have to ask to get access to do it but we’re accepting all requests so it’s nothing more than a speed bump to keep too big a rush from happening all at once. With these last set of changes, you can:
- Associate your OrgID (AAD/AD credentials) with your MSDN subscription, if you have one, and use that to grant your VSO license
- Create a new account connected to an AAD tenant
- Connect an existing account to an AAD tenant
- Disconnect an account from an AAD tenant
- Log in with either a Microsoft Account or and OrgID (AAD only or synchronized from you on prem Active Directory) giving you single sign-on with your corporate credentials, Office 365, etc.
- I’m probably forgetting something but you get the point
I encourage you to read the docs and more docs for details. One thing I’ve asked be included in the docs and I’m still not satisfied with the clarity is one detail about binding an existing account to AAD. If you have an existing account not connected to AAD then, by definition, you are using Microsoft Accounts. When you connect you VS Online account to AAD, your identities have to be recognized by AAD to authenticate. You have 3 options for each existing user of your account:
- Add the Microsoft Account as an “external identity” in your AAD. All your data and in-progress work carries forward. The draw back is that external Microsoft accounts won’t fully honor you AAD policies – like Two Factor Auth, Password policies, etc. It’s still a Microsoft Account that’s been associated with your AAD, giving your AAD admin central control over access.
- If you created your Microsoft Account using the same email address as your AD/AAD identity (for instance, for me it’s email@example.com) then, when you connect your VSO account to AAD, your Microsoft Account will be seamlessly rebound to your corporate identity. All your data and in progress work carries forward and your login get the full set of AAD governance. This is the “best” of the 3 options but requires that you created your Microsoft Account a certain way.
- If you can’t do #2 and you don’t want to do #1, then you can just add your AAD identity as a “new” VS Online user and remove your old Microsoft Account identity from the VS Online account. To VS Online this is just like adding a new user and deleting an old user. VS Online has no idea they are the same person. This has the advantage of getting full AAD administration but the downside that in-progress work (checkouts, work items assigned to you, …) and other places where your old MS Account identity was associated need to either be deleted or reassigned to your new identity. Work items can be reassigned. Workspaces, shelvesets and stuff like that can be deleted. History will always be associated with your “old” Microsoft Account identity.
So that’s a good segue to what’s left for us to do to really complete AAD support…
- Add the ability to migrate one identity to any other identity, thereby having all references in VSO changed to the new user (to get around the issue in #3). This is on the backlog but is going to take a while.
- Add support for using AAD groups (to assign permissions, query work items, etc) in VS Online. Today you can use AAD users, but you can’t yet AAD groups. This feature is coming fairly soon (within the next few sprints).
I’m sure I’m missing something else we haven’t done yet but I don’t think anything big. AAD support is ready for prime time for most user scenarios.
And I have to say something about account deletion. Until this week, VS Online account deletions could only be done by contacting support – and we had to do a delicate dance to ensure that the person requesting a deletion had the rights to. For the past few months, account deletion has been the #1 support request, with dozens of requests a month. There are all kinds of reasons –
- Merging multiple accounts into one
- Moving from VS Online back to on-premises TFS
- Wanting to just wipe everything out and start over (for instance after an evaluation)
With this week’s deployment, account deletion is self service (assuming you are an account administrator). However, it’s important to understand that all account deletes are “soft” deletes only. Meaning the account is “marked for deletion” and no one can access it any more but it is *not* actually deleted. It will be physically deleted, I believe, 90 days after you delete it in the UI. This gives you a window to have your “Oh sh%t!” moment. If you realize that you deleted something you did not intend to, you can contact support and they can “undelete” your account. This is indicative of a general direction we are headed where all deletes are “soft” and you always have a time window to go back and recover it. It will take us quite a while to get there on everything that can be deleted but we’ll make progress every chance we get. Of course, if there’s some reason you *REALLY* need a VS Online account permanently deleted immediately, you can contact support to help you.
Oh, and lest I manage to avoid mentioning any feature in this deployment, check out the new trend reports. They are very cool and make the VS online charting experience even more useful. And, because I know several people will ask, yes, these charting enhancements will be added Team Foundation Server (our on-premises product). If everything goes according to plan, they will be in TFS 2013.4 (Update 4) later this fall.
It’s a bunch of stuff. Maybe you have to be a bit of a geek to appreciate all of it We’ve been working on some of this for a good while and I’m really happy to see it all available. Check it out and let us know what you think.