Greg Boer posted a very detailed article on how to use the Scaled Agile Framework with TFS: http://blogs.msdn.com/b/visualstudioalm/archive/2014/09/11/scaled-agile-framework-using-tfs-to-support-epics-release-trains-and-multiple-backlogs-whitepaper.aspx
Check it out,
We’ve worked hard over the last year to continue to improve our testing tools offering – for developers, testers and end users. We’ve made our testing experiences available via a web browser, reduced licensing requirements, improved customizability and more.
It’s great to see that, once again, Gartner has placed us in the leader quadrant in their 2014 Integrated Quality Suites report after evaluating out integrated offering with Visual Studio and Team Foundation Server that works well for development and testing teams. So often, I run into people who are surprised that we even have an offering for testers – it’s a very well-kept secret. It’s great to see Gartner recognizing the quality of the offering we have. Please read their report and judge for yourself if Visual Studio ALM can provide a great solution for you.
Gartner asked me to include the following disclaimer…
Gartner, Magic Quadrant for Integrated Software Quality Suites, Mark Driver, Thomas E. Murphy, Nathan Wilson, Ray Valdes, David Norton, Maritess Sobejana 28 August 2014
Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings. Gartner research publications consist of the opinions of Gartner's research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.
We’ve recently started bring some larger Microsoft teams into VS Online. That has been a very enlightening experience. It has particularly highlighted issues with fit and finish for high bandwidth activities like bug triage, performance and scale issues with larger data sets, etc.
This sprint (sprint 70), deploying today, we’ve delivered a large number of smallish improvements based on what we’ve learned. I think this pattern will continue for the next couple of sprints as we get everyone “settled in” to using VS Online in heavy daily use.
**IMPORTANT** This deployment is going to span the weekend and many accounts won’t see all of the improvements until Monday Sept 8th.
You can read more about the specific improvements in Aaron’s new post: http://visualstudio.com/en-us/news/2014-sep-4-vso.
There’s also a chunk of work in this sprint (also described in the release notes) that’s our next installment in our new open web extensibility model. The most significant improvement is substantially improved REST APIs for work item tracking but the coolest addition is Hubot support.
As always, let us know what you think,
I don’t know about you but it’s kind of hard for me to wrap my head around the fact that we are already on the road to delivering Visual Studio 2013.4 and Team Foundation Server 2013.4. Update 3 wasn’t that long ago. Today we are delivering Update 4 CTP (Community Technology Preview) 1. As always, there will be a couple of CTPs – roughly 3 weeks apart, then a release candidate and a final release. So, the final release of Update 4 is still a ways off – Oct/Nov timeframe, but if you are interested in seeing it develop, the CTPs give you a good way to follow it. For TFS functionality, particularly early in the development cycle (like we are now), Visual Studio Online is even a better way to check it out. That way you don’t have to install anything and all of the Update 4 functionality already is or will soon be deployed on the cloud service.
As usual, in my post, I’ll focus on the ALM functionality in Update 4 and you can go to the VIsual Studio blog to learn more about the IDE pieces. Now, the reality is that the new features in Update 4 are disproportionately in Team Foundation Server. It’s a pretty modest release for the IDE but a pretty big one for TFS.
Here’s some valuable links:
So, enough preamble, let’s get to what’s coming.
Since VS/TFS 2012, we’ve had a TFS based code review experience in Visual Studio. It only works with TFVC and, because it’s in the VS IDE, it’s great for VS users, but it’s not so useful for Eclipse (or XCode, …) developers.
Git, being a distributed version control system, brings with it a different code review like workflow called “Pull requests”. A user with changes in a branch or fork submits a pull request for those changes to be merged into another branch/fork. A committer (and others) in the destination is responsible for reviewing the changes, commenting on them, etc and ultimately accepting them by merging them in or rejecting them.
In TFS 2013 Update 4, we are introducing a web based pull request solution for Git. This gives us a good code review solution for Git and it will work reasonably well regardless of what IDE you use. You can read more about it in this detailed walkthrough of pull requests on VS Online.
Sometime in the next year, we will be working to better reconciled the TFVC experience and the GIt experience so they aren’t so completely different like they are today.
Update 4 will also include charting improvements in Web Access, including the ability to show trends – up to a year, simple aggregates – sum of values, etc.
Work management improvements
Lately we have been working on improvements, based on feedback, to our work item management UI. None of them are huge but there are lot’s of nice little improvements. A few are included in CTP1:
Move to position on the backlog – A new keyboard based prioritization capability that’s handy for people who prefer the keyboard or have very long backlogs and get tired of scrolling to drag & drop.
"Full-screen mode" for all the pages under the Backlogs hub – This enables you to eliminate all the chrome and focus on the data you really care about – particularly useful for things like stand up meetings in front of the task board.
Search for an Area path in Web Access – A new way of managing very large area path hierarchies.
And beyond these that are in CTP 1, there will be much more. If you track our VS Online enhancements on our release notes page, you’ll be able to see stuff showing up over the next few sprints that will also make it into Update 4.
This all the big stuff in CTP 1. Of course there’s lots of bug fixes, various performance improvements, etc. It’s still early so, expect an update with more stuff being added to the list every few weeks.
Thanks and feedback encouraged…
About 6 weeks ago, I announced a plan to make some licensing changes to VS Online and Team Foundation Server that would make it easier for more people to participate in the development process. A few weeks ago, we did part 1 by enabling Test Hub access for VS Online Advanced licenses. Today we have completed step 2 by enabling the new Stakeholder license allowing an unlimited # of people in each account significant access to the work item tracking and Agile planning capabilities, at no charge. You can read my news announcement here: http://www.visualstudio.com/en-us/news/2014-aug-27-vso.
The final step in this wave of changes will happen in TFS 2013 Update 4, later this year, when we enable the Stakeholder license changes in our on-premises product.
Thanks and let us know if we can help you.
Here's a very cool way to use our new VS Online extensibility (that will be available in TFS V.Next) to enable sending an email to a Team Room. Our new REST, OAuth and Service Hooks support can be used in countless creative ways.
We had a pretty serious outage last Thursday all told it was a little over 5 hours. The symptoms were that performance was so bad that the service was basically unavailable for most people (though there was some intermittent access as various mitigation steps were taken). It started around 14:00 UTC and ended a little before 19:30 UTC. This duration and severity makes this one of the worst incidents we’ve ever had on VS Online.
We feel terrible about it and continue to be committed to doing everything we can to prevent outages. I’m sorry for the problems it caused. The team worked tirelessly from Thursday through Sunday both to address the immediate health issues and to fix underlying bugs that might cause recurrences.
As you might imagine, for the past week, we’ve been hard at work trying to understand what happened and what changes we have to make to prevent such things in the future. It is often very difficult to find proof of the exact trigger for outages but you can learn a ton by studying them closely.
On an outage like this, there’s a set of questions I always ask, and they include:
What happened was that one of the core SPS (Shared Platform Services) databases became overwhelmed with database updates and started queuing up so badly that it effectively blocked callers. Since SPS is part of the authentication and licensing process, we can’t just completely ignore it – though I would suggest that if it became very sluggish, it wouldn’t be the end of the world if we bypassed some licensing checks to keep the service responsive.
What was the trigger? What made it happen today vs yesterday or any other day?
Though we’ve worked hard on this question, we don’t have any definitive answer (we’re still pursuing it though). We know that before the incident, some configuration changes were made that caused a significant increase in traffic between our “TFS” service and our “SPS” (Shared Platform Service). That traffic involved additional license validation checks that had been improperly disabled. We also know that, at about the same time, we saw a spike in latencies and failed deliveries of Service Bus messages. We believe that one or both of these were key triggers but we are missing some logging on SPS database access to be able to be 100% certain. Hopefully, in the next few days, we’ll know more conclusively.
What was the “root cause”?
This is different than the trigger in the sense that the trigger is often a condition that caused some cascading effect. The root cause is more about understanding why the effect cascaded and why it took the system down. It turns out that, I believe, the root cause was that we had accumulated a series of bugs that were causing extra SPS database work to be done and that the system was inherently unstable – from a performance perspective. It just took some poke at the system – in the form of extra identity or licensing churn to cause a ripple effect on these bugs. Most, but not all, of them were introduced in the last few sprints. Here’s a list of the “core” causal bugs that we’ve found and fixed so far:
- Many calls from TFS -> SPS were inappropriately updating the “TFS service” identity's properties. This created SQL write contention and invalidated the identity by sending a Service Bus message from SPS -> TFS. This message caused the app tiers to invalidate their cache and subsequent TFS requests to make a call to SPS causing further property updates and a vicious cycle.
- A bug in 401-handling code was making an update to the identity causing an invalidation of the identity's cache – no vicious cycle but lots of extra cache flushes.
- A bug in the Azure Portal extension service was retrying 401s every 5sec.
- An old behavior that was causing the same invalidation 'event' to be resent from each SPS AT (user1 was invalidated on AT1, user2 was invalidated from AT2 -> user1 will be sent 2 invalidations). And we have about 4 ATs so this can have a pretty nasty multiplicative effect.
We’ve also found/fixed a few “aggravating” bugs that made the situation worse but wouldn’t have been bad enough to cause serious issues on their own:
- Many volatile properties were being stored in Identity's extended properties causing repeated cache invalidations and broad “change notifications” to be sent to listeners who didn’t care about the property changes.
- A few places were updating properties with unchanged values causing an unnecessary invalidation and SQL round trips.
All of these, in some form, have to do with updates to identities in the system that then often cause propagating change notifications (which in some cases were over propagated) that caused extra processing/updates/cache invalidations. It was “unstable” because anything that caused an unexpected increased load in these identity updates would spiral out of control due to multiplicative effects and cycles.
What did we learn from the event?
I always want to look beyond the immediate and understand the underlying pattern. This is sometimes called “The 5 whys”. This is, in fact, the most important question in the list. Why did this happen and what can we do differently? Not what bugs did we hit. Why were those bugs there? What should we have done to ensure those bugs were caught in the design/development process before anything went into production?
Let me start with a story. Way back in 2008, when we were beginning to rollout TFS across very large teams at Microsoft, we had a catastrophe. We significantly underestimated the load that many thousands of people and very large scale build labs would put on TFS. We lived in hell for close to 9 months with significant performance issues, painful daily slowdowns and lots of people sending me hate mail.
My biggest learning from that was, when it comes to performance, you can’t trust abstractions. In that case, we were treating SQL Server as a relational database. What I learned is that it’s really not. It’s a software abstraction layer over disk I/O. If you don’t know what’s happening at the disk I/O layer, you don’t know anything. Your ignorance may be bliss – but when you get hit with 10x or 100x scale/performance requirement and you fall over dead. We became very deep in SQL disk layout, head seeks, data density, query plans, etc. We optimized the flows from the top to the very bottom and made sure we knew where all the CPU went, all the I/Os went, etc. When we were done, TFS scaled to crazy large teams and code bases.
We then put in place regression tests that would measure changes, not just in time but also in terms of SQL round trips, etc.
So back to last Thursday… We’ve gotten sloppy. Sloppy is probably too harsh. As with any team, we are pulled in the tension between eating our Wheaties and adding capabilities that customers are asking for. In the drive toward rapid cadence, value every sprint, etc., we’ve allowed some of the engineering rigor that we had put in place back then to atrophy – or more precisely, not carried it forward to new code that we’ve been writing. This, I believe, is the root cause – Developers can’t fully understand the cost/impact of a change they make because we don’t have sufficient visibility across the layers of software/abstraction, and we don’t have automated regression tests to flag when code changes trigger measurable increases in overall resource cost of operations. You must, of course, be able to do this in synthetic test environments – like unit tests, but also in production environments because you’ll never catch everything in your tests.
So, we’ve got some bugs to fix and some more debt to pay off in terms of tuning the interaction between TFS and SPS but, most importantly, we need to put in place some infrastructure to better measure and flag changes in end to end cost – both in test and in production.
The irony here (not funny irony but sad irony), is that there has been some renewed attention on this in the team recently. A few weeks ago, we had a “hack-a-thon” for small groups of people on the team to experiment with new ideas. One of the teams built a prototype of a solution for capturing important performance tracing information across the end-to-end thread of a request. I’ll try to do a blog post in the next couple of weeks to show some of these ideas. And just the week before this incident Buck (our Dev director) and I were having a conversation about needing to invest more in this very scenario. Unfortunately we had a major incident before we could address the gap.
What are we going to do?
OK, so we learned a lot, but what are we actually going to do about it. Clearly step 1 is mitigate the emergency and get the system back to sustainable health quickly. I think we are there now. But we haven’t addressed the underlying whys yet. So, some plans we are making now include:
- We will analyze call patterns within SPS and between SPS and SQL and build the right telemetry and alerts to catch situations early. Adding baselines into unit and functional tests will enforce that baselines don't get violated when a dev checks in code.
- Partitioning and scaling of SPS Config DB will be a very high priority. With the work to enable tenant-level hosts, we can partition identity related information per tenant. This enables us to scale SPS data across databases, enabling a higher “ceiling” and more isolation in the event things ever go badly again.
- We are looking into building an ability for a service to throttle and recover itself from a slow or failed dependency. We should leverage the same techniques for TFS -> SPS communication and let TFS leverage cached state or fail gracefully. (This is actually also a remaining action item from the previous outage we had a months or so ago.)
- We should test our designs for lag in Service Bus delivery and ensure that our functionality continues to work or degrades gracefully.
- Moving to Service Bus batching APIs and partitioned topics will help us scale better to handle very 'hot' topics like Identity.
As always, hopefully you get some value from the details behind the mistakes we make. Thank you for bearing with us.
Well, today was my turn to take the ALS Ice Bucket Challenge. This morning I was tagged by both Adam Cogan and Scott Guthrie. Tempting as it is though, I’m not doing it twice I don’t know if it’s typical, but Adam challenged me to complete the task within 24 hours. I spent today thinking about how I would get home early enough, how I would orchestrate it and where I would get the ice.
I decided to do it on the farm (notice the cows behind me) and to have my kids help me. When I first told them that I needed their help, they moaned in exasperation at having to “help dad again”. When I told them, they would be pouring ice cold water on me while filming it, they jumped up and down screaming with excitement.
I gave them instructions ahead of time on what to do. Apparently I wasn’t clear enough. You’ll notice that my daughter poured the water on me very gradually. I was hoping for an instantaneous drenching but instead I got a slow motion frostbite. It seemed like it lasted 10 minutes
I’ve now named James Philips, David Treadwell and Buck Hodges as my victims, ahem, I mean nominees in the challenge. Good luck to you all – it’s for a good cause. Of course, you could just donate $100 and chicken out. Or you could drench yourself and donate $100 anyway. Up to you.
Yesterday I rolled out the release notes for our sprint 69 deployment. Check out the updates!
P.S. the outage we had last Thursday was *not* caused by the rollout of updates (or, at best only tangentially so). I’m hoping to get the outage retrospective written up tomorrow.
Today we released CTP of our next major Visual Studio release. You can look in the release notes to see what is in it. You can also read about the highlights on John’s post. There’s no startling new improvements in this CTP but lots of nice, smaller, things.
Just a reminder, we aren’t shipping CTP previews of TFS “14” at this point. Visual Studio Online is still your best way to check out the latest TFS improvements.
Let me start by apologizing for the pretty horrific outage we had last week. I’ve been silent on it because I was on a “last family vacation” in Europe before my oldest son goes off to college. Buck Hodges and others have been working hard on it. I’ve been reading up on everything that happened and everything that’s been done. I need to spend some time talking with the team but I expect to publish a lengthy retrospective in the next few days. Stay tuned for more.
At this point, I think enough has been done that we won’t see a recurrence of the issues any time soon. However, there is some underlying work that will take some time (small number of months I expect) to put in place the infrastructure necessary to avoid another incident in the same class.
Again, I’m very sorry for the disruption. We take all incidents very seriously and work hard to ensure they won’t ever happen again.
Today we released the final version of Visual Studio 2013 Update 3 and Team Foundation Server Update 3. You can get the update using the link below. Note that the link includes both the Visual Studio & TFS downloads (among other things) if you expand the Details section on the page.
I’ve blogged about the features before but I’ll reiterate that some of the biggest enhancements in this Update include:
- CodeLens support for Git
- Configurable display on in progress items on the backlog (a common customer request)
- Application Insights tooling
- Desktop app support in the memory usage tool (including WPF)
- Release management support for PowerShell/DSC and Chef
- Test plan/suite customization, permissions, auditing, etc.
- Cloud load testing integration with Application Insights for app under test telemetry/diagnostics
- and a substantial number of bug fixes (listed in the KB article).
Along side Update 3, we are also releasing an updated CTP of our Cordova tooling with support for Windows 7. Make sure to look for that too.
Thanks for all of your help with validating early drops of this release and we hope you like it. We’re happy to be delivering it and already turning our attention to Update 4. I’m hoping we’re going to see several very nice improvements to the TFS Agile planning tools in that release plus a lot more. Stay tuned. I suspect we’ll ship the first CTP of Update 4 in a couple of months.
I was out doing chores on my farm yesterday morning and ran across something surprising (to me, at least). By the pig pen, we have an electric fence charger and it is covered by buckets. For some reason I don’t understand (my wife did it), there are two buckets – one nested inside the other. Yesterday, I removed the top bucket and inside it, I found a frog. It took me a minute to recognize it as a frog because it was so white it didn’t look much like a frog.
I know frogs can change colors to match their surroundings but white? I’ve seen documentaries of incredible color changing animals but those are exotic animals in some exotic place – not a frog in my back yard, right?
10 points to anyone who can identify what kind of frog (or maybe toad for all I know) it is
It was pretty cool.
Sorry it took me a week and a half to get to this.
We had the most significant VS Online outage we’ve had in a while on Friday July 18th. The entire service was unavailable for about 90 minutes. Fortunately it happened during non-peak hours so the number of affected customers was fewer than it might have been but I know that’s small consolation to those who were affected.
My main goal from any outage that we have is to learn from it. With that learning, I want to make our service better and also share it so, maybe, other people can avoid similar errors.
The root cause was that a single database in SQL Azure became very slow. I actually don’t know why, so I guess it’s not really the root cause but, for my purposes, it’s close enough. I trust the SQL Azure team chased that part of the root cause – certainly did loop them in on the incident. Databases will, from time to time, get slow and SQL Azure has been pretty good about that over the past year or so.
The scenario was that Visual Studio (the IDE) was calling our “Shared Platform Services” (a common service instance managing things like identity, user profiles, licensing, etc.) to establish a connection to get notified about updates to roaming settings. The Shared Platform Services were calling Azure Service Bus and it was calling the ailing SQL Azure database.
The slow Azure database caused calls to the Shard Platform Services (SPS) to pile up until all threads in the SPS thread pool were consumed, at which point, all calls to TFS eventually got blocked due to dependencies on SPS. The ultimate result was VS Online being down until we manually disabled our connection to Azure Service Bus an the log jam cleared itself up.
There was a lot to learn from this. Some of it I already knew, some I hadn’t thought about but, regardless of which category it was in, it was a damn interesting/enlightening failure.
**UPDATE** Within the first 10 minutes I've been pinged by a couple of people on my team pointing out that people may interpret this as saying the root cause was Azure DB. Actually, the point of my post is that it doesn't matter what the root cause was. Transient failures will happen in a complex service. The interesting thing is that you react to them appropriately. So regardless of what the trigger was, the "root cause" of the outage was that we did not handle a transient failure in a secondary service properly and allowed it to cascade into a total service outage. I'm also told that I may be wrong about what happened in SB/Azure DB. I try to stay away from saying too much about what happens in other services because it's a dangerous thing to do from afar. I'm not going to take the time to go double check and correct any error because, again, it's not relevant to the discussion. The post isn't about the trigger. The post is about how we reacted to the trigger and what we are going to do to handle such situations better in the future.
Don’t let a ‘nice to have’ feature take down your mission critical ones
I’d say the first and foremost lesson is “Don’t let a ‘nice to have’ feature take down your mission critical ones.” There’s a notion in services that all services should be loosely coupled and failure tolerant. One service going down should not cause a cascading failure, causing other services to fail but rather only the portion of functionality that absolutely depends on the failing component is unavailable. Services like Google and Bing are great at this. They are composed of dozens or hundreds of services and any single service might be down and you never even notice because most of the experience looks like it always does.
The crime of this particular case is that, the feature that was experiencing the failure was Visual Studio settings roaming. If we had properly contained the failure, your roaming settings wouldn’t have synchronized for 90 minutes and everything else would have been fine. No big deal. Instead, the whole service went down.
In our case, all of our services were written to handle failures in other services but, because the failure ultimately resulted in thread pool exhaustion in a critical service, and it reached the point that no service could make forward progress.
Smaller services are better
Part of the problem here was that a very critical service like our authentication service shared an exhaustible resource (the thread pool) with a very non-critical service (the roaming settings service). Another principle of services is that they should be factored into small atomic units of work if at all possible. Those units should be run with as few common failure points as possible and all interactions should honor “defensive programming” practices. If our authentication service goes down, then our service goes down. But the roaming settings service should never take the service down. We’ve been on a journey for the past 18 months or so of gradually refactoring VS Online into a set of loosely coupled services. In fact, only about a year ago, what is now SPS was factored out of TFS into a separate service. All told, we have about 15 or so independent services today. Clearly, we need more :)
How many times do you have to retry?
Another one of the long standing rules in services is that transient failures are “normal”. Every service consuming another service has to be tolerant of dropped packets, transient delays, flow control backpressure, etc. The primary technique is to retry when a service you are calling fails. That’s all well and good. The interesting thing we ran into here was a set of cascading retries. Our situation was
Visual Studio –> SPS –> Service Bus –> Azure DB
When Azure DB failed Service Bus retried 3 times. When Service Bus failed, SPS retried 2 times. When SPS failed, VS retried 3 times. 3 * 2 * 3 = 18 times. So, every single Visual Studio client launched in that time period caused a total of 18 attempts on the SQL Azure database. Since the problem was that the database was running slow (resulting in a timeout after like 30 seconds), that’s 18 tries * 30 seconds = 9 minutes each.
Calls in this stack of services piled up and up and up until, eventually, the thread pool was full and no further requests could be processed.
As it turns out SQL Azure is actually very good about communicating to it’s callers whether or not a retry is worth attempting. SB doesn’t pay attention to that and doesn’t communicate it to it’s callers. And neither does SPS. So a new rule I learned is that it’s important that any service carefully determine, based on the error, whether or not retries are called for *and* communicate back to their callers whether or not retries are advisable. If this had been done, each connection would have been only 30 seconds rather than 9 minutes and likely the situation would have been MUCH better.
A traffic cop goes a long way
Imagine that SPS kept count of how many concurrent calls were in progress to Service Bus. Knowing that this service was a “low priority” service and that calls were synchronous and the thread pool limited, it could have decided that, once that concurrent number of calls exceeded some threshold (let’s say 30, for arguments sake) that it would start rejecting all subsequent calls until the traffic jam drained a bit. Some callers would very quickly get rejected and their settings wouldn’t be roamed but we’d never have exhausted threads and the higher priority services would have continued to run just fine. Assuming the client is set to attempt a reconnect on some very infrequent interval, the system would eventually self-heal, assuming the underlying database issue was cleared up.
Threads, threads and more threads
I’m sure I won’t get out of this without someone pointing at that one of the root causes here is that the inter-service calls were synchronous. They should have been asynchronous, therefore not consuming a thread and never exhausting the thread pool. It’s a fair point but not my highest priority take away here. You are almost always consuming some resource, even on async calls – usually memory. That resource may be large but it too is not inexhaustible. The techniques I’ve listed above are valuable, regardless of sync or async and will also prevent other side effects, like pounding an already ailing database into the dirt with excessive retries.
So, it’s a good point, but I don’t think it’s a silver bullet.
So, onto our backlog go another series of “infrastructure" improvements and practices that will help us provide an ever more reliable service. All software will fail eventually, somehow. The key thing is to examine each and every failure, trace the failure all the way to the root cause, generalize the lessons and build defenses for the future.
I’m sorry for the interruption we caused. I can’t promise it won’t happen again, *but* after a few more weeks (for us to implement some of these defenses), it won’t happen again for these reasons.
Thanks as always for joining us on this journey and being astonishingly understanding as we learn, And, hopefully these lessons provide some value to you in your own development efforts.
A month ago I wrote about our newly enabled capability to measure quality of service on a customer by customer basis. In that post I mentioned that we had actually identified a customer experiencing issues before they even contacted us about them and had started working with them to understand the issues. Well, the rest of that story…
We’ve identified the underlying issue. The customer had an unusually large number of Team Projects in their account and some of our code paths were not scaling well, resulting in slower than expected response times. We have debugged it, coded a fix and will be deploying it with our next sprint deployment.
Now that’s cool. We’ve already started working with a few other of the accounts that have the lowest quality of service metrics. Our plan is to make this a regular part of our sprint rhythm where, every sprint, we investigate a top few customer accounts on the list and try to deploy fixes within a sprint or two – improving the service every sprint.
Today we began deployment of our sprint 68 work. There’s a bunch of really good stuff there. I say “begun” because deployment is a multi day event now as we roll it out across instances. Everyone should have the updates by tomorrow (Tue) afternoon. You can read the release notes to get details.
You’ll see that one part of the licensing changes I described a couple of weeks ago are now live – addition of Test Hub access to the Visual Studio Online Advanced license. The remaining stakeholder licensing changes are still tracking to go live in mid-August. Stay tuned for more.
Azure Active Directory support
The biggest thing in the announcement is the next step in our rollout of Azure Active Directory (AAD) support in VS Online. We started this journey back in April with the very first flicker of AAD support at the Build conference. We added more support at TechEd but I’ve stayed pretty quiet about it because, until this week, there was no way to convert and existing account to AAD. With this deployment we’ve enabled it. Officially it’s in preview and you have to ask to get access to do it but we’re accepting all requests so it’s nothing more than a speed bump to keep too big a rush from happening all at once. With these last set of changes, you can:
- Associate your OrgID (AAD/AD credentials) with your MSDN subscription, if you have one, and use that to grant your VSO license
- Create a new account connected to an AAD tenant
- Connect an existing account to an AAD tenant
- Disconnect an account from an AAD tenant
- Log in with either a Microsoft Account or and OrgID (AAD only or synchronized from you on prem Active Directory) giving you single sign-on with your corporate credentials, Office 365, etc.
- I’m probably forgetting something but you get the point
I encourage you to read the docs and more docs for details. One thing I’ve asked be included in the docs and I’m still not satisfied with the clarity is one detail about binding an existing account to AAD. If you have an existing account not connected to AAD then, by definition, you are using Microsoft Accounts. When you connect you VS Online account to AAD, your identities have to be recognized by AAD to authenticate. You have 3 options for each existing user of your account:
- Add the Microsoft Account as an “external identity” in your AAD. All your data and in-progress work carries forward. The draw back is that external Microsoft accounts won’t fully honor you AAD policies – like Two Factor Auth, Password policies, etc. It’s still a Microsoft Account that’s been associated with your AAD, giving your AAD admin central control over access.
- If you created your Microsoft Account using the same email address as your AD/AAD identity (for instance, for me it’s firstname.lastname@example.org) then, when you connect your VSO account to AAD, your Microsoft Account will be seamlessly rebound to your corporate identity. All your data and in progress work carries forward and your login get the full set of AAD governance. This is the “best” of the 3 options but requires that you created your Microsoft Account a certain way.
- If you can’t do #2 and you don’t want to do #1, then you can just add your AAD identity as a “new” VS Online user and remove your old Microsoft Account identity from the VS Online account. To VS Online this is just like adding a new user and deleting an old user. VS Online has no idea they are the same person. This has the advantage of getting full AAD administration but the downside that in-progress work (checkouts, work items assigned to you, …) and other places where your old MS Account identity was associated need to either be deleted or reassigned to your new identity. Work items can be reassigned. Workspaces, shelvesets and stuff like that can be deleted. History will always be associated with your “old” Microsoft Account identity.
So that’s a good segue to what’s left for us to do to really complete AAD support…
- Add the ability to migrate one identity to any other identity, thereby having all references in VSO changed to the new user (to get around the issue in #3). This is on the backlog but is going to take a while.
- Add support for using AAD groups (to assign permissions, query work items, etc) in VS Online. Today you can use AAD users, but you can’t yet AAD groups. This feature is coming fairly soon (within the next few sprints).
I’m sure I’m missing something else we haven’t done yet but I don’t think anything big. AAD support is ready for prime time for most user scenarios.
And I have to say something about account deletion. Until this week, VS Online account deletions could only be done by contacting support – and we had to do a delicate dance to ensure that the person requesting a deletion had the rights to. For the past few months, account deletion has been the #1 support request, with dozens of requests a month. There are all kinds of reasons –
- Merging multiple accounts into one
- Moving from VS Online back to on-premises TFS
- Wanting to just wipe everything out and start over (for instance after an evaluation)
With this week’s deployment, account deletion is self service (assuming you are an account administrator). However, it’s important to understand that all account deletes are “soft” deletes only. Meaning the account is “marked for deletion” and no one can access it any more but it is *not* actually deleted. It will be physically deleted, I believe, 90 days after you delete it in the UI. This gives you a window to have your “Oh sh%t!” moment. If you realize that you deleted something you did not intend to, you can contact support and they can “undelete” your account. This is indicative of a general direction we are headed where all deletes are “soft” and you always have a time window to go back and recover it. It will take us quite a while to get there on everything that can be deleted but we’ll make progress every chance we get. Of course, if there’s some reason you *REALLY* need a VS Online account permanently deleted immediately, you can contact support to help you.
Oh, and lest I manage to avoid mentioning any feature in this deployment, check out the new trend reports. They are very cool and make the VS online charting experience even more useful. And, because I know several people will ask, yes, these charting enhancements will be added Team Foundation Server (our on-premises product). If everything goes according to plan, they will be in TFS 2013.4 (Update 4) later this fall.
It’s a bunch of stuff. Maybe you have to be a bit of a geek to appreciate all of it We’ve been working on some of this for a good while and I’m really happy to see it all available. Check it out and let us know what you think.
Through the fall and spring, we transitioned VS Online from Preview to General Availability. That process included changes to branding, the SLA, the announcement of pricing, the end of the early adopter program and more. We’ve been working closely with customers to understand where the friction is and what we can do to make adopting VS Online as easy as possible. This is a continuing process and includes discussions about product functionality, compliance and privacy, pricing and licensing, etc. This is a journey and we’ll keep taking feedback and adjusting.
Today I want to talk about one set of adjustments that we want to make to licensing.
As we ended the early adopter period, we got a lot of questions from customers about how to apply the licensing to their situation. We also watched as people assigned licenses to their users: What kind of licenses did they choose? How many people did they choose to remove from their account? Etc.
From all of this learning, we’ve decided to roll out 2 licensing changes in the next couple of months:
A common question we saw was “What do I do with all of the stakeholders in my organization?” While the early adopter program was in effect and all users were free, customers were liberal with adding people to their account. People who just wanted to track progress or file a bug or a suggestion occasionally, were included. As the early adopter period ended, customers had to decide – Is this really worth $20/user/month (minus appropriate Azure discounts)? The result was that many of these “stakeholders” were removed from the VS Online accounts in the transition, just adding more friction for the development teams.
As a result of all this feedback we proposed a new “Stakeholder” license for VS Online. Based on the scenarios we wanted to address, we designed a set of features that matched the needs most customers have. These include:
- Full read/write/create on all work items
- Create, run and save (to “My Queries”) work item queries
- View project and team home pages
- Access to the backlog, including add and update (but no ability to reprioritize the work)
- Ability to receive work item alerts
Some of the explicitly excluded items are:
- No access to Code, Build or Test hubs.
- No access to Team Rooms
- No access to any administrative functionality (Team membership, license administration, permissions, area/iterations configuration, sprint configuration, home page configuration, creation of shared queries, etc.)
We the surveyed our “Top Customers” and tuned the list of features (to arrive at what I listed above). One of the conversations we had with them was about the price/value of this feature set. We tested 3 different price points - $5/user/month, $2/user/month and free. Many thought it was worth $5. Every single one thought it was worth $2. However, one of the questions we asked was “How many stakeholders would you add to your account at each of these price points?” The result was 3X more stakeholders if it’s free than if it’s $2. That told us that any amount of money, even if it is perceived as “worth it”, is too much friction. Our goal is to enable everyone who has a stake to participate in the development process (and, of course, to run a business in the process). Ultimately, in balancing the goals of enabling everyone to participate and running a business, we concluded that “free” is the right answer.
As a result, any VS Online account will be able to have an unlimited number of “Stakeholder” users with access to the functionality listed above, at no charge.
Access to the Test Hub
Another point of friction that emerged in the transition was access to the Test hub. During the Preview, all users had access to the Test hub but, at the end of the early adopter program, the only way to get access to the Test hub was by purchasing Visual Studio Test Professional with MSDN (or one of the other products that include it, like VS Premium or VS Ultimate).
We got ample feedback that there were a class of users who really only need access to the web based Test functionality and don’t need all that’s in VS Test Professional.
Because of this, we’ve decided to include access to all of the Test hub functionality in the Visual Studio Online Advanced plan.
I’m letting you know now so that, if you are currently planning your future, you know what is coming. I’m always loathe to get too specific about dates in the future because, as we all know, stuff happens. However, we are working hard to implement these licensing changes now and my expectation is that we’ve got about 2 sprints of work to do to get it all finished. That would put the effective date somewhere in the neighborhood of mid-August. I’ll update you with more certainty as the date gets a little closer.
What about Team Foundation Server?
In general, our goal is to keep the licensing for VS Online and Team Foundation Server as “parallel” as we can – to limit how confusing it could be. As a result, we will be evolving the current “Work Item Web Access” TFS CAL exemption (currently known as “Limited” users in TFS) to match the “Stakeholder” capabilities. That will result in significantly more functionality available to TFS users without CALs. My hope is to get that change made for Team Foundation Server 2013 Update 4. It’s too early yet to be sure that’s going to be possible but I’m hopeful. We do not, currently, plan to provide an alternate license for the Test Hub functionality in TFS, though it’s certainly something we’re looking at and may have a solution in a future TFS version.
As I said, it’s a journey and we’ll keep listening. It was interesting to me to watch the phenomenon of the transition from Preview to GA. Despite announcing the planned pricing many months in advance, the feedback didn’t get really intense until, literally, the week before the end of the early adopter period when everyone had to finish choosing licenses.
One of the things that I’m proud of is that we were able to absorb that feedback, create a plan, review it with enough people, create an engineering plan and (assuming our timelines hold), deliver it in about 3 months. In years past that kind of change would take a year or two.
Hopefully you’ll find this change valuable. We’ll keep listening to feedback and tuning our offering to create the best, most friction-free solution that we can.
I’m not going to make too big a deal about this because there’s going to be tons of them between now and when VS “14” ships. But we shipped another CTP today and you can learn more about it here: http://blogs.msdn.com/b/visualstudio/archive/2014/07/08/visual-studio-14-ctp-2-available.aspx
We’re continuing the practice of making Azure VM templates available to make it really easy to try out the CTPs.
We are starting to show some nice new features that are worth learning about. I think the lightbulb feature is promising, for instance.
For reasons I explained in my last post on the subject, we are not releasing TFS “14” CTPs at this time and, quite honestly, won’t for a while. We will start releasing CTPs of TFS well before the release but there’s just not a good enough cost benefit analysis to it right now. You can see the majority of the work we are doing on VS Online as we do it.
Years ago, I used to do monthly updates on TFS adoption at Microsoft. Eventually, the numbers got so astronomical that it just seemed silly so I stopped doing them. It’s been long enough and there’s some changes happening that I figured it was worth updating you all on where we are.
First of all, adoption has continued to grow steadily year over year. We’ve continued to onboard more teams and to deepen the feature set teams are using. Any major change in the ALM solution of an organization of our size and complexity is journey.
Let’s start with some stats:
As of today, we have 68 TFS “instances”. Instance sizes vary from modest hardware up to very large scaled out hardware for the larger teams. We have over 60K monthly active users and that number is still growing rapidly. Growth varies month to month and the growth below seems unusually high (over 10%). I grabbed the latest data I could get my hands on – and that happened to be from April. The numbers are really staggeringly large.
|Current||30 day growth|
In addition we’ve started to make progress recently with Windows and Office – two of the Microsoft teams with the oldest and most entrenched engineering systems. They’ve both used TFS in the past for work planning but recently Windows has also adopted TFS for all work management (including bugs) and Office is planning a move. We’re also working with them on plans to move their source code over.
In the first couple of years of adoption of TFS at Microsoft, I remember a lot of fire drills. Bringing on so many people and so much data with such mission critical needs really pushed the system and we spent a lot of time chasing down performance (and occasionally availability) problems. These days things run pretty smoothly. The system is scaled out enough and the code, and our dev processes have been tuned enough, that for the most part, the system just works. We upgrade it pretty regularly (a couple of times a year for the breadth of the service, as often as every 3 weeks for our own instances).
As we close in on completing the first leg of our journey – getting all teams at Microsoft onto TFS, we are now beginning the second. A few months ago, The TFS team and a few engineering systems teams working closely with them moved all of their assets into VS Online – code, work items, builds, etc. This is a big step and, I think, foreshadows the future for the entire company. At this point it’s only a few hundred people accessing it but it’s already the largest and most active account on VS Online and it will continue to grow.
It was a big decision for us – and we went through a lot of the same anxieties I hear from anyone wanting to adopt a cloud solution for a mission critical need. Will be intellectual property be safe? What happens when the service goes down? Will I lose any data? Will performance be good? Etc. Etc. At the same time, it was important to us to live the life that we are suggesting our customers live – taking the same risks and working to ensure that all of those risks are mitigated.
The benefits of moving are already visible. I’ve had countless people remark to me how much they’ve enjoyed having access to their work – work items, build status, code reviews, etc from any device, anywhere. No messing with remote desktop or any other connectivity technology. As part of this, we also bound the account to the Microsoft Active Directory tenant so we can log in using the same corporate credentials as we do for everything else. Combining this with a move to Office 365/SharePoint Online for our other collaboration workflows has created for us a fantastic mobile, cloud experience.
I’ll see about starting to post some statistics on our move to the cloud. As, I say, at this point, it’s a few hundred people and mostly just the TFS codebase – which is pretty large at this point. Over time that will grow but I expect it will be slow – getting larger year over year into a distant future when all of Microsoft has moved to the cloud for our engineering system tools.
I know I have to say this because people will ask. No, we are not abandoning on-prem TFS. The vast majority of our customers still use it, the overwhelming majority of our internal teams still use it (the few hundred people using VS Online is still rounding error on the more than 60K people using TFS on premises. We continue to share a codebase between VS Online and TFS and the vast majority of the work we do accrues to both scenarios – and that will continue to be the case. TFS is here to stay and we’ll keep using it ourselves for a very long time. At the same time VS Online is here to stay too and our use of it will grow rapidly in the coming years. It will be a big milestone when the first big product engineering team not associated with building VS Online/TFS moves over to VSO for all of their core engineering system needs – I’ll be sure to let you know when that happens.