And an embarrassing moment.
I’ve gotten some teasing about wearing a tie at the French Visual Studio 2013 launch event in Paris this week. There are some pictures of me floating around online. It’s certainly a rare occurrence. I don’t wear a tie more than a couple of times a year. I brought it on this trip “just in case”, expecting I wouldn’t actually wear it.
I felt pretty bad about the night before so I decided to err on the side of formality for the event. Here’s the story…
Sunday was, in many ways, a typical day on the farm (except that it was darned cold for November and my friend and VS ALM MVP, Adam Cogan, was visiting again). I dressed for a typical farm day, did chores, etc. A couple hours before I had to leave for the airport, my wife reminded me that we needed to move the cows before I went. Normally that’s a pretty minor thing but the cows were feeling particularly uncooperative so it took much longer than usual. Further, we needed to separate 3 cows form the heard – 2 for harvesting and one that had a respiratory infection that needed treating.
After getting all the cows sorted and moved, Adam and I (wish I had pictures) had to hold the sick cow’s head while my wife gave it an IV antibiotic. We rarely give antibiotics to cows (this is only the second time ever), but if they get sick enough, it’s either that or let them die. Anyway all this completely ate up all the time we had and all my travel buffer. By the time we were done, the best case was for me to get to the airport about 40 minutes before Adam’s flight (I usually plan for about 90 minutes).
I ran back home, took off my cow snot covered coat (yeah, gross), grabbed my bags and a clean coat and headed for the airport.
The rest of the day went fine (well as good as any trans-Atlantic flight can be).
When I arrived in France on Monday morning, I was completely wiped out. I had just taken a red-eye back from Seattle Friday night and then another red-eye to France Sunday night and I could barely see straight. I made it to the hotel, spent a bit of time reading email and then took a nap so I was at least slightly coherent for the dinner I was supposed to go to with people from the local Microsoft office.
You know, I’m an idiot and I didn’t really think about it. Dinner for me is Red Robin. Well… I woke from my nap about 20 minutes before I was to meet them in the hotel lobby. I got up, checked my email, got myself in order and headed down to meet them.
We arrived at the restaurant and, oh crap. It’s an incredibly fancy French restaurant on an island in the middle of the Seine river and everyone is wearing suits. I think to myself. Uh-oh, I have jeans on. But OK, it’s not the end of the world, jeans and a nice cotton collared shirt (that’s what I almost always travel in). I’m under dressed but I’ll be OK. Then I take off my coat and look down.
OH MY GOD! I’m wearing the Minka Farm T-shirt that I was wrestling a cow in 16 hours before. What am I going to do at this point? In my sleep deprived delirium, it just never occurred to me to notice what I was wearing.
My hosts were gracious and didn’t say a word but the whole evening I was horrified. I decided not to say anything about it either and just tried to act as if it were nothing unusual. I was a bit surprised the waiter didn’t come and tell me to leave. I’m sure everyone in the restaurant was thinking “typical stupid American”.
Well, I made it through the evening and decided that I better dress nicely for the day of speaking and customer meetings to show that I’m not a complete and total moron. Hence the slacks, nice shirt and tie. And, yes, the shirt was the one the marketing team bought me last year – first time I’ve had a reason to wear it :)
Ah well, I suppose worse things can happen but this will be one of those stupid memories that stick with me for the rest of my life :)
My wife would have skewered me if she had been with me. Scratch that, it wouldn’t have happened if she had been with me because she’s got more sense than I do.
Hope you enjoy the laugh at my expense.
Brian Keller has released the RTM version of his incredibly useful demo VM. He loads just about everything you could want to demo onto it and makes it easy to see how anything in VS or TFS works.
We are running a couple of surveys on Visual Studio to collect feedback for our next release cycle. We’d appreciate any feedback you have. Here are the links to the surveys on the VS blog.
Either I’m going to get increasingly good at apologizing to fewer and fewer people or we’re going to get better at this. I vote for the later.
We’ve had some issues with the service over the past week and a half. I feel terrible about it and I can’t apologize enough. It’s the biggest incident we’ve had since the instability created by our service refactoring in the March/April timeframe. I know it’s not much consolation but I can assure you that we have taken the issue very seriously and there are a fair number of people on my team who haven’t gotten much sleep recently.
The incident started the morning of the Visual Studio 2013 launch when we introduced some significant performance issues with the changes we made. You may not have noticed it by my presentation but for the couple of hours before I was frantically working with the team to restore the service.
At launch, we introduced the commercial terms for the service and enabled people to start paying for usage over the free level. To follow that with a couple of rough weeks is leaving a bad taste in my mouth (and yours too, I’m sure). Although the service is still officially in preview, I think it’s reasonable to expect us to do better. So, rather than start off on such a sour note, we are going to extend the “early adopter” program for 1 month giving all existing early adopters an extra month at no charge. We will also add all new paying customers to the early adopter program for the month of December – giving them a full month of use at no charge. Meanwhile we’ll be working hard to ensure things run more smoothly.
Hopefully that, at least, demonstrates that we’re committed to offering a very reliable service. For the rest of this post, I’m going to walk through all the things that happened and what we learned from them. It’s a long read and it’s up to you how much of it you want to know.
Here’s a picture of our availability graph to save 1,000 words:
First Incident: Nov 13, 14, 18, 19
Let’s start with the symptoms. What users saw was that the service became very slow to the point of being unresponsive during peak traffic periods. Peak hours start mid-morning on the east coast of the US – right when we were doing the launch keynote :(
The beginning coincided with the changes we made to enable the new branding and billing right before the launch.
We’ve gotten to a fairly mature stage of dealing with live-site incidents. We detect them quickly, triage them, bring in the right developers, collaborate and fan out to all the various people who need to know/contribute, including pulling in partner teams like Azure fabric, SQL Azure, ACS, etc. Despite that this one took us a long time to diagnose for a number of reasons – for one, it was the first incident of this nature that we have ever hit (more on that in a minute).
One of the things we did fairly early on was to rollback all of the feature switches that we enabled that morning – thinking just undoing the changes would likely restore the service. Unfortunately it didn’t. And no one is completely sure why. There were some aspects of the issue that had been there for a while and were triggered by the spike in load that we saw but the bulk of the issue really was caused by the enabling of the feature flags. It’s possible that some error was made in rolling them back but, at this point they’ve been changed enough times that it’s hard to tell for certain.
Let’s talk about what we do know.
Within a fairly short period of time, we discovered that the biggest underlying symptom of the problem was SNAT port exhaustion. SNAT is a form of network address translations for outgoing calls from a service on Azure (for instance, calls from TFS –> SQLAzure, Azure storage, our accounts service, ACS, etc). For any given destination there are only 64K ports and once you exhaust them, you can no longer talk to the service – the networking infrastructure stops you. This is a design pattern you really need to think about with multi-tier, scale out services because the multiplexing across tiers with VIPs (virtual IP addresses) can explode the use of these SNAT ports.
This was the first incident that we had where the core issue was a networking mechanics issue and we didn’t have a great deal of experience. After struggling to identify what was happening for a few hours – and finally figuring out the port exhaustion issue, we pulled in the Azure networking team to provide us the expertise to further diagnose the issue. The core thing we struggled with next was figuring which service we were talking to was using up all the SNAT ports. This ultimately uncovered 2 “issues” that will get fixed:
- The Windows Azure networking infrastructure logging records which source IP address exhausted SNAT ports but not the target IP so the logs didn’t help us figure out which service was killing us. Azure has now decided to add that to their logging info to make future incidents easy to diagnose.
- We, at the VSOnline level, have pretty involved trace information. However, we eventually discovered that we were missing the trace point for the one target service that was actually the choke point here. It was our connection to ACS. We have gone back and made sure that we have trace points for every single service we talk to.
Once we understood the right code path that was causing it, we were able to focus our efforts. As with most of these things, it was not one bug but rather the interaction of several.
Send some people home
A brief aside… I was at the launch with a bunch of people when this happened. One of them was Scott Guthrie. Throughout the day we were chatting about the problems we were having and how the diagnostics were going. He gave me a piece of advice. Send some people home. He related to me a story about a particularly bad incident they had 18 month ago or so. He said the whole dev team was up for 36 hours working on it. They had a fix, deployed it and, unfortunately, not only did it not resolve the problem, it made it worse. After 36 hours of straight work, no one on the dev team was coherent enough to work effectively on the new problem. They now have a policy that after so many hours of working on a problem, they send half the team home to sleep in case they need to start rotating fresh people. That made a ton of sense to me so we did the same thing at about 8:00pm that evening.
The core problem
On launch day, we enabled a feature flag that required customers to accept some “terms of service” statements. Those statements had been on the service for a while and most people had already accepted them but on launch day we started enforcing it in order to keep using the service.
However, there were 55 clients out there in the world who had automated tools set up, regularly pinging the service using accounts that no human had logged into to accept the terms of service. Those 55 clients were the trigger for the bug.
Those requests went to TFS. TFS then made a request “on behalf of the client” to our account service to validate the account’s access.
The first bug was that the account service returned a 401 – Unauthenticated error rather than a 403 – Unauthorized error when it determined that the account had not accepted the terms of service.
The second bug was that TFS, upon receiving that error, rather than just failing the request interpreted it as indicating that the ACS token that the TFS service itself used was invalid and it decided to contact ACS to refresh the token. It nulled out the token member variable and initiated an ACS request to refill it.
The third bug was that the logic that was supposed to make subsequent requests block waiting for ACS to return was broken and, instead, every other request would call ACS to refresh the token.
The net result was a storm of requests to ACS that exhausted SNAT ports and prevented authentications.
Once we did the minimal fix to address this problem (by fixing the logic that allows all requests to fall through to ACS) the situation got MUCH better and was why Monday was much better than Wed or Thurs of the previous week.
However, it turns out there was another problem hiding behind that one. Once we eliminated ACS SNAT port exhaustion, we still had a small(er) incident on Monday.
The second problem
In order to avoid calling through to the account service, we “cache” validations and check them in TFS and we save the round trip. The “main” way we do this is by wrapping the validation in the token/cookie returned to the client. So when a client comes back with the cookie TFS checks the signature, the expiration date and the claim and accepts the claim. It only falls through to the account service when this check fails.
When this logic was implemented, it was decided not to change the token format for older clients – for fear of breaking older clients and not wanting to have to update them. So instead, older clients use a more traditional “hash table” cache in TFS. It turns out that the hash table cache was broken and always caused fall through to the account service.
Unfortunately, on Monday, someone with an old client was running hundreds of thousands of requests against TFS (about 8 per second). Combined with our daily peak load, this caused a storm against the account service and again slowed things down unacceptably.
Part 2 of Monday’s incident was a bug where TFS was holding a global lock while it made a call to our accounts service. This code path is not executed on every request but it is executed often enough that it caused the lock to serialize a large number of requests that should have been executed in parallel. That contributed to the overall slow down.
As often happens in a prolonged case like this, along the way we found many other things that turned out not to be “the issue” but none-the-less are worth fixing. My favorite was a regex expression that was sucking down 25% of the CPU on an AT just to re-write the account creation URL to include the proper locale. Yeah, that’s crazy.
All-in-all it was a bad incident but quite a number of improvements will come out of it.
With the incident now understood and mitigated, the next thing we’ll turn our attention to is the retrospective. I think we’ll end up with a lot of learnings. Among them are:
- We really need multi-instance support. The current service is one instance. It’s scaled out but you either update it or you don’t. Aside from our ability to control with feature flags (which is very nice) all customers are affected or none. The service is now at a scale where there’s just too much risk in that. We need to be able to enable multi-instance so that we can roll out changes to smaller groups of people and observe the behavior in a production environment. We already have a team working on this but the result of this incident is that we’ve accelerated that work.
- Most sophisticated multi-threaded apps have a lock manager. A lock manager manages lock acquisition and ordering, causing illegal ordering (that might cause a deadlock) to fail immediately. This turns a potential race condition into something that’s detected every time the code path is executed. One of the things I learned here is that we need to extend that mechanism to also detect cases where we are holding onto important exhaustible resources (like connections, locks, etc) across potentially slow and unpredictable operations (like cross service calls). You have to assume any service can fail and you can not allow that failure (through resource exhaustion or any other means) to cascade into a broader failure across the service. Contain the failure to only the capabilities directly affected by the failure.
- We need to revisit our pre-production load testing. Some of these issues should have been caught before they went to production. This is a delicate balance because “there’s no place like production” but “trust thy neighbor but lock your door.”
- I think there’s another turn of the crank we can take in evolving some of our live site incident management/debugging process. This includes improvements in telemetry. We should have been able to get to the bottom of these issues sooner.
There’s more and I’m looking forward to the retrospective to get everyone’s perspective.
Second Incident: Nov 20th
Unfortunately bad luck comes in 3’s and we got hit with a pair of incidents on Wed. The first was an intermittent failure in Windows Azure Storage. It appears (but I don’t have all the facts yet) that Azure was doing an upgrade on the cluster we use for VS Online and something went wrong with the upgrade. The result were intermittent outages starting 9:00PM PST on Wed night and going through lunch time on Thurs, unfortunately right through another peak usage time for us. Then, to add insult to injury, someone made an error in some global Microsoft networking setting, breaking a DNS setting, that impacted our service and many others across Microsoft. That added about another hour to our misery in the mid-afternoon.
Now, I don’t want to point fingers here because this was another example where a failure in one of the services we consume cascaded into having a much larger impact on our service than it should have. The changes that I talked about above (managing exhaustible resources and unreliable operations) would have help significantly mitigated this incident.
All-in-all, it was an inauspicious week and a half. We are going to have incidents. Every service does. You can plan and test and be robust and there will be issues. However, what we’ve experienced recently isn’t reflective of what we are committed to providing. It’s been a painful but productive learning experience and I believe the result will be a more robust service. I appreciate your patience and we will continue to work hard to provide the best service there is. Anything less is unacceptable.
I don’t get asked that question too often but I do occasionally and, as the service matures, I know I’ll get asked it more and more so it’s been on my mind. I was looking at some data yesterday about some of our largest tenants. No, I wasn’t looking at any of their IP (I can’t) but I was looking at some meta-data to understand usage patterns so we can plan ahead to make sure the service provides a good experience as tenants grow.
So far, no customer has hit any limit on how much they can store in VSOnline but there are limits and I keep wondering how to help people understand what they are so they can think about them in their planning. For the purpose of this conversation there are 2 main kinds of storage that you use:
1) Blob store – this is the size of the files, attachments, etc that are stored on the service. The files are compressed so that affects the size. The blob store is, for all intents and purposes unlimited (though we may from time to time impose limits to prevent abuse). Legitimate use is basically unlimited.
2) Meta-data store – Metadata (version control version info, work item records, test execution results, etc) are stored in a SQL Azure database. Today the limit on a SQL Azure database is 150GB. That’s a hard limit that we live with. SQL Azure has a road map for increasing that and we are also working with them to get compression support (our stuff compresses incredibly well) so I don’t see this being a big issue for anyone anytime soon but it’s always on my mind.
So the question I’ve struggled with is how do I answer the question “How much data can I put in VSOnline?” No one is ever going to be able to wrap their head around what the 150GB meta-data limit means. So I tend to think that people most easily relate to the size of their source code/documents/attachments and everything else kind of works out in the wash. Of course usage patterns can vary and you may have a very large number of work items or test results compared to others but so far, it’s the best measure I’ve been able to come up with.
So as I was looking at the data yesterday, here’s what I found about our largest tenant to date:
260GB compressed blob store – I usually estimate about a 3X compression ratio (varies depending on how much source vs binary you check in but, on average it’s pretty close). So that’s about 780GB of uncompressed data.
11GB of meta data - So, that puts them about 7% of the way to the limit on meta-data size – plenty of headroom there.
So if I extrapolate to how much data they could store before hitting the meta-data limit, I get: 150GB/11GB * 780GB = 10.5TB. That’s a pretty promising number! There aren’t many orgs that have that much development data to store.
So, the next question on my mind was whether or not the blob to meta data ratio was consistent across tenants. In other words, can everyone get this much data in or do usage patterns vary enough that the results are significantly different. I as you might imagine the answer is yes, they do vary a lot. I looked at a number of other larger tenants and I found ratios varied between about 5 and 23 (turns out the largest tenant also had the largest ratio). So if I take the most conservative number and do the same extrapolation, I get 2.2TB.
So right now, the best I can say is today you can put in between 2.2TB and 10.5TB depending on usage patterns. Either way it’s a lot of data and no one is close to hitting any limits.
A bit of a random thought for the day but I thought you might be curious.
Overall, tire kicking on Application Insights is going well and the invitation codes continue to get used up pretty quickly. We’ve now got many hundreds of accounts enabled so I’m probably going to slow down a bit on handing out invitation codes. But…
Here’s another one for another opportunity: VSInsights258441337
A couple of weeks ago one of our great partners, eDev Technologies (makers of InteGREAT – an excellent formal requirements management tool for TFS), released a new set of tools called SmartOffice4TFS. Whereas InteGREAT is a pretty comprehensive requirements suite, SmartOffice4TFS is intended for teams with a less formal process but who still need to be able to manage requirements as documents.
SmartOffice4TFS helps bridge the gap between the work that the development team is managing in TFS and stakeholders/customers/vendors/etc. that need a document. To make SmartOffice4TFS even more attractive, eDev Technologies is offering a 40% discount to MSDN subscribers through the end of the year.
SmartWord4TFS allows you to export TFS work items/requirements into document templates, update edit those documents and publish updates back into TFS. It enables full round tripping. This enables you to produce standard requirements documents from data in TFS and author TFS requirements online or offline. Meanwhile, you get all the capabilities with MS Word – like SharePoint workflows for review and approval processes.
SmartVisio4TFS has similar abilities but designed for working with diagrams. It enables you to link work items to individual elements of your diagram and can color the shapes on your diagram based on the state of the related work item. Of course, it supports the same handy round tripping with TFS that the Word add in does. As a really cool bonus, SmartVisio4TFS can process your flow diagrams and automatically generate test cases that cover all the branches in your process.http://www.smartoffice4tfs.com/. And don’t forget to ask about your 40% discount for MSDN subscribers.
Well, the last one was exhausted in a few hours. It’s great to see there are plenty of people who want to take an early peek. Here’s another one to unblock the next group of people…
In the launch keynote on Wed, I announced, and Nicole demoed a new service on VS Online called Application Insights.
I also announced that it is in “limited preview” – meaning it’s an invitation only service for now. It’s early and we want to grow it slowly so that we can incorporate feedback early, getting fresh eyes along the way. The most dependable way to get an invite is to create a VS Online account, then click on the Try Application Insights tile and request an invitation code by clicking on “Add me to waiting list”…
We’ll try to get invite codes out at a reasonable pace but it’s just going to depend on demand.
But, to make it even easier for those willing to dive in early, here’s an invitation code you can use today. It has close to 100 activations on it. Don’t use it if you don’t actually plan to play with it though. I’ll post new codes on my blog every so often and you can always just request one via the web page.
Invitation Code: VSInsights8522381191
We look forward to hearing your feedback.
Today, I am in New York at the launch of Visual Studio 2013 speaking with a local audience and broadcasting around the world. In many ways it’s a celebration of the blood, sweat and tears we put into VS 2013. At the same time, it’s a bit anti-climactic because VS 2013 has been in preview for months and released for weeks. For me, there’s nothing really new to say about it because hundreds of thousands of people have it and are using it daily. So the fun part is talking about all of the stuff that we’ve been working on that isn’t part of VS 2013 – and there’s a ton of it.
First, we all know the cloud never sleeps. It doesn’t follow a shrink wrapped product cadence. It marches on relentlessly. Today we are announcing Visual Studio Online (an evolution of what was Team Foundation Service). I’ve written a fairly extensive news post on it that I won’t repeat here but there’s a lot to know about it. For one thing we unveiled and enabled the commercial terms for the service – paid accounts. We added cloud build and cloud load testing to the “released” features of the service and released previews for 2 new services: Application Insights for understanding the reliability, performance and usage of your apps and “Monaco”, a light weight browser based editing environment that’s a great compliment to Azure. Check out the news post to learn more.
And wait Johnny, that’s not all…
We released the first official Microsoft release of Release Management for Visual Studio (formerly InRelease). The new release management capabilities allow you to easily manage software releases in a way that is rapid, reliable and repeatable. You can get a free trial or you can download the licensed software as part of your MSDN subscription from the MSDN portal. This release works with TFS 2010, 2012 and 2013 so trying it is a “no brainer”.
It’s an exciting day and there are a bunch more on the way. As I’ve been saying for the past several sprints now, we’ve already got a lot of post VS/TFS 2013 under way. The new cloud services and capabilities we released today were certainly big chunks of it but we’ll have even more in the coming sprints and some pretty exciting stuff after the new year.
Visual Studio 2012.4
Today, we are releasing Visual Studio 2012 Update 4. We continue the trend of gradually tapering off how much goes into each update – particularly now that VS 2013 is available and this update contains some customer driven bugs fixes and fixes for a few compat issues.
You can download the update here: http://go.microsoft.com/fwlink/?LinkId=301713
Or you can just download it when Visual Studio notifies you that an update is available.
Hold on tight, it’s a wild ride,
We had a nasty enough service incident on Friday that I can’t let it pass without commenting on it. If you follow my blog or the service, then you know we were doing a deployment on Friday. The natural assumption, both inside and outside Microsoft, was that it was caused by the deployment. It wasn’t, but more on that in a minute.
We received notable Twitter traffic, blog comments and emails both about this incident and the pattern of incidents. I’m going to tackle both topics (at least lightly) here. First, people are justified to be upset about Friday’s incident. The service was mostly down for about 3 hours. That’s clearly not acceptable. There was also a pattern of “the service has incidents too often”. I agree. We know it and we are constantly working on making it better. And, actually, it is getting consistently better. It’s not good enough but here is a trend of significant incidents over the past several months.
We’ve got a lot of work to do but we’re making progress. Not only are incidents generally getting less frequent, they are getting less severe (though Friday’s was pretty severe). Incidents as bad as Friday’s are actually reasonably rare.
A few of the comments I got about the service availability referenced our service status blog as evidence that incidents happen almost every day. Well, kind of. It’s true that some incident happens almost every day. However, the majority of incidents turn out to be very minor – with little or no customer impact. For now we have chosen to err on the side of notifying of potential issues before clarifying the impact. My guidance to the team is “If a customer is seeing a problem, they shouldn’t go to the service status page and have it tell them everything is fine – that’s a very frustrating experience”. However, it causes us to be very chatty. Our plan it to improve our service health status reporting so customers can better understand whether incidents likely affect them – but that’s months away.
None of that takes away from the fact that we want and need to have fewer impactful events.
Availability and service quality has been on my mind a lot lately as the service continues to mature and that’s why I wrote a post on the topic a few weeks ago. As we analyzed various availability models, we took 6 months or so of historical data and tested the models against it. What we found is that dips in availability correlate very strongly with our deployment weeks – in other words, when we change things, stuff breaks. It proved the old operations maxim behind the rule “the best way to run a highly available service is to never change it”. True but misses the point.
So, it’s not surprising that people would assume the outage we had on Friday was the result of our deployment. To be honest, I did too. However, Monday we did a deep dive on the root cause and it was enlightening.
There were actually two distinct (and as best we can tell, unrelated) issues that happened on Friday. The first was a cache overflow and the second was a synchronization task that bogged down the system.
Cache overflow – we keep a cache of group membership resolution so that we don’t have to go back to the database to check people’s group membership every time we need it. The cache had 1,024 entries in it and a bad eviction policy. The load got high enough (we hit a new usage record on Friday) that the cache overflowed and began to thrash. The next problem was that the cache fault algorithm held a lock on the cache while it fetched the data from the database – never, ever, ever do that. That caused the thrashing cache to queue faults and the system backed up to the point that no one could log in any longer. We took away several action items: 1) increase the cache size 2) fetch the data outside the lock 3) switch to an LRU eviction policy and 4) add a new feature to our lock manager and data access layer so that when the right configuration setting is set, we throw an exception anytime a SQL call is made while a lock is held.
Sync job – We have a synchronization job that updates work item tracking with relevant list of users (for instance candidates to assign a work item to). We have to sync any relevant changes to a user (like display name changes). The trigger for the problem was two fold – an increase in the number of changes to identities and an increase in the total number of identities in the system. The root cause was a couple of bad design decisions. The first was the fact that any identity change triggers the identity to be a candidate for synchronization – even if work item tracking doesn’t use the property that changed. So we need a filter to only deal with synchronization when a relevant property is changed. The more serious issue though was the “directionality” of the algorithm. What it did was query all changed identities in the system and “join” that to the list of relevant identities in an account. The problem is, that particularly on the public service, the number of change identities dwarfs, by orders of magnitude, the number of relevant identities in any given account. The result is hundreds of times more processing by, effectively, using the wrong index. The quick mitigation was to reverse the join and sync job times dropped from, sometimes, minutes to milliseconds.
Neither of these things changed in the deployment. It was just a coincidence that the thresholds that triggered the issues happened on Friday. So my question was, if these things were a result of increased load/data volume, then that didn’t change overnight. What were the symptoms that we should have been able to detect ahead of time? The problem with these kind of things is that failures like this happen gradually and then all at once. It gets a little worse and a little worse and then the *** hits the fan.
We generally have a “hygiene” process that involves tracking exceptions, event log entries, response times, available memory, etc. The goal is to detect the early signs of a problem and address it before it passes the knee curve. The problem is that some of these measures are particularly noisy and too easy to convince yourself that you are just seeing noise. Clearly we have some work to do to change our assessment practice here.
To help visualize this, here is a graph of the average times for the group membership resolution times. As you can see, it was regularly spiking at close to 10 seconds (the units of the y-axis are microseconds). In retrospect everyone knows that’s not expected. Typical times should be in milliseconds.
On Friday, the averages spiked to about 30 seconds and was enough to start triggering timeouts, etc. Not good. Should have seen this before Friday and never let it happen.
As a brief aside, I’m breaking one of my golden rules here. Never use average data. Use percentiles. Averages hide “the truth”. Lots of “good” values can hide fewer VERY bad values. Look at 50th percentile or 75th ot 90th, etc. Don’t look at averages. It turns out averages are easy to compute so, in a rush, it sufficed to demonstrate the problem here but it’s never what I would use in a formal analysis.
Anyway, I’m sorry for the incident on Friday. There were plenty of learnings. The biggest one I want to share is that you have to listen to your service. Very often it will “silently” get gradually degrade and then it will catastrophically fail. You have to catch it before the cliff.
Because of the big launch we’re doing next week, we updated TFService with the sprint 56 deployment a little early. We deployed a couple of nice improvements today – making our charting feature easier to share and improving our load testing service. You can read more about the changes in the release notes.
Next week is going to be a big week. I’ll be in New York participating in our VS/TFS 2013 launch. As part of that, I’ll be making a bunch of new announcements and enabling a new set of features on the service. You can participate online at http://events.visualstudio.com or you can read about in on the service news page.
I’m looking forward to an exciting week.
A little over a year ago we unveiled a continuous delivery feature for TFS –> Azure, allowing you to configure the TFS build pipeline to automatically deploy to running Azure applications. In the intervening time, we added Git support to TFS but getting Git integrated into all of our scenarios is a journey. This week we released the next small step on that journey: TFS Git –> Azure continuous delivery using Team Foundation Service.
Scott wrote up a nice overview of the scenario on his blog so, rather than replicate it, I’ll just refer you there. It starts about half way down his post.
Check it out and, as always, feedback is encouraged.
Among the deluge of Visual Studio 2013 releases in October, we shipped the 2013 release of Team Explorer Everywhere – updating the experience for team members working in Eclipse and/or on non-Windows environments. Team Explorer Everywhere includes an Eclipse plug-in, a cross-platform command line client, and a Java SDK for building custom tools that access TFS.
In addition to a good number of bug fixes, the 2013 release significantly improves the Team Explorer experience and adds new capabilities to both the Team Foundation Version Control and Git version control experiences.
You can download Team Explorer Everywhere 2013 from the Download Center or install the TFS plug-in for Eclipse from directly within your Eclipse IDE (update site URL: http://dl.microsoft.com/eclipse/tfs). If you run into any problems installing or using any of the TEE components, visit the Eclipse and Cross-Platform Tools forum.
Some highlights from the 2013 release:
Improved Team Explorer (with dockable views)
The Team Explorer view in TEE was greatly improved in the 2013 release. The look now matches the much improved look of the Team Explorer view in Visual Studio. TEE borrowed some of the nice organizational and navigational enhancements added in Visual Studio as well. Quick access is provided to the most commonly used functions using a context menu that appears when you right-click on a tile. For example, right-clicking on the Builds tile makes it easy to view completed or queued builds. We have also increased the number of places where you can launch into the web access portal, which saves time when needing to access a function that might only be available from the web.
Dockable views have also been added. You can now undock the Pending Changes and Builds views and position them anywhere within the workbench window. Both views also now appear under Window > Show View, which makes it possible to add these views to another perspective. For example, you can show the Pending Changes right in in the Java perspective and have quick access to view and check-in your pending changes.
Find in Source Control
The Find in Source Control feature, which was previously shipped in the Power Tools, has now been fully integrated into the TEE product. This feature enables quick searching for files and folders in source control, with filtering by name, path, or check out status. This makes it particularly easy to find files that are currently checked out by any user or a specific user and/or under a path. Once results are returned, you can check out a file for edit, undo a pending change, view history, view properties, open in Source Control Explorer, or copy its full path to the clipboard.
Note: check out status and the ability to check out a file for editing in the results view are only available if the Show checkout status checkbox was checked when the search was performed.
Add to Source Control
In TEE 2013 RTM, we significantly improved the experiences around adding files to TF source control. The UI was rebuilt as a multi-page wizard instead of a dialog. The new a wizard has several big improvements including support for adding symbolic links to source control, creating workspace mappings directly when files to non-mapped source control folders are added, automatically filtering out local files that are already in source control, and importing files from local folders which are outside of workspace mappings.
With 2013, TEE now supports symbolic link (symlink) transparently on Linux-based operating systems. Just like regular files, symbolic links can be added to source control and changes (like add, edit, and delete) can be detected, pended, and subsequently updated in source control. Developers have the full capability version control when working with symlinks (history, branching, merging) and when symlinks are downloaded from version control, they are created as symbolic links in the file system.
Import for Projects in Git repositories
To make it easy to start working with code hosted in Git repositories on TFS, TEE 2013 includes a wizard for cloning and importing projects into your workspace. This wizard enhances the base import wizard provided by the Eclipse EGit tools and provides the ability to clone multiple repositories at one time (useful for large projects where code is spread across multiple repositories). For repositories hosted on Team Foundation Service, the wizard guides you to setup alternate credentials, which is required since the EGit tools do not support federated authentication like TEE does. Once alternate credentials are setup and stored in the Eclipse Secure Storage (the wizard will do this for you), you will not be prompted to re-supply credentials as you work with the EGit tools. The wizard does the following:
- Clones one or more Git repositories from TFS to your local workstation
- Detects and imports Eclipse projects found in these cloned repositories
- Sets up connections to cloned repositories in EGit
Integration with Eclipse EGit tools
To support working with Git repositories in TEE, we made the decision to leverage and extend the existing Eclipse EGit tools (these tools are included in the most popular Eclipse packages and are well-integrated into Eclipse). A connection to a remote repository hosted in TFS can be manually configured in EGit (note: make sure to use alternate credentials when setting up your connection to a repository hosted in Team Foundation Service) or is setup when you use the import wizard provided by TEE. Once this connection is established, you can use the EGit tools to perform basically any Git function, like committing, pushing, and creating new branches.
To see the full set of EGit tools, open the Git Repository Exploring perspective (under Window > Open Perspective > Other), find your repository, and right-click. Right-clicking on a file or folder in the Navigator, Package Explorer, or other workspace views will show resource-level options (like Commit) under the Team sub-menu.
I encourage you to download the newest version of TEE and start exploring these new features. As always, your feedback to the forum is always appreciated. We are continually enhancing TEE, and have some really cool stuff planned for the next updates. Stay tuned.
We released the first drop of our Visual Studio tools for Git about a year ago as a plugin for VS 2012. Our ultimate goal was to release them as part of VS 2013 RTM (and we did). At the same time, we wanted to iterate quickly on VS 2012 because we could get a lot more feedback and because there’s always an adoption curve for a new VS release and supporting 2012 allows us to deliver for more customers.
A couple of days ago, we released the “final” build of the Git tooling for VS 2012. You can get the latest build on the Visual Studio Gallery: http://visualstudiogallery.msdn.microsoft.com/abafc7d6-dcaa-40f4-8a5e-d6724bdb980c. I encourage you to check it out and we’ll work hard to support you in your use of it.
We’ll continue to support/maintain the VS 2012 Git tooling but, at this point, all significant new Git features will be delivered for VS 2013 and above – we’ll mostly be doing bug fixes on the 2012 Git tooling from here on out.
I genuinely hope you enjoy the tooling and appreciate any feedback you have.
In TFS 2013, we introduced a new feature called “Team Rooms” that keep a record of things that happen in your team – checkins, work item updates, build failures, code reviews, etc. And you can have conversations about the activity directly in the team room. This keeps a durable record of what’s happening in the team and makes it easy for people to catch up if they’ve been out or to ask a question about something.
In 2013, Team Rooms only appear in the TFS web UI. From the day we previewed it, we started getting questions about enabling team rooms in Visual Studio. Unfortunately, we didn’t have time to do that for 2013. However, one of our MVPs has taken it upon himself to build a VS extension using our REST APIs.
It’s still in preview but I’ve played with it some and it seems pretty good. Check it out: http://visualstudiogallery.msdn.microsoft.com/c1bf5e4f-5436-465d-87da-09b2f15ff061. It will work both with on-premises Team Foundation Servers and Team Foundation Service.
If you check out the news post yesterday, you’ll see that the big change in this sprint’s deployment was to the project and account home pages. Quite honestly the account home page was mostly useless and not very visually appealing. One of the big things we focused on with these updates was improving the getting started experience. One change we made was a new “Create your first project” panel on the account home page (where you land immediately after creating an account) to try to make it crystal clear what the next step to do anything useful is.
After you’ve created a project, the create your fist project panel disappears and you see the new account home page with, among other things, a set of tiles (that are dismissable) to provide additional information about getting started with the experience.
After dismissing it, you see this. I’ve narrowed the browser in this screenshot to highlight another effort we’ve undertaken. We’ve started, with these two home pages, to implement adaptive UI that changes layout when the browser size changes. You’ll notice this is a 2 column layout rather than a 3 column layout. We’re not yet focused on solving for phone form factors but we expect this UI to work effectively for tablets and up. We’ve also got work to do to improve our touch experience but you should expect that this is a direction we’ll be evolving towards.
To see the project home page and learn about a few more things, you can read Aaron’s news post.
In the wee hours of the morning this morning, we made the final versions of Visual Studio 2013, Team Foundation Server 2013 and .NET 4.51 available. You can download the trials and related products and MSDN subscribers can download the licensed product from the subscriber portal. You can learn more about what’s new in VS 2013 here.
Windows 8.1 is also available today. It’s quick and easy to upgrade from the store so get it soon.
On November 13th, we’ll be hosting the Visual Studio 2013 launch. At launch, we’ll be highlighting the breadth and depth of new features and capabilities in the Visual Studio 2013 release.
Getting VS 2013 is easy. If you have an active MSDN subscription, you already own it. If you don’t you can upgrade your VS Pro subscription for $99, for a limited time. Check out purchasing options here: http://www.microsoft.com/visualstudio/eng/buy
VS 2013 can be installed side by side with previous versions of Visual Studio or, if you have a VS 2013 pre-release, it can be installed straight over top of the pre-release. TFS 2013 cannot be installed side by side but can also be installed over top of either a previous version (TFS 2012 or TFS 2010) or a pre-release.
We’re already hard at work planning for the improvements we will make in VS/TFS 2013.1 so expect to hear more over in the next few months.
Brian created a great little video on why he likes Team Foundation Service. I figured I’d share it…
Since we started down the path of building an online service a couple of years ago, I have learned a lot. One of the things I’ve learned a lot about is measuring the health of a service. I don’t pretend to have the only solution to the problem so I’m happy to have anyone with a differing opinion chime in.
For the purpose of this post, I’m defining the “quality of a service” as the degree to which it is available and responsive.
The “traditional way” of tackling this problem is what’s called “synthetic transactions”. In this approach, you create a “test agent” that is going to make some request to your service over and over again every N minutes. A failed response indicates a problem and that time window is marked as “failing”. You then take the number of failed intervals and divide by the total number of intervals in a trailing window, let’s say 30 days, for instance, and that becomes your availability metric.
So what’s wrong with this? Let me start with a story…
When we first launched Team Foundation Service, we had a lot of problems with SQL Azure. We were one of the first high scale, interactive services to go live on SQL Azure and, in the process, discovered quite a lot of issues (it’s much better now, in case you are wondering). But, 3 or 4 months after we launched the service, I was in Redmond and was paying a visit to a couple of the leaders of the SQL Azure team to talk about how the SQL Azure issues were killing us and I needed to understand their plan for addressing the issues quickly.
As I walked through the central hallway on their floor, I noticed they had a service dashboard rotating through a set of screens displaying data about the live service. As and aside, this is a pretty common practice (we do it too). It’s a good way to emphasize to the team that in a service business, “live-site” is the most important thing. I stopped for a few minutes to just watch the screens scroll by and see what it said about their service. Everything was green. In fact, looking at the dashboard, you’d have no clue there were any problems – availability was good, performance was good, etc, etc. As a user of the service, I can assure you, there was nothing green about it. I was pretty upset and it made for a colorful beginning to the meeting I was headed for.
Again, before everyone goes and says “Brian said, SQL Azure sucks”. What I said is 2 years ago it had some significant reliability issues for us . While it’s not perfect now, it works well and I can honestly say that I’m not sure we could run our service easily without it. The high scale elastic database pool it provides is truly fantastic.
So how does this happen? How is it that the people who run the service can have a very different view on the health of the service than the people who use the service? Well, there are many answers but some of them have to do with how you measure and evaluate the health of a service.
Too often measurements of the health of a service don’t reflect the experience customers actually have. The “traditional” model that I described above can lead to this. When you run synthetic transactions, you generally have to run them against some subset of the service endpoints, against some subset of the data. Further, while it’s easy to exercise the “read” paths, the “write” paths are more tricky because you often don’t actually want to change the data. So to bring this home, in the early days of TFService, we set up something similar and had a few synthetic transactions that would login, ping a couple of web pages, read some work items, etc. That all happened in a test account that our service deliver team created (because, we couldn’t be messing with customer accounts, of course). Every customer of our system could, theoretically be down and our synthetic transactions could still be working fine.
That’s the fundamental problem with this approach in my humble opinion. Your synthetic transactions only exercise a small subset of the data (especially in an isolated multi-tenant system) and a small subset of the end-points, leaving lots of ways for missing the experience your customers are actually having.
Another mistake I’ve seen is evaluating the service in too much of an aggregate view. You might say 99% of my requests are successful and you might feel OK about that. If all those failures are clusters on a small number of customers, they will abandon you. And then the next set and so forth. So you can’t blur your eyes too much. You need to understand what is happening to individual customers.
OK, enough about the problem, let’s talk about our journey to a solution.
One of the big lessons I learned from the very beginning was that I wanted our primary measure of availability to be based on real customer experience rather than on synthetic transactions (we still use synthetic transactions, but more on that later). Fortunately, for years, TFS has had a capability that we call “Activity logging”. It records every request to the system, who made it, when it arrived, how long it took, whether or not it succeeded, etc. This has been incredibly valuable in understanding and diagnosing issues in TFS.
Another of the lessons I learned is that any measure of “availability”, if you want it to be a meaningful measure of customer experience needs to represent both reliability and performance. Just counting failed requests leaves a major gap. If your users have to wait too long, the system can be just as unusable as if it’s not responding at all.
Lastly, any measure of availability should reflect the overall system health and not just the health of a given component. You may feel good that a component is running well but if a user needs to interact with 3 components, to get anything done, only one of them has to have a problem to cause the user to fail.
Our first cut at an availability metric was to count requests in the availability log. The formula was availability = (total requests – failed requests – slow requests) / total requests. For a long time, this served us pretty well. It did a good job of reflecting the kinds of instability we were experiencing. It was based on real user experience and included both reliability and performance. We also did outside in monitoring with synthetic transactions, BTW, but that wasn’t our primary availability metric.
Over the past 6 months or so, we’ve found this measure increasingly diverging from what we believe the actual service experience to be. It’s been painting a rosier picture than reality. Why? There are a number of reasons. I believe the primary phenomenon is what I’ll call “modified behavior”. If you hit a failed request, for a number of reasons, you may not make any more requests. For instance, if you try to kick off a build and it fails, all the requests that the build would have caused never happen and never get the opportunity to fail. As a result, you undercount the total number of requests that would have failed if the user had actually been able to make progress. And, of course, if the system isn’t working, your users don’t just sit a beat their heads against the wall, they go get lunch. In this model, if no one is using the system, the availability is 100% (we’ll, OK, actually it’s undefined since the denominator is also 0, but you get the point)
We’ve been spending the last several months working on a new availability model. We’ve tried dozens and modeled them over all our data to see what we think appropriately reflects the “real user experience”. In the end, nothing else matters.
The data is still measuring the success and failure of real user requests as represented in the activity log. But the computation is very different. One additional constraint we tried to solve for was we wanted a measure that could be applied equally to either an individual customer to measure their experience or to the aggregate of all of our customers. This will ultimately be valuable when we do get into the business of needing to actually provide refunds for SLA violations.
First, like traditional monitoring, we’ve introduced a “time penalty” for every failure. That is to say, if we get a failure than we mark an entire time interval as failed. This is intended to address the “modified behavior” phenomenon I described above. It changes the numerator from a request count to a time period. We need to change the denominator to a time period as well to make the math work. We could have just used # of customer or users multiplied by # of intervals in a month but that really dampens the availability curve. Instead we wanted the denominator to reflect the number of people actually trying to use the service and the duration in which they tried. To do that, we defined an aggregation period. Any customer who uses the service in the aggregation period gets counted as part of the denominator. So, let’s look at the formula.
In English the process works like this:
For each customer who used the service in a 5 minute aggregation period, count the number of minutes they experienced a failure (failed request or slow request). Sum up all those 1 minute failing intervals across all customers that used the service. Subtract that from the number of customers who used the service in the 5 minute aggregation period multiplied by 5 minutes. That gives you the number of “successful customer minutes” in that 5 minute aggregation period. Divide that by the total customer minutes (number of customers who used the service in the 5 minute aggregation period multiplied by 5 minutes) and that gives you a % of customer success. Average that over all of the 5 minutes aggregation periods (288 in 24 hours) in the window to get a % availability.
We’re still tweaking the values for 1 min intervals, 5 min aggregation period, 10 sec perf threshold.
Of all the models we’ve tried, this model provides a result that is reasonably intuitive, reasonably reactive to real customer problems (without being hyperactive) and more closely matches the experience we believe our customers are actually seeing. It’s based on real customer experience, not synthetic ones and captures every single issue that any customer experiences in the system. To visualize the difference, look at the graph below. The orange line is the old availability model. The blue line is the results of the new one. What you are seeing is a graph of the 24 hour availability numbers. It will dampen a bit more when we turn it into a 30 day rolling average for SLA computation.
There’s a saying “There are lies, damn lies and statistics”. I can craft an availability model that will say anything I want. I can make it look really good or really bad. Neither of those are, of course, the goal. What you want is an availability number that tells you what your customers experience. You want it to be bad when your customers are unhappy and good when your customers are satisfied.
Is that all you need?
Overall, I find this model works very well but there’s still something missing. The problem is that no matter where you put your measurement, there can always be a failure in front of it. In our case, the activity log is collected when the request arrives at our service. It could fail in the IIS pipeline, in the Azure network, in the Azure load balancer, in the ISP, etc, etc. This is a place where we will use synthetic transactions because you are primarily just testing that a request can get through to your system. We use our Global Service Monitor service to place end points around the world and execute synthetic transactions every few minutes. We have some ideas for how we will integrate this numerically into our availability model but won’t probably do so for a few months (this is not one of our real problems at the moment).
When I first started into this space, the head of Azure operations said to me – outside in monitoring (what GSM, Keynote, Gomez, etc do) just measure the availability of the internet and “test in production” – running tests inside your own data center measures the health of your app. I thought it was insightful. I think you still need to do it but it’s important to think about the role it plays in your overall health assessment strategy.
A word about SLAs
I can’t leave, even this ridiculously long post, without a word about SLAs (Service Level Agreements). An SLA generally defines the minimum level of service that a customer can expect from you. The phenomenon I’ve seen happen in team after team is, once the SLA is defined, it becomes the goal. If we promise 99.9% availability in the SLA then the goal is 99.9% availability. My team and others have heard me rant about this far too many times, I suspect. The SLA is not the goal! The SLA is the worst you can possibly do before you have to give the customer their money back. The goal is 100% availability (or something close to that).
Of course all of these things are trade offs. How much work does it take to get the last 0.0001% availability and how many great new features could you be providing instead. So, I’ll never make my team do everything that is necessary to never have a single failure. But we’ll investigate every failure we learn of and understand what we could do about it to prevent it and evaluate the cost benefit, knowing the issue and the solution. Right now, I’m pushing for us to work towards 99.99% availability on a regular basis (that’s 4.32 minutes of unexpected downtime a month).
Sorry for the length. Hopefully it’s at least somewhat useful to someone out there As always, comments are welcome.