Yesterday we launched a new Windows Dev Center. You might ask “Why are you telling us about that, Brian?” Well, the answer is that my team is responsible for MSDN and much of the infrastructure behind it. We partner very closely with the Windows team team produce an awesome Windows Dev Center. Yesterday was the culmination of months of work to really overhaul the dev center and make it serve our customers even better.
Here’s a list of some of the key improvements in the dev center…
- Navigation aligned with the development lifecycle
- Updates to key landing pages, with more visual appeal and interactivity
- A “Design” section for Windows Store apps that is more comprehensive, easy to use, and discoverable
- A new “Market” section for Windows Store apps that includes tips for marketing apps, evaluating telemetry data, and passing app certification.
- New cross-site navigation
- An experience that matches the personality of Windows and is aligned with Windows online
Check it out…
Yesterday, we deployed an update to Team Foundation Service. You can read the release notes to see the improvements. The improvements aren’t major but there’s a few nice things – Some updates to UI styling with a greater use of color, some navigation improvements that include the ability to see the task board for all sprints (rather than just the current), and support for multiple Git repos per Team Project.
Less prominent, but we’ve also made a bunch of changes to our authentication code. We’re working to better support people who use multiple Microsoft Accounts (Live IDs) and working towards ultimately supporting Active Directory Federation and O365 accounts. I’ve seen some hiccups today with people logging in – I expect they will be cleared up quickly.
This sprint’s deployment was delayed a week as we hit a number of issues in pre-production testing that had to be ironed out. Next sprint should resume on the normal schedule.
Yesterday, I got a thoughtful comment from Dean on my post about Update 3. I sat down to write a response this morning and it turned into a but of a novel (in fact, the blog says it’s too long to post as a comment ). So I’ve turned it into a post. Here was Dean’s comment:
I looked at the list of fixes, you weren't kidding when you said small. Are those just the major ones (crashing)? There are a whole lot of things logged on Connect that when added together really stifle productivity (many involving the editor with intellisense, syntax highlighting, etc.). I get nervous at this point in a VS life-cycle because I fear that the bugs I care most about will once again get worked on for the next major version. We spend money on the product and the window of time for things getting fixed is unbelievably tiny...before MS turns off the lights/closes shop until the next version. The new update cycle is great, I just hope it is not merely SP1 in chunks...I hope you still give updates to the product all along the way up until the next version.
And my reply…
@Dean, I try not to kid :) I'm not going to swear that's every single bug fix. Plus there will be a few more fixed between now and the final release - for instance, I just heard that we are fixing a few important Blend issues and those fixes didn't make the RC. It's not an incredibly small list - I count 59 bug fixes. I'm not sure whether to feel proud or ashamed that 34 of them are in my team :) I think the answer is proud, though I suspect this is an eye of the beholder issue. A story...
I was speaking last week with one of our test managers and he said to me...
People in shiproom don't know what to make of us. We bring so many more bugs to be fixed than any other team in VS that they just look at us sideways. But, my feeling is that if a customer reports a bug, we should just fix it. We're doing the update anyway and there's no point making the customer wait.
Aside - shiproom is a process we have of sharing the changes that each team is making as we get close to the end of any release that allows for peer review and feedback. The purpose is to help slow down the churn, reduce the risk of regression and warn other teams of any changes that might impact them.
It was a proud moment for me. I've spent the better part of 12 years arguing for a different way of thinking about this kind of thing. I grew up in startups where virtually every customer felt like the difference between going out of business or not. I spent 30% of my time on tech support and generally gave same day (or at most a couple of days) turn around on bug fixes if a customer had an issue. I've always believed that trying (look, I understand that you'll never fully succeed) to make each and every customer happy is important. And the connection between developers and the problems customers face is an important feedback loop that forces learning over time.
I don't want anyone to think this is a simple "motherhood and apple pie" vs ignorance issue. There's a legitimate debate to be had. First, every time we fix an issue we have a non-zero probability of introducing a regression that our testing misses. Customers tend to be super unhappy if they get a fix for one problem only to find a new problem they didn't have before. Further, we all know interruptions are bad - they sap productivity. Having to stop what you are doing and investigate a customer reported issue (which often turns out to be a configuration problem), produce a fix, test it and deliver it can really reduce the overall volume of value delivered. Further, fixing every issue any customer reports can cause you to spend an inordinate amount of time fixing issues that affect relatively few people while the bulk of your customers wait for value. It's not a simple trade-off and neither extreme is the right answer. It's a balance and it all comes down to the values to apply to weigh that balance.
As for the instability introduced by regressions, I argue that some regression rate is acceptable as long as your time to repair is sufficiently short. While a customer will be frustrated by getting a new bug along with their fix to the last, as long as it doesn't happen too often and you fix the new bug quickly, the net result of fixing people's issues more responsively is a win. Reasonable minds can disagree on what rate is considered acceptable. The lower you want to drive the rate though, the less churn you can tolerate and the less frequently you can release. Generally your tolerance also varies by the kind of component you are working on, the ease of deploying updates, and many other factors.
Now, again, before someone jumps to calling me a hypocrite, let's talk about Team Project Rename. There's an "issue" that has existed for a long time and we've done nothing about it (right Allen?). Could we? Of course, it's just software - anything can be done. So that must mean we have decided not to despite it being one of the top customer generated requests. Brian, you're a hypocrite. I like to think I'm not but you can judge that for yourself. As I said, it's a balance. This is probably one of the things I hang my head in shame about. We made some decisions many many years ago that, in retrospect, I would do differently. The ramification of those decisions are that Team Project rename is hard - it shouldn't be, but it is. We've costed it a couple of different times over the past several years and it always comes back as months and months of work with high regression probabilities. That doesn't mean we're ignoring it. I still very much want to do it but we're taking a longer road to get there. For a while we just decided to "live" with the problem. Now we have a plan but it calls for making changes to "undo" some of those decisions we made many years ago that makes it hard. Once we get that work done, we can do rename. All I can do is apologize for how long it is taking.
Back more to your point Dean. Yes, Update 3 is likely the last of the updates to the VS 2012 line. Of course, we'll still continue to fix any critical issues people find but we are winding it down and focusing on VS V.Next. I'd like to think many of the issues you refer to will get addressed there and, if not, I hope we'll get to them in the V.Next update train. Referring to the above - Update 3 is the update where we slow down the churn, address remaining high impact customer issues and any regressions introduced in Update 2.
When you say the window is "unbelievably tiny", I don't think I agree. Of course we tried to get as much customer value and feedback incorporated into 2012 RTM as we could and then followed with 3 updates over a period of 9-10 months in which we delivered, in the aggregate, a tremendous amount of stuff - much of which was directly driven by customer feedback (e.g. Desktop Express SKU, C++ XP targeting, Kanban support, Blue theme, and dozens of more significant improvements - and, hundreds of bug fixes). While I can understand it's frustrating that we didn't get everything - or even all the most popular ones, we did make a lot of progress - and we'll keep making progress in the context of V.Next.
I also don’t think it’s “SP1 in chunks”. The kinds of changes we’ve put into the updates go FAR beyond what we would have historically included in a Service Pack. Service Packs had an “aura” that they only contain bug fixes and while that was never strictly true – any time someone proposed a Service Pack change that didn’t smell like a bug fix, there was a lot of justification that had to be done. One of the fundamental mindset changes with the move from “Service Packs” to “Updates” has been that the primary value of Updates is new value – and sure we’ll fix a lot of bugs too, but that’s not the focus. Read my posts on the updates and you’ll that generally the bug fixes are a footnote. They are all about the cool new capabilities we are enabling.
I've exposed a lot of flank here - so I suspect it may generate a lively conversation. But, hopefully it sheds a little light on how I think about it (of course, only one person in a large company - so don't construe this as any kind of official policy). As always, I'm happy to engage in healthy debate and learn from customers and from mistakes.
Thanks as always for listening,
Yesterday we released our update to the TFS 2012 Power Tools to work with VS/TFS 2012.2 (Update 2).
Yes, I see the irony in announcing this the day after announcing the first “go-live” Update 3 CTP It took us longer than we had hoped to get the Power Tools ready. However, we’ve made a key change to mitigate this issue in the future.
Our Power Tools setup has traditionally had a block to prevent it from working with the next version of VS (which is why the Power Tools from Update 1 wouldn’t work with Update 2). This is a bit of a hold-over from before we started doing the regular update cadence. At that time, it was pretty likely that we’d change something in the 2 years between updates that would break the Power Tools and we’d work hard to have new releases of Power Tools ready before the final release of VS/TFS.
In the new world of frequent updates, it’s much less likely that we break something and trying to synchronize the 2 releases is much harder. So, we’ve removed the block and we fully expect that these Power Tools will continue to work fine with Update 3. I’ve just finished installing it on my VS 2012.3 RC and everything looks to be working well.
There really aren’t significant changes in the Power Tools release. Aside from removing the block, we fixed a hand full of bugs and removed the back/restore Power Tool – because the feature was moved into the “official product” in Update 2. So, the main reason to install this release is just to get it to work with VS 2012.2.
I apologize for the delay and the inconvenience it caused. I talked to a number of customers who decided to uninstall Update 2 rather than give up their Power Tools. We’re glad to hear that the Power Tools are valued so much, but at the same time, we’re sorry for forcing you to make a choice. You shouldn’t have to any more.
I know it seems like 2012.2 just released (at least it does to me) but we’re already well into our schedule for 2012.3. Today, we released the updates for both Visual Studio and Team Foundation Server. This is the first “go-live” release in the 2012.3 line of CTPs – this means that we’ve tested it to the point that we think you can install it in your production environment (we have in ours) and get help with any significant issues you encounter. Our go-live process found 8 or 10 TFS bugs for Update 2. We were able to fix those for the final build and ensure everyone had a better experience. We really appreciate your help finding the last few issues.
As I mentioned in my Update 2 availability post – Update 3 is going to be very small compared to Updates 1 and 2. For the most part it just contains bug fixes that have either been reported by customers or found in our own testing. You can read the KB article to find a full list of fixes.
Installing it should be relatively straight forward – for both VS and TFS, just run the installer. There should be no compatibility breaks so you need not upgrade all of your components at once, though, over time, you should plan to get them all updated.
Thanks and let us know what you think,
I’ve had a healthy discussion with some of you on this post and we’ve received quite a lot of feedback on Soma’s blog and other places. Based on this feedback we’ve produced ISOs for Visual Studio Update 2 and decided that for these "”larger” updates, we’ll continue to do so in the future.
You can read more about it on the Visual Studio blog here: http://blogs.msdn.com/b/visualstudio/archive/2013/05/03/announcing-availability-of-isos-for-visual-studio-updates.aspx
I missed announcing the availability of Team Explorer Everywhere Update 2. We released the update the first week of April. You can get it either by:
- Downloading it from our download page or
- If you use our Eclipse Update site (http://dl.microsoft.com/eclipse/tfs), using Help –> Check for Updates in Eclipse
Mostly our Update 2 for Team Explorer Everywhere contains bug fixes and small feature improvements. Here’s a list of the improvements you will find…
- Add delete options to builds and retention policy
- Allow import of Java and Ant as Zip files into Version Control
- Fix checkout for edit being disabled for items with merge,branch pending change
- Remove deleted builds from the build definitions filter in builds page
- Add error when attempting to delete a build that is marked as retain indefinitely
- Update progress indicator during unshelve of large changes
- Improve the way that tooltips are displayed on Ubuntu OOB color theme
- Allow periods in WIT display names
- Fix issue with setting execute bit on get when talking to Update 1 or Update 2 TFS 2012 server
- Fix issue blocking some label, shelve and workspace operations when there are two users with the same display name on TFS2012
- Fix issue canonicalizing local paths when setting working folder to a symlinked path
- Give proper error message when check-ins rejected by an ISubscriber plug-in on the server
- Language pack updates (fixes for French, German, Japanize and Brazilian Portuguese)
- Hide the workspace selection page during connection, make it less prominent during share and import.
- Allow use of alternative credentials when embedded browser is unavailable to Eclipse
Let us know if you have any feedback.
Last week we released an update to the TFS 2010 MSSCCI provider. The primary motivator for this release was to enable it to work with VS 2005. This now enables you to connect to TFS 2012 using VS 2005, particularly from Windows XP. The support for VS 2005 was the only enhancement to this update of the 2010 MSSCCI provider.
Yesterday, we released a major update to our Visual Studio Tools for Git. While still nowhere near done, we feel like we’ve crossed a significant threshold of completeness and usability. If you’re already using them or have tried them and didn’t feel they were ready, I encourage you to give this update a go.
You can install the latest drop here: http://aka.ms/git4vs
It will require that you have Visual Studio 2012 Update 2.
The most compelling improvements include:
- Push is dramatically faster
- Larger repos now work a lot better
- Merge and pull now allow you to have non-conflicting changes in your working copy
Additional improvements are:
- Added support for integrated Windows authentication when receiving a 407 Proxy Authentication Required
- Added support for pushing to Git servers which do not support Transfer-Encoding: Chunked
- Added support for pushing to certain Git servers which require side-band-64k in order to report push status
- The sync branch button on the Commits page is now functional. It does a fetch, merge, and push, in sequence
- Merge and pull now properly prompt the user to save their work before starting
- Merge now properly auto-reloads non-dirty solution items which changed in the working copy
- The push command now sends fewer objects to the server, which is a substantial performance improvement
- Merge and pull now use checkout instead of reset for fast-forward merges, which is a performance improvement
- Fixed CRLF issue that was causing checkout to fail due to conflicts in certain scenarios
- Improved performance of working directory status computation (much less hashing of workdir contents)
- Fixed an issue where CRCRLF line endings would end up in the working copy in certain scenarios
- The Add, Delete, and Ignore actions in the Untracked Items section of the Changes page are now working
- Undo merge option added to the Resolve Conflicts page
- Fixed a bug where the user was not prompted for fresh credentials when stored credentials failed to authenticate
- Submodule support from the libgit2 project has arrived in this release, but we haven't yet finished all the work in our VS plug-in to make them work smoothly.
As I never like talking about performance improvements without giving some numbers… Here’s a report that I got about a week ago. Not sure these are the final numbers but you can see the degree of improvement. These are differences between the Git client tooling between Sprint 44 (the previous public release) and Sprint 46 (this release), using the TFS Git Server. The numbers are in seconds.
We have also made some improvements to our Git server perf but these are just client improvements. Of course, we haven’t released an on-prem Git server yet so the only place you can see the server improvements is on http://tfs.visualstudio.com. The Git server perf improvements were deployed with our Sprint 46 service update yesterday.
Check it out and let us know what you think!
If you are a regular user of our Team Foundation Service - particularly during peak times, you've probably noticed that it hasn't been running as seamlessly as usual. Our recent issues started with our sprint 45 deployment that I wrote about here and foreshadowed here. In this post I want to give you some insights to what happened and what we've done about it. The team has been working long and hard since the deployment and I really appreciate the dedication and tenacity.
As our service has evolved, we've begun to expand the breadth of the services we are trying to offer. We are also working towards having support in data centers around the world so you can choose where your data is hosted. As part of this drive we've been gradually working to "decouple" components of Team Foundation Service to enable more independent evolotion and deployment. Sprint 45 was the culmination of one of the biggest peices of work we've done to date in that effort. Though you can't really see the effect (ok, other than some of the bugs we introduced), we refactored the core Team Foundation Service into two services - what we call SPS or Shared Platform Services and Team Foundation Service (the functionality you know and love).
SPS contains many things but the simplest way to think about it is the data that is outside a given tenant account - Subscriptions, location, configuration, identity, profile, etc. After many sprints of work, we finally deployed the new SPS service at the end of sprint 45 - in fact on March 22nd and 23rd. It involved a pretty massive data refactoring and the first upgrade with planned downtime in a long time (and I believe the last in a long time).
When you think about the change technologically, the biggest difference was that, for the first time, we introduced a service to service split with fairly heavy traffic. Before this, all of the calls between the "logical" services were either between DLLs in the same app domain or to the database. This split caused those calls to become service to service REST calls, crossing not only processes but even machines in the data center.
We did a ton of testing ahead of this deployment, however, it's clear we didn't catch everything we should have and the result has been a lower quality of service than we (and you) expect. The issues have mostly manifested themselves as slowdowns during peak hours (though in the first couple of days it was a little worse than that). The build service has been the most (and longest) affected service - mostly because it's very intensive: it does a full sync of all of your source to build it and that can put a lot of load on the system when lots of people are running builds at the same time.
I'm going to walk through a list of the issues that we hit. I'm going to try to group them a bit make it easier to digest.
After we went live and started to see production load, we saw a great deal of time spent in our REST stack. I still don't fully understand why we didn't catch these issues in our pre-production load testing and the team is doing deep retrospectives over the next couple of weeks and this will be one of the key questions to answer.
ClientCertificationOption.Automatic - We have an object model for talking to TFS. With the new service to service architecture, we are now using portions of that object model on the service. A couple of years ago, we added multi-factor auth to our object model to support scenarios where stronger authentication is required. We discovered that this option was still turned on in our server context - where it was not needed and the result was scanning of the certificate store on every service to service request. We disabled the certificate lookup in this scenario and improved performance.
Thread pool threads - The new service to service architecture introduced a lot more outgoing web calls. The way the web stack works, under the covers, all of these calls are async and use the thread pool. We found that we were often running out of threadpool threads and waiting for the CLR to inject more into the pool. By default the CLR starts with 1 thread per core and then uses a "hill climbing" algorithm to adjust threads as needed. This algorithm tries to dampen changes in the threadpool size so it is not thrashing. However, we found that it could not handle our rapidly changing demands well. The result was long pauses of up to 20 seconds while requests queued up waiting for the thread pool to resize. Working with the CLR team, we decided to increase the minimum number of thread pool threads to 6 per core using (ThreadPool.SetMinThreads).
Wasteful CPU use - Many years ago, someone wrote some code to scan every request/response data stream for invalid characters. This was consuming a fair amount of CPU (I forget how much but more than 10%). No one, including the dev who did it, could remember why this code was put in there or think of any reason it was needed. We removed it.
Extra thread switching - We had a coding mistake that caused extra thread switches on every REST call. This is best explained in code. At the beginning of every REST call there's some lazy initialization logic:
if (a == null)
if (a == null)
a = fetch some data;
The issue is that "fetch some data" was a cross service call and the developer wanted to make it async. So they tried making the method async and doing and await fetch some data and discovered that you can't do an await from within a lock. Let's forget that this would completely have broken the synchronization. So they decided to do a Task.Run on the whole block of code. The problem is that a == null only the first time the code path is used, after which a is cached. The end result is that we did a thread switch just to run an if statement everytime there was a service to service call.
Too many cross service calls - As we started to investigate the cause of the slowdown, we quickly realized that we had way more "chattiness" between the services than we realized. This chattiness was exacerbated by the various performance issues above. Before the split most of these interactions were direct in-proc calls or database calls that were much lighter weight than the cross service calls. We were able to rework the code to reduce the number of calls across service boundaries substantially.
We fixed all of these REST stack problems within the first few days and it really helped restore the service nearly constant poor performance to intermittant poor performance and build service issues.
Other misc bugs
In addition to the set of bugs/issues we had with the transition to separate service, we had another set of bugs that were uncovered/exaggerated as part of the transition. Among them were:
Work item tracking identity sync - We have a process by which we sync identities from the identity system (now in the SPS service) into work item tracking (in the TFS service) for the purposes of enabling work item tracking rule enforcement based on groups and identities. This change uncovered/exacerbated two identity syncing bugs. One caused older accounts (the first 40K accounts) to start doing full identity syncs (consuming significant resources) frequently. The other was some thrashing in the identity service causing lots of activity but no progress in syncing.
Stored procedure optimization - We also found a few stored procedures with some significant query plan problems.
That pretty much summarizes our "bugs". There were clearly too many and they were too significant. I don't ever expect to get to the point where we can deploy a major change and not miss any bugs but it's clear to me we can do better. However, if it were just this, the overall impact would not have lasted nearly as long as it has. In addition to our own mistakes we got hit by a number of environment/operational issues. The cloud environment (Azure) that we operate in is constantly changing. Almost every day something, somewhere is being updated and we rely on enough of it that we can get wagged by many things. Here are some of the environmental things...
GC/Pinning - As I mentioned above in the section on Thread pool threads, we introduced a bunch more async patterns in our service with this change than we had ever had before. In fact, we now have about 10 async calls per second from each application tier machine. Async calls, in the CLR, ultimately involve some amount of "pinning" GC memory so that the OS can transfer data into buffers asynchronously without blocking the GC. The problem we have is that at that rate there's almost always an outstanding async call, and therefore pinned buffers. The CLR has some designs to help reduce the impact of this but we found they weren't working sufficiently well for us. We were using Large Azure roles (7GB of RAM) and finding that due to pinning effects on the GC, we were seing high CPU load from the GC and memory exhaustion that would ultimately result in machine recycling - causing intermittent availability. In the process of investigating this, we also uncovered the issue described below under "Windows Azure". We engaged the CLR team quickly and started investigating. They said they had seen this issue before in some high scale services but no service was able to work with them long enough to isolate and fix it - basically as soon as they found a mitigation their interest in it was over. I view part of our role as driving feedback and requirements into our platform to improve the overall quality. As such we're continuing to work with the CLR on this. We're expecting a longer term improvement from them but, in the shorter term, we've found that we could increased to an XL role (14GB) and that mitigated the pinning effects, eliminating the critical issues.
Config DB provisioning - Each of our services have a single database that we call Config DB. It's the core database that the service needs to run. It's got critical configuration information like connection strings, manages cross app tier notifications, etc. Before now, we only had 1 Config DB for the entire service. That database was on a XL reserved SQL Azure instance to make sure we had a committed level of capacity behind this critical database. With the rollout of our new service and a second Config DB along with a new "accounts" DB, we needed to readjust how our config DBs were deployed so that we didn't have to pay for 3 XL reservations but were able, instead, to share the reserved resources between our Config DBs. In this process, there were manual operational procedures which caused all three database primaries to get landed on the same physical machine. The load balancer realized this was a bad idea and began trying to move two of the primaries onto another of our reserved machines. However, we hit a bug in the load balancer where it had difficulty balancing the resource consumption of the foreground work with the background database movement. The result was that it kept trying to move the databases, getting part way through and then giving up. This caused a lot of load on our SQL Azure databases that, coupled with our issues in the first section, compounded our performance problem. Ultimately this was addressed by explicit operational intervention to place the databases.
SQL Azure storm - As it turns out, that first week, SQL Azure (on the cluster we use), had a bit of a storm with some misbehaving tenants using an undo share of the resources and causing the performance of our databases to suffer. These storms happen from time to time in SQL Azure and they are working through throtting and quota mechanisms to contain them. They were MUCH worse a year or so ago - the SQL team has made a lot of progress but the problem is not yet completely gone.
So far, most of what I've talked about has been, at least, tangentially related to the "big" change we were actually rolling out. As it so happens we also got hit by some stuff that had absolutely nothing to do with the big changes we were rolling out. Over the first couple of weeks we had worked through all of the issues we could find with our change and the service was generally running reasonably well - with the exception of the build service. The build service was running poorly - during peak times builds were slow and frequently timing out/failing. Once we had ruled out everything that had anything to do with our changes we discovered some other things going on.
Slow disk I/O - The slow and failed builds were continuing to plague us. Once we had stopped looking for causes related to the changes we had made, it became clear pretty quickly that the root problem had to do with the amount of time it was taking to download source out of version control. After much investigation we concluded it had something to do with changes in Windows Azure. We were seeing incredibly slow disk I/O on the local disk of our web role. We contacted the Azure team and learned they had just rolled out (just about the same day we rolled out our changes), a change to both host and guest VMs that changed the way write through caching works. We were seeing, under load, individual I/Os taking as much at 2 seconds - 2 orders of magnitude longer than it should. The source code we manage is in Azure blob store. However, to optimize delivery, we store a cache of source code on the application tier that we can use the OS TransmitFile API on. It is a VERY optimized API for transmitting files over the network that has very low overhead because it avoids copying buffers in and out of user mode, etc. The disk I/O slow down was causing this cache to behave very badly - causing build to take for ever, downloads to fail, builds to be aborted, etc. Further, it significantly exacerbated our issues with the GC because we'd see page fault in GC collection taking a second or more. It was a mess. To mitigate it, we have turned off our AT cache and are instead fetching source directly from Azure blob store. This is causing our file downloads to take about 1.6 times longer than it used to but it's still way better than where we were. We're continuing to work with the Azure team on a more permanent fix to restore the local disk performance.
Network failures - Unfortunately, with the slow disk I/O problem mitigate, we're still seeing an increased level of build failures. Upon further investigation, we've discovered that we're seeing a much higher level of network error in Azure than we are used to. At this time, I still don't know the cause of that - we believe it's related to the last Azure fabric rollout but we don't have a clear root cause yet. We are seeing hundreds of builds a day failing due to network failures starting about April 12th. Along the way of investigating many of these issues, we discovered that our build service didn't have appropriate retry logic to handle failed downloads. It's really not been an issue for the past year but all of the issues we've had in the last month has uncovered this hole. We'll soon be rolling out an update that will do retries (which quite honestly should be there in the first place because all cross service cloud calls should handle intermittent failures well) and it should significantly improve the build success rate. We'll still see some performance issues from the last two issues but at least builds should stop failing.
Memory dump problems - To add insult to injury, our ability to debug the service has been seriously compromised. I haven't actually looked deeply at this one myself so my description may be a little sketchy. Debugging some of these problems are greatly facilitated taking memory dumps from production that we can diagnose off line. Historically, we have used a feature called Reflective dumps using procdump /r. This method reflects the memory state into a parallel process and then dumps from the parallel process so that there is almost no interruption of the production service. However, apparently, for a couple of years the Windows OS reflective dump mechanism has had a bug that caused it to miss fragments of the process memory. Apparently we've seen this in the form of "bad" memory dumps but it was infrequent enough that people just wrote it off to gremlins. Some change in .NET 4.5 (which we rolled out on the service a few months ago), has caused this reflective dump to fail every time - so we can't use reflective dumps any longer. Now we have to take a straight process dump in production and, unfortunately, this takes a minute or two, literally. This long of a pause causes cascades of timeouts and results in a significant service interruption. So every time we needed to take a dump to look at thread state, GC issues, or the myriad of other issues here, we had to basically reboot one of the app tiers, causing significant service interruptions, failed builds, etc. We're working with the Windows team on getting them to produce a hotfix that we can use and the rest of the world can use to reliably get .NET 4.5 reflected memory dumps.
As you can see there were quite a lot of issues that conspired to make it a bad month for us. We've been working incredibly hard and the service has gotten better day by day. We've also been working closely with the platform teams to ensure there are durable solutions to the problems we've hit there that work not just for us, but for everyone.
This is a long post and if you've read this far, I'm impressed. We're very sorry for the problems that we've caused. At this point we feel like we've worked through enough of our issues that we can get back on our regular deployment schedule. As such, we are deploying the delayed Sprint 46 build today and will deploy the Sprint 47 build a week from Monday. We'll keep working hard until everything is completely back to normal and if you see anything that looks squirelly, please let us know.
Today, we released our second update to Visual Studio and Team Foundation Server 2012. You can read a fairly detailed post with all the new capabilities on the ALM blog. There’s a ton of new value in the in this update as evidenced by the long list of new features in that post. Roughly speaking, Update 2 is about the same size as Update 1 was (in terms of numbers of new features).
Many of the features have been aired before in our CTP posts. But there are a couple of things about the TFS update that I want to highlight.
TFS 2010 Build controller/agent compat – We’ve received feedback that simultaneously updating all TFS build machines along with the TFS server is not practical – particularly in large organization where there can be hundreds of build machines, many of which aren’t even known to the TFS administrators. Because of this, in update 2, we have added support for TFS 2010 build controllers and agents – so you can update your TFS 2010 server without updating your build infrastructure and your builds will just keep working. In general, we expect to continue this pattern from here forward – a new TFS server will support build machines from one major version back. This adds the additional benefit this version that you can use the TFS 2010 build servers on Windows XP (in the event you need to do that) while the TFS 2012 build machines don’t support XP. Based on the feedback we’ve gotten from our MVPs, this change is very popular and makes people’s lives much easier.
Preservation of TFS settings across updates – You may recall that when you applied TFS Update 1, you had to reconfigure many of the settings manually. In Update 2, we put a great deal of effort into preserving settings across the upgrade. While we didn’t get every one, we got the most common customizations and we plan to get most of the rest in Update 3. In all the upgrade should be more seamless this time.
Upgrading TFS using SQL Always On – We added support to automatically handle upgrading TFS installs using the SQL Always On high availability configuration. In Update 1, this was a manual process.
So I guess, what I’m trying to say is that, in addition to the long list of new features you’ll find in the blog post above (like new Agile project management capabilities, tons of testing tools improvements, Blend & Sketchflow support and more), we’ve worked really hard to make the upgrade as easy and seamless for you as we can. Of course, if you hit any bumps, please let us know because we’ll want to fix them.
I can’t write this post without commenting on the quality of the TFS 2012 Update 1 release and what we’ve done about it. You may recall that we had a number of issues with our Update 1, had to issue a re-release soon afterwards and then a patch with ~8 critical bug fixes. We vowed not to repeat those issues. We learned a lot shipping a pretty significant update and made a lot of changes for Update 2.
Among them, we added two “go live” CTPs to collect feedback early. The first was a release just for our MVPs. We had a dozen or so MVPs do production upgrades and report all the issues they found. We found and fixed probably 5-6 significant bugs that way. 3 weeks later, we had a “broad go-live” CTP and worked with many more customers to do trial or production upgrades – finding more issues. Throughout, we worked very closely with customers and pursued every issue to its end. In addition to all of the customer testing, we provided an upgrade path from CTP->CTP->RTM and extended our own testing window to ensure we could cover any areas we felt we missed in Update 1 and do full verification of all fixes in the end game. With all the effort and due diligence we’ve put into this release, we feel like we’ve done a good job ensuring the quality of what we are shipping. Ultimately, the true validation of that will be a lot of successful and happy customers so, we’re eager to hear your successes or issues in applying Update 2.
A comment on Update 3…
We’ve already begun working on Update 3. I’d like to set your expectations a bit on it now. Update 1 & 2 were both fairly substantial updates with a fair number of new features. My expectation is that Update 3 will be VERY modest. In all likelihood, we will primarily focus on bug fixes, upgrade issues and small refinements to the experience. At this point we are pretty consumed in working on our next major update to TFS and, as such, can’t manage to do 2 separate & significant things at the same time.
Once you’ve had a chance to try out Update 2, I’d love to hear your overall impression of the VS 2012 Update experience. This is the first release that we’ve tried doing this and sometime later this year, we’ll be sitting down to evaluate overall how successful the effort has been and beginning to think about what we’re going to do for the next major release – in terms of subsequent Updates.
A final note. After an update, I usually get a set of requests to produce a list of bug fixes. For updates of this magnitude, that’s a harder thing than you might think. I’ve done it in the past when our service packs or something would have a few dozen, or maybe even many dozen fixes. It’s actually real work to turn the lingo in our internal bug database into a list that is useful to someone who’s not on the team and I usually spend several hours doing it. I checked and this update contains over 500 bug fixes – just in TFS. Now some of those bug fixes are fixes to things introduced in the process of creating the update. You wouldn’t want to see those and I’d want to filter them out. Because of the magnitude of that effort, I won’t be producing a list of bug fixes. It’s something I’ll look at doing for Update 3 because I expect that will be a much smaller list.
Thanks and good luck with the update. We really hope you like it. As always, we are eager to hear your feedback.
Today, we released the results of Sprint 45 on Team Foundation Service. You can check out the release notes to learn more about today’s release. There were 2 basic areas of new capabilities on the service: Git Branch insights and Web based test execution UI. Both are nice improvements in usability and experience.
You might notice that we’re releasing this on a Friday and we “always” do our service releases on Mondays. This is an unusual one. Last sprint I mentioned that we are making some big infrastructural changes and that these changes are going to require a few minutes of down time. The update today puts everything in place for us to execute the infrastructural changes tomorrow during a relatively quiet time on the service (15 minutes in 11:00am to 2:00pm on Saturday).
As always, we love to hear your feedback.
Saturday night I was at my son’s lacrosse game with the family. My wife got a phone call. It was a fellow farmer and neighbor, Noah (about 2 miles south of us). His question was “Are you missing a cow?” The answer, of course, was we didn’t know. Even if we had been home, counting ~100 cows turns out to be a lot harder than you’d think – the dang things just won’t stay still. It turns out Noah had seen a stray black cow with no distinctive markings.
So I had to leave the lacrosse game to go home and see what was up. By the time I got home it was dark. I stopped by Noah’s house and learned that he had last seen the cow running down another neighbor’s (long) driveway. I drove down the driveway but, let’s just say that looking for a black cow when it’s pitch dark is a less then productive exercise. After a drive up and down the driveway, I gave up.
The next morning, my wife and I woke up at about 6:30 and my wife said to me “Something’s not right. The cows are making too much noise.” We’re expecting calves any day so she was thinking a cow was having problems. So we both got dressed and went down to the lower barn where the cows are wintering. Standing in front of the gate behind the hay barn was a strange cow – a young, all black, female cow (yes, I know that’s redundant ).
I could tell something was not right with this cow. It had the wrong posture – head up high, very alert and poised to run. These cows I call “crazy cows”. Crazy cows are cows that are so afraid that they are hyper and irrational – jacked up on adrenaline. In fact, they are incredibly dangerous. We’ve had a couple of cows that have “gone crazy” and the smartest thing you can do is sell them fast. We are fortunate to generally have a very calm herd that I can walk calmly among and get within a few feet before they ease away. This cow was trouble.
I managed to go the long way around behind and clear the field of our cows. I then opened the gate and we chased the cow into the pasture. This is my goat pasture which is extra secure – posts every 12 ft, high tensile woven wire plus electrical. Not much will get out of it.
Once we had the cow penned up, we started calling farmers in the area. None of them were missing cows.
We noticed that the cow had a “slap tag” on its left shoulder. This is a round white sticker sometimes used as a temporary label for transport.
My best guess is someone bought her at auction with the idea of keeping her on some “extra grass”. My guess is they didn’t realize they were getting a crazy cow and once they got it home, the cow panicked and jumped their fence. I’ve seen a cow jump a 4 1/4 foot stall gate.
Ultimately we called the animal control officer and they are going to go door to door in the area anywhere they find a place nearby that looks like they might have a cow. Here it is Monday night and we still don’t know whose it is. It’s more than a $1,000 animal. Someone has got to be wondering where their investment is.
Are you missing a cow?
A couple of days ago I announced the availability of VS/TFS 2012.2 CTP4. At that time I mentioned that there would be an update of our Visual Studio Git extensions to work with it. We released the Git extension update last night and you can read more about it in Matt’s and Andy’s blog post. We continue to push forward with additional Git functionality as quickly as we can. The biggest advance in this release is a new merge experience. Check it out and give us feedback.
We are continuing our journey to deliver a final release of Update 2. You can read about the CTP3 we shipped a few weeks ago here. The most notable thing about this CTP is that it is “go-live”. One of our big learnings from shipping Update 1 was that we really do need feedback from real customer deployments before we ship a major update. We provided a “go-live” build to a select set of early adopters in the last CTP and got some great feedback. In that process we found 3 or 4 significant bugs and fixed them for this CTP. We now need a bunch more people to give it a go and report any issues they find. If all goes well, we hope to release Update 2 soon.
You can download the CTP here: http://go.microsoft.com/fwlink/?LinkId=273878
You can read about the list of new features in this CTP on the ALM team blog.
Chuck’s list on the ALM blog is pretty high level so I want to call out a few TFS specific things:
Back up & restore Power Tool – We’ve now integrated the backup and restore power tool into the product as part of Update 2. There’s be no need to use the Power Tool download for this any longer.
Preserve configuration on upgrade – We did additional work to preserve your TFS configuration when you upgrade. This should make applying update 2 easier.
Servicing in High Availability SQL Environments (SQL AlwaysOn) – We now support upgrade & servicing of SQL AlwaysOn configurations without undo manual intervention.
Customizable kanban columns – We’ve now included that new customizable kanban columns support that I first previewed at the ALM Summit in Update 2. Gregg has also produced a detailed walkthrough of the customizable kanban columns.
Visual Studio Git extension – You will also need a new version of our Git extensions VSIX to work with this Update. It’s not available as I write this but should be within the next day or two. I’ll post another blog post with details on the Git extensions update as soon as I have it. Just to be clear, the Git extensions will not be included in Update 2. For now they will continue to be a separate VSUpdate as we are rapidly iterating on them and plan to continue for the foreseeable future.
At this point all but a few of the planned Update 2 features are in.
Please try it out and give us any feedback you have. Because this is a “go-live” release all our standard support mechanisms are in place for it. Also, you can submit bug reports on Connect.
Sprint 44 is done and has now been deployed to the service. You can read the release notes on the Team Foundation Service Portal to learn more about what’s in it. Maybe the biggest news is that the customizable swim lanes for kanban are live on the service now. Over all though, it was a sprint of just moving the ball forward on many fronts – improve test case execution, version control annotate/blame, scheduled builds for Git, etc. No real big news – just everything a bit better. It’ll probably continue to be that way for a few sprints as we pull together our next set of “big” improvements.
You may also want to checkout my upcoming posts on VS 2012.2 (Update 2) and VS Git Extension updates. Both enable new scenarios with the service.
I also want to make you aware that we’ll be taking the Team Foundation Service down for about 15 minutes Saturday March 23rd in the afternoon (US East coast time) – see the release notes for more precise timing. I talked a few months ago about some “big infrastructural changes” we had made that were the root of many of our Update 1 problems. That work is finally done and will allow us to repartition some of our key services data (account, identity, etc) and allow the service to scale a lot further than it can today. That work will be enabled in production on March 23rd. Because it’s a pretty big reorganization of our account database, we’ll need to take the service down while we do it. It’s only 15 minutes because most of the work can be done online and only some wrap up stuff has to be done synchronously while the service is down. We chose Saturday afternoon because that’s generally a low usage time.
We haven’t done an intentional offline servicing event in a year or more. We are generally able to make all of our upgrades with the system fully online. I don’t anticipate another event like this in the coming months but I don’t promise it will never happen again. I won’t be surprised if it’s something we need to do once or twice a year as the service goes through significant architectural evolution. I apologize for the inconvenience ahead of time. We’re trying to give you plenty of warning.
As always, please let us know if you have any feedback,
Clearly yesterday was a bad day. Team Foundation Service was mostly down for approximately 9 hours. The underlying issue was an expired SSL certificate in Windows Azure storage. We use HTTPS to access Windows Azure storage, where we store source code files, Git repos, work item attachments and more. The expired certificate prevented access to any of this information, making much of the TFService functionality unavailable.
We were watching the issue very closely, were on the support bridge continuously and were investigating options to mitigate the outage. Unfortunately we were not successful and had to wait until the underlying Azure issue was resolved. I have a new appreciation for the “fog of war” that happens so easily during a large scale crisis. We’ll be sitting down early this week to go through the timeline hour by hour – what we knew, what we didn’t know, what we tried, what else we could have tried, how we communicated with customers and everything else to learn everything we can from the experience.
I can appreciate this problem. Team Foundation Service has dozens of “expiring objects” – certificates, credentials, etc. A couple of years ago, when our service was in its infancy, we too were hit by an expired certificate due to an operational oversight. Afterwards we instituted a regime of reviewing all expiring objects every few months to ensure we never allow another to expire. I’m still not as confident in our protection as I’d like. The current process relies on developers to document any expiring objects they add to the service and for the ops team to properly manually confirm all the expiration dates on a timely schedule. We took the occasion of this incident to raise the priority of automating this check to reduce the likelihood of a recurrence. Of course, one of the things you quickly learn when operating a large scale mission critical service is that you can’t assume anything is going to work. For instance our automated expiration checks, once we build them, might fail. Or, when they find an issue, the alerting system may fail to deliver the alert. Or, let’s say the alert is delivered by email, we may have personnel change and forget to update the email address the alert is sent to, causing it to get ignored. And on and on. The hard thing about this is that anything can go wrong and it’s only obvious in hindsight what you should have been protecting against – so you have to try to protect against every possibility.
I haven’t yet seen the Azure incident review so I don’t know exactly what failures led to this outage. Yesterday, the focus was on restoring the service – not understanding how to prevent the next issue. That will be the priority over the next couple of weeks.
Anyway you look at it, it was a bad, humbling, embarrassing day that we have to learn from and prevent from ever happening again.
I apologize to all of our affected customers and hope you’ll give us a chance to learn and continue to deliver you a great service.
Yesterday, we posted a small update to the Git tooling for VS. Since we just released it the other day, we don’t have much new yet but we should continue (at least for a while) to update it after each of our 3 week sprints.
This update has 3 significant bug fixes:
1) Add support for VS 2012 Express for Windows Desktop
2) Fix a bug that broke Resharper
3) Fix for a problem detecting global config files.
If you’ve installed the tool, you should get popup notifications in VS of any updates we publish and, at least for now, updating the extensions is super fast – only a few seconds.
That’s all for now