& cplSiteName &

First AWS, Now Microsoft Cloud; Who's Next?

Ashwin Krishnan
5/4/2017
50%
50%

So, there we have it. Within a window of a few months, the top two public cloud providers on the planet -- Amazon Web Services Inc. and Microsoft Cloud -- have had bodily seizures that have caused the rest of us (mere cells in their ecosystem) to go into crazy orbits. Enough of the drama, let's get to facts. In this age of information deluge it would not be presumptuous to assume that the reader may have forgotten the specifics, so let's recollect.

The Amazon Simple Storage Service (S3) had an outage on Tuesday, February 28. An authorized S3 team member who was using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. However, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. And the rest, as they say, is history!

Now let's turn to the Microsoft episode. On Tuesday, March 21, Outlook, Hotmail, OneDrive, Skype and Xbox Live were all significantly impacted, and trouble ranged from being unable to log in to degraded services. True to form, the Microsoft response was to downplay the impact and provide little detail (by contrast, Amazon provided a much more detailed post mortem). A subset of Azure customers may have experienced intermittent login failures while authenticating with their Microsoft accounts. Engineers identified a recent deployment task as the potential root cause. Engineers rolled back the recent deployment task to mitigate the issue.

So, is this the death of public cloud? Nah. Far from it. And anyone who says otherwise should have their head examined. BUT, it should serve as a wake-up call to every IT, security and compliance professional across every industry. Why? Because this kind of "user error" or "deployment task snafu" can happen anywhere -- on-premises, on private cloud and on public cloud. And since every enterprise is deployed on one or more of the above, every enterprise is at risk. So enough of the fear mongering. What does someone do about it? Glad you asked.

There are really three vectors of control: scope, privileges and governance model.

Scope is really the number of "objects," a.k.a. the nuclear radius of what each admin (or script) is authorized to work on at any given time. Using the Microsoft Cloud example (I realize I am extrapolating since they have not provided any details), this may be the number of containers a deployment task can operate on at any given time.

Privileges calls for controlling what an administrator or task can do on the object. For instance, continuing with the container example from above, the privilege restriction could be that the container can be launched but not destroyed.

And finally, you need a governance model. This is really the implementation of best practices and a well-defined policy for enforcing the above two functions -- scope overview and control enforcement -- in a self-driven fashion. In this example, the policy could be to ensure that the number of containers an admin can operate remains under 100 (scope) and that any increase in that number automatically requires a pre-defined approval process (control). Further sophistication can easily be built in, where the human approver could easily be a bot that checks the type of container and the load on the system and approves (or denies) the request. Bottom-line -- checks and balances.

So there you have it. The two large public clouds have suffered embarrassing outages in the past month. They will recover, get stronger and most likely have future outages as well. The question for the rest of us is what we learn from their experience and how to make our environments in our own data centers and on private and public clouds better! If we don't, we may not be lucky enough to fight another day.

— Ashwin Krishnan, SVP, Products & Strategy, HyTrust

(3)  | 
Comment  | 
Print  | 
Newest First  |  Oldest First  |  Threaded View        ADD A COMMENT
akrishnan940
50%
50%
akrishnan940,
User Rank: Light Beer
5/4/2017 | 12:46:54 PM
Re: AWS S3 outage and proper architecture
Thanks for the detailed comments. Yes - if you automate a poorly defined process, you are going to crash and burn faster. And there is no 'stress testing' that either cloud providers and enterprises are voluntarily embracing to expose any holes and fix them. But first step is to know what you don't know or haven't acknowledge - the scope, privileges and putting a governance around that
danielcawrey
50%
50%
danielcawrey,
User Rank: Light Sabre
5/4/2017 | 11:36:58 AM
Re: AWS S3 outage and proper architecture
These are the large-scale issues that can afflict cloud systems. I'm sure Amazon and Microsoft are learning from the mistatkes made. Let's keep in mind that this is all still really new, and everyone is learning as we all go along. 
mladeb
50%
50%
mladeb,
User Rank: Light Beer
5/4/2017 | 11:18:32 AM
AWS S3 outage and proper architecture
Northern Virginia is the cheapest AWS region and many services including AWS own dashboard do not follow AWS high availability architecture recommendations and use only one region, the cheapest one, even for valuable services and data. 

Backup and disaster recovery is also an issue and using multiple public cloud providers for most valuable services and data makes sense. Especially in case of natural disasters public cloud providers should ensure that rest of the regions can operate without disruption and AWS outage proved that AWS can deliver on that expectation. Funny part of the story was that some AWS own applications were impacted due to single region architecture as well as may other popular applications relying on only one region availability.

With automation scripts that take manual input there will always be possibility for humans to cause disaster even with well defined scope, privileges and governance model but the point is that with proper architecture in case of both human caused and natural disasters public cloud provides real benefits and can cope with them - especially by using public multi cloud for high availability architecture.  
More Blogs from Column
Mobile edge computing (MEC) and a cloud-native core are necessary ingredients for the future 5G NFV network, argues Ian Maclean of Metaswitch.
Once pay-TV providers embrace the idea, they must take a comprehensive, company-wide approach to carry out such fundamental changes.
5G will require a fresh look at RF characteristics as operators deploy next-gen tech on very high-band frequencies.
MVPDs have an opportunity to make digital investments without upending their current business.
VoLTE, in the end, becomes a cloud and NFV story, Metaswitch's Ian Maclean argues.
Featured Video
From The Founder
The 'gleaming city on a hill,' Steve Saunders calls it. But who is going to take us from today's NFV componentry to the grand future of a self-driving network? Here's a look at the vendors hoping to make it happen.
Flash Poll
Upcoming Live Events
September 28, 2017, Denver, CO
October 18, 2017, Colorado Convention Center - Denver, CO
November 1, 2017, The Royal Garden Hotel
November 1, 2017, The Montcalm Marble Arch
November 2, 2017, 8 Northumberland Avenue, London, UK
November 10, 2017, The Westin Times Square, New York, NY
November 30, 2017, The Westin Times Square
All Upcoming Live Events
Infographics
With the mobile ecosystem becoming increasingly vulnerable to security threats, AdaptiveMobile has laid out some of the key considerations for the wireless community.
Hot Topics
Could the Connected Car Help Prevent Terrorism?
Dan Jones, Mobile Editor, 9/15/2017
Cities Slam FCC on Broadband Proceedings
Mari Silbey, Senior Editor, Cable/Video, 9/15/2017
Apple's New iPhones: No Gigabit LTE for You!
Dan Jones, Mobile Editor, 9/14/2017
1 Million Pirate Set-Top Boxes Sold in the UK
Aditya Kishore, Practice Leader, Video Transformation, Telco Transformation, 9/20/2017
Close the Loop to Automate Service Assurance
Carol Wilson, Editor-at-large, 9/14/2017
Animals with Phones
Live Digital Audio

Understanding the full experience of women in technology requires starting at the collegiate level (or sooner) and studying the technologies women are involved with, company cultures they're part of and personal experiences of individuals.

During this WiC radio show, we will talk with Nicole Engelbert, the director of Research & Analysis for Ovum Technology and a 23-year telecom industry veteran, about her experiences and perspectives on women in tech. Engelbert covers infrastructure, applications and industries for Ovum, but she is also involved in the research firm's higher education team and has helped colleges and universities globally leverage technology as a strategy for improving recruitment, retention and graduation performance.

She will share her unique insight into the collegiate level, where women pursuing engineering and STEM-related degrees is dwindling. Engelbert will also reveal new, original Ovum research on the topics of artificial intelligence, the Internet of Things, security and augmented reality, as well as discuss what each of those technologies might mean for women in our field. As always, we'll also leave plenty of time to answer all your questions live on the air and chat board.

Like Us on Facebook
Twitter Feed