x

RIM's Three-Day Service Disruption Spreads

BlackBerry is struggling to resolve a network failure that has caused widespread service disruption for BlackBerry users for the last three days.

The BlackBerry email and messaging service outages that started on Monday in Europe, the Middle East and Africa, continued throughout the day on Tuesday and spread to India, Brazil, Chile and Argentina. (See Euronews: BlackBerry Outage Hits Millions.)

On Wednesday, the problems appear to have spread to North America, according to a Reuters report, and Research in Motion is still scratching its head about what actually has gone wrong.

The last official statement from RIM about the problem was issued Tuesday night at 9:30 GMT:
    The messaging and browsing delays being experienced by BlackBerry users in Europe, the Middle East, Africa, India, Brazil, Chile and Argentina were caused by a core switch failure within RIM's infrastructure. Although the system is designed to failover to a back-up switch, the failover did not function as previously tested. As a result, a large backlog of data was generated and we are now working to clear that backlog and restore normal service as quickly as possible. We apologize for any inconvenience and we will continue to keep you informed.


But at the BlackBerry Innovation Forum in London on Wednesday morning, RIM executives said that employees at their server center in Slough were still trying to figure out what the problem is.

They say
  • The Guardian reports that RIM UK Managing Director Stephen Bates said at Wednesday's BlackBerry event that the company "thought we had found the problem [BB outage] but had not. We are working around the clock to get to the bottom of the problem."

  • In an interview with paidContent, Rory O'Neill, RIM's VP software and services for EMEA, said: "The fundamental cause as far as we know is the issue with the core switch and a failing in the component architecture … It has to do with how our networks speak to each other, and we are trying to fix that while preserving all messages."

    We say
    A network outage of this magnitude can never come at a good time, but now is a particularly bad time for RIM. The BlackBerry maker is under pressure from certain disgruntled investors that have called for changes in the company's management and the sale of all or parts of the company, in response to its declining market share in smartphones. (See RIM Revved on Icahn Rumor , Investor to RIM: Sell Something, RIM's Q2 Profit Falls on Weak Shipments and RIM Plans a Q3 PlayBook Revival . )

    — Michelle Donegan, European Editor, Light Reading Mobile

  • Anne Morris 12/5/2012 | 4:51:25 PM
    re: RIM's Three-Day Service Disruption Spreads

    This really is bad timing for RIM; is the writing now on the wall?? Lots of BlackBerry users badmouthing the company today on Twitter!

    krishanguru143 12/5/2012 | 4:51:25 PM
    re: RIM's Three-Day Service Disruption Spreads

    When is an outage good or well timed?  People rely on it every single day.  This is not the first nor the last outage that they will have.


     


    RIM needed to get out of the device business and just support the BES and the app on the phone.  That window is now closed.

    Michelle Donegan 12/5/2012 | 4:51:24 PM
    re: RIM's Three-Day Service Disruption Spreads

    Here are the latest official updates from RIM about the outage (not exactly revealing):





    Wednesday 12th October – 9:45 (GMT-5)
    BlackBerry subscribers in the Americas may be experiencing intermittent service delays this morning. We are working to resolve the situation as quickly as possible and we apologize to our customers for any inconvenience. We will provide a further update as soon as more information is available.


    Wednesday 12th October – 12:00 (GMT+1)
    We know that many of you are still experiencing service problems. The resolution of this service issue is our Number One priority right now and we are working night and day to restore all BlackBerry services to normal levels. We will continue to keep this page updated.


     


     




    Michelle Donegan 12/5/2012 | 4:51:22 PM
    re: RIM's Three-Day Service Disruption Spreads



    RIM just held a press conference with David Yach, CTO of Software. Here's how he explained what's going on:

    There was a major outage that began on Monday (October 10).… This was the result of a core switch failure in our European infrastructure. The failover did not function as expected. As a result, there was a large backlog of data… we had to throttle the traffic to stabilize the service. This is why we're seeing the issue in other regions. We believe we have found the root cause of the original failure, but need to investigate further...

    He also said that there has been speculation about a breach or hack, but that there was no evidence to believe this was the case.

    The new bit from this press conference is RIM's admission that they have had to throttle traffic in Europe in order to ease the flow of the backlog.

    OK, so here are my questions: who supplies that core switch in Europe (which reportedly is in Slough) and why did it fail? Was RIM conducting a routine upgrade? Were they installing a new one? Does the problem lie with the failure of RIM's network redundancy?

     




    Gabriel Brown 12/5/2012 | 4:51:22 PM
    re: RIM's Three-Day Service Disruption Spreads

    Skimping on the network spend? For a communications company? Tut tut.

    ^Eagle^ 12/5/2012 | 4:51:22 PM
    re: RIM's Three-Day Service Disruption Spreads

    Clearly, at a minimum, this is a failure in RIM's network and system level redundancy.  Amazing to me how many otherwise sophisticated companies fall down in this area.  I used to make good money consulting with enterprise users on their "redundancy" or lack thereof.  Amazing how naive IT departments can be when it comes to this stuff.  Even the mighty OEM's out there took far too long to figure out how to do it properly.


    Ex, the failur of a DSC SS7 switch firmware upgrade that propogated and took down all long distance on the eastern seaboard for a few days back in the 90's.  DSC's never really recovered as a company.


    ex #2: for many years Cisco had redundant IO blades on their routers, but not redundant switch fabrics.  Later they fixed this, but did not put sufficient memory on board the platform to hold 3 images of the firmware software.  you need 3 images so that you can update the firmware / software while the system is still running: 1 to be updated, 1 to be running, and one as the failsafe redundant memory image in case the 2nd one fails during the upgrade of the first.  Then it took still longer for Cisco and others to learn that to call something redundant and protected, it need to have redundant power supplies and cooling fans.  Something very few OEMs do to this day.


    ex #3: redundant network connections.  I cannot tell you how many enterprise users used to pay for redundant network connections, but never asked the key question, were the network links dual homed to different CO's in different fiber trunks.  most of those redundant links were in the same fiber trunk.  And when some yahoo bulldozed the trunk, the primary and protect links both went down.


    there are other examples of such foolishness.  


    abundantly clear that the IT teams at RIM made several mistakes in their redundancy plans and platforms.


    sailboat


    maybe I should go back into that area of consulting... might be time to make money doing the old work once again with newer younger IT "professionals".

    Michelle Donegan 12/5/2012 | 4:51:19 PM
    re: RIM's Three-Day Service Disruption Spreads

    Here's the update from RIM's UK press team that was in my inbox this morning (Thursday):




    “From 6am BST today, all services across Europe, the Middle East and Africa, as well as India, have been operating with significant improvement.  We continue to monitor the situation 24x7 to ensure ongoing stability.  Thank you for your patience.”


    Last night, RIM CIO Robin Bienfait issued an apology to all BlackBerry customers and said: "You’ve depended on us for reliable, real-time communications, and right now we’re letting you down." His statement and service update is here:


    http://www.rim.com/newsroom/service-update.shtml


    Other service updates are being posted here as well:


    http://uk.blackberry.com/serviceupdate/


     


     

     




    Anne Morris 12/5/2012 | 4:51:18 PM
    re: RIM's Three-Day Service Disruption Spreads

    yes, of course all outages are very bad at any time, but RIM is already in the press more than it would like for the wrong reasons right now, so it seems 'particularly' bad timing.


    anyway...

    HOME
    Sign In
    SEARCH
    CLOSE
    MORE
    CLOSE