The Resilient Internet Dream



If it's such a resilient system; why doesn't it work ?



The Internet - A resilient network of computer systems and infrastructure designed to withstand catastrophic failures and able to continue working despite such failures.

So why, if someone so much as coughs, does it fall crashing to its knees ?

The modern-day Internet as we know it, is an evolution of ARPANET, created in 1969, and designed to be a robust communications network which would withstand a nuclear attack in times of war.

ARPANET was meant to be resilient to this sort of problem, capable of surviving catastrophic world events, and automatically, and near instantaneously, re-working itself around problems which could crop up.

The current topologies and design of the Internet are based upon those which made ARPANET what it was, but much of the underlying infrastructure and its higher-level data carrying capabilities have been considerably improved upon. The resilience of ARPANET should still be present in what we have today.

It seems though that the reality is somewhat different to what the design goals of the Internet should have delivered. Today, it seems that even the slightest hiccup causes a good chunk of the UK's internet access goes down, and totally, for sustained periods.

It will always be the case that the world of reality is often entirely different to theory ( like the rocket cars we should be driving, all that leisure time we wouldn't know what to do with, and nuclear generated electricity that would be so cheap we wouldn't need to pay for it ), but why is a system, which was designed to be so robust and reliable, actually so fragile ?

The tale of a home user using two modems dialled into his ISP, finding his PC grinding to a halt because a whole country's internet access was being routed through their PC when an ISP's service failed may be little more than Urban Myth, but it highlights exactly what automatic recovery and automatic routing to keep the Internet ticking over was meant to be like.

No single point of failure should have a significant impact on Internet Users and redundancies in the systems provide alternative routes of connection while overloaded systems can pass their excess onto other systems which can cope. All of which goes on behind the scenes, without the end-user noticing anything more than a momentary glitch if anything.

That theory was shown to be little more than just a theory though when millions of Internet Users in the UK effectively lost the use of the Internet entirely for a considerable amount of time.

NTL was the Service Provider which was most severely hit by the problems, losing access to most Internet services for over 8 hours, with consequential problems lasting well into the aftermath, but other ISP's such as Freeserve, Nildram, Pipex and Telewest are also reported to have had their services impacted upon. BT has also said that some of its voice services were hit by the problem.

The root cause of the problem is reported to be some damage to an inter-continental cable between the UK and France which had occurred just off the French coast, but the knock-on effects caused NTL's DNS services to crash, preventing many people from accessing any services at all.

It's all well and good Service Providers explaining why their systems fell over big-time, but the question which must be asked is why these systems are allowed to fall over. What has happened to the redundancy built into the system ? Why are single point failures so catastrophic, and why do they have such significant knock-on effects ?

No one would expect a Service Provider to provide backup for the entirety of the Internet structure; if the magic 'cable routing box' for a street is blown-up by terrorists or dismantled by vandals, customers expect their access to be steeply curtailed until it is fixed, or they use their own redundancy capabilities by dialling out on an alternative telephone line, through a mobile phone or even by running a wire to the nearest house which still has internet access, to keep their own access going.

At the centre of the Internet things are entirely different. The inter-continental links are critical to permitting global access between countries and single point failures there are obviously going to have a significant impact on the Internet as a whole. This is where redundancy and alternative routes come into play, and it is something we look to ISP's dealing with as individual home users have no control over the issue.

If the main link from the UK to France goes down, the secondary link should kick in, and, apart from a minimal loss of connection while it does, and a slowing down of access as everyone gets funnelled through a less than ideal link, everything should continue as normal. Even if that secondary link fails then we can still access France ( and anywhere else ) by routing through any of the links which are unaffected, even if it means that an email to the PC on the desk next to you goes round the world and back in its travels. This is the fundamental resilience the Internet should offer.

At worse, a loss of inter-continental access from the UK should do nothing more than isolate the UK from everyone else. The UK's Internet should keep running as normal. We'll still be able to send emails within the UK and still browse web sites hosted within the UK; that's another fundamental resilience designed in.

And while email, newsgroup, web browsing and game playing services may all fail individually at times, there is no reason why any failure should cause problems for other services. A third resilience we have.

Key services upon which the Internet relies, which are mostly hidden from the users, are also duplicated and designed to ignore overloads, or pass what they can't deal with onto systems which have the capacity to cope. In short, if it can go wrong, there's something in place to keep everything working as a whole; slightly slower perhaps, but still functional. It all makes for a an apparently unfailing system.

Yet when a problem, which amounts to little more than someone in France having accidentally pulled the UK-France connection plug out of its wall socket, the, rather insignificant event given the design of the Internet, had a catastrophic and snow-balling effect right through the system, leaving millions in the UK without access to any Internet services at all.

Imagine this having happened to a Military Internet during times of crisis; it would be the most catastrophic event imaginable, and so simply done - the eyes, ears and voice of the UK closed and silenced at the mere flick of a switch from beyond its shores.

Would the military rely upon such a system ? Perhaps they do, and they just aren't admitting it, but if the military can design their systems to overcome such fateful scenarios, then why can't commercial Service Providers ?

Is it really the simple case that it is commercial drive to make profit over all else which has led them to undermine the resilience of what we should have, or are there really fundamental flaws in the design of the Internet which makes the imagined resilience only a dream when it's put into practice ?

That France's Internet services, along with the rest of the world's, kept going when the connection to the UK was "pulled" suggests that the problems are entirely within the UK's domain. That some Service Providers managed to maintain services for their customers while others were unable to deliver any real service at all suggests that the failings lie only with a few of the companies involved.

It is reported that the automatic re-routing of traffic did occur, but because of the problems caused elsewhere, those using affected Service Providers had little traffic to be so routed. No matter what parts of the Internet did remain working, and functioning as predicted, it is but a moot point for those who found they could do nothing at all. That services have been restored before the cable has been repaired suggests it is a failing which is separate to the issue of that single point fault.

Whatever the case, the Service Providers have a duty to explain why a minor problem can have such a massive effect on so many. We know "why" the problem occurred - "It's France's fault", allegedly - but the question "why" requires much deeper answers than that.

It is not just a question of providing an answer to fee paying customers and businesses who have come to rely upon Internet access, but to those who are wondering if the Service Providers are undermining the very soul of the Internet, and are equally wondering why this is being done, and those trying to formulate solutions to make the Internet, once more, what it was meant to be. We need to know if we can trust commercial Service Providers to deliver what we are paying for, and even what they are meant to provide, and we need to decide if we should take this highly important business out of their hands and put it in the hands of the Government to look after. Has self-regulation failed ? If it has, we need to decide what to do in response.

The only alternative is to redefine "The UK Internet" : A loosely coupled network of computer systems connected through points of access which are prone to failure, with ineffective redundancy, a lack of resilience, which continues to work on a wing and a prayer, with the risk of catastrophic failure should the smallest of problems occur. A theoretically superb, robust and resilient design, undermined by implementation and commercial concerns.

Hardly what was dreamt of in the 1960's.





Site Navigation

  Home Page
  What's New
  Search
  Add Bookmark
  Have Your Say
  Guestbook




First published on Saturday the 29th of November, 2003 at 16:26:56
Last upload was on Wednesday the 7th of January, 2004 at 04:31:26