Clustering Solutions and Zero Downtime Hosting Pitfalls

2005-01-24by Godfrey Heron

There are a number of benchmarks, which we may use to evaluate hosting companies. One of these is, reliability.

Like most things in this life, reliability in web hosting is typically a function of how much we are willing to spend for it. In essence, a "cost-effectiveness" equation needs to be determined and solved.

Reliability can be measured in terms of percentage availability. Industry personnel will talk of reliability in terms of system availability with three (99.9%), four (99.99%) or five nines.(99.999%).

Typically, web-hosting availability exceeding three nines was the purvue of extremely large companies with multiple layers of redundancy built into their network and software systems. However technology has now brought high-availability theory and cost-effective reality into alignment.

High availability can be achieved by removing, as far as possible, any "single point/s of failure", or, where this is not altogether possible, minimizing the time spent in a "failure" situation.

One of the ways in which small businesses and ISP's can reasonably avoid single point of failures is by employing server farm clustering and load-balancing solutions.

Webopedia defines server farm clustering as follows:

"A server farm is a group of networked servers that are housed in one location. A server farm streamlines internal processes by distributing the workload between the individual components of the farm and expedites computing processes by harnessing the power of multiple servers.

The farms rely on load-balancing software that accomplishes such tasks as tracking demand for processing power from different machines, prioritizing the tasks and scheduling and rescheduling them depending on priority and demand that users put on the network. When one server in the farm fails, another can step in as a backup."

It is important to note, that typically, web servers, which are load-balanced in such a manner, display one external IP address to the public Internet, while using internal network IP's to communicate between the clustered servers and load balancer.

Now this is indeed fantastic! Not only do you receive web site peak demand scalability with web server clusters, but you also have the built-in  "high uptime availability" component which is so

However this is only half of the picture.
There are very important cautionary notes to keep in mind.

Where web hosting is concerned, availability depends on two things:

1.Hardware reliability (RAID drives, server  clustering etc) within the Data Center;

2.High Bandwidth Internet Connectivity to the Data Center / Network Operating Center (NOC).

Now, with all your well thought out server clustering solutions, what would be the result, if, (as had recently occurred in a very high profile web company), a fire in the Network vicinity had caused the entire Data Center to shut down power for hours. Or, a bandwidth provider to the NOC had router problems. All your websites would be showing the dreaded "Page Cannot be Displayed" page.

The ideal solution therefore would be to employ clustering solutions with servers in entirely different Data Centers with different bandwidth providers. Redundant Data Centers eliminate the NOC
itself being a single point of failure. This scenario becomes interesting at this point, because the difficulty of addressing the potential problems now increase exponentially.

We now have to deal with DNS caching, the concept of failover, and how static and dynamic web applications respond to failure events.

Failover and Load balancing are frequently used interchangeably, however they are in fact quite  different.

·Load Balancing refers to physically sharing servers capacity, so that one server is not overloaded and swamped with requests.

·Failover however, is the process that manually or automatically switches a failed server or bandwidth provider to a standby server or
network if the primary system fails or is temporarily shut down for servicing.

As such, failover software is an important function of mission-critical systems that rely on constant accessibility.

One of the inherent difficulties with failover for Web Hosting companies operating on different networks is the limitations imposed by the DNS caching system.

As DNS records are passed from the original DNS servers (i.e., ns1/ns2.your-domain.com), they are cached or stored at several different ISP's along the way. Which is why it takes a while for a newly registered domain name to resolve to its IP address.

Each DNS record has a TTL (time to live) setting assigned. By manipulating this value, it is possible to alter how  long that particular IP address/ DNS record combo is stored. If your site is on 2 different servers with 2 different IP addresses, you could set the ‘time to live' with a value of, say, 2 minutes.

The failover software would check server availability by "pinging" the web server every few minutes to determine whether it's IP address is responding appropriately. (perhaps by looking for a particular text string in a web page).

If a failure is detected, then the software would pull the non-working web server IP address out of the list of IP addresses assigned to the your web site's domain name. If/when your web server IP comes back online it would be restored to the list.

With a TTL setting of 2 minutes, theoretically, your web site should be down for just 2 minutes, while switching DNS information to the other web server.

The problem with this scenario, is that, while some ISP's caching might respond to such low figures, other ISP's may decide to ignore,(to save on bandwidth utilization), any TTL's below a certain
value, say, 60 minutes. So it is entirely possible that some of your visitors would see your websites and for others, your site would be down for 1 hour or more, even though one of your servers was
operating perfectly.

Static non interactive web sites are great candidates for server clustering, but the wicket becomes a bit sticky for dynamically generated sites. Most database application software in general, although having some replication capabilities, are not happy
with multiple server master/slave relationships and  real time updating between servers. The issue can become very problematic if your site requires frequent updates.

Then there is the problem of how to keep your websites synchronized. Unix/Linux servers have a built in synchronizing software tool called rsync. You can also automate the synchronizing process by setting up a cron job to run periodically.

DNS caching and synchronizing issues can be so problematic so as to nullify the advantages of server clustering. For example, a cron job to synchronize your servers every few minutes might very well use
up your server capacity.

Your customers will also have to contend with their desktop email client software having dual email addresses for each email account on each web server. e.g. info@server1.net, info@server2.net.

It is important to realize that DNS operates by default in a round robin manner, so that, if you have the same web site on 2 separate servers, it is very likely that server 1 will get 50% of all the web

Now, this is important for a number of reasons, but one of the principal reasons to keep this in mind, is that, you will not be able to effectively keep a "back-up" site (as some providers would have
you believe) which will only be used when the primary server goes down. For e.g. a site saying" we're sorry our main server is down but you may contact us at: www.yourdomain2.com.

On a final note, hardware based load balancing solutions tend to be quite expensive and also introduce a potential single point of failure
into the system, the load balancer itself. There is a very prominent Data Center that began offering load balanced hosting solutions, where the load balancer itself failed on several occasions, although the web servers were operating perfectly. The net effect to the public however, was that the sites were unavailable.

Reasonable cost effective software based solutions may be obtained as a service model or by purchasing the software yourself. Zoneedit is an example of a service model, and Simplefailover is an example of
a software based model which maybe purchased on a server license basis.

In conclusion, at this point in time, there are several limiting factors to successfully implementing a "true" high availability multiple server web hosting system. Depending on your clientele and the nature of their web sites,this may indeed be a very viable alternative.
For others, simply setting up a server with high quality components, redundant RAID hard drives and a good supply of server spare parts may be the best way to ensure high availability.

news Buffer

Godfrey Heron

Godfrey Heron is a distributor of the amazing AWARD winning firestarter flash tool which makes complex flash sites in minutes. Download your FREE evaluation software here: http://www.irieisle-online.com/Services/onlinestore.htm He is also web editor of a monthly free ezine. Receive $195 in free bonuses when you subscribe. View Godfrey Heron`s profile for more

Leave a Comment