| Marc Goodman, Director of Marketing at Ecessa |
Marc
Goodman is the director of marketing at Ecessa, a manufacturer of advanced WAN
Optimization products that provide WAN and ISP link aggregation, intelligent load
balancing, failover, QoS and VPN load balancing and failover within a single
device.
|
| Marc Goodman, Director of Marketing at Ecessa
has written 3 articles for HostReview. |
| View all articles by Marc Goodman, Director of Marketing at Ecessa... |
Overview
The Internet has become so pervasive and integral for conducting
business and communicating with customers, partners and employees that network
performance, high-availability, and uptime are absolute requirements for running
the day-to-day operations of an organization. Network downtime not only costs
money and loss of productivity, it can also adversely affect a company’s
reputation among customers and partners. For many companies, their entire
business strategy depends on how well the network performs.
There are many events that can cause
a network or site to go down ,ranging from natural disasters and security
attacks to a backhoe cutting a network line, or failing network infrastructure.
Few organizations plan for, and have the budget to implement appropriate
network infrastructure to ensure their datacenters and remote offices have the
protection they need in anticipation of disaster. For many companies, it’s a case of closing
the barn door after the horse runs away.
According to market research firm Infonetics, large enterprises
typically lose between 2 and 16 percent of their annual revenues due to losses
associated with network downtime. The more distributed a company’s network is,
the more likely it is to suffer service-provider interruptions. According to a
recent Infonetics survey, retailers are affected the most, with service
providers accounting for more than 30 percent of their downtime costs. Another
cause of downtime is human error, which accounts for about one-fifth of the
downtime costs. For financial institutions, this percentage jumps to nearly
one-third. Thus, network survivability
is key to a business’s productivity and profitability.
Failover
Failover within a communications network is
the process of instantly transferring tasks from a failed component to a
similar redundant component to avoid disruption and maintain ongoing operations.
Automated failover is the ability to quickly reroute data automatically from a
failed component such as a server or network connection to a functioning component
and is essential for mission-critical systems.
Different components may be configured for
either cold standby (requiring human intervention), warm standby (automatic but
delayed) or hot standby (automatic) failover. The three critical elements
requiring failover configuration are power, network connectivity and server
capacity.
This article describes the different types of
failover, the requirements of failover design and strategies for successful
failover implementation.
Device Failover
In a failover situation, such as a firewall,
router, WAN controller, server load balancer, disk drive, web server, etc., data
is transferred to the same type of redundant component to ensure there is limited
interruption in data flow and operation.
If a primary component becomes unavailable because of either failure or
scheduled downtime, the secondary component serves as a backup and takes over
for its troubled counterpart.
The capability to switch automatically to a
redundant or standby system or network upon failure happens without human
intervention (see Failover Hierarchy for other types of failover). Automated failover is essential in servers, systems
or networks requiring continuous availability and a high degree of reliability —
those that are responsible for mission-critical processes and data (see examples
below). There are few IT managers who
want to be responsible for putting a server back online on a holiday weekend.
Failover Hierarchy
As mentioned above, there are different types
of failover—some that are not entirely automatic by
intention and require manual intervention.
This is called "automated with manual approval"—activity is
automatic once approval is given. When hardware is on “cold standby,” failover must be
performed manually, which invites error.
In contrast, when hardware is on “warm standby,” the backup system runs
in the background, so the transfer takes place automatically. The data on both systems is automatically
synchronized. To the user, failover
resembles a very fast automatic service reboot.
The most reliable failover scenario is “hot standby,” in which both systems
permanently run in parallel — data
on both systems is 100% synchronized at all times. Users will not be aware of any failures. This level of failover protection usually
requires a corresponding modification to the client. To run both with systems complete synchronicity,
the connections to the client must be mirrored 100 percent. This normally requires clients that have
connections with two or more servers at the same time and can communicate with
all of them. A typical Web browser
cannot do this. Some enterprises implement both hot failover
and cold failover for disaster recovery. It is important to differentiate between
failover and disaster recovery. Failover is a methodology through which to resume system
availability in an acceptable period of time while disaster
recovery is used to resume system availability when all failover
strategies have failed.
The Critical Role of Failover
The convergence of voice,
data and video over a single IP network is making the network infrastructure one
of the most critical elements in operational success. These voice, video and data services are
increasingly integrated with business-critical applications such as VoIP, e-mail,
customer relationship management (CRM), etc.
Therefore, all forms of communication with customers, suppliers and
employees are inextricably tied to network operation. If the network fails, access to critical
information can be lost or potentially compromised, with potentially calamitous
results: for example, an airport risks massive delays that impact passengers;
or patients’ health may be compromised by a major medical center experiencing
application delivery delays. There are many organizations for which network
failure is not an option.
Examples of Organizations
that Need Failover
· Small
and medium-sized businesses need both incoming and outgoing network link aggregation
and failover for an increasing assortment of critical-business traffic, from
VoIP to email. For example, the local
corner store that does online banking and bill-pay over the Internet, or a
manufacturing company that needs email, web services, hosted ERP, and ecommerce
applications available 100% of the time.
· Companies
with a central headquarters and a number of branch offices or remote employees
need secure and reliable data communications. They need reliable performance and high-availability
of their VPN data, including the ability of the tunnel to automatically
failover if a WAN link goes down.
· Web
hosting companies, MSPs, ASPs and small ISPs need incoming link aggregation and
failover to ensure that their services are reliable, with extra bandwidth and
redundancy available to their servers. Their
mission-critical applications need to be up and running 24/7. If a WAN link goes down, the failover process
has to be smooth and transparent to users.
· Many
of these companies are deploying VoIP applications to cut expenses and enhance
productivity. These companies now need quality of service levels and traffic-shaping
for guaranteed bandwidth to critical services and applications such as VoIP;
and
· Companies
that have ERP, CRM or any other software accessed over the Internet. etc.
Failover
Requirements
Most corporate and government networks are
comprised of three main elements — LAN, WAN and network infrastructure devices
and services. The LAN provides
interconnectivity around a single organizational location. The WAN provides interconnectivity between
these locations (interconnecting specific geographical sites), other business
partners, and access to public networks such as the public switched telephone
network in the case of voice traffic, and the Internet for data traffic. The network infrastructure services element
provides the services that allow control of the network and flow of data (DNS,
DHCP, WINS, FTP), and contain access to the network using Active Directory,
RADIUS, etc.
These three elements of network infrastructure services have several requirements
for creating a failover environment, the most basic of which is a connecting cable between the two devices.
The second device initiates its systems only when it detects a problem
in the first device. Some systems have the ability to page or send a message to
a specific technician or support center.
There may also be a third "spare parts" device that has
running spare components for "hot" switching to prevent downtime.
The following are other critical elements that comprise a failover
environment:
Power
With power failures being
one of the most common reasons for network and systems failures, all critical
network components at either the primary datacenter, call center or failover
site must be connected to a power source that has very high-availability — 99.999%
in the case of a datacenter.
A LAN that provides
critical services such as a hospital or bank should be equipped with uninterruptible
power supplies (UPS) for each component of its distribution and access
portions. These should be connected to
emergency power sources to maintain internal communication. The WAN routers,
switches, firewalls, etc. need the same form of protection to provide
continuous communication and interconnection to external sites and other public
networks.
Large datacenters and
critical operations, such as call centers, must rely on multiple electric power
companies to provide utility power to their locations. The power is brought into
the critical site from different geographical locations. So, if power is interrupted by a car hitting a
utility pole accident that severs electric lines at a particular location, the
other utility can continue to provide uninterrupted power.
Emergency power
generators may be used instead of alternate utilities. These generators, together with UPS equipment,
can provide a continuous stream of electrical power for days if necessary,
while utility power is being restored.
Network
Redundancy
Levels of redundancy
should be determined for the primary and backup networks based on the
identification of critical network components, impact analyses, and established
recovery objectives. There should be
consideration for redundancy of network devices such as switches, routers, gateways,
etc. There should also be consideration
given to redundant components such as power supplies, CPUs, and circuit cards
for the network switches and routers.
WAN Link Aggregation
Consideration must be given
to the redundancy and diversity of WAN links in conjunction with automated
failover. Redundancy can be achieved by
providing multiple links and multiple types of links for a single site, and between
multiple sites. For example, if the WAN
network utilizes MPLS or ATM, it might be prudent to provide different links such
as via frame relay, so that if a carrier’s entire service goes down, the
organization can have a backup strategy, which many include satellite or
microwave links.
Diversity of links can be
accomplished either by link diversity — two or more links travel different
routes to your locations — or through carrier diversity. Multiple carriers are used to provide Internet
access diversity and redundancy to companies that rely heavily on Internet
connectivity.
WAN
Bandwidth Capacity
Several capacity factors associated
with the alternate sites must be properly assessed in order to avoid failures
caused by unanticipated high traffic volumes from a primary site. One factor is the peak capacity coming from
the primary site that failed. The second
factor is the peak capacity of the secondary site to which the traffic will be
rerouted. The size of the WAN links should
allow for both peak capacities, plus an additional 25-40% accommodating new
peak traffic volumes. Additional traffic may come from new applications such as
VoIP, and/or traffic congestion caused by customers, suppliers, and employees.
Aggregated bandwidth must be ample enough to
provide ISP failover and redundancy. If one link were to fail, enough bandwidth
would still be needed for users to be productive. Intelligent link load
balancing monitors bandwidth availability throughout the network, and
priority-assigns traffic to the link with the greatest available bandwidth in
order to guarantee that time-sensitive traffic (i.e. voice and video — and
other critical applications) receive the bandwidth required for smooth
performance.
In addition to the availability of WAN links,
there is a need for a load balancer to connect users to available servers. If a server to which the
user is connected suddenly becomes unavailable, the load balancer redirects the
request to one of the other replicated servers. This action causes the loss of the original
session-to-credential mapping where the user is new to this substitute server,
and is normally forced to login again.
WAN
Link Load Balancing and Failover
Many companies deploy a specialized WAN
optimization controller to merge WAN link load balancing and failover to
cost-effectively eliminate downtime for business-critical, time-sensitive
applications and ensure network performance.
These devices enable redundant WAN and ISP access, and can provide both
outbound and inbound WAN/ISP load balancing and failover.
Bandwidth aggregation combines multiple WAN
links into what is effectively one large network connection. Alternately, it can
use bandwidth aggregation to maintain these links separately and allocate
Internet traffic across them. Both techniques result in larger pools of
available bandwidth, and greater reliability.
Site-to-site Channel Bonding
For site-to-site
channel bonding, WAN optimization controllers with intelligent link load
balancing are installed at both a local and remote site and direct traffic over
the Internet between the two sites using the combined (or bonded) bandwidth of
multiple ISP or WAN connections. Each site connected by such a bonded link is
assigned a unique identifier that allows it to be differentiated from other
sites. Each site is also configured with addressing information for both the
local and remote end of the bonded link. This allows the WAN optimization
controller at the local site to identify traffic that should be sent across the
bonded link and direct it to the specified IP addresses on the WAN link(s) of
the remote site. When the WAN optimization controller identifies such outgoing
traffic, it is disassembled at the packet level into separate streams of data,
then encapsulated for transmission through the bonded channels and sent over
all available WAN links. Since each encapsulated packet contains addressing
information for a specific remote location, data is easily reassembled at that
location.
Summary
The Internet has become so pervasive and integral for conducting
business and communicating with customers, partners and employees, that network
performance, reliability and uptime are becoming required for running the
day-to-day operations of many organizations. Network downtime not only costs
money and loss of productivity, it can also adversely affect a company’s
reputation among customers and partners. There
are many events that can cause a network or site to go down, such as natural
disasters, security attacks, human errors, and other network infrastructure
elements that can fail. When evaluating how to avoid network failures, it is important
to evaluate the many options available to ensure high-availability, network
uptime and optimal network performance. It is also critical to examine the
solutions that will not only help
avoid network failures, but are also affordable, and when deployed, will be
operationally cost-effective.
|