Adding Redundant Links for High Availability Embedded Networking

high availability embedded networking

Many techniques are available for mitigating network failures. Modern networking equipment implements a number of open-standards protocols that help to ensure high availability (HA).

One of the typical failure modes that must be addressed for networks connecting embedded systems is the loss of a network link. Links can fail when a cable is severed or degraded, or when a transceiver on one end ceases to function. With common BASE-T links, magnetics and PHY devices can be a significant contributor to failure rate. With fiber optic links, mechanical damage due to shock or vibration can cause a loss of connectivity.

To mitigate against link failure, a network design can add a redundant link. This could be, for example, a second cable routed along a different physical path to reduce the risk that both links will be affected concurrently. In the case of a link failure, traffic can make use of the redundant link and minimize the effect on the overall system.

When redundant links are used, networking protocols must be properly configured to ensure that the redundancy is effective. If network redundancy is not properly implemented and configured, adding hardware can increase complexity and lead to new failure modes that reduce overall availability. Designers must also carefully analyze two key aspects of their design: failure detection and switch-over time.

High-availability LAN using PRP, redundant switches and dual-connected hos

 

 

 

 

 

Figure 1: High-availability LAN using PRP, redundant switches and dual-connected hosts

Failure detection

If a tree falls in the forest, and no-one is around to hear it - does it make a sound? This old idea has a parallel in network design. If a host on a network is not receiving any traffic, is this because there is no traffic to receive? Or has it lost its connection? Although many protocols at various layers of the network stack will automatically detect and report a link-down condition, it is up to higher-layers to recognize and respond to these conditions. For applications that need to recognize and respond to a loss of connectivity, such as safety-critical or real-time systems, designers should plan for network integrity checking. This can be as simple as using a regular “ping” or “keep-alive” message that serves as a watch-dog timer on the link between two hosts. If a message is not received when it is expected, this can indicate a loss of connectivity.

Switch-over time

If a link fails, and a redundant link is available, traffic can be re-routed to ensure there is no loss of connectivity. But even with a redundant link, application performance can be significantly affected until connectivity is restored. In many cases, traffic will be lost during the time it takes to notice a link is down, and to propagate this event to the software layers that will act on it. Where failure detection depends on a watch-dog as noted above, designers must trade off the time to detect a failure against the overhead involved in generating and responding to these regular messages.

Some newer redundancy protocols aim to provide “hit-less” redundancy, so that the failure of a link or even of network equipment can be mitigated without impacting applications. However these protocols may be expensive to implement, requiring duplicate hardware or additional network traffic that reduces system performance.

To read more about configuring a redundant network and managing redundant switches, download "Staying Connected: High Availability Embedded Networking", a new white paper from Curtiss-Wright.