Some of the key operational metrics of a communications network are the availability of each of the components in the network, the availability of a group of components and the service availability that the network provides for end users. While the theory of availability prediction and estimation is well established, it can be difficult to actually measure the availability for some of the network components.
In general, it has been relatively easy to determine the availability of network components that sit at the core of a network and are managed closely by network operators. Any unavailability of these components will affect a large number of end users, so network operators typically closely monitor the uptime and downtime of these components.
In the case of Customer Premise Equipment (CPE) devices that are deployed at consumer's premises, availability metrics are typically much more difficult to determine. Examples of CPE devices include DSL and Cable modems, Fixed Wireless Access devices and Mobile wireless terminals. In some cases operators will want to understand the availability of a CPE device itself. This requires accurate measurements of the uptime and downtime of each CPE device in the network. However, accurate measurements of this type have been proven to be difficult to make.
The availability of a network or a component may be expressed according to the following formula:Availability=Uptime/(Uptime+Downtime)*100%Where Uptime is the amount of time the network or component has been providing service and downtime is the amount of time that the network or component has not been providing service. If it is possible to accurately measure the uptime and the downtime then the availability of the network or component can be determined. The availability is usually expressed as a percentage value.
The unavailability of the network or network component may be expressed according to the following equation 1:Unavailability=(1−Availability)=Downtime/(Uptime+Downtime)  [Equation 1]In the case where multiple CPE are deployed in a network, an average CPE availability is usually desired. This is provided by the following equation 2:CPE_Availability=All_CPE_uptime/(All_CPE_uptime+All_CPE_downtime)  [Equation 2]Where All_CPE_uptime is the sum of the uptimes of all the CPE in the network, and All_CPE_downtime is the sum of the downtimes of all the CPE in the network.
In many cases, measurements of the uptime and downtime may not be available, especially for new networks. When measurements are not available, network designers may make estimates of the availability of the network design based on component Mean Time Between Failures (MTBF) and the Mean Time To Repair/Replacement (MTTR). The MTBF of a network component may be provided by the component manufacturer. The MTTR is usually a function of how quickly a failure can be detected, how quickly the failure can be diagnosed, how quickly the failed component can be repaired or replaced, and how long it takes for the network component to be up and running in the network.
The following equation 3 shows the relationship between MTBF, MTTR and Availability:Availability=MTBF/(MTBF+MTTR)  [Equation 3]As an example calculation, for a network component with an MTBF=20,000 hours and a MTTR=4 hours, the availability is expressed as:Availability=20,000/(20,000+4)=0.9998Thus, it can be said that the component has an availability of 0.9998, or that the component is available 99.98% of the time.
The unavailability can be expressed according to the following equation 4:Unavailability=(1−Availability)=MTTR/(MTBF+MTTR)  [Equation 4]Using the figures from the above example, the unavailability may be calculated as 1−0.9998=0.0002. Expressed as a percentage, it can be stated that the component is unavailable 0.02% of the time.
In some cases, network components are assembled from individual hardware components with little or no software implemented by the manufacturer. In the case of hardware MTBF, well established techniques such as the Telcordia TR-332 method or the methods in Mil-Hdbk-217 allow the MTBF of a hardware system to be estimated based simply on the bill of materials list of the system.
It can be more difficult to determine the MTBF of a network component that includes software components. Software faults can be the primary reason for low MTBF. Estimating MTBF for software failures can be achieved by measuring the times between when a network component becomes unavailable because of a software failure. This measurement can be done automatically through the use of timers in the network component that record the times between when a network component is restarted due to software issues.
Networks are generally built by connecting a number of network components in series or in parallel. If the availability of each network component is known, then the availability of the series or parallel network can be determined. When a number n of components are networked in a series arrangement, then the end to end availability of the network may be expressed by the following equation 5:Series_Availabiltiy=product(1, n, Availabilityn)  [Equation 5]
Where Availabilityn is the availability of each component, and the product function calculates the product of each of the Availabilityn values for n in the range 1 to n.
An example of applying the series availability formula is shown in FIG. 1, which illustrates availability and unavailability of a typical Public Switched Telephone network. The example of FIG. 1 shows the unavailability of each of a series of network components and the links between the components, as well as the end to end availability and unavailability. In FIG. 1, NI refers to the network interface at a customer premises, LE is the local exchange switch and the box labeled Long Distance refers to the long distance telephone network, including switches and long distance cabling.
When network components are networked in parallel, then the end to end availability of the parallel network is given by equation 6:Parallel_Availability=1−product(1, n, (1−Availabilityn))  [Equation 6]In this example, the parallel network is available if one or more of the parallel components is available.
Operators typically want to validate that individual network components or the entire network itself are meeting the availability targets that were set during the design of the network. As previously discussed, availability is easier to measure for central network components, in which relatively few components are monitored closely by the operator. However in the case of Customer Premise Equipment (CPE), where there may be millions of devices deployed for a single operator, determining an average availability for these devices is more challenging.
A typical CPE availability formula takes into account the uptime and downtime of all CPE to arrive at an average CPE availability metric according to equation 2 above. The CPE downtime in this case may consist of the times when a CPE hardware problem or CPE software problem causes the CPE to enter a state where it can no longer provide services to the end user. There may be other times where the end customer may not be provided service, but the lack of service is not a result of a problem with the CPE. These scenarios can include:                Customer powering down the CPE.        Customer unplugging CPE from the network.        Power Outage at the CPE.        Network server outage.        Outage on a link between CPE and the central network.        
In wireless networks, the link between the CPE and the network can also be broken for reasons that are not due to downtime caused by CPE HW or SW failures. These scenarios include:                Customer repositioning the CPE, thereby breaking the RF link between the CPE and the base station.        RF link between CPE and base station changing due to seasonal or environmental changes (e.g. a new building is built between a fixed location CPE and the serving base station, increasing the path loss between the CPE and base station).        Wireless Base Station outage.        Other wireless core network equipment outage such as PDSN outage, S-GW outage, P-GW outage, Media Gateway, AAA, DSN, DHCP, outage, Backhaul outage, etc.        Power outages at the CPE.In principle, the average CPE availability can therefore be obtained by measuring the total service outage time due to CPE hardware or software faults and the total CPE uptime, and applying equation 2.        
In practice, it can be difficult to determine whether a service outage is due to a CPE hardware or software fault, or if the service outage is due to one of the other reasons listed above. For example, a wireless CPE may not be able to tell the difference between an outage due to a radio failure at the CPE and an outage due to the repositioning of the CPE to a location where it can no longer communicate with the base station. In both cases the CPE can no longer communicate with the base station, but only in the former should the outage time be logged as downtime. As shown in this example, the main difficulty with determining an average CPE availability rate is differentiating service outages due to CPE hardware or software failures from service outages due to other causes, which is not always feasible through conventional techniques.
Other scenarios, such as a CPE being powered down due to a power outage or by the customer powering down the CPE, should not be included in the CPE downtime measurement. Accounting for these distinctions precludes the use of an automated ping mechanism from the core network or base station to the CPE to monitor the CPE availability. With such a ping, it is not possible to differentiate between CPE unavailability due to a hardware/software fault and CPE unavailability due to another reason that should be excluded from any reliability/availability, calculations.