In today’s world we are so dependent in our jobs for the computing devices that it incurs direct losses on business if not working. Downtime is not good for any businesses. When devices stop working, business stops. Even disruption for a few minutes can have enormous impact, irrespective of size of our network. Beyond direct loss the effect of downtime can have ripple effect across various avenues of the business, like bad customer experience which in turn affects company’s reputation.
What is Downtime?
Nothings more frustrating in a business than to not able to send an email or the printer not printing the moment you need it the most. For some employee’s downtime becomes so common that they start to live with it.
In terms of computing, the duration during which a service is down is considered as downtime. It is very commonly used for networks and servers. Downtime can be caused due to no transmission, an overload of traffic, hardware failure, power outage, or issue with downstream/upstream equipment to name a few. Downtime can be categorized into two types – scheduled and unscheduled.
Scheduled downtime is one in which the users are notified way ahead that certain services will be unavailable during certain time. These are usually done with respect to maintenance of software or hardware. Then there are the unscheduled downtime. For any IT team involved, the later would be a nightmare to deal with.
Let’s face it downtime is an inevitable evil which cannot be avoided beyond a certain limit, but certainly there are vectors which must be taken care being a super administrator. Downtime can occur due to:
- Human error
- Security flaw
- Failed device
- Power failure
There could be a fire incident causing outage or interface failure in your network device connecting to the server. Likewise, the service is not available to end customers resulting in downtime for the service running on the server.
Another example of a server which hosts applications running for a bank. As soon as the device went public due to the load, the device crashed and was unable to process any new request. From a business perspective, the service is unavailable to the customer and results in loss of business from potential and existing customers.
How to avoid downtime?
In our experience downtime to an extend can be avoided, if proper measures and best practices are in place.
- Device monitoring:
Being a network or server administrator, we have tools that we can use to proactively monitor, log and report, so that we can act on an incident ASAP. All devices should always be monitored and logged. This is considered as the first line of defence. Also, the data collected could be used to form polices and take necessary measures for running an smooth operation. There are multiple tools and protocols for monitoring devices. These tools also help keep a record of uptime/downtime of any computing device. SNMP plays a very important part in monitoring. It has become and is one of the commercially profitable entities which many companies are selling. Businesses are solely running and putting this protocol in application. Many more companies are surviving on those products for monitoring their computing environment. SNMP and monitoring are 2 words that we interchangeably use and is a core aspect of the productive computing ecosystem.
- Applying proper security patches.
As users we use various multiple devices running different operating system and multiple applications. In an SMB or enterprise, the devices are multiplied by 100s, which in turn increases the threat or vulnerability by same ration. In today’s day and age there are bad actors who are constantly on a lookout for finding vulnerabilities and using them for their benefit. As an administrator it is our responsibility to ensure that we have proper measures taken and device are updated regularly and whenever a vulnerability patch is announced. To avoid attacks on network/server latest firmware and patches should always been applied.
- Device high availability – Disaster Recovery
High Availability (HA) provides a failover solution in the event a server, network or database fails. Disaster recovery provides a recovery solution even across a geographically separated distance in the event of a disaster that causes an entire data centre to fail. We have a section dedicated to High Availability (HA) where we discuss in detail about it and how to configure it across multiple devices. To summarize for the purpose of understanding it means that the same data is available on another server or same instance of server is available in case of failure or the server is reachable via another path in case connecting networking device fails.
How is downtime calculated?
Let’s, take an example of that of a network device – a router is being monitored for a total period of 24 hours. Apparently due to NIC failure on the router the internet at the site is down. Now, you being-you, an instaelearner, troubleshoot the issue and found out the issue with particular NIC. You had swapped and connected the cable to a newly configured spare port and voila, the internet is back up in the office. Now the internet was down for a total period of 30 mins. The uptime of the network can simply be calculated at the end of the day like below.
Total number of seconds Internet monitored: 86,400 sec (24 Hours)
Total number of seconds internet was down: 1800 sec (30 Mins)
Now the total time the internet was up would be 86,400 – 1800 = 84,600
In percentage, it can be calculated simply by diving the downtime by total monitored time, i.e.
1800 / 86400, which would be 0.0208 and then multiply by 100
Downtime / monitored time * 100 = Downtime %
In other words, we can say that the internet downtime was 2.08%
Now that we have an understanding about what is downtime, what causes downtime and how to avoid it. Let’s put this into practice and ensure that the services are available and up and running.