Servers are amazing beasts of technology that keep enterprises running. They can handle complex processes well enough for users to not even think about their existence. When servers fail though, they can be disruptive and create a domino effect that could cost the business thousands or millions of dollars in lost opportunities.
Servers require regular monitoring and maintenance to ensure they don’t fail when they are most needed. They are, after all, computing machines just like laptops and desktops. Maintenance ensures a relatively minor server problem does not mutate into a catastrophic failure. Often times, server failure is the result of an easily preventable situation spiraling out of control due to a lack of timely countermeasures.
Developing a checklist of the things you need to monitor or act on regularly can go a long way in ensuring your servers consistently perform at their best. The following entries must be part of any such checklist.
If for some reason the production systems and data are corrupted or compromised, a failure of backups can cripple the organization. Elaborate disaster recovery plans would be rendered useless if the backup data and systems cannot be restored. That’s why this is perhaps the single most important server maintenance task.
Depending on the volume of data the organization generates per day, backups should be tested daily or weekly. For mission-critical systems, routine testing can go as far running actual test recovery. If you are only backing up part of and not the entire system, ensure that any newly introduced applications have been risk rated and, if applicable, included in the backup routine.
Monitor Disk Usage
If disk usage is regularly approaching or exceeding 90 percent of overall capacity, you need to either add more disk space or clean up the disk of superfluous files. Such high disk usage gradually degrades system performance and increases the likelihood of data corruption.
Don’t allow your production system to morph into an archival system. Get rid of old emails, logs, installation files and software that is no longer required. The more the redundant data and systems your servers contain, the higher the risk of security breaches. A smaller data footprint also means quicker recovery in the event of failure.
Monitor RAID Alarms
Production servers ought to use RAID because it rarely fails. While this is good news, it’s also bad news—bad because system administrators can grow accustomed to RAID’s reliability and thus settle for irregular or ad hoc monitoring. This is one of the reasons RAID controllers are programmed to generate warning and alarm messages when certain problems are detected.
System administrators should keep an eye on RAID status and alarms. In a RAID array, one hard disk can fail and remain unnoticed if not for the alarms generated. In any case, production systems are likely to continue working smoothly since the remaining disks in the array are functioning as required.
Administrators must thus confirm that all hard disks are working. Fortunately, the software that comes with RAID controllers allow you to check the status at any given time.
Update the OS
Hackers routinely scan enterprise systems for vulnerability to know which systems will be easiest to penetrate. This is why server OS updates are released fairly frequently. That can make staying current quite difficult, especially if you are wholly reliant on manually installing the updates. For this reason, an automated patch management system is handy.
An automated system will inform you when a new patch is available and allow you to even specify automatic updates. However, if you cannot automate the patching process, develop a manual schedule. Ideally, you should update newer OS versions weekly (because they are likely to have more bugs) and do the same monthly for older versions.
Update Control Panel
If you are using a server or hosting control panel, ensure it’s regularly updated. That means both the control panel and the software the panel controls. For instance, with the WHM/cPanel, you first have to update PHP to seal any known loopholes. Updating the control panel alone will not fix the issues affecting the underlying PHP and Apache versions used by your OS.
Update Web Applications
Web apps have been the doorway for the majority of large-scale security breaches. Since they are internet-facing, websites and web apps are more prone to cyberattacks. Hackers will identify and exploit vulnerabilities to gain a foothold in the organization’s network. From there, they can install malware and conduct reconnaissance to pick up glaring opportunities they can take advantage of.
This is why you should prioritize web app updates. Where possible, check for new patches daily and apply high-priority updates as soon as is practically possible.
Monitor Remote Management Tools
If your servers are co-located or are administered by a dedicated services provider, you want to be certain that remote management tools and apps always work. Since you do not have physical access to the servers, you’ll be left helpless if these tools do not function when you need to respond to an urgent server issue.
Look Out for Hardware Errors
Review audit and event logs to identify any emerging hardware issues. Network failures, disk read errors and overheating notices are potentially early indicators of a looming hardware failure. As long as you are using good quality hardware, failure should be relatively rare.
Nevertheless, the risk of hardware failure increases with the frequency of instances where the system’s capacity is exceeded. Which brings us to the next item on the checklist.
Check Resource Utilization
Ideally, increases in system utilization should be relatively gradual and predictable which will allow you to plan for expansion well in advance. Certain organizational changes including unanticipated business growth can stretch resources faster than expected.
Review network, server RAM, CPU and disk utilization. If the system is frequently nearing or exceeding its optimal operating limits, you need to start the process of increasing capacity or completely replacing the equipment. You can keep track of utilization using the default tools available on Linux and Windows servers.
Physically Inspect Equipment
Over time, administrators can get so caught up in monitoring their systems via an application that they forget about the basics of equipment maintenance. Not all system problems will originate from a bug. A look at the physical state of servers, routers, printers and network cables can unearth problems you cannot identify when seated behind a computer screen.
Perhaps one of the network cables is coming loose or the server room is getting congested, making the servers overheat. Inspecting critical infrastructure should be part of an administrator’s morning routine.
Monitor Server Room Humidity and Temperature
Humidity and temperature can affect server and network performance. In the short term, it slows down performance. In the long term, it shrinks equipment service life. For example, running your servers at temperatures above the specified standard can cause regular hangups and occasional data corruption.
Similarly, if the server room has above optimal humidity, the resulting condensation can cause a short circuit or equipment corrosion. Of course, If you’re just renting a server from the likes of DigitalOcean or HostGator, you won’t be responsible for such equipment or on-site issues.
Run an Antivirus Scan
The primary purpose of antivirus software is to prevent infection. Usually, users will be notified when they try to introduce an infected file. However, every so often, some files will fall through the cracks and not be detected immediately they are introduced. In addition, the virus definitions are regularly being updated so files that were kosher a few weeks earlier may be flagged as infected.
Schedule a detailed antivirus scan to detect and remove malware. As this can be a strain on system resources, the scan should be scheduled for off-peak hours.
Review User Accounts
If you have recently had client cancellations, staff changes, supplier changes and any other changes rendering some user accounts redundant, you’d want to disable or delete these users from the system at the earliest opportunity. Old user accounts are a data, financial and reputation risk.
Persons who are no longer with the organization but still have system access are a potential conduit for corporate espionage. Disgruntled employees forced to leave can maliciously sabotage the system in revenge. In fact, the review of user accounts in mission-critical systems should be done daily. It would take just a few hours of unauthorized access for someone to wreak devastating havoc.
System, database and network administrators are privy to the most powerful user accounts in the organization. In the wrong hands, these super user accounts can be used to facilitate actions with far reaching negative consequences.
For instance, in a banking environment, a person who illegally obtains a system administrator password could create unauthorized user accounts that they would later use to commit fraud.
Best practice demands that passwords are changed at least every six months. Fortunately, this is something administrators no longer have to do manually, since the requirement can be configured in most systems.
Check System Security
Always assess the state of server, database and network security by using remote auditing tools. Some IT departments overly rely on internal IT auditors under the assumption that identifying system vulnerabilities is an audit function. This can be a costly miscalculation, given that IT auditors may not have the time to carry out system checks frequently.
Administrators should review system security monthly or quarterly depending on the risk rating assigned to each system. Pay particular attention to OS updates, system configuration and other potential risks.
Despite significant advances in redundancy and performance features in modern servers, growing workload and reliability expectations can take a toll on technology. The more robust your maintenance regime, the less downtime you are likely to experience and the longer your technology infrastructure is will serve you.
Don’t wait until there’s a catastrophic failure for you to develop and start following a strict routine. Make time for maintenance because the lack of it will require far more resources and time to correct.