Monitoring

Regular, Service Maintenance Checklist.

Weekly, High Frequencey

Short interval reviews exist for the following.

WhatWithWhen
Server Online State daily log, direct connections, groklog, ping, cacti twice weekly
Service Online State.

Verify that all specified services are online and operational.

daily log, direct connections, groklog, cacti twice weekly
Service Performance.

Verify services are functional within guidelines for installation.

Do not make performance changes with little projected improvements.

Review, and where possible correct, issues blocking performance or causing performance problems (e.g. delays caused by DNS queries)

daily log, cacti twice weekly
Service Performance.

Consider a formal review of new controls and modification changes. Where appropriate document an RFC

Review half-yearly
Backup State.

Review restorability and security aspects of the backup states.

daily log twice weekly
Resource Utilisation - Disk

Disk useage can have a significant impact on performance and is monitored regularly

daily log, cacti twice weekly
Resource Utilisation - Other

other system resources that are monitored for abnormal changes include cpu use patterns, RAM utilisation.

daily log, cacti, groklog twice weekly
Mail Queue

Review unforeseen extended growths in the incoming and outgoing mailqueues.

daily log, cacti twice weekly
Web Proxy

Review proxy logs for noticeable performance issues

daily log, cacti twice weekly

Monthly

Analysis and reviews requiring longer data collection periods for analysis (typically a month) include the following

WhatWithWhen
Firewall Reports.

Review a specific section of the networks firewall logs seeking insights to security and performance

groklog, eyeballs Monthly
VPN Report

Use and capacity performance report.

sawmill, webalizer(?) bi-monthly
Web Proxy Report

Use and capacity performance report.

bi-monthly

Quarterly

WhatWithWhen
Firewall Document.

Report on the current firewall deployment and potential impact of network change proposals.

groklog, eyeballs semi-annually

Annually

WhatWithWhen
Authentication Keys.

Distribute new public SSH keys for all managed servers. Ensure managed hosts get new public keys as a security measure.

daily log Annually
Host Build Test Cycle.

Each host will be given a full pre-release security audit and review.

eyeball on commission
Perfomance Assessment

Should the client concur, a review can be made assessing capacity of existing host to achieve required performance for next 12 months.

on commission

A checklist is useful in ensuring a minimal level of consistency. Of-course we accept that the check-lists do not ensure the quality service. It still belongs in the hand of the code monkey, or admin monkey carrying the check-list around.

In the absence of a good automated system, the more mundane manual process still needs to be completed.

What activities do we want to perform, and why is that beneficial for us and our clients.

Below is a quick list of activities that should/could be reviewed and an argument (for or against) their value to clients and Nullcube. Otherwise known as the feature checklist of what would be nice in a monitoring system that is scalable.

The suggestion to use Control, Histogram, and Pareto Charts allow us to discuss Policy Procedures. The nature of the charts give specific data points that can be connected to specific Policy or Flag items enabling both Nullcube and clients to pre-allocate behaviour and resources.

We all benefit from a visual indicator that can become a common “language” between different skill levels and orientation.

All Hosts

For all managed hosts, the following are points of interest to regularly monitor.

ItemMonitorPurposeAudit tools
Uptime Control Chart

Set boundaries for median and maximum uptime.

The graph should be linear and interesting factors exist both below a pre-determined minimum and maximum control bars. Where the system spends too much time below the minimum uptime (e.g. set minimum uptime of 2 days, so when a machine is below that bar for more than a week this should flag a review of the installation.) There should be a maximum number of days live control, above this control should begin to worry us whether the system can survive a restart on the occassion of a forced restart.

Clients may neglect or are themselves not aware of power cycling of servers due to various issues such as short-term power failure onsite.

Benefits - see above reference to Policy and Behaviour

groklog, daily output
Resource Utilisation Control Charts

Set boundaries for median and maximum disk use.

The maximum value is critical, but we also need to know if there is a pattern of use that is systematically driving use towards the control borders.

Item Description Tool
Disk Load, expansion
Ram Load, expansion Cacti
CPU. Load, expansion Cacti
groklog,
Network Links Control Chart

Set boundaries for median and maximum state behaviour of link.

Benefits - see above reference to Policy and Behaviour

groklog, netstat
Changes to Configuration Files Change List

Track changes to configuration files such as /etc/pf.conf, /etc/samba/smb.conf, /etc/samba/arp.allowed

/etc/pf.conf Firewall rules
/etc/rc, /etc/rc.local, /etc/rc.conf.local, /etc/login.conf, root's cron Startup changes
/etc/mail/*, /etc/samba/*, /etc/squid/* Sendmail, Samba and Squid configuration files
/home/*/.ssh;/home/*/.bash_profile;/home/*/.profile

groklog,
Specific Services

For special services on hosts, below is the beginnings of a list of issues to monitor.

ServiceItemMonitorPurposeAudit Tools
Firewall Traffic Control, Histogram, and Pareto Chart

Control Charts can be used for visualising overall traffic patterns as well as behavioural changes for different types of traffic.

Histogram Chart:

Pareto Chart: Visually highlight volume differences in types of traffic.

Benefits - see above reference to Policy and Behaviour

groklog,
Mail Server Mail Queue Control, Histogram Chart

Set boundaries for median and maximum disk use.

Benefits - see above reference to Policy and Behaviour

groklog,
Mail Server Traffic Control, Histogram Chart

There are various issues with mail traffic that should be of value to ourselves and to clients. Some of these, charted would significantly improve ability to react.

ItemDescription
activity end-user activity. Histogram highlighting ends of user activity. Heavy users provide a pattern of behaviour. We anticipate that the interest will be mostly with things on the extreme. Someone sending out 10GB of email should raise some sort of flag somewhere. A sudden major increase in use should also raise a flag.
Denied/Failed Denied and failed send-to accounts could imply a user error or some sort of software misconfiguration. A huge denial/failure may indicate a potential security data point.

If the user has failed to adjust their behaviour from the mailserver error messages, then we may need to look at other means of resolving the problem.
Denied/Failed Analysis of high incoming denied/failed will give us a better lead towards DOS and SPAM.
  </td>
<td>groklog, </td>

Web Proxy Traffic Control, Histogram Chart

ItemDescription
activity end-user activity. Histogram highlighting ends of user activity. Heavy users provide a pattern of behaviour. We anticipate that the interest will be mostly with things on the extreme. Someone sending out 10GB of email should raise some sort of flag somewhere. A sudden major increase in use should also raise a flag.
Denied/Failed Denied and failed send-to accounts could imply a user error or some sort of software misconfiguration. A huge denial/failure may indicate a potential security data point.

If the user has failed to adjust their behaviour from the mailserver error messages, then we may need to look at other means of resolving the problem.
Denied/Failed Analysis of high incoming denied/failed will give us a better lead towards DOS and SPAM.
  <p>Benefits - see above reference to Policy and Behaviour</p>

  </td>
<td>groklog, </td>