Administration Guide for Papio and the Backup Servers

Backup Design Overview

The backup computer must be physically plugged into an ethernet network that is itself connected to the Internet. It will automatically configure itself via DHCP.

The backup computer is a stock Debian Linux computer. Backups are kept on an external USB 2.0 hard drive. Backups are daily rsync snapshots, complete copies, of all of papio's filesystems via a script invoked from /etc/cron.d/rsync_backup. There should be at least 100 of them, probably more, in directories named by timestamp. The oldest backups are removed as space runs out or the limit of 400 backups is reached. The filesystem is ext3. On the backup computer the external hard drive is mounted on /srv/backups/papio/.

The expectation is that to restore we would give the external hard drive to Duke technical support staff. We could restore individual files or even database backups over the network but anything close to a restore on bare metal we'd expect Duke to have primary responsibility.

Papio backs up the database to disk daily (by /etc/cron.daily/babase_postgres_backup.cron) as standard pg_dump files. The output is in /srv/babase_database/postgres/. To restore the database use the the standard postgres restore tools, i.e. pg_restore. See the restore instructions in the backup script.

The VPN Tunnel

Because the backup server may not have a static IP, and to get around NAT issues on the network to which the backup server is connected, the backup server initiates a non-necessarily-encrypted (but definitely authenticated) OpenVPN VPN tunnel to papio. All communication between papio and the backup server is through this tunnel.

Papio initiates all the backups. Papio ensures that the VPN tunnel to the backup server does not forward to the rest of the network. Other than the VPN connection there are no inbound connections to papio.

The OpenVPN tunnel used for backups listens on papio on port 1195, rather than the usual 1194. This is because the VPN used for backups uses certificate authentication whereas the regular VPN does not.

The rsync command uses --hard-links so that no additional storage is allocated for those files that do not change between backups. The ext3 backup partition is created with (mke2fs -i 4096) 4096 bytes per inode to allocate the additional inodes such a scheme requires.

File ownership

Backups are stored with the numeric uid and gids used on papio. This may or may not, likely not, correspond to the uids and gids on the backup sever. Caution is required.

The backup script

The backup script is custom because at the time of this writing the rsnapshot program does not purge based on partition space available or is otherwise oriented around partitions.

Backups of the backup server(s)

The backup server is itself backed up to papio using the same methods. Again, all connections are initiated by papio.

Administrative Tasks

There are two administrative tasks to be performed daily.

The administrator of the backup computer receives daily emails reporting the status of the daily security updates. The first task of the administrator is to monitor these emails. Should the email report errors, an extremely unlikely occurence, a Linux administrator should be called in to examine the situation. Should the email report that manual intervention is required to install a security update the administrator should follow the procedure below to install the security update and to reboot the backup computer.

Should a daily backup, or some other automated operation, fail, the administrator will receive an email reporting this problem. The second task of the administrator is to refer these failures to a Linux administrator for resolution.

Aside from the backup itself, the only automated process specific to the backup system is a daily check that a minimum number of backups exist. Currently this minimum is set to 30. Mail is sent to the administrator for possible action should the number of existent backups fall below the limit. This number, as well as the maximum number of backups to keep, is set in /etc/cron.d/purge_backups.

Connecting to the backup computer

Logins, usernames and passwords, are required to connect to the backup computer. They are handed out by Karl, or whomever has the root password.

Those with physical access can plug in a keyboard and screen, and even a mouse if a GUI is desired.

Connections may be made to the backup computer using ssh from the physical network (LAN) to which the backup computer is plugged in. The backup computer obtains it's IP address via DHCP. If connections via the LAN are to be made it is up to the LAN administrator to keep track of the IP address assigned to the backup computer.

The presumption is that NATting will prevent arbitrary hosts on the Internet from connecting to the backup server. If this is not the case it is up to the network administrator of the backup server's network to firewall the backup server to prevent ssh connections from the Internet.

Connections from the Internet to the backup computer are made over ssh, via putty or some other ssh client. The user must first connect to papio and then connect over the VPN to the backup computer. This approach serves two purposes: it bypasses NATting and inbound connection firewalling; and it renders moot the occasional random changes most consumer ISPs make to assigned IP addresses which, in turn, randomly change the "location" of the backup server on the Internet.

Once logged in to papio the command to connect to the backup server (where username is your assigned login name) is:

ssh -l username backup-server1

If the username on the backup server is the same as the username used on papio the command may be shortened to:

ssh backup-server1

Disconnecting from the backup computer

Those who use ssh to connect to the backup computer can disconnect by typing:

exit

This command will have to be typed twice if the su - command is in effect, once to exit from the su command and cease being root and again to exit as a normal user.

The exit command can often be typed rapidly enough after the machine has been told to halt or reboot that the session will end before the backup computer finishes it's shutdown sequence. Typing exit in these circumstances has the advantage of taking effect immediately. It may otherwise take some time for the ssh session to realize that the backup computer has stopped.

Daily messages

There are 3 sorts of daily messages.

If no message at all is received this is a sign of trouble and should be investigated.

Nothing done

The typical daily message will indicate that nothing happened.

Subject:        Cron <root@foo> test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily )
Date:   09/09/2009 06:25:33 AM
From:   Cron Daemon <root@foo.example.com>
To:     root@foo.example.com

/etc/cron.daily/security_updates:
No security updates to apply.

The administrator need do nothing upon receiving such a message.

Security updates performed

On occasion the system will automatically update itself. This example shows 2 packages being updated.

Subject:        Cron <root@foo> test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily )
Date:   09/09/2009 06:25:33 AM
From:   Cron Daemon <root@foo.example.com>
To:     root@foo.example.com

/etc/cron.daily/security_updates:
Reading package lists...
Reading package lists...
Building dependency tree...
Reading state information...
Reading extended state information...
Initializing package states...
Reading task descriptions...
The following packages will be upgraded:
  libmysqlclient15off mysql-common
2 packages upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
Need to get 1920kB of archives. After unpacking 0B will be used.
Writing extended state information...
(Reading database ... 40166 files and directories currently i
nstalled.)
Preparing to replace mysql-common 5.0.51a-24+lenny1 (using .../mysql-common_5.0.51a-24+lenny2_all.deb) ...
Unpacking replacement mysql-common ...
Preparing to replace libmysqlclient15off 5.0.51a-24+lenny1 (using .../libmysqlclient15off_5.0.51a24+lenny2_i386.deb) ...
Unpacking replacement libmysqlclient15off ...
Setting up mysql-common (5.0.51a-24+lenny2) ...
Setting up libmysqlclient15off (5.0.51a-24+lenny2) ...
Reading package lists...
Building dependency tree...
Reading state information...
Reading extended state information...
Initializing package states...
Reading task descriptions...

Note that the important part is the bottom where, unlike the next example, there is no indication that the administrator need manually intervene. There are times when the system automatically installs some updates but others require manual installation.

The administrator should scan such emails for words like "error" or "fail" in the unlikely event that something failed during the automatic security update.

Manual intervention required

Less frequently the administrator must manually make a security update and reboot the backup computer. Messages like the one that follows indicate this.

Subject:        Cron <root@foo> test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily )
Date:   09/09/2009 06:25:33 AM
From:   Cron Daemon <root@foo.example.com>
To:     root@foo.example.com

/etc/cron.daily/security_updates:
Reading package lists...
Reading package lists...
Building dependency tree...
Reading state information...
Reading extended state information...
Initializing package states...
Reading task descriptions...
No packages will be installed, upgraded, or removed.
0 packages upgraded, 0 newly installed, 0 to remove and 1 not upgraded.
Need to get 0B of archives. After unpacking 0B will be used.
Reading package lists...
Building dependency tree...
Reading state information...
Reading extended state information...
Initializing package states...
Reading task descriptions...

Security (and other) updates that have been held back:
ihA linux-image-2.6.26-2-486        - Linux 2.6.26 image on x86

Note that the message may contain multiple lines that read: error: out of partition. This error is spurious and can be safely ignored so long as a line beginning with run-parts: executing /etc/kernel/postinst.d/zz-update-grub appears somewhere in the preceding text.

The important parts here are the last 2 lines. If the message contains the line:

Security (and other) updates that have been held back:

Then manual intervention is required.

The second line

ihA linux-image-2.6.26-2-486        - Linux 2.6.26 image on x86

reports which package must be manually upgraded. In this case the package is linux-image-2.6.26-2-486.

Before beginning note which package(s) must be manually updated.

Initial reboot of the backup computer

Although it is not required it is good practice to reboot the backup server before a manual upgrade. A reboot will trigger an automatic check of the disk content, if one is needed. Should a problem with the disk content ever be discovered it may require physical interaction with the backup computer's keyboard to fix. Rebooting before a manual upgrade allows for more accurate remote problem diagnosis should there be a problem bringing the backup computer back on-line after a manual update.

Note that should a reboot trigger a check of disk content (a "fsck") the check can take many hours to complete and the computer will not return to service until the check has finished.

To reboot ssh into the backup server and issue the following sequence of commands. (Explanatory remarks beginning with ;# follow the commands.)

su -                                             ;# You will be prompted for the root password.
reboot                                           ;# Restart the computer, checking the disk if necessary.

Manually update packages

After identifying the package(s) which must be manually updated, ssh in to the backup server to perform the update and, again, reboot the computer. The following sequence of commands will update, e.g., the linux-image-2.6.26-2-486 package. If more than one package must be manually updated then unhold/hold all of the packages before/after doing aptitude safe-upgrade. (You may issue multiple aptitude unhold ... aptitude hold ... commands or simply list all the packages, separated with spaces, to be unheld or held on the command line following the unhold/hold keyword.) (Explanatory remarks beginning with ;# follow the commands.)

su -                                             ;# You will be prompted for the root password.
aptitude update                                  ;# These next two commands perform
aptitude safe-upgrade                            ;# non-security related updates.
aptitude unhold linux-image-2.6.26-2-486         ;# This begins the update of the desired package.
aptitude safe-upgrade                            ;# If prompted whether you really wish to replace
                                                 ;# currently running kernel, answer affirmatively.
aptitude hold linux-image-2.6.26-2-486           ;# Force manual updates in the future.
aptitude search linux-image-2.6.26-2-486         ;# Check the last step worked.
                                                 ;# The first 2 letters should be 'ih'.
reboot                                           ;# Restart the computer with the new kernel.

As the backup computer reboots your ssh session will be terminated and you will return to your login (ssh) session on papio. Check to ensure that the backup computer successfully reboots with the following command:

ping backup-server1

You know the backup server is responding when replies like the following are received:

64 bytes from foo.example.com (192.168.199.2): icmp_seq=1 ttl=254 time=1.59 ms

Terminate the ping command by holding down the Control key (usually labeled "Ctrl") and, while Ctrl is held down, pressing the "c" key.

Note that on occasion, typically at least every 120 days, the backup server will perform file system checks as part of the reboot. This can take quite some time, possibly hours, and the system will not respond to pings until it has finished. If the backup system does not come back up after reboot contact a Linux administrator for assistance.

After verification that the backup server is operational type:

exit

to end your ssh session with papio.

Note: At this time the only packages that require manual intervention are new versions of the kernel. This list should probably be expanded to include certain critical libraries like glibc.

Shutting down the computer

There are times when the computer must be dis-connected from power. If the backup computer is physically powered off without first being shutdown data loss is possible, although it is unlikely that anything significant will be lost unless a backup is running at the time of power loss and even then it is unlikely that database content will be lost.

To avoid any possibility of data loss shut down the computer manually before removing power. The easy way to do this is to press the power button on the front and wait for all the lights to go off. The computer will detect the button press and shut down cleanly.

An alternate approach, which may be necessary with older software, is to first connect to the backup computer as described above. Then type:

su -                                             ;# You will be prompted for the root password.
halt

After power is physically restored the power button on the front of the computer must be pressed to restart the machine.

Manual Software Updates to Papio

Notifications about needed updates

If you've been granted access, you should receive daily lengthy emails from Papio that describe the various attempts to access the server in the previous 24 hours. You should check these daily for any errors or any extremes in access attempts. If you have any questions about these, contact Karl Pinc about what to do next.

Less regularly, emails should come, detailing any programs on Papio that have available updates. These emails are much shorter, and might look something like this:

From: root@foo.example.com
Date: 28 January 2016 at 03:45:16 GMT-5
To: root@foo.example.com
Subject: Anacron job 'cron.daily' on example.com

bind-libs.x86_64              32:9.8.2-0.37.rc1.el6_7.6              sl-security
bind-utils.x86_64             32:9.8.2-0.37.rc1.el6_7.6              sl-security

In this case, bind-libs.x86_64 and bind_utilsx65_64 need to be updated.

Performing updates

To update the above files, you'll first need to SSH into Papio, as above. Next, you'll need to log in to su, using sudo:

sudo su -

You will be prompted for your password. Use the same password that you use to log in to Papio.

Once logged in, get all available updates by entering:

yum update

After a moment, you should see a list of available updates, and you'll be asked if you want to go ahead and implement the updates. You do want to update, so go ahead and say yes.

Updates requiring a reboot

After some updates, the VM should be rebooted. Yum won't tell you this; you just need to know. This includes updates to:

For packages like these, after yum finishes doing its updates you should enter:

systemctl reboot

The VM will disconnect you as it shuts itself down. Wait a few minutes, and then SSH back in to Papio (or visit/refresh any web page hosted on Papio, like this one!) to confirm that the VM is back up. If after several minutes you continue to get timeout or "not found" errors, contact Duke's OIT to see what's wrong.

You should coordinate these updates and reboots with the other Papio users. When you get an email indicating that an update like this is available, send an announcement (see BabaseMailingLists). People don't need all the details, just something like:

Papio is in need of some software updates that will require a restart.  I'd like to do this on [DATE], at [Time, including time zone].  The database will be unavailable for ~5-10 minutes.  Please let me know if you have any concerns or issues with this timeframe.

To minimize disruption to other users, you should ideally pick a low-traffic time, like the late afternoon/early evening, or early in the morning.

Restarting services

The day after an update, you may get an email that looks something like this:

Date: Wed, 19 Jun 2024 06:58:15 -0400 (EDT)
From: root <root@example.com>
To: needrestart-email-recipients@example.com
Subject: needrestart report from example.com


Services needing restart with `systemctl restart $SERVICENAME`:
user@1234567.service

If this happens, you probably should have rebooted after yesterday's updates. You _could_ do that now. But let's talk about an alternative.

To get more info about the service, you can user the "id" command:

$ id 1234567

The result will show you a uid, a gid, and groups. If it's a named user, then perhaps you should just go contact that person and tell them to disconnect/reconnect. Most likely, the uid will have a name that indicates that this is related to the backup servers. Specifically, it's the SSH commands that they use to setup the tunnels through which backups are done and email is sent.

Do what your email said to do: (as a superuser)

$ systemctl restart user@1234567

This will likely fail, but it does remove the systemd-related process that provoked the email. The actual SSH tunnels are left untouched. If they fail for some reason, the backup server(s) will re-establish them.

The impact of the failure to restart is unclear, but everything seems to work okay after this.

But probably the best thing to do would be to just reboot the server when you do a major update. Then you avoid this mess entirely.

Finishing up

After updating is complete, exit sudo simply by typing "exit" in the terminal. You can type "exit" again to log out of Papio. If you rebooted Papio after updating something, it will do all the exiting for you.

SysAdmin (last edited 2024-06-20 14:53:33 by JakeGordon)

Wiki content based upon work supported by the National Science Foundation under Grant Nos. 0323553 and 0323596. Any opinions, findings, conclusions or recommendations expressed in this material are those of the wiki contributor(s) and do not necessarily reflect the views of the National Science Foundation.