Keeping two very large repositories in sync can be a bit of a challenge, especially when you're dealing with residential-grade ADSL connections at one or both ends. rsync can kind of do this by itself out of the box, but in practice it needs a bit of a wrapper to make it workable. There's some hints on various sites on how you can wrap rsync, but they were somewhat incomplete, so I wrote this bash script to make it a smidge easier.

To be clear, what we're doing here is using the output of rsync to produce a list of differences (files, directories) between two large directory structures, and then copying the difference to a third location - an out-of-band structure, typically a USB drive, that can then be physically transported to the remote site, and rsync'd into place.

Never underestimate the bandwidth of a station-wagon full of tapes, etc etc.

Architecture assumptions

There's some assumptions inherent in my script - I'll explain my network, so you can adjust to suit yours.

This is almost exclusively a uni-directional operation

My local repository lives on the server py-hub-01, and I'm replicating to a remote box called en-hub-01

It's always run from my primary desktop: royksopp.

I'm at the source end of the operation. py-hub-01 is on my LAN.

The destination is accessed via SSH over the WAN (a VPN, but that's mostly irrelevant).

Deltas between source and destination are usually in the range 10-200GB, but depends how frequently I run the script, of course. Biggest constraint will be the size of your intermediate USB stick / drive.

Full repository size is currently around 6TB, and at the remote end this is a VM talking to relatively slow 8TB Archive drives -- even so, an rsync comparison operation usually takes under 10 minutes.

Target for the delta -- in my case a USB drive -- MUST be mounted and the user you're running this as needs write access into it.

As an aside, I use Unison-GTK to synchronise most of my non-git data repositories -- it works well for a bunch of stuff. I run it periodically against this repository between the two sites. When I first ran it, the archives were 5TB, and even though I had this destination server physically on my LAN at the time, that first run of unison took the thick end of a day to complete. I shit thee nay.

Anyway, unison is fantastic for keeping large repositories in sync, and if I've got a stack of small files that need to be updated, I'll happily sync them over the WAN directly. It won't break, or be broken by, alternating use of this script.

Code

I'll run through the interesting bits in the next section.

It's not complex, by any stretch, but there's some stuff you're going to need to change.

#!/bin/bash

# jedd
# 2016-05

# Start to automate the sync of py-hub-01 -> en-hub-01 by 
# generating a copy of the delta set of dirs + files, and
# saving to a local usb disk / flash drive / whatever.

# @TODO Don't trust new directories turn up in rsync output with trailing /
#       This seems to sometimes break with changes of files within a sub-dir.

# @TODO Allow splitting the delta over multiple devices - say a couple of
#       USB thumbdrives.   May be attached sequentially, or concurrently,
#       but will also need to track capacity consumption during copy.


if [ $# -ne 1 ]; then
  echo "   Usage:   ${0} /path/to/external/drive"
  echo 
  echo "   Makes a copy of the delta from py-hub-01 and en-hub-01 onto"
  echo "   an external drive for subsequent high-way speed transfer."
  echo
  echo "   Make sure the external drive is mounted, and has capacity."
  echo
  exit
fi

if [ ! ${HOSTNAME} = "royksopp" ]; then
  echo "You need to be on royksopp to find this one useful."
  exit
fi

if [ ! ${USER} = "jedd" ]; then
  echo "You need to be logged in as jedd to find this one useful."
  exit
fi

if [ ! -d ${1} ]; then
  echo "Intermediate drive / directory doesn't exist."
  exit
fi

DEBUG=true

local_folder="/pub"
remote_folder="en-hub-01:/hub/pub"

intermediate_folder=${1}

rsync -rv --size-only --dry-run                                        \
      --exclude="/lost+found" --exclude="du*.txt" --exclude="/distros"  \
      --out-format="%n"  $local_folder/  $remote_folder/ |              \
while read x
  do
    if [ "${x}" = "sending incremental file list" ]
      then
      continue
    fi

    if [ "${x}" = "" ]
      then
      continue
    fi

    if [ "${x:0:5}" = "sent " ]
      then
      continue
    fi

    if [ "${x:0:11}" = "total size " ]
      then
      continue
    fi

    if [ $DEBUG ]; then
      echo DEBUG: PROCESSING:  "${x}"
    fi
    DIR=`dirname "${x}"`
    if [ $DEBUG ]; then
      echo DEBUG: DIRNAME: "${DIR}"
    fi
    # if [ "${x: -1}" = "/" ]  &&  [ ! -d "${DIR}" ]; then
    if [ "${x: -1}" = "/" ];   then
      if [ $DEBUG ]; then
        echo DEBUG TRAILING SLASH: mkdir -p \"$intermediate_folder/${x}\"
      fi
      mkdir -p "$intermediate_folder/${x}"
    else
      if [ ! -d "$intermediate_folder/${DIR}/" ]; then
        if [ $DEBUG ]; then
          echo DEBUG FAILBACK CHECK: mkdir -p \"$intermediate_folder/${DIR}\"
        fi
        mkdir -p "$intermediate_folder/${DIR}"
      fi
      if [ -f "$local_folder/$x" ]; then
        if [ $DEBUG ]; then
          echo DEBUG:  cp -fuv "$local_folder/$x" "$intermediate_folder/$x"
        fi
        cp -fuv "$local_folder/$x" "$intermediate_folder/$x"
      fi
    fi
  echo
done

Notes on the script

The first section is just sanity checks -- make sure the user passes a parameter, make sure we're running on the host, and as the user, that I need to be. Make sure the parameter that's been passed is an actual directory.

You could confirm that the parameter is a mount point -- but I typically store the updates in a sub-directory on my removable disk, rather than at the top level, so that's not a useful test for me.

DEBUG is a variable I use to enable/disable verbose output -- you can almost definitely set that to false once you've started using and, ultimately, trusting this script.

local_folder and remote_folder are, for me, consistent on every run. I don't use this script for any other purpose.

You could have them come in as parameters if you want this to be a bit more portable for multiple sources & destinations.

intermediate_folder is, of course, the parameter that's passed in. This is where the delta file system structure will be copied to -- as noted above, it needs to have sufficient capacity available.

The rsync line is the meat of this.

-r means recursive, and -v means verbose. That's the easy bit.

--size-only makes the rsync much faster, because it assumes if a file exists at each end and it's the same name and size, then it has the same contents. This works just fine for me, as unison (mentioned above) is used for occasional network synchronisations of smaller sets of files, but otherwise I'll never have the situation where I have same name & same size files in both places that aren't actually the same. Further, on the next run of unison, it'll identify the file is new, and do a checksum at the remote end, and alert me if there's disparity (say, due to file corruption, or an incomplete copy, etc).

--dry-run instructs rsync to not actually do anything, just report on what it would do. This works fine as we just need the text output of rsync to provide a list of files that we're going to copy.

The --exclude lines could be pushed into a single --exclude-file option, but that would mean maintaining a separate (text) file, and that seems less clean -- again because in my case I'm not using this across multiple repositories. Obviously you can have multiple --exclude= options, as I've done here.

--out-format="%n" should give us a more convenient machine-readable list of differences.

But having said that ... I couldn't get an rsync output format that didn't include some stuff I don't care about, namely empty lines, summary text on numbers of files checked, sizes of same, etc -- and that's why those first four if ... fi blocks, that get rid of rsync output we don't want (and would otherwise savagely confuse the cp command).

The rest is simply processing each output line from the rsync command. I'm assuming that any line that ends with a / is a directory, and that's a fairly safe assumption, but as noted in the script, there were occasions when this seemed to be fragile.

The cp -fuv says force, which is self-explanatory, and update which means only copy the file if it doesn't exist (or if local is newer), which means that if the script crashes, or you stop it, you can re-run and it won't re-copy stuff that's already copied - this is much faster. The v means verbose, so you get a bit more feedback on the terminal.

And that's about it.

Certainly the script could be tidied up, but it's been working fine for me for the past couple of weeks.

Re-importing at the B-end

It probably should go without saying, but at the far end, once you mount the USB drive, it's a simple matter of using a vanilla rsync command to import these additions into the primary structure.

Rsync, by default, augments an existing structure - so assuming that the drive is mounted at /media/jedd/foo, and the deltas were written into a pub directory on that device:

en-hub-01:~#  rsync -av /media/jedd/foo/pub/  /hub/pub/

You can try with the --dry-run parameter, if you want to be cautious, or even just run an rsync with a subset of the total deltas, just by traversing down the directory structure of the USB drive and the remote repository.