An Introduction to Rsync (part 2)

In An Introduction to Rsync (part 1), I described how rsync is a UNIX command that can keep two file folders synchronized, and I promised you a step-by-step example. To illustrate rsync, I’ll talk about how I manage the photos that I upload from my digital camera. This is just an arbitrary example, though. The files could be anything: business documents and spreadsheets, website pages, program source code, archived e-mails, podcasts, or other media files.

Synchronizing a Digital Photo Library: In this case, the hard drive on my desktop currently has about 24 GB of photographs that I have taken over the years. I upload new ones all of the time, and I like to have copies of the “interesting ones” on my laptop, but I don’t have room on my laptop for all of them.

I have two cameras. One assigns each photograph a name like “IMG_6466.JPG”. The other assigns names like “DSCN0028.JPG”. When I upload the photos from my camera, I place them in a separate folder based on the year and month (F:\media\Photos_2007\01\DSCN0028.JPG). Right off the bat, I inspect the photographs, deleting the blurry ones, and rotating the ones that were taken in portrait mode. My habit it to save the rotated versions with “_R” added to the file name (F:\media\Photos_2007\01\DSCN0028_R.jpg). At this point, I mark all of the files as read only, and then the fun begins.

Just the “Interesting” Photos: Further editing of the photographs result in copies getting saved to disk with names like DSCN0028_cropped.jpg, DSCN0028_small.jpg, DSCN0028_retouched.jpg. Also, sometimes the photos don’t need any editing, but I’ll rename them to include something of a description (DSCN0028_birthday_cake.jpg). It is these renamed files (_birthday_cake, _retouched, _small) that I consider “interesting”, so these are the ones that get copied to the laptop. Finally, even though my digital photograph collection goes back to 2004, I only consider the photographs taken in 2006 and 2007 to be “interesting.”

Local Operation: This step-by-step example will focus on local operation. Remember, rsync can work remotely, synchronizing two file folders that are controlled by different computers; however, in this case, I don’t need that. The desktop’s hard disk is a mapped drive that the laptop can access directly (as F:).

Windows CygWin: My laptop happens to be Windows XP. So, I use a CygWin shell script to run rsync. There are two parts to this example: the actual shell script, and a file that lists all of the inclusion and exclusion patterns.

Here is the shell script (~\interesting_photos):

#!/bin/sh
pushd /cygdrive/f/media
mkdir -p /cygdrive/c/media
rsync -vrut --filter='. interesting_photos.rules'
Photos_2006 Photos_2007 /cygdrive/c/media
popd

Some notes about this script:

  • (The line that begins “Photos_2006…” is actualy a continuation of the “rsync …” line.)
  • pushd and popd are a variation of the cd command, which is used to change the current directory. pushd has the same effect as cd. In this case, it changes the current directory to /cygdrive/f/media. But, before doing so, it memorizes the original current directory first, space whatever it is. Then, popd restores the original.
  • The “mkdir -p /cygdrive/c/media” line ensures that a “media” folder exists on the C drive, if not already. See Quick Tip: Create a Whole Dir Path with One Stroke
  • The syntax of the rsync command itself was described in part one. Briefly, the “-vrut” stands for verbose, recursive, update, and (preserve) timestamps; the –filter switch specifies the (first) filter rule (in this case it’s a rule that says find more rules in the “interesting_photos.rules” file); Photos_2006 and Photos_2007 are source folders (relative to the current directory) and /cygdrive/c/media is the destination folder (which would also be relative to the current directory if it wasn’t already fully qualified).
  • /cygdrive/f/media is the main library on the desktop (that’s CygWin speak for F:\media)
  • /cygdrive/c/media is the copy of the library that is to contain just the interesting photos
  • Photos_2006 Photos_2007 are the two folders within the library that I am interested in.
  • “interesting_photos.rules” refers to a corresponding ASCII text file that contains the filename pattern matching rules that determined what makes an interesting file name.

Here is the rule file (interesting_photos.rules):

# Never copy any thumbnails
- *.THN
- Thumbs.db

# Always copy all TXT files (notes about the photos)
+ *.TXT
+ *.txt

# Ignore images with names exactly as the camera assigned them
- IMG_????.JPG
- DSCN????.JPG

# Ignore portrait images that were merely rotated
- Img_????_[Rr].jpg
- Dscn????_[Rr].jpg

+ *.[Jj][Pp][Gg]
  • Blank lines, and lines that begin with a pound sign (#) are ignored.
  • Lines that begin with a plus sign and a space are inclusion rules.
  • Lines that begin with a hyphen and a space are exclusion rules.
  • A question mark (?) is a wildcard that matches any one character (exactly one).
  • An asterisk (*) is a wildcard that matches any number of any characters, except a slash (/). In other words, the characters that match an asterisk are confined to a single file name or folder name.
  • A double asterisk (**) is a wildcard that matches any number of any characters, including a slashes (/).
  • Square brackets ([]) can be used to make up a custom wildcard that enumerates exactly which characters are possible.
  • Pattern matching is case sensitive.
  • There are other aspects to the pattern matching algorithm. See the MAN pages for details. For example, leading and trailing slashes each have special significance.

One piece of software I use likes to leave little thumbnail files lying around with an extension of THN. So, the first line in the rules file ensures that they are skipped (and the next is to skip the thumbnails that Windows creates).

Rules Are Processed Top-Down: As rsync iterates through the files in the source folder(s), and considers each one to determine whether or not it should be transmitted, it does so by processing the list of filter rules from top to bottom. As soon as it finds a rule that matches the name under consideration, rsync looks to the left to see if the rule is an inclusion rule (+) or an exclusion rule (-). If it is an inclusion rule, then the file is transmitted. If it is an exclusion rule, then the file is skipped. If rsync gets to the end of the list of filters without matching any of them, then the file is transmitted.

Note: In this case, all of the + lines are actually redundant, since the default action to include (transmit) files. Normally, an explicit inclusion rule is only needed to specify an exception to an exclusion rule.

Case Sensitive: Most implementations of rsync are case sensitive, including CygWin’s (even though Windows apps are normally case insensitive). So, if there is a possibility of filenames that exist with multiple casings, then it is necessary to either repeat the pattern (as I did above with *.TXT patterns) or use the square bracket notation (as I did above with the *.JPG pattern):

- *.EXE
- *.Exe
- *.exe

or

- *.[Ee][Xx][Ee]

I was actually very particular when I specified the exclusion rules for the JPG files. My cameras both save filenames in all uppercase, but my photo editing software always converts the names to an initial cap when saving them back.

Related articles:

Comments

  1. Interesting read. So am I right in assuming that rsync is a more powerful version of rcp?

    I just gave it a whirl on my work CygWin install, but it’s not installed here, so I’ll wait to have a play tonight. I like the idea of syncing it up with my webserver in the form of a quick build script - attach it to a cron job and you’ve got emulated continuous integration.

    Again - cheers for that, this could prove to be a very useful tool.

    Simon

  2. Simon: Exactly. From the man pages: “rsync is a program that behaves in much the same way that rcp does, but has many more options and uses the rsync remote-update protocol to greatly speed up file transfers when the destination file is being updated.” Let me know if you have any questions about setting it up to run remotely, which I didn’t cover here.

Post a Comment


Your email is never published nor shared. Required fields are marked *



© 2006-2007 Maxim Software Corp.  All rights reserved.