Tutorial: Use of the shell 'xargs' command

I have found the ‘xargs’ command line utility to be very useful sometimes.  Often you can avoid writing a special shell script to perform a task by using xargs to perform a task on a list of files.  Its use can best be described through an example.  I have recenlty be migrating files from one svn repository to another.  I began by copying all of the files over using rsync.  This copied all of the .svn files from the old repository, which I didn’t want.  This left me with at least 30 directories, each with a .svn directory that needed to be removed.  What to do?  Use xargs.  Here is the command:

command-prompt> find -name .svn | xargs rm -rf

This uses the find command to generate a list of all files and directories names ‘.svn’ (case sensistive).  This list is then piped to the xargs command, which runs ‘rm -rf’ on each file in the list.  That’s it!  Note that as this command is written it will seach the current directory and all subdirectories for files with name ‘.svn’.  If you want to have ‘find’ search a directory other than the current directory you can add the path of that directory after the file name to search for.  See the man page for find for more info.

Tutorial: Use rsync instead of mv or scp when it really matters

I’ve been running a lot of simulations on the ‘updraft’ parallel computing cluster at the University of Utah.  My input files often have to wait in the queue for quite a while (a few days sometimes) before they can be ran.  The simulations generate large data sets which I then need to use for post-processing.  The directory where these files are created on the cluster is regularly wiped by the administrators to keep space free for other users.  This means that you don’t want to leave important data sitting around on this file system.  I’d been moving it back to my home directory on the cluster using ‘mv’, and eventually transfering it to my workstation using ‘scp’.   This was kind of a pain and took FOREVER!  I also discovered something that caused me to completely abandon ‘mv’ for any data that is even somewhat important.  I was using ‘mv’ to transfer the data to my home directory when I lost my internet connection.  Big deal right.  I logged back in only to find out that the data files had been corrupted by the inturrupted ‘mv’ command.  Now I had to run the simulation all over again to generate a new data file.  Bummer.  I did a little research about ‘mv’ and found that if it is interupted for any reason, it often looses data.  Not good.  Enter rsync.  rsync is a tool which makes a copy of files and directories.  If it gets interrupted, you can simply restart it and it will essentially continue where it left off.  Why not just use cp or scp?  Two reasons.  First if cp or scp is interupted, then issued again it simply restarts.  This is really a problem when the transfer takes an hour an you need the data NOW.  Which brings me to the second reason:  speed.  If you call rsync with the -z flag, it compresses the data before copying it.  On remote file transfers this results in a HUGE speed up.  Of course with rsync, once the files are transferred you need to manually delete the unwanted copy.  You can use ‘rdiff’ to verify that the two copies are in fact identical before deleting the unwanted files.  Did I mention that rsync is also great for backups too?