Discussion:
[BackupPC-users] Why is my rsync so much slower to do an incremental backup than tar over ssh?
gimili
2009-07-15 12:13:52 UTC
Permalink
I am new at backuppc. I am backing up 250GB from one machine to another
with backuppc. Both machines use debian lenny as the os. 2.4ghz
machines with 4GB ram in client and 2GB in backuppc machine. Both
drives are sata.

Initially I used rsync and it took 7 hours for a full backup and 5 hours
for an incremental. I did every tweak I could find on to improve speed.

I switched to tar and it still took 7 hours for a full backup but it
only took 10minutes for an incremental. I modified a few files to make
sure that it was working. The modified files were indeed backed up.

Why is my rsync so much slower to do an incremental backup than tar?

Have I made an error?

It is hard to believe that tar can be so fast with the incremental
backup with that much data?

Are there any serious problems with using tar?

Any advice appreciated! Thanks!
--
gimili
Carl Wilhelm Soderstrom
2009-07-15 13:10:19 UTC
Permalink
Post by gimili
Why is my rsync so much slower to do an incremental backup than tar?
Because rsync makes checksums of all the files, instead of just checking the
timestamp.
Post by gimili
Have I made an error?
No, tar really is faster than rsync in some cases.
Post by gimili
It is hard to believe that tar can be so fast with the incremental
backup with that much data?
Are there any serious problems with using tar?
- tar does not catch files which have changed, but have timestamps which say
they are not changed. If the timestamp gets set to some point prior to the
last reference backup, the file won't be backed up.

- tar copies *all* the files when doing a 'full' backup. rsync only copies
the changed ones. (Like an 'incremental', but with slightly more thorough
checking).

This is a simplified explanation, but covers the basics. I'm sure others
will chime in with more details.

If tar works better for you in your environment, due to network bandwidth,
processor power available, memory availability, etc; that's why it's an
option. :)

As a noteworthy data point, when making an initial copy of files (not using
backuppc, just plain tar or rsync); tar is 2x-4x faster than rsync,
presumably due to all of rsync's calculating overhead. (Newer versions may
be faster, this was a while ago that I found this out by practical testing).
Rsync wins when making subsequent copies that don't require a large
percentage of the data to be transferred.
--
Carl Soderstrom
Systems Administrator
Real-Time Enterprises
www.real-time.com
Holger Parplies
2009-07-15 14:21:19 UTC
Permalink
Hi,
Post by Carl Wilhelm Soderstrom
Post by gimili
Why is my rsync so much slower to do an incremental backup than tar?
Because rsync makes checksums of all the files, instead of just checking the
timestamp.
that is actually not true for incremental backups, unless you have changed
RsyncArgs to include --ignore-times (which you should *not*).
Post by Carl Wilhelm Soderstrom
Post by gimili
Have I made an error?
No, tar really is faster than rsync in some cases.
While that is true, incremental backups should not be taking 5 hours (compared
to a 7 hour full backup). Something is going wrong, I just couldn't guess
what, and I'm not even sure what additional information to ask for, aside from
the usual: relevant config files settings (everything related to rsync or tar,
including BackupFilesOnly/BackupFilesExclude), details on your setup that you
missed (network, type of file system on both sides, number of files in backup
set ...). One other thought: what does your server status page say on the
matter of hash chains and their length?
Post by Carl Wilhelm Soderstrom
Post by gimili
Are there any serious problems with using tar?
- tar does not catch files which have changed, but have timestamps which say
they are not changed. If the timestamp gets set to some point prior to the
last reference backup, the file won't be backed up.
This includes, most notably, files that have been moved/renamed, especially if
they were moved from a place that was not backed up to one that is (because
then they will not even be in the backup under a wrong name). Other common
examples are unpacked zip file contents with old dates.

Also, incremental tar backups cannot detect deleted files (while incremental
rsync backups will), so these will continue to exist in your backups up to the
next full backup, meaning the incrementals don't acurately reflect the state of
your file system. This may or may not be important for you.
Post by Carl Wilhelm Soderstrom
If tar works better for you in your environment, due to network bandwidth,
processor power available, memory availability, etc; that's why it's an
option. :)
Thankfully, Jon Forrest recently reminded us that you can [probably] get the
best of both worlds by using the '-W' (or '--whole-file') option to rsync.
This makes rsync transfer the whole file (like tar) instead of attempting to
speed up the transfer with the (expensive) rsync algorithm. If network
bandwidth is *not* your limiting factor, you should be able to cut down CPU
usage (and possibly disk I/O, presuming rsync needs to read the files twice -
once for calculating checksums and once for retrieving content to transfer)
and still get the benefits of rsync's file list comparison. I'm just not
positive that File::RsyncP implements this option, but I'm going to test it as
soon as I find some time.
Post by Carl Wilhelm Soderstrom
As a noteworthy data point, when making an initial copy of files (not using
backuppc, just plain tar or rsync); tar is 2x-4x faster than rsync,
presumably due to all of rsync's calculating overhead.
I find this somewhat surprising, because an initial copy has no remote files
to compare to. Also, I'm quite sure that this depends on your data set (file
sizes, file counts), computer configurations (extremely slow sender vs. fast
receiver?) and network (high bandwidth). But I suppose tar *can be* 2-4 times
faster than rsync *in some cases*.
Post by Carl Wilhelm Soderstrom
Rsync wins when making subsequent copies that don't require a large
percentage of the data to be transferred.
That is really the point of using rsync (note: this, again, depends on other
factors, probably mainly network speed). One thing to note in the context of
BackupPC: if you switch XferMethod from tar to rsync, you won't get any
benefits until *after the first full rsync backup*, because due to a difference
in attrib file formats the rsync method *won't* match same files from the
reference tar backup. So you can't do an initial full tar backup for the speed
advantage (if there really is one) and then switch to rsync.

Another BackupPC specific note is that rsync Xfer will save interrupted full
backups as partials and restart them later (saving you the network transfers
that have already been done), while tar will always need to rerun the full
backup from the start, because it has no notion of resuming an interrupted
transfer.

So, to sum it up, there are numerous advantages to rsync as XferMethod,
especially if network bandwidth is a precious resource (compared to CPU cycles
and disk I/O), and rsync should normally not be as slow (by far!) as you are
experiencing it. If tar incrementals take 10 minutes, rsync incrementals
should probably be somewhere in the range of 10-30 minutes, with a typical
value below 15 minutes (I'm just guessing, but that's what I'd expect). It's
just a matter of figuring out why your rsync incrementals are taking so long.

Your relevant XferLOG file might give some clues. Unchanged files should not
be mentioned for incremental backups and listed as "same" for full (rsync!)
backups. Directories are always listed. Sizes of XferLOG files should clearly
indicate which backups were full and which were incremental (meaning your
XferLogLevel should be at 1 ;-). Can you find anything obvious in there? (*)

Regards,
Holger

(*) sudo -v; sudo /usr/share/backuppc/bin/BackupPC_zcat /var/lib/backuppc/pc/hostname/XferLOG.X.z | less
(replace the paths to match your installation).
gimili
2009-07-15 15:33:12 UTC
Permalink
Post by Holger Parplies
While that is true, incremental backups should not be taking 5 hours (compared
to a 7 hour full backup). Something is going wrong, I just couldn't guess
what, and I'm not even sure what additional information to ask for, aside from
the usual: relevant config files settings (everything related to rsync or tar,
including BackupFilesOnly/BackupFilesExclude), details on your setup that you
missed (network, type of file system on both sides, number of files in backup
set ...). One other thought: what does your server status page say on the
matter of hash chains and their length?
A BIG thanks to everyone for all the info! Because the difference was
so massive between tar and rsync I figure I must have done something
wrong. The question is what is wrong

Where do I find the hash chains and their length?

Both drives are ext3. Lan is running at 1000
643,036 files
Comp Level 3
Les Mikesell
2009-07-15 14:53:56 UTC
Permalink
Post by Carl Wilhelm Soderstrom
As a noteworthy data point, when making an initial copy of files (not using
backuppc, just plain tar or rsync); tar is 2x-4x faster than rsync,
presumably due to all of rsync's calculating overhead.
Are you comparing a tar to a tar archive to rsync or 'tar -c | ssh ...
tar -x'? The slow part of the process should be creating the new
directory tree and file structure. Rsync shouldn't do a lot of
calculating when the target is empty, but it does (at least the old
versions) read and transfer the entire source directory tree before
starting any file transfers which can add some time to the operation
particularly where there are a lot of small files.
--
Les Mikesell
***@gmail.com
Carl Wilhelm Soderstrom
2009-07-15 15:30:42 UTC
Permalink
Post by Les Mikesell
Post by Carl Wilhelm Soderstrom
As a noteworthy data point, when making an initial copy of files (not using
backuppc, just plain tar or rsync); tar is 2x-4x faster than rsync,
presumably due to all of rsync's calculating overhead.
Are you comparing a tar to a tar archive to rsync or 'tar -c | ssh ...
tar -x'?
The latter.
Post by Les Mikesell
The slow part of the process should be creating the new
directory tree and file structure. Rsync shouldn't do a lot of
calculating when the target is empty, but it does (at least the old
versions) read and transfer the entire source directory tree before
starting any file transfers which can add some time to the operation
particularly where there are a lot of small files.
Interesting. I might have vaguely known some of that but never correlated
it. Thank you. I think I can see how it might be a bit slower to create the
directory trees then transfer the files, rather than create the files and
indices in a more sequential fashion.
--
Carl Soderstrom
Systems Administrator
Real-Time Enterprises
www.real-time.com
Les Mikesell
2009-07-15 15:49:48 UTC
Permalink
Post by Carl Wilhelm Soderstrom
Post by Les Mikesell
The slow part of the process should be creating the new
directory tree and file structure. Rsync shouldn't do a lot of
calculating when the target is empty, but it does (at least the old
versions) read and transfer the entire source directory tree before
starting any file transfers which can add some time to the operation
particularly where there are a lot of small files.
Interesting. I might have vaguely known some of that but never correlated
it. Thank you. I think I can see how it might be a bit slower to create the
directory trees then transfer the files, rather than create the files and
indices in a more sequential fashion.
I don't think rsync actually creates the directory tree ahead of the
transfer - in fact I'm pretty sure it doesn't based on what is left
after an incomplete transfer. It just loads the remote directory
structure into RAM before starting to walk through it doing the
comparison to what is already there. If nothing already exists, the
comparison won't actually be needed but it still waits to get the whole
tree before starting to request files where tar would start writing as
soon as the first thing comes over.
--
Les Mikesell
***@gmail.com
gimili
2009-07-16 11:08:52 UTC
Permalink
I switched back from tar to rsync. It sounds like rsync is far
superior. I ran a full backup which took 302 minutes and then an
incremental which only took 26 minutes. So it seems like things are
working now as the incremental was much quicker. I am not sure what
happened the first time. Hopefully it won't happen again. Thanks!
--
gimili
Les Mikesell
2009-07-15 13:18:13 UTC
Permalink
Post by gimili
I am new at backuppc. I am backing up 250GB from one machine to another
with backuppc. Both machines use debian lenny as the os. 2.4ghz
machines with 4GB ram in client and 2GB in backuppc machine. Both
drives are sata.
Initially I used rsync and it took 7 hours for a full backup and 5 hours
for an incremental. I did every tweak I could find on to improve speed.
I switched to tar and it still took 7 hours for a full backup but it
only took 10minutes for an incremental. I modified a few files to make
sure that it was working. The modified files were indeed backed up.
Why is my rsync so much slower to do an incremental backup than tar?
The backuppc side is running a perl implementation of rsync and working against
a compressed copy of the files. It actually compares the previous full run to
the existing files so it knows about deletions, old files under renamed
directories, etc. Tar just uses the timestamp on incrementals.
Post by gimili
Have I made an error?
Some difference is normal - but that seems somewhat extreme unless your 250 gigs
is millions of tiny files. It would probably help to add RAM to the backuppc
server. I think there is also some quirk about the --checksum-seed=32761 that
doesn't take effect on the first run but I've forgotten the details - and it
shouldn't matter on incrementals where most files haven't changed anyway.
Post by gimili
It is hard to believe that tar can be so fast with the incremental
backup with that much data?
There's a lot less work involved if you don't actually have many new/changed files.
Post by gimili
Are there any serious problems with using tar?
There's more network traffic, especially for the subsequent full runs and the
incrementals will not pick deletions or files that for various reasons have old
timestamps but are new or in new locations. Those may or may not be serious
problems for you.
--
Les Mikesell
***@gmail.com
Craig Barratt
2009-07-16 20:50:05 UTC
Permalink
Post by gimili
I switched back from tar to rsync. It sounds like rsync is far
superior. I ran a full backup which took 302 minutes and then an
incremental which only took 26 minutes. So it seems like things are
working now as the incremental was much quicker. I am not sure what
happened the first time. Hopefully it won't happen again. Thanks!
If you changed settings that affect what is backed up (eg: the share
names, BackupFilesOnly or BackupFilesExclude) after the full but before
the incremental then the incremental will have a lot more work to do.
If you make significant changes it's not a bad idea to manually run
a full backup.

As suggested already, inspecting the XferLOG files will tell you more
about why the first incremental took longer than you expected.

Craig

Loading...