This is the home page for my tool svndumpsanitizer. It's a small project born from my experiences with the official subversion tool "svndumpfilter". Svndumpfilter unfortunatley does not work with every valid repository, and even though I can't vouch for my program either, I have certainly tried to make it that way. If it doesn't work with some valid repository, that is to be considered a bug. I know it can handle all the files I've thrown at it, that svndumpfilter couldn't.
The latest version can be downloaded here.
If you prefer, I've also made this code available via github. (If you for some odd reason need an older version, this is the place to go.)
The program has been tested on Linux (i386 and x86_64 architectures) and should work out-of-the-box on any system using the GNU toolchain. It uses only standard libraries and should be easily portable, though. The only thing that might cause some snags is the 64 bit file API. (As of 1.0.2 it contains a modification by $ergi0 that should make it possible to build under Windows. I haven't tested that myself, though.)
To compile it, just run:
gcc svndumpsanitizer.c -o svndumpsanitizer
If you've found this page you probably already know. You have a large subversion repository, and you've been charged with the task to filter out some of the paths in the repository while keeping some others - naturally maintaining the entire history of the paths and files that should be kept. You create the dump file and you google the problem. You quickly discover svndumpfilter, and you start feeling hopeful about your task. This is going to work out...
You proceed with reading the man pages of svndumpfilter and when you think you've got it figured out you give it your first shot:
cat foobar.dump | svndumpfilter include trunk/dowant > clean.dump
You see the program start doing its thing, and you're convinced that you're almost done. Then some time later disaster strikes.
Revision 8932 committed as 8932. Revision 8933 committed as 8933. Revision 8934 committed as 8934. svndumpfilter: Invalid copy source path '/trunk/donotwant/hello.c'
Hmm... That looks bad. So you try a new strategy. Maybe if you just exclude the unwanted stuff instead.
cat foobar.dump | svndumpfilter exclude trunk/donotwant branches > clean.dump
You watch svndumpfilter spring into action, but your optimism has already suffered a blow, and a nagging suspicion has taken its place. A while later your worst fears are realized.
Revision 3461 committed as 3461. Revision 3462 committed as 3462. Revision 3463 committed as 3463. svndumpfilter: Invalid copy source path '/branches/george-test-branch/bork.py'
D'oh! Apparently someone has moved stuff from a place you wanted to exclude to some directory you didn't even know existed, (because it's been long since deleted) and has thus not been able to exclude. Svndumpfilter in it's wisdom naturally didn't tell you what that directory is either. Realizing that this is probably a dead end, you become somewhat discouraged, but since you really need to get the file cleaned up, you go for the brute force approach. It's back to the include strategy, but you add the offending file to the includes.
cat foobar.dump | svndumpfilter include trunk/dowant trunk/donotwant/hello.c > clean.dump ... Revision 8932 committed as 8932. Revision 8933 committed as 8933. Revision 8934 committed as 8934. svndumpfilter: Invalid copy source path '/trunk/donotwant/hello.h'
Argh! The header file of the .c file that got you the last time nailed you this time. Well, try, try again. You could of course include the trunk/donotwant directory, but that would defeat the purpose of the filtering, so you add only the offending file. (Also if you do include the entire directory you can run into "fun" surprises where svndumpfilter craps out due to stuff that you're not really interested in that was moved from an excluded directory to the newly included one.)
cat foobar.dump | svndumpfilter include trunk/dowant trunk/donotwant/hello.c trunk/donotwant/hello.h > clean.dump ... Revision 8932 committed as 8932. Revision 8933 committed as 8933. Revision 8934 committed as 8934. svndumpfilter: Invalid copy source path '/trunk/donotwant/someotherstupidfile.c'
You can feel your blood pressure rising as you realize that, at some point someone has apparently moved a load of files around in the repository between the areas of the repository that you want to keep, and the areas you do not. Not only is svndumpfilter unable to handle this - it is also unable to tell you about more than one offending file at a time! In order to not have to go through filtering 8934 revisions n times, where n may be an annoyingly large value you start digging through the dumpfile instead. You locate the offending commit and dig out all the offending files. It takes a fair amount of time, because even though svn dumpfiles are human readable, they aren't all that pleasant to read. With a combination of fury and agony, you try again.
cat foobar.dump | svndumpfilter include trunk/dowant trunk/donotwant/hello.c trunk/donotwant/hello.h trunk/donotwant/someotherstupidfile.c \ trunk/donotwant/someotherstupidfile.h trunk/donotwant/foo.c trunk/donotwant/foo.h trunk/dowant trunk/donotwant/blah.c trunk/donotwant/blah.h \ trunk/donotwant/spreadsheetsarefun.ods trunk/donotwant/andsoarerandombinaries.bin trunk/donotwant/foobar.c trunk/donotwant/foobar.h \ trunk/donotwant/randomcrud.h trunk/donotwant/randomcrud.c trunk/donotwant/main.c trunk/dowant trunk/donotwant/stop.c trunk/donotwant/stop.h \ > clean.dump ... Revision 1713 committed as 1713. Revision 1714 committed as 1714. svndumpfilter: Invalid copy source path '/branches/quux/hello.c'
It's the same story all over again. At an earlier point in time (some of) these files have been moved from another location and svndumpsanitizer failed to handle it or tell you about it until it ran head first into a brick wall. As if to taunt you the revision is situated at an earler point of the repository. (You thought that you had at least progressed to revision 8934, well think again. Nyah, nyah, nyah!) Also, even if you fix this issue you have no idea of how many fun little surprises lie ahead. You now look and feel like this:
If you google for this problem you will mostly run into things like suggestions on how to manage your repository in order to avoid files that svndumpfilter can't handle. That's nice advice, but also useless when you're expected to fix a b0rked 40GB dumpfile with a huge number of commits, made by people you don't know and are not allowed to murder.
I could keep going, but I think I've made my point by now. Parsing tons of boring data is a job for a computer, not a human being. And if the dumpfile is valid (i.e. represents any actual repository - not just a repository built in a certain manner) the tool should be able to handle it. Hence svndumpsanitizer.
It is in fact quite understandable that svndumpfilter doesn't work. It's an aptly named program, because all it does is take a data stream and output the contents to stdout after filtering it on the fly. The problem is that the subversion repository structure is too complicated for such an approach to even have a theoretical chance of working. When the filter is at revision 10 it has no way of knowing whether a node the user wants to discard, will be moved to a position he wishes to keep in revision 113. So it does the only thing it can do - it discards the node, and at revision 113 craps out because it has already discarded the data it turns out it would have needed.
Svndumpsanitizer works in a different manner. It scans the nodes several times in order to discover which nodes should actually be kept. After it has determined which nodes to keep it writes only these nodes to the outfile. Finally - if necessary - it adds a commit that deletes any unwanted nodes that had to be kept in order not to break the repository. There are 6 steps in total. (7 if you want to drop empty revisions.)
Since one example is usually better that a ton of theory, let's look at the following hypothetic repository: Revision 0 is just the empty repository creation. Revision 1 adds the directories "trunk", "trunk/dowant", "trunk/donotwant" and "trunk/foobar". After this three more commits are made as follows. (Each black rectangle represents a revision.)
Now we take a dump of the repository and try to filter out the trunk/donotwant path. Svndumpfilter will fail on revision 3, because there a file has been copied from a location we want to omit, to a location we wish to keep. Svndumpsanitizer will work, though. Let's see what it would do if we were to run the command:
svndumpsanitizer --infile foobar.dump --outfile foobar.sanitized.dump --exclude trunk/donotwant
After reading through the dumpfile it will parse through the nodes starting with revision 4. There it will keep all nodes except the one that deletes the file "trunk/donotwant/test.c", because it's in the directory we don't want. In revision 3 it will remove the node that adds the "trunk/donotwant/test2.c" file for the same reason, however it will take note of the fact that the file "trunk/dowant/test.c" has in fact been copied from "trunk/donotwant/test.c". For this reason when it comes to revision 2 it won't remove anything, as removing the node that adds test.c would break the copy operation in revision 3. The repository now looks like this:
That's not too bad, but if we were to create the repository like this, we would discover that we would still have the file "trunk/donotwant/test.c" hanging around despite it being deleted in the original repository. Svndumpsanitizer therefore scans through the nodes again to bring back any delete nodes for files it was forced to keep. This will resurrect the delete test.c -operation in revision 4.
Finally it will scan through the repository looking for any nodes that are still alive, even though the user said he didn't want them. Turns out there is one. The directory "trunk/donotwant" that was added in revision 1. Svndumpsanitizer now adds a revision 5 that deletes the offending node. The end result looks like this:
This looks fairly similar to the original, but with a small repository like this and the conservative exclude method, that's no surprise. If we were to use this command instead:
svndumpsanitizer --infile foobar.dump --outfile foobar.sanitized.dump --include trunk/dowant
The results would look like this:
In this redacted version of subversion history the directory "trunk/foobar" has never even existed, nor has any of its contents. It wasn't in the include path, nor were any copies ever made from there to the include path, so everything related to it was systematically excluded.
Version 0.8.0 adds support for dropping and renumbering revisions, so starting then there are no serious known limitations. Unlike svndumpfilter, svndumpsanitizer does not have 2 different switches for dropping and renumbering. If you want to drop the empty revisions, you typically want to renumber them as well, and having 2 switches would have added complexity to the code, so I decided against it.
The feature to move/rename directories on the fly was to cumbersome to implement, and would have required a complete re-design of how svndumpsanitizer works. Instead, I settled for implementing the most common use case, which is redefining the repository root (I.e. moving stuff up in the directory structure.) This feature was implemented in version 1.1.0, and is still considered to be of beta-quality.
It has been pointed out to me that svndumpsanitizer creates a lot of unnecessary newlines in the sanitized files. This is true. If you look at the changelog you'll see that I tried to address this in version 0.8.2, but eventually got rid of it, because the "fix" intoduced a bug, and the problem is only cosmetic. Svnadmin ignores surplus newlines anyway. If you absolutely must have a clean dump file (instead of "merely" a working one, the workaround is to import the dump and then dump again.
Bugs can be reported to daniel[dot]suni[at]gmail[dot]com. If the problem is with a specific dumpfile, please include the offending dumpfile. If the contents of the repository is too sensitive/secret/embarrassing/too freakin' huge to post, then please try to recreate the problem with a simple non-sensitive dumpfile.
You can also try creating a non-sensitive dumpfile by using dumpstrip, a tool that strips out all the data, leaving only the metadata (which is usually the interesting part from a debugging perspective). Dumpstrip is used like this:
dumpstrip --infile foobar.dump --outfile stripped.dump
Oh, and if you can code and use gdb, patches are of course welcome. Thanks, Gary. :-)
Back to main page