Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

separable repository format #370

Open
jonarchist opened this issue Jan 9, 2016 · 1 comment
Open

separable repository format #370

jonarchist opened this issue Jan 9, 2016 · 1 comment

Comments

@jonarchist
Copy link

i originally intended to ask this on the mailing list, but i can't post or subscribe, so i'm just going to go ahead and post it here as a feature request.

the following paragraphs are part of what i intended to post to the mailing list (skip below them if you don't care about the motivation):


hello!

i'm in need of a new backup solution, and i've been playing with zbackup for a while. i like that it delegates in a true unix fashion.

what's nice about this is that zbackup does not implement transfer of the archive off-site itself. rather, its repository format cleanly separates data chunks from indexes. data chunks are never modified, only new ones added, and they are only read for restoring an archive. so you can use whatever you want to transfer the data chunks, and then just delete them. only the index files remain on the system, so new backups can still be created efficiently, even if the data chunks are now missing from the original system.

this is nice because in my setup, it makes a lot of sense to create a backup archive at one time and transfer it off-site slightly time-delayed. also, i do not have full control over the off-site storage (i can't install arbitrary software). so being able to use whatever software/script i want to transfer the data is a huge benefit.

the downside is that zbackup doesn't even implement reading the input files itself. rather, it takes an input stream (from tar, or any other archive program) and deduplicates it. this is where it gets inefficient, because even though the data is deduplicated, it generates a lot of disk i/o because tar still needs to fully read each file on every run of the backup.
....


these are the two features i'm curious about:

  • since attic reads the input files itself, it appears that it could be smarter than zbackup about deduplication (if a file's metadata indicates it hasn't changed, do not even attempt to read it, like in rsync).

it seems that this is the case, at least judging from some experiments i did. can you confirm that?

  • is attic's repository structure similar to zbackup's, i.e., can i create an attic repository locally, synchronize all its files off-site by my own means, and then safely delete the data chunks without affecting the effectiveness of future backups?

this seems not to be the case. how easy would it be to implement this?

thanks

  • johannes
@wshanks
Copy link

wshanks commented Jan 12, 2016

I am interested in similar use cases, so I'll answer based on what I have found, though I have not worked on the internals, so I can't answer definiitively.

You would probably like to look at this internals documentation in the borgbackup repo.

You would probably also be interested in looking at borgbackup/borg#102 about working with Amazon S3. Working with S3 would be similar to what you want to do since ideally you would want to send new chunks to S3 without needing to read previous chunks (S3 has additional issues due to its eventual consistency behavior).

Regarding your first question, a cache of file metadata is maintained to allow for skipping unchanged files. This cache can be deleted at the cost of re-reading the files for the next archive, but note that it seems that the actual data segments are not used for this check, so it should be okay to delete them at least in regards to how the file metadata checking is done.

Regarding the second question, as explained in the internals document, attic/borg maintains a manifest of all the archives which it stores in the final data segment file. Every time a new archive is created the segment file(s) with the manifest is deleted and the manifest is written out onto the end of the new last segment file. Otherwise, it seems like segment files are not touched by the create command once they are filled with data to the maximum size. In my testing, I could delete data files other than the last one and then run the create command without error. So in principle you could do what you want as long as you don't delete the last segment file(s) containing the manifest. The fact that the manifest could rollover across multiple segment files makes that slightly trickier than just not deleting the last segment file. A safer approach might be to note the time just before a new archive is created and then after that archive is created delete all of the segment files older than the noted time. Also, note that if you are copying to a remote destination without deleting remote files that you will retain all of the old segment files with the old manifests in them. Still, I don't know that it would be a good idea to do this in practice since attic/borg are not currently designed with this use case in mind and a future update could potentially make different use of the data segments.

Regarding implementing support for deleting data locally, I think it would not be too hard, but it would require at least changing how the manifest is stored (i.e. not storing in the same objects as the backup data). Of course, not many of the commands other than create would work very well with missing data segments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants