-
-
Notifications
You must be signed in to change notification settings - Fork 741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: fs type detection for linux, dev/inode cache keys #1842
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1842 +/- ##
=========================================
Coverage ? 82.34%
=========================================
Files ? 21
Lines ? 6769
Branches ? 1164
=========================================
Hits ? 5574
Misses ? 887
Partials ? 308
Continue to review full report at Codecov.
|
Hmmm. I'm not sure that this is a good direction, because it isn't the average platform specific code, it's highly platform specific and it's hard to test / maintain. I'd say for #909 it would be much easier to make it a flag or env var. Not only because this seems quite difficult to get right, but also because only a few cases really profit from #909 -- things contained in the same archive are already indexed through their (dev, ino) for hardlink handling, which I'd expect to be a more typical use case than the rsync/hardlink importer. |
Well, at first I also rather disliked the slightly different APIs, header files, etc. on different platforms. But then I realized that we don't need to support all the platforms / all the filesystems. If a platform is not supported, it will just do it based on the path, as before. That's also the reason why I only whitelisted a few, popular filesystems for linux. Also, I realized that giving cmdline options to control inode vs. path only works if all source filesystems work in the same way and not if you backup a stable-inode fs that has a unstable-inode fs mounted somewhere. Yes, the main motivation for me to do this was the rsnapshot/rsync+hardlink importer. There are some other motivations, though:
|
Ok let's try it |
99ad908
to
e73e00e
Compare
src/borg/cache.py
Outdated
from .remote import cache_if_remote | ||
|
||
FS_WITH_STABLE_INODES = {'extfs', 'btrfs', 'xfs', 'zfs', } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, move this to platform code?
guess it depends on the platform's implementation of a filesystem whether it has stable inodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe make the platform API just more specific, eg. fs_inodes_stable(path)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok.
src/borg/cache.py
Outdated
@@ -452,25 +455,34 @@ def chunk_decref(self, id, stats): | |||
else: | |||
stats.update(-size, -csize, False) | |||
|
|||
def file_known_and_unchanged(self, path_hash, st, ignore_inode=False): | |||
def file_cache_key(self, hash, path, st, ignore_inode): | |||
if ignore_inode or fstype(path) not in FS_WITH_STABLE_INODES: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the fstype() call could be done less frequently (only at fs boundaries) and the result kept somewhere.
src/borg/cache.py
Outdated
def file_known_and_unchanged(self, path_hash, st, ignore_inode=False): | ||
def file_cache_key(self, hash, path, st, ignore_inode): | ||
if ignore_inode or fstype(path) not in FS_WITH_STABLE_INODES: | ||
# we don't use the path directly (but its hash) to safe memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*save
v- dito
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops.
src/borg/cache.py
Outdated
def file_cache_key(self, hash, path, st, ignore_inode): | ||
if ignore_inode or fstype(path) not in FS_WITH_STABLE_INODES: | ||
# we don't use the path directly (but its hash) to safe memory | ||
cache_key = hash(safe_encode(path)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be incompatible with existing files caches (incompatible in the sense of rechunking / data is worthless). Since the keys are different it would also ~double the on-disk size until 10 (cache TTL) archives are created.
Also, hash() is pseudo-random for each Python invocation.
So I guess this bit is more of a placeholder for the actual hash we used before? ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ad 1: that's the price (only 1 time a problem per data set). size issue can be avoided by either killing the files cache or using the env var to decrease kept generations.
ad 2: hash is a param of this function, not the builtin.
e73e00e
to
72f28b3
Compare
fixed & rebased. |
the fstype detection could be also interesting to skip some tests when some feature is not supported or known-broken on some filesystem. See e.g. the atime trouble in #1820. |
any other feedback on current code before I rebase / resolve conflicts? |
lgtm |
man statfs -> f_fsid (and read the "below"). seems like made for what we want. it still sucks as everybody uses a different include file, long vs 2 ints, etc. :( |
e.g. btrfs has stable inode numbers on linux, but elsewhere it could maybe be implemented in a strange way that does not have stable inode numbers.
72f28b3
to
c19c1bc
Compare
src/borg/platform/linux.pyx
Outdated
@@ -261,6 +261,7 @@ def umount(mountpoint): | |||
cdef extern from "sys/statfs.h": | |||
struct statfs_t "statfs": | |||
long f_type | |||
# fsid_t f_fsid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could not get this working, did always get compiler errors here ...
src/borg/platform/linux.pyx
Outdated
return MAGIC_TO_NAME.get(buf.f_type) | ||
return dict( | ||
fstype=MAGIC_TO_NAME.get(buf.f_type), | ||
#fsid=..., |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or there.
My goal was to have either a bytestring of appropriate size or an int in the fsid value.
random idea from irc: Have a external programm to query for maj:min and ship an example / contrib that uses blkid to get the info. Would only work as root. But most backups that care about these things run as root anyway. That would allow us to just use something that really is able to look into file systems etc and is maintained by people who work with low-level file system aspects. Also it does have a lot more bits, so it should be safer. On Linux the script could do something like this ($1 beeing the device node): |
blkid probably wants a device path... not sure how easy it would be to get that from a mountpoint (in a somewhat portable manner). Or leave it to the script. |
another issue that came up: Then the modification check would assume foo == bar, which is not the case. Note: similar thing could happen with current code, if the filename stays same, but content is exchanged within mtime granularity time (and also size stays same). |
I opened #3946 to refer to this stale / blocked PR, so it can be closed. |
fstype(path) -> determine filesystem type (at least the most important ones with stable inodes)
has_stable_inodes(path) -> True/False
Use for files cache.