Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: fs type detection for linux, dev/inode cache keys #1842

Closed
wants to merge 4 commits into from

Conversation

ThomasWaldmann
Copy link
Member

@ThomasWaldmann ThomasWaldmann commented Nov 13, 2016

fstype(path) -> determine filesystem type (at least the most important ones with stable inodes)
has_stable_inodes(path) -> True/False

Use for files cache.

@codecov-io
Copy link

codecov-io commented Nov 13, 2016

Codecov Report

❗ No coverage uploaded for pull request base (master@df8205a). Click here to learn what that means.
The diff coverage is 77.27%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #1842   +/-   ##
=========================================
  Coverage          ?   82.34%           
=========================================
  Files             ?       21           
  Lines             ?     6769           
  Branches          ?     1164           
=========================================
  Hits              ?     5574           
  Misses            ?      887           
  Partials          ?      308
Impacted Files Coverage Δ
src/borg/platform/__init__.py 76.47% <100%> (ø)
src/borg/platform/base.py 70.66% <50%> (ø)
src/borg/archive.py 82.16% <75%> (ø)
src/borg/cache.py 85.31% <83.33%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update df8205a...caf45f3. Read the comment docs.

@enkore
Copy link
Contributor

enkore commented Nov 13, 2016

Hmmm. I'm not sure that this is a good direction, because it isn't the average platform specific code, it's highly platform specific and it's hard to test / maintain. I'd say for #909 it would be much easier to make it a flag or env var. Not only because this seems quite difficult to get right, but also because only a few cases really profit from #909 -- things contained in the same archive are already indexed through their (dev, ino) for hardlink handling, which I'd expect to be a more typical use case than the rsync/hardlink importer.

@ThomasWaldmann
Copy link
Member Author

Well, at first I also rather disliked the slightly different APIs, header files, etc. on different platforms. But then I realized that we don't need to support all the platforms / all the filesystems. If a platform is not supported, it will just do it based on the path, as before. That's also the reason why I only whitelisted a few, popular filesystems for linux.

Also, I realized that giving cmdline options to control inode vs. path only works if all source filesystems work in the same way and not if you backup a stable-inode fs that has a unstable-inode fs mounted somewhere.

Yes, the main motivation for me to do this was the rsnapshot/rsync+hardlink importer.

There are some other motivations, though:

@enkore
Copy link
Contributor

enkore commented Nov 13, 2016

Ok let's try it

@ThomasWaldmann ThomasWaldmann changed the title implement fs type detection for linux, dummy fallback WIP: fs type detection for linux, dev/inode cache keys Nov 13, 2016
from .remote import cache_if_remote

FS_WITH_STABLE_INODES = {'extfs', 'btrfs', 'xfs', 'zfs', }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, move this to platform code?

guess it depends on the platform's implementation of a filesystem whether it has stable inodes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make the platform API just more specific, eg. fs_inodes_stable(path)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.

@@ -452,25 +455,34 @@ def chunk_decref(self, id, stats):
else:
stats.update(-size, -csize, False)

def file_known_and_unchanged(self, path_hash, st, ignore_inode=False):
def file_cache_key(self, hash, path, st, ignore_inode):
if ignore_inode or fstype(path) not in FS_WITH_STABLE_INODES:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the fstype() call could be done less frequently (only at fs boundaries) and the result kept somewhere.

def file_known_and_unchanged(self, path_hash, st, ignore_inode=False):
def file_cache_key(self, hash, path, st, ignore_inode):
if ignore_inode or fstype(path) not in FS_WITH_STABLE_INODES:
# we don't use the path directly (but its hash) to safe memory
Copy link
Contributor

@enkore enkore Nov 13, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*save

v- dito

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops.

def file_cache_key(self, hash, path, st, ignore_inode):
if ignore_inode or fstype(path) not in FS_WITH_STABLE_INODES:
# we don't use the path directly (but its hash) to safe memory
cache_key = hash(safe_encode(path))
Copy link
Contributor

@enkore enkore Nov 13, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be incompatible with existing files caches (incompatible in the sense of rechunking / data is worthless). Since the keys are different it would also ~double the on-disk size until 10 (cache TTL) archives are created.

Also, hash() is pseudo-random for each Python invocation.

So I guess this bit is more of a placeholder for the actual hash we used before? ;)

Copy link
Member Author

@ThomasWaldmann ThomasWaldmann Nov 14, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ad 1: that's the price (only 1 time a problem per data set). size issue can be avoided by either killing the files cache or using the env var to decrease kept generations.

ad 2: hash is a param of this function, not the builtin.

@ThomasWaldmann
Copy link
Member Author

fixed & rebased.

@ThomasWaldmann
Copy link
Member Author

the fstype detection could be also interesting to skip some tests when some feature is not supported or known-broken on some filesystem. See e.g. the atime trouble in #1820.

@ThomasWaldmann
Copy link
Member Author

any other feedback on current code before I rebase / resolve conflicts?

@enkore
Copy link
Contributor

enkore commented Nov 20, 2016

lgtm

@ThomasWaldmann
Copy link
Member Author

man statfs -> f_fsid (and read the "below"). seems like made for what we want.

it still sucks as everybody uses a different include file, long vs 2 ints, etc. :(

e.g. btrfs has stable inode numbers on linux, but elsewhere it could maybe
be implemented in a strange way that does not have stable inode numbers.
@@ -261,6 +261,7 @@ def umount(mountpoint):
cdef extern from "sys/statfs.h":
struct statfs_t "statfs":
long f_type
# fsid_t f_fsid
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not get this working, did always get compiler errors here ...

return MAGIC_TO_NAME.get(buf.f_type)
return dict(
fstype=MAGIC_TO_NAME.get(buf.f_type),
#fsid=...,
Copy link
Member Author

@ThomasWaldmann ThomasWaldmann Nov 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or there.

My goal was to have either a bytestring of appropriate size or an int in the fsid value.

@textshell
Copy link
Member

random idea from irc: Have a external programm to query for maj:min and ship an example / contrib that uses blkid to get the info. Would only work as root. But most backups that care about these things run as root anyway.

That would allow us to just use something that really is able to look into file systems etc and is maintained by people who work with low-level file system aspects. Also it does have a lot more bits, so it should be safer.

On Linux the script could do something like this ($1 beeing the device node):
blkid -o export $1 | grep -e "^UUID" -e "^TYPE" | sha1sum

@enkore
Copy link
Contributor

enkore commented Jan 28, 2017

blkid probably wants a device path... not sure how easy it would be to get that from a mountpoint (in a somewhat portable manner). Or leave it to the script.

@ThomasWaldmann
Copy link
Member Author

ThomasWaldmann commented Feb 27, 2017

another issue that came up:
assume files cache has key (dev_uuid, inode_no) -> value (mtime_ns, size).
assume a specific disk has a file "foo" sitting on inode X of mtime M and size S and content Cfoo.
now we delete foo and write "bar" with content Cbar != Cfoo.
inode X gets reused, size is S and mtime is M (could happen if "foo" modify, delete and creation of "bar" happens within mtime granularity).

Then the modification check would assume foo == bar, which is not the case.

Note: similar thing could happen with current code, if the filename stays same, but content is exchanged within mtime granularity time (and also size stays same).

@ThomasWaldmann
Copy link
Member Author

I opened #3946 to refer to this stale / blocked PR, so it can be closed.

@ThomasWaldmann ThomasWaldmann deleted the fs-detect branch October 6, 2020 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants