-
-
Notifications
You must be signed in to change notification settings - Fork 741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
file_known_and_unchanged is wasteful and should be converted to inodes #909
Comments
Currently, a mapping is used from key path_hash -> value FileCacheEntry (size, mtime, inode, age, chunk_ids). key: path_hash = self.key.id_hash(safe_encode(os.path.join(self.cwd, path))) So, the key hash could be just computed from different input data or get replaced by a tuple. Algorithm:
|
A hash isn't even necessary - we can just pack the device number and the inode number in 16 bytes(!) each. But since we in fact don't even have the requirement of len(key) == 32 here, like we do for objects, we could just as well use Maybe that even uses less memory (but certainly less storage when msgpacked) than the 32 byte string. Edit: Uh, well maybe not :D
= 120 bytes per key. While bytes(32).
Storage, however:
But we can combine both: Store tuples in the file = smaller storage. When reading the file, keys that are tuples are converted to a 32 byte string (= same memory footprint). When writing the file, the reverse is performed. This works, because when writing we know we want to operate in use_inodes_for_cache mode, and when reading, the file just tells us. This is is almost compatible: Old cold doesn't touch it. It would just insert all the tuples as keys. The only problem is that msgpack deserializes tuples as lists, which are mutable, which are therefore not hashable, and therefore don't qualify as keys. Hm. |
Trying to list filesystems with known-stable inode numbers (white list):
Trying to list filesystems with known-unstable inode numbers (black list):
|
Crap, os.statvfs() does not return f_type. Calling stat(v)fs ourselves seem to require some platform-specific code as it is slightly different on different platforms (data type of fs type, include files). Also, we would have to classify a lot of filesystems. Maybe just let the user specify whether inodes shall be considered stable? We already have a commandline option to ignore inodes (used when they are unstable or unsupported) - add another one like |
Yes I think the latter would be a better approach to this. Only some 'special' cases profit from this (but if they do, then quite a lot I suspect); hardlinked files in the same archive are already indexed by (dev, ino) anyway. |
#1842 (comment) see there. |
Re. compatibility with older Borg: in ino-mode, truncate/leave files (this is valid). Store ino files cache as files.ino or so. No problems. This also needs next to no code changes in the loading/saving logic, only the filename has to be changed depending on mode. With some added code the cache could be automatically online, as files are seen during backups, as well. |
The code will dynamically switch between modes, so we need both in parallel. |
#1842 currently blocked by some C/Cython problem (access fsid) - can anybody help with that? |
anyone? |
Small related update:
|
About this ticket: does anybody know how to solve it? See also #3946. The past tries have been unsuccessful - are there new ideas or new insights? Changing this before borg2 gets released would be a good time. |
Moved from #908
I guess another change is to turn cache.file_known_and_unchanged into just inode_known_and_unchanged.
Who cares what's the actual file name, since it's always stored in the archive anyway I believe?
Additional benefit, the "files_cache" huge file in .cache might become smaller as it no longer needs to hold path names.
This poses some difficulties for filesystems without inode numbers like vfat and on Windows, as mentioned by @ThomasWaldmann
Windows (assuming they don't have inodes - I don't know much about ntfs) - some other unique identifier might be used, I guess. For filesystems that have no symlinks (like vfat) - file path might be unique enough. Still some races with content changing underneath, but at least nothing security critical (also hopefully nobody resurrects umsdos fs) (it's tempting to use filesize + create date (+modification date?) as unique id for vfat to handle renames more optimally there, but that might not be safe).
On nfs there's "file handle" that's unique (no idea if it's easily visible from userspace).
fuse and friends are harder, I guess.
ntfs has symlinks and hardlinks, so they must have have unique file ids similar to inode numbers of some sort.
These strategies could be determined by filesystem type (available from statfs(2) and could be called once we cross a mountpoint boundary assuming this is enabled, also separately checked for every path).
It should be noted that since inodes are not unique outside of a normal filesystem, really inode in the table should be some combination of actual inode number and filesystem id (similar to how nfs creates file handles). Filesystem id might be anything from mount path (unsafe) to underlying block device id (unsafe) to fs uuid (not bullet proof either) to some other options.
Using file path as generic fallback for inode number for cases when inode numbers are ignored for other reasons (commandline switch) is also possible, though it should be marked as "potentially suboptimal".
The text was updated successfully, but these errors were encountered: