Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maximum of (ctime, mtime) - what I call "version time" or "vtime" #110

Closed
c-blake opened this issue Jul 20, 2023 · 11 comments
Closed

Maximum of (ctime, mtime) - what I call "version time" or "vtime" #110

c-blake opened this issue Jul 20, 2023 · 11 comments

Comments

@c-blake
Copy link

c-blake commented Jul 20, 2023

Especially (but not exclusively) on systems without btime, I often find it useful to find files newer than a given reference file/time. However, things like cp and tar have a way of altering the time backward or forward. When they do so they update the ctime (aka "the mtime for most inode metadata"). This makes max(ctime, mtime) a useful "synthetic" time stamp.

It seems clearly useful to have this sort of thing built in to a find-like utility in things like -newerXY. I am unaware of tools beyond several of my own which integrate the concept. I mean, I know you could just printf both and take the max in a pipeline, but then you have to produce all the output to run the > max() calculation against (binary->ASCII->binary cycles; blech). Much more efficient to do the various inequalities inside whatever evaluator the find-like tool is using. This also ought to be an extremely easy feature to add.

About the only decision to make is what letter to use. I like lowercase 'v' for "v)ersion", but I am not aware of this "version time" as standard terminology. I kind of doubt there is any more standard name. I partly chose "version" so that 'v' would not collide with 'm' (and also since 'm' would be ambiguous with min, as well as modification), but I find it "suggestive". I could be biased having used this abbreviation/term for almost a quarter century now, though..

{ Personally, I think GNU find is kind of kooky for going with uppercase 'B' for birth time.. as if someone did that first with 'C' for creation as it is sometimes called outside Linux, and then thought better of it and changed to 'b' to disambiguate and then forgot to go lowercase to be consistent or some such silliness. So, if things are case-sensitive at all then you might want to make 'b' and alias for 'B'... but this is (more of) a nitpick. }

@tavianator
Copy link
Owner

My understanding is that any time the mtime is changed, the ctime is set to the current time. So typically you'll have ctime >= mtime and you can just use ctime, unless the mtime is in the future.

If that's a concern, then you can use the fact that

$$a > \max\{b, c\} \iff a > b \;\;\&\;\; a > c$$

to write the equivalent of your proposed -nexerXv as

$ bfs -newerXc ... -and -newerXm ...

and or -newervY you could use -or.

Personally, I think GNU find is kind of kooky for going with uppercase 'B' for birth time

I'm not sure where that convention came from, but I don't think GNU find started it. The BSD finds have -Bnewer etc. while GNU find doesn't.

@c-blake
Copy link
Author

c-blake commented Jul 20, 2023

What prevents you from just using ctime is that mtime can be updated on its own just with file writes. { Although file writes can also happen without mtime updates via writable mmaps which mutate sans system calls.. } So, it's not so rare for mtime to be in the future relative to ctime.

But your -and idea does work as a verbose substitute for the single letter change I'm used to doing... :-) Like shell aliases, this "file age alias" is mostly a question of "how often does this construct arise". Anyway, thanks for the reply. I'll just close.

@c-blake c-blake closed this as completed Jul 20, 2023
@tavianator
Copy link
Owner

What prevents you from just using ctime is that mtime can be updated on its own just with file writes.

I think that would be a kernel bug if that happened. I did see some relevant Linux patches recently, e.g. https://lore.kernel.org/linux-fsdevel/[email protected]/

POSIX requires that when the mtime is updated that the ctime also be updated.

@c-blake
Copy link
Author

c-blake commented Jul 20, 2023

It is some pretty radical news to me that POSIX is (or always has been?!) requiring ctime updates after mtime. 🤯 This strikes me as going against ancient traditions in many OS kernels which makes me wonder about its adoption. Very interesting!

FWIW, I really only mentioned all this because bfs seems to aim to have a "union" of features. There are some not-quite-find-but-similar ideas where that kind of boolean combination must be embedded, such as the "newest N files by various ages, X" which I am pretty sure requires a very small heap (or sorting) where the internal comparisons must be outside the max.

Once you had, say, -vnewest 10, the v would be more natural in that newerXY context. So, I may have introduced the idea in the wrong order. Although all my reasoning is somewhat in The Before Times when mtime could update independently.. Independent update does not, of course, change tar/cp/something sets the mtime into the future or time zones or weirdness, but all that is more rare and arguably a "false positive" for the "version concept" anyway. (I've never been super happy with that word, "version".)

I could also see how adding -Xnewest <int> might be considered out of scope. I don't know any finds that have a -newest/-cnewest/-anewest, though I expect you have more exhaustive knowledge of your competitors. I do find it a useful operation and it also requires efficient recursive stats. Maybe bfs could be the first? :-)

These kinds of "heap filters" might be interesting more generally (e.g. as -biggest N file sizes). I'm pretty sure top N is popular in the SQL world.

@tavianator
Copy link
Owner

It is some pretty radical news to me that POSIX is (or always has been?!) requiring ctime updates after mtime. 🤯 This strikes me as going against ancient traditions in many OS kernels which makes me wonder about its adoption. Very interesting!

Yeah I don't know how the various UNIXes behave in practice here. But I did just check on my computer and echo >foo updates both times.

These kinds of "heap filters" might be interesting more generally (e.g. as -biggest N file sizes). I'm pretty sure top N is popular in the SQL world.

Indeed, and I think this might be doable with fselect.

Almost all of bfs's features are "local" in the sense that they look at each file separately, the exceptions being -s (sorting) and -unique. There's an open ticket for more advanced sorting (#58) but that would still be per-directory like -s. I think sorting the whole result stream is out of scope for bfs, but you can probably cobble together a pipeline to do it, something like:

$ bfs -printf '%C@ %T@ %p\n' | awk '{ k = $1 > $2 ? $1 : $2; print k, $3 }' | sort -n | tail -n10

@c-blake
Copy link
Author

c-blake commented Jul 20, 2023

Yeah. I did go to the LKML archives and search around and while I'm convinced that Jeff Layton thinks POSIX says this to be true, I sure cannot find any primary POSIX-like sources that say so. I may check Austin next. There needs to be a nice version ctl log for all this stuff so I can just git log it... Maybe there is by now.

There is various traffic on LKML searching for ctime mtime update including some that suggests mtime on mmap()d files should also be updated (say when the kernel gets around to flushing the dirty page to backing store) as far back as 2007..2008. I'm pretty sure back then I used kernels which did not do that, though there could have been some lag, and while I haven't written a little test program today to check it, I'd swear I also had to do the equivalent of touch after modify for mmaped writes as recently as 2 years ago. So, the short of all this is - I'm still investigating WTF semantics have become.

The problem with your awk pipeline is that it has to do all that binary -> ASCII -> binary as well as then sort the whole thing -- including all that pathname data IO -- instead of just keeping a tiny heap-structured array of like 10 things (or N for top N). The top N problem is not perfectly local, in your terms, but it can still be very low resource. (The lgN per-heap operation costs are only lg10 here, too.) Well, for top 1 it could basically be more local (but for how the flavor of the CLI might be in terms of emitting only at the end).

Anyway, I know how to do it various ways.. I perhaps should have been more clear that my interest here was more in brainstorming to help you improve your tool, not me solving some more concrete problem. I already have newest that does that heap approach that can even use my personal sys_batch Linux kernel module to do batch stat system calls if I'm in a Caffeine-Fueled Rage Optimization (CFRO) mode. :-)

@tavianator
Copy link
Owner

Yeah. I did go to the LKML archives and search around and while I'm convinced that Jeff Layton thinks POSIX says this to be true, I sure cannot find any primary POSIX-like sources that say so.

https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_09 says

Each function or utility in POSIX.1-2017 that reads or writes data (even if the data does not change) or performs an operation to change file status (even if the file status does not change) indicates which of the appropriate timestamps shall be marked for update.

And e.g. write() says

Upon successful completion, where nbyte is greater than 0, write() shall mark for update the last data modification and last file status change timestamps of the file

I don't know the details around mmap(). Linux does not particularly care: https://yarchive.net/comp/linux/mtime_mmap.html

The problem with your awk pipeline is that it has to do all that binary -> ASCII -> binary as well as then sort the whole thing -- including all that pathname data IO -- instead of just keeping a tiny heap-structured array of like 10 things (or N for top N). The top N problem is not perfectly local, in your terms, but it can still be very low resource. (The lgN per-heap operation costs are only lg10 here, too.) Well, for top 1 it could basically be more local (but for how the flavor of the CLI might be in terms of emitting only at the end).

True, it is inefficient. (By the way, technically you can extract the top N items with a linear time selection algorithm. You can even do this online. But I'd guess a heap is better in practice.)

Anyway, I know how to do it various ways.. I perhaps should have been more clear that my interest here was more in brainstorming to help you improve your tool, not me solving some more concrete problem.

That's fair! There's basically three ways something gets added to bfs:

  • It's a feature I personally need
  • I don't really need it but I get nerdsniped into adding it anyway ;)
  • Someone else writes it, and I think it's worthwhile

I see the value of a "find top 10 newest files" operation, but I think it doesn't outweigh the effort it would take to implement. I reserve the right to change my mind :)

I already have newest that does that heap approach

Nice!

that can even use my personal sys_batch Linux kernel module to do batch stat system calls if I'm in a Caffeine-Fueled Rage Optimization (CFRO) mode. :-)

Also cool! Though I'm guessing you could use io_uring?

@c-blake
Copy link
Author

c-blake commented Jul 20, 2023

Well, I found it in the write(2) spec by now (as you quote), but I remain curious when this settled, who all honors it, and what prior language was. Hence my bemoaning no git log. The mmap stuff is pretty ambiguous about what if you never call msync or munmap. E.g., the process simply exits and the kernel cleans up. I did find this frustrated article LOL. (Warning - it's a long read). At least for me it's usually a very small heap. Quickselect-wise, I found this which supports a maybe-optimized heapselect (not sure how much those JS results generalize..) I was only trying to help, not nerdsnipe. :-) Cheers!

@c-blake
Copy link
Author

c-blake commented Jul 22, 2023

I forgot that I neglected to answer about the io_uring bit and remembered when I saw you start that repo. Sure, you can use that now on Linux. Part of my writing/using sys_batch is that I had very similar Linux kernel patches back around 1999..2000. So, it was follow-on work that actually pre-dated io_uring (by a lot).

Another part was/is to try to help put an example out there (another example?) not only for Linux, but any OS (or even any distributed system with pricey round-trips) with modules &| expensive system call boundaries, but also with a highly uniform/regular interface (like the Linux syscall interface). Once you have that regularity, a little batch-comparing-jump-forward-only nano-"assembly language" has a simple strength to it. batch.c is only 182 lines and the body of sys_batch itself only 55 lines.

io_uring is probably a bit better, also less general in a few ways. E.g., last I knew there were also still a great many missing system calls while if anything sys_batch should probably black list a few more calls. Optimization is specialization, of course. So, io_uring might be marginally faster (and I should say I do like its ideas!), but I have not benchmarked the two together. (Another possibility is the BPF stuff, but my guess is that would be slower.) sys_batch does copy its args from user to kernel which is not free.

@tavianator
Copy link
Owner

Yeah I looked in more detail after I made that comment and the little VM that's part of sys_batch is pretty cool, and not really something io_uring does. Unless you can drive it with eBPF somehow.

@c-blake
Copy link
Author

c-blake commented Jul 22, 2023

Personally, I am a bit at a loss why this kind of thing is not more prevalent all over and kind of like "old hat" if you know that expression. "Byte code" is freakin' everywhere. Maybe that just comes from the deep penetration of Emacs. Ideas need examples to copy/propaganda/something to spread. This one is kind of "too simple" for most academic paper circuits.. It would be some work to port it to all of the arch/'s Linux supports, of course.

I once asked one of the Plan9 principals why 9p didn't have something like that. He said it was because the NJ group did not have enough feedback from the outside world which I thought was kind of an odd answer (but I think I might have been the 3rd person that particular day to ask him that question).


But I often find myself at a loss for why the world is how it is and why we are so captive to existing toolkits. Reading your hyperfine results on bfs on ycombinator today, it reminded me that when I was writing tim.md, I wasn't sure how to express "Why do people keep reporting error bars reflective of system noise probably not even reproducible by themselves?". I mean, we all know it depends on what background activity is happening, where browsers are in various auto-something-or-other cycles, etc. At least in the millisecond time scale regimes. One mitigation is at least the mean+-stderr of min times of N which everyone also "knows" (sort of), at least as well as they know hot cache(s) != cold. I realize this probably comes off as critical and I don't mean it that way. I'm sure you're just used to running hyperfine foo. And I'm not even perfectly happy with tim, but it at least feels less.. off track.

Anyway, I don't know why I write so much or mean to bother you. I liked the vibe of your "could've invented futexes" post. You seem like a smart fellow who might appreciate hearing my frustrations or share some was all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants