Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up functions that use audeer.path() #136

Closed
hagenw opened this issue Jan 18, 2024 · 6 comments
Closed

Speed up functions that use audeer.path() #136

hagenw opened this issue Jan 18, 2024 · 6 comments

Comments

@hagenw
Copy link
Member

hagenw commented Jan 18, 2024

In #132 we introduced the follow_symlink argumet to audeer.path() which does not convert symlinks into the actual links, which makes it much faster.

I checked all functions, that use audeer.path() and add a discussion here, how we might handle applying the changes there too:

  • audeer.basename_wo_ext(): I would silently use follow_symlink=False as the current behavior for symlinks feels like a bug for me (the returned name might be different if the basename of the link is different from the file name) Not follow symlinks in Ensure basename_wo_ext() #139
  • audeer.common_directory(): I'm not sure what the wanted behavior is here. At the moment it would first convert symlink, which means a/b/c and d/e/f can have a common directory if at least some it is a link. If this is the desired behavior, we should not change it or add follow_symlink as an argument.
  • audeer.create_archive(): it is possible to silently use follow_symlink=False. It would introduce a breaking change as a user can then no longer include symlinks in the files argument that are outside of root, but just link into sub-folders of root.
  • audeer.download_url() we can silently use follow_symlink=False. The only breaking change is that it then would return a symlink instead of the real path (Do now follow symlinks in download_url() #135)
  • audeer.extract_archive(): we should be able to silently use follow_symlink=False without any breaking change (Don't convert symlinks in audeer.extract_archive() #138)
  • audeer.extract_archives(): we should be able to silently use follow_symlink=False without any breaking change (Don't convert symlinks in audeer.extract_archive() #138)
  • audeer.file_extension(): we need to add a follow_symlink argument and pass it on to audeer.path() as it leads to different outputs. Or we could argue as with audeer.basename_wo_ext() that the current behavior is more a bug than a feature and change to use follow_symlink=False
  • audeer.list_dir_names(): I would add a follow_symlink option as the output is directly affected or do nothing
  • audeer.list_file_names(): I would add a follow_symlink option as the output is directly affected or do nothing
  • audeer.mkdir(): we could silently use follow_symlink=False, but it would change the return value. Maybe we can skip it as speed is anyway not an issue here as we have to do an hard disc operation anyway
  • audeer.md5(): we would need to add follow_symlink=False as an option as it directly influences the return value. If we add it here, we also need to add it at audeer.list_file_names()
  • audeer.replace_file_extension(): I would propose to silently use follow_symlink=False here. The current behavior seems to have a bug for symlinks, as it might replace a different file extension than the user expects.
  • audeer.rmdir(): we should be able to silently use follow_symlink=False we cannot change the current implementation as rmdir() does not work on symlinks (TST: test for symbolic link in audeer.rmdir() #137)
  • audeer.touch(): we could silently use follow_symlink. It would be a breaking change as it changes the return value for a symlink
@frankenjoe
Copy link
Collaborator

frankenjoe commented Jan 19, 2024

Puhh, I wasn't expecting that things would become so complicated :)

I am a bit afraid we will add a lot of new complexity here and it will become difficult for a user to predict the outcome of certain functions. E.g. with #139 if we have the following files:

$ touch a.txt
$ ln -s a.txt b.txt

This can happen:

audeer.basename_wo_ext('b.txt')
b

but:

audeer.basename_wo_ext(audeer.path('b.txt'))
a

Before it would at least return a in both cases.

The only solution would be to expose follow_symlink in all functions I think. But maybe we should actually reconsider our decision in #131 and simply go with a breaking change and switch to os.path.abspath() in audeer.path().

@frankenjoe
Copy link
Collaborator

frankenjoe commented Jan 19, 2024

Or we keep follow_symlink in audeer.path() but set it to False by default and not change any of the other functions. The user can then explicitly call audeer.path(..., follow_symlink=True) before passing a path string to those functions if that is really needed.

@hagenw
Copy link
Member Author

hagenw commented Jan 19, 2024

Or we keep follow_symlink in audeer.path() but set it to False by default and not change any of the other functions. The user can then explicitly call audeer.path(..., follow_symlink=True) before passing a path string to those functions if that is really needed.

I was also thinking about this as well. There seems to be only a few function like audeer.rmdir() where the current behavior of first converting to a symlink seems to make sense.
But I'm also a little bit afraid that it can break existing code as we use symlinks for everything on our compute servers.

@frankenjoe
Copy link
Collaborator

In places where we think it's crucial we can explicitly call audeer.path() with follow_symlink=True, e.g. when we expand the root folder cache in audb.

@hagenw
Copy link
Member Author

hagenw commented Jan 19, 2024

I created #140 to set follow_symlink=False as the new default.
I also added a discussion for the two most affected functions, that we maybe would like to change or for which we have to update the docstring.

@hagenw
Copy link
Member Author

hagenw commented Jan 22, 2024

This is now handled by #140 which changes the default of follow_symlink to False in audeer.path(); #141 which adds follow_symlink to audeer.rmdir(), but sets the default to False; by #143 which describes the changed return values and behavior of all other affected functions in the changelog of the new release.

@hagenw hagenw closed this as completed Jan 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants