Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for TXT files as media files #392

Merged
merged 6 commits into from
May 8, 2024
Merged

Add support for TXT files as media files #392

merged 6 commits into from
May 8, 2024

Conversation

hagenw
Copy link
Member

@hagenw hagenw commented Apr 29, 2024

Motivated by audeering/audformat#376 (comment)

We want to support publishing databases, that do contain media files, that don't support audio/video related metadata (e.g. sampling rate). In audformat we can store such data anyway, the only problem was so far the dependency table in audb. Because when publishing a database, audb.core.publish._media_values() was trying to run audiofile.sampling_rate(file) on all media files, which obviously fails for non audio/video files.

There are two ways to solve this:

  • Catch the RuntimeError thrown by audiofile.sampling_rate(file) and treat the file automatically as non audio/video file
  • Set explicitly which non audio/video files should be supported

I have opted for the first solution at the moment, as in audformat we also support every file extension at the moment.
An implementation of the second approach can be inspected at d6e9c7f.


In audb.load() we now raise a RuntimeError when a database is loaded with a requested flavor (e.g. sampling_rate=16000), but contains text files. From the docstring of audb.load():

image

Here the question is if we should change this, and instead of raising an error, only convert files to the requested flavor that we can convert. On the other hand, a user might expect that when using sampling_rate=16000 the loaded database will contain only files, that can be passed on to a model working with audio files provided in a sampling rate of 16000 Hz. For that reason, it felt saver to me, to raise an error.

@maxschmitt
Copy link

For that reason, it felt saver to me, to raise an error.

I agree that an error should be raised when loading a text dataset with audio-specific non-default parameters.

Copy link

@maxschmitt maxschmitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look all good to me.

I am just wondering if it is possible at the moment to have datasets with mixed audio and text media files (or if it should be supported).

@hagenw
Copy link
Member Author

hagenw commented Apr 29, 2024

I am just wondering if it is possible at the moment to have datasets with mixed audio and text media files (or if it should be supported).

This is possible. To make it more explicit, I extended the test to also publish a mixed database.
The only problem is, that when loading such a database and requesting a flavor (e.g. sampling_rate=16000), you will get a RuntimeError as discussed above.

@maxschmitt
Copy link

This is possible. To make it more explicit, I extended the test to also publish a mixed database. The only problem is, that when loading such a database and requesting a flavor (e.g. sampling_rate=16000), you will get a RuntimeError as discussed above.

I see, we might not want the warning for text files in mixed databased, but there is no simple fix for that as the files are checked independently, I guess.

@hagenw
Copy link
Member Author

hagenw commented Apr 29, 2024

This is possible. To make it more explicit, I extended the test to also publish a mixed database. The only problem is, that when loading such a database and requesting a flavor (e.g. sampling_rate=16000), you will get a RuntimeError as discussed above.

I see, we might not want the warning for text files in mixed databased, but there is no simple fix for that as the files are checked independently, I guess.

Yes, the files are checked independently, and a RuntimeError is raised as soon as one file is a text file.
The alternative would be to silently skip all text files, or maybe to raise an error if only text files are present.

@maxschmitt
Copy link

maxschmitt commented Apr 29, 2024

The alternative would be to silently skip all text files, or maybe to raise an error if only text files are present.

Both makes sense if this is easy to fix.

I am just thinking of the case where a user loads a mixed dataset and thinks that there is only audio and then runs into an error when iterating over the text media files.

@ChristianGeng
Copy link
Member

Yes, the files are checked independently, and a RuntimeError is raised as soon as one file is a text file. The alternative would be to silently skip all text files, or maybe to raise an error if only text files are present.

I have not studied the implementation in detail yet, so apologies in case this is an unmotivated comment. My understanding is that this is the situation:

Content expected behavior
(a) contains media only as before flavor ok
(b) contains text only errors when flavored
(c) contains media and text Alternatives in MR: error out
(d) contains neither media or text can this happen?

I believe that (c) is the problematic case.

From an api perspective, would not an implementation that is akin to what pandas often does, - e.g. here - be a good way to proceed:

errors{‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’

Of course we cannot coerce, but apart from that, would be having default raise as implemented, and the possibility to ignore , i.e. proceed with conversion where possible be extremely hard to implement?

@maxschmitt
Copy link

(d) contains neither media or text can this happen?

This is what we currently do in the our text databases (as publishing non-audio/video files does not work).
All samples/text is handled by misc tables.

@hagenw
Copy link
Member Author

hagenw commented Apr 29, 2024

For case (d) of an empty database, we do indeed not raise an error, e.g.

import audb
import audeer
import audformat


build_dir = "./build"
audeer.rmdir(build_dir)
audeer.mkdir(build_dir)

db = audformat.Database("mydb")
db.save(build_dir)

host = audeer.path("./host")
repo = "repo"
audeer.rmdir(host)
audeer.mkdir(host, repo)
repository = audb.Repository(repo, host, "file-system")
audb.publish(build_dir, "1.0.0", repository)

audb.config.REPOSITORIES = [repository]
db = audb.load("mydb", version="1.0.0", sampling_rate=16000)

@hagenw
Copy link
Member Author

hagenw commented Apr 29, 2024

From an api perspective, would not an implementation that is akin to what pandas often does, - e.g. here - be a good way to proceed:

errors{‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’

Of course we cannot coerce, but apart from that, would be having default raise as implemented, and the possibility to ignore , i.e. proceed with conversion where possible be extremely hard to implement?

This sounds like a good solution. But we might think about a different name than errors, as when you get an error during conversion from mp4 to wav, because you don't have installed ffmpeg it should never continue without the conversion.

I will update the code tomorrow accordingly.

@hagenw
Copy link
Member Author

hagenw commented Apr 30, 2024

Unfortunately, requesting a flavor influences also where the files are cached, and if the files in the index are renamed, etc.
So it is not straightforward to change it.

For now I would propose the following behavior (as currently implemented):

Content expected behavior
(a) contains media only as before flavor ok
(b) contains text only errors when flavored
(c) contains media and text errors when flavored
(d) contains neither media or text as before flavor ok

If we want to have an option to request a flavor for a mixed dataset, we should open an issue for it, and work on it in a follow up pull request.

@hagenw
Copy link
Member Author

hagenw commented May 2, 2024

Before merging this here, I would like to discuss one additional change we might want to consider here.
At the moment we store in the dependency table the type of file, where type refers to

audb/audb/core/define.py

Lines 64 to 69 in 069cc04

class DependType:
r"""Dependency file types."""
META = 0
MEDIA = 1
ATTACHMENT = 2

We could add another value here for media files that are not video or audio, which might also help with how we might pick a meaningful example for a dataset (audeering/audbcards#89).

I opted against this solution, as for me all files, that are not attachments and tables, in a database are media files (e.g. all are handled by the media argument in audb.load()). And I think that we should not add another type, but maybe you have another opinion?

@hagenw
Copy link
Member Author

hagenw commented May 8, 2024

I thought about my last comment, and I would indeed not want to introduce another define.DependType for text files, as MEDIA refers in general to all files that are stored as data, that is listed in an index of some table. This is generic, and should be independent of the actual file format of the data.

@ChristianGeng I think it is fine to add this as "experimental" feature to the new audb release, we are going to do on Friday.
Would this be fine with you as well?

@ChristianGeng
Copy link
Member

I thought about my last comment, and I would indeed not want to introduce another define.DependType for text files, as MEDIA refers in general to all files that are stored as data, that is listed in an index of some table. This is generic, and should be independent of the actual file format of the data.

@ChristianGeng I think it is fine to add this as "experimental" feature to the new audb release, we are going to do on Friday. Would this be fine with you as well?

Yes I am fine with it. The same holds for the handling of "mixed datasets" as suggested above.
Both, but in particular the second point seem to be quite far off what can be seen the scope of this MR.
My understanding is that both require more conceptualization on how to get them going as this is a diffiult topic that might even involve discussing the Flavor class.

@hagenw
Copy link
Member Author

hagenw commented May 8, 2024

My understanding is that both require more conceptualization on how to get them going as this is a diffiult topic that might even involve discussing the Flavor class.

Yes exactly, and for that reason I would continue with the simple solution introduced here, so we can publish (toy) example datasets with text as media files to make it easier to discuss what else we need to change at a later stage.

Copy link
Member

@ChristianGeng ChristianGeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review.

As discussed in single comments: handling the conglommerate between DependType and mixed datasets and flavors is better discussed in follow up issues and the whole complex tackled in its entirety - these are by far scope breaking.

For now it is ok to simply raise a RuntimeError when non-supported files show up.

Therefore my approval.

@ChristianGeng
Copy link
Member

My understanding is that both require more conceptualization on how to get them going as this is a diffiult topic that might even involve discussing the Flavor class.

Yes exactly, and for that reason I would continue with the simple solution introduced here, so we can publish (toy) example datasets with text as media files to make it easier to discuss what else we need to change at a later stage.

Therefore it's approved now :-)

@hagenw hagenw merged commit 44fea3e into main May 8, 2024
7 checks passed
@hagenw hagenw deleted the text-files branch May 8, 2024 13:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants