Add support for TXT files as media files #392

hagenw · 2024-04-29T07:16:24Z

Motivated by audeering/audformat#376 (comment)

We want to support publishing databases, that do contain media files, that don't support audio/video related metadata (e.g. sampling rate). In audformat we can store such data anyway, the only problem was so far the dependency table in audb. Because when publishing a database, audb.core.publish._media_values() was trying to run audiofile.sampling_rate(file) on all media files, which obviously fails for non audio/video files.

There are two ways to solve this:

Catch the RuntimeError thrown by audiofile.sampling_rate(file) and treat the file automatically as non audio/video file
Set explicitly which non audio/video files should be supported

I have opted for the first solution at the moment, as in audformat we also support every file extension at the moment.
An implementation of the second approach can be inspected at d6e9c7f.

In audb.load() we now raise a RuntimeError when a database is loaded with a requested flavor (e.g. sampling_rate=16000), but contains text files. From the docstring of audb.load():

Here the question is if we should change this, and instead of raising an error, only convert files to the requested flavor that we can convert. On the other hand, a user might expect that when using sampling_rate=16000 the loaded database will contain only files, that can be passed on to a model working with audio files provided in a sampling rate of 16000 Hz. For that reason, it felt saver to me, to raise an error.

maxschmitt · 2024-04-29T09:41:08Z

For that reason, it felt saver to me, to raise an error.

I agree that an error should be raised when loading a text dataset with audio-specific non-default parameters.

maxschmitt

The changes look all good to me.

I am just wondering if it is possible at the moment to have datasets with mixed audio and text media files (or if it should be supported).

hagenw · 2024-04-29T10:20:41Z

I am just wondering if it is possible at the moment to have datasets with mixed audio and text media files (or if it should be supported).

This is possible. To make it more explicit, I extended the test to also publish a mixed database.
The only problem is, that when loading such a database and requesting a flavor (e.g. sampling_rate=16000), you will get a RuntimeError as discussed above.

maxschmitt · 2024-04-29T11:07:40Z

This is possible. To make it more explicit, I extended the test to also publish a mixed database. The only problem is, that when loading such a database and requesting a flavor (e.g. sampling_rate=16000), you will get a RuntimeError as discussed above.

I see, we might not want the warning for text files in mixed databased, but there is no simple fix for that as the files are checked independently, I guess.

hagenw · 2024-04-29T11:10:34Z

This is possible. To make it more explicit, I extended the test to also publish a mixed database. The only problem is, that when loading such a database and requesting a flavor (e.g. sampling_rate=16000), you will get a RuntimeError as discussed above.

I see, we might not want the warning for text files in mixed databased, but there is no simple fix for that as the files are checked independently, I guess.

Yes, the files are checked independently, and a RuntimeError is raised as soon as one file is a text file.
The alternative would be to silently skip all text files, or maybe to raise an error if only text files are present.

maxschmitt · 2024-04-29T11:58:22Z

The alternative would be to silently skip all text files, or maybe to raise an error if only text files are present.

Both makes sense if this is easy to fix.

I am just thinking of the case where a user loads a mixed dataset and thinks that there is only audio and then runs into an error when iterating over the text media files.

ChristianGeng · 2024-04-29T12:03:34Z

Yes, the files are checked independently, and a RuntimeError is raised as soon as one file is a text file. The alternative would be to silently skip all text files, or maybe to raise an error if only text files are present.

I have not studied the implementation in detail yet, so apologies in case this is an unmotivated comment. My understanding is that this is the situation:

	Content	expected behavior
(a)	contains media only	as before flavor ok
(b)	contains text only	errors when flavored
(c)	contains media and text	Alternatives in MR: error out
(d)	contains neither media or text	can this happen?

I believe that (c) is the problematic case.

From an api perspective, would not an implementation that is akin to what pandas often does, - e.g. here - be a good way to proceed:

errors{‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’

Of course we cannot coerce, but apart from that, would be having default raise as implemented, and the possibility to ignore , i.e. proceed with conversion where possible be extremely hard to implement?

maxschmitt · 2024-04-29T12:30:18Z

(d) contains neither media or text can this happen?

This is what we currently do in the our text databases (as publishing non-audio/video files does not work).
All samples/text is handled by misc tables.

hagenw · 2024-04-29T12:56:43Z

For case (d) of an empty database, we do indeed not raise an error, e.g.

import audb
import audeer
import audformat


build_dir = "./build"
audeer.rmdir(build_dir)
audeer.mkdir(build_dir)

db = audformat.Database("mydb")
db.save(build_dir)

host = audeer.path("./host")
repo = "repo"
audeer.rmdir(host)
audeer.mkdir(host, repo)
repository = audb.Repository(repo, host, "file-system")
audb.publish(build_dir, "1.0.0", repository)

audb.config.REPOSITORIES = [repository]
db = audb.load("mydb", version="1.0.0", sampling_rate=16000)

hagenw · 2024-04-29T13:01:03Z

From an api perspective, would not an implementation that is akin to what pandas often does, - e.g. here - be a good way to proceed:
errors{‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’
Of course we cannot coerce, but apart from that, would be having default raise as implemented, and the possibility to ignore , i.e. proceed with conversion where possible be extremely hard to implement?

This sounds like a good solution. But we might think about a different name than errors, as when you get an error during conversion from mp4 to wav, because you don't have installed ffmpeg it should never continue without the conversion.

I will update the code tomorrow accordingly.

hagenw · 2024-04-30T13:12:23Z

Unfortunately, requesting a flavor influences also where the files are cached, and if the files in the index are renamed, etc.
So it is not straightforward to change it.

For now I would propose the following behavior (as currently implemented):

	Content	expected behavior
(a)	contains media only	as before flavor ok
(b)	contains text only	errors when flavored
(c)	contains media and text	errors when flavored
(d)	contains neither media or text	as before flavor ok

If we want to have an option to request a flavor for a mixed dataset, we should open an issue for it, and work on it in a follow up pull request.

hagenw · 2024-05-02T08:28:24Z

Before merging this here, I would like to discuss one additional change we might want to consider here.
At the moment we store in the dependency table the type of file, where type refers to

audb/audb/core/define.py

Lines 64 to 69 in 069cc04

    
           class DependType: 
        
               r"""Dependency file types.""" 
        
               META = 0 
        
               MEDIA = 1 
        
               ATTACHMENT = 2

We could add another value here for media files that are not video or audio, which might also help with how we might pick a meaningful example for a dataset (audeering/audbcards#89).

I opted against this solution, as for me all files, that are not attachments and tables, in a database are media files (e.g. all are handled by the media argument in audb.load()). And I think that we should not add another type, but maybe you have another opinion?

hagenw · 2024-05-08T08:52:45Z

I thought about my last comment, and I would indeed not want to introduce another define.DependType for text files, as MEDIA refers in general to all files that are stored as data, that is listed in an index of some table. This is generic, and should be independent of the actual file format of the data.

@ChristianGeng I think it is fine to add this as "experimental" feature to the new audb release, we are going to do on Friday.
Would this be fine with you as well?

ChristianGeng · 2024-05-08T13:19:01Z

I thought about my last comment, and I would indeed not want to introduce another define.DependType for text files, as MEDIA refers in general to all files that are stored as data, that is listed in an index of some table. This is generic, and should be independent of the actual file format of the data.

@ChristianGeng I think it is fine to add this as "experimental" feature to the new audb release, we are going to do on Friday. Would this be fine with you as well?

Yes I am fine with it. The same holds for the handling of "mixed datasets" as suggested above.
Both, but in particular the second point seem to be quite far off what can be seen the scope of this MR.
My understanding is that both require more conceptualization on how to get them going as this is a diffiult topic that might even involve discussing the Flavor class.

hagenw · 2024-05-08T13:23:15Z

My understanding is that both require more conceptualization on how to get them going as this is a diffiult topic that might even involve discussing the Flavor class.

Yes exactly, and for that reason I would continue with the simple solution introduced here, so we can publish (toy) example datasets with text as media files to make it easier to discuss what else we need to change at a later stage.

ChristianGeng

Review.

As discussed in single comments: handling the conglommerate between DependType and mixed datasets and flavors is better discussed in follow up issues and the whole complex tackled in its entirety - these are by far scope breaking.

For now it is ok to simply raise a RuntimeError when non-supported files show up.

Therefore my approval.

ChristianGeng · 2024-05-08T13:25:06Z

My understanding is that both require more conceptualization on how to get them going as this is a diffiult topic that might even involve discussing the Flavor class.

Yes exactly, and for that reason I would continue with the simple solution introduced here, so we can publish (toy) example datasets with text as media files to make it easier to discuss what else we need to change at a later stage.

Therefore it's approved now :-)

hagenw added 4 commits April 29, 2024 09:05

Add support for TXT files as media files

d6e9c7f

Catch RuntimeError instead

9c635b3

Fix docstring

b1383e3

Raise RuntimeError when a flavor is requested

97e409e

hagenw requested review from maxschmitt and ChristianGeng April 29, 2024 07:39

Use database instead of dataset in docstring

250240b

maxschmitt approved these changes Apr 29, 2024

View reviewed changes

Add test for mixed text/audio database

509fdb0

hagenw mentioned this pull request May 2, 2024

Adjust example on Datacard based on actual mime-type of media file audeering/audbcards#89

Open

ChristianGeng approved these changes May 8, 2024

View reviewed changes

hagenw merged commit 44fea3e into main May 8, 2024
7 checks passed

hagenw deleted the text-files branch May 8, 2024 13:26

hagenw mentioned this pull request Jun 21, 2024

Document that media files don't have to be audio or video files audeering/audformat#439

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for TXT files as media files #392

Add support for TXT files as media files #392

hagenw commented Apr 29, 2024 •

edited

Loading

maxschmitt commented Apr 29, 2024

maxschmitt left a comment

hagenw commented Apr 29, 2024

maxschmitt commented Apr 29, 2024

hagenw commented Apr 29, 2024 •

edited

Loading

maxschmitt commented Apr 29, 2024 •

edited

Loading

ChristianGeng commented Apr 29, 2024

maxschmitt commented Apr 29, 2024

hagenw commented Apr 29, 2024

hagenw commented Apr 29, 2024

hagenw commented Apr 30, 2024

hagenw commented May 2, 2024

hagenw commented May 8, 2024

ChristianGeng commented May 8, 2024

hagenw commented May 8, 2024

ChristianGeng left a comment •

edited

Loading

ChristianGeng commented May 8, 2024

Add support for TXT files as media files #392

Add support for TXT files as media files #392

Conversation

hagenw commented Apr 29, 2024 • edited Loading

maxschmitt commented Apr 29, 2024

maxschmitt left a comment

Choose a reason for hiding this comment

hagenw commented Apr 29, 2024

maxschmitt commented Apr 29, 2024

hagenw commented Apr 29, 2024 • edited Loading

maxschmitt commented Apr 29, 2024 • edited Loading

ChristianGeng commented Apr 29, 2024

maxschmitt commented Apr 29, 2024

hagenw commented Apr 29, 2024

hagenw commented Apr 29, 2024

hagenw commented Apr 30, 2024

hagenw commented May 2, 2024

hagenw commented May 8, 2024

ChristianGeng commented May 8, 2024

hagenw commented May 8, 2024

ChristianGeng left a comment • edited Loading

Choose a reason for hiding this comment

Review.

ChristianGeng commented May 8, 2024

hagenw commented Apr 29, 2024 •

edited

Loading

hagenw commented Apr 29, 2024 •

edited

Loading

maxschmitt commented Apr 29, 2024 •

edited

Loading

ChristianGeng left a comment •

edited

Loading