Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filtering for a list of media files results in an error if the requested media format is not the original one #318

Open
maxschmitt opened this issue Jul 25, 2023 · 6 comments

Comments

@maxschmitt
Copy link

Description

If a list of media files is requested when loading a database and the format (e.g., flac/wav) is not the "raw" format of the database, filtering for media files will result in an error.
This behaviour is found for both audb.load(media=...) and audb.load_media().

Version

audb==1.5.1 (but found also in previous versions)

Example

import numpy as np

import audb
import audeer
import audformat
import audiofile


DB_ROOT = audeer.mkdir('./tmp/db')
REPO_DIR = audeer.mkdir('./tmp/repo')
NAME = 'mydb'
VERSION = '1.0.0'

# Create database with 1 table and 1 flac file
db = audformat.Database(NAME)
db.schemes['column'] = audformat.Scheme('str')
db['a'] = audformat.Table(audformat.filewise_index(['a.flac']))
db['a']['column'] = audformat.Column(scheme_id='column')
db['a']['column'].set(['a'])
sampling_rate = 16000
for table in list(db.tables):
    for file in db[table].files:
        audiofile.write(
            audeer.path(DB_ROOT, file),
            np.zeros((1, sampling_rate)),
            sampling_rate,
        )
db.save(DB_ROOT)

# Publish database
repository = audb.Repository('tmp-repo', '.', 'file-system')
audb.config.REPOSITORIES = [repository]
audb.publish(DB_ROOT, VERSION, repository)

# Load database with wav format and the one media file with wav extension
audb.load(
    NAME,
    version=VERSION,
    tables=['a'],
    media=['a.wav'],
    format='wav',
    verbose=False,
)

Reason

When filtering for media with load(), filter_deps() is called with the original filenames db.files

In core/load.py, ll. 1132-:

            # filter media
            requested_media = filter_deps(
                media,
                db.files,
                'media',
                name,
                version,
            )

while file extensions are not corrected until ll.1165-:

            # Adjust full paths and file extensions in tables
            _update_path(
                db,
                db_root,
                full_path,
                flavor.format,
                num_workers,
                verbose,
            )

Similarly, load_media considers deps.media as available files:
ll.1479-

    available_files = deps.media
@hagenw
Copy link
Member

hagenw commented Jul 26, 2023

Thanks for reporting.
The current idea is to filter by the original name, e.g.

audb.load(
    NAME,
    version=VERSION,
    tables=['a'],
    media=['a.flac'],
    format='wav',
    verbose=False,
)

as this allows to first filter the files, e.g.

db = audb.load(NAME, version=VERSION, tables=['a'], only_metadata=True)
audb.load(
    NAME,
    version=VERSION,
    tables=['a'],
    media=db.files[0],
    format='wav',
    verbose=False,
)

But also the current behavior results in an error: (see comment below)

>>> audb.load(NAME, version=VERSION, tables=['a'], media=['a.flac'], format='wav', verbose=False)
...
KeyError: 'a.flac'

So, yes, we have an issue here. (see comment below)

When fixing it, I would propose that we fix the desired behavior: it should always work when using media=['a.flac']. But we might also support requesting media=['a.wav'] when format='wav'. (see comment below)


BTW, you found another bug with your example, see #319.

@hagenw
Copy link
Member

hagenw commented Jul 26, 2023

Sorry, it works for media=['a.flac'], I just forgot to use another cache_root and I still had a mydb in my cache:

audb.load(
    NAME,
    version=VERSION,
    tables=['a'],
    media=['a.flac'],
    format='wav',
    verbose=False,
    cache_root='.',
)

returns

name: mydb
source: ''
usage: unrestricted
languages: []
schemes:
  column: {dtype: str}
tables:
  a:
    type: filewise
    columns:
      column: {scheme_id: column}
audb:
  root: /home/audeering.local/hwierstorf/tmp/audb-max/mydb/1.0.0/5690b542
  version: 1.0.0
  flavor: {bit_depth: null, channels: null, format: wav, mixdown: false, sampling_rate: null}
  complete: true

@hagenw
Copy link
Member

hagenw commented Jul 26, 2023

So, the only remaining question here is if we should support the following as well:

audb.load(
    NAME,
    version=VERSION,
    tables=['a'],
    media=['a.wav'],
    format='wav',
    verbose=False,
    cache_root='.',
)

or if we should extend the documentation of the media argument and state that the original format has to be used?

@maxschmitt
Copy link
Author

Either way would be fine from my point of view. I was facing this bug when I was re-using code for several databases (loading one database, filtering media files, and then re-loading with a selected list of media files).
Needed to look into the source code to understand that it required file extensions of the original format (I never realized that this particular DB was using .flac files natively. However, a note in the documentation (maybe also in the documentations of depending toolkits) would also have avoided the issue.

@hagenw
Copy link
Member

hagenw commented Jul 26, 2023

The problem with allowing for using also media=['a.wav'] even though the original file was called a.flac is that this might also refer to another file that had the same name at the beginning, compare also #322.

So for now, I would be more in favor of extending the docstrings instead of changing the current behavior.

@maxschmitt
Copy link
Author

Makes totally sense and won't break anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants