Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requesting a format in audb.load() might lead to duplicated index entries #322

Open
hagenw opened this issue Jul 26, 2023 · 3 comments
Open
Labels

Comments

@hagenw
Copy link
Member

hagenw commented Jul 26, 2023

Usually, we expect that a audformat conform table don't has any duplicated index entries.
But if you have a database with the following table:

file,label
a.wav,'a'
a.flac,'b'

and you request the corresponding database with audb.load(..., format='wav') you will end up with a table that has duplicated index entries:

file,label
a.wav,'a'
a.wav,'b'
Minimal example

Create database with corresponding a.wav and a.flac files and publish it.

import numpy as np

import audb
import audeer
import audformat
import audiofile


DB_ROOT = audeer.mkdir('./db')
REPO_DIR = audeer.path('./repo')
NAME = 'mydb'
VERSION = '1.0.0'

# Create database with 1 table and 1 flac file
db = audformat.Database(NAME)
db.schemes['column'] = audformat.Scheme('str')
db['table'] = audformat.Table(audformat.filewise_index(['a.wav', 'a.flac']))
db['table']['column'] = audformat.Column(scheme_id='column')
db['table']['column'].set(['a', 'b'])
sampling_rate = 16000
for file in db.files:
    audiofile.write(
        audeer.path(DB_ROOT, file),
        np.zeros((1, sampling_rate)),
        sampling_rate,
    )
db.save(DB_ROOT)

# Publish database
repository = audb.Repository(REPO_DIR, '.', 'file-system')
audb.publish(DB_ROOT, VERSION, repository)

When loading without requested format everything is fine:

>>> audb.config.REPOSITORIES = [repository]

>>> db = audb.load(NAME, version=VERSION, cache_root='.', full_path=False, verbose=False)

>>> db['table'].df
       column
file         
a.wav       a
a.flac      b

But when requesting format='wav' we get duplicated index entries:

>>> db = audb.load(NAME, version=VERSION, format='wav', cache_root='.', full_path=False, verbose=False)

>>> db['table'].df
      column
file        
a.wav      a
a.wav      b
@maxschmitt
Copy link

I'm just asking myself if it should be allowed at all to store two files differing only in their file extension.

@hagenw
Copy link
Member Author

hagenw commented Jul 27, 2023

Yes, I guess this would be the easiest solution to this problem: extending audb.publish() by a check that requires the index entries without extensions to be different. At the moment we only check for duplicates, see

def _check_for_duplicates(
db: audformat.Database,
num_workers: int,
verbose: bool,
):
r"""Ensures tables do not contain duplicated index entries."""
def job(table_id):
audformat.assert_no_duplicates(db[table_id]._df)
table_ids = list(db)
audeer.run_tasks(
job,
params=[([table_id], {}) for table_id in table_ids],
num_workers=num_workers,
progress_bar=verbose,
task_description='Check tables for duplicates',
)

@hagenw
Copy link
Member Author

hagenw commented Jul 27, 2023

Maybe we could change

     def job(table_id): 
         audformat.assert_no_duplicates(db[table_id]._df) 

to

    def job(table_id):
        index = audformat.utils.replace_file_extension(db[table_id].index, '')
        audformat.assert_no_duplicates(index) 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants