Requesting a format in audb.load() might lead to duplicated index entries #322

hagenw · 2023-07-26T08:35:58Z

Usually, we expect that a audformat conform table don't has any duplicated index entries.
But if you have a database with the following table:

file,label
a.wav,'a'
a.flac,'b'

and you request the corresponding database with audb.load(..., format='wav') you will end up with a table that has duplicated index entries:

file,label
a.wav,'a'
a.wav,'b'

Minimal example

Create database with corresponding a.wav and a.flac files and publish it.

import numpy as np

import audb
import audeer
import audformat
import audiofile


DB_ROOT = audeer.mkdir('./db')
REPO_DIR = audeer.path('./repo')
NAME = 'mydb'
VERSION = '1.0.0'

# Create database with 1 table and 1 flac file
db = audformat.Database(NAME)
db.schemes['column'] = audformat.Scheme('str')
db['table'] = audformat.Table(audformat.filewise_index(['a.wav', 'a.flac']))
db['table']['column'] = audformat.Column(scheme_id='column')
db['table']['column'].set(['a', 'b'])
sampling_rate = 16000
for file in db.files:
    audiofile.write(
        audeer.path(DB_ROOT, file),
        np.zeros((1, sampling_rate)),
        sampling_rate,
    )
db.save(DB_ROOT)

# Publish database
repository = audb.Repository(REPO_DIR, '.', 'file-system')
audb.publish(DB_ROOT, VERSION, repository)

When loading without requested format everything is fine:

>>> audb.config.REPOSITORIES = [repository]

>>> db = audb.load(NAME, version=VERSION, cache_root='.', full_path=False, verbose=False)

>>> db['table'].df
       column
file         
a.wav       a
a.flac      b

But when requesting format='wav' we get duplicated index entries:

>>> db = audb.load(NAME, version=VERSION, format='wav', cache_root='.', full_path=False, verbose=False)

>>> db['table'].df
      column
file        
a.wav      a
a.wav      b

The text was updated successfully, but these errors were encountered:

maxschmitt · 2023-07-26T18:53:10Z

I'm just asking myself if it should be allowed at all to store two files differing only in their file extension.

hagenw · 2023-07-27T05:32:33Z

Yes, I guess this would be the easiest solution to this problem: extending audb.publish() by a check that requires the index entries without extensions to be different. At the moment we only check for duplicates, see

audb/audb/core/publish.py

Lines 19 to 36 in a359988

    
           def _check_for_duplicates( 
        
                   db: audformat.Database, 
        
                   num_workers: int, 
        
                   verbose: bool, 
        
           ): 
        
               r"""Ensures tables do not contain duplicated index entries.""" 
        
               def job(table_id): 
        
                   audformat.assert_no_duplicates(db[table_id]._df) 
        
               table_ids = list(db) 
        
               audeer.run_tasks( 
        
                   job, 
        
                   params=[([table_id], {}) for table_id in table_ids], 
        
                   num_workers=num_workers, 
        
                   progress_bar=verbose, 
        
                   task_description='Check tables for duplicates', 
        
               )

hagenw · 2023-07-27T05:35:40Z

Maybe we could change

     def job(table_id): 
         audformat.assert_no_duplicates(db[table_id]._df)

to

    def job(table_id):
        index = audformat.utils.replace_file_extension(db[table_id].index, '')
        audformat.assert_no_duplicates(index)

hagenw added the load label Jul 26, 2023

hagenw mentioned this issue Jul 26, 2023

Filtering for a list of media files results in an error if the requested media format is not the original one #318

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Requesting a format in audb.load() might lead to duplicated index entries #322

Requesting a format in audb.load() might lead to duplicated index entries #322

hagenw commented Jul 26, 2023

maxschmitt commented Jul 26, 2023

hagenw commented Jul 27, 2023

hagenw commented Jul 27, 2023

Requesting a format in audb.load() might lead to duplicated index entries #322

Requesting a format in audb.load() might lead to duplicated index entries #322

Comments

hagenw commented Jul 26, 2023

maxschmitt commented Jul 26, 2023

hagenw commented Jul 27, 2023

hagenw commented Jul 27, 2023