Random seed is useless in get_molnet_dataset function #418

Minys233 · 2020-03-14T06:03:08Z

In this function, although there is a seed=777 argument in the signature like this.

chainer-chemistry/chainer_chemistry/datasets/molnet/molnet.py

Lines 24 to 28 in 56e83de

    
           def get_molnet_dataset(dataset_name, preprocessor=None, labels=None, 
        
                                  split=None, frac_train=.8, frac_valid=.1, 
        
                                  frac_test=.1, seed=777, return_smiles=False, 
        
                                  return_pdb_id=False, target_index=None, task_index=0, 
        
                                  **kwargs):

But it's never passed to any splitter, in the same function, the splitter is called here without seed argument:

chainer-chemistry/chainer_chemistry/datasets/molnet/molnet.py

Lines 104 to 130 in 56e83de

    
           if dataset_config['dataset_type'] == 'one_file_csv': 
        
               split = dataset_config['split'] if split is None else split 
        
               if isinstance(split, str): 
        
                   splitter = split_method_dict[split]() 
        
               elif isinstance(split, BaseSplitter): 
        
                   splitter = split 
        
               else: 
        
                   raise TypeError("split must be None, str or instance of" 
        
                                   " BaseSplitter, but got {}".format(type(split))) 
        
               if isinstance(splitter, ScaffoldSplitter): 
        
                   get_smiles = True 
        
               else: 
        
                   get_smiles = return_smiles 
        
               result = parser.parse(get_molnet_filepath(dataset_name), 
        
                                     return_smiles=get_smiles, 
        
                                     target_index=target_index, **kwargs) 
        
               dataset = result['dataset'] 
        
               smiles = result['smiles'] 
        
               train_ind, valid_ind, test_ind = \ 
        
                   splitter.train_valid_test_split(dataset, smiles_list=smiles, 
        
                                                   task_index=task_index, 
        
                                                   frac_train=frac_train, 
        
                                                   frac_valid=frac_valid, 
        
                                                   frac_test=frac_test, **kwargs)

Then, in the splitter (here the ScaffoldSplitter), the seed argument is still None:

chainer-chemistry/chainer_chemistry/dataset/splitters/scaffold_splitter.py

Lines 62 to 65 in 56e83de

    
           def train_valid_test_split(self, dataset, smiles_list, frac_train=0.8, 
        
                                      frac_valid=0.1, frac_test=0.1, converter=None, 
        
                                      return_index=True, seed=None, 
        
                                      include_chirality=False, **kwargs):

According to the implementation, the seed=None means it be initialized by reading data from /dev/urandom according to the numpy docs.

chainer-chemistry/chainer_chemistry/dataset/splitters/scaffold_splitter.py

Lines 23 to 35 in 56e83de

    
           def _split(self, dataset, frac_train=0.8, frac_valid=0.1, frac_test=0.1, 
        
                      **kwargs): 
        
               numpy.testing.assert_almost_equal(frac_train + frac_valid + frac_test, 
        
                                                 1.) 
        
               seed = kwargs.get('seed', None) 
        
               smiles_list = kwargs.get('smiles_list') 
        
               include_chirality = kwargs.get('include_chirality') 
        
               if len(dataset) != len(smiles_list): 
        
                   raise ValueError("The lengths of dataset and smiles_list are " 
        
                                    "different") 
        
               rng = numpy.random.RandomState(seed)

This bug will cause data split inconsistent across different models and different run, even if we explicitly specify the same seed, and the default seed 777 here is useless.

PS: I use Pycharm debug tool to validate above procedure.

The text was updated successfully, but these errors were encountered:

corochann · 2020-03-14T07:50:24Z

Thank you for report. @Minys233

I think you are right. Seems we need to change this line into splitter = split_method_dict[split](seed=seed).

chainer-chemistry/chainer_chemistry/datasets/molnet/molnet.py

Line 108 in 56e83de

splitter = split_method_dict[split]()

JFYI: sorry that we moved to maintenance phase and development is not so active recently. But I can merge the PR for bug fix.

https://github.com/chainer/chainer-chemistry/releases/tag/v0.7.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random seed is useless in get_molnet_dataset function #418

Random seed is useless in get_molnet_dataset function #418

Minys233 commented Mar 14, 2020

corochann commented Mar 14, 2020 •

edited

Loading

Random seed is useless in get_molnet_dataset function #418

Random seed is useless in get_molnet_dataset function #418

Comments

Minys233 commented Mar 14, 2020

corochann commented Mar 14, 2020 • edited Loading

corochann commented Mar 14, 2020 •

edited

Loading