Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random seed is useless in get_molnet_dataset function #418

Open
Minys233 opened this issue Mar 14, 2020 · 1 comment
Open

Random seed is useless in get_molnet_dataset function #418

Minys233 opened this issue Mar 14, 2020 · 1 comment

Comments

@Minys233
Copy link

In this function, although there is a seed=777 argument in the signature like this.

def get_molnet_dataset(dataset_name, preprocessor=None, labels=None,
split=None, frac_train=.8, frac_valid=.1,
frac_test=.1, seed=777, return_smiles=False,
return_pdb_id=False, target_index=None, task_index=0,
**kwargs):

But it's never passed to any splitter, in the same function, the splitter is called here without seed argument:

if dataset_config['dataset_type'] == 'one_file_csv':
split = dataset_config['split'] if split is None else split
if isinstance(split, str):
splitter = split_method_dict[split]()
elif isinstance(split, BaseSplitter):
splitter = split
else:
raise TypeError("split must be None, str or instance of"
" BaseSplitter, but got {}".format(type(split)))
if isinstance(splitter, ScaffoldSplitter):
get_smiles = True
else:
get_smiles = return_smiles
result = parser.parse(get_molnet_filepath(dataset_name),
return_smiles=get_smiles,
target_index=target_index, **kwargs)
dataset = result['dataset']
smiles = result['smiles']
train_ind, valid_ind, test_ind = \
splitter.train_valid_test_split(dataset, smiles_list=smiles,
task_index=task_index,
frac_train=frac_train,
frac_valid=frac_valid,
frac_test=frac_test, **kwargs)

Then, in the splitter (here the ScaffoldSplitter), the seed argument is still None:

def train_valid_test_split(self, dataset, smiles_list, frac_train=0.8,
frac_valid=0.1, frac_test=0.1, converter=None,
return_index=True, seed=None,
include_chirality=False, **kwargs):

According to the implementation, the seed=None means it be initialized by reading data from /dev/urandom according to the numpy docs.

def _split(self, dataset, frac_train=0.8, frac_valid=0.1, frac_test=0.1,
**kwargs):
numpy.testing.assert_almost_equal(frac_train + frac_valid + frac_test,
1.)
seed = kwargs.get('seed', None)
smiles_list = kwargs.get('smiles_list')
include_chirality = kwargs.get('include_chirality')
if len(dataset) != len(smiles_list):
raise ValueError("The lengths of dataset and smiles_list are "
"different")
rng = numpy.random.RandomState(seed)

This bug will cause data split inconsistent across different models and different run, even if we explicitly specify the same seed, and the default seed 777 here is useless.

PS: I use Pycharm debug tool to validate above procedure.

@corochann
Copy link
Member

corochann commented Mar 14, 2020

Thank you for report. @Minys233

I think you are right. Seems we need to change this line into splitter = split_method_dict[split](seed=seed).

splitter = split_method_dict[split]()

JFYI: sorry that we moved to maintenance phase and development is not so active recently. But I can merge the PR for bug fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants