Numba parallel weights computation + dataloader #5

loodvn · 2024-03-16T02:01:35Z

We should squash all the commit messages together, there's a lot of random scripts from other projects that were in (but I've now removed)

…_encoding function; minor linting, changed seq_name_to_sequence to a string instead of list of chars

…minor printouts to MSA_processing

passing z_dim into train_VAE script manually passing in args for VAE checkpoint reloading small typos in scripts

…to mixed batch joint training. Initialising the bias to mean(y_train) for much better convergence, still not great performance though. Moved parameter reading outside of main function so that we can override the z_dim size

saving vae checkpoints, checkpoint loading vs train from scratch, added sigmoid+bce loss, added 3 very long functions for mixed/alternating/frozen training modes to switch from command line, added linear model loss weight

linting

… outputs are equal

…cord, will delete the bad ones

…seq_reproduce

…gure out properly later

…ordingly

…apping file also added "identity" weights for completion

# Conflicts: # EVE/VAE_model.py # calc_weights.py # compute_evol_indices.py # data/mappings/example_mapping.csv # examples/Step0_optional_calc_weights.sh # examples/Step0_optional_calc_weights_slurm.sh # train_VAE.py # utils/data_utils.py # utils/weights.py

…ax error

Added progress bar, weights-only calc mode

Fallback to normal mode also works well

…der, merged in changes from ProteinGym. Removed the aggregation methods for evol indices.

…default, tested with DLG4 (cherry picked from commit fcb7894)

aaronkollasch

Posting a few comments for now

aaronkollasch · 2024-03-18T22:23:44Z

README.md

@@ -11,7 +11,7 @@ EVE is a set of protein-specific models providing for any single amino acid muta
 The end to end process to compute EVE scores consists of three consecutive steps:
 1. Train the Bayesian VAE on a re-weighted multiple sequence alignment (MSA) for the protein of interest => train_VAE.py
 2. Compute the evolutionary indices for all single amino acid mutations => compute_evol_indices.py
-3. Train a GMM to cluster variants on the basis of the evol indices then output scores and uncertainties on the class assignments => train_GMM_and_compute_EVE_scores.py
+3. Train a GMM to cluster variants on the basis of the qevol indices then output scores and uncertainties on the class assignments => train_GMM_and_compute_EVE_scores.py


revert please

.gitignore

aaronkollasch · 2024-03-18T22:27:07Z

EVE/VAE_model.py

    def sample_latent(self, mu, log_var):
        """
        Samples a latent vector via reparametrization trick
        """
        eps = torch.randn_like(mu).to(self.device)
-        z = torch.exp(0.5*log_var) * eps + mu
+        z = torch.exp(0.5 * log_var) * eps + mu


it would be nice to keep linting/formatting to a separate PR so this one isn't as cluttered

aaronkollasch · 2024-03-18T22:50:19Z

compute_evol_indices.py


    parser = argparse.ArgumentParser(description='Evol indices')
    parser.add_argument('--MSA_data_folder', type=str, help='Folder where MSAs are stored')
    parser.add_argument('--MSA_list', type=str, help='List of proteins and corresponding MSA file name')
    parser.add_argument('--protein_index', type=int, help='Row index of protein in input mapping file')
-    parser.add_argument('--MSA_weights_location', type=str, help='Location where weights for each sequence in the MSA will be stored')
-    parser.add_argument('--theta_reweighting', type=float, help='Parameters for MSA sequence re-weighting')
+    # parser.add_argument('--MSA_weights_location', type=str, help='Location where weights for each sequence in the MSA will be stored')


should these arguments be deprecated instead of removed entirely?

aaronkollasch · 2024-03-18T22:54:38Z

train_VAE.py

+    parser.add_argument('--training_logs_location', type=str,
+                        help='Location of VAE model parameters')
+    parser.add_argument("--seed", type=int, help="Random seed", default=42)
+    parser.add_argument('--z_dim', type=int, help='Specify a different latent dim than in the params file')


this can be done by editing the params config file

aaronkollasch · 2024-03-18T22:58:08Z

train_VAE.py

+    parser.add_argument('--force_load_weights', action='store_true',
+                        help="Force loading of weights from MSA_weights_location (useful if you want to make sure you're using precalculated weights). Will fail if weight file doesn't exist.",
+                        default=False)
+    parser.add_argument("--overwrite_weights",


is --overwrite_weights necessary?

aaronkollasch · 2024-03-18T23:00:40Z

train_VAE.py

+                        action="store_true", default=False)
+    parser.add_argument("--batch_size", type=int,
+                        help="Batch size for training", default=None)
+    parser.add_argument("--experimental_stream_data",


we could consider making this the default and removing the CLI option, if it's judged robust enough

aaronkollasch · 2024-03-18T23:05:06Z

train_VAE.py

-    print("Protein name: "+str(protein_name))
-    print("MSA file: "+str(msa_location))
+
+    if mapping_file["MSA_filename"].duplicated().any():


for the new mapping colnames "MSA_filename" and "MSA_theta", we should probably accept the old filenames "protein_name" and "theta" for legacy mapping files

Lodevicus Van Niekerk and others added 30 commits May 11, 2021 23:27

Added some slurm scripts, small changes to VAE_model for file handling

4e98cb2

defined instance variables in __init__ for clarity; extracted one_hot…

fcb8d1b

…_encoding function; minor linting, changed seq_name_to_sequence to a string instead of list of chars

change focus_seq_trimmed to a string instead of list of chars; added …

9606f5a

…minor printouts to MSA_processing

error checking in compute_evol_indices

fb13194

passing z_dim into train_VAE script manually passing in args for VAE checkpoint reloading small typos in scripts

merged data_utils

09c8b8b

added joint training, one-hot sequence functions

c29fd56

Merge remote-tracking branch 'origin/master' into master

f0984cf

moved optimizer.zero_grad() outside of if-else

ba75f74

joint training script improvements:

8d54d4b

saving vae checkpoints, checkpoint loading vs train from scratch, added sigmoid+bce loss, added 3 very long functions for mixed/alternating/frozen training modes to switch from command line, added linear model loss weight

Joint training: parameterize lm_loss_weight

a4c7be9

linting

adding EVcouplings versions, some data checks, and need to check that…

ab1055d

… outputs are equal

temp hehe

57b7fb7

committing all ideas for parallelising the weights calculation for re…

9687b7c

…cord, will delete the bad ones

checking on O2 now

d65d435

adding mapping files

0597e02

running as array

636f88f

changed logging dir

cc30589

removed old training flags

b5af044

some slurm script changes

87c30d5

updating conda bin and log output dir

b9ab6b1

print equality

7bf655f

added directory exists checking

d8a0951

running all proteins now

7611fd8

74 MSAs

d235b9d

oops was still debugging

c29ddae

using new MSAs

5993736

wrong MSA location

2b66b67

explicitly setting number of CPUs to use

ad85d42

also testing only 1 cpu

91e4ac0

loodvn and others added 25 commits August 8, 2022 13:25

kicking off scoring, had to recheckout compute_evol_indices from deep…

534b0db

…seq_reproduce

turned DMS filename assertion into just a warning for now, need to fi…

536b207

…gure out properly later

using updated MSA and DMS files (v7?), rerunning training/scoring acc…

2cb5cd0

…ordingly

added new disordered MSA using notebook in disorder_human project

07d3984

using new DMS and MSA mapping and new suffix

aeade79

adpred scripts

f761149

allowed to pass in a MSA file directly to calc_weights instead of a m…

af99627

…apping file also added "identity" weights for completion

reformatted whitespace PEP8

f70712c

some more minor whitespace formatting

6043996

syntax errors: added overwrite_weights to signature and fixed :: synt…

875a625

…ax error

added overwrite_weights option to calc_weights.py

098ce58

added overwrite_weights option to calc_weights.py

cb9aabd

Weights calc:

37375c4

Added progress bar, weights-only calc mode

Training: Added some checks to input/output files

c8249f0

Tweaked progress bar; removed debugging statements

f9c291c

Streaming one-hot-encodings is working well

3709d7d

Fallback to normal mode also works well

Using a --experimental_stream_data flag, a bit cleaner

e30f784

Skipping synonymous mutants in the filtering, fixed tqdm bug

74238ea

Using protein_name in compute_evol_indices, added some logging

455ffaf

Removed weights calculation comparison tests, cleaned up dataloader

c801b52

Computing one-hot encodings on the fly for evol_indices using dataloa…

70f63e2

…der, merged in changes from ProteinGym. Removed the aggregation methods for evol indices.

Using dataloaders for train and validation, use multi-cpu weights by …

e18c56f

…default, tested with DLG4 (cherry picked from commit fcb7894)

Added files back from upstream repo to match master before PR

3d48173

deleted internal scripts

d129b81

loodvn requested review from aaronkollasch, jonnyfrazer, danieldritter and pascalnotin March 16, 2024 02:02

aaronkollasch suggested changes Mar 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numba parallel weights computation + dataloader #5

Numba parallel weights computation + dataloader #5

loodvn commented Mar 16, 2024 •

edited

Loading

aaronkollasch left a comment

aaronkollasch Mar 18, 2024

aaronkollasch Mar 18, 2024

aaronkollasch Mar 18, 2024

aaronkollasch Mar 18, 2024

aaronkollasch Mar 18, 2024

aaronkollasch Mar 18, 2024

aaronkollasch Mar 18, 2024

Numba parallel weights computation + dataloader #5

Are you sure you want to change the base?

Numba parallel weights computation + dataloader #5

Conversation

loodvn commented Mar 16, 2024 • edited Loading

aaronkollasch left a comment

Choose a reason for hiding this comment

aaronkollasch Mar 18, 2024

Choose a reason for hiding this comment

aaronkollasch Mar 18, 2024

Choose a reason for hiding this comment

aaronkollasch Mar 18, 2024

Choose a reason for hiding this comment

aaronkollasch Mar 18, 2024

Choose a reason for hiding this comment

aaronkollasch Mar 18, 2024

Choose a reason for hiding this comment

aaronkollasch Mar 18, 2024

Choose a reason for hiding this comment

aaronkollasch Mar 18, 2024

Choose a reason for hiding this comment

loodvn commented Mar 16, 2024 •

edited

Loading