RECOVAR is a software tool for analyzing different conformations in heterogeneous cryo-EM and cryo-ET datasets. RECOVAR can reconstruct high-resolution volumes, estimate conformational density and low free-energy motions, and automatically identify subsets of images with a particular volume feature.
Click here to see more details about the features RECOVAR
- High resolution. According to a thorough benchmarking led by Ellen Zhong’s lab, cryobench, it achieves higher resolution than other methods in most cases. See analyzing the output.
- Extracting images based on a volume. If you see a particular feature in a volume generated by RECOVAR, there is a feature to automatically extract the set of images that have created that feature. See extracting
- Accurate estimation of conformational density.
- Estimation of low free-energy motions (if you believe conformational density and free energy are related). See analyzing the output.
- It can use a focus mask. See option
--focus_mask
in pipeline. - It supports tomography data (same format as cryoDRGN-ET). One practical advantage over the cryoDRGN-ET and tomodrgn is that a focus mask can be input. This is a new feature (no paper yet) so it may be less stable. See tomography
- You can also use the RECOVAR volume generation on latent space generated by your favorite heterogeneity method. For example, I have used it to improve cryoDRGN reconstructions. Since the volume generation scheme is transparent, you can be quite confident that it produces no hallucination which may also be useful to validate the output of another method. See kernel regression
If you are interested in the research and methods see preprint here and recorded talk here.
Running RECOVAR:
- 1. Preprocessing
- 2. Specifying a mask
- 3. Running the pipeline
- Uses
pipeline.py
- Uses
- 4. Analyzing results
- Uses
analyze.py
- Generating volumes at specific points:
compute_state.py
- Generating and analyzing trajectories:
compute_trajectory.py
- Extracting image subsets:
- Based on volumes:
extract_image_subset.py
- Based on k-means clusters:
extract_image_subset_from_kmeans.py
- Based on volumes:
- Estimating conformational density:
estimate_conformational_density.py
- Uses
- 5. Visualizing results
Other:
(OUT OF DATE) Peak at what output looks like on a synthetic dataset and real dataset.
Also: using the source code, limitations, contact
To run this code, CUDA and JAX are required. See information about JAX installation here. Assuming you already have CUDA, installation should take less than 5 minutes. Below are a set of commands which runs on our university cluster (Della), but may need to be tweaked to run on other clusters. You may need to load CUDA before installing JAX, E.g., on our university cluster with module load cudatoolkit/12.3
Then create an environment, download JAX-cuda (for some reason the latest version is causing issues, so make sure to use 0.4.23), clone the directory and install the requirements (note the --no-deps flag. This is because of some conflict with dependencies of cryodrgn. Will fix it soon.).
git clone https://github.com/ma-gilles/recovar.git
conda create --name recovar python=3.11 -y
conda activate recovar
pip install -U "jax[cuda12_pip]"==0.4.23 -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html --no-input
pip install --no-deps -r recovar/recovar_install_requirements.txt --no-input
python -m ipykernel install --user --name=recovar
It is recommanded to test your installation before running on a real dataset, see Testing your installation.
NOTE: Until I fix versioning issues, you need to install cryoDRGN and RECOVAR in two different conda environments.
This means you should install cryoDRGN in the crydrgn environment as below, process the data using the cryoDRGN environment after doing conda activate cryodrgn
, then activate the recovar environement with conda activate recovar
, and go on with the analysis.
The input interface of RECOVAR is borrowed directly from the excellent cryoDRGN toolbox. Particles, poses and CTF must be prepared in the same way, and below is copy-pasted part of cryoDRGN's documentation.
You should first install cryoDRGN, and prepare the dataset as below before going on to step 2.
cryodrgn
may be installed via pip
, and we recommend installing cryodrgn
in a clean conda environment.
# Create and activate conda environment
(base) $ conda create --name cryodrgn python=3.9
(cryodrgn) $ conda activate cryodrgn
# install cryodrgn
(cryodrgn) $ pip install cryodrgn
(NOTE: right now you need to install cryoDRGN and RECOVAR in two different environments, will fix soon!)
You can alternatively install a newer, less stable, development version of cryodrgn
using our beta release channel:
(cryodrgn) $ pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ cryodrgn --pre
More installation instructions are found in the documentation.
First resize your particle images using the cryodrgn downsample
command:
$ cryodrgn downsample -h
usage: cryodrgn downsample [-h] -D D -o MRCS [--is-vol] [--chunk CHUNK]
[--datadir DATADIR]
mrcs
Downsample an image stack or volume by clipping fourier frequencies
positional arguments:
mrcs Input images or volume (.mrc, .mrcs, .star, .cs, or .txt)
optional arguments:
-h, --help show this help message and exit
-D D New box size in pixels, must be even
-o MRCS Output image stack (.mrcs) or volume (.mrc)
--is-vol Flag if input .mrc is a volume
--chunk CHUNK Chunksize (in # of images) to split particle stack when
saving
--relion31 Flag for relion3.1 star format
--datadir DATADIR Optionally provide path to input .mrcs if loading from a
.star or .cs file
--max-threads MAX_THREADS
Maximum number of CPU cores for parallelization (default: 16)
--ind PKL Filter image stack by these indices
We recommend first downsampling images to 128x128 since larger images can take much longer to train:
$ cryodrgn downsample [input particle stack] -D 128 -o particles.128.mrcs
The maximum recommended image size is D=256, so we also recommend downsampling your images to D=256 if your images are larger than 256x256:
$ cryodrgn downsample [input particle stack] -D 256 -o particles.256.mrcs
The input file format can be a single .mrcs
file, a .txt
file containing paths to multiple .mrcs
files, a RELION .star
file, or a cryoSPARC .cs
file. For the latter two options, if the relative paths to the .mrcs
are broken, the argument --datadir
can supply the path to where the .mrcs
files are located.
If there are memory issues with downsampling large particle stacks, add the --chunk 10000
argument to save images as separate .mrcs
files of 10k images.
CryoDRGN expects image poses to be stored in a binary pickle format (.pkl
). Use the parse_pose_star
or parse_pose_csparc
command to extract the poses from a .star
file or a .cs
file, respectively.
Example usage to parse image poses from a RELION 3.1 starfile:
$ cryodrgn parse_pose_star particles.star -o pose.pkl -D 300
Example usage to parse image poses from a cryoSPARC homogeneous refinement particles.cs file:
$ cryodrgn parse_pose_csparc cryosparc_P27_J3_005_particles.cs -o pose.pkl -D 300
Note: The -D
argument should be the box size of the consensus refinement (and not the downsampled images from step 1) so that the units for translation shifts are parsed correctly.
CryoDRGN expects CTF parameters to be stored in a binary pickle format (.pkl
). Use the parse_ctf_star
or parse_ctf_csparc
command to extract the relevant CTF parameters from a .star
file or a .cs
file, respectively.
Example usage for a .star file:
$ cryodrgn parse_ctf_star particles.star -D 300 --Apix 1.03 -o ctf.pkl
The -D
and --Apix
arguments should be set to the box size and Angstrom/pixel of the original .mrcs
file (before any downsampling).
Example usage for a .cs file:
$ cryodrgn parse_ctf_csparc cryosparc_P27_J3_005_particles.cs -o ctf.pkl
A real space mask is important to boost SNR. Most consensus reconstruction software output a mask, which you can use as input (--mask=path_to_mask.mrc
). Make sure the mask is not too tight; you can use the input --dilate-mask-iter
to expand the mask if needed. You may also want to use a focusing mask to focus on heterogeneity in one part of the volume click here to find instructions to generate one with Chimera.
If you don't input a mask, you can ask the software to estimate one using the two halfmaps of the mean ( --mask=from_halfmaps
). You may also want to run with a loose spherical mask (option --mask=sphere
) and use the computed variance map to observe which parts have large variance.
To run the RECOVAR pipeline, ensure that the input images (.mrcs), poses (.pkl), and CTF parameters (.pkl) are prepared. Then, execute the following command:
$ python [recovar_directory]/pipeline.py particles.128.mrcs -o output_test --ctf ctf.pkl --poses poses.pkl --mask [path_to_your_mask.mrc]
To view a list of all available options and their usage, use the help command:
$ python pipeline.py -h
usage: pipeline.py [-h] -o OUTDIR [--zdim ZDIM] --poses POSES --ctf pkl --mask mrc [--focus-mask mrc]
[--mask-dilate-iter MASK_DILATE_ITER] [--correct-contrast] [--ignore-zero-frequency]
[--ind PKL] [--uninvert-data UNINVERT_DATA] [--lazy] [--datadir DATADIR] [--n-images N_IMAGES]
[--padding PADDING] [--halfsets HALFSETS] [--keep-intermediate] [--noise-model NOISE_MODEL]
[--mean-fn MEAN_FN] [--accept-cpu] [--test-covar-options] [--low-memory-option]
[--very-low-memory-option] [--dont-use-image-mask] [--do-over-with-contrast] [--tilt-series]
[--tilt-series-ctf TILT_SERIES_CTF] [--dose-per-tilt DOSE_PER_TILT] [--angle-per-tilt ANGLE_PER_TILT]
[--only-mean] [--ntilts NTILTS]
particles
positional arguments:
particles Input particles (.mrcs, .star, .cs, or .txt)
optional arguments:
-h, --help Show this help message and exit
-o OUTDIR, --outdir OUTDIR
Output directory to save model
--zdim ZDIM Dimensions of latent variable. Default = 1, 2, 4, 10, 20
--poses POSES Image poses (.pkl)
--ctf pkl CTF parameters (.pkl)
--mask mrc Solvent mask (.mrc). Options: from_halfmaps, sphere, none
--focus-mask mrc Focus mask (.mrc)
--mask-dilate-iter MASK_DILATE_ITER
Number of iterations to dilate solvent and focus mask
--correct-contrast Estimate and correct for amplitude scaling (contrast) variation across images
--ignore-zero-frequency
Ignore zero frequency. Useful if images are normalized to 0 mean
--tilt-series Use tilt series data
--tilt-series-ctf TILT_SERIES_CTF
Specify CTF for tilt series. Default is the same as tilt series
--dose-per-tilt DOSE_PER_TILT
Specify dose per tilt
--angle-per-tilt ANGLE_PER_TILT
Specify angle per tilt
--only-mean Only compute mean
--ntilts NTILTS Number of tilts to use per tilt series. Default is all
Dataset loading:
--ind PKL Filter particles by these indices (.pkl)
--uninvert-data UNINVERT_DATA
Invert data sign: Options are true, false, automatic (default).
Automatic will swap signs if sum(estimated mean) < 0
--lazy Use lazy loading if the full dataset is too large to fit in memory
--datadir DATADIR Path prefix to particle stack if loading relative paths from a .star or .cs file
--n-images N_IMAGES Number of images to use (useful for quick runs)
--padding PADDING Real-space padding
--halfsets HALFSETS Path to a file with indices of split dataset (.pkl)
--keep-intermediate Save intermediate results (useful for debugging)
--noise-model NOISE_MODEL
Noise model to use. Options: radial (default), white
--mean-fn MEAN_FN Mean function to use. Options: triangular (default), old, triangular_reg
--accept-cpu Accept running on CPU if no GPU is found
--test-covar-options Test different covariance estimation options (for development)
--low-memory-option Use low memory options for covariance estimation
--very-low-memory-option
Use very low memory options for covariance estimation
--dont-use-image-mask Do not use image mask
--do-over-with-contrast
Run again once contrast is estimated
- particles: Input particle stack file (.mrcs, .star, .cs, or .txt)
-o OUTDIR, --outdir OUTDIR
: Output directory to save the model--poses POSES
: Image poses file (.pkl)--ctf pkl
: CTF parameters file (.pkl)--mask mrc
: Solvent mask file (.mrc). Options: from_halfmaps, sphere, none (no mask used).
--focus-mask mrc
: Specify a focus mask file (.mrc). If using only a solvent mask, pass it with--mask
. If only a focus mask is available, you can use--mask=sphere
.--mask-dilate-iter MASK_DILATE_ITER
: Number of iterations to dilate the mask (default = 0).--zdim ZDIM
: Dimensions of the PCA to use for embedding. You can specify one integer (--zdim=20
) or a comma-separated list (--zdim=10,50,100
). Default is--zdim=1,2,4,10,20
with no regularization.--only-mean
: Only compute the mean. Fast and useful to make dataset is properly preprocessed.
After the pipeline is run, you can find the mean, eigenvectors, variance maps, and embeddings in the outdir/results
directory, where outdir is the option given above by -o
. You can run some standard analysis by running:
python analyze.py [pipeline_output_dir] --zdim=10
It will run k-means, generate volumes corresponding to the centers, generate trajectories between pairs of cluster centers, and run UMAP. See more input details below.
If you want to sample many points of the distribution (e.g 100-200), and want to do it fast, you can use --n-clusters=200
and reduce run time (at the cost of resolution) with --n-bins=10
. Afterwards, you can recompute the same cluster to higher resolution using compute_state.py
.
Command:
python analyze.py [result_dir] --zdim=10
Positional Arguments:
result_dir
: Path to the result directory (output directory of the pipeline).
Required Arguments:
--zdim ZDIM
: Dimension of the latent variable (must be a single integer, not a list).
Optional Arguments
-o OUTDIR, --outdir OUTDIR
: Output directory to save model. If not provided, results will be saved inresult_dir/analysis_{zdim}/
.--n-clusters N_CLUSTERS
: Number of k-means clusters (default: 40).--n-trajectories N_TRAJECTORIES
: Number of trajectories to compute between k-means clusters (default: 0).--skip-umap
: Skip UMAP embedding (useful for large datasets where UMAP can be slow).--skip-centers
: Skip generating volumes of the k-means cluster centers.--n-vols-along-path N_VOLS_ALONG_PATH
: Number of volumes to compute along each trajectory (default: 6).--Bfactor BFACTOR
: B-factor for sharpening (default: 0).--n-bins N_BINS
: Number of bins for kernel regression.--maskrad-fraction MASKRAD_FRACTION
: Radius of mask used in kernel regression. Default is 20, which means radius =grid_size/20
pixels, orgrid_size * voxel_size / 20
Ă….--n-min-images N_MIN_IMAGES
: Minimum number of images required for kernel regression. Default is 100 for SPA and 10 particles for tilt series.--density DENSITY
: Path to a.pkl
file containing density data with keysdensity
andlatent_space_bounds
.--normalize-kmeans
: Normalize latent variables (z
) before computing k-means clustering.--no-z-regularization
: Use latent variables without regularization (e.g., use2_noreg
instead of2
).--lazy
: Enable lazy loading if the full dataset is too large to fit in memory.
To generate volumes at specific coordinates in latent space, use the compute_state.py
script. Ensure you specify the coordinates using a .txt
file readable by np.loadtxt
and provide a B-factor if sharpening is required.
Command:
python compute_state.py [pipeline_output_dir] -o [volume_output_dir] --latent-points [zfiles.txt] --Bfactor=[Bfac]
Positional Arguments:
result_dir
: The output directory provided topipeline.py
using the--o
option. Foranalyze.py
, this is set by default toresult_dir/output/analysis_[zdim]
.
Required Arguments:
--latent-points LATENT_POINTS
: Path to latent points (.txt
file). You can use the output of k-means, such asoutput/analysis_2/centers.txt
fromanalyze.py
, or create your own latent points. The file should have a shape of(n_points, zdim)
.
Optional Arguments
-o OUTDIR, --outdir OUTDIR
: Output directory to save all outputs. Foranalyze.py
, it defaults toresult_dir/output/analysis_[zdim]
. For other functions, this argument is required.--Bfactor BFACTOR
: B-factor for sharpening. The B-factor of the consensus reconstruction is likely a good guess. Default is 0, meaning no sharpening.--n-bins N_BINS
: Number of bins for kernel regression. Default is 50, which works well for most cases. This setting was used to generate all figures in the paper.--maskrad-fraction MASKRAD_FRACTION
: Radius of mask used in kernel regression. Default is 20, meaningradius = grid_size/20
pixels orgrid_size * voxel_size / 20
Ă…. The default setting works well for most cases. This was used for all figures in the paper. If using cryo-ET or very noisy (or very non-noisy) data, you might want to adjust this value accordingly. For low-resolution data (less than 128x128 images), consider increasing this value. For high-resolution data (more than 512x512 images), consider decreasing it.--n-min-images N_MIN_IMAGES
: Minimum number of images required for kernel regression. The default is 100 for SPA and 10 particles for tilt series. The default works well for most cases, as it was used to generate all figures in the paper. For cryo-ET or very noisy (or very non-noisy) data, consider adjusting this value.--zdim1
: Enable this if dimension 1 is used. This setting addresses a specific issue withnp.loadtxt
.--no-z-regularization
: If set, uses latent variables without regularization.--lazy
: Enables lazy loading when the full dataset is too large to fit in memory.--particles PARTICLES
: Path to the particle stack dataset. If not specified, the same stack as provided topipeline.py
will be used. Use this option to utilize a higher-resolution stack.--datadir DATADIR
: Path prefix to the particle stack if loading relative paths from a.star
or.cs
file. Similar to the--datadir
option inpipeline.py
. If not specified, the same stack as provided topipeline.py
will be used. Use this option to load a higher-resolution stack.
To compute a high-density (low free energy) trajectory in latent space and generate corresponding volumes, use the compute_trajectory.py
script.
Command:
python compute_trajectory.py [pipeline_output_dir] -o [volume_output_dir] --zdim [dimension] [options]
result_dir
: The output directory provided topipeline.py
using the-o
option. This is where the results frompipeline.py
are stored.
-
-o OUTDIR, --outdir OUTDIR
: Output directory to save all results. This is a required argument. -
--zdim ZDIM
: Dimension of the latent variable (a single integer). This should match thezdim
used in the pipeline.
To define the start and end points of the trajectory in latent space, you need to specify endpoints using one of the following options:
-
--ind IND
: Indices of points in the list of coordinates to use as endpoints. Provide indices as a comma-separated list, e.g.,--ind 0,1
.This option is used in conjunction with the
--endpts
option. -
--endpts ENDPTS_FILE
: Path to a file containing endpoint coordinates in latent space. The file should be a.txt
file with at least two rows, each containing the coordinates of an endpoint. If the file has more than two lines, it will use the first two lines as endpoints by default. If you want to specify different lines, use the--ind
option to specify the indices of the lines to use. -
--z_st Z_ST_FILE
and--z_end Z_END_FILE
: Paths to files containing the starting point (z_st
) and ending point (z_end
) coordinates in latent space, respectively. Each file should be a.txt
file containing a single row of coordinates.
-
--density DENSITY
: Path to the density file saved in.pkl
format, containing keys'density'
and'latent_space_bounds'
. This is typically generated byestimate_conformational_density.py
, e.g.,density/deconv_density_knee.pkl
. -
--n-vols-along-path N_VOLS_ALONG_PATH
: Number of volumes to compute along the trajectory. Default is 6. -
--Bfactor BFACTOR
: B-factor for sharpening. The B-factor of the consensus reconstruction is a good estimate. Default is 0 (no sharpening). -
--n-bins N_BINS
: Number of bins for kernel regression. Default is 50. -
--maskrad-fraction MASKRAD_FRACTION
: Radius of mask used in kernel regression. Default is 20, which meansradius = grid_size / 20
pixels, orgrid_size * voxel_size / 20
Ă…. -
--n-min-images N_MIN_IMAGES
: Minimum number of images required for kernel regression. Default is 100 for SPA data. -
--lazy
: Use lazy loading if the dataset is too large to fit in memory. -
--particles PARTICLES
: Path to the particle stack dataset. If not specified, the same stack used inpipeline.py
will be used. Use this option if you want to use a higher-resolution stack. -
--datadir DATADIR
: Path prefix to particle stack if loading relative paths from a.star
or.cs
file. -
--zdim1
: Enable this if using a 1-dimensional latent space. This addresses a specific issue withnp.loadtxt
. -
--override_z_regularization
: Override thez
regularization setting from the pipeline. Use with caution.
-
Compute trajectory using indices in a coordinate file:
If you have a coordinate file (e.g.,
kmeans_center_coords.txt
generated byanalyze.py
), you can specify which lines (points) to use as endpoints using--ind
.python compute_trajectory.py [pipeline_output_dir] -o [volume_output_dir] --zdim [dimension] \ --density [density.pkl] --endpts kmeans_center_coords.txt --ind 0,1
This will use the first two lines (indices 0 and 1) in
kmeans_center_coords.txt
as the endpoints. -
Compute trajectory using endpoint coordinates from a file:
If your endpoint file has exactly two lines, you can simply specify:
python compute_trajectory.py [pipeline_output_dir] -o [volume_output_dir] --zdim [dimension] \ --density [density.pkl] --endpts endpoints.txt
If the file has more than two lines, and you want to use specific lines, use
--ind
to specify the indices. -
Compute trajectory using specific start and end point files:
If you have separate
.txt
files for the starting point (z_st.txt
) and ending point (z_end.txt
):python compute_trajectory.py [pipeline_output_dir] -o [volume_output_dir] --zdim [dimension] \ --density [density.pkl] --z_st z_st.txt --z_end z_end.txt
-
The
--density
option is important if you want to compute trajectories that follow regions of high conformational density (low free energy). The density file can be generated usingestimate_conformational_density.py
. -
Ensure that the
zdim
matches the dimension used in the pipeline and in density estimation. -
The volumes generated along the trajectory will be saved in the specified output directory (
--outdir
), typically under subdirectories likevol0000
,vol0001
, etc.
For more details and options, you can view the help message by running:
python compute_trajectory.py -h
Help Output:
usage: compute_trajectory.py [-h] -o OUTDIR [--zdim ZDIM] [--Bfactor BFACTOR] [--n-bins N_BINS]
[--maskrad-fraction MASKRAD_FRACTION] [--n-min-images N_MIN_IMAGES] [--zdim1]
[--no-z-regularization] [--override_z_regularization] [--lazy]
[--particles PARTICLES] [--datadir DATADIR] [--n-vols-along-path N_VOLS_ALONG_PATH]
[--density DENSITY] [--ind IND] [--endpts ENDPTS_FILE]
[--z_st Z_ST_FILE] [--z_end Z_END_FILE]
result_dir
positional arguments:
result_dir Output directory provided to pipeline.py using the --o option.
optional arguments:
-h, --help Show this help message and exit.
-o OUTDIR, --outdir OUTDIR
Output directory to save all outputs. This is a required argument.
--zdim ZDIM Dimension of latent variable (a single int, not a list). Must match the zdim used in the pipeline.
--Bfactor BFACTOR B-factor sharpening. The B-factor of the consensus reconstruction is probably a good guess. Default is 0, which means no sharpening.
--n-bins N_BINS Number of bins for kernel regression. Default is 50 and works well for most cases.
--maskrad-fraction MASKRAD_FRACTION
Radius of mask used in kernel regression. Default = 20, which means radius = grid_size/20 pixels, or grid_size * voxel_size / 20 angstrom.
--n-min-images N_MIN_IMAGES
Minimum number of images to compute kernel regression. Default = 100 for SPA, and 10 particles for tilt series.
--zdim1 Enable if using a 1-dimensional latent space. Addresses an issue with np.loadtxt.
--no-z-regularization
Use unregularized latent variables. This should be consistent with how the density was estimated.
--override_z_regularization
Override the z regularization setting from the pipeline. Use with caution.
--lazy Use lazy loading if the dataset is too large to fit in memory.
--particles PARTICLES
Particle stack dataset. If not specified, the same stack as provided to pipeline.py will be used. Use this option if you want to use a higher resolution stack.
--datadir DATADIR Path prefix to particle stack if loading relative paths from a .star or .cs file.
--n-vols-along-path N_VOLS_ALONG_PATH
Number of volumes to compute along each trajectory (default 6)
--density DENSITY Density saved in pkl file, keys are 'density' and 'latent_space_bounds'.
--ind IND Indices of points in the list of coordinates to use as endpoints. Provide as a comma-separated list, e.g., --ind 0,1.
--endpts ENDPTS_FILE Endpoints file (txt). If it has more than 2 lines, it will use the first two lines as endpoints. If you want to specify different lines, use --ind.
--z_st Z_ST_FILE Starting point file (txt).
--z_end Z_END_FILE Ending point file (txt).
This command will automatically figure out which images produced a particular feature of a volume generated by RECOVAR. You may want to use these images for a downstream tasks, or to re-import into RELION/cryosparc.
You can use cryodrgn_utils write_star
and cryodrgn write_cs
to do that after running extract_image_subset
.
The volume generation method of RECOVAR uses different sets of images to generate different parts of the volume. Thus, you need to tell this function which part of the volume is of interest. This can be done by giving a small mask around a region (note that it is really the center of mass of the mask which is used, so if you give a strange non convex mask, it will give strange result).
python extract_image_subset.py --input-dir [input_directory] --output [output_path] --subvol-idx [subvolume_idx] --mask [mask_file] --coordinate [x,y,z]
-
--input-dir
(required):- Type:
str
- Description: Path to the directory containing the volume data, typically generated by volume-processing scripts like analyze, compute_state, or compute_trajectory. input-dir should point to a folder that looks like
\vol00**\
. For example, if you ran pipeline.py then analyze.py --zdim 20, this folder should exist:[pipeline_output_dir]/output/analysis_20/centers/vol0000/
, which contains all the parameters of the volume stored in[pipeline_output_dir]/output/analysis_20/centers/all_volumes/vol0000.mrc
.
- Type:
-
--output
(required):- Type:
str
- Description: Output path where the
.pkl
file containing the indices of the extracted image subset will be saved.
- Type:
-
--subvol-idx
(optional):- Type:
int
- Description: Specifies the subvolume index to use for extracting the subset of images. This index directs the function to focus on a particular subvolume.
- Type:
-
--mask
(optional):- Type:
str
(file path) - Description: Path to an MRC file that defines a mask around the feature of interest. If provided, the center of mass of the mask is calculated, and the subset of images closest to this center is selected.
- Type:
-
--coordinate
(optional):- Type:
list of int
- Description: Specifies the pixel coordinates of the feature for which images are to be extracted. This should be provided in the format
x,y,z
.
- Type:
-
Extracting by Subvolume Index:
python extract_image_subset.py --input-dir "/path/to/volume_data" --output "/path/to/output_indices.pkl" --subvol-idx 3
-
Extracting by Mask:
python extract_image_subset.py --input-dir "/path/to/volume_data" --output "/path/to/output_indices.pkl" --mask "/path/to/mask.mrc"
-
Extracting by Coordinates:
python extract_image_subset.py --input-dir "/path/to/volume_data" --output "/path/to/output_indices.pkl" --coordinate 50,50,50
- Ensure that the correct path is provided for both the
--input-dir
and--output
parameters. - Only one of
--subvol-idx
,--mask
, or--coordinate
should be provided to define the criteria for image extraction. Providing more than one of these parameters will result in an error.
This command extracts a subset of images based on the k-means clustering indices from a specified .pkl
file. The images can be selected either by directly specifying the k-means indices to keep, or by excluding the specified indices using the inverse option.
python extract_image_subset_from_kmeans.py [path_to_centers] [output_path] [kmeans_indices] [-i]
-
path_to_centers
(required):- Type:
str
- Description: Path to the
centers.pkl
file. This file is usually found in the folder[recovar_output_folder]/output/analysis_[zdim]/centers.pkl
and contains the labels of the k-means clustering.
- Type:
-
output_path
(required):- Type:
str
- Description: Path where the output
.pkl
file containing the indices of the extracted image subset will be saved.
- Type:
-
kmeans_indices
(required):- Type:
list of int
- Description: A comma-separated list of k-means indices that you wish to keep. For example,
20,30,50
would keep the images that belong to these k-means clusters.
- Type:
-
-i, --inverse
(optional):- Type:
bool
- Description: If this option is provided, the function will exclude the images that correspond to the specified k-means indices. By default, the function keeps the images that correspond to the specified indices.
- Type:
-
Extracting images corresponding to specific k-means clusters:
python extract_image_subset_from_kmeans.py /path/to/centers.pkl /path/to/output_indices.pkl 20,30,50
-
Excluding specific k-means clusters:
python extract_image_subset_from_kmeans.py /path/to/centers.pkl /path/to/output_indices.pkl 20,30,50 -i
- The
path_to_centers
must be a valid.pkl
file containing the k-means clustering labels, typically located at[recovar_output_folder]/output/analysis_[zdim]/centers.pkl
. - The
output_path
must also be a.pkl
file where the subset of indices will be saved. - Only one of the
--inverse
option should be used. If provided, the images not in the specified clusters will be saved.
This script estimates the conformational density from RECOVAR results using a deconvolution approach. It processes the results in a specified principal component space and outputs the estimated densities along with related plots.
Note that this conformational density works better if you have a relatively simple distribution, which can be well represented in a small number of PCS (<=4). Often this is achievable by using focused mask to focus the analysis on a part of the biomoecule.
Note that the density estimation requires a regularization parameter all_densities.png
plot and the L-curve.png
plots to make sure the densities don't look overfit at the selected knee. The estimated densities with all values of all_densities/
, and you can select the corresponding one if you do not think the knee selection is accurate, e.g. if you think the a[3]
is the most accurate density on the all_densities.png
plot, select the density all_densities/deconv_density_3.pkl
for further analysis (e.g. by passing it to compute_trajectory.py
).
python estimate_conformational_density.py [recovar_result_dir]
Positional Arguments:
recovar_result_dir
Directory containing RECOVAR results provided topipeline.py
.
Optional Arguments
-
--pca_dim PCA_DIM
Dimension of PCA space in which the density is estimated (default 4). The runtime increases exponentially with this number, so <=5 is recommended. -
--z_dim_used Z_DIM_USED
Dimension of latent variable used (default 4). Should be at least as big as pca_dim, and should be one of the dims used in analyze.py -
--output_dir OUTPUT_DIR
Directory to save the density estimation results. Default isrecovar_result_dir/density/
. -
--percentile_reject PERCENTILE_REJECT
Percentile of data to reject due to large covariance (default: 10%). -
--num_disc_points NUM_DISC_POINTS
Number of discretization points in each dimension for the grid density estimation (default: 50). For example,50^4
points for a 4-dimensional grid. -
--alphas ALPHAS
List of alphas for regularization. Provide as space-separated values, e.g.,--alphas 1e-9 1e-8 1e-7
. If not provided, defaults to values from1e-9
to1e1
in logarithmic spacing. -
--percentile_bound PERCENTILE_BOUND
Reject zs with coordinates above this percentile when deciding the bounds of the grid (default: 1%).
-
Basic Density Estimation:
python estimate_conformational_density.py /path/to/recovar_results --pca_dim 3
-
Advanced Example with Custom Parameters:
python estimate_conformational_density.py /path/to/recovar_results --z_dim_used 4 \ --pca_dim 2 \ --output_dir /path/to/output \ --percentile_reject 15 \ --num_disc_points 100 \ --alphas 1e-9 1e-5 1e-3 1e-1 \ --percentile_bound 5
-
The dimensionality of the deconvolved space (
--pca_dim
) determines the runtime complexity, which grows exponentially with higher dimensions. -
Ensure that the correct
recovar_result_dir
is provided, as it should contain the necessary results generated bypipeline.py
. -
The
--alphas
parameter allows for fine-tuning regularization, and the script will automatically select an optimal alpha based on the L-curve.
The script will create a density/
directory inside the recovar_result_dir
(or in the specified output directory), which includes:
all_densities/
: Contains density estimations for different alphas.deconv_density_knee.pkl
: Density file corresponding to the optimal alpha determined by the knee point of the L-curve.all_densities.png
: Visualization of all densities across different alphas.Lcurve.png
: L-curve plot used for selecting the optimal alpha.
After running the RECOVAR pipeline and the subsequent analysis scripts (e.g., analyze.py
, compute_state.py
, compute_trajectory.py
, estimate_conformational_density.py
), the outputs will be organized in a structured directory tree. Understanding this structure will help you navigate the results and locate the files of interest.
If you are running on a remote server and wish to visualize the results locally, it is recommended to copy only the [output_dir]/output
directory to your local machine, as the model files can be quite large. You can then use visualization tools like ChimeraX to inspect the generated volumes.
Below is the updated directory tree structure (click to expand):
Output File Structure
.
├── analysis_2_noreg # Generated by `analyze.py` with zdim=2
│ ├── contrast_histogram.png # Histogram of contrast values
│ ├── density_plots # Density plots from `estimate_conformational_density.py`
│ │ └── density_01.png
│ ├── density_plots_sliced # Sliced density plots
│ │ └── density_01.png
│ ├── kmeans_center_coords.txt # Coordinates of k-means cluster centers in latent space
│ ├── kmeans_center_volumes # Volumes corresponding to k-means cluster centers
│ │ ├── all_volumes
│ │ │ ├── locres0000.mrc
│ │ │ ├── locres0001.mrc
│ │ │ ├── locres0002.mrc
│ │ │ ├── vol0000.mrc
│ │ │ ├── vol0001.mrc
│ │ │ └── vol0002.mrc
│ │ ├── latent_coords.txt # Latent coordinates of volumes
│ │ ├── vol0000 # Subdirectories for each cluster center volume
│ │ │ ├── half1_unfil.mrc
│ │ │ ├── half2_unfil.mrc
│ │ │ ├── heterogeneity_distances.txt
│ │ │ ├── locres_filtered.mrc
│ │ │ ├── locres_filtered_nob.mrc
│ │ │ ├── locres.mrc
│ │ │ ├── params.pkl
│ │ │ ├── split_choice.pkl
│ │ │ ├── unfil.mrc
│ │ │ └── volume_sampling.mrc
│ │ ├── vol0001
│ │ │ └── [same structure as vol0000]
│ │ └── vol0002
│ │ └── [same structure as vol0000]
│ ├── kmeans_result.pkl # Results of k-means clustering
│ ├── PCA # Principal Component Analysis results
│ │ ├── PC_01.png
│ │ └── PC_01no_annotate.png
│ ├── run.log # Log file from `analyze.py`
│ ├── traj0 # Trajectory directory generated by `compute_trajectory.py`
│ │ ├── all_volumes
│ │ │ ├── locres0000.mrc
│ │ │ ├── locres0001.mrc
│ │ │ ├── locres0002.mrc
│ │ │ ├── locres0003.mrc
│ │ │ ├── locres0004.mrc
│ │ │ ├── locres0005.mrc
│ │ │ ├── vol0000.mrc
│ │ │ ├── vol0001.mrc
│ │ │ ├── vol0002.mrc
│ │ │ ├── vol0003.mrc
│ │ │ ├── vol0004.mrc
│ │ │ └── vol0005.mrc
│ │ ├── density
│ │ │ └── density_01.png
│ │ ├── latent_coords.txt
│ │ ├── path.json # Trajectory path in latent space
│ │ ├── vol0000
│ │ │ └── [volume files as in kmeans_center_volumes]
│ │ ├── vol0001
│ │ │ └── [volume files as in kmeans_center_volumes]
│ │ ├── vol0002
│ │ │ └── [volume files as in kmeans_center_volumes]
│ │ ├── vol0003
│ │ │ └── [volume files as in kmeans_center_volumes]
│ │ ├── vol0004
│ │ │ └── [volume files as in kmeans_center_volumes]
│ │ └── vol0005
│ │ └── [volume files as in kmeans_center_volumes]
│ ├── trajectory_endpoints.pkl # Endpoints used for trajectory computation
│ └── umap # UMAP embedding results
│ ├── kmeans_centers.png
│ ├── kmeans_centers_no_annotate.png
│ ├── sns.png
│ ├── sns_hex.png
│ └── umap_embedding.pkl
├── command.txt # Command used to run the pipeline
├── density # Generated by `estimate_conformational_density.py`
│ ├── all_densities
│ │ ├── deconv_density_0.pkl
│ │ ├── deconv_density_1.pkl
│ │ ├── deconv_density_2.pkl
│ │ ├── deconv_density_3.pkl
│ │ ├── deconv_density_4.pkl
│ │ ├── deconv_density_5.pkl
│ │ ├── deconv_density_6.pkl
│ │ ├── deconv_density_7.pkl
│ │ ├── deconv_density_8.pkl
│ │ ├── deconv_density_9.pkl
│ │ └── deconv_density_10.pkl
│ ├── all_densities.png # Visualization of densities across alphas
│ ├── deconv_density_knee.pkl # Optimal density selected via L-curve
│ └── Lcurve.png # L-curve plot for regularization selection
├── model # Generated by `pipeline.py`
│ ├── covariance_cols.pkl # Covariance matrices
│ ├── embeddings.pkl # Embeddings of the data
│ ├── halfsets.pkl # Information about half-sets
│ ├── params.pkl # Parameters used during pipeline execution
│ └── particles_halfsets.pkl # Indices of particles in each half-set
├── output
│ ├── plots # Plots generated during the pipeline
│ │ ├── contrast_histogram.png # Histogram of contrast values
│ │ ├── eigenvalues.png # Eigenvalues plot
│ │ ├── mean_fsc.png # Fourier Shell Correlation of the mean volume
│ │ └── mean_variance_eigenvolume_plots.png
│ └── volumes # Volumes generated by `pipeline.py`
│ ├── dilated_mask.mrc # Dilated version of the mask
│ ├── eigen_neg*.mrc # Negative eigenvolumes (principal components)
│ ├── eigen_pos*.mrc # Positive eigenvolumes (principal components)
│ ├── focus_mask.mrc # Focus mask if provided
│ ├── mask.mrc # Solvent mask used
│ ├── mean.mrc # Mean volume
│ ├── mean_filt.mrc # Filtered mean volume
│ ├── mean_half1_unfil.mrc # Unfiltered mean volume from half-set 1
│ ├── mean_half2_unfil.mrc # Unfiltered mean volume from half-set 2
│ ├── variance4.mrc # Variance map for zdim=4
│ ├── variance10.mrc # Variance map for zdim=10
│ └── variance20.mrc # Variance map for zdim=20
├── run.log # Log file from the pipeline execution
└── trajectory1 # Generated by `compute_trajectory.py`
├── density
│ └── density_01.png
├── path.json # Trajectory path in latent space
└── vol0000
└── [volume files as in kmeans_center_volumes]
-
analysis_2_noreg/
: Results fromanalyze.py
withzdim=2
and no regularization.contrast_histogram.png
: Histogram of estimated contrast values across images.density_plots/
: Density plots generated from conformational density estimation.kmeans_center_coords.txt
: Coordinates of k-means cluster centers in latent space.kmeans_center_volumes/
: Contains volumes corresponding to k-means cluster centers.all_volumes/
: Aggregated volumes for all centers.vol****.mrc
: Volumes for each cluster center.locres****.mrc
: Locally filtered volumes.
latent_coords.txt
: Latent space coordinates of the volumes.vol****/
: Subdirectories for each cluster center containing detailed volume files.
kmeans_result.pkl
: Results of the k-means clustering, including labels and centers.PCA/
: Principal Component Analysis plots.PC_01.png
: Plot of the first principal component.PC_01no_annotate.png
: Same plot without annotations.
traj0/
: Trajectory results fromcompute_trajectory.py
.all_volumes/
: Volumes along the trajectory.density/
: Density plots along the trajectory.path.json
: JSON file containing the trajectory path.vol****/
: Volumes at specific points along the trajectory.
trajectory_endpoints.pkl
: Endpoints used for trajectory computations.umap/
: UMAP embedding results.sns.png
,sns_hex.png
: Scatter plots of the embeddings.kmeans_centers.png
: UMAP plot showing k-means centers.umap_embedding.pkl
: UMAP embedding coordinates.
-
command.txt
: Contains the command used to run the pipeline, useful for record-keeping and reproducing results. -
density/
: Generated byestimate_conformational_density.py
.all_densities/
: Deconvolved densities for different regularization parameters (alphas
).deconv_density_*.pkl
: Density files for each alpha value.
all_densities.png
: Visualization of all densities across different alphas.deconv_density_knee.pkl
: Deconvolved density corresponding to the optimal regularization parameter (knee point).Lcurve.png
: L-curve plot used for selecting the optimal alpha.
-
model/
: Generated bypipeline.py
.covariance_cols.pkl
: Covariance matrices of the columns.embeddings.pkl
: Embeddings of the data.halfsets.pkl
: Information about half-sets used in the analysis.params.pkl
: Parameters used during the pipeline execution.particles_halfsets.pkl
: Indices of particles in each half-set.
-
output/
: Main output directory containing results and volumes.plots/
: Various plots generated during the pipeline.contrast_histogram.png
: Histogram of estimated contrast values.eigenvalues.png
: Plot showing the decay of eigenvalues.mean_fsc.png
: Fourier Shell Correlation of the mean volume.mean_variance_eigenvolume_plots.png
: Combined plots of mean, variance, and eigenvolumes.
volumes/
: Volumes generated bypipeline.py
.mean.mrc
: The mean volume reconstructed from the data.mean_filt.mrc
: Filtered mean volume.mean_half1_unfil.mrc
&mean_half2_unfil.mrc
: Unfiltered mean volumes from half-sets.variance*.mrc
: Variance maps computed for different dimensions.eigen_pos*.mrc
&eigen_neg*.mrc
: Positive and negative eigenvolumes representing principal components of variability.mask.mrc
: The solvent mask used during reconstruction.focus_mask.mrc
: The focus mask if provided.dilated_mask.mrc
: Dilated version of the mask.
-
run.log
: Log file containing messages and errors during the pipeline execution. -
trajectory1/
: Generated bycompute_trajectory.py
.density/
: Density plots along the trajectory.density_01.png
: Density plot for the first pair of dimensions.
path.json
: JSON file containing the trajectory path in latent space.vol0000/
: Volume at the starting point of the trajectory.- [Contains volume files as in
kmeans_center_volumes/vol****/
]
- [Contains volume files as in
-
pipeline.py
: Generates themodel/
directory and theoutput/
directory.- Computes the mean volume, variance maps, eigenvolumes, and saves necessary parameters and embeddings.
- Generates plots such as eigenvalues and FSC curves.
-
analyze.py
: Creates theanalysis_[zdim]_noreg/
directories.- Performs k-means clustering, generating cluster center volumes, and computing UMAP and PCA embeddings.
- Generates contrast histograms and density plots.
-
compute_state.py
: Generates volumes at specific points in latent space provided by the user.- Outputs are stored in directories like
vol****/
underkmeans_center_volumes/
or specified output directory.
- Outputs are stored in directories like
-
compute_trajectory.py
: Produces thetraj0/
directory withinanalysis_[zdim]_noreg/
or a separate trajectory directory.- Contains volumes along a trajectory in latent space, density plots, and the trajectory path (
path.json
).
- Contains volumes along a trajectory in latent space, density plots, and the trajectory path (
-
estimate_conformational_density.py
: Creates thedensity/
directory with deconvolved density estimations, plots, and related data.- Computes the conformational density in the principal component space and selects the optimal regularization parameter using the L-curve method.
For visualization purposes, especially when working remotely, it's recommended to copy only the necessary output directories (e.g., output/
, analysis_[zdim]_noreg/
, trajectory1/
, density/
) to your local machine. This allows you to use visualization tools without transferring large model files.
You can then use software like ChimeraX to open and visualize the .mrc
volume files found in these directories.
You can visualize the results using this notebook, which will show a bunch of results including:
- the FSC of the mean estimation, which you can interpret as an upper bound of the resolution you can expect.
- decay of eigenvalues to help you pick the right
zdim
- and standard clustering visualization (borrowed from the cryoDRGN output).
To verify that RECOVAR is installed properly and functioning correctly, you can run a small test using a synthetic dataset. The script run_test_dataset.py
is already included in the repository and will:
- Generate a small synthetic dataset.
- Run the RECOVAR pipeline on this dataset.
- Perform analysis, including clustering and embedding.
- Estimate conformational density.
- Compute trajectories in the latent space.
- Report which steps passed or failed.
- Delete the test dataset directory at the end if all steps pass.
-
Activate your
recovar
environment:conda activate recovar
-
Run the test script
Navigate to the directory where RECOVAR is installed and execute:
python run_test_dataset.py
This script is located in the root directory of the RECOVAR repository.
The script performs the following steps:
- Generates a small synthetic dataset using
make_test_dataset.py
. - Runs the RECOVAR pipeline on the test dataset using
pipeline.py
. - Analyzes the results using
analyze.py
, including clustering and embedding. - Estimates the conformational density with
estimate_conformational_density.py
. - Computes trajectories in the latent space using
compute_trajectory.py
. - Reports which steps passed or failed.
- Deletes the
test_dataset
directory at the end if all steps pass.
-
Success Message: If all steps complete successfully, you'll see:
Total steps passed: 6 Total steps failed: 0 All functions completed successfully! Test dataset directory 'test_dataset' has been deleted.
-
Failure Notification: If any steps fail, the script will indicate which functions failed, and you can check the output for details.
Total steps passed: 5 Total steps failed: 1 The following functions failed: - estimate_conformational_density.py Please check the output above for details.
-
Verifying Outputs: You can compare the volumes generated in
test_dataset/pipeline_output/output/analysis_2_noreg/kmeans_center_volumes/all_volumes
with the simulated volumes inrecovar/data/vol*.mrc
to ensure the pipeline is producing expected results (the order may differ). -
Cleanup: The script will delete the
test_dataset
directory at the end if all steps pass. This helps keep your workspace clean.
[Rest of the README remains unchanged.]
A short example illustrating the steps to run the code on EMPIAR-10076. Assuming you have downloaded the code and have a GPU, the code should take less than an hour to run, and less than 10 minutes if you downsample to 128 instead (exact running time depends on your hardware). and Read above for more details:
# Downloaded poses from here: https://github.com/zhonge/cryodrgn_empiar.git
git clone https://github.com/zhonge/cryodrgn_empiar.git
cd cryodrgn_empiar/empiar10076/inputs/
# Download particles stack from here. https://www.ebi.ac.uk/empiar/EMPIAR-10076/ with your favorite method.
# My method of choice is to use https://www.globus.org/
# Move the data into cryodrgn_empiar/empiar10076/inputs/
conda activate recovar
# Downsample images to D=256
cryodrgn downsample Parameters.star -D 256 -o particles.256.mrcs --chunk 50000
# Extract pose and ctf information from cryoSPARC refinement
cryodrgn parse_ctf_csparc cryosparc_P4_J33_004_particles.cs -o ctf.pkl
cryodrgn parse_pose_csparc cryosparc_P4_J33_004_particles.cs -D 320 -o poses.pkl
# run recovar
python [recovar_dir]/pipeline.py particles.256.mrcs --ctf --ctf.pkl -poses poses.pkl --o recovar_test
# run analysis
python [recovar_dir]/analysis.py recovar_test --zdim=20
# Open notebook output_visualization.ipynb
# Change the recovar_result_dir = '[path_to_this_dir]/recovar_test' and
Note that this is different from the one in the paper. Run the following pipeline command to get the one in the paper (runs on the filtered stack from the cryoDRGN paper, and uses a predefined mask):
# Download mask
git clone https://github.com/ma-gilles/recovar_masks.git
python ~/recovar/pipeline.py particles.256.mrcs --ctf ctf.pkl --poses poses.pkl -o test-mask --mask recovar_masks/mask_10076.mrc --ind filtered.ind.pkl
The output should be the same as this notebook.
You can generate volumes from embedding not generated by RECOVAR using generate_from_embedding
. E.g., for a cryoDRGN embedding:
python [recovar_dir/]generate_from_embedding.py particles.256.mrcs --poses poses.pkl --ctf ctf.pkl --embedding 02_cryodrgn256/z.24.pkl --o [output_dir] --target zfile.txt
$ python generate_from_embedding.py -h
usage: generate_from_embedding.py [-h] -o OUTDIR [--zdim ZDIM] --poses POSES --ctf pkl [--ind PKL] [--uninvert-data UNINVERT_DATA]
[--datadir DATADIR] [--n-images N_IMAGES] [--padding PADDING] [--halfsets HALFSETS]
[--noise-model NOISE_MODEL] [--Bfactor BFACTOR] [--n-bins N_BINS] --embedding EMBEDDING --target
TARGET [--zdim1]
particles
positional arguments:
particles Input particles (.mrcs, .star, .cs, or .txt)
optional arguments:
-h, --help show this help message and exit
-o OUTDIR, --outdir OUTDIR
Output directory to save model
--zdim ZDIM Dimensions of latent variable. Default=1,2,4,10,20
--poses POSES Image poses (.pkl)
--ctf pkl CTF parameters (.pkl)
--Bfactor BFACTOR 0
--n-bins N_BINS number of bins for kernel regression
--embedding EMBEDDING
Image embeddings zs (.pkl), e.g. 00_cryodrgn256/z.24.pkl if you want to use a cryoDRGN embedding.
--target TARGET Target zs to evaluate the kernel regression (.txt)
--zdim1 Whether dimension 1 embedding is used. This is an annoying corner case for np.loadtxt...
Dataset loading:
--ind PKL Filter particles by these indices
--uninvert-data UNINVERT_DATA
Invert data sign: options: true, false (default)
--datadir DATADIR Path prefix to particle stack if loading relative paths from a .star or .cs file
--n-images N_IMAGES Number of images to use (should only use for quick run)
--padding PADDING Real-space padding
--halfsets HALFSETS Path to a file with indices of split dataset (.pkl).
--noise-model NOISE_MODEL
what noise model to use. Options are radial (default) computed from outside the masks, and white computed by
power spectrum at high frequencies
I hope some developers find parts of the code useful for their projects. See this notebook for a short tutorial. (OUT OF DATE, see cryoJAX for a much better documented JAX cryo-EM code.)
Some of the features which may be of interest:
-
The basic building block operations of cryo-EM efficiently, in batch and on GPU: shift images, slice volumes, do the adjoint slicing, apply CTF. See recovar.core. Though I have not tried it, all of these operations should be differentiable thus you could use JAX's autodiff.
-
A heterogeneity dataset simulator that includes variations in contrast, realistic CTF and pose distribution (loaded from real dataset), junk proteins, outliers, etc. See recovar.simulator.
-
A code to go from atomic positions to volumes or images. Does not run on GPU. Thanks to prody, if you have an internet connection, you can generate the volume from only the PDB ID. E.g., you can do
recovar.simulate_scattering_potential.generate_molecule_spectrum_from_pdb_id('6VXX', 2, 256)
to generate the volume of the spike protein with voxel size 2 on a 256^3 grid. Note that this code exactly evaluates the Fourier transform of the potential, thus it is exact in Fourier space, which can produce some oscillations artifact in the spatial domain. Also see cryoJAX -
Some other features that aren't very well separated from the rest of the code, but I think could easily be stripped out: trajectory computation recovar.trajectory, per-image mask generation recovar.covariance_core, regularization schemes recovar.regularization, various noise estimators recovar.noise.
-
Some features that are not there (but soon, hopefully): pose search, symmetry handling.
I am developping the tomography extension. You can try it out by passing the the --tilt-series
, --tilt-series-ctf=v2
, --angle-per-tilt=[insert angle here]
to pipeline.py
. The input should be identical to cryoDRGN-ET (see here).
E.g, on the M-file:
$ python [recovar_dir]/pipeline.py M_particles.star --ctf ctf.pkl --poses pose.pkl -o v2_nocont_$ntilts --datadir=128 --mask=path_to_mask.mrc --tilt-series-ctf=v2 --ntilts=10 --tilt-series --angle-per-tilt=3.0 --dose-per-tilt=2.93
You can use all tilts by not passing the argument --ntilts.
- Symmetry: there is currently no support for symmetry. If you got your poses through symmetric refinement, it will probably not work. It should probably work if you make a symmetry expansion of the particle stack, but I have not tested it.
- Memory: RECOVAR uses a lot of memory by default. For a stack of images of size 256, you need approximately 200 GB + size of dataset. You can also use the --lazy option, which will do lazy loading, in which case you need a little more than 200. If you run out of memory, you can use the --low-memory-option, in which case you need 60GB. If even that fails, you can try --very-low-memory-option. Finally, you can downsample the data and run RECOVAR. You can generate states from the full resolution images even if you ran the RECOVAR pipeline on downsampled images by using
compute_state.py
and passing the--particles
argument. The memory requirement for the compute_state is approximatelyn_bins * volume_size * 32
bytes where n_bins is 50 by default (~26 GB for size 256 images). - ignore-zero-frequency: I haven't thought much about the best way to do this. I would advise against using it for now.
- Other ones, probably?: if you run into issues, please let me know.
A lot of code in RECOVAR is based on cryoDRGN (https://cryodrgn.cs.princeton.edu/) and RELION (https://relion.readthedocs.io/en/release-5.0/). Hopefully, it is clear from the code naming/comments. Much of this documentation is generated using chatGTP4.
If you use this software for analysis, please cite:
@article{gilles2023bayesian,
title={A Bayesian Framework for Cryo-EM Heterogeneity Analysis using Regularized Covariance Estimation},
author={Gilles, Marc Aurele T and Singer, Amit},
journal={bioRxiv},
pages={2023--10},
year={2023},
publisher={Cold Spring Harbor Laboratory}
}
You can reach me (Marc) at [email protected] with questions or comments, or open an issue on Github.
Input files to replicate the results in the manuscript can be found in this dropbox.