Import summary stats using import.wdl. Prepare a list of summary stats like in reference configuration file (one summary stat bucket path per line) import.json.
After successful import run, copy generated file to a single bucket using proper file structure using copy_cromwell_import_to_bucket_puddle.py .
Parameters needed are cromwell hash and path to destination bucket. copy_cromwell_import_to_bucket_puddle.py cromwell_hash gs://bucket_for_deployment_pickerupper/v8/
You need to have a socks5 proxy open in localhost:5000 to cromwell machine to get the metadata.
Example proxy creation if cromwell runs in google VM: gcloud compute ssh cromwell-machine-name -- -D localhost:5000 -N
.
Alternatively if direct access available change url with --cromwell_url yourURL
and remove proxy (--socks_proxy "")
create UTF-8 TSV file from Aki's Excel, I've found this to be the best way to avoid double quotes around pheno names and correctly encode weird characters. open Endpoints_Controls_FINNGEN_ENDPOINTS_DF7_Final_2021-03-05.xlsx in Excel and save as UTF-8 CSV -install csvkit (terrible deps)
csvformat -T Endpoints_Controls_FINNGEN_ENDPOINTS_DF7_Final_2021-03-05.csv > Endpoints_Controls_FINNGEN_ENDPOINTS_DF7_Final_2021-03-05.tsv
in refinery get numbers of cases and controls from cov/pheno file: E.g in R:
cov_pheno <- fread("gunzip -c /mnt/nfs/r7/R7_COV_PHENO_V2.FID.txt.gz")
first_pheno_index <- match("DEATH", names(cov_pheno))[1]
cs <- colSums(cov_pheno[,first_pheno_index:length(cov_pheno)], na.rm=T)
mcs <- colSums(1-cov_pheno[,first_pheno_index:length(cov_pheno)], na.rm=T)
fwrite(data.table(cbind(pheno=names(cs), cases=cs, ctrls=mcs, n_eff=2/(1/cs+1/mcs))), "n_eff.txt", quote=F, sep="\t")
in /mnt/nfs/pheweb/r7/phenolist
-copy Aki's files and the above created TSV and the above created n_eff.txt there
gsutil cp gs://fg-cromwell_fresh/pheweb_import/e4792246-6efb-4b2e-a155-7f0dbbc00380/call-matrix/pheweb/pheno-list.json /mnt/nfs/pheweb/r7/pheno-list.json.orig
python3 phenolist.py /mnt/nfs/pheweb/r7/pheno-list.json.orig /mnt/nfs/pheweb/r6/pheno-list.json TAGLIST_DF7.txt Pheweb_FINNGEN _ENDPOINTS_DF7_Final_2021-03-05.names_tagged_ordered.txt Endpoints_Controls_FINNGEN_ENDPOINTS_DF7_Final_2021-03-05.tsv n_eff.txt /mnt/nfs/ pheweb/r7/generated-by-pheweb | python -m json.tool > /mnt/nfs/pheweb/r7/pheno-list.json
Docker
Google Cloud SDK
kubectl
Note that there can be at most one minor version difference between kubectl client and server versions: If the server is running v1.8, the client cannot be v1.10. Versions can be checked with kubectl version
. If they differ too much, download a different version of the client or update the server.
In repository root:
docker build -t gcr.io/phewas-development/pheweb:[TAG] -f deploy/Dockerfile .
gcloud docker -- push gcr.io/phewas-development/pheweb:[TAG]
Get credentials for a running cluster:
gcloud container clusters get-credentials [CLUSTER-NAME] --zone=europe-west1-b
Or create a new cluster:
gcloud container clusters create [CLUSTER-NAME] --num-nodes=1 --machine-type=n1-standard-1 --zone=europe-west1-b
Make sure you're in the right kubernetes context:
kubectl config get-contexts
If necessary:
kubectl config use-context [CONTEXT-NAME]
This example is for R6 data. If using a running cluster:
In e.g. deploy/pheweb-deployment-r6.yaml
(or other pheweb-deployment-* file), change the Docker image to the one you just created (or make other desired changes, note that replicas
should usually be the same as the cluster size). Make sure that in deploy/pheweb-pv-nfs.yaml
(or other pheweb-pv-* file) the NFS / GCE disk is the one you want with the wanted data - and that there is a correct config.py in the data directory of the disk. The data directory needs to be specified (PHEWEB_DIR
) in deploy/pheweb-deployment-r6.yaml
.
Then, apply the changes you made (example with dev config):
kubectl apply -f deploy/pheweb-pv-nfs.yaml
(if changed) and/or
kubectl apply -f deploy/pheweb-deployment-r6.yaml
Or if using a new cluster:
Modify deploy/pheweb-ingress-r6.yaml
, deploy/pheweb-deployment-r6.yaml
and deploy/pheweb-pv-nfs.yaml
-- or other files -- as needed. Then
kubectl create -f deploy/pheweb-ingress-r6.yaml
and
kubectl create -f deploy/pheweb-pv-nfs.yaml
and
kubectl create -f deploy/pheweb-deployment-r6.yaml
Example of updating the image used in StatefulSet
kubectl patch statefulset pheweb-front --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/image", "value":"gcr.io/phewas-development/pheweb:r2-2"}]'
Kubernetes will try to rolling update so that while some pods are updating, the others are serving using the old image. In case the new image or settings are not functional Kubernetes will keep on retrying. In this case you need to update settings again first and then delete those pods that keep trying to run with the old settings.
kubectl delete pod pheweb-front-3
In case of an incomprehensible situation and it would be great to bring the service back asap, here's how to do a full restart of the cluster (example with R2 and 4 nodes):
gcloud container clusters delete [CLUSTER_NAME]
gcloud container clusters create [CLUSTER_NAME] --num-nodes=4 --machine-type=n1-standard-1 --zone=europe-west1-b
kubectl create secret tls finngen-tls --key /path/to/star_finngen_fi.key --cert /path/to/star_finngen_fi.crt
kubectl create -f deploy/pheweb-ingress-r6.yaml
kubectl create -f deploy/pheweb-pv-nfs.yaml
kubectl create -f deploy/pheweb-deployment-r6.yaml
kubectl get ingress
kubectl describe ingress
kubectl get svc
kubectl describe svc
kubectl get pods
kubectl logs [POD-NAME]
kubectl get events --sort-by=.metadata.creationTimestamp
More here
For an example, see the Michigan Genomics Initiative PheWeb. For a walk-through demo see here. If you have questions or comments, check out our Google Group.
If any of these steps is incorrect, please email me at [email protected] and I'll see what I can do to improve things.
pip3 install pheweb
- If that doesn't work, follow the detailed install instructions.
-
mkdir ~/my-new-pheweb && cd ~/my-new-pheweb
- This directory will store all data for the pheweb your are building. All
pheweb ...
commands should be run in this directory. - You can put it wherever you want and name it whatever you want.
- This directory will store all data for the pheweb your are building. All
-
If you want to configure any options, make a file
config.py
in your data directory. Some options you can set are:-
Minor Allele Frequency cutoffs:
assoc_min_maf
: an association (between a phenotype and variant) will only be included if its MAF is greater than this value. (default:0
, but it saves disk space during loading, so I usually use at leastvariant_inclusion_maf / 2
)variant_inclusion_maf
: a variant will only be included if it has some associations with MAF greater than this value. That is, if some or all associations for a variant are aboveassoc_min_maf
, but none are abovevariant_inclusion_maf
, that entire variant (including all of its associations with phenotypes) will be dropped. If any association's MAF is abovevariant_inclusion_maf
, all associations for that variant that are aboveassoc_min_maf
will be included. (default:0
, but I recommend at least0.005
)
-
cache
: a directory where files common to all datasets can be stored. If you don't want one, setcache = False
. (default:cache = "~/.pheweb/cache/"
)
-
You should have one file for each phenotype. It can be gzipped if you want. It should be tab-delimited and have a header row. Variants must be sorted by chromosome and position, with chromosomes in the order [1-22,X,Y,MT].
- If you are using EPACTS, your files should work just fine. If they don't, email me. EPACTS files won't have
REF
orALT
, but PheWeb will parse theirMARKER_ID
column to get those.
The file must have columns for:
column description | name | other allowed column names | allowed values |
---|---|---|---|
chromosome | chrom |
#chrom |
integer 1-22, X , Y , M , MT |
position | pos |
beg , begin |
integer |
reference allele | ref |
anything | |
alternate allele | alt |
anything | |
p-value | pval |
pvalue |
number in [0,1] |
Note: column names are case-insensitive.
Note: any field may be .
or NA
. For required fields, these values will cause the variant to be dropped.
Note: if your column name is not one of these, you may set field_aliases = {"column_name": "field_name"}
in config.py
. For example, field_aliases = {'P_BOLT_LMM_INF': 'pval'}
.
Note: scientific notation is okay.
You may also have columns for:
column description | name | allowed column names | allowed values |
---|---|---|---|
minor allele frequency | maf |
number in (0,0.5] | |
allele frequency | af |
number in (0,1) | |
allele count | ac |
integer | |
effect size | beta |
number | |
standard error of effect size | sebeta |
number | |
odds ratio | or |
number | |
R2 | r2 |
number | |
number of samples | num_samples |
ns , n |
integer, must be the same for every variant in its phenotype |
number of controls | num_controls |
ns.ctrl , n_controls |
integer, must be the same for every variant in its phenotype |
number of cases | num_cases |
ns.case , n_cases |
integer, must be the same for every variant in its phenotype |
Inside of your data directory, you need a file named pheno-list.json
that looks like this:
[
{
"assoc_files": ["/home/watman/ear-length.epacts.gz"],
"phenocode": "ear-length"
},
{
"assoc_files": ["/home/watman/eats-kimchi.X.epacts.gz","/home/watman/eats-kimchi.autosomal.epacts.gz"],
"phenocode": "eats-kimchi"
}
]
phenocode
must only contain letters, numbers, or any of _-~
.
That example file only includes the columns assoc_files
(a list of paths to association files) and phenocode
(a string representing your phenotype that is valid in a URL). If you want, you can also include:
phenostring
: a string that is more descriptive thanphenocode
and will be shown in several placescategory
: a string that will group together phenotypes in the PheWAS plot and also be shown in several placesnum_cases
,num_controls
, and/ornum_samples
: numbers of strings which will be shown in several places- anything else you want, but you'll have to modify templates to show it.
There are four ways to make a pheno-list.json
:
-
If you have a csv (or tsv, optionally gzipped) with a header that has EXACTLY the right column names, just import it by running
pheweb phenolist import-phenolist "/path/to/my/pheno-list.csv"
.If you have multiple association files for each phenotype, you may put them all into a single column with
|
between them. For example, your filepheno-list.csv
might look like this:phenocode,assoc_files eats-kimchi,/home/watman/eats-kimchi.autosomal.epacts.gz|/home/watman/eats-kimchi.X.epacts.gz ear-length,/home/watman/ear-length.all.epacts.gz
-
If you have one association file per phenotype, you can use a shell-glob and a regex to get assoc-files and phenocodes for them. Suppose that your assocation files are at paths like:
/home/watman/eats-kimchi.epacts.gz
/home/watman/ear-length.epacts.gz
Then you could run
pheweb phenolist glob-files "/home/watman/*.epacts.gz"
to getassoc-files
.To get
phenocodes
, you can use a regex that captures the phenocode from the file path. In most cases (including this one), just use:pheweb phenolist extract-phenocode-from-filepath --simple
-
If you have multiple association files for some phenotypes, you can follow the directions in 2 and then run
pheweb phenolist unique-phenocode
.For example, if your association files are at:
/home/watman/autosomal/eats-kimchi.epacts.gz
/home/watman/X/eats-kimchi.epacts.gz
/home/watman/all/ear-length.epacts.gz
then you can run:
pheweb phenolist glob-files "/home/watman/*/*.epacts.gz" pheweb phenolist extract-phenocode-from-filepath --simple pheweb phenolist unique-phenocode
-
If you want to do more advanced things, like merging in more information from another file, email [email protected] and I'll write documentation for
pheweb phenolist
.No matter what you do, please run
pheweb phenolist verify
when you are done to check that it worked correctly. At any point, you may runpheweb phenolist view
orpheweb phenolist print-as-csv
to view the current file.
-
Run
pheweb process
.- This step can take hours or days for large datasets. If you want to use the SLURM cluster scheduler, run
pheweb slurm-parse
for parsing and thenpheweb process --no-parse
for everything else.
- This step can take hours or days for large datasets. If you want to use the SLURM cluster scheduler, run
-
If something breaks, read the error message.
- If you can understand the error message, modify your association or config files to avoid it, or drop the problematic phenotypes from
pheno-list.json
. Then re-runpheweb process
. - If the problem is something that PheWeb should support by default, feel free to email it to me at [email protected].
- If you can't understand the error message, please email your error message to [email protected] and hopefully I can get back to you quickly.
- If you can understand the error message, modify your association or config files to avoid it, or drop the problematic phenotypes from
Run pheweb serve --open
.
That command should either open a browser to your new PheWeb, or it should give you a URL that you can open in your browser to access your new PheWeb. If it doesn't, follow the directions for hosting a PheWeb and accessing it from your browser.
To use Apache2 or Nginx (for performance), see instructions here. To require login via OAuth, see instructions here. To track page views with Google Analytics, see instructions here.