-
Notifications
You must be signed in to change notification settings - Fork 0
/
pipelines.qmd
330 lines (224 loc) · 10.7 KB
/
pipelines.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
---
execute:
eval: false
format:
html:
link-external-newwindow: true
---
# Glueing Together Pipelines
This chapter is named as such because I will be detailing the processing pipelines for our NGS data and the command line, i.e. the glue, that connects different parts together.
As of October 2023, we are starting to move ahead with switching to Nextflow and nf-core's processing pipelines which will take care of most basic processing steps such as mapping so I will not list them in too much detail.
## Working with High Performance Computing (HPCs)
Northwestern's [Quest User Guide](https://services.northwestern.edu/TDClient/30/Portal/KB/ArticleDet?ID=505) is a great place to start with learning how HPCs work. However, with our own server at Emory, things are much simpler.
Cancel a lot of pending jobs `scancel --state PENDING --user vkp2256`.
## Environment Reproducibility
Many packages will now use [conda](https://conda.io/projects/conda/en/latest/user-guide/index.html), so many of those will be self-contained. I highly recommend using [mamba](https://mamba.readthedocs.io/en/latest/) to speed up installations.
For R, while one can use a conda environment, I've found it to be hard to set it up, so try to use only one version of R in one project to avoid dependency issues and do `sessioninfo()` to get package versions that were installed.
## Nextflow
Nextflow is a pipeline construction language. Nf-core is a publicly contributed best-practices of bioinformatics pipelines. If you want to do basic, routine analysis, using nf-core's ready-made pipelines is best, but because of its large code-base and complexity, it might be hard to debug. You may want to start your own nextflow container if you find yourself doing similar analyses many times.
Use `nextflow pull nf-core/rnaseq` to update pipelines. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since.
## The Command Line
It is essential to master the basics of the command line. [@Buffalo_2015, Chapter 8] is a great place to get started. I found learning about path expansion, awk, and vim extremely useful. [The Missing Semester](https://www.youtube.com/watch?v=Z56Jmr9Z34Q) is also a great resource.
Make your executable bash script have a help page by creating a [Help facility](https://opensource.com/article/19/12/help-bash-program).
While I think going through these well-documented resources is better than anything I could share, I'll document the things I found really useful here.
::: callout-warning
Some of these might not be accurate, test carefully before using them (as you should do with any random piece of code you find).
:::
### One-liners
There are many great bioinformatics one-liner compilations such as [this one](https://github.com/stephenturner/oneliners). I suggest you keep a list of your own that you find yourself re-using often.
### File manipulation
Read in files line by line and perform an action with it.
```{bash}
file = "/path/to/yourfile.txt"
while IFS= read -r line
do
echo "$line"
echo "${line:8:5}" # grab chars starting at 8th position(0 index) for 5 chars
done <"$file"
```
If you want to refer to columns in your file
```{bash}
# Loop through each line in the mapping file and rename the files
while IFS=' ' read -r old_name new_name; do
# Rename the files
mv "$old_name" "$new_name"
done < "$mapping_file"
```
Checking if a directory already exists
```{bash}
if [ ! -d $dir ]; then
mkdir $dir
fi
```
For loops: there are many ways of specifying what to loop through
```{bash}
#!/bin/bash
for i in m{1..3} m{13..15}
for i in 1 2 3 4 5
for i in file1 file2
do
echo "Welcome $i times"
j="$(basename "${i}")" # get the file name
k="$(basename "${i}" | sed 's/_[^_]*$//')" # prints file name without the extension and the last chunk after the last _
l="$(dirname "${i}")" # prints the dirname without filename
echo raw/JYu-${i}_*_L001* # filename expansion
done
```
For more complex naming schemes, you may use an array.
```{bash}
# Specify the last two digits of the numbers
numbers=("59" "82" "23" "45" "77")
# Iterate over the array
for num in "${numbers[@]}"; do
echo "SRR13105$num"
done
# Iterate over indexes
for ((index=0;index<5;index++))
do
num="SRR13105${numbers[$index]}"
done
```
See [this](https://geekflare.com/bash-for-loop-examples/) for more examples and use cases.
### Unix Text Processing: Awk and Sed
**Awk**
Awk statements are written as `pattern { action }`. If we omit the pattern, Awk will run the action on all records. If we omit the action but specify a pattern, Awk will print all records that match the pattern.
```{bash}
awk '{if ($3 == "sth) print $0;}' file.tsv
awk -F "\t" '{print NF; exit}' file.tsv # prints number of fields/columns
awk -F "\t" 'BEGIN{OFS="\t";} {print $1,$2,$3,"+"}' input.bed > output.bed # indicate that the output should be tab-delimited
color="$(awk 'NR==val{print; exit}' val=$line color_list_3_chars.txt)" # pick line based on counter and assigning to a variable
```
**Sed**\
Sed statements are written as `substitute: 's/pattern/replacement/'`
sed -e `some rule` -e `another rule` The g/ means global replace i.e. find all occurrences of foo and replace with bar using sed. If you removed the /g only first occurrence is changed. chaining rules also possible: sed -e 's/,/\t/g ; 5q' print only the first 5 lines
### Miscellanous
- Inside a bash script, you can use `$0` for the name of the script, `$1` for the first argument and so on.\
- `$?` show error code of last command, 0 if everything went well, `$#` number of commands.
- `shuf -n 4 example_data.csv` print random 4 lines.\
- `nohup command &` will make commmand run in the background, `fg` to bring it back to the foreground.\
- https://www.shellcheck.net/ is a website where it can check for errors and bugs in your shell scripts.
- `paste -sd,` concatenate lines into a single line -s, delimited by a comma
#### Working with lines
count number of occurrences of unique instances
```{bash}
sort | uniq -c
```
output number of duplicated lines if there are any
```{bash}
uniq -d mm_gene_names.txt | wc -l
```
count fastq lines (assuming well-formatted fastq files which it usually are)
```{bash}
cat file.fq | echo $((`wc -l`/4))
```
print only the number of lines in a file and no filenames
```{bash}
wc -l m*/m*.narrowPeak | cut -f 4 -d' '
```
to get rid of x lines at the beginning of a file
```{bash}
tail -n +2 file.txt # starts from the 2nd line, i.e. removing the first line
```
to see the first 2 lines and the last 2 lines of a file
```{bash}
(head -n 2; tail -n 2) < Mus_musculus.GRCm38.75_chr1.bed
```
#### Working with columns
```{bash}
cut -f 2 Mus_musculus.GRCm38.75_chr1.bed # only the second column
cut -f 3-8 Mus_musculus.GRCm38.75_chr1.bed # range of columns
cut -f 3,6,7,8 Mus_musculus.GRCm38.75_chr1.bed # sets of columns, cannot be reordered
cut -d, -f2 some_file.csv # use comma as delimiter
grep -v "^#" some_file.gtf | cut -f 1-8 | column -t | head -n 3 #prettify output
```
remove header lines and count the number of columns
```{bash}
grep -v "^#" Mus_musculus.GRCm38.75_chr1.gtf | awk -F "\t" '{print NF; exit}'
```
#### Working with strings
to remove everything after first dot. useful for getting sample name from filename.fastq.gz
```{bash}
${i%%.*}
```
to remove everything after last dot. useful for getting sample name from filename.param.fastq
```{bash}
${i%.*}
```
to remove everything before a /, including it
```{bash}
${i##*/}
```
to make uppercase
```{bash}
${var^}
```
#### Directories
to get human readable size of folders under the parent folder
```{bash}
du -h --max-depth=1 parent_folder_name
```
s means summary
```{bash}
du -sh /full/path/directory
```
print each file in new line
```{bash}
ls -1
```
check what file takes the most space (d is one level down)
```{bash}
du -d 1 | sort -n
```
#### Alignment files
Removing EBV chromosomes for viewing on UCSC browser.
```{bash}
samtools idxstats m${i}_sorted.bam | cut -f 1 | grep -v 'chrEBV' | xargs samtools view -h m${i}_sorted.bam | grep -v 'chrEBV' | samtools view -o m${i}_filtered.bam
```
Remove reads not mapping to chr1-22, X, Y. this does not remove from headers. The first sed expression removes leading whitespace from echo (-n to ), the second expression to add "chr" at the beginning.
```{bash}
samtools view -o mr392_filtered1.bam mr392_sorted.bam `echo -n {{1..22},X,Y}$'\n' | sed -e 's/^[[:space:]]//' -e 's/^/chr/'`
```
or you can just chain together reverse grep to remove any chromosomes you want
```{bash}
chr=samtools view -H m1076_rgid_reheader_st.bam | grep chr | cut -f2 | sed 's/SN://g' | grep -v chrM | grep -v chrY | awk '{if(length($0)<6)print}'
```
the awk statement is to remove the unknown chromosomes and random contigs since they will be longer than 6 chars
```{r}
#| echo: false
# Included only to make the bash code chunks have background color and correct code chunk behaviors
```
## Bioinformatics Bits
These are mostly still command line tools, but more bioinformatics related.
### Getting files from GEO
**Using SRAtools** <https://bioinformaticsworkbook.org/dataAcquisition/fileTransfer/sra.html#gsc.tab=0> To get the SRR runs numbers, use the SRA Run Selector. Select the ones you want to download or all of them, and download the Accession List. Upload to quest. Better to do a job array for each SRR. fasterq-dump is now the preferred and faster option. To download bam, can bystep sam step.
```{bash}
module load sratoolkit
fastq-dump --gzip SRR # use --split-files for paired-end
fasterq-dump SRR -O output/dir # for 10x, need to use --split-filles and --include-technical
sam-dump SRR | samtools view -bS -> SRR.bam
```
**Using ftp**\
This can be faster than SRAtoolkit but only works if those files have been uploaded to EBI.
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR101/006/SRR1016916
pattern: fastq/first 6 chars/00(last number)/full accession
### Making UCSC tracks
```{bash}
mdoule load homer
makeTagDirectory tag1045 1045.bed
makeUCSCfile tagm57 -o m57 -norm 2e7v
```
With RNA-Seq UCSC tracks, use --fragLength given, otherwise it messes up the auto-correlation thinking that's it's ChIP-Seq leading to screwed up tracks.
You can upload the tracks and save it as a session that can be shared.
### samtools
[Learn the Bam Format](https://github.com/davetang/learning_bam_file).
```{bash, eval = FALSE}
samtools view -h test.bam # print bam with headers
samtools view -H test.bam # headers only
samtools flagstat test.bam # get mapping info
samtools idxstats test.bam
samtools -f 0x2 # read mapped in proper pair
samtools view -H test.bam | grep @HD # Check if BAM files are sorted
samtools view -F 4 test.bam | wc -l # check how many mapped reads
```
## R Quirks
When saving anything produced with ggplot and you will be editing in Illustrator, set `useDingbats = FALSE`.