First test with Dolomite Nadia data #101

RichardCorbett · 2020-02-06T22:39:56Z

Hi folks,

I'm trying dropSeeqPipe for the first time on some test Nadia data we generated. I am using the example Nadia template config and I have 4 samples, each with paired-end 125bp reads.

When I start snakemake it seems that fastqc completes successfully, and most of the cutadapt jobs are successful, but things go awry when my first bbmap/whitelist/umi_tools commands are run.

I'll try running a single sample to see if I can isolate the error, but I'd be interested to hear if you have any advice.

thanks
Richard

Here's a trace...

...
Activating conda environment: /projects/rcorbettprj2/TechD/dolomite/dropSeqPipe/.snakemake/conda/cb2ceeea
# output generated by whitelist --stdin /projects/rcorbettprj2/TechD/dolomite/results/samples/sample4/trimmed_repaired_R1.fastq.gz --bc-pattern=(?P<cell_1>.{12})(?P<umi_1>.{8}) --extract-method=regex --set-cell-number=1200 --log2stderr
# job started at Thu Feb 6 12:29:11 2020 on gphost01.bcgsc.ca -- a684f4f1-1284-4512-a5e2-a86637a5714a
# pid: 41642, system: Linux 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017 x86_64
# blacklist_tsv : None
# cell_number : 1200
# compresslevel : 6
# error_correct_threshold : 1
# expect_cells : False
# extract_method : regex
# filter_cell_barcodes : False
# log2stderr : True
# loglevel : 1
# method : reads
# pattern : (?P<cell_1>.{12})(?P<umi_1>.{8})
# pattern2 : None
# plot_prefix : None
# prime3 : None
# random_seed : None
# read2_in : None
# short_help : None
# stderr : <_io.TextIOWrapper name='' mode='w' encoding='UTF-8'>
# stdin : <_io.TextIOWrapper name='/projects/rcorbettprj2/TechD/dolomite/results/samples/sample4/trimmed_repaired_R1.fastq.gz' encoding='ascii'>
# stdlog : <_io.TextIOWrapper name='' mode='w' encoding='UTF-8'>
# stdout : <_io.TextIOWrapper name='' mode='w' encoding='UTF-8'>
# subset_reads : 100000000
# timeit_file : None
# timeit_header : None
# timeit_name : all
# whitelist_tsv : None
2020-02-06 12:29:11,987 INFO Starting barcode extraction
2020-02-06 12:29:11,987 INFO Starting - whitelist determination
Traceback (most recent call last):
File "/projects/rcorbettprj2/TechD/dolomite/dropSeqPipe/.snakemake/conda/bf0643d5/bin/umi_tools", line 11, in
sys.exit(main())
File "/projects/rcorbettprj2/TechD/dolomite/dropSeqPipe/.snakemake/conda/bf0643d5/lib/python3.6/site-packages/umi_tools/umi_tools.py", line 57, in main
module.main(sys.argv)
File "/projects/rcorbettprj2/TechD/dolomite/dropSeqPipe/.snakemake/conda/bf0643d5/lib/python3.6/site-packages/umi_tools/whitelist.py", line 364, in main
options.cell_number))
ValueError: --set-cell-barcode option specifies more cell barcodes than the number of observed cell barcodes. This may be because --subset-reads was set to a value too low to capture reads from all cells. 0 cell barcodes observed from 100000000 parsed reads. Expected>= 1200 cell barcodes
[Thu Feb 6 12:29:12 2020]
Error in rule get_top_barcodes:
jobid: 92
output: /projects/rcorbettprj2/TechD/dolomite/results/samples/sample4/top_barcodes.csv
conda-env: /projects/rcorbettprj2/TechD/dolomite/dropSeqPipe/.snakemake/conda/bf0643d5
shell:
umi_tools whitelist --stdin /projects/rcorbettprj2/TechD/dolomite/results/samples/sample4/trimmed_repaired_R1.fastq.gz --bc-pattern='(?P<cell_1>.{12})(?P<umi_1>.{8})' --extract-method=regex --set-cell-number=1200 --log2stderr > /projects/rcorbettprj2/TechD/dolomite/results/samples/sample4/top_barcodes.csv
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job get_top_barcodes since they might be corrupted:
/projects/rcorbettprj2/TechD/dolomite/results/samples/sample4/top_barcodes.csv
Removing temporary output file /projects/rcorbettprj2/TechD/dolomite/results/samples/sample4/Aligned.out.bam.
[Thu Feb 6 12:29:18 2020]
Finished job 115.
20 of 108 steps (19%) done
[Thu Feb 6 12:46:47 2020]
Finished job 47.
21 of 108 steps (19%) done
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /projects/rcorbettprj2/TechD/dolomite/dropSeqPipe/.snakemake/log/2020-02-06T103811.148029.snakemake.log

seb-mueller · 2020-02-07T08:48:50Z

You seem have it set up and run it the right way, for some reason umi-tools however think there are no barcodes in your sample:

ValueError: --set-cell-barcode option specifies more cell barcodes than the number of observed cell barcodes
 0 cell barcodes observed from 100000000 parsed reads

umi-tools estimates the number of barcodes from this file: /projects/rcorbettprj2/TechD/dolomite/results/samples/sample4/trimmed_repaired_R1.fastq.gz, and I suspect something is wrong with it.
Could you inspect this, e.g. looking at the first entries zcat trimmed_repaired_R1.fastq.gz | head -n 100 etc. Feel free to share the resulting lines here if you not sure what to expect.

You can also run umi-tools manually ny activating the conda environment to play arround with parameters or so:

conda activate /projects/rcorbettprj2/TechD/dolomite/dropSeqPipe/.snakemake/conda/bf0643d5
umi_tools whitelist --stdin /projects/rcorbettprj2/TechD/dolomite/results/samples/sample4/trimmed_repaired_R1.fastq.gz --bc-pattern='(?P<cell_1>.{12})(?P<umi_1>.{8})' --extract-method=regex --set-cell-number=1200

RichardCorbett · 2020-02-07T16:18:23Z

Thanks @seb-mueller,

I can confirm that trimmed_repaired_R1.fastq.gz is completely empty. I looked through the logs to try and guess which upstream command creates these files where I see this command:

java -ea -Xmx120g -cp /projects/rcorbettprj2/TechD/dolomite/dropSeqPipe/.snakemake/conda/3e00e686/opt/bbmap-38.22-1/current/ jgi.SplitPairsAndSingles rp -Xmx120g in=/projects/rcorbettprj2/TechD/dolomite/results/samples/sample1/trimmed_R1.fastq.gz in2=/projects/rcorbettprj2/TechD/dolomite/results/samples/sample1/trimmed_R2.fastq.gz out1=/projects/rcorbettprj2/TechD/dolomite/results/samples/sample1/trimmed_repaired_R1.fastq.gz out2=/projects/rcorbettprj2/TechD/dolomite/results/samples/sample1/trimmed_repaired_R2.fastq.gz repair=t threads=4

The trimmed_R*.fastq.gz files do seem to be created by cutadapt in the early steps with what looks like correct-ish results, but they seem to be gone when the run completes.

RichardCorbett · 2020-02-07T19:22:53Z

I suspect that there is a problem with my fastq format, which I've seen before when using bbMap. My fastq entry names have the following format where there is both a trailing "/2" in the read name and a leading "2" in the second field.
I'll try cleaning up the names a little to see if that helps.
@HS28_337:3:1101:11073:1999/2 2:N:0:TCCTGAGC

RichardCorbett · 2020-02-07T22:38:20Z

Used this command:

zcat sample1_R1.fastq.gz | sed 's/\/1 / /' | gzip > sample1_R1.fastq_clean.gz
zcat sample1_R2.fastq.gz | sed 's/\/2 / /' | gzip > sample1_R2.fastq_clean.gz

and am now getting non-empty files and the analysis is progressing to the alignment stage.

seb-mueller · 2020-02-08T12:19:12Z

Thats impressive bugfixing work Richard! Since you have seen this before (and might therefore occur for others), this might be material for the trouble-shooting section. Do you have a link or so that elaborates on this problem? Also, bbmap might benefit from handling those files, so we can maybe send them a bug report.

RichardCorbett · 2020-02-10T16:31:47Z

I suspect that not many others are getting read names like ours. I think we have that mix of formats for some backwards compatibility issues in our pipeline and I ran into a similar error when using a different bbMap tool last year. From my understanding of the standard fastq formats we are the outlier. That said Brian Bushnell over at bbMap might have some text stipulating the fastq formats bbmap can handle.

My run after fixing the read names got real close to the end, but still appears to have hit an error (let me know if you'd prefer a different issue of this):

Activating conda environment:

/projects/rcorbettprj2/TechD/dolomite/dropSeqPipe/.snakemake/conda/c06ad25c
INFO 2020-02-07 22:54:17 SingleCellRnaSeqMetricsCollector Accumulating metrics 10,000,000 records. Elapsed time: 00:01:27s. Time for last 1,000,000: 8s. Last read position: MT:1,930

Attaching package: ‘cowplot’

The following object is masked from ‘package:ggplot2’:
ggsave
INFO 2020-02-07 22:54:26 SingleCellRnaSeqMetricsCollector Accumulating metrics 11,000,000 records. Elapsed time: 00:01:36s. Time for last 1,000,000: 9s. Last read position: 7:928,092
[Fri Feb 07 22:54:27 PST 2020] org.broadinstitute.dropseqrna.barnyard.DigitalExpression done. Elapsed time: 11.92 minutes.
Runtime.totalMemory()=27037007872
Error in Read10X(file.path(snakemake@wildcards$results_dir, "summary", :
Directory provided does not exist
Execution halted
[Fri Feb 7 22:54:33 2020]
Error in rule violine_plots:
jobid: 31
output: /projects/rcorbettprj2/TechD/dolomite/results_hg38/plots/violinplots_comparison_UMI.pdf, /projects/rcorbettprj2/TechD/dolomite/results_hg38/plots/UMI_vs_counts.pdf, /projects/rcorbettprj2/TechD/dolomite/results_hg38/plots/UMI_vs_gene.pdf, /projects/rcorbettprj2/TechD/dolomite/results_hg38/plots/Count_vs_gene.pdf, /projects/rcorbettprj2/TechD/dolomite/results_hg38/summary/R_Seurat_objects.rdata
conda-env: /projects/rcorbettprj2/TechD/dolomite/dropSeqPipe/.snakemake/conda/c06ad25c

RuleException:
CalledProcessError in line 47 of /projects/rcorbettprj2/TechD/dolomite/dropSeq Command 'source /home/rcorbett/miniconda3/bin/activate '/projects/rcorbettprj2 File "/projects/rcorbettprj2/TechD/dolomite/dropSeqPipe/rules/merge.smk", File "/home/rcorbett/miniconda3/envs/dropSeqPipe/lib/python3.6/concurrent/futu Removing temporary output file /projects/rcorbettprj2/TechD/dolomite/results_h [Fri Feb 7 22:54:33 2020]
Finished job 78.
106 of 113 steps (94%) done
INFO 2020-02-07 22:54:35 SingleCellRnaSeqMetricsCollector Accumulating metrics INFO 2020-02-07 22:54:45 SingleCellRnaSeqMetricsCollector Accumulating metrics INFO 2020-02-07 22:54:54 SingleCellRnaSeqMetricsCollector Accumulating metrics INFO 2020-02-07 22:55:02 SingleCellRnaSeqMetricsCollector Accumulating metrics INFO 2020-02-07 22:55:10 SingleCellRnaSeqMetricsCollector Accumulating metrics INFO 2020-02-07 22:55:18 SingleCellRnaSeqMetricsCollector Accumulating metrics INFO 2020-02-07 22:55:26 SingleCellRnaSeqMetricsCollector Accumulating metrics INFO 2020-02-07 22:55:33 SingleCellRnaSeqMetricsCollector Accumulating metrics INFO 2020-02-07 22:55:41 SingleCellRnaSeqMetricsCollector Accumulating metrics INFO 2020-02-07 22:55:48 SingleCellRnaSeqMetricsCollector Accumulating metrics INFO 2020-02-07 22:55:55 SingleCellRnaSeqMetricsCollector Accumulating metrics INFO 2020-02-07 22:56:01 SingleCellRnaSeqMetricsCollector Accumulating metrics INFO 2020-02-07 22:56:09 SingleCellRnaSeqMetricsCollector Accumulating metrics INFO 2020-02-07 22:56:16 SingleCellRnaSeqMetricsCollector Accumulating metrics INFO 2020-02-07 22:56:24 SingleCellRnaSeqMetricsCollector Accumulating metrics INFO 2020-02-07 22:56:31 SingleCellRnaSeqMetricsCollector Accumulating metrics INFO 2020-02-07 22:56:37 SingleCellRnaSeqMetricsCollector Accumulating metrics INFO 2020-02-07 22:56:45 SingleCellRnaSeqMetricsCollector Accumulating metrics INFO 2020-02-07 22:56:52 SingleCellRnaSeqMetricsCollector Accumulating metrics INFO 2020-02-07 22:56:59 SingleCellRnaSeqMetricsCollector Accumulating metrics INFO 2020-02-07 22:57:05 SingleCellRnaSeqMetricsCollector Accumulating metrics INFO 2020-02-07 22:57:13 SingleCellRnaSeqMetricsCollector Accumulating metrics INFO 2020-02-07 22:57:21 SingleCellRnaSeqMetricsCollector Accumulating metrics INFO 2020-02-07 22:57:25 SingleCellRnaSeqMetricsCollector Adding metrics to file. [Fri Feb 07 22:57:26 PST 2020] org.broadinstitute.dropseqrna.barnyard.SingleCe Runtime.totalMemory()=14865137664
[Fri Feb 7 22:57:30 2020]
Finished job 74.
107 of 113 steps (95%) done
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message Pipe/rules/merge.smk:
/TechD/dolomite/dropSeqPipe/.snakemake/conda/c06ad25c'; set -euo pipefail; Rscript --vanilla /projects/rcorbettprj2/TechD/dolomite/dropSeqPipe/.snakemake/scripts/tmpv4agpdro.plot_violine.R' returned non-zero exit status 1.
line 47, in __rule_violine_plots
res/thread.py", line 56, in run
g38/samples/sample4_clean/read/expression.tsv.
12,000,000 records. Elapsed time: 00:01:45s. Time for last 1,000,000: 8s. Last read position: 16:1,962,091
13,000,000 records. Elapsed time: 00:01:55s. Time for last 1,000,000: 9s. Last read position: 3:12,839,391
14,000,000 records. Elapsed time: 00:02:04s. Time for last 1,000,000: 9s. Last read position: MT:1,867
15,000,000 records. Elapsed time: 00:02:12s. Time for last 1,000,000: 8s. Last read position: MT:1,544
16,000,000 records. Elapsed time: 00:02:20s. Time for last 1,000,000: 8s. Last read position: 3:188,875,046
17,000,000 records. Elapsed time: 00:02:28s. Time for last 1,000,000: 7s. Last read position: MT:2,056
18,000,000 records. Elapsed time: 00:02:36s. Time for last 1,000,000: 7s. Last read position: 15:69,455,475
19,000,000 records. Elapsed time: 00:02:43s. Time for last 1,000,000: 7s. Last read position: 1:111,588,056
20,000,000 records. Elapsed time: 00:02:51s. Time for last 1,000,000: 7s. Last read position: 4:152,525,950
21,000,000 records. Elapsed time: 00:02:58s. Time for last 1,000,000: 7s. Last read position: 19:8,322,286
22,000,000 records. Elapsed time: 00:03:05s. Time for last 1,000,000: 6s. Last read position: 13:32,533,540
23,000,000 records. Elapsed time: 00:03:11s. Time for last 1,000,000: 6s. Last read position: 5:71,466,163
24,000,000 records. Elapsed time: 00:03:19s. Time for last 1,000,000: 7s. Last read position: 11:62,852,017
25,000,000 records. Elapsed time: 00:03:26s. Time for last 1,000,000: 7s. Last read position: 1:89,106,164
26,000,000 records. Elapsed time: 00:03:34s. Time for last 1,000,000: 7s. Last read position: 1:171,589,331
27,000,000 records. Elapsed time: 00:03:41s. Time for last 1,000,000: 6s. Last read position: 13:45,339,532
28,000,000 records. Elapsed time: 00:03:47s. Time for last 1,000,000: 6s. Last read position: 4:152,511,093
29,000,000 records. Elapsed time: 00:03:55s. Time for last 1,000,000: 7s. Last read position: 11:65,499,708
30,000,000 records. Elapsed time: 00:04:02s. Time for last 1,000,000: 6s. Last read position: 7:37,382,576
31,000,000 records. Elapsed time: 00:04:09s. Time for last 1,000,000: 6s. Last read position: 14:23,971,366
32,000,000 records. Elapsed time: 00:04:16s. Time for last 1,000,000: 6s. Last read position: 1:30,733,456
33,000,000 records. Elapsed time: 00:04:23s. Time for last 1,000,000: 7s. Last read position: 19:4,715,092
34,000,000 records. Elapsed time: 00:04:31s. Time for last 1,000,000: 7s. Last read position: 3:49,268,606
This may take a while, with no progress messages.
llRnaSeqMetricsCollector done. Elapsed time: 14.90 minutes.

seb-mueller · 2020-02-11T22:34:55Z

Indeed, if this drags out we might transfer this to its own issue (not sure how this is handled elsewere though).
Anyway, the key errors seems here:

Error in Read10X(file.path(snakemake@wildcards$results_dir, "summary", :
Directory provided does not exist

I haven't had this error before, but suspect this has to do with how the working directory is set. Normally I call snakemake from the cloned archive (containing the Snakefile) and specifiying the working dir using --directory <absolute-path>. Have you tried it this way? Also, do you have the following in the config.yaml? (especially the result bit):

LOCAL:
    temp-directory: ./tmp
    memory: 60g
    raw_data: data
    results: results

Lastly, it might help if you set DEBUG: True which gives you a R object to inspect. I'd elaborate if the error persists.

TomKellyGenetics · 2020-06-12T09:21:44Z

I get a similar error running DropSeq data (Mascosko 2015). I downloaded files from GEO and SRA and did some filtering.

SRR1873277_S1_L001_R1_001.fastq

@AH2HM7BGXX:1:11101:10007:10911
TAACGTACAGGAGGAGTACG
+AH2HM7BGXX:1:11101:10007:10911
AA7AAFAFFFAFFFFFAAAF
@AH2HM7BGXX:1:11101:10008:15270
GCATGAAACTTCCACGGGCC
+AH2HM7BGXX:1:11101:10008:15270
AAAAAFF7FFF.F<FFFFFF
@AH2HM7BGXX:1:11101:10016:7370
CTGGATGTTCAAGGGTCTCA

SRR1873277_S1_L001_R2_001.fastq

@AH2HM7BGXX:1:11101:10007:10911
GTACTAACGTACAGGAAAGGACCTTTTTTTTTTTTTTTTTTTTTTTTTTTT
+
<..<7..777)<.FF..FF<F.FFAAAA<.7.777<77.<<7..7..7...
@AH2HM7BGXX:1:11101:10008:15270
GAAGTACCAGGCAGTGACAGCCACCCTGGAGGAGAAGAGGAAAGAGAAAGCCAAGATCCA
+
<AAAAFF.FF.F<FAF7FAFFFFFFFFFFF7AAFFFFFFFAF<AFAFF..FF<F7AFAFF
@AH2HM7BGXX:1:11101:10016:7370
TATTAAAAGAATCCAAATTCAAACT

I get this issue running version 0.5 or the current develop branch acef6df

My understanding is that DigitalExpression or umitools no longer accept empty barcode whitelists. As a workaround, I generated one by permutations and added it to config.yaml.

echo AAAA{A,T,C,G}{A,T,C,G}{A,T,C,G}{A,T,C,G}{A,T,C,G}{A,T,C,G}{A,T,C,G}{A,T,C,G}{A,T,C,G}{A,T,C,G}{A,T,C,G}{A,T,C,G} | sed 's/ /\n/g' > dropseq_barcode.txt

FILTER:
    barcode-whitelist: 'dropseq_barcode.txt'

This hangs on the extend_barcode_whitelist rule, presumably because the number of barcodes to compute mismatches for is too large.

After seeing this, I think the FASTQ file may be problem, I seem to be missing second field from the headers. Is this required?

TomKellyGenetics · 2020-06-12T09:31:59Z

In my case, I was able to fix it by adding the second field to the header.

sed -i "1~4s/$/ 1:N:0:0/g" SRR1873277_S1_L001_R[12]_001.fastq

With this get_cell_whitelist, extend_barcode_top, and repair_barcodes was able to run without specifying a barcode barcode-whitelist: ''.

I hope this helps anyone running into similar problems but it may help to check for this problem in the pipeline (or remove the requirement for compatible headers if not needed).

Update

This gives a "Found 0 cell barcodes in file" error.

19 of 35 steps (54%) done

[Fri Jun 12 18:29:16 2020]
rule extract_umi_expression:
    input: /home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/final.bam, /home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/barcodes.csv
    output: /home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/umi/expression.long, /home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/umi/expression.tsv
    jobid: 41
    wildcards: results_dir=/home/tom/cellranger_convert/test-dropseqpipe-dev, sample=SRR1873277_S1_L001

export _JAVA_OPTIONS=-Djava.io.tmpdir=./tmp && DigitalExpression -m 10g        I=/home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/final.bam        O=/home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/umi/expression.tsv        EDIT_DISTANCE=1        OUTPUT_LONG_FORMAT=/home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/umi/expression.long        STRAND_STRATEGY=SENSE        OUTPUT_READS_INSTEAD=false        LOCUS_FUNCTION_LIST=null        LOCUS_FUNCTION_LIST={CODING,UTR}        MIN_BC_READ_THRESHOLD=0        CELL_BC_FILE=/home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/barcodes.csv
Activating conda environment: /home/tom/local/bin/dropSeqPipe-0.6/.snakemake/conda/f7468c03
Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=./tmp

INFO	2020-06-12 18:29:19	DigitalExpression	

********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
**********    DigitalExpression -I /home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/final.bam -O /home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/umi/expression.tsv -EDIT_DISTANCE 1 -OUTPUT_LONG_FORMAT /home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/umi/expression.long -STRAND_STRATEGY SENSE -OUTPUT_READS_INSTEAD false -LOCUS_FUNCTION_LIST null -LOCUS_FUNCTION_LIST CODING -LOCUS_FUNCTION_LIST UTR -MIN_BC_READ_THRESHOLD 0 -CELL_BC_FILE /home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/barcodes.csv
**********


18:29:19.810 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/tom/local/bin/dropSeqPipe-0.6/.snakemake/conda/f7468c03/share/dropseq_tools-2.0.0-0/jar/lib/picard-2.18.14.jar!/com/intel/gkl/native/libgkl_compression.so
[Fri Jun 12 18:29:19 JST 2020] DigitalExpression OUTPUT_READS_INSTEAD=false OUTPUT=/home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/umi/expression.tsv OUTPUT_LONG_FORMAT=/home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/umi/expression.long INPUT=/home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/final.bam EDIT_DISTANCE=1 MIN_BC_READ_THRESHOLD=0 CELL_BC_FILE=/home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/barcodes.csv STRAND_STRATEGY=SENSE LOCUS_FUNCTION_LIST=[CODING, UTR]    CELL_BARCODE_TAG=XC MOLECULAR_BARCODE_TAG=XM READ_MQ=10 USE_STRAND_INFO=true RARE_UMI_FILTER_THRESHOLD=0.0 GENE_NAME_TAG=gn GENE_STRAND_TAG=gs GENE_FUNCTION_TAG=gf VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Fri Jun 12 18:29:19 JST 2020] Executing as tom@d4-lm3 on Linux 4.15.0-45-generic amd64; OpenJDK 64-Bit Server VM 11.0.1+13-LTS; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.0.0(1ef3a59_1539205128)
INFO	2020-06-12 18:29:19	BarcodeListRetrieval	*Found 0 cell barcodes in file*
ERROR	2020-06-12 18:29:19	DigitalExpression	Running digital expression without somehow selecting a set of barcodes to process no longer supported.
[Fri Jun 12 18:29:19 JST 2020] org.broadinstitute.dropseqrna.barnyard.DigitalExpression done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=2058354688
[Fri Jun 12 18:29:19 2020]
Error in rule extract_umi_expression:
    jobid: 41
    output: /home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/umi/expression.long, /home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/umi/expression.tsv
    conda-env: /home/tom/local/bin/dropSeqPipe-0.6/.snakemake/conda/f7468c03
    shell:
        export _JAVA_OPTIONS=-Djava.io.tmpdir=./tmp && DigitalExpression -m 10g        I=/home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/final.bam        O=/home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/umi/expression.tsv        EDIT_DISTANCE=1        OUTPUT_LONG_FORMAT=/home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/umi/expression.long        STRAND_STRATEGY=SENSE        OUTPUT_READS_INSTEAD=false        LOCUS_FUNCTION_LIST=null        LOCUS_FUNCTION_LIST={CODING,UTR}        MIN_BC_READ_THRESHOLD=0        CELL_BC_FILE=/home/tom/cellranger_convert/test-dropseqpipe-dev/samples/SRR1873277_S1_L001/barcodes.csv
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

I'm guessing this is because the barcodes are parsed from the header rather than the 1st 8 bases of the R1. In this case, the code to fix the files is:

perl -0p -i -e "s/@(.*)\n(.{8})/@\1 1:N:0:\2\n\2/g"  SRR1873277_S1_L001_R1_001.fastq

Update 2
I can confirm this corrects the input format and runs to completion for the develop branch. I run into some issues running "rule repair" for version 0.4 (#72) or "violine_plot" for 0.5 (#91, #94) but I think these issues have already been discussed elsewhere. Version 0.5 seems to call the R instance from the PATH with Seurat 3 installed, instead of the conda version with Seurat 2. If anyone runs into this issue with DropSeqPipe version 0.5, a workaround is to add an older R version to the PATH that supports Seurat v2 (R 3.4.0 < version < 3.6.0 seems to work).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First test with Dolomite Nadia data #101

First test with Dolomite Nadia data #101

RichardCorbett commented Feb 6, 2020 •

edited

Loading

seb-mueller commented Feb 7, 2020

RichardCorbett commented Feb 7, 2020

RichardCorbett commented Feb 7, 2020

RichardCorbett commented Feb 7, 2020

seb-mueller commented Feb 8, 2020

RichardCorbett commented Feb 10, 2020

seb-mueller commented Feb 11, 2020

TomKellyGenetics commented Jun 12, 2020

TomKellyGenetics commented Jun 12, 2020 •

edited

Loading

First test with Dolomite Nadia data #101

First test with Dolomite Nadia data #101

Comments

RichardCorbett commented Feb 6, 2020 • edited Loading

seb-mueller commented Feb 7, 2020

RichardCorbett commented Feb 7, 2020

RichardCorbett commented Feb 7, 2020

RichardCorbett commented Feb 7, 2020

seb-mueller commented Feb 8, 2020

RichardCorbett commented Feb 10, 2020

seb-mueller commented Feb 11, 2020

TomKellyGenetics commented Jun 12, 2020

TomKellyGenetics commented Jun 12, 2020 • edited Loading

RichardCorbett commented Feb 6, 2020 •

edited

Loading

TomKellyGenetics commented Jun 12, 2020 •

edited

Loading