Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: Number of files (1) does not match number of input files required by technology 10XV1 (3) #60

Open
Gabrisfon opened this issue Mar 9, 2024 · 14 comments

Comments

@Gabrisfon
Copy link

Gabrisfon commented Mar 9, 2024

Hello everyone. First of all, I'd like to congratulate you on this incredible tool.

In my current project, I need to create cell x transcript matrices for my single nucleus human sample data generated using 10x technology. I have been trying to process this data using kallisto bus following this tutorial but unfortunately none of the indexes I have created have been compatible with 10xv1 technology and so I have been getting the following error:

**[gabrielfonseca@login 02_kallisto_bus]$ srun kb count --tcc -i index_HS_98.idx -g t2g.txt -c1 cDNA_t2c.txt -c2 introns_t2c.txt -x 10xv1 -o /sn_ovary/results/gabrielfonseca/02_kallisto_bus/01_Kbc_results/ --filter bustools -t 28 --workflow nucleus /sn_ovary/data/00_raw_data/input/raw_data/
[2024-03-09 12:28:07,487] INFO [count_lamanno] Using index index_HS_98.idx to generate BUS file to /sn_ovary/results/gabrielfonseca/02_kallisto_bus/01_Kbc_results/ from
[2024-03-09 12:28:07,488] INFO [count_lamanno] /sn_ovary/data/00_raw_data/input/raw_data/
[2024-03-09 12:28:08,599] ERROR [count_lamanno]
[bus] Note: Strand option was not specified; setting it to --fr-stranded for specified technology
Error: Number of files (1) does not match number of input files required by technology 10XV1 (3)
kallisto 0.48.0
Generates BUS files for single-cell sequencing
Usage: kallisto bus [arguments] FASTQ-files
Required arguments:
-i, --index=STRING Filename for the kallisto index to be used for
pseudoalignment
-o, --output-dir=STRING Directory to write output to
Optional arguments:
-x, --technology=STRING Single-cell technology used
-l, --list List all single-cell technologies supported
-B, --batch=FILE Process files listed in FILE
-t, --threads=INT Number of threads to use (default: 1)
-b, --bam Input file is a BAM file
-n, --num Output number of read in flag column (incompatible with --bam)
-T, --tag=STRING 5′ tag sequence to identify UMI reads for certain technologies
--fr-stranded Strand specific reads for UMI-tagged reads, first read forward
--rf-stranded Strand specific reads for UMI-tagged reads, first read reverse
--unstranded Treat all read as non-strand-specific

--unstranded Treat all read as non-strand-specific
--paired Treat reads as paired
--genomebam Project pseudoalignments to genome sorted BAM file
-g, --gtf GTF file for transcriptome information
(required for --genomebam)
-c, --chromosomes Tab separated file with chromosome names and lengths
(optional for --genomebam, but recommended)
--verbose Print out progress information every 1M proccessed reads
[2024-03-09 12:28:08,599] ERROR [main] An exception occurred
Traceback (most recent call last):
File "/opt/spack/opt/spack/linux-rocky8-x86_64/gcc-12.2.0/py-kb-python-0.27.3-wdlrh6n6qos4wtlxjqcdgtp7ys2z7gon/lib/python3.10/site-packages/kb_python/main.py",
line 1305, in main
COMMAND_TO_FUNCTION[args.command](parser, args, temp_dir=temp_dir)
File "/opt/spack/opt/spack/linux-rocky8-x86_64/gcc-12.2.0/py-kb-python-0.27.3-wdlrh6n6qos4wtlxjqcdgtp7ys2z7gon/lib/python3.10/site-packages/kb_python/main.py",
line 491, in parse_count
count_velocity(
File "/opt/spack/opt/spack/linux-rocky8-x86_64/gcc-12.2.0/py-ngs-tools-1.8.1-olf5dpwkrl54xgpw6icmsugxnwpponf6/lib/python3.10/site-packages/ngs_tools/logging.py"
, line 62, in inner
return func(*args, **kwargs)
File "/opt/spack/opt/spack/linux-rocky8-x86_64/gcc-12.2.0/py-kb-python-0.27.3-wdlrh6n6qos4wtlxjqcdgtp7ys2z7gon/lib/python3.10/site-packages/kb_python/count.py",
line 1593, in count_velocity
bus_result = kallisto_bus(
File "/opt/spack/opt/spack/linux-rocky8-x86_64/gcc-12.2.0/py-kb-python-0.27.3-wdlrh6n6qos4wtlxjqcdgtp7ys2z7gon/lib/python3.10/site-packages/kb_python/validate.p
y", line 116, in inner
results = func(*args, **kwargs)
File "/opt/spack/opt/spack/linux-rocky8-x86_64/gcc-12.2.0/py-kb-python-0.27.3-wdlrh6n6qos4wtlxjqcdgtp7ys2z7gon/lib/python3.10/site-packages/kb_python/count.py",
line 150, in kallisto_bus
run_executable(command)
File "/opt/spack/opt/spack/linux-rocky8-x86_64/gcc-12.2.0/py-kb-python-0.27.3-wdlrh6n6qos4wtlxjqcdgtp7ys2z7gon/lib/python3.10/site-packages/kb_python/dry/__init
__.py", line 25, in inner
return func(*args, kwargs)
File "/opt/spack/opt/spack/linux-rocky8-x86_64/gcc-12.2.0/py-kb-python-0.27.3-wdlrh6n6qos4wtlxjqcdgtp7ys2z7gon/lib/python3.10/site-packages/kb_python/utils.py",
line 203, in run_executable
raise sp.CalledProcessError(p.returncode, ' '.join(command))
subprocess.CalledProcessError: Command '/opt/spack/opt/spack/linux-rocky8-x86_64/gcc-12.2.0/py-kb-python-0.27.3-wdlrh6n6qos4wtlxjqcdgtp7ys2z7gon/lib/python3.10/site-packages/kb_python/bins/linux/kallisto/kallisto bus -i index_HS_98.idx -o /sn_ovary/results/gabrielfonseca/02_kallisto_bus/01_Kbc_results/ -x 10xv1 -t 28 /sn_ovary/data/00_raw_data/input/raw_data/' returned non-zero exit status 1.

I have created several indexes with different versions of gtf and genomes human files. The last version I tried was according to this one from 10x (References - 2020-A (July 7, 2020)) but it doesn't worked. I really need to process this data as soon as possible. This is the command I've been using to generate the indexes:

kb ref -i index_HS_98.idx -g t2g.txt -f1 ./cdna.fa -f2 ./introns.fa -c1 cDNA_t2c.txt -c2 introns_t2c.txt --workflow=nucleus /sn_ovary/data/gabrielfonseca/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz.1 /sn_ovary/data/gabrielfonseca/gencode.v32.primary_assembly.annotation.gtf.gz --overwrite

@Yenaled
Copy link

Yenaled commented Mar 13, 2024

Hi! Sorry for the semi-late reply (was traveling for the past week). 10XV1 requires three files: First file contains 14-bp barcode, second file contains the 10-bp UMI, and third file contains the read sequence to be mapped against the reference.

Additionally, you can't specify a directory (i.e. /raw_data/) as your input -- you must supply the actual FASTQ file names.

@Gabrisfon
Copy link
Author

Gabrisfon commented Mar 13, 2024

Hi Delaney, don't worry and thanks for replying. Unfortunately I haven't succeeded yet.
I searched here but couldn't find a tutorial on how to get these files (First file contains 14-bp barcode, second file contains the 10-bp UMI), could you indicate one out please? the third file are my FASTQ files using their names, right?

@Yenaled
Copy link

Yenaled commented Mar 13, 2024

Can you show me the contents of your /sn_ovary/data/00_raw_data/input/raw_data/ folder? All the FASTQ files should be there.

@Gabrisfon
Copy link
Author

Yes of course.

2024-03-13

@Yenaled
Copy link

Yenaled commented Mar 14, 2024

Hmm, ok, so you only have an R1 and an R2 file (not three files). I'm not sure how to process data in this format because I have no ideas where your barcodes are, where your UMIs are, and where your actual RNA sequence is.

You need to find someone who can tell you "My barcode is at location _ in file _, my UMI is at location _ in file _, my RNA sequence is at location _ in file _". Only then can you use kallisto (or any other RNAseq pipeline). I don't even know if your data is 10xv1 (it doesn't look like it...).

@Gabrisfon
Copy link
Author

Hello Delaney, I apologize for the delay in replying as I was looking for answers to these questions. According to the protocol of the kit that was used to generate the data, the chemistry used in the gel bead is v1. The position of the barcodes and UMI is according to the photo below. The kit used was a singleplex kit, so the probe sequence is the same for all the nuclei, as shown in the link. The indices used were according to these Dual Index Kit TS, Set A (PN-1000251): CSV | JSON

Thanks in advance for your help.

image_19234862031710509871414

@Yenaled
Copy link

Yenaled commented Mar 16, 2024

OK, based on the protocol provided, the technology appears to be 10xv3 (16 bp barcode, 12 bp UMIs -- R1 read length is 28 bp).

So supply -x 10xv3 to kb count.

@Gabrisfon
Copy link
Author

I ran the command as you instructed and it's working! I was waiting for the result, but I think it may take a while. Thank you very much for your help!

[gabrielfonseca@login raw_data]$ srun kb count --tcc -i /data/02_kallisto_bus/index_HS2_98.idx -g /data/02_kallisto_bus/t2g.txt -c1 /data/02_kallisto_bus/cDNA_t2c.txt -c2 /data/02_kallisto_bus/introns_t2c.txt -x 10xv3 -o /results/02_kallisto_bus/01_Kbc_results --filter bustools -t 28 --workflow nucleus SN-1_S1_L001_R1_001.fastq.gz SN-1_S1_L001_R2_001.fastq.gz SN-1_S1_L002_R1_001.fastq.gz SN-1_S1_L002_R2_001.fastq.gz SN-2_S2_L001_R1_001.fastq.gz SN-2_S2_L001_R2_001.fastq.gz SN-2_S2_L002_R1_001.fastq.gz SN-2_S2_L002_R2_001.fastq.gz SN-3_S3_L001_R1_001.fastq.gz SN-3_S3_L001_R2_001.fastq.gz SN-3_S3_L002_R1_001.fastq.gz SN-3_S3_L002_R2_001.fastq.gz SN-4_S4_L001_R1_001.fastq.gz SN-4_S4_L001_R2_001.fastq.gz SN-4_S4_L002_R1_001.fastq.gz SN-4_S4_L002_R2_001.fastq.gz SN-5_S5_L001_R1_001.fastq.gz SN-5_S5_L001_R2_001.fastq.gz SN-5_S5_L002_R1_001.fastq.gz SN-5_S5_L002_R2_001.fastq.gz SN-6_S6_L001_R1_001.fastq.gz SN-6_S6_L001_R2_001.fastq.gz SN-6_S6_L002_R1_001.fastq.gz SN-6_S6_L002_R2_001.fastq.gz SN-7_S7_L001_R1_001.fastq.gz SN-7_S7_L001_R2_001.fastq.gz SN-7_S7_L002_R1_001.fastq.gz SN-7_S7_L002_R2_001.fastq.gz SN-8_S8_L001_R1_001.fastq.gz SN-8_S8_L001_R2_001.fastq.gz SN-8_S8_L002_R1_001.fastq.gz SN-8_S8_L002_R2_001.fastq.gz
[2024-03-15 21:55:32,538] INFO [count_lamanno] Using index /data/02_kallisto_bus/index_HS2_98.idx to
generate BUS file to /results/02_kallisto_bus/01_Kbc_results from
[2024-03-15 21:55:32,539] INFO [count_lamanno] SN-1_S1_L001_R1_001.fastq.gz
[2024-03-15 21:55:32,539] INFO [count_lamanno] SN-1_S1_L001_R2_001.fastq.gz
[2024-03-15 21:55:32,539] INFO [count_lamanno] SN-1_S1_L002_R1_001.fastq.gz
[2024-03-15 21:55:32,539] INFO [count_lamanno] SN-1_S1_L002_R2_001.fastq.gz
[2024-03-15 21:55:32,539] INFO [count_lamanno] SN-2_S2_L001_R1_001.fastq.gz
...

I'd like to ask you a few more questions. Is it only necessary to use the --tcc parameter to generate Transcript x Cell matrices and are the parameters I used on the command line adequate for this? With this version of the tool is it still necessary to use bustools or is all the processing already done by Kb count? And finally, is it possible to pre-process the data with this tool or is it necessary to use another tool, for example Seurat?

@Yenaled
Copy link

Yenaled commented Mar 16, 2024

Yes, you can just use —tcc for generating your TCC matrices — kb count does everything you need.

Once you have generated your matrices, that’s when you start using Seurat (knee plots, violin plots, clustering, etc.)

@Gabrisfon
Copy link
Author

Thank you very much for your help and all the information!

Yours sincerely,
Gabriel.

@Gabrisfon Gabrisfon reopened this Mar 19, 2024
@Gabrisfon
Copy link
Author

Hello Delaney, I had to reactivate this topic for another question if you can help me. After generating the quantification files with kb count using this parameter --tcc , because I want to identify the isoforms of the genes present in the cell populations later, I received as output the following files within one of the output folders counts_unfiltered: cells_x_tcc.barcodes.txt, cells_x_tcc.ec.txt and cells_x_tcc.mtx. I have tried to follow these recommendations in these links 1 and 2 to import these files with seurat, but in all of them I am failing, I have also tried to use as features the transcript.txt file that was generated outside this folder, but unfortunately it is not working either due to the number of lines being inconsistent.

Is there any other way of importing this data using seurat that is different due to its format?

expression_matrix <- ReadMtx(mtx="/results/02_kallisto_bus/02_Kbc_resultsK/counts_unfiltered/cells_x_tcc.mtx", features = "/results/02_kallisto_bus/02_Kbc_resultsK/counts_unfiltered/cells_x_tcc.ec.txt", cells = "/results/02_kallisto_bus/02_Kbc_resultsK/counts_unfiltered/cells_x_tcc.barcodes.txt", feature.column=1, mtx.transpose = TRUE)

Error in make.unique(names = feature.names) : 'names' must be a character vector

expression_matrix <- ReadMtx(mtx="/results/02_kallisto_bus/02_Kbc_resultsK/counts_unfiltered/cells_x_tcc.mtx", features = "/results/02_kallisto_bus/02_Kbc_resultsK/transcripts.txt", cells = "/results/02_kallisto_bus/02_Kbc_resultsK/counts_unfiltered/cells_x_tcc.barcodes.txt", feature.column=1, mtx.transpose = TRUE)

Error: Matrix has 13062953 rows but found 1547350 features.

Thanks in advance.

@Yenaled
Copy link

Yenaled commented Mar 19, 2024

This is a bit tricky. You're getting the TCCs which is NOT the same as isoforms.

In any case, the first way would be the way to go if you do want to import TCCs -- I think it's getting confused about cells_x_tcc.ec.txt because the first column is a list of numbers. You might have to edit the list of numbers to be a list of strings (e.g. prepend an alphabetical character in front of each number).

@Gabrisfon
Copy link
Author

These are the output files. The ones in cells_x_tcc.ec.txt do not contain the feature information that would be needed to import with Seurat. I believe that for the analysis I need to do I would need a feature file with information on the count of transcripts per cell, right? similar to the count of genes per cell generated by Cell Ranger. Is there any parameter I should use in my code to generate this information?

[gabrielfonseca@login counts_unfiltered]$ head cells_x_tcc.barcodes.txt
AAACCCAGTAATGGCG
AAACCCATCACCATAA
AAACCCATCGAACGCC
AAACCCATCTGACCTT
AAACCCATCTGCTAGG
AAACGAAAGAAACCCG
AAACGAAAGAAACTCA
AAACGAAAGAAAGACA
AAACGAAAGAACCTCC
AAACGAAAGACAACTA

[gabrielfonseca@login counts_unfiltered]$ head cells_x_tcc.ec.txt
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9

[gabrielfonseca@login counts_unfiltered]$ tail cells_x_tcc.ec.txt
13062943 292286,292289,982741,982746,982774,982776
13062944 16518,955620,955634
13062945 83836,83839,83840,687187,955620,955634
13062946 1285587,1323410,1323412,1323414,1323417,1323419,1323422,1323425,1323427,1323431
13062947 58704,598549,598551,598552
13062948 1124772,1124775,1344511
13062949 747983,1297369
13062950 64342,285425,428088,589909,589927,589936,589941,589951,589962,589970,589986,589996,590004,590007,590016,590025,590031,590040,590048,590058,590070,590071,590080,590087,590095,590105,590114,590120,590127,590136,590142,590148,590155,590163,590172,590182,590205,590212,590220,590228,590239
13062951 63415,63416,585449,585453,585455,585457,1015962,1015963
13062952 866308,1282566,1282572,1282575,1282577,1282581,1282587,1282590,1282592,1282594,1282598,1282601,1285241

[gabrielfonseca@login counts_unfiltered]$ head cells_x_tcc.mtx
%%MatrixMarket matrix coordinate real general
%
54595 13062953 1245195
1 1583467 1
2 649261 1
3 1554508 1
4 55953 1
5 1577131 1
6 40130 1
6 231585 2

@Yenaled
Copy link

Yenaled commented Mar 20, 2024

You would need to use kallisto, not just kb-python.

kallisto quant-tcc is what you would need to use. Unfortunately, I don't have a straightforward workflow for this written up but you'll welcome to play around with it. Essentially, kallisto quant-tcc estimates transcript counts using an EM algorithm, but this algorithm a while to run so, with thousands of cells, it can run for days. Second, with 10x data, you're sequencing only the 3' end of reads so there may not be enough diversity to get reliable transcript estimates. Those are the reasons I haven't made a workflow for it (but all the pieces are there if you want to give it a whirl).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants