Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation updates #8

Merged
merged 1 commit into from
Dec 21, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Package: genecovr
Title: Gene body coverage analysis to evaluate genome assemblies
Version: 0.1.0
Version: 0.1.1
Authors@R:
person(given = "Per",
family = "Unneberg",
Expand Down
31 changes: 21 additions & 10 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,33 @@
# Release 0.0.0.9013
<!-- markdownlint-disable MD025 -->

# genecovr 0.1.1

- update README
- add Empirical studies section

# genecovr 0.1.0

- add pkgdown site

# genecovr 0.0.0.9013

- fix factor level ordering for geneBodyCoverage plot
- save geneBodyCoverage as tsv

# Release 0.0.0.9012
# genecovr 0.0.0.9012

- adjust factor levels for number of inserts (#4)
- summarize number of inserts by transcript (#5)

# Release 0.0.0.9011
# genecovr 0.0.0.9011

- fix order of factors

# Release 0.0.0.9010
# genecovr 0.0.0.9010

- remove duplicate entries in psl input

# Release 0.0.0.9009
# genecovr 0.0.0.9009

- add plot of transcript length distributions conditioned on number of
mapped contigs
Expand All @@ -26,21 +37,21 @@
DataFrame inputs, obviating the need to rerun geneBodyCoverage
multiple times in genecovr script


# Release 0.0.0.9008
# genecovr 0.0.0.9008

- Remove characters trailing first space in fasta headers

# Release 0.0.0.9007
# genecovr 0.0.0.9007

- Fix conversion of DNAStringSet to Seqinfo
- Make sure geneBodyCoverage table has nmax levels


# Release 0.0.0.9006
# genecovr 0.0.0.9006

- add depthOfCoverage function and analysis to vignette and script
- reduceHitCoverage is deprecated
- improve some docs
- add wrapper for saving plots
- add tests mainly for alignmentpairs and test setup

<!-- markdownlint-enable MD025 -->
4 changes: 4 additions & 0 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,10 @@ GitHub](https://github.com/nbis) with:
devtools::install_github("NBISweden/genecovr")
```

The tool has been developed and tested on GNU/Linux systems but should
work on any system that runs `R`. Installation is expected to take at
most a couple of minutes.

## Usage

### genecovr script quick start
Expand Down
66 changes: 66 additions & 0 deletions vignettes/empirical.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
---
title: "Empirical studies"
author: "Per Unneberg"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Empirical studies}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
biblio-style: plain
bibliography: bibliography.bib
---

# Northern krill

`genecovr` was used to assess the quality metrics of the Northern
krill genome.

To test genecovr with the 19 Gb Northern krill genome and gene data
(16,509 transcripts of protein coding genes), access the collection in
the SciLifeLab Data Repository named "Ecological genomics of the
Northern krill" using the following permanent link:

< URL to be provided >

1. Genome file

Access item: 1. Ecological genomics of the Northern krill: Genome
assembly DNA sequences

Download: northern_krill.genome_assembly.tar.gz

Extract genome assembly for evaluation:
1.m_norvegica.main_w_mito.fasta

2. Gene models

Access item: 3. Ecological genomics of the Northern krill: Genome
assembly annotations (genes and repeats)

Download: trinity_transcript.16509_single_isoforms.cds.fasta.tar.gz

Extract and use transcripts for evaluation:
trinity_transcript.16509_single_isoforms.cds.fasta

3. gmap alignment

Map transcripts to assembly with gmap:

# Build index
gmap_build --genomedb mnorvegica 1.m_norvegica.main_w_mito.fasta
# Map with gmap; format=1 -> psl output
gmap -t 12 --dir . --db mnorvegica --format 1 trinity_transcript.16509_single_isoforms.cds.fasta > mnorvegica.psl

4. genecovr input file

Generate a comma-separated file, assemblies.csv, with the following contents:

main,mnorvegica.psl,1.m_norvegica.main_w_mito.fasta,trinity_transcript.16509_single_isoforms.cds.fasta

and run

genecovr assemblies.csv

This will generate a number of summary data files along with png and
pdf plots based on the summary data.
22 changes: 12 additions & 10 deletions vignettes/genecovr.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Gene body coverage analysis in R}
%\VignetteEngine{knitr::rmarkdown}
\usepackage[utf8]{inputenc}
%\VignetteEncoding{UTF-8}
biblio-style: plain
bibliography: bibliography.bib
---
Expand Down Expand Up @@ -71,15 +71,17 @@ are `GenomicRanges::GRanges` objects or objects derived from the
# Analysing gene body coverage

In this section we analyse the mapping of a transcriptome to a
non-polished and polished assembly. The mapping results consist of two
gmap files in psl format, `transcripts2nonpolished.psl` and
`transcripts2polished.psl`. In addition there are fasta index files
for both assemblies (`nonpolished.fai` and `polished.fai`) and for the
transcriptome (`transcripts.fai`). The fasta indices are used to
generate `GenomeInfoDb::Seqinfo` objects that can be used to set
sequence information on the parsed output. We load the fasta indices
and parse the psl files with `genecovr::readPsl`, storing the results
in an `genecovr::AlignmentPairsList` for convenience.
non-polished and polished assembly, using example data. The entire
analysis takes less than 5 minutes to execute using these datasets.
The mapping results consist of two gmap files in psl format,
`transcripts2nonpolished.psl` and `transcripts2polished.psl`. In
addition there are fasta index files for both assemblies
(`nonpolished.fai` and `polished.fai`) and for the transcriptome
(`transcripts.fai`). The fasta indices are used to generate
`GenomeInfoDb::Seqinfo` objects that can be used to set sequence
information on the parsed output. We load the fasta indices and parse
the psl files with `genecovr::readPsl`, storing the results in an
`genecovr::AlignmentPairsList` for convenience.

``` {r gbc-load-data}
assembly_fai_fn <- list(
Expand Down
Loading