AStarix is a sequence-to-graph aligner based on the A* shortest path algorithm. It finds alignments that are optimal according to edit-distance with non-negative weights. The motivation of AStarix is to speed up optimal alignment in the average case. A trie index allows to scale with reference size [1], while seed heuristic allows to scale with query length [2].

Publications

[1] Ivanov, Bichsel, Mustafa, Kahles, Rätsch, Vechev (RECOMB 2020) AStarix: Fast and Optimal Sequence-to-Graph Alignment (2020) -- Introduces AStarix tool, the trie index which allows scaling to bigger references, and Prefix heuristic which outperforms Dijkstra. Evaluations here.

[2] Ivanov, Bichsel, Vechev (RECOMB 2022) Fast and Optimal Sequence-to-Graph Alignment Guided by Seeds -- Introduces Seed heuristic which outperforms other aproaches on Illumina reads and allows to scale to long HiFi reads.

Installation

Prerequisites of AStarix are:

argp – argument parsing library
pandas (optional) – dataframe library used for testing and evalutaions

To compile AStarix and run a test:

make
make test

Third-party libraries are located in the /ext directory and their own licenses apply. Tested on Ubuntu 20.04.

Example run

This is an example run of AStarix on a toy data: aligning 100 simulated Illumina reads to a linear graph from the first 10000bp of Escherichia coli. You can expect a similar summary printed to standard output:

$ make test

release/astarix align-optimal -a astar-seeds -t 1 -v 0 -g data/ecoli_head10000_linear/graph.gfa -q data/ecoli_head10000_linear/illumina.fq -o tmp/ecoli_head10000_linear/astar-seeds --fixed_trie_depth 1 --seeds_len 25

----
Assert() checks:       OFF
        verbose:        0
----
Loading reference graph... Added reverse complement... done in 0.001078s.
Loading queries... done in 0.000216s.
Contructing trie... done in 0.007017s.
Initializing A* heuristic... done in 1.6e-05s.

 == General parameters and optimizations ==
             Alignment algo: astar-seeds
                 Edit costs: 0, 1, 5, 5 (match, subst, ins, del)
              Greedy match?: true
                    Threads: 1

 == A* parameters ==
          seed length: 25 bp
     skip near crumbs: 1

 == Data ==
         Original reference: 9826 nodes, 9824 edges
                       Trie: 5186 nodes, 24822 edges, depth=7
  Reference+ReverseRef+Trie: 24838 nodes, 44470 edges, density: 0

                  Trie: 5186 nodes, 24822 edges, depth=7
  Reference+ReverseRef+Trie: 24838 nodes, 44470 edges, density: 0
                      Reads: 100 x 99bp, coverage: 1.00774x
            Avg phred value: 0.13640%

 == Aligning statistics ==
        Explored rate (avg): 1.70707 states/read_bp
         States with crumbs: 0.02992%
            Explored states: 0.01737%
             Skipped states: 99.95271%
     Pushed rate (avg, max): 0.77130, 0.01400    [states/bp] (states normalized by query length)
     Popped rate (avg, max): 0.09310, 0.00140
             Average popped: 7.16000 from trie (76.90655%) vs 2.15000 from ref (per read)
Total cost of aligned reads: 15, 0.15000 per read, 0.15152% per letter
                 Alignments: 100 unique (100.00000%), 0 ambiguous (0.00000%) and 0 overcost (0.00000%)

 == Heuristic stats (astar-seeds) ==
        For all reads:
                            Seeds: 400 (4.00000 per read)
                     Seed matches: 385 (3.85000 per read, 0.96250 per seed)
               States with crumbs: 29105 [+0.00000% repeated], (291.05000 per read)
                  Heuristic (avg): 0.10000 of potential 4.00000

 == Performance ==
    Memory:                    measured | estimated
                   total: 0.00510gb, 100% | -
               reference: 0.00267gb, 52.39521% | 5.74505%
                   reads: 0.00000gb, 0.00000% | 0.18091%
                    trie: 0.00100gb, 19.53593% | 5.82224%
     equiv. classes opt.: 0.00000gb, 0.00000%
          A*-memoization: 0.00000gb, 0.00000

   Total wall runtime:    0.04086s
       reference loading: 0.00108s
         queries loading: 0.00022s
          construct trie: 0.00702s
              precompute: 0.00002s
       align (wall time): 0.03040s = 3289.04470 reads/s = 325.61543 Kbp/s

    Total align cpu time: 0.02998s = 3335.11206 reads/s = 330.17609 Kbp/s
     |          Preprocessing: 31.61686%
     |               A* query: 18.47652%
     |           greedy_match: 11.42609%
     |                  other: 38.48052%
 DONE

The directory tmp/ecoli_head10000_linear/astar-seeds/ will contain a file with execution logs and a file with alignment statistics. Short aggregated statistics are print to standard output (to redirect, you can add >summary.txt).

Usage

AStarix only finds optimal alignments (specified by argument align-optimal). Currently supported formats are .gfa without overlapping nodes (for a graph reference) and .fa/.fasta (for a linear reference). The queries should be in .fq/.fastq format (the phred values are ignored).

$ astarix --help
----  
Usage: astarix [OPTION...] align-optimal -g GRAPH.gfa -q READS.fq -o OUT_DIR/
Optimal sequence-to-graph aligner based on A* shortest path.

  -a, --algorithm={dijkstra, astar-prefix, astar-seeds}
                             Shortest path algorithm
  -c, --prefix_cost_cap=A*_COST_CAP
                             The maximum prefix cost for the A* heuristic
  -d, --prefix_len_cap=A*_PREFIX_CAP
                             The upcoming sequence length cap for the A*
                             heuristic
  -D, --tree_depth=TREE_DEPTH   Suffix tree depth
  -e, --prefix_equivalence_classes=A*_EQ_CLASSES
                             Whether to partition all nodes to equivalence
                             classes in order not to reuse the heuristic
      --fixed_trie_depth=FIXED_TRIE_DEPTH
                             Some leafs depth can be less than tree_depth
                             (variable=0, fixed=1)
  -f, --greedy_match=GREEDY_MATCH
                             Proceed greedily forward if there is a unique
                             matching outgoing edge
  -g, --graph=GRAPH          Input graph (.gfa)
  -G, --gap=GAP_COST         Gap (Insertion or Deletion) penalty [5]
  -k, --k_best_alignments=TOP_K   Output at most k optimal alignments per read
                             [1]
  -M, --match=MATCH_COST     Match penalty [0]
  -o, --outdir=OUTDIR        Output directory
  -q, --query=QUERY          Input queries/reads (.fq, .fastq)
      --seeds_len=A*_SEED_LEN   The length of the A* seeds.
      --seeds_skip_near_crumbs={0,1}
  -S, --subst=SUBST_COST     Substitution penalty [1]
  -t, --threads=THREADS      Number of threads [1]
  -v, --verbose=THREADS      Verbosity (silent=0, info=1, debug=2), [0]
  -?, --help                 Give this help list
      --usage                Give a short usage message
  -V, --version              Print program version

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Publications

Installation

Example run

Usage

Files

README.md

Latest commit

History

README.md

File metadata and controls

Publications

Installation

Example run

Usage