This is a first, and very simplistic, cut at generating BEAST2 XML from Python (2.7, and 3.5 to 3.10 are all known to work).
BEAST2 is a complex program and so is its input XML. People normally generate the input XML using a GUI tool, BEAUti. BEAUti is also a complex tool, and the XML it generates can vary widely. Because BEAUti is a GUI tool it's not possible to use it to programmatically generate XML.
I wrote beast2-xml
because I wanted a way to quickly and easily generate
BEAST2 XML files, from the command line and also from my
Python code.
There are a lot of things that could be added to this code! Contributions welcome.
- A command-line script (bin/beast2-xml.py) to generate BEAST2 XML files.
- A simplistic Python class (in beast2xml/beast2.py) that may be helpful if you are writing Python that needs to generate BEAST2 XML files. (This Python class is of course used by the command line script.)
$ pip install beast2-xml
You can also get the source from PyPI or on Github.
You can use bin/beast2-xml.py
to quickly generate BEAST2 XML. You must
provide the sequences for the analysis (as FASTA or FASTQ), either on
standard input or using the --fastaFile
option.
Run beast2-xml.py --help
to see currently supported options:
$ beast2-xml.py --help
usage: beast2-xml.py [-h] [--clockModel MODEL | --templateFile FILENAME]
[--chainLength LENGTH] [--age ID=N [ID=N ...]]
[--defaultAge N] [--dateUnit UNIT]
[--dateDirection DIRECTION]
[--logFileBasename BASE-FILENAME] [--traceLogEvery N]
[--treeLogEvery N] [--screenLogEvery N] [--mimicBEAUti]
[--sequenceIdDateRegex REGEX]
[--sequenceIdAgeRegex REGEX]
[--sequenceIdRegexMayNotMatch] [--fastaFile FILENAME]
[--readClass CLASSNAME] [--fasta | --fastq | --fasta-ss]
Given FASTA on stdin (or in a file via the --fastaFile option), write an XML
BEAST2 input file on stdout.
optional arguments:
-h, --help show this help message and exit
--clockModel MODEL Specify the clock model. Possible values are 'random-
local', 'relaxed-exponential', 'relaxed-lognormal', or
'strict' (default: strict)
--templateFile FILENAME
The XML template file to use. (default: None)
--chainLength LENGTH The MCMC chain length. (default: None)
--age ID=N [ID=N ...]
The age of a sequence. The format is a sequence id, an
equals sign, then the age. For convenience, just the
first part of a full sequence id (i.e., up to the
first space) may be given. May be specified multiple
times. (default: None)
--defaultAge N The age to use for sequences that are not explicitly
given an age via --age. (default: 0.0)
--dateUnit UNIT Specify the date unit. Possible values are 'day',
'month', or 'year'. (default: year)
--dateDirection DIRECTION
Specify whether dates are back in time from the
present or forward in time from some point in the
past. Possible values are 'forward' or 'backward'.
(default: backward)
--logFileBasename BASE-FILENAME
The base filename to write logs to. A ".log" or
".trees" suffix will be appended to this to make
complete log file names. (default: beast-output)
--traceLogEvery N How often to write to the trace log file. (default:
2000)
--treeLogEvery N How often to write to the tree log file. (default:
2000)
--screenLogEvery N How often to write logging to the screen (i.e.,
terminal). (default: 2000)
--mimicBEAUti If specified, add attributes to the <beast> tag that
mimic what BEAUti uses so that BEAUti will be able to
load the XML. (default: False)
--sequenceIdDateRegex REGEX
A regular expression that will be used to capture sequence
dates from their ids. The regular expression must have three
named capture regions ("year", "month", and "day"). Regular
expression matching is anchored to the start of the id string
(i.e., Python's re.match function is used, not the re.search
function), so you must explicitly match the id from its beginning.
For example, you might use
--sequenceIdDateRegex '^.*_(?P<year>\d\d\d\d)-(?P<month>\d\d)-(?P<day>\d\d)'.
(default: None)
--sequenceIdAgeRegex REGEX
A regular expression that will be used to capture sequence ages
from their ids. The regular expression must have a single capture
region. Regular expression matching is anchored to the start of
the id string (i.e., Python's re.match function is used, not the
re.search function), so you must explicitly match the id from its
beginning. For example, you might use --sequenceIdAgeRegex '^.*_(\d+)$'
to capture an age preceded by an underscore at the very end of the
sequence id. If --sequenceIdDateRegex is also given, it
takes precedence when matching sequence ids. (default: None)
--sequenceIdRegexMayNotMatch
If specified (and --sequenceIdDateRegex or --sequenceIdAgeRegex is given)
it will not be considered an error if a sequence id does not
match the given regular expression. In that case, sequences will be assigned
an age of zero unless one is given via --age. (default: False)
--fastaFile FILENAME The name of the FASTA input file. Standard input will
be read if no file name is given.
--readClass CLASSNAME
If specified, give the type of the reads in the input.
Possible choices: SSAARead, DNARead, TranslatedRead,
RNARead, SSAAReadWithX, AAReadORF, Read, AARead,
AAReadWithX. (default: DNARead)
--fasta If specified, input will be treated as FASTA. This is
the default. (default: False)
--fastq If specified, input will be treated as FASTQ.
(default: False)
--fasta-ss If specified, input will be treated as PDB FASTA
(i.e., regular FASTA with each sequence followed by
its structure). (default: False)
As mentioned, this is extremely simplistic. If you need to generate more
complex XML, you can pass in a template file using --templateFile
. Your
template will need to have a high-level structure that's similar to those
produced by BEAUti, otherwise the various command-line options for
manipulating the template wont find what they need (you'll see an error
message in this case).
If you don't pass a template file name, a default will be chosen based on
the clock model (strict
by default). The default templates
all come from BEAUti, so if you generate a template yourself using BEAUti,
you can almost certainly pass it to beast2-xml.py
to use as a basis to
create variants from.
Note that the generated XML contains just the first part of sequence ids in the given FASTA input. I'm not sure if this is a requirement, but it's what BEAUti does and so I have done the same.
If you want to create BEAST2 XML from your own Python, you can use the
BEAST2XML
class defined in beast2xml/beast2.py.
The simplest possible usage is
from beast2xml import BEAST2XML
print(BEAST2XML().toString())
There are several options you can pass to the BEAST2XML
constructor:
class BEAST2XML(object):
"""
Create BEAST2 XML.
@param template: A C{str} filename or an open file pointer to read the
XML template from. If C{None}, a template based on C{clockModel}
will be used.
@param clockModel: A C{str} specifying the clock model. Possible values
are 'random-local', 'relaxed-exponential', 'relaxed-lognormal',
and 'strict.
@param sequenceIdDateRegex: If not C{None}, gives a C{str} regular
expression that will be used to capture sequence dates from their ids.
See the explanation in ../bin/beast2-xml.py
@param sequenceIdAgeRegex: If not C{None}, gives a C{str} regular
expression that will be used to capture sequence ages from their ids.
See the explanation in ../bin/beast2-xml.py
@param sequenceIdRegexMustMatch: If C{True} it will be considered an error
if a sequence id does not match the regular expression given by
C{sequenceIdDateRegex} or C{sequenceIdAgeRegex}.
"""
and options you can pass to its toString
method:
def toString(self, chainLength=None, defaultAge=0.0, dateUnit='year',
dateDirection='backward', logFileBasename=None,
traceLogEvery=None, treeLogEvery=None, screenLogEvery=None,
transformFunc=None, mimicBEAUti=False):
"""
@param chainLength: The C{int} length of the MCMC chain. If C{None},
the value in the template will be retained.
@param defaultAge: The C{float} age to use for sequences that are not
explicitly given an age via C{addAge}.
@param dateDirection: A C{str}, either 'backward' or 'forward'
indicating whether dates are back in time from the present or
forward in time from some point in the past.
@param logFileBasename: The C{str} The base filename to write logs to.
A .log or .trees suffix will be appended to this to make the
actual log file names. If C{None}, the log file names in the
template will be retained.
@param traceLogEvery: An C{int} specifying how often to write to the
trace log file. If C{None}, the value in the template will be
retained.
@param treeLogEvery: An C{int} specifying how often to write to the
tree log file. If C{None}, the value in the template will be
retained.
@param screenLogEvery: An C{int} specifying how often to write to the
terminal (screen) log. If C{None}, the value in the template will
be retained.
@param transformFunc: If not C{None} A callable that will be passed
the C{ElementTree} instance and which must return an C{ElementTree}
instance.
@param mimicBEAUti: If C{True}, add attributes to the <beast> tag
in the way that BEAUti does, to allow BEAUti to load the XML we
produce.
@raise ValueError: If any required tree elements cannot be found
(raised by our call to self.findElements).
@return: C{str} XML.
"""
An example of using the Python class can be found in the beast2-xml.py script. Small examples showing all functionality can be found in the tests in test/testBeast2.py.
To run the tests:
$ make check
or if you have Twisted installed, you
can use its trial
test runner, via
$ make tcheck
You can also use
$ tox
to run tests for various versions of Python.