/usr/share/doc/mira-assembler/DefinitiveGuideToMIRA.html is in mira-doc 4.9.6-3build2.
This file is owned by root:root, with mode 0o644.
The actual contents of the file can be viewed below.
| <html><head><meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>Sequence assembly with MIRA 5</title><link rel="stylesheet" type="text/css" href="doccss/miradocstyle.css"><meta name="generator" content="DocBook XSL Stylesheets V1.79.1"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="book"><div class="titlepage"><div><div><h1 class="title"><a name="idm1"></a>Sequence assembly with MIRA 5</h1></div><div><h2 class="subtitle">
The Definitive Guide
</h2></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><span class="contrib">Main author</span> <code class="email"><<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>></code></div></div><div><div class="othercredit"><h3 class="othercredit"><span class="firstname">Jacqueline</span> <span class="surname">Weber</span></h3><span class="contrib">Extensive review of early reference manual
</span> </div></div><div><div class="othercredit"><h3 class="othercredit"><span class="firstname">Andrea</span> <span class="surname">Hörster</span></h3><span class="contrib">Extensive review of early reference manual
</span> </div></div><div><div class="othercredit"><h3 class="othercredit"><span class="firstname">Katrina</span> <span class="surname">Dlugosch</span></h3><span class="contrib">Draft for section on preprocessing of ESTs in EST manual
</span> </div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div><div><div class="legalnotice"><a name="idm6"></a><p>
This documentation is licensed under the Creative Commons
Attribution-NonCommercial-ShareAlike 3.0 Unported License. To view a copy of
this license, visit <a class="ulink" href="http://creativecommons.org/licenses/by-nc-sa/3.0/" target="_top">http://creativecommons.org/licenses/by-nc-sa/3.0/</a> or send a letter to
Creative Commons, 171 Second Street, Suite 300, San Francisco, California,
94105, USA.
</p></div></div></div><hr></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="preface"><a href="#idm30">Preface</a></span></dt><dt><span class="chapter"><a href="#chap_intro">1. Introduction to MIRA</a></span></dt><dt><span class="chapter"><a href="#chap_installation">2. Installing MIRA</a></span></dt><dt><span class="chapter"><a href="#chap_reference">3. MIRA 4 reference manual</a></span></dt><dt><span class="chapter"><a href="#chap_dataprep">4. Preparing data</a></span></dt><dt><span class="chapter"><a href="#chap_denovo">5. De-novo assemblies</a></span></dt><dt><span class="chapter"><a href="#chap_mapping">6. Mapping assemblies</a></span></dt><dt><span class="chapter"><a href="#chap_est">7. EST / RNASeq assemblies</a></span></dt><dt><span class="chapter"><a href="#chap_specialparams">8. Parameters for special situations</a></span></dt><dt><span class="chapter"><a href="#chap_results">9. Working with the results of MIRA</a></span></dt><dt><span class="chapter"><a href="#chap_mutils">10. Utilities in the MIRA package</a></span></dt><dt><span class="chapter"><a href="#chap_hard">11. Assembly of <span class="emphasis"><em>hard</em></span> genome or EST / RNASeq projects</a></span></dt><dt><span class="chapter"><a href="#chap_seqtechdesc">12. Description of sequencing technologies</a></span></dt><dt><span class="chapter"><a href="#chap_seqadvice">13. Some advice when going into a sequencing project</a></span></dt><dt><span class="chapter"><a href="#chap_bitsandpieces">14. Bits and pieces</a></span></dt><dt><span class="chapter"><a href="#chap_faq">15. Frequently asked questions</a></span></dt><dt><span class="chapter"><a href="#chap_maf">16. The MAF format</a></span></dt><dt><span class="chapter"><a href="#chap_logfiles">17. Log and temporary files used by MIRA</a></span></dt></dl></div><div class="list-of-figures"><p><b>List of Figures</b></p><dl><dt>1.1. <a href="#chap_intro::srmc_in_454sxahyb_1stpass.png">
How MIRA learns from misassemblies (1)
</a></dt><dt>1.2. <a href="#chap_intro::srmc_in_454sxahyb_lastpass1.png">
How MIRA learns from misassemblies (2)
</a></dt><dt>1.3. <a href="#chap_intro::srmc_in_454sxahyb_lastpass2.png">
How MIRA learns from misassemblies (3)
</a></dt><dt>1.4. <a href="#chap_intro::gcb99_replocator.png">
Slides presenting the repeat locator at the GCB 99
</a></dt><dt>1.5. <a href="#chap_intro::gcb99_edit.png">
Slides presenting the Edit automatic Sanger editor at the GCB 99
</a></dt><dt>1.6. <a href="#chap_intro::san_autoedit1.png">
Sanger assembly without EdIt automatic editing routines
</a></dt><dt>1.7. <a href="#chap_intro::san_autoedit2.png">
Sanger assembly with EdIt automatic editing routines
</a></dt><dt>1.8. <a href="#chap_intro::454_autoedit1.png">
454 assembly without 454 automatic editing routines
</a></dt><dt>1.9. <a href="#chap_intro::454_autoedit2.png">
454 assembly with 454 automatic editing routines
</a></dt><dt>1.10. <a href="#chap_intro::haf5_haf2_contigcoverage_ovals.png">
Coverage of a contig.
</a></dt><dt>1.11. <a href="#chap_intro::haf5_repend_rrna.png">
Repetitive end of a contig
</a></dt><dt>1.12. <a href="#chap_intro::haf2_end_nomoredata.png">
Non-repetitive end of a contig
</a></dt><dt>1.13. <a href="#chap_intro::454sxa_stms_hybdenovo.png">
MIRA pointing out problems in hybrid assemblies (1)
</a></dt><dt>1.14. <a href="#chap_intro::454san_stmu_hybdenovo.png">
MIRA pointing out problems in hybrid assemblies (2)
</a></dt><dt>1.15. <a href="#chap_intro::sxa_cer_reads1.png">
Coverage equivalent reads (CERs) explained.
</a></dt><dt>1.16. <a href="#chap_intro::sxa_cer_reads2.png">
Coverage equivalent reads let SNPs become very visible in assembly viewers
</a></dt><dt>1.17. <a href="#chap_intro::sxa_sroc_lenski2.png">
SNP tags in a MIRA assembly
</a></dt><dt>1.18. <a href="#chap_intro::sxa_mcvc_lenski.png">
Tag pointing out a large deletion in a MIRA mapping assembly
</a></dt><dt>9.1. <a href="#chap_res::results_miraconvert.png">
Format conversions with <span class="command"><strong>miraconvert</strong></span>
</a></dt><dt>9.2. <a href="#chap_res::results_mira2other.png">
Conversions needed for other tools.
</a></dt><dt>9.3. <a href="#haf_danger_join_notok.png">
Join at a repetitive site which should not be performed due to
missing spanning templates.
</a></dt><dt>9.4. <a href="#haf_danger_join_ok.png">
Join at a repetitive site which should be performed due to
spanning templates being good.
</a></dt><dt>9.5. <a href="#454_stacks_join.png">
Pseudo-repeat in 454 data due to sequencing artifacts
</a></dt><dt>9.6. <a href="#chap_sol::sxa_sroc_lenski1.png">
"SROc" tag showing a SNP position in a Solexa mapping
assembly.
</a></dt><dt>9.7. <a href="#chap_sol::sxa_sroc_lenski2.png">
"SROc" tag showing a SNP/indel position in a Solexa mapping
assembly.
</a></dt><dt>9.8. <a href="#chap_sol::sxa_mcvc_lenski.png">
"MCVc" tag (dark red stretch in figure) showing a genome
deletion in Solexa mapping assembly.
</a></dt><dt>9.9. <a href="#chap_sol::sxa_wrmcsrmc_hiding_lenski1.png">
An IS150 insertion hiding behind a WRMc and a SRMc tags
</a></dt><dt>9.10. <a href="#chap_sol::sxa_xmastree_lenski1.png">
A 16 base pair deletion leading to a SROc/UNsC xmas-tree
</a></dt><dt>9.11. <a href="#chap_sol::sxa_xmastree_lenski2.png">
An IS186 insertion leading to a SROc/UNsC xmas-tree
</a></dt><dt>12.1. <a href="#sxa_unsc_ggcxg2_lenski.png">
The Solexa GGCxG problem.
</a></dt><dt>12.2. <a href="#sxa_unsc_ggc1_lenski.png">
The Solexa GGC problem, forward example
</a></dt><dt>12.3. <a href="#sxa_unsc_ggc4_lenski.png">
The Solexa GGC problem, reverse example
</a></dt><dt>12.4. <a href="#sxa_xmastree_lenski2.png">
A genuine place of interest almost masked by the
<code class="literal">GGCxG</code> problem.
</a></dt><dt>12.5. <a href="#sxa_gcbias_nobias2008.png">
Example for no GC coverage bias in 2008 Solexa data.
</a></dt><dt>12.6. <a href="#sxa_gcbias_bias2009.png">
Example for GC coverage bias starting Q3 2009 in Solexa data.
</a></dt><dt>12.7. <a href="#sxa_gcbias_comp20082009.png">
Example for GC coverage bias, direct comparison 2008 / 2010 data.
</a></dt><dt>12.8. <a href="#chap_iontor::ion_dh10bgoodB13.png">
Example for good IonTorrent data (100bp reads)
</a></dt><dt>12.9. <a href="#chap_iontor::iontor_indelhpexample.png">
Example for problematic IonTorrent data (100bp reads)
</a></dt><dt>12.10. <a href="#chap_iontor::ion_dh10bdirdepindel.png.png">
Example for a sequencing direction dependent indel
</a></dt></dl></div><div class="preface"><div class="titlepage"><div><div><h1 class="title"><a name="idm30"></a>Preface</h1></div></div></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
<span class="emphasis"><em><span class="quote">“<span class="quote">How much intelligence does one need to sneak upon lettuce?
</span>”</span></em></span>
</p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><p>
This "book" is actually the result of an exercise in self-defense. It
contains texts from several years of help files, mails, postings, questions,
answers etc.pp concerning MIRA and assembly projects one can do with it.
</p><p>
I never really intended to push MIRA. It started out as a PhD thesis and I
subsequently continued development when I needed something to be done which
other programs couldn't do at the time. But MIRA has always been available
as binary on the Internet since 1999 ... and as Open Source since
2007. Somehow, MIRA seems to have caught the attention of more than just a
few specialised sequencing labs and over the years I've seen an ever growing
number of mails in my inbox and on the MIRA mailing list. Both from people
having been "since ever" in the sequencing business as well as from labs or
people just getting their feet wet in the area.
</p><p>
The help files -- and through them this book -- sort of reflect this
development. Most of the chapters<a href="#ftn.idm40" class="footnote" name="idm40"><sup class="footnote">[1]</sup></a> contain both very specialised
topics as well as step-by-step walk-throughs intended to help people to get
their assembly projects going. Some parts of the documentation are written
in a decidedly non-scientific way. Please excuse, time for rewriting mails
somewhat lacking, some texts were re-used almost verbatim.
</p><p>
The last few years have seen tremendous change in the sequencing
technologies and MIRA 4 reflects that: core data structures and
routines had to be thrown overboard and replaced with faster and/or more
versatile versions suited for the broad range of technologies and use-cases
I am currently running MIRA with.
</p><p>
Nothing is perfect, and both MIRA and this documentation (even if it is
rather pompously called <span class="emphasis"><em>Definitive Guide</em></span>) are far from
it. If you spot an error either in MIRA or this manual, feel free to report
it. Or, even better, correct it if you can. At least with the manual files
it should be easy: they're basically just some decorated text files.
</p><p>
I hope that MIRA will be as useful to you as it has been to me. Have a lot
of fun with it.
</p><p>
Burlington, Spring 2016
</p><p>
Bastien Chevreux
</p><div class="footnotes"><br><hr style="width:100; text-align:left;margin-left: 0"><div id="ftn.idm40" class="footnote"><p><a href="#idm40" class="para"><sup class="para">[1] </sup></a>Avid readers of David
Gerrold will certainly recognise the quotes from his books at the beginning
of each chapter</p></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_intro"></a>Chapter 1. Introduction to MIRA</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email"><<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>></code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect_intro_whatismira">1.1.
What is MIRA?
</a></span></dt><dt><span class="sect1"><a href="#sect_wheretostartreading">1.2.
What to read in this manual and where to start reading?
</a></span></dt><dt><span class="sect1"><a href="#sect_intro_miraquicktour">1.3.
The MIRA quick tour
</a></span></dt><dt><span class="sect1"><a href="#sect_for_which_data_sets_to_use_mira_and_for_which_not">1.4.
For which data sets to use MIRA and for which not
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect3_genome_denovo">1.4.1.
Genome de-novo
</a></span></dt><dt><span class="sect2"><a href="#sect_genome_mapping">1.4.2.
Genome mapping
</a></span></dt><dt><span class="sect2"><a href="#sect3_ests_rnaseq">1.4.3.
ESTs / RNASeq
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_intro_specialfeatures">1.5.
Any special features I might be interested in?
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_intro_miradiscernsrepeats">1.5.1.
MIRA learns to discern non-perfect repeats, leading to better assemblies
</a></span></dt><dt><span class="sect2"><a href="#sect_intro_automatic_editors">1.5.2.
MIRA has integrated editors for data from Sanger, 454, IonTorrent sequencing
</a></span></dt><dt><span class="sect2"><a href="#sect_intro_whycontigsend">1.5.3.
MIRA lets you see why contigs end where they end
</a></span></dt><dt><span class="sect2"><a href="#sect_intro_stmshybrid_tags">1.5.4.
MIRA tags problematic decisions in hybrid assemblies
</a></span></dt><dt><span class="sect2"><a href="#sect_intro_cer_reads">1.5.5.
MIRA allows older finishing programs to cope with amount data in Solexa
mapping projects
</a></span></dt><dt><span class="sect2"><a href="#sect_intro_mapping_tags">1.5.6.
MIRA tags SNPs and other features, outputs result files
for biologists
</a></span></dt><dt><span class="sect2"><a href="#sect_intro_miramuchmore">1.5.7.
MIRA has ... much more
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_intro_versions_licenses_disclaimer_and_copyright">1.6.
Versions, Licenses, Disclaimer and Copyright
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_intro_versions">1.6.1.
Versions
</a></span></dt><dt><span class="sect2"><a href="#sect_intro_licenses">1.6.2.
License
</a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_intro_licensemira">1.6.2.1.
MIRA
</a></span></dt><dt><span class="sect3"><a href="#sect_intro_licensedocs">1.6.2.2.
Documentation
</a></span></dt></dl></dd><dt><span class="sect2"><a href="#sect_intro_copyright">1.6.3.
Copyright
</a></span></dt><dt><span class="sect2"><a href="#sect_intro_external_libraries">1.6.4.
External libraries
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_intro_getting_help___mailing_lists___reporting_bugs">1.7.
Getting help / Mailing lists / Reporting bugs
</a></span></dt><dt><span class="sect1"><a href="#sect_intro_author">1.8.
Author
</a></span></dt><dt><span class="sect1"><a href="#sect_intro_miscellaneous">1.9.
Miscellaneous
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_intro_citations">1.9.1.
Citing MIRA
</a></span></dt><dt><span class="sect2"><a href="#sect_intro_postcards_gold_and_jewellery">1.9.2.
Postcards, gold and jewellery
</a></span></dt></dl></dd></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
<span class="emphasis"><em><span class="quote">“<span class="quote">Half of being smart is to know what you're dumb at.
</span>”</span></em></span>
</p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_intro_whatismira"></a>1.1.
What is MIRA?
</h2></div></div></div><p>
MIRA is a multi-pass DNA sequence data assembler/mapper for whole
genome and EST/RNASeq projects. MIRA assembles/maps reads gained by
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
electrophoresis sequencing (aka Sanger sequencing)
</p></li><li class="listitem"><p>
454 pyro-sequencing (GS20, FLX or Titanium)
</p></li><li class="listitem"><p>
Ion Torrent
</p></li><li class="listitem"><p>
Solexa (Illumina) sequencing
</p></li><li class="listitem"><p>
Error-corrected Pacific Biosciences sequences
</p></li></ul></div><p>
into contiguous sequences (called <span class="emphasis"><em>contigs</em></span>). One can
use the sequences of different sequencing technologies either in a
single assembly run (a <span class="emphasis"><em>true hybrid assembly</em></span>) or by
mapping one type of data to an assembly of other sequencing type (a
<span class="emphasis"><em>semi-hybrid assembly (or mapping)</em></span>) or by mapping a
data against consensus sequences of other assemblies (a <span class="emphasis"><em>simple
mapping</em></span>).
</p><p>
The MIRA acronym stands for <span class="bold"><strong>M</strong></span>imicking
<span class="bold"><strong>I</strong></span>ntelligent <span class="bold"><strong>R</strong></span>ead <span class="bold"><strong>A</strong></span>ssembly
and the program pretty well does what its acronym says (well, most of
the time anyway). It is the Swiss army knife of sequence assembly that
I've used and developed during the past 14 years to get assembly jobs I
work on done efficiently - and especially accurately. That is, without
me actually putting too much manual work into it.
</p><p>
Over time, other labs and sequencing providers have found MIRA useful
for assembly of extremely 'unfriendly' projects containing lots of
repetitive sequences. As always, your mileage may vary.
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_wheretostartreading"></a>1.2.
What to read in this manual and where to start reading?
</h2></div></div></div><p>
At the last count, this manual had almost 200 pages and this might seem a little bit daunting.
However, you very probably do not need to read everything.
</p><p>
You should read most of this introductional chapter though: e.g.,
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
the part with the MIRA quick tour
</p></li><li class="listitem"><p>
the part which gives a quick overview for which data sets to use MIRA and for which not
</p></li><li class="listitem"><p>
the part which showcases different features of MIRA (lots of screen shots!)
</p></li><li class="listitem"><p>
where and how to get help if things don't work out as you expected
</p></li></ul></div><p>
After that, reading should depend on the type of data you intend to work
with: there are specific chapters for assembly of de-novo, of mapping and
of EST / RNASeq projects. They all contain an overview on how to
define your data and how to launch MIRA for these data sets. There is
also chapter on how to prepare data sets from specific sequencing
technologies.
</p><p>
The chapter on working with results of MIRA should again be of general
interest to everyone. It describes the structure of output directories
and files and gives first pointers on what to find where. Also,
converting results into different formats -- with and without filtering
for specific needs -- is covered there.
</p><p>
As the previously cited chapters are more introductory in their nature,
they do not go into the details of MIRA parametrisation. While MIRA has
a comprehensive set of standard settings which should be suited for a
majority of assembly tasks, the are more than 150 switches / parameters
with which one can fine tune almost every aspect of an assembly. A
complete description for each and every parameter and how to correctly
set parameters for different use cases and sequencing technologies can
be found in the reference chapter.
</p><p>
As not every assembly project is simple, there is also a chapter with
tips on how to deal with projects which turn out to be "hard." It
certainly helps if you at least skim through it even if you do not
expect to have problems with your data ... it contains a couple of
tricks on what one can see in result files as well as in temporary and
log files which are not explained elsewhere.
</p><p>
MIRA comes with a number of additional utilities which are described in
an own chapter. While the purpose of <span class="command"><strong>miraconvert</strong></span>
should be quite clear quite quickly, the versatility of use cases for
<span class="command"><strong>mirabait</strong></span> might surprise more than one. Be sure to
check it out.
</p><p>
As from time to time some general questions on sequencing are popping up
on the MIRA talk mailing list, I have added a chapter with some general
musings on what to consider when going into sequencing projects. This
should be in no way a replacement for an exhaustive talk with a
sequencing provider, but it can give a couple of hints on what to take
care of.
</p><p>
There is also a FAQ chapter with some of the more frequently asked questions
which popped up in the past few years.
</p><p>
Finally, there are also chapters covering some more technical aspects of MIRA: the MAF format
and structure / content of the tmp directory have own chapters.
</p><p>
Complete walkthroughs ... are lacking at the moment for MIRA 4. In the
MIRA 3 manual I had them, but so many things have changed (at all
levels: MIRA, the sequencing technologies, data repositories) that I did
not have time to update them. I probably will need quite some time to
write new ones. Feel free to send me some if you are inclined to help
fellow scientists.
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_intro_miraquicktour"></a>1.3.
The MIRA quick tour
</h2></div></div></div><p>
Input can be in various formats like Staden experiment (EXP), Sanger
CAF, FASTA, FASTQ or PHD file. Ancillary data containing additional
information helpful to the assembly as is contained in, e.g. NCBI
traceinfo XML files or Staden EXP files, is also honoured. If present,
base qualities in
<span class="command"><strong>phred</strong></span> style and SCF signal electrophoresis trace
files are used to adjudicate between or even correct contradictory
stretches of bases in reads by either the integrated automatic EdIt
editor (written by Thomas Pfisterer) or the assembler itself.
</p><p>
MIRA was conceived especially with the problem of repeats in genomic
data and SNPs in transcript (EST / RNASeq) data in mind. Considerable
effort was made to develop a number of strategies -- ranging from
standard clone-pair size restrictions to discovery and marking of base
positions discriminating the different repeats / SNPs -- to ensure that
repetitive elements are correctly resolved and that misassemblies do not
occur.
</p><p>
The resulting assembly can be written in different standard formats like
CAF, Staden GAP4 directed assembly, ACE, HTML, FASTA, simple text or
transposed contig summary (TCS) files. These can easily be imported into
numerous finishing tools or further evaluated with simple scripts.
</p><p>
The aim of MIRA is to build the best possible assembly by
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
having a more or less full overview on the whole project at any time
of the assembly, i.e. knowledge of almost all possible read-pairs in
a project,
</p></li><li class="listitem"><p>
using high confidence regions (HCRs) of several aligned read-pairs to
start contig building at a good anchor point of a contig, extending
clipped regions of reads on a 'can be justified' basis.
</p></li><li class="listitem"><p>
using all available data present at the time of assembly, i.e.,
instead of relying on sequence and base confidence values only, the
assembler will profit from trace files containing electrophoresis
signals, tags marking possible special attributes of DNA,
information on specific insert sizes of read-pairs etc.
</p></li><li class="listitem"><p>
having 'intelligent' contig objects accept or refuse reads based on
the rate of unexplainable errors introduced into the consensus
</p></li><li class="listitem"><p>
learning from mistakes by discovering and analysing possible repeats
differentiated only by single nucleotide polymorphisms. The
important bases for discriminating different repetitive elements are
tagged and used as new information.
</p></li><li class="listitem"><p>
using the possibility given by the integrated automatic editor to
correct errors present in contigs (and subsequently) reads by
generating and verifying complex error hypotheses through analysis
of trace signals in several reads covering the same area of a
consensus,
</p></li><li class="listitem"><p>
iteratively extending reads (and subsequently) contigs based on
</p><div class="orderedlist"><ol class="orderedlist" type="a"><li class="listitem"><p>
additional information gained by overlapping read pairs in contigs
and
</p></li><li class="listitem"><p>
corrections made by the automated editor.
</p></li></ol></div></li></ol></div><p>
</p><p>
MIRA was part of a bigger project that started at the DKFZ (Deutsches
Krebsforschungszentrum, German Cancer Research Centre) Heidelberg in
1997: the "Bundesministerium für Bildung, Wissenschaft, Forschung und
Technologie" supported the PhD thesis of Thomas and myself by grant
number <span class="emphasis"><em>01 KW 9611</em></span>. Beside an assembler to tackle
difficult repeats, the grant also supported the automated editor /
finisher EdIt package -- written by Thomas Pfisterer. The strength of
MIRA and EdIt is the automatic interaction of both packages which
produces assemblies with less work for human finishers to be done.
</p><p>
I'd like to thank everybody who reported bugs to me, pointed out problems,
sent ideas and suggestions they encountered while using the predecessors.
Please continue to do so, the feedback made this third version possible.
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_for_which_data_sets_to_use_mira_and_for_which_not"></a>1.4.
For which data sets to use MIRA and for which not
</h2></div></div></div><p>
As a general rule of thumb: if you have an organism with more than
100 to 150 megabases or more than 20 to 40 million reads, you might want
to try other assemblers first.
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect3_genome_denovo"></a>1.4.1.
Genome de-novo
</h3></div></div></div><p>
For genome assembly, the version 4 series of MIRA have been reported
to work on projects with something like a million Sanger reads (~80 to
100 megabases at 10x coverage), five to ten million 454 Titanium reads
(~100 megabases at 20x coverage) and 20 to 40 million Solexa reads
(enough for de-novo of a bacterium or a small eukaryote with 76mers or
100mers).
</p><p>
Provided you have the memory, MIRA is expected to work in de-novo
mode with
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
Sanger reads: 5 to 10 million
</p></li><li class="listitem"><p>
454 reads: 5 to 15 million
</p></li><li class="listitem"><p>
Ion Torrent reads: 5 to 15 million
</p></li><li class="listitem"><p>
Solexa reads: in normal operation, up to 40 million reads. Some
people use it on up to 300 million, but you'll need a really big
machine and month of computation time ... I do not recommend
that.
</p></li></ul></div><p>
and "normal" coverages, whereas "normal" would be at no more than 50x
to 70x for genome projects. Higher coverages will also work, but may
create somewhat larger temporary files without heavy
parametrisation. Lower coverages (<4x for Sanger, <10x for 454,
< 10x for IonTorrent) also need special attention in the
parameter settings.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_genome_mapping"></a>1.4.2.
Genome mapping
</h3></div></div></div><p>
As the complexity of mapping is a lot lower than de-novo, one can
basically double (perhaps even triple) the number of reads compared to
'de-novo'. The limiting factor will be the amount of RAM though, and
MIRA will also need lots of it if you go into eukaryotes.
</p><p>
The main limiting factor regarding time will be the number of
reference sequences (backbones) you are using. MIRA being pedantic
during the mapping process, it might be a rather long wait if you have
more than 40 megabase of reference sequences.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect3_ests_rnaseq"></a>1.4.3.
ESTs / RNASeq
</h3></div></div></div><p>
The default values for MIRA should allow it to work with many EST and
RNASeq data sets, sometimes even from non-normalised libraries. For
extreme coverage cases however (like, something with a lot of cases at
and above 10k coverage), one would perhaps want to resort to data
reduction routines before feeding the sequences to MIRA.
</p><p>
On the other hand, recent developments of MIRA were targeted at making
de-novo RNASeq assembly of non-normalised libraries liveable, and
indeed I now regularly use MIRA for data sets with up to 50 million
Illumina 100bp reads.
</p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_intro_specialfeatures"></a>1.5.
Any special features I might be interested in?
</h2></div></div></div><p>
A few perhaps.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
The screen shots in this section show data from assemblies produced
with MIRA, but the visualisation itself is done in a finishing program
named <span class="command"><strong>gap4</strong></span>.
</p><p>
Some of the screen shots were edited for showing a special feature of
MIRA. E.g., in the screen shots with Solexa data, quite some reads were
left out of the view pane as else -- due to the amount of data --
these screen shots would need several pages for a complete printout.
</p></td></tr></table></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_intro_miradiscernsrepeats"></a>1.5.1.
MIRA learns to discern non-perfect repeats, leading to better assemblies
</h3></div></div></div><p>
MIRA is an iterative assembler (it works in several passes) and acts a
bit like a child when exploring the world: it explores the assembly
space and is specifically parameterised to allow a couple of assembly
errors during the first passes. But after each pass some routines (the
"parents", if you like) check the result, searching for assembly
errors and deduce knowledge about specific assemblies MIRA should not
have ventured into. MIRA will then prevent these errors to re-occur in
subsequent passes.
</p><p>
As an example, consider the following multiple alignment:
</p><div class="figure"><a name="chap_intro::srmc_in_454sxahyb_1stpass.png"></a><p class="title"><b>Figure 1.1. How MIRA learns from misassemblies (1). Multiple alignment
after 1st pass with an obvious assembly error, notice the clustered
columns discrepancies. Two slightly different repeats were assembled
together.</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/srmc_in_454sxahyb_1stpass.png" width="100%" alt="How MIRA learns from misassemblies (1). Multiple alignment after 1st pass with an obvious assembly error, notice the clustered columns discrepancies. Two slightly different repeats were assembled together."></td></tr></table></div></div></div><br class="figure-break"><p>
These kind of errors can be easily spotted by a human, but are hard to
prevent by normal alignment algorithms as sometimes there's only one
single base column difference between repeats (and not several as in
this example).
</p><p>
MIRA spots these things (even if it's only a single column), tags the
base positions in the reads with additional information and then will
use that information in subsequent passes. The net effect is shown in
the next two figures:
</p><div class="figure"><a name="chap_intro::srmc_in_454sxahyb_lastpass1.png"></a><p class="title"><b>Figure 1.2.
Multiple alignment after last pass where assembly errors from
previous passes have been resolved (1st repeat site)
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/srmc_in_454sxahyb_lastpass1.png" width="100%" alt="Multiple alignment after last pass where assembly errors from previous passes have been resolved (1st repeat site)"></td></tr></table></div></div></div><br class="figure-break"><div class="figure"><a name="chap_intro::srmc_in_454sxahyb_lastpass2.png"></a><p class="title"><b>Figure 1.3.
Multiple alignment after last pass where assembly errors from
previous passes have been resolved (2nd repeat site)
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/srmc_in_454sxahyb_lastpass2.png" width="100%" alt="Multiple alignment after last pass where assembly errors from previous passes have been resolved (2nd repeat site)"></td></tr></table></div></div></div><br class="figure-break"><p>
The ability of MIRA to learn and discern non-identical repeats from
each other through column discrepancies is nothing new. Here's the
link to a paper from a talk I had at the German Conference on
Bioinformatics in 1999: <a class="ulink" href="http://www.bioinfo.de/isb/gcb99/talks/chevreux/" target="_top">http://www.bioinfo.de/isb/gcb99/talks/chevreux/</a>
</p><p>
I'm sure you'll recognise the basic principle in figures 8 and 9. The
slides from the corresponding talk also look very similar to the
screen shots above:
</p><div class="figure"><a name="chap_intro::gcb99_replocator.png"></a><p class="title"><b>Figure 1.4.
Slides presenting the repeat locator at the GCB 99
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/gcb99_replocator.png" width="100%" alt="Slides presenting the repeat locator at the GCB 99"></td></tr></table></div></div></div><br class="figure-break"><p>
You can get the talk with these slides here: <a class="ulink" href="http://chevreux.org/dkfzold/gcb99/bachvortrag_gcb99.ppt" target="_top">http://chevreux.org/dkfzold/gcb99/bachvortrag_gcb99.ppt</a>
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_intro_automatic_editors"></a>1.5.2.
MIRA has integrated editors for data from Sanger, 454, IonTorrent sequencing
</h3></div></div></div><p>
Since the first versions in 1999, the <span class="emphasis"><em>EdIt</em></span>
automatic Sanger sequence editor from Thomas Pfisterer has been
integrated into MIRA.
</p><div class="figure"><a name="chap_intro::gcb99_edit.png"></a><p class="title"><b>Figure 1.5.
Slides presenting the Edit automatic Sanger editor at the GCB 99
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/gcb99_edit.png" width="100%" alt="Slides presenting the Edit automatic Sanger editor at the GCB 99"></td></tr></table></div></div></div><br class="figure-break"><p>
The routines use a combination of hypothesis generation/testing
together with neural networks (trained on ABI and ALF traces) for
signal recognition to discern between base calling errors and true
multiple alignment differences. They go back to the trace data to
resolve potential conflicts and eventually recall bases using the
additional information gained in a multiple alignment of reads.
</p><div class="figure"><a name="chap_intro::san_autoedit1.png"></a><p class="title"><b>Figure 1.6.
Sanger assembly without EdIt automatic editing routines. The bases
with blue background are base calling errors.
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/san_autoedit1.png" width="100%" alt="Sanger assembly without EdIt automatic editing routines. The bases with blue background are base calling errors."></td></tr></table></div></div></div><br class="figure-break"><div class="figure"><a name="chap_intro::san_autoedit2.png"></a><p class="title"><b>Figure 1.7.
Sanger assembly with EdIt automatic editing routines. Bases with
pink background are corrections made by EdIt after assessing the
underlying trace files (SCF files in this case). Bases with blue
background are base calling errors where the evidence in the trace
files did not show enough evidence to allow an editing correction.
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/san_autoedit2.png" width="100%" alt="Sanger assembly with EdIt automatic editing routines. Bases with pink background are corrections made by EdIt after assessing the underlying trace files (SCF files in this case). Bases with blue background are base calling errors where the evidence in the trace files did not show enough evidence to allow an editing correction."></td></tr></table></div></div></div><br class="figure-break"><p>
With the introduction of 454 reads, MIRA also got in 2007 specialised
editors to search and correct for typical 454 sequencing problems like
the homopolymer run over-/undercalls. These editors are now integrated
into MIRA itself and are not part of EdIt anymore.
</p><p>
While not being paramount to the assembly quality, both editors
provide additional layers of safety for the MIRA learning algorithm to
discern non-perfect repeats even on a single base
discrepancy. Furthermore, the multiple alignments generated by these
two editors are way more pleasant to look at (or automatically
analyse) than the ones containing all kind of gaps, insertions,
deletions etc.pp.
</p><div class="figure"><a name="chap_intro::454_autoedit1.png"></a><p class="title"><b>Figure 1.8.
454 assembly without 454 automatic editing routines
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/454_autoedit1.png" width="100%" alt="454 assembly without 454 automatic editing routines"></td></tr></table></div></div></div><br class="figure-break"><div class="figure"><a name="chap_intro::454_autoedit2.png"></a><p class="title"><b>Figure 1.9.
454 assembly with 454 automatic editing routines
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/454_autoedit2.png" width="100%" alt="454 assembly with 454 automatic editing routines"></td></tr></table></div></div></div><br class="figure-break"></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_intro_whycontigsend"></a>1.5.3.
MIRA lets you see why contigs end where they end
</h3></div></div></div><p>
A very useful feature for finishing are kmer (hash) frequency tags
which MIRA sets in the assembly. Provided your finishing editor
understands those tags
(<span class="command"><strong>gap4</strong></span>, <span class="command"><strong>gap5</strong></span>
and <span class="command"><strong>consed</strong></span> are fine but there may be others),
they'll give you precious insight where you might want to be cautious
when joining to contigs or where you would need to perform some primer
walking. MIRA colourises the assembly with the hash frequency (HAF)
tags to show repetitiveness.
</p><p>
You will need to read about the HAF tags in the reference manual, but
in a nutshell: the HAF5, HAF6 and HAF7 tags tell you potentially have
repetitive to very repetitive read areas in the genome, while HAF2
tags will tell you that these areas in the genome have not been
covered as well as they should have been.
</p><p>
As an example, the following figure shows the coverage of a contig.
</p><div class="figure"><a name="chap_intro::haf5_haf2_contigcoverage_ovals.png"></a><p class="title"><b>Figure 1.10.
Coverage of a contig.
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/haf5_haf2_contigcoverage_ovals.png" width="100%" alt="Coverage of a contig."></td></tr></table></div></div></div><br class="figure-break"><p>
The question is now: why did MIRA stop building this contig on the
left end (left oval) and why on the right end (right oval).
</p><p>
Looking at the HAF tags in the contig, the answer becomes quickly
clear: the left contig end has HAF5 tags in the reads (shown in bright
red in the following figure). This tells you that MIRA stopped because
it probably could not unambiguously continue building this
contig. Indeed, if you BLAST the sequence at the NCBI, you will find
out that this is an rRNA area of a bacterium, of which bacteria
normally have several copies in the genome:
</p><div class="figure"><a name="chap_intro::haf5_repend_rrna.png"></a><p class="title"><b>Figure 1.11.
HAF5 tags (reads shown with red background) covering a contig end
show repetitiveness as reason for stopping a contig build.
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/haf5_repend_rrna.png" width="100%" alt="HAF5 tags (reads shown with red background) covering a contig end show repetitiveness as reason for stopping a contig build."></td></tr></table></div></div></div><br class="figure-break"><p>
The right end of the same contig however ends in HAF3 tags (normal
coverage, bright green in the next figure) and even HAF2 tags (below
average coverage, pale green in the next image). This tells you MIRA
stopped building the contig at this place simply because there were
no more reads to continue. This is a perfect target for primer
walking if you want to finish a genome.
</p><div class="figure"><a name="chap_intro::haf2_end_nomoredata.png"></a><p class="title"><b>Figure 1.12.
HAF2 tags covering a contig end show that no more reads were
available for assembly at this position.
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/haf2_end_nomoredata.png" width="100%" alt="HAF2 tags covering a contig end show that no more reads were available for assembly at this position."></td></tr></table></div></div></div><br class="figure-break"></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_intro_stmshybrid_tags"></a>1.5.4.
MIRA tags problematic decisions in hybrid assemblies
</h3></div></div></div><p>
Many people combine Sanger & 454 -- or nowadays more 454 &
Solexa -- to improve the sequencing quality of their project through
two (or more) sequencing technologies. To reduce time spent in
finishing, MIRA automatically tags those bases in a consensus of a
hybrid assembly where reads from different sequencing technologies
severely contradict each other.
</p><p>
The following example shows a hybrid 454 / Solexa assembly where reads
from 454 (highlighted read names in following figure) were not sure
whether to have one or two "G" at a certain position. The consensus
algorithm would have chosen "two Gs" for 454, obviously a wrong
decision as all Solexa reads at the same spot (the reads which are not
highlighted) show only one "G" for the given position. While MIRA
chose to believe Solexa in this case, it tagged the position anyway in
case someone chooses to check these kind of things.
</p><div class="figure"><a name="chap_intro::454sxa_stms_hybdenovo.png"></a><p class="title"><b>Figure 1.13.
A "STMS" tag (Sequencing Technology Mismatch Solved, the black
square base in the consensus) showing a potentially difficult
decision in a hybrid 454 / Solexa de-novo assembly.
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/454sxa_stms_hybdenovo.png" width="100%" alt='A "STMS" tag (Sequencing Technology Mismatch Solved, the black square base in the consensus) showing a potentially difficult decision in a hybrid 454 / Solexa de-novo assembly.'></td></tr></table></div></div></div><br class="figure-break"><p>
This works also for other sequencing technology combinations or in
mapping assemblies. The following is an example in a hybrid Sanger /
454 project where by pure misfortune, all Sanger reads have a base
calling error at a given position while the 454 reads show the true
sequence.
</p><div class="figure"><a name="chap_intro::454san_stmu_hybdenovo.png"></a><p class="title"><b>Figure 1.14.
A "STMU" tag (Sequencing Technology Mismatch Unresolved, light blue
square in the consensus at lower end of large oval) showing a
potentially difficult decision in a hybrid Sanger / 454 mapping
assembly.
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/454san_stmu_hybdenovo.png" width="100%" alt='A "STMU" tag (Sequencing Technology Mismatch Unresolved, light blue square in the consensus at lower end of large oval) showing a potentially difficult decision in a hybrid Sanger / 454 mapping assembly.'></td></tr></table></div></div></div><br class="figure-break"></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_intro_cer_reads"></a>1.5.5.
MIRA allows older finishing programs to cope with amount data in Solexa
mapping projects
</h3></div></div></div><p>
Quality control is paramount when you do mutation analysis for
biologists: I know they'll be on my doorstep the very next minute they
found out one of the SNPs in the resequencing data wasn't a SNP, but a
sequencing artefact. And I can understand them: why should they invest
-- per SNP -- hours in the wet lab if I can invest a couple of minutes
to get them data false negative rates (and false discovery rates) way
below 1%? So, finishing and quality control for any mapping project is
a must.
</p><p>
Both <span class="command"><strong>gap4</strong></span> and <span class="command"><strong>consed</strong></span> start to
have a couple of problems when projects have millions of reads: you
need lots of RAM and scrolling around the assembly gets a test to your
patience. Still, these two assembly finishing programs are amongst the
better ones out there, although <span class="command"><strong>gap5</strong></span> starts to
quickly arrive in a state in which it allows itself to substitute to
<span class="command"><strong>gap4</strong></span>.
</p><p>
So, MIRA reduces the number of reads in Solexa mapping projects
without sacrificing information on coverage. The principle is pretty
simple: for 100% matching reads, MIRA tracks coverage of every
reference base and creates long synthetic, coverage equivalent reads
(CERs) in exchange for the Solexa reads. Reads that do not match 100%
are kept as own entities, so that no information gets lost. The
following figure illustrates this:
</p><div class="figure"><a name="chap_intro::sxa_cer_reads1.png"></a><p class="title"><b>Figure 1.15.
Coverage equivalent reads (CERs) explained.
<p>
Left side of the figure: a conventional mapping with eleven reads
of size 4 against a consensus (in uppercase). The inversed base in
the lowest read depicts a sequencing error.
</p>
<p>
Right side of the figure: the same situation, but with coverage
equivalent reads (CERs). Note that there are less reads, but no
information is lost: the coverage of each reference base is
equivalent to the left side of the figure and reads with
differences to the reference are still present.
</p>
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_cer_reads1.png" width="100%" alt="Coverage equivalent reads (CERs) explained. Left side of the figure: a conventional mapping with eleven reads of size 4 against a consensus (in uppercase). The inversed base in the lowest read depicts a sequencing error. Right side of the figure: the same situation, but with coverage equivalent reads (CERs). Note that there are less reads, but no information is lost: the coverage of each reference base is equivalent to the left side of the figure and reads with differences to the reference are still present."></td></tr></table></div></div></div><br class="figure-break"><p>
This strategy is very effective in reducing the size of a project. As
an example, in a mapping project with 9 million Solexa 36mers, MIRA
created a project with 1.7m reads: 700k CER reads representing ~8
million 100% matching Solexa reads, and it kept ~950k mapped reads as
they had ≥ mismatch (be it sequencing error or true SNP) to the
reference. A reduction of 80%, and numbers for mapping projects with
Solexa 100bp reads are in a similar range.
</p><p>
Also, mutations of the resequenced strain now really stand out in the
assembly viewer as the following figure shows:
</p><div class="figure"><a name="chap_intro::sxa_cer_reads2.png"></a><p class="title"><b>Figure 1.16.
Coverage equivalent reads let SNPs become very visible in assembly viewers
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_cer_reads2.png" width="100%" alt="Coverage equivalent reads let SNPs become very visible in assembly viewers"></td></tr></table></div></div></div><br class="figure-break"></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_intro_mapping_tags"></a>1.5.6.
MIRA tags SNPs and other features, outputs result files
for biologists
</h3></div></div></div><p>
Want to assemble two or several very closely related genomes without
reference, but finding SNPs or differences between them?
</p><p>
Tired of looking at some text output from mapping programs and
guessing whether a SNP is really a SNP or just some random junk?
</p><p>
MIRA tags all SNPs (and other features like missing coverage etc.) it
finds so that -- when using a finishing viewer like gap4 or consed --
one can quickly jump from tag to tag and perform quality control. This
works both in de-novo assembly and in mapping assembly, all MIRA needs
is the information which read comes from which strain.
</p><p>
The following figure shows a mapping assembly of Solexa 36mers against
a bacterial reference sequence, where a mutant has an indel position
in an gene:
</p><div class="figure"><a name="chap_intro::sxa_sroc_lenski2.png"></a><p class="title"><b>Figure 1.17.
"SROc" tag (Snp inteR Organism on Consensus) showing a SNP position
in a Solexa mapping assembly.
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_sroc_lenski2.png" width="100%" alt='"SROc" tag (Snp inteR Organism on Consensus) showing a SNP position in a Solexa mapping assembly.'></td></tr></table></div></div></div><br class="figure-break"><p>
Other interesting places like deletions of whole genome parts are also
directly tagged by MIRA and noted in diverse result files (and
searchable in assembly viewers):
</p><div class="figure"><a name="chap_intro::sxa_mcvc_lenski.png"></a><p class="title"><b>Figure 1.18.
"MCVc" tag (Missing CoVerage in Consensus, dark red stretch in figure)
showing a genome deletion in Solexa mapping assembly.
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_mcvc_lenski.png" width="100%" alt='"MCVc" tag (Missing CoVerage in Consensus, dark red stretch in figure) showing a genome deletion in Solexa mapping assembly.'></td></tr></table></div></div></div><br class="figure-break"><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
For bacteria -- and if you use annotated GenBank files as reference
sequence -- MIRA will also output some nice lists directly usable (in
Excel) by biologists, telling them which gene was affected by what
kind of SNP, whether it changes the protein, the original and the
mutated protein sequence etc.pp.
</td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_intro_miramuchmore"></a>1.5.7.
MIRA has ... much more
</h3></div></div></div><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
Extensive possibilities to clip data if needed: by quality, by
masked bases, by A/T stretches, by evidence from other reads, ...
</p></li><li class="listitem"><p>
Routines to re-extend reads into clipped parts if multiple
alignment allows for it.
</p></li><li class="listitem"><p>
Read in ancillary data in different formats: EXP, NCBI TRACEINFO
XML, SSAHA2, SMALT result files and text files.
</p></li><li class="listitem"><p>
Detection of chimeric reads.
</p></li><li class="listitem"><p>
Pipeline to discover SNPs in ESTs from different strains
(miraSearchESTSNPs)
</p></li><li class="listitem"><p>
Support for many different of input and output formats (FASTA,
EXP, FASTQ, CAF, MAF, ...)
</p></li><li class="listitem"><p>
Automatic memory management (when RAM is tight)
</p></li><li class="listitem"><p>
Over 150 parameters to tune the assembly for a lot of use cases,
many of these parameters being tunable individually depending on
sequencing technology they apply to.
</p></li></ul></div><p>
</p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_intro_versions_licenses_disclaimer_and_copyright"></a>1.6.
Versions, Licenses, Disclaimer and Copyright
</h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_intro_versions"></a>1.6.1.
Versions
</h3></div></div></div><p>
There are two kind of versions for MIRA that can be compiled form
source files: production and development.
</p><p>
Production versions are from the stable branch of the source code. These
versions are available for download from SourceForge.
</p><p>
Development versions are from the development branch of the source
tree. These are also made available to the public and should be
compiled by users who want to test out new functionality or to track
down bugs or errors that might arise at a given location. Release
candidates (rc) also fall into the development versions: they are
usually the last versions of a given development branch before being
folded back into the production branch.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_intro_licenses"></a>1.6.2.
License
</h3></div></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_intro_licensemira"></a>1.6.2.1.
MIRA
</h4></div></div></div><p>
MIRA has been put under the GPL version 2.
</p><p>
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or (at
your option) any later version.
</p><p>
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
General Public License for more details.
</p><p>
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
02110-1301, USA
</p><p>
You may also visit <a class="ulink" href="http://www.opensource.org/licenses/gpl-2.0.php" target="_top">http://www.opensource.org/licenses/gpl-2.0.php</a> at the Open
Source Initiative for a copy of this licence.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_intro_licensedocs"></a>1.6.2.2.
Documentation
</h4></div></div></div><p>
The documentation pertaining to MIRA is licensed under the Creative
Commons Attribution-NonCommercial-ShareAlike 3.0 Unported
License. To view a copy of this license, visit <a class="ulink" href="http://creativecommons.org/licenses/by-nc-sa/3.0/" target="_top">http://creativecommons.org/licenses/by-nc-sa/3.0/</a> or send a
letter to Creative Commons, 171 Second Street, Suite 300, San
Francisco, California, 94105, USA.
</p></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_intro_copyright"></a>1.6.3.
Copyright
</h3></div></div></div><p>
© 1997-2000 Deutsches Krebsforschungszentrum Heidelberg -- Dept.
of Molecular Biophysics and Bastien Chevreux (for MIRA) and Thomas
Pfisterer (for EdIt)
</p><p>
© 2001-2014 Bastien Chevreux.
</p><p>
All rights reserved.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_intro_external_libraries"></a>1.6.4.
External libraries
</h3></div></div></div><p>
MIRA uses the excellent Expat library to parse XML files. Expat is Copyright
© 1998, 1999, 2000 Thai Open Source Software Center Ltd and Clark
Cooper as well as Copyright ©
2001, 2002 Expat maintainers.
</p><p>
See <a class="ulink" href="http://www.libexpat.org/" target="_top">http://www.libexpat.org/</a> and
<a class="ulink" href="http://sourceforge.net/projects/expat/" target="_top">http://sourceforge.net/projects/expat/</a> for more information on Expat.
</p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_intro_getting_help___mailing_lists___reporting_bugs"></a>1.7.
Getting help / Mailing lists / Reporting bugs
</h2></div></div></div><p>
Please try to find an answer to your question by first reading the
documents provided with the MIRA package (FAQs, READMEs, usage guide,
guides for specific sequencing technologies etc.). It's a lot, but then
again, they hopefully should cover 90% of all questions.
</p><p>
If you have a tough nut to crack or simply could not find what you were
searching for, you can subscribe to the MIRA talk mailing list and send
in your question (or comment, or suggestion), see <a class="ulink" href="http://www.chevreux.org/mira_mailinglists.html" target="_top">http://www.chevreux.org/mira_mailinglists.html</a> for more
information on that. Now that the number of subscribers has reached a
good level, there's a fair chance that someone could answer your
question before I have the opportunity or while I'm away from mail for a
certain time.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
Please very seriously consider using the mailing list before mailing
me directly. Every question which can be answered by participants of
the list is time I can invest in development and documentation of
MIRA. I have a day job as bioinformatician which has nothing to do
with MIRA and after work hours are rare enough nowadays.
</p><p>
Furthermore, Google indexes the mailing list and every discussion /
question asked on the mailing list helps future users as they show up
in Google searches.
</p><p>
Only mail me directly (bach@chevreux.org) if you feel that there's
some information you absolutely do not want to share publicly.
</p></td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Subscribing to the list <span class="emphasis"><em>before sending mails to it </em></span>
is necessary as messages from non-subscribers will be stopped by the
system to keep the spam level low.
</td></tr></table></div><p>
To report bugs or ask for new features, please use the SourceForge
ticketing system at: <a class="ulink" href="http://sourceforge.net/p/mira-assembler/tickets/" target="_top">http://sourceforge.net/p/mira-assembler/tickets/</a>. This ensures
that requests do not get lost <span class="bold"><strong>and</strong></span> you
get the additional benefit to automatically know when a bug has been
fixed as I will not send separate emails, that's what bug trackers are
there for.
</p><p>
Finally, new or intermediate versions of MIRA will be announced on the
separate MIRA announce mailing list. Traffic is very low there as the
only one who can post there is me. Subscribe if you want to be informed
automatically on new releases of MIRA.
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_intro_author"></a>1.8.
Author
</h2></div></div></div><p>
Bastien Chevreux (mira): <code class="email"><<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>></code>
</p><p>
WWW: <a class="ulink" href="http://www.chevreux.org/" target="_top">http://www.chevreux.org/</a>
</p><p>
MIRA can use automatic editing routines for Sanger sequences which were
written by Thomas Pfisterer (EdIt):
<code class="email"><<a class="email" href="mailto:t.pfisterer@dkfz-heidelberg.de">t.pfisterer@dkfz-heidelberg.de</a>></code>
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_intro_miscellaneous"></a>1.9.
Miscellaneous
</h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_intro_citations"></a>1.9.1.
Citing MIRA
</h3></div></div></div><p>
Please use these citations:
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
For <span class="command"><strong>mira</strong></span>
</span></dt><dd><p>
Chevreux, B., Wetter, T. and Suhai, S. (1999): <span class="emphasis"><em>Genome
Sequence Assembly Using Trace Signals and Additional Sequence
Information</em></span>. Computer Science and Biology:
Proceedings of the German Conference on Bioinformatics (GCB) 99,
pp. 45-56.
</p></dd><dt><span class="term">
For <span class="command"><strong>miraSearchESTSNPs</strong></span> (was named
<span class="command"><strong>miraEST</strong></span> in earlier times)
</span></dt><dd><p> Chevreux, B., Pfisterer, T., Drescher, B., Driesel, A. J.,
Müller, W. E., Wetter, T. and Suhai, S. (2004): <span class="emphasis"><em>Using
the miraEST Assembler for Reliable and Automated mRNA Transcript
Assembly and SNP Detection in Sequenced ESTs</em></span>. Genome
Research, 14(6)
</p></dd></dl></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_intro_postcards_gold_and_jewellery"></a>1.9.2.
Postcards, gold and jewellery
</h3></div></div></div><p>
If you find this software useful, please send the author a postcard. If
postcards are not available, a treasure chest full of Spanish doubloons, gold
and jewellery will do nicely, thank you.
</p></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_installation"></a>Chapter 2. Installing MIRA</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email"><<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>></code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect_install_wheretofetch">2.1.
Where to fetch MIRA
</a></span></dt><dt><span class="sect1"><a href="#sect_install_precompiledbinary">2.2.
Installing from a precompiled binary package
</a></span></dt><dt><span class="sect1"><a href="#sect_install_third_party_integration">2.3.
Integration with third party programs (gap4, consed)
</a></span></dt><dt><span class="sect1"><a href="#sect_install_compiling">2.4.
Compiling MIRA yourself
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_install_comp_prereq">2.4.1.
Prerequisites
</a></span></dt><dt><span class="sect2"><a href="#sect_install_comp_comp">2.4.2.
Compiling and installing
</a></span></dt><dt><span class="sect2"><a href="#sect_install_comp_conf">2.4.3.
Configure switches for MIRA
</a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_install_comp_conf_boost">2.4.3.1.
BOOST configure switches for MIRA
</a></span></dt><dt><span class="sect3"><a href="#sect_install_comp_conf_mira">2.4.3.2.
MIRA specific configure switches
</a></span></dt></dl></dd></dl></dd><dt><span class="sect1"><a href="#sect_install_walkthroughs">2.5.
Installation walkthroughs
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_install_walkthroughs_kubuntu">2.5.1.
(K)Ubuntu 12.04
</a></span></dt><dt><span class="sect2"><a href="#sect_install_walkthroughs_opensuse">2.5.2.
openSUSE 12.1
</a></span></dt><dt><span class="sect2"><a href="#sect_install_walkthroughs_fedora">2.5.3.
Fedora 17
</a></span></dt><dt><span class="sect2"><a href="#sect_install_walkthroughs_osx">2.5.4.
Mac OSX
</a></span></dt><dt><span class="sect2"><a href="#sect_install_walkthroughs_allfromscratch">2.5.5.
Compile everything from scratch
</a></span></dt><dt><span class="sect2"><a href="#sect_install_walkthroughs_dynamic">2.5.6.
Dynamically linked MIRA
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_install_hintotherplatforms">2.6.
Compilation hints for other platforms.
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_install_hintnetbsd5">2.6.1.
NetBSD 5 (i386)
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_install_notesformaintainers">2.7.
Notes for distribution maintainers / system administrators
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_install_additionaldatafiles">2.7.1.
Additional data files
</a></span></dt></dl></dd></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
<span class="emphasis"><em><span class="quote">“<span class="quote">A problem can be found to almost every solution.
</span>”</span></em></span>
</p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_install_wheretofetch"></a>2.1.
Where to fetch MIRA
</h2></div></div></div><p>
SourceForge: <a class="ulink" href="http://sourceforge.net/projects/mira-assembler/" target="_top">http://sourceforge.net/projects/mira-assembler/</a>
</p><p>
There you will normally find a couple of precompiled binaries -- usually
for Linux and Mac OSX -- or the source package for compiling yourself.
</p><p>
Precompiled binary packages are named in the following way:
</p><p>
<code class="filename">mira_<em class="replaceable"><code>miraversion</code></em>_<em class="replaceable"><code>OS-and-binarytype</code></em>.tar.bz2</code>
</p><p>
where
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
For <code class="filename"><em class="replaceable"><code>miraversion</code></em></code>, the
stable versions of MIRA with the general public as audience usually
have a version number in three parts, like
<code class="filename">3.0.5</code>, sometimes also followed by some postfix
like in <code class="filename">3.2.0rc1</code> to denote release candidate 1
of the 3.2.0 version of MIRA. On very rare occasions, stable
versions of MIRA can have four part like in, e.g.,
<code class="filename">3.4.0.1</code>: these versions create identical
binaries to their parent version (<code class="filename">3.4.0</code>) and
just contains fixes to the source build machinery.
</p><p>
The version string sometimes can have a different format:
<code class="filename"><span class="emphasis"><em>sometext</em></span>-0-g<span class="emphasis"><em>somehexnumber</em></span></code>
like in, e.g.,
<code class="filename">ft_fastercontig-0-g4a27c91</code>. These versions of
MIRA are snapshots from the development tree of MIRA and usually
contain new functionality which may not be as well tested as the
rest of MIRA, hence contains more checks and more debugging output
to catch potential errors
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>OS-and-binarytype</code></em></code>
finally defines for which operating system and which processor class
the package is destined. E.g.,
<code class="filename">linux-gnu_x86_64_static</code> contains static
binaries for Linux running a 64 bit processor.
</p></li></ul></div><p>
Source packages are usually named
</p><p>
<code class="filename">mira-<em class="replaceable"><code>miraversion</code></em>.tar.bz2</code>
</p><p>
Examples for packages at SourceForge:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><code class="filename">mira_3.0.5_prod_linux-gnu_x86_64_static.tar.bz2</code></li><li class="listitem"><code class="filename">mira_3.0.5_prod_linux-gnu_i686_32_static.tar.bz2</code></li><li class="listitem"><code class="filename">mira_3.0.5_prod_OSX_snowleopard_x86_64_static.tar.bz2</code></li><li class="listitem"><code class="filename">mira-3.0.5.tar.bz2</code></li></ul></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_install_precompiledbinary"></a>2.2.
Installing from a precompiled binary package
</h2></div></div></div><p>
The distributable package follows the
one-directory-which-contains-everything-which-is-needed philosophy, but
after unpacking and moving the package to its final destination, you
need to run a script which will create some data files.
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
Download the package, unpack it.
</p></li><li class="listitem"><p>
Move the directory somewhere to your disk. Either to one of the
"standard" places like, e.g., <code class="filename">/opt/mira</code>,
<code class="filename">/usr/local/mira</code> or somewhere in your home
directory
</p></li><li class="listitem"><p>
Softlink the binaries which are in the 'bin' directory into a
directory which is in your shell PATH. Then have the shell reload
the location of PATH binaries (either <code class="literal">hash -r</code>
for sh/bash or <code class="literal">rehash</code> for csh/tcsh.
</p><p>
Alternatively, add the <code class="filename">bin</code> directory of the
MIRA package to your PATH variable.
</p></li><li class="listitem"><p>
Test whether the binaries are installed ok via <code class="literal">mirabait
-v</code> which should return with the current version you
downloaded and installed.
</p></li><li class="listitem"><p>
Now you need to run a script which will unpack and reformat some
data needed by MIRA. That script is located in the
<code class="filename">dbdata</code> directory of the package and should
be called with the name of the <span class="emphasis"><em>SLS</em></span> file present
in the same diretory like this:
</p><pre class="screen">
<code class="prompt">arcadia:/path/to/mirapkg$</code> <strong class="userinput"><code>cd dbdata</code></strong>
<code class="prompt">arcadia:/path/to/mirapkg/dbdata$</code> <strong class="userinput"><code>ls -l</code></strong>
drwxr-xr-x 3 bach bach 4096 2016-03-18 14:31 mira-createsls
-rwxr-xr-x 1 bach bach 2547 2015-12-14 04:33 mira-install-sls-rrna.sh
-rw-r--r-- 1 bach bach 337 2016-01-01 14:50 README.txt
lrwxrwxrwx 1 bach bach 10421035 2016-03-18 14:28 rfam_rrna-21-12.sls.gz
<code class="prompt">arcadia:/path/to/mirapkg/dbdata$</code> <strong class="userinput"><code>./mira-install-sls-rrna.sh rfam_rrna-21-12.sls.gz</code></strong></pre><p>
This will take a minute or so. Then you're done for MIRA.
</p></li></ol></div><p>
Additional scripts for special purposes are in the
<code class="filename">scripts</code> directory. You might or might not want to
have them in your $PATH.
</p><p>
Scripts and programs for MIRA from other authors are in the
<code class="filename">3rdparty</code> directory. Here too, you may or may not
want to have (some of them) in your $PATH.
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_install_third_party_integration"></a>2.3.
Integration with third party programs (gap4, consed)
</h2></div></div></div><p>
MIRA sets tags in the assemblies that can be read and interpreted by the
Staden <span class="command"><strong>gap4</strong></span> package or
<span class="command"><strong>consed</strong></span>. These tags are extremely useful to
efficiently find places of interest in an assembly (be it de-novo or
mapping), but both <span class="command"><strong>gap4</strong></span> and <span class="command"><strong>consed</strong></span>
need to be told about these tags.
</p><p>
Data files for a correct integration are delivered in the
<code class="filename">support</code> directory of the distribution. Please
consult the README in that directory for more information on how to
integrate this information in either of these packages.
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_install_compiling"></a>2.4.
Compiling MIRA yourself
</h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_install_comp_prereq"></a>2.4.1.
Prerequisites
</h3></div></div></div><p>
Compiling the 5.x series of MIRA needs a C++14 compatible tool chain, i.e.,
systems starting from 2013/2014 should be OK. The
requisites for <span class="emphasis"><em>compiling</em></span> MIRA are:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
gcc ≥ 4.9.1, with libstdc++6. You really want to use a simple
installation package pre-configured for your system, but in case you
want or have to install gcc yourself, please refer to <a class="ulink" href="http://gcc.gnu.org/" target="_top">http://gcc.gnu.org/</a> for more information on the GNU compiler
collection.
</p></li><li class="listitem"><p>
BOOST library ≥ 1.48. Lower versions might work, but
untested. You would need to change the checking in the configure
script for this to run through. You really want to use a simple
installation package pre-configured for your system, but in case you
want or have to install BOOST yourself, please refer to <a class="ulink" href="http://www.boost.org/" target="_top">http://www.boost.org/</a> for more information on the BOOST
library.
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
Do NOT use a so called <span class="emphasis"><em>staged</em></span> BOOST library,
that will not work.
</td></tr></table></div></li><li class="listitem">
zlib. Should your system not have zlib installed or available as
simple installation package, please see <a class="ulink" href="http://www.zlib.net/" target="_top">http://www.zlib.net/</a> for more information regarding zlib.
</li><li class="listitem">
GNU make. Should your system not have gmake installed or available
as simple installation package, please see <a class="ulink" href="www.gnu.org/software/make/" target="_top">www.gnu.org/software/make/</a> for more information regarding
GNU make.
</li><li class="listitem">
GNU flex ≥ 2.5.33. Should your system not have flex installed or
available as simple installation package, please see <a class="ulink" href="http://flex.sourceforge.net/" target="_top">http://flex.sourceforge.net/</a> for more information regarding
flex.
</li><li class="listitem">
Expat library ≥ 2.0.1. Should your system not have the Expat library and
header files already installed or available as simple installation
package, you will need to download and install a yourself. Please see
<a class="ulink" href="http://www.libexpat.org/" target="_top">http://www.libexpat.org/</a> and <a class="ulink" href="http://sourceforge.net/projects/expat/" target="_top">http://sourceforge.net/projects/expat/</a> for information on how
to do this.
</li><li class="listitem">
xxd. A small utility from the <span class="command"><strong>vim</strong></span> package.
</li></ul></div><p>
For <span class="emphasis"><em>building the documentation</em></span>, additional
prerequisites are from the DocBook tool chain:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem">
xsltproc + docbook-xsl for HTML output
</li><li class="listitem">
dblatex for PDF output
</li></ul></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top"><p>
Previous versions of MIRA had a benefit by using the TCMalloc
library. This is not the case anymore! Indeed, tests showed that when
using TCMalloc, MIRA 4.9.x and above will probably need 20 to
30% <span class="emphasis"><em>more</em></span> max memory and up to 80% more overall
memory than without TCMalloc.
</p><p>
In short: do not use at the moment.
</p></td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_install_comp_comp"></a>2.4.2.
Compiling and installing
</h3></div></div></div><p>
MIRA uses the GNU autoconf/automake tools, please read the section
"Basic Installation" of the <code class="filename">INSTALL</code> file in the
source package of MIRA for more generic information on how to invoke
them.
</p><p>
The short version: simply type
</p><pre class="screen">
<code class="prompt">arcadia:/path/to/mira-5.0.0$</code> <strong class="userinput"><code>./configure</code></strong>
<code class="prompt">arcadia:/path/to/mira-5.0.0$</code> <strong class="userinput"><code>make</code></strong>
<code class="prompt">arcadia:/path/to/mira-5.0.0$</code> <strong class="userinput"><code>make install</code></strong></pre><p>
This should install the following programs:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><span class="command"><strong>mira</strong></span></li><li class="listitem"><span class="command"><strong>miraconvert</strong></span></li><li class="listitem"><span class="command"><strong>mirabait</strong></span></li><li class="listitem"><span class="command"><strong>miramem</strong></span></li></ul></div><p>
Should the <code class="literal">./configure</code> step fail for some reason or
another, you should get a message telling you at which step this
happens and and either install missing packages or tell
<span class="command"><strong>configure</strong></span> where it should search the packages it
did not find, see also next section.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_install_comp_conf"></a>2.4.3.
Configure switches for MIRA
</h3></div></div></div><p>
MIRA understands all standard autoconf configure switches like <code class="literal">--prefix=</code>
etc. Please consult the INSTALL file in the MIRA top level directory
of the source package and also call <code class="literal">./configure
--help</code> to get a full list of currently supported switches.
</p><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_install_comp_conf_boost"></a>2.4.3.1.
BOOST configure switches for MIRA
</h4></div></div></div><p>
BOOST is maybe the most tricky library to get right in case it does
not come pre-configured for your system. The two main switches for
helping to locate BOOST are
probably <code class="literal">--with-boost=[ARG]</code>
and <code class="literal">--with-boost-libdir=LIB_DIR</code>. Only if those
two fail, try using the other <code class="literal">--with-boost-*=</code> switches
you will see from the ./configure help text.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_install_comp_conf_mira"></a>2.4.3.2.
MIRA specific configure switches
</h4></div></div></div><p>
MIRA honours the following switches:
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
--enable-64=yes/no
</span></dt><dd><p>
MIRA should happily build as 32 bit executable on 32 bit
platforms and as 64 bit executable on 64 bit platforms. On 64
bit platforms, setting the switch to 'no' forces the compiler
to produce 32 bit executables (if possible)
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
As of MIRA 3.9.0, support for 32 bit platforms is being
slowly phased out. While MIRA should compile and also run fine
on 32 bit platforms, I do not guarantee it anymore as I
haven't used 32 bit systems in the last 5 years.
</td></tr></table></div></dd><dt><span class="term">
--enable-warnings
</span></dt><dd>
Enables compiler warnings, useful only for developers, not for users.
</dd><dt><span class="term">
--enable-debug
</span></dt><dd>
Lets the MIRA binary contain C/C++ debug symbols.
</dd><dt><span class="term">
--enable-mirastatic
</span></dt><dd>
Builds static binaries which are easier to distribute. Some
platforms (like OpenSolaris) might not like this and you will
get an error from the linker.
</dd><dt><span class="term">
--enable-optimisations
</span></dt><dd>
Instructs the configure script to set optimisation switches for compiling
(on by default). Switching optimisations off (warning, high impact on
run-time) might be interesting only for, e.g, debugging with valgrind.
</dd><dt><span class="term">
--enable-publicquietmira
</span></dt><dd>
Some parts of MIRA can dump additional debug information during
assembly, setting this switch to "no" performs this. Warning:
MIRA will be a bit chatty, using this is not recommended for
public usage.
</dd><dt><span class="term">
--enable-developmentversion
</span></dt><dd>
Using MIRA with enabled development mode may lead to extra
output on stdout as well as some additional data in the results
which should not appear in real world data
</dd><dt><span class="term">
--enable-boundtracking
</span></dt><dd></dd><dt><span class="term">
--enable-bugtracking
</span></dt><dd>
Both flags above compile in some basic checks into mira that
look for sanity within some functions: Leaving this on "yes"
(default) is encouraged, impact on run time is minimal
</dd><dt><span class="term">
</span></dt><dd></dd><dt><span class="term">
</span></dt><dd></dd><dt><span class="term">
</span></dt><dd></dd></dl></div></div></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_install_walkthroughs"></a>2.5.
Installation walkthroughs
</h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_install_walkthroughs_kubuntu"></a>2.5.1.
(K)Ubuntu 12.04
</h3></div></div></div><p>
You will need to install a couple of tools and libraries before
compiling MIRA. Here's the recipe:
</p><pre class="screen">
<strong class="userinput"><code>sudo apt-get install make flex
sudo apt-get install libboost-doc libboost.*1.48-dev libboost.*1.48.0</code></strong></pre><p>
Once this is done, you can unpack and compile MIRA. For a dynamically
linked version, use:
</p><pre class="screen">
<strong class="userinput"><code>tar xvjf <em class="replaceable"><code>mira-5.0.0.tar.bz2</code></em>
cd <em class="replaceable"><code>mira-5.0.0</code></em>
./configure
make && make install</code></strong></pre><p>
For a statically linked version, just change the configure line from
above into
</p><pre class="screen">
<strong class="userinput"><code>./configure <em class="replaceable"><code>--enable-mirastatic</code></em></code></strong></pre><p>
In case you also want to build documentation yourself, you will need
this in addition:
</p><pre class="screen"><strong class="userinput"><code>sudo apt-get install xsltproc docbook-xsl dblatex</code></strong></pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
People working on git checkouts of the MIRA source code will
obviously need some more tools. Get them with this:
</p><pre class="screen"><strong class="userinput"><code>sudo apt-get install automake libtool xutils-dev</code></strong></pre></td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_install_walkthroughs_opensuse"></a>2.5.2.
openSUSE 12.1
</h3></div></div></div><p>
You will need to install a couple of tools and libraries before
compiling MIRA. Here's the recipe:
</p><pre class="screen">
<strong class="userinput"><code>sudo zypper install gcc-c++ boost-devel
sudo zypper install flex libexpat-devel zlib-devel</code></strong></pre><p>
Once this is done, you can unpack and compile MIRA. For a dynamically
linked version, use:
</p><pre class="screen">
<strong class="userinput"><code>tar xvjf <em class="replaceable"><code>mira-5.0.0.tar.bz2</code></em>
cd <em class="replaceable"><code>mira-5.0.0</code></em>
./configure
make && make install</code></strong></pre><p>
In case you also want to build documentation yourself, you will need
this in addition:
</p><pre class="screen"><strong class="userinput"><code>sudo zypper install docbook-xsl-stylesheets dblatex</code></strong></pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
People working on git checkouts of the MIRA source code will
obviously need some more tools. Get them with this:
</p><pre class="screen"><strong class="userinput"><code>sudo zypper install automake libtool xutils-dev</code></strong></pre></td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_install_walkthroughs_fedora"></a>2.5.3.
Fedora 17
</h3></div></div></div><p>
You will need to install a couple of tools and libraries before
compiling MIRA. Here's the recipe:
</p><pre class="screen">
<strong class="userinput"><code>sudo yum -y install gcc-c++ boost-devel
sudo yum install flex expat-devel vim-common zlib-devel</code></strong></pre><p>
Once this is done, you can unpack and compile MIRA. For a dynamically
linked version, use:
</p><pre class="screen">
<strong class="userinput"><code>tar xvjf <em class="replaceable"><code>mira-5.0.0.tar.bz2</code></em>
cd <em class="replaceable"><code>mira-5.0.0</code></em>
./configure
make && make install</code></strong></pre><p>
In case you also want to build documentation yourself, you will need
this in addition:
</p><pre class="screen"><strong class="userinput"><code>sudo yum -y install docbook-xsl dblatex</code></strong></pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
People working on git checkouts of the MIRA source code will
obviously need some more tools. Get them with this:
</p><pre class="screen"><strong class="userinput"><code>sudo yum -y install automake libtool xorg-x1-util-devel</code></strong></pre></td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_install_walkthroughs_osx"></a>2.5.4.
Mac OSX
</h3></div></div></div><p>
These instructions are for OSX 10.11 (El Capitan) and use
MacPorts. There are other ways to do this (e.g., see the "compile
everything from scratch"), but they are definetly more painful.
</p><p>
If you do not already have it, install McPorts. See <a class="ulink" href="https://www.macports.org/install.php" target="_top">https://www.macports.org/install.php</a>. Then have the port
system fetch information of the newest ports (can take a while):
</p><pre class="screen">
<strong class="userinput"><code>sudo port selfupdate</code></strong>
</pre><p>
Then go on and install gcc (this is going to take a long time) and
then switch to gcc5:
</p><pre class="screen">
<strong class="userinput"><code>sudo port install m4 gcc5</code></strong>
<strong class="userinput"><code>sudo port select --set gcc mp-gcc5</code></strong>
</pre><p>
Now, the libraries you need to download and compile need to be
installed somewhere. You can take a path in your home directory or any
other path in the system you have access to, for the sake of this
walkthrough we'll continue with
<code class="filename">/opt/biosw/gccchain</code>
</p><p>
Download and install a current flex. Use at least 2.6.0. If for some
reason you need to use flex 2.5.38 or .39, take care to apply the
patch described here: <a class="ulink" href="https://sourceforge.net/p/flex/bugs/182/" target="_top">https://sourceforge.net/p/flex/bugs/182/</a>. Configure flex to be
installed into the directory you chose the step before:
</p><pre class="screen">
<strong class="userinput"><code>tar xvf flex-2.6.0.tar.bz2</code></strong>
<strong class="userinput"><code>cd flex-2.6.0</code></strong>
<strong class="userinput"><code>./configure --prefix=<em class="replaceable"><code>/opt/biosw/gccchain</code></em></code></strong>
<strong class="userinput"><code>make</code></strong>
<strong class="userinput"><code>make install</code></strong>
</pre><p>
That done, proceed with likewise with expat and zlib library:
</p><pre class="screen">
<strong class="userinput"><code>tar xvf expat-2.1.0.tar.gz</code></strong>
<strong class="userinput"><code>cd expat-2.1.0</code></strong>
<strong class="userinput"><code>./configure --prefix=<em class="replaceable"><code>/opt/biosw/gccchain</code></em></code></strong>
<strong class="userinput"><code>make</code></strong>
<strong class="userinput"><code>make install</code></strong>
<strong class="userinput"><code>cd ..</code></strong>
<strong class="userinput"><code>tar xvf zlib-1.2.8.tar.gz</code></strong>
<strong class="userinput"><code>cd zlib-1.2.8</code></strong>
<strong class="userinput"><code>./configure --prefix=<em class="replaceable"><code>/opt/biosw/gccchain</code></em></code></strong>
<strong class="userinput"><code>make -j 4</code></strong>
<strong class="userinput"><code>make install</code></strong>
</pre><p>
The bzip2 library needs a different installation command line:
</p><pre class="screen">
<strong class="userinput"><code>tar xvf bzip2-1.0.6.tar.gz</code></strong>
<strong class="userinput"><code>cd bzip2-1.0.6</code></strong>
<strong class="userinput"><code>make -j 4</code></strong>
<strong class="userinput"><code>make install PREFIX=<em class="replaceable"><code>/opt/biosw/gccchain</code></em></code></strong>
</pre><p>
Last library to be installed for the MIRA compilation is BOOST:
</p><pre class="screen">
<strong class="userinput"><code>tar xvf boost_1_59_0.tar.bz2</code></strong>
<strong class="userinput"><code>cd boost_1_59_0</code></strong>
<strong class="userinput"><code>./bootstrap.sh --prefix=<em class="replaceable"><code>/opt/biosw/gccchain</code></em></code></strong>
<strong class="userinput"><code>./b2 -j 4</code></strong>
<strong class="userinput"><code>./b2 install</code></strong>
</pre><p>
Now unpack MIRA, configure it and compile. Remember to give configure
script the location of every package you just installed or else it
might pick up a version installed by the system (and compiled with
different compiler) which would invariably lead to errors in the
linker stage of the compilation.
</p><pre class="screen">
<strong class="userinput"><code>tar xvf mira-5.0.0.tar.bz2</code></strong>
<strong class="userinput"><code>cd mira-5.0.0</code></strong>
<strong class="userinput"><code>./configure --enable-debug
--with-boost=<em class="replaceable"><code>/opt/biosw/gccchain</code></em>
--with-boost-libdir=<em class="replaceable"><code>/opt/biosw/gccchain</code></em>/lib
--with-expat=<em class="replaceable"><code>/opt/biosw/gccchain</code></em>
--with-zlib=<em class="replaceable"><code>/opt/biosw/gccchain</code></em></code></strong>
<strong class="userinput"><code>make -j 4</code></strong>
</pre><p>
That's it for the dynamic version.
</p><p>
For building an almost static version, we need some trickery: after
the configure (this time with the mirastatic argument), create a
special directory <code class="filename">OSXstatlibs</code> in which we
softlink all static libraries MIRA needs. This directory will be
searched first by the build scripts generated by the
<span class="command"><strong>libtool</strong></span> suite during the linking stage of MIRA.
</p><pre class="screen">
<strong class="userinput"><code>./configure --enable-mirastatic --enable-debug
--with-boost=<em class="replaceable"><code>/opt/biosw/gccchain</code></em>
--with-boost-libdir=<em class="replaceable"><code>/opt/biosw/gccchain</code></em>/lib
--with-expat=<em class="replaceable"><code>/opt/biosw/gccchain</code></em>
--with-zlib=<em class="replaceable"><code>/opt/biosw/gccchain</code></em></code></strong>
<strong class="userinput"><code>mkdir OSXstatlib</code></strong>
<strong class="userinput"><code>cd OSXstatlib</code></strong>
<strong class="userinput"><code>ln -s /opt/biosw/gccchain/lib/*a</code></strong>
<strong class="userinput"><code>ln -s /opt/local/lib/*a</code></strong>
</pre><p>
Note that <code class="filename">/opt/local</code> is the standard installation
path of the MacPorts programs. If you changed that, you need to adapt
it here, too.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_install_walkthroughs_allfromscratch"></a>2.5.5.
Compile everything from scratch
</h3></div></div></div><p>
This lets you build a self-contained static MIRA binary. The only
prerequisite here is that you have a working <span class="command"><strong>gcc</strong></span>
with the minimum version described above. Please download all
necessary files (expat, flex, etc.pp) and then simply follow the
script below. The only things that you will want to change are the
path used and, maybe, the name of some packages in case they were
bumped up a version or revision.
</p><p>
Contributed by Sven Klages.
</p><pre class="screen">
## whatever path is appropriate
<strong class="userinput"><code>cd <em class="replaceable"><code>/home/gls/SvenTemp/install</code></em></code></strong>
## expat
<strong class="userinput"><code>tar zxvf <em class="replaceable"><code>expat-2.0.1.tar.gz</code></em>
cd <em class="replaceable"><code>expat-2.0.1</code></em>
./configure <em class="replaceable"><code>--prefix=/home/gls/SvenTemp/expat</code></em>
make && make install</code></strong>
## flex
<strong class="userinput"><code>cd <em class="replaceable"><code>/home/gls/SvenTemp/install</code></em>
tar zxvf <em class="replaceable"><code>flex-2.5.35.tar.gz</code></em>
cd <em class="replaceable"><code>flex-2.5.35</code></em>
./configure <em class="replaceable"><code>--prefix=/home/gls/SvenTemp/flex</code></em>
make && make install
cd <em class="replaceable"><code>/home/gls/SvenTemp/flex/bin</code></em>
ln -s flex flex++
export PATH=<em class="replaceable"><code>/home/gls/SvenTemp/flex/bin</code></em>:$PATH</code></strong>
## boost
<strong class="userinput"><code>cd <em class="replaceable"><code>/home/gls/SvenTemp/install</code></em>
tar zxvf <em class="replaceable"><code>boost_1_48_0.tar.gz</code></em>
cd <em class="replaceable"><code>boost_1_48_0</code></em>
./bootstrap.sh --prefix=<em class="replaceable"><code>/home/gls/SvenTemp/boost</code></em>
./b2 install</code></strong>
## mira itself
<strong class="userinput"><code>export CXXFLAGS="-I<em class="replaceable"><code>/home/gls/SvenTemp/flex/include</code></em>"
cd <em class="replaceable"><code>/home/gls/SvenTemp/install</code></em>
tar zxvf <em class="replaceable"><code>mira-3.4.0.1.tar.gz</code></em>
cd <em class="replaceable"><code>mira-3.4.0.1</code></em>
./configure --prefix=<em class="replaceable"><code>/home/gls/SvenTemp/mira</code></em> \
--with-boost=<em class="replaceable"><code>/home/gls/SvenTemp/boost</code></em> \
--with-expat=<em class="replaceable"><code>/home/gls/SvenTemp/expat</code></em> \
--enable-mirastatic
make && make install</code></strong></pre></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_install_walkthroughs_dynamic"></a>2.5.6.
Dynamically linked MIRA
</h3></div></div></div><p>
In case you do not want a static binary of MIRA, but a dynamically
linked version, the following script by Robert Bruccoleri will give
you an idea on how to do this.
</p><p>
Note that he, having root rights, puts all additional software in
/usr/local, and in particular, he keeps updated versions of Boost and
Flex there.
</p><pre class="screen">
#!/bin/sh -x
make distclean
oze=`find . -name "*.o" -print`
if [[ -n "$oze" ]]
then
echo "Not clean."
exit 1
fi
export prefix=${BUILD_PREFIX:-/usr/local}
export LDFLAGS="-Wl,-rpath,$prefix/lib"
./configure --prefix=$prefix \
--enable-debug=yes \
--enable-mirastatic=no \
--with-boost-libdir=$prefix/lib \
--enable-optimisations \
--enable-boundtracking=yes \
--enable-bugtracking=yes \
--enable-extendedbugtracking=no
make
make install</pre></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_install_hintotherplatforms"></a>2.6.
Compilation hints for other platforms.
</h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_install_hintnetbsd5"></a>2.6.1.
NetBSD 5 (i386)
</h3></div></div></div><p>
Contributed by Thomas Vaughan
</p><p>
The system flex <span class="emphasis"><em>(/usr/bin/flex)</em></span> is too old, but the
devel/flex package from a recent pkgsrc works fine. BSD make doesn't
like one of the lines in <span class="emphasis"><em>src/progs/Makefile</em></span>, so use GNU make instead
(available from <span class="emphasis"><em>pkgsrc</em></span> as <span class="emphasis"><em>devel/gmake</em></span>). Other relevant pkgsrc packages:
<span class="emphasis"><em>devel/boost-libs</em></span>, <span class="emphasis"><em>devel/boost-headers</em></span>
and <span class="emphasis"><em>textproc/expat</em></span>. The configure script has to
be told about these pkgsrc prerequisites (they are usually rooted
at <span class="emphasis"><em>/usr/pkg</em></span> but other locations are possible):
</p><pre class="screen"><strong class="userinput"><code>FLEX=/usr/pkg/bin/flex ./configure --with-expat=/usr/pkg --with-boost=/usr/pkg</code></strong></pre><p>
If attempting to build a pkgsrc package of MIRA, note that the LDFLAGS
passed by the pkgsrc mk files don't remove the need for
the <span class="emphasis"><em>--with-boost</em></span> option. The configure script
complains about flex being too old, but this is harmless because it
honours the $FLEX variable when writing out makefiles.
</p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_install_notesformaintainers"></a>2.7.
Notes for distribution maintainers / system administrators
</h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_install_additionaldatafiles"></a>2.7.1.
Additional data files
</h3></div></div></div><p>
Depending on options/paramaters, the MIRA/mirabait binary may need
to load some additional data during the run. By default this data will
always be searched at this location:
<code class="filename">LOCATION_OF_BINARY/../share/mira/...</code>
</p><p>
That is: If the binary is, e.g.,
<code class="filename">/opt/mira5/bin/mira</code> with a softlink pointing from
<code class="filename">/usr/local/bin/mira -> /opt/mira5/bin/mira</code>
(because, e.g., <code class="filename">/usr/local/bin</code> may be by default in your
PATH variable), then the additional data will be searched in
<code class="filename">/opt/mira5/share/mira/...</code> and NOT in
<code class="filename">/usr/local/share/mira/...</code>.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
In short: since MIRA 4.9.6, moving the binary is not enough
anymore. Take care to have the <span style="color: red"><emph>share</emph></span> directory in the
right place, i.e., adjacent to the directory the MIRA binary lives in.
</td></tr></table></div></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_reference"></a>Chapter 3. MIRA 4 reference manual</h1></div><div><h3 class="subtitle"><i>aka: The extended man page of MIRA 4,
a genome and EST/RNASeq sequence assembly and mapping program for Sanger, 454, IonTorrent,
PacBio and Illumina/Solexa sequencing data</i></h3></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email"><<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>></code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect_ref_synopsis">3.1.
Synopsis
</a></span></dt><dt><span class="sect1"><a href="#sect_ref_requirements">3.2.
Requirements
</a></span></dt><dt><span class="sect1"><a href="#sect_ref_working_modes">3.3.
Working modes
</a></span></dt><dt><span class="sect1"><a href="#sect_ref_config">3.4.
Configuring an assembly: files and parameters
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_ref_manifest_introduction">3.4.1.
The manifest file: introduction
</a></span></dt><dt><span class="sect2"><a href="#sect_ref_manifest_basics">3.4.2.
The manifest file: basics
</a></span></dt><dt><span class="sect2"><a href="#sect_ref_manifest_readgroups">3.4.3.
The manifest file: information on the data you have
</a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_ref_manifest_readgroups_readgroup">3.4.3.1.
Starting a new readgroup
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_manifest_readgroups_data">3.4.3.2.
Defining data files to load
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_manifest_readgroups_defaultqual">3.4.3.3.
Setting default quality
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_manifest_readgroups_technology">3.4.3.4.
Defining technology used to sequence
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_manifest_readgroups_nostatistics">3.4.3.5.
Preventing statistics for technologies with biases
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_manifest_readgroups_asreference">3.4.3.6.
Setting reference sequence for mapping jobs
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_manifest_readgroups_autopairing">3.4.3.7.
Autopairing: letting MIRA find out pair info by itself
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_manifest_readgroups_templatesize">3.4.3.8.
Setting size of read templates
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_manifest_readgroups_segplace">3.4.3.9.
Read segment placement
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_manifest_readgroups_segname">3.4.3.10.
Read segment naming
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_manifest_readgroups_strainname">3.4.3.11.
Strain naming
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_manifest_readgroups_datadirscf">3.4.3.12.
Data directory for SCF files
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_manifest_readgroups_renameprefix">3.4.3.13.
Renaming read name prefixes
</a></span></dt></dl></dd><dt><span class="sect2"><a href="#sect_ref_manifest_parameters">3.4.4.
The manifest file: extended parameters
</a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_ref_parameter_groups">3.4.4.1.
Parameter groups
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_technology_sections">3.4.4.2.
Technology sections
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_parameter_shortnames">3.4.4.3.
Parameter short names
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_order_dependent_quick_switches">3.4.4.4.
Order dependent quick switches
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_general_ge">3.4.4.5.
Parameter group: -GENERAL (-GE)
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_assembly_as">3.4.4.6.
Parameter group: -ASSEMBLY (-AS)
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_strain_backbone_sb">3.4.4.7.
Parameter group: -STRAIN/BACKBONE (-SB)
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_dataprocessing_dp">3.4.4.8.
Parameter group: -DATAPROCESSING (-DP)
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_clipping_cl">3.4.4.9.
Parameter group: -CLIPPING (-CL)
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_skim_sk">3.4.4.10.
Parameter group: -SKIM (-SK)
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_hashstatistics_hs">3.4.4.11.
Parameter group: -KMERSTATISTICS (-KS)
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_align_al">3.4.4.12.
Parameter group: -ALIGN (-AL)
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_contig_co">3.4.4.13.
Parameter group: -CONTIG (-CO)
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_edit_ed">3.4.4.14.
Parameter group: -EDIT (-ED)
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_misc_mi">3.4.4.15.
Parameter group: -MISC (-MI)
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_misc_nw">3.4.4.16.
Parameter group: -NAG_AND_WARN (-NW)
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_directory_dir_di">3.4.4.17.
Parameter group: -DIRECTORY (-DIR, -DI)
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_output_out">3.4.4.18.
Parameter group: -OUTPUT (-OUT)
</a></span></dt></dl></dd></dl></dd><dt><span class="sect1"><a href="#sect_ref_resuming_assemblies">3.5.
Resuming / restarting assemblies
</a></span></dt><dt><span class="sect1"><a href="#sect_ref_input_output">3.6.
Input / Output
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_ref_directories">3.6.1.
Directories
</a></span></dt><dt><span class="sect2"><a href="#sect_ref_filenames">3.6.2.
Filenames
</a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_ref_output">3.6.2.1.
Output
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_assembly_statistics_and_information_files">3.6.2.2.
Assembly statistics and information files
</a></span></dt></dl></dd><dt><span class="sect2"><a href="#sect_ref_file_formats">3.6.3.
File formats
</a></span></dt><dt><span class="sect2"><a href="#sect_ref_stdout_stderr">3.6.4.
STDOUT/STDERR
</a></span></dt><dt><span class="sect2"><a href="#sect_ref_ssaha2smalt">3.6.5.
SSAHA2 / SMALT ancillary data
</a></span></dt><dt><span class="sect2"><a href="#sect_ref_xml_traceinfo">3.6.6.
XML TRACEINFO ancillary data
</a></span></dt><dt><span class="sect2"><a href="#sect_ref_contig_naming">3.6.7.
Contig naming
</a></span></dt><dt><span class="sect2"><a href="#sect_ref_recovering_strain_specific_consensus">3.6.8.
Recovering strain specific consensus as FASTA
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_ref_tags_used_in_the_assembly_by_mira_and_edit">3.7.
Tags used in the assembly by MIRA and EdIt
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_ref_tags_read_and_used">3.7.1.
Tags read (and used)
</a></span></dt><dt><span class="sect2"><a href="#sect_ref_tags_set_and_used">3.7.2.
Tags set (and used)
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_ref_contigs_singlets_debris">3.8.
Where reads end up: contigs, singlets, debris
</a></span></dt><dt><span class="sect1"><a href="#sect_ref_snp_discovery">3.9.
Detection of bases distinguishing non-perfect repeats and SNP discovery
</a></span></dt><dt><span class="sect1"><a href="#sect_ref_data_reduction">3.10.
Data reduction: subsampling vs. lossless digital normalisation
</a></span></dt><dt><span class="sect1"><a href="#sect_ref_caveats">3.11.
Caveats
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_ref_using_artificial_reads">3.11.1.
Using data not from sequencing instruments: artificial / synthetic reads
</a></span></dt><dt><span class="sect2"><a href="#sect_ref_ploidy_and_repeats">3.11.2.
Ploidy and repeats
</a></span></dt><dt><span class="sect2"><a href="#sect_ref_handling_of_repeats">3.11.3.
Handling of repeats
</a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_ref_uniform_read_distribution">3.11.3.1.
Uniform read distribution
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_keeping_'long'_repetitive_contigs_separate">3.11.3.2.
Keeping 'long' repetitive contigs separate
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_helping_finishing_by_tagging_reads_with_haf_tags">3.11.3.3.
Helping finishing by tagging reads with HAF tags
</a></span></dt></dl></dd><dt><span class="sect2"><a href="#sect_ref_consensus_in_finishing_programs_gap4_consed_">3.11.4.
Consensus in finishing programs (gap4, consed, ...)
</a></span></dt><dt><span class="sect2"><a href="#sect_ref_some_other_things_to_consider">3.11.5.
Some other things to consider
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_ref_things_you_should_not_do">3.12.
Things you should not do
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_ref_never_on_nfs">3.12.1.
Do not run MIRA on NFS mounted directories without redirecting the tmp directory
</a></span></dt><dt><span class="sect2"><a href="#sect_ref_never_without_quality_values">3.12.2.
Do not assemble without quality values
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_ref_useful_third_party_programs">3.13.
Useful third party programs
</a></span></dt><dt><span class="sect1"><a href="#sect_ref_speed_and_memory_considerations">3.14.
Speed and memory considerations
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_ref_memory">3.14.1.
Estimating needed memory for an assembly project
</a></span></dt><dt><span class="sect2"><a href="#sect_ref_speed">3.14.2.
Some numbers on speed
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_ref_known_problems_bugs">3.15.
Known Problems / Bugs
</a></span></dt><dt><span class="sect1"><a href="#sect_ref_todos">3.16.
TODOs
</a></span></dt><dt><span class="sect1"><a href="#sect_ref_working_principles">3.17.
Working principles
</a></span></dt><dt><span class="sect1"><a href="#sect_ref_see_also">3.18.
See Also
</a></span></dt></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
<span class="emphasis"><em><span class="quote">“<span class="quote">The manual only makes sense after you learn the program.
</span>”</span></em></span>
</p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_synopsis"></a>3.1.
Synopsis
</h2></div></div></div><p>
<code class="literal">mira [-chmMrtv] <em class="replaceable"><code>manifest-file</code></em> [<em class="replaceable"><code>manifest-file</code></em> ...]</code>
</p><p>
The command line parameters in short:
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
[-c / --cwd=<em class="replaceable"><code>directory</code></em>]
</span></dt><dd>
Change working directory.
</dd><dt><span class="term">
[-h / --help]
</span></dt><dd>
Print a short help and exit.
</dd><dt><span class="term">
[-m / --mcheck]
</span></dt><dd>
Only check the manifest file, then exit.
</dd><dt><span class="term">
[-M / --mdcheck]
</span></dt><dd>
Only check the manifest file and presence of data files, then exit.
</dd><dt><span class="term">
[-r / --resume]
</span></dt><dd>
Resume / restart an interrupted assembly. Works only for de-novo
assemblies at the moment.
</dd><dt><span class="term">
[-t / --thread=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd>
Force number of threads (overrides equivalent [-GE:not]
manifest entry).
</dd><dt><span class="term">
[-v / --version]
</span></dt><dd>
Print version and exit.
</dd></dl></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_requirements"></a>3.2.
Requirements
</h2></div></div></div><p>
To use MIRA itself, one doesn't need very much:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
Sequence data in EXP, CAF, PHD, FASTA or FASTQ format
</p></li><li class="listitem"><p>
Optionally: ancillary information in NCBI traceinfo XML format;
ancillary information about strains in tab delimited format, vector
screen information generated with <span class="command"><strong>ssaha2</strong></span> or
<span class="command"><strong>smalt</strong></span>.
</p></li><li class="listitem"><p>
Some memory and disk space. Actually lots of both if you are
venturing into 454 or Illumina.
</p></li></ul></div><p>
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_working_modes"></a>3.3.
Working modes
</h2></div></div></div><p>
MIRA has three basic working modes: genome, EST/RNASeq or
EST-reconstruction-and-SNP-detection. From version 2.4 on, there is
only executable which supports all modes. The name with which this
executable is called defines the working mode:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
<span class="command"><strong>mira</strong></span> for assembly of genomic data as well as
assembly of EST data from one or multiple strains / organisms
</p><p>
and
</p></li><li class="listitem"><p>
<span class="command"><strong>miraSearchESTSNPs</strong></span> for assembly of EST data from
different strains (or organisms) and SNP detection within this
assembly. This is the former <span class="command"><strong>miraEST</strong></span> program
which was renamed as many people got confused regarding whether to
use MIRA in est mode or miraEST.
</p></li></ol></div><p>
Note that <span class="command"><strong>miraSearchESTSNPs</strong></span> is usually realised as
a link to the <span class="command"><strong>mira</strong></span> executable, the executable
decides by the name it was called with which module to start.
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_config"></a>3.4.
Configuring an assembly: files and parameters
</h2></div></div></div><p>
All the configuration needed for an assembly is done in one (or several)
configuration file(s): the <span class="emphasis"><em>manifest</em></span> files. This
encompasses things like what kind of assembly you want to perform
(genome or EST / RNASeq, mapping or de-novo etc.pp) or which data files
contain the sequences you want to assemble (and in which format these
are).
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_manifest_introduction"></a>3.4.1.
The manifest file: introduction
</h3></div></div></div><p>
A <span class="emphasis"><em>manifest</em></span> file can be seen as a two part
configuration file for an assembly: the first part contains some
general information while the second part contains information about
the sequencing data to be loaded. Examples being always easier to
follow than long texts, here's an example for a de-novo assembly with
single-end (also called shotgun) 454 data:
</p><pre class="screen"># Example for a manifest describing a simple 454 de-novo assembly
# A manifest file can contain comment lines, these start with the #-character
# First part: defining some basic things
# In this example, we just give a name to the assembly
# and tell MIRA it should assemble a genome de-novo in accurate mode
# As special parameter, we want to use 4 threads in parallel (where possible)
<strong class="userinput"><code>
project = <em class="replaceable"><code>MyFirstAssembly</code></em>
job = <em class="replaceable"><code>genome,denovo,accurate</code></em>
parameters = <em class="replaceable"><code>-GE:not=4</code></em></code></strong>
# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups": this reflects the
# ... that read sequences ...
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>SomeUnpaired454ReadsIGotFromTheLab</code></em>
data = <em class="replaceable"><code>TCMFS456ZH345.fastq TQF92GT7H34.fastq</code></em>
technology = <em class="replaceable"><code>454</code></em></code></strong></pre><p>
To make things a bit more interesting, here's an example using a
couple more technologies and showing some more options of the manifest
file like wild cards in file names, different paired-end/mate-pair
libraries and how to let MIRA refine pairing information (or even find
out everything by itself):
</p><pre class="screen"># Example for a manifest describing a de-novo assembly with
# unpaired 454, paired-end Illumina, a mate-pair Illumina
# and a paired Ion Torrent
# First part: defining some basic things
# In this example, we just give a name to the assembly
# and tell MIRA it should assemble a genome de-novo in accurate mode
# As special parameter, we want to use 4 passes with kmer sizes of
# 17, 31, 63 and 127 nucleotides. Obviously, read lengths of the
# libraries should be greater than 127 bp.
# Note: usually MIRA will choose sensible options for number of
# passes and kmer sizes to be used by itself.
<strong class="userinput"><code>project = <em class="replaceable"><code>MyFirstAssembly</code></em>
job = <em class="replaceable"><code>genome,denovo,accurate</code></em>
parameters = <em class="replaceable"><code>-AS:kms=17,31,63,127</code></em></code></strong>
# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups": this reflects the
# ... that read sequences ...
# defining the shotgun (i.e. unpaired) 454 reads
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>SomeUnpaired454ReadsIGotFromTheLab</code></em>
data = <em class="replaceable"><code>TCMFS456ZH345.fastq TQF92GT7H34.fastq</code></em>
technology = <em class="replaceable"><code>454</code></em></code></strong>
# defining the paired-end Illumina reads, fixing all needed pair information
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>SomePairedEndIlluminaReadsIGotFromTheLab</code></em>
data = <em class="replaceable"><code>datape*.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
template_size = <em class="replaceable"><code>100 300</code></em>
segment_placement = <em class="replaceable"><code>---> <---</code></em>
segment_naming = <em class="replaceable"><code>solexa</code></em></code></strong>
# defining the mate-pair Illumina reads, fixing most needed pair information
# but letting MIRA refine the template_size via "autorefine"
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>SomeMatePairIlluminaReadsIGotFromTheLab</code></em>
data = <em class="replaceable"><code>datamp*.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
template_size = <em class="replaceable"><code>2000 4000 autorefine</code></em>
segment_placement = <em class="replaceable"><code><--- ---></code></em>
segment_naming = <em class="replaceable"><code>solexa</code></em></code></strong>
# defining paired Ion Torrent reads
# example to show how lazy one can be and simply let MIRA estimate by itself
# all needed pairing information via "autopairing"
# Hint: it usually does a better job at it than we do ;-)
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>SomePairedIonReadsIGotFromTheLab</code></em>
<em class="replaceable"><code>autopairing</code></em>
data = <em class="replaceable"><code>dataion*.fastq</code></em>
technology = <em class="replaceable"><code>iontor</code></em></code></strong></pre></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_manifest_basics"></a>3.4.2.
The manifest file: basics
</h3></div></div></div><p>
The first part of an assembly <span class="emphasis"><em>manifest</em></span> contains
the very basic information the assembler needs to have to know what
you want it to do. This part consists of exactly three entries:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
<span class="bold"><strong>project =</strong></span> [=
<em class="replaceable"><code>project name</code></em>] tells the assembler
the name you wish to give to the whole assembly project. MIRA will
use that name throughout the whole assembly for naming
directories, files and a couple of other things.
</p><p>
You can name the assembly anyway you want, you should however
restrain yourself and use only alphanumeric characters and perhaps
the characters plus, minus and underscore. Using slashes or
backslashes here is a recipe for catastrophe.
</p></li><li class="listitem"><p>
<span class="bold"><strong>job =</strong></span>
[<em class="replaceable"><code>denovo|mapping</code></em>],
[<em class="replaceable"><code>genome|est|fragments|clustering</code></em>],
[<em class="replaceable"><code>draft|accurate</code></em>] tells the
assembler what kind of data it should expect and how it should
assemble it.
</p><p>
You need to make your choice mainly in three steps and in the end
concatenate your choices to the [job=] entry of the manifest:
</p><div class="orderedlist"><ol class="orderedlist" type="a"><li class="listitem"><p>
are you building an assembly from scratch
(choose: <span class="emphasis"><em>denovo</em></span>) or are you mapping reads
to an existing backbone sequence
(choose: <span class="emphasis"><em>mapping</em></span>)? Pick one. Leaving this
out automatically chooses <span class="emphasis"><em>denovo</em></span> as
default.
</p></li><li class="listitem"><p>
are the data you are assembling forming a larger contiguous
sequence (choose: <span class="emphasis"><em>genome</em></span>), are you
assembling EST or mRNA libraries
(choose: <span class="emphasis"><em>est</em></span>), single genes or small
plasmids (choose: <span class="emphasis"><em>fragments</em></span>) or do you cluster assembled
sequences (choose: <span class="emphasis"><em>clustering</em></span>)?
Pick one. Leaving this out
automatically chooses <span class="emphasis"><em>genome</em></span> as default.
</p><p>
Since version 4.9.4, a new mode <span class="emphasis"><em>fragments</em></span>
is available. This mode is essentially similar to the
<span class="emphasis"><em>EST</em></span> mode, but has all safety features
switched off which reduce data sizes. Use this mode for
assembly of comparatively small EST/mRNA or small plasmid or
single gene projects where you
want to have highest accuracy and minimal filtering. Warning:
contigs with coverages going into the 1000s will lead to
really slow assemblies.
</p><p>
Since version 4.9.6, a new mode <span class="emphasis"><em>clustering</em></span>
is available. This mode is essentially for clustering
assembled contigs like they are created in mRNA or EST
assemblies. Basic parameters are: single pass, no clipping, no
editing, ~7.5% differences between sequences allowed,
gaps >= 13 bases disallowed, single occurrence of disagreeing
base leads to SNP tagging.
Warning: do not use that with any type of real sequencing data
... you probably would regret this.
</p></li><li class="listitem"><p>
do you want a quick and dirty assembly for first insights
(choose: <span class="emphasis"><em>draft</em></span>) or an assembly that should
be able to tackle even most nasty cases (choose:
<span class="emphasis"><em>accurate</em></span>)? Pick one. Leaving this out
automatically chooses <span class="emphasis"><em>accurate</em></span> as default.
</p></li></ol></div><p>
Once you're done with your choices, concatenate everything with
commas and you're done. E.g.:
'<code class="literal">--job=mapping,genome,draft</code>' will give you a
mapping assembly of a genome in draft quality.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
For de-novo assembly of genomes, these switches are optimised for
'decent' coverages that are commonly seen to get you something useful,
i.e., ≥ 7x for Sanger, >=18x for 454 FLX or Titanium, ≥ 25x for
454 GS20 and ≥ 30x for Solexa. Should you venture into lower
coverage or extremely high coverage (say, >=60x for 454), you will
need to adapt a few parameters via extensive switches.
</td></tr></table></div></li><li class="listitem"><p>
<span class="bold"><strong>parameters =</strong></span> is used in case you
want to change one of the 150+ extended parameters MIRA has to
offer to control almost every aspect of an assembly. This is
described in more detail in a separate section below.
</p></li></ol></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_manifest_readgroups"></a>3.4.3.
The manifest file: information on the data you have
</h3></div></div></div><p>
The second part of an assembly <span class="emphasis"><em>manifest</em></span> tells
MIRA which files it needs to load, which sequencing technology
generated the data, whether there are DNA template constraints it can
use during the assembly process and a couple of other things.
</p><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_manifest_readgroups_readgroup"></a>3.4.3.1.
Starting a new readgroup
</h4></div></div></div><p>
<span class="bold"><strong>readgroup </strong></span> [= <em class="replaceable"><code>group name</code></em>] is the keyword which tells MIRA that you are going to define a new read group. You can optionally name that group.
</p><div class="sidebar"><a name="sidebar_ref_manifest_readgroups_templates_and_readgroups"></a><div class="titlepage"><div><div><p class="title"><b>
Understanding readgroups and DNA templates
</b></p></div></div></div><p>
When you send away your DNA for sequencing, it is going to be
prepared for sequencing according to your wishes. Sequencing
providers call this "constructing a library" and regardless
whether you sequence with Sanger, 454, Illumina, Ion Torrent,
Pacific Biosciences or other technologies, the "library prep" is
always there.
</p><p>
With most library preps, your DNA is first amplified and then
cut into small pieces. These pieces are called
<span class="emphasis"><em>templates</em></span> and their length can be anywhere
between a few dozen bases, a few hundred bases or even a couple
of dozen or even hundred kilobases. The important thing is that
these templates can be much bigger in size than the actual read
length. While this is a wet lab step, protocols and providers
have gotten pretty good at constructing libraries where the DNA
templates are all in a given range of bases like, e.g., having a
library with template size 500bp (+/- 100bp) and another library
with template size around 7kb (+/- 500bp).
</p><p>
Depending on the technology and sequencing strategy used, the
DNA templates are used to create either one single read or - and
that's important - two or more reads.
</p><p>
Libraries with "single reads" are often called "single read
libraries" or "shotgun libraries". They can be found for every
sequencing technology and are most of the time easy to construct
(therefore cheap) and are often used to provide a decent amount
of bases as basic coverage for your project.
</p><p>
Libraries with two reads per DNA template are often called
"mate-pair" or "paired-end" libraries. They are harder to
construct and sometime have less yield, therefore they are often
more expensive. But the sequencing approach using several reads
per DNA template allows assembly and scaffolding algorithms to
resolve repetitive regions of a genome which are longer than the
average read length. Note that Pacific Biosciences has a
sequencing mode called "strobed sequencing" which is different
from "paired-end/mate-pair" but also creates multiple reads per
DNA template.
</p><p>
Long story short: an assembler must know afterwards what kind of
reads it has to expect: the sequencing technology, library
preparation strategy etc. For this, the notion of <span class="emphasis"><em>read
groups</em></span> has emerged: reads coming from the same
technology and same library preparation are pooled together in a
read group to tell the assembler: in the assembly, if you see two
reads coming from a same DNA template, you should expect them to
be at a certain distance from each other and they should be
oriented in a certain way.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
The above was a <span class="bold"><strong>very</strong></span> simplified
view on the whole area of DNA templates, readgroups, shotgun and
paired end sequencing. Enough to hopefully understand the
concepts, but you might want to read more about it.
</td></tr></table></div></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_manifest_readgroups_data"></a>3.4.3.2.
Defining data files to load
</h4></div></div></div><p>
<span class="bold"><strong>data</strong></span> = <em class="replaceable"><code>filepath
[filepath ...]</code></em> defines the file paths from
which sequences should be loaded. A file path can contain just the
name of one (or several) files or it can contain the
<span class="emphasis"><em>path</em></span>, i.e., the directory (absolute or
relative) including the file name.
</p><p>
MIRA automatically recognises what type the sequence data is by
looking at the postfix of files. For postfixes not adhering widely
used naming schemes for file types, there's additionally a way of
explicitly defining the type (see further down at the end of this
item on how this is done). Currently allowed file types are:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
<code class="filename">.fasta</code> for sequences formatted in FASTA
format where there exists an additional
<code class="filename">.fasta.qual</code> file which contains quality
data. If the file with quality data is missing, this is
interpreted as error and MIRA will abort.
</p></li><li class="listitem"><p>
<code class="filename">.fna</code> and <code class="filename">.fa</code> also
for sequences formatted in FASTA format. The difference
to <code class="filename">.fasta</code> lies in the way MIRA treats a
missing quality file (called
<code class="filename">.fna.qual</code>
or <code class="filename">.fa.qual</code>): it does not see that as
critical error and continues.
</p></li><li class="listitem"><p>
<code class="filename">.fastq</code> or <code class="filename">.fq</code> for files in FASTQ format
</p></li><li class="listitem"><p>
<code class="filename">.gff3</code> or <code class="filename">.gff</code> for files in GFF3 format. Note that
MIRA will load all sequences and annotations contained in this
file.
</p></li><li class="listitem"><p>
<code class="filename">.gbk</code>, <code class="filename">.gbf</code>, <code class="filename">.gbff</code>
or <code class="filename">.gb</code> for files formatted in GenBank
format. Note that the MIRA GenBank loader does not understand
intron/exon or other multiple-locus structures in this format,
use GFF3 instead!
</p></li><li class="listitem"><p>
<code class="filename">.caf</code> for files in the CAF format (from Sanger Centre)
</p></li><li class="listitem"><p>
<code class="filename">.maf</code> for files in the MIRA MAF format
</p></li><li class="listitem"><p>
<code class="filename">.exp</code> for files in the Staden EXP format.
</p></li><li class="listitem"><p>
<code class="filename">.fofnexp</code> for a <span class="emphasis"><em>file of EXP
filenames</em></span> which all point to files in the Staden EXP
format.
</p></li><li class="listitem"><p>
<code class="filename">.xml</code>, <code class="filename">.ssaha2</code> and <code class="filename">.smalt</code> for ancillary data in NCBI TRACEINFO, SSAHA2 or SMALT format respectively.
</p></li></ul></div><p>
Multiple 'data' lines and multiple entries per line (even
different formats) are allowed, as in, e.g.,
</p><pre class="screen">data = file1.fastq file2.fastq file3.fasta file4.gbk
data = myscreenings.smalt</pre><p>
You can also use wildcards and/or directory names. E.g., loading
all file types MIRA understand from a given directory
<code class="filename">mydir</code>:
</p><pre class="screen">data = mydir</pre><p>
or loading all files starting with <code class="filename">mydata</code> and
ending with <code class="filename">fastq</code>:
</p><pre class="screen">data = mydata*fastq</pre><p>
or loading all files in directory <code class="filename">mydir</code>
starting with <code class="filename">mydata</code> and ending with
<code class="filename">fastq</code>:
</p><pre class="screen">data = mydir/mydata*fastq</pre><p>
or loading all FASTQ files in all directories starting with <code class="filename">mydir</code>:
</p><pre class="screen">data = mydir*/*fastq</pre><p>
or ... well, you get the gist.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Giving a directory like in <code class="filename">mydir</code> is
equivalent to <code class="filename">mydir/*</code> (saying: give me all
files in the directory <code class="filename">mydir</code>), however the
first version should be preferred when the directory contains
thousands of files.
</td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
GenBank and GFF3 files may or may not contain embedded sequences. If
annotations are present in these files for which no sequence is
present in the same file, MIRA will look for reads of the same
name which it already loaded in this or previously defined read
groups and add the annotations there.
</p><p>
As security measure, annotations in GenBank and GFF3 files for which
absolutely no sequence or read has been defined are treated as
error.
</p></td></tr></table></div><p>
<span class="emphasis"><em>Explicit definition of file types.</em></span> It is
possible to explicitly tell MIRA the type of a file even if said
file does not have a 'standard' naming scheme. For this, the
EMBOSS double-colon notation has been adapted to work also for
MIRA, i.e., you prepend the type of a file and separate it from
the file name by a double colon. E.g.,
the <code class="filename">.dat</code> postfix is not anything MIRA will
recognise, but you can define it should be loaded as FASTQ file
like this:
</p><pre class="screen">data = fastq::myfile.dat</pre><p>
Another frequent usage is forcing MIRA to load FASTA files
named <code class="filename">.fasta</code> without complaining in case
quality files (which MIRA wants you to provide) are not present:
</p><pre class="screen">data = fna::myfile.fasta</pre><p>
This does (of course) work also with directories or wildcard
characters. In the following example, the first line will load all
files from <code class="filename">mydirectory</code> as FASTQ while the
second line loads just <code class="filename">.dat</code> files in a given
path as FASTA:
</p><pre class="screen">data = fastq::mydirectory
data = fasta::/path/to/somewhere/*.dat</pre><p>
It is entirely possible (although not really sensible), to give
contradicting information to MIRA by using a different explicit
file type than one would guess from the standard postfix. In this
case, the explicit type takes precedence over the automatic
type. E.g.: to force MIRA to load a file as FASTA although it is
named <code class="filename">.fastq</code>, one could use this:
</p><pre class="screen">data = fasta::file.fastq</pre><p>
Note that the above does not make any kind of file conversion,
<code class="filename">file.fastq</code> needs to be already in FASTA
format or else MIRA will fail loading that data.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_manifest_readgroups_defaultqual"></a>3.4.3.3.
Setting default quality
</h4></div></div></div><p>
<span class="bold"><strong>default_qual</strong></span>=
<em class="replaceable"><code>quality_value</code></em> is meant to be used as
default fall-back quality value for sequences where the data files
given above do not contain quality values. E.g., GFF3 or GenBank
formats, eventually also FASTA files where quality data files is
missing.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_manifest_readgroups_technology"></a>3.4.3.4.
Defining technology used to sequence
</h4></div></div></div><p>
<span class="bold"><strong>technology</strong></span>=
<em class="replaceable"><code>technology</code></em> which names the technology
with which the sequences were produced. Allowed technologies are:
<span class="emphasis"><em>sanger, 454, solexa, iontor, pcbiolq, pcbiohq,
text</em></span>.
</p><p>
The <span class="emphasis"><em>text</em></span> technology is not a technology per
se, but should be used for sequences which are not coming from
sequencing machines like, e.g., database entries, consensus
sequences, artificial reads (which do not comply to normal
behaviour of 'normal' sequencing data), etc.pp
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_manifest_readgroups_nostatistics"></a>3.4.3.5.
Preventing statistics for technologies with biases
</h4></div></div></div><p>
<span class="bold"><strong>nostatistics</strong></span> used as keyword will
prevent MIRA to calculate coverage estimates from reads of the given
readgroup.
</p><p>
This keyword should be used in denovo genome assemblies for reads
from libraries which produce very uneven coverage (e.g.: old
Illumina mate-pair protocols) or have a bias in the randomness of
DNA fragmentations (e.g.: Nextera protocol from Illumina).
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_manifest_readgroups_asreference"></a>3.4.3.6.
Setting reference sequence for mapping jobs
</h4></div></div></div><p>
<span class="bold"><strong>as_reference</strong></span> This keyword
indicates to MIRA that the sequences in this readgroup should not
be assembled, but should be used as reference backbone for a
mapping assembly. That is, sequencing reads are then placed/mapped
onto these reference reads.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_manifest_readgroups_autopairing"></a>3.4.3.7.
Autopairing: letting MIRA find out pair info by itself
</h4></div></div></div><p>
<span class="bold"><strong>autopairing</strong></span> This keyword is used
to tell MIRA it should estimate values for
<span class="emphasis"><em>template_size</em></span> and
<span class="emphasis"><em>segment_placement</em></span> (see below).
</p><p>
This is basically the lazy way to tell MIRA that the data in the
corresponding readgroup consists of paired reads and that you
trust it will find out the correct values.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><span class="emphasis"><em>autopairing</em></span> usually works quite well for
small and mid-sized libraries (up to, say, 10 kb). For larger
libraries it might be a good thing to tell MIRA some rough
boundaries via <span class="emphasis"><em>template_size</em></span> /
<span class="emphasis"><em>segment_placement</em></span> and let MIRA refine the
values for the template size via <span class="emphasis"><em>autorefine</em></span>
(see below).
</td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><span class="emphasis"><em>autopairing</em></span> is a feature new to MIRA 4.0rc5,
it may contain bugs for some corner cases. Feedback appreciated.
</td></tr></table></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_manifest_readgroups_templatesize"></a>3.4.3.8.
Setting size of read templates
</h4></div></div></div><p>
<span class="bold"><strong>template_size </strong></span>=
<em class="replaceable"><code>min_size max_size
<span class="emphasis"><em>[infoonly|exclusion_criterion]</em></span>
<span class="emphasis"><em>[autorefine]</em></span></code></em>. Defines the
minimum and maximum size of "good" DNA templates in the library
prep for this read group. This defines at which distance the two
reads of a pair are to be expected in a contig, a very useful
information for an assembler to resolve repeats in a genome or
different splice variants in transcriptome data.
</p><p>
If the term <span class="emphasis"><em>infoonly</em></span> is present, then MIRA
will pass the information on template sizes in result files, but
will not use it for any decision making during de-novo or mapping
assembly. The term <span class="emphasis"><em>exclusion_criterion</em></span> makes
MIRA use the information for decision making.
</p><p>
If <span class="emphasis"><em>infoonly</em></span>
or <span class="emphasis"><em>exclusion_criterion</em></span> are missing, then MIRA
assumes <span class="emphasis"><em>exclusion_criterion</em></span> for de-novo
assemblies and <span class="emphasis"><em>infoonly</em></span> for mapping
assemblies.
</p><p>
If the term <span class="emphasis"><em>autorefine</em></span> is present, MIRA will
start the assembly with the given size information but switch to
refined value computed from observed distances in an
assembly. However, please note that the size values
can <span class="emphasis"><em>never</em></span> be expanded, only shrunk. It is
therefore advisable to use generous bounds when using the
autorefine feature.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
The <span class="emphasis"><em>template_size</em></span> line in the manifest file
replaces the parameters -GE:uti:tismin:tismax of earlier versions
of MIRA (3.4.x and below).
</td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
The minimum or the maximum size (or both) can be set to a negative
value for "don't care and don't check". This allows constructs
like <code class="literal">template_size= 500 -1 exclusion_criterion</code>
which would check only the minimum distance but not the maximum
distance.
</td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
For <span class="emphasis"><em>mapping</em></span> assemblies with MIRA, you
usually will want to use <span class="emphasis"><em>infoonly</em></span> as else -
in case of genome re-arrangements, larger deletions or
insertions - MIRA would probably reject one read of every read
pair in the corresponding areas as it would not be at the
expected distance and/or orientation ... and you would not be
able to simply find the re-arrangement in downstream analysis.
</p><p>
For <span class="emphasis"><em>de-novo</em></span> assemblies however
you <span class="emphasis"><em>should not</em></span>
use <span class="emphasis"><em>infoonly</em></span> except in very rare cases
where you know what you do.
</p></td></tr></table></div><div class="sidebar"><div class="titlepage"><div><div><p class="title"><b>
Understanding the size of DNA templates
</b></p></div></div></div><p>
When using a <span class="emphasis"><em>paired-end</em></span> or
<span class="emphasis"><em>mate-pair</em></span> sequencing strategy, two
sequences are generated for the ends of each DNA template (see
sidebar above: "understanding readgroups and DNA
templates"). That is, if one has a library with 6kb fragments,
one knows that the outer ends of the two reads will be
approximately 6kb apart, like so:
</p><pre class="screen">DNA template ##############################################################
read 1 .......
read 2 ......
<------------------------- ~6 kb ----------------------------></pre><p>
Sequencing labs will try their best to get these two sequences
from DNA templates which comply to a given length
specification. But as this is chemistry and wet lab, things must
be seen with a certain uncertainty and therefore the DNA
templates generated are not exactly of the specified size
(e.g. 6kb), but the size distribution will vary in a given
range, e.g., 5.5kb to 6.5 kb.
</p></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_manifest_readgroups_segplace"></a>3.4.3.9.
Read segment placement
</h4></div></div></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">You do not need to use this when using 'autopairing' (see above).</td></tr></table></div><p>
<span class="bold"><strong>segment_placement </strong></span>=
<em class="replaceable"><code>placementcode <span class="emphasis"><em>[infoonly|exclusion_criterion]</em></span></code></em>. Allowed
placement codes are:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
<span class="bold"><strong>?</strong></span>
or <span class="bold"><strong>unknown</strong></span> which are
place-holders for "well, in the end: don't care." Segments of
a template can be reads in any direction and in any
relationship to each other.
</p><p>
This is typically used for unpaired libraries (sometimes
called <span class="emphasis"><em>shotgun libraries</em></span>), but may be
also useful for, e.g., primer walking with Sanger.
</p></li><li class="listitem"><p>
<span class="bold"><strong>---> <---</strong></span> or <span class="bold"><strong>FR</strong></span> or <span class="bold"><strong>INNIES</strong></span>. The <span class="emphasis"><em>forward /
reverse</em></span> scheme as used in traditional Sanger
sequencing as well as Illumina paired-end sequencing,
</p><p>
This is the usual placement code for Sanger paired-end
protocols as well as Illumina paired-end. Less frequently used
in IonTorrent paired-end sequencing.
</p></li><li class="listitem"><p>
<span class="bold"><strong><--- ---></strong></span> or <span class="bold"><strong>RF</strong></span> or <span class="bold"><strong>OUTIES</strong></span>. The <span class="emphasis"><em>reverse /
forward</em></span> scheme as used in Illumina mate-pair
sequencing.
</p><p>
This is the usual placement code for Illumina mate-pair protocols.
</p></li><li class="listitem"><p>
<span class="bold"><strong>1---> 2---></strong></span> or
<span class="bold"><strong>samedir forward</strong></span> or <span class="bold"><strong>SF</strong></span> or <span class="bold"><strong>LEFTIES</strong></span>. The <span class="emphasis"><em>forward /
forward</em></span> scheme. Segments of a template are all
placed in the same direction, the segment order in the contig
follows segment ordering of the reads.
</p></li><li class="listitem"><p>
<span class="bold"><strong>2---> 1---></strong></span> <span class="bold"><strong>samedir backward</strong></span> or <span class="bold"><strong>SB</strong></span> or <span class="bold"><strong>RIGHTIES</strong></span>. Segments of a template are
all placed in the same direction, the segment order in the
contig is reversed compared to segment ordering of the reads.
</p><p>
This is the usual placement code for 454 "paired-end" and IonTorrent
long-mate protocols.
</p></li><li class="listitem"><p>
<span class="bold"><strong>samedir</strong></span> Segments of a
template are all placed in the same direction, the spatial
relationship however is not cared of.
</p></li><li class="listitem"><p>
<span class="bold"><strong>>>></strong></span> (reserved for
sequencing of several equidistant fragments per template like
in PacBio strobe sequencing, not implemented yet)
</p></li></ul></div><p>
If the term <span class="emphasis"><em>infoonly</em></span> is present, then MIRA
will pass the information on segment placement in result files, but
will not use it for any decision making during de-novo assembly or
mapping assembly. The term <span class="emphasis"><em>exclusion_criterion</em></span> makes MIRA use the information for decision making.
</p><p>
If <span class="emphasis"><em>infoonly</em></span> or <span class="emphasis"><em>exclusion_criterion</em></span> are missing, then MIRA assumes <span class="emphasis"><em>exclusion_criterion</em></span> for de-novo assemblies and <span class="emphasis"><em>infoonly</em></span> for mapping assemblies.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
For <span class="emphasis"><em>mapping</em></span> assemblies with MIRA, you
usually will want to use <span class="emphasis"><em>infoonly</em></span> as else -
in case of genome re-arrangements, larger deletions or
insertions - MIRA would probably reject one read of every read
pair (as it would not be at the expected distance and/or
orientation) and you would not be able to simply find the
re-arrangement in downstream analysis.
</p><p>
For <span class="emphasis"><em>de-novo</em></span> assemblies however
you <span class="emphasis"><em>should not</em></span>
use <span class="emphasis"><em>infoonly</em></span> except in very rare cases
where you know what you do.
</p></td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
As soon as you tell MIRA that a readgroup contains paired reads (via one of the other typical readgroup parameters like template_size, segment_naming etc.), the <span class="emphasis"><em>segment_placement</em></span> line becomes mandatory in the manifest. This is because different sequencing technologies and/or library preparations result in different read orientations. E.g., Illumina libraries come in paired-end flavour which have FR (forward/reverse) placements, but there are also mate-pair libraries which have reverse/forward (RF) placements.
</td></tr></table></div><div class="sidebar"><div class="titlepage"><div><div><p class="title"><b>
Understanding read segment placement on DNA templates
</b></p></div></div></div><p>
bla
</p></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_manifest_readgroups_segname"></a>3.4.3.10.
Read segment naming
</h4></div></div></div><p>
<span class="bold"><strong>segment_naming </strong></span>= <em class="replaceable"><code>naming_scheme <span class="emphasis"><em>[rollcomment]</em></span></code></em>. Defines
the naming scheme reads are following to indicate the DNA template
they belong to. Allowed naming schemes are: <span class="emphasis"><em>sanger,
stlouis, tigr, FR, solexa, sra</em></span>.
</p><p>
If not defined, the defaults are <span class="underline">sanger</span> for Sanger sequencing data,
while <span class="underline">solexa</span> for Solexa, 454
and Ion Torrent.
</p><p>
For FASTQ files, the modifier <span class="emphasis"><em>rollcomment</em></span> can
be used to let MIRA take the first token in the comment as name of
a read instead of the orginal name. E.g.: for a read
</p><pre class="screen">@DRR014327.1.1 HWUSI-EAS547_0013:1:1:1106:4597.1 length=91
TTAGAAGGAGATCTGGAGAACATTTTAAACCGGATTGAACAACGCGGCCGTGAGATGGAGCTTCAGACAAGCCGGTCTTATTGGGACGAAC
+
bbb`bbbbabbR`\_bb_bba`b`bb_bb_`\^\^Y^`\Zb^b``]]\S^a`]]a``bbbb_bbbb]bbb\`^^^]\aaY\`\\^aa__aB</pre><p>
the rollcomment modifier will lead to the read being named
<code class="filename"> HWUSI-EAS547_0013:1:1:1106:4597.1</code> (which
is almost the original instrument read name) instead of
<code class="filename">DRR014327.1.1</code> (which is the SRA read name).
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
For data from the short read archive (SRA), one will usually need
to explicitly specify the 'sra' naming scheme or use the
'rollcomment' modifier in FASTQ files.
</td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
This has changed with MIRA 3.9.1
and <span class="command"><strong>sff_extract</strong></span> 0.3.0. Before that, 454 and Ion
Torrent were given <span class="underline">fr</span> as naming
scheme.
</td></tr></table></div><div class="sidebar"><div class="titlepage"><div><div><p class="title"><b>
Understanding read naming schemes
</b></p></div></div></div><p>
Read naming is a long story with lots of historical gotchas: it
needs to be clear and simple, but still people sometimes wanted
to convey additional meta-information with it. Unsurprisingly,
several "standards" emerged over time. In short: it's a mess. See also XKCD entry on <a class="ulink" href="http://xkcd.com/927/" target="_top">proliferating standards</a>.
</p><p>
How to choose: please read the documentation available at the
different centres or ask your sequence provider. In a nutshell
(and probably over-simplified):
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
Sanger scheme
</span></dt><dd><p>
"somename<span class="emphasis"><em>.[pqsfrw][12][bckdeflmnpt][a|b|c|...</em></span>"
(e.g. U13a08f10.p1ca), but the length of the postfix
must be at least 4 characters, i.e., ".p" alone will not
be recognised.
</p><p>
Usually, ".p" + 3 characters or "f" + 3 characters are
used for forwards reads, while reverse complement reads
take either ".q" or ".r" (+ 3 characters in both cases).
</p></dd><dt><span class="term">
TIGR scheme
</span></dt><dd><p>
"somename<span class="emphasis"><em>TF*|TR*|TA*</em></span>"
(e.g. GCPBN02TF or GCPDL68TABRPT103A58B),
</p><p>
Forward reads take "TF*", reverse reads "TR*".
</p></dd><dt><span class="term">
St. Louis scheme
</span></dt><dd><p>
"somename<span class="emphasis"><em>.[sfrxzyingtpedca]*</em></span>"
</p></dd><dt><span class="term">
Forward/Reverse scheme
</span></dt><dd><p>
"somename<span class="emphasis"><em>.[fr]*</em></span>"
(e.g. E0K6C4E01DIGEW.f or E0K6C4E01BNDXN.r2nd),
</p><p>
".f*" for forward, ".r*" for reverse.
</p></dd><dt><span class="term">
Solexa scheme
</span></dt><dd><p>
Even simpler than the forward/reverse scheme, it allows
only for one two reads per template:
"somename<span class="emphasis"><em>/[12]</em></span>"
</p></dd><dt><span class="term">
SRA scheme
</span></dt><dd><p>
The Short Read Archive (SRA) finally settled on a naming
scheme and renames each and every read within its
database. When you download sequences from the archive,
all reads will be named
<code class="filename">XXX000000.Y[.Z]</code> (where X's are
characters A-Z, 0 are digits from 0 to 9, Y is a counter
and Z is a number denoting the segment (usually 1,2 or
3)). This naming scheme is applied to reads from all
technologies, therefore the MIRA technology dependent
defaults will not apply and one must specify the 'sra'
naming scheme in the command line.
</p></dd></dl></div></div><p>
Any wildcard in the forward/reverse suffix must be consistent for
a read pair, and is treated as part of the template name. This is
to allow multiple sequencing of a fragment, particularly common
with Sanger capillary data (e.g. given somename.f and somename.r,
resequenced as somename.f2 and somename.r2, this would be treated
as two pairs, with template names somename and somename_2
respectively).
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_manifest_readgroups_strainname"></a>3.4.3.11.
Strain naming
</h4></div></div></div><p>
<span class="bold"><strong>strain_name </strong></span>=
<em class="replaceable"><code>string</code></em>. Defines the strain /
organism-code the reads of this read group are from. If not set,
MIRA will assign "StrainX" to normal readgroups and
"ReferenceStrain" to readgroups with reference sequences.
</p><p>
Restrictions: in de-novo assemblies you can have 255 strain. In
mapping assemblies, you can have at most 8 strains.
</p><div class="sidebar"><div class="titlepage"><div><div><p class="title"><b>
Understanding how MIRA uses strain information
</b></p></div></div></div><p>
bla
</p></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_manifest_readgroups_datadirscf"></a>3.4.3.12.
Data directory for SCF files
</h4></div></div></div><p>
<span class="bold"><strong>datadir_scf </strong></span>=
<em class="replaceable"><code>directory</code></em>
</p><p>
For SANGER data only: tells MIRA in which directory it can find
SCF data belonging to reads of this read group.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_manifest_readgroups_renameprefix"></a>3.4.3.13.
Renaming read name prefixes
</h4></div></div></div><p>
<span class="bold"><strong>rename_prefix</strong></span>=
<em class="replaceable"><code>prefix replacement</code></em>. Allows to rename
reads on the fly while loading data by searching each read name
for a given <span class="emphasis"><em>prefix</em></span> string and, if found,
replace it with a given <span class="emphasis"><em>replacement</em></span> string.
</p><p>
This is most useful for systems like Illumina or PacBio which
generate quite long read names which, in the end, are either
utterly useless for an end user or are even breaking older
programs which have a length restriction on read names. E.g.:
</p><pre class="screen">rename_prefix = DQT9AAQ4:436:H371HABMM: Sample1_</pre><p>
will rename reads
like <span class="emphasis"><em>DQT9AAQ4:436:H371HABMM:5:1101:9154:3062</em></span>
into <span class="emphasis"><em>Sample1_5:1101:9154:3062</em></span>
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><code class="literal">rename_prefix</code> entries are valid per
readgroup. I.e., an entry for a readgroup will not rename reads of
another readgroup.
</td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
Multiple <code class="literal">rename_prefix</code> entries are
allowed per readgroup. E.g.:
</p><pre class="screen">rename_prefix = DQT9AAQ4:436:H371HABMM: S1sxa_
rename_prefix = m140328_002546_42149_c100624422550000001823118308061414_s1_ S1pb_</pre><p>
will rename a read
called <code class="literal">DQT9AAQ4:436:H371HABMM:1:1101:3099:2186</code>
into <code class="literal">S1sxa_1:1101:3099:2186</code> while renaming
another read called <code class="literal">m140328_002546_42149_c100624422550000001823118308061414_s1_p0/100084/10792_20790/0_9573</code>
into <code class="literal">S1pb_p0/100084/10792_20790/0_9573</code>
</p></td></tr></table></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_manifest_parameters"></a>3.4.4.
The manifest file: extended parameters
</h3></div></div></div><p>
The <span class="bold"><strong>parameters=</strong></span> line in the manifest
file opens up the full panoply of possibilities the MIRA assembler
offers. This ranges from fine-tuning assemblies to setting parameters
in a way so that MIRA is suited also for very special assembly cases.
</p><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_parameter_groups"></a>3.4.4.1.
Parameter groups
</h4></div></div></div><p>
Some parameters one can set in MIRA somehow belong together. Example
given: when specifying an overlap in an alignment of two sequences,
one could tell the assembler it should look at overlaps only if they
have a certain similarity and a certain length. On the other hand,
specifying how many processors / threads the assembler should use or
whether the results of an assembly should be written out as SAM
format does not seem to relate to alignments.
</p><p>
MIRA uses <span class="emphasis"><em>parameter groups</em></span> to keep parameters
together which somehow belong together. Example given:
</p><pre class="screen">
<strong class="userinput"><code>parameters = <em class="replaceable"><code> -GENERAL:number_of_threads=4 \
-ALIGN:min_relative_score=70 -ASSEMBLY:minimum_read_length=150 \
-OUTPUT:output_result_caf=no</code></em></code></strong></pre><p>
The parameters of the different parameter groups are described in
detail a bit later in this manual.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_technology_sections"></a>3.4.4.2.
Technology sections
</h4></div></div></div><p>
With the introduction of new sequencing technologies, MIRA also had
to be able to set values that allow technology specific behaviour of
algorithms. One simple example for this could be the minimum length
a read must have to be used in the assembly. For Sanger sequences,
having this value to be 150 (meaning a read should have at least 150
unclipped bases) would be a very valid, albeit conservative
choice. For 454 reads and especially Solexa and ABI SOLiD reads
however, this value would be ridiculously high.
</p><p>
To allow very fine grained behaviour, especially in hybrid
assemblies, and to prevent the explosion of parameter names, MIRA
knows two categories of parameters:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
<span class="bold"><strong>technology independent parameters</strong></span>
which control general behaviour of MIRA like, e.g., the number of
assembly passes or file names etc.
</p></li><li class="listitem"><p>
<span class="bold"><strong>technology dependent parameters</strong></span>
which control behaviour of algorithms where the sequencing
technology plays a role. Example for this would be the minimum
length of a read (like 200 for Sanger reads and 120 for 454 FLX
reads).
</p></li></ol></div><p>
More on this a bit further down in this documentation.
</p><p>
As example, a manifest using technology dependent and independent parameters could
look like this:
</p><pre class="screen">
<strong class="userinput"><code>parameters = <em class="replaceable"><code>COMMON_SETTINGS -GENERAL:number_of_threads=4 \
SANGER_SETTINGS -ALIGN:min_relative_score=70 -ASSEMBLY:minimum_read_length=150 \
454_SETTINGS -ALIGN:min_relative_score=75 -ASSEMBLY:minimum_read_length=100 \
SANGER_SETTINGS -ALIGN:min_relative_score=90 -ASSEMBLY:minimum_read_length=75</code></em></code></strong></pre><p>
Now, assume the following read group descriptions in a manifest:
</p><pre class="screen">
...
readgroup
technology=454
...
readgroup
technology=solexa
...</pre><p>
For MIRA, this means a number of parameters should apply to the
assembly as whole, while others apply to the sequencing data itself
... and some parameters might need to be different depending on the
technology they apply to. MIRA dumps the parameters it is running
with at the beginning of an assembly and it makes it clear there
which parameters are "global" and which parameters apply to single
technologies.
</p><p>
Here is as example a part of the output of used parameters that MIRA
will show when started with 454 and Illumina (Solexa) data:
</p><pre class="screen">
...
Assembly options (-AS):
Number of passes (nop) : 1
Skim each pass (sep) : yes
Maximum number of RMB break loops (rbl) : 1
Spoiler detection (sd) : no
Last pass only (sdlpo) : yes
Minimum read length (mrl) : [454] 40
[sxa] 20
Enforce presence of qualities (epoq) : [454] no
[sxa] yes
...</pre><p>
You can see the two different kind of settings that MIRA uses:
<span class="emphasis"><em>common</em></span> <span class="emphasis"><em>settings</em></span> (like
[-AS:nop]) which allows only one value and
<span class="emphasis"><em>technology</em></span> <span class="emphasis"><em>dependent</em></span>
<span class="emphasis"><em>settings</em></span> (like [-AS:mrl]), where for
each sequencing technology used in the project, the setting can be
different.
</p><p>
How would one set a minimum read length of 40 and not enforce
presence of base qualities for Sanger reads, but for 454 reads a
minimum read length of 30 and enforce base qualities? The answer:
</p><pre class="screen">
job=denovo,genome,draft
parameters= SANGER_SETTINGS -AS:mrl=40:epoq=mo 454_SETTINGS -AS:mrl=40:epoq=yes</pre><p>
Notice the ..._SETTINGS section in the command line (or parameter file):
these tell MIRA that all the following parameters until the advent of
another switch are to be set specifically for the said technology.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
For improved readability, you can distribute parameters across
several lines either by pre-fixing every line with
<code class="literal">parameter=</code>, like so:
</p><pre class="screen">
job=denovo,genome,draft
parameters= SANGER_SETTINGS -AS:mrl=80:epoq=no
parameters= 454_SETTINGS -AS:mrl=30:epoq=yes</pre><p>
Alternatively you can use a backslash at the end of a parameter
line to indicate that the next line is a continuing line, like so:
</p><pre class="screen">
job=denovo,genome,draft
parameters= SANGER_SETTINGS -AS:mrl=80:epoq=no <strong class="userinput"><code>\</code></strong>
454_SETTINGS -AS:mrl=30:epoq=yes</pre><p>
Note that the very last line of the parameters settings MUST NOT
end with a backslash.
</p></td></tr></table></div><p>
Beside COMMON_SETTINGS there are currently 6 technology settings available:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
SANGER_SETTINGS
</p></li><li class="listitem"><p>
454_SETTINGS
</p></li><li class="listitem"><p>
IONTOR_SETTINGS
</p></li><li class="listitem"><p>
PCBIOLQ_SETTINGS (currently not supported)
</p></li><li class="listitem"><p>
PCBIOHQ_SETTINGS
</p></li><li class="listitem"><p>
SOLEXA_SETTINGS
</p></li><li class="listitem"><p>
TEXT_SETTINGS
</p></li></ol></div><p>
</p><p>
Some settings of MIRA are influencing global behaviour and are not
related to a specific sequencing technology, these must be set in the
COMMON_SETTINGS environment. For example, it would not make sense to try and
set different number of assembly passes for each technology like in
</p><pre class="screen">
<strong class="userinput"><code>parameters= 454_SETTINGS -AS:nop=4 SOLEXA_SETTINGS -AS:nop=3</code></strong></pre><p>
Beside being contradictory, this makes not really sense. MIRA will
complain about cases like these. Simply set those common settings in
an area prefixed with the COMMON_SETTINGS switch like in
</p><pre class="screen">
<strong class="userinput"><code>parameters= COMMON_SETTINGS -AS:nop=4 454_SETTINGS ... SOLEXA_SETTINGS ...</code></strong></pre><p>
</p><p>
Since MIRA 3rc3, the parameter parser will help you by checking
whether parameters are correctly defined as COMMON_SETTINGS or
technology dependent setting.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_parameter_shortnames"></a>3.4.4.3.
Parameter short names
</h4></div></div></div><p>
Writing the verbose form of parameters can be quite a long task. Here a short example:
</p><pre class="screen">
<strong class="userinput"><code>parameters = <em class="replaceable"><code>COMMON_SETTINGS -GENERAL:number_of_threads=4 \
SANGER_SETTINGS -ALIGN:min_relative_score=70 -ASSEMBLY:minimum_read_length=150 \
454_SETTINGS -ALIGN:min_relative_score=75 -ASSEMBLY:minimum_read_length=100 \
SOLEXA_SETTINGS -ALIGN:min_relative_score=90 -ASSEMBLY:minimum_read_length=75</code></em></code></strong></pre><p>
However, every parameter has a shortened form. The above could be written like this:
</p><pre class="screen">
<strong class="userinput"><code>parameters = <em class="replaceable"><code>COMMON_SETTINGS -GE:not=4 \
SANGER_SETTINGS -AL:mrs=70 -AS:mrl=150 \
454_SETTINGS -AL:mrs=75 -AS:mrl=100 \
SOLEXA_SETTINGS -AL:mrs=90 -AS:mrl=75</code></em></code></strong></pre><p>
Please note that it is also perfectly legal to decompose the switches
so that they can be used more easily in scripted environments (notice
the multiple -AL in some sections of the following example):
</p><pre class="screen">
<strong class="userinput"><code>parameters = <em class="replaceable"><code>COMMON_SETTINGS -GE:not=4 \
SANGER_SETTINGS \
-AL:mrs=70 \
-AL:mrl=150 \
454_SETTINGS -AL:mrs=75:mrl=100 \
SOLEXA_SETTINGS \
-AL:mrs=90 \
-AL:mrl=75</code></em></code></strong></pre></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_order_dependent_quick_switches"></a>3.4.4.4.
Order dependent quick switches
</h4></div></div></div><p>
For some parameters, the order of appearance in the parameter lines
of the manifest is important. This is because the <span class="emphasis"><em>quick
parameters</em></span> are realised internally as a collection of
extended parameters that will overwrite any previously manually set
extended parameters. It is generally a good idea to place quick parameters in
the order as described in this documentation, that is: first the
order dependent quick parameters, then other quick parameters, then all
the other extended parameters.
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
[--hirep_best]
, </span><span class="term">
[--hirep_good]
, </span><span class="term">
[--hirep_something]
</span></dt><dd><p>
These are modifier switches for genome data that is deemed to
be highly repetitive. With <span class="emphasis"><em>hirep_good</em></span> and
<span class="emphasis"><em>hirep_best</em></span>, the assemblies will run
slower due to more iterative cycles and slightly different
default parameter sets that give MIRA a chance to resolve many
nasty repeats. The <span class="emphasis"><em>hirep_something</em></span> switch
goes the other way round and resolves repeats less well than a
normal assembly, but allows MIRA to finish even on more
complex data.
</p><p>
Usage recommendations bacteria: starting MIRA without any
hirep switches yields good enough result in most cases. Under
normal circumstances one can use
<span class="emphasis"><em>hirep_good</em></span> or
even <span class="emphasis"><em>hirep_best</em></span> without remorse as data
sets and genome complexities are small enough to run within a
couple of hours at most.
</p><p>
Usage recommendations for 'simple' lower eukaryotes: starting
MIRA without any hirep switches yields good enough result in
most cases. If the genomes are not too complex,
using <span class="emphasis"><em>hirep_good</em></span> can be a possibility.
</p><p>
Usage recommendations for lower eukaryotes with complex
repeats: starting MIRA without any hirep switches might
already take too long or create temporary data files which are
too big. For these cases, using
<span class="emphasis"><em>hirep_something</em></span> makes MIRA use a
parameter set which is targeted as resolving the
non-repetitive areas of a genome and additionally all repeats
which occur less than 10 times in the genome. Repeats occurring
more often will not be resolved, but using the debris
information one can recover affected reads and use these with
harsh data reduction algorithms (e.g. digital normalisation)
to get a glimpse into these.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
These switches replace the '--highlyrepetitive' switch from
earlier versions.
</td></tr></table></div></dd><dt><span class="term">
[--noclipping=...]
</span></dt><dd><p>
Switching off clipping options. If used
as <code class="literal">--noclipping</code>
or <code class="literal">--noclipping=all</code>, this switches off
really everything, both technology dependent and independent switches.
Clipping options for technology dependent options be switched
off via entries being <span class="emphasis"><em>sanger</em></span>,
<span class="emphasis"><em>454</em></span>, <span class="emphasis"><em>iontor</em></span>,
<span class="emphasis"><em>solexa</em></span> or
<span class="emphasis"><em>solid</em></span>. Multiple entries separated by
comma are allowed.
</p><p> Examples:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
Switch off 454 and Solexa, but keep technology independent
clippings and all clippings for other technologies, (like,
e.g., Sanger) <code class="literal">--noclipping=454,solexa</code>
</p></li><li class="listitem"><p>
Switch off really
everything: <code class="literal">--noclipping</code>
or <code class="literal">--noclipping=all</code>
</p></li></ol></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
Switching off technology independent clippings
([-CL:pec], [-CL:gbcdc], [-CL:kjd])
via this switch has been implemented for consistency in MIRA
4.9.6. Prior to this they were kept active, which created a
good deal of confusion with a number of users.
</p><p>
As soon as you have any kind of 'real' sequencing data, you
really should use at least [-CL:pec]
and [-CL:gbcdc].
</p></td></tr></table></div></dd></dl></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_general_ge"></a>3.4.4.5.
Parameter group: -GENERAL (-GE)
</h4></div></div></div><p>
General options control the type of assembly to be performed and
other switches not belonging anywhere else.
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
[number_of_threads(not)=<em class="replaceable"><code>0 ≤ integer ≤ 256</code></em>]
</span></dt><dd><p> Default is <span class="underline">0</span>. Master switch to set the number
of threads used in different parts of MIRA.
</p><p>
A value of 0 tells MIRA to set this to the number of available
physical cores on the machine it runs on. That is,
hyperthreaded "cores" are not counted in as using these would
cause a tremendous slowdown in the heavy duty computation
parts. E.g., a machine with 2 processors having 4 cores each
will have this value set to 8.
</p><p>
In case MIRA cannot find out the number of cores, the
fall-back value is <span class="underline">2</span>.
</p><p>
Note: when running the SKIM algorithm in parallel threads,
MIRA can give different results when started with the same
data and same arguments. While the effect could be averted for
SKIM, the memory cost for doing so would be an additional 50%
for one of the large tables, so this has not been implemented
at the moment. Besides, at the latest when the Smith-Watermans
run in parallel, this could not be easily avoided at all.
</p></dd><dt><span class="term">
[automatic_memory_management(amm)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">Yes</span>. Whether
MIRA tries to optimise run time of certain algorithms in a
space/time trade-off memory usage, increasing or reducing some
internal tables as memory permits.
</p><p>
Note 1: This functionality currently relies on the
<code class="filename">/proc</code> file system giving information on
the system memory ("MemTotal" in /proc/meminfo) and the memory
usage of the current process ("VmSize" in
<code class="filename">/proc/self/status</code>). If this is not
available, the functionality is switched off.
</p><p>
Note 2: The automatic memory management can only work if there
actually is unused system memory. It's not a wonder switch
which reduces memory consumption. In tight memory situations,
memory management has no effect and the algorithms fall back
to minimum table sizes. This means that the effective size in
memory can grow larger than given in the memory management
parameters, but then MIRA will try to keep the additional
memory requirements to a minimum.
</p></dd><dt><span class="term">
[max_process_size(mps)=<em class="replaceable"><code>0 ≤ integer</code></em>]
</span></dt><dd><p> Default is <span class="underline">0</span>. If
automatic memory management is used (see above), this number is
the size in gigabytes that the MIRA process will use as maximum
target size when looking for space/time trade-offs. A value of 0
means that MIRA does not try keep a fixed upper limit.
</p><p>
Note: when in competition to [-GE:kpmf] (see below),
the smaller of both sizes is taken as target. Example: if your
machine has 64 GiB but you limit the use to 32 GiB, then the
MIRA process will try to stay within these 32 GiB.
</p></dd><dt><span class="term">
[keep_percent_memory_free(kpmf)=<em class="replaceable"><code>0 ≤ integer</code></em>]
</span></dt><dd><p> Default is <span class="underline">10</span>. If
automatic memory management is used (see above), this number
works a bit like [-GE:mps] but the other way round: it
tries to keep x percent of the memory free.
</p><p>
Note: when in competition to [-GE:mps] (see above),
the argument leaving the most memory free is taken as
target. Example: if your machine has 64 GiB and you limit the
use to 42 GiB via [-GE:mps] but have a
[-GE:kpmf] of 50, then the MIRA process will try to
stay within 64-(64*50%)=32 GiB.
</p></dd><dt><span class="term">
[preprocess_only(ppo)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is <span class="underline">no</span> As a
special use case, MIRA will just run the following tasks:
loading and clipping of reads as well as calculating kmer
frequencies and read repeat information. The resulting reads can
then be found as MAF file in the checkpoint directory; the read
repeat information in the info directory.
</p><p>
No assembly is performed.
</p></dd><dt><span class="term">
[est_snp_pipeline_step(esps)=<em class="replaceable"><code>1 ≤ integer ≤ 4</code></em>]
</span></dt><dd><p> Default is <span class="underline">1</span>. Controls the starting step of the
SNP search in EST pipeline and is therefore only useful in
miraSearchESTSNPs.
</p><p>
EST assembly is a three step process, each with different
settings to the assembly engine, with the result of each step
being saved to disk. If results of previous steps are present
in a directory, one can easily "play around" with different
setting for subsequent steps by reusing the results of the
previous steps and directly starting with step two or three.
</p></dd><dt><span class="term">
[print_date(pd)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">yes</span>. Controls
whether date and time are printed out during the
assembly. Suppressing it is not useful in normal operation,
only when debugging or benchmarking.
</p></dd><dt><span class="term">
[bang_on_throw(bot)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">no</span>. For
debugging purposes only. Controls whether MIRA raises a signal
when detecting an error which triggers a running debugger like
gdb.
</p></dd></dl></div><p>
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_assembly_as"></a>3.4.4.6.
Parameter group: -ASSEMBLY (-AS)
</h4></div></div></div><p>
General options for controlling the assembly.
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
[num_of_passes(nop)=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd><p>
Default is <span class="underline">0</span>. Defines how many iterations of the whole
assembly process are done.
</p><p>
The default of 0 will let MIRA choose automatically the number
of passes and the kmer sizes used in each pass
(see also [-AS:kms] below).
</p><p>
Early termination: if the number of passes was chosen too
high, one can simply create a file
<code class="filename"><em class="replaceable"><code>projectname</code></em>_assembly/<em class="replaceable"><code>projectname</code></em>_d_chkpt/terminate</code>. At
the beginning of a new pass, MIRA checks for the existence of
that file and, if it finds it, acknowledges by renaming it to
<code class="filename">terminate_acknowledged</code> and then run 2
more passes (with special "last pass routines") before
finishing the assembly.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
As a rule of thumb, <span class="emphasis"><em>de-novo</em></span> assemblies
should always have at least two passes,
while <span class="emphasis"><em>mapping</em></span> assemblies should work with
only one pass. Not doing this will lead to results unexpected
by users. The reason is that the MIRA the learning routines
either have no chance to learn enough about the assembly (for
de-novo with one pass) or learn "too much" (mapping with more
than one pass).
</td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
MIRA versions ≤ 4.0.2 were interpreting the value of '0' in
a different way and only performed pre-processing of
reads. MIRA can still do this, but this is controlled by the
new parameter [-GE:ppo].
</td></tr></table></div></dd><dt><span class="term">
[kmer_series(kms)=<em class="replaceable"><code>comma separated list of integers ≥ 0 and ≤ 256</code></em>]
</span></dt><dd><p>
Default is an empty value. If set, overrides [-AS:nop] and [-SK:kms].
</p><p>
If set, this parameter provides a one-stop-shop for defining the number of passes and the kmer size used in each pass. E.g.: <code class="literal">-AS:kms=17,31,63,127</code> defines an assembly with 4 passes which uses a kmer size of 17 in pass 1, 31 in pass 2, 63 in pass 3 and 127 in pass 4.
</p><p>
Note that it is perfectly valid to use the same kmer size more than once, e.g.: <code class="literal">17,31,63,63</code> will perform a 4 pass assembly, using a kmer size of 63 in passes 3 and 4. It also makes sense to do this, as with default parameters MIRA uses its integrated automatic editor which edits away obvious sequencing errors in each step, thus the second pass with a kmer size of 63 bases can rely on improved reads.
</p></dd><dt><span class="term">
[rmb_break_loops(rbl)=<em class="replaceable"><code>integer > 0</code></em>]
</span></dt><dd><p>
Default is dependent of the sequencing technology and assembly
quality level. Defines the maximum number of times a contig
can be rebuilt during a main assembly pass
(see [-AS:nop] or [-AS:kms]) if misassemblies due to possible repeats
are found.
</p></dd><dt><span class="term">
[max_contigs_per_pass(mcpp)=<em class="replaceable"><code>integer</code></em>]
</span></dt><dd><p>
Default is <span class="underline">0</span>. Defines
how many contigs are maximally built in each pass. A value of
0 stands for 'unlimited'. Values >0 can be used for special
use cases like test assemblies etc.
</p><p>
If in doubt, do not touch this parameter.
</p></dd><dt><span class="term">
[automatic_repeat_detection(ard)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is is currently <span class="underline">yes</span>. Tells MIRA to use coverage
information accumulated over time to more accurately pinpoint reads that are
in repetitive regions.
</p></dd><dt><span class="term">
[coverage_threshold(ardct)=<em class="replaceable"><code>float > 1.0</code></em>]
</span></dt><dd><p> Default is
<span class="underline">2.0</span> for all sequencing technologies in most assembly cases. This
option says this: if MIRA a read has ever been aligned at positions
where the total coverage of all reads of the same sequencing technology
attained the average coverage times [-AS:ardct] (over a length of
[-AS:ardml], see below), then this read is considered to be
repetitive.
</p></dd><dt><span class="term">
[min_length(ardml)=<em class="replaceable"><code>integer > 1</code></em>]
</span></dt><dd><p>
Default is dependent of the sequencing technology, currently
<span class="underline">400</span> for Sanger and
<span class="underline">200</span> for 454 and Ion
Torrent.
</p><p>
A coverage must be at least this number of bases higher than
[-AS:ardct] before being really treated as repeat.
</p></dd><dt><span class="term">
[grace_length(ardgl)=<em class="replaceable"><code>integer > 1</code></em>]
</span></dt><dd><p>
Default is dependent of the sequencing technology.
</p></dd><dt><span class="term">
[uniform_read_distribution(urd)=<em class="replaceable"><code>on|y[es]|t[rue],
off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is currently always <span class="underline">no</span>
as these algorithms were supplanted by better ones in MIRA 4.0.
</p><p>
Takes effect only if uniform read distribution
([-AS:urd]) is on.
</p><p>
When set to <span class="underline">yes</span>, MIRA
will analyse coverage of contigs built at a certain stage of
the assembly and estimate an average expected coverage of
reads for contigs. This value will be used in subsequent
passes of the assembly to ensure that no part of the contig
gets significantly more read coverage of reads that were
previously identified as repetitive than the estimated average
coverage allows for.
</p><p>
This switch is useful to disentangle repeats that are
otherwise 100% identical and generally allows to build larger
contigs. It is expected to be useful for Sanger and 454
sequences. Usage of this switch with Solexa and Ion Torrent
data is currently not recommended.
</p><p>
It is a real improvement to disentangle repeats, but has the
side-effect of creating some "contig debris" (small and low
coverage contigs, things you normally can safely throw away as
they are representing sequence that already has enough
coverage).
</p><p>
This switch must be set to <span class="underline">no</span> for EST assembly, assembly of
transcripts etc. It is recommended to also switch this off for
mapping assemblies.
</p></dd><dt><span class="term">
[urd_startinpass(urdsip)=<em class="replaceable"><code>integer > 0</code></em>]
</span></dt><dd><p>
Default is dependent of the sequencing technology and assembly
quality level. Recommended values are: 3 for an assembly with
3 to 4 passes ([-AS:nop]). Assemblies with 5 passes
or more should set the value to the number of passes minus 2.
</p><p>
Takes effect only if uniform read distribution
([-AS:urd]) is on.
</p></dd><dt><span class="term">
[urd_clipoffmultiplier(urdcm)=<em class="replaceable"><code>float > 1.0</code></em>]
</span></dt><dd><p>
Default is <span class="underline">1.5</span> for all
sequencing technologies in most assembly cases.
</p><p>
This option says this: if MIRA determined that the average
coverage is <span class="emphasis"><em>x</em></span>, then in subsequent passes it will allow
coverage for reads determined to be repetitive to be built
into the contig only up to a total coverage of
<span class="emphasis"><em>x*urdcm</em></span>. Reads that bring the coverage above the threshold
will be rejected from that specific place in the contig (and
either be built into another copy of the repeat somewhere else
or end up as contig debris).
</p><p>
Please note that the lower [-AS:urdcm] is, the more
contig debris you will end up with (contigs with an average
coverage less than half of the expected coverage, mostly short
contigs with just a couple of reads).
</p><p>
Takes effect only if uniform read distribution ([-AS:urd]) is on.
</p></dd><dt><span class="term">
[spoiler_detection(sd)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is dependent of the sequencing technology and assembly
quality level. A spoiler can be either a chimeric read or it
is a read with long parts of unclipped vector sequence still
included (that was too long for the [-CL:pvc] vector
leftover clipping routines). A spoiler typically prevents
contigs to be joined, MIRA will cut them back so that they
represent no more harm to the assembly.
</p><p>
Recommended for assemblies of mid- to high-coverage genomic
assemblies, not recommended for assemblies of ESTs as one
might loose splice variants with that.
</p><p>
A minimum number of two assembly passes ([-AS:nop])
must be run for this option to take effect.
</p></dd><dt><span class="term">
[sd_last_pass_only(sdlpo)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">yes</span>. Defines
whether the spoiler detection algorithms are run only for the
last pass or for all passes ( [-AS:nop]).
</p><p>
Takes effect only if spoiler detection ([-AS:sd]) is on. If in
doubt, leave it to 'yes'.
</p></dd><dt><span class="term">
[minimum_read_length(mrl)=<em class="replaceable"><code>integer ≥ 20</code></em>]
</span></dt><dd><p> Default is
dependent of the sequencing technology. Defines the minimum length that
reads must have to be considered for the assembly. Shorter sequences will be
filtered out at the beginning of the process and won't be present in the
final project.
</p></dd><dt><span class="term">
[minimum_reads_per_contig(mrpc)=<em class="replaceable"><code>integer ≥ 1</code></em>]
</span></dt><dd><p>
Default is dependent of the sequencing technology and the
[--job] parameter. For genome assemblies it's usually
around <span class="underline">2</span> for Sanger,
<span class="underline">5</span> for 454, <span class="underline">5</span> for Ion Torrent, <span class="underline">5</span> for PacBio and <span class="underline">10</span> for Solexa. In EST assemblies,
it's currently <span class="underline">2</span> for all
sequencing technologies.
</p><p>
Defines the minimum number of reads a contig must have before
it is built or saved by MIRA. Overlap clusters with less reads
than defined will not be assembled into contigs but reads in
these clusters will be immediately transferred to debris.
</p><p>
This parameter is useful to considerably reduce assembly time
in large projects with millions of reads (like in Solexa
projects) where a lot of small "junk" contigs with
contamination sequence or otherwise uninteresting data may be
created otherwise.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Important: a value larger 1 of this parameter interferes with
the functioning of [-OUT:sssip] and
[-OUT:stsip].
</td></tr></table></div></dd><dt><span class="term">
[enforce_presence_of_qualities(epoq)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">yes</span>. When set
to yes, MIRA will stop the assembly if any read has no quality
values loaded.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">[-AS:epoq] switches on/off the quality check for a
complete sequencing technology. A more fine grained control
for switching checks of per readgroup is available via
the <span class="emphasis"><em>default_qual</em></span> readgroup parameter in
the manifest file.
</td></tr></table></div></dd><dt><span class="term">
[use_genomic_pathfinder(ugpf)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">yes</span>. MIRA has
two different pathfinder algorithms it chooses from to find
its way through the (more or less) complete set of possible
sequence overlaps: a genomic and an EST pathfinder. The
genomic looks a bit into the future of the assembly and tries
to stay on safe grounds using a maximum of information already
present in the contig that is being built. The EST version on
the contrary will directly jump at the complex cases posed by
very similar repetitive sequences and try to solve those first
and is willing to fall back to first-come-first-served when
really bad cases (like, e.g., coverage with thousands of
sequences) are encountered.
</p><p>
Generally, the genomic pathfinder will also work quite well
with EST sequences (but might get slowed down a lot in
pathological cases), while the EST algorithm does not work so
well on genomes. If in doubt, leave on <span class="underline">yes</span> for genome projects and set to
<span class="underline">no</span> for EST projects.
</p></dd><dt><span class="term">
[use_emergency_search_stop(uess)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">yes</span>. Another
important switch if you plan to assemble non-normalised EST
libraries, where some ESTs may reach coverages of several
hundreds or thousands of reads. This switch lets MIRA save a
lot of computational time when aligning those extremely high
coverage areas (but only there), at the expense of some
accuracy.
</p></dd><dt><span class="term">
[ess_partnerdepth(esspd)=<em class="replaceable"><code>integer > 0</code></em>]
</span></dt><dd><p> Default is <span class="underline">500</span>. Defines the number of potential
partners a read must have for MIRA switching into emergency
search stop mode for that read.
</p></dd><dt><span class="term">
[use_max_contig_buildtime(umcbt)=<em class="replaceable"><code>on|y[es]|t[rue],off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">no</span>. Defines whether there is an upper limit of time
to be used to build one contig. Set this to yes in EST assemblies where you
think that extremely high coverages occur. Less useful for assembly of
genomic sequences.
</p></dd><dt><span class="term">
[buildtime_in_seconds(bts)=<em class="replaceable"><code>integer > 0</code></em>]
</span></dt><dd><p> Default is
<span class="underline">3600</span> for genome
assemblies, <span class="underline">720</span> for EST
assemblies with Sanger or 454
and <span class="underline">360</span> for EST assemblies
with Solexa or Ion Torrent. Depending on [-AS:umcbt]
above, this number defines the time in seconds allocated to
building one contig.
</p></dd></dl></div><p>
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_strain_backbone_sb"></a>3.4.4.7.
Parameter group: -STRAIN/BACKBONE (-SB)
</h4></div></div></div><p>
Controlling backbone options in mapping assemblies:
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
[bootstrap_new_backbone(bnb)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">yes</span> for
mapping assemblies with Illumina data, no otherwise.
</p><p>
When set to 'yes', MIRA will use a two stage mapping process
which bootstraps an intermediate backbone (reference) sequence
and greatly improves mapping accuracy at indel sites.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Currently only works with Illumina data, other sequencing
technologies will not be affected by this flag.
</td></tr></table></div></dd><dt><span class="term">
[startbackboneusage_inpass(sbuip)=<em class="replaceable"><code>0 < integer</code></em>]
</span></dt><dd><p> Default is
dependent on assembly quality level chosen: 0 for 'draft'
and [-AS:nop] divided by 2 for 'accurate'.
</p><p>
When assembling against backbones, this parameter defines the
pass iteration (see [-AS:nop]) from which on the
backbones will be really used. In the passes preceding this
number, the non-backbone reads will be assembled together as
if no backbones existed. This allows MIRA to correctly spot
repetitive stretches that differ by single bases and tag them
accordingly. Note that full assemblies are considerably slower
than mapping assemblies, so be careful with this when
assembling millions of reads.
</p><p>
Rule of thumb: if backbones belong to same strain as reads to assemble, set
to <span class="underline">1</span>. If backbones are a different strain, then set
[-SB:sbuib] to 1 lower than [-AS:nop] (example: nop=4 and
sbuip=3).
</p></dd><dt><span class="term">
[backbone_raillength(brl)=<em class="replaceable"><code>0 ≤ integer ≤ 10000</code></em>]
</span></dt><dd><p> Default
is <span class="underline">0</span>. Parameter for the
internal sectioning size of the backbone to compute optimal
alignments. Should be set to two times length of longest read in
input data + 15%. When set to 0, MIRA will compute optimal
values from the data loaded.
</p></dd><dt><span class="term">
[backbone_railoverlap(bro)=<em class="replaceable"><code>0 ≤ integer ≤ 2000</code></em>]
</span></dt><dd><p> Default is <span class="underline">0</span>.
Parameter for the internal sectioning size of the backbone to
compute optimal alignments. Should be set to length of the
longest read. When set to 0, MIRA will compute optimal values
from the data loaded.
</p></dd><dt><span class="term">
[trim_overhanging_reads(tor)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">yes</span>.
</p><p>
When set to 'yes', MIRA will trim back reads at end of contigs
which outgrow the reference sequence so that boundaries of
the reference and the mapped reads align perfectly. That is,
the mapping does not perform a sequence extension.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
The trimming is performed via setting low quality cutoffs in
the reads, i.e., the trimmed parts are not really gone but
just not part of the active contig anymore. They can be
uncovered when working on the assembly in finishing programs
like, e.g., <span class="command"><strong>gap4</strong></span>
or <span class="command"><strong>gap5</strong></span>.
</td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
Previous versions of MIRA (up to and including 3.9.18) behaved
as if this option had been set to 'no'. This is a major change
in behaviour, but it is also what probably most people expect
from a mapping.
</td></tr></table></div></dd></dl></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_dataprocessing_dp"></a>3.4.4.8.
Parameter group: -DATAPROCESSING (-DP)
</h4></div></div></div><p>
Options for controlling some data processing during the assembly.
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
[use_read_extension(ure)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default
is dependent of the sequencing technology used: <span class="underline">yes</span> for Sanger,
no for all others. MIRA expects the sequences it is given to be
quality clipped. During the assembly though, it will try to extend reads
into the clipped region and gain additional coverage by analysing
Smith-Waterman alignments between reads that were found to be valid. Only
the right clip is extended though, the left clip (most of the time
containing sequencing vector) is never touched.
</p></dd><dt><span class="term">
[read_extension_window_length(rewl)=<em class="replaceable"><code>integer > 0</code></em>]
</span></dt><dd><p> Default
is dependent of the sequencing technology used. Only takes effect when
[-DP:ure] (see above) is set to <span class="underline">yes</span>. The read extension
routines use a sliding window approach on Smith-Waterman alignments. This
parameter defines the window length.
</p></dd><dt><span class="term">
[read_extension_with_maxerrors(rewme)=<em class="replaceable"><code>integer > 0</code></em>]
</span></dt><dd><p>
Default is dependent of the sequencing technology used. Only takes effect
when [-DP:ure] (see above) is set to <span class="underline">yes</span>. The read
extension routines use a sliding window approach on Smith-Waterman
alignments. This parameter defines the number maximum number of errors
(=disagreements) between two alignment in the given window.
</p></dd><dt><span class="term">
[first_extension_in_pass(feip)=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd><p> Default is
dependent of the sequencing technology used. Only takes effect when
[-DP:ure] (see above) is set to <span class="underline">yes</span>. The read extension
routines can be called before assembly and/or after each assembly pass (see
[-AS:nop]). This parameter defines the first pass in which the read
extension routines are called. The default of <span class="underline">0</span> tells
MIRA to extend the reads the first time before the first assembly
pass.
</p></dd><dt><span class="term">
[last_extension_in_pass(leip)=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd><p> Default is
dependent of the sequencing technology used. Only takes effect when
[-DP:ure] (see above) is set to <span class="underline">yes</span>. The read extension
routines can be called before assembly and/or after each assembly pass (see
[-AS:nop]). This parameter defines the last pass in which the read
extension routines are called. The default of <span class="underline">0</span> tells
MIRA to extend the reads the last time before the first assembly
pass.
</p></dd></dl></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_clipping_cl"></a>3.4.4.9.
Parameter group: -CLIPPING (-CL)
</h4></div></div></div><p>
Controls for clipping options: when and how sequences should be clipped.
</p><p>
Every option in this section can be set individually for every sequencing
technology, giving a very fine grained control on how reads are clipped for
each technology.
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
[msvs_gap_size(msvsgs)=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd><p>
Default is dependent of the sequencing technology used. Takes
effect only when loading data from ancillary SSAHA2 or SMALT
files.
</p><p>
While performing the clip of screened vector sequences, MIRA
will look if it can merge larger chunks of sequencing vector
bases that are a maximum of [-CL:msvgsgs] apart.
</p></dd><dt><span class="term">
[msvs_max_front_gap(msvsmfg)=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd><p>
Default is dependent of the sequencing technology used. Takes
effect only when loading data from ancillary SSAHA2 or SMALT
files.
</p><p>
While performing the clip of screened vector sequences at the
start of a sequence, MIRA will allow up to this number of
non-vector bases in front of a vector stretch.
</p></dd><dt><span class="term">
[msvs_max_end_gap(msvsmeg)=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd><p>
Default is dependent of the sequencing technology used. Takes
effect only when loading data from ancillary SSAHA2 or SMALT
files.
</p><p>
While performing the clip of screened vector sequences at the
end of a sequence, MIRA will allow up to this number of
non-vector bases behind a vector stretch.
</p></dd><dt><span class="term">
[possible_vector_leftover_clip(pvlc)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is dependent of the sequencing technology
used: <span class="underline">yes</span> for
Sanger, <span class="underline">no</span> for any
other. MIRA will try to identify possible sequencing vector
relics present at the start of a sequence and clip them
away. These relics are usually a few bases long and were not
correctly removed from the sequence in data preprocessing
steps of external programs.
</p><p>
You might want to turn off this option if you know (or think)
that your data contains a lot of repeats and the option below
to fine tune the clipping behaviour does not give the expected
results.
</p><p>
You certainly want to turn off this option in EST assemblies
as this will quite certainly cut back (and thus hide)
different splice variants. But then make certain that you
pre-processing of Sanger data (sequencing vector removal) is
good, other sequencing technologies are not affected then.
</p></dd><dt><span class="term">
[pvc_maxlenallowed(pvcmla)=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd><p> Default is dependent of the sequencing technology
used. The clipping of possible vector relics option works quite
well. Unfortunately, especially the bounds of repeats or
differences in EST splice variants sometimes show the same
alignment behaviour than possible sequencing vector relics and
could therefore also be clipped.
</p><p>
To refrain the vector clipping from mistakenly clip repetitive
regions or EST splice variants, this option puts an upper
bound to the number of bases a potential clip is allowed to
have. If the number of bases is below or equal to this
threshold, the bases are clipped. If the number of bases
exceeds the threshold, the clip
is <span class="bold"><strong>NOT</strong></span> performed.
</p><p>
Setting the value to 0 turns off the threshold, i.e., clips are then always
performed if a potential vector was found.
</p></dd><dt><span class="term">
[quality_clip(qc)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is
<span class="underline">no</span>. This will let MIRA
perform its own quality clipping before sequences are entered
into the assembly. The clip function performed is a sequence end
window quality clip with back iteration to get a maximum number
of bases as useful sequence. Note that the bases clipped away
here can still be used afterwards if there is enough evidence
supporting their correctness when the option [-DP:ure]
is turned on.
</p><p>
Warning: The windowing algorithm works pretty well for Sanger,
but apparently does not like 454 type data. It's advisable to
not switch it on for 454. Beside, the 454 quality clipping
algorithm performs a pretty decent albeit not perfect job, so
for genomic 454 data (not! ESTs), it is currently recommended
to use a combination of [-CL:emrc] and
[-DP:ure].
</p></dd><dt><span class="term">
[qc_minimum_quality(qcmq)=<em class="replaceable"><code>integer ≥ 15 and ≤ 35</code></em>]
</span></dt><dd><p>
Default is dependent of the sequencing technology used. This is the minimum
quality bases in a window require to be accepted. Please be cautious not to
take too extreme values here, because then the clipping will be too lax or
too harsh. Values below 15 and higher than 30-35 are not recommended.
</p></dd><dt><span class="term">
[qc_window_length(qcwl)=<em class="replaceable"><code>integer ≥ 10</code></em>]
</span></dt><dd><p> Default is
dependent of the sequencing technology used. This is the length of a window
in bases for the quality clip.
</p></dd><dt><span class="term">
[bad_stretch_quality_clip (bsqc)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is <span class="underline">no</span>. This
option allows to clip reads that were not correctly preprocess
and have unclipped bad quality stretches that might prevent a
good assembly.
</p><p> MIRA will search the sequence in forward direction for a
stretch of bases that have in average a quality less than a
defined threshold and then set the right quality clip of this
sequence to cover the given stretch.
</p></dd><dt><span class="term">
[bsqc_minimum_quality (bsqcmq)=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd><p> Default is dependent
of the sequencing technology used. Defines the minimum average quality a
given window of bases must have. If this quality is not reached, the
sequence will be clipped at this position.
</p></dd><dt><span class="term">
[bsqc_window_length (bsqcwl)=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd><p> Default is dependent of the
sequencing technology used. Defines the length of the window within which
the average quality of the bases are computed.
</p></dd><dt><span class="term">
[maskedbases_clip(mbc)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is
dependent of the sequencing technology used. This will let MIRA
perform a 'clipping' of bases that were masked out (replaced with the
character X). It is generally not a good idea to use mask bases to remove
unwanted portions of a sequence, the EXP file format and the NCBI traceinfo
format have excellent possibilities to circumvent this. But because a lot of
preprocessing software are built around cross_match, scylla-
and phrap-style of base masking, the need arose for MIRA to
be able to handle this, too. MIRA will look at the start and end of
each sequence to see whether there are masked bases that should be
'clipped'.
</p></dd><dt><span class="term">
[mbc_gap_size(mbcgs)=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd><p> Default is dependent of
the sequencing technology used. While performing the clip of masked bases,
MIRA will look if it can merge larger chunks of masked bases that are
a maximum of [-CL:mbcgs] apart.
</p></dd><dt><span class="term">
[mbc_max_front_gap(mbcmfg)=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd><p> Default is
dependent of the sequencing technology used. While performing the clip of
masked bases at the start of a sequence, MIRA will allow up to this
number of unmasked bases in front of a masked stretch.
</p></dd><dt><span class="term">
[mbc_max_end_gap(mbcmeg)=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd><p> Default is
dependent of the sequencing technology used. While performing the clip of
masked bases at the end of a sequence, MIRA will allow up to this
number of unmasked bases behind a masked stretch.
</p></dd><dt><span class="term">
[lowercase_clip_front(lccf)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is
dependent of the sequencing technology used: on for 454 and Ion
Torrent data, off for all
others. This will let MIRA perform a 'clipping' of bases that are in
lowercase at the front end of a sequence, leaving only the uppercase
sequence. Useful when handling 454 data that does not have ancillary data in
XML format.
</p></dd><dt><span class="term">
[lowercase_clip_back(lccb)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is
dependent of the sequencing technology used: on for 454 and Ion
Torrent data, off for all
others. This will let MIRA perform a 'clipping' of bases that are in
lowercase at the back end of a sequence, leaving only the uppercase
sequence. Useful when handling 454 data that does not have ancillary data in
XML format.
</p></dd><dt><span class="term">
[clip_polyat(cpat)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is
<span class="underline">yes</span> for all EST/RNASeq
assemblies. Poly-A stretches in forward reads and poly-T
stretches in reverse reads get either clipped or tagged here
(see [-CL:cpkps] below). The assembler will not use
these stretches for finding overlaps, but it will use these to
discern and disassemble different 3' UTR endings.
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
Should poly-A / poly-T stretches have been trimmed in
pre-processing steps before MIRA got the reads, this option
MUST be switched off.
</td></tr></table></div></dd><dt><span class="term">
[cp_keep_poly_stretch (cpkps)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is
<span class="underline">yes</span> but takes effect only
if [-CL:cpat] (see above) is also set to yes.
</p><p>
Instead of clipping the poly-A / poly-T sequence away, the
stretch in question in the reads is kept and tagged. The tags
provide additional information for MIRA to discern between
different 3' UTR endings and alse a good visual anchor when
looking at the assembly with different programs.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
One side-effect of this option is that the poly-A / poly-T
stretch are 'cleaned'. That is, single non-poly A / poly-T
bases within the stretch are automatically edited to be
conforming to the surrounding stretch. This is necessary as
homopolymers are by nature one of the hardest motifs to be
sequenced correctly by any sequencing technology and one
frequently gets 'dirty' poly-A sequence from sequencing and
this interferes heavily with the methods MIRA uses to discern
repeats.
</td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Keeping the poly-A sequence is a two-edged sword: on one hand it
enabled to discern different 3' UTR endings, on the other hand
it might be that sequencing problems toward the end of reads
create false-positive different endings. If you find that this
is the case for your data, just switch off this option: MIRA
will then simply build the longest possible 3' UTRs.
</td></tr></table></div></dd><dt><span class="term">
[cp_min_sequence_len(cpmsl)=<em class="replaceable"><code>integer >
0</code></em>]
</span></dt><dd><p> Default is
<span class="underline">10</span>. Only takes effect
when [-CP:cpat] (see above) is set
to <span class="underline">yes</span>. Defines the number
of 'A' (in forward direction) or 'T' (in reverse direction) must
be present to be considered a poly-A sequence stretch.
</p></dd><dt><span class="term">
[cp_max_errors_allowed(cpmea)=<em class="replaceable"><code>integer > 0</code></em>]
</span></dt><dd><p> Default is
<span class="underline">1</span>. Only takes effect
when [-CL:cpat] (see above) is set
to <span class="underline">yes</span>. Defines the
maximum number of errors allowed in the potential poly-A
sequence stretch. The distribution of these errors is not
important.
</p></dd><dt><span class="term">
[cp_max_gap_from_end(cpmgfe)=<em class="replaceable"><code>integer > 0</code></em>]
</span></dt><dd><p> Default is <span class="underline">9</span>. Only
takes effect when [-CL:cpat] (see above) is set
to <span class="underline">yes</span>.Defines the number
of bases from the end of a sequence (if masked: from the end of
the masked area) within which a poly-A sequence stretch is
looked for.
</p></dd><dt><span class="term">
[clip_3ppolybase (c3pp)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd>
c3p* options to be described ...
</dd><dt><span class="term">
[clip_known_adaptorsright (ckar)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">yes</span>. Defines
whether MIRA should search and clip known sequencing technology
specific sequencing adaptors. MIRA knows adaptors for Illumina
best, followed by Ion Torrent and some 454 adaptors.
</p><p>
As the list of known adaptors changes quite frequently, the
best place to get a list of known adaptors by MIRA is by
looking at the text files in the program
sources: <code class="filename">src/mira/adaptorsforclip.*.xxd</code>.
</p></dd><dt><span class="term">
[ensure_minimum_left_clip(emlc)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is dependent of the sequencing technology used. If on, ensures a
minimum left clip on each read according to the parameters in
[-CL:mlcr:smlc]
</p></dd><dt><span class="term">
[minimum_left_clip_required(mlcr)=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd><p> Default
is dependent of the sequencing technology used. If [-CL:emlc] is
on, checks whether there is a left clip which length is at least the one
specified here.
</p></dd><dt><span class="term">
[set_minimum_left_clip(smlc)=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd><p> Default is
dependent of the sequencing technology used. If [-CL:emlc] is on
and actual left clip is < [-CL:mlcr], set left clip of read to
the value given here.
</p></dd><dt><span class="term">
[ensure_minimum_right_clip(emrc)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is dependent of the sequencing technology used. If on, ensures a
minimum right clip on each read according to the parameters in
[-CL:mrcr:smrc]
</p></dd><dt><span class="term">
[minimum_right_clip_required(mrcr)=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd><p> Default
is dependent of the sequencing technology used. If [-CL:emrc] is
on, checks whether there is a right clip which length is at least the one
specified here.
</p></dd><dt><span class="term">
[set_minimum_right_clip(smrc)=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd><p> Default is
dependent of the sequencing technology used. If [-CL:emrc] is on
and actual right clip is < [-CL:mrcr], set the length of the
right clip of read to the value given here.
</p></dd><dt><span class="term">
[gb_chimeradetectionclip(gbcdc)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">yes</span> for all jobs.
</p><p>
Very safe chimera detection, should have no false
positives. For repetitive data, a low number of false
negatives is possible.
</p></dd><dt><span class="term">
[kmerjunk_detection(kjd)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is currently <span class="underline">yes</span>.
</p><p>
Reads that look "fishy" are marked as potentially
chimeric. This mark leads either to a read being completely
killed or to a read being included into a contig only if no
other possibility remains.
</p><p>
It is currently suggested to leave this parameter switched on
and to fine-tune via [-CL:kjck] (see below).
</p></dd><dt><span class="term">
[kmerjunk_completekill(kjck)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is currently <span class="underline">no</span>
for genome assemblies and <span class="underline">yes</span> for EST/RNASeq assemblies.
</p><p>
If set to yes, reads marked as junk (see above) are completely
removed from an assembly. If set to no, reads are not removed
but included only into a contig as a very last resort.
</p><p>
Having reads killed guarantees assemblies of extremely high
quality containing virtually no missassembly due to chimeric
sequencing errors. The downside is that, computationally,
there is no difference between junk and stretches with correct
but very low coverage data (generally < 3x coverage). It's
up to you to decide what is more important: total accuracy or
longer contigs.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
As a rule of thumb: I set this to no for genome assemblies
with at least medium average coverage (≥ 20-30x) as MIRA
does a pretty good job to incorporate these reads so late in
an assembly that they do not lead to misassemblies. In
transcript assemblies I set this to yes as there is a high
chance that high coverage transcripts could be extended via
chimeric reads.
</p><p>
With this in mind: deciding for metagenome assemblies would
be really difficult though. It probably depends on what you
need the data for.
</p></td></tr></table></div></dd><dt><span class="term">
[propose_end_clips(pec)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is is dependent on --job quality: currently <span class="underline">yes</span> for all genome assemblies.
Switched off for EST assemblies (but one might want to switch
it on sometimes).
</p><p>
This implements a pretty powerful strategy to ensure a good
"high confidence region" (HCR) in reads, basically eliminating
99.9% of all junk at the 5' and 3' ends of reads. Note that
one still must ensure that sequencing vectors (Sanger) or
adaptor sequences (454, Solexa ion Torrent) are "more or less"
clipped prior to assembly.
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
Extremely effective, but should NOT be used for very low
coverage genomic data, or for EST projects if one wants to
retain the rareest transcripts.
</td></tr></table></div></dd><dt><span class="term">
[handle_solexa_ggcxg_problem(pechsgp)=<em class="replaceable"><code>on|y[es]|t[rue],
off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is is dependent <span class="underline">yes</span>.
</p><p>
Solexa data has a pretty awful problem with in some reads when
a <code class="literal">GGCxG</code> motif occurs (read more about it in
the chapter on Solexa data). In short: the sequencing errors
produced by this problem lead to many false positive SNP
discoveries in mapping assemblies or problems in contig
building in de-novo assembly.
</p><p>
MIRA knows about this problem and can look for it in Solexa
reads during the proposed end clipping and further clip back
the reads, greatly minimising the impact of this problem.
</p></dd><dt><span class="term">
[pec_kmer_size(peckms)=<em class="replaceable"><code>10 ≤ integer ≤ 32</code></em>]
</span></dt><dd><p>
Default is is dependent on technology and quality in the --job
switch: usually
between <span class="underline">17</span>
and <span class="underline">21</span> for Sanger,
higher for 454 (up to
<span class="underline">27</span>) and highest for
Solexa (<span class="underline">31</span>). Ion Torrent
has at the moment <span class="underline">17</span>,
but this may change in the future to somewhat higher values.
</p><p>
This parameter defines the minimum number of bases at each end
of a read that should be free of any sequencing errors.
</p></dd><dt><span class="term">
[pec_minimum_kmer_forward_reverse(pmkfr)=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd><p>
Default is is dependent on technology and quality in the --job
switch: usually
between <span class="underline">1</span>
and <span class="underline">3</span>
when [-CL:pec=yes].
</p><p>
This parameter defines the minimum number of occurrence of a
kmer at each end of a read that should be free of any
sequencing errors.
</p></dd><dt><span class="term">
[rare_kmer_mask(rkm)=<em class="replaceable"><code>on|y[es]|t[rue],
off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is is dependent on --job switch: currently
it's <span class="underline">yes</span> for Solexa data
and <span class="underline">no</span> otherwise. If
this parameter is active, MIRA will completely mask with 'X'
those parts of a read which have kmer occurrence (in forward
and reverse direction) less than the value specified
via [-CL:pmkfr].
</p><p>
This is a quality ensuring move which improves assembly of
ultra-high coverage contigs by cleaning out very likely, low
frequency sequence dependent sequencing errors which passed
all previous filters. The drawback is that very rare
transcripts or very lowly covered genome parts with an
occurrence less than the given value will also be masked
out. However, Illumina gives so much data that this is almost
never the case.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
This works only if [-CL:pec] is active.
</td></tr></table></div></dd><dt><span class="term">
[search_phix174(spx174)=<em class="replaceable"><code>on|y[es]|t[rue],
off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is <span class="emphasis"><em>on</em></span> for Illumina data, off
otherwise.
</p><p>
PhiX 174 is a small phage of enterobacteria whose DNA is often
spiked-in during Illumina sequencing to determine error rates
in datasets and to increase complexity in low-complexity
samples (amplicon, chipseq etc) to help in cluster
identification.
</p><p>
If it remains in the sequenced data, it has to be
seen as a contaminant for projects working with organisms
which should not contain the PhiX 174 phage.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
However, PhiX may be part of some genome sequences
(enterobacteria). In these cases, the PhiX174 search will
report genuine genome data.
</td></tr></table></div></dd><dt><span class="term">
[filter_phix174(fpx174)=<em class="replaceable"><code>on|y[es]|t[rue],
off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is <span class="emphasis"><em>on</em></span> for Illumina data in
EST (RNASeq) assemblies, off otherwise.
</p><p>
If [-CL:spx174] is on and [-CL:fpx174] also,
MIRA will filter out as contaminants all reads which have
PhiX174 sequence recognised.
</p><p>
The default value of having the filtering on only for Illumina
EST (RNASeq) data is a conservative approach: the overwhelming
majority of RNASeq data will indeed not sequence some
enterobacteria, so having PhiX174 containing reads thrown out
is indeed a valid move. For genomes however, MIRA currently is
cautious and will not filter these reads by default.
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
However, PhiX may be part of some genome sequences
(enterobacteria). In these cases, the PhiX174 filter will
remove reads from valid genome or expression data.
</td></tr></table></div></dd><dt><span class="term">
[filter_rrna(frrna)=<em class="replaceable"><code>on|y[es]|t[rue],
off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is <span class="emphasis"><em>on</em></span> for EST (RNASeq)
assemblies, off otherwise.
</p><p>
If enabled, MIRA will filter out (and not assemble) all reads
(or pairs, see below) it recognises as being rRNA or
rDNA. This is useful to reduce computing time on data sets
which contain large contamination of rRNA which were not
filtered away in wet lab.
</p></dd><dt><span class="term">
[filter_rrna_pairs(frrnap)=<em class="replaceable"><code>on|y[es]|t[rue],
off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is <span class="emphasis"><em>on</em></span> for EST (RNASeq)
assemblies, off otherwise.
</p><p>
If enabled together with [-CL:frrna], MIRA will
filter out (and not assemble) all reads pairs where at least
one of the reads is recognised as being rRNA or rDNA.
</p><p>
This option is useful to also catch less conserved parts of
rRNA transcribed like, e.g. the internal transcribed spacers
(ITS) in eukaryotic data.
</p></dd><dt><span class="term">
[filter_rrna_numkmers(frrnank)=<em class="replaceable"><code>integer > 0
</code></em>]
</span></dt><dd><p> Default is <span class="emphasis"><em>20</em></span>.
</p><p>
The rRNA recognition in MIRA works with a precompiled set of
preserved rRNA kmers, at the time of this writing with
21-mers. To allow for specific recognition, the rRNA filtering
process expects to find at least this number of kmers per read
before identifying it as rRNA.
</p><p>
To increase sensitivity (and at the same time risk more false
positives): reduce this parameter. To increase specificity
(and at the same time risk more reads not being recognised):
increase this parameter.
</p><p>
The default parameters together with the default database seem
to work pretty well and this is expected to work for all but
the most exotic rRNA containing organisms.
</p></dd></dl></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_skim_sk"></a>3.4.4.10.
Parameter group: -SKIM (-SK)
</h4></div></div></div><p>
Options that control the behaviour of the initial fast all-against-all read
comparison algorithm. Matches found here will be confirmed later in the
alignment phase. The new SKIM3 algorithm that is in place since version 2.7.4
uses a kmer based algorithm that works similarly to SSAHA (see Ning Z, Cox AJ,
Mullikin JC; "SSAHA: a fast search method for large DNA databases."; Genome
Res. 2001;11;1725-9).
</p><p>
The major differences of SKIM3 and SSAHA are:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
the word length <span class="emphasis"><em>n</em></span> of a kmer (hash) in
SSAHA2 must be < 15, but can be up to 32 bases in 64 bit
versions of MIRA < 4.0.2 and lower, and up to 256 bases for
higher versions of MIRA.
</p></li><li class="listitem"><p>
SKIM3 uses a maximum fixed amount of RAM that is independent of
the word size. E.g., SSAHA would need 4 <span class="underline">exabyte</span> to work with word length of
30 bases ... SKIM3 just takes a couple of hundred MB.
</p></li></ol></div><p>
The parameters for SKIM3:
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
[number_of_threads(not)=<em class="replaceable"><code>integer ≥ 1</code></em>]
</span></dt><dd><p>
Number of threads used in SKIM, default is <span class="underline">2</span>. A few parts of SKIM are
non-threaded, so the speedup is not exactly linear, but it
should be very close. E.g., with 2 processors I get a speedup
of 180-195%, with 4 between 350 and 395%.
</p><p>
Although the main data structures are shared between the
threads, there's some additional memory needed for each
thread.
</p></dd><dt><span class="term">
[also_compute_reverse_complements(acrc)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">on</span>. Defines
whether SKIM searches for matches only in forward/forward
direction or whether it also looks for forward/reverse
direction.
</p><p>
You usually will not want to touch the default, except for very
special application cases where you do not want MIRA to use
reverse complement sequences at all.
</p></dd><dt><span class="term">
[kmer_size(kms)=<em class="replaceable"><code>10 < integer ≤ 256</code></em>]
</span></dt><dd><p>
Defaults are dependent on "--job" switch and sequencing
technologies used.
</p><p>
Controls the number of consecutive bases
<span class="emphasis"><em>n</em></span> which are used as a kmer. The
higher the value, the faster the search. The lower the value,
the slower the search and the more weak matches are found.
</p><p>
A secondary effect of this parameter is the estimation of MIRA
on whether stretches within a read sequence are repetitive or
not. Large values of [-SK:kms] allow a better
distinction between "almost identical" repeats early in the
assembly process and, given enough coverage, generally lead to
less and longer contigs.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
This parameter gets overriden by the one-stop-shop parameter
[-AS:kms] which determines number of passes and kmer
size to use in each pass.
</td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
For de-novo assemblies, values below 15 are not
recommended. For mapping assemblies, values below 10 should
not be used.
</td></tr></table></div></dd><dt><span class="term">
[kmer_save_stepping(kss)=<em class="replaceable"><code>integer ≥ 1</code></em>]
</span></dt><dd><p> Default is
<span class="underline">1</span>. This is a parameter
controlling the stepping increment <span class="emphasis"><em>s</em></span> with which kmers are
generated. This allows for more or less fine grained search as
matches are found with at least <span class="emphasis"><em>n+s</em></span> (see [-SK:kms])
equal bases. The higher the value, the faster the search. The
lower the value, the more weak matches are found.
</p></dd><dt><span class="term">
[percent_required(pr)=<em class="replaceable"><code>integer ≥ 1</code></em>]
</span></dt><dd><p> Default is dependent of the sequencing technology used
and assembly quality wished. Controls the relative percentage of
exact word matches in an approximate overlap that has to be
reached to accept this overlap as possible match. Increasing
this number will decrease the number of possible alignments that
have to be checked by Smith-Waterman later on in the assembly,
but it also might lead to the rejection of weaker overlaps (i.e.
overlaps that contain a higher number of mismatches).
</p><p>
Note: most of the time it makes sense to keep this parameter
in sync with [-AL:mrs].
</p></dd><dt><span class="term">
[maxhits_perread(mhpr)=<em class="replaceable"><code>integer ≥ 1</code></em>]
</span></dt><dd><p> Default is
<span class="underline">2000</span>. Controls the maximum
number of possible hits one read can maximally transport to the
overlap edge reduction phase. If more potential hits are found,
only the best ones are taken.
</p><p>
In the pre-2.9.x series, this was an important option for
tackling projects which contain <span class="emphasis"><em>extreme</em></span>
assembly conditions. It still is if you run out of memory in
the graph edge reduction phase. Try then to lower it to 1000,
500 or even 100.
</p><p>
As the assembly increases in passes ([-AS:nop]),
different combinations of possible hits will be checked,
always the probably best ones first. So the accuracy of the
assembly should only suffer when lowering this number too
much.
</p></dd><dt><span class="term">
[filter_megahubs(fmh)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is <span class="underline">yes</span>. Defines whether megahubs (reads
with extremely many overlaps to other reads) are filtered.
See also [-SK:mhc:mmhr].
</p></dd><dt><span class="term">
[megahub_cap(mhc)=<em class="replaceable"><code>integer > 0</code></em>]
</span></dt><dd><p> Default is <span class="underline">150000</span>. Defines the number of kmer
overlaps a read may have before it is categorised as megahub.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
You basically don't want to mess with this one. Except for
assemblies containing very long reads. Rule of thumb: you
might want to multiply the 150k value by n where n is the
average read length divided by 2000. Don't overdo, max n at 15
or so.
</td></tr></table></div></dd><dt><span class="term">
[max_megahub_ratio(mmhr)=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd><p> Default is
<span class="underline">0</span>. If the number of reads
identified as megahubs exceeds the allowed ratio, MIRA will
abort.
</p><p>
This is a fail-safe parameter to avoid assemblies where things
look fishy. In case you see this, you might want to ask for
advice on the mira_talk mailing list. In short: bacteria
should never have megahubs (90% of all cases reported were
contamination of some sort and the 10% were due to incredibly
high coverage numbers). Eukaryotes are likely to contain
megahubs if filtering is [-KS:mnr] not on.
</p><p>
EST project however, especially from non-normalised libraries,
will very probably contain megahubs. In this case, you might
want to think about masking, see [-KS:mnr].
</p></dd><dt><span class="term">
[sw_check_on_backbones(swcob)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is currently (3.4.0) <span class="underline">yes</span> for accurate mapping
jobs. Takes effect only in mapping assemblies. Defines whether
SKIM hits against a backbone (reference) sequence with less
than 100% identity are double checked with Smith-Waterman to
improve mapping accuracy.
</p><p>
You will want to set this option to <span class="underline">yes</span> whenever your reference
sequence contains more complex or numerous repeats and your
data has SNPs in those areas.
</p></dd><dt><span class="term">
[max_kmers_in_memory(mkim)=<em class="replaceable"><code>integer ≥ 100000</code></em>]
</span></dt><dd><p> Default is
<span class="underline">15000000</span>. Has no influence
on the quality of the assembly, only on the maximum memory size
needed during the skimming. The default value is equivalent to
approximately 500MB.
</p><p>
Note: reducing the number will increase the run time, the more drastically
the bigger the reduction. On the other hand, increasing the default value
chosen will not result in speed improvements that are really noticeable. In
short: leave this number alone if you are not desperate to save a few MB.
</p></dd><dt><span class="term">
[memcap_hitreduction(mchr)=<em class="replaceable"><code>integer ≥ 10</code></em>]
</span></dt><dd><p> Default is
<span class="underline">1024</span>, <span class="underline">2048</span>
when Solexa sequences are used. Maximum memory used (in MiB)
during the reduction of skim hits.
</p><p>
Note: has no influence on the quality of the assembly,
reducing the number will increase the runtime, the more
drastically the bigger the reduction as hits then must be
streamed multiple times from disk.
</p><p>
The default is good enough for assembly of bacterial genomes
or small eukaryotes (using Sanger and/or 454 sequences). As
soon as assembling something bigger than 20 megabases, you
should increase it to 2048 or 4096 (equivalent to 2 or 4 GiB
of memory).
</p></dd></dl></div><p>
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_hashstatistics_hs"></a>3.4.4.11.
Parameter group: -KMERSTATISTICS (-KS)
</h4></div></div></div><p>
Hash statistics (nowadays called kmer statistics in literature
or other software packages) allows to quickly assess reads from a
coverage point of view without actually assembling the reads. MIRA
uses this as a quick pre-assembly evaluation to find and tag reads
which are from repetitive and non-repetitive parts of a project.
</p><p>
The length of the kmer is defined via [-SK:kms]
or [-AS:kms] while the parameters in this section define
the boundaries of the different repeat levels.
</p><p>
A more in-depth description on kmer statistics is given in the
sections <span class="emphasis"><em>Introduction to 'masking'</em></span>
and <span class="emphasis"><em>How does 'nasty repeat' masking work?</em></span> in
the chapter dealing with the assembly of hard projects.
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
[freq_est_minnormal(fenn)=<em class="replaceable"><code>float > 0</code></em>]
</span></dt><dd><p>
During kmer statistics analysis, MIRA will estimate how repetitive parts
of reads are. Parts which are occurring less than
[-KS:fenn] times the average occurrence will be tagged
with a HAF2 (less than average) tag.
</p></dd><dt><span class="term">
[freq_est_maxnormal(fexn)=<em class="replaceable"><code>float > 0</code></em>]
</span></dt><dd><p>
During kmer statistics analysis, MIRA will estimate how repetitive parts
of reads are. Parts which are occurring more than
[-KS:fenn] but less than [-KS:fexn] times
the average occurrence will be tagged with a HAF3 (normal) tag.
</p></dd><dt><span class="term">
[freq_est_repeat(fer)=<em class="replaceable"><code>float > 0</code></em>]
</span></dt><dd><p>
During kmer statistics analysis, MIRA will estimate how repetitive parts
of reads are. Parts which are occurring more than
[-KS:fexn] but less than [-KS:fer] times
the average occurrence will be tagged with a HAF4 (above average) tag.
</p></dd><dt><span class="term">
[freq_est_heavyrepeat(fehr)=<em class="replaceable"><code>float > 0</code></em>]
</span></dt><dd><p>
During kmer statistics analysis, MIRA will estimate how repetitive parts
of reads are. Parts which are occurring more than
[-KS:fer] but less than [-KS:fehr] times
the average occurrence will be tagged with a HAF5 (repeat) tag.
</p></dd><dt><span class="term">
[freq_est_crazyrepeat(fecr)=<em class="replaceable"><code>float > 0</code></em>]
</span></dt><dd><p>
During kmer statistics analysis, MIRA will estimate how repetitive parts
of reads are. Parts which are occurring more than
[-KS:fehr] but less than [-KS:fecr] times
the average occurrence will be tagged with a HAF6 (heavy
repeat) tag. Parts which are occurring more than
[-KS:fecr] but less than [-KS:nrr] times the
average occurrence will be tagged with a HAF7 (crazy repeat)
tag.
</p></dd><dt><span class="term">
[mask_nasty_repeats(mnr)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is dependent on --job
type: <span class="underline">yes</span> for
de-novo, <span class="underline">no</span> for mapping.
</p><p>
Tells MIRA to tag (during the kmer statistics phase) read
subsequences of length [-SK:kms] nucleotides that
appear more that X times more often than the median occurrence
of subsequences would otherwise suggest. The threshold X from
which on subsequences are considered nasty is set by
[-KS:nrr] or [-KS:nrc], the action MIRA
should take when encountering those sequences is defined
by [-KS:ldn] (see below).
</p><p>
When not using lossless digital normalisation
([-KS:ldn]), the tag used by MIRA will be "MNRr"
which stands for "Mask Nasty Repeat in read". This tag has an
active masking function in MIRA and the fast all-against-all
overlap searcher (SKIM) will then completely ignore the tagged
subsequences of reads. There's one drawback though: the
smaller the reads are that you try to assemble, the higher the
probability that your reads will not span nasty repeats
completely, therefore leading to a abortion of contig building
at this site. Reads completely covered by the MNRr tag will
therefore land in the debris file as no overlap will be found.
</p><p>
This option is extremely useful for assembly of larger
projects (fungi-size) with a high percentage of repeats.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Although it is expected that bacteria will not really need
this, leaving it turned on will probably not harm except in
unusual cases like several copies of (pro-)phages integrated
in a genome.
</td></tr></table></div></dd><dt><span class="term">
[nasty_repeat_ratio(nrr)=<em class="replaceable"><code>integer ≥ 2</code></em>]
</span></dt><dd><p>
Default is depending on the [--job=...]
parameters. Normally it's high (around 100) for genome
assemblies, but much lower (20 or less) for EST assemblies.
</p><p>
Sets the ratio from which on subsequences are considered nasty
and hidden from the kmer statistics overlapper with a
<span class="emphasis"><em>MNRr</em></span> tag. E.g.: A value of 10 means: mask all
k-mers of [-SK:kms] length which are occurring more
than 10 times more often than the average of the whole project.
</p></dd><dt><span class="term">
[nasty_repeat_coverage(nrc)=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd><p>
Default is depending on the [--job=...]
parameters: <span class="underline">0</span> for genome
assemblies, <span class="underline">200</span> for EST assemblies.
</p><p>
Closely related to the [-KS:nrr] parameter (see
above), but while the above works on ratios derived from a
calculated average, this parameter allows to set an absolute
value. Note that this parameter will take precedence
over [-KS:nrr] if the calculated value of nrr is
larger that the absolute value given here. A value of 0
de-activates this parameter.
</p></dd><dt><span class="term">
[lossless_digital_normalisation(ldn)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is dependent on --job
type: <span class="underline">yes</span> for denovo
EST/RNAseq assembly, <span class="underline">no</span>
otherwise.
</p><p>
Tells MIRA how on whether or not digitally normalising reads containing nasty repeats
when [-KS:mnr] is active.
</p><p>
When set to <span class="emphasis"><em>yes</em></span>, MIRA will apply a
modified digital normalisation step to the reads, effectively
decreasing the coverage of a given repetitive stretch down to
a minimum needed to correctly represent one copy of the
repeat. However, contrary to the published method, MIRA will
keep enough reads of repetitive regions to also correctly
reconstruct slightly different variants of the repeats present
in the genome or EST / RNASeq data set, even if they differ in
only a single base.
</p><p>
The tag used by MIRA to denote stretches which may have
contributed to the digital normalisation will be
"DGNr". Additionally, contigs which contain reads completely
covered by a DGNr tag will get an additional "_dn" as part of
their name to show that they contain read representatives for
digital normalisation. E.g.: "contig_dn_c1".
</p><p>
This option is extremely useful for non-normalised EST /
RNASeq projects, to get at least the sequence of
overrepresented transcripts assembled even if the coverage
values then cannot be interpreted as expression values
anymore.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
The lossless digital normalisation will be applied as soon as
the kmer size of the active pass (see [-AS:kms])
reaches a size of at least 50 or, at the latest, in the second
to last pass.
</td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Once digital normalisation has been applied, the
parameters [-KS:nrr] and [-KS:nrc] do not
take effect anymore.
</td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
The effect of lossless digital normalisation on genome data
has not been studied sufficiently by me to approve it for
genomes. Use with care in genome assemblies.
</td></tr></table></div></dd><dt><span class="term">
[repeatlevel_in_infofile(rliif)=<em class="replaceable"><code>integer; 0, 5-8</code></em>]
</span></dt><dd><p>
Default is <span class="underline">6</span>. Sets the
minimum level of the HAF tags from which on MIRA will report
tentatively repetitive sequence in the
<code class="filename">*_info_readrepeats.lst</code> file of the info
directory.
</p><p>
A value of <span class="underline">0</span> means
"switched off". The default value of <span class="underline">6</span> means all subsequences tagged
with <span class="emphasis"><em>HAF6</em></span>, <span class="emphasis"><em>HAF7</em></span> and
<span class="emphasis"><em>MNRr</em></span> will be logged. If you, e.g., only
wanted MNRr logged, you'd use <span class="underline">8</span> as parameter value.
</p><p>
See also [-KS:fenn:fexn:fer:fehr:mnr:nrr] to set the
different levels for the <span class="emphasis"><em>HAF</em></span> and
<span class="emphasis"><em>MNRr</em></span> tags.
</p></dd><dt><span class="term">
[memory_to_use(mtu)=<em class="replaceable"><code>integer</code></em>]
</span></dt><dd><p>
Default is <span class="underline">75</span>. Defines
the memory MIRA can use to compute kmer statistics.
</p><p>
A value of <span class="underline">>100</span> is
interpreted as absolute value in megabyte. E.g., 16384 = 16384
megabyte = 16 gigabyte.
</p><p>
A value of <span class="underline">0 ≤ x ≤100</span> is
interpreted as relative value of free memory at the time of
computation. E.g.: for a value of 75% and 10 gigabyte of free
memory, it will use 7.5 gigabyte.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
The minimum amount of memory this algorithm will use is 512 Mib
on 32 bit systems and 2 Gib on 64 bit systems.
</td></tr></table></div></dd></dl></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_align_al"></a>3.4.4.12.
Parameter group: -ALIGN (-AL)
</h4></div></div></div><p>
The align options control the behaviour of the Smith-Waterman alignment
routines. Only read pairs which are confirmed here may be included into
contigs. Affects both the checking of possible alignments found by SKIM as
well as the phase when reads are integrated into a contig.
</p><p>
Every option in this section can be set individually for every sequencing
technology, giving a very fine grained control on how reads are aligned for
each technology.
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
[bandwidth_in_percent(bip)=<em class="replaceable"><code>integer > 0 and ≤100</code></em>]
</span></dt><dd><p> Default
is dependent of the sequencing technology used. The banded Smith-Waterman
alignment uses this percentage number to compute the bandwidth it has to use
when computing the alignment matrix. E.g., expected overlap is 150 bases,
bip=10 -> the banded SW will compute a band of 15 bases to each side of
the expected alignment diagonal, thus allowing up to 15 unbalanced inserts /
deletes in the alignment. INCREASING AND DECREASING THIS NUMBER:
<span class="emphasis"><em>increase</em></span>: will find more non-optimal alignments, but will also
increase SW runtime between linear and \Circum2. <span class="emphasis"><em>decrease</em></span>: the other
way round, might miss a few bad alignments but gaining speed.
</p></dd><dt><span class="term">
[bandwidth_min(bmin)=<em class="replaceable"><code>integer > 0</code></em>]
</span></dt><dd><p> Default is dependent of the
sequencing technology used. Minimum bandwidth in bases to each side.
</p></dd><dt><span class="term">
[bandwidth_max(bmax)=<em class="replaceable"><code>integer > 0</code></em>]
</span></dt><dd><p> Default is dependent of the
sequencing technology used. Maximum bandwidth in bases to each side.
</p></dd><dt><span class="term">
[min_overlap(mo)=<em class="replaceable"><code>integer > 0</code></em>]
</span></dt><dd><p> Default is dependent of the
sequencing technology used. Minimum number of overlapping bases needed in an
alignment of two sequences to be accepted.
</p></dd><dt><span class="term">
[min_score(ms)=<em class="replaceable"><code>integer > 0</code></em>]
</span></dt><dd><p> Default is dependent of the
sequencing technology used. Describes the minimum score of an overlap to be
taken into account for assembly. MIRA uses a default scoring scheme
for SW align: each match counts 1, a match with an N counts 0, each mismatch
with a non-N base -1 and each gap -2. Take a bigger score to weed out a
number of chance matches, a lower score to perhaps find the single (short)
alignment that might join two contigs together (at the expense of computing
time and memory).
</p></dd><dt><span class="term">
[min_relative_score(mrs)=<em class="replaceable"><code>integer > 0 and ≤100</code></em>]
</span></dt><dd><p> Default is dependent of the sequencing technology
used. Describes the min % of matching between two reads to be
considered for assembly. Increasing this number will save
memory, but one might loose possible alignments. I propose a
maximum of 80 here. Decreasing below 55% will make memory and
time consumption probably explode.
</p><p>
Note: most of the time it makes sense to keep this parameter
in sync with
[-SK:pr].
</p></dd><dt><span class="term">
[solexa_hack_max_errors(shme)=<em class="replaceable"><code>integer > -1</code></em>]
</span></dt><dd><p>
Currently a hack just for Solexa/Illumina data.
</p><p>
When running in mapping mode, this defines the maximum number
of mismatches and gaps a read may have compared to the
reference to be allowed to map. The result is usually a much
better mapping in areas with larger discrepancies between
reference sequence and mapped data. Note that the mapping
process takes longer if this value is unequal to 0 as MIRA
will use iterative mapping which involves a certain amount of
trial and error.
</p><p>
The default value of <span class="underline">-1</span>
lets MIRA choose this value automatically. It sets it to 15%
of the average Illumina read lengths loaded.
</p><p>
A value of <span class="underline">0</span> switches of
this functionality, leading to a much faster mapping
process. Useful when mapping expression data where coverage
values may be more important than the best possible alignment.
</p></dd><dt><span class="term">
[extra_gap_penalty(egp)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is dependent of the sequencing technology
used. Defines whether or not to increase penalties applied to
alignments containing long gaps. Setting this to 'yes' might
help in projects with frequent repeats. On the other hand, it
is definitively disturbing when assembling very long reads
containing multiple long indels in the called base sequence
... although this should not happen in the first place and is
a sure sign for problems lying ahead.
</p><p>
When in doubt, set it
to <span class="underline">yes</span> for EST projects
and de-novo genome assembly, set it
to <span class="underline">no</span> for assembly of
closely related strains (assembly against a backbone).
</p><p>
When set to <span class="underline">no</span>, it is
recommended to have [-CO:amgb]
and [-CO:amgbemc] both set to yes.
</p></dd><dt><span class="term">
[egp_level(egpl)=<em class="replaceable"><code>comma separated list of integer ≥ 0 and ≤ 100</code></em>]
</span></dt><dd><p>
Default is dependent of the sequencing technology and job
used. Has no effect if extra_gap_penalty is off.
</p><p>
...
</p></dd><dt><span class="term">
[egp_level(megpp)=<em class="replaceable"><code>0 ≤ integer ≤ 100</code></em>]
</span></dt><dd><p> Default is
<span class="underline">100</span>. Has no effect if
extra_gap_penalty is off. Defines the maximum extra penalty in
percent applied to 'long' gaps.
</p></dd></dl></div><p>
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_contig_co"></a>3.4.4.13.
Parameter group: -CONTIG (-CO)
</h4></div></div></div><p>
The contig options control the behaviour of the contig objects.
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
[name_prefix(np)=<em class="replaceable"><code>string</code></em>]
</span></dt><dd><p>
Default is
<span class="underline"><projectname></span>. Contigs
will have this string prepended to their names. Normally,
the [project=] line in the manifest will set this.
</p></dd><dt><span class="term">
[reject_on_drop_in_relscore(rodirs)=<em class="replaceable"><code>integer ≥ 0 and ≤100</code></em>]
</span></dt><dd><p>
Default is dependent of the sequencing technology used.
</p><p>
When adding reads to a contig, reject the reads if the drop in
the minimum relative score of the alignment of the current
consensus and the new read is > the expected value
calculated during the alignment phase. Lower values mean
stricter checking.
</p><p>
This value is doubled should a read be entered that has an
assembled template partner (a read pair) at the right distance
in the current contig.
</p></dd><dt><span class="term">
[cmin_relative_score(cmrs)=<em class="replaceable"><code>integer ≥ -1 and ≤100</code></em>]
</span></dt><dd><p>
Default is <span class="underline">-1</span>. Works
similarly to [-AL:mrs], but during contig
construction phase instead of read vs read alignment phase:
describes the min % of matching between a read being added to
a contig and the current contig consensus.
</p><p>
If value is set to -1, then the value of [-AL:mrs] is used.
</p><p>
Note: most of the time it makes sense to keep this parameter
at -1. Else have it at
approximately <span class="emphasis"><em>[-AL:mrs]-10</em></span> or
switch it completely off via 0.
</p></dd><dt><span class="term">
[mark_repeats(mr)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is
<span class="underline">yes</span>. One of the most important switches in MIRA: if set to
<span class="underline">yes</span>, MIRA will try to resolve misassemblies due to repeats by
identifying single base stretch differences and tag those critical bases as
RMB (Repeat Marker Base, weak or strong). This switch is also needed when
MIRA is run in EST mode to identify possible inter-, intra- and
intra-and-interorganism SNPs.
</p></dd><dt><span class="term">
[only_in_result(mroir)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is <span class="underline">no</span>. Only
takes effect when [-CO:mr] (see above) is set
to <span class="underline">yes</span>. If set
to <span class="underline">yes</span>, MIRA will not use
the repeat resolving algorithm during build time (and therefore
will not be able to take advantage of this), but only before
saving results to disk.
</p><p>
This switch is useful in some (rare) cases of mapping assembly.
</p></dd><dt><span class="term">
[assume_snp_instead_repeat(asir)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is <span class="underline">no</span>.
Only takes effect when [-CO:mr] (see above) is set to
<span class="underline">yes</span>, effect is also
dependent on the fact whether strain data (see
- [-SB:lsd]) is present or not. Usually, MIRA will mark
bases that differentiate between repeats when a conflict occurs
between reads that belong to one strain. If the conflict occurs
between reads belonging to different strains, they are marked as
SNP. However, if this switch is set
to <span class="underline">yes</span>, conflict within a
strain are also marked as SNP.
</p><p>
This switch is mainly used in assemblies of ESTs, it should
not be set for genomic assembly.
</p></dd><dt><span class="term">
[min_reads_per_group(mrpg)=<em class="replaceable"><code>integer ≥ 2</code></em>]
</span></dt><dd><p> Default is
dependent of the sequencing technology used. Only takes effect when
[-CO:mr] (see above) is set
to <span class="underline">yes</span>. This defines the
minimum number of reads in a group that are needed for the RMB
(Repeat Marker Bases) or SNP detection routines to be
triggered. A group is defined by the reads carrying the same
nucleotide for a given position, i.e., an assembly with mrpg=2
will need at least two times two reads with the same nucleotide
(having at least a quality as defined in [-CO:mgqrt])
to be recognised as repeat marker or a SNP. Setting this to a
low number increases sensitivity, but might produce a few false
positives, resulting in reads being thrown out of contigs
because of falsely identified possible repeat markers (or
wrongly recognised as SNP).
</p></dd><dt><span class="term">
[min_neighbour_qual (mnq)=<em class="replaceable"><code>integer ≥
10</code></em>]
</span></dt><dd><p>
Default is dependent of the sequencing technology used. Takes
only effect when [-CO:mr] is set
to <span class="underline">yes</span>. This defines the
minimum quality of neighbouring bases that a base must have
for being taken into consideration during the decision whether
column base mismatches are relevant or not.
</p></dd><dt><span class="term">
[min_groupqual_for_rmb_tagging(mgqrt)=<em class="replaceable"><code>integer ≥ 25</code></em>]
</span></dt><dd><p>
Default is dependent of the sequencing technology used. Takes
only effect when [-CO:mr] is set
to <span class="underline">yes</span>. This defines the
minimum quality of a group of bases to be taken into account
as potential repeat marker. The lower the number, the more
sensitive you get, but lowering below 25 is not recommended as
a lot of wrongly called bases can have a quality approaching
this value and you'd end up with a lot of false positives. The
higher the overall coverage of your project, the better, and
the higher you can set this number. A value of 35 will
probably remove most false positives, a value of 40 will
probably never show false positives ... but will generate a
sizable number of false negatives.
</p></dd><dt><span class="term">
[min_coverage_percentage(mcp)=<em class="replaceable"><code>0 < integer ≤ 100</code></em>]
</span></dt><dd><p>
Default is currently <span class="underline">10</span>. Used to reduce the number of
IUPAC bases due to non-random PCR artefacts or sequencing
errors in very high coverage areas (e.g. Illumina ≥
80). Once the most probable base has been determined,
[-CO:mcp] defines the minimum percentage (calculated
from the most probable base) the coverage of alternative bases
must have to be considered for consensus. E.g.: with mcp=10
and the most probable base having a coverage of 200x, other
bases must have a coverage of 20x.
</p><p>
Drawback is that valid low frequency variants will not show up
anymore as IUPAC in the FASTA.
</p></dd><dt><span class="term">
[endread_mark_exclusion_area(emea)=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd><p> Default is dependent of the sequencing technology
used. Takes only effect when [-CO:mr] is set to
<span class="underline">yes</span>. Using the end of
sequences of Sanger type shotgun sequencing is always a bit
risky, as wrongly called bases tend to crowd there or some
sequencing vector relics hang around. It is even more risky to
use these stretches for detecting possible repeats, so one can
define an exclusion area where the bases are not used when
determining whether a mismatch is due to repeats or not.
</p></dd><dt><span class="term">
[emea_set1_on_clipping_pec(emeas1clpec)=<em class="replaceable"><code>on|y[es]|t[rue],
off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">yes</span>. When
[-CL:pec] is set, the end-read exclusion area can be
considerably reduced. Setting this parameter will
automatically do this.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Although the parameter is named "set to 1", it may be that the
exclusion area is actually a bit larger (2 to 4), depending on
what users will report back as "best" option.
</td></tr></table></div></dd><dt><span class="term">
[also_mark_gap_bases(amgb)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is dependent of the sequencing technology
used. Determines whether columns containing gap bases (indels)
are also tagged.
</p><p>
Note: it is strongly recommended to not set this to 'yes' for
454 type data.
</p></dd><dt><span class="term">
[also_mark_gap_bases_even_multicolumn(amgbemc)=<em class="replaceable"><code>on|y[es]|t[rue],
off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is <span class="underline">yes</span>.
Takes effect only when [-CO:amgb] is set to
<span class="underline">yes</span>. Determines whether multiple columns containing gap bases
(indels) are also tagged.
</p></dd><dt><span class="term">
[also_mark_gap_bases_need_both_strands(amgbnbs)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is <span class="underline">yes</span>. Takes effect only when
[-CO:amgb] is set to <span class="underline">yes</span>. Determines whether both for
tagging columns containing gap bases, both strands.need to have a gap.
Setting this to <span class="underline">no</span> is not recommended except when working in
desperately low coverage situations.
</p></dd><dt><span class="term">
[force_nonIUPACconsensus_perseqtype(fnic)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">no</span> for
de-novo genome assemblies, yes for all others. If set to
<span class="underline">yes</span>, MIRA will be forced
to make a choice for a consensus base (A,C,G,T or gap) even in
unclear cases where it would normally put a IUPAC base. All
other things being equal (like quality of the possible
consensus base and other things), MIRA will choose a base by
either looking for a majority vote or, if that also is not
clear, by preferring gaps over T over G over C over finally A.
</p><p>
MIRA makes a considerable effort to deduce the right base at
each position of an assembly. Only when cases begin to be
borderline it will use a IUPAC code to make you aware of
potential problems. It
is <span class="bold"><strong>suggested</strong></span> to leave this
option to <span class="underline">no</span> as IUPAC
bases in the consensus are a sign that - if you need 100%
reliability - you really should have a look at this particular
place to resolve potential problems. You might want to set
this parameter to yes in the following cases: 1) when your
tools that use assembly result cannot handle IUPAC bases and
you don't care about being absolutely perfect in your data (by
looking over them manually). 2) when you assemble data without
any quality values (which you should not do anyway), then this
method will allow you to get a result without IUPAC bases that
is "good enough" with respect to the fact that you did not
have quality values.
</p></dd><dt><span class="term">
[merge_short_reads(msr)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">yes</span> for all
Solexas when in a mapping assembly, else it's <span class="underline">no</span>. Can only be used in mapping
assemblies. If set to <span class="underline">yes</span>, MIRA will merge all perfectly
mapping Solexa reads into longer reads (Coverage Equivalent
Reads, CERs) while keeping quality and coverage information
intact.
</p><p>
This feature hugely reduces the number of Solexa reads and
makes assembly results with Solexa data small enough to be
handled by current finishing programs (gap4, consed, others)
on normal workstations.
</p></dd><dt><span class="term">
[msr_keepcontigendsunmerged(msrme)=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd><p>
Default is <span class="underline">0</span> for all
Solexas when in a mapping assembly. Takes only effect in
mapping assemblies if [-CO:msr=yes].
</p><p>
Defines how many "errors" (i.e. differences) a read may have
to be merged into a coverage equivalent read. Useful only when
one does not need SNP information from an assembly but wants
to concentrate either on coverage data or on paired-end
information at contig ends.
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
This feature allows to merge non-perfect reads, which makes
most SNP information simply disappear from the alignment. Use
with care!
</td></tr></table></div></dd><dt><span class="term">
[msr_keepcontigendsunmerged(msrkceu)=<em class="replaceable"><code>-1, integer > 0</code></em>]
</span></dt><dd><p>
Default is <span class="underline">-1</span> for all
Solexas when in a mapping assembly. Takes only effect in
mapping assemblies if [-CO:msr=yes] and for reads
which have a paired-end / mate-pair partner actively used in
the assembly.
</p><p>
If set to a value > 0, MIRA will not merge paired-end /
mate-pair reads if they map within the given distance of a
contig end of the original reference sequence
(backbone). Instead of a fixed value, one can also use
-1. MIRA will then automatically not merge reads if the
distance from the contig end is within the maximum size of the
template insert size of the sequencing library for that read
(either given via [-GE:tismax] or via XML TRACEINFO
for the given read).
</p><p>
This feature allows to use the data reduction from
[-CO:msr] while enabling the result of such a mapping
to be useful in subsequent scaffolding programs to order
contigs.
</p></dd></dl></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_edit_ed"></a>3.4.4.14.
Parameter group: -EDIT (-ED)
</h4></div></div></div><p>
General options for controlling the integrated automatic editor. The editors
generally make a good job cleaning up alignments from typical sequencing
errors like (like base overcalls etc.). However, they may prove tricky in
certain situations:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
in EST assemblies, they may edit rare transcripts toward almost
identical, more abundant transcripts. Usage must be carefully weighed.
</p></li><li class="listitem"><p>
the editors will not only change bases, but also sometimes delete or
insert non-gap bases as needed to improve an alignment when facts (trace
signals or other) show that this is what should have been the
sequence. However, this can make post processing of assembly results pretty
difficult with some formats like ACE, where the format itself contains no
way to specify certain edits like deletion. There's nothing one can do about
it and the only way to get around this problem is to use file formats with
more complete specifications like CAF, MAF (and BAF once supported by MIRA).
</p></li></ul></div><p>
</p><p>
The following edit parameters are supported:
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
[_mira_automatic_contig_editing(mace)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">yes</span>. When set
to yes, MIRA will use built-in versions of own automatic
contig editors (see parameters below) to improve alignments.
</p></dd><dt><span class="term">
[edit_kmer_singlets(eks)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">yes</span> for all
sequencing technologies, but only takes effect
if [-ED:mace] is on (see above).
</p><p>
When set to yes, MIRA uses the alignment information of a
complete contig at places with sequencing errors which lead to
unique kmers and correct the error according to the alignment.
</p><p>
This is an extremely conservative yet very effective editing
strategy and can therefore be kept always activated.
</p></dd><dt><span class="term">
[edit_homopolymer_overcalls(ehpo)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">yes</span> for 454
and Ion Torrent, but only takes effect if [-ED:mace]
is on (see above).
</p><p>
When set to yes, MIRA use the alignment information of a
complete contig at places with potential homopolymer
sequencing errors and correct the error according to the
alignment.
</p><p>
This editor should be switched on only for sequencing
technologies with known homopolymer sequencing problems. That
is: currently only 454 and Ion.
</p></dd><dt><span class="term">
[edit_automatic_contig_editing(eace)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">no</span>. When set
to yes, MIRA will use built-in versions of the "EdIt"
automatic contig editor (see parameters below) to correct
sequencing errors in Sanger reads.
</p><p>
EdIt will try to resolve discrepancies in the contig by
performing trace signal analysis and correct even hard to resolve
errors.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
The current development version has a memory leak in
this editor, therefore the option cannot be turned
on.
</td></tr></table></div></dd><dt><span class="term">
[strict_editing_mode(sem)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">yes</span>. Only for
Sanger data. If set to yes, the automatic editor will not take
error hypotheses with a low probability into account, even if
all the requirements to make an edit are fulfilled.
</p></dd><dt><span class="term">
[confirmation_threshold(ct)=<em class="replaceable"><code>integer, 0 < x ≤ 100</code></em>]
</span></dt><dd><p>
Default is <span class="underline">50</span>. Only for
Sanger data. The higher this value, the more strict the
automatic editor will apply its internal rule set. Going below
40 is not recommended.
</p></dd></dl></div><p>
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_misc_mi"></a>3.4.4.15.
Parameter group: -MISC (-MI)
</h4></div></div></div><p>
Options which would not fit elsewhere.
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
[iknowwhatido(ikwid)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">no</span>. This
switch tells MIRA that you know what you do in some
situations and force it not to stop when it thinks something is
really wrong, but simply continue.
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
You generally should not to set this flag except in cases
where MIRA stopped and the warning / error message told you to
get around that very specific problem by setting this flag.
</td></tr></table></div></dd><dt><span class="term">
[large_contig_size(lcs)=<em class="replaceable"><code>integer <
0</code></em>]
</span></dt><dd><p>
Default is <span class="underline">500</span>. This
parameter has absolutely no influence whatsoever on the
assembly process of MIRA. But is used in the reporting within
the <code class="filename">*_assembly_info.txt</code> file after the
assembly where MIRA reports statistics on
<span class="emphasis"><em>large</em></span> contigs and
<span class="emphasis"><em>all</em></span> contigs. [-MI:lcs] is the
threshold value for dividing the contigs into these two
categories.
</p></dd><dt><span class="term">
[large_contig_size_for_stats(lcs4s)=<em class="replaceable"><code>integer <
0</code></em>]
</span></dt><dd><p>
Default is <span class="underline">5000</span> for
[--job=genome] and <span class="underline">1000</span> for [--job=est].
</p><p>
This parameter is used for internal statistics calculations
and has a subtle influence when being in a
[--job=genome] assembly mode.
</p><p>
MIRA uses coverage information of an assembly project to find
out about potentially repetitive areas in reads (and thus, a
genome). To calculate statistics which are reflecting the
approximate truth regarding the average coverage of a genome,
the "large contig size for stats" value of
[-MI:lcs4s] is used as a cutoff threshold: contigs
smaller than this value do not contribute to the calculation
of average coverage while contigs larger or equal to this
value do.
</p><p>
This reflects two facts: on the one hand - especially with
short read sequencing technologies and in projects without
read pair libraries - contigs containing predominantly
repetitive sequences are of a relatively small size. On the
other hand, reads which could not be placed into contigs
(maybe due to a sequencing technology dependent motif error)
often enough form small contigs with extremely low
coverage.
</p><p>
It should be clear that one does not want any of the above
when calculating average coverage statistics and having this
cutoff discards small contigs which tend to muddy the
picture. If in doubt, don't touch this parameter.
</p></dd></dl></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_misc_nw"></a>3.4.4.16.
Parameter group: -NAG_AND_WARN (-NW)
</h4></div></div></div><p>
Parameters which let MIRA warn you about unusual things or potential
problems. The flags in this parameter section come in three
flavours: <span class="emphasis"><em>stop</em></span>, <span class="emphasis"><em>warn</em></span> and
<span class="emphasis"><em>no</em></span> which let MIRA either stop, give a warning
or do nothing if a specific problem is detected.
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
[check_nfs(cnfs)=<em class="replaceable"><code>stop|warn|no</code></em>]
</span></dt><dd><p>
Default is <span class="underline">stop</span>. MIRA
will check whether the tmp directory is running on a NFS
mount.
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top"><p>
You should never ever at all run MIRA on a NFS mounted
directory ... or face the the fact that the assembly process
may very well take 5 to 10 times longer (or more) than
normal. You have been warned.
</p><p>
The reason for the slowdown is the same as why one should
never run a BLAST search on a big database being located on
a NFS volume: access via network is terribly slow when
compared to local disks, at least if you have not invested a
lot of money into specialised solutions.
</p></td></tr></table></div></dd><dt><span class="term">
[check_duplicate_readnames(cdrn)=<em class="replaceable"><code>stop|warn|no</code></em>]
</span></dt><dd><p>
Default is <span class="underline">stop</span>. MIRA
will check for duplicate read names after loading.
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top"><p>
Duplicate read names usually hint to a serious problem with
your input and should really, really be fixed. You can
choose to ignore this error by switching off this flag, but
this will almost certainly lead to problems with result
files (ACE and CAF for sure, maybe also SAM) and probably to
other unexpected effects.
</p></td></tr></table></div></dd><dt><span class="term">
[check_template_problems(ctp)=<em class="replaceable"><code>stop|warn|no</code></em>]
</span></dt><dd><p>
Default is <span class="underline">stop</span>. MIRA
will check read template naming after loading.
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top"><p>
Problems in read template naming point to problems with read
names or to broken template information. You should try to
find the cause of the problem instead of ignoring this error
message.
</p></td></tr></table></div></dd><dt><span class="term">
[check_maxreadnamelength(cmrnl)=<em class="replaceable"><code>stop|warn|no</code></em>]
</span></dt><dd><p>
Default is <span class="underline">stop</span>. MIRA
will check whether the length of the names of your reads
surpass the given number of characters (see [-NW:mrnl]).
</p><p>
While MIRA and many other programs have no problem with long read names,
some older programs have restrictions concerning the length of
the read name. Example given: the pipeline <code class="literal">CAF ->
caf2gap -> gap2caf</code> will stop working at
the <span class="command"><strong>gap2caf</strong></span> stage if there are read names
having > 40 characters where the names differ only at >40
characters.
</p><p>
This should be a warning only, but as a couple of people were
bitten by this, the default behaviour of MIRA is to stop when
it sees that potential problem. You might want to rename your
reads to have ≤ 40 characters.
</p><p>
On the other hand, you also can ignore this potential problem
and force MIRA to continue by using the parameter:
[-NW:cmrnl=warn] or [-NW:cmrnl=no]
</p></dd><dt><span class="term">
[maxreadnamelength(mrnl)=<em class="replaceable"><code>integer ≥
0</code></em>]
</span></dt><dd><p>
Default is <span class="underline">40</span>. This
defines the effective check length for [-NW:cmrnl].
</p></dd><dt><span class="term">
[check_average_coverage(cac)=<em class="replaceable"><code>stop|warn|no</code></em>]
</span></dt><dd><p>
Default is <span class="underline">stop</span>. In
genome de-novo assemblies, MIRA will perform checks early in
the assembly process whether the average coverage to be
expected exceeds a given value (see [-NW:acv]).
</p><p>
With todays' sequencing technologies (especially Illumina, but
also Ion Torrent and 454), many people simply take everything
they get and throw it into an assembly. Which, in the case of
Illumina and Ion, can mean they try to assemble their organism
with a coverage of 100x, 200x and more (I've seen trials with
more than 1000x).
</p><p>
This is not good. Not. At. All! For two reasons (well, three
to be precise).
</p><p>
The first reason is that, usually, one does not sequence a
single cell but a population of cells. If this population is
not clonal (i.e., it contains subpopulations with genomic
differences with each other), assemblers will be able to pick
up these differences in the DNA once a certain sequence count
is reached and they will try reconstruct a genome containing
all clonal variations, treating these variations as potential
repeats with slightly different sequences. Which, of course,
will be wrong and I am pretty sure you do not want that.
</p><p>
The second and way more important reason is that none of the
current sequencing technologies is completely error free. Even
more problematic, they contain both random and non-random
sequencing errors. Especially the latter can become a big
hurdle if these non-random errors are so prevalent that they
suddenly appear to be valid sequence to an assembler. This in
turn leads to false repeat detection, hence possibly contig
breaks or even wrong consensus sequence. You don't want that,
do you?
</p><p>
The last reason is that overlap based assemblers (like MIRA
is) need <span class="emphasis"><em>exponentially</em></span> more time and
memory when the coverage increases. So keeping the coverage
comparatively low helps you there.
</p></dd><dt><span class="term">
[average_coverage_value(acv)=<em class="replaceable"><code>integer ≥
0</code></em>]
</span></dt><dd><p>
Default is <span class="underline">80</span> for
de-novo assemblies, in mapping assemblies it is 120 for Ion
Torrent and 160 for Illumina data (might change in
future). This defines the effective coverage to check for in
[-NW:cac].
</p></dd></dl></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_directory_dir_di"></a>3.4.4.17.
Parameter group: -DIRECTORY (-DIR, -DI)
</h4></div></div></div><p>
General options for controlling where to find or where to write data.
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
[tmp_redirected_to(trt)=<em class="replaceable"><code><directoryname></code></em>]
</span></dt><dd><p>
Default is an empty string. When set to a non-empty string,
MIRA will create the MIRA-temporary directory at the given
location instead of using the current working directory.
</p><p>
This option is particularly useful for systems which have
solid state disks (SSDs) and some very fast disk subsystems
which can be used for temporary files. Or in projects where
the input and output files reside on a NFS mounted directory
(current working dir), to put the tmp directory somewhere
outside the NFS (see also: Things you should not do).
</p><p>
In both cases above, and for larger projects, MIRA then runs
a lot faster.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Prior to MIRA 4.0rc2, users had to make sure themselves that
the target directory did not already exist. MIRA now handles
this automatically by creating directory names with a random
substring attached.
</td></tr></table></div></dd></dl></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_output_out"></a>3.4.4.18.
Parameter group: -OUTPUT (-OUT)
</h4></div></div></div><p>
Options for controlling which results to write to which type of files.
Additionally, a few options allow output customisation of textual
alignments (in text and HTML files).
</p><p>
There are 3 types of results: result, temporary results and extra
temporary results. One probably needs only the results. Temporary
and extra temporary results are written while building different
stages of a contig and are given as convenience for trying to find
out why MIRA set some RMBs or disassembled some contigs.
</p><p>
Output can be generated in these formats: CAF, Gap4 Directed
Assembly, FASTA, ACE, TCS, WIG, HTML and simple text.
</p><p>
Naming conventions of the files follow the rules described in
section <span class="bold"><strong>Input / Output</strong></span>, subsection
<span class="bold"><strong>Filenames</strong></span>.
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
[savesimplesingletsinproject(sssip)=<em class="replaceable"><code>on|y[es]|t[rue],off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">no</span>. Controls
whether 'unimportant' singlets are written to the result
files.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Note that a value larger 1 of the [-AS:mrpc]
parameter will disable the function of this parameter.
</td></tr></table></div></dd><dt><span class="term">
[savetaggedsingletsinproject(stsip)=<em class="replaceable"><code>on|y[es]|t[rue],off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default
is <span class="underline">yes</span>. Controls whether
singlets which have certain tags (see below) are written to
the result files, even if [-OUT:sssip] (see above) is
set.
</p><p>
If one of the (SRMr, CRMr, WRMr, SROr, SAOr, SIOr) tags
appears in a singlet, MIRA will see that the singlets had been
part of a larger alignment in earlier passes and even was part
of a potentially 'important' decision. To give the possibility
to human finishers to trace back the decision, these singlets
can be written to result files.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Note that a value larger 1 of the [-AS:mrpc]
parameter will disable the function of this parameter.
</td></tr></table></div></dd><dt><span class="term">
[remove_rollover_tmps(rrot)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default
is <span class="underline">yes</span>. Removes log and
temporary files once they should not be needed anymore during
the assembly process.
</p></dd><dt><span class="term">
[remove_tmp_directory(rtd)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default
is <span class="underline">no</span>. Removes the
complete tmp directory at the end of the assembly process. Some
logs and temporary files contain useful information that you may
want to analyse though, therefore the default of MIRA is not to
delete it.
</p></dd><dt><span class="term">
[output_result_caf(orc)=<em class="replaceable"><code>on|y[es]|t[rue],
off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is
<span class="underline">yes</span>.
</p></dd><dt><span class="term">
[output_result_maf(orm)=<em class="replaceable"><code>on|y[es]|t[rue],
off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is
<span class="underline">yes</span>.
</p></dd><dt><span class="term">
[output_result_gap4da(org)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">no</span>.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
If set to <span class="underline">yes</span>, MIRA will
automatically switch back
to <span class="underline">no</span> (and cannot be
forced to 'yes') when 454 or Solexa reads are present in the
project as this ensure that the file system does not get
flooded with millions of files.
</td></tr></table></div></dd><dt><span class="term">
[output_result_fasta(orf)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">yes</span>.
</p></dd><dt><span class="term">
[output_result_ace(ora)=<em class="replaceable"><code>on|y[es]|t[rue],
off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default
is <span class="underline">no</span>.
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
ACE is the least suited file format for NGS data. Use it only
when absolutely necessary.
</td></tr></table></div></dd><dt><span class="term">
[output_result_txt(ort)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">no</span>.
</p></dd><dt><span class="term">
[output_result_tcs(ors)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">yes</span>.
</p></dd><dt><span class="term">
[output_result_html(orh)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default
is <span class="underline">no</span>.
</p></dd><dt><span class="term">
[output_tmpresult_caf(otc)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">no</span>.
</p></dd><dt><span class="term">
[output_tmpresult_maf(otm)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">no</span>.
</p></dd><dt><span class="term">
[output_tmpresult_gap4da(otg)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">no</span>.
</p></dd><dt><span class="term">
[output_tmpresult_fasta(otf)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">no</span>.
</p></dd><dt><span class="term">
[output_tmpresult_ace(ota)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">no</span>.
</p></dd><dt><span class="term">
[output_tmpresult_txt(ott)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">no</span>.
</p></dd><dt><span class="term">
[output_result_tcs(ots)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default
is <span class="underline">no</span>.
</p></dd><dt><span class="term">
[output_tmpresult_html(oth)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">no</span>.
</p></dd><dt><span class="term">
[output_exttmpresult_caf(oetc)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">no</span>.
</p></dd><dt><span class="term">
[output_exttmpresult_gap4da(oetg)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is
<span class="underline">no</span>.
</p></dd><dt><span class="term">
[output_exttmpresult_fasta(oetf)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is
<span class="underline">no</span>.
</p></dd><dt><span class="term">
[output_exttmpresult_ace(oeta)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">no</span>.
</p></dd><dt><span class="term">
[output_exttmpresult_txt(oett)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">no</span>.
</p></dd><dt><span class="term">
[output_exttmpresult_html(oeth)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">no</span>.
</p></dd><dt><span class="term">
[text_chars_per_line(tcpl)=<em class="replaceable"><code>integer > 0</code></em>]
</span></dt><dd><p> Default is
<span class="underline">60</span>. When producing an output in text format
( [-OUT:ort|ott|oett]), this parameter defines how many bases
each line of an alignment should contain.
</p></dd><dt><span class="term">
[html_chars_per_line(tcpl)=<em class="replaceable"><code>integer > 0</code></em>]
</span></dt><dd><p> Default is
<span class="underline">60</span>. When producing an output in HTML format,
( [-OUT:orh|oth|oeth]), this parameter defines how many bases
each line of an alignment should contain.
</p></dd><dt><span class="term">
[text_endgap_fillchar(tegfc)=<em class="replaceable"><code><single character></code></em>]
</span></dt><dd><p> Default
is <span class="underline"> </span> (a blank). When producing an output in text format
( [-OUT:ort|ott|oett]), endgaps are filled up with this
character.
</p></dd><dt><span class="term">
[html_endgap_fillchar(hegfc)=<em class="replaceable"><code><single character></code></em>]
</span></dt><dd><p> Default
is <span class="underline"> </span> (a blank). When producing an output in HTML format
( [-OUT:orh|oth|oeth]), end-gaps are filled up with this
character.
</p></dd></dl></div><p>
</p></div></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_resuming_assemblies"></a>3.5.
Resuming / restarting assemblies
</h2></div></div></div><p>
It may happen that a MIRA run is interrupted - sometimes rather harshly
- due to events more or less outside your control like, e.g., power
failures, machine shutdowns for maintenance, missing disk space,
run-time quotas etc. This may be less of a problem when assembling or
mapping small data sets with run times between a couple of minutes up to
a few hours, but becomes a nuisance for larger data sets like in small
eukaryotes or RNASeq samples where the run time is measured in days.
</p><p>
If this happens in de-novo assemblies, MIRA has
a <span class="emphasis"><em>resume</em></span> functionality: at predefined points in the
assembly process, MIRA writes out special files to disk which enables it
to resume the assembly at the point where these files were
written. Starting MIRA in resume mode is pretty easy: simply add the
resume flag [-r] on a command line like this:
</p><pre class="screen">
<code class="prompt">$</code> <strong class="userinput"><code>mira -r ...</code></strong></pre><p>
where the ellipsis ("...") above stands for the rest of the command line you would have used to start a new assembly.
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_input_output"></a>3.6.
Input / Output
</h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_directories"></a>3.6.1.
Directories
</h3></div></div></div><p>
Since version 3.0.0, MIRA now puts all files and directories it
generates into one sub-directory which is named
<code class="filename"><em class="replaceable"><code>projectname</code></em>_assembly</code>. This directory contains up to four
sub-directories:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_d_results</code>: this directory contains all the
output files of the assembly in different formats.
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_d_info</code>: this directory contains information
files of the final assembly. They provide statistics as well as, e.g.,
information (easily parsable by scripts) on which read is found in which
contig etc.
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_d_tmp</code>:
this directory contains tmp files and temporary assembly files. It
can be safely removed after an assembly as there may be easily a
few GB of data in there that are not normally not needed anymore.
</p><p>
In case of problems: please do not delete. I will get in touch
with you for additional information that might possibly be present
in the tmp directory.
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_d_chkpt</code>: this directory
contains checkpoint files needed to resume assemblies that crashed
or were stopped.
</p></li></ul></div><p>
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_filenames"></a>3.6.2.
Filenames
</h3></div></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_output"></a>3.6.2.1.
Output
</h4></div></div></div><p>
These result output files and sub-directories are placed in in the
<em class="replaceable"><code>projectname</code></em>_results directory after a run of MIRA.
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
<code class="filename"><em class="replaceable"><code>projectname</code></em>_out.<type></code>
</span></dt><dd><p> Assembled project written in type =
(<span class="emphasis"><em>maf</em></span> / <span class="emphasis"><em>gap4da</em></span> / <span class="emphasis"><em>caf</em></span> /
<span class="emphasis"><em>ace</em></span> / <span class="emphasis"><em>fasta</em></span> /
<span class="emphasis"><em>html</em></span> / <span class="emphasis"><em>tcs</em></span> /
<span class="emphasis"><em>wig</em></span> / <span class="emphasis"><em>text</em></span>) format by
MIRA, final result.
</p><p>
Type <span class="emphasis"><em>gap4da</em></span> is a directory containing
experiment files and a file of filenames (called 'fofn'), all
other types are files. <span class="emphasis"><em>gap4da</em></span>,
<span class="emphasis"><em>caf</em></span>, <span class="emphasis"><em>ace</em></span> contain the
complete assembly information suitable for import into
different post-processing tools (gap4, consed and
others). <span class="emphasis"><em>html</em></span> and
<span class="emphasis"><em>text</em></span> contain visual representations of
the assembly suited for viewing in browsers or as simple text
file. <span class="emphasis"><em>tcs</em></span> is a summary of a contig suited
for "quick" analysis from command-line tools or even visual
inspection. <span class="emphasis"><em>wig</em></span> is a file containing
coverage information (useful for mapping assemblies) which can
be loaded and shown by different genome browsers (IGB, GMOD,
USCS and probably many more.
</p><p>
<span class="emphasis"><em>fasta</em></span> contains the contig consensus
sequences (and .fasta.qual the consensus qualities). Please
note that they come in two flavours:
<span class="underline">padded</span>
and <span class="underline">unpadded</span>. The padded
versions may contain stars (*) denoting gap base positions
where there was some minor evidence for additional bases, but
not strong enough to be considered as a real base. Unpadded
versions have these gaps removed. Padded versions have an
additional postfix <span class="emphasis"><em>.padded</em></span>, while
unpadded versions <span class="emphasis"><em>.unpadded</em></span>.
</p></dd><dt><span class="term">
<code class="filename"><em class="replaceable"><code>projectname</code></em>_LargeContigs_out.<type></code>
</span></dt><dd>
These files are only written when MIRA runs in
<span class="emphasis"><em>de-novo</em></span> mode. They usually contain a subset
of contigs deemed 'large' from the whole project. More details
are given in the chapter "working with results of MIRA."
</dd></dl></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_assembly_statistics_and_information_files"></a>3.6.2.2.
Assembly statistics and information files
</h4></div></div></div><p>
These information files are placed in in the
<em class="replaceable"><code>projectname</code></em>_info directory after a run of
MIRA.
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
<code class="filename"><em class="replaceable"><code>projectname</code></em>_info_assembly.txt</code>
</span></dt><dd><p>
This file contains basic information about the
assembly. MIRA will split the information in two
parts: information about <span class="emphasis"><em>large</em></span>
contigs and information about all contigs.
</p><p>
For more information on how to interpret this file,
please consult the chapter on "Results" of the MIRA
documentation manual.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
In contrast to other information files, this file
always appears in the "info" directory, even when just
intermediate results are reported.
</td></tr></table></div></dd><dt><span class="term">
<code class="filename"><em class="replaceable"><code>projectname</code></em>_info_contigreadlist.txt</code>
</span></dt><dd><p> This file contains information which reads have been
assembled into which contigs (or singlets).
</p></dd><dt><span class="term">
<code class="filename"><em class="replaceable"><code>projectname</code></em>_info_contigstats.txt</code>
</span></dt><dd><p> This file contains statistics about the contigs
themselves, their length, average consensus quality, number of
reads, maximum and average coverage, average read length, number
of A, C, G, T, N, X and gaps in consensus.
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
For contigs containing digitally normalised reads, the coverage numbers may sometimes seem strange. E.g.: a contig may contain only one read, but have an average coverage of 3. This means that the read was a representative for 3 reads. The coverage numbers are computed as if all 3 reads had been assembled instead of the representative. In EST/RNASeq projects, these numbers thus represent the (more or less) true expression coverage.
</td></tr></table></div></dd><dt><span class="term">
<code class="filename"><em class="replaceable"><code>projectname</code></em>_info_consensustaglist.txt</code>
</span></dt><dd><p> This file contains
information about the tags (and their position) that are present in the
consensus of a contig.
</p></dd><dt><span class="term">
<code class="filename"><em class="replaceable"><code>projectname</code></em>_info_largecontigs.txt</code>
</span></dt><dd><p>For de-novo assemblies, this file contains the name of the
contigs which pass the (adaptable) 'large contig' criterion.
</p></dd><dt><span class="term">
<code class="filename"><em class="replaceable"><code>projectname</code></em>_info_readrepeats.lst</code>
</span></dt><dd><p>
Tab delimited file with three columns: read name, repeat level tag, sequence.
</p><p>
This file permits a quick analysis of the repetitiveness of
different parts of reads in a project. See
[-SK:rliif] to control from which repetitive level on
subsequences of reads are written to this file,
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Reads can have more than one entry in this file. E.g., with
standard settings (<code class="literal">-SK:rliif=6</code>) if the
start of a read is covered by MNRr, followed by a HAF3 region
and finally the read ends with HAF6, then there will be two
lines in the file: one for the subsequence covered by MNRr,
one for HAF6.
</td></tr></table></div></dd><dt><span class="term">
<code class="filename"><em class="replaceable"><code>projectname</code></em>_info_readstooshort</code>
</span></dt><dd><p> A list containing the
names of those reads that have been sorted out of the assembly before any
processing started only due to the fact that they were too short.
</p></dd><dt><span class="term">
<code class="filename"><em class="replaceable"><code>projectname</code></em>_info_readtaglist.txt</code>
</span></dt><dd><p> This file contains
information about the tags and their position that are present in each
read. The read positions are given relative to the forward direction of the
sequence (i.e. as it was entered into the the assembly).
</p></dd><dt><span class="term">
<code class="filename"><em class="replaceable"><code>projectname</code></em>_info_WARNINGS_*.txt</code>
</span></dt><dd><p>
These files collect warning messages MIRA dumped out
throughout the assembly process. These warnings cover a wide
area of things monitored by MIRA and can - together with the
output written to STDOUT - give an insight as to why an
assembly does not behave as expected. There are three warning
files representing different levels of
criticality: <span class="emphasis"><em>critical</em></span>, <span class="emphasis"><em>medium</em></span>
and <span class="emphasis"><em>minor</em></span>. These files may be empty,
meaning that no warning of the corresponding level was
printed. It is strongly suggested to have a look at least at
critical warnings during and after an assembly run.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
These files are quite new to MIRA and not all warning messages
appear there yet. This will come over time.
</td></tr></table></div></dd><dt><span class="term">
<code class="filename"><em class="replaceable"><code>projectname</code></em>_error_reads_invalid</code>
</span></dt><dd><p> A list of sequences that
have been found to be invalid due to various reasons (given in the output of
the assembler).
</p></dd></dl></div><p>
</p></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_file_formats"></a>3.6.3.
File formats
</h3></div></div></div><p>
MIRA can write almost all of the following formats and can read most
of them.
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
<code class="filename">ACE</code>
</span></dt><dd><p> This old assembly file format used mainly by phrap and
consed. Support for .ace output is currently only in test status in
MIRA as documentation on that format is ... sparse and I currently
don't have access to consed to verify my assumptions.
</p><p> Using consed, you will need to load projects with -nophd to
view them. Tags /in reads and consensus) are fully supported. The
only hitch: consed has a bug which prevents it to read consensus
tags which are located throughout the whole file (as MIRA writes
per default). The solution to that is easy: filter the CAF file
through the fixACE4consed.tcl script which is provided in older
MIRA distributions (V4.9.5 and before), then all should be well.
</p><p> If you don't have consed, you might want to try clview
(<a class="ulink" href="http://www.tigr.org/tdb/tgi/software/" target="_top">http://www.tigr.org/tdb/tgi/software/</a>) from TIGR
to look at .ace files.
</p></dd><dt><span class="term">
<code class="filename">BAM</code>
</span></dt><dd>
The binary cousin of the SAM format. MIRA neither reads nor writes
BAM, but BAMs can be created out of SAMs (which can be created via
<span class="command"><strong>miraconvert</strong></span>).
</dd><dt><span class="term">
<code class="filename">CAF</code>
</span></dt><dd><p> Common Assembly Format (CAF) developed by the Sanger
Centre. <a class="ulink" href="http://www.sanger.ac.uk/resources/software/caf.html" target="_top">http://www.sanger.ac.uk/resources/software/caf.html</a> provides a
description of the format and some software documentation as well as the
source for compiling caf2gap and gap2caf (thanks to Rob Davies
for this).
</p></dd><dt><span class="term">
<code class="filename">EXP</code>
</span></dt><dd><p> Standard experiment files used in genome
sequencing. Correct EXP files are expected. Especially the ID
record (containing the id of the reading) and the LN record
(containing the name of the corresponding trace file) should be
correctly set. See <a class="ulink" href="http://www.sourceforge.net/projects/staden/" target="_top">http://www.sourceforge.net/projects/staden/</a> for links to
online format description.
</p></dd><dt><span class="term">
<code class="filename">FASTA</code>
</span></dt><dd><p> A simple format for sequence data, see
<a class="ulink" href="http://www.ncbi.nlm.nih.gov/BLAST/fasta.html" target="_top">http://www.ncbi.nlm.nih.gov/BLAST/fasta.html</a>. An
often used extension of that format is used to also store quality
values in a similar fashion, these files have a .fasta.qual
ending.
</p><p>
MIRA writes two kinds of FASTA files for
results: <span class="emphasis"><em>padded</em></span> and
<span class="emphasis"><em>unpadded</em></span>. The difference is that the padded
version still contains the gap (pad) character (an asterisk) at
positions in the consensus where some of the reads apparently
had some more bases than others but where the consensus routines
decided that to treat them as artifacts. The
<span class="emphasis"><em>unpadded</em></span> version has the gaps removed.
</p></dd><dt><span class="term">
<code class="filename">GBF, GBK</code>
</span></dt><dd><p> GenBank file format as used at the NCBI to describe
sequences. MIRA is able to read and write this format (but only
for viruses or bacteria) for using sequences as backbones in an
assembly. Features of the GenBank format are also transferred
automatically to Staden compatible tags.
</p><p>
If possible, use GFF3 instead (see below).
</p></dd><dt><span class="term">
<code class="filename">GFF3</code>
</span></dt><dd><p> General feature format used to describe sequences and
features on these sequences. MIRA is able to read and write this
format.
</p></dd><dt><span class="term">
<code class="filename">HTML</code>
</span></dt><dd><p> Hypertext Markup Language. Projects written in HTML format
can be viewed directly with any table capable browser. Display is even
better if the browser knows style sheets (CSS).
</p></dd><dt><span class="term">
<code class="filename">MAF</code>
</span></dt><dd><p> MIRA Assembly Format (MAF). A faster and more compact form
than EXP, CAF or ACE. See documentation in separate file.
</p></dd><dt><span class="term">
<code class="filename">PHD</code>
</span></dt><dd><p> This file type originates from the phred base caller
and contains basically -- along with some other status information -- the
base sequence, the base quality values and the peak indices, but not the
sequence traces itself.
</p></dd><dt><span class="term">
<code class="filename">SAM</code>
</span></dt><dd><p> The Sequence Alignment/Map Format. MIRA does not write SAM
directly, but <span class="command"><strong>miraconvert</strong></span> can be used for
converting a MAF (or CAF) file to SAM.
</p><p>
MIRA cannot read SAM though.
</p></dd><dt><span class="term">
<code class="filename">SCF</code>
</span></dt><dd><p> The Staden trace file format that has established itself as
compact standard replacement for the much bigger ABI files. See
<a class="ulink" href="http://www.sourceforge.net/projects/staden/" target="_top">http://www.sourceforge.net/projects/staden/</a> for
links to online format description.
</p><p>
The SCF files should be V2-8bit, V2-16bit, V3-8bit or V3-16bit
and can be packed with compress or gzip.
</p></dd><dt><span class="term">
<code class="filename">traceinfo.XML</code>
</span></dt><dd><p> XML based file with information relating to
traces. Used at the NCBI and ENSEMBL trace archive to store additional
information (like clippings, insert sizes etc.) for projects. See further
down for for a description of the fields used and
<a class="ulink" href="http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=rfc&m=main&s=rfc" target="_top">http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=rfc&m=main&s=rfc</a> for a full description of all fields.
</p></dd><dt><span class="term">
<code class="filename">TCS</code>
</span></dt><dd><p> Transpose Contig Summary. A text file as written by MIRA
which gives a summary of a contig in tabular fashion, one line per
base. Nicely suited for "quick" analysis from command line tools,
scripts, or even visual inspection in file viewers or spreadsheet
programs.
</p><p> In the current file version (TCS 1.0), each column is
separated by at least one space from the next. Vertical bars are
inserted as visual delimiter to help inspection by eye. The
following columns are written into the file:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
contig name (width 20)
</p></li><li class="listitem"><p>
padded position in contigs (width 3)
</p></li><li class="listitem"><p>
unpadded position in contigs (width 3)
</p></li><li class="listitem"><p>
separator (a vertical bar)
</p></li><li class="listitem"><p>
called consensus base
</p></li><li class="listitem"><p>
quality of called consensus base (0-100), but MIRA itself caps at 90.
</p></li><li class="listitem"><p>
separator (a vertical bar)
</p></li><li class="listitem"><p>
total coverage in number of reads. This number can be higher than the
sum of the next five columns if Ns or IUPAC bases are present in the
sequence of reads.
</p></li><li class="listitem"><p>
coverage of reads having an "A"
</p></li><li class="listitem"><p>
coverage of reads having an "C"
</p></li><li class="listitem"><p>
coverage of reads having an "G"
</p></li><li class="listitem"><p>
coverage of reads having an "T"
</p></li><li class="listitem"><p>
coverage of reads having an "*" (a gap)
</p></li><li class="listitem"><p>
separator (a vertical bar)
</p></li><li class="listitem"><p>
quality of "A" or "--" if none
</p></li><li class="listitem"><p>
quality of "C" or "--" if none
</p></li><li class="listitem"><p>
quality of "G" or "--" if none
</p></li><li class="listitem"><p>
quality of "T" or "--" if none
</p></li><li class="listitem"><p>
quality of "*" (gap) or "--" if none
</p></li><li class="listitem"><p>
separator (a vertical bar)
</p></li><li class="listitem"><p>
Status. This field sums up the evaluation of MIRA whether you should
have a look at this base or not. The content can be one of the following:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
everything OK: a colon (:)
</p></li><li class="listitem"><p>
unclear base calling (IUPAC base): a "!M"
</p></li><li class="listitem"><p>
potentially problematic base calling involving a gap or low quality: a "!m"
</p></li><li class="listitem"><p>
consensus tag(s) of MIRA that hint to problems: a "!$". Currently,
the following tags will lead to this marker: SRMc, WRMc, DGPc, UNSc,
IUPc.
</p></li></ul></div></li><li class="listitem"><p>
list of a consensus tags at that position, tags are delimited by a
space. E.g.: "DGPc H454"
</p></li></ol></div></dd></dl></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_stdout_stderr"></a>3.6.4.
STDOUT/STDERR
</h3></div></div></div><p>
The actual stage of the assembly is written to STDOUT, giving status messages
on what MIRA is actually doing. Dumping to STDERR is almost not used
anymore by MIRA, remnants will disappear over time.
</p><p>
Some debugging information might also be written to STDOUT if MIRA
generates error messages.
</p><p>
On errors, MIRA will dump these also to STDOUT. Basically, three error classes
exist:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
WARNING: Messages in this error class do not stop the assembly but
are meant as an information to the user. In some rare cases these
errors are due to (an always possible) error in the I/O routines
of MIRA, but nowadays they are mostly due to unexpected (read:
wrong) input data and can be traced back to errors in the
preprocessing stages. If these errors arise, you
definitively <span class="bold"><strong>DO</strong></span> want to check how
and why these errors came into those files in the first place.
</p><p>
Frequent cause for warnings include missing SCF files, SCF files
containing known quirks, EXP files containing known quirks etc.
</p></li><li class="listitem"><p>
FATAL: Messages in this error class actually stop the
assembly. These are mostly due to missing files that MIRA needs or
to very garbled (wrong) input data.
</p><p>
Frequent causes include naming an experiment file in the 'file of filenames'
that could not be found on the disk, same experiment file twice in the
project, suspected errors in the EXP files, etc.
</p></li><li class="listitem"><p>
INTERNAL: These are true programming errors that were caught by internal
checks. Should this happen, please mail the output of STDOUT and STDERR to
the author.
</p></li></ol></div><p>
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_ssaha2smalt"></a>3.6.5.
SSAHA2 / SMALT ancillary data
</h3></div></div></div><p>
The <span class="command"><strong>ssaha2</strong></span> or <span class="command"><strong>smalt</strong></span> programs -
both from the Sanger Centre - can be used to detect possible vector
sequence stretches in the input data for the assembly. MIRA can load
the result files of a
<span class="command"><strong>ssaha2</strong></span> or <span class="command"><strong>smalt</strong></span> run and
interpret the results to tag the possible vector sequences at the ends
of reads.
</p><p>
Note that this also uses the parameters
[-CL:msvsgs:msvsmfg:msvsmeg] (see below).
</p><p>
ssaha2 must be called like this "<code class="literal">ssaha2
<ssaha2options> vector.fasta sequences.fasta</code>"
to generate an output that can be parsed by MIRA. In the above
example, replace <code class="filename">vector.fasta</code> by the name
of the file with your vector sequences and
<code class="filename">sequences.fasta</code> by the name of the file
containing your sequencing data.
</p><p>
smalt must be called like this: "<code class="literal">smalt map -f ssaha
<ssaha2options> hash_index sequences.fasta</code>"
</p><p>
This makes you basically independent from any other commercial or
license-requiring vector screening software. For Sanger reads, a
combination of <span class="command"><strong>lucy</strong></span> and
<span class="command"><strong>ssaha2</strong></span> or <span class="command"><strong>smalt</strong></span> together with
this parameter should do the trick. For reads coming from 454
pyro-sequencing, <span class="command"><strong>ssaha2</strong></span> or
<span class="command"><strong>smalt</strong></span> and this parameter will also work very
well. See the usage manual for a walkthrough example on how to use
SSAHA2 / SMALT screening data.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
The output format of SSAHA2 must the native output format
(<code class="literal">-output ssaha2</code>). For SMALT, the output
option <code class="literal">-f ssaha</code> must be used. Other formats cannot
be parsed by MIRA.
</td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
I currently use the following SSAHA2 options:
<code class="literal">-kmer 8 -skip 1 -seeds 1 -score 12 -cmatch 9 -ckmer
6</code></td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Anyone contributing SMALT parameters?
</td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
The sequence vector clippings generated from SSAHA2 /
SMALT data do not replace sequence vector clippings loaded via
the EXP, CAF or XML files, they rather extend them.
</td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_xml_traceinfo"></a>3.6.6.
XML TRACEINFO ancillary data
</h3></div></div></div><p>
MIRA extracts the following data from the TRACEINFO files:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
trace_name (required)
</p></li><li class="listitem"><p>
trace_file (recommended)
</p></li><li class="listitem"><p>
trace_type_code (recommended)
</p></li><li class="listitem"><p>
trace_end (recommended)
</p></li><li class="listitem"><p>
clip_quality_left (recommended)
</p></li><li class="listitem"><p>
clip_quality_right (recommended)
</p></li><li class="listitem"><p>
clip_vector_left (recommended)
</p></li><li class="listitem"><p>
clip_vector_right (recommended)
</p></li><li class="listitem"><p>
strain (recommended)
</p></li><li class="listitem"><p>
template_id (recommended for paired end)
</p></li><li class="listitem"><p>
insert_size (recommended for paired end)
</p></li><li class="listitem"><p>
insert_stdev (recommended for paired end)
</p></li><li class="listitem"><p>
machine_type (optional)
</p></li><li class="listitem"><p>
program_id (optional)
</p></li></ul></div><p>
</p><p>
Other data types are also read, but the info is not used.
</p><p>
Here's the example for a TRACEINFO file with ancillary info:
</p><pre class="screen">
<?xml version="1.0"?>
<trace_volume>
<trace>
<trace_name>GCJAA15TF</trace_name>
<program_id>PHRED (0.990722.G) AND TTUNER (1.1)</program_id>
<template_id>GCJAA15</template_id>
<trace_direction>FORWARD</trace_direction>
<trace_end>F</trace_end>
<clip_quality_left>3</clip_quality_left>
<clip_quality_right>622</clip_quality_right>
<clip_vector_left>1</clip_vector_left>
<clip_vector_right>944</clip_vector_right>
<insert_stdev>600</insert_stdev>
<insert_size>2000</insert_size>
</trace>
<trace>
...
</trace>
...
</trace_volume></pre><p>
See
<a class="ulink" href="http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=rfc&m=main&s=rfc" target="_top">http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=rfc&m=main&s=rfc</a>
for a full description of all fields and more info on the TRACEINFO XML format.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_contig_naming"></a>3.6.7.
Contig naming
</h3></div></div></div><p>
MIRA names contigs the following
way: <span class="emphasis"><em><projectname>_<contigtype><number></em></span>. While <span class="emphasis"><em><projectname></em></span>
is dictated by the [--project=] parameter
and <span class="emphasis"><em><number></em></span> should be clear,
the <span class="emphasis"><em><contigtype></em></span> might need additional
explaining. There are currently three contig types existing:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
_c: these are "normal" contigs
</p></li><li class="listitem"><p>
_rep_c: only for genome assembly mode. These are contigs
containing only repetitive areas. These contigs
had <span class="emphasis"><em>_lrc</em></span> as type in previous version of MIRA,
this was changed to the <span class="emphasis"><em>_rep_c</em></span> to make things
clearer.
</p></li><li class="listitem"><p>
_s: these are singlet-contigs. Technically: "contigs" with a
single read.
</p></li><li class="listitem"><p>
_dn: these is an additional contig type which can occur when MIRA
ran a digital normalisation step during the assembly. Contigs
which contain reads completely covered by a DGNr tag will get an
additional "_dn" as part of their name to show that they contain
read representatives for digital normalisation. E.g.:
"contig_dn_c1".
</p><p>
Reads covered only partly by the DGNr tag do not trigger the _dn
naming.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note: Important side note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Important side note</th></tr><tr><td align="left" valign="top"> Due to the digital
normalisation step, the coverage numbers in the info file
regarding contig statistics will not represent the number of
reads in the contig, but they will show an approximation of
the true coverage or expression value as if there had not been
a digital normalisation step performed. The approximation may
be around 10 to 20% below the true value.
</td></tr></table></div></li></ol></div><p>
Basically, for genome assemblies MIRA starts to build contigs in areas
which seem "rock solid", i.e., not a repetitive region (main decision
point) and nice coverage of good reads. Contigs which started like
this get a <span class="emphasis"><em>_c</em></span> name. If during the assembly MIRA
reaches a point where it cannot start building a contig in a
non-repetitive region, it will name the contig
<span class="emphasis"><em>_rep_c</em></span> instead of <span class="emphasis"><em>_c</em></span>. This
is why "_rep_c" contigs occur late in a genome assembly.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
MIRA has a different understanding of "rock solid" when in EST/RNASeq
assembly: here, MIRA will try to reconstruct a full length gene
sequence, starting with the most abundant genes.
</td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Depending on the settings of [-AS:mrpc], your project may or
may not contain <span class="emphasis"><em>_s</em></span> singlet-contigs. Also note
that reads landing in the debris file will not get assigned to
singlet-contigs and hence not get <span class="emphasis"><em>_s</em></span> names.
</td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_recovering_strain_specific_consensus"></a>3.6.8.
Recovering strain specific consensus as FASTA
</h3></div></div></div><p>
In case you used strain information in an assembly, you can
recover the consensus for just any given strain
by using <span class="command"><strong>miraconvert</strong></span> and convert from a
full assembly format (e.g. MAF or CAF) which also carries
strain information to FASTA. MIRA will automatically detect
the strain information and create one FASTA file per strain
encountered.
</p><p>
It will also create a blend of all strains encountered and
conveniently add "AllStrains" to the name of these files. Note that
this blend may or may not be something you need, but in some
cases I found it to be useful.
</p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_tags_used_in_the_assembly_by_mira_and_edit"></a>3.7.
Tags used in the assembly by MIRA and EdIt
</h2></div></div></div><p>
MIRA uses and sets a couple of tags during the assembly process. That
is, if information is known before the assembly, it can be stored in tags (in
the EXP and CAF formats) and will be used in the assembly.
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_tags_read_and_used"></a>3.7.1.
Tags read (and used)
</h3></div></div></div><p>
This section lists "foreign" tags, i.e., tags that whose definition was made
by other software packages than MIRA.
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
ALUS, REPT: Sequence stretches tagged as ALUS (ALU Sequence) or REPT
(general repetitive sequence) will be handled with extreme care during the
assembly process. The allowed error rate after automatic contig editing
within these stretches is normally far below the general allowed error rate,
leading to much higher stringency during the assembly process and
subsequently to a better repeat resolving in many cases.
</p></li><li class="listitem"><p>
Fpas: GenBank feature for a poly-A sequence. Used in EST, cDNA or
transcript assembly. Either read in the input files or set when using
[-CL:cpat]. This allows to keep the poly-A sequence in
the reads during assembly without them interfering as massive
repeats or as mismatches.
</p></li><li class="listitem"><p>
FCDS, Fgen: GenBank features as described in GBF/GBK files or set in the
Staden package are used to make some SNP impact analysis on genes.
</p></li><li class="listitem"><p>
other. All other tags in reads will be read and passed through the
assembly without being changed and they currently do not influence the
assembly process.
</p></li></ul></div><p>
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_tags_set_and_used"></a>3.7.2.
Tags set (and used)
</h3></div></div></div><p>
This section lists tags which MIRA sets (and reads of course), but that other
software packages might not know about.
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
UNSr, UNSc: <span class="bold"><strong>UNS</strong></span>ure
in <span class="bold"><strong>R</strong></span>ead
respectively <span class="bold"><strong>C</strong></span>ontig. These tags
denote positions in an assembly with conflicts that could not be
resolved automatically by MIRA. These positions should be looked
at during the finishing process.
</p><p>
For assemblies using good sequences and enough coverage, something
0.01% of the consensus positions have such a tag. (e.g. ~300 UNSc
tags for a genome of 3 megabases).
</p></li><li class="listitem"><p>
SRMr, WRMc: <span class="bold"><strong>S</strong></span>trong <span class="bold"><strong>R</strong></span>epeat <span class="bold"><strong>M</strong></span>arker and
<span class="bold"><strong>W</strong></span>eak <span class="bold"><strong>R</strong></span>epeat <span class="bold"><strong>M</strong></span>arker. These
tags are set in two flavours: as
SRM<span class="bold"><strong>r</strong></span> and
WRM<span class="bold"><strong>r</strong></span> when set in reads, and as
SRM<span class="bold"><strong>c</strong></span> and
WRM<span class="bold"><strong>c</strong></span> when set in the
consensus. These tags are used on an individual per base basis for
each read. They denote bases that have been identified as crucial
for resolving repeats, often denoting a single SNP within several
hundreds or thousands of bases. While a SRM is quite certain, the
WRM really is either weak (there wasn't enough comforting
information in the vicinity to be really sure) or involves gap
columns (which is always a bit tricky).
</p><p>
MIRA will automatically set these tags when it encounters repeats
and will tag exactly those bases that can be used to discern the
differences.
</p><p>
Seeing such a tag in the consensus means that MIRA was not able to
finish the disentanglement of that special repeat stretch or that
it found a new one in one of the last passes without having the
opportunity to resolve the problem.
</p></li><li class="listitem"><p>
DGPc: <span class="bold"><strong>D</strong></span>ubious <span class="bold"><strong>G</strong></span>ap <span class="bold"><strong>P</strong></span>osition in
<span class="bold"><strong>C</strong></span>onsensus. Set whenever the gap to base ratio in a column of 454
reads is between 40% and 60%.
</p></li><li class="listitem"><p>
SAO, SRO, SIO: <span class="bold"><strong>S</strong></span>NP intr<span class="bold"><strong>A</strong></span> <span class="bold"><strong>O</strong></span>rganism,
<span class="bold"><strong>S</strong></span>NP <span class="bold"><strong>R</strong></span> <span class="bold"><strong>O</strong></span>rganism, <span class="bold"><strong>S</strong></span>NP <span class="bold"><strong>I</strong></span>ntra
and inter <span class="bold"><strong>O</strong></span>rganism. As for SRM
and WRM, these tags have a <span class="bold"><strong>r</strong></span>
appended when set in reads and
a <span class="bold"><strong>c</strong></span> appended when set in the
consensus. These tags denote SNP positions.
</p><p>
MIRA will automatically set these tags when it encounters SNPs and
will tag exactly those bases that can be used to discern the
differences. They denote SNPs as they occur within an organism
(SAO), between two or more organisms (SRO) or within and between
organisms (SIO).
</p><p>
Seeing such a tag in the consensus means that MIRA set this as a
valid SNP in the assembly pass. Seeing such tags only in reads (but not in
the consensus) shows that in a previous pass, MIRA thought these
bases to be SNPs but that in later passes, this SNP does not appear anymore
(perhaps due to resolved misassemblies).
</p></li><li class="listitem"><p>
STMS: (only hybrid assemblies). The <span class="bold"><strong>S</strong></span>equencing <span class="bold"><strong>T</strong></span>ype
<span class="bold"><strong>M</strong></span>ismatch <span class="bold"><strong>S</strong></span>olved
is tagged to positions in the assembly where the consensus of
different sequencing technologies (Sanger, 454, Ion Torrent, Solexa, PacBio, SOLiD)
reads differ, but MIRA thinks it found out the correct
solution. Often this is due to low coverage of one of the types
and an additional base calling error.
</p><p>
Sometimes this depicts real differences where possible explanation
might include: slightly different bugs were sequenced or a
mutation occurred during library preparation.
</p></li><li class="listitem"><p>
STMU: (only hybrid assemblies). The <span class="bold"><strong>S</strong></span>equencing <span class="bold"><strong>T</strong></span>ype
<span class="bold"><strong>M</strong></span>ismatch <span class="bold"><strong>U</strong></span>nresolved
is tagged to positions in the assembly where the consensus of
different sequencing technologies (Sanger, 454, Ion Torrent, Solexa, SOLiD)
reads differ, but MIRA could not find a good resolution. Often this
is due to low coverage of one of the types and an additional base
calling error.
</p><p>
Sometimes this depicts real differences where possible explanation
might include: slightly different bugs were sequenced or a mutation
occurred during library preparation.
</p></li><li class="listitem"><p>
MCVc: The <span class="bold"><strong>M</strong></span>issing <span class="bold"><strong>C</strong></span>o{V}erage in <span class="bold"><strong>C</strong></span>onsensus.
Set in assemblies with more than one strain. If a strain has no coverage at
a certain position, the consensus gets tagged with this tag (and the name of
the strain which misses this position is put in the comment). Additionally,
the sequence in the result files for this strain will have an @ character.
</p></li><li class="listitem"><p>
MNRr: (only with [-KS:mnr] active). The <span class="bold"><strong>M</strong></span>asked
<span class="bold"><strong>N</strong></span>asty <span class="bold"><strong>R</strong></span>epeat tags are set over those parts of a read that
have been detected as being many more times present than the average
sub-sequence. MIRA will hide these parts during the initial
all-against-all overlap finding routine (SKIM3) but will otherwise happily
use these sequences for consensus generation during contig building.
</p></li><li class="listitem"><p>
FpAS: See "Tags read (and used)" above.
</p></li><li class="listitem"><p>
ED_C, ED_I, ED_D: EDit Change, EDit Insertion, EDit Deletion. These
tags are set by the integrated automatic editor EdIt and show which edit
actions have been performed.
</p></li><li class="listitem"><p>
HAF2, HAF3, HAF4, HAF5, HAF6, HAF7. These
are <span class="bold"><strong>HA</strong></span>sh <span class="bold"><strong>F</strong></span>requency
tags which show the status of read parts in comparison to the
whole project. Only set if [-AS:ard] is active (default
for genome assemblies).
</p><p>
More info on how to use the information conveyed by HAF tags in
the section dealing with repeats and HAF tags in finishing
programs further down in this manual.
</p><p>
HAF2 coverage below average ( standard setting at < 0.5 times average)
</p><p>
HAF3 coverage is at average ( standard setting at ≥ 0.5 times average and ≤ 1.5 times average)
</p><p>
HAF4 coverage above average ( standard setting at > 1.5 times average and < 2 times average)
</p><p>
HAF5 probably repeat ( standard setting at ≥ 2 times average and < 5 times average)
</p><p>
HAF6 'heavy' repeat ( standard setting at > 8 times average)
</p><p>
HAF7 'crazy' repeat ( standard setting at > 20 times average)
</p></li></ul></div><p>
</p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_contigs_singlets_debris"></a>3.8.
Where reads end up: contigs, singlets, debris
</h2></div></div></div><p>
At the start, things are simple: a read either aligns with other reads or it does not. Reads which
align with other reads form contigs, and these MIRA will save in the results with a contig name
of <span class="emphasis"><em>_c</em></span>.
</p><p>
However, not all reads can be placed in an assembly. This can have several reasons and
these reads may end up at two different places in the result files: either in the
<span class="emphasis"><em>debris</em></span> file, then just as a name entry, or as singlet (a "contig"
with just one read) in the regular results.
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
reads are too short and get filtered out (before or after the MIRA
clipping stages). These invariably land in the debris file.
</p></li><li class="listitem"><p>
reads are real singlets: they contain genuine sequence but have no
overlap with any other read. These get either caught by the
[-CL:pec] clipping filter or during the SKIM phase
</p></li><li class="listitem"><p>
reads contain mostly or completely junk.
</p></li><li class="listitem"><p>
reads contain chimeric sequence (therefore: they're also junk)
</p></li></ol></div><p>
MIRA filters out these reads in different stages: before and after read
clipping, during the SKIM stage, during the Smith-Waterman overlap
checking stage or during contig building. The exact place where these
single reads land is dependent on why they do not align with other
reads. Reads landing in the debris file will have the reason and stage
attached to the decision.
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_snp_discovery"></a>3.9.
Detection of bases distinguishing non-perfect repeats and SNP discovery
</h2></div></div></div><p>
MIRA is able to find and tag SNPs in any kind of data -- be it genomic
or EST -- in both de-novo and mapping assemblies ... provided it knows
which read in an assembly is coming from which strain, cell line or
organism.
</p><p>
The SNP detection routines are based on the same routines as the
routines for detecting non-perfect repeats. In fact, MIRA can even
distinguish between bases marking a misassembled repeat from bases
marking a SNP within the same project.
</p><p>
All you need to do to enable this feature is to set
[-CO:mr=yes] (which is standard in all
<code class="literal">--job=...</code> incantations of <span class="command"><strong>mira</strong></span> and
in some steps of <span class="command"><strong>miraSearchESTSNPs</strong></span>. Furthermore, you
will need to provide <span class="emphasis"><em>strain information</em></span>, either in
the manifest file or in ancillary NCBI TRACEINFO XML files.
</p><p>
The effect of using strain names attached to reads can be described
briefly like this. Assume that you have 6 reads (called R1 to R6), three
of them having an <code class="literal">A</code> at a given position, the other
three a <code class="literal">C</code>.
</p><pre class="screen">
R1 ......A......
R2 ......A......
R3 ......A......
R4 ......C......
R5 ......C......
R6 ......C......</pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
This example is just that: an example. It uses just 6 reads, with two
times three reads as read groups for demonstration purposes and without
looking at qualities. For MIRA to recognise SNPs, a few things must come
together (e.g. for many sequencing technologies it wants forward and
backward reads when in de-novo assembly) and a couple of parameters can
be set to adjust the sensitivity. Read more about the parameters:
[-CO:mrpg:mnq:mgqrt:emea:amgb:amgbemc:amgbnbs]</td></tr></table></div><p>
Now, assume you did not give any strain information. MIRA will most
probably recognise a problem and, having no strain information, assume
it made an error by assembling two different repeats of the same
organism. It will tag the bases in the reads with repeat marker tags
(SRMr) and the base in the consensus with a SROc tag (to point at an
unresolved problem). In a subsequent pass, MIRA will then not assemble
these six reads together again, but create two contigs like this:
</p><pre class="screen">
Contig1:
R1 ......A......
R2 ......A......
R3 ......A......
Contig2:
R4 ......C......
R5 ......C......
R6 ......C......</pre><p>
The bases in the repeats will keep their SROr tags, but the consensus
base of each contig will not get SROc as there is no conflict anymore.
</p><p>
Now, assume you gave reads R1, R2 and R3 the strain information "human",
and read R4, R5 and R6 "chimpanzee". MIRA will then create this:
</p><pre class="screen">
R1 (hum) ......<span class="bold"><strong>A</strong></span>......
R2 (hum) ......<span class="bold"><strong>A</strong></span>......
R3 (hum) ......<span class="bold"><strong>A</strong></span>......
R4 (chi) ......<span class="bold"><strong>C</strong></span>......
R5 (chi) ......<span class="bold"><strong>C</strong></span>......
R6 (chi) ......<span class="bold"><strong>C</strong></span>......</pre><p>
Instead of creating two contigs, it will create again one contig ... but
it will tag the bases in the reads with a SROr tag and the position in
the contig with a SROc tag. The SRO tags (<span class="bold"><strong>S</strong></span>NP inte<span class="bold"><strong>R</strong></span>
<span class="bold"><strong>O</strong></span>rganisms) tell you: there's a SNP
between those two (or multiple) strains/organisms/whatever.
</p><p>
Changing the above example a little, assume you have this assembly early
on during the MIRA process:
</p><pre class="screen">
R1 (hum) ......A......
R2 (hum) ......A......
R3 (hum) ......A......
R4 (chi) ......A......
R5 (chi) ......A......
R6 (chi) ......A......
R7 (chi) ......C......
R8 (chi) ......C......
R9 (chi) ......C......</pre><p>
Because "chimp" has a SNP within itself (<code class="literal">A</code> versus
<code class="literal">C</code>) and there's a SNP between "human" and "chimp"
(also <code class="literal">A</code> versus <code class="literal">C</code>), MIRA will see a
problem and set a tag, this time a SIOr tag: <span class="bold"><strong>S</strong></span>NP <span class="bold"><strong>I</strong></span>ntra- and
inter <span class="bold"><strong>O</strong></span>rganism.
</p><p>
MIRA does not like conflicts occurring within an organism and will try
to resolve these cleanly. After setting the SIOr tags, MIRA will
re-assemble in subsequent passes this:
</p><pre class="screen">
Contig1:
R1 (hum) ......<span class="bold"><strong>A</strong></span>......
R2 (hum) ......<span class="bold"><strong>A</strong></span>......
R3 (hum) ......<span class="bold"><strong>A</strong></span>......
R4 (chi) ......<span class="bold"><strong>A</strong></span>......
R5 (chi) ......<span class="bold"><strong>A</strong></span>......
R6 (chi) ......<span class="bold"><strong>A</strong></span>......
Contig2:
R7 (chi) ......<span class="bold"><strong>C</strong></span>......
R8 (chi) ......<span class="bold"><strong>C</strong></span>......
R9 (chi) ......<span class="bold"><strong>C</strong></span>......</pre><p>
The reads in Contig1 (hum+chi) and Contig2 (chi) will keep their SIOr
tags, the consensus will have no SIOc tag as the "problem" was
resolved.
</p><p>
When presented to conflicting information regarding SNPs and possible
repeat markers or SNPs within an organism, MIRA will always first try to
resolve the repeats marker. Assume the following situation:
</p><pre class="screen">
R1 (hum) ......A...T......
R2 (hum) ......A...G......
R3 (hum) ......A...T......
R4 (chi) ......C...G......
R5 (chi) ......C...T......
R6 (chi) ......C...G......</pre><p>
While the first discrepancy column can be "explained away" by a SNP
between organisms (it will get a SROr/SROc tag), the second column
cannot and will get a SIOr/SIOc tag. After that, MIRA opts to get the
SIO conflict resolved:
</p><pre class="screen">
Contig1:
R1 (hum) ......A...T......
R3 (hum) ......A...T......
R5 (chi) ......C...T......
Contig2:
R2 (hum) ......A...G......
R4 (chi) ......C...G......
R6 (chi) ......C...G......</pre></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_data_reduction"></a>3.10.
Data reduction: subsampling vs. lossless digital normalisation
</h2></div></div></div><p>
Some data sets have way too much data. Sometimes it is simply more than
needed like, e.g., performing a de-novo genome assembly with reads
enough for 300x coverage is like taking a sledgehammer for cracking a
nut. Sometimes it is even more than is good for an assembly (see also:
motif dependent sequencing errors).
</p><p>
MIRA being an overlap-based assembler, reducing a data set helps to keep
time and memory requirements low. There are basically two ways to
perform this: reduction by subsampling and reduction by digital
normalisation. Both methods have their pros and cons and can be used
effectively in different scenarios.
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
<span class="emphasis"><em>Subsampling</em></span> is a process to create a smaller,
hopefully representative set from a larger data set.
</p><p>
In sequencing, various ways exist to perform subsampling. As
sequencing data sets from current sequencing technologies can be
seen as essentially randomised when coming fresh from the machine,
the selection step can be as easy as selecting the
first <span class="emphasis"><em>n</em></span> reads. When the input data set is not
random (e.g. in SAM/BAM files with mapped data), one must resort to
random selection of reads.
</p><p>
Subsampling must be done by the user prior to assembly with MIRA.
</p><p>
On the upside, subsampling preserves the exact copy number structure
of the input data set: a repeat with n copies in a genome will
always be represented by reads forming n copies of the repeat in the
reduced data set. Furthermore, subsampling is comparatively
insensitive to motif dependent sequencing errors. On the downside,
subsampling will more probably loose rare events of the data set
(e.g., rare SNPs of a cell population or rare transcripts in
EST/RNASeq). Also, in EST/RNASeq projects, subsampling will not be
able to reduce extraordinary coverage events to a level which make
the assembly not painfully slow. Examples for the later being rRNA
genes or highly expressed house-keeping genes where todays' Illumina
data sets sometimes contains enough data to reach coverage numbers
≥ 100,000x or even a million x.
</p><p>
Subsampling should therefore be used for single genome de-novo
assemblies; or for EST/RNASeq assemblies which need reliable
coverage numbers for transcript expression data but where at least
all rDNA has been filtered out prior to assembly.
</p></li><li class="listitem"><p>
<span class="emphasis"><em>Digital normalisation</em></span> is a process to perform a
reduction of sequencing data redundancy. It was made known to a
wider audience by the paper <span class="emphasis"><em>"A Reference-Free Algorithm
for Computational Normalization of Shotgun Sequencing
Data"</em></span> by Brown et al. (see
<a class="ulink" href="http://arxiv.org/abs/1203.4802" target="_top">http://arxiv.org/abs/1203.4802</a>).
</p><p>
The normalisation process works by progressively going through the
sequencing data and selecting reads which bring new, previously
unseen information to the assembly and discarding those which
describe nothing new. For single genome assemblies, this has the
effect that repeats with n copies in the genome are afterwards
present often with just enough reads to reconstruct only a single
copy of the repeat. In EST/RNASeq assemblies, this leads to
reconstructed transcripts having all the more or less same coverage.
</p><p>
The normalisation process as described in the paper allows for a
certain lossiness during the data reduction as it was developed to
cope with billions of reads. E.g., it will often loose borders in
genome reorganisation events or SNP information from ploidies, from
closely related genes copies or from closely related species.
</p><p>
MIRA implements a variant of the algorithm: the <span class="emphasis"><em>lossless
digital normalisation</em></span>. Here, normalised data has copy
numbers reduced like in the original algorithm, but all variants
(SNPs, borders of reorganisation events etc.) present in the
original data set are retained in the reduced data set. Furthermore,
the normalisation is parameterised to take place only for
excessively repetitive parts of a data set which would lead to
overly increased run-time and memory consumption. This gives the
assembler the opportunity to correctly evaluate and work with
repeats which do not occur "too often" in a data set while still
being able to reconstruct at least one copy of the really nasty
repeats.
</p><p>
Digital normalisation should not be done prior to an assembly with
MIRA, rather the MIRA parameter to perform a digital normalisation
on the complete data set should be used.
</p><p>
The lossless digital normalisation of MIRA should be used for
EST/RNASeq assemblies containing highly repetitive data. Metagenome
assemblies may also profit from this feature.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
MIRA keeps track of the approximate coverage represented by the
reads chosen in the digital normalisation process. That is, MIRA is
able to give approximate coverage numbers as if digital
normalisation had never happened. The approximation may be around 10
to 20% below the true value. Contigs affected by this coverage
approximation are denoted with an additional "_dn" in their name.
</p><p>
Due to the digital
normalisation step, the coverage numbers in the info file
regarding contig statistics will not represent the number of
reads in the contig, but they will show an approximation of
the true coverage or expression value as if there had not been
a digital normalisation step performed.
</p></td></tr></table></div></li></ul></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_caveats"></a>3.11.
Caveats
</h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_using_artificial_reads"></a>3.11.1.
Using data not from sequencing instruments: artificial / synthetic reads
</h3></div></div></div><p>
The default parameters for MIRA assemblies work best when given real
sequencing data and they even expect the data to behave like real
sequencing data. But some assembly strategies work in multiple rounds,
using so called "artificial" or "synthetic" reads in later rounds,
i.e., data which was not generated through sequencing machines but
might be something like the consensus of previous assemblies.
</p><p>
If one doesn't take utter care to make these artificial reads at least
behave a little bit like real sequencing data, a number of quality
insurance algorithms of MIRA might spot that they "look funny" and
trim back these artificial reads ... sometimes even removing them
completely.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note: Summary tips for creating artificial reads for MIRA assemblies"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Summary tips for creating artificial reads for MIRA assemblies</th></tr><tr><td align="left" valign="top"><p>
The following should lead to the least amount of surprises for most
assembly use cases when calling MIRA only with the most basic
switches <code class="literal">--project=... --job=...</code>
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><span class="bold"><strong>Length:</strong></span> between 50 and 20000 bp
</li><li class="listitem"><span class="bold"><strong>Quality values:</strong></span> give your
artificial reads quality values. Using <span class="emphasis"><em>30</em></span>
as quality value for your bases should be OK for most
applications.
</li><li class="listitem"><span class="bold"><strong>Orientation:</strong></span> for every read you
create, create a read with the same data (bases and quality
values) in reverse complement direction.
</li></ol></div></td></tr></table></div><p>
The following list gives all the gory details on how synthetic reads
should look like or which MIRA algorithms to switch off in certain
cases:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
Forward and reverse complement directions: most sequencing
technologies and strategies yield a mixture of reads with both
forward and reverse complement direction to the DNA sequenced. In
fact, having both directions allows for a much better quality
control of an alignment as sequencing technology dependent
sequencing errors will often affect only one direction at a given
place and not both (the exception being homopolymers and 454).
</p><p>
The MIRA <span class="emphasis"><em>proposed end clipping</em></span> algorithm
[-CL:pec] uses this knowledge to initially trim back
ends of reads to an area without sequencing errors. However, if
reads covering a given area of DNA are present in only one
direction, then these reads will be completely eliminated.
</p><p>
If you use only artificial reads in an assembly, then switch off
the <span class="emphasis"><em>proposed end clipping</em></span>
[-CL:pec=no].
</p><p>
If you mix artificial reads with "normal" reads, make sure that
every part of an artificial read is covered by some other read in
reverse complement direction (be it a normal or artificial
read). The easiest way to do that is to add a reverse complement
for every artificial read yourself, though if you use an
overlapping strategy with artificial reads, you can calculate the
overlaps and reverse complements of reads so that every second
artificial read is in reverse complement to save time and memory
afterwards during the computation.
</p></li><li class="listitem"><p>
Sequencing type/technology: MIRA currently knows Sanger, 454, Ion
Torrent, Solexa, PacBioHQ/LQ and "Text" as sequencing
technologies, every read entered in an assembly must be one of
those.
</p><p>
Artificial reads should be classified depending on the data they
were created from, that is, Sanger for consensus of Sanger reads,
454 for consensus of 454 reads etc. However, should reads created
from Illumina consensus be much longer than, say, 200 or 300
bases, you should treat them as Sanger reads.
</p></li><li class="listitem"><p>
Quality values: be careful to assign decent quality values to your
artificial reads as several quality clipping or consensus calling
algorithms make extensive use of qualities. Pay attention to
values of [-CL:qc:bsqc] as well as to
[-CO:mrpg:mnq:mgqrt].
</p></li><li class="listitem"><p>
Read lengths: current maximum read length for MIRA is around
~30kb. However, to account for some safety, MIRA currently allows
only 20kb reads as maximum length.
</p></li></ul></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_ploidy_and_repeats"></a>3.11.2.
Ploidy and repeats
</h3></div></div></div><p>
MIRA treats ploidy differences as repeats and will therefore build a
separate contigs for the reads of a ploidy that has a difference to
the other ploidy/ploidies.
</p><p>
There is simply no other way to handle ploidy while retaining the
ability to separate repeats based on differences of only a single
base. Everything else would be guesswork. I thought for some time
about doing a coverage analysis around the potential repeat/ploidy
site, but came to the conclusion that due to the stochastic nature of
sequencing data, this would very probably take wrong decisions in too
many cases to be acceptable.
</p><p>
If someone has a good idea, I'll be happy to hear it.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_handling_of_repeats"></a>3.11.3.
Handling of repeats
</h3></div></div></div><p>
</p><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_uniform_read_distribution"></a>3.11.3.1.
Uniform read distribution
</h4></div></div></div><p>
Under the assumption that reads in a project are uniformly
distributed across the genome, MIRA will enforce an average coverage
and temporarily reject reads from a contig when this average
coverage multiplied by a safety factor is reached at a given
site. This strategy reduces over-compression of repeats during the
contig building phase and keeps reads in reserve for other copies of
that repeat.
</p><p>
It's generally a very useful tool disentangle repeats, but has some
slight secondary effects: rejection of otherwise perfectly good
reads. The assumption of read distribution uniformity is the big
problem we have here: of course it's not really valid. You sometimes
have less, and sometimes more than "the average"
coverage. Furthermore, the new sequencing technologies - 454 perhaps
but certainly the ones from Solexa - show that you also have a skew
towards the site of replication origin.
</p><p>
Warning: Solexa data from late 2009 and 2010 show a high GC content
bias. This bias can reach 200 or 300%, i.e., sequence part for with
low GC
</p><p>
One example: let's assume the average coverage of a project is 8 and
by chance at one place there 17 (non-repetitive) reads, then the
following happens:
</p><p>
(Note: <span class="emphasis"><em>p</em></span> is the parameter [-AS:urdsip])
</p><p>
Pass 1 to <span class="emphasis"><em>p-1</em></span>: MIRA happily assembles everything together and calculates a
number of different things, amongst them an average coverage of ~8. At the
end of pass <span class="emphasis"><em>p-1</em></span>, it will announce this average coverage as first estimate
to the assembly process.
</p><p>
Pass <span class="emphasis"><em>p</em></span>: MIRA has still assembled everything together, but at the end of each
pass the contig self-checking algorithms now include an "average coverage
check". They'll invariably find the 17 reads stacked and decide (looking at
the [-AS:ardct] parameter which is assumed to be 2 for this example)
that 17 is larger than 2*8 and that this very well may be a repeat. The reads
get flagged as possible repeats.
</p><p>
Pass <span class="emphasis"><em>p+1</em></span> to end: the "possibly repetitive" reads get a much tougher
treatment in MIRA. Amongst other things, when building the contig, the contig
now looks that "possibly repetitive" reads do not over-stack by an average
coverage multiplied by a safety value ( [-AS:urdcm]) which we'll
assume now to be 1.5 in this example. So, at a certain point, say when read 14
or 15 of that possible repeat want to be aligned to the contig at this given
place, the contig will just flatly refuse and tell the assembler to please
find another place for them, be it in this contig that is built or any other
that will follow. Of course, if the assembler cannot comply, the reads 14 to
17 will end up as contiglet (contig debris, if you want) or if it was only one
read that got rejected like this, it will end up as singlet or in the debris
file.
</p><p>
Tough luck. I do have ideas on how to re-integrate those reads at the and of an
assembly, but I have deferred doing this as in every case I had looked up,
adding those reads to the contigs wouldn't have changed anything ... there's
already enough coverage.
</p><p>
What should be done in those cases is simply filter away the contiglets
(defined as being of small size and having an average coverage below the
average coverage of the project divided 3 (or 2.5)) from a project.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_keeping_'long'_repetitive_contigs_separate"></a>3.11.3.2.
Keeping 'long' repetitive contigs separate
</h4></div></div></div><p>
MIRA had since 2.9.36 a feature to keep long repeats in separate
contigs. Due to algorithm changes, this feature is now standard. The
effect of this is that contigs with non-repetitive sequence will
stop at a 'long repeat' border which cannot be crossed by a single
read or by paired reads, including only the first few bases of the
repeat. Long repeats will be kept as separate contigs.
</p><p>
This has been implemented to get a clean overview on which parts of
an assembly are 'safe' and which parts will be 'difficult'. For
this, the naming of the contigs has been extended: contigs named
with a '_c' at the end are contigs which contain mostly 'normal'
coverage. Contigs with "rep_c" are contigs which contain mostly
sequence classified as repetitive and which could not be assembled
together with a 'c' contig.
</p><p>
The question remains: what are 'long' repeats? MIRA defines these as
repeats that are not spanned by any read that has non-repetitive
parts at the end. Basically -for shotgun assemblies - the mean
length of the reads that go into the assembly defines the minimum
length of 'long' repeats that have to be kept in separate contigs.
</p><p>
It has to be noted that when using paired-end (or template)
sequencing, 'long' repeats which can be spanned by read-pairs (or
templates) are frequently integrated into 'normal' contigs as MIRA
can correctly place them most of the time.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_helping_finishing_by_tagging_reads_with_haf_tags"></a>3.11.3.3.
Helping finishing by tagging reads with HAF tags
</h4></div></div></div><p>
HAF tags (HAsh Frequency) are set by MIRA when the option to colour reads by
kmer frequency ([-GE:crkf], on by default in most --job combinations)
is on. These tags show the status of k-mers (stretch of bases of given length
<span class="emphasis"><em>k</em></span>) in read sequences: whether MIRA recognised them as being present in
sub-average, average, above average or repetitive numbers.
</p><p>
When using a finishing programs which can display tags in reads (and using the
proposed tag colour schemes for gap4 or consed, the assembly
will light up in colours ranging from light green to dark red, indicating
whether a certain part of the assembly is deemed non-repetitive to extremely
repetitive.
</p><p>
One of the biggest advantages of the HAF tags is the implicit information they
convey on why the assembler stopped building a contig at an end.
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
if the read parts composing a contig end are mostly covered with HAF2
tags (below average frequency, coloured light-green), then one very probably
has a hole in the contig due to coverage problems which means there are no
or not enough reads covering a part of the sequence.
</p></li><li class="listitem"><p>
if the read parts composing a contig end are mostly covered with HAF3
tags (average frequency, coloured green), then you have an unusual situation
as this should only very rarely occur. The reason is that MIRA saw that
there are enough sequences which look the same as the one from your contig
end, but that these could not be joined. Likely reasons for this scenario
include non-random sequencing artifacts (seen in 454 data) or also
non-random chimeric reads (seen in Sanger and 454 data).
</p></li><li class="listitem"><p>
if the read parts composing a contig end are mostly covered with HAF4
tags (above average frequency, coloured yellow), then the assembler stopped
at grey zone of the coverage not being normal anymore, but not quite
repetitive yet. This can happen in cases where the read coverage is very
unevenly distributed across the project. The contig end in question might be
a repeat occurring two times in the sequence, but having less reads than
expected. Or it may be non-repetitive coverage with an unusual excess of
reads.
</p></li><li class="listitem"><p>
if the read parts composing a contig end are mostly covered with HAF5
(repeat, coloured red), HAF6 (heavy repeat, coloured darker red) and HAF7
tags (crazy repeat, coloured very dark red), then there is a repetitive area
in the sequence which could not be uniquely bridged by the reads present in
the assembly.
</p></li></ul></div><p>
</p><p>
This information can be especially helpful when joining reads by hand in a
finishing program. The following list gives you a short guide to cases which
are most likely to occur and what you should do.
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
the proposed join involves contig ends mostly covered by HAF2
tags. Joining these contigs is probably a safe bet. The assembly may have
missed this join because of too many errors in the read ends or because
sequence having been clipped away which could be useful to join contigs.
Just check whether the join seems sensible, then join.
</p></li><li class="listitem"><p>
the proposed join involves contig ends mostly covered by HAF3
tags. Joining these contigs is probably a safe bet. The assembly may have
missed this join because of several similar chimeric reads reads or reads
with similar, severe sequencing errors covering the same spot.
Just check whether the join seems sensible, then join.
</p></li><li class="listitem"><p>
the proposed join involves contig ends mostly covered by HAF4
tags. Joining these contigs should be done with some caution, it
may be a repeat occurring twice in the sequence. Check whether
the contig ends in question align with ends of several other
contigs. If not, joining is probably the way to go. If potential
joins exist with several other contigs, then it's a repeat (see
below).
</p></li><li class="listitem"><p>
the proposed join involves contig ends mostly covered by HAF5, HAF6 or
HAF7 tags. Joining these contigs should be done with utmost caution, you are
almost certainly (HAF5) and very certainly (HAF6 and HAF7) in a repetitive
area of your sequence.
You will probably need additional information like paired-end or template
info in order join your contigs.
</p></li></ul></div><p>
</p></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_consensus_in_finishing_programs_gap4_consed_"></a>3.11.4.
Consensus in finishing programs (gap4, consed, ...)
</h3></div></div></div><p>
MIRA goes a long way to calculate a consensus which is as correct as
possible. Unfortunately, communication with finishing programs is a bit
problematic as there currently is no standard way to say which reads are from
which sequencing technology.
</p><p>
It is therefore often the case that finishing programs calculate an own
consensus when loading a project assembled with MIRA. This is the case for at
least, e.g., gap4. This consensus may then not be optimal.
</p><p>
The recommended way to deal with this problem is: import the results from MIRA
into your finishing program like you always do. Then finish the genome there,
export the project from the finishing program as CAF and finally use
miraconvert (from the MIRA package ) with the "-r" option to
recalculate the optimal consensus of your finished project.
</p><p>
E.g., assuming you have just finished editing the gap4 database
<code class="filename">DEMO.3</code>, do the following. First, export the gap4 database back to
CAF:
</p><pre class="screen">
<code class="prompt">$</code> <strong class="userinput"><code>gap2caf -project DEMO -version 3 >demo3.caf</code></strong></pre><p>
</p><p>
Then, use<span class="command"><strong>miraconvert</strong></span> <span class="emphasis"><em>with</em></span> <span class="emphasis"><em>option</em></span> <span class="emphasis"><em>'-r'</em></span> to
convert it into any other format that you need. Example for converting to a
CAF and a FASTA format with correct consensus:
</p><pre class="screen">
<code class="prompt">$</code> <strong class="userinput"><code>miraconvert -t caf -t fasta -r c demo3.caf final_result</code></strong></pre><p>
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_some_other_things_to_consider"></a>3.11.5.
Some other things to consider
</h3></div></div></div><p>
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
MIRA cannot work with EXP files resulting from GAP4 that already
have been edited. If you want to reassemble an edited GAP4 project, convert
it to CAF format and use the [-caf] option to load.
</p></li><li class="listitem"><p>
As also explained earlier, MIRA relies on sequencing vector being
recognised in preprocessing steps by other programs. Sometimes, when a whole
stretch of bases is not correctly marked as sequencing vector, the reads
might not be aligned into a contig although they might otherwise match quite
perfectly. You can use [-CL:pvc] and [-CO:emea] to address
problem with incomplete clipping of sequencing vectors. Also having the
assembler work with less strict parameters may help out of this.
</p></li><li class="listitem"><p>
MIRA has been developed to assemble shotgun sequencing or EST
sequencing data. There are no explicit limitations concerning length or
number of sequences. However, there are a few implicit assumptions that were
made while writing portions of the code:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
Problems which might arise with 'unnatural' long sequence
reads: my implementation of the Smith-Waterman alignment
routines. I use a banded version with linear running time
(linear to the bandwidth) but quadratic space usage. So,
comparing two 'reads' of length 5000 will result in memory
usage of 95 MiB, two reads with 50000 bases will need 9.5 GiB.
</p><p>
This problem has become acute now with PacBio, I'm working on
it. In the mean time, current usable sequence length of PacBio
are more in the 3 to 4 kilobase range, with only a few reads
attaining or surpassing 20 kb. So Todays' machines should
still be able to handle the problem more or less effortlessly.
</p></li><li class="listitem"><p>
32 bit versions of MIRA are not supported anymore.
</p></li><li class="listitem"><p>
to reduce memory overhead, the following assumptions have been made:
</p></li><li class="listitem"><p>
MIRA is not fully multi-threaded (yet), though most
bottlenecks are now in code areas which cannot be
multi-threaded by algorithm design.
</p></li></ol></div></li><li class="listitem"><p>
a project does not contain sequences from more than 255 different:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: circle; "><li class="listitem"><p>
sequencing machine types
</p></li><li class="listitem"><p>
primers
</p></li><li class="listitem"><p>
strains (in mapping mode: 7)
</p></li><li class="listitem"><p>
base callers
</p></li><li class="listitem"><p>
dyes
</p></li><li class="listitem"><p>
process status
</p></li></ul></div></li><li class="listitem"><p>
a project does not contain sequences from more than 65535 different
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: circle; "><li class="listitem"><p>
clone vectors
</p></li><li class="listitem"><p>
sequencing vectors
</p></li></ul></div></li></ul></div></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_things_you_should_not_do"></a>3.12.
Things you should not do
</h2></div></div></div><p>
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_never_on_nfs"></a>3.12.1.
Do not run MIRA on NFS mounted directories without redirecting the tmp directory
</h3></div></div></div><p>
Of course one can run MIRA atop a NFS mount (a "disk" mounted over a
network using the NFS protocol), but the performance will go down the
drain as the NFS server respectively the network will not be able to
cope with the amount of data MIRA needs to shift to and from disk
(writes/reads to the tmp directory). Slowdowns of a factor of 10 and
more have been observed. In case you have no other possibility, you
can force MIRA to run atop a NFS using [-NW:cnfs=warn]
( [-NW:cnfs=no]), but you have been warned.
</p><p>
In case you want to keep input and output files on NFS, you can use
[-DI:trt] to redirect the tmp directory to a local
filesystem. Then MIRA will run at almost full speed.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_never_without_quality_values"></a>3.12.2.
Do not assemble without quality values
</h3></div></div></div><p>
Assembling sequences without quality values is like ... like ... like
driving a car downhill a sinuous mountain road with no rails at 200
km/h without brakes, airbags and no steering wheel. With a ravine on
one side and a rock face on the other. Did I mention the missing
seat-belts? You <span class="emphasis"><em>might</em></span> get down safely, but
experience tells the result will rather be a bloody mess.
</p><p>
Well, assembling without quality values is a bit like above, but
bloodier. And the worst: you (or the people using the results of such
an assembly) will notice the gore only until it is way too late and
money has been sunk in follow-up experiments based on wrong data.
</p><p>
All MIRA routines internally are geared toward quality values guiding
decisions. No one should ever assembly anything without quality
values. Never. Ever. Even if quality values are sometimes inaccurate,
they do help.
</p><p>
Now, there are <span class="bold"><strong>very rare occasions</strong></span>
where getting quality values is not possible. If you absolutely cannot
get them, and I mean only in this case, use the following
switch:<code class="literal">--noqualities[=SEQUENCINGTECHNOLOGY]</code> and
additionally give a default quality for reads of a readgroup. E.g.:
</p><pre class="screen">parameters= --noqualities=454
readgroup
technology=454
data=...
default_qual=30</pre><p>
This tells MIRA not to complain about missing quality values and to
fake a quality value of 30 for all reads (of a readgroup) having no
qualities, allowing some MIRA routines (in standard parameter
settings) to start disentangling your repeats.
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
Doing the above has some severe side-effects. You will be, e.g., at
the mercy of non-random sequencing errors. I suggest combining the
above with a [-CO:mrpg=4] or higher. You also may want to
tune the default quality parameter together with [-CO:mnq]
and [-CO:mgqrt] in cases where you mix sequences with and
without quality values.
</td></tr></table></div></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_useful_third_party_programs"></a>3.13.
Useful third party programs
</h2></div></div></div><p>
Viewing the results of a MIRA assembly or preprocessing the sequences
for an assembly can be done with a number of different programs. The
following ones are are just examples, there are a lot more packages
available:
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
HTML browser
</span></dt><dd><p> If you have really nothing else as viewer, a browser who
understands tables is needed to view the HTML output. A browser knowing
style sheets (CSS) is recommended, as different tags will be highlighted.
Konqueror, Opera, Mozilla, Netscape and Internet Explorer all do fine, lynx
is not really ... optimal.
</p></dd><dt><span class="term">
Assembly viewer / finishing / preprocessing
</span></dt><dd><p>
You'll want GAP4 or its successor GAP5 (generally speaking: the
Staden package) to preprocess the sequences, visualise and
eventually rework the results when using gap4da output. The Staden
package comes with a fully featured sequence preparing and
annotating engine (pregap4) that is very useful to preprocess your
Sanger data (conversion between file types, quality clipping,
tagging etc.).
</p><p>
See <a class="ulink" href="http://www.sourceforge.net/projects/staden/" target="_top">http://www.sourceforge.net/projects/staden/</a> for
further information and also a possibility to download precompiled
binaries for different platforms.
</p></dd><dt><span class="term">
Vector screening
</span></dt><dd><p>
Reading result files from <span class="command"><strong>ssaha2</strong></span> or
<span class="command"><strong>smalt</strong></span> from the Sanger Centre is supported
directly by MIRA to perform a fast and efficient tagging of
sequencing vector stretches. This makes you basically independent
from any other commercial or license-requiring vector screening
software. For Sanger reads, a combination of
<span class="command"><strong>lucy</strong></span> (see below), <span class="command"><strong>ssaha2</strong></span> or
<span class="command"><strong>smalt</strong></span> together with the MIRA parameters for
SSAHA2 / SMALT support (see all [-CL:msvs*] parameters) and quality clipping
( [-CL:qc]) should do the trick. For reads coming from 454
pyro-sequencing, <span class="command"><strong>ssaha2</strong></span> or
<span class="command"><strong>smalt</strong></span> and the SSAHA2 / SMALT support also work
pretty well.
</p><p>
See
<a class="ulink" href="http://www.sanger.ac.uk/resources/software/ssaha2/" target="_top">http://www.sanger.ac.uk/resources/software/ssaha2/</a>
and / or <a class="ulink" href="http://www.sanger.ac.uk/resources/software/smalt/" target="_top">http://www.sanger.ac.uk/resources/software/smalt/</a> for
further information and also a possibility to download the source
or precompiled binaries for different platforms.
</p></dd><dt><span class="term">
Preprocessing
</span></dt><dd><p> <span class="command"><strong>lucy</strong></span> from TIGR (now JCVI) is another
useful sequence preprocessing program for Sanger data. Lucy is a
utility that prepares raw DNA sequence fragments for sequence
assembly. The cleanup process includes quality assessment,
confidence reassurance, vector trimming and vector removal.
</p><p>
There's a small script in the MIRA 3rd party package which
converts the clipping data from the lucy format into something
MIRA can understand (NCBI Traceinfo).
</p><p>
See <a class="ulink" href="ftp://ftp.tigr.org/pub/software/Lucy/" target="_top">ftp://ftp.tigr.org/pub/software/Lucy/</a> to download the source code
of lucy.
</p></dd><dt><span class="term">
Assembly viewer
</span></dt><dd><p> Viewing <code class="filename">.ace</code> file output without consed
can be done with clview from TIGR. See
<a class="ulink" href="http://www.tigr.org/tdb/tgi/software/" target="_top">http://www.tigr.org/tdb/tgi/software/</a>.
</p><p>
A better alternative is Tablet <a class="ulink" href="http://bioinf.scri.ac.uk/tablet/" target="_top">http://bioinf.scri.ac.uk/tablet/</a> which also reads SAM
format.
</p></dd><dt><span class="term">
Assembly coverage analysis
</span></dt><dd><p>
The Integrated Genome Browser (IGB) of the GenoViz project at
SourceForge (<a class="ulink" href="http://sourceforge.net/projects/genoviz/" target="_top">http://sourceforge.net/projects/genoviz/</a>) is just perfect
for loading a genome and looking at mapping coverage (provided by
the wiggle result files of MIRA).
</p></dd><dt><span class="term">
Preprocessing (base calling)
</span></dt><dd><p>
TraceTuner (<a class="ulink" href="http://sourceforge.net/projects/tracetuner/" target="_top">http://sourceforge.net/projects/tracetuner/</a>) is a tool for
base and quality calling of trace files from DNA sequencing
instruments. Originally developed by Paracel, this code base was
released as open source in 2006 by Celera.
</p></dd><dt><span class="term">
Preprocessing / viewing
</span></dt><dd><p> phred (basecaller) - cross_match (sequence comparison and
filtering) - phrap (assembler) - consed (assembly viewer and
editor). This is another package that can be used for this type of
job, but requires more programming work. The fact that sequence
stretches are masked out (overwritten with the character X) if they
shouldn't be used in an assembly doesn't really help and is
considered harmful (but it works).
</p><p>
Note the bug of consed when reading ACE files, see more about this
in the section on file types (above) in the entry for ACE.
</p><p>
See <a class="ulink" href="http://www.phrap.org/" target="_top">http://www.phrap.org/</a> for further information.
</p></dd><dt><span class="term">
text viewer
</span></dt><dd><p> A text viewer for the different textual output files.
</p></dd></dl></div><p>
As always, most of the time a combination of several different packages
is possible. My currently preferred combo for genome projects is
<span class="command"><strong>ssaha2</strong></span> or <span class="command"><strong>smalt</strong></span> and or
<span class="command"><strong>lucy</strong></span> (vector screening), MIRA (assembly, of course)
and gap4 (assembly viewing and finishing).
</p><p>
For re-assembling projects that were edited in gap4, one will also need
the gap2caf converter. The source for this is available at
<a class="ulink" href="http://www.sanger.ac.uk/resources/software/caf.html" target="_top">http://www.sanger.ac.uk/resources/software/caf.html</a>.
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_speed_and_memory_considerations"></a>3.14.
Speed and memory considerations
</h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_memory"></a>3.14.1.
Estimating needed memory for an assembly project
</h3></div></div></div><p>
Since the V2.9.24x3 version of MIRA, there is <span class="command"><strong>miramem</strong></span> as
program call. When called from the command line, it will ask a number of
questions and then print out an estimate of the amount of RAM needed to
assemble the project. Take this estimate with a grain of salt, depending on
the sequences properties, variations in the estimate can be +/- 30% for
bacteria and 'simple' eukaryotes. The higher the number of repeats is, the
more likely you will need to restrict memory usage in some way or another.
</p><p>
Here's the transcript of a session with miramem:
</p><pre class="screen">
This is MIRA V3.2.0rc1 (development version).
Please cite: Chevreux, B., Wetter, T. and Suhai, S. (1999), Genome Sequence
Assembly Using Trace Signals and Additional Sequence Information.
Computer Science and Biology: Proceedings of the German Conference on
Bioinformatics (GCB) 99, pp. 45-56.
To (un-)subscribe the MIRA mailing lists, see:
http://www.chevreux.org/mira_mailinglists.html
After subscribing, mail general questions to the MIRA talk mailing list:
mira_talk@freelists.org
To report bugs or ask for features, please use the SourceForge ticketing
system at:
http://sourceforge.net/p/mira-assembler/tickets/
This ensures that requests do not get lost.
[...]
miraMEM helps you to estimate the memory needed to assemble a project.
Please answer the questions below.
Defaults are give in square brackets and chosen if you just press return.
Hint: you can add k/m/g modifiers to your numbers to say kilo, mega or giga.
Is it a genome or transcript (EST/tag/etc.) project? (g/e/) [g]
g
Size of genome? [4.5m] <strong class="userinput"><code>9.8m</code></strong>
9800000
Size of largest chromosome? [9800000]
9800000
Is it a denovo or mapping assembly? (d/m/) [d]
d
Number of Sanger reads? [0]
0
Are there 454 reads? (y/n/) [n] <strong class="userinput"><code>y</code></strong>
y
Number of 454 GS20 reads? [0]
0
Number of 454 FLX reads? [0]
0
Number of 454 Titanium reads? [0] <strong class="userinput"><code>750k</code></strong>
750000
Are there PacBio reads? (y/n/) [n]
n
Are there Solexa reads? (y/n/) [n]
n
************************* Estimates *************************
The contigs will have an average coverage of ~ 30.6 (+/- 10%)
RAM estimates:
reads+contigs (unavoidable): 7.0 GiB
large tables (tunable): 688. MiB
---------
total (peak): 7.7 GiB
add if using -CL:pvlc=yes : 2.6 GiB
Estimates may be way off for pathological cases.
Note that some algorithms might try to grab more memory if
the need arises and the system has enough RAM. The options
for automatic memory management control this:
-AS:amm, -AS:kpmf, -AS:mps
Further switches that might reduce RAM (at cost of run time
or accuracy):
-SK:mkim, -SK:mchr (both runtime); -SK:mhpr (accuracy)
*************************************************************</pre><p>
If your RAM is not large enough, you can still assemble projects by
using disk swap. Up to 20% of the needed memory can be provided by
swap without the speed penalty getting too large. Going above 20% is
not recommended though, above 30% the machine will be almost
permanently swapping at some point or another.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_speed"></a>3.14.2.
Some numbers on speed
</h3></div></div></div><p>
To be rewritten for MIRA4.
</p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_known_problems_bugs"></a>3.15.
Known Problems / Bugs
</h2></div></div></div><p>
File Input / Output:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
MIRA can only read unedited EXP files.
</p></li><li class="listitem"><p>
There sometimes is a (rather important) memory leak occurring while
using the assembly integrated Sanger read editor. I have not been
able to trace the reason yet.
</p></li></ol></div><p>
</p><p>
Assembly process:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
The routines for determining <span class="emphasis"><em>Repeat Marker
Bases</em></span> (SRMr) are sometimes too sensitive, which sometimes
leads to excessive base tagging and preventing right assemblies in
subsequent assembly processes. The parameters you should look at for
this problem are
[-CO:mrc:nrz:mgqrt:mgqwpc]. Also look at [-CL:pvc] and
[-CO:emea] if you have a lot of sequencing vector relics at the
end of the sequences.
</p></li></ol></div><p>
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_todos"></a>3.16.
TODOs
</h2></div></div></div><p>
These are some of the topics on my TODO list for the next revisions to
come:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
Making Smith-Waterman parts of the process multi-threaded or use SIMD
(currently stopped due to other priorities like PacBio etc.)
</p></li></ol></div><p>
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_working_principles"></a>3.17.
Working principles
</h2></div></div></div><p>
Note: description is old and needs to be adapted to the current 4.x line
of MIRA.
</p><p>
To avoid the "garbage-in, garbage-out" problematic, MIRA uses a 'high
quality alignments first' contig building strategy. This means that the
assembler will start with those regions of sequences that have been
marked as good quality (high confidence region - HCR) with low error
probabilities (the clipping must have been done by the base caller or
other preprocessing programs, e.g. pregap4) and then gradually extends
the alignments as errors in different reads are resolved through error
hypothesis verification and signal analysis.
</p><p>
This assembly approach relies on some of the automatic editing
functionality provided by the EdIt package which has been integrated in
parts within MIRA.
</p><p>
This is an approximate overview on the steps that are executed while
assembling:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
All the experiment / phd / fasta sequences that act as input are
loaded (or the CAF project). Qualities for the bases are loaded from
the FASTA or SCF if needed.
</p></li><li class="listitem"><p>
the ends of the reads are cleaned ensure they have a minimum stretch
of bases without sequencing errors
</p></li><li class="listitem"><p>
The high confidence region (HCR) of each read is compared with a
quick algorithm to the HCR of every other read to see if it could
match and have overlapping parts (this is the 'SKIM' filter).
</p></li><li class="listitem"><p>
All the reads which could match are being checked with an adapted
Smith-Waterman alignment algorithm (banded version). Obvious
mismatches are rejected, the accepted alignments form one or several
alignment graphs.
</p></li><li class="listitem"><p>
Optional pre-assembly read extension step: MIRA tries to extend HCR
of reads by analysing the read pairs from the previous
alignment. This is a bit shaky as reads in this step have not been
edited yet, but it can help. Go back to step 2.
</p></li><li class="listitem"><p>
A contig gets made by building a preliminary partial path through
the alignment graph (through in-depth analysis up to a given level)
and then adding the most probable overlap candidates to a given
contig. Contigs may reject reads if these introduce to many errors
in the existing consensus. Errors in regions known as dangerous
(for the time being only ALUS and REPT) get additional attention by
performing simple signal analysis when alignment discrepancies
occur.
</p></li><li class="listitem"><p>
Optional: the contig can be analysed and corrected by the automatic
editor ("EdIt" for Sanger reads, or the new MIRA editor for 454
reads).
</p></li><li class="listitem"><p>
Long repeats are searched for, bases in reads of different repeats
that have been assembled together but differ sufficiently (for EdIT
so that they didn't get edited and by phred quality value) get
tagged with special tags (SRMr and WRMr).
</p></li><li class="listitem"><p>
Go back to step 5 if there are reads present that have not been
assembled into contigs.
</p></li><li class="listitem"><p>
Optional: Detection of spoiler reads that prevent joining of
contigs. Remedy by shortening them.
</p></li><li class="listitem"><p>
Optional: Write out a checkpoint assembly file and go back to step 2.
</p></li><li class="listitem"><p>
The resulting project is written out to different output files and
directories.
</p></li></ol></div><p>
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_see_also"></a>3.18.
See Also
</h2></div></div></div><p>
The other MIRA manuals and walkthroughs as well as
<span class="command"><strong>EdIt</strong></span>, <span class="command"><strong>gap4</strong></span>,
<span class="command"><strong>pregap4</strong></span>, <span class="command"><strong>gap5</strong></span>,
<span class="command"><strong>clview</strong></span>, <span class="command"><strong>caf2gap</strong></span>,
<span class="command"><strong>gap2caf</strong></span>, <span class="command"><strong>ssaha2</strong></span>,
<span class="command"><strong>smalt</strong></span>, <span class="command"><strong>compress</strong></span> and
<span class="command"><strong>gzip</strong></span>, <span class="command"><strong>cap3</strong></span>,
<span class="command"><strong>ttuner</strong></span>, <span class="command"><strong>phred</strong></span>,
<span class="command"><strong>phrap</strong></span>, <span class="command"><strong>cross_match</strong></span>,
<span class="command"><strong>consed</strong></span>.
</p></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_dataprep"></a>Chapter 4. Preparing data</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email"><<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>></code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect_pd_introduction">4.1.
Introduction
</a></span></dt><dt><span class="sect1"><a href="#sect_pd_sanger">4.2.
Sanger
</a></span></dt><dt><span class="sect1"><a href="#sect_pd_454">4.3.
Roche / 454
</a></span></dt><dt><span class="sect1"><a href="#sect_pd_illumina">4.4.
Illumina
</a></span></dt><dt><span class="sect1"><a href="#sect_pd_pacbio">4.5.
Pacific Biosciences
</a></span></dt><dt><span class="sect1"><a href="#sect_pd_iontor">4.6.
Ion Torrent
</a></span></dt><dt><span class="sect1"><a href="#sect_pd_sra">4.7.
Short Read Archive (SRA)
</a></span></dt></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
<span class="emphasis"><em><span class="quote">“<span class="quote">Rome didn't fall in a day either.</span>”</span></em></span>
</p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_pd_introduction"></a>4.1.
Introduction
</h2></div></div></div><p>
Most of this chapter and many sections are just stubs at the moment.
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_pd_sanger"></a>4.2.
Sanger
</h2></div></div></div><p>
Outside MIRA: transform .ab1 to .scf, perform sequencing vector clip
(and cloning vector clip if used), basic quality clips.
</p><p>
Recommended program: <span class="command"><strong>gap4</strong></span> (or
rather <span class="command"><strong>pregap4</strong></span>) from the Staden 4 package.
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_pd_454"></a>4.3.
Roche / 454
</h2></div></div></div><p>
Outside MIRA: convert SFF instrument from Roche to FASTQ,
use <span class="command"><strong>sff_extract</strong></span> for that. In case you used
"non-standard" sequencing procedures: clip away MIDs, clip away
non-standard sequencing adaptors used in that project.
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_pd_illumina"></a>4.4.
Illumina
</h2></div></div></div><p>
Outside MIRA: for heavens' sake: do NOT try to clip or trim by quality
yourself. Do NOT try to remove standard sequencing adaptors
yourself. Just leave Illumina data alone! (really, I mean it).
</p><p>
MIRA is much, much better at that job than you will probably ever be
... and I dare to say that MIRA is better at that job than 99% of all
clipping/trimming software existing out there. Just make sure you use
the [-CL:pec] (proposed_end_clip) option of MIRA.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
The <span class="emphasis"><em>only</em></span> exception to the above is if you (or your
sequencing provider) used decidedly non-standard sequencing
adaptors. Then it might be worthwhile to perform own adaptor
clipping. But this will not be the case for 99% of all sequencing
projects out there.
</td></tr></table></div><p>
Joining paired-ends: if you want to do this, feel free to use any tool
which is out there (TODO: quick list). Just make sure they do not join
on very short overlaps. For me, the minimum overlap is at least 17
bases, but I more commonly use at least 30.
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_pd_pacbio"></a>4.5.
Pacific Biosciences
</h2></div></div></div><p>
Outside MIRA: MIRA needs error corrected reads, either
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem">
PacBio CCS reads (circular consensus sequence) which you get from the
PacBio SMRTAnalysis pipeline
</li><li class="listitem">
or self-corrected or reads corrected with other sequencing
technologies which you will get either from the PacBio HGAP pipeline
or the pacbioToCA pipeline
</li></ul></div><p>
Assembly of uncorrected PacBio reads (CLR) is currently not supported
officially as of MIRA 4.0.
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_pd_iontor"></a>4.6.
Ion Torrent
</h2></div></div></div><p>
Outside MIRA: need to convert BAM to FASTQ. Need to clip away
non-standard sequencing adaptors if used in that project. Apart from
that: leave the data alone.
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_pd_sra"></a>4.7.
Short Read Archive (SRA)
</h2></div></div></div><p>
Outside MIRA: you need to convert SRA format to FASTQ format. This is done
using <span class="command"><strong>fastq-dump</strong></span> from the SRA toolkit from the
NCBI. Make sure to have at least version 2.4.x of the toolkit. Last time
I looked (March 2015), the software was at
<a class="ulink" href="http://www.ncbi.nlm.nih.gov/Traces/sra/?view=software" target="_top">http://www.ncbi.nlm.nih.gov/Traces/sra/?view=software</a>, the
documentation for the whole toolkit was at
<a class="ulink" href="http://www.ncbi.nlm.nih.gov/Traces/sra/?view=toolkit_doc" target="_top">http://www.ncbi.nlm.nih.gov/Traces/sra/?view=toolkit_doc</a>,
and for <span class="command"><strong>fastq-dump</strong></span> it was
<a class="ulink" href="http://www.ncbi.nlm.nih.gov/Traces/sra/?view=toolkit_doc&f=fastq-dump" target="_top">http://www.ncbi.nlm.nih.gov/Traces/sra/?view=toolkit_doc&f=fastq-dump</a>
</p><p>
After extraction, proceed with preprocessing as described above,
depending on the sequencing technology used.
</p><p>
For extracting Illumina data, use something like this:
</p><pre class="screen"><code class="prompt">arcadia:/some/path$</code> <strong class="userinput"><code>fastq-dump -I --split-files <em class="replaceable"><code>somefile.sra</code></em></code></strong></pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
As <span class="command"><strong>fastq-dump</strong></span> unfortunately uses a pretty wasteful
variant of the FASTQ format, you might want to reduce the file size
for each FASTQ it produces by doing this:
</p><pre class="screen"><strong class="userinput"><code>sed -i '3~4 s/^+.*$/+/' <em class="replaceable"><code>file.fastq</code></em></code></strong></pre><p>
The above command performs an in-file replacement of unnecessary name
and comments on the quality divider lines of the FASTQ. The exact
translation of the <span class="command"><strong>sed</strong></span> is: do an in-file
replacement (-i); starting on the third line, then every fourth line
(3~4); substitute (s/); a line which starts (^); with a plus (+); and
then can have any character (.); repeated any number of times
including zero (*); until the end of the line ($); by just a single
plus character (/+/).
</p><p>
This alone reduces the file size of a typical Illumina data set with
100mers extracted from the SRA by about 15 to 20%.
</p></td></tr></table></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_denovo"></a>Chapter 5. De-novo assemblies</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email"><<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>></code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect_dn_introduction">5.1.
Introduction
</a></span></dt><dt><span class="sect1"><a href="#sect_dn_general">5.2.
General steps
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_dn_ge_copying_and_naming_the_sequence_data">5.2.1.
Copying and naming the sequence data
</a></span></dt><dt><span class="sect2"><a href="#sect_dn_ge_writing_a_simple_manifest_file">5.2.2.
Writing a simple manifest file
</a></span></dt><dt><span class="sect2"><a href="#sect_dn_ge_starting_assembly">5.2.3. Starting the assembly</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_dn_manifest_files_use_cases">5.3.
Manifest files for different use cases
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_dn_mf_denovo_with_shotgun_data">5.3.1.
Manifest for shotgun data
</a></span></dt><dt><span class="sect2"><a href="#sect_dn_mf_assembling_with_multiple_technologies">5.3.2.
Assembling with multiple sequencing technologies (hybrid assemblies)
</a></span></dt><dt><span class="sect2"><a href="#sect_dn_mf_manifest_for_pairedend_data">5.3.3.
Manifest for data sets with paired reads
</a></span></dt><dt><span class="sect2"><a href="#sect_dn_mf_denovo_with_multiple_strains">5.3.4.
De-novo with multiple strains
</a></span></dt></dl></dd></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
<span class="emphasis"><em><span class="quote">“<span class="quote">The universe is full of surprises - most of them nasty.</span>”</span></em></span>
</p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_dn_introduction"></a>5.1.
Introduction
</h2></div></div></div><p>
This guide assumes that you have basic working knowledge of Unix systems, know
the basic principles of sequencing (and sequence assembly) and what assemblers
do.
</p><p>
While there are step by step instructions on how to setup your data and
then perform an assembly, this guide expects you to read at some point in time
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
Before the assembly, <a class="xref" href="#chap_dataprep" title="Chapter 4. Preparing data">Chapter 4: “<i>Preparing data</i>”</a> to know what to do (or not to
do) with the sequencing data before giving it to MIRA.
</p></li><li class="listitem"><p>
For users with PacBio reads, <a class="xref" href="#sect_sp_pacbio_ccs" title="8.2.1. PacBio CCS reads">Section 8.2.1: “
PacBio CCS reads
”</a> has important
information regarding special parameters needed.
</p></li><li class="listitem"><p>
After the assembly, <a class="xref" href="#chap_results" title="Chapter 9. Working with the results of MIRA">Chapter 9: “<i>Working with the results of MIRA</i>”</a> to know what to do with the
results of the assembly. More specifically, <a class="xref" href="#sect_res_looking_at_results" title="9.1. MIRA output directories and files">Section 9.1: “
MIRA output directories and files
”</a>, <a class="xref" href="#sect_res_first_look:the_assembly_info" title="9.2. First look: the assembly info">Section 9.2: “
First look: the assembly info
”</a>, <a class="xref" href="#sect_res_converting_results" title="9.3. Converting results">Section 9.3: “
Converting results
”</a>, <a class="xref" href="#sect_res_filtering_of_results" title="9.4. Filtering results">Section 9.4: “
Filtering results
”</a> and <a class="xref" href="#sect_res_places_of_importance_in_a_de_novo_assembly" title="9.5. Places of importance in a de-novo assembly">Section 9.5: “
Places of importance in a de-novo assembly
”</a>.
</p></li><li class="listitem"><p>
And also <a class="xref" href="#chap_reference" title="Chapter 3. MIRA 4 reference manual">Chapter 3: “<i>MIRA 4 reference manual</i>”</a> to look up how manifest files should be
written (<a class="xref" href="#sect_ref_manifest_basics" title="3.4.2. The manifest file: basics">Section 3.4.2: “
The manifest file: basics
”</a> and <a class="xref" href="#sect_ref_manifest_readgroups" title="3.4.3. The manifest file: information on the data you have">Section 3.4.3: “
The manifest file: information on the data you have
”</a> and <a class="xref" href="#sect_ref_manifest_parameters" title="3.4.4. The manifest file: extended parameters">Section 3.4.4: “
The manifest file: extended parameters
”</a>), some command line options as well as general information on
what tags MIRA uses in assemblies, files it generates etc.pp
</p></li><li class="listitem"><p>
Last but not least, you may be interested in some observations about
the different sequencing technologies and the traps they may
contain, see <a class="xref" href="#chap_seqtechdesc" title="Chapter 12. Description of sequencing technologies">Chapter 12: “<i>Description of sequencing technologies</i>”</a> for that. For advice on what to pay
attention to <span class="emphasis"><em>before</em></span> going into a sequencing
project, have a look at <a class="xref" href="#chap_seqadvice" title="Chapter 13. Some advice when going into a sequencing project">Chapter 13: “<i>Some advice when going into a sequencing project</i>”</a>.
</p></li></ul></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_dn_general"></a>5.2.
General steps
</h2></div></div></div><p>
This part will introduce you step by step how to get your data together
for a simple mapping assembly. I'll make up an example using an
imaginary bacterium: <span class="emphasis"><em>Bacillus chocorafoliensis</em></span> (or
short: <span class="emphasis"><em>Bchoc</em></span>). You collected the strain you want to
assemble somewhere in the wild, so you gave the strain the name
<span class="emphasis"><em>Bchoc_wt</em></span>.
</p><p>
Just for laughs, let's assume you sequenced that bug with lots of more
or less current sequencing technologies: Sanger, 454, Illumina, Ion
Torrent and Pacific Biosciences.
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_dn_ge_copying_and_naming_the_sequence_data"></a>5.2.1.
Copying and naming the sequence data
</h3></div></div></div><p>
You need to create (or get from your sequencing provider) the
sequencing data in any supported file format. Amongst these, FASTQ and
FASTA + FASTA-quality will be the most common, although the latter is
well on the way out nowadays. The following walkthrough uses what most
people nowadays get: FASTQ.
</p><p>
Create a new project directory (e.g. <code class="filename">myProject</code>)
and a subdirectory of this which will hold the sequencing data
(e.g. <code class="filename">data</code>).
</p><pre class="screen"><code class="prompt">arcadia:/path/to</code> <strong class="userinput"><code>mkdir myProject</code></strong>
<code class="prompt">arcadia:/path/to</code> <strong class="userinput"><code>cd myProject</code></strong>
<code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>mkdir data</code></strong></pre><p>
Put the FASTQ data into that <code class="filename">data</code> directory so
that it now looks perhaps like this:
</p><pre class="screen"><code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>ls -l data</code></strong>
-rw-r--r-- 1 bach users 263985896 2008-03-28 21:49 bchocwt_lane6.solexa.fastq</pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
I completely made up the file names above. You can name them anyway you
want. And you can have them live anywhere on the hard-disk, you do not
need to put them in this <code class="filename">data</code> directory. It's just
the way I do it ... and it's where the example manifest files a bit
further down in this chapter will look for the data files.
</td></tr></table></div><p>
We're almost finished with the setup. As I like to have things neatly separated, I always create a directory called <code class="filename">assemblies</code> which will hold my assemblies (or different trials) together. Let's quickly do that:
</p><pre class="screen"><code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>mkdir assemblies</code></strong>
<code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>mkdir assemblies/1sttrial</code></strong>
<code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>cd assemblies/1sttrial</code></strong></pre></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_dn_ge_writing_a_simple_manifest_file"></a>5.2.2.
Writing a simple manifest file
</h3></div></div></div><p>
A manifest file is a configuration file for MIRA which tells it what
type of assembly it should do and which data it should load. In this
case we'll make a simple assembly of a genome with unpaired Illumina
data
</p><pre class="screen"># Example for a manifest describing a genome de-novo assembly with
# unpaired Illumina data
# First part: defining some basic things
# In this example, we just give a name to the assembly
# and tell MIRA it should map a genome in accurate mode
<strong class="userinput"><code>project = <em class="replaceable"><code>MyFirstAssembly</code></em>
job = <em class="replaceable"><code>genome,denovo,accurate</code></em></code></strong>
# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups"
# here comes the unpaired Illumina data
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>SomeUnpairedIlluminaReadsIGotFromTheLab</code></em>
data = <em class="replaceable"><code>../../data/bchocwt_lane6.solexa.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em></code></strong></pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
Please look up the parameters of the manifest file in the main
manual or the example manifest files in the following section.
</p><p>
The ones above basically say: make an accurate denovo assembly of
unpaired Illumina reads.
</p></td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_dn_ge_starting_assembly"></a>5.2.3. Starting the assembly</h3></div></div></div><p>
Starting the assembly is now just a matter of a simple command line:
</p><pre class="screen"><code class="prompt">arcadia:/path/to/myProject/assemblies/1sttrial$</code> <strong class="userinput"><code>mira <em class="replaceable"><code>manifest.conf >&log_assembly.txt</code></em></code></strong></pre><p>
For this example - if you followed the walk-through on how to prepare the data
- everything you might want to adapt in the first time are the following thing in the manifest file:
options:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
project= (for naming your assembly project)
</p></li></ul></div><p>
Of course, you are free to change any option via the extended parameters, but
this is the topic of another part of this manual.
</p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_dn_manifest_files_use_cases"></a>5.3.
Manifest files for different use cases
</h2></div></div></div><p>
This section will introduce you to manifest files for different use
cases. It should cover the most important uses, but as always you are
free to mix and match the parameters and readgroup definitions to suit
your specific needs.
</p><p>
Taking into account that there may be <span class="emphasis"><em>a lot</em></span> of
combinations of sequencing technologies, sequencing libraries (shotgun,
paired-end, mate-pair, etc.) and input file types (FASTQ, FASTA,
GenBank, GFF3, etc.pp), the example manifest files just use Illumina and
454 as technologies, GFF3 as input file type for the reference sequence,
FASTQ as input type for sequencing data ... and they do not show the
multitude of more advanced features like, e.g., using ancillary clipping
information in XML files, ancillary masking information in SSAHA2 or
SMALT files etc.pp.
</p><p>
I'm sure you will be able to find your way by scanning through the
corresponding section on manifest files in the reference chapter :-)
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_dn_mf_denovo_with_shotgun_data"></a>5.3.1.
Manifest for shotgun data
</h3></div></div></div><p>
Well, we've seen that already in the section above, but here it is
again ... but this time with 454 data.
</p><pre class="screen"># Example for a manifest describing a denovo assembly with
# unpaired 454 data
# First part: defining some basic things
# In this example, we just give a name to the assembly
# and tell MIRA it should map a genome in accurate mode
<strong class="userinput"><code>project = <em class="replaceable"><code>MyFirstAssembly</code></em>
job = <em class="replaceable"><code>genome,denovo,accurate</code></em></code></strong>
# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups"
# here's the 454 data
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>SomeUnpaired454ReadsIGotFromTheLab</code></em>
data = <em class="replaceable"><code>../../data/some454data.fastq</code></em>
technology = <em class="replaceable"><code>454</code></em></code></strong></pre></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_dn_mf_assembling_with_multiple_technologies"></a>5.3.2.
Assembling with multiple sequencing technologies (hybrid assemblies)
</h3></div></div></div><p>
Hybrid mapping assemblies follow the general manifest scheme: tell
what you want in the first part, then simply add as separate readgroup
the information MIRA needs to know to find the data and off you
go. Just for laughs, here's a manifest for 454 shotgun with Illumina
shotgun
</p><pre class="screen"># Example for a manifest describing a denovo assembly with
# shotgun 454 and shotgun Illumina data
# First part: defining some basic things
# In this example, we just give a name to the assembly
# and tell MIRA it should map a genome in accurate mode
<strong class="userinput"><code>project = <em class="replaceable"><code>MyFirstAssembly</code></em>
job = <em class="replaceable"><code>genome,mapping,accurate</code></em></code></strong>
# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups"
# now the shotgun 454 data
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataForShotgun454</code></em>
data = <em class="replaceable"><code>../../data/project454data.fastq</code></em>
technology = <em class="replaceable"><code>454</code></em></code></strong>
# now the shotgun Illumina data
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataForShotgunIllumina</code></em>
data = <em class="replaceable"><code>../../data/someillumina.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em></code></strong></pre></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_dn_mf_manifest_for_pairedend_data"></a>5.3.3.
Manifest for data sets with paired reads
</h3></div></div></div><p>
When using paired-end data, you should know
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
the orientation of the reads toward each other. This is specific
to sequencing technologies and / or the sequencing library preparation.
</p></li><li class="listitem"><p>
at which distance these reads should be. This is specific to the
sequencing library preparation and the sequencing lab should tell
you this.
</p></li></ol></div><p>
In case you do not know one (or any) of the above, don't panic! MIRA
is able to estimate the needed values during the assembly if you tell
it to.
</p><p>
The following manifest shows you the most laziest way to define a
paired data set by simply adding <span class="emphasis"><em>autopairing</em></span> as keyword to a
readgroup (using Illumina just as example):
</p><pre class="screen"># Example for a lazy manifest describing a denovo assembly with
# one library of paired reads
# First part: defining some basic things
# In this example, we just give a name to the assembly
# and tell MIRA it should map a genome in accurate mode
<strong class="userinput"><code>project = <em class="replaceable"><code>MyFirstAssembly</code></em>
job = <em class="replaceable"><code>genome,denovo,accurate</code></em></code></strong>
# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups"
# now the Illumina paired-end data
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataIlluminaPairedLib</code></em>
<em class="replaceable"><code>autopairing</code></em>
data = <em class="replaceable"><code>../../data/project_1.fastq ../../data/project_2.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em></code></strong></pre><p>
If you know the orientation of the reads and/or the library size, you
can tell this MIRA the following way (just showing the readgroup
definition here):
</p><pre class="screen"><strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataIlluminaPairedEnd500Lib</code></em>
data = <em class="replaceable"><code>../../data/project_1.fastq ../../data/project_2.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
template_size = <em class="replaceable"><code>250 750</code></em>
segment_placement = <em class="replaceable"><code>---> <---</code></em></code></strong></pre><p>
In cases you are not 100% sure about, e.g., the size of the DNA
template, you can also give a (generous) expected range and then tell
MIRA to automatically refine this range during the assembly based on
real, observed distances of read pairs. Do this with <span class="emphasis"><em>autorefine</em></span>
modifier like this:
</p><pre class="screen"><strong class="userinput"><code>template_size = <em class="replaceable"><code>50 1000 autorefine</code></em></code></strong></pre><p>
The following manifest file is an example for assembling with several
different libraries from different technologies. Do not forget you
can use <span class="emphasis"><em>autopairing</em></span> or <span class="emphasis"><em>autorefine</em></span> :-)
</p><pre class="screen"># Example for a manifest describing a denovo assembly with
# several kinds of sequencing libraries
# First part: defining some basic things
# In this example, we just give a name to the assembly
# and tell MIRA it should map a genome in accurate mode
<strong class="userinput"><code>project = <em class="replaceable"><code>MyFirstAssembly</code></em>
job = <em class="replaceable"><code>genome,denovo,accurate</code></em></code></strong>
# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups"
# now the Illumina paired-end data
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataIlluminaForPairedEnd500bpLib</code></em>
data = <em class="replaceable"><code>../../data/project500bp-1.fastq ../../data/project500bp-2.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
strain = <em class="replaceable"><code>bchoc_se1</code></em>
template_size = <em class="replaceable"><code>250 750</code></em>
segment_placement = <em class="replaceable"><code>---> <---</code></em></code></strong>
# now the Illumina mate-pair data
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataIlluminaForMatePair3kbLib</code></em>
data = <em class="replaceable"><code>../../data/project3kb-1.fastq ../../data/project3kb-2.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
strain = <em class="replaceable"><code>bchoc_se1</code></em>
template_size = <em class="replaceable"><code>2500 3500</code></em>
segment_placement = <em class="replaceable"><code><--- ---></code></em></code></strong>
# some Sanger data (6kb library)
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataForSanger6kbLib</code></em>
data = <em class="replaceable"><code>../../data/sangerdata.fastq</code></em>
technology = <em class="replaceable"><code>sanger</code></em>
template_size = <em class="replaceable"><code>5500 6500</code></em>
segment_placement = <em class="replaceable"><code>---> <---</code></em></code></strong>
# some 454 data
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataFo454Pairs</code></em>
data = <em class="replaceable"><code>../../data/454data.fastq</code></em>
technology = <em class="replaceable"><code>454</code></em>
template_size = <em class="replaceable"><code>8000 1200</code></em>
segment_placement = <em class="replaceable"><code>2---> 1---></code></em></code></strong>
# some Ion Torrent data
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataFoIonPairs</code></em>
data = <em class="replaceable"><code>../../data/iondata.fastq</code></em>
technology = <em class="replaceable"><code>iontor</code></em>
template_size = <em class="replaceable"><code>1000 300</code></em>
segment_placement = <em class="replaceable"><code>2---> 1---></code></em></code></strong></pre></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_dn_mf_denovo_with_multiple_strains"></a>5.3.4.
De-novo with multiple strains
</h3></div></div></div><p>
MIRA will make use of ancillary information present in the manifest
file. One of these is the information to which strain (or organism or
cell line etc.pp) the generated data belongs.
</p><p>
You just need to tell in the manifest file which data comes from which
strain. Let's assume that in the example from above, the "lane6" data
were from a first mutant named <span class="emphasis"><em>bchoc_se1</em></span> and the
"lane7" data were from a second mutant
named <span class="emphasis"><em>bchoc_se2</em></span>. Here's the manifest file you
would write then:
</p><pre class="screen"># Example for a manifest describing a de-novo assembly with
# unpaired Illumina data, but from multiple strains
# First part: defining some basic things
# In this example, we just give a name to the assembly
# and tell MIRA it should map a genome in accurate mode
<strong class="userinput"><code>project = <em class="replaceable"><code>MyFirstAssembly</code></em>
job = <em class="replaceable"><code>genome,denovo,accurate</code></em></code></strong>
# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups"
# now the Illumina data
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataForSE1</code></em>
data = <em class="replaceable"><code>../../data/bchocse_lane6.solexa.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
strain = <em class="replaceable"><code>bchoc_se1</code></em></code></strong>
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataForSE2</code></em>
data = <em class="replaceable"><code>../../data/bchocse_lane7.solexa.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
strain = <em class="replaceable"><code>bchoc_se2</code></em></code></strong></pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
While assembling de-novo (pr mapping) with multiple strains is
possible, the interpretation of results may become a bit daunting in
some cases. For many scenarios it might therefore be preferable to
successively use the data sets in own assemblies or mappings.
</td></tr></table></div><p>
This <span class="emphasis"><em>strain</em></span> information for each readgroup is
really the only change you need to perform to tell MIRA everything it
needs for handling strains.
</p></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_mapping"></a>Chapter 6. Mapping assemblies</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email"><<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>></code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect_map_introduction">6.1.
Introduction
</a></span></dt><dt><span class="sect1"><a href="#sect_map_general">6.2.
General steps
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_map_ge_copying_and_naming_the_sequence_data">6.2.1.
Copying and naming the sequence data
</a></span></dt><dt><span class="sect2"><a href="#sect_map_ma_copying_and_naming_the_reference_sequence">6.2.2.
Copying and naming the reference sequence
</a></span></dt><dt><span class="sect2"><a href="#sect_map_ge_writing_a_simple_manifest_file">6.2.3.
Writing a simple manifest file
</a></span></dt><dt><span class="sect2"><a href="#sect_map_ge_starting_assembly">6.2.4. Starting the assembly</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_map_manifest_files_use_cases">6.3.
Manifest files for different use cases
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_map_mf_mapping_with_shotgun_data">6.3.1.
Mapping with shotgun data
</a></span></dt><dt><span class="sect2"><a href="#sect_map_mf_manifest_for_pairedend_data">6.3.2.
Manifest for data sets with paired reads
</a></span></dt><dt><span class="sect2"><a href="#sect_map_mf_mapping_with_multiple_technologies">6.3.3.
Mapping with multiple sequencing technologies (hybrid mapping)
</a></span></dt><dt><span class="sect2"><a href="#sect_map_mf_mapping_with_multiple_strains">6.3.4.
Mapping with multiple strains
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_map_walkthroughs">6.4.
Walkthroughs
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_map_walkthrough:_mapping_of_ecoli_from_lenski_lab_against_ecoli_b_rel606">6.4.1.
Walkthrough: mapping of E.coli from Lenski lab against E.coli B REL606
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_map_useful_about_reference_sequences">6.5.
Useful things to know about reference sequences
</a></span></dt><dt><span class="sect1"><a href="#sect_map_known_bugs_problems">6.6.
Known bugs / problems
</a></span></dt></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
<span class="emphasis"><em><span class="quote">“<span class="quote">You have to know what you're looking for before you can find it.</span>”</span></em></span>
</p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_map_introduction"></a>6.1.
Introduction
</h2></div></div></div><p>
This guide assumes that you have basic working knowledge of Unix systems, know
the basic principles of sequencing (and sequence assembly) and what assemblers
do.
</p><p>
While there are step by step instructions on how to setup your data and
then perform an assembly, this guide expects you to read at some point in time
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
Before the mapping, <a class="xref" href="#chap_dataprep" title="Chapter 4. Preparing data">Chapter 4: “<i>Preparing data</i>”</a> to know what to do (or not to
do) with the sequencing data before giving it to MIRA.
</p></li><li class="listitem"><p>
Generally, the <a class="xref" href="#chap_results" title="Chapter 9. Working with the results of MIRA">Chapter 9: “<i>Working with the results of MIRA</i>”</a> to know what to do with the
results of the assembly. More specifically, <a class="xref" href="#sect_res_converting_results" title="9.3. Converting results">Section 9.3: “
Converting results
”</a> <a class="xref" href="#sect_res_places_of_interest_in_a_mapping_assembly" title="9.6. Places of interest in a mapping assembly">Section 9.6: “
Places of interest in a mapping assembly
”</a> <a class="xref" href="#sect_res_postprocessing_mapping_assemblies" title="9.7. Post-processing mapping assemblies">Section 9.7: “
Post-processing mapping assemblies
”</a>
</p></li><li class="listitem"><p>
And also <a class="xref" href="#chap_reference" title="Chapter 3. MIRA 4 reference manual">Chapter 3: “<i>MIRA 4 reference manual</i>”</a> to look up how manifest files should be
written (<a class="xref" href="#sect_ref_manifest_basics" title="3.4.2. The manifest file: basics">Section 3.4.2: “
The manifest file: basics
”</a> and <a class="xref" href="#sect_ref_manifest_readgroups" title="3.4.3. The manifest file: information on the data you have">Section 3.4.3: “
The manifest file: information on the data you have
”</a> and <a class="xref" href="#sect_ref_manifest_parameters" title="3.4.4. The manifest file: extended parameters">Section 3.4.4: “
The manifest file: extended parameters
”</a>), some command line options as well as general information on
what tags MIRA uses in assemblies, files it generates etc.pp
</p></li><li class="listitem"><p>
Last but not least, you may be interested in some observations about
the different sequencing technologies and the traps they may
contain, see <a class="xref" href="#chap_seqtechdesc" title="Chapter 12. Description of sequencing technologies">Chapter 12: “<i>Description of sequencing technologies</i>”</a> for that. For advice on what to pay
attention to <span class="emphasis"><em>before</em></span> going into a sequencing
project, have a look at <a class="xref" href="#chap_seqadvice" title="Chapter 13. Some advice when going into a sequencing project">Chapter 13: “<i>Some advice when going into a sequencing project</i>”</a>.
</p></li></ul></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_map_general"></a>6.2.
General steps
</h2></div></div></div><p>
This part will introduce you step by step how to get your data together for a
simple mapping assembly.
</p><p>
I'll make up an example using an imaginary bacterium: <span class="emphasis"><em>Bacillus chocorafoliensis</em></span> (or short: <span class="emphasis"><em>Bchoc</em></span>).
</p><p>
In this example, we assume you have two strains: a wild type strain of
<span class="emphasis"><em>Bchoc_wt</em></span> and a mutant which you perhaps got from mutagenesis or other
means. Let's imagine that this mutant needs more time to eliminate a given
amount of chocolate, so we call the mutant <span class="emphasis"><em>Bchoc_se</em></span> ... SE for
<span class="bold"><strong>s</strong></span>low <span class="bold"><strong>e</strong></span>ater
</p><p>
You wanted to know which mutations might be responsible for the observed
behaviour. Assume the genome of <span class="emphasis"><em>Bchoc_wt</em></span> is available to you as it was
published (or you previously sequenced it), so you resequenced <span class="emphasis"><em>Bchoc_se</em></span>
with Solexa to examine mutations.
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_map_ge_copying_and_naming_the_sequence_data"></a>6.2.1.
Copying and naming the sequence data
</h3></div></div></div><p>
You need to create (or get from your sequencing provider) the sequencing data
in either FASTQ or FASTA + FASTA quality format. The following walkthrough
uses what most people nowadays get: FASTQ.
</p><p>
Create a new project directory (e.g. <code class="filename">myProject</code>) and a subdirectory of this which will hold the sequencing data (e.g. <code class="filename">data</code>).
</p><pre class="screen"><code class="prompt">arcadia:/path/to</code> <strong class="userinput"><code>mkdir myProject</code></strong>
<code class="prompt">arcadia:/path/to</code> <strong class="userinput"><code>cd myProject</code></strong>
<code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>mkdir data</code></strong></pre><p>
Put the FASTQ data into that <code class="filename">data</code> directory so that it now looks perhaps like this:
</p><pre class="screen"><code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>ls -l data</code></strong>
-rw-r--r-- 1 bach users 263985896 2008-03-28 21:49 bchocse_lane6.solexa.fastq
-rw-r--r-- 1 bach users 264823645 2008-03-28 21:51 bchocse_lane7.solexa.fastq</pre></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
I completely made up the file names above. You can name them anyway you
want. And you can have them live anywhere on the hard disk, you do not
need to put them in this <code class="filename">data</code> directory. It's just
the way I do it ... and it's where the example manifest files a bit further down
in this chapter will look for the data files.
</td></tr></table></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_map_ma_copying_and_naming_the_reference_sequence"></a>6.2.2.
Copying and naming the reference sequence
</h3></div></div></div><p>
The reference sequence (the backbone) can be in a number of different
formats: GFF3, GenBank, MAF, CAF, FASTA. The first three have the advantage
of being able to carry additional information like, e.g.,
annotation. In this example, we will use a GFF3 file like the ones
one can download from the NCBI.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
TODO: Write why GFF3 is better and where to get them at the NCBI.
</td></tr></table></div><p>
So, let's assume that our wild type
strain is in the following file:
<code class="filename">NC_someNCBInumber.gff3</code>.
</p><p>
You do not need to copy the reference sequence to your directory, but
I normally copy also the reference file into the directory with my
data as I want to have, at the end of my work, a nice little
self-sufficient directory which I can archive away and still be sure
that in 10 years time I have all data I need together.
</p><pre class="screen"><code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>cp /somewhere/NC_someNCBInumber.gff3 data</code></strong>
<code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>ls -l data</code></strong>
-rw-r--r-- 1 bach users 6543511 2008-04-08 23:53 NC_someNCBInumber.gff3
-rw-r--r-- 1 bach users 263985896 2008-03-28 21:49 bchocse_lane6.solexa.fastq
-rw-r--r-- 1 bach users 264823645 2008-03-28 21:51 bchocse_lane7.solexa.fastq</pre><p>
We're almost finished with the setup. As I like to have things neatly separated, I always create a directory called <code class="filename">assemblies</code> which will hold my assemblies (or different trials) together. Let's quickly do that:
</p><pre class="screen"><code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>mkdir assemblies</code></strong>
<code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>mkdir assemblies/1sttrial</code></strong>
<code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>cd assemblies/1sttrial</code></strong></pre></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_map_ge_writing_a_simple_manifest_file"></a>6.2.3.
Writing a simple manifest file
</h3></div></div></div><p>
A manifest file is a configuration file for MIRA which tells it what
type of assembly it should do and which data it should load. In this
case we have unpaired sequencing data which we want to map to a
reference sequence, the manifest file for that is pretty simple:
</p><pre class="screen"># Example for a manifest describing a mapping assembly with
# unpaired Illumina data
# First part: defining some basic things
# In this example, we just give a name to the assembly
# and tell MIRA it should map a genome in accurate mode
<strong class="userinput"><code>project = <em class="replaceable"><code>MyFirstAssembly</code></em>
job = <em class="replaceable"><code>genome,mapping,accurate</code></em></code></strong>
# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups"
# first, the reference sequence
<strong class="userinput"><code>readgroup
is_reference
data = <em class="replaceable"><code>../../data/NC_someNCBInumber.gff3</code></em>
strain = <em class="replaceable"><code>bchoc_wt</code></em></code></strong>
# now the Illumina data
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>SomeUnpairedIlluminaReadsIGotFromTheLab</code></em>
data = <em class="replaceable"><code>../../data/*fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
strain = <em class="replaceable"><code>bchoc_se</code></em></code></strong></pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
Please look up the parameters of the manifest file in the main
manual or the example manifest files in the following section.
</p><p>
The ones above basically say: make an accurate mapping of Solexa
reads against a genome; in one pass; the name of the backbone strain
is 'bchoc_wt'; the data with the backbone sequence (and maybe
annotations) is in a specified GFF3 file; for Solexa data: assign
default strain names for reads which have not loaded ancillary data
with strain info and that default strain name should be 'bchoc_se'.
</p></td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_map_ge_starting_assembly"></a>6.2.4. Starting the assembly</h3></div></div></div><p>
Starting the assembly is now just a matter of a simple command line:
</p><pre class="screen"><code class="prompt">arcadia:/path/to/myProject/assemblies/1sttrial$</code> <strong class="userinput"><code>mira <em class="replaceable"><code>manifest.conf >&log_assembly.txt</code></em></code></strong></pre><p>
For this example - if you followed the walk-through on how to prepare the data
- everything you might want to adapt in the first time are the following thing in the manifest file:
options:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
project= (for naming your assembly project)
</p></li><li class="listitem"><p>
strain_name= to give the names of your reference and mapping strain
</p></li></ul></div><p>
Of course, you are free to change any option via the extended parameters, but
this is the topic of another part of this manual.
</p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_map_manifest_files_use_cases"></a>6.3.
Manifest files for different use cases
</h2></div></div></div><p>
This section will introduce you to manifest files for different use
cases. It should cover the most important uses, but as always you are
free to mix and match the parameters and readgroup definitions to suit
your specific needs.
</p><p>
Taking into account that there may be <span class="emphasis"><em>a lot</em></span> of
combinations of sequencing technologies, sequencing libraries (shotgun,
paired-end, mate-pair, etc.) and input file types (FASTQ, FASTA,
GenBank, GFF3, etc.pp), the example manifest files just use Illumina and
454 as technologies, GFF3 as input file type for the reference sequence,
FASTQ as input type for sequencing data ... and they do not show the
multitude of more advanced features like, e.g., using ancillary clipping
information in XML files, ancillary masking information in SSAHA2 or
SMALT files etc.pp.
</p><p>
I'm sure you will be able to find your way by scanning through the
corresponding section on manifest files in the reference chapter :-)
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_map_mf_mapping_with_shotgun_data"></a>6.3.1.
Mapping with shotgun data
</h3></div></div></div><p>
Well, we've seen that already in the section above, but here it is
again ... this time with Ion Torrent data though.
</p><pre class="screen"># Example for a manifest describing a mapping assembly with
# unpaired Ion data
# First part: defining some basic things
# In this example, we just give a name to the assembly
# and tell MIRA it should map a genome in accurate mode
<strong class="userinput"><code>project = <em class="replaceable"><code>MyFirstAssembly</code></em>
job = <em class="replaceable"><code>genome,mapping,accurate</code></em></code></strong>
# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups"
# first, the reference sequence
<strong class="userinput"><code>readgroup
is_reference
data = <em class="replaceable"><code>../../data/NC_someNCBInumber.gff3</code></em>
strain = <em class="replaceable"><code>bchoc_wt</code></em></code></strong>
# now the Ion Torrent data
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>SomeUnpairedIonReadsIGotFromTheLab</code></em>
data = <em class="replaceable"><code>../../data/someiondata.fastq</code></em>
technology = <em class="replaceable"><code>iontor</code></em>
strain = <em class="replaceable"><code>bchoc_se</code></em></code></strong></pre></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_map_mf_manifest_for_pairedend_data"></a>6.3.2.
Manifest for data sets with paired reads
</h3></div></div></div><p>
</p><p>
When using paired-end data in mapping, you must decide whether you want
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
use the MIRA feature to create long 'coverage equivalent reads'
(CERs) which saves a lot of memory (both in the assembler and
later on in an assembly editor). However, you then
<span class="emphasis"><em>loose information about read pairs!</em></span>
</p></li><li class="listitem"><p>
or whether you want to <span class="emphasis"><em>keep information about read
pairs</em></span> at the expense of larger memory requirements both
in MIRA and in assembly finishing tools or viewers afterwards.
</p></li><li class="listitem"><p>
or a mix of the two above
</p></li></ol></div><p>
The Illumina pipeline generally normally gives you two files for paired-end
data: a <code class="filename">project-1.fastq</code> and
<code class="filename">project-2.fastq</code>. The first file containing the
first read of a read-pair, the second file the second read. Depending
on the preprocessing pipeline of your sequencing provider, the names
of the reads are either the very same in both files or already have
a <code class="literal">/1</code> or <code class="literal">/2</code> appended. Also, your
sequencing provider may give you one big file where the reads from
both ends are present.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
MIRA can read all FASTQ variants produced by various Illumina
pipelines, be they with or without the /1 and /2 already appended to
the names. You generally do not need to do any name mangling before
feeding the data to MIRA. However, MIRA will shell out a warning if read names are longer than 40 characters.
</p></td></tr></table></div><p>
When using paired-end data, you should know
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
the orientation of the reads toward each other. This is specific
to sequencing technologies and / or the sequencing library preparation.
</p></li><li class="listitem"><p>
at which distance these reads should be. This is specific to the
sequencing library preparation and the sequencing lab should tell
you this.
</p></li></ol></div><p>
In case you do not know one (or any) of the above, don't panic! MIRA
is able to estimate the needed values during the assembly if you tell
it to.
</p><p>
The following manifest shows you the most laziest way to define a
paired data set by simply adding <span class="emphasis"><em>autopairing</em></span> as keyword to a
readgroup (using Illumina just as example):
</p><pre class="screen"># Example for a lazy manifest describing a denovo assembly with
# one library of paired reads
# First part: defining some basic things
# In this example, we just give a name to the assembly
# and tell MIRA it should map a genome in accurate mode
<strong class="userinput"><code>project = <em class="replaceable"><code>MyFirstAssembly</code></em>
job = <em class="replaceable"><code>genome,mapping,accurate</code></em></code></strong>
# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups"
# first the reference sequence
<strong class="userinput"><code>readgroup
is_reference
data = <em class="replaceable"><code>../../data/NC_someNCBInumber.gff3</code></em>
technology = <em class="replaceable"><code>text</code></em>
strain = <em class="replaceable"><code>bchoc_wt</code></em></code></strong>
# now the Illumina paired-end data
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataIlluminaPairedLib</code></em>
<em class="replaceable"><code>autopairing</code></em>
data = <em class="replaceable"><code>../../data/project_1.fastq ../../data/project_2.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
strain = <em class="replaceable"><code>bchoc_se1</code></em>
</code></strong></pre><p>
See? Wasn't hard and it did not hurt, did it? One just needs to tell
MIRA it should expect paired reads via
the <span class="emphasis"><em>autopairing</em></span> keyword and that is everything you
need.
</p><p>
If you know the orientation of the reads and/or the library size, you
can tell this MIRA the following way (just showing the readgroup
definition here):
</p><pre class="screen"><strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataIlluminaPairedEnd500Lib</code></em>
data = <em class="replaceable"><code>../../data/project_1.fastq ../../data/project_2.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
template_size = <em class="replaceable"><code>250 750</code></em>
segment_placement = <em class="replaceable"><code>---> <---</code></em></code></strong></pre><p>
In cases you are not 100% sure about, e.g., the size of the DNA
template, you can also give a (generous) expected range and then tell
MIRA to automatically refine this range during the assembly based on
real, observed distances of read pairs. Do this with <span class="emphasis"><em>autorefine</em></span>
modifier like this:
</p><pre class="screen"><strong class="userinput"><code>template_size = <em class="replaceable"><code>50 1000 autorefine</code></em></code></strong></pre><p>
The following manifest file is an example for mapping a 500 bp
paired-end and a 3kb mate-pair library of a strain
called <span class="emphasis"><em>bchoc_se1</em></span> against a GenBank reference
file containing a strain called <span class="emphasis"><em>bchoc_wt</em></span>:
</p><pre class="screen"># Example for a manifest describing a mapping assembly with
# paired Illumina data, not merging reads and therefore keeping
# all pair information
# First part: defining some basic things
# In this example, we just give a name to the assembly
# and tell MIRA it should map a genome in accurate mode
# As special parameter, we want to switch off merging of Solexa reads
<strong class="userinput"><code>project = <em class="replaceable"><code>MyFirstAssembly</code></em>
job = <em class="replaceable"><code>genome,mapping,accurate</code></em>
parameters = <em class="replaceable"><code>SOLEXA_SETTINGS -CO:msr=no</code></em></code></strong>
# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups"
# first, the reference sequence
<strong class="userinput"><code>readgroup
is_reference
data = <em class="replaceable"><code>../../data/NC_someNCBInumber.gff3</code></em>
technology = <em class="replaceable"><code>text</code></em>
strain = <em class="replaceable"><code>bchoc_wt</code></em></code></strong>
# now the Illumina data
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataForPairedEnd500bpLib</code></em>
<em class="replaceable"><code>autopairing</code></em>
data = <em class="replaceable"><code>../../data/project500bp-1.fastq ../../data/project500bp-2.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
strain = <em class="replaceable"><code>bchoc_se1</code></em></code></strong>
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataForMatePair3kbLib</code></em>
data = <em class="replaceable"><code>../../data/project3kb-1.fastq ../../data/project3kb-2.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
strain = <em class="replaceable"><code>bchoc_se1</code></em>
template_size = <em class="replaceable"><code>2000 4000 autorefine</code></em>
segment_placement = <em class="replaceable"><code><--- ---></code></em></code></strong></pre><p>
Please look up the parameters used in the main manual. The ones
above basically say: make an accurate mapping of Solexa reads
against a genome. Additionally do not merge short short Solexa
reads to the contig.
</p><p>
For the paired-end library, be lazy and let MIRA find out everything
it needs. However, that information should be treated as
"information only" by MIRA, i.e., it is not used for deciding whether
a pair is well mapped.
</p><p>
For the mate-pair library, assume a DNA template template size of
2000 to 4000 bp (but let MIRA automatically refine this using observed
distances) and the segment orientation of the read pairs follows
the reverse / forward scheme. That information should be treated as
"information only" by MIRA, i.e., it is not used for deciding whether
a pair is well mapped.
</p><p>
Comparing this manifest with a manifest for unpaired-data, two
parameters were added in the section for Solexa data:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
<code class="literal">-CO:msr=no</code> tells MIRA not to merge reads that
are 100% identical to the backbone. This also allows to keep the
template information (distance and orientation) for the reads.
</p></li><li class="listitem"><p>
<code class="literal">template_size</code> tells MIRA at which distance the
two reads should normally be placed from each other.
</p></li><li class="listitem"><p>
<code class="literal">segment_placement</code> tells MIRA how the different
segments (reads) of a DNA template have to be ordered to form a
valid representation of the sequenced DNA.
</p></li></ol></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
Note that in mapping assemblies, these
<code class="literal">template_distance</code> and
<code class="literal">segment_placement</code> parameters are normally treated
as <span class="emphasis"><em>information only</em></span>, i.e., MIRA will map the
reads regardless whether the distance and orientation criterions are
met or not. This enables post-mapping analysis programs to hunt for
genome rearrangements or larger insertions/deletion.
</p></td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top"><p>
If template size and segment placement checking were on, the
following would happen at, e.g. sites of re-arrangement: MIRA would
map the first read of a read-pair without problem. However, it would
very probably reject the second read because it would not map at the
specified distance or orientation from its partner. Therefore, in
mapping assemblies with paired-end data, checking of the template
size must be switched off to give post-processing programs a chance
to spot re-arrangements.
</p></td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_map_mf_mapping_with_multiple_technologies"></a>6.3.3.
Mapping with multiple sequencing technologies (hybrid mapping)
</h3></div></div></div><p>
I'm sure you'll have picked up the general scheme of manifest files by
now. Hybrid mapping assemblies follow the general scheme: simply add
as separate readgroup the information MIRA needs to know to find the
data and off you go. Just for laughs, here's a manifest for 454
shotgun with Illumina paired-end
</p><pre class="screen"># Example for a manifest describing a mapping assembly with
# shotgun 454 and paired-end Illumina data, not merging reads and therefore keeping
# all pair information
# First part: defining some basic things
# In this example, we just give a name to the assembly
# and tell MIRA it should map a genome in accurate mode
# As special parameter, we want to switch off merging of Solexa reads
<strong class="userinput"><code>project = <em class="replaceable"><code>MyFirstAssembly</code></em>
job = <em class="replaceable"><code>genome,mapping,accurate</code></em>
parameters = <em class="replaceable"><code>SOLEXA_SETTINGS -CO:msr=no</code></em></code></strong>
# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups"
# first, the reference sequence
<strong class="userinput"><code>readgroup
is_reference
data = <em class="replaceable"><code>../../data/NC_someNCBInumber.gff3</code></em>
strain = <em class="replaceable"><code>bchoc_wt</code></em></code></strong>
# now the shotgun 454 data
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataForShotgun454</code></em>
data = <em class="replaceable"><code>../../data/project454data.fastq</code></em>
technology = <em class="replaceable"><code>454</code></em>
strain = <em class="replaceable"><code>bchoc_se1</code></em></code></strong>
# now the paired-end Illumina data
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataForPairedEnd500bpLib</code></em>
data = <em class="replaceable"><code>../../data/project500bp-1.fastq ../../data/project500bp-2.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
strain = <em class="replaceable"><code>bchoc_se1</code></em>
template_size = <em class="replaceable"><code>250 750</code></em>
segment_placement = <em class="replaceable"><code>---> <---</code></em></code></strong></pre></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_map_mf_mapping_with_multiple_strains"></a>6.3.4.
Mapping with multiple strains
</h3></div></div></div><p>
MIRA will make use of ancillary information present in the manifest
file. One of these is the information to which strain (or organism or
cell line etc.pp) the generated data belongs.
</p><p>
You just need to tell in the manifest file which data comes from which
strain. Let's assume that in the example from above, the "lane6" data
were from a first mutant named <span class="emphasis"><em>bchoc_se1</em></span> and the
"lane7" data were from a second mutant
named <span class="emphasis"><em>bchoc_se2</em></span>. Here's the manifest file you
would write then:
</p><pre class="screen"># Example for a manifest describing a mapping assembly with
# unpaired Illumina data
# First part: defining some basic things
# In this example, we just give a name to the assembly
# and tell MIRA it should map a genome in accurate mode
<strong class="userinput"><code>project = <em class="replaceable"><code>MyFirstAssembly</code></em>
job = <em class="replaceable"><code>genome,mapping,accurate</code></em></code></strong>
# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups"
# first, the reference sequence
<strong class="userinput"><code>readgroup
is_reference
data = <em class="replaceable"><code>../../data/NC_someNCBInumber.gff3</code></em>
technology = <em class="replaceable"><code>text</code></em>
strain = <em class="replaceable"><code>bchoc_wt</code></em></code></strong>
# now the Illumina data
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataForSE1</code></em>
data = <em class="replaceable"><code>../../data/bchocse_lane6.solexa.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
strain = <em class="replaceable"><code>bchoc_se1</code></em></code></strong>
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataForSE2</code></em>
data = <em class="replaceable"><code>../../data/bchocse_lane7.solexa.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
strain = <em class="replaceable"><code>bchoc_se2</code></em></code></strong></pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
While mapping (or even assembling de-novo) with multiple strains is
possible, the interpretation of results may become a bit daunting in
some cases. For many scenarios it might therefore be preferable to
successively use the data sets in own mappings or assemblies.
</td></tr></table></div><p>
This <span class="emphasis"><em>strain</em></span> information for each readgroup is really the only change you need to perform to tell MIRA everything it needs for handling strains.
</p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_map_walkthroughs"></a>6.4.
Walkthroughs
</h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_map_walkthrough:_mapping_of_ecoli_from_lenski_lab_against_ecoli_b_rel606"></a>6.4.1.
Walkthrough: mapping of E.coli from Lenski lab against E.coli B REL606
</h3></div></div></div><p>
TODO: Sorry, needs to be re-written for the relatively new SRR format
distributed at the NCBI ... and changes in MIRA 3.9.x. Please come
back later.
</p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_map_useful_about_reference_sequences"></a>6.5.
Useful things to know about reference sequences
</h2></div></div></div><p>
There are a few things to consider when using reference sequences:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
MIRA is not really made to handle a big amount of reference
sequences as they currently need inane amounts of memory. Use other
programs for mapping against more than, say, 200 megabases.
</p></li><li class="listitem"><p>
Reference sequences can be as long as needed! They are not subject
to normal read length constraints of a maximum of 32k bases. That
is, if one wants to load one or several entire chromosomes of a
bacterium or lower eukaryote as backbone sequence(s), this is just
fine.
</p></li><li class="listitem"><p>
Reference sequences can be single sequences like provided in, e.g.,
FASTA, FASTQ, GFF or GenBank files. But reference sequences also can
be whole assemblies when they are provided as, e.g., MAF or CAF
format. This opens the possibility to perform semi-hybrid assemblies
by assembling first reads from one sequencing technology de-novo
(e.g. PacBio) and then map reads from another sequencing technology
(e.g. Solexa) to the whole PacBio alignment instead of mapping it to
the PacBio consensus.
</p><p>
A semi-hybrid assembly will therefore contain, like a hybrid
assembly, the reads of both sequencing technologies.
</p></li><li class="listitem"><p>
Reference sequences will not be reversed! They will always appear in
forward direction in the output of the assembly. Please note: if the
backbone sequence consists of a MAF or CAF file that contain contigs
which contain reversed reads, then the contigs themselves will be in
forward direction. But the reads they contain that are in reverse
complement direction will of course also stay reverse complement
direction.
</p></li><li class="listitem"><p>
Reference sequences will not not be assembled together! That is,
even if a reference sequence has a perfect overlap with another
reference sequence, they will still not be merged.
</p></li><li class="listitem"><p>
Reads are assembled to reference sequences in a first come, first
served scattering strategy.
</p><p>
Suppose you have two identical reference sequences and a read which
would match both, then the read would be mapped to the first
backbone. If you had two identical reads, the first read would go to
the first backbone, the second read to the second backbone. With
three identical reads, the first backbone would get two reads, the
second backbone one read. Etc.pp.
</p></li><li class="listitem"><p>
Only in references loaded from MAF or CAF files: contigs made out of
single reads (singlets) loose their status as reference sequence and
will be returned to the normal read pool for the assembly
process. That is, these sequences will be assembled to other
reference sequences or with each other.
</p></li></ol></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_map_known_bugs_problems"></a>6.6.
Known bugs / problems
</h2></div></div></div><p>
These are actual for version 4.0 of MIRA and might or might not have been
addressed in later version.
</p><p>
Bugs:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
mapping of paired-end reads with one read being in non-repetitive
area and the other in a repeat is not as effective as it should
be. The optimal strategy to use would be to map first the
non-repetitive read and then the read in the repeat. Unfortunately,
this is not yet implemented in MIRA.
</p></li></ol></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_est"></a>Chapter 7. EST / RNASeq assemblies</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email"><<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>></code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect1_est_introduction">7.1.
Introduction
</a></span></dt><dt><span class="sect1"><a href="#sect1_est_preliminaries:on_the_difficulties_of_assembling_ests">7.2.
Preliminaries: on the difficulties of assembling ESTs /RNASeq
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect2_est_poly-a_tails_in_est_data">7.2.1.
Poly-A tails
</a></span></dt><dt><span class="sect2"><a href="#sect2_est_lowly_expressed_transcripts">7.2.2.
Lowly expressed transcripts
</a></span></dt><dt><span class="sect2"><a href="#sect2_est_library_normalisation">7.2.3.
Very highly expressed transcripts
</a></span></dt><dt><span class="sect2"><a href="#sect_est_chimeras">7.2.4.
Chimeras
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#est_sect1_est_preprocessing">7.3.
Preprocessing of ESTs
</a></span></dt><dt><span class="sect1"><a href="#sect1_est_est_difference_assembly_clustering">7.4.
The difference between <span class="emphasis"><em>assembly</em></span> and
<span class="emphasis"><em>clustering</em></span>
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect2_est_snp_splitting">7.4.1.
Splitting transcripts into contigs based on SNPs
</a></span></dt><dt><span class="sect2"><a href="#sect2_est_gap_splitting">7.4.2.
Splitting transcripts into contigs based on larger gaps
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect1_est_demopipeline">7.5.
A simple step by step pipeline for reliable RNASeq assembly of eukaryotes
</a></span></dt><dt><span class="sect1"><a href="#idm5079">7.6.
Solving common problems of EST assemblies
</a></span></dt></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
<span class="emphasis"><em><span class="quote">“<span class="quote">Expect the worst. You'll never get disappointed.
</span>”</span></em></span>
</p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect1_est_introduction"></a>7.1.
Introduction
</h2></div></div></div><p>
This document is not complete yet and some sections may be a bit
unclear. I'd be happy to receive suggestions for improvements.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note:
Some reading requirements
"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">
Some reading requirements
</th></tr><tr><td align="left" valign="top"><p>
This guide assumes that you have basic working knowledge of Unix systems, know
the basic principles of sequencing (and sequence assembly) and what assemblers
do. Basic knowledge on mRNA transcription should also be present.
</p><p>
Please read at some point in time
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
Before the assembly, <a class="xref" href="#chap_dataprep" title="Chapter 4. Preparing data">Chapter 4: “<i>Preparing data</i>”</a> to know what to do (or not to
do) with the sequencing data before giving it to MIRA.
</p></li><li class="listitem"><p>
For setting up the assembly, <a class="xref" href="#chap_denovo" title="Chapter 5. De-novo assemblies">Chapter 5: “<i>De-novo assemblies</i>”</a> to know how to
start a denovo assembly (except you obviously will need to change
the --job setting from <span class="emphasis"><em>genome</em></span> to
<span class="emphasis"><em>est</em></span>).
</p></li><li class="listitem"><p>
After the assembly, <a class="xref" href="#chap_results" title="Chapter 9. Working with the results of MIRA">Chapter 9: “<i>Working with the results of MIRA</i>”</a> to know what to do with the
results of the assembly. More specifically, <a class="xref" href="#sect_res_looking_at_results" title="9.1. MIRA output directories and files">Section 9.1: “
MIRA output directories and files
”</a>, <a class="xref" href="#sect_res_first_look:the_assembly_info" title="9.2. First look: the assembly info">Section 9.2: “
First look: the assembly info
”</a>, <a class="xref" href="#sect_res_converting_results" title="9.3. Converting results">Section 9.3: “
Converting results
”</a>, <a class="xref" href="#sect_res_filtering_of_results" title="9.4. Filtering results">Section 9.4: “
Filtering results
”</a> and <a class="xref" href="#sect_res_places_of_importance_in_a_de_novo_assembly" title="9.5. Places of importance in a de-novo assembly">Section 9.5: “
Places of importance in a de-novo assembly
”</a>.
</p></li><li class="listitem"><p>
And also <a class="xref" href="#chap_reference" title="Chapter 3. MIRA 4 reference manual">Chapter 3: “<i>MIRA 4 reference manual</i>”</a> to look up how manifest files should be
written (<a class="xref" href="#sect_ref_manifest_basics" title="3.4.2. The manifest file: basics">Section 3.4.2: “
The manifest file: basics
”</a> and <a class="xref" href="#sect_ref_manifest_readgroups" title="3.4.3. The manifest file: information on the data you have">Section 3.4.3: “
The manifest file: information on the data you have
”</a> and <a class="xref" href="#sect_ref_manifest_parameters" title="3.4.4. The manifest file: extended parameters">Section 3.4.4: “
The manifest file: extended parameters
”</a>), some command line options as well as general information on
what tags MIRA uses in assemblies, files it generates etc.pp
</p></li></ul></div></td></tr></table></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect1_est_preliminaries:on_the_difficulties_of_assembling_ests"></a>7.2.
Preliminaries: on the difficulties of assembling ESTs /RNASeq
</h2></div></div></div><p>
Assembling ESTs can be, from an assemblers point of view, pure
horror. E.g., it may be that some genes have thousands of transcripts
while other genes have just one single transcript in the sequenced
data. Furthermore, the presence of 5' and 3' UTR, transcription
variants, splice variants, homologues, SNPs etc.pp complicates the
assembly in some rather interesting ways.
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_est_poly-a_tails_in_est_data"></a>7.2.1.
Poly-A tails
</h3></div></div></div><p>
Poly-A tails are part of the mRNA and therefore also part of sequenced
data. They can occur as poly-A or poly-T, depending from which
direction and which part of the mRNA was sequenced. Having poly-A/T
tails in the data is a something of a double edged sword. More
specifically., if the 3' poly-A tail is kept unmasked in the data,
transcripts having this tail will very probably not align with similar
transcripts from different splice variants (which is basically
good). On the other hand, homopolymers (multiple consecutive bases of
the same type) like poly-As are features that are pretty difficult to
get correct with today's sequencing technologies, be it Sanger, Solexa
or, with even more problems problems, 454. So slight errors in the
poly-A tail could lead to wrongly assigned splice sites ... and
wrongly split contigs.
</p><p>
This is the reason why many people cut off the poly-A tails. Which in
turn may lead to transcripts from different splice variants being
assembled together.
</p><p>
Either way, it's not pretty.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_est_lowly_expressed_transcripts"></a>7.2.2.
Lowly expressed transcripts
</h3></div></div></div><p>
Single transcripts (or very lowly expressed transcripts) containing
SNPs, splice variants or similar differences to other, more highly
expressed transcripts are a problem: it's basically impossible for an
assembler to distinguish them from reads containing junky data
(e.g. read with a high error rate or chimeras). The standard setting
of many EST assemblers and clusterers is therefore to remove these
reads from the assembly set. MIRA handles things a bit differently:
depending on the settings, single transcripts with sufficiently large
differences are either treated as debris or can be saved as
<span class="emphasis"><em>singlet</em></span>.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_est_library_normalisation"></a>7.2.3.
Very highly expressed transcripts
</h3></div></div></div><p>
Another interesting problem for de-novo assemblers are non-normalised
libraries. In each cell, the number of mRNA copies per gene may
differ by several orders of magnitude, from a single transcripts to
several tens of thousands. Pre-sequencing normalisation is a wet-lab
procedure to approximately equalise those copy numbers. This can
however, introduce other artifacts.
</p><p>
If an assembler is fed with non-normalised EST data, it may very well
be that an overwhelming number of the reads comes only from a few
genes (house-keeping genes). In Sanger sequencing projects this could
mean a couple of thousand reads per gene. In 454 sequencing projects,
this can mean several tens of thousands of reads per genes. With
Solexa data, this number can grow to something close to a million.
</p><p>
Several effects then hit a de-novo assembler, the three most annoying
being (in ascending order of annoyance): a) non-random sequencing
errors then look like valid SNPs, b) sequencing and library
construction artefacts start to look like valid sequences if the data
set was not cleaned "enough" and more importantly, c) an explosion in
time and memory requirements when attempting to deliver a "good"
assembly. While MIRA has methods to deal with this kind of data
(e.g. via digital normalisation), a sure sign of the latter are messages
from MIRA about <span class="emphasis"><em>megahubs</em></span> in the data set.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
The guide on how to tackle <span class="emphasis"><em>hard</em></span> projects with
MIRA gives an overview on how to hunt down sequences which can lead to
the assembler getting confused, be it sequencing artefacts or highly
expressed genes.
</td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_est_chimeras"></a>7.2.4.
Chimeras
</h3></div></div></div><p>
Chimeras are sequences containing adjacent base stretches which are
not occurring in an organism as sequenced, neither as DNA nor as
(m)RNA. Chimeras can be created through recombination effects during
library construction or sequencing. Chimeras can, and often do, lead
to misassemblies of sequence stretches into one contig although they
do not belong together. Have a look at the following example where two
stretches (denoted by <code class="literal">x</code> and <code class="literal">o</code>
are joined by a chimeric read <span class="emphasis"><em>r4</em></span> containing both
stretches:
</p><pre class="screen">
r1 xxxxxxxxxxxxxxxx
r2 xxxxxxxxxxxxxxxxx
r3 xxxxxxxxxxxxxxxxx
r4 xxxxxxxxxxxxxxxxxxx|oooooooooooooo
r5 ooooooooooo
r6 ooooooooooo
r7 ooooooooo</pre><p>
The site of the recombination event is denoted by <code class="literal">x|o</code>
in read <span class="emphasis"><em>r4</em></span>.
</p><p>
MIRA does have a chimera detection -- which works very well in genome
assemblies due to high enough coverage -- by searching for sequence
stretches which are not covered by overlaps. In the above example, the
chimera detection routine will almost certainly flag read
<span class="emphasis"><em>r4</em></span> as chimera and only use a part of it: either the
<code class="literal"> x</code> or <code class="literal">o</code> part, depending on which
part is longer. There is always a chance that <span class="emphasis"><em>r4</em></span> is
a valid read though, but that's a risk to take.
</p><p>
Now, that strategy would also work totally fine in EST projects if one
would not have to account for lowly expressed genes. Imagine the
following situation:
</p><pre class="screen">
s1 xxxxxxxxxxxxxxxxx
s2 xxxxxxxxxxxxxxxxxxxxxxxxx
s3 xxxxxxxxxxxxxxx
</pre><p>
Look at read <span class="emphasis"><em>s2</em></span>; from an overlap coverage
perspective, <span class="emphasis"><em>s2</em></span> could also very well be a chimera,
leading to a break of an otherwise perfectly valid contig if
<span class="emphasis"><em>s2</em></span> were cut back accordingly. This is why chimera
detection is switched off by default in MIRA.
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top"><p>
When starting an EST assembly via the <code class="literal">--job=est,...</code>
switch, chimera detection is switched off by default. It is absolutely
possible to switch on the SKIM chimera detection afterwards via
[-CL:ascdc]. However, this will have exactly the effects
described above: chimeras in higher coverage contigs will be detected,
but perfectly valid low coverage contigs will be torn apart.
</p><p>
It is up to you to decide what you want or need.
</p></td></tr></table></div></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="est_sect1_est_preprocessing"></a>7.3.
Preprocessing of ESTs
</h2></div></div></div><p>
With contributions from Katrina Dlugosch
</p><p>
EST sequences necessarily contain fragments of vectors or primers used
to create cDNA libraries from RNA, and may additionally contain primer
and adaptor sequences used during amplification-based library
normalisation and/or high-throughput sequencing. These contaminant
sequences need to be removed prior to assembly. MIRA can trim sequences
by taking contaminant location information from a SSAHA2 or SMALT search
output, or users can remove contaminants beforehand by trimming
sequences themselves or masking unwanted bases with lowercase or other
characters (e.g. 'x', as with <span class="command"><strong>cross_match</strong></span>). Many
folks use preprocessing trimming/masking pipelines because it can be
very important to try a variety of settings to verify that you've
removed all of your contaminants (and fragments thereof) before sending
them into an assembly program like MIRA. It can also be good to spend
some time seeing what contaminants are in your data, so that you get to
know what quality issues are present and how pervasive.
</p><p>
Two features of next generation sequencing can introduce errors into
contaminant sequences that make them particularly difficult to remove,
arguing for preprocessing: First, most next-generation sequence
platforms seem to be sensitive to excess primers present during library
preparation, and can produce a small percentage of sequences composed
entirely of concatenated primer fragments. These are among the most
difficult contaminants to remove, and the program TagDust (<a class="ulink" href="http://genome.gsc.riken.jp/osc/english/dataresource/" target="_top">http://genome.gsc.riken.jp/osc/english/dataresource/</a>) was
recently developed specifically to address this problem. Second, 454 EST
data sets can show high variability within primer sequences designed to
anchor to polyA tails during cDNA synthesis, because 454 has trouble
calling the length of the necessary A and T nucleotide repeats with
accuracy.
</p><p>
A variety of programs exist for preprocessing. Popular ones include
cross_match (<a class="ulink" href="http://www.phrap.org/phredphrapconsed.html" target="_top">http://www.phrap.org/phredphrapconsed.html</a>)
for primer masking, and SeqClean (<a class="ulink" href="http://compbio.dfci.harvard.edu/tgi/software/" target="_top">http://compbio.dfci.harvard.edu/tgi/software/</a>), Lucy (<a class="ulink" href="http://lucy.sourceforge.net/" target="_top">http://lucy.sourceforge.net/</a>), and SeqTrim (<a class="ulink" href="http://www.scbi.uma.es/cgi-bin/seqtrim/seqtrim_login.cgi" target="_top">http://www.scbi.uma.es/cgi-bin/seqtrim/seqtrim_login.cgi</a>) for
both primer and polyA/T trimming. The pipeline SnoWhite (<a class="ulink" href="http://evopipes.net" target="_top">http://evopipes.net</a>) combines Seqclean and TagDust with custom
scripts for aggressive sequence and polyA/T trimming (and is tolerant of
data already masked using cross_match). In all cases, the user must
provide contaminant sequence information and adjust settings for how
sensitive the programs should be to possible matches. To find the best
settings, it is helpful to look directly at some of the sequences that
are being trimmed and inspect them for remaining primer and/or polyA/T
fragments after cleaning.
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
When using <span class="command"><strong>mira</strong></span> or
<span class="command"><strong>miraSearchESTSNPs</strong></span> with the the simplest parameter
calls (using the "--job=..." quick switches), the default settings used
include pretty heavy sequence pre-processing to cope with noisy
data. Especially if you have your own pre-processing pipeline, you
<span class="emphasis"><em>must</em></span> then switch off different clip algorithms that
you might have applied previously yourself. Especially poly-A clips
should never be run twice (by your pipeline and by
<span class="command"><strong>mira</strong></span>) as they invariably lead to too many bases being
cut away in some sequences,
</td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Here too: In some cases MIRA can get confused if something with the
pre-processing went wrong because, e.g., unexpected sequencing artefacts
like unknown sequencing vectors or adaptors remain in data. The guide on
how to tackle <span class="emphasis"><em>hard</em></span> projects with MIRA gives an
overview on how to hunt down sequences which can lead to the assembler
getting confused, be it sequencing artefacts or highly expressed genes.
</td></tr></table></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect1_est_est_difference_assembly_clustering"></a>7.4.
The difference between <span class="emphasis"><em>assembly</em></span> and
<span class="emphasis"><em>clustering</em></span>
</h2></div></div></div><p>
MIRA in its base settings is an <span class="emphasis"><em>assembler</em></span> and not a
<span class="emphasis"><em>clusterer</em></span>, although it can be configured as such. As
assembler, it will split up read groups into different contigs if it
thinks there is enough evidence that they come from different RNA
transcripts.
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_est_snp_splitting"></a>7.4.1.
Splitting transcripts into contigs based on SNPs
</h3></div></div></div><p>
Imagine this simple case: a gene has two slightly different alleles and you've
sequenced this:
</p><pre class="screen">
A1-1 ...........T...........
A1-2 ...........T...........
A1-3 ...........T...........
A1-4 ...........T...........
A1-5 ...........T...........
B2-1 ...........G...........
B2-2 ...........G...........
B2-3 ...........G...........
B2-4 ...........G...........
</pre><p>
Depending on base qualities and settings used during the assembly
like, e.g., [-CO:mr:mrpg:mnq:mgqrt:emea:amgb] MIRA will
recognise that there's enough evidence for a T and also enough
evidence for a G at that position and create two contigs, one
containing the "T" allele, one the "G". The consensus will be >99%
identical, but not 100%.
</p><p>
Things become complicated if one has to account for errors in
sequencing. Imagine you sequenced the following case:
</p><pre class="screen">
A1-1 ...........T...........
A1-2 ...........T...........
A1-3 ...........T...........
A1-4 ...........T...........
A1-5 ...........T...........
B2-1 ...........<span class="bold"><strong>G</strong></span>...........
</pre><p>
It shows very much the same like the one from above, except that
there's only one read with a "G" instead of 4 reads. MIRA will, when
using standard settings, treat this as erroneous base and leave all
these reads in a contig. It will likewise also not mark it as SNP in
the results. However, this could also very well be a lowly expressed
transcript with a single base mutation. It's virtually impossible to
tell which of the possibilities is right.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
You can of course force MIRA to mark situations like the one depicted
above by, e.g., changing the parameters
for [-CO:mrpg:mnq:mgqrt]. But this may have the side-effect
that sequencing errors get an increased chance of getting flagged as
SNP.
</td></tr></table></div><p>
Further complications arise when SNPs and potential sequencing errors
meet at the same place. consider the following case:
</p><pre class="screen">
A1-1 ...........T...........
A1-2 ...........T...........
A1-3 ...........T...........
A1-4 ...........T...........
B1-5 ...........T...........
B2-1 ...........G...........
B2-2 ...........G...........
B2-3 ...........G...........
B2-4 ...........G...........
E1-1 ...........<span class="bold"><strong>A</strong></span>...........
</pre><p>
This example is exactly like the first one, except an additional read
<code class="literal">E1-1</code> has made it's appearance and has an "A"
instead of a "G" or "T". Again it is impossible to tell whether this
is a sequencing error or a real SNP. MIRA handles these cases in the
following way: it will recognise two valid read groups (one having a
"T", the other a "G") and, in assembly mode, split these two groups
into different contigs. It will also play safe and define that the
single read <code class="literal">E1-1</code> will not be attributed to either
one of the contigs but, if it cannot be assembled to other reads, form
an own contig ... if need to be even only as single read (a
<span class="emphasis"><em>singlet</em></span>).
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Depending on some settings, singlets may either appear in the regular
results or end up in the debris file.
</td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_est_gap_splitting"></a>7.4.2.
Splitting transcripts into contigs based on larger gaps
</h3></div></div></div><p>
Gaps in alignments of transcripts are handled very cautiously by
MIRA. The standard settings will lead to the creation of different
contigs if three or more consecutive gaps are introduced in an
alignment. Consider the following example:
</p><pre class="screen">
A1-1 ..........CGA..........
A1-2 ..........*GA..........
A1-3 ..........**A..........
B2-1 ..........<span class="bold"><strong>***</strong></span>..........
B2-2 ..........<span class="bold"><strong>***</strong></span>..........
</pre><p>
Under normal circumstances, MIRA will use the reads
<code class="literal">A1-1</code>, <code class="literal">A1-2</code> and
<code class="literal">A1-3</code> to form one contig and put
<code class="literal">B2-1</code> and <code class="literal">B2-2</code> into a separate
contig. MIRA would do this also if there were only one of the B2
reads.
</p><p>
The reason behind this is that the probability for having gaps of
three or more bases only due to sequencing errors is pretty
low. MIRA will therefore treat reads with such attributes as coming
from different transcripts and not assemble them together, though
this can be changed using the [-AL:egp:egpl] parameters of
MIRA if wanted.
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning:
Problems with homopolymers, especially in 454, Ion Torrent and high
coverage Illumina
"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">
Problems with homopolymers, especially in 454, Ion Torrent and high
coverage Illumina
</th></tr><tr><td align="left" valign="top"><p>
As 454 and Ion Torrent sequencing has a general problem with
homopolymers, this rule of MIRA will sometimes lead formation of
more contigs than expected due to sequencing errors at "long"
homopolymer sites ... where long starts at ~6-7 bases. Though MIRA
does know about the problem in 454 homopolymers and has some
routines which try to mitigate the problem. this is not always
successful.
</p><p>
The same applies for Illumina data with long homopolymers (~ 8-9 bp)
and high coverage (≥ 100x).
</p></td></tr></table></div></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect1_est_demopipeline"></a>7.5.
A simple step by step pipeline for reliable RNASeq assembly of eukaryotes
</h2></div></div></div><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
Remove rRNA sequences. For that I use <span class="command"><strong>mirabait</strong></span> like this:
</p><pre class="screen"><code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>mirabait -I -j rrna <em class="replaceable"><code>-p norRNAfile_1.fastq norRNAfile_2.fastq ...</code></em></code></strong></pre></li><li class="listitem"><p>
Clean the data. For this I use mira, asking it to perform only a
preprocessing of the data from step 1 via a line like this
</p><pre class="screen">parameters = -AS:nop=0</pre><p>
in the manifest file. After preprocessing, the results will be
present as MAf file in the file
<code class="filename">*_assembly/*_d_chkpt/readpool.maf</code>.
</p></li><li class="listitem"><p>
As the MAF file contains paired reads together, they need to be
separated again. Additionally, I perform a hard cut of the clipped
sequence. This is a job for <span class="command"><strong>miraconvert</strong></span>:
</p><pre class="screen"><code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>miraconvert -C -F -F readpool.maf</code></strong></pre></li><li class="listitem"><p>
I then use FLASH to merge paired read together, using high overlap and zero allowed errors.
</p><pre class="screen"><code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>...</code></strong></pre><p>
FLASH will create three file for this: one file with joined pairs,
one file with unjoined pairs and one file with orphan reads (i.e.,
reads which have no mate). I generally continue with just the joined
and unjoined files.
</p></li><li class="listitem"><p>
Reduce the dataset to a reasonable size. Using 3 or 4 gigabases to
reconstruct an eukaryotic transcriptome should yield in pretty good
transcripts without too much noise and loose all but the rarest
transcripts.
</p><p>
Depending on the Illumina read length (100, 125, 150, 250 or 300) I
generally go for a 1:1 or 2:1 ratio of joined versus unjoined
reads. E.g., if I need to extract 2 gigabases of joined FLASH
results and 1 gigabase of unjoined FLASH results I do this:
</p><pre class="screen"><code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>miraconvert -Y 2000000 <em class="replaceable"><code>flashjoined.fastq reduced2gb_flashjoined.fastq</code></em></code></strong>
<code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>miraconvert -Y 1000000 <em class="replaceable"><code>flashunjoined.fastq reduced1gb_flashunjoined.fastq</code></em></code></strong></pre><p>
</p></li><li class="listitem"><p>
Assemble the cleaned, joined and reduced data set. A simple manifest
file like this will suffice:
</p><pre class="screen">project = myRNASEQ
job=est,denovo,accurate
readgroup
technology=solexa
autopairing
data=reduced2gb_flashjoined.fastq reduced1gb_flashunjoined.fastq
</pre></li></ol></div><p>
The result can be annotated and quality controlled. However, this will
still contain duplicate genes (due to, e.g., ploidy variants) or gene
fragements (due to ploidy variants, splice variants, sequencing
errors). To reduce this number I generally do the following:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem">
Extract CDS of the annotated sequences. make sure that your pipeline
also annotates hypothetical proteins with a length ≥ 300bp.
</li><li class="listitem"><p>
Cluster the CDS sequences with MIRA, using a high similarity threshold:
</p><pre class="screen">project = myRNASEQclustering
job=est,clustering,accurate
parameters = --noclipping
parameters = TEXT_SETTINGS -AS:mrs=94
readgroup
technology=text
autopairing
data=fna::CDSfromAnnotation.fasta
</pre></li></ol></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="idm5079"></a>7.6.
Solving common problems of EST assemblies
</h2></div></div></div><p>
... continue here ...
</p><p>
Megahubs => track down reason (high expr, seqvec or adaptor: see
mira_hard) and eliminate it
</p></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_specialparams"></a>Chapter 8. Parameters for special situations</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email"><<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>></code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect_sp_introduction">8.1.
Introduction
</a></span></dt><dt><span class="sect1"><a href="#sect_sp_pacbio">8.2.
PacBio
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_sp_pacbio_ccs">8.2.1.
PacBio CCS reads
</a></span></dt><dt><span class="sect2"><a href="#sect_sp_pacbio_ec">8.2.2.
PacBio error corrected reads
</a></span></dt></dl></dd></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
<span class="emphasis"><em><span class="quote">“<span class="quote">... .
</span>”</span></em></span>
</p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_sp_introduction"></a>8.1.
Introduction
</h2></div></div></div><p>
Most of this chapter and many sections are just stubs at the moment.
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_sp_pacbio"></a>8.2.
PacBio
</h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_sp_pacbio_ccs"></a>8.2.1.
PacBio CCS reads
</h3></div></div></div><p>
Declare the sequencing technology to be high-quality PacBio (<span class="bold"><strong>PCBIOHQ</strong></span>). The last time I worked with CCS, the
ends of the reads were not really clean, so using the proposed end
clipping (which needs to be manually switched on for PCBIOHQ reads)
may be advisable.
</p><pre class="screen"><strong class="userinput"><code>...
parameters = PCBIOHQ_SETTINGS -CL:pec=yes
...
readgroup
technology=pcbiohq
data=...
...</code></strong></pre></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_sp_pacbio_ec"></a>8.2.2.
PacBio error corrected reads
</h3></div></div></div><p>
Declare the sequencing technology to be high-quality PacBio (<span class="bold"><strong>PCBIOHQ</strong></span>). For self-corrected data or data
corrected with other sequencing technologies, it is recommended to
change the [-CO:mrpg] setting to a value which is 1/4th to
1/5th of the average coverage of the corrected PacBio reads across the
genome. E.g.:
</p><pre class="screen"><strong class="userinput"><code>...
parameters = PCBIOHQ_SETTINGS -CO:mrpg=5
...
readgroup
technology=pcbiohq
data=...
...</code></strong></pre><p>
for a project which has ~24x coverage. This necessity may change in
later versions of MIRA though.
</p></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_results"></a>Chapter 9. Working with the results of MIRA</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email"><<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>></code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect_res_looking_at_results">9.1.
MIRA output directories and files
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_res_resultsdir">9.1.1.
The <code class="filename">*_d_results</code> directory
</a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_res_resultsdir_denovo">9.1.1.1.
Additional 'large contigs' result files for de-novo assemblies of genomes
</a></span></dt></dl></dd><dt><span class="sect2"><a href="#sect_res_infodir">9.1.2.
The <code class="filename">*_d_info</code> directory
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_res_first_look:the_assembly_info">9.2.
First look: the assembly info
</a></span></dt><dt><span class="sect1"><a href="#sect_res_converting_results">9.3.
Converting results
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_res_converting_miraconvert">9.3.1.
Converting to and from other formats:<span class="command"><strong>miraconvert</strong></span>
</a></span></dt><dt><span class="sect2"><a href="#sect_res_converting_reach_other_programs">9.3.2.
Steps for converting data from / to other tools
</a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_res_converting_to_from_staden">9.3.2.1.
Example: converting to and from the Staden package (gap4 / gap5)
</a></span></dt><dt><span class="sect3"><a href="#sect_res_converting_to_from_sam">9.3.2.2.
Example: converting to and from SAM (for samtools, tablet etc.)
</a></span></dt></dl></dd></dl></dd><dt><span class="sect1"><a href="#sect_res_filtering_of_results">9.4.
Filtering results
</a></span></dt><dt><span class="sect1"><a href="#sect_res_places_of_importance_in_a_de_novo_assembly">9.5.
Places of importance in a de-novo assembly
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_res_tags_set_by_mira">9.5.1.
Tags set by MIRA
</a></span></dt><dt><span class="sect2"><a href="#sect_res_other_places_of_importance">9.5.2.
Other places of importance
</a></span></dt><dt><span class="sect2"><a href="#sect_res_joining_contigs">9.5.3.
Joining contigs
</a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_res_joining_truerepeats">9.5.3.1.
Joining contigs at true repetitive sites
</a></span></dt><dt><span class="sect3"><a href="#sect_res_joining_FALSErepeats">9.5.3.2.
Joining contigs at "wrongly discovered" repetitive sites
</a></span></dt></dl></dd></dl></dd><dt><span class="sect1"><a href="#sect_res_places_of_interest_in_a_mapping_assembly">9.6.
Places of interest in a mapping assembly
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_res_poi_where_are_snps?">9.6.1.
Where are SNPs?
</a></span></dt><dt><span class="sect2"><a href="#sect_res_poi_where_are_insertions_deletions_or_genome_rearrangements?">9.6.2.
Where are insertions, deletions or genome re-arrangements?
</a></span></dt><dt><span class="sect2"><a href="#sect_res_poi_other_tags_of_interest">9.6.3.
Other tags of interest
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_res_postprocessing_mapping_assemblies">9.7.
Post-processing mapping assemblies
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_res_pp_manual_cleanup">9.7.1.
Manual cleanup and validation (optional)
</a></span></dt><dt><span class="sect2"><a href="#sect_res_poi_comprehensive_snp_analysis_spreadsheet_tables_for_excel_or_oocalc">9.7.2.
Comprehensive SNP analysis spreadsheet tables (for Excel or OOcalc)
</a></span></dt><dt><span class="sect2"><a href="#sect_res_poi_html_files_depicting_snp_positions_and_deletions">9.7.3.
HTML files depicting SNP positions and deletions
</a></span></dt><dt><span class="sect2"><a href="#sect_res_poi_wig_files">9.7.4.
WIG files depicting contig coverage or GC content
</a></span></dt><dt><span class="sect2"><a href="#sect_res_poi_tables_for_feature_coverage">9.7.5.
Comprehensive spreadsheet tables for gene expression values / genome deletions & duplications
</a></span></dt></dl></dd></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
<span class="emphasis"><em><span class="quote">“<span class="quote">You have to know what you're looking for before you can find it.
</span>”</span></em></span>
</p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><p>
MIRA makes results available in quite a number of formats: CAF, ACE, FASTA and
a few others. The preferred formats are CAF and MAF, as these format can be
translated into any other supported format.
</p><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_res_looking_at_results"></a>9.1.
MIRA output directories and files
</h2></div></div></div><p>
For the assembly MIRA creates a directory named
<code class="filename"><em class="replaceable"><code>projectname</code></em>_assembly</code> in
which a number of sub-directories will have appeared.
</p><p>
These sub-directories (and files within) contain the results of the
assembly itself, general information and statistics on the results and
-- if not deleted automatically by MIRA -- a tmp directory with log
files and temporary data:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_d_results</code>:
this directory contains all the output files of the assembly in
different formats.
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_d_info</code>:
this directory contains information files of the final
assembly. They provide statistics as well as, e.g., information
(easily parsable by scripts) on which read is found in which
contig etc.
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_d_tmp</code>:
this directory contains log files and temporary assembly files. It
can be safely removed after an assembly as there may be easily a
few GB of data in there that are not normally not needed anymore.
</p><p>
The default settings of MIRA are such that really big files are
automatically deleted when they not needed anymore during an
assembly.
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_d_chkpt</code>:
this directory contains checkpoint files needed to resume
assemblies that crashed or were stopped.
</p></li></ul></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_resultsdir"></a>9.1.1.
The <code class="filename">*_d_results</code> directory
</h3></div></div></div><p>
The following files in
<code class="filename"><em class="replaceable"><code>projectname</code></em>_d_results</code>
contain results of the assembly in different formats. Depending on the
output options you defined for MIRA, some files may or may not be
there. As long as the CAF or MAF format are present, you can translate
your assembly later on to about any supported format with the
<span class="command"><strong>miraconvert</strong></span> program supplied with the MIRA
distribution:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_out.txt</code>:
this file contains in a human readable format the aligned assembly
results, where all input sequences are shown in the context of the
contig they were assembled into. This file is just meant as a
quick way for people to have a look at their assembly without
specialised alignment finishing tools.
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_out.padded.fasta</code>:
this file contains as FASTA sequence the consensus of the contigs
that were assembled in the process. Positions in the consensus
containing gaps (also called 'pads', denoted by an asterisk) are
still present. The computed consensus qualities are in the
corresponding
<code class="filename"><em class="replaceable"><code>projectname</code></em>_out.padded.fasta.qual</code>
file.
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_out.unpadded.fasta</code>:
as above, this file contains as FASTA sequence the consensus of
the contigs that were assembled in the process, put positions in
the consensus containing gaps were removed. The computed consensus
qualities are in the corresponding
<code class="filename"><em class="replaceable"><code>projectname</code></em>_out.unpadded.fasta.qual</code>
file.
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_out.caf</code>:
this is the result of the assembly in CAF format, which can be
further worked on with, e.g., tools from the
<span class="emphasis"><em>caftools</em></span> package from the Sanger Centre and
later on be imported into, e.g., the Staden gap4 assembly and
finishing tool.
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_out.ace</code>:
this is the result of the assembly in ACE format. This format can
be read by viewers like the TIGR clview or by consed from the
phred/phrap/consed package.
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_out.gap4da</code>:
this directory contains the result of the assembly suited for the
<span class="emphasis"><em>direct assembly</em></span> import of the Staden gap4
assembly viewer and finishing tool.
</p></li></ul></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_res_resultsdir_denovo"></a>9.1.1.1.
Additional 'large contigs' result files for de-novo assemblies of genomes
</h4></div></div></div><p>
For de-novo assemblies of genomes, MIRA makes a proposal regarding
which contigs you probably want to have a look at ... and which ones
you can probably forget about.
</p><p>
This proposal relies on the <span class="emphasis"><em>largecontigs</em></span> file
in the info directory (see section below) and MIRA automatically
extracted these contigs into all the formats you wanted to have your
results in.
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
The result files for 'large contigs' are all named:
<code class="filename"><em class="replaceable"><code>projectname</code></em>_<span class="emphasis"><em>LargeContigs</em></span>_out.<em class="replaceable"><code>resulttype</code></em></code>:
</p></li><li class="listitem"><p>
<code class="filename">extractLargeContigs.sh</code>: this is a small
shell script which just contains the call
to <span class="command"><strong>miraconvert</strong></span> with which MIRA extracted the
large contigs for you. In case you want to redefine what large
contigs are for you, feel free to use this as template.
</p></li></ul></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_infodir"></a>9.1.2.
The <code class="filename">*_d_info</code> directory
</h3></div></div></div><p>
The following files in
<code class="filename"><em class="replaceable"><code>projectname</code></em>_info</code>
contain statistics and other information files of the assembly:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_info_assembly.txt</code>:
This file should be your first stop after an assembly. It will
tell you some statistics as well as whether or not problematic
areas remain in the result.
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_info_callparameters.txt</code>:
This file contains the parameters as given on the mira command
line when the assembly was started.
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_info_consensustaglist.txt</code>:
This file contains information about the tags (and their position)
that are present in the consensus of a contig.
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_info_contigreadlist.txt</code>:
This file contains information which reads have been assembled
into which contigs (or singlets).
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_info_contigstats.txt</code>:
This file contains in tabular format statistics about the contigs
themselves, their length, average consensus quality, number of
reads, maximum and average coverage, average read length, number
of A, C, G, T, N, X and gaps in consensus.
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_info_debrislist.txt</code>:
This file contains the names of all the reads which were not
assembled into contigs (or singlets if appropriate MIRA parameters
were chosen). The file has two columns: first column is the name
of the read, second column is a code showing the reason and stage
at which the read was put into the debris category.
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_info_largecontigs.txt</code>:
This file contains as simple list the names of all the contigs
MIRA thinks to be more or less important at the end of the
assembly. To be present in this list, a contig needed to reach a
certain length (usually 500, but see [-MI:lcs]) and had a
coverage of at least 1/3 of the average coverage (per sequencing
technology) of the complete project.
</p><p>
Note: only present for de-novo assemblies of genomes.
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
The default heuristics (500bp length and 1/3 coverage per
sequencing technology) generally work well enough for most
projects. However, Projects with extremely different coverage
numbers per sequencing technology may need to use different
numbers. E.g.: a project with 80x Illumina and 6x Sanger would
have contigs consisting only of 2 or 3 Sanger sequence but with
the average coverage >= 2 also in this list although clearly no
one would look at these under normal circumstances.
</td></tr></table></div></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_info_groups.txt</code>:
This file contains information about readgroups as determined by
MIRA. Most interesting will probably be statistics concerning
read-pair sizes.
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_info_readrepeats</code>:
This file helps to find out which parts of which reads are quite
repetitive in a project. Please consult the chapter on how to
tackle "hard" sequencing projects to learn how this file can help
you in spotting sequencing mistakes and / or difficult parts in a
genome or EST / RNASeq project.
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_info_readstooshort</code>:
A list containing the names of those reads that have been sorted
out of the assembly only due to the fact that they were too short,
before any processing started.
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_info_readtaglist.txt</code>:
This file contains information about the tags and their position
that are present in each read. The read positions are given
relative to the forward direction of the sequence (i.e. as it was
entered into the the assembly).
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_error_reads_invalid</code>:
A list of sequences that have been found to be invalid due to
various reasons (given in the output of the assembler).
</p></li></ul></div></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_res_first_look:the_assembly_info"></a>9.2.
First look: the assembly info
</h2></div></div></div><p>
Once finished, have a look at the file
<code class="filename">*_info_assembly.txt</code> in the info directory. The
assembly information given there is split in three major parts:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
some general assembly information (number of reads assembled etc.). This
part is quite short at the moment, will be expanded in future
</p></li><li class="listitem"><p>
assembly metrics for 'large' contigs.
</p></li><li class="listitem"><p>
assembly metrics for all contigs.
</p></li></ol></div><p>
The first part for large contigs contains several sections. The first of
these shows what MIRA counts as large contig for this particular
project. As example, this may look like this:
</p><pre class="screen">
Large contigs:
--------------
With Contig size >= 500
AND (Total avg. Cov >= 19
OR Cov(san) >= 0
OR Cov(454) >= 8
OR Cov(pbs) >= 0
OR Cov(sxa) >= 11
OR Cov(sid) >= 0
)</pre><p>
The above is for a 454 and Solexa hybrid assembly in which MIRA
determined large contigs to be contigs
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
of length of at least 500 bp and
</p></li><li class="listitem"><p>
having a total average coverage of at least 19x or an
average 454 coverage of 8 or an average Solexa coverage of 11
</p></li></ol></div><p>
The second section is about length assessment of large contigs:
</p><pre class="screen">
Length assessment:
------------------
Number of contigs: 44
Total consensus: 3567224
Largest contig: 404449
N50 contig size: 186785
N90 contig size: 55780
N95 contig size: 34578</pre><p>
In the above example, 44 contigs totalling 3.56 megabases were built,
the largest contig being 404 kilobases long and the N50/N90 and N95
numbers give the respective lengths.
</p><p>
The next section shows information about the coverage assessment of
large contigs. An example:
</p><pre class="screen">
Coverage assessment:
--------------------
Max coverage (total): 563
Max coverage
Sanger: 0
454: 271
PacBio: 0
Solexa: 360
Solid: 0
Avg. total coverage (size >= 5000): 57.38
Avg. coverage (contig size >= 5000)
Sanger: 0.00
454: 25.10
PacBio: 0.00
Solexa: 32.88
Solid: 0.00</pre><p>
Maximum coverage attained was 563, maximum for 454 alone 271 and for
Solexa alone 360. The average total coverage (computed from contigs with
a size ≥ 5000 bases is 57.38. The average coverage by sequencing
technology (in contigs ≥ 5000) is 25.10 for 454 and 32.88 for Solexa
reads.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
For genome assemblies, the value for <span class="emphasis"><em>Avg. total coverage
(size >= 5000)</em></span> is currently always calculated for contigs
having 5000 or more consensus bases. While this gives a very effective
measure for genome assemblies, assemblies of EST or RNASeq will often
have totally irrelevant values here: even if the default of MIRA is to
use smaller contig sizes (1000) for EST / RNASeq assemblies, the
coverage values for lowly and highly expressed genes can easily span a
factor of 10000 or more.
</p></td></tr></table></div><p>
The last section contains some numbers useful for quality assessment. It
looks like this:
</p><pre class="screen">
Quality assessment:
-------------------
Average consensus quality: 90
Consensus bases with IUPAC: 11 (you might want to check these)
Strong unresolved repeat positions (SRMc): 0 (excellent)
Weak unresolved repeat positions (WRMc): 19 (you might want to check these)
Sequencing Type Mismatch Unsolved (STMU): 0 (excellent)
Contigs having only reads wo qual: 0 (excellent)
Contigs with reads wo qual values: 0 (excellent)</pre><p>
Beside the average quality of the contigs and whether they contain reads
without quality values, MIRA shows the number of different tags in the
consensus which might point at problems.
</p><p>
The above mentioned sections (length assessment, coverage assessment and
quality assessment) for <span class="emphasis"><em>large</em></span> contigs will then be
re-iterated for <span class="emphasis"><em>all</em></span> contigs, this time including
also contigs which MIRA did not take into account as large contig.
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_res_converting_results"></a>9.3.
Converting results
</h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_converting_miraconvert"></a>9.3.1.
Converting to and from other formats:<span class="command"><strong>miraconvert</strong></span>
</h3></div></div></div><p>
<span class="command"><strong>miraconvert</strong></span> is tool in the MIRA package which
reads and writes a number of formats, ranging from full assembly
formats like CAF and MAF to simple output view formats like HTML or
plain text.
</p><div class="figure"><a name="chap_res::results_miraconvert.png"></a><p class="title"><b>Figure 9.1. <span class="command">miraconvert</span> supports a wide range of
format conversions to simplify export / import of results to and from
other programs</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/results_miraconvert.png" width="100%" alt="miraconvert supports a wide range of format conversions to simplify export / import of results to and from other programs"></td></tr></table></div></div></div><br class="figure-break"></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_converting_reach_other_programs"></a>9.3.2.
Steps for converting data from / to other tools
</h3></div></div></div><p>
The question "How Do I convert to / from other tools?" is complicated
by the plethora of file formats and tools available. This section
gives an overview on what is needed to reach the most important ones.
</p><div class="figure"><a name="chap_res::results_mira2other.png"></a><p class="title"><b>Figure 9.2.
Conversion steps, formats and programs needed to reach some tools
like assembly viewers, editors or scaffolders.
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/results_mira2other.png" width="100%" alt="Conversion steps, formats and programs needed to reach some tools like assembly viewers, editors or scaffolders."></td></tr></table></div></div></div><br class="figure-break"><p>
Please also read the chapter on MIRA utilities in this manual to learn
more on <span class="command"><strong>miraconvert</strong></span> and have a look at
<code class="literal">miraconvert -h</code> which lists all possible formats
and other command line options.
</p><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_res_converting_to_from_staden"></a>9.3.2.1.
Example: converting to and from the Staden package (gap4 / gap5)
</h4></div></div></div><p>
The <span class="command"><strong>gap4</strong></span> program (and its
successor <span class="command"><strong>gap5</strong></span> from the Staden package are pretty
useful finishing tools and assembly viewers. They have an own
database format which MIRA does not read or write, but there are
interconversion possibilities using the CAF format (for gap4) and
SAM format (for gap5)
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
[gap4]
</span></dt><dd><p>
You need the <span class="command"><strong>caf2gap</strong></span>
and <span class="command"><strong>gap2caf</strong></span> utilities for this, which are
distributed separately from the Sanger Centre
<a class="ulink" href="http://www.sanger.ac.uk/Software/formats/CAF/" target="_top">http://www.sanger.ac.uk/Software/formats/CAF/</a>).
Conversion is pretty straightforward. From MIRA to gap4, it's
like this:
</p><pre class="screen">
<code class="prompt">$</code> caf2gap -project <em class="replaceable"><code>YOURGAP4PROJECTNAME</code></em> -ace <em class="replaceable"><code>mira_result.caf</code></em> >&/dev/null</pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Don't be fooled by the <code class="literal">-ace</code> parameter of
<span class="command"><strong>caf2gap</strong></span>. It needs a CAF file as input, not
an ACE file.
</td></tr></table></div><p>
From gap4 to CAF, it's like this:
</p><pre class="screen">
<code class="prompt">$</code> gap2caf -project <em class="replaceable"><code>YOURGAP4PROJECTNAME</code></em> >tmp.caf
<code class="prompt">$</code> miraconvert -r c tmp.caf <em class="replaceable"><code>somenewname</code></em>.caf</pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Using <span class="command"><strong>gap2caf</strong></span>, be careful to use the simple
<code class="literal">></code> redirection to file and
<span class="emphasis"><em>not</em></span> the <code class="literal">>&</code>
redirection.
</td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Using first <span class="command"><strong>gap2caf</strong></span> and then
<span class="command"><strong>miraconvert</strong></span> is needed as gap4 writes an
own consensus to the CAF file which is not necessarily the
best. Indeed, gap4 does not know about different sequencing
technologies like 454 and treats everything as
Sanger. Therefore, using
<span class="command"><strong>miraconvert</strong></span> with the [-r c] option
recalculates a MIRA consensus during the "conversion" from CAF to CAF.
</td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
If you work with a 32 bit executable of caf2gap, it might very
well be that the converter needs more memory than can be
handled by 32 bit. Only solution: switch to a 64 bit
executable of caf2gap.
</td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning:
caf2gap bug for sequence annotations in reverse direction
"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">
caf2gap bug for sequence annotations in reverse direction
</th></tr><tr><td align="left" valign="top"><p>
caf2gap has currently (as of version 2.0.2) a bug that turns
around all features in reverse direction during the
conversion from CAF to a gap4 project. There is a fix
available, please contact me for further information (until
I find time to describe it here).
</p></td></tr></table></div></dd><dt><span class="term">
[gap5]
</span></dt><dd><p>
The <span class="command"><strong>gap5</strong></span> program is the successor for
gap4. It comes with on own import utility
(<span class="command"><strong>tg_index</strong></span>) which can import SAM and CAF
files, and gap5 itself has an export function which also
writes SAM and CAF. It is suggested to use the SAM format to
export data gap5 as it is more efficient and conveys more
information on sequencing technologies used.
</p><p>
Conversion is pretty straightforward. From MIRA to gap5, it's like
this:
</p><pre class="screen">
<code class="prompt">$</code> tg_index <em class="replaceable"><code>INPUT</code></em>_out.sam</pre><p>
This creates a gap5 database named
<code class="filename"><em class="replaceable"><code>INPUT</code></em>_out.g5d</code>
which can be directly loaded with gap5 like this:
</p><pre class="screen">
<code class="prompt">$</code> gap5 <em class="replaceable"><code>INPUT</code></em>_out.g5d</pre><p>
Exporting back to SAM or CAF is done in gap5 via
the <span class="emphasis"><em>File->Export Sequences</em></span> menu there.
</p></dd></dl></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_res_converting_to_from_sam"></a>9.3.2.2.
Example: converting to and from SAM (for samtools, tablet etc.)
</h4></div></div></div><p>
Converting to SAM is done by
using <span class="command"><strong>miraconvert</strong></span> on a MIRA MAF file, like this:
</p><pre class="screen">
<code class="prompt">$</code> miraconvert maf -t sam <em class="replaceable"><code>INPUT</code></em>.maf <em class="replaceable"><code>OUTPUT</code></em></pre><p>
The above will create a file named <code class="filename">OUTPUT.sam</code>.
</p><p>
Converting from SAM to a format which either <span class="command"><strong>mira</strong></span>
or <span class="command"><strong>miraconvert</strong></span> can understand takes a few
more steps. As neither tool currently reads SAM natively, you need
to go via the <span class="command"><strong>gap5</strong></span> editor of the Staden package:
convert the SAM via <span class="command"><strong>tg_index</strong></span> to a gap5 database,
load that database in gap5 and export it there to CAF.
</p></div></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_res_filtering_of_results"></a>9.4.
Filtering results
</h2></div></div></div><p>
It is important to remember that, depending on assembly options, MIRA
will also include very small contigs (with eventually very low coverage)
made out of reads which were rejected from the "good" contigs for
quality or other reasons. You probably do not want to have a look at
this contig debris when finishing a genome unless you are really,
really, really picky.
</p><p>
Many people prefer to just go on with what would be large
contigs. Therefore, in de-novo assemblies, MIRA writes out separate
files of what it thinks are "good", large contigs. In case you want to
extract contigs differently, the <span class="command"><strong>miraconvert</strong></span> program
from the MIRA package can selectively filter CAF or MAF files for
contigs with a certain size, average coverage or number of reads.
</p><p>
The file <code class="filename">*_info_assembly.txt</code> in the info directory
at the end of an assembly might give you first hints on what could be
suitable filter parameters. As example, for "normal" assemblies
(whatever this means), one could want to consider only contigs larger
than 500 bases and which have at least one third of the average coverage
of the N50 contigs.
</p><p>
Here's an example: In the "Large contigs" section, there's a "Coverage
assessment" subsection. It looks a bit like this:
</p><pre class="screen">
...
Coverage assessment:
--------------------
Max coverage (total): 43
Max coverage
Sanger: 0
454: 43
Solexa: 0
Solid: 0
Avg. total coverage (size ≥ 5000): 22.30
Avg. coverage (contig size ≥ 5000)
Sanger: 0.00
454: 22.05
Solexa: 0.00
Solid: 0.00
...</pre><p>
This project was obviously a 454 only project, and the average coverage
for it is ~22. This number was estimated by MIRA by taking only contigs
of at least 5kb into account, which for sure left out everything which
could be categorised as debris. Normally it's a pretty solid number.
</p><p>
Now, depending on how much time you want to invest performing some manual
polishing, you should extract contigs which have at least the following
fraction of the average coverage:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
2/3 if a quick and "good enough" is what you want and you don't want
to do some manual polishing. In this example, that would be around
14 or 15.
</p></li><li class="listitem"><p>
1/2 if you want to have a "quick look" and eventually perform some
contig joins. In this example the number would be 11.
</p></li><li class="listitem"><p>
1/3 if you want quite accurate and for sure not loose any possible
repeat. That would be 7 or 8 in this example.
</p></li></ul></div><p>
Example (useful with assemblies of Sanger data): extracting only contigs ≥
1000 bases and with a minimum average coverage of 4 into FASTA format:
</p><pre class="screen">
<code class="prompt">$</code> <strong class="userinput"><code>miraconvert -x 1000 -y 4 <em class="replaceable"><code>sourcefile.maf targetfile.fasta</code></em></code></strong></pre><p>
Example (useful with assemblies of 454 data): extracting only contigs
≥ 500 bases into FASTA format:
</p><pre class="screen">
<code class="prompt">$</code> <strong class="userinput"><code>miraconvert -x 500 <em class="replaceable"><code>sourcefile.maf targetfile.fasta</code></em></code></strong></pre><p>
Example (e.g. useful with Sanger/454 hybrid assemblies): extracting only
contigs ≥ 500 bases and with an average coverage ≥ 15 reads into
CAF format, then converting the reduced CAF into a Staden GAP4 project:
</p><pre class="screen">
<code class="prompt">$</code> <strong class="userinput"><code>miraconvert -x 500 -y 15 <em class="replaceable"><code>sourcefile.maf tmp.caf</code></em></code></strong>
<code class="prompt">$</code> <strong class="userinput"><code>caf2gap -project <em class="replaceable"><code>somename</code></em> -ace <em class="replaceable"><code>tmp.caf</code></em></code></strong></pre><p>
Example (e.g. useful with Sanger/454 hybrid assemblies): extracting only
contigs ≥ 1000 bases and with ≥ 10 reads from MAF into CAF format,
then converting the reduced CAF into a Staden GAP4 project:
</p><pre class="screen">
<code class="prompt">$</code> <strong class="userinput"><code>miraconvert -x 500 -z 10 <em class="replaceable"><code>sourcefile.maf tmp.caf</code></em></code></strong>
<code class="prompt">$</code> <strong class="userinput"><code>caf2gap -project <em class="replaceable"><code>somename</code></em> -ace <em class="replaceable"><code>tmp.caf</code></em></code></strong></pre></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_res_places_of_importance_in_a_de_novo_assembly"></a>9.5.
Places of importance in a de-novo assembly
</h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_tags_set_by_mira"></a>9.5.1.
Tags set by MIRA
</h3></div></div></div><p>
MIRA sets a number of different tags in resulting assemblies. They can be set in reads
(in which case they mostly end with a <span class="emphasis"><em>r</em></span>) or in the consensus.(then
ending with a <span class="emphasis"><em>c</em></span>).
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
If you use the
Staden <span class="command"><strong>gap4</strong></span>, <span class="command"><strong>gap5</strong></span> or
<span class="command"><strong>consed</strong></span> assembly editor to tidy up the assembly, you
can directly jump to places of interest that MIRA marked for further
analysis by using the search functionality of these programs.
</p><p>
However, you need to tell these programs that these tags exist. For
that you must change some configuration files. More information on
how to do this can be found in the
<code class="filename">support/README</code> file of the MIRA distribution.
</p></td></tr></table></div><p>
You should search for the following "consensus" tags for finding places of importance
(in this order).
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
IUPc
</p></li><li class="listitem"><p>
UNSc
</p></li><li class="listitem"><p>
SRMc
</p></li><li class="listitem"><p>
WRMc
</p></li><li class="listitem"><p>
STMU (only hybrid assemblies)
</p></li><li class="listitem"><p>
MCVc (only when assembling different strains, i.e., mostly relevant for mapping assemblies)
</p></li><li class="listitem"><p>
SROc (only when assembling different strains, i.e., mostly relevant for mapping assemblies)
</p></li><li class="listitem"><p>
SAOc (only when assembling different strains, i.e., mostly relevant for mapping assemblies)
</p></li><li class="listitem"><p>
SIOc (only when assembling different strains, i.e., mostly relevant for mapping assemblies)
</p></li><li class="listitem"><p>
STMS (only hybrid assemblies)
</p></li></ul></div><p>
</p><p>
of lesser importance are the "read" versions of the tags above:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
UNSr
</p></li><li class="listitem"><p>
SRMr
</p></li><li class="listitem"><p>
WRMr
</p></li><li class="listitem"><p>
SROr (only when assembling different strains, i.e., mostly relevant for mapping assemblies)
</p></li><li class="listitem"><p>
SAOr (only when assembling different strains, i.e., mostly relevant for mapping assemblies)
</p></li><li class="listitem"><p>
SIOr (only when assembling different strains, i.e., mostly relevant for mapping assemblies)
</p></li></ul></div><p>
</p><p>
In normal assemblies (only one sequencing technology, just one
strain), search for the IUPc, UNSc, SRMc and WRMc tags.
</p><p>
In hybrid assemblies, searching for the IUPc, UNSc, SRMc, WRMc, and
STMU tags and correcting only those places will allow you to have a
qualitatively good assembly in no time at all.
</p><p>
Columns with SRMr tags (SRM in <span class="bold"><strong>R</strong></span>eads)
in an assembly without a SRMc tag at the same consensus position show
where mira was able to resolve a repeat during the different passes of
the assembly ... you don't need to look at these. SRMc and WRMc tags
however mean that there may be unresolved trouble ahead, you should take a
look at these.
</p><p>
Especially in mapping assemblies, columns with the MCVc, SROx, SIOx and SAOx tags are
extremely helpful in finding places of interest. As they are only set if you
gave strain information to MIRA, you should always do that.
</p><p>
For more information on tags set/used by MIRA and what they exactly mean, please look up the
according section in the reference chapter.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_other_places_of_importance"></a>9.5.2.
Other places of importance
</h3></div></div></div><p>
The read coverage histogram as well as the template display of gap4
will help you to spot other places of potential interest. Please consult the
gap4 documentation.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_joining_contigs"></a>9.5.3.
Joining contigs
</h3></div></div></div><p>
I recommend to invest a couple of minutes (in the best case) to a few
hours in joining contigs, especially if the uniform read distribution
option of MIRA was used (but first filter for large contigs). This
way, you will reduce the number of "false repeats" in improve the
overall quality of your assembly.
</p><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_res_joining_truerepeats"></a>9.5.3.1.
Joining contigs at true repetitive sites
</h4></div></div></div><p>
Joining contigs at repetitive sites of a genome is always a
difficult decision. There are, however, two rules which can help:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem">
If the sequencing was done without a paired-end library, don't join.
</li><li class="listitem">
If the sequencing was done with a paired-end library, but no
pair (or template) span the join site, don't join.
</li></ol></div><p>
</p><p>
The following screen shot shows a case where one should not join as
the finishing program (in this case <span class="command"><strong>gap4</strong></span>) warns
that no template (read-pair) span the join site:
</p><p>
</p><div class="figure"><a name="haf_danger_join_notok.png"></a><p class="title"><b>Figure 9.3.
Join at a repetitive site which should not be performed due to
missing spanning templates.
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/haf_danger_join_notok.png" width="100%" alt="Join at a repetitive site which should not be performed due to missing spanning templates."></td></tr></table></div></div></div><p><br class="figure-break">
</p><p>
The next screen shot shows a case where one should join as the
finishing program (in this case <span class="command"><strong>gap4</strong></span>) finds
templates spanning the join site and all of them are good:
</p><p>
</p><div class="figure"><a name="haf_danger_join_ok.png"></a><p class="title"><b>Figure 9.4.
Join at a repetitive site which should be performed due to
spanning templates being good.
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/haf_danger_join_ok.png" width="100%" alt="Join at a repetitive site which should be performed due to spanning templates being good."></td></tr></table></div></div></div><p><br class="figure-break">
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_res_joining_FALSErepeats"></a>9.5.3.2.
Joining contigs at "wrongly discovered" repetitive sites
</h4></div></div></div></div><p>
Remember that MIRA takes a very cautious approach in contig building,
and sometimes creates two contigs when it could have created
one. Three main reasons can be the cause for this:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
when using <span class="emphasis"><em>uniform read distribution</em></span>, some
non-repetitive areas may have generated so many more reads that
they start to look like repeats (so called pseudo-repeats). In
this case, reads that are above a given coverage are
<span class="emphasis"><em>shaved off</em></span> (see [-AS:urdcm] and kept
in reserve to be used for another copy of that repeat ... which in
case of a non-repetitive region will of course never arrive. So at
the end of an assembly, these shaved-off reads will form short,
low coverage contig debris which can more or less be safely
ignored and sorted out via the filtering options ( [-x -y
-z]) of <span class="command"><strong>miraconvert</strong></span>.
</p><p>
Some 454 library construction protocols -- especially, but not
exclusively, for paired-end reads -- create pseudo-repeats quite
frequently. In this case, the pseudo-repeats are characterised by
several reads starting at exact the same position but which can
have different lengths. Should MIRA have separated these reads
into different contigs, these can be -- most of the time -- safely
joined. The following figure shows such a case:
</p><div class="figure"><a name="454_stacks_join.png"></a><p class="title"><b>Figure 9.5.
Pseudo-repeat in 454 data due to sequencing artifacts
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/454_stacks_join.png" width="100%" alt="Pseudo-repeat in 454 data due to sequencing artifacts"></td></tr></table></div></div></div><br class="figure-break"><p>
For Solexa data, a non-negligible GC bias has been reported in
genome assemblies since late 2009. In genomes with moderate to
high GC, this bias actually favours regions with lower
GC. Examples were observed where regions with an average GC of 10%
less than the rest of the genome had between two and four times
more reads than the rest of the genome, leading to false
"discovery" of duplicated genome regions.
</p></li><li class="listitem"><p>
when using unpaired data, the above described possibility of
having "too many" reads in a non-repetitive region can also lead
to a contig being separated into two contigs in the region of the
pseudo-repeat.
</p></li><li class="listitem"><p>
a number of reads (sometimes even just one) can contain "high
quality garbage", that is, nonsense bases which got - for some
reason or another - good quality values. This garbage can be
distributed on a long stretch in a single read or concern just a
single base position across several reads.
</p><p>
While MIRA has some algorithms to deal with the disrupting effects
of reads like, the algorithms are not always 100% effective and
some might slip through the filters.
</p></li></ol></div></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_res_places_of_interest_in_a_mapping_assembly"></a>9.6.
Places of interest in a mapping assembly
</h2></div></div></div><p>
This section just give a short overview on the tags you might find
interesting. For more information, especially on how to configure gap4
or consed, please consult the <span class="emphasis"><em>mira usage</em></span> document
and the <span class="emphasis"><em>mira</em></span> manual.
</p><p>
In file types that allow tags (CAF, MAF, ACE), SNPs and other
interesting features will be marked by MIRA with a number of tags. The
following sections give a brief overview. For a description of what
the tags are (SROc, WRMc etc.), please read up the section "Tags used
in the assembly by MIRA and EdIt" in the main manual.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Screen shots in this section are taken from the walk-through with
Lenski data (see below).
</td></tr></table></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_poi_where_are_snps?"></a>9.6.1.
Where are SNPs?
</h3></div></div></div><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
the <span class="bold"><strong>SROc</strong></span> tag will point to most
SNPs. Should you assemble sequences of more than one strain (I
cannot really recommend such a strategy), you also might
encounter <span class="bold"><strong>SIOc</strong></span> and <span class="bold"><strong>SAOc</strong></span> tags.
</p><div class="figure"><a name="chap_sol::sxa_sroc_lenski1.png"></a><p class="title"><b>Figure 9.6.
"SROc" tag showing a SNP position in a Solexa mapping
assembly.
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_sroc_lenski1.png" width="100%" alt='"SROc" tag showing a SNP position in a Solexa mapping assembly.'></td></tr></table></div></div></div><br class="figure-break"><div class="figure"><a name="chap_sol::sxa_sroc_lenski2.png"></a><p class="title"><b>Figure 9.7.
"SROc" tag showing a SNP/indel position in a Solexa mapping
assembly.
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_sroc_lenski2.png" width="100%" alt='"SROc" tag showing a SNP/indel position in a Solexa mapping assembly.'></td></tr></table></div></div></div><br class="figure-break"></li><li class="listitem"><p>
the <span class="bold"><strong>WRMc</strong></span> tags might sometimes
point SNPs to indels of one or two bases.
</p></li></ul></div><p>
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_poi_where_are_insertions_deletions_or_genome_rearrangements?"></a>9.6.2.
Where are insertions, deletions or genome re-arrangements?
</h3></div></div></div><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
Large deletions: the <span class="bold"><strong>MCVc</strong></span> tags
point to deletions in the resequenced data, where no read is
covering the reference genome.
</p><div class="figure"><a name="chap_sol::sxa_mcvc_lenski.png"></a><p class="title"><b>Figure 9.8.
"MCVc" tag (dark red stretch in figure) showing a genome
deletion in Solexa mapping assembly.
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_mcvc_lenski.png" width="100%" alt='"MCVc" tag (dark red stretch in figure) showing a genome deletion in Solexa mapping assembly.'></td></tr></table></div></div></div><br class="figure-break"></li><li class="listitem"><p>
Insertions, small deletions and re-arrangements: these are
harder to spot. In unpaired data sets they can be found looking
at clusters of <span class="bold"><strong>SROc</strong></span>, <span class="bold"><strong>SRMc</strong></span>, <span class="bold"><strong>WRMc</strong></span>, and / or <span class="bold"><strong>UNSc</strong></span> tags.
</p><div class="figure"><a name="chap_sol::sxa_wrmcsrmc_hiding_lenski1.png"></a><p class="title"><b>Figure 9.9.
An IS150 insertion hiding behind a WRMc and a SRMc tags
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_wrmcsrmc_hiding_lenski1.png" width="100%" alt="An IS150 insertion hiding behind a WRMc and a SRMc tags"></td></tr></table></div></div></div><br class="figure-break"><p>
more massive occurrences of these tags lead to a rather colourful
display in finishing programs, which is why these clusters are
also sometimes called Xmas-trees.
</p><div class="figure"><a name="chap_sol::sxa_xmastree_lenski1.png"></a><p class="title"><b>Figure 9.10.
A 16 base pair deletion leading to a SROc/UNsC xmas-tree
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_xmastree_lenski1.png" width="100%" alt="A 16 base pair deletion leading to a SROc/UNsC xmas-tree"></td></tr></table></div></div></div><br class="figure-break"><div class="figure"><a name="chap_sol::sxa_xmastree_lenski2.png"></a><p class="title"><b>Figure 9.11.
An IS186 insertion leading to a SROc/UNsC xmas-tree
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_xmastree_lenski2.png" width="100%" alt="An IS186 insertion leading to a SROc/UNsC xmas-tree"></td></tr></table></div></div></div><br class="figure-break"><p>
In sets with paired-end data, post-processing software (or
alignment viewers) can use the read-pair information to guide
you to these sites (MIRA doesn't set tags at the moment).
</p></li></ul></div><p>
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_poi_other_tags_of_interest"></a>9.6.3.
Other tags of interest
</h3></div></div></div><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
the <span class="bold"><strong>UNSc</strong></span> tag points to areas
where the consensus algorithm had troubles choosing a base. This
happens in low coverage areas, at places of insertions (compared
to the reference genome) or sometimes also in places where
repeats with a few bases difference are present. Often enough,
these tags are in areas with problematic sequences for the
Solexa sequencing technology like, e.g., a
<code class="literal">GGCxG</code> or even <code class="literal">GGC</code> motif in
the reads.
</p></li><li class="listitem"><p>
the <span class="bold"><strong>SRMc</strong></span> tag points to places
where repeats with a few bases difference are present. Here too,
sequence problematic for the Solexa technology are likely to
have cause base calling errors and subsequently setting of this
tag.
</p></li></ul></div><p>
</p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_res_postprocessing_mapping_assemblies"></a>9.7.
Post-processing mapping assemblies
</h2></div></div></div><p>
This section is a bit terse, you should also read the chapter on
<span class="emphasis"><em>working with results</em></span> of MIRA.
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_pp_manual_cleanup"></a>9.7.1.
Manual cleanup and validation (optional)
</h3></div></div></div><p>
When working with resequencing data and a mapping assembly, I always
load finished projects into an assembly editor and perform a quick
cleanup of the results. SNP or small indels normally do not need
cleanups, but every mapper will get larger indels mostly wrong, and
MIRA is no exception to this.
</p><p>
For close relatives of the reference strain this doesn't take long as
MIRA will have set tags (see section earlier in this document) at all
sites you should have a look at. For example, very close mutant
bacteria with just SNPs or simple deletions and no genome
reorganisation, I usually clean up in 10 to 15 minutes. That gives the
last boost to data quality and your users (biologists etc.) will thank
you for that as it reduces their work in analysing the data (be it
looking at data or performing wet-lab experiments).
</p><p>
The general workflow I use is to convert the CAF file to a gap4 or gap5
database. Then, in gap4 or gap5, I
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
quickly search for the UNSc and WRMc tags and check whether they
could be real SNPs that were overseen by MIRA. In that case, I
manually set a SROc (or SIOc) tag in gap4 via hotkeys that were
defined to set these tags.
</p></li><li class="listitem"><p>
sometimes also quickly clean up reads that are causing trouble in
alignments and lead to wrong base calling. These can be found at
sites with UNSc tags, most of the time they have the 5' to 3'
<code class="literal">GGCxG</code> motif which can cause trouble to Solexa.
</p></li><li class="listitem"><p>
look at sites with deletions (tagged with MCVc) and look whether I
should clean up the borders of the deletion.
</p></li></ol></div><p>
After this, I convert the gap4 or gap5 database back to CAF format.
But beware: gap4 does not have the same consensus calling routines as
MIRA and will have saved it's own consensus in the new CAF. In fact,
gap4 performs rather badly in projects with multiple sequencing
technologies. So I use miraconvert from the MIRA package to recall
a good consensus (and save it in MAF as it's more compact and a lot
faster in handling than CAF):
</p><p>
And from this MAF file I can then convert with miraconvert to any
other format I or my users need: CAF, FASTA, ACE, WIG (for coverage
analysis), SNP and coverage analysis (see below), HTML etc.pp.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_poi_comprehensive_snp_analysis_spreadsheet_tables_for_excel_or_oocalc"></a>9.7.2.
Comprehensive SNP analysis spreadsheet tables (for Excel or OOcalc)
</h3></div></div></div><p>
Biologists are not really interested in SNPs coordinates, and why
should they? They're more interested where SNPs are, how good they
are, which genes or other elements they hit, whether they have an
effect on a protein sequence, whether they may be important etc. For
organisms without intron/exon structure or splice variants, MIRA can
generate pretty comprehensive tables and files if an annotated
GenBank file was used as reference and strain information was given
to MIRA during the assembly.
</p><p>
Well, MIRA does all that automatically for you if the reference
sequence you gave was annotated.
</p><p>
For this, <span class="command"><strong>miraconvert</strong></span> should be used with the
<span class="emphasis"><em>asnp</em></span> format as target and a MAF (or CAF) file as
input:
</p><pre class="screen"><code class="prompt">$</code> <strong class="userinput"><code>miraconvert -t asnp <em class="replaceable"><code>input.maf output</code></em></code></strong></pre><p>
Note that it is strongly suggested to perform a quick manual cleanup
of the assembly prior to this: for rare cases (mainly at site of
small indels of one or two bases), MIRA will not tag SNPs with a SNP
tag (SROc, SAOc or SIOc) but will be fooled into a tag denoting
unsure positions (UNSc). This can be quickly corrected manually. See
further down in this manual in the section on post-processing.
</p><p>
After conversion, you will have four files in the directory which
you can all drag-and-drop into spreadsheet applications like
OpenOffice Calc or Excel.
</p><p>
The files should be pretty self-explanatory, here's just a short overview:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
<code class="filename">output_info_snplist.txt</code> is a simple list of
the SNPs, with their positions compared to the reference
sequence (in bases and map degrees on the genome) as well as the
GenBank features they hit.
</p></li><li class="listitem"><p>
<code class="filename">output_info_featureanalysis.txt</code> is a much
extended version of the list above. It puts the SNPs into
context of the features (proteins, genes, RNAs etc.) and gives a
nice list, SNP by SNP, what might cause bigger changes in
proteins.
</p></li><li class="listitem"><p>
<code class="filename">output_info_featuresummary.txt</code> looks at the
changes (SNPs, indels) from the other way round. It gives an
excellent overview which features (genes, proteins, RNAs,
intergenic regions) you should investigate.
</p><p>
There's one column (named 'interesting') which pretty much
summarises up everything you need into three categories: yes,
no, and perhaps. 'Yes' is set if indels were detected, an amino
acid changed, start or stop codon changed or for SNPs in
intergenic regions and RNAs. 'Perhaps' is set for SNPs in
proteins that change a codon, but not an amino acid (silent
SNPs). 'No' is set if no SNP is hitting a feature.
</p></li><li class="listitem"><p>
<code class="filename">output_info_featuresequences.txt</code> simply
gives the sequences of each feature of the reference sequence
and the resequenced strain.
</p></li></ol></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_poi_html_files_depicting_snp_positions_and_deletions"></a>9.7.3.
HTML files depicting SNP positions and deletions
</h3></div></div></div><p>
I've come to realise that people who don't handle data from NextGen
sequencing technologies on a regular basis (e.g., many biologists)
don't want to be bothered with learning to handle specialised
programs to have a look at their resequenced strains. Be it because
they don't have time to learn how to use a new program or because
their desktop is not strong enough (CPU, memory) to handle the data
sets.
</p><p>
Something even biologist know to operate are browsers. Therefore,
miraconvert has the option to load a MAF (or CAF) file of a
mapping assembly at output to HTML those areas which are interesting
to biologists. It uses the tags SROc, SAOc, SIOc and MCVc and outputs
the surrounding alignment of these areas together with a nice overview
and links to jump from one position to the previous or next.
</p><p>
This is done with the '<code class="literal">-t hsnp</code>' option of
miraconvert:
</p><pre class="screen"><code class="prompt">$</code> <strong class="userinput"><code>miraconvert -t hsnp <em class="replaceable"><code>input.maf output</code></em></code></strong></pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
I recommend doing this only if the resequenced strain is a very close
relative to the reference genome, else the HTML gets pretty big. But
for a couple of hundred SNPs it works great.
</td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_poi_wig_files"></a>9.7.4.
WIG files depicting contig coverage or GC content
</h3></div></div></div><p>
<span class="command"><strong>miraconvert</strong></span> can also dump a coverage file in WIG
format (using '<code class="literal">-t wig</code>') or a WIG file for GC
content (using '<code class="literal">-t gcwig</code>'). This comes pretty handy
for searching genome deletions or duplications in programs like the
Affymetrix Integrated Genome Browser (IGB, see <a class="ulink" href="http://igb.bioviz.org/" target="_top">http://igb.bioviz.org/</a>) or when looking for foreign sequence
in a genome.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_poi_tables_for_feature_coverage"></a>9.7.5.
Comprehensive spreadsheet tables for gene expression values / genome deletions & duplications
</h3></div></div></div><p>
When having data mapped against a reference with annotations (either
from GenBank formats or GFF3 formats),
<span class="command"><strong>miraconvert</strong></span> can generate tables depicting
either expression values (in RNASeq/EST data mappings) or probable
genome multiplication and deletion factors (in genome mappings). For
this to work, you must use a MAF or CAF file as input, specify
<span class="emphasis"><em>fcov</em></span> as output format and the reference sequence
must have had annotations during the mapping with MIRA.
</p><p>TODO: add example</p><pre class="screen"><strong class="userinput"><code>miraconvert -t fcov <em class="replaceable"><code>mira_out.maf myfeaturetable</code></em></code></strong></pre></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_mutils"></a>Chapter 10. Utilities in the MIRA package</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email"><<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>></code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect_mutils_convpro">10.1. miraconvert</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_mutils_cp_synopsis">10.1.1.
Synopsis
</a></span></dt><dt><span class="sect2"><a href="#sect_mutils_cp_description">10.1.2. Description</a></span></dt><dt><span class="sect2"><a href="#sect_mutils_cp_options">10.1.3. Options</a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_mutils_cp_options_general">10.1.3.1. General options</a></span></dt><dt><span class="sect3"><a href="#sect_mutils_cp_options_contigs">10.1.3.2. Options for input containing contig data</a></span></dt></dl></dd><dt><span class="sect2"><a href="#sect_mutils_cp_examples">10.1.4. Examples</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_mutils_bait">10.2. mirabait - a "grep" like tool to select reads with kmers up to 256 bases</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_mutils_bait_synopsis">10.2.1.
Synopsis
</a></span></dt><dt><span class="sect2"><a href="#sect_mutils_bait_description">10.2.2. Description</a></span></dt><dt><span class="sect2"><a href="#sect_mutils_bait_options">10.2.3. Options</a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_mutils_bait_mainoptions">10.2.3.1. Main options</a></span></dt><dt><span class="sect3"><a href="#sect_mutils_bait_outputdef">10.2.3.2. File type options</a></span></dt><dt><span class="sect3"><a href="#sect_mutils_bait_other">10.2.3.3. Other options</a></span></dt></dl></dd><dt><span class="sect2"><a href="#sect_mutils_bait_examples">10.2.4. Usage examples</a></span></dt><dt><span class="sect2"><a href="#sect_mutils_bait_installrrnadb">10.2.5. Installing different rRNA databases</a></span></dt></dl></dd></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
<span class="emphasis"><em><span class="quote">“<span class="quote">Ninety percent of success is just growing up.
</span>”</span></em></span>
</p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_mutils_convpro"></a>10.1. miraconvert</h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_mutils_cp_synopsis"></a>10.1.1.
Synopsis
</h3></div></div></div><div class="cmdsynopsis"><p><code class="command">miraconvert</code> [options] {<em class="replaceable"><code>input_file</code></em>} {<em class="replaceable"><code>output_basename</code></em>}</p></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_mutils_cp_description"></a>10.1.2. Description</h3></div></div></div><p>
<span class="command"><strong>miraconvert</strong></span> is a tool to convert, extract and
sometimes recalculate all kinds of data related to sequence assembly
files.
</p><p>
More specifically, <span class="command"><strong>miraconvert</strong></span> can
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
convert from multiple alignment files (CAF, MAF) to other multiple
alignment files (CAF, MAF, ACE, SAM), and -- if wished -- selecting
contigs by different criteria like name, length, coverage etc.
</p></li><li class="listitem"><p>
extract the consensus from multiple alignments in CAF and MAF format,
writing it to any supported output format (FASTA, FASTQ, plain text,
HTML, etc.) and -- if wished -- recalculating the consensus using
the MIRA consensus engine with MIRA parameters
</p></li><li class="listitem"><p>
extract read sequences (clipped or unclipped) from multiple
alignments and save to any supported format
</p></li><li class="listitem"><p>
Much more, need to document this.
</p></li></ol></div><p>
</p><p>…</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_mutils_cp_options"></a>10.1.3. Options</h3></div></div></div><p>…</p><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_mutils_cp_options_general"></a>10.1.3.1. General options</h4></div></div></div><div class="variablelist"><dl class="variablelist"><dt><span class="term">
<code class="option">-f
<em class="replaceable"><code>
{ <code class="option">caf</code> | <code class="option">maf</code> | <code class="option">fasta</code> | <code class="option">fastq</code> | <code class="option">gbf</code> | <code class="option">phd</code> | <code class="option">fofnexp</code> }
</code></em>
</code>
</span></dt><dd><p>
<span class="quote">“<span class="quote">From-type</span>”</span>, the format of the input file. CAF and MAF
files can contain full assemblies and/or unassembled (single)
sequences while the other formats contain only unassembled
sequences.
</p></dd><dt><span class="term">
<code class="option">-t
<em class="replaceable"><code>
{ <code class="option">ace</code> | <code class="option">asnp</code> | <code class="option">caf</code> | <code class="option">crlist</code> | <code class="option">cstats</code> | <code class="option">exp</code> | <code class="option">fasta</code> | <code class="option">fastq</code> | <code class="option">fcov</code> | <code class="option">gbf</code> | <code class="option">gff3</code> | <code class="option">hsnp</code> | <code class="option">html</code> | <code class="option">maf</code> | <code class="option">phd</code> | <code class="option">sam</code> | <code class="option">samnbb</code> | <code class="option">text</code> | <code class="option">tcs</code> | <code class="option">wig</code> }
</code></em>
</code>
<code class="option">[ -t … ]</code>
</span></dt><dd><p>
<span class="quote">“<span class="quote">To-type</span>”</span>, the format of the output file. Multiple
mentions of [-t] are allowed, in which case
<span class="command"><strong>miraconvert</strong></span> will convert to multiple types.
</p></dd><dt><span class="term"><code class="option">-a</code></span></dt><dd><p>
Append. Results of conversion are appended to existing files instead of overwriting them.
</p></dd><dt><span class="term"><code class="option">-A</code></span></dt><dd><p>
Do not adjust sequence case.
</p><p>
When reading formats which define clipping points (like CAF,
MAF or EXP), and saving to formats which do not have clipping
information, miraconvert normally adjusts the case of read
sequences: lower case for clipped parts, upper case for
unclipped parts of reads. Use -A if you do not want this. See
also -C.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Applies only to files/formats which do not contain contigs.
</td></tr></table></div></dd><dt><span class="term"><code class="option">-b</code></span></dt><dd><p>
Blind data. Replace all bases in all reads / contigs with a 'c'.
</p></dd><dt><span class="term"><code class="option">-C</code></span></dt><dd><p>
Hard clip reads. When the input is a format which contains clipping
points in sequences and the requested output consists of sequences
of reads, only the unclipped parts of sequences will be saved as
results.
</p></dd><dt><span class="term"><code class="option">-d</code></span></dt><dd><p>
Delete gap only columns. When output is contigs: delete
columns that are entirely gaps (can occur after having deleted
reads during editing in gap4, consed or other). When output is
reads: delete gaps in reads.
</p></dd><dt><span class="term"><code class="option">-F</code></span></dt><dd><p>
Filter read groups to different files. Works only for input
files containing readgroups, i.e., CAF or MAF. 3 (or 4) files
are generated: one or two for paired, one for unpaired and one
for debris reads. Reads in paired file are interlaced by
default, use -F twice to create separate files.
</p></dd><dt><span class="term"><code class="option">-m</code></span></dt><dd><p>
Make contigs. Encase single reads as contig singlets into a CAF/MAF
file.
</p></dd><dt><span class="term"><code class="option">-n <em class="replaceable"><code>namefile</code></em></code></span></dt><dd><p>
Name select. Only contigs or reads are selected for output which
name appears in
<code class="filename">namefile</code>. <code class="filename">namefile</code> is a
simple text file having one name entry per line.
</p></dd><dt><span class="term"><code class="option">-i</code></span></dt><dd><p>
When -n is used, inverts the selection.
</p></dd><dt><span class="term"><code class="option">-o <em class="replaceable"><code>offset</code></em></code></span></dt><dd><p>
Offset of quality values in FASTQ files. Only valid if -f is FASTQ.
</p></dd><dt><span class="term"><code class="option">-P <em class="replaceable"><code>MIRA-PARAMETERSTRING</code></em></code></span></dt><dd><p>
Additional MIRA parameters. Allows to initialise the underlying MIRA
routines with specific parameters. A use case can be, e.g., to
recalculate a consensus of an assembly in a slightly different way
(see also [-r]) than the one which is stored in assembly
files. Example: to tell the consensus algorithm to use a minimum
number of reads per group for 454 reads, use: "454_SETTINGS -CO:mrpg=4".
</p><p>
Consult the MIRA reference manual for a full list of MIRA
parameters.
</p></dd><dt><span class="term"><code class="option">-q quality_value</code></span></dt><dd><p>
When loading read data from files where sequence and quality
are split in several files (e.g. FASTA with .fasta and
.fasta.qual files), do not stop if the quality values for a
read are missing but set them to be the quality_value given.
</p></dd><dt><span class="term"><code class="option">-R <em class="replaceable"><code>namestring</code></em></code></span></dt><dd><p>
Rename contigs/singlets/reads with given name string to which
a counter is added.
</p><p>
Known bug: will create duplicate names if input (CAF or
MAF) contains contigs/singlets as well as free reads, i.e.
reads not in contigs nor singlets.
</p></dd><dt><span class="term"><code class="option">-S <em class="replaceable"><code>namescheme</code></em></code></span></dt><dd><p>
Naming scheme for renaming reads, important for
paired-ends. Only 'solexa' is supported at the moment.
</p></dd><dt><span class="term"><code class="option">-Y <em class="replaceable"><code>integer</code></em></code></span></dt><dd><p>
Yield. Defines the maximum number of (clipped/padded) bases to
convert. When used on reads: output will contain first reads
of file where length of clipped bases totals at least -Y.
When used on contigs: output will contain first contigs of
file where length of padded contigs totals at least -Y.
</p></dd></dl></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_mutils_cp_options_contigs"></a>10.1.3.2. Options for input containing contig data</h4></div></div></div><p>
The following switches will work only if the input file contains
contigs (i.e., CAF or MAF with contig data). Though infrequent, note
that both CAF and MAF can contain single reads only.
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term"><code class="option">-M</code></span></dt><dd><p>
Do not extract contigs (or their consensus), but the reads
they are composed of.
</p></dd><dt><span class="term"><code class="option">-N <em class="replaceable"><code>namefile</code></em></code></span></dt><dd><p>
Name select, sorted. Only contigs/reads are selected for
output which name appears in
<code class="filename">namefile</code>. Regardless of the order of
contigs/reads in the input, the output is sorted according to
the appearance of names in
<code class="filename">namefile</code>. <code class="filename">namefile</code>
is a simple text file having one name entry per line.
</p><p>
Note that for this function to work, all contigs/reads are
loaded into memory which may be straining your RAM for larger
projects.
</p></dd><dt><span class="term">
<code class="option">-r
<em class="replaceable"><code>
{ <code class="option">c</code> | <code class="option">C</code> | <code class="option">q</code> | <code class="option">f</code> }
</code></em>
</code>
</span></dt><dd><p>
Recalculate consensus and / or consensus quality values and / or
SNP feature tags of an assembly. This feature is useful in case
third party programs create own consensus sequences without
handling different sequencing technologies (e.g. the combination
of <span class="command"><strong>gap4</strong></span> and <span class="command"><strong>caf2gap</strong></span>) or
when the CAF/MAF files do not contain consensus sequences at
all.
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term"><code class="option">c</code></span></dt><dd>
recalculate consensus & consensus qualities using IUPAC where necessary
</dd><dt><span class="term"><code class="option">C</code></span></dt><dd>
recalculate consensus & consensus qualities forcing ACGT calls and without IUPAC codes
</dd><dt><span class="term"><code class="option">q</code></span></dt><dd>
recalculate consensus quality values only
</dd><dt><span class="term"><code class="option">f</code></span></dt><dd>
recalculate SNP features
</dd></dl></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Only the last of cCq is relevant, 'f' works as a switch and can be
combined with the others (e.g. <span class="quote">“<span class="quote">-r Cf</span>”</span>).
</td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
If the CAF/MAF contains reads from multiple strains, recalculation
of consensus & consensus qualities is forced, you can just
influence whether IUPACs are used or not. This is due to the fact
that CAF/MAF do not provide facilities to store consensus
sequences from multiple strains.
</td></tr></table></div></dd><dt><span class="term"><code class="option">-s</code></span></dt><dd><p>
Split. Split output into single files, one file per
contig. Files are named according to name of contig.
</p></dd><dt><span class="term"><code class="option">-u</code></span></dt><dd><p>
fillUp strain genomes. In assemblies made of multiple strains,
holes in the consensus of a strain (bases 'N' or '@') can be
filled up with the consensus of the other strains. Takes effect
only when '-r' is active.
</p></dd><dt><span class="term"><code class="option">-Q <em class="replaceable"><code>quality_value</code></em></code></span></dt><dd><p>
Defines minimum quality a consensus base of a strain
must have, consensus bases below this will be set to 'N'.
Only used when -r is active.
</p></dd><dt><span class="term"><code class="option">-V <em class="replaceable"><code>coverage_value</code></em></code></span></dt><dd><p>
Defines minimum coverage a consensus base of a strain must
have, consensus bases below this coverage will be set to 'N'.
Only used when -r is active.
</p></dd><dt><span class="term"><code class="option">-v</code></span></dt><dd><p>
Print version number and exit.
</p></dd><dt><span class="term"><code class="option">-x <em class="replaceable"><code>length</code></em></code></span></dt><dd><p>
Minimum length a contig (in full assemblies) or read (in single
sequence files) must have. All contigs / reads with a
length less than this value are discarded. Default: 0 (=switched
off).
</p><p>
Note: this is of course not applied to reads in contigs! Contigs passing
the [-x] length criterion and stored as complete
assembly (CAF, MAF, ACE, etc.) still contain all their reads.
</p></dd><dt><span class="term"><code class="option">-X <em class="replaceable"><code>length</code></em></code></span></dt><dd><p>
Similar to [-x], but applies only to clipped reads
(input file format must have clipping points set to be
effective).
</p></dd><dt><span class="term"><code class="option">-y <em class="replaceable"><code>contig_coverage</code></em></code></span></dt><dd><p>
Minimum average contig coverage. Contigs with an average
coverage less than this value are discarded.
</p></dd><dt><span class="term"><code class="option">-z <em class="replaceable"><code>min_reads</code></em></code></span></dt><dd><p>
Minimum number of reads in contig. Contigs with less
reads than this value are discarded.
</p></dd><dt><span class="term"><code class="option">-l <em class="replaceable"><code>line_length</code></em></code></span></dt><dd><p>
On output of assemblies as text or HTML: number of bases shown in
one alignment line. Default: 60.
</p></dd><dt><span class="term"><code class="option">-c <em class="replaceable"><code>endgap_character</code></em></code></span></dt><dd><p>
On output of assemblies as text or HTML: character used to pad
endgaps. Default: ' ' (a blank)
</p></dd></dl></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_mutils_cp_examples"></a>10.1.4. Examples</h3></div></div></div><p>
In the following examples, the CAF and MAF files used are expected to
contain full assembly data like the files created by MIRA during an
assembly or by the gap2caf program. CAF and MAF could be used
interchangeably in these examples, depending on which format currently
is available. In general though, MAF is faster to process and smaller
on disk.
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
Simple conversion: a MIRA MAF file to a SAM file
</span></dt><dd><pre class="screen">
<strong class="userinput"><code>miraconvert source.maf destination.sam</code></strong></pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
Previous versions of miraconvert had a slightly different
syntax, which however is still supported:
</p><pre class="screen">
<strong class="userinput"><code>miraconvert source.maf destination.sam</code></strong></pre></td></tr></table></div></dd><dt><span class="term">
Simple conversion: the consensus of an assembly to FASTA, at the
same time coverage data for contigs to WIG and furthermore
translate the CAF to ACE:
</span></dt><dd><pre class="screen">
<strong class="userinput"><code>miraconvert source.caf destination.fasta wig ace</code></strong></pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
Previous versions of miraconvert had a slightly different
syntax, which however is still supported:
</p><pre class="screen">
<strong class="userinput"><code>miraconvert -f caf -t fasta -t wig -t ace source.caf destination</code></strong></pre></td></tr></table></div></dd><dt><span class="term">
Filtering an assembly for contigs of length ≥2000 and an
average coverage ≥ 10, while translating from MAF to CAF:
</span></dt><dd><pre class="screen">
<strong class="userinput"><code>miraconvert -x 2000 -y 10 source.caf destination.caf</code></strong></pre></dd><dt><span class="term">
Filtering a FASTQ file for reads ≥ 55 base pairs, rename the
selected reads with a string starting <span class="quote">“<span class="quote">newname</span>”</span> and
save them back to FASTQ. Note how [-t fastq] was left out
as the default behaviour of <span class="command"><strong>miraconvert</strong></span> is
to use the same "to" type as the input type ( [-f]).
</span></dt><dd><pre class="screen">
<strong class="userinput"><code>miraconvert -x 55 -R newname source.fastq destination.fastq</code></strong></pre></dd><dt><span class="term">
Filtering and reordering contigs of an assembly according to external contig name list.
</span></dt><dd><p>
This example will fetch the contigs named bchoc_c14, ...3, ...5
and ...13 and save the result in exactly that order to a new
file:
</p><pre class="screen"><code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>ls -l</code></strong>
-rw-r--r-- 1 bach users 231698898 2007-10-21 15:16 bchoc_out.caf
-rw-r--r-- 1 bach users 38 2007-10-21 15:16 contigs.lst
<code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>cat contigs.lst</code></strong>
bchoc_c14
bchoc_c3
bchoc_c5
bchoc_c13
<code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>miraconvert -N contigs.lst bchoc_out.caf myfilteredresult.caf</code></strong>
[...]
<code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>ls -l</code></strong>
-rw-r--r-- 1 bach users 231698898 2007-10-21 15:16 bchoc_out.caf
-rw-r--r-- 1 bach users 38 2007-10-21 15:16 contigs.lst
-rw-r--r-- 1 bach users 828726 2007-10-21 15:24 myfilteredresult.caf</pre></dd></dl></div></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_mutils_bait"></a>10.2. mirabait - a "grep" like tool to select reads with kmers up to 256 bases</h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_mutils_bait_synopsis"></a>10.2.1.
Synopsis
</h3></div></div></div><div class="cmdsynopsis"><p><code class="command">mirabait</code> [options] {-b <em class="replaceable"><code>baitfile</code></em> [-b ...] | -B <em class="replaceable"><code>file</code></em>} [-p <em class="replaceable"><code>file1 file2</code></em> | -P <em class="replaceable"><code>file3</code></em>]*
[<em class="replaceable"><code>file4 ...</code></em>]</p></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
The above command line, especially with mandatory [-b] format
appeared only in MIRA 4.9.0 and represents a major change to 4.0.x!
</td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_mutils_bait_description"></a>10.2.2. Description</h3></div></div></div><p>
<span class="command"><strong>mirabait</strong></span> selects reads from a read collection which
are partly similar or equal to sequences defined as target
baits. Similarity is defined by finding a user-adjustable number of
common k-mers (sequences of k consecutive bases) which are the same in
the bait sequences and the screened sequences to be selected, either in forward
or reverse complement direction.
</p><p>
When used on paired files (-p or -P), selects read pairs where at least
one read matches.
</p><p>
One can use <span class="command"><strong>mirabait</strong></span> to do targeted assembly by
fishing out reads belonging to a gene and just assemble these; or to
clean out rRNA sequences from data sets; or to fish out and
iteratively reconstruct mitochondria from metagenomic data; or, or, or
... whenever one has to take in or take out subsets of reads based on
kmer equality, this tool should come in quite handy.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
The search performed is exact, that is, sequences selected are
guaranteed to have the required number of matching k-mers to the bait
sequences while sequences not selected are guaranteed not have these.
</td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_mutils_bait_options"></a>10.2.3. Options</h3></div></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_mutils_bait_mainoptions"></a>10.2.3.1. Main options</h4></div></div></div><div class="variablelist"><dl class="variablelist"><dt><span class="term"><code class="option">-b <em class="replaceable"><code>file</code></em></code></span></dt><dd><p>
A file containing sequences to be used as bait. The file can
be in any of the following types: FASTQ, FASTA, GenBank (.gbf,
.gbk, .gbff), CAF, MAF or Staden EXP.
</p><p>
If the the file extension is non-standard
(e.g. <code class="filename">file.dat</code>, you can force a file type
by using the double colon type specification like in
EMBOSS. E.g.: <code class="filename">fastq::file.dat</code>
</p><p>
Using multiple -b for loading bait sequences from multiple
files is allowed.
</p></dd><dt><span class="term"><code class="option">-B <em class="replaceable"><code>file</code></em></code></span></dt><dd><p>
Load bait from an existing kmer statistics file, not from
sequence files. Only one -B allowed, cannot be combined with
-b. See -K on how to create such a file.
</p><p>
This option comes in handy when you always want to bait
against a given set of sequences, e.g., rRNA sequences or the
human genome, and where the statistics computation itself may
be quite time and resource intensive. Once computed and saved
via [-K], a baiting process loading the statistics
via [-B] can start much faster.
</p></dd><dt><span class="term"><code class="option">-j <em class="replaceable"><code>job</code></em></code></span></dt><dd><p>
Set option for predefined job from supplied MIRA library. Currently available jobs:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
'rrna': Bait rRNA/rDNA sequences. Locked options: [-b,
-B, -k, -K, -n]. In the current version mirabait will
use a hash statistics file with 21mers derived from a
subset of the RFAM 12 rRNA database to bait rRNA/rDNA
reads. The supplied subset should catch all but the most
uncommon rRNA data, if needed one could albeit increase
the sensitivity by decreasing [-n].
</p></li></ul></div><p>
Note that [-j] will hardwire a number of options to
be optimal for the chosen job. Note that it is not advisable
to change the 'locked' options as this either breaks the
functionality or worse, it could lead to undefined results.
</p></dd><dt><span class="term"><code class="option">-p <em class="replaceable"><code>file_1 file_2</code></em></code></span></dt><dd><p>
Instructs to load sequences to be baited from files
<code class="filename">file_1</code> and
<code class="filename">file_2</code>. The sequences are treated as
pairs, where a read in one file is paired with a read in the
second file. The files can be in any of the following types:
FASTQ, FASTA, GenBank (.gbf, .gbk, .gbff), CAF, MAF or Staden
EXP.
</p><p>
If the the file extension is non-standard
(e.g. <code class="filename">file.dat</code>, you can force a file type
by using the double colon type specification like in
EMBOSS. E.g.: <code class="filename">fastq::file.dat</code>
</p><p>
Using multiple -p for baiting sequences from multiple file
pairs is allowed.
</p></dd><dt><span class="term"><code class="option">-P <em class="replaceable"><code>file</code></em></code></span></dt><dd><p>
Instructs to load sequences to be baited from file
<code class="filename">file</code>. The sequences are treated as pairs,
where a read in the file is immediately followed by its paired
read. The file can be in any of the following types: FASTQ,
FASTA, GenBank (.gbf, .gbk, .gbff), CAF, MAF or Staden
EXP.
</p><p>
If the the file extension is non-standard
(e.g. <code class="filename">file.dat</code>, you can force a file type
by using the double colon type specification like in
EMBOSS. E.g.: <code class="filename">fastq::file.dat</code>
</p><p>
Using multiple -P for baiting sequences from multiple files is
allowed.
</p></dd><dt><span class="term"><code class="option">-k <em class="replaceable"><code>kmer-length</code></em></code></span></dt><dd><p>
k-mer, length of bait in bases (≤256, default=31)
</p></dd><dt><span class="term"><code class="option">-n <em class="replaceable"><code>integer</code></em></code></span></dt><dd><p>
Default value: 1.
</p><p>
If the integer given is > 0: minimum number of kmers needed
for a sequence to be selected.
</p><p>
If the integer given is ≤ 0: maximum number of missed kmers
allowed over sequence length for a sequence to be selected.
</p></dd><dt><span class="term"><code class="option">-d</code></span></dt><dd><p>
Do not use kmers with microrepeats (DUST-like). Standard
length for microrepeats is 67% of kmer length, see
[-D] to change this.
</p><p>
Microrepeats are defined as repeats of a 1, 2, 3 or 4 base
motif at either end (not in the middle) of a kmer. E.g.: a
kmer of 17 will have a microrepeat length of 12 bases, so
that, all kmers having 12 A, C, G, T at either end will be
filtered away. E.g.: AAAAAAAAAAAAnnnnn as well as
nnnnnAAAAAAAAAAAA will be filtered.
</p><p>
E.g. for repeats of 2 bases: AGAGAGAGAGAGnnnnn or CACACACACACAnnnnn.
</p><p>
E.g. for repeats of 3 bases: ACGACGACGACGnnnnn.
</p><p>
E.g. for repeats of 4 bases: ACGTACGTACGTnnnnn.
</p><p>
Microrepeat motifs will truncate at the end of allocated
microrepeat length. E.g. kmer length 20 with microrepeat
length of 13 and 4 base repeat: ACGTACGTACGTAnnnnnnn.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
When saving the kmer statistics via [-K], having
[-d] will save kmer statistics where kmers with
microrepeats have already been removed. Use this when you
always want to have microrepeats removed from a given bait
data as [-d] will not be needed when using via
[-B] that set in later loads (which saves time).
</p><p>
If you want to be able to bait from precomputed kmer
statistics both with and without microrepeats, use
[-d] only when loading the statistics file with
[-B], not when creating it with [-K].
</p></td></tr></table></div></dd><dt><span class="term"><code class="option">-D <em class="replaceable"><code>integer</code></em></code></span></dt><dd><p>
Set length of microrepeats in kmers to discard from bait.
</p><p>
int > 0: microrepeat length in percentage of kmer length.
E.g.: -k 17 -D 67 --> 67% of 17 bases = 11.39 bases --> 12 bases.
</p><p>
int <: 0 microrepeat length in bases.
</p><p>
int != 0 implies -d, int=0 turns DUST filter off
</p></dd><dt><span class="term"><code class="option">-i</code></span></dt><dd><p>
Inverse selection: selects only sequence that do not meet the
-k and -n criteria.
</p></dd><dt><span class="term"><code class="option">-I</code></span></dt><dd><p>
Filters and writes sequences which hit to one file and
sequences which do not hit to a second file.
</p></dd><dt><span class="term"><code class="option">-r</code></span></dt><dd><p>
Does not check for hits in reverse complement direction.
</p></dd><dt><span class="term"><code class="option">-t <em class="replaceable"><code>integer</code></em></code></span></dt><dd><p>
Number of threads to use. The default value of 0 is configured
to automatically use up to 4 CPU cores (if present). Numbers
higher than 4 (or maybe 8) will probably not make much sense
because of diminishing returns.
</p></dd></dl></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_mutils_bait_outputdef"></a>10.2.3.2. File type options</h4></div></div></div><p>
Normally, mirabait writes separate result files (named
<code class="filename">bait_match_*</code> and
<code class="filename">bait_miss_*</code>) for each input to the current
directory. For changing this behaviour, and others relating to
output, use these options:
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term"><code class="option">-c</code></span></dt><dd><p>
Normally, mirabait will change the case of the sequences it
loads to denote kmers which hit a bait in upper case and kmers
which did not hit a bait in lower case. Using -c switches off
this behaviour.
</p></dd><dt><span class="term"><code class="option">-l <em class="replaceable"><code>integer</code></em></code></span></dt><dd><p>
Set length of sequence line in FASTA output.
</p></dd><dt><span class="term"><code class="option">-K <em class="replaceable"><code>filename</code></em></code></span></dt><dd><p>
Save kmer statistics (for baits loaded via [-b]) to
<code class="filename">filename</code>.
</p><p>
As the calculation of kmers can take quite some time for
larger sequences (e.g., human genome), this option is
extremely useful if you want to perform the same baiting
operation more than once. Once calculated, the kmer statistics
is saved and can be reloaded for a later baiting operation via
[-B].
</p></dd><dt><span class="term"><code class="option">-N <em class="replaceable"><code>name</code></em></code></span></dt><dd><p>
Change the file prefix <code class="filename">bait</code> to
<code class="filename">name</code>. Has no effect if -o/-O is used and
targets are not directories.
</p></dd><dt><span class="term"><code class="option">-o <em class="replaceable"><code>path</code></em></code></span></dt><dd><p>
Save sequences matching a bait to
<code class="filename">path</code>. If <code class="filename">path</code> is a
directory, write separate files into this directory. If not,
combine all matching sequences from the input file(s) into a
single file specified by the path.
</p></dd><dt><span class="term"><code class="option">-O <em class="replaceable"><code>path</code></em></code></span></dt><dd><p>
Like -o, but for sequences not matching.
</p></dd></dl></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_mutils_bait_other"></a>10.2.3.3. Other options</h4></div></div></div><div class="variablelist"><dl class="variablelist"><dt><span class="term"><code class="option">-T <em class="replaceable"><code>dir</code></em></code></span></dt><dd><p>
Use <code class="filename">dir</code> as directory for temporary files
instead of the current working directory.
</p></dd><dt><span class="term"><code class="option">-m <em class="replaceable"><code>integer</code></em></code></span></dt><dd><p>
Default is <span class="underline">75</span>. Defines
the memory MIRA can use to compute kmer statistics. Therefore
does not apply when using [-B].
</p><p>
A value of <span class="underline">>100</span> is
interpreted as absolute value in megabyte. E.g., 16384 = 16384
megabyte = 16 gigabyte.
</p><p>
A value of <span class="underline">0 ≤ x ≤100</span> is
interpreted as relative value of free memory at the time of
computation. E.g.: for a value of 75% and 10 gigabyte of free
memory, it will use 7.5 gigabyte.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
The minimum amount of memory this algorithm will use is 512 MiB
on 32 bit systems and 2 GiB on 64 bit systems.
</td></tr></table></div></dd></dl></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_mutils_bait_examples"></a>10.2.4. Usage examples</h3></div></div></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
The examples below, together with the manual above, should be enough to get
you going. If there's a typical use case you are missing, feel free to
suggest it on the MIRA talk mailing list.
</td></tr></table></div><p>Baiting unpaired sequences, bait sequences in FASTA, sequences in FASTQ:</p><pre class="screen"><strong class="userinput"><code>mirabait -b b.fasta file.fastq</code></strong></pre><p>Same as above, but baits in two files (FASTA and GenBank):</p><pre class="screen"><strong class="userinput"><code>mirabait -b b1.fasta -b b2.gbk file.fastq</code></strong></pre><p>Baiting paired sequences, read pairs are in two files:</p><pre class="screen"><strong class="userinput"><code>mirabait -b b.fasta -p file_1.fastq file_2.fastq</code></strong></pre><p>Baiting paired sequences, pairs are interleaved in one file:</p><pre class="screen"><strong class="userinput"><code>mirabait -b b.fasta -P file.fastq</code></strong></pre><p>Like above, but selecting sequences which do not match the baits:</p><pre class="screen"><strong class="userinput"><code>mirabait -i -b b.fasta -P file.fastq</code></strong></pre><p>
Baiting paired sequences (<code class="filename">file_1.fastq</code>,
<code class="filename">file_2.fastq</code> and
<code class="filename">file3.fastq</code>) and unpaired sequences
(<code class="filename">file4.fastq</code>), all at once and different file
types:
</p><pre class="screen"><strong class="userinput"><code>mirabait -b b.fasta -p file_1.fastq file_2.fastq -P file3.fasta file4.caf</code></strong></pre><p>
Like above, but writing sequences matching baits and sequences not
matching baits to different files:
</p><pre class="screen"><strong class="userinput"><code>mirabait -I -b b.fasta -p file_1.fastq file_2.fastq -P file3.fasta file4.caf</code></strong></pre><p>Change bait criterion to need 10 kmers of size 27:</p><pre class="screen"><strong class="userinput"><code>mirabait -k 27 -n 10 -b b.fasta file.fastq</code></strong></pre><p>
Change bait criterion to baiting only reads which have all kmers
present in the bait:
</p><pre class="screen"><strong class="userinput"><code>mirabait -n 0 -b b.fasta file.fastq</code></strong></pre><p>
Change bait criterion to baiting all reads having almost all kmers
present in the bait, but allowing for up to 40 kmers not in the bait:
</p><pre class="screen"><strong class="userinput"><code>mirabait -n -40 -b b.fasta file.fastq</code></strong></pre><p>
Force bait sequences to load as FASTA, force sequences to be baited to
be loaded as FASTQ:
</p><pre class="screen"><strong class="userinput"><code>mirabait -b fasta::b.dat fastq::file.dat</code></strong></pre><p>
Write result files to directory <code class="filename">/dev/shm/</code>:
</p><pre class="screen"><strong class="userinput"><code>mirabait -o /dev/shm/ -b b.fasta -p file_1.fastq file_2.fastq</code></strong></pre><p>
Merge all result files containing sequences hitting baits to file
<code class="filename">/dev/shm/match</code>:
</p><pre class="screen"><strong class="userinput"><code>mirabait -o /dev/shm/match -b b.fasta -p file_1.fastq file_2.fastq</code></strong></pre><p>
Like above, but also merge all result files containing sequences not
hitting baits to file <code class="filename">/dev/shm/nomatch</code>:
</p><pre class="screen"><strong class="userinput"><code>mirabait -o /dev/shm/match -O /dev/shm/nomatch -b b.fasta -p file_1.fastq file_2.fastq</code></strong></pre><p>
Fetch all reads having rRNA motifs in a paired FASTQ files:
</p><pre class="screen"><strong class="userinput"><code>mirabait -j rrna -p file1.fastq file2.fastq</code></strong></pre><p>
Fetch all reads not having rRNA motifs in a paired FASTQ files:
</p><pre class="screen"><strong class="userinput"><code>mirabait -j rrna -i -p file1.fastq file2.fastq</code></strong></pre><p>
Split a paired FASTQ file into two sets of files (4 files total), one
containing rRNA reads and one containing non-rRNA reads:
</p><pre class="screen"><strong class="userinput"><code>mirabait -j rrna -I -p file1.fastq file2.fastq</code></strong></pre><p>
Assuming the file <code class="filename">human_genome.fasta</code> contains the
human genome: bait all read pairs matching the human genome. Also,
save the compute kmer statistics for later re-use to file
<code class="filename">HG_kmerstats.mhs.gz</code>:
</p><pre class="screen"><strong class="userinput"><code>mirabait -b human_genome.fasta -K HG_kmerstats.mhs.gz -p file1.fastq file2.fastq</code></strong></pre><p>
The same as above, but just precompute and save the kmer statistics, no actual baiting done.
</p><pre class="screen"><strong class="userinput"><code>mirabait -b human_genome.fasta -K HG_kmerstats.mhs.gz</code></strong></pre><p>
Using the precomputed kmer statistics from the command above: bait
files with read pairs for human reads:
</p><pre class="screen"><strong class="userinput"><code>mirabait -B HG_kmerstats.mhs.gz -p file_1.fastq file_2.fastq</code></strong></pre></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_mutils_bait_installrrnadb"></a>10.2.5. Installing different rRNA databases</h3></div></div></div><p>
The standard database for rRNA baiting supplied with the MIRA source
code and binary packages is called
<code class="filename">rfam_rrna-21-12.sls.gz</code> which will get installed
as <span class="emphasis"><em>MHS</em></span> (MiraHashStatistics) file into
<code class="filename">$BINDIR/../share/mira/mhs/rfam_rrna-21-12.mhs.gz</code>
(where $BINDIR is the directory where the mira/mirabait binary
resides) and a soft link pointing from
<code class="filename">filter_default_rrna.mhs.gz</code> to
<code class="filename">rfam_rrna-21-12.mhs.gz</code> like so:
</p><pre class="screen"><code class="prompt">arcadia:~$</code> <strong class="userinput"><code>which mira</code></strong>
/usr/local/bin/mira
<code class="prompt">arcadia:~$</code> <strong class="userinput"><code>ls -l /usr/local/share/mira/mhs</code></strong>
lrwxrwxrwx 1 bach bach 22 Mar 24 23:58 filter_default_rrna.mhs.gz -> rfam_rrna-21-12.mhs.gz
-rw-rw-r-- 1 bach bach 148985059 Mar 24 23:58 rfam_rrna-21-12.mhs.gz</pre><p>
The file naming scheme for the database is as following:
dbidentifier-kmerlength-kmerfreqcutoff. The standard database is therefore:
<code class="filename">rfam_rrna</code> as identifier for the RFAM rRNA
sequences (currently RFAM 12), then 21 defining a kmer length of 21
and finally a kmer cutoff frequency of 12, meaning that kmers must
have been seen at least 12 times in the RFAM database to be stored in
the subset.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
The value of 12 as frequency cutoff for the standard mirabait rRNA
database was chosen as a compromise between sensitivity and database
size.
</td></tr></table></div><p>
Although rRNA are pretty well conserved overall, the cutoff frequency
also implies that kmers from rare rRNA variants will not be present in
the database, eventually losing some sensitivity for rRNA from rarely
sequenced organisms. It follows that more sensitive versions of the
rRNA database can be installed by downloading a file from the MIRA
repository at SourceForge and calling a script provided by MIRA. To
install a version with a kmer size of 21 and a cutoff frequency of,
e.g., 3, download <code class="filename">rfam_rrna-21-3.sls.gz</code> and
install it like this:
</p><pre class="screen"><code class="prompt">arcadia:~/tmp$</code> <strong class="userinput"><code>ls</code></strong>
<code class="prompt">arcadia:~/tmp$</code> <strong class="userinput"><code>wget https://sourceforge.net/projects/mira-assembler/files/MIRA/slsfiles/rfam_rrna-21-3.sls.gz</code></strong>
...
</pre><p>
</p><p>
TODO: continue docs here.
</p></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_hard"></a>Chapter 11. Assembly of <span class="emphasis"><em>hard</em></span> genome or EST / RNASeq projects</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email"><<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>></code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect_hard_getting_mean_data_assembled">11.1.
Getting 'mean' genomes or EST / RNASeq data sets assembled
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_hard_for_the_impatient">11.1.1.
For the impatient
</a></span></dt><dt><span class="sect2"><a href="#sect_hard_introduction_to_masking">11.1.2.
Introduction to 'masking'
</a></span></dt><dt><span class="sect2"><a href="#sect_hard_how_does_nasty_repeat_masking_work">11.1.3.
How does 'nasty repeat' masking work?
</a></span></dt><dt><span class="sect2"><a href="#sect_hard_selecting_a_nasty_repeat_ratio">11.1.4.
Selecting a "nasty repeat ratio"
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_hard_how_MIRA_tags_different_repeat_levels">11.2.
How MIRA tags different repeat levels
</a></span></dt><dt><span class="sect1"><a href="#sect_hard_the_readrepeats_info_file">11.3.
The readrepeats info file
</a></span></dt><dt><span class="sect1"><a href="#sect_hard_pipeline_to_find_worst_contaminants_or_repeats_in_sequencing_data">11.4.
Pipeline to find worst contaminants or repeats in sequencing data
</a></span></dt><dt><span class="sect1"><a href="#sect_hard_examples_for_kmer_statistics">11.5.
Examples for kmer statistics
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_hard_caveat:_sk:kms">11.5.1.
Caveat: -SK:kmer_size
</a></span></dt><dt><span class="sect2"><a href="#sect_hard_sanger_sequencing_a_simple_bacterium">11.5.2.
Sanger sequencing, a simple bacterium
</a></span></dt><dt><span class="sect2"><a href="#sect_hard_454_sequencing_a_somewhat_more_complex_bacterium">11.5.3.
454 Sequencing, a somewhat more complex bacterium
</a></span></dt><dt><span class="sect2"><a href="#sect_hard_solexa_sequencing_ecoli_mg1655">11.5.4.
Solexa sequencing, E.coli MG1655
</a></span></dt><dt><span class="sect2"><a href="#sect_hard_need_examples_for_eukaryotes">11.5.5.
(NEED EXAMPLES FOR EUKARYOTES)
</a></span></dt><dt><span class="sect2"><a href="#sect_hard_need_examples_for_pathological_cases">11.5.6.
(NEED EXAMPLES FOR PATHOLOGICAL CASES)
</a></span></dt></dl></dd></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
<span class="emphasis"><em><span class="quote">“<span class="quote">If it were easy, it would have been done already.
</span>”</span></em></span>
</p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_hard_getting_mean_data_assembled"></a>11.1.
Getting 'mean' genomes or EST / RNASeq data sets assembled
</h2></div></div></div><p>
</p><p>
For some EST data sets you might want to assemble, MIRA will take too
long or the available memory will not be sufficient. For genomes this
can be the case for eukaryotes, plants, but also for some bacteria which
contain high number of (pro-)phages, plasmids or engineered operons. For
EST data sets, this concerns all projects with non-normalised libraries.
</p><p>
This guide is intended to get you through these problematic genomes. It
is (cannot be) exhaustive, but it should get you going.
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_hard_for_the_impatient"></a>11.1.1.
For the impatient
</h3></div></div></div><p>
For bacteria with nasty repeats, try first
[--hirep_something]. This will increase runtime and memory
requirements, but helps to get this sorted out. If the data for lower
eukaryotes leads to runtime and memory explosion, try either
[--hirep_good] or, for desperate cases,
[--hirep_something].
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_hard_introduction_to_masking"></a>11.1.2.
Introduction to 'masking'
</h3></div></div></div><p>
The SKIM phase (all-against-all comparison) will report almost every potential
hit to be checked with Smith-Waterman further downstream in the MIRA assembly
process. While this is absolutely no problem for most bacteria, some genomes
(eukaryotes, plants, some bacteria) have so many closely related sequences
(repeats) that the data structures needed to take up all information might get
much larger than your available memory. In those cases, your only chance to
still get an assembly is to tell the assembler it should disregard extremely
repetitive features of your genome.
</p><p>
There is, in most cases, one problem: one doesn't know beforehand which parts
of the genome are extremely repetitive. But MIRA can help you here as it
produces most of the needed information during assembly and you just need to
choose a threshold from where on MIRA won't care about repetitive matches.
</p><p>
The key to this are the three fail-safe command line parameters which will mask
"nasty" repeats from the quick overlap finder (SKIM): [-KS:mnr] and
[-KS:nrr] respectively [-KS:nrc]. I'll come back
to [-SK:kms] later as it also plays a role in this.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_hard_how_does_nasty_repeat_masking_work"></a>11.1.3.
How does 'nasty repeat' masking work?
</h3></div></div></div><p>
</p><p>
If switched on [-KS:mnr=yes], MIRA will use k-mer statistics to
find repetitive stretches. K-mers are nucleotide stretches of length k. In a
perfectly sequenced genome without any sequencing error and without sequencing
bias, the k-mer frequency can be used to assess how many times a given
nucleotide stretch is present in the genome: if a specific k-mer is present as
many times as the average frequency of all k-mers, it is a reasonable
assumption to estimate that the specific k-mer is not part of a repeat (at
least not in this genome).
</p><p>
Following the same path of thinking, if a specific k-mer frequency is now two
times higher than the average of all k-mers, one would assume that this
specific k-mer is part of a repeat which occurs exactly two times in the
genome. For 3x k-mer frequency, a repeat is present three times. Etc.pp. MIRA
will merge information on single k-mers frequency into larger 'repeat'
stretches and tag these stretches accordingly.
</p><p>
Of course, low-complexity nucleotide stretches (like poly-A in eukaryotes),
sequencing errors in reads and non-uniform distribution of reads in a
sequencing project will weaken the initial assumption that a k-mer frequency
is representative for repeat status. But even then the k-mer frequency model
works quite well and will give a pretty good overall picture: most repeats
will be tagged as such.
</p><p>
Note that the parts of reads tagged as "nasty repeat" will not get masked per
se, the sequence will still be present. The stretches dubbed repetitive will
get the "MNRr" tag. They will still be used in Smith-Waterman overlaps and
will generate a correct consensus if included in an alignment, but they will
not be used as seed.
</p><p>
Some reads will invariably end up being completely repetitive. These
will not be assembled into contigs as MIRA will not see overlaps as
they'll be completely masked away. These reads will end up as
debris. However, note that MIRA is pretty good at discerning 100%
matching repeats from repeats which are not 100% matching: if there's
a single base with which repeats can be discerned from each other,
MIRA will find this base and use the k-mers covering that base to find
overlaps.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_hard_selecting_a_nasty_repeat_ratio"></a>11.1.4.
Selecting a "nasty repeat ratio"
</h3></div></div></div><p>
</p><p>
The ratio from which on the MIRA kmer statistics algorithm won't
report matches is set via [-KS:nrr]. E.g.,
using [-KS:nrr=10] will hide all k-mers which occur at a
frequency 10 times (or more) higher than the median of all k-mers.
</p><p>
The nastiness of a repeat is difficult to judge, but starting with 10 copies
in a genome, things can get complicated. At 20 copies, you'll have some
troubles for sure.
</p><p>
The standard values of <span class="emphasis"><em>10</em></span> for
the [-KS:nrr] parameter is a pretty good 'standard' value
which can be tried for an assembly before trying to optimise it via
studying the kmer statistics calculated by MIRA. For the later, please
read the section 'Examples for kmer statistics' further down in this
guide.
</p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_hard_how_MIRA_tags_different_repeat_levels"></a>11.2.
How MIRA tags different repeat levels
</h2></div></div></div><p>
During SKIM phase, MIRA will assign frequency information to each and every
k-mer in all reads of a sequencing project, giving them different
status. Additionally, tags are set in the reads so that one can
assess reads in assembly editors that understand tags (like gap4,
gap5, consed etc.). The following tags are used:
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
HAF2
</span></dt><dd><p> coverage below average ( default: < 0.5 times average)
</p></dd><dt><span class="term">
HAF3
</span></dt><dd><p> coverage is at average ( default: ≥ 0.5 times average and ≤ 1.5 times average)
</p></dd><dt><span class="term">
HAF4
</span></dt><dd><p> coverage above average ( default: > 1.5 times average and < 2 times average)
</p></dd><dt><span class="term">
HAF5
</span></dt><dd><p> probably repeat ( default: ≥ 2 times average and < 5 times average)
</p></dd><dt><span class="term">
HAF6
</span></dt><dd><p> 'crazy' repeat ( default: > 5 times average)
</p></dd><dt><span class="term">
MNRr
</span></dt><dd><p> stretches which were masked away by [-KS:<em class="replaceable"><code>mnr=yes</code></em>]
being more repetitive than deduced
by [-KS:<em class="replaceable"><code>nrr=...</code></em>] or given via [-KS:<em class="replaceable"><code>nrc=...</code></em>].
</p></dd></dl></div><p>
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_hard_the_readrepeats_info_file"></a>11.3.
The readrepeats info file
</h2></div></div></div><p>
If [-KS:mnr=yes] is used, MIRA will write an additional file into the
info directory:
<code class="filename"><projectname>_info_readrepeats.lst</code>
</p><p>
The "readrepeats" file makes it possible to try and find out what makes
sequencing data nasty. It's a key-value-value file with the name of the
sequence as "key" and then the type of repeat (HAF2 - HAF7 and MNRr) and
the repeat sequence as "values". "Nasty" in this case means
<span class="emphasis"><em>everything which was masked via
[-KS:mnr=yes]</em></span>.
</p><p>
The file looks like this:
</p><pre class="screen">
read1 HAF5 GCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCT ...
read2 HAF7 CCGAAGCCGAAGCCGAAGCCGAAGCCGAAGCCGAAGCCGAAGCCGAAGC ...
read2 MNRr AAAAAAAAAAAAAAAAAAAAAAAAAAAA ...
read3 HAF6 GCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCT ...
...
etc.
</pre><p>
That is, each line consists of the read name where a stretch of
repetitive sequences was found, then the MIRA repeat categorisation
level (HAF2 to HAF7 and MNRr) and then the stretch of bases which is
seen to be repetitive.
</p><p>
Note that reads can have several disjunct repeat stretches in a single
read, hence they can occur more than one time in the file as shown with
<span class="emphasis"><em>read2</em></span> in the example above.
</p><p>
One will need to search some databases with the "nasty" sequences and find
vector sequences, adaptor sequences or even human sequences in bacterial or
plant genomes ... or vice versa as this type of contamination happens quite
easily with data from new sequencing technologies. After a while one gets a
feeling what constitutes the largest part of the problem and one can start to
think of taking countermeasures like filtering, clipping, masking etc.
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_hard_pipeline_to_find_worst_contaminants_or_repeats_in_sequencing_data"></a>11.4.
Pipeline to find worst contaminants or repeats in sequencing data
</h2></div></div></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
In case you are not familiar with UNIX pipes, now would be a good time
to read an introductory text on how this wonderful system works. You
might want to start with a short introductory article at Wikipedia:
<a class="ulink" href="http://en.wikipedia.org/wiki/Pipeline_%28Unix%29" target="_top">http://en.wikipedia.org/wiki/Pipeline_%28Unix%29</a>
</p><p>
In a nutshell: instead of output to files, a pipe directs the output
of one program as input to another program.
</p></td></tr></table></div><p>
There's one very simple trick to find out whether your data contains
some kind of sequencing vector or adaptor contamination which I use. it
makes use of the read repeat file discussed above.
</p><p>
The following example shows this exemplarily on a 454 data where the
sequencing provider used some special adaptor in the wet lab but somehow
forgot to tell the Roche pre-processing software about it, so that a
very large fraction of reads in the SFF file had unclipped adaptor
sequence in it (which of course wreaks havoc with assembly programs):
</p><pre class="screen"><code class="prompt">arcadia:$</code> <strong class="userinput"><code>grep MNRr <em class="replaceable"><code>badproject</code></em>_info_readrepeats.lst | cut -f 3| sort | uniq -c |sort -g -r | head -15</code></strong>
504 ACCACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
501 CAACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
489 GGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
483 GCCACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
475 AATACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
442 GATACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
429 CGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
424 TTGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
393 ACTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
379 CAGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
363 ATTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
343 CATACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
334 GTTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
328 AACACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
324 GGTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC</pre><p>
You probably see a sequence pattern
CTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC in the above screens hot. Before
going into details of what you are actually seeing, here's the
explanation how this pipeline works:
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
grep MNRr <em class="replaceable"><code>badproject</code></em>_info_readrepeats.lst
</span></dt><dd><p>
From the file with the information on repeats, grab all the lines
containing repetitive sequence which MIRA categorised as 'nasty'
via the 'MNRr' tag. The result looks a bit like this (first 15
lines shown):</p><pre class="screen">C6E3C7T12GKN35 MNRr GCGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12JLIBM MNRr TTCACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12HQOM1 MNRr CAGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12G52II MNRr CAGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12JRMPO MNRr TCTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12H1A8V MNRr GCGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12H34Z7 MNRr AAACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12H4HGC MNRr GGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12FNA1N MNRr AATACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12F074V MNRr CTTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12I1GYO MNRr CAACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12I53C8 MNRr CACACTCGTATAGTGACACGCAACAGGGG
C6E3C7T12I4V6V MNRr ATCACTCGTATAGTGACACGCAACAGGGG
C6E3C7T12H5R00 MNRr TCTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12IBA5E MNRr AATACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
...</pre><p>
</p></dd><dt><span class="term">
cut -f 3
</span></dt><dd><p>
We're just interested in the sequence now, which is in the third
column. The above 'cut' command takes care of this. The resulting
output may look like this (only first 15 lines shown):
</p><pre class="screen">GCGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
TTCACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
CAGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
CAGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
TCTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
GCGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
AAACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
GGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
AATACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
CTTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
CAACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
CACACTCGTATAGTGACACGCAACAGGGG
ATCACTCGTATAGTGACACGCAACAGGGG
TCTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
AATACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
...</pre></dd><dt><span class="term">
sort
</span></dt><dd><p>
Simply sort all sequences. The output may look like this now (only first 15 line shown):</p><pre class="screen">
AAACTCGTATAGTGACACGCA
AAACTCGTATAGTGACACGCAACAGG
AAACTCGTATAGTGACACGCAACAGGG
AAACTCGTATAGTGACACGCAACAGGGG
AAACTCGTATAGTGACACGCAACAGGGG
AAACTCGTATAGTGACACGCAACAGGGG
AAACTCGTATAGTGACACGCAACAGGGG
AAACTCGTATAGTGACACGCAACAGGGG
AAACTCGTATAGTGACACGCAACAGGGGAT
AAACTCGTATAGTGACACGCAACAGGGGATA
AAACTCGTATAGTGACACGCAACAGGGGATA
AAACTCGTATAGTGACACGCAACAGGGGATA
AAACTCGTATAGTGACACGCAACAGGGGATA
AAACTCGTATAGTGACACGCAACAGGGGATA
AAACTCGTATAGTGACACGCAACAGGGGATA
...</pre><p>
</p></dd><dt><span class="term">
uniq -c
</span></dt><dd><p>
This command counts how often a line repeats itself in a file. As
we previously sorted the whole file by sequence, it effectively
counts how often a certain sequence has been tagged as MNRr. The
output consists of a tab delimited format in two columns: the
first column contains the number of times a given line (sequence
in our case) was seen, the second column contains the line
(sequence) itself. An exemplarily output looks like this (only first 15 lines shown):
</p><pre class="screen"> 1 AAACTCGTATAGTGACACGCA
1 AAACTCGTATAGTGACACGCAACAGG
1 AAACTCGTATAGTGACACGCAACAGGG
5 AAACTCGTATAGTGACACGCAACAGGGG
1 AAACTCGTATAGTGACACGCAACAGGGGAT
13 AAACTCGTATAGTGACACGCAACAGGGGATA
6 AAACTCGTATAGTGACACGCAACAGGGGATAGAC
4 AAACTCGTATAGTGACACGCAACAGGGGATAGACAA
9 AAACTCGTATAGTGACACGCAACAGGGGATAGACAAGGC
3 AAACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCA
257 AAACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
1 AACACTCGTATAGTGACACGCAAC
2 AACACTCGTATAGTGACACGCAACAGGG
23 AACACTCGTATAGTGACACGCAACAGGGG
6 AACACTCGTATAGTGACACGCAACAGGGGATA
...</pre></dd><dt><span class="term">
sort -g -r
</span></dt><dd><p>
We now sort the output of the previous uniq-counting command by
asking 'sort' to perform a numerical sort (via '-g') and
additionally sort in reverse order (via '-r') so that we get the
sequences encountered most often at the top of the output. And
that one looks exactly like shown previously:
</p><pre class="screen">
504 ACCACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
501 CAACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
489 GGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
483 GCCACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
475 AATACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
442 GATACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
429 CGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
424 TTGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
393 ACTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
379 CAGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
363 ATTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
343 CATACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
334 GTTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
328 AACACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
324 GGTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
...</pre></dd></dl></div><p>
So, what is this ominous CTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC you are
seeing? To make it short: a modified 454 B-adaptor with an additional MID sequence.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
These adaptor sequences have absolutely no reason to exist in your
data, none! Go back to your sequencing provider and ask them to have a look
at their pipeline as they should have had it set up in a way that you
do not see these things anymore. Yes, due to sequencing errors,
sometimes some adaptor or sequencing vectors remnants will stay in
your sequencing data, but that is no problem as MIRA is capable of
handling that very well.
</p><p>
But having much more than 0.1% to 0.5% of your sequence containing
these is a sure sign that someone goofed somewhere ... and it's very
probably not your fault.
</p></td></tr></table></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_hard_examples_for_kmer_statistics"></a>11.5.
Examples for kmer statistics
</h2></div></div></div><p>
Selecting the right ratio so that an assembly fits into your memory is not
straight forward. But MIRA can help you a bit: during assembly, some frequency
statistics are printed out (they'll probably end up in some info file in later
releases). Search for the term "Kmer statistics" in the information printed
out by MIRA (this happens quite early in the process)
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_hard_caveat:_sk:kms"></a>11.5.1.
Caveat: -SK:kmer_size
</h3></div></div></div><p>
Some explanation how kmer size affects the statistics and why it
should be chosen >=16 for [-KS:mnr]
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_hard_sanger_sequencing_a_simple_bacterium"></a>11.5.2.
Sanger sequencing, a simple bacterium
</h3></div></div></div><p>
This example is taken from a pretty standard bacterium where Sanger
sequencing was used:
</p><pre class="screen">
Kmer statistics:
=========================================================
Measured avg. coverage: 15
Deduced thresholds:
-------------------
Min normal cov: 7
Max normal cov: 23
Repeat cov: 29
Crazy cov: 120
Mask cov: 150
Repeat ratio histogram:
-----------------------
0 475191
1 5832419
2 181994
3 6052
4 4454
5 972
6 4
7 8
14 2
16 10
=========================================================
</pre><p>
The above can be interpreted like this: the expected coverage of the genome is
15x. Starting with an estimated kmer frequency of 29, MIRA will treat a k-mer
as 'repetitive'. As shown in the histogram, the overall picture of this
project is pretty healthy:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
only a small fraction of k-mers have a repeat level of '0' (these would be
k-mers in regions with quite low coverage or k-mers containing sequencing
errors)
</p></li><li class="listitem"><p>
the vast majority of k-mers have a repeat level of 1 (so that's non-
repetitive coverage)
</p></li><li class="listitem"><p>
there is a small fraction of k-mers with repeat level of 2-10
</p></li><li class="listitem"><p>
there are almost no k-mers with a repeat level >10
</p></li></ul></div><p>
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_hard_454_sequencing_a_somewhat_more_complex_bacterium"></a>11.5.3.
454 Sequencing, a somewhat more complex bacterium
</h3></div></div></div><p>
Here's in comparison a profile for a more complicated bacterium (454
sequencing):
</p><pre class="screen">
Kmer statistics:
=========================================================
Measured avg. coverage: 20
Deduced thresholds:
-------------------
Min normal cov: 10
Max normal cov: 30
Repeat cov: 38
Crazy cov: 160
Mask cov: 0
Repeat ratio histogram:
-----------------------
0 8292273
1 6178063
2 692642
3 55390
4 10471
5 6326
6 5568
7 3850
8 2472
9 708
10 464
11 270
12 140
13 136
14 116
15 64
16 54
17 54
18 52
19 50
20 58
21 36
22 40
23 26
24 46
25 42
26 44
27 32
28 38
29 44
30 42
31 62
32 116
33 76
34 80
35 82
36 142
37 100
38 120
39 94
40 196
41 172
42 228
43 226
44 214
45 164
46 168
47 122
48 116
49 98
50 38
51 56
52 22
53 14
54 8
55 2
56 2
57 4
87 2
89 6
90 2
92 2
93 2
1177 2
1181 2
=========================================================
</pre><p>
The difference to the first bacterium shown is pretty striking:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
first, the k-mers in repeat level 0 (below average) is higher than
the k-mers of level 1! This points to a higher number of
sequencing errors in the 454 reads than in the Sanger project
shown previously. Or at a more uneven distribution of reads (but
not in this special case).
</p></li><li class="listitem"><p>
second, the repeat level histogram does not trail of at a repeat
frequency of 10 or 15, but it has a long tail up to the fifties, even having
a local maximum at 42. This points to a small part of the genome being
heavily repetitive ... or to (a) plasmid(s) in high copy numbers.
</p></li></ul></div><p>
</p><p>
Should MIRA ever have problems with this genome, switch on the nasty repeat
masking and use a level of 15 as cutoff. In this case, 15 is OK to start with
as a) it's a bacterium, it can't be that hard and b) the frequencies above
level 5 are in the low thousands and not in the tens of thousands.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_hard_solexa_sequencing_ecoli_mg1655"></a>11.5.4.
Solexa sequencing, E.coli MG1655
</h3></div></div></div><p>
</p><pre class="screen">
Kmer statistics:
=========================================================
Measured avg. coverage: 23
Deduced thresholds:
-------------------
Min normal cov: 11
Max normal cov: 35
Repeat cov: 44
Crazy cov: 184
Mask cov: 0
Repeat ratio histogram:
-----------------------
0 1365693
1 8627974
2 157220
3 11086
4 4990
5 3512
6 3922
7 4904
8 3100
9 1106
10 868
11 788
12 400
13 186
14 28
15 10
16 12
17 4
18 4
19 2
20 14
21 8
25 2
26 8
27 2
28 4
30 2
31 2
36 4
37 6
39 4
40 2
45 2
46 8
47 14
48 8
49 4
50 2
53 2
56 6
59 4
62 2
63 2
67 2
68 2
70 2
73 4
75 2
77 4
=========================================================
</pre><p>
This kmer statistics shows that MG1655 is pretty boring (from a
repetitive point of view). One might expect a few repeats but nothing
fancy: The repeats are actually the rRNA and sRNA stretches in the
genome plus some intergenic regions.
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
the k-mers number in repeat level 0 (below average) is
considerably lower than the level 1, so the Solexa sequencing
quality is pretty good respectively there shouldn't be too many
low coverage areas.
</p></li><li class="listitem"><p>
the histogram tail shows some faint traces of possibly highly repetitive
k-mers, but these are false positive matches due to some standard Solexa
base-calling weaknesses of earlier pipelines like, e.g., adding poly-A,
poly-T or sometimes poly-C and poly-G tails to reads when spots in the
images were faint and the base calls of bad quality
</p></li></ul></div><p>
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_hard_need_examples_for_eukaryotes"></a>11.5.5.
(NEED EXAMPLES FOR EUKARYOTES)
</h3></div></div></div><p>
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_hard_need_examples_for_pathological_cases"></a>11.5.6.
(NEED EXAMPLES FOR PATHOLOGICAL CASES)
</h3></div></div></div><p>
Vector contamination etc.
</p></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_seqtechdesc"></a>Chapter 12. Description of sequencing technologies</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email"><<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>></code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect_std_intro">12.1.
Introduction
</a></span></dt><dt><span class="sect1"><a href="#sect_std_sxa">12.2.
Illumina (formerly Solexa)
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_std_sxa_caveats_for_illumina">12.2.1.
Caveats for Illumina data
</a></span></dt><dt><span class="sect2"><a href="#sect_std_sxa_highlights">12.2.2.
Illumina highlights
</a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_std_sxa_highlights_quality">12.2.2.1.
Quality
</a></span></dt></dl></dd><dt><span class="sect2"><a href="#sect_std_std_sxa_lowlights">12.2.3.
Lowlights
</a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_std_sxa_lowlights_longhomopolymers">12.2.3.1.
Long homopolymers
</a></span></dt><dt><span class="sect3"><a href="#sect_std_sxa_lowlights_GGCxG_motif">12.2.3.2.
The GGCxG and GGC motifs
</a></span></dt><dt><span class="sect3"><a href="#sect_std_sxa_lowlights_chimericreads">12.2.3.3.
Chimeric reads
</a></span></dt><dt><span class="sect3"><a href="#sect_std_sxa_lowlights_samplemix">12.2.3.4.
Sample barcode misidentification
</a></span></dt><dt><span class="sect3"><a href="#sect_std_sxa_lowlights_nextera">12.2.3.5.
Nextera library prep
</a></span></dt><dt><span class="sect3"><a href="#sect_std_sxa_lowlights_gcbias">12.2.3.6.
Strong GC bias in some Solexa data (2nd half 2009 until advent of TruSeq kit at end of 2010)
</a></span></dt></dl></dd></dl></dd><dt><span class="sect1"><a href="#sect_std_iontor">12.3.
Ion Torrent
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_std_iontor_hpindels">12.3.1.
Homopolymer insertions / deletions
</a></span></dt><dt><span class="sect2"><a href="#sect_std_iontor_seqdirdepindels">12.3.2.
Sequencing direction dependent insertions / deletions
</a></span></dt><dt><span class="sect2"><a href="#sect_std_iontor_covvariance">12.3.3.
Coverage variance
</a></span></dt><dt><span class="sect2"><a href="#sect_std_iontor_gcbias">12.3.4.
GC bias
</a></span></dt><dt><span class="sect2"><a href="#sect_std_iontor_other_sources_of_error">12.3.5.
Other sources of error
</a></span></dt><dt><span class="sect2"><a href="#sect_std_iontor_where_to_find_further_information">12.3.6.
Where to find further information
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_std_pacbio">12.4.
Pacific BioSciences
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_std_pb_highlights">12.4.1.
Highlights
</a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_std_pb_hl_length">12.4.1.1.
Sequence lengths
</a></span></dt><dt><span class="sect3"><a href="#sect_std_pb_hl_gcbias">12.4.1.2.
GC bias
</a></span></dt><dt><span class="sect3"><a href="#sect_std_pb_hl_acccorrected">12.4.1.3.
Accuracy of corrected reads
</a></span></dt><dt><span class="sect3"><a href="#sect_std_pb_hl_qualassemblies">12.4.1.4.
Assemblies of corrected reads
</a></span></dt></dl></dd><dt><span class="sect2"><a href="#sect_std_pb_lowlights">12.4.2.
Lowlights
</a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_std_pb_ll_namingconfusion">12.4.2.1.
Naming confusion
</a></span></dt><dt><span class="sect3"><a href="#sect_std_pb_ll_revseq">12.4.2.2.
Forward / reverse chimeric sequences
</a></span></dt><dt><span class="sect3"><a href="#sect_std_pb_ll_rawreadaccuracy">12.4.2.3.
Accuracy of uncorrected subreads
</a></span></dt><dt><span class="sect3"><a href="#sect_std_pb_ll_cpu">12.4.2.4.
Immense need for CPU power
</a></span></dt><dt><span class="sect3"><a href="#sect_std_pb_ll_dnaprep">12.4.2.5.
Increased quality requirements for clean DNA sample prep
</a></span></dt></dl></dd></dl></dd></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
<span class="emphasis"><em><span class="quote">“<span class="quote">Opinions are like chili powder - best used in moderation.</span>”</span></em></span>
</p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_std_intro"></a>12.1.
Introduction
</h2></div></div></div><p>
<span class="bold"><strong>Note:</strong></span> This section contains things I've
seen in the past and simply jotted down. These may be fundamentally
correct or correct only under circumstances or not correct at all with
your data. You may have different observations.
</p><p>
...
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_std_sxa"></a>12.2.
Illumina (formerly Solexa)
</h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_std_sxa_caveats_for_illumina"></a>12.2.1.
Caveats for Illumina data
</h3></div></div></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
Even if you can get bacteria sequenced with ridiculously high coverage
like 500x or 1000x, this amount of data is simply not needed. Even
more important - though counterintuitive - is the fact that due to
non-random sequence dependent sequencing errors, a too high coverage
may even make the assembly worse.
</p><p>
Another rule of thumb: when having more than enough data, reduce the
data set so as to have an average coverage of approximately 100x. In
some rare cases (high GC content), perhaps 120x to 150x, but certainly
not more.
</p></td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
When reducing a data set, do <span class="bold"><strong>NOT</strong></span>,
under no circumstances not, try fancy selection of reads by some
arbitrary quality or length criteria. This will introduce a terrible
bias in your assembly due to non-random sequence-dependent sequencing
errors and non-random sequence dependent base quality assignment. More
on this in the next section.
</td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_std_sxa_highlights"></a>12.2.2.
Illumina highlights
</h3></div></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_sxa_highlights_quality"></a>12.2.2.1.
Quality
</h4></div></div></div><p>
For current HiSeq 100bp reads I get - after MIRA clipping - about 90
to 95% reads matching to a reference without a single error. MiSeq
250bp reads contain a couple more errors, but nothing to be alarmed
off.
</p><p>
In short: Illumina is currently <span class="emphasis"><em>the</em></span> technology
to use if you want high quality reads.
</p></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_std_std_sxa_lowlights"></a>12.2.3.
Lowlights
</h3></div></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_sxa_lowlights_longhomopolymers"></a>12.2.3.1.
Long homopolymers
</h4></div></div></div><p>
Long homopolymers (stretches of identical bases in reads) can be a
slight problem for Solexa. However, it must be noted that this is a
problem of all sequencing technologies on the market so far (Sanger,
Solexa, 454). Furthermore, the problem in much less pronounced in
Solexa than in 454 data: in Solexa, first problem appear may appear
in stretches of 9 to 10 bases, in Ion Torrent a stretch of 3 to 4
bases may already start being problematic in some cases.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_sxa_lowlights_GGCxG_motif"></a>12.2.3.2.
The GGCxG and GGC motifs
</h4></div></div></div><p>
<code class="literal">GGCxG</code> or even <code class="literal">GGC</code> motif in the
5' to 3' direction of reads. This one is particularly annoying and
it took me quite a while to circumvent in MIRA the problems it
causes.
</p><p>
Simply put: at some places in a genome, base calling after a
<code class="literal">GGCxG</code> or <code class="literal">GGC</code> motif is
particularly error prone, the number of reads without errors
declines markedly. Repeated <code class="literal">GGC</code> motifs worsen
the situation. The following screen shots of a mapping assembly
illustrate this.
</p><p>
The first example is a the <code class="literal">GGCxG</code> motif (in form
of a <code class="literal">GGCTG</code>) occurring in approximately one third
of the reads at the shown position. Note that all but one read
with this problem are in the same (plus) direction.
</p><div class="figure"><a name="sxa_unsc_ggcxg2_lenski.png"></a><p class="title"><b>Figure 12.1.
The Solexa GGCxG problem.
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_unsc_ggcxg2_lenski.png" width="100%" alt="The Solexa GGCxG problem."></td></tr></table></div></div></div><br class="figure-break"><p>
The next two screen shots show the <code class="literal">GGC</code>, once for
forward direction and one with reverse direction reads:
</p><div class="figure"><a name="sxa_unsc_ggc1_lenski.png"></a><p class="title"><b>Figure 12.2.
The Solexa GGC problem, forward example
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_unsc_ggc1_lenski.png" width="100%" alt="The Solexa GGC problem, forward example"></td></tr></table></div></div></div><br class="figure-break"><div class="figure"><a name="sxa_unsc_ggc4_lenski.png"></a><p class="title"><b>Figure 12.3.
The Solexa GGC problem, reverse example
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_unsc_ggc4_lenski.png" width="100%" alt="The Solexa GGC problem, reverse example"></td></tr></table></div></div></div><br class="figure-break"><p>
Places in the genome that have <code class="literal">GGCGGC.....GCCGCC</code>
(a motif, perhaps even repeated, then some bases and then an
inverted motif) almost always have very, very low number of good
reads. Especially when the motif is <code class="literal">GGCxG</code>.
</p><p>
Things get especially difficult when these motifs occur at sites
where users may have a genuine interest. The following example is a
screen shot from the Lenski data (see walk-through below) where a
simple mapping reveals an anomaly which -- in reality -- is an IS
insertion (see <a class="ulink" href="http://www.nature.com/nature/journal/v461/n7268/fig_tab/nature08480_F1.html" target="_top">http://www.nature.com/nature/journal/v461/n7268/fig_tab/nature08480_F1.html</a>)
but could also look like a <code class="literal">GGCxG</code> motif in forward
direction (<code class="literal">GGCCG</code>) and at the same time a
<code class="literal">GGC</code> motif in reverse direction:
</p><div class="figure"><a name="sxa_xmastree_lenski2.png"></a><p class="title"><b>Figure 12.4.
A genuine place of interest almost masked by the
<code class="literal">GGCxG</code> problem.
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_xmastree_lenski2.png" width="100%" alt="A genuine place of interest almost masked by the GGCxG problem."></td></tr></table></div></div></div><br class="figure-break"></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_sxa_lowlights_chimericreads"></a>12.2.3.3.
Chimeric reads
</h4></div></div></div><p>
I did not realise chimeric reads were a problem with Illumina data
until Fall 2014 when I got reads > 100bp for extremely well
charactersided bacteria ... and because MIRA since ever used data
cleaning methods which worked very well on either short reads ≤
100bp or when chimeras occurred at a very low frequency.
</p><p>
Chimeras are are artefacts reads from library preparation which
contain parts of the sequence of interest which do not belong
together. E.g., in DNA from a bacterial genome, there may be one
read of 100 bp where the first 40 bp come from the genome position
at 100kb and the last 60 bp come from a position at 1300kb ... more
than one megabase apart.
</p><p>
There is not much literature regarding chimeric sequences in
Illumina data: most of it deals with 16S or amplicon sequencing
where I always thought <span class="emphasis"><em>"that does not apply to my data
sets."</em></span> Well, tough luck ... it does. After some searching I
found some papers which report quite varying levels depending on the
protocols used. Oyola et al. report between 0.24% and 2.3% of
chimeras (<span class="emphasis"><em>Optimizing illumina next-generation sequencing
library preparation for extremely at-biased genomes</em></span>; BMC
Genomics 2012, 13:1; doi:10.1186/1471-2164-13-1; <a class="ulink" href="http://www.biomedcentral.com/1471-2164/13/1" target="_top">http://www.biomedcentral.com/1471-2164/13/1</a>). Apparently, a
paper from researchers at the Sanger Centre reported up to 5%
chimeric reads (Bronner et al., <span class="emphasis"><em>Improved Protocols for
Illumina Sequencing</em></span>; Current Protocols in Human Genetics
18:18.2:18.2.1–18.2.42; DOI: 10.1002/0471142905.hg1802s80; <a class="ulink" href="http://onlinelibrary.wiley.com/doi/10.1002/0471142905.hg1802s80/abstract" target="_top">http://onlinelibrary.wiley.com/doi/10.1002/0471142905.hg1802s80/abstract</a>
via <a class="ulink" href="http://www.sagescience.com/blog/sanger-reports-improved-prep-protocols-for-illumina-sequencing/" target="_top">http://www.sagescience.com/blog/sanger-reports-improved-prep-protocols-for-illumina-sequencing/</a>).
</p><p>
I have now seen MiSeq 250bp and 300bp paired-end genomic data sets
from different (trusted) sequencing providers for very well
characterised, non-complex and non-GC-extreme bacterial genomes with
up to 3% chimeric reads. To make things worse, some chimeras were
represented by both reads of a read-pair, so one had the exact same
chimeric sequence represented twice: once in forward and once in
reverse complement direction.
</p><p>
It turned out that MIRA versions ≤ 4.9.3 have problems in
filtering chimeras in Illumina data sets with reads > 100bp as
the chimera detection algorithms were designed to handle amounts
much less than 1% of the total reads. This led to shorter contigs in
genomic assemblies and to chimeric transcripts (when they are very
low-coverage) in RNA assemblies.
</p><p>
Note that projects using reads ≤ 100 bp assembled fine with MIRA
4.9.3 and before as the default algorithms for proposed-end-clip
([-CL:pec]) implicitly caught chimeras occurring near the
read ends and the remaining chimeras were caught by the algorithms
for low level chimeras.
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
MIRA 4.9.4 and higher eliminate all chimeras in Illumina reads of
any length, you do not need to take any precautionary steps
here. But if you use other assemblers and in light of the above, I
highly recommend to apply very stringent filters to Illumina data.
Especially for applications like metagenomics or RNA de-novo
assembly where low coverage may be expected for parts of the
results! Indeed, I now treat any assembly result with consensus data
generated from a coverage of less than 3 Illumina reads as
potentially junk data.
</td></tr></table></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_sxa_lowlights_samplemix"></a>12.2.3.4.
Sample barcode misidentification
</h4></div></div></div><p>
Long story short: data from multiplexed samples contains "low"
amounts of foreign samples from the same lane. Probably not a
problem for high coverage assemblies, but can become a problem in
multiplexed RNASeq or projects looking for "rare" variants.
</p><p>
In essence, the barcoding used for multiplexing several samples into
a single lane is not a 100% foolproof process. I found one paper
quantifying this effect to 0.3% of misidentified reads: Kircher et
al., <span class="emphasis"><em>Double indexing overcomes inaccuracies in multiplex
sequencing on the Illumina platform</em></span>; Nucleic Acids
Res. Jan 2012; 40(1): e3. <a class="ulink" href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3245947/" target="_top">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3245947/</a>
</p><p>
For example, I got some genome sequecing data for a bacterium where
closer inspection of some small contigs coming out of the assembly
process turned out to be highly expressed genes from a plant. The
sequencing provider had multiplexed our bacterial sample with a
RNASeq project of that plant.
</p><p>
Another example involved RNASeq of two genomes where one of the
organisms had been modified to contain additional genes under a
strong promoter. In the data set we suddenly saw those inserted
genes pop-up in the samples of the wild type organism. Which,
clearly, could not be.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_sxa_lowlights_nextera"></a>12.2.3.5.
Nextera library prep
</h4></div></div></div><p>
Opinions seem to be divided about Nextera: some people don't like it
as it introduces sometimes terrible coverage bias in the data, other
people say they're happy with the data.
</p><p>
Someone told me (or wrote, I do not remember) that this divide may
be due to the fact that some people use their sequencing data for
de-novo assemblies, while others just do mappings and hunt for
SNPs. In fact, this would explain a lot: for de-novo assemblies, I
would never use Nextera. When on a hunt for SNPs, they may be OK.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_sxa_lowlights_gcbias"></a>12.2.3.6.
Strong GC bias in some Solexa data (2nd half 2009 until advent of TruSeq kit at end of 2010)
</h4></div></div></div><p>
I'm recycling a few slides from a couple of talks I held in 2010.
</p><p>
Things used to be so nice and easy with the early Solexa data I worked
with (36 and 44mers) in late 2007 / early 2008. When sample taking was
done right -- e.g. for bacteria: in stationary phase -- and the
sequencing lab did a good job, the read coverage of the genome was
almost even. I did see a few papers claiming to see non-trivial GC
bias back then, but after having analysed the data I worked with I
dismissed them as "not relevant for my use cases." Have a look at the
following figure showing exemplarily the coverage of a 45% GC
bacterium in 2008:
</p><div class="figure"><a name="sxa_gcbias_nobias2008.png"></a><p class="title"><b>Figure 12.5.
Example for no GC coverage bias in 2008 Solexa data. Apart from a
slight <span class="emphasis"><em>smile shape</em></span> of the coverage --
indicating the sample taking was not 100% in stationary phase of the
bacterial culture -- everything looks pretty nice: the average
coverage is at 27x, and when looking at potential genome
duplications at twice the coverage (54x), there's nothing apart a
single peak (which turned out to be a problem in a rRNA region).
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_gcbias_nobias2008.png" width="100%" alt="Example for no GC coverage bias in 2008 Solexa data. Apart from a slight smile shape of the coverage -- indicating the sample taking was not 100% in stationary phase of the bacterial culture -- everything looks pretty nice: the average coverage is at 27x, and when looking at potential genome duplications at twice the coverage (54x), there's nothing apart a single peak (which turned out to be a problem in a rRNA region)."></td></tr></table></div></div></div><br class="figure-break"><p>
Things changed starting sometime in Q3 2009, at least that's when I
got some data which made me notice a problem. Have a look at the
following figure which shows exactly the same organism as in the
figure above (bacterium, 45% GC):
</p><div class="figure"><a name="sxa_gcbias_bias2009.png"></a><p class="title"><b>Figure 12.6.
Example for GC coverage bias starting Q3 2009 in Solexa
data. There's no <span class="emphasis"><em>smile shape</em></span> anymore -- the
people in the lab learned to pay attention to sample in 100%
stationary phase -- but something else is extremely disconcerting:
the average coverage is at 33x, and when looking at potential genome
duplications at twice the coverage (66x), there are several dozen
peaks crossing the 66x threshold over a several kilobases (in one
case over 200 Kb) all over the genome. As if several small genome
duplications happened.
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_gcbias_bias2009.png" width="100%" alt="Example for GC coverage bias starting Q3 2009 in Solexa data. There's no smile shape anymore -- the people in the lab learned to pay attention to sample in 100% stationary phase -- but something else is extremely disconcerting: the average coverage is at 33x, and when looking at potential genome duplications at twice the coverage (66x), there are several dozen peaks crossing the 66x threshold over a several kilobases (in one case over 200 Kb) all over the genome. As if several small genome duplications happened."></td></tr></table></div></div></div><br class="figure-break"><p>
By the way, the figures above are just examples: I saw over a dozen
sequencing projects in 2008 without GC bias and several dozen in 2009
/ 2010 with GC bias.
</p><p>
Checking the potential genome duplication sites, they all looked
"clean", i.e., the typical genome insertion markers are
missing. Poking around at possible explanations, I looked at GC
content of those parts in the genome ... and there was the
explanation:
</p><div class="figure"><a name="sxa_gcbias_comp20082009.png"></a><p class="title"><b>Figure 12.7.
Example for GC coverage bias, direct comparison 2008 / 2010
data. The bug has 45% average GC, areas with above average read
coverage in 2010 data turn out to be lower GC: around 33 to 36%. The
effect is also noticeable in the 2008 data, but barely so.
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_gcbias_comp20082009.png" width="100%" alt="Example for GC coverage bias, direct comparison 2008 / 2010 data. The bug has 45% average GC, areas with above average read coverage in 2010 data turn out to be lower GC: around 33 to 36%. The effect is also noticeable in the 2008 data, but barely so."></td></tr></table></div></div></div><br class="figure-break"><p>
Now as to actually <span class="emphasis"><em>why</em></span> the GC bias suddenly
became so strong is unknown to me. The people in the lab use the same
protocol since several years to extract the DNA and the sequencing
providers claim to always use the Illumina standard protocols.
</p><p>
But obviously something must have changed.
</p><p>
It took Illumina some 18 months to resolve that problem for the
broader public: since data I work on were done with the TruSeq kit,
this problem has vanished.
</p><p>
However, if you based some conclusions or wrote a paper with Illumina
data which might be affected by the GC bias (Q3 2009 to Q4 2010), I
suggest you rethink all the conclusion drawn. This should be
especially the case for transcriptomics experiments where a difference
in expression of 2x to 3x starts to get highly significant!
</p></div></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_std_iontor"></a>12.3.
Ion Torrent
</h2></div></div></div><p>
As of January 2014, I would say Ion Torrent reads behave very much like
late data from the 454 technology (FLX / Titanium chemistry): reads are
on average are > 300bp and the homopolymer problem is much less
pronounced than 2 years ago. The following figure shows what you can get
out of 100bp reads if you're lucky:
</p><div class="figure"><a name="chap_iontor::ion_dh10bgoodB13.png"></a><p class="title"><b>Figure 12.8.
Example for good IonTorrent data (100bp reads). Note that only a
single sequencing error - shown by blue background - can be
seen. Except this, all homopolymers of size 3 and 4 in the area
shown are good.
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/ion_dh10bgoodB13.png" width="100%" alt="Example for good IonTorrent data (100bp reads). Note that only a single sequencing error - shown by blue background - can be seen. Except this, all homopolymers of size 3 and 4 in the area shown are good."></td></tr></table></div></div></div><br class="figure-break"><p>
The "if you're lucky" part in the preceding sentence is not there by
accident: having so many clean reads is more of an exception rather a
rule. On the other hand, most sequencing errors in current IonTorrent
data are unproblematic ... if it were not for indels, which is going to
be explained on the next sections.
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_std_iontor_hpindels"></a>12.3.1.
Homopolymer insertions / deletions
</h3></div></div></div><p>
The main source of error in your data will be insertions / deletions
(indels) especially in homopolymer regions (but not only there, see
also next section). Starting with a base run of 4 to 6 bases, there
is a distinct tendency to have an increased occurrence of indel
errors.
</p><div class="figure"><a name="chap_iontor::iontor_indelhpexample.png"></a><p class="title"><b>Figure 12.9.
Example for problematic IonTorrent data (100bp reads).
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/iontor_indelhpexample.png" width="100%" alt="Example for problematic IonTorrent data (100bp reads)."></td></tr></table></div></div></div><br class="figure-break"></div><p>
The above figure contains a couple of particularly nasty indel
problems. While areas 2 (C-homopolymer length 3), 5 (A-homopolymer
length 4) and 6 (T-homopolymer length 3) are not a big problem as most
of the reads got the length right, the areas 1, 3 and 4 are nasty.
</p><p>
Area 1 is an A-homopolymer of length 7 and while many reads get that
length right (enough to tell MIRA what the true length is), it also
contains reads with a length of 6 and and others with a length of 8.
</p><p>
Area 2 is a "A-homopolymer" of length 2 where approximately half of the
reads get the length right, the other half not. See also the following
section.
</p><p>
Area 4 is a T-homopolymer of length 5 which also has approximately half
the reads with a wrong length of 4.
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_std_iontor_seqdirdepindels"></a>12.3.2.
Sequencing direction dependent insertions / deletions
</h3></div></div></div><p>
In the previous section, the screen shot showing indels had an indel
at a homopolymer of 2, which is something quite curious. Upon closer
investigation, one might notice a pattern in the gap/nogap
distribution: it is almost identical to the orientation of build
direction of reads!
</p><p>
I looked for other examples of this behaviour and found quite a
number of them, the following figure shows a very clear case of that
error behaviour:
</p><div class="figure"><a name="chap_iontor::ion_dh10bdirdepindel.png.png"></a><p class="title"><b>Figure 12.10.
Example for a sequencing direction dependent indel. Note how all
but one of the reads in '+' direction miss a base while all reads
built in in '-' direction have the correct number of bases.
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/ion_dh10bdirdepindel.png" width="100%" alt="Example for a sequencing direction dependent indel. Note how all but one of the reads in '+' direction miss a base while all reads built in in '-' direction have the correct number of bases."></td></tr></table></div></div></div><br class="figure-break"><p>
This is quite astonishing: the problem occurs at a site without real
homopolymer (calling a 2-bases run a 'homopolymer' starts stretching
the definition a bit) and there are no major problematic homopolymer
sites near. In fact, this was more or less the case for all sites I
had a look at.
</p><p>
Neither did the cases which were investigated show common base
patterns, so unlike the Solexa GGCxG motif it does not look like
that error of IonTorrent is bound to a particular motif.
</p><p>
While I cannot prove the following statement, I somehow suspect that
there must be some kind of secondary structure forming which leads to
that kind of sequencing error. If anyone has a good explanation I'd be
happy to hear it: feel free to contact me at
<code class="email"><<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>></code>.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_std_iontor_covvariance"></a>12.3.3.
Coverage variance
</h3></div></div></div><p>
The coverage variance with the old ~100bp reads was a bit on the
bad side for low coverage projects (10x to 15x): it varied wildly,
sometimes dropping to nearly zero, sometimes reaching approximately
double the coverage.
</p><p>
This has now improved and I have not seen pronounced coverage variance
in the data sets I have worked on.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_std_iontor_gcbias"></a>12.3.4.
GC bias
</h3></div></div></div><p>
The GC bias seems to be small to non-existent, at least I could not
immediately make a correlation between GC content and coverage.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_std_iontor_other_sources_of_error"></a>12.3.5.
Other sources of error
</h3></div></div></div><p>
You will want to keep an eye on the clipping of the data in the SFF
files from IonTorrent: while it is generally good enough, some data
sets of IonTorrent show that - for some error patterns - the clipping
is too lax and strange artefacts appear. MIRA will take care of these
- or at least of those it knows - but you should be aware of this
potential problem.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_std_iontor_where_to_find_further_information"></a>12.3.6.
Where to find further information
</h3></div></div></div><p>
IonTorrent being pretty new, getting as much information on that
technology is quite important. So here are a couple of links I found
to be helpful:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
There is, of course, the TorrentDev site (<a class="ulink" href="http://lifetech-it.hosted.jivesoftware.com/community/torrent_dev" target="_top">http://lifetech-it.hosted.jivesoftware.com/community/torrent_dev</a>)
at Life Technologies which will be helpful to get a couple of
questions answered.
</p><p>
Just be aware that some of the documents over there are sometimes
painting an - how should I say it diplomatically? - overly
optimistic view on the performance of the technology. On the
other hand, so do documents released by the main competitors
like 454/Roche, Illumina, PacBio etc. ... so no harm done there.
</p></li><li class="listitem"><p>
I found Nick Loman's blog <a class="ulink" href="http://pathogenomics.bham.ac.uk/blog/" target="_top">Pathogens: Genes and
Genomes</a> to be my currently most valuable source of
information on IonTorrent. While the group he works for won a
sequencer from IonTorrent, he makes that fact very clear and still
unsparingly dissects the data he gets from that machine.
</p><p>
His posts got me going in getting MIRA grok IonTorrent.
</p></li><li class="listitem"><p>
The blog of Lex Nederbragt <a class="ulink" href="http://flxlexblog.wordpress.com/" target="_top">In between lines of
code</a> is playing in the same league: very down to earth and
he knows a bluff when he sees it ... and is not afraid to call it
(be it from IonTorrent, PacBio or 454).
</p><p>
The analysis he did on a couple of Ion data sets have saved me
quite some time.
</p></li><li class="listitem"><p>
Last, but not least, the board with <a class="ulink" href="http://seqanswers.com/forums/forumdisplay.php?f=40" target="_top">IonTorrent-related-stuff</a>
over at <a class="ulink" href="http://seqanswers.com/" target="_top">SeqAnswers</a>,
the first and foremost one-stop-shop ... erm ... discussion board
for everything related to sequencing nowadays.
</p></li></ul></div></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_std_pacbio"></a>12.4.
Pacific BioSciences
</h2></div></div></div><p>
As of January 2014, PacBio should be seen as <span class="emphasis"><em>the</em></span>
technology to go to for de-novo sequencing of bacteria and lower
eukaryotes. Period. Complement it with a bit of Illumina to get rid of
the last remaining errors and you'll have - for a couple of thousand
Euros - the best genome sequences money can buy.
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_std_pb_highlights"></a>12.4.1.
Highlights
</h3></div></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_pb_hl_length"></a>12.4.1.1.
Sequence lengths
</h4></div></div></div><p>
Just one word: huge. At least compared to other currently existing
technologies. It is not unusual to get average - usable - read lengths
of more than 3 to 4 kb, some chemistries doubling that number (at
the expense of accuracy). The largest - usable - reads I have seen
were > 25kb, though one needs to keep in mind that these are
quite rare and one does not see many of them in a project.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_pb_hl_gcbias"></a>12.4.1.2.
GC bias
</h4></div></div></div><p>
I have seen none in my projects so far, neither have I in public
data. But these were certainly not as many projects as Sanger, 454,
Illumina and Ion, so take this with a grain of salt.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_pb_hl_acccorrected"></a>12.4.1.3.
Accuracy of corrected reads
</h4></div></div></div><p>
Once the raw PacBio data has been corrected (HGAP pipeline), the
resulting reads have a pretty good accuracy. There still are
occasional homopolymer errors remaining at non-random locations, but
they are a minor problem.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_pb_hl_qualassemblies"></a>12.4.1.4.
Assemblies of corrected reads
</h4></div></div></div><p>
The assemblies coming out of the HGAP pipeline are already
astoundingly good. Of course you get long contigs, but also the
number of miscalled consensus bases is not too bad: 1 error per 20
kb. Once the program
<span class="command"><strong>Quiver</strong></span> went through the assembly to do its magic
in polishing, the quality improves further to into the range of 1
error per 50kb to 1 error per 250kb.
</p><p>
In my hands, I get even better assemblies with MIRA (longer contigs
which span repeats unresolved by HGAP). When combining this with
some low coverage Illumina data (say, 50x) to do cheap polishing,
the error rates I get are lower than 1 error in 4 megabases.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Take the above with a grain of salt as at the time of this writing,
I analysed in-depth only on a couple of bacteria. For ploidal
organisms I have just played a bit around with public data without
really doing an in depth analysis there.
</td></tr></table></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_std_pb_lowlights"></a>12.4.2.
Lowlights
</h3></div></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_pb_ll_namingconfusion"></a>12.4.2.1.
Naming confusion
</h4></div></div></div><p>
With PacBio, there are quite a number of read types being thrown
around and which do confuse people: <span class="emphasis"><em>polymerase
reads</em></span>, <span class="emphasis"><em>quality clipped
reads</em></span>, <span class="emphasis"><em>subreads</em></span>, <span class="emphasis"><em>corrected
reads</em></span> and maybe some more I currently forgot. Here's the
total unofficial guide on how to keep those things apart:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
<span class="bold"><strong>polymerase reads</strong></span> are the rawest
and most unedited stuff you may come into contact. You can see
it as "data fresh from the machine" and the number of megabases
there is usually the one sequencing providers sell to you.
</p><p>
The sequencing technology PacBio employs uses special hairpin
adaptors they have named SMRTBell, and these adaptors will be
present in the polymerase reads together with the fragments of
your DNA.
</p><p>
In terms of regular expression look-alike, the data in
polymerase reads has the following form:
</p><pre class="screen">(Adaptor + (forward fragment sequence + (Adaptor + (fragment sequence in reverse complement))))*</pre><p>
E.g., some of your <span class="emphasis"><em>polymerase reads</em></span> will
contain just the adaptor and (part of) a fragment sequence:
Adap+FwdSeq. Others might contain: Adap+FwdSeq+Adap+RevSeq. And
still others might contain: multiple copies of
Adap+FwdSeq+Adap+RevSeq.
</p></li><li class="listitem"><span class="bold"><strong>quality clipped reads</strong></span> are
simply <span class="emphasis"><em>polymerase reads</em></span> where some sort of
first quality clipping has been done.
</li><li class="listitem"><span class="bold"><strong>subreads</strong></span> are <span class="emphasis"><em>quality
clipped reads</em></span> where the adaptors have been removed and
the read split into forward fragment sequences and reverse
fragment sequences. Hence, one quality clipped polymerase read can
yield several subreads.
</li><li class="listitem"><p>
<span class="bold"><strong>corrected (sub)reads</strong></span> are
subreads where through the magic of lots of computational power
and a very high coverage of subreads, the errors have been
almost completely removed from the subreads.
</p><p>
This is usually done only on a part of the subreads as it takes
already long enough (several hundred hours CPU for a simple
bacterium).
</p></li></ul></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_pb_ll_revseq"></a>12.4.2.2.
Forward / reverse chimeric sequences
</h4></div></div></div><p>
The splitting of polymerase reads into subreads (see above) needs
the SMRTBell adaptor to be recognised by motif searching
programs. Unfortunately, it looks like as if some "low percentage"
of reads have a self-looped end instead of an adaptor. Which in turn
means that the subread splitting will not split those reads and you
end up with a chimeric sequence.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_pb_ll_rawreadaccuracy"></a>12.4.2.3.
Accuracy of uncorrected subreads
</h4></div></div></div><p>
You need to be brave now: the accuracy of the the unclipped
polymerase reads is usually only about 50%. That is: on average
every second base is wrong. And I have seen a project where this
accuracy was only 14% (6 out of 7 bases are wrong).
</p><p>
After clipping, the average accuracy of the polymerase reads should
be anywhere between 80% and 85% (this depends a little bit on the
chemistry used), which translates to: every 5th to every 7th base is
wrong. The vast majority of errors being insertions or deletions, not
base substitutions.
</p><p>
80% to 85% accurracy with indels as primary error is unfortunately
something assemblers cannot use very well. Read: not at all if you
want good assemblies (at least I know no program which does
that). Therefore, one needs to apply some sort of correction
... which needs quite a deal of CPU, see below.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_pb_ll_cpu"></a>12.4.2.4.
Immense need for CPU power
</h4></div></div></div><p>
The above mentioned accuracies of 80% to 85% are too low for any
existing assembler I know to be correctly assembled. Therefore,
people came up with the idea of doing error correction on subreads
to improve their quality.
</p><p>
There are two major approaches: 1) correcting PacBio subreads with
other technologies with shorter reads and 2) correcting long PacBio
subreads with shorter PacBio subreads. Both approaches have been
shown to work, though there seems to be a preference nowadays to use
the second option as the "shorter" PacBio reads provide the benefit
of being still longer than read from other technologies and hence
provide a better repeat resolution.
</p><p>
Anyway, the amount of CPU power needed for any method above is
something to keep for: bacteria with 3 to 5 megabases at a 100x
polymerase read coverage can take several hundred hours of CPU for
the correction step.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_pb_ll_dnaprep"></a>12.4.2.5.
Increased quality requirements for clean DNA sample prep
</h4></div></div></div><p>
This is a problem which cannot be really attributed to PacBio: one
absolutely needs to check whether the protocols used "since ever"
for DNA extraction yield results which are clean and long enough for
PacBio. Often they are not.
</p><p>
The reason for this being a problem is simple: PacBio can sequence
really long fragments, but if your DNA extraction protocol smashed
the DNA into small pieces, then no sequencing technology in this
universe will be able to give you long reads for small fragments.
</p></div></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_seqadvice"></a>Chapter 13. Some advice when going into a sequencing project</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email"><<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>></code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect_seqadv_seqprovider">13.1.
Talk to your sequencing provider(s) before sequencing
</a></span></dt><dt><span class="sect1"><a href="#sect_seqadv_whichseqprovider">13.2.
Choosing a sequencing provider
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_seqadv_whichseqprovider_want">13.2.1.
WHAT DO YOU WANT?!
</a></span></dt><dt><span class="sect2"><a href="#sect_seqadv_whichseqprovider_need">13.2.2.
WHAT DO YOU NEED?!
</a></span></dt><dt><span class="sect2"><a href="#sect_seqadv_whichseqprovider_cost">13.2.3.
WHAT WILL IT COST ME?
</a></span></dt><dt><span class="sect2"><a href="#sect_seqadv_whichseqprovider_where">13.2.4.
WHERE TO SEQUENCE?
</a></span></dt><dt><span class="sect2"><a href="#sect_seqadv_whichseqprovider_summary">13.2.5.
Summary of all the above
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_seqadv_specific">13.3.
Specific advice
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_seqadv_technologies">13.3.1.
Technologies
</a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_seqadv_technologies_sanger">13.3.1.1.
Sanger
</a></span></dt><dt><span class="sect3"><a href="#sect_seqadv_technologies_pacbio">13.3.1.2.
Pacific Biosciences
</a></span></dt><dt><span class="sect3"><a href="#sect_seqadv_technologies_illumina">13.3.1.3.
Illumina
</a></span></dt><dt><span class="sect3"><a href="#sect_seqadv_technologies_iontorrent">13.3.1.4.
Ion Torrent
</a></span></dt><dt><span class="sect3"><a href="#sect_seqadv_technologies_454">13.3.1.5.
Roche 454
</a></span></dt></dl></dd><dt><span class="sect2"><a href="#sect_seqadv_denovo">13.3.2.
Sequencing de-novo
</a></span></dt><dt><span class="sect2"><a href="#sect_seqadv_mapping">13.3.3.
Re-sequencing / mapping
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_seqadv_a_word_or_two_on_coverage">13.4.
A word or two on coverage ...
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_seqadv_lowcov">13.4.1.
Low coverage isn't worth it
</a></span></dt><dt><span class="sect2"><a href="#sect_seqadv_highcov">13.4.2.
Catch-22: too high coverage
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_seqadv_when_sequencing_a_word_of_caution_regarding_your_dna">13.5.
A word of caution regarding your DNA in hybrid sequencing projects
</a></span></dt><dt><span class="sect1"><a href="#sect_seqadv_for_bacteria">13.6.
Advice for bacteria
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_seqadv_for_bacteria_no_not_sample_in_exponential_phase">13.6.1.
Do not sample DNA from bacteria in exponential growth phase!
</a></span></dt><dt><span class="sect2"><a href="#sect_seqadv_for_bacteria:_beware_of_high_copy_number_plasmids">13.6.2.
Beware of (high copy number) plasmids!
</a></span></dt></dl></dd></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
<span class="emphasis"><em>
<span class="quote">“<span class="quote">
Reliable information lets you say 'I don't know' with real confidence.
</span>”</span>
</em></span>
</p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_seqadv_seqprovider"></a>13.1.
Talk to your sequencing provider(s) before sequencing
</h2></div></div></div><p>
Well, duh! But it's interesting what kind of mails I sometimes get. Like in:
</p><div class="blockquote"><blockquote class="blockquote"><span class="quote">“<span class="quote">We've sequenced a one gigabase, diploid eukaryote with
Solexa 36bp paired-end with 200bp insert size at 25x coverage. Could you
please tell us how to assemble this data set de-novo to get a finished
genome?</span>”</span></blockquote></div><p>
A situation like the above should have never happened. Good sequencing
providers are interested in keeping customers long term and will
therefore try to find out what exactly your needs are. These folks
generally know their stuff (they're making a living out of it) and most
of the time propose you a strategy that fulfills your needs for a near
minimum amount of money.
</p><p>
Listen to them.
</p><p>
If you think they try to rip you off or are overselling their
competences (which most providers I know won't even think of trying,
but there are some), ask a quote from a couple of other
providers. You'll see pretty quickly if there are some things not being
right.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
As a matter of fact, a rule which has saved me time and again for
finding sequencing providers is not to go for the cheapest provider,
especially if their price is far below quotes from other
providers. They're cutting corners somewhere others don't cut for a
reason.
</td></tr></table></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_seqadv_whichseqprovider"></a>13.2.
Choosing a sequencing provider
</h2></div></div></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
This is a slightly reworked version of a post I made on the MIRA talk
mailing list. The question <span class="emphasis"><em>"Could you please recommend me a
sequencing provider?"</em></span> arrives every now and then in my
private inbox, often enough for me decide to make a collage of the
responses I gave in the past and post it to MIRA talk.
</td></tr></table></div><p>
This response got, errrr, a little longer, but allow me to note that I
will not give you names. The reasons are manyfold:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem">
once upon a time I worked for a sequencing company
</li><li class="listitem">
the company I am currently employed with is not in the sequencing
provider business, but the company uses more than one sequencing
provider on a regular base and I get to see quite some data
</li><li class="listitem">
due to my development on MIRA in my free time, I'm getting insight
into a number of highs and lows of sequencing technologies at
different sequencing providers which I would not get if I were to
expose them publicly ... I do not want to jeopardise these
relationships.
</li></ul></div><p>
That being said, there are a number of general considerations which
could help you. Excuse me in case the detours I am going to make are
obvious to you, but I'm writing this also for future references. Also,
please bear with me if I look at "sequencing" a bit differently than you
might be accustomed to from academia, but I have worked for quite some
time now in industry ... and there cost-effectiveness respectively
"probability of success" of a project as whole is paramount to
everything else. I'll come back to that further down.
</p><p>
There's one -- and only one -- question which you, as sequencing
customer, need to be able to answer ... if necessary in every
excruciating detail, but you must know the answer. The question is:
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_seqadv_whichseqprovider_want"></a>13.2.1.
WHAT DO YOU WANT?!
</h3></div></div></div><div class="sidebar"><div class="titlepage"><div><div><p class="title"><b>
Detour - Sequencing -
</b></p></div></div></div><p>
For me, every "sequencing project", be it genomic or transcriptomic,
really consists of four major phases:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
<span class="bold"><strong>data generation:</strong></span> This can be
broadly seen as everything to get the DNA/RNA ready to be sent
off to sequencing (usually something the client does), the
library prep at the sequencing provider and finally the
sequencing itself (including base calling). An area of thousand
pitfalls where each step (and the communication) is crucial and
even one slight inadvertence can make the difference between a
"simple" project and a "hard" project. E.g.: taking DNA from
growing cells (especially bacteria in exponential growing phase)
might not be a good idea ... it makes assembly more
difficult. Some DNA extraction methods generate more junk than
good fragments etc.pp
</p><p>
The reason I am emphasizing this is simple: nowadays, the
"sequencing" itself is not the most expensive part of a
sequencing project, the next two steps are (most of the time
anyway).
</p></li><li class="listitem"><p>
<span class="bold"><strong>assembly & finishing:</strong></span> Still
a hard problem. Even a "simple" bacterium can present weeks of
effort to get right if its riddled with phages, prophages,
transposon elements, genetically engineered repeats etc.pp. And
starting with eukaryotes the real fun starts: ploidy,
retrotransposons etc. make for an unbelievable genome plasticity
and almost always have their own surprises. I've seen "simple"
Saccharomyces cerevisiae - where biologist swore to high heaven
they were "close to the publicly sequenced strains" - being
*very* different from what they were expected to be, both on the
DNA level and the genome organisation level.
</p><p>
Getting eukaryotes right "down to the last base" might cost
quite some money, especially when looping back to step 1 (data
generation) to tackle difficult areas.
</p></li><li class="listitem"><p>
<span class="bold"><strong>annotation:</strong></span> Something many
people forget: give the sequence a meaning. Here too, things can
get quite costly if done "right", i.e., with hand
curation. Especially on organism which are not part of the more
commonly sequenced species or are generally more complex.
</p><p>
Annotation of a de-novo transcriptome assembly is also not for
the faint of heart, especially if done on short, unpaired read
assemblies.
</p></li><li class="listitem"><span class="bold"><strong>using the sequencing data:</strong></span>
... for whatever it was generated for.
</li></ol></div></div><p>
The above makes it clear that, depending on what you are really
interested in within your project and what you expect to be able to do
with the sequencing data, one can cut corners and reduce cost here and
there (but not everywhere). And therefore, the above question "What do
you want?" is one which - after the initial chit-chat of "hi, hello,
nice to meet you, a pleasure to be here, etc." - every good
representative of respectable sequencing providers I have met so far
will ask as very first question. Usually in the form of "what do you
want to sequence and what will you want to use the data for (and what
not)?"
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_seqadv_whichseqprovider_need"></a>13.2.2.
WHAT DO YOU NEED?!
</h3></div></div></div><p>
... difference between "want" and "need" ...
</p><p>
Every other question - like where to sequence, which sequencing
technology to use, how to process the sequencing data afterwards - is
incidental and subordinated to your answer(s) to the question of "what
do you want?!" But often sequencing customers get their priorities
wrong by putting forward another question:
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_seqadv_whichseqprovider_cost"></a>13.2.3.
WHAT WILL IT COST ME?
</h3></div></div></div><p>
And its inevitable companion question "Can you make it cheaper?"
</p><div class="sidebar"><div class="titlepage"><div><div><p class="title"><b>
Detour - Putting things into perspective -
</b></p></div></div></div><p>
Come to think of it, people sometimes have very interesting ideas
regarding costs. Interesting as in "outright silly." It may be
because they do not really know what they want or feel unsure on a
terrain unbeknownst to them, and often instead focus their energy on
single aspects of a wider project because they feel more at home
there. And suddenly the focus lies on haggling and bartering for
some prices because, after all, this is something everyone knows how
to do, right?
</p><p>
As I hinted earlier, the pure sequencing costs are nowadays probably
not the biggest factor in any sequencing project: 454, Illumina,
IonTorrent and other technology providers have seen to that. E.g.,
in 20043/2004 it still cost somewhere between 150 - 200 k€ to get an
8x Sanger coverage of a moderately sized bacterium (4 to 5
mb). Nowadays, for the same organism, you get coverages in the
dozens (going with 454) for a few thousand Euro ... or coverages in
the hundreds or even thousands (going with Illumina) for a few
hundred Euro.
</p><p>
Cost for assembly, finishing and annotation have not followed the
same decrease. Yes, advances in algorithms have made things easier
in some parts, but not really on the same scale. Furthermore, the
"short read" technologies have more than made up for algorithmical
complexity when compared to the old Sanger reads. Maybe that
"(ultra)long read" technologies will alleviate the problem, but I
would not hold my breath for them to really work well.
</p><p>
One thing however has almost not changed at all: your costs of
actually doing followup experiments and data interpretation!
Remember that sequencing in itself is most of the time not the
ultimate goal, you actually want to gain something out of it. Be it
abstract knowledge for a paper or concrete hints for producing some
compounds or whatever, chances are that you will actually devote a
substantial amount of your resources (time, manpower, mental health)
into followup activities (lab experiments, genetic engineering,
writing papers) to turn the abstract act of sequencing into
something tangible, be it papers, fame, new products, money, or
whatever you want to achieve.
</p><p>
And this is the place where it pays to stop and think: "what do I
want? what are my strengths and where are my weaknesses? where are
my priorities?" The English have a nice saying: "Being penny-wise
and pound-foolish is not wise." I may add: Especially not if you are
basing man months / years of lab work and your career on the outcome
of something like sequencing. Maybe I'm spoiled because I have left
academia for quite some time now, but in sequencing I always prefer
to throw a bit more money at the sequencing process itself to
minimise risks of the later stages.
</p></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_seqadv_whichseqprovider_where"></a>13.2.4.
WHERE TO SEQUENCE?
</h3></div></div></div><p>
There's one last detour I'd like to make, and that is the question of "where to sequence?"
</p><div class="sidebar"><div class="titlepage"><div><div><p class="title"><b>
Detour - Public or private, old-timers or young-timers ? -
</b></p></div></div></div><p>
Choosing a sequencing provider is highly dependent on your answer to
"what do you want?" In case you want to keep the sequencing data (or
the very act of sequencing) secret (even only for some time) will
probably lead you to commercial sequencing companies. There you more
or less have complete control on the data. Paranoid people might
perhaps argue that you can have that only with own sequencing
equipment and personnel, but I have the feeling that only a minority
is able to cough-up the necessary money for purchasing sequencing
equipment for a small one-time project.
</p><p>
Instead of companies you could however also look whether one of the
existing sequencing centers in the world might be a good cooperation
candidate. Especially if you are doing this project within the scope
of your university. Note however that there might be a number of
gotchas lurking there, beside the obvious "the data is not really
secret anymore": sometimes the raw sequencing data needs to be
publicly released, maybe earlier than you would like; or the
sequencing center imposes that each and every paper you publish with
that data as basis has them as (co-)first author.
</p><p>
A related problem is "whom do I trust to deliver good work?"
Intuition says that institutes with a long sequencing history have
amassed quite some knowledge in this field, making them experts in
all three aspects (data generation, assembly & finishing,
annotation) of a sequencing project ... and intuition probably isn't
wrong there. The same thing is probably true for sequencing
companies which have existed for more than just a couple of years,
though from what I have seen so far is that - due to size -
sequencing companies sometimes really focus on the data generation
and rely on partner companies for "assembly" and "annotation". This
is not to say that younger companies are bad. Incidentally, it is my
belief that in this field, people are still more important than
technology ... and every once in a while good people split off a
well known institute (or company) to try their luck in an own
company. Always look for references there.
</p><p>
The following statement is a personal opinion (and you can call me
biased for that): Personally, I am however quite wary of sequencing
done at locations where a sequencer exists because someone got a
grant to buy one (because it was chic & en-vogue to get a shiny
new toy) but where the instrument then slowly starts to collect dust
after the initial flurry ... and because people often do not
calculate chemistry costs which arise in case they'd really thought
of using the machine 24/7. I want to know that technicians actually
work with those things every day, that they know the ins and outs of
the work, the protocols, the chemistry, the moods of the machine
(even an instrument can have a bad day). I honestly do not believe
that one can build up enough expertise when operating these things
"every once in a while".
</p></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_seqadv_whichseqprovider_summary"></a>13.2.5.
Summary of all the above
</h3></div></div></div><p>
All of the above means that depending on what I need the data for, I
have the freedom choose among different providers. In case I just need
masses of raw data and potential savings are substantial, I might go
with the cheapest whom I know to generate good data. If I want good
service and second round of data in case I am not 110% satisfied with
the first round (somehow people have stopped questioning me there),
this is usually not the cheapest provider ... but the additional costs
are not really high. If I wanted my data really really quick, I'd
search for a provider with Ion Torrent, or MiSeq (I am actually
looking for one with a MiSeq, so if anyone knows a good one,
preferably in Europe -> mail me). Though I already did transcriptomics
on eukaryotes, in case I needed larger eukaryotes assembled de-novo
& also annotated, I would probably look for the help of a larger
sequencing center as this starts to get dangerously near the fringe of
my field of expertise.
</p><p>
In closing this part, here are a couple of guidelines which have not
failed me so far for choosing sequencing providers:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem">
Building a good relationship helps. In case your institute /
university already has good (or OK) experience with a provider, ask
there first.
</li><li class="listitem">
It is a lot easier to build a good relationship with someone who
speaks your language ... or a good(!) English.
</li><li class="listitem">
I will not haggle for a couple of hundred Euros in a single project,
I'll certainly reconsider this when savings are in the tens of
thousands.
</li><li class="listitem">
Managing expectations: some sequencing projects are high risk from
the start, for lots of possible reasons (underfunded, bad starting
material, unclear organism). This is *sometimes* (!) OK as long as
everyone involved knows and acknowledges this. However, you should
always have a clear target ("what am I looking for?") and preferably
know in advance how to treat the data to get there.
</li><li class="listitem">
Errors occur, stay friendly at first. In case the expectations were
clear (see above), the material and organism are not at fault but
the data quality somehow is bad, it is not too difficult to have the
sequencing provider acknowledge this and get additional sequencing
for no added cost.
</li></ul></div><p>
Regarding the technologies you can use ... it really depends on what
you want to do :-) And note that I base my answers on technologies
available today without bigger problems: PacBio, Illumina, with
IonTorrent as Joker for quick projects. 454 can still be considered,
but probably not for too long anymore as Roche stopped development of
the technology and thus PacBio takes over the part for long
reads. Oxford Nanopore might become a game changer, but they are not
just yet
</p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_seqadv_specific"></a>13.3.
Specific advice
</h2></div></div></div><p>
Here's how I see things as of now (January 2014), which might not
necessarily be how others see them.
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_seqadv_technologies"></a>13.3.1.
Technologies
</h3></div></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_seqadv_technologies_sanger"></a>13.3.1.1.
Sanger
</h4></div></div></div><p>
Use for: checking assemblies; closing gaps by PCR; checking for a couple of genes with
known sequence (i.e., where you can design oligos for).
</p><p>
Do not use for: anything else. In particular, if you find yourself
designing oligos for a 96 well plate destined for Sanger sequencing
of a single bacterial DNA sample, you (probably) are doing something
wrong.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_seqadv_technologies_pacbio"></a>13.3.1.2.
Pacific Biosciences
</h4></div></div></div><p>
Use for: de-novo of bacteria and lower eukaryotes (or higher
eukaryotes if you have the money). PacBio should be seen as
<span class="emphasis"><em>the</em></span> technology to use when getting the best
assemblies with least number of contigs is important to you. Also,
resequencing of variants of known organisms with lots of genomic
reorganisation flexibility due to high numbers of transposons (where
short reads will not help in getting the chromosome assembled/mapped
correctly).
</p><p>
Do not use for: resequencing of "dull" organisms (where the only
differences will be simple SNPs or simple insertion/deletions or
simple contig reorganisations at non-repetitive places). Illumina
will do a much better and cost effective job there.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
As of January 2014: aim for at least 100x coverage of raw data,
better 130x to 150x as pre-processing (quality clip, removal of
adapters and other sequencing artefacts) will take its toll and
reduce the data by up to 1/3. After that, the error
correction/self-correction of raw reads into corrected reads will
again reduce the data considerably.
</p><p>
It's really a numbers game: the more data you have, the more
likely you will also get many of those really long reads in the 5
to 30 Kb range which are extremely useful to get over those nasty
repeats.
</p></td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
MIRA will most probably give you longer contigs with corrected
PacBio reads than you get with the HGAP pipeline, but the number of
indel errors will currently be higher. Either use Quiver on the
results of MIRA ... or simply polish the assembly with a cheap
Illumina data set. The latter approach will also give you better
results than a Quiver approach.
</td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
For non-haploid organisms, you might need more coverage to get
enough data at ploidy sites to get the reads correctly out of
error correction.
</td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
Preparation of your DNA sample is not trivial as many methods will
break your DNA into "small" chunks which are good enough for
Sanger, 454, Illumina or Ion Torrents, but not for PacBio.
</td></tr></table></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_seqadv_technologies_illumina"></a>13.3.1.3.
Illumina
</h4></div></div></div><p>
Use for: general resequencing jobs (finding SNPs, indel locations of
any size, copy number variations etc.); gene expression analysis;
cheap test sequencing of unknown organisms to assess complexity;
de-novo sequencing if you are OK with getting hundreds / thousands
of contigs (depending on organism, some bacteria get only a few
dozen).
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
Careful with high GC organisms, starting with 60% to 65% GC Illumina
reads contain more errors: SNP detection may be less reliable if
extreme care is not taken to perform good read clipping. Especially
the dreaded GGCxG motif often leads to problems in Illumina reads.
</td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
For de-novo assemblies, do <span class="emphasis"><em>NOT</em></span> (never ever at
all and under no circumstances) use the Nextera kit, take
TruSeq. The non-random fragmentation behaviour of Nextera leads to
all sorts of problems for assemblers (not only MIRA) which try to
use kmer frequencies as a criterion for repetitiveness of a given
sequence.
</td></tr></table></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_seqadv_technologies_iontorrent"></a>13.3.1.4.
Ion Torrent
</h4></div></div></div><p>
Use for: like Illumina. With three notable exceptions: 1) SNP
detection is not as good as with Illumina (more false positives and
false negatives) 2) de-novo assemblies will contain more single-base
indels and 3) Ion having problems with homopolymers, that technology
is not as well suited as complimentary hybrid technology for PacBio
as is Illumina (except for high-GC perhaps).
</p><p>
Ion has a speed advantage on Illumina: if you have your own machine,
getting from your sample to data takes less time than with Illumina.
</p><p>
Also, it looks like as if Ion has less problems with GC content or
sequence motifs than Illumina.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_seqadv_technologies_454"></a>13.3.1.5.
Roche 454
</h4></div></div></div><p>
That technology is on the way out, but there may be two reasons to
not completely dismiss 454: 1) the average read length of 700 bp can
be seen as a plus when compared to Illumina or Ion ... but then
there's PacBio to take care of read length. 2) the large read-pair
libraries work better with 454 than Illumina mate-pair libraries,
something which might be important for scaffolding data where even
PacBio could not completely resolve long repeats.
</p></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_seqadv_denovo"></a>13.3.2.
Sequencing de-novo
</h3></div></div></div><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem">
On a cheap gene fishing expedition? Probably Illumina HiSeq, at
least 100bp, 150 to 250bp or 300bp if your provider supports it
well. Paired-end definitely a plus. As alternative: Ion Torrent for
small organism (maybe up to 100Mb) and when you need results quickly
without caring for possible frameshifts.
</li><li class="listitem">
Want some larger contigs? PacBio. Add in cheap Illumina 100bp
paired-end (150 to 300bp if provider supports it) to get rid of
those last frameshifts which may remain.
</li><li class="listitem">
Maybe scaffolding of contigs above? PacBio + Illumina 100bp + a
large paired-end library (e.g. 454 20kb)
</li><li class="listitem">
Have some good friends at Oxford Nanopore who can give you some
MinIon engineering samples? Man, I'd kill for some bacterial test
sets with those (especially Bacillus subtilis 168)
</li></ul></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_seqadv_mapping"></a>13.3.3.
Re-sequencing / mapping
</h3></div></div></div><p>
There is a reason why Illumina currently dominates the market as it
does: a cheap Illumina run (preferably paired-end) will answer most of
your questions in 99% of the cases. Things will get difficult for
organisms with high numbers of repeats and/or frequent genome
re-arrangements. Then using longer read technologies and/or Illumina
mate-pair may be required.
</p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_seqadv_a_word_or_two_on_coverage"></a>13.4.
A word or two on coverage ...
</h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_seqadv_lowcov"></a>13.4.1.
Low coverage isn't worth it
</h3></div></div></div><p>
There's one thing to be said about coverage and de-novo assembly:
especially for bacteria, getting more than 'decent' coverage is
<span class="emphasis"><em>cheap</em></span> with any current day technology. Every
assembler I know will be happy to assemble de-novo genomes with
coverages of 25x, 30x, 40x ... and the number of contigs will still
drop dramatically between a 15x Ion Torrent and a 30x Ion Torrent
project.
</p><p>
In any case, do some calculations: if the coverage you expect to get
reaches 50x (e.g. 200MB raw sequence for a 4MB genome), then you
(respectively the assembler) can still throw away the worst 20% of the
sequence (with lots of sequencing errors) and concentrate on the
really, really good parts of the sequences to get you nice contigs.
</p><p>
Other example: the price for 1 gigabase Illumina paired-end of a
single DNA prep is way, way below USD 1000, even with commercial
providers. Then you just need to do the math: is it worth to invest
10, 20, 30 or more days of wet lab work, designing primers, doing PCR
sequencing etc. and trying to close remaining gaps or hunt down
sequencing errors when you went for a 'low' coverage or a non-hybrid
sequencing strategy? Or do you invest a few bucks more to get some
additional coverage and considerably reduce the uncertainties and gaps
which remain?
</p><p>
Remember, you probably want to do research on your bug and not
research on how to best assemble and close genomes. So even if you put
(PhD) students on the job, it's costing you time and money if you
wanted to save money earlier in the sequencing. Penny-wise and
pound-foolish is almost never a good strategy :-)
</p><p>
I do agree that with eukaryotes, things start to get a bit more
interesting from the financial point of view.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_seqadv_highcov"></a>13.4.2.
Catch-22: too high coverage
</h3></div></div></div><p>
There is, however, a catch-22 situation with coverage: too much
coverage isn't good either. Without going into details: sequencing
errors sometimes interfere heavily when coverage exceeds ~60x to 80x
for 454 & IonTorrent and approximately 150x to 200x for
Solexa/Illumina.
</p><p>
In those cases, do yourself a favour: there's more than enough data
for your project ... just cut it down to some reasonable amount: 40x
to 50x for 454 & IonTorrent, 100x for Solexa/Illumina.
</p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_seqadv_when_sequencing_a_word_of_caution_regarding_your_dna"></a>13.5.
A word of caution regarding your DNA in hybrid sequencing projects
</h2></div></div></div><p>
So, you have decided that sequencing your bug with PacBio and Illumina
(or PacBio and Ion Torrent or whatever) may be a viable way to get the
best bang for your buck. Then please follow this advice: prepare enough
DNA <span class="emphasis"><em>in</em></span> <span class="emphasis"><em>one</em></span>
<span class="emphasis"><em>go</em></span> for the sequencing provider so that they can
sequence it with all the technologies you chose without you having to
prepare another batch ... or even grow another culture!
</p><p>
The reason for that is that as soon as you do that, the probability that
there is a mutation somewhere that your first batch did not have is not
negligible. And if there is a mutation, even if it is only one base,
there is a >95% chance that MIRA will find it and thinks it is some
repetitive sequence (like a duplicated gene with a mutation in it) and
splits contigs at those places.
</p><p>
Now, there are times when you cannot completely be sure that different
sequencing runs did not use slightly different batches (or even strains).
</p><p>
One example: the SFF files for SRA000156 and SRA001028 from the NCBI
short trace archive should both contain E.coli K12 MG-16650 (two
unpaired half plates and a paired-end plate). However, they contain
DNA from different cultures. Furthermore, the DNA was prepared by
different labs. The net effect is that the sequences in the paired-end
library contain a few distinct mutations from the sequences in the two
unpaired half-plates. Furthermore, the paired-end sequences contain
sequences from phages that are not present in the unpaired sequences.
</p><p>
In those cases, provide strain information to the reads so that MIRA can
discern possible repeats from possible SNPs.
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_seqadv_for_bacteria"></a>13.6.
Advice for bacteria
</h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_seqadv_for_bacteria_no_not_sample_in_exponential_phase"></a>13.6.1.
Do not sample DNA from bacteria in exponential growth phase!
</h3></div></div></div><p>
The reason is simple: some bacteria grow so fast that they start
replicating themselves even before having finished the first
replication cycle. This leads to more DNA around the origin of
replication being present in cells, which in turn fools assemblers and
mappers into believing that those areas are either repeats or that
there are copy number changes.
</p><p>
Sample. In. Stationary. Phase!
</p><p>
For de-novo assemblies, MIRA will warn you if it detects data which
points at exponential phase. In mapping assemblies, look at the
coverage profile of your genome: if you see a smile shape (or V
shape), you have a problem.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_seqadv_for_bacteria:_beware_of_high_copy_number_plasmids"></a>13.6.2.
Beware of (high copy number) plasmids!
</h3></div></div></div><p>
This is a source of interesting problems and furthermore gets people
wondering why MIRA sometimes creates more contigs than other
assemblers when it usually creates less.
</p><p>
Here's the short story: there are data sets which include one ore
several high-copy plasmid(s). Here's a particularly ugly example:
SRA001028 from the NCBI short read archive which contains a plate of
paired-end reads for Ecoli K12 MG1655-G
(<a class="ulink" href="ftp://ftp.ncbi.nih.gov/pub/TraceDB/ShortRead/SRA001028/" target="_top">ftp://ftp.ncbi.nih.gov/pub/TraceDB/ShortRead/SRA001028/</a>).
</p><p>
The genome is sequenced at ~10x coverage, but during the assembly,
three intermediate contigs with ~2kb attain a silly maximum coverage
of ~1800x each. This means that there were ~540 copies of this
plasmid (or these plasmids) in the sequencing.
</p><p>
When using the uniform read distribution algorithm - which is switched
on by default when using "--job=" and the quality level of 'accurate' -
MIRA will find out about the average coverage of the genome to be at
~10x. Subsequently this leads MIRA to dutifully create ~500 additional
contigs (plus a number of contig debris) with various incarnations of
that plasmid at an average of ~10x, because it thought that these were
repetitive sites within the genome that needed to be disentangled.
</p><p>
Things get even more interesting when some of the plasmid / phage
copies are slightly different from each other. These too will be split
apart and when looking through the results later on and trying to join
the copies back into one contig, one will see that this should not be
done because there are real differences.
</p><p>
DON'T PANIC!
</p><p>
The only effect this has on your assembly is that the number of
contigs goes up. This in turn leads to a number of questions in my
mailbox why MIRA is sometimes producing more contigs than Newbler (or
other assemblers), but that is another story (hint: Newbler either
collapses repeats or leaves them completely out of the picture by not
assembling repetitive reads).
</p><p>
What you can do is the following:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
either you assemble everything together and the join the plasmid
contigs manually after assembly, e.g. in gap4 (drawback: on really
high copy numbers, MIRA will work quite a bit longer ... and you
will have a lot of fun joining the contigs afterwards)
</p></li><li class="listitem"><p>
or, after you found out about the plasmid(s) and know the sequence,
you filter out reads in the input data which contain this sequence
(you can use <span class="command"><strong>mirabait</strong></span> for this) and assemble the
remaining reads.
</p></li></ol></div></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_bitsandpieces"></a>Chapter 14. Bits and pieces</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email"><<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>></code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect_bap_using_ssaha2_smalt_to_screen_for_vector_sequence">14.1.
Using SSAHA2 / SMALT to screen for vector sequence
</a></span></dt></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
<span class="emphasis"><em><span class="quote">“<span class="quote">Just when you think it's finally settled, it isn't.
</span>”</span></em></span>
</p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top"><p>
The documentation of MIRA 3.9.x has not completely caught up yet with the changes introduced by MIRA now using manifest files. Quite a number of recipes still show the old command-line style, e.g.:
</p><pre class="screen">
mira --project=... --job=... ...</pre><p>
For those cases, please refer to chapter 3 (the reference) for how to write manifest files.
</p></td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_bap_using_ssaha2_smalt_to_screen_for_vector_sequence"></a>14.1.
Using SSAHA2 / SMALT to screen for vector sequence
</h2></div></div></div><p>
If your sequencing provider gave you data which was NOT pre-clipped for
vector sequence, you can do this yourself in a pretty robust manner
using SSAHA2 -- or the successor, SMALT -- from the Sanger Centre. You
just need to know which sequencing vector the provider used and have its
sequence in FASTA format (ask your provider).
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
This screening is a valid method for any type of Sanger sequencing
vectors, 454 adaptors, Illumina adaptors and paired-end adaptors
etc. However, you probably want to use it only for Sanger type data as
MIRA already knows all standard 454, Ion Torrent and Illumina adaptors.
</td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
SSAHA2 and SMALT need their input data to be in FASTA format, so for
these to run you will need them also in FASTA format. For MIRA however
you can load your original data in whatever format it was present.
</td></tr></table></div><p>
For SSAHA2 follow these steps (most are the same as in the example
above):
</p><pre class="screen">
<code class="prompt">$</code> <strong class="userinput"><code>ssaha2 -output ssaha2
-kmer 8 -skip 1 -seeds 1 -score 12 -cmatch 9 -ckmer 6
/path/where/the/vector/data/resides/vector.fasta
<em class="replaceable"><code>yourinputsequences.fasta</code></em> > <em class="replaceable"><code>screendataforyoursequences.ssaha2</code></em></code></strong></pre><p>
Then, in your manifest file, add the following line in the readgroup
which contains the sequences you screened:
</p><pre class="screen">
<strong class="userinput"><code>readgroup
...
data = <em class="replaceable"><code>yourinputsequences_inwhateverformat_thisexamplehasfastq.fastq</code></em>
data = <em class="replaceable"><code>screendataforyoursequences.ssaha2</code></em>
...</code></strong></pre><p>
For SMALT, the only difference is that you use SMALT for generating the
vector-screen file and ask SMALT to generate it in SSAHA2 format. As
SMALT works in two steps (indexing and then mapping), you also need to
perform it in two steps and then call MIRA. E.g.:
</p><pre class="screen">
<code class="prompt">$</code> <strong class="userinput"><code>smalt index -k 7 -s 1 smaltidxdb /path/where/the/vector/data/resides/vector.fasta</code></strong>
<code class="prompt">$</code> <strong class="userinput"><code>smalt map -f ssaha -d -1 -m 7 smaltidxdb <em class="replaceable"><code>yourinputsequences.fasta</code></em> > <em class="replaceable"><code>screendataforyoursequences.smalt</code></em></code></strong></pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Please note that, due to subtle differences between output of SSAHA2 (in
ssaha2 format) and SMALT (in ssaha2 format), MIRA identifies the source
of the screening (and the parsing method it needs) by the name of the
screen file. Therefore, screens done with SSAHA2 need to have the
postfix <code class="filename">.ssaha2</code> in the file name and screens done
with SMALT need
<code class="filename">*.smalt</code>.
</td></tr></table></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_faq"></a>Chapter 15. Frequently asked questions</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email"><<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>></code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect_faq_assembly_quality">15.1.
Assembly quality
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_faq_what_is_the_effect_of_uniform_read_distribution_as:urd?">15.1.1.
What is the effect of uniform read distribution (-AS:urd)?
</a></span></dt><dt><span class="sect2"><a href="#sect_faq_there_are_too_many_contig_debris_when_using_uniform_read_distribution_how_do_i_filter_for_good_contigs?">15.1.2.
There are too many contig debris when using uniform read distribution, how do I filter for "good" contigs?
</a></span></dt><dt><span class="sect2"><a href="#sect_faq_when_finishing_which_places_should_i_have_a_look_at?">15.1.3.
When finishing, which places should I have a look at?
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_faq_454_data">15.2.
454 data
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_faq_what_do_i_need_sffs_for?">15.2.1.
What do I need SFFs for?
</a></span></dt><dt><span class="sect2"><a href="#sect_faq_what's_sff_extract_and_where_do_i_get_it?">15.2.2.
What's sff_extract and where do I get it?
</a></span></dt><dt><span class="sect2"><a href="#sect_faq_do_i_need_the_sfftools_from_the_roche_software_package?">15.2.3.
Do I need the sfftools from the Roche software package?
</a></span></dt><dt><span class="sect2"><a href="#sect_faq_combining_sffs">15.2.4.
Combining SFFs
</a></span></dt><dt><span class="sect2"><a href="#sect_faq_adaptors_and_pairedend_linker_sequences">15.2.5.
Adaptors and paired-end linker sequences
</a></span></dt><dt><span class="sect2"><a href="#sect_faq_what_do_i_get_in_pairedend_sequencing?">15.2.6.
What do I get in paired-end sequencing?
</a></span></dt><dt><span class="sect2"><a href="#sect_faq_sequencing_protocol">15.2.7.
Sequencing protocol
</a></span></dt><dt><span class="sect2"><a href="#sect_faq_filtering_by_seqlen">15.2.8.
Filtering sequences by length and re-assembly
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_faq_solexa___illumina_data">15.3.
Solexa / Illumina data
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_faq_can_i_see_deletions?">15.3.1.
Can I see deletions?
</a></span></dt><dt><span class="sect2"><a href="#sect_faq_can_i_see_insertions?">15.3.2.
Can I see insertions?
</a></span></dt><dt><span class="sect2"><a href="#sect_faq_denovo_assembly_with_solexa_data">15.3.3.
De-novo assembly with Solexa data
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_faq_hybrid_assemblies">15.4.
Hybrid assemblies
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_faq_what_are_hybrid_assemblies?">15.4.1.
What are hybrid assemblies?
</a></span></dt><dt><span class="sect2"><a href="#sect_faq_what_differences_are_there_in_hybrid_assembly_strategies?">15.4.2.
What differences are there in hybrid assembly strategies?
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_faq_masking">15.5.
Masking
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_faq_should_i_mask?">15.5.1.
Should I mask?
</a></span></dt><dt><span class="sect2"><a href="#sect_faq_how_can_i_apply_custom_masking?">15.5.2.
How can I apply custom masking?
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_faq_miscellaneous">15.6.
Miscellaneous
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_faq_what_are_megahubs?">15.6.1.
What are megahubs?
</a></span></dt><dt><span class="sect2"><a href="#sect_faq_passes_and_loops">15.6.2.
Passes and loops
</a></span></dt><dt><span class="sect2"><a href="#sect_faq_debris">15.6.3.
Debris
</a></span></dt><dt><span class="sect2"><a href="#sect_faq_tmpf_files:_more_info_on_what_happened_during_the_assembly">15.6.4.
Log and temporary files: more info on what happened during the assembly
</a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_faq_sequence_clipping_after_load">15.6.4.1.
Sequence clipping after load
</a></span></dt></dl></dd></dl></dd><dt><span class="sect1"><a href="#sect_faq_platforms_and_compiling">15.7.
Platforms and Compiling
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_faq_windows">15.7.1.
Windows
</a></span></dt></dl></dd></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
<span class="emphasis"><em><span class="quote">“<span class="quote">Every question defines its own answer. Except perhaps 'Why a duck?'
</span>”</span></em></span>
</p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top"><p>
The documentation of MIRA 3.9.x has not completely caught up yet with the changes introduced by MIRA now using manifest files. Quite a number of recipes still show the old command-line style, e.g.:
</p><pre class="screen">
mira --project=... --job=... ...</pre><p>
For those cases, please refer to chapter 3 (the reference) for how to write manifest files.
</p></td></tr></table></div><p>
This list is a collection of frequently asked questions and answers
regarding different aspects of the MIRA assembler.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
This document needs to be overhauled.
</td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_faq_assembly_quality"></a>15.1.
Assembly quality
</h2></div></div></div><div class="qandaset"><a name="idm7142"></a><dl><dt>15.1.1. <a href="#idm7143">Test question 1</a></dt><dt>15.1.2. <a href="#idm7148">Test question 2</a></dt></dl><table border="0" style="width: 100%;"><colgroup><col align="left" width="1%"><col></colgroup><tbody><tr class="question"><td align="left" valign="top"><a name="idm7143"></a><a name="idm7144"></a><p><b>15.1.1.</b></p></td><td align="left" valign="top"><p>Test question 1</p></td></tr><tr class="answer"><td align="left" valign="top"></td><td align="left" valign="top"><p>Test answer 1</p></td></tr><tr class="question"><td align="left" valign="top"><a name="idm7148"></a><a name="idm7149"></a><p><b>15.1.2.</b></p></td><td align="left" valign="top"><p>Test question 2</p></td></tr><tr class="answer"><td align="left" valign="top"></td><td align="left" valign="top"><p>Test answer 2</p></td></tr></tbody></table></div><p>
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_what_is_the_effect_of_uniform_read_distribution_as:urd?"></a>15.1.1.
What is the effect of uniform read distribution (-AS:urd)?
</h3></div></div></div><p>
</p><pre class="screen">
I have a project which I once started quite normally via
"--job=denovo,genome,accurate,454"
and once with explicitly switching off the uniform read distribution
"--job=denovo,genome,accurate,454 -AS:urd=no"
I get less contigs in the second case and I wonder if that is not better.
Can you please explain?
</pre><p>
</p><p>
Since 2.9.24x1, MIRA has a feature called "uniform read distribution" which is
normally switched on. This feature reduces over-compression of repeats during
the contig building phase and makes sure that, e.g., a rRNA stretch which is
present 10 times in a bacterium will also be present approximately 10 times in
your result files.
</p><p>
It works a bit like this: under the assumption that reads in a project are
uniformly distributed across the genome, MIRA will enforce an average coverage
and temporarily reject reads from a contig when this average coverage
multiplied by a safety factor is reached at a given site.
</p><p>
It's generally a very useful tool disentangle repeats, but has some slight
secondary effects: rejection of otherwise perfectly good reads. The
assumption of read distribution uniformity is the big problem we have here:
of course it's not really valid. You sometimes have less, and sometimes more
than "the average" coverage. Furthermore, the new sequencing technologies -
454 perhaps but especially the microreads from Solexa & probably also SOLiD -
show that you also have a skew towards the site of replication origin.
</p><p>
One example: let's assume the average coverage of your project is 8 and by
chance at one place you have 17 (non-repetitive) reads, then the following
happens:
</p><p>
$p$= parameter of -AS:urdsip
</p><p>
Pass 1 to $p-1$: MIRA happily assembles everything together and calculates a
number of different things, amongst them an average coverage of ~8. At the
end of pass '$p-1$', it will announce this average coverage as first estimate
to the assembly process.
</p><p>
Pass $p$: MIRA has still assembled everything together, but at the end of each
pass the contig self-checking algorithms now include an "average coverage
check". They'll invariably find the 17 reads stacked and decide (looking at
the -AS:urdct parameter which I now assume to be 2) that 17 is larger than
2*8 and that this very well may be a repeat. The reads get flagged as
possible repeats.
</p><p>
Pass $p+1$ to end: the "possibly repetitive" reads get a much tougher
treatment in MIRA. Amongst other things, when building the contig, the contig
now looks that "possibly repetitive" reads do not over-stack by an average
coverage multiplied by a safety value (-AS:urdcm) which I'll assume in this
example to be 1.5. So, at a certain point, say when read 14 or 15 of
that possible repeat want to be aligned to the contig at this given place, the
contig will just flatly refuse and tell the assembler to please find another
place for them, be it in this contig that is built or any other that will
follow. Of course, if the assembler cannot comply, the reads 14 to 17 will end
up as contiglet (contig debris, if you want) or if it was only one read that
got rejected like this, it will end up as singlet or in the debris file.
</p><p>
Tough luck. I do have ideas on how to re-integrate those reads at the and of an
assembly, but I had deferred doing this as in every case I had looked up,
adding those reads to the contigs wouldn't have changed anything ... there's
already enough coverage. What I do in those cases is simply filter away the
contiglets (defined as being of small size and having an average coverage
below the average coverage of the project / 3 (or 2.5)) from a project.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_there_are_too_many_contig_debris_when_using_uniform_read_distribution_how_do_i_filter_for_good_contigs?"></a>15.1.2.
There are too many contig debris when using uniform read distribution, how do I filter for "good" contigs?
</h3></div></div></div><p>
</p><pre class="screen">
When using uniform read distribution there are too many contig with low
coverage which I don't want to integrate by hand in the finishing process. How
do I filter for "good" contigs?
</pre><p>
</p><p>
OK, let's get rid of the cruft. It's easy, really: you just need to look up
one number, take two decisions and then launch a command.
</p><p>
The first decision you need to take is on the minimum average coverage the
contigs you want to keep should have. Have a look at the file
<code class="filename">*_info_assembly.txt</code> which is in the info directory after
assembly. In the "Large contigs" section, there's a "Coverage assessment"
subsection. It looks a bit like this:
</p><pre class="screen">
...
Coverage assessment:
--------------------
Max coverage (total): 43
Max coverage
Sanger: 0
454: 43
Solexa: 0
Solid: 0
Avg. total coverage (size ≥ 5000): 22.30
Avg. coverage (contig size ≥ 5000)
Sanger: 0.00
454: 22.05
Solexa: 0.00
Solid: 0.00
...
</pre><p>
</p><p>
This project was obviously a 454 only project, and the average coverage for it
is ~22. This number was estimated by MIRA by taking only contigs of at least
5Kb into account, which for sure left out everything which could be
categorised as debris. It's a pretty solid number.
</p><p>
Now, depending on how much time you want to invest performing some manual
polishing, you should extract contigs which have at least the following
fraction of the average coverage:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
2/3 if a quick and "good enough" is what you want and you don't want to
do some manual polishing. In this example, that would be around 14 or 15.
</p></li><li class="listitem"><p>
1/2 if you want to have a "quick look" and eventually perform some
contig joins. In this example the number would be 11.
</p></li><li class="listitem"><p>
1/3 if you want quite accurate and for sure not loose any possible
repeat. That would be 7 or 8 in this example.
</p></li></ul></div><p>
</p><p>
The second decision you need to take is on the minimum length your contigs
should have. This decision is a bit dependent on the sequencing technology you
used (the read length). The following are some rules of thumb:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
Sanger: 1000 to 2000
</p></li><li class="listitem"><p>
454 GS20: 500
</p></li><li class="listitem"><p>
454 FLX: 1000
</p></li><li class="listitem"><p>
454 Titanium: 1500
</p></li></ul></div><p>
</p><p>
Let's assume we decide for an average coverage of 11 and a minimum length of
1000 bases. Now you can filter your project with miraconvert
</p><pre class="screen">
miraconvert -x 1000 -y 11 sourcefile.caf filtered.caf
</pre><p>
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_when_finishing_which_places_should_i_have_a_look_at?"></a>15.1.3.
When finishing, which places should I have a look at?
</h3></div></div></div><p>
</p><pre class="screen">
I would like to find those places where MIRA wasn't sure and give it a quick
shot. Where do I need to search?
</pre><p>
</p><p>
Search for the following tags in gap4 or any other finishing program
for finding places of importance (in this order).
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
IUPc
</p></li><li class="listitem"><p>
UNSc
</p></li><li class="listitem"><p>
SRMc
</p></li><li class="listitem"><p>
WRMc
</p></li><li class="listitem"><p>
STMU (only hybrid assemblies)
</p></li><li class="listitem"><p>
STMS (only hybrid assemblies)
</p></li></ul></div><p>
</p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_faq_454_data"></a>15.2.
454 data
</h2></div></div></div><div class="qandaset"><a name="idm7224"></a><dl><dt>15.2.1. <a href="#idm7225">What are little boys made of?</a></dt><dt>15.2.2. <a href="#idm7230">What are little girls made of?</a></dt></dl><table border="0" style="width: 100%;"><colgroup><col align="left" width="1%"><col></colgroup><tbody><tr class="question"><td align="left" valign="top"><a name="idm7225"></a><a name="idm7226"></a><p><b>15.2.1.</b></p></td><td align="left" valign="top"><p>What are little boys made of?</p></td></tr><tr class="answer"><td align="left" valign="top"></td><td align="left" valign="top"><p>Snips and snails and puppy dog tails.</p></td></tr><tr class="question"><td align="left" valign="top"><a name="idm7230"></a><a name="idm7231"></a><p><b>15.2.2.</b></p></td><td align="left" valign="top"><p>What are little girls made of?</p></td></tr><tr class="answer"><td align="left" valign="top"></td><td align="left" valign="top"><p>Sugar and spice and everything nice.</p></td></tr></tbody></table></div><p>
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_what_do_i_need_sffs_for?"></a>15.2.1.
What do I need SFFs for?
</h3></div></div></div><p>
</p><pre class="screen">
I need the .sff files for MIRA to load ...
</pre><p>
</p><p>
Nope, you don't, but it's a common misconception. MIRA does not load SFF
files, it loads FASTA, FASTA qualities, FASTQ, XML, CAF, EXP and PHD. The
reason why one should start from the SFF is: those files can be used to create
a XML file in TRACEINFO format. This XML contains the absolutely vital
information regarding clipping information of the 454 adaptors (the sequencing
vector of 454, if you want).
</p><p>
For 454 projects, MIRA will then load the FASTA, FASTA quality and the
corresponding XML. Or from CAF, if you have your data in CAF format.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_what's_sff_extract_and_where_do_i_get_it?"></a>15.2.2.
What's sff_extract and where do I get it?
</h3></div></div></div><p>
</p><pre class="screen">
How do I extract the sequence, quality and other values from SFFs?
</pre><p>
</p><p>
Use the <span class="command"><strong>sff_extract</strong></span> script from Jose Blanca at the
University of Valencia to extract everything you need from the SFF
files (sequence, qualities and ancillary information). The home of
sff_extract is: <a class="ulink" href="http://bioinf.comav.upv.es/sff_extract/index.html" target="_top">http://bioinf.comav.upv.es/sff_extract/index.html</a> but I am
thankful to Jose for giving permission to distribute the script in the
MIRA 3rd party package (separate download).
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_do_i_need_the_sfftools_from_the_roche_software_package?"></a>15.2.3.
Do I need the sfftools from the Roche software package?
</h3></div></div></div><p>
No, not anymore. Use the <span class="command"><strong>sff_extract</strong></span> script to
extract your reads. Though the Roche sfftools package contains a few
additional utilities which could be useful.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_combining_sffs"></a>15.2.4.
Combining SFFs
</h3></div></div></div><p>
</p><pre class="screen">
I am trying to use MIRA to assemble reads obtained with the 454 technology
but I can't combine my sff files since I have two files obtained with GS20
system and 2 others obtained with the GS-FLX system. Since they use
different cycles (42 and 100) I can't use the sfffile to combine both.
</pre><p>
</p><p>
You do not need to combine SFFs before translating them into something
MIRA (or other software tools) understands. Use
<span class="command"><strong>sff_extract</strong></span> which extracts data from the SFF files
and combines this into input files.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_adaptors_and_pairedend_linker_sequences"></a>15.2.5.
Adaptors and paired-end linker sequences
</h3></div></div></div><p>
</p><pre class="screen">
I have no idea about the adaptor and the linker sequences, could you send me
the sequences please?
</pre><p>
</p><p>
Here are the sequences as filed by 454 in their patent application:
</p><pre class="screen">
>AdaptorA
CTGAGACAGGGAGGGAACAGATGGGACACGCAGGGATGAGATGG
>AdaptorB
CTGAGACACGCAACAGGGGATAGGCAAGGCACACAGGGGATAGG
</pre><p>
</p><p>
However, looking through some earlier project data I had, I also retrieved the
following (by simply making a consensus of sequences that did not match the
target genome anymore):
</p><pre class="screen">
>5prime454adaptor???
GCCTCCCTCGCGCCATCAGATCGTAGGCACCTGAAA
>3prime454adaptor???
GCCTTGCCAGCCCGCTCAGATTGATGGTGCCTACAG
</pre><p>
</p><p>
Go figure, I have absolutely no idea where these come from as they also do not
comply to the "tcag" ending the adaptors should have.
</p><p>
I currently know one linker sequence (454/Roche also calls it <span class="emphasis"><em>spacer</em></span>
for GS20 and FLX paired-end sequencing:
</p><pre class="screen">
>flxlinker
GTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTCGGTTCCAAC
</pre><p>
</p><p>
For Titanium data using standard Roche protocol, you need to screen for two
linker sequences:
</p><pre class="screen">
>titlinker1
TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG
>titlinker2
CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
</pre><p>
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
Some sequencing labs modify the adaptor sequences for tagging and
similar things. Ask your sequencing provider for the exact adaptor
and/or linker sequences.
</td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_what_do_i_get_in_pairedend_sequencing?"></a>15.2.6.
What do I get in paired-end sequencing?
</h3></div></div></div><p>
</p><pre class="screen">
Another question I have is does the read pair sequences have further
adaptors/vectors in the forward and reverse strands?
</pre><p>
</p><p>
Like for normal 454 reads - the normal A and B adaptors can be present
in paired-end reads. That theory this could could look like this:
</p><p>
A-Adaptor - DNA1 - Linker - DNA2 - B-Adaptor.
</p><p>
It's possible that one of the two DNA fragments is *very* short or is missing
completely, then one has something like this:
</p><p>
A-Adaptor - DNA1 - Linker - B-Adaptor
</p><p>
or
</p><p>
A-Adaptor - Linker - DNA2 - B-Adaptor
</p><p>
And then there are all intermediate possibilities with the read not having one
of the two adaptors (or both). Though it appears that the majority of reads
will contain the following:
</p><p>
DNA1 - Linker - DNA2
</p><p>
There is one caveat: according to current paired-end protocols, the sequences
will <span class="bold"><strong>NOT</strong></span> have the direction
</p><pre class="screen">
---> Linker <---
</pre><p>
as one might expect when being used to Sanger Sequencing, but rather in this
direction
</p><pre class="screen">
<--- Linker --->
</pre><p>
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_sequencing_protocol"></a>15.2.7.
Sequencing protocol
</h3></div></div></div><p>
</p><pre class="screen">
Is there a way I can find out which protocol was used?
</pre><p>
</p><p>
Yes. The best thing to do is obviously to ask your sequencing provider.
</p><p>
If this is - for whatever reason - not possible, this list might help.
</p><p>
Are the sequences ~100-110 bases long? It's GS20.
</p><p>
Are the sequences ~220-250 bases long? It's FLX.
</p><p>
Are the sequences ~350-450 bases long? It's Titanium.
</p><p>
Do the sequences contain a linker
(GTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTCGGTTCCAAC)? It's a paired end protocol.
</p><p>
If the sequences left and right of the linker are ~29bp, it's the old short
paired end (SPET, also it's most probably from a GS20). If longer, it's long
paired-end (LPET, from a FLX).
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_filtering_by_seqlen"></a>15.2.8.
Filtering sequences by length and re-assembly
</h3></div></div></div><pre class="screen">
I have two datasets of ~500K sequences each and the sequencing company
already did an assembly (using MIRA) on the basecalled and fully processed
reads (using of course the accompanying *qual file). Do you suggest that I
should redo the assembly after filtering out sequences being shorter than a
certain length (e.g. those that are <200bp)? In other words, am I taking into
account low quality sequences if I do the assembly the way the sequencing
company did it (fully processed reads + quality files)?
</pre><p>
I don't think that filtering out "shorter" reads will bring much
positive improvement. If the sequencing company used the standard
Roche/454 pipeline, the cut-offs for quality are already quite good,
remaining sequences should be, even when being < 200bp, not of bad
quality, simply a bit shorter.
</p><p>
Worse, you might even introduce a bias when filtering out short
sequences: chemistry and library construction being what they are
(rather imprecise and sometimes problematic), some parts of DNA/RNA
yield smaller sequences per se ... and filtering those out might not
be the best move.
</p><p>
You might consider doing an assembly if the company used a rather old
version of MIRA (<3.0.0 for sure, perhaps also <3.0.5).
</p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_faq_solexa___illumina_data"></a>15.3.
Solexa / Illumina data
</h2></div></div></div><p>
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_can_i_see_deletions?"></a>15.3.1.
Can I see deletions?
</h3></div></div></div><p>
</p><pre class="screen">
Suppose you ran the genome of a strain that had one or more large
deletions. Would it be clear from the data that a deletion had occurred?
</pre><p>
</p><p>
In the question above, I assume you'd compare your strain <span class="emphasis"><em>X</em></span> to a strain
<span class="emphasis"><em>Ref</em></span> and that <span class="emphasis"><em>X</em></span> had deletions compared to
<span class="emphasis"><em>Ref</em></span>. Furthermore, I base my answer on data sets I have seen, which
presently were 36 and 76 mers, paired and unpaired.
</p><p>
Yes, this would be clear. And it's a piece of cake with MIRA.
</p><p>
Short deletions (1 to 10 bases): they'll be tagged SROc or WRMc.
General rule: deletions of up to 10 to 12% of the length of your read should
be found and tagged without problem by MIRA, above that it may or may not,
depending a bit on coverage, indel distribution and luck.
</p><p>
Long deletions (longer than read length): they'll be tagged with MCVc tag by
MIRA ins the consensus. Additionally, when looking at the FASTA files when
running the CAF result through miraconvert: long stretches of
sequences without coverage (the @ sign in the FASTAs) of <span class="emphasis"><em>X</em></span> show missing
genomic DNA.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_can_i_see_insertions?"></a>15.3.2.
Can I see insertions?
</h3></div></div></div><p>
</p><pre class="screen">
Suppose you ran the genome of a strain X that had a plasmid missing from the
reference sequence. Alternatively, suppose you ran a strain that had picked
up a prophage or mobile element lacking in the reference. Would that
situation be clear from the data?
</pre><p>
</p><p>
Short insertions (1 to 10 bases): they'll be tagged SROc or WRMc.
General rule: deletions of up to 10 to 12% of the length of your read should
be found and tagged without problem by MIRA, above that it may or may not,
depending a bit on coverage, indel distribution and luck.
</p><p>
Long insertions: it's a bit more work than for deletions. But if you ran a
de-novo assembly on all reads not mapped against your reference sequence,
chances are good you'd get good chunks of the additional DNA put together
</p><p>
Once the Solexa paired-end protocol is completely rolled out and used on a
regular base, you would even be able to place the additional element into the
genome (approximately).
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_denovo_assembly_with_solexa_data"></a>15.3.3.
De-novo assembly with Solexa data
</h3></div></div></div><p>
</p><pre class="screen">
Any chance you could assemble de-novo the sequence of a from just the Solexa
data?
</pre><p>
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
Highly opinionated answer ahead, your mileage may vary.
</td></tr></table></div><p>
Allow me to make a clear statement on this: maybe.
</p><p>
But the result would probably be nothing I would call a good
assembly. If you used anything below 76mers, I'm highly sceptical
towards the idea of de-novo assembly with Solexa (or ABI SOLiD) reads
that are in the 30 to 50bp range. They're really too short for that,
even paired end won't help you much (especially if you have library
sizes of just 200 or 500bp). Yes, there are papers describing
different draft assemblers (SHARCGS, EDENA, Velvet, Euler and others),
but at the moment the results are less than thrilling to me.
</p><p>
If a sequencing provider came to me with N50 numbers for an
<span class="emphasis"><em>assembled genome</em></span> in the 5-8 Kb range, I'd laugh
him in the face. Or weep. I wouldn't dare to call this even
'draft'. I'd just call it junk.
</p><p>
On the other hand, this could be enough for some purposes like, e.g.,
getting a quick overview on the genetic baggage of a bug. Just don't
expect a finished genome.
</p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_faq_hybrid_assemblies"></a>15.4.
Hybrid assemblies
</h2></div></div></div><p>
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_what_are_hybrid_assemblies?"></a>15.4.1.
What are hybrid assemblies?
</h3></div></div></div><p>
Hybrid assemblies are assemblies where one used more than one sequencing
technology. E.g.: Sanger and 454, or 454 and Solexa, or Sanger and Solexa
etc.pp
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_what_differences_are_there_in_hybrid_assembly_strategies?"></a>15.4.2.
What differences are there in hybrid assembly strategies?
</h3></div></div></div><p>
Basically, one can choose two routes: multi-step or all-in-one-go.
</p><p>
Multi-steps means: to assemble reads from one sequencing technology (ideally
the one from the shorter tech like, e.g., Solexa), fragment the resulting
contigs into pseudo-reads of the longer tech and assemble these with the real
reads from the longer tech (like, e.g., 454). The advantage of this approach
is that it will be probably quite faster than the all-in-one-go approach. The
disadvantage is that you loose a lot of information when using only consensus
sequence of the shorter read technology for the final assembly.
</p><p>
All-in-one-go means: use all reads in one single assembly. The advantage of
this is that the resulting alignment will be made of true reads with a maximum
of information contained to allow a really good finishing. The disadvantage is
that the assembly will take longer and will need more RAM.
</p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_faq_masking"></a>15.5.
Masking
</h2></div></div></div><p>
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_should_i_mask?"></a>15.5.1.
Should I mask?
</h3></div></div></div><p>
</p><pre class="screen">
In EST projects, do you think that the highly repetitive option will get rid
of the repetitive sequences without going to the step of repeat masking?
</pre><p>
</p><p>
For eukaryotes, yes. Please also consult the [-KS:mnr] option.
</p><p>
Remember: you still <span class="bold"><strong>MUST</strong></span> have sequencing vectors and adaptors
clipped! In EST sequences the poly-A tails should be also clipped (or let
mira do it.
</p><p>
For prokaryotes, I´m a big fan of having a first look at unmasked data.
Just try to start MIRA without masking the data. After something like 30
minutes, the all-vs-all comparison algorithm should be through with a first
comparison round. grep the log for the term "megahub" ... if it doesn't
appear, you probably don't need to mask repeats
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_how_can_i_apply_custom_masking?"></a>15.5.2.
How can I apply custom masking?
</h3></div></div></div><p>
</p><pre class="screen">
I want to mask away some sequences in my input. How do I do that?
</pre><p>
</p><p>
First, if you want to have Sanger sequencing vectors (or 454 adaptor
sequences) "masked", please note that you should rather use ancillary data
files (CAF, XML or EXP) and use the sequencing or quality clip options there.
</p><p>
Second, please make sure you have read and understood the documentation for all
-CL parameters in the main manual, but especially -CL:mbc:mbcgs:mbcmfg:mbcmeg
as you might want to switch it on or off or set different values depending on
your pipeline and on your sequencing technology.
</p><p>
You can without problem mix your normal repeat masking pipeline with the FASTA
or EXP input for MIRA, as long as you <span class="bold"><strong>mask</strong></span> and not <span class="bold"><strong>clip</strong></span> the
sequence.
</p><p>
An example:
</p><pre class="screen">
>E09238ARF0
tcag GTGTCAGTGTTGACTGTAAAAAAAAAGTACGTATGGACTGCATGTGCATGTCATGGTACGTGTCA
GTCAGTACAAAAAAAAAAAAAAAAAAAAGTACGT tgctgacgcacatgatcgtagc
</pre><p>
</p><p>
(spaces inserted just as visual helper in the example sequence, they would not
occur in the real stuff)
</p><p>
The XML will contain the following clippings:
left clip = 4 (clipping away the "tcag" which are the last four bases of the
adaptor used by Roche)
right clip= ~90 (clipping away the "tgctgac..." lower case sequence on the
right side of the sequence above.
</p><p>
Now, on the FASTA file that was generated with reads_sff.py or with the Roche
sff* tools, you can let run, e.g., a repeat masker. The result could look like
this:
</p><pre class="screen">
>E09238ARF0
tcag XXXXXXXXX TTGACTGTAAAAAAAAAGTACGTATGGACTGCATGTGCATGTCATGGTACGTGTCA
GTCAGTACAAAAAAAAAAAAAAAAAAAAGTACGT tgctgacgcacatgatcgtagc
</pre><p>
</p><p>
The part with the Xs was masked away by your repeat masker. Now, when MIRA
loads the FASTA, it will first apply the clippings from the XML file (they're
still the same). Then, if the option to clip away masked areas of a read
(-CL:mbc, which is normally on for EST projects), it will search for the
stretches of X and internally also put clips to the sequence. In the example
above, only the following sequence would remain as "working sequence" (the
clipped parts would still be present, but not used for any computation.
</p><pre class="screen">
>E09238ARF0
...............TTGACTGTAAAAAAAAAGTACGTATGGACTGCATGTGCATGTCATGGTACGTGTCA
GTCAGTACAAAAAAAAAAAAAAAAAAAAGTACGT........................
</pre><p>
</p><p>
Here you can also see the reason why your filters should <span class="bold"><strong>mask</strong></span> and not
clip the sequence. If you change the length of the sequence, the clips in the
XML would not be correct anymore, wrong clippings would be made, wrong
sequence reconstructed, chaos ensues and the world would ultimately end. Or
something.
</p><p>
<span class="bold"><strong>IMPORTANT!</strong></span> It might be that you do not want MIRA to merge the masked
part of your sequence with a left or right clip, but that you want to keep it
something like DNA - masked part - DNA. In this case, consult the manual for
the -CL:mbc switch, either switch it off or set adequate options for the
boundaries and gap sizes.
</p><p>
Now, if you look at the sequence above, you will see two possible poly-A
tails ... at least the real poly-A tail should be masked else you will get
megahubs with all the other reads having the poly-A tail.
</p><p>
You have two possibilities: you mask yourself with an own program or you let
MIRA do the job (-CL:cpat, which should normally be on for EST projects but I
forgot to set the correct switch in the versions prior to 2.9.26x3, so you
need to set it manually for 454 EST projects there).
</p><p>
<span class="bold"><strong>IMPORTANT!</strong></span> Never ever at all use two poly-A tail masker (an own and
the one from MIRA): you would risk to mask too much. Example: assume the above
read you masked with a poly-A masker. The result would very probably look like
this:
</p><pre class="screen">
>E09238ARF0
tcag XXXXXXXXX TTGACTGTAAAAAAAAAGTACGTATGGACTGCATGTGCATGTCATGGTACGTGTCA
GTCAGTAC XXXXXXXXXXXXXXXXXXXX GTACGT tgctgacgcacatgatcgtagc
</pre><p>
</p><p>
And MIRA would internally make the following out of it after loading:
</p><pre class="screen">
>E09238ARF0
...............TTGACTGTAAAAAAAAAGTACGTATGGACTGCATGTGCATGTCATGGTACGTGTCA
GTCAGTAC..................................................
</pre><p>
</p><p>
and then apply the internal poly-A tail masker:
</p><pre class="screen">
>E09238ARF0
...............TTGACTGT................................................
..........................................................
</pre><p>
</p><p>
You'd be left with ... well, a fragment of your sequence.
</p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_faq_miscellaneous"></a>15.6.
Miscellaneous
</h2></div></div></div><p>
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_what_are_megahubs?"></a>15.6.1.
What are megahubs?
</h3></div></div></div><p>
</p><pre class="screen">
I looked in the log file and that term "megahub" you told me about appears
pretty much everywhere. First of all, what does it mean?
</pre><p>
</p><p>
Megahub is the internal term for MIRA that the read is massively repetitive
with respect to the other reads of the projects, i.e., a read that is a
megahub connects to an insane number of other reads.
</p><p>
This is a clear sign that something is wrong. Or that you have a quite
repetitive eukaryote. But most of the time it's sequencing vectors
(Sanger), A and B adaptors or paired-end linkers (454), unmasked
poly-A signals (EST) or non-normalised EST libraries which contain
high amounts of housekeeping genes (always the same or nearly the
same).
</p><p>
Countermeasures to take are:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
set clips for the sequencing vectors (Sanger) or Adaptors (454)
either in the XML or EXP files
</p></li><li class="listitem"><p>
for ESTs, mask poly-A in your input data (or let MIRA do it with the
-CL:cpat parameter)
</p></li><li class="listitem"><p>
only after the above steps have been made, use
the [-KS:mnr] switch to let mira automatically mask nasty
repeats, adjust the threshold with [-SK:rt].
</p></li><li class="listitem"><p>
if everything else fails, filter out or mask sequences yourself in the
input data that come from housekeeping genes or nasty repeats.
</p></li></ul></div><p>
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_passes_and_loops"></a>15.6.2.
Passes and loops
</h3></div></div></div><p>
</p><pre class="screen">
While processing some contigs with repeats i get
"Accepting probably misassembled contig because of too many iterations."
What is this?
</pre><p>
</p><p>
That's quite normal in the first few passes of an assembly. During each pass
(-AS:nop), contigs get built one by one. After a contig has been finished, it
checks itself whether it can find misassemblies due to repeats (and marks
these internally). If no misassembly, perfect, build next contig. But if yes,
the contig requests immediate re-assembly of itself.
</p><p>
But this can happen only a limited number of times (governed by -AS:rbl). If
there are still misassemblies, the contig is stored away anyway ... chances
are good that in the next full pass of the assembler, enough knowledge has
been gained top correctly place the reads.
</p><p>
So, you need to worry only if these messages still appear during the last
pass. The positions that cause this are marked with "SRMc" tags in the
assemblies (CAF, ACE in the result dir; and some files in the info dir).
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_debris"></a>15.6.3.
Debris
</h3></div></div></div><p>
</p><pre class="screen">
What are the debris composed of?
</pre><p>
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
sequences too short (after trimming)
</p></li><li class="listitem"><p>
megahubs
</p></li><li class="listitem"><p>
sequences almost completely masked by the nasty repeat masker
([-KS:mnr])
</p></li><li class="listitem"><p>
singlets, i.e., reads that after an assembly pass did not align
into any contig (or where rejected from every contig).
</p></li><li class="listitem"><p>
sequences that form a contig with less reads than defined by
[-AS:mrpc]
</p></li></ul></div><p>
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_tmpf_files:_more_info_on_what_happened_during_the_assembly"></a>15.6.4.
Log and temporary files: more info on what happened during the assembly
</h3></div></div></div><p>
</p><pre class="screen">
I do not understand why ... happened. Is there a way to find out?
</pre><p>
Yes. The tmp directory contains, beside temporary data, a number of
log files with more or less readable information. While development
versions of MIRA keep this directory after finishing, production
versions normally delete this directory after an assembly. To keep the
logs and temporary file also in production versions, use
"-OUT:rtd=no".
</p><p>
As MIRA also tries to save as much disk space as possible, some logs
and temporary files are rotated (which means that old logs and tmps
get deleted). To switch off this behaviour, use
"-OUT:rrot=no". Beware, the size of the tmp directory will increase,
sometimes dramatically so.
</p><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_faq_sequence_clipping_after_load"></a>15.6.4.1.
Sequence clipping after load
</h4></div></div></div><p>
How MIRA clipped the reads after loading them can be found in the file
<code class="filename">mira_int_clippings.0.txt</code>. The entries look like this:
</p><pre class="screen">
load: minleft. U13a01d05.t1 Left: 11 -> 30
</pre><p>
Interpret this as: after loading, the read "U13a01d05.t1" had a left clipping
of eleven. The "minleft" clipping option of MIRA did not like it and set it to
30.
</p><pre class="screen">
load: bad seq. gnl|ti|1133527649 Shortened by 89 New right: 484
</pre><p>
</p><p>
Interpret this as: after loading, the read "gnl|ti|1133527649" was checked
with the "bad sequence search" clipping algorithm which determined that there
apparently is something dubious, so it shortened the read by 89 bases, setting
the new right clip to position 484.
</p></div></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_faq_platforms_and_compiling"></a>15.7.
Platforms and Compiling
</h2></div></div></div><p>
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_windows"></a>15.7.1.
Windows
</h3></div></div></div><p>
</p><pre class="screen">
Also, is MIRA be available on a windows platform?
</pre><p>
</p><p>
As a matter of fact: it was and may be again. While I haven't done it myself,
according to reports I got compiling MIRA 2.9.3* in a Cygwin environment was
actually painless. But since then BOOST and multi-threading has been included
and I am not sure whether it is still as easy.
</p><p>
I'd be thankful for reports :-)
</p></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_maf"></a>Chapter 16. The MAF format</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email"><<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>></code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect1_introduction:_why_an_own_assembly_format?">16.1.
Introduction: why an own assembly format?
</a></span></dt><dt><span class="sect1"><a href="#sect1_the_maf_format">16.2.
The MAF format
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect2_basics">16.2.1.
Basics
</a></span></dt><dt><span class="sect2"><a href="#sect2_reads">16.2.2.
Reads
</a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect3_simple_example">16.2.2.1.
Simple example
</a></span></dt><dt><span class="sect3"><a href="#sect3_list_of_records_for_reads">16.2.2.2.
List of records for reads
</a></span></dt><dt><span class="sect3"><a href="#sect3_interpreting_clipping_values">16.2.2.3.
Interpreting clipping values
</a></span></dt></dl></dd><dt><span class="sect2"><a href="#sect2_contigs">16.2.3.
Contigs
</a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect3_simple_example_2">16.2.3.1.
Simple example 2
</a></span></dt><dt><span class="sect3"><a href="#sect3_list_of_records_for_contigs">16.2.3.2.
List of records for contigs
</a></span></dt></dl></dd></dl></dd></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
<span class="emphasis"><em><span class="quote">“<span class="quote">Design flaws travel in herds.
</span>”</span></em></span>
</p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top"><p>
The documentation of MIRA 3.9.x has not completely caught up yet with the changes introduced by MIRA now using manifest files. Quite a number of recipes still show the old command-line style, e.g.:
</p><pre class="screen">
mira --project=... --job=... ...</pre><p>
For those cases, please refer to chapter 3 (the reference) for how to write manifest files.
</p></td></tr></table></div><p>
This documents describes purpose and format of the MAF format, version
1. Which has been superceeded by version 2 but is not described here
(yet). But as v1 and v2 are very similar only the notion of readgroups is
a big change, I'll let this description live until I have time to update
this section.
</p><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect1_introduction:_why_an_own_assembly_format?"></a>16.1.
Introduction: why an own assembly format?
</h2></div></div></div><p>
I had been on the hunt for some time for a file format that allow MIRA to
quickly save and load reads and full assemblies. There are currently a number
of alignment format files on the market and MIRA can read and/or write most of
them. Why not take one of these? It turned out that all (well, the ones I
know: ACE, BAF, CAF, CALF, EXP, FRG) have some kind of no-go 'feature' (or problem
or bug) that makes one life pretty difficult if one wants to write or parse
that given file format.
</p><p>
What I needed for MIRA was a format that:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
is easy to parse
</p></li><li class="listitem"><p>
is quick to parse
</p></li><li class="listitem"><p>
contains all needed information of an assembly that MIRA and many
finishing programs use: reads (with sequence and qualities) and contigs,
tags etc.pp
</p></li></ol></div><p>
</p><p>
MAF is not a format with the smallest possible footprint though it fares quite
well in comparison to ACE, CAF and EXP), but as it's meant as interchange format,
it'll do. It can be easily indexed and does not need string lookups during
parsing.
</p><p>
I took the liberty to combine many good ideas from EXP, BAF, CAF and FASTQ
while defining the format and if anything is badly designed, it's all my
fault.
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect1_the_maf_format"></a>16.2.
The MAF format
</h2></div></div></div><p>
This describes version 1 of the MAF format. If the need arises, enhancements
like meta-data about total number of contigs and reads will be implemented in the
next version.
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_basics"></a>16.2.1.
Basics
</h3></div></div></div><p>
MAF ...
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
... has for each record a keyword at the beginning of the line, followed
by exactly one blank (a space or a tab), then followed by the values for
this record. At the moment keywords are two character keywords, but keywords
with other lengths might appear in the future
</p></li><li class="listitem"><p>
... is strictly line oriented. Each record is terminated by a newline,
no record spans across lines.
</p></li></ol></div><p>
</p><p>
All coordinates start at 1, i.e., there is no 0 value for coordinates.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_reads"></a>16.2.2.
Reads
</h3></div></div></div><p>
</p><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect3_simple_example"></a>16.2.2.1.
Simple example
</h4></div></div></div><p>
Here's an example for a simple read, just the read name and the sequence:
</p><pre class="screen">
RD U13a05e07.t1
RS CTTGCATGCCTGCAGGTCGACTCTAGAAGGACCCCGATCA
ER
</pre><p>
</p><p>
Reads start with RD and end with ER, the RD keyword is always followed by the
name of the read, ER stands on its own. Reads also should contain a sequence
(RS). Everything else is optional. In the following example, the read has
additional quality values (RQ), template definitions (name in TN, minimum and
maximum insert size in TF and TT), a pointer to the file with the raw data (SF),
a left clip which covers sequencing vector or adaptor sequence (SL), a left
clip covering low quality (QL), a right clip covering low quality (QR), a
right clip covering sequencing vector or adaptor sequence (SR), alignment to
original sequence (AO), a tag (RT) and the sequencing technology it was
generated with (ST).
</p><pre class="screen">
RD U13a05e07.t1
RS CTTGCATGCCTGCAGGTCGACTCTAGAAGGACCCCGATCA
RQ ,-+*,1-+/,36;:6≤3327<7A1/,,).('..7=@E8:
TN U13a05e07
DI F
TF 1200
TT 1800
SF U13a05e07.t1.scf
SL 4
QL 7
QR 30
SR 32
AO 1 40 1 40
RT ALUS 10 15 Some comment to this read tag.
ST Sanger
ER
</pre><p>
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect3_list_of_records_for_reads"></a>16.2.2.2.
List of records for reads
</h4></div></div></div><p>
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
RD <span class="emphasis"><em>string: readname</em></span>
</p><p> RD followed by the read name starts a read.
</p></li><li class="listitem"><p>
LR <span class="emphasis"><em>integer: read length</em></span>
</p><p>
The length of the read can be given optionally in LR. This is
meant to help the parser perform sanity checks and eventually
pre-allocate memory for sequence and quality.
</p><p>
MIRA at the moment only writes LR lines for reads with more than
2000 bases.
</p></li><li class="listitem"><p>
RS <span class="emphasis"><em>string: DNA sequence</em></span>
</p><p> Sequence of a read is stored in RS.
</p></li><li class="listitem"><p>
RQ <span class="emphasis"><em>string: qualities</em></span>
</p><p> Qualities are stored in FASTQ format, i.e., each quality
value + 33 is written as single as ASCII character.
</p></li><li class="listitem"><p>
SV <span class="emphasis"><em>string: sequencing vector</em></span>
</p><p> Name of the sequencing vector or
adaptor used in this read.
</p></li><li class="listitem"><p>
TN <span class="emphasis"><em>string: template name</em></span>
</p><p> Template name. This defines the DNA template a sequence
comes from. In it's simplest form, a DNA template is sequenced
only once. In paired-end sequencing, a DNA template is sequenced
once in forward and once in reverse direction (Sanger, 454,
Solexa). In Sanger sequencing, several forward and/or reverse
reads can be sequenced from a DNA template. In PacBio sequencing,
a DNA template can be sequenced in several "strobes", leading to
multiple reads on a DNA template.
</p></li><li class="listitem"><p>
DI <span class="emphasis"><em>character: F or R</em></span>
</p><p> Direction of the read with respect to the
template. F for forward, R for reverse.
</p></li><li class="listitem"><p>
TF <span class="emphasis"><em>integer: template size from</em></span>
</p><p> Minimum estimated
size of a sequencing template. In paired-end sequencing, this is the minimum
distance of the read pair.
</p></li><li class="listitem"><p>
TT <span class="emphasis"><em>integer: template size to</em></span>
</p><p> Maximum estimated
size of a sequencing template. In paired-end sequencing, this is the maximum
distance of the read pair.
</p></li><li class="listitem"><p>
SF <span class="emphasis"><em>string: sequencing file</em></span>
</p><p> Name of the sequencing file which
contains raw data for this read.
</p></li><li class="listitem"><p>
SL <span class="emphasis"><em>integer: seqvec left</em></span>
</p><p>
Clip left due to sequencing vector. Assumed to be 1 if not
present. Note that left clip values are excluding, e.g.: a value
of '7' clips off the left 6 bases.
</p></li><li class="listitem"><p>
QL <span class="emphasis"><em>integer: qual left</em></span>
</p><p>
Clip left due to low quality. Assumed to be 1 if not
present. Note that left clip values are excluding, e.g.: a value
off '7' clips of the left 6 bases.
</p></li><li class="listitem"><p>
CL <span class="emphasis"><em>integer: clip left</em></span>
</p><p>
Clip left (any reason). Assumed to be 1 if not present. Note
that left clip values are excluding, e.g.: a value of '7' clips
off the left 6 bases.
</p></li><li class="listitem"><p>
SR <span class="emphasis"><em>integer: seqvec right</em></span>
</p><p> Clip right due to sequencing
vector. Assumed to be the length of the sequence if not present. Note that
right clip values are including, e.g., a value of '10' leaves the bases 1 to
9 and clips at and including base 10 and higher.
</p></li><li class="listitem"><p>
QR <span class="emphasis"><em>integer: qual right</em></span>
</p><p> Clip right due to low quality. Assumed
to be the length of the sequence if not present. Note that right clip values
are including, e.g., a value of '10' leaves the bases 1 to 9 and clips at
and including base 10 and higher.
</p></li><li class="listitem"><p>
CR <span class="emphasis"><em>integer: clip right</em></span>
</p><p> Clip right (any reason). Assumed to be
the length of the sequence if not present. Note that
right clip values are including, e.g., a value of '10' leaves the bases 1 to
9 and clips at and including base 10 and higher.
</p></li><li class="listitem"><p>
AO <span class="emphasis"><em>four integers: x1 y1 x2 y2</em></span>
</p><p> AO stands for "Align to
Original". The interval [x1 y1] in the read as stored in the MAF file aligns
with [x2 y2] in the original, unedited read sequence. This allows to model
insertions and deletions in the read and still be able to find the correct
position in the original, base-called sequence data.
</p><p> A read can have
several AO lines which together define all the edits performed to this
read.
</p><p> Assumed to be "1 x 1 x" if not present, where 'x' is the length of
the unclipped sequence.
</p></li><li class="listitem"><p>
RT <span class="emphasis"><em>string + 2 integers + optional string: type x1 y1 comment</em></span>
</p><p> Read tags are given by naming the tag type, which positions
in the read the tag spans in the interval [x1 y1] and afterwards
optionally a comment. As MAF is strictly line oriented, newline
characters in the comment are encoded
as <code class="literal">\n</code>.
</p><p> If x1 > y1, the tag is in reverse direction.
</p><p>
The tag type can be a free form string, though MIRA will
recognise and work with tag types used by the Staden gap4
package (and of course the MIRA tags as described in the main
documentation of MIRA).
</p></li><li class="listitem"><p>
ST <span class="emphasis"><em>string: sequencing technology</em></span>
</p><p> The current technologies
can be defined: Sanger, 454, Solexa, SOLiD.
</p></li><li class="listitem"><p>
SN <span class="emphasis"><em>string: strain name</em></span>
</p><p> Strain name of the sample that was
sequenced, this is a free form string.
</p></li><li class="listitem"><p>
MT <span class="emphasis"><em>string: machine type</em></span>
</p><p> Machine type which generated the data,
this is a free form string.
</p></li><li class="listitem"><p>
BC <span class="emphasis"><em>string: base caller</em></span>
</p><p>
Base calling program used to call bases
</p></li><li class="listitem"><p>
IB <span class="emphasis"><em>boolean (0 or 1): is backbone</em></span>
</p><p> Whether the read is a backbone. Reads used as reference
(backbones) in mapping assemblies get this attribute.
</p></li><li class="listitem"><p>
IC <span class="emphasis"><em>boolean (0 or 1)</em></span>
</p><p> Whether the read is a coverage equivalent
read (e.g. from mapping Solexa). This is internal to MIRA.
</p></li><li class="listitem"><p>
IR <span class="emphasis"><em>boolean (0 or 1)</em></span>
</p><p> Whether the read is a rail. This also is
internal to MIRA.
</p></li><li class="listitem"><p>
ER
</p><p> This ends a read and is mandatory.
</p></li></ul></div><p>
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect3_interpreting_clipping_values"></a>16.2.2.3.
Interpreting clipping values
</h4></div></div></div><p>
Every left and right clipping pair (SL & SR, QL & QR, CL & CR) forms a clear
range in the interval [left right[ in the sequence of a read. E.g. a read with
SL=4 and SR=10 has the bases 1,2,3 clipped away on the left side, the bases
4,5,6,7,8,9 as clear range and the bases 10 and following clipped away on the
right side.
</p><p>
The left clip of a read is determined as max(SL,QL,CL) (the rightmost left
clip) whereas the right clip is min(SR,QR,CR).
</p></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_contigs"></a>16.2.3.
Contigs
</h3></div></div></div><p>
Contigs are not much more than containers containing reads with some
additional information. Contrary to CAF or ACE, MAF does not first store all reads in
single containers and then define the contigs. In MAF, contigs are defined as
outer container and within those, the reads are stored like normal reads.
</p><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect3_simple_example_2"></a>16.2.3.1.
Simple example 2
</h4></div></div></div><p>
The above example for a read can be encased in a contig like this (with two
consensus tags gratuitously added in):
</p><pre class="screen">
CO contigname_s1
NR 1
LC 24
CS TGCCTGCAGGTCGACTCTAGAAGG
CQ -+/,36;:6≤3327<7A1/,,).
CT COMM 5 8 Some comment to this consensus tag.
CT COMM 7 12 Another comment to this consensus tag.
\\
RD U13a05e07.t1
RS CTTGCATGCCTGCAGGTCGACTCTAGAAGGACCCCGATCA
RQ ,-+*,1-+/,36;:6≤3327<7A1/,,).('..7=@E8:
TN U13a05e07
TF 1200
TT 1800
SF U13a05e07.t1.scf
SL 4
SR 32
QL 7
QR 30
AO 1 40 1 40
RT ALUS 10 15 Some comment to this read tag.
ST Sanger
ER
AT 1 24 7 30
//
EC
</pre><p>
</p><p>
Note that the read shown previously (and now encased in a contig) is
absolutely unchanged. It has just been complemented with a bit of data which
describes the contig as well as with a one liner which places the read into
the contig.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect3_list_of_records_for_contigs"></a>16.2.3.2.
List of records for contigs
</h4></div></div></div><p>
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
CO <span class="emphasis"><em>string: contig name</em></span>
</p><p> CO starts a contig, the contig name
behind is mandatory but can be any string, including numbers.
</p></li><li class="listitem"><p>
NR <span class="emphasis"><em>integer: num reads in contig</em></span>
</p><p> This is optional but highly
recommended.
</p></li><li class="listitem"><p>
LC <span class="emphasis"><em>integer: contig length</em></span>
</p><p> Note that this length defines the length of the 'clear
range' of the consensus. It is 100% equal to the length of the CS
(sequence) and CQ (quality) strings below.
</p></li><li class="listitem"><p>
CT <span class="emphasis"><em>string + 2 integers + optional string: identifier
x1 y1 comment</em></span>
</p><p> Consensus tags are defined like read tags but apply to the
consensus. Here too, the interval [x1 y1] is including and if x1 > y1, the tag
is in reverse direction.
</p></li><li class="listitem"><p>
CS <span class="emphasis"><em>string: consensus sequence</em></span>
</p><p> Sequence of a consensus is stored in RS.
</p></li><li class="listitem"><p>
CQ <span class="emphasis"><em>string: qualities</em></span>
</p><p> Consensus Qualities are stored in FASTQ
format, i.e., each quality value + 33 is written as single as ASCII character.
</p></li><li class="listitem"><p>
\\
</p><p> This marks the start of read data of this contig. After
this, all reads are stored one after the other, just separated by
an "AT" line (see below).
</p></li><li class="listitem"><p>
AT <span class="emphasis"><em>Four integers: x1 y1 x2 y2</em></span>
</p><p> The AT (Assemble_To) line defines the placement of the read
in the contig and follows immediately the closing "ER" of a read
so that parsers do not need to perform time consuming string
lookups. Every read in a contig has exactly one AT line.
</p><p> The interval
[x2 y2] of the read (i.e., the unclipped data, also called the 'clear range')
aligns with the interval [x1 y1] of the contig. If x1 > y1 (the contig
positions), then the reverse complement of the read is aligned to the
contig. For the read positions, x2 is always < y2.
</p></li><li class="listitem"><p>
//
</p><p> This marks the end of read data
</p></li><li class="listitem"><p>
EC
</p><p> This ends a contig and is mandatory
</p></li></ul></div></div></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_logfiles"></a>Chapter 17. Log and temporary files used by MIRA</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email"><<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>></code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect1_logf_introduction">17.1.
Introduction
</a></span></dt><dt><span class="sect1"><a href="#sect1_logf_the_files">17.2.
The files
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect2_logf_mira_error_reads_invalid">17.2.1.
mira_error_reads_invalid
</a></span></dt><dt><span class="sect2"><a href="#sect2_logf_mira_info_reads_tooshort">17.2.2.
mira_info_reads_tooshort
</a></span></dt><dt><span class="sect2"><a href="#sect2_logf_mira_int_alignextends_preassembly10txt">17.2.3.
mira_int_alignextends_preassembly1.0.txt
</a></span></dt><dt><span class="sect2"><a href="#sect2_logf_mira_int_clippings0txt">17.2.4.
mira_int_clippings.0.txt
</a></span></dt><dt><span class="sect2"><a href="#sect2_logf_mira_int_posmatch_megahubs_passxlst">17.2.5.
mira_int_posmatch_megahubs_pass.X.lst
</a></span></dt><dt><span class="sect2"><a href="#sect2_logf_mira_int_posmatch_multicopystat_preassembly0txt">17.2.6.
mira_int_posmatch_multicopystat_preassembly.0.txt
</a></span></dt><dt><span class="sect2"><a href="#sect2_logf_mira_int_posmatch_rawhashhits_passxlst">17.2.7.
mira_int_posmatch_rawhashhits_pass.X.lst
</a></span></dt><dt><span class="sect2"><a href="#sect2_logf_mira_int_skimmarknastyrepeats_hist_passxlst">17.2.8.
mira_int_skimmarknastyrepeats_hist_pass.X.lst
</a></span></dt><dt><span class="sect2"><a href="#sect2_logf_mira_int_skimmarknastyrepeats_nastyseq_passxlst">17.2.9.
mira_int_skimmarknastyrepeats_nastyseq_pass.X.lst
</a></span></dt><dt><span class="sect2"><a href="#sect2_logf_mira_int_vectorclip_passxtxt">17.2.10.
mira_int_vectorclip_pass.X.txt
</a></span></dt><dt><span class="sect2"><a href="#sect2_logf_miratmpads_passxforward_and_miratmpads_passxcomplement">17.2.11.
miratmp.ads_pass.X.forward and miratmp.ads_pass.X.complement
</a></span></dt><dt><span class="sect2"><a href="#sect2_logf_miratmpads_passxreject">17.2.12.
miratmp.ads_pass.X.reject
</a></span></dt><dt><span class="sect2"><a href="#sect2_logf_miratmpnoqualities">17.2.13.
miratmp.noqualities
</a></span></dt><dt><span class="sect2"><a href="#sect2_logf_miratmpusedids">17.2.14.
miratmp.usedids
</a></span></dt><dt><span class="sect2"><a href="#sect2_logf_mira_readpoolinfolst">17.2.15.
mira_readpoolinfo.lst
</a></span></dt></dl></dd></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
<span class="emphasis"><em><span class="quote">“<span class="quote">The amount of entropy in the universe is constant - except when it increases.
</span>”</span></em></span>
</p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top"><p>
The documentation of MIRA 3.9.x has not completely caught up yet with the changes introduced by MIRA now using manifest files. Quite a number of recipes still show the old command-line style, e.g.:
</p><pre class="screen">
mira --project=... --job=... ...</pre><p>
For those cases, please refer to chapter 3 (the reference) for how to write manifest files.
</p></td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect1_logf_introduction"></a>17.1.
Introduction
</h2></div></div></div><p>
The tmp directory used by mira (usually
<code class="filename"><projectname>_d_tmp</code>) may contain a number of
files with information which could be interesting for other uses than
the pure assembly. This guide gives a short overview.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
This guide is probably the least complete and most out-of-date as it is
updated only very infrequently. If in doubt, ask on the MIRA talk
mailing list.
</td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
Please note that the format of these files may change over time,
although I try very hard to keep changes reduced to a minimum.
</td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Remember that mira has two options that control whether log and
temporary files get deleted: while [-OUT:rtd] removes the
complete tmp directory after an assembly, [-OUT:rrot] removes
only those log and temporary files which are not needed anymore for the
continuation of the assembly. Setting both options to <span class="underline">no</span> will keep all log and temporary files.
</td></tr></table></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect1_logf_the_files"></a>17.2.
The files
</h2></div></div></div><p>
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_mira_error_reads_invalid"></a>17.2.1.
mira_error_reads_invalid
</h3></div></div></div><p>
A simple list of those reads that were invalid (no sequence or similar
problems).
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_mira_info_reads_tooshort"></a>17.2.2.
mira_info_reads_tooshort
</h3></div></div></div><p>
A simple list of those reads that were sorted out because the unclipped
sequence was too short as defined by [-AS:mrl].
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_mira_int_alignextends_preassembly10txt"></a>17.2.3.
mira_int_alignextends_preassembly1.0.txt
</h3></div></div></div><p>
If read extension is used ([-DP:ure]), this file contains the read
name and the number of bases by which the right clipping was extended.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_mira_int_clippings0txt"></a>17.2.4.
mira_int_clippings.0.txt
</h3></div></div></div><p>
If any of the [-CL:] options leads to the clipping of a read, this
file will tell when, which clipping, which read and by how much (or to where)
the clippings were set.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_mira_int_posmatch_megahubs_passxlst"></a>17.2.5.
mira_int_posmatch_megahubs_pass.X.lst
</h3></div></div></div><p>
Note: replace the <span class="emphasis"><em>X</em></span> by the pass of mira. Should any read be
categorised as megahub during the all-against-all search (SKIM3), this file
will tell you which.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_mira_int_posmatch_multicopystat_preassembly0txt"></a>17.2.6.
mira_int_posmatch_multicopystat_preassembly.0.txt
</h3></div></div></div><p>
After the initial all-against-all search (SKIM3), this file tells you to how
many other reads each read has overlaps. Furthermore, reads that have more
overlaps than expected are tagged with ``mc'' (multicopy).
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_mira_int_posmatch_rawhashhits_passxlst"></a>17.2.7.
mira_int_posmatch_rawhashhits_pass.X.lst
</h3></div></div></div><p>
Note: replace the <span class="emphasis"><em>X</em></span> by the pass of mira. Similar to
<code class="filename">mira_int_posmatch_multicopystat_preassembly.0.txt</code>, this counts the
kmer hits of each read to other reads. This time however per pass.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_mira_int_skimmarknastyrepeats_hist_passxlst"></a>17.2.8.
mira_int_skimmarknastyrepeats_hist_pass.X.lst
</h3></div></div></div><p>
Note: replace the <span class="emphasis"><em>X</em></span> by the pass of mira. Only written if
[-KS:mnr] is set to <span class="underline">yes</span>. This file contains a
histogram of kmer occurrences encountered by SKIM3.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_mira_int_skimmarknastyrepeats_nastyseq_passxlst"></a>17.2.9.
mira_int_skimmarknastyrepeats_nastyseq_pass.X.lst
</h3></div></div></div><p>
Note: replace the <span class="emphasis"><em>X</em></span> by the pass of mira. Only written if
[-KS:mnr] is set to <span class="underline">yes</span>. One of the more interesting
files if you want to know the repetitive sequences cause the assembly to be
really difficult: for each masked part of a read, the masked sequences is
shown here.
</p><p>
E.g.
</p><pre class="screen">
U13a04h11.t1 TATATATATATATATATATATATA
U13a05b01.t1 TATATATATATATATATATATATA
U13a05c07.t1 AAAAAAAAAAAAAAA
U13a05e12.t1 CTCTCTCTCTCTCTCTCTCTCTCTCTCTC
</pre><p>
Simple repeats like the ones shown above will certainly pop-up there,
but a few other sequences (like e.g. rDNA/rRNA or SINEs, LINEs in
eukaryotes) will also appear.
</p><p>
Nifty thing to try out if you want to have a more compressed overview: sort
and unify by the second column.
</p><pre class="screen">
sort -k 2 -u mira_int_skimmarknastyrepeats_nastyseq_pass.X.lst
</pre><p>
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_mira_int_vectorclip_passxtxt"></a>17.2.10.
mira_int_vectorclip_pass.X.txt
</h3></div></div></div><p>
Note: replace the <span class="emphasis"><em>X</em></span> by the pass of mira. Only written if
[-CL:pvlc] is set to <span class="underline">yes</span>. Tells you where possible
sequencing vector (or adaptor) leftovers were found and clipped (or not
clipped).
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_miratmpads_passxforward_and_miratmpads_passxcomplement"></a>17.2.11.
miratmp.ads_pass.X.forward and miratmp.ads_pass.X.complement
</h3></div></div></div><p>
Note: replace the <span class="emphasis"><em>X</em></span> by the pass of mira. Which read aligns with
Smith-Waterman against which other read, 'forward-forward' and
'forward-complement'.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_miratmpads_passxreject"></a>17.2.12.
miratmp.ads_pass.X.reject
</h3></div></div></div><p>
Note: replace the <span class="emphasis"><em>X</em></span> by the pass of mira. Which possible read
overlaps failed the Smith-Waterman alignment check.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_miratmpnoqualities"></a>17.2.13.
miratmp.noqualities
</h3></div></div></div><p>
Which reads went completely without qualities into the assembly.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_miratmpusedids"></a>17.2.14.
miratmp.usedids
</h3></div></div></div><p>
Which reads effectively went into the assembly (after clipping etc.).
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_mira_readpoolinfolst"></a>17.2.15.
mira_readpoolinfo.lst
</h3></div></div></div></div></div></div></div></body></html>
|