/usr/share/doc/mira-assembler/DefinitiveGuideToMIRA.html

<html><head><meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>Sequence assembly with MIRA 5</title><link rel="stylesheet" type="text/css" href="doccss/miradocstyle.css"><meta name="generator" content="DocBook XSL Stylesheets V1.79.1"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="book"><div class="titlepage"><div><div><h1 class="title"><a name="idm1"></a>Sequence assembly with MIRA 5</h1></div><div><h2 class="subtitle">
  The Definitive Guide
</h2></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><span class="contrib">Main author</span> <code class="email">&lt;<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>&gt;</code></div></div><div><div class="othercredit"><h3 class="othercredit"><span class="firstname">Jacqueline</span> <span class="surname">Weber</span></h3><span class="contrib">Extensive review of early reference manual
  </span> </div></div><div><div class="othercredit"><h3 class="othercredit"><span class="firstname">Andrea</span> <span class="surname">Hörster</span></h3><span class="contrib">Extensive review of early reference manual
  </span> </div></div><div><div class="othercredit"><h3 class="othercredit"><span class="firstname">Katrina</span> <span class="surname">Dlugosch</span></h3><span class="contrib">Draft for section on preprocessing of ESTs in EST manual
  </span> </div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div><div><div class="legalnotice"><a name="idm6"></a><p>
This documentation is licensed under the Creative Commons
Attribution-NonCommercial-ShareAlike 3.0 Unported License. To view a copy of
this license, visit <a class="ulink" href="http://creativecommons.org/licenses/by-nc-sa/3.0/" target="_top">http://creativecommons.org/licenses/by-nc-sa/3.0/</a> or send a letter to
Creative Commons, 171 Second Street, Suite 300, San Francisco, California,
94105, USA.
</p></div></div></div><hr></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="preface"><a href="#idm30">Preface</a></span></dt><dt><span class="chapter"><a href="#chap_intro">1. Introduction to MIRA</a></span></dt><dt><span class="chapter"><a href="#chap_installation">2. Installing MIRA</a></span></dt><dt><span class="chapter"><a href="#chap_reference">3. MIRA 4 reference manual</a></span></dt><dt><span class="chapter"><a href="#chap_dataprep">4. Preparing data</a></span></dt><dt><span class="chapter"><a href="#chap_denovo">5. De-novo assemblies</a></span></dt><dt><span class="chapter"><a href="#chap_mapping">6. Mapping assemblies</a></span></dt><dt><span class="chapter"><a href="#chap_est">7. EST / RNASeq assemblies</a></span></dt><dt><span class="chapter"><a href="#chap_specialparams">8. Parameters for special situations</a></span></dt><dt><span class="chapter"><a href="#chap_results">9. Working with the results of MIRA</a></span></dt><dt><span class="chapter"><a href="#chap_mutils">10. Utilities in the MIRA package</a></span></dt><dt><span class="chapter"><a href="#chap_hard">11. Assembly of <span class="emphasis"><em>hard</em></span> genome or EST / RNASeq projects</a></span></dt><dt><span class="chapter"><a href="#chap_seqtechdesc">12. Description of sequencing technologies</a></span></dt><dt><span class="chapter"><a href="#chap_seqadvice">13. Some advice when going into a sequencing project</a></span></dt><dt><span class="chapter"><a href="#chap_bitsandpieces">14. Bits and pieces</a></span></dt><dt><span class="chapter"><a href="#chap_faq">15. Frequently asked questions</a></span></dt><dt><span class="chapter"><a href="#chap_maf">16. The MAF format</a></span></dt><dt><span class="chapter"><a href="#chap_logfiles">17. Log and temporary files used by MIRA</a></span></dt></dl></div><div class="list-of-figures"><p><b>List of Figures</b></p><dl><dt>1.1. <a href="#chap_intro::srmc_in_454sxahyb_1stpass.png">
	  How MIRA learns from misassemblies (1)
	</a></dt><dt>1.2. <a href="#chap_intro::srmc_in_454sxahyb_lastpass1.png">
	  How MIRA learns from misassemblies (2)
	</a></dt><dt>1.3. <a href="#chap_intro::srmc_in_454sxahyb_lastpass2.png">
	  How MIRA learns from misassemblies (3)
	</a></dt><dt>1.4. <a href="#chap_intro::gcb99_replocator.png">
	  Slides presenting the repeat locator at the GCB 99
	</a></dt><dt>1.5. <a href="#chap_intro::gcb99_edit.png">
	  Slides presenting the Edit automatic Sanger editor at the GCB 99
	</a></dt><dt>1.6. <a href="#chap_intro::san_autoedit1.png">
	  Sanger assembly without EdIt automatic editing routines
	</a></dt><dt>1.7. <a href="#chap_intro::san_autoedit2.png">
	  Sanger assembly with EdIt automatic editing routines
	</a></dt><dt>1.8. <a href="#chap_intro::454_autoedit1.png">
	  454 assembly without 454 automatic editing routines
	</a></dt><dt>1.9. <a href="#chap_intro::454_autoedit2.png">
	  454 assembly with 454 automatic editing routines
	</a></dt><dt>1.10. <a href="#chap_intro::haf5_haf2_contigcoverage_ovals.png">
	  Coverage of a contig.
	</a></dt><dt>1.11. <a href="#chap_intro::haf5_repend_rrna.png">
	  Repetitive end of a contig
	</a></dt><dt>1.12. <a href="#chap_intro::haf2_end_nomoredata.png">
	  Non-repetitive end of a contig
	</a></dt><dt>1.13. <a href="#chap_intro::454sxa_stms_hybdenovo.png">
	  MIRA pointing out problems in hybrid assemblies (1)
	</a></dt><dt>1.14. <a href="#chap_intro::454san_stmu_hybdenovo.png">
	  MIRA pointing out problems in hybrid assemblies (2)
	</a></dt><dt>1.15. <a href="#chap_intro::sxa_cer_reads1.png">
	  Coverage equivalent reads (CERs) explained.
	</a></dt><dt>1.16. <a href="#chap_intro::sxa_cer_reads2.png">
	  Coverage equivalent reads let SNPs become very visible in assembly viewers
	</a></dt><dt>1.17. <a href="#chap_intro::sxa_sroc_lenski2.png">
	  SNP tags in a MIRA assembly
	</a></dt><dt>1.18. <a href="#chap_intro::sxa_mcvc_lenski.png">
	  Tag pointing out a large deletion in a MIRA mapping assembly
	</a></dt><dt>9.1. <a href="#chap_res::results_miraconvert.png">
	  Format conversions with <span class="command"><strong>miraconvert</strong></span>
	</a></dt><dt>9.2. <a href="#chap_res::results_mira2other.png">
	  Conversions needed for other tools.
	</a></dt><dt>9.3. <a href="#haf_danger_join_notok.png">
	      Join at a repetitive site which should not be performed due to
	      missing spanning templates.
	    </a></dt><dt>9.4. <a href="#haf_danger_join_ok.png">
	      Join at a repetitive site which should be performed due to
	      spanning templates being good.
	    </a></dt><dt>9.5. <a href="#454_stacks_join.png">
	      Pseudo-repeat in 454 data due to sequencing artifacts
	    </a></dt><dt>9.6. <a href="#chap_sol::sxa_sroc_lenski1.png">
	      "SROc" tag showing a SNP position in a Solexa mapping
	      assembly.
	    </a></dt><dt>9.7. <a href="#chap_sol::sxa_sroc_lenski2.png">
	      "SROc" tag showing a SNP/indel position in a Solexa mapping
	      assembly.
	    </a></dt><dt>9.8. <a href="#chap_sol::sxa_mcvc_lenski.png">
	      "MCVc" tag (dark red stretch in figure) showing a genome
	      deletion in Solexa mapping assembly.
	    </a></dt><dt>9.9. <a href="#chap_sol::sxa_wrmcsrmc_hiding_lenski1.png">
	      An IS150 insertion hiding behind a WRMc and a SRMc tags
	    </a></dt><dt>9.10. <a href="#chap_sol::sxa_xmastree_lenski1.png">
	      A 16 base pair deletion leading to a SROc/UNsC xmas-tree
	    </a></dt><dt>9.11. <a href="#chap_sol::sxa_xmastree_lenski2.png">
	      An IS186 insertion leading to a SROc/UNsC xmas-tree
	    </a></dt><dt>12.1. <a href="#sxa_unsc_ggcxg2_lenski.png">
	    The Solexa GGCxG problem.
	  </a></dt><dt>12.2. <a href="#sxa_unsc_ggc1_lenski.png">
	    The Solexa GGC problem, forward example
	  </a></dt><dt>12.3. <a href="#sxa_unsc_ggc4_lenski.png">
	    The Solexa GGC problem, reverse example
	  </a></dt><dt>12.4. <a href="#sxa_xmastree_lenski2.png">
	    A genuine place of interest almost masked by the
	    <code class="literal">GGCxG</code> problem.
	  </a></dt><dt>12.5. <a href="#sxa_gcbias_nobias2008.png">
	    Example for no GC coverage bias in 2008 Solexa data.
	  </a></dt><dt>12.6. <a href="#sxa_gcbias_bias2009.png">
	    Example for GC coverage bias starting Q3 2009 in Solexa data.
	  </a></dt><dt>12.7. <a href="#sxa_gcbias_comp20082009.png">
	    Example for GC coverage bias, direct comparison 2008 / 2010 data.
	  </a></dt><dt>12.8. <a href="#chap_iontor::ion_dh10bgoodB13.png">
	Example for good IonTorrent data (100bp reads)
      </a></dt><dt>12.9. <a href="#chap_iontor::iontor_indelhpexample.png">
	  Example for problematic IonTorrent data (100bp reads)
	</a></dt><dt>12.10. <a href="#chap_iontor::ion_dh10bdirdepindel.png.png">
	  Example for a sequencing direction dependent indel
	</a></dt></dl></div><div class="preface"><div class="titlepage"><div><div><h1 class="title"><a name="idm30"></a>Preface</h1></div></div></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
    <span class="emphasis"><em><span class="quote">&#8220;<span class="quote">How much intelligence does one need to sneak upon lettuce?
    </span>&#8221;</span></em></span>
  </p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><p>
  This "book" is actually the result of an exercise in self-defense. It
  contains texts from several years of help files, mails, postings, questions,
  answers etc.pp concerning MIRA and assembly projects one can do with it.
</p><p>
  I never really intended to push MIRA. It started out as a PhD thesis and I
  subsequently continued development when I needed something to be done which
  other programs couldn't do at the time. But MIRA has always been available
  as binary on the Internet since 1999 ... and as Open Source since
  2007. Somehow, MIRA seems to have caught the attention of more than just a
  few specialised sequencing labs and over the years I've seen an ever growing
  number of mails in my inbox and on the MIRA mailing list. Both from people
  having been "since ever" in the sequencing business as well as from labs or
  people just getting their feet wet in the area.
</p><p>
  The help files -- and through them this book -- sort of reflect this
  development. Most of the chapters<a href="#ftn.idm40" class="footnote" name="idm40"><sup class="footnote">[1]</sup></a> contain both very specialised
  topics as well as step-by-step walk-throughs intended to help people to get
  their assembly projects going. Some parts of the documentation are written
  in a decidedly non-scientific way. Please excuse, time for rewriting mails
  somewhat lacking, some texts were re-used almost verbatim.
</p><p>
  The last few years have seen tremendous change in the sequencing
  technologies and MIRA 4 reflects that: core data structures and
  routines had to be thrown overboard and replaced with faster and/or more
  versatile versions suited for the broad range of technologies and use-cases
  I am currently running MIRA with.
</p><p>
  Nothing is perfect, and both MIRA and this documentation (even if it is
  rather pompously called <span class="emphasis"><em>Definitive Guide</em></span>) are far from
  it. If you spot an error either in MIRA or this manual, feel free to report
  it. Or, even better, correct it if you can. At least with the manual files
  it should be easy: they're basically just some decorated text files.
</p><p>
  I hope that MIRA will be as useful to you as it has been to me. Have a lot
  of fun with it.
</p><p>
  Burlington, Spring 2016
</p><p>
  Bastien Chevreux
</p><div class="footnotes"><br><hr style="width:100; text-align:left;margin-left: 0"><div id="ftn.idm40" class="footnote"><p><a href="#idm40" class="para"><sup class="para">[1] </sup></a>Avid readers of David
  Gerrold will certainly recognise the quotes from his books at the beginning
  of each chapter</p></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_intro"></a>Chapter 1. Introduction to MIRA</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email">&lt;<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>&gt;</code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect_intro_whatismira">1.1. 
      What is MIRA?
    </a></span></dt><dt><span class="sect1"><a href="#sect_wheretostartreading">1.2. 
      What to read in this manual and where to start reading?
    </a></span></dt><dt><span class="sect1"><a href="#sect_intro_miraquicktour">1.3. 
      The MIRA quick tour
    </a></span></dt><dt><span class="sect1"><a href="#sect_for_which_data_sets_to_use_mira_and_for_which_not">1.4. 
      For which data sets to use MIRA and for which not
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect3_genome_denovo">1.4.1. 
	Genome de-novo
      </a></span></dt><dt><span class="sect2"><a href="#sect_genome_mapping">1.4.2. 
	Genome mapping
      </a></span></dt><dt><span class="sect2"><a href="#sect3_ests_rnaseq">1.4.3. 
	ESTs / RNASeq
      </a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_intro_specialfeatures">1.5. 
      Any special features I might be interested in?
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_intro_miradiscernsrepeats">1.5.1. 
	MIRA learns to discern non-perfect repeats, leading to better assemblies
      </a></span></dt><dt><span class="sect2"><a href="#sect_intro_automatic_editors">1.5.2. 
	MIRA has integrated editors for data from Sanger, 454, IonTorrent sequencing
      </a></span></dt><dt><span class="sect2"><a href="#sect_intro_whycontigsend">1.5.3. 
	MIRA lets you see why contigs end where they end
      </a></span></dt><dt><span class="sect2"><a href="#sect_intro_stmshybrid_tags">1.5.4. 
	MIRA tags problematic decisions in hybrid assemblies
      </a></span></dt><dt><span class="sect2"><a href="#sect_intro_cer_reads">1.5.5. 
	MIRA allows older finishing programs to cope with amount data in Solexa
	mapping projects
      </a></span></dt><dt><span class="sect2"><a href="#sect_intro_mapping_tags">1.5.6. 
	MIRA tags SNPs and other features, outputs result files
	for biologists
      </a></span></dt><dt><span class="sect2"><a href="#sect_intro_miramuchmore">1.5.7. 
	MIRA has ... much more
      </a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_intro_versions_licenses_disclaimer_and_copyright">1.6. 
      Versions, Licenses, Disclaimer and Copyright
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_intro_versions">1.6.1. 
	Versions
      </a></span></dt><dt><span class="sect2"><a href="#sect_intro_licenses">1.6.2. 
	License
      </a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_intro_licensemira">1.6.2.1. 
	  MIRA
	</a></span></dt><dt><span class="sect3"><a href="#sect_intro_licensedocs">1.6.2.2. 
	  Documentation
	</a></span></dt></dl></dd><dt><span class="sect2"><a href="#sect_intro_copyright">1.6.3. 
	Copyright
      </a></span></dt><dt><span class="sect2"><a href="#sect_intro_external_libraries">1.6.4. 
	External libraries
      </a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_intro_getting_help___mailing_lists___reporting_bugs">1.7. 
      Getting help / Mailing lists / Reporting bugs
    </a></span></dt><dt><span class="sect1"><a href="#sect_intro_author">1.8. 
      Author
    </a></span></dt><dt><span class="sect1"><a href="#sect_intro_miscellaneous">1.9. 
      Miscellaneous
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_intro_citations">1.9.1. 
	Citing MIRA
      </a></span></dt><dt><span class="sect2"><a href="#sect_intro_postcards_gold_and_jewellery">1.9.2. 
	Postcards, gold and jewellery
      </a></span></dt></dl></dd></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
      <span class="emphasis"><em><span class="quote">&#8220;<span class="quote">Half of being smart is to know what you're dumb at.
      </span>&#8221;</span></em></span>
    </p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_intro_whatismira"></a>1.1. 
      What is MIRA?
    </h2></div></div></div><p>
      MIRA is a multi-pass DNA sequence data assembler/mapper for whole
      genome and EST/RNASeq projects. MIRA assembles/maps reads gained by
    </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	  electrophoresis sequencing (aka Sanger sequencing)
	</p></li><li class="listitem"><p>
	  454 pyro-sequencing (GS20, FLX or Titanium)
	</p></li><li class="listitem"><p>
	  Ion Torrent
	</p></li><li class="listitem"><p>
	  Solexa (Illumina) sequencing
	</p></li><li class="listitem"><p>
	  Error-corrected Pacific Biosciences sequences
	</p></li></ul></div><p>
      into contiguous sequences (called <span class="emphasis"><em>contigs</em></span>). One can
      use the sequences of different sequencing technologies either in a
      single assembly run (a <span class="emphasis"><em>true hybrid assembly</em></span>) or by
      mapping one type of data to an assembly of other sequencing type (a
      <span class="emphasis"><em>semi-hybrid assembly (or mapping)</em></span>) or by mapping a
      data against consensus sequences of other assemblies (a <span class="emphasis"><em>simple
      mapping</em></span>).
    </p><p>
      The MIRA acronym stands for <span class="bold"><strong>M</strong></span>imicking
      <span class="bold"><strong>I</strong></span>ntelligent <span class="bold"><strong>R</strong></span>ead <span class="bold"><strong>A</strong></span>ssembly
      and the program pretty well does what its acronym says (well, most of
      the time anyway). It is the Swiss army knife of sequence assembly that
      I've used and developed during the past 14 years to get assembly jobs I
      work on done efficiently - and especially accurately. That is, without
      me actually putting too much manual work into it.
    </p><p>
      Over time, other labs and sequencing providers have found MIRA useful
      for assembly of extremely 'unfriendly' projects containing lots of
      repetitive sequences. As always, your mileage may vary.
    </p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_wheretostartreading"></a>1.2. 
      What to read in this manual and where to start reading?
    </h2></div></div></div><p>
      At the last count, this manual had almost 200 pages and this might seem a little bit daunting.
      However, you very probably do not need to read everything.
    </p><p>
      You should read most of this introductional chapter though: e.g.,
    </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	    the part with the MIRA quick tour
	  </p></li><li class="listitem"><p>
	    the part which gives a quick overview for which data sets to use MIRA and for which not
	  </p></li><li class="listitem"><p>
	    the part which showcases different features of MIRA (lots of screen shots!)
	  </p></li><li class="listitem"><p>
	    where and how to get help if things don't work out as you expected
	  </p></li></ul></div><p>
      After that, reading should depend on the type of data you intend to work
      with: there are specific chapters for assembly of de-novo, of mapping and
      of EST / RNASeq projects. They all contain an overview on how to
      define your data and how to launch MIRA for these data sets. There is
      also chapter on how to prepare data sets from specific sequencing
      technologies.
    </p><p>
      The chapter on working with results of MIRA should again be of general
      interest to everyone. It describes the structure of output directories
      and files and gives first pointers on what to find where. Also,
      converting results into different formats -- with and without filtering
      for specific needs -- is covered there.
    </p><p>
      As the previously cited chapters are more introductory in their nature,
      they do not go into the details of MIRA parametrisation. While MIRA has
      a comprehensive set of standard settings which should be suited for a
      majority of assembly tasks, the are more than 150 switches / parameters
      with which one can fine tune almost every aspect of an assembly. A
      complete description for each and every parameter and how to correctly
      set parameters for different use cases and sequencing technologies can
      be found in the reference chapter.
    </p><p>
      As not every assembly project is simple, there is also a chapter with
      tips on how to deal with projects which turn out to be "hard." It
      certainly helps if you at least skim through it even if you do not
      expect to have problems with your data ... it contains a couple of
      tricks on what one can see in result files as well as in temporary and
      log files which are not explained elsewhere.
    </p><p>
      MIRA comes with a number of additional utilities which are described in
      an own chapter. While the purpose of <span class="command"><strong>miraconvert</strong></span>
      should be quite clear quite quickly, the versatility of use cases for
      <span class="command"><strong>mirabait</strong></span> might surprise more than one. Be sure to
      check it out.
    </p><p>
      As from time to time some general questions on sequencing are popping up
      on the MIRA talk mailing list, I have added a chapter with some general
      musings on what to consider when going into sequencing projects. This
      should be in no way a replacement for an exhaustive talk with a
      sequencing provider, but it can give a couple of hints on what to take
      care of.
    </p><p>
      There is also a FAQ chapter with some of the more frequently asked questions
      which popped up in the past few years.
    </p><p>
      Finally, there are also chapters covering some more technical aspects of MIRA: the MAF format
      and structure / content of the tmp directory have own chapters.
    </p><p>
      Complete walkthroughs ... are lacking at the moment for MIRA 4. In the
      MIRA 3 manual I had them, but so many things have changed (at all
      levels: MIRA, the sequencing technologies, data repositories) that I did
      not have time to update them. I probably will need quite some time to
      write new ones. Feel free to send me some if you are inclined to help
      fellow scientists.
    </p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_intro_miraquicktour"></a>1.3. 
      The MIRA quick tour
    </h2></div></div></div><p>
      Input can be in various formats like Staden experiment (EXP), Sanger
      CAF, FASTA, FASTQ or PHD file. Ancillary data containing additional
      information helpful to the assembly as is contained in, e.g. NCBI
      traceinfo XML files or Staden EXP files, is also honoured. If present,
      base qualities in
      <span class="command"><strong>phred</strong></span> style and SCF signal electrophoresis trace
      files are used to adjudicate between or even correct contradictory
      stretches of bases in reads by either the integrated automatic EdIt
      editor (written by Thomas Pfisterer) or the assembler itself.
    </p><p>
      MIRA was conceived especially with the problem of repeats in genomic
      data and SNPs in transcript (EST / RNASeq) data in mind. Considerable
      effort was made to develop a number of strategies -- ranging from
      standard clone-pair size restrictions to discovery and marking of base
      positions discriminating the different repeats / SNPs -- to ensure that
      repetitive elements are correctly resolved and that misassemblies do not
      occur.
    </p><p>
      The resulting assembly can be written in different standard formats like
      CAF, Staden GAP4 directed assembly, ACE, HTML, FASTA, simple text or
      transposed contig summary (TCS) files. These can easily be imported into
      numerous finishing tools or further evaluated with simple scripts.
    </p><p>
      The aim of MIRA is to build the best possible assembly by
    </p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
	  having a more or less full overview on the whole project at any time
	  of the assembly, i.e. knowledge of almost all possible read-pairs in
	  a project,
	</p></li><li class="listitem"><p>
	  using high confidence regions (HCRs) of several aligned read-pairs to
	  start contig building at a good anchor point of a contig, extending
	  clipped regions of reads on a 'can be justified' basis.
	</p></li><li class="listitem"><p>
	  using all available data present at the time of assembly, i.e.,
	  instead of relying on sequence and base confidence values only, the
	  assembler will profit from trace files containing electrophoresis
	  signals, tags marking possible special attributes of DNA,
	  information on specific insert sizes of read-pairs etc.
	</p></li><li class="listitem"><p>
	  having 'intelligent' contig objects accept or refuse reads based on
	  the rate of unexplainable errors introduced into the consensus
	</p></li><li class="listitem"><p>
	  learning from mistakes by discovering and analysing possible repeats
	  differentiated only by single nucleotide polymorphisms. The
	  important bases for discriminating different repetitive elements are
	  tagged and used as new information.
	</p></li><li class="listitem"><p>
	  using the possibility given by the integrated automatic editor to
	  correct errors present in contigs (and subsequently) reads by
	  generating and verifying complex error hypotheses through analysis
	  of trace signals in several reads covering the same area of a
	  consensus,
	</p></li><li class="listitem"><p>
	  iteratively extending reads (and subsequently) contigs based on
	</p><div class="orderedlist"><ol class="orderedlist" type="a"><li class="listitem"><p>
	      additional information gained by overlapping read pairs in contigs
	      and
	    </p></li><li class="listitem"><p>
	      corrections made by the automated editor.
	    </p></li></ol></div></li></ol></div><p>
    </p><p>
      MIRA was part of a bigger project that started at the DKFZ (Deutsches
      Krebsforschungszentrum, German Cancer Research Centre) Heidelberg in
      1997: the "Bundesministerium für Bildung, Wissenschaft, Forschung und
      Technologie" supported the PhD thesis of Thomas and myself by grant
      number <span class="emphasis"><em>01 KW 9611</em></span>. Beside an assembler to tackle
      difficult repeats, the grant also supported the automated editor /
      finisher EdIt package -- written by Thomas Pfisterer. The strength of
      MIRA and EdIt is the automatic interaction of both packages which
      produces assemblies with less work for human finishers to be done.
    </p><p>
      I'd like to thank everybody who reported bugs to me, pointed out problems,
      sent ideas and suggestions they encountered while using the predecessors.
      Please continue to do so, the feedback made this third version possible.
    </p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_for_which_data_sets_to_use_mira_and_for_which_not"></a>1.4. 
      For which data sets to use MIRA and for which not
    </h2></div></div></div><p>
      As a general rule of thumb: if you have an organism with more than
      100 to 150 megabases or more than 20 to 40 million reads, you might want
      to try other assemblers first.
    </p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect3_genome_denovo"></a>1.4.1. 
	Genome de-novo
      </h3></div></div></div><p>
	For genome assembly, the version 4 series of MIRA have been reported
	to work on projects with something like a million Sanger reads (~80 to
	100 megabases at 10x coverage), five to ten million 454 Titanium reads
	(~100 megabases at 20x coverage) and 20 to 40 million Solexa reads
	(enough for de-novo of a bacterium or a small eukaryote with 76mers or
	100mers).
      </p><p>
	Provided you have the memory, MIRA is expected to work in de-novo
	mode with
      </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	    Sanger reads: 5 to 10 million
	  </p></li><li class="listitem"><p>
	    454 reads: 5 to 15 million
	  </p></li><li class="listitem"><p>
	    Ion Torrent reads: 5 to 15 million
	  </p></li><li class="listitem"><p>
	    Solexa reads: in normal operation, up to 40 million reads. Some
	    people use it on up to 300 million, but you'll need a really big
	    machine and month of computation time ... I do not recommend
	    that.
	  </p></li></ul></div><p>
	and "normal" coverages, whereas "normal" would be at no more than 50x
	to 70x for genome projects. Higher coverages will also work, but may
	create somewhat larger temporary files without heavy
	parametrisation. Lower coverages (&lt;4x for Sanger, &lt;10x for 454,
	&lt; 10x for IonTorrent) also need special attention in the
	parameter settings.
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_genome_mapping"></a>1.4.2. 
	Genome mapping
      </h3></div></div></div><p>
	As the complexity of mapping is a lot lower than de-novo, one can
	basically double (perhaps even triple) the number of reads compared to
	'de-novo'. The limiting factor will be the amount of RAM though, and
	MIRA will also need lots of it if you go into eukaryotes.
      </p><p>
	The main limiting factor regarding time will be the number of
	reference sequences (backbones) you are using. MIRA being pedantic
	during the mapping process, it might be a rather long wait if you have
	more than 40 megabase of reference sequences.
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect3_ests_rnaseq"></a>1.4.3. 
	ESTs / RNASeq
      </h3></div></div></div><p>
	The default values for MIRA should allow it to work with many EST and
	RNASeq data sets, sometimes even from non-normalised libraries. For
	extreme coverage cases however (like, something with a lot of cases at
	and above 10k coverage), one would perhaps want to resort to data
	reduction routines before feeding the sequences to MIRA.
      </p><p>
	On the other hand, recent developments of MIRA were targeted at making
	de-novo RNASeq assembly of non-normalised libraries liveable, and
	indeed I now regularly use MIRA for data sets with up to 50 million
	Illumina 100bp reads.
      </p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_intro_specialfeatures"></a>1.5. 
      Any special features I might be interested in?
    </h2></div></div></div><p>
      A few perhaps.
    </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
	The screen shots in this section show data from assemblies produced
	with MIRA, but the visualisation itself is done in a finishing program
	named <span class="command"><strong>gap4</strong></span>.
      </p><p>
	Some of the screen shots were edited for showing a special feature of
	MIRA. E.g., in the screen shots with Solexa data, quite some reads were
	left out of the view pane as else -- due to the amount of data --
	these screen shots would need several pages for a complete printout.
      </p></td></tr></table></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_intro_miradiscernsrepeats"></a>1.5.1. 
	MIRA learns to discern non-perfect repeats, leading to better assemblies
      </h3></div></div></div><p>
	MIRA is an iterative assembler (it works in several passes) and acts a
	bit like a child when exploring the world: it explores the assembly
	space and is specifically parameterised to allow a couple of assembly
	errors during the first passes. But after each pass some routines (the
	"parents", if you like) check the result, searching for assembly
	errors and deduce knowledge about specific assemblies MIRA should not
	have ventured into. MIRA will then prevent these errors to re-occur in
	subsequent passes.
      </p><p>
	As an example, consider the following multiple alignment:
      </p><div class="figure"><a name="chap_intro::srmc_in_454sxahyb_1stpass.png"></a><p class="title"><b>Figure 1.1. How MIRA learns from misassemblies (1). Multiple alignment
	after 1st pass with an obvious assembly error, notice the clustered
	columns discrepancies. Two slightly different repeats were assembled
	together.</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/srmc_in_454sxahyb_1stpass.png" width="100%" alt="How MIRA learns from misassemblies (1). Multiple alignment after 1st pass with an obvious assembly error, notice the clustered columns discrepancies. Two slightly different repeats were assembled together."></td></tr></table></div></div></div><br class="figure-break"><p>
	These kind of errors can be easily spotted by a human, but are hard to
	prevent by normal alignment algorithms as sometimes there's only one
	single base column difference between repeats (and not several as in
	this example).
      </p><p>
	MIRA spots these things (even if it's only a single column), tags the
	base positions in the reads with additional information and then will
	use that information in subsequent passes. The net effect is shown in
	the next two figures:
      </p><div class="figure"><a name="chap_intro::srmc_in_454sxahyb_lastpass1.png"></a><p class="title"><b>Figure 1.2. 
	  Multiple alignment after last pass where assembly errors from
	  previous passes have been resolved (1st repeat site)
	</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/srmc_in_454sxahyb_lastpass1.png" width="100%" alt="Multiple alignment after last pass where assembly errors from previous passes have been resolved (1st repeat site)"></td></tr></table></div></div></div><br class="figure-break"><div class="figure"><a name="chap_intro::srmc_in_454sxahyb_lastpass2.png"></a><p class="title"><b>Figure 1.3. 
	  Multiple alignment after last pass where assembly errors from
	  previous passes have been resolved (2nd repeat site)
	</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/srmc_in_454sxahyb_lastpass2.png" width="100%" alt="Multiple alignment after last pass where assembly errors from previous passes have been resolved (2nd repeat site)"></td></tr></table></div></div></div><br class="figure-break"><p>
	The ability of MIRA to learn and discern non-identical repeats from
	each other through column discrepancies is nothing new. Here's the
	link to a paper from a talk I had at the German Conference on
	Bioinformatics in 1999: <a class="ulink" href="http://www.bioinfo.de/isb/gcb99/talks/chevreux/" target="_top">http://www.bioinfo.de/isb/gcb99/talks/chevreux/</a>
      </p><p>
	I'm sure you'll recognise the basic principle in figures 8 and 9. The
	slides from the corresponding talk also look very similar to the
	screen shots above:
      </p><div class="figure"><a name="chap_intro::gcb99_replocator.png"></a><p class="title"><b>Figure 1.4. 
	  Slides presenting the repeat locator at the GCB 99
	</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/gcb99_replocator.png" width="100%" alt="Slides presenting the repeat locator at the GCB 99"></td></tr></table></div></div></div><br class="figure-break"><p>
	You can get the talk with these slides here: <a class="ulink" href="http://chevreux.org/dkfzold/gcb99/bachvortrag_gcb99.ppt" target="_top">http://chevreux.org/dkfzold/gcb99/bachvortrag_gcb99.ppt</a>
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_intro_automatic_editors"></a>1.5.2. 
	MIRA has integrated editors for data from Sanger, 454, IonTorrent sequencing
      </h3></div></div></div><p>
	Since the first versions in 1999, the <span class="emphasis"><em>EdIt</em></span>
	automatic Sanger sequence editor from Thomas Pfisterer has been
	integrated into MIRA.
      </p><div class="figure"><a name="chap_intro::gcb99_edit.png"></a><p class="title"><b>Figure 1.5. 
	  Slides presenting the Edit automatic Sanger editor at the GCB 99
	</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/gcb99_edit.png" width="100%" alt="Slides presenting the Edit automatic Sanger editor at the GCB 99"></td></tr></table></div></div></div><br class="figure-break"><p>
	The routines use a combination of hypothesis generation/testing
	together with neural networks (trained on ABI and ALF traces) for
	signal recognition to discern between base calling errors and true
	multiple alignment differences. They go back to the trace data to
	resolve potential conflicts and eventually recall bases using the
	additional information gained in a multiple alignment of reads.
      </p><div class="figure"><a name="chap_intro::san_autoedit1.png"></a><p class="title"><b>Figure 1.6. 
	  Sanger assembly without EdIt automatic editing routines. The bases
	  with blue background are base calling errors.
	</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/san_autoedit1.png" width="100%" alt="Sanger assembly without EdIt automatic editing routines. The bases with blue background are base calling errors."></td></tr></table></div></div></div><br class="figure-break"><div class="figure"><a name="chap_intro::san_autoedit2.png"></a><p class="title"><b>Figure 1.7. 
	  Sanger assembly with EdIt automatic editing routines. Bases with
	  pink background are corrections made by EdIt after assessing the
	  underlying trace files (SCF files in this case). Bases with blue
	  background are base calling errors where the evidence in the trace
	  files did not show enough evidence to allow an editing correction.
	</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/san_autoedit2.png" width="100%" alt="Sanger assembly with EdIt automatic editing routines. Bases with pink background are corrections made by EdIt after assessing the underlying trace files (SCF files in this case). Bases with blue background are base calling errors where the evidence in the trace files did not show enough evidence to allow an editing correction."></td></tr></table></div></div></div><br class="figure-break"><p>
	With the introduction of 454 reads, MIRA also got in 2007 specialised
	editors to search and correct for typical 454 sequencing problems like
	the homopolymer run over-/undercalls. These editors are now integrated
	into MIRA itself and are not part of EdIt anymore.
      </p><p>
	While not being paramount to the assembly quality, both editors
	provide additional layers of safety for the MIRA learning algorithm to
	discern non-perfect repeats even on a single base
	discrepancy. Furthermore, the multiple alignments generated by these
	two editors are way more pleasant to look at (or automatically
	analyse) than the ones containing all kind of gaps, insertions,
	deletions etc.pp.
      </p><div class="figure"><a name="chap_intro::454_autoedit1.png"></a><p class="title"><b>Figure 1.8. 
	  454 assembly without 454 automatic editing routines
	</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/454_autoedit1.png" width="100%" alt="454 assembly without 454 automatic editing routines"></td></tr></table></div></div></div><br class="figure-break"><div class="figure"><a name="chap_intro::454_autoedit2.png"></a><p class="title"><b>Figure 1.9. 
	  454 assembly with 454 automatic editing routines
	</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/454_autoedit2.png" width="100%" alt="454 assembly with 454 automatic editing routines"></td></tr></table></div></div></div><br class="figure-break"></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_intro_whycontigsend"></a>1.5.3. 
	MIRA lets you see why contigs end where they end
      </h3></div></div></div><p>
	A very useful feature for finishing are kmer (hash) frequency tags
	which MIRA sets in the assembly. Provided your finishing editor
	understands those tags
	(<span class="command"><strong>gap4</strong></span>, <span class="command"><strong>gap5</strong></span>
	and <span class="command"><strong>consed</strong></span> are fine but there may be others),
	they'll give you precious insight where you might want to be cautious
	when joining to contigs or where you would need to perform some primer
	walking. MIRA colourises the assembly with the hash frequency (HAF)
	tags to show repetitiveness.
      </p><p>
	You will need to read about the HAF tags in the reference manual, but
	in a nutshell: the HAF5, HAF6 and HAF7 tags tell you potentially have
	repetitive to very repetitive read areas in the genome, while HAF2
	tags will tell you that these areas in the genome have not been
	covered as well as they should have been.
      </p><p>
	As an example, the following figure shows the coverage of a contig.
      </p><div class="figure"><a name="chap_intro::haf5_haf2_contigcoverage_ovals.png"></a><p class="title"><b>Figure 1.10. 
	  Coverage of a contig.
	</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/haf5_haf2_contigcoverage_ovals.png" width="100%" alt="Coverage of a contig."></td></tr></table></div></div></div><br class="figure-break"><p>
	The question is now: why did MIRA stop building this contig on the
	left end (left oval) and why on the right end (right oval).
      </p><p>
	Looking at the HAF tags in the contig, the answer becomes quickly
	clear: the left contig end has HAF5 tags in the reads (shown in bright
	red in the following figure). This tells you that MIRA stopped because
	it probably could not unambiguously continue building this
	contig. Indeed, if you BLAST the sequence at the NCBI, you will find
	out that this is an rRNA area of a bacterium, of which bacteria
	normally have several copies in the genome:
      </p><div class="figure"><a name="chap_intro::haf5_repend_rrna.png"></a><p class="title"><b>Figure 1.11. 
	  HAF5 tags (reads shown with red background) covering a contig end
	  show repetitiveness as reason for stopping a contig build.
	</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/haf5_repend_rrna.png" width="100%" alt="HAF5 tags (reads shown with red background) covering a contig end show repetitiveness as reason for stopping a contig build."></td></tr></table></div></div></div><br class="figure-break"><p>
	The right end of the same contig however ends in HAF3 tags (normal
	coverage, bright green in the next figure) and even HAF2 tags (below
	average coverage, pale green in the next image). This tells you MIRA
	stopped building the contig at this place simply because there were
	no more reads to continue. This is a perfect target for primer
	walking if you want to finish a genome.
      </p><div class="figure"><a name="chap_intro::haf2_end_nomoredata.png"></a><p class="title"><b>Figure 1.12. 
	  HAF2 tags covering a contig end show that no more reads were
	  available for assembly at this position.
	</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/haf2_end_nomoredata.png" width="100%" alt="HAF2 tags covering a contig end show that no more reads were available for assembly at this position."></td></tr></table></div></div></div><br class="figure-break"></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_intro_stmshybrid_tags"></a>1.5.4. 
	MIRA tags problematic decisions in hybrid assemblies
      </h3></div></div></div><p>
	Many people combine Sanger &amp; 454 -- or nowadays more 454 &amp;
	Solexa -- to improve the sequencing quality of their project through
	two (or more) sequencing technologies. To reduce time spent in
	finishing, MIRA automatically tags those bases in a consensus of a
	hybrid assembly where reads from different sequencing technologies
	severely contradict each other.
      </p><p>
	The following example shows a hybrid 454 / Solexa assembly where reads
	from 454 (highlighted read names in following figure) were not sure
	whether to have one or two "G" at a certain position. The consensus
	algorithm would have chosen "two Gs" for 454, obviously a wrong
	decision as all Solexa reads at the same spot (the reads which are not
	highlighted) show only one "G" for the given position. While MIRA
	chose to believe Solexa in this case, it tagged the position anyway in
	case someone chooses to check these kind of things.
      </p><div class="figure"><a name="chap_intro::454sxa_stms_hybdenovo.png"></a><p class="title"><b>Figure 1.13. 
	  A "STMS" tag (Sequencing Technology Mismatch Solved, the black
	  square base in the consensus) showing a potentially difficult
	  decision in a hybrid 454 / Solexa de-novo assembly.
	</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/454sxa_stms_hybdenovo.png" width="100%" alt='A "STMS" tag (Sequencing Technology Mismatch Solved, the black square base in the consensus) showing a potentially difficult decision in a hybrid 454 / Solexa de-novo assembly.'></td></tr></table></div></div></div><br class="figure-break"><p>
	This works also for other sequencing technology combinations or in
	mapping assemblies. The following is an example in a hybrid Sanger /
	454 project where by pure misfortune, all Sanger reads have a base
	calling error at a given position while the 454 reads show the true
	sequence.
      </p><div class="figure"><a name="chap_intro::454san_stmu_hybdenovo.png"></a><p class="title"><b>Figure 1.14. 
	  A "STMU" tag (Sequencing Technology Mismatch Unresolved, light blue
	  square in the consensus at lower end of large oval) showing a
	  potentially difficult decision in a hybrid Sanger / 454 mapping
	  assembly.
	</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/454san_stmu_hybdenovo.png" width="100%" alt='A "STMU" tag (Sequencing Technology Mismatch Unresolved, light blue square in the consensus at lower end of large oval) showing a potentially difficult decision in a hybrid Sanger / 454 mapping assembly.'></td></tr></table></div></div></div><br class="figure-break"></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_intro_cer_reads"></a>1.5.5. 
	MIRA allows older finishing programs to cope with amount data in Solexa
	mapping projects
      </h3></div></div></div><p>
	Quality control is paramount when you do mutation analysis for
	biologists: I know they'll be on my doorstep the very next minute they
	found out one of the SNPs in the resequencing data wasn't a SNP, but a
	sequencing artefact. And I can understand them: why should they invest
	-- per SNP -- hours in the wet lab if I can invest a couple of minutes
	to get them data false negative rates (and false discovery rates) way
	below 1%? So, finishing and quality control for any mapping project is
	a must.
      </p><p>
	Both <span class="command"><strong>gap4</strong></span> and <span class="command"><strong>consed</strong></span> start to
	have a couple of problems when projects have millions of reads: you
	need lots of RAM and scrolling around the assembly gets a test to your
	patience. Still, these two assembly finishing programs are amongst the
	better ones out there, although <span class="command"><strong>gap5</strong></span> starts to
	quickly arrive in a state in which it allows itself to substitute to
	<span class="command"><strong>gap4</strong></span>.
      </p><p>
	So, MIRA reduces the number of reads in Solexa mapping projects
	without sacrificing information on coverage. The principle is pretty
	simple: for 100% matching reads, MIRA tracks coverage of every
	reference base and creates long synthetic, coverage equivalent reads
	(CERs) in exchange for the Solexa reads. Reads that do not match 100%
	are kept as own entities, so that no information gets lost. The
	following figure illustrates this:
      </p><div class="figure"><a name="chap_intro::sxa_cer_reads1.png"></a><p class="title"><b>Figure 1.15. 
	    Coverage equivalent reads (CERs) explained.
	  <p>
	    Left side of the figure: a conventional mapping with eleven reads
	    of size 4 against a consensus (in uppercase). The inversed base in
	    the lowest read depicts a sequencing error.
	  </p>
	  <p>
	    Right side of the figure: the same situation, but with coverage
	    equivalent reads (CERs). Note that there are less reads, but no
	    information is lost: the coverage of each reference base is
	    equivalent to the left side of the figure and reads with
	    differences to the reference are still present.
	  </p>
	</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_cer_reads1.png" width="100%" alt="Coverage equivalent reads (CERs) explained. Left side of the figure: a conventional mapping with eleven reads of size 4 against a consensus (in uppercase). The inversed base in the lowest read depicts a sequencing error. Right side of the figure: the same situation, but with coverage equivalent reads (CERs). Note that there are less reads, but no information is lost: the coverage of each reference base is equivalent to the left side of the figure and reads with differences to the reference are still present."></td></tr></table></div></div></div><br class="figure-break"><p>
	This strategy is very effective in reducing the size of a project. As
	an example, in a mapping project with 9 million Solexa 36mers, MIRA
	created a project with 1.7m reads: 700k CER reads representing ~8
	million 100% matching Solexa reads, and it kept ~950k mapped reads as
	they had &#8805; mismatch (be it sequencing error or true SNP) to the
	reference. A reduction of 80%, and numbers for mapping projects with
	Solexa 100bp reads are in a similar range.
      </p><p>
	Also, mutations of the resequenced strain now really stand out in the
	assembly viewer as the following figure shows:
      </p><div class="figure"><a name="chap_intro::sxa_cer_reads2.png"></a><p class="title"><b>Figure 1.16. 
	  Coverage equivalent reads let SNPs become very visible in assembly viewers
	</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_cer_reads2.png" width="100%" alt="Coverage equivalent reads let SNPs become very visible in assembly viewers"></td></tr></table></div></div></div><br class="figure-break"></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_intro_mapping_tags"></a>1.5.6. 
	MIRA tags SNPs and other features, outputs result files
	for biologists
      </h3></div></div></div><p>
	Want to assemble two or several very closely related genomes without
	reference, but finding SNPs or differences between them?
      </p><p>
	Tired of looking at some text output from mapping programs and
	guessing whether a SNP is really a SNP or just some random junk?
      </p><p>
	MIRA tags all SNPs (and other features like missing coverage etc.) it
	finds so that -- when using a finishing viewer like gap4 or consed --
	one can quickly jump from tag to tag and perform quality control. This
	works both in de-novo assembly and in mapping assembly, all MIRA needs
	is the information which read comes from which strain.
      </p><p>
	The following figure shows a mapping assembly of Solexa 36mers against
	a bacterial reference sequence, where a mutant has an indel position
	in an gene:
      </p><div class="figure"><a name="chap_intro::sxa_sroc_lenski2.png"></a><p class="title"><b>Figure 1.17. 
	  "SROc" tag (Snp inteR Organism on Consensus) showing a SNP position
	  in a Solexa mapping assembly.
	</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_sroc_lenski2.png" width="100%" alt='"SROc" tag (Snp inteR Organism on Consensus) showing a SNP position in a Solexa mapping assembly.'></td></tr></table></div></div></div><br class="figure-break"><p>
	Other interesting places like deletions of whole genome parts are also
	directly tagged by MIRA and noted in diverse result files (and
	searchable in assembly viewers):
      </p><div class="figure"><a name="chap_intro::sxa_mcvc_lenski.png"></a><p class="title"><b>Figure 1.18. 
	  "MCVc" tag (Missing CoVerage in Consensus, dark red stretch in figure)
	  showing a genome deletion in Solexa mapping assembly.
	</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_mcvc_lenski.png" width="100%" alt='"MCVc" tag (Missing CoVerage in Consensus, dark red stretch in figure) showing a genome deletion in Solexa mapping assembly.'></td></tr></table></div></div></div><br class="figure-break"><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
	For bacteria -- and if you use annotated GenBank files as reference
	sequence -- MIRA will also output some nice lists directly usable (in
	Excel) by biologists, telling them which gene was affected by what
	kind of SNP, whether it changes the protein, the original and the
	mutated protein sequence etc.pp.
      </td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_intro_miramuchmore"></a>1.5.7. 
	MIRA has ... much more
      </h3></div></div></div><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	    Extensive possibilities to clip data if needed: by quality, by
	    masked bases, by A/T stretches, by evidence from other reads, ...
	  </p></li><li class="listitem"><p>
	    Routines to re-extend reads into clipped parts if multiple
	    alignment allows for it.
	  </p></li><li class="listitem"><p>
	    Read in ancillary data in different formats: EXP, NCBI TRACEINFO
	    XML, SSAHA2, SMALT result files and text files.
	  </p></li><li class="listitem"><p>
	    Detection of chimeric reads.
	  </p></li><li class="listitem"><p>
	    Pipeline to discover SNPs in ESTs from different strains
	    (miraSearchESTSNPs)
	  </p></li><li class="listitem"><p>
	    Support for many different of input and output formats (FASTA,
	    EXP, FASTQ, CAF, MAF, ...)
	  </p></li><li class="listitem"><p>
	    Automatic memory management (when RAM is tight)
	  </p></li><li class="listitem"><p>
	    Over 150 parameters to tune the assembly for a lot of use cases,
	    many of these parameters being tunable individually depending on
	    sequencing technology they apply to.
	  </p></li></ul></div><p>
      </p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_intro_versions_licenses_disclaimer_and_copyright"></a>1.6. 
      Versions, Licenses, Disclaimer and Copyright
    </h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_intro_versions"></a>1.6.1. 
	Versions
      </h3></div></div></div><p>
	There are two kind of versions for MIRA that can be compiled form
	source files: production and development.
      </p><p>
	Production versions are from the stable branch of the source code. These
	versions are available for download from SourceForge.
      </p><p>
	Development versions are from the development branch of the source
	tree. These are also made available to the public and should be
	compiled by users who want to test out new functionality or to track
	down bugs or errors that might arise at a given location. Release
	candidates (rc) also fall into the development versions: they are
	usually the last versions of a given development branch before being
	folded back into the production branch.
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_intro_licenses"></a>1.6.2. 
	License
      </h3></div></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_intro_licensemira"></a>1.6.2.1. 
	  MIRA
	</h4></div></div></div><p>
	  MIRA has been put under the GPL version 2.
	</p><p>
	  This program is free software; you can redistribute it and/or modify
	  it under the terms of the GNU General Public License as published by
	  the Free Software Foundation; either version 2 of the License, or (at
	  your option) any later version.
	</p><p>
	  This program is distributed in the hope that it will be useful, but
	  WITHOUT ANY WARRANTY; without even the implied warranty of
	  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
	  General Public License for more details.
	</p><p>
	  You should have received a copy of the GNU General Public License
	  along with this program; if not, write to the Free Software
	  Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
	  02110-1301, USA
	</p><p>
	  You may also visit <a class="ulink" href="http://www.opensource.org/licenses/gpl-2.0.php" target="_top">http://www.opensource.org/licenses/gpl-2.0.php</a> at the Open
	  Source Initiative for a copy of this licence.
	</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_intro_licensedocs"></a>1.6.2.2. 
	  Documentation
	</h4></div></div></div><p>
	  The documentation pertaining to MIRA is licensed under the Creative
	  Commons Attribution-NonCommercial-ShareAlike 3.0 Unported
	  License. To view a copy of this license, visit <a class="ulink" href="http://creativecommons.org/licenses/by-nc-sa/3.0/" target="_top">http://creativecommons.org/licenses/by-nc-sa/3.0/</a> or send a
	  letter to Creative Commons, 171 Second Street, Suite 300, San
	  Francisco, California, 94105, USA.
	</p></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_intro_copyright"></a>1.6.3. 
	Copyright
      </h3></div></div></div><p>
	© 1997-2000 Deutsches Krebsforschungszentrum Heidelberg -- Dept.
	of Molecular Biophysics and Bastien Chevreux (for MIRA) and Thomas
	Pfisterer (for EdIt)
      </p><p>
	© 2001-2014 Bastien Chevreux.
      </p><p>
	All rights reserved.
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_intro_external_libraries"></a>1.6.4. 
	External libraries
      </h3></div></div></div><p>
	MIRA uses the excellent Expat library to parse XML files. Expat is Copyright
	©  1998, 1999, 2000 Thai Open Source Software Center Ltd and Clark
	Cooper as well as Copyright ©
	2001, 2002 Expat maintainers.
      </p><p>
	See <a class="ulink" href="http://www.libexpat.org/" target="_top">http://www.libexpat.org/</a> and
	<a class="ulink" href="http://sourceforge.net/projects/expat/" target="_top">http://sourceforge.net/projects/expat/</a> for more information on Expat.
      </p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_intro_getting_help___mailing_lists___reporting_bugs"></a>1.7. 
      Getting help / Mailing lists / Reporting bugs
    </h2></div></div></div><p>
      Please try to find an answer to your question by first reading the
      documents provided with the MIRA package (FAQs, READMEs, usage guide,
      guides for specific sequencing technologies etc.). It's a lot, but then
      again, they hopefully should cover 90% of all questions.
    </p><p>
      If you have a tough nut to crack or simply could not find what you were
      searching for, you can subscribe to the MIRA talk mailing list and send
      in your question (or comment, or suggestion), see <a class="ulink" href="http://www.chevreux.org/mira_mailinglists.html" target="_top">http://www.chevreux.org/mira_mailinglists.html</a> for more
      information on that. Now that the number of subscribers has reached a
      good level, there's a fair chance that someone could answer your
      question before I have the opportunity or while I'm away from mail for a
      certain time.
    </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
	Please very seriously consider using the mailing list before mailing
	me directly. Every question which can be answered by participants of
	the list is time I can invest in development and documentation of
	MIRA. I have a day job as bioinformatician which has nothing to do
	with MIRA and after work hours are rare enough nowadays.
      </p><p>
	Furthermore, Google indexes the mailing list and every discussion /
	question asked on the mailing list helps future users as they show up
	in Google searches.
      </p><p>
	Only mail me directly (bach@chevreux.org) if you feel that there's
	some information you absolutely do not want to share publicly.
      </p></td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
      Subscribing to the list <span class="emphasis"><em>before sending mails to it </em></span>
      is necessary as messages from non-subscribers will be stopped by the
      system to keep the spam level low.
    </td></tr></table></div><p>
      To report bugs or ask for new features, please use the SourceForge
      ticketing system at: <a class="ulink" href="http://sourceforge.net/p/mira-assembler/tickets/" target="_top">http://sourceforge.net/p/mira-assembler/tickets/</a>. This ensures
      that requests do not get lost <span class="bold"><strong>and</strong></span> you
      get the additional benefit to automatically know when a bug has been
      fixed as I will not send separate emails, that's what bug trackers are
      there for.
    </p><p>
      Finally, new or intermediate versions of MIRA will be announced on the
      separate MIRA announce mailing list. Traffic is very low there as the
      only one who can post there is me. Subscribe if you want to be informed
      automatically on new releases of MIRA.
    </p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_intro_author"></a>1.8. 
      Author
    </h2></div></div></div><p>
      Bastien Chevreux (mira): <code class="email">&lt;<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>&gt;</code>
    </p><p>
      WWW:  <a class="ulink" href="http://www.chevreux.org/" target="_top">http://www.chevreux.org/</a>
    </p><p>
      MIRA can use automatic editing routines for Sanger sequences which were
      written by Thomas Pfisterer (EdIt):
      <code class="email">&lt;<a class="email" href="mailto:t.pfisterer@dkfz-heidelberg.de">t.pfisterer@dkfz-heidelberg.de</a>&gt;</code>
    </p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_intro_miscellaneous"></a>1.9. 
      Miscellaneous
    </h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_intro_citations"></a>1.9.1. 
	Citing MIRA
      </h3></div></div></div><p>
	Please use these citations:
      </p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
	    For <span class="command"><strong>mira</strong></span>
	  </span></dt><dd><p>
	      Chevreux, B., Wetter, T. and Suhai, S. (1999): <span class="emphasis"><em>Genome
	      Sequence Assembly Using Trace Signals and Additional Sequence
	      Information</em></span>. Computer Science and Biology:
	      Proceedings of the German Conference on Bioinformatics (GCB) 99,
	      pp. 45-56.
	    </p></dd><dt><span class="term">
	    For <span class="command"><strong>miraSearchESTSNPs</strong></span> (was named
	    <span class="command"><strong>miraEST</strong></span> in earlier times)
	  </span></dt><dd><p> Chevreux, B., Pfisterer, T., Drescher, B., Driesel, A. J.,
	    Müller, W. E., Wetter, T. and Suhai, S. (2004): <span class="emphasis"><em>Using
	    the miraEST Assembler for Reliable and Automated mRNA Transcript
	    Assembly and SNP Detection in Sequenced ESTs</em></span>. Genome
	    Research, 14(6)
	    </p></dd></dl></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_intro_postcards_gold_and_jewellery"></a>1.9.2. 
	Postcards, gold and jewellery
      </h3></div></div></div><p>
	If you find this software useful, please send the author a postcard. If
	postcards are not available, a treasure chest full of Spanish doubloons, gold
	and jewellery will do nicely, thank you.
      </p></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_installation"></a>Chapter 2. Installing MIRA</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email">&lt;<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>&gt;</code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect_install_wheretofetch">2.1. 
      Where to fetch MIRA
    </a></span></dt><dt><span class="sect1"><a href="#sect_install_precompiledbinary">2.2. 
      Installing from a precompiled binary package
    </a></span></dt><dt><span class="sect1"><a href="#sect_install_third_party_integration">2.3. 
      Integration with third party programs (gap4, consed)
    </a></span></dt><dt><span class="sect1"><a href="#sect_install_compiling">2.4. 
      Compiling MIRA yourself
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_install_comp_prereq">2.4.1. 
	Prerequisites
      </a></span></dt><dt><span class="sect2"><a href="#sect_install_comp_comp">2.4.2. 
	Compiling and installing
      </a></span></dt><dt><span class="sect2"><a href="#sect_install_comp_conf">2.4.3. 
	Configure switches for MIRA
      </a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_install_comp_conf_boost">2.4.3.1. 
	  BOOST configure switches for MIRA
	</a></span></dt><dt><span class="sect3"><a href="#sect_install_comp_conf_mira">2.4.3.2. 
	  MIRA specific configure switches
	</a></span></dt></dl></dd></dl></dd><dt><span class="sect1"><a href="#sect_install_walkthroughs">2.5. 
      Installation walkthroughs
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_install_walkthroughs_kubuntu">2.5.1. 
	(K)Ubuntu 12.04
      </a></span></dt><dt><span class="sect2"><a href="#sect_install_walkthroughs_opensuse">2.5.2. 
	openSUSE 12.1
      </a></span></dt><dt><span class="sect2"><a href="#sect_install_walkthroughs_fedora">2.5.3. 
	Fedora 17
      </a></span></dt><dt><span class="sect2"><a href="#sect_install_walkthroughs_osx">2.5.4. 
	Mac OSX
      </a></span></dt><dt><span class="sect2"><a href="#sect_install_walkthroughs_allfromscratch">2.5.5. 
	Compile everything from scratch
      </a></span></dt><dt><span class="sect2"><a href="#sect_install_walkthroughs_dynamic">2.5.6. 
	Dynamically linked MIRA
      </a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_install_hintotherplatforms">2.6. 
      Compilation hints for other platforms.
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_install_hintnetbsd5">2.6.1. 
	NetBSD 5 (i386)
      </a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_install_notesformaintainers">2.7. 
      Notes for distribution maintainers / system administrators
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_install_additionaldatafiles">2.7.1. 
	Additional data files
      </a></span></dt></dl></dd></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
      <span class="emphasis"><em><span class="quote">&#8220;<span class="quote">A problem can be found to almost every solution.
      </span>&#8221;</span></em></span>
    </p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_install_wheretofetch"></a>2.1. 
      Where to fetch MIRA
    </h2></div></div></div><p>
      SourceForge: <a class="ulink" href="http://sourceforge.net/projects/mira-assembler/" target="_top">http://sourceforge.net/projects/mira-assembler/</a>
    </p><p>
      There you will normally find a couple of precompiled binaries -- usually
      for Linux and Mac OSX -- or the source package for compiling yourself.
    </p><p>
      Precompiled binary packages are named in the following way:
    </p><p>
      <code class="filename">mira_<em class="replaceable"><code>miraversion</code></em>_<em class="replaceable"><code>OS-and-binarytype</code></em>.tar.bz2</code>
    </p><p>
      where
    </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	  For <code class="filename"><em class="replaceable"><code>miraversion</code></em></code>, the
	  stable versions of MIRA with the general public as audience usually
	  have a version number in three parts, like
	  <code class="filename">3.0.5</code>, sometimes also followed by some postfix
	  like in <code class="filename">3.2.0rc1</code> to denote release candidate 1
	  of the 3.2.0 version of MIRA. On very rare occasions, stable
	  versions of MIRA can have four part like in, e.g.,
	  <code class="filename">3.4.0.1</code>: these versions create identical
	  binaries to their parent version (<code class="filename">3.4.0</code>) and
	  just contains fixes to the source build machinery.
	</p><p>
	  The version string sometimes can have a different format:
	  <code class="filename"><span class="emphasis"><em>sometext</em></span>-0-g<span class="emphasis"><em>somehexnumber</em></span></code>
	  like in, e.g.,
	  <code class="filename">ft_fastercontig-0-g4a27c91</code>. These versions of
	  MIRA are snapshots from the development tree of MIRA and usually
	  contain new functionality which may not be as well tested as the
	  rest of MIRA, hence contains more checks and more debugging output
	  to catch potential errors
	</p></li><li class="listitem"><p>
	  <code class="filename"><em class="replaceable"><code>OS-and-binarytype</code></em></code>
	  finally defines for which operating system and which processor class
	  the package is destined. E.g.,
	  <code class="filename">linux-gnu_x86_64_static</code> contains static
	  binaries for Linux running a 64 bit processor.
	</p></li></ul></div><p>
      Source packages are usually named
    </p><p>
      <code class="filename">mira-<em class="replaceable"><code>miraversion</code></em>.tar.bz2</code>
    </p><p>
      Examples for packages at SourceForge:
    </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><code class="filename">mira_3.0.5_prod_linux-gnu_x86_64_static.tar.bz2</code></li><li class="listitem"><code class="filename">mira_3.0.5_prod_linux-gnu_i686_32_static.tar.bz2</code></li><li class="listitem"><code class="filename">mira_3.0.5_prod_OSX_snowleopard_x86_64_static.tar.bz2</code></li><li class="listitem"><code class="filename">mira-3.0.5.tar.bz2</code></li></ul></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_install_precompiledbinary"></a>2.2. 
      Installing from a precompiled binary package
    </h2></div></div></div><p>
      The distributable package follows the
      one-directory-which-contains-everything-which-is-needed philosophy, but
      after unpacking and moving the package to its final destination, you
      need to run a script which will create some data files.
    </p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
	  Download the package, unpack it.
	</p></li><li class="listitem"><p>
	  Move the directory somewhere to your disk. Either to one of the
	  "standard" places like, e.g., <code class="filename">/opt/mira</code>,
	  <code class="filename">/usr/local/mira</code> or somewhere in your home
	  directory
	</p></li><li class="listitem"><p>
	  Softlink the binaries which are in the 'bin' directory into a
	  directory which is in your shell PATH. Then have the shell reload
	  the location of PATH binaries (either <code class="literal">hash -r</code>
	  for sh/bash or <code class="literal">rehash</code> for csh/tcsh.
	</p><p>
	  Alternatively, add the <code class="filename">bin</code> directory of the
	  MIRA package to your PATH variable.
	</p></li><li class="listitem"><p>
	  Test whether the binaries are installed ok via <code class="literal">mirabait
	  -v</code> which should return with the current version you
	  downloaded and installed.
	</p></li><li class="listitem"><p>
	  Now you need to run a script which will unpack and reformat some
	  data needed by MIRA. That script is located in the
	  <code class="filename">dbdata</code> directory of the package and should
	  be called with the name of the <span class="emphasis"><em>SLS</em></span> file present
	  in the same diretory like this:
	</p><pre class="screen">
<code class="prompt">arcadia:/path/to/mirapkg$</code> <strong class="userinput"><code>cd dbdata</code></strong>
<code class="prompt">arcadia:/path/to/mirapkg/dbdata$</code> <strong class="userinput"><code>ls -l</code></strong>
drwxr-xr-x 3 bach bach     4096 2016-03-18 14:31 mira-createsls
-rwxr-xr-x 1 bach bach     2547 2015-12-14 04:33 mira-install-sls-rrna.sh
-rw-r--r-- 1 bach bach      337 2016-01-01 14:50 README.txt
lrwxrwxrwx 1 bach bach 10421035 2016-03-18 14:28 rfam_rrna-21-12.sls.gz
<code class="prompt">arcadia:/path/to/mirapkg/dbdata$</code> <strong class="userinput"><code>./mira-install-sls-rrna.sh rfam_rrna-21-12.sls.gz</code></strong></pre><p>
	  This will take a minute or so. Then you're done for MIRA.
        </p></li></ol></div><p>
      Additional scripts for special purposes are in the
      <code class="filename">scripts</code> directory. You might or might not want to
      have them in your $PATH.
    </p><p>
      Scripts and programs for MIRA from other authors are in the
      <code class="filename">3rdparty</code> directory. Here too, you may or may not
      want to have (some of them) in your $PATH.
    </p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_install_third_party_integration"></a>2.3. 
      Integration with third party programs (gap4, consed)
    </h2></div></div></div><p>
      MIRA sets tags in the assemblies that can be read and interpreted by the
      Staden <span class="command"><strong>gap4</strong></span> package or
      <span class="command"><strong>consed</strong></span>. These tags are extremely useful to
      efficiently find places of interest in an assembly (be it de-novo or
      mapping), but both <span class="command"><strong>gap4</strong></span> and <span class="command"><strong>consed</strong></span>
      need to be told about these tags.
    </p><p>
      Data files for a correct integration are delivered in the
      <code class="filename">support</code> directory of the distribution. Please
      consult the README in that directory for more information on how to
      integrate this information in either of these packages.
    </p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_install_compiling"></a>2.4. 
      Compiling MIRA yourself
    </h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_install_comp_prereq"></a>2.4.1. 
	Prerequisites
      </h3></div></div></div><p>
	Compiling the 5.x series of MIRA needs a C++14 compatible tool chain, i.e.,
	systems starting from 2013/2014 should be OK. The
	requisites for <span class="emphasis"><em>compiling</em></span> MIRA are:
      </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	    gcc &#8805; 4.9.1, with libstdc++6. You really want to use a simple
	    installation package pre-configured for your system, but in case you
	    want or have to install gcc yourself, please refer to <a class="ulink" href="http://gcc.gnu.org/" target="_top">http://gcc.gnu.org/</a> for more information on the GNU compiler
	    collection.
	  </p></li><li class="listitem"><p>
	    BOOST library &#8805; 1.48. Lower versions might work, but
	    untested. You would need to change the checking in the configure
	    script for this to run through. You really want to use a simple
	    installation package pre-configured for your system, but in case you
	    want or have to install BOOST yourself, please refer to <a class="ulink" href="http://www.boost.org/" target="_top">http://www.boost.org/</a> for more information on the BOOST
	    library.
	  </p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
	    Do NOT use a so called <span class="emphasis"><em>staged</em></span> BOOST library,
	    that will not work.
	  </td></tr></table></div></li><li class="listitem">
	  zlib. Should your system not have zlib installed or available as
	  simple installation package, please see <a class="ulink" href="http://www.zlib.net/" target="_top">http://www.zlib.net/</a> for more information regarding zlib.
	</li><li class="listitem">
	  GNU make. Should your system not have gmake installed or available
	  as simple installation package, please see <a class="ulink" href="www.gnu.org/software/make/" target="_top">www.gnu.org/software/make/</a> for more information regarding
	  GNU make.
	</li><li class="listitem">
	  GNU flex &#8805; 2.5.33. Should your system not have flex installed or
	  available as simple installation package, please see <a class="ulink" href="http://flex.sourceforge.net/" target="_top">http://flex.sourceforge.net/</a> for more information regarding
	  flex.
	</li><li class="listitem">
	  Expat library &#8805; 2.0.1. Should your system not have the Expat library and
	  header files already installed or available as simple installation
	  package, you will need to download and install a yourself. Please see
	  <a class="ulink" href="http://www.libexpat.org/" target="_top">http://www.libexpat.org/</a> and <a class="ulink" href="http://sourceforge.net/projects/expat/" target="_top">http://sourceforge.net/projects/expat/</a> for information on how
	  to do this.
	</li><li class="listitem">
	  xxd. A small utility from the <span class="command"><strong>vim</strong></span> package.
	</li></ul></div><p>
	For <span class="emphasis"><em>building the documentation</em></span>, additional
	prerequisites are from the DocBook tool chain:
      </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem">
	  xsltproc + docbook-xsl for HTML output
	</li><li class="listitem">
	  dblatex for PDF output
	</li></ul></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top"><p>
	  Previous versions of MIRA had a benefit by using the TCMalloc
	  library. This is not the case anymore! Indeed, tests showed that when
	  using TCMalloc, MIRA 4.9.x and above will probably need 20 to
	  30% <span class="emphasis"><em>more</em></span> max memory and up to 80% more overall
	  memory than without TCMalloc.
	</p><p>
	  In short: do not use at the moment.
	</p></td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_install_comp_comp"></a>2.4.2. 
	Compiling and installing
      </h3></div></div></div><p>
	MIRA uses the GNU autoconf/automake tools, please read the section
	"Basic Installation" of the <code class="filename">INSTALL</code> file in the
	source package of MIRA for more generic information on how to invoke
	them.
      </p><p>
	The short version: simply type
      </p><pre class="screen">
<code class="prompt">arcadia:/path/to/mira-5.0.0$</code> <strong class="userinput"><code>./configure</code></strong>
<code class="prompt">arcadia:/path/to/mira-5.0.0$</code> <strong class="userinput"><code>make</code></strong>
<code class="prompt">arcadia:/path/to/mira-5.0.0$</code> <strong class="userinput"><code>make install</code></strong></pre><p>
	This should install the following programs:
      </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><span class="command"><strong>mira</strong></span></li><li class="listitem"><span class="command"><strong>miraconvert</strong></span></li><li class="listitem"><span class="command"><strong>mirabait</strong></span></li><li class="listitem"><span class="command"><strong>miramem</strong></span></li></ul></div><p>
	Should the <code class="literal">./configure</code> step fail for some reason or
	another, you should get a message telling you at which step this
	happens and and either install missing packages or tell
	<span class="command"><strong>configure</strong></span> where it should search the packages it
	did not find, see also next section.
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_install_comp_conf"></a>2.4.3. 
	Configure switches for MIRA
      </h3></div></div></div><p>
	MIRA understands all standard autoconf configure switches like <code class="literal">--prefix=</code>
	etc. Please consult the INSTALL file in the MIRA top level directory
	of the source package and also call <code class="literal">./configure
	--help</code> to get a full list of currently supported switches.
      </p><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_install_comp_conf_boost"></a>2.4.3.1. 
	  BOOST configure switches for MIRA
	</h4></div></div></div><p>
	  BOOST is maybe the most tricky library to get right in case it does
	  not come pre-configured for your system. The two main switches for
	  helping to locate BOOST are
	  probably <code class="literal">--with-boost=[ARG]</code>
	  and <code class="literal">--with-boost-libdir=LIB_DIR</code>. Only if those
	  two fail, try using the other <code class="literal">--with-boost-*=</code> switches
	  you will see from the ./configure help text.
	</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_install_comp_conf_mira"></a>2.4.3.2. 
	  MIRA specific configure switches
	</h4></div></div></div><p>
	  MIRA honours the following switches:
	</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
	      --enable-64=yes/no
	    </span></dt><dd><p>
		MIRA should happily build as 32 bit executable on 32 bit
		platforms and as 64 bit executable on 64 bit platforms. On 64
		bit platforms, setting the switch to 'no' forces the compiler
		to produce 32 bit executables (if possible)
	      </p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
		As of MIRA 3.9.0, support for 32 bit platforms is being
		slowly phased out. While MIRA should compile and also run fine
		on 32 bit platforms, I do not guarantee it anymore as I
		haven't used 32 bit systems in the last 5 years.
	      </td></tr></table></div></dd><dt><span class="term">
	      --enable-warnings
	    </span></dt><dd>
	      Enables compiler warnings, useful only for developers, not for users.
	    </dd><dt><span class="term">
	      --enable-debug
	    </span></dt><dd>
	      Lets the MIRA binary contain C/C++ debug symbols.
	    </dd><dt><span class="term">
	      --enable-mirastatic
	    </span></dt><dd>
	      Builds static binaries which are easier to distribute. Some
	      platforms (like OpenSolaris) might not like this and you will
	      get an error from the linker.
	    </dd><dt><span class="term">
	      --enable-optimisations
	    </span></dt><dd>
	      Instructs the configure script to set optimisation switches for compiling
	      (on by default). Switching optimisations off (warning, high impact on
	      run-time) might be interesting only for, e.g, debugging with valgrind.
	    </dd><dt><span class="term">
	      --enable-publicquietmira
	    </span></dt><dd>
	      Some parts of MIRA can dump additional debug information during
	      assembly, setting this switch to "no" performs this. Warning:
	      MIRA will be a bit chatty, using this is not recommended for
	      public usage.
	    </dd><dt><span class="term">
	      --enable-developmentversion
	    </span></dt><dd>
	      Using MIRA with enabled development mode may lead to extra
	      output on stdout as well as some additional data in the results
	      which should not appear in real world data
	    </dd><dt><span class="term">
	      --enable-boundtracking
	    </span></dt><dd></dd><dt><span class="term">
	      --enable-bugtracking
	    </span></dt><dd>
	      Both flags above compile in some basic checks into mira that
	      look for sanity within some functions: Leaving this on "yes"
	      (default) is encouraged, impact on run time is minimal
	    </dd><dt><span class="term">
	    </span></dt><dd></dd><dt><span class="term">
	    </span></dt><dd></dd><dt><span class="term">
	    </span></dt><dd></dd></dl></div></div></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_install_walkthroughs"></a>2.5. 
      Installation walkthroughs
    </h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_install_walkthroughs_kubuntu"></a>2.5.1. 
	(K)Ubuntu 12.04
      </h3></div></div></div><p>
	You will need to install a couple of tools and libraries before
	compiling MIRA. Here's the recipe:
      </p><pre class="screen">
<strong class="userinput"><code>sudo apt-get install make flex
sudo apt-get install libboost-doc libboost.*1.48-dev libboost.*1.48.0</code></strong></pre><p>
	Once this is done, you can unpack and compile MIRA. For a dynamically
	linked version, use:
      </p><pre class="screen">
<strong class="userinput"><code>tar xvjf <em class="replaceable"><code>mira-5.0.0.tar.bz2</code></em>
cd <em class="replaceable"><code>mira-5.0.0</code></em>
./configure
make &amp;&amp; make install</code></strong></pre><p>
	For a statically linked version, just change the configure line from
	above into
      </p><pre class="screen">
<strong class="userinput"><code>./configure <em class="replaceable"><code>--enable-mirastatic</code></em></code></strong></pre><p>
	In case you also want to build documentation yourself, you will need
	this in addition:
      </p><pre class="screen"><strong class="userinput"><code>sudo apt-get install xsltproc docbook-xsl dblatex</code></strong></pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
	  People working on git checkouts of the MIRA source code will
	  obviously need some more tools. Get them with this:
	</p><pre class="screen"><strong class="userinput"><code>sudo apt-get install automake libtool xutils-dev</code></strong></pre></td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_install_walkthroughs_opensuse"></a>2.5.2. 
	openSUSE 12.1
      </h3></div></div></div><p>
	You will need to install a couple of tools and libraries before
	compiling MIRA. Here's the recipe:
      </p><pre class="screen">
<strong class="userinput"><code>sudo zypper install gcc-c++ boost-devel
sudo zypper install flex libexpat-devel zlib-devel</code></strong></pre><p>
	Once this is done, you can unpack and compile MIRA. For a dynamically
	linked version, use:
      </p><pre class="screen">
<strong class="userinput"><code>tar xvjf <em class="replaceable"><code>mira-5.0.0.tar.bz2</code></em>
cd <em class="replaceable"><code>mira-5.0.0</code></em>
./configure
make &amp;&amp; make install</code></strong></pre><p>
	In case you also want to build documentation yourself, you will need
	this in addition:
      </p><pre class="screen"><strong class="userinput"><code>sudo zypper install docbook-xsl-stylesheets dblatex</code></strong></pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
	  People working on git checkouts of the MIRA source code will
	  obviously need some more tools. Get them with this:
	</p><pre class="screen"><strong class="userinput"><code>sudo zypper install automake libtool xutils-dev</code></strong></pre></td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_install_walkthroughs_fedora"></a>2.5.3. 
	Fedora 17
      </h3></div></div></div><p>
	You will need to install a couple of tools and libraries before
	compiling MIRA. Here's the recipe:
      </p><pre class="screen">
<strong class="userinput"><code>sudo yum -y install gcc-c++ boost-devel
sudo yum install flex expat-devel vim-common zlib-devel</code></strong></pre><p>
	Once this is done, you can unpack and compile MIRA. For a dynamically
	linked version, use:
      </p><pre class="screen">
<strong class="userinput"><code>tar xvjf <em class="replaceable"><code>mira-5.0.0.tar.bz2</code></em>
cd <em class="replaceable"><code>mira-5.0.0</code></em>
./configure
make &amp;&amp; make install</code></strong></pre><p>
	In case you also want to build documentation yourself, you will need
	this in addition:
      </p><pre class="screen"><strong class="userinput"><code>sudo yum -y install docbook-xsl dblatex</code></strong></pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
	  People working on git checkouts of the MIRA source code will
	  obviously need some more tools. Get them with this:
	</p><pre class="screen"><strong class="userinput"><code>sudo yum -y install automake libtool xorg-x1-util-devel</code></strong></pre></td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_install_walkthroughs_osx"></a>2.5.4. 
	Mac OSX
      </h3></div></div></div><p>
	These instructions are for OSX 10.11 (El Capitan) and use
	MacPorts. There are other ways to do this (e.g., see the "compile
	everything from scratch"), but they are definetly more painful.
      </p><p>
	If you do not already have it, install McPorts. See <a class="ulink" href="https://www.macports.org/install.php" target="_top">https://www.macports.org/install.php</a>. Then have the port
	system fetch information of the newest ports (can take a while):
      </p><pre class="screen">
<strong class="userinput"><code>sudo port selfupdate</code></strong>
</pre><p>
	Then go on and install gcc (this is going to take a long time) and
	then switch to gcc5:
      </p><pre class="screen">
<strong class="userinput"><code>sudo port install m4 gcc5</code></strong>
<strong class="userinput"><code>sudo port select --set gcc mp-gcc5</code></strong>
</pre><p>
	Now, the libraries you need to download and compile need to be
	installed somewhere. You can take a path in your home directory or any
	other path in the system you have access to, for the sake of this
	walkthrough we'll continue with
	<code class="filename">/opt/biosw/gccchain</code>
      </p><p>
	Download and install a current flex. Use at least 2.6.0. If for some
	reason you need to use flex 2.5.38 or .39, take care to apply the
	patch described here: <a class="ulink" href="https://sourceforge.net/p/flex/bugs/182/" target="_top">https://sourceforge.net/p/flex/bugs/182/</a>. Configure flex to be
	installed into the directory you chose the step before:
      </p><pre class="screen">
<strong class="userinput"><code>tar xvf flex-2.6.0.tar.bz2</code></strong>
<strong class="userinput"><code>cd flex-2.6.0</code></strong>
<strong class="userinput"><code>./configure --prefix=<em class="replaceable"><code>/opt/biosw/gccchain</code></em></code></strong>
<strong class="userinput"><code>make</code></strong>
<strong class="userinput"><code>make install</code></strong>
</pre><p>
	That done, proceed with likewise with expat and zlib library:
      </p><pre class="screen">
<strong class="userinput"><code>tar xvf expat-2.1.0.tar.gz</code></strong>
<strong class="userinput"><code>cd expat-2.1.0</code></strong>
<strong class="userinput"><code>./configure --prefix=<em class="replaceable"><code>/opt/biosw/gccchain</code></em></code></strong>
<strong class="userinput"><code>make</code></strong>
<strong class="userinput"><code>make install</code></strong>
<strong class="userinput"><code>cd ..</code></strong>
<strong class="userinput"><code>tar xvf zlib-1.2.8.tar.gz</code></strong>
<strong class="userinput"><code>cd zlib-1.2.8</code></strong>
<strong class="userinput"><code>./configure --prefix=<em class="replaceable"><code>/opt/biosw/gccchain</code></em></code></strong>
<strong class="userinput"><code>make -j 4</code></strong>
<strong class="userinput"><code>make install</code></strong>
</pre><p>
	The bzip2 library needs a different installation command line:
      </p><pre class="screen">
<strong class="userinput"><code>tar xvf bzip2-1.0.6.tar.gz</code></strong>
<strong class="userinput"><code>cd bzip2-1.0.6</code></strong>
<strong class="userinput"><code>make -j 4</code></strong>
<strong class="userinput"><code>make install PREFIX=<em class="replaceable"><code>/opt/biosw/gccchain</code></em></code></strong>
</pre><p>
	Last library to be installed for the MIRA compilation is BOOST:
      </p><pre class="screen">
<strong class="userinput"><code>tar xvf boost_1_59_0.tar.bz2</code></strong>
<strong class="userinput"><code>cd boost_1_59_0</code></strong>
<strong class="userinput"><code>./bootstrap.sh --prefix=<em class="replaceable"><code>/opt/biosw/gccchain</code></em></code></strong>
<strong class="userinput"><code>./b2 -j 4</code></strong>
<strong class="userinput"><code>./b2 install</code></strong>
</pre><p>
	Now unpack MIRA, configure it and compile. Remember to give configure
	script the location of every package you just installed or else it
	might pick up a version installed by the system (and compiled with
	different compiler) which would invariably lead to errors in the
	linker stage of the compilation.
      </p><pre class="screen">
<strong class="userinput"><code>tar xvf mira-5.0.0.tar.bz2</code></strong>
<strong class="userinput"><code>cd mira-5.0.0</code></strong>
<strong class="userinput"><code>./configure --enable-debug
  --with-boost=<em class="replaceable"><code>/opt/biosw/gccchain</code></em>
  --with-boost-libdir=<em class="replaceable"><code>/opt/biosw/gccchain</code></em>/lib
  --with-expat=<em class="replaceable"><code>/opt/biosw/gccchain</code></em>
  --with-zlib=<em class="replaceable"><code>/opt/biosw/gccchain</code></em></code></strong>
<strong class="userinput"><code>make -j 4</code></strong>
</pre><p>
	That's it for the dynamic version.
      </p><p>
	For building an almost static version, we need some trickery: after
	the configure (this time with the mirastatic argument), create a
	special directory <code class="filename">OSXstatlibs</code> in which we
	softlink all static libraries MIRA needs. This directory will be
	searched first by the build scripts generated by the
	<span class="command"><strong>libtool</strong></span> suite during the linking stage of MIRA.
      </p><pre class="screen">
<strong class="userinput"><code>./configure --enable-mirastatic --enable-debug
  --with-boost=<em class="replaceable"><code>/opt/biosw/gccchain</code></em>
  --with-boost-libdir=<em class="replaceable"><code>/opt/biosw/gccchain</code></em>/lib
  --with-expat=<em class="replaceable"><code>/opt/biosw/gccchain</code></em>
  --with-zlib=<em class="replaceable"><code>/opt/biosw/gccchain</code></em></code></strong>
<strong class="userinput"><code>mkdir OSXstatlib</code></strong>
<strong class="userinput"><code>cd OSXstatlib</code></strong>
<strong class="userinput"><code>ln -s /opt/biosw/gccchain/lib/*a</code></strong>
<strong class="userinput"><code>ln -s /opt/local/lib/*a</code></strong>
</pre><p>
	Note that <code class="filename">/opt/local</code> is the standard installation
	path of the MacPorts programs. If you changed that, you need to adapt
	it here, too.
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_install_walkthroughs_allfromscratch"></a>2.5.5. 
	Compile everything from scratch
      </h3></div></div></div><p>
	This lets you build a self-contained static MIRA binary. The only
	prerequisite here is that you have a working <span class="command"><strong>gcc</strong></span>
	with the minimum version described above. Please download all
	necessary files (expat, flex, etc.pp) and then simply follow the
	script below. The only things that you will want to change are the
	path used and, maybe, the name of some packages in case they were
	bumped up a version or revision.
      </p><p>
	Contributed by Sven Klages.
      </p><pre class="screen">
## whatever path is appropriate
<strong class="userinput"><code>cd <em class="replaceable"><code>/home/gls/SvenTemp/install</code></em></code></strong>

## expat
<strong class="userinput"><code>tar zxvf <em class="replaceable"><code>expat-2.0.1.tar.gz</code></em>
cd <em class="replaceable"><code>expat-2.0.1</code></em>
./configure <em class="replaceable"><code>--prefix=/home/gls/SvenTemp/expat</code></em>
make &amp;&amp; make install</code></strong>

## flex
<strong class="userinput"><code>cd <em class="replaceable"><code>/home/gls/SvenTemp/install</code></em>
tar zxvf <em class="replaceable"><code>flex-2.5.35.tar.gz</code></em>
cd <em class="replaceable"><code>flex-2.5.35</code></em>
./configure <em class="replaceable"><code>--prefix=/home/gls/SvenTemp/flex</code></em>
make &amp;&amp; make install
cd <em class="replaceable"><code>/home/gls/SvenTemp/flex/bin</code></em>
ln -s flex flex++
export PATH=<em class="replaceable"><code>/home/gls/SvenTemp/flex/bin</code></em>:$PATH</code></strong>

## boost
<strong class="userinput"><code>cd <em class="replaceable"><code>/home/gls/SvenTemp/install</code></em>
tar zxvf <em class="replaceable"><code>boost_1_48_0.tar.gz</code></em>
cd <em class="replaceable"><code>boost_1_48_0</code></em>
./bootstrap.sh --prefix=<em class="replaceable"><code>/home/gls/SvenTemp/boost</code></em>
./b2 install</code></strong>

## mira itself
<strong class="userinput"><code>export CXXFLAGS="-I<em class="replaceable"><code>/home/gls/SvenTemp/flex/include</code></em>"

cd <em class="replaceable"><code>/home/gls/SvenTemp/install</code></em>
tar zxvf <em class="replaceable"><code>mira-3.4.0.1.tar.gz</code></em>
cd <em class="replaceable"><code>mira-3.4.0.1</code></em>
./configure --prefix=<em class="replaceable"><code>/home/gls/SvenTemp/mira</code></em> \
--with-boost=<em class="replaceable"><code>/home/gls/SvenTemp/boost</code></em> \
--with-expat=<em class="replaceable"><code>/home/gls/SvenTemp/expat</code></em> \
--enable-mirastatic
make &amp;&amp; make install</code></strong></pre></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_install_walkthroughs_dynamic"></a>2.5.6. 
	Dynamically linked MIRA
      </h3></div></div></div><p>
	In case you do not want a static binary of MIRA, but a dynamically
	linked version, the following script by Robert Bruccoleri will give
	you an idea on how to do this.
      </p><p>
	Note that he, having root rights, puts all additional software in
	/usr/local, and in particular, he keeps updated versions of Boost and
	Flex there.
      </p><pre class="screen">
#!/bin/sh -x

make distclean
oze=`find . -name "*.o" -print`
if [[ -n "$oze" ]]
then
   echo "Not clean."
   exit 1

fi

export prefix=${BUILD_PREFIX:-/usr/local}
export LDFLAGS="-Wl,-rpath,$prefix/lib"

./configure --prefix=$prefix \
           --enable-debug=yes \
           --enable-mirastatic=no \
           --with-boost-libdir=$prefix/lib \
           --enable-optimisations \
           --enable-boundtracking=yes \
           --enable-bugtracking=yes \
           --enable-extendedbugtracking=no
make
make install</pre></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_install_hintotherplatforms"></a>2.6. 
      Compilation hints for other platforms.
    </h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_install_hintnetbsd5"></a>2.6.1. 
	NetBSD 5 (i386)
      </h3></div></div></div><p>
	Contributed by Thomas Vaughan
      </p><p>
	The system flex <span class="emphasis"><em>(/usr/bin/flex)</em></span> is too old, but the
	devel/flex package from a recent pkgsrc works fine. BSD make doesn't
	like one of the lines in <span class="emphasis"><em>src/progs/Makefile</em></span>, so use GNU make instead
	(available from <span class="emphasis"><em>pkgsrc</em></span> as <span class="emphasis"><em>devel/gmake</em></span>). Other relevant pkgsrc packages:
	<span class="emphasis"><em>devel/boost-libs</em></span>, <span class="emphasis"><em>devel/boost-headers</em></span>
	and <span class="emphasis"><em>textproc/expat</em></span>. The configure script has to
	be told about these pkgsrc prerequisites (they are usually rooted
	at <span class="emphasis"><em>/usr/pkg</em></span> but other locations are possible):
      </p><pre class="screen"><strong class="userinput"><code>FLEX=/usr/pkg/bin/flex ./configure --with-expat=/usr/pkg --with-boost=/usr/pkg</code></strong></pre><p>
	If attempting to build a pkgsrc package of MIRA, note that the LDFLAGS
	passed by the pkgsrc mk files don't remove the need for
	the <span class="emphasis"><em>--with-boost</em></span> option.  The configure script
	complains about flex being too old, but this is harmless because it
	honours the $FLEX variable when writing out makefiles.
      </p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_install_notesformaintainers"></a>2.7. 
      Notes for distribution maintainers / system administrators
    </h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_install_additionaldatafiles"></a>2.7.1. 
	Additional data files
      </h3></div></div></div><p>
	Depending on options/paramaters, the MIRA/mirabait binary may need
	to load some additional data during the run. By default this data will
	always be searched at this location:
	<code class="filename">LOCATION_OF_BINARY/../share/mira/...</code>
      </p><p>
	That is: If the binary is, e.g.,
	<code class="filename">/opt/mira5/bin/mira</code> with a softlink pointing from
	<code class="filename">/usr/local/bin/mira -&gt; /opt/mira5/bin/mira</code>
	(because, e.g., <code class="filename">/usr/local/bin</code> may be by default in your
	PATH variable), then the additional data will be searched in
	<code class="filename">/opt/mira5/share/mira/...</code> and NOT in
	<code class="filename">/usr/local/share/mira/...</code>.
      </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
	In short: since MIRA 4.9.6, moving the binary is not enough
	anymore. Take care to have the <span style="color: red">&lt;emph&gt;share&lt;/emph&gt;</span> directory in the
	right place, i.e., adjacent to the directory the MIRA binary lives in.
      </td></tr></table></div></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_reference"></a>Chapter 3. MIRA 4 reference manual</h1></div><div><h3 class="subtitle"><i>aka: The extended man page of MIRA 4,
  a genome and EST/RNASeq sequence assembly and mapping program for Sanger, 454, IonTorrent,
  PacBio and Illumina/Solexa sequencing data</i></h3></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email">&lt;<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>&gt;</code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect_ref_synopsis">3.1. 
      Synopsis
    </a></span></dt><dt><span class="sect1"><a href="#sect_ref_requirements">3.2. 
      Requirements
    </a></span></dt><dt><span class="sect1"><a href="#sect_ref_working_modes">3.3. 
      Working modes
    </a></span></dt><dt><span class="sect1"><a href="#sect_ref_config">3.4. 
      Configuring an assembly: files and parameters
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_ref_manifest_introduction">3.4.1. 
	The manifest file: introduction
      </a></span></dt><dt><span class="sect2"><a href="#sect_ref_manifest_basics">3.4.2. 
	The manifest file: basics
      </a></span></dt><dt><span class="sect2"><a href="#sect_ref_manifest_readgroups">3.4.3. 
	The manifest file: information on the data you have
      </a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_ref_manifest_readgroups_readgroup">3.4.3.1. 
	Starting a new readgroup
      </a></span></dt><dt><span class="sect3"><a href="#sect_ref_manifest_readgroups_data">3.4.3.2. 
	Defining data files to load
      </a></span></dt><dt><span class="sect3"><a href="#sect_ref_manifest_readgroups_defaultqual">3.4.3.3. 
	Setting default quality
      </a></span></dt><dt><span class="sect3"><a href="#sect_ref_manifest_readgroups_technology">3.4.3.4. 
	  Defining technology used to sequence
	</a></span></dt><dt><span class="sect3"><a href="#sect_ref_manifest_readgroups_nostatistics">3.4.3.5. 
	  Preventing statistics for technologies with biases
	</a></span></dt><dt><span class="sect3"><a href="#sect_ref_manifest_readgroups_asreference">3.4.3.6. 
	Setting reference sequence for mapping jobs
      </a></span></dt><dt><span class="sect3"><a href="#sect_ref_manifest_readgroups_autopairing">3.4.3.7. 
	Autopairing: letting MIRA find out pair info by itself
      </a></span></dt><dt><span class="sect3"><a href="#sect_ref_manifest_readgroups_templatesize">3.4.3.8. 
	Setting size of read templates
      </a></span></dt><dt><span class="sect3"><a href="#sect_ref_manifest_readgroups_segplace">3.4.3.9. 
	Read segment placement
      </a></span></dt><dt><span class="sect3"><a href="#sect_ref_manifest_readgroups_segname">3.4.3.10. 
	Read segment naming
      </a></span></dt><dt><span class="sect3"><a href="#sect_ref_manifest_readgroups_strainname">3.4.3.11. 
	Strain naming
      </a></span></dt><dt><span class="sect3"><a href="#sect_ref_manifest_readgroups_datadirscf">3.4.3.12. 
	Data directory for SCF files
      </a></span></dt><dt><span class="sect3"><a href="#sect_ref_manifest_readgroups_renameprefix">3.4.3.13. 
	Renaming read name prefixes
      </a></span></dt></dl></dd><dt><span class="sect2"><a href="#sect_ref_manifest_parameters">3.4.4. 
	The manifest file: extended parameters
      </a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_ref_parameter_groups">3.4.4.1. 
	  Parameter groups
	</a></span></dt><dt><span class="sect3"><a href="#sect_ref_technology_sections">3.4.4.2. 
	  Technology sections
	</a></span></dt><dt><span class="sect3"><a href="#sect_ref_parameter_shortnames">3.4.4.3. 
	  Parameter short names
	</a></span></dt><dt><span class="sect3"><a href="#sect_ref_order_dependent_quick_switches">3.4.4.4. 
	  Order dependent quick switches
	</a></span></dt><dt><span class="sect3"><a href="#sect_ref_general_ge">3.4.4.5. 
	  Parameter group: -GENERAL (-GE)
	</a></span></dt><dt><span class="sect3"><a href="#sect_ref_assembly_as">3.4.4.6. 
	  Parameter group: -ASSEMBLY (-AS)
	</a></span></dt><dt><span class="sect3"><a href="#sect_ref_strain_backbone_sb">3.4.4.7. 
	  Parameter group: -STRAIN/BACKBONE (-SB)
	</a></span></dt><dt><span class="sect3"><a href="#sect_ref_dataprocessing_dp">3.4.4.8. 
	  Parameter group: -DATAPROCESSING (-DP)
	</a></span></dt><dt><span class="sect3"><a href="#sect_ref_clipping_cl">3.4.4.9. 
	  Parameter group: -CLIPPING (-CL)
	</a></span></dt><dt><span class="sect3"><a href="#sect_ref_skim_sk">3.4.4.10. 
	  Parameter group: -SKIM (-SK)
	</a></span></dt><dt><span class="sect3"><a href="#sect_ref_hashstatistics_hs">3.4.4.11. 
	  Parameter group: -KMERSTATISTICS (-KS)
	</a></span></dt><dt><span class="sect3"><a href="#sect_ref_align_al">3.4.4.12. 
	  Parameter group: -ALIGN (-AL)
	</a></span></dt><dt><span class="sect3"><a href="#sect_ref_contig_co">3.4.4.13. 
	  Parameter group: -CONTIG (-CO)
	</a></span></dt><dt><span class="sect3"><a href="#sect_ref_edit_ed">3.4.4.14. 
	  Parameter group: -EDIT (-ED)
	</a></span></dt><dt><span class="sect3"><a href="#sect_ref_misc_mi">3.4.4.15. 
	  Parameter group: -MISC (-MI)
	</a></span></dt><dt><span class="sect3"><a href="#sect_ref_misc_nw">3.4.4.16. 
	  Parameter group: -NAG_AND_WARN (-NW)
	</a></span></dt><dt><span class="sect3"><a href="#sect_ref_directory_dir_di">3.4.4.17. 
	  Parameter group: -DIRECTORY (-DIR, -DI)
	</a></span></dt><dt><span class="sect3"><a href="#sect_ref_output_out">3.4.4.18. 
	  Parameter group: -OUTPUT (-OUT)
	</a></span></dt></dl></dd></dl></dd><dt><span class="sect1"><a href="#sect_ref_resuming_assemblies">3.5. 
      Resuming / restarting assemblies
    </a></span></dt><dt><span class="sect1"><a href="#sect_ref_input_output">3.6. 
      Input / Output
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_ref_directories">3.6.1. 
	Directories
      </a></span></dt><dt><span class="sect2"><a href="#sect_ref_filenames">3.6.2. 
	Filenames
      </a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_ref_output">3.6.2.1. 
	  Output
	</a></span></dt><dt><span class="sect3"><a href="#sect_ref_assembly_statistics_and_information_files">3.6.2.2. 
	  Assembly statistics and information files
	</a></span></dt></dl></dd><dt><span class="sect2"><a href="#sect_ref_file_formats">3.6.3. 
	File formats
      </a></span></dt><dt><span class="sect2"><a href="#sect_ref_stdout_stderr">3.6.4. 
	STDOUT/STDERR
      </a></span></dt><dt><span class="sect2"><a href="#sect_ref_ssaha2smalt">3.6.5. 
	SSAHA2 / SMALT ancillary data
      </a></span></dt><dt><span class="sect2"><a href="#sect_ref_xml_traceinfo">3.6.6. 
	XML TRACEINFO ancillary data
      </a></span></dt><dt><span class="sect2"><a href="#sect_ref_contig_naming">3.6.7. 
	Contig naming
      </a></span></dt><dt><span class="sect2"><a href="#sect_ref_recovering_strain_specific_consensus">3.6.8. 
	Recovering strain specific consensus as FASTA
      </a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_ref_tags_used_in_the_assembly_by_mira_and_edit">3.7. 
      Tags used in the assembly by MIRA and EdIt
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_ref_tags_read_and_used">3.7.1. 
	Tags read (and used)
      </a></span></dt><dt><span class="sect2"><a href="#sect_ref_tags_set_and_used">3.7.2. 
	Tags set (and used)
      </a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_ref_contigs_singlets_debris">3.8. 
      Where reads end up: contigs, singlets, debris
    </a></span></dt><dt><span class="sect1"><a href="#sect_ref_snp_discovery">3.9. 
      Detection of bases distinguishing non-perfect repeats and SNP discovery
    </a></span></dt><dt><span class="sect1"><a href="#sect_ref_data_reduction">3.10. 
      Data reduction: subsampling vs. lossless digital normalisation
    </a></span></dt><dt><span class="sect1"><a href="#sect_ref_caveats">3.11. 
      Caveats
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_ref_using_artificial_reads">3.11.1. 
	Using data not from sequencing instruments: artificial / synthetic reads
      </a></span></dt><dt><span class="sect2"><a href="#sect_ref_ploidy_and_repeats">3.11.2. 
	Ploidy and repeats
      </a></span></dt><dt><span class="sect2"><a href="#sect_ref_handling_of_repeats">3.11.3. 
	Handling of repeats
      </a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_ref_uniform_read_distribution">3.11.3.1. 
	  Uniform read distribution
	</a></span></dt><dt><span class="sect3"><a href="#sect_ref_keeping_'long'_repetitive_contigs_separate">3.11.3.2. 
	  Keeping 'long' repetitive contigs separate
	</a></span></dt><dt><span class="sect3"><a href="#sect_ref_helping_finishing_by_tagging_reads_with_haf_tags">3.11.3.3. 
	  Helping finishing by tagging reads with HAF tags
	</a></span></dt></dl></dd><dt><span class="sect2"><a href="#sect_ref_consensus_in_finishing_programs_gap4_consed_">3.11.4. 
	Consensus in finishing programs (gap4, consed, ...)
      </a></span></dt><dt><span class="sect2"><a href="#sect_ref_some_other_things_to_consider">3.11.5. 
	Some other things to consider
      </a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_ref_things_you_should_not_do">3.12. 
      Things you should not do
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_ref_never_on_nfs">3.12.1. 
	Do not run MIRA on NFS mounted directories without redirecting the tmp directory
      </a></span></dt><dt><span class="sect2"><a href="#sect_ref_never_without_quality_values">3.12.2. 
	Do not assemble without quality values
      </a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_ref_useful_third_party_programs">3.13. 
      Useful third party programs
    </a></span></dt><dt><span class="sect1"><a href="#sect_ref_speed_and_memory_considerations">3.14. 
      Speed and memory considerations
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_ref_memory">3.14.1. 
	Estimating needed memory for an assembly project
      </a></span></dt><dt><span class="sect2"><a href="#sect_ref_speed">3.14.2. 
	Some numbers on speed
      </a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_ref_known_problems_bugs">3.15. 
      Known Problems / Bugs
    </a></span></dt><dt><span class="sect1"><a href="#sect_ref_todos">3.16. 
      TODOs
    </a></span></dt><dt><span class="sect1"><a href="#sect_ref_working_principles">3.17. 
      Working principles
    </a></span></dt><dt><span class="sect1"><a href="#sect_ref_see_also">3.18. 
      See Also
    </a></span></dt></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
      <span class="emphasis"><em><span class="quote">&#8220;<span class="quote">The manual only makes sense after you learn the program.
      </span>&#8221;</span></em></span>
    </p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_synopsis"></a>3.1. 
      Synopsis
    </h2></div></div></div><p>
      <code class="literal">mira [-chmMrtv] <em class="replaceable"><code>manifest-file</code></em>  [<em class="replaceable"><code>manifest-file</code></em> ...]</code>
    </p><p>
      The command line parameters in short:
    </p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
	  [-c / --cwd=<em class="replaceable"><code>directory</code></em>]
	</span></dt><dd>
	  Change working directory.
	</dd><dt><span class="term">
	  [-h / --help]
	</span></dt><dd>
	  Print a short help and exit.
	</dd><dt><span class="term">
	  [-m / --mcheck]
	</span></dt><dd>
	  Only check the manifest file, then exit.
	</dd><dt><span class="term">
	  [-M / --mdcheck]
	</span></dt><dd>
	  Only check the manifest file and presence of data files, then exit.
	</dd><dt><span class="term">
	  [-r / --resume]
	</span></dt><dd>
	  Resume / restart an interrupted assembly. Works only for de-novo
	  assemblies at the moment.
	</dd><dt><span class="term">
	  [-t / --thread=<em class="replaceable"><code>integer &#8805; 0</code></em>]
	</span></dt><dd>
	  Force number of threads (overrides equivalent [-GE:not]
          manifest entry).
	</dd><dt><span class="term">
	  [-v / --version]
	</span></dt><dd>
	  Print version and exit.
	</dd></dl></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_requirements"></a>3.2. 
      Requirements
    </h2></div></div></div><p>
      To use MIRA itself, one doesn't need very much:
    </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	  Sequence data in EXP, CAF, PHD, FASTA or FASTQ format
	</p></li><li class="listitem"><p>
	  Optionally: ancillary information in NCBI traceinfo XML format;
	  ancillary information about strains in tab delimited format, vector
	  screen information generated with <span class="command"><strong>ssaha2</strong></span> or
	  <span class="command"><strong>smalt</strong></span>.
	</p></li><li class="listitem"><p>
	  Some memory and disk space. Actually lots of both if you are
	  venturing into 454 or Illumina.
	</p></li></ul></div><p>
    </p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_working_modes"></a>3.3. 
      Working modes
    </h2></div></div></div><p>
      MIRA has three basic working modes: genome, EST/RNASeq or
      EST-reconstruction-and-SNP-detection. From version 2.4 on, there is
      only executable which supports all modes. The name with which this
      executable is called defines the working mode:
    </p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
	  <span class="command"><strong>mira</strong></span> for assembly of genomic data as well as
	  assembly of EST data from one or multiple strains / organisms
	</p><p>
	  and
	</p></li><li class="listitem"><p>
	  <span class="command"><strong>miraSearchESTSNPs</strong></span> for assembly of EST data from
	  different strains (or organisms) and SNP detection within this
	  assembly. This is the former <span class="command"><strong>miraEST</strong></span> program
	  which was renamed as many people got confused regarding whether to
	  use MIRA in est mode or miraEST.
	</p></li></ol></div><p>
      Note that <span class="command"><strong>miraSearchESTSNPs</strong></span> is usually realised as
      a link to the <span class="command"><strong>mira</strong></span> executable, the executable
      decides by the name it was called with which module to start.
    </p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_config"></a>3.4. 
      Configuring an assembly: files and parameters
    </h2></div></div></div><p>
      All the configuration needed for an assembly is done in one (or several)
      configuration file(s): the <span class="emphasis"><em>manifest</em></span> files. This
      encompasses things like what kind of assembly you want to perform
      (genome or EST / RNASeq, mapping or de-novo etc.pp) or which data files
      contain the sequences you want to assemble (and in which format these
      are).
    </p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_manifest_introduction"></a>3.4.1. 
	The manifest file: introduction
      </h3></div></div></div><p>
	A <span class="emphasis"><em>manifest</em></span> file can be seen as a two part
	configuration file for an assembly: the first part contains some
	general information while the second part contains information about
	the sequencing data to be loaded. Examples being always easier to
	follow than long texts, here's an example for a de-novo assembly with
	single-end (also called shotgun) 454 data:
      </p><pre class="screen"># Example for a manifest describing a simple 454 de-novo assembly

# A manifest file can contain comment lines, these start with the #-character

# First part: defining some basic things
# In this example, we just give a name to the assembly
#  and tell MIRA it should assemble a genome de-novo in accurate mode
# As special parameter, we want to use 4 threads in parallel (where possible)

<strong class="userinput"><code>
project = <em class="replaceable"><code>MyFirstAssembly</code></em>
job = <em class="replaceable"><code>genome,denovo,accurate</code></em>
parameters = <em class="replaceable"><code>-GE:not=4</code></em></code></strong>

# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups": this reflects the
#  ... that read sequences ...

<strong class="userinput"><code>readgroup = <em class="replaceable"><code>SomeUnpaired454ReadsIGotFromTheLab</code></em>
data = <em class="replaceable"><code>TCMFS456ZH345.fastq TQF92GT7H34.fastq</code></em>
technology = <em class="replaceable"><code>454</code></em></code></strong></pre><p>
	To make things a bit more interesting, here's an example using a
	couple more technologies and showing some more options of the manifest
	file like wild cards in file names, different paired-end/mate-pair
	libraries and how to let MIRA refine pairing information (or even find
	out everything by itself):
      </p><pre class="screen"># Example for a manifest describing a de-novo assembly with
# unpaired 454, paired-end Illumina, a mate-pair Illumina
# and a paired Ion Torrent

# First part: defining some basic things
# In this example, we just give a name to the assembly
#  and tell MIRA it should assemble a genome de-novo in accurate mode
# As special parameter, we want to use 4 passes with kmer sizes of
# 17, 31, 63 and 127 nucleotides. Obviously, read lengths of the
# libraries should be greater than 127 bp.
# Note: usually MIRA will choose sensible options for number of
#  passes and kmer sizes to be used by itself.

<strong class="userinput"><code>project = <em class="replaceable"><code>MyFirstAssembly</code></em>
job = <em class="replaceable"><code>genome,denovo,accurate</code></em>
parameters = <em class="replaceable"><code>-AS:kms=17,31,63,127</code></em></code></strong>

# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups": this reflects the
#  ... that read sequences ...

# defining the shotgun (i.e. unpaired) 454 reads
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>SomeUnpaired454ReadsIGotFromTheLab</code></em>
data = <em class="replaceable"><code>TCMFS456ZH345.fastq TQF92GT7H34.fastq</code></em>
technology = <em class="replaceable"><code>454</code></em></code></strong>

# defining the paired-end Illumina reads, fixing all needed pair information
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>SomePairedEndIlluminaReadsIGotFromTheLab</code></em>
data = <em class="replaceable"><code>datape*.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
template_size = <em class="replaceable"><code>100 300</code></em>
segment_placement = <em class="replaceable"><code>---&gt; &lt;---</code></em>
segment_naming = <em class="replaceable"><code>solexa</code></em></code></strong>

# defining the mate-pair Illumina reads, fixing most needed pair information
#  but letting MIRA refine the template_size via "autorefine"
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>SomeMatePairIlluminaReadsIGotFromTheLab</code></em>
data = <em class="replaceable"><code>datamp*.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
template_size = <em class="replaceable"><code>2000 4000 autorefine</code></em>
segment_placement = <em class="replaceable"><code>&lt;--- ---&gt;</code></em>
segment_naming = <em class="replaceable"><code>solexa</code></em></code></strong>

# defining paired Ion Torrent reads
# example to show how lazy one can be and simply let MIRA estimate by itself
#  all needed pairing information via "autopairing"
#  Hint: it usually does a better job at it than we do ;-)
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>SomePairedIonReadsIGotFromTheLab</code></em>
<em class="replaceable"><code>autopairing</code></em>
data = <em class="replaceable"><code>dataion*.fastq</code></em>
technology = <em class="replaceable"><code>iontor</code></em></code></strong></pre></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_manifest_basics"></a>3.4.2. 
	The manifest file: basics
      </h3></div></div></div><p>
	The first part of an assembly <span class="emphasis"><em>manifest</em></span> contains
	the very basic information the assembler needs to have to know what
	you want it to do. This part consists of exactly three entries:
      </p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
	    <span class="bold"><strong>project =</strong></span> [=
	    <em class="replaceable"><code>project name</code></em>] tells the assembler
	    the name you wish to give to the whole assembly project. MIRA will
	    use that name throughout the whole assembly for naming
	    directories, files and a couple of other things.
	  </p><p>
	    You can name the assembly anyway you want, you should however
	    restrain yourself and use only alphanumeric characters and perhaps
	    the characters plus, minus and underscore. Using slashes or
	    backslashes here is a recipe for catastrophe.
	  </p></li><li class="listitem"><p>
	    <span class="bold"><strong>job =</strong></span>
	     [<em class="replaceable"><code>denovo|mapping</code></em>],
	     [<em class="replaceable"><code>genome|est|fragments|clustering</code></em>],
	     [<em class="replaceable"><code>draft|accurate</code></em>] tells the
	    assembler what kind of data it should expect and how it should
	    assemble it.
	  </p><p>
	  You need to make your choice mainly in three steps and in the end
	  concatenate your choices to the [job=] entry of the manifest:
	  </p><div class="orderedlist"><ol class="orderedlist" type="a"><li class="listitem"><p>
		are you building an assembly from scratch
		(choose: <span class="emphasis"><em>denovo</em></span>) or are you mapping reads
		to an existing backbone sequence
		(choose: <span class="emphasis"><em>mapping</em></span>)?  Pick one. Leaving this
		out automatically chooses <span class="emphasis"><em>denovo</em></span> as
		default.
	      </p></li><li class="listitem"><p>
		are the data you are assembling forming a larger contiguous
		sequence (choose: <span class="emphasis"><em>genome</em></span>), are you
		assembling EST or mRNA libraries
		(choose: <span class="emphasis"><em>est</em></span>), single genes or small
		plasmids (choose: <span class="emphasis"><em>fragments</em></span>) or do you cluster assembled
		sequences (choose: <span class="emphasis"><em>clustering</em></span>)?
		Pick one. Leaving this out
		automatically chooses <span class="emphasis"><em>genome</em></span> as default.
	      </p><p>
		Since version 4.9.4, a new mode <span class="emphasis"><em>fragments</em></span>
		is available. This mode is essentially similar to the
		<span class="emphasis"><em>EST</em></span> mode, but has all safety features
		switched off which reduce data sizes. Use this mode for
		assembly of comparatively small EST/mRNA or small plasmid or
		single gene projects where you
		want to have highest accuracy and minimal filtering. Warning:
		contigs with coverages going into the 1000s will lead to
		really slow assemblies.
	      </p><p>
		Since version 4.9.6, a new mode <span class="emphasis"><em>clustering</em></span>
		is available. This mode is essentially for clustering
		assembled contigs like they are created in mRNA or EST
		assemblies. Basic parameters are: single pass, no clipping, no
		editing, ~7.5% differences between sequences allowed,
		gaps &gt;= 13 bases disallowed, single occurrence of disagreeing
		base leads to SNP tagging.
		Warning: do not use that with any type of real sequencing data
		... you probably would regret this.
	      </p></li><li class="listitem"><p>
		do you want a quick and dirty assembly for first insights
		(choose: <span class="emphasis"><em>draft</em></span>) or an assembly that should
		be able to tackle even most nasty cases (choose:
		<span class="emphasis"><em>accurate</em></span>)? Pick one. Leaving this out
		automatically chooses <span class="emphasis"><em>accurate</em></span> as default.
	      </p></li></ol></div><p>
	    Once you're done with your choices, concatenate everything with
	    commas and you're done. E.g.:
	    '<code class="literal">--job=mapping,genome,draft</code>' will give you a
	    mapping assembly of a genome in draft quality.
	  </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
	    For de-novo assembly of genomes, these switches are optimised for
	    'decent' coverages that are commonly seen to get you something useful,
	    i.e., &#8805; 7x for Sanger, &gt;=18x for 454 FLX or Titanium, &#8805; 25x for
	    454 GS20 and &#8805; 30x for Solexa. Should you venture into lower
	    coverage or extremely high coverage (say, &gt;=60x for 454), you will
	    need to adapt a few parameters via extensive switches.
	  </td></tr></table></div></li><li class="listitem"><p>
	    <span class="bold"><strong>parameters =</strong></span> is used in case you
	    want to change one of the 150+ extended parameters MIRA has to
	    offer to control almost every aspect of an assembly. This is
	    described in more detail in a separate section below.
	  </p></li></ol></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_manifest_readgroups"></a>3.4.3. 
	The manifest file: information on the data you have
      </h3></div></div></div><p>
	The second part of an assembly <span class="emphasis"><em>manifest</em></span> tells
	MIRA which files it needs to load, which sequencing technology
	generated the data, whether there are DNA template constraints it can
	use during the assembly process and a couple of other things.
      </p><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_manifest_readgroups_readgroup"></a>3.4.3.1. 
	Starting a new readgroup
      </h4></div></div></div><p>
	  <span class="bold"><strong>readgroup </strong></span> [= <em class="replaceable"><code>group name</code></em>] is the keyword which tells MIRA that you are going to define a new read group. You can optionally name that group.
	</p><div class="sidebar"><a name="sidebar_ref_manifest_readgroups_templates_and_readgroups"></a><div class="titlepage"><div><div><p class="title"><b>
	    Understanding readgroups and DNA templates
	  </b></p></div></div></div><p>
	    When you send away your DNA for sequencing, it is going to be
	    prepared for sequencing according to your wishes. Sequencing
	    providers call this "constructing a library" and regardless
	    whether you sequence with Sanger, 454, Illumina, Ion Torrent,
	    Pacific Biosciences or other technologies, the "library prep" is
	    always there.
	  </p><p>
	    With most library preps, your DNA is first amplified and then
	    cut into small pieces. These pieces are called
	    <span class="emphasis"><em>templates</em></span> and their length can be anywhere
	    between a few dozen bases, a few hundred bases or even a couple
	    of dozen or even hundred kilobases. The important thing is that
	    these templates can be much bigger in size than the actual read
	    length. While this is a wet lab step, protocols and providers
	    have gotten pretty good at constructing libraries where the DNA
	    templates are all in a given range of bases like, e.g., having a
	    library with template size 500bp (+/- 100bp) and another library
	    with template size around 7kb (+/- 500bp).
	  </p><p>
	    Depending on the technology and sequencing strategy used, the
	    DNA templates are used to create either one single read or - and
	    that's important - two or more reads.
	  </p><p>
	    Libraries with "single reads" are often called "single read
	    libraries" or "shotgun libraries". They can be found for every
	    sequencing technology and are most of the time easy to construct
	    (therefore cheap) and are often used to provide a decent amount
	    of bases as basic coverage for your project.
	  </p><p>
	    Libraries with two reads per DNA template are often called
	    "mate-pair" or "paired-end" libraries. They are harder to
	    construct and sometime have less yield, therefore they are often
	    more expensive. But the sequencing approach using several reads
	    per DNA template allows assembly and scaffolding algorithms to
	    resolve repetitive regions of a genome which are longer than the
	    average read length. Note that Pacific Biosciences has a
	    sequencing mode called "strobed sequencing" which is different
	    from "paired-end/mate-pair" but also creates multiple reads per
	    DNA template.
	  </p><p>
	    Long story short: an assembler must know afterwards what kind of
	    reads it has to expect: the sequencing technology, library
	    preparation strategy etc. For this, the notion of <span class="emphasis"><em>read
	    groups</em></span> has emerged: reads coming from the same
	    technology and same library preparation are pooled together in a
	    read group to tell the assembler: in the assembly, if you see two
	    reads coming from a same DNA template, you should expect them to
	    be at a certain distance from each other and they should be
	    oriented in a certain way.
	  </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
	    The above was a <span class="bold"><strong>very</strong></span> simplified
	    view on the whole area of DNA templates, readgroups, shotgun and
	    paired end sequencing. Enough to hopefully understand the
	    concepts, but you might want to read more about it.
	  </td></tr></table></div></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_manifest_readgroups_data"></a>3.4.3.2. 
	Defining data files to load
      </h4></div></div></div><p>
	  <span class="bold"><strong>data</strong></span> = <em class="replaceable"><code>filepath
	  [filepath ...]</code></em> defines the file paths from
	  which sequences should be loaded. A file path can contain just the
	  name of one (or several) files or it can contain the
	  <span class="emphasis"><em>path</em></span>, i.e., the directory (absolute or
	  relative) including the file name.
	</p><p>
	  MIRA automatically recognises what type the sequence data is by
	  looking at the postfix of files. For postfixes not adhering widely
	  used naming schemes for file types, there's additionally a way of
	  explicitly defining the type (see further down at the end of this
	  item on how this is done). Currently allowed file types are:
	</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	      <code class="filename">.fasta</code> for sequences formatted in FASTA
	      format where there exists an additional
	      <code class="filename">.fasta.qual</code> file which contains quality
	      data. If the file with quality data is missing, this is
	      interpreted as error and MIRA will abort.
	    </p></li><li class="listitem"><p>
	      <code class="filename">.fna</code> and <code class="filename">.fa</code> also
	      for sequences formatted in FASTA format. The difference
	      to <code class="filename">.fasta</code> lies in the way MIRA treats a
	      missing quality file (called
	      <code class="filename">.fna.qual</code>
	      or <code class="filename">.fa.qual</code>): it does not see that as
	      critical error and continues.
	    </p></li><li class="listitem"><p>
	      <code class="filename">.fastq</code> or <code class="filename">.fq</code> for files in FASTQ format
	    </p></li><li class="listitem"><p>
	      <code class="filename">.gff3</code> or <code class="filename">.gff</code> for files in GFF3 format. Note that
	      MIRA will load all sequences and annotations contained in this
	      file.
	    </p></li><li class="listitem"><p>
	      <code class="filename">.gbk</code>, <code class="filename">.gbf</code>, <code class="filename">.gbff</code>
	      or <code class="filename">.gb</code> for files formatted in GenBank
	      format. Note that the MIRA GenBank loader does not understand
	      intron/exon or other multiple-locus structures in this format,
	      use GFF3 instead!
	    </p></li><li class="listitem"><p>
	      <code class="filename">.caf</code> for files in the CAF format (from Sanger Centre)
	    </p></li><li class="listitem"><p>
	      <code class="filename">.maf</code> for files in the MIRA MAF format
	    </p></li><li class="listitem"><p>
	      <code class="filename">.exp</code> for files in the Staden EXP format.
	    </p></li><li class="listitem"><p>
	      <code class="filename">.fofnexp</code> for a <span class="emphasis"><em>file of EXP
	      filenames</em></span> which all point to files in the Staden EXP
	      format.
	    </p></li><li class="listitem"><p>
	      <code class="filename">.xml</code>, <code class="filename">.ssaha2</code> and <code class="filename">.smalt</code> for ancillary data in NCBI TRACEINFO, SSAHA2 or SMALT format respectively.
	    </p></li></ul></div><p>
	  Multiple 'data' lines and multiple entries per line (even
	  different formats) are allowed, as in, e.g.,
	</p><pre class="screen">data = file1.fastq file2.fastq file3.fasta file4.gbk
data = myscreenings.smalt</pre><p>
	  You can also use wildcards and/or directory names. E.g., loading
	  all file types MIRA understand from a given directory
	  <code class="filename">mydir</code>:
	</p><pre class="screen">data = mydir</pre><p>
	  or loading all files starting with <code class="filename">mydata</code> and
	  ending with <code class="filename">fastq</code>:
	</p><pre class="screen">data = mydata*fastq</pre><p>
	  or loading all files in directory <code class="filename">mydir</code>
	  starting with <code class="filename">mydata</code> and ending with
	  <code class="filename">fastq</code>:
	</p><pre class="screen">data = mydir/mydata*fastq</pre><p>
	  or loading all FASTQ files in all directories starting with <code class="filename">mydir</code>:
	</p><pre class="screen">data = mydir*/*fastq</pre><p>
	  or ... well, you get the gist.
	</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
	  Giving a directory like in <code class="filename">mydir</code> is
	  equivalent to <code class="filename">mydir/*</code> (saying: give me all
	  files in the directory <code class="filename">mydir</code>), however the
	  first version should be preferred when the directory contains
	  thousands of files.
	</td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
	    GenBank and GFF3 files may or may not contain embedded sequences. If
	    annotations are present in these files for which no sequence is
	    present in the same file, MIRA will look for reads of the same
	    name which it already loaded in this or previously defined read
	    groups and add the annotations there.
	  </p><p>
	    As security measure, annotations in GenBank and GFF3 files for which
	    absolutely no sequence or read has been defined are treated as
	    error.
	  </p></td></tr></table></div><p>
	  <span class="emphasis"><em>Explicit definition of file types.</em></span> It is
	  possible to explicitly tell MIRA the type of a file even if said
	  file does not have a 'standard' naming scheme. For this, the
	  EMBOSS double-colon notation has been adapted to work also for
	  MIRA, i.e., you prepend the type of a file and separate it from
	  the file name by a double colon. E.g.,
	  the <code class="filename">.dat</code> postfix is not anything MIRA will
	  recognise, but you can define it should be loaded as FASTQ file
	  like this:
	</p><pre class="screen">data = fastq::myfile.dat</pre><p>
	  Another frequent usage is forcing MIRA to load FASTA files
	  named <code class="filename">.fasta</code> without complaining in case
	  quality files (which MIRA wants you to provide) are not present:
	</p><pre class="screen">data = fna::myfile.fasta</pre><p>
	  This does (of course) work also with directories or wildcard
	  characters. In the following example, the first line will load all
	  files from <code class="filename">mydirectory</code> as FASTQ while the
	  second line loads just <code class="filename">.dat</code> files in a given
	  path as FASTA:
	</p><pre class="screen">data = fastq::mydirectory
data = fasta::/path/to/somewhere/*.dat</pre><p>
	  It is entirely possible (although not really sensible), to give
	  contradicting information to MIRA by using a different explicit
	  file type than one would guess from the standard postfix. In this
	  case, the explicit type takes precedence over the automatic
	  type. E.g.: to force MIRA to load a file as FASTA although it is
	  named <code class="filename">.fastq</code>, one could use this:
	</p><pre class="screen">data = fasta::file.fastq</pre><p>
	  Note that the above does not make any kind of file conversion,
	  <code class="filename">file.fastq</code> needs to be already in FASTA
	  format or else MIRA will fail loading that data.
	</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_manifest_readgroups_defaultqual"></a>3.4.3.3. 
	Setting default quality
      </h4></div></div></div><p>
	  <span class="bold"><strong>default_qual</strong></span>=
	  <em class="replaceable"><code>quality_value</code></em> is meant to be used as
	  default fall-back quality value for sequences where the data files
	  given above do not contain quality values. E.g., GFF3 or GenBank
	  formats, eventually also FASTA files where quality data files is
	  missing.
	</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_manifest_readgroups_technology"></a>3.4.3.4. 
	  Defining technology used to sequence
	</h4></div></div></div><p>
	  <span class="bold"><strong>technology</strong></span>=
	  <em class="replaceable"><code>technology</code></em> which names the technology
	  with which the sequences were produced. Allowed technologies are:
	  <span class="emphasis"><em>sanger, 454, solexa, iontor, pcbiolq, pcbiohq,
	  text</em></span>.
	</p><p>
	  The <span class="emphasis"><em>text</em></span> technology is not a technology per
	  se, but should be used for sequences which are not coming from
	  sequencing machines like, e.g., database entries, consensus
	  sequences, artificial reads (which do not comply to normal
	  behaviour of 'normal' sequencing data), etc.pp
	</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_manifest_readgroups_nostatistics"></a>3.4.3.5. 
	  Preventing statistics for technologies with biases
	</h4></div></div></div><p>
	  <span class="bold"><strong>nostatistics</strong></span> used as keyword will
	  prevent MIRA to calculate coverage estimates from reads of the given
	  readgroup.
	</p><p>
	  This keyword should be used in denovo genome assemblies for reads
	  from libraries which produce very uneven coverage (e.g.: old
	  Illumina mate-pair protocols) or have a bias in the randomness of
	  DNA fragmentations (e.g.: Nextera protocol from Illumina).
	</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_manifest_readgroups_asreference"></a>3.4.3.6. 
	Setting reference sequence for mapping jobs
      </h4></div></div></div><p>
	  <span class="bold"><strong>as_reference</strong></span> This keyword
	  indicates to MIRA that the sequences in this readgroup should not
	  be assembled, but should be used as reference backbone for a
	  mapping assembly. That is, sequencing reads are then placed/mapped
	  onto these reference reads.
	</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_manifest_readgroups_autopairing"></a>3.4.3.7. 
	Autopairing: letting MIRA find out pair info by itself
      </h4></div></div></div><p>
	  <span class="bold"><strong>autopairing</strong></span> This keyword is used
	  to tell MIRA it should estimate values for
	  <span class="emphasis"><em>template_size</em></span> and
	  <span class="emphasis"><em>segment_placement</em></span> (see below).
	</p><p>
	  This is basically the lazy way to tell MIRA that the data in the
	  corresponding readgroup consists of paired reads and that you
	  trust it will find out the correct values.
	</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><span class="emphasis"><em>autopairing</em></span> usually works quite well for
	  small and mid-sized libraries (up to, say, 10 kb). For larger
	  libraries it might be a good thing to tell MIRA some rough
	  boundaries via <span class="emphasis"><em>template_size</em></span> /
	  <span class="emphasis"><em>segment_placement</em></span> and let MIRA refine the
	  values for the template size via <span class="emphasis"><em>autorefine</em></span>
	  (see below).
	</td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><span class="emphasis"><em>autopairing</em></span> is a feature new to MIRA 4.0rc5,
	  it may contain bugs for some corner cases. Feedback appreciated.
	</td></tr></table></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_manifest_readgroups_templatesize"></a>3.4.3.8. 
	Setting size of read templates
      </h4></div></div></div><p>
	  <span class="bold"><strong>template_size </strong></span>=
	  <em class="replaceable"><code>min_size max_size
	  <span class="emphasis"><em>[infoonly|exclusion_criterion]</em></span>
	  <span class="emphasis"><em>[autorefine]</em></span></code></em>. Defines the
	  minimum and maximum size of "good" DNA templates in the library
	  prep for this read group. This defines at which distance the two
	  reads of a pair are to be expected in a contig, a very useful
	  information for an assembler to resolve repeats in a genome or
	  different splice variants in transcriptome data.
	</p><p>
	  If the term <span class="emphasis"><em>infoonly</em></span> is present, then MIRA
	  will pass the information on template sizes in result files, but
	  will not use it for any decision making during de-novo or mapping
	  assembly. The term <span class="emphasis"><em>exclusion_criterion</em></span> makes
	  MIRA use the information for decision making.
	</p><p>
	  If <span class="emphasis"><em>infoonly</em></span>
	  or <span class="emphasis"><em>exclusion_criterion</em></span> are missing, then MIRA
	  assumes <span class="emphasis"><em>exclusion_criterion</em></span> for de-novo
	  assemblies and <span class="emphasis"><em>infoonly</em></span> for mapping
	  assemblies.
	</p><p>
	  If the term <span class="emphasis"><em>autorefine</em></span> is present, MIRA will
	  start the assembly with the given size information but switch to
	  refined value computed from observed distances in an
	  assembly. However, please note that the size values
	  can <span class="emphasis"><em>never</em></span> be expanded, only shrunk. It is
	  therefore advisable to use generous bounds when using the
	  autorefine feature.
	</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
	  The <span class="emphasis"><em>template_size</em></span> line in the manifest file
	  replaces the parameters -GE:uti:tismin:tismax of earlier versions
	  of MIRA (3.4.x and below).
	</td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
	  The minimum or the maximum size (or both) can be set to a negative
	  value for "don't care and don't check". This allows constructs
	  like <code class="literal">template_size= 500 -1 exclusion_criterion</code>
	  which would check only the minimum distance but not the maximum
	  distance.
	</td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
	    For <span class="emphasis"><em>mapping</em></span> assemblies with MIRA, you
	    usually will want to use <span class="emphasis"><em>infoonly</em></span> as else -
	    in case of genome re-arrangements, larger deletions or
	    insertions - MIRA would probably reject one read of every read
	    pair in the corresponding areas as it would not be at the
	    expected distance and/or orientation ... and you would not be
	    able to simply find the re-arrangement in downstream analysis.
	  </p><p>
	    For <span class="emphasis"><em>de-novo</em></span> assemblies however
	    you <span class="emphasis"><em>should not</em></span>
	    use <span class="emphasis"><em>infoonly</em></span> except in very rare cases
	    where you know what you do.
	  </p></td></tr></table></div><div class="sidebar"><div class="titlepage"><div><div><p class="title"><b>
	    Understanding the size of DNA templates
	  </b></p></div></div></div><p>
	    When using a <span class="emphasis"><em>paired-end</em></span> or
	    <span class="emphasis"><em>mate-pair</em></span> sequencing strategy, two
	    sequences are generated for the ends of each DNA template (see
	    sidebar above: "understanding readgroups and DNA
	    templates"). That is, if one has a library with 6kb fragments,
	    one knows that the outer ends of the two reads will be
	    approximately 6kb apart, like so:
	  </p><pre class="screen">DNA template    ##############################################################
read 1          .......
read 2                                                                  ......
&lt;------------------------- ~6 kb ----------------------------&gt;</pre><p>
	    Sequencing labs will try their best to get these two sequences
	    from DNA templates which comply to a given length
	    specification. But as this is chemistry and wet lab, things must
	    be seen with a certain uncertainty and therefore the DNA
	    templates generated are not exactly of the specified size
	    (e.g. 6kb), but the size distribution will vary in a given
	    range, e.g., 5.5kb to 6.5 kb.
	  </p></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_manifest_readgroups_segplace"></a>3.4.3.9. 
	Read segment placement
      </h4></div></div></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">You do not need to use this when using 'autopairing' (see above).</td></tr></table></div><p>
	  <span class="bold"><strong>segment_placement </strong></span>=
	  <em class="replaceable"><code>placementcode <span class="emphasis"><em>[infoonly|exclusion_criterion]</em></span></code></em>. Allowed
	  placement codes are:
	</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	      <span class="bold"><strong>?</strong></span>
	      or <span class="bold"><strong>unknown</strong></span> which are
	      place-holders for "well, in the end: don't care." Segments of
	      a template can be reads in any direction and in any
	      relationship to each other.
	    </p><p>
	      This is typically used for unpaired libraries (sometimes
	      called <span class="emphasis"><em>shotgun libraries</em></span>), but may be
	      also useful for, e.g., primer walking with Sanger.
	    </p></li><li class="listitem"><p>
	      <span class="bold"><strong>---&gt; &lt;---</strong></span> or <span class="bold"><strong>FR</strong></span> or <span class="bold"><strong>INNIES</strong></span>. The <span class="emphasis"><em>forward /
	      reverse</em></span> scheme as used in traditional Sanger
	      sequencing as well as Illumina paired-end sequencing,
	    </p><p>
	      This is the usual placement code for Sanger paired-end
	      protocols as well as Illumina paired-end. Less frequently used
	      in IonTorrent paired-end sequencing.
	    </p></li><li class="listitem"><p>
	      <span class="bold"><strong>&lt;--- ---&gt;</strong></span> or <span class="bold"><strong>RF</strong></span> or <span class="bold"><strong>OUTIES</strong></span>. The <span class="emphasis"><em>reverse /
	      forward</em></span> scheme as used in Illumina mate-pair
	      sequencing.
	    </p><p>
	      This is the usual placement code for Illumina mate-pair protocols.
	    </p></li><li class="listitem"><p>
	      <span class="bold"><strong>1---&gt; 2---&gt;</strong></span> or
	      <span class="bold"><strong>samedir forward</strong></span> or <span class="bold"><strong>SF</strong></span> or <span class="bold"><strong>LEFTIES</strong></span>. The <span class="emphasis"><em>forward /
	      forward</em></span> scheme. Segments of a template are all
	      placed in the same direction, the segment order in the contig
	      follows segment ordering of the reads.
	    </p></li><li class="listitem"><p>
	      <span class="bold"><strong>2---&gt; 1---&gt;</strong></span> <span class="bold"><strong>samedir backward</strong></span> or <span class="bold"><strong>SB</strong></span> or <span class="bold"><strong>RIGHTIES</strong></span>. Segments of a template are
	      all placed in the same direction, the segment order in the
	      contig is reversed compared to segment ordering of the reads.
	    </p><p>
	      This is the usual placement code for 454 "paired-end" and IonTorrent
	      long-mate protocols.
	    </p></li><li class="listitem"><p>
	      <span class="bold"><strong>samedir</strong></span> Segments of a
	      template are all placed in the same direction, the spatial
	      relationship however is not cared of.
	    </p></li><li class="listitem"><p>
	      <span class="bold"><strong>&gt;&gt;&gt;</strong></span> (reserved for
	      sequencing of several equidistant fragments per template like
	      in PacBio strobe sequencing, not implemented yet)
	    </p></li></ul></div><p>
	  If the term <span class="emphasis"><em>infoonly</em></span> is present, then MIRA
	  will pass the information on segment placement in result files, but
	  will not use it for any decision making during de-novo assembly or
	  mapping assembly. The term <span class="emphasis"><em>exclusion_criterion</em></span> makes MIRA use the information for decision making.
	</p><p>
	  If <span class="emphasis"><em>infoonly</em></span> or <span class="emphasis"><em>exclusion_criterion</em></span> are missing, then MIRA assumes <span class="emphasis"><em>exclusion_criterion</em></span> for de-novo assemblies and <span class="emphasis"><em>infoonly</em></span> for mapping assemblies.
	</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
	    For <span class="emphasis"><em>mapping</em></span> assemblies with MIRA, you
	    usually will want to use <span class="emphasis"><em>infoonly</em></span> as else -
	    in case of genome re-arrangements, larger deletions or
	    insertions - MIRA would probably reject one read of every read
	    pair (as it would not be at the expected distance and/or
	    orientation) and you would not be able to simply find the
	    re-arrangement in downstream analysis.
	  </p><p>
	    For <span class="emphasis"><em>de-novo</em></span> assemblies however
	    you <span class="emphasis"><em>should not</em></span>
	    use <span class="emphasis"><em>infoonly</em></span> except in very rare cases
	    where you know what you do.
	  </p></td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
	  As soon as you tell MIRA that a readgroup contains paired reads (via one of the other typical readgroup parameters like template_size, segment_naming etc.), the <span class="emphasis"><em>segment_placement</em></span> line becomes mandatory in the manifest. This is because different sequencing technologies and/or library preparations result in different read orientations. E.g., Illumina libraries come in paired-end flavour which have FR (forward/reverse) placements, but there are also mate-pair libraries which have reverse/forward (RF) placements.
	</td></tr></table></div><div class="sidebar"><div class="titlepage"><div><div><p class="title"><b>
	    Understanding read segment placement on DNA templates
	  </b></p></div></div></div><p>
	    bla
	  </p></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_manifest_readgroups_segname"></a>3.4.3.10. 
	Read segment naming
      </h4></div></div></div><p>
	  <span class="bold"><strong>segment_naming </strong></span>= <em class="replaceable"><code>naming_scheme <span class="emphasis"><em>[rollcomment]</em></span></code></em>. Defines
	  the naming scheme reads are following to indicate the DNA template
	  they belong to. Allowed naming schemes are: <span class="emphasis"><em>sanger,
	  stlouis, tigr, FR, solexa, sra</em></span>.
	</p><p>
	  If not defined, the defaults are <span class="underline">sanger</span> for Sanger sequencing data,
	  while <span class="underline">solexa</span> for Solexa, 454
	  and Ion Torrent.
	</p><p>
	  For FASTQ files, the modifier <span class="emphasis"><em>rollcomment</em></span> can
	  be used to let MIRA take the first token in the comment as name of
	  a read instead of the orginal name. E.g.: for a read
	</p><pre class="screen">@DRR014327.1.1 HWUSI-EAS547_0013:1:1:1106:4597.1 length=91
TTAGAAGGAGATCTGGAGAACATTTTAAACCGGATTGAACAACGCGGCCGTGAGATGGAGCTTCAGACAAGCCGGTCTTATTGGGACGAAC
+
bbb`bbbbabbR`\_bb_bba`b`bb_bb_`\^\^Y^`\Zb^b``]]\S^a`]]a``bbbb_bbbb]bbb\`^^^]\aaY\`\\^aa__aB</pre><p>
	  the rollcomment modifier will lead to the read being named
	  <code class="filename"> HWUSI-EAS547_0013:1:1:1106:4597.1</code> (which
	  is almost the original instrument read name) instead of
	  <code class="filename">DRR014327.1.1</code> (which is the SRA read name).
	</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
	  For data from the short read archive (SRA), one will usually need
	  to explicitly specify the 'sra' naming scheme or use the
	  'rollcomment' modifier in FASTQ files.
	</td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
	  This has changed with MIRA 3.9.1
	  and <span class="command"><strong>sff_extract</strong></span> 0.3.0. Before that, 454 and Ion
	  Torrent were given <span class="underline">fr</span> as naming
	  scheme.
	</td></tr></table></div><div class="sidebar"><div class="titlepage"><div><div><p class="title"><b>
	    Understanding read naming schemes
	  </b></p></div></div></div><p>
	    Read naming is a long story with lots of historical gotchas: it
	    needs to be clear and simple, but still people sometimes wanted
	    to convey additional meta-information with it. Unsurprisingly,
	    several "standards" emerged over time. In short: it's a mess. See also XKCD entry on <a class="ulink" href="http://xkcd.com/927/" target="_top">proliferating standards</a>.
	  </p><p>
	    How to choose: please read the documentation available at the
	    different centres or ask your sequence provider. In a nutshell
	    (and probably over-simplified):
	  </p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
		Sanger scheme
	      </span></dt><dd><p>
		  "somename<span class="emphasis"><em>.[pqsfrw][12][bckdeflmnpt][a|b|c|...</em></span>"
		  (e.g.  U13a08f10.p1ca), but the length of the postfix
		  must be at least 4 characters, i.e., ".p" alone will not
		  be recognised.
		</p><p>
		  Usually, ".p" + 3 characters or "f" + 3 characters are
		  used for forwards reads, while reverse complement reads
		  take either ".q" or ".r" (+ 3 characters in both cases).
		</p></dd><dt><span class="term">
		TIGR scheme
	      </span></dt><dd><p>
		  "somename<span class="emphasis"><em>TF*|TR*|TA*</em></span>"
		  (e.g. GCPBN02TF or GCPDL68TABRPT103A58B),
		</p><p>
		  Forward reads take "TF*", reverse reads "TR*".
		</p></dd><dt><span class="term">
		St. Louis scheme
	      </span></dt><dd><p>
		  "somename<span class="emphasis"><em>.[sfrxzyingtpedca]*</em></span>"
		</p></dd><dt><span class="term">
		Forward/Reverse scheme
	      </span></dt><dd><p>
		  "somename<span class="emphasis"><em>.[fr]*</em></span>"
		  (e.g.  E0K6C4E01DIGEW.f or E0K6C4E01BNDXN.r2nd),
		</p><p>
		  ".f*" for forward, ".r*" for reverse.
		</p></dd><dt><span class="term">
		Solexa scheme
	      </span></dt><dd><p>
		  Even simpler than the forward/reverse scheme, it allows
		  only for one two reads per template:
		  "somename<span class="emphasis"><em>/[12]</em></span>"
		</p></dd><dt><span class="term">
		SRA scheme
	      </span></dt><dd><p>
		  The Short Read Archive (SRA) finally settled on a naming
		  scheme and renames each and every read within its
		  database. When you download sequences from the archive,
		  all reads will be named
		  <code class="filename">XXX000000.Y[.Z]</code> (where X's are
		  characters A-Z, 0 are digits from 0 to 9, Y is a counter
		  and Z is a number denoting the segment (usually 1,2 or
		  3)). This naming scheme is applied to reads from all
		  technologies, therefore the MIRA technology dependent
		  defaults will not apply and one must specify the 'sra'
		  naming scheme in the command line.
		</p></dd></dl></div></div><p>
	  Any wildcard in the forward/reverse suffix must be consistent for
	  a read pair, and is treated as part of the template name. This is
	  to allow multiple sequencing of a fragment, particularly common
	  with Sanger capillary data (e.g. given somename.f and somename.r,
	  resequenced as somename.f2 and somename.r2, this would be treated
	  as two pairs, with template names somename and somename_2
	  respectively).
	</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_manifest_readgroups_strainname"></a>3.4.3.11. 
	Strain naming
      </h4></div></div></div><p>
	  <span class="bold"><strong>strain_name </strong></span>=
	  <em class="replaceable"><code>string</code></em>. Defines the strain /
	  organism-code the reads of this read group are from. If not set,
	  MIRA will assign "StrainX" to normal readgroups and
	  "ReferenceStrain" to readgroups with reference sequences.
	</p><p>
	  Restrictions: in de-novo assemblies you can have 255 strain. In
	  mapping assemblies, you can have at most 8 strains.
	</p><div class="sidebar"><div class="titlepage"><div><div><p class="title"><b>
	    Understanding how MIRA uses strain information
	  </b></p></div></div></div><p>
	    bla
	  </p></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_manifest_readgroups_datadirscf"></a>3.4.3.12. 
	Data directory for SCF files
      </h4></div></div></div><p>
	  <span class="bold"><strong>datadir_scf </strong></span>=
	  <em class="replaceable"><code>directory</code></em>
	</p><p>
	  For SANGER data only: tells MIRA in which directory it can find
	  SCF data belonging to reads of this read group.
	</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_manifest_readgroups_renameprefix"></a>3.4.3.13. 
	Renaming read name prefixes
      </h4></div></div></div><p>
	  <span class="bold"><strong>rename_prefix</strong></span>=
	  <em class="replaceable"><code>prefix replacement</code></em>. Allows to rename
	  reads on the fly while loading data by searching each read name
	  for a given <span class="emphasis"><em>prefix</em></span> string and, if found,
	  replace it with a given <span class="emphasis"><em>replacement</em></span> string.
	</p><p>
	  This is most useful for systems like Illumina or PacBio which
	  generate quite long read names which, in the end, are either
	  utterly useless for an end user or are even breaking older
	  programs which have a length restriction on read names. E.g.:
	</p><pre class="screen">rename_prefix = DQT9AAQ4:436:H371HABMM: Sample1_</pre><p>
	  will rename reads
	  like <span class="emphasis"><em>DQT9AAQ4:436:H371HABMM:5:1101:9154:3062</em></span>
	  into <span class="emphasis"><em>Sample1_5:1101:9154:3062</em></span>
	</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><code class="literal">rename_prefix</code> entries are valid per
	  readgroup. I.e., an entry for a readgroup will not rename reads of
	  another readgroup.
	</td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
	    Multiple <code class="literal">rename_prefix</code> entries are
	    allowed per readgroup. E.g.:
	  </p><pre class="screen">rename_prefix = DQT9AAQ4:436:H371HABMM: S1sxa_
rename_prefix = m140328_002546_42149_c100624422550000001823118308061414_s1_ S1pb_</pre><p>
	    will rename a read
	    called <code class="literal">DQT9AAQ4:436:H371HABMM:1:1101:3099:2186</code>
	    into <code class="literal">S1sxa_1:1101:3099:2186</code> while renaming
	    another read called <code class="literal">m140328_002546_42149_c100624422550000001823118308061414_s1_p0/100084/10792_20790/0_9573</code>
	    into <code class="literal">S1pb_p0/100084/10792_20790/0_9573</code>
	  </p></td></tr></table></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_manifest_parameters"></a>3.4.4. 
	The manifest file: extended parameters
      </h3></div></div></div><p>
	The <span class="bold"><strong>parameters=</strong></span> line in the manifest
	file opens up the full panoply of possibilities the MIRA assembler
	offers. This ranges from fine-tuning assemblies to setting parameters
	in a way so that MIRA is suited also for very special assembly cases.
      </p><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_parameter_groups"></a>3.4.4.1. 
	  Parameter groups
	</h4></div></div></div><p>
	  Some parameters one can set in MIRA somehow belong together. Example
	  given: when specifying an overlap in an alignment of two sequences,
	  one could tell the assembler it should look at overlaps only if they
	  have a certain similarity and a certain length. On the other hand,
	  specifying how many processors / threads the assembler should use or
	  whether the results of an assembly should be written out as SAM
	  format does not seem to relate to alignments.
	</p><p>
	  MIRA uses <span class="emphasis"><em>parameter groups</em></span> to keep parameters
	  together which somehow belong together. Example given:
	</p><pre class="screen">
<strong class="userinput"><code>parameters = <em class="replaceable"><code> -GENERAL:number_of_threads=4 \
              -ALIGN:min_relative_score=70 -ASSEMBLY:minimum_read_length=150 \
              -OUTPUT:output_result_caf=no</code></em></code></strong></pre><p>
	The parameters of the different parameter groups are described in
	detail a bit later in this manual.
      </p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_technology_sections"></a>3.4.4.2. 
	  Technology sections
	</h4></div></div></div><p>
	  With the introduction of new sequencing technologies, MIRA also had
	  to be able to set values that allow technology specific behaviour of
	  algorithms. One simple example for this could be the minimum length
	  a read must have to be used in the assembly. For Sanger sequences,
	  having this value to be 150 (meaning a read should have at least 150
	  unclipped bases) would be a very valid, albeit conservative
	  choice. For 454 reads and especially Solexa and ABI SOLiD reads
	  however, this value would be ridiculously high.
	</p><p>
	  To allow very fine grained behaviour, especially in hybrid
	  assemblies, and to prevent the explosion of parameter names, MIRA
	  knows two categories of parameters:
	</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
	      <span class="bold"><strong>technology independent parameters</strong></span>
	      which control general behaviour of MIRA like, e.g., the number of
	      assembly passes or file names etc.
	    </p></li><li class="listitem"><p>
	      <span class="bold"><strong>technology dependent parameters</strong></span>
	      which control behaviour of algorithms where the sequencing
	      technology plays a role. Example for this would be the minimum
	      length of a read (like 200 for Sanger reads and 120 for 454 FLX
	      reads).
	    </p></li></ol></div><p>
	  More on this a bit further down in this documentation.
	</p><p>
	  As example, a manifest using technology dependent and independent parameters could
	  look like this:
	</p><pre class="screen">
<strong class="userinput"><code>parameters = <em class="replaceable"><code>COMMON_SETTINGS -GENERAL:number_of_threads=4 \
              SANGER_SETTINGS -ALIGN:min_relative_score=70 -ASSEMBLY:minimum_read_length=150 \
              454_SETTINGS -ALIGN:min_relative_score=75 -ASSEMBLY:minimum_read_length=100 \
              SANGER_SETTINGS -ALIGN:min_relative_score=90 -ASSEMBLY:minimum_read_length=75</code></em></code></strong></pre><p>
	  Now, assume the following read group descriptions in a manifest:
	</p><pre class="screen">
...

readgroup
technology=454
...

readgroup
technology=solexa
...</pre><p>
	  For MIRA, this means a number of parameters should apply to the
	  assembly as whole, while others apply to the sequencing data itself
	  ... and some parameters might need to be different depending on the
	  technology they apply to. MIRA dumps the parameters it is running
	  with at the beginning of an assembly and it makes it clear there
	  which parameters are "global" and which parameters apply to single
	  technologies.
	</p><p>
	  Here is as example a part of the output of used parameters that MIRA
	  will show when started with 454 and Illumina (Solexa) data:
	</p><pre class="screen">
...

Assembly options (-AS):
    Number of passes (nop)                      : 1
    Skim each pass (sep)                        : yes
    Maximum number of RMB break loops (rbl)     : 1
    Spoiler detection (sd)                      : no
    Last pass only (sdlpo)                      : yes

    Minimum read length (mrl)                   :  [454]  40
                                                   [sxa]  20
    Enforce presence of qualities (epoq)        :  [454]  no
                                                   [sxa]  yes

...</pre><p>
	  You can see the two different kind of settings that MIRA uses:
	  <span class="emphasis"><em>common</em></span> <span class="emphasis"><em>settings</em></span> (like
	   [-AS:nop]) which allows only one value and
	  <span class="emphasis"><em>technology</em></span> <span class="emphasis"><em>dependent</em></span>
	  <span class="emphasis"><em>settings</em></span> (like  [-AS:mrl]), where for
	  each sequencing technology used in the project, the setting can be
	  different.
	</p><p>
	  How would one set a minimum read length of 40 and not enforce
	  presence of base qualities for Sanger reads, but for 454 reads a
	  minimum read length of 30 and enforce base qualities? The answer:
	</p><pre class="screen">
job=denovo,genome,draft
parameters= SANGER_SETTINGS -AS:mrl=40:epoq=mo 454_SETTINGS -AS:mrl=40:epoq=yes</pre><p>
	  Notice the ..._SETTINGS section in the command line (or parameter file):
	  these tell MIRA that all the following parameters until the advent of
	  another switch are to be set specifically for the said technology.
	</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
	    For improved readability, you can distribute parameters across
	    several lines either by pre-fixing every line with
	    <code class="literal">parameter=</code>, like so:
	  </p><pre class="screen">
job=denovo,genome,draft
parameters= SANGER_SETTINGS -AS:mrl=80:epoq=no
parameters= 454_SETTINGS -AS:mrl=30:epoq=yes</pre><p>
	    Alternatively you can use a backslash at the end of a parameter
	    line to indicate that the next line is a continuing line, like so:
	  </p><pre class="screen">
job=denovo,genome,draft
parameters= SANGER_SETTINGS -AS:mrl=80:epoq=no <strong class="userinput"><code>\</code></strong>
            454_SETTINGS -AS:mrl=30:epoq=yes</pre><p>
	    Note that the very last line of the parameters settings MUST NOT
	    end with a backslash.
	  </p></td></tr></table></div><p>
	  Beside COMMON_SETTINGS there are currently 6 technology settings available:
	</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
	      SANGER_SETTINGS
	    </p></li><li class="listitem"><p>
	      454_SETTINGS
	    </p></li><li class="listitem"><p>
	      IONTOR_SETTINGS
	    </p></li><li class="listitem"><p>
	      PCBIOLQ_SETTINGS (currently not supported)
	    </p></li><li class="listitem"><p>
	      PCBIOHQ_SETTINGS
	    </p></li><li class="listitem"><p>
	      SOLEXA_SETTINGS
	    </p></li><li class="listitem"><p>
	      TEXT_SETTINGS
	    </p></li></ol></div><p>
	</p><p>
	  Some settings of MIRA are influencing global behaviour and are not
	  related to a specific sequencing technology, these must be set in the
	  COMMON_SETTINGS environment. For example, it would not make sense to try and
	  set different number of assembly passes for each technology like in
	</p><pre class="screen">
<strong class="userinput"><code>parameters= 454_SETTINGS -AS:nop=4 SOLEXA_SETTINGS -AS:nop=3</code></strong></pre><p>
	  Beside being contradictory, this makes not really sense. MIRA will
	  complain about cases like these. Simply set those common settings in
	  an area prefixed with the COMMON_SETTINGS switch like in
	</p><pre class="screen">
<strong class="userinput"><code>parameters= COMMON_SETTINGS -AS:nop=4 454_SETTINGS ... SOLEXA_SETTINGS ...</code></strong></pre><p>
	</p><p>
	  Since MIRA 3rc3, the parameter parser will help you by checking
	  whether parameters are correctly defined as COMMON_SETTINGS or
	  technology dependent setting.
	</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_parameter_shortnames"></a>3.4.4.3. 
	  Parameter short names
	</h4></div></div></div><p>
	  Writing the verbose form of parameters can be quite a long task. Here a short example:
	</p><pre class="screen">
<strong class="userinput"><code>parameters = <em class="replaceable"><code>COMMON_SETTINGS -GENERAL:number_of_threads=4 \
              SANGER_SETTINGS -ALIGN:min_relative_score=70 -ASSEMBLY:minimum_read_length=150 \
              454_SETTINGS -ALIGN:min_relative_score=75 -ASSEMBLY:minimum_read_length=100 \
              SOLEXA_SETTINGS -ALIGN:min_relative_score=90 -ASSEMBLY:minimum_read_length=75</code></em></code></strong></pre><p>
	  However, every parameter has a shortened form. The above could be written like this:
	</p><pre class="screen">
<strong class="userinput"><code>parameters = <em class="replaceable"><code>COMMON_SETTINGS -GE:not=4 \
              SANGER_SETTINGS -AL:mrs=70 -AS:mrl=150 \
              454_SETTINGS -AL:mrs=75 -AS:mrl=100 \
              SOLEXA_SETTINGS -AL:mrs=90 -AS:mrl=75</code></em></code></strong></pre><p>
	Please note that it is also perfectly legal to decompose the switches
	so that they can be used more easily in scripted environments (notice
	the multiple -AL in some sections of the following example):
      </p><pre class="screen">
<strong class="userinput"><code>parameters = <em class="replaceable"><code>COMMON_SETTINGS -GE:not=4 \
              SANGER_SETTINGS \
                -AL:mrs=70 \
		-AL:mrl=150 \
              454_SETTINGS -AL:mrs=75:mrl=100 \
              SOLEXA_SETTINGS \
	        -AL:mrs=90 \
                -AL:mrl=75</code></em></code></strong></pre></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_order_dependent_quick_switches"></a>3.4.4.4. 
	  Order dependent quick switches
	</h4></div></div></div><p>
	  For some parameters, the order of appearance in the parameter lines
	  of the manifest is important. This is because the <span class="emphasis"><em>quick
	  parameters</em></span> are realised internally as a collection of
	  extended parameters that will overwrite any previously manually set
	  extended parameters. It is generally a good idea to place quick parameters in
	  the order as described in this documentation, that is: first the
	  order dependent quick parameters, then other quick parameters, then all
	  the other extended parameters.
	</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
	      [--hirep_best]
	    , </span><span class="term">
	      [--hirep_good]
	    , </span><span class="term">
	      [--hirep_something]
	    </span></dt><dd><p>
		These are modifier switches for genome data that is deemed to
		be highly repetitive. With <span class="emphasis"><em>hirep_good</em></span> and
		<span class="emphasis"><em>hirep_best</em></span>, the assemblies will run
		slower due to more iterative cycles and slightly different
		default parameter sets that give MIRA a chance to resolve many
		nasty repeats. The <span class="emphasis"><em>hirep_something</em></span> switch
		goes the other way round and resolves repeats less well than a
		normal assembly, but allows MIRA to finish even on more
		complex data.
	      </p><p>
		Usage recommendations bacteria: starting MIRA without any
		hirep switches yields good enough result in most cases. Under
		normal circumstances one can use
		<span class="emphasis"><em>hirep_good</em></span> or
		even <span class="emphasis"><em>hirep_best</em></span> without remorse as data
		sets and genome complexities are small enough to run within a
		couple of hours at most.
	      </p><p>
		Usage recommendations for 'simple' lower eukaryotes: starting
		MIRA without any hirep switches yields good enough result in
		most cases. If the genomes are not too complex,
		using <span class="emphasis"><em>hirep_good</em></span> can be a possibility.
	      </p><p>
		Usage recommendations for lower eukaryotes with complex
		repeats: starting MIRA without any hirep switches might
		already take too long or create temporary data files which are
		too big. For these cases, using
		<span class="emphasis"><em>hirep_something</em></span> makes MIRA use a
		parameter set which is targeted as resolving the
		non-repetitive areas of a genome and additionally all repeats
		which occur less than 10 times in the genome. Repeats occurring
		more often will not be resolved, but using the debris
		information one can recover affected reads and use these with
		harsh data reduction algorithms (e.g. digital normalisation)
		to get a glimpse into these.
	      </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
		These switches replace the '--highlyrepetitive' switch from
		earlier versions.
	      </td></tr></table></div></dd><dt><span class="term">
	      [--noclipping=...]
	    </span></dt><dd><p>
		Switching off clipping options. If used
		as <code class="literal">--noclipping</code>
		or <code class="literal">--noclipping=all</code>, this switches off
		really everything, both technology dependent and independent switches.
		Clipping options for technology dependent options be switched
		off via entries being <span class="emphasis"><em>sanger</em></span>,
		<span class="emphasis"><em>454</em></span>, <span class="emphasis"><em>iontor</em></span>,
		<span class="emphasis"><em>solexa</em></span> or
		<span class="emphasis"><em>solid</em></span>. Multiple entries separated by
		comma are allowed.
	      </p><p> Examples:
	      </p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
		    Switch off 454 and Solexa, but keep technology independent
		    clippings and all clippings for other technologies, (like,
		    e.g., Sanger) <code class="literal">--noclipping=454,solexa</code>
		  </p></li><li class="listitem"><p>
		    Switch off really
		    everything: <code class="literal">--noclipping</code>
		    or <code class="literal">--noclipping=all</code>
		  </p></li></ol></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
		  Switching off technology independent clippings
		  ([-CL:pec],  [-CL:gbcdc],  [-CL:kjd])
		  via this switch has been implemented for consistency in MIRA
		  4.9.6. Prior to this they were kept active, which created a
		  good deal of confusion with a number of users.
		</p><p>
		  As soon as you have any kind of 'real' sequencing data, you
		  really should use at least [-CL:pec]
		  and  [-CL:gbcdc].
		</p></td></tr></table></div></dd></dl></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_general_ge"></a>3.4.4.5. 
	  Parameter group: -GENERAL (-GE)
	</h4></div></div></div><p>
	  General options control the type of assembly to be performed and
	  other switches not belonging anywhere else.
	</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
	      [number_of_threads(not)=<em class="replaceable"><code>0 &#8804; integer &#8804; 256</code></em>]
	    </span></dt><dd><p> Default is <span class="underline">0</span>. Master switch to set the number
	      of threads used in different parts of MIRA.
	      </p><p>
		A value of 0 tells MIRA to set this to the number of available
		physical cores on the machine it runs on. That is,
		hyperthreaded "cores" are not counted in as using these would
		cause a tremendous slowdown in the heavy duty computation
		parts. E.g., a machine with 2 processors having 4 cores each
		will have this value set to 8.
	      </p><p>
		In case MIRA cannot find out the number of cores, the
		fall-back value is <span class="underline">2</span>.
	      </p><p>
		Note: when running the SKIM algorithm in parallel threads,
		MIRA can give different results when started with the same
		data and same arguments. While the effect could be averted for
		SKIM, the memory cost for doing so would be an additional 50%
		for one of the large tables, so this has not been implemented
		at the moment. Besides, at the latest when the Smith-Watermans
		run in parallel, this could not be easily avoided at all.
	      </p></dd><dt><span class="term">
	      [automatic_memory_management(amm)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">Yes</span>. Whether
		MIRA tries to optimise run time of certain algorithms in a
		space/time trade-off memory usage, increasing or reducing some
		internal tables as memory permits.
	      </p><p>
		Note 1: This functionality currently relies on the
		<code class="filename">/proc</code> file system giving information on
		the system memory ("MemTotal" in /proc/meminfo) and the memory
		usage of the current process ("VmSize" in
		<code class="filename">/proc/self/status</code>). If this is not
		available, the functionality is switched off.
	      </p><p>
		Note 2: The automatic memory management can only work if there
		actually is unused system memory. It's not a wonder switch
		which reduces memory consumption. In tight memory situations,
		memory management has no effect and the algorithms fall back
		to minimum table sizes. This means that the effective size in
		memory can grow larger than given in the memory management
		parameters, but then MIRA will try to keep the additional
		memory requirements to a minimum.
	      </p></dd><dt><span class="term">
	      [max_process_size(mps)=<em class="replaceable"><code>0 &#8804; integer</code></em>]
	    </span></dt><dd><p> Default is <span class="underline">0</span>. If
	      automatic memory management is used (see above), this number is
	      the size in gigabytes that the MIRA process will use as maximum
	      target size when looking for space/time trade-offs. A value of 0
	      means that MIRA does not try keep a fixed upper limit.
	      </p><p>
		Note: when in competition to [-GE:kpmf] (see below),
		the smaller of both sizes is taken as target. Example: if your
		machine has 64 GiB but you limit the use to 32 GiB, then the
		MIRA process will try to stay within these 32 GiB.
	      </p></dd><dt><span class="term">
	      [keep_percent_memory_free(kpmf)=<em class="replaceable"><code>0 &#8804; integer</code></em>]
	    </span></dt><dd><p> Default is <span class="underline">10</span>. If
	      automatic memory management is used (see above), this number
	      works a bit like  [-GE:mps] but the other way round: it
	      tries to keep x percent of the memory free.
	      </p><p>
		Note: when in competition to [-GE:mps] (see above),
		the argument leaving the most memory free is taken as
		target. Example: if your machine has 64 GiB and you limit the
		use to 42 GiB via  [-GE:mps] but have a
		 [-GE:kpmf] of 50, then the MIRA process will try to
		stay within 64-(64*50%)=32 GiB.
	      </p></dd><dt><span class="term">
	      [preprocess_only(ppo)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p> Default is <span class="underline">no</span> As a
	      special use case, MIRA will just run the following tasks:
	      loading and clipping of reads as well as calculating kmer
	      frequencies and read repeat information. The resulting reads can
	      then be found as MAF file in the checkpoint directory; the read
	      repeat information in the info directory.
	      </p><p>
		No assembly is performed.
	      </p></dd><dt><span class="term">
	      [est_snp_pipeline_step(esps)=<em class="replaceable"><code>1 &#8804; integer &#8804; 4</code></em>]
	    </span></dt><dd><p> Default is <span class="underline">1</span>. Controls the starting step of the
	      SNP search in EST pipeline and is therefore only useful in
	      miraSearchESTSNPs.
	      </p><p>
		EST assembly is a three step process, each with different
		settings to the assembly engine, with the result of each step
		being saved to disk. If results of previous steps are present
		in a directory, one can easily "play around" with different
		setting for subsequent steps by reusing the results of the
		previous steps and directly starting with step two or three.
	      </p></dd><dt><span class="term">
	      [print_date(pd)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">yes</span>. Controls
		whether date and time are printed out during the
		assembly. Suppressing it is not useful in normal operation,
		only when debugging or benchmarking.
	      </p></dd><dt><span class="term">
	      [bang_on_throw(bot)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">no</span>. For
		debugging purposes only. Controls whether MIRA raises a signal
		when detecting an error which triggers a running debugger like
		gdb.
	      </p></dd></dl></div><p>
	</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_assembly_as"></a>3.4.4.6. 
	  Parameter group: -ASSEMBLY (-AS)
	</h4></div></div></div><p>
	  General options for controlling the assembly.
	</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
	      [num_of_passes(nop)=<em class="replaceable"><code>integer &#8805; 0</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">0</span>. Defines how many iterations of the whole
		assembly process are done.
	      </p><p>
		The default of 0 will let MIRA choose automatically the number
		of passes and the kmer sizes used in each pass
		(see also [-AS:kms] below).
	      </p><p>
		Early termination: if the number of passes was chosen too
		high, one can simply create a file
		<code class="filename"><em class="replaceable"><code>projectname</code></em>_assembly/<em class="replaceable"><code>projectname</code></em>_d_chkpt/terminate</code>. At
		the beginning of a new pass, MIRA checks for the existence of
		that file and, if it finds it, acknowledges by renaming it to
		<code class="filename">terminate_acknowledged</code> and then run 2
		more passes (with special "last pass routines") before
		finishing the assembly.
	      </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
		As a rule of thumb, <span class="emphasis"><em>de-novo</em></span> assemblies
		should always have at least two passes,
		while <span class="emphasis"><em>mapping</em></span> assemblies should work with
		only one pass. Not doing this will lead to results unexpected
		by users. The reason is that the MIRA the learning routines
		either have no chance to learn enough about the assembly (for
		de-novo with one pass) or learn "too much" (mapping with more
		than one pass).
	      </td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
		MIRA versions &#8804; 4.0.2 were interpreting the value of '0' in
		a different way and only performed pre-processing of
		reads. MIRA can still do this, but this is controlled by the
		new parameter [-GE:ppo].
	      </td></tr></table></div></dd><dt><span class="term">
	      [kmer_series(kms)=<em class="replaceable"><code>comma separated list of integers &#8805; 0 and &#8804; 256</code></em>]
	    </span></dt><dd><p>
		Default is an empty value. If set, overrides [-AS:nop] and  [-SK:kms].
	      </p><p>
		If set, this parameter provides a one-stop-shop for defining the number of passes and the kmer size used in each pass. E.g.: <code class="literal">-AS:kms=17,31,63,127</code> defines an assembly with 4 passes which uses a kmer size of 17 in pass 1, 31 in pass 2, 63 in pass 3 and 127 in pass 4.
	      </p><p>
		Note that it is perfectly valid to use the same kmer size more than once, e.g.: <code class="literal">17,31,63,63</code> will perform a 4 pass assembly, using a kmer size of 63 in passes 3 and 4. It also makes sense to do this, as with default parameters MIRA uses its integrated automatic editor which edits away obvious sequencing errors in each step, thus the second pass with a kmer size of 63 bases can rely on improved reads.
	      </p></dd><dt><span class="term">
	      [rmb_break_loops(rbl)=<em class="replaceable"><code>integer &gt; 0</code></em>]
	    </span></dt><dd><p>
		Default is dependent of the sequencing technology and assembly
		quality level. Defines the maximum number of times a contig
		can be rebuilt during a main assembly pass
		(see [-AS:nop] or  [-AS:kms]) if misassemblies due to possible repeats
		are found.
	      </p></dd><dt><span class="term">
	      [max_contigs_per_pass(mcpp)=<em class="replaceable"><code>integer</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">0</span>. Defines
		how many contigs are maximally built in each pass. A value of
		0 stands for 'unlimited'. Values &gt;0 can be used for special
		use cases like test assemblies etc.
	      </p><p>
		If in doubt, do not touch this parameter.
	      </p></dd><dt><span class="term">
	      [automatic_repeat_detection(ard)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is is currently <span class="underline">yes</span>. Tells MIRA to use coverage
		information accumulated over time to more accurately pinpoint reads that are
		in repetitive regions.
	      </p></dd><dt><span class="term">
	      [coverage_threshold(ardct)=<em class="replaceable"><code>float &gt; 1.0</code></em>]
	    </span></dt><dd><p> Default is
	      <span class="underline">2.0</span> for all sequencing technologies in most assembly cases. This
	      option says this: if MIRA a read has ever been aligned at positions
	      where the total coverage of all reads of the same sequencing technology
	      attained the average coverage times  [-AS:ardct] (over a length of
	       [-AS:ardml], see below), then this read is considered to be
	      repetitive.
	      </p></dd><dt><span class="term">
	      [min_length(ardml)=<em class="replaceable"><code>integer &gt; 1</code></em>]
	    </span></dt><dd><p>
		Default is dependent of the sequencing technology, currently
		<span class="underline">400</span> for Sanger and
		<span class="underline">200</span> for 454 and Ion
		Torrent.
	      </p><p>
		A coverage must be at least this number of bases higher than
		[-AS:ardct] before being really treated as repeat.
	      </p></dd><dt><span class="term">
	      [grace_length(ardgl)=<em class="replaceable"><code>integer &gt; 1</code></em>]
	    </span></dt><dd><p>
		Default is dependent of the sequencing technology.
	      </p></dd><dt><span class="term">
	      [uniform_read_distribution(urd)=<em class="replaceable"><code>on|y[es]|t[rue],
	      off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is currently always <span class="underline">no</span>
		as these algorithms were supplanted by better ones in MIRA 4.0.
	      </p><p>
		Takes effect only if uniform read distribution
		([-AS:urd]) is on.
	      </p><p>
		When set to <span class="underline">yes</span>, MIRA
		will analyse coverage of contigs built at a certain stage of
		the assembly and estimate an average expected coverage of
		reads for contigs. This value will be used in subsequent
		passes of the assembly to ensure that no part of the contig
		gets significantly more read coverage of reads that were
		previously identified as repetitive than the estimated average
		coverage allows for.
	      </p><p>
		This switch is useful to disentangle repeats that are
		otherwise 100% identical and generally allows to build larger
		contigs. It is expected to be useful for Sanger and 454
		sequences. Usage of this switch with Solexa and Ion Torrent
		data is currently not recommended.
	      </p><p>
		It is a real improvement to disentangle repeats, but has the
		side-effect of creating some "contig debris" (small and low
		coverage contigs, things you normally can safely throw away as
		they are representing sequence that already has enough
		coverage).
	      </p><p>
		This switch must be set to <span class="underline">no</span> for EST assembly, assembly of
		transcripts etc. It is recommended to also switch this off for
		mapping assemblies.
	      </p></dd><dt><span class="term">
	      [urd_startinpass(urdsip)=<em class="replaceable"><code>integer &gt; 0</code></em>]
	    </span></dt><dd><p>
		Default is dependent of the sequencing technology and assembly
		quality level. Recommended values are: 3 for an assembly with
		3 to 4 passes ([-AS:nop]). Assemblies with 5 passes
		or more should set the value to the number of passes minus 2.
	      </p><p>
		Takes effect only if uniform read distribution
		([-AS:urd]) is on.
	      </p></dd><dt><span class="term">
	      [urd_clipoffmultiplier(urdcm)=<em class="replaceable"><code>float &gt; 1.0</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">1.5</span> for all
	        sequencing technologies in most assembly cases.
	      </p><p>
		This option says this: if MIRA determined that the average
		coverage is <span class="emphasis"><em>x</em></span>, then in subsequent passes it will allow
		coverage for reads determined to be repetitive to be built
		into the contig only up to a total coverage of
		<span class="emphasis"><em>x*urdcm</em></span>. Reads that bring the coverage above the threshold
		will be rejected from that specific place in the contig (and
		either be built into another copy of the repeat somewhere else
		or end up as contig debris).
	      </p><p>
		Please note that the lower [-AS:urdcm] is, the more
		contig debris you will end up with (contigs with an average
		coverage less than half of the expected coverage, mostly short
		contigs with just a couple of reads).
	      </p><p>
		Takes effect only if uniform read distribution ([-AS:urd]) is on.
	      </p></dd><dt><span class="term">
	      [spoiler_detection(sd)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is dependent of the sequencing technology and assembly
		quality level. A spoiler can be either a chimeric read or it
		is a read with long parts of unclipped vector sequence still
		included (that was too long for the [-CL:pvc] vector
		leftover clipping routines). A spoiler typically prevents
		contigs to be joined, MIRA will cut them back so that they
		represent no more harm to the assembly.
	      </p><p>
		Recommended for assemblies of mid- to high-coverage genomic
		assemblies, not recommended for assemblies of ESTs as one
		might loose splice variants with that.
	      </p><p>
		A minimum number of two assembly passes ([-AS:nop])
		must be run for this option to take effect.
	      </p></dd><dt><span class="term">
	      [sd_last_pass_only(sdlpo)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">yes</span>. Defines
		whether the spoiler detection algorithms are run only for the
		last pass or for all passes ( [-AS:nop]).
	      </p><p>
		Takes effect only if spoiler detection ([-AS:sd]) is on. If in
		doubt, leave it to 'yes'.
	      </p></dd><dt><span class="term">
	      [minimum_read_length(mrl)=<em class="replaceable"><code>integer &#8805; 20</code></em>]
	    </span></dt><dd><p> Default is
	      dependent of the sequencing technology. Defines the minimum length that
	      reads must have to be considered for the assembly. Shorter sequences will be
	      filtered out at the beginning of the process and won't be present in the
	      final project.
	      </p></dd><dt><span class="term">
	      [minimum_reads_per_contig(mrpc)=<em class="replaceable"><code>integer &#8805; 1</code></em>]
	    </span></dt><dd><p>
		Default is dependent of the sequencing technology and the
		[--job] parameter. For genome assemblies it's usually
		around <span class="underline">2</span> for Sanger,
		<span class="underline">5</span> for 454, <span class="underline">5</span> for Ion Torrent, <span class="underline">5</span> for PacBio and <span class="underline">10</span> for Solexa. In EST assemblies,
		it's currently <span class="underline">2</span> for all
		sequencing technologies.
	      </p><p>
		Defines the minimum number of reads a contig must have before
		it is built or saved by MIRA. Overlap clusters with less reads
		than defined will not be assembled into contigs but reads in
		these clusters will be immediately transferred to debris.
	      </p><p>
		This parameter is useful to considerably reduce assembly time
		in large projects with millions of reads (like in Solexa
		projects) where a lot of small "junk" contigs with
		contamination sequence or otherwise uninteresting data may be
		created otherwise.
	      </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
		Important: a value larger 1 of this parameter interferes with
		the functioning of [-OUT:sssip] and
		 [-OUT:stsip].
	      </td></tr></table></div></dd><dt><span class="term">
	      [enforce_presence_of_qualities(epoq)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">yes</span>. When set
		to yes, MIRA will stop the assembly if any read has no quality
		values loaded.
	      </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">[-AS:epoq] switches on/off the quality check for a
		complete sequencing technology. A more fine grained control
		for switching checks of per readgroup is available via
		the <span class="emphasis"><em>default_qual</em></span> readgroup parameter in
		the manifest file.
	      </td></tr></table></div></dd><dt><span class="term">
	      [use_genomic_pathfinder(ugpf)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">yes</span>. MIRA has
		two different pathfinder algorithms it chooses from to find
		its way through the (more or less) complete set of possible
		sequence overlaps: a genomic and an EST pathfinder. The
		genomic looks a bit into the future of the assembly and tries
		to stay on safe grounds using a maximum of information already
		present in the contig that is being built. The EST version on
		the contrary will directly jump at the complex cases posed by
		very similar repetitive sequences and try to solve those first
		and is willing to fall back to first-come-first-served when
		really bad cases (like, e.g., coverage with thousands of
		sequences) are encountered.
	      </p><p>
		Generally, the genomic pathfinder will also work quite well
		with EST sequences (but might get slowed down a lot in
		pathological cases), while the EST algorithm does not work so
		well on genomes. If in doubt, leave on <span class="underline">yes</span> for genome projects and set to
		<span class="underline">no</span> for EST projects.
	      </p></dd><dt><span class="term">
	      [use_emergency_search_stop(uess)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">yes</span>. Another
		important switch if you plan to assemble non-normalised EST
		libraries, where some ESTs may reach coverages of several
		hundreds or thousands of reads. This switch lets MIRA save a
		lot of computational time when aligning those extremely high
		coverage areas (but only there), at the expense of some
		accuracy.
	      </p></dd><dt><span class="term">
	      [ess_partnerdepth(esspd)=<em class="replaceable"><code>integer &gt; 0</code></em>]
	    </span></dt><dd><p> Default is <span class="underline">500</span>. Defines the number of potential
	      partners a read must have for MIRA switching into emergency
	      search stop mode for that read.
	      </p></dd><dt><span class="term">
	      [use_max_contig_buildtime(umcbt)=<em class="replaceable"><code>on|y[es]|t[rue],off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">no</span>. Defines whether there is an upper limit of time
		to be used to build one contig. Set this to yes in EST assemblies where you
		think that extremely high coverages occur. Less useful for assembly of
		genomic sequences.
	      </p></dd><dt><span class="term">
	      [buildtime_in_seconds(bts)=<em class="replaceable"><code>integer &gt; 0</code></em>]
	    </span></dt><dd><p> Default is
	      <span class="underline">3600</span> for genome
	      assemblies, <span class="underline">720</span> for EST
	      assemblies with Sanger or 454
	      and <span class="underline">360</span> for EST assemblies
	      with Solexa or Ion Torrent. Depending on  [-AS:umcbt]
	      above, this number defines the time in seconds allocated to
	      building one contig.
	      </p></dd></dl></div><p>
	</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_strain_backbone_sb"></a>3.4.4.7. 
	  Parameter group: -STRAIN/BACKBONE (-SB)
	</h4></div></div></div><p>
	  Controlling backbone options in mapping assemblies:
	</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
	      [bootstrap_new_backbone(bnb)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">yes</span> for
		mapping assemblies with Illumina data, no otherwise.
	      </p><p>
		When set to 'yes', MIRA will use a two stage mapping process
		which bootstraps an intermediate backbone (reference) sequence
		and greatly improves mapping accuracy at indel sites.
	      </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
		Currently only works with Illumina data, other sequencing
		technologies will not be affected by this flag.
	      </td></tr></table></div></dd><dt><span class="term">
	      [startbackboneusage_inpass(sbuip)=<em class="replaceable"><code>0 &lt; integer</code></em>]
	    </span></dt><dd><p> Default is
	      dependent on assembly quality level chosen: 0 for 'draft'
	      and [-AS:nop] divided by 2 for 'accurate'.
	      </p><p>
		When assembling against backbones, this parameter defines the
		pass iteration (see [-AS:nop]) from which on the
		backbones will be really used. In the passes preceding this
		number, the non-backbone reads will be assembled together as
		if no backbones existed.  This allows MIRA to correctly spot
		repetitive stretches that differ by single bases and tag them
		accordingly. Note that full assemblies are considerably slower
		than mapping assemblies, so be careful with this when
		assembling millions of reads.
	      </p><p>
		Rule of thumb: if backbones belong to same strain as reads to assemble, set
		to <span class="underline">1</span>. If backbones are a different strain, then set
		 [-SB:sbuib] to 1 lower than  [-AS:nop] (example: nop=4 and
		sbuip=3).
	      </p></dd><dt><span class="term">
	      [backbone_raillength(brl)=<em class="replaceable"><code>0 &#8804; integer &#8804; 10000</code></em>]
	    </span></dt><dd><p> Default
	      is <span class="underline">0</span>. Parameter for the
	      internal sectioning size of the backbone to compute optimal
	      alignments. Should be set to two times length of longest read in
	      input data + 15%. When set to 0, MIRA will compute optimal
	      values from the data loaded.
	      </p></dd><dt><span class="term">
	      [backbone_railoverlap(bro)=<em class="replaceable"><code>0 &#8804; integer &#8804; 2000</code></em>]
	    </span></dt><dd><p> Default is <span class="underline">0</span>.
	      Parameter for the internal sectioning size of the backbone to
	      compute optimal alignments. Should be set to length of the
	      longest read. When set to 0, MIRA will compute optimal values
	      from the data loaded.
	      </p></dd><dt><span class="term">
	      [trim_overhanging_reads(tor)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">yes</span>.
	      </p><p>
		When set to 'yes', MIRA will trim back reads at end of contigs
		which outgrow the reference sequence so that boundaries of
		the reference and the mapped reads align perfectly. That is,
		the mapping does not perform a sequence extension.
	      </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
		The trimming is performed via setting low quality cutoffs in
		the reads, i.e., the trimmed parts are not really gone but
		just not part of the active contig anymore. They can be
		uncovered when working on the assembly in finishing programs
		like, e.g., <span class="command"><strong>gap4</strong></span>
		or <span class="command"><strong>gap5</strong></span>.
	      </td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
		Previous versions of MIRA (up to and including 3.9.18) behaved
		as if this option had been set to 'no'. This is a major change
		in behaviour, but it is also what probably most people expect
		from a mapping.
	      </td></tr></table></div></dd></dl></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_dataprocessing_dp"></a>3.4.4.8. 
	  Parameter group: -DATAPROCESSING (-DP)
	</h4></div></div></div><p>
	  Options for controlling some data processing during the assembly.
	</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
	      [use_read_extension(ure)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p> Default
	      is dependent of the sequencing technology used: <span class="underline">yes</span> for Sanger,
	      no for all others. MIRA expects the sequences it is given to be
	      quality clipped. During the assembly though, it will try to extend reads
	      into the clipped region and gain additional coverage by analysing
	      Smith-Waterman alignments between reads that were found to be valid. Only
	      the right clip is extended though, the left clip (most of the time
	      containing sequencing vector) is never touched.
	      </p></dd><dt><span class="term">
	      [read_extension_window_length(rewl)=<em class="replaceable"><code>integer &gt; 0</code></em>]
	    </span></dt><dd><p> Default
	      is dependent of the sequencing technology used. Only takes effect when
	      [-DP:ure] (see above) is set to <span class="underline">yes</span>. The read extension
	      routines use a sliding window approach on Smith-Waterman alignments. This
	      parameter defines the window length.
	      </p></dd><dt><span class="term">
	      [read_extension_with_maxerrors(rewme)=<em class="replaceable"><code>integer &gt; 0</code></em>]
	    </span></dt><dd><p>
		Default is dependent of the sequencing technology used. Only takes effect
		when [-DP:ure] (see above) is set to <span class="underline">yes</span>. The read
		extension routines use a sliding window approach on Smith-Waterman
		alignments. This parameter defines the number maximum number of errors
		(=disagreements) between two alignment in the given window.
	      </p></dd><dt><span class="term">
	      [first_extension_in_pass(feip)=<em class="replaceable"><code>integer &#8805; 0</code></em>]
	    </span></dt><dd><p> Default is
	      dependent of the sequencing technology used. Only takes effect when
	      [-DP:ure] (see above) is set to <span class="underline">yes</span>. The read extension
	      routines can be called before assembly and/or after each assembly pass (see
	       [-AS:nop]). This parameter defines the first pass in which the read
	      extension routines are called. The default of <span class="underline">0</span> tells
	      MIRA to extend the reads the first time before the first assembly
	      pass.
	      </p></dd><dt><span class="term">
	      [last_extension_in_pass(leip)=<em class="replaceable"><code>integer &#8805; 0</code></em>]
	    </span></dt><dd><p> Default is
	      dependent of the sequencing technology used. Only takes effect when
	      [-DP:ure] (see above) is set to <span class="underline">yes</span>. The read extension
	      routines can be called before assembly and/or after each assembly pass (see
	       [-AS:nop]). This parameter defines the last pass in which the read
	      extension routines are called. The default of <span class="underline">0</span> tells
	      MIRA to extend the reads the last time before the first assembly
	      pass.
	      </p></dd></dl></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_clipping_cl"></a>3.4.4.9. 
	  Parameter group: -CLIPPING (-CL)
	</h4></div></div></div><p>
	  Controls for clipping options: when and how sequences should be clipped.
	</p><p>
	  Every option in this section can be set individually for every sequencing
	  technology, giving a very fine grained control on how reads are clipped for
	  each technology.
	</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
	      [msvs_gap_size(msvsgs)=<em class="replaceable"><code>integer &#8805; 0</code></em>]
	    </span></dt><dd><p>
		Default is dependent of the sequencing technology used. Takes
		effect only when loading data from ancillary SSAHA2 or SMALT
		files.
	      </p><p>
		While performing the clip of screened vector sequences, MIRA
		will look if it can merge larger chunks of sequencing vector
		bases that are a maximum of [-CL:msvgsgs] apart.
	      </p></dd><dt><span class="term">
	      [msvs_max_front_gap(msvsmfg)=<em class="replaceable"><code>integer &#8805; 0</code></em>]
	    </span></dt><dd><p>
		Default is dependent of the sequencing technology used. Takes
		effect only when loading data from ancillary SSAHA2 or SMALT
		files.
	      </p><p>
		While performing the clip of screened vector sequences at the
		start of a sequence, MIRA will allow up to this number of
		non-vector bases in front of a vector stretch.
	      </p></dd><dt><span class="term">
	      [msvs_max_end_gap(msvsmeg)=<em class="replaceable"><code>integer &#8805; 0</code></em>]
	    </span></dt><dd><p>
		Default is dependent of the sequencing technology used. Takes
		effect only when loading data from ancillary SSAHA2 or SMALT
		files.
	      </p><p>
		While performing the clip of screened vector sequences at the
		end of a sequence, MIRA will allow up to this number of
		non-vector bases behind a vector stretch.
	      </p></dd><dt><span class="term">
	      [possible_vector_leftover_clip(pvlc)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is dependent of the sequencing technology
		used: <span class="underline">yes</span> for
		Sanger, <span class="underline">no</span> for any
		other. MIRA will try to identify possible sequencing vector
		relics present at the start of a sequence and clip them
		away. These relics are usually a few bases long and were not
		correctly removed from the sequence in data preprocessing
		steps of external programs.
	      </p><p>
		You might want to turn off this option if you know (or think)
		that your data contains a lot of repeats and the option below
		to fine tune the clipping behaviour does not give the expected
		results.
	      </p><p>
		You certainly want to turn off this option in EST assemblies
		as this will quite certainly cut back (and thus hide)
		different splice variants. But then make certain that you
		pre-processing of Sanger data (sequencing vector removal) is
		good, other sequencing technologies are not affected then.
	      </p></dd><dt><span class="term">
	      [pvc_maxlenallowed(pvcmla)=<em class="replaceable"><code>integer &#8805; 0</code></em>]
	    </span></dt><dd><p> Default is dependent of the sequencing technology
	      used. The clipping of possible vector relics option works quite
	      well. Unfortunately, especially the bounds of repeats or
	      differences in EST splice variants sometimes show the same
	      alignment behaviour than possible sequencing vector relics and
	      could therefore also be clipped.
	      </p><p>
		To refrain the vector clipping from mistakenly clip repetitive
		regions or EST splice variants, this option puts an upper
		bound to the number of bases a potential clip is allowed to
		have. If the number of bases is below or equal to this
		threshold, the bases are clipped. If the number of bases
		exceeds the threshold, the clip
		is <span class="bold"><strong>NOT</strong></span> performed.
	      </p><p>
		Setting the value to 0 turns off the threshold, i.e., clips are then always
		performed if a potential vector was found.
	      </p></dd><dt><span class="term">
	      [quality_clip(qc)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p> Default is
	      <span class="underline">no</span>. This will let MIRA
	      perform its own quality clipping before sequences are entered
	      into the assembly. The clip function performed is a sequence end
	      window quality clip with back iteration to get a maximum number
	      of bases as useful sequence. Note that the bases clipped away
	      here can still be used afterwards if there is enough evidence
	      supporting their correctness when the option  [-DP:ure]
	      is turned on.
	      </p><p>
		Warning: The windowing algorithm works pretty well for Sanger,
		but apparently does not like 454 type data. It's advisable to
		not switch it on for 454. Beside, the 454 quality clipping
		algorithm performs a pretty decent albeit not perfect job, so
		for genomic 454 data (not! ESTs), it is currently recommended
		to use a combination of [-CL:emrc] and
		 [-DP:ure].
	      </p></dd><dt><span class="term">
	      [qc_minimum_quality(qcmq)=<em class="replaceable"><code>integer &#8805; 15 and &#8804; 35</code></em>]
	    </span></dt><dd><p>
		Default is dependent of the sequencing technology used. This is the minimum
		quality bases in a window require to be accepted. Please be cautious not to
		take too extreme values here, because then the clipping will be too lax or
		too harsh. Values below 15 and higher than 30-35 are not recommended.
	      </p></dd><dt><span class="term">
	      [qc_window_length(qcwl)=<em class="replaceable"><code>integer &#8805; 10</code></em>]
	    </span></dt><dd><p> Default is
	      dependent of the sequencing technology used. This is the length of a window
	      in bases for the quality clip.
	      </p></dd><dt><span class="term">
	      [bad_stretch_quality_clip (bsqc)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p> Default is <span class="underline">no</span>. This
		option allows to clip reads that were not correctly preprocess
		and have unclipped bad quality stretches that might prevent a
		good assembly.
	      </p><p> MIRA will search the sequence in forward direction for a
		stretch of bases that have in average a quality less than a
		defined threshold and then set the right quality clip of this
		sequence to cover the given stretch.
	      </p></dd><dt><span class="term">
	      [bsqc_minimum_quality (bsqcmq)=<em class="replaceable"><code>integer &#8805; 0</code></em>]
	    </span></dt><dd><p> Default is dependent
	      of the sequencing technology used. Defines the minimum average quality a
	      given window of bases must have. If this quality is not reached, the
	      sequence will be clipped at this position.
	      </p></dd><dt><span class="term">
	      [bsqc_window_length (bsqcwl)=<em class="replaceable"><code>integer &#8805; 0</code></em>]
	    </span></dt><dd><p> Default is dependent of the
	      sequencing technology used. Defines the length of the window within which
	      the average quality of the bases are computed.
	      </p></dd><dt><span class="term">
	      [maskedbases_clip(mbc)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p> Default is
	      dependent of the sequencing technology used. This will let MIRA
	      perform a 'clipping' of bases that were masked out (replaced with the
	      character X). It is generally not a good idea to use mask bases to remove
	      unwanted portions of a sequence, the EXP file format and the NCBI traceinfo
	      format have excellent possibilities to circumvent this. But because a lot of
	      preprocessing software are built around cross_match, scylla-
	      and phrap-style of base masking, the need arose for MIRA to
	      be able to handle this, too.  MIRA will look at the start and end of
	      each sequence to see whether there are masked bases that should be
	      'clipped'.
	      </p></dd><dt><span class="term">
	      [mbc_gap_size(mbcgs)=<em class="replaceable"><code>integer &#8805; 0</code></em>]
	    </span></dt><dd><p> Default is dependent of
	      the sequencing technology used. While performing the clip of masked bases,
	      MIRA will look if it can merge larger chunks of masked bases that are
	      a maximum of [-CL:mbcgs] apart.
	      </p></dd><dt><span class="term">
	      [mbc_max_front_gap(mbcmfg)=<em class="replaceable"><code>integer &#8805; 0</code></em>]
	    </span></dt><dd><p> Default is
	      dependent of the sequencing technology used. While performing the clip of
	      masked bases at the start of a sequence, MIRA will allow up to this
	      number of unmasked bases in front of a masked stretch.
	      </p></dd><dt><span class="term">
	      [mbc_max_end_gap(mbcmeg)=<em class="replaceable"><code>integer &#8805; 0</code></em>]
	    </span></dt><dd><p> Default is
	      dependent of the sequencing technology used. While performing the clip of
	      masked bases at the end of a sequence, MIRA will allow up to this
	      number of unmasked bases behind a masked stretch.
	      </p></dd><dt><span class="term">
	      [lowercase_clip_front(lccf)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p> Default is
	      dependent of the sequencing technology used: on for 454 and Ion
	      Torrent data, off for all
	      others. This will let MIRA perform a 'clipping' of bases that are in
	      lowercase at the front end of a sequence, leaving only the uppercase
	      sequence. Useful when handling 454 data that does not have ancillary data in
	      XML format.
	      </p></dd><dt><span class="term">
	      [lowercase_clip_back(lccb)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p> Default is
	      dependent of the sequencing technology used: on for 454 and Ion
	      Torrent data, off for all
	      others. This will let MIRA perform a 'clipping' of bases that are in
	      lowercase at the back end of a sequence, leaving only the uppercase
	      sequence. Useful when handling 454 data that does not have ancillary data in
	      XML format.
	      </p></dd><dt><span class="term">
	      [clip_polyat(cpat)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p> Default is
	      <span class="underline">yes</span> for all EST/RNASeq
	      assemblies. Poly-A stretches in forward reads and poly-T
	      stretches in reverse reads get either clipped or tagged here
	      (see  [-CL:cpkps] below). The assembler will not use
	      these stretches for finding overlaps, but it will use these to
	      discern and disassemble different 3' UTR endings.
	      </p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
		Should poly-A / poly-T stretches have been trimmed in
		pre-processing steps before MIRA got the reads, this option
		MUST be switched off.
	      </td></tr></table></div></dd><dt><span class="term">
	      [cp_keep_poly_stretch (cpkps)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p> Default is
	      <span class="underline">yes</span> but takes effect only
	      if  [-CL:cpat] (see above) is also set to yes.
	      </p><p>
		Instead of clipping the poly-A / poly-T sequence away, the
		stretch in question in the reads is kept and tagged. The tags
		provide additional information for MIRA to discern between
		different 3' UTR endings and alse a good visual anchor when
		looking at the assembly with different programs.
	      </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
		One side-effect of this option is that the poly-A / poly-T
		stretch are 'cleaned'. That is, single non-poly A / poly-T
		bases within the stretch are automatically edited to be
		conforming to the surrounding stretch. This is necessary as
		homopolymers are by nature one of the hardest motifs to be
		sequenced correctly by any sequencing technology and one
		frequently gets 'dirty' poly-A sequence from sequencing and
		this interferes heavily with the methods MIRA uses to discern
		repeats.
	      </td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
		Keeping the poly-A sequence is a two-edged sword: on one hand it
		enabled to discern different 3' UTR endings, on the other hand
		it might be that sequencing problems toward the end of reads
		create false-positive different endings. If you find that this
		is the case for your data, just switch off this option: MIRA
		will then simply build the longest possible 3' UTRs.
	      </td></tr></table></div></dd><dt><span class="term">
	      [cp_min_sequence_len(cpmsl)=<em class="replaceable"><code>integer &gt;
	      0</code></em>]
	    </span></dt><dd><p> Default is
	      <span class="underline">10</span>. Only takes effect
	      when  [-CP:cpat] (see above) is set
	      to <span class="underline">yes</span>. Defines the number
	      of 'A' (in forward direction) or 'T' (in reverse direction) must
	      be present to be considered a poly-A sequence stretch.
	      </p></dd><dt><span class="term">
	      [cp_max_errors_allowed(cpmea)=<em class="replaceable"><code>integer &gt; 0</code></em>]
	    </span></dt><dd><p> Default is
	      <span class="underline">1</span>. Only takes effect
	      when  [-CL:cpat] (see above) is set
	      to <span class="underline">yes</span>. Defines the
	      maximum number of errors allowed in the potential poly-A
	      sequence stretch. The distribution of these errors is not
	      important.
	      </p></dd><dt><span class="term">
	      [cp_max_gap_from_end(cpmgfe)=<em class="replaceable"><code>integer &gt; 0</code></em>]
	    </span></dt><dd><p> Default is <span class="underline">9</span>. Only
	      takes effect when  [-CL:cpat] (see above) is set
	      to <span class="underline">yes</span>.Defines the number
	      of bases from the end of a sequence (if masked: from the end of
	      the masked area) within which a poly-A sequence stretch is
	      looked for.
	      </p></dd><dt><span class="term">
	      [clip_3ppolybase (c3pp)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd>
	      c3p* options to be described ...
	    </dd><dt><span class="term">
	      [clip_known_adaptorsright (ckar)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">yes</span>. Defines
		whether MIRA should search and clip known sequencing technology
		specific sequencing adaptors. MIRA knows adaptors for Illumina
		best, followed by Ion Torrent and some 454 adaptors.
	      </p><p>
		As the list of known adaptors changes quite frequently, the
		best place to get a list of known adaptors by MIRA is by
		looking at the text files in the program
		sources: <code class="filename">src/mira/adaptorsforclip.*.xxd</code>.
	      </p></dd><dt><span class="term">
	      [ensure_minimum_left_clip(emlc)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is dependent of the sequencing technology used. If on, ensures a
		minimum left clip on each read according to the parameters in
		[-CL:mlcr:smlc]
	      </p></dd><dt><span class="term">
	      [minimum_left_clip_required(mlcr)=<em class="replaceable"><code>integer &#8805; 0</code></em>]
	    </span></dt><dd><p> Default
	      is dependent of the sequencing technology used. If [-CL:emlc] is
	      on, checks whether there is a left clip which length is at least the one
	      specified here.
	      </p></dd><dt><span class="term">
	      [set_minimum_left_clip(smlc)=<em class="replaceable"><code>integer &#8805; 0</code></em>]
	    </span></dt><dd><p> Default is
	      dependent of the sequencing technology used. If [-CL:emlc] is on
	      and actual left clip is &lt;  [-CL:mlcr], set left clip of read to
	      the value given here.
	      </p></dd><dt><span class="term">
	      [ensure_minimum_right_clip(emrc)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is dependent of the sequencing technology used. If on, ensures a
		minimum right clip on each read according to the parameters in
		[-CL:mrcr:smrc]
	      </p></dd><dt><span class="term">
	      [minimum_right_clip_required(mrcr)=<em class="replaceable"><code>integer &#8805; 0</code></em>]
	    </span></dt><dd><p> Default
	      is dependent of the sequencing technology used. If [-CL:emrc] is
	      on, checks whether there is a right clip which length is at least the one
	      specified here.
	      </p></dd><dt><span class="term">
	      [set_minimum_right_clip(smrc)=<em class="replaceable"><code>integer &#8805; 0</code></em>]
	    </span></dt><dd><p> Default is
	      dependent of the sequencing technology used. If [-CL:emrc] is on
	      and actual right clip is &lt;  [-CL:mrcr], set the length of the
	      right clip of read to the value given here.
	      </p></dd><dt><span class="term">
	      [gb_chimeradetectionclip(gbcdc)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">yes</span> for all jobs.
	      </p><p>
		Very safe chimera detection, should have no false
		positives. For repetitive data, a low number of false
		negatives is possible.
	      </p></dd><dt><span class="term">
	      [kmerjunk_detection(kjd)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is currently <span class="underline">yes</span>.
	      </p><p>
		Reads that look "fishy" are marked as potentially
		chimeric. This mark leads either to a read being completely
		killed or to a read being included into a contig only if no
		other possibility remains.
	      </p><p>
		It is currently suggested to leave this parameter switched on
		and to fine-tune via [-CL:kjck] (see below).
	      </p></dd><dt><span class="term">
	      [kmerjunk_completekill(kjck)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is currently <span class="underline">no</span>
		for genome assemblies and <span class="underline">yes</span> for EST/RNASeq assemblies.
	      </p><p>
		If set to yes, reads marked as junk (see above) are completely
		removed from an assembly. If set to no, reads are not removed
		but included only into a contig as a very last resort.
	      </p><p>
		Having reads killed guarantees assemblies of extremely high
		quality containing virtually no missassembly due to chimeric
		sequencing errors. The downside is that, computationally,
		there is no difference between junk and stretches with correct
		but very low coverage data (generally &lt; 3x coverage). It's
		up to you to decide what is more important: total accuracy or
		longer contigs.
	      </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
		  As a rule of thumb: I set this to no for genome assemblies
		  with at least medium average coverage (&#8805; 20-30x) as MIRA
		  does a pretty good job to incorporate these reads so late in
		  an assembly that they do not lead to misassemblies. In
		  transcript assemblies I set this to yes as there is a high
		  chance that high coverage transcripts could be extended via
		  chimeric reads.
		</p><p>
		  With this in mind: deciding for metagenome assemblies would
		  be really difficult though. It probably depends on what you
		  need the data for.
		</p></td></tr></table></div></dd><dt><span class="term">
	      [propose_end_clips(pec)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is is dependent on --job quality: currently <span class="underline">yes</span> for all genome assemblies.
		Switched off for EST assemblies (but one might want to switch
		it on sometimes).
	      </p><p>
		This implements a pretty powerful strategy to ensure a good
		"high confidence region" (HCR) in reads, basically eliminating
		99.9% of all junk at the 5' and 3' ends of reads. Note that
		one still must ensure that sequencing vectors (Sanger) or
		adaptor sequences (454, Solexa ion Torrent) are "more or less"
		clipped prior to assembly.
	      </p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
		Extremely effective, but should NOT be used for very low
		coverage genomic data, or for EST projects if one wants to
		retain the rareest transcripts.
	      </td></tr></table></div></dd><dt><span class="term">
	      [handle_solexa_ggcxg_problem(pechsgp)=<em class="replaceable"><code>on|y[es]|t[rue],
	      off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is is dependent <span class="underline">yes</span>.
	      </p><p>
		Solexa data has a pretty awful problem with in some reads when
		a <code class="literal">GGCxG</code> motif occurs (read more about it in
		the chapter on Solexa data). In short: the sequencing errors
		produced by this problem lead to many false positive SNP
		discoveries in mapping assemblies or problems in contig
		building in de-novo assembly.
	      </p><p>
		MIRA knows about this problem and can look for it in Solexa
		reads during the proposed end clipping and further clip back
		the reads, greatly minimising the impact of this problem.
	      </p></dd><dt><span class="term">
	      [pec_kmer_size(peckms)=<em class="replaceable"><code>10 &#8804; integer &#8804; 32</code></em>]
	    </span></dt><dd><p>
		Default is is dependent on technology and quality in the --job
		switch: usually
		between <span class="underline">17</span>
		and <span class="underline">21</span> for Sanger,
		higher for 454 (up to
		<span class="underline">27</span>) and highest for
		Solexa (<span class="underline">31</span>). Ion Torrent
		has at the moment <span class="underline">17</span>,
		but this may change in the future to somewhat higher values.
	      </p><p>
		This parameter defines the minimum number of bases at each end
		of a read that should be free of any sequencing errors.
	      </p></dd><dt><span class="term">
	      [pec_minimum_kmer_forward_reverse(pmkfr)=<em class="replaceable"><code>integer &#8805; 0</code></em>]
	    </span></dt><dd><p>
		Default is is dependent on technology and quality in the --job
		switch: usually
		between <span class="underline">1</span>
		and <span class="underline">3</span>
		when  [-CL:pec=yes].
	      </p><p>
		This parameter defines the minimum number of occurrence of a
		kmer at each end of a read that should be free of any
		sequencing errors.
	      </p></dd><dt><span class="term">
	      [rare_kmer_mask(rkm)=<em class="replaceable"><code>on|y[es]|t[rue],
	      off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is is dependent on --job switch: currently
		it's <span class="underline">yes</span> for Solexa data
		and <span class="underline">no</span> otherwise.  If
		this parameter is active, MIRA will completely mask with 'X'
		those parts of a read which have kmer occurrence (in forward
		and reverse direction) less than the value specified
		via  [-CL:pmkfr].
	      </p><p>
		This is a quality ensuring move which improves assembly of
		ultra-high coverage contigs by cleaning out very likely, low
		frequency sequence dependent sequencing errors which passed
		all previous filters. The drawback is that very rare
		transcripts or very lowly covered genome parts with an
		occurrence less than the given value will also be masked
		out. However, Illumina gives so much data that this is almost
		never the case.
	      </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
		This works only if [-CL:pec] is active.
	      </td></tr></table></div></dd><dt><span class="term">
	      [search_phix174(spx174)=<em class="replaceable"><code>on|y[es]|t[rue],
	      off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p> Default is <span class="emphasis"><em>on</em></span> for Illumina data, off
	      otherwise.
	      </p><p>
		PhiX 174 is a small phage of enterobacteria whose DNA is often
		spiked-in during Illumina sequencing to determine error rates
		in datasets and to increase complexity in low-complexity
		samples (amplicon, chipseq etc) to help in cluster
		identification.
	      </p><p>
		If it remains in the sequenced data, it has to be
		seen as a contaminant for projects working with organisms
		which should not contain the PhiX 174 phage.
	      </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
		However, PhiX may be part of some genome sequences
		(enterobacteria). In these cases, the PhiX174 search will
		report genuine genome data.
	      </td></tr></table></div></dd><dt><span class="term">
	      [filter_phix174(fpx174)=<em class="replaceable"><code>on|y[es]|t[rue],
	      off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p> Default is <span class="emphasis"><em>on</em></span> for Illumina data in
	      EST (RNASeq) assemblies, off otherwise.
	      </p><p>
		If [-CL:spx174] is on and  [-CL:fpx174] also,
		MIRA will filter out as contaminants all reads which have
		PhiX174 sequence recognised.
	      </p><p>
		The default value of having the filtering on only for Illumina
		EST (RNASeq) data is a conservative approach: the overwhelming
		majority of RNASeq data will indeed not sequence some
		enterobacteria, so having PhiX174 containing reads thrown out
		is indeed a valid move. For genomes however, MIRA currently is
		cautious and will not filter these reads by default.
	      </p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
		However, PhiX may be part of some genome sequences
		(enterobacteria). In these cases, the PhiX174 filter will
		remove reads from valid genome or expression data.
	      </td></tr></table></div></dd><dt><span class="term">
	      [filter_rrna(frrna)=<em class="replaceable"><code>on|y[es]|t[rue],
	      off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p> Default is <span class="emphasis"><em>on</em></span> for EST (RNASeq)
	      assemblies, off otherwise.
	      </p><p>
		If enabled, MIRA will filter out (and not assemble) all reads
		(or pairs, see below) it recognises as being rRNA or
		rDNA. This is useful to reduce computing time on data sets
		which contain large contamination of rRNA which were not
		filtered away in wet lab.
	      </p></dd><dt><span class="term">
	      [filter_rrna_pairs(frrnap)=<em class="replaceable"><code>on|y[es]|t[rue],
	      off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p> Default is <span class="emphasis"><em>on</em></span> for EST (RNASeq)
	      assemblies, off otherwise.
	      </p><p>
		If enabled together with [-CL:frrna], MIRA will
		filter out (and not assemble) all reads pairs where at least
		one of the reads is recognised as being rRNA or rDNA.
	      </p><p>
		This option is useful to also catch less conserved parts of
		rRNA transcribed like, e.g. the internal transcribed spacers
		(ITS) in eukaryotic data.
	      </p></dd><dt><span class="term">
	      [filter_rrna_numkmers(frrnank)=<em class="replaceable"><code>integer &gt; 0
	      </code></em>]
	    </span></dt><dd><p> Default is <span class="emphasis"><em>20</em></span>.
	      </p><p>
		The rRNA recognition in MIRA works with a precompiled set of
		preserved rRNA kmers, at the time of this writing with
		21-mers. To allow for specific recognition, the rRNA filtering
		process expects to find at least this number of kmers per read
		before identifying it as rRNA.
	      </p><p>
		To increase sensitivity (and at the same time risk more false
		positives): reduce this parameter. To increase specificity
		(and at the same time risk more reads not being recognised):
		increase this parameter.
	      </p><p>
		The default parameters together with the default database seem
		to work pretty well and this is expected to work for all but
		the most exotic rRNA containing organisms.
	      </p></dd></dl></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_skim_sk"></a>3.4.4.10. 
	  Parameter group: -SKIM (-SK)
	</h4></div></div></div><p>
	  Options that control the behaviour of the initial fast all-against-all read
	  comparison algorithm. Matches found here will be confirmed later in the
	  alignment phase. The new SKIM3 algorithm that is in place since version 2.7.4
	  uses a kmer based algorithm that works similarly to SSAHA (see Ning Z, Cox AJ,
	  Mullikin JC; "SSAHA: a fast search method for large DNA databases."; Genome
	  Res. 2001;11;1725-9).
	</p><p>
	  The major differences of SKIM3 and SSAHA are:
	</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
	      the word length <span class="emphasis"><em>n</em></span> of a kmer (hash) in
	      SSAHA2 must be &lt; 15, but can be up to 32 bases in 64 bit
	      versions of MIRA &lt; 4.0.2 and lower, and up to 256 bases for
	      higher versions of MIRA.
	    </p></li><li class="listitem"><p>
	      SKIM3 uses a maximum fixed amount of RAM that is independent of
	      the word size. E.g., SSAHA would need 4 <span class="underline">exabyte</span> to work with word length of
	      30 bases ... SKIM3 just takes a couple of hundred MB.
	    </p></li></ol></div><p>
	  The parameters for SKIM3:
	</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
	      [number_of_threads(not)=<em class="replaceable"><code>integer &#8805; 1</code></em>]
	    </span></dt><dd><p>
		Number of threads used in SKIM, default is <span class="underline">2</span>. A few parts of SKIM are
		non-threaded, so the speedup is not exactly linear, but it
		should be very close. E.g., with 2 processors I get a speedup
		of 180-195%, with 4 between 350 and 395%.
	      </p><p>
		Although the main data structures are shared between the
		threads, there's some additional memory needed for each
		thread.
	      </p></dd><dt><span class="term">
	      [also_compute_reverse_complements(acrc)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">on</span>. Defines
		whether SKIM searches for matches only in forward/forward
		direction or whether it also looks for forward/reverse
		direction.
	      </p><p>
		You usually will not want to touch the default, except for very
		special application cases where you do not want MIRA to use
		reverse complement sequences at all.
	      </p></dd><dt><span class="term">
	      [kmer_size(kms)=<em class="replaceable"><code>10 &lt; integer &#8804; 256</code></em>]
	    </span></dt><dd><p>
		Defaults are dependent on "--job" switch and sequencing
		technologies used.
	      </p><p>
		Controls the number of consecutive bases
		<span class="emphasis"><em>n</em></span> which are used as a kmer. The
		higher the value, the faster the search. The lower the value,
		the slower the search and the more weak matches are found.
	      </p><p>
		A secondary effect of this parameter is the estimation of MIRA
		on whether stretches within a read sequence are repetitive or
		not. Large values of [-SK:kms] allow a better
		distinction between "almost identical" repeats early in the
		assembly process and, given enough coverage, generally lead to
		less and longer contigs.
	      </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
		This parameter gets overriden by the one-stop-shop parameter
		[-AS:kms] which determines number of passes and kmer
		size to use in each pass.
	      </td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
		For de-novo assemblies, values below 15 are not
		recommended. For mapping assemblies, values below 10 should
		not be used.
	      </td></tr></table></div></dd><dt><span class="term">
	      [kmer_save_stepping(kss)=<em class="replaceable"><code>integer &#8805; 1</code></em>]
	    </span></dt><dd><p> Default is
	      <span class="underline">1</span>. This is a parameter
	      controlling the stepping increment <span class="emphasis"><em>s</em></span> with which kmers are
	      generated. This allows for more or less fine grained search as
	      matches are found with at least <span class="emphasis"><em>n+s</em></span> (see  [-SK:kms])
	      equal bases. The higher the value, the faster the search. The
	      lower the value, the more weak matches are found.
	      </p></dd><dt><span class="term">
	      [percent_required(pr)=<em class="replaceable"><code>integer &#8805; 1</code></em>]
	    </span></dt><dd><p> Default is dependent of the sequencing technology used
	      and assembly quality wished. Controls the relative percentage of
	      exact word matches in an approximate overlap that has to be
	      reached to accept this overlap as possible match. Increasing
	      this number will decrease the number of possible alignments that
	      have to be checked by Smith-Waterman later on in the assembly,
	      but it also might lead to the rejection of weaker overlaps (i.e.
	      overlaps that contain a higher number of mismatches).
	      </p><p>
		Note: most of the time it makes sense to keep this parameter
		in sync with [-AL:mrs].
	      </p></dd><dt><span class="term">
	      [maxhits_perread(mhpr)=<em class="replaceable"><code>integer &#8805; 1</code></em>]
	    </span></dt><dd><p> Default is
	      <span class="underline">2000</span>. Controls the maximum
	      number of possible hits one read can maximally transport to the
	      overlap edge reduction phase. If more potential hits are found,
	      only the best ones are taken.
	      </p><p>
		In the pre-2.9.x series, this was an important option for
		tackling projects which contain <span class="emphasis"><em>extreme</em></span>
		assembly conditions. It still is if you run out of memory in
		the graph edge reduction phase. Try then to lower it to 1000,
		500 or even 100.
	      </p><p>
		As the assembly increases in passes ([-AS:nop]),
		different combinations of possible hits will be checked,
		always the probably best ones first. So the accuracy of the
		assembly should only suffer when lowering this number too
		much.
	      </p></dd><dt><span class="term">
	      [filter_megahubs(fmh)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p> Default is <span class="underline">yes</span>. Defines whether megahubs (reads
	      with extremely many overlaps to other reads) are filtered.
	      See also  [-SK:mhc:mmhr].
	      </p></dd><dt><span class="term">
	      [megahub_cap(mhc)=<em class="replaceable"><code>integer &gt; 0</code></em>]
	    </span></dt><dd><p> Default is <span class="underline">150000</span>. Defines the number of kmer
	      overlaps a read may have before it is categorised as megahub.
	      </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
		You basically don't want to mess with this one. Except for
		assemblies containing very long reads. Rule of thumb: you
		might want to multiply the 150k value by n where n is the
		average read length divided by 2000. Don't overdo, max n at 15
		or so.
	      </td></tr></table></div></dd><dt><span class="term">
	      [max_megahub_ratio(mmhr)=<em class="replaceable"><code>integer &#8805; 0</code></em>]
	    </span></dt><dd><p> Default is
	      <span class="underline">0</span>. If the number of reads
	      identified as megahubs exceeds the allowed ratio, MIRA will
	      abort.
	      </p><p>
		This is a fail-safe parameter to avoid assemblies where things
		look fishy. In case you see this, you might want to ask for
		advice on the mira_talk mailing list. In short: bacteria
		should never have megahubs (90% of all cases reported were
		contamination of some sort and the 10% were due to incredibly
		high coverage numbers). Eukaryotes are likely to contain
		megahubs if filtering is [-KS:mnr] not on.
	      </p><p>
		EST project however, especially from non-normalised libraries,
		will very probably contain megahubs. In this case, you might
		want to think about masking, see [-KS:mnr].
	      </p></dd><dt><span class="term">
	      [sw_check_on_backbones(swcob)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is currently (3.4.0) <span class="underline">yes</span> for accurate mapping
		jobs. Takes effect only in mapping assemblies. Defines whether
		SKIM hits against a backbone (reference) sequence with less
		than 100% identity are double checked with Smith-Waterman to
		improve mapping accuracy.
	      </p><p>
		You will want to set this option to <span class="underline">yes</span> whenever your reference
		sequence contains more complex or numerous repeats and your
		data has SNPs in those areas.
	      </p></dd><dt><span class="term">
	      [max_kmers_in_memory(mkim)=<em class="replaceable"><code>integer &#8805; 100000</code></em>]
	    </span></dt><dd><p> Default is
	      <span class="underline">15000000</span>. Has no influence
	      on the quality of the assembly, only on the maximum memory size
	      needed during the skimming. The default value is equivalent to
	      approximately 500MB.
	      </p><p>
		Note: reducing the number will increase the run time, the more drastically
		the bigger the reduction. On the other hand, increasing the default value
		chosen will not result in speed improvements that are really noticeable. In
		short: leave this number alone if you are not desperate to save a few MB.
	      </p></dd><dt><span class="term">
	      [memcap_hitreduction(mchr)=<em class="replaceable"><code>integer &#8805; 10</code></em>]
	    </span></dt><dd><p> Default is
	      <span class="underline">1024</span>, <span class="underline">2048</span>
	      when Solexa sequences are used. Maximum memory used (in MiB)
	      during the reduction of skim hits.
	      </p><p>
		Note: has no influence on the quality of the assembly,
		reducing the number will increase the runtime, the more
		drastically the bigger the reduction as hits then must be
		streamed multiple times from disk.
	      </p><p>
		The default is good enough for assembly of bacterial genomes
		or small eukaryotes (using Sanger and/or 454 sequences). As
		soon as assembling something bigger than 20 megabases, you
		should increase it to 2048 or 4096 (equivalent to 2 or 4 GiB
		of memory).
	      </p></dd></dl></div><p>
	</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_hashstatistics_hs"></a>3.4.4.11. 
	  Parameter group: -KMERSTATISTICS (-KS)
	</h4></div></div></div><p>
	  Hash statistics (nowadays called kmer statistics in literature
	  or other software packages) allows to quickly assess reads from a
	  coverage point of view without actually assembling the reads. MIRA
	  uses this as a quick pre-assembly evaluation to find and tag reads
	  which are from repetitive and non-repetitive parts of a project.
	</p><p>
	  The length of the kmer is defined via [-SK:kms]
	  or  [-AS:kms] while the parameters in this section define
	  the boundaries of the different repeat levels.
	</p><p>
	  A more in-depth description on kmer statistics is given in the
	  sections <span class="emphasis"><em>Introduction to 'masking'</em></span>
	  and <span class="emphasis"><em>How does 'nasty repeat' masking work?</em></span> in
	  the chapter dealing with the assembly of hard projects.
	</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
	      [freq_est_minnormal(fenn)=<em class="replaceable"><code>float &gt; 0</code></em>]
	    </span></dt><dd><p>
		During kmer statistics analysis, MIRA will estimate how repetitive parts
		of reads are. Parts which are occurring less than
		[-KS:fenn] times the average occurrence will be tagged
		with a HAF2 (less than average) tag.
	      </p></dd><dt><span class="term">
	      [freq_est_maxnormal(fexn)=<em class="replaceable"><code>float &gt; 0</code></em>]
	    </span></dt><dd><p>
		During kmer statistics analysis, MIRA will estimate how repetitive parts
		of reads are. Parts which are occurring more than
		[-KS:fenn] but less than  [-KS:fexn] times
		the average occurrence will be tagged with a HAF3 (normal) tag.
	      </p></dd><dt><span class="term">
	      [freq_est_repeat(fer)=<em class="replaceable"><code>float &gt; 0</code></em>]
	    </span></dt><dd><p>
		During kmer statistics analysis, MIRA will estimate how repetitive parts
		of reads are. Parts which are occurring more than
		[-KS:fexn] but less than  [-KS:fer] times
		the average occurrence will be tagged with a HAF4 (above average) tag.
	      </p></dd><dt><span class="term">
	      [freq_est_heavyrepeat(fehr)=<em class="replaceable"><code>float &gt; 0</code></em>]
	    </span></dt><dd><p>
		During kmer statistics analysis, MIRA will estimate how repetitive parts
		of reads are. Parts which are occurring more than
		[-KS:fer] but less than  [-KS:fehr] times
		the average occurrence will be tagged with a HAF5 (repeat) tag.
	      </p></dd><dt><span class="term">
	      [freq_est_crazyrepeat(fecr)=<em class="replaceable"><code>float &gt; 0</code></em>]
	    </span></dt><dd><p>
		During kmer statistics analysis, MIRA will estimate how repetitive parts
		of reads are. Parts which are occurring more than
		[-KS:fehr] but less than  [-KS:fecr] times
		the average occurrence will be tagged with a HAF6 (heavy
		repeat) tag. Parts which are occurring more than
		 [-KS:fecr] but less than  [-KS:nrr] times the
		average occurrence will be tagged with a HAF7 (crazy repeat)
		tag.
	      </p></dd><dt><span class="term">
	      [mask_nasty_repeats(mnr)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is dependent on --job
		type: <span class="underline">yes</span> for
		de-novo, <span class="underline">no</span> for mapping.
	      </p><p>
		Tells MIRA to tag (during the kmer statistics phase) read
		subsequences of length [-SK:kms] nucleotides that
		appear more that X times more often than the median occurrence
		of subsequences would otherwise suggest. The threshold X from
		which on subsequences are considered nasty is set by
		 [-KS:nrr] or  [-KS:nrc], the action MIRA
		should take when encountering those sequences is defined
		by  [-KS:ldn] (see below).
	      </p><p>
		When not using lossless digital normalisation
		([-KS:ldn]), the tag used by MIRA will be "MNRr"
		which stands for "Mask Nasty Repeat in read". This tag has an
		active masking function in MIRA and the fast all-against-all
		overlap searcher (SKIM) will then completely ignore the tagged
		subsequences of reads. There's one drawback though: the
		smaller the reads are that you try to assemble, the higher the
		probability that your reads will not span nasty repeats
		completely, therefore leading to a abortion of contig building
		at this site. Reads completely covered by the MNRr tag will
		therefore land in the debris file as no overlap will be found.
	      </p><p>
		This option is extremely useful for assembly of larger
		projects (fungi-size) with a high percentage of repeats.
	      </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
		Although it is expected that bacteria will not really need
		this, leaving it turned on will probably not harm except in
		unusual cases like several copies of (pro-)phages integrated
		in a genome.
	      </td></tr></table></div></dd><dt><span class="term">
	      [nasty_repeat_ratio(nrr)=<em class="replaceable"><code>integer &#8805; 2</code></em>]
	    </span></dt><dd><p>
		Default is depending on the [--job=...]
		parameters. Normally it's high (around 100) for genome
		assemblies, but much lower (20 or less) for EST assemblies.
	      </p><p>
		Sets the ratio from which on subsequences are considered nasty
		and hidden from the kmer statistics overlapper with a
		<span class="emphasis"><em>MNRr</em></span> tag. E.g.: A value of 10 means: mask all
		k-mers of  [-SK:kms] length which are occurring more
		than 10 times more often than the average of the whole project.
	      </p></dd><dt><span class="term">
	      [nasty_repeat_coverage(nrc)=<em class="replaceable"><code>integer &#8805; 0</code></em>]
	    </span></dt><dd><p>
		Default is depending on the [--job=...]
		parameters: <span class="underline">0</span> for genome
		assemblies, <span class="underline">200</span> for EST assemblies.
	      </p><p>
		Closely related to the [-KS:nrr] parameter (see
		above), but while the above works on ratios derived from a
		calculated average, this parameter allows to set an absolute
		value. Note that this parameter will take precedence
		over  [-KS:nrr] if the calculated value of nrr is
		larger that the absolute value given here. A value of 0
		de-activates this parameter.
	      </p></dd><dt><span class="term">
	      [lossless_digital_normalisation(ldn)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is dependent on --job
		type: <span class="underline">yes</span> for denovo
		EST/RNAseq assembly, <span class="underline">no</span>
		otherwise.
	      </p><p>
		Tells MIRA how on whether or not digitally normalising reads containing nasty repeats
		when [-KS:mnr] is active.
	      </p><p>
		When set to <span class="emphasis"><em>yes</em></span>, MIRA will apply a
		modified digital normalisation step to the reads, effectively
		decreasing the coverage of a given repetitive stretch down to
		a minimum needed to correctly represent one copy of the
		repeat. However, contrary to the published method, MIRA will
		keep enough reads of repetitive regions to also correctly
		reconstruct slightly different variants of the repeats present
		in the genome or EST / RNASeq data set, even if they differ in
		only a single base.
	      </p><p>
		The tag used by MIRA to denote stretches which may have
		contributed to the digital normalisation will be
		"DGNr". Additionally, contigs which contain reads completely
		covered by a DGNr tag will get an additional "_dn" as part of
		their name to show that they contain read representatives for
		digital normalisation. E.g.: "contig_dn_c1".
	      </p><p>
		This option is extremely useful for non-normalised EST /
		RNASeq projects, to get at least the sequence of
		overrepresented transcripts assembled even if the coverage
		values then cannot be interpreted as expression values
		anymore.
	      </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
		The lossless digital normalisation will be applied as soon as
		the kmer size of the active pass (see [-AS:kms])
		reaches a size of at least 50 or, at the latest, in the second
		to last pass.
	      </td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
		Once digital normalisation has been applied, the
		parameters [-KS:nrr] and  [-KS:nrc] do not
		take effect anymore.
	      </td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
		The effect of lossless digital normalisation on genome data
		has not been studied sufficiently by me to approve it for
		genomes. Use with care in genome assemblies.
	      </td></tr></table></div></dd><dt><span class="term">
	      [repeatlevel_in_infofile(rliif)=<em class="replaceable"><code>integer; 0, 5-8</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">6</span>. Sets the
		minimum level of the HAF tags from which on MIRA will report
		tentatively repetitive sequence in the
		<code class="filename">*_info_readrepeats.lst</code> file of the info
		directory.
	      </p><p>
		A value of <span class="underline">0</span> means
		"switched off". The default value of <span class="underline">6</span> means all subsequences tagged
		with <span class="emphasis"><em>HAF6</em></span>, <span class="emphasis"><em>HAF7</em></span> and
		<span class="emphasis"><em>MNRr</em></span> will be logged. If you, e.g., only
		wanted MNRr logged, you'd use <span class="underline">8</span> as parameter value.
	      </p><p>
		See also [-KS:fenn:fexn:fer:fehr:mnr:nrr] to set the
		different levels for the <span class="emphasis"><em>HAF</em></span> and
		<span class="emphasis"><em>MNRr</em></span> tags.
	      </p></dd><dt><span class="term">
	      [memory_to_use(mtu)=<em class="replaceable"><code>integer</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">75</span>. Defines
		the memory MIRA can use to compute kmer statistics.
	      </p><p>
		A value of <span class="underline">&gt;100</span> is
		interpreted as absolute value in megabyte. E.g., 16384 = 16384
		megabyte = 16 gigabyte.
	      </p><p>
		A value of <span class="underline">0 &#8804; x &#8804;100</span> is
		interpreted as relative value of free memory at the time of
		computation. E.g.: for a value of 75% and 10 gigabyte of free
		memory, it will use 7.5 gigabyte.
	      </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
		The minimum amount of memory this algorithm will use is 512 Mib
		on 32 bit systems and 2 Gib on 64 bit systems.
	      </td></tr></table></div></dd></dl></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_align_al"></a>3.4.4.12. 
	  Parameter group: -ALIGN (-AL)
	</h4></div></div></div><p>
	  The align options control the behaviour of the Smith-Waterman alignment
	  routines. Only read pairs which are confirmed here may be included into
	  contigs. Affects both the checking of possible alignments found by SKIM as
	  well as the phase when reads are integrated into a contig.
	</p><p>
	  Every option in this section can be set individually for every sequencing
	  technology, giving a very fine grained control on how reads are aligned for
	  each technology.
	</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
	      [bandwidth_in_percent(bip)=<em class="replaceable"><code>integer &gt; 0 and &#8804;100</code></em>]
	    </span></dt><dd><p> Default
	      is dependent of the sequencing technology used. The banded Smith-Waterman
	      alignment uses this percentage number to compute the bandwidth it has to use
	      when computing the alignment matrix. E.g., expected overlap is 150 bases,
	      bip=10 -&gt; the banded SW will compute a band of 15 bases to each side of
	      the expected alignment diagonal, thus allowing up to 15 unbalanced inserts /
	      deletes in the alignment.  INCREASING AND DECREASING THIS NUMBER:
	      <span class="emphasis"><em>increase</em></span>: will find more non-optimal alignments, but will also
	      increase SW runtime between linear and \Circum2.  <span class="emphasis"><em>decrease</em></span>: the other
	      way round, might miss a few bad alignments but gaining speed.
	      </p></dd><dt><span class="term">
	      [bandwidth_min(bmin)=<em class="replaceable"><code>integer &gt; 0</code></em>]
	    </span></dt><dd><p> Default is dependent of the
	      sequencing technology used. Minimum bandwidth in bases to each side.
	      </p></dd><dt><span class="term">
	      [bandwidth_max(bmax)=<em class="replaceable"><code>integer &gt; 0</code></em>]
	    </span></dt><dd><p> Default is dependent of the
	      sequencing technology used. Maximum bandwidth in bases to each side.
	      </p></dd><dt><span class="term">
	      [min_overlap(mo)=<em class="replaceable"><code>integer &gt; 0</code></em>]
	    </span></dt><dd><p> Default is dependent of the
	      sequencing technology used. Minimum number of overlapping bases needed in an
	      alignment of two sequences to be accepted.
	      </p></dd><dt><span class="term">
	      [min_score(ms)=<em class="replaceable"><code>integer &gt; 0</code></em>]
	    </span></dt><dd><p> Default is dependent of the
	      sequencing technology used. Describes the minimum score of an overlap to be
	      taken into account for assembly. MIRA uses a default scoring scheme
	      for SW align: each match counts 1, a match with an N counts 0, each mismatch
	      with a non-N base -1 and each gap -2. Take a bigger score to weed out a
	      number of chance matches, a lower score to perhaps find the single (short)
	      alignment that might join two contigs together (at the expense of computing
	      time and memory).
	      </p></dd><dt><span class="term">
	      [min_relative_score(mrs)=<em class="replaceable"><code>integer &gt; 0 and &#8804;100</code></em>]
	    </span></dt><dd><p> Default is dependent of the sequencing technology
	      used. Describes the min % of matching between two reads to be
	      considered for assembly. Increasing this number will save
	      memory, but one might loose possible alignments. I propose a
	      maximum of 80 here.  Decreasing below 55% will make memory and
	      time consumption probably explode.
	      </p><p>
		Note: most of the time it makes sense to keep this parameter
		in sync with
		[-SK:pr].
	      </p></dd><dt><span class="term">
	      [solexa_hack_max_errors(shme)=<em class="replaceable"><code>integer &gt; -1</code></em>]
	    </span></dt><dd><p>
		Currently a hack just for Solexa/Illumina data.
	      </p><p>
		When running in mapping mode, this defines the maximum number
		of mismatches and gaps a read may have compared to the
		reference to be allowed to map. The result is usually a much
		better mapping in areas with larger discrepancies between
		reference sequence and mapped data. Note that the mapping
		process takes longer if this value is unequal to 0 as MIRA
		will use iterative mapping which involves a certain amount of
		trial and error.
	      </p><p>
		The default value of <span class="underline">-1</span>
		lets MIRA choose this value automatically. It sets it to 15%
		of the average Illumina read lengths loaded.
	      </p><p>
		A value of <span class="underline">0</span> switches of
		this functionality, leading to a much faster mapping
		process. Useful when mapping expression data where coverage
		values may be more important than the best possible alignment.
	      </p></dd><dt><span class="term">
	      [extra_gap_penalty(egp)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is dependent of the sequencing technology
		used. Defines whether or not to increase penalties applied to
		alignments containing long gaps. Setting this to 'yes' might
		help in projects with frequent repeats. On the other hand, it
		is definitively disturbing when assembling very long reads
		containing multiple long indels in the called base sequence
		... although this should not happen in the first place and is
		a sure sign for problems lying ahead.
	      </p><p>
		When in doubt, set it
		to <span class="underline">yes</span> for EST projects
		and de-novo genome assembly, set it
		to <span class="underline">no</span> for assembly of
		closely related strains (assembly against a backbone).
	      </p><p>
		When set to <span class="underline">no</span>, it is
		recommended to have  [-CO:amgb]
		and  [-CO:amgbemc] both set to yes.
	      </p></dd><dt><span class="term">
	      [egp_level(egpl)=<em class="replaceable"><code>comma separated list of integer &#8805; 0 and &#8804; 100</code></em>]
	    </span></dt><dd><p>
		Default is dependent of the sequencing technology and job
		used. Has no effect if extra_gap_penalty is off.
	      </p><p>
		...
	      </p></dd><dt><span class="term">
	      [egp_level(megpp)=<em class="replaceable"><code>0 &#8804; integer &#8804; 100</code></em>]
	    </span></dt><dd><p> Default is
	      <span class="underline">100</span>. Has no effect if
	      extra_gap_penalty is off. Defines the maximum extra penalty in
	      percent applied to 'long' gaps.
	      
	      
	      
	      
	      
	      
	      
	      
	      
	      
	      
	      </p></dd></dl></div><p>
	</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_contig_co"></a>3.4.4.13. 
	  Parameter group: -CONTIG (-CO)
	</h4></div></div></div><p>
	  The contig options control the behaviour of the contig objects.
	</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
	      [name_prefix(np)=<em class="replaceable"><code>string</code></em>]
	    </span></dt><dd><p>
		Default is
		<span class="underline">&lt;projectname&gt;</span>. Contigs
		will have this string prepended to their names. Normally,
		the  [project=] line in the manifest will set this.
	      </p></dd><dt><span class="term">
	      [reject_on_drop_in_relscore(rodirs)=<em class="replaceable"><code>integer &#8805; 0 and &#8804;100</code></em>]
	    </span></dt><dd><p>
		Default is dependent of the sequencing technology used.
	      </p><p>
		When adding reads to a contig, reject the reads if the drop in
		the minimum relative score of the alignment of the current
		consensus and the new read is &gt; the expected value
		calculated during the alignment phase. Lower values mean
		stricter checking.
	      </p><p>
		This value is doubled should a read be entered that has an
		assembled template partner (a read pair) at the right distance
		in the current contig.
	      </p></dd><dt><span class="term">
	      [cmin_relative_score(cmrs)=<em class="replaceable"><code>integer &#8805; -1 and &#8804;100</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">-1</span>. Works
		similarly to  [-AL:mrs], but during contig
		construction phase instead of read vs read alignment phase:
		describes the min % of matching between a read being added to
		a contig and the current contig consensus.
	      </p><p>
		If value is set to -1, then the value of [-AL:mrs] is used.
	      </p><p>
		Note: most of the time it makes sense to keep this parameter
		at -1. Else have it at
		approximately <span class="emphasis"><em>[-AL:mrs]-10</em></span> or
		switch it completely off via 0.
	      </p></dd><dt><span class="term">
	      [mark_repeats(mr)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p> Default is
	      <span class="underline">yes</span>. One of the most important switches in MIRA: if set to
	      <span class="underline">yes</span>, MIRA will try to resolve misassemblies due to repeats by
	      identifying single base stretch differences and tag those critical bases as
	      RMB (Repeat Marker Base, weak or strong). This switch is also needed when
	      MIRA is run in EST mode to identify possible inter-, intra- and
	      intra-and-interorganism SNPs.
	      </p></dd><dt><span class="term">
	      [only_in_result(mroir)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p> Default is <span class="underline">no</span>. Only
	      takes effect when  [-CO:mr] (see above) is set
	      to <span class="underline">yes</span>. If set
	      to <span class="underline">yes</span>, MIRA will not use
	      the repeat resolving algorithm during build time (and therefore
	      will not be able to take advantage of this), but only before
	      saving results to disk.
	      </p><p>
		This switch is useful in some (rare) cases of mapping assembly.
	      </p></dd><dt><span class="term">
	      [assume_snp_instead_repeat(asir)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p> Default is <span class="underline">no</span>.
	      Only takes effect when  [-CO:mr] (see above) is set to
	      <span class="underline">yes</span>, effect is also
	      dependent on the fact whether strain data (see
	      - [-SB:lsd]) is present or not.  Usually, MIRA will mark
	      bases that differentiate between repeats when a conflict occurs
	      between reads that belong to one strain. If the conflict occurs
	      between reads belonging to different strains, they are marked as
	      SNP. However, if this switch is set
	      to <span class="underline">yes</span>, conflict within a
	      strain are also marked as SNP.
	      </p><p>
		This switch is mainly used in assemblies of ESTs, it should
		not be set for genomic assembly.
	      </p></dd><dt><span class="term">
	      [min_reads_per_group(mrpg)=<em class="replaceable"><code>integer &#8805; 2</code></em>]
	    </span></dt><dd><p> Default is
	      dependent of the sequencing technology used. Only takes effect when
	      [-CO:mr] (see above) is set
	      to <span class="underline">yes</span>. This defines the
	      minimum number of reads in a group that are needed for the RMB
	      (Repeat Marker Bases) or SNP detection routines to be
	      triggered. A group is defined by the reads carrying the same
	      nucleotide for a given position, i.e., an assembly with mrpg=2
	      will need at least two times two reads with the same nucleotide
	      (having at least a quality as defined in  [-CO:mgqrt])
	      to be recognised as repeat marker or a SNP.  Setting this to a
	      low number increases sensitivity, but might produce a few false
	      positives, resulting in reads being thrown out of contigs
	      because of falsely identified possible repeat markers (or
	      wrongly recognised as SNP).
	      </p></dd><dt><span class="term">
	      [min_neighbour_qual (mnq)=<em class="replaceable"><code>integer &#8805;
	      10</code></em>]
	    </span></dt><dd><p>
		Default is dependent of the sequencing technology used. Takes
		only effect when [-CO:mr] is set
		to <span class="underline">yes</span>. This defines the
		minimum quality of neighbouring bases that a base must have
		for being taken into consideration during the decision whether
		column base mismatches are relevant or not.
	      </p></dd><dt><span class="term">
	      [min_groupqual_for_rmb_tagging(mgqrt)=<em class="replaceable"><code>integer &#8805; 25</code></em>]
	    </span></dt><dd><p>
		Default is dependent of the sequencing technology used. Takes
		only effect when [-CO:mr] is set
		to <span class="underline">yes</span>. This defines the
		minimum quality of a group of bases to be taken into account
		as potential repeat marker. The lower the number, the more
		sensitive you get, but lowering below 25 is not recommended as
		a lot of wrongly called bases can have a quality approaching
		this value and you'd end up with a lot of false positives. The
		higher the overall coverage of your project, the better, and
		the higher you can set this number. A value of 35 will
		probably remove most false positives, a value of 40 will
		probably never show false positives ... but will generate a
		sizable number of false negatives.
	      </p></dd><dt><span class="term">
	      [min_coverage_percentage(mcp)=<em class="replaceable"><code>0 &lt; integer &#8804; 100</code></em>]
	    </span></dt><dd><p>
		Default is currently <span class="underline">10</span>. Used to reduce the number of
		IUPAC bases due to non-random PCR artefacts or sequencing
		errors in very high coverage areas (e.g. Illumina &#8805;
		80). Once the most probable base has been determined,
		 [-CO:mcp] defines the minimum percentage (calculated
		from the most probable base) the coverage of alternative bases
		must have to be considered for consensus. E.g.: with mcp=10
		and the most probable base having a coverage of 200x, other
		bases must have a coverage of 20x.
	      </p><p>
		Drawback is that valid low frequency variants will not show up
		anymore as IUPAC in the FASTA.
	      </p></dd><dt><span class="term">
	      [endread_mark_exclusion_area(emea)=<em class="replaceable"><code>integer &#8805; 0</code></em>]
	    </span></dt><dd><p> Default is dependent of the sequencing technology
	      used. Takes only effect when [-CO:mr] is set to
	      <span class="underline">yes</span>. Using the end of
	      sequences of Sanger type shotgun sequencing is always a bit
	      risky, as wrongly called bases tend to crowd there or some
	      sequencing vector relics hang around. It is even more risky to
	      use these stretches for detecting possible repeats, so one can
	      define an exclusion area where the bases are not used when
	      determining whether a mismatch is due to repeats or not.
	      </p></dd><dt><span class="term">
	      [emea_set1_on_clipping_pec(emeas1clpec)=<em class="replaceable"><code>on|y[es]|t[rue],
	      off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">yes</span>. When
		 [-CL:pec] is set, the end-read exclusion area can be
		considerably reduced. Setting this parameter will
		automatically do this.
	      </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
		Although the parameter is named "set to 1", it may be that the
		exclusion area is actually a bit larger (2 to 4), depending on
		what users will report back as "best" option.
	      </td></tr></table></div></dd><dt><span class="term">
	      [also_mark_gap_bases(amgb)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is dependent of the sequencing technology
		used. Determines whether columns containing gap bases (indels)
		are also tagged.
	      </p><p>
		Note: it is strongly recommended to not set this to 'yes' for
		454 type data.
	      </p></dd><dt><span class="term">
	      [also_mark_gap_bases_even_multicolumn(amgbemc)=<em class="replaceable"><code>on|y[es]|t[rue],
	      off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p> Default is <span class="underline">yes</span>.
	      Takes effect only when  [-CO:amgb] is set to
	      <span class="underline">yes</span>. Determines whether multiple columns containing gap bases
	      (indels) are also tagged.
	      </p></dd><dt><span class="term">
	      [also_mark_gap_bases_need_both_strands(amgbnbs)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p> Default is <span class="underline">yes</span>.  Takes effect only when
	       [-CO:amgb] is set to <span class="underline">yes</span>. Determines whether both for
	      tagging columns containing gap bases, both strands.need to have a gap.
	      Setting this to <span class="underline">no</span> is not recommended except when working in
	      desperately low coverage situations.
	      </p></dd><dt><span class="term">
	      [force_nonIUPACconsensus_perseqtype(fnic)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">no</span> for
		de-novo genome assemblies, yes for all others. If set to
		<span class="underline">yes</span>, MIRA will be forced
		to make a choice for a consensus base (A,C,G,T or gap) even in
		unclear cases where it would normally put a IUPAC base. All
		other things being equal (like quality of the possible
		consensus base and other things), MIRA will choose a base by
		either looking for a majority vote or, if that also is not
		clear, by preferring gaps over T over G over C over finally A.
	      </p><p>
		MIRA makes a considerable effort to deduce the right base at
		each position of an assembly. Only when cases begin to be
		borderline it will use a IUPAC code to make you aware of
		potential problems. It
		is <span class="bold"><strong>suggested</strong></span> to leave this
		option to <span class="underline">no</span> as IUPAC
		bases in the consensus are a sign that - if you need 100%
		reliability - you really should have a look at this particular
		place to resolve potential problems. You might want to set
		this parameter to yes in the following cases: 1) when your
		tools that use assembly result cannot handle IUPAC bases and
		you don't care about being absolutely perfect in your data (by
		looking over them manually). 2) when you assemble data without
		any quality values (which you should not do anyway), then this
		method will allow you to get a result without IUPAC bases that
		is "good enough" with respect to the fact that you did not
		have quality values.
	      </p></dd><dt><span class="term">
	      [merge_short_reads(msr)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">yes</span> for all
		Solexas when in a mapping assembly, else it's <span class="underline">no</span>. Can only be used in mapping
		assemblies. If set to <span class="underline">yes</span>, MIRA will merge all perfectly
		mapping Solexa reads into longer reads (Coverage Equivalent
		Reads, CERs) while keeping quality and coverage information
		intact.
	      </p><p>
		This feature hugely reduces the number of Solexa reads and
		makes assembly results with Solexa data small enough to be
		handled by current finishing programs (gap4, consed, others)
		on normal workstations.
	      </p></dd><dt><span class="term">
	      [msr_keepcontigendsunmerged(msrme)=<em class="replaceable"><code>integer &#8805; 0</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">0</span> for all
		Solexas when in a mapping assembly. Takes only effect in
		mapping assemblies if  [-CO:msr=yes].
	      </p><p>
		Defines how many "errors" (i.e. differences) a read may have
		to be merged into a coverage equivalent read. Useful only when
		one does not need SNP information from an assembly but wants
		to concentrate either on coverage data or on paired-end
		information at contig ends.
	      </p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
		This feature allows to merge non-perfect reads, which makes
		most SNP information simply disappear from the alignment. Use
		with care!
	      </td></tr></table></div></dd><dt><span class="term">
	      [msr_keepcontigendsunmerged(msrkceu)=<em class="replaceable"><code>-1, integer &gt; 0</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">-1</span> for all
		Solexas when in a mapping assembly. Takes only effect in
		mapping assemblies if  [-CO:msr=yes] and for reads
		which have a paired-end / mate-pair partner actively used in
		the assembly.
	      </p><p>
		If set to a value &gt; 0, MIRA will not merge paired-end /
		mate-pair reads if they map within the given distance of a
		contig end of the original reference sequence
		(backbone). Instead of a fixed value, one can also use
		-1. MIRA will then automatically not merge reads if the
		distance from the contig end is within the maximum size of the
		template insert size of the sequencing library for that read
		(either given via [-GE:tismax] or via XML TRACEINFO
		for the given read).
	      </p><p>
		This feature allows to use the data reduction from
		[-CO:msr] while enabling the result of such a mapping
		to be useful in subsequent scaffolding programs to order
		contigs.
	      </p></dd></dl></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_edit_ed"></a>3.4.4.14. 
	  Parameter group: -EDIT (-ED)
	</h4></div></div></div><p>
	  General options for controlling the integrated automatic editor. The editors
	  generally make a good job cleaning up alignments from typical sequencing
	  errors like (like base overcalls etc.). However, they may prove tricky in
	  certain situations:
	</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	      in EST assemblies, they may edit rare transcripts toward almost
	      identical, more abundant transcripts. Usage must be carefully weighed.
	    </p></li><li class="listitem"><p>
	      the editors will not only change bases, but also sometimes delete or
	      insert non-gap bases as needed to improve an alignment when facts (trace
	      signals or other) show that this is what should have been the
	      sequence. However, this can make post processing of assembly results pretty
	      difficult with some formats like ACE, where the format itself contains no
	      way to specify certain edits like deletion. There's nothing one can do about
	      it and the only way to get around this problem is to use file formats with
	      more complete specifications like CAF, MAF (and BAF once supported by MIRA).
	    </p></li></ul></div><p>
	</p><p>
	  The following edit parameters are supported:
	</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
	      [_mira_automatic_contig_editing(mace)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">yes</span>. When set
		to yes, MIRA will use built-in versions of own automatic
		contig editors (see parameters below) to improve alignments.
	      </p></dd><dt><span class="term">
	      [edit_kmer_singlets(eks)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">yes</span> for all
		sequencing technologies, but only takes effect
		if  [-ED:mace] is on (see above).
	      </p><p>
		When set to yes, MIRA uses the alignment information of a
		complete contig at places with sequencing errors which lead to
		unique kmers and correct the error according to the alignment.
	      </p><p>
		This is an extremely conservative yet very effective editing
		strategy and can therefore be kept always activated.
	      </p></dd><dt><span class="term">
	      [edit_homopolymer_overcalls(ehpo)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">yes</span> for 454
		and Ion Torrent, but only takes effect if  [-ED:mace]
		is on (see above).
	      </p><p>
		When set to yes, MIRA use the alignment information of a
		complete contig at places with potential homopolymer
		sequencing errors and correct the error according to the
		alignment.
	      </p><p>
		This editor should be switched on only for sequencing
		technologies with known homopolymer sequencing problems. That
		is: currently only 454 and Ion.
	      </p></dd><dt><span class="term">
	      [edit_automatic_contig_editing(eace)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">no</span>. When set
		to yes, MIRA will use built-in versions of the "EdIt"
		automatic contig editor (see parameters below) to correct
		sequencing errors in Sanger reads.
	      </p><p>
		EdIt will try to resolve discrepancies in the contig by
		performing trace signal analysis and correct even hard to resolve
		errors.
	      </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
		The current development version has a memory leak in
		this editor, therefore the option cannot be turned
		on.
	      </td></tr></table></div></dd><dt><span class="term">
	      [strict_editing_mode(sem)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">yes</span>. Only for
		Sanger data. If set to yes, the automatic editor will not take
		error hypotheses with a low probability into account, even if
		all the requirements to make an edit are fulfilled.
	      </p></dd><dt><span class="term">
	      [confirmation_threshold(ct)=<em class="replaceable"><code>integer, 0 &lt; x &#8804; 100</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">50</span>. Only for
		Sanger data. The higher this value, the more strict the
		automatic editor will apply its internal rule set. Going below
		40 is not recommended.
	      </p></dd></dl></div><p>
	</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_misc_mi"></a>3.4.4.15. 
	  Parameter group: -MISC (-MI)
	</h4></div></div></div><p>
	  Options which would not fit elsewhere.
	</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
	      [iknowwhatido(ikwid)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">no</span>. This
		switch tells MIRA that you know what you do in some
		situations and force it not to stop when it thinks something is
		really wrong, but simply continue.
	      </p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
		You generally should not to set this flag except in cases
		where MIRA stopped and the warning / error message told you to
		get around that very specific problem by setting this flag.
	      </td></tr></table></div></dd><dt><span class="term">
	      [large_contig_size(lcs)=<em class="replaceable"><code>integer &lt;
	      0</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">500</span>. This
		parameter has absolutely no influence whatsoever on the
		assembly process of MIRA. But is used in the reporting within
		the <code class="filename">*_assembly_info.txt</code> file after the
		assembly where MIRA reports statistics on
		<span class="emphasis"><em>large</em></span> contigs and
		<span class="emphasis"><em>all</em></span> contigs.  [-MI:lcs] is the
		threshold value for dividing the contigs into these two
		categories.
	      </p></dd><dt><span class="term">
	      [large_contig_size_for_stats(lcs4s)=<em class="replaceable"><code>integer &lt;
	      0</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">5000</span> for
		 [--job=genome] and <span class="underline">1000</span> for  [--job=est].
	      </p><p>
		This parameter is used for internal statistics calculations
		and has a subtle influence when being in a
		[--job=genome] assembly mode.
	      </p><p>
		MIRA uses coverage information of an assembly project to find
		out about potentially repetitive areas in reads (and thus, a
		genome). To calculate statistics which are reflecting the
		approximate truth regarding the average coverage of a genome,
		the "large contig size for stats" value of
		[-MI:lcs4s] is used as a cutoff threshold: contigs
		smaller than this value do not contribute to the calculation
		of average coverage while contigs larger or equal to this
		value do.
	      </p><p>
		This reflects two facts: on the one hand - especially with
		short read sequencing technologies and in projects without
		read pair libraries - contigs containing predominantly
		repetitive sequences are of a relatively small size. On the
		other hand, reads which could not be placed into contigs
		(maybe due to a sequencing technology dependent motif error)
		often enough form small contigs with extremely low
		coverage.
	      </p><p>
		 It should be clear that one does not want any of the above
		 when calculating average coverage statistics and having this
		 cutoff discards small contigs which tend to muddy the
		 picture. If in doubt, don't touch this parameter.
	      </p></dd></dl></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_misc_nw"></a>3.4.4.16. 
	  Parameter group: -NAG_AND_WARN (-NW)
	</h4></div></div></div><p>
	  Parameters which let MIRA warn you about unusual things or potential
	  problems. The flags in this parameter section come in three
	  flavours: <span class="emphasis"><em>stop</em></span>, <span class="emphasis"><em>warn</em></span> and
	  <span class="emphasis"><em>no</em></span> which let MIRA either stop, give a warning
	  or do nothing if a specific problem is detected.
	</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
	      [check_nfs(cnfs)=<em class="replaceable"><code>stop|warn|no</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">stop</span>. MIRA
		will check whether the tmp directory is running on a NFS
		mount.
	      </p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top"><p>
		  You should never ever at all run MIRA on a NFS mounted
		  directory ... or face the the fact that the assembly process
		  may very well take 5 to 10 times longer (or more) than
		  normal. You have been warned.
		</p><p>
		  The reason for the slowdown is the same as why one should
		  never run a BLAST search on a big database being located on
		  a NFS volume: access via network is terribly slow when
		  compared to local disks, at least if you have not invested a
		  lot of money into specialised solutions.
		</p></td></tr></table></div></dd><dt><span class="term">
	      [check_duplicate_readnames(cdrn)=<em class="replaceable"><code>stop|warn|no</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">stop</span>. MIRA
		will check for duplicate read names after loading.
	      </p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top"><p>
		  Duplicate read names usually hint to a serious problem with
		  your input and should really, really be fixed. You can
		  choose to ignore this error by switching off this flag, but
		  this will almost certainly lead to problems with result
		  files (ACE and CAF for sure, maybe also SAM) and probably to
		  other unexpected effects.
		</p></td></tr></table></div></dd><dt><span class="term">
	      [check_template_problems(ctp)=<em class="replaceable"><code>stop|warn|no</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">stop</span>. MIRA
		will check read template naming after loading.
	      </p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top"><p>
		  Problems in read template naming point to problems with read
		  names or to broken template information. You should try to
		  find the cause of the problem instead of ignoring this error
		  message.
		</p></td></tr></table></div></dd><dt><span class="term">
	      [check_maxreadnamelength(cmrnl)=<em class="replaceable"><code>stop|warn|no</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">stop</span>. MIRA
		will check whether the length of the names of your reads
		surpass the given number of characters (see  [-NW:mrnl]).
	      </p><p>
		While MIRA and many other programs have no problem with long read names,
		some older programs have restrictions concerning the length of
		the read name. Example given: the pipeline <code class="literal">CAF -&gt;
		caf2gap -&gt; gap2caf</code> will stop working at
		the <span class="command"><strong>gap2caf</strong></span> stage if there are read names
		having &gt; 40 characters where the names differ only at &gt;40
		characters.
	      </p><p>
		This should be a warning only, but as a couple of people were
		bitten by this, the default behaviour of MIRA is to stop when
		it sees that potential problem. You might want to rename your
		reads to have &#8804; 40 characters.
	      </p><p>
		On the other hand, you also can ignore this potential problem
		and force MIRA to continue by using the parameter:
		[-NW:cmrnl=warn] or  [-NW:cmrnl=no]
	      </p></dd><dt><span class="term">
	      [maxreadnamelength(mrnl)=<em class="replaceable"><code>integer &#8805;
	      0</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">40</span>. This
		defines the effective check length for  [-NW:cmrnl].
	      </p></dd><dt><span class="term">
	      [check_average_coverage(cac)=<em class="replaceable"><code>stop|warn|no</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">stop</span>. In
		genome de-novo assemblies, MIRA will perform checks early in
		the assembly process whether the average coverage to be
		expected exceeds a given value (see  [-NW:acv]).
	      </p><p>
		With todays' sequencing technologies (especially Illumina, but
		also Ion Torrent and 454), many people simply take everything
		they get and throw it into an assembly. Which, in the case of
		Illumina and Ion, can mean they try to assemble their organism
		with a coverage of 100x, 200x and more (I've seen trials with
		more than 1000x).
	      </p><p>
		This is not good. Not. At. All! For two reasons (well, three
		to be precise).
	      </p><p>
		The first reason is that, usually, one does not sequence a
		single cell but a population of cells. If this population is
		not clonal (i.e., it contains subpopulations with genomic
		differences with each other), assemblers will be able to pick
		up these differences in the DNA once a certain sequence count
		is reached and they will try reconstruct a genome containing
		all clonal variations, treating these variations as potential
		repeats with slightly different sequences. Which, of course,
		will be wrong and I am pretty sure you do not want that.
	      </p><p>
		The second and way more important reason is that none of the
		current sequencing technologies is completely error free. Even
		more problematic, they contain both random and non-random
		sequencing errors. Especially the latter can become a big
		hurdle if these non-random errors are so prevalent that they
		suddenly appear to be valid sequence to an assembler. This in
		turn leads to false repeat detection, hence possibly contig
		breaks or even wrong consensus sequence. You don't want that,
		do you?
	      </p><p>
		The last reason is that overlap based assemblers (like MIRA
		is) need <span class="emphasis"><em>exponentially</em></span> more time and
		memory when the coverage increases. So keeping the coverage
		comparatively low helps you there.
	      </p></dd><dt><span class="term">
	      [average_coverage_value(acv)=<em class="replaceable"><code>integer &#8805;
	      0</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">80</span> for
		de-novo assemblies, in mapping assemblies it is 120 for Ion
		Torrent and 160 for Illumina data (might change in
		future). This defines the effective coverage to check for in
		 [-NW:cac].
	      </p></dd></dl></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_directory_dir_di"></a>3.4.4.17. 
	  Parameter group: -DIRECTORY (-DIR, -DI)
	</h4></div></div></div><p>
	  General options for controlling where to find or where to write data.
	</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
	      [tmp_redirected_to(trt)=<em class="replaceable"><code>&lt;directoryname&gt;</code></em>]
	    </span></dt><dd><p>
		Default is an empty string. When set to a non-empty string,
		MIRA will create the MIRA-temporary directory at the given
		location instead of using the current working directory.
	      </p><p>
		This option is particularly useful for systems which have
		solid state disks (SSDs) and some very fast disk subsystems
		which can be used for temporary files. Or in projects where
		the input and output files reside on a NFS mounted directory
		(current working dir), to put the tmp directory somewhere
		outside the NFS (see also: Things you should not do).
	      </p><p>
		In both cases above, and for larger projects, MIRA then runs
		a lot faster.
	      </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
		Prior to MIRA 4.0rc2, users had to make sure themselves that
		the target directory did not already exist. MIRA now handles
		this automatically by creating directory names with a random
		substring attached.
	      </td></tr></table></div></dd></dl></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_output_out"></a>3.4.4.18. 
	  Parameter group: -OUTPUT (-OUT)
	</h4></div></div></div><p>
	  Options for controlling which results to write to which type of files.
	  Additionally, a few options allow output customisation of textual
	  alignments (in text and HTML files).
	</p><p>
	  There are 3 types of results: result, temporary results and extra
	  temporary results. One probably needs only the results. Temporary
	  and extra temporary results are written while building different
	  stages of a contig and are given as convenience for trying to find
	  out why MIRA set some RMBs or disassembled some contigs.
	</p><p>
	  Output can be generated in these formats: CAF, Gap4 Directed
	  Assembly, FASTA, ACE, TCS, WIG, HTML and simple text.
	</p><p>
	  Naming conventions of the files follow the rules described in
	  section <span class="bold"><strong>Input / Output</strong></span>, subsection
	  <span class="bold"><strong>Filenames</strong></span>.
	</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
	      [savesimplesingletsinproject(sssip)=<em class="replaceable"><code>on|y[es]|t[rue],off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">no</span>. Controls
		whether 'unimportant' singlets are written to the result
		files.
	      </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
		Note that a value larger 1 of the [-AS:mrpc]
		parameter will disable the function of this parameter.
	      </td></tr></table></div></dd><dt><span class="term">
	      [savetaggedsingletsinproject(stsip)=<em class="replaceable"><code>on|y[es]|t[rue],off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p> Default
		is <span class="underline">yes</span>. Controls whether
		singlets which have certain tags (see below) are written to
		the result files, even if  [-OUT:sssip] (see above) is
		set.
	      </p><p>
		If one of the (SRMr, CRMr, WRMr, SROr, SAOr, SIOr) tags
		appears in a singlet, MIRA will see that the singlets had been
		part of a larger alignment in earlier passes and even was part
		of a potentially 'important' decision. To give the possibility
		to human finishers to trace back the decision, these singlets
		can be written to result files.
	      </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
		Note that a value larger 1 of the [-AS:mrpc]
		parameter will disable the function of this parameter.
	      </td></tr></table></div></dd><dt><span class="term">
	      [remove_rollover_tmps(rrot)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p> Default
	      is <span class="underline">yes</span>. Removes log and
	      temporary files once they should not be needed anymore during
	      the assembly process.
	      </p></dd><dt><span class="term">
	      [remove_tmp_directory(rtd)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p> Default
	      is <span class="underline">no</span>. Removes the
	      complete tmp directory at the end of the assembly process. Some
	      logs and temporary files contain useful information that you may
	      want to analyse though, therefore the default of MIRA is not to
	      delete it.
	      </p></dd><dt><span class="term">
	      [output_result_caf(orc)=<em class="replaceable"><code>on|y[es]|t[rue],
	      off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p> Default is
	      <span class="underline">yes</span>.
	      </p></dd><dt><span class="term">
	      [output_result_maf(orm)=<em class="replaceable"><code>on|y[es]|t[rue],
	      off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p> Default is
	      <span class="underline">yes</span>.
	      </p></dd><dt><span class="term">
	      [output_result_gap4da(org)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">no</span>.
	      </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
		If set to <span class="underline">yes</span>, MIRA will
		automatically switch back
		to <span class="underline">no</span> (and cannot be
		forced to 'yes') when 454 or Solexa reads are present in the
		project as this ensure that the file system does not get
		flooded with millions of files.
	      </td></tr></table></div></dd><dt><span class="term">
	      [output_result_fasta(orf)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">yes</span>.
	      </p></dd><dt><span class="term">
	      [output_result_ace(ora)=<em class="replaceable"><code>on|y[es]|t[rue],
	      off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p> Default
	      is <span class="underline">no</span>.
	      </p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
		ACE is the least suited file format for NGS data. Use it only
		when absolutely necessary.
	      </td></tr></table></div></dd><dt><span class="term">
	      [output_result_txt(ort)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">no</span>.
	      </p></dd><dt><span class="term">
	      [output_result_tcs(ors)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">yes</span>.
	      </p></dd><dt><span class="term">
	      [output_result_html(orh)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p> Default
	      is <span class="underline">no</span>.
	      </p></dd><dt><span class="term">
	      [output_tmpresult_caf(otc)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">no</span>.
	      </p></dd><dt><span class="term">
	      [output_tmpresult_maf(otm)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">no</span>.
	      </p></dd><dt><span class="term">
	      [output_tmpresult_gap4da(otg)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">no</span>.
	      </p></dd><dt><span class="term">
	      [output_tmpresult_fasta(otf)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">no</span>.
	      </p></dd><dt><span class="term">
	      [output_tmpresult_ace(ota)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">no</span>.
	      </p></dd><dt><span class="term">
	      [output_tmpresult_txt(ott)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">no</span>.
	      </p></dd><dt><span class="term">
	      [output_result_tcs(ots)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p> Default
	      is <span class="underline">no</span>.
	      </p></dd><dt><span class="term">
	      [output_tmpresult_html(oth)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">no</span>.
	      </p></dd><dt><span class="term">
	      [output_exttmpresult_caf(oetc)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">no</span>.
	      </p></dd><dt><span class="term">
	      [output_exttmpresult_gap4da(oetg)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p> Default is
	      <span class="underline">no</span>.
	      </p></dd><dt><span class="term">
	      [output_exttmpresult_fasta(oetf)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p> Default is
	      <span class="underline">no</span>.
	      </p></dd><dt><span class="term">
	      [output_exttmpresult_ace(oeta)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">no</span>.
	      </p></dd><dt><span class="term">
	      [output_exttmpresult_txt(oett)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">no</span>.
	      </p></dd><dt><span class="term">
	      [output_exttmpresult_html(oeth)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
	    </span></dt><dd><p>
		Default is <span class="underline">no</span>.
	      </p></dd><dt><span class="term">
	      [text_chars_per_line(tcpl)=<em class="replaceable"><code>integer &gt; 0</code></em>]
	    </span></dt><dd><p> Default is
	      <span class="underline">60</span>. When producing an output in text format
	      ( [-OUT:ort|ott|oett]), this parameter defines how many bases
	      each line of an alignment should contain.
	      </p></dd><dt><span class="term">
	      [html_chars_per_line(tcpl)=<em class="replaceable"><code>integer &gt; 0</code></em>]
	    </span></dt><dd><p> Default is
	      <span class="underline">60</span>. When producing an output in HTML format,
	      ( [-OUT:orh|oth|oeth]), this parameter defines how many bases
	      each line of an alignment should contain.
	      </p></dd><dt><span class="term">
	      [text_endgap_fillchar(tegfc)=<em class="replaceable"><code>&lt;single character&gt;</code></em>]
	    </span></dt><dd><p> Default
	      is <span class="underline"> </span> (a blank). When producing an output in text format
	      ( [-OUT:ort|ott|oett]), endgaps are filled up with this
	      character.
	      </p></dd><dt><span class="term">
	      [html_endgap_fillchar(hegfc)=<em class="replaceable"><code>&lt;single character&gt;</code></em>]
	    </span></dt><dd><p> Default
	      is <span class="underline"> </span> (a blank). When producing an output in HTML format
	      ( [-OUT:orh|oth|oeth]), end-gaps are filled up with this
	      character.
	      </p></dd></dl></div><p>
	</p></div></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_resuming_assemblies"></a>3.5. 
      Resuming / restarting assemblies
    </h2></div></div></div><p>
      It may happen that a MIRA run is interrupted - sometimes rather harshly
      - due to events more or less outside your control like, e.g., power
      failures, machine shutdowns for maintenance, missing disk space,
      run-time quotas etc. This may be less of a problem when assembling or
      mapping small data sets with run times between a couple of minutes up to
      a few hours, but becomes a nuisance for larger data sets like in small
      eukaryotes or RNASeq samples where the run time is measured in days.
    </p><p>
      If this happens in de-novo assemblies, MIRA has
      a <span class="emphasis"><em>resume</em></span> functionality: at predefined points in the
      assembly process, MIRA writes out special files to disk which enables it
      to resume the assembly at the point where these files were
      written. Starting MIRA in resume mode is pretty easy: simply add the
      resume flag  [-r] on a command line like this:
    </p><pre class="screen">
<code class="prompt">$</code> <strong class="userinput"><code>mira -r ...</code></strong></pre><p>
      where the ellipsis ("...") above stands for the rest of the command line you would have used to start a new assembly.
    </p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_input_output"></a>3.6. 
      Input / Output
    </h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_directories"></a>3.6.1. 
	Directories
      </h3></div></div></div><p>
	Since version 3.0.0, MIRA now puts all files and directories it
	generates into one sub-directory which is named
	<code class="filename"><em class="replaceable"><code>projectname</code></em>_assembly</code>. This directory contains up to four
	sub-directories:
      </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	    <code class="filename"><em class="replaceable"><code>projectname</code></em>_d_results</code>: this directory contains all the
	    output files of the assembly in different formats.
	  </p></li><li class="listitem"><p>
	    <code class="filename"><em class="replaceable"><code>projectname</code></em>_d_info</code>: this directory contains information
	    files of the final assembly. They provide statistics as well as, e.g.,
	    information (easily parsable by scripts) on which read is found in which
	    contig etc.
	  </p></li><li class="listitem"><p>
	    <code class="filename"><em class="replaceable"><code>projectname</code></em>_d_tmp</code>:
	    this directory contains tmp files and temporary assembly files. It
	    can be safely removed after an assembly as there may be easily a
	    few GB of data in there that are not normally not needed anymore.
	  </p><p>
	    In case of problems: please do not delete. I will get in touch
	    with you for additional information that might possibly be present
	    in the tmp directory.
	  </p></li><li class="listitem"><p>
	    <code class="filename"><em class="replaceable"><code>projectname</code></em>_d_chkpt</code>: this directory
	    contains checkpoint files needed to resume assemblies that crashed
	    or were stopped.
	  </p></li></ul></div><p>
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_filenames"></a>3.6.2. 
	Filenames
      </h3></div></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_output"></a>3.6.2.1. 
	  Output
	</h4></div></div></div><p>
	  These result output files and sub-directories are placed in in the
	  <em class="replaceable"><code>projectname</code></em>_results directory after a run of MIRA.
	</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
	      <code class="filename"><em class="replaceable"><code>projectname</code></em>_out.&lt;type&gt;</code>
	    </span></dt><dd><p> Assembled project written in type =
	      (<span class="emphasis"><em>maf</em></span> / <span class="emphasis"><em>gap4da</em></span> / <span class="emphasis"><em>caf</em></span> /
	      <span class="emphasis"><em>ace</em></span> / <span class="emphasis"><em>fasta</em></span> /
	      <span class="emphasis"><em>html</em></span> / <span class="emphasis"><em>tcs</em></span> /
	      <span class="emphasis"><em>wig</em></span> / <span class="emphasis"><em>text</em></span>) format by
	      MIRA, final result.
	      </p><p>
		Type <span class="emphasis"><em>gap4da</em></span> is a directory containing
		experiment files and a file of filenames (called 'fofn'), all
		other types are files. <span class="emphasis"><em>gap4da</em></span>,
		<span class="emphasis"><em>caf</em></span>, <span class="emphasis"><em>ace</em></span> contain the
		complete assembly information suitable for import into
		different post-processing tools (gap4, consed and
		others). <span class="emphasis"><em>html</em></span> and
		<span class="emphasis"><em>text</em></span> contain visual representations of
		the assembly suited for viewing in browsers or as simple text
		file. <span class="emphasis"><em>tcs</em></span> is a summary of a contig suited
		for "quick" analysis from command-line tools or even visual
		inspection. <span class="emphasis"><em>wig</em></span> is a file containing
		coverage information (useful for mapping assemblies) which can
		be loaded and shown by different genome browsers (IGB, GMOD,
		USCS and probably many more.
	      </p><p>
		<span class="emphasis"><em>fasta</em></span> contains the contig consensus
		sequences (and .fasta.qual the consensus qualities). Please
		note that they come in two flavours:
		<span class="underline">padded</span>
		and <span class="underline">unpadded</span>. The padded
		versions may contain stars (*) denoting gap base positions
		where there was some minor evidence for additional bases, but
		not strong enough to be considered as a real base. Unpadded
		versions have these gaps removed. Padded versions have an
		additional postfix <span class="emphasis"><em>.padded</em></span>, while
		unpadded versions <span class="emphasis"><em>.unpadded</em></span>.
	      </p></dd><dt><span class="term">
	      <code class="filename"><em class="replaceable"><code>projectname</code></em>_LargeContigs_out.&lt;type&gt;</code>
	    </span></dt><dd>
	      These files are only written when MIRA runs in
	      <span class="emphasis"><em>de-novo</em></span> mode. They usually contain a subset
	      of contigs deemed 'large' from the whole project. More details
	      are given in the chapter "working with results of MIRA."
	    </dd></dl></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_assembly_statistics_and_information_files"></a>3.6.2.2. 
	  Assembly statistics and information files
	</h4></div></div></div><p>
	  These information files are placed in in the
	  <em class="replaceable"><code>projectname</code></em>_info directory after a run of
	  MIRA.
	</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
	      <code class="filename"><em class="replaceable"><code>projectname</code></em>_info_assembly.txt</code>
	    </span></dt><dd><p>
		This file contains basic information about the
		assembly. MIRA will split the information in two
		parts: information about <span class="emphasis"><em>large</em></span>
		contigs and information about all contigs.
	      </p><p>
		For more information on how to interpret this file,
		please consult the chapter on "Results" of the MIRA
		documentation manual.
	      </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
		In contrast to other information files, this file
		always appears in the "info" directory, even when just
		intermediate results are reported.
	      </td></tr></table></div></dd><dt><span class="term">
	      <code class="filename"><em class="replaceable"><code>projectname</code></em>_info_contigreadlist.txt</code>
	    </span></dt><dd><p> This file contains information which reads have been
	      assembled into which contigs (or singlets).
	      </p></dd><dt><span class="term">
	      <code class="filename"><em class="replaceable"><code>projectname</code></em>_info_contigstats.txt</code>
	    </span></dt><dd><p> This file contains statistics about the contigs
	      themselves, their length, average consensus quality, number of
	      reads, maximum and average coverage, average read length, number
	      of A, C, G, T, N, X and gaps in consensus.
	      </p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
		For contigs containing digitally normalised reads, the coverage numbers may sometimes seem strange. E.g.: a contig may contain only one read, but have an average coverage of 3. This means that the read was a representative for 3 reads. The coverage numbers are computed as if all 3 reads had been assembled instead of the representative. In EST/RNASeq projects, these numbers thus represent the (more or less) true expression coverage.
	      </td></tr></table></div></dd><dt><span class="term">
	      <code class="filename"><em class="replaceable"><code>projectname</code></em>_info_consensustaglist.txt</code>
	    </span></dt><dd><p> This file contains
	      information about the tags (and their position) that are present in the
	      consensus of a contig.
	      </p></dd><dt><span class="term">
	      <code class="filename"><em class="replaceable"><code>projectname</code></em>_info_largecontigs.txt</code>
	    </span></dt><dd><p>For de-novo assemblies, this file contains the name of the
	      contigs which pass the (adaptable) 'large contig' criterion.
	      </p></dd><dt><span class="term">
	      <code class="filename"><em class="replaceable"><code>projectname</code></em>_info_readrepeats.lst</code>
	    </span></dt><dd><p>
		Tab delimited file with three columns: read name, repeat level tag, sequence.
	      </p><p>
		This file permits a quick analysis of the repetitiveness of
		different parts of reads in a project. See
		[-SK:rliif] to control from which repetitive level on
		subsequences of reads are written to this file,
	      </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
		Reads can have more than one entry in this file. E.g., with
		standard settings (<code class="literal">-SK:rliif=6</code>) if the
		start of a read is covered by MNRr, followed by a HAF3 region
		and finally the read ends with HAF6, then there will be two
		lines in the file: one for the subsequence covered by MNRr,
		one for HAF6.
	      </td></tr></table></div></dd><dt><span class="term">
	      <code class="filename"><em class="replaceable"><code>projectname</code></em>_info_readstooshort</code>
	    </span></dt><dd><p> A list containing the
	      names of those reads that have been sorted out of the assembly before any
	      processing started only due to the fact that they were too short.
	      </p></dd><dt><span class="term">
	      <code class="filename"><em class="replaceable"><code>projectname</code></em>_info_readtaglist.txt</code>
	    </span></dt><dd><p> This file contains
	      information about the tags and their position that are present in each
	      read. The read positions are given relative to the forward direction of the
	      sequence (i.e. as it was entered into the the assembly).
	      </p></dd><dt><span class="term">
	      <code class="filename"><em class="replaceable"><code>projectname</code></em>_info_WARNINGS_*.txt</code>
	    </span></dt><dd><p>
		These files collect warning messages MIRA dumped out
		throughout the assembly process. These warnings cover a wide
		area of things monitored by MIRA and can - together with the
		output written to STDOUT - give an insight as to why an
		assembly does not behave as expected. There are three warning
		files representing different levels of
		criticality: <span class="emphasis"><em>critical</em></span>, <span class="emphasis"><em>medium</em></span>
		and <span class="emphasis"><em>minor</em></span>. These files may be empty,
		meaning that no warning of the corresponding level was
		printed. It is strongly suggested to have a look at least at
		critical warnings during and after an assembly run.
	      </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
		These files are quite new to MIRA and not all warning messages
		appear there yet. This will come over time.
	      </td></tr></table></div></dd><dt><span class="term">
	      <code class="filename"><em class="replaceable"><code>projectname</code></em>_error_reads_invalid</code>
	    </span></dt><dd><p> A list of sequences that
	      have been found to be invalid due to various reasons (given in the output of
	      the assembler).
	      </p></dd></dl></div><p>
	</p></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_file_formats"></a>3.6.3. 
	File formats
      </h3></div></div></div><p>
	MIRA can write almost all of the following formats and can read most
	of them.
      </p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
	    <code class="filename">ACE</code>
	  </span></dt><dd><p> This old assembly file format used mainly by phrap and
	    consed. Support for .ace output is currently only in test status in
	    MIRA as documentation on that format is ... sparse and I currently
	    don't have access to consed to verify my assumptions.
	    </p><p> Using consed, you will need to load projects with -nophd to
	    view them. Tags /in reads and consensus) are fully supported. The
	    only hitch: consed has a bug which prevents it to read consensus
	    tags which are located throughout the whole file (as MIRA writes
	    per default). The solution to that is easy: filter the CAF file
	    through the fixACE4consed.tcl script which is provided in older
	    MIRA distributions (V4.9.5 and before), then all should be well.
	    </p><p> If you don't have consed, you might want to try clview
	    (<a class="ulink" href="http://www.tigr.org/tdb/tgi/software/" target="_top">http://www.tigr.org/tdb/tgi/software/</a>) from TIGR
	    to look at .ace files.
	    </p></dd><dt><span class="term">
	    <code class="filename">BAM</code>
	  </span></dt><dd>
	    The binary cousin of the SAM format. MIRA neither reads nor writes
	    BAM, but BAMs can be created out of SAMs (which can be created via
	    <span class="command"><strong>miraconvert</strong></span>).
	  </dd><dt><span class="term">
	    <code class="filename">CAF</code>
	  </span></dt><dd><p> Common Assembly Format (CAF) developed by the Sanger
	    Centre. <a class="ulink" href="http://www.sanger.ac.uk/resources/software/caf.html" target="_top">http://www.sanger.ac.uk/resources/software/caf.html</a> provides a
	    description of the format and some software documentation as well as the
	    source for compiling caf2gap and gap2caf (thanks to Rob Davies
	    for this).
	    </p></dd><dt><span class="term">
	    <code class="filename">EXP</code>
	  </span></dt><dd><p> Standard experiment files used in genome
	    sequencing. Correct EXP files are expected. Especially the ID
	    record (containing the id of the reading) and the LN record
	    (containing the name of the corresponding trace file) should be
	    correctly set. See <a class="ulink" href="http://www.sourceforge.net/projects/staden/" target="_top">http://www.sourceforge.net/projects/staden/</a> for links to
	    online format description.
	    </p></dd><dt><span class="term">
	    <code class="filename">FASTA</code>
	  </span></dt><dd><p> A simple format for sequence data, see
	    <a class="ulink" href="http://www.ncbi.nlm.nih.gov/BLAST/fasta.html" target="_top">http://www.ncbi.nlm.nih.gov/BLAST/fasta.html</a>. An
	    often used extension of that format is used to also store quality
	    values in a similar fashion, these files have a .fasta.qual
	    ending.
	    </p><p>
	      MIRA writes two kinds of FASTA files for
	      results: <span class="emphasis"><em>padded</em></span> and
	      <span class="emphasis"><em>unpadded</em></span>. The difference is that the padded
	      version still contains the gap (pad) character (an asterisk) at
	      positions in the consensus where some of the reads apparently
	      had some more bases than others but where the consensus routines
	      decided that to treat them as artifacts. The
	      <span class="emphasis"><em>unpadded</em></span> version has the gaps removed.
	    </p></dd><dt><span class="term">
	    <code class="filename">GBF, GBK</code>
	  </span></dt><dd><p> GenBank file format as used at the NCBI to describe
	    sequences. MIRA is able to read and write this format (but only
	    for viruses or bacteria) for using sequences as backbones in an
	    assembly. Features of the GenBank format are also transferred
	    automatically to Staden compatible tags.
	    </p><p>
	      If possible, use GFF3 instead (see below).
	    </p></dd><dt><span class="term">
	    <code class="filename">GFF3</code>
	  </span></dt><dd><p> General feature format used to describe sequences and
	    features on these sequences. MIRA is able to read and write this
	    format.
	    </p></dd><dt><span class="term">
	    <code class="filename">HTML</code>
	  </span></dt><dd><p> Hypertext Markup Language. Projects written in HTML format
	    can be viewed directly with any table capable browser. Display is even
	    better if the browser knows style sheets (CSS).
	    </p></dd><dt><span class="term">
	    <code class="filename">MAF</code>
	  </span></dt><dd><p> MIRA Assembly Format (MAF). A faster and more compact form
	    than EXP, CAF or ACE. See documentation in separate file.
	    </p></dd><dt><span class="term">
	    <code class="filename">PHD</code>
	  </span></dt><dd><p> This file type originates from the phred base caller
	    and contains basically -- along with some other status information -- the
	    base sequence, the base quality values and the peak indices, but not the
	    sequence traces itself.
	    </p></dd><dt><span class="term">
	    <code class="filename">SAM</code>
	  </span></dt><dd><p> The Sequence Alignment/Map Format. MIRA does not write SAM
	    directly, but <span class="command"><strong>miraconvert</strong></span> can be used for
	    converting a MAF (or CAF) file to SAM.
	    </p><p>
	      MIRA cannot read SAM though.
	    </p></dd><dt><span class="term">
	    <code class="filename">SCF</code>
	  </span></dt><dd><p> The Staden trace file format that has established itself as
	    compact standard replacement for the much bigger ABI files. See
	    <a class="ulink" href="http://www.sourceforge.net/projects/staden/" target="_top">http://www.sourceforge.net/projects/staden/</a> for
	    links to online format description.
	    </p><p>
	      The SCF files should be V2-8bit, V2-16bit, V3-8bit or V3-16bit
	      and can be packed with compress or gzip.
	    </p></dd><dt><span class="term">
	    <code class="filename">traceinfo.XML</code>
	  </span></dt><dd><p> XML based file with information relating to
	    traces. Used at the NCBI and ENSEMBL trace archive to store additional
	    information (like clippings, insert sizes etc.) for projects. See further
	    down for for a description of the fields used and
	    <a class="ulink" href="http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&amp;f=rfc&amp;m=main&amp;s=rfc" target="_top">http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&amp;f=rfc&amp;m=main&amp;s=rfc</a> for a full description of all fields.
	    </p></dd><dt><span class="term">
	    <code class="filename">TCS</code>
	  </span></dt><dd><p> Transpose Contig Summary. A text file as written by MIRA
	    which gives a summary of a contig in tabular fashion, one line per
	    base. Nicely suited for "quick" analysis from command line tools,
	    scripts, or even visual inspection in file viewers or spreadsheet
	    programs.
	    </p><p> In the current file version (TCS 1.0), each column is
	    separated by at least one space from the next. Vertical bars are
	    inserted as visual delimiter to help inspection by eye. The
	    following columns are written into the file:
	    </p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
		  contig name (width 20)
		</p></li><li class="listitem"><p>
		  padded position in contigs (width 3)
		</p></li><li class="listitem"><p>
		  unpadded position in contigs (width 3)
		</p></li><li class="listitem"><p>
		  separator (a vertical bar)
		</p></li><li class="listitem"><p>
		  called consensus base
		</p></li><li class="listitem"><p>
		  quality of called consensus base (0-100), but MIRA itself caps at 90.
		</p></li><li class="listitem"><p>
		  separator (a vertical bar)
		</p></li><li class="listitem"><p>
		  total coverage in number of reads. This number can be higher than the
		  sum of the next five columns if Ns or IUPAC bases are present in the
		  sequence of reads.
		</p></li><li class="listitem"><p>
		  coverage of reads having an "A"
		</p></li><li class="listitem"><p>
		  coverage of reads having an "C"
		</p></li><li class="listitem"><p>
		  coverage of reads having an "G"
		</p></li><li class="listitem"><p>
		  coverage of reads having an "T"
		</p></li><li class="listitem"><p>
		  coverage of reads having an "*" (a gap)
		</p></li><li class="listitem"><p>
		  separator (a vertical bar)
		</p></li><li class="listitem"><p>
		  quality of "A" or "--" if none
		</p></li><li class="listitem"><p>
		  quality of "C" or "--" if none
		</p></li><li class="listitem"><p>
		  quality of "G" or "--" if none
		</p></li><li class="listitem"><p>
		  quality of "T" or "--" if none
		</p></li><li class="listitem"><p>
		  quality of "*" (gap) or "--" if none
		</p></li><li class="listitem"><p>
		  separator (a vertical bar)
		</p></li><li class="listitem"><p>
		  Status. This field sums up the evaluation of MIRA whether you should
		  have a look at this base or not. The content can be one of the following:
		</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
		      everything OK: a colon (:)
		    </p></li><li class="listitem"><p>
		      unclear base calling (IUPAC base): a "!M"
		    </p></li><li class="listitem"><p>
		      potentially problematic base calling involving a gap or low quality: a "!m"
		    </p></li><li class="listitem"><p>
		      consensus tag(s) of MIRA that hint to problems: a "!$". Currently,
		      the following tags will lead to this marker: SRMc, WRMc, DGPc, UNSc,
		      IUPc.
		    </p></li></ul></div></li><li class="listitem"><p>
		  list of a consensus tags at that position, tags are delimited by a
		  space. E.g.: "DGPc H454"
		</p></li></ol></div></dd></dl></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_stdout_stderr"></a>3.6.4. 
	STDOUT/STDERR
      </h3></div></div></div><p>
	The actual stage of the assembly is written to STDOUT, giving status messages
	on what MIRA is actually doing. Dumping to STDERR is almost not used
	anymore by MIRA, remnants will disappear over time.
      </p><p>
	Some debugging information might also be written to STDOUT if MIRA
	generates error messages.
      </p><p>
	On errors, MIRA will dump these also to STDOUT. Basically, three error classes
	exist:
      </p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
	    WARNING: Messages in this error class do not stop the assembly but
	    are meant as an information to the user. In some rare cases these
	    errors are due to (an always possible) error in the I/O routines
	    of MIRA, but nowadays they are mostly due to unexpected (read:
	    wrong) input data and can be traced back to errors in the
	    preprocessing stages. If these errors arise, you
	    definitively <span class="bold"><strong>DO</strong></span> want to check how
	    and why these errors came into those files in the first place.
	  </p><p>
	    Frequent cause for warnings include missing SCF files, SCF files
	    containing known quirks, EXP files containing known quirks etc.
	  </p></li><li class="listitem"><p>
	    FATAL: Messages in this error class actually stop the
	    assembly. These are mostly due to missing files that MIRA needs or
	    to very garbled (wrong) input data.
	  </p><p>
	    Frequent causes include naming an experiment file in the 'file of filenames'
	    that could not be found on the disk, same experiment file twice in the
	    project, suspected errors in the EXP files, etc.
	  </p></li><li class="listitem"><p>
	    INTERNAL: These are true programming errors that were caught by internal
	    checks. Should this happen, please mail the output of STDOUT and STDERR to
	    the author.
	  </p></li></ol></div><p>
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_ssaha2smalt"></a>3.6.5. 
	SSAHA2 / SMALT ancillary data
      </h3></div></div></div><p>
	The <span class="command"><strong>ssaha2</strong></span> or <span class="command"><strong>smalt</strong></span> programs -
	both from the Sanger Centre - can be used to detect possible vector
	sequence stretches in the input data for the assembly.  MIRA can load
	the result files of a
	<span class="command"><strong>ssaha2</strong></span> or <span class="command"><strong>smalt</strong></span> run and
	interpret the results to tag the possible vector sequences at the ends
	of reads.
      </p><p>
	Note that this also uses the parameters
	[-CL:msvsgs:msvsmfg:msvsmeg] (see below).
      </p><p>
	ssaha2 must be called like this "<code class="literal">ssaha2
	  &lt;ssaha2options&gt; vector.fasta sequences.fasta</code>"
	to generate an output that can be parsed by MIRA. In the above
	example, replace <code class="filename">vector.fasta</code> by the name
	of the file with your vector sequences and
	<code class="filename">sequences.fasta</code> by the name of the file
	containing your sequencing data.
      </p><p>
	smalt must be called like this: "<code class="literal">smalt map -f ssaha
	  &lt;ssaha2options&gt; hash_index sequences.fasta</code>"
      </p><p>
	This makes you basically independent from any other commercial or
	license-requiring vector screening software. For Sanger reads, a
	combination of <span class="command"><strong>lucy</strong></span> and
	<span class="command"><strong>ssaha2</strong></span> or <span class="command"><strong>smalt</strong></span> together with
	this parameter should do the trick. For reads coming from 454
	pyro-sequencing, <span class="command"><strong>ssaha2</strong></span> or
	<span class="command"><strong>smalt</strong></span> and this parameter will also work very
	well. See the usage manual for a walkthrough example on how to use
	SSAHA2 / SMALT screening data.
      </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
	The output format of SSAHA2 must the native output format
	(<code class="literal">-output ssaha2</code>). For SMALT, the output
	option <code class="literal">-f ssaha</code> must be used. Other formats cannot
	be parsed by MIRA.
      </td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
	I currently use the following SSAHA2 options:
	<code class="literal">-kmer 8 -skip 1 -seeds 1 -score 12 -cmatch 9 -ckmer
	  6</code></td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
	Anyone contributing SMALT parameters?
      </td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
	The sequence vector clippings generated from SSAHA2 /
	SMALT data do not replace sequence vector clippings loaded via
	the EXP, CAF or XML files, they rather extend them.
      </td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_xml_traceinfo"></a>3.6.6. 
	XML TRACEINFO ancillary data
      </h3></div></div></div><p>
	MIRA extracts the following data from the TRACEINFO files:
      </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	    trace_name (required)
	  </p></li><li class="listitem"><p>
	    trace_file (recommended)
	  </p></li><li class="listitem"><p>
	    trace_type_code (recommended)
	  </p></li><li class="listitem"><p>
	    trace_end (recommended)
	  </p></li><li class="listitem"><p>
	    clip_quality_left (recommended)
	  </p></li><li class="listitem"><p>
	    clip_quality_right (recommended)
	  </p></li><li class="listitem"><p>
	    clip_vector_left (recommended)
	  </p></li><li class="listitem"><p>
	    clip_vector_right (recommended)
	  </p></li><li class="listitem"><p>
	    strain (recommended)
	  </p></li><li class="listitem"><p>
	    template_id (recommended for paired end)
	  </p></li><li class="listitem"><p>
	    insert_size (recommended for paired end)
	  </p></li><li class="listitem"><p>
	    insert_stdev (recommended for paired end)
	  </p></li><li class="listitem"><p>
	    machine_type (optional)
	  </p></li><li class="listitem"><p>
	    program_id (optional)
	  </p></li></ul></div><p>
      </p><p>
	Other data types are also read, but the info is not used.
      </p><p>
	Here's the example for a TRACEINFO file with ancillary info:
      </p><pre class="screen">
&lt;?xml version="1.0"?&gt;
&lt;trace_volume&gt;
&lt;trace&gt;
  &lt;trace_name&gt;GCJAA15TF&lt;/trace_name&gt;
  &lt;program_id&gt;PHRED (0.990722.G) AND TTUNER (1.1)&lt;/program_id&gt;
  &lt;template_id&gt;GCJAA15&lt;/template_id&gt;
  &lt;trace_direction&gt;FORWARD&lt;/trace_direction&gt;
  &lt;trace_end&gt;F&lt;/trace_end&gt;
  &lt;clip_quality_left&gt;3&lt;/clip_quality_left&gt;
  &lt;clip_quality_right&gt;622&lt;/clip_quality_right&gt;
  &lt;clip_vector_left&gt;1&lt;/clip_vector_left&gt;
  &lt;clip_vector_right&gt;944&lt;/clip_vector_right&gt;
  &lt;insert_stdev&gt;600&lt;/insert_stdev&gt;
  &lt;insert_size&gt;2000&lt;/insert_size&gt;
&lt;/trace&gt;
&lt;trace&gt;
  ...
&lt;/trace&gt;
...
&lt;/trace_volume&gt;</pre><p>
	See
	<a class="ulink" href="http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&amp;f=rfc&amp;m=main&amp;s=rfc" target="_top">http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&amp;f=rfc&amp;m=main&amp;s=rfc</a>
	for a full description of all fields and more info on the TRACEINFO XML format.
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_contig_naming"></a>3.6.7. 
	Contig naming
      </h3></div></div></div><p>
	MIRA names contigs the following
	way: <span class="emphasis"><em>&lt;projectname&gt;_&lt;contigtype&gt;&lt;number&gt;</em></span>. While <span class="emphasis"><em>&lt;projectname&gt;</em></span>
	is dictated by the  [--project=] parameter
	and <span class="emphasis"><em>&lt;number&gt;</em></span> should be clear,
	the <span class="emphasis"><em>&lt;contigtype&gt;</em></span> might need additional
	explaining. There are currently three contig types existing:
      </p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
	    _c: these are "normal" contigs
	  </p></li><li class="listitem"><p>
	    _rep_c: only for genome assembly mode. These are contigs
	    containing only repetitive areas. These contigs
	    had <span class="emphasis"><em>_lrc</em></span> as type in previous version of MIRA,
	    this was changed to the <span class="emphasis"><em>_rep_c</em></span> to make things
	    clearer.
	  </p></li><li class="listitem"><p>
	    _s: these are singlet-contigs. Technically: "contigs" with a
	    single read.
	  </p></li><li class="listitem"><p>
	    _dn: these is an additional contig type which can occur when MIRA
	    ran a digital normalisation step during the assembly. Contigs
	    which contain reads completely covered by a DGNr tag will get an
	    additional "_dn" as part of their name to show that they contain
	    read representatives for digital normalisation. E.g.:
	    "contig_dn_c1".
	  </p><p>
	    Reads covered only partly by the DGNr tag do not trigger the _dn
	    naming.
	  </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note: Important side note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Important side note</th></tr><tr><td align="left" valign="top"> Due to the digital
	    normalisation step, the coverage numbers in the info file
	    regarding contig statistics will not represent the number of
	    reads in the contig, but they will show an approximation of
	    the true coverage or expression value as if there had not been
	    a digital normalisation step performed. The approximation may
	    be around 10 to 20% below the true value.
	  </td></tr></table></div></li></ol></div><p>
	Basically, for genome assemblies MIRA starts to build contigs in areas
	which seem "rock solid", i.e., not a repetitive region (main decision
	point) and nice coverage of good reads. Contigs which started like
	this get a <span class="emphasis"><em>_c</em></span> name. If during the assembly MIRA
	reaches a point where it cannot start building a contig in a
	non-repetitive region, it will name the contig
	<span class="emphasis"><em>_rep_c</em></span> instead of <span class="emphasis"><em>_c</em></span>. This
	is why "_rep_c" contigs occur late in a genome assembly.
      </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
	MIRA has a different understanding of "rock solid" when in EST/RNASeq
	assembly: here, MIRA will try to reconstruct a full length gene
	sequence, starting with the most abundant genes.
      </td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
	Depending on the settings of [-AS:mrpc], your project may or
	may not contain <span class="emphasis"><em>_s</em></span> singlet-contigs. Also note
	that reads landing in the debris file will not get assigned to
	singlet-contigs and hence not get <span class="emphasis"><em>_s</em></span> names.
      </td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_recovering_strain_specific_consensus"></a>3.6.8. 
	Recovering strain specific consensus as FASTA
      </h3></div></div></div><p>
	In case you used strain information in an assembly, you can
	recover the consensus for just any given strain
	by using <span class="command"><strong>miraconvert</strong></span> and convert from a
	full assembly format (e.g. MAF or CAF) which also carries
	strain information to FASTA. MIRA will automatically detect
	the strain information and create one FASTA file per strain
	encountered.
      </p><p>
	It will also create a blend of all strains encountered and
	conveniently add "AllStrains" to the name of these files. Note that
	this blend may or may not be something you need, but in some
	cases I found it to be useful.
      </p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_tags_used_in_the_assembly_by_mira_and_edit"></a>3.7. 
      Tags used in the assembly by MIRA and EdIt
    </h2></div></div></div><p>
      MIRA uses and sets a couple of tags during the assembly process. That
      is, if information is known before the assembly, it can be stored in tags (in
      the EXP and CAF formats) and will be used in the assembly.
    </p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_tags_read_and_used"></a>3.7.1. 
	Tags read (and used)
      </h3></div></div></div><p>
	This section lists "foreign" tags, i.e., tags that whose definition was made
	by other software packages than MIRA.
      </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	    ALUS, REPT: Sequence stretches tagged as ALUS (ALU Sequence) or REPT
	    (general repetitive sequence) will be handled with extreme care during the
	    assembly process. The allowed error rate after automatic contig editing
	    within these stretches is normally far below the general allowed error rate,
	    leading to much higher stringency during the assembly process and
	    subsequently to a better repeat resolving in many cases.
	  </p></li><li class="listitem"><p>
	    Fpas: GenBank feature for a poly-A sequence. Used in EST, cDNA or
	    transcript assembly. Either read in the input files or set when using
	    [-CL:cpat]. This allows to keep the poly-A sequence in
	    the reads during assembly without them interfering as massive
	    repeats or as mismatches.
	  </p></li><li class="listitem"><p>
	    FCDS, Fgen: GenBank features as described in GBF/GBK files or set in the
	    Staden package are used to make some SNP impact analysis on genes.
	  </p></li><li class="listitem"><p>
	    other. All other tags in reads will be read and passed through the
	    assembly without being changed and they currently do not influence the
	    assembly process.
	  </p></li></ul></div><p>
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_tags_set_and_used"></a>3.7.2. 
	Tags set (and used)
      </h3></div></div></div><p>
	This section lists tags which MIRA sets (and reads of course), but that other
	software packages might not know about.
      </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	    UNSr, UNSc: <span class="bold"><strong>UNS</strong></span>ure
	    in <span class="bold"><strong>R</strong></span>ead
	    respectively <span class="bold"><strong>C</strong></span>ontig.  These tags
	    denote positions in an assembly with conflicts that could not be
	    resolved automatically by MIRA. These positions should be looked
	    at during the finishing process.
	  </p><p>
	    For assemblies using good sequences and enough coverage, something
	    0.01% of the consensus positions have such a tag. (e.g. ~300 UNSc
	    tags for a genome of 3 megabases).
	  </p></li><li class="listitem"><p>
	    SRMr, WRMc: <span class="bold"><strong>S</strong></span>trong <span class="bold"><strong>R</strong></span>epeat <span class="bold"><strong>M</strong></span>arker and
	    <span class="bold"><strong>W</strong></span>eak <span class="bold"><strong>R</strong></span>epeat <span class="bold"><strong>M</strong></span>arker. These
	    tags are set in two flavours: as
	    SRM<span class="bold"><strong>r</strong></span> and
	    WRM<span class="bold"><strong>r</strong></span> when set in reads, and as
	    SRM<span class="bold"><strong>c</strong></span> and
	    WRM<span class="bold"><strong>c</strong></span> when set in the
	    consensus. These tags are used on an individual per base basis for
	    each read. They denote bases that have been identified as crucial
	    for resolving repeats, often denoting a single SNP within several
	    hundreds or thousands of bases. While a SRM is quite certain, the
	    WRM really is either weak (there wasn't enough comforting
	    information in the vicinity to be really sure) or involves gap
	    columns (which is always a bit tricky).
	  </p><p>
	    MIRA will automatically set these tags when it encounters repeats
	    and will tag exactly those bases that can be used to discern the
	    differences.
	  </p><p>
	    Seeing such a tag in the consensus means that MIRA was not able to
	    finish the disentanglement of that special repeat stretch or that
	    it found a new one in one of the last passes without having the
	    opportunity to resolve the problem.
	  </p></li><li class="listitem"><p>
	    DGPc: <span class="bold"><strong>D</strong></span>ubious <span class="bold"><strong>G</strong></span>ap <span class="bold"><strong>P</strong></span>osition in
	    <span class="bold"><strong>C</strong></span>onsensus. Set whenever the gap to base ratio in a column of 454
	    reads is between 40% and 60%.
	  </p></li><li class="listitem"><p>
	    SAO, SRO, SIO: <span class="bold"><strong>S</strong></span>NP intr<span class="bold"><strong>A</strong></span> <span class="bold"><strong>O</strong></span>rganism,
	    <span class="bold"><strong>S</strong></span>NP <span class="bold"><strong>R</strong></span> <span class="bold"><strong>O</strong></span>rganism, <span class="bold"><strong>S</strong></span>NP <span class="bold"><strong>I</strong></span>ntra
	    and inter <span class="bold"><strong>O</strong></span>rganism. As for SRM
	    and WRM, these tags have a <span class="bold"><strong>r</strong></span>
	    appended when set in reads and
	    a <span class="bold"><strong>c</strong></span> appended when set in the
	    consensus. These tags denote SNP positions.
	  </p><p>
	    MIRA will automatically set these tags when it encounters SNPs and
	    will tag exactly those bases that can be used to discern the
	    differences.  They denote SNPs as they occur within an organism
	    (SAO), between two or more organisms (SRO) or within and between
	    organisms (SIO).
	  </p><p>
	    Seeing such a tag in the consensus means that MIRA set this as a
	    valid SNP in the assembly pass. Seeing such tags only in reads (but not in
	    the consensus) shows that in a previous pass, MIRA thought these
	    bases to be SNPs but that in later passes, this SNP does not appear anymore
	    (perhaps due to resolved misassemblies).
	  </p></li><li class="listitem"><p>
	    STMS: (only hybrid assemblies). The <span class="bold"><strong>S</strong></span>equencing <span class="bold"><strong>T</strong></span>ype
	    <span class="bold"><strong>M</strong></span>ismatch <span class="bold"><strong>S</strong></span>olved
	    is tagged to positions in the assembly where the consensus of
	    different sequencing technologies (Sanger, 454, Ion Torrent, Solexa, PacBio, SOLiD)
	    reads differ, but MIRA thinks it found out the correct
	    solution. Often this is due to low coverage of one of the types
	    and an additional base calling error.
	  </p><p>
	    Sometimes this depicts real differences where possible explanation
	    might include: slightly different bugs were sequenced or a
	    mutation occurred during library preparation.
	  </p></li><li class="listitem"><p>
	    STMU: (only hybrid assemblies). The <span class="bold"><strong>S</strong></span>equencing <span class="bold"><strong>T</strong></span>ype
	    <span class="bold"><strong>M</strong></span>ismatch <span class="bold"><strong>U</strong></span>nresolved
	    is tagged to positions in the assembly where the consensus of
	    different sequencing technologies (Sanger, 454, Ion Torrent, Solexa, SOLiD)
	    reads differ, but MIRA could not find a good resolution. Often this
	    is due to low coverage of one of the types and an additional base
	    calling error.
	  </p><p>
	    Sometimes this depicts real differences where possible explanation
	    might include: slightly different bugs were sequenced or a mutation
	    occurred during library preparation.
	  </p></li><li class="listitem"><p>
	    MCVc: The <span class="bold"><strong>M</strong></span>issing <span class="bold"><strong>C</strong></span>o{V}erage in <span class="bold"><strong>C</strong></span>onsensus.
	    Set in assemblies with more than one strain. If a strain has no coverage at
	    a certain position, the consensus gets tagged with this tag (and the name of
	    the strain which misses this position is put in the comment). Additionally,
	    the sequence in the result files for this strain will have an @ character.
	  </p></li><li class="listitem"><p>
	    MNRr: (only with [-KS:mnr] active). The <span class="bold"><strong>M</strong></span>asked
	    <span class="bold"><strong>N</strong></span>asty <span class="bold"><strong>R</strong></span>epeat tags are set over those parts of a read that
	    have been detected as being many more times present than the average
	    sub-sequence. MIRA will hide these parts during the initial
	    all-against-all overlap finding routine (SKIM3) but will otherwise happily
	    use these sequences for consensus generation during contig building.
	  </p></li><li class="listitem"><p>
	    FpAS: See "Tags read (and used)" above.
	  </p></li><li class="listitem"><p>
	    ED_C, ED_I, ED_D: EDit Change, EDit Insertion, EDit Deletion. These
	    tags are set by the integrated automatic editor EdIt and show which edit
	    actions have been performed.
	  </p></li><li class="listitem"><p>
	    HAF2, HAF3, HAF4, HAF5, HAF6, HAF7. These
	    are <span class="bold"><strong>HA</strong></span>sh <span class="bold"><strong>F</strong></span>requency
	    tags which show the status of read parts in comparison to the
	    whole project. Only set if  [-AS:ard] is active (default
	    for genome assemblies).
	  </p><p>
	    More info on how to use the information conveyed by HAF tags in
	    the section dealing with repeats and HAF tags in finishing
	    programs further down in this manual.
	  </p><p>
	    HAF2 coverage below average ( standard setting at &lt; 0.5 times average)
	  </p><p>
	    HAF3 coverage is at average ( standard setting at &#8805; 0.5 times average and &#8804; 1.5 times average)
	  </p><p>
	    HAF4 coverage above average ( standard setting at &gt; 1.5 times average and &lt; 2 times average)
	  </p><p>
	    HAF5 probably repeat ( standard setting at &#8805; 2 times average and &lt; 5 times average)
	  </p><p>
	    HAF6 'heavy' repeat ( standard setting at &gt; 8 times average)
	  </p><p>
	    HAF7 'crazy' repeat ( standard setting at &gt; 20 times average)
	  </p></li></ul></div><p>
      </p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_contigs_singlets_debris"></a>3.8. 
      Where reads end up: contigs, singlets, debris
    </h2></div></div></div><p>
      At the start, things are simple: a read either aligns with other reads or it does not. Reads which
      align with other reads form contigs, and these MIRA will save in the results with a contig name
      of <span class="emphasis"><em>_c</em></span>.
    </p><p>
      However, not all reads can be placed in an assembly. This can have several reasons and
      these reads may end up at two different places in the result files: either in the
      <span class="emphasis"><em>debris</em></span> file, then just as a name entry, or as singlet (a "contig"
      with just one read) in the regular results.
    </p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
	  reads are too short and get filtered out (before or after the MIRA
	  clipping stages). These invariably land in the debris file.
	</p></li><li class="listitem"><p>
	  reads are real singlets: they contain genuine sequence but have no
	  overlap with any other read. These get either caught by the
	  [-CL:pec] clipping filter or during the SKIM phase
	</p></li><li class="listitem"><p>
	  reads contain mostly or completely junk.
	</p></li><li class="listitem"><p>
	  reads contain chimeric sequence (therefore: they're also junk)
	</p></li></ol></div><p>
      MIRA filters out these reads in different stages: before and after read
      clipping, during the SKIM stage, during the Smith-Waterman overlap
      checking stage or during contig building. The exact place where these
      single reads land is dependent on why they do not align with other
      reads. Reads landing in the debris file will have the reason and stage
      attached to the decision.
    </p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_snp_discovery"></a>3.9. 
      Detection of bases distinguishing non-perfect repeats and SNP discovery
    </h2></div></div></div><p>
      MIRA is able to find and tag SNPs in any kind of data -- be it genomic
      or EST -- in both de-novo and mapping assemblies ... provided it knows
      which read in an assembly is coming from which strain, cell line or
      organism.
    </p><p>
      The SNP detection routines are based on the same routines as the
      routines for detecting non-perfect repeats. In fact, MIRA can even
      distinguish between bases marking a misassembled repeat from bases
      marking a SNP within the same project.
    </p><p>
      All you need to do to enable this feature is to set
      [-CO:mr=yes] (which is standard in all
      <code class="literal">--job=...</code> incantations of <span class="command"><strong>mira</strong></span> and
      in some steps of <span class="command"><strong>miraSearchESTSNPs</strong></span>. Furthermore, you
      will need to provide <span class="emphasis"><em>strain information</em></span>, either in
      the manifest file or in ancillary NCBI TRACEINFO XML files.
    </p><p>
      The effect of using strain names attached to reads can be described
      briefly like this. Assume that you have 6 reads (called R1 to R6), three
      of them having an <code class="literal">A</code> at a given position, the other
      three a <code class="literal">C</code>.
    </p><pre class="screen">
R1   ......A......
R2   ......A......
R3   ......A......
R4   ......C......
R5   ......C......
R6   ......C......</pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
      This example is just that: an example. It uses just 6 reads, with two
      times three reads as read groups for demonstration purposes and without
      looking at qualities. For MIRA to recognise SNPs, a few things must come
      together (e.g. for many sequencing technologies it wants forward and
      backward reads when in de-novo assembly) and a couple of parameters can
      be set to adjust the sensitivity. Read more about the parameters:
      [-CO:mrpg:mnq:mgqrt:emea:amgb:amgbemc:amgbnbs]</td></tr></table></div><p>
      Now, assume you did not give any strain information. MIRA will most
      probably recognise a problem and, having no strain information, assume
      it made an error by assembling two different repeats of the same
      organism. It will tag the bases in the reads with repeat marker tags
      (SRMr) and the base in the consensus with a SROc tag (to point at an
      unresolved problem). In a subsequent pass, MIRA will then not assemble
      these six reads together again, but create two contigs like this:
    </p><pre class="screen">
Contig1:
R1   ......A......
R2   ......A......
R3   ......A......

Contig2:
R4   ......C......
R5   ......C......
R6   ......C......</pre><p>
      The bases in the repeats will keep their SROr tags, but the consensus
      base of each contig will not get SROc as there is no conflict anymore.
    </p><p>
      Now, assume you gave reads R1, R2 and R3 the strain information "human",
      and read R4, R5 and R6 "chimpanzee". MIRA will then create this:
    </p><pre class="screen">
R1 (hum)  ......<span class="bold"><strong>A</strong></span>......
R2 (hum)  ......<span class="bold"><strong>A</strong></span>......
R3 (hum)  ......<span class="bold"><strong>A</strong></span>......
R4 (chi)  ......<span class="bold"><strong>C</strong></span>......
R5 (chi)  ......<span class="bold"><strong>C</strong></span>......
R6 (chi)  ......<span class="bold"><strong>C</strong></span>......</pre><p>
      Instead of creating two contigs, it will create again one contig ... but
      it will tag the bases in the reads with a SROr tag and the position in
      the contig with a SROc tag. The SRO tags (<span class="bold"><strong>S</strong></span>NP inte<span class="bold"><strong>R</strong></span>
      <span class="bold"><strong>O</strong></span>rganisms) tell you: there's a SNP
      between those two (or multiple) strains/organisms/whatever.
    </p><p>
      Changing the above example a little, assume you have this assembly early
      on during the MIRA process:
    </p><pre class="screen">
R1 (hum)  ......A......
R2 (hum)  ......A......
R3 (hum)  ......A......
R4 (chi)  ......A......
R5 (chi)  ......A......
R6 (chi)  ......A......
R7 (chi)  ......C......
R8 (chi)  ......C......
R9 (chi)  ......C......</pre><p>
      Because "chimp" has a SNP within itself (<code class="literal">A</code> versus
      <code class="literal">C</code>) and there's a SNP between "human" and "chimp"
      (also <code class="literal">A</code> versus <code class="literal">C</code>), MIRA will see a
      problem and set a tag, this time a SIOr tag: <span class="bold"><strong>S</strong></span>NP <span class="bold"><strong>I</strong></span>ntra- and
      inter <span class="bold"><strong>O</strong></span>rganism.
      </p><p>
	MIRA does not like conflicts occurring within an organism and will try
	to resolve these cleanly. After setting the SIOr tags, MIRA will
	re-assemble in subsequent passes this:
    </p><pre class="screen">
Contig1:
R1 (hum)  ......<span class="bold"><strong>A</strong></span>......
R2 (hum)  ......<span class="bold"><strong>A</strong></span>......
R3 (hum)  ......<span class="bold"><strong>A</strong></span>......
R4 (chi)  ......<span class="bold"><strong>A</strong></span>......
R5 (chi)  ......<span class="bold"><strong>A</strong></span>......
R6 (chi)  ......<span class="bold"><strong>A</strong></span>......

Contig2:
R7 (chi)  ......<span class="bold"><strong>C</strong></span>......
R8 (chi)  ......<span class="bold"><strong>C</strong></span>......
R9 (chi)  ......<span class="bold"><strong>C</strong></span>......</pre><p>
      The reads in Contig1 (hum+chi) and Contig2 (chi) will keep their SIOr
      tags, the consensus will have no SIOc tag as the "problem" was
      resolved.
    </p><p>
      When presented to conflicting information regarding SNPs and possible
      repeat markers or SNPs within an organism, MIRA will always first try to
      resolve the repeats marker. Assume the following situation:
    </p><pre class="screen">
R1 (hum)  ......A...T......
R2 (hum)  ......A...G......
R3 (hum)  ......A...T......
R4 (chi)  ......C...G......
R5 (chi)  ......C...T......
R6 (chi)  ......C...G......</pre><p>
      While the first discrepancy column can be "explained away" by a SNP
      between organisms (it will get a SROr/SROc tag), the second column
      cannot and will get a SIOr/SIOc tag. After that, MIRA opts to get the
      SIO conflict resolved:
    </p><pre class="screen">
Contig1:
R1 (hum)  ......A...T......
R3 (hum)  ......A...T......
R5 (chi)  ......C...T......

Contig2:
R2 (hum)  ......A...G......
R4 (chi)  ......C...G......
R6 (chi)  ......C...G......</pre></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_data_reduction"></a>3.10. 
      Data reduction: subsampling vs. lossless digital normalisation
    </h2></div></div></div><p>
      Some data sets have way too much data. Sometimes it is simply more than
      needed like, e.g., performing a de-novo genome assembly with reads
      enough for 300x coverage is like taking a sledgehammer for cracking a
      nut. Sometimes it is even more than is good for an assembly (see also:
      motif dependent sequencing errors).
    </p><p>
      MIRA being an overlap-based assembler, reducing a data set helps to keep
      time and memory requirements low. There are basically two ways to
      perform this: reduction by subsampling and reduction by digital
      normalisation. Both methods have their pros and cons and can be used
      effectively in different scenarios.
    </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	  <span class="emphasis"><em>Subsampling</em></span> is a process to create a smaller,
	  hopefully representative set from a larger data set.
	</p><p>
	  In sequencing, various ways exist to perform subsampling. As
	  sequencing data sets from current sequencing technologies can be
	  seen as essentially randomised when coming fresh from the machine,
	  the selection step can be as easy as selecting the
	  first <span class="emphasis"><em>n</em></span> reads. When the input data set is not
	  random (e.g. in SAM/BAM files with mapped data), one must resort to
	  random selection of reads.
	</p><p>
	  Subsampling must be done by the user prior to assembly with MIRA.
	</p><p>
	  On the upside, subsampling preserves the exact copy number structure
	  of the input data set: a repeat with n copies in a genome will
	  always be represented by reads forming n copies of the repeat in the
	  reduced data set. Furthermore, subsampling is comparatively
	  insensitive to motif dependent sequencing errors. On the downside,
	  subsampling will more probably loose rare events of the data set
	  (e.g., rare SNPs of a cell population or rare transcripts in
	  EST/RNASeq). Also, in EST/RNASeq projects, subsampling will not be
	  able to reduce extraordinary coverage events to a level which make
	  the assembly not painfully slow. Examples for the later being rRNA
	  genes or highly expressed house-keeping genes where todays' Illumina
	  data sets sometimes contains enough data to reach coverage numbers
	  &#8805; 100,000x or even a million x.
	</p><p>
	  Subsampling should therefore be used for single genome de-novo
	  assemblies; or for EST/RNASeq assemblies which need reliable
	  coverage numbers for transcript expression data but where at least
	  all rDNA has been filtered out prior to assembly.
	</p></li><li class="listitem"><p>
	  <span class="emphasis"><em>Digital normalisation</em></span> is a process to perform a
	  reduction of sequencing data redundancy. It was made known to a
	  wider audience by the paper <span class="emphasis"><em>"A Reference-Free Algorithm
	  for Computational Normalization of Shotgun Sequencing
	  Data"</em></span> by Brown et al. (see
	  <a class="ulink" href="http://arxiv.org/abs/1203.4802" target="_top">http://arxiv.org/abs/1203.4802</a>).
	</p><p>
	  The normalisation process works by progressively going through the
	  sequencing data and selecting reads which bring new, previously
	  unseen information to the assembly and discarding those which
	  describe nothing new. For single genome assemblies, this has the
	  effect that repeats with n copies in the genome are afterwards
	  present often with just enough reads to reconstruct only a single
	  copy of the repeat. In EST/RNASeq assemblies, this leads to
	  reconstructed transcripts having all the more or less same coverage.
	</p><p>
	  The normalisation process as described in the paper allows for a
	  certain lossiness during the data reduction as it was developed to
	  cope with billions of reads. E.g., it will often loose borders in
	  genome reorganisation events or SNP information from ploidies, from
	  closely related genes copies or from closely related species.
	</p><p>
	  MIRA implements a variant of the algorithm: the <span class="emphasis"><em>lossless
	  digital normalisation</em></span>. Here, normalised data has copy
	  numbers reduced like in the original algorithm, but all variants
	  (SNPs, borders of reorganisation events etc.) present in the
	  original data set are retained in the reduced data set. Furthermore,
	  the normalisation is parameterised to take place only for
	  excessively repetitive parts of a data set which would lead to
	  overly increased run-time and memory consumption. This gives the
	  assembler the opportunity to correctly evaluate and work with
	  repeats which do not occur "too often" in a data set while still
	  being able to reconstruct at least one copy of the really nasty
	  repeats.
	</p><p>
	  Digital normalisation should not be done prior to an assembly with
	  MIRA, rather the MIRA parameter to perform a digital normalisation
	  on the complete data set should be used.
	</p><p>
	  The lossless digital normalisation of MIRA should be used for
	  EST/RNASeq assemblies containing highly repetitive data. Metagenome
	  assemblies may also profit from this feature.
	</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
	    MIRA keeps track of the approximate coverage represented by the
	    reads chosen in the digital normalisation process. That is, MIRA is
	    able to give approximate coverage numbers as if digital
	    normalisation had never happened. The approximation may be around 10
	    to 20% below the true value. Contigs affected by this coverage
	    approximation are denoted with an additional "_dn" in their name.
	  </p><p>
	    Due to the digital
	    normalisation step, the coverage numbers in the info file
	    regarding contig statistics will not represent the number of
	    reads in the contig, but they will show an approximation of
	    the true coverage or expression value as if there had not been
	    a digital normalisation step performed.
	</p></td></tr></table></div></li></ul></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_caveats"></a>3.11. 
      Caveats
    </h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_using_artificial_reads"></a>3.11.1. 
	Using data not from sequencing instruments: artificial / synthetic reads
      </h3></div></div></div><p>
	The default parameters for MIRA assemblies work best when given real
	sequencing data and they even expect the data to behave like real
	sequencing data. But some assembly strategies work in multiple rounds,
	using so called "artificial" or "synthetic" reads in later rounds,
	i.e., data which was not generated through sequencing machines but
	might be something like the consensus of previous assemblies.
      </p><p>
	If one doesn't take utter care to make these artificial reads at least
	behave a little bit like real sequencing data, a number of quality
	insurance algorithms of MIRA might spot that they "look funny" and
	trim back these artificial reads ... sometimes even removing them
	completely.
      </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note: Summary tips for creating artificial reads for MIRA assemblies"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Summary tips for creating artificial reads for MIRA assemblies</th></tr><tr><td align="left" valign="top"><p>
	  The following should lead to the least amount of surprises for most
	  assembly use cases when calling MIRA only with the most basic
	  switches <code class="literal">--project=... --job=...</code>
	</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><span class="bold"><strong>Length:</strong></span> between 50 and 20000 bp
	  </li><li class="listitem"><span class="bold"><strong>Quality values:</strong></span> give your
	    artificial reads quality values. Using <span class="emphasis"><em>30</em></span>
	    as quality value for your bases should be OK for most
	    applications.
	  </li><li class="listitem"><span class="bold"><strong>Orientation:</strong></span> for every read you
	    create, create a read with the same data (bases and quality
	    values) in reverse complement direction.
	  </li></ol></div></td></tr></table></div><p>
	The following list gives all the gory details on how synthetic reads
	should look like or which MIRA algorithms to switch off in certain
	cases:
      </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	    Forward and reverse complement directions: most sequencing
	    technologies and strategies yield a mixture of reads with both
	    forward and reverse complement direction to the DNA sequenced. In
	    fact, having both directions allows for a much better quality
	    control of an alignment as sequencing technology dependent
	    sequencing errors will often affect only one direction at a given
	    place and not both (the exception being homopolymers and 454).
	  </p><p>
	    The MIRA <span class="emphasis"><em>proposed end clipping</em></span> algorithm
	     [-CL:pec] uses this knowledge to initially trim back
	    ends of reads to an area without sequencing errors. However, if
	    reads covering a given area of DNA are present in only one
	    direction, then these reads will be completely eliminated.
	  </p><p>
	    If you use only artificial reads in an assembly, then switch off
	    the <span class="emphasis"><em>proposed end clipping</em></span>
	     [-CL:pec=no].
	  </p><p>
	    If you mix artificial reads with "normal" reads, make sure that
	    every part of an artificial read is covered by some other read in
	    reverse complement direction (be it a normal or artificial
	    read). The easiest way to do that is to add a reverse complement
	    for every artificial read yourself, though if you use an
	    overlapping strategy with artificial reads, you can calculate the
	    overlaps and reverse complements of reads so that every second
	    artificial read is in reverse complement to save time and memory
	    afterwards during the computation.
	  </p></li><li class="listitem"><p>
	    Sequencing type/technology: MIRA currently knows Sanger, 454, Ion
	    Torrent, Solexa, PacBioHQ/LQ and "Text" as sequencing
	    technologies, every read entered in an assembly must be one of
	    those.
	  </p><p>
	    Artificial reads should be classified depending on the data they
	    were created from, that is, Sanger for consensus of Sanger reads,
	    454 for consensus of 454 reads etc. However, should reads created
	    from Illumina consensus be much longer than, say, 200 or 300
	    bases, you should treat them as Sanger reads.
	  </p></li><li class="listitem"><p>
	    Quality values: be careful to assign decent quality values to your
	    artificial reads as several quality clipping or consensus calling
	    algorithms make extensive use of qualities. Pay attention to
	    values of [-CL:qc:bsqc] as well as to
	     [-CO:mrpg:mnq:mgqrt].
	  </p></li><li class="listitem"><p>
	    Read lengths: current maximum read length for MIRA is around
	    ~30kb. However, to account for some safety, MIRA currently allows
	    only 20kb reads as maximum length.
	  </p></li></ul></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_ploidy_and_repeats"></a>3.11.2. 
	Ploidy and repeats
      </h3></div></div></div><p>
	MIRA treats ploidy differences as repeats and will therefore build a
	separate contigs for the reads of a ploidy that has a difference to
	the other ploidy/ploidies.
      </p><p>
	There is simply no other way to handle ploidy while retaining the
	ability to separate repeats based on differences of only a single
	base. Everything else would be guesswork. I thought for some time
	about doing a coverage analysis around the potential repeat/ploidy
	site, but came to the conclusion that due to the stochastic nature of
	sequencing data, this would very probably take wrong decisions in too
	many cases to be acceptable.
      </p><p>
	If someone has a good idea, I'll be happy to hear it.
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_handling_of_repeats"></a>3.11.3. 
	Handling of repeats
      </h3></div></div></div><p>
      </p><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_uniform_read_distribution"></a>3.11.3.1. 
	  Uniform read distribution
	</h4></div></div></div><p>
	  Under the assumption that reads in a project are uniformly
	  distributed across the genome, MIRA will enforce an average coverage
	  and temporarily reject reads from a contig when this average
	  coverage multiplied by a safety factor is reached at a given
	  site. This strategy reduces over-compression of repeats during the
	  contig building phase and keeps reads in reserve for other copies of
	  that repeat.
	</p><p>
	  It's generally a very useful tool disentangle repeats, but has some
	  slight secondary effects: rejection of otherwise perfectly good
	  reads. The assumption of read distribution uniformity is the big
	  problem we have here: of course it's not really valid. You sometimes
	  have less, and sometimes more than "the average"
	  coverage. Furthermore, the new sequencing technologies - 454 perhaps
	  but certainly the ones from Solexa - show that you also have a skew
	  towards the site of replication origin.
	</p><p>
	  Warning: Solexa data from late 2009 and 2010 show a high GC content
	  bias. This bias can reach 200 or 300%, i.e., sequence part for with
	  low GC
	</p><p>
	  One example: let's assume the average coverage of a project is 8 and
	  by chance at one place there 17 (non-repetitive) reads, then the
	  following happens:
	</p><p>
	  (Note: <span class="emphasis"><em>p</em></span> is the parameter  [-AS:urdsip])
	</p><p>
	  Pass 1 to <span class="emphasis"><em>p-1</em></span>: MIRA happily assembles everything together and calculates a
	  number of different things, amongst them an average coverage of ~8. At the
	  end of pass <span class="emphasis"><em>p-1</em></span>, it will announce this average coverage as first estimate
	  to the assembly process.
	</p><p>
	  Pass <span class="emphasis"><em>p</em></span>: MIRA has still assembled everything together, but at the end of each
	  pass the contig self-checking algorithms now include an "average coverage
	  check". They'll invariably find the 17 reads stacked and decide (looking at
	  the  [-AS:ardct] parameter which is assumed to be 2 for this example)
	  that 17 is larger than 2*8 and that this very well may be a repeat. The reads
	  get flagged as possible repeats.
	</p><p>
	  Pass <span class="emphasis"><em>p+1</em></span> to end: the "possibly repetitive" reads get a much tougher
	  treatment in MIRA. Amongst other things, when building the contig, the contig
	  now looks that "possibly repetitive" reads do not over-stack by an average
	  coverage multiplied by a safety value ( [-AS:urdcm]) which we'll
	  assume now to be 1.5 in this example. So, at a certain point, say when read 14
	  or 15 of that possible repeat want to be aligned to the contig at this given
	  place, the contig will just flatly refuse and tell the assembler to please
	  find another place for them, be it in this contig that is built or any other
	  that will follow. Of course, if the assembler cannot comply, the reads 14 to
	  17 will end up as contiglet (contig debris, if you want) or if it was only one
	  read that got rejected like this, it will end up as singlet or in the debris
	  file.
	</p><p>
	  Tough luck. I do have ideas on how to re-integrate those reads at the and of an
	  assembly, but I have deferred doing this as in every case I had looked up,
	  adding those reads to the contigs wouldn't have changed anything ... there's
	  already enough coverage.
	</p><p>
	  What should be done in those cases is simply filter away the contiglets
	  (defined as being of small size and having an average coverage below the
	  average coverage of the project divided 3 (or 2.5)) from a project.
	</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_keeping_'long'_repetitive_contigs_separate"></a>3.11.3.2. 
	  Keeping 'long' repetitive contigs separate
	</h4></div></div></div><p>
	  MIRA had since 2.9.36 a feature to keep long repeats in separate
	  contigs. Due to algorithm changes, this feature is now standard. The
	  effect of this is that contigs with non-repetitive sequence will
	  stop at a 'long repeat' border which cannot be crossed by a single
	  read or by paired reads, including only the first few bases of the
	  repeat. Long repeats will be kept as separate contigs.
	</p><p>
	  This has been implemented to get a clean overview on which parts of
	  an assembly are 'safe' and which parts will be 'difficult'. For
	  this, the naming of the contigs has been extended: contigs named
	  with a '_c' at the end are contigs which contain mostly 'normal'
	  coverage. Contigs with "rep_c" are contigs which contain mostly
	  sequence classified as repetitive and which could not be assembled
	  together with a 'c' contig.
	</p><p>
	  The question remains: what are 'long' repeats? MIRA defines these as
	  repeats that are not spanned by any read that has non-repetitive
	  parts at the end. Basically -for shotgun assemblies - the mean
	  length of the reads that go into the assembly defines the minimum
	  length of 'long' repeats that have to be kept in separate contigs.
	</p><p>
	  It has to be noted that when using paired-end (or template)
	  sequencing, 'long' repeats which can be spanned by read-pairs (or
	  templates) are frequently integrated into 'normal' contigs as MIRA
	  can correctly place them most of the time.
	</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_helping_finishing_by_tagging_reads_with_haf_tags"></a>3.11.3.3. 
	  Helping finishing by tagging reads with HAF tags
	</h4></div></div></div><p>
	  HAF tags (HAsh Frequency) are set by MIRA when the option to colour reads by
	  kmer frequency ([-GE:crkf], on by default in most --job combinations)
	  is on. These tags show the status of k-mers (stretch of bases of given length
	  <span class="emphasis"><em>k</em></span>) in read sequences: whether MIRA recognised them as being present in
	  sub-average, average, above average or repetitive numbers.
	</p><p>
	  When using a finishing programs which can display tags in reads (and using the
	  proposed tag colour schemes for gap4 or consed, the assembly
	  will light up in colours ranging from light green to dark red, indicating
	  whether a certain part of the assembly is deemed non-repetitive to extremely
	  repetitive.
	</p><p>
	  One of the biggest advantages of the HAF tags is the implicit information they
	  convey on why the assembler stopped building a contig at an end.
	</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	      if the read parts composing a contig end are mostly covered with HAF2
	      tags (below average frequency, coloured light-green), then one very probably
	      has a hole in the contig due to coverage problems which means there are no
	      or not enough reads covering a part of the sequence.
	    </p></li><li class="listitem"><p>
	      if the read parts composing a contig end are mostly covered with HAF3
	      tags (average frequency, coloured green), then you have an unusual situation
	      as this should only very rarely occur. The reason is that MIRA saw that
	      there are enough sequences which look the same as the one from your contig
	      end, but that these could not be joined. Likely reasons for this scenario
	      include non-random sequencing artifacts (seen in 454 data) or also
	      non-random chimeric reads (seen in Sanger and 454 data).
	    </p></li><li class="listitem"><p>
	      if the read parts composing a contig end are mostly covered with HAF4
	      tags (above average frequency, coloured yellow), then the assembler stopped
	      at grey zone of the coverage not being normal anymore, but not quite
	      repetitive yet. This can happen in cases where the read coverage is very
	      unevenly distributed across the project. The contig end in question might be
	      a repeat occurring two times in the sequence, but having less reads than
	      expected. Or it may be non-repetitive coverage with an unusual excess of
	      reads.
	    </p></li><li class="listitem"><p>
	      if the read parts composing a contig end are mostly covered with HAF5
	      (repeat, coloured red), HAF6 (heavy repeat, coloured darker red) and HAF7
	      tags (crazy repeat, coloured very dark red), then there is a repetitive area
	      in the sequence which could not be uniquely bridged by the reads present in
	      the assembly.
	    </p></li></ul></div><p>
	</p><p>
	  This information can be especially helpful when joining reads by hand in a
	  finishing program. The following list gives you a short guide to cases which
	  are most likely to occur and what you should do.
	</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	      the proposed join involves contig ends mostly covered by HAF2
	      tags. Joining these contigs is probably a safe bet. The assembly may have
	      missed this join because of too many errors in the read ends or because
	      sequence having been clipped away which could be useful to join contigs.
	      Just check whether the join seems sensible, then join.
	    </p></li><li class="listitem"><p>
	      the proposed join involves contig ends mostly covered by HAF3
	      tags. Joining these contigs is probably a safe bet. The assembly may have
	      missed this join because of several similar chimeric reads reads or reads
	      with similar, severe sequencing errors covering the same spot.
	      Just check whether the join seems sensible, then join.
	    </p></li><li class="listitem"><p>
	      the proposed join involves contig ends mostly covered by HAF4
	      tags. Joining these contigs should be done with some caution, it
	      may be a repeat occurring twice in the sequence.  Check whether
	      the contig ends in question align with ends of several other
	      contigs. If not, joining is probably the way to go. If potential
	      joins exist with several other contigs, then it's a repeat (see
	      below).
	    </p></li><li class="listitem"><p>
	      the proposed join involves contig ends mostly covered by HAF5, HAF6 or
	      HAF7 tags. Joining these contigs should be done with utmost caution, you are
	      almost certainly (HAF5) and very certainly (HAF6 and HAF7) in a repetitive
	      area of your sequence.
	      You will probably need additional information like paired-end or template
	      info in order join your contigs.
	    </p></li></ul></div><p>
	</p></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_consensus_in_finishing_programs_gap4_consed_"></a>3.11.4. 
	Consensus in finishing programs (gap4, consed, ...)
      </h3></div></div></div><p>
	MIRA goes a long way to calculate a consensus which is as correct as
	possible. Unfortunately, communication with finishing programs is a bit
	problematic as there currently is no standard way to say which reads are from
	which sequencing technology.
      </p><p>
	It is therefore often the case that finishing programs calculate an own
	consensus when loading a project assembled with MIRA. This is the case for at
	least, e.g., gap4. This consensus may then not be optimal.
      </p><p>
	The recommended way to deal with this problem is: import the results from MIRA
	into your finishing program like you always do. Then finish the genome there,
	export the project from the finishing program as CAF and finally use
	miraconvert (from the MIRA package ) with the "-r" option to
	recalculate the optimal consensus of your finished project.
      </p><p>
	E.g., assuming you have just finished editing the gap4 database
	<code class="filename">DEMO.3</code>, do the following. First, export the gap4 database back to
	CAF:
      </p><pre class="screen">
<code class="prompt">$</code> <strong class="userinput"><code>gap2caf -project DEMO -version 3 &gt;demo3.caf</code></strong></pre><p>
      </p><p>
	Then, use<span class="command"><strong>miraconvert</strong></span> <span class="emphasis"><em>with</em></span> <span class="emphasis"><em>option</em></span> <span class="emphasis"><em>'-r'</em></span> to
	convert it into any other format that you need. Example for converting to a
	CAF and a FASTA format with correct consensus:
      </p><pre class="screen">
<code class="prompt">$</code> <strong class="userinput"><code>miraconvert -t caf -t fasta -r c demo3.caf final_result</code></strong></pre><p>
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_some_other_things_to_consider"></a>3.11.5. 
	Some other things to consider
      </h3></div></div></div><p>
      </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	    MIRA cannot work with EXP files resulting from GAP4 that already
	    have been edited. If you want to reassemble an edited GAP4 project, convert
	    it to CAF format and use the [-caf] option to load.
	  </p></li><li class="listitem"><p>
	    As also explained earlier, MIRA relies on sequencing vector being
	    recognised in preprocessing steps by other programs. Sometimes, when a whole
	    stretch of bases is not correctly marked as sequencing vector, the reads
	    might not be aligned into a contig although they might otherwise match quite
	    perfectly. You can use [-CL:pvc] and  [-CO:emea] to address
	    problem with incomplete clipping of sequencing vectors. Also having the
	    assembler work with less strict parameters may help out of this.
	  </p></li><li class="listitem"><p>
	    MIRA has been developed to assemble shotgun sequencing or EST
	    sequencing data. There are no explicit limitations concerning length or
	    number of sequences. However, there are a few implicit assumptions that were
	    made while writing portions of the code:
	  </p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
		Problems which might arise with 'unnatural' long sequence
		reads: my implementation of the Smith-Waterman alignment
		routines. I use a banded version with linear running time
		(linear to the bandwidth) but quadratic space usage. So,
		comparing two 'reads' of length 5000 will result in memory
		usage of 95 MiB, two reads with 50000 bases will need 9.5 GiB.
	      </p><p>
		This problem has become acute now with PacBio, I'm working on
		it. In the mean time, current usable sequence length of PacBio
		are more in the 3 to 4 kilobase range, with only a few reads
		attaining or surpassing 20 kb. So Todays' machines should
		still be able to handle the problem more or less effortlessly.
	      </p></li><li class="listitem"><p>
		32 bit versions of MIRA are not supported anymore.
	      </p></li><li class="listitem"><p>
		to reduce memory overhead, the following assumptions have been made:
	      </p></li><li class="listitem"><p>
		MIRA is not fully multi-threaded (yet), though most
		bottlenecks are now in code areas which cannot be
		multi-threaded by algorithm design.
	      </p></li></ol></div></li><li class="listitem"><p>
	    a project does not contain sequences from more than 255 different:
	  </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: circle; "><li class="listitem"><p>
		sequencing machine types
	      </p></li><li class="listitem"><p>
		primers
	      </p></li><li class="listitem"><p>
		strains (in mapping mode: 7)
	      </p></li><li class="listitem"><p>
		base callers
	      </p></li><li class="listitem"><p>
		dyes
	      </p></li><li class="listitem"><p>
		process status
	      </p></li></ul></div></li><li class="listitem"><p>
	    a project does not contain sequences from more than 65535 different
	  </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: circle; "><li class="listitem"><p>
		clone vectors
	      </p></li><li class="listitem"><p>
		sequencing vectors
	      </p></li></ul></div></li></ul></div></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_things_you_should_not_do"></a>3.12. 
      Things you should not do
    </h2></div></div></div><p>
    </p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_never_on_nfs"></a>3.12.1. 
	Do not run MIRA on NFS mounted directories without redirecting the tmp directory
      </h3></div></div></div><p>
	Of course one can run MIRA atop a NFS mount (a "disk" mounted over a
	network using the NFS protocol), but the performance will go down the
	drain as the NFS server respectively the network will not be able to
	cope with the amount of data MIRA needs to shift to and from disk
	(writes/reads to the tmp directory). Slowdowns of a factor of 10 and
	more have been observed. In case you have no other possibility, you
	can force MIRA to run atop a NFS using [-NW:cnfs=warn]
	( [-NW:cnfs=no]), but you have been warned.
      </p><p>
	In case you want to keep input and output files on NFS, you can use
	[-DI:trt] to redirect the tmp directory to a local
	filesystem. Then MIRA will run at almost full speed.
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_never_without_quality_values"></a>3.12.2. 
	Do not assemble without quality values
      </h3></div></div></div><p>
	Assembling sequences without quality values is like ... like ... like
	driving a car downhill a sinuous mountain road with no rails at 200
	km/h without brakes, airbags and no steering wheel. With a ravine on
	one side and a rock face on the other. Did I mention the missing
	seat-belts? You <span class="emphasis"><em>might</em></span> get down safely, but
	experience tells the result will rather be a bloody mess.
      </p><p>
	Well, assembling without quality values is a bit like above, but
	bloodier. And the worst: you (or the people using the results of such
	an assembly) will notice the gore only until it is way too late and
	money has been sunk in follow-up experiments based on wrong data.
      </p><p>
	All MIRA routines internally are geared toward quality values guiding
	decisions. No one should ever assembly anything without quality
	values. Never. Ever. Even if quality values are sometimes inaccurate,
	they do help.
      </p><p>
	Now, there are <span class="bold"><strong>very rare occasions</strong></span>
	where getting quality values is not possible. If you absolutely cannot
	get them, and I mean only in this case, use the following
	switch:<code class="literal">--noqualities[=SEQUENCINGTECHNOLOGY]</code> and
	additionally give a default quality for reads of a readgroup. E.g.:
      </p><pre class="screen">parameters= --noqualities=454

readgroup
technology=454
data=...
default_qual=30</pre><p>
	This tells MIRA not to complain about missing quality values and to
	fake a quality value of 30 for all reads (of a readgroup) having no
	qualities, allowing some MIRA routines (in standard parameter
	settings) to start disentangling your repeats.
      </p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
	Doing the above has some severe side-effects. You will be, e.g., at
	the mercy of non-random sequencing errors. I suggest combining the
	above with a [-CO:mrpg=4] or higher. You also may want to
	tune the default quality parameter together with  [-CO:mnq]
	and  [-CO:mgqrt] in cases where you mix sequences with and
	without quality values.
      </td></tr></table></div></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_useful_third_party_programs"></a>3.13. 
      Useful third party programs
    </h2></div></div></div><p>
      Viewing the results of a MIRA assembly or preprocessing the sequences
      for an assembly can be done with a number of different programs. The
      following ones are are just examples, there are a lot more packages
      available:
    </p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
	  HTML browser
	</span></dt><dd><p> If you have really nothing else as viewer, a browser who
	  understands tables is needed to view the HTML output. A browser knowing
	  style sheets (CSS) is recommended, as different tags will be highlighted.
	  Konqueror, Opera, Mozilla, Netscape and Internet Explorer all do fine, lynx
	  is not really ...  optimal.
	  </p></dd><dt><span class="term">
	  Assembly viewer / finishing / preprocessing
	</span></dt><dd><p>
	    You'll want GAP4 or its successor GAP5 (generally speaking: the
	    Staden package) to preprocess the sequences, visualise and
	    eventually rework the results when using gap4da output. The Staden
	    package comes with a fully featured sequence preparing and
	    annotating engine (pregap4) that is very useful to preprocess your
	    Sanger data (conversion between file types, quality clipping,
	    tagging etc.).
	  </p><p>
	    See <a class="ulink" href="http://www.sourceforge.net/projects/staden/" target="_top">http://www.sourceforge.net/projects/staden/</a> for
	    further information and also a possibility to download precompiled
	    binaries for different platforms.
	  </p></dd><dt><span class="term">
	  Vector screening
	</span></dt><dd><p>
	    Reading result files from <span class="command"><strong>ssaha2</strong></span> or
	    <span class="command"><strong>smalt</strong></span> from the Sanger Centre is supported
	    directly by MIRA to perform a fast and efficient tagging of
	    sequencing vector stretches. This makes you basically independent
	    from any other commercial or license-requiring vector screening
	    software.  For Sanger reads, a combination of
	    <span class="command"><strong>lucy</strong></span> (see below), <span class="command"><strong>ssaha2</strong></span> or
	    <span class="command"><strong>smalt</strong></span> together with the MIRA parameters for
	    SSAHA2 / SMALT support (see all  [-CL:msvs*] parameters) and quality clipping
	    ( [-CL:qc]) should do the trick. For reads coming from 454
	    pyro-sequencing, <span class="command"><strong>ssaha2</strong></span> or
	    <span class="command"><strong>smalt</strong></span> and the SSAHA2 / SMALT support also work
	    pretty well.
	  </p><p>
	    See
	    <a class="ulink" href="http://www.sanger.ac.uk/resources/software/ssaha2/" target="_top">http://www.sanger.ac.uk/resources/software/ssaha2/</a>
	    and / or <a class="ulink" href="http://www.sanger.ac.uk/resources/software/smalt/" target="_top">http://www.sanger.ac.uk/resources/software/smalt/</a> for
	    further information and also a possibility to download the source
	    or precompiled binaries for different platforms.
	  </p></dd><dt><span class="term">
	  Preprocessing
	</span></dt><dd><p> <span class="command"><strong>lucy</strong></span> from TIGR (now JCVI) is another
	  useful sequence preprocessing program for Sanger data. Lucy is a
	  utility that prepares raw DNA sequence fragments for sequence
	  assembly. The cleanup process includes quality assessment,
	  confidence reassurance, vector trimming and vector removal.
	  </p><p>
	    There's a small script in the MIRA 3rd party package which
	    converts the clipping data from the lucy format into something
	    MIRA can understand (NCBI Traceinfo).
	  </p><p>
	    See <a class="ulink" href="ftp://ftp.tigr.org/pub/software/Lucy/" target="_top">ftp://ftp.tigr.org/pub/software/Lucy/</a> to download the source code
	    of lucy.
	  </p></dd><dt><span class="term">
	  Assembly viewer
	</span></dt><dd><p> Viewing <code class="filename">.ace</code> file output without consed
	  can be done with clview from TIGR. See
	  <a class="ulink" href="http://www.tigr.org/tdb/tgi/software/" target="_top">http://www.tigr.org/tdb/tgi/software/</a>.
	  </p><p>
	    A better alternative is Tablet <a class="ulink" href="http://bioinf.scri.ac.uk/tablet/" target="_top">http://bioinf.scri.ac.uk/tablet/</a> which also reads SAM
	    format.
	  </p></dd><dt><span class="term">
	  Assembly coverage analysis
	</span></dt><dd><p>
	    The Integrated Genome Browser (IGB) of the GenoViz project at
	    SourceForge (<a class="ulink" href="http://sourceforge.net/projects/genoviz/" target="_top">http://sourceforge.net/projects/genoviz/</a>) is just perfect
	    for loading a genome and looking at mapping coverage (provided by
	    the wiggle result files of MIRA).
	  </p></dd><dt><span class="term">
	  Preprocessing (base calling)
	</span></dt><dd><p>
	    TraceTuner (<a class="ulink" href="http://sourceforge.net/projects/tracetuner/" target="_top">http://sourceforge.net/projects/tracetuner/</a>) is a tool for
	    base and quality calling of trace files from DNA sequencing
	    instruments. Originally developed by Paracel, this code base was
	    released as open source in 2006 by Celera.
	  </p></dd><dt><span class="term">
	  Preprocessing / viewing
	</span></dt><dd><p> phred (basecaller) - cross_match (sequence comparison and
	  filtering) - phrap (assembler) - consed (assembly viewer and
	  editor). This is another package that can be used for this type of
	  job, but requires more programming work. The fact that sequence
	  stretches are masked out (overwritten with the character X) if they
	  shouldn't be used in an assembly doesn't really help and is
	  considered harmful (but it works).
	  </p><p>
	    Note the bug of consed when reading ACE files, see more about this
	    in the section on file types (above) in the entry for ACE.
	  </p><p>
	    See <a class="ulink" href="http://www.phrap.org/" target="_top">http://www.phrap.org/</a> for further information.
	  </p></dd><dt><span class="term">
	  text viewer
	</span></dt><dd><p> A text viewer for the different textual output files.
	  </p></dd></dl></div><p>
      As always, most of the time a combination of several different packages
      is possible. My currently preferred combo for genome projects is
      <span class="command"><strong>ssaha2</strong></span> or <span class="command"><strong>smalt</strong></span> and or
      <span class="command"><strong>lucy</strong></span> (vector screening), MIRA (assembly, of course)
      and gap4 (assembly viewing and finishing).
    </p><p>
      For re-assembling projects that were edited in gap4, one will also need
      the gap2caf converter. The source for this is available at
      <a class="ulink" href="http://www.sanger.ac.uk/resources/software/caf.html" target="_top">http://www.sanger.ac.uk/resources/software/caf.html</a>.
    </p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_speed_and_memory_considerations"></a>3.14. 
      Speed and memory considerations
    </h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_memory"></a>3.14.1. 
	Estimating needed memory for an assembly project
      </h3></div></div></div><p>
	Since the V2.9.24x3 version of MIRA, there is <span class="command"><strong>miramem</strong></span> as
	program call. When called from the command line, it will ask a number of
	questions and then print out an estimate of the amount of RAM needed to
	assemble the project. Take this estimate with a grain of salt, depending on
	the sequences properties, variations in the estimate can be +/- 30% for
	bacteria and 'simple' eukaryotes. The higher the number of repeats is, the
	more likely you will need to restrict memory usage in some way or another.
      </p><p>
	Here's the transcript of a session with miramem:
      </p><pre class="screen">
This is MIRA V3.2.0rc1 (development version).

Please cite: Chevreux, B., Wetter, T. and Suhai, S. (1999), Genome Sequence
Assembly Using Trace Signals and Additional Sequence Information.
Computer Science and Biology: Proceedings of the German Conference on
Bioinformatics (GCB) 99, pp. 45-56.

To (un-)subscribe the MIRA mailing lists, see:
        http://www.chevreux.org/mira_mailinglists.html

After subscribing, mail general questions to the MIRA talk mailing list:
        mira_talk@freelists.org

To report bugs or ask for features, please use the SourceForge ticketing
system at:
        http://sourceforge.net/p/mira-assembler/tickets/
This ensures that requests do not get lost.

[...]

miraMEM helps you to estimate the memory needed to assemble a project.
Please answer the questions below.

Defaults are give in square brackets and chosen if you just press return.
Hint: you can add k/m/g modifiers to your numbers to say kilo, mega or giga.

Is it a genome or transcript (EST/tag/etc.) project? (g/e/) [g]
g
Size of genome? [4.5m] <strong class="userinput"><code>9.8m</code></strong>
9800000
Size of largest chromosome? [9800000]
9800000
Is it a denovo or mapping assembly? (d/m/) [d]
d
Number of Sanger reads? [0]
0
Are there 454 reads? (y/n/) [n] <strong class="userinput"><code>y</code></strong>
y
Number of 454 GS20 reads? [0]
0
Number of 454 FLX reads? [0]
0
Number of 454 Titanium reads? [0] <strong class="userinput"><code>750k</code></strong>
750000
Are there PacBio reads? (y/n/) [n]
n
Are there Solexa reads? (y/n/) [n]
n


************************* Estimates *************************

The contigs will have an average coverage of ~ 30.6 (+/- 10%)

RAM estimates:
           reads+contigs (unavoidable): 7.0 GiB
                large tables (tunable): 688. MiB
                                        ---------
                          total (peak): 7.7 GiB

            add if using -CL:pvlc=yes : 2.6 GiB

Estimates may be way off for pathological cases.

Note that some algorithms might try to grab more memory if
the need arises and the system has enough RAM. The options
for automatic memory management control this:
  -AS:amm, -AS:kpmf, -AS:mps
Further switches that might reduce RAM (at cost of run time
or accuracy):
  -SK:mkim, -SK:mchr (both runtime); -SK:mhpr (accuracy)
*************************************************************</pre><p>
	If your RAM is not large enough, you can still assemble projects by
	using disk swap. Up to 20% of the needed memory can be provided by
	swap without the speed penalty getting too large. Going above 20% is
	not recommended though, above 30% the machine will be almost
	permanently swapping at some point or another.
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_speed"></a>3.14.2. 
	Some numbers on speed
      </h3></div></div></div><p>
	To be rewritten for MIRA4.
      </p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_known_problems_bugs"></a>3.15. 
      Known Problems / Bugs
    </h2></div></div></div><p>
      File Input / Output:
    </p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
	  MIRA can only read unedited EXP files.
	</p></li><li class="listitem"><p>
	  There sometimes is a (rather important) memory leak occurring while
	  using the assembly integrated Sanger read editor. I have not been
	  able to trace the reason yet.
	</p></li></ol></div><p>
    </p><p>
      Assembly process:
    </p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
	  The routines for determining <span class="emphasis"><em>Repeat Marker
	  Bases</em></span> (SRMr) are sometimes too sensitive, which sometimes
	  leads to excessive base tagging and preventing right assemblies in
	  subsequent assembly processes. The parameters you should look at for
	  this problem are
	   [-CO:mrc:nrz:mgqrt:mgqwpc]. Also look at  [-CL:pvc] and
	   [-CO:emea] if you have a lot of sequencing vector relics at the
	  end of the sequences.
	</p></li></ol></div><p>
    </p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_todos"></a>3.16. 
      TODOs
    </h2></div></div></div><p>
      These are some of the topics on my TODO list for the next revisions to
      come:
    </p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
	  Making Smith-Waterman parts of the process multi-threaded or use SIMD
	  (currently stopped due to other priorities like PacBio etc.)
	</p></li></ol></div><p>
    </p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_working_principles"></a>3.17. 
      Working principles
    </h2></div></div></div><p>
      Note: description is old and needs to be adapted to the current 4.x line
      of MIRA.
    </p><p>
      To avoid the "garbage-in, garbage-out" problematic, MIRA uses a 'high
      quality alignments first' contig building strategy. This means that the
      assembler will start with those regions of sequences that have been
      marked as good quality (high confidence region - HCR) with low error
      probabilities (the clipping must have been done by the base caller or
      other preprocessing programs, e.g. pregap4) and then gradually extends
      the alignments as errors in different reads are resolved through error
      hypothesis verification and signal analysis.
    </p><p>
      This assembly approach relies on some of the automatic editing
      functionality provided by the EdIt package which has been integrated in
      parts within MIRA.
    </p><p>
      This is an approximate overview on the steps that are executed while
      assembling:
    </p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
	  All the experiment / phd / fasta sequences that act as input are
	  loaded (or the CAF project). Qualities for the bases are loaded from
	  the FASTA or SCF if needed.
	</p></li><li class="listitem"><p>
	  the ends of the reads are cleaned ensure they have a minimum stretch
	  of bases without sequencing errors
	</p></li><li class="listitem"><p>
	  The high confidence region (HCR) of each read is compared with a
	  quick algorithm to the HCR of every other read to see if it could
	  match and have overlapping parts (this is the 'SKIM' filter).
	</p></li><li class="listitem"><p>
	  All the reads which could match are being checked with an adapted
	  Smith-Waterman alignment algorithm (banded version). Obvious
	  mismatches are rejected, the accepted alignments form one or several
	  alignment graphs.
	</p></li><li class="listitem"><p>
	  Optional pre-assembly read extension step: MIRA tries to extend HCR
	  of reads by analysing the read pairs from the previous
	  alignment. This is a bit shaky as reads in this step have not been
	  edited yet, but it can help. Go back to step 2.
	</p></li><li class="listitem"><p>
	  A contig gets made by building a preliminary partial path through
	  the alignment graph (through in-depth analysis up to a given level)
	  and then adding the most probable overlap candidates to a given
	  contig. Contigs may reject reads if these introduce to many errors
	  in the existing consensus. Errors in regions known as dangerous
	  (for the time being only ALUS and REPT) get additional attention by
	  performing simple signal analysis when alignment discrepancies
	  occur.
	</p></li><li class="listitem"><p>
	  Optional: the contig can be analysed and corrected by the automatic
	  editor ("EdIt" for Sanger reads, or the new MIRA editor for 454
	  reads).
	</p></li><li class="listitem"><p>
	  Long repeats are searched for, bases in reads of different repeats
	  that have been assembled together but differ sufficiently (for EdIT
	  so that they didn't get edited and by phred quality value) get
	  tagged with special tags (SRMr and WRMr).
	</p></li><li class="listitem"><p>
	  Go back to step 5 if there are reads present that have not been
	  assembled into contigs.
	</p></li><li class="listitem"><p>
	  Optional: Detection of spoiler reads that prevent joining of
	  contigs. Remedy by shortening them.
	</p></li><li class="listitem"><p>
	  Optional: Write out a checkpoint assembly file and go back to step 2.
	</p></li><li class="listitem"><p>
	  The resulting project is written out to different output files and
	  directories.
	</p></li></ol></div><p>
    </p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_see_also"></a>3.18. 
      See Also
    </h2></div></div></div><p>
      The other MIRA manuals and walkthroughs as well as
      <span class="command"><strong>EdIt</strong></span>, <span class="command"><strong>gap4</strong></span>,
      <span class="command"><strong>pregap4</strong></span>, <span class="command"><strong>gap5</strong></span>,
      <span class="command"><strong>clview</strong></span>, <span class="command"><strong>caf2gap</strong></span>,
      <span class="command"><strong>gap2caf</strong></span>, <span class="command"><strong>ssaha2</strong></span>,
      <span class="command"><strong>smalt</strong></span>, <span class="command"><strong>compress</strong></span> and
      <span class="command"><strong>gzip</strong></span>, <span class="command"><strong>cap3</strong></span>,
      <span class="command"><strong>ttuner</strong></span>, <span class="command"><strong>phred</strong></span>,
      <span class="command"><strong>phrap</strong></span>, <span class="command"><strong>cross_match</strong></span>,
      <span class="command"><strong>consed</strong></span>.
    </p></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_dataprep"></a>Chapter 4. Preparing data</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email">&lt;<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>&gt;</code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect_pd_introduction">4.1. 
      Introduction
    </a></span></dt><dt><span class="sect1"><a href="#sect_pd_sanger">4.2. 
      Sanger
    </a></span></dt><dt><span class="sect1"><a href="#sect_pd_454">4.3. 
      Roche / 454
    </a></span></dt><dt><span class="sect1"><a href="#sect_pd_illumina">4.4. 
      Illumina
    </a></span></dt><dt><span class="sect1"><a href="#sect_pd_pacbio">4.5. 
      Pacific Biosciences
    </a></span></dt><dt><span class="sect1"><a href="#sect_pd_iontor">4.6. 
      Ion Torrent
    </a></span></dt><dt><span class="sect1"><a href="#sect_pd_sra">4.7. 
      Short Read Archive (SRA)
    </a></span></dt></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
      <span class="emphasis"><em><span class="quote">&#8220;<span class="quote">Rome didn't fall in a day either.</span>&#8221;</span></em></span>
    </p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_pd_introduction"></a>4.1. 
      Introduction
    </h2></div></div></div><p>
      Most of this chapter and many sections are just stubs at the moment.
    </p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_pd_sanger"></a>4.2. 
      Sanger
    </h2></div></div></div><p>
      Outside MIRA: transform .ab1 to .scf, perform sequencing vector clip
      (and cloning vector clip if used), basic quality clips.
    </p><p>
      Recommended program: <span class="command"><strong>gap4</strong></span> (or
      rather <span class="command"><strong>pregap4</strong></span>) from the Staden 4 package.
    </p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_pd_454"></a>4.3. 
      Roche / 454
    </h2></div></div></div><p>
      Outside MIRA: convert SFF instrument from Roche to FASTQ,
      use <span class="command"><strong>sff_extract</strong></span> for that. In case you used
      "non-standard" sequencing procedures: clip away MIDs, clip away
      non-standard sequencing adaptors used in that project.
    </p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_pd_illumina"></a>4.4. 
      Illumina
    </h2></div></div></div><p>
      Outside MIRA: for heavens' sake: do NOT try to clip or trim by quality
      yourself. Do NOT try to remove standard sequencing adaptors
      yourself. Just leave Illumina data alone! (really, I mean it).
    </p><p>
      MIRA is much, much better at that job than you will probably ever be
      ... and I dare to say that MIRA is better at that job than 99% of all
      clipping/trimming software existing out there. Just make sure you use
      the [-CL:pec] (proposed_end_clip) option of MIRA.
    </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
      The <span class="emphasis"><em>only</em></span> exception to the above is if you (or your
      sequencing provider) used decidedly non-standard sequencing
      adaptors. Then it might be worthwhile to perform own adaptor
      clipping. But this will not be the case for 99% of all sequencing
      projects out there.
    </td></tr></table></div><p>
      Joining paired-ends: if you want to do this, feel free to use any tool
      which is out there (TODO: quick list). Just make sure they do not join
      on very short overlaps. For me, the minimum overlap is at least 17
      bases, but I more commonly use at least 30.
    </p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_pd_pacbio"></a>4.5. 
      Pacific Biosciences
    </h2></div></div></div><p>
      Outside MIRA: MIRA needs error corrected reads, either
    </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem">
	PacBio CCS reads (circular consensus sequence) which you get from the
	PacBio SMRTAnalysis pipeline
      </li><li class="listitem">
	or self-corrected or reads corrected with other sequencing
	technologies which you will get either from the PacBio HGAP pipeline
	or the pacbioToCA pipeline
      </li></ul></div><p>
      Assembly of uncorrected PacBio reads (CLR) is currently not supported
      officially as of MIRA 4.0.
    </p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_pd_iontor"></a>4.6. 
      Ion Torrent
    </h2></div></div></div><p>
      Outside MIRA: need to convert BAM to FASTQ. Need to clip away
      non-standard sequencing adaptors if used in that project. Apart from
      that: leave the data alone.
    </p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_pd_sra"></a>4.7. 
      Short Read Archive (SRA)
    </h2></div></div></div><p>
      Outside MIRA: you need to convert SRA format to FASTQ format. This is done
      using <span class="command"><strong>fastq-dump</strong></span> from the SRA toolkit from the
      NCBI. Make sure to have at least version 2.4.x of the toolkit. Last time
      I looked (March 2015), the software was at
      <a class="ulink" href="http://www.ncbi.nlm.nih.gov/Traces/sra/?view=software" target="_top">http://www.ncbi.nlm.nih.gov/Traces/sra/?view=software</a>, the
      documentation for the whole toolkit was at
      <a class="ulink" href="http://www.ncbi.nlm.nih.gov/Traces/sra/?view=toolkit_doc" target="_top">http://www.ncbi.nlm.nih.gov/Traces/sra/?view=toolkit_doc</a>,
      and for <span class="command"><strong>fastq-dump</strong></span> it was
      <a class="ulink" href="http://www.ncbi.nlm.nih.gov/Traces/sra/?view=toolkit_doc&amp;f=fastq-dump" target="_top">http://www.ncbi.nlm.nih.gov/Traces/sra/?view=toolkit_doc&amp;f=fastq-dump</a>
    </p><p>
      After extraction, proceed with preprocessing as described above,
      depending on the sequencing technology used.
    </p><p>
      For extracting Illumina data, use something like this:
    </p><pre class="screen"><code class="prompt">arcadia:/some/path$</code> <strong class="userinput"><code>fastq-dump -I --split-files <em class="replaceable"><code>somefile.sra</code></em></code></strong></pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
	As <span class="command"><strong>fastq-dump</strong></span> unfortunately uses a pretty wasteful
	variant of the FASTQ format, you might want to reduce the file size
	for each FASTQ it produces by doing this:
      </p><pre class="screen"><strong class="userinput"><code>sed -i '3~4 s/^+.*$/+/' <em class="replaceable"><code>file.fastq</code></em></code></strong></pre><p>
	The above command performs an in-file replacement of unnecessary name
	and comments on the quality divider lines of the FASTQ. The exact
	translation of the <span class="command"><strong>sed</strong></span> is: do an in-file
	replacement (-i); starting on the third line, then every fourth line
	(3~4); substitute (s/); a line which starts (^); with a plus (+); and
	then can have any character (.); repeated any number of times
	including zero (*); until the end of the line ($); by just a single
	plus character (/+/).
      </p><p>
	This alone reduces the file size of a typical Illumina data set with
	100mers extracted from the SRA by about 15 to 20%.
      </p></td></tr></table></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_denovo"></a>Chapter 5. De-novo assemblies</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email">&lt;<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>&gt;</code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect_dn_introduction">5.1. 
      Introduction
    </a></span></dt><dt><span class="sect1"><a href="#sect_dn_general">5.2. 
      General steps
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_dn_ge_copying_and_naming_the_sequence_data">5.2.1. 
	Copying and naming the sequence data
      </a></span></dt><dt><span class="sect2"><a href="#sect_dn_ge_writing_a_simple_manifest_file">5.2.2. 
	Writing a simple manifest file
      </a></span></dt><dt><span class="sect2"><a href="#sect_dn_ge_starting_assembly">5.2.3. Starting the assembly</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_dn_manifest_files_use_cases">5.3. 
      Manifest files for different use cases
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_dn_mf_denovo_with_shotgun_data">5.3.1. 
	Manifest for shotgun data
      </a></span></dt><dt><span class="sect2"><a href="#sect_dn_mf_assembling_with_multiple_technologies">5.3.2. 
	Assembling with multiple sequencing technologies (hybrid assemblies)
      </a></span></dt><dt><span class="sect2"><a href="#sect_dn_mf_manifest_for_pairedend_data">5.3.3. 
	Manifest for data sets with paired reads
      </a></span></dt><dt><span class="sect2"><a href="#sect_dn_mf_denovo_with_multiple_strains">5.3.4. 
	De-novo with multiple strains
      </a></span></dt></dl></dd></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
      <span class="emphasis"><em><span class="quote">&#8220;<span class="quote">The universe is full of surprises - most of them nasty.</span>&#8221;</span></em></span>
    </p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_dn_introduction"></a>5.1. 
      Introduction
    </h2></div></div></div><p>
      This guide assumes that you have basic working knowledge of Unix systems, know
      the basic principles of sequencing (and sequence assembly) and what assemblers
      do.
    </p><p>
      While there are step by step instructions on how to setup your data and
      then perform an assembly, this guide expects you to read at some point in time
    </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	  Before the assembly, <a class="xref" href="#chap_dataprep" title="Chapter 4. Preparing data">Chapter 4: &#8220;<i>Preparing data</i>&#8221;</a> to know what to do (or not to
	  do) with the sequencing data before giving it to MIRA.
	</p></li><li class="listitem"><p>
	  For users with PacBio reads, <a class="xref" href="#sect_sp_pacbio_ccs" title="8.2.1.  PacBio CCS reads">Section 8.2.1: &#8220;
	PacBio CCS reads
      &#8221;</a> has important
	  information regarding special parameters needed.
	</p></li><li class="listitem"><p>
	  After the assembly, <a class="xref" href="#chap_results" title="Chapter 9. Working with the results of MIRA">Chapter 9: &#8220;<i>Working with the results of MIRA</i>&#8221;</a> to know what to do with the
	  results of the assembly. More specifically, <a class="xref" href="#sect_res_looking_at_results" title="9.1.  MIRA output directories and files">Section 9.1: &#8220;
      MIRA output directories and files
    &#8221;</a>, <a class="xref" href="#sect_res_first_look:the_assembly_info" title="9.2.  First look: the assembly info">Section 9.2: &#8220;
      First look: the assembly info
    &#8221;</a>, <a class="xref" href="#sect_res_converting_results" title="9.3.  Converting results">Section 9.3: &#8220;
      Converting results
    &#8221;</a>, <a class="xref" href="#sect_res_filtering_of_results" title="9.4.  Filtering results">Section 9.4: &#8220;
      Filtering results
    &#8221;</a> and <a class="xref" href="#sect_res_places_of_importance_in_a_de_novo_assembly" title="9.5.  Places of importance in a de-novo assembly">Section 9.5: &#8220;
      Places of importance in a de-novo assembly
    &#8221;</a>.
	</p></li><li class="listitem"><p>
	  And also <a class="xref" href="#chap_reference" title="Chapter 3. MIRA 4 reference manual">Chapter 3: &#8220;<i>MIRA 4 reference manual</i>&#8221;</a> to look up how manifest files should be
	  written (<a class="xref" href="#sect_ref_manifest_basics" title="3.4.2.  The manifest file: basics">Section 3.4.2: &#8220;
	The manifest file: basics
      &#8221;</a> and <a class="xref" href="#sect_ref_manifest_readgroups" title="3.4.3.  The manifest file: information on the data you have">Section 3.4.3: &#8220;
	The manifest file: information on the data you have
      &#8221;</a> and <a class="xref" href="#sect_ref_manifest_parameters" title="3.4.4.  The manifest file: extended parameters">Section 3.4.4: &#8220;
	The manifest file: extended parameters
      &#8221;</a>), some command line options as well as general information on
	  what tags MIRA uses in assemblies, files it generates etc.pp
	</p></li><li class="listitem"><p>
	  Last but not least, you may be interested in some observations about
	  the different sequencing technologies and the traps they may
	  contain, see <a class="xref" href="#chap_seqtechdesc" title="Chapter 12. Description of sequencing technologies">Chapter 12: &#8220;<i>Description of sequencing technologies</i>&#8221;</a> for that. For advice on what to pay
	  attention to <span class="emphasis"><em>before</em></span> going into a sequencing
	  project, have a look at <a class="xref" href="#chap_seqadvice" title="Chapter 13. Some advice when going into a sequencing project">Chapter 13: &#8220;<i>Some advice when going into a sequencing project</i>&#8221;</a>.
	</p></li></ul></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_dn_general"></a>5.2. 
      General steps
    </h2></div></div></div><p>
      This part will introduce you step by step how to get your data together
      for a simple mapping assembly. I'll make up an example using an
      imaginary bacterium: <span class="emphasis"><em>Bacillus chocorafoliensis</em></span> (or
      short: <span class="emphasis"><em>Bchoc</em></span>). You collected the strain you want to
      assemble somewhere in the wild, so you gave the strain the name
      <span class="emphasis"><em>Bchoc_wt</em></span>.
    </p><p>
      Just for laughs, let's assume you sequenced that bug with lots of more
      or less current sequencing technologies: Sanger, 454, Illumina, Ion
      Torrent and Pacific Biosciences.
    </p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_dn_ge_copying_and_naming_the_sequence_data"></a>5.2.1. 
	Copying and naming the sequence data
      </h3></div></div></div><p>
	You need to create (or get from your sequencing provider) the
	sequencing data in any supported file format. Amongst these, FASTQ and
	FASTA + FASTA-quality will be the most common, although the latter is
	well on the way out nowadays. The following walkthrough uses what most
	people nowadays get: FASTQ.
      </p><p>
	Create a new project directory (e.g. <code class="filename">myProject</code>)
	and a subdirectory of this which will hold the sequencing data
	(e.g. <code class="filename">data</code>).
      </p><pre class="screen"><code class="prompt">arcadia:/path/to</code> <strong class="userinput"><code>mkdir myProject</code></strong>
<code class="prompt">arcadia:/path/to</code> <strong class="userinput"><code>cd myProject</code></strong>
<code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>mkdir data</code></strong></pre><p>
	Put the FASTQ data into that <code class="filename">data</code> directory so
	that it now looks perhaps like this:
      </p><pre class="screen"><code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>ls -l data</code></strong>
-rw-r--r-- 1 bach users 263985896 2008-03-28 21:49 bchocwt_lane6.solexa.fastq</pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
      I completely made up the file names above. You can name them anyway you
      want. And you can have them live anywhere on the hard-disk, you do not
      need to put them in this <code class="filename">data</code> directory. It's just
      the way I do it ... and it's where the example manifest files a bit
      further down in this chapter will look for the data files.
    </td></tr></table></div><p>
      We're almost finished with the setup. As I like to have things neatly separated, I always create a directory called <code class="filename">assemblies</code> which will hold my assemblies (or different trials) together. Let's quickly do that:
    </p><pre class="screen"><code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>mkdir assemblies</code></strong>
<code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>mkdir assemblies/1sttrial</code></strong>
<code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>cd assemblies/1sttrial</code></strong></pre></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_dn_ge_writing_a_simple_manifest_file"></a>5.2.2. 
	Writing a simple manifest file
      </h3></div></div></div><p>
	A manifest file is a configuration file for MIRA which tells it what
	type of assembly it should do and which data it should load. In this
	case we'll make a simple assembly of a genome with unpaired Illumina
	data
      </p><pre class="screen"># Example for a manifest describing a genome de-novo assembly with
# unpaired Illumina data

# First part: defining some basic things
# In this example, we just give a name to the assembly
#  and tell MIRA it should map a genome in accurate mode

<strong class="userinput"><code>project = <em class="replaceable"><code>MyFirstAssembly</code></em>
job = <em class="replaceable"><code>genome,denovo,accurate</code></em></code></strong>

# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups"

# here comes the unpaired Illumina data

<strong class="userinput"><code>readgroup = <em class="replaceable"><code>SomeUnpairedIlluminaReadsIGotFromTheLab</code></em>
data = <em class="replaceable"><code>../../data/bchocwt_lane6.solexa.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em></code></strong></pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
	  Please look up the parameters of the manifest file in the main
	  manual or the example manifest files in the following section.
	</p><p>
	  The ones above basically say: make an accurate denovo assembly of
	  unpaired Illumina reads.
	</p></td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_dn_ge_starting_assembly"></a>5.2.3. Starting the assembly</h3></div></div></div><p>
	Starting the assembly is now just a matter of a simple command line:
      </p><pre class="screen"><code class="prompt">arcadia:/path/to/myProject/assemblies/1sttrial$</code> <strong class="userinput"><code>mira <em class="replaceable"><code>manifest.conf &gt;&amp;log_assembly.txt</code></em></code></strong></pre><p>
	For this example - if you followed the walk-through on how to prepare the data
	- everything you might want to adapt in the first time are the following thing in the manifest file:
	options:
      </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	    project= (for naming your assembly project)
	  </p></li></ul></div><p>
	Of course, you are free to change any option via the extended parameters, but
	this is the topic of another part of this manual.
      </p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_dn_manifest_files_use_cases"></a>5.3. 
      Manifest files for different use cases
    </h2></div></div></div><p>
      This section will introduce you to manifest files for different use
      cases. It should cover the most important uses, but as always you are
      free to mix and match the parameters and readgroup definitions to suit
      your specific needs.
    </p><p>
      Taking into account that there may be <span class="emphasis"><em>a lot</em></span> of
      combinations of sequencing technologies, sequencing libraries (shotgun,
      paired-end, mate-pair, etc.) and input file types (FASTQ, FASTA,
      GenBank, GFF3, etc.pp), the example manifest files just use Illumina and
      454 as technologies, GFF3 as input file type for the reference sequence,
      FASTQ as input type for sequencing data ... and they do not show the
      multitude of more advanced features like, e.g., using ancillary clipping
      information in XML files, ancillary masking information in SSAHA2 or
      SMALT files etc.pp.
    </p><p>
      I'm sure you will be able to find your way by scanning through the
      corresponding section on manifest files in the reference chapter :-)
    </p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_dn_mf_denovo_with_shotgun_data"></a>5.3.1. 
	Manifest for shotgun data
      </h3></div></div></div><p>
	Well, we've seen that already in the section above, but here it is
	again ... but this time with 454 data.
      </p><pre class="screen"># Example for a manifest describing a denovo assembly with
# unpaired 454 data

# First part: defining some basic things
# In this example, we just give a name to the assembly
#  and tell MIRA it should map a genome in accurate mode

<strong class="userinput"><code>project = <em class="replaceable"><code>MyFirstAssembly</code></em>
job = <em class="replaceable"><code>genome,denovo,accurate</code></em></code></strong>

# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups"

# here's the 454 data

<strong class="userinput"><code>readgroup = <em class="replaceable"><code>SomeUnpaired454ReadsIGotFromTheLab</code></em>
data = <em class="replaceable"><code>../../data/some454data.fastq</code></em>
technology = <em class="replaceable"><code>454</code></em></code></strong></pre></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_dn_mf_assembling_with_multiple_technologies"></a>5.3.2. 
	Assembling with multiple sequencing technologies (hybrid assemblies)
      </h3></div></div></div><p>
	Hybrid mapping assemblies follow the general manifest scheme: tell
	what you want in the first part, then simply add as separate readgroup
	the information MIRA needs to know to find the data and off you
	go. Just for laughs, here's a manifest for 454 shotgun with Illumina
	shotgun
      </p><pre class="screen"># Example for a manifest describing a denovo assembly with
# shotgun 454 and shotgun Illumina data

# First part: defining some basic things
# In this example, we just give a name to the assembly
#  and tell MIRA it should map a genome in accurate mode

<strong class="userinput"><code>project = <em class="replaceable"><code>MyFirstAssembly</code></em>
job = <em class="replaceable"><code>genome,mapping,accurate</code></em></code></strong>

# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups"

# now the shotgun 454 data
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataForShotgun454</code></em>
data = <em class="replaceable"><code>../../data/project454data.fastq</code></em>
technology = <em class="replaceable"><code>454</code></em></code></strong>

# now the shotgun Illumina data

<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataForShotgunIllumina</code></em>
data = <em class="replaceable"><code>../../data/someillumina.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em></code></strong></pre></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_dn_mf_manifest_for_pairedend_data"></a>5.3.3. 
	Manifest for data sets with paired reads
      </h3></div></div></div><p>
	When using paired-end data, you should know
      </p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
	    the orientation of the reads toward each other. This is specific
	    to sequencing technologies and / or the sequencing library preparation.
	  </p></li><li class="listitem"><p>
	    at which distance these reads should be. This is specific to the
	    sequencing library preparation and the sequencing lab should tell
	    you this.
	  </p></li></ol></div><p>
	In case you do not know one (or any) of the above, don't panic! MIRA
	is able to estimate the needed values during the assembly if you tell
	it to.
      </p><p>
	The following manifest shows you the most laziest way to define a
	paired data set by simply adding <span class="emphasis"><em>autopairing</em></span> as keyword to a
	readgroup (using Illumina just as example):
      </p><pre class="screen"># Example for a lazy manifest describing a denovo assembly with
# one library of paired reads

# First part: defining some basic things
# In this example, we just give a name to the assembly
#  and tell MIRA it should map a genome in accurate mode

<strong class="userinput"><code>project = <em class="replaceable"><code>MyFirstAssembly</code></em>
job = <em class="replaceable"><code>genome,denovo,accurate</code></em></code></strong>

# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups"

# now the Illumina paired-end data

<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataIlluminaPairedLib</code></em>
<em class="replaceable"><code>autopairing</code></em>
data = <em class="replaceable"><code>../../data/project_1.fastq ../../data/project_2.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em></code></strong></pre><p>
	If you know the orientation of the reads and/or the library size, you
	can tell this MIRA the following way (just showing the readgroup
	definition here):
      </p><pre class="screen"><strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataIlluminaPairedEnd500Lib</code></em>
data = <em class="replaceable"><code>../../data/project_1.fastq ../../data/project_2.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
template_size = <em class="replaceable"><code>250 750</code></em>
segment_placement = <em class="replaceable"><code>---&gt; &lt;---</code></em></code></strong></pre><p>
	In cases you are not 100% sure about, e.g., the size of the DNA
	template, you can also give a (generous) expected range and then tell
	MIRA to automatically refine this range during the assembly based on
	real, observed distances of read pairs. Do this with <span class="emphasis"><em>autorefine</em></span>
	modifier like this:
      </p><pre class="screen"><strong class="userinput"><code>template_size = <em class="replaceable"><code>50 1000 autorefine</code></em></code></strong></pre><p>
	The following manifest file is an example for assembling with several
	different libraries from different technologies. Do not forget you
	can use <span class="emphasis"><em>autopairing</em></span> or <span class="emphasis"><em>autorefine</em></span> :-)
      </p><pre class="screen"># Example for a manifest describing a denovo assembly with
# several kinds of sequencing libraries

# First part: defining some basic things
# In this example, we just give a name to the assembly
#  and tell MIRA it should map a genome in accurate mode

<strong class="userinput"><code>project = <em class="replaceable"><code>MyFirstAssembly</code></em>
job = <em class="replaceable"><code>genome,denovo,accurate</code></em></code></strong>

# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups"

# now the Illumina paired-end data

<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataIlluminaForPairedEnd500bpLib</code></em>
data = <em class="replaceable"><code>../../data/project500bp-1.fastq ../../data/project500bp-2.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
strain = <em class="replaceable"><code>bchoc_se1</code></em>
template_size = <em class="replaceable"><code>250 750</code></em>
segment_placement = <em class="replaceable"><code>---&gt; &lt;---</code></em></code></strong>

# now the Illumina mate-pair data

<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataIlluminaForMatePair3kbLib</code></em>
data = <em class="replaceable"><code>../../data/project3kb-1.fastq ../../data/project3kb-2.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
strain = <em class="replaceable"><code>bchoc_se1</code></em>
template_size = <em class="replaceable"><code>2500 3500</code></em>
segment_placement = <em class="replaceable"><code>&lt;--- ---&gt;</code></em></code></strong>

# some Sanger data (6kb library)

<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataForSanger6kbLib</code></em>
data = <em class="replaceable"><code>../../data/sangerdata.fastq</code></em>
technology = <em class="replaceable"><code>sanger</code></em>
template_size = <em class="replaceable"><code>5500 6500</code></em>
segment_placement = <em class="replaceable"><code>---&gt; &lt;---</code></em></code></strong>

# some 454 data

<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataFo454Pairs</code></em>
data = <em class="replaceable"><code>../../data/454data.fastq</code></em>
technology = <em class="replaceable"><code>454</code></em>
template_size = <em class="replaceable"><code>8000 1200</code></em>
segment_placement = <em class="replaceable"><code>2---&gt; 1---&gt;</code></em></code></strong>

# some Ion Torrent data

<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataFoIonPairs</code></em>
data = <em class="replaceable"><code>../../data/iondata.fastq</code></em>
technology = <em class="replaceable"><code>iontor</code></em>
template_size = <em class="replaceable"><code>1000 300</code></em>
segment_placement = <em class="replaceable"><code>2---&gt; 1---&gt;</code></em></code></strong></pre></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_dn_mf_denovo_with_multiple_strains"></a>5.3.4. 
	De-novo with multiple strains
      </h3></div></div></div><p>
	MIRA will make use of ancillary information present in the manifest
	file. One of these is the information to which strain (or organism or
	cell line etc.pp) the generated data belongs.
      </p><p>
	You just need to tell in the manifest file which data comes from which
	strain. Let's assume that in the example from above, the "lane6" data
	were from a first mutant named <span class="emphasis"><em>bchoc_se1</em></span> and the
	"lane7" data were from a second mutant
	named <span class="emphasis"><em>bchoc_se2</em></span>. Here's the manifest file you
	would write then:
      </p><pre class="screen"># Example for a manifest describing a de-novo assembly with
# unpaired Illumina data, but from multiple strains

# First part: defining some basic things
# In this example, we just give a name to the assembly
#  and tell MIRA it should map a genome in accurate mode

<strong class="userinput"><code>project = <em class="replaceable"><code>MyFirstAssembly</code></em>
job = <em class="replaceable"><code>genome,denovo,accurate</code></em></code></strong>

# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups"

# now the Illumina data

<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataForSE1</code></em>
data = <em class="replaceable"><code>../../data/bchocse_lane6.solexa.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
strain = <em class="replaceable"><code>bchoc_se1</code></em></code></strong>

<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataForSE2</code></em>
data = <em class="replaceable"><code>../../data/bchocse_lane7.solexa.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
strain = <em class="replaceable"><code>bchoc_se2</code></em></code></strong></pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
	While assembling de-novo (pr mapping) with multiple strains is
	possible, the interpretation of results may become a bit daunting in
	some cases. For many scenarios it might therefore be preferable to
	successively use the data sets in own assemblies or mappings.
      </td></tr></table></div><p>
	This <span class="emphasis"><em>strain</em></span> information for each readgroup is
	really the only change you need to perform to tell MIRA everything it
	needs for handling strains.
      </p></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_mapping"></a>Chapter 6. Mapping assemblies</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email">&lt;<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>&gt;</code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect_map_introduction">6.1. 
      Introduction
    </a></span></dt><dt><span class="sect1"><a href="#sect_map_general">6.2. 
      General steps
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_map_ge_copying_and_naming_the_sequence_data">6.2.1. 
	Copying and naming the sequence data
      </a></span></dt><dt><span class="sect2"><a href="#sect_map_ma_copying_and_naming_the_reference_sequence">6.2.2. 
	Copying and naming the reference sequence
      </a></span></dt><dt><span class="sect2"><a href="#sect_map_ge_writing_a_simple_manifest_file">6.2.3. 
	Writing a simple manifest file
      </a></span></dt><dt><span class="sect2"><a href="#sect_map_ge_starting_assembly">6.2.4. Starting the assembly</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_map_manifest_files_use_cases">6.3. 
      Manifest files for different use cases
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_map_mf_mapping_with_shotgun_data">6.3.1. 
	Mapping with shotgun data
      </a></span></dt><dt><span class="sect2"><a href="#sect_map_mf_manifest_for_pairedend_data">6.3.2. 
	Manifest for data sets with paired reads
      </a></span></dt><dt><span class="sect2"><a href="#sect_map_mf_mapping_with_multiple_technologies">6.3.3. 
	Mapping with multiple sequencing technologies (hybrid mapping)
      </a></span></dt><dt><span class="sect2"><a href="#sect_map_mf_mapping_with_multiple_strains">6.3.4. 
	Mapping with multiple strains
      </a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_map_walkthroughs">6.4. 
      Walkthroughs
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_map_walkthrough:_mapping_of_ecoli_from_lenski_lab_against_ecoli_b_rel606">6.4.1. 
	Walkthrough: mapping of E.coli from Lenski lab against E.coli B REL606
      </a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_map_useful_about_reference_sequences">6.5. 
      Useful things to know about reference sequences
    </a></span></dt><dt><span class="sect1"><a href="#sect_map_known_bugs_problems">6.6. 
      Known bugs / problems
    </a></span></dt></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
      <span class="emphasis"><em><span class="quote">&#8220;<span class="quote">You have to know what you're looking for before you can find it.</span>&#8221;</span></em></span>
    </p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_map_introduction"></a>6.1. 
      Introduction
    </h2></div></div></div><p>
      This guide assumes that you have basic working knowledge of Unix systems, know
      the basic principles of sequencing (and sequence assembly) and what assemblers
      do.
    </p><p>
      While there are step by step instructions on how to setup your data and
      then perform an assembly, this guide expects you to read at some point in time
    </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	  Before the mapping, <a class="xref" href="#chap_dataprep" title="Chapter 4. Preparing data">Chapter 4: &#8220;<i>Preparing data</i>&#8221;</a> to know what to do (or not to
	  do) with the sequencing data before giving it to MIRA.
	</p></li><li class="listitem"><p>
	  Generally, the <a class="xref" href="#chap_results" title="Chapter 9. Working with the results of MIRA">Chapter 9: &#8220;<i>Working with the results of MIRA</i>&#8221;</a> to know what to do with the
	  results of the assembly. More specifically, <a class="xref" href="#sect_res_converting_results" title="9.3.  Converting results">Section 9.3: &#8220;
      Converting results
    &#8221;</a> <a class="xref" href="#sect_res_places_of_interest_in_a_mapping_assembly" title="9.6.  Places of interest in a mapping assembly">Section 9.6: &#8220;
      Places of interest in a mapping assembly
    &#8221;</a> <a class="xref" href="#sect_res_postprocessing_mapping_assemblies" title="9.7.  Post-processing mapping assemblies">Section 9.7: &#8220;
      Post-processing mapping assemblies
    &#8221;</a>
	</p></li><li class="listitem"><p>
	  And also <a class="xref" href="#chap_reference" title="Chapter 3. MIRA 4 reference manual">Chapter 3: &#8220;<i>MIRA 4 reference manual</i>&#8221;</a> to look up how manifest files should be
	  written (<a class="xref" href="#sect_ref_manifest_basics" title="3.4.2.  The manifest file: basics">Section 3.4.2: &#8220;
	The manifest file: basics
      &#8221;</a> and <a class="xref" href="#sect_ref_manifest_readgroups" title="3.4.3.  The manifest file: information on the data you have">Section 3.4.3: &#8220;
	The manifest file: information on the data you have
      &#8221;</a> and <a class="xref" href="#sect_ref_manifest_parameters" title="3.4.4.  The manifest file: extended parameters">Section 3.4.4: &#8220;
	The manifest file: extended parameters
      &#8221;</a>), some command line options as well as general information on
	  what tags MIRA uses in assemblies, files it generates etc.pp
	</p></li><li class="listitem"><p>
	  Last but not least, you may be interested in some observations about
	  the different sequencing technologies and the traps they may
	  contain, see <a class="xref" href="#chap_seqtechdesc" title="Chapter 12. Description of sequencing technologies">Chapter 12: &#8220;<i>Description of sequencing technologies</i>&#8221;</a> for that. For advice on what to pay
	  attention to <span class="emphasis"><em>before</em></span> going into a sequencing
	  project, have a look at <a class="xref" href="#chap_seqadvice" title="Chapter 13. Some advice when going into a sequencing project">Chapter 13: &#8220;<i>Some advice when going into a sequencing project</i>&#8221;</a>.
	</p></li></ul></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_map_general"></a>6.2. 
      General steps
    </h2></div></div></div><p>
      This part will introduce you step by step how to get your data together for a
      simple mapping assembly.
    </p><p>
      I'll make up an example using an imaginary bacterium: <span class="emphasis"><em>Bacillus chocorafoliensis</em></span> (or short: <span class="emphasis"><em>Bchoc</em></span>).
    </p><p>
      In this example, we assume you have two strains: a wild type strain of
      <span class="emphasis"><em>Bchoc_wt</em></span> and a mutant which you perhaps got from mutagenesis or other
      means. Let's imagine that this mutant needs more time to eliminate a given
      amount of chocolate, so we call the mutant <span class="emphasis"><em>Bchoc_se</em></span> ... SE for
      <span class="bold"><strong>s</strong></span>low <span class="bold"><strong>e</strong></span>ater
    </p><p>
      You wanted to know which mutations might be responsible for the observed
      behaviour. Assume the genome of <span class="emphasis"><em>Bchoc_wt</em></span> is available to you as it was
      published (or you previously sequenced it), so you resequenced <span class="emphasis"><em>Bchoc_se</em></span>
      with Solexa to examine mutations.
    </p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_map_ge_copying_and_naming_the_sequence_data"></a>6.2.1. 
	Copying and naming the sequence data
      </h3></div></div></div><p>
	You need to create (or get from your sequencing provider) the sequencing data
	in either FASTQ or FASTA + FASTA quality format. The following walkthrough
	uses what most people nowadays get: FASTQ.
      </p><p>
	Create a new project directory (e.g. <code class="filename">myProject</code>) and a subdirectory of this which will hold the sequencing data (e.g. <code class="filename">data</code>).
      </p><pre class="screen"><code class="prompt">arcadia:/path/to</code> <strong class="userinput"><code>mkdir myProject</code></strong>
<code class="prompt">arcadia:/path/to</code> <strong class="userinput"><code>cd myProject</code></strong>
<code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>mkdir data</code></strong></pre><p>
	Put the FASTQ data into that <code class="filename">data</code> directory so that it now looks perhaps like this:
      </p><pre class="screen"><code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>ls -l data</code></strong>
-rw-r--r-- 1 bach users 263985896 2008-03-28 21:49 bchocse_lane6.solexa.fastq
-rw-r--r-- 1 bach users 264823645 2008-03-28 21:51 bchocse_lane7.solexa.fastq</pre></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
      I completely made up the file names above. You can name them anyway you
      want. And you can have them live anywhere on the hard disk, you do not
      need to put them in this <code class="filename">data</code> directory. It's just
      the way I do it ... and it's where the example manifest files a bit further down
      in this chapter will look for the data files.
    </td></tr></table></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_map_ma_copying_and_naming_the_reference_sequence"></a>6.2.2. 
	Copying and naming the reference sequence
      </h3></div></div></div><p>
	The reference sequence (the backbone) can be in a number of different
	formats: GFF3, GenBank, MAF, CAF, FASTA. The first three have the advantage
	of being able to carry additional information like, e.g.,
	annotation. In this example, we will use a GFF3 file like the ones
	one can download from the NCBI.
      </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
	TODO: Write why GFF3 is better and where to get them at the NCBI.
      </td></tr></table></div><p>
	So, let's assume that our wild type
	strain is in the following file:
	<code class="filename">NC_someNCBInumber.gff3</code>.
      </p><p>
	You do not need to copy the reference sequence to your directory, but
	I normally copy also the reference file into the directory with my
	data as I want to have, at the end of my work, a nice little
	self-sufficient directory which I can archive away and still be sure
	that in 10 years time I have all data I need together.
      </p><pre class="screen"><code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>cp /somewhere/NC_someNCBInumber.gff3 data</code></strong>
<code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>ls -l data</code></strong>
-rw-r--r-- 1 bach users   6543511 2008-04-08 23:53 NC_someNCBInumber.gff3
-rw-r--r-- 1 bach users 263985896 2008-03-28 21:49 bchocse_lane6.solexa.fastq
-rw-r--r-- 1 bach users 264823645 2008-03-28 21:51 bchocse_lane7.solexa.fastq</pre><p>
      We're almost finished with the setup. As I like to have things neatly separated, I always create a directory called <code class="filename">assemblies</code> which will hold my assemblies (or different trials) together. Let's quickly do that:
    </p><pre class="screen"><code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>mkdir assemblies</code></strong>
<code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>mkdir assemblies/1sttrial</code></strong>
<code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>cd assemblies/1sttrial</code></strong></pre></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_map_ge_writing_a_simple_manifest_file"></a>6.2.3. 
	Writing a simple manifest file
      </h3></div></div></div><p>
	A manifest file is a configuration file for MIRA which tells it what
	type of assembly it should do and which data it should load. In this
	case we have unpaired sequencing data which we want to map to a
	reference sequence, the manifest file for that is pretty simple:
      </p><pre class="screen"># Example for a manifest describing a mapping assembly with
# unpaired Illumina data

# First part: defining some basic things
# In this example, we just give a name to the assembly
#  and tell MIRA it should map a genome in accurate mode

<strong class="userinput"><code>project = <em class="replaceable"><code>MyFirstAssembly</code></em>
job = <em class="replaceable"><code>genome,mapping,accurate</code></em></code></strong>

# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups"

# first, the reference sequence
<strong class="userinput"><code>readgroup
is_reference
data = <em class="replaceable"><code>../../data/NC_someNCBInumber.gff3</code></em>
strain = <em class="replaceable"><code>bchoc_wt</code></em></code></strong>

# now the Illumina data

<strong class="userinput"><code>readgroup = <em class="replaceable"><code>SomeUnpairedIlluminaReadsIGotFromTheLab</code></em>
data = <em class="replaceable"><code>../../data/*fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
strain = <em class="replaceable"><code>bchoc_se</code></em></code></strong></pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
	  Please look up the parameters of the manifest file in the main
	  manual or the example manifest files in the following section.
	</p><p>
	  The ones above basically say: make an accurate mapping of Solexa
	  reads against a genome; in one pass; the name of the backbone strain
	  is 'bchoc_wt'; the data with the backbone sequence (and maybe
	  annotations) is in a specified GFF3 file; for Solexa data: assign
	  default strain names for reads which have not loaded ancillary data
	  with strain info and that default strain name should be 'bchoc_se'.
	</p></td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_map_ge_starting_assembly"></a>6.2.4. Starting the assembly</h3></div></div></div><p>
	Starting the assembly is now just a matter of a simple command line:
      </p><pre class="screen"><code class="prompt">arcadia:/path/to/myProject/assemblies/1sttrial$</code> <strong class="userinput"><code>mira <em class="replaceable"><code>manifest.conf &gt;&amp;log_assembly.txt</code></em></code></strong></pre><p>
	For this example - if you followed the walk-through on how to prepare the data
	- everything you might want to adapt in the first time are the following thing in the manifest file:
	options:
      </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	    project= (for naming your assembly project)
	  </p></li><li class="listitem"><p>
	    strain_name= to give the names of your reference and mapping strain
	  </p></li></ul></div><p>
	Of course, you are free to change any option via the extended parameters, but
	this is the topic of another part of this manual.
      </p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_map_manifest_files_use_cases"></a>6.3. 
      Manifest files for different use cases
    </h2></div></div></div><p>
      This section will introduce you to manifest files for different use
      cases. It should cover the most important uses, but as always you are
      free to mix and match the parameters and readgroup definitions to suit
      your specific needs.
    </p><p>
      Taking into account that there may be <span class="emphasis"><em>a lot</em></span> of
      combinations of sequencing technologies, sequencing libraries (shotgun,
      paired-end, mate-pair, etc.) and input file types (FASTQ, FASTA,
      GenBank, GFF3, etc.pp), the example manifest files just use Illumina and
      454 as technologies, GFF3 as input file type for the reference sequence,
      FASTQ as input type for sequencing data ... and they do not show the
      multitude of more advanced features like, e.g., using ancillary clipping
      information in XML files, ancillary masking information in SSAHA2 or
      SMALT files etc.pp.
    </p><p>
      I'm sure you will be able to find your way by scanning through the
      corresponding section on manifest files in the reference chapter :-)
    </p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_map_mf_mapping_with_shotgun_data"></a>6.3.1. 
	Mapping with shotgun data
      </h3></div></div></div><p>
	Well, we've seen that already in the section above, but here it is
	again ... this time with Ion Torrent data though.
      </p><pre class="screen"># Example for a manifest describing a mapping assembly with
# unpaired Ion data

# First part: defining some basic things
# In this example, we just give a name to the assembly
#  and tell MIRA it should map a genome in accurate mode

<strong class="userinput"><code>project = <em class="replaceable"><code>MyFirstAssembly</code></em>
job = <em class="replaceable"><code>genome,mapping,accurate</code></em></code></strong>

# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups"

# first, the reference sequence
<strong class="userinput"><code>readgroup
is_reference
data = <em class="replaceable"><code>../../data/NC_someNCBInumber.gff3</code></em>
strain = <em class="replaceable"><code>bchoc_wt</code></em></code></strong>

# now the Ion Torrent data

<strong class="userinput"><code>readgroup = <em class="replaceable"><code>SomeUnpairedIonReadsIGotFromTheLab</code></em>
data = <em class="replaceable"><code>../../data/someiondata.fastq</code></em>
technology = <em class="replaceable"><code>iontor</code></em>
strain = <em class="replaceable"><code>bchoc_se</code></em></code></strong></pre></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_map_mf_manifest_for_pairedend_data"></a>6.3.2. 
	Manifest for data sets with paired reads
      </h3></div></div></div><p>
      </p><p>
	When using paired-end data in mapping, you must decide whether you want
      </p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
	    use the MIRA feature to create long 'coverage equivalent reads'
	    (CERs) which saves a lot of memory (both in the assembler and
	    later on in an assembly editor). However, you then
	    <span class="emphasis"><em>loose information about read pairs!</em></span>
	  </p></li><li class="listitem"><p>
	    or whether you want to <span class="emphasis"><em>keep information about read
	    pairs</em></span> at the expense of larger memory requirements both
	    in MIRA and in assembly finishing tools or viewers afterwards.
	  </p></li><li class="listitem"><p>
	    or a mix of the two above
	  </p></li></ol></div><p>
	The Illumina pipeline generally normally gives you two files for paired-end
	data: a <code class="filename">project-1.fastq</code> and
	<code class="filename">project-2.fastq</code>. The first file containing the
	first read of a read-pair, the second file the second read. Depending
	on the preprocessing pipeline of your sequencing provider, the names
	of the reads are either the very same in both files or already have
	a <code class="literal">/1</code> or <code class="literal">/2</code> appended. Also, your
	sequencing provider may give you one big file where the reads from
	both ends are present.
      </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
	  MIRA can read all FASTQ variants produced by various Illumina
	  pipelines, be they with or without the /1 and /2 already appended to
	  the names. You generally do not need to do any name mangling before
	  feeding the data to MIRA. However, MIRA will shell out a warning if read names are longer than 40 characters.
	</p></td></tr></table></div><p>
	When using paired-end data, you should know
      </p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
	    the orientation of the reads toward each other. This is specific
	    to sequencing technologies and / or the sequencing library preparation.
	  </p></li><li class="listitem"><p>
	    at which distance these reads should be. This is specific to the
	    sequencing library preparation and the sequencing lab should tell
	    you this.
	  </p></li></ol></div><p>
	In case you do not know one (or any) of the above, don't panic! MIRA
	is able to estimate the needed values during the assembly if you tell
	it to.
      </p><p>
	The following manifest shows you the most laziest way to define a
	paired data set by simply adding <span class="emphasis"><em>autopairing</em></span> as keyword to a
	readgroup (using Illumina just as example):
      </p><pre class="screen"># Example for a lazy manifest describing a denovo assembly with
# one library of paired reads

# First part: defining some basic things
# In this example, we just give a name to the assembly
#  and tell MIRA it should map a genome in accurate mode

<strong class="userinput"><code>project = <em class="replaceable"><code>MyFirstAssembly</code></em>
job = <em class="replaceable"><code>genome,mapping,accurate</code></em></code></strong>

# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups"

# first the reference sequence
<strong class="userinput"><code>readgroup
is_reference
data = <em class="replaceable"><code>../../data/NC_someNCBInumber.gff3</code></em>
technology = <em class="replaceable"><code>text</code></em>
strain = <em class="replaceable"><code>bchoc_wt</code></em></code></strong>

# now the Illumina paired-end data

<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataIlluminaPairedLib</code></em>
<em class="replaceable"><code>autopairing</code></em>
data = <em class="replaceable"><code>../../data/project_1.fastq ../../data/project_2.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
strain = <em class="replaceable"><code>bchoc_se1</code></em>
</code></strong></pre><p>
	See? Wasn't hard and it did not hurt, did it? One just needs to tell
	MIRA it should expect paired reads via
	the <span class="emphasis"><em>autopairing</em></span> keyword and that is everything you
	need.
      </p><p>
	If you know the orientation of the reads and/or the library size, you
	can tell this MIRA the following way (just showing the readgroup
	definition here):
      </p><pre class="screen"><strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataIlluminaPairedEnd500Lib</code></em>
data = <em class="replaceable"><code>../../data/project_1.fastq ../../data/project_2.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
template_size = <em class="replaceable"><code>250 750</code></em>
segment_placement = <em class="replaceable"><code>---&gt; &lt;---</code></em></code></strong></pre><p>
	In cases you are not 100% sure about, e.g., the size of the DNA
	template, you can also give a (generous) expected range and then tell
	MIRA to automatically refine this range during the assembly based on
	real, observed distances of read pairs. Do this with <span class="emphasis"><em>autorefine</em></span>
	modifier like this:
      </p><pre class="screen"><strong class="userinput"><code>template_size = <em class="replaceable"><code>50 1000 autorefine</code></em></code></strong></pre><p>
	The following manifest file is an example for mapping a 500 bp
	paired-end and a 3kb mate-pair library of a strain
	called <span class="emphasis"><em>bchoc_se1</em></span> against a GenBank reference
	file containing a strain called <span class="emphasis"><em>bchoc_wt</em></span>:
      </p><pre class="screen"># Example for a manifest describing a mapping assembly with
# paired Illumina data, not merging reads and therefore keeping
# all pair information

# First part: defining some basic things
# In this example, we just give a name to the assembly
#  and tell MIRA it should map a genome in accurate mode
# As special parameter, we want to switch off merging of Solexa reads

<strong class="userinput"><code>project = <em class="replaceable"><code>MyFirstAssembly</code></em>
job = <em class="replaceable"><code>genome,mapping,accurate</code></em>
parameters = <em class="replaceable"><code>SOLEXA_SETTINGS -CO:msr=no</code></em></code></strong>

# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups"

# first, the reference sequence
<strong class="userinput"><code>readgroup
is_reference
data = <em class="replaceable"><code>../../data/NC_someNCBInumber.gff3</code></em>
technology = <em class="replaceable"><code>text</code></em>
strain = <em class="replaceable"><code>bchoc_wt</code></em></code></strong>

# now the Illumina data

<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataForPairedEnd500bpLib</code></em>
<em class="replaceable"><code>autopairing</code></em>
data = <em class="replaceable"><code>../../data/project500bp-1.fastq ../../data/project500bp-2.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
strain = <em class="replaceable"><code>bchoc_se1</code></em></code></strong>

<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataForMatePair3kbLib</code></em>
data = <em class="replaceable"><code>../../data/project3kb-1.fastq ../../data/project3kb-2.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
strain = <em class="replaceable"><code>bchoc_se1</code></em>
template_size = <em class="replaceable"><code>2000 4000 autorefine</code></em>
segment_placement = <em class="replaceable"><code>&lt;--- ---&gt;</code></em></code></strong></pre><p>
	Please look up the parameters used in the main manual. The ones
	above basically say: make an accurate mapping of Solexa reads
	against a genome. Additionally do not merge short short Solexa
	reads to the contig.
      </p><p>
	For the paired-end library, be lazy and let MIRA find out everything
	it needs. However, that information should be treated as
	"information only" by MIRA, i.e., it is not used for deciding whether
	a pair is well mapped.
      </p><p>
	For the mate-pair library, assume a DNA template template size of
	2000 to 4000 bp (but let MIRA automatically refine this using observed
	distances) and the segment orientation of the read pairs follows
	the reverse / forward scheme. That information should be treated as
	"information only" by MIRA, i.e., it is not used for deciding whether
	a pair is well mapped.
      </p><p>
	Comparing this manifest with a manifest for unpaired-data, two
	parameters were added in the section for Solexa data:
      </p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
	    <code class="literal">-CO:msr=no</code> tells MIRA not to merge reads that
	    are 100% identical to the backbone. This also allows to keep the
	    template information (distance and orientation) for the reads.
	  </p></li><li class="listitem"><p>
	    <code class="literal">template_size</code> tells MIRA at which distance the
	    two reads should normally be placed from each other.
	  </p></li><li class="listitem"><p>
	    <code class="literal">segment_placement</code> tells MIRA how the different
	    segments (reads) of a DNA template have to be ordered to form a
	    valid representation of the sequenced DNA.
	  </p></li></ol></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
	  Note that in mapping assemblies, these
	  <code class="literal">template_distance</code> and
	  <code class="literal">segment_placement</code> parameters are normally treated
	  as <span class="emphasis"><em>information only</em></span>, i.e., MIRA will map the
	  reads regardless whether the distance and orientation criterions are
	  met or not. This enables post-mapping analysis programs to hunt for
	  genome rearrangements or larger insertions/deletion.
	</p></td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top"><p>
	  If template size and segment placement checking were on, the
	  following would happen at, e.g. sites of re-arrangement: MIRA would
	  map the first read of a read-pair without problem. However, it would
	  very probably reject the second read because it would not map at the
	  specified distance or orientation from its partner. Therefore, in
	  mapping assemblies with paired-end data, checking of the template
	  size must be switched off to give post-processing programs a chance
	  to spot re-arrangements.
	</p></td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_map_mf_mapping_with_multiple_technologies"></a>6.3.3. 
	Mapping with multiple sequencing technologies (hybrid mapping)
      </h3></div></div></div><p>
	I'm sure you'll have picked up the general scheme of manifest files by
	now. Hybrid mapping assemblies follow the general scheme: simply add
	as separate readgroup the information MIRA needs to know to find the
	data and off you go. Just for laughs, here's a manifest for 454
	shotgun with Illumina paired-end
      </p><pre class="screen"># Example for a manifest describing a mapping assembly with
# shotgun 454 and paired-end Illumina data, not merging reads and therefore keeping
# all pair information

# First part: defining some basic things
# In this example, we just give a name to the assembly
#  and tell MIRA it should map a genome in accurate mode
# As special parameter, we want to switch off merging of Solexa reads

<strong class="userinput"><code>project = <em class="replaceable"><code>MyFirstAssembly</code></em>
job = <em class="replaceable"><code>genome,mapping,accurate</code></em>
parameters = <em class="replaceable"><code>SOLEXA_SETTINGS -CO:msr=no</code></em></code></strong>

# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups"

# first, the reference sequence
<strong class="userinput"><code>readgroup
is_reference
data = <em class="replaceable"><code>../../data/NC_someNCBInumber.gff3</code></em>
strain = <em class="replaceable"><code>bchoc_wt</code></em></code></strong>

# now the shotgun 454 data
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataForShotgun454</code></em>
data = <em class="replaceable"><code>../../data/project454data.fastq</code></em>
technology = <em class="replaceable"><code>454</code></em>
strain = <em class="replaceable"><code>bchoc_se1</code></em></code></strong>

# now the paired-end Illumina data

<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataForPairedEnd500bpLib</code></em>
data = <em class="replaceable"><code>../../data/project500bp-1.fastq ../../data/project500bp-2.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
strain = <em class="replaceable"><code>bchoc_se1</code></em>
template_size = <em class="replaceable"><code>250 750</code></em>
segment_placement = <em class="replaceable"><code>---&gt; &lt;---</code></em></code></strong></pre></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_map_mf_mapping_with_multiple_strains"></a>6.3.4. 
	Mapping with multiple strains
      </h3></div></div></div><p>
	MIRA will make use of ancillary information present in the manifest
	file. One of these is the information to which strain (or organism or
	cell line etc.pp) the generated data belongs.
      </p><p>
	You just need to tell in the manifest file which data comes from which
	strain. Let's assume that in the example from above, the "lane6" data
	were from a first mutant named <span class="emphasis"><em>bchoc_se1</em></span> and the
	"lane7" data were from a second mutant
	named <span class="emphasis"><em>bchoc_se2</em></span>. Here's the manifest file you
	would write then:
      </p><pre class="screen"># Example for a manifest describing a mapping assembly with
# unpaired Illumina data

# First part: defining some basic things
# In this example, we just give a name to the assembly
#  and tell MIRA it should map a genome in accurate mode

<strong class="userinput"><code>project = <em class="replaceable"><code>MyFirstAssembly</code></em>
job = <em class="replaceable"><code>genome,mapping,accurate</code></em></code></strong>

# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups"

# first, the reference sequence
<strong class="userinput"><code>readgroup
is_reference
data = <em class="replaceable"><code>../../data/NC_someNCBInumber.gff3</code></em>
technology = <em class="replaceable"><code>text</code></em>
strain = <em class="replaceable"><code>bchoc_wt</code></em></code></strong>

# now the Illumina data

<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataForSE1</code></em>
data = <em class="replaceable"><code>../../data/bchocse_lane6.solexa.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
strain = <em class="replaceable"><code>bchoc_se1</code></em></code></strong>

<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataForSE2</code></em>
data = <em class="replaceable"><code>../../data/bchocse_lane7.solexa.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
strain = <em class="replaceable"><code>bchoc_se2</code></em></code></strong></pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
	While mapping (or even assembling de-novo) with multiple strains is
	possible, the interpretation of results may become a bit daunting in
	some cases. For many scenarios it might therefore be preferable to
	successively use the data sets in own mappings or assemblies.
      </td></tr></table></div><p>
	This <span class="emphasis"><em>strain</em></span> information for each readgroup is really the only change you need to perform to tell MIRA everything it needs for handling strains.
      </p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_map_walkthroughs"></a>6.4. 
      Walkthroughs
    </h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_map_walkthrough:_mapping_of_ecoli_from_lenski_lab_against_ecoli_b_rel606"></a>6.4.1. 
	Walkthrough: mapping of E.coli from Lenski lab against E.coli B REL606
      </h3></div></div></div><p>
	TODO: Sorry, needs to be re-written for the relatively new SRR format
	distributed at the NCBI ... and changes in MIRA 3.9.x. Please come
	back later.
      </p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_map_useful_about_reference_sequences"></a>6.5. 
      Useful things to know about reference sequences
    </h2></div></div></div><p>
      There are a few things to consider when using reference sequences:
    </p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
	  MIRA is not really made to handle a big amount of reference
	  sequences as they currently need inane amounts of memory. Use other
	  programs for mapping against more than, say, 200 megabases.
	</p></li><li class="listitem"><p>
	  Reference sequences can be as long as needed! They are not subject
	  to normal read length constraints of a maximum of 32k bases. That
	  is, if one wants to load one or several entire chromosomes of a
	  bacterium or lower eukaryote as backbone sequence(s), this is just
	  fine.
	</p></li><li class="listitem"><p>
	  Reference sequences can be single sequences like provided in, e.g.,
	  FASTA, FASTQ, GFF or GenBank files. But reference sequences also can
	  be whole assemblies when they are provided as, e.g., MAF or CAF
	  format. This opens the possibility to perform semi-hybrid assemblies
	  by assembling first reads from one sequencing technology de-novo
	  (e.g. PacBio) and then map reads from another sequencing technology
	  (e.g. Solexa) to the whole PacBio alignment instead of mapping it to
	  the PacBio consensus.
	</p><p>
	  A semi-hybrid assembly will therefore contain, like a hybrid
	  assembly, the reads of both sequencing technologies.
	</p></li><li class="listitem"><p>
	  Reference sequences will not be reversed! They will always appear in
	  forward direction in the output of the assembly. Please note: if the
	  backbone sequence consists of a MAF or CAF file that contain contigs
	  which contain reversed reads, then the contigs themselves will be in
	  forward direction. But the reads they contain that are in reverse
	  complement direction will of course also stay reverse complement
	  direction.
	</p></li><li class="listitem"><p>
	  Reference sequences will not not be assembled together! That is,
	  even if a reference sequence has a perfect overlap with another
	  reference sequence, they will still not be merged.
	</p></li><li class="listitem"><p>
	  Reads are assembled to reference sequences in a first come, first
	  served scattering strategy.
	</p><p>
	  Suppose you have two identical reference sequences and a read which
	  would match both, then the read would be mapped to the first
	  backbone. If you had two identical reads, the first read would go to
	  the first backbone, the second read to the second backbone. With
	  three identical reads, the first backbone would get two reads, the
	  second backbone one read. Etc.pp.
	</p></li><li class="listitem"><p>
	  Only in references loaded from MAF or CAF files: contigs made out of
	  single reads (singlets) loose their status as reference sequence and
	  will be returned to the normal read pool for the assembly
	  process. That is, these sequences will be assembled to other
	  reference sequences or with each other.
	</p></li></ol></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_map_known_bugs_problems"></a>6.6. 
      Known bugs / problems
    </h2></div></div></div><p>
      These are actual for version 4.0 of MIRA and might or might not have been
      addressed in later version.
    </p><p>
      Bugs:
    </p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
	  mapping of paired-end reads with one read being in non-repetitive
	  area and the other in a repeat is not as effective as it should
	  be. The optimal strategy to use would be to map first the
	  non-repetitive read and then the read in the repeat. Unfortunately,
	  this is not yet implemented in MIRA.
	</p></li></ol></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_est"></a>Chapter 7. EST / RNASeq assemblies</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email">&lt;<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>&gt;</code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect1_est_introduction">7.1. 
      Introduction
    </a></span></dt><dt><span class="sect1"><a href="#sect1_est_preliminaries:on_the_difficulties_of_assembling_ests">7.2. 
      Preliminaries: on the difficulties of assembling ESTs /RNASeq
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect2_est_poly-a_tails_in_est_data">7.2.1. 
	Poly-A tails
      </a></span></dt><dt><span class="sect2"><a href="#sect2_est_lowly_expressed_transcripts">7.2.2. 
	Lowly expressed transcripts
      </a></span></dt><dt><span class="sect2"><a href="#sect2_est_library_normalisation">7.2.3. 
	Very highly expressed transcripts
      </a></span></dt><dt><span class="sect2"><a href="#sect_est_chimeras">7.2.4. 
	Chimeras
      </a></span></dt></dl></dd><dt><span class="sect1"><a href="#est_sect1_est_preprocessing">7.3. 
      Preprocessing of ESTs
    </a></span></dt><dt><span class="sect1"><a href="#sect1_est_est_difference_assembly_clustering">7.4. 
      The difference between <span class="emphasis"><em>assembly</em></span> and
      <span class="emphasis"><em>clustering</em></span>
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect2_est_snp_splitting">7.4.1. 
	Splitting transcripts into contigs based on SNPs
      </a></span></dt><dt><span class="sect2"><a href="#sect2_est_gap_splitting">7.4.2. 
	Splitting transcripts into contigs based on larger gaps
      </a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect1_est_demopipeline">7.5. 
      A simple step by step pipeline for reliable RNASeq assembly of eukaryotes
    </a></span></dt><dt><span class="sect1"><a href="#idm5079">7.6. 
      Solving common problems of EST assemblies
    </a></span></dt></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
      <span class="emphasis"><em><span class="quote">&#8220;<span class="quote">Expect the worst. You'll never get disappointed.
      </span>&#8221;</span></em></span>
    </p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect1_est_introduction"></a>7.1. 
      Introduction
    </h2></div></div></div><p>
      This document is not complete yet and some sections may be a bit
      unclear. I'd be happy to receive suggestions for improvements.
    </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note: 
	Some reading requirements
      "><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">
	Some reading requirements
      </th></tr><tr><td align="left" valign="top"><p>
	This guide assumes that you have basic working knowledge of Unix systems, know
	the basic principles of sequencing (and sequence assembly) and what assemblers
	do. Basic knowledge on mRNA transcription should also be present.
      </p><p>
	Please read at some point in time
      </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	    Before the assembly, <a class="xref" href="#chap_dataprep" title="Chapter 4. Preparing data">Chapter 4: &#8220;<i>Preparing data</i>&#8221;</a> to know what to do (or not to
	    do) with the sequencing data before giving it to MIRA.
	  </p></li><li class="listitem"><p>
	    For setting up the assembly, <a class="xref" href="#chap_denovo" title="Chapter 5. De-novo assemblies">Chapter 5: &#8220;<i>De-novo assemblies</i>&#8221;</a> to know how to
	    start a denovo assembly (except you obviously will need to change
	    the --job setting from <span class="emphasis"><em>genome</em></span> to
	    <span class="emphasis"><em>est</em></span>).
	  </p></li><li class="listitem"><p>
	    After the assembly, <a class="xref" href="#chap_results" title="Chapter 9. Working with the results of MIRA">Chapter 9: &#8220;<i>Working with the results of MIRA</i>&#8221;</a> to know what to do with the
	    results of the assembly. More specifically, <a class="xref" href="#sect_res_looking_at_results" title="9.1.  MIRA output directories and files">Section 9.1: &#8220;
      MIRA output directories and files
    &#8221;</a>, <a class="xref" href="#sect_res_first_look:the_assembly_info" title="9.2.  First look: the assembly info">Section 9.2: &#8220;
      First look: the assembly info
    &#8221;</a>, <a class="xref" href="#sect_res_converting_results" title="9.3.  Converting results">Section 9.3: &#8220;
      Converting results
    &#8221;</a>, <a class="xref" href="#sect_res_filtering_of_results" title="9.4.  Filtering results">Section 9.4: &#8220;
      Filtering results
    &#8221;</a> and <a class="xref" href="#sect_res_places_of_importance_in_a_de_novo_assembly" title="9.5.  Places of importance in a de-novo assembly">Section 9.5: &#8220;
      Places of importance in a de-novo assembly
    &#8221;</a>.
	  </p></li><li class="listitem"><p>
	    And also <a class="xref" href="#chap_reference" title="Chapter 3. MIRA 4 reference manual">Chapter 3: &#8220;<i>MIRA 4 reference manual</i>&#8221;</a> to look up how manifest files should be
	    written (<a class="xref" href="#sect_ref_manifest_basics" title="3.4.2.  The manifest file: basics">Section 3.4.2: &#8220;
	The manifest file: basics
      &#8221;</a> and <a class="xref" href="#sect_ref_manifest_readgroups" title="3.4.3.  The manifest file: information on the data you have">Section 3.4.3: &#8220;
	The manifest file: information on the data you have
      &#8221;</a> and <a class="xref" href="#sect_ref_manifest_parameters" title="3.4.4.  The manifest file: extended parameters">Section 3.4.4: &#8220;
	The manifest file: extended parameters
      &#8221;</a>), some command line options as well as general information on
	    what tags MIRA uses in assemblies, files it generates etc.pp
	  </p></li></ul></div></td></tr></table></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect1_est_preliminaries:on_the_difficulties_of_assembling_ests"></a>7.2. 
      Preliminaries: on the difficulties of assembling ESTs /RNASeq
    </h2></div></div></div><p>
      Assembling ESTs can be, from an assemblers point of view, pure
      horror. E.g., it may be that some genes have thousands of transcripts
      while other genes have just one single transcript in the sequenced
      data. Furthermore, the presence of 5' and 3' UTR, transcription
      variants, splice variants, homologues, SNPs etc.pp complicates the
      assembly in some rather interesting ways.
    </p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_est_poly-a_tails_in_est_data"></a>7.2.1. 
	Poly-A tails
      </h3></div></div></div><p>
	Poly-A tails are part of the mRNA and therefore also part of sequenced
	data. They can occur as poly-A or poly-T, depending from which
	direction and which part of the mRNA was sequenced. Having poly-A/T
	tails in the data is a something of a double edged sword. More
	specifically., if the 3' poly-A tail is kept unmasked in the data,
	transcripts having this tail will very probably not align with similar
	transcripts from different splice variants (which is basically
	good). On the other hand, homopolymers (multiple consecutive bases of
	the same type) like poly-As are features that are pretty difficult to
	get correct with today's sequencing technologies, be it Sanger, Solexa
	or, with even more problems problems, 454. So slight errors in the
	poly-A tail could lead to wrongly assigned splice sites ... and
	wrongly split contigs.
      </p><p>
	This is the reason why many people cut off the poly-A tails. Which in
	turn may lead to transcripts from different splice variants being
	assembled together.
      </p><p>
	Either way, it's not pretty.
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_est_lowly_expressed_transcripts"></a>7.2.2. 
	Lowly expressed transcripts
      </h3></div></div></div><p>
	Single transcripts (or very lowly expressed transcripts) containing
	SNPs, splice variants or similar differences to other, more highly
	expressed transcripts are a problem: it's basically impossible for an
	assembler to distinguish them from reads containing junky data
	(e.g. read with a high error rate or chimeras). The standard setting
	of many EST assemblers and clusterers is therefore to remove these
	reads from the assembly set. MIRA handles things a bit differently:
	depending on the settings, single transcripts with sufficiently large
	differences are either treated as debris or can be saved as
	<span class="emphasis"><em>singlet</em></span>.
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_est_library_normalisation"></a>7.2.3. 
	Very highly expressed transcripts
      </h3></div></div></div><p>
	Another interesting problem for de-novo assemblers are non-normalised
	libraries. In each cell, the number of mRNA copies per gene may
	differ by several orders of magnitude, from a single transcripts to
	several tens of thousands. Pre-sequencing normalisation is a wet-lab
	procedure to approximately equalise those copy numbers. This can
	however, introduce other artifacts.
      </p><p>
	If an assembler is fed with non-normalised EST data, it may very well
	be that an overwhelming number of the reads comes only from a few
	genes (house-keeping genes). In Sanger sequencing projects this could
	mean a couple of thousand reads per gene. In 454 sequencing projects,
	this can mean several tens of thousands of reads per genes. With
	Solexa data, this number can grow to something close to a million.
      </p><p>
	Several effects then hit a de-novo assembler, the three most annoying
	being (in ascending order of annoyance): a) non-random sequencing
	errors then look like valid SNPs, b) sequencing and library
	construction artefacts start to look like valid sequences if the data
	set was not cleaned "enough" and more importantly, c) an explosion in
	time and memory requirements when attempting to deliver a "good"
	assembly. While MIRA has methods to deal with this kind of data
	(e.g. via digital normalisation), a sure sign of the latter are messages
	from MIRA about <span class="emphasis"><em>megahubs</em></span> in the data set.
      </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
	The guide on how to tackle <span class="emphasis"><em>hard</em></span> projects with
	MIRA gives an overview on how to hunt down sequences which can lead to
	the assembler getting confused, be it sequencing artefacts or highly
	expressed genes.
      </td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_est_chimeras"></a>7.2.4. 
	Chimeras
      </h3></div></div></div><p>
	Chimeras are sequences containing adjacent base stretches which are
	not occurring in an organism as sequenced, neither as DNA nor as
	(m)RNA. Chimeras can be created through recombination effects during
	library construction or sequencing. Chimeras can, and often do, lead
	to misassemblies of sequence stretches into one contig although they
	do not belong together. Have a look at the following example where two
	stretches (denoted by <code class="literal">x</code> and <code class="literal">o</code>
	are joined by a chimeric read <span class="emphasis"><em>r4</em></span> containing both
	stretches:
      </p><pre class="screen">
r1 xxxxxxxxxxxxxxxx
r2 xxxxxxxxxxxxxxxxx
r3 xxxxxxxxxxxxxxxxx
r4 xxxxxxxxxxxxxxxxxxx|oooooooooooooo
r5                        ooooooooooo
r6                        ooooooooooo
r7                          ooooooooo</pre><p>
      The site of the recombination event is denoted by <code class="literal">x|o</code>
      in read <span class="emphasis"><em>r4</em></span>.
    </p><p>
      MIRA does have a chimera detection -- which works very well in genome
      assemblies due to high enough coverage -- by searching for sequence
      stretches which are not covered by overlaps. In the above example, the
      chimera detection routine will almost certainly flag read
      <span class="emphasis"><em>r4</em></span> as chimera and only use a part of it: either the
      <code class="literal"> x</code> or <code class="literal">o</code> part, depending on which
      part is longer. There is always a chance that <span class="emphasis"><em>r4</em></span> is
      a valid read though, but that's a risk to take.
    </p><p>
      Now, that strategy would also work totally fine in EST projects if one
      would not have to account for lowly expressed genes. Imagine the
      following situation:
    </p><pre class="screen">
s1 xxxxxxxxxxxxxxxxx
s2         xxxxxxxxxxxxxxxxxxxxxxxxx
s3                          xxxxxxxxxxxxxxx
    </pre><p>
      Look at read <span class="emphasis"><em>s2</em></span>; from an overlap coverage
      perspective, <span class="emphasis"><em>s2</em></span> could also very well be a chimera,
      leading to a break of an otherwise perfectly valid contig if
      <span class="emphasis"><em>s2</em></span> were cut back accordingly. This is why chimera
      detection is switched off by default in MIRA.
    </p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top"><p>
	When starting an EST assembly via the <code class="literal">--job=est,...</code>
	switch, chimera detection is switched off by default. It is absolutely
	possible to switch on the SKIM chimera detection afterwards via
	 [-CL:ascdc]. However, this will have exactly the effects
	described above: chimeras in higher coverage contigs will be detected,
	but perfectly valid low coverage contigs will be torn apart.
      </p><p>
	It is up to you to decide what you want or need.
      </p></td></tr></table></div></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="est_sect1_est_preprocessing"></a>7.3. 
      Preprocessing of ESTs
    </h2></div></div></div><p>
      With contributions from Katrina Dlugosch
    </p><p>
      EST sequences necessarily contain fragments of vectors or primers used
      to create cDNA libraries from RNA, and may additionally contain primer
      and adaptor sequences used during amplification-based library
      normalisation and/or high-throughput sequencing.  These contaminant
      sequences need to be removed prior to assembly.  MIRA can trim sequences
      by taking contaminant location information from a SSAHA2 or SMALT search
      output, or users can remove contaminants beforehand by trimming
      sequences themselves or masking unwanted bases with lowercase or other
      characters (e.g. 'x', as with <span class="command"><strong>cross_match</strong></span>).  Many
      folks use preprocessing trimming/masking pipelines because it can be
      very important to try a variety of settings to verify that you've
      removed all of your contaminants (and fragments thereof) before sending
      them into an assembly program like MIRA.  It can also be good to spend
      some time seeing what contaminants are in your data, so that you get to
      know what quality issues are present and how pervasive.
    </p><p>
      Two features of next generation sequencing can introduce errors into
      contaminant sequences that make them particularly difficult to remove,
      arguing for preprocessing: First, most next-generation sequence
      platforms seem to be sensitive to excess primers present during library
      preparation, and can produce a small percentage of sequences composed
      entirely of concatenated primer fragments.  These are among the most
      difficult contaminants to remove, and the program TagDust (<a class="ulink" href="http://genome.gsc.riken.jp/osc/english/dataresource/" target="_top">http://genome.gsc.riken.jp/osc/english/dataresource/</a>) was
      recently developed specifically to address this problem. Second, 454 EST
      data sets can show high variability within primer sequences designed to
      anchor to polyA tails during cDNA synthesis, because 454 has trouble
      calling the length of the necessary A and T nucleotide repeats with
      accuracy.
    </p><p>
      A variety of programs exist for preprocessing.  Popular ones include
      cross_match (<a class="ulink" href="http://www.phrap.org/phredphrapconsed.html" target="_top">http://www.phrap.org/phredphrapconsed.html</a>)
      for primer masking, and SeqClean (<a class="ulink" href="http://compbio.dfci.harvard.edu/tgi/software/" target="_top">http://compbio.dfci.harvard.edu/tgi/software/</a>), Lucy (<a class="ulink" href="http://lucy.sourceforge.net/" target="_top">http://lucy.sourceforge.net/</a>), and SeqTrim (<a class="ulink" href="http://www.scbi.uma.es/cgi-bin/seqtrim/seqtrim_login.cgi" target="_top">http://www.scbi.uma.es/cgi-bin/seqtrim/seqtrim_login.cgi</a>) for
      both primer and polyA/T trimming.  The pipeline SnoWhite (<a class="ulink" href="http://evopipes.net" target="_top">http://evopipes.net</a>) combines Seqclean and TagDust with custom
      scripts for aggressive sequence and polyA/T trimming (and is tolerant of
      data already masked using cross_match).  In all cases, the user must
      provide contaminant sequence information and adjust settings for how
      sensitive the programs should be to possible matches.  To find the best
      settings, it is helpful to look directly at some of the sequences that
      are being trimmed and inspect them for remaining primer and/or polyA/T
      fragments after cleaning.
    </p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
      When using <span class="command"><strong>mira</strong></span> or
      <span class="command"><strong>miraSearchESTSNPs</strong></span> with the the simplest parameter
      calls (using the "--job=..." quick switches), the default settings used
      include pretty heavy sequence pre-processing to cope with noisy
      data. Especially if you have your own pre-processing pipeline, you
      <span class="emphasis"><em>must</em></span> then switch off different clip algorithms that
      you might have applied previously yourself. Especially poly-A clips
      should never be run twice (by your pipeline and by
      <span class="command"><strong>mira</strong></span>) as they invariably lead to too many bases being
      cut away in some sequences,
    </td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
      Here too: In some cases MIRA can get confused if something with the
      pre-processing went wrong because, e.g., unexpected sequencing artefacts
      like unknown sequencing vectors or adaptors remain in data. The guide on
      how to tackle <span class="emphasis"><em>hard</em></span> projects with MIRA gives an
      overview on how to hunt down sequences which can lead to the assembler
      getting confused, be it sequencing artefacts or highly expressed genes.
    </td></tr></table></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect1_est_est_difference_assembly_clustering"></a>7.4. 
      The difference between <span class="emphasis"><em>assembly</em></span> and
      <span class="emphasis"><em>clustering</em></span>
    </h2></div></div></div><p>
      MIRA in its base settings is an <span class="emphasis"><em>assembler</em></span> and not a
      <span class="emphasis"><em>clusterer</em></span>, although it can be configured as such. As
      assembler, it will split up read groups into different contigs if it
      thinks there is enough evidence that they come from different RNA
      transcripts.
    </p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_est_snp_splitting"></a>7.4.1. 
	Splitting transcripts into contigs based on SNPs
      </h3></div></div></div><p>
	Imagine this simple case: a gene has two slightly different alleles and you've
	sequenced this:
      </p><pre class="screen">
A1-1  ...........T...........
A1-2  ...........T...........
A1-3  ...........T...........
A1-4  ...........T...........
A1-5  ...........T...........
B2-1  ...........G...........
B2-2  ...........G...........
B2-3  ...........G...........
B2-4  ...........G...........
      </pre><p>
	Depending on base qualities and settings used during the assembly
	like, e.g., [-CO:mr:mrpg:mnq:mgqrt:emea:amgb] MIRA will
	recognise that there's enough evidence for a T and also enough
	evidence for a G at that position and create two contigs, one
	containing the "T" allele, one the "G". The consensus will be &gt;99%
	identical, but not 100%.
      </p><p>
	Things become complicated if one has to account for errors in
	sequencing. Imagine you sequenced the following case:
      </p><pre class="screen">
A1-1  ...........T...........
A1-2  ...........T...........
A1-3  ...........T...........
A1-4  ...........T...........
A1-5  ...........T...........
B2-1  ...........<span class="bold"><strong>G</strong></span>...........
      </pre><p>
	It shows very much the same like the one from above, except that
	there's only one read with a "G" instead of 4 reads. MIRA will, when
	using standard settings, treat this as erroneous base and leave all
	these reads in a contig. It will likewise also not mark it as SNP in
	the results. However, this could also very well be a lowly expressed
	transcript with a single base mutation. It's virtually impossible to
	tell which of the possibilities is right.
      </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
	You can of course force MIRA to mark situations like the one depicted
	above by, e.g., changing the parameters
	for [-CO:mrpg:mnq:mgqrt]. But this may have the side-effect
	that sequencing errors get an increased chance of getting flagged as
	SNP.
      </td></tr></table></div><p>
	Further complications arise when SNPs and potential sequencing errors
	meet at the same place. consider the following case:
      </p><pre class="screen">
A1-1  ...........T...........
A1-2  ...........T...........
A1-3  ...........T...........
A1-4  ...........T...........
B1-5  ...........T...........
B2-1  ...........G...........
B2-2  ...........G...........
B2-3  ...........G...........
B2-4  ...........G...........
E1-1  ...........<span class="bold"><strong>A</strong></span>...........
      </pre><p>
	This example is exactly like the first one, except an additional read
	<code class="literal">E1-1</code> has made it's appearance and has an "A"
	instead of a "G" or "T". Again it is impossible to tell whether this
	is a sequencing error or a real SNP. MIRA handles these cases in the
	following way: it will recognise two valid read groups (one having a
	"T", the other a "G") and, in assembly mode, split these two groups
	into different contigs. It will also play safe and define that the
	single read <code class="literal">E1-1</code> will not be attributed to either
	one of the contigs but, if it cannot be assembled to other reads, form
	an own contig ... if need to be even only as single read (a
	<span class="emphasis"><em>singlet</em></span>).
      </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
	Depending on some settings, singlets may either appear in the regular
	results or end up in the debris file.
      </td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_est_gap_splitting"></a>7.4.2. 
	Splitting transcripts into contigs based on larger gaps
      </h3></div></div></div><p>
	Gaps in alignments of transcripts are handled very cautiously by
	MIRA. The standard settings will lead to the creation of different
	contigs if three or more consecutive gaps are introduced in an
	alignment. Consider the following example:
      </p><pre class="screen">
A1-1  ..........CGA..........
A1-2  ..........*GA..........
A1-3  ..........**A..........
B2-1  ..........<span class="bold"><strong>***</strong></span>..........
B2-2  ..........<span class="bold"><strong>***</strong></span>..........
      </pre><p>
	Under normal circumstances, MIRA will use the reads
	<code class="literal">A1-1</code>, <code class="literal">A1-2</code> and
	<code class="literal">A1-3</code> to form one contig and put
	<code class="literal">B2-1</code> and <code class="literal">B2-2</code> into a separate
	contig. MIRA would do this also if there were only one of the B2
	reads.
      </p><p>
	The reason behind this is that the probability for having gaps of
	three or more bases only due to sequencing errors is pretty
	low. MIRA will therefore treat reads with such attributes as coming
	from different transcripts and not assemble them together, though
	this can be changed using the [-AL:egp:egpl] parameters of
	MIRA if wanted.
      </p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning: 
	  Problems with homopolymers, especially in 454, Ion Torrent and high
	  coverage Illumina
	"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">
	  Problems with homopolymers, especially in 454, Ion Torrent and high
	  coverage Illumina
	</th></tr><tr><td align="left" valign="top"><p>
	  As 454 and Ion Torrent sequencing has a general problem with
	  homopolymers, this rule of MIRA will sometimes lead formation of
	  more contigs than expected due to sequencing errors at "long"
	  homopolymer sites ... where long starts at ~6-7 bases. Though MIRA
	  does know about the problem in 454 homopolymers and has some
	  routines which try to mitigate the problem. this is not always
	  successful.
	</p><p>
	  The same applies for Illumina data with long homopolymers (~ 8-9 bp)
	  and high coverage (&#8805; 100x).
	</p></td></tr></table></div></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect1_est_demopipeline"></a>7.5. 
      A simple step by step pipeline for reliable RNASeq assembly of eukaryotes
    </h2></div></div></div><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
	  Remove rRNA sequences. For that I use <span class="command"><strong>mirabait</strong></span> like this:
	</p><pre class="screen"><code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>mirabait -I -j rrna <em class="replaceable"><code>-p norRNAfile_1.fastq norRNAfile_2.fastq ...</code></em></code></strong></pre></li><li class="listitem"><p>
	  Clean the data. For this I use mira, asking it to perform only a
	  preprocessing of the data from step 1 via a line like this
	</p><pre class="screen">parameters = -AS:nop=0</pre><p>
	  in the manifest file. After preprocessing, the results will be
	  present as MAf file in the file
	  <code class="filename">*_assembly/*_d_chkpt/readpool.maf</code>.
	</p></li><li class="listitem"><p>
	  As the MAF file contains paired reads together, they need to be
	  separated again. Additionally, I perform a hard cut of the clipped
	  sequence. This is a job for <span class="command"><strong>miraconvert</strong></span>:
	</p><pre class="screen"><code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>miraconvert -C -F -F readpool.maf</code></strong></pre></li><li class="listitem"><p>
	  I then use FLASH to merge paired read together, using high overlap and zero allowed errors.
	</p><pre class="screen"><code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>...</code></strong></pre><p>
	  FLASH will create three file for this: one file with joined pairs,
	  one file with unjoined pairs and one file with orphan reads (i.e.,
	  reads which have no mate). I generally continue with just the joined
	  and unjoined files.
	</p></li><li class="listitem"><p>
	  Reduce the dataset to a reasonable size. Using 3 or 4 gigabases to
	  reconstruct an eukaryotic transcriptome should yield in pretty good
	  transcripts without too much noise and loose all but the rarest
	  transcripts.
	</p><p>
	  Depending on the Illumina read length (100, 125, 150, 250 or 300) I
	  generally go for a 1:1 or 2:1 ratio of joined versus unjoined
	  reads. E.g., if I need to extract 2 gigabases of joined FLASH
	  results and 1 gigabase of unjoined FLASH results I do this:
</p><pre class="screen"><code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>miraconvert -Y 2000000 <em class="replaceable"><code>flashjoined.fastq reduced2gb_flashjoined.fastq</code></em></code></strong>
<code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>miraconvert -Y 1000000 <em class="replaceable"><code>flashunjoined.fastq reduced1gb_flashunjoined.fastq</code></em></code></strong></pre><p>
	</p></li><li class="listitem"><p>
	  Assemble the cleaned, joined and reduced data set. A simple manifest
	  file like this will suffice:
	</p><pre class="screen">project = myRNASEQ
job=est,denovo,accurate

readgroup
technology=solexa
autopairing
data=reduced2gb_flashjoined.fastq reduced1gb_flashunjoined.fastq
</pre></li></ol></div><p>
      The result can be annotated and quality controlled. However, this will
      still contain duplicate genes (due to, e.g., ploidy variants) or gene
      fragements (due to ploidy variants, splice variants, sequencing
      errors). To reduce this number I generally do the following:
    </p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem">
	Extract CDS of the annotated sequences. make sure that your pipeline
	also annotates hypothetical proteins with a length &#8805; 300bp.
      </li><li class="listitem"><p>
	  Cluster the CDS sequences with MIRA, using a high similarity threshold:
	</p><pre class="screen">project = myRNASEQclustering
job=est,clustering,accurate
parameters = --noclipping
parameters = TEXT_SETTINGS -AS:mrs=94

readgroup
technology=text
autopairing
data=fna::CDSfromAnnotation.fasta
</pre></li></ol></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="idm5079"></a>7.6. 
      Solving common problems of EST assemblies
    </h2></div></div></div><p>
      ... continue here ...
    </p><p>
      Megahubs =&gt; track down reason (high expr, seqvec or adaptor: see
      mira_hard) and eliminate it
    </p></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_specialparams"></a>Chapter 8. Parameters for special situations</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email">&lt;<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>&gt;</code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect_sp_introduction">8.1. 
      Introduction
    </a></span></dt><dt><span class="sect1"><a href="#sect_sp_pacbio">8.2. 
      PacBio
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_sp_pacbio_ccs">8.2.1. 
	PacBio CCS reads
      </a></span></dt><dt><span class="sect2"><a href="#sect_sp_pacbio_ec">8.2.2. 
	PacBio error corrected reads
      </a></span></dt></dl></dd></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
      <span class="emphasis"><em><span class="quote">&#8220;<span class="quote">... .
      </span>&#8221;</span></em></span>
    </p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_sp_introduction"></a>8.1. 
      Introduction
    </h2></div></div></div><p>
      Most of this chapter and many sections are just stubs at the moment.
    </p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_sp_pacbio"></a>8.2. 
      PacBio
    </h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_sp_pacbio_ccs"></a>8.2.1. 
	PacBio CCS reads
      </h3></div></div></div><p>
	Declare the sequencing technology to be high-quality PacBio (<span class="bold"><strong>PCBIOHQ</strong></span>). The last time I worked with CCS, the
	ends of the reads were not really clean, so using the proposed end
	clipping (which needs to be manually switched on for PCBIOHQ reads)
	may be advisable.
      </p><pre class="screen"><strong class="userinput"><code>...
parameters = PCBIOHQ_SETTINGS -CL:pec=yes
...

readgroup
technology=pcbiohq
data=...
...</code></strong></pre></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_sp_pacbio_ec"></a>8.2.2. 
	PacBio error corrected reads
      </h3></div></div></div><p>
	Declare the sequencing technology to be high-quality PacBio (<span class="bold"><strong>PCBIOHQ</strong></span>). For self-corrected data or data
	corrected with other sequencing technologies, it is recommended to
	change the  [-CO:mrpg] setting to a value which is 1/4th to
	1/5th of the average coverage of the corrected PacBio reads across the
	genome. E.g.:
      </p><pre class="screen"><strong class="userinput"><code>...
parameters = PCBIOHQ_SETTINGS -CO:mrpg=5
...

readgroup
technology=pcbiohq
data=...
...</code></strong></pre><p>
	for a project which has ~24x coverage. This necessity may change in
	later versions of MIRA though.
      </p></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_results"></a>Chapter 9. Working with the results of MIRA</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email">&lt;<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>&gt;</code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect_res_looking_at_results">9.1. 
      MIRA output directories and files
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_res_resultsdir">9.1.1. 
	The <code class="filename">*_d_results</code> directory
      </a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_res_resultsdir_denovo">9.1.1.1. 
	  Additional 'large contigs' result files for de-novo assemblies of genomes
	</a></span></dt></dl></dd><dt><span class="sect2"><a href="#sect_res_infodir">9.1.2. 
	The <code class="filename">*_d_info</code> directory
      </a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_res_first_look:the_assembly_info">9.2. 
      First look: the assembly info
    </a></span></dt><dt><span class="sect1"><a href="#sect_res_converting_results">9.3. 
      Converting results
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_res_converting_miraconvert">9.3.1. 
	Converting to and from other formats:<span class="command"><strong>miraconvert</strong></span>
      </a></span></dt><dt><span class="sect2"><a href="#sect_res_converting_reach_other_programs">9.3.2. 
	Steps for converting data from / to other tools
      </a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_res_converting_to_from_staden">9.3.2.1. 
	  Example: converting to and from the Staden package (gap4 / gap5)
	</a></span></dt><dt><span class="sect3"><a href="#sect_res_converting_to_from_sam">9.3.2.2. 
	  Example: converting to and from SAM (for samtools, tablet etc.)
	</a></span></dt></dl></dd></dl></dd><dt><span class="sect1"><a href="#sect_res_filtering_of_results">9.4. 
      Filtering results
    </a></span></dt><dt><span class="sect1"><a href="#sect_res_places_of_importance_in_a_de_novo_assembly">9.5. 
      Places of importance in a de-novo assembly
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_res_tags_set_by_mira">9.5.1. 
	Tags set by MIRA
      </a></span></dt><dt><span class="sect2"><a href="#sect_res_other_places_of_importance">9.5.2. 
	Other places of importance
      </a></span></dt><dt><span class="sect2"><a href="#sect_res_joining_contigs">9.5.3. 
	Joining contigs
      </a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_res_joining_truerepeats">9.5.3.1. 
	  Joining contigs at true repetitive sites
	</a></span></dt><dt><span class="sect3"><a href="#sect_res_joining_FALSErepeats">9.5.3.2. 
	  Joining contigs at "wrongly discovered" repetitive sites
	</a></span></dt></dl></dd></dl></dd><dt><span class="sect1"><a href="#sect_res_places_of_interest_in_a_mapping_assembly">9.6. 
      Places of interest in a mapping assembly
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_res_poi_where_are_snps?">9.6.1. 
	Where are SNPs?
      </a></span></dt><dt><span class="sect2"><a href="#sect_res_poi_where_are_insertions_deletions_or_genome_rearrangements?">9.6.2. 
	Where are insertions, deletions or genome re-arrangements?
      </a></span></dt><dt><span class="sect2"><a href="#sect_res_poi_other_tags_of_interest">9.6.3. 
	Other tags of interest
      </a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_res_postprocessing_mapping_assemblies">9.7. 
      Post-processing mapping assemblies
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_res_pp_manual_cleanup">9.7.1. 
	Manual cleanup and validation (optional)
      </a></span></dt><dt><span class="sect2"><a href="#sect_res_poi_comprehensive_snp_analysis_spreadsheet_tables_for_excel_or_oocalc">9.7.2. 
	Comprehensive SNP analysis spreadsheet tables (for Excel or OOcalc)
      </a></span></dt><dt><span class="sect2"><a href="#sect_res_poi_html_files_depicting_snp_positions_and_deletions">9.7.3. 
	HTML files depicting SNP positions and deletions
      </a></span></dt><dt><span class="sect2"><a href="#sect_res_poi_wig_files">9.7.4. 
	WIG files depicting contig coverage or GC content
      </a></span></dt><dt><span class="sect2"><a href="#sect_res_poi_tables_for_feature_coverage">9.7.5. 
	Comprehensive spreadsheet tables for gene expression values / genome deletions &amp; duplications
      </a></span></dt></dl></dd></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
      <span class="emphasis"><em><span class="quote">&#8220;<span class="quote">You have to know what you're looking for before you can find it.
      </span>&#8221;</span></em></span>
    </p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><p>
    MIRA makes results available in quite a number of formats: CAF, ACE, FASTA and
    a few others. The preferred formats are CAF and MAF, as these format can be
    translated into any other supported format.
  </p><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_res_looking_at_results"></a>9.1. 
      MIRA output directories and files
    </h2></div></div></div><p>
      For the assembly MIRA creates a directory named
      <code class="filename"><em class="replaceable"><code>projectname</code></em>_assembly</code> in
      which a number of sub-directories will have appeared.
    </p><p>
      These sub-directories (and files within) contain the results of the
      assembly itself, general information and statistics on the results and
      -- if not deleted automatically by MIRA -- a tmp directory with log
      files and temporary data:
    </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	  <code class="filename"><em class="replaceable"><code>projectname</code></em>_d_results</code>:
	  this directory contains all the output files of the assembly in
	  different formats.
	</p></li><li class="listitem"><p>
	  <code class="filename"><em class="replaceable"><code>projectname</code></em>_d_info</code>:
	  this directory contains information files of the final
	  assembly. They provide statistics as well as, e.g., information
	  (easily parsable by scripts) on which read is found in which
	  contig etc.
	</p></li><li class="listitem"><p>
	  <code class="filename"><em class="replaceable"><code>projectname</code></em>_d_tmp</code>:
	  this directory contains log files and temporary assembly files. It
	  can be safely removed after an assembly as there may be easily a
	  few GB of data in there that are not normally not needed anymore.
	</p><p>
	  The default settings of MIRA are such that really big files are
	  automatically deleted when they not needed anymore during an
	  assembly.
	</p></li><li class="listitem"><p>
	  <code class="filename"><em class="replaceable"><code>projectname</code></em>_d_chkpt</code>:
	  this directory contains checkpoint files needed to resume
	  assemblies that crashed or were stopped.
	</p></li></ul></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_resultsdir"></a>9.1.1. 
	The <code class="filename">*_d_results</code> directory
      </h3></div></div></div><p>
	The following files in
	<code class="filename"><em class="replaceable"><code>projectname</code></em>_d_results</code>
	contain results of the assembly in different formats. Depending on the
	output options you defined for MIRA, some files may or may not be
	there. As long as the CAF or MAF format are present, you can translate
	your assembly later on to about any supported format with the
	<span class="command"><strong>miraconvert</strong></span> program supplied with the MIRA
	distribution:
      </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	    <code class="filename"><em class="replaceable"><code>projectname</code></em>_out.txt</code>:
	    this file contains in a human readable format the aligned assembly
	    results, where all input sequences are shown in the context of the
	    contig they were assembled into. This file is just meant as a
	    quick way for people to have a look at their assembly without
	    specialised alignment finishing tools.
	  </p></li><li class="listitem"><p>
	    <code class="filename"><em class="replaceable"><code>projectname</code></em>_out.padded.fasta</code>:
	    this file contains as FASTA sequence the consensus of the contigs
	    that were assembled in the process.  Positions in the consensus
	    containing gaps (also called 'pads', denoted by an asterisk) are
	    still present. The computed consensus qualities are in the
	    corresponding
	    <code class="filename"><em class="replaceable"><code>projectname</code></em>_out.padded.fasta.qual</code>
	    file.
	  </p></li><li class="listitem"><p>
	    <code class="filename"><em class="replaceable"><code>projectname</code></em>_out.unpadded.fasta</code>:
	    as above, this file contains as FASTA sequence the consensus of
	    the contigs that were assembled in the process, put positions in
	    the consensus containing gaps were removed. The computed consensus
	    qualities are in the corresponding
	    <code class="filename"><em class="replaceable"><code>projectname</code></em>_out.unpadded.fasta.qual</code>
	    file.
	  </p></li><li class="listitem"><p>
	    <code class="filename"><em class="replaceable"><code>projectname</code></em>_out.caf</code>:
	    this is the result of the assembly in CAF format, which can be
	    further worked on with, e.g., tools from the
	    <span class="emphasis"><em>caftools</em></span> package from the Sanger Centre and
	    later on be imported into, e.g., the Staden gap4 assembly and
	    finishing tool.
	  </p></li><li class="listitem"><p>
	    <code class="filename"><em class="replaceable"><code>projectname</code></em>_out.ace</code>:
	    this is the result of the assembly in ACE format. This format can
	    be read by viewers like the TIGR clview or by consed from the
	    phred/phrap/consed package.
	  </p></li><li class="listitem"><p>
	    <code class="filename"><em class="replaceable"><code>projectname</code></em>_out.gap4da</code>:
	    this directory contains the result of the assembly suited for the
	    <span class="emphasis"><em>direct assembly</em></span> import of the Staden gap4
	    assembly viewer and finishing tool.
	  </p></li></ul></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_res_resultsdir_denovo"></a>9.1.1.1. 
	  Additional 'large contigs' result files for de-novo assemblies of genomes
	</h4></div></div></div><p>
	  For de-novo assemblies of genomes, MIRA makes a proposal regarding
	  which contigs you probably want to have a look at ... and which ones
	  you can probably forget about.
	</p><p>
	  This proposal relies on the <span class="emphasis"><em>largecontigs</em></span> file
	  in the info directory (see section below) and MIRA automatically
	  extracted these contigs into all the formats you wanted to have your
	  results in.
	</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	      The result files for 'large contigs' are all named:
	      <code class="filename"><em class="replaceable"><code>projectname</code></em>_<span class="emphasis"><em>LargeContigs</em></span>_out.<em class="replaceable"><code>resulttype</code></em></code>:
	    </p></li><li class="listitem"><p>
	      <code class="filename">extractLargeContigs.sh</code>: this is a small
	      shell script which just contains the call
	      to <span class="command"><strong>miraconvert</strong></span> with which MIRA extracted the
	      large contigs for you. In case you want to redefine what large
	      contigs are for you, feel free to use this as template.
	    </p></li></ul></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_infodir"></a>9.1.2. 
	The <code class="filename">*_d_info</code> directory
      </h3></div></div></div><p>
	The following files in
	<code class="filename"><em class="replaceable"><code>projectname</code></em>_info</code>
	contain statistics and other information files of the assembly:
      </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	    <code class="filename"><em class="replaceable"><code>projectname</code></em>_info_assembly.txt</code>:
	    This file should be your first stop after an assembly. It will
	    tell you some statistics as well as whether or not problematic
	    areas remain in the result.
	  </p></li><li class="listitem"><p>
	    <code class="filename"><em class="replaceable"><code>projectname</code></em>_info_callparameters.txt</code>:
	    This file contains the parameters as given on the mira command
	    line when the assembly was started.
	  </p></li><li class="listitem"><p>
	    <code class="filename"><em class="replaceable"><code>projectname</code></em>_info_consensustaglist.txt</code>:
	    This file contains information about the tags (and their position)
	    that are present in the consensus of a contig.
	  </p></li><li class="listitem"><p>
	    <code class="filename"><em class="replaceable"><code>projectname</code></em>_info_contigreadlist.txt</code>:
	    This file contains information which reads have been assembled
	    into which contigs (or singlets).
	  </p></li><li class="listitem"><p>
	    <code class="filename"><em class="replaceable"><code>projectname</code></em>_info_contigstats.txt</code>:
	    This file contains in tabular format statistics about the contigs
	    themselves, their length, average consensus quality, number of
	    reads, maximum and average coverage, average read length, number
	    of A, C, G, T, N, X and gaps in consensus.
	  </p></li><li class="listitem"><p>
	    <code class="filename"><em class="replaceable"><code>projectname</code></em>_info_debrislist.txt</code>:
	    This file contains the names of all the reads which were not
	    assembled into contigs (or singlets if appropriate MIRA parameters
	    were chosen). The file has two columns: first column is the name
	    of the read, second column is a code showing the reason and stage
	    at which the read was put into the debris category.
	  </p></li><li class="listitem"><p>
	    <code class="filename"><em class="replaceable"><code>projectname</code></em>_info_largecontigs.txt</code>:
	    This file contains as simple list the names of all the contigs
	    MIRA thinks to be more or less important at the end of the
	    assembly. To be present in this list, a contig needed to reach a
	    certain length (usually 500, but see  [-MI:lcs]) and had a
	    coverage of at least 1/3 of the average coverage (per sequencing
	    technology) of the complete project.
	  </p><p>
	    Note: only present for de-novo assemblies of genomes.
	  </p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
	    The default heuristics (500bp length and 1/3 coverage per
	    sequencing technology) generally work well enough for most
	    projects. However, Projects with extremely different coverage
	    numbers per sequencing technology may need to use different
	    numbers. E.g.: a project with 80x Illumina and 6x Sanger would
	    have contigs consisting only of 2 or 3 Sanger sequence but with
	    the average coverage &gt;= 2 also in this list although clearly no
	    one would look at these under normal circumstances.
	  </td></tr></table></div></li><li class="listitem"><p>
	    <code class="filename"><em class="replaceable"><code>projectname</code></em>_info_groups.txt</code>:
	    This file contains information about readgroups as determined by
	    MIRA. Most interesting will probably be statistics concerning
	    read-pair sizes.
	  </p></li><li class="listitem"><p>
	    <code class="filename"><em class="replaceable"><code>projectname</code></em>_info_readrepeats</code>:
	    This file helps to find out which parts of which reads are quite
	    repetitive in a project. Please consult the chapter on how to
	    tackle "hard" sequencing projects to learn how this file can help
	    you in spotting sequencing mistakes and / or difficult parts in a
	    genome or EST / RNASeq project.
	  </p></li><li class="listitem"><p>
	    <code class="filename"><em class="replaceable"><code>projectname</code></em>_info_readstooshort</code>:
	    A list containing the names of those reads that have been sorted
	    out of the assembly only due to the fact that they were too short,
	    before any processing started.
	  </p></li><li class="listitem"><p>
	    <code class="filename"><em class="replaceable"><code>projectname</code></em>_info_readtaglist.txt</code>:
	    This file contains information about the tags and their position
	    that are present in each read.  The read positions are given
	    relative to the forward direction of the sequence (i.e. as it was
	    entered into the the assembly).
	  </p></li><li class="listitem"><p>
	    <code class="filename"><em class="replaceable"><code>projectname</code></em>_error_reads_invalid</code>:
	    A list of sequences that have been found to be invalid due to
	    various reasons (given in the output of the assembler).
	  </p></li></ul></div></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_res_first_look:the_assembly_info"></a>9.2. 
      First look: the assembly info
    </h2></div></div></div><p>
      Once finished, have a look at the file
      <code class="filename">*_info_assembly.txt</code> in the info directory. The
	assembly information given there is split in three major parts:
    </p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
	  some general assembly information (number of reads assembled etc.). This
	    part is quite short at the moment, will be expanded in future
	</p></li><li class="listitem"><p>
	  assembly metrics for 'large' contigs.
	</p></li><li class="listitem"><p>
	  assembly metrics for all contigs.
	  </p></li></ol></div><p>
      The first part for large contigs contains several sections. The first of
      these shows what MIRA counts as large contig for this particular
      project. As example, this may look like this:
    </p><pre class="screen">
Large contigs:
--------------
With    Contig size             &gt;= 500
        AND (Total avg. Cov     &gt;= 19
             OR Cov(san)        &gt;= 0
             OR Cov(454)        &gt;= 8
             OR Cov(pbs)        &gt;= 0
             OR Cov(sxa)        &gt;= 11
             OR Cov(sid)        &gt;= 0
            )</pre><p>
      The above is for a 454 and Solexa hybrid assembly in which MIRA
      determined large contigs to be contigs
    </p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
	  of length of at least 500 bp and
	</p></li><li class="listitem"><p>
	  having a total average coverage of at least 19x or an
	  average 454 coverage of 8 or an average Solexa coverage of 11
	</p></li></ol></div><p>
      The second section is about length assessment of large contigs:
    </p><pre class="screen">
  Length assessment:
  ------------------
  Number of contigs:    44
  Total consensus:      3567224
  Largest contig:       404449
  N50 contig size:      186785
  N90 contig size:      55780
  N95 contig size:      34578</pre><p>
      In the above example, 44 contigs totalling 3.56 megabases were built,
      the largest contig being 404 kilobases long and the N50/N90 and N95
      numbers give the respective lengths.
    </p><p>
      The next section shows information about the coverage assessment of
      large contigs. An example:
    </p><pre class="screen">
  Coverage assessment:
  --------------------
  Max coverage (total): 563
  Max coverage
        Sanger: 0
        454:    271
        PacBio: 0
        Solexa: 360
        Solid:  0
  Avg. total coverage (size &gt;= 5000): 57.38
  Avg. coverage (contig size &gt;= 5000)
        Sanger: 0.00
        454:    25.10
        PacBio: 0.00
        Solexa: 32.88
        Solid:  0.00</pre><p>
      Maximum coverage attained was 563, maximum for 454 alone 271 and for
      Solexa alone 360. The average total coverage (computed from contigs with
      a size &#8805; 5000 bases is 57.38. The average coverage by sequencing
      technology (in contigs &#8805; 5000) is 25.10 for 454 and 32.88 for Solexa
      reads.
    </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
	For genome assemblies, the value for <span class="emphasis"><em>Avg. total coverage
	(size &gt;= 5000)</em></span> is currently always calculated for contigs
	having 5000 or more consensus bases. While this gives a very effective
	measure for genome assemblies, assemblies of EST or RNASeq will often
	have totally irrelevant values here: even if the default of MIRA is to
	use smaller contig sizes (1000) for EST / RNASeq assemblies, the
	coverage values for lowly and highly expressed genes can easily span a
	factor of 10000 or more.
      </p></td></tr></table></div><p>
      The last section contains some numbers useful for quality assessment. It
      looks like this:
    </p><pre class="screen">
  Quality assessment:
  -------------------
  Average consensus quality:                    90
  Consensus bases with IUPAC:                   11      (you might want to check these)
  Strong unresolved repeat positions (SRMc):    0       (excellent)
  Weak unresolved repeat positions (WRMc):      19      (you might want to check these)
  Sequencing Type Mismatch Unsolved (STMU):     0       (excellent)
  Contigs having only reads wo qual:            0       (excellent)
  Contigs with reads wo qual values:            0       (excellent)</pre><p>
      Beside the average quality of the contigs and whether they contain reads
      without quality values, MIRA shows the number of different tags in the
      consensus which might point at problems.
    </p><p>
      The above mentioned sections (length assessment, coverage assessment and
      quality assessment) for <span class="emphasis"><em>large</em></span> contigs will then be
      re-iterated for <span class="emphasis"><em>all</em></span> contigs, this time including
      also contigs which MIRA did not take into account as large contig.
    </p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_res_converting_results"></a>9.3. 
      Converting results
    </h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_converting_miraconvert"></a>9.3.1. 
	Converting to and from other formats:<span class="command"><strong>miraconvert</strong></span>
      </h3></div></div></div><p>
	<span class="command"><strong>miraconvert</strong></span> is tool in the MIRA package which
	reads and writes a number of formats, ranging from full assembly
	formats like CAF and MAF to simple output view formats like HTML or
	plain text.
      </p><div class="figure"><a name="chap_res::results_miraconvert.png"></a><p class="title"><b>Figure 9.1. <span class="command">miraconvert</span> supports a wide range of
	format conversions to simplify export / import of results to and from
	other programs</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/results_miraconvert.png" width="100%" alt="miraconvert supports a wide range of format conversions to simplify export / import of results to and from other programs"></td></tr></table></div></div></div><br class="figure-break"></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_converting_reach_other_programs"></a>9.3.2. 
	Steps for converting data from / to other tools
      </h3></div></div></div><p>
	The question "How Do I convert to / from other tools?" is complicated
	by the plethora of file formats and tools available. This section
	gives an overview on what is needed to reach the most important ones.
      </p><div class="figure"><a name="chap_res::results_mira2other.png"></a><p class="title"><b>Figure 9.2. 
	  Conversion steps, formats and programs needed to reach some tools
	  like assembly viewers, editors or scaffolders.
	</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/results_mira2other.png" width="100%" alt="Conversion steps, formats and programs needed to reach some tools like assembly viewers, editors or scaffolders."></td></tr></table></div></div></div><br class="figure-break"><p>
	Please also read the chapter on MIRA utilities in this manual to learn
	more on <span class="command"><strong>miraconvert</strong></span> and have a look at
	<code class="literal">miraconvert -h</code> which lists all possible formats
	and other command line options.
      </p><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_res_converting_to_from_staden"></a>9.3.2.1. 
	  Example: converting to and from the Staden package (gap4 / gap5)
	</h4></div></div></div><p>
	  The <span class="command"><strong>gap4</strong></span> program (and its
	  successor <span class="command"><strong>gap5</strong></span> from the Staden package are pretty
	  useful finishing tools and assembly viewers. They have an own
	  database format which MIRA does not read or write, but there are
	  interconversion possibilities using the CAF format (for gap4) and
	  SAM format (for gap5)
	</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
	      [gap4]
	    </span></dt><dd><p>
		You need the <span class="command"><strong>caf2gap</strong></span>
		and <span class="command"><strong>gap2caf</strong></span> utilities for this, which are
		distributed separately from the Sanger Centre
		<a class="ulink" href="http://www.sanger.ac.uk/Software/formats/CAF/" target="_top">http://www.sanger.ac.uk/Software/formats/CAF/</a>).
		Conversion is pretty straightforward. From MIRA to gap4, it's
		like this:
	      </p><pre class="screen">
<code class="prompt">$</code> caf2gap -project <em class="replaceable"><code>YOURGAP4PROJECTNAME</code></em> -ace <em class="replaceable"><code>mira_result.caf</code></em> &gt;&amp;/dev/null</pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
		Don't be fooled by the <code class="literal">-ace</code> parameter of
		<span class="command"><strong>caf2gap</strong></span>. It needs a CAF file as input, not
		an ACE file.
	      </td></tr></table></div><p>
		From gap4 to CAF, it's like this:
	      </p><pre class="screen">
<code class="prompt">$</code> gap2caf -project <em class="replaceable"><code>YOURGAP4PROJECTNAME</code></em> &gt;tmp.caf
<code class="prompt">$</code> miraconvert -r c tmp.caf <em class="replaceable"><code>somenewname</code></em>.caf</pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
		Using <span class="command"><strong>gap2caf</strong></span>, be careful to use the simple
		<code class="literal">&gt;</code> redirection to file and
		<span class="emphasis"><em>not</em></span> the <code class="literal">&gt;&amp;</code>
		redirection.
	      </td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
		Using first <span class="command"><strong>gap2caf</strong></span> and then
		<span class="command"><strong>miraconvert</strong></span> is needed as gap4 writes an
		own consensus to the CAF file which is not necessarily the
		best. Indeed, gap4 does not know about different sequencing
		technologies like 454 and treats everything as
		Sanger. Therefore, using
		<span class="command"><strong>miraconvert</strong></span> with the  [-r c] option
		recalculates a MIRA consensus during the "conversion" from CAF to CAF.
	      </td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
		If you work with a 32 bit executable of caf2gap, it might very
		well be that the converter needs more memory than can be
		handled by 32 bit. Only solution: switch to a 64 bit
		executable of caf2gap.
	      </td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning: 
		  caf2gap bug for sequence annotations in reverse direction
		"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">
		  caf2gap bug for sequence annotations in reverse direction
		</th></tr><tr><td align="left" valign="top"><p>
		  caf2gap has currently (as of version 2.0.2) a bug that turns
		  around all features in reverse direction during the
		  conversion from CAF to a gap4 project. There is a fix
		  available, please contact me for further information (until
		  I find time to describe it here).
		</p></td></tr></table></div></dd><dt><span class="term">
	      [gap5]
	    </span></dt><dd><p>
		The <span class="command"><strong>gap5</strong></span> program is the successor for
		gap4. It comes with on own import utility
		(<span class="command"><strong>tg_index</strong></span>) which can import SAM and CAF
		files, and gap5 itself has an export function which also
		writes SAM and CAF. It is suggested to use the SAM format to
		export data gap5 as it is more efficient and conveys more
		information on sequencing technologies used.
	      </p><p>
		Conversion is pretty straightforward. From MIRA to gap5, it's like
		this:
	      </p><pre class="screen">
<code class="prompt">$</code> tg_index <em class="replaceable"><code>INPUT</code></em>_out.sam</pre><p>
		This creates a gap5 database named
		<code class="filename"><em class="replaceable"><code>INPUT</code></em>_out.g5d</code>
		which can be directly loaded with gap5 like this:
	      </p><pre class="screen">
<code class="prompt">$</code> gap5 <em class="replaceable"><code>INPUT</code></em>_out.g5d</pre><p>
		Exporting back to SAM or CAF is done in gap5 via
		the <span class="emphasis"><em>File-&gt;Export Sequences</em></span> menu there.
	      </p></dd></dl></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_res_converting_to_from_sam"></a>9.3.2.2. 
	  Example: converting to and from SAM (for samtools, tablet etc.)
	</h4></div></div></div><p>
	  Converting to SAM is done by
	  using <span class="command"><strong>miraconvert</strong></span> on a MIRA MAF file, like this:
	</p><pre class="screen">
<code class="prompt">$</code> miraconvert maf -t sam <em class="replaceable"><code>INPUT</code></em>.maf <em class="replaceable"><code>OUTPUT</code></em></pre><p>
	  The above will create a file named <code class="filename">OUTPUT.sam</code>.
	</p><p>
	  Converting from SAM to a format which either <span class="command"><strong>mira</strong></span>
	  or <span class="command"><strong>miraconvert</strong></span> can understand takes a few
	  more steps. As neither tool currently reads SAM natively, you need
	  to go via the <span class="command"><strong>gap5</strong></span> editor of the Staden package:
	  convert the SAM via <span class="command"><strong>tg_index</strong></span> to a gap5 database,
	  load that database in gap5 and export it there to CAF.
	</p></div></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_res_filtering_of_results"></a>9.4. 
      Filtering results
    </h2></div></div></div><p>
      It is important to remember that, depending on assembly options, MIRA
      will also include very small contigs (with eventually very low coverage)
      made out of reads which were rejected from the "good" contigs for
      quality or other reasons. You probably do not want to have a look at
      this contig debris when finishing a genome unless you are really,
      really, really picky.
    </p><p>
      Many people prefer to just go on with what would be large
      contigs. Therefore, in de-novo assemblies, MIRA writes out separate
      files of what it thinks are "good", large contigs. In case you want to
      extract contigs differently, the <span class="command"><strong>miraconvert</strong></span> program
      from the MIRA package can selectively filter CAF or MAF files for
      contigs with a certain size, average coverage or number of reads.
    </p><p>
      The file <code class="filename">*_info_assembly.txt</code> in the info directory
      at the end of an assembly might give you first hints on what could be
      suitable filter parameters. As example, for "normal" assemblies
      (whatever this means), one could want to consider only contigs larger
      than 500 bases and which have at least one third of the average coverage
      of the N50 contigs.
    </p><p>
      Here's an example: In the "Large contigs" section, there's a "Coverage
      assessment" subsection. It looks a bit like this:
    </p><pre class="screen">
...
Coverage assessment:
--------------------
Max coverage (total): 43
Max coverage
Sanger: 0
454:    43
Solexa: 0
Solid:  0
Avg. total coverage (size &#8805; 5000): 22.30
Avg. coverage (contig size &#8805; 5000)
Sanger: 0.00
454:    22.05
Solexa: 0.00
Solid:  0.00
...</pre><p>
      This project was obviously a 454 only project, and the average coverage
      for it is ~22. This number was estimated by MIRA by taking only contigs
      of at least 5kb into account, which for sure left out everything which
      could be categorised as debris. Normally it's a pretty solid number.
    </p><p>
      Now, depending on how much time you want to invest performing some manual
      polishing, you should extract contigs which have at least the following
      fraction of the average coverage:
    </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	  2/3 if a quick and "good enough" is what you want and you don't want
	  to do some manual polishing. In this example, that would be around
	  14 or 15.
	</p></li><li class="listitem"><p>
	  1/2 if you want to have a "quick look" and eventually perform some
	  contig joins. In this example the number would be 11.
	</p></li><li class="listitem"><p>
	  1/3 if you want quite accurate and for sure not loose any possible
	  repeat. That would be 7 or 8 in this example.
	</p></li></ul></div><p>
      Example (useful with assemblies of Sanger data): extracting only contigs &#8805;
      1000 bases and with a minimum average coverage of 4 into FASTA format:
    </p><pre class="screen">
<code class="prompt">$</code> <strong class="userinput"><code>miraconvert -x 1000 -y 4 <em class="replaceable"><code>sourcefile.maf targetfile.fasta</code></em></code></strong></pre><p>
      Example (useful with assemblies of 454 data): extracting only contigs
      &#8805; 500 bases into FASTA format:
    </p><pre class="screen">
<code class="prompt">$</code> <strong class="userinput"><code>miraconvert -x 500 <em class="replaceable"><code>sourcefile.maf targetfile.fasta</code></em></code></strong></pre><p>
      Example (e.g. useful with Sanger/454 hybrid assemblies): extracting only
      contigs &#8805; 500 bases and with an average coverage &#8805; 15 reads into
      CAF format, then converting the reduced CAF into a Staden GAP4 project:
    </p><pre class="screen">
<code class="prompt">$</code> <strong class="userinput"><code>miraconvert -x 500 -y 15 <em class="replaceable"><code>sourcefile.maf tmp.caf</code></em></code></strong>
<code class="prompt">$</code> <strong class="userinput"><code>caf2gap -project <em class="replaceable"><code>somename</code></em> -ace <em class="replaceable"><code>tmp.caf</code></em></code></strong></pre><p>
      Example (e.g. useful with Sanger/454 hybrid assemblies): extracting only
      contigs &#8805; 1000 bases and with &#8805; 10 reads from MAF into CAF format,
      then converting the reduced CAF into a Staden GAP4 project:
    </p><pre class="screen">
<code class="prompt">$</code> <strong class="userinput"><code>miraconvert -x 500 -z 10 <em class="replaceable"><code>sourcefile.maf tmp.caf</code></em></code></strong>
<code class="prompt">$</code> <strong class="userinput"><code>caf2gap -project <em class="replaceable"><code>somename</code></em> -ace <em class="replaceable"><code>tmp.caf</code></em></code></strong></pre></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_res_places_of_importance_in_a_de_novo_assembly"></a>9.5. 
      Places of importance in a de-novo assembly
    </h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_tags_set_by_mira"></a>9.5.1. 
	Tags set by MIRA
      </h3></div></div></div><p>
        MIRA sets a number of different tags in resulting assemblies. They can be set in reads
        (in which case they mostly end with a <span class="emphasis"><em>r</em></span>) or in the consensus.(then
        ending with a <span class="emphasis"><em>c</em></span>).
      </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
	  If you use the
	  Staden <span class="command"><strong>gap4</strong></span>, <span class="command"><strong>gap5</strong></span> or
	  <span class="command"><strong>consed</strong></span> assembly editor to tidy up the assembly, you
	  can directly jump to places of interest that MIRA marked for further
	  analysis by using the search functionality of these programs.
	</p><p>
	  However, you need to tell these programs that these tags exist. For
	  that you must change some configuration files. More information on
	  how to do this can be found in the
	  <code class="filename">support/README</code> file of the MIRA distribution.
	</p></td></tr></table></div><p>
	You should search for the following "consensus" tags for finding places of importance
	(in this order).
      </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	    IUPc
	  </p></li><li class="listitem"><p>
	    UNSc
	  </p></li><li class="listitem"><p>
	    SRMc
	  </p></li><li class="listitem"><p>
	    WRMc
	    </p></li><li class="listitem"><p>
	    STMU (only hybrid assemblies)
	  </p></li><li class="listitem"><p>
	    MCVc (only when assembling different strains, i.e., mostly relevant for mapping assemblies)
	  </p></li><li class="listitem"><p>
	    SROc (only when assembling different strains, i.e., mostly relevant for mapping assemblies)
	  </p></li><li class="listitem"><p>
	    SAOc (only when assembling different strains, i.e., mostly relevant for mapping assemblies)
	  </p></li><li class="listitem"><p>
	    SIOc (only when assembling different strains, i.e., mostly relevant for mapping assemblies)
	  </p></li><li class="listitem"><p>
	    STMS (only hybrid assemblies)
	  </p></li></ul></div><p>
      </p><p>
	of lesser importance are the "read" versions of the tags above:
      </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	    UNSr
	  </p></li><li class="listitem"><p>
	    SRMr
	  </p></li><li class="listitem"><p>
	    WRMr
	  </p></li><li class="listitem"><p>
	    SROr (only when assembling different strains, i.e., mostly relevant for mapping assemblies)
	  </p></li><li class="listitem"><p>
	    SAOr (only when assembling different strains, i.e., mostly relevant for mapping assemblies)
	  </p></li><li class="listitem"><p>
	    SIOr (only when assembling different strains, i.e., mostly relevant for mapping assemblies)
	    </p></li></ul></div><p>
      </p><p>
	In normal assemblies (only one sequencing technology, just one
	strain), search for the IUPc, UNSc, SRMc and WRMc tags.
      </p><p>
	In hybrid assemblies, searching for the IUPc, UNSc, SRMc, WRMc, and
	STMU tags and correcting only those places will allow you to have a
	qualitatively good assembly in no time at all.
      </p><p>
	Columns with SRMr tags (SRM in <span class="bold"><strong>R</strong></span>eads)
	in an assembly without a SRMc tag at the same consensus position show
	where mira was able to resolve a repeat during the different passes of
	the assembly ... you don't need to look at these. SRMc and WRMc tags
	however mean that there may be unresolved trouble ahead, you should take a
	look at these.
      </p><p>
	Especially in mapping assemblies, columns with the MCVc, SROx, SIOx and SAOx tags are
	extremely helpful in finding places of interest. As they are only set if you
	gave strain information to MIRA, you should always do that.
      </p><p>
	For more information on tags set/used by MIRA and what they exactly mean, please look up the
	according section in the reference chapter.
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_other_places_of_importance"></a>9.5.2. 
	Other places of importance
      </h3></div></div></div><p>
	The read coverage histogram as well as the template display of gap4
	will help you to spot other places of potential interest. Please consult the
	gap4 documentation.
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_joining_contigs"></a>9.5.3. 
	Joining contigs
      </h3></div></div></div><p>
	I recommend to invest a couple of minutes (in the best case) to a few
	hours in joining contigs, especially if the uniform read distribution
	option of MIRA was used (but first filter for large contigs). This
	way, you will reduce the number of "false repeats" in improve the
	overall quality of your assembly.
      </p><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_res_joining_truerepeats"></a>9.5.3.1. 
	  Joining contigs at true repetitive sites
	</h4></div></div></div><p>
	  Joining contigs at repetitive sites of a genome is always a
	  difficult decision. There are, however, two rules which can help:
	  </p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem">
	      If the sequencing was done without a paired-end library, don't join.
	    </li><li class="listitem">
	      If the sequencing was done with a paired-end library, but no
	      pair (or template) span the join site, don't join.
	    </li></ol></div><p>
	  </p><p>
	    The following screen shot shows a case where one should not join as
	    the finishing program (in this case <span class="command"><strong>gap4</strong></span>) warns
	    that no template (read-pair) span the join site:
	  </p><p>
	  </p><div class="figure"><a name="haf_danger_join_notok.png"></a><p class="title"><b>Figure 9.3. 
	      Join at a repetitive site which should not be performed due to
	      missing spanning templates.
	    </b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/haf_danger_join_notok.png" width="100%" alt="Join at a repetitive site which should not be performed due to missing spanning templates."></td></tr></table></div></div></div><p><br class="figure-break">
	  </p><p>
	    The next screen shot shows a case where one should join as the
	    finishing program (in this case <span class="command"><strong>gap4</strong></span>) finds
	    templates spanning the join site and all of them are good:
	  </p><p>
	  </p><div class="figure"><a name="haf_danger_join_ok.png"></a><p class="title"><b>Figure 9.4. 
	      Join at a repetitive site which should be performed due to
	      spanning templates being good.
	    </b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/haf_danger_join_ok.png" width="100%" alt="Join at a repetitive site which should be performed due to spanning templates being good."></td></tr></table></div></div></div><p><br class="figure-break">
	</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_res_joining_FALSErepeats"></a>9.5.3.2. 
	  Joining contigs at "wrongly discovered" repetitive sites
	</h4></div></div></div></div><p>
	Remember that MIRA takes a very cautious approach in contig building,
	and sometimes creates two contigs when it could have created
	one. Three main reasons can be the cause for this:
      </p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
	    when using <span class="emphasis"><em>uniform read distribution</em></span>, some
	    non-repetitive areas may have generated so many more reads that
	    they start to look like repeats (so called pseudo-repeats). In
	    this case, reads that are above a given coverage are
	    <span class="emphasis"><em>shaved off</em></span> (see  [-AS:urdcm] and kept
	    in reserve to be used for another copy of that repeat ... which in
	    case of a non-repetitive region will of course never arrive. So at
	    the end of an assembly, these shaved-off reads will form short,
	    low coverage contig debris which can more or less be safely
	    ignored and sorted out via the filtering options ( [-x -y
	    -z]) of <span class="command"><strong>miraconvert</strong></span>.
	  </p><p>
	    Some 454 library construction protocols -- especially, but not
	    exclusively, for paired-end reads -- create pseudo-repeats quite
	    frequently. In this case, the pseudo-repeats are characterised by
	    several reads starting at exact the same position but which can
	    have different lengths. Should MIRA have separated these reads
	    into different contigs, these can be -- most of the time -- safely
	    joined. The following figure shows such a case:
	  </p><div class="figure"><a name="454_stacks_join.png"></a><p class="title"><b>Figure 9.5. 
	      Pseudo-repeat in 454 data due to sequencing artifacts
	    </b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/454_stacks_join.png" width="100%" alt="Pseudo-repeat in 454 data due to sequencing artifacts"></td></tr></table></div></div></div><br class="figure-break"><p>
	    For Solexa data, a non-negligible GC bias has been reported in
	    genome assemblies since late 2009. In genomes with moderate to
	    high GC, this bias actually favours regions with lower
	    GC. Examples were observed where regions with an average GC of 10%
	    less than the rest of the genome had between two and four times
	    more reads than the rest of the genome, leading to false
	    "discovery" of duplicated genome regions.
	  </p></li><li class="listitem"><p>
	    when using unpaired data, the above described possibility of
	    having "too many" reads in a non-repetitive region can also lead
	    to a contig being separated into two contigs in the region of the
	    pseudo-repeat.
	  </p></li><li class="listitem"><p>
	    a number of reads (sometimes even just one) can contain "high
	    quality garbage", that is, nonsense bases which got - for some
	    reason or another - good quality values.  This garbage can be
	    distributed on a long stretch in a single read or concern just a
	    single base position across several reads.
	  </p><p>
	    While MIRA has some algorithms to deal with the disrupting effects
	    of reads like, the algorithms are not always 100% effective and
	    some might slip through the filters.
	  </p></li></ol></div></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_res_places_of_interest_in_a_mapping_assembly"></a>9.6. 
      Places of interest in a mapping assembly
    </h2></div></div></div><p>
      This section just give a short overview on the tags you might find
      interesting. For more information, especially on how to configure gap4
      or consed, please consult the <span class="emphasis"><em>mira usage</em></span> document
      and the <span class="emphasis"><em>mira</em></span> manual.
    </p><p>
      In file types that allow tags (CAF, MAF, ACE), SNPs and other
      interesting features will be marked by MIRA with a number of tags. The
      following sections give a brief overview. For a description of what
      the tags are (SROc, WRMc etc.), please read up the section "Tags used
      in the assembly by MIRA and EdIt" in the main manual.
    </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
      Screen shots in this section are taken from the walk-through with
      Lenski data (see below).
    </td></tr></table></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_poi_where_are_snps?"></a>9.6.1. 
	Where are SNPs?
      </h3></div></div></div><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	    the <span class="bold"><strong>SROc</strong></span> tag will point to most
	    SNPs. Should you assemble sequences of more than one strain (I
	    cannot really recommend such a strategy), you also might
	    encounter <span class="bold"><strong>SIOc</strong></span> and <span class="bold"><strong>SAOc</strong></span> tags.
	  </p><div class="figure"><a name="chap_sol::sxa_sroc_lenski1.png"></a><p class="title"><b>Figure 9.6. 
	      "SROc" tag showing a SNP position in a Solexa mapping
	      assembly.
	    </b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_sroc_lenski1.png" width="100%" alt='"SROc" tag showing a SNP position in a Solexa mapping assembly.'></td></tr></table></div></div></div><br class="figure-break"><div class="figure"><a name="chap_sol::sxa_sroc_lenski2.png"></a><p class="title"><b>Figure 9.7. 
	      "SROc" tag showing a SNP/indel position in a Solexa mapping
	      assembly.
	    </b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_sroc_lenski2.png" width="100%" alt='"SROc" tag showing a SNP/indel position in a Solexa mapping assembly.'></td></tr></table></div></div></div><br class="figure-break"></li><li class="listitem"><p>
	    the <span class="bold"><strong>WRMc</strong></span> tags might sometimes
	    point SNPs to indels of one or two bases.
	  </p></li></ul></div><p>
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_poi_where_are_insertions_deletions_or_genome_rearrangements?"></a>9.6.2. 
	Where are insertions, deletions or genome re-arrangements?
      </h3></div></div></div><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	    Large deletions: the <span class="bold"><strong>MCVc</strong></span> tags
	    point to deletions in the resequenced data, where no read is
	    covering the reference genome.
	  </p><div class="figure"><a name="chap_sol::sxa_mcvc_lenski.png"></a><p class="title"><b>Figure 9.8. 
	      "MCVc" tag (dark red stretch in figure) showing a genome
	      deletion in Solexa mapping assembly.
	    </b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_mcvc_lenski.png" width="100%" alt='"MCVc" tag (dark red stretch in figure) showing a genome deletion in Solexa mapping assembly.'></td></tr></table></div></div></div><br class="figure-break"></li><li class="listitem"><p>
	    Insertions, small deletions and re-arrangements: these are
	    harder to spot. In unpaired data sets they can be found looking
	    at clusters of <span class="bold"><strong>SROc</strong></span>, <span class="bold"><strong>SRMc</strong></span>, <span class="bold"><strong>WRMc</strong></span>, and / or <span class="bold"><strong>UNSc</strong></span> tags.
	  </p><div class="figure"><a name="chap_sol::sxa_wrmcsrmc_hiding_lenski1.png"></a><p class="title"><b>Figure 9.9. 
	      An IS150 insertion hiding behind a WRMc and a SRMc tags
	    </b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_wrmcsrmc_hiding_lenski1.png" width="100%" alt="An IS150 insertion hiding behind a WRMc and a SRMc tags"></td></tr></table></div></div></div><br class="figure-break"><p>
	    more massive occurrences of these tags lead to a rather colourful
	    display in finishing programs, which is why these clusters are
	    also sometimes called Xmas-trees.
	  </p><div class="figure"><a name="chap_sol::sxa_xmastree_lenski1.png"></a><p class="title"><b>Figure 9.10. 
	      A 16 base pair deletion leading to a SROc/UNsC xmas-tree
	    </b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_xmastree_lenski1.png" width="100%" alt="A 16 base pair deletion leading to a SROc/UNsC xmas-tree"></td></tr></table></div></div></div><br class="figure-break"><div class="figure"><a name="chap_sol::sxa_xmastree_lenski2.png"></a><p class="title"><b>Figure 9.11. 
	      An IS186 insertion leading to a SROc/UNsC xmas-tree
	    </b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_xmastree_lenski2.png" width="100%" alt="An IS186 insertion leading to a SROc/UNsC xmas-tree"></td></tr></table></div></div></div><br class="figure-break"><p>
	    In sets with paired-end data, post-processing software (or
	    alignment viewers) can use the read-pair information to guide
	    you to these sites (MIRA doesn't set tags at the moment).
	  </p></li></ul></div><p>
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_poi_other_tags_of_interest"></a>9.6.3. 
	Other tags of interest
      </h3></div></div></div><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	    the <span class="bold"><strong>UNSc</strong></span> tag points to areas
	    where the consensus algorithm had troubles choosing a base. This
	    happens in low coverage areas, at places of insertions (compared
	    to the reference genome) or sometimes also in places where
	    repeats with a few bases difference are present. Often enough,
	    these tags are in areas with problematic sequences for the
	    Solexa sequencing technology like, e.g., a
	    <code class="literal">GGCxG</code> or even <code class="literal">GGC</code> motif in
	    the reads.
	  </p></li><li class="listitem"><p>
	    the <span class="bold"><strong>SRMc</strong></span> tag points to places
	    where repeats with a few bases difference are present. Here too,
	    sequence problematic for the Solexa technology are likely to
	    have cause base calling errors and subsequently setting of this
	    tag.
	  </p></li></ul></div><p>
      </p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_res_postprocessing_mapping_assemblies"></a>9.7. 
      Post-processing mapping assemblies
    </h2></div></div></div><p>
      This section is a bit terse, you should also read the chapter on
      <span class="emphasis"><em>working with results</em></span> of MIRA.
    </p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_pp_manual_cleanup"></a>9.7.1. 
	Manual cleanup and validation (optional)
      </h3></div></div></div><p>
	When working with resequencing data and a mapping assembly, I always
	load finished projects into an assembly editor and perform a quick
	cleanup of the results. SNP or small indels normally do not need
	cleanups, but every mapper will get larger indels mostly wrong, and
	MIRA is no exception to this.
      </p><p>
	For close relatives of the reference strain this doesn't take long as
	MIRA will have set tags (see section earlier in this document) at all
	sites you should have a look at. For example, very close mutant
	bacteria with just SNPs or simple deletions and no genome
	reorganisation, I usually clean up in 10 to 15 minutes. That gives the
	last boost to data quality and your users (biologists etc.) will thank
	you for that as it reduces their work in analysing the data (be it
	looking at data or performing wet-lab experiments).
      </p><p>
        The general workflow I use is to convert the CAF file to a gap4 or gap5
        database. Then, in gap4 or gap5, I
      </p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
	    quickly search for the UNSc and WRMc tags and check whether they
	    could be real SNPs that were overseen by MIRA. In that case, I
	    manually set a SROc (or SIOc) tag in gap4 via hotkeys that were
	    defined to set these tags.
	  </p></li><li class="listitem"><p>
	    sometimes also quickly clean up reads that are causing trouble in
	    alignments and lead to wrong base calling. These can be found at
	    sites with UNSc tags, most of the time they have the 5' to 3'
	    <code class="literal">GGCxG</code> motif which can cause trouble to Solexa.
	  </p></li><li class="listitem"><p>
	    look at sites with deletions (tagged with MCVc) and look whether I
	    should clean up the borders of the deletion.
	  </p></li></ol></div><p>
	After this, I convert the gap4 or gap5 database back to CAF format.
	But beware: gap4 does not have the same consensus calling routines as
	MIRA and will have saved it's own consensus in the new CAF. In fact,
	gap4 performs rather badly in projects with multiple sequencing
	technologies. So I use miraconvert from the MIRA package to recall
	a good consensus (and save it in MAF as it's more compact and a lot
	faster in handling than CAF):
      </p><p>
	And from this MAF file I can then convert with miraconvert to any
	other format I or my users need: CAF, FASTA, ACE, WIG (for coverage
	analysis), SNP and coverage analysis (see below), HTML etc.pp.
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_poi_comprehensive_snp_analysis_spreadsheet_tables_for_excel_or_oocalc"></a>9.7.2. 
	Comprehensive SNP analysis spreadsheet tables (for Excel or OOcalc)
      </h3></div></div></div><p>
	Biologists are not really interested in SNPs coordinates, and why
	should they? They're more interested where SNPs are, how good they
	are, which genes or other elements they hit, whether they have an
	effect on a protein sequence, whether they may be important etc. For
	organisms without intron/exon structure or splice variants, MIRA can
	generate pretty comprehensive tables and files if an annotated
	GenBank file was used as reference and strain information was given
	to MIRA during the assembly.
      </p><p>
	Well, MIRA does all that automatically for you if the reference
	sequence you gave was annotated.
      </p><p>
	For this, <span class="command"><strong>miraconvert</strong></span> should be used with the
	<span class="emphasis"><em>asnp</em></span> format as target and a MAF (or CAF) file as
	input:
      </p><pre class="screen"><code class="prompt">$</code> <strong class="userinput"><code>miraconvert -t asnp <em class="replaceable"><code>input.maf output</code></em></code></strong></pre><p>
	Note that it is strongly suggested to perform a quick manual cleanup
	of the assembly prior to this: for rare cases (mainly at site of
	small indels of one or two bases), MIRA will not tag SNPs with a SNP
	tag (SROc, SAOc or SIOc) but will be fooled into a tag denoting
	unsure positions (UNSc). This can be quickly corrected manually. See
	further down in this manual in the section on post-processing.
      </p><p>
	After conversion, you will have four files in the directory which
	you can all drag-and-drop into spreadsheet applications like
	OpenOffice Calc or Excel.
      </p><p>
	The files should be pretty self-explanatory, here's just a short overview:
      </p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
	    <code class="filename">output_info_snplist.txt</code> is a simple list of
	    the SNPs, with their positions compared to the reference
	    sequence (in bases and map degrees on the genome) as well as the
	    GenBank features they hit.
	  </p></li><li class="listitem"><p>
	    <code class="filename">output_info_featureanalysis.txt</code> is a much
	    extended version of the list above. It puts the SNPs into
	    context of the features (proteins, genes, RNAs etc.) and gives a
	    nice list, SNP by SNP, what might cause bigger changes in
	    proteins.
	  </p></li><li class="listitem"><p>
	    <code class="filename">output_info_featuresummary.txt</code> looks at the
	    changes (SNPs, indels) from the other way round. It gives an
	    excellent overview which features (genes, proteins, RNAs,
	    intergenic regions) you should investigate.
	  </p><p>
	    There's one column (named 'interesting') which pretty much
	    summarises up everything you need into three categories: yes,
	    no, and perhaps. 'Yes' is set if indels were detected, an amino
	    acid changed, start or stop codon changed or for SNPs in
	    intergenic regions and RNAs. 'Perhaps' is set for SNPs in
	    proteins that change a codon, but not an amino acid (silent
	    SNPs). 'No' is set if no SNP is hitting a feature.
	  </p></li><li class="listitem"><p>
	    <code class="filename">output_info_featuresequences.txt</code> simply
	    gives the sequences of each feature of the reference sequence
	    and the resequenced strain.
	  </p></li></ol></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_poi_html_files_depicting_snp_positions_and_deletions"></a>9.7.3. 
	HTML files depicting SNP positions and deletions
      </h3></div></div></div><p>
	I've come to realise that people who don't handle data from NextGen
	sequencing technologies on a regular basis (e.g., many biologists)
	don't want to be bothered with learning to handle specialised
	programs to have a look at their resequenced strains. Be it because
	they don't have time to learn how to use a new program or because
	their desktop is not strong enough (CPU, memory) to handle the data
	sets.
      </p><p>
	Something even biologist know to operate are browsers. Therefore,
	miraconvert has the option to load a MAF (or CAF) file of a
	mapping assembly at output to HTML those areas which are interesting
	to biologists. It uses the tags SROc, SAOc, SIOc and MCVc and outputs
	the surrounding alignment of these areas together with a nice overview
	and links to jump from one position to the previous or next.
      </p><p>
	This is done with the '<code class="literal">-t hsnp</code>' option of
	miraconvert:
      </p><pre class="screen"><code class="prompt">$</code> <strong class="userinput"><code>miraconvert -t hsnp <em class="replaceable"><code>input.maf output</code></em></code></strong></pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
	I recommend doing this only if the resequenced strain is a very close
	relative to the reference genome, else the HTML gets pretty big. But
	for a couple of hundred SNPs it works great.
      </td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_poi_wig_files"></a>9.7.4. 
	WIG files depicting contig coverage or GC content
      </h3></div></div></div><p>
	<span class="command"><strong>miraconvert</strong></span> can also dump a coverage file in WIG
	format (using '<code class="literal">-t wig</code>') or a WIG file for GC
	content (using '<code class="literal">-t gcwig</code>'). This comes pretty handy
	for searching genome deletions or duplications in programs like the
	Affymetrix Integrated Genome Browser (IGB, see <a class="ulink" href="http://igb.bioviz.org/" target="_top">http://igb.bioviz.org/</a>) or when looking for foreign sequence
	in a genome.
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_poi_tables_for_feature_coverage"></a>9.7.5. 
	Comprehensive spreadsheet tables for gene expression values / genome deletions &amp; duplications
      </h3></div></div></div><p>
	When having data mapped against a reference with annotations (either
	from GenBank formats or GFF3 formats),
	<span class="command"><strong>miraconvert</strong></span> can generate tables depicting
	either expression values (in RNASeq/EST data mappings) or probable
	genome multiplication and deletion factors (in genome mappings). For
	this to work, you must use a MAF or CAF file as input, specify
	<span class="emphasis"><em>fcov</em></span> as output format and the reference sequence
	must have had annotations during the mapping with MIRA.
      </p><p>TODO: add example</p><pre class="screen"><strong class="userinput"><code>miraconvert -t fcov <em class="replaceable"><code>mira_out.maf myfeaturetable</code></em></code></strong></pre></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_mutils"></a>Chapter 10. Utilities in the MIRA package</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email">&lt;<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>&gt;</code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect_mutils_convpro">10.1. miraconvert</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_mutils_cp_synopsis">10.1.1. 
	Synopsis
      </a></span></dt><dt><span class="sect2"><a href="#sect_mutils_cp_description">10.1.2. Description</a></span></dt><dt><span class="sect2"><a href="#sect_mutils_cp_options">10.1.3. Options</a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_mutils_cp_options_general">10.1.3.1. General options</a></span></dt><dt><span class="sect3"><a href="#sect_mutils_cp_options_contigs">10.1.3.2. Options for input containing contig data</a></span></dt></dl></dd><dt><span class="sect2"><a href="#sect_mutils_cp_examples">10.1.4. Examples</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_mutils_bait">10.2. mirabait - a "grep" like tool to select reads with kmers up to 256 bases</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_mutils_bait_synopsis">10.2.1. 
	Synopsis
      </a></span></dt><dt><span class="sect2"><a href="#sect_mutils_bait_description">10.2.2. Description</a></span></dt><dt><span class="sect2"><a href="#sect_mutils_bait_options">10.2.3. Options</a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_mutils_bait_mainoptions">10.2.3.1. Main options</a></span></dt><dt><span class="sect3"><a href="#sect_mutils_bait_outputdef">10.2.3.2. File type options</a></span></dt><dt><span class="sect3"><a href="#sect_mutils_bait_other">10.2.3.3. Other options</a></span></dt></dl></dd><dt><span class="sect2"><a href="#sect_mutils_bait_examples">10.2.4. Usage examples</a></span></dt><dt><span class="sect2"><a href="#sect_mutils_bait_installrrnadb">10.2.5. Installing different rRNA databases</a></span></dt></dl></dd></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
      <span class="emphasis"><em><span class="quote">&#8220;<span class="quote">Ninety percent of success is just growing up.
      </span>&#8221;</span></em></span>
    </p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_mutils_convpro"></a>10.1. miraconvert</h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_mutils_cp_synopsis"></a>10.1.1. 
	Synopsis
      </h3></div></div></div><div class="cmdsynopsis"><p><code class="command">miraconvert</code>  [options] {<em class="replaceable"><code>input_file</code></em>} {<em class="replaceable"><code>output_basename</code></em>}</p></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_mutils_cp_description"></a>10.1.2. Description</h3></div></div></div><p>
	<span class="command"><strong>miraconvert</strong></span> is a tool to convert, extract and
	sometimes recalculate all kinds of data related to sequence assembly
	files.
      </p><p>
	More specifically, <span class="command"><strong>miraconvert</strong></span> can
	</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
	      convert from multiple alignment files (CAF, MAF) to other multiple
	      alignment files (CAF, MAF, ACE, SAM), and -- if wished -- selecting
	      contigs by different criteria like name, length, coverage etc.
	    </p></li><li class="listitem"><p>
	      extract the consensus from multiple alignments in CAF and MAF format,
	      writing it to any supported output format (FASTA, FASTQ, plain text,
	      HTML, etc.) and -- if wished -- recalculating the consensus using
	      the MIRA consensus engine with MIRA parameters
	    </p></li><li class="listitem"><p>
	      extract read sequences (clipped or unclipped) from multiple
	      alignments and save to any supported format
	    </p></li><li class="listitem"><p>
	      Much more, need to document this.
	    </p></li></ol></div><p>
      </p><p>&#8230;</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_mutils_cp_options"></a>10.1.3. Options</h3></div></div></div><p>&#8230;</p><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_mutils_cp_options_general"></a>10.1.3.1. General options</h4></div></div></div><div class="variablelist"><dl class="variablelist"><dt><span class="term">
	      <code class="option">-f
	      <em class="replaceable"><code>
		{ <code class="option">caf</code>  |   <code class="option">maf</code>  |   <code class="option">fasta</code>  |   <code class="option">fastq</code>  |   <code class="option">gbf</code>  |   <code class="option">phd</code>  |   <code class="option">fofnexp</code> }
	      </code></em>
	      </code>
	    </span></dt><dd><p>
		<span class="quote">&#8220;<span class="quote">From-type</span>&#8221;</span>, the format of the input file. CAF and MAF
		files can contain full assemblies and/or unassembled (single)
		sequences while the other formats contain only unassembled
		sequences.
	      </p></dd><dt><span class="term">
	      <code class="option">-t
	      <em class="replaceable"><code>
		{ <code class="option">ace</code>  |   <code class="option">asnp</code>  |   <code class="option">caf</code>  |   <code class="option">crlist</code>  |   <code class="option">cstats</code>  |   <code class="option">exp</code>  |   <code class="option">fasta</code>  |   <code class="option">fastq</code>  |   <code class="option">fcov</code>  |   <code class="option">gbf</code>  |   <code class="option">gff3</code>  |   <code class="option">hsnp</code>  |   <code class="option">html</code>  |   <code class="option">maf</code>  |   <code class="option">phd</code>  |   <code class="option">sam</code>  |   <code class="option">samnbb</code>  |   <code class="option">text</code>  |   <code class="option">tcs</code>  |   <code class="option">wig</code> }
	      </code></em>
	      </code>
	      <code class="option">[ -t &#8230; ]</code>
	    </span></dt><dd><p>
		<span class="quote">&#8220;<span class="quote">To-type</span>&#8221;</span>, the format of the output file. Multiple
		mentions of  [-t] are allowed, in which case
		<span class="command"><strong>miraconvert</strong></span> will convert to multiple types.
	      </p></dd><dt><span class="term"><code class="option">-a</code></span></dt><dd><p>
		Append. Results of conversion are appended to existing files instead of overwriting them.
	      </p></dd><dt><span class="term"><code class="option">-A</code></span></dt><dd><p>
		Do not adjust sequence case.
	      </p><p>
		When reading formats which define clipping points (like CAF,
		MAF or EXP), and saving to formats which do not have clipping
		information, miraconvert normally adjusts the case of read
		sequences: lower case for clipped parts, upper case for
		unclipped parts of reads.  Use -A if you do not want this. See
		also -C.
	      </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
		Applies only to files/formats which do not contain contigs.
	      </td></tr></table></div></dd><dt><span class="term"><code class="option">-b</code></span></dt><dd><p>
		Blind data. Replace all bases in all reads / contigs with a 'c'.
	      </p></dd><dt><span class="term"><code class="option">-C</code></span></dt><dd><p>
		Hard clip reads. When the input is a format which contains clipping
		points in sequences and the requested output consists of sequences
		of reads, only the unclipped parts of sequences will be saved as
		results.
	      </p></dd><dt><span class="term"><code class="option">-d</code></span></dt><dd><p>
		Delete gap only columns. When output is contigs: delete
		columns that are entirely gaps (can occur after having deleted
		reads during editing in gap4, consed or other). When output is
		reads: delete gaps in reads.
	      </p></dd><dt><span class="term"><code class="option">-F</code></span></dt><dd><p>
		Filter read groups to different files. Works only for input
		files containing readgroups, i.e., CAF or MAF. 3 (or 4) files
		are generated: one or two for paired, one for unpaired and one
		for debris reads. Reads in paired file are interlaced by
		default, use -F twice to create separate files.
	      </p></dd><dt><span class="term"><code class="option">-m</code></span></dt><dd><p>
		Make contigs. Encase single reads as contig singlets into a CAF/MAF
		file.
	      </p></dd><dt><span class="term"><code class="option">-n <em class="replaceable"><code>namefile</code></em></code></span></dt><dd><p>
		Name select. Only contigs or reads are selected for output which
		name appears in
		<code class="filename">namefile</code>. <code class="filename">namefile</code> is a
		simple text file having one name entry per line.
	      </p></dd><dt><span class="term"><code class="option">-i</code></span></dt><dd><p>
		When -n is used, inverts the selection.
	      </p></dd><dt><span class="term"><code class="option">-o <em class="replaceable"><code>offset</code></em></code></span></dt><dd><p>
		Offset of quality values in FASTQ files. Only valid if -f is FASTQ.
	      </p></dd><dt><span class="term"><code class="option">-P <em class="replaceable"><code>MIRA-PARAMETERSTRING</code></em></code></span></dt><dd><p>
		Additional MIRA parameters. Allows to initialise the underlying MIRA
		routines with specific parameters. A use case can be, e.g., to
		recalculate a consensus of an assembly in a slightly different way
		(see also [-r]) than the one which is stored in assembly
		files. Example: to tell the consensus algorithm to use a minimum
		number of reads per group for 454 reads, use: "454_SETTINGS -CO:mrpg=4".
	      </p><p>
		Consult the MIRA reference manual for a full list of MIRA
		parameters.
	      </p></dd><dt><span class="term"><code class="option">-q quality_value</code></span></dt><dd><p>
		When loading read data from files where sequence and quality
		are split in several files (e.g. FASTA with .fasta and
		.fasta.qual files), do not stop if the quality values for a
		read are missing but set them to be the quality_value given.
	      </p></dd><dt><span class="term"><code class="option">-R <em class="replaceable"><code>namestring</code></em></code></span></dt><dd><p>
		Rename contigs/singlets/reads with given name string to which
		a counter is added.
	      </p><p>
		Known bug: will create duplicate names if input (CAF or
		MAF) contains contigs/singlets as well as free reads, i.e.
		reads not in contigs nor singlets.
	      </p></dd><dt><span class="term"><code class="option">-S <em class="replaceable"><code>namescheme</code></em></code></span></dt><dd><p>
		Naming scheme for renaming reads, important for
		paired-ends. Only 'solexa' is supported at the moment.
	      </p></dd><dt><span class="term"><code class="option">-Y <em class="replaceable"><code>integer</code></em></code></span></dt><dd><p>
		Yield. Defines the maximum number of (clipped/padded) bases to
		convert. When used on reads: output will contain first reads
		of file where length of clipped bases totals at least -Y.
		When used on contigs: output will contain first contigs of
		file where length of padded contigs totals at least -Y.
	      </p></dd></dl></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_mutils_cp_options_contigs"></a>10.1.3.2. Options for input containing contig data</h4></div></div></div><p>
	  The following switches will work only if the input file contains
	  contigs (i.e., CAF or MAF with contig data). Though infrequent, note
	  that both CAF and MAF can contain single reads only.
	</p><div class="variablelist"><dl class="variablelist"><dt><span class="term"><code class="option">-M</code></span></dt><dd><p>
		Do not extract contigs (or their consensus), but the reads
		they are composed of.
	      </p></dd><dt><span class="term"><code class="option">-N <em class="replaceable"><code>namefile</code></em></code></span></dt><dd><p>
		Name select, sorted. Only contigs/reads are selected for
		output which name appears in
		<code class="filename">namefile</code>. Regardless of the order of
		contigs/reads in the input, the output is sorted according to
		the appearance of names in
		<code class="filename">namefile</code>. <code class="filename">namefile</code>
		is a simple text file having one name entry per line.
	      </p><p>
		Note that for this function to work, all contigs/reads are
		loaded into memory which may be straining your RAM for larger
		projects.
	      </p></dd><dt><span class="term">
	      <code class="option">-r
	      <em class="replaceable"><code>
		{ <code class="option">c</code>  |   <code class="option">C</code>  |   <code class="option">q</code>  |   <code class="option">f</code> }
	      </code></em>
	      </code>
	    </span></dt><dd><p>
		Recalculate consensus and / or consensus quality values and / or
		SNP feature tags of an assembly. This feature is useful in case
		third party programs create own consensus sequences without
		handling different sequencing technologies (e.g. the combination
		of <span class="command"><strong>gap4</strong></span> and <span class="command"><strong>caf2gap</strong></span>) or
		when the CAF/MAF files do not contain consensus sequences at
		all.
	      </p><div class="variablelist"><dl class="variablelist"><dt><span class="term"><code class="option">c</code></span></dt><dd>
		    recalculate consensus &amp; consensus qualities using IUPAC where necessary
		  </dd><dt><span class="term"><code class="option">C</code></span></dt><dd>
		    recalculate consensus &amp; consensus qualities forcing ACGT calls and without IUPAC codes
		  </dd><dt><span class="term"><code class="option">q</code></span></dt><dd>
		    recalculate consensus quality values only
		  </dd><dt><span class="term"><code class="option">f</code></span></dt><dd>
		    recalculate SNP features
		  </dd></dl></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
		Only the last of cCq is relevant, 'f' works as a switch and can be
		combined with the others (e.g. <span class="quote">&#8220;<span class="quote">-r Cf</span>&#8221;</span>).
	      </td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
		If the CAF/MAF contains reads from multiple strains, recalculation
		of consensus &amp; consensus qualities is forced, you can just
		influence whether IUPACs are used or not. This is due to the fact
		that CAF/MAF do not provide facilities to store consensus
		sequences from multiple strains.
	      </td></tr></table></div></dd><dt><span class="term"><code class="option">-s</code></span></dt><dd><p>
		Split. Split output into single files, one file per
		contig. Files are named according to name of contig.
	      </p></dd><dt><span class="term"><code class="option">-u</code></span></dt><dd><p>
		fillUp strain genomes. In assemblies made of multiple strains,
		holes in the consensus of a strain (bases 'N' or '@') can be
		filled up with the consensus of the other strains. Takes effect
		only when '-r' is active.
	      </p></dd><dt><span class="term"><code class="option">-Q <em class="replaceable"><code>quality_value</code></em></code></span></dt><dd><p>
		Defines minimum quality a consensus base of a strain
		must have, consensus bases below this will be set to 'N'.
		Only used when -r is active.
	      </p></dd><dt><span class="term"><code class="option">-V <em class="replaceable"><code>coverage_value</code></em></code></span></dt><dd><p>
		Defines minimum coverage a consensus base of a strain must
		have, consensus bases below this coverage will be set to 'N'.
		Only used when -r is active.
	      </p></dd><dt><span class="term"><code class="option">-v</code></span></dt><dd><p>
		Print version number and exit.
	      </p></dd><dt><span class="term"><code class="option">-x <em class="replaceable"><code>length</code></em></code></span></dt><dd><p>
		Minimum length a contig (in full assemblies) or read (in single
		sequence files) must have. All contigs / reads with a
		length less than this value are discarded. Default: 0 (=switched
		off).
	      </p><p>
		Note: this is of course not applied to reads in contigs! Contigs passing
		the [-x] length criterion and stored as complete
		assembly (CAF, MAF, ACE, etc.) still contain all their reads.
	      </p></dd><dt><span class="term"><code class="option">-X <em class="replaceable"><code>length</code></em></code></span></dt><dd><p>
		Similar to [-x], but applies only to clipped reads
		(input file format must have clipping points set to be
		effective).
	      </p></dd><dt><span class="term"><code class="option">-y <em class="replaceable"><code>contig_coverage</code></em></code></span></dt><dd><p>
		Minimum average contig coverage. Contigs with an average
		coverage less than this value are discarded.
	      </p></dd><dt><span class="term"><code class="option">-z <em class="replaceable"><code>min_reads</code></em></code></span></dt><dd><p>
		Minimum number of reads in contig. Contigs with less
		reads than this value are discarded.
	      </p></dd><dt><span class="term"><code class="option">-l <em class="replaceable"><code>line_length</code></em></code></span></dt><dd><p>
		On output of assemblies as text or HTML: number of bases shown in
		one alignment line. Default: 60.
	      </p></dd><dt><span class="term"><code class="option">-c <em class="replaceable"><code>endgap_character</code></em></code></span></dt><dd><p>
		On output of assemblies as text or HTML: character used to pad
		endgaps. Default: ' ' (a blank)
	      </p></dd></dl></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_mutils_cp_examples"></a>10.1.4. Examples</h3></div></div></div><p>
	In the following examples, the CAF and MAF files used are expected to
	contain full assembly data like the files created by MIRA during an
	assembly or by the gap2caf program. CAF and MAF could be used
	interchangeably in these examples, depending on which format currently
	is available. In general though, MAF is faster to process and smaller
	on disk.
      </p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
	    Simple conversion: a MIRA MAF file to a SAM file
	  </span></dt><dd><pre class="screen">
<strong class="userinput"><code>miraconvert source.maf destination.sam</code></strong></pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
		Previous versions of miraconvert had a slightly different
		syntax, which however is still supported:
	      </p><pre class="screen">
<strong class="userinput"><code>miraconvert source.maf destination.sam</code></strong></pre></td></tr></table></div></dd><dt><span class="term">
	    Simple conversion: the consensus of an assembly to FASTA, at the
	    same time coverage data for contigs to WIG and furthermore
	    translate the CAF to ACE:
	  </span></dt><dd><pre class="screen">
<strong class="userinput"><code>miraconvert source.caf destination.fasta wig ace</code></strong></pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
		Previous versions of miraconvert had a slightly different
		syntax, which however is still supported:
	      </p><pre class="screen">
<strong class="userinput"><code>miraconvert -f caf -t fasta -t wig -t ace source.caf destination</code></strong></pre></td></tr></table></div></dd><dt><span class="term">
	    Filtering an assembly for contigs of length &#8805;2000 and an
	    average coverage &#8805; 10, while translating from MAF to CAF:
	  </span></dt><dd><pre class="screen">
<strong class="userinput"><code>miraconvert -x 2000 -y 10 source.caf destination.caf</code></strong></pre></dd><dt><span class="term">
	    Filtering a FASTQ file for reads &#8805; 55 base pairs, rename the
	    selected reads with a string starting <span class="quote">&#8220;<span class="quote">newname</span>&#8221;</span> and
	    save them back to FASTQ. Note how  [-t fastq] was left out
	    as the default behaviour of <span class="command"><strong>miraconvert</strong></span> is
	    to use the same "to" type as the input type ( [-f]).
	  </span></dt><dd><pre class="screen">
<strong class="userinput"><code>miraconvert -x 55 -R newname source.fastq destination.fastq</code></strong></pre></dd><dt><span class="term">
	    Filtering and reordering contigs of an assembly according to external contig name list.
	  </span></dt><dd><p>
	      This example will fetch the contigs named bchoc_c14, ...3, ...5
	      and ...13 and save the result in exactly that order to a new
	      file:
	    </p><pre class="screen"><code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>ls -l</code></strong>
-rw-r--r-- 1 bach users  231698898 2007-10-21 15:16 bchoc_out.caf
-rw-r--r-- 1 bach users         38 2007-10-21 15:16 contigs.lst
<code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>cat contigs.lst</code></strong>
bchoc_c14
bchoc_c3
bchoc_c5
bchoc_c13
<code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>miraconvert -N contigs.lst bchoc_out.caf myfilteredresult.caf</code></strong>
[...]
<code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>ls -l</code></strong>
-rw-r--r-- 1 bach users  231698898 2007-10-21 15:16 bchoc_out.caf
-rw-r--r-- 1 bach users         38 2007-10-21 15:16 contigs.lst
-rw-r--r-- 1 bach users     828726 2007-10-21 15:24 myfilteredresult.caf</pre></dd></dl></div></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_mutils_bait"></a>10.2. mirabait - a "grep" like tool to select reads with kmers up to 256 bases</h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_mutils_bait_synopsis"></a>10.2.1. 
	Synopsis
      </h3></div></div></div><div class="cmdsynopsis"><p><code class="command">mirabait</code>  [options] {-b <em class="replaceable"><code>baitfile</code></em> [-b ...] | -B <em class="replaceable"><code>file</code></em>} [-p <em class="replaceable"><code>file1 file2</code></em> | -P <em class="replaceable"><code>file3</code></em>]*
	 [<em class="replaceable"><code>file4 ...</code></em>]</p></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
	The above command line, especially with mandatory [-b] format
	appeared only in MIRA 4.9.0 and represents a major change to 4.0.x!
      </td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_mutils_bait_description"></a>10.2.2. Description</h3></div></div></div><p>
	<span class="command"><strong>mirabait</strong></span> selects reads from a read collection which
	are partly similar or equal to sequences defined as target
	baits. Similarity is defined by finding a user-adjustable number of
	common k-mers (sequences of k consecutive bases) which are the same in
	the bait sequences and the screened sequences to be selected, either in forward
	or reverse complement direction.
      </p><p>
	When used on paired files (-p or -P), selects read pairs where at least
	one read matches.
      </p><p>
	One can use <span class="command"><strong>mirabait</strong></span> to do targeted assembly by
	fishing out reads belonging to a gene and just assemble these; or to
	clean out rRNA sequences from data sets; or to fish out and
	iteratively reconstruct mitochondria from metagenomic data; or, or, or
	... whenever one has to take in or take out subsets of reads based on
	kmer equality, this tool should come in quite handy.
      </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
	The search performed is exact, that is, sequences selected are
	guaranteed to have the required number of matching k-mers to the bait
	sequences while sequences not selected are guaranteed not have these.
      </td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_mutils_bait_options"></a>10.2.3. Options</h3></div></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_mutils_bait_mainoptions"></a>10.2.3.1. Main options</h4></div></div></div><div class="variablelist"><dl class="variablelist"><dt><span class="term"><code class="option">-b <em class="replaceable"><code>file</code></em></code></span></dt><dd><p>
		A file containing sequences to be used as bait. The file can
		be in any of the following types: FASTQ, FASTA, GenBank (.gbf,
		.gbk, .gbff), CAF, MAF or Staden EXP.
	      </p><p>
		If the the file extension is non-standard
		(e.g. <code class="filename">file.dat</code>, you can force a file type
		by using the double colon type specification like in
		EMBOSS. E.g.: <code class="filename">fastq::file.dat</code>
	      </p><p>
		Using multiple -b for loading bait sequences from multiple
		files is allowed.
	      </p></dd><dt><span class="term"><code class="option">-B <em class="replaceable"><code>file</code></em></code></span></dt><dd><p>
		Load bait from an existing kmer statistics file, not from
		sequence files. Only one -B allowed, cannot be combined with
		-b. See -K on how to create such a file.
	      </p><p>
		This option comes in handy when you always want to bait
		against a given set of sequences, e.g., rRNA sequences or the
		human genome, and where the statistics computation itself may
		be quite time and resource intensive. Once computed and saved
		via [-K], a baiting process loading the statistics
		via  [-B] can start much faster.
	      </p></dd><dt><span class="term"><code class="option">-j <em class="replaceable"><code>job</code></em></code></span></dt><dd><p>
		Set option for predefined job from supplied MIRA library. Currently available jobs:
	      </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
		    'rrna': Bait rRNA/rDNA sequences. Locked options: [-b,
		    -B, -k, -K, -n]. In the current version mirabait will
		    use a hash statistics file with 21mers derived from a
		    subset of the RFAM 12 rRNA database to bait rRNA/rDNA
		    reads. The supplied subset should catch all but the most
		    uncommon rRNA data, if needed one could albeit increase
		    the sensitivity by decreasing  [-n].
		  </p></li></ul></div><p>
		Note that [-j] will hardwire a number of options to
		be optimal for the chosen job. Note that it is not advisable
		to change the 'locked' options as this either breaks the
		functionality or worse, it could lead to undefined results.
	      </p></dd><dt><span class="term"><code class="option">-p <em class="replaceable"><code>file_1 file_2</code></em></code></span></dt><dd><p>
		Instructs to load sequences to be baited from files
		<code class="filename">file_1</code> and
		<code class="filename">file_2</code>. The sequences are treated as
		pairs, where a read in one file is paired with a read in the
		second file. The files can be in any of the following types:
		FASTQ, FASTA, GenBank (.gbf, .gbk, .gbff), CAF, MAF or Staden
		EXP.
	      </p><p>
		If the the file extension is non-standard
		(e.g. <code class="filename">file.dat</code>, you can force a file type
		by using the double colon type specification like in
		EMBOSS. E.g.: <code class="filename">fastq::file.dat</code>
	      </p><p>
		Using multiple -p for baiting sequences from multiple file
		pairs is allowed.
	      </p></dd><dt><span class="term"><code class="option">-P <em class="replaceable"><code>file</code></em></code></span></dt><dd><p>
		Instructs to load sequences to be baited from file
		<code class="filename">file</code>. The sequences are treated as pairs,
		where a read in the file is immediately followed by its paired
		read. The file can be in any of the following types: FASTQ,
		FASTA, GenBank (.gbf, .gbk, .gbff), CAF, MAF or Staden
		EXP.
	      </p><p>
		If the the file extension is non-standard
		(e.g. <code class="filename">file.dat</code>, you can force a file type
		by using the double colon type specification like in
		EMBOSS. E.g.: <code class="filename">fastq::file.dat</code>
	      </p><p>
		Using multiple -P for baiting sequences from multiple files is
		allowed.
	      </p></dd><dt><span class="term"><code class="option">-k <em class="replaceable"><code>kmer-length</code></em></code></span></dt><dd><p>
		k-mer, length of bait in bases (&#8804;256, default=31)
	      </p></dd><dt><span class="term"><code class="option">-n <em class="replaceable"><code>integer</code></em></code></span></dt><dd><p>
		Default value: 1.
	      </p><p>
		If the integer given is &gt; 0: minimum number of kmers needed
		for a sequence to be selected.
	      </p><p>
		If the integer given is &#8804; 0: maximum number of missed kmers
		allowed over sequence length for a sequence to be selected.
	      </p></dd><dt><span class="term"><code class="option">-d</code></span></dt><dd><p>
		Do not use kmers with microrepeats (DUST-like). Standard
		length for microrepeats is 67% of kmer length, see
		[-D] to change this.
	      </p><p>
		Microrepeats are defined as repeats of a 1, 2, 3 or 4 base
		motif at either end (not in the middle) of a kmer. E.g.: a
		kmer of 17 will have a microrepeat length of 12 bases, so
		that, all kmers having 12 A, C, G, T at either end will be
		filtered away. E.g.: AAAAAAAAAAAAnnnnn as well as
		nnnnnAAAAAAAAAAAA will be filtered.
	      </p><p>
		E.g. for repeats of 2 bases: AGAGAGAGAGAGnnnnn or CACACACACACAnnnnn.
	      </p><p>
		E.g. for repeats of 3 bases: ACGACGACGACGnnnnn.
	      </p><p>
		E.g. for repeats of 4 bases: ACGTACGTACGTnnnnn.
	      </p><p>
		Microrepeat motifs will truncate at the end of allocated
		microrepeat length. E.g. kmer length 20 with microrepeat
		length of 13 and 4 base repeat: ACGTACGTACGTAnnnnnnn.
	      </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
		  When saving the kmer statistics via [-K], having
		   [-d] will save kmer statistics where kmers with
		  microrepeats have already been removed. Use this when you
		  always want to have microrepeats removed from a given bait
		  data as  [-d] will not be needed when using via
		   [-B] that set in later loads (which saves time).
		</p><p>
		  If you want to be able to bait from precomputed kmer
		  statistics both with and without microrepeats, use
		  [-d] only when loading the statistics file with
		   [-B], not when creating it with  [-K].
		</p></td></tr></table></div></dd><dt><span class="term"><code class="option">-D <em class="replaceable"><code>integer</code></em></code></span></dt><dd><p>
		Set length of microrepeats in kmers to discard from bait.
	      </p><p>
	      int &gt; 0: microrepeat length in percentage of kmer length.
	      E.g.: -k 17 -D 67 --&gt; 67% of 17 bases = 11.39 bases --&gt; 12 bases.
	      </p><p>
		int &lt;: 0 microrepeat length in bases.
	      </p><p>
		int != 0 implies -d, int=0 turns DUST filter off
	      </p></dd><dt><span class="term"><code class="option">-i</code></span></dt><dd><p>
		Inverse selection: selects only sequence that do not meet the
		-k and -n criteria.
	      </p></dd><dt><span class="term"><code class="option">-I</code></span></dt><dd><p>
		Filters and writes sequences which hit to one file and
		sequences which do not hit to a second file.
	      </p></dd><dt><span class="term"><code class="option">-r</code></span></dt><dd><p>
		Does not check for hits in reverse complement direction.
	      </p></dd><dt><span class="term"><code class="option">-t <em class="replaceable"><code>integer</code></em></code></span></dt><dd><p>
		Number of threads to use. The default value of 0 is configured
		to automatically use up to 4 CPU cores (if present). Numbers
		higher than 4 (or maybe 8) will probably not make much sense
		because of diminishing returns.
	      </p></dd></dl></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_mutils_bait_outputdef"></a>10.2.3.2. File type options</h4></div></div></div><p>
	  Normally, mirabait writes separate result files (named
	  <code class="filename">bait_match_*</code> and
	  <code class="filename">bait_miss_*</code>) for each input to the current
	  directory. For changing this behaviour, and others relating to
	  output, use these options:
	</p><div class="variablelist"><dl class="variablelist"><dt><span class="term"><code class="option">-c</code></span></dt><dd><p>
		Normally, mirabait will change the case of the sequences it
		loads to denote kmers which hit a bait in upper case and kmers
		which did not hit a bait in lower case. Using -c switches off
		this behaviour.
	      </p></dd><dt><span class="term"><code class="option">-l <em class="replaceable"><code>integer</code></em></code></span></dt><dd><p>
		Set length of sequence line in FASTA output.
	      </p></dd><dt><span class="term"><code class="option">-K <em class="replaceable"><code>filename</code></em></code></span></dt><dd><p>
		Save kmer statistics (for baits loaded via [-b]) to
		<code class="filename">filename</code>.
	      </p><p>
		As the calculation of kmers can take quite some time for
		larger sequences (e.g., human genome), this option is
		extremely useful if you want to perform the same baiting
		operation more than once. Once calculated, the kmer statistics
		is saved and can be reloaded for a later baiting operation via
		[-B].
	      </p></dd><dt><span class="term"><code class="option">-N <em class="replaceable"><code>name</code></em></code></span></dt><dd><p>
		Change the file prefix <code class="filename">bait</code> to
		<code class="filename">name</code>. Has no effect if -o/-O is used and
		targets are not directories.
	      </p></dd><dt><span class="term"><code class="option">-o <em class="replaceable"><code>path</code></em></code></span></dt><dd><p>
		Save sequences matching a bait to
		<code class="filename">path</code>. If <code class="filename">path</code> is a
		directory, write separate files into this directory. If not,
		combine all matching sequences from the input file(s) into a
		single file specified by the path.
	      </p></dd><dt><span class="term"><code class="option">-O <em class="replaceable"><code>path</code></em></code></span></dt><dd><p>
		Like -o, but for sequences not matching.
	      </p></dd></dl></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_mutils_bait_other"></a>10.2.3.3. Other options</h4></div></div></div><div class="variablelist"><dl class="variablelist"><dt><span class="term"><code class="option">-T <em class="replaceable"><code>dir</code></em></code></span></dt><dd><p>
		Use <code class="filename">dir</code> as directory for temporary files
		instead of the current working directory.
	      </p></dd><dt><span class="term"><code class="option">-m <em class="replaceable"><code>integer</code></em></code></span></dt><dd><p>
		Default is <span class="underline">75</span>. Defines
		the memory MIRA can use to compute kmer statistics. Therefore
		does not apply when using  [-B].
	      </p><p>
		A value of <span class="underline">&gt;100</span> is
		interpreted as absolute value in megabyte. E.g., 16384 = 16384
		megabyte = 16 gigabyte.
	      </p><p>
		A value of <span class="underline">0 &#8804; x &#8804;100</span> is
		interpreted as relative value of free memory at the time of
		computation. E.g.: for a value of 75% and 10 gigabyte of free
		memory, it will use 7.5 gigabyte.
	      </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
		The minimum amount of memory this algorithm will use is 512 MiB
		on 32 bit systems and 2 GiB on 64 bit systems.
	      </td></tr></table></div></dd></dl></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_mutils_bait_examples"></a>10.2.4. Usage examples</h3></div></div></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
	The examples below, together with the manual above, should be enough to get
	you going. If there's a typical use case you are missing, feel free to
	suggest it on the MIRA talk mailing list.
      </td></tr></table></div><p>Baiting unpaired sequences, bait sequences in FASTA, sequences in FASTQ:</p><pre class="screen"><strong class="userinput"><code>mirabait -b b.fasta file.fastq</code></strong></pre><p>Same as above, but baits in two files (FASTA and GenBank):</p><pre class="screen"><strong class="userinput"><code>mirabait -b b1.fasta -b b2.gbk file.fastq</code></strong></pre><p>Baiting paired sequences, read pairs are in two files:</p><pre class="screen"><strong class="userinput"><code>mirabait -b b.fasta -p file_1.fastq file_2.fastq</code></strong></pre><p>Baiting paired sequences, pairs are interleaved in one file:</p><pre class="screen"><strong class="userinput"><code>mirabait -b b.fasta -P file.fastq</code></strong></pre><p>Like above, but selecting sequences which do not match the baits:</p><pre class="screen"><strong class="userinput"><code>mirabait -i -b b.fasta -P file.fastq</code></strong></pre><p>
	Baiting paired sequences (<code class="filename">file_1.fastq</code>,
	<code class="filename">file_2.fastq</code> and
	<code class="filename">file3.fastq</code>) and unpaired sequences
	(<code class="filename">file4.fastq</code>), all at once and different file
	types:
      </p><pre class="screen"><strong class="userinput"><code>mirabait -b b.fasta -p file_1.fastq file_2.fastq -P file3.fasta file4.caf</code></strong></pre><p>
	Like above, but writing sequences matching baits and sequences not
	matching baits to different files:
      </p><pre class="screen"><strong class="userinput"><code>mirabait -I -b b.fasta -p file_1.fastq file_2.fastq -P file3.fasta file4.caf</code></strong></pre><p>Change bait criterion to need 10 kmers of size 27:</p><pre class="screen"><strong class="userinput"><code>mirabait -k 27 -n 10 -b b.fasta file.fastq</code></strong></pre><p>
	Change bait criterion to baiting only reads which have all kmers
	present in the bait:
      </p><pre class="screen"><strong class="userinput"><code>mirabait -n 0 -b b.fasta file.fastq</code></strong></pre><p>
	Change bait criterion to baiting all reads having almost all kmers
	present in the bait, but allowing for up to 40 kmers not in the bait:
      </p><pre class="screen"><strong class="userinput"><code>mirabait -n -40 -b b.fasta file.fastq</code></strong></pre><p>
	Force bait sequences to load as FASTA, force sequences to be baited to
	be loaded as FASTQ:
      </p><pre class="screen"><strong class="userinput"><code>mirabait -b fasta::b.dat fastq::file.dat</code></strong></pre><p>
	Write result files to directory <code class="filename">/dev/shm/</code>:
      </p><pre class="screen"><strong class="userinput"><code>mirabait -o /dev/shm/ -b b.fasta -p file_1.fastq file_2.fastq</code></strong></pre><p>
	Merge all result files containing sequences hitting baits to file
	<code class="filename">/dev/shm/match</code>:
      </p><pre class="screen"><strong class="userinput"><code>mirabait -o /dev/shm/match -b b.fasta -p file_1.fastq file_2.fastq</code></strong></pre><p>
	Like above, but also merge all result files containing sequences not
	hitting baits to file <code class="filename">/dev/shm/nomatch</code>:
      </p><pre class="screen"><strong class="userinput"><code>mirabait -o /dev/shm/match -O /dev/shm/nomatch -b b.fasta -p file_1.fastq file_2.fastq</code></strong></pre><p>
	Fetch all reads having rRNA motifs in a paired FASTQ files:
      </p><pre class="screen"><strong class="userinput"><code>mirabait -j rrna -p file1.fastq file2.fastq</code></strong></pre><p>
	Fetch all reads not having rRNA motifs in a paired FASTQ files:
      </p><pre class="screen"><strong class="userinput"><code>mirabait -j rrna -i -p file1.fastq file2.fastq</code></strong></pre><p>
	Split a paired FASTQ file into two sets of files (4 files total), one
	containing rRNA reads and one containing non-rRNA reads:
      </p><pre class="screen"><strong class="userinput"><code>mirabait -j rrna -I -p file1.fastq file2.fastq</code></strong></pre><p>
	Assuming the file <code class="filename">human_genome.fasta</code> contains the
	human genome: bait all read pairs matching the human genome. Also,
	save the compute kmer statistics for later re-use to file
	<code class="filename">HG_kmerstats.mhs.gz</code>:
      </p><pre class="screen"><strong class="userinput"><code>mirabait -b human_genome.fasta -K HG_kmerstats.mhs.gz -p file1.fastq file2.fastq</code></strong></pre><p>
	The same as above, but just precompute and save the kmer statistics, no actual baiting done.
      </p><pre class="screen"><strong class="userinput"><code>mirabait -b human_genome.fasta -K HG_kmerstats.mhs.gz</code></strong></pre><p>
	Using the precomputed kmer statistics from the command above: bait
	files with read pairs for human reads:
      </p><pre class="screen"><strong class="userinput"><code>mirabait -B HG_kmerstats.mhs.gz -p file_1.fastq file_2.fastq</code></strong></pre></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_mutils_bait_installrrnadb"></a>10.2.5. Installing different rRNA databases</h3></div></div></div><p>
	The standard database for rRNA baiting supplied with the MIRA source
	code and binary packages is called
	<code class="filename">rfam_rrna-21-12.sls.gz</code> which will get installed
	as <span class="emphasis"><em>MHS</em></span> (MiraHashStatistics) file into
	<code class="filename">$BINDIR/../share/mira/mhs/rfam_rrna-21-12.mhs.gz</code>
	(where $BINDIR is the directory where the mira/mirabait binary
	resides) and a soft link pointing from
	<code class="filename">filter_default_rrna.mhs.gz</code> to
	<code class="filename">rfam_rrna-21-12.mhs.gz</code> like so:
      </p><pre class="screen"><code class="prompt">arcadia:~$</code> <strong class="userinput"><code>which mira</code></strong>
/usr/local/bin/mira
<code class="prompt">arcadia:~$</code> <strong class="userinput"><code>ls -l /usr/local/share/mira/mhs</code></strong>
lrwxrwxrwx 1 bach bach        22 Mar 24 23:58 filter_default_rrna.mhs.gz -&gt; rfam_rrna-21-12.mhs.gz
-rw-rw-r-- 1 bach bach 148985059 Mar 24 23:58 rfam_rrna-21-12.mhs.gz</pre><p>
	The file naming scheme for the database is as following:
	dbidentifier-kmerlength-kmerfreqcutoff. The standard database is therefore:
	<code class="filename">rfam_rrna</code> as identifier for the RFAM rRNA
	sequences (currently RFAM 12), then 21 defining a kmer length of 21
	and finally a kmer cutoff frequency of 12, meaning that kmers must
	have been seen at least 12 times in the RFAM database to be stored in
	the subset.
      </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
	The value of 12 as frequency cutoff for the standard mirabait rRNA
	database was chosen as a compromise between sensitivity and database
	size.
      </td></tr></table></div><p>
	Although rRNA are pretty well conserved overall, the cutoff frequency
	also implies that kmers from rare rRNA variants will not be present in
	the database, eventually losing some sensitivity for rRNA from rarely
	sequenced organisms. It follows that more sensitive versions of the
	rRNA database can be installed by downloading a file from the MIRA
	repository at SourceForge and calling a script provided by MIRA. To
	install a version with a kmer size of 21 and a cutoff frequency of,
	e.g., 3, download <code class="filename">rfam_rrna-21-3.sls.gz</code> and
	install it like this:
      </p><pre class="screen"><code class="prompt">arcadia:~/tmp$</code> <strong class="userinput"><code>ls</code></strong>
<code class="prompt">arcadia:~/tmp$</code> <strong class="userinput"><code>wget https://sourceforge.net/projects/mira-assembler/files/MIRA/slsfiles/rfam_rrna-21-3.sls.gz</code></strong>
...
</pre><p>
      </p><p>
	TODO: continue docs here.
      </p></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_hard"></a>Chapter 11. Assembly of <span class="emphasis"><em>hard</em></span> genome or EST / RNASeq projects</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email">&lt;<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>&gt;</code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect_hard_getting_mean_data_assembled">11.1. 
      Getting 'mean' genomes or EST / RNASeq data sets assembled
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_hard_for_the_impatient">11.1.1. 
	For the impatient
      </a></span></dt><dt><span class="sect2"><a href="#sect_hard_introduction_to_masking">11.1.2. 
	Introduction to 'masking'
      </a></span></dt><dt><span class="sect2"><a href="#sect_hard_how_does_nasty_repeat_masking_work">11.1.3. 
	How does 'nasty repeat' masking work?
      </a></span></dt><dt><span class="sect2"><a href="#sect_hard_selecting_a_nasty_repeat_ratio">11.1.4. 
	Selecting a "nasty repeat ratio"
      </a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_hard_how_MIRA_tags_different_repeat_levels">11.2. 
      How MIRA tags different repeat levels
    </a></span></dt><dt><span class="sect1"><a href="#sect_hard_the_readrepeats_info_file">11.3. 
      The readrepeats info file
    </a></span></dt><dt><span class="sect1"><a href="#sect_hard_pipeline_to_find_worst_contaminants_or_repeats_in_sequencing_data">11.4. 
      Pipeline to find worst contaminants or repeats in sequencing data
    </a></span></dt><dt><span class="sect1"><a href="#sect_hard_examples_for_kmer_statistics">11.5. 
      Examples for kmer statistics
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_hard_caveat:_sk:kms">11.5.1. 
	Caveat: -SK:kmer_size
      </a></span></dt><dt><span class="sect2"><a href="#sect_hard_sanger_sequencing_a_simple_bacterium">11.5.2. 
	Sanger sequencing, a simple bacterium
      </a></span></dt><dt><span class="sect2"><a href="#sect_hard_454_sequencing_a_somewhat_more_complex_bacterium">11.5.3. 
	454 Sequencing, a somewhat more complex bacterium
      </a></span></dt><dt><span class="sect2"><a href="#sect_hard_solexa_sequencing_ecoli_mg1655">11.5.4. 
	Solexa sequencing, E.coli MG1655
      </a></span></dt><dt><span class="sect2"><a href="#sect_hard_need_examples_for_eukaryotes">11.5.5. 
	(NEED EXAMPLES FOR EUKARYOTES)
      </a></span></dt><dt><span class="sect2"><a href="#sect_hard_need_examples_for_pathological_cases">11.5.6. 
	(NEED EXAMPLES FOR PATHOLOGICAL CASES)
      </a></span></dt></dl></dd></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
      <span class="emphasis"><em><span class="quote">&#8220;<span class="quote">If it were easy, it would have been done already.
      </span>&#8221;</span></em></span>
    </p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_hard_getting_mean_data_assembled"></a>11.1. 
      Getting 'mean' genomes or EST / RNASeq data sets assembled
    </h2></div></div></div><p>
    </p><p>
      For some EST data sets you might want to assemble, MIRA will take too
      long or the available memory will not be sufficient. For genomes this
      can be the case for eukaryotes, plants, but also for some bacteria which
      contain high number of (pro-)phages, plasmids or engineered operons. For
      EST data sets, this concerns all projects with non-normalised libraries.
    </p><p>
      This guide is intended to get you through these problematic genomes. It
      is (cannot be) exhaustive, but it should get you going.
    </p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_hard_for_the_impatient"></a>11.1.1. 
	For the impatient
      </h3></div></div></div><p>
	For bacteria with nasty repeats, try first
	[--hirep_something]. This will increase runtime and memory
	requirements, but helps to get this sorted out. If the data for lower
	eukaryotes leads to runtime and memory explosion, try either
	 [--hirep_good] or, for desperate cases,
	 [--hirep_something].
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_hard_introduction_to_masking"></a>11.1.2. 
	Introduction to 'masking'
      </h3></div></div></div><p>
	The SKIM phase (all-against-all comparison) will report almost every potential
	hit to be checked with Smith-Waterman further downstream in the MIRA assembly
	process. While this is absolutely no problem for most bacteria, some genomes
	(eukaryotes, plants, some bacteria) have so many closely related sequences
	(repeats) that the data structures needed to take up all information might get
	much larger than your available memory. In those cases, your only chance to
	still get an assembly is to tell the assembler it should disregard extremely
	repetitive features of your genome.
      </p><p>
	There is, in most cases, one problem: one doesn't know beforehand which parts
	of the genome are extremely repetitive. But MIRA can help you here as it
	produces most of the needed information during assembly and you just need to
	choose a threshold from where on MIRA won't care about repetitive matches.
      </p><p>
	The key to this are the three fail-safe command line parameters which will mask
	"nasty" repeats from the quick overlap finder (SKIM): [-KS:mnr] and
	 [-KS:nrr] respectively  [-KS:nrc]. I'll come back
	to  [-SK:kms] later as it also plays a role in this.
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_hard_how_does_nasty_repeat_masking_work"></a>11.1.3. 
	How does 'nasty repeat' masking work?
      </h3></div></div></div><p>
      </p><p>
	If switched on [-KS:mnr=yes], MIRA will use k-mer statistics to
	find repetitive stretches. K-mers are nucleotide stretches of length k. In a
	perfectly sequenced genome without any sequencing error and without sequencing
	bias, the k-mer frequency can be used to assess how many times a given
	nucleotide stretch is present in the genome: if a specific k-mer is present as
	many times as the average frequency of all k-mers, it is a reasonable
	assumption to estimate that the specific k-mer is not part of a repeat (at
	least not in this genome).
      </p><p>
	Following the same path of thinking, if a specific k-mer frequency is now two
	times higher than the average of all k-mers, one would assume that this
	specific k-mer is part of a repeat which occurs exactly two times in the
	genome. For 3x k-mer frequency, a repeat is present three times. Etc.pp. MIRA
	will merge information on single k-mers frequency into larger 'repeat'
	stretches and tag these stretches accordingly.
      </p><p>
	Of course, low-complexity nucleotide stretches (like poly-A in eukaryotes),
	sequencing errors in reads and non-uniform distribution of reads in a
	sequencing project will weaken the initial assumption that a k-mer frequency
	is representative for repeat status. But even then the k-mer frequency model
	works quite well and will give a pretty good overall picture: most repeats
	will be tagged as such.
      </p><p>
	Note that the parts of reads tagged as "nasty repeat" will not get masked per
	se, the sequence will still be present. The stretches dubbed repetitive will
	get the "MNRr" tag. They will still be used in Smith-Waterman overlaps and
	will generate a correct consensus if included in an alignment, but they will
	not be used as seed.
      </p><p>
	Some reads will invariably end up being completely repetitive. These
	will not be assembled into contigs as MIRA will not see overlaps as
	they'll be completely masked away. These reads will end up as
	debris. However, note that MIRA is pretty good at discerning 100%
	matching repeats from repeats which are not 100% matching: if there's
	a single base with which repeats can be discerned from each other,
	MIRA will find this base and use the k-mers covering that base to find
	overlaps.
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_hard_selecting_a_nasty_repeat_ratio"></a>11.1.4. 
	Selecting a "nasty repeat ratio"
      </h3></div></div></div><p>
      </p><p>
	The ratio from which on the MIRA kmer statistics algorithm won't
	report matches is set via [-KS:nrr]. E.g.,
	using  [-KS:nrr=10] will hide all k-mers which occur at a
	frequency 10 times (or more) higher than the median of all k-mers.
      </p><p>
	The nastiness of a repeat is difficult to judge, but starting with 10 copies
	in a genome, things can get complicated. At 20 copies, you'll have some
	troubles for sure.
      </p><p>
	The standard values of <span class="emphasis"><em>10</em></span> for
	the  [-KS:nrr] parameter is a pretty good 'standard' value
	which can be tried for an assembly before trying to optimise it via
	studying the kmer statistics calculated by MIRA. For the later, please
	read the section 'Examples for kmer statistics' further down in this
	guide.
      </p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_hard_how_MIRA_tags_different_repeat_levels"></a>11.2. 
      How MIRA tags different repeat levels
    </h2></div></div></div><p>
      During SKIM phase, MIRA will assign frequency information to each and every
      k-mer in all reads of a sequencing project, giving them different
      status. Additionally, tags are set in the reads so that one can
      assess reads in assembly editors that understand tags (like gap4,
      gap5, consed etc.). The following tags are used:
    </p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
	  HAF2
	</span></dt><dd><p> coverage below average ( default: &lt; 0.5 times average)
	  </p></dd><dt><span class="term">
	  HAF3
	</span></dt><dd><p> coverage is at average ( default: &#8805; 0.5 times average and &#8804; 1.5 times average)
	  </p></dd><dt><span class="term">
	  HAF4
	</span></dt><dd><p> coverage above average ( default: &gt; 1.5 times average and &lt; 2 times average)
	  </p></dd><dt><span class="term">
	  HAF5
	</span></dt><dd><p> probably repeat ( default: &#8805; 2 times average and &lt; 5 times average)
	  </p></dd><dt><span class="term">
	  HAF6
	</span></dt><dd><p> 'crazy' repeat ( default: &gt; 5 times average)
	  </p></dd><dt><span class="term">
	  MNRr
	</span></dt><dd><p> stretches which were masked away by [-KS:<em class="replaceable"><code>mnr=yes</code></em>]
	  being more repetitive than deduced
	  by  [-KS:<em class="replaceable"><code>nrr=...</code></em>] or given via  [-KS:<em class="replaceable"><code>nrc=...</code></em>].
	  </p></dd></dl></div><p>
    </p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_hard_the_readrepeats_info_file"></a>11.3. 
      The readrepeats info file
    </h2></div></div></div><p>
      If [-KS:mnr=yes] is used, MIRA will write an additional file into the
      info directory:
      <code class="filename">&lt;projectname&gt;_info_readrepeats.lst</code>
    </p><p>
      The "readrepeats" file makes it possible to try and find out what makes
      sequencing data nasty. It's a key-value-value file with the name of the
      sequence as "key" and then the type of repeat (HAF2 - HAF7 and MNRr) and
      the repeat sequence as "values". "Nasty" in this case means
      <span class="emphasis"><em>everything which was masked via
      [-KS:mnr=yes]</em></span>.
    </p><p>
      The file looks like this:
    </p><pre class="screen">
read1     HAF5   GCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCT ...
read2     HAF7   CCGAAGCCGAAGCCGAAGCCGAAGCCGAAGCCGAAGCCGAAGCCGAAGC ...
read2     MNRr   AAAAAAAAAAAAAAAAAAAAAAAAAAAA ...
read3     HAF6   GCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCT ...
...
etc.
    </pre><p>
      That is, each line consists of the read name where a stretch of
      repetitive sequences was found, then the MIRA repeat categorisation
      level (HAF2 to HAF7 and MNRr) and then the stretch of bases which is
      seen to be repetitive.
    </p><p>
      Note that reads can have several disjunct repeat stretches in a single
      read, hence they can occur more than one time in the file as shown with
      <span class="emphasis"><em>read2</em></span> in the example above.
    </p><p>
      One will need to search some databases with the "nasty" sequences and find
      vector sequences, adaptor sequences or even human sequences in bacterial or
      plant genomes ... or vice versa as this type of contamination happens quite
      easily with data from new sequencing technologies. After a while one gets a
      feeling what constitutes the largest part of the problem and one can start to
      think of taking countermeasures like filtering, clipping, masking etc.
    </p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_hard_pipeline_to_find_worst_contaminants_or_repeats_in_sequencing_data"></a>11.4. 
      Pipeline to find worst contaminants or repeats in sequencing data
    </h2></div></div></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
	In case you are not familiar with UNIX pipes, now would be a good time
	to read an introductory text on how this wonderful system works. You
	might want to start with a short introductory article at Wikipedia:
	<a class="ulink" href="http://en.wikipedia.org/wiki/Pipeline_%28Unix%29" target="_top">http://en.wikipedia.org/wiki/Pipeline_%28Unix%29</a>
      </p><p>
	In a nutshell: instead of output to files, a pipe directs the output
	of one program as input to another program.
      </p></td></tr></table></div><p>
      There's one very simple trick to find out whether your data contains
      some kind of sequencing vector or adaptor contamination which I use. it
      makes use of the read repeat file discussed above.
    </p><p>
      The following example shows this exemplarily on a 454 data where the
      sequencing provider used some special adaptor in the wet lab but somehow
      forgot to tell the Roche pre-processing software about it, so that a
      very large fraction of reads in the SFF file had unclipped adaptor
      sequence in it (which of course wreaks havoc with assembly programs):
    </p><pre class="screen"><code class="prompt">arcadia:$</code> <strong class="userinput"><code>grep MNRr <em class="replaceable"><code>badproject</code></em>_info_readrepeats.lst | cut -f 3| sort | uniq -c |sort -g -r | head -15</code></strong>
    504 ACCACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    501 CAACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    489 GGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    483 GCCACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    475 AATACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    442 GATACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    429 CGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    424 TTGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    393 ACTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    379 CAGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    363 ATTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    343 CATACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    334 GTTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    328 AACACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    324 GGTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC</pre><p>
      You probably see a sequence pattern
      CTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC in the above screens hot. Before
      going into details of what you are actually seeing, here's the
      explanation how this pipeline works:
    </p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
	  grep MNRr <em class="replaceable"><code>badproject</code></em>_info_readrepeats.lst
	</span></dt><dd><p>
	    From the file with the information on repeats, grab all the lines
	    containing repetitive sequence which MIRA categorised as 'nasty'
	    via the 'MNRr' tag. The result looks a bit like this (first 15
	    lines shown):</p><pre class="screen">C6E3C7T12GKN35  MNRr    GCGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12JLIBM  MNRr    TTCACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12HQOM1  MNRr    CAGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12G52II  MNRr    CAGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12JRMPO  MNRr    TCTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12H1A8V  MNRr    GCGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12H34Z7  MNRr    AAACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12H4HGC  MNRr    GGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12FNA1N  MNRr    AATACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12F074V  MNRr    CTTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12I1GYO  MNRr    CAACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12I53C8  MNRr    CACACTCGTATAGTGACACGCAACAGGGG
C6E3C7T12I4V6V  MNRr    ATCACTCGTATAGTGACACGCAACAGGGG
C6E3C7T12H5R00  MNRr    TCTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12IBA5E  MNRr    AATACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
...</pre><p>
	  </p></dd><dt><span class="term">
	  cut -f 3
	</span></dt><dd><p>
	    We're just interested in the sequence now, which is in the third
	    column. The above 'cut' command takes care of this. The resulting
	    output may look like this (only first 15 lines shown):
	  </p><pre class="screen">GCGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
TTCACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
CAGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
CAGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
TCTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
GCGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
AAACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
GGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
AATACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
CTTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
CAACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
CACACTCGTATAGTGACACGCAACAGGGG
ATCACTCGTATAGTGACACGCAACAGGGG
TCTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
AATACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
...</pre></dd><dt><span class="term">
	  sort
	</span></dt><dd><p>
	    Simply sort all sequences. The output may look like this now (only first 15 line shown):</p><pre class="screen">
AAACTCGTATAGTGACACGCA
AAACTCGTATAGTGACACGCAACAGG
AAACTCGTATAGTGACACGCAACAGGG
AAACTCGTATAGTGACACGCAACAGGGG
AAACTCGTATAGTGACACGCAACAGGGG
AAACTCGTATAGTGACACGCAACAGGGG
AAACTCGTATAGTGACACGCAACAGGGG
AAACTCGTATAGTGACACGCAACAGGGG
AAACTCGTATAGTGACACGCAACAGGGGAT
AAACTCGTATAGTGACACGCAACAGGGGATA
AAACTCGTATAGTGACACGCAACAGGGGATA
AAACTCGTATAGTGACACGCAACAGGGGATA
AAACTCGTATAGTGACACGCAACAGGGGATA
AAACTCGTATAGTGACACGCAACAGGGGATA
AAACTCGTATAGTGACACGCAACAGGGGATA
...</pre><p>
	  </p></dd><dt><span class="term">
	  uniq -c
	</span></dt><dd><p>
	    This command counts how often a line repeats itself in a file. As
	    we previously sorted the whole file by sequence, it effectively
	    counts how often a certain sequence has been tagged as MNRr. The
	    output consists of a tab delimited format in two columns: the
	    first column contains the number of times a given line (sequence
	    in our case) was seen, the second column contains the line
	    (sequence) itself. An exemplarily output looks like this (only first 15 lines shown):
	  </p><pre class="screen">      1 AAACTCGTATAGTGACACGCA
      1 AAACTCGTATAGTGACACGCAACAGG
      1 AAACTCGTATAGTGACACGCAACAGGG
      5 AAACTCGTATAGTGACACGCAACAGGGG
      1 AAACTCGTATAGTGACACGCAACAGGGGAT
     13 AAACTCGTATAGTGACACGCAACAGGGGATA
      6 AAACTCGTATAGTGACACGCAACAGGGGATAGAC
      4 AAACTCGTATAGTGACACGCAACAGGGGATAGACAA
      9 AAACTCGTATAGTGACACGCAACAGGGGATAGACAAGGC
      3 AAACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCA
    257 AAACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
      1 AACACTCGTATAGTGACACGCAAC
      2 AACACTCGTATAGTGACACGCAACAGGG
     23 AACACTCGTATAGTGACACGCAACAGGGG
      6 AACACTCGTATAGTGACACGCAACAGGGGATA
...</pre></dd><dt><span class="term">
	  sort -g -r
	</span></dt><dd><p>
	    We now sort the output of the previous uniq-counting command by
	    asking 'sort' to perform a numerical sort (via '-g') and
	    additionally sort in reverse order (via '-r') so that we get the
	    sequences encountered most often at the top of the output. And
	    that one looks exactly like shown previously:
	  </p><pre class="screen">
    504 ACCACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    501 CAACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    489 GGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    483 GCCACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    475 AATACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    442 GATACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    429 CGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    424 TTGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    393 ACTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    379 CAGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    363 ATTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    343 CATACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    334 GTTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    328 AACACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    324 GGTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
...</pre></dd></dl></div><p>
      So, what is this ominous CTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC you are
      seeing? To make it short: a modified 454 B-adaptor with an additional MID sequence.
    </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
	These adaptor sequences have absolutely no reason to exist in your
	data, none! Go back to your sequencing provider and ask them to have a look
	at their pipeline as they should have had it set up in a way that you
	do not see these things anymore. Yes, due to sequencing errors,
	sometimes some adaptor or sequencing vectors remnants will stay in
	your sequencing data, but that is no problem as MIRA is capable of
	handling that very well.
      </p><p>
	But having much more than 0.1% to 0.5% of your sequence containing
	these is a sure sign that someone goofed somewhere ... and it's very
	probably not your fault.
      </p></td></tr></table></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_hard_examples_for_kmer_statistics"></a>11.5. 
      Examples for kmer statistics
    </h2></div></div></div><p>
      Selecting the right ratio so that an assembly fits into your memory is not
      straight forward. But MIRA can help you a bit: during assembly, some frequency
      statistics are printed out (they'll probably end up in some info file in later
      releases). Search for the term "Kmer statistics" in the information printed
      out by MIRA (this happens quite early in the process)
    </p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_hard_caveat:_sk:kms"></a>11.5.1. 
	Caveat: -SK:kmer_size
      </h3></div></div></div><p>
	Some explanation how kmer size affects the statistics and why it
	should be chosen &gt;=16 for [-KS:mnr]
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_hard_sanger_sequencing_a_simple_bacterium"></a>11.5.2. 
	Sanger sequencing, a simple bacterium
      </h3></div></div></div><p>
	This example is taken from a pretty standard bacterium where Sanger
	sequencing was used:
      </p><pre class="screen">
Kmer statistics:
=========================================================
Measured avg. coverage: 15

Deduced thresholds:
-------------------
Min normal cov: 7
Max normal cov: 23
Repeat cov: 29
Crazy cov: 120
Mask cov: 150

Repeat ratio histogram:
-----------------------
0       475191
1       5832419
2       181994
3       6052
4       4454
5       972
6       4
7       8
14      2
16      10
=========================================================
      </pre><p>
	The above can be interpreted like this: the expected coverage of the genome is
	15x. Starting with an estimated kmer frequency of 29, MIRA will treat a k-mer
	as 'repetitive'. As shown in the histogram, the overall picture of this
	project is pretty healthy:
      </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	    only a small fraction of k-mers have a repeat level of '0' (these would be
	    k-mers in regions with quite low coverage or k-mers containing sequencing
	    errors)
	  </p></li><li class="listitem"><p>
	    the vast majority of k-mers have a repeat level of 1 (so that's non-
	    repetitive coverage)
	  </p></li><li class="listitem"><p>
	    there is a small fraction of k-mers with repeat level of 2-10
	  </p></li><li class="listitem"><p>
	    there are almost no k-mers with a repeat level &gt;10
	  </p></li></ul></div><p>
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_hard_454_sequencing_a_somewhat_more_complex_bacterium"></a>11.5.3. 
	454 Sequencing, a somewhat more complex bacterium
      </h3></div></div></div><p>
	Here's in comparison a profile for a more complicated bacterium (454
	sequencing):
      </p><pre class="screen">
Kmer statistics:
=========================================================
Measured avg. coverage: 20

Deduced thresholds:
-------------------
Min normal cov: 10
Max normal cov: 30
Repeat cov: 38
Crazy cov: 160
Mask cov: 0

Repeat ratio histogram:
-----------------------
0       8292273
1       6178063
2       692642
3       55390
4       10471
5       6326
6       5568
7       3850
8       2472
9       708
10      464
11      270
12      140
13      136
14      116
15      64
16      54
17      54
18      52
19      50
20      58
21      36
22      40
23      26
24      46
25      42
26      44
27      32
28      38
29      44
30      42
31      62
32      116
33      76
34      80
35      82
36      142
37      100
38      120
39      94
40      196
41      172
42      228
43      226
44      214
45      164
46      168
47      122
48      116
49      98
50      38
51      56
52      22
53      14
54      8
55      2
56      2
57      4
87      2
89      6
90      2
92      2
93      2
1177    2
1181    2
=========================================================
      </pre><p>
	The difference to the first bacterium shown is pretty striking:
      </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	    first, the k-mers in repeat level 0 (below average) is higher than
	    the k-mers of level 1! This points to a higher number of
	    sequencing errors in the 454 reads than in the Sanger project
	    shown previously. Or at a more uneven distribution of reads (but
	    not in this special case).
	  </p></li><li class="listitem"><p>
	    second, the repeat level histogram does not trail of at a repeat
	    frequency of 10 or 15, but it has a long tail up to the fifties, even having
	    a local maximum at 42. This points to a small part of the genome being
	    heavily repetitive ... or to (a) plasmid(s) in high copy numbers.
	  </p></li></ul></div><p>
      </p><p>
	Should MIRA ever have problems with this genome, switch on the nasty repeat
	masking and use a level of 15 as cutoff. In this case, 15 is OK to start with
	as a) it's a bacterium, it can't be that hard and b) the frequencies above
	level 5 are in the low thousands and not in the tens of thousands.
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_hard_solexa_sequencing_ecoli_mg1655"></a>11.5.4. 
	Solexa sequencing, E.coli MG1655
      </h3></div></div></div><p>
      </p><pre class="screen">
Kmer statistics:
=========================================================
Measured avg. coverage: 23

Deduced thresholds:
-------------------
Min normal cov: 11
Max normal cov: 35
Repeat cov: 44
Crazy cov: 184
Mask cov: 0

Repeat ratio histogram:
-----------------------
0       1365693
1       8627974
2       157220
3       11086
4       4990
5       3512
6       3922
7       4904
8       3100
9       1106
10      868
11      788
12      400
13      186
14      28
15      10
16      12
17      4
18      4
19      2
20      14
21      8
25      2
26      8
27      2
28      4
30      2
31      2
36      4
37      6
39      4
40      2
45      2
46      8
47      14
48      8
49      4
50      2
53      2
56      6
59      4
62      2
63      2
67      2
68      2
70      2
73      4
75      2
77      4
=========================================================
      </pre><p>
	This kmer statistics shows that MG1655 is pretty boring (from a
	repetitive point of view). One might expect a few repeats but nothing
	fancy: The repeats are actually the rRNA and sRNA stretches in the
	genome plus some intergenic regions.
      </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	    the k-mers number in repeat level 0 (below average) is
	    considerably lower than the level 1, so the Solexa sequencing
	    quality is pretty good respectively there shouldn't be too many
	    low coverage areas.
	  </p></li><li class="listitem"><p>
	    the histogram tail shows some faint traces of possibly highly repetitive
	    k-mers, but these are false positive matches due to some standard Solexa
	    base-calling weaknesses of earlier pipelines like, e.g., adding poly-A,
	    poly-T or sometimes poly-C and poly-G tails to reads when spots in the
	    images were faint and the base calls of bad quality
	  </p></li></ul></div><p>
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_hard_need_examples_for_eukaryotes"></a>11.5.5. 
	(NEED EXAMPLES FOR EUKARYOTES)
      </h3></div></div></div><p>
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_hard_need_examples_for_pathological_cases"></a>11.5.6. 
	(NEED EXAMPLES FOR PATHOLOGICAL CASES)
      </h3></div></div></div><p>
	Vector contamination etc.
      </p></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_seqtechdesc"></a>Chapter 12. Description of sequencing technologies</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email">&lt;<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>&gt;</code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect_std_intro">12.1. 
      Introduction
    </a></span></dt><dt><span class="sect1"><a href="#sect_std_sxa">12.2. 
      Illumina (formerly Solexa)
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_std_sxa_caveats_for_illumina">12.2.1. 
	Caveats for Illumina data
      </a></span></dt><dt><span class="sect2"><a href="#sect_std_sxa_highlights">12.2.2. 
	Illumina highlights
      </a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_std_sxa_highlights_quality">12.2.2.1. 
	  Quality
	</a></span></dt></dl></dd><dt><span class="sect2"><a href="#sect_std_std_sxa_lowlights">12.2.3. 
	Lowlights
      </a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_std_sxa_lowlights_longhomopolymers">12.2.3.1. 
	  Long homopolymers
	</a></span></dt><dt><span class="sect3"><a href="#sect_std_sxa_lowlights_GGCxG_motif">12.2.3.2. 
	  The GGCxG and GGC motifs
	</a></span></dt><dt><span class="sect3"><a href="#sect_std_sxa_lowlights_chimericreads">12.2.3.3. 
	  Chimeric reads
	</a></span></dt><dt><span class="sect3"><a href="#sect_std_sxa_lowlights_samplemix">12.2.3.4. 
	  Sample barcode misidentification
	</a></span></dt><dt><span class="sect3"><a href="#sect_std_sxa_lowlights_nextera">12.2.3.5. 
	  Nextera library prep
	</a></span></dt><dt><span class="sect3"><a href="#sect_std_sxa_lowlights_gcbias">12.2.3.6. 
	  Strong GC bias in some Solexa data (2nd half 2009 until advent of TruSeq kit at end of 2010)
	</a></span></dt></dl></dd></dl></dd><dt><span class="sect1"><a href="#sect_std_iontor">12.3. 
      Ion Torrent
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_std_iontor_hpindels">12.3.1. 
	Homopolymer insertions / deletions
      </a></span></dt><dt><span class="sect2"><a href="#sect_std_iontor_seqdirdepindels">12.3.2. 
	Sequencing direction dependent insertions / deletions
      </a></span></dt><dt><span class="sect2"><a href="#sect_std_iontor_covvariance">12.3.3. 
	Coverage variance
      </a></span></dt><dt><span class="sect2"><a href="#sect_std_iontor_gcbias">12.3.4. 
	GC bias
      </a></span></dt><dt><span class="sect2"><a href="#sect_std_iontor_other_sources_of_error">12.3.5. 
	Other sources of error
      </a></span></dt><dt><span class="sect2"><a href="#sect_std_iontor_where_to_find_further_information">12.3.6. 
	Where to find further information
      </a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_std_pacbio">12.4. 
      Pacific BioSciences
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_std_pb_highlights">12.4.1. 
	Highlights
      </a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_std_pb_hl_length">12.4.1.1. 
	  Sequence lengths
	</a></span></dt><dt><span class="sect3"><a href="#sect_std_pb_hl_gcbias">12.4.1.2. 
	  GC bias
	</a></span></dt><dt><span class="sect3"><a href="#sect_std_pb_hl_acccorrected">12.4.1.3. 
	  Accuracy of corrected reads
	</a></span></dt><dt><span class="sect3"><a href="#sect_std_pb_hl_qualassemblies">12.4.1.4. 
	  Assemblies of corrected reads
	</a></span></dt></dl></dd><dt><span class="sect2"><a href="#sect_std_pb_lowlights">12.4.2. 
	Lowlights
      </a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_std_pb_ll_namingconfusion">12.4.2.1. 
	  Naming confusion
	</a></span></dt><dt><span class="sect3"><a href="#sect_std_pb_ll_revseq">12.4.2.2. 
	  Forward / reverse chimeric sequences
	</a></span></dt><dt><span class="sect3"><a href="#sect_std_pb_ll_rawreadaccuracy">12.4.2.3. 
	  Accuracy of uncorrected subreads
	</a></span></dt><dt><span class="sect3"><a href="#sect_std_pb_ll_cpu">12.4.2.4. 
	  Immense need for CPU power
	</a></span></dt><dt><span class="sect3"><a href="#sect_std_pb_ll_dnaprep">12.4.2.5. 
	  Increased quality requirements for clean DNA sample prep
	</a></span></dt></dl></dd></dl></dd></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
      <span class="emphasis"><em><span class="quote">&#8220;<span class="quote">Opinions are like chili powder - best used in moderation.</span>&#8221;</span></em></span>
    </p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_std_intro"></a>12.1. 
      Introduction
    </h2></div></div></div><p>
      <span class="bold"><strong>Note:</strong></span> This section contains things I've
      seen in the past and simply jotted down. These may be fundamentally
      correct or correct only under circumstances or not correct at all with
      your data. You may have different observations.
    </p><p>
      ...
    </p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_std_sxa"></a>12.2. 
      Illumina (formerly Solexa)
    </h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_std_sxa_caveats_for_illumina"></a>12.2.1. 
	Caveats for Illumina data
      </h3></div></div></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
	  Even if you can get bacteria sequenced with ridiculously high coverage
	  like 500x or 1000x, this amount of data is simply not needed. Even
	  more important - though counterintuitive - is the fact that due to
	  non-random sequence dependent sequencing errors, a too high coverage
	  may even make the assembly worse.
	</p><p>
	  Another rule of thumb: when having more than enough data, reduce the
	  data set so as to have an average coverage of approximately 100x. In
	  some rare cases (high GC content), perhaps 120x to 150x, but certainly
	  not more.
	</p></td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
	When reducing a data set, do <span class="bold"><strong>NOT</strong></span>,
	under no circumstances not, try fancy selection of reads by some
	arbitrary quality or length criteria. This will introduce a terrible
	bias in your assembly due to non-random sequence-dependent sequencing
	errors and non-random sequence dependent base quality assignment. More
	on this in the next section.
      </td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_std_sxa_highlights"></a>12.2.2. 
	Illumina highlights
      </h3></div></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_sxa_highlights_quality"></a>12.2.2.1. 
	  Quality
	</h4></div></div></div><p>
	  For current HiSeq 100bp reads I get - after MIRA clipping - about 90
	  to 95% reads matching to a reference without a single error. MiSeq
	  250bp reads contain a couple more errors, but nothing to be alarmed
	  off.
	</p><p>
	  In short: Illumina is currently <span class="emphasis"><em>the</em></span> technology
	  to use if you want high quality reads.
	</p></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_std_std_sxa_lowlights"></a>12.2.3. 
	Lowlights
      </h3></div></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_sxa_lowlights_longhomopolymers"></a>12.2.3.1. 
	  Long homopolymers
	</h4></div></div></div><p>
	  Long homopolymers (stretches of identical bases in reads) can be a
	  slight problem for Solexa. However, it must be noted that this is a
	  problem of all sequencing technologies on the market so far (Sanger,
	  Solexa, 454). Furthermore, the problem in much less pronounced in
	  Solexa than in 454 data: in Solexa, first problem appear may appear
	  in stretches of 9 to 10 bases, in Ion Torrent a stretch of 3 to 4
	  bases may already start being problematic in some cases.
	</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_sxa_lowlights_GGCxG_motif"></a>12.2.3.2. 
	  The GGCxG and GGC motifs
	</h4></div></div></div><p>
	  <code class="literal">GGCxG</code> or even <code class="literal">GGC</code> motif in the
	  5' to 3' direction of reads. This one is particularly annoying and
	  it took me quite a while to circumvent in MIRA the problems it
	  causes.
	</p><p>
	  Simply put: at some places in a genome, base calling after a
	  <code class="literal">GGCxG</code> or <code class="literal">GGC</code> motif is
	  particularly error prone, the number of reads without errors
	  declines markedly. Repeated <code class="literal">GGC</code> motifs worsen
	  the situation. The following screen shots of a mapping assembly
	  illustrate this.
	</p><p>
	  The first example is a the <code class="literal">GGCxG</code> motif (in form
	  of a <code class="literal">GGCTG</code>) occurring in approximately one third
	  of the reads at the shown position. Note that all but one read
	  with this problem are in the same (plus) direction.
	</p><div class="figure"><a name="sxa_unsc_ggcxg2_lenski.png"></a><p class="title"><b>Figure 12.1. 
	    The Solexa GGCxG problem.
	  </b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_unsc_ggcxg2_lenski.png" width="100%" alt="The Solexa GGCxG problem."></td></tr></table></div></div></div><br class="figure-break"><p>
	  The next two screen shots show the <code class="literal">GGC</code>, once for
	  forward direction and one with reverse direction reads:
	</p><div class="figure"><a name="sxa_unsc_ggc1_lenski.png"></a><p class="title"><b>Figure 12.2. 
	    The Solexa GGC problem, forward example
	  </b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_unsc_ggc1_lenski.png" width="100%" alt="The Solexa GGC problem, forward example"></td></tr></table></div></div></div><br class="figure-break"><div class="figure"><a name="sxa_unsc_ggc4_lenski.png"></a><p class="title"><b>Figure 12.3. 
	    The Solexa GGC problem, reverse example
	  </b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_unsc_ggc4_lenski.png" width="100%" alt="The Solexa GGC problem, reverse example"></td></tr></table></div></div></div><br class="figure-break"><p>
	  Places in the genome that have <code class="literal">GGCGGC.....GCCGCC</code>
	  (a motif, perhaps even repeated, then some bases and then an
	  inverted motif) almost always have very, very low number of good
	  reads. Especially when the motif is <code class="literal">GGCxG</code>.
	</p><p>
	  Things get especially difficult when these motifs occur at sites
	  where users may have a genuine interest. The following example is a
	  screen shot from the Lenski data (see walk-through below) where a
	  simple mapping reveals an anomaly which -- in reality -- is an IS
	  insertion (see <a class="ulink" href="http://www.nature.com/nature/journal/v461/n7268/fig_tab/nature08480_F1.html" target="_top">http://www.nature.com/nature/journal/v461/n7268/fig_tab/nature08480_F1.html</a>)
	  but could also look like a <code class="literal">GGCxG</code> motif in forward
	  direction (<code class="literal">GGCCG</code>) and at the same time a
	  <code class="literal">GGC</code> motif in reverse direction:
	</p><div class="figure"><a name="sxa_xmastree_lenski2.png"></a><p class="title"><b>Figure 12.4. 
	    A genuine place of interest almost masked by the
	    <code class="literal">GGCxG</code> problem.
	  </b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_xmastree_lenski2.png" width="100%" alt="A genuine place of interest almost masked by the GGCxG problem."></td></tr></table></div></div></div><br class="figure-break"></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_sxa_lowlights_chimericreads"></a>12.2.3.3. 
	  Chimeric reads
	</h4></div></div></div><p>
	  I did not realise chimeric reads were a problem with Illumina data
	  until Fall 2014 when I got reads &gt; 100bp for extremely well
	  charactersided bacteria ... and because MIRA since ever used data
	  cleaning methods which worked very well on either short reads &#8804;
	  100bp or when chimeras occurred at a very low frequency.
	</p><p>
	  Chimeras are are artefacts reads from library preparation which
	  contain parts of the sequence of interest which do not belong
	  together. E.g., in DNA from a bacterial genome, there may be one
	  read of 100 bp where the first 40 bp come from the genome position
	  at 100kb and the last 60 bp come from a position at 1300kb ... more
	  than one megabase apart.
	</p><p>
	  There is not much literature regarding chimeric sequences in
	  Illumina data: most of it deals with 16S or amplicon sequencing
	  where I always thought <span class="emphasis"><em>"that does not apply to my data
	  sets."</em></span> Well, tough luck ... it does. After some searching I
	  found some papers which report quite varying levels depending on the
	  protocols used.  Oyola et al. report between 0.24% and 2.3% of
	  chimeras (<span class="emphasis"><em>Optimizing illumina next-generation sequencing
	  library preparation for extremely at-biased genomes</em></span>; BMC
	  Genomics 2012, 13:1; doi:10.1186/1471-2164-13-1; <a class="ulink" href="http://www.biomedcentral.com/1471-2164/13/1" target="_top">http://www.biomedcentral.com/1471-2164/13/1</a>). Apparently, a
	  paper from researchers at the Sanger Centre reported up to 5%
	  chimeric reads (Bronner et al., <span class="emphasis"><em>Improved Protocols for
	  Illumina Sequencing</em></span>; Current Protocols in Human Genetics
	  18:18.2:18.2.1&#8211;18.2.42; DOI: 10.1002/0471142905.hg1802s80; <a class="ulink" href="http://onlinelibrary.wiley.com/doi/10.1002/0471142905.hg1802s80/abstract" target="_top">http://onlinelibrary.wiley.com/doi/10.1002/0471142905.hg1802s80/abstract</a>
	  via <a class="ulink" href="http://www.sagescience.com/blog/sanger-reports-improved-prep-protocols-for-illumina-sequencing/" target="_top">http://www.sagescience.com/blog/sanger-reports-improved-prep-protocols-for-illumina-sequencing/</a>).
	</p><p>
	  I have now seen MiSeq 250bp and 300bp paired-end genomic data sets
	  from different (trusted) sequencing providers for very well
	  characterised, non-complex and non-GC-extreme bacterial genomes with
	  up to 3% chimeric reads. To make things worse, some chimeras were
	  represented by both reads of a read-pair, so one had the exact same
	  chimeric sequence represented twice: once in forward and once in
	  reverse complement direction.
	</p><p>
	  It turned out that MIRA versions &#8804; 4.9.3 have problems in
	  filtering chimeras in Illumina data sets with reads &gt; 100bp as
	  the chimera detection algorithms were designed to handle amounts
	  much less than 1% of the total reads. This led to shorter contigs in
	  genomic assemblies and to chimeric transcripts (when they are very
	  low-coverage) in RNA assemblies.
	</p><p>
	  Note that projects using reads &#8804; 100 bp assembled fine with MIRA
	  4.9.3 and before as the default algorithms for proposed-end-clip
	  ([-CL:pec]) implicitly caught chimeras occurring near the
	  read ends and the remaining chimeras were caught by the algorithms
	  for low level chimeras.
	</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
	  MIRA 4.9.4 and higher eliminate all chimeras in Illumina reads of
	  any length, you do not need to take any precautionary steps
	  here. But if you use other assemblers and in light of the above, I
	  highly recommend to apply very stringent filters to Illumina data.
	  Especially for applications like metagenomics or RNA de-novo
	  assembly where low coverage may be expected for parts of the
	  results! Indeed, I now treat any assembly result with consensus data
	  generated from a coverage of less than 3 Illumina reads as
	  potentially junk data.
	</td></tr></table></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_sxa_lowlights_samplemix"></a>12.2.3.4. 
	  Sample barcode misidentification
	</h4></div></div></div><p>
	  Long story short: data from multiplexed samples contains "low"
	  amounts of foreign samples from the same lane. Probably not a
	  problem for high coverage assemblies, but can become a problem in
	  multiplexed RNASeq or projects looking for "rare" variants.
	</p><p>
	  In essence, the barcoding used for multiplexing several samples into
	  a single lane is not a 100% foolproof process. I found one paper
	  quantifying this effect to 0.3% of misidentified reads: Kircher et
	  al., <span class="emphasis"><em>Double indexing overcomes inaccuracies in multiplex
	  sequencing on the Illumina platform</em></span>; Nucleic Acids
	  Res. Jan 2012; 40(1): e3. <a class="ulink" href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3245947/" target="_top">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3245947/</a>
	</p><p>
	  For example, I got some genome sequecing data for a bacterium where
	  closer inspection of some small contigs coming out of the assembly
	  process turned out to be highly expressed genes from a plant. The
	  sequencing provider had multiplexed our bacterial sample with a
	  RNASeq project of that plant.
	</p><p>
	  Another example involved RNASeq of two genomes where one of the
	  organisms had been modified to contain additional genes under a
	  strong promoter. In the data set we suddenly saw those inserted
	  genes pop-up in the samples of the wild type organism. Which,
	  clearly, could not be.
	</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_sxa_lowlights_nextera"></a>12.2.3.5. 
	  Nextera library prep
	</h4></div></div></div><p>
	  Opinions seem to be divided about Nextera: some people don't like it
	  as it introduces sometimes terrible coverage bias in the data, other
	  people say they're happy with the data.
	</p><p>
	  Someone told me (or wrote, I do not remember) that this divide may
	  be due to the fact that some people use their sequencing data for
	  de-novo assemblies, while others just do mappings and hunt for
	  SNPs. In fact, this would explain a lot: for de-novo assemblies, I
	  would never use Nextera. When on a hunt for SNPs, they may be OK.
	</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_sxa_lowlights_gcbias"></a>12.2.3.6. 
	  Strong GC bias in some Solexa data (2nd half 2009 until advent of TruSeq kit at end of 2010)
	</h4></div></div></div><p>
	  I'm recycling a few slides from a couple of talks I held in 2010.
	</p><p>
	  Things used to be so nice and easy with the early Solexa data I worked
	  with (36 and 44mers) in late 2007 / early 2008. When sample taking was
	  done right -- e.g. for bacteria: in stationary phase -- and the
	  sequencing lab did a good job, the read coverage of the genome was
	  almost even. I did see a few papers claiming to see non-trivial GC
	  bias back then, but after having analysed the data I worked with I
	  dismissed them as "not relevant for my use cases." Have a look at the
	  following figure showing exemplarily the coverage of a 45% GC
	  bacterium in 2008:
	</p><div class="figure"><a name="sxa_gcbias_nobias2008.png"></a><p class="title"><b>Figure 12.5. 
	    Example for no GC coverage bias in 2008 Solexa data. Apart from a
	    slight <span class="emphasis"><em>smile shape</em></span> of the coverage --
	    indicating the sample taking was not 100% in stationary phase of the
	    bacterial culture -- everything looks pretty nice: the average
	    coverage is at 27x, and when looking at potential genome
	    duplications at twice the coverage (54x), there's nothing apart a
	    single peak (which turned out to be a problem in a rRNA region).
	  </b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_gcbias_nobias2008.png" width="100%" alt="Example for no GC coverage bias in 2008 Solexa data. Apart from a slight smile shape of the coverage -- indicating the sample taking was not 100% in stationary phase of the bacterial culture -- everything looks pretty nice: the average coverage is at 27x, and when looking at potential genome duplications at twice the coverage (54x), there's nothing apart a single peak (which turned out to be a problem in a rRNA region)."></td></tr></table></div></div></div><br class="figure-break"><p>
	  Things changed starting sometime in Q3 2009, at least that's when I
	  got some data which made me notice a problem. Have a look at the
	  following figure which shows exactly the same organism as in the
	  figure above (bacterium, 45% GC):
	</p><div class="figure"><a name="sxa_gcbias_bias2009.png"></a><p class="title"><b>Figure 12.6. 
	    Example for GC coverage bias starting Q3 2009 in Solexa
	    data. There's no <span class="emphasis"><em>smile shape</em></span> anymore -- the
	    people in the lab learned to pay attention to sample in 100%
	    stationary phase -- but something else is extremely disconcerting:
	    the average coverage is at 33x, and when looking at potential genome
	    duplications at twice the coverage (66x), there are several dozen
	    peaks crossing the 66x threshold over a several kilobases (in one
	    case over 200 Kb) all over the genome. As if several small genome
	    duplications happened.
	  </b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_gcbias_bias2009.png" width="100%" alt="Example for GC coverage bias starting Q3 2009 in Solexa data. There's no smile shape anymore -- the people in the lab learned to pay attention to sample in 100% stationary phase -- but something else is extremely disconcerting: the average coverage is at 33x, and when looking at potential genome duplications at twice the coverage (66x), there are several dozen peaks crossing the 66x threshold over a several kilobases (in one case over 200 Kb) all over the genome. As if several small genome duplications happened."></td></tr></table></div></div></div><br class="figure-break"><p>
	  By the way, the figures above are just examples: I saw over a dozen
	  sequencing projects in 2008 without GC bias and several dozen in 2009
	  / 2010 with GC bias.
	</p><p>
	  Checking the potential genome duplication sites, they all looked
	  "clean", i.e., the typical genome insertion markers are
	  missing. Poking around at possible explanations, I looked at GC
	  content of those parts in the genome ... and there was the
	  explanation:
	</p><div class="figure"><a name="sxa_gcbias_comp20082009.png"></a><p class="title"><b>Figure 12.7. 
	    Example for GC coverage bias, direct comparison 2008 / 2010
	    data. The bug has 45% average GC, areas with above average read
	    coverage in 2010 data turn out to be lower GC: around 33 to 36%. The
	    effect is also noticeable in the 2008 data, but barely so.
	  </b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_gcbias_comp20082009.png" width="100%" alt="Example for GC coverage bias, direct comparison 2008 / 2010 data. The bug has 45% average GC, areas with above average read coverage in 2010 data turn out to be lower GC: around 33 to 36%. The effect is also noticeable in the 2008 data, but barely so."></td></tr></table></div></div></div><br class="figure-break"><p>
	  Now as to actually <span class="emphasis"><em>why</em></span> the GC bias suddenly
	  became so strong is unknown to me. The people in the lab use the same
	  protocol since several years to extract the DNA and the sequencing
	  providers claim to always use the Illumina standard protocols.
	</p><p>
	  But obviously something must have changed.
	</p><p>
	  It took Illumina some 18 months to resolve that problem for the
	  broader public: since data I work on were done with the TruSeq kit,
	  this problem has vanished.
	</p><p>
	  However, if you based some conclusions or wrote a paper with Illumina
	  data which might be affected by the GC bias (Q3 2009 to Q4 2010), I
	  suggest you rethink all the conclusion drawn. This should be
	  especially the case for transcriptomics experiments where a difference
	  in expression of 2x to 3x starts to get highly significant!
	</p></div></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_std_iontor"></a>12.3. 
      Ion Torrent
    </h2></div></div></div><p>
      As of January 2014, I would say Ion Torrent reads behave very much like
      late data from the 454 technology (FLX / Titanium chemistry): reads are
      on average are &gt; 300bp and the homopolymer problem is much less
      pronounced than 2 years ago. The following figure shows what you can get
      out of 100bp reads if you're lucky:
    </p><div class="figure"><a name="chap_iontor::ion_dh10bgoodB13.png"></a><p class="title"><b>Figure 12.8. 
	Example for good IonTorrent data (100bp reads). Note that only a
	single sequencing error - shown by blue background - can be
	seen. Except this, all homopolymers of size 3 and 4 in the area
	shown are good.
      </b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/ion_dh10bgoodB13.png" width="100%" alt="Example for good IonTorrent data (100bp reads). Note that only a single sequencing error - shown by blue background - can be seen. Except this, all homopolymers of size 3 and 4 in the area shown are good."></td></tr></table></div></div></div><br class="figure-break"><p>
      The "if you're lucky" part in the preceding sentence is not there by
      accident: having so many clean reads is more of an exception rather a
      rule. On the other hand, most sequencing errors in current IonTorrent
      data are unproblematic ... if it were not for indels, which is going to
      be explained on the next sections.
    </p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_std_iontor_hpindels"></a>12.3.1. 
	Homopolymer insertions / deletions
      </h3></div></div></div><p>
	The main source of error in your data will be insertions / deletions
	(indels) especially in homopolymer regions (but not only there, see
	also next section). Starting with a base run of 4 to 6 bases, there
	is a distinct tendency to have an increased occurrence of indel
	errors.
      </p><div class="figure"><a name="chap_iontor::iontor_indelhpexample.png"></a><p class="title"><b>Figure 12.9. 
	  Example for problematic IonTorrent data (100bp reads).
	</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/iontor_indelhpexample.png" width="100%" alt="Example for problematic IonTorrent data (100bp reads)."></td></tr></table></div></div></div><br class="figure-break"></div><p>
      The above figure contains a couple of particularly nasty indel
      problems. While areas 2 (C-homopolymer length 3), 5 (A-homopolymer
      length 4) and 6 (T-homopolymer length 3) are not a big problem as most
      of the reads got the length right, the areas 1, 3 and 4 are nasty.
    </p><p>
      Area 1 is an A-homopolymer of length 7 and while many reads get that
      length right (enough to tell MIRA what the true length is), it also
      contains reads with a length of 6 and and others with a length of 8.
    </p><p>
      Area 2 is a "A-homopolymer" of length 2 where approximately half of the
      reads get the length right, the other half not. See also the following
      section.
    </p><p>
      Area 4 is a T-homopolymer of length 5 which also has approximately half
      the reads with a wrong length of 4.
    </p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_std_iontor_seqdirdepindels"></a>12.3.2. 
	Sequencing direction dependent insertions / deletions
      </h3></div></div></div><p>
	In the previous section, the screen shot showing indels had an indel
	at a homopolymer of 2, which is something quite curious. Upon closer
	investigation, one might notice a pattern in the gap/nogap
	distribution: it is almost identical to the orientation of build
	direction of reads!
      </p><p>
	I looked for other examples of this behaviour and found quite a
	number of them, the following figure shows a very clear case of that
	error behaviour:
      </p><div class="figure"><a name="chap_iontor::ion_dh10bdirdepindel.png.png"></a><p class="title"><b>Figure 12.10. 
	  Example for a sequencing direction dependent indel. Note how all
	  but one of the reads in '+' direction miss a base while all reads
	  built in in '-' direction have the correct number of bases.
	</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/ion_dh10bdirdepindel.png" width="100%" alt="Example for a sequencing direction dependent indel. Note how all but one of the reads in '+' direction miss a base while all reads built in in '-' direction have the correct number of bases."></td></tr></table></div></div></div><br class="figure-break"><p>
	This is quite astonishing: the problem occurs at a site without real
	homopolymer (calling a 2-bases run a 'homopolymer' starts stretching
	the definition a bit) and there are no major problematic homopolymer
	sites near. In fact, this was more or less the case for all sites I
	had a look at.
      </p><p>
	Neither did the cases which were investigated show common base
	patterns, so unlike the Solexa GGCxG motif it does not look like
	that error of IonTorrent is bound to a particular motif.
      </p><p>
	While I cannot prove the following statement, I somehow suspect that
	there must be some kind of secondary structure forming which leads to
	that kind of sequencing error. If anyone has a good explanation I'd be
	happy to hear it: feel free to contact me at
	<code class="email">&lt;<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>&gt;</code>.
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_std_iontor_covvariance"></a>12.3.3. 
	Coverage variance
      </h3></div></div></div><p>
	The coverage variance with the old ~100bp reads was a bit on the
	bad side for low coverage projects (10x to 15x): it varied wildly,
	sometimes dropping to nearly zero, sometimes reaching approximately
	double the coverage.
      </p><p>
	This has now improved and I have not seen pronounced coverage variance
	in the data sets I have worked on.
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_std_iontor_gcbias"></a>12.3.4. 
	GC bias
      </h3></div></div></div><p>
	The GC bias seems to be small to non-existent, at least I could not
	immediately make a correlation between GC content and coverage.
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_std_iontor_other_sources_of_error"></a>12.3.5. 
	Other sources of error
      </h3></div></div></div><p>
	You will want to keep an eye on the clipping of the data in the SFF
	files from IonTorrent: while it is generally good enough, some data
	sets of IonTorrent show that - for some error patterns - the clipping
	is too lax and strange artefacts appear. MIRA will take care of these
	- or at least of those it knows - but you should be aware of this
	potential problem.
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_std_iontor_where_to_find_further_information"></a>12.3.6. 
	Where to find further information
      </h3></div></div></div><p>
	IonTorrent being pretty new, getting as much information on that
	technology is quite important. So here are a couple of links I found
	to be helpful:
      </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	    There is, of course, the TorrentDev site (<a class="ulink" href="http://lifetech-it.hosted.jivesoftware.com/community/torrent_dev" target="_top">http://lifetech-it.hosted.jivesoftware.com/community/torrent_dev</a>)
	    at Life Technologies which will be helpful to get a couple of
	    questions answered.
	  </p><p>
	    Just be aware that some of the documents over there are sometimes
	    painting an - how should I say it diplomatically? - overly
	    optimistic view on the performance of the technology. On the
	    other hand, so do documents released by the main competitors
	    like 454/Roche, Illumina, PacBio etc. ... so no harm done there.
	  </p></li><li class="listitem"><p>
	    I found Nick Loman's blog <a class="ulink" href="http://pathogenomics.bham.ac.uk/blog/" target="_top">Pathogens: Genes and
	    Genomes</a> to be my currently most valuable source of
	    information on IonTorrent. While the group he works for won a
	    sequencer from IonTorrent, he makes that fact very clear and still
	    unsparingly dissects the data he gets from that machine.
	  </p><p>
	    His posts got me going in getting MIRA grok IonTorrent.
	  </p></li><li class="listitem"><p>
	    The blog of Lex Nederbragt <a class="ulink" href="http://flxlexblog.wordpress.com/" target="_top">In between lines of
	    code</a> is playing in the same league: very down to earth and
	    he knows a bluff when he sees it ... and is not afraid to call it
	    (be it from IonTorrent, PacBio or 454).
	  </p><p>
	    The analysis he did on a couple of Ion data sets have saved me
	    quite some time.
	  </p></li><li class="listitem"><p>
	    Last, but not least, the board with <a class="ulink" href="http://seqanswers.com/forums/forumdisplay.php?f=40" target="_top">IonTorrent-related-stuff</a>
	    over at <a class="ulink" href="http://seqanswers.com/" target="_top">SeqAnswers</a>,
	    the first and foremost one-stop-shop ... erm ... discussion board
	    for everything related to sequencing nowadays.
	  </p></li></ul></div></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_std_pacbio"></a>12.4. 
      Pacific BioSciences
    </h2></div></div></div><p>
      As of January 2014, PacBio should be seen as <span class="emphasis"><em>the</em></span>
      technology to go to for de-novo sequencing of bacteria and lower
      eukaryotes. Period. Complement it with a bit of Illumina to get rid of
      the last remaining errors and you'll have - for a couple of thousand
      Euros - the best genome sequences money can buy.
    </p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_std_pb_highlights"></a>12.4.1. 
	Highlights
      </h3></div></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_pb_hl_length"></a>12.4.1.1. 
	  Sequence lengths
	</h4></div></div></div><p>
	  Just one word: huge. At least compared to other currently existing
	  technologies. It is not unusual to get average - usable - read lengths
	  of more than 3 to 4 kb, some chemistries doubling that number (at
	  the expense of accuracy). The largest - usable - reads I have seen
	  were &gt; 25kb, though one needs to keep in mind that these are
	  quite rare and one does not see many of them in a project.
	</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_pb_hl_gcbias"></a>12.4.1.2. 
	  GC bias
	</h4></div></div></div><p>
	  I have seen none in my projects so far, neither have I in public
	  data. But these were certainly not as many projects as Sanger, 454,
	  Illumina and Ion, so take this with a grain of salt.
	</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_pb_hl_acccorrected"></a>12.4.1.3. 
	  Accuracy of corrected reads
	</h4></div></div></div><p>
	  Once the raw PacBio data has been corrected (HGAP pipeline), the
	  resulting reads have a pretty good accuracy. There still are
	  occasional homopolymer errors remaining at non-random locations, but
	  they are a minor problem.
	</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_pb_hl_qualassemblies"></a>12.4.1.4. 
	  Assemblies of corrected reads
	</h4></div></div></div><p>
	  The assemblies coming out of the HGAP pipeline are already
	  astoundingly good. Of course you get long contigs, but also the
	  number of miscalled consensus bases is not too bad: 1 error per 20
	  kb. Once the program
	  <span class="command"><strong>Quiver</strong></span> went through the assembly to do its magic
	  in polishing, the quality improves further to into the range of 1
	  error per 50kb to 1 error per 250kb.
	</p><p>
	  In my hands, I get even better assemblies with MIRA (longer contigs
	  which span repeats unresolved by HGAP). When combining this with
	  some low coverage Illumina data (say, 50x) to do cheap polishing,
	  the error rates I get are lower than 1 error in 4 megabases.
	</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
	  Take the above with a grain of salt as at the time of this writing,
	  I analysed in-depth only on a couple of bacteria. For ploidal
	  organisms I have just played a bit around with public data without
	  really doing an in depth analysis there.
	</td></tr></table></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_std_pb_lowlights"></a>12.4.2. 
	Lowlights
      </h3></div></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_pb_ll_namingconfusion"></a>12.4.2.1. 
	  Naming confusion
	</h4></div></div></div><p>
	  With PacBio, there are quite a number of read types being thrown
	  around and which do confuse people: <span class="emphasis"><em>polymerase
	  reads</em></span>, <span class="emphasis"><em>quality clipped
	  reads</em></span>, <span class="emphasis"><em>subreads</em></span>, <span class="emphasis"><em>corrected
	  reads</em></span> and maybe some more I currently forgot. Here's the
	  total unofficial guide on how to keep those things apart:
	</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	      <span class="bold"><strong>polymerase reads</strong></span> are the rawest
	      and most unedited stuff you may come into contact. You can see
	      it as "data fresh from the machine" and the number of megabases
	      there is usually the one sequencing providers sell to you.
	    </p><p>
	      The sequencing technology PacBio employs uses special hairpin
	      adaptors they have named SMRTBell, and these adaptors will be
	      present in the polymerase reads together with the fragments of
	      your DNA.
	    </p><p>
	      In terms of regular expression look-alike, the data in
	      polymerase reads has the following form:
	    </p><pre class="screen">(Adaptor + (forward fragment sequence + (Adaptor + (fragment sequence in reverse complement))))*</pre><p>
	      E.g., some of your <span class="emphasis"><em>polymerase reads</em></span> will
	      contain just the adaptor and (part of) a fragment sequence:
	      Adap+FwdSeq. Others might contain: Adap+FwdSeq+Adap+RevSeq. And
	      still others might contain: multiple copies of
	      Adap+FwdSeq+Adap+RevSeq.
	    </p></li><li class="listitem"><span class="bold"><strong>quality clipped reads</strong></span> are
	    simply <span class="emphasis"><em>polymerase reads</em></span> where some sort of
	    first quality clipping has been done.
	  </li><li class="listitem"><span class="bold"><strong>subreads</strong></span> are <span class="emphasis"><em>quality
	    clipped reads</em></span> where the adaptors have been removed and
	    the read split into forward fragment sequences and reverse
	    fragment sequences. Hence, one quality clipped polymerase read can
	    yield several subreads.
	  </li><li class="listitem"><p>
	      <span class="bold"><strong>corrected (sub)reads</strong></span> are
	      subreads where through the magic of lots of computational power
	      and a very high coverage of subreads, the errors have been
	      almost completely removed from the subreads.
	    </p><p>
	      This is usually done only on a part of the subreads as it takes
	      already long enough (several hundred hours CPU for a simple
	      bacterium).
	    </p></li></ul></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_pb_ll_revseq"></a>12.4.2.2. 
	  Forward / reverse chimeric sequences
	</h4></div></div></div><p>
	  The splitting of polymerase reads into subreads (see above) needs
	  the SMRTBell adaptor to be recognised by motif searching
	  programs. Unfortunately, it looks like as if some "low percentage"
	  of reads have a self-looped end instead of an adaptor. Which in turn
	  means that the subread splitting will not split those reads and you
	  end up with a chimeric sequence.
	</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_pb_ll_rawreadaccuracy"></a>12.4.2.3. 
	  Accuracy of uncorrected subreads
	</h4></div></div></div><p>
	  You need to be brave now: the accuracy of the the unclipped
	  polymerase reads is usually only about 50%. That is: on average
	  every second base is wrong. And I have seen a project where this
	  accuracy was only 14% (6 out of 7 bases are wrong).
	</p><p>
	  After clipping, the average accuracy of the polymerase reads should
	  be anywhere between 80% and 85% (this depends a little bit on the
	  chemistry used), which translates to: every 5th to every 7th base is
	  wrong. The vast majority of errors being insertions or deletions, not
	  base substitutions.
	</p><p>
	  80% to 85% accurracy with indels as primary error is unfortunately
	  something assemblers cannot use very well. Read: not at all if you
	  want good assemblies (at least I know no program which does
	  that). Therefore, one needs to apply some sort of correction
	  ... which needs quite a deal of CPU, see below.
	</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_pb_ll_cpu"></a>12.4.2.4. 
	  Immense need for CPU power
	</h4></div></div></div><p>
	  The above mentioned accuracies of 80% to 85% are too low for any
	  existing assembler I know to be correctly assembled. Therefore,
	  people came up with the idea of doing error correction on subreads
	  to improve their quality.
	</p><p>
	  There are two major approaches: 1) correcting PacBio subreads with
	  other technologies with shorter reads and 2) correcting long PacBio
	  subreads with shorter PacBio subreads. Both approaches have been
	  shown to work, though there seems to be a preference nowadays to use
	  the second option as the "shorter" PacBio reads provide the benefit
	  of being still longer than read from other technologies and hence
	  provide a better repeat resolution.
	</p><p>
	  Anyway, the amount of CPU power needed for any method above is
	  something to keep for: bacteria with 3 to 5 megabases at a 100x
	  polymerase read coverage can take several hundred hours of CPU for
	  the correction step.
	</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_pb_ll_dnaprep"></a>12.4.2.5. 
	  Increased quality requirements for clean DNA sample prep
	</h4></div></div></div><p>
	  This is a problem which cannot be really attributed to PacBio: one
	  absolutely needs to check whether the protocols used "since ever"
	  for DNA extraction yield results which are clean and long enough for
	  PacBio. Often they are not.
	</p><p>
	  The reason for this being a problem is simple: PacBio can sequence
	  really long fragments, but if your DNA extraction protocol smashed
	  the DNA into small pieces, then no sequencing technology in this
	  universe will be able to give you long reads for small fragments.
	</p></div></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_seqadvice"></a>Chapter 13. Some advice when going into a sequencing project</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email">&lt;<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>&gt;</code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect_seqadv_seqprovider">13.1. 
      Talk to your sequencing provider(s) before sequencing
    </a></span></dt><dt><span class="sect1"><a href="#sect_seqadv_whichseqprovider">13.2. 
      Choosing a sequencing provider
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_seqadv_whichseqprovider_want">13.2.1. 
	WHAT DO YOU WANT?!
      </a></span></dt><dt><span class="sect2"><a href="#sect_seqadv_whichseqprovider_need">13.2.2. 
	WHAT DO YOU NEED?!
      </a></span></dt><dt><span class="sect2"><a href="#sect_seqadv_whichseqprovider_cost">13.2.3. 
	WHAT WILL IT COST ME?
      </a></span></dt><dt><span class="sect2"><a href="#sect_seqadv_whichseqprovider_where">13.2.4. 
	WHERE TO SEQUENCE?
      </a></span></dt><dt><span class="sect2"><a href="#sect_seqadv_whichseqprovider_summary">13.2.5. 
	Summary of all the above
      </a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_seqadv_specific">13.3. 
      Specific advice
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_seqadv_technologies">13.3.1. 
	Technologies
      </a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_seqadv_technologies_sanger">13.3.1.1. 
	  Sanger
	</a></span></dt><dt><span class="sect3"><a href="#sect_seqadv_technologies_pacbio">13.3.1.2. 
	  Pacific Biosciences
	</a></span></dt><dt><span class="sect3"><a href="#sect_seqadv_technologies_illumina">13.3.1.3. 
	  Illumina
	</a></span></dt><dt><span class="sect3"><a href="#sect_seqadv_technologies_iontorrent">13.3.1.4. 
	  Ion Torrent
	</a></span></dt><dt><span class="sect3"><a href="#sect_seqadv_technologies_454">13.3.1.5. 
	  Roche 454
	</a></span></dt></dl></dd><dt><span class="sect2"><a href="#sect_seqadv_denovo">13.3.2. 
	Sequencing de-novo
      </a></span></dt><dt><span class="sect2"><a href="#sect_seqadv_mapping">13.3.3. 
	Re-sequencing / mapping
      </a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_seqadv_a_word_or_two_on_coverage">13.4. 
      A word or two on coverage ...
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_seqadv_lowcov">13.4.1. 
	Low coverage isn't worth it
      </a></span></dt><dt><span class="sect2"><a href="#sect_seqadv_highcov">13.4.2. 
	Catch-22: too high coverage
      </a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_seqadv_when_sequencing_a_word_of_caution_regarding_your_dna">13.5. 
      A word of caution regarding your DNA in hybrid sequencing projects
    </a></span></dt><dt><span class="sect1"><a href="#sect_seqadv_for_bacteria">13.6. 
      Advice for bacteria
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_seqadv_for_bacteria_no_not_sample_in_exponential_phase">13.6.1. 
	Do not sample DNA from bacteria in exponential growth phase!
      </a></span></dt><dt><span class="sect2"><a href="#sect_seqadv_for_bacteria:_beware_of_high_copy_number_plasmids">13.6.2. 
	Beware of (high copy number) plasmids!
      </a></span></dt></dl></dd></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
      <span class="emphasis"><em>
	<span class="quote">&#8220;<span class="quote">
	  Reliable information lets you say 'I don't know' with real confidence.
	</span>&#8221;</span>
      </em></span>
    </p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_seqadv_seqprovider"></a>13.1. 
      Talk to your sequencing provider(s) before sequencing
    </h2></div></div></div><p>
      Well, duh! But it's interesting what kind of mails I sometimes get. Like in:
    </p><div class="blockquote"><blockquote class="blockquote"><span class="quote">&#8220;<span class="quote">We've sequenced a one gigabase, diploid eukaryote with
    Solexa 36bp paired-end with 200bp insert size at 25x coverage. Could you
    please tell us how to assemble this data set de-novo to get a finished
    genome?</span>&#8221;</span></blockquote></div><p>
      A situation like the above should have never happened. Good sequencing
      providers are interested in keeping customers long term and will
      therefore try to find out what exactly your needs are. These folks
      generally know their stuff (they're making a living out of it) and most
      of the time propose you a strategy that fulfills your needs for a near
      minimum amount of money.
    </p><p>
      Listen to them.
    </p><p>
      If you think they try to rip you off or are overselling their
      competences (which most providers I know won't even think of trying,
      but there are some), ask a quote from a couple of other
      providers. You'll see pretty quickly if there are some things not being
      right.
    </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
      As a matter of fact, a rule which has saved me time and again for
      finding sequencing providers is not to go for the cheapest provider,
      especially if their price is far below quotes from other
      providers. They're cutting corners somewhere others don't cut for a
      reason.
    </td></tr></table></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_seqadv_whichseqprovider"></a>13.2. 
      Choosing a sequencing provider
    </h2></div></div></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
      This is a slightly reworked version of a post I made on the MIRA talk
      mailing list.  The question <span class="emphasis"><em>"Could you please recommend me a
      sequencing provider?"</em></span> arrives every now and then in my
      private inbox, often enough for me decide to make a collage of the
      responses I gave in the past and post it to MIRA talk.
    </td></tr></table></div><p>
      This response got, errrr, a little longer, but allow me to note that I
      will not give you names. The reasons are manyfold:
    </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem">
	once upon a time I worked for a sequencing company
      </li><li class="listitem">
	the company I am currently employed with is not in the sequencing
	provider business, but the company uses more than one sequencing
	provider on a regular base and I get to see quite some data
      </li><li class="listitem">
	due to my development on MIRA in my free time, I'm getting insight
	into a number of highs and lows of sequencing technologies at
	different sequencing providers which I would not get if I were to
	expose them publicly ... I do not want to jeopardise these
	relationships.
      </li></ul></div><p>
      That being said, there are a number of general considerations which
      could help you. Excuse me in case the detours I am going to make are
      obvious to you, but I'm writing this also for future references. Also,
      please bear with me if I look at "sequencing" a bit differently than you
      might be accustomed to from academia, but I have worked for quite some
      time now in industry ... and there cost-effectiveness respectively
      "probability of success" of a project as whole is paramount to
      everything else. I'll come back to that further down.
    </p><p>
      There's one -- and only one -- question which you, as sequencing
      customer, need to be able to answer ... if necessary in every
      excruciating detail, but you must know the answer. The question is:
    </p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_seqadv_whichseqprovider_want"></a>13.2.1. 
	WHAT DO YOU WANT?!
      </h3></div></div></div><div class="sidebar"><div class="titlepage"><div><div><p class="title"><b>
	  Detour - Sequencing -
	</b></p></div></div></div><p>
	  For me, every "sequencing project", be it genomic or transcriptomic,
	  really consists of four major phases:
	</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
	      <span class="bold"><strong>data generation:</strong></span> This can be
	      broadly seen as everything to get the DNA/RNA ready to be sent
	      off to sequencing (usually something the client does), the
	      library prep at the sequencing provider and finally the
	      sequencing itself (including base calling). An area of thousand
	      pitfalls where each step (and the communication) is crucial and
	      even one slight inadvertence can make the difference between a
	      "simple" project and a "hard" project. E.g.: taking DNA from
	      growing cells (especially bacteria in exponential growing phase)
	      might not be a good idea ... it makes assembly more
	      difficult. Some DNA extraction methods generate more junk than
	      good fragments etc.pp
	    </p><p>
	      The reason I am emphasizing this is simple: nowadays, the
	      "sequencing" itself is not the most expensive part of a
	      sequencing project, the next two steps are (most of the time
	      anyway).
	    </p></li><li class="listitem"><p>
	      <span class="bold"><strong>assembly &amp; finishing:</strong></span> Still
	      a hard problem. Even a "simple" bacterium can present weeks of
	      effort to get right if its riddled with phages, prophages,
	      transposon elements, genetically engineered repeats etc.pp. And
	      starting with eukaryotes the real fun starts: ploidy,
	      retrotransposons etc. make for an unbelievable genome plasticity
	      and almost always have their own surprises. I've seen "simple"
	      Saccharomyces cerevisiae - where biologist swore to high heaven
	      they were "close to the publicly sequenced strains" - being
	      *very* different from what they were expected to be, both on the
	      DNA level and the genome organisation level.
	    </p><p>
	      Getting eukaryotes right "down to the last base" might cost
	      quite some money, especially when looping back to step 1 (data
	      generation) to tackle difficult areas.
	    </p></li><li class="listitem"><p>
	      <span class="bold"><strong>annotation:</strong></span> Something many
	      people forget: give the sequence a meaning. Here too, things can
	      get quite costly if done "right", i.e., with hand
	      curation. Especially on organism which are not part of the more
	      commonly sequenced species or are generally more complex.
	    </p><p>
	      Annotation of a de-novo transcriptome assembly is also not for
	      the faint of heart, especially if done on short, unpaired read
	      assemblies.
	    </p></li><li class="listitem"><span class="bold"><strong>using the sequencing data:</strong></span>
	    ... for whatever it was generated for.
	  </li></ol></div></div><p>
	The above makes it clear that, depending on what you are really
	interested in within your project and what you expect to be able to do
	with the sequencing data, one can cut corners and reduce cost here and
	there (but not everywhere). And therefore, the above question "What do
	you want?" is one which - after the initial chit-chat of "hi, hello,
	nice to meet you, a pleasure to be here, etc." - every good
	representative of respectable sequencing providers I have met so far
	will ask as very first question. Usually in the form of "what do you
	want to sequence and what will you want to use the data for (and what
	not)?"
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_seqadv_whichseqprovider_need"></a>13.2.2. 
	WHAT DO YOU NEED?!
      </h3></div></div></div><p>
	... difference between "want" and "need" ...
      </p><p>
	Every other question - like where to sequence, which sequencing
	technology to use, how to process the sequencing data afterwards - is
	incidental and subordinated to your answer(s) to the question of "what
	do you want?!" But often sequencing customers get their priorities
	wrong by putting forward another question:
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_seqadv_whichseqprovider_cost"></a>13.2.3. 
	WHAT WILL IT COST ME?
      </h3></div></div></div><p>
	And its inevitable companion question "Can you make it cheaper?"
      </p><div class="sidebar"><div class="titlepage"><div><div><p class="title"><b>
	  Detour - Putting things into perspective -
	</b></p></div></div></div><p>
	  Come to think of it, people sometimes have very interesting ideas
	  regarding costs. Interesting as in "outright silly." It may be
	  because they do not really know what they want or feel unsure on a
	  terrain unbeknownst to them, and often instead focus their energy on
	  single aspects of a wider project because they feel more at home
	  there. And suddenly the focus lies on haggling and bartering for
	  some prices because, after all, this is something everyone knows how
	  to do, right?
	</p><p>
	  As I hinted earlier, the pure sequencing costs are nowadays probably
	  not the biggest factor in any sequencing project: 454, Illumina,
	  IonTorrent and other technology providers have seen to that. E.g.,
	  in 20043/2004 it still cost somewhere between 150 - 200 k&#8364; to get an
	  8x Sanger coverage of a moderately sized bacterium (4 to 5
	  mb). Nowadays, for the same organism, you get coverages in the
	  dozens (going with 454) for a few thousand Euro ... or coverages in
	  the hundreds or even thousands (going with Illumina) for a few
	  hundred Euro.
	</p><p>
	  Cost for assembly, finishing and annotation have not followed the
	  same decrease. Yes, advances in algorithms have made things easier
	  in some parts, but not really on the same scale. Furthermore, the
	  "short read" technologies have more than made up for algorithmical
	  complexity when compared to the old Sanger reads. Maybe that
	  "(ultra)long read" technologies will alleviate the problem, but I
	  would not hold my breath for them to really work well.
	</p><p>
	  One thing however has almost not changed at all: your costs of
	  actually doing followup experiments and data interpretation!
	  Remember that sequencing in itself is most of the time not the
	  ultimate goal, you actually want to gain something out of it. Be it
	  abstract knowledge for a paper or concrete hints for producing some
	  compounds or whatever, chances are that you will actually devote a
	  substantial amount of your resources (time, manpower, mental health)
	  into followup activities (lab experiments, genetic engineering,
	  writing papers) to turn the abstract act of sequencing into
	  something tangible, be it papers, fame, new products, money, or
	  whatever you want to achieve.
	</p><p>
	  And this is the place where it pays to stop and think: "what do I
	  want? what are my strengths and where are my weaknesses? where are
	  my priorities?" The English have a nice saying: "Being penny-wise
	  and pound-foolish is not wise." I may add: Especially not if you are
	  basing man months / years of lab work and your career on the outcome
	  of something like sequencing. Maybe I'm spoiled because I have left
	  academia for quite some time now, but in sequencing I always prefer
	  to throw a bit more money at the sequencing process itself to
	  minimise risks of the later stages.
	</p></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_seqadv_whichseqprovider_where"></a>13.2.4. 
	WHERE TO SEQUENCE?
      </h3></div></div></div><p>
	There's one last detour I'd like to make, and that is the question of "where to sequence?"
      </p><div class="sidebar"><div class="titlepage"><div><div><p class="title"><b>
	  Detour - Public or private, old-timers or young-timers ? -
	</b></p></div></div></div><p>
	  Choosing a sequencing provider is highly dependent on your answer to
	  "what do you want?" In case you want to keep the sequencing data (or
	  the very act of sequencing) secret (even only for some time) will
	  probably lead you to commercial sequencing companies. There you more
	  or less have complete control on the data. Paranoid people might
	  perhaps argue that you can have that only with own sequencing
	  equipment and personnel, but I have the feeling that only a minority
	  is able to cough-up the necessary money for purchasing sequencing
	  equipment for a small one-time project.
	</p><p>
	  Instead of companies you could however also look whether one of the
	  existing sequencing centers in the world might be a good cooperation
	  candidate. Especially if you are doing this project within the scope
	  of your university. Note however that there might be a number of
	  gotchas lurking there, beside the obvious "the data is not really
	  secret anymore": sometimes the raw sequencing data needs to be
	  publicly released, maybe earlier than you would like; or the
	  sequencing center imposes that each and every paper you publish with
	  that data as basis has them as (co-)first author.
	</p><p>
	  A related problem is "whom do I trust to deliver good work?"
	  Intuition says that institutes with a long sequencing history have
	  amassed quite some knowledge in this field, making them experts in
	  all three aspects (data generation, assembly &amp; finishing,
	  annotation) of a sequencing project ... and intuition probably isn't
	  wrong there. The same thing is probably true for sequencing
	  companies which have existed for more than just a couple of years,
	  though from what I have seen so far is that - due to size -
	  sequencing companies sometimes really focus on the data generation
	  and rely on partner companies for "assembly" and "annotation". This
	  is not to say that younger companies are bad. Incidentally, it is my
	  belief that in this field, people are still more important than
	  technology ... and every once in a while good people split off a
	  well known institute (or company) to try their luck in an own
	  company. Always look for references there.
	</p><p>
	  The following statement is a personal opinion (and you can call me
	  biased for that): Personally, I am however quite wary of sequencing
	  done at locations where a sequencer exists because someone got a
	  grant to buy one (because it was chic &amp; en-vogue to get a shiny
	  new toy) but where the instrument then slowly starts to collect dust
	  after the initial flurry ... and because people often do not
	  calculate chemistry costs which arise in case they'd really thought
	  of using the machine 24/7. I want to know that technicians actually
	  work with those things every day, that they know the ins and outs of
	  the work, the protocols, the chemistry, the moods of the machine
	  (even an instrument can have a bad day). I honestly do not believe
	  that one can build up enough expertise when operating these things
	  "every once in a while".
	</p></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_seqadv_whichseqprovider_summary"></a>13.2.5. 
	Summary of all the above
      </h3></div></div></div><p>
	All of the above means that depending on what I need the data for, I
	have the freedom choose among different providers. In case I just need
	masses of raw data and potential savings are substantial, I might go
	with the cheapest whom I know to generate good data. If I want good
	service and second round of data in case I am not 110% satisfied with
	the first round (somehow people have stopped questioning me there),
	this is usually not the cheapest provider ... but the additional costs
	are not really high. If I wanted my data really really quick, I'd
	search for a provider with Ion Torrent, or MiSeq (I am actually
	looking for one with a MiSeq, so if anyone knows a good one,
	preferably in Europe -&gt; mail me). Though I already did transcriptomics
	on eukaryotes, in case I needed larger eukaryotes assembled de-novo
	&amp; also annotated, I would probably look for the help of a larger
	sequencing center as this starts to get dangerously near the fringe of
	my field of expertise.
      </p><p>
	In closing this part, here are a couple of guidelines which have not
	failed me so far for choosing sequencing providers:
      </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem">
	  Building a good relationship helps. In case your institute /
	  university already has good (or OK) experience with a provider, ask
	  there first.
	</li><li class="listitem">
	  It is a lot easier to build a good relationship with someone who
	  speaks your language ... or a good(!) English.
	</li><li class="listitem">
	  I will not haggle for a couple of hundred Euros in a single project,
	  I'll certainly reconsider this when savings are in the tens of
	  thousands.
	</li><li class="listitem">
	  Managing expectations: some sequencing projects are high risk from
	  the start, for lots of possible reasons (underfunded, bad starting
	  material, unclear organism). This is *sometimes* (!) OK as long as
	  everyone involved knows and acknowledges this. However, you should
	  always have a clear target ("what am I looking for?") and preferably
	  know in advance how to treat the data to get there.
	</li><li class="listitem">
	  Errors occur, stay friendly at first. In case the expectations were
	  clear (see above), the material and organism are not at fault but
	  the data quality somehow is bad, it is not too difficult to have the
	  sequencing provider acknowledge this and get additional sequencing
	  for no added cost.
	</li></ul></div><p>
	Regarding the technologies you can use ... it really depends on what
	you want to do :-) And note that I base my answers on technologies
	available today without bigger problems: PacBio, Illumina, with
	IonTorrent as Joker for quick projects. 454 can still be considered,
	but probably not for too long anymore as Roche stopped development of
	the technology and thus PacBio takes over the part for long
	reads. Oxford Nanopore might become a game changer, but they are not
	just yet
      </p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_seqadv_specific"></a>13.3. 
      Specific advice
    </h2></div></div></div><p>
      Here's how I see things as of now (January 2014), which might not
      necessarily be how others see them.
    </p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_seqadv_technologies"></a>13.3.1. 
	Technologies
      </h3></div></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_seqadv_technologies_sanger"></a>13.3.1.1. 
	  Sanger
	</h4></div></div></div><p>
	  Use for: checking assemblies; closing gaps by PCR; checking for a couple of genes with
	  known sequence (i.e., where you can design oligos for).
	</p><p>
	  Do not use for: anything else. In particular, if you find yourself
	  designing oligos for a 96 well plate destined for Sanger sequencing
	  of a single bacterial DNA sample, you (probably) are doing something
	  wrong.
	</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_seqadv_technologies_pacbio"></a>13.3.1.2. 
	  Pacific Biosciences
	</h4></div></div></div><p>
	  Use for: de-novo of bacteria and lower eukaryotes (or higher
	  eukaryotes if you have the money). PacBio should be seen as
	  <span class="emphasis"><em>the</em></span> technology to use when getting the best
	  assemblies with least number of contigs is important to you. Also,
	  resequencing of variants of known organisms with lots of genomic
	  reorganisation flexibility due to high numbers of transposons (where
	  short reads will not help in getting the chromosome assembled/mapped
	  correctly).
	</p><p>
	  Do not use for: resequencing of "dull" organisms (where the only
	  differences will be simple SNPs or simple insertion/deletions or
	  simple contig reorganisations at non-repetitive places). Illumina
	  will do a much better and cost effective job there.
	</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
	    As of January 2014: aim for at least 100x coverage of raw data,
	    better 130x to 150x as pre-processing (quality clip, removal of
	    adapters and other sequencing artefacts) will take its toll and
	    reduce the data by up to 1/3. After that, the error
	    correction/self-correction of raw reads into corrected reads will
	    again reduce the data considerably.
	  </p><p>
	    It's really a numbers game: the more data you have, the more
	    likely you will also get many of those really long reads in the 5
	    to 30 Kb range which are extremely useful to get over those nasty
	    repeats.
	  </p></td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
	  MIRA will most probably give you longer contigs with corrected
	  PacBio reads than you get with the HGAP pipeline, but the number of
	  indel errors will currently be higher. Either use Quiver on the
	  results of MIRA ... or simply polish the assembly with a cheap
	  Illumina data set. The latter approach will also give you better
	  results than a Quiver approach.
	</td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
	  For non-haploid organisms, you might need more coverage to get
	  enough data at ploidy sites to get the reads correctly out of
	  error correction.
	</td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
	  Preparation of your DNA sample is not trivial as many methods will
	  break your DNA into "small" chunks which are good enough for
	  Sanger, 454, Illumina or Ion Torrents, but not for PacBio.
	</td></tr></table></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_seqadv_technologies_illumina"></a>13.3.1.3. 
	  Illumina
	</h4></div></div></div><p>
	  Use for: general resequencing jobs (finding SNPs, indel locations of
	  any size, copy number variations etc.); gene expression analysis;
	  cheap test sequencing of unknown organisms to assess complexity;
	  de-novo sequencing if you are OK with getting hundreds / thousands
	  of contigs (depending on organism, some bacteria get only a few
	  dozen).
	</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
	  Careful with high GC organisms, starting with 60% to 65% GC Illumina
	  reads contain more errors: SNP detection may be less reliable if
	  extreme care is not taken to perform good read clipping. Especially
	  the dreaded GGCxG motif often leads to problems in Illumina reads.
	</td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
	  For de-novo assemblies, do <span class="emphasis"><em>NOT</em></span> (never ever at
	  all and under no circumstances) use the Nextera kit, take
	  TruSeq. The non-random fragmentation behaviour of Nextera leads to
	  all sorts of problems for assemblers (not only MIRA) which try to
	  use kmer frequencies as a criterion for repetitiveness of a given
	  sequence.
	</td></tr></table></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_seqadv_technologies_iontorrent"></a>13.3.1.4. 
	  Ion Torrent
	</h4></div></div></div><p>
	  Use for: like Illumina. With three notable exceptions: 1) SNP
	  detection is not as good as with Illumina (more false positives and
	  false negatives) 2) de-novo assemblies will contain more single-base
	  indels and 3) Ion having problems with homopolymers, that technology
	  is not as well suited as complimentary hybrid technology for PacBio
	  as is Illumina (except for high-GC perhaps).
	</p><p>
	  Ion has a speed advantage on Illumina: if you have your own machine,
	  getting from your sample to data takes less time than with Illumina.
	</p><p>
	  Also, it looks like as if Ion has less problems with GC content or
	  sequence motifs than Illumina.
	</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_seqadv_technologies_454"></a>13.3.1.5. 
	  Roche 454
	</h4></div></div></div><p>
	  That technology is on the way out, but there may be two reasons to
	  not completely dismiss 454: 1) the average read length of 700 bp can
	  be seen as a plus when compared to Illumina or Ion ... but then
	  there's PacBio to take care of read length. 2) the large read-pair
	  libraries work better with 454 than Illumina mate-pair libraries,
	  something which might be important for scaffolding data where even
	  PacBio could not completely resolve long repeats.
	</p></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_seqadv_denovo"></a>13.3.2. 
	Sequencing de-novo
      </h3></div></div></div><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem">
	  On a cheap gene fishing expedition? Probably Illumina HiSeq, at
	  least 100bp, 150 to 250bp or 300bp if your provider supports it
	  well. Paired-end definitely a plus. As alternative: Ion Torrent for
	  small organism (maybe up to 100Mb) and when you need results quickly
	  without caring for possible frameshifts.
	</li><li class="listitem">
	  Want some larger contigs? PacBio. Add in cheap Illumina 100bp
	  paired-end (150 to 300bp if provider supports it) to get rid of
	  those last frameshifts which may remain.
	</li><li class="listitem">
	  Maybe scaffolding of contigs above? PacBio + Illumina 100bp + a
	  large paired-end library (e.g. 454 20kb)
	</li><li class="listitem">
	  Have some good friends at Oxford Nanopore who can give you some
	  MinIon engineering samples? Man, I'd kill for some bacterial test
	  sets with those (especially Bacillus subtilis 168)
	</li></ul></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_seqadv_mapping"></a>13.3.3. 
	Re-sequencing / mapping
      </h3></div></div></div><p>
	There is a reason why Illumina currently dominates the market as it
	does: a cheap Illumina run (preferably paired-end) will answer most of
	your questions in 99% of the cases. Things will get difficult for
	organisms with high numbers of repeats and/or frequent genome
	re-arrangements. Then using longer read technologies and/or Illumina
	mate-pair may be required.
      </p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_seqadv_a_word_or_two_on_coverage"></a>13.4. 
      A word or two on coverage ...
    </h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_seqadv_lowcov"></a>13.4.1. 
	Low coverage isn't worth it
      </h3></div></div></div><p>
	There's one thing to be said about coverage and de-novo assembly:
	especially for bacteria, getting more than 'decent' coverage is
	<span class="emphasis"><em>cheap</em></span> with any current day technology. Every
	assembler I know will be happy to assemble de-novo genomes with
	coverages of 25x, 30x, 40x ... and the number of contigs will still
	drop dramatically between a 15x Ion Torrent and a 30x Ion Torrent
	project.
      </p><p>
	In any case, do some calculations: if the coverage you expect to get
	reaches 50x (e.g. 200MB raw sequence for a 4MB genome), then you
	(respectively the assembler) can still throw away the worst 20% of the
	sequence (with lots of sequencing errors) and concentrate on the
	really, really good parts of the sequences to get you nice contigs.
      </p><p>
	Other example: the price for 1 gigabase Illumina paired-end of a
	single DNA prep is way, way below USD 1000, even with commercial
	providers. Then you just need to do the math: is it worth to invest
	10, 20, 30 or more days of wet lab work, designing primers, doing PCR
	sequencing etc. and trying to close remaining gaps or hunt down
	sequencing errors when you went for a 'low' coverage or a non-hybrid
	sequencing strategy? Or do you invest a few bucks more to get some
	additional coverage and considerably reduce the uncertainties and gaps
	which remain?
      </p><p>
	Remember, you probably want to do research on your bug and not
	research on how to best assemble and close genomes. So even if you put
	(PhD) students on the job, it's costing you time and money if you
	wanted to save money earlier in the sequencing. Penny-wise and
	pound-foolish is almost never a good strategy :-)
      </p><p>
	I do agree that with eukaryotes, things start to get a bit more
	interesting from the financial point of view.
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_seqadv_highcov"></a>13.4.2. 
	Catch-22: too high coverage
      </h3></div></div></div><p>
	There is, however, a catch-22 situation with coverage: too much
	coverage isn't good either. Without going into details: sequencing
	errors sometimes interfere heavily when coverage exceeds ~60x to 80x
	for 454 &amp; IonTorrent and approximately 150x to 200x for
	Solexa/Illumina.
      </p><p>
	In those cases, do yourself a favour: there's more than enough data
	for your project ... just cut it down to some reasonable amount: 40x
	to 50x for 454 &amp; IonTorrent, 100x for Solexa/Illumina.
      </p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_seqadv_when_sequencing_a_word_of_caution_regarding_your_dna"></a>13.5. 
      A word of caution regarding your DNA in hybrid sequencing projects
    </h2></div></div></div><p>
      So, you have decided that sequencing your bug with PacBio and Illumina
      (or PacBio and Ion Torrent or whatever) may be a viable way to get the
      best bang for your buck. Then please follow this advice: prepare enough
      DNA <span class="emphasis"><em>in</em></span> <span class="emphasis"><em>one</em></span>
      <span class="emphasis"><em>go</em></span> for the sequencing provider so that they can
      sequence it with all the technologies you chose without you having to
      prepare another batch ... or even grow another culture!
    </p><p>
      The reason for that is that as soon as you do that, the probability that
      there is a mutation somewhere that your first batch did not have is not
      negligible. And if there is a mutation, even if it is only one base,
      there is a &gt;95% chance that MIRA will find it and thinks it is some
      repetitive sequence (like a duplicated gene with a mutation in it) and
      splits contigs at those places.
    </p><p>
      Now, there are times when you cannot completely be sure that different
      sequencing runs did not use slightly different batches (or even strains).
    </p><p>
      One example: the SFF files for SRA000156 and SRA001028 from the NCBI
      short trace archive should both contain E.coli K12 MG-16650 (two
      unpaired half plates and a paired-end plate). However, they contain
      DNA from different cultures. Furthermore, the DNA was prepared by
      different labs. The net effect is that the sequences in the paired-end
      library contain a few distinct mutations from the sequences in the two
      unpaired half-plates. Furthermore, the paired-end sequences contain
      sequences from phages that are not present in the unpaired sequences.
    </p><p>
      In those cases, provide strain information to the reads so that MIRA can
      discern possible repeats from possible SNPs.
    </p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_seqadv_for_bacteria"></a>13.6. 
      Advice for bacteria
    </h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_seqadv_for_bacteria_no_not_sample_in_exponential_phase"></a>13.6.1. 
	Do not sample DNA from bacteria in exponential growth phase!
      </h3></div></div></div><p>
	The reason is simple: some bacteria grow so fast that they start
	replicating themselves even before having finished the first
	replication cycle. This leads to more DNA around the origin of
	replication being present in cells, which in turn fools assemblers and
	mappers into believing that those areas are either repeats or that
	there are copy number changes.
      </p><p>
	Sample. In. Stationary. Phase!
      </p><p>
	For de-novo assemblies, MIRA will warn you if it detects data which
	points at exponential phase. In mapping assemblies, look at the
	coverage profile of your genome: if you see a smile shape (or V
	shape), you have a problem.
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_seqadv_for_bacteria:_beware_of_high_copy_number_plasmids"></a>13.6.2. 
	Beware of (high copy number) plasmids!
      </h3></div></div></div><p>
	This is a source of interesting problems and furthermore gets people
	wondering why MIRA sometimes creates more contigs than other
	assemblers when it usually creates less.
      </p><p>
	Here's the short story: there are data sets which include one ore
	several high-copy plasmid(s). Here's a particularly ugly example:
	SRA001028 from the NCBI short read archive which contains a plate of
	paired-end reads for Ecoli K12 MG1655-G
	(<a class="ulink" href="ftp://ftp.ncbi.nih.gov/pub/TraceDB/ShortRead/SRA001028/" target="_top">ftp://ftp.ncbi.nih.gov/pub/TraceDB/ShortRead/SRA001028/</a>).
      </p><p>
	The genome is sequenced at ~10x coverage, but during the assembly,
	three intermediate contigs with ~2kb attain a silly maximum coverage
	of ~1800x each.  This means that there were ~540 copies of this
	plasmid (or these plasmids) in the sequencing.
      </p><p>
	When using the uniform read distribution algorithm - which is switched
	on by default when using "--job=" and the quality level of 'accurate' -
	MIRA will find out about the average coverage of the genome to be at
	~10x.  Subsequently this leads MIRA to dutifully create ~500 additional
	contigs (plus a number of contig debris) with various incarnations of
	that plasmid at an average of ~10x, because it thought that these were
	repetitive sites within the genome that needed to be disentangled.
      </p><p>
	Things get even more interesting when some of the plasmid / phage
	copies are slightly different from each other. These too will be split
	apart and when looking through the results later on and trying to join
	the copies back into one contig, one will see that this should not be
	done because there are real differences.
      </p><p>
	DON'T PANIC!
      </p><p>
	The only effect this has on your assembly is that the number of
	contigs goes up. This in turn leads to a number of questions in my
	mailbox why MIRA is sometimes producing more contigs than Newbler (or
	other assemblers), but that is another story (hint: Newbler either
	collapses repeats or leaves them completely out of the picture by not
	assembling repetitive reads).
      </p><p>
	What you can do is the following:
      </p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
	    either you assemble everything together and the join the plasmid
	    contigs manually after assembly, e.g. in gap4 (drawback: on really
	    high copy numbers, MIRA will work quite a bit longer ... and you
	    will have a lot of fun joining the contigs afterwards)
	  </p></li><li class="listitem"><p>
	    or, after you found out about the plasmid(s) and know the sequence,
	    you filter out reads in the input data which contain this sequence
	    (you can use <span class="command"><strong>mirabait</strong></span> for this) and assemble the
	    remaining reads.
	  </p></li></ol></div></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_bitsandpieces"></a>Chapter 14. Bits and pieces</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email">&lt;<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>&gt;</code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect_bap_using_ssaha2_smalt_to_screen_for_vector_sequence">14.1. 
      Using SSAHA2 / SMALT to screen for vector sequence
    </a></span></dt></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
      <span class="emphasis"><em><span class="quote">&#8220;<span class="quote">Just when you think it's finally settled, it isn't.
      </span>&#8221;</span></em></span>
    </p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top"><p>
  The documentation of MIRA 3.9.x has not completely caught up yet with the changes introduced by MIRA now using manifest files. Quite a number of recipes still show the old command-line style, e.g.:
  </p><pre class="screen">
mira --project=... --job=... ...</pre><p>
    For those cases, please refer to chapter 3 (the reference) for how to write manifest files.
  </p></td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_bap_using_ssaha2_smalt_to_screen_for_vector_sequence"></a>14.1. 
      Using SSAHA2 / SMALT to screen for vector sequence
    </h2></div></div></div><p>
      If your sequencing provider gave you data which was NOT pre-clipped for
      vector sequence, you can do this yourself in a pretty robust manner
      using SSAHA2 -- or the successor, SMALT -- from the Sanger Centre. You
      just need to know which sequencing vector the provider used and have its
      sequence in FASTA format (ask your provider).
    </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
      This screening is a valid method for any type of Sanger sequencing
      vectors, 454 adaptors, Illumina adaptors and paired-end adaptors
      etc. However, you probably want to use it only for Sanger type data as
      MIRA already knows all standard 454, Ion Torrent and Illumina adaptors.
    </td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
      SSAHA2 and SMALT need their input data to be in FASTA format, so for
      these to run you will need them also in FASTA format. For MIRA however
      you can load your original data in whatever format it was present.
    </td></tr></table></div><p>
      For SSAHA2 follow these steps (most are the same as in the example
      above):
    </p><pre class="screen">
<code class="prompt">$</code> <strong class="userinput"><code>ssaha2 -output ssaha2
  -kmer 8 -skip 1 -seeds 1 -score 12 -cmatch 9 -ckmer 6
  /path/where/the/vector/data/resides/vector.fasta
  <em class="replaceable"><code>yourinputsequences.fasta</code></em> &gt; <em class="replaceable"><code>screendataforyoursequences.ssaha2</code></em></code></strong></pre><p>
      Then, in your manifest file, add the following line in the readgroup
      which contains the sequences you screened:
    </p><pre class="screen">
<strong class="userinput"><code>readgroup
...
data = <em class="replaceable"><code>yourinputsequences_inwhateverformat_thisexamplehasfastq.fastq</code></em>
data = <em class="replaceable"><code>screendataforyoursequences.ssaha2</code></em>
...</code></strong></pre><p>
      For SMALT, the only difference is that you use SMALT for generating the
      vector-screen file and ask SMALT to generate it in SSAHA2 format. As
      SMALT works in two steps (indexing and then mapping), you also need to
      perform it in two steps and then call MIRA. E.g.:
    </p><pre class="screen">
<code class="prompt">$</code> <strong class="userinput"><code>smalt index -k 7 -s 1 smaltidxdb /path/where/the/vector/data/resides/vector.fasta</code></strong>
<code class="prompt">$</code> <strong class="userinput"><code>smalt map -f ssaha -d -1 -m 7 smaltidxdb <em class="replaceable"><code>yourinputsequences.fasta</code></em> &gt; <em class="replaceable"><code>screendataforyoursequences.smalt</code></em></code></strong></pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
      Please note that, due to subtle differences between output of SSAHA2 (in
      ssaha2 format) and SMALT (in ssaha2 format), MIRA identifies the source
      of the screening (and the parsing method it needs) by the name of the
      screen file. Therefore, screens done with SSAHA2 need to have the
      postfix <code class="filename">.ssaha2</code> in the file name and screens done
      with SMALT need
      <code class="filename">*.smalt</code>.
    </td></tr></table></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_faq"></a>Chapter 15. Frequently asked questions</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email">&lt;<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>&gt;</code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect_faq_assembly_quality">15.1. 
      Assembly quality
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_faq_what_is_the_effect_of_uniform_read_distribution_as:urd?">15.1.1. 
	What is the effect of uniform read distribution (-AS:urd)?
      </a></span></dt><dt><span class="sect2"><a href="#sect_faq_there_are_too_many_contig_debris_when_using_uniform_read_distribution_how_do_i_filter_for_good_contigs?">15.1.2. 
	There are too many contig debris when using uniform read distribution, how do I filter for "good" contigs?
      </a></span></dt><dt><span class="sect2"><a href="#sect_faq_when_finishing_which_places_should_i_have_a_look_at?">15.1.3. 
	When finishing, which places should I have a look at?
      </a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_faq_454_data">15.2. 
      454 data
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_faq_what_do_i_need_sffs_for?">15.2.1. 
	What do I need SFFs for?
      </a></span></dt><dt><span class="sect2"><a href="#sect_faq_what's_sff_extract_and_where_do_i_get_it?">15.2.2. 
	What's sff_extract and where do I get it?
      </a></span></dt><dt><span class="sect2"><a href="#sect_faq_do_i_need_the_sfftools_from_the_roche_software_package?">15.2.3. 
	Do I need the sfftools from the Roche software package?
      </a></span></dt><dt><span class="sect2"><a href="#sect_faq_combining_sffs">15.2.4. 
	Combining SFFs
      </a></span></dt><dt><span class="sect2"><a href="#sect_faq_adaptors_and_pairedend_linker_sequences">15.2.5. 
	Adaptors and paired-end linker sequences
      </a></span></dt><dt><span class="sect2"><a href="#sect_faq_what_do_i_get_in_pairedend_sequencing?">15.2.6. 
	What do I get in paired-end sequencing?
      </a></span></dt><dt><span class="sect2"><a href="#sect_faq_sequencing_protocol">15.2.7. 
	Sequencing protocol
      </a></span></dt><dt><span class="sect2"><a href="#sect_faq_filtering_by_seqlen">15.2.8. 
	Filtering sequences by length and re-assembly
      </a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_faq_solexa___illumina_data">15.3. 
      Solexa / Illumina data
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_faq_can_i_see_deletions?">15.3.1. 
	Can I see deletions?
      </a></span></dt><dt><span class="sect2"><a href="#sect_faq_can_i_see_insertions?">15.3.2. 
	Can I see insertions?
      </a></span></dt><dt><span class="sect2"><a href="#sect_faq_denovo_assembly_with_solexa_data">15.3.3. 
	De-novo assembly with Solexa data
      </a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_faq_hybrid_assemblies">15.4. 
      Hybrid assemblies
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_faq_what_are_hybrid_assemblies?">15.4.1. 
	What are hybrid assemblies?
      </a></span></dt><dt><span class="sect2"><a href="#sect_faq_what_differences_are_there_in_hybrid_assembly_strategies?">15.4.2. 
	What differences are there in hybrid assembly strategies?
      </a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_faq_masking">15.5. 
      Masking
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_faq_should_i_mask?">15.5.1. 
	Should I mask?
      </a></span></dt><dt><span class="sect2"><a href="#sect_faq_how_can_i_apply_custom_masking?">15.5.2. 
	How can I apply custom masking?
      </a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_faq_miscellaneous">15.6. 
      Miscellaneous
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_faq_what_are_megahubs?">15.6.1. 
	What are megahubs?
      </a></span></dt><dt><span class="sect2"><a href="#sect_faq_passes_and_loops">15.6.2. 
	Passes and loops
      </a></span></dt><dt><span class="sect2"><a href="#sect_faq_debris">15.6.3. 
	Debris
      </a></span></dt><dt><span class="sect2"><a href="#sect_faq_tmpf_files:_more_info_on_what_happened_during_the_assembly">15.6.4. 
	Log and temporary files: more info on what happened during the assembly
      </a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_faq_sequence_clipping_after_load">15.6.4.1. 
	  Sequence clipping after load
	</a></span></dt></dl></dd></dl></dd><dt><span class="sect1"><a href="#sect_faq_platforms_and_compiling">15.7. 
      Platforms and Compiling
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_faq_windows">15.7.1. 
	Windows
      </a></span></dt></dl></dd></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
      <span class="emphasis"><em><span class="quote">&#8220;<span class="quote">Every question defines its own answer. Except perhaps 'Why a duck?'
      </span>&#8221;</span></em></span>
    </p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top"><p>
  The documentation of MIRA 3.9.x has not completely caught up yet with the changes introduced by MIRA now using manifest files. Quite a number of recipes still show the old command-line style, e.g.:
  </p><pre class="screen">
mira --project=... --job=... ...</pre><p>
    For those cases, please refer to chapter 3 (the reference) for how to write manifest files.
  </p></td></tr></table></div><p>
    This list is a collection of frequently asked questions and answers
    regarding different aspects of the MIRA assembler.
  </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
    This document needs to be overhauled.
  </td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_faq_assembly_quality"></a>15.1. 
      Assembly quality
    </h2></div></div></div><div class="qandaset"><a name="idm7142"></a><dl><dt>15.1.1. <a href="#idm7143">Test question 1</a></dt><dt>15.1.2. <a href="#idm7148">Test question 2</a></dt></dl><table border="0" style="width: 100%;"><colgroup><col align="left" width="1%"><col></colgroup><tbody><tr class="question"><td align="left" valign="top"><a name="idm7143"></a><a name="idm7144"></a><p><b>15.1.1.</b></p></td><td align="left" valign="top"><p>Test question 1</p></td></tr><tr class="answer"><td align="left" valign="top"></td><td align="left" valign="top"><p>Test answer 1</p></td></tr><tr class="question"><td align="left" valign="top"><a name="idm7148"></a><a name="idm7149"></a><p><b>15.1.2.</b></p></td><td align="left" valign="top"><p>Test question 2</p></td></tr><tr class="answer"><td align="left" valign="top"></td><td align="left" valign="top"><p>Test answer 2</p></td></tr></tbody></table></div><p>
    </p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_what_is_the_effect_of_uniform_read_distribution_as:urd?"></a>15.1.1. 
	What is the effect of uniform read distribution (-AS:urd)?
      </h3></div></div></div><p>
      </p><pre class="screen">
	I have a project which I once started quite normally via
	"--job=denovo,genome,accurate,454"
	and once with explicitly switching off the uniform read distribution
	"--job=denovo,genome,accurate,454 -AS:urd=no"
	I get less contigs in the second case and I wonder if that is not better.
	Can you please explain?
      </pre><p>
      </p><p>
	Since 2.9.24x1, MIRA has a feature called "uniform read distribution" which is
	normally switched on. This feature reduces over-compression of repeats during
	the contig building phase and makes sure that, e.g., a rRNA stretch which is
	present 10 times in a bacterium will also be present approximately 10 times in
	your result files.
      </p><p>
	It works a bit like this: under the assumption that reads in a project are
	uniformly distributed across the genome, MIRA will enforce an average coverage
	and temporarily reject reads from a contig when this average coverage
	multiplied by a safety factor is reached at a given site.
      </p><p>
	It's generally a very useful tool disentangle repeats, but has some slight
	secondary effects: rejection of otherwise perfectly good reads. The
	assumption of read distribution uniformity is the big problem we have here:
	of course it's not really valid. You sometimes have less, and sometimes more
	than "the average" coverage. Furthermore, the new sequencing technologies -
	454 perhaps but especially the microreads from Solexa &amp; probably also SOLiD -
	show that you also have a skew towards the site of replication origin.
      </p><p>
	One example: let's assume the average coverage of your project is 8 and by
	chance at one place you have 17 (non-repetitive) reads, then the following
	happens:
      </p><p>
	$p$= parameter of -AS:urdsip
      </p><p>
	Pass 1 to $p-1$: MIRA happily assembles everything together and calculates a
	number of different things, amongst them an average coverage of ~8. At the
	end of pass '$p-1$', it will announce this average coverage as first estimate
	to the assembly process.
      </p><p>
	Pass $p$: MIRA has still assembled everything together, but at the end of each
	pass the contig self-checking algorithms now include an "average coverage
	check". They'll invariably find the 17 reads stacked and decide (looking at
	the -AS:urdct parameter which I now assume to be 2) that 17 is larger than
	2*8 and that this very well may be a repeat. The reads get flagged as
	possible repeats.
      </p><p>
	Pass $p+1$ to end: the "possibly repetitive" reads get a much tougher
	treatment in MIRA. Amongst other things, when building the contig, the contig
	now looks that "possibly repetitive" reads do not over-stack by an average
	coverage multiplied by a safety value (-AS:urdcm) which I'll assume in this
	example to be 1.5. So, at a certain point, say when read 14 or 15 of
	that possible repeat want to be aligned to the contig at this given place, the
	contig will just flatly refuse and tell the assembler to please find another
	place for them, be it in this contig that is built or any other that will
	follow. Of course, if the assembler cannot comply, the reads 14 to 17 will end
	up as contiglet (contig debris, if you want) or if it was only one read that
	got rejected like this, it will end up as singlet or in the debris file.
      </p><p>
	Tough luck. I do have ideas on how to re-integrate those reads at the and of an
	assembly, but I had deferred doing this as in every case I had looked up,
	adding those reads to the contigs wouldn't have changed anything ... there's
	already enough coverage. What I do in those cases is simply filter away the
	contiglets (defined as being of small size and having an average coverage
	below the average coverage of the project / 3 (or 2.5)) from a project.
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_there_are_too_many_contig_debris_when_using_uniform_read_distribution_how_do_i_filter_for_good_contigs?"></a>15.1.2. 
	There are too many contig debris when using uniform read distribution, how do I filter for "good" contigs?
      </h3></div></div></div><p>
      </p><pre class="screen">
	When using uniform read distribution there are too many contig with low
	coverage which I don't want to integrate by hand in the finishing process. How
	do I filter for "good" contigs?
      </pre><p>
      </p><p>
	OK, let's get rid of the cruft. It's easy, really: you just need to look up
	one number, take two decisions and then launch a command.
      </p><p>
	The first decision you need to take is on the minimum average coverage the
	contigs you want to keep should have. Have a look at the file
	<code class="filename">*_info_assembly.txt</code> which is in the info directory after
	assembly. In the "Large contigs" section, there's a "Coverage assessment"
	subsection. It looks a bit like this:
      </p><pre class="screen">
	...
	Coverage assessment:
	--------------------
	Max coverage (total): 43
	Max coverage
	Sanger: 0
	454:    43
	Solexa: 0
	Solid:  0
	Avg. total coverage (size &#8805; 5000): 22.30
	Avg. coverage (contig size &#8805; 5000)
	Sanger: 0.00
	454:    22.05
	Solexa: 0.00
	Solid:  0.00
	...
      </pre><p>
      </p><p>
	This project was obviously a 454 only project, and the average coverage for it
	is ~22. This number was estimated by MIRA by taking only contigs of at least
	5Kb into account, which for sure left out everything which could be
	categorised as debris. It's a pretty solid number.
      </p><p>
	Now, depending on how much time you want to invest performing some manual
	polishing, you should extract contigs which have at least the following
	fraction of the average coverage:
      </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	    2/3 if a quick and "good enough" is what you want and you don't want to
	    do some manual polishing. In this example, that would be around 14 or 15.
	  </p></li><li class="listitem"><p>
	    1/2 if you want to have a "quick look" and eventually perform some
	    contig joins. In this example the number would be 11.
	  </p></li><li class="listitem"><p>
	    1/3 if you want quite accurate and for sure not loose any possible
	    repeat. That would be 7 or 8 in this example.
	  </p></li></ul></div><p>
      </p><p>
	The second decision you need to take is on the minimum length your contigs
	should have. This decision is a bit dependent on the sequencing technology you
	used (the read length). The following are some rules of thumb:
      </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	    Sanger: 1000 to 2000
	  </p></li><li class="listitem"><p>
	    454 GS20: 500
	  </p></li><li class="listitem"><p>
	    454 FLX: 1000
	  </p></li><li class="listitem"><p>
	    454 Titanium: 1500
	  </p></li></ul></div><p>
      </p><p>
	Let's assume we decide for an average coverage of 11 and a minimum length of
	1000 bases. Now you can filter your project with miraconvert
      </p><pre class="screen">
	miraconvert -x 1000 -y 11 sourcefile.caf filtered.caf
      </pre><p>
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_when_finishing_which_places_should_i_have_a_look_at?"></a>15.1.3. 
	When finishing, which places should I have a look at?
      </h3></div></div></div><p>
      </p><pre class="screen">
	I would like to find those places where MIRA wasn't sure and give it a quick
	shot. Where do I need to search?
      </pre><p>
      </p><p>
	Search for the following tags in gap4 or any other finishing program
	for finding places of importance (in this order).
      </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	    IUPc
	  </p></li><li class="listitem"><p>
	    UNSc
	  </p></li><li class="listitem"><p>
	    SRMc
	  </p></li><li class="listitem"><p>
	    WRMc
	  </p></li><li class="listitem"><p>
	    STMU (only hybrid assemblies)
	  </p></li><li class="listitem"><p>
	    STMS (only hybrid assemblies)
	  </p></li></ul></div><p>
      </p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_faq_454_data"></a>15.2. 
      454 data
    </h2></div></div></div><div class="qandaset"><a name="idm7224"></a><dl><dt>15.2.1. <a href="#idm7225">What are little boys made of?</a></dt><dt>15.2.2. <a href="#idm7230">What are little girls made of?</a></dt></dl><table border="0" style="width: 100%;"><colgroup><col align="left" width="1%"><col></colgroup><tbody><tr class="question"><td align="left" valign="top"><a name="idm7225"></a><a name="idm7226"></a><p><b>15.2.1.</b></p></td><td align="left" valign="top"><p>What are little boys made of?</p></td></tr><tr class="answer"><td align="left" valign="top"></td><td align="left" valign="top"><p>Snips and snails and puppy dog tails.</p></td></tr><tr class="question"><td align="left" valign="top"><a name="idm7230"></a><a name="idm7231"></a><p><b>15.2.2.</b></p></td><td align="left" valign="top"><p>What are little girls made of?</p></td></tr><tr class="answer"><td align="left" valign="top"></td><td align="left" valign="top"><p>Sugar and spice and everything nice.</p></td></tr></tbody></table></div><p>
    </p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_what_do_i_need_sffs_for?"></a>15.2.1. 
	What do I need SFFs for?
      </h3></div></div></div><p>
      </p><pre class="screen">
	I need the .sff files for MIRA to load ...
      </pre><p>
      </p><p>
	Nope, you don't, but it's a common misconception. MIRA does not load SFF
	files, it loads FASTA, FASTA qualities, FASTQ, XML, CAF, EXP and PHD. The
	reason why one should start from the SFF is: those files can be used to create
	a XML file in TRACEINFO format. This XML contains the absolutely vital
	information regarding clipping information of the 454 adaptors (the sequencing
	vector of 454, if you want).
      </p><p>
	For 454 projects, MIRA will then load the FASTA, FASTA quality and the
	corresponding XML. Or from CAF, if you have your data in CAF format.
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_what's_sff_extract_and_where_do_i_get_it?"></a>15.2.2. 
	What's sff_extract and where do I get it?
      </h3></div></div></div><p>
      </p><pre class="screen">
	How do I extract the sequence, quality and other values from SFFs?
      </pre><p>
      </p><p>
	Use the <span class="command"><strong>sff_extract</strong></span> script from Jose Blanca at the
	University of Valencia to extract everything you need from the SFF
	files (sequence, qualities and ancillary information). The home of
	sff_extract is: <a class="ulink" href="http://bioinf.comav.upv.es/sff_extract/index.html" target="_top">http://bioinf.comav.upv.es/sff_extract/index.html</a> but I am
	thankful to Jose for giving permission to distribute the script in the
	MIRA 3rd party package (separate download).
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_do_i_need_the_sfftools_from_the_roche_software_package?"></a>15.2.3. 
	Do I need the sfftools from the Roche software package?
      </h3></div></div></div><p>
	No, not anymore. Use the <span class="command"><strong>sff_extract</strong></span> script to
	extract your reads. Though the Roche sfftools package contains a few
	additional utilities which could be useful.
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_combining_sffs"></a>15.2.4. 
	Combining SFFs
      </h3></div></div></div><p>
      </p><pre class="screen">
	I am trying to use MIRA to assemble reads obtained with the 454 technology
	but I can't combine my sff files since I have two files obtained with GS20
	system and 2 others obtained with the GS-FLX system. Since they use
	different cycles (42 and 100) I can't use the sfffile to combine both.
      </pre><p>
      </p><p>
	You do not need to combine SFFs before translating them into something
	MIRA (or other software tools) understands. Use
	<span class="command"><strong>sff_extract</strong></span> which extracts data from the SFF files
	and combines this into input files.
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_adaptors_and_pairedend_linker_sequences"></a>15.2.5. 
	Adaptors and paired-end linker sequences
      </h3></div></div></div><p>
      </p><pre class="screen">
	I have no idea about the adaptor and the linker sequences, could you send me
	the sequences please?
      </pre><p>
      </p><p>
	Here are the sequences as filed by 454 in their patent application:
      </p><pre class="screen">
	&gt;AdaptorA
	CTGAGACAGGGAGGGAACAGATGGGACACGCAGGGATGAGATGG
	&gt;AdaptorB
	CTGAGACACGCAACAGGGGATAGGCAAGGCACACAGGGGATAGG
      </pre><p>
      </p><p>
	However, looking through some earlier project data I had, I also retrieved the
	following (by simply making a consensus of sequences that did not match the
	target genome anymore):
      </p><pre class="screen">
	&gt;5prime454adaptor???
	GCCTCCCTCGCGCCATCAGATCGTAGGCACCTGAAA
	&gt;3prime454adaptor???
	GCCTTGCCAGCCCGCTCAGATTGATGGTGCCTACAG
      </pre><p>
      </p><p>
	Go figure, I have absolutely no idea where these come from as they also do not
	comply to the "tcag" ending the adaptors should have.
      </p><p>
	I currently know one linker sequence (454/Roche also calls it <span class="emphasis"><em>spacer</em></span>
	for GS20 and FLX paired-end sequencing:
      </p><pre class="screen">
	&gt;flxlinker
	GTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTCGGTTCCAAC
      </pre><p>
      </p><p>
	For Titanium data using standard Roche protocol, you need to screen for two
	linker sequences:
      </p><pre class="screen">
	&gt;titlinker1
	TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG
	&gt;titlinker2
	CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
      </pre><p>
      </p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
	Some sequencing labs modify the adaptor sequences for tagging and
	similar things. Ask your sequencing provider for the exact adaptor
	and/or linker sequences.
      </td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_what_do_i_get_in_pairedend_sequencing?"></a>15.2.6. 
	What do I get in paired-end sequencing?
      </h3></div></div></div><p>
      </p><pre class="screen">
	Another question I have is does the read pair sequences have further
	adaptors/vectors in the forward and reverse strands?
      </pre><p>
      </p><p>
	Like for normal 454 reads - the normal A and B adaptors can be present
	in paired-end reads. That theory this could could look like this:
      </p><p>
	A-Adaptor - DNA1 - Linker - DNA2 - B-Adaptor.
      </p><p>
	It's possible that one of the two DNA fragments is *very* short or is missing
	completely, then one has something like this:
      </p><p>
	A-Adaptor - DNA1 - Linker - B-Adaptor
      </p><p>
	or
      </p><p>
	A-Adaptor - Linker - DNA2 - B-Adaptor
      </p><p>
	And then there are all intermediate possibilities with the read not having one
	of the two adaptors (or both). Though it appears that the majority of reads
	will contain the following:
      </p><p>
	DNA1 - Linker - DNA2
      </p><p>
	There is one caveat: according to current paired-end protocols, the sequences
	will <span class="bold"><strong>NOT</strong></span> have the direction
      </p><pre class="screen">
	---&gt; Linker &lt;---
      </pre><p>
	as one might expect when being used to Sanger Sequencing, but rather in this
	direction
      </p><pre class="screen">
	&lt;--- Linker ---&gt;
      </pre><p>
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_sequencing_protocol"></a>15.2.7. 
	Sequencing protocol
      </h3></div></div></div><p>
      </p><pre class="screen">
	Is there a way I can find out which protocol was used?
      </pre><p>
      </p><p>
	Yes. The best thing to do is obviously to ask your sequencing provider.
      </p><p>
	If this is - for whatever reason - not possible, this list might help.
      </p><p>
	Are the sequences ~100-110 bases long? It's GS20.
      </p><p>
	Are the sequences ~220-250 bases long? It's FLX.
      </p><p>
	Are the sequences ~350-450 bases long? It's Titanium.
      </p><p>
	Do the sequences contain a linker
	(GTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTCGGTTCCAAC)? It's a paired end protocol.
      </p><p>
	If the sequences left and right of the linker are ~29bp, it's the old short
	paired end (SPET, also it's most probably from a GS20). If longer, it's long
	paired-end (LPET, from a FLX).
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_filtering_by_seqlen"></a>15.2.8. 
	Filtering sequences by length and re-assembly
      </h3></div></div></div><pre class="screen">
I have two datasets of ~500K sequences each and the sequencing company
already did an assembly (using MIRA) on the basecalled and fully processed
reads (using of course the accompanying *qual file). Do you suggest that I
should redo the assembly after filtering out sequences being shorter than a
certain length (e.g. those that are &lt;200bp)? In other words, am I taking into
account low quality sequences if I do the assembly the way the sequencing
company did it (fully processed reads + quality files)?
      </pre><p>
	I don't think that filtering out "shorter" reads will bring much
	positive improvement. If the sequencing company used the standard
	Roche/454 pipeline, the cut-offs for quality are already quite good,
	remaining sequences should be, even when being &lt; 200bp, not of bad
	quality, simply a bit shorter.
      </p><p>
	Worse, you might even introduce a bias when filtering out short
	sequences: chemistry and library construction being what they are
	(rather imprecise and sometimes problematic), some parts of DNA/RNA
	yield smaller sequences per se ... and filtering those out might not
	be the best move.
      </p><p>
	You might consider doing an assembly if the company used a rather old
	version of MIRA (&lt;3.0.0 for sure, perhaps also &lt;3.0.5).
      </p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_faq_solexa___illumina_data"></a>15.3. 
      Solexa / Illumina data
    </h2></div></div></div><p>
    </p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_can_i_see_deletions?"></a>15.3.1. 
	Can I see deletions?
      </h3></div></div></div><p>
      </p><pre class="screen">
	Suppose you ran the genome of a strain that had one or more large
	deletions. Would it be clear from the data that a deletion had occurred?
      </pre><p>
      </p><p>
	In the question above, I assume you'd compare your strain <span class="emphasis"><em>X</em></span> to a strain
	<span class="emphasis"><em>Ref</em></span> and that <span class="emphasis"><em>X</em></span> had deletions compared to
	<span class="emphasis"><em>Ref</em></span>. Furthermore, I base my answer on data sets I have seen, which
	presently were 36 and 76 mers, paired and unpaired.
      </p><p>
	Yes, this would be clear. And it's a piece of cake with MIRA.
      </p><p>
	Short deletions (1 to 10 bases): they'll be tagged SROc or WRMc.
	General rule: deletions of up to 10 to 12% of the length of your read should
	be found and tagged without problem by MIRA, above that it may or may not,
	depending a bit on coverage, indel distribution and luck.
      </p><p>
	Long deletions (longer than read length): they'll be tagged with MCVc tag by
	MIRA ins the consensus. Additionally, when looking at the FASTA files when
	running the CAF result through miraconvert: long stretches of
	sequences without coverage (the @ sign in the FASTAs) of <span class="emphasis"><em>X</em></span> show missing
	genomic DNA.
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_can_i_see_insertions?"></a>15.3.2. 
	Can I see insertions?
      </h3></div></div></div><p>
      </p><pre class="screen">
	Suppose you ran the genome of a strain X that had a plasmid missing from the
	reference sequence. Alternatively, suppose you ran a strain that had picked
	up a prophage or mobile element lacking in the reference. Would that
	situation be clear from the data?
      </pre><p>
      </p><p>
	Short insertions (1 to 10 bases): they'll be tagged SROc or WRMc.
	General rule: deletions of up to 10 to 12% of the length of your read should
	be found and tagged without problem by MIRA, above that it may or may not,
	depending a bit on coverage, indel distribution and luck.
      </p><p>
	Long insertions: it's a bit more work than for deletions. But if you ran a
	de-novo assembly on all reads not mapped against your reference sequence,
	chances are good you'd get good chunks of the additional DNA put together
      </p><p>
	Once the Solexa paired-end protocol is completely rolled out and used on a
	regular base, you would even be able to place the additional element into the
	genome (approximately).
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_denovo_assembly_with_solexa_data"></a>15.3.3. 
	De-novo assembly with Solexa data
      </h3></div></div></div><p>
      </p><pre class="screen">
	Any chance you could assemble de-novo the sequence of a from just the Solexa
	data?
      </pre><p>
      </p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
	Highly opinionated answer ahead, your mileage may vary.
      </td></tr></table></div><p>
	Allow me to make a clear statement on this: maybe.
      </p><p>
	But the result would probably be nothing I would call a good
	assembly. If you used anything below 76mers, I'm highly sceptical
	towards the idea of de-novo assembly with Solexa (or ABI SOLiD) reads
	that are in the 30 to 50bp range. They're really too short for that,
	even paired end won't help you much (especially if you have library
	sizes of just 200 or 500bp). Yes, there are papers describing
	different draft assemblers (SHARCGS, EDENA, Velvet, Euler and others),
	but at the moment the results are less than thrilling to me.
      </p><p>
	If a sequencing provider came to me with N50 numbers for an
	<span class="emphasis"><em>assembled genome</em></span> in the 5-8 Kb range, I'd laugh
	him in the face. Or weep. I wouldn't dare to call this even
	'draft'. I'd just call it junk.
      </p><p>
	On the other hand, this could be enough for some purposes like, e.g.,
	getting a quick overview on the genetic baggage of a bug. Just don't
	expect a finished genome.
      </p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_faq_hybrid_assemblies"></a>15.4. 
      Hybrid assemblies
    </h2></div></div></div><p>
    </p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_what_are_hybrid_assemblies?"></a>15.4.1. 
	What are hybrid assemblies?
      </h3></div></div></div><p>
	Hybrid assemblies are assemblies where one used more than one sequencing
	technology. E.g.: Sanger and 454, or 454 and Solexa, or Sanger and Solexa
	etc.pp
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_what_differences_are_there_in_hybrid_assembly_strategies?"></a>15.4.2. 
	What differences are there in hybrid assembly strategies?
      </h3></div></div></div><p>
	Basically, one can choose two routes: multi-step or all-in-one-go.
      </p><p>
	Multi-steps means: to assemble reads from one sequencing technology (ideally
	the one from the shorter tech like, e.g., Solexa), fragment the resulting
	contigs into pseudo-reads of the longer tech and assemble these with the real
	reads from the longer tech (like, e.g., 454). The advantage of this approach
	is that it will be probably quite faster than the all-in-one-go approach. The
	disadvantage is that you loose a lot of information when using only consensus
	sequence of the shorter read technology for the final assembly.
      </p><p>
	All-in-one-go means: use all reads in one single assembly. The advantage of
	this is that the resulting alignment will be made of true reads with a maximum
	of information contained to allow a really good finishing. The disadvantage is
	that the assembly will take longer and will need more RAM.
      </p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_faq_masking"></a>15.5. 
      Masking
    </h2></div></div></div><p>
    </p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_should_i_mask?"></a>15.5.1. 
	Should I mask?
      </h3></div></div></div><p>
      </p><pre class="screen">
	In EST projects, do you think that the highly repetitive option will get rid
	of the repetitive sequences without going to the step of repeat masking?
      </pre><p>
      </p><p>
	For eukaryotes, yes. Please also consult the [-KS:mnr] option.
      </p><p>
	Remember: you still <span class="bold"><strong>MUST</strong></span> have sequencing vectors and adaptors
	clipped! In EST sequences the poly-A tails should be also clipped (or let
	mira do it.
      </p><p>
	For prokaryotes, I´m a big fan of having a first look at unmasked data.
	Just try to start MIRA without masking the data. After something like 30
	minutes, the all-vs-all comparison algorithm should be through with a first
	comparison round. grep the log for the term "megahub" ... if it doesn't
	appear, you probably don't need to mask repeats
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_how_can_i_apply_custom_masking?"></a>15.5.2. 
	How can I apply custom masking?
      </h3></div></div></div><p>
      </p><pre class="screen">
	I want to mask away some sequences in my input. How do I do that?
      </pre><p>
      </p><p>
	First, if you want to have Sanger sequencing vectors (or 454 adaptor
	sequences) "masked", please note that you should rather use ancillary data
	files (CAF, XML or EXP) and use the sequencing or quality clip options there.
      </p><p>
	Second, please make sure you have read and understood the documentation for all
	-CL parameters in the main manual, but especially -CL:mbc:mbcgs:mbcmfg:mbcmeg
	as you might want to switch it on or off or set different values depending on
	your pipeline and on your sequencing technology.
      </p><p>
	You can without problem mix your normal repeat masking pipeline with the FASTA
	or EXP input for MIRA, as long as you <span class="bold"><strong>mask</strong></span> and not <span class="bold"><strong>clip</strong></span> the
	sequence.
      </p><p>
	An example:
      </p><pre class="screen">
	&gt;E09238ARF0
	tcag GTGTCAGTGTTGACTGTAAAAAAAAAGTACGTATGGACTGCATGTGCATGTCATGGTACGTGTCA
	GTCAGTACAAAAAAAAAAAAAAAAAAAAGTACGT tgctgacgcacatgatcgtagc
      </pre><p>
      </p><p>
	(spaces inserted just as visual helper in the example sequence, they would not
	occur in the real stuff)
      </p><p>
	The XML will contain the following clippings:
	left clip = 4    (clipping away the "tcag" which are the last four bases of the
	adaptor used by Roche)
	right clip= ~90  (clipping away the "tgctgac..." lower case sequence on the
	right side of the sequence above.
      </p><p>
	Now, on the FASTA file that was generated with reads_sff.py or with the Roche
	sff* tools, you can let run, e.g., a repeat masker. The result could look like
	this:
      </p><pre class="screen">
	&gt;E09238ARF0
	tcag XXXXXXXXX TTGACTGTAAAAAAAAAGTACGTATGGACTGCATGTGCATGTCATGGTACGTGTCA
	GTCAGTACAAAAAAAAAAAAAAAAAAAAGTACGT tgctgacgcacatgatcgtagc
      </pre><p>
      </p><p>
	The part with the Xs was masked away by your repeat masker. Now, when MIRA
	loads the FASTA, it will first apply the clippings from the XML file (they're
	still the same). Then, if the option to clip away masked areas of a read
	(-CL:mbc, which is normally on for EST projects), it will search for the
	stretches of X and internally also put clips to the sequence. In the example
	above, only the following sequence would remain as "working sequence" (the
	clipped parts would still be present, but not used for any computation.
      </p><pre class="screen">
	&gt;E09238ARF0
	...............TTGACTGTAAAAAAAAAGTACGTATGGACTGCATGTGCATGTCATGGTACGTGTCA
	GTCAGTACAAAAAAAAAAAAAAAAAAAAGTACGT........................
      </pre><p>
      </p><p>
	Here you can also see the reason why your filters should <span class="bold"><strong>mask</strong></span> and not
	clip the sequence. If you change the length of the sequence, the clips in the
	XML would not be correct anymore, wrong clippings would be made, wrong
	sequence reconstructed, chaos ensues and the world would ultimately end. Or
	something.
      </p><p>
	<span class="bold"><strong>IMPORTANT!</strong></span> It might be that you do not want MIRA to merge the masked
	part of your sequence with a left or right clip, but that you want to keep it
	something like DNA - masked part - DNA. In this case, consult the manual for
	the -CL:mbc switch, either switch it off or set adequate options for the
	boundaries and gap sizes.
      </p><p>
	Now, if you look at the sequence above, you will see two possible poly-A
	tails ... at least the real poly-A tail should be masked else you will get
	megahubs with all the other reads having the poly-A tail.
      </p><p>
	You have two possibilities: you mask yourself with an own program or you let
	MIRA do the job (-CL:cpat, which should normally be on for EST projects but I
	forgot to set the correct switch in the versions prior to 2.9.26x3, so you
	need to set it manually for 454 EST projects there).
      </p><p>
	<span class="bold"><strong>IMPORTANT!</strong></span> Never ever at all use two poly-A tail masker (an own and
	the one from MIRA): you would risk to mask too much. Example: assume the above
	read you masked with a poly-A masker. The result would very probably look like
	this:
      </p><pre class="screen">
	&gt;E09238ARF0
	tcag XXXXXXXXX TTGACTGTAAAAAAAAAGTACGTATGGACTGCATGTGCATGTCATGGTACGTGTCA
	GTCAGTAC XXXXXXXXXXXXXXXXXXXX GTACGT tgctgacgcacatgatcgtagc
      </pre><p>
      </p><p>
	And MIRA would internally make the following out of it after loading:
      </p><pre class="screen">
	&gt;E09238ARF0
	...............TTGACTGTAAAAAAAAAGTACGTATGGACTGCATGTGCATGTCATGGTACGTGTCA
	GTCAGTAC..................................................
      </pre><p>
      </p><p>
	and then apply the internal poly-A tail masker:
      </p><pre class="screen">
	&gt;E09238ARF0
	...............TTGACTGT................................................
	..........................................................
      </pre><p>
      </p><p>
	You'd be left with ... well, a fragment of your sequence.
      </p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_faq_miscellaneous"></a>15.6. 
      Miscellaneous
    </h2></div></div></div><p>
    </p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_what_are_megahubs?"></a>15.6.1. 
	What are megahubs?
      </h3></div></div></div><p>
      </p><pre class="screen">
	I looked in the log file and that term "megahub" you told me about appears
	pretty much everywhere. First of all, what does it mean?
      </pre><p>
      </p><p>
	Megahub is the internal term for MIRA that the read is massively repetitive
	with respect to the other reads of the projects, i.e., a read that is a
	megahub connects to an insane number of other reads.
      </p><p>
	This is a clear sign that something is wrong. Or that you have a quite
	repetitive eukaryote. But most of the time it's sequencing vectors
	(Sanger), A and B adaptors or paired-end linkers (454), unmasked
	poly-A signals (EST) or non-normalised EST libraries which contain
	high amounts of housekeeping genes (always the same or nearly the
	same).
      </p><p>
	Countermeasures to take are:
      </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	    set clips for the sequencing vectors (Sanger) or Adaptors (454)
	    either in the XML or EXP files
	  </p></li><li class="listitem"><p>
	    for ESTs, mask poly-A in your input data (or let MIRA do it with the
	    -CL:cpat parameter)
	  </p></li><li class="listitem"><p>
	    only after the above steps have been made, use
	    the [-KS:mnr] switch to let mira automatically mask nasty
	    repeats, adjust the threshold with  [-SK:rt].
	  </p></li><li class="listitem"><p>
	    if everything else fails, filter out or mask sequences yourself in the
	    input data that come from housekeeping genes or nasty repeats.
	  </p></li></ul></div><p>
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_passes_and_loops"></a>15.6.2. 
	Passes and loops
      </h3></div></div></div><p>
      </p><pre class="screen">
	While processing some contigs with repeats i get
	"Accepting probably misassembled contig because of too many iterations."
	What is this?
      </pre><p>
      </p><p>
	That's quite normal in the first few passes of an assembly. During each pass
	(-AS:nop), contigs get built one by one. After a contig has been finished, it
	checks itself whether it can find misassemblies due to repeats (and marks
	these internally). If no misassembly, perfect, build next contig. But if yes,
	the contig requests immediate re-assembly of itself.
      </p><p>
	But this can happen only a limited number of times (governed by -AS:rbl). If
	there are still misassemblies, the contig is stored away anyway ... chances
	are good that in the next full pass of the assembler, enough knowledge has
	been gained top correctly place the reads.
      </p><p>
	So, you need to worry only if these messages still appear during the last
	pass. The positions that cause this are marked with "SRMc" tags in the
	assemblies (CAF, ACE in the result dir; and some files in the info dir).
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_debris"></a>15.6.3. 
	Debris
      </h3></div></div></div><p>
      </p><pre class="screen">
	What are the debris composed of?
      </pre><p>
      </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	    sequences too short (after trimming)
	  </p></li><li class="listitem"><p>
	    megahubs
	  </p></li><li class="listitem"><p>
	    sequences almost completely masked by the nasty repeat masker
	    ([-KS:mnr])
	  </p></li><li class="listitem"><p>
	    singlets, i.e., reads that after an assembly pass did not align
	    into any contig (or where rejected from every contig).
	  </p></li><li class="listitem"><p>
	    sequences that form a contig with less reads than defined by
	    [-AS:mrpc]
	  </p></li></ul></div><p>
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_tmpf_files:_more_info_on_what_happened_during_the_assembly"></a>15.6.4. 
	Log and temporary files: more info on what happened during the assembly
      </h3></div></div></div><p>
      </p><pre class="screen">
	I do not understand why ... happened. Is there a way to find out?
      </pre><p>
	Yes. The tmp directory contains, beside temporary data, a number of
	log files with more or less readable information. While development
	versions of MIRA keep this directory after finishing, production
	versions normally delete this directory after an assembly. To keep the
	logs and temporary file also in production versions, use
	"-OUT:rtd=no".
      </p><p>
	As MIRA also tries to save as much disk space as possible, some logs
	and temporary files are rotated (which means that old logs and tmps
	get deleted). To switch off this behaviour, use
	"-OUT:rrot=no". Beware, the size of the tmp directory will increase,
	sometimes dramatically so.
      </p><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_faq_sequence_clipping_after_load"></a>15.6.4.1. 
	  Sequence clipping after load
	</h4></div></div></div><p>
	  How MIRA clipped the reads after loading them can be found in the file
	  <code class="filename">mira_int_clippings.0.txt</code>. The entries look like this:
	</p><pre class="screen">
	  load:  minleft. U13a01d05.t1    Left: 11         -&gt; 30
	</pre><p>
	  Interpret this as: after loading, the read "U13a01d05.t1" had a left clipping
	  of eleven. The "minleft" clipping option of MIRA did not like it and set it to
	  30.
	</p><pre class="screen">
	  load:  bad seq. gnl|ti|1133527649       Shortened by 89 New right: 484
	</pre><p>
	</p><p>
	  Interpret this as: after loading, the read "gnl|ti|1133527649" was checked
	  with the "bad sequence search" clipping algorithm which determined that there
	  apparently is something dubious, so it shortened the read by 89 bases, setting
	  the new right clip to position 484.
	</p></div></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_faq_platforms_and_compiling"></a>15.7. 
      Platforms and Compiling
    </h2></div></div></div><p>
    </p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_windows"></a>15.7.1. 
	Windows
      </h3></div></div></div><p>
      </p><pre class="screen">
	Also, is MIRA be available on a windows platform?
      </pre><p>
      </p><p>
	As a matter of fact: it was and may be again. While I haven't done it myself,
	according to reports I got compiling MIRA 2.9.3* in a Cygwin environment was
	actually painless. But since then BOOST and multi-threading has been included
	and I am not sure whether it is still as easy.
      </p><p>
	I'd be thankful for reports :-)
      </p></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_maf"></a>Chapter 16. The MAF format</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email">&lt;<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>&gt;</code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect1_introduction:_why_an_own_assembly_format?">16.1. 
      Introduction: why an own assembly format?
    </a></span></dt><dt><span class="sect1"><a href="#sect1_the_maf_format">16.2. 
      The MAF format
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect2_basics">16.2.1. 
	Basics
      </a></span></dt><dt><span class="sect2"><a href="#sect2_reads">16.2.2. 
	Reads
      </a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect3_simple_example">16.2.2.1. 
	  Simple example
	</a></span></dt><dt><span class="sect3"><a href="#sect3_list_of_records_for_reads">16.2.2.2. 
	  List of records for reads
	</a></span></dt><dt><span class="sect3"><a href="#sect3_interpreting_clipping_values">16.2.2.3. 
	  Interpreting clipping values
	</a></span></dt></dl></dd><dt><span class="sect2"><a href="#sect2_contigs">16.2.3. 
	Contigs
      </a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect3_simple_example_2">16.2.3.1. 
	  Simple example 2
	</a></span></dt><dt><span class="sect3"><a href="#sect3_list_of_records_for_contigs">16.2.3.2. 
	  List of records for contigs
	</a></span></dt></dl></dd></dl></dd></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
      <span class="emphasis"><em><span class="quote">&#8220;<span class="quote">Design flaws travel in herds.
      </span>&#8221;</span></em></span>
    </p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top"><p>
  The documentation of MIRA 3.9.x has not completely caught up yet with the changes introduced by MIRA now using manifest files. Quite a number of recipes still show the old command-line style, e.g.:
  </p><pre class="screen">
mira --project=... --job=... ...</pre><p>
    For those cases, please refer to chapter 3 (the reference) for how to write manifest files.
  </p></td></tr></table></div><p>
    This documents describes purpose and format of the MAF format, version
    1. Which has been superceeded by version 2 but is not described here
    (yet). But as v1 and v2 are very similar only the notion of readgroups is
    a big change, I'll let this description live until I have time to update
    this section.
  </p><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect1_introduction:_why_an_own_assembly_format?"></a>16.1. 
      Introduction: why an own assembly format?
    </h2></div></div></div><p>
      I had been on the hunt for some time for a file format that allow MIRA to
      quickly save and load reads and full assemblies. There are currently a number
      of alignment format files on the market and MIRA can read and/or write most of
      them. Why not take one of these? It turned out that all (well, the ones I
      know: ACE, BAF, CAF, CALF, EXP, FRG) have some kind of no-go 'feature' (or problem
      or bug) that makes one life pretty difficult if one wants to write or parse
      that given file format.
    </p><p>
      What I needed for MIRA was a format that:
    </p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
	  is easy to parse
	</p></li><li class="listitem"><p>
	  is quick to parse
	</p></li><li class="listitem"><p>
	  contains all needed information of an assembly that MIRA and many
	  finishing programs use: reads (with sequence and qualities) and contigs,
	  tags etc.pp
	</p></li></ol></div><p>
    </p><p>
      MAF is not a format with the smallest possible footprint though it fares quite
      well in comparison to ACE, CAF and EXP), but as it's meant as interchange format,
      it'll do. It can be easily indexed and does not need string lookups during
      parsing.
    </p><p>
      I took the liberty to combine many good ideas from EXP, BAF, CAF and FASTQ
      while defining the format and if anything is badly designed, it's all my
      fault.
    </p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect1_the_maf_format"></a>16.2. 
      The MAF format
    </h2></div></div></div><p>
      This describes version 1 of the MAF format. If the need arises, enhancements
      like meta-data about total number of contigs and reads will be implemented in the
      next version.
    </p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_basics"></a>16.2.1. 
	Basics
      </h3></div></div></div><p>
	MAF ...
      </p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
	    ... has for each record a keyword at the beginning of the line, followed
	    by exactly one blank (a space or a tab), then followed by the values for
	    this record. At the moment keywords are two character keywords, but keywords
	    with other lengths might appear in the future
	  </p></li><li class="listitem"><p>
	    ... is strictly line oriented. Each record is terminated by a newline,
	    no record spans across lines.
	  </p></li></ol></div><p>
      </p><p>
	All coordinates start at 1, i.e., there is no 0 value for coordinates.
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_reads"></a>16.2.2. 
	Reads
      </h3></div></div></div><p>
      </p><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect3_simple_example"></a>16.2.2.1. 
	  Simple example
	</h4></div></div></div><p>
	  Here's an example for a simple read, just the read name and the sequence:
	</p><pre class="screen">
	  RD      U13a05e07.t1
	  RS      CTTGCATGCCTGCAGGTCGACTCTAGAAGGACCCCGATCA
	  ER
	</pre><p>
	</p><p>
	  Reads start with RD and end with ER, the RD keyword is always followed by the
	  name of the read, ER stands on its own. Reads also should contain a sequence
	  (RS). Everything else is optional. In the following example, the read has
	  additional quality values (RQ), template definitions (name in TN, minimum and
	  maximum insert size in TF and TT), a pointer to the file with the raw data (SF),
	  a left clip which covers sequencing vector or adaptor sequence (SL), a left
	  clip covering low quality (QL), a right clip covering low quality (QR), a
	  right clip covering sequencing vector or adaptor sequence (SR), alignment to
	  original sequence (AO), a tag (RT) and the sequencing technology it was
	  generated with (ST).
	</p><pre class="screen">
	  RD      U13a05e07.t1
	  RS      CTTGCATGCCTGCAGGTCGACTCTAGAAGGACCCCGATCA
	  RQ      ,-+*,1-+/,36;:6&#8804;3327&lt;7A1/,,).('..7=@E8:
	  TN      U13a05e07
	  DI      F
	  TF      1200
	  TT      1800
	  SF      U13a05e07.t1.scf
	  SL      4
	  QL      7
	  QR      30
	  SR      32
	  AO      1 40 1 40
	  RT      ALUS 10 15 Some comment to this read tag.
	  ST      Sanger
	  ER
	</pre><p>
	</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect3_list_of_records_for_reads"></a>16.2.2.2. 
	  List of records for reads
	</h4></div></div></div><p>
	</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	      RD <span class="emphasis"><em>string: readname</em></span>
	    </p><p> RD followed by the read name starts a read.
	    </p></li><li class="listitem"><p>
	      LR <span class="emphasis"><em>integer: read length</em></span>
	    </p><p>
	      The length of the read can be given optionally in LR. This is
	      meant to help the parser perform sanity checks and eventually
	      pre-allocate memory for sequence and quality.
	    </p><p>
	      MIRA at the moment only writes LR lines for reads with more than
	      2000 bases.
	    </p></li><li class="listitem"><p>
	      RS <span class="emphasis"><em>string: DNA sequence</em></span>
	    </p><p> Sequence of a read is stored in RS.
	    </p></li><li class="listitem"><p>
	      RQ <span class="emphasis"><em>string: qualities</em></span>
	    </p><p> Qualities are stored in FASTQ format, i.e., each quality
	    value + 33 is written as single as ASCII character.
	    </p></li><li class="listitem"><p>
	      SV <span class="emphasis"><em>string: sequencing vector</em></span>
	    </p><p> Name of the sequencing vector or
	    adaptor used in this read.
	    </p></li><li class="listitem"><p>
	      TN <span class="emphasis"><em>string: template name</em></span>
	    </p><p> Template name. This defines the DNA template a sequence
	    comes from. In it's simplest form, a DNA template is sequenced
	    only once. In paired-end sequencing, a DNA template is sequenced
	    once in forward and once in reverse direction (Sanger, 454,
	    Solexa). In Sanger sequencing, several forward and/or reverse
	    reads can be sequenced from a DNA template. In PacBio sequencing,
	    a DNA template can be sequenced in several "strobes", leading to
	    multiple reads on a DNA template.
	    </p></li><li class="listitem"><p>
	      DI <span class="emphasis"><em>character: F or R</em></span>
	    </p><p> Direction of the read with respect to the
	    template. F for forward, R for reverse.
	    </p></li><li class="listitem"><p>
	      TF <span class="emphasis"><em>integer: template size from</em></span>
	    </p><p> Minimum estimated
	    size of a sequencing template. In paired-end sequencing, this is the minimum
	    distance of the read pair.
	    </p></li><li class="listitem"><p>
	      TT <span class="emphasis"><em>integer: template size to</em></span>
	    </p><p> Maximum estimated
	    size of a sequencing template. In paired-end sequencing, this is the maximum
	    distance of the read pair.
	    </p></li><li class="listitem"><p>
	      SF <span class="emphasis"><em>string: sequencing file</em></span>
	    </p><p> Name of the sequencing file which
	    contains raw data for this read.
	    </p></li><li class="listitem"><p>
	      SL <span class="emphasis"><em>integer: seqvec left</em></span>
	    </p><p>
	      Clip left due to sequencing vector. Assumed to be 1 if not
	      present. Note that left clip values are excluding, e.g.: a value
	      of '7' clips off the left 6 bases.
	    </p></li><li class="listitem"><p>
	      QL <span class="emphasis"><em>integer: qual left</em></span>
	    </p><p>
	      Clip left due to low quality. Assumed to be 1 if not
	      present. Note that left clip values are excluding, e.g.: a value
	      off '7' clips of the left 6 bases.
	    </p></li><li class="listitem"><p>
	      CL <span class="emphasis"><em>integer: clip left</em></span>
	    </p><p>
	      Clip left (any reason). Assumed to be 1 if not present. Note
	      that left clip values are excluding, e.g.: a value of '7' clips
	      off the left 6 bases.
	    </p></li><li class="listitem"><p>
	      SR <span class="emphasis"><em>integer: seqvec right</em></span>
	    </p><p> Clip right due to sequencing
	    vector. Assumed to be the length of the sequence if not present. Note that
	    right clip values are including, e.g., a value of '10' leaves the bases 1 to
	    9 and clips at and including base 10 and higher.
	    </p></li><li class="listitem"><p>
	      QR <span class="emphasis"><em>integer: qual right</em></span>
	    </p><p> Clip right due to low quality. Assumed
	    to be the length of the sequence if not present. Note that right clip values
	    are including, e.g., a value of '10' leaves the bases 1 to 9 and clips at
	    and including base 10 and higher.
	    </p></li><li class="listitem"><p>
	      CR <span class="emphasis"><em>integer: clip right</em></span>
	    </p><p> Clip right (any reason). Assumed to be
	    the length of the sequence if not present. Note that
	    right clip values are including, e.g., a value of '10' leaves the bases 1 to
	    9 and clips at and including base 10 and higher.
	    </p></li><li class="listitem"><p>
	      AO <span class="emphasis"><em>four integers: x1 y1 x2 y2</em></span>
	    </p><p> AO stands for "Align to
	    Original". The interval [x1 y1] in the read as stored in the MAF file aligns
	    with [x2 y2] in the original, unedited read sequence. This allows to model
	    insertions and deletions in the read and still be able to find the correct
	    position in the original, base-called sequence data.
	    </p><p> A read can have
	    several AO lines which together define all the edits performed to this
	    read.
	    </p><p> Assumed to be "1 x 1 x" if not present, where 'x' is the length of
	    the unclipped sequence.
	    </p></li><li class="listitem"><p>
	      RT <span class="emphasis"><em>string + 2 integers + optional string: type x1 y1 comment</em></span>
	    </p><p> Read tags are given by naming the tag type, which positions
	    in the read the tag spans in the interval [x1 y1] and afterwards
	    optionally a comment. As MAF is strictly line oriented, newline
	    characters in the comment are encoded
	    as <code class="literal">\n</code>.
	    </p><p> If x1 &gt; y1, the tag is in reverse direction.
	    </p><p>
	      The tag type can be a free form string, though MIRA will
	      recognise and work with tag types used by the Staden gap4
	      package (and of course the MIRA tags as described in the main
	      documentation of MIRA).
	    </p></li><li class="listitem"><p>
	      ST <span class="emphasis"><em>string: sequencing technology</em></span>
	    </p><p> The current technologies
	    can be defined: Sanger, 454, Solexa, SOLiD.
	    </p></li><li class="listitem"><p>
	      SN <span class="emphasis"><em>string: strain name</em></span>
	    </p><p> Strain name of the sample that was
	    sequenced, this is a free form string.
	    </p></li><li class="listitem"><p>
	      MT <span class="emphasis"><em>string: machine type</em></span>
	    </p><p> Machine type which generated the data,
	    this is a free form string.
	    </p></li><li class="listitem"><p>
	      BC <span class="emphasis"><em>string: base caller</em></span>
	    </p><p>
	      Base calling program used to call bases
	    </p></li><li class="listitem"><p>
	      IB <span class="emphasis"><em>boolean (0 or 1): is backbone</em></span>
	    </p><p> Whether the read is a backbone. Reads used as reference
	    (backbones) in mapping assemblies get this attribute.
	    </p></li><li class="listitem"><p>
	      IC <span class="emphasis"><em>boolean (0 or 1)</em></span>
	    </p><p> Whether the read is a coverage equivalent
	    read (e.g. from mapping Solexa). This is internal to MIRA.
	    </p></li><li class="listitem"><p>
	      IR <span class="emphasis"><em>boolean (0 or 1)</em></span>
	    </p><p> Whether the read is a rail. This also is
	    internal to MIRA.
	    </p></li><li class="listitem"><p>
	      ER
	    </p><p> This ends a read and is mandatory.
	    </p></li></ul></div><p>
	</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect3_interpreting_clipping_values"></a>16.2.2.3. 
	  Interpreting clipping values
	</h4></div></div></div><p>
	  Every left and right clipping pair (SL &amp; SR, QL &amp; QR, CL &amp; CR) forms a clear
	  range in the interval [left right[ in the sequence of a read. E.g. a read with
	  SL=4 and SR=10 has the bases 1,2,3 clipped away on the left side, the bases
	  4,5,6,7,8,9 as clear range and the bases 10 and following clipped away on the
	  right side.
	</p><p>
	  The left clip of a read is determined as max(SL,QL,CL) (the rightmost left
	  clip) whereas the right clip is min(SR,QR,CR).
	</p></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_contigs"></a>16.2.3. 
	Contigs
      </h3></div></div></div><p>
	Contigs are not much more than containers containing reads with some
	additional information. Contrary to CAF or ACE, MAF does not first store all reads in
	single containers and then define the contigs. In MAF, contigs are defined as
	outer container and within those, the reads are stored like normal reads.
      </p><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect3_simple_example_2"></a>16.2.3.1. 
	  Simple example 2
	</h4></div></div></div><p>
	  The above example for a read can be encased in a contig like this (with two
	  consensus tags gratuitously added in):
	</p><pre class="screen">
	  CO      contigname_s1
	  NR      1
	  LC      24
	  CS      TGCCTGCAGGTCGACTCTAGAAGG
	  CQ      -+/,36;:6&#8804;3327&lt;7A1/,,).
	  CT      COMM 5 8 Some comment to this consensus tag.
	  CT      COMM 7 12 Another comment to this consensus tag.
	  \\
	  RD      U13a05e07.t1
	  RS      CTTGCATGCCTGCAGGTCGACTCTAGAAGGACCCCGATCA
	  RQ      ,-+*,1-+/,36;:6&#8804;3327&lt;7A1/,,).('..7=@E8:
	  TN      U13a05e07
	  TF      1200
	  TT      1800
	  SF      U13a05e07.t1.scf
	  SL      4
	  SR      32
	  QL      7
	  QR      30
	  AO      1 40 1 40
	  RT      ALUS 10 15 Some comment to this read tag.
	  ST      Sanger
	  ER
	  AT      1 24 7 30
	  //
	  EC
	</pre><p>
	</p><p>
	  Note that the read shown previously (and now encased in a contig) is
	  absolutely unchanged. It has just been complemented with a bit of data which
	  describes the contig as well as with a one liner which places the read into
	  the contig.
	</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect3_list_of_records_for_contigs"></a>16.2.3.2. 
	  List of records for contigs
	</h4></div></div></div><p>
	</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
	      CO <span class="emphasis"><em>string: contig name</em></span>
	    </p><p> CO starts a contig, the contig name
	    behind is mandatory but can be any string, including numbers.
	    </p></li><li class="listitem"><p>
	      NR <span class="emphasis"><em>integer: num reads in contig</em></span>
	    </p><p> This is optional but highly
	    recommended.
	    </p></li><li class="listitem"><p>
	      LC <span class="emphasis"><em>integer: contig length</em></span>
	    </p><p> Note that this length defines the length of the 'clear
	    range' of the consensus. It is 100% equal to the length of the CS
	    (sequence) and CQ (quality) strings below.
	    </p></li><li class="listitem"><p>
	      CT <span class="emphasis"><em>string + 2 integers + optional string: identifier
	      x1 y1 comment</em></span>
	    </p><p> Consensus tags are defined like read tags but apply to the
	    consensus. Here too, the interval [x1 y1] is including and if x1 &gt; y1, the tag
	    is in reverse direction.
	    </p></li><li class="listitem"><p>
	      CS <span class="emphasis"><em>string: consensus sequence</em></span>
	    </p><p> Sequence of a consensus is stored in RS.
	    </p></li><li class="listitem"><p>
	      CQ <span class="emphasis"><em>string: qualities</em></span>
	    </p><p> Consensus Qualities are stored in FASTQ
	    format, i.e., each quality value + 33 is written as single as ASCII character.
	    </p></li><li class="listitem"><p>
	      \\
	    </p><p> This marks the start of read data of this contig. After
	    this, all reads are stored one after the other, just separated by
	    an "AT" line (see below).
	    </p></li><li class="listitem"><p>
	      AT <span class="emphasis"><em>Four integers: x1 y1 x2 y2</em></span>
	    </p><p> The AT (Assemble_To) line defines the placement of the read
	    in the contig and follows immediately the closing "ER" of a read
	    so that parsers do not need to perform time consuming string
	    lookups. Every read in a contig has exactly one AT line.
	    </p><p> The interval
	    [x2 y2] of the read (i.e., the unclipped data, also called the 'clear range')
	    aligns with the interval [x1 y1] of the contig. If x1 &gt; y1 (the contig
	    positions), then the reverse complement of the read is aligned to the
	    contig. For the read positions, x2 is always &lt; y2.
	    </p></li><li class="listitem"><p>
	      //
	    </p><p> This marks the end of read data
	    </p></li><li class="listitem"><p>
	      EC
	    </p><p> This ends a contig and is mandatory
	    </p></li></ul></div></div></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_logfiles"></a>Chapter 17. Log and temporary files used by MIRA</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email">&lt;<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>&gt;</code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect1_logf_introduction">17.1. 
      Introduction
    </a></span></dt><dt><span class="sect1"><a href="#sect1_logf_the_files">17.2. 
      The files
    </a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect2_logf_mira_error_reads_invalid">17.2.1. 
	mira_error_reads_invalid
      </a></span></dt><dt><span class="sect2"><a href="#sect2_logf_mira_info_reads_tooshort">17.2.2. 
	mira_info_reads_tooshort
      </a></span></dt><dt><span class="sect2"><a href="#sect2_logf_mira_int_alignextends_preassembly10txt">17.2.3. 
	mira_int_alignextends_preassembly1.0.txt
      </a></span></dt><dt><span class="sect2"><a href="#sect2_logf_mira_int_clippings0txt">17.2.4. 
	mira_int_clippings.0.txt
      </a></span></dt><dt><span class="sect2"><a href="#sect2_logf_mira_int_posmatch_megahubs_passxlst">17.2.5. 
	mira_int_posmatch_megahubs_pass.X.lst
      </a></span></dt><dt><span class="sect2"><a href="#sect2_logf_mira_int_posmatch_multicopystat_preassembly0txt">17.2.6. 
	mira_int_posmatch_multicopystat_preassembly.0.txt
      </a></span></dt><dt><span class="sect2"><a href="#sect2_logf_mira_int_posmatch_rawhashhits_passxlst">17.2.7. 
	mira_int_posmatch_rawhashhits_pass.X.lst
      </a></span></dt><dt><span class="sect2"><a href="#sect2_logf_mira_int_skimmarknastyrepeats_hist_passxlst">17.2.8. 
	mira_int_skimmarknastyrepeats_hist_pass.X.lst
      </a></span></dt><dt><span class="sect2"><a href="#sect2_logf_mira_int_skimmarknastyrepeats_nastyseq_passxlst">17.2.9. 
	mira_int_skimmarknastyrepeats_nastyseq_pass.X.lst
      </a></span></dt><dt><span class="sect2"><a href="#sect2_logf_mira_int_vectorclip_passxtxt">17.2.10. 
	mira_int_vectorclip_pass.X.txt
      </a></span></dt><dt><span class="sect2"><a href="#sect2_logf_miratmpads_passxforward_and_miratmpads_passxcomplement">17.2.11. 
	miratmp.ads_pass.X.forward and miratmp.ads_pass.X.complement
      </a></span></dt><dt><span class="sect2"><a href="#sect2_logf_miratmpads_passxreject">17.2.12. 
	miratmp.ads_pass.X.reject
      </a></span></dt><dt><span class="sect2"><a href="#sect2_logf_miratmpnoqualities">17.2.13. 
	miratmp.noqualities
      </a></span></dt><dt><span class="sect2"><a href="#sect2_logf_miratmpusedids">17.2.14. 
	miratmp.usedids
      </a></span></dt><dt><span class="sect2"><a href="#sect2_logf_mira_readpoolinfolst">17.2.15. 
	mira_readpoolinfo.lst
      </a></span></dt></dl></dd></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
      <span class="emphasis"><em><span class="quote">&#8220;<span class="quote">The amount of entropy in the universe is constant - except when it increases.
      </span>&#8221;</span></em></span>
    </p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top"><p>
  The documentation of MIRA 3.9.x has not completely caught up yet with the changes introduced by MIRA now using manifest files. Quite a number of recipes still show the old command-line style, e.g.:
  </p><pre class="screen">
mira --project=... --job=... ...</pre><p>
    For those cases, please refer to chapter 3 (the reference) for how to write manifest files.
  </p></td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect1_logf_introduction"></a>17.1. 
      Introduction
    </h2></div></div></div><p>
      The tmp directory used by mira (usually
      <code class="filename">&lt;projectname&gt;_d_tmp</code>) may contain a number of
      files with information which could be interesting for other uses than
      the pure assembly. This guide gives a short overview.
    </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
      This guide is probably the least complete and most out-of-date as it is
      updated only very infrequently. If in doubt, ask on the MIRA talk
      mailing list.
    </td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
      Please note that the format of these files may change over time,
      although I try very hard to keep changes reduced to a minimum.
    </td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
      Remember that mira has two options that control whether log and
      temporary files get deleted: while [-OUT:rtd] removes the
      complete tmp directory after an assembly,  [-OUT:rrot] removes
      only those log and temporary files which are not needed anymore for the
      continuation of the assembly. Setting both options to <span class="underline">no</span> will keep all log and temporary files.
    </td></tr></table></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect1_logf_the_files"></a>17.2. 
      The files
    </h2></div></div></div><p>
    </p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_mira_error_reads_invalid"></a>17.2.1. 
	mira_error_reads_invalid
      </h3></div></div></div><p>
	A simple list of those reads that were invalid (no sequence or similar
	problems).
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_mira_info_reads_tooshort"></a>17.2.2. 
	mira_info_reads_tooshort
      </h3></div></div></div><p>
	A simple list of those reads that were sorted out because the unclipped
	sequence was too short as defined by [-AS:mrl].
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_mira_int_alignextends_preassembly10txt"></a>17.2.3. 
	mira_int_alignextends_preassembly1.0.txt
      </h3></div></div></div><p>
	If read extension is used ([-DP:ure]), this file contains the read
	name and the number of bases by which the right clipping was extended.
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_mira_int_clippings0txt"></a>17.2.4. 
	mira_int_clippings.0.txt
      </h3></div></div></div><p>
	If any of the [-CL:] options leads to the clipping of a read, this
	file will tell when, which clipping, which read and by how much (or to where)
	the clippings were set.
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_mira_int_posmatch_megahubs_passxlst"></a>17.2.5. 
	mira_int_posmatch_megahubs_pass.X.lst
      </h3></div></div></div><p>
	Note: replace the <span class="emphasis"><em>X</em></span> by the pass of mira. Should any read be
	categorised as megahub during the all-against-all search (SKIM3), this file
	will tell you which.
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_mira_int_posmatch_multicopystat_preassembly0txt"></a>17.2.6. 
	mira_int_posmatch_multicopystat_preassembly.0.txt
      </h3></div></div></div><p>
	After the initial all-against-all search (SKIM3), this file tells you to how
	many other reads each read has overlaps. Furthermore, reads that have more
	overlaps than expected are tagged with ``mc'' (multicopy).
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_mira_int_posmatch_rawhashhits_passxlst"></a>17.2.7. 
	mira_int_posmatch_rawhashhits_pass.X.lst
      </h3></div></div></div><p>
	Note: replace the <span class="emphasis"><em>X</em></span> by the pass of mira. Similar to
	<code class="filename">mira_int_posmatch_multicopystat_preassembly.0.txt</code>, this counts the
	kmer hits of each read to other reads. This time however per pass.
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_mira_int_skimmarknastyrepeats_hist_passxlst"></a>17.2.8. 
	mira_int_skimmarknastyrepeats_hist_pass.X.lst
      </h3></div></div></div><p>
	Note: replace the <span class="emphasis"><em>X</em></span> by the pass of mira. Only written if
	 [-KS:mnr] is set to <span class="underline">yes</span>. This file contains a
	histogram of kmer occurrences encountered by SKIM3.
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_mira_int_skimmarknastyrepeats_nastyseq_passxlst"></a>17.2.9. 
	mira_int_skimmarknastyrepeats_nastyseq_pass.X.lst
      </h3></div></div></div><p>
	Note: replace the <span class="emphasis"><em>X</em></span> by the pass of mira.  Only written if
	 [-KS:mnr] is set to <span class="underline">yes</span>. One of the more interesting
	files if you want to know the repetitive sequences cause the assembly to be
	really difficult: for each masked part of a read, the masked sequences is
	shown here.
      </p><p>
	E.g.
      </p><pre class="screen">
	U13a04h11.t1    TATATATATATATATATATATATA
	U13a05b01.t1    TATATATATATATATATATATATA
	U13a05c07.t1    AAAAAAAAAAAAAAA
	U13a05e12.t1    CTCTCTCTCTCTCTCTCTCTCTCTCTCTC
      </pre><p>
	Simple repeats like the ones shown above will certainly pop-up there,
	but a few other sequences (like e.g. rDNA/rRNA or SINEs, LINEs in
	eukaryotes) will also appear.
      </p><p>
	Nifty thing to try out if you want to have a more compressed overview: sort
	and unify by the second column.
      </p><pre class="screen">
	sort -k 2 -u mira_int_skimmarknastyrepeats_nastyseq_pass.X.lst
      </pre><p>
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_mira_int_vectorclip_passxtxt"></a>17.2.10. 
	mira_int_vectorclip_pass.X.txt
      </h3></div></div></div><p>
	Note: replace the <span class="emphasis"><em>X</em></span> by the pass of mira. Only written if
	 [-CL:pvlc] is set to <span class="underline">yes</span>. Tells you where possible
	sequencing vector (or adaptor) leftovers were found and clipped (or not
	clipped).
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_miratmpads_passxforward_and_miratmpads_passxcomplement"></a>17.2.11. 
	miratmp.ads_pass.X.forward and miratmp.ads_pass.X.complement
      </h3></div></div></div><p>
	Note: replace the <span class="emphasis"><em>X</em></span> by the pass of mira. Which read aligns with
	Smith-Waterman against which other read, 'forward-forward' and
	'forward-complement'.
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_miratmpads_passxreject"></a>17.2.12. 
	miratmp.ads_pass.X.reject
      </h3></div></div></div><p>
	Note: replace the <span class="emphasis"><em>X</em></span> by the pass of mira. Which possible read
	overlaps failed the Smith-Waterman alignment check.
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_miratmpnoqualities"></a>17.2.13. 
	miratmp.noqualities
      </h3></div></div></div><p>
	Which reads went completely without qualities into the assembly.
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_miratmpusedids"></a>17.2.14. 
	miratmp.usedids
      </h3></div></div></div><p>
	Which reads effectively went into the assembly (after clipping etc.).
      </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_mira_readpoolinfolst"></a>17.2.15. 
	mira_readpoolinfo.lst
      </h3></div></div></div></div></div></div></div></body></html>
mira-doc 4.9.6-3build2 / usr / share / doc / mira-assembler / DefinitiveGuideToMIRA.html