/usr/share/doc/mira-assembler/DefinitiveGuideToMIRA.html is in mira-doc 4.9.6-3build2.
This file is owned by root:root, with mode 0o644.
The actual contents of the file can be viewed below.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
2465
2466
2467
2468
2469
2470
2471
2472
2473
2474
2475
2476
2477
2478
2479
2480
2481
2482
2483
2484
2485
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
2563
2564
2565
2566
2567
2568
2569
2570
2571
2572
2573
2574
2575
2576
2577
2578
2579
2580
2581
2582
2583
2584
2585
2586
2587
2588
2589
2590
2591
2592
2593
2594
2595
2596
2597
2598
2599
2600
2601
2602
2603
2604
2605
2606
2607
2608
2609
2610
2611
2612
2613
2614
2615
2616
2617
2618
2619
2620
2621
2622
2623
2624
2625
2626
2627
2628
2629
2630
2631
2632
2633
2634
2635
2636
2637
2638
2639
2640
2641
2642
2643
2644
2645
2646
2647
2648
2649
2650
2651
2652
2653
2654
2655
2656
2657
2658
2659
2660
2661
2662
2663
2664
2665
2666
2667
2668
2669
2670
2671
2672
2673
2674
2675
2676
2677
2678
2679
2680
2681
2682
2683
2684
2685
2686
2687
2688
2689
2690
2691
2692
2693
2694
2695
2696
2697
2698
2699
2700
2701
2702
2703
2704
2705
2706
2707
2708
2709
2710
2711
2712
2713
2714
2715
2716
2717
2718
2719
2720
2721
2722
2723
2724
2725
2726
2727
2728
2729
2730
2731
2732
2733
2734
2735
2736
2737
2738
2739
2740
2741
2742
2743
2744
2745
2746
2747
2748
2749
2750
2751
2752
2753
2754
2755
2756
2757
2758
2759
2760
2761
2762
2763
2764
2765
2766
2767
2768
2769
2770
2771
2772
2773
2774
2775
2776
2777
2778
2779
2780
2781
2782
2783
2784
2785
2786
2787
2788
2789
2790
2791
2792
2793
2794
2795
2796
2797
2798
2799
2800
2801
2802
2803
2804
2805
2806
2807
2808
2809
2810
2811
2812
2813
2814
2815
2816
2817
2818
2819
2820
2821
2822
2823
2824
2825
2826
2827
2828
2829
2830
2831
2832
2833
2834
2835
2836
2837
2838
2839
2840
2841
2842
2843
2844
2845
2846
2847
2848
2849
2850
2851
2852
2853
2854
2855
2856
2857
2858
2859
2860
2861
2862
2863
2864
2865
2866
2867
2868
2869
2870
2871
2872
2873
2874
2875
2876
2877
2878
2879
2880
2881
2882
2883
2884
2885
2886
2887
2888
2889
2890
2891
2892
2893
2894
2895
2896
2897
2898
2899
2900
2901
2902
2903
2904
2905
2906
2907
2908
2909
2910
2911
2912
2913
2914
2915
2916
2917
2918
2919
2920
2921
2922
2923
2924
2925
2926
2927
2928
2929
2930
2931
2932
2933
2934
2935
2936
2937
2938
2939
2940
2941
2942
2943
2944
2945
2946
2947
2948
2949
2950
2951
2952
2953
2954
2955
2956
2957
2958
2959
2960
2961
2962
2963
2964
2965
2966
2967
2968
2969
2970
2971
2972
2973
2974
2975
2976
2977
2978
2979
2980
2981
2982
2983
2984
2985
2986
2987
2988
2989
2990
2991
2992
2993
2994
2995
2996
2997
2998
2999
3000
3001
3002
3003
3004
3005
3006
3007
3008
3009
3010
3011
3012
3013
3014
3015
3016
3017
3018
3019
3020
3021
3022
3023
3024
3025
3026
3027
3028
3029
3030
3031
3032
3033
3034
3035
3036
3037
3038
3039
3040
3041
3042
3043
3044
3045
3046
3047
3048
3049
3050
3051
3052
3053
3054
3055
3056
3057
3058
3059
3060
3061
3062
3063
3064
3065
3066
3067
3068
3069
3070
3071
3072
3073
3074
3075
3076
3077
3078
3079
3080
3081
3082
3083
3084
3085
3086
3087
3088
3089
3090
3091
3092
3093
3094
3095
3096
3097
3098
3099
3100
3101
3102
3103
3104
3105
3106
3107
3108
3109
3110
3111
3112
3113
3114
3115
3116
3117
3118
3119
3120
3121
3122
3123
3124
3125
3126
3127
3128
3129
3130
3131
3132
3133
3134
3135
3136
3137
3138
3139
3140
3141
3142
3143
3144
3145
3146
3147
3148
3149
3150
3151
3152
3153
3154
3155
3156
3157
3158
3159
3160
3161
3162
3163
3164
3165
3166
3167
3168
3169
3170
3171
3172
3173
3174
3175
3176
3177
3178
3179
3180
3181
3182
3183
3184
3185
3186
3187
3188
3189
3190
3191
3192
3193
3194
3195
3196
3197
3198
3199
3200
3201
3202
3203
3204
3205
3206
3207
3208
3209
3210
3211
3212
3213
3214
3215
3216
3217
3218
3219
3220
3221
3222
3223
3224
3225
3226
3227
3228
3229
3230
3231
3232
3233
3234
3235
3236
3237
3238
3239
3240
3241
3242
3243
3244
3245
3246
3247
3248
3249
3250
3251
3252
3253
3254
3255
3256
3257
3258
3259
3260
3261
3262
3263
3264
3265
3266
3267
3268
3269
3270
3271
3272
3273
3274
3275
3276
3277
3278
3279
3280
3281
3282
3283
3284
3285
3286
3287
3288
3289
3290
3291
3292
3293
3294
3295
3296
3297
3298
3299
3300
3301
3302
3303
3304
3305
3306
3307
3308
3309
3310
3311
3312
3313
3314
3315
3316
3317
3318
3319
3320
3321
3322
3323
3324
3325
3326
3327
3328
3329
3330
3331
3332
3333
3334
3335
3336
3337
3338
3339
3340
3341
3342
3343
3344
3345
3346
3347
3348
3349
3350
3351
3352
3353
3354
3355
3356
3357
3358
3359
3360
3361
3362
3363
3364
3365
3366
3367
3368
3369
3370
3371
3372
3373
3374
3375
3376
3377
3378
3379
3380
3381
3382
3383
3384
3385
3386
3387
3388
3389
3390
3391
3392
3393
3394
3395
3396
3397
3398
3399
3400
3401
3402
3403
3404
3405
3406
3407
3408
3409
3410
3411
3412
3413
3414
3415
3416
3417
3418
3419
3420
3421
3422
3423
3424
3425
3426
3427
3428
3429
3430
3431
3432
3433
3434
3435
3436
3437
3438
3439
3440
3441
3442
3443
3444
3445
3446
3447
3448
3449
3450
3451
3452
3453
3454
3455
3456
3457
3458
3459
3460
3461
3462
3463
3464
3465
3466
3467
3468
3469
3470
3471
3472
3473
3474
3475
3476
3477
3478
3479
3480
3481
3482
3483
3484
3485
3486
3487
3488
3489
3490
3491
3492
3493
3494
3495
3496
3497
3498
3499
3500
3501
3502
3503
3504
3505
3506
3507
3508
3509
3510
3511
3512
3513
3514
3515
3516
3517
3518
3519
3520
3521
3522
3523
3524
3525
3526
3527
3528
3529
3530
3531
3532
3533
3534
3535
3536
3537
3538
3539
3540
3541
3542
3543
3544
3545
3546
3547
3548
3549
3550
3551
3552
3553
3554
3555
3556
3557
3558
3559
3560
3561
3562
3563
3564
3565
3566
3567
3568
3569
3570
3571
3572
3573
3574
3575
3576
3577
3578
3579
3580
3581
3582
3583
3584
3585
3586
3587
3588
3589
3590
3591
3592
3593
3594
3595
3596
3597
3598
3599
3600
3601
3602
3603
3604
3605
3606
3607
3608
3609
3610
3611
3612
3613
3614
3615
3616
3617
3618
3619
3620
3621
3622
3623
3624
3625
3626
3627
3628
3629
3630
3631
3632
3633
3634
3635
3636
3637
3638
3639
3640
3641
3642
3643
3644
3645
3646
3647
3648
3649
3650
3651
3652
3653
3654
3655
3656
3657
3658
3659
3660
3661
3662
3663
3664
3665
3666
3667
3668
3669
3670
3671
3672
3673
3674
3675
3676
3677
3678
3679
3680
3681
3682
3683
3684
3685
3686
3687
3688
3689
3690
3691
3692
3693
3694
3695
3696
3697
3698
3699
3700
3701
3702
3703
3704
3705
3706
3707
3708
3709
3710
3711
3712
3713
3714
3715
3716
3717
3718
3719
3720
3721
3722
3723
3724
3725
3726
3727
3728
3729
3730
3731
3732
3733
3734
3735
3736
3737
3738
3739
3740
3741
3742
3743
3744
3745
3746
3747
3748
3749
3750
3751
3752
3753
3754
3755
3756
3757
3758
3759
3760
3761
3762
3763
3764
3765
3766
3767
3768
3769
3770
3771
3772
3773
3774
3775
3776
3777
3778
3779
3780
3781
3782
3783
3784
3785
3786
3787
3788
3789
3790
3791
3792
3793
3794
3795
3796
3797
3798
3799
3800
3801
3802
3803
3804
3805
3806
3807
3808
3809
3810
3811
3812
3813
3814
3815
3816
3817
3818
3819
3820
3821
3822
3823
3824
3825
3826
3827
3828
3829
3830
3831
3832
3833
3834
3835
3836
3837
3838
3839
3840
3841
3842
3843
3844
3845
3846
3847
3848
3849
3850
3851
3852
3853
3854
3855
3856
3857
3858
3859
3860
3861
3862
3863
3864
3865
3866
3867
3868
3869
3870
3871
3872
3873
3874
3875
3876
3877
3878
3879
3880
3881
3882
3883
3884
3885
3886
3887
3888
3889
3890
3891
3892
3893
3894
3895
3896
3897
3898
3899
3900
3901
3902
3903
3904
3905
3906
3907
3908
3909
3910
3911
3912
3913
3914
3915
3916
3917
3918
3919
3920
3921
3922
3923
3924
3925
3926
3927
3928
3929
3930
3931
3932
3933
3934
3935
3936
3937
3938
3939
3940
3941
3942
3943
3944
3945
3946
3947
3948
3949
3950
3951
3952
3953
3954
3955
3956
3957
3958
3959
3960
3961
3962
3963
3964
3965
3966
3967
3968
3969
3970
3971
3972
3973
3974
3975
3976
3977
3978
3979
3980
3981
3982
3983
3984
3985
3986
3987
3988
3989
3990
3991
3992
3993
3994
3995
3996
3997
3998
3999
4000
4001
4002
4003
4004
4005
4006
4007
4008
4009
4010
4011
4012
4013
4014
4015
4016
4017
4018
4019
4020
4021
4022
4023
4024
4025
4026
4027
4028
4029
4030
4031
4032
4033
4034
4035
4036
4037
4038
4039
4040
4041
4042
4043
4044
4045
4046
4047
4048
4049
4050
4051
4052
4053
4054
4055
4056
4057
4058
4059
4060
4061
4062
4063
4064
4065
4066
4067
4068
4069
4070
4071
4072
4073
4074
4075
4076
4077
4078
4079
4080
4081
4082
4083
4084
4085
4086
4087
4088
4089
4090
4091
4092
4093
4094
4095
4096
4097
4098
4099
4100
4101
4102
4103
4104
4105
4106
4107
4108
4109
4110
4111
4112
4113
4114
4115
4116
4117
4118
4119
4120
4121
4122
4123
4124
4125
4126
4127
4128
4129
4130
4131
4132
4133
4134
4135
4136
4137
4138
4139
4140
4141
4142
4143
4144
4145
4146
4147
4148
4149
4150
4151
4152
4153
4154
4155
4156
4157
4158
4159
4160
4161
4162
4163
4164
4165
4166
4167
4168
4169
4170
4171
4172
4173
4174
4175
4176
4177
4178
4179
4180
4181
4182
4183
4184
4185
4186
4187
4188
4189
4190
4191
4192
4193
4194
4195
4196
4197
4198
4199
4200
4201
4202
4203
4204
4205
4206
4207
4208
4209
4210
4211
4212
4213
4214
4215
4216
4217
4218
4219
4220
4221
4222
4223
4224
4225
4226
4227
4228
4229
4230
4231
4232
4233
4234
4235
4236
4237
4238
4239
4240
4241
4242
4243
4244
4245
4246
4247
4248
4249
4250
4251
4252
4253
4254
4255
4256
4257
4258
4259
4260
4261
4262
4263
4264
4265
4266
4267
4268
4269
4270
4271
4272
4273
4274
4275
4276
4277
4278
4279
4280
4281
4282
4283
4284
4285
4286
4287
4288
4289
4290
4291
4292
4293
4294
4295
4296
4297
4298
4299
4300
4301
4302
4303
4304
4305
4306
4307
4308
4309
4310
4311
4312
4313
4314
4315
4316
4317
4318
4319
4320
4321
4322
4323
4324
4325
4326
4327
4328
4329
4330
4331
4332
4333
4334
4335
4336
4337
4338
4339
4340
4341
4342
4343
4344
4345
4346
4347
4348
4349
4350
4351
4352
4353
4354
4355
4356
4357
4358
4359
4360
4361
4362
4363
4364
4365
4366
4367
4368
4369
4370
4371
4372
4373
4374
4375
4376
4377
4378
4379
4380
4381
4382
4383
4384
4385
4386
4387
4388
4389
4390
4391
4392
4393
4394
4395
4396
4397
4398
4399
4400
4401
4402
4403
4404
4405
4406
4407
4408
4409
4410
4411
4412
4413
4414
4415
4416
4417
4418
4419
4420
4421
4422
4423
4424
4425
4426
4427
4428
4429
4430
4431
4432
4433
4434
4435
4436
4437
4438
4439
4440
4441
4442
4443
4444
4445
4446
4447
4448
4449
4450
4451
4452
4453
4454
4455
4456
4457
4458
4459
4460
4461
4462
4463
4464
4465
4466
4467
4468
4469
4470
4471
4472
4473
4474
4475
4476
4477
4478
4479
4480
4481
4482
4483
4484
4485
4486
4487
4488
4489
4490
4491
4492
4493
4494
4495
4496
4497
4498
4499
4500
4501
4502
4503
4504
4505
4506
4507
4508
4509
4510
4511
4512
4513
4514
4515
4516
4517
4518
4519
4520
4521
4522
4523
4524
4525
4526
4527
4528
4529
4530
4531
4532
4533
4534
4535
4536
4537
4538
4539
4540
4541
4542
4543
4544
4545
4546
4547
4548
4549
4550
4551
4552
4553
4554
4555
4556
4557
4558
4559
4560
4561
4562
4563
4564
4565
4566
4567
4568
4569
4570
4571
4572
4573
4574
4575
4576
4577
4578
4579
4580
4581
4582
4583
4584
4585
4586
4587
4588
4589
4590
4591
4592
4593
4594
4595
4596
4597
4598
4599
4600
4601
4602
4603
4604
4605
4606
4607
4608
4609
4610
4611
4612
4613
4614
4615
4616
4617
4618
4619
4620
4621
4622
4623
4624
4625
4626
4627
4628
4629
4630
4631
4632
4633
4634
4635
4636
4637
4638
4639
4640
4641
4642
4643
4644
4645
4646
4647
4648
4649
4650
4651
4652
4653
4654
4655
4656
4657
4658
4659
4660
4661
4662
4663
4664
4665
4666
4667
4668
4669
4670
4671
4672
4673
4674
4675
4676
4677
4678
4679
4680
4681
4682
4683
4684
4685
4686
4687
4688
4689
4690
4691
4692
4693
4694
4695
4696
4697
4698
4699
4700
4701
4702
4703
4704
4705
4706
4707
4708
4709
4710
4711
4712
4713
4714
4715
4716
4717
4718
4719
4720
4721
4722
4723
4724
4725
4726
4727
4728
4729
4730
4731
4732
4733
4734
4735
4736
4737
4738
4739
4740
4741
4742
4743
4744
4745
4746
4747
4748
4749
4750
4751
4752
4753
4754
4755
4756
4757
4758
4759
4760
4761
4762
4763
4764
4765
4766
4767
4768
4769
4770
4771
4772
4773
4774
4775
4776
4777
4778
4779
4780
4781
4782
4783
4784
4785
4786
4787
4788
4789
4790
4791
4792
4793
4794
4795
4796
4797
4798
4799
4800
4801
4802
4803
4804
4805
4806
4807
4808
4809
4810
4811
4812
4813
4814
4815
4816
4817
4818
4819
4820
4821
4822
4823
4824
4825
4826
4827
4828
4829
4830
4831
4832
4833
4834
4835
4836
4837
4838
4839
4840
4841
4842
4843
4844
4845
4846
4847
4848
4849
4850
4851
4852
4853
4854
4855
4856
4857
4858
4859
4860
4861
4862
4863
4864
4865
4866
4867
4868
4869
4870
4871
4872
4873
4874
4875
4876
4877
4878
4879
4880
4881
4882
4883
4884
4885
4886
4887
4888
4889
4890
4891
4892
4893
4894
4895
4896
4897
4898
4899
4900
4901
4902
4903
4904
4905
4906
4907
4908
4909
4910
4911
4912
4913
4914
4915
4916
4917
4918
4919
4920
4921
4922
4923
4924
4925
4926
4927
4928
4929
4930
4931
4932
4933
4934
4935
4936
4937
4938
4939
4940
4941
4942
4943
4944
4945
4946
4947
4948
4949
4950
4951
4952
4953
4954
4955
4956
4957
4958
4959
4960
4961
4962
4963
4964
4965
4966
4967
4968
4969
4970
4971
4972
4973
4974
4975
4976
4977
4978
4979
4980
4981
4982
4983
4984
4985
4986
4987
4988
4989
4990
4991
4992
4993
4994
4995
4996
4997
4998
4999
5000
5001
5002
5003
5004
5005
5006
5007
5008
5009
5010
5011
5012
5013
5014
5015
5016
5017
5018
5019
5020
5021
5022
5023
5024
5025
5026
5027
5028
5029
5030
5031
5032
5033
5034
5035
5036
5037
5038
5039
5040
5041
5042
5043
5044
5045
5046
5047
5048
5049
5050
5051
5052
5053
5054
5055
5056
5057
5058
5059
5060
5061
5062
5063
5064
5065
5066
5067
5068
5069
5070
5071
5072
5073
5074
5075
5076
5077
5078
5079
5080
5081
5082
5083
5084
5085
5086
5087
5088
5089
5090
5091
5092
5093
5094
5095
5096
5097
5098
5099
5100
5101
5102
5103
5104
5105
5106
5107
5108
5109
5110
5111
5112
5113
5114
5115
5116
5117
5118
5119
5120
5121
5122
5123
5124
5125
5126
5127
5128
5129
5130
5131
5132
5133
5134
5135
5136
5137
5138
5139
5140
5141
5142
5143
5144
5145
5146
5147
5148
5149
5150
5151
5152
5153
5154
5155
5156
5157
5158
5159
5160
5161
5162
5163
5164
5165
5166
5167
5168
5169
5170
5171
5172
5173
5174
5175
5176
5177
5178
5179
5180
5181
5182
5183
5184
5185
5186
5187
5188
5189
5190
5191
5192
5193
5194
5195
5196
5197
5198
5199
5200
5201
5202
5203
5204
5205
5206
5207
5208
5209
5210
5211
5212
5213
5214
5215
5216
5217
5218
5219
5220
5221
5222
5223
5224
5225
5226
5227
5228
5229
5230
5231
5232
5233
5234
5235
5236
5237
5238
5239
5240
5241
5242
5243
5244
5245
5246
5247
5248
5249
5250
5251
5252
5253
5254
5255
5256
5257
5258
5259
5260
5261
5262
5263
5264
5265
5266
5267
5268
5269
5270
5271
5272
5273
5274
5275
5276
5277
5278
5279
5280
5281
5282
5283
5284
5285
5286
5287
5288
5289
5290
5291
5292
5293
5294
5295
5296
5297
5298
5299
5300
5301
5302
5303
5304
5305
5306
5307
5308
5309
5310
5311
5312
5313
5314
5315
5316
5317
5318
5319
5320
5321
5322
5323
5324
5325
5326
5327
5328
5329
5330
5331
5332
5333
5334
5335
5336
5337
5338
5339
5340
5341
5342
5343
5344
5345
5346
5347
5348
5349
5350
5351
5352
5353
5354
5355
5356
5357
5358
5359
5360
5361
5362
5363
5364
5365
5366
5367
5368
5369
5370
5371
5372
5373
5374
5375
5376
5377
5378
5379
5380
5381
5382
5383
5384
5385
5386
5387
5388
5389
5390
5391
5392
5393
5394
5395
5396
5397
5398
5399
5400
5401
5402
5403
5404
5405
5406
5407
5408
5409
5410
5411
5412
5413
5414
5415
5416
5417
5418
5419
5420
5421
5422
5423
5424
5425
5426
5427
5428
5429
5430
5431
5432
5433
5434
5435
5436
5437
5438
5439
5440
5441
5442
5443
5444
5445
5446
5447
5448
5449
5450
5451
5452
5453
5454
5455
5456
5457
5458
5459
5460
5461
5462
5463
5464
5465
5466
5467
5468
5469
5470
5471
5472
5473
5474
5475
5476
5477
5478
5479
5480
5481
5482
5483
5484
5485
5486
5487
5488
5489
5490
5491
5492
5493
5494
5495
5496
5497
5498
5499
5500
5501
5502
5503
5504
5505
5506
5507
5508
5509
5510
5511
5512
5513
5514
5515
5516
5517
5518
5519
5520
5521
5522
5523
5524
5525
5526
5527
5528
5529
5530
5531
5532
5533
5534
5535
5536
5537
5538
5539
5540
5541
5542
5543
5544
5545
5546
5547
5548
5549
5550
5551
5552
5553
5554
5555
5556
5557
5558
5559
5560
5561
5562
5563
5564
5565
5566
5567
5568
5569
5570
5571
5572
5573
5574
5575
5576
5577
5578
5579
5580
5581
5582
5583
5584
5585
5586
5587
5588
5589
5590
5591
5592
5593
5594
5595
5596
5597
5598
5599
5600
5601
5602
5603
5604
5605
5606
5607
5608
5609
5610
5611
5612
5613
5614
5615
5616
5617
5618
5619
5620
5621
5622
5623
5624
5625
5626
5627
5628
5629
5630
5631
5632
5633
5634
5635
5636
5637
5638
5639
5640
5641
5642
5643
5644
5645
5646
5647
5648
5649
5650
5651
5652
5653
5654
5655
5656
5657
5658
5659
5660
5661
5662
5663
5664
5665
5666
5667
5668
5669
5670
5671
5672
5673
5674
5675
5676
5677
5678
5679
5680
5681
5682
5683
5684
5685
5686
5687
5688
5689
5690
5691
5692
5693
5694
5695
5696
5697
5698
5699
5700
5701
5702
5703
5704
5705
5706
5707
5708
5709
5710
5711
5712
5713
5714
5715
5716
5717
5718
5719
5720
5721
5722
5723
5724
5725
5726
5727
5728
5729
5730
5731
5732
5733
5734
5735
5736
5737
5738
5739
5740
5741
5742
5743
5744
5745
5746
5747
5748
5749
5750
5751
5752
5753
5754
5755
5756
5757
5758
5759
5760
5761
5762
5763
5764
5765
5766
5767
5768
5769
5770
5771
5772
5773
5774
5775
5776
5777
5778
5779
5780
5781
5782
5783
5784
5785
5786
5787
5788
5789
5790
5791
5792
5793
5794
5795
5796
5797
5798
5799
5800
5801
5802
5803
5804
5805
5806
5807
5808
5809
5810
5811
5812
5813
5814
5815
5816
5817
5818
5819
5820
5821
5822
5823
5824
5825
5826
5827
5828
5829
5830
5831
5832
5833
5834
5835
5836
5837
5838
5839
5840
5841
5842
5843
5844
5845
5846
5847
5848
5849
5850
5851
5852
5853
5854
5855
5856
5857
5858
5859
5860
5861
5862
5863
5864
5865
5866
5867
5868
5869
5870
5871
5872
5873
5874
5875
5876
5877
5878
5879
5880
5881
5882
5883
5884
5885
5886
5887
5888
5889
5890
5891
5892
5893
5894
5895
5896
5897
5898
5899
5900
5901
5902
5903
5904
5905
5906
5907
5908
5909
5910
5911
5912
5913
5914
5915
5916
5917
5918
5919
5920
5921
5922
5923
5924
5925
5926
5927
5928
5929
5930
5931
5932
5933
5934
5935
5936
5937
5938
5939
5940
5941
5942
5943
5944
5945
5946
5947
5948
5949
5950
5951
5952
5953
5954
5955
5956
5957
5958
5959
5960
5961
5962
5963
5964
5965
5966
5967
5968
5969
5970
5971
5972
5973
5974
5975
5976
5977
5978
5979
5980
5981
5982
5983
5984
5985
5986
5987
5988
5989
5990
5991
5992
5993
5994
5995
5996
5997
5998
5999
6000
6001
6002
6003
6004
6005
6006
6007
6008
6009
6010
6011
6012
6013
6014
6015
6016
6017
6018
6019
6020
6021
6022
6023
6024
6025
6026
6027
6028
6029
6030
6031
6032
6033
6034
6035
6036
6037
6038
6039
6040
6041
6042
6043
6044
6045
6046
6047
6048
6049
6050
6051
6052
6053
6054
6055
6056
6057
6058
6059
6060
6061
6062
6063
6064
6065
6066
6067
6068
6069
6070
6071
6072
6073
6074
6075
6076
6077
6078
6079
6080
6081
6082
6083
6084
6085
6086
6087
6088
6089
6090
6091
6092
6093
6094
6095
6096
6097
6098
6099
6100
6101
6102
6103
6104
6105
6106
6107
6108
6109
6110
6111
6112
6113
6114
6115
6116
6117
6118
6119
6120
6121
6122
6123
6124
6125
6126
6127
6128
6129
6130
6131
6132
6133
6134
6135
6136
6137
6138
6139
6140
6141
6142
6143
6144
6145
6146
6147
6148
6149
6150
6151
6152
6153
6154
6155
6156
6157
6158
6159
6160
6161
6162
6163
6164
6165
6166
6167
6168
6169
6170
6171
6172
6173
6174
6175
6176
6177
6178
6179
6180
6181
6182
6183
6184
6185
6186
6187
6188
6189
6190
6191
6192
6193
6194
6195
6196
6197
6198
6199
6200
6201
6202
6203
6204
6205
6206
6207
6208
6209
6210
6211
6212
6213
6214
6215
6216
6217
6218
6219
6220
6221
6222
6223
6224
6225
6226
6227
6228
6229
6230
6231
6232
6233
6234
6235
6236
6237
6238
6239
6240
6241
6242
6243
6244
6245
6246
6247
6248
6249
6250
6251
6252
6253
6254
6255
6256
6257
6258
6259
6260
6261
6262
6263
6264
6265
6266
6267
6268
6269
6270
6271
6272
6273
6274
6275
6276
6277
6278
6279
6280
6281
6282
6283
6284
6285
6286
6287
6288
6289
6290
6291
6292
6293
6294
6295
6296
6297
6298
6299
6300
6301
6302
6303
6304
6305
6306
6307
6308
6309
6310
6311
6312
6313
6314
6315
6316
6317
6318
6319
6320
6321
6322
6323
6324
6325
6326
6327
6328
6329
6330
6331
6332
6333
6334
6335
6336
6337
6338
6339
6340
6341
6342
6343
6344
6345
6346
6347
6348
6349
6350
6351
6352
6353
6354
6355
6356
6357
6358
6359
6360
6361
6362
6363
6364
6365
6366
6367
6368
6369
6370
6371
6372
6373
6374
6375
6376
6377
6378
6379
6380
6381
6382
6383
6384
6385
6386
6387
6388
6389
6390
6391
6392
6393
6394
6395
6396
6397
6398
6399
6400
6401
6402
6403
6404
6405
6406
6407
6408
6409
6410
6411
6412
6413
6414
6415
6416
6417
6418
6419
6420
6421
6422
6423
6424
6425
6426
6427
6428
6429
6430
6431
6432
6433
6434
6435
6436
6437
6438
6439
6440
6441
6442
6443
6444
6445
6446
6447
6448
6449
6450
6451
6452
6453
6454
6455
6456
6457
6458
6459
6460
6461
6462
6463
6464
6465
6466
6467
6468
6469
6470
6471
6472
6473
6474
6475
6476
6477
6478
6479
6480
6481
6482
6483
6484
6485
6486
6487
6488
6489
6490
6491
6492
6493
6494
6495
6496
6497
6498
6499
6500
6501
6502
6503
6504
6505
6506
6507
6508
6509
6510
6511
6512
6513
6514
6515
6516
6517
6518
6519
6520
6521
6522
6523
6524
6525
6526
6527
6528
6529
6530
6531
6532
6533
6534
6535
6536
6537
6538
6539
6540
6541
6542
6543
6544
6545
6546
6547
6548
6549
6550
6551
6552
6553
6554
6555
6556
6557
6558
6559
6560
6561
6562
6563
6564
6565
6566
6567
6568
6569
6570
6571
6572
6573
6574
6575
6576
6577
6578
6579
6580
6581
6582
6583
6584
6585
6586
6587
6588
6589
6590
6591
6592
6593
6594
6595
6596
6597
6598
6599
6600
6601
6602
6603
6604
6605
6606
6607
6608
6609
6610
6611
6612
6613
6614
6615
6616
6617
6618
6619
6620
6621
6622
6623
6624
6625
6626
6627
6628
6629
6630
6631
6632
6633
6634
6635
6636
6637
6638
6639
6640
6641
6642
6643
6644
6645
6646
6647
6648
6649
6650
6651
6652
6653
6654
6655
6656
6657
6658
6659
6660
6661
6662
6663
6664
6665
6666
6667
6668
6669
6670
6671
6672
6673
6674
6675
6676
6677
6678
6679
6680
6681
6682
6683
6684
6685
6686
6687
6688
6689
6690
6691
6692
6693
6694
6695
6696
6697
6698
6699
6700
6701
6702
6703
6704
6705
6706
6707
6708
6709
6710
6711
6712
6713
6714
6715
6716
6717
6718
6719
6720
6721
6722
6723
6724
6725
6726
6727
6728
6729
6730
6731
6732
6733
6734
6735
6736
6737
6738
6739
6740
6741
6742
6743
6744
6745
6746
6747
6748
6749
6750
6751
6752
6753
6754
6755
6756
6757
6758
6759
6760
6761
6762
6763
6764
6765
6766
6767
6768
6769
6770
6771
6772
6773
6774
6775
6776
6777
6778
6779
6780
6781
6782
6783
6784
6785
6786
6787
6788
6789
6790
6791
6792
6793
6794
6795
6796
6797
6798
6799
6800
6801
6802
6803
6804
6805
6806
6807
6808
6809
6810
6811
6812
6813
6814
6815
6816
6817
6818
6819
6820
6821
6822
6823
6824
6825
6826
6827
6828
6829
6830
6831
6832
6833
6834
6835
6836
6837
6838
6839
6840
6841
6842
6843
6844
6845
6846
6847
6848
6849
6850
6851
6852
6853
6854
6855
6856
6857
6858
6859
6860
6861
6862
6863
6864
6865
6866
6867
6868
6869
6870
6871
6872
6873
6874
6875
6876
6877
6878
6879
6880
6881
6882
6883
6884
6885
6886
6887
6888
6889
6890
6891
6892
6893
6894
6895
6896
6897
6898
6899
6900
6901
6902
6903
6904
6905
6906
6907
6908
6909
6910
6911
6912
6913
6914
6915
6916
6917
6918
6919
6920
6921
6922
6923
6924
6925
6926
6927
6928
6929
6930
6931
6932
6933
6934
6935
6936
6937
6938
6939
6940
6941
6942
6943
6944
6945
6946
6947
6948
6949
6950
6951
6952
6953
6954
6955
6956
6957
6958
6959
6960
6961
6962
6963
6964
6965
6966
6967
6968
6969
6970
6971
6972
6973
6974
6975
6976
6977
6978
6979
6980
6981
6982
6983
6984
6985
6986
6987
6988
6989
6990
6991
6992
6993
6994
6995
6996
6997
6998
6999
7000
7001
7002
7003
7004
7005
7006
7007
7008
7009
7010
7011
7012
7013
7014
7015
7016
7017
7018
7019
7020
7021
7022
7023
7024
7025
7026
7027
7028
7029
7030
7031
7032
7033
7034
7035
7036
7037
7038
7039
7040
7041
7042
7043
7044
7045
7046
7047
7048
7049
7050
7051
7052
7053
7054
7055
7056
7057
7058
7059
7060
7061
7062
7063
7064
7065
7066
7067
7068
7069
7070
7071
7072
7073
7074
7075
7076
7077
7078
7079
7080
7081
7082
7083
7084
7085
7086
7087
7088
7089
7090
7091
7092
7093
7094
7095
7096
7097
7098
7099
7100
7101
7102
7103
7104
7105
7106
7107
7108
7109
7110
7111
7112
7113
7114
7115
7116
7117
7118
7119
7120
7121
7122
7123
7124
7125
7126
7127
7128
7129
7130
7131
7132
7133
7134
7135
7136
7137
7138
7139
7140
7141
7142
7143
7144
7145
7146
7147
7148
7149
7150
7151
7152
7153
7154
7155
7156
7157
7158
7159
7160
7161
7162
7163
7164
7165
7166
7167
7168
7169
7170
7171
7172
7173
7174
7175
7176
7177
7178
7179
7180
7181
7182
7183
7184
7185
7186
7187
7188
7189
7190
7191
7192
7193
7194
7195
7196
7197
7198
7199
7200
7201
7202
7203
7204
7205
7206
7207
7208
7209
7210
7211
7212
7213
7214
7215
7216
7217
7218
7219
7220
7221
7222
7223
7224
7225
7226
7227
7228
7229
7230
7231
7232
7233
7234
7235
7236
7237
7238
7239
7240
7241
7242
7243
7244
7245
7246
7247
7248
7249
7250
7251
7252
7253
7254
7255
7256
7257
7258
7259
7260
7261
7262
7263
7264
7265
7266
7267
7268
7269
7270
7271
7272
7273
7274
7275
7276
7277
7278
7279
7280
7281
7282
7283
7284
7285
7286
7287
7288
7289
7290
7291
7292
7293
7294
7295
7296
7297
7298
7299
7300
7301
7302
7303
7304
7305
7306
7307
7308
7309
7310
7311
7312
7313
7314
7315
7316
7317
7318
7319
7320
7321
7322
7323
7324
7325
7326
7327
7328
7329
7330
7331
7332
7333
7334
7335
7336
7337
7338
7339
7340
7341
7342
7343
7344
7345
7346
7347
7348
7349
7350
7351
7352
7353
7354
7355
7356
7357
7358
7359
7360
7361
7362
7363
7364
7365
7366
7367
7368
7369
7370
7371
7372
7373
7374
7375
7376
7377
7378
7379
7380
7381
7382
7383
7384
7385
7386
7387
7388
7389
7390
7391
7392
7393
7394
7395
7396
7397
7398
7399
7400
7401
7402
7403
7404
7405
7406
7407
7408
7409
7410
7411
7412
7413
7414
7415
7416
7417
7418
7419
7420
7421
7422
7423
7424
7425
7426
7427
7428
7429
7430
7431
7432
7433
7434
7435
7436
7437
7438
7439
7440
7441
7442
7443
7444
7445
7446
7447
7448
7449
7450
7451
7452
7453
7454
7455
7456
7457
7458
7459
7460
7461
7462
7463
7464
7465
7466
7467
7468
7469
7470
7471
7472
7473
7474
7475
7476
7477
7478
7479
7480
7481
7482
7483
7484
7485
7486
7487
7488
7489
7490
7491
7492
7493
7494
7495
7496
7497
7498
7499
7500
7501
7502
7503
7504
7505
7506
7507
7508
7509
7510
7511
7512
7513
7514
7515
7516
7517
7518
7519
7520
7521
7522
7523
7524
7525
7526
7527
7528
7529
7530
7531
7532
7533
7534
7535
7536
7537
7538
7539
7540
7541
7542
7543
7544
7545
7546
7547
7548
7549
7550
7551
7552
7553
7554
7555
7556
7557
7558
7559
7560
7561
7562
7563
7564
7565
7566
7567
7568
7569
7570
7571
7572
7573
7574
7575
7576
7577
7578
7579
7580
7581
7582
7583
7584
7585
7586
7587
7588
7589
7590
7591
7592
7593
7594
7595
7596
7597
7598
7599
7600
7601
7602
7603
7604
7605
7606
7607
7608
7609
7610
7611
7612
7613
7614
7615
7616
7617
7618
7619
7620
7621
7622
7623
7624
7625
7626
7627
7628
7629
7630
7631
7632
7633
7634
7635
7636
7637
7638
7639
7640
7641
7642
7643
7644
7645
7646
7647
7648
7649
7650
7651
7652
7653
7654
7655
7656
7657
7658
7659
7660
7661
7662
7663
7664
7665
7666
7667
7668
7669
7670
7671
7672
7673
7674
7675
7676
7677
7678
7679
7680
7681
7682
7683
7684
7685
7686
7687
7688
7689
7690
7691
7692
7693
7694
7695
7696
7697
7698
7699
7700
7701
7702
7703
7704
7705
7706
7707
7708
7709
7710
7711
7712
7713
7714
7715
7716
7717
7718
7719
7720
7721
7722
7723
7724
7725
7726
7727
7728
7729
7730
7731
7732
7733
7734
7735
7736
7737
7738
7739
7740
7741
7742
7743
7744
7745
7746
7747
7748
7749
7750
7751
7752
7753
7754
7755
7756
7757
7758
7759
7760
7761
7762
7763
7764
7765
7766
7767
7768
7769
7770
7771
7772
7773
7774
7775
7776
7777
7778
7779
7780
7781
7782
7783
7784
7785
7786
7787
7788
7789
7790
7791
7792
7793
7794
7795
7796
7797
7798
7799
7800
7801
7802
7803
7804
7805
7806
7807
7808
7809
7810
7811
7812
7813
7814
7815
7816
7817
7818
7819
7820
7821
7822
7823
7824
7825
7826
7827
7828
7829
7830
7831
7832
7833
7834
7835
7836
7837
7838
7839
7840
7841
7842
7843
7844
7845
7846
7847
7848
7849
7850
7851
7852
7853
7854
7855
7856
7857
7858
7859
7860
7861
7862
7863
7864
7865
7866
7867
7868
7869
7870
7871
7872
7873
7874
7875
7876
7877
7878
7879
7880
7881
7882
7883
7884
7885
7886
7887
7888
7889
7890
7891
7892
7893
7894
7895
7896
7897
7898
7899
7900
7901
7902
7903
7904
7905
7906
7907
7908
7909
7910
7911
7912
7913
7914
7915
7916
7917
7918
7919
7920
7921
7922
7923
7924
7925
7926
7927
7928
7929
7930
7931
7932
7933
7934
7935
7936
7937
7938
7939
7940
7941
7942
7943
7944
7945
7946
7947
7948
7949
7950
7951
7952
7953
7954
7955
7956
7957
7958
7959
7960
7961
7962
7963
7964
7965
7966
7967
7968
7969
7970
7971
7972
7973
7974
7975
7976
7977
7978
7979
7980
7981
7982
7983
7984
7985
7986
7987
7988
7989
7990
7991
7992
7993
7994
7995
7996
7997
7998
7999
8000
8001
8002
8003
8004
8005
8006
8007
8008
8009
8010
8011
8012
8013
8014
8015
8016
8017
8018
8019
8020
8021
8022
8023
8024
8025
8026
8027
8028
8029
8030
8031
8032
8033
8034
8035
8036
8037
8038
8039
8040
8041
8042
8043
8044
8045
8046
8047
8048
8049
8050
8051
8052
8053
8054
8055
8056
8057
8058
8059
8060
8061
8062
8063
8064
8065
8066
8067
8068
8069
8070
8071
8072
8073
8074
8075
8076
8077
8078
8079
8080
8081
8082
8083
8084
8085
8086
8087
8088
8089
8090
8091
8092
8093
8094
8095
8096
8097
8098
8099
8100
8101
8102
8103
8104
8105
8106
8107
8108
8109
8110
8111
8112
8113
8114
8115
8116
8117
8118
8119
8120
8121
8122
8123
8124
8125
8126
8127
8128
8129
8130
8131
8132
8133
8134
8135
8136
8137
8138
8139
8140
8141
8142
8143
8144
8145
8146
8147
8148
8149
8150
8151
8152
8153
8154
8155
8156
8157
8158
8159
8160
8161
8162
8163
8164
8165
8166
8167
8168
8169
8170
8171
8172
8173
8174
8175
8176
8177
8178
8179
8180
8181
8182
8183
8184
8185
8186
8187
8188
8189
8190
8191
8192
8193
8194
8195
8196
8197
8198
8199
8200
8201
8202
8203
8204
8205
8206
8207
8208
8209
8210
8211
8212
8213
8214
8215
8216
8217
8218
8219
8220
8221
8222
8223
8224
8225
8226
8227
8228
8229
8230
8231
8232
8233
8234
8235
8236
8237
8238
8239
8240
8241
8242
8243
8244
8245
8246
8247
8248
8249
8250
8251
8252
8253
8254
8255
8256
8257
8258
8259
8260
8261
8262
8263
8264
8265
8266
8267
8268
8269
8270
8271
8272
8273
8274
8275
8276
8277
8278
8279
8280
8281
8282
8283
8284
8285
8286
8287
8288
8289
8290
8291
8292
8293
8294
8295
8296
8297
8298
8299
8300
8301
8302
8303
8304
8305
8306
8307
8308
8309
8310
8311
8312
8313
8314
8315
8316
8317
8318
8319
8320
8321
8322
8323
8324
8325
8326
8327
8328
8329
8330
8331
8332
8333
8334
8335
8336
8337
8338
8339
8340
8341
8342
8343
8344
8345
8346
8347
8348
8349
8350
8351
8352
8353
8354
8355
8356
8357
8358
8359
8360
8361
8362
8363
8364
8365
8366
8367
8368
8369
8370
8371
8372
8373
8374
8375
8376
8377
8378
8379
8380
8381
8382
8383
8384
8385
8386
8387
8388
8389
8390
8391
8392
8393
8394
8395
8396
8397
8398
8399
8400
8401
8402
8403
8404
8405
8406
8407
8408
8409
8410
8411
8412
8413
8414
8415
8416
8417
8418
8419
8420
8421
8422
8423
8424
8425
8426
8427
8428
8429
8430
8431
8432
8433
8434
8435
8436
8437
8438
8439
8440
8441
8442
8443
8444
8445
8446
8447
8448
8449
8450
8451
8452
8453
8454
8455
8456
8457
8458
8459
8460
8461
8462
8463
8464
8465
8466
8467
8468
8469
8470
8471
8472
8473
8474
8475
8476
8477
8478
8479
8480
8481
8482
8483
8484
8485
8486
8487
8488
8489
8490
8491
8492
8493
8494
8495
8496
8497
8498
8499
8500
8501
8502
8503
8504
8505
8506
8507
8508
8509
8510
8511
8512
8513
8514
8515
8516
8517
8518
8519
8520
8521
8522
8523
8524
8525
8526
8527
8528
8529
8530
8531
8532
8533
8534
8535
8536
8537
8538
8539
8540
8541
8542
8543
8544
8545
8546
8547
8548
8549
8550
8551
8552
8553
8554
8555
8556
8557
8558
8559
8560
8561
8562
8563
8564
8565
8566
8567
8568
8569
8570
8571
8572
8573
8574
8575
8576
8577
8578
8579
8580
8581
8582
8583
8584
8585
8586
8587
8588
8589
8590
8591
8592
8593
8594
8595
8596
8597
8598
8599
8600
8601
8602
8603
8604
8605
8606
8607
8608
8609
8610
8611
8612
8613
8614
8615
8616
8617
8618
8619
8620
8621
8622
8623
8624
8625
8626
8627
8628
8629
8630
8631
8632
8633
8634
8635
8636
8637
8638
8639
8640
8641
8642
8643
8644
8645
8646
8647
8648
8649
8650
8651
8652
8653
8654
8655
8656
8657
8658
8659
8660
8661
8662
8663
8664
8665
8666
8667
8668
8669
8670
8671
8672
8673
8674
8675
8676
8677
8678
8679
8680
8681
8682
8683
8684
8685
8686
8687
8688
8689
8690
8691
8692
8693
8694
8695
8696
8697
8698
8699
8700
8701
8702
8703
8704
8705
8706
8707
8708
8709
8710
8711
8712
8713
8714
8715
8716
8717
8718
8719
8720
8721
8722
8723
8724
8725
8726
8727
8728
8729
8730
8731
8732
8733
8734
8735
8736
8737
8738
8739
8740
8741
8742
8743
8744
8745
8746
8747
8748
8749
8750
8751
8752
8753
8754
8755
8756
8757
8758
8759
8760
8761
8762
8763
8764
8765
8766
8767
8768
8769
8770
8771
8772
8773
8774
8775
8776
8777
8778
8779
8780
8781
8782
8783
8784
8785
8786
8787
8788
8789
8790
8791
8792
8793
8794
8795
8796
8797
8798
8799
8800
8801
8802
8803
8804
8805
8806
8807
8808
8809
8810
8811
8812
8813
8814
8815
8816
8817
8818
8819
8820
8821
8822
8823
8824
8825
8826
8827
8828
8829
8830
8831
8832
8833
8834
8835
8836
8837
8838
8839
8840
8841
8842
8843
8844
8845
8846
8847
8848
8849
8850
8851
8852
8853
8854
8855
8856
8857
8858
8859
8860
8861
8862
8863
8864
8865
8866
8867
8868
8869
8870
8871
8872
8873
8874
8875
8876
8877
8878
8879
8880
8881
8882
8883
8884
8885
8886
8887
8888
8889
8890
8891
8892
8893
8894
8895
8896
8897
8898
8899
8900
8901
8902
8903
8904
8905
8906
8907
8908
8909
8910
8911
8912
8913
8914
8915
8916
8917
8918
8919
8920
8921
8922
8923
8924
8925
8926
8927
8928
8929
8930
8931
8932
8933
8934
8935
8936
8937
8938
8939
8940
8941
8942
8943
8944
8945
8946
8947
8948
8949
8950
8951
8952
8953
8954
8955
8956
8957
8958
8959
8960
8961
8962
8963
8964
8965
8966
8967
8968
8969
8970
8971
8972
8973
8974
8975
8976
8977
8978
8979
8980
8981
8982
8983
8984
8985
8986
8987
8988
8989
8990
8991
8992
8993
8994
8995
8996
8997
8998
8999
9000
9001
9002
9003
9004
9005
9006
9007
9008
9009
9010
9011
9012
9013
9014
9015
9016
9017
9018
9019
9020
9021
9022
9023
9024
9025
9026
9027
9028
9029
9030
9031
9032
9033
9034
9035
9036
9037
9038
9039
9040
9041
9042
9043
9044
9045
9046
9047
9048
9049
9050
9051
9052
9053
9054
9055
9056
9057
9058
9059
9060
9061
9062
9063
9064
9065
9066
9067
9068
9069
9070
9071
9072
9073
9074
9075
9076
9077
9078
9079
9080
9081
9082
9083
9084
9085
9086
9087
9088
9089
9090
9091
9092
9093
9094
9095
9096
9097
9098
9099
9100
9101
9102
9103
9104
9105
9106
9107
9108
9109
9110
9111
9112
9113
9114
9115
9116
9117
9118
9119
9120
9121
9122
9123
9124
9125
9126
9127
9128
9129
9130
9131
9132
9133
9134
9135
9136
9137
9138
9139
9140
9141
9142
9143
9144
9145
9146
9147
9148
9149
9150
9151
9152
9153
9154
9155
9156
9157
9158
9159
9160
9161
9162
9163
9164
9165
9166
9167
9168
9169
9170
9171
9172
9173
9174
9175
9176
9177
9178
9179
9180
9181
9182
9183
9184
9185
9186
9187
9188
9189
9190
9191
9192
9193
9194
9195
9196
9197
9198
9199
9200
9201
9202
9203
9204
9205
9206
9207
9208
9209
9210
9211
9212
9213
9214
9215
9216
9217
9218
9219
9220
9221
9222
9223
9224
9225
9226
9227
9228
9229
9230
9231
9232
9233
9234
9235
9236
9237
9238
9239
9240
9241
9242
9243
9244
9245
9246
9247
9248
9249
9250
9251
9252
9253
9254
9255
9256
9257
9258
9259
9260
9261
9262
9263
9264
9265
9266
9267
9268
9269
9270
9271
9272
9273
9274
9275
9276
9277
9278
9279
9280
9281
9282
9283
9284
9285
9286
9287
9288
9289
9290
9291
9292
9293
9294
9295
9296
9297
9298
9299
9300
9301
9302
9303
9304
9305
9306
9307
9308
9309
9310
9311
9312
9313
9314
9315
9316
9317
9318
9319
9320
9321
9322
9323
9324
9325
9326
9327
9328
9329
9330
9331
9332
9333
9334
9335
9336
9337
9338
9339
9340
9341
9342
9343
9344
9345
9346
9347
9348
9349
9350
9351
9352
9353
9354
9355
9356
9357
9358
9359
9360
9361
9362
9363
9364
9365
9366
9367
9368
9369
9370
9371
9372
9373
9374
9375
9376
9377
9378
9379
9380
9381
9382
9383
9384
9385
9386
9387
9388
9389
9390
9391
9392
9393
9394
9395
9396
9397
9398
9399
9400
9401
9402
9403
9404
9405
9406
9407
9408
9409
9410
9411
9412
9413
9414
9415
9416
9417
9418
9419
9420
9421
9422
9423
9424
9425
9426
9427
9428
9429
9430
9431
9432
9433
9434
9435
9436
9437
9438
9439
9440
9441
9442
9443
9444
9445
9446
9447
9448
9449
9450
9451
9452
9453
9454
9455
9456
9457
9458
9459
9460
9461
9462
9463
9464
9465
9466
9467
9468
9469
9470
9471
9472
9473
9474
9475
9476
9477
9478
9479
9480
9481
9482
9483
9484
9485
9486
9487
9488
9489
9490
9491
9492
9493
9494
9495
9496
9497
9498
9499
9500
9501
9502
9503
9504
9505
9506
9507
9508
9509
9510
9511
9512
9513
9514
9515
9516
9517
9518
9519
9520
9521
9522
9523
9524
9525
9526
9527
9528
9529
9530
9531
9532
9533
9534
9535
9536
9537
9538
9539
9540
9541
9542
9543
9544
9545
9546
9547
9548
9549
9550
9551
9552
9553
9554
9555
9556
9557
9558
9559
9560
9561
9562
9563
9564
9565
9566
9567
9568
9569
9570
9571
9572
9573
9574
9575
9576
9577
9578
9579
9580
9581
9582
9583
9584
9585
9586
9587
9588
9589
9590
9591
9592
9593
9594
9595
9596
9597
9598
9599
9600
9601
9602
9603
9604
9605
9606
9607
9608
9609
9610
9611
9612
9613
9614
9615
9616
9617
9618
9619
9620
9621
9622
9623
9624
9625
9626
9627
9628
9629
9630
9631
9632
9633
9634
9635
9636
9637
9638
9639
9640
9641
9642
9643
9644
9645
9646
9647
9648
9649
9650
9651
9652
9653
9654
9655
9656
9657
9658
9659
9660
9661
9662
9663
9664
9665
9666
9667
9668
9669
9670
9671
9672
9673
9674
9675
9676
9677
9678
9679
9680
9681
9682
9683
9684
9685
9686
9687
9688
9689
9690
9691
9692
9693
9694
9695
9696
9697
9698
9699
9700
9701
9702
9703
9704
9705
9706
9707
9708
9709
9710
9711
9712
9713
9714
9715
9716
9717
9718
9719
9720
9721
9722
9723
9724
9725
9726
9727
9728
9729
9730
9731
9732
9733
9734
9735
9736
9737
9738
9739
9740
9741
9742
9743
9744
9745
9746
9747
9748
9749
9750
9751
9752
9753
9754
9755
9756
9757
9758
9759
9760
9761
9762
9763
9764
9765
9766
9767
9768
9769
9770
9771
9772
9773
9774
9775
9776
9777
9778
9779
9780
9781
9782
9783
9784
9785
9786
9787
9788
9789
9790
9791
9792
9793
9794
9795
9796
9797
9798
9799
9800
9801
9802
9803
9804
9805
9806
9807
9808
9809
9810
9811
9812
9813
9814
9815
9816
9817
9818
9819
9820
9821
9822
9823
9824
9825
9826
9827
9828
9829
9830
9831
9832
9833
9834
9835
9836
9837
9838
9839
9840
9841
9842
9843
9844
9845
9846
9847
9848
9849
9850
9851
9852
9853
9854
9855
9856
9857
9858
9859
9860
9861
9862
9863
9864
9865
9866
9867
9868
9869
9870
9871
9872
9873
9874
9875
9876
9877
9878
9879
9880
9881
9882
9883
9884
9885
9886
9887
9888
9889
9890
9891
9892
9893
9894
9895
9896
9897
9898
9899
9900
9901
9902
9903
9904
9905
9906
9907
9908
9909
9910
9911
9912
9913
9914
9915
9916
9917
9918
9919
9920
9921
9922
9923
9924
9925
9926
9927
9928
9929
9930
9931
9932
9933
9934
9935
9936
9937
9938
9939
9940
9941
9942
9943
9944
9945
9946
9947
9948
9949
9950
9951
9952
9953
9954
9955
9956
9957
9958
9959
9960
9961
9962
9963
9964
9965
9966
9967
9968
9969
9970
9971
9972
9973
9974
9975
9976
9977
9978
9979
9980
9981
9982
9983
9984
9985
9986
9987
9988
9989
9990
9991
9992
9993
9994
9995
9996
9997
9998
9999
10000
10001
10002
10003
10004
10005
10006
10007
10008
10009
10010
10011
10012
10013
10014
10015
10016
10017
10018
10019
10020
10021
10022
10023
10024
10025
10026
10027
10028
10029
10030
10031
10032
10033
10034
10035
10036
10037
10038
10039
10040
10041
10042
10043
10044
10045
10046
10047
10048
10049
10050
10051
10052
10053
10054
10055
10056
10057
10058
10059
10060
10061
10062
10063
10064
10065
10066
10067
10068
10069
10070
10071
10072
10073
10074
10075
10076
10077
10078
10079
10080
10081
10082
10083
10084
10085
10086
10087
10088
10089
10090
10091
10092
10093
10094
10095
10096
10097
10098
10099
10100
10101
10102
10103
10104
10105
10106
10107
10108
10109
10110
10111
10112
10113
10114
10115
10116
10117
10118
10119
10120
10121
10122
10123
10124
10125
10126
10127
10128
10129
10130
10131
10132
10133
10134
10135
10136
10137
10138
10139
10140
10141
10142
10143
10144
10145
10146
10147
10148
10149
10150
10151
10152
10153
10154
10155
10156
10157
10158
10159
10160
10161
10162
10163
10164
10165
10166
10167
10168
10169
10170
10171
10172
10173
10174
10175
10176
10177
10178
10179
10180
10181
10182
10183
10184
10185
10186
10187
10188
10189
10190
10191
10192
10193
10194
10195
10196
10197
10198
10199
10200
10201
10202
10203
10204
10205
10206
10207
10208
10209
10210
10211
10212
10213
10214
10215
10216
10217
10218
10219
10220
10221
10222
10223
10224
10225
10226
10227
10228
10229
10230
10231
10232
10233
10234
10235
10236
10237
10238
10239
10240
10241
10242
10243
10244
10245
10246
10247
10248
10249
10250
10251
10252
10253
10254
10255
10256
10257
10258
10259
10260
10261
10262
10263
10264
10265
10266
10267
10268
10269
10270
10271
10272
10273
10274
10275
10276
10277
10278
10279
10280
10281
10282
10283
10284
10285
10286
10287
10288
10289
10290
10291
10292
10293
10294
10295
10296
10297
10298
10299
10300
10301
10302
10303
10304
10305
10306
10307
10308
10309
10310
10311
10312
10313
10314
10315
10316
10317
10318
10319
10320
10321
10322
10323
10324
10325
10326
10327
10328
10329
10330
10331
10332
10333
10334
10335
10336
10337
10338
10339
10340
10341
10342
10343
10344
10345
10346
10347
10348
10349
10350
10351
10352
10353
10354
10355
10356
10357
10358
10359
10360
10361
10362
10363
10364
10365
10366
10367
10368
10369
10370
10371
10372
10373
10374
10375
10376
10377
10378
10379
10380
10381
10382
10383
10384
10385
10386
10387
10388
10389
10390
10391
10392
10393
10394
10395
10396
10397
10398
10399
10400
10401
10402
10403
10404
10405
10406
10407
10408
10409
10410
10411
10412
10413
10414
10415
10416
10417
10418
10419
10420
10421
10422
10423
10424
10425
10426
10427
10428
10429
10430
10431
10432
10433
10434
10435
10436
10437
10438
10439
10440
10441
10442
10443
10444
10445
10446
10447
10448
10449
10450
10451
10452
10453
10454
10455
10456
10457
10458
10459
10460
10461
10462
10463
10464
10465
10466
10467
10468
10469
10470
10471
10472
10473
10474
10475
10476
10477
10478
10479
10480
10481
10482
10483
10484
10485
10486
10487
10488
10489
10490
10491
10492
10493
10494
10495
10496
10497
10498
10499
10500
10501
10502
10503
10504
10505
10506
10507
10508
10509
10510
10511
10512
10513
10514
10515
10516
10517
10518
10519
10520
10521
10522
10523
10524
10525
10526
10527
10528
10529
10530
10531
10532
10533
10534
10535
10536
10537
10538
10539
10540
10541
10542
10543
10544
10545
10546
10547
10548
10549
10550
10551
10552
10553
10554
10555
10556
10557
10558
10559
10560
10561
10562
10563
10564
10565
10566
10567
10568
10569
10570
10571
10572
10573
10574
10575
10576
10577
10578
10579
10580
10581
10582
10583
10584
10585
10586
10587
10588
10589
10590
10591
10592
10593
10594
10595
10596
10597
10598
10599
10600
10601
10602
10603
10604
10605
10606
10607
10608
10609
10610
10611
10612
10613
10614
10615
10616
10617
10618
10619
10620
10621
10622
10623
10624
10625
10626
10627
10628
10629
10630
10631
10632
10633
10634
10635
10636
10637
10638
10639
10640
10641
10642
10643
10644
10645
10646
10647
10648
10649
10650
10651
10652
10653
10654
10655
10656
10657
10658
10659
10660
10661
10662
10663
10664
10665
10666
10667
10668
10669
10670
10671
10672
10673
10674
10675
10676
10677
10678
10679
10680
10681
10682
10683
10684
10685
10686
10687
10688
10689
10690
10691
10692
10693
10694
10695
10696
10697
10698
10699
10700
10701
10702
10703
10704
10705
10706
10707
10708
10709
10710
10711
10712
10713
10714
10715
10716
10717
10718
10719
10720
10721
10722
10723
10724
10725
10726
10727
10728
10729
10730
10731
10732
10733
10734
10735
10736
10737
10738
10739
10740
10741
10742
10743
10744
10745
10746
10747
10748
10749
10750
10751
10752
10753
10754
10755
10756
10757
10758
10759
10760
10761
10762
10763
10764
10765
10766
10767
10768
10769
10770
10771
10772
10773
10774
10775
10776
10777
10778
10779
10780
10781
10782
10783
10784
10785
10786
10787
10788
10789
10790
10791
10792
10793
10794
10795
10796
10797
10798
10799
10800
10801
10802
10803
10804
10805
10806
10807
10808
10809
10810
10811
10812
10813
10814
10815
10816
10817
10818
10819
10820
10821
10822
10823
10824
10825
10826
10827
10828
10829
10830
10831
10832
10833
10834
10835
10836
10837
10838
10839
10840
10841
10842
10843
10844
10845
10846
10847
10848
10849
10850
10851
10852
10853
10854
10855
10856
10857
10858
10859
10860
10861
10862
10863
10864
10865
10866
10867
10868
10869
10870
10871
10872
10873
10874
10875
10876
10877
10878
10879
10880
10881
10882
10883
10884
10885
10886
10887
10888
10889
10890
10891
10892
10893
10894
10895
10896
10897
10898
10899
10900
10901
10902
10903
10904
10905
10906
10907
10908
10909
10910
10911
10912
10913
10914
10915
10916
10917
10918
10919
10920
10921
10922
10923
10924
10925
10926
10927
10928
10929
10930
10931
10932
10933
10934
10935
10936
10937
10938
10939
10940
10941
10942
10943
10944
10945
10946
10947
10948
10949
10950
10951
10952
10953
10954
10955
10956
10957
10958
10959
10960
10961
10962
10963
10964
10965
10966
10967
10968
10969
10970
10971
10972
10973
10974
10975
10976
10977
10978
10979
10980
10981
10982
10983
10984
10985
10986
10987
10988
10989
10990
10991
10992
10993
10994
10995
10996
10997
10998
10999
11000
11001
11002
11003
11004
11005
11006
11007
11008
11009
11010
11011
11012
11013
11014
11015
11016
11017
11018
11019
11020
11021
11022
11023
11024
11025
11026
11027
11028
11029
11030
11031
11032
11033
11034
11035
11036
11037
11038
11039
11040
11041
11042
11043
11044
11045
11046
11047
11048
11049
11050
11051
11052
11053
11054
11055
11056
11057
11058
11059
11060
11061
11062
11063
11064
11065
11066
11067
11068
11069
11070
11071
11072
11073
11074
11075
11076
11077
11078
11079
11080
11081
11082
11083
11084
11085
11086
11087
11088
11089
11090
11091
11092
11093
11094
11095
11096
11097
11098
11099
11100
11101
11102
11103
11104
11105
11106
11107
11108
11109
11110
11111
11112
11113
11114
11115
11116
11117
11118
11119
11120
11121
11122
11123
11124
11125
11126
11127
11128
11129
11130
11131
11132
11133
11134
11135
11136
11137
11138
11139
11140
11141
11142
11143
11144
11145
11146
11147
11148
11149
11150
11151
11152
11153
11154
11155
11156
11157
11158
11159
11160
11161
11162
11163
11164
11165
11166
11167
11168
11169
11170
11171
11172
11173
11174
11175
11176
11177
11178
11179
11180
11181
11182
11183
11184
11185
11186
11187
11188
11189
11190
11191
11192
11193
11194
11195
11196
11197
11198
11199
11200
11201
11202
11203
11204
11205
11206
11207
11208
11209
11210
11211
11212
11213
11214
11215
11216
11217
11218
11219
11220
11221
11222
11223
11224
11225
11226
11227
11228
11229
11230
11231
11232
11233
11234
11235
11236
11237
11238
11239
11240
11241
11242
11243
11244
11245
11246
11247
11248
11249
11250
11251
11252
11253
11254
11255
11256
11257
11258
11259
11260
11261
11262
11263
11264
11265
11266
11267
11268
11269
11270
11271
11272
11273
11274
11275
11276
11277
11278
11279
11280
11281
11282
11283
11284
11285
11286
11287
11288
11289
11290
11291
11292
11293
11294
11295
11296
11297
11298
11299
11300
11301
11302
11303
11304
11305
11306
11307
11308
11309
11310
11311
11312
11313
11314
11315
11316
11317
11318
11319
11320
11321
11322
11323
11324
11325
11326
11327
11328
11329
11330
11331
11332
11333
11334
11335
11336
11337
11338
11339
11340
11341
11342
11343
11344
11345
11346
11347
11348
11349
11350
11351
11352
11353
11354
11355
11356
11357
11358
11359
11360
11361
11362
11363
11364
11365
11366
11367
11368
11369
11370
11371
11372
11373
11374
11375
11376
11377
11378
11379
11380
11381
11382
11383
11384
11385
11386
11387
11388
11389
11390
11391
11392
11393
11394
11395
11396
11397
11398
11399
11400
11401
11402
11403
11404
11405
11406
11407
11408
11409
11410
11411
11412
11413
11414
11415
11416
11417
11418
11419
11420
11421
11422
11423
11424
11425
11426
11427
11428
11429
11430
11431
11432
11433
11434
11435
11436
11437
11438
11439
11440
11441
11442
11443
11444
11445
11446
11447
11448
11449
11450
11451
11452
11453
11454
11455
11456
11457
11458
11459
11460
11461
11462
11463
11464
11465
11466
11467
11468
11469
11470
11471
11472
11473
11474
11475
11476
11477
11478
11479
11480
11481
11482
11483
11484
11485
11486
11487
11488
11489
11490
11491
11492
11493
11494
11495
11496
11497
11498
11499
11500
11501
11502
11503
11504
11505
11506
11507
11508
11509
11510
11511
11512
11513
11514
11515
11516
11517
11518
11519
11520
11521
11522
11523
11524
11525
11526
11527
11528
11529
11530
11531
11532
11533
11534
11535
11536
11537
11538
11539
11540
11541
11542
11543
11544
11545
11546
11547
11548
11549
11550
11551
11552
11553
11554
11555
11556
11557
11558
11559
11560
11561
11562
11563
11564
11565
11566
11567
11568
11569
11570
11571
11572
11573
11574
11575
11576
11577
11578
11579
11580
11581
11582
11583
11584
11585
11586
11587
11588
11589
11590
11591
11592
11593
11594
11595
11596
11597
11598
11599
11600
11601
11602
11603
11604
11605
11606
11607
11608
11609
11610
11611
11612
11613
11614
11615
11616
11617
11618
11619
11620
11621
11622
11623
11624
11625
11626
11627
11628
11629
11630
11631
11632
11633
11634
11635
11636
11637
11638
11639
11640
11641
11642
11643
11644
11645
11646
11647
11648
11649
11650
11651
11652
11653
11654
11655
11656
11657
11658
11659
11660
11661
11662
11663
11664
11665
11666
11667
11668
11669
11670
11671
11672
11673
11674
11675
11676
11677
11678
11679
11680
11681
11682
11683
11684
11685
11686
11687
11688
11689
11690
11691
11692
11693
11694
11695
11696
11697
11698
11699
11700
11701
11702
11703
11704
11705
11706
11707
11708
11709
11710
11711
11712
11713
11714
11715
11716
11717
11718
11719
11720
11721
11722
11723
11724
11725
11726
11727
11728
11729
11730
11731
11732
11733
11734
11735
11736
11737
11738
11739
11740
11741
11742
11743
11744
11745
11746
11747
11748
11749
11750
11751
11752
11753
11754
11755
11756
11757
11758
11759
11760
11761
11762
11763
11764
11765
11766
11767
11768
11769
11770
11771
11772
11773
11774
11775
11776
11777
11778
11779
11780
11781
11782
11783
11784
11785
11786
11787
11788
11789
11790
11791
11792
11793
11794
11795
11796
11797
11798
11799
11800
11801
11802
11803
11804
11805
11806
11807
11808
11809
11810
11811
11812
11813
11814
11815
11816
11817
11818
11819
11820
11821
11822
11823
11824
11825
11826
11827
11828
11829
11830
11831
11832
11833
11834
11835
11836
11837
11838
11839
11840
11841
11842
11843
11844
11845
11846
11847
11848
11849
11850
11851
11852
11853
11854
11855
11856
11857
11858
11859
11860
11861
11862
11863
11864
11865
11866
11867
11868
11869
11870
11871
11872
11873
11874
11875
11876
11877
11878
11879
11880
11881
11882
11883
11884
11885
11886
11887
11888
11889
11890
11891
11892
11893
11894
11895
11896
11897
11898
11899
11900
11901
11902
11903
11904
11905
11906
11907
11908
11909
11910
11911
11912
11913
11914
11915
11916
11917
11918
11919
11920
11921
11922
11923
11924
11925
11926
11927
11928
11929
11930
11931
11932
11933
11934
11935
11936
11937
11938
11939
11940
11941
11942
11943
11944
11945
11946
11947
11948
11949
11950
11951
11952
11953
11954
11955
11956
11957
11958
11959
11960
11961
11962
11963
11964
11965
11966
11967
11968
11969
11970
11971
11972
11973
11974
11975
11976
11977
11978
11979
11980
11981
11982
11983
11984
11985
11986
11987
11988
11989
11990
11991
11992
11993
11994
11995
11996
11997
11998
11999
12000
12001
12002
12003
12004
12005
12006
12007
12008
12009
12010
12011
12012
12013
12014
12015
12016
12017
12018
12019
12020
12021
12022
12023
12024
12025
12026
12027
12028
12029
12030
12031
12032
12033
12034
12035
12036
12037
12038
12039
12040
12041
12042
12043
12044
12045
12046
12047
12048
12049
12050
12051
12052
12053
12054
12055
12056
12057
12058
12059
12060
12061
12062
12063
12064
12065
12066
12067
12068
12069
12070
12071
12072
12073
12074
12075
12076
12077
12078
12079
12080
12081
12082
12083
12084
12085
12086
12087
12088
12089
12090
12091
12092
12093
12094
12095
12096
12097
12098
12099
12100
12101
12102
12103
12104
12105
12106
12107
12108
12109
12110
12111
12112
12113
12114
12115
12116
12117
12118
12119
12120
12121
12122
12123
12124
12125
12126
12127
12128
12129
12130
12131
12132
12133
12134
12135
12136
12137
12138
12139
12140
12141
12142
12143
12144
12145
12146
12147
12148
12149
12150
12151
12152
12153
12154
12155
12156
12157
12158
12159
12160
12161
12162
12163
12164
12165
12166
12167
12168
12169
12170
12171
12172
12173
12174
12175
12176
12177
12178
12179
12180
12181
12182
12183
12184
12185
12186
12187
12188
12189
12190
12191
12192
12193
12194
12195
12196
12197
12198
12199
12200
12201
12202
12203
12204
12205
12206
12207
12208
12209
12210
12211
12212
12213
12214
12215
12216
12217
12218
12219
12220
12221
12222
12223
12224
12225
12226
12227
12228
12229
12230
12231
12232
12233
12234
12235
12236
12237
12238
12239
12240
12241
12242
12243
12244
12245
12246
12247
12248
12249
12250
12251
12252
12253
12254
12255
12256
12257
12258
12259
12260
12261
12262
12263
12264
12265
12266
12267
12268
12269
12270
12271
12272
12273
12274
12275
12276
12277
12278
12279
12280
12281
12282
12283
12284
12285
12286
12287
12288
12289
12290
12291
12292
12293
12294
12295
12296
12297
12298
12299
12300
12301
12302
12303
12304
12305
12306
12307
12308
12309
12310
12311
12312
12313
12314
12315
12316
12317
12318
12319
12320
12321
12322
12323
12324
12325
12326
12327
12328
12329
12330
12331
12332
12333
12334
12335
12336
12337
12338
12339
12340
12341
12342
12343
12344
12345
12346
12347
12348
12349
12350
12351
12352
12353
12354
12355
12356
12357
12358
12359
12360
12361
12362
12363
12364
12365
12366
12367
12368
12369
12370
12371
12372
12373
12374
12375
12376
12377
12378
12379
12380
12381
12382
12383
12384
12385
12386
12387
12388
12389
12390
12391
12392
12393
12394
12395
12396
12397
12398
12399
12400
12401
12402
12403
12404
12405
12406
12407
12408
12409
12410
12411
12412
12413
12414
12415
12416
12417
12418
12419
12420
12421
12422
12423
12424
12425
12426
12427
12428
12429
12430
12431
12432
12433
12434
12435
12436
12437
12438
12439
12440
12441
12442
12443
12444
12445
12446
12447
12448
12449
12450
12451
12452
12453
12454
12455
12456
12457
12458
12459
12460
12461
12462
12463
12464
12465
12466
12467
12468
12469
12470
12471
12472
12473
12474
12475
12476
12477
12478
12479
12480
12481
12482
12483
12484
12485
12486
12487
12488
12489
12490
12491
12492
12493
12494
12495
12496
12497
12498
12499
12500
12501
12502
12503
12504
12505
12506
12507
12508
12509
12510
12511
12512
12513
12514
12515
12516
12517
12518
12519
12520
12521
12522
12523
12524
12525
12526
12527
12528
12529
12530
12531
12532
12533
12534
12535
12536
12537
12538
12539
12540
12541
12542
12543
12544
12545
12546
12547
12548
12549
12550
12551
12552
12553
12554
12555
12556
12557
12558
12559
12560
12561
12562
12563
12564
12565
12566
12567
12568
12569
12570
12571
12572
12573
12574
12575
12576
12577
12578
12579
12580
12581
12582
12583
12584
12585
12586
12587
12588
12589
12590
12591
12592
12593
12594
12595
12596
12597
12598
12599
12600
12601
12602
12603
12604
12605
12606
12607
12608
12609
12610
12611
12612
12613
12614
12615
12616
12617
12618
12619
12620
12621
12622
12623
12624
12625
12626
12627
12628
12629
12630
12631
12632
12633
12634
12635
12636
12637
12638
12639
12640
12641
12642
12643
12644
12645
12646
12647
12648
12649
12650
12651
12652
12653
12654
12655
12656
12657
12658
12659
12660
12661
12662
12663
12664
12665
12666
12667
12668
12669
12670
12671
12672
12673
12674
12675
12676
12677
12678
12679
12680
12681
12682
12683
12684
12685
12686
12687
12688
12689
12690
12691
12692
12693
12694
12695
12696
12697
12698
12699
12700
12701
12702
12703
12704
12705
12706
12707
12708
12709
12710
12711
12712
12713
12714
12715
12716
12717
12718
12719
12720
12721
12722
12723
12724
12725
12726
12727
12728
12729
12730
12731
12732
12733
12734
12735
12736
12737
12738
12739
12740
12741
12742
12743
12744
12745
12746
12747
12748
12749
12750
12751
12752
12753
12754
12755
12756
12757
12758
12759
12760
12761
12762
12763
12764
12765
12766
12767
12768
12769
12770
12771
12772
12773
12774
12775
12776
12777
12778
12779
12780
12781
12782
12783
12784
12785
12786
12787
12788
12789
12790
12791
12792
12793
12794
12795
12796
12797
12798
12799
12800
12801
12802
12803
12804
12805
12806
12807
12808
12809
12810
12811
12812
12813
12814
12815
12816
12817
12818
12819
12820
12821
12822
12823
12824
12825
12826
12827
12828
12829
12830
12831
12832
12833
12834
12835
12836
12837
12838
12839
12840
12841
12842
12843
12844
12845
12846
12847
12848
12849
12850
12851
12852
12853
12854
12855
12856
12857
12858
12859
12860
12861
12862
12863
12864
12865
12866
12867
12868
12869
12870
12871
12872
12873
12874
12875
12876
12877
12878
12879
12880
12881
12882
12883
12884
12885
12886
12887
12888
12889
12890
12891
12892
12893
12894
12895
12896
12897
12898
12899
12900
12901
12902
12903
12904
12905
12906
12907
12908
12909
12910
12911
12912
12913
12914
12915
12916
12917
12918
12919
12920
12921
12922
12923
12924
12925
12926
12927
12928
12929
12930
12931
12932
12933
12934
12935
12936
12937
12938
12939
12940
12941
12942
12943
12944
12945
12946
12947
12948
12949
12950
12951
12952
12953
12954
12955
12956
12957
12958
12959
12960
12961
12962
12963
12964
12965
12966
12967
12968
12969
12970
12971
12972
12973
12974
12975
12976
12977
12978
12979
12980
12981
12982
12983
12984
12985
12986
12987
12988
12989
12990
12991
12992
12993
12994
12995
12996
12997
12998
12999
13000
13001
13002
13003
13004
13005
13006
13007
13008
13009
13010
13011
13012
13013
13014
13015
13016
13017
13018
13019
13020
13021
13022
13023
13024
13025
13026
13027
13028
13029
13030
13031
13032
13033
13034
13035
13036
13037
13038
13039
13040
13041
13042
13043
13044
13045
13046
13047
13048
13049
13050
13051
13052
13053
13054
13055
13056
13057
13058
13059
13060
13061
13062
13063
13064
13065
13066
13067
13068
13069
13070
13071
13072
13073
13074
13075
13076
13077
13078
13079
13080
13081
13082
13083
13084
13085
13086
13087
13088
13089
13090
13091
13092
13093
13094
13095
13096
13097
13098
13099
13100
13101
13102
13103
13104
13105
13106
13107
13108
13109
13110
13111
13112
13113
13114
13115
13116
13117
13118
13119
13120
13121
13122
13123
13124
13125
13126
13127
13128
13129
13130
13131
13132
13133
13134
13135
13136
13137
13138
13139
13140
13141
13142
13143
13144
13145
13146
13147
13148
13149
13150
13151
13152
13153
13154
13155
13156
13157
13158
13159
13160
13161
13162
13163
13164
13165
13166
13167
13168
13169
13170
13171
13172
13173
13174
13175
13176
13177 | <html><head><meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>Sequence assembly with MIRA 5</title><link rel="stylesheet" type="text/css" href="doccss/miradocstyle.css"><meta name="generator" content="DocBook XSL Stylesheets V1.79.1"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="book"><div class="titlepage"><div><div><h1 class="title"><a name="idm1"></a>Sequence assembly with MIRA 5</h1></div><div><h2 class="subtitle">
The Definitive Guide
</h2></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><span class="contrib">Main author</span> <code class="email"><<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>></code></div></div><div><div class="othercredit"><h3 class="othercredit"><span class="firstname">Jacqueline</span> <span class="surname">Weber</span></h3><span class="contrib">Extensive review of early reference manual
</span> </div></div><div><div class="othercredit"><h3 class="othercredit"><span class="firstname">Andrea</span> <span class="surname">Hörster</span></h3><span class="contrib">Extensive review of early reference manual
</span> </div></div><div><div class="othercredit"><h3 class="othercredit"><span class="firstname">Katrina</span> <span class="surname">Dlugosch</span></h3><span class="contrib">Draft for section on preprocessing of ESTs in EST manual
</span> </div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div><div><div class="legalnotice"><a name="idm6"></a><p>
This documentation is licensed under the Creative Commons
Attribution-NonCommercial-ShareAlike 3.0 Unported License. To view a copy of
this license, visit <a class="ulink" href="http://creativecommons.org/licenses/by-nc-sa/3.0/" target="_top">http://creativecommons.org/licenses/by-nc-sa/3.0/</a> or send a letter to
Creative Commons, 171 Second Street, Suite 300, San Francisco, California,
94105, USA.
</p></div></div></div><hr></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="preface"><a href="#idm30">Preface</a></span></dt><dt><span class="chapter"><a href="#chap_intro">1. Introduction to MIRA</a></span></dt><dt><span class="chapter"><a href="#chap_installation">2. Installing MIRA</a></span></dt><dt><span class="chapter"><a href="#chap_reference">3. MIRA 4 reference manual</a></span></dt><dt><span class="chapter"><a href="#chap_dataprep">4. Preparing data</a></span></dt><dt><span class="chapter"><a href="#chap_denovo">5. De-novo assemblies</a></span></dt><dt><span class="chapter"><a href="#chap_mapping">6. Mapping assemblies</a></span></dt><dt><span class="chapter"><a href="#chap_est">7. EST / RNASeq assemblies</a></span></dt><dt><span class="chapter"><a href="#chap_specialparams">8. Parameters for special situations</a></span></dt><dt><span class="chapter"><a href="#chap_results">9. Working with the results of MIRA</a></span></dt><dt><span class="chapter"><a href="#chap_mutils">10. Utilities in the MIRA package</a></span></dt><dt><span class="chapter"><a href="#chap_hard">11. Assembly of <span class="emphasis"><em>hard</em></span> genome or EST / RNASeq projects</a></span></dt><dt><span class="chapter"><a href="#chap_seqtechdesc">12. Description of sequencing technologies</a></span></dt><dt><span class="chapter"><a href="#chap_seqadvice">13. Some advice when going into a sequencing project</a></span></dt><dt><span class="chapter"><a href="#chap_bitsandpieces">14. Bits and pieces</a></span></dt><dt><span class="chapter"><a href="#chap_faq">15. Frequently asked questions</a></span></dt><dt><span class="chapter"><a href="#chap_maf">16. The MAF format</a></span></dt><dt><span class="chapter"><a href="#chap_logfiles">17. Log and temporary files used by MIRA</a></span></dt></dl></div><div class="list-of-figures"><p><b>List of Figures</b></p><dl><dt>1.1. <a href="#chap_intro::srmc_in_454sxahyb_1stpass.png">
How MIRA learns from misassemblies (1)
</a></dt><dt>1.2. <a href="#chap_intro::srmc_in_454sxahyb_lastpass1.png">
How MIRA learns from misassemblies (2)
</a></dt><dt>1.3. <a href="#chap_intro::srmc_in_454sxahyb_lastpass2.png">
How MIRA learns from misassemblies (3)
</a></dt><dt>1.4. <a href="#chap_intro::gcb99_replocator.png">
Slides presenting the repeat locator at the GCB 99
</a></dt><dt>1.5. <a href="#chap_intro::gcb99_edit.png">
Slides presenting the Edit automatic Sanger editor at the GCB 99
</a></dt><dt>1.6. <a href="#chap_intro::san_autoedit1.png">
Sanger assembly without EdIt automatic editing routines
</a></dt><dt>1.7. <a href="#chap_intro::san_autoedit2.png">
Sanger assembly with EdIt automatic editing routines
</a></dt><dt>1.8. <a href="#chap_intro::454_autoedit1.png">
454 assembly without 454 automatic editing routines
</a></dt><dt>1.9. <a href="#chap_intro::454_autoedit2.png">
454 assembly with 454 automatic editing routines
</a></dt><dt>1.10. <a href="#chap_intro::haf5_haf2_contigcoverage_ovals.png">
Coverage of a contig.
</a></dt><dt>1.11. <a href="#chap_intro::haf5_repend_rrna.png">
Repetitive end of a contig
</a></dt><dt>1.12. <a href="#chap_intro::haf2_end_nomoredata.png">
Non-repetitive end of a contig
</a></dt><dt>1.13. <a href="#chap_intro::454sxa_stms_hybdenovo.png">
MIRA pointing out problems in hybrid assemblies (1)
</a></dt><dt>1.14. <a href="#chap_intro::454san_stmu_hybdenovo.png">
MIRA pointing out problems in hybrid assemblies (2)
</a></dt><dt>1.15. <a href="#chap_intro::sxa_cer_reads1.png">
Coverage equivalent reads (CERs) explained.
</a></dt><dt>1.16. <a href="#chap_intro::sxa_cer_reads2.png">
Coverage equivalent reads let SNPs become very visible in assembly viewers
</a></dt><dt>1.17. <a href="#chap_intro::sxa_sroc_lenski2.png">
SNP tags in a MIRA assembly
</a></dt><dt>1.18. <a href="#chap_intro::sxa_mcvc_lenski.png">
Tag pointing out a large deletion in a MIRA mapping assembly
</a></dt><dt>9.1. <a href="#chap_res::results_miraconvert.png">
Format conversions with <span class="command"><strong>miraconvert</strong></span>
</a></dt><dt>9.2. <a href="#chap_res::results_mira2other.png">
Conversions needed for other tools.
</a></dt><dt>9.3. <a href="#haf_danger_join_notok.png">
Join at a repetitive site which should not be performed due to
missing spanning templates.
</a></dt><dt>9.4. <a href="#haf_danger_join_ok.png">
Join at a repetitive site which should be performed due to
spanning templates being good.
</a></dt><dt>9.5. <a href="#454_stacks_join.png">
Pseudo-repeat in 454 data due to sequencing artifacts
</a></dt><dt>9.6. <a href="#chap_sol::sxa_sroc_lenski1.png">
"SROc" tag showing a SNP position in a Solexa mapping
assembly.
</a></dt><dt>9.7. <a href="#chap_sol::sxa_sroc_lenski2.png">
"SROc" tag showing a SNP/indel position in a Solexa mapping
assembly.
</a></dt><dt>9.8. <a href="#chap_sol::sxa_mcvc_lenski.png">
"MCVc" tag (dark red stretch in figure) showing a genome
deletion in Solexa mapping assembly.
</a></dt><dt>9.9. <a href="#chap_sol::sxa_wrmcsrmc_hiding_lenski1.png">
An IS150 insertion hiding behind a WRMc and a SRMc tags
</a></dt><dt>9.10. <a href="#chap_sol::sxa_xmastree_lenski1.png">
A 16 base pair deletion leading to a SROc/UNsC xmas-tree
</a></dt><dt>9.11. <a href="#chap_sol::sxa_xmastree_lenski2.png">
An IS186 insertion leading to a SROc/UNsC xmas-tree
</a></dt><dt>12.1. <a href="#sxa_unsc_ggcxg2_lenski.png">
The Solexa GGCxG problem.
</a></dt><dt>12.2. <a href="#sxa_unsc_ggc1_lenski.png">
The Solexa GGC problem, forward example
</a></dt><dt>12.3. <a href="#sxa_unsc_ggc4_lenski.png">
The Solexa GGC problem, reverse example
</a></dt><dt>12.4. <a href="#sxa_xmastree_lenski2.png">
A genuine place of interest almost masked by the
<code class="literal">GGCxG</code> problem.
</a></dt><dt>12.5. <a href="#sxa_gcbias_nobias2008.png">
Example for no GC coverage bias in 2008 Solexa data.
</a></dt><dt>12.6. <a href="#sxa_gcbias_bias2009.png">
Example for GC coverage bias starting Q3 2009 in Solexa data.
</a></dt><dt>12.7. <a href="#sxa_gcbias_comp20082009.png">
Example for GC coverage bias, direct comparison 2008 / 2010 data.
</a></dt><dt>12.8. <a href="#chap_iontor::ion_dh10bgoodB13.png">
Example for good IonTorrent data (100bp reads)
</a></dt><dt>12.9. <a href="#chap_iontor::iontor_indelhpexample.png">
Example for problematic IonTorrent data (100bp reads)
</a></dt><dt>12.10. <a href="#chap_iontor::ion_dh10bdirdepindel.png.png">
Example for a sequencing direction dependent indel
</a></dt></dl></div><div class="preface"><div class="titlepage"><div><div><h1 class="title"><a name="idm30"></a>Preface</h1></div></div></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
<span class="emphasis"><em><span class="quote">“<span class="quote">How much intelligence does one need to sneak upon lettuce?
</span>”</span></em></span>
</p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><p>
This "book" is actually the result of an exercise in self-defense. It
contains texts from several years of help files, mails, postings, questions,
answers etc.pp concerning MIRA and assembly projects one can do with it.
</p><p>
I never really intended to push MIRA. It started out as a PhD thesis and I
subsequently continued development when I needed something to be done which
other programs couldn't do at the time. But MIRA has always been available
as binary on the Internet since 1999 ... and as Open Source since
2007. Somehow, MIRA seems to have caught the attention of more than just a
few specialised sequencing labs and over the years I've seen an ever growing
number of mails in my inbox and on the MIRA mailing list. Both from people
having been "since ever" in the sequencing business as well as from labs or
people just getting their feet wet in the area.
</p><p>
The help files -- and through them this book -- sort of reflect this
development. Most of the chapters<a href="#ftn.idm40" class="footnote" name="idm40"><sup class="footnote">[1]</sup></a> contain both very specialised
topics as well as step-by-step walk-throughs intended to help people to get
their assembly projects going. Some parts of the documentation are written
in a decidedly non-scientific way. Please excuse, time for rewriting mails
somewhat lacking, some texts were re-used almost verbatim.
</p><p>
The last few years have seen tremendous change in the sequencing
technologies and MIRA 4 reflects that: core data structures and
routines had to be thrown overboard and replaced with faster and/or more
versatile versions suited for the broad range of technologies and use-cases
I am currently running MIRA with.
</p><p>
Nothing is perfect, and both MIRA and this documentation (even if it is
rather pompously called <span class="emphasis"><em>Definitive Guide</em></span>) are far from
it. If you spot an error either in MIRA or this manual, feel free to report
it. Or, even better, correct it if you can. At least with the manual files
it should be easy: they're basically just some decorated text files.
</p><p>
I hope that MIRA will be as useful to you as it has been to me. Have a lot
of fun with it.
</p><p>
Burlington, Spring 2016
</p><p>
Bastien Chevreux
</p><div class="footnotes"><br><hr style="width:100; text-align:left;margin-left: 0"><div id="ftn.idm40" class="footnote"><p><a href="#idm40" class="para"><sup class="para">[1] </sup></a>Avid readers of David
Gerrold will certainly recognise the quotes from his books at the beginning
of each chapter</p></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_intro"></a>Chapter 1. Introduction to MIRA</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email"><<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>></code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect_intro_whatismira">1.1.
What is MIRA?
</a></span></dt><dt><span class="sect1"><a href="#sect_wheretostartreading">1.2.
What to read in this manual and where to start reading?
</a></span></dt><dt><span class="sect1"><a href="#sect_intro_miraquicktour">1.3.
The MIRA quick tour
</a></span></dt><dt><span class="sect1"><a href="#sect_for_which_data_sets_to_use_mira_and_for_which_not">1.4.
For which data sets to use MIRA and for which not
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect3_genome_denovo">1.4.1.
Genome de-novo
</a></span></dt><dt><span class="sect2"><a href="#sect_genome_mapping">1.4.2.
Genome mapping
</a></span></dt><dt><span class="sect2"><a href="#sect3_ests_rnaseq">1.4.3.
ESTs / RNASeq
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_intro_specialfeatures">1.5.
Any special features I might be interested in?
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_intro_miradiscernsrepeats">1.5.1.
MIRA learns to discern non-perfect repeats, leading to better assemblies
</a></span></dt><dt><span class="sect2"><a href="#sect_intro_automatic_editors">1.5.2.
MIRA has integrated editors for data from Sanger, 454, IonTorrent sequencing
</a></span></dt><dt><span class="sect2"><a href="#sect_intro_whycontigsend">1.5.3.
MIRA lets you see why contigs end where they end
</a></span></dt><dt><span class="sect2"><a href="#sect_intro_stmshybrid_tags">1.5.4.
MIRA tags problematic decisions in hybrid assemblies
</a></span></dt><dt><span class="sect2"><a href="#sect_intro_cer_reads">1.5.5.
MIRA allows older finishing programs to cope with amount data in Solexa
mapping projects
</a></span></dt><dt><span class="sect2"><a href="#sect_intro_mapping_tags">1.5.6.
MIRA tags SNPs and other features, outputs result files
for biologists
</a></span></dt><dt><span class="sect2"><a href="#sect_intro_miramuchmore">1.5.7.
MIRA has ... much more
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_intro_versions_licenses_disclaimer_and_copyright">1.6.
Versions, Licenses, Disclaimer and Copyright
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_intro_versions">1.6.1.
Versions
</a></span></dt><dt><span class="sect2"><a href="#sect_intro_licenses">1.6.2.
License
</a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_intro_licensemira">1.6.2.1.
MIRA
</a></span></dt><dt><span class="sect3"><a href="#sect_intro_licensedocs">1.6.2.2.
Documentation
</a></span></dt></dl></dd><dt><span class="sect2"><a href="#sect_intro_copyright">1.6.3.
Copyright
</a></span></dt><dt><span class="sect2"><a href="#sect_intro_external_libraries">1.6.4.
External libraries
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_intro_getting_help___mailing_lists___reporting_bugs">1.7.
Getting help / Mailing lists / Reporting bugs
</a></span></dt><dt><span class="sect1"><a href="#sect_intro_author">1.8.
Author
</a></span></dt><dt><span class="sect1"><a href="#sect_intro_miscellaneous">1.9.
Miscellaneous
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_intro_citations">1.9.1.
Citing MIRA
</a></span></dt><dt><span class="sect2"><a href="#sect_intro_postcards_gold_and_jewellery">1.9.2.
Postcards, gold and jewellery
</a></span></dt></dl></dd></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
<span class="emphasis"><em><span class="quote">“<span class="quote">Half of being smart is to know what you're dumb at.
</span>”</span></em></span>
</p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_intro_whatismira"></a>1.1.
What is MIRA?
</h2></div></div></div><p>
MIRA is a multi-pass DNA sequence data assembler/mapper for whole
genome and EST/RNASeq projects. MIRA assembles/maps reads gained by
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
electrophoresis sequencing (aka Sanger sequencing)
</p></li><li class="listitem"><p>
454 pyro-sequencing (GS20, FLX or Titanium)
</p></li><li class="listitem"><p>
Ion Torrent
</p></li><li class="listitem"><p>
Solexa (Illumina) sequencing
</p></li><li class="listitem"><p>
Error-corrected Pacific Biosciences sequences
</p></li></ul></div><p>
into contiguous sequences (called <span class="emphasis"><em>contigs</em></span>). One can
use the sequences of different sequencing technologies either in a
single assembly run (a <span class="emphasis"><em>true hybrid assembly</em></span>) or by
mapping one type of data to an assembly of other sequencing type (a
<span class="emphasis"><em>semi-hybrid assembly (or mapping)</em></span>) or by mapping a
data against consensus sequences of other assemblies (a <span class="emphasis"><em>simple
mapping</em></span>).
</p><p>
The MIRA acronym stands for <span class="bold"><strong>M</strong></span>imicking
<span class="bold"><strong>I</strong></span>ntelligent <span class="bold"><strong>R</strong></span>ead <span class="bold"><strong>A</strong></span>ssembly
and the program pretty well does what its acronym says (well, most of
the time anyway). It is the Swiss army knife of sequence assembly that
I've used and developed during the past 14 years to get assembly jobs I
work on done efficiently - and especially accurately. That is, without
me actually putting too much manual work into it.
</p><p>
Over time, other labs and sequencing providers have found MIRA useful
for assembly of extremely 'unfriendly' projects containing lots of
repetitive sequences. As always, your mileage may vary.
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_wheretostartreading"></a>1.2.
What to read in this manual and where to start reading?
</h2></div></div></div><p>
At the last count, this manual had almost 200 pages and this might seem a little bit daunting.
However, you very probably do not need to read everything.
</p><p>
You should read most of this introductional chapter though: e.g.,
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
the part with the MIRA quick tour
</p></li><li class="listitem"><p>
the part which gives a quick overview for which data sets to use MIRA and for which not
</p></li><li class="listitem"><p>
the part which showcases different features of MIRA (lots of screen shots!)
</p></li><li class="listitem"><p>
where and how to get help if things don't work out as you expected
</p></li></ul></div><p>
After that, reading should depend on the type of data you intend to work
with: there are specific chapters for assembly of de-novo, of mapping and
of EST / RNASeq projects. They all contain an overview on how to
define your data and how to launch MIRA for these data sets. There is
also chapter on how to prepare data sets from specific sequencing
technologies.
</p><p>
The chapter on working with results of MIRA should again be of general
interest to everyone. It describes the structure of output directories
and files and gives first pointers on what to find where. Also,
converting results into different formats -- with and without filtering
for specific needs -- is covered there.
</p><p>
As the previously cited chapters are more introductory in their nature,
they do not go into the details of MIRA parametrisation. While MIRA has
a comprehensive set of standard settings which should be suited for a
majority of assembly tasks, the are more than 150 switches / parameters
with which one can fine tune almost every aspect of an assembly. A
complete description for each and every parameter and how to correctly
set parameters for different use cases and sequencing technologies can
be found in the reference chapter.
</p><p>
As not every assembly project is simple, there is also a chapter with
tips on how to deal with projects which turn out to be "hard." It
certainly helps if you at least skim through it even if you do not
expect to have problems with your data ... it contains a couple of
tricks on what one can see in result files as well as in temporary and
log files which are not explained elsewhere.
</p><p>
MIRA comes with a number of additional utilities which are described in
an own chapter. While the purpose of <span class="command"><strong>miraconvert</strong></span>
should be quite clear quite quickly, the versatility of use cases for
<span class="command"><strong>mirabait</strong></span> might surprise more than one. Be sure to
check it out.
</p><p>
As from time to time some general questions on sequencing are popping up
on the MIRA talk mailing list, I have added a chapter with some general
musings on what to consider when going into sequencing projects. This
should be in no way a replacement for an exhaustive talk with a
sequencing provider, but it can give a couple of hints on what to take
care of.
</p><p>
There is also a FAQ chapter with some of the more frequently asked questions
which popped up in the past few years.
</p><p>
Finally, there are also chapters covering some more technical aspects of MIRA: the MAF format
and structure / content of the tmp directory have own chapters.
</p><p>
Complete walkthroughs ... are lacking at the moment for MIRA 4. In the
MIRA 3 manual I had them, but so many things have changed (at all
levels: MIRA, the sequencing technologies, data repositories) that I did
not have time to update them. I probably will need quite some time to
write new ones. Feel free to send me some if you are inclined to help
fellow scientists.
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_intro_miraquicktour"></a>1.3.
The MIRA quick tour
</h2></div></div></div><p>
Input can be in various formats like Staden experiment (EXP), Sanger
CAF, FASTA, FASTQ or PHD file. Ancillary data containing additional
information helpful to the assembly as is contained in, e.g. NCBI
traceinfo XML files or Staden EXP files, is also honoured. If present,
base qualities in
<span class="command"><strong>phred</strong></span> style and SCF signal electrophoresis trace
files are used to adjudicate between or even correct contradictory
stretches of bases in reads by either the integrated automatic EdIt
editor (written by Thomas Pfisterer) or the assembler itself.
</p><p>
MIRA was conceived especially with the problem of repeats in genomic
data and SNPs in transcript (EST / RNASeq) data in mind. Considerable
effort was made to develop a number of strategies -- ranging from
standard clone-pair size restrictions to discovery and marking of base
positions discriminating the different repeats / SNPs -- to ensure that
repetitive elements are correctly resolved and that misassemblies do not
occur.
</p><p>
The resulting assembly can be written in different standard formats like
CAF, Staden GAP4 directed assembly, ACE, HTML, FASTA, simple text or
transposed contig summary (TCS) files. These can easily be imported into
numerous finishing tools or further evaluated with simple scripts.
</p><p>
The aim of MIRA is to build the best possible assembly by
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
having a more or less full overview on the whole project at any time
of the assembly, i.e. knowledge of almost all possible read-pairs in
a project,
</p></li><li class="listitem"><p>
using high confidence regions (HCRs) of several aligned read-pairs to
start contig building at a good anchor point of a contig, extending
clipped regions of reads on a 'can be justified' basis.
</p></li><li class="listitem"><p>
using all available data present at the time of assembly, i.e.,
instead of relying on sequence and base confidence values only, the
assembler will profit from trace files containing electrophoresis
signals, tags marking possible special attributes of DNA,
information on specific insert sizes of read-pairs etc.
</p></li><li class="listitem"><p>
having 'intelligent' contig objects accept or refuse reads based on
the rate of unexplainable errors introduced into the consensus
</p></li><li class="listitem"><p>
learning from mistakes by discovering and analysing possible repeats
differentiated only by single nucleotide polymorphisms. The
important bases for discriminating different repetitive elements are
tagged and used as new information.
</p></li><li class="listitem"><p>
using the possibility given by the integrated automatic editor to
correct errors present in contigs (and subsequently) reads by
generating and verifying complex error hypotheses through analysis
of trace signals in several reads covering the same area of a
consensus,
</p></li><li class="listitem"><p>
iteratively extending reads (and subsequently) contigs based on
</p><div class="orderedlist"><ol class="orderedlist" type="a"><li class="listitem"><p>
additional information gained by overlapping read pairs in contigs
and
</p></li><li class="listitem"><p>
corrections made by the automated editor.
</p></li></ol></div></li></ol></div><p>
</p><p>
MIRA was part of a bigger project that started at the DKFZ (Deutsches
Krebsforschungszentrum, German Cancer Research Centre) Heidelberg in
1997: the "Bundesministerium für Bildung, Wissenschaft, Forschung und
Technologie" supported the PhD thesis of Thomas and myself by grant
number <span class="emphasis"><em>01 KW 9611</em></span>. Beside an assembler to tackle
difficult repeats, the grant also supported the automated editor /
finisher EdIt package -- written by Thomas Pfisterer. The strength of
MIRA and EdIt is the automatic interaction of both packages which
produces assemblies with less work for human finishers to be done.
</p><p>
I'd like to thank everybody who reported bugs to me, pointed out problems,
sent ideas and suggestions they encountered while using the predecessors.
Please continue to do so, the feedback made this third version possible.
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_for_which_data_sets_to_use_mira_and_for_which_not"></a>1.4.
For which data sets to use MIRA and for which not
</h2></div></div></div><p>
As a general rule of thumb: if you have an organism with more than
100 to 150 megabases or more than 20 to 40 million reads, you might want
to try other assemblers first.
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect3_genome_denovo"></a>1.4.1.
Genome de-novo
</h3></div></div></div><p>
For genome assembly, the version 4 series of MIRA have been reported
to work on projects with something like a million Sanger reads (~80 to
100 megabases at 10x coverage), five to ten million 454 Titanium reads
(~100 megabases at 20x coverage) and 20 to 40 million Solexa reads
(enough for de-novo of a bacterium or a small eukaryote with 76mers or
100mers).
</p><p>
Provided you have the memory, MIRA is expected to work in de-novo
mode with
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
Sanger reads: 5 to 10 million
</p></li><li class="listitem"><p>
454 reads: 5 to 15 million
</p></li><li class="listitem"><p>
Ion Torrent reads: 5 to 15 million
</p></li><li class="listitem"><p>
Solexa reads: in normal operation, up to 40 million reads. Some
people use it on up to 300 million, but you'll need a really big
machine and month of computation time ... I do not recommend
that.
</p></li></ul></div><p>
and "normal" coverages, whereas "normal" would be at no more than 50x
to 70x for genome projects. Higher coverages will also work, but may
create somewhat larger temporary files without heavy
parametrisation. Lower coverages (<4x for Sanger, <10x for 454,
< 10x for IonTorrent) also need special attention in the
parameter settings.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_genome_mapping"></a>1.4.2.
Genome mapping
</h3></div></div></div><p>
As the complexity of mapping is a lot lower than de-novo, one can
basically double (perhaps even triple) the number of reads compared to
'de-novo'. The limiting factor will be the amount of RAM though, and
MIRA will also need lots of it if you go into eukaryotes.
</p><p>
The main limiting factor regarding time will be the number of
reference sequences (backbones) you are using. MIRA being pedantic
during the mapping process, it might be a rather long wait if you have
more than 40 megabase of reference sequences.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect3_ests_rnaseq"></a>1.4.3.
ESTs / RNASeq
</h3></div></div></div><p>
The default values for MIRA should allow it to work with many EST and
RNASeq data sets, sometimes even from non-normalised libraries. For
extreme coverage cases however (like, something with a lot of cases at
and above 10k coverage), one would perhaps want to resort to data
reduction routines before feeding the sequences to MIRA.
</p><p>
On the other hand, recent developments of MIRA were targeted at making
de-novo RNASeq assembly of non-normalised libraries liveable, and
indeed I now regularly use MIRA for data sets with up to 50 million
Illumina 100bp reads.
</p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_intro_specialfeatures"></a>1.5.
Any special features I might be interested in?
</h2></div></div></div><p>
A few perhaps.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
The screen shots in this section show data from assemblies produced
with MIRA, but the visualisation itself is done in a finishing program
named <span class="command"><strong>gap4</strong></span>.
</p><p>
Some of the screen shots were edited for showing a special feature of
MIRA. E.g., in the screen shots with Solexa data, quite some reads were
left out of the view pane as else -- due to the amount of data --
these screen shots would need several pages for a complete printout.
</p></td></tr></table></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_intro_miradiscernsrepeats"></a>1.5.1.
MIRA learns to discern non-perfect repeats, leading to better assemblies
</h3></div></div></div><p>
MIRA is an iterative assembler (it works in several passes) and acts a
bit like a child when exploring the world: it explores the assembly
space and is specifically parameterised to allow a couple of assembly
errors during the first passes. But after each pass some routines (the
"parents", if you like) check the result, searching for assembly
errors and deduce knowledge about specific assemblies MIRA should not
have ventured into. MIRA will then prevent these errors to re-occur in
subsequent passes.
</p><p>
As an example, consider the following multiple alignment:
</p><div class="figure"><a name="chap_intro::srmc_in_454sxahyb_1stpass.png"></a><p class="title"><b>Figure 1.1. How MIRA learns from misassemblies (1). Multiple alignment
after 1st pass with an obvious assembly error, notice the clustered
columns discrepancies. Two slightly different repeats were assembled
together.</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/srmc_in_454sxahyb_1stpass.png" width="100%" alt="How MIRA learns from misassemblies (1). Multiple alignment after 1st pass with an obvious assembly error, notice the clustered columns discrepancies. Two slightly different repeats were assembled together."></td></tr></table></div></div></div><br class="figure-break"><p>
These kind of errors can be easily spotted by a human, but are hard to
prevent by normal alignment algorithms as sometimes there's only one
single base column difference between repeats (and not several as in
this example).
</p><p>
MIRA spots these things (even if it's only a single column), tags the
base positions in the reads with additional information and then will
use that information in subsequent passes. The net effect is shown in
the next two figures:
</p><div class="figure"><a name="chap_intro::srmc_in_454sxahyb_lastpass1.png"></a><p class="title"><b>Figure 1.2.
Multiple alignment after last pass where assembly errors from
previous passes have been resolved (1st repeat site)
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/srmc_in_454sxahyb_lastpass1.png" width="100%" alt="Multiple alignment after last pass where assembly errors from previous passes have been resolved (1st repeat site)"></td></tr></table></div></div></div><br class="figure-break"><div class="figure"><a name="chap_intro::srmc_in_454sxahyb_lastpass2.png"></a><p class="title"><b>Figure 1.3.
Multiple alignment after last pass where assembly errors from
previous passes have been resolved (2nd repeat site)
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/srmc_in_454sxahyb_lastpass2.png" width="100%" alt="Multiple alignment after last pass where assembly errors from previous passes have been resolved (2nd repeat site)"></td></tr></table></div></div></div><br class="figure-break"><p>
The ability of MIRA to learn and discern non-identical repeats from
each other through column discrepancies is nothing new. Here's the
link to a paper from a talk I had at the German Conference on
Bioinformatics in 1999: <a class="ulink" href="http://www.bioinfo.de/isb/gcb99/talks/chevreux/" target="_top">http://www.bioinfo.de/isb/gcb99/talks/chevreux/</a>
</p><p>
I'm sure you'll recognise the basic principle in figures 8 and 9. The
slides from the corresponding talk also look very similar to the
screen shots above:
</p><div class="figure"><a name="chap_intro::gcb99_replocator.png"></a><p class="title"><b>Figure 1.4.
Slides presenting the repeat locator at the GCB 99
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/gcb99_replocator.png" width="100%" alt="Slides presenting the repeat locator at the GCB 99"></td></tr></table></div></div></div><br class="figure-break"><p>
You can get the talk with these slides here: <a class="ulink" href="http://chevreux.org/dkfzold/gcb99/bachvortrag_gcb99.ppt" target="_top">http://chevreux.org/dkfzold/gcb99/bachvortrag_gcb99.ppt</a>
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_intro_automatic_editors"></a>1.5.2.
MIRA has integrated editors for data from Sanger, 454, IonTorrent sequencing
</h3></div></div></div><p>
Since the first versions in 1999, the <span class="emphasis"><em>EdIt</em></span>
automatic Sanger sequence editor from Thomas Pfisterer has been
integrated into MIRA.
</p><div class="figure"><a name="chap_intro::gcb99_edit.png"></a><p class="title"><b>Figure 1.5.
Slides presenting the Edit automatic Sanger editor at the GCB 99
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/gcb99_edit.png" width="100%" alt="Slides presenting the Edit automatic Sanger editor at the GCB 99"></td></tr></table></div></div></div><br class="figure-break"><p>
The routines use a combination of hypothesis generation/testing
together with neural networks (trained on ABI and ALF traces) for
signal recognition to discern between base calling errors and true
multiple alignment differences. They go back to the trace data to
resolve potential conflicts and eventually recall bases using the
additional information gained in a multiple alignment of reads.
</p><div class="figure"><a name="chap_intro::san_autoedit1.png"></a><p class="title"><b>Figure 1.6.
Sanger assembly without EdIt automatic editing routines. The bases
with blue background are base calling errors.
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/san_autoedit1.png" width="100%" alt="Sanger assembly without EdIt automatic editing routines. The bases with blue background are base calling errors."></td></tr></table></div></div></div><br class="figure-break"><div class="figure"><a name="chap_intro::san_autoedit2.png"></a><p class="title"><b>Figure 1.7.
Sanger assembly with EdIt automatic editing routines. Bases with
pink background are corrections made by EdIt after assessing the
underlying trace files (SCF files in this case). Bases with blue
background are base calling errors where the evidence in the trace
files did not show enough evidence to allow an editing correction.
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/san_autoedit2.png" width="100%" alt="Sanger assembly with EdIt automatic editing routines. Bases with pink background are corrections made by EdIt after assessing the underlying trace files (SCF files in this case). Bases with blue background are base calling errors where the evidence in the trace files did not show enough evidence to allow an editing correction."></td></tr></table></div></div></div><br class="figure-break"><p>
With the introduction of 454 reads, MIRA also got in 2007 specialised
editors to search and correct for typical 454 sequencing problems like
the homopolymer run over-/undercalls. These editors are now integrated
into MIRA itself and are not part of EdIt anymore.
</p><p>
While not being paramount to the assembly quality, both editors
provide additional layers of safety for the MIRA learning algorithm to
discern non-perfect repeats even on a single base
discrepancy. Furthermore, the multiple alignments generated by these
two editors are way more pleasant to look at (or automatically
analyse) than the ones containing all kind of gaps, insertions,
deletions etc.pp.
</p><div class="figure"><a name="chap_intro::454_autoedit1.png"></a><p class="title"><b>Figure 1.8.
454 assembly without 454 automatic editing routines
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/454_autoedit1.png" width="100%" alt="454 assembly without 454 automatic editing routines"></td></tr></table></div></div></div><br class="figure-break"><div class="figure"><a name="chap_intro::454_autoedit2.png"></a><p class="title"><b>Figure 1.9.
454 assembly with 454 automatic editing routines
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/454_autoedit2.png" width="100%" alt="454 assembly with 454 automatic editing routines"></td></tr></table></div></div></div><br class="figure-break"></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_intro_whycontigsend"></a>1.5.3.
MIRA lets you see why contigs end where they end
</h3></div></div></div><p>
A very useful feature for finishing are kmer (hash) frequency tags
which MIRA sets in the assembly. Provided your finishing editor
understands those tags
(<span class="command"><strong>gap4</strong></span>, <span class="command"><strong>gap5</strong></span>
and <span class="command"><strong>consed</strong></span> are fine but there may be others),
they'll give you precious insight where you might want to be cautious
when joining to contigs or where you would need to perform some primer
walking. MIRA colourises the assembly with the hash frequency (HAF)
tags to show repetitiveness.
</p><p>
You will need to read about the HAF tags in the reference manual, but
in a nutshell: the HAF5, HAF6 and HAF7 tags tell you potentially have
repetitive to very repetitive read areas in the genome, while HAF2
tags will tell you that these areas in the genome have not been
covered as well as they should have been.
</p><p>
As an example, the following figure shows the coverage of a contig.
</p><div class="figure"><a name="chap_intro::haf5_haf2_contigcoverage_ovals.png"></a><p class="title"><b>Figure 1.10.
Coverage of a contig.
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/haf5_haf2_contigcoverage_ovals.png" width="100%" alt="Coverage of a contig."></td></tr></table></div></div></div><br class="figure-break"><p>
The question is now: why did MIRA stop building this contig on the
left end (left oval) and why on the right end (right oval).
</p><p>
Looking at the HAF tags in the contig, the answer becomes quickly
clear: the left contig end has HAF5 tags in the reads (shown in bright
red in the following figure). This tells you that MIRA stopped because
it probably could not unambiguously continue building this
contig. Indeed, if you BLAST the sequence at the NCBI, you will find
out that this is an rRNA area of a bacterium, of which bacteria
normally have several copies in the genome:
</p><div class="figure"><a name="chap_intro::haf5_repend_rrna.png"></a><p class="title"><b>Figure 1.11.
HAF5 tags (reads shown with red background) covering a contig end
show repetitiveness as reason for stopping a contig build.
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/haf5_repend_rrna.png" width="100%" alt="HAF5 tags (reads shown with red background) covering a contig end show repetitiveness as reason for stopping a contig build."></td></tr></table></div></div></div><br class="figure-break"><p>
The right end of the same contig however ends in HAF3 tags (normal
coverage, bright green in the next figure) and even HAF2 tags (below
average coverage, pale green in the next image). This tells you MIRA
stopped building the contig at this place simply because there were
no more reads to continue. This is a perfect target for primer
walking if you want to finish a genome.
</p><div class="figure"><a name="chap_intro::haf2_end_nomoredata.png"></a><p class="title"><b>Figure 1.12.
HAF2 tags covering a contig end show that no more reads were
available for assembly at this position.
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/haf2_end_nomoredata.png" width="100%" alt="HAF2 tags covering a contig end show that no more reads were available for assembly at this position."></td></tr></table></div></div></div><br class="figure-break"></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_intro_stmshybrid_tags"></a>1.5.4.
MIRA tags problematic decisions in hybrid assemblies
</h3></div></div></div><p>
Many people combine Sanger & 454 -- or nowadays more 454 &
Solexa -- to improve the sequencing quality of their project through
two (or more) sequencing technologies. To reduce time spent in
finishing, MIRA automatically tags those bases in a consensus of a
hybrid assembly where reads from different sequencing technologies
severely contradict each other.
</p><p>
The following example shows a hybrid 454 / Solexa assembly where reads
from 454 (highlighted read names in following figure) were not sure
whether to have one or two "G" at a certain position. The consensus
algorithm would have chosen "two Gs" for 454, obviously a wrong
decision as all Solexa reads at the same spot (the reads which are not
highlighted) show only one "G" for the given position. While MIRA
chose to believe Solexa in this case, it tagged the position anyway in
case someone chooses to check these kind of things.
</p><div class="figure"><a name="chap_intro::454sxa_stms_hybdenovo.png"></a><p class="title"><b>Figure 1.13.
A "STMS" tag (Sequencing Technology Mismatch Solved, the black
square base in the consensus) showing a potentially difficult
decision in a hybrid 454 / Solexa de-novo assembly.
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/454sxa_stms_hybdenovo.png" width="100%" alt='A "STMS" tag (Sequencing Technology Mismatch Solved, the black square base in the consensus) showing a potentially difficult decision in a hybrid 454 / Solexa de-novo assembly.'></td></tr></table></div></div></div><br class="figure-break"><p>
This works also for other sequencing technology combinations or in
mapping assemblies. The following is an example in a hybrid Sanger /
454 project where by pure misfortune, all Sanger reads have a base
calling error at a given position while the 454 reads show the true
sequence.
</p><div class="figure"><a name="chap_intro::454san_stmu_hybdenovo.png"></a><p class="title"><b>Figure 1.14.
A "STMU" tag (Sequencing Technology Mismatch Unresolved, light blue
square in the consensus at lower end of large oval) showing a
potentially difficult decision in a hybrid Sanger / 454 mapping
assembly.
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/454san_stmu_hybdenovo.png" width="100%" alt='A "STMU" tag (Sequencing Technology Mismatch Unresolved, light blue square in the consensus at lower end of large oval) showing a potentially difficult decision in a hybrid Sanger / 454 mapping assembly.'></td></tr></table></div></div></div><br class="figure-break"></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_intro_cer_reads"></a>1.5.5.
MIRA allows older finishing programs to cope with amount data in Solexa
mapping projects
</h3></div></div></div><p>
Quality control is paramount when you do mutation analysis for
biologists: I know they'll be on my doorstep the very next minute they
found out one of the SNPs in the resequencing data wasn't a SNP, but a
sequencing artefact. And I can understand them: why should they invest
-- per SNP -- hours in the wet lab if I can invest a couple of minutes
to get them data false negative rates (and false discovery rates) way
below 1%? So, finishing and quality control for any mapping project is
a must.
</p><p>
Both <span class="command"><strong>gap4</strong></span> and <span class="command"><strong>consed</strong></span> start to
have a couple of problems when projects have millions of reads: you
need lots of RAM and scrolling around the assembly gets a test to your
patience. Still, these two assembly finishing programs are amongst the
better ones out there, although <span class="command"><strong>gap5</strong></span> starts to
quickly arrive in a state in which it allows itself to substitute to
<span class="command"><strong>gap4</strong></span>.
</p><p>
So, MIRA reduces the number of reads in Solexa mapping projects
without sacrificing information on coverage. The principle is pretty
simple: for 100% matching reads, MIRA tracks coverage of every
reference base and creates long synthetic, coverage equivalent reads
(CERs) in exchange for the Solexa reads. Reads that do not match 100%
are kept as own entities, so that no information gets lost. The
following figure illustrates this:
</p><div class="figure"><a name="chap_intro::sxa_cer_reads1.png"></a><p class="title"><b>Figure 1.15.
Coverage equivalent reads (CERs) explained.
<p>
Left side of the figure: a conventional mapping with eleven reads
of size 4 against a consensus (in uppercase). The inversed base in
the lowest read depicts a sequencing error.
</p>
<p>
Right side of the figure: the same situation, but with coverage
equivalent reads (CERs). Note that there are less reads, but no
information is lost: the coverage of each reference base is
equivalent to the left side of the figure and reads with
differences to the reference are still present.
</p>
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_cer_reads1.png" width="100%" alt="Coverage equivalent reads (CERs) explained. Left side of the figure: a conventional mapping with eleven reads of size 4 against a consensus (in uppercase). The inversed base in the lowest read depicts a sequencing error. Right side of the figure: the same situation, but with coverage equivalent reads (CERs). Note that there are less reads, but no information is lost: the coverage of each reference base is equivalent to the left side of the figure and reads with differences to the reference are still present."></td></tr></table></div></div></div><br class="figure-break"><p>
This strategy is very effective in reducing the size of a project. As
an example, in a mapping project with 9 million Solexa 36mers, MIRA
created a project with 1.7m reads: 700k CER reads representing ~8
million 100% matching Solexa reads, and it kept ~950k mapped reads as
they had ≥ mismatch (be it sequencing error or true SNP) to the
reference. A reduction of 80%, and numbers for mapping projects with
Solexa 100bp reads are in a similar range.
</p><p>
Also, mutations of the resequenced strain now really stand out in the
assembly viewer as the following figure shows:
</p><div class="figure"><a name="chap_intro::sxa_cer_reads2.png"></a><p class="title"><b>Figure 1.16.
Coverage equivalent reads let SNPs become very visible in assembly viewers
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_cer_reads2.png" width="100%" alt="Coverage equivalent reads let SNPs become very visible in assembly viewers"></td></tr></table></div></div></div><br class="figure-break"></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_intro_mapping_tags"></a>1.5.6.
MIRA tags SNPs and other features, outputs result files
for biologists
</h3></div></div></div><p>
Want to assemble two or several very closely related genomes without
reference, but finding SNPs or differences between them?
</p><p>
Tired of looking at some text output from mapping programs and
guessing whether a SNP is really a SNP or just some random junk?
</p><p>
MIRA tags all SNPs (and other features like missing coverage etc.) it
finds so that -- when using a finishing viewer like gap4 or consed --
one can quickly jump from tag to tag and perform quality control. This
works both in de-novo assembly and in mapping assembly, all MIRA needs
is the information which read comes from which strain.
</p><p>
The following figure shows a mapping assembly of Solexa 36mers against
a bacterial reference sequence, where a mutant has an indel position
in an gene:
</p><div class="figure"><a name="chap_intro::sxa_sroc_lenski2.png"></a><p class="title"><b>Figure 1.17.
"SROc" tag (Snp inteR Organism on Consensus) showing a SNP position
in a Solexa mapping assembly.
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_sroc_lenski2.png" width="100%" alt='"SROc" tag (Snp inteR Organism on Consensus) showing a SNP position in a Solexa mapping assembly.'></td></tr></table></div></div></div><br class="figure-break"><p>
Other interesting places like deletions of whole genome parts are also
directly tagged by MIRA and noted in diverse result files (and
searchable in assembly viewers):
</p><div class="figure"><a name="chap_intro::sxa_mcvc_lenski.png"></a><p class="title"><b>Figure 1.18.
"MCVc" tag (Missing CoVerage in Consensus, dark red stretch in figure)
showing a genome deletion in Solexa mapping assembly.
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_mcvc_lenski.png" width="100%" alt='"MCVc" tag (Missing CoVerage in Consensus, dark red stretch in figure) showing a genome deletion in Solexa mapping assembly.'></td></tr></table></div></div></div><br class="figure-break"><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
For bacteria -- and if you use annotated GenBank files as reference
sequence -- MIRA will also output some nice lists directly usable (in
Excel) by biologists, telling them which gene was affected by what
kind of SNP, whether it changes the protein, the original and the
mutated protein sequence etc.pp.
</td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_intro_miramuchmore"></a>1.5.7.
MIRA has ... much more
</h3></div></div></div><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
Extensive possibilities to clip data if needed: by quality, by
masked bases, by A/T stretches, by evidence from other reads, ...
</p></li><li class="listitem"><p>
Routines to re-extend reads into clipped parts if multiple
alignment allows for it.
</p></li><li class="listitem"><p>
Read in ancillary data in different formats: EXP, NCBI TRACEINFO
XML, SSAHA2, SMALT result files and text files.
</p></li><li class="listitem"><p>
Detection of chimeric reads.
</p></li><li class="listitem"><p>
Pipeline to discover SNPs in ESTs from different strains
(miraSearchESTSNPs)
</p></li><li class="listitem"><p>
Support for many different of input and output formats (FASTA,
EXP, FASTQ, CAF, MAF, ...)
</p></li><li class="listitem"><p>
Automatic memory management (when RAM is tight)
</p></li><li class="listitem"><p>
Over 150 parameters to tune the assembly for a lot of use cases,
many of these parameters being tunable individually depending on
sequencing technology they apply to.
</p></li></ul></div><p>
</p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_intro_versions_licenses_disclaimer_and_copyright"></a>1.6.
Versions, Licenses, Disclaimer and Copyright
</h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_intro_versions"></a>1.6.1.
Versions
</h3></div></div></div><p>
There are two kind of versions for MIRA that can be compiled form
source files: production and development.
</p><p>
Production versions are from the stable branch of the source code. These
versions are available for download from SourceForge.
</p><p>
Development versions are from the development branch of the source
tree. These are also made available to the public and should be
compiled by users who want to test out new functionality or to track
down bugs or errors that might arise at a given location. Release
candidates (rc) also fall into the development versions: they are
usually the last versions of a given development branch before being
folded back into the production branch.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_intro_licenses"></a>1.6.2.
License
</h3></div></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_intro_licensemira"></a>1.6.2.1.
MIRA
</h4></div></div></div><p>
MIRA has been put under the GPL version 2.
</p><p>
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or (at
your option) any later version.
</p><p>
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
General Public License for more details.
</p><p>
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
02110-1301, USA
</p><p>
You may also visit <a class="ulink" href="http://www.opensource.org/licenses/gpl-2.0.php" target="_top">http://www.opensource.org/licenses/gpl-2.0.php</a> at the Open
Source Initiative for a copy of this licence.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_intro_licensedocs"></a>1.6.2.2.
Documentation
</h4></div></div></div><p>
The documentation pertaining to MIRA is licensed under the Creative
Commons Attribution-NonCommercial-ShareAlike 3.0 Unported
License. To view a copy of this license, visit <a class="ulink" href="http://creativecommons.org/licenses/by-nc-sa/3.0/" target="_top">http://creativecommons.org/licenses/by-nc-sa/3.0/</a> or send a
letter to Creative Commons, 171 Second Street, Suite 300, San
Francisco, California, 94105, USA.
</p></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_intro_copyright"></a>1.6.3.
Copyright
</h3></div></div></div><p>
© 1997-2000 Deutsches Krebsforschungszentrum Heidelberg -- Dept.
of Molecular Biophysics and Bastien Chevreux (for MIRA) and Thomas
Pfisterer (for EdIt)
</p><p>
© 2001-2014 Bastien Chevreux.
</p><p>
All rights reserved.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_intro_external_libraries"></a>1.6.4.
External libraries
</h3></div></div></div><p>
MIRA uses the excellent Expat library to parse XML files. Expat is Copyright
© 1998, 1999, 2000 Thai Open Source Software Center Ltd and Clark
Cooper as well as Copyright ©
2001, 2002 Expat maintainers.
</p><p>
See <a class="ulink" href="http://www.libexpat.org/" target="_top">http://www.libexpat.org/</a> and
<a class="ulink" href="http://sourceforge.net/projects/expat/" target="_top">http://sourceforge.net/projects/expat/</a> for more information on Expat.
</p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_intro_getting_help___mailing_lists___reporting_bugs"></a>1.7.
Getting help / Mailing lists / Reporting bugs
</h2></div></div></div><p>
Please try to find an answer to your question by first reading the
documents provided with the MIRA package (FAQs, READMEs, usage guide,
guides for specific sequencing technologies etc.). It's a lot, but then
again, they hopefully should cover 90% of all questions.
</p><p>
If you have a tough nut to crack or simply could not find what you were
searching for, you can subscribe to the MIRA talk mailing list and send
in your question (or comment, or suggestion), see <a class="ulink" href="http://www.chevreux.org/mira_mailinglists.html" target="_top">http://www.chevreux.org/mira_mailinglists.html</a> for more
information on that. Now that the number of subscribers has reached a
good level, there's a fair chance that someone could answer your
question before I have the opportunity or while I'm away from mail for a
certain time.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
Please very seriously consider using the mailing list before mailing
me directly. Every question which can be answered by participants of
the list is time I can invest in development and documentation of
MIRA. I have a day job as bioinformatician which has nothing to do
with MIRA and after work hours are rare enough nowadays.
</p><p>
Furthermore, Google indexes the mailing list and every discussion /
question asked on the mailing list helps future users as they show up
in Google searches.
</p><p>
Only mail me directly (bach@chevreux.org) if you feel that there's
some information you absolutely do not want to share publicly.
</p></td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Subscribing to the list <span class="emphasis"><em>before sending mails to it </em></span>
is necessary as messages from non-subscribers will be stopped by the
system to keep the spam level low.
</td></tr></table></div><p>
To report bugs or ask for new features, please use the SourceForge
ticketing system at: <a class="ulink" href="http://sourceforge.net/p/mira-assembler/tickets/" target="_top">http://sourceforge.net/p/mira-assembler/tickets/</a>. This ensures
that requests do not get lost <span class="bold"><strong>and</strong></span> you
get the additional benefit to automatically know when a bug has been
fixed as I will not send separate emails, that's what bug trackers are
there for.
</p><p>
Finally, new or intermediate versions of MIRA will be announced on the
separate MIRA announce mailing list. Traffic is very low there as the
only one who can post there is me. Subscribe if you want to be informed
automatically on new releases of MIRA.
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_intro_author"></a>1.8.
Author
</h2></div></div></div><p>
Bastien Chevreux (mira): <code class="email"><<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>></code>
</p><p>
WWW: <a class="ulink" href="http://www.chevreux.org/" target="_top">http://www.chevreux.org/</a>
</p><p>
MIRA can use automatic editing routines for Sanger sequences which were
written by Thomas Pfisterer (EdIt):
<code class="email"><<a class="email" href="mailto:t.pfisterer@dkfz-heidelberg.de">t.pfisterer@dkfz-heidelberg.de</a>></code>
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_intro_miscellaneous"></a>1.9.
Miscellaneous
</h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_intro_citations"></a>1.9.1.
Citing MIRA
</h3></div></div></div><p>
Please use these citations:
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
For <span class="command"><strong>mira</strong></span>
</span></dt><dd><p>
Chevreux, B., Wetter, T. and Suhai, S. (1999): <span class="emphasis"><em>Genome
Sequence Assembly Using Trace Signals and Additional Sequence
Information</em></span>. Computer Science and Biology:
Proceedings of the German Conference on Bioinformatics (GCB) 99,
pp. 45-56.
</p></dd><dt><span class="term">
For <span class="command"><strong>miraSearchESTSNPs</strong></span> (was named
<span class="command"><strong>miraEST</strong></span> in earlier times)
</span></dt><dd><p> Chevreux, B., Pfisterer, T., Drescher, B., Driesel, A. J.,
Müller, W. E., Wetter, T. and Suhai, S. (2004): <span class="emphasis"><em>Using
the miraEST Assembler for Reliable and Automated mRNA Transcript
Assembly and SNP Detection in Sequenced ESTs</em></span>. Genome
Research, 14(6)
</p></dd></dl></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_intro_postcards_gold_and_jewellery"></a>1.9.2.
Postcards, gold and jewellery
</h3></div></div></div><p>
If you find this software useful, please send the author a postcard. If
postcards are not available, a treasure chest full of Spanish doubloons, gold
and jewellery will do nicely, thank you.
</p></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_installation"></a>Chapter 2. Installing MIRA</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email"><<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>></code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect_install_wheretofetch">2.1.
Where to fetch MIRA
</a></span></dt><dt><span class="sect1"><a href="#sect_install_precompiledbinary">2.2.
Installing from a precompiled binary package
</a></span></dt><dt><span class="sect1"><a href="#sect_install_third_party_integration">2.3.
Integration with third party programs (gap4, consed)
</a></span></dt><dt><span class="sect1"><a href="#sect_install_compiling">2.4.
Compiling MIRA yourself
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_install_comp_prereq">2.4.1.
Prerequisites
</a></span></dt><dt><span class="sect2"><a href="#sect_install_comp_comp">2.4.2.
Compiling and installing
</a></span></dt><dt><span class="sect2"><a href="#sect_install_comp_conf">2.4.3.
Configure switches for MIRA
</a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_install_comp_conf_boost">2.4.3.1.
BOOST configure switches for MIRA
</a></span></dt><dt><span class="sect3"><a href="#sect_install_comp_conf_mira">2.4.3.2.
MIRA specific configure switches
</a></span></dt></dl></dd></dl></dd><dt><span class="sect1"><a href="#sect_install_walkthroughs">2.5.
Installation walkthroughs
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_install_walkthroughs_kubuntu">2.5.1.
(K)Ubuntu 12.04
</a></span></dt><dt><span class="sect2"><a href="#sect_install_walkthroughs_opensuse">2.5.2.
openSUSE 12.1
</a></span></dt><dt><span class="sect2"><a href="#sect_install_walkthroughs_fedora">2.5.3.
Fedora 17
</a></span></dt><dt><span class="sect2"><a href="#sect_install_walkthroughs_osx">2.5.4.
Mac OSX
</a></span></dt><dt><span class="sect2"><a href="#sect_install_walkthroughs_allfromscratch">2.5.5.
Compile everything from scratch
</a></span></dt><dt><span class="sect2"><a href="#sect_install_walkthroughs_dynamic">2.5.6.
Dynamically linked MIRA
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_install_hintotherplatforms">2.6.
Compilation hints for other platforms.
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_install_hintnetbsd5">2.6.1.
NetBSD 5 (i386)
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_install_notesformaintainers">2.7.
Notes for distribution maintainers / system administrators
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_install_additionaldatafiles">2.7.1.
Additional data files
</a></span></dt></dl></dd></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
<span class="emphasis"><em><span class="quote">“<span class="quote">A problem can be found to almost every solution.
</span>”</span></em></span>
</p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_install_wheretofetch"></a>2.1.
Where to fetch MIRA
</h2></div></div></div><p>
SourceForge: <a class="ulink" href="http://sourceforge.net/projects/mira-assembler/" target="_top">http://sourceforge.net/projects/mira-assembler/</a>
</p><p>
There you will normally find a couple of precompiled binaries -- usually
for Linux and Mac OSX -- or the source package for compiling yourself.
</p><p>
Precompiled binary packages are named in the following way:
</p><p>
<code class="filename">mira_<em class="replaceable"><code>miraversion</code></em>_<em class="replaceable"><code>OS-and-binarytype</code></em>.tar.bz2</code>
</p><p>
where
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
For <code class="filename"><em class="replaceable"><code>miraversion</code></em></code>, the
stable versions of MIRA with the general public as audience usually
have a version number in three parts, like
<code class="filename">3.0.5</code>, sometimes also followed by some postfix
like in <code class="filename">3.2.0rc1</code> to denote release candidate 1
of the 3.2.0 version of MIRA. On very rare occasions, stable
versions of MIRA can have four part like in, e.g.,
<code class="filename">3.4.0.1</code>: these versions create identical
binaries to their parent version (<code class="filename">3.4.0</code>) and
just contains fixes to the source build machinery.
</p><p>
The version string sometimes can have a different format:
<code class="filename"><span class="emphasis"><em>sometext</em></span>-0-g<span class="emphasis"><em>somehexnumber</em></span></code>
like in, e.g.,
<code class="filename">ft_fastercontig-0-g4a27c91</code>. These versions of
MIRA are snapshots from the development tree of MIRA and usually
contain new functionality which may not be as well tested as the
rest of MIRA, hence contains more checks and more debugging output
to catch potential errors
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>OS-and-binarytype</code></em></code>
finally defines for which operating system and which processor class
the package is destined. E.g.,
<code class="filename">linux-gnu_x86_64_static</code> contains static
binaries for Linux running a 64 bit processor.
</p></li></ul></div><p>
Source packages are usually named
</p><p>
<code class="filename">mira-<em class="replaceable"><code>miraversion</code></em>.tar.bz2</code>
</p><p>
Examples for packages at SourceForge:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><code class="filename">mira_3.0.5_prod_linux-gnu_x86_64_static.tar.bz2</code></li><li class="listitem"><code class="filename">mira_3.0.5_prod_linux-gnu_i686_32_static.tar.bz2</code></li><li class="listitem"><code class="filename">mira_3.0.5_prod_OSX_snowleopard_x86_64_static.tar.bz2</code></li><li class="listitem"><code class="filename">mira-3.0.5.tar.bz2</code></li></ul></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_install_precompiledbinary"></a>2.2.
Installing from a precompiled binary package
</h2></div></div></div><p>
The distributable package follows the
one-directory-which-contains-everything-which-is-needed philosophy, but
after unpacking and moving the package to its final destination, you
need to run a script which will create some data files.
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
Download the package, unpack it.
</p></li><li class="listitem"><p>
Move the directory somewhere to your disk. Either to one of the
"standard" places like, e.g., <code class="filename">/opt/mira</code>,
<code class="filename">/usr/local/mira</code> or somewhere in your home
directory
</p></li><li class="listitem"><p>
Softlink the binaries which are in the 'bin' directory into a
directory which is in your shell PATH. Then have the shell reload
the location of PATH binaries (either <code class="literal">hash -r</code>
for sh/bash or <code class="literal">rehash</code> for csh/tcsh.
</p><p>
Alternatively, add the <code class="filename">bin</code> directory of the
MIRA package to your PATH variable.
</p></li><li class="listitem"><p>
Test whether the binaries are installed ok via <code class="literal">mirabait
-v</code> which should return with the current version you
downloaded and installed.
</p></li><li class="listitem"><p>
Now you need to run a script which will unpack and reformat some
data needed by MIRA. That script is located in the
<code class="filename">dbdata</code> directory of the package and should
be called with the name of the <span class="emphasis"><em>SLS</em></span> file present
in the same diretory like this:
</p><pre class="screen">
<code class="prompt">arcadia:/path/to/mirapkg$</code> <strong class="userinput"><code>cd dbdata</code></strong>
<code class="prompt">arcadia:/path/to/mirapkg/dbdata$</code> <strong class="userinput"><code>ls -l</code></strong>
drwxr-xr-x 3 bach bach 4096 2016-03-18 14:31 mira-createsls
-rwxr-xr-x 1 bach bach 2547 2015-12-14 04:33 mira-install-sls-rrna.sh
-rw-r--r-- 1 bach bach 337 2016-01-01 14:50 README.txt
lrwxrwxrwx 1 bach bach 10421035 2016-03-18 14:28 rfam_rrna-21-12.sls.gz
<code class="prompt">arcadia:/path/to/mirapkg/dbdata$</code> <strong class="userinput"><code>./mira-install-sls-rrna.sh rfam_rrna-21-12.sls.gz</code></strong></pre><p>
This will take a minute or so. Then you're done for MIRA.
</p></li></ol></div><p>
Additional scripts for special purposes are in the
<code class="filename">scripts</code> directory. You might or might not want to
have them in your $PATH.
</p><p>
Scripts and programs for MIRA from other authors are in the
<code class="filename">3rdparty</code> directory. Here too, you may or may not
want to have (some of them) in your $PATH.
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_install_third_party_integration"></a>2.3.
Integration with third party programs (gap4, consed)
</h2></div></div></div><p>
MIRA sets tags in the assemblies that can be read and interpreted by the
Staden <span class="command"><strong>gap4</strong></span> package or
<span class="command"><strong>consed</strong></span>. These tags are extremely useful to
efficiently find places of interest in an assembly (be it de-novo or
mapping), but both <span class="command"><strong>gap4</strong></span> and <span class="command"><strong>consed</strong></span>
need to be told about these tags.
</p><p>
Data files for a correct integration are delivered in the
<code class="filename">support</code> directory of the distribution. Please
consult the README in that directory for more information on how to
integrate this information in either of these packages.
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_install_compiling"></a>2.4.
Compiling MIRA yourself
</h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_install_comp_prereq"></a>2.4.1.
Prerequisites
</h3></div></div></div><p>
Compiling the 5.x series of MIRA needs a C++14 compatible tool chain, i.e.,
systems starting from 2013/2014 should be OK. The
requisites for <span class="emphasis"><em>compiling</em></span> MIRA are:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
gcc ≥ 4.9.1, with libstdc++6. You really want to use a simple
installation package pre-configured for your system, but in case you
want or have to install gcc yourself, please refer to <a class="ulink" href="http://gcc.gnu.org/" target="_top">http://gcc.gnu.org/</a> for more information on the GNU compiler
collection.
</p></li><li class="listitem"><p>
BOOST library ≥ 1.48. Lower versions might work, but
untested. You would need to change the checking in the configure
script for this to run through. You really want to use a simple
installation package pre-configured for your system, but in case you
want or have to install BOOST yourself, please refer to <a class="ulink" href="http://www.boost.org/" target="_top">http://www.boost.org/</a> for more information on the BOOST
library.
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
Do NOT use a so called <span class="emphasis"><em>staged</em></span> BOOST library,
that will not work.
</td></tr></table></div></li><li class="listitem">
zlib. Should your system not have zlib installed or available as
simple installation package, please see <a class="ulink" href="http://www.zlib.net/" target="_top">http://www.zlib.net/</a> for more information regarding zlib.
</li><li class="listitem">
GNU make. Should your system not have gmake installed or available
as simple installation package, please see <a class="ulink" href="www.gnu.org/software/make/" target="_top">www.gnu.org/software/make/</a> for more information regarding
GNU make.
</li><li class="listitem">
GNU flex ≥ 2.5.33. Should your system not have flex installed or
available as simple installation package, please see <a class="ulink" href="http://flex.sourceforge.net/" target="_top">http://flex.sourceforge.net/</a> for more information regarding
flex.
</li><li class="listitem">
Expat library ≥ 2.0.1. Should your system not have the Expat library and
header files already installed or available as simple installation
package, you will need to download and install a yourself. Please see
<a class="ulink" href="http://www.libexpat.org/" target="_top">http://www.libexpat.org/</a> and <a class="ulink" href="http://sourceforge.net/projects/expat/" target="_top">http://sourceforge.net/projects/expat/</a> for information on how
to do this.
</li><li class="listitem">
xxd. A small utility from the <span class="command"><strong>vim</strong></span> package.
</li></ul></div><p>
For <span class="emphasis"><em>building the documentation</em></span>, additional
prerequisites are from the DocBook tool chain:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem">
xsltproc + docbook-xsl for HTML output
</li><li class="listitem">
dblatex for PDF output
</li></ul></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top"><p>
Previous versions of MIRA had a benefit by using the TCMalloc
library. This is not the case anymore! Indeed, tests showed that when
using TCMalloc, MIRA 4.9.x and above will probably need 20 to
30% <span class="emphasis"><em>more</em></span> max memory and up to 80% more overall
memory than without TCMalloc.
</p><p>
In short: do not use at the moment.
</p></td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_install_comp_comp"></a>2.4.2.
Compiling and installing
</h3></div></div></div><p>
MIRA uses the GNU autoconf/automake tools, please read the section
"Basic Installation" of the <code class="filename">INSTALL</code> file in the
source package of MIRA for more generic information on how to invoke
them.
</p><p>
The short version: simply type
</p><pre class="screen">
<code class="prompt">arcadia:/path/to/mira-5.0.0$</code> <strong class="userinput"><code>./configure</code></strong>
<code class="prompt">arcadia:/path/to/mira-5.0.0$</code> <strong class="userinput"><code>make</code></strong>
<code class="prompt">arcadia:/path/to/mira-5.0.0$</code> <strong class="userinput"><code>make install</code></strong></pre><p>
This should install the following programs:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><span class="command"><strong>mira</strong></span></li><li class="listitem"><span class="command"><strong>miraconvert</strong></span></li><li class="listitem"><span class="command"><strong>mirabait</strong></span></li><li class="listitem"><span class="command"><strong>miramem</strong></span></li></ul></div><p>
Should the <code class="literal">./configure</code> step fail for some reason or
another, you should get a message telling you at which step this
happens and and either install missing packages or tell
<span class="command"><strong>configure</strong></span> where it should search the packages it
did not find, see also next section.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_install_comp_conf"></a>2.4.3.
Configure switches for MIRA
</h3></div></div></div><p>
MIRA understands all standard autoconf configure switches like <code class="literal">--prefix=</code>
etc. Please consult the INSTALL file in the MIRA top level directory
of the source package and also call <code class="literal">./configure
--help</code> to get a full list of currently supported switches.
</p><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_install_comp_conf_boost"></a>2.4.3.1.
BOOST configure switches for MIRA
</h4></div></div></div><p>
BOOST is maybe the most tricky library to get right in case it does
not come pre-configured for your system. The two main switches for
helping to locate BOOST are
probably <code class="literal">--with-boost=[ARG]</code>
and <code class="literal">--with-boost-libdir=LIB_DIR</code>. Only if those
two fail, try using the other <code class="literal">--with-boost-*=</code> switches
you will see from the ./configure help text.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_install_comp_conf_mira"></a>2.4.3.2.
MIRA specific configure switches
</h4></div></div></div><p>
MIRA honours the following switches:
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
--enable-64=yes/no
</span></dt><dd><p>
MIRA should happily build as 32 bit executable on 32 bit
platforms and as 64 bit executable on 64 bit platforms. On 64
bit platforms, setting the switch to 'no' forces the compiler
to produce 32 bit executables (if possible)
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
As of MIRA 3.9.0, support for 32 bit platforms is being
slowly phased out. While MIRA should compile and also run fine
on 32 bit platforms, I do not guarantee it anymore as I
haven't used 32 bit systems in the last 5 years.
</td></tr></table></div></dd><dt><span class="term">
--enable-warnings
</span></dt><dd>
Enables compiler warnings, useful only for developers, not for users.
</dd><dt><span class="term">
--enable-debug
</span></dt><dd>
Lets the MIRA binary contain C/C++ debug symbols.
</dd><dt><span class="term">
--enable-mirastatic
</span></dt><dd>
Builds static binaries which are easier to distribute. Some
platforms (like OpenSolaris) might not like this and you will
get an error from the linker.
</dd><dt><span class="term">
--enable-optimisations
</span></dt><dd>
Instructs the configure script to set optimisation switches for compiling
(on by default). Switching optimisations off (warning, high impact on
run-time) might be interesting only for, e.g, debugging with valgrind.
</dd><dt><span class="term">
--enable-publicquietmira
</span></dt><dd>
Some parts of MIRA can dump additional debug information during
assembly, setting this switch to "no" performs this. Warning:
MIRA will be a bit chatty, using this is not recommended for
public usage.
</dd><dt><span class="term">
--enable-developmentversion
</span></dt><dd>
Using MIRA with enabled development mode may lead to extra
output on stdout as well as some additional data in the results
which should not appear in real world data
</dd><dt><span class="term">
--enable-boundtracking
</span></dt><dd></dd><dt><span class="term">
--enable-bugtracking
</span></dt><dd>
Both flags above compile in some basic checks into mira that
look for sanity within some functions: Leaving this on "yes"
(default) is encouraged, impact on run time is minimal
</dd><dt><span class="term">
</span></dt><dd></dd><dt><span class="term">
</span></dt><dd></dd><dt><span class="term">
</span></dt><dd></dd></dl></div></div></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_install_walkthroughs"></a>2.5.
Installation walkthroughs
</h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_install_walkthroughs_kubuntu"></a>2.5.1.
(K)Ubuntu 12.04
</h3></div></div></div><p>
You will need to install a couple of tools and libraries before
compiling MIRA. Here's the recipe:
</p><pre class="screen">
<strong class="userinput"><code>sudo apt-get install make flex
sudo apt-get install libboost-doc libboost.*1.48-dev libboost.*1.48.0</code></strong></pre><p>
Once this is done, you can unpack and compile MIRA. For a dynamically
linked version, use:
</p><pre class="screen">
<strong class="userinput"><code>tar xvjf <em class="replaceable"><code>mira-5.0.0.tar.bz2</code></em>
cd <em class="replaceable"><code>mira-5.0.0</code></em>
./configure
make && make install</code></strong></pre><p>
For a statically linked version, just change the configure line from
above into
</p><pre class="screen">
<strong class="userinput"><code>./configure <em class="replaceable"><code>--enable-mirastatic</code></em></code></strong></pre><p>
In case you also want to build documentation yourself, you will need
this in addition:
</p><pre class="screen"><strong class="userinput"><code>sudo apt-get install xsltproc docbook-xsl dblatex</code></strong></pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
People working on git checkouts of the MIRA source code will
obviously need some more tools. Get them with this:
</p><pre class="screen"><strong class="userinput"><code>sudo apt-get install automake libtool xutils-dev</code></strong></pre></td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_install_walkthroughs_opensuse"></a>2.5.2.
openSUSE 12.1
</h3></div></div></div><p>
You will need to install a couple of tools and libraries before
compiling MIRA. Here's the recipe:
</p><pre class="screen">
<strong class="userinput"><code>sudo zypper install gcc-c++ boost-devel
sudo zypper install flex libexpat-devel zlib-devel</code></strong></pre><p>
Once this is done, you can unpack and compile MIRA. For a dynamically
linked version, use:
</p><pre class="screen">
<strong class="userinput"><code>tar xvjf <em class="replaceable"><code>mira-5.0.0.tar.bz2</code></em>
cd <em class="replaceable"><code>mira-5.0.0</code></em>
./configure
make && make install</code></strong></pre><p>
In case you also want to build documentation yourself, you will need
this in addition:
</p><pre class="screen"><strong class="userinput"><code>sudo zypper install docbook-xsl-stylesheets dblatex</code></strong></pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
People working on git checkouts of the MIRA source code will
obviously need some more tools. Get them with this:
</p><pre class="screen"><strong class="userinput"><code>sudo zypper install automake libtool xutils-dev</code></strong></pre></td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_install_walkthroughs_fedora"></a>2.5.3.
Fedora 17
</h3></div></div></div><p>
You will need to install a couple of tools and libraries before
compiling MIRA. Here's the recipe:
</p><pre class="screen">
<strong class="userinput"><code>sudo yum -y install gcc-c++ boost-devel
sudo yum install flex expat-devel vim-common zlib-devel</code></strong></pre><p>
Once this is done, you can unpack and compile MIRA. For a dynamically
linked version, use:
</p><pre class="screen">
<strong class="userinput"><code>tar xvjf <em class="replaceable"><code>mira-5.0.0.tar.bz2</code></em>
cd <em class="replaceable"><code>mira-5.0.0</code></em>
./configure
make && make install</code></strong></pre><p>
In case you also want to build documentation yourself, you will need
this in addition:
</p><pre class="screen"><strong class="userinput"><code>sudo yum -y install docbook-xsl dblatex</code></strong></pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
People working on git checkouts of the MIRA source code will
obviously need some more tools. Get them with this:
</p><pre class="screen"><strong class="userinput"><code>sudo yum -y install automake libtool xorg-x1-util-devel</code></strong></pre></td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_install_walkthroughs_osx"></a>2.5.4.
Mac OSX
</h3></div></div></div><p>
These instructions are for OSX 10.11 (El Capitan) and use
MacPorts. There are other ways to do this (e.g., see the "compile
everything from scratch"), but they are definetly more painful.
</p><p>
If you do not already have it, install McPorts. See <a class="ulink" href="https://www.macports.org/install.php" target="_top">https://www.macports.org/install.php</a>. Then have the port
system fetch information of the newest ports (can take a while):
</p><pre class="screen">
<strong class="userinput"><code>sudo port selfupdate</code></strong>
</pre><p>
Then go on and install gcc (this is going to take a long time) and
then switch to gcc5:
</p><pre class="screen">
<strong class="userinput"><code>sudo port install m4 gcc5</code></strong>
<strong class="userinput"><code>sudo port select --set gcc mp-gcc5</code></strong>
</pre><p>
Now, the libraries you need to download and compile need to be
installed somewhere. You can take a path in your home directory or any
other path in the system you have access to, for the sake of this
walkthrough we'll continue with
<code class="filename">/opt/biosw/gccchain</code>
</p><p>
Download and install a current flex. Use at least 2.6.0. If for some
reason you need to use flex 2.5.38 or .39, take care to apply the
patch described here: <a class="ulink" href="https://sourceforge.net/p/flex/bugs/182/" target="_top">https://sourceforge.net/p/flex/bugs/182/</a>. Configure flex to be
installed into the directory you chose the step before:
</p><pre class="screen">
<strong class="userinput"><code>tar xvf flex-2.6.0.tar.bz2</code></strong>
<strong class="userinput"><code>cd flex-2.6.0</code></strong>
<strong class="userinput"><code>./configure --prefix=<em class="replaceable"><code>/opt/biosw/gccchain</code></em></code></strong>
<strong class="userinput"><code>make</code></strong>
<strong class="userinput"><code>make install</code></strong>
</pre><p>
That done, proceed with likewise with expat and zlib library:
</p><pre class="screen">
<strong class="userinput"><code>tar xvf expat-2.1.0.tar.gz</code></strong>
<strong class="userinput"><code>cd expat-2.1.0</code></strong>
<strong class="userinput"><code>./configure --prefix=<em class="replaceable"><code>/opt/biosw/gccchain</code></em></code></strong>
<strong class="userinput"><code>make</code></strong>
<strong class="userinput"><code>make install</code></strong>
<strong class="userinput"><code>cd ..</code></strong>
<strong class="userinput"><code>tar xvf zlib-1.2.8.tar.gz</code></strong>
<strong class="userinput"><code>cd zlib-1.2.8</code></strong>
<strong class="userinput"><code>./configure --prefix=<em class="replaceable"><code>/opt/biosw/gccchain</code></em></code></strong>
<strong class="userinput"><code>make -j 4</code></strong>
<strong class="userinput"><code>make install</code></strong>
</pre><p>
The bzip2 library needs a different installation command line:
</p><pre class="screen">
<strong class="userinput"><code>tar xvf bzip2-1.0.6.tar.gz</code></strong>
<strong class="userinput"><code>cd bzip2-1.0.6</code></strong>
<strong class="userinput"><code>make -j 4</code></strong>
<strong class="userinput"><code>make install PREFIX=<em class="replaceable"><code>/opt/biosw/gccchain</code></em></code></strong>
</pre><p>
Last library to be installed for the MIRA compilation is BOOST:
</p><pre class="screen">
<strong class="userinput"><code>tar xvf boost_1_59_0.tar.bz2</code></strong>
<strong class="userinput"><code>cd boost_1_59_0</code></strong>
<strong class="userinput"><code>./bootstrap.sh --prefix=<em class="replaceable"><code>/opt/biosw/gccchain</code></em></code></strong>
<strong class="userinput"><code>./b2 -j 4</code></strong>
<strong class="userinput"><code>./b2 install</code></strong>
</pre><p>
Now unpack MIRA, configure it and compile. Remember to give configure
script the location of every package you just installed or else it
might pick up a version installed by the system (and compiled with
different compiler) which would invariably lead to errors in the
linker stage of the compilation.
</p><pre class="screen">
<strong class="userinput"><code>tar xvf mira-5.0.0.tar.bz2</code></strong>
<strong class="userinput"><code>cd mira-5.0.0</code></strong>
<strong class="userinput"><code>./configure --enable-debug
--with-boost=<em class="replaceable"><code>/opt/biosw/gccchain</code></em>
--with-boost-libdir=<em class="replaceable"><code>/opt/biosw/gccchain</code></em>/lib
--with-expat=<em class="replaceable"><code>/opt/biosw/gccchain</code></em>
--with-zlib=<em class="replaceable"><code>/opt/biosw/gccchain</code></em></code></strong>
<strong class="userinput"><code>make -j 4</code></strong>
</pre><p>
That's it for the dynamic version.
</p><p>
For building an almost static version, we need some trickery: after
the configure (this time with the mirastatic argument), create a
special directory <code class="filename">OSXstatlibs</code> in which we
softlink all static libraries MIRA needs. This directory will be
searched first by the build scripts generated by the
<span class="command"><strong>libtool</strong></span> suite during the linking stage of MIRA.
</p><pre class="screen">
<strong class="userinput"><code>./configure --enable-mirastatic --enable-debug
--with-boost=<em class="replaceable"><code>/opt/biosw/gccchain</code></em>
--with-boost-libdir=<em class="replaceable"><code>/opt/biosw/gccchain</code></em>/lib
--with-expat=<em class="replaceable"><code>/opt/biosw/gccchain</code></em>
--with-zlib=<em class="replaceable"><code>/opt/biosw/gccchain</code></em></code></strong>
<strong class="userinput"><code>mkdir OSXstatlib</code></strong>
<strong class="userinput"><code>cd OSXstatlib</code></strong>
<strong class="userinput"><code>ln -s /opt/biosw/gccchain/lib/*a</code></strong>
<strong class="userinput"><code>ln -s /opt/local/lib/*a</code></strong>
</pre><p>
Note that <code class="filename">/opt/local</code> is the standard installation
path of the MacPorts programs. If you changed that, you need to adapt
it here, too.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_install_walkthroughs_allfromscratch"></a>2.5.5.
Compile everything from scratch
</h3></div></div></div><p>
This lets you build a self-contained static MIRA binary. The only
prerequisite here is that you have a working <span class="command"><strong>gcc</strong></span>
with the minimum version described above. Please download all
necessary files (expat, flex, etc.pp) and then simply follow the
script below. The only things that you will want to change are the
path used and, maybe, the name of some packages in case they were
bumped up a version or revision.
</p><p>
Contributed by Sven Klages.
</p><pre class="screen">
## whatever path is appropriate
<strong class="userinput"><code>cd <em class="replaceable"><code>/home/gls/SvenTemp/install</code></em></code></strong>
## expat
<strong class="userinput"><code>tar zxvf <em class="replaceable"><code>expat-2.0.1.tar.gz</code></em>
cd <em class="replaceable"><code>expat-2.0.1</code></em>
./configure <em class="replaceable"><code>--prefix=/home/gls/SvenTemp/expat</code></em>
make && make install</code></strong>
## flex
<strong class="userinput"><code>cd <em class="replaceable"><code>/home/gls/SvenTemp/install</code></em>
tar zxvf <em class="replaceable"><code>flex-2.5.35.tar.gz</code></em>
cd <em class="replaceable"><code>flex-2.5.35</code></em>
./configure <em class="replaceable"><code>--prefix=/home/gls/SvenTemp/flex</code></em>
make && make install
cd <em class="replaceable"><code>/home/gls/SvenTemp/flex/bin</code></em>
ln -s flex flex++
export PATH=<em class="replaceable"><code>/home/gls/SvenTemp/flex/bin</code></em>:$PATH</code></strong>
## boost
<strong class="userinput"><code>cd <em class="replaceable"><code>/home/gls/SvenTemp/install</code></em>
tar zxvf <em class="replaceable"><code>boost_1_48_0.tar.gz</code></em>
cd <em class="replaceable"><code>boost_1_48_0</code></em>
./bootstrap.sh --prefix=<em class="replaceable"><code>/home/gls/SvenTemp/boost</code></em>
./b2 install</code></strong>
## mira itself
<strong class="userinput"><code>export CXXFLAGS="-I<em class="replaceable"><code>/home/gls/SvenTemp/flex/include</code></em>"
cd <em class="replaceable"><code>/home/gls/SvenTemp/install</code></em>
tar zxvf <em class="replaceable"><code>mira-3.4.0.1.tar.gz</code></em>
cd <em class="replaceable"><code>mira-3.4.0.1</code></em>
./configure --prefix=<em class="replaceable"><code>/home/gls/SvenTemp/mira</code></em> \
--with-boost=<em class="replaceable"><code>/home/gls/SvenTemp/boost</code></em> \
--with-expat=<em class="replaceable"><code>/home/gls/SvenTemp/expat</code></em> \
--enable-mirastatic
make && make install</code></strong></pre></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_install_walkthroughs_dynamic"></a>2.5.6.
Dynamically linked MIRA
</h3></div></div></div><p>
In case you do not want a static binary of MIRA, but a dynamically
linked version, the following script by Robert Bruccoleri will give
you an idea on how to do this.
</p><p>
Note that he, having root rights, puts all additional software in
/usr/local, and in particular, he keeps updated versions of Boost and
Flex there.
</p><pre class="screen">
#!/bin/sh -x
make distclean
oze=`find . -name "*.o" -print`
if [[ -n "$oze" ]]
then
echo "Not clean."
exit 1
fi
export prefix=${BUILD_PREFIX:-/usr/local}
export LDFLAGS="-Wl,-rpath,$prefix/lib"
./configure --prefix=$prefix \
--enable-debug=yes \
--enable-mirastatic=no \
--with-boost-libdir=$prefix/lib \
--enable-optimisations \
--enable-boundtracking=yes \
--enable-bugtracking=yes \
--enable-extendedbugtracking=no
make
make install</pre></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_install_hintotherplatforms"></a>2.6.
Compilation hints for other platforms.
</h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_install_hintnetbsd5"></a>2.6.1.
NetBSD 5 (i386)
</h3></div></div></div><p>
Contributed by Thomas Vaughan
</p><p>
The system flex <span class="emphasis"><em>(/usr/bin/flex)</em></span> is too old, but the
devel/flex package from a recent pkgsrc works fine. BSD make doesn't
like one of the lines in <span class="emphasis"><em>src/progs/Makefile</em></span>, so use GNU make instead
(available from <span class="emphasis"><em>pkgsrc</em></span> as <span class="emphasis"><em>devel/gmake</em></span>). Other relevant pkgsrc packages:
<span class="emphasis"><em>devel/boost-libs</em></span>, <span class="emphasis"><em>devel/boost-headers</em></span>
and <span class="emphasis"><em>textproc/expat</em></span>. The configure script has to
be told about these pkgsrc prerequisites (they are usually rooted
at <span class="emphasis"><em>/usr/pkg</em></span> but other locations are possible):
</p><pre class="screen"><strong class="userinput"><code>FLEX=/usr/pkg/bin/flex ./configure --with-expat=/usr/pkg --with-boost=/usr/pkg</code></strong></pre><p>
If attempting to build a pkgsrc package of MIRA, note that the LDFLAGS
passed by the pkgsrc mk files don't remove the need for
the <span class="emphasis"><em>--with-boost</em></span> option. The configure script
complains about flex being too old, but this is harmless because it
honours the $FLEX variable when writing out makefiles.
</p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_install_notesformaintainers"></a>2.7.
Notes for distribution maintainers / system administrators
</h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_install_additionaldatafiles"></a>2.7.1.
Additional data files
</h3></div></div></div><p>
Depending on options/paramaters, the MIRA/mirabait binary may need
to load some additional data during the run. By default this data will
always be searched at this location:
<code class="filename">LOCATION_OF_BINARY/../share/mira/...</code>
</p><p>
That is: If the binary is, e.g.,
<code class="filename">/opt/mira5/bin/mira</code> with a softlink pointing from
<code class="filename">/usr/local/bin/mira -> /opt/mira5/bin/mira</code>
(because, e.g., <code class="filename">/usr/local/bin</code> may be by default in your
PATH variable), then the additional data will be searched in
<code class="filename">/opt/mira5/share/mira/...</code> and NOT in
<code class="filename">/usr/local/share/mira/...</code>.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
In short: since MIRA 4.9.6, moving the binary is not enough
anymore. Take care to have the <span style="color: red"><emph>share</emph></span> directory in the
right place, i.e., adjacent to the directory the MIRA binary lives in.
</td></tr></table></div></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_reference"></a>Chapter 3. MIRA 4 reference manual</h1></div><div><h3 class="subtitle"><i>aka: The extended man page of MIRA 4,
a genome and EST/RNASeq sequence assembly and mapping program for Sanger, 454, IonTorrent,
PacBio and Illumina/Solexa sequencing data</i></h3></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email"><<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>></code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect_ref_synopsis">3.1.
Synopsis
</a></span></dt><dt><span class="sect1"><a href="#sect_ref_requirements">3.2.
Requirements
</a></span></dt><dt><span class="sect1"><a href="#sect_ref_working_modes">3.3.
Working modes
</a></span></dt><dt><span class="sect1"><a href="#sect_ref_config">3.4.
Configuring an assembly: files and parameters
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_ref_manifest_introduction">3.4.1.
The manifest file: introduction
</a></span></dt><dt><span class="sect2"><a href="#sect_ref_manifest_basics">3.4.2.
The manifest file: basics
</a></span></dt><dt><span class="sect2"><a href="#sect_ref_manifest_readgroups">3.4.3.
The manifest file: information on the data you have
</a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_ref_manifest_readgroups_readgroup">3.4.3.1.
Starting a new readgroup
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_manifest_readgroups_data">3.4.3.2.
Defining data files to load
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_manifest_readgroups_defaultqual">3.4.3.3.
Setting default quality
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_manifest_readgroups_technology">3.4.3.4.
Defining technology used to sequence
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_manifest_readgroups_nostatistics">3.4.3.5.
Preventing statistics for technologies with biases
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_manifest_readgroups_asreference">3.4.3.6.
Setting reference sequence for mapping jobs
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_manifest_readgroups_autopairing">3.4.3.7.
Autopairing: letting MIRA find out pair info by itself
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_manifest_readgroups_templatesize">3.4.3.8.
Setting size of read templates
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_manifest_readgroups_segplace">3.4.3.9.
Read segment placement
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_manifest_readgroups_segname">3.4.3.10.
Read segment naming
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_manifest_readgroups_strainname">3.4.3.11.
Strain naming
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_manifest_readgroups_datadirscf">3.4.3.12.
Data directory for SCF files
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_manifest_readgroups_renameprefix">3.4.3.13.
Renaming read name prefixes
</a></span></dt></dl></dd><dt><span class="sect2"><a href="#sect_ref_manifest_parameters">3.4.4.
The manifest file: extended parameters
</a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_ref_parameter_groups">3.4.4.1.
Parameter groups
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_technology_sections">3.4.4.2.
Technology sections
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_parameter_shortnames">3.4.4.3.
Parameter short names
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_order_dependent_quick_switches">3.4.4.4.
Order dependent quick switches
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_general_ge">3.4.4.5.
Parameter group: -GENERAL (-GE)
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_assembly_as">3.4.4.6.
Parameter group: -ASSEMBLY (-AS)
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_strain_backbone_sb">3.4.4.7.
Parameter group: -STRAIN/BACKBONE (-SB)
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_dataprocessing_dp">3.4.4.8.
Parameter group: -DATAPROCESSING (-DP)
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_clipping_cl">3.4.4.9.
Parameter group: -CLIPPING (-CL)
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_skim_sk">3.4.4.10.
Parameter group: -SKIM (-SK)
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_hashstatistics_hs">3.4.4.11.
Parameter group: -KMERSTATISTICS (-KS)
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_align_al">3.4.4.12.
Parameter group: -ALIGN (-AL)
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_contig_co">3.4.4.13.
Parameter group: -CONTIG (-CO)
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_edit_ed">3.4.4.14.
Parameter group: -EDIT (-ED)
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_misc_mi">3.4.4.15.
Parameter group: -MISC (-MI)
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_misc_nw">3.4.4.16.
Parameter group: -NAG_AND_WARN (-NW)
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_directory_dir_di">3.4.4.17.
Parameter group: -DIRECTORY (-DIR, -DI)
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_output_out">3.4.4.18.
Parameter group: -OUTPUT (-OUT)
</a></span></dt></dl></dd></dl></dd><dt><span class="sect1"><a href="#sect_ref_resuming_assemblies">3.5.
Resuming / restarting assemblies
</a></span></dt><dt><span class="sect1"><a href="#sect_ref_input_output">3.6.
Input / Output
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_ref_directories">3.6.1.
Directories
</a></span></dt><dt><span class="sect2"><a href="#sect_ref_filenames">3.6.2.
Filenames
</a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_ref_output">3.6.2.1.
Output
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_assembly_statistics_and_information_files">3.6.2.2.
Assembly statistics and information files
</a></span></dt></dl></dd><dt><span class="sect2"><a href="#sect_ref_file_formats">3.6.3.
File formats
</a></span></dt><dt><span class="sect2"><a href="#sect_ref_stdout_stderr">3.6.4.
STDOUT/STDERR
</a></span></dt><dt><span class="sect2"><a href="#sect_ref_ssaha2smalt">3.6.5.
SSAHA2 / SMALT ancillary data
</a></span></dt><dt><span class="sect2"><a href="#sect_ref_xml_traceinfo">3.6.6.
XML TRACEINFO ancillary data
</a></span></dt><dt><span class="sect2"><a href="#sect_ref_contig_naming">3.6.7.
Contig naming
</a></span></dt><dt><span class="sect2"><a href="#sect_ref_recovering_strain_specific_consensus">3.6.8.
Recovering strain specific consensus as FASTA
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_ref_tags_used_in_the_assembly_by_mira_and_edit">3.7.
Tags used in the assembly by MIRA and EdIt
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_ref_tags_read_and_used">3.7.1.
Tags read (and used)
</a></span></dt><dt><span class="sect2"><a href="#sect_ref_tags_set_and_used">3.7.2.
Tags set (and used)
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_ref_contigs_singlets_debris">3.8.
Where reads end up: contigs, singlets, debris
</a></span></dt><dt><span class="sect1"><a href="#sect_ref_snp_discovery">3.9.
Detection of bases distinguishing non-perfect repeats and SNP discovery
</a></span></dt><dt><span class="sect1"><a href="#sect_ref_data_reduction">3.10.
Data reduction: subsampling vs. lossless digital normalisation
</a></span></dt><dt><span class="sect1"><a href="#sect_ref_caveats">3.11.
Caveats
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_ref_using_artificial_reads">3.11.1.
Using data not from sequencing instruments: artificial / synthetic reads
</a></span></dt><dt><span class="sect2"><a href="#sect_ref_ploidy_and_repeats">3.11.2.
Ploidy and repeats
</a></span></dt><dt><span class="sect2"><a href="#sect_ref_handling_of_repeats">3.11.3.
Handling of repeats
</a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_ref_uniform_read_distribution">3.11.3.1.
Uniform read distribution
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_keeping_'long'_repetitive_contigs_separate">3.11.3.2.
Keeping 'long' repetitive contigs separate
</a></span></dt><dt><span class="sect3"><a href="#sect_ref_helping_finishing_by_tagging_reads_with_haf_tags">3.11.3.3.
Helping finishing by tagging reads with HAF tags
</a></span></dt></dl></dd><dt><span class="sect2"><a href="#sect_ref_consensus_in_finishing_programs_gap4_consed_">3.11.4.
Consensus in finishing programs (gap4, consed, ...)
</a></span></dt><dt><span class="sect2"><a href="#sect_ref_some_other_things_to_consider">3.11.5.
Some other things to consider
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_ref_things_you_should_not_do">3.12.
Things you should not do
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_ref_never_on_nfs">3.12.1.
Do not run MIRA on NFS mounted directories without redirecting the tmp directory
</a></span></dt><dt><span class="sect2"><a href="#sect_ref_never_without_quality_values">3.12.2.
Do not assemble without quality values
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_ref_useful_third_party_programs">3.13.
Useful third party programs
</a></span></dt><dt><span class="sect1"><a href="#sect_ref_speed_and_memory_considerations">3.14.
Speed and memory considerations
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_ref_memory">3.14.1.
Estimating needed memory for an assembly project
</a></span></dt><dt><span class="sect2"><a href="#sect_ref_speed">3.14.2.
Some numbers on speed
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_ref_known_problems_bugs">3.15.
Known Problems / Bugs
</a></span></dt><dt><span class="sect1"><a href="#sect_ref_todos">3.16.
TODOs
</a></span></dt><dt><span class="sect1"><a href="#sect_ref_working_principles">3.17.
Working principles
</a></span></dt><dt><span class="sect1"><a href="#sect_ref_see_also">3.18.
See Also
</a></span></dt></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
<span class="emphasis"><em><span class="quote">“<span class="quote">The manual only makes sense after you learn the program.
</span>”</span></em></span>
</p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_synopsis"></a>3.1.
Synopsis
</h2></div></div></div><p>
<code class="literal">mira [-chmMrtv] <em class="replaceable"><code>manifest-file</code></em> [<em class="replaceable"><code>manifest-file</code></em> ...]</code>
</p><p>
The command line parameters in short:
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
[-c / --cwd=<em class="replaceable"><code>directory</code></em>]
</span></dt><dd>
Change working directory.
</dd><dt><span class="term">
[-h / --help]
</span></dt><dd>
Print a short help and exit.
</dd><dt><span class="term">
[-m / --mcheck]
</span></dt><dd>
Only check the manifest file, then exit.
</dd><dt><span class="term">
[-M / --mdcheck]
</span></dt><dd>
Only check the manifest file and presence of data files, then exit.
</dd><dt><span class="term">
[-r / --resume]
</span></dt><dd>
Resume / restart an interrupted assembly. Works only for de-novo
assemblies at the moment.
</dd><dt><span class="term">
[-t / --thread=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd>
Force number of threads (overrides equivalent [-GE:not]
manifest entry).
</dd><dt><span class="term">
[-v / --version]
</span></dt><dd>
Print version and exit.
</dd></dl></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_requirements"></a>3.2.
Requirements
</h2></div></div></div><p>
To use MIRA itself, one doesn't need very much:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
Sequence data in EXP, CAF, PHD, FASTA or FASTQ format
</p></li><li class="listitem"><p>
Optionally: ancillary information in NCBI traceinfo XML format;
ancillary information about strains in tab delimited format, vector
screen information generated with <span class="command"><strong>ssaha2</strong></span> or
<span class="command"><strong>smalt</strong></span>.
</p></li><li class="listitem"><p>
Some memory and disk space. Actually lots of both if you are
venturing into 454 or Illumina.
</p></li></ul></div><p>
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_working_modes"></a>3.3.
Working modes
</h2></div></div></div><p>
MIRA has three basic working modes: genome, EST/RNASeq or
EST-reconstruction-and-SNP-detection. From version 2.4 on, there is
only executable which supports all modes. The name with which this
executable is called defines the working mode:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
<span class="command"><strong>mira</strong></span> for assembly of genomic data as well as
assembly of EST data from one or multiple strains / organisms
</p><p>
and
</p></li><li class="listitem"><p>
<span class="command"><strong>miraSearchESTSNPs</strong></span> for assembly of EST data from
different strains (or organisms) and SNP detection within this
assembly. This is the former <span class="command"><strong>miraEST</strong></span> program
which was renamed as many people got confused regarding whether to
use MIRA in est mode or miraEST.
</p></li></ol></div><p>
Note that <span class="command"><strong>miraSearchESTSNPs</strong></span> is usually realised as
a link to the <span class="command"><strong>mira</strong></span> executable, the executable
decides by the name it was called with which module to start.
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_config"></a>3.4.
Configuring an assembly: files and parameters
</h2></div></div></div><p>
All the configuration needed for an assembly is done in one (or several)
configuration file(s): the <span class="emphasis"><em>manifest</em></span> files. This
encompasses things like what kind of assembly you want to perform
(genome or EST / RNASeq, mapping or de-novo etc.pp) or which data files
contain the sequences you want to assemble (and in which format these
are).
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_manifest_introduction"></a>3.4.1.
The manifest file: introduction
</h3></div></div></div><p>
A <span class="emphasis"><em>manifest</em></span> file can be seen as a two part
configuration file for an assembly: the first part contains some
general information while the second part contains information about
the sequencing data to be loaded. Examples being always easier to
follow than long texts, here's an example for a de-novo assembly with
single-end (also called shotgun) 454 data:
</p><pre class="screen"># Example for a manifest describing a simple 454 de-novo assembly
# A manifest file can contain comment lines, these start with the #-character
# First part: defining some basic things
# In this example, we just give a name to the assembly
# and tell MIRA it should assemble a genome de-novo in accurate mode
# As special parameter, we want to use 4 threads in parallel (where possible)
<strong class="userinput"><code>
project = <em class="replaceable"><code>MyFirstAssembly</code></em>
job = <em class="replaceable"><code>genome,denovo,accurate</code></em>
parameters = <em class="replaceable"><code>-GE:not=4</code></em></code></strong>
# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups": this reflects the
# ... that read sequences ...
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>SomeUnpaired454ReadsIGotFromTheLab</code></em>
data = <em class="replaceable"><code>TCMFS456ZH345.fastq TQF92GT7H34.fastq</code></em>
technology = <em class="replaceable"><code>454</code></em></code></strong></pre><p>
To make things a bit more interesting, here's an example using a
couple more technologies and showing some more options of the manifest
file like wild cards in file names, different paired-end/mate-pair
libraries and how to let MIRA refine pairing information (or even find
out everything by itself):
</p><pre class="screen"># Example for a manifest describing a de-novo assembly with
# unpaired 454, paired-end Illumina, a mate-pair Illumina
# and a paired Ion Torrent
# First part: defining some basic things
# In this example, we just give a name to the assembly
# and tell MIRA it should assemble a genome de-novo in accurate mode
# As special parameter, we want to use 4 passes with kmer sizes of
# 17, 31, 63 and 127 nucleotides. Obviously, read lengths of the
# libraries should be greater than 127 bp.
# Note: usually MIRA will choose sensible options for number of
# passes and kmer sizes to be used by itself.
<strong class="userinput"><code>project = <em class="replaceable"><code>MyFirstAssembly</code></em>
job = <em class="replaceable"><code>genome,denovo,accurate</code></em>
parameters = <em class="replaceable"><code>-AS:kms=17,31,63,127</code></em></code></strong>
# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups": this reflects the
# ... that read sequences ...
# defining the shotgun (i.e. unpaired) 454 reads
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>SomeUnpaired454ReadsIGotFromTheLab</code></em>
data = <em class="replaceable"><code>TCMFS456ZH345.fastq TQF92GT7H34.fastq</code></em>
technology = <em class="replaceable"><code>454</code></em></code></strong>
# defining the paired-end Illumina reads, fixing all needed pair information
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>SomePairedEndIlluminaReadsIGotFromTheLab</code></em>
data = <em class="replaceable"><code>datape*.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
template_size = <em class="replaceable"><code>100 300</code></em>
segment_placement = <em class="replaceable"><code>---> <---</code></em>
segment_naming = <em class="replaceable"><code>solexa</code></em></code></strong>
# defining the mate-pair Illumina reads, fixing most needed pair information
# but letting MIRA refine the template_size via "autorefine"
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>SomeMatePairIlluminaReadsIGotFromTheLab</code></em>
data = <em class="replaceable"><code>datamp*.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
template_size = <em class="replaceable"><code>2000 4000 autorefine</code></em>
segment_placement = <em class="replaceable"><code><--- ---></code></em>
segment_naming = <em class="replaceable"><code>solexa</code></em></code></strong>
# defining paired Ion Torrent reads
# example to show how lazy one can be and simply let MIRA estimate by itself
# all needed pairing information via "autopairing"
# Hint: it usually does a better job at it than we do ;-)
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>SomePairedIonReadsIGotFromTheLab</code></em>
<em class="replaceable"><code>autopairing</code></em>
data = <em class="replaceable"><code>dataion*.fastq</code></em>
technology = <em class="replaceable"><code>iontor</code></em></code></strong></pre></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_manifest_basics"></a>3.4.2.
The manifest file: basics
</h3></div></div></div><p>
The first part of an assembly <span class="emphasis"><em>manifest</em></span> contains
the very basic information the assembler needs to have to know what
you want it to do. This part consists of exactly three entries:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
<span class="bold"><strong>project =</strong></span> [=
<em class="replaceable"><code>project name</code></em>] tells the assembler
the name you wish to give to the whole assembly project. MIRA will
use that name throughout the whole assembly for naming
directories, files and a couple of other things.
</p><p>
You can name the assembly anyway you want, you should however
restrain yourself and use only alphanumeric characters and perhaps
the characters plus, minus and underscore. Using slashes or
backslashes here is a recipe for catastrophe.
</p></li><li class="listitem"><p>
<span class="bold"><strong>job =</strong></span>
[<em class="replaceable"><code>denovo|mapping</code></em>],
[<em class="replaceable"><code>genome|est|fragments|clustering</code></em>],
[<em class="replaceable"><code>draft|accurate</code></em>] tells the
assembler what kind of data it should expect and how it should
assemble it.
</p><p>
You need to make your choice mainly in three steps and in the end
concatenate your choices to the [job=] entry of the manifest:
</p><div class="orderedlist"><ol class="orderedlist" type="a"><li class="listitem"><p>
are you building an assembly from scratch
(choose: <span class="emphasis"><em>denovo</em></span>) or are you mapping reads
to an existing backbone sequence
(choose: <span class="emphasis"><em>mapping</em></span>)? Pick one. Leaving this
out automatically chooses <span class="emphasis"><em>denovo</em></span> as
default.
</p></li><li class="listitem"><p>
are the data you are assembling forming a larger contiguous
sequence (choose: <span class="emphasis"><em>genome</em></span>), are you
assembling EST or mRNA libraries
(choose: <span class="emphasis"><em>est</em></span>), single genes or small
plasmids (choose: <span class="emphasis"><em>fragments</em></span>) or do you cluster assembled
sequences (choose: <span class="emphasis"><em>clustering</em></span>)?
Pick one. Leaving this out
automatically chooses <span class="emphasis"><em>genome</em></span> as default.
</p><p>
Since version 4.9.4, a new mode <span class="emphasis"><em>fragments</em></span>
is available. This mode is essentially similar to the
<span class="emphasis"><em>EST</em></span> mode, but has all safety features
switched off which reduce data sizes. Use this mode for
assembly of comparatively small EST/mRNA or small plasmid or
single gene projects where you
want to have highest accuracy and minimal filtering. Warning:
contigs with coverages going into the 1000s will lead to
really slow assemblies.
</p><p>
Since version 4.9.6, a new mode <span class="emphasis"><em>clustering</em></span>
is available. This mode is essentially for clustering
assembled contigs like they are created in mRNA or EST
assemblies. Basic parameters are: single pass, no clipping, no
editing, ~7.5% differences between sequences allowed,
gaps >= 13 bases disallowed, single occurrence of disagreeing
base leads to SNP tagging.
Warning: do not use that with any type of real sequencing data
... you probably would regret this.
</p></li><li class="listitem"><p>
do you want a quick and dirty assembly for first insights
(choose: <span class="emphasis"><em>draft</em></span>) or an assembly that should
be able to tackle even most nasty cases (choose:
<span class="emphasis"><em>accurate</em></span>)? Pick one. Leaving this out
automatically chooses <span class="emphasis"><em>accurate</em></span> as default.
</p></li></ol></div><p>
Once you're done with your choices, concatenate everything with
commas and you're done. E.g.:
'<code class="literal">--job=mapping,genome,draft</code>' will give you a
mapping assembly of a genome in draft quality.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
For de-novo assembly of genomes, these switches are optimised for
'decent' coverages that are commonly seen to get you something useful,
i.e., ≥ 7x for Sanger, >=18x for 454 FLX or Titanium, ≥ 25x for
454 GS20 and ≥ 30x for Solexa. Should you venture into lower
coverage or extremely high coverage (say, >=60x for 454), you will
need to adapt a few parameters via extensive switches.
</td></tr></table></div></li><li class="listitem"><p>
<span class="bold"><strong>parameters =</strong></span> is used in case you
want to change one of the 150+ extended parameters MIRA has to
offer to control almost every aspect of an assembly. This is
described in more detail in a separate section below.
</p></li></ol></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_manifest_readgroups"></a>3.4.3.
The manifest file: information on the data you have
</h3></div></div></div><p>
The second part of an assembly <span class="emphasis"><em>manifest</em></span> tells
MIRA which files it needs to load, which sequencing technology
generated the data, whether there are DNA template constraints it can
use during the assembly process and a couple of other things.
</p><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_manifest_readgroups_readgroup"></a>3.4.3.1.
Starting a new readgroup
</h4></div></div></div><p>
<span class="bold"><strong>readgroup </strong></span> [= <em class="replaceable"><code>group name</code></em>] is the keyword which tells MIRA that you are going to define a new read group. You can optionally name that group.
</p><div class="sidebar"><a name="sidebar_ref_manifest_readgroups_templates_and_readgroups"></a><div class="titlepage"><div><div><p class="title"><b>
Understanding readgroups and DNA templates
</b></p></div></div></div><p>
When you send away your DNA for sequencing, it is going to be
prepared for sequencing according to your wishes. Sequencing
providers call this "constructing a library" and regardless
whether you sequence with Sanger, 454, Illumina, Ion Torrent,
Pacific Biosciences or other technologies, the "library prep" is
always there.
</p><p>
With most library preps, your DNA is first amplified and then
cut into small pieces. These pieces are called
<span class="emphasis"><em>templates</em></span> and their length can be anywhere
between a few dozen bases, a few hundred bases or even a couple
of dozen or even hundred kilobases. The important thing is that
these templates can be much bigger in size than the actual read
length. While this is a wet lab step, protocols and providers
have gotten pretty good at constructing libraries where the DNA
templates are all in a given range of bases like, e.g., having a
library with template size 500bp (+/- 100bp) and another library
with template size around 7kb (+/- 500bp).
</p><p>
Depending on the technology and sequencing strategy used, the
DNA templates are used to create either one single read or - and
that's important - two or more reads.
</p><p>
Libraries with "single reads" are often called "single read
libraries" or "shotgun libraries". They can be found for every
sequencing technology and are most of the time easy to construct
(therefore cheap) and are often used to provide a decent amount
of bases as basic coverage for your project.
</p><p>
Libraries with two reads per DNA template are often called
"mate-pair" or "paired-end" libraries. They are harder to
construct and sometime have less yield, therefore they are often
more expensive. But the sequencing approach using several reads
per DNA template allows assembly and scaffolding algorithms to
resolve repetitive regions of a genome which are longer than the
average read length. Note that Pacific Biosciences has a
sequencing mode called "strobed sequencing" which is different
from "paired-end/mate-pair" but also creates multiple reads per
DNA template.
</p><p>
Long story short: an assembler must know afterwards what kind of
reads it has to expect: the sequencing technology, library
preparation strategy etc. For this, the notion of <span class="emphasis"><em>read
groups</em></span> has emerged: reads coming from the same
technology and same library preparation are pooled together in a
read group to tell the assembler: in the assembly, if you see two
reads coming from a same DNA template, you should expect them to
be at a certain distance from each other and they should be
oriented in a certain way.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
The above was a <span class="bold"><strong>very</strong></span> simplified
view on the whole area of DNA templates, readgroups, shotgun and
paired end sequencing. Enough to hopefully understand the
concepts, but you might want to read more about it.
</td></tr></table></div></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_manifest_readgroups_data"></a>3.4.3.2.
Defining data files to load
</h4></div></div></div><p>
<span class="bold"><strong>data</strong></span> = <em class="replaceable"><code>filepath
[filepath ...]</code></em> defines the file paths from
which sequences should be loaded. A file path can contain just the
name of one (or several) files or it can contain the
<span class="emphasis"><em>path</em></span>, i.e., the directory (absolute or
relative) including the file name.
</p><p>
MIRA automatically recognises what type the sequence data is by
looking at the postfix of files. For postfixes not adhering widely
used naming schemes for file types, there's additionally a way of
explicitly defining the type (see further down at the end of this
item on how this is done). Currently allowed file types are:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
<code class="filename">.fasta</code> for sequences formatted in FASTA
format where there exists an additional
<code class="filename">.fasta.qual</code> file which contains quality
data. If the file with quality data is missing, this is
interpreted as error and MIRA will abort.
</p></li><li class="listitem"><p>
<code class="filename">.fna</code> and <code class="filename">.fa</code> also
for sequences formatted in FASTA format. The difference
to <code class="filename">.fasta</code> lies in the way MIRA treats a
missing quality file (called
<code class="filename">.fna.qual</code>
or <code class="filename">.fa.qual</code>): it does not see that as
critical error and continues.
</p></li><li class="listitem"><p>
<code class="filename">.fastq</code> or <code class="filename">.fq</code> for files in FASTQ format
</p></li><li class="listitem"><p>
<code class="filename">.gff3</code> or <code class="filename">.gff</code> for files in GFF3 format. Note that
MIRA will load all sequences and annotations contained in this
file.
</p></li><li class="listitem"><p>
<code class="filename">.gbk</code>, <code class="filename">.gbf</code>, <code class="filename">.gbff</code>
or <code class="filename">.gb</code> for files formatted in GenBank
format. Note that the MIRA GenBank loader does not understand
intron/exon or other multiple-locus structures in this format,
use GFF3 instead!
</p></li><li class="listitem"><p>
<code class="filename">.caf</code> for files in the CAF format (from Sanger Centre)
</p></li><li class="listitem"><p>
<code class="filename">.maf</code> for files in the MIRA MAF format
</p></li><li class="listitem"><p>
<code class="filename">.exp</code> for files in the Staden EXP format.
</p></li><li class="listitem"><p>
<code class="filename">.fofnexp</code> for a <span class="emphasis"><em>file of EXP
filenames</em></span> which all point to files in the Staden EXP
format.
</p></li><li class="listitem"><p>
<code class="filename">.xml</code>, <code class="filename">.ssaha2</code> and <code class="filename">.smalt</code> for ancillary data in NCBI TRACEINFO, SSAHA2 or SMALT format respectively.
</p></li></ul></div><p>
Multiple 'data' lines and multiple entries per line (even
different formats) are allowed, as in, e.g.,
</p><pre class="screen">data = file1.fastq file2.fastq file3.fasta file4.gbk
data = myscreenings.smalt</pre><p>
You can also use wildcards and/or directory names. E.g., loading
all file types MIRA understand from a given directory
<code class="filename">mydir</code>:
</p><pre class="screen">data = mydir</pre><p>
or loading all files starting with <code class="filename">mydata</code> and
ending with <code class="filename">fastq</code>:
</p><pre class="screen">data = mydata*fastq</pre><p>
or loading all files in directory <code class="filename">mydir</code>
starting with <code class="filename">mydata</code> and ending with
<code class="filename">fastq</code>:
</p><pre class="screen">data = mydir/mydata*fastq</pre><p>
or loading all FASTQ files in all directories starting with <code class="filename">mydir</code>:
</p><pre class="screen">data = mydir*/*fastq</pre><p>
or ... well, you get the gist.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Giving a directory like in <code class="filename">mydir</code> is
equivalent to <code class="filename">mydir/*</code> (saying: give me all
files in the directory <code class="filename">mydir</code>), however the
first version should be preferred when the directory contains
thousands of files.
</td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
GenBank and GFF3 files may or may not contain embedded sequences. If
annotations are present in these files for which no sequence is
present in the same file, MIRA will look for reads of the same
name which it already loaded in this or previously defined read
groups and add the annotations there.
</p><p>
As security measure, annotations in GenBank and GFF3 files for which
absolutely no sequence or read has been defined are treated as
error.
</p></td></tr></table></div><p>
<span class="emphasis"><em>Explicit definition of file types.</em></span> It is
possible to explicitly tell MIRA the type of a file even if said
file does not have a 'standard' naming scheme. For this, the
EMBOSS double-colon notation has been adapted to work also for
MIRA, i.e., you prepend the type of a file and separate it from
the file name by a double colon. E.g.,
the <code class="filename">.dat</code> postfix is not anything MIRA will
recognise, but you can define it should be loaded as FASTQ file
like this:
</p><pre class="screen">data = fastq::myfile.dat</pre><p>
Another frequent usage is forcing MIRA to load FASTA files
named <code class="filename">.fasta</code> without complaining in case
quality files (which MIRA wants you to provide) are not present:
</p><pre class="screen">data = fna::myfile.fasta</pre><p>
This does (of course) work also with directories or wildcard
characters. In the following example, the first line will load all
files from <code class="filename">mydirectory</code> as FASTQ while the
second line loads just <code class="filename">.dat</code> files in a given
path as FASTA:
</p><pre class="screen">data = fastq::mydirectory
data = fasta::/path/to/somewhere/*.dat</pre><p>
It is entirely possible (although not really sensible), to give
contradicting information to MIRA by using a different explicit
file type than one would guess from the standard postfix. In this
case, the explicit type takes precedence over the automatic
type. E.g.: to force MIRA to load a file as FASTA although it is
named <code class="filename">.fastq</code>, one could use this:
</p><pre class="screen">data = fasta::file.fastq</pre><p>
Note that the above does not make any kind of file conversion,
<code class="filename">file.fastq</code> needs to be already in FASTA
format or else MIRA will fail loading that data.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_manifest_readgroups_defaultqual"></a>3.4.3.3.
Setting default quality
</h4></div></div></div><p>
<span class="bold"><strong>default_qual</strong></span>=
<em class="replaceable"><code>quality_value</code></em> is meant to be used as
default fall-back quality value for sequences where the data files
given above do not contain quality values. E.g., GFF3 or GenBank
formats, eventually also FASTA files where quality data files is
missing.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_manifest_readgroups_technology"></a>3.4.3.4.
Defining technology used to sequence
</h4></div></div></div><p>
<span class="bold"><strong>technology</strong></span>=
<em class="replaceable"><code>technology</code></em> which names the technology
with which the sequences were produced. Allowed technologies are:
<span class="emphasis"><em>sanger, 454, solexa, iontor, pcbiolq, pcbiohq,
text</em></span>.
</p><p>
The <span class="emphasis"><em>text</em></span> technology is not a technology per
se, but should be used for sequences which are not coming from
sequencing machines like, e.g., database entries, consensus
sequences, artificial reads (which do not comply to normal
behaviour of 'normal' sequencing data), etc.pp
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_manifest_readgroups_nostatistics"></a>3.4.3.5.
Preventing statistics for technologies with biases
</h4></div></div></div><p>
<span class="bold"><strong>nostatistics</strong></span> used as keyword will
prevent MIRA to calculate coverage estimates from reads of the given
readgroup.
</p><p>
This keyword should be used in denovo genome assemblies for reads
from libraries which produce very uneven coverage (e.g.: old
Illumina mate-pair protocols) or have a bias in the randomness of
DNA fragmentations (e.g.: Nextera protocol from Illumina).
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_manifest_readgroups_asreference"></a>3.4.3.6.
Setting reference sequence for mapping jobs
</h4></div></div></div><p>
<span class="bold"><strong>as_reference</strong></span> This keyword
indicates to MIRA that the sequences in this readgroup should not
be assembled, but should be used as reference backbone for a
mapping assembly. That is, sequencing reads are then placed/mapped
onto these reference reads.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_manifest_readgroups_autopairing"></a>3.4.3.7.
Autopairing: letting MIRA find out pair info by itself
</h4></div></div></div><p>
<span class="bold"><strong>autopairing</strong></span> This keyword is used
to tell MIRA it should estimate values for
<span class="emphasis"><em>template_size</em></span> and
<span class="emphasis"><em>segment_placement</em></span> (see below).
</p><p>
This is basically the lazy way to tell MIRA that the data in the
corresponding readgroup consists of paired reads and that you
trust it will find out the correct values.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><span class="emphasis"><em>autopairing</em></span> usually works quite well for
small and mid-sized libraries (up to, say, 10 kb). For larger
libraries it might be a good thing to tell MIRA some rough
boundaries via <span class="emphasis"><em>template_size</em></span> /
<span class="emphasis"><em>segment_placement</em></span> and let MIRA refine the
values for the template size via <span class="emphasis"><em>autorefine</em></span>
(see below).
</td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><span class="emphasis"><em>autopairing</em></span> is a feature new to MIRA 4.0rc5,
it may contain bugs for some corner cases. Feedback appreciated.
</td></tr></table></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_manifest_readgroups_templatesize"></a>3.4.3.8.
Setting size of read templates
</h4></div></div></div><p>
<span class="bold"><strong>template_size </strong></span>=
<em class="replaceable"><code>min_size max_size
<span class="emphasis"><em>[infoonly|exclusion_criterion]</em></span>
<span class="emphasis"><em>[autorefine]</em></span></code></em>. Defines the
minimum and maximum size of "good" DNA templates in the library
prep for this read group. This defines at which distance the two
reads of a pair are to be expected in a contig, a very useful
information for an assembler to resolve repeats in a genome or
different splice variants in transcriptome data.
</p><p>
If the term <span class="emphasis"><em>infoonly</em></span> is present, then MIRA
will pass the information on template sizes in result files, but
will not use it for any decision making during de-novo or mapping
assembly. The term <span class="emphasis"><em>exclusion_criterion</em></span> makes
MIRA use the information for decision making.
</p><p>
If <span class="emphasis"><em>infoonly</em></span>
or <span class="emphasis"><em>exclusion_criterion</em></span> are missing, then MIRA
assumes <span class="emphasis"><em>exclusion_criterion</em></span> for de-novo
assemblies and <span class="emphasis"><em>infoonly</em></span> for mapping
assemblies.
</p><p>
If the term <span class="emphasis"><em>autorefine</em></span> is present, MIRA will
start the assembly with the given size information but switch to
refined value computed from observed distances in an
assembly. However, please note that the size values
can <span class="emphasis"><em>never</em></span> be expanded, only shrunk. It is
therefore advisable to use generous bounds when using the
autorefine feature.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
The <span class="emphasis"><em>template_size</em></span> line in the manifest file
replaces the parameters -GE:uti:tismin:tismax of earlier versions
of MIRA (3.4.x and below).
</td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
The minimum or the maximum size (or both) can be set to a negative
value for "don't care and don't check". This allows constructs
like <code class="literal">template_size= 500 -1 exclusion_criterion</code>
which would check only the minimum distance but not the maximum
distance.
</td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
For <span class="emphasis"><em>mapping</em></span> assemblies with MIRA, you
usually will want to use <span class="emphasis"><em>infoonly</em></span> as else -
in case of genome re-arrangements, larger deletions or
insertions - MIRA would probably reject one read of every read
pair in the corresponding areas as it would not be at the
expected distance and/or orientation ... and you would not be
able to simply find the re-arrangement in downstream analysis.
</p><p>
For <span class="emphasis"><em>de-novo</em></span> assemblies however
you <span class="emphasis"><em>should not</em></span>
use <span class="emphasis"><em>infoonly</em></span> except in very rare cases
where you know what you do.
</p></td></tr></table></div><div class="sidebar"><div class="titlepage"><div><div><p class="title"><b>
Understanding the size of DNA templates
</b></p></div></div></div><p>
When using a <span class="emphasis"><em>paired-end</em></span> or
<span class="emphasis"><em>mate-pair</em></span> sequencing strategy, two
sequences are generated for the ends of each DNA template (see
sidebar above: "understanding readgroups and DNA
templates"). That is, if one has a library with 6kb fragments,
one knows that the outer ends of the two reads will be
approximately 6kb apart, like so:
</p><pre class="screen">DNA template ##############################################################
read 1 .......
read 2 ......
<------------------------- ~6 kb ----------------------------></pre><p>
Sequencing labs will try their best to get these two sequences
from DNA templates which comply to a given length
specification. But as this is chemistry and wet lab, things must
be seen with a certain uncertainty and therefore the DNA
templates generated are not exactly of the specified size
(e.g. 6kb), but the size distribution will vary in a given
range, e.g., 5.5kb to 6.5 kb.
</p></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_manifest_readgroups_segplace"></a>3.4.3.9.
Read segment placement
</h4></div></div></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">You do not need to use this when using 'autopairing' (see above).</td></tr></table></div><p>
<span class="bold"><strong>segment_placement </strong></span>=
<em class="replaceable"><code>placementcode <span class="emphasis"><em>[infoonly|exclusion_criterion]</em></span></code></em>. Allowed
placement codes are:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
<span class="bold"><strong>?</strong></span>
or <span class="bold"><strong>unknown</strong></span> which are
place-holders for "well, in the end: don't care." Segments of
a template can be reads in any direction and in any
relationship to each other.
</p><p>
This is typically used for unpaired libraries (sometimes
called <span class="emphasis"><em>shotgun libraries</em></span>), but may be
also useful for, e.g., primer walking with Sanger.
</p></li><li class="listitem"><p>
<span class="bold"><strong>---> <---</strong></span> or <span class="bold"><strong>FR</strong></span> or <span class="bold"><strong>INNIES</strong></span>. The <span class="emphasis"><em>forward /
reverse</em></span> scheme as used in traditional Sanger
sequencing as well as Illumina paired-end sequencing,
</p><p>
This is the usual placement code for Sanger paired-end
protocols as well as Illumina paired-end. Less frequently used
in IonTorrent paired-end sequencing.
</p></li><li class="listitem"><p>
<span class="bold"><strong><--- ---></strong></span> or <span class="bold"><strong>RF</strong></span> or <span class="bold"><strong>OUTIES</strong></span>. The <span class="emphasis"><em>reverse /
forward</em></span> scheme as used in Illumina mate-pair
sequencing.
</p><p>
This is the usual placement code for Illumina mate-pair protocols.
</p></li><li class="listitem"><p>
<span class="bold"><strong>1---> 2---></strong></span> or
<span class="bold"><strong>samedir forward</strong></span> or <span class="bold"><strong>SF</strong></span> or <span class="bold"><strong>LEFTIES</strong></span>. The <span class="emphasis"><em>forward /
forward</em></span> scheme. Segments of a template are all
placed in the same direction, the segment order in the contig
follows segment ordering of the reads.
</p></li><li class="listitem"><p>
<span class="bold"><strong>2---> 1---></strong></span> <span class="bold"><strong>samedir backward</strong></span> or <span class="bold"><strong>SB</strong></span> or <span class="bold"><strong>RIGHTIES</strong></span>. Segments of a template are
all placed in the same direction, the segment order in the
contig is reversed compared to segment ordering of the reads.
</p><p>
This is the usual placement code for 454 "paired-end" and IonTorrent
long-mate protocols.
</p></li><li class="listitem"><p>
<span class="bold"><strong>samedir</strong></span> Segments of a
template are all placed in the same direction, the spatial
relationship however is not cared of.
</p></li><li class="listitem"><p>
<span class="bold"><strong>>>></strong></span> (reserved for
sequencing of several equidistant fragments per template like
in PacBio strobe sequencing, not implemented yet)
</p></li></ul></div><p>
If the term <span class="emphasis"><em>infoonly</em></span> is present, then MIRA
will pass the information on segment placement in result files, but
will not use it for any decision making during de-novo assembly or
mapping assembly. The term <span class="emphasis"><em>exclusion_criterion</em></span> makes MIRA use the information for decision making.
</p><p>
If <span class="emphasis"><em>infoonly</em></span> or <span class="emphasis"><em>exclusion_criterion</em></span> are missing, then MIRA assumes <span class="emphasis"><em>exclusion_criterion</em></span> for de-novo assemblies and <span class="emphasis"><em>infoonly</em></span> for mapping assemblies.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
For <span class="emphasis"><em>mapping</em></span> assemblies with MIRA, you
usually will want to use <span class="emphasis"><em>infoonly</em></span> as else -
in case of genome re-arrangements, larger deletions or
insertions - MIRA would probably reject one read of every read
pair (as it would not be at the expected distance and/or
orientation) and you would not be able to simply find the
re-arrangement in downstream analysis.
</p><p>
For <span class="emphasis"><em>de-novo</em></span> assemblies however
you <span class="emphasis"><em>should not</em></span>
use <span class="emphasis"><em>infoonly</em></span> except in very rare cases
where you know what you do.
</p></td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
As soon as you tell MIRA that a readgroup contains paired reads (via one of the other typical readgroup parameters like template_size, segment_naming etc.), the <span class="emphasis"><em>segment_placement</em></span> line becomes mandatory in the manifest. This is because different sequencing technologies and/or library preparations result in different read orientations. E.g., Illumina libraries come in paired-end flavour which have FR (forward/reverse) placements, but there are also mate-pair libraries which have reverse/forward (RF) placements.
</td></tr></table></div><div class="sidebar"><div class="titlepage"><div><div><p class="title"><b>
Understanding read segment placement on DNA templates
</b></p></div></div></div><p>
bla
</p></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_manifest_readgroups_segname"></a>3.4.3.10.
Read segment naming
</h4></div></div></div><p>
<span class="bold"><strong>segment_naming </strong></span>= <em class="replaceable"><code>naming_scheme <span class="emphasis"><em>[rollcomment]</em></span></code></em>. Defines
the naming scheme reads are following to indicate the DNA template
they belong to. Allowed naming schemes are: <span class="emphasis"><em>sanger,
stlouis, tigr, FR, solexa, sra</em></span>.
</p><p>
If not defined, the defaults are <span class="underline">sanger</span> for Sanger sequencing data,
while <span class="underline">solexa</span> for Solexa, 454
and Ion Torrent.
</p><p>
For FASTQ files, the modifier <span class="emphasis"><em>rollcomment</em></span> can
be used to let MIRA take the first token in the comment as name of
a read instead of the orginal name. E.g.: for a read
</p><pre class="screen">@DRR014327.1.1 HWUSI-EAS547_0013:1:1:1106:4597.1 length=91
TTAGAAGGAGATCTGGAGAACATTTTAAACCGGATTGAACAACGCGGCCGTGAGATGGAGCTTCAGACAAGCCGGTCTTATTGGGACGAAC
+
bbb`bbbbabbR`\_bb_bba`b`bb_bb_`\^\^Y^`\Zb^b``]]\S^a`]]a``bbbb_bbbb]bbb\`^^^]\aaY\`\\^aa__aB</pre><p>
the rollcomment modifier will lead to the read being named
<code class="filename"> HWUSI-EAS547_0013:1:1:1106:4597.1</code> (which
is almost the original instrument read name) instead of
<code class="filename">DRR014327.1.1</code> (which is the SRA read name).
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
For data from the short read archive (SRA), one will usually need
to explicitly specify the 'sra' naming scheme or use the
'rollcomment' modifier in FASTQ files.
</td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
This has changed with MIRA 3.9.1
and <span class="command"><strong>sff_extract</strong></span> 0.3.0. Before that, 454 and Ion
Torrent were given <span class="underline">fr</span> as naming
scheme.
</td></tr></table></div><div class="sidebar"><div class="titlepage"><div><div><p class="title"><b>
Understanding read naming schemes
</b></p></div></div></div><p>
Read naming is a long story with lots of historical gotchas: it
needs to be clear and simple, but still people sometimes wanted
to convey additional meta-information with it. Unsurprisingly,
several "standards" emerged over time. In short: it's a mess. See also XKCD entry on <a class="ulink" href="http://xkcd.com/927/" target="_top">proliferating standards</a>.
</p><p>
How to choose: please read the documentation available at the
different centres or ask your sequence provider. In a nutshell
(and probably over-simplified):
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
Sanger scheme
</span></dt><dd><p>
"somename<span class="emphasis"><em>.[pqsfrw][12][bckdeflmnpt][a|b|c|...</em></span>"
(e.g. U13a08f10.p1ca), but the length of the postfix
must be at least 4 characters, i.e., ".p" alone will not
be recognised.
</p><p>
Usually, ".p" + 3 characters or "f" + 3 characters are
used for forwards reads, while reverse complement reads
take either ".q" or ".r" (+ 3 characters in both cases).
</p></dd><dt><span class="term">
TIGR scheme
</span></dt><dd><p>
"somename<span class="emphasis"><em>TF*|TR*|TA*</em></span>"
(e.g. GCPBN02TF or GCPDL68TABRPT103A58B),
</p><p>
Forward reads take "TF*", reverse reads "TR*".
</p></dd><dt><span class="term">
St. Louis scheme
</span></dt><dd><p>
"somename<span class="emphasis"><em>.[sfrxzyingtpedca]*</em></span>"
</p></dd><dt><span class="term">
Forward/Reverse scheme
</span></dt><dd><p>
"somename<span class="emphasis"><em>.[fr]*</em></span>"
(e.g. E0K6C4E01DIGEW.f or E0K6C4E01BNDXN.r2nd),
</p><p>
".f*" for forward, ".r*" for reverse.
</p></dd><dt><span class="term">
Solexa scheme
</span></dt><dd><p>
Even simpler than the forward/reverse scheme, it allows
only for one two reads per template:
"somename<span class="emphasis"><em>/[12]</em></span>"
</p></dd><dt><span class="term">
SRA scheme
</span></dt><dd><p>
The Short Read Archive (SRA) finally settled on a naming
scheme and renames each and every read within its
database. When you download sequences from the archive,
all reads will be named
<code class="filename">XXX000000.Y[.Z]</code> (where X's are
characters A-Z, 0 are digits from 0 to 9, Y is a counter
and Z is a number denoting the segment (usually 1,2 or
3)). This naming scheme is applied to reads from all
technologies, therefore the MIRA technology dependent
defaults will not apply and one must specify the 'sra'
naming scheme in the command line.
</p></dd></dl></div></div><p>
Any wildcard in the forward/reverse suffix must be consistent for
a read pair, and is treated as part of the template name. This is
to allow multiple sequencing of a fragment, particularly common
with Sanger capillary data (e.g. given somename.f and somename.r,
resequenced as somename.f2 and somename.r2, this would be treated
as two pairs, with template names somename and somename_2
respectively).
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_manifest_readgroups_strainname"></a>3.4.3.11.
Strain naming
</h4></div></div></div><p>
<span class="bold"><strong>strain_name </strong></span>=
<em class="replaceable"><code>string</code></em>. Defines the strain /
organism-code the reads of this read group are from. If not set,
MIRA will assign "StrainX" to normal readgroups and
"ReferenceStrain" to readgroups with reference sequences.
</p><p>
Restrictions: in de-novo assemblies you can have 255 strain. In
mapping assemblies, you can have at most 8 strains.
</p><div class="sidebar"><div class="titlepage"><div><div><p class="title"><b>
Understanding how MIRA uses strain information
</b></p></div></div></div><p>
bla
</p></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_manifest_readgroups_datadirscf"></a>3.4.3.12.
Data directory for SCF files
</h4></div></div></div><p>
<span class="bold"><strong>datadir_scf </strong></span>=
<em class="replaceable"><code>directory</code></em>
</p><p>
For SANGER data only: tells MIRA in which directory it can find
SCF data belonging to reads of this read group.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_manifest_readgroups_renameprefix"></a>3.4.3.13.
Renaming read name prefixes
</h4></div></div></div><p>
<span class="bold"><strong>rename_prefix</strong></span>=
<em class="replaceable"><code>prefix replacement</code></em>. Allows to rename
reads on the fly while loading data by searching each read name
for a given <span class="emphasis"><em>prefix</em></span> string and, if found,
replace it with a given <span class="emphasis"><em>replacement</em></span> string.
</p><p>
This is most useful for systems like Illumina or PacBio which
generate quite long read names which, in the end, are either
utterly useless for an end user or are even breaking older
programs which have a length restriction on read names. E.g.:
</p><pre class="screen">rename_prefix = DQT9AAQ4:436:H371HABMM: Sample1_</pre><p>
will rename reads
like <span class="emphasis"><em>DQT9AAQ4:436:H371HABMM:5:1101:9154:3062</em></span>
into <span class="emphasis"><em>Sample1_5:1101:9154:3062</em></span>
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><code class="literal">rename_prefix</code> entries are valid per
readgroup. I.e., an entry for a readgroup will not rename reads of
another readgroup.
</td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
Multiple <code class="literal">rename_prefix</code> entries are
allowed per readgroup. E.g.:
</p><pre class="screen">rename_prefix = DQT9AAQ4:436:H371HABMM: S1sxa_
rename_prefix = m140328_002546_42149_c100624422550000001823118308061414_s1_ S1pb_</pre><p>
will rename a read
called <code class="literal">DQT9AAQ4:436:H371HABMM:1:1101:3099:2186</code>
into <code class="literal">S1sxa_1:1101:3099:2186</code> while renaming
another read called <code class="literal">m140328_002546_42149_c100624422550000001823118308061414_s1_p0/100084/10792_20790/0_9573</code>
into <code class="literal">S1pb_p0/100084/10792_20790/0_9573</code>
</p></td></tr></table></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_manifest_parameters"></a>3.4.4.
The manifest file: extended parameters
</h3></div></div></div><p>
The <span class="bold"><strong>parameters=</strong></span> line in the manifest
file opens up the full panoply of possibilities the MIRA assembler
offers. This ranges from fine-tuning assemblies to setting parameters
in a way so that MIRA is suited also for very special assembly cases.
</p><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_parameter_groups"></a>3.4.4.1.
Parameter groups
</h4></div></div></div><p>
Some parameters one can set in MIRA somehow belong together. Example
given: when specifying an overlap in an alignment of two sequences,
one could tell the assembler it should look at overlaps only if they
have a certain similarity and a certain length. On the other hand,
specifying how many processors / threads the assembler should use or
whether the results of an assembly should be written out as SAM
format does not seem to relate to alignments.
</p><p>
MIRA uses <span class="emphasis"><em>parameter groups</em></span> to keep parameters
together which somehow belong together. Example given:
</p><pre class="screen">
<strong class="userinput"><code>parameters = <em class="replaceable"><code> -GENERAL:number_of_threads=4 \
-ALIGN:min_relative_score=70 -ASSEMBLY:minimum_read_length=150 \
-OUTPUT:output_result_caf=no</code></em></code></strong></pre><p>
The parameters of the different parameter groups are described in
detail a bit later in this manual.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_technology_sections"></a>3.4.4.2.
Technology sections
</h4></div></div></div><p>
With the introduction of new sequencing technologies, MIRA also had
to be able to set values that allow technology specific behaviour of
algorithms. One simple example for this could be the minimum length
a read must have to be used in the assembly. For Sanger sequences,
having this value to be 150 (meaning a read should have at least 150
unclipped bases) would be a very valid, albeit conservative
choice. For 454 reads and especially Solexa and ABI SOLiD reads
however, this value would be ridiculously high.
</p><p>
To allow very fine grained behaviour, especially in hybrid
assemblies, and to prevent the explosion of parameter names, MIRA
knows two categories of parameters:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
<span class="bold"><strong>technology independent parameters</strong></span>
which control general behaviour of MIRA like, e.g., the number of
assembly passes or file names etc.
</p></li><li class="listitem"><p>
<span class="bold"><strong>technology dependent parameters</strong></span>
which control behaviour of algorithms where the sequencing
technology plays a role. Example for this would be the minimum
length of a read (like 200 for Sanger reads and 120 for 454 FLX
reads).
</p></li></ol></div><p>
More on this a bit further down in this documentation.
</p><p>
As example, a manifest using technology dependent and independent parameters could
look like this:
</p><pre class="screen">
<strong class="userinput"><code>parameters = <em class="replaceable"><code>COMMON_SETTINGS -GENERAL:number_of_threads=4 \
SANGER_SETTINGS -ALIGN:min_relative_score=70 -ASSEMBLY:minimum_read_length=150 \
454_SETTINGS -ALIGN:min_relative_score=75 -ASSEMBLY:minimum_read_length=100 \
SANGER_SETTINGS -ALIGN:min_relative_score=90 -ASSEMBLY:minimum_read_length=75</code></em></code></strong></pre><p>
Now, assume the following read group descriptions in a manifest:
</p><pre class="screen">
...
readgroup
technology=454
...
readgroup
technology=solexa
...</pre><p>
For MIRA, this means a number of parameters should apply to the
assembly as whole, while others apply to the sequencing data itself
... and some parameters might need to be different depending on the
technology they apply to. MIRA dumps the parameters it is running
with at the beginning of an assembly and it makes it clear there
which parameters are "global" and which parameters apply to single
technologies.
</p><p>
Here is as example a part of the output of used parameters that MIRA
will show when started with 454 and Illumina (Solexa) data:
</p><pre class="screen">
...
Assembly options (-AS):
Number of passes (nop) : 1
Skim each pass (sep) : yes
Maximum number of RMB break loops (rbl) : 1
Spoiler detection (sd) : no
Last pass only (sdlpo) : yes
Minimum read length (mrl) : [454] 40
[sxa] 20
Enforce presence of qualities (epoq) : [454] no
[sxa] yes
...</pre><p>
You can see the two different kind of settings that MIRA uses:
<span class="emphasis"><em>common</em></span> <span class="emphasis"><em>settings</em></span> (like
[-AS:nop]) which allows only one value and
<span class="emphasis"><em>technology</em></span> <span class="emphasis"><em>dependent</em></span>
<span class="emphasis"><em>settings</em></span> (like [-AS:mrl]), where for
each sequencing technology used in the project, the setting can be
different.
</p><p>
How would one set a minimum read length of 40 and not enforce
presence of base qualities for Sanger reads, but for 454 reads a
minimum read length of 30 and enforce base qualities? The answer:
</p><pre class="screen">
job=denovo,genome,draft
parameters= SANGER_SETTINGS -AS:mrl=40:epoq=mo 454_SETTINGS -AS:mrl=40:epoq=yes</pre><p>
Notice the ..._SETTINGS section in the command line (or parameter file):
these tell MIRA that all the following parameters until the advent of
another switch are to be set specifically for the said technology.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
For improved readability, you can distribute parameters across
several lines either by pre-fixing every line with
<code class="literal">parameter=</code>, like so:
</p><pre class="screen">
job=denovo,genome,draft
parameters= SANGER_SETTINGS -AS:mrl=80:epoq=no
parameters= 454_SETTINGS -AS:mrl=30:epoq=yes</pre><p>
Alternatively you can use a backslash at the end of a parameter
line to indicate that the next line is a continuing line, like so:
</p><pre class="screen">
job=denovo,genome,draft
parameters= SANGER_SETTINGS -AS:mrl=80:epoq=no <strong class="userinput"><code>\</code></strong>
454_SETTINGS -AS:mrl=30:epoq=yes</pre><p>
Note that the very last line of the parameters settings MUST NOT
end with a backslash.
</p></td></tr></table></div><p>
Beside COMMON_SETTINGS there are currently 6 technology settings available:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
SANGER_SETTINGS
</p></li><li class="listitem"><p>
454_SETTINGS
</p></li><li class="listitem"><p>
IONTOR_SETTINGS
</p></li><li class="listitem"><p>
PCBIOLQ_SETTINGS (currently not supported)
</p></li><li class="listitem"><p>
PCBIOHQ_SETTINGS
</p></li><li class="listitem"><p>
SOLEXA_SETTINGS
</p></li><li class="listitem"><p>
TEXT_SETTINGS
</p></li></ol></div><p>
</p><p>
Some settings of MIRA are influencing global behaviour and are not
related to a specific sequencing technology, these must be set in the
COMMON_SETTINGS environment. For example, it would not make sense to try and
set different number of assembly passes for each technology like in
</p><pre class="screen">
<strong class="userinput"><code>parameters= 454_SETTINGS -AS:nop=4 SOLEXA_SETTINGS -AS:nop=3</code></strong></pre><p>
Beside being contradictory, this makes not really sense. MIRA will
complain about cases like these. Simply set those common settings in
an area prefixed with the COMMON_SETTINGS switch like in
</p><pre class="screen">
<strong class="userinput"><code>parameters= COMMON_SETTINGS -AS:nop=4 454_SETTINGS ... SOLEXA_SETTINGS ...</code></strong></pre><p>
</p><p>
Since MIRA 3rc3, the parameter parser will help you by checking
whether parameters are correctly defined as COMMON_SETTINGS or
technology dependent setting.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_parameter_shortnames"></a>3.4.4.3.
Parameter short names
</h4></div></div></div><p>
Writing the verbose form of parameters can be quite a long task. Here a short example:
</p><pre class="screen">
<strong class="userinput"><code>parameters = <em class="replaceable"><code>COMMON_SETTINGS -GENERAL:number_of_threads=4 \
SANGER_SETTINGS -ALIGN:min_relative_score=70 -ASSEMBLY:minimum_read_length=150 \
454_SETTINGS -ALIGN:min_relative_score=75 -ASSEMBLY:minimum_read_length=100 \
SOLEXA_SETTINGS -ALIGN:min_relative_score=90 -ASSEMBLY:minimum_read_length=75</code></em></code></strong></pre><p>
However, every parameter has a shortened form. The above could be written like this:
</p><pre class="screen">
<strong class="userinput"><code>parameters = <em class="replaceable"><code>COMMON_SETTINGS -GE:not=4 \
SANGER_SETTINGS -AL:mrs=70 -AS:mrl=150 \
454_SETTINGS -AL:mrs=75 -AS:mrl=100 \
SOLEXA_SETTINGS -AL:mrs=90 -AS:mrl=75</code></em></code></strong></pre><p>
Please note that it is also perfectly legal to decompose the switches
so that they can be used more easily in scripted environments (notice
the multiple -AL in some sections of the following example):
</p><pre class="screen">
<strong class="userinput"><code>parameters = <em class="replaceable"><code>COMMON_SETTINGS -GE:not=4 \
SANGER_SETTINGS \
-AL:mrs=70 \
-AL:mrl=150 \
454_SETTINGS -AL:mrs=75:mrl=100 \
SOLEXA_SETTINGS \
-AL:mrs=90 \
-AL:mrl=75</code></em></code></strong></pre></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_order_dependent_quick_switches"></a>3.4.4.4.
Order dependent quick switches
</h4></div></div></div><p>
For some parameters, the order of appearance in the parameter lines
of the manifest is important. This is because the <span class="emphasis"><em>quick
parameters</em></span> are realised internally as a collection of
extended parameters that will overwrite any previously manually set
extended parameters. It is generally a good idea to place quick parameters in
the order as described in this documentation, that is: first the
order dependent quick parameters, then other quick parameters, then all
the other extended parameters.
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
[--hirep_best]
, </span><span class="term">
[--hirep_good]
, </span><span class="term">
[--hirep_something]
</span></dt><dd><p>
These are modifier switches for genome data that is deemed to
be highly repetitive. With <span class="emphasis"><em>hirep_good</em></span> and
<span class="emphasis"><em>hirep_best</em></span>, the assemblies will run
slower due to more iterative cycles and slightly different
default parameter sets that give MIRA a chance to resolve many
nasty repeats. The <span class="emphasis"><em>hirep_something</em></span> switch
goes the other way round and resolves repeats less well than a
normal assembly, but allows MIRA to finish even on more
complex data.
</p><p>
Usage recommendations bacteria: starting MIRA without any
hirep switches yields good enough result in most cases. Under
normal circumstances one can use
<span class="emphasis"><em>hirep_good</em></span> or
even <span class="emphasis"><em>hirep_best</em></span> without remorse as data
sets and genome complexities are small enough to run within a
couple of hours at most.
</p><p>
Usage recommendations for 'simple' lower eukaryotes: starting
MIRA without any hirep switches yields good enough result in
most cases. If the genomes are not too complex,
using <span class="emphasis"><em>hirep_good</em></span> can be a possibility.
</p><p>
Usage recommendations for lower eukaryotes with complex
repeats: starting MIRA without any hirep switches might
already take too long or create temporary data files which are
too big. For these cases, using
<span class="emphasis"><em>hirep_something</em></span> makes MIRA use a
parameter set which is targeted as resolving the
non-repetitive areas of a genome and additionally all repeats
which occur less than 10 times in the genome. Repeats occurring
more often will not be resolved, but using the debris
information one can recover affected reads and use these with
harsh data reduction algorithms (e.g. digital normalisation)
to get a glimpse into these.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
These switches replace the '--highlyrepetitive' switch from
earlier versions.
</td></tr></table></div></dd><dt><span class="term">
[--noclipping=...]
</span></dt><dd><p>
Switching off clipping options. If used
as <code class="literal">--noclipping</code>
or <code class="literal">--noclipping=all</code>, this switches off
really everything, both technology dependent and independent switches.
Clipping options for technology dependent options be switched
off via entries being <span class="emphasis"><em>sanger</em></span>,
<span class="emphasis"><em>454</em></span>, <span class="emphasis"><em>iontor</em></span>,
<span class="emphasis"><em>solexa</em></span> or
<span class="emphasis"><em>solid</em></span>. Multiple entries separated by
comma are allowed.
</p><p> Examples:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
Switch off 454 and Solexa, but keep technology independent
clippings and all clippings for other technologies, (like,
e.g., Sanger) <code class="literal">--noclipping=454,solexa</code>
</p></li><li class="listitem"><p>
Switch off really
everything: <code class="literal">--noclipping</code>
or <code class="literal">--noclipping=all</code>
</p></li></ol></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
Switching off technology independent clippings
([-CL:pec], [-CL:gbcdc], [-CL:kjd])
via this switch has been implemented for consistency in MIRA
4.9.6. Prior to this they were kept active, which created a
good deal of confusion with a number of users.
</p><p>
As soon as you have any kind of 'real' sequencing data, you
really should use at least [-CL:pec]
and [-CL:gbcdc].
</p></td></tr></table></div></dd></dl></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_general_ge"></a>3.4.4.5.
Parameter group: -GENERAL (-GE)
</h4></div></div></div><p>
General options control the type of assembly to be performed and
other switches not belonging anywhere else.
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
[number_of_threads(not)=<em class="replaceable"><code>0 ≤ integer ≤ 256</code></em>]
</span></dt><dd><p> Default is <span class="underline">0</span>. Master switch to set the number
of threads used in different parts of MIRA.
</p><p>
A value of 0 tells MIRA to set this to the number of available
physical cores on the machine it runs on. That is,
hyperthreaded "cores" are not counted in as using these would
cause a tremendous slowdown in the heavy duty computation
parts. E.g., a machine with 2 processors having 4 cores each
will have this value set to 8.
</p><p>
In case MIRA cannot find out the number of cores, the
fall-back value is <span class="underline">2</span>.
</p><p>
Note: when running the SKIM algorithm in parallel threads,
MIRA can give different results when started with the same
data and same arguments. While the effect could be averted for
SKIM, the memory cost for doing so would be an additional 50%
for one of the large tables, so this has not been implemented
at the moment. Besides, at the latest when the Smith-Watermans
run in parallel, this could not be easily avoided at all.
</p></dd><dt><span class="term">
[automatic_memory_management(amm)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">Yes</span>. Whether
MIRA tries to optimise run time of certain algorithms in a
space/time trade-off memory usage, increasing or reducing some
internal tables as memory permits.
</p><p>
Note 1: This functionality currently relies on the
<code class="filename">/proc</code> file system giving information on
the system memory ("MemTotal" in /proc/meminfo) and the memory
usage of the current process ("VmSize" in
<code class="filename">/proc/self/status</code>). If this is not
available, the functionality is switched off.
</p><p>
Note 2: The automatic memory management can only work if there
actually is unused system memory. It's not a wonder switch
which reduces memory consumption. In tight memory situations,
memory management has no effect and the algorithms fall back
to minimum table sizes. This means that the effective size in
memory can grow larger than given in the memory management
parameters, but then MIRA will try to keep the additional
memory requirements to a minimum.
</p></dd><dt><span class="term">
[max_process_size(mps)=<em class="replaceable"><code>0 ≤ integer</code></em>]
</span></dt><dd><p> Default is <span class="underline">0</span>. If
automatic memory management is used (see above), this number is
the size in gigabytes that the MIRA process will use as maximum
target size when looking for space/time trade-offs. A value of 0
means that MIRA does not try keep a fixed upper limit.
</p><p>
Note: when in competition to [-GE:kpmf] (see below),
the smaller of both sizes is taken as target. Example: if your
machine has 64 GiB but you limit the use to 32 GiB, then the
MIRA process will try to stay within these 32 GiB.
</p></dd><dt><span class="term">
[keep_percent_memory_free(kpmf)=<em class="replaceable"><code>0 ≤ integer</code></em>]
</span></dt><dd><p> Default is <span class="underline">10</span>. If
automatic memory management is used (see above), this number
works a bit like [-GE:mps] but the other way round: it
tries to keep x percent of the memory free.
</p><p>
Note: when in competition to [-GE:mps] (see above),
the argument leaving the most memory free is taken as
target. Example: if your machine has 64 GiB and you limit the
use to 42 GiB via [-GE:mps] but have a
[-GE:kpmf] of 50, then the MIRA process will try to
stay within 64-(64*50%)=32 GiB.
</p></dd><dt><span class="term">
[preprocess_only(ppo)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is <span class="underline">no</span> As a
special use case, MIRA will just run the following tasks:
loading and clipping of reads as well as calculating kmer
frequencies and read repeat information. The resulting reads can
then be found as MAF file in the checkpoint directory; the read
repeat information in the info directory.
</p><p>
No assembly is performed.
</p></dd><dt><span class="term">
[est_snp_pipeline_step(esps)=<em class="replaceable"><code>1 ≤ integer ≤ 4</code></em>]
</span></dt><dd><p> Default is <span class="underline">1</span>. Controls the starting step of the
SNP search in EST pipeline and is therefore only useful in
miraSearchESTSNPs.
</p><p>
EST assembly is a three step process, each with different
settings to the assembly engine, with the result of each step
being saved to disk. If results of previous steps are present
in a directory, one can easily "play around" with different
setting for subsequent steps by reusing the results of the
previous steps and directly starting with step two or three.
</p></dd><dt><span class="term">
[print_date(pd)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">yes</span>. Controls
whether date and time are printed out during the
assembly. Suppressing it is not useful in normal operation,
only when debugging or benchmarking.
</p></dd><dt><span class="term">
[bang_on_throw(bot)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">no</span>. For
debugging purposes only. Controls whether MIRA raises a signal
when detecting an error which triggers a running debugger like
gdb.
</p></dd></dl></div><p>
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_assembly_as"></a>3.4.4.6.
Parameter group: -ASSEMBLY (-AS)
</h4></div></div></div><p>
General options for controlling the assembly.
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
[num_of_passes(nop)=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd><p>
Default is <span class="underline">0</span>. Defines how many iterations of the whole
assembly process are done.
</p><p>
The default of 0 will let MIRA choose automatically the number
of passes and the kmer sizes used in each pass
(see also [-AS:kms] below).
</p><p>
Early termination: if the number of passes was chosen too
high, one can simply create a file
<code class="filename"><em class="replaceable"><code>projectname</code></em>_assembly/<em class="replaceable"><code>projectname</code></em>_d_chkpt/terminate</code>. At
the beginning of a new pass, MIRA checks for the existence of
that file and, if it finds it, acknowledges by renaming it to
<code class="filename">terminate_acknowledged</code> and then run 2
more passes (with special "last pass routines") before
finishing the assembly.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
As a rule of thumb, <span class="emphasis"><em>de-novo</em></span> assemblies
should always have at least two passes,
while <span class="emphasis"><em>mapping</em></span> assemblies should work with
only one pass. Not doing this will lead to results unexpected
by users. The reason is that the MIRA the learning routines
either have no chance to learn enough about the assembly (for
de-novo with one pass) or learn "too much" (mapping with more
than one pass).
</td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
MIRA versions ≤ 4.0.2 were interpreting the value of '0' in
a different way and only performed pre-processing of
reads. MIRA can still do this, but this is controlled by the
new parameter [-GE:ppo].
</td></tr></table></div></dd><dt><span class="term">
[kmer_series(kms)=<em class="replaceable"><code>comma separated list of integers ≥ 0 and ≤ 256</code></em>]
</span></dt><dd><p>
Default is an empty value. If set, overrides [-AS:nop] and [-SK:kms].
</p><p>
If set, this parameter provides a one-stop-shop for defining the number of passes and the kmer size used in each pass. E.g.: <code class="literal">-AS:kms=17,31,63,127</code> defines an assembly with 4 passes which uses a kmer size of 17 in pass 1, 31 in pass 2, 63 in pass 3 and 127 in pass 4.
</p><p>
Note that it is perfectly valid to use the same kmer size more than once, e.g.: <code class="literal">17,31,63,63</code> will perform a 4 pass assembly, using a kmer size of 63 in passes 3 and 4. It also makes sense to do this, as with default parameters MIRA uses its integrated automatic editor which edits away obvious sequencing errors in each step, thus the second pass with a kmer size of 63 bases can rely on improved reads.
</p></dd><dt><span class="term">
[rmb_break_loops(rbl)=<em class="replaceable"><code>integer > 0</code></em>]
</span></dt><dd><p>
Default is dependent of the sequencing technology and assembly
quality level. Defines the maximum number of times a contig
can be rebuilt during a main assembly pass
(see [-AS:nop] or [-AS:kms]) if misassemblies due to possible repeats
are found.
</p></dd><dt><span class="term">
[max_contigs_per_pass(mcpp)=<em class="replaceable"><code>integer</code></em>]
</span></dt><dd><p>
Default is <span class="underline">0</span>. Defines
how many contigs are maximally built in each pass. A value of
0 stands for 'unlimited'. Values >0 can be used for special
use cases like test assemblies etc.
</p><p>
If in doubt, do not touch this parameter.
</p></dd><dt><span class="term">
[automatic_repeat_detection(ard)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is is currently <span class="underline">yes</span>. Tells MIRA to use coverage
information accumulated over time to more accurately pinpoint reads that are
in repetitive regions.
</p></dd><dt><span class="term">
[coverage_threshold(ardct)=<em class="replaceable"><code>float > 1.0</code></em>]
</span></dt><dd><p> Default is
<span class="underline">2.0</span> for all sequencing technologies in most assembly cases. This
option says this: if MIRA a read has ever been aligned at positions
where the total coverage of all reads of the same sequencing technology
attained the average coverage times [-AS:ardct] (over a length of
[-AS:ardml], see below), then this read is considered to be
repetitive.
</p></dd><dt><span class="term">
[min_length(ardml)=<em class="replaceable"><code>integer > 1</code></em>]
</span></dt><dd><p>
Default is dependent of the sequencing technology, currently
<span class="underline">400</span> for Sanger and
<span class="underline">200</span> for 454 and Ion
Torrent.
</p><p>
A coverage must be at least this number of bases higher than
[-AS:ardct] before being really treated as repeat.
</p></dd><dt><span class="term">
[grace_length(ardgl)=<em class="replaceable"><code>integer > 1</code></em>]
</span></dt><dd><p>
Default is dependent of the sequencing technology.
</p></dd><dt><span class="term">
[uniform_read_distribution(urd)=<em class="replaceable"><code>on|y[es]|t[rue],
off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is currently always <span class="underline">no</span>
as these algorithms were supplanted by better ones in MIRA 4.0.
</p><p>
Takes effect only if uniform read distribution
([-AS:urd]) is on.
</p><p>
When set to <span class="underline">yes</span>, MIRA
will analyse coverage of contigs built at a certain stage of
the assembly and estimate an average expected coverage of
reads for contigs. This value will be used in subsequent
passes of the assembly to ensure that no part of the contig
gets significantly more read coverage of reads that were
previously identified as repetitive than the estimated average
coverage allows for.
</p><p>
This switch is useful to disentangle repeats that are
otherwise 100% identical and generally allows to build larger
contigs. It is expected to be useful for Sanger and 454
sequences. Usage of this switch with Solexa and Ion Torrent
data is currently not recommended.
</p><p>
It is a real improvement to disentangle repeats, but has the
side-effect of creating some "contig debris" (small and low
coverage contigs, things you normally can safely throw away as
they are representing sequence that already has enough
coverage).
</p><p>
This switch must be set to <span class="underline">no</span> for EST assembly, assembly of
transcripts etc. It is recommended to also switch this off for
mapping assemblies.
</p></dd><dt><span class="term">
[urd_startinpass(urdsip)=<em class="replaceable"><code>integer > 0</code></em>]
</span></dt><dd><p>
Default is dependent of the sequencing technology and assembly
quality level. Recommended values are: 3 for an assembly with
3 to 4 passes ([-AS:nop]). Assemblies with 5 passes
or more should set the value to the number of passes minus 2.
</p><p>
Takes effect only if uniform read distribution
([-AS:urd]) is on.
</p></dd><dt><span class="term">
[urd_clipoffmultiplier(urdcm)=<em class="replaceable"><code>float > 1.0</code></em>]
</span></dt><dd><p>
Default is <span class="underline">1.5</span> for all
sequencing technologies in most assembly cases.
</p><p>
This option says this: if MIRA determined that the average
coverage is <span class="emphasis"><em>x</em></span>, then in subsequent passes it will allow
coverage for reads determined to be repetitive to be built
into the contig only up to a total coverage of
<span class="emphasis"><em>x*urdcm</em></span>. Reads that bring the coverage above the threshold
will be rejected from that specific place in the contig (and
either be built into another copy of the repeat somewhere else
or end up as contig debris).
</p><p>
Please note that the lower [-AS:urdcm] is, the more
contig debris you will end up with (contigs with an average
coverage less than half of the expected coverage, mostly short
contigs with just a couple of reads).
</p><p>
Takes effect only if uniform read distribution ([-AS:urd]) is on.
</p></dd><dt><span class="term">
[spoiler_detection(sd)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is dependent of the sequencing technology and assembly
quality level. A spoiler can be either a chimeric read or it
is a read with long parts of unclipped vector sequence still
included (that was too long for the [-CL:pvc] vector
leftover clipping routines). A spoiler typically prevents
contigs to be joined, MIRA will cut them back so that they
represent no more harm to the assembly.
</p><p>
Recommended for assemblies of mid- to high-coverage genomic
assemblies, not recommended for assemblies of ESTs as one
might loose splice variants with that.
</p><p>
A minimum number of two assembly passes ([-AS:nop])
must be run for this option to take effect.
</p></dd><dt><span class="term">
[sd_last_pass_only(sdlpo)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">yes</span>. Defines
whether the spoiler detection algorithms are run only for the
last pass or for all passes ( [-AS:nop]).
</p><p>
Takes effect only if spoiler detection ([-AS:sd]) is on. If in
doubt, leave it to 'yes'.
</p></dd><dt><span class="term">
[minimum_read_length(mrl)=<em class="replaceable"><code>integer ≥ 20</code></em>]
</span></dt><dd><p> Default is
dependent of the sequencing technology. Defines the minimum length that
reads must have to be considered for the assembly. Shorter sequences will be
filtered out at the beginning of the process and won't be present in the
final project.
</p></dd><dt><span class="term">
[minimum_reads_per_contig(mrpc)=<em class="replaceable"><code>integer ≥ 1</code></em>]
</span></dt><dd><p>
Default is dependent of the sequencing technology and the
[--job] parameter. For genome assemblies it's usually
around <span class="underline">2</span> for Sanger,
<span class="underline">5</span> for 454, <span class="underline">5</span> for Ion Torrent, <span class="underline">5</span> for PacBio and <span class="underline">10</span> for Solexa. In EST assemblies,
it's currently <span class="underline">2</span> for all
sequencing technologies.
</p><p>
Defines the minimum number of reads a contig must have before
it is built or saved by MIRA. Overlap clusters with less reads
than defined will not be assembled into contigs but reads in
these clusters will be immediately transferred to debris.
</p><p>
This parameter is useful to considerably reduce assembly time
in large projects with millions of reads (like in Solexa
projects) where a lot of small "junk" contigs with
contamination sequence or otherwise uninteresting data may be
created otherwise.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Important: a value larger 1 of this parameter interferes with
the functioning of [-OUT:sssip] and
[-OUT:stsip].
</td></tr></table></div></dd><dt><span class="term">
[enforce_presence_of_qualities(epoq)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">yes</span>. When set
to yes, MIRA will stop the assembly if any read has no quality
values loaded.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">[-AS:epoq] switches on/off the quality check for a
complete sequencing technology. A more fine grained control
for switching checks of per readgroup is available via
the <span class="emphasis"><em>default_qual</em></span> readgroup parameter in
the manifest file.
</td></tr></table></div></dd><dt><span class="term">
[use_genomic_pathfinder(ugpf)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">yes</span>. MIRA has
two different pathfinder algorithms it chooses from to find
its way through the (more or less) complete set of possible
sequence overlaps: a genomic and an EST pathfinder. The
genomic looks a bit into the future of the assembly and tries
to stay on safe grounds using a maximum of information already
present in the contig that is being built. The EST version on
the contrary will directly jump at the complex cases posed by
very similar repetitive sequences and try to solve those first
and is willing to fall back to first-come-first-served when
really bad cases (like, e.g., coverage with thousands of
sequences) are encountered.
</p><p>
Generally, the genomic pathfinder will also work quite well
with EST sequences (but might get slowed down a lot in
pathological cases), while the EST algorithm does not work so
well on genomes. If in doubt, leave on <span class="underline">yes</span> for genome projects and set to
<span class="underline">no</span> for EST projects.
</p></dd><dt><span class="term">
[use_emergency_search_stop(uess)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">yes</span>. Another
important switch if you plan to assemble non-normalised EST
libraries, where some ESTs may reach coverages of several
hundreds or thousands of reads. This switch lets MIRA save a
lot of computational time when aligning those extremely high
coverage areas (but only there), at the expense of some
accuracy.
</p></dd><dt><span class="term">
[ess_partnerdepth(esspd)=<em class="replaceable"><code>integer > 0</code></em>]
</span></dt><dd><p> Default is <span class="underline">500</span>. Defines the number of potential
partners a read must have for MIRA switching into emergency
search stop mode for that read.
</p></dd><dt><span class="term">
[use_max_contig_buildtime(umcbt)=<em class="replaceable"><code>on|y[es]|t[rue],off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">no</span>. Defines whether there is an upper limit of time
to be used to build one contig. Set this to yes in EST assemblies where you
think that extremely high coverages occur. Less useful for assembly of
genomic sequences.
</p></dd><dt><span class="term">
[buildtime_in_seconds(bts)=<em class="replaceable"><code>integer > 0</code></em>]
</span></dt><dd><p> Default is
<span class="underline">3600</span> for genome
assemblies, <span class="underline">720</span> for EST
assemblies with Sanger or 454
and <span class="underline">360</span> for EST assemblies
with Solexa or Ion Torrent. Depending on [-AS:umcbt]
above, this number defines the time in seconds allocated to
building one contig.
</p></dd></dl></div><p>
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_strain_backbone_sb"></a>3.4.4.7.
Parameter group: -STRAIN/BACKBONE (-SB)
</h4></div></div></div><p>
Controlling backbone options in mapping assemblies:
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
[bootstrap_new_backbone(bnb)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">yes</span> for
mapping assemblies with Illumina data, no otherwise.
</p><p>
When set to 'yes', MIRA will use a two stage mapping process
which bootstraps an intermediate backbone (reference) sequence
and greatly improves mapping accuracy at indel sites.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Currently only works with Illumina data, other sequencing
technologies will not be affected by this flag.
</td></tr></table></div></dd><dt><span class="term">
[startbackboneusage_inpass(sbuip)=<em class="replaceable"><code>0 < integer</code></em>]
</span></dt><dd><p> Default is
dependent on assembly quality level chosen: 0 for 'draft'
and [-AS:nop] divided by 2 for 'accurate'.
</p><p>
When assembling against backbones, this parameter defines the
pass iteration (see [-AS:nop]) from which on the
backbones will be really used. In the passes preceding this
number, the non-backbone reads will be assembled together as
if no backbones existed. This allows MIRA to correctly spot
repetitive stretches that differ by single bases and tag them
accordingly. Note that full assemblies are considerably slower
than mapping assemblies, so be careful with this when
assembling millions of reads.
</p><p>
Rule of thumb: if backbones belong to same strain as reads to assemble, set
to <span class="underline">1</span>. If backbones are a different strain, then set
[-SB:sbuib] to 1 lower than [-AS:nop] (example: nop=4 and
sbuip=3).
</p></dd><dt><span class="term">
[backbone_raillength(brl)=<em class="replaceable"><code>0 ≤ integer ≤ 10000</code></em>]
</span></dt><dd><p> Default
is <span class="underline">0</span>. Parameter for the
internal sectioning size of the backbone to compute optimal
alignments. Should be set to two times length of longest read in
input data + 15%. When set to 0, MIRA will compute optimal
values from the data loaded.
</p></dd><dt><span class="term">
[backbone_railoverlap(bro)=<em class="replaceable"><code>0 ≤ integer ≤ 2000</code></em>]
</span></dt><dd><p> Default is <span class="underline">0</span>.
Parameter for the internal sectioning size of the backbone to
compute optimal alignments. Should be set to length of the
longest read. When set to 0, MIRA will compute optimal values
from the data loaded.
</p></dd><dt><span class="term">
[trim_overhanging_reads(tor)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">yes</span>.
</p><p>
When set to 'yes', MIRA will trim back reads at end of contigs
which outgrow the reference sequence so that boundaries of
the reference and the mapped reads align perfectly. That is,
the mapping does not perform a sequence extension.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
The trimming is performed via setting low quality cutoffs in
the reads, i.e., the trimmed parts are not really gone but
just not part of the active contig anymore. They can be
uncovered when working on the assembly in finishing programs
like, e.g., <span class="command"><strong>gap4</strong></span>
or <span class="command"><strong>gap5</strong></span>.
</td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
Previous versions of MIRA (up to and including 3.9.18) behaved
as if this option had been set to 'no'. This is a major change
in behaviour, but it is also what probably most people expect
from a mapping.
</td></tr></table></div></dd></dl></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_dataprocessing_dp"></a>3.4.4.8.
Parameter group: -DATAPROCESSING (-DP)
</h4></div></div></div><p>
Options for controlling some data processing during the assembly.
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
[use_read_extension(ure)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default
is dependent of the sequencing technology used: <span class="underline">yes</span> for Sanger,
no for all others. MIRA expects the sequences it is given to be
quality clipped. During the assembly though, it will try to extend reads
into the clipped region and gain additional coverage by analysing
Smith-Waterman alignments between reads that were found to be valid. Only
the right clip is extended though, the left clip (most of the time
containing sequencing vector) is never touched.
</p></dd><dt><span class="term">
[read_extension_window_length(rewl)=<em class="replaceable"><code>integer > 0</code></em>]
</span></dt><dd><p> Default
is dependent of the sequencing technology used. Only takes effect when
[-DP:ure] (see above) is set to <span class="underline">yes</span>. The read extension
routines use a sliding window approach on Smith-Waterman alignments. This
parameter defines the window length.
</p></dd><dt><span class="term">
[read_extension_with_maxerrors(rewme)=<em class="replaceable"><code>integer > 0</code></em>]
</span></dt><dd><p>
Default is dependent of the sequencing technology used. Only takes effect
when [-DP:ure] (see above) is set to <span class="underline">yes</span>. The read
extension routines use a sliding window approach on Smith-Waterman
alignments. This parameter defines the number maximum number of errors
(=disagreements) between two alignment in the given window.
</p></dd><dt><span class="term">
[first_extension_in_pass(feip)=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd><p> Default is
dependent of the sequencing technology used. Only takes effect when
[-DP:ure] (see above) is set to <span class="underline">yes</span>. The read extension
routines can be called before assembly and/or after each assembly pass (see
[-AS:nop]). This parameter defines the first pass in which the read
extension routines are called. The default of <span class="underline">0</span> tells
MIRA to extend the reads the first time before the first assembly
pass.
</p></dd><dt><span class="term">
[last_extension_in_pass(leip)=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd><p> Default is
dependent of the sequencing technology used. Only takes effect when
[-DP:ure] (see above) is set to <span class="underline">yes</span>. The read extension
routines can be called before assembly and/or after each assembly pass (see
[-AS:nop]). This parameter defines the last pass in which the read
extension routines are called. The default of <span class="underline">0</span> tells
MIRA to extend the reads the last time before the first assembly
pass.
</p></dd></dl></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_clipping_cl"></a>3.4.4.9.
Parameter group: -CLIPPING (-CL)
</h4></div></div></div><p>
Controls for clipping options: when and how sequences should be clipped.
</p><p>
Every option in this section can be set individually for every sequencing
technology, giving a very fine grained control on how reads are clipped for
each technology.
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
[msvs_gap_size(msvsgs)=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd><p>
Default is dependent of the sequencing technology used. Takes
effect only when loading data from ancillary SSAHA2 or SMALT
files.
</p><p>
While performing the clip of screened vector sequences, MIRA
will look if it can merge larger chunks of sequencing vector
bases that are a maximum of [-CL:msvgsgs] apart.
</p></dd><dt><span class="term">
[msvs_max_front_gap(msvsmfg)=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd><p>
Default is dependent of the sequencing technology used. Takes
effect only when loading data from ancillary SSAHA2 or SMALT
files.
</p><p>
While performing the clip of screened vector sequences at the
start of a sequence, MIRA will allow up to this number of
non-vector bases in front of a vector stretch.
</p></dd><dt><span class="term">
[msvs_max_end_gap(msvsmeg)=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd><p>
Default is dependent of the sequencing technology used. Takes
effect only when loading data from ancillary SSAHA2 or SMALT
files.
</p><p>
While performing the clip of screened vector sequences at the
end of a sequence, MIRA will allow up to this number of
non-vector bases behind a vector stretch.
</p></dd><dt><span class="term">
[possible_vector_leftover_clip(pvlc)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is dependent of the sequencing technology
used: <span class="underline">yes</span> for
Sanger, <span class="underline">no</span> for any
other. MIRA will try to identify possible sequencing vector
relics present at the start of a sequence and clip them
away. These relics are usually a few bases long and were not
correctly removed from the sequence in data preprocessing
steps of external programs.
</p><p>
You might want to turn off this option if you know (or think)
that your data contains a lot of repeats and the option below
to fine tune the clipping behaviour does not give the expected
results.
</p><p>
You certainly want to turn off this option in EST assemblies
as this will quite certainly cut back (and thus hide)
different splice variants. But then make certain that you
pre-processing of Sanger data (sequencing vector removal) is
good, other sequencing technologies are not affected then.
</p></dd><dt><span class="term">
[pvc_maxlenallowed(pvcmla)=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd><p> Default is dependent of the sequencing technology
used. The clipping of possible vector relics option works quite
well. Unfortunately, especially the bounds of repeats or
differences in EST splice variants sometimes show the same
alignment behaviour than possible sequencing vector relics and
could therefore also be clipped.
</p><p>
To refrain the vector clipping from mistakenly clip repetitive
regions or EST splice variants, this option puts an upper
bound to the number of bases a potential clip is allowed to
have. If the number of bases is below or equal to this
threshold, the bases are clipped. If the number of bases
exceeds the threshold, the clip
is <span class="bold"><strong>NOT</strong></span> performed.
</p><p>
Setting the value to 0 turns off the threshold, i.e., clips are then always
performed if a potential vector was found.
</p></dd><dt><span class="term">
[quality_clip(qc)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is
<span class="underline">no</span>. This will let MIRA
perform its own quality clipping before sequences are entered
into the assembly. The clip function performed is a sequence end
window quality clip with back iteration to get a maximum number
of bases as useful sequence. Note that the bases clipped away
here can still be used afterwards if there is enough evidence
supporting their correctness when the option [-DP:ure]
is turned on.
</p><p>
Warning: The windowing algorithm works pretty well for Sanger,
but apparently does not like 454 type data. It's advisable to
not switch it on for 454. Beside, the 454 quality clipping
algorithm performs a pretty decent albeit not perfect job, so
for genomic 454 data (not! ESTs), it is currently recommended
to use a combination of [-CL:emrc] and
[-DP:ure].
</p></dd><dt><span class="term">
[qc_minimum_quality(qcmq)=<em class="replaceable"><code>integer ≥ 15 and ≤ 35</code></em>]
</span></dt><dd><p>
Default is dependent of the sequencing technology used. This is the minimum
quality bases in a window require to be accepted. Please be cautious not to
take too extreme values here, because then the clipping will be too lax or
too harsh. Values below 15 and higher than 30-35 are not recommended.
</p></dd><dt><span class="term">
[qc_window_length(qcwl)=<em class="replaceable"><code>integer ≥ 10</code></em>]
</span></dt><dd><p> Default is
dependent of the sequencing technology used. This is the length of a window
in bases for the quality clip.
</p></dd><dt><span class="term">
[bad_stretch_quality_clip (bsqc)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is <span class="underline">no</span>. This
option allows to clip reads that were not correctly preprocess
and have unclipped bad quality stretches that might prevent a
good assembly.
</p><p> MIRA will search the sequence in forward direction for a
stretch of bases that have in average a quality less than a
defined threshold and then set the right quality clip of this
sequence to cover the given stretch.
</p></dd><dt><span class="term">
[bsqc_minimum_quality (bsqcmq)=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd><p> Default is dependent
of the sequencing technology used. Defines the minimum average quality a
given window of bases must have. If this quality is not reached, the
sequence will be clipped at this position.
</p></dd><dt><span class="term">
[bsqc_window_length (bsqcwl)=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd><p> Default is dependent of the
sequencing technology used. Defines the length of the window within which
the average quality of the bases are computed.
</p></dd><dt><span class="term">
[maskedbases_clip(mbc)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is
dependent of the sequencing technology used. This will let MIRA
perform a 'clipping' of bases that were masked out (replaced with the
character X). It is generally not a good idea to use mask bases to remove
unwanted portions of a sequence, the EXP file format and the NCBI traceinfo
format have excellent possibilities to circumvent this. But because a lot of
preprocessing software are built around cross_match, scylla-
and phrap-style of base masking, the need arose for MIRA to
be able to handle this, too. MIRA will look at the start and end of
each sequence to see whether there are masked bases that should be
'clipped'.
</p></dd><dt><span class="term">
[mbc_gap_size(mbcgs)=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd><p> Default is dependent of
the sequencing technology used. While performing the clip of masked bases,
MIRA will look if it can merge larger chunks of masked bases that are
a maximum of [-CL:mbcgs] apart.
</p></dd><dt><span class="term">
[mbc_max_front_gap(mbcmfg)=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd><p> Default is
dependent of the sequencing technology used. While performing the clip of
masked bases at the start of a sequence, MIRA will allow up to this
number of unmasked bases in front of a masked stretch.
</p></dd><dt><span class="term">
[mbc_max_end_gap(mbcmeg)=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd><p> Default is
dependent of the sequencing technology used. While performing the clip of
masked bases at the end of a sequence, MIRA will allow up to this
number of unmasked bases behind a masked stretch.
</p></dd><dt><span class="term">
[lowercase_clip_front(lccf)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is
dependent of the sequencing technology used: on for 454 and Ion
Torrent data, off for all
others. This will let MIRA perform a 'clipping' of bases that are in
lowercase at the front end of a sequence, leaving only the uppercase
sequence. Useful when handling 454 data that does not have ancillary data in
XML format.
</p></dd><dt><span class="term">
[lowercase_clip_back(lccb)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is
dependent of the sequencing technology used: on for 454 and Ion
Torrent data, off for all
others. This will let MIRA perform a 'clipping' of bases that are in
lowercase at the back end of a sequence, leaving only the uppercase
sequence. Useful when handling 454 data that does not have ancillary data in
XML format.
</p></dd><dt><span class="term">
[clip_polyat(cpat)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is
<span class="underline">yes</span> for all EST/RNASeq
assemblies. Poly-A stretches in forward reads and poly-T
stretches in reverse reads get either clipped or tagged here
(see [-CL:cpkps] below). The assembler will not use
these stretches for finding overlaps, but it will use these to
discern and disassemble different 3' UTR endings.
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
Should poly-A / poly-T stretches have been trimmed in
pre-processing steps before MIRA got the reads, this option
MUST be switched off.
</td></tr></table></div></dd><dt><span class="term">
[cp_keep_poly_stretch (cpkps)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is
<span class="underline">yes</span> but takes effect only
if [-CL:cpat] (see above) is also set to yes.
</p><p>
Instead of clipping the poly-A / poly-T sequence away, the
stretch in question in the reads is kept and tagged. The tags
provide additional information for MIRA to discern between
different 3' UTR endings and alse a good visual anchor when
looking at the assembly with different programs.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
One side-effect of this option is that the poly-A / poly-T
stretch are 'cleaned'. That is, single non-poly A / poly-T
bases within the stretch are automatically edited to be
conforming to the surrounding stretch. This is necessary as
homopolymers are by nature one of the hardest motifs to be
sequenced correctly by any sequencing technology and one
frequently gets 'dirty' poly-A sequence from sequencing and
this interferes heavily with the methods MIRA uses to discern
repeats.
</td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Keeping the poly-A sequence is a two-edged sword: on one hand it
enabled to discern different 3' UTR endings, on the other hand
it might be that sequencing problems toward the end of reads
create false-positive different endings. If you find that this
is the case for your data, just switch off this option: MIRA
will then simply build the longest possible 3' UTRs.
</td></tr></table></div></dd><dt><span class="term">
[cp_min_sequence_len(cpmsl)=<em class="replaceable"><code>integer >
0</code></em>]
</span></dt><dd><p> Default is
<span class="underline">10</span>. Only takes effect
when [-CP:cpat] (see above) is set
to <span class="underline">yes</span>. Defines the number
of 'A' (in forward direction) or 'T' (in reverse direction) must
be present to be considered a poly-A sequence stretch.
</p></dd><dt><span class="term">
[cp_max_errors_allowed(cpmea)=<em class="replaceable"><code>integer > 0</code></em>]
</span></dt><dd><p> Default is
<span class="underline">1</span>. Only takes effect
when [-CL:cpat] (see above) is set
to <span class="underline">yes</span>. Defines the
maximum number of errors allowed in the potential poly-A
sequence stretch. The distribution of these errors is not
important.
</p></dd><dt><span class="term">
[cp_max_gap_from_end(cpmgfe)=<em class="replaceable"><code>integer > 0</code></em>]
</span></dt><dd><p> Default is <span class="underline">9</span>. Only
takes effect when [-CL:cpat] (see above) is set
to <span class="underline">yes</span>.Defines the number
of bases from the end of a sequence (if masked: from the end of
the masked area) within which a poly-A sequence stretch is
looked for.
</p></dd><dt><span class="term">
[clip_3ppolybase (c3pp)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd>
c3p* options to be described ...
</dd><dt><span class="term">
[clip_known_adaptorsright (ckar)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">yes</span>. Defines
whether MIRA should search and clip known sequencing technology
specific sequencing adaptors. MIRA knows adaptors for Illumina
best, followed by Ion Torrent and some 454 adaptors.
</p><p>
As the list of known adaptors changes quite frequently, the
best place to get a list of known adaptors by MIRA is by
looking at the text files in the program
sources: <code class="filename">src/mira/adaptorsforclip.*.xxd</code>.
</p></dd><dt><span class="term">
[ensure_minimum_left_clip(emlc)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is dependent of the sequencing technology used. If on, ensures a
minimum left clip on each read according to the parameters in
[-CL:mlcr:smlc]
</p></dd><dt><span class="term">
[minimum_left_clip_required(mlcr)=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd><p> Default
is dependent of the sequencing technology used. If [-CL:emlc] is
on, checks whether there is a left clip which length is at least the one
specified here.
</p></dd><dt><span class="term">
[set_minimum_left_clip(smlc)=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd><p> Default is
dependent of the sequencing technology used. If [-CL:emlc] is on
and actual left clip is < [-CL:mlcr], set left clip of read to
the value given here.
</p></dd><dt><span class="term">
[ensure_minimum_right_clip(emrc)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is dependent of the sequencing technology used. If on, ensures a
minimum right clip on each read according to the parameters in
[-CL:mrcr:smrc]
</p></dd><dt><span class="term">
[minimum_right_clip_required(mrcr)=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd><p> Default
is dependent of the sequencing technology used. If [-CL:emrc] is
on, checks whether there is a right clip which length is at least the one
specified here.
</p></dd><dt><span class="term">
[set_minimum_right_clip(smrc)=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd><p> Default is
dependent of the sequencing technology used. If [-CL:emrc] is on
and actual right clip is < [-CL:mrcr], set the length of the
right clip of read to the value given here.
</p></dd><dt><span class="term">
[gb_chimeradetectionclip(gbcdc)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">yes</span> for all jobs.
</p><p>
Very safe chimera detection, should have no false
positives. For repetitive data, a low number of false
negatives is possible.
</p></dd><dt><span class="term">
[kmerjunk_detection(kjd)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is currently <span class="underline">yes</span>.
</p><p>
Reads that look "fishy" are marked as potentially
chimeric. This mark leads either to a read being completely
killed or to a read being included into a contig only if no
other possibility remains.
</p><p>
It is currently suggested to leave this parameter switched on
and to fine-tune via [-CL:kjck] (see below).
</p></dd><dt><span class="term">
[kmerjunk_completekill(kjck)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is currently <span class="underline">no</span>
for genome assemblies and <span class="underline">yes</span> for EST/RNASeq assemblies.
</p><p>
If set to yes, reads marked as junk (see above) are completely
removed from an assembly. If set to no, reads are not removed
but included only into a contig as a very last resort.
</p><p>
Having reads killed guarantees assemblies of extremely high
quality containing virtually no missassembly due to chimeric
sequencing errors. The downside is that, computationally,
there is no difference between junk and stretches with correct
but very low coverage data (generally < 3x coverage). It's
up to you to decide what is more important: total accuracy or
longer contigs.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
As a rule of thumb: I set this to no for genome assemblies
with at least medium average coverage (≥ 20-30x) as MIRA
does a pretty good job to incorporate these reads so late in
an assembly that they do not lead to misassemblies. In
transcript assemblies I set this to yes as there is a high
chance that high coverage transcripts could be extended via
chimeric reads.
</p><p>
With this in mind: deciding for metagenome assemblies would
be really difficult though. It probably depends on what you
need the data for.
</p></td></tr></table></div></dd><dt><span class="term">
[propose_end_clips(pec)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is is dependent on --job quality: currently <span class="underline">yes</span> for all genome assemblies.
Switched off for EST assemblies (but one might want to switch
it on sometimes).
</p><p>
This implements a pretty powerful strategy to ensure a good
"high confidence region" (HCR) in reads, basically eliminating
99.9% of all junk at the 5' and 3' ends of reads. Note that
one still must ensure that sequencing vectors (Sanger) or
adaptor sequences (454, Solexa ion Torrent) are "more or less"
clipped prior to assembly.
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
Extremely effective, but should NOT be used for very low
coverage genomic data, or for EST projects if one wants to
retain the rareest transcripts.
</td></tr></table></div></dd><dt><span class="term">
[handle_solexa_ggcxg_problem(pechsgp)=<em class="replaceable"><code>on|y[es]|t[rue],
off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is is dependent <span class="underline">yes</span>.
</p><p>
Solexa data has a pretty awful problem with in some reads when
a <code class="literal">GGCxG</code> motif occurs (read more about it in
the chapter on Solexa data). In short: the sequencing errors
produced by this problem lead to many false positive SNP
discoveries in mapping assemblies or problems in contig
building in de-novo assembly.
</p><p>
MIRA knows about this problem and can look for it in Solexa
reads during the proposed end clipping and further clip back
the reads, greatly minimising the impact of this problem.
</p></dd><dt><span class="term">
[pec_kmer_size(peckms)=<em class="replaceable"><code>10 ≤ integer ≤ 32</code></em>]
</span></dt><dd><p>
Default is is dependent on technology and quality in the --job
switch: usually
between <span class="underline">17</span>
and <span class="underline">21</span> for Sanger,
higher for 454 (up to
<span class="underline">27</span>) and highest for
Solexa (<span class="underline">31</span>). Ion Torrent
has at the moment <span class="underline">17</span>,
but this may change in the future to somewhat higher values.
</p><p>
This parameter defines the minimum number of bases at each end
of a read that should be free of any sequencing errors.
</p></dd><dt><span class="term">
[pec_minimum_kmer_forward_reverse(pmkfr)=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd><p>
Default is is dependent on technology and quality in the --job
switch: usually
between <span class="underline">1</span>
and <span class="underline">3</span>
when [-CL:pec=yes].
</p><p>
This parameter defines the minimum number of occurrence of a
kmer at each end of a read that should be free of any
sequencing errors.
</p></dd><dt><span class="term">
[rare_kmer_mask(rkm)=<em class="replaceable"><code>on|y[es]|t[rue],
off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is is dependent on --job switch: currently
it's <span class="underline">yes</span> for Solexa data
and <span class="underline">no</span> otherwise. If
this parameter is active, MIRA will completely mask with 'X'
those parts of a read which have kmer occurrence (in forward
and reverse direction) less than the value specified
via [-CL:pmkfr].
</p><p>
This is a quality ensuring move which improves assembly of
ultra-high coverage contigs by cleaning out very likely, low
frequency sequence dependent sequencing errors which passed
all previous filters. The drawback is that very rare
transcripts or very lowly covered genome parts with an
occurrence less than the given value will also be masked
out. However, Illumina gives so much data that this is almost
never the case.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
This works only if [-CL:pec] is active.
</td></tr></table></div></dd><dt><span class="term">
[search_phix174(spx174)=<em class="replaceable"><code>on|y[es]|t[rue],
off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is <span class="emphasis"><em>on</em></span> for Illumina data, off
otherwise.
</p><p>
PhiX 174 is a small phage of enterobacteria whose DNA is often
spiked-in during Illumina sequencing to determine error rates
in datasets and to increase complexity in low-complexity
samples (amplicon, chipseq etc) to help in cluster
identification.
</p><p>
If it remains in the sequenced data, it has to be
seen as a contaminant for projects working with organisms
which should not contain the PhiX 174 phage.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
However, PhiX may be part of some genome sequences
(enterobacteria). In these cases, the PhiX174 search will
report genuine genome data.
</td></tr></table></div></dd><dt><span class="term">
[filter_phix174(fpx174)=<em class="replaceable"><code>on|y[es]|t[rue],
off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is <span class="emphasis"><em>on</em></span> for Illumina data in
EST (RNASeq) assemblies, off otherwise.
</p><p>
If [-CL:spx174] is on and [-CL:fpx174] also,
MIRA will filter out as contaminants all reads which have
PhiX174 sequence recognised.
</p><p>
The default value of having the filtering on only for Illumina
EST (RNASeq) data is a conservative approach: the overwhelming
majority of RNASeq data will indeed not sequence some
enterobacteria, so having PhiX174 containing reads thrown out
is indeed a valid move. For genomes however, MIRA currently is
cautious and will not filter these reads by default.
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
However, PhiX may be part of some genome sequences
(enterobacteria). In these cases, the PhiX174 filter will
remove reads from valid genome or expression data.
</td></tr></table></div></dd><dt><span class="term">
[filter_rrna(frrna)=<em class="replaceable"><code>on|y[es]|t[rue],
off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is <span class="emphasis"><em>on</em></span> for EST (RNASeq)
assemblies, off otherwise.
</p><p>
If enabled, MIRA will filter out (and not assemble) all reads
(or pairs, see below) it recognises as being rRNA or
rDNA. This is useful to reduce computing time on data sets
which contain large contamination of rRNA which were not
filtered away in wet lab.
</p></dd><dt><span class="term">
[filter_rrna_pairs(frrnap)=<em class="replaceable"><code>on|y[es]|t[rue],
off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is <span class="emphasis"><em>on</em></span> for EST (RNASeq)
assemblies, off otherwise.
</p><p>
If enabled together with [-CL:frrna], MIRA will
filter out (and not assemble) all reads pairs where at least
one of the reads is recognised as being rRNA or rDNA.
</p><p>
This option is useful to also catch less conserved parts of
rRNA transcribed like, e.g. the internal transcribed spacers
(ITS) in eukaryotic data.
</p></dd><dt><span class="term">
[filter_rrna_numkmers(frrnank)=<em class="replaceable"><code>integer > 0
</code></em>]
</span></dt><dd><p> Default is <span class="emphasis"><em>20</em></span>.
</p><p>
The rRNA recognition in MIRA works with a precompiled set of
preserved rRNA kmers, at the time of this writing with
21-mers. To allow for specific recognition, the rRNA filtering
process expects to find at least this number of kmers per read
before identifying it as rRNA.
</p><p>
To increase sensitivity (and at the same time risk more false
positives): reduce this parameter. To increase specificity
(and at the same time risk more reads not being recognised):
increase this parameter.
</p><p>
The default parameters together with the default database seem
to work pretty well and this is expected to work for all but
the most exotic rRNA containing organisms.
</p></dd></dl></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_skim_sk"></a>3.4.4.10.
Parameter group: -SKIM (-SK)
</h4></div></div></div><p>
Options that control the behaviour of the initial fast all-against-all read
comparison algorithm. Matches found here will be confirmed later in the
alignment phase. The new SKIM3 algorithm that is in place since version 2.7.4
uses a kmer based algorithm that works similarly to SSAHA (see Ning Z, Cox AJ,
Mullikin JC; "SSAHA: a fast search method for large DNA databases."; Genome
Res. 2001;11;1725-9).
</p><p>
The major differences of SKIM3 and SSAHA are:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
the word length <span class="emphasis"><em>n</em></span> of a kmer (hash) in
SSAHA2 must be < 15, but can be up to 32 bases in 64 bit
versions of MIRA < 4.0.2 and lower, and up to 256 bases for
higher versions of MIRA.
</p></li><li class="listitem"><p>
SKIM3 uses a maximum fixed amount of RAM that is independent of
the word size. E.g., SSAHA would need 4 <span class="underline">exabyte</span> to work with word length of
30 bases ... SKIM3 just takes a couple of hundred MB.
</p></li></ol></div><p>
The parameters for SKIM3:
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
[number_of_threads(not)=<em class="replaceable"><code>integer ≥ 1</code></em>]
</span></dt><dd><p>
Number of threads used in SKIM, default is <span class="underline">2</span>. A few parts of SKIM are
non-threaded, so the speedup is not exactly linear, but it
should be very close. E.g., with 2 processors I get a speedup
of 180-195%, with 4 between 350 and 395%.
</p><p>
Although the main data structures are shared between the
threads, there's some additional memory needed for each
thread.
</p></dd><dt><span class="term">
[also_compute_reverse_complements(acrc)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">on</span>. Defines
whether SKIM searches for matches only in forward/forward
direction or whether it also looks for forward/reverse
direction.
</p><p>
You usually will not want to touch the default, except for very
special application cases where you do not want MIRA to use
reverse complement sequences at all.
</p></dd><dt><span class="term">
[kmer_size(kms)=<em class="replaceable"><code>10 < integer ≤ 256</code></em>]
</span></dt><dd><p>
Defaults are dependent on "--job" switch and sequencing
technologies used.
</p><p>
Controls the number of consecutive bases
<span class="emphasis"><em>n</em></span> which are used as a kmer. The
higher the value, the faster the search. The lower the value,
the slower the search and the more weak matches are found.
</p><p>
A secondary effect of this parameter is the estimation of MIRA
on whether stretches within a read sequence are repetitive or
not. Large values of [-SK:kms] allow a better
distinction between "almost identical" repeats early in the
assembly process and, given enough coverage, generally lead to
less and longer contigs.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
This parameter gets overriden by the one-stop-shop parameter
[-AS:kms] which determines number of passes and kmer
size to use in each pass.
</td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
For de-novo assemblies, values below 15 are not
recommended. For mapping assemblies, values below 10 should
not be used.
</td></tr></table></div></dd><dt><span class="term">
[kmer_save_stepping(kss)=<em class="replaceable"><code>integer ≥ 1</code></em>]
</span></dt><dd><p> Default is
<span class="underline">1</span>. This is a parameter
controlling the stepping increment <span class="emphasis"><em>s</em></span> with which kmers are
generated. This allows for more or less fine grained search as
matches are found with at least <span class="emphasis"><em>n+s</em></span> (see [-SK:kms])
equal bases. The higher the value, the faster the search. The
lower the value, the more weak matches are found.
</p></dd><dt><span class="term">
[percent_required(pr)=<em class="replaceable"><code>integer ≥ 1</code></em>]
</span></dt><dd><p> Default is dependent of the sequencing technology used
and assembly quality wished. Controls the relative percentage of
exact word matches in an approximate overlap that has to be
reached to accept this overlap as possible match. Increasing
this number will decrease the number of possible alignments that
have to be checked by Smith-Waterman later on in the assembly,
but it also might lead to the rejection of weaker overlaps (i.e.
overlaps that contain a higher number of mismatches).
</p><p>
Note: most of the time it makes sense to keep this parameter
in sync with [-AL:mrs].
</p></dd><dt><span class="term">
[maxhits_perread(mhpr)=<em class="replaceable"><code>integer ≥ 1</code></em>]
</span></dt><dd><p> Default is
<span class="underline">2000</span>. Controls the maximum
number of possible hits one read can maximally transport to the
overlap edge reduction phase. If more potential hits are found,
only the best ones are taken.
</p><p>
In the pre-2.9.x series, this was an important option for
tackling projects which contain <span class="emphasis"><em>extreme</em></span>
assembly conditions. It still is if you run out of memory in
the graph edge reduction phase. Try then to lower it to 1000,
500 or even 100.
</p><p>
As the assembly increases in passes ([-AS:nop]),
different combinations of possible hits will be checked,
always the probably best ones first. So the accuracy of the
assembly should only suffer when lowering this number too
much.
</p></dd><dt><span class="term">
[filter_megahubs(fmh)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is <span class="underline">yes</span>. Defines whether megahubs (reads
with extremely many overlaps to other reads) are filtered.
See also [-SK:mhc:mmhr].
</p></dd><dt><span class="term">
[megahub_cap(mhc)=<em class="replaceable"><code>integer > 0</code></em>]
</span></dt><dd><p> Default is <span class="underline">150000</span>. Defines the number of kmer
overlaps a read may have before it is categorised as megahub.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
You basically don't want to mess with this one. Except for
assemblies containing very long reads. Rule of thumb: you
might want to multiply the 150k value by n where n is the
average read length divided by 2000. Don't overdo, max n at 15
or so.
</td></tr></table></div></dd><dt><span class="term">
[max_megahub_ratio(mmhr)=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd><p> Default is
<span class="underline">0</span>. If the number of reads
identified as megahubs exceeds the allowed ratio, MIRA will
abort.
</p><p>
This is a fail-safe parameter to avoid assemblies where things
look fishy. In case you see this, you might want to ask for
advice on the mira_talk mailing list. In short: bacteria
should never have megahubs (90% of all cases reported were
contamination of some sort and the 10% were due to incredibly
high coverage numbers). Eukaryotes are likely to contain
megahubs if filtering is [-KS:mnr] not on.
</p><p>
EST project however, especially from non-normalised libraries,
will very probably contain megahubs. In this case, you might
want to think about masking, see [-KS:mnr].
</p></dd><dt><span class="term">
[sw_check_on_backbones(swcob)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is currently (3.4.0) <span class="underline">yes</span> for accurate mapping
jobs. Takes effect only in mapping assemblies. Defines whether
SKIM hits against a backbone (reference) sequence with less
than 100% identity are double checked with Smith-Waterman to
improve mapping accuracy.
</p><p>
You will want to set this option to <span class="underline">yes</span> whenever your reference
sequence contains more complex or numerous repeats and your
data has SNPs in those areas.
</p></dd><dt><span class="term">
[max_kmers_in_memory(mkim)=<em class="replaceable"><code>integer ≥ 100000</code></em>]
</span></dt><dd><p> Default is
<span class="underline">15000000</span>. Has no influence
on the quality of the assembly, only on the maximum memory size
needed during the skimming. The default value is equivalent to
approximately 500MB.
</p><p>
Note: reducing the number will increase the run time, the more drastically
the bigger the reduction. On the other hand, increasing the default value
chosen will not result in speed improvements that are really noticeable. In
short: leave this number alone if you are not desperate to save a few MB.
</p></dd><dt><span class="term">
[memcap_hitreduction(mchr)=<em class="replaceable"><code>integer ≥ 10</code></em>]
</span></dt><dd><p> Default is
<span class="underline">1024</span>, <span class="underline">2048</span>
when Solexa sequences are used. Maximum memory used (in MiB)
during the reduction of skim hits.
</p><p>
Note: has no influence on the quality of the assembly,
reducing the number will increase the runtime, the more
drastically the bigger the reduction as hits then must be
streamed multiple times from disk.
</p><p>
The default is good enough for assembly of bacterial genomes
or small eukaryotes (using Sanger and/or 454 sequences). As
soon as assembling something bigger than 20 megabases, you
should increase it to 2048 or 4096 (equivalent to 2 or 4 GiB
of memory).
</p></dd></dl></div><p>
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_hashstatistics_hs"></a>3.4.4.11.
Parameter group: -KMERSTATISTICS (-KS)
</h4></div></div></div><p>
Hash statistics (nowadays called kmer statistics in literature
or other software packages) allows to quickly assess reads from a
coverage point of view without actually assembling the reads. MIRA
uses this as a quick pre-assembly evaluation to find and tag reads
which are from repetitive and non-repetitive parts of a project.
</p><p>
The length of the kmer is defined via [-SK:kms]
or [-AS:kms] while the parameters in this section define
the boundaries of the different repeat levels.
</p><p>
A more in-depth description on kmer statistics is given in the
sections <span class="emphasis"><em>Introduction to 'masking'</em></span>
and <span class="emphasis"><em>How does 'nasty repeat' masking work?</em></span> in
the chapter dealing with the assembly of hard projects.
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
[freq_est_minnormal(fenn)=<em class="replaceable"><code>float > 0</code></em>]
</span></dt><dd><p>
During kmer statistics analysis, MIRA will estimate how repetitive parts
of reads are. Parts which are occurring less than
[-KS:fenn] times the average occurrence will be tagged
with a HAF2 (less than average) tag.
</p></dd><dt><span class="term">
[freq_est_maxnormal(fexn)=<em class="replaceable"><code>float > 0</code></em>]
</span></dt><dd><p>
During kmer statistics analysis, MIRA will estimate how repetitive parts
of reads are. Parts which are occurring more than
[-KS:fenn] but less than [-KS:fexn] times
the average occurrence will be tagged with a HAF3 (normal) tag.
</p></dd><dt><span class="term">
[freq_est_repeat(fer)=<em class="replaceable"><code>float > 0</code></em>]
</span></dt><dd><p>
During kmer statistics analysis, MIRA will estimate how repetitive parts
of reads are. Parts which are occurring more than
[-KS:fexn] but less than [-KS:fer] times
the average occurrence will be tagged with a HAF4 (above average) tag.
</p></dd><dt><span class="term">
[freq_est_heavyrepeat(fehr)=<em class="replaceable"><code>float > 0</code></em>]
</span></dt><dd><p>
During kmer statistics analysis, MIRA will estimate how repetitive parts
of reads are. Parts which are occurring more than
[-KS:fer] but less than [-KS:fehr] times
the average occurrence will be tagged with a HAF5 (repeat) tag.
</p></dd><dt><span class="term">
[freq_est_crazyrepeat(fecr)=<em class="replaceable"><code>float > 0</code></em>]
</span></dt><dd><p>
During kmer statistics analysis, MIRA will estimate how repetitive parts
of reads are. Parts which are occurring more than
[-KS:fehr] but less than [-KS:fecr] times
the average occurrence will be tagged with a HAF6 (heavy
repeat) tag. Parts which are occurring more than
[-KS:fecr] but less than [-KS:nrr] times the
average occurrence will be tagged with a HAF7 (crazy repeat)
tag.
</p></dd><dt><span class="term">
[mask_nasty_repeats(mnr)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is dependent on --job
type: <span class="underline">yes</span> for
de-novo, <span class="underline">no</span> for mapping.
</p><p>
Tells MIRA to tag (during the kmer statistics phase) read
subsequences of length [-SK:kms] nucleotides that
appear more that X times more often than the median occurrence
of subsequences would otherwise suggest. The threshold X from
which on subsequences are considered nasty is set by
[-KS:nrr] or [-KS:nrc], the action MIRA
should take when encountering those sequences is defined
by [-KS:ldn] (see below).
</p><p>
When not using lossless digital normalisation
([-KS:ldn]), the tag used by MIRA will be "MNRr"
which stands for "Mask Nasty Repeat in read". This tag has an
active masking function in MIRA and the fast all-against-all
overlap searcher (SKIM) will then completely ignore the tagged
subsequences of reads. There's one drawback though: the
smaller the reads are that you try to assemble, the higher the
probability that your reads will not span nasty repeats
completely, therefore leading to a abortion of contig building
at this site. Reads completely covered by the MNRr tag will
therefore land in the debris file as no overlap will be found.
</p><p>
This option is extremely useful for assembly of larger
projects (fungi-size) with a high percentage of repeats.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Although it is expected that bacteria will not really need
this, leaving it turned on will probably not harm except in
unusual cases like several copies of (pro-)phages integrated
in a genome.
</td></tr></table></div></dd><dt><span class="term">
[nasty_repeat_ratio(nrr)=<em class="replaceable"><code>integer ≥ 2</code></em>]
</span></dt><dd><p>
Default is depending on the [--job=...]
parameters. Normally it's high (around 100) for genome
assemblies, but much lower (20 or less) for EST assemblies.
</p><p>
Sets the ratio from which on subsequences are considered nasty
and hidden from the kmer statistics overlapper with a
<span class="emphasis"><em>MNRr</em></span> tag. E.g.: A value of 10 means: mask all
k-mers of [-SK:kms] length which are occurring more
than 10 times more often than the average of the whole project.
</p></dd><dt><span class="term">
[nasty_repeat_coverage(nrc)=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd><p>
Default is depending on the [--job=...]
parameters: <span class="underline">0</span> for genome
assemblies, <span class="underline">200</span> for EST assemblies.
</p><p>
Closely related to the [-KS:nrr] parameter (see
above), but while the above works on ratios derived from a
calculated average, this parameter allows to set an absolute
value. Note that this parameter will take precedence
over [-KS:nrr] if the calculated value of nrr is
larger that the absolute value given here. A value of 0
de-activates this parameter.
</p></dd><dt><span class="term">
[lossless_digital_normalisation(ldn)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is dependent on --job
type: <span class="underline">yes</span> for denovo
EST/RNAseq assembly, <span class="underline">no</span>
otherwise.
</p><p>
Tells MIRA how on whether or not digitally normalising reads containing nasty repeats
when [-KS:mnr] is active.
</p><p>
When set to <span class="emphasis"><em>yes</em></span>, MIRA will apply a
modified digital normalisation step to the reads, effectively
decreasing the coverage of a given repetitive stretch down to
a minimum needed to correctly represent one copy of the
repeat. However, contrary to the published method, MIRA will
keep enough reads of repetitive regions to also correctly
reconstruct slightly different variants of the repeats present
in the genome or EST / RNASeq data set, even if they differ in
only a single base.
</p><p>
The tag used by MIRA to denote stretches which may have
contributed to the digital normalisation will be
"DGNr". Additionally, contigs which contain reads completely
covered by a DGNr tag will get an additional "_dn" as part of
their name to show that they contain read representatives for
digital normalisation. E.g.: "contig_dn_c1".
</p><p>
This option is extremely useful for non-normalised EST /
RNASeq projects, to get at least the sequence of
overrepresented transcripts assembled even if the coverage
values then cannot be interpreted as expression values
anymore.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
The lossless digital normalisation will be applied as soon as
the kmer size of the active pass (see [-AS:kms])
reaches a size of at least 50 or, at the latest, in the second
to last pass.
</td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Once digital normalisation has been applied, the
parameters [-KS:nrr] and [-KS:nrc] do not
take effect anymore.
</td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
The effect of lossless digital normalisation on genome data
has not been studied sufficiently by me to approve it for
genomes. Use with care in genome assemblies.
</td></tr></table></div></dd><dt><span class="term">
[repeatlevel_in_infofile(rliif)=<em class="replaceable"><code>integer; 0, 5-8</code></em>]
</span></dt><dd><p>
Default is <span class="underline">6</span>. Sets the
minimum level of the HAF tags from which on MIRA will report
tentatively repetitive sequence in the
<code class="filename">*_info_readrepeats.lst</code> file of the info
directory.
</p><p>
A value of <span class="underline">0</span> means
"switched off". The default value of <span class="underline">6</span> means all subsequences tagged
with <span class="emphasis"><em>HAF6</em></span>, <span class="emphasis"><em>HAF7</em></span> and
<span class="emphasis"><em>MNRr</em></span> will be logged. If you, e.g., only
wanted MNRr logged, you'd use <span class="underline">8</span> as parameter value.
</p><p>
See also [-KS:fenn:fexn:fer:fehr:mnr:nrr] to set the
different levels for the <span class="emphasis"><em>HAF</em></span> and
<span class="emphasis"><em>MNRr</em></span> tags.
</p></dd><dt><span class="term">
[memory_to_use(mtu)=<em class="replaceable"><code>integer</code></em>]
</span></dt><dd><p>
Default is <span class="underline">75</span>. Defines
the memory MIRA can use to compute kmer statistics.
</p><p>
A value of <span class="underline">>100</span> is
interpreted as absolute value in megabyte. E.g., 16384 = 16384
megabyte = 16 gigabyte.
</p><p>
A value of <span class="underline">0 ≤ x ≤100</span> is
interpreted as relative value of free memory at the time of
computation. E.g.: for a value of 75% and 10 gigabyte of free
memory, it will use 7.5 gigabyte.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
The minimum amount of memory this algorithm will use is 512 Mib
on 32 bit systems and 2 Gib on 64 bit systems.
</td></tr></table></div></dd></dl></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_align_al"></a>3.4.4.12.
Parameter group: -ALIGN (-AL)
</h4></div></div></div><p>
The align options control the behaviour of the Smith-Waterman alignment
routines. Only read pairs which are confirmed here may be included into
contigs. Affects both the checking of possible alignments found by SKIM as
well as the phase when reads are integrated into a contig.
</p><p>
Every option in this section can be set individually for every sequencing
technology, giving a very fine grained control on how reads are aligned for
each technology.
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
[bandwidth_in_percent(bip)=<em class="replaceable"><code>integer > 0 and ≤100</code></em>]
</span></dt><dd><p> Default
is dependent of the sequencing technology used. The banded Smith-Waterman
alignment uses this percentage number to compute the bandwidth it has to use
when computing the alignment matrix. E.g., expected overlap is 150 bases,
bip=10 -> the banded SW will compute a band of 15 bases to each side of
the expected alignment diagonal, thus allowing up to 15 unbalanced inserts /
deletes in the alignment. INCREASING AND DECREASING THIS NUMBER:
<span class="emphasis"><em>increase</em></span>: will find more non-optimal alignments, but will also
increase SW runtime between linear and \Circum2. <span class="emphasis"><em>decrease</em></span>: the other
way round, might miss a few bad alignments but gaining speed.
</p></dd><dt><span class="term">
[bandwidth_min(bmin)=<em class="replaceable"><code>integer > 0</code></em>]
</span></dt><dd><p> Default is dependent of the
sequencing technology used. Minimum bandwidth in bases to each side.
</p></dd><dt><span class="term">
[bandwidth_max(bmax)=<em class="replaceable"><code>integer > 0</code></em>]
</span></dt><dd><p> Default is dependent of the
sequencing technology used. Maximum bandwidth in bases to each side.
</p></dd><dt><span class="term">
[min_overlap(mo)=<em class="replaceable"><code>integer > 0</code></em>]
</span></dt><dd><p> Default is dependent of the
sequencing technology used. Minimum number of overlapping bases needed in an
alignment of two sequences to be accepted.
</p></dd><dt><span class="term">
[min_score(ms)=<em class="replaceable"><code>integer > 0</code></em>]
</span></dt><dd><p> Default is dependent of the
sequencing technology used. Describes the minimum score of an overlap to be
taken into account for assembly. MIRA uses a default scoring scheme
for SW align: each match counts 1, a match with an N counts 0, each mismatch
with a non-N base -1 and each gap -2. Take a bigger score to weed out a
number of chance matches, a lower score to perhaps find the single (short)
alignment that might join two contigs together (at the expense of computing
time and memory).
</p></dd><dt><span class="term">
[min_relative_score(mrs)=<em class="replaceable"><code>integer > 0 and ≤100</code></em>]
</span></dt><dd><p> Default is dependent of the sequencing technology
used. Describes the min % of matching between two reads to be
considered for assembly. Increasing this number will save
memory, but one might loose possible alignments. I propose a
maximum of 80 here. Decreasing below 55% will make memory and
time consumption probably explode.
</p><p>
Note: most of the time it makes sense to keep this parameter
in sync with
[-SK:pr].
</p></dd><dt><span class="term">
[solexa_hack_max_errors(shme)=<em class="replaceable"><code>integer > -1</code></em>]
</span></dt><dd><p>
Currently a hack just for Solexa/Illumina data.
</p><p>
When running in mapping mode, this defines the maximum number
of mismatches and gaps a read may have compared to the
reference to be allowed to map. The result is usually a much
better mapping in areas with larger discrepancies between
reference sequence and mapped data. Note that the mapping
process takes longer if this value is unequal to 0 as MIRA
will use iterative mapping which involves a certain amount of
trial and error.
</p><p>
The default value of <span class="underline">-1</span>
lets MIRA choose this value automatically. It sets it to 15%
of the average Illumina read lengths loaded.
</p><p>
A value of <span class="underline">0</span> switches of
this functionality, leading to a much faster mapping
process. Useful when mapping expression data where coverage
values may be more important than the best possible alignment.
</p></dd><dt><span class="term">
[extra_gap_penalty(egp)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is dependent of the sequencing technology
used. Defines whether or not to increase penalties applied to
alignments containing long gaps. Setting this to 'yes' might
help in projects with frequent repeats. On the other hand, it
is definitively disturbing when assembling very long reads
containing multiple long indels in the called base sequence
... although this should not happen in the first place and is
a sure sign for problems lying ahead.
</p><p>
When in doubt, set it
to <span class="underline">yes</span> for EST projects
and de-novo genome assembly, set it
to <span class="underline">no</span> for assembly of
closely related strains (assembly against a backbone).
</p><p>
When set to <span class="underline">no</span>, it is
recommended to have [-CO:amgb]
and [-CO:amgbemc] both set to yes.
</p></dd><dt><span class="term">
[egp_level(egpl)=<em class="replaceable"><code>comma separated list of integer ≥ 0 and ≤ 100</code></em>]
</span></dt><dd><p>
Default is dependent of the sequencing technology and job
used. Has no effect if extra_gap_penalty is off.
</p><p>
...
</p></dd><dt><span class="term">
[egp_level(megpp)=<em class="replaceable"><code>0 ≤ integer ≤ 100</code></em>]
</span></dt><dd><p> Default is
<span class="underline">100</span>. Has no effect if
extra_gap_penalty is off. Defines the maximum extra penalty in
percent applied to 'long' gaps.
</p></dd></dl></div><p>
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_contig_co"></a>3.4.4.13.
Parameter group: -CONTIG (-CO)
</h4></div></div></div><p>
The contig options control the behaviour of the contig objects.
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
[name_prefix(np)=<em class="replaceable"><code>string</code></em>]
</span></dt><dd><p>
Default is
<span class="underline"><projectname></span>. Contigs
will have this string prepended to their names. Normally,
the [project=] line in the manifest will set this.
</p></dd><dt><span class="term">
[reject_on_drop_in_relscore(rodirs)=<em class="replaceable"><code>integer ≥ 0 and ≤100</code></em>]
</span></dt><dd><p>
Default is dependent of the sequencing technology used.
</p><p>
When adding reads to a contig, reject the reads if the drop in
the minimum relative score of the alignment of the current
consensus and the new read is > the expected value
calculated during the alignment phase. Lower values mean
stricter checking.
</p><p>
This value is doubled should a read be entered that has an
assembled template partner (a read pair) at the right distance
in the current contig.
</p></dd><dt><span class="term">
[cmin_relative_score(cmrs)=<em class="replaceable"><code>integer ≥ -1 and ≤100</code></em>]
</span></dt><dd><p>
Default is <span class="underline">-1</span>. Works
similarly to [-AL:mrs], but during contig
construction phase instead of read vs read alignment phase:
describes the min % of matching between a read being added to
a contig and the current contig consensus.
</p><p>
If value is set to -1, then the value of [-AL:mrs] is used.
</p><p>
Note: most of the time it makes sense to keep this parameter
at -1. Else have it at
approximately <span class="emphasis"><em>[-AL:mrs]-10</em></span> or
switch it completely off via 0.
</p></dd><dt><span class="term">
[mark_repeats(mr)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is
<span class="underline">yes</span>. One of the most important switches in MIRA: if set to
<span class="underline">yes</span>, MIRA will try to resolve misassemblies due to repeats by
identifying single base stretch differences and tag those critical bases as
RMB (Repeat Marker Base, weak or strong). This switch is also needed when
MIRA is run in EST mode to identify possible inter-, intra- and
intra-and-interorganism SNPs.
</p></dd><dt><span class="term">
[only_in_result(mroir)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is <span class="underline">no</span>. Only
takes effect when [-CO:mr] (see above) is set
to <span class="underline">yes</span>. If set
to <span class="underline">yes</span>, MIRA will not use
the repeat resolving algorithm during build time (and therefore
will not be able to take advantage of this), but only before
saving results to disk.
</p><p>
This switch is useful in some (rare) cases of mapping assembly.
</p></dd><dt><span class="term">
[assume_snp_instead_repeat(asir)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is <span class="underline">no</span>.
Only takes effect when [-CO:mr] (see above) is set to
<span class="underline">yes</span>, effect is also
dependent on the fact whether strain data (see
- [-SB:lsd]) is present or not. Usually, MIRA will mark
bases that differentiate between repeats when a conflict occurs
between reads that belong to one strain. If the conflict occurs
between reads belonging to different strains, they are marked as
SNP. However, if this switch is set
to <span class="underline">yes</span>, conflict within a
strain are also marked as SNP.
</p><p>
This switch is mainly used in assemblies of ESTs, it should
not be set for genomic assembly.
</p></dd><dt><span class="term">
[min_reads_per_group(mrpg)=<em class="replaceable"><code>integer ≥ 2</code></em>]
</span></dt><dd><p> Default is
dependent of the sequencing technology used. Only takes effect when
[-CO:mr] (see above) is set
to <span class="underline">yes</span>. This defines the
minimum number of reads in a group that are needed for the RMB
(Repeat Marker Bases) or SNP detection routines to be
triggered. A group is defined by the reads carrying the same
nucleotide for a given position, i.e., an assembly with mrpg=2
will need at least two times two reads with the same nucleotide
(having at least a quality as defined in [-CO:mgqrt])
to be recognised as repeat marker or a SNP. Setting this to a
low number increases sensitivity, but might produce a few false
positives, resulting in reads being thrown out of contigs
because of falsely identified possible repeat markers (or
wrongly recognised as SNP).
</p></dd><dt><span class="term">
[min_neighbour_qual (mnq)=<em class="replaceable"><code>integer ≥
10</code></em>]
</span></dt><dd><p>
Default is dependent of the sequencing technology used. Takes
only effect when [-CO:mr] is set
to <span class="underline">yes</span>. This defines the
minimum quality of neighbouring bases that a base must have
for being taken into consideration during the decision whether
column base mismatches are relevant or not.
</p></dd><dt><span class="term">
[min_groupqual_for_rmb_tagging(mgqrt)=<em class="replaceable"><code>integer ≥ 25</code></em>]
</span></dt><dd><p>
Default is dependent of the sequencing technology used. Takes
only effect when [-CO:mr] is set
to <span class="underline">yes</span>. This defines the
minimum quality of a group of bases to be taken into account
as potential repeat marker. The lower the number, the more
sensitive you get, but lowering below 25 is not recommended as
a lot of wrongly called bases can have a quality approaching
this value and you'd end up with a lot of false positives. The
higher the overall coverage of your project, the better, and
the higher you can set this number. A value of 35 will
probably remove most false positives, a value of 40 will
probably never show false positives ... but will generate a
sizable number of false negatives.
</p></dd><dt><span class="term">
[min_coverage_percentage(mcp)=<em class="replaceable"><code>0 < integer ≤ 100</code></em>]
</span></dt><dd><p>
Default is currently <span class="underline">10</span>. Used to reduce the number of
IUPAC bases due to non-random PCR artefacts or sequencing
errors in very high coverage areas (e.g. Illumina ≥
80). Once the most probable base has been determined,
[-CO:mcp] defines the minimum percentage (calculated
from the most probable base) the coverage of alternative bases
must have to be considered for consensus. E.g.: with mcp=10
and the most probable base having a coverage of 200x, other
bases must have a coverage of 20x.
</p><p>
Drawback is that valid low frequency variants will not show up
anymore as IUPAC in the FASTA.
</p></dd><dt><span class="term">
[endread_mark_exclusion_area(emea)=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd><p> Default is dependent of the sequencing technology
used. Takes only effect when [-CO:mr] is set to
<span class="underline">yes</span>. Using the end of
sequences of Sanger type shotgun sequencing is always a bit
risky, as wrongly called bases tend to crowd there or some
sequencing vector relics hang around. It is even more risky to
use these stretches for detecting possible repeats, so one can
define an exclusion area where the bases are not used when
determining whether a mismatch is due to repeats or not.
</p></dd><dt><span class="term">
[emea_set1_on_clipping_pec(emeas1clpec)=<em class="replaceable"><code>on|y[es]|t[rue],
off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">yes</span>. When
[-CL:pec] is set, the end-read exclusion area can be
considerably reduced. Setting this parameter will
automatically do this.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Although the parameter is named "set to 1", it may be that the
exclusion area is actually a bit larger (2 to 4), depending on
what users will report back as "best" option.
</td></tr></table></div></dd><dt><span class="term">
[also_mark_gap_bases(amgb)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is dependent of the sequencing technology
used. Determines whether columns containing gap bases (indels)
are also tagged.
</p><p>
Note: it is strongly recommended to not set this to 'yes' for
454 type data.
</p></dd><dt><span class="term">
[also_mark_gap_bases_even_multicolumn(amgbemc)=<em class="replaceable"><code>on|y[es]|t[rue],
off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is <span class="underline">yes</span>.
Takes effect only when [-CO:amgb] is set to
<span class="underline">yes</span>. Determines whether multiple columns containing gap bases
(indels) are also tagged.
</p></dd><dt><span class="term">
[also_mark_gap_bases_need_both_strands(amgbnbs)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is <span class="underline">yes</span>. Takes effect only when
[-CO:amgb] is set to <span class="underline">yes</span>. Determines whether both for
tagging columns containing gap bases, both strands.need to have a gap.
Setting this to <span class="underline">no</span> is not recommended except when working in
desperately low coverage situations.
</p></dd><dt><span class="term">
[force_nonIUPACconsensus_perseqtype(fnic)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">no</span> for
de-novo genome assemblies, yes for all others. If set to
<span class="underline">yes</span>, MIRA will be forced
to make a choice for a consensus base (A,C,G,T or gap) even in
unclear cases where it would normally put a IUPAC base. All
other things being equal (like quality of the possible
consensus base and other things), MIRA will choose a base by
either looking for a majority vote or, if that also is not
clear, by preferring gaps over T over G over C over finally A.
</p><p>
MIRA makes a considerable effort to deduce the right base at
each position of an assembly. Only when cases begin to be
borderline it will use a IUPAC code to make you aware of
potential problems. It
is <span class="bold"><strong>suggested</strong></span> to leave this
option to <span class="underline">no</span> as IUPAC
bases in the consensus are a sign that - if you need 100%
reliability - you really should have a look at this particular
place to resolve potential problems. You might want to set
this parameter to yes in the following cases: 1) when your
tools that use assembly result cannot handle IUPAC bases and
you don't care about being absolutely perfect in your data (by
looking over them manually). 2) when you assemble data without
any quality values (which you should not do anyway), then this
method will allow you to get a result without IUPAC bases that
is "good enough" with respect to the fact that you did not
have quality values.
</p></dd><dt><span class="term">
[merge_short_reads(msr)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">yes</span> for all
Solexas when in a mapping assembly, else it's <span class="underline">no</span>. Can only be used in mapping
assemblies. If set to <span class="underline">yes</span>, MIRA will merge all perfectly
mapping Solexa reads into longer reads (Coverage Equivalent
Reads, CERs) while keeping quality and coverage information
intact.
</p><p>
This feature hugely reduces the number of Solexa reads and
makes assembly results with Solexa data small enough to be
handled by current finishing programs (gap4, consed, others)
on normal workstations.
</p></dd><dt><span class="term">
[msr_keepcontigendsunmerged(msrme)=<em class="replaceable"><code>integer ≥ 0</code></em>]
</span></dt><dd><p>
Default is <span class="underline">0</span> for all
Solexas when in a mapping assembly. Takes only effect in
mapping assemblies if [-CO:msr=yes].
</p><p>
Defines how many "errors" (i.e. differences) a read may have
to be merged into a coverage equivalent read. Useful only when
one does not need SNP information from an assembly but wants
to concentrate either on coverage data or on paired-end
information at contig ends.
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
This feature allows to merge non-perfect reads, which makes
most SNP information simply disappear from the alignment. Use
with care!
</td></tr></table></div></dd><dt><span class="term">
[msr_keepcontigendsunmerged(msrkceu)=<em class="replaceable"><code>-1, integer > 0</code></em>]
</span></dt><dd><p>
Default is <span class="underline">-1</span> for all
Solexas when in a mapping assembly. Takes only effect in
mapping assemblies if [-CO:msr=yes] and for reads
which have a paired-end / mate-pair partner actively used in
the assembly.
</p><p>
If set to a value > 0, MIRA will not merge paired-end /
mate-pair reads if they map within the given distance of a
contig end of the original reference sequence
(backbone). Instead of a fixed value, one can also use
-1. MIRA will then automatically not merge reads if the
distance from the contig end is within the maximum size of the
template insert size of the sequencing library for that read
(either given via [-GE:tismax] or via XML TRACEINFO
for the given read).
</p><p>
This feature allows to use the data reduction from
[-CO:msr] while enabling the result of such a mapping
to be useful in subsequent scaffolding programs to order
contigs.
</p></dd></dl></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_edit_ed"></a>3.4.4.14.
Parameter group: -EDIT (-ED)
</h4></div></div></div><p>
General options for controlling the integrated automatic editor. The editors
generally make a good job cleaning up alignments from typical sequencing
errors like (like base overcalls etc.). However, they may prove tricky in
certain situations:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
in EST assemblies, they may edit rare transcripts toward almost
identical, more abundant transcripts. Usage must be carefully weighed.
</p></li><li class="listitem"><p>
the editors will not only change bases, but also sometimes delete or
insert non-gap bases as needed to improve an alignment when facts (trace
signals or other) show that this is what should have been the
sequence. However, this can make post processing of assembly results pretty
difficult with some formats like ACE, where the format itself contains no
way to specify certain edits like deletion. There's nothing one can do about
it and the only way to get around this problem is to use file formats with
more complete specifications like CAF, MAF (and BAF once supported by MIRA).
</p></li></ul></div><p>
</p><p>
The following edit parameters are supported:
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
[_mira_automatic_contig_editing(mace)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">yes</span>. When set
to yes, MIRA will use built-in versions of own automatic
contig editors (see parameters below) to improve alignments.
</p></dd><dt><span class="term">
[edit_kmer_singlets(eks)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">yes</span> for all
sequencing technologies, but only takes effect
if [-ED:mace] is on (see above).
</p><p>
When set to yes, MIRA uses the alignment information of a
complete contig at places with sequencing errors which lead to
unique kmers and correct the error according to the alignment.
</p><p>
This is an extremely conservative yet very effective editing
strategy and can therefore be kept always activated.
</p></dd><dt><span class="term">
[edit_homopolymer_overcalls(ehpo)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">yes</span> for 454
and Ion Torrent, but only takes effect if [-ED:mace]
is on (see above).
</p><p>
When set to yes, MIRA use the alignment information of a
complete contig at places with potential homopolymer
sequencing errors and correct the error according to the
alignment.
</p><p>
This editor should be switched on only for sequencing
technologies with known homopolymer sequencing problems. That
is: currently only 454 and Ion.
</p></dd><dt><span class="term">
[edit_automatic_contig_editing(eace)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">no</span>. When set
to yes, MIRA will use built-in versions of the "EdIt"
automatic contig editor (see parameters below) to correct
sequencing errors in Sanger reads.
</p><p>
EdIt will try to resolve discrepancies in the contig by
performing trace signal analysis and correct even hard to resolve
errors.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
The current development version has a memory leak in
this editor, therefore the option cannot be turned
on.
</td></tr></table></div></dd><dt><span class="term">
[strict_editing_mode(sem)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">yes</span>. Only for
Sanger data. If set to yes, the automatic editor will not take
error hypotheses with a low probability into account, even if
all the requirements to make an edit are fulfilled.
</p></dd><dt><span class="term">
[confirmation_threshold(ct)=<em class="replaceable"><code>integer, 0 < x ≤ 100</code></em>]
</span></dt><dd><p>
Default is <span class="underline">50</span>. Only for
Sanger data. The higher this value, the more strict the
automatic editor will apply its internal rule set. Going below
40 is not recommended.
</p></dd></dl></div><p>
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_misc_mi"></a>3.4.4.15.
Parameter group: -MISC (-MI)
</h4></div></div></div><p>
Options which would not fit elsewhere.
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
[iknowwhatido(ikwid)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">no</span>. This
switch tells MIRA that you know what you do in some
situations and force it not to stop when it thinks something is
really wrong, but simply continue.
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
You generally should not to set this flag except in cases
where MIRA stopped and the warning / error message told you to
get around that very specific problem by setting this flag.
</td></tr></table></div></dd><dt><span class="term">
[large_contig_size(lcs)=<em class="replaceable"><code>integer <
0</code></em>]
</span></dt><dd><p>
Default is <span class="underline">500</span>. This
parameter has absolutely no influence whatsoever on the
assembly process of MIRA. But is used in the reporting within
the <code class="filename">*_assembly_info.txt</code> file after the
assembly where MIRA reports statistics on
<span class="emphasis"><em>large</em></span> contigs and
<span class="emphasis"><em>all</em></span> contigs. [-MI:lcs] is the
threshold value for dividing the contigs into these two
categories.
</p></dd><dt><span class="term">
[large_contig_size_for_stats(lcs4s)=<em class="replaceable"><code>integer <
0</code></em>]
</span></dt><dd><p>
Default is <span class="underline">5000</span> for
[--job=genome] and <span class="underline">1000</span> for [--job=est].
</p><p>
This parameter is used for internal statistics calculations
and has a subtle influence when being in a
[--job=genome] assembly mode.
</p><p>
MIRA uses coverage information of an assembly project to find
out about potentially repetitive areas in reads (and thus, a
genome). To calculate statistics which are reflecting the
approximate truth regarding the average coverage of a genome,
the "large contig size for stats" value of
[-MI:lcs4s] is used as a cutoff threshold: contigs
smaller than this value do not contribute to the calculation
of average coverage while contigs larger or equal to this
value do.
</p><p>
This reflects two facts: on the one hand - especially with
short read sequencing technologies and in projects without
read pair libraries - contigs containing predominantly
repetitive sequences are of a relatively small size. On the
other hand, reads which could not be placed into contigs
(maybe due to a sequencing technology dependent motif error)
often enough form small contigs with extremely low
coverage.
</p><p>
It should be clear that one does not want any of the above
when calculating average coverage statistics and having this
cutoff discards small contigs which tend to muddy the
picture. If in doubt, don't touch this parameter.
</p></dd></dl></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_misc_nw"></a>3.4.4.16.
Parameter group: -NAG_AND_WARN (-NW)
</h4></div></div></div><p>
Parameters which let MIRA warn you about unusual things or potential
problems. The flags in this parameter section come in three
flavours: <span class="emphasis"><em>stop</em></span>, <span class="emphasis"><em>warn</em></span> and
<span class="emphasis"><em>no</em></span> which let MIRA either stop, give a warning
or do nothing if a specific problem is detected.
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
[check_nfs(cnfs)=<em class="replaceable"><code>stop|warn|no</code></em>]
</span></dt><dd><p>
Default is <span class="underline">stop</span>. MIRA
will check whether the tmp directory is running on a NFS
mount.
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top"><p>
You should never ever at all run MIRA on a NFS mounted
directory ... or face the the fact that the assembly process
may very well take 5 to 10 times longer (or more) than
normal. You have been warned.
</p><p>
The reason for the slowdown is the same as why one should
never run a BLAST search on a big database being located on
a NFS volume: access via network is terribly slow when
compared to local disks, at least if you have not invested a
lot of money into specialised solutions.
</p></td></tr></table></div></dd><dt><span class="term">
[check_duplicate_readnames(cdrn)=<em class="replaceable"><code>stop|warn|no</code></em>]
</span></dt><dd><p>
Default is <span class="underline">stop</span>. MIRA
will check for duplicate read names after loading.
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top"><p>
Duplicate read names usually hint to a serious problem with
your input and should really, really be fixed. You can
choose to ignore this error by switching off this flag, but
this will almost certainly lead to problems with result
files (ACE and CAF for sure, maybe also SAM) and probably to
other unexpected effects.
</p></td></tr></table></div></dd><dt><span class="term">
[check_template_problems(ctp)=<em class="replaceable"><code>stop|warn|no</code></em>]
</span></dt><dd><p>
Default is <span class="underline">stop</span>. MIRA
will check read template naming after loading.
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top"><p>
Problems in read template naming point to problems with read
names or to broken template information. You should try to
find the cause of the problem instead of ignoring this error
message.
</p></td></tr></table></div></dd><dt><span class="term">
[check_maxreadnamelength(cmrnl)=<em class="replaceable"><code>stop|warn|no</code></em>]
</span></dt><dd><p>
Default is <span class="underline">stop</span>. MIRA
will check whether the length of the names of your reads
surpass the given number of characters (see [-NW:mrnl]).
</p><p>
While MIRA and many other programs have no problem with long read names,
some older programs have restrictions concerning the length of
the read name. Example given: the pipeline <code class="literal">CAF ->
caf2gap -> gap2caf</code> will stop working at
the <span class="command"><strong>gap2caf</strong></span> stage if there are read names
having > 40 characters where the names differ only at >40
characters.
</p><p>
This should be a warning only, but as a couple of people were
bitten by this, the default behaviour of MIRA is to stop when
it sees that potential problem. You might want to rename your
reads to have ≤ 40 characters.
</p><p>
On the other hand, you also can ignore this potential problem
and force MIRA to continue by using the parameter:
[-NW:cmrnl=warn] or [-NW:cmrnl=no]
</p></dd><dt><span class="term">
[maxreadnamelength(mrnl)=<em class="replaceable"><code>integer ≥
0</code></em>]
</span></dt><dd><p>
Default is <span class="underline">40</span>. This
defines the effective check length for [-NW:cmrnl].
</p></dd><dt><span class="term">
[check_average_coverage(cac)=<em class="replaceable"><code>stop|warn|no</code></em>]
</span></dt><dd><p>
Default is <span class="underline">stop</span>. In
genome de-novo assemblies, MIRA will perform checks early in
the assembly process whether the average coverage to be
expected exceeds a given value (see [-NW:acv]).
</p><p>
With todays' sequencing technologies (especially Illumina, but
also Ion Torrent and 454), many people simply take everything
they get and throw it into an assembly. Which, in the case of
Illumina and Ion, can mean they try to assemble their organism
with a coverage of 100x, 200x and more (I've seen trials with
more than 1000x).
</p><p>
This is not good. Not. At. All! For two reasons (well, three
to be precise).
</p><p>
The first reason is that, usually, one does not sequence a
single cell but a population of cells. If this population is
not clonal (i.e., it contains subpopulations with genomic
differences with each other), assemblers will be able to pick
up these differences in the DNA once a certain sequence count
is reached and they will try reconstruct a genome containing
all clonal variations, treating these variations as potential
repeats with slightly different sequences. Which, of course,
will be wrong and I am pretty sure you do not want that.
</p><p>
The second and way more important reason is that none of the
current sequencing technologies is completely error free. Even
more problematic, they contain both random and non-random
sequencing errors. Especially the latter can become a big
hurdle if these non-random errors are so prevalent that they
suddenly appear to be valid sequence to an assembler. This in
turn leads to false repeat detection, hence possibly contig
breaks or even wrong consensus sequence. You don't want that,
do you?
</p><p>
The last reason is that overlap based assemblers (like MIRA
is) need <span class="emphasis"><em>exponentially</em></span> more time and
memory when the coverage increases. So keeping the coverage
comparatively low helps you there.
</p></dd><dt><span class="term">
[average_coverage_value(acv)=<em class="replaceable"><code>integer ≥
0</code></em>]
</span></dt><dd><p>
Default is <span class="underline">80</span> for
de-novo assemblies, in mapping assemblies it is 120 for Ion
Torrent and 160 for Illumina data (might change in
future). This defines the effective coverage to check for in
[-NW:cac].
</p></dd></dl></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_directory_dir_di"></a>3.4.4.17.
Parameter group: -DIRECTORY (-DIR, -DI)
</h4></div></div></div><p>
General options for controlling where to find or where to write data.
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
[tmp_redirected_to(trt)=<em class="replaceable"><code><directoryname></code></em>]
</span></dt><dd><p>
Default is an empty string. When set to a non-empty string,
MIRA will create the MIRA-temporary directory at the given
location instead of using the current working directory.
</p><p>
This option is particularly useful for systems which have
solid state disks (SSDs) and some very fast disk subsystems
which can be used for temporary files. Or in projects where
the input and output files reside on a NFS mounted directory
(current working dir), to put the tmp directory somewhere
outside the NFS (see also: Things you should not do).
</p><p>
In both cases above, and for larger projects, MIRA then runs
a lot faster.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Prior to MIRA 4.0rc2, users had to make sure themselves that
the target directory did not already exist. MIRA now handles
this automatically by creating directory names with a random
substring attached.
</td></tr></table></div></dd></dl></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_output_out"></a>3.4.4.18.
Parameter group: -OUTPUT (-OUT)
</h4></div></div></div><p>
Options for controlling which results to write to which type of files.
Additionally, a few options allow output customisation of textual
alignments (in text and HTML files).
</p><p>
There are 3 types of results: result, temporary results and extra
temporary results. One probably needs only the results. Temporary
and extra temporary results are written while building different
stages of a contig and are given as convenience for trying to find
out why MIRA set some RMBs or disassembled some contigs.
</p><p>
Output can be generated in these formats: CAF, Gap4 Directed
Assembly, FASTA, ACE, TCS, WIG, HTML and simple text.
</p><p>
Naming conventions of the files follow the rules described in
section <span class="bold"><strong>Input / Output</strong></span>, subsection
<span class="bold"><strong>Filenames</strong></span>.
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
[savesimplesingletsinproject(sssip)=<em class="replaceable"><code>on|y[es]|t[rue],off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">no</span>. Controls
whether 'unimportant' singlets are written to the result
files.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Note that a value larger 1 of the [-AS:mrpc]
parameter will disable the function of this parameter.
</td></tr></table></div></dd><dt><span class="term">
[savetaggedsingletsinproject(stsip)=<em class="replaceable"><code>on|y[es]|t[rue],off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default
is <span class="underline">yes</span>. Controls whether
singlets which have certain tags (see below) are written to
the result files, even if [-OUT:sssip] (see above) is
set.
</p><p>
If one of the (SRMr, CRMr, WRMr, SROr, SAOr, SIOr) tags
appears in a singlet, MIRA will see that the singlets had been
part of a larger alignment in earlier passes and even was part
of a potentially 'important' decision. To give the possibility
to human finishers to trace back the decision, these singlets
can be written to result files.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Note that a value larger 1 of the [-AS:mrpc]
parameter will disable the function of this parameter.
</td></tr></table></div></dd><dt><span class="term">
[remove_rollover_tmps(rrot)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default
is <span class="underline">yes</span>. Removes log and
temporary files once they should not be needed anymore during
the assembly process.
</p></dd><dt><span class="term">
[remove_tmp_directory(rtd)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default
is <span class="underline">no</span>. Removes the
complete tmp directory at the end of the assembly process. Some
logs and temporary files contain useful information that you may
want to analyse though, therefore the default of MIRA is not to
delete it.
</p></dd><dt><span class="term">
[output_result_caf(orc)=<em class="replaceable"><code>on|y[es]|t[rue],
off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is
<span class="underline">yes</span>.
</p></dd><dt><span class="term">
[output_result_maf(orm)=<em class="replaceable"><code>on|y[es]|t[rue],
off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is
<span class="underline">yes</span>.
</p></dd><dt><span class="term">
[output_result_gap4da(org)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">no</span>.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
If set to <span class="underline">yes</span>, MIRA will
automatically switch back
to <span class="underline">no</span> (and cannot be
forced to 'yes') when 454 or Solexa reads are present in the
project as this ensure that the file system does not get
flooded with millions of files.
</td></tr></table></div></dd><dt><span class="term">
[output_result_fasta(orf)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">yes</span>.
</p></dd><dt><span class="term">
[output_result_ace(ora)=<em class="replaceable"><code>on|y[es]|t[rue],
off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default
is <span class="underline">no</span>.
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
ACE is the least suited file format for NGS data. Use it only
when absolutely necessary.
</td></tr></table></div></dd><dt><span class="term">
[output_result_txt(ort)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">no</span>.
</p></dd><dt><span class="term">
[output_result_tcs(ors)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">yes</span>.
</p></dd><dt><span class="term">
[output_result_html(orh)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default
is <span class="underline">no</span>.
</p></dd><dt><span class="term">
[output_tmpresult_caf(otc)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">no</span>.
</p></dd><dt><span class="term">
[output_tmpresult_maf(otm)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">no</span>.
</p></dd><dt><span class="term">
[output_tmpresult_gap4da(otg)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">no</span>.
</p></dd><dt><span class="term">
[output_tmpresult_fasta(otf)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">no</span>.
</p></dd><dt><span class="term">
[output_tmpresult_ace(ota)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">no</span>.
</p></dd><dt><span class="term">
[output_tmpresult_txt(ott)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">no</span>.
</p></dd><dt><span class="term">
[output_result_tcs(ots)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default
is <span class="underline">no</span>.
</p></dd><dt><span class="term">
[output_tmpresult_html(oth)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">no</span>.
</p></dd><dt><span class="term">
[output_exttmpresult_caf(oetc)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">no</span>.
</p></dd><dt><span class="term">
[output_exttmpresult_gap4da(oetg)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is
<span class="underline">no</span>.
</p></dd><dt><span class="term">
[output_exttmpresult_fasta(oetf)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p> Default is
<span class="underline">no</span>.
</p></dd><dt><span class="term">
[output_exttmpresult_ace(oeta)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">no</span>.
</p></dd><dt><span class="term">
[output_exttmpresult_txt(oett)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">no</span>.
</p></dd><dt><span class="term">
[output_exttmpresult_html(oeth)=<em class="replaceable"><code>on|y[es]|t[rue], off|n[o]|f[alse]</code></em>]
</span></dt><dd><p>
Default is <span class="underline">no</span>.
</p></dd><dt><span class="term">
[text_chars_per_line(tcpl)=<em class="replaceable"><code>integer > 0</code></em>]
</span></dt><dd><p> Default is
<span class="underline">60</span>. When producing an output in text format
( [-OUT:ort|ott|oett]), this parameter defines how many bases
each line of an alignment should contain.
</p></dd><dt><span class="term">
[html_chars_per_line(tcpl)=<em class="replaceable"><code>integer > 0</code></em>]
</span></dt><dd><p> Default is
<span class="underline">60</span>. When producing an output in HTML format,
( [-OUT:orh|oth|oeth]), this parameter defines how many bases
each line of an alignment should contain.
</p></dd><dt><span class="term">
[text_endgap_fillchar(tegfc)=<em class="replaceable"><code><single character></code></em>]
</span></dt><dd><p> Default
is <span class="underline"> </span> (a blank). When producing an output in text format
( [-OUT:ort|ott|oett]), endgaps are filled up with this
character.
</p></dd><dt><span class="term">
[html_endgap_fillchar(hegfc)=<em class="replaceable"><code><single character></code></em>]
</span></dt><dd><p> Default
is <span class="underline"> </span> (a blank). When producing an output in HTML format
( [-OUT:orh|oth|oeth]), end-gaps are filled up with this
character.
</p></dd></dl></div><p>
</p></div></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_resuming_assemblies"></a>3.5.
Resuming / restarting assemblies
</h2></div></div></div><p>
It may happen that a MIRA run is interrupted - sometimes rather harshly
- due to events more or less outside your control like, e.g., power
failures, machine shutdowns for maintenance, missing disk space,
run-time quotas etc. This may be less of a problem when assembling or
mapping small data sets with run times between a couple of minutes up to
a few hours, but becomes a nuisance for larger data sets like in small
eukaryotes or RNASeq samples where the run time is measured in days.
</p><p>
If this happens in de-novo assemblies, MIRA has
a <span class="emphasis"><em>resume</em></span> functionality: at predefined points in the
assembly process, MIRA writes out special files to disk which enables it
to resume the assembly at the point where these files were
written. Starting MIRA in resume mode is pretty easy: simply add the
resume flag [-r] on a command line like this:
</p><pre class="screen">
<code class="prompt">$</code> <strong class="userinput"><code>mira -r ...</code></strong></pre><p>
where the ellipsis ("...") above stands for the rest of the command line you would have used to start a new assembly.
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_input_output"></a>3.6.
Input / Output
</h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_directories"></a>3.6.1.
Directories
</h3></div></div></div><p>
Since version 3.0.0, MIRA now puts all files and directories it
generates into one sub-directory which is named
<code class="filename"><em class="replaceable"><code>projectname</code></em>_assembly</code>. This directory contains up to four
sub-directories:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_d_results</code>: this directory contains all the
output files of the assembly in different formats.
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_d_info</code>: this directory contains information
files of the final assembly. They provide statistics as well as, e.g.,
information (easily parsable by scripts) on which read is found in which
contig etc.
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_d_tmp</code>:
this directory contains tmp files and temporary assembly files. It
can be safely removed after an assembly as there may be easily a
few GB of data in there that are not normally not needed anymore.
</p><p>
In case of problems: please do not delete. I will get in touch
with you for additional information that might possibly be present
in the tmp directory.
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_d_chkpt</code>: this directory
contains checkpoint files needed to resume assemblies that crashed
or were stopped.
</p></li></ul></div><p>
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_filenames"></a>3.6.2.
Filenames
</h3></div></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_output"></a>3.6.2.1.
Output
</h4></div></div></div><p>
These result output files and sub-directories are placed in in the
<em class="replaceable"><code>projectname</code></em>_results directory after a run of MIRA.
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
<code class="filename"><em class="replaceable"><code>projectname</code></em>_out.<type></code>
</span></dt><dd><p> Assembled project written in type =
(<span class="emphasis"><em>maf</em></span> / <span class="emphasis"><em>gap4da</em></span> / <span class="emphasis"><em>caf</em></span> /
<span class="emphasis"><em>ace</em></span> / <span class="emphasis"><em>fasta</em></span> /
<span class="emphasis"><em>html</em></span> / <span class="emphasis"><em>tcs</em></span> /
<span class="emphasis"><em>wig</em></span> / <span class="emphasis"><em>text</em></span>) format by
MIRA, final result.
</p><p>
Type <span class="emphasis"><em>gap4da</em></span> is a directory containing
experiment files and a file of filenames (called 'fofn'), all
other types are files. <span class="emphasis"><em>gap4da</em></span>,
<span class="emphasis"><em>caf</em></span>, <span class="emphasis"><em>ace</em></span> contain the
complete assembly information suitable for import into
different post-processing tools (gap4, consed and
others). <span class="emphasis"><em>html</em></span> and
<span class="emphasis"><em>text</em></span> contain visual representations of
the assembly suited for viewing in browsers or as simple text
file. <span class="emphasis"><em>tcs</em></span> is a summary of a contig suited
for "quick" analysis from command-line tools or even visual
inspection. <span class="emphasis"><em>wig</em></span> is a file containing
coverage information (useful for mapping assemblies) which can
be loaded and shown by different genome browsers (IGB, GMOD,
USCS and probably many more.
</p><p>
<span class="emphasis"><em>fasta</em></span> contains the contig consensus
sequences (and .fasta.qual the consensus qualities). Please
note that they come in two flavours:
<span class="underline">padded</span>
and <span class="underline">unpadded</span>. The padded
versions may contain stars (*) denoting gap base positions
where there was some minor evidence for additional bases, but
not strong enough to be considered as a real base. Unpadded
versions have these gaps removed. Padded versions have an
additional postfix <span class="emphasis"><em>.padded</em></span>, while
unpadded versions <span class="emphasis"><em>.unpadded</em></span>.
</p></dd><dt><span class="term">
<code class="filename"><em class="replaceable"><code>projectname</code></em>_LargeContigs_out.<type></code>
</span></dt><dd>
These files are only written when MIRA runs in
<span class="emphasis"><em>de-novo</em></span> mode. They usually contain a subset
of contigs deemed 'large' from the whole project. More details
are given in the chapter "working with results of MIRA."
</dd></dl></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_assembly_statistics_and_information_files"></a>3.6.2.2.
Assembly statistics and information files
</h4></div></div></div><p>
These information files are placed in in the
<em class="replaceable"><code>projectname</code></em>_info directory after a run of
MIRA.
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
<code class="filename"><em class="replaceable"><code>projectname</code></em>_info_assembly.txt</code>
</span></dt><dd><p>
This file contains basic information about the
assembly. MIRA will split the information in two
parts: information about <span class="emphasis"><em>large</em></span>
contigs and information about all contigs.
</p><p>
For more information on how to interpret this file,
please consult the chapter on "Results" of the MIRA
documentation manual.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
In contrast to other information files, this file
always appears in the "info" directory, even when just
intermediate results are reported.
</td></tr></table></div></dd><dt><span class="term">
<code class="filename"><em class="replaceable"><code>projectname</code></em>_info_contigreadlist.txt</code>
</span></dt><dd><p> This file contains information which reads have been
assembled into which contigs (or singlets).
</p></dd><dt><span class="term">
<code class="filename"><em class="replaceable"><code>projectname</code></em>_info_contigstats.txt</code>
</span></dt><dd><p> This file contains statistics about the contigs
themselves, their length, average consensus quality, number of
reads, maximum and average coverage, average read length, number
of A, C, G, T, N, X and gaps in consensus.
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
For contigs containing digitally normalised reads, the coverage numbers may sometimes seem strange. E.g.: a contig may contain only one read, but have an average coverage of 3. This means that the read was a representative for 3 reads. The coverage numbers are computed as if all 3 reads had been assembled instead of the representative. In EST/RNASeq projects, these numbers thus represent the (more or less) true expression coverage.
</td></tr></table></div></dd><dt><span class="term">
<code class="filename"><em class="replaceable"><code>projectname</code></em>_info_consensustaglist.txt</code>
</span></dt><dd><p> This file contains
information about the tags (and their position) that are present in the
consensus of a contig.
</p></dd><dt><span class="term">
<code class="filename"><em class="replaceable"><code>projectname</code></em>_info_largecontigs.txt</code>
</span></dt><dd><p>For de-novo assemblies, this file contains the name of the
contigs which pass the (adaptable) 'large contig' criterion.
</p></dd><dt><span class="term">
<code class="filename"><em class="replaceable"><code>projectname</code></em>_info_readrepeats.lst</code>
</span></dt><dd><p>
Tab delimited file with three columns: read name, repeat level tag, sequence.
</p><p>
This file permits a quick analysis of the repetitiveness of
different parts of reads in a project. See
[-SK:rliif] to control from which repetitive level on
subsequences of reads are written to this file,
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Reads can have more than one entry in this file. E.g., with
standard settings (<code class="literal">-SK:rliif=6</code>) if the
start of a read is covered by MNRr, followed by a HAF3 region
and finally the read ends with HAF6, then there will be two
lines in the file: one for the subsequence covered by MNRr,
one for HAF6.
</td></tr></table></div></dd><dt><span class="term">
<code class="filename"><em class="replaceable"><code>projectname</code></em>_info_readstooshort</code>
</span></dt><dd><p> A list containing the
names of those reads that have been sorted out of the assembly before any
processing started only due to the fact that they were too short.
</p></dd><dt><span class="term">
<code class="filename"><em class="replaceable"><code>projectname</code></em>_info_readtaglist.txt</code>
</span></dt><dd><p> This file contains
information about the tags and their position that are present in each
read. The read positions are given relative to the forward direction of the
sequence (i.e. as it was entered into the the assembly).
</p></dd><dt><span class="term">
<code class="filename"><em class="replaceable"><code>projectname</code></em>_info_WARNINGS_*.txt</code>
</span></dt><dd><p>
These files collect warning messages MIRA dumped out
throughout the assembly process. These warnings cover a wide
area of things monitored by MIRA and can - together with the
output written to STDOUT - give an insight as to why an
assembly does not behave as expected. There are three warning
files representing different levels of
criticality: <span class="emphasis"><em>critical</em></span>, <span class="emphasis"><em>medium</em></span>
and <span class="emphasis"><em>minor</em></span>. These files may be empty,
meaning that no warning of the corresponding level was
printed. It is strongly suggested to have a look at least at
critical warnings during and after an assembly run.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
These files are quite new to MIRA and not all warning messages
appear there yet. This will come over time.
</td></tr></table></div></dd><dt><span class="term">
<code class="filename"><em class="replaceable"><code>projectname</code></em>_error_reads_invalid</code>
</span></dt><dd><p> A list of sequences that
have been found to be invalid due to various reasons (given in the output of
the assembler).
</p></dd></dl></div><p>
</p></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_file_formats"></a>3.6.3.
File formats
</h3></div></div></div><p>
MIRA can write almost all of the following formats and can read most
of them.
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
<code class="filename">ACE</code>
</span></dt><dd><p> This old assembly file format used mainly by phrap and
consed. Support for .ace output is currently only in test status in
MIRA as documentation on that format is ... sparse and I currently
don't have access to consed to verify my assumptions.
</p><p> Using consed, you will need to load projects with -nophd to
view them. Tags /in reads and consensus) are fully supported. The
only hitch: consed has a bug which prevents it to read consensus
tags which are located throughout the whole file (as MIRA writes
per default). The solution to that is easy: filter the CAF file
through the fixACE4consed.tcl script which is provided in older
MIRA distributions (V4.9.5 and before), then all should be well.
</p><p> If you don't have consed, you might want to try clview
(<a class="ulink" href="http://www.tigr.org/tdb/tgi/software/" target="_top">http://www.tigr.org/tdb/tgi/software/</a>) from TIGR
to look at .ace files.
</p></dd><dt><span class="term">
<code class="filename">BAM</code>
</span></dt><dd>
The binary cousin of the SAM format. MIRA neither reads nor writes
BAM, but BAMs can be created out of SAMs (which can be created via
<span class="command"><strong>miraconvert</strong></span>).
</dd><dt><span class="term">
<code class="filename">CAF</code>
</span></dt><dd><p> Common Assembly Format (CAF) developed by the Sanger
Centre. <a class="ulink" href="http://www.sanger.ac.uk/resources/software/caf.html" target="_top">http://www.sanger.ac.uk/resources/software/caf.html</a> provides a
description of the format and some software documentation as well as the
source for compiling caf2gap and gap2caf (thanks to Rob Davies
for this).
</p></dd><dt><span class="term">
<code class="filename">EXP</code>
</span></dt><dd><p> Standard experiment files used in genome
sequencing. Correct EXP files are expected. Especially the ID
record (containing the id of the reading) and the LN record
(containing the name of the corresponding trace file) should be
correctly set. See <a class="ulink" href="http://www.sourceforge.net/projects/staden/" target="_top">http://www.sourceforge.net/projects/staden/</a> for links to
online format description.
</p></dd><dt><span class="term">
<code class="filename">FASTA</code>
</span></dt><dd><p> A simple format for sequence data, see
<a class="ulink" href="http://www.ncbi.nlm.nih.gov/BLAST/fasta.html" target="_top">http://www.ncbi.nlm.nih.gov/BLAST/fasta.html</a>. An
often used extension of that format is used to also store quality
values in a similar fashion, these files have a .fasta.qual
ending.
</p><p>
MIRA writes two kinds of FASTA files for
results: <span class="emphasis"><em>padded</em></span> and
<span class="emphasis"><em>unpadded</em></span>. The difference is that the padded
version still contains the gap (pad) character (an asterisk) at
positions in the consensus where some of the reads apparently
had some more bases than others but where the consensus routines
decided that to treat them as artifacts. The
<span class="emphasis"><em>unpadded</em></span> version has the gaps removed.
</p></dd><dt><span class="term">
<code class="filename">GBF, GBK</code>
</span></dt><dd><p> GenBank file format as used at the NCBI to describe
sequences. MIRA is able to read and write this format (but only
for viruses or bacteria) for using sequences as backbones in an
assembly. Features of the GenBank format are also transferred
automatically to Staden compatible tags.
</p><p>
If possible, use GFF3 instead (see below).
</p></dd><dt><span class="term">
<code class="filename">GFF3</code>
</span></dt><dd><p> General feature format used to describe sequences and
features on these sequences. MIRA is able to read and write this
format.
</p></dd><dt><span class="term">
<code class="filename">HTML</code>
</span></dt><dd><p> Hypertext Markup Language. Projects written in HTML format
can be viewed directly with any table capable browser. Display is even
better if the browser knows style sheets (CSS).
</p></dd><dt><span class="term">
<code class="filename">MAF</code>
</span></dt><dd><p> MIRA Assembly Format (MAF). A faster and more compact form
than EXP, CAF or ACE. See documentation in separate file.
</p></dd><dt><span class="term">
<code class="filename">PHD</code>
</span></dt><dd><p> This file type originates from the phred base caller
and contains basically -- along with some other status information -- the
base sequence, the base quality values and the peak indices, but not the
sequence traces itself.
</p></dd><dt><span class="term">
<code class="filename">SAM</code>
</span></dt><dd><p> The Sequence Alignment/Map Format. MIRA does not write SAM
directly, but <span class="command"><strong>miraconvert</strong></span> can be used for
converting a MAF (or CAF) file to SAM.
</p><p>
MIRA cannot read SAM though.
</p></dd><dt><span class="term">
<code class="filename">SCF</code>
</span></dt><dd><p> The Staden trace file format that has established itself as
compact standard replacement for the much bigger ABI files. See
<a class="ulink" href="http://www.sourceforge.net/projects/staden/" target="_top">http://www.sourceforge.net/projects/staden/</a> for
links to online format description.
</p><p>
The SCF files should be V2-8bit, V2-16bit, V3-8bit or V3-16bit
and can be packed with compress or gzip.
</p></dd><dt><span class="term">
<code class="filename">traceinfo.XML</code>
</span></dt><dd><p> XML based file with information relating to
traces. Used at the NCBI and ENSEMBL trace archive to store additional
information (like clippings, insert sizes etc.) for projects. See further
down for for a description of the fields used and
<a class="ulink" href="http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=rfc&m=main&s=rfc" target="_top">http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=rfc&m=main&s=rfc</a> for a full description of all fields.
</p></dd><dt><span class="term">
<code class="filename">TCS</code>
</span></dt><dd><p> Transpose Contig Summary. A text file as written by MIRA
which gives a summary of a contig in tabular fashion, one line per
base. Nicely suited for "quick" analysis from command line tools,
scripts, or even visual inspection in file viewers or spreadsheet
programs.
</p><p> In the current file version (TCS 1.0), each column is
separated by at least one space from the next. Vertical bars are
inserted as visual delimiter to help inspection by eye. The
following columns are written into the file:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
contig name (width 20)
</p></li><li class="listitem"><p>
padded position in contigs (width 3)
</p></li><li class="listitem"><p>
unpadded position in contigs (width 3)
</p></li><li class="listitem"><p>
separator (a vertical bar)
</p></li><li class="listitem"><p>
called consensus base
</p></li><li class="listitem"><p>
quality of called consensus base (0-100), but MIRA itself caps at 90.
</p></li><li class="listitem"><p>
separator (a vertical bar)
</p></li><li class="listitem"><p>
total coverage in number of reads. This number can be higher than the
sum of the next five columns if Ns or IUPAC bases are present in the
sequence of reads.
</p></li><li class="listitem"><p>
coverage of reads having an "A"
</p></li><li class="listitem"><p>
coverage of reads having an "C"
</p></li><li class="listitem"><p>
coverage of reads having an "G"
</p></li><li class="listitem"><p>
coverage of reads having an "T"
</p></li><li class="listitem"><p>
coverage of reads having an "*" (a gap)
</p></li><li class="listitem"><p>
separator (a vertical bar)
</p></li><li class="listitem"><p>
quality of "A" or "--" if none
</p></li><li class="listitem"><p>
quality of "C" or "--" if none
</p></li><li class="listitem"><p>
quality of "G" or "--" if none
</p></li><li class="listitem"><p>
quality of "T" or "--" if none
</p></li><li class="listitem"><p>
quality of "*" (gap) or "--" if none
</p></li><li class="listitem"><p>
separator (a vertical bar)
</p></li><li class="listitem"><p>
Status. This field sums up the evaluation of MIRA whether you should
have a look at this base or not. The content can be one of the following:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
everything OK: a colon (:)
</p></li><li class="listitem"><p>
unclear base calling (IUPAC base): a "!M"
</p></li><li class="listitem"><p>
potentially problematic base calling involving a gap or low quality: a "!m"
</p></li><li class="listitem"><p>
consensus tag(s) of MIRA that hint to problems: a "!$". Currently,
the following tags will lead to this marker: SRMc, WRMc, DGPc, UNSc,
IUPc.
</p></li></ul></div></li><li class="listitem"><p>
list of a consensus tags at that position, tags are delimited by a
space. E.g.: "DGPc H454"
</p></li></ol></div></dd></dl></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_stdout_stderr"></a>3.6.4.
STDOUT/STDERR
</h3></div></div></div><p>
The actual stage of the assembly is written to STDOUT, giving status messages
on what MIRA is actually doing. Dumping to STDERR is almost not used
anymore by MIRA, remnants will disappear over time.
</p><p>
Some debugging information might also be written to STDOUT if MIRA
generates error messages.
</p><p>
On errors, MIRA will dump these also to STDOUT. Basically, three error classes
exist:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
WARNING: Messages in this error class do not stop the assembly but
are meant as an information to the user. In some rare cases these
errors are due to (an always possible) error in the I/O routines
of MIRA, but nowadays they are mostly due to unexpected (read:
wrong) input data and can be traced back to errors in the
preprocessing stages. If these errors arise, you
definitively <span class="bold"><strong>DO</strong></span> want to check how
and why these errors came into those files in the first place.
</p><p>
Frequent cause for warnings include missing SCF files, SCF files
containing known quirks, EXP files containing known quirks etc.
</p></li><li class="listitem"><p>
FATAL: Messages in this error class actually stop the
assembly. These are mostly due to missing files that MIRA needs or
to very garbled (wrong) input data.
</p><p>
Frequent causes include naming an experiment file in the 'file of filenames'
that could not be found on the disk, same experiment file twice in the
project, suspected errors in the EXP files, etc.
</p></li><li class="listitem"><p>
INTERNAL: These are true programming errors that were caught by internal
checks. Should this happen, please mail the output of STDOUT and STDERR to
the author.
</p></li></ol></div><p>
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_ssaha2smalt"></a>3.6.5.
SSAHA2 / SMALT ancillary data
</h3></div></div></div><p>
The <span class="command"><strong>ssaha2</strong></span> or <span class="command"><strong>smalt</strong></span> programs -
both from the Sanger Centre - can be used to detect possible vector
sequence stretches in the input data for the assembly. MIRA can load
the result files of a
<span class="command"><strong>ssaha2</strong></span> or <span class="command"><strong>smalt</strong></span> run and
interpret the results to tag the possible vector sequences at the ends
of reads.
</p><p>
Note that this also uses the parameters
[-CL:msvsgs:msvsmfg:msvsmeg] (see below).
</p><p>
ssaha2 must be called like this "<code class="literal">ssaha2
<ssaha2options> vector.fasta sequences.fasta</code>"
to generate an output that can be parsed by MIRA. In the above
example, replace <code class="filename">vector.fasta</code> by the name
of the file with your vector sequences and
<code class="filename">sequences.fasta</code> by the name of the file
containing your sequencing data.
</p><p>
smalt must be called like this: "<code class="literal">smalt map -f ssaha
<ssaha2options> hash_index sequences.fasta</code>"
</p><p>
This makes you basically independent from any other commercial or
license-requiring vector screening software. For Sanger reads, a
combination of <span class="command"><strong>lucy</strong></span> and
<span class="command"><strong>ssaha2</strong></span> or <span class="command"><strong>smalt</strong></span> together with
this parameter should do the trick. For reads coming from 454
pyro-sequencing, <span class="command"><strong>ssaha2</strong></span> or
<span class="command"><strong>smalt</strong></span> and this parameter will also work very
well. See the usage manual for a walkthrough example on how to use
SSAHA2 / SMALT screening data.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
The output format of SSAHA2 must the native output format
(<code class="literal">-output ssaha2</code>). For SMALT, the output
option <code class="literal">-f ssaha</code> must be used. Other formats cannot
be parsed by MIRA.
</td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
I currently use the following SSAHA2 options:
<code class="literal">-kmer 8 -skip 1 -seeds 1 -score 12 -cmatch 9 -ckmer
6</code></td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Anyone contributing SMALT parameters?
</td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
The sequence vector clippings generated from SSAHA2 /
SMALT data do not replace sequence vector clippings loaded via
the EXP, CAF or XML files, they rather extend them.
</td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_xml_traceinfo"></a>3.6.6.
XML TRACEINFO ancillary data
</h3></div></div></div><p>
MIRA extracts the following data from the TRACEINFO files:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
trace_name (required)
</p></li><li class="listitem"><p>
trace_file (recommended)
</p></li><li class="listitem"><p>
trace_type_code (recommended)
</p></li><li class="listitem"><p>
trace_end (recommended)
</p></li><li class="listitem"><p>
clip_quality_left (recommended)
</p></li><li class="listitem"><p>
clip_quality_right (recommended)
</p></li><li class="listitem"><p>
clip_vector_left (recommended)
</p></li><li class="listitem"><p>
clip_vector_right (recommended)
</p></li><li class="listitem"><p>
strain (recommended)
</p></li><li class="listitem"><p>
template_id (recommended for paired end)
</p></li><li class="listitem"><p>
insert_size (recommended for paired end)
</p></li><li class="listitem"><p>
insert_stdev (recommended for paired end)
</p></li><li class="listitem"><p>
machine_type (optional)
</p></li><li class="listitem"><p>
program_id (optional)
</p></li></ul></div><p>
</p><p>
Other data types are also read, but the info is not used.
</p><p>
Here's the example for a TRACEINFO file with ancillary info:
</p><pre class="screen">
<?xml version="1.0"?>
<trace_volume>
<trace>
<trace_name>GCJAA15TF</trace_name>
<program_id>PHRED (0.990722.G) AND TTUNER (1.1)</program_id>
<template_id>GCJAA15</template_id>
<trace_direction>FORWARD</trace_direction>
<trace_end>F</trace_end>
<clip_quality_left>3</clip_quality_left>
<clip_quality_right>622</clip_quality_right>
<clip_vector_left>1</clip_vector_left>
<clip_vector_right>944</clip_vector_right>
<insert_stdev>600</insert_stdev>
<insert_size>2000</insert_size>
</trace>
<trace>
...
</trace>
...
</trace_volume></pre><p>
See
<a class="ulink" href="http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=rfc&m=main&s=rfc" target="_top">http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=rfc&m=main&s=rfc</a>
for a full description of all fields and more info on the TRACEINFO XML format.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_contig_naming"></a>3.6.7.
Contig naming
</h3></div></div></div><p>
MIRA names contigs the following
way: <span class="emphasis"><em><projectname>_<contigtype><number></em></span>. While <span class="emphasis"><em><projectname></em></span>
is dictated by the [--project=] parameter
and <span class="emphasis"><em><number></em></span> should be clear,
the <span class="emphasis"><em><contigtype></em></span> might need additional
explaining. There are currently three contig types existing:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
_c: these are "normal" contigs
</p></li><li class="listitem"><p>
_rep_c: only for genome assembly mode. These are contigs
containing only repetitive areas. These contigs
had <span class="emphasis"><em>_lrc</em></span> as type in previous version of MIRA,
this was changed to the <span class="emphasis"><em>_rep_c</em></span> to make things
clearer.
</p></li><li class="listitem"><p>
_s: these are singlet-contigs. Technically: "contigs" with a
single read.
</p></li><li class="listitem"><p>
_dn: these is an additional contig type which can occur when MIRA
ran a digital normalisation step during the assembly. Contigs
which contain reads completely covered by a DGNr tag will get an
additional "_dn" as part of their name to show that they contain
read representatives for digital normalisation. E.g.:
"contig_dn_c1".
</p><p>
Reads covered only partly by the DGNr tag do not trigger the _dn
naming.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note: Important side note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Important side note</th></tr><tr><td align="left" valign="top"> Due to the digital
normalisation step, the coverage numbers in the info file
regarding contig statistics will not represent the number of
reads in the contig, but they will show an approximation of
the true coverage or expression value as if there had not been
a digital normalisation step performed. The approximation may
be around 10 to 20% below the true value.
</td></tr></table></div></li></ol></div><p>
Basically, for genome assemblies MIRA starts to build contigs in areas
which seem "rock solid", i.e., not a repetitive region (main decision
point) and nice coverage of good reads. Contigs which started like
this get a <span class="emphasis"><em>_c</em></span> name. If during the assembly MIRA
reaches a point where it cannot start building a contig in a
non-repetitive region, it will name the contig
<span class="emphasis"><em>_rep_c</em></span> instead of <span class="emphasis"><em>_c</em></span>. This
is why "_rep_c" contigs occur late in a genome assembly.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
MIRA has a different understanding of "rock solid" when in EST/RNASeq
assembly: here, MIRA will try to reconstruct a full length gene
sequence, starting with the most abundant genes.
</td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Depending on the settings of [-AS:mrpc], your project may or
may not contain <span class="emphasis"><em>_s</em></span> singlet-contigs. Also note
that reads landing in the debris file will not get assigned to
singlet-contigs and hence not get <span class="emphasis"><em>_s</em></span> names.
</td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_recovering_strain_specific_consensus"></a>3.6.8.
Recovering strain specific consensus as FASTA
</h3></div></div></div><p>
In case you used strain information in an assembly, you can
recover the consensus for just any given strain
by using <span class="command"><strong>miraconvert</strong></span> and convert from a
full assembly format (e.g. MAF or CAF) which also carries
strain information to FASTA. MIRA will automatically detect
the strain information and create one FASTA file per strain
encountered.
</p><p>
It will also create a blend of all strains encountered and
conveniently add "AllStrains" to the name of these files. Note that
this blend may or may not be something you need, but in some
cases I found it to be useful.
</p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_tags_used_in_the_assembly_by_mira_and_edit"></a>3.7.
Tags used in the assembly by MIRA and EdIt
</h2></div></div></div><p>
MIRA uses and sets a couple of tags during the assembly process. That
is, if information is known before the assembly, it can be stored in tags (in
the EXP and CAF formats) and will be used in the assembly.
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_tags_read_and_used"></a>3.7.1.
Tags read (and used)
</h3></div></div></div><p>
This section lists "foreign" tags, i.e., tags that whose definition was made
by other software packages than MIRA.
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
ALUS, REPT: Sequence stretches tagged as ALUS (ALU Sequence) or REPT
(general repetitive sequence) will be handled with extreme care during the
assembly process. The allowed error rate after automatic contig editing
within these stretches is normally far below the general allowed error rate,
leading to much higher stringency during the assembly process and
subsequently to a better repeat resolving in many cases.
</p></li><li class="listitem"><p>
Fpas: GenBank feature for a poly-A sequence. Used in EST, cDNA or
transcript assembly. Either read in the input files or set when using
[-CL:cpat]. This allows to keep the poly-A sequence in
the reads during assembly without them interfering as massive
repeats or as mismatches.
</p></li><li class="listitem"><p>
FCDS, Fgen: GenBank features as described in GBF/GBK files or set in the
Staden package are used to make some SNP impact analysis on genes.
</p></li><li class="listitem"><p>
other. All other tags in reads will be read and passed through the
assembly without being changed and they currently do not influence the
assembly process.
</p></li></ul></div><p>
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_tags_set_and_used"></a>3.7.2.
Tags set (and used)
</h3></div></div></div><p>
This section lists tags which MIRA sets (and reads of course), but that other
software packages might not know about.
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
UNSr, UNSc: <span class="bold"><strong>UNS</strong></span>ure
in <span class="bold"><strong>R</strong></span>ead
respectively <span class="bold"><strong>C</strong></span>ontig. These tags
denote positions in an assembly with conflicts that could not be
resolved automatically by MIRA. These positions should be looked
at during the finishing process.
</p><p>
For assemblies using good sequences and enough coverage, something
0.01% of the consensus positions have such a tag. (e.g. ~300 UNSc
tags for a genome of 3 megabases).
</p></li><li class="listitem"><p>
SRMr, WRMc: <span class="bold"><strong>S</strong></span>trong <span class="bold"><strong>R</strong></span>epeat <span class="bold"><strong>M</strong></span>arker and
<span class="bold"><strong>W</strong></span>eak <span class="bold"><strong>R</strong></span>epeat <span class="bold"><strong>M</strong></span>arker. These
tags are set in two flavours: as
SRM<span class="bold"><strong>r</strong></span> and
WRM<span class="bold"><strong>r</strong></span> when set in reads, and as
SRM<span class="bold"><strong>c</strong></span> and
WRM<span class="bold"><strong>c</strong></span> when set in the
consensus. These tags are used on an individual per base basis for
each read. They denote bases that have been identified as crucial
for resolving repeats, often denoting a single SNP within several
hundreds or thousands of bases. While a SRM is quite certain, the
WRM really is either weak (there wasn't enough comforting
information in the vicinity to be really sure) or involves gap
columns (which is always a bit tricky).
</p><p>
MIRA will automatically set these tags when it encounters repeats
and will tag exactly those bases that can be used to discern the
differences.
</p><p>
Seeing such a tag in the consensus means that MIRA was not able to
finish the disentanglement of that special repeat stretch or that
it found a new one in one of the last passes without having the
opportunity to resolve the problem.
</p></li><li class="listitem"><p>
DGPc: <span class="bold"><strong>D</strong></span>ubious <span class="bold"><strong>G</strong></span>ap <span class="bold"><strong>P</strong></span>osition in
<span class="bold"><strong>C</strong></span>onsensus. Set whenever the gap to base ratio in a column of 454
reads is between 40% and 60%.
</p></li><li class="listitem"><p>
SAO, SRO, SIO: <span class="bold"><strong>S</strong></span>NP intr<span class="bold"><strong>A</strong></span> <span class="bold"><strong>O</strong></span>rganism,
<span class="bold"><strong>S</strong></span>NP <span class="bold"><strong>R</strong></span> <span class="bold"><strong>O</strong></span>rganism, <span class="bold"><strong>S</strong></span>NP <span class="bold"><strong>I</strong></span>ntra
and inter <span class="bold"><strong>O</strong></span>rganism. As for SRM
and WRM, these tags have a <span class="bold"><strong>r</strong></span>
appended when set in reads and
a <span class="bold"><strong>c</strong></span> appended when set in the
consensus. These tags denote SNP positions.
</p><p>
MIRA will automatically set these tags when it encounters SNPs and
will tag exactly those bases that can be used to discern the
differences. They denote SNPs as they occur within an organism
(SAO), between two or more organisms (SRO) or within and between
organisms (SIO).
</p><p>
Seeing such a tag in the consensus means that MIRA set this as a
valid SNP in the assembly pass. Seeing such tags only in reads (but not in
the consensus) shows that in a previous pass, MIRA thought these
bases to be SNPs but that in later passes, this SNP does not appear anymore
(perhaps due to resolved misassemblies).
</p></li><li class="listitem"><p>
STMS: (only hybrid assemblies). The <span class="bold"><strong>S</strong></span>equencing <span class="bold"><strong>T</strong></span>ype
<span class="bold"><strong>M</strong></span>ismatch <span class="bold"><strong>S</strong></span>olved
is tagged to positions in the assembly where the consensus of
different sequencing technologies (Sanger, 454, Ion Torrent, Solexa, PacBio, SOLiD)
reads differ, but MIRA thinks it found out the correct
solution. Often this is due to low coverage of one of the types
and an additional base calling error.
</p><p>
Sometimes this depicts real differences where possible explanation
might include: slightly different bugs were sequenced or a
mutation occurred during library preparation.
</p></li><li class="listitem"><p>
STMU: (only hybrid assemblies). The <span class="bold"><strong>S</strong></span>equencing <span class="bold"><strong>T</strong></span>ype
<span class="bold"><strong>M</strong></span>ismatch <span class="bold"><strong>U</strong></span>nresolved
is tagged to positions in the assembly where the consensus of
different sequencing technologies (Sanger, 454, Ion Torrent, Solexa, SOLiD)
reads differ, but MIRA could not find a good resolution. Often this
is due to low coverage of one of the types and an additional base
calling error.
</p><p>
Sometimes this depicts real differences where possible explanation
might include: slightly different bugs were sequenced or a mutation
occurred during library preparation.
</p></li><li class="listitem"><p>
MCVc: The <span class="bold"><strong>M</strong></span>issing <span class="bold"><strong>C</strong></span>o{V}erage in <span class="bold"><strong>C</strong></span>onsensus.
Set in assemblies with more than one strain. If a strain has no coverage at
a certain position, the consensus gets tagged with this tag (and the name of
the strain which misses this position is put in the comment). Additionally,
the sequence in the result files for this strain will have an @ character.
</p></li><li class="listitem"><p>
MNRr: (only with [-KS:mnr] active). The <span class="bold"><strong>M</strong></span>asked
<span class="bold"><strong>N</strong></span>asty <span class="bold"><strong>R</strong></span>epeat tags are set over those parts of a read that
have been detected as being many more times present than the average
sub-sequence. MIRA will hide these parts during the initial
all-against-all overlap finding routine (SKIM3) but will otherwise happily
use these sequences for consensus generation during contig building.
</p></li><li class="listitem"><p>
FpAS: See "Tags read (and used)" above.
</p></li><li class="listitem"><p>
ED_C, ED_I, ED_D: EDit Change, EDit Insertion, EDit Deletion. These
tags are set by the integrated automatic editor EdIt and show which edit
actions have been performed.
</p></li><li class="listitem"><p>
HAF2, HAF3, HAF4, HAF5, HAF6, HAF7. These
are <span class="bold"><strong>HA</strong></span>sh <span class="bold"><strong>F</strong></span>requency
tags which show the status of read parts in comparison to the
whole project. Only set if [-AS:ard] is active (default
for genome assemblies).
</p><p>
More info on how to use the information conveyed by HAF tags in
the section dealing with repeats and HAF tags in finishing
programs further down in this manual.
</p><p>
HAF2 coverage below average ( standard setting at < 0.5 times average)
</p><p>
HAF3 coverage is at average ( standard setting at ≥ 0.5 times average and ≤ 1.5 times average)
</p><p>
HAF4 coverage above average ( standard setting at > 1.5 times average and < 2 times average)
</p><p>
HAF5 probably repeat ( standard setting at ≥ 2 times average and < 5 times average)
</p><p>
HAF6 'heavy' repeat ( standard setting at > 8 times average)
</p><p>
HAF7 'crazy' repeat ( standard setting at > 20 times average)
</p></li></ul></div><p>
</p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_contigs_singlets_debris"></a>3.8.
Where reads end up: contigs, singlets, debris
</h2></div></div></div><p>
At the start, things are simple: a read either aligns with other reads or it does not. Reads which
align with other reads form contigs, and these MIRA will save in the results with a contig name
of <span class="emphasis"><em>_c</em></span>.
</p><p>
However, not all reads can be placed in an assembly. This can have several reasons and
these reads may end up at two different places in the result files: either in the
<span class="emphasis"><em>debris</em></span> file, then just as a name entry, or as singlet (a "contig"
with just one read) in the regular results.
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
reads are too short and get filtered out (before or after the MIRA
clipping stages). These invariably land in the debris file.
</p></li><li class="listitem"><p>
reads are real singlets: they contain genuine sequence but have no
overlap with any other read. These get either caught by the
[-CL:pec] clipping filter or during the SKIM phase
</p></li><li class="listitem"><p>
reads contain mostly or completely junk.
</p></li><li class="listitem"><p>
reads contain chimeric sequence (therefore: they're also junk)
</p></li></ol></div><p>
MIRA filters out these reads in different stages: before and after read
clipping, during the SKIM stage, during the Smith-Waterman overlap
checking stage or during contig building. The exact place where these
single reads land is dependent on why they do not align with other
reads. Reads landing in the debris file will have the reason and stage
attached to the decision.
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_snp_discovery"></a>3.9.
Detection of bases distinguishing non-perfect repeats and SNP discovery
</h2></div></div></div><p>
MIRA is able to find and tag SNPs in any kind of data -- be it genomic
or EST -- in both de-novo and mapping assemblies ... provided it knows
which read in an assembly is coming from which strain, cell line or
organism.
</p><p>
The SNP detection routines are based on the same routines as the
routines for detecting non-perfect repeats. In fact, MIRA can even
distinguish between bases marking a misassembled repeat from bases
marking a SNP within the same project.
</p><p>
All you need to do to enable this feature is to set
[-CO:mr=yes] (which is standard in all
<code class="literal">--job=...</code> incantations of <span class="command"><strong>mira</strong></span> and
in some steps of <span class="command"><strong>miraSearchESTSNPs</strong></span>. Furthermore, you
will need to provide <span class="emphasis"><em>strain information</em></span>, either in
the manifest file or in ancillary NCBI TRACEINFO XML files.
</p><p>
The effect of using strain names attached to reads can be described
briefly like this. Assume that you have 6 reads (called R1 to R6), three
of them having an <code class="literal">A</code> at a given position, the other
three a <code class="literal">C</code>.
</p><pre class="screen">
R1 ......A......
R2 ......A......
R3 ......A......
R4 ......C......
R5 ......C......
R6 ......C......</pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
This example is just that: an example. It uses just 6 reads, with two
times three reads as read groups for demonstration purposes and without
looking at qualities. For MIRA to recognise SNPs, a few things must come
together (e.g. for many sequencing technologies it wants forward and
backward reads when in de-novo assembly) and a couple of parameters can
be set to adjust the sensitivity. Read more about the parameters:
[-CO:mrpg:mnq:mgqrt:emea:amgb:amgbemc:amgbnbs]</td></tr></table></div><p>
Now, assume you did not give any strain information. MIRA will most
probably recognise a problem and, having no strain information, assume
it made an error by assembling two different repeats of the same
organism. It will tag the bases in the reads with repeat marker tags
(SRMr) and the base in the consensus with a SROc tag (to point at an
unresolved problem). In a subsequent pass, MIRA will then not assemble
these six reads together again, but create two contigs like this:
</p><pre class="screen">
Contig1:
R1 ......A......
R2 ......A......
R3 ......A......
Contig2:
R4 ......C......
R5 ......C......
R6 ......C......</pre><p>
The bases in the repeats will keep their SROr tags, but the consensus
base of each contig will not get SROc as there is no conflict anymore.
</p><p>
Now, assume you gave reads R1, R2 and R3 the strain information "human",
and read R4, R5 and R6 "chimpanzee". MIRA will then create this:
</p><pre class="screen">
R1 (hum) ......<span class="bold"><strong>A</strong></span>......
R2 (hum) ......<span class="bold"><strong>A</strong></span>......
R3 (hum) ......<span class="bold"><strong>A</strong></span>......
R4 (chi) ......<span class="bold"><strong>C</strong></span>......
R5 (chi) ......<span class="bold"><strong>C</strong></span>......
R6 (chi) ......<span class="bold"><strong>C</strong></span>......</pre><p>
Instead of creating two contigs, it will create again one contig ... but
it will tag the bases in the reads with a SROr tag and the position in
the contig with a SROc tag. The SRO tags (<span class="bold"><strong>S</strong></span>NP inte<span class="bold"><strong>R</strong></span>
<span class="bold"><strong>O</strong></span>rganisms) tell you: there's a SNP
between those two (or multiple) strains/organisms/whatever.
</p><p>
Changing the above example a little, assume you have this assembly early
on during the MIRA process:
</p><pre class="screen">
R1 (hum) ......A......
R2 (hum) ......A......
R3 (hum) ......A......
R4 (chi) ......A......
R5 (chi) ......A......
R6 (chi) ......A......
R7 (chi) ......C......
R8 (chi) ......C......
R9 (chi) ......C......</pre><p>
Because "chimp" has a SNP within itself (<code class="literal">A</code> versus
<code class="literal">C</code>) and there's a SNP between "human" and "chimp"
(also <code class="literal">A</code> versus <code class="literal">C</code>), MIRA will see a
problem and set a tag, this time a SIOr tag: <span class="bold"><strong>S</strong></span>NP <span class="bold"><strong>I</strong></span>ntra- and
inter <span class="bold"><strong>O</strong></span>rganism.
</p><p>
MIRA does not like conflicts occurring within an organism and will try
to resolve these cleanly. After setting the SIOr tags, MIRA will
re-assemble in subsequent passes this:
</p><pre class="screen">
Contig1:
R1 (hum) ......<span class="bold"><strong>A</strong></span>......
R2 (hum) ......<span class="bold"><strong>A</strong></span>......
R3 (hum) ......<span class="bold"><strong>A</strong></span>......
R4 (chi) ......<span class="bold"><strong>A</strong></span>......
R5 (chi) ......<span class="bold"><strong>A</strong></span>......
R6 (chi) ......<span class="bold"><strong>A</strong></span>......
Contig2:
R7 (chi) ......<span class="bold"><strong>C</strong></span>......
R8 (chi) ......<span class="bold"><strong>C</strong></span>......
R9 (chi) ......<span class="bold"><strong>C</strong></span>......</pre><p>
The reads in Contig1 (hum+chi) and Contig2 (chi) will keep their SIOr
tags, the consensus will have no SIOc tag as the "problem" was
resolved.
</p><p>
When presented to conflicting information regarding SNPs and possible
repeat markers or SNPs within an organism, MIRA will always first try to
resolve the repeats marker. Assume the following situation:
</p><pre class="screen">
R1 (hum) ......A...T......
R2 (hum) ......A...G......
R3 (hum) ......A...T......
R4 (chi) ......C...G......
R5 (chi) ......C...T......
R6 (chi) ......C...G......</pre><p>
While the first discrepancy column can be "explained away" by a SNP
between organisms (it will get a SROr/SROc tag), the second column
cannot and will get a SIOr/SIOc tag. After that, MIRA opts to get the
SIO conflict resolved:
</p><pre class="screen">
Contig1:
R1 (hum) ......A...T......
R3 (hum) ......A...T......
R5 (chi) ......C...T......
Contig2:
R2 (hum) ......A...G......
R4 (chi) ......C...G......
R6 (chi) ......C...G......</pre></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_data_reduction"></a>3.10.
Data reduction: subsampling vs. lossless digital normalisation
</h2></div></div></div><p>
Some data sets have way too much data. Sometimes it is simply more than
needed like, e.g., performing a de-novo genome assembly with reads
enough for 300x coverage is like taking a sledgehammer for cracking a
nut. Sometimes it is even more than is good for an assembly (see also:
motif dependent sequencing errors).
</p><p>
MIRA being an overlap-based assembler, reducing a data set helps to keep
time and memory requirements low. There are basically two ways to
perform this: reduction by subsampling and reduction by digital
normalisation. Both methods have their pros and cons and can be used
effectively in different scenarios.
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
<span class="emphasis"><em>Subsampling</em></span> is a process to create a smaller,
hopefully representative set from a larger data set.
</p><p>
In sequencing, various ways exist to perform subsampling. As
sequencing data sets from current sequencing technologies can be
seen as essentially randomised when coming fresh from the machine,
the selection step can be as easy as selecting the
first <span class="emphasis"><em>n</em></span> reads. When the input data set is not
random (e.g. in SAM/BAM files with mapped data), one must resort to
random selection of reads.
</p><p>
Subsampling must be done by the user prior to assembly with MIRA.
</p><p>
On the upside, subsampling preserves the exact copy number structure
of the input data set: a repeat with n copies in a genome will
always be represented by reads forming n copies of the repeat in the
reduced data set. Furthermore, subsampling is comparatively
insensitive to motif dependent sequencing errors. On the downside,
subsampling will more probably loose rare events of the data set
(e.g., rare SNPs of a cell population or rare transcripts in
EST/RNASeq). Also, in EST/RNASeq projects, subsampling will not be
able to reduce extraordinary coverage events to a level which make
the assembly not painfully slow. Examples for the later being rRNA
genes or highly expressed house-keeping genes where todays' Illumina
data sets sometimes contains enough data to reach coverage numbers
≥ 100,000x or even a million x.
</p><p>
Subsampling should therefore be used for single genome de-novo
assemblies; or for EST/RNASeq assemblies which need reliable
coverage numbers for transcript expression data but where at least
all rDNA has been filtered out prior to assembly.
</p></li><li class="listitem"><p>
<span class="emphasis"><em>Digital normalisation</em></span> is a process to perform a
reduction of sequencing data redundancy. It was made known to a
wider audience by the paper <span class="emphasis"><em>"A Reference-Free Algorithm
for Computational Normalization of Shotgun Sequencing
Data"</em></span> by Brown et al. (see
<a class="ulink" href="http://arxiv.org/abs/1203.4802" target="_top">http://arxiv.org/abs/1203.4802</a>).
</p><p>
The normalisation process works by progressively going through the
sequencing data and selecting reads which bring new, previously
unseen information to the assembly and discarding those which
describe nothing new. For single genome assemblies, this has the
effect that repeats with n copies in the genome are afterwards
present often with just enough reads to reconstruct only a single
copy of the repeat. In EST/RNASeq assemblies, this leads to
reconstructed transcripts having all the more or less same coverage.
</p><p>
The normalisation process as described in the paper allows for a
certain lossiness during the data reduction as it was developed to
cope with billions of reads. E.g., it will often loose borders in
genome reorganisation events or SNP information from ploidies, from
closely related genes copies or from closely related species.
</p><p>
MIRA implements a variant of the algorithm: the <span class="emphasis"><em>lossless
digital normalisation</em></span>. Here, normalised data has copy
numbers reduced like in the original algorithm, but all variants
(SNPs, borders of reorganisation events etc.) present in the
original data set are retained in the reduced data set. Furthermore,
the normalisation is parameterised to take place only for
excessively repetitive parts of a data set which would lead to
overly increased run-time and memory consumption. This gives the
assembler the opportunity to correctly evaluate and work with
repeats which do not occur "too often" in a data set while still
being able to reconstruct at least one copy of the really nasty
repeats.
</p><p>
Digital normalisation should not be done prior to an assembly with
MIRA, rather the MIRA parameter to perform a digital normalisation
on the complete data set should be used.
</p><p>
The lossless digital normalisation of MIRA should be used for
EST/RNASeq assemblies containing highly repetitive data. Metagenome
assemblies may also profit from this feature.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
MIRA keeps track of the approximate coverage represented by the
reads chosen in the digital normalisation process. That is, MIRA is
able to give approximate coverage numbers as if digital
normalisation had never happened. The approximation may be around 10
to 20% below the true value. Contigs affected by this coverage
approximation are denoted with an additional "_dn" in their name.
</p><p>
Due to the digital
normalisation step, the coverage numbers in the info file
regarding contig statistics will not represent the number of
reads in the contig, but they will show an approximation of
the true coverage or expression value as if there had not been
a digital normalisation step performed.
</p></td></tr></table></div></li></ul></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_caveats"></a>3.11.
Caveats
</h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_using_artificial_reads"></a>3.11.1.
Using data not from sequencing instruments: artificial / synthetic reads
</h3></div></div></div><p>
The default parameters for MIRA assemblies work best when given real
sequencing data and they even expect the data to behave like real
sequencing data. But some assembly strategies work in multiple rounds,
using so called "artificial" or "synthetic" reads in later rounds,
i.e., data which was not generated through sequencing machines but
might be something like the consensus of previous assemblies.
</p><p>
If one doesn't take utter care to make these artificial reads at least
behave a little bit like real sequencing data, a number of quality
insurance algorithms of MIRA might spot that they "look funny" and
trim back these artificial reads ... sometimes even removing them
completely.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note: Summary tips for creating artificial reads for MIRA assemblies"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Summary tips for creating artificial reads for MIRA assemblies</th></tr><tr><td align="left" valign="top"><p>
The following should lead to the least amount of surprises for most
assembly use cases when calling MIRA only with the most basic
switches <code class="literal">--project=... --job=...</code>
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><span class="bold"><strong>Length:</strong></span> between 50 and 20000 bp
</li><li class="listitem"><span class="bold"><strong>Quality values:</strong></span> give your
artificial reads quality values. Using <span class="emphasis"><em>30</em></span>
as quality value for your bases should be OK for most
applications.
</li><li class="listitem"><span class="bold"><strong>Orientation:</strong></span> for every read you
create, create a read with the same data (bases and quality
values) in reverse complement direction.
</li></ol></div></td></tr></table></div><p>
The following list gives all the gory details on how synthetic reads
should look like or which MIRA algorithms to switch off in certain
cases:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
Forward and reverse complement directions: most sequencing
technologies and strategies yield a mixture of reads with both
forward and reverse complement direction to the DNA sequenced. In
fact, having both directions allows for a much better quality
control of an alignment as sequencing technology dependent
sequencing errors will often affect only one direction at a given
place and not both (the exception being homopolymers and 454).
</p><p>
The MIRA <span class="emphasis"><em>proposed end clipping</em></span> algorithm
[-CL:pec] uses this knowledge to initially trim back
ends of reads to an area without sequencing errors. However, if
reads covering a given area of DNA are present in only one
direction, then these reads will be completely eliminated.
</p><p>
If you use only artificial reads in an assembly, then switch off
the <span class="emphasis"><em>proposed end clipping</em></span>
[-CL:pec=no].
</p><p>
If you mix artificial reads with "normal" reads, make sure that
every part of an artificial read is covered by some other read in
reverse complement direction (be it a normal or artificial
read). The easiest way to do that is to add a reverse complement
for every artificial read yourself, though if you use an
overlapping strategy with artificial reads, you can calculate the
overlaps and reverse complements of reads so that every second
artificial read is in reverse complement to save time and memory
afterwards during the computation.
</p></li><li class="listitem"><p>
Sequencing type/technology: MIRA currently knows Sanger, 454, Ion
Torrent, Solexa, PacBioHQ/LQ and "Text" as sequencing
technologies, every read entered in an assembly must be one of
those.
</p><p>
Artificial reads should be classified depending on the data they
were created from, that is, Sanger for consensus of Sanger reads,
454 for consensus of 454 reads etc. However, should reads created
from Illumina consensus be much longer than, say, 200 or 300
bases, you should treat them as Sanger reads.
</p></li><li class="listitem"><p>
Quality values: be careful to assign decent quality values to your
artificial reads as several quality clipping or consensus calling
algorithms make extensive use of qualities. Pay attention to
values of [-CL:qc:bsqc] as well as to
[-CO:mrpg:mnq:mgqrt].
</p></li><li class="listitem"><p>
Read lengths: current maximum read length for MIRA is around
~30kb. However, to account for some safety, MIRA currently allows
only 20kb reads as maximum length.
</p></li></ul></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_ploidy_and_repeats"></a>3.11.2.
Ploidy and repeats
</h3></div></div></div><p>
MIRA treats ploidy differences as repeats and will therefore build a
separate contigs for the reads of a ploidy that has a difference to
the other ploidy/ploidies.
</p><p>
There is simply no other way to handle ploidy while retaining the
ability to separate repeats based on differences of only a single
base. Everything else would be guesswork. I thought for some time
about doing a coverage analysis around the potential repeat/ploidy
site, but came to the conclusion that due to the stochastic nature of
sequencing data, this would very probably take wrong decisions in too
many cases to be acceptable.
</p><p>
If someone has a good idea, I'll be happy to hear it.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_handling_of_repeats"></a>3.11.3.
Handling of repeats
</h3></div></div></div><p>
</p><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_uniform_read_distribution"></a>3.11.3.1.
Uniform read distribution
</h4></div></div></div><p>
Under the assumption that reads in a project are uniformly
distributed across the genome, MIRA will enforce an average coverage
and temporarily reject reads from a contig when this average
coverage multiplied by a safety factor is reached at a given
site. This strategy reduces over-compression of repeats during the
contig building phase and keeps reads in reserve for other copies of
that repeat.
</p><p>
It's generally a very useful tool disentangle repeats, but has some
slight secondary effects: rejection of otherwise perfectly good
reads. The assumption of read distribution uniformity is the big
problem we have here: of course it's not really valid. You sometimes
have less, and sometimes more than "the average"
coverage. Furthermore, the new sequencing technologies - 454 perhaps
but certainly the ones from Solexa - show that you also have a skew
towards the site of replication origin.
</p><p>
Warning: Solexa data from late 2009 and 2010 show a high GC content
bias. This bias can reach 200 or 300%, i.e., sequence part for with
low GC
</p><p>
One example: let's assume the average coverage of a project is 8 and
by chance at one place there 17 (non-repetitive) reads, then the
following happens:
</p><p>
(Note: <span class="emphasis"><em>p</em></span> is the parameter [-AS:urdsip])
</p><p>
Pass 1 to <span class="emphasis"><em>p-1</em></span>: MIRA happily assembles everything together and calculates a
number of different things, amongst them an average coverage of ~8. At the
end of pass <span class="emphasis"><em>p-1</em></span>, it will announce this average coverage as first estimate
to the assembly process.
</p><p>
Pass <span class="emphasis"><em>p</em></span>: MIRA has still assembled everything together, but at the end of each
pass the contig self-checking algorithms now include an "average coverage
check". They'll invariably find the 17 reads stacked and decide (looking at
the [-AS:ardct] parameter which is assumed to be 2 for this example)
that 17 is larger than 2*8 and that this very well may be a repeat. The reads
get flagged as possible repeats.
</p><p>
Pass <span class="emphasis"><em>p+1</em></span> to end: the "possibly repetitive" reads get a much tougher
treatment in MIRA. Amongst other things, when building the contig, the contig
now looks that "possibly repetitive" reads do not over-stack by an average
coverage multiplied by a safety value ( [-AS:urdcm]) which we'll
assume now to be 1.5 in this example. So, at a certain point, say when read 14
or 15 of that possible repeat want to be aligned to the contig at this given
place, the contig will just flatly refuse and tell the assembler to please
find another place for them, be it in this contig that is built or any other
that will follow. Of course, if the assembler cannot comply, the reads 14 to
17 will end up as contiglet (contig debris, if you want) or if it was only one
read that got rejected like this, it will end up as singlet or in the debris
file.
</p><p>
Tough luck. I do have ideas on how to re-integrate those reads at the and of an
assembly, but I have deferred doing this as in every case I had looked up,
adding those reads to the contigs wouldn't have changed anything ... there's
already enough coverage.
</p><p>
What should be done in those cases is simply filter away the contiglets
(defined as being of small size and having an average coverage below the
average coverage of the project divided 3 (or 2.5)) from a project.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_keeping_'long'_repetitive_contigs_separate"></a>3.11.3.2.
Keeping 'long' repetitive contigs separate
</h4></div></div></div><p>
MIRA had since 2.9.36 a feature to keep long repeats in separate
contigs. Due to algorithm changes, this feature is now standard. The
effect of this is that contigs with non-repetitive sequence will
stop at a 'long repeat' border which cannot be crossed by a single
read or by paired reads, including only the first few bases of the
repeat. Long repeats will be kept as separate contigs.
</p><p>
This has been implemented to get a clean overview on which parts of
an assembly are 'safe' and which parts will be 'difficult'. For
this, the naming of the contigs has been extended: contigs named
with a '_c' at the end are contigs which contain mostly 'normal'
coverage. Contigs with "rep_c" are contigs which contain mostly
sequence classified as repetitive and which could not be assembled
together with a 'c' contig.
</p><p>
The question remains: what are 'long' repeats? MIRA defines these as
repeats that are not spanned by any read that has non-repetitive
parts at the end. Basically -for shotgun assemblies - the mean
length of the reads that go into the assembly defines the minimum
length of 'long' repeats that have to be kept in separate contigs.
</p><p>
It has to be noted that when using paired-end (or template)
sequencing, 'long' repeats which can be spanned by read-pairs (or
templates) are frequently integrated into 'normal' contigs as MIRA
can correctly place them most of the time.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_ref_helping_finishing_by_tagging_reads_with_haf_tags"></a>3.11.3.3.
Helping finishing by tagging reads with HAF tags
</h4></div></div></div><p>
HAF tags (HAsh Frequency) are set by MIRA when the option to colour reads by
kmer frequency ([-GE:crkf], on by default in most --job combinations)
is on. These tags show the status of k-mers (stretch of bases of given length
<span class="emphasis"><em>k</em></span>) in read sequences: whether MIRA recognised them as being present in
sub-average, average, above average or repetitive numbers.
</p><p>
When using a finishing programs which can display tags in reads (and using the
proposed tag colour schemes for gap4 or consed, the assembly
will light up in colours ranging from light green to dark red, indicating
whether a certain part of the assembly is deemed non-repetitive to extremely
repetitive.
</p><p>
One of the biggest advantages of the HAF tags is the implicit information they
convey on why the assembler stopped building a contig at an end.
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
if the read parts composing a contig end are mostly covered with HAF2
tags (below average frequency, coloured light-green), then one very probably
has a hole in the contig due to coverage problems which means there are no
or not enough reads covering a part of the sequence.
</p></li><li class="listitem"><p>
if the read parts composing a contig end are mostly covered with HAF3
tags (average frequency, coloured green), then you have an unusual situation
as this should only very rarely occur. The reason is that MIRA saw that
there are enough sequences which look the same as the one from your contig
end, but that these could not be joined. Likely reasons for this scenario
include non-random sequencing artifacts (seen in 454 data) or also
non-random chimeric reads (seen in Sanger and 454 data).
</p></li><li class="listitem"><p>
if the read parts composing a contig end are mostly covered with HAF4
tags (above average frequency, coloured yellow), then the assembler stopped
at grey zone of the coverage not being normal anymore, but not quite
repetitive yet. This can happen in cases where the read coverage is very
unevenly distributed across the project. The contig end in question might be
a repeat occurring two times in the sequence, but having less reads than
expected. Or it may be non-repetitive coverage with an unusual excess of
reads.
</p></li><li class="listitem"><p>
if the read parts composing a contig end are mostly covered with HAF5
(repeat, coloured red), HAF6 (heavy repeat, coloured darker red) and HAF7
tags (crazy repeat, coloured very dark red), then there is a repetitive area
in the sequence which could not be uniquely bridged by the reads present in
the assembly.
</p></li></ul></div><p>
</p><p>
This information can be especially helpful when joining reads by hand in a
finishing program. The following list gives you a short guide to cases which
are most likely to occur and what you should do.
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
the proposed join involves contig ends mostly covered by HAF2
tags. Joining these contigs is probably a safe bet. The assembly may have
missed this join because of too many errors in the read ends or because
sequence having been clipped away which could be useful to join contigs.
Just check whether the join seems sensible, then join.
</p></li><li class="listitem"><p>
the proposed join involves contig ends mostly covered by HAF3
tags. Joining these contigs is probably a safe bet. The assembly may have
missed this join because of several similar chimeric reads reads or reads
with similar, severe sequencing errors covering the same spot.
Just check whether the join seems sensible, then join.
</p></li><li class="listitem"><p>
the proposed join involves contig ends mostly covered by HAF4
tags. Joining these contigs should be done with some caution, it
may be a repeat occurring twice in the sequence. Check whether
the contig ends in question align with ends of several other
contigs. If not, joining is probably the way to go. If potential
joins exist with several other contigs, then it's a repeat (see
below).
</p></li><li class="listitem"><p>
the proposed join involves contig ends mostly covered by HAF5, HAF6 or
HAF7 tags. Joining these contigs should be done with utmost caution, you are
almost certainly (HAF5) and very certainly (HAF6 and HAF7) in a repetitive
area of your sequence.
You will probably need additional information like paired-end or template
info in order join your contigs.
</p></li></ul></div><p>
</p></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_consensus_in_finishing_programs_gap4_consed_"></a>3.11.4.
Consensus in finishing programs (gap4, consed, ...)
</h3></div></div></div><p>
MIRA goes a long way to calculate a consensus which is as correct as
possible. Unfortunately, communication with finishing programs is a bit
problematic as there currently is no standard way to say which reads are from
which sequencing technology.
</p><p>
It is therefore often the case that finishing programs calculate an own
consensus when loading a project assembled with MIRA. This is the case for at
least, e.g., gap4. This consensus may then not be optimal.
</p><p>
The recommended way to deal with this problem is: import the results from MIRA
into your finishing program like you always do. Then finish the genome there,
export the project from the finishing program as CAF and finally use
miraconvert (from the MIRA package ) with the "-r" option to
recalculate the optimal consensus of your finished project.
</p><p>
E.g., assuming you have just finished editing the gap4 database
<code class="filename">DEMO.3</code>, do the following. First, export the gap4 database back to
CAF:
</p><pre class="screen">
<code class="prompt">$</code> <strong class="userinput"><code>gap2caf -project DEMO -version 3 >demo3.caf</code></strong></pre><p>
</p><p>
Then, use<span class="command"><strong>miraconvert</strong></span> <span class="emphasis"><em>with</em></span> <span class="emphasis"><em>option</em></span> <span class="emphasis"><em>'-r'</em></span> to
convert it into any other format that you need. Example for converting to a
CAF and a FASTA format with correct consensus:
</p><pre class="screen">
<code class="prompt">$</code> <strong class="userinput"><code>miraconvert -t caf -t fasta -r c demo3.caf final_result</code></strong></pre><p>
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_some_other_things_to_consider"></a>3.11.5.
Some other things to consider
</h3></div></div></div><p>
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
MIRA cannot work with EXP files resulting from GAP4 that already
have been edited. If you want to reassemble an edited GAP4 project, convert
it to CAF format and use the [-caf] option to load.
</p></li><li class="listitem"><p>
As also explained earlier, MIRA relies on sequencing vector being
recognised in preprocessing steps by other programs. Sometimes, when a whole
stretch of bases is not correctly marked as sequencing vector, the reads
might not be aligned into a contig although they might otherwise match quite
perfectly. You can use [-CL:pvc] and [-CO:emea] to address
problem with incomplete clipping of sequencing vectors. Also having the
assembler work with less strict parameters may help out of this.
</p></li><li class="listitem"><p>
MIRA has been developed to assemble shotgun sequencing or EST
sequencing data. There are no explicit limitations concerning length or
number of sequences. However, there are a few implicit assumptions that were
made while writing portions of the code:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
Problems which might arise with 'unnatural' long sequence
reads: my implementation of the Smith-Waterman alignment
routines. I use a banded version with linear running time
(linear to the bandwidth) but quadratic space usage. So,
comparing two 'reads' of length 5000 will result in memory
usage of 95 MiB, two reads with 50000 bases will need 9.5 GiB.
</p><p>
This problem has become acute now with PacBio, I'm working on
it. In the mean time, current usable sequence length of PacBio
are more in the 3 to 4 kilobase range, with only a few reads
attaining or surpassing 20 kb. So Todays' machines should
still be able to handle the problem more or less effortlessly.
</p></li><li class="listitem"><p>
32 bit versions of MIRA are not supported anymore.
</p></li><li class="listitem"><p>
to reduce memory overhead, the following assumptions have been made:
</p></li><li class="listitem"><p>
MIRA is not fully multi-threaded (yet), though most
bottlenecks are now in code areas which cannot be
multi-threaded by algorithm design.
</p></li></ol></div></li><li class="listitem"><p>
a project does not contain sequences from more than 255 different:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: circle; "><li class="listitem"><p>
sequencing machine types
</p></li><li class="listitem"><p>
primers
</p></li><li class="listitem"><p>
strains (in mapping mode: 7)
</p></li><li class="listitem"><p>
base callers
</p></li><li class="listitem"><p>
dyes
</p></li><li class="listitem"><p>
process status
</p></li></ul></div></li><li class="listitem"><p>
a project does not contain sequences from more than 65535 different
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: circle; "><li class="listitem"><p>
clone vectors
</p></li><li class="listitem"><p>
sequencing vectors
</p></li></ul></div></li></ul></div></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_things_you_should_not_do"></a>3.12.
Things you should not do
</h2></div></div></div><p>
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_never_on_nfs"></a>3.12.1.
Do not run MIRA on NFS mounted directories without redirecting the tmp directory
</h3></div></div></div><p>
Of course one can run MIRA atop a NFS mount (a "disk" mounted over a
network using the NFS protocol), but the performance will go down the
drain as the NFS server respectively the network will not be able to
cope with the amount of data MIRA needs to shift to and from disk
(writes/reads to the tmp directory). Slowdowns of a factor of 10 and
more have been observed. In case you have no other possibility, you
can force MIRA to run atop a NFS using [-NW:cnfs=warn]
( [-NW:cnfs=no]), but you have been warned.
</p><p>
In case you want to keep input and output files on NFS, you can use
[-DI:trt] to redirect the tmp directory to a local
filesystem. Then MIRA will run at almost full speed.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_never_without_quality_values"></a>3.12.2.
Do not assemble without quality values
</h3></div></div></div><p>
Assembling sequences without quality values is like ... like ... like
driving a car downhill a sinuous mountain road with no rails at 200
km/h without brakes, airbags and no steering wheel. With a ravine on
one side and a rock face on the other. Did I mention the missing
seat-belts? You <span class="emphasis"><em>might</em></span> get down safely, but
experience tells the result will rather be a bloody mess.
</p><p>
Well, assembling without quality values is a bit like above, but
bloodier. And the worst: you (or the people using the results of such
an assembly) will notice the gore only until it is way too late and
money has been sunk in follow-up experiments based on wrong data.
</p><p>
All MIRA routines internally are geared toward quality values guiding
decisions. No one should ever assembly anything without quality
values. Never. Ever. Even if quality values are sometimes inaccurate,
they do help.
</p><p>
Now, there are <span class="bold"><strong>very rare occasions</strong></span>
where getting quality values is not possible. If you absolutely cannot
get them, and I mean only in this case, use the following
switch:<code class="literal">--noqualities[=SEQUENCINGTECHNOLOGY]</code> and
additionally give a default quality for reads of a readgroup. E.g.:
</p><pre class="screen">parameters= --noqualities=454
readgroup
technology=454
data=...
default_qual=30</pre><p>
This tells MIRA not to complain about missing quality values and to
fake a quality value of 30 for all reads (of a readgroup) having no
qualities, allowing some MIRA routines (in standard parameter
settings) to start disentangling your repeats.
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
Doing the above has some severe side-effects. You will be, e.g., at
the mercy of non-random sequencing errors. I suggest combining the
above with a [-CO:mrpg=4] or higher. You also may want to
tune the default quality parameter together with [-CO:mnq]
and [-CO:mgqrt] in cases where you mix sequences with and
without quality values.
</td></tr></table></div></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_useful_third_party_programs"></a>3.13.
Useful third party programs
</h2></div></div></div><p>
Viewing the results of a MIRA assembly or preprocessing the sequences
for an assembly can be done with a number of different programs. The
following ones are are just examples, there are a lot more packages
available:
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
HTML browser
</span></dt><dd><p> If you have really nothing else as viewer, a browser who
understands tables is needed to view the HTML output. A browser knowing
style sheets (CSS) is recommended, as different tags will be highlighted.
Konqueror, Opera, Mozilla, Netscape and Internet Explorer all do fine, lynx
is not really ... optimal.
</p></dd><dt><span class="term">
Assembly viewer / finishing / preprocessing
</span></dt><dd><p>
You'll want GAP4 or its successor GAP5 (generally speaking: the
Staden package) to preprocess the sequences, visualise and
eventually rework the results when using gap4da output. The Staden
package comes with a fully featured sequence preparing and
annotating engine (pregap4) that is very useful to preprocess your
Sanger data (conversion between file types, quality clipping,
tagging etc.).
</p><p>
See <a class="ulink" href="http://www.sourceforge.net/projects/staden/" target="_top">http://www.sourceforge.net/projects/staden/</a> for
further information and also a possibility to download precompiled
binaries for different platforms.
</p></dd><dt><span class="term">
Vector screening
</span></dt><dd><p>
Reading result files from <span class="command"><strong>ssaha2</strong></span> or
<span class="command"><strong>smalt</strong></span> from the Sanger Centre is supported
directly by MIRA to perform a fast and efficient tagging of
sequencing vector stretches. This makes you basically independent
from any other commercial or license-requiring vector screening
software. For Sanger reads, a combination of
<span class="command"><strong>lucy</strong></span> (see below), <span class="command"><strong>ssaha2</strong></span> or
<span class="command"><strong>smalt</strong></span> together with the MIRA parameters for
SSAHA2 / SMALT support (see all [-CL:msvs*] parameters) and quality clipping
( [-CL:qc]) should do the trick. For reads coming from 454
pyro-sequencing, <span class="command"><strong>ssaha2</strong></span> or
<span class="command"><strong>smalt</strong></span> and the SSAHA2 / SMALT support also work
pretty well.
</p><p>
See
<a class="ulink" href="http://www.sanger.ac.uk/resources/software/ssaha2/" target="_top">http://www.sanger.ac.uk/resources/software/ssaha2/</a>
and / or <a class="ulink" href="http://www.sanger.ac.uk/resources/software/smalt/" target="_top">http://www.sanger.ac.uk/resources/software/smalt/</a> for
further information and also a possibility to download the source
or precompiled binaries for different platforms.
</p></dd><dt><span class="term">
Preprocessing
</span></dt><dd><p> <span class="command"><strong>lucy</strong></span> from TIGR (now JCVI) is another
useful sequence preprocessing program for Sanger data. Lucy is a
utility that prepares raw DNA sequence fragments for sequence
assembly. The cleanup process includes quality assessment,
confidence reassurance, vector trimming and vector removal.
</p><p>
There's a small script in the MIRA 3rd party package which
converts the clipping data from the lucy format into something
MIRA can understand (NCBI Traceinfo).
</p><p>
See <a class="ulink" href="ftp://ftp.tigr.org/pub/software/Lucy/" target="_top">ftp://ftp.tigr.org/pub/software/Lucy/</a> to download the source code
of lucy.
</p></dd><dt><span class="term">
Assembly viewer
</span></dt><dd><p> Viewing <code class="filename">.ace</code> file output without consed
can be done with clview from TIGR. See
<a class="ulink" href="http://www.tigr.org/tdb/tgi/software/" target="_top">http://www.tigr.org/tdb/tgi/software/</a>.
</p><p>
A better alternative is Tablet <a class="ulink" href="http://bioinf.scri.ac.uk/tablet/" target="_top">http://bioinf.scri.ac.uk/tablet/</a> which also reads SAM
format.
</p></dd><dt><span class="term">
Assembly coverage analysis
</span></dt><dd><p>
The Integrated Genome Browser (IGB) of the GenoViz project at
SourceForge (<a class="ulink" href="http://sourceforge.net/projects/genoviz/" target="_top">http://sourceforge.net/projects/genoviz/</a>) is just perfect
for loading a genome and looking at mapping coverage (provided by
the wiggle result files of MIRA).
</p></dd><dt><span class="term">
Preprocessing (base calling)
</span></dt><dd><p>
TraceTuner (<a class="ulink" href="http://sourceforge.net/projects/tracetuner/" target="_top">http://sourceforge.net/projects/tracetuner/</a>) is a tool for
base and quality calling of trace files from DNA sequencing
instruments. Originally developed by Paracel, this code base was
released as open source in 2006 by Celera.
</p></dd><dt><span class="term">
Preprocessing / viewing
</span></dt><dd><p> phred (basecaller) - cross_match (sequence comparison and
filtering) - phrap (assembler) - consed (assembly viewer and
editor). This is another package that can be used for this type of
job, but requires more programming work. The fact that sequence
stretches are masked out (overwritten with the character X) if they
shouldn't be used in an assembly doesn't really help and is
considered harmful (but it works).
</p><p>
Note the bug of consed when reading ACE files, see more about this
in the section on file types (above) in the entry for ACE.
</p><p>
See <a class="ulink" href="http://www.phrap.org/" target="_top">http://www.phrap.org/</a> for further information.
</p></dd><dt><span class="term">
text viewer
</span></dt><dd><p> A text viewer for the different textual output files.
</p></dd></dl></div><p>
As always, most of the time a combination of several different packages
is possible. My currently preferred combo for genome projects is
<span class="command"><strong>ssaha2</strong></span> or <span class="command"><strong>smalt</strong></span> and or
<span class="command"><strong>lucy</strong></span> (vector screening), MIRA (assembly, of course)
and gap4 (assembly viewing and finishing).
</p><p>
For re-assembling projects that were edited in gap4, one will also need
the gap2caf converter. The source for this is available at
<a class="ulink" href="http://www.sanger.ac.uk/resources/software/caf.html" target="_top">http://www.sanger.ac.uk/resources/software/caf.html</a>.
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_speed_and_memory_considerations"></a>3.14.
Speed and memory considerations
</h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_memory"></a>3.14.1.
Estimating needed memory for an assembly project
</h3></div></div></div><p>
Since the V2.9.24x3 version of MIRA, there is <span class="command"><strong>miramem</strong></span> as
program call. When called from the command line, it will ask a number of
questions and then print out an estimate of the amount of RAM needed to
assemble the project. Take this estimate with a grain of salt, depending on
the sequences properties, variations in the estimate can be +/- 30% for
bacteria and 'simple' eukaryotes. The higher the number of repeats is, the
more likely you will need to restrict memory usage in some way or another.
</p><p>
Here's the transcript of a session with miramem:
</p><pre class="screen">
This is MIRA V3.2.0rc1 (development version).
Please cite: Chevreux, B., Wetter, T. and Suhai, S. (1999), Genome Sequence
Assembly Using Trace Signals and Additional Sequence Information.
Computer Science and Biology: Proceedings of the German Conference on
Bioinformatics (GCB) 99, pp. 45-56.
To (un-)subscribe the MIRA mailing lists, see:
http://www.chevreux.org/mira_mailinglists.html
After subscribing, mail general questions to the MIRA talk mailing list:
mira_talk@freelists.org
To report bugs or ask for features, please use the SourceForge ticketing
system at:
http://sourceforge.net/p/mira-assembler/tickets/
This ensures that requests do not get lost.
[...]
miraMEM helps you to estimate the memory needed to assemble a project.
Please answer the questions below.
Defaults are give in square brackets and chosen if you just press return.
Hint: you can add k/m/g modifiers to your numbers to say kilo, mega or giga.
Is it a genome or transcript (EST/tag/etc.) project? (g/e/) [g]
g
Size of genome? [4.5m] <strong class="userinput"><code>9.8m</code></strong>
9800000
Size of largest chromosome? [9800000]
9800000
Is it a denovo or mapping assembly? (d/m/) [d]
d
Number of Sanger reads? [0]
0
Are there 454 reads? (y/n/) [n] <strong class="userinput"><code>y</code></strong>
y
Number of 454 GS20 reads? [0]
0
Number of 454 FLX reads? [0]
0
Number of 454 Titanium reads? [0] <strong class="userinput"><code>750k</code></strong>
750000
Are there PacBio reads? (y/n/) [n]
n
Are there Solexa reads? (y/n/) [n]
n
************************* Estimates *************************
The contigs will have an average coverage of ~ 30.6 (+/- 10%)
RAM estimates:
reads+contigs (unavoidable): 7.0 GiB
large tables (tunable): 688. MiB
---------
total (peak): 7.7 GiB
add if using -CL:pvlc=yes : 2.6 GiB
Estimates may be way off for pathological cases.
Note that some algorithms might try to grab more memory if
the need arises and the system has enough RAM. The options
for automatic memory management control this:
-AS:amm, -AS:kpmf, -AS:mps
Further switches that might reduce RAM (at cost of run time
or accuracy):
-SK:mkim, -SK:mchr (both runtime); -SK:mhpr (accuracy)
*************************************************************</pre><p>
If your RAM is not large enough, you can still assemble projects by
using disk swap. Up to 20% of the needed memory can be provided by
swap without the speed penalty getting too large. Going above 20% is
not recommended though, above 30% the machine will be almost
permanently swapping at some point or another.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_ref_speed"></a>3.14.2.
Some numbers on speed
</h3></div></div></div><p>
To be rewritten for MIRA4.
</p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_known_problems_bugs"></a>3.15.
Known Problems / Bugs
</h2></div></div></div><p>
File Input / Output:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
MIRA can only read unedited EXP files.
</p></li><li class="listitem"><p>
There sometimes is a (rather important) memory leak occurring while
using the assembly integrated Sanger read editor. I have not been
able to trace the reason yet.
</p></li></ol></div><p>
</p><p>
Assembly process:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
The routines for determining <span class="emphasis"><em>Repeat Marker
Bases</em></span> (SRMr) are sometimes too sensitive, which sometimes
leads to excessive base tagging and preventing right assemblies in
subsequent assembly processes. The parameters you should look at for
this problem are
[-CO:mrc:nrz:mgqrt:mgqwpc]. Also look at [-CL:pvc] and
[-CO:emea] if you have a lot of sequencing vector relics at the
end of the sequences.
</p></li></ol></div><p>
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_todos"></a>3.16.
TODOs
</h2></div></div></div><p>
These are some of the topics on my TODO list for the next revisions to
come:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
Making Smith-Waterman parts of the process multi-threaded or use SIMD
(currently stopped due to other priorities like PacBio etc.)
</p></li></ol></div><p>
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_working_principles"></a>3.17.
Working principles
</h2></div></div></div><p>
Note: description is old and needs to be adapted to the current 4.x line
of MIRA.
</p><p>
To avoid the "garbage-in, garbage-out" problematic, MIRA uses a 'high
quality alignments first' contig building strategy. This means that the
assembler will start with those regions of sequences that have been
marked as good quality (high confidence region - HCR) with low error
probabilities (the clipping must have been done by the base caller or
other preprocessing programs, e.g. pregap4) and then gradually extends
the alignments as errors in different reads are resolved through error
hypothesis verification and signal analysis.
</p><p>
This assembly approach relies on some of the automatic editing
functionality provided by the EdIt package which has been integrated in
parts within MIRA.
</p><p>
This is an approximate overview on the steps that are executed while
assembling:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
All the experiment / phd / fasta sequences that act as input are
loaded (or the CAF project). Qualities for the bases are loaded from
the FASTA or SCF if needed.
</p></li><li class="listitem"><p>
the ends of the reads are cleaned ensure they have a minimum stretch
of bases without sequencing errors
</p></li><li class="listitem"><p>
The high confidence region (HCR) of each read is compared with a
quick algorithm to the HCR of every other read to see if it could
match and have overlapping parts (this is the 'SKIM' filter).
</p></li><li class="listitem"><p>
All the reads which could match are being checked with an adapted
Smith-Waterman alignment algorithm (banded version). Obvious
mismatches are rejected, the accepted alignments form one or several
alignment graphs.
</p></li><li class="listitem"><p>
Optional pre-assembly read extension step: MIRA tries to extend HCR
of reads by analysing the read pairs from the previous
alignment. This is a bit shaky as reads in this step have not been
edited yet, but it can help. Go back to step 2.
</p></li><li class="listitem"><p>
A contig gets made by building a preliminary partial path through
the alignment graph (through in-depth analysis up to a given level)
and then adding the most probable overlap candidates to a given
contig. Contigs may reject reads if these introduce to many errors
in the existing consensus. Errors in regions known as dangerous
(for the time being only ALUS and REPT) get additional attention by
performing simple signal analysis when alignment discrepancies
occur.
</p></li><li class="listitem"><p>
Optional: the contig can be analysed and corrected by the automatic
editor ("EdIt" for Sanger reads, or the new MIRA editor for 454
reads).
</p></li><li class="listitem"><p>
Long repeats are searched for, bases in reads of different repeats
that have been assembled together but differ sufficiently (for EdIT
so that they didn't get edited and by phred quality value) get
tagged with special tags (SRMr and WRMr).
</p></li><li class="listitem"><p>
Go back to step 5 if there are reads present that have not been
assembled into contigs.
</p></li><li class="listitem"><p>
Optional: Detection of spoiler reads that prevent joining of
contigs. Remedy by shortening them.
</p></li><li class="listitem"><p>
Optional: Write out a checkpoint assembly file and go back to step 2.
</p></li><li class="listitem"><p>
The resulting project is written out to different output files and
directories.
</p></li></ol></div><p>
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_ref_see_also"></a>3.18.
See Also
</h2></div></div></div><p>
The other MIRA manuals and walkthroughs as well as
<span class="command"><strong>EdIt</strong></span>, <span class="command"><strong>gap4</strong></span>,
<span class="command"><strong>pregap4</strong></span>, <span class="command"><strong>gap5</strong></span>,
<span class="command"><strong>clview</strong></span>, <span class="command"><strong>caf2gap</strong></span>,
<span class="command"><strong>gap2caf</strong></span>, <span class="command"><strong>ssaha2</strong></span>,
<span class="command"><strong>smalt</strong></span>, <span class="command"><strong>compress</strong></span> and
<span class="command"><strong>gzip</strong></span>, <span class="command"><strong>cap3</strong></span>,
<span class="command"><strong>ttuner</strong></span>, <span class="command"><strong>phred</strong></span>,
<span class="command"><strong>phrap</strong></span>, <span class="command"><strong>cross_match</strong></span>,
<span class="command"><strong>consed</strong></span>.
</p></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_dataprep"></a>Chapter 4. Preparing data</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email"><<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>></code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect_pd_introduction">4.1.
Introduction
</a></span></dt><dt><span class="sect1"><a href="#sect_pd_sanger">4.2.
Sanger
</a></span></dt><dt><span class="sect1"><a href="#sect_pd_454">4.3.
Roche / 454
</a></span></dt><dt><span class="sect1"><a href="#sect_pd_illumina">4.4.
Illumina
</a></span></dt><dt><span class="sect1"><a href="#sect_pd_pacbio">4.5.
Pacific Biosciences
</a></span></dt><dt><span class="sect1"><a href="#sect_pd_iontor">4.6.
Ion Torrent
</a></span></dt><dt><span class="sect1"><a href="#sect_pd_sra">4.7.
Short Read Archive (SRA)
</a></span></dt></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
<span class="emphasis"><em><span class="quote">“<span class="quote">Rome didn't fall in a day either.</span>”</span></em></span>
</p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_pd_introduction"></a>4.1.
Introduction
</h2></div></div></div><p>
Most of this chapter and many sections are just stubs at the moment.
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_pd_sanger"></a>4.2.
Sanger
</h2></div></div></div><p>
Outside MIRA: transform .ab1 to .scf, perform sequencing vector clip
(and cloning vector clip if used), basic quality clips.
</p><p>
Recommended program: <span class="command"><strong>gap4</strong></span> (or
rather <span class="command"><strong>pregap4</strong></span>) from the Staden 4 package.
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_pd_454"></a>4.3.
Roche / 454
</h2></div></div></div><p>
Outside MIRA: convert SFF instrument from Roche to FASTQ,
use <span class="command"><strong>sff_extract</strong></span> for that. In case you used
"non-standard" sequencing procedures: clip away MIDs, clip away
non-standard sequencing adaptors used in that project.
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_pd_illumina"></a>4.4.
Illumina
</h2></div></div></div><p>
Outside MIRA: for heavens' sake: do NOT try to clip or trim by quality
yourself. Do NOT try to remove standard sequencing adaptors
yourself. Just leave Illumina data alone! (really, I mean it).
</p><p>
MIRA is much, much better at that job than you will probably ever be
... and I dare to say that MIRA is better at that job than 99% of all
clipping/trimming software existing out there. Just make sure you use
the [-CL:pec] (proposed_end_clip) option of MIRA.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
The <span class="emphasis"><em>only</em></span> exception to the above is if you (or your
sequencing provider) used decidedly non-standard sequencing
adaptors. Then it might be worthwhile to perform own adaptor
clipping. But this will not be the case for 99% of all sequencing
projects out there.
</td></tr></table></div><p>
Joining paired-ends: if you want to do this, feel free to use any tool
which is out there (TODO: quick list). Just make sure they do not join
on very short overlaps. For me, the minimum overlap is at least 17
bases, but I more commonly use at least 30.
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_pd_pacbio"></a>4.5.
Pacific Biosciences
</h2></div></div></div><p>
Outside MIRA: MIRA needs error corrected reads, either
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem">
PacBio CCS reads (circular consensus sequence) which you get from the
PacBio SMRTAnalysis pipeline
</li><li class="listitem">
or self-corrected or reads corrected with other sequencing
technologies which you will get either from the PacBio HGAP pipeline
or the pacbioToCA pipeline
</li></ul></div><p>
Assembly of uncorrected PacBio reads (CLR) is currently not supported
officially as of MIRA 4.0.
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_pd_iontor"></a>4.6.
Ion Torrent
</h2></div></div></div><p>
Outside MIRA: need to convert BAM to FASTQ. Need to clip away
non-standard sequencing adaptors if used in that project. Apart from
that: leave the data alone.
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_pd_sra"></a>4.7.
Short Read Archive (SRA)
</h2></div></div></div><p>
Outside MIRA: you need to convert SRA format to FASTQ format. This is done
using <span class="command"><strong>fastq-dump</strong></span> from the SRA toolkit from the
NCBI. Make sure to have at least version 2.4.x of the toolkit. Last time
I looked (March 2015), the software was at
<a class="ulink" href="http://www.ncbi.nlm.nih.gov/Traces/sra/?view=software" target="_top">http://www.ncbi.nlm.nih.gov/Traces/sra/?view=software</a>, the
documentation for the whole toolkit was at
<a class="ulink" href="http://www.ncbi.nlm.nih.gov/Traces/sra/?view=toolkit_doc" target="_top">http://www.ncbi.nlm.nih.gov/Traces/sra/?view=toolkit_doc</a>,
and for <span class="command"><strong>fastq-dump</strong></span> it was
<a class="ulink" href="http://www.ncbi.nlm.nih.gov/Traces/sra/?view=toolkit_doc&f=fastq-dump" target="_top">http://www.ncbi.nlm.nih.gov/Traces/sra/?view=toolkit_doc&f=fastq-dump</a>
</p><p>
After extraction, proceed with preprocessing as described above,
depending on the sequencing technology used.
</p><p>
For extracting Illumina data, use something like this:
</p><pre class="screen"><code class="prompt">arcadia:/some/path$</code> <strong class="userinput"><code>fastq-dump -I --split-files <em class="replaceable"><code>somefile.sra</code></em></code></strong></pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
As <span class="command"><strong>fastq-dump</strong></span> unfortunately uses a pretty wasteful
variant of the FASTQ format, you might want to reduce the file size
for each FASTQ it produces by doing this:
</p><pre class="screen"><strong class="userinput"><code>sed -i '3~4 s/^+.*$/+/' <em class="replaceable"><code>file.fastq</code></em></code></strong></pre><p>
The above command performs an in-file replacement of unnecessary name
and comments on the quality divider lines of the FASTQ. The exact
translation of the <span class="command"><strong>sed</strong></span> is: do an in-file
replacement (-i); starting on the third line, then every fourth line
(3~4); substitute (s/); a line which starts (^); with a plus (+); and
then can have any character (.); repeated any number of times
including zero (*); until the end of the line ($); by just a single
plus character (/+/).
</p><p>
This alone reduces the file size of a typical Illumina data set with
100mers extracted from the SRA by about 15 to 20%.
</p></td></tr></table></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_denovo"></a>Chapter 5. De-novo assemblies</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email"><<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>></code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect_dn_introduction">5.1.
Introduction
</a></span></dt><dt><span class="sect1"><a href="#sect_dn_general">5.2.
General steps
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_dn_ge_copying_and_naming_the_sequence_data">5.2.1.
Copying and naming the sequence data
</a></span></dt><dt><span class="sect2"><a href="#sect_dn_ge_writing_a_simple_manifest_file">5.2.2.
Writing a simple manifest file
</a></span></dt><dt><span class="sect2"><a href="#sect_dn_ge_starting_assembly">5.2.3. Starting the assembly</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_dn_manifest_files_use_cases">5.3.
Manifest files for different use cases
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_dn_mf_denovo_with_shotgun_data">5.3.1.
Manifest for shotgun data
</a></span></dt><dt><span class="sect2"><a href="#sect_dn_mf_assembling_with_multiple_technologies">5.3.2.
Assembling with multiple sequencing technologies (hybrid assemblies)
</a></span></dt><dt><span class="sect2"><a href="#sect_dn_mf_manifest_for_pairedend_data">5.3.3.
Manifest for data sets with paired reads
</a></span></dt><dt><span class="sect2"><a href="#sect_dn_mf_denovo_with_multiple_strains">5.3.4.
De-novo with multiple strains
</a></span></dt></dl></dd></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
<span class="emphasis"><em><span class="quote">“<span class="quote">The universe is full of surprises - most of them nasty.</span>”</span></em></span>
</p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_dn_introduction"></a>5.1.
Introduction
</h2></div></div></div><p>
This guide assumes that you have basic working knowledge of Unix systems, know
the basic principles of sequencing (and sequence assembly) and what assemblers
do.
</p><p>
While there are step by step instructions on how to setup your data and
then perform an assembly, this guide expects you to read at some point in time
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
Before the assembly, <a class="xref" href="#chap_dataprep" title="Chapter 4. Preparing data">Chapter 4: “<i>Preparing data</i>”</a> to know what to do (or not to
do) with the sequencing data before giving it to MIRA.
</p></li><li class="listitem"><p>
For users with PacBio reads, <a class="xref" href="#sect_sp_pacbio_ccs" title="8.2.1. PacBio CCS reads">Section 8.2.1: “
PacBio CCS reads
”</a> has important
information regarding special parameters needed.
</p></li><li class="listitem"><p>
After the assembly, <a class="xref" href="#chap_results" title="Chapter 9. Working with the results of MIRA">Chapter 9: “<i>Working with the results of MIRA</i>”</a> to know what to do with the
results of the assembly. More specifically, <a class="xref" href="#sect_res_looking_at_results" title="9.1. MIRA output directories and files">Section 9.1: “
MIRA output directories and files
”</a>, <a class="xref" href="#sect_res_first_look:the_assembly_info" title="9.2. First look: the assembly info">Section 9.2: “
First look: the assembly info
”</a>, <a class="xref" href="#sect_res_converting_results" title="9.3. Converting results">Section 9.3: “
Converting results
”</a>, <a class="xref" href="#sect_res_filtering_of_results" title="9.4. Filtering results">Section 9.4: “
Filtering results
”</a> and <a class="xref" href="#sect_res_places_of_importance_in_a_de_novo_assembly" title="9.5. Places of importance in a de-novo assembly">Section 9.5: “
Places of importance in a de-novo assembly
”</a>.
</p></li><li class="listitem"><p>
And also <a class="xref" href="#chap_reference" title="Chapter 3. MIRA 4 reference manual">Chapter 3: “<i>MIRA 4 reference manual</i>”</a> to look up how manifest files should be
written (<a class="xref" href="#sect_ref_manifest_basics" title="3.4.2. The manifest file: basics">Section 3.4.2: “
The manifest file: basics
”</a> and <a class="xref" href="#sect_ref_manifest_readgroups" title="3.4.3. The manifest file: information on the data you have">Section 3.4.3: “
The manifest file: information on the data you have
”</a> and <a class="xref" href="#sect_ref_manifest_parameters" title="3.4.4. The manifest file: extended parameters">Section 3.4.4: “
The manifest file: extended parameters
”</a>), some command line options as well as general information on
what tags MIRA uses in assemblies, files it generates etc.pp
</p></li><li class="listitem"><p>
Last but not least, you may be interested in some observations about
the different sequencing technologies and the traps they may
contain, see <a class="xref" href="#chap_seqtechdesc" title="Chapter 12. Description of sequencing technologies">Chapter 12: “<i>Description of sequencing technologies</i>”</a> for that. For advice on what to pay
attention to <span class="emphasis"><em>before</em></span> going into a sequencing
project, have a look at <a class="xref" href="#chap_seqadvice" title="Chapter 13. Some advice when going into a sequencing project">Chapter 13: “<i>Some advice when going into a sequencing project</i>”</a>.
</p></li></ul></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_dn_general"></a>5.2.
General steps
</h2></div></div></div><p>
This part will introduce you step by step how to get your data together
for a simple mapping assembly. I'll make up an example using an
imaginary bacterium: <span class="emphasis"><em>Bacillus chocorafoliensis</em></span> (or
short: <span class="emphasis"><em>Bchoc</em></span>). You collected the strain you want to
assemble somewhere in the wild, so you gave the strain the name
<span class="emphasis"><em>Bchoc_wt</em></span>.
</p><p>
Just for laughs, let's assume you sequenced that bug with lots of more
or less current sequencing technologies: Sanger, 454, Illumina, Ion
Torrent and Pacific Biosciences.
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_dn_ge_copying_and_naming_the_sequence_data"></a>5.2.1.
Copying and naming the sequence data
</h3></div></div></div><p>
You need to create (or get from your sequencing provider) the
sequencing data in any supported file format. Amongst these, FASTQ and
FASTA + FASTA-quality will be the most common, although the latter is
well on the way out nowadays. The following walkthrough uses what most
people nowadays get: FASTQ.
</p><p>
Create a new project directory (e.g. <code class="filename">myProject</code>)
and a subdirectory of this which will hold the sequencing data
(e.g. <code class="filename">data</code>).
</p><pre class="screen"><code class="prompt">arcadia:/path/to</code> <strong class="userinput"><code>mkdir myProject</code></strong>
<code class="prompt">arcadia:/path/to</code> <strong class="userinput"><code>cd myProject</code></strong>
<code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>mkdir data</code></strong></pre><p>
Put the FASTQ data into that <code class="filename">data</code> directory so
that it now looks perhaps like this:
</p><pre class="screen"><code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>ls -l data</code></strong>
-rw-r--r-- 1 bach users 263985896 2008-03-28 21:49 bchocwt_lane6.solexa.fastq</pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
I completely made up the file names above. You can name them anyway you
want. And you can have them live anywhere on the hard-disk, you do not
need to put them in this <code class="filename">data</code> directory. It's just
the way I do it ... and it's where the example manifest files a bit
further down in this chapter will look for the data files.
</td></tr></table></div><p>
We're almost finished with the setup. As I like to have things neatly separated, I always create a directory called <code class="filename">assemblies</code> which will hold my assemblies (or different trials) together. Let's quickly do that:
</p><pre class="screen"><code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>mkdir assemblies</code></strong>
<code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>mkdir assemblies/1sttrial</code></strong>
<code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>cd assemblies/1sttrial</code></strong></pre></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_dn_ge_writing_a_simple_manifest_file"></a>5.2.2.
Writing a simple manifest file
</h3></div></div></div><p>
A manifest file is a configuration file for MIRA which tells it what
type of assembly it should do and which data it should load. In this
case we'll make a simple assembly of a genome with unpaired Illumina
data
</p><pre class="screen"># Example for a manifest describing a genome de-novo assembly with
# unpaired Illumina data
# First part: defining some basic things
# In this example, we just give a name to the assembly
# and tell MIRA it should map a genome in accurate mode
<strong class="userinput"><code>project = <em class="replaceable"><code>MyFirstAssembly</code></em>
job = <em class="replaceable"><code>genome,denovo,accurate</code></em></code></strong>
# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups"
# here comes the unpaired Illumina data
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>SomeUnpairedIlluminaReadsIGotFromTheLab</code></em>
data = <em class="replaceable"><code>../../data/bchocwt_lane6.solexa.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em></code></strong></pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
Please look up the parameters of the manifest file in the main
manual or the example manifest files in the following section.
</p><p>
The ones above basically say: make an accurate denovo assembly of
unpaired Illumina reads.
</p></td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_dn_ge_starting_assembly"></a>5.2.3. Starting the assembly</h3></div></div></div><p>
Starting the assembly is now just a matter of a simple command line:
</p><pre class="screen"><code class="prompt">arcadia:/path/to/myProject/assemblies/1sttrial$</code> <strong class="userinput"><code>mira <em class="replaceable"><code>manifest.conf >&log_assembly.txt</code></em></code></strong></pre><p>
For this example - if you followed the walk-through on how to prepare the data
- everything you might want to adapt in the first time are the following thing in the manifest file:
options:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
project= (for naming your assembly project)
</p></li></ul></div><p>
Of course, you are free to change any option via the extended parameters, but
this is the topic of another part of this manual.
</p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_dn_manifest_files_use_cases"></a>5.3.
Manifest files for different use cases
</h2></div></div></div><p>
This section will introduce you to manifest files for different use
cases. It should cover the most important uses, but as always you are
free to mix and match the parameters and readgroup definitions to suit
your specific needs.
</p><p>
Taking into account that there may be <span class="emphasis"><em>a lot</em></span> of
combinations of sequencing technologies, sequencing libraries (shotgun,
paired-end, mate-pair, etc.) and input file types (FASTQ, FASTA,
GenBank, GFF3, etc.pp), the example manifest files just use Illumina and
454 as technologies, GFF3 as input file type for the reference sequence,
FASTQ as input type for sequencing data ... and they do not show the
multitude of more advanced features like, e.g., using ancillary clipping
information in XML files, ancillary masking information in SSAHA2 or
SMALT files etc.pp.
</p><p>
I'm sure you will be able to find your way by scanning through the
corresponding section on manifest files in the reference chapter :-)
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_dn_mf_denovo_with_shotgun_data"></a>5.3.1.
Manifest for shotgun data
</h3></div></div></div><p>
Well, we've seen that already in the section above, but here it is
again ... but this time with 454 data.
</p><pre class="screen"># Example for a manifest describing a denovo assembly with
# unpaired 454 data
# First part: defining some basic things
# In this example, we just give a name to the assembly
# and tell MIRA it should map a genome in accurate mode
<strong class="userinput"><code>project = <em class="replaceable"><code>MyFirstAssembly</code></em>
job = <em class="replaceable"><code>genome,denovo,accurate</code></em></code></strong>
# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups"
# here's the 454 data
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>SomeUnpaired454ReadsIGotFromTheLab</code></em>
data = <em class="replaceable"><code>../../data/some454data.fastq</code></em>
technology = <em class="replaceable"><code>454</code></em></code></strong></pre></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_dn_mf_assembling_with_multiple_technologies"></a>5.3.2.
Assembling with multiple sequencing technologies (hybrid assemblies)
</h3></div></div></div><p>
Hybrid mapping assemblies follow the general manifest scheme: tell
what you want in the first part, then simply add as separate readgroup
the information MIRA needs to know to find the data and off you
go. Just for laughs, here's a manifest for 454 shotgun with Illumina
shotgun
</p><pre class="screen"># Example for a manifest describing a denovo assembly with
# shotgun 454 and shotgun Illumina data
# First part: defining some basic things
# In this example, we just give a name to the assembly
# and tell MIRA it should map a genome in accurate mode
<strong class="userinput"><code>project = <em class="replaceable"><code>MyFirstAssembly</code></em>
job = <em class="replaceable"><code>genome,mapping,accurate</code></em></code></strong>
# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups"
# now the shotgun 454 data
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataForShotgun454</code></em>
data = <em class="replaceable"><code>../../data/project454data.fastq</code></em>
technology = <em class="replaceable"><code>454</code></em></code></strong>
# now the shotgun Illumina data
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataForShotgunIllumina</code></em>
data = <em class="replaceable"><code>../../data/someillumina.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em></code></strong></pre></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_dn_mf_manifest_for_pairedend_data"></a>5.3.3.
Manifest for data sets with paired reads
</h3></div></div></div><p>
When using paired-end data, you should know
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
the orientation of the reads toward each other. This is specific
to sequencing technologies and / or the sequencing library preparation.
</p></li><li class="listitem"><p>
at which distance these reads should be. This is specific to the
sequencing library preparation and the sequencing lab should tell
you this.
</p></li></ol></div><p>
In case you do not know one (or any) of the above, don't panic! MIRA
is able to estimate the needed values during the assembly if you tell
it to.
</p><p>
The following manifest shows you the most laziest way to define a
paired data set by simply adding <span class="emphasis"><em>autopairing</em></span> as keyword to a
readgroup (using Illumina just as example):
</p><pre class="screen"># Example for a lazy manifest describing a denovo assembly with
# one library of paired reads
# First part: defining some basic things
# In this example, we just give a name to the assembly
# and tell MIRA it should map a genome in accurate mode
<strong class="userinput"><code>project = <em class="replaceable"><code>MyFirstAssembly</code></em>
job = <em class="replaceable"><code>genome,denovo,accurate</code></em></code></strong>
# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups"
# now the Illumina paired-end data
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataIlluminaPairedLib</code></em>
<em class="replaceable"><code>autopairing</code></em>
data = <em class="replaceable"><code>../../data/project_1.fastq ../../data/project_2.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em></code></strong></pre><p>
If you know the orientation of the reads and/or the library size, you
can tell this MIRA the following way (just showing the readgroup
definition here):
</p><pre class="screen"><strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataIlluminaPairedEnd500Lib</code></em>
data = <em class="replaceable"><code>../../data/project_1.fastq ../../data/project_2.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
template_size = <em class="replaceable"><code>250 750</code></em>
segment_placement = <em class="replaceable"><code>---> <---</code></em></code></strong></pre><p>
In cases you are not 100% sure about, e.g., the size of the DNA
template, you can also give a (generous) expected range and then tell
MIRA to automatically refine this range during the assembly based on
real, observed distances of read pairs. Do this with <span class="emphasis"><em>autorefine</em></span>
modifier like this:
</p><pre class="screen"><strong class="userinput"><code>template_size = <em class="replaceable"><code>50 1000 autorefine</code></em></code></strong></pre><p>
The following manifest file is an example for assembling with several
different libraries from different technologies. Do not forget you
can use <span class="emphasis"><em>autopairing</em></span> or <span class="emphasis"><em>autorefine</em></span> :-)
</p><pre class="screen"># Example for a manifest describing a denovo assembly with
# several kinds of sequencing libraries
# First part: defining some basic things
# In this example, we just give a name to the assembly
# and tell MIRA it should map a genome in accurate mode
<strong class="userinput"><code>project = <em class="replaceable"><code>MyFirstAssembly</code></em>
job = <em class="replaceable"><code>genome,denovo,accurate</code></em></code></strong>
# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups"
# now the Illumina paired-end data
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataIlluminaForPairedEnd500bpLib</code></em>
data = <em class="replaceable"><code>../../data/project500bp-1.fastq ../../data/project500bp-2.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
strain = <em class="replaceable"><code>bchoc_se1</code></em>
template_size = <em class="replaceable"><code>250 750</code></em>
segment_placement = <em class="replaceable"><code>---> <---</code></em></code></strong>
# now the Illumina mate-pair data
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataIlluminaForMatePair3kbLib</code></em>
data = <em class="replaceable"><code>../../data/project3kb-1.fastq ../../data/project3kb-2.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
strain = <em class="replaceable"><code>bchoc_se1</code></em>
template_size = <em class="replaceable"><code>2500 3500</code></em>
segment_placement = <em class="replaceable"><code><--- ---></code></em></code></strong>
# some Sanger data (6kb library)
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataForSanger6kbLib</code></em>
data = <em class="replaceable"><code>../../data/sangerdata.fastq</code></em>
technology = <em class="replaceable"><code>sanger</code></em>
template_size = <em class="replaceable"><code>5500 6500</code></em>
segment_placement = <em class="replaceable"><code>---> <---</code></em></code></strong>
# some 454 data
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataFo454Pairs</code></em>
data = <em class="replaceable"><code>../../data/454data.fastq</code></em>
technology = <em class="replaceable"><code>454</code></em>
template_size = <em class="replaceable"><code>8000 1200</code></em>
segment_placement = <em class="replaceable"><code>2---> 1---></code></em></code></strong>
# some Ion Torrent data
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataFoIonPairs</code></em>
data = <em class="replaceable"><code>../../data/iondata.fastq</code></em>
technology = <em class="replaceable"><code>iontor</code></em>
template_size = <em class="replaceable"><code>1000 300</code></em>
segment_placement = <em class="replaceable"><code>2---> 1---></code></em></code></strong></pre></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_dn_mf_denovo_with_multiple_strains"></a>5.3.4.
De-novo with multiple strains
</h3></div></div></div><p>
MIRA will make use of ancillary information present in the manifest
file. One of these is the information to which strain (or organism or
cell line etc.pp) the generated data belongs.
</p><p>
You just need to tell in the manifest file which data comes from which
strain. Let's assume that in the example from above, the "lane6" data
were from a first mutant named <span class="emphasis"><em>bchoc_se1</em></span> and the
"lane7" data were from a second mutant
named <span class="emphasis"><em>bchoc_se2</em></span>. Here's the manifest file you
would write then:
</p><pre class="screen"># Example for a manifest describing a de-novo assembly with
# unpaired Illumina data, but from multiple strains
# First part: defining some basic things
# In this example, we just give a name to the assembly
# and tell MIRA it should map a genome in accurate mode
<strong class="userinput"><code>project = <em class="replaceable"><code>MyFirstAssembly</code></em>
job = <em class="replaceable"><code>genome,denovo,accurate</code></em></code></strong>
# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups"
# now the Illumina data
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataForSE1</code></em>
data = <em class="replaceable"><code>../../data/bchocse_lane6.solexa.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
strain = <em class="replaceable"><code>bchoc_se1</code></em></code></strong>
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataForSE2</code></em>
data = <em class="replaceable"><code>../../data/bchocse_lane7.solexa.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
strain = <em class="replaceable"><code>bchoc_se2</code></em></code></strong></pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
While assembling de-novo (pr mapping) with multiple strains is
possible, the interpretation of results may become a bit daunting in
some cases. For many scenarios it might therefore be preferable to
successively use the data sets in own assemblies or mappings.
</td></tr></table></div><p>
This <span class="emphasis"><em>strain</em></span> information for each readgroup is
really the only change you need to perform to tell MIRA everything it
needs for handling strains.
</p></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_mapping"></a>Chapter 6. Mapping assemblies</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email"><<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>></code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect_map_introduction">6.1.
Introduction
</a></span></dt><dt><span class="sect1"><a href="#sect_map_general">6.2.
General steps
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_map_ge_copying_and_naming_the_sequence_data">6.2.1.
Copying and naming the sequence data
</a></span></dt><dt><span class="sect2"><a href="#sect_map_ma_copying_and_naming_the_reference_sequence">6.2.2.
Copying and naming the reference sequence
</a></span></dt><dt><span class="sect2"><a href="#sect_map_ge_writing_a_simple_manifest_file">6.2.3.
Writing a simple manifest file
</a></span></dt><dt><span class="sect2"><a href="#sect_map_ge_starting_assembly">6.2.4. Starting the assembly</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_map_manifest_files_use_cases">6.3.
Manifest files for different use cases
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_map_mf_mapping_with_shotgun_data">6.3.1.
Mapping with shotgun data
</a></span></dt><dt><span class="sect2"><a href="#sect_map_mf_manifest_for_pairedend_data">6.3.2.
Manifest for data sets with paired reads
</a></span></dt><dt><span class="sect2"><a href="#sect_map_mf_mapping_with_multiple_technologies">6.3.3.
Mapping with multiple sequencing technologies (hybrid mapping)
</a></span></dt><dt><span class="sect2"><a href="#sect_map_mf_mapping_with_multiple_strains">6.3.4.
Mapping with multiple strains
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_map_walkthroughs">6.4.
Walkthroughs
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_map_walkthrough:_mapping_of_ecoli_from_lenski_lab_against_ecoli_b_rel606">6.4.1.
Walkthrough: mapping of E.coli from Lenski lab against E.coli B REL606
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_map_useful_about_reference_sequences">6.5.
Useful things to know about reference sequences
</a></span></dt><dt><span class="sect1"><a href="#sect_map_known_bugs_problems">6.6.
Known bugs / problems
</a></span></dt></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
<span class="emphasis"><em><span class="quote">“<span class="quote">You have to know what you're looking for before you can find it.</span>”</span></em></span>
</p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_map_introduction"></a>6.1.
Introduction
</h2></div></div></div><p>
This guide assumes that you have basic working knowledge of Unix systems, know
the basic principles of sequencing (and sequence assembly) and what assemblers
do.
</p><p>
While there are step by step instructions on how to setup your data and
then perform an assembly, this guide expects you to read at some point in time
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
Before the mapping, <a class="xref" href="#chap_dataprep" title="Chapter 4. Preparing data">Chapter 4: “<i>Preparing data</i>”</a> to know what to do (or not to
do) with the sequencing data before giving it to MIRA.
</p></li><li class="listitem"><p>
Generally, the <a class="xref" href="#chap_results" title="Chapter 9. Working with the results of MIRA">Chapter 9: “<i>Working with the results of MIRA</i>”</a> to know what to do with the
results of the assembly. More specifically, <a class="xref" href="#sect_res_converting_results" title="9.3. Converting results">Section 9.3: “
Converting results
”</a> <a class="xref" href="#sect_res_places_of_interest_in_a_mapping_assembly" title="9.6. Places of interest in a mapping assembly">Section 9.6: “
Places of interest in a mapping assembly
”</a> <a class="xref" href="#sect_res_postprocessing_mapping_assemblies" title="9.7. Post-processing mapping assemblies">Section 9.7: “
Post-processing mapping assemblies
”</a>
</p></li><li class="listitem"><p>
And also <a class="xref" href="#chap_reference" title="Chapter 3. MIRA 4 reference manual">Chapter 3: “<i>MIRA 4 reference manual</i>”</a> to look up how manifest files should be
written (<a class="xref" href="#sect_ref_manifest_basics" title="3.4.2. The manifest file: basics">Section 3.4.2: “
The manifest file: basics
”</a> and <a class="xref" href="#sect_ref_manifest_readgroups" title="3.4.3. The manifest file: information on the data you have">Section 3.4.3: “
The manifest file: information on the data you have
”</a> and <a class="xref" href="#sect_ref_manifest_parameters" title="3.4.4. The manifest file: extended parameters">Section 3.4.4: “
The manifest file: extended parameters
”</a>), some command line options as well as general information on
what tags MIRA uses in assemblies, files it generates etc.pp
</p></li><li class="listitem"><p>
Last but not least, you may be interested in some observations about
the different sequencing technologies and the traps they may
contain, see <a class="xref" href="#chap_seqtechdesc" title="Chapter 12. Description of sequencing technologies">Chapter 12: “<i>Description of sequencing technologies</i>”</a> for that. For advice on what to pay
attention to <span class="emphasis"><em>before</em></span> going into a sequencing
project, have a look at <a class="xref" href="#chap_seqadvice" title="Chapter 13. Some advice when going into a sequencing project">Chapter 13: “<i>Some advice when going into a sequencing project</i>”</a>.
</p></li></ul></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_map_general"></a>6.2.
General steps
</h2></div></div></div><p>
This part will introduce you step by step how to get your data together for a
simple mapping assembly.
</p><p>
I'll make up an example using an imaginary bacterium: <span class="emphasis"><em>Bacillus chocorafoliensis</em></span> (or short: <span class="emphasis"><em>Bchoc</em></span>).
</p><p>
In this example, we assume you have two strains: a wild type strain of
<span class="emphasis"><em>Bchoc_wt</em></span> and a mutant which you perhaps got from mutagenesis or other
means. Let's imagine that this mutant needs more time to eliminate a given
amount of chocolate, so we call the mutant <span class="emphasis"><em>Bchoc_se</em></span> ... SE for
<span class="bold"><strong>s</strong></span>low <span class="bold"><strong>e</strong></span>ater
</p><p>
You wanted to know which mutations might be responsible for the observed
behaviour. Assume the genome of <span class="emphasis"><em>Bchoc_wt</em></span> is available to you as it was
published (or you previously sequenced it), so you resequenced <span class="emphasis"><em>Bchoc_se</em></span>
with Solexa to examine mutations.
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_map_ge_copying_and_naming_the_sequence_data"></a>6.2.1.
Copying and naming the sequence data
</h3></div></div></div><p>
You need to create (or get from your sequencing provider) the sequencing data
in either FASTQ or FASTA + FASTA quality format. The following walkthrough
uses what most people nowadays get: FASTQ.
</p><p>
Create a new project directory (e.g. <code class="filename">myProject</code>) and a subdirectory of this which will hold the sequencing data (e.g. <code class="filename">data</code>).
</p><pre class="screen"><code class="prompt">arcadia:/path/to</code> <strong class="userinput"><code>mkdir myProject</code></strong>
<code class="prompt">arcadia:/path/to</code> <strong class="userinput"><code>cd myProject</code></strong>
<code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>mkdir data</code></strong></pre><p>
Put the FASTQ data into that <code class="filename">data</code> directory so that it now looks perhaps like this:
</p><pre class="screen"><code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>ls -l data</code></strong>
-rw-r--r-- 1 bach users 263985896 2008-03-28 21:49 bchocse_lane6.solexa.fastq
-rw-r--r-- 1 bach users 264823645 2008-03-28 21:51 bchocse_lane7.solexa.fastq</pre></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
I completely made up the file names above. You can name them anyway you
want. And you can have them live anywhere on the hard disk, you do not
need to put them in this <code class="filename">data</code> directory. It's just
the way I do it ... and it's where the example manifest files a bit further down
in this chapter will look for the data files.
</td></tr></table></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_map_ma_copying_and_naming_the_reference_sequence"></a>6.2.2.
Copying and naming the reference sequence
</h3></div></div></div><p>
The reference sequence (the backbone) can be in a number of different
formats: GFF3, GenBank, MAF, CAF, FASTA. The first three have the advantage
of being able to carry additional information like, e.g.,
annotation. In this example, we will use a GFF3 file like the ones
one can download from the NCBI.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
TODO: Write why GFF3 is better and where to get them at the NCBI.
</td></tr></table></div><p>
So, let's assume that our wild type
strain is in the following file:
<code class="filename">NC_someNCBInumber.gff3</code>.
</p><p>
You do not need to copy the reference sequence to your directory, but
I normally copy also the reference file into the directory with my
data as I want to have, at the end of my work, a nice little
self-sufficient directory which I can archive away and still be sure
that in 10 years time I have all data I need together.
</p><pre class="screen"><code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>cp /somewhere/NC_someNCBInumber.gff3 data</code></strong>
<code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>ls -l data</code></strong>
-rw-r--r-- 1 bach users 6543511 2008-04-08 23:53 NC_someNCBInumber.gff3
-rw-r--r-- 1 bach users 263985896 2008-03-28 21:49 bchocse_lane6.solexa.fastq
-rw-r--r-- 1 bach users 264823645 2008-03-28 21:51 bchocse_lane7.solexa.fastq</pre><p>
We're almost finished with the setup. As I like to have things neatly separated, I always create a directory called <code class="filename">assemblies</code> which will hold my assemblies (or different trials) together. Let's quickly do that:
</p><pre class="screen"><code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>mkdir assemblies</code></strong>
<code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>mkdir assemblies/1sttrial</code></strong>
<code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>cd assemblies/1sttrial</code></strong></pre></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_map_ge_writing_a_simple_manifest_file"></a>6.2.3.
Writing a simple manifest file
</h3></div></div></div><p>
A manifest file is a configuration file for MIRA which tells it what
type of assembly it should do and which data it should load. In this
case we have unpaired sequencing data which we want to map to a
reference sequence, the manifest file for that is pretty simple:
</p><pre class="screen"># Example for a manifest describing a mapping assembly with
# unpaired Illumina data
# First part: defining some basic things
# In this example, we just give a name to the assembly
# and tell MIRA it should map a genome in accurate mode
<strong class="userinput"><code>project = <em class="replaceable"><code>MyFirstAssembly</code></em>
job = <em class="replaceable"><code>genome,mapping,accurate</code></em></code></strong>
# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups"
# first, the reference sequence
<strong class="userinput"><code>readgroup
is_reference
data = <em class="replaceable"><code>../../data/NC_someNCBInumber.gff3</code></em>
strain = <em class="replaceable"><code>bchoc_wt</code></em></code></strong>
# now the Illumina data
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>SomeUnpairedIlluminaReadsIGotFromTheLab</code></em>
data = <em class="replaceable"><code>../../data/*fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
strain = <em class="replaceable"><code>bchoc_se</code></em></code></strong></pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
Please look up the parameters of the manifest file in the main
manual or the example manifest files in the following section.
</p><p>
The ones above basically say: make an accurate mapping of Solexa
reads against a genome; in one pass; the name of the backbone strain
is 'bchoc_wt'; the data with the backbone sequence (and maybe
annotations) is in a specified GFF3 file; for Solexa data: assign
default strain names for reads which have not loaded ancillary data
with strain info and that default strain name should be 'bchoc_se'.
</p></td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_map_ge_starting_assembly"></a>6.2.4. Starting the assembly</h3></div></div></div><p>
Starting the assembly is now just a matter of a simple command line:
</p><pre class="screen"><code class="prompt">arcadia:/path/to/myProject/assemblies/1sttrial$</code> <strong class="userinput"><code>mira <em class="replaceable"><code>manifest.conf >&log_assembly.txt</code></em></code></strong></pre><p>
For this example - if you followed the walk-through on how to prepare the data
- everything you might want to adapt in the first time are the following thing in the manifest file:
options:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
project= (for naming your assembly project)
</p></li><li class="listitem"><p>
strain_name= to give the names of your reference and mapping strain
</p></li></ul></div><p>
Of course, you are free to change any option via the extended parameters, but
this is the topic of another part of this manual.
</p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_map_manifest_files_use_cases"></a>6.3.
Manifest files for different use cases
</h2></div></div></div><p>
This section will introduce you to manifest files for different use
cases. It should cover the most important uses, but as always you are
free to mix and match the parameters and readgroup definitions to suit
your specific needs.
</p><p>
Taking into account that there may be <span class="emphasis"><em>a lot</em></span> of
combinations of sequencing technologies, sequencing libraries (shotgun,
paired-end, mate-pair, etc.) and input file types (FASTQ, FASTA,
GenBank, GFF3, etc.pp), the example manifest files just use Illumina and
454 as technologies, GFF3 as input file type for the reference sequence,
FASTQ as input type for sequencing data ... and they do not show the
multitude of more advanced features like, e.g., using ancillary clipping
information in XML files, ancillary masking information in SSAHA2 or
SMALT files etc.pp.
</p><p>
I'm sure you will be able to find your way by scanning through the
corresponding section on manifest files in the reference chapter :-)
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_map_mf_mapping_with_shotgun_data"></a>6.3.1.
Mapping with shotgun data
</h3></div></div></div><p>
Well, we've seen that already in the section above, but here it is
again ... this time with Ion Torrent data though.
</p><pre class="screen"># Example for a manifest describing a mapping assembly with
# unpaired Ion data
# First part: defining some basic things
# In this example, we just give a name to the assembly
# and tell MIRA it should map a genome in accurate mode
<strong class="userinput"><code>project = <em class="replaceable"><code>MyFirstAssembly</code></em>
job = <em class="replaceable"><code>genome,mapping,accurate</code></em></code></strong>
# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups"
# first, the reference sequence
<strong class="userinput"><code>readgroup
is_reference
data = <em class="replaceable"><code>../../data/NC_someNCBInumber.gff3</code></em>
strain = <em class="replaceable"><code>bchoc_wt</code></em></code></strong>
# now the Ion Torrent data
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>SomeUnpairedIonReadsIGotFromTheLab</code></em>
data = <em class="replaceable"><code>../../data/someiondata.fastq</code></em>
technology = <em class="replaceable"><code>iontor</code></em>
strain = <em class="replaceable"><code>bchoc_se</code></em></code></strong></pre></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_map_mf_manifest_for_pairedend_data"></a>6.3.2.
Manifest for data sets with paired reads
</h3></div></div></div><p>
</p><p>
When using paired-end data in mapping, you must decide whether you want
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
use the MIRA feature to create long 'coverage equivalent reads'
(CERs) which saves a lot of memory (both in the assembler and
later on in an assembly editor). However, you then
<span class="emphasis"><em>loose information about read pairs!</em></span>
</p></li><li class="listitem"><p>
or whether you want to <span class="emphasis"><em>keep information about read
pairs</em></span> at the expense of larger memory requirements both
in MIRA and in assembly finishing tools or viewers afterwards.
</p></li><li class="listitem"><p>
or a mix of the two above
</p></li></ol></div><p>
The Illumina pipeline generally normally gives you two files for paired-end
data: a <code class="filename">project-1.fastq</code> and
<code class="filename">project-2.fastq</code>. The first file containing the
first read of a read-pair, the second file the second read. Depending
on the preprocessing pipeline of your sequencing provider, the names
of the reads are either the very same in both files or already have
a <code class="literal">/1</code> or <code class="literal">/2</code> appended. Also, your
sequencing provider may give you one big file where the reads from
both ends are present.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
MIRA can read all FASTQ variants produced by various Illumina
pipelines, be they with or without the /1 and /2 already appended to
the names. You generally do not need to do any name mangling before
feeding the data to MIRA. However, MIRA will shell out a warning if read names are longer than 40 characters.
</p></td></tr></table></div><p>
When using paired-end data, you should know
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
the orientation of the reads toward each other. This is specific
to sequencing technologies and / or the sequencing library preparation.
</p></li><li class="listitem"><p>
at which distance these reads should be. This is specific to the
sequencing library preparation and the sequencing lab should tell
you this.
</p></li></ol></div><p>
In case you do not know one (or any) of the above, don't panic! MIRA
is able to estimate the needed values during the assembly if you tell
it to.
</p><p>
The following manifest shows you the most laziest way to define a
paired data set by simply adding <span class="emphasis"><em>autopairing</em></span> as keyword to a
readgroup (using Illumina just as example):
</p><pre class="screen"># Example for a lazy manifest describing a denovo assembly with
# one library of paired reads
# First part: defining some basic things
# In this example, we just give a name to the assembly
# and tell MIRA it should map a genome in accurate mode
<strong class="userinput"><code>project = <em class="replaceable"><code>MyFirstAssembly</code></em>
job = <em class="replaceable"><code>genome,mapping,accurate</code></em></code></strong>
# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups"
# first the reference sequence
<strong class="userinput"><code>readgroup
is_reference
data = <em class="replaceable"><code>../../data/NC_someNCBInumber.gff3</code></em>
technology = <em class="replaceable"><code>text</code></em>
strain = <em class="replaceable"><code>bchoc_wt</code></em></code></strong>
# now the Illumina paired-end data
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataIlluminaPairedLib</code></em>
<em class="replaceable"><code>autopairing</code></em>
data = <em class="replaceable"><code>../../data/project_1.fastq ../../data/project_2.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
strain = <em class="replaceable"><code>bchoc_se1</code></em>
</code></strong></pre><p>
See? Wasn't hard and it did not hurt, did it? One just needs to tell
MIRA it should expect paired reads via
the <span class="emphasis"><em>autopairing</em></span> keyword and that is everything you
need.
</p><p>
If you know the orientation of the reads and/or the library size, you
can tell this MIRA the following way (just showing the readgroup
definition here):
</p><pre class="screen"><strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataIlluminaPairedEnd500Lib</code></em>
data = <em class="replaceable"><code>../../data/project_1.fastq ../../data/project_2.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
template_size = <em class="replaceable"><code>250 750</code></em>
segment_placement = <em class="replaceable"><code>---> <---</code></em></code></strong></pre><p>
In cases you are not 100% sure about, e.g., the size of the DNA
template, you can also give a (generous) expected range and then tell
MIRA to automatically refine this range during the assembly based on
real, observed distances of read pairs. Do this with <span class="emphasis"><em>autorefine</em></span>
modifier like this:
</p><pre class="screen"><strong class="userinput"><code>template_size = <em class="replaceable"><code>50 1000 autorefine</code></em></code></strong></pre><p>
The following manifest file is an example for mapping a 500 bp
paired-end and a 3kb mate-pair library of a strain
called <span class="emphasis"><em>bchoc_se1</em></span> against a GenBank reference
file containing a strain called <span class="emphasis"><em>bchoc_wt</em></span>:
</p><pre class="screen"># Example for a manifest describing a mapping assembly with
# paired Illumina data, not merging reads and therefore keeping
# all pair information
# First part: defining some basic things
# In this example, we just give a name to the assembly
# and tell MIRA it should map a genome in accurate mode
# As special parameter, we want to switch off merging of Solexa reads
<strong class="userinput"><code>project = <em class="replaceable"><code>MyFirstAssembly</code></em>
job = <em class="replaceable"><code>genome,mapping,accurate</code></em>
parameters = <em class="replaceable"><code>SOLEXA_SETTINGS -CO:msr=no</code></em></code></strong>
# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups"
# first, the reference sequence
<strong class="userinput"><code>readgroup
is_reference
data = <em class="replaceable"><code>../../data/NC_someNCBInumber.gff3</code></em>
technology = <em class="replaceable"><code>text</code></em>
strain = <em class="replaceable"><code>bchoc_wt</code></em></code></strong>
# now the Illumina data
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataForPairedEnd500bpLib</code></em>
<em class="replaceable"><code>autopairing</code></em>
data = <em class="replaceable"><code>../../data/project500bp-1.fastq ../../data/project500bp-2.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
strain = <em class="replaceable"><code>bchoc_se1</code></em></code></strong>
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataForMatePair3kbLib</code></em>
data = <em class="replaceable"><code>../../data/project3kb-1.fastq ../../data/project3kb-2.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
strain = <em class="replaceable"><code>bchoc_se1</code></em>
template_size = <em class="replaceable"><code>2000 4000 autorefine</code></em>
segment_placement = <em class="replaceable"><code><--- ---></code></em></code></strong></pre><p>
Please look up the parameters used in the main manual. The ones
above basically say: make an accurate mapping of Solexa reads
against a genome. Additionally do not merge short short Solexa
reads to the contig.
</p><p>
For the paired-end library, be lazy and let MIRA find out everything
it needs. However, that information should be treated as
"information only" by MIRA, i.e., it is not used for deciding whether
a pair is well mapped.
</p><p>
For the mate-pair library, assume a DNA template template size of
2000 to 4000 bp (but let MIRA automatically refine this using observed
distances) and the segment orientation of the read pairs follows
the reverse / forward scheme. That information should be treated as
"information only" by MIRA, i.e., it is not used for deciding whether
a pair is well mapped.
</p><p>
Comparing this manifest with a manifest for unpaired-data, two
parameters were added in the section for Solexa data:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
<code class="literal">-CO:msr=no</code> tells MIRA not to merge reads that
are 100% identical to the backbone. This also allows to keep the
template information (distance and orientation) for the reads.
</p></li><li class="listitem"><p>
<code class="literal">template_size</code> tells MIRA at which distance the
two reads should normally be placed from each other.
</p></li><li class="listitem"><p>
<code class="literal">segment_placement</code> tells MIRA how the different
segments (reads) of a DNA template have to be ordered to form a
valid representation of the sequenced DNA.
</p></li></ol></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
Note that in mapping assemblies, these
<code class="literal">template_distance</code> and
<code class="literal">segment_placement</code> parameters are normally treated
as <span class="emphasis"><em>information only</em></span>, i.e., MIRA will map the
reads regardless whether the distance and orientation criterions are
met or not. This enables post-mapping analysis programs to hunt for
genome rearrangements or larger insertions/deletion.
</p></td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top"><p>
If template size and segment placement checking were on, the
following would happen at, e.g. sites of re-arrangement: MIRA would
map the first read of a read-pair without problem. However, it would
very probably reject the second read because it would not map at the
specified distance or orientation from its partner. Therefore, in
mapping assemblies with paired-end data, checking of the template
size must be switched off to give post-processing programs a chance
to spot re-arrangements.
</p></td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_map_mf_mapping_with_multiple_technologies"></a>6.3.3.
Mapping with multiple sequencing technologies (hybrid mapping)
</h3></div></div></div><p>
I'm sure you'll have picked up the general scheme of manifest files by
now. Hybrid mapping assemblies follow the general scheme: simply add
as separate readgroup the information MIRA needs to know to find the
data and off you go. Just for laughs, here's a manifest for 454
shotgun with Illumina paired-end
</p><pre class="screen"># Example for a manifest describing a mapping assembly with
# shotgun 454 and paired-end Illumina data, not merging reads and therefore keeping
# all pair information
# First part: defining some basic things
# In this example, we just give a name to the assembly
# and tell MIRA it should map a genome in accurate mode
# As special parameter, we want to switch off merging of Solexa reads
<strong class="userinput"><code>project = <em class="replaceable"><code>MyFirstAssembly</code></em>
job = <em class="replaceable"><code>genome,mapping,accurate</code></em>
parameters = <em class="replaceable"><code>SOLEXA_SETTINGS -CO:msr=no</code></em></code></strong>
# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups"
# first, the reference sequence
<strong class="userinput"><code>readgroup
is_reference
data = <em class="replaceable"><code>../../data/NC_someNCBInumber.gff3</code></em>
strain = <em class="replaceable"><code>bchoc_wt</code></em></code></strong>
# now the shotgun 454 data
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataForShotgun454</code></em>
data = <em class="replaceable"><code>../../data/project454data.fastq</code></em>
technology = <em class="replaceable"><code>454</code></em>
strain = <em class="replaceable"><code>bchoc_se1</code></em></code></strong>
# now the paired-end Illumina data
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataForPairedEnd500bpLib</code></em>
data = <em class="replaceable"><code>../../data/project500bp-1.fastq ../../data/project500bp-2.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
strain = <em class="replaceable"><code>bchoc_se1</code></em>
template_size = <em class="replaceable"><code>250 750</code></em>
segment_placement = <em class="replaceable"><code>---> <---</code></em></code></strong></pre></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_map_mf_mapping_with_multiple_strains"></a>6.3.4.
Mapping with multiple strains
</h3></div></div></div><p>
MIRA will make use of ancillary information present in the manifest
file. One of these is the information to which strain (or organism or
cell line etc.pp) the generated data belongs.
</p><p>
You just need to tell in the manifest file which data comes from which
strain. Let's assume that in the example from above, the "lane6" data
were from a first mutant named <span class="emphasis"><em>bchoc_se1</em></span> and the
"lane7" data were from a second mutant
named <span class="emphasis"><em>bchoc_se2</em></span>. Here's the manifest file you
would write then:
</p><pre class="screen"># Example for a manifest describing a mapping assembly with
# unpaired Illumina data
# First part: defining some basic things
# In this example, we just give a name to the assembly
# and tell MIRA it should map a genome in accurate mode
<strong class="userinput"><code>project = <em class="replaceable"><code>MyFirstAssembly</code></em>
job = <em class="replaceable"><code>genome,mapping,accurate</code></em></code></strong>
# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups"
# first, the reference sequence
<strong class="userinput"><code>readgroup
is_reference
data = <em class="replaceable"><code>../../data/NC_someNCBInumber.gff3</code></em>
technology = <em class="replaceable"><code>text</code></em>
strain = <em class="replaceable"><code>bchoc_wt</code></em></code></strong>
# now the Illumina data
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataForSE1</code></em>
data = <em class="replaceable"><code>../../data/bchocse_lane6.solexa.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
strain = <em class="replaceable"><code>bchoc_se1</code></em></code></strong>
<strong class="userinput"><code>readgroup = <em class="replaceable"><code>DataForSE2</code></em>
data = <em class="replaceable"><code>../../data/bchocse_lane7.solexa.fastq</code></em>
technology = <em class="replaceable"><code>solexa</code></em>
strain = <em class="replaceable"><code>bchoc_se2</code></em></code></strong></pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
While mapping (or even assembling de-novo) with multiple strains is
possible, the interpretation of results may become a bit daunting in
some cases. For many scenarios it might therefore be preferable to
successively use the data sets in own mappings or assemblies.
</td></tr></table></div><p>
This <span class="emphasis"><em>strain</em></span> information for each readgroup is really the only change you need to perform to tell MIRA everything it needs for handling strains.
</p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_map_walkthroughs"></a>6.4.
Walkthroughs
</h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_map_walkthrough:_mapping_of_ecoli_from_lenski_lab_against_ecoli_b_rel606"></a>6.4.1.
Walkthrough: mapping of E.coli from Lenski lab against E.coli B REL606
</h3></div></div></div><p>
TODO: Sorry, needs to be re-written for the relatively new SRR format
distributed at the NCBI ... and changes in MIRA 3.9.x. Please come
back later.
</p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_map_useful_about_reference_sequences"></a>6.5.
Useful things to know about reference sequences
</h2></div></div></div><p>
There are a few things to consider when using reference sequences:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
MIRA is not really made to handle a big amount of reference
sequences as they currently need inane amounts of memory. Use other
programs for mapping against more than, say, 200 megabases.
</p></li><li class="listitem"><p>
Reference sequences can be as long as needed! They are not subject
to normal read length constraints of a maximum of 32k bases. That
is, if one wants to load one or several entire chromosomes of a
bacterium or lower eukaryote as backbone sequence(s), this is just
fine.
</p></li><li class="listitem"><p>
Reference sequences can be single sequences like provided in, e.g.,
FASTA, FASTQ, GFF or GenBank files. But reference sequences also can
be whole assemblies when they are provided as, e.g., MAF or CAF
format. This opens the possibility to perform semi-hybrid assemblies
by assembling first reads from one sequencing technology de-novo
(e.g. PacBio) and then map reads from another sequencing technology
(e.g. Solexa) to the whole PacBio alignment instead of mapping it to
the PacBio consensus.
</p><p>
A semi-hybrid assembly will therefore contain, like a hybrid
assembly, the reads of both sequencing technologies.
</p></li><li class="listitem"><p>
Reference sequences will not be reversed! They will always appear in
forward direction in the output of the assembly. Please note: if the
backbone sequence consists of a MAF or CAF file that contain contigs
which contain reversed reads, then the contigs themselves will be in
forward direction. But the reads they contain that are in reverse
complement direction will of course also stay reverse complement
direction.
</p></li><li class="listitem"><p>
Reference sequences will not not be assembled together! That is,
even if a reference sequence has a perfect overlap with another
reference sequence, they will still not be merged.
</p></li><li class="listitem"><p>
Reads are assembled to reference sequences in a first come, first
served scattering strategy.
</p><p>
Suppose you have two identical reference sequences and a read which
would match both, then the read would be mapped to the first
backbone. If you had two identical reads, the first read would go to
the first backbone, the second read to the second backbone. With
three identical reads, the first backbone would get two reads, the
second backbone one read. Etc.pp.
</p></li><li class="listitem"><p>
Only in references loaded from MAF or CAF files: contigs made out of
single reads (singlets) loose their status as reference sequence and
will be returned to the normal read pool for the assembly
process. That is, these sequences will be assembled to other
reference sequences or with each other.
</p></li></ol></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_map_known_bugs_problems"></a>6.6.
Known bugs / problems
</h2></div></div></div><p>
These are actual for version 4.0 of MIRA and might or might not have been
addressed in later version.
</p><p>
Bugs:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
mapping of paired-end reads with one read being in non-repetitive
area and the other in a repeat is not as effective as it should
be. The optimal strategy to use would be to map first the
non-repetitive read and then the read in the repeat. Unfortunately,
this is not yet implemented in MIRA.
</p></li></ol></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_est"></a>Chapter 7. EST / RNASeq assemblies</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email"><<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>></code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect1_est_introduction">7.1.
Introduction
</a></span></dt><dt><span class="sect1"><a href="#sect1_est_preliminaries:on_the_difficulties_of_assembling_ests">7.2.
Preliminaries: on the difficulties of assembling ESTs /RNASeq
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect2_est_poly-a_tails_in_est_data">7.2.1.
Poly-A tails
</a></span></dt><dt><span class="sect2"><a href="#sect2_est_lowly_expressed_transcripts">7.2.2.
Lowly expressed transcripts
</a></span></dt><dt><span class="sect2"><a href="#sect2_est_library_normalisation">7.2.3.
Very highly expressed transcripts
</a></span></dt><dt><span class="sect2"><a href="#sect_est_chimeras">7.2.4.
Chimeras
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#est_sect1_est_preprocessing">7.3.
Preprocessing of ESTs
</a></span></dt><dt><span class="sect1"><a href="#sect1_est_est_difference_assembly_clustering">7.4.
The difference between <span class="emphasis"><em>assembly</em></span> and
<span class="emphasis"><em>clustering</em></span>
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect2_est_snp_splitting">7.4.1.
Splitting transcripts into contigs based on SNPs
</a></span></dt><dt><span class="sect2"><a href="#sect2_est_gap_splitting">7.4.2.
Splitting transcripts into contigs based on larger gaps
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect1_est_demopipeline">7.5.
A simple step by step pipeline for reliable RNASeq assembly of eukaryotes
</a></span></dt><dt><span class="sect1"><a href="#idm5079">7.6.
Solving common problems of EST assemblies
</a></span></dt></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
<span class="emphasis"><em><span class="quote">“<span class="quote">Expect the worst. You'll never get disappointed.
</span>”</span></em></span>
</p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect1_est_introduction"></a>7.1.
Introduction
</h2></div></div></div><p>
This document is not complete yet and some sections may be a bit
unclear. I'd be happy to receive suggestions for improvements.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note:
Some reading requirements
"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">
Some reading requirements
</th></tr><tr><td align="left" valign="top"><p>
This guide assumes that you have basic working knowledge of Unix systems, know
the basic principles of sequencing (and sequence assembly) and what assemblers
do. Basic knowledge on mRNA transcription should also be present.
</p><p>
Please read at some point in time
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
Before the assembly, <a class="xref" href="#chap_dataprep" title="Chapter 4. Preparing data">Chapter 4: “<i>Preparing data</i>”</a> to know what to do (or not to
do) with the sequencing data before giving it to MIRA.
</p></li><li class="listitem"><p>
For setting up the assembly, <a class="xref" href="#chap_denovo" title="Chapter 5. De-novo assemblies">Chapter 5: “<i>De-novo assemblies</i>”</a> to know how to
start a denovo assembly (except you obviously will need to change
the --job setting from <span class="emphasis"><em>genome</em></span> to
<span class="emphasis"><em>est</em></span>).
</p></li><li class="listitem"><p>
After the assembly, <a class="xref" href="#chap_results" title="Chapter 9. Working with the results of MIRA">Chapter 9: “<i>Working with the results of MIRA</i>”</a> to know what to do with the
results of the assembly. More specifically, <a class="xref" href="#sect_res_looking_at_results" title="9.1. MIRA output directories and files">Section 9.1: “
MIRA output directories and files
”</a>, <a class="xref" href="#sect_res_first_look:the_assembly_info" title="9.2. First look: the assembly info">Section 9.2: “
First look: the assembly info
”</a>, <a class="xref" href="#sect_res_converting_results" title="9.3. Converting results">Section 9.3: “
Converting results
”</a>, <a class="xref" href="#sect_res_filtering_of_results" title="9.4. Filtering results">Section 9.4: “
Filtering results
”</a> and <a class="xref" href="#sect_res_places_of_importance_in_a_de_novo_assembly" title="9.5. Places of importance in a de-novo assembly">Section 9.5: “
Places of importance in a de-novo assembly
”</a>.
</p></li><li class="listitem"><p>
And also <a class="xref" href="#chap_reference" title="Chapter 3. MIRA 4 reference manual">Chapter 3: “<i>MIRA 4 reference manual</i>”</a> to look up how manifest files should be
written (<a class="xref" href="#sect_ref_manifest_basics" title="3.4.2. The manifest file: basics">Section 3.4.2: “
The manifest file: basics
”</a> and <a class="xref" href="#sect_ref_manifest_readgroups" title="3.4.3. The manifest file: information on the data you have">Section 3.4.3: “
The manifest file: information on the data you have
”</a> and <a class="xref" href="#sect_ref_manifest_parameters" title="3.4.4. The manifest file: extended parameters">Section 3.4.4: “
The manifest file: extended parameters
”</a>), some command line options as well as general information on
what tags MIRA uses in assemblies, files it generates etc.pp
</p></li></ul></div></td></tr></table></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect1_est_preliminaries:on_the_difficulties_of_assembling_ests"></a>7.2.
Preliminaries: on the difficulties of assembling ESTs /RNASeq
</h2></div></div></div><p>
Assembling ESTs can be, from an assemblers point of view, pure
horror. E.g., it may be that some genes have thousands of transcripts
while other genes have just one single transcript in the sequenced
data. Furthermore, the presence of 5' and 3' UTR, transcription
variants, splice variants, homologues, SNPs etc.pp complicates the
assembly in some rather interesting ways.
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_est_poly-a_tails_in_est_data"></a>7.2.1.
Poly-A tails
</h3></div></div></div><p>
Poly-A tails are part of the mRNA and therefore also part of sequenced
data. They can occur as poly-A or poly-T, depending from which
direction and which part of the mRNA was sequenced. Having poly-A/T
tails in the data is a something of a double edged sword. More
specifically., if the 3' poly-A tail is kept unmasked in the data,
transcripts having this tail will very probably not align with similar
transcripts from different splice variants (which is basically
good). On the other hand, homopolymers (multiple consecutive bases of
the same type) like poly-As are features that are pretty difficult to
get correct with today's sequencing technologies, be it Sanger, Solexa
or, with even more problems problems, 454. So slight errors in the
poly-A tail could lead to wrongly assigned splice sites ... and
wrongly split contigs.
</p><p>
This is the reason why many people cut off the poly-A tails. Which in
turn may lead to transcripts from different splice variants being
assembled together.
</p><p>
Either way, it's not pretty.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_est_lowly_expressed_transcripts"></a>7.2.2.
Lowly expressed transcripts
</h3></div></div></div><p>
Single transcripts (or very lowly expressed transcripts) containing
SNPs, splice variants or similar differences to other, more highly
expressed transcripts are a problem: it's basically impossible for an
assembler to distinguish them from reads containing junky data
(e.g. read with a high error rate or chimeras). The standard setting
of many EST assemblers and clusterers is therefore to remove these
reads from the assembly set. MIRA handles things a bit differently:
depending on the settings, single transcripts with sufficiently large
differences are either treated as debris or can be saved as
<span class="emphasis"><em>singlet</em></span>.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_est_library_normalisation"></a>7.2.3.
Very highly expressed transcripts
</h3></div></div></div><p>
Another interesting problem for de-novo assemblers are non-normalised
libraries. In each cell, the number of mRNA copies per gene may
differ by several orders of magnitude, from a single transcripts to
several tens of thousands. Pre-sequencing normalisation is a wet-lab
procedure to approximately equalise those copy numbers. This can
however, introduce other artifacts.
</p><p>
If an assembler is fed with non-normalised EST data, it may very well
be that an overwhelming number of the reads comes only from a few
genes (house-keeping genes). In Sanger sequencing projects this could
mean a couple of thousand reads per gene. In 454 sequencing projects,
this can mean several tens of thousands of reads per genes. With
Solexa data, this number can grow to something close to a million.
</p><p>
Several effects then hit a de-novo assembler, the three most annoying
being (in ascending order of annoyance): a) non-random sequencing
errors then look like valid SNPs, b) sequencing and library
construction artefacts start to look like valid sequences if the data
set was not cleaned "enough" and more importantly, c) an explosion in
time and memory requirements when attempting to deliver a "good"
assembly. While MIRA has methods to deal with this kind of data
(e.g. via digital normalisation), a sure sign of the latter are messages
from MIRA about <span class="emphasis"><em>megahubs</em></span> in the data set.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
The guide on how to tackle <span class="emphasis"><em>hard</em></span> projects with
MIRA gives an overview on how to hunt down sequences which can lead to
the assembler getting confused, be it sequencing artefacts or highly
expressed genes.
</td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_est_chimeras"></a>7.2.4.
Chimeras
</h3></div></div></div><p>
Chimeras are sequences containing adjacent base stretches which are
not occurring in an organism as sequenced, neither as DNA nor as
(m)RNA. Chimeras can be created through recombination effects during
library construction or sequencing. Chimeras can, and often do, lead
to misassemblies of sequence stretches into one contig although they
do not belong together. Have a look at the following example where two
stretches (denoted by <code class="literal">x</code> and <code class="literal">o</code>
are joined by a chimeric read <span class="emphasis"><em>r4</em></span> containing both
stretches:
</p><pre class="screen">
r1 xxxxxxxxxxxxxxxx
r2 xxxxxxxxxxxxxxxxx
r3 xxxxxxxxxxxxxxxxx
r4 xxxxxxxxxxxxxxxxxxx|oooooooooooooo
r5 ooooooooooo
r6 ooooooooooo
r7 ooooooooo</pre><p>
The site of the recombination event is denoted by <code class="literal">x|o</code>
in read <span class="emphasis"><em>r4</em></span>.
</p><p>
MIRA does have a chimera detection -- which works very well in genome
assemblies due to high enough coverage -- by searching for sequence
stretches which are not covered by overlaps. In the above example, the
chimera detection routine will almost certainly flag read
<span class="emphasis"><em>r4</em></span> as chimera and only use a part of it: either the
<code class="literal"> x</code> or <code class="literal">o</code> part, depending on which
part is longer. There is always a chance that <span class="emphasis"><em>r4</em></span> is
a valid read though, but that's a risk to take.
</p><p>
Now, that strategy would also work totally fine in EST projects if one
would not have to account for lowly expressed genes. Imagine the
following situation:
</p><pre class="screen">
s1 xxxxxxxxxxxxxxxxx
s2 xxxxxxxxxxxxxxxxxxxxxxxxx
s3 xxxxxxxxxxxxxxx
</pre><p>
Look at read <span class="emphasis"><em>s2</em></span>; from an overlap coverage
perspective, <span class="emphasis"><em>s2</em></span> could also very well be a chimera,
leading to a break of an otherwise perfectly valid contig if
<span class="emphasis"><em>s2</em></span> were cut back accordingly. This is why chimera
detection is switched off by default in MIRA.
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top"><p>
When starting an EST assembly via the <code class="literal">--job=est,...</code>
switch, chimera detection is switched off by default. It is absolutely
possible to switch on the SKIM chimera detection afterwards via
[-CL:ascdc]. However, this will have exactly the effects
described above: chimeras in higher coverage contigs will be detected,
but perfectly valid low coverage contigs will be torn apart.
</p><p>
It is up to you to decide what you want or need.
</p></td></tr></table></div></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="est_sect1_est_preprocessing"></a>7.3.
Preprocessing of ESTs
</h2></div></div></div><p>
With contributions from Katrina Dlugosch
</p><p>
EST sequences necessarily contain fragments of vectors or primers used
to create cDNA libraries from RNA, and may additionally contain primer
and adaptor sequences used during amplification-based library
normalisation and/or high-throughput sequencing. These contaminant
sequences need to be removed prior to assembly. MIRA can trim sequences
by taking contaminant location information from a SSAHA2 or SMALT search
output, or users can remove contaminants beforehand by trimming
sequences themselves or masking unwanted bases with lowercase or other
characters (e.g. 'x', as with <span class="command"><strong>cross_match</strong></span>). Many
folks use preprocessing trimming/masking pipelines because it can be
very important to try a variety of settings to verify that you've
removed all of your contaminants (and fragments thereof) before sending
them into an assembly program like MIRA. It can also be good to spend
some time seeing what contaminants are in your data, so that you get to
know what quality issues are present and how pervasive.
</p><p>
Two features of next generation sequencing can introduce errors into
contaminant sequences that make them particularly difficult to remove,
arguing for preprocessing: First, most next-generation sequence
platforms seem to be sensitive to excess primers present during library
preparation, and can produce a small percentage of sequences composed
entirely of concatenated primer fragments. These are among the most
difficult contaminants to remove, and the program TagDust (<a class="ulink" href="http://genome.gsc.riken.jp/osc/english/dataresource/" target="_top">http://genome.gsc.riken.jp/osc/english/dataresource/</a>) was
recently developed specifically to address this problem. Second, 454 EST
data sets can show high variability within primer sequences designed to
anchor to polyA tails during cDNA synthesis, because 454 has trouble
calling the length of the necessary A and T nucleotide repeats with
accuracy.
</p><p>
A variety of programs exist for preprocessing. Popular ones include
cross_match (<a class="ulink" href="http://www.phrap.org/phredphrapconsed.html" target="_top">http://www.phrap.org/phredphrapconsed.html</a>)
for primer masking, and SeqClean (<a class="ulink" href="http://compbio.dfci.harvard.edu/tgi/software/" target="_top">http://compbio.dfci.harvard.edu/tgi/software/</a>), Lucy (<a class="ulink" href="http://lucy.sourceforge.net/" target="_top">http://lucy.sourceforge.net/</a>), and SeqTrim (<a class="ulink" href="http://www.scbi.uma.es/cgi-bin/seqtrim/seqtrim_login.cgi" target="_top">http://www.scbi.uma.es/cgi-bin/seqtrim/seqtrim_login.cgi</a>) for
both primer and polyA/T trimming. The pipeline SnoWhite (<a class="ulink" href="http://evopipes.net" target="_top">http://evopipes.net</a>) combines Seqclean and TagDust with custom
scripts for aggressive sequence and polyA/T trimming (and is tolerant of
data already masked using cross_match). In all cases, the user must
provide contaminant sequence information and adjust settings for how
sensitive the programs should be to possible matches. To find the best
settings, it is helpful to look directly at some of the sequences that
are being trimmed and inspect them for remaining primer and/or polyA/T
fragments after cleaning.
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
When using <span class="command"><strong>mira</strong></span> or
<span class="command"><strong>miraSearchESTSNPs</strong></span> with the the simplest parameter
calls (using the "--job=..." quick switches), the default settings used
include pretty heavy sequence pre-processing to cope with noisy
data. Especially if you have your own pre-processing pipeline, you
<span class="emphasis"><em>must</em></span> then switch off different clip algorithms that
you might have applied previously yourself. Especially poly-A clips
should never be run twice (by your pipeline and by
<span class="command"><strong>mira</strong></span>) as they invariably lead to too many bases being
cut away in some sequences,
</td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Here too: In some cases MIRA can get confused if something with the
pre-processing went wrong because, e.g., unexpected sequencing artefacts
like unknown sequencing vectors or adaptors remain in data. The guide on
how to tackle <span class="emphasis"><em>hard</em></span> projects with MIRA gives an
overview on how to hunt down sequences which can lead to the assembler
getting confused, be it sequencing artefacts or highly expressed genes.
</td></tr></table></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect1_est_est_difference_assembly_clustering"></a>7.4.
The difference between <span class="emphasis"><em>assembly</em></span> and
<span class="emphasis"><em>clustering</em></span>
</h2></div></div></div><p>
MIRA in its base settings is an <span class="emphasis"><em>assembler</em></span> and not a
<span class="emphasis"><em>clusterer</em></span>, although it can be configured as such. As
assembler, it will split up read groups into different contigs if it
thinks there is enough evidence that they come from different RNA
transcripts.
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_est_snp_splitting"></a>7.4.1.
Splitting transcripts into contigs based on SNPs
</h3></div></div></div><p>
Imagine this simple case: a gene has two slightly different alleles and you've
sequenced this:
</p><pre class="screen">
A1-1 ...........T...........
A1-2 ...........T...........
A1-3 ...........T...........
A1-4 ...........T...........
A1-5 ...........T...........
B2-1 ...........G...........
B2-2 ...........G...........
B2-3 ...........G...........
B2-4 ...........G...........
</pre><p>
Depending on base qualities and settings used during the assembly
like, e.g., [-CO:mr:mrpg:mnq:mgqrt:emea:amgb] MIRA will
recognise that there's enough evidence for a T and also enough
evidence for a G at that position and create two contigs, one
containing the "T" allele, one the "G". The consensus will be >99%
identical, but not 100%.
</p><p>
Things become complicated if one has to account for errors in
sequencing. Imagine you sequenced the following case:
</p><pre class="screen">
A1-1 ...........T...........
A1-2 ...........T...........
A1-3 ...........T...........
A1-4 ...........T...........
A1-5 ...........T...........
B2-1 ...........<span class="bold"><strong>G</strong></span>...........
</pre><p>
It shows very much the same like the one from above, except that
there's only one read with a "G" instead of 4 reads. MIRA will, when
using standard settings, treat this as erroneous base and leave all
these reads in a contig. It will likewise also not mark it as SNP in
the results. However, this could also very well be a lowly expressed
transcript with a single base mutation. It's virtually impossible to
tell which of the possibilities is right.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
You can of course force MIRA to mark situations like the one depicted
above by, e.g., changing the parameters
for [-CO:mrpg:mnq:mgqrt]. But this may have the side-effect
that sequencing errors get an increased chance of getting flagged as
SNP.
</td></tr></table></div><p>
Further complications arise when SNPs and potential sequencing errors
meet at the same place. consider the following case:
</p><pre class="screen">
A1-1 ...........T...........
A1-2 ...........T...........
A1-3 ...........T...........
A1-4 ...........T...........
B1-5 ...........T...........
B2-1 ...........G...........
B2-2 ...........G...........
B2-3 ...........G...........
B2-4 ...........G...........
E1-1 ...........<span class="bold"><strong>A</strong></span>...........
</pre><p>
This example is exactly like the first one, except an additional read
<code class="literal">E1-1</code> has made it's appearance and has an "A"
instead of a "G" or "T". Again it is impossible to tell whether this
is a sequencing error or a real SNP. MIRA handles these cases in the
following way: it will recognise two valid read groups (one having a
"T", the other a "G") and, in assembly mode, split these two groups
into different contigs. It will also play safe and define that the
single read <code class="literal">E1-1</code> will not be attributed to either
one of the contigs but, if it cannot be assembled to other reads, form
an own contig ... if need to be even only as single read (a
<span class="emphasis"><em>singlet</em></span>).
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Depending on some settings, singlets may either appear in the regular
results or end up in the debris file.
</td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_est_gap_splitting"></a>7.4.2.
Splitting transcripts into contigs based on larger gaps
</h3></div></div></div><p>
Gaps in alignments of transcripts are handled very cautiously by
MIRA. The standard settings will lead to the creation of different
contigs if three or more consecutive gaps are introduced in an
alignment. Consider the following example:
</p><pre class="screen">
A1-1 ..........CGA..........
A1-2 ..........*GA..........
A1-3 ..........**A..........
B2-1 ..........<span class="bold"><strong>***</strong></span>..........
B2-2 ..........<span class="bold"><strong>***</strong></span>..........
</pre><p>
Under normal circumstances, MIRA will use the reads
<code class="literal">A1-1</code>, <code class="literal">A1-2</code> and
<code class="literal">A1-3</code> to form one contig and put
<code class="literal">B2-1</code> and <code class="literal">B2-2</code> into a separate
contig. MIRA would do this also if there were only one of the B2
reads.
</p><p>
The reason behind this is that the probability for having gaps of
three or more bases only due to sequencing errors is pretty
low. MIRA will therefore treat reads with such attributes as coming
from different transcripts and not assemble them together, though
this can be changed using the [-AL:egp:egpl] parameters of
MIRA if wanted.
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning:
Problems with homopolymers, especially in 454, Ion Torrent and high
coverage Illumina
"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">
Problems with homopolymers, especially in 454, Ion Torrent and high
coverage Illumina
</th></tr><tr><td align="left" valign="top"><p>
As 454 and Ion Torrent sequencing has a general problem with
homopolymers, this rule of MIRA will sometimes lead formation of
more contigs than expected due to sequencing errors at "long"
homopolymer sites ... where long starts at ~6-7 bases. Though MIRA
does know about the problem in 454 homopolymers and has some
routines which try to mitigate the problem. this is not always
successful.
</p><p>
The same applies for Illumina data with long homopolymers (~ 8-9 bp)
and high coverage (≥ 100x).
</p></td></tr></table></div></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect1_est_demopipeline"></a>7.5.
A simple step by step pipeline for reliable RNASeq assembly of eukaryotes
</h2></div></div></div><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
Remove rRNA sequences. For that I use <span class="command"><strong>mirabait</strong></span> like this:
</p><pre class="screen"><code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>mirabait -I -j rrna <em class="replaceable"><code>-p norRNAfile_1.fastq norRNAfile_2.fastq ...</code></em></code></strong></pre></li><li class="listitem"><p>
Clean the data. For this I use mira, asking it to perform only a
preprocessing of the data from step 1 via a line like this
</p><pre class="screen">parameters = -AS:nop=0</pre><p>
in the manifest file. After preprocessing, the results will be
present as MAf file in the file
<code class="filename">*_assembly/*_d_chkpt/readpool.maf</code>.
</p></li><li class="listitem"><p>
As the MAF file contains paired reads together, they need to be
separated again. Additionally, I perform a hard cut of the clipped
sequence. This is a job for <span class="command"><strong>miraconvert</strong></span>:
</p><pre class="screen"><code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>miraconvert -C -F -F readpool.maf</code></strong></pre></li><li class="listitem"><p>
I then use FLASH to merge paired read together, using high overlap and zero allowed errors.
</p><pre class="screen"><code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>...</code></strong></pre><p>
FLASH will create three file for this: one file with joined pairs,
one file with unjoined pairs and one file with orphan reads (i.e.,
reads which have no mate). I generally continue with just the joined
and unjoined files.
</p></li><li class="listitem"><p>
Reduce the dataset to a reasonable size. Using 3 or 4 gigabases to
reconstruct an eukaryotic transcriptome should yield in pretty good
transcripts without too much noise and loose all but the rarest
transcripts.
</p><p>
Depending on the Illumina read length (100, 125, 150, 250 or 300) I
generally go for a 1:1 or 2:1 ratio of joined versus unjoined
reads. E.g., if I need to extract 2 gigabases of joined FLASH
results and 1 gigabase of unjoined FLASH results I do this:
</p><pre class="screen"><code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>miraconvert -Y 2000000 <em class="replaceable"><code>flashjoined.fastq reduced2gb_flashjoined.fastq</code></em></code></strong>
<code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>miraconvert -Y 1000000 <em class="replaceable"><code>flashunjoined.fastq reduced1gb_flashunjoined.fastq</code></em></code></strong></pre><p>
</p></li><li class="listitem"><p>
Assemble the cleaned, joined and reduced data set. A simple manifest
file like this will suffice:
</p><pre class="screen">project = myRNASEQ
job=est,denovo,accurate
readgroup
technology=solexa
autopairing
data=reduced2gb_flashjoined.fastq reduced1gb_flashunjoined.fastq
</pre></li></ol></div><p>
The result can be annotated and quality controlled. However, this will
still contain duplicate genes (due to, e.g., ploidy variants) or gene
fragements (due to ploidy variants, splice variants, sequencing
errors). To reduce this number I generally do the following:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem">
Extract CDS of the annotated sequences. make sure that your pipeline
also annotates hypothetical proteins with a length ≥ 300bp.
</li><li class="listitem"><p>
Cluster the CDS sequences with MIRA, using a high similarity threshold:
</p><pre class="screen">project = myRNASEQclustering
job=est,clustering,accurate
parameters = --noclipping
parameters = TEXT_SETTINGS -AS:mrs=94
readgroup
technology=text
autopairing
data=fna::CDSfromAnnotation.fasta
</pre></li></ol></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="idm5079"></a>7.6.
Solving common problems of EST assemblies
</h2></div></div></div><p>
... continue here ...
</p><p>
Megahubs => track down reason (high expr, seqvec or adaptor: see
mira_hard) and eliminate it
</p></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_specialparams"></a>Chapter 8. Parameters for special situations</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email"><<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>></code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect_sp_introduction">8.1.
Introduction
</a></span></dt><dt><span class="sect1"><a href="#sect_sp_pacbio">8.2.
PacBio
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_sp_pacbio_ccs">8.2.1.
PacBio CCS reads
</a></span></dt><dt><span class="sect2"><a href="#sect_sp_pacbio_ec">8.2.2.
PacBio error corrected reads
</a></span></dt></dl></dd></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
<span class="emphasis"><em><span class="quote">“<span class="quote">... .
</span>”</span></em></span>
</p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_sp_introduction"></a>8.1.
Introduction
</h2></div></div></div><p>
Most of this chapter and many sections are just stubs at the moment.
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_sp_pacbio"></a>8.2.
PacBio
</h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_sp_pacbio_ccs"></a>8.2.1.
PacBio CCS reads
</h3></div></div></div><p>
Declare the sequencing technology to be high-quality PacBio (<span class="bold"><strong>PCBIOHQ</strong></span>). The last time I worked with CCS, the
ends of the reads were not really clean, so using the proposed end
clipping (which needs to be manually switched on for PCBIOHQ reads)
may be advisable.
</p><pre class="screen"><strong class="userinput"><code>...
parameters = PCBIOHQ_SETTINGS -CL:pec=yes
...
readgroup
technology=pcbiohq
data=...
...</code></strong></pre></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_sp_pacbio_ec"></a>8.2.2.
PacBio error corrected reads
</h3></div></div></div><p>
Declare the sequencing technology to be high-quality PacBio (<span class="bold"><strong>PCBIOHQ</strong></span>). For self-corrected data or data
corrected with other sequencing technologies, it is recommended to
change the [-CO:mrpg] setting to a value which is 1/4th to
1/5th of the average coverage of the corrected PacBio reads across the
genome. E.g.:
</p><pre class="screen"><strong class="userinput"><code>...
parameters = PCBIOHQ_SETTINGS -CO:mrpg=5
...
readgroup
technology=pcbiohq
data=...
...</code></strong></pre><p>
for a project which has ~24x coverage. This necessity may change in
later versions of MIRA though.
</p></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_results"></a>Chapter 9. Working with the results of MIRA</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email"><<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>></code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect_res_looking_at_results">9.1.
MIRA output directories and files
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_res_resultsdir">9.1.1.
The <code class="filename">*_d_results</code> directory
</a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_res_resultsdir_denovo">9.1.1.1.
Additional 'large contigs' result files for de-novo assemblies of genomes
</a></span></dt></dl></dd><dt><span class="sect2"><a href="#sect_res_infodir">9.1.2.
The <code class="filename">*_d_info</code> directory
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_res_first_look:the_assembly_info">9.2.
First look: the assembly info
</a></span></dt><dt><span class="sect1"><a href="#sect_res_converting_results">9.3.
Converting results
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_res_converting_miraconvert">9.3.1.
Converting to and from other formats:<span class="command"><strong>miraconvert</strong></span>
</a></span></dt><dt><span class="sect2"><a href="#sect_res_converting_reach_other_programs">9.3.2.
Steps for converting data from / to other tools
</a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_res_converting_to_from_staden">9.3.2.1.
Example: converting to and from the Staden package (gap4 / gap5)
</a></span></dt><dt><span class="sect3"><a href="#sect_res_converting_to_from_sam">9.3.2.2.
Example: converting to and from SAM (for samtools, tablet etc.)
</a></span></dt></dl></dd></dl></dd><dt><span class="sect1"><a href="#sect_res_filtering_of_results">9.4.
Filtering results
</a></span></dt><dt><span class="sect1"><a href="#sect_res_places_of_importance_in_a_de_novo_assembly">9.5.
Places of importance in a de-novo assembly
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_res_tags_set_by_mira">9.5.1.
Tags set by MIRA
</a></span></dt><dt><span class="sect2"><a href="#sect_res_other_places_of_importance">9.5.2.
Other places of importance
</a></span></dt><dt><span class="sect2"><a href="#sect_res_joining_contigs">9.5.3.
Joining contigs
</a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_res_joining_truerepeats">9.5.3.1.
Joining contigs at true repetitive sites
</a></span></dt><dt><span class="sect3"><a href="#sect_res_joining_FALSErepeats">9.5.3.2.
Joining contigs at "wrongly discovered" repetitive sites
</a></span></dt></dl></dd></dl></dd><dt><span class="sect1"><a href="#sect_res_places_of_interest_in_a_mapping_assembly">9.6.
Places of interest in a mapping assembly
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_res_poi_where_are_snps?">9.6.1.
Where are SNPs?
</a></span></dt><dt><span class="sect2"><a href="#sect_res_poi_where_are_insertions_deletions_or_genome_rearrangements?">9.6.2.
Where are insertions, deletions or genome re-arrangements?
</a></span></dt><dt><span class="sect2"><a href="#sect_res_poi_other_tags_of_interest">9.6.3.
Other tags of interest
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_res_postprocessing_mapping_assemblies">9.7.
Post-processing mapping assemblies
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_res_pp_manual_cleanup">9.7.1.
Manual cleanup and validation (optional)
</a></span></dt><dt><span class="sect2"><a href="#sect_res_poi_comprehensive_snp_analysis_spreadsheet_tables_for_excel_or_oocalc">9.7.2.
Comprehensive SNP analysis spreadsheet tables (for Excel or OOcalc)
</a></span></dt><dt><span class="sect2"><a href="#sect_res_poi_html_files_depicting_snp_positions_and_deletions">9.7.3.
HTML files depicting SNP positions and deletions
</a></span></dt><dt><span class="sect2"><a href="#sect_res_poi_wig_files">9.7.4.
WIG files depicting contig coverage or GC content
</a></span></dt><dt><span class="sect2"><a href="#sect_res_poi_tables_for_feature_coverage">9.7.5.
Comprehensive spreadsheet tables for gene expression values / genome deletions & duplications
</a></span></dt></dl></dd></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
<span class="emphasis"><em><span class="quote">“<span class="quote">You have to know what you're looking for before you can find it.
</span>”</span></em></span>
</p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><p>
MIRA makes results available in quite a number of formats: CAF, ACE, FASTA and
a few others. The preferred formats are CAF and MAF, as these format can be
translated into any other supported format.
</p><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_res_looking_at_results"></a>9.1.
MIRA output directories and files
</h2></div></div></div><p>
For the assembly MIRA creates a directory named
<code class="filename"><em class="replaceable"><code>projectname</code></em>_assembly</code> in
which a number of sub-directories will have appeared.
</p><p>
These sub-directories (and files within) contain the results of the
assembly itself, general information and statistics on the results and
-- if not deleted automatically by MIRA -- a tmp directory with log
files and temporary data:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_d_results</code>:
this directory contains all the output files of the assembly in
different formats.
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_d_info</code>:
this directory contains information files of the final
assembly. They provide statistics as well as, e.g., information
(easily parsable by scripts) on which read is found in which
contig etc.
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_d_tmp</code>:
this directory contains log files and temporary assembly files. It
can be safely removed after an assembly as there may be easily a
few GB of data in there that are not normally not needed anymore.
</p><p>
The default settings of MIRA are such that really big files are
automatically deleted when they not needed anymore during an
assembly.
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_d_chkpt</code>:
this directory contains checkpoint files needed to resume
assemblies that crashed or were stopped.
</p></li></ul></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_resultsdir"></a>9.1.1.
The <code class="filename">*_d_results</code> directory
</h3></div></div></div><p>
The following files in
<code class="filename"><em class="replaceable"><code>projectname</code></em>_d_results</code>
contain results of the assembly in different formats. Depending on the
output options you defined for MIRA, some files may or may not be
there. As long as the CAF or MAF format are present, you can translate
your assembly later on to about any supported format with the
<span class="command"><strong>miraconvert</strong></span> program supplied with the MIRA
distribution:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_out.txt</code>:
this file contains in a human readable format the aligned assembly
results, where all input sequences are shown in the context of the
contig they were assembled into. This file is just meant as a
quick way for people to have a look at their assembly without
specialised alignment finishing tools.
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_out.padded.fasta</code>:
this file contains as FASTA sequence the consensus of the contigs
that were assembled in the process. Positions in the consensus
containing gaps (also called 'pads', denoted by an asterisk) are
still present. The computed consensus qualities are in the
corresponding
<code class="filename"><em class="replaceable"><code>projectname</code></em>_out.padded.fasta.qual</code>
file.
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_out.unpadded.fasta</code>:
as above, this file contains as FASTA sequence the consensus of
the contigs that were assembled in the process, put positions in
the consensus containing gaps were removed. The computed consensus
qualities are in the corresponding
<code class="filename"><em class="replaceable"><code>projectname</code></em>_out.unpadded.fasta.qual</code>
file.
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_out.caf</code>:
this is the result of the assembly in CAF format, which can be
further worked on with, e.g., tools from the
<span class="emphasis"><em>caftools</em></span> package from the Sanger Centre and
later on be imported into, e.g., the Staden gap4 assembly and
finishing tool.
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_out.ace</code>:
this is the result of the assembly in ACE format. This format can
be read by viewers like the TIGR clview or by consed from the
phred/phrap/consed package.
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_out.gap4da</code>:
this directory contains the result of the assembly suited for the
<span class="emphasis"><em>direct assembly</em></span> import of the Staden gap4
assembly viewer and finishing tool.
</p></li></ul></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_res_resultsdir_denovo"></a>9.1.1.1.
Additional 'large contigs' result files for de-novo assemblies of genomes
</h4></div></div></div><p>
For de-novo assemblies of genomes, MIRA makes a proposal regarding
which contigs you probably want to have a look at ... and which ones
you can probably forget about.
</p><p>
This proposal relies on the <span class="emphasis"><em>largecontigs</em></span> file
in the info directory (see section below) and MIRA automatically
extracted these contigs into all the formats you wanted to have your
results in.
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
The result files for 'large contigs' are all named:
<code class="filename"><em class="replaceable"><code>projectname</code></em>_<span class="emphasis"><em>LargeContigs</em></span>_out.<em class="replaceable"><code>resulttype</code></em></code>:
</p></li><li class="listitem"><p>
<code class="filename">extractLargeContigs.sh</code>: this is a small
shell script which just contains the call
to <span class="command"><strong>miraconvert</strong></span> with which MIRA extracted the
large contigs for you. In case you want to redefine what large
contigs are for you, feel free to use this as template.
</p></li></ul></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_infodir"></a>9.1.2.
The <code class="filename">*_d_info</code> directory
</h3></div></div></div><p>
The following files in
<code class="filename"><em class="replaceable"><code>projectname</code></em>_info</code>
contain statistics and other information files of the assembly:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_info_assembly.txt</code>:
This file should be your first stop after an assembly. It will
tell you some statistics as well as whether or not problematic
areas remain in the result.
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_info_callparameters.txt</code>:
This file contains the parameters as given on the mira command
line when the assembly was started.
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_info_consensustaglist.txt</code>:
This file contains information about the tags (and their position)
that are present in the consensus of a contig.
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_info_contigreadlist.txt</code>:
This file contains information which reads have been assembled
into which contigs (or singlets).
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_info_contigstats.txt</code>:
This file contains in tabular format statistics about the contigs
themselves, their length, average consensus quality, number of
reads, maximum and average coverage, average read length, number
of A, C, G, T, N, X and gaps in consensus.
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_info_debrislist.txt</code>:
This file contains the names of all the reads which were not
assembled into contigs (or singlets if appropriate MIRA parameters
were chosen). The file has two columns: first column is the name
of the read, second column is a code showing the reason and stage
at which the read was put into the debris category.
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_info_largecontigs.txt</code>:
This file contains as simple list the names of all the contigs
MIRA thinks to be more or less important at the end of the
assembly. To be present in this list, a contig needed to reach a
certain length (usually 500, but see [-MI:lcs]) and had a
coverage of at least 1/3 of the average coverage (per sequencing
technology) of the complete project.
</p><p>
Note: only present for de-novo assemblies of genomes.
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
The default heuristics (500bp length and 1/3 coverage per
sequencing technology) generally work well enough for most
projects. However, Projects with extremely different coverage
numbers per sequencing technology may need to use different
numbers. E.g.: a project with 80x Illumina and 6x Sanger would
have contigs consisting only of 2 or 3 Sanger sequence but with
the average coverage >= 2 also in this list although clearly no
one would look at these under normal circumstances.
</td></tr></table></div></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_info_groups.txt</code>:
This file contains information about readgroups as determined by
MIRA. Most interesting will probably be statistics concerning
read-pair sizes.
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_info_readrepeats</code>:
This file helps to find out which parts of which reads are quite
repetitive in a project. Please consult the chapter on how to
tackle "hard" sequencing projects to learn how this file can help
you in spotting sequencing mistakes and / or difficult parts in a
genome or EST / RNASeq project.
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_info_readstooshort</code>:
A list containing the names of those reads that have been sorted
out of the assembly only due to the fact that they were too short,
before any processing started.
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_info_readtaglist.txt</code>:
This file contains information about the tags and their position
that are present in each read. The read positions are given
relative to the forward direction of the sequence (i.e. as it was
entered into the the assembly).
</p></li><li class="listitem"><p>
<code class="filename"><em class="replaceable"><code>projectname</code></em>_error_reads_invalid</code>:
A list of sequences that have been found to be invalid due to
various reasons (given in the output of the assembler).
</p></li></ul></div></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_res_first_look:the_assembly_info"></a>9.2.
First look: the assembly info
</h2></div></div></div><p>
Once finished, have a look at the file
<code class="filename">*_info_assembly.txt</code> in the info directory. The
assembly information given there is split in three major parts:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
some general assembly information (number of reads assembled etc.). This
part is quite short at the moment, will be expanded in future
</p></li><li class="listitem"><p>
assembly metrics for 'large' contigs.
</p></li><li class="listitem"><p>
assembly metrics for all contigs.
</p></li></ol></div><p>
The first part for large contigs contains several sections. The first of
these shows what MIRA counts as large contig for this particular
project. As example, this may look like this:
</p><pre class="screen">
Large contigs:
--------------
With Contig size >= 500
AND (Total avg. Cov >= 19
OR Cov(san) >= 0
OR Cov(454) >= 8
OR Cov(pbs) >= 0
OR Cov(sxa) >= 11
OR Cov(sid) >= 0
)</pre><p>
The above is for a 454 and Solexa hybrid assembly in which MIRA
determined large contigs to be contigs
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
of length of at least 500 bp and
</p></li><li class="listitem"><p>
having a total average coverage of at least 19x or an
average 454 coverage of 8 or an average Solexa coverage of 11
</p></li></ol></div><p>
The second section is about length assessment of large contigs:
</p><pre class="screen">
Length assessment:
------------------
Number of contigs: 44
Total consensus: 3567224
Largest contig: 404449
N50 contig size: 186785
N90 contig size: 55780
N95 contig size: 34578</pre><p>
In the above example, 44 contigs totalling 3.56 megabases were built,
the largest contig being 404 kilobases long and the N50/N90 and N95
numbers give the respective lengths.
</p><p>
The next section shows information about the coverage assessment of
large contigs. An example:
</p><pre class="screen">
Coverage assessment:
--------------------
Max coverage (total): 563
Max coverage
Sanger: 0
454: 271
PacBio: 0
Solexa: 360
Solid: 0
Avg. total coverage (size >= 5000): 57.38
Avg. coverage (contig size >= 5000)
Sanger: 0.00
454: 25.10
PacBio: 0.00
Solexa: 32.88
Solid: 0.00</pre><p>
Maximum coverage attained was 563, maximum for 454 alone 271 and for
Solexa alone 360. The average total coverage (computed from contigs with
a size ≥ 5000 bases is 57.38. The average coverage by sequencing
technology (in contigs ≥ 5000) is 25.10 for 454 and 32.88 for Solexa
reads.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
For genome assemblies, the value for <span class="emphasis"><em>Avg. total coverage
(size >= 5000)</em></span> is currently always calculated for contigs
having 5000 or more consensus bases. While this gives a very effective
measure for genome assemblies, assemblies of EST or RNASeq will often
have totally irrelevant values here: even if the default of MIRA is to
use smaller contig sizes (1000) for EST / RNASeq assemblies, the
coverage values for lowly and highly expressed genes can easily span a
factor of 10000 or more.
</p></td></tr></table></div><p>
The last section contains some numbers useful for quality assessment. It
looks like this:
</p><pre class="screen">
Quality assessment:
-------------------
Average consensus quality: 90
Consensus bases with IUPAC: 11 (you might want to check these)
Strong unresolved repeat positions (SRMc): 0 (excellent)
Weak unresolved repeat positions (WRMc): 19 (you might want to check these)
Sequencing Type Mismatch Unsolved (STMU): 0 (excellent)
Contigs having only reads wo qual: 0 (excellent)
Contigs with reads wo qual values: 0 (excellent)</pre><p>
Beside the average quality of the contigs and whether they contain reads
without quality values, MIRA shows the number of different tags in the
consensus which might point at problems.
</p><p>
The above mentioned sections (length assessment, coverage assessment and
quality assessment) for <span class="emphasis"><em>large</em></span> contigs will then be
re-iterated for <span class="emphasis"><em>all</em></span> contigs, this time including
also contigs which MIRA did not take into account as large contig.
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_res_converting_results"></a>9.3.
Converting results
</h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_converting_miraconvert"></a>9.3.1.
Converting to and from other formats:<span class="command"><strong>miraconvert</strong></span>
</h3></div></div></div><p>
<span class="command"><strong>miraconvert</strong></span> is tool in the MIRA package which
reads and writes a number of formats, ranging from full assembly
formats like CAF and MAF to simple output view formats like HTML or
plain text.
</p><div class="figure"><a name="chap_res::results_miraconvert.png"></a><p class="title"><b>Figure 9.1. <span class="command">miraconvert</span> supports a wide range of
format conversions to simplify export / import of results to and from
other programs</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/results_miraconvert.png" width="100%" alt="miraconvert supports a wide range of format conversions to simplify export / import of results to and from other programs"></td></tr></table></div></div></div><br class="figure-break"></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_converting_reach_other_programs"></a>9.3.2.
Steps for converting data from / to other tools
</h3></div></div></div><p>
The question "How Do I convert to / from other tools?" is complicated
by the plethora of file formats and tools available. This section
gives an overview on what is needed to reach the most important ones.
</p><div class="figure"><a name="chap_res::results_mira2other.png"></a><p class="title"><b>Figure 9.2.
Conversion steps, formats and programs needed to reach some tools
like assembly viewers, editors or scaffolders.
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/results_mira2other.png" width="100%" alt="Conversion steps, formats and programs needed to reach some tools like assembly viewers, editors or scaffolders."></td></tr></table></div></div></div><br class="figure-break"><p>
Please also read the chapter on MIRA utilities in this manual to learn
more on <span class="command"><strong>miraconvert</strong></span> and have a look at
<code class="literal">miraconvert -h</code> which lists all possible formats
and other command line options.
</p><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_res_converting_to_from_staden"></a>9.3.2.1.
Example: converting to and from the Staden package (gap4 / gap5)
</h4></div></div></div><p>
The <span class="command"><strong>gap4</strong></span> program (and its
successor <span class="command"><strong>gap5</strong></span> from the Staden package are pretty
useful finishing tools and assembly viewers. They have an own
database format which MIRA does not read or write, but there are
interconversion possibilities using the CAF format (for gap4) and
SAM format (for gap5)
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
[gap4]
</span></dt><dd><p>
You need the <span class="command"><strong>caf2gap</strong></span>
and <span class="command"><strong>gap2caf</strong></span> utilities for this, which are
distributed separately from the Sanger Centre
<a class="ulink" href="http://www.sanger.ac.uk/Software/formats/CAF/" target="_top">http://www.sanger.ac.uk/Software/formats/CAF/</a>).
Conversion is pretty straightforward. From MIRA to gap4, it's
like this:
</p><pre class="screen">
<code class="prompt">$</code> caf2gap -project <em class="replaceable"><code>YOURGAP4PROJECTNAME</code></em> -ace <em class="replaceable"><code>mira_result.caf</code></em> >&/dev/null</pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Don't be fooled by the <code class="literal">-ace</code> parameter of
<span class="command"><strong>caf2gap</strong></span>. It needs a CAF file as input, not
an ACE file.
</td></tr></table></div><p>
From gap4 to CAF, it's like this:
</p><pre class="screen">
<code class="prompt">$</code> gap2caf -project <em class="replaceable"><code>YOURGAP4PROJECTNAME</code></em> >tmp.caf
<code class="prompt">$</code> miraconvert -r c tmp.caf <em class="replaceable"><code>somenewname</code></em>.caf</pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Using <span class="command"><strong>gap2caf</strong></span>, be careful to use the simple
<code class="literal">></code> redirection to file and
<span class="emphasis"><em>not</em></span> the <code class="literal">>&</code>
redirection.
</td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Using first <span class="command"><strong>gap2caf</strong></span> and then
<span class="command"><strong>miraconvert</strong></span> is needed as gap4 writes an
own consensus to the CAF file which is not necessarily the
best. Indeed, gap4 does not know about different sequencing
technologies like 454 and treats everything as
Sanger. Therefore, using
<span class="command"><strong>miraconvert</strong></span> with the [-r c] option
recalculates a MIRA consensus during the "conversion" from CAF to CAF.
</td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
If you work with a 32 bit executable of caf2gap, it might very
well be that the converter needs more memory than can be
handled by 32 bit. Only solution: switch to a 64 bit
executable of caf2gap.
</td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning:
caf2gap bug for sequence annotations in reverse direction
"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">
caf2gap bug for sequence annotations in reverse direction
</th></tr><tr><td align="left" valign="top"><p>
caf2gap has currently (as of version 2.0.2) a bug that turns
around all features in reverse direction during the
conversion from CAF to a gap4 project. There is a fix
available, please contact me for further information (until
I find time to describe it here).
</p></td></tr></table></div></dd><dt><span class="term">
[gap5]
</span></dt><dd><p>
The <span class="command"><strong>gap5</strong></span> program is the successor for
gap4. It comes with on own import utility
(<span class="command"><strong>tg_index</strong></span>) which can import SAM and CAF
files, and gap5 itself has an export function which also
writes SAM and CAF. It is suggested to use the SAM format to
export data gap5 as it is more efficient and conveys more
information on sequencing technologies used.
</p><p>
Conversion is pretty straightforward. From MIRA to gap5, it's like
this:
</p><pre class="screen">
<code class="prompt">$</code> tg_index <em class="replaceable"><code>INPUT</code></em>_out.sam</pre><p>
This creates a gap5 database named
<code class="filename"><em class="replaceable"><code>INPUT</code></em>_out.g5d</code>
which can be directly loaded with gap5 like this:
</p><pre class="screen">
<code class="prompt">$</code> gap5 <em class="replaceable"><code>INPUT</code></em>_out.g5d</pre><p>
Exporting back to SAM or CAF is done in gap5 via
the <span class="emphasis"><em>File->Export Sequences</em></span> menu there.
</p></dd></dl></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_res_converting_to_from_sam"></a>9.3.2.2.
Example: converting to and from SAM (for samtools, tablet etc.)
</h4></div></div></div><p>
Converting to SAM is done by
using <span class="command"><strong>miraconvert</strong></span> on a MIRA MAF file, like this:
</p><pre class="screen">
<code class="prompt">$</code> miraconvert maf -t sam <em class="replaceable"><code>INPUT</code></em>.maf <em class="replaceable"><code>OUTPUT</code></em></pre><p>
The above will create a file named <code class="filename">OUTPUT.sam</code>.
</p><p>
Converting from SAM to a format which either <span class="command"><strong>mira</strong></span>
or <span class="command"><strong>miraconvert</strong></span> can understand takes a few
more steps. As neither tool currently reads SAM natively, you need
to go via the <span class="command"><strong>gap5</strong></span> editor of the Staden package:
convert the SAM via <span class="command"><strong>tg_index</strong></span> to a gap5 database,
load that database in gap5 and export it there to CAF.
</p></div></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_res_filtering_of_results"></a>9.4.
Filtering results
</h2></div></div></div><p>
It is important to remember that, depending on assembly options, MIRA
will also include very small contigs (with eventually very low coverage)
made out of reads which were rejected from the "good" contigs for
quality or other reasons. You probably do not want to have a look at
this contig debris when finishing a genome unless you are really,
really, really picky.
</p><p>
Many people prefer to just go on with what would be large
contigs. Therefore, in de-novo assemblies, MIRA writes out separate
files of what it thinks are "good", large contigs. In case you want to
extract contigs differently, the <span class="command"><strong>miraconvert</strong></span> program
from the MIRA package can selectively filter CAF or MAF files for
contigs with a certain size, average coverage or number of reads.
</p><p>
The file <code class="filename">*_info_assembly.txt</code> in the info directory
at the end of an assembly might give you first hints on what could be
suitable filter parameters. As example, for "normal" assemblies
(whatever this means), one could want to consider only contigs larger
than 500 bases and which have at least one third of the average coverage
of the N50 contigs.
</p><p>
Here's an example: In the "Large contigs" section, there's a "Coverage
assessment" subsection. It looks a bit like this:
</p><pre class="screen">
...
Coverage assessment:
--------------------
Max coverage (total): 43
Max coverage
Sanger: 0
454: 43
Solexa: 0
Solid: 0
Avg. total coverage (size ≥ 5000): 22.30
Avg. coverage (contig size ≥ 5000)
Sanger: 0.00
454: 22.05
Solexa: 0.00
Solid: 0.00
...</pre><p>
This project was obviously a 454 only project, and the average coverage
for it is ~22. This number was estimated by MIRA by taking only contigs
of at least 5kb into account, which for sure left out everything which
could be categorised as debris. Normally it's a pretty solid number.
</p><p>
Now, depending on how much time you want to invest performing some manual
polishing, you should extract contigs which have at least the following
fraction of the average coverage:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
2/3 if a quick and "good enough" is what you want and you don't want
to do some manual polishing. In this example, that would be around
14 or 15.
</p></li><li class="listitem"><p>
1/2 if you want to have a "quick look" and eventually perform some
contig joins. In this example the number would be 11.
</p></li><li class="listitem"><p>
1/3 if you want quite accurate and for sure not loose any possible
repeat. That would be 7 or 8 in this example.
</p></li></ul></div><p>
Example (useful with assemblies of Sanger data): extracting only contigs ≥
1000 bases and with a minimum average coverage of 4 into FASTA format:
</p><pre class="screen">
<code class="prompt">$</code> <strong class="userinput"><code>miraconvert -x 1000 -y 4 <em class="replaceable"><code>sourcefile.maf targetfile.fasta</code></em></code></strong></pre><p>
Example (useful with assemblies of 454 data): extracting only contigs
≥ 500 bases into FASTA format:
</p><pre class="screen">
<code class="prompt">$</code> <strong class="userinput"><code>miraconvert -x 500 <em class="replaceable"><code>sourcefile.maf targetfile.fasta</code></em></code></strong></pre><p>
Example (e.g. useful with Sanger/454 hybrid assemblies): extracting only
contigs ≥ 500 bases and with an average coverage ≥ 15 reads into
CAF format, then converting the reduced CAF into a Staden GAP4 project:
</p><pre class="screen">
<code class="prompt">$</code> <strong class="userinput"><code>miraconvert -x 500 -y 15 <em class="replaceable"><code>sourcefile.maf tmp.caf</code></em></code></strong>
<code class="prompt">$</code> <strong class="userinput"><code>caf2gap -project <em class="replaceable"><code>somename</code></em> -ace <em class="replaceable"><code>tmp.caf</code></em></code></strong></pre><p>
Example (e.g. useful with Sanger/454 hybrid assemblies): extracting only
contigs ≥ 1000 bases and with ≥ 10 reads from MAF into CAF format,
then converting the reduced CAF into a Staden GAP4 project:
</p><pre class="screen">
<code class="prompt">$</code> <strong class="userinput"><code>miraconvert -x 500 -z 10 <em class="replaceable"><code>sourcefile.maf tmp.caf</code></em></code></strong>
<code class="prompt">$</code> <strong class="userinput"><code>caf2gap -project <em class="replaceable"><code>somename</code></em> -ace <em class="replaceable"><code>tmp.caf</code></em></code></strong></pre></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_res_places_of_importance_in_a_de_novo_assembly"></a>9.5.
Places of importance in a de-novo assembly
</h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_tags_set_by_mira"></a>9.5.1.
Tags set by MIRA
</h3></div></div></div><p>
MIRA sets a number of different tags in resulting assemblies. They can be set in reads
(in which case they mostly end with a <span class="emphasis"><em>r</em></span>) or in the consensus.(then
ending with a <span class="emphasis"><em>c</em></span>).
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
If you use the
Staden <span class="command"><strong>gap4</strong></span>, <span class="command"><strong>gap5</strong></span> or
<span class="command"><strong>consed</strong></span> assembly editor to tidy up the assembly, you
can directly jump to places of interest that MIRA marked for further
analysis by using the search functionality of these programs.
</p><p>
However, you need to tell these programs that these tags exist. For
that you must change some configuration files. More information on
how to do this can be found in the
<code class="filename">support/README</code> file of the MIRA distribution.
</p></td></tr></table></div><p>
You should search for the following "consensus" tags for finding places of importance
(in this order).
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
IUPc
</p></li><li class="listitem"><p>
UNSc
</p></li><li class="listitem"><p>
SRMc
</p></li><li class="listitem"><p>
WRMc
</p></li><li class="listitem"><p>
STMU (only hybrid assemblies)
</p></li><li class="listitem"><p>
MCVc (only when assembling different strains, i.e., mostly relevant for mapping assemblies)
</p></li><li class="listitem"><p>
SROc (only when assembling different strains, i.e., mostly relevant for mapping assemblies)
</p></li><li class="listitem"><p>
SAOc (only when assembling different strains, i.e., mostly relevant for mapping assemblies)
</p></li><li class="listitem"><p>
SIOc (only when assembling different strains, i.e., mostly relevant for mapping assemblies)
</p></li><li class="listitem"><p>
STMS (only hybrid assemblies)
</p></li></ul></div><p>
</p><p>
of lesser importance are the "read" versions of the tags above:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
UNSr
</p></li><li class="listitem"><p>
SRMr
</p></li><li class="listitem"><p>
WRMr
</p></li><li class="listitem"><p>
SROr (only when assembling different strains, i.e., mostly relevant for mapping assemblies)
</p></li><li class="listitem"><p>
SAOr (only when assembling different strains, i.e., mostly relevant for mapping assemblies)
</p></li><li class="listitem"><p>
SIOr (only when assembling different strains, i.e., mostly relevant for mapping assemblies)
</p></li></ul></div><p>
</p><p>
In normal assemblies (only one sequencing technology, just one
strain), search for the IUPc, UNSc, SRMc and WRMc tags.
</p><p>
In hybrid assemblies, searching for the IUPc, UNSc, SRMc, WRMc, and
STMU tags and correcting only those places will allow you to have a
qualitatively good assembly in no time at all.
</p><p>
Columns with SRMr tags (SRM in <span class="bold"><strong>R</strong></span>eads)
in an assembly without a SRMc tag at the same consensus position show
where mira was able to resolve a repeat during the different passes of
the assembly ... you don't need to look at these. SRMc and WRMc tags
however mean that there may be unresolved trouble ahead, you should take a
look at these.
</p><p>
Especially in mapping assemblies, columns with the MCVc, SROx, SIOx and SAOx tags are
extremely helpful in finding places of interest. As they are only set if you
gave strain information to MIRA, you should always do that.
</p><p>
For more information on tags set/used by MIRA and what they exactly mean, please look up the
according section in the reference chapter.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_other_places_of_importance"></a>9.5.2.
Other places of importance
</h3></div></div></div><p>
The read coverage histogram as well as the template display of gap4
will help you to spot other places of potential interest. Please consult the
gap4 documentation.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_joining_contigs"></a>9.5.3.
Joining contigs
</h3></div></div></div><p>
I recommend to invest a couple of minutes (in the best case) to a few
hours in joining contigs, especially if the uniform read distribution
option of MIRA was used (but first filter for large contigs). This
way, you will reduce the number of "false repeats" in improve the
overall quality of your assembly.
</p><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_res_joining_truerepeats"></a>9.5.3.1.
Joining contigs at true repetitive sites
</h4></div></div></div><p>
Joining contigs at repetitive sites of a genome is always a
difficult decision. There are, however, two rules which can help:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem">
If the sequencing was done without a paired-end library, don't join.
</li><li class="listitem">
If the sequencing was done with a paired-end library, but no
pair (or template) span the join site, don't join.
</li></ol></div><p>
</p><p>
The following screen shot shows a case where one should not join as
the finishing program (in this case <span class="command"><strong>gap4</strong></span>) warns
that no template (read-pair) span the join site:
</p><p>
</p><div class="figure"><a name="haf_danger_join_notok.png"></a><p class="title"><b>Figure 9.3.
Join at a repetitive site which should not be performed due to
missing spanning templates.
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/haf_danger_join_notok.png" width="100%" alt="Join at a repetitive site which should not be performed due to missing spanning templates."></td></tr></table></div></div></div><p><br class="figure-break">
</p><p>
The next screen shot shows a case where one should join as the
finishing program (in this case <span class="command"><strong>gap4</strong></span>) finds
templates spanning the join site and all of them are good:
</p><p>
</p><div class="figure"><a name="haf_danger_join_ok.png"></a><p class="title"><b>Figure 9.4.
Join at a repetitive site which should be performed due to
spanning templates being good.
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/haf_danger_join_ok.png" width="100%" alt="Join at a repetitive site which should be performed due to spanning templates being good."></td></tr></table></div></div></div><p><br class="figure-break">
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_res_joining_FALSErepeats"></a>9.5.3.2.
Joining contigs at "wrongly discovered" repetitive sites
</h4></div></div></div></div><p>
Remember that MIRA takes a very cautious approach in contig building,
and sometimes creates two contigs when it could have created
one. Three main reasons can be the cause for this:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
when using <span class="emphasis"><em>uniform read distribution</em></span>, some
non-repetitive areas may have generated so many more reads that
they start to look like repeats (so called pseudo-repeats). In
this case, reads that are above a given coverage are
<span class="emphasis"><em>shaved off</em></span> (see [-AS:urdcm] and kept
in reserve to be used for another copy of that repeat ... which in
case of a non-repetitive region will of course never arrive. So at
the end of an assembly, these shaved-off reads will form short,
low coverage contig debris which can more or less be safely
ignored and sorted out via the filtering options ( [-x -y
-z]) of <span class="command"><strong>miraconvert</strong></span>.
</p><p>
Some 454 library construction protocols -- especially, but not
exclusively, for paired-end reads -- create pseudo-repeats quite
frequently. In this case, the pseudo-repeats are characterised by
several reads starting at exact the same position but which can
have different lengths. Should MIRA have separated these reads
into different contigs, these can be -- most of the time -- safely
joined. The following figure shows such a case:
</p><div class="figure"><a name="454_stacks_join.png"></a><p class="title"><b>Figure 9.5.
Pseudo-repeat in 454 data due to sequencing artifacts
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/454_stacks_join.png" width="100%" alt="Pseudo-repeat in 454 data due to sequencing artifacts"></td></tr></table></div></div></div><br class="figure-break"><p>
For Solexa data, a non-negligible GC bias has been reported in
genome assemblies since late 2009. In genomes with moderate to
high GC, this bias actually favours regions with lower
GC. Examples were observed where regions with an average GC of 10%
less than the rest of the genome had between two and four times
more reads than the rest of the genome, leading to false
"discovery" of duplicated genome regions.
</p></li><li class="listitem"><p>
when using unpaired data, the above described possibility of
having "too many" reads in a non-repetitive region can also lead
to a contig being separated into two contigs in the region of the
pseudo-repeat.
</p></li><li class="listitem"><p>
a number of reads (sometimes even just one) can contain "high
quality garbage", that is, nonsense bases which got - for some
reason or another - good quality values. This garbage can be
distributed on a long stretch in a single read or concern just a
single base position across several reads.
</p><p>
While MIRA has some algorithms to deal with the disrupting effects
of reads like, the algorithms are not always 100% effective and
some might slip through the filters.
</p></li></ol></div></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_res_places_of_interest_in_a_mapping_assembly"></a>9.6.
Places of interest in a mapping assembly
</h2></div></div></div><p>
This section just give a short overview on the tags you might find
interesting. For more information, especially on how to configure gap4
or consed, please consult the <span class="emphasis"><em>mira usage</em></span> document
and the <span class="emphasis"><em>mira</em></span> manual.
</p><p>
In file types that allow tags (CAF, MAF, ACE), SNPs and other
interesting features will be marked by MIRA with a number of tags. The
following sections give a brief overview. For a description of what
the tags are (SROc, WRMc etc.), please read up the section "Tags used
in the assembly by MIRA and EdIt" in the main manual.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Screen shots in this section are taken from the walk-through with
Lenski data (see below).
</td></tr></table></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_poi_where_are_snps?"></a>9.6.1.
Where are SNPs?
</h3></div></div></div><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
the <span class="bold"><strong>SROc</strong></span> tag will point to most
SNPs. Should you assemble sequences of more than one strain (I
cannot really recommend such a strategy), you also might
encounter <span class="bold"><strong>SIOc</strong></span> and <span class="bold"><strong>SAOc</strong></span> tags.
</p><div class="figure"><a name="chap_sol::sxa_sroc_lenski1.png"></a><p class="title"><b>Figure 9.6.
"SROc" tag showing a SNP position in a Solexa mapping
assembly.
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_sroc_lenski1.png" width="100%" alt='"SROc" tag showing a SNP position in a Solexa mapping assembly.'></td></tr></table></div></div></div><br class="figure-break"><div class="figure"><a name="chap_sol::sxa_sroc_lenski2.png"></a><p class="title"><b>Figure 9.7.
"SROc" tag showing a SNP/indel position in a Solexa mapping
assembly.
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_sroc_lenski2.png" width="100%" alt='"SROc" tag showing a SNP/indel position in a Solexa mapping assembly.'></td></tr></table></div></div></div><br class="figure-break"></li><li class="listitem"><p>
the <span class="bold"><strong>WRMc</strong></span> tags might sometimes
point SNPs to indels of one or two bases.
</p></li></ul></div><p>
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_poi_where_are_insertions_deletions_or_genome_rearrangements?"></a>9.6.2.
Where are insertions, deletions or genome re-arrangements?
</h3></div></div></div><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
Large deletions: the <span class="bold"><strong>MCVc</strong></span> tags
point to deletions in the resequenced data, where no read is
covering the reference genome.
</p><div class="figure"><a name="chap_sol::sxa_mcvc_lenski.png"></a><p class="title"><b>Figure 9.8.
"MCVc" tag (dark red stretch in figure) showing a genome
deletion in Solexa mapping assembly.
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_mcvc_lenski.png" width="100%" alt='"MCVc" tag (dark red stretch in figure) showing a genome deletion in Solexa mapping assembly.'></td></tr></table></div></div></div><br class="figure-break"></li><li class="listitem"><p>
Insertions, small deletions and re-arrangements: these are
harder to spot. In unpaired data sets they can be found looking
at clusters of <span class="bold"><strong>SROc</strong></span>, <span class="bold"><strong>SRMc</strong></span>, <span class="bold"><strong>WRMc</strong></span>, and / or <span class="bold"><strong>UNSc</strong></span> tags.
</p><div class="figure"><a name="chap_sol::sxa_wrmcsrmc_hiding_lenski1.png"></a><p class="title"><b>Figure 9.9.
An IS150 insertion hiding behind a WRMc and a SRMc tags
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_wrmcsrmc_hiding_lenski1.png" width="100%" alt="An IS150 insertion hiding behind a WRMc and a SRMc tags"></td></tr></table></div></div></div><br class="figure-break"><p>
more massive occurrences of these tags lead to a rather colourful
display in finishing programs, which is why these clusters are
also sometimes called Xmas-trees.
</p><div class="figure"><a name="chap_sol::sxa_xmastree_lenski1.png"></a><p class="title"><b>Figure 9.10.
A 16 base pair deletion leading to a SROc/UNsC xmas-tree
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_xmastree_lenski1.png" width="100%" alt="A 16 base pair deletion leading to a SROc/UNsC xmas-tree"></td></tr></table></div></div></div><br class="figure-break"><div class="figure"><a name="chap_sol::sxa_xmastree_lenski2.png"></a><p class="title"><b>Figure 9.11.
An IS186 insertion leading to a SROc/UNsC xmas-tree
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_xmastree_lenski2.png" width="100%" alt="An IS186 insertion leading to a SROc/UNsC xmas-tree"></td></tr></table></div></div></div><br class="figure-break"><p>
In sets with paired-end data, post-processing software (or
alignment viewers) can use the read-pair information to guide
you to these sites (MIRA doesn't set tags at the moment).
</p></li></ul></div><p>
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_poi_other_tags_of_interest"></a>9.6.3.
Other tags of interest
</h3></div></div></div><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
the <span class="bold"><strong>UNSc</strong></span> tag points to areas
where the consensus algorithm had troubles choosing a base. This
happens in low coverage areas, at places of insertions (compared
to the reference genome) or sometimes also in places where
repeats with a few bases difference are present. Often enough,
these tags are in areas with problematic sequences for the
Solexa sequencing technology like, e.g., a
<code class="literal">GGCxG</code> or even <code class="literal">GGC</code> motif in
the reads.
</p></li><li class="listitem"><p>
the <span class="bold"><strong>SRMc</strong></span> tag points to places
where repeats with a few bases difference are present. Here too,
sequence problematic for the Solexa technology are likely to
have cause base calling errors and subsequently setting of this
tag.
</p></li></ul></div><p>
</p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_res_postprocessing_mapping_assemblies"></a>9.7.
Post-processing mapping assemblies
</h2></div></div></div><p>
This section is a bit terse, you should also read the chapter on
<span class="emphasis"><em>working with results</em></span> of MIRA.
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_pp_manual_cleanup"></a>9.7.1.
Manual cleanup and validation (optional)
</h3></div></div></div><p>
When working with resequencing data and a mapping assembly, I always
load finished projects into an assembly editor and perform a quick
cleanup of the results. SNP or small indels normally do not need
cleanups, but every mapper will get larger indels mostly wrong, and
MIRA is no exception to this.
</p><p>
For close relatives of the reference strain this doesn't take long as
MIRA will have set tags (see section earlier in this document) at all
sites you should have a look at. For example, very close mutant
bacteria with just SNPs or simple deletions and no genome
reorganisation, I usually clean up in 10 to 15 minutes. That gives the
last boost to data quality and your users (biologists etc.) will thank
you for that as it reduces their work in analysing the data (be it
looking at data or performing wet-lab experiments).
</p><p>
The general workflow I use is to convert the CAF file to a gap4 or gap5
database. Then, in gap4 or gap5, I
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
quickly search for the UNSc and WRMc tags and check whether they
could be real SNPs that were overseen by MIRA. In that case, I
manually set a SROc (or SIOc) tag in gap4 via hotkeys that were
defined to set these tags.
</p></li><li class="listitem"><p>
sometimes also quickly clean up reads that are causing trouble in
alignments and lead to wrong base calling. These can be found at
sites with UNSc tags, most of the time they have the 5' to 3'
<code class="literal">GGCxG</code> motif which can cause trouble to Solexa.
</p></li><li class="listitem"><p>
look at sites with deletions (tagged with MCVc) and look whether I
should clean up the borders of the deletion.
</p></li></ol></div><p>
After this, I convert the gap4 or gap5 database back to CAF format.
But beware: gap4 does not have the same consensus calling routines as
MIRA and will have saved it's own consensus in the new CAF. In fact,
gap4 performs rather badly in projects with multiple sequencing
technologies. So I use miraconvert from the MIRA package to recall
a good consensus (and save it in MAF as it's more compact and a lot
faster in handling than CAF):
</p><p>
And from this MAF file I can then convert with miraconvert to any
other format I or my users need: CAF, FASTA, ACE, WIG (for coverage
analysis), SNP and coverage analysis (see below), HTML etc.pp.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_poi_comprehensive_snp_analysis_spreadsheet_tables_for_excel_or_oocalc"></a>9.7.2.
Comprehensive SNP analysis spreadsheet tables (for Excel or OOcalc)
</h3></div></div></div><p>
Biologists are not really interested in SNPs coordinates, and why
should they? They're more interested where SNPs are, how good they
are, which genes or other elements they hit, whether they have an
effect on a protein sequence, whether they may be important etc. For
organisms without intron/exon structure or splice variants, MIRA can
generate pretty comprehensive tables and files if an annotated
GenBank file was used as reference and strain information was given
to MIRA during the assembly.
</p><p>
Well, MIRA does all that automatically for you if the reference
sequence you gave was annotated.
</p><p>
For this, <span class="command"><strong>miraconvert</strong></span> should be used with the
<span class="emphasis"><em>asnp</em></span> format as target and a MAF (or CAF) file as
input:
</p><pre class="screen"><code class="prompt">$</code> <strong class="userinput"><code>miraconvert -t asnp <em class="replaceable"><code>input.maf output</code></em></code></strong></pre><p>
Note that it is strongly suggested to perform a quick manual cleanup
of the assembly prior to this: for rare cases (mainly at site of
small indels of one or two bases), MIRA will not tag SNPs with a SNP
tag (SROc, SAOc or SIOc) but will be fooled into a tag denoting
unsure positions (UNSc). This can be quickly corrected manually. See
further down in this manual in the section on post-processing.
</p><p>
After conversion, you will have four files in the directory which
you can all drag-and-drop into spreadsheet applications like
OpenOffice Calc or Excel.
</p><p>
The files should be pretty self-explanatory, here's just a short overview:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
<code class="filename">output_info_snplist.txt</code> is a simple list of
the SNPs, with their positions compared to the reference
sequence (in bases and map degrees on the genome) as well as the
GenBank features they hit.
</p></li><li class="listitem"><p>
<code class="filename">output_info_featureanalysis.txt</code> is a much
extended version of the list above. It puts the SNPs into
context of the features (proteins, genes, RNAs etc.) and gives a
nice list, SNP by SNP, what might cause bigger changes in
proteins.
</p></li><li class="listitem"><p>
<code class="filename">output_info_featuresummary.txt</code> looks at the
changes (SNPs, indels) from the other way round. It gives an
excellent overview which features (genes, proteins, RNAs,
intergenic regions) you should investigate.
</p><p>
There's one column (named 'interesting') which pretty much
summarises up everything you need into three categories: yes,
no, and perhaps. 'Yes' is set if indels were detected, an amino
acid changed, start or stop codon changed or for SNPs in
intergenic regions and RNAs. 'Perhaps' is set for SNPs in
proteins that change a codon, but not an amino acid (silent
SNPs). 'No' is set if no SNP is hitting a feature.
</p></li><li class="listitem"><p>
<code class="filename">output_info_featuresequences.txt</code> simply
gives the sequences of each feature of the reference sequence
and the resequenced strain.
</p></li></ol></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_poi_html_files_depicting_snp_positions_and_deletions"></a>9.7.3.
HTML files depicting SNP positions and deletions
</h3></div></div></div><p>
I've come to realise that people who don't handle data from NextGen
sequencing technologies on a regular basis (e.g., many biologists)
don't want to be bothered with learning to handle specialised
programs to have a look at their resequenced strains. Be it because
they don't have time to learn how to use a new program or because
their desktop is not strong enough (CPU, memory) to handle the data
sets.
</p><p>
Something even biologist know to operate are browsers. Therefore,
miraconvert has the option to load a MAF (or CAF) file of a
mapping assembly at output to HTML those areas which are interesting
to biologists. It uses the tags SROc, SAOc, SIOc and MCVc and outputs
the surrounding alignment of these areas together with a nice overview
and links to jump from one position to the previous or next.
</p><p>
This is done with the '<code class="literal">-t hsnp</code>' option of
miraconvert:
</p><pre class="screen"><code class="prompt">$</code> <strong class="userinput"><code>miraconvert -t hsnp <em class="replaceable"><code>input.maf output</code></em></code></strong></pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
I recommend doing this only if the resequenced strain is a very close
relative to the reference genome, else the HTML gets pretty big. But
for a couple of hundred SNPs it works great.
</td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_poi_wig_files"></a>9.7.4.
WIG files depicting contig coverage or GC content
</h3></div></div></div><p>
<span class="command"><strong>miraconvert</strong></span> can also dump a coverage file in WIG
format (using '<code class="literal">-t wig</code>') or a WIG file for GC
content (using '<code class="literal">-t gcwig</code>'). This comes pretty handy
for searching genome deletions or duplications in programs like the
Affymetrix Integrated Genome Browser (IGB, see <a class="ulink" href="http://igb.bioviz.org/" target="_top">http://igb.bioviz.org/</a>) or when looking for foreign sequence
in a genome.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_res_poi_tables_for_feature_coverage"></a>9.7.5.
Comprehensive spreadsheet tables for gene expression values / genome deletions & duplications
</h3></div></div></div><p>
When having data mapped against a reference with annotations (either
from GenBank formats or GFF3 formats),
<span class="command"><strong>miraconvert</strong></span> can generate tables depicting
either expression values (in RNASeq/EST data mappings) or probable
genome multiplication and deletion factors (in genome mappings). For
this to work, you must use a MAF or CAF file as input, specify
<span class="emphasis"><em>fcov</em></span> as output format and the reference sequence
must have had annotations during the mapping with MIRA.
</p><p>TODO: add example</p><pre class="screen"><strong class="userinput"><code>miraconvert -t fcov <em class="replaceable"><code>mira_out.maf myfeaturetable</code></em></code></strong></pre></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_mutils"></a>Chapter 10. Utilities in the MIRA package</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email"><<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>></code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect_mutils_convpro">10.1. miraconvert</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_mutils_cp_synopsis">10.1.1.
Synopsis
</a></span></dt><dt><span class="sect2"><a href="#sect_mutils_cp_description">10.1.2. Description</a></span></dt><dt><span class="sect2"><a href="#sect_mutils_cp_options">10.1.3. Options</a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_mutils_cp_options_general">10.1.3.1. General options</a></span></dt><dt><span class="sect3"><a href="#sect_mutils_cp_options_contigs">10.1.3.2. Options for input containing contig data</a></span></dt></dl></dd><dt><span class="sect2"><a href="#sect_mutils_cp_examples">10.1.4. Examples</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_mutils_bait">10.2. mirabait - a "grep" like tool to select reads with kmers up to 256 bases</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_mutils_bait_synopsis">10.2.1.
Synopsis
</a></span></dt><dt><span class="sect2"><a href="#sect_mutils_bait_description">10.2.2. Description</a></span></dt><dt><span class="sect2"><a href="#sect_mutils_bait_options">10.2.3. Options</a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_mutils_bait_mainoptions">10.2.3.1. Main options</a></span></dt><dt><span class="sect3"><a href="#sect_mutils_bait_outputdef">10.2.3.2. File type options</a></span></dt><dt><span class="sect3"><a href="#sect_mutils_bait_other">10.2.3.3. Other options</a></span></dt></dl></dd><dt><span class="sect2"><a href="#sect_mutils_bait_examples">10.2.4. Usage examples</a></span></dt><dt><span class="sect2"><a href="#sect_mutils_bait_installrrnadb">10.2.5. Installing different rRNA databases</a></span></dt></dl></dd></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
<span class="emphasis"><em><span class="quote">“<span class="quote">Ninety percent of success is just growing up.
</span>”</span></em></span>
</p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_mutils_convpro"></a>10.1. miraconvert</h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_mutils_cp_synopsis"></a>10.1.1.
Synopsis
</h3></div></div></div><div class="cmdsynopsis"><p><code class="command">miraconvert</code> [options] {<em class="replaceable"><code>input_file</code></em>} {<em class="replaceable"><code>output_basename</code></em>}</p></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_mutils_cp_description"></a>10.1.2. Description</h3></div></div></div><p>
<span class="command"><strong>miraconvert</strong></span> is a tool to convert, extract and
sometimes recalculate all kinds of data related to sequence assembly
files.
</p><p>
More specifically, <span class="command"><strong>miraconvert</strong></span> can
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
convert from multiple alignment files (CAF, MAF) to other multiple
alignment files (CAF, MAF, ACE, SAM), and -- if wished -- selecting
contigs by different criteria like name, length, coverage etc.
</p></li><li class="listitem"><p>
extract the consensus from multiple alignments in CAF and MAF format,
writing it to any supported output format (FASTA, FASTQ, plain text,
HTML, etc.) and -- if wished -- recalculating the consensus using
the MIRA consensus engine with MIRA parameters
</p></li><li class="listitem"><p>
extract read sequences (clipped or unclipped) from multiple
alignments and save to any supported format
</p></li><li class="listitem"><p>
Much more, need to document this.
</p></li></ol></div><p>
</p><p>…</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_mutils_cp_options"></a>10.1.3. Options</h3></div></div></div><p>…</p><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_mutils_cp_options_general"></a>10.1.3.1. General options</h4></div></div></div><div class="variablelist"><dl class="variablelist"><dt><span class="term">
<code class="option">-f
<em class="replaceable"><code>
{ <code class="option">caf</code> | <code class="option">maf</code> | <code class="option">fasta</code> | <code class="option">fastq</code> | <code class="option">gbf</code> | <code class="option">phd</code> | <code class="option">fofnexp</code> }
</code></em>
</code>
</span></dt><dd><p>
<span class="quote">“<span class="quote">From-type</span>”</span>, the format of the input file. CAF and MAF
files can contain full assemblies and/or unassembled (single)
sequences while the other formats contain only unassembled
sequences.
</p></dd><dt><span class="term">
<code class="option">-t
<em class="replaceable"><code>
{ <code class="option">ace</code> | <code class="option">asnp</code> | <code class="option">caf</code> | <code class="option">crlist</code> | <code class="option">cstats</code> | <code class="option">exp</code> | <code class="option">fasta</code> | <code class="option">fastq</code> | <code class="option">fcov</code> | <code class="option">gbf</code> | <code class="option">gff3</code> | <code class="option">hsnp</code> | <code class="option">html</code> | <code class="option">maf</code> | <code class="option">phd</code> | <code class="option">sam</code> | <code class="option">samnbb</code> | <code class="option">text</code> | <code class="option">tcs</code> | <code class="option">wig</code> }
</code></em>
</code>
<code class="option">[ -t … ]</code>
</span></dt><dd><p>
<span class="quote">“<span class="quote">To-type</span>”</span>, the format of the output file. Multiple
mentions of [-t] are allowed, in which case
<span class="command"><strong>miraconvert</strong></span> will convert to multiple types.
</p></dd><dt><span class="term"><code class="option">-a</code></span></dt><dd><p>
Append. Results of conversion are appended to existing files instead of overwriting them.
</p></dd><dt><span class="term"><code class="option">-A</code></span></dt><dd><p>
Do not adjust sequence case.
</p><p>
When reading formats which define clipping points (like CAF,
MAF or EXP), and saving to formats which do not have clipping
information, miraconvert normally adjusts the case of read
sequences: lower case for clipped parts, upper case for
unclipped parts of reads. Use -A if you do not want this. See
also -C.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Applies only to files/formats which do not contain contigs.
</td></tr></table></div></dd><dt><span class="term"><code class="option">-b</code></span></dt><dd><p>
Blind data. Replace all bases in all reads / contigs with a 'c'.
</p></dd><dt><span class="term"><code class="option">-C</code></span></dt><dd><p>
Hard clip reads. When the input is a format which contains clipping
points in sequences and the requested output consists of sequences
of reads, only the unclipped parts of sequences will be saved as
results.
</p></dd><dt><span class="term"><code class="option">-d</code></span></dt><dd><p>
Delete gap only columns. When output is contigs: delete
columns that are entirely gaps (can occur after having deleted
reads during editing in gap4, consed or other). When output is
reads: delete gaps in reads.
</p></dd><dt><span class="term"><code class="option">-F</code></span></dt><dd><p>
Filter read groups to different files. Works only for input
files containing readgroups, i.e., CAF or MAF. 3 (or 4) files
are generated: one or two for paired, one for unpaired and one
for debris reads. Reads in paired file are interlaced by
default, use -F twice to create separate files.
</p></dd><dt><span class="term"><code class="option">-m</code></span></dt><dd><p>
Make contigs. Encase single reads as contig singlets into a CAF/MAF
file.
</p></dd><dt><span class="term"><code class="option">-n <em class="replaceable"><code>namefile</code></em></code></span></dt><dd><p>
Name select. Only contigs or reads are selected for output which
name appears in
<code class="filename">namefile</code>. <code class="filename">namefile</code> is a
simple text file having one name entry per line.
</p></dd><dt><span class="term"><code class="option">-i</code></span></dt><dd><p>
When -n is used, inverts the selection.
</p></dd><dt><span class="term"><code class="option">-o <em class="replaceable"><code>offset</code></em></code></span></dt><dd><p>
Offset of quality values in FASTQ files. Only valid if -f is FASTQ.
</p></dd><dt><span class="term"><code class="option">-P <em class="replaceable"><code>MIRA-PARAMETERSTRING</code></em></code></span></dt><dd><p>
Additional MIRA parameters. Allows to initialise the underlying MIRA
routines with specific parameters. A use case can be, e.g., to
recalculate a consensus of an assembly in a slightly different way
(see also [-r]) than the one which is stored in assembly
files. Example: to tell the consensus algorithm to use a minimum
number of reads per group for 454 reads, use: "454_SETTINGS -CO:mrpg=4".
</p><p>
Consult the MIRA reference manual for a full list of MIRA
parameters.
</p></dd><dt><span class="term"><code class="option">-q quality_value</code></span></dt><dd><p>
When loading read data from files where sequence and quality
are split in several files (e.g. FASTA with .fasta and
.fasta.qual files), do not stop if the quality values for a
read are missing but set them to be the quality_value given.
</p></dd><dt><span class="term"><code class="option">-R <em class="replaceable"><code>namestring</code></em></code></span></dt><dd><p>
Rename contigs/singlets/reads with given name string to which
a counter is added.
</p><p>
Known bug: will create duplicate names if input (CAF or
MAF) contains contigs/singlets as well as free reads, i.e.
reads not in contigs nor singlets.
</p></dd><dt><span class="term"><code class="option">-S <em class="replaceable"><code>namescheme</code></em></code></span></dt><dd><p>
Naming scheme for renaming reads, important for
paired-ends. Only 'solexa' is supported at the moment.
</p></dd><dt><span class="term"><code class="option">-Y <em class="replaceable"><code>integer</code></em></code></span></dt><dd><p>
Yield. Defines the maximum number of (clipped/padded) bases to
convert. When used on reads: output will contain first reads
of file where length of clipped bases totals at least -Y.
When used on contigs: output will contain first contigs of
file where length of padded contigs totals at least -Y.
</p></dd></dl></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_mutils_cp_options_contigs"></a>10.1.3.2. Options for input containing contig data</h4></div></div></div><p>
The following switches will work only if the input file contains
contigs (i.e., CAF or MAF with contig data). Though infrequent, note
that both CAF and MAF can contain single reads only.
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term"><code class="option">-M</code></span></dt><dd><p>
Do not extract contigs (or their consensus), but the reads
they are composed of.
</p></dd><dt><span class="term"><code class="option">-N <em class="replaceable"><code>namefile</code></em></code></span></dt><dd><p>
Name select, sorted. Only contigs/reads are selected for
output which name appears in
<code class="filename">namefile</code>. Regardless of the order of
contigs/reads in the input, the output is sorted according to
the appearance of names in
<code class="filename">namefile</code>. <code class="filename">namefile</code>
is a simple text file having one name entry per line.
</p><p>
Note that for this function to work, all contigs/reads are
loaded into memory which may be straining your RAM for larger
projects.
</p></dd><dt><span class="term">
<code class="option">-r
<em class="replaceable"><code>
{ <code class="option">c</code> | <code class="option">C</code> | <code class="option">q</code> | <code class="option">f</code> }
</code></em>
</code>
</span></dt><dd><p>
Recalculate consensus and / or consensus quality values and / or
SNP feature tags of an assembly. This feature is useful in case
third party programs create own consensus sequences without
handling different sequencing technologies (e.g. the combination
of <span class="command"><strong>gap4</strong></span> and <span class="command"><strong>caf2gap</strong></span>) or
when the CAF/MAF files do not contain consensus sequences at
all.
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term"><code class="option">c</code></span></dt><dd>
recalculate consensus & consensus qualities using IUPAC where necessary
</dd><dt><span class="term"><code class="option">C</code></span></dt><dd>
recalculate consensus & consensus qualities forcing ACGT calls and without IUPAC codes
</dd><dt><span class="term"><code class="option">q</code></span></dt><dd>
recalculate consensus quality values only
</dd><dt><span class="term"><code class="option">f</code></span></dt><dd>
recalculate SNP features
</dd></dl></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Only the last of cCq is relevant, 'f' works as a switch and can be
combined with the others (e.g. <span class="quote">“<span class="quote">-r Cf</span>”</span>).
</td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
If the CAF/MAF contains reads from multiple strains, recalculation
of consensus & consensus qualities is forced, you can just
influence whether IUPACs are used or not. This is due to the fact
that CAF/MAF do not provide facilities to store consensus
sequences from multiple strains.
</td></tr></table></div></dd><dt><span class="term"><code class="option">-s</code></span></dt><dd><p>
Split. Split output into single files, one file per
contig. Files are named according to name of contig.
</p></dd><dt><span class="term"><code class="option">-u</code></span></dt><dd><p>
fillUp strain genomes. In assemblies made of multiple strains,
holes in the consensus of a strain (bases 'N' or '@') can be
filled up with the consensus of the other strains. Takes effect
only when '-r' is active.
</p></dd><dt><span class="term"><code class="option">-Q <em class="replaceable"><code>quality_value</code></em></code></span></dt><dd><p>
Defines minimum quality a consensus base of a strain
must have, consensus bases below this will be set to 'N'.
Only used when -r is active.
</p></dd><dt><span class="term"><code class="option">-V <em class="replaceable"><code>coverage_value</code></em></code></span></dt><dd><p>
Defines minimum coverage a consensus base of a strain must
have, consensus bases below this coverage will be set to 'N'.
Only used when -r is active.
</p></dd><dt><span class="term"><code class="option">-v</code></span></dt><dd><p>
Print version number and exit.
</p></dd><dt><span class="term"><code class="option">-x <em class="replaceable"><code>length</code></em></code></span></dt><dd><p>
Minimum length a contig (in full assemblies) or read (in single
sequence files) must have. All contigs / reads with a
length less than this value are discarded. Default: 0 (=switched
off).
</p><p>
Note: this is of course not applied to reads in contigs! Contigs passing
the [-x] length criterion and stored as complete
assembly (CAF, MAF, ACE, etc.) still contain all their reads.
</p></dd><dt><span class="term"><code class="option">-X <em class="replaceable"><code>length</code></em></code></span></dt><dd><p>
Similar to [-x], but applies only to clipped reads
(input file format must have clipping points set to be
effective).
</p></dd><dt><span class="term"><code class="option">-y <em class="replaceable"><code>contig_coverage</code></em></code></span></dt><dd><p>
Minimum average contig coverage. Contigs with an average
coverage less than this value are discarded.
</p></dd><dt><span class="term"><code class="option">-z <em class="replaceable"><code>min_reads</code></em></code></span></dt><dd><p>
Minimum number of reads in contig. Contigs with less
reads than this value are discarded.
</p></dd><dt><span class="term"><code class="option">-l <em class="replaceable"><code>line_length</code></em></code></span></dt><dd><p>
On output of assemblies as text or HTML: number of bases shown in
one alignment line. Default: 60.
</p></dd><dt><span class="term"><code class="option">-c <em class="replaceable"><code>endgap_character</code></em></code></span></dt><dd><p>
On output of assemblies as text or HTML: character used to pad
endgaps. Default: ' ' (a blank)
</p></dd></dl></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_mutils_cp_examples"></a>10.1.4. Examples</h3></div></div></div><p>
In the following examples, the CAF and MAF files used are expected to
contain full assembly data like the files created by MIRA during an
assembly or by the gap2caf program. CAF and MAF could be used
interchangeably in these examples, depending on which format currently
is available. In general though, MAF is faster to process and smaller
on disk.
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
Simple conversion: a MIRA MAF file to a SAM file
</span></dt><dd><pre class="screen">
<strong class="userinput"><code>miraconvert source.maf destination.sam</code></strong></pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
Previous versions of miraconvert had a slightly different
syntax, which however is still supported:
</p><pre class="screen">
<strong class="userinput"><code>miraconvert source.maf destination.sam</code></strong></pre></td></tr></table></div></dd><dt><span class="term">
Simple conversion: the consensus of an assembly to FASTA, at the
same time coverage data for contigs to WIG and furthermore
translate the CAF to ACE:
</span></dt><dd><pre class="screen">
<strong class="userinput"><code>miraconvert source.caf destination.fasta wig ace</code></strong></pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
Previous versions of miraconvert had a slightly different
syntax, which however is still supported:
</p><pre class="screen">
<strong class="userinput"><code>miraconvert -f caf -t fasta -t wig -t ace source.caf destination</code></strong></pre></td></tr></table></div></dd><dt><span class="term">
Filtering an assembly for contigs of length ≥2000 and an
average coverage ≥ 10, while translating from MAF to CAF:
</span></dt><dd><pre class="screen">
<strong class="userinput"><code>miraconvert -x 2000 -y 10 source.caf destination.caf</code></strong></pre></dd><dt><span class="term">
Filtering a FASTQ file for reads ≥ 55 base pairs, rename the
selected reads with a string starting <span class="quote">“<span class="quote">newname</span>”</span> and
save them back to FASTQ. Note how [-t fastq] was left out
as the default behaviour of <span class="command"><strong>miraconvert</strong></span> is
to use the same "to" type as the input type ( [-f]).
</span></dt><dd><pre class="screen">
<strong class="userinput"><code>miraconvert -x 55 -R newname source.fastq destination.fastq</code></strong></pre></dd><dt><span class="term">
Filtering and reordering contigs of an assembly according to external contig name list.
</span></dt><dd><p>
This example will fetch the contigs named bchoc_c14, ...3, ...5
and ...13 and save the result in exactly that order to a new
file:
</p><pre class="screen"><code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>ls -l</code></strong>
-rw-r--r-- 1 bach users 231698898 2007-10-21 15:16 bchoc_out.caf
-rw-r--r-- 1 bach users 38 2007-10-21 15:16 contigs.lst
<code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>cat contigs.lst</code></strong>
bchoc_c14
bchoc_c3
bchoc_c5
bchoc_c13
<code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>miraconvert -N contigs.lst bchoc_out.caf myfilteredresult.caf</code></strong>
[...]
<code class="prompt">arcadia:/path/to/myProject$</code> <strong class="userinput"><code>ls -l</code></strong>
-rw-r--r-- 1 bach users 231698898 2007-10-21 15:16 bchoc_out.caf
-rw-r--r-- 1 bach users 38 2007-10-21 15:16 contigs.lst
-rw-r--r-- 1 bach users 828726 2007-10-21 15:24 myfilteredresult.caf</pre></dd></dl></div></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_mutils_bait"></a>10.2. mirabait - a "grep" like tool to select reads with kmers up to 256 bases</h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_mutils_bait_synopsis"></a>10.2.1.
Synopsis
</h3></div></div></div><div class="cmdsynopsis"><p><code class="command">mirabait</code> [options] {-b <em class="replaceable"><code>baitfile</code></em> [-b ...] | -B <em class="replaceable"><code>file</code></em>} [-p <em class="replaceable"><code>file1 file2</code></em> | -P <em class="replaceable"><code>file3</code></em>]*
[<em class="replaceable"><code>file4 ...</code></em>]</p></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
The above command line, especially with mandatory [-b] format
appeared only in MIRA 4.9.0 and represents a major change to 4.0.x!
</td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_mutils_bait_description"></a>10.2.2. Description</h3></div></div></div><p>
<span class="command"><strong>mirabait</strong></span> selects reads from a read collection which
are partly similar or equal to sequences defined as target
baits. Similarity is defined by finding a user-adjustable number of
common k-mers (sequences of k consecutive bases) which are the same in
the bait sequences and the screened sequences to be selected, either in forward
or reverse complement direction.
</p><p>
When used on paired files (-p or -P), selects read pairs where at least
one read matches.
</p><p>
One can use <span class="command"><strong>mirabait</strong></span> to do targeted assembly by
fishing out reads belonging to a gene and just assemble these; or to
clean out rRNA sequences from data sets; or to fish out and
iteratively reconstruct mitochondria from metagenomic data; or, or, or
... whenever one has to take in or take out subsets of reads based on
kmer equality, this tool should come in quite handy.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
The search performed is exact, that is, sequences selected are
guaranteed to have the required number of matching k-mers to the bait
sequences while sequences not selected are guaranteed not have these.
</td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_mutils_bait_options"></a>10.2.3. Options</h3></div></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_mutils_bait_mainoptions"></a>10.2.3.1. Main options</h4></div></div></div><div class="variablelist"><dl class="variablelist"><dt><span class="term"><code class="option">-b <em class="replaceable"><code>file</code></em></code></span></dt><dd><p>
A file containing sequences to be used as bait. The file can
be in any of the following types: FASTQ, FASTA, GenBank (.gbf,
.gbk, .gbff), CAF, MAF or Staden EXP.
</p><p>
If the the file extension is non-standard
(e.g. <code class="filename">file.dat</code>, you can force a file type
by using the double colon type specification like in
EMBOSS. E.g.: <code class="filename">fastq::file.dat</code>
</p><p>
Using multiple -b for loading bait sequences from multiple
files is allowed.
</p></dd><dt><span class="term"><code class="option">-B <em class="replaceable"><code>file</code></em></code></span></dt><dd><p>
Load bait from an existing kmer statistics file, not from
sequence files. Only one -B allowed, cannot be combined with
-b. See -K on how to create such a file.
</p><p>
This option comes in handy when you always want to bait
against a given set of sequences, e.g., rRNA sequences or the
human genome, and where the statistics computation itself may
be quite time and resource intensive. Once computed and saved
via [-K], a baiting process loading the statistics
via [-B] can start much faster.
</p></dd><dt><span class="term"><code class="option">-j <em class="replaceable"><code>job</code></em></code></span></dt><dd><p>
Set option for predefined job from supplied MIRA library. Currently available jobs:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
'rrna': Bait rRNA/rDNA sequences. Locked options: [-b,
-B, -k, -K, -n]. In the current version mirabait will
use a hash statistics file with 21mers derived from a
subset of the RFAM 12 rRNA database to bait rRNA/rDNA
reads. The supplied subset should catch all but the most
uncommon rRNA data, if needed one could albeit increase
the sensitivity by decreasing [-n].
</p></li></ul></div><p>
Note that [-j] will hardwire a number of options to
be optimal for the chosen job. Note that it is not advisable
to change the 'locked' options as this either breaks the
functionality or worse, it could lead to undefined results.
</p></dd><dt><span class="term"><code class="option">-p <em class="replaceable"><code>file_1 file_2</code></em></code></span></dt><dd><p>
Instructs to load sequences to be baited from files
<code class="filename">file_1</code> and
<code class="filename">file_2</code>. The sequences are treated as
pairs, where a read in one file is paired with a read in the
second file. The files can be in any of the following types:
FASTQ, FASTA, GenBank (.gbf, .gbk, .gbff), CAF, MAF or Staden
EXP.
</p><p>
If the the file extension is non-standard
(e.g. <code class="filename">file.dat</code>, you can force a file type
by using the double colon type specification like in
EMBOSS. E.g.: <code class="filename">fastq::file.dat</code>
</p><p>
Using multiple -p for baiting sequences from multiple file
pairs is allowed.
</p></dd><dt><span class="term"><code class="option">-P <em class="replaceable"><code>file</code></em></code></span></dt><dd><p>
Instructs to load sequences to be baited from file
<code class="filename">file</code>. The sequences are treated as pairs,
where a read in the file is immediately followed by its paired
read. The file can be in any of the following types: FASTQ,
FASTA, GenBank (.gbf, .gbk, .gbff), CAF, MAF or Staden
EXP.
</p><p>
If the the file extension is non-standard
(e.g. <code class="filename">file.dat</code>, you can force a file type
by using the double colon type specification like in
EMBOSS. E.g.: <code class="filename">fastq::file.dat</code>
</p><p>
Using multiple -P for baiting sequences from multiple files is
allowed.
</p></dd><dt><span class="term"><code class="option">-k <em class="replaceable"><code>kmer-length</code></em></code></span></dt><dd><p>
k-mer, length of bait in bases (≤256, default=31)
</p></dd><dt><span class="term"><code class="option">-n <em class="replaceable"><code>integer</code></em></code></span></dt><dd><p>
Default value: 1.
</p><p>
If the integer given is > 0: minimum number of kmers needed
for a sequence to be selected.
</p><p>
If the integer given is ≤ 0: maximum number of missed kmers
allowed over sequence length for a sequence to be selected.
</p></dd><dt><span class="term"><code class="option">-d</code></span></dt><dd><p>
Do not use kmers with microrepeats (DUST-like). Standard
length for microrepeats is 67% of kmer length, see
[-D] to change this.
</p><p>
Microrepeats are defined as repeats of a 1, 2, 3 or 4 base
motif at either end (not in the middle) of a kmer. E.g.: a
kmer of 17 will have a microrepeat length of 12 bases, so
that, all kmers having 12 A, C, G, T at either end will be
filtered away. E.g.: AAAAAAAAAAAAnnnnn as well as
nnnnnAAAAAAAAAAAA will be filtered.
</p><p>
E.g. for repeats of 2 bases: AGAGAGAGAGAGnnnnn or CACACACACACAnnnnn.
</p><p>
E.g. for repeats of 3 bases: ACGACGACGACGnnnnn.
</p><p>
E.g. for repeats of 4 bases: ACGTACGTACGTnnnnn.
</p><p>
Microrepeat motifs will truncate at the end of allocated
microrepeat length. E.g. kmer length 20 with microrepeat
length of 13 and 4 base repeat: ACGTACGTACGTAnnnnnnn.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
When saving the kmer statistics via [-K], having
[-d] will save kmer statistics where kmers with
microrepeats have already been removed. Use this when you
always want to have microrepeats removed from a given bait
data as [-d] will not be needed when using via
[-B] that set in later loads (which saves time).
</p><p>
If you want to be able to bait from precomputed kmer
statistics both with and without microrepeats, use
[-d] only when loading the statistics file with
[-B], not when creating it with [-K].
</p></td></tr></table></div></dd><dt><span class="term"><code class="option">-D <em class="replaceable"><code>integer</code></em></code></span></dt><dd><p>
Set length of microrepeats in kmers to discard from bait.
</p><p>
int > 0: microrepeat length in percentage of kmer length.
E.g.: -k 17 -D 67 --> 67% of 17 bases = 11.39 bases --> 12 bases.
</p><p>
int <: 0 microrepeat length in bases.
</p><p>
int != 0 implies -d, int=0 turns DUST filter off
</p></dd><dt><span class="term"><code class="option">-i</code></span></dt><dd><p>
Inverse selection: selects only sequence that do not meet the
-k and -n criteria.
</p></dd><dt><span class="term"><code class="option">-I</code></span></dt><dd><p>
Filters and writes sequences which hit to one file and
sequences which do not hit to a second file.
</p></dd><dt><span class="term"><code class="option">-r</code></span></dt><dd><p>
Does not check for hits in reverse complement direction.
</p></dd><dt><span class="term"><code class="option">-t <em class="replaceable"><code>integer</code></em></code></span></dt><dd><p>
Number of threads to use. The default value of 0 is configured
to automatically use up to 4 CPU cores (if present). Numbers
higher than 4 (or maybe 8) will probably not make much sense
because of diminishing returns.
</p></dd></dl></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_mutils_bait_outputdef"></a>10.2.3.2. File type options</h4></div></div></div><p>
Normally, mirabait writes separate result files (named
<code class="filename">bait_match_*</code> and
<code class="filename">bait_miss_*</code>) for each input to the current
directory. For changing this behaviour, and others relating to
output, use these options:
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term"><code class="option">-c</code></span></dt><dd><p>
Normally, mirabait will change the case of the sequences it
loads to denote kmers which hit a bait in upper case and kmers
which did not hit a bait in lower case. Using -c switches off
this behaviour.
</p></dd><dt><span class="term"><code class="option">-l <em class="replaceable"><code>integer</code></em></code></span></dt><dd><p>
Set length of sequence line in FASTA output.
</p></dd><dt><span class="term"><code class="option">-K <em class="replaceable"><code>filename</code></em></code></span></dt><dd><p>
Save kmer statistics (for baits loaded via [-b]) to
<code class="filename">filename</code>.
</p><p>
As the calculation of kmers can take quite some time for
larger sequences (e.g., human genome), this option is
extremely useful if you want to perform the same baiting
operation more than once. Once calculated, the kmer statistics
is saved and can be reloaded for a later baiting operation via
[-B].
</p></dd><dt><span class="term"><code class="option">-N <em class="replaceable"><code>name</code></em></code></span></dt><dd><p>
Change the file prefix <code class="filename">bait</code> to
<code class="filename">name</code>. Has no effect if -o/-O is used and
targets are not directories.
</p></dd><dt><span class="term"><code class="option">-o <em class="replaceable"><code>path</code></em></code></span></dt><dd><p>
Save sequences matching a bait to
<code class="filename">path</code>. If <code class="filename">path</code> is a
directory, write separate files into this directory. If not,
combine all matching sequences from the input file(s) into a
single file specified by the path.
</p></dd><dt><span class="term"><code class="option">-O <em class="replaceable"><code>path</code></em></code></span></dt><dd><p>
Like -o, but for sequences not matching.
</p></dd></dl></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_mutils_bait_other"></a>10.2.3.3. Other options</h4></div></div></div><div class="variablelist"><dl class="variablelist"><dt><span class="term"><code class="option">-T <em class="replaceable"><code>dir</code></em></code></span></dt><dd><p>
Use <code class="filename">dir</code> as directory for temporary files
instead of the current working directory.
</p></dd><dt><span class="term"><code class="option">-m <em class="replaceable"><code>integer</code></em></code></span></dt><dd><p>
Default is <span class="underline">75</span>. Defines
the memory MIRA can use to compute kmer statistics. Therefore
does not apply when using [-B].
</p><p>
A value of <span class="underline">>100</span> is
interpreted as absolute value in megabyte. E.g., 16384 = 16384
megabyte = 16 gigabyte.
</p><p>
A value of <span class="underline">0 ≤ x ≤100</span> is
interpreted as relative value of free memory at the time of
computation. E.g.: for a value of 75% and 10 gigabyte of free
memory, it will use 7.5 gigabyte.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
The minimum amount of memory this algorithm will use is 512 MiB
on 32 bit systems and 2 GiB on 64 bit systems.
</td></tr></table></div></dd></dl></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_mutils_bait_examples"></a>10.2.4. Usage examples</h3></div></div></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
The examples below, together with the manual above, should be enough to get
you going. If there's a typical use case you are missing, feel free to
suggest it on the MIRA talk mailing list.
</td></tr></table></div><p>Baiting unpaired sequences, bait sequences in FASTA, sequences in FASTQ:</p><pre class="screen"><strong class="userinput"><code>mirabait -b b.fasta file.fastq</code></strong></pre><p>Same as above, but baits in two files (FASTA and GenBank):</p><pre class="screen"><strong class="userinput"><code>mirabait -b b1.fasta -b b2.gbk file.fastq</code></strong></pre><p>Baiting paired sequences, read pairs are in two files:</p><pre class="screen"><strong class="userinput"><code>mirabait -b b.fasta -p file_1.fastq file_2.fastq</code></strong></pre><p>Baiting paired sequences, pairs are interleaved in one file:</p><pre class="screen"><strong class="userinput"><code>mirabait -b b.fasta -P file.fastq</code></strong></pre><p>Like above, but selecting sequences which do not match the baits:</p><pre class="screen"><strong class="userinput"><code>mirabait -i -b b.fasta -P file.fastq</code></strong></pre><p>
Baiting paired sequences (<code class="filename">file_1.fastq</code>,
<code class="filename">file_2.fastq</code> and
<code class="filename">file3.fastq</code>) and unpaired sequences
(<code class="filename">file4.fastq</code>), all at once and different file
types:
</p><pre class="screen"><strong class="userinput"><code>mirabait -b b.fasta -p file_1.fastq file_2.fastq -P file3.fasta file4.caf</code></strong></pre><p>
Like above, but writing sequences matching baits and sequences not
matching baits to different files:
</p><pre class="screen"><strong class="userinput"><code>mirabait -I -b b.fasta -p file_1.fastq file_2.fastq -P file3.fasta file4.caf</code></strong></pre><p>Change bait criterion to need 10 kmers of size 27:</p><pre class="screen"><strong class="userinput"><code>mirabait -k 27 -n 10 -b b.fasta file.fastq</code></strong></pre><p>
Change bait criterion to baiting only reads which have all kmers
present in the bait:
</p><pre class="screen"><strong class="userinput"><code>mirabait -n 0 -b b.fasta file.fastq</code></strong></pre><p>
Change bait criterion to baiting all reads having almost all kmers
present in the bait, but allowing for up to 40 kmers not in the bait:
</p><pre class="screen"><strong class="userinput"><code>mirabait -n -40 -b b.fasta file.fastq</code></strong></pre><p>
Force bait sequences to load as FASTA, force sequences to be baited to
be loaded as FASTQ:
</p><pre class="screen"><strong class="userinput"><code>mirabait -b fasta::b.dat fastq::file.dat</code></strong></pre><p>
Write result files to directory <code class="filename">/dev/shm/</code>:
</p><pre class="screen"><strong class="userinput"><code>mirabait -o /dev/shm/ -b b.fasta -p file_1.fastq file_2.fastq</code></strong></pre><p>
Merge all result files containing sequences hitting baits to file
<code class="filename">/dev/shm/match</code>:
</p><pre class="screen"><strong class="userinput"><code>mirabait -o /dev/shm/match -b b.fasta -p file_1.fastq file_2.fastq</code></strong></pre><p>
Like above, but also merge all result files containing sequences not
hitting baits to file <code class="filename">/dev/shm/nomatch</code>:
</p><pre class="screen"><strong class="userinput"><code>mirabait -o /dev/shm/match -O /dev/shm/nomatch -b b.fasta -p file_1.fastq file_2.fastq</code></strong></pre><p>
Fetch all reads having rRNA motifs in a paired FASTQ files:
</p><pre class="screen"><strong class="userinput"><code>mirabait -j rrna -p file1.fastq file2.fastq</code></strong></pre><p>
Fetch all reads not having rRNA motifs in a paired FASTQ files:
</p><pre class="screen"><strong class="userinput"><code>mirabait -j rrna -i -p file1.fastq file2.fastq</code></strong></pre><p>
Split a paired FASTQ file into two sets of files (4 files total), one
containing rRNA reads and one containing non-rRNA reads:
</p><pre class="screen"><strong class="userinput"><code>mirabait -j rrna -I -p file1.fastq file2.fastq</code></strong></pre><p>
Assuming the file <code class="filename">human_genome.fasta</code> contains the
human genome: bait all read pairs matching the human genome. Also,
save the compute kmer statistics for later re-use to file
<code class="filename">HG_kmerstats.mhs.gz</code>:
</p><pre class="screen"><strong class="userinput"><code>mirabait -b human_genome.fasta -K HG_kmerstats.mhs.gz -p file1.fastq file2.fastq</code></strong></pre><p>
The same as above, but just precompute and save the kmer statistics, no actual baiting done.
</p><pre class="screen"><strong class="userinput"><code>mirabait -b human_genome.fasta -K HG_kmerstats.mhs.gz</code></strong></pre><p>
Using the precomputed kmer statistics from the command above: bait
files with read pairs for human reads:
</p><pre class="screen"><strong class="userinput"><code>mirabait -B HG_kmerstats.mhs.gz -p file_1.fastq file_2.fastq</code></strong></pre></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_mutils_bait_installrrnadb"></a>10.2.5. Installing different rRNA databases</h3></div></div></div><p>
The standard database for rRNA baiting supplied with the MIRA source
code and binary packages is called
<code class="filename">rfam_rrna-21-12.sls.gz</code> which will get installed
as <span class="emphasis"><em>MHS</em></span> (MiraHashStatistics) file into
<code class="filename">$BINDIR/../share/mira/mhs/rfam_rrna-21-12.mhs.gz</code>
(where $BINDIR is the directory where the mira/mirabait binary
resides) and a soft link pointing from
<code class="filename">filter_default_rrna.mhs.gz</code> to
<code class="filename">rfam_rrna-21-12.mhs.gz</code> like so:
</p><pre class="screen"><code class="prompt">arcadia:~$</code> <strong class="userinput"><code>which mira</code></strong>
/usr/local/bin/mira
<code class="prompt">arcadia:~$</code> <strong class="userinput"><code>ls -l /usr/local/share/mira/mhs</code></strong>
lrwxrwxrwx 1 bach bach 22 Mar 24 23:58 filter_default_rrna.mhs.gz -> rfam_rrna-21-12.mhs.gz
-rw-rw-r-- 1 bach bach 148985059 Mar 24 23:58 rfam_rrna-21-12.mhs.gz</pre><p>
The file naming scheme for the database is as following:
dbidentifier-kmerlength-kmerfreqcutoff. The standard database is therefore:
<code class="filename">rfam_rrna</code> as identifier for the RFAM rRNA
sequences (currently RFAM 12), then 21 defining a kmer length of 21
and finally a kmer cutoff frequency of 12, meaning that kmers must
have been seen at least 12 times in the RFAM database to be stored in
the subset.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
The value of 12 as frequency cutoff for the standard mirabait rRNA
database was chosen as a compromise between sensitivity and database
size.
</td></tr></table></div><p>
Although rRNA are pretty well conserved overall, the cutoff frequency
also implies that kmers from rare rRNA variants will not be present in
the database, eventually losing some sensitivity for rRNA from rarely
sequenced organisms. It follows that more sensitive versions of the
rRNA database can be installed by downloading a file from the MIRA
repository at SourceForge and calling a script provided by MIRA. To
install a version with a kmer size of 21 and a cutoff frequency of,
e.g., 3, download <code class="filename">rfam_rrna-21-3.sls.gz</code> and
install it like this:
</p><pre class="screen"><code class="prompt">arcadia:~/tmp$</code> <strong class="userinput"><code>ls</code></strong>
<code class="prompt">arcadia:~/tmp$</code> <strong class="userinput"><code>wget https://sourceforge.net/projects/mira-assembler/files/MIRA/slsfiles/rfam_rrna-21-3.sls.gz</code></strong>
...
</pre><p>
</p><p>
TODO: continue docs here.
</p></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_hard"></a>Chapter 11. Assembly of <span class="emphasis"><em>hard</em></span> genome or EST / RNASeq projects</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email"><<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>></code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect_hard_getting_mean_data_assembled">11.1.
Getting 'mean' genomes or EST / RNASeq data sets assembled
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_hard_for_the_impatient">11.1.1.
For the impatient
</a></span></dt><dt><span class="sect2"><a href="#sect_hard_introduction_to_masking">11.1.2.
Introduction to 'masking'
</a></span></dt><dt><span class="sect2"><a href="#sect_hard_how_does_nasty_repeat_masking_work">11.1.3.
How does 'nasty repeat' masking work?
</a></span></dt><dt><span class="sect2"><a href="#sect_hard_selecting_a_nasty_repeat_ratio">11.1.4.
Selecting a "nasty repeat ratio"
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_hard_how_MIRA_tags_different_repeat_levels">11.2.
How MIRA tags different repeat levels
</a></span></dt><dt><span class="sect1"><a href="#sect_hard_the_readrepeats_info_file">11.3.
The readrepeats info file
</a></span></dt><dt><span class="sect1"><a href="#sect_hard_pipeline_to_find_worst_contaminants_or_repeats_in_sequencing_data">11.4.
Pipeline to find worst contaminants or repeats in sequencing data
</a></span></dt><dt><span class="sect1"><a href="#sect_hard_examples_for_kmer_statistics">11.5.
Examples for kmer statistics
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_hard_caveat:_sk:kms">11.5.1.
Caveat: -SK:kmer_size
</a></span></dt><dt><span class="sect2"><a href="#sect_hard_sanger_sequencing_a_simple_bacterium">11.5.2.
Sanger sequencing, a simple bacterium
</a></span></dt><dt><span class="sect2"><a href="#sect_hard_454_sequencing_a_somewhat_more_complex_bacterium">11.5.3.
454 Sequencing, a somewhat more complex bacterium
</a></span></dt><dt><span class="sect2"><a href="#sect_hard_solexa_sequencing_ecoli_mg1655">11.5.4.
Solexa sequencing, E.coli MG1655
</a></span></dt><dt><span class="sect2"><a href="#sect_hard_need_examples_for_eukaryotes">11.5.5.
(NEED EXAMPLES FOR EUKARYOTES)
</a></span></dt><dt><span class="sect2"><a href="#sect_hard_need_examples_for_pathological_cases">11.5.6.
(NEED EXAMPLES FOR PATHOLOGICAL CASES)
</a></span></dt></dl></dd></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
<span class="emphasis"><em><span class="quote">“<span class="quote">If it were easy, it would have been done already.
</span>”</span></em></span>
</p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_hard_getting_mean_data_assembled"></a>11.1.
Getting 'mean' genomes or EST / RNASeq data sets assembled
</h2></div></div></div><p>
</p><p>
For some EST data sets you might want to assemble, MIRA will take too
long or the available memory will not be sufficient. For genomes this
can be the case for eukaryotes, plants, but also for some bacteria which
contain high number of (pro-)phages, plasmids or engineered operons. For
EST data sets, this concerns all projects with non-normalised libraries.
</p><p>
This guide is intended to get you through these problematic genomes. It
is (cannot be) exhaustive, but it should get you going.
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_hard_for_the_impatient"></a>11.1.1.
For the impatient
</h3></div></div></div><p>
For bacteria with nasty repeats, try first
[--hirep_something]. This will increase runtime and memory
requirements, but helps to get this sorted out. If the data for lower
eukaryotes leads to runtime and memory explosion, try either
[--hirep_good] or, for desperate cases,
[--hirep_something].
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_hard_introduction_to_masking"></a>11.1.2.
Introduction to 'masking'
</h3></div></div></div><p>
The SKIM phase (all-against-all comparison) will report almost every potential
hit to be checked with Smith-Waterman further downstream in the MIRA assembly
process. While this is absolutely no problem for most bacteria, some genomes
(eukaryotes, plants, some bacteria) have so many closely related sequences
(repeats) that the data structures needed to take up all information might get
much larger than your available memory. In those cases, your only chance to
still get an assembly is to tell the assembler it should disregard extremely
repetitive features of your genome.
</p><p>
There is, in most cases, one problem: one doesn't know beforehand which parts
of the genome are extremely repetitive. But MIRA can help you here as it
produces most of the needed information during assembly and you just need to
choose a threshold from where on MIRA won't care about repetitive matches.
</p><p>
The key to this are the three fail-safe command line parameters which will mask
"nasty" repeats from the quick overlap finder (SKIM): [-KS:mnr] and
[-KS:nrr] respectively [-KS:nrc]. I'll come back
to [-SK:kms] later as it also plays a role in this.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_hard_how_does_nasty_repeat_masking_work"></a>11.1.3.
How does 'nasty repeat' masking work?
</h3></div></div></div><p>
</p><p>
If switched on [-KS:mnr=yes], MIRA will use k-mer statistics to
find repetitive stretches. K-mers are nucleotide stretches of length k. In a
perfectly sequenced genome without any sequencing error and without sequencing
bias, the k-mer frequency can be used to assess how many times a given
nucleotide stretch is present in the genome: if a specific k-mer is present as
many times as the average frequency of all k-mers, it is a reasonable
assumption to estimate that the specific k-mer is not part of a repeat (at
least not in this genome).
</p><p>
Following the same path of thinking, if a specific k-mer frequency is now two
times higher than the average of all k-mers, one would assume that this
specific k-mer is part of a repeat which occurs exactly two times in the
genome. For 3x k-mer frequency, a repeat is present three times. Etc.pp. MIRA
will merge information on single k-mers frequency into larger 'repeat'
stretches and tag these stretches accordingly.
</p><p>
Of course, low-complexity nucleotide stretches (like poly-A in eukaryotes),
sequencing errors in reads and non-uniform distribution of reads in a
sequencing project will weaken the initial assumption that a k-mer frequency
is representative for repeat status. But even then the k-mer frequency model
works quite well and will give a pretty good overall picture: most repeats
will be tagged as such.
</p><p>
Note that the parts of reads tagged as "nasty repeat" will not get masked per
se, the sequence will still be present. The stretches dubbed repetitive will
get the "MNRr" tag. They will still be used in Smith-Waterman overlaps and
will generate a correct consensus if included in an alignment, but they will
not be used as seed.
</p><p>
Some reads will invariably end up being completely repetitive. These
will not be assembled into contigs as MIRA will not see overlaps as
they'll be completely masked away. These reads will end up as
debris. However, note that MIRA is pretty good at discerning 100%
matching repeats from repeats which are not 100% matching: if there's
a single base with which repeats can be discerned from each other,
MIRA will find this base and use the k-mers covering that base to find
overlaps.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_hard_selecting_a_nasty_repeat_ratio"></a>11.1.4.
Selecting a "nasty repeat ratio"
</h3></div></div></div><p>
</p><p>
The ratio from which on the MIRA kmer statistics algorithm won't
report matches is set via [-KS:nrr]. E.g.,
using [-KS:nrr=10] will hide all k-mers which occur at a
frequency 10 times (or more) higher than the median of all k-mers.
</p><p>
The nastiness of a repeat is difficult to judge, but starting with 10 copies
in a genome, things can get complicated. At 20 copies, you'll have some
troubles for sure.
</p><p>
The standard values of <span class="emphasis"><em>10</em></span> for
the [-KS:nrr] parameter is a pretty good 'standard' value
which can be tried for an assembly before trying to optimise it via
studying the kmer statistics calculated by MIRA. For the later, please
read the section 'Examples for kmer statistics' further down in this
guide.
</p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_hard_how_MIRA_tags_different_repeat_levels"></a>11.2.
How MIRA tags different repeat levels
</h2></div></div></div><p>
During SKIM phase, MIRA will assign frequency information to each and every
k-mer in all reads of a sequencing project, giving them different
status. Additionally, tags are set in the reads so that one can
assess reads in assembly editors that understand tags (like gap4,
gap5, consed etc.). The following tags are used:
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
HAF2
</span></dt><dd><p> coverage below average ( default: < 0.5 times average)
</p></dd><dt><span class="term">
HAF3
</span></dt><dd><p> coverage is at average ( default: ≥ 0.5 times average and ≤ 1.5 times average)
</p></dd><dt><span class="term">
HAF4
</span></dt><dd><p> coverage above average ( default: > 1.5 times average and < 2 times average)
</p></dd><dt><span class="term">
HAF5
</span></dt><dd><p> probably repeat ( default: ≥ 2 times average and < 5 times average)
</p></dd><dt><span class="term">
HAF6
</span></dt><dd><p> 'crazy' repeat ( default: > 5 times average)
</p></dd><dt><span class="term">
MNRr
</span></dt><dd><p> stretches which were masked away by [-KS:<em class="replaceable"><code>mnr=yes</code></em>]
being more repetitive than deduced
by [-KS:<em class="replaceable"><code>nrr=...</code></em>] or given via [-KS:<em class="replaceable"><code>nrc=...</code></em>].
</p></dd></dl></div><p>
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_hard_the_readrepeats_info_file"></a>11.3.
The readrepeats info file
</h2></div></div></div><p>
If [-KS:mnr=yes] is used, MIRA will write an additional file into the
info directory:
<code class="filename"><projectname>_info_readrepeats.lst</code>
</p><p>
The "readrepeats" file makes it possible to try and find out what makes
sequencing data nasty. It's a key-value-value file with the name of the
sequence as "key" and then the type of repeat (HAF2 - HAF7 and MNRr) and
the repeat sequence as "values". "Nasty" in this case means
<span class="emphasis"><em>everything which was masked via
[-KS:mnr=yes]</em></span>.
</p><p>
The file looks like this:
</p><pre class="screen">
read1 HAF5 GCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCT ...
read2 HAF7 CCGAAGCCGAAGCCGAAGCCGAAGCCGAAGCCGAAGCCGAAGCCGAAGC ...
read2 MNRr AAAAAAAAAAAAAAAAAAAAAAAAAAAA ...
read3 HAF6 GCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCT ...
...
etc.
</pre><p>
That is, each line consists of the read name where a stretch of
repetitive sequences was found, then the MIRA repeat categorisation
level (HAF2 to HAF7 and MNRr) and then the stretch of bases which is
seen to be repetitive.
</p><p>
Note that reads can have several disjunct repeat stretches in a single
read, hence they can occur more than one time in the file as shown with
<span class="emphasis"><em>read2</em></span> in the example above.
</p><p>
One will need to search some databases with the "nasty" sequences and find
vector sequences, adaptor sequences or even human sequences in bacterial or
plant genomes ... or vice versa as this type of contamination happens quite
easily with data from new sequencing technologies. After a while one gets a
feeling what constitutes the largest part of the problem and one can start to
think of taking countermeasures like filtering, clipping, masking etc.
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_hard_pipeline_to_find_worst_contaminants_or_repeats_in_sequencing_data"></a>11.4.
Pipeline to find worst contaminants or repeats in sequencing data
</h2></div></div></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
In case you are not familiar with UNIX pipes, now would be a good time
to read an introductory text on how this wonderful system works. You
might want to start with a short introductory article at Wikipedia:
<a class="ulink" href="http://en.wikipedia.org/wiki/Pipeline_%28Unix%29" target="_top">http://en.wikipedia.org/wiki/Pipeline_%28Unix%29</a>
</p><p>
In a nutshell: instead of output to files, a pipe directs the output
of one program as input to another program.
</p></td></tr></table></div><p>
There's one very simple trick to find out whether your data contains
some kind of sequencing vector or adaptor contamination which I use. it
makes use of the read repeat file discussed above.
</p><p>
The following example shows this exemplarily on a 454 data where the
sequencing provider used some special adaptor in the wet lab but somehow
forgot to tell the Roche pre-processing software about it, so that a
very large fraction of reads in the SFF file had unclipped adaptor
sequence in it (which of course wreaks havoc with assembly programs):
</p><pre class="screen"><code class="prompt">arcadia:$</code> <strong class="userinput"><code>grep MNRr <em class="replaceable"><code>badproject</code></em>_info_readrepeats.lst | cut -f 3| sort | uniq -c |sort -g -r | head -15</code></strong>
504 ACCACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
501 CAACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
489 GGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
483 GCCACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
475 AATACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
442 GATACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
429 CGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
424 TTGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
393 ACTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
379 CAGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
363 ATTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
343 CATACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
334 GTTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
328 AACACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
324 GGTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC</pre><p>
You probably see a sequence pattern
CTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC in the above screens hot. Before
going into details of what you are actually seeing, here's the
explanation how this pipeline works:
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
grep MNRr <em class="replaceable"><code>badproject</code></em>_info_readrepeats.lst
</span></dt><dd><p>
From the file with the information on repeats, grab all the lines
containing repetitive sequence which MIRA categorised as 'nasty'
via the 'MNRr' tag. The result looks a bit like this (first 15
lines shown):</p><pre class="screen">C6E3C7T12GKN35 MNRr GCGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12JLIBM MNRr TTCACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12HQOM1 MNRr CAGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12G52II MNRr CAGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12JRMPO MNRr TCTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12H1A8V MNRr GCGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12H34Z7 MNRr AAACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12H4HGC MNRr GGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12FNA1N MNRr AATACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12F074V MNRr CTTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12I1GYO MNRr CAACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12I53C8 MNRr CACACTCGTATAGTGACACGCAACAGGGG
C6E3C7T12I4V6V MNRr ATCACTCGTATAGTGACACGCAACAGGGG
C6E3C7T12H5R00 MNRr TCTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12IBA5E MNRr AATACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
...</pre><p>
</p></dd><dt><span class="term">
cut -f 3
</span></dt><dd><p>
We're just interested in the sequence now, which is in the third
column. The above 'cut' command takes care of this. The resulting
output may look like this (only first 15 lines shown):
</p><pre class="screen">GCGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
TTCACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
CAGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
CAGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
TCTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
GCGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
AAACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
GGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
AATACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
CTTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
CAACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
CACACTCGTATAGTGACACGCAACAGGGG
ATCACTCGTATAGTGACACGCAACAGGGG
TCTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
AATACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
...</pre></dd><dt><span class="term">
sort
</span></dt><dd><p>
Simply sort all sequences. The output may look like this now (only first 15 line shown):</p><pre class="screen">
AAACTCGTATAGTGACACGCA
AAACTCGTATAGTGACACGCAACAGG
AAACTCGTATAGTGACACGCAACAGGG
AAACTCGTATAGTGACACGCAACAGGGG
AAACTCGTATAGTGACACGCAACAGGGG
AAACTCGTATAGTGACACGCAACAGGGG
AAACTCGTATAGTGACACGCAACAGGGG
AAACTCGTATAGTGACACGCAACAGGGG
AAACTCGTATAGTGACACGCAACAGGGGAT
AAACTCGTATAGTGACACGCAACAGGGGATA
AAACTCGTATAGTGACACGCAACAGGGGATA
AAACTCGTATAGTGACACGCAACAGGGGATA
AAACTCGTATAGTGACACGCAACAGGGGATA
AAACTCGTATAGTGACACGCAACAGGGGATA
AAACTCGTATAGTGACACGCAACAGGGGATA
...</pre><p>
</p></dd><dt><span class="term">
uniq -c
</span></dt><dd><p>
This command counts how often a line repeats itself in a file. As
we previously sorted the whole file by sequence, it effectively
counts how often a certain sequence has been tagged as MNRr. The
output consists of a tab delimited format in two columns: the
first column contains the number of times a given line (sequence
in our case) was seen, the second column contains the line
(sequence) itself. An exemplarily output looks like this (only first 15 lines shown):
</p><pre class="screen"> 1 AAACTCGTATAGTGACACGCA
1 AAACTCGTATAGTGACACGCAACAGG
1 AAACTCGTATAGTGACACGCAACAGGG
5 AAACTCGTATAGTGACACGCAACAGGGG
1 AAACTCGTATAGTGACACGCAACAGGGGAT
13 AAACTCGTATAGTGACACGCAACAGGGGATA
6 AAACTCGTATAGTGACACGCAACAGGGGATAGAC
4 AAACTCGTATAGTGACACGCAACAGGGGATAGACAA
9 AAACTCGTATAGTGACACGCAACAGGGGATAGACAAGGC
3 AAACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCA
257 AAACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
1 AACACTCGTATAGTGACACGCAAC
2 AACACTCGTATAGTGACACGCAACAGGG
23 AACACTCGTATAGTGACACGCAACAGGGG
6 AACACTCGTATAGTGACACGCAACAGGGGATA
...</pre></dd><dt><span class="term">
sort -g -r
</span></dt><dd><p>
We now sort the output of the previous uniq-counting command by
asking 'sort' to perform a numerical sort (via '-g') and
additionally sort in reverse order (via '-r') so that we get the
sequences encountered most often at the top of the output. And
that one looks exactly like shown previously:
</p><pre class="screen">
504 ACCACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
501 CAACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
489 GGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
483 GCCACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
475 AATACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
442 GATACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
429 CGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
424 TTGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
393 ACTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
379 CAGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
363 ATTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
343 CATACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
334 GTTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
328 AACACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
324 GGTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
...</pre></dd></dl></div><p>
So, what is this ominous CTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC you are
seeing? To make it short: a modified 454 B-adaptor with an additional MID sequence.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
These adaptor sequences have absolutely no reason to exist in your
data, none! Go back to your sequencing provider and ask them to have a look
at their pipeline as they should have had it set up in a way that you
do not see these things anymore. Yes, due to sequencing errors,
sometimes some adaptor or sequencing vectors remnants will stay in
your sequencing data, but that is no problem as MIRA is capable of
handling that very well.
</p><p>
But having much more than 0.1% to 0.5% of your sequence containing
these is a sure sign that someone goofed somewhere ... and it's very
probably not your fault.
</p></td></tr></table></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_hard_examples_for_kmer_statistics"></a>11.5.
Examples for kmer statistics
</h2></div></div></div><p>
Selecting the right ratio so that an assembly fits into your memory is not
straight forward. But MIRA can help you a bit: during assembly, some frequency
statistics are printed out (they'll probably end up in some info file in later
releases). Search for the term "Kmer statistics" in the information printed
out by MIRA (this happens quite early in the process)
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_hard_caveat:_sk:kms"></a>11.5.1.
Caveat: -SK:kmer_size
</h3></div></div></div><p>
Some explanation how kmer size affects the statistics and why it
should be chosen >=16 for [-KS:mnr]
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_hard_sanger_sequencing_a_simple_bacterium"></a>11.5.2.
Sanger sequencing, a simple bacterium
</h3></div></div></div><p>
This example is taken from a pretty standard bacterium where Sanger
sequencing was used:
</p><pre class="screen">
Kmer statistics:
=========================================================
Measured avg. coverage: 15
Deduced thresholds:
-------------------
Min normal cov: 7
Max normal cov: 23
Repeat cov: 29
Crazy cov: 120
Mask cov: 150
Repeat ratio histogram:
-----------------------
0 475191
1 5832419
2 181994
3 6052
4 4454
5 972
6 4
7 8
14 2
16 10
=========================================================
</pre><p>
The above can be interpreted like this: the expected coverage of the genome is
15x. Starting with an estimated kmer frequency of 29, MIRA will treat a k-mer
as 'repetitive'. As shown in the histogram, the overall picture of this
project is pretty healthy:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
only a small fraction of k-mers have a repeat level of '0' (these would be
k-mers in regions with quite low coverage or k-mers containing sequencing
errors)
</p></li><li class="listitem"><p>
the vast majority of k-mers have a repeat level of 1 (so that's non-
repetitive coverage)
</p></li><li class="listitem"><p>
there is a small fraction of k-mers with repeat level of 2-10
</p></li><li class="listitem"><p>
there are almost no k-mers with a repeat level >10
</p></li></ul></div><p>
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_hard_454_sequencing_a_somewhat_more_complex_bacterium"></a>11.5.3.
454 Sequencing, a somewhat more complex bacterium
</h3></div></div></div><p>
Here's in comparison a profile for a more complicated bacterium (454
sequencing):
</p><pre class="screen">
Kmer statistics:
=========================================================
Measured avg. coverage: 20
Deduced thresholds:
-------------------
Min normal cov: 10
Max normal cov: 30
Repeat cov: 38
Crazy cov: 160
Mask cov: 0
Repeat ratio histogram:
-----------------------
0 8292273
1 6178063
2 692642
3 55390
4 10471
5 6326
6 5568
7 3850
8 2472
9 708
10 464
11 270
12 140
13 136
14 116
15 64
16 54
17 54
18 52
19 50
20 58
21 36
22 40
23 26
24 46
25 42
26 44
27 32
28 38
29 44
30 42
31 62
32 116
33 76
34 80
35 82
36 142
37 100
38 120
39 94
40 196
41 172
42 228
43 226
44 214
45 164
46 168
47 122
48 116
49 98
50 38
51 56
52 22
53 14
54 8
55 2
56 2
57 4
87 2
89 6
90 2
92 2
93 2
1177 2
1181 2
=========================================================
</pre><p>
The difference to the first bacterium shown is pretty striking:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
first, the k-mers in repeat level 0 (below average) is higher than
the k-mers of level 1! This points to a higher number of
sequencing errors in the 454 reads than in the Sanger project
shown previously. Or at a more uneven distribution of reads (but
not in this special case).
</p></li><li class="listitem"><p>
second, the repeat level histogram does not trail of at a repeat
frequency of 10 or 15, but it has a long tail up to the fifties, even having
a local maximum at 42. This points to a small part of the genome being
heavily repetitive ... or to (a) plasmid(s) in high copy numbers.
</p></li></ul></div><p>
</p><p>
Should MIRA ever have problems with this genome, switch on the nasty repeat
masking and use a level of 15 as cutoff. In this case, 15 is OK to start with
as a) it's a bacterium, it can't be that hard and b) the frequencies above
level 5 are in the low thousands and not in the tens of thousands.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_hard_solexa_sequencing_ecoli_mg1655"></a>11.5.4.
Solexa sequencing, E.coli MG1655
</h3></div></div></div><p>
</p><pre class="screen">
Kmer statistics:
=========================================================
Measured avg. coverage: 23
Deduced thresholds:
-------------------
Min normal cov: 11
Max normal cov: 35
Repeat cov: 44
Crazy cov: 184
Mask cov: 0
Repeat ratio histogram:
-----------------------
0 1365693
1 8627974
2 157220
3 11086
4 4990
5 3512
6 3922
7 4904
8 3100
9 1106
10 868
11 788
12 400
13 186
14 28
15 10
16 12
17 4
18 4
19 2
20 14
21 8
25 2
26 8
27 2
28 4
30 2
31 2
36 4
37 6
39 4
40 2
45 2
46 8
47 14
48 8
49 4
50 2
53 2
56 6
59 4
62 2
63 2
67 2
68 2
70 2
73 4
75 2
77 4
=========================================================
</pre><p>
This kmer statistics shows that MG1655 is pretty boring (from a
repetitive point of view). One might expect a few repeats but nothing
fancy: The repeats are actually the rRNA and sRNA stretches in the
genome plus some intergenic regions.
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
the k-mers number in repeat level 0 (below average) is
considerably lower than the level 1, so the Solexa sequencing
quality is pretty good respectively there shouldn't be too many
low coverage areas.
</p></li><li class="listitem"><p>
the histogram tail shows some faint traces of possibly highly repetitive
k-mers, but these are false positive matches due to some standard Solexa
base-calling weaknesses of earlier pipelines like, e.g., adding poly-A,
poly-T or sometimes poly-C and poly-G tails to reads when spots in the
images were faint and the base calls of bad quality
</p></li></ul></div><p>
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_hard_need_examples_for_eukaryotes"></a>11.5.5.
(NEED EXAMPLES FOR EUKARYOTES)
</h3></div></div></div><p>
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_hard_need_examples_for_pathological_cases"></a>11.5.6.
(NEED EXAMPLES FOR PATHOLOGICAL CASES)
</h3></div></div></div><p>
Vector contamination etc.
</p></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_seqtechdesc"></a>Chapter 12. Description of sequencing technologies</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email"><<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>></code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect_std_intro">12.1.
Introduction
</a></span></dt><dt><span class="sect1"><a href="#sect_std_sxa">12.2.
Illumina (formerly Solexa)
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_std_sxa_caveats_for_illumina">12.2.1.
Caveats for Illumina data
</a></span></dt><dt><span class="sect2"><a href="#sect_std_sxa_highlights">12.2.2.
Illumina highlights
</a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_std_sxa_highlights_quality">12.2.2.1.
Quality
</a></span></dt></dl></dd><dt><span class="sect2"><a href="#sect_std_std_sxa_lowlights">12.2.3.
Lowlights
</a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_std_sxa_lowlights_longhomopolymers">12.2.3.1.
Long homopolymers
</a></span></dt><dt><span class="sect3"><a href="#sect_std_sxa_lowlights_GGCxG_motif">12.2.3.2.
The GGCxG and GGC motifs
</a></span></dt><dt><span class="sect3"><a href="#sect_std_sxa_lowlights_chimericreads">12.2.3.3.
Chimeric reads
</a></span></dt><dt><span class="sect3"><a href="#sect_std_sxa_lowlights_samplemix">12.2.3.4.
Sample barcode misidentification
</a></span></dt><dt><span class="sect3"><a href="#sect_std_sxa_lowlights_nextera">12.2.3.5.
Nextera library prep
</a></span></dt><dt><span class="sect3"><a href="#sect_std_sxa_lowlights_gcbias">12.2.3.6.
Strong GC bias in some Solexa data (2nd half 2009 until advent of TruSeq kit at end of 2010)
</a></span></dt></dl></dd></dl></dd><dt><span class="sect1"><a href="#sect_std_iontor">12.3.
Ion Torrent
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_std_iontor_hpindels">12.3.1.
Homopolymer insertions / deletions
</a></span></dt><dt><span class="sect2"><a href="#sect_std_iontor_seqdirdepindels">12.3.2.
Sequencing direction dependent insertions / deletions
</a></span></dt><dt><span class="sect2"><a href="#sect_std_iontor_covvariance">12.3.3.
Coverage variance
</a></span></dt><dt><span class="sect2"><a href="#sect_std_iontor_gcbias">12.3.4.
GC bias
</a></span></dt><dt><span class="sect2"><a href="#sect_std_iontor_other_sources_of_error">12.3.5.
Other sources of error
</a></span></dt><dt><span class="sect2"><a href="#sect_std_iontor_where_to_find_further_information">12.3.6.
Where to find further information
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_std_pacbio">12.4.
Pacific BioSciences
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_std_pb_highlights">12.4.1.
Highlights
</a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_std_pb_hl_length">12.4.1.1.
Sequence lengths
</a></span></dt><dt><span class="sect3"><a href="#sect_std_pb_hl_gcbias">12.4.1.2.
GC bias
</a></span></dt><dt><span class="sect3"><a href="#sect_std_pb_hl_acccorrected">12.4.1.3.
Accuracy of corrected reads
</a></span></dt><dt><span class="sect3"><a href="#sect_std_pb_hl_qualassemblies">12.4.1.4.
Assemblies of corrected reads
</a></span></dt></dl></dd><dt><span class="sect2"><a href="#sect_std_pb_lowlights">12.4.2.
Lowlights
</a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_std_pb_ll_namingconfusion">12.4.2.1.
Naming confusion
</a></span></dt><dt><span class="sect3"><a href="#sect_std_pb_ll_revseq">12.4.2.2.
Forward / reverse chimeric sequences
</a></span></dt><dt><span class="sect3"><a href="#sect_std_pb_ll_rawreadaccuracy">12.4.2.3.
Accuracy of uncorrected subreads
</a></span></dt><dt><span class="sect3"><a href="#sect_std_pb_ll_cpu">12.4.2.4.
Immense need for CPU power
</a></span></dt><dt><span class="sect3"><a href="#sect_std_pb_ll_dnaprep">12.4.2.5.
Increased quality requirements for clean DNA sample prep
</a></span></dt></dl></dd></dl></dd></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
<span class="emphasis"><em><span class="quote">“<span class="quote">Opinions are like chili powder - best used in moderation.</span>”</span></em></span>
</p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_std_intro"></a>12.1.
Introduction
</h2></div></div></div><p>
<span class="bold"><strong>Note:</strong></span> This section contains things I've
seen in the past and simply jotted down. These may be fundamentally
correct or correct only under circumstances or not correct at all with
your data. You may have different observations.
</p><p>
...
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_std_sxa"></a>12.2.
Illumina (formerly Solexa)
</h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_std_sxa_caveats_for_illumina"></a>12.2.1.
Caveats for Illumina data
</h3></div></div></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
Even if you can get bacteria sequenced with ridiculously high coverage
like 500x or 1000x, this amount of data is simply not needed. Even
more important - though counterintuitive - is the fact that due to
non-random sequence dependent sequencing errors, a too high coverage
may even make the assembly worse.
</p><p>
Another rule of thumb: when having more than enough data, reduce the
data set so as to have an average coverage of approximately 100x. In
some rare cases (high GC content), perhaps 120x to 150x, but certainly
not more.
</p></td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
When reducing a data set, do <span class="bold"><strong>NOT</strong></span>,
under no circumstances not, try fancy selection of reads by some
arbitrary quality or length criteria. This will introduce a terrible
bias in your assembly due to non-random sequence-dependent sequencing
errors and non-random sequence dependent base quality assignment. More
on this in the next section.
</td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_std_sxa_highlights"></a>12.2.2.
Illumina highlights
</h3></div></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_sxa_highlights_quality"></a>12.2.2.1.
Quality
</h4></div></div></div><p>
For current HiSeq 100bp reads I get - after MIRA clipping - about 90
to 95% reads matching to a reference without a single error. MiSeq
250bp reads contain a couple more errors, but nothing to be alarmed
off.
</p><p>
In short: Illumina is currently <span class="emphasis"><em>the</em></span> technology
to use if you want high quality reads.
</p></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_std_std_sxa_lowlights"></a>12.2.3.
Lowlights
</h3></div></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_sxa_lowlights_longhomopolymers"></a>12.2.3.1.
Long homopolymers
</h4></div></div></div><p>
Long homopolymers (stretches of identical bases in reads) can be a
slight problem for Solexa. However, it must be noted that this is a
problem of all sequencing technologies on the market so far (Sanger,
Solexa, 454). Furthermore, the problem in much less pronounced in
Solexa than in 454 data: in Solexa, first problem appear may appear
in stretches of 9 to 10 bases, in Ion Torrent a stretch of 3 to 4
bases may already start being problematic in some cases.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_sxa_lowlights_GGCxG_motif"></a>12.2.3.2.
The GGCxG and GGC motifs
</h4></div></div></div><p>
<code class="literal">GGCxG</code> or even <code class="literal">GGC</code> motif in the
5' to 3' direction of reads. This one is particularly annoying and
it took me quite a while to circumvent in MIRA the problems it
causes.
</p><p>
Simply put: at some places in a genome, base calling after a
<code class="literal">GGCxG</code> or <code class="literal">GGC</code> motif is
particularly error prone, the number of reads without errors
declines markedly. Repeated <code class="literal">GGC</code> motifs worsen
the situation. The following screen shots of a mapping assembly
illustrate this.
</p><p>
The first example is a the <code class="literal">GGCxG</code> motif (in form
of a <code class="literal">GGCTG</code>) occurring in approximately one third
of the reads at the shown position. Note that all but one read
with this problem are in the same (plus) direction.
</p><div class="figure"><a name="sxa_unsc_ggcxg2_lenski.png"></a><p class="title"><b>Figure 12.1.
The Solexa GGCxG problem.
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_unsc_ggcxg2_lenski.png" width="100%" alt="The Solexa GGCxG problem."></td></tr></table></div></div></div><br class="figure-break"><p>
The next two screen shots show the <code class="literal">GGC</code>, once for
forward direction and one with reverse direction reads:
</p><div class="figure"><a name="sxa_unsc_ggc1_lenski.png"></a><p class="title"><b>Figure 12.2.
The Solexa GGC problem, forward example
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_unsc_ggc1_lenski.png" width="100%" alt="The Solexa GGC problem, forward example"></td></tr></table></div></div></div><br class="figure-break"><div class="figure"><a name="sxa_unsc_ggc4_lenski.png"></a><p class="title"><b>Figure 12.3.
The Solexa GGC problem, reverse example
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_unsc_ggc4_lenski.png" width="100%" alt="The Solexa GGC problem, reverse example"></td></tr></table></div></div></div><br class="figure-break"><p>
Places in the genome that have <code class="literal">GGCGGC.....GCCGCC</code>
(a motif, perhaps even repeated, then some bases and then an
inverted motif) almost always have very, very low number of good
reads. Especially when the motif is <code class="literal">GGCxG</code>.
</p><p>
Things get especially difficult when these motifs occur at sites
where users may have a genuine interest. The following example is a
screen shot from the Lenski data (see walk-through below) where a
simple mapping reveals an anomaly which -- in reality -- is an IS
insertion (see <a class="ulink" href="http://www.nature.com/nature/journal/v461/n7268/fig_tab/nature08480_F1.html" target="_top">http://www.nature.com/nature/journal/v461/n7268/fig_tab/nature08480_F1.html</a>)
but could also look like a <code class="literal">GGCxG</code> motif in forward
direction (<code class="literal">GGCCG</code>) and at the same time a
<code class="literal">GGC</code> motif in reverse direction:
</p><div class="figure"><a name="sxa_xmastree_lenski2.png"></a><p class="title"><b>Figure 12.4.
A genuine place of interest almost masked by the
<code class="literal">GGCxG</code> problem.
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_xmastree_lenski2.png" width="100%" alt="A genuine place of interest almost masked by the GGCxG problem."></td></tr></table></div></div></div><br class="figure-break"></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_sxa_lowlights_chimericreads"></a>12.2.3.3.
Chimeric reads
</h4></div></div></div><p>
I did not realise chimeric reads were a problem with Illumina data
until Fall 2014 when I got reads > 100bp for extremely well
charactersided bacteria ... and because MIRA since ever used data
cleaning methods which worked very well on either short reads ≤
100bp or when chimeras occurred at a very low frequency.
</p><p>
Chimeras are are artefacts reads from library preparation which
contain parts of the sequence of interest which do not belong
together. E.g., in DNA from a bacterial genome, there may be one
read of 100 bp where the first 40 bp come from the genome position
at 100kb and the last 60 bp come from a position at 1300kb ... more
than one megabase apart.
</p><p>
There is not much literature regarding chimeric sequences in
Illumina data: most of it deals with 16S or amplicon sequencing
where I always thought <span class="emphasis"><em>"that does not apply to my data
sets."</em></span> Well, tough luck ... it does. After some searching I
found some papers which report quite varying levels depending on the
protocols used. Oyola et al. report between 0.24% and 2.3% of
chimeras (<span class="emphasis"><em>Optimizing illumina next-generation sequencing
library preparation for extremely at-biased genomes</em></span>; BMC
Genomics 2012, 13:1; doi:10.1186/1471-2164-13-1; <a class="ulink" href="http://www.biomedcentral.com/1471-2164/13/1" target="_top">http://www.biomedcentral.com/1471-2164/13/1</a>). Apparently, a
paper from researchers at the Sanger Centre reported up to 5%
chimeric reads (Bronner et al., <span class="emphasis"><em>Improved Protocols for
Illumina Sequencing</em></span>; Current Protocols in Human Genetics
18:18.2:18.2.1–18.2.42; DOI: 10.1002/0471142905.hg1802s80; <a class="ulink" href="http://onlinelibrary.wiley.com/doi/10.1002/0471142905.hg1802s80/abstract" target="_top">http://onlinelibrary.wiley.com/doi/10.1002/0471142905.hg1802s80/abstract</a>
via <a class="ulink" href="http://www.sagescience.com/blog/sanger-reports-improved-prep-protocols-for-illumina-sequencing/" target="_top">http://www.sagescience.com/blog/sanger-reports-improved-prep-protocols-for-illumina-sequencing/</a>).
</p><p>
I have now seen MiSeq 250bp and 300bp paired-end genomic data sets
from different (trusted) sequencing providers for very well
characterised, non-complex and non-GC-extreme bacterial genomes with
up to 3% chimeric reads. To make things worse, some chimeras were
represented by both reads of a read-pair, so one had the exact same
chimeric sequence represented twice: once in forward and once in
reverse complement direction.
</p><p>
It turned out that MIRA versions ≤ 4.9.3 have problems in
filtering chimeras in Illumina data sets with reads > 100bp as
the chimera detection algorithms were designed to handle amounts
much less than 1% of the total reads. This led to shorter contigs in
genomic assemblies and to chimeric transcripts (when they are very
low-coverage) in RNA assemblies.
</p><p>
Note that projects using reads ≤ 100 bp assembled fine with MIRA
4.9.3 and before as the default algorithms for proposed-end-clip
([-CL:pec]) implicitly caught chimeras occurring near the
read ends and the remaining chimeras were caught by the algorithms
for low level chimeras.
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
MIRA 4.9.4 and higher eliminate all chimeras in Illumina reads of
any length, you do not need to take any precautionary steps
here. But if you use other assemblers and in light of the above, I
highly recommend to apply very stringent filters to Illumina data.
Especially for applications like metagenomics or RNA de-novo
assembly where low coverage may be expected for parts of the
results! Indeed, I now treat any assembly result with consensus data
generated from a coverage of less than 3 Illumina reads as
potentially junk data.
</td></tr></table></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_sxa_lowlights_samplemix"></a>12.2.3.4.
Sample barcode misidentification
</h4></div></div></div><p>
Long story short: data from multiplexed samples contains "low"
amounts of foreign samples from the same lane. Probably not a
problem for high coverage assemblies, but can become a problem in
multiplexed RNASeq or projects looking for "rare" variants.
</p><p>
In essence, the barcoding used for multiplexing several samples into
a single lane is not a 100% foolproof process. I found one paper
quantifying this effect to 0.3% of misidentified reads: Kircher et
al., <span class="emphasis"><em>Double indexing overcomes inaccuracies in multiplex
sequencing on the Illumina platform</em></span>; Nucleic Acids
Res. Jan 2012; 40(1): e3. <a class="ulink" href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3245947/" target="_top">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3245947/</a>
</p><p>
For example, I got some genome sequecing data for a bacterium where
closer inspection of some small contigs coming out of the assembly
process turned out to be highly expressed genes from a plant. The
sequencing provider had multiplexed our bacterial sample with a
RNASeq project of that plant.
</p><p>
Another example involved RNASeq of two genomes where one of the
organisms had been modified to contain additional genes under a
strong promoter. In the data set we suddenly saw those inserted
genes pop-up in the samples of the wild type organism. Which,
clearly, could not be.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_sxa_lowlights_nextera"></a>12.2.3.5.
Nextera library prep
</h4></div></div></div><p>
Opinions seem to be divided about Nextera: some people don't like it
as it introduces sometimes terrible coverage bias in the data, other
people say they're happy with the data.
</p><p>
Someone told me (or wrote, I do not remember) that this divide may
be due to the fact that some people use their sequencing data for
de-novo assemblies, while others just do mappings and hunt for
SNPs. In fact, this would explain a lot: for de-novo assemblies, I
would never use Nextera. When on a hunt for SNPs, they may be OK.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_sxa_lowlights_gcbias"></a>12.2.3.6.
Strong GC bias in some Solexa data (2nd half 2009 until advent of TruSeq kit at end of 2010)
</h4></div></div></div><p>
I'm recycling a few slides from a couple of talks I held in 2010.
</p><p>
Things used to be so nice and easy with the early Solexa data I worked
with (36 and 44mers) in late 2007 / early 2008. When sample taking was
done right -- e.g. for bacteria: in stationary phase -- and the
sequencing lab did a good job, the read coverage of the genome was
almost even. I did see a few papers claiming to see non-trivial GC
bias back then, but after having analysed the data I worked with I
dismissed them as "not relevant for my use cases." Have a look at the
following figure showing exemplarily the coverage of a 45% GC
bacterium in 2008:
</p><div class="figure"><a name="sxa_gcbias_nobias2008.png"></a><p class="title"><b>Figure 12.5.
Example for no GC coverage bias in 2008 Solexa data. Apart from a
slight <span class="emphasis"><em>smile shape</em></span> of the coverage --
indicating the sample taking was not 100% in stationary phase of the
bacterial culture -- everything looks pretty nice: the average
coverage is at 27x, and when looking at potential genome
duplications at twice the coverage (54x), there's nothing apart a
single peak (which turned out to be a problem in a rRNA region).
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_gcbias_nobias2008.png" width="100%" alt="Example for no GC coverage bias in 2008 Solexa data. Apart from a slight smile shape of the coverage -- indicating the sample taking was not 100% in stationary phase of the bacterial culture -- everything looks pretty nice: the average coverage is at 27x, and when looking at potential genome duplications at twice the coverage (54x), there's nothing apart a single peak (which turned out to be a problem in a rRNA region)."></td></tr></table></div></div></div><br class="figure-break"><p>
Things changed starting sometime in Q3 2009, at least that's when I
got some data which made me notice a problem. Have a look at the
following figure which shows exactly the same organism as in the
figure above (bacterium, 45% GC):
</p><div class="figure"><a name="sxa_gcbias_bias2009.png"></a><p class="title"><b>Figure 12.6.
Example for GC coverage bias starting Q3 2009 in Solexa
data. There's no <span class="emphasis"><em>smile shape</em></span> anymore -- the
people in the lab learned to pay attention to sample in 100%
stationary phase -- but something else is extremely disconcerting:
the average coverage is at 33x, and when looking at potential genome
duplications at twice the coverage (66x), there are several dozen
peaks crossing the 66x threshold over a several kilobases (in one
case over 200 Kb) all over the genome. As if several small genome
duplications happened.
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_gcbias_bias2009.png" width="100%" alt="Example for GC coverage bias starting Q3 2009 in Solexa data. There's no smile shape anymore -- the people in the lab learned to pay attention to sample in 100% stationary phase -- but something else is extremely disconcerting: the average coverage is at 33x, and when looking at potential genome duplications at twice the coverage (66x), there are several dozen peaks crossing the 66x threshold over a several kilobases (in one case over 200 Kb) all over the genome. As if several small genome duplications happened."></td></tr></table></div></div></div><br class="figure-break"><p>
By the way, the figures above are just examples: I saw over a dozen
sequencing projects in 2008 without GC bias and several dozen in 2009
/ 2010 with GC bias.
</p><p>
Checking the potential genome duplication sites, they all looked
"clean", i.e., the typical genome insertion markers are
missing. Poking around at possible explanations, I looked at GC
content of those parts in the genome ... and there was the
explanation:
</p><div class="figure"><a name="sxa_gcbias_comp20082009.png"></a><p class="title"><b>Figure 12.7.
Example for GC coverage bias, direct comparison 2008 / 2010
data. The bug has 45% average GC, areas with above average read
coverage in 2010 data turn out to be lower GC: around 33 to 36%. The
effect is also noticeable in the 2008 data, but barely so.
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/sxa_gcbias_comp20082009.png" width="100%" alt="Example for GC coverage bias, direct comparison 2008 / 2010 data. The bug has 45% average GC, areas with above average read coverage in 2010 data turn out to be lower GC: around 33 to 36%. The effect is also noticeable in the 2008 data, but barely so."></td></tr></table></div></div></div><br class="figure-break"><p>
Now as to actually <span class="emphasis"><em>why</em></span> the GC bias suddenly
became so strong is unknown to me. The people in the lab use the same
protocol since several years to extract the DNA and the sequencing
providers claim to always use the Illumina standard protocols.
</p><p>
But obviously something must have changed.
</p><p>
It took Illumina some 18 months to resolve that problem for the
broader public: since data I work on were done with the TruSeq kit,
this problem has vanished.
</p><p>
However, if you based some conclusions or wrote a paper with Illumina
data which might be affected by the GC bias (Q3 2009 to Q4 2010), I
suggest you rethink all the conclusion drawn. This should be
especially the case for transcriptomics experiments where a difference
in expression of 2x to 3x starts to get highly significant!
</p></div></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_std_iontor"></a>12.3.
Ion Torrent
</h2></div></div></div><p>
As of January 2014, I would say Ion Torrent reads behave very much like
late data from the 454 technology (FLX / Titanium chemistry): reads are
on average are > 300bp and the homopolymer problem is much less
pronounced than 2 years ago. The following figure shows what you can get
out of 100bp reads if you're lucky:
</p><div class="figure"><a name="chap_iontor::ion_dh10bgoodB13.png"></a><p class="title"><b>Figure 12.8.
Example for good IonTorrent data (100bp reads). Note that only a
single sequencing error - shown by blue background - can be
seen. Except this, all homopolymers of size 3 and 4 in the area
shown are good.
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/ion_dh10bgoodB13.png" width="100%" alt="Example for good IonTorrent data (100bp reads). Note that only a single sequencing error - shown by blue background - can be seen. Except this, all homopolymers of size 3 and 4 in the area shown are good."></td></tr></table></div></div></div><br class="figure-break"><p>
The "if you're lucky" part in the preceding sentence is not there by
accident: having so many clean reads is more of an exception rather a
rule. On the other hand, most sequencing errors in current IonTorrent
data are unproblematic ... if it were not for indels, which is going to
be explained on the next sections.
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_std_iontor_hpindels"></a>12.3.1.
Homopolymer insertions / deletions
</h3></div></div></div><p>
The main source of error in your data will be insertions / deletions
(indels) especially in homopolymer regions (but not only there, see
also next section). Starting with a base run of 4 to 6 bases, there
is a distinct tendency to have an increased occurrence of indel
errors.
</p><div class="figure"><a name="chap_iontor::iontor_indelhpexample.png"></a><p class="title"><b>Figure 12.9.
Example for problematic IonTorrent data (100bp reads).
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/iontor_indelhpexample.png" width="100%" alt="Example for problematic IonTorrent data (100bp reads)."></td></tr></table></div></div></div><br class="figure-break"></div><p>
The above figure contains a couple of particularly nasty indel
problems. While areas 2 (C-homopolymer length 3), 5 (A-homopolymer
length 4) and 6 (T-homopolymer length 3) are not a big problem as most
of the reads got the length right, the areas 1, 3 and 4 are nasty.
</p><p>
Area 1 is an A-homopolymer of length 7 and while many reads get that
length right (enough to tell MIRA what the true length is), it also
contains reads with a length of 6 and and others with a length of 8.
</p><p>
Area 2 is a "A-homopolymer" of length 2 where approximately half of the
reads get the length right, the other half not. See also the following
section.
</p><p>
Area 4 is a T-homopolymer of length 5 which also has approximately half
the reads with a wrong length of 4.
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_std_iontor_seqdirdepindels"></a>12.3.2.
Sequencing direction dependent insertions / deletions
</h3></div></div></div><p>
In the previous section, the screen shot showing indels had an indel
at a homopolymer of 2, which is something quite curious. Upon closer
investigation, one might notice a pattern in the gap/nogap
distribution: it is almost identical to the orientation of build
direction of reads!
</p><p>
I looked for other examples of this behaviour and found quite a
number of them, the following figure shows a very clear case of that
error behaviour:
</p><div class="figure"><a name="chap_iontor::ion_dh10bdirdepindel.png.png"></a><p class="title"><b>Figure 12.10.
Example for a sequencing direction dependent indel. Note how all
but one of the reads in '+' direction miss a base while all reads
built in in '-' direction have the correct number of bases.
</b></p><div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" style="cellpadding: 0; cellspacing: 0;" width="90%"><tr><td><img src="bookfigures/ion_dh10bdirdepindel.png" width="100%" alt="Example for a sequencing direction dependent indel. Note how all but one of the reads in '+' direction miss a base while all reads built in in '-' direction have the correct number of bases."></td></tr></table></div></div></div><br class="figure-break"><p>
This is quite astonishing: the problem occurs at a site without real
homopolymer (calling a 2-bases run a 'homopolymer' starts stretching
the definition a bit) and there are no major problematic homopolymer
sites near. In fact, this was more or less the case for all sites I
had a look at.
</p><p>
Neither did the cases which were investigated show common base
patterns, so unlike the Solexa GGCxG motif it does not look like
that error of IonTorrent is bound to a particular motif.
</p><p>
While I cannot prove the following statement, I somehow suspect that
there must be some kind of secondary structure forming which leads to
that kind of sequencing error. If anyone has a good explanation I'd be
happy to hear it: feel free to contact me at
<code class="email"><<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>></code>.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_std_iontor_covvariance"></a>12.3.3.
Coverage variance
</h3></div></div></div><p>
The coverage variance with the old ~100bp reads was a bit on the
bad side for low coverage projects (10x to 15x): it varied wildly,
sometimes dropping to nearly zero, sometimes reaching approximately
double the coverage.
</p><p>
This has now improved and I have not seen pronounced coverage variance
in the data sets I have worked on.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_std_iontor_gcbias"></a>12.3.4.
GC bias
</h3></div></div></div><p>
The GC bias seems to be small to non-existent, at least I could not
immediately make a correlation between GC content and coverage.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_std_iontor_other_sources_of_error"></a>12.3.5.
Other sources of error
</h3></div></div></div><p>
You will want to keep an eye on the clipping of the data in the SFF
files from IonTorrent: while it is generally good enough, some data
sets of IonTorrent show that - for some error patterns - the clipping
is too lax and strange artefacts appear. MIRA will take care of these
- or at least of those it knows - but you should be aware of this
potential problem.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_std_iontor_where_to_find_further_information"></a>12.3.6.
Where to find further information
</h3></div></div></div><p>
IonTorrent being pretty new, getting as much information on that
technology is quite important. So here are a couple of links I found
to be helpful:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
There is, of course, the TorrentDev site (<a class="ulink" href="http://lifetech-it.hosted.jivesoftware.com/community/torrent_dev" target="_top">http://lifetech-it.hosted.jivesoftware.com/community/torrent_dev</a>)
at Life Technologies which will be helpful to get a couple of
questions answered.
</p><p>
Just be aware that some of the documents over there are sometimes
painting an - how should I say it diplomatically? - overly
optimistic view on the performance of the technology. On the
other hand, so do documents released by the main competitors
like 454/Roche, Illumina, PacBio etc. ... so no harm done there.
</p></li><li class="listitem"><p>
I found Nick Loman's blog <a class="ulink" href="http://pathogenomics.bham.ac.uk/blog/" target="_top">Pathogens: Genes and
Genomes</a> to be my currently most valuable source of
information on IonTorrent. While the group he works for won a
sequencer from IonTorrent, he makes that fact very clear and still
unsparingly dissects the data he gets from that machine.
</p><p>
His posts got me going in getting MIRA grok IonTorrent.
</p></li><li class="listitem"><p>
The blog of Lex Nederbragt <a class="ulink" href="http://flxlexblog.wordpress.com/" target="_top">In between lines of
code</a> is playing in the same league: very down to earth and
he knows a bluff when he sees it ... and is not afraid to call it
(be it from IonTorrent, PacBio or 454).
</p><p>
The analysis he did on a couple of Ion data sets have saved me
quite some time.
</p></li><li class="listitem"><p>
Last, but not least, the board with <a class="ulink" href="http://seqanswers.com/forums/forumdisplay.php?f=40" target="_top">IonTorrent-related-stuff</a>
over at <a class="ulink" href="http://seqanswers.com/" target="_top">SeqAnswers</a>,
the first and foremost one-stop-shop ... erm ... discussion board
for everything related to sequencing nowadays.
</p></li></ul></div></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_std_pacbio"></a>12.4.
Pacific BioSciences
</h2></div></div></div><p>
As of January 2014, PacBio should be seen as <span class="emphasis"><em>the</em></span>
technology to go to for de-novo sequencing of bacteria and lower
eukaryotes. Period. Complement it with a bit of Illumina to get rid of
the last remaining errors and you'll have - for a couple of thousand
Euros - the best genome sequences money can buy.
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_std_pb_highlights"></a>12.4.1.
Highlights
</h3></div></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_pb_hl_length"></a>12.4.1.1.
Sequence lengths
</h4></div></div></div><p>
Just one word: huge. At least compared to other currently existing
technologies. It is not unusual to get average - usable - read lengths
of more than 3 to 4 kb, some chemistries doubling that number (at
the expense of accuracy). The largest - usable - reads I have seen
were > 25kb, though one needs to keep in mind that these are
quite rare and one does not see many of them in a project.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_pb_hl_gcbias"></a>12.4.1.2.
GC bias
</h4></div></div></div><p>
I have seen none in my projects so far, neither have I in public
data. But these were certainly not as many projects as Sanger, 454,
Illumina and Ion, so take this with a grain of salt.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_pb_hl_acccorrected"></a>12.4.1.3.
Accuracy of corrected reads
</h4></div></div></div><p>
Once the raw PacBio data has been corrected (HGAP pipeline), the
resulting reads have a pretty good accuracy. There still are
occasional homopolymer errors remaining at non-random locations, but
they are a minor problem.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_pb_hl_qualassemblies"></a>12.4.1.4.
Assemblies of corrected reads
</h4></div></div></div><p>
The assemblies coming out of the HGAP pipeline are already
astoundingly good. Of course you get long contigs, but also the
number of miscalled consensus bases is not too bad: 1 error per 20
kb. Once the program
<span class="command"><strong>Quiver</strong></span> went through the assembly to do its magic
in polishing, the quality improves further to into the range of 1
error per 50kb to 1 error per 250kb.
</p><p>
In my hands, I get even better assemblies with MIRA (longer contigs
which span repeats unresolved by HGAP). When combining this with
some low coverage Illumina data (say, 50x) to do cheap polishing,
the error rates I get are lower than 1 error in 4 megabases.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Take the above with a grain of salt as at the time of this writing,
I analysed in-depth only on a couple of bacteria. For ploidal
organisms I have just played a bit around with public data without
really doing an in depth analysis there.
</td></tr></table></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_std_pb_lowlights"></a>12.4.2.
Lowlights
</h3></div></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_pb_ll_namingconfusion"></a>12.4.2.1.
Naming confusion
</h4></div></div></div><p>
With PacBio, there are quite a number of read types being thrown
around and which do confuse people: <span class="emphasis"><em>polymerase
reads</em></span>, <span class="emphasis"><em>quality clipped
reads</em></span>, <span class="emphasis"><em>subreads</em></span>, <span class="emphasis"><em>corrected
reads</em></span> and maybe some more I currently forgot. Here's the
total unofficial guide on how to keep those things apart:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
<span class="bold"><strong>polymerase reads</strong></span> are the rawest
and most unedited stuff you may come into contact. You can see
it as "data fresh from the machine" and the number of megabases
there is usually the one sequencing providers sell to you.
</p><p>
The sequencing technology PacBio employs uses special hairpin
adaptors they have named SMRTBell, and these adaptors will be
present in the polymerase reads together with the fragments of
your DNA.
</p><p>
In terms of regular expression look-alike, the data in
polymerase reads has the following form:
</p><pre class="screen">(Adaptor + (forward fragment sequence + (Adaptor + (fragment sequence in reverse complement))))*</pre><p>
E.g., some of your <span class="emphasis"><em>polymerase reads</em></span> will
contain just the adaptor and (part of) a fragment sequence:
Adap+FwdSeq. Others might contain: Adap+FwdSeq+Adap+RevSeq. And
still others might contain: multiple copies of
Adap+FwdSeq+Adap+RevSeq.
</p></li><li class="listitem"><span class="bold"><strong>quality clipped reads</strong></span> are
simply <span class="emphasis"><em>polymerase reads</em></span> where some sort of
first quality clipping has been done.
</li><li class="listitem"><span class="bold"><strong>subreads</strong></span> are <span class="emphasis"><em>quality
clipped reads</em></span> where the adaptors have been removed and
the read split into forward fragment sequences and reverse
fragment sequences. Hence, one quality clipped polymerase read can
yield several subreads.
</li><li class="listitem"><p>
<span class="bold"><strong>corrected (sub)reads</strong></span> are
subreads where through the magic of lots of computational power
and a very high coverage of subreads, the errors have been
almost completely removed from the subreads.
</p><p>
This is usually done only on a part of the subreads as it takes
already long enough (several hundred hours CPU for a simple
bacterium).
</p></li></ul></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_pb_ll_revseq"></a>12.4.2.2.
Forward / reverse chimeric sequences
</h4></div></div></div><p>
The splitting of polymerase reads into subreads (see above) needs
the SMRTBell adaptor to be recognised by motif searching
programs. Unfortunately, it looks like as if some "low percentage"
of reads have a self-looped end instead of an adaptor. Which in turn
means that the subread splitting will not split those reads and you
end up with a chimeric sequence.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_pb_ll_rawreadaccuracy"></a>12.4.2.3.
Accuracy of uncorrected subreads
</h4></div></div></div><p>
You need to be brave now: the accuracy of the the unclipped
polymerase reads is usually only about 50%. That is: on average
every second base is wrong. And I have seen a project where this
accuracy was only 14% (6 out of 7 bases are wrong).
</p><p>
After clipping, the average accuracy of the polymerase reads should
be anywhere between 80% and 85% (this depends a little bit on the
chemistry used), which translates to: every 5th to every 7th base is
wrong. The vast majority of errors being insertions or deletions, not
base substitutions.
</p><p>
80% to 85% accurracy with indels as primary error is unfortunately
something assemblers cannot use very well. Read: not at all if you
want good assemblies (at least I know no program which does
that). Therefore, one needs to apply some sort of correction
... which needs quite a deal of CPU, see below.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_pb_ll_cpu"></a>12.4.2.4.
Immense need for CPU power
</h4></div></div></div><p>
The above mentioned accuracies of 80% to 85% are too low for any
existing assembler I know to be correctly assembled. Therefore,
people came up with the idea of doing error correction on subreads
to improve their quality.
</p><p>
There are two major approaches: 1) correcting PacBio subreads with
other technologies with shorter reads and 2) correcting long PacBio
subreads with shorter PacBio subreads. Both approaches have been
shown to work, though there seems to be a preference nowadays to use
the second option as the "shorter" PacBio reads provide the benefit
of being still longer than read from other technologies and hence
provide a better repeat resolution.
</p><p>
Anyway, the amount of CPU power needed for any method above is
something to keep for: bacteria with 3 to 5 megabases at a 100x
polymerase read coverage can take several hundred hours of CPU for
the correction step.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_std_pb_ll_dnaprep"></a>12.4.2.5.
Increased quality requirements for clean DNA sample prep
</h4></div></div></div><p>
This is a problem which cannot be really attributed to PacBio: one
absolutely needs to check whether the protocols used "since ever"
for DNA extraction yield results which are clean and long enough for
PacBio. Often they are not.
</p><p>
The reason for this being a problem is simple: PacBio can sequence
really long fragments, but if your DNA extraction protocol smashed
the DNA into small pieces, then no sequencing technology in this
universe will be able to give you long reads for small fragments.
</p></div></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_seqadvice"></a>Chapter 13. Some advice when going into a sequencing project</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email"><<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>></code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect_seqadv_seqprovider">13.1.
Talk to your sequencing provider(s) before sequencing
</a></span></dt><dt><span class="sect1"><a href="#sect_seqadv_whichseqprovider">13.2.
Choosing a sequencing provider
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_seqadv_whichseqprovider_want">13.2.1.
WHAT DO YOU WANT?!
</a></span></dt><dt><span class="sect2"><a href="#sect_seqadv_whichseqprovider_need">13.2.2.
WHAT DO YOU NEED?!
</a></span></dt><dt><span class="sect2"><a href="#sect_seqadv_whichseqprovider_cost">13.2.3.
WHAT WILL IT COST ME?
</a></span></dt><dt><span class="sect2"><a href="#sect_seqadv_whichseqprovider_where">13.2.4.
WHERE TO SEQUENCE?
</a></span></dt><dt><span class="sect2"><a href="#sect_seqadv_whichseqprovider_summary">13.2.5.
Summary of all the above
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_seqadv_specific">13.3.
Specific advice
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_seqadv_technologies">13.3.1.
Technologies
</a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_seqadv_technologies_sanger">13.3.1.1.
Sanger
</a></span></dt><dt><span class="sect3"><a href="#sect_seqadv_technologies_pacbio">13.3.1.2.
Pacific Biosciences
</a></span></dt><dt><span class="sect3"><a href="#sect_seqadv_technologies_illumina">13.3.1.3.
Illumina
</a></span></dt><dt><span class="sect3"><a href="#sect_seqadv_technologies_iontorrent">13.3.1.4.
Ion Torrent
</a></span></dt><dt><span class="sect3"><a href="#sect_seqadv_technologies_454">13.3.1.5.
Roche 454
</a></span></dt></dl></dd><dt><span class="sect2"><a href="#sect_seqadv_denovo">13.3.2.
Sequencing de-novo
</a></span></dt><dt><span class="sect2"><a href="#sect_seqadv_mapping">13.3.3.
Re-sequencing / mapping
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_seqadv_a_word_or_two_on_coverage">13.4.
A word or two on coverage ...
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_seqadv_lowcov">13.4.1.
Low coverage isn't worth it
</a></span></dt><dt><span class="sect2"><a href="#sect_seqadv_highcov">13.4.2.
Catch-22: too high coverage
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_seqadv_when_sequencing_a_word_of_caution_regarding_your_dna">13.5.
A word of caution regarding your DNA in hybrid sequencing projects
</a></span></dt><dt><span class="sect1"><a href="#sect_seqadv_for_bacteria">13.6.
Advice for bacteria
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_seqadv_for_bacteria_no_not_sample_in_exponential_phase">13.6.1.
Do not sample DNA from bacteria in exponential growth phase!
</a></span></dt><dt><span class="sect2"><a href="#sect_seqadv_for_bacteria:_beware_of_high_copy_number_plasmids">13.6.2.
Beware of (high copy number) plasmids!
</a></span></dt></dl></dd></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
<span class="emphasis"><em>
<span class="quote">“<span class="quote">
Reliable information lets you say 'I don't know' with real confidence.
</span>”</span>
</em></span>
</p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_seqadv_seqprovider"></a>13.1.
Talk to your sequencing provider(s) before sequencing
</h2></div></div></div><p>
Well, duh! But it's interesting what kind of mails I sometimes get. Like in:
</p><div class="blockquote"><blockquote class="blockquote"><span class="quote">“<span class="quote">We've sequenced a one gigabase, diploid eukaryote with
Solexa 36bp paired-end with 200bp insert size at 25x coverage. Could you
please tell us how to assemble this data set de-novo to get a finished
genome?</span>”</span></blockquote></div><p>
A situation like the above should have never happened. Good sequencing
providers are interested in keeping customers long term and will
therefore try to find out what exactly your needs are. These folks
generally know their stuff (they're making a living out of it) and most
of the time propose you a strategy that fulfills your needs for a near
minimum amount of money.
</p><p>
Listen to them.
</p><p>
If you think they try to rip you off or are overselling their
competences (which most providers I know won't even think of trying,
but there are some), ask a quote from a couple of other
providers. You'll see pretty quickly if there are some things not being
right.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
As a matter of fact, a rule which has saved me time and again for
finding sequencing providers is not to go for the cheapest provider,
especially if their price is far below quotes from other
providers. They're cutting corners somewhere others don't cut for a
reason.
</td></tr></table></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_seqadv_whichseqprovider"></a>13.2.
Choosing a sequencing provider
</h2></div></div></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
This is a slightly reworked version of a post I made on the MIRA talk
mailing list. The question <span class="emphasis"><em>"Could you please recommend me a
sequencing provider?"</em></span> arrives every now and then in my
private inbox, often enough for me decide to make a collage of the
responses I gave in the past and post it to MIRA talk.
</td></tr></table></div><p>
This response got, errrr, a little longer, but allow me to note that I
will not give you names. The reasons are manyfold:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem">
once upon a time I worked for a sequencing company
</li><li class="listitem">
the company I am currently employed with is not in the sequencing
provider business, but the company uses more than one sequencing
provider on a regular base and I get to see quite some data
</li><li class="listitem">
due to my development on MIRA in my free time, I'm getting insight
into a number of highs and lows of sequencing technologies at
different sequencing providers which I would not get if I were to
expose them publicly ... I do not want to jeopardise these
relationships.
</li></ul></div><p>
That being said, there are a number of general considerations which
could help you. Excuse me in case the detours I am going to make are
obvious to you, but I'm writing this also for future references. Also,
please bear with me if I look at "sequencing" a bit differently than you
might be accustomed to from academia, but I have worked for quite some
time now in industry ... and there cost-effectiveness respectively
"probability of success" of a project as whole is paramount to
everything else. I'll come back to that further down.
</p><p>
There's one -- and only one -- question which you, as sequencing
customer, need to be able to answer ... if necessary in every
excruciating detail, but you must know the answer. The question is:
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_seqadv_whichseqprovider_want"></a>13.2.1.
WHAT DO YOU WANT?!
</h3></div></div></div><div class="sidebar"><div class="titlepage"><div><div><p class="title"><b>
Detour - Sequencing -
</b></p></div></div></div><p>
For me, every "sequencing project", be it genomic or transcriptomic,
really consists of four major phases:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
<span class="bold"><strong>data generation:</strong></span> This can be
broadly seen as everything to get the DNA/RNA ready to be sent
off to sequencing (usually something the client does), the
library prep at the sequencing provider and finally the
sequencing itself (including base calling). An area of thousand
pitfalls where each step (and the communication) is crucial and
even one slight inadvertence can make the difference between a
"simple" project and a "hard" project. E.g.: taking DNA from
growing cells (especially bacteria in exponential growing phase)
might not be a good idea ... it makes assembly more
difficult. Some DNA extraction methods generate more junk than
good fragments etc.pp
</p><p>
The reason I am emphasizing this is simple: nowadays, the
"sequencing" itself is not the most expensive part of a
sequencing project, the next two steps are (most of the time
anyway).
</p></li><li class="listitem"><p>
<span class="bold"><strong>assembly & finishing:</strong></span> Still
a hard problem. Even a "simple" bacterium can present weeks of
effort to get right if its riddled with phages, prophages,
transposon elements, genetically engineered repeats etc.pp. And
starting with eukaryotes the real fun starts: ploidy,
retrotransposons etc. make for an unbelievable genome plasticity
and almost always have their own surprises. I've seen "simple"
Saccharomyces cerevisiae - where biologist swore to high heaven
they were "close to the publicly sequenced strains" - being
*very* different from what they were expected to be, both on the
DNA level and the genome organisation level.
</p><p>
Getting eukaryotes right "down to the last base" might cost
quite some money, especially when looping back to step 1 (data
generation) to tackle difficult areas.
</p></li><li class="listitem"><p>
<span class="bold"><strong>annotation:</strong></span> Something many
people forget: give the sequence a meaning. Here too, things can
get quite costly if done "right", i.e., with hand
curation. Especially on organism which are not part of the more
commonly sequenced species or are generally more complex.
</p><p>
Annotation of a de-novo transcriptome assembly is also not for
the faint of heart, especially if done on short, unpaired read
assemblies.
</p></li><li class="listitem"><span class="bold"><strong>using the sequencing data:</strong></span>
... for whatever it was generated for.
</li></ol></div></div><p>
The above makes it clear that, depending on what you are really
interested in within your project and what you expect to be able to do
with the sequencing data, one can cut corners and reduce cost here and
there (but not everywhere). And therefore, the above question "What do
you want?" is one which - after the initial chit-chat of "hi, hello,
nice to meet you, a pleasure to be here, etc." - every good
representative of respectable sequencing providers I have met so far
will ask as very first question. Usually in the form of "what do you
want to sequence and what will you want to use the data for (and what
not)?"
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_seqadv_whichseqprovider_need"></a>13.2.2.
WHAT DO YOU NEED?!
</h3></div></div></div><p>
... difference between "want" and "need" ...
</p><p>
Every other question - like where to sequence, which sequencing
technology to use, how to process the sequencing data afterwards - is
incidental and subordinated to your answer(s) to the question of "what
do you want?!" But often sequencing customers get their priorities
wrong by putting forward another question:
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_seqadv_whichseqprovider_cost"></a>13.2.3.
WHAT WILL IT COST ME?
</h3></div></div></div><p>
And its inevitable companion question "Can you make it cheaper?"
</p><div class="sidebar"><div class="titlepage"><div><div><p class="title"><b>
Detour - Putting things into perspective -
</b></p></div></div></div><p>
Come to think of it, people sometimes have very interesting ideas
regarding costs. Interesting as in "outright silly." It may be
because they do not really know what they want or feel unsure on a
terrain unbeknownst to them, and often instead focus their energy on
single aspects of a wider project because they feel more at home
there. And suddenly the focus lies on haggling and bartering for
some prices because, after all, this is something everyone knows how
to do, right?
</p><p>
As I hinted earlier, the pure sequencing costs are nowadays probably
not the biggest factor in any sequencing project: 454, Illumina,
IonTorrent and other technology providers have seen to that. E.g.,
in 20043/2004 it still cost somewhere between 150 - 200 k€ to get an
8x Sanger coverage of a moderately sized bacterium (4 to 5
mb). Nowadays, for the same organism, you get coverages in the
dozens (going with 454) for a few thousand Euro ... or coverages in
the hundreds or even thousands (going with Illumina) for a few
hundred Euro.
</p><p>
Cost for assembly, finishing and annotation have not followed the
same decrease. Yes, advances in algorithms have made things easier
in some parts, but not really on the same scale. Furthermore, the
"short read" technologies have more than made up for algorithmical
complexity when compared to the old Sanger reads. Maybe that
"(ultra)long read" technologies will alleviate the problem, but I
would not hold my breath for them to really work well.
</p><p>
One thing however has almost not changed at all: your costs of
actually doing followup experiments and data interpretation!
Remember that sequencing in itself is most of the time not the
ultimate goal, you actually want to gain something out of it. Be it
abstract knowledge for a paper or concrete hints for producing some
compounds or whatever, chances are that you will actually devote a
substantial amount of your resources (time, manpower, mental health)
into followup activities (lab experiments, genetic engineering,
writing papers) to turn the abstract act of sequencing into
something tangible, be it papers, fame, new products, money, or
whatever you want to achieve.
</p><p>
And this is the place where it pays to stop and think: "what do I
want? what are my strengths and where are my weaknesses? where are
my priorities?" The English have a nice saying: "Being penny-wise
and pound-foolish is not wise." I may add: Especially not if you are
basing man months / years of lab work and your career on the outcome
of something like sequencing. Maybe I'm spoiled because I have left
academia for quite some time now, but in sequencing I always prefer
to throw a bit more money at the sequencing process itself to
minimise risks of the later stages.
</p></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_seqadv_whichseqprovider_where"></a>13.2.4.
WHERE TO SEQUENCE?
</h3></div></div></div><p>
There's one last detour I'd like to make, and that is the question of "where to sequence?"
</p><div class="sidebar"><div class="titlepage"><div><div><p class="title"><b>
Detour - Public or private, old-timers or young-timers ? -
</b></p></div></div></div><p>
Choosing a sequencing provider is highly dependent on your answer to
"what do you want?" In case you want to keep the sequencing data (or
the very act of sequencing) secret (even only for some time) will
probably lead you to commercial sequencing companies. There you more
or less have complete control on the data. Paranoid people might
perhaps argue that you can have that only with own sequencing
equipment and personnel, but I have the feeling that only a minority
is able to cough-up the necessary money for purchasing sequencing
equipment for a small one-time project.
</p><p>
Instead of companies you could however also look whether one of the
existing sequencing centers in the world might be a good cooperation
candidate. Especially if you are doing this project within the scope
of your university. Note however that there might be a number of
gotchas lurking there, beside the obvious "the data is not really
secret anymore": sometimes the raw sequencing data needs to be
publicly released, maybe earlier than you would like; or the
sequencing center imposes that each and every paper you publish with
that data as basis has them as (co-)first author.
</p><p>
A related problem is "whom do I trust to deliver good work?"
Intuition says that institutes with a long sequencing history have
amassed quite some knowledge in this field, making them experts in
all three aspects (data generation, assembly & finishing,
annotation) of a sequencing project ... and intuition probably isn't
wrong there. The same thing is probably true for sequencing
companies which have existed for more than just a couple of years,
though from what I have seen so far is that - due to size -
sequencing companies sometimes really focus on the data generation
and rely on partner companies for "assembly" and "annotation". This
is not to say that younger companies are bad. Incidentally, it is my
belief that in this field, people are still more important than
technology ... and every once in a while good people split off a
well known institute (or company) to try their luck in an own
company. Always look for references there.
</p><p>
The following statement is a personal opinion (and you can call me
biased for that): Personally, I am however quite wary of sequencing
done at locations where a sequencer exists because someone got a
grant to buy one (because it was chic & en-vogue to get a shiny
new toy) but where the instrument then slowly starts to collect dust
after the initial flurry ... and because people often do not
calculate chemistry costs which arise in case they'd really thought
of using the machine 24/7. I want to know that technicians actually
work with those things every day, that they know the ins and outs of
the work, the protocols, the chemistry, the moods of the machine
(even an instrument can have a bad day). I honestly do not believe
that one can build up enough expertise when operating these things
"every once in a while".
</p></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_seqadv_whichseqprovider_summary"></a>13.2.5.
Summary of all the above
</h3></div></div></div><p>
All of the above means that depending on what I need the data for, I
have the freedom choose among different providers. In case I just need
masses of raw data and potential savings are substantial, I might go
with the cheapest whom I know to generate good data. If I want good
service and second round of data in case I am not 110% satisfied with
the first round (somehow people have stopped questioning me there),
this is usually not the cheapest provider ... but the additional costs
are not really high. If I wanted my data really really quick, I'd
search for a provider with Ion Torrent, or MiSeq (I am actually
looking for one with a MiSeq, so if anyone knows a good one,
preferably in Europe -> mail me). Though I already did transcriptomics
on eukaryotes, in case I needed larger eukaryotes assembled de-novo
& also annotated, I would probably look for the help of a larger
sequencing center as this starts to get dangerously near the fringe of
my field of expertise.
</p><p>
In closing this part, here are a couple of guidelines which have not
failed me so far for choosing sequencing providers:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem">
Building a good relationship helps. In case your institute /
university already has good (or OK) experience with a provider, ask
there first.
</li><li class="listitem">
It is a lot easier to build a good relationship with someone who
speaks your language ... or a good(!) English.
</li><li class="listitem">
I will not haggle for a couple of hundred Euros in a single project,
I'll certainly reconsider this when savings are in the tens of
thousands.
</li><li class="listitem">
Managing expectations: some sequencing projects are high risk from
the start, for lots of possible reasons (underfunded, bad starting
material, unclear organism). This is *sometimes* (!) OK as long as
everyone involved knows and acknowledges this. However, you should
always have a clear target ("what am I looking for?") and preferably
know in advance how to treat the data to get there.
</li><li class="listitem">
Errors occur, stay friendly at first. In case the expectations were
clear (see above), the material and organism are not at fault but
the data quality somehow is bad, it is not too difficult to have the
sequencing provider acknowledge this and get additional sequencing
for no added cost.
</li></ul></div><p>
Regarding the technologies you can use ... it really depends on what
you want to do :-) And note that I base my answers on technologies
available today without bigger problems: PacBio, Illumina, with
IonTorrent as Joker for quick projects. 454 can still be considered,
but probably not for too long anymore as Roche stopped development of
the technology and thus PacBio takes over the part for long
reads. Oxford Nanopore might become a game changer, but they are not
just yet
</p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_seqadv_specific"></a>13.3.
Specific advice
</h2></div></div></div><p>
Here's how I see things as of now (January 2014), which might not
necessarily be how others see them.
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_seqadv_technologies"></a>13.3.1.
Technologies
</h3></div></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_seqadv_technologies_sanger"></a>13.3.1.1.
Sanger
</h4></div></div></div><p>
Use for: checking assemblies; closing gaps by PCR; checking for a couple of genes with
known sequence (i.e., where you can design oligos for).
</p><p>
Do not use for: anything else. In particular, if you find yourself
designing oligos for a 96 well plate destined for Sanger sequencing
of a single bacterial DNA sample, you (probably) are doing something
wrong.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_seqadv_technologies_pacbio"></a>13.3.1.2.
Pacific Biosciences
</h4></div></div></div><p>
Use for: de-novo of bacteria and lower eukaryotes (or higher
eukaryotes if you have the money). PacBio should be seen as
<span class="emphasis"><em>the</em></span> technology to use when getting the best
assemblies with least number of contigs is important to you. Also,
resequencing of variants of known organisms with lots of genomic
reorganisation flexibility due to high numbers of transposons (where
short reads will not help in getting the chromosome assembled/mapped
correctly).
</p><p>
Do not use for: resequencing of "dull" organisms (where the only
differences will be simple SNPs or simple insertion/deletions or
simple contig reorganisations at non-repetitive places). Illumina
will do a much better and cost effective job there.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top"><p>
As of January 2014: aim for at least 100x coverage of raw data,
better 130x to 150x as pre-processing (quality clip, removal of
adapters and other sequencing artefacts) will take its toll and
reduce the data by up to 1/3. After that, the error
correction/self-correction of raw reads into corrected reads will
again reduce the data considerably.
</p><p>
It's really a numbers game: the more data you have, the more
likely you will also get many of those really long reads in the 5
to 30 Kb range which are extremely useful to get over those nasty
repeats.
</p></td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
MIRA will most probably give you longer contigs with corrected
PacBio reads than you get with the HGAP pipeline, but the number of
indel errors will currently be higher. Either use Quiver on the
results of MIRA ... or simply polish the assembly with a cheap
Illumina data set. The latter approach will also give you better
results than a Quiver approach.
</td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
For non-haploid organisms, you might need more coverage to get
enough data at ploidy sites to get the reads correctly out of
error correction.
</td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
Preparation of your DNA sample is not trivial as many methods will
break your DNA into "small" chunks which are good enough for
Sanger, 454, Illumina or Ion Torrents, but not for PacBio.
</td></tr></table></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_seqadv_technologies_illumina"></a>13.3.1.3.
Illumina
</h4></div></div></div><p>
Use for: general resequencing jobs (finding SNPs, indel locations of
any size, copy number variations etc.); gene expression analysis;
cheap test sequencing of unknown organisms to assess complexity;
de-novo sequencing if you are OK with getting hundreds / thousands
of contigs (depending on organism, some bacteria get only a few
dozen).
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
Careful with high GC organisms, starting with 60% to 65% GC Illumina
reads contain more errors: SNP detection may be less reliable if
extreme care is not taken to perform good read clipping. Especially
the dreaded GGCxG motif often leads to problems in Illumina reads.
</td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
For de-novo assemblies, do <span class="emphasis"><em>NOT</em></span> (never ever at
all and under no circumstances) use the Nextera kit, take
TruSeq. The non-random fragmentation behaviour of Nextera leads to
all sorts of problems for assemblers (not only MIRA) which try to
use kmer frequencies as a criterion for repetitiveness of a given
sequence.
</td></tr></table></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_seqadv_technologies_iontorrent"></a>13.3.1.4.
Ion Torrent
</h4></div></div></div><p>
Use for: like Illumina. With three notable exceptions: 1) SNP
detection is not as good as with Illumina (more false positives and
false negatives) 2) de-novo assemblies will contain more single-base
indels and 3) Ion having problems with homopolymers, that technology
is not as well suited as complimentary hybrid technology for PacBio
as is Illumina (except for high-GC perhaps).
</p><p>
Ion has a speed advantage on Illumina: if you have your own machine,
getting from your sample to data takes less time than with Illumina.
</p><p>
Also, it looks like as if Ion has less problems with GC content or
sequence motifs than Illumina.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_seqadv_technologies_454"></a>13.3.1.5.
Roche 454
</h4></div></div></div><p>
That technology is on the way out, but there may be two reasons to
not completely dismiss 454: 1) the average read length of 700 bp can
be seen as a plus when compared to Illumina or Ion ... but then
there's PacBio to take care of read length. 2) the large read-pair
libraries work better with 454 than Illumina mate-pair libraries,
something which might be important for scaffolding data where even
PacBio could not completely resolve long repeats.
</p></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_seqadv_denovo"></a>13.3.2.
Sequencing de-novo
</h3></div></div></div><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem">
On a cheap gene fishing expedition? Probably Illumina HiSeq, at
least 100bp, 150 to 250bp or 300bp if your provider supports it
well. Paired-end definitely a plus. As alternative: Ion Torrent for
small organism (maybe up to 100Mb) and when you need results quickly
without caring for possible frameshifts.
</li><li class="listitem">
Want some larger contigs? PacBio. Add in cheap Illumina 100bp
paired-end (150 to 300bp if provider supports it) to get rid of
those last frameshifts which may remain.
</li><li class="listitem">
Maybe scaffolding of contigs above? PacBio + Illumina 100bp + a
large paired-end library (e.g. 454 20kb)
</li><li class="listitem">
Have some good friends at Oxford Nanopore who can give you some
MinIon engineering samples? Man, I'd kill for some bacterial test
sets with those (especially Bacillus subtilis 168)
</li></ul></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_seqadv_mapping"></a>13.3.3.
Re-sequencing / mapping
</h3></div></div></div><p>
There is a reason why Illumina currently dominates the market as it
does: a cheap Illumina run (preferably paired-end) will answer most of
your questions in 99% of the cases. Things will get difficult for
organisms with high numbers of repeats and/or frequent genome
re-arrangements. Then using longer read technologies and/or Illumina
mate-pair may be required.
</p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_seqadv_a_word_or_two_on_coverage"></a>13.4.
A word or two on coverage ...
</h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_seqadv_lowcov"></a>13.4.1.
Low coverage isn't worth it
</h3></div></div></div><p>
There's one thing to be said about coverage and de-novo assembly:
especially for bacteria, getting more than 'decent' coverage is
<span class="emphasis"><em>cheap</em></span> with any current day technology. Every
assembler I know will be happy to assemble de-novo genomes with
coverages of 25x, 30x, 40x ... and the number of contigs will still
drop dramatically between a 15x Ion Torrent and a 30x Ion Torrent
project.
</p><p>
In any case, do some calculations: if the coverage you expect to get
reaches 50x (e.g. 200MB raw sequence for a 4MB genome), then you
(respectively the assembler) can still throw away the worst 20% of the
sequence (with lots of sequencing errors) and concentrate on the
really, really good parts of the sequences to get you nice contigs.
</p><p>
Other example: the price for 1 gigabase Illumina paired-end of a
single DNA prep is way, way below USD 1000, even with commercial
providers. Then you just need to do the math: is it worth to invest
10, 20, 30 or more days of wet lab work, designing primers, doing PCR
sequencing etc. and trying to close remaining gaps or hunt down
sequencing errors when you went for a 'low' coverage or a non-hybrid
sequencing strategy? Or do you invest a few bucks more to get some
additional coverage and considerably reduce the uncertainties and gaps
which remain?
</p><p>
Remember, you probably want to do research on your bug and not
research on how to best assemble and close genomes. So even if you put
(PhD) students on the job, it's costing you time and money if you
wanted to save money earlier in the sequencing. Penny-wise and
pound-foolish is almost never a good strategy :-)
</p><p>
I do agree that with eukaryotes, things start to get a bit more
interesting from the financial point of view.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_seqadv_highcov"></a>13.4.2.
Catch-22: too high coverage
</h3></div></div></div><p>
There is, however, a catch-22 situation with coverage: too much
coverage isn't good either. Without going into details: sequencing
errors sometimes interfere heavily when coverage exceeds ~60x to 80x
for 454 & IonTorrent and approximately 150x to 200x for
Solexa/Illumina.
</p><p>
In those cases, do yourself a favour: there's more than enough data
for your project ... just cut it down to some reasonable amount: 40x
to 50x for 454 & IonTorrent, 100x for Solexa/Illumina.
</p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_seqadv_when_sequencing_a_word_of_caution_regarding_your_dna"></a>13.5.
A word of caution regarding your DNA in hybrid sequencing projects
</h2></div></div></div><p>
So, you have decided that sequencing your bug with PacBio and Illumina
(or PacBio and Ion Torrent or whatever) may be a viable way to get the
best bang for your buck. Then please follow this advice: prepare enough
DNA <span class="emphasis"><em>in</em></span> <span class="emphasis"><em>one</em></span>
<span class="emphasis"><em>go</em></span> for the sequencing provider so that they can
sequence it with all the technologies you chose without you having to
prepare another batch ... or even grow another culture!
</p><p>
The reason for that is that as soon as you do that, the probability that
there is a mutation somewhere that your first batch did not have is not
negligible. And if there is a mutation, even if it is only one base,
there is a >95% chance that MIRA will find it and thinks it is some
repetitive sequence (like a duplicated gene with a mutation in it) and
splits contigs at those places.
</p><p>
Now, there are times when you cannot completely be sure that different
sequencing runs did not use slightly different batches (or even strains).
</p><p>
One example: the SFF files for SRA000156 and SRA001028 from the NCBI
short trace archive should both contain E.coli K12 MG-16650 (two
unpaired half plates and a paired-end plate). However, they contain
DNA from different cultures. Furthermore, the DNA was prepared by
different labs. The net effect is that the sequences in the paired-end
library contain a few distinct mutations from the sequences in the two
unpaired half-plates. Furthermore, the paired-end sequences contain
sequences from phages that are not present in the unpaired sequences.
</p><p>
In those cases, provide strain information to the reads so that MIRA can
discern possible repeats from possible SNPs.
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_seqadv_for_bacteria"></a>13.6.
Advice for bacteria
</h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_seqadv_for_bacteria_no_not_sample_in_exponential_phase"></a>13.6.1.
Do not sample DNA from bacteria in exponential growth phase!
</h3></div></div></div><p>
The reason is simple: some bacteria grow so fast that they start
replicating themselves even before having finished the first
replication cycle. This leads to more DNA around the origin of
replication being present in cells, which in turn fools assemblers and
mappers into believing that those areas are either repeats or that
there are copy number changes.
</p><p>
Sample. In. Stationary. Phase!
</p><p>
For de-novo assemblies, MIRA will warn you if it detects data which
points at exponential phase. In mapping assemblies, look at the
coverage profile of your genome: if you see a smile shape (or V
shape), you have a problem.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_seqadv_for_bacteria:_beware_of_high_copy_number_plasmids"></a>13.6.2.
Beware of (high copy number) plasmids!
</h3></div></div></div><p>
This is a source of interesting problems and furthermore gets people
wondering why MIRA sometimes creates more contigs than other
assemblers when it usually creates less.
</p><p>
Here's the short story: there are data sets which include one ore
several high-copy plasmid(s). Here's a particularly ugly example:
SRA001028 from the NCBI short read archive which contains a plate of
paired-end reads for Ecoli K12 MG1655-G
(<a class="ulink" href="ftp://ftp.ncbi.nih.gov/pub/TraceDB/ShortRead/SRA001028/" target="_top">ftp://ftp.ncbi.nih.gov/pub/TraceDB/ShortRead/SRA001028/</a>).
</p><p>
The genome is sequenced at ~10x coverage, but during the assembly,
three intermediate contigs with ~2kb attain a silly maximum coverage
of ~1800x each. This means that there were ~540 copies of this
plasmid (or these plasmids) in the sequencing.
</p><p>
When using the uniform read distribution algorithm - which is switched
on by default when using "--job=" and the quality level of 'accurate' -
MIRA will find out about the average coverage of the genome to be at
~10x. Subsequently this leads MIRA to dutifully create ~500 additional
contigs (plus a number of contig debris) with various incarnations of
that plasmid at an average of ~10x, because it thought that these were
repetitive sites within the genome that needed to be disentangled.
</p><p>
Things get even more interesting when some of the plasmid / phage
copies are slightly different from each other. These too will be split
apart and when looking through the results later on and trying to join
the copies back into one contig, one will see that this should not be
done because there are real differences.
</p><p>
DON'T PANIC!
</p><p>
The only effect this has on your assembly is that the number of
contigs goes up. This in turn leads to a number of questions in my
mailbox why MIRA is sometimes producing more contigs than Newbler (or
other assemblers), but that is another story (hint: Newbler either
collapses repeats or leaves them completely out of the picture by not
assembling repetitive reads).
</p><p>
What you can do is the following:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
either you assemble everything together and the join the plasmid
contigs manually after assembly, e.g. in gap4 (drawback: on really
high copy numbers, MIRA will work quite a bit longer ... and you
will have a lot of fun joining the contigs afterwards)
</p></li><li class="listitem"><p>
or, after you found out about the plasmid(s) and know the sequence,
you filter out reads in the input data which contain this sequence
(you can use <span class="command"><strong>mirabait</strong></span> for this) and assemble the
remaining reads.
</p></li></ol></div></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_bitsandpieces"></a>Chapter 14. Bits and pieces</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email"><<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>></code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect_bap_using_ssaha2_smalt_to_screen_for_vector_sequence">14.1.
Using SSAHA2 / SMALT to screen for vector sequence
</a></span></dt></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
<span class="emphasis"><em><span class="quote">“<span class="quote">Just when you think it's finally settled, it isn't.
</span>”</span></em></span>
</p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top"><p>
The documentation of MIRA 3.9.x has not completely caught up yet with the changes introduced by MIRA now using manifest files. Quite a number of recipes still show the old command-line style, e.g.:
</p><pre class="screen">
mira --project=... --job=... ...</pre><p>
For those cases, please refer to chapter 3 (the reference) for how to write manifest files.
</p></td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_bap_using_ssaha2_smalt_to_screen_for_vector_sequence"></a>14.1.
Using SSAHA2 / SMALT to screen for vector sequence
</h2></div></div></div><p>
If your sequencing provider gave you data which was NOT pre-clipped for
vector sequence, you can do this yourself in a pretty robust manner
using SSAHA2 -- or the successor, SMALT -- from the Sanger Centre. You
just need to know which sequencing vector the provider used and have its
sequence in FASTA format (ask your provider).
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
This screening is a valid method for any type of Sanger sequencing
vectors, 454 adaptors, Illumina adaptors and paired-end adaptors
etc. However, you probably want to use it only for Sanger type data as
MIRA already knows all standard 454, Ion Torrent and Illumina adaptors.
</td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
SSAHA2 and SMALT need their input data to be in FASTA format, so for
these to run you will need them also in FASTA format. For MIRA however
you can load your original data in whatever format it was present.
</td></tr></table></div><p>
For SSAHA2 follow these steps (most are the same as in the example
above):
</p><pre class="screen">
<code class="prompt">$</code> <strong class="userinput"><code>ssaha2 -output ssaha2
-kmer 8 -skip 1 -seeds 1 -score 12 -cmatch 9 -ckmer 6
/path/where/the/vector/data/resides/vector.fasta
<em class="replaceable"><code>yourinputsequences.fasta</code></em> > <em class="replaceable"><code>screendataforyoursequences.ssaha2</code></em></code></strong></pre><p>
Then, in your manifest file, add the following line in the readgroup
which contains the sequences you screened:
</p><pre class="screen">
<strong class="userinput"><code>readgroup
...
data = <em class="replaceable"><code>yourinputsequences_inwhateverformat_thisexamplehasfastq.fastq</code></em>
data = <em class="replaceable"><code>screendataforyoursequences.ssaha2</code></em>
...</code></strong></pre><p>
For SMALT, the only difference is that you use SMALT for generating the
vector-screen file and ask SMALT to generate it in SSAHA2 format. As
SMALT works in two steps (indexing and then mapping), you also need to
perform it in two steps and then call MIRA. E.g.:
</p><pre class="screen">
<code class="prompt">$</code> <strong class="userinput"><code>smalt index -k 7 -s 1 smaltidxdb /path/where/the/vector/data/resides/vector.fasta</code></strong>
<code class="prompt">$</code> <strong class="userinput"><code>smalt map -f ssaha -d -1 -m 7 smaltidxdb <em class="replaceable"><code>yourinputsequences.fasta</code></em> > <em class="replaceable"><code>screendataforyoursequences.smalt</code></em></code></strong></pre><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Please note that, due to subtle differences between output of SSAHA2 (in
ssaha2 format) and SMALT (in ssaha2 format), MIRA identifies the source
of the screening (and the parsing method it needs) by the name of the
screen file. Therefore, screens done with SSAHA2 need to have the
postfix <code class="filename">.ssaha2</code> in the file name and screens done
with SMALT need
<code class="filename">*.smalt</code>.
</td></tr></table></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_faq"></a>Chapter 15. Frequently asked questions</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email"><<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>></code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect_faq_assembly_quality">15.1.
Assembly quality
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_faq_what_is_the_effect_of_uniform_read_distribution_as:urd?">15.1.1.
What is the effect of uniform read distribution (-AS:urd)?
</a></span></dt><dt><span class="sect2"><a href="#sect_faq_there_are_too_many_contig_debris_when_using_uniform_read_distribution_how_do_i_filter_for_good_contigs?">15.1.2.
There are too many contig debris when using uniform read distribution, how do I filter for "good" contigs?
</a></span></dt><dt><span class="sect2"><a href="#sect_faq_when_finishing_which_places_should_i_have_a_look_at?">15.1.3.
When finishing, which places should I have a look at?
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_faq_454_data">15.2.
454 data
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_faq_what_do_i_need_sffs_for?">15.2.1.
What do I need SFFs for?
</a></span></dt><dt><span class="sect2"><a href="#sect_faq_what's_sff_extract_and_where_do_i_get_it?">15.2.2.
What's sff_extract and where do I get it?
</a></span></dt><dt><span class="sect2"><a href="#sect_faq_do_i_need_the_sfftools_from_the_roche_software_package?">15.2.3.
Do I need the sfftools from the Roche software package?
</a></span></dt><dt><span class="sect2"><a href="#sect_faq_combining_sffs">15.2.4.
Combining SFFs
</a></span></dt><dt><span class="sect2"><a href="#sect_faq_adaptors_and_pairedend_linker_sequences">15.2.5.
Adaptors and paired-end linker sequences
</a></span></dt><dt><span class="sect2"><a href="#sect_faq_what_do_i_get_in_pairedend_sequencing?">15.2.6.
What do I get in paired-end sequencing?
</a></span></dt><dt><span class="sect2"><a href="#sect_faq_sequencing_protocol">15.2.7.
Sequencing protocol
</a></span></dt><dt><span class="sect2"><a href="#sect_faq_filtering_by_seqlen">15.2.8.
Filtering sequences by length and re-assembly
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_faq_solexa___illumina_data">15.3.
Solexa / Illumina data
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_faq_can_i_see_deletions?">15.3.1.
Can I see deletions?
</a></span></dt><dt><span class="sect2"><a href="#sect_faq_can_i_see_insertions?">15.3.2.
Can I see insertions?
</a></span></dt><dt><span class="sect2"><a href="#sect_faq_denovo_assembly_with_solexa_data">15.3.3.
De-novo assembly with Solexa data
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_faq_hybrid_assemblies">15.4.
Hybrid assemblies
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_faq_what_are_hybrid_assemblies?">15.4.1.
What are hybrid assemblies?
</a></span></dt><dt><span class="sect2"><a href="#sect_faq_what_differences_are_there_in_hybrid_assembly_strategies?">15.4.2.
What differences are there in hybrid assembly strategies?
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_faq_masking">15.5.
Masking
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_faq_should_i_mask?">15.5.1.
Should I mask?
</a></span></dt><dt><span class="sect2"><a href="#sect_faq_how_can_i_apply_custom_masking?">15.5.2.
How can I apply custom masking?
</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sect_faq_miscellaneous">15.6.
Miscellaneous
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_faq_what_are_megahubs?">15.6.1.
What are megahubs?
</a></span></dt><dt><span class="sect2"><a href="#sect_faq_passes_and_loops">15.6.2.
Passes and loops
</a></span></dt><dt><span class="sect2"><a href="#sect_faq_debris">15.6.3.
Debris
</a></span></dt><dt><span class="sect2"><a href="#sect_faq_tmpf_files:_more_info_on_what_happened_during_the_assembly">15.6.4.
Log and temporary files: more info on what happened during the assembly
</a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect_faq_sequence_clipping_after_load">15.6.4.1.
Sequence clipping after load
</a></span></dt></dl></dd></dl></dd><dt><span class="sect1"><a href="#sect_faq_platforms_and_compiling">15.7.
Platforms and Compiling
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect_faq_windows">15.7.1.
Windows
</a></span></dt></dl></dd></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
<span class="emphasis"><em><span class="quote">“<span class="quote">Every question defines its own answer. Except perhaps 'Why a duck?'
</span>”</span></em></span>
</p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top"><p>
The documentation of MIRA 3.9.x has not completely caught up yet with the changes introduced by MIRA now using manifest files. Quite a number of recipes still show the old command-line style, e.g.:
</p><pre class="screen">
mira --project=... --job=... ...</pre><p>
For those cases, please refer to chapter 3 (the reference) for how to write manifest files.
</p></td></tr></table></div><p>
This list is a collection of frequently asked questions and answers
regarding different aspects of the MIRA assembler.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
This document needs to be overhauled.
</td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_faq_assembly_quality"></a>15.1.
Assembly quality
</h2></div></div></div><div class="qandaset"><a name="idm7142"></a><dl><dt>15.1.1. <a href="#idm7143">Test question 1</a></dt><dt>15.1.2. <a href="#idm7148">Test question 2</a></dt></dl><table border="0" style="width: 100%;"><colgroup><col align="left" width="1%"><col></colgroup><tbody><tr class="question"><td align="left" valign="top"><a name="idm7143"></a><a name="idm7144"></a><p><b>15.1.1.</b></p></td><td align="left" valign="top"><p>Test question 1</p></td></tr><tr class="answer"><td align="left" valign="top"></td><td align="left" valign="top"><p>Test answer 1</p></td></tr><tr class="question"><td align="left" valign="top"><a name="idm7148"></a><a name="idm7149"></a><p><b>15.1.2.</b></p></td><td align="left" valign="top"><p>Test question 2</p></td></tr><tr class="answer"><td align="left" valign="top"></td><td align="left" valign="top"><p>Test answer 2</p></td></tr></tbody></table></div><p>
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_what_is_the_effect_of_uniform_read_distribution_as:urd?"></a>15.1.1.
What is the effect of uniform read distribution (-AS:urd)?
</h3></div></div></div><p>
</p><pre class="screen">
I have a project which I once started quite normally via
"--job=denovo,genome,accurate,454"
and once with explicitly switching off the uniform read distribution
"--job=denovo,genome,accurate,454 -AS:urd=no"
I get less contigs in the second case and I wonder if that is not better.
Can you please explain?
</pre><p>
</p><p>
Since 2.9.24x1, MIRA has a feature called "uniform read distribution" which is
normally switched on. This feature reduces over-compression of repeats during
the contig building phase and makes sure that, e.g., a rRNA stretch which is
present 10 times in a bacterium will also be present approximately 10 times in
your result files.
</p><p>
It works a bit like this: under the assumption that reads in a project are
uniformly distributed across the genome, MIRA will enforce an average coverage
and temporarily reject reads from a contig when this average coverage
multiplied by a safety factor is reached at a given site.
</p><p>
It's generally a very useful tool disentangle repeats, but has some slight
secondary effects: rejection of otherwise perfectly good reads. The
assumption of read distribution uniformity is the big problem we have here:
of course it's not really valid. You sometimes have less, and sometimes more
than "the average" coverage. Furthermore, the new sequencing technologies -
454 perhaps but especially the microreads from Solexa & probably also SOLiD -
show that you also have a skew towards the site of replication origin.
</p><p>
One example: let's assume the average coverage of your project is 8 and by
chance at one place you have 17 (non-repetitive) reads, then the following
happens:
</p><p>
$p$= parameter of -AS:urdsip
</p><p>
Pass 1 to $p-1$: MIRA happily assembles everything together and calculates a
number of different things, amongst them an average coverage of ~8. At the
end of pass '$p-1$', it will announce this average coverage as first estimate
to the assembly process.
</p><p>
Pass $p$: MIRA has still assembled everything together, but at the end of each
pass the contig self-checking algorithms now include an "average coverage
check". They'll invariably find the 17 reads stacked and decide (looking at
the -AS:urdct parameter which I now assume to be 2) that 17 is larger than
2*8 and that this very well may be a repeat. The reads get flagged as
possible repeats.
</p><p>
Pass $p+1$ to end: the "possibly repetitive" reads get a much tougher
treatment in MIRA. Amongst other things, when building the contig, the contig
now looks that "possibly repetitive" reads do not over-stack by an average
coverage multiplied by a safety value (-AS:urdcm) which I'll assume in this
example to be 1.5. So, at a certain point, say when read 14 or 15 of
that possible repeat want to be aligned to the contig at this given place, the
contig will just flatly refuse and tell the assembler to please find another
place for them, be it in this contig that is built or any other that will
follow. Of course, if the assembler cannot comply, the reads 14 to 17 will end
up as contiglet (contig debris, if you want) or if it was only one read that
got rejected like this, it will end up as singlet or in the debris file.
</p><p>
Tough luck. I do have ideas on how to re-integrate those reads at the and of an
assembly, but I had deferred doing this as in every case I had looked up,
adding those reads to the contigs wouldn't have changed anything ... there's
already enough coverage. What I do in those cases is simply filter away the
contiglets (defined as being of small size and having an average coverage
below the average coverage of the project / 3 (or 2.5)) from a project.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_there_are_too_many_contig_debris_when_using_uniform_read_distribution_how_do_i_filter_for_good_contigs?"></a>15.1.2.
There are too many contig debris when using uniform read distribution, how do I filter for "good" contigs?
</h3></div></div></div><p>
</p><pre class="screen">
When using uniform read distribution there are too many contig with low
coverage which I don't want to integrate by hand in the finishing process. How
do I filter for "good" contigs?
</pre><p>
</p><p>
OK, let's get rid of the cruft. It's easy, really: you just need to look up
one number, take two decisions and then launch a command.
</p><p>
The first decision you need to take is on the minimum average coverage the
contigs you want to keep should have. Have a look at the file
<code class="filename">*_info_assembly.txt</code> which is in the info directory after
assembly. In the "Large contigs" section, there's a "Coverage assessment"
subsection. It looks a bit like this:
</p><pre class="screen">
...
Coverage assessment:
--------------------
Max coverage (total): 43
Max coverage
Sanger: 0
454: 43
Solexa: 0
Solid: 0
Avg. total coverage (size ≥ 5000): 22.30
Avg. coverage (contig size ≥ 5000)
Sanger: 0.00
454: 22.05
Solexa: 0.00
Solid: 0.00
...
</pre><p>
</p><p>
This project was obviously a 454 only project, and the average coverage for it
is ~22. This number was estimated by MIRA by taking only contigs of at least
5Kb into account, which for sure left out everything which could be
categorised as debris. It's a pretty solid number.
</p><p>
Now, depending on how much time you want to invest performing some manual
polishing, you should extract contigs which have at least the following
fraction of the average coverage:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
2/3 if a quick and "good enough" is what you want and you don't want to
do some manual polishing. In this example, that would be around 14 or 15.
</p></li><li class="listitem"><p>
1/2 if you want to have a "quick look" and eventually perform some
contig joins. In this example the number would be 11.
</p></li><li class="listitem"><p>
1/3 if you want quite accurate and for sure not loose any possible
repeat. That would be 7 or 8 in this example.
</p></li></ul></div><p>
</p><p>
The second decision you need to take is on the minimum length your contigs
should have. This decision is a bit dependent on the sequencing technology you
used (the read length). The following are some rules of thumb:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
Sanger: 1000 to 2000
</p></li><li class="listitem"><p>
454 GS20: 500
</p></li><li class="listitem"><p>
454 FLX: 1000
</p></li><li class="listitem"><p>
454 Titanium: 1500
</p></li></ul></div><p>
</p><p>
Let's assume we decide for an average coverage of 11 and a minimum length of
1000 bases. Now you can filter your project with miraconvert
</p><pre class="screen">
miraconvert -x 1000 -y 11 sourcefile.caf filtered.caf
</pre><p>
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_when_finishing_which_places_should_i_have_a_look_at?"></a>15.1.3.
When finishing, which places should I have a look at?
</h3></div></div></div><p>
</p><pre class="screen">
I would like to find those places where MIRA wasn't sure and give it a quick
shot. Where do I need to search?
</pre><p>
</p><p>
Search for the following tags in gap4 or any other finishing program
for finding places of importance (in this order).
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
IUPc
</p></li><li class="listitem"><p>
UNSc
</p></li><li class="listitem"><p>
SRMc
</p></li><li class="listitem"><p>
WRMc
</p></li><li class="listitem"><p>
STMU (only hybrid assemblies)
</p></li><li class="listitem"><p>
STMS (only hybrid assemblies)
</p></li></ul></div><p>
</p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_faq_454_data"></a>15.2.
454 data
</h2></div></div></div><div class="qandaset"><a name="idm7224"></a><dl><dt>15.2.1. <a href="#idm7225">What are little boys made of?</a></dt><dt>15.2.2. <a href="#idm7230">What are little girls made of?</a></dt></dl><table border="0" style="width: 100%;"><colgroup><col align="left" width="1%"><col></colgroup><tbody><tr class="question"><td align="left" valign="top"><a name="idm7225"></a><a name="idm7226"></a><p><b>15.2.1.</b></p></td><td align="left" valign="top"><p>What are little boys made of?</p></td></tr><tr class="answer"><td align="left" valign="top"></td><td align="left" valign="top"><p>Snips and snails and puppy dog tails.</p></td></tr><tr class="question"><td align="left" valign="top"><a name="idm7230"></a><a name="idm7231"></a><p><b>15.2.2.</b></p></td><td align="left" valign="top"><p>What are little girls made of?</p></td></tr><tr class="answer"><td align="left" valign="top"></td><td align="left" valign="top"><p>Sugar and spice and everything nice.</p></td></tr></tbody></table></div><p>
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_what_do_i_need_sffs_for?"></a>15.2.1.
What do I need SFFs for?
</h3></div></div></div><p>
</p><pre class="screen">
I need the .sff files for MIRA to load ...
</pre><p>
</p><p>
Nope, you don't, but it's a common misconception. MIRA does not load SFF
files, it loads FASTA, FASTA qualities, FASTQ, XML, CAF, EXP and PHD. The
reason why one should start from the SFF is: those files can be used to create
a XML file in TRACEINFO format. This XML contains the absolutely vital
information regarding clipping information of the 454 adaptors (the sequencing
vector of 454, if you want).
</p><p>
For 454 projects, MIRA will then load the FASTA, FASTA quality and the
corresponding XML. Or from CAF, if you have your data in CAF format.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_what's_sff_extract_and_where_do_i_get_it?"></a>15.2.2.
What's sff_extract and where do I get it?
</h3></div></div></div><p>
</p><pre class="screen">
How do I extract the sequence, quality and other values from SFFs?
</pre><p>
</p><p>
Use the <span class="command"><strong>sff_extract</strong></span> script from Jose Blanca at the
University of Valencia to extract everything you need from the SFF
files (sequence, qualities and ancillary information). The home of
sff_extract is: <a class="ulink" href="http://bioinf.comav.upv.es/sff_extract/index.html" target="_top">http://bioinf.comav.upv.es/sff_extract/index.html</a> but I am
thankful to Jose for giving permission to distribute the script in the
MIRA 3rd party package (separate download).
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_do_i_need_the_sfftools_from_the_roche_software_package?"></a>15.2.3.
Do I need the sfftools from the Roche software package?
</h3></div></div></div><p>
No, not anymore. Use the <span class="command"><strong>sff_extract</strong></span> script to
extract your reads. Though the Roche sfftools package contains a few
additional utilities which could be useful.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_combining_sffs"></a>15.2.4.
Combining SFFs
</h3></div></div></div><p>
</p><pre class="screen">
I am trying to use MIRA to assemble reads obtained with the 454 technology
but I can't combine my sff files since I have two files obtained with GS20
system and 2 others obtained with the GS-FLX system. Since they use
different cycles (42 and 100) I can't use the sfffile to combine both.
</pre><p>
</p><p>
You do not need to combine SFFs before translating them into something
MIRA (or other software tools) understands. Use
<span class="command"><strong>sff_extract</strong></span> which extracts data from the SFF files
and combines this into input files.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_adaptors_and_pairedend_linker_sequences"></a>15.2.5.
Adaptors and paired-end linker sequences
</h3></div></div></div><p>
</p><pre class="screen">
I have no idea about the adaptor and the linker sequences, could you send me
the sequences please?
</pre><p>
</p><p>
Here are the sequences as filed by 454 in their patent application:
</p><pre class="screen">
>AdaptorA
CTGAGACAGGGAGGGAACAGATGGGACACGCAGGGATGAGATGG
>AdaptorB
CTGAGACACGCAACAGGGGATAGGCAAGGCACACAGGGGATAGG
</pre><p>
</p><p>
However, looking through some earlier project data I had, I also retrieved the
following (by simply making a consensus of sequences that did not match the
target genome anymore):
</p><pre class="screen">
>5prime454adaptor???
GCCTCCCTCGCGCCATCAGATCGTAGGCACCTGAAA
>3prime454adaptor???
GCCTTGCCAGCCCGCTCAGATTGATGGTGCCTACAG
</pre><p>
</p><p>
Go figure, I have absolutely no idea where these come from as they also do not
comply to the "tcag" ending the adaptors should have.
</p><p>
I currently know one linker sequence (454/Roche also calls it <span class="emphasis"><em>spacer</em></span>
for GS20 and FLX paired-end sequencing:
</p><pre class="screen">
>flxlinker
GTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTCGGTTCCAAC
</pre><p>
</p><p>
For Titanium data using standard Roche protocol, you need to screen for two
linker sequences:
</p><pre class="screen">
>titlinker1
TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG
>titlinker2
CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
</pre><p>
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
Some sequencing labs modify the adaptor sequences for tagging and
similar things. Ask your sequencing provider for the exact adaptor
and/or linker sequences.
</td></tr></table></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_what_do_i_get_in_pairedend_sequencing?"></a>15.2.6.
What do I get in paired-end sequencing?
</h3></div></div></div><p>
</p><pre class="screen">
Another question I have is does the read pair sequences have further
adaptors/vectors in the forward and reverse strands?
</pre><p>
</p><p>
Like for normal 454 reads - the normal A and B adaptors can be present
in paired-end reads. That theory this could could look like this:
</p><p>
A-Adaptor - DNA1 - Linker - DNA2 - B-Adaptor.
</p><p>
It's possible that one of the two DNA fragments is *very* short or is missing
completely, then one has something like this:
</p><p>
A-Adaptor - DNA1 - Linker - B-Adaptor
</p><p>
or
</p><p>
A-Adaptor - Linker - DNA2 - B-Adaptor
</p><p>
And then there are all intermediate possibilities with the read not having one
of the two adaptors (or both). Though it appears that the majority of reads
will contain the following:
</p><p>
DNA1 - Linker - DNA2
</p><p>
There is one caveat: according to current paired-end protocols, the sequences
will <span class="bold"><strong>NOT</strong></span> have the direction
</p><pre class="screen">
---> Linker <---
</pre><p>
as one might expect when being used to Sanger Sequencing, but rather in this
direction
</p><pre class="screen">
<--- Linker --->
</pre><p>
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_sequencing_protocol"></a>15.2.7.
Sequencing protocol
</h3></div></div></div><p>
</p><pre class="screen">
Is there a way I can find out which protocol was used?
</pre><p>
</p><p>
Yes. The best thing to do is obviously to ask your sequencing provider.
</p><p>
If this is - for whatever reason - not possible, this list might help.
</p><p>
Are the sequences ~100-110 bases long? It's GS20.
</p><p>
Are the sequences ~220-250 bases long? It's FLX.
</p><p>
Are the sequences ~350-450 bases long? It's Titanium.
</p><p>
Do the sequences contain a linker
(GTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTCGGTTCCAAC)? It's a paired end protocol.
</p><p>
If the sequences left and right of the linker are ~29bp, it's the old short
paired end (SPET, also it's most probably from a GS20). If longer, it's long
paired-end (LPET, from a FLX).
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_filtering_by_seqlen"></a>15.2.8.
Filtering sequences by length and re-assembly
</h3></div></div></div><pre class="screen">
I have two datasets of ~500K sequences each and the sequencing company
already did an assembly (using MIRA) on the basecalled and fully processed
reads (using of course the accompanying *qual file). Do you suggest that I
should redo the assembly after filtering out sequences being shorter than a
certain length (e.g. those that are <200bp)? In other words, am I taking into
account low quality sequences if I do the assembly the way the sequencing
company did it (fully processed reads + quality files)?
</pre><p>
I don't think that filtering out "shorter" reads will bring much
positive improvement. If the sequencing company used the standard
Roche/454 pipeline, the cut-offs for quality are already quite good,
remaining sequences should be, even when being < 200bp, not of bad
quality, simply a bit shorter.
</p><p>
Worse, you might even introduce a bias when filtering out short
sequences: chemistry and library construction being what they are
(rather imprecise and sometimes problematic), some parts of DNA/RNA
yield smaller sequences per se ... and filtering those out might not
be the best move.
</p><p>
You might consider doing an assembly if the company used a rather old
version of MIRA (<3.0.0 for sure, perhaps also <3.0.5).
</p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_faq_solexa___illumina_data"></a>15.3.
Solexa / Illumina data
</h2></div></div></div><p>
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_can_i_see_deletions?"></a>15.3.1.
Can I see deletions?
</h3></div></div></div><p>
</p><pre class="screen">
Suppose you ran the genome of a strain that had one or more large
deletions. Would it be clear from the data that a deletion had occurred?
</pre><p>
</p><p>
In the question above, I assume you'd compare your strain <span class="emphasis"><em>X</em></span> to a strain
<span class="emphasis"><em>Ref</em></span> and that <span class="emphasis"><em>X</em></span> had deletions compared to
<span class="emphasis"><em>Ref</em></span>. Furthermore, I base my answer on data sets I have seen, which
presently were 36 and 76 mers, paired and unpaired.
</p><p>
Yes, this would be clear. And it's a piece of cake with MIRA.
</p><p>
Short deletions (1 to 10 bases): they'll be tagged SROc or WRMc.
General rule: deletions of up to 10 to 12% of the length of your read should
be found and tagged without problem by MIRA, above that it may or may not,
depending a bit on coverage, indel distribution and luck.
</p><p>
Long deletions (longer than read length): they'll be tagged with MCVc tag by
MIRA ins the consensus. Additionally, when looking at the FASTA files when
running the CAF result through miraconvert: long stretches of
sequences without coverage (the @ sign in the FASTAs) of <span class="emphasis"><em>X</em></span> show missing
genomic DNA.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_can_i_see_insertions?"></a>15.3.2.
Can I see insertions?
</h3></div></div></div><p>
</p><pre class="screen">
Suppose you ran the genome of a strain X that had a plasmid missing from the
reference sequence. Alternatively, suppose you ran a strain that had picked
up a prophage or mobile element lacking in the reference. Would that
situation be clear from the data?
</pre><p>
</p><p>
Short insertions (1 to 10 bases): they'll be tagged SROc or WRMc.
General rule: deletions of up to 10 to 12% of the length of your read should
be found and tagged without problem by MIRA, above that it may or may not,
depending a bit on coverage, indel distribution and luck.
</p><p>
Long insertions: it's a bit more work than for deletions. But if you ran a
de-novo assembly on all reads not mapped against your reference sequence,
chances are good you'd get good chunks of the additional DNA put together
</p><p>
Once the Solexa paired-end protocol is completely rolled out and used on a
regular base, you would even be able to place the additional element into the
genome (approximately).
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_denovo_assembly_with_solexa_data"></a>15.3.3.
De-novo assembly with Solexa data
</h3></div></div></div><p>
</p><pre class="screen">
Any chance you could assemble de-novo the sequence of a from just the Solexa
data?
</pre><p>
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
Highly opinionated answer ahead, your mileage may vary.
</td></tr></table></div><p>
Allow me to make a clear statement on this: maybe.
</p><p>
But the result would probably be nothing I would call a good
assembly. If you used anything below 76mers, I'm highly sceptical
towards the idea of de-novo assembly with Solexa (or ABI SOLiD) reads
that are in the 30 to 50bp range. They're really too short for that,
even paired end won't help you much (especially if you have library
sizes of just 200 or 500bp). Yes, there are papers describing
different draft assemblers (SHARCGS, EDENA, Velvet, Euler and others),
but at the moment the results are less than thrilling to me.
</p><p>
If a sequencing provider came to me with N50 numbers for an
<span class="emphasis"><em>assembled genome</em></span> in the 5-8 Kb range, I'd laugh
him in the face. Or weep. I wouldn't dare to call this even
'draft'. I'd just call it junk.
</p><p>
On the other hand, this could be enough for some purposes like, e.g.,
getting a quick overview on the genetic baggage of a bug. Just don't
expect a finished genome.
</p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_faq_hybrid_assemblies"></a>15.4.
Hybrid assemblies
</h2></div></div></div><p>
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_what_are_hybrid_assemblies?"></a>15.4.1.
What are hybrid assemblies?
</h3></div></div></div><p>
Hybrid assemblies are assemblies where one used more than one sequencing
technology. E.g.: Sanger and 454, or 454 and Solexa, or Sanger and Solexa
etc.pp
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_what_differences_are_there_in_hybrid_assembly_strategies?"></a>15.4.2.
What differences are there in hybrid assembly strategies?
</h3></div></div></div><p>
Basically, one can choose two routes: multi-step or all-in-one-go.
</p><p>
Multi-steps means: to assemble reads from one sequencing technology (ideally
the one from the shorter tech like, e.g., Solexa), fragment the resulting
contigs into pseudo-reads of the longer tech and assemble these with the real
reads from the longer tech (like, e.g., 454). The advantage of this approach
is that it will be probably quite faster than the all-in-one-go approach. The
disadvantage is that you loose a lot of information when using only consensus
sequence of the shorter read technology for the final assembly.
</p><p>
All-in-one-go means: use all reads in one single assembly. The advantage of
this is that the resulting alignment will be made of true reads with a maximum
of information contained to allow a really good finishing. The disadvantage is
that the assembly will take longer and will need more RAM.
</p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_faq_masking"></a>15.5.
Masking
</h2></div></div></div><p>
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_should_i_mask?"></a>15.5.1.
Should I mask?
</h3></div></div></div><p>
</p><pre class="screen">
In EST projects, do you think that the highly repetitive option will get rid
of the repetitive sequences without going to the step of repeat masking?
</pre><p>
</p><p>
For eukaryotes, yes. Please also consult the [-KS:mnr] option.
</p><p>
Remember: you still <span class="bold"><strong>MUST</strong></span> have sequencing vectors and adaptors
clipped! In EST sequences the poly-A tails should be also clipped (or let
mira do it.
</p><p>
For prokaryotes, I´m a big fan of having a first look at unmasked data.
Just try to start MIRA without masking the data. After something like 30
minutes, the all-vs-all comparison algorithm should be through with a first
comparison round. grep the log for the term "megahub" ... if it doesn't
appear, you probably don't need to mask repeats
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_how_can_i_apply_custom_masking?"></a>15.5.2.
How can I apply custom masking?
</h3></div></div></div><p>
</p><pre class="screen">
I want to mask away some sequences in my input. How do I do that?
</pre><p>
</p><p>
First, if you want to have Sanger sequencing vectors (or 454 adaptor
sequences) "masked", please note that you should rather use ancillary data
files (CAF, XML or EXP) and use the sequencing or quality clip options there.
</p><p>
Second, please make sure you have read and understood the documentation for all
-CL parameters in the main manual, but especially -CL:mbc:mbcgs:mbcmfg:mbcmeg
as you might want to switch it on or off or set different values depending on
your pipeline and on your sequencing technology.
</p><p>
You can without problem mix your normal repeat masking pipeline with the FASTA
or EXP input for MIRA, as long as you <span class="bold"><strong>mask</strong></span> and not <span class="bold"><strong>clip</strong></span> the
sequence.
</p><p>
An example:
</p><pre class="screen">
>E09238ARF0
tcag GTGTCAGTGTTGACTGTAAAAAAAAAGTACGTATGGACTGCATGTGCATGTCATGGTACGTGTCA
GTCAGTACAAAAAAAAAAAAAAAAAAAAGTACGT tgctgacgcacatgatcgtagc
</pre><p>
</p><p>
(spaces inserted just as visual helper in the example sequence, they would not
occur in the real stuff)
</p><p>
The XML will contain the following clippings:
left clip = 4 (clipping away the "tcag" which are the last four bases of the
adaptor used by Roche)
right clip= ~90 (clipping away the "tgctgac..." lower case sequence on the
right side of the sequence above.
</p><p>
Now, on the FASTA file that was generated with reads_sff.py or with the Roche
sff* tools, you can let run, e.g., a repeat masker. The result could look like
this:
</p><pre class="screen">
>E09238ARF0
tcag XXXXXXXXX TTGACTGTAAAAAAAAAGTACGTATGGACTGCATGTGCATGTCATGGTACGTGTCA
GTCAGTACAAAAAAAAAAAAAAAAAAAAGTACGT tgctgacgcacatgatcgtagc
</pre><p>
</p><p>
The part with the Xs was masked away by your repeat masker. Now, when MIRA
loads the FASTA, it will first apply the clippings from the XML file (they're
still the same). Then, if the option to clip away masked areas of a read
(-CL:mbc, which is normally on for EST projects), it will search for the
stretches of X and internally also put clips to the sequence. In the example
above, only the following sequence would remain as "working sequence" (the
clipped parts would still be present, but not used for any computation.
</p><pre class="screen">
>E09238ARF0
...............TTGACTGTAAAAAAAAAGTACGTATGGACTGCATGTGCATGTCATGGTACGTGTCA
GTCAGTACAAAAAAAAAAAAAAAAAAAAGTACGT........................
</pre><p>
</p><p>
Here you can also see the reason why your filters should <span class="bold"><strong>mask</strong></span> and not
clip the sequence. If you change the length of the sequence, the clips in the
XML would not be correct anymore, wrong clippings would be made, wrong
sequence reconstructed, chaos ensues and the world would ultimately end. Or
something.
</p><p>
<span class="bold"><strong>IMPORTANT!</strong></span> It might be that you do not want MIRA to merge the masked
part of your sequence with a left or right clip, but that you want to keep it
something like DNA - masked part - DNA. In this case, consult the manual for
the -CL:mbc switch, either switch it off or set adequate options for the
boundaries and gap sizes.
</p><p>
Now, if you look at the sequence above, you will see two possible poly-A
tails ... at least the real poly-A tail should be masked else you will get
megahubs with all the other reads having the poly-A tail.
</p><p>
You have two possibilities: you mask yourself with an own program or you let
MIRA do the job (-CL:cpat, which should normally be on for EST projects but I
forgot to set the correct switch in the versions prior to 2.9.26x3, so you
need to set it manually for 454 EST projects there).
</p><p>
<span class="bold"><strong>IMPORTANT!</strong></span> Never ever at all use two poly-A tail masker (an own and
the one from MIRA): you would risk to mask too much. Example: assume the above
read you masked with a poly-A masker. The result would very probably look like
this:
</p><pre class="screen">
>E09238ARF0
tcag XXXXXXXXX TTGACTGTAAAAAAAAAGTACGTATGGACTGCATGTGCATGTCATGGTACGTGTCA
GTCAGTAC XXXXXXXXXXXXXXXXXXXX GTACGT tgctgacgcacatgatcgtagc
</pre><p>
</p><p>
And MIRA would internally make the following out of it after loading:
</p><pre class="screen">
>E09238ARF0
...............TTGACTGTAAAAAAAAAGTACGTATGGACTGCATGTGCATGTCATGGTACGTGTCA
GTCAGTAC..................................................
</pre><p>
</p><p>
and then apply the internal poly-A tail masker:
</p><pre class="screen">
>E09238ARF0
...............TTGACTGT................................................
..........................................................
</pre><p>
</p><p>
You'd be left with ... well, a fragment of your sequence.
</p></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_faq_miscellaneous"></a>15.6.
Miscellaneous
</h2></div></div></div><p>
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_what_are_megahubs?"></a>15.6.1.
What are megahubs?
</h3></div></div></div><p>
</p><pre class="screen">
I looked in the log file and that term "megahub" you told me about appears
pretty much everywhere. First of all, what does it mean?
</pre><p>
</p><p>
Megahub is the internal term for MIRA that the read is massively repetitive
with respect to the other reads of the projects, i.e., a read that is a
megahub connects to an insane number of other reads.
</p><p>
This is a clear sign that something is wrong. Or that you have a quite
repetitive eukaryote. But most of the time it's sequencing vectors
(Sanger), A and B adaptors or paired-end linkers (454), unmasked
poly-A signals (EST) or non-normalised EST libraries which contain
high amounts of housekeeping genes (always the same or nearly the
same).
</p><p>
Countermeasures to take are:
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
set clips for the sequencing vectors (Sanger) or Adaptors (454)
either in the XML or EXP files
</p></li><li class="listitem"><p>
for ESTs, mask poly-A in your input data (or let MIRA do it with the
-CL:cpat parameter)
</p></li><li class="listitem"><p>
only after the above steps have been made, use
the [-KS:mnr] switch to let mira automatically mask nasty
repeats, adjust the threshold with [-SK:rt].
</p></li><li class="listitem"><p>
if everything else fails, filter out or mask sequences yourself in the
input data that come from housekeeping genes or nasty repeats.
</p></li></ul></div><p>
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_passes_and_loops"></a>15.6.2.
Passes and loops
</h3></div></div></div><p>
</p><pre class="screen">
While processing some contigs with repeats i get
"Accepting probably misassembled contig because of too many iterations."
What is this?
</pre><p>
</p><p>
That's quite normal in the first few passes of an assembly. During each pass
(-AS:nop), contigs get built one by one. After a contig has been finished, it
checks itself whether it can find misassemblies due to repeats (and marks
these internally). If no misassembly, perfect, build next contig. But if yes,
the contig requests immediate re-assembly of itself.
</p><p>
But this can happen only a limited number of times (governed by -AS:rbl). If
there are still misassemblies, the contig is stored away anyway ... chances
are good that in the next full pass of the assembler, enough knowledge has
been gained top correctly place the reads.
</p><p>
So, you need to worry only if these messages still appear during the last
pass. The positions that cause this are marked with "SRMc" tags in the
assemblies (CAF, ACE in the result dir; and some files in the info dir).
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_debris"></a>15.6.3.
Debris
</h3></div></div></div><p>
</p><pre class="screen">
What are the debris composed of?
</pre><p>
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
sequences too short (after trimming)
</p></li><li class="listitem"><p>
megahubs
</p></li><li class="listitem"><p>
sequences almost completely masked by the nasty repeat masker
([-KS:mnr])
</p></li><li class="listitem"><p>
singlets, i.e., reads that after an assembly pass did not align
into any contig (or where rejected from every contig).
</p></li><li class="listitem"><p>
sequences that form a contig with less reads than defined by
[-AS:mrpc]
</p></li></ul></div><p>
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_tmpf_files:_more_info_on_what_happened_during_the_assembly"></a>15.6.4.
Log and temporary files: more info on what happened during the assembly
</h3></div></div></div><p>
</p><pre class="screen">
I do not understand why ... happened. Is there a way to find out?
</pre><p>
Yes. The tmp directory contains, beside temporary data, a number of
log files with more or less readable information. While development
versions of MIRA keep this directory after finishing, production
versions normally delete this directory after an assembly. To keep the
logs and temporary file also in production versions, use
"-OUT:rtd=no".
</p><p>
As MIRA also tries to save as much disk space as possible, some logs
and temporary files are rotated (which means that old logs and tmps
get deleted). To switch off this behaviour, use
"-OUT:rrot=no". Beware, the size of the tmp directory will increase,
sometimes dramatically so.
</p><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect_faq_sequence_clipping_after_load"></a>15.6.4.1.
Sequence clipping after load
</h4></div></div></div><p>
How MIRA clipped the reads after loading them can be found in the file
<code class="filename">mira_int_clippings.0.txt</code>. The entries look like this:
</p><pre class="screen">
load: minleft. U13a01d05.t1 Left: 11 -> 30
</pre><p>
Interpret this as: after loading, the read "U13a01d05.t1" had a left clipping
of eleven. The "minleft" clipping option of MIRA did not like it and set it to
30.
</p><pre class="screen">
load: bad seq. gnl|ti|1133527649 Shortened by 89 New right: 484
</pre><p>
</p><p>
Interpret this as: after loading, the read "gnl|ti|1133527649" was checked
with the "bad sequence search" clipping algorithm which determined that there
apparently is something dubious, so it shortened the read by 89 bases, setting
the new right clip to position 484.
</p></div></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect_faq_platforms_and_compiling"></a>15.7.
Platforms and Compiling
</h2></div></div></div><p>
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect_faq_windows"></a>15.7.1.
Windows
</h3></div></div></div><p>
</p><pre class="screen">
Also, is MIRA be available on a windows platform?
</pre><p>
</p><p>
As a matter of fact: it was and may be again. While I haven't done it myself,
according to reports I got compiling MIRA 2.9.3* in a Cygwin environment was
actually painless. But since then BOOST and multi-threading has been included
and I am not sure whether it is still as easy.
</p><p>
I'd be thankful for reports :-)
</p></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_maf"></a>Chapter 16. The MAF format</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email"><<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>></code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect1_introduction:_why_an_own_assembly_format?">16.1.
Introduction: why an own assembly format?
</a></span></dt><dt><span class="sect1"><a href="#sect1_the_maf_format">16.2.
The MAF format
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect2_basics">16.2.1.
Basics
</a></span></dt><dt><span class="sect2"><a href="#sect2_reads">16.2.2.
Reads
</a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect3_simple_example">16.2.2.1.
Simple example
</a></span></dt><dt><span class="sect3"><a href="#sect3_list_of_records_for_reads">16.2.2.2.
List of records for reads
</a></span></dt><dt><span class="sect3"><a href="#sect3_interpreting_clipping_values">16.2.2.3.
Interpreting clipping values
</a></span></dt></dl></dd><dt><span class="sect2"><a href="#sect2_contigs">16.2.3.
Contigs
</a></span></dt><dd><dl><dt><span class="sect3"><a href="#sect3_simple_example_2">16.2.3.1.
Simple example 2
</a></span></dt><dt><span class="sect3"><a href="#sect3_list_of_records_for_contigs">16.2.3.2.
List of records for contigs
</a></span></dt></dl></dd></dl></dd></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
<span class="emphasis"><em><span class="quote">“<span class="quote">Design flaws travel in herds.
</span>”</span></em></span>
</p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top"><p>
The documentation of MIRA 3.9.x has not completely caught up yet with the changes introduced by MIRA now using manifest files. Quite a number of recipes still show the old command-line style, e.g.:
</p><pre class="screen">
mira --project=... --job=... ...</pre><p>
For those cases, please refer to chapter 3 (the reference) for how to write manifest files.
</p></td></tr></table></div><p>
This documents describes purpose and format of the MAF format, version
1. Which has been superceeded by version 2 but is not described here
(yet). But as v1 and v2 are very similar only the notion of readgroups is
a big change, I'll let this description live until I have time to update
this section.
</p><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect1_introduction:_why_an_own_assembly_format?"></a>16.1.
Introduction: why an own assembly format?
</h2></div></div></div><p>
I had been on the hunt for some time for a file format that allow MIRA to
quickly save and load reads and full assemblies. There are currently a number
of alignment format files on the market and MIRA can read and/or write most of
them. Why not take one of these? It turned out that all (well, the ones I
know: ACE, BAF, CAF, CALF, EXP, FRG) have some kind of no-go 'feature' (or problem
or bug) that makes one life pretty difficult if one wants to write or parse
that given file format.
</p><p>
What I needed for MIRA was a format that:
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
is easy to parse
</p></li><li class="listitem"><p>
is quick to parse
</p></li><li class="listitem"><p>
contains all needed information of an assembly that MIRA and many
finishing programs use: reads (with sequence and qualities) and contigs,
tags etc.pp
</p></li></ol></div><p>
</p><p>
MAF is not a format with the smallest possible footprint though it fares quite
well in comparison to ACE, CAF and EXP), but as it's meant as interchange format,
it'll do. It can be easily indexed and does not need string lookups during
parsing.
</p><p>
I took the liberty to combine many good ideas from EXP, BAF, CAF and FASTQ
while defining the format and if anything is badly designed, it's all my
fault.
</p></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect1_the_maf_format"></a>16.2.
The MAF format
</h2></div></div></div><p>
This describes version 1 of the MAF format. If the need arises, enhancements
like meta-data about total number of contigs and reads will be implemented in the
next version.
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_basics"></a>16.2.1.
Basics
</h3></div></div></div><p>
MAF ...
</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p>
... has for each record a keyword at the beginning of the line, followed
by exactly one blank (a space or a tab), then followed by the values for
this record. At the moment keywords are two character keywords, but keywords
with other lengths might appear in the future
</p></li><li class="listitem"><p>
... is strictly line oriented. Each record is terminated by a newline,
no record spans across lines.
</p></li></ol></div><p>
</p><p>
All coordinates start at 1, i.e., there is no 0 value for coordinates.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_reads"></a>16.2.2.
Reads
</h3></div></div></div><p>
</p><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect3_simple_example"></a>16.2.2.1.
Simple example
</h4></div></div></div><p>
Here's an example for a simple read, just the read name and the sequence:
</p><pre class="screen">
RD U13a05e07.t1
RS CTTGCATGCCTGCAGGTCGACTCTAGAAGGACCCCGATCA
ER
</pre><p>
</p><p>
Reads start with RD and end with ER, the RD keyword is always followed by the
name of the read, ER stands on its own. Reads also should contain a sequence
(RS). Everything else is optional. In the following example, the read has
additional quality values (RQ), template definitions (name in TN, minimum and
maximum insert size in TF and TT), a pointer to the file with the raw data (SF),
a left clip which covers sequencing vector or adaptor sequence (SL), a left
clip covering low quality (QL), a right clip covering low quality (QR), a
right clip covering sequencing vector or adaptor sequence (SR), alignment to
original sequence (AO), a tag (RT) and the sequencing technology it was
generated with (ST).
</p><pre class="screen">
RD U13a05e07.t1
RS CTTGCATGCCTGCAGGTCGACTCTAGAAGGACCCCGATCA
RQ ,-+*,1-+/,36;:6≤3327<7A1/,,).('..7=@E8:
TN U13a05e07
DI F
TF 1200
TT 1800
SF U13a05e07.t1.scf
SL 4
QL 7
QR 30
SR 32
AO 1 40 1 40
RT ALUS 10 15 Some comment to this read tag.
ST Sanger
ER
</pre><p>
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect3_list_of_records_for_reads"></a>16.2.2.2.
List of records for reads
</h4></div></div></div><p>
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
RD <span class="emphasis"><em>string: readname</em></span>
</p><p> RD followed by the read name starts a read.
</p></li><li class="listitem"><p>
LR <span class="emphasis"><em>integer: read length</em></span>
</p><p>
The length of the read can be given optionally in LR. This is
meant to help the parser perform sanity checks and eventually
pre-allocate memory for sequence and quality.
</p><p>
MIRA at the moment only writes LR lines for reads with more than
2000 bases.
</p></li><li class="listitem"><p>
RS <span class="emphasis"><em>string: DNA sequence</em></span>
</p><p> Sequence of a read is stored in RS.
</p></li><li class="listitem"><p>
RQ <span class="emphasis"><em>string: qualities</em></span>
</p><p> Qualities are stored in FASTQ format, i.e., each quality
value + 33 is written as single as ASCII character.
</p></li><li class="listitem"><p>
SV <span class="emphasis"><em>string: sequencing vector</em></span>
</p><p> Name of the sequencing vector or
adaptor used in this read.
</p></li><li class="listitem"><p>
TN <span class="emphasis"><em>string: template name</em></span>
</p><p> Template name. This defines the DNA template a sequence
comes from. In it's simplest form, a DNA template is sequenced
only once. In paired-end sequencing, a DNA template is sequenced
once in forward and once in reverse direction (Sanger, 454,
Solexa). In Sanger sequencing, several forward and/or reverse
reads can be sequenced from a DNA template. In PacBio sequencing,
a DNA template can be sequenced in several "strobes", leading to
multiple reads on a DNA template.
</p></li><li class="listitem"><p>
DI <span class="emphasis"><em>character: F or R</em></span>
</p><p> Direction of the read with respect to the
template. F for forward, R for reverse.
</p></li><li class="listitem"><p>
TF <span class="emphasis"><em>integer: template size from</em></span>
</p><p> Minimum estimated
size of a sequencing template. In paired-end sequencing, this is the minimum
distance of the read pair.
</p></li><li class="listitem"><p>
TT <span class="emphasis"><em>integer: template size to</em></span>
</p><p> Maximum estimated
size of a sequencing template. In paired-end sequencing, this is the maximum
distance of the read pair.
</p></li><li class="listitem"><p>
SF <span class="emphasis"><em>string: sequencing file</em></span>
</p><p> Name of the sequencing file which
contains raw data for this read.
</p></li><li class="listitem"><p>
SL <span class="emphasis"><em>integer: seqvec left</em></span>
</p><p>
Clip left due to sequencing vector. Assumed to be 1 if not
present. Note that left clip values are excluding, e.g.: a value
of '7' clips off the left 6 bases.
</p></li><li class="listitem"><p>
QL <span class="emphasis"><em>integer: qual left</em></span>
</p><p>
Clip left due to low quality. Assumed to be 1 if not
present. Note that left clip values are excluding, e.g.: a value
off '7' clips of the left 6 bases.
</p></li><li class="listitem"><p>
CL <span class="emphasis"><em>integer: clip left</em></span>
</p><p>
Clip left (any reason). Assumed to be 1 if not present. Note
that left clip values are excluding, e.g.: a value of '7' clips
off the left 6 bases.
</p></li><li class="listitem"><p>
SR <span class="emphasis"><em>integer: seqvec right</em></span>
</p><p> Clip right due to sequencing
vector. Assumed to be the length of the sequence if not present. Note that
right clip values are including, e.g., a value of '10' leaves the bases 1 to
9 and clips at and including base 10 and higher.
</p></li><li class="listitem"><p>
QR <span class="emphasis"><em>integer: qual right</em></span>
</p><p> Clip right due to low quality. Assumed
to be the length of the sequence if not present. Note that right clip values
are including, e.g., a value of '10' leaves the bases 1 to 9 and clips at
and including base 10 and higher.
</p></li><li class="listitem"><p>
CR <span class="emphasis"><em>integer: clip right</em></span>
</p><p> Clip right (any reason). Assumed to be
the length of the sequence if not present. Note that
right clip values are including, e.g., a value of '10' leaves the bases 1 to
9 and clips at and including base 10 and higher.
</p></li><li class="listitem"><p>
AO <span class="emphasis"><em>four integers: x1 y1 x2 y2</em></span>
</p><p> AO stands for "Align to
Original". The interval [x1 y1] in the read as stored in the MAF file aligns
with [x2 y2] in the original, unedited read sequence. This allows to model
insertions and deletions in the read and still be able to find the correct
position in the original, base-called sequence data.
</p><p> A read can have
several AO lines which together define all the edits performed to this
read.
</p><p> Assumed to be "1 x 1 x" if not present, where 'x' is the length of
the unclipped sequence.
</p></li><li class="listitem"><p>
RT <span class="emphasis"><em>string + 2 integers + optional string: type x1 y1 comment</em></span>
</p><p> Read tags are given by naming the tag type, which positions
in the read the tag spans in the interval [x1 y1] and afterwards
optionally a comment. As MAF is strictly line oriented, newline
characters in the comment are encoded
as <code class="literal">\n</code>.
</p><p> If x1 > y1, the tag is in reverse direction.
</p><p>
The tag type can be a free form string, though MIRA will
recognise and work with tag types used by the Staden gap4
package (and of course the MIRA tags as described in the main
documentation of MIRA).
</p></li><li class="listitem"><p>
ST <span class="emphasis"><em>string: sequencing technology</em></span>
</p><p> The current technologies
can be defined: Sanger, 454, Solexa, SOLiD.
</p></li><li class="listitem"><p>
SN <span class="emphasis"><em>string: strain name</em></span>
</p><p> Strain name of the sample that was
sequenced, this is a free form string.
</p></li><li class="listitem"><p>
MT <span class="emphasis"><em>string: machine type</em></span>
</p><p> Machine type which generated the data,
this is a free form string.
</p></li><li class="listitem"><p>
BC <span class="emphasis"><em>string: base caller</em></span>
</p><p>
Base calling program used to call bases
</p></li><li class="listitem"><p>
IB <span class="emphasis"><em>boolean (0 or 1): is backbone</em></span>
</p><p> Whether the read is a backbone. Reads used as reference
(backbones) in mapping assemblies get this attribute.
</p></li><li class="listitem"><p>
IC <span class="emphasis"><em>boolean (0 or 1)</em></span>
</p><p> Whether the read is a coverage equivalent
read (e.g. from mapping Solexa). This is internal to MIRA.
</p></li><li class="listitem"><p>
IR <span class="emphasis"><em>boolean (0 or 1)</em></span>
</p><p> Whether the read is a rail. This also is
internal to MIRA.
</p></li><li class="listitem"><p>
ER
</p><p> This ends a read and is mandatory.
</p></li></ul></div><p>
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect3_interpreting_clipping_values"></a>16.2.2.3.
Interpreting clipping values
</h4></div></div></div><p>
Every left and right clipping pair (SL & SR, QL & QR, CL & CR) forms a clear
range in the interval [left right[ in the sequence of a read. E.g. a read with
SL=4 and SR=10 has the bases 1,2,3 clipped away on the left side, the bases
4,5,6,7,8,9 as clear range and the bases 10 and following clipped away on the
right side.
</p><p>
The left clip of a read is determined as max(SL,QL,CL) (the rightmost left
clip) whereas the right clip is min(SR,QR,CR).
</p></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_contigs"></a>16.2.3.
Contigs
</h3></div></div></div><p>
Contigs are not much more than containers containing reads with some
additional information. Contrary to CAF or ACE, MAF does not first store all reads in
single containers and then define the contigs. In MAF, contigs are defined as
outer container and within those, the reads are stored like normal reads.
</p><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect3_simple_example_2"></a>16.2.3.1.
Simple example 2
</h4></div></div></div><p>
The above example for a read can be encased in a contig like this (with two
consensus tags gratuitously added in):
</p><pre class="screen">
CO contigname_s1
NR 1
LC 24
CS TGCCTGCAGGTCGACTCTAGAAGG
CQ -+/,36;:6≤3327<7A1/,,).
CT COMM 5 8 Some comment to this consensus tag.
CT COMM 7 12 Another comment to this consensus tag.
\\
RD U13a05e07.t1
RS CTTGCATGCCTGCAGGTCGACTCTAGAAGGACCCCGATCA
RQ ,-+*,1-+/,36;:6≤3327<7A1/,,).('..7=@E8:
TN U13a05e07
TF 1200
TT 1800
SF U13a05e07.t1.scf
SL 4
SR 32
QL 7
QR 30
AO 1 40 1 40
RT ALUS 10 15 Some comment to this read tag.
ST Sanger
ER
AT 1 24 7 30
//
EC
</pre><p>
</p><p>
Note that the read shown previously (and now encased in a contig) is
absolutely unchanged. It has just been complemented with a bit of data which
describes the contig as well as with a one liner which places the read into
the contig.
</p></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="sect3_list_of_records_for_contigs"></a>16.2.3.2.
List of records for contigs
</h4></div></div></div><p>
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
CO <span class="emphasis"><em>string: contig name</em></span>
</p><p> CO starts a contig, the contig name
behind is mandatory but can be any string, including numbers.
</p></li><li class="listitem"><p>
NR <span class="emphasis"><em>integer: num reads in contig</em></span>
</p><p> This is optional but highly
recommended.
</p></li><li class="listitem"><p>
LC <span class="emphasis"><em>integer: contig length</em></span>
</p><p> Note that this length defines the length of the 'clear
range' of the consensus. It is 100% equal to the length of the CS
(sequence) and CQ (quality) strings below.
</p></li><li class="listitem"><p>
CT <span class="emphasis"><em>string + 2 integers + optional string: identifier
x1 y1 comment</em></span>
</p><p> Consensus tags are defined like read tags but apply to the
consensus. Here too, the interval [x1 y1] is including and if x1 > y1, the tag
is in reverse direction.
</p></li><li class="listitem"><p>
CS <span class="emphasis"><em>string: consensus sequence</em></span>
</p><p> Sequence of a consensus is stored in RS.
</p></li><li class="listitem"><p>
CQ <span class="emphasis"><em>string: qualities</em></span>
</p><p> Consensus Qualities are stored in FASTQ
format, i.e., each quality value + 33 is written as single as ASCII character.
</p></li><li class="listitem"><p>
\\
</p><p> This marks the start of read data of this contig. After
this, all reads are stored one after the other, just separated by
an "AT" line (see below).
</p></li><li class="listitem"><p>
AT <span class="emphasis"><em>Four integers: x1 y1 x2 y2</em></span>
</p><p> The AT (Assemble_To) line defines the placement of the read
in the contig and follows immediately the closing "ER" of a read
so that parsers do not need to perform time consuming string
lookups. Every read in a contig has exactly one AT line.
</p><p> The interval
[x2 y2] of the read (i.e., the unclipped data, also called the 'clear range')
aligns with the interval [x1 y1] of the contig. If x1 > y1 (the contig
positions), then the reverse complement of the read is aligned to the
contig. For the read positions, x2 is always < y2.
</p></li><li class="listitem"><p>
//
</p><p> This marks the end of read data
</p></li><li class="listitem"><p>
EC
</p><p> This ends a contig and is mandatory
</p></li></ul></div></div></div></div></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="chap_logfiles"></a>Chapter 17. Log and temporary files used by MIRA</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bastien</span> <span class="surname">Chevreux</span></h3><code class="email"><<a class="email" href="mailto:bach@chevreux.org">bach@chevreux.org</a>></code></div></div><div><p class="releaseinfo">MIRA Version 4.9.6</p></div><div><p class="copyright">Copyright © 2016 Bastien Chevreux</p></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="sect1"><a href="#sect1_logf_introduction">17.1.
Introduction
</a></span></dt><dt><span class="sect1"><a href="#sect1_logf_the_files">17.2.
The files
</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sect2_logf_mira_error_reads_invalid">17.2.1.
mira_error_reads_invalid
</a></span></dt><dt><span class="sect2"><a href="#sect2_logf_mira_info_reads_tooshort">17.2.2.
mira_info_reads_tooshort
</a></span></dt><dt><span class="sect2"><a href="#sect2_logf_mira_int_alignextends_preassembly10txt">17.2.3.
mira_int_alignextends_preassembly1.0.txt
</a></span></dt><dt><span class="sect2"><a href="#sect2_logf_mira_int_clippings0txt">17.2.4.
mira_int_clippings.0.txt
</a></span></dt><dt><span class="sect2"><a href="#sect2_logf_mira_int_posmatch_megahubs_passxlst">17.2.5.
mira_int_posmatch_megahubs_pass.X.lst
</a></span></dt><dt><span class="sect2"><a href="#sect2_logf_mira_int_posmatch_multicopystat_preassembly0txt">17.2.6.
mira_int_posmatch_multicopystat_preassembly.0.txt
</a></span></dt><dt><span class="sect2"><a href="#sect2_logf_mira_int_posmatch_rawhashhits_passxlst">17.2.7.
mira_int_posmatch_rawhashhits_pass.X.lst
</a></span></dt><dt><span class="sect2"><a href="#sect2_logf_mira_int_skimmarknastyrepeats_hist_passxlst">17.2.8.
mira_int_skimmarknastyrepeats_hist_pass.X.lst
</a></span></dt><dt><span class="sect2"><a href="#sect2_logf_mira_int_skimmarknastyrepeats_nastyseq_passxlst">17.2.9.
mira_int_skimmarknastyrepeats_nastyseq_pass.X.lst
</a></span></dt><dt><span class="sect2"><a href="#sect2_logf_mira_int_vectorclip_passxtxt">17.2.10.
mira_int_vectorclip_pass.X.txt
</a></span></dt><dt><span class="sect2"><a href="#sect2_logf_miratmpads_passxforward_and_miratmpads_passxcomplement">17.2.11.
miratmp.ads_pass.X.forward and miratmp.ads_pass.X.complement
</a></span></dt><dt><span class="sect2"><a href="#sect2_logf_miratmpads_passxreject">17.2.12.
miratmp.ads_pass.X.reject
</a></span></dt><dt><span class="sect2"><a href="#sect2_logf_miratmpnoqualities">17.2.13.
miratmp.noqualities
</a></span></dt><dt><span class="sect2"><a href="#sect2_logf_miratmpusedids">17.2.14.
miratmp.usedids
</a></span></dt><dt><span class="sect2"><a href="#sect2_logf_mira_readpoolinfolst">17.2.15.
mira_readpoolinfo.lst
</a></span></dt></dl></dd></dl></div><div class="blockquote"><table border="0" class="blockquote" style="width: 100%; cellspacing: 0; cellpadding: 0;" summary="Block quote"><tr><td width="10%" valign="top"> </td><td width="80%" valign="top"><p>
<span class="emphasis"><em><span class="quote">“<span class="quote">The amount of entropy in the universe is constant - except when it increases.
</span>”</span></em></span>
</p></td><td width="10%" valign="top"> </td></tr><tr><td width="10%" valign="top"> </td><td colspan="2" align="right" valign="top">--<span class="attribution">Solomon Short</span></td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top"><p>
The documentation of MIRA 3.9.x has not completely caught up yet with the changes introduced by MIRA now using manifest files. Quite a number of recipes still show the old command-line style, e.g.:
</p><pre class="screen">
mira --project=... --job=... ...</pre><p>
For those cases, please refer to chapter 3 (the reference) for how to write manifest files.
</p></td></tr></table></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect1_logf_introduction"></a>17.1.
Introduction
</h2></div></div></div><p>
The tmp directory used by mira (usually
<code class="filename"><projectname>_d_tmp</code>) may contain a number of
files with information which could be interesting for other uses than
the pure assembly. This guide gives a short overview.
</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
This guide is probably the least complete and most out-of-date as it is
updated only very infrequently. If in doubt, ask on the MIRA talk
mailing list.
</td></tr></table></div><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td align="left" valign="top">
Please note that the format of these files may change over time,
although I try very hard to keep changes reduced to a minimum.
</td></tr></table></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="images/note.png"></td><th align="left">Note</th></tr><tr><td align="left" valign="top">
Remember that mira has two options that control whether log and
temporary files get deleted: while [-OUT:rtd] removes the
complete tmp directory after an assembly, [-OUT:rrot] removes
only those log and temporary files which are not needed anymore for the
continuation of the assembly. Setting both options to <span class="underline">no</span> will keep all log and temporary files.
</td></tr></table></div></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sect1_logf_the_files"></a>17.2.
The files
</h2></div></div></div><p>
</p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_mira_error_reads_invalid"></a>17.2.1.
mira_error_reads_invalid
</h3></div></div></div><p>
A simple list of those reads that were invalid (no sequence or similar
problems).
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_mira_info_reads_tooshort"></a>17.2.2.
mira_info_reads_tooshort
</h3></div></div></div><p>
A simple list of those reads that were sorted out because the unclipped
sequence was too short as defined by [-AS:mrl].
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_mira_int_alignextends_preassembly10txt"></a>17.2.3.
mira_int_alignextends_preassembly1.0.txt
</h3></div></div></div><p>
If read extension is used ([-DP:ure]), this file contains the read
name and the number of bases by which the right clipping was extended.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_mira_int_clippings0txt"></a>17.2.4.
mira_int_clippings.0.txt
</h3></div></div></div><p>
If any of the [-CL:] options leads to the clipping of a read, this
file will tell when, which clipping, which read and by how much (or to where)
the clippings were set.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_mira_int_posmatch_megahubs_passxlst"></a>17.2.5.
mira_int_posmatch_megahubs_pass.X.lst
</h3></div></div></div><p>
Note: replace the <span class="emphasis"><em>X</em></span> by the pass of mira. Should any read be
categorised as megahub during the all-against-all search (SKIM3), this file
will tell you which.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_mira_int_posmatch_multicopystat_preassembly0txt"></a>17.2.6.
mira_int_posmatch_multicopystat_preassembly.0.txt
</h3></div></div></div><p>
After the initial all-against-all search (SKIM3), this file tells you to how
many other reads each read has overlaps. Furthermore, reads that have more
overlaps than expected are tagged with ``mc'' (multicopy).
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_mira_int_posmatch_rawhashhits_passxlst"></a>17.2.7.
mira_int_posmatch_rawhashhits_pass.X.lst
</h3></div></div></div><p>
Note: replace the <span class="emphasis"><em>X</em></span> by the pass of mira. Similar to
<code class="filename">mira_int_posmatch_multicopystat_preassembly.0.txt</code>, this counts the
kmer hits of each read to other reads. This time however per pass.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_mira_int_skimmarknastyrepeats_hist_passxlst"></a>17.2.8.
mira_int_skimmarknastyrepeats_hist_pass.X.lst
</h3></div></div></div><p>
Note: replace the <span class="emphasis"><em>X</em></span> by the pass of mira. Only written if
[-KS:mnr] is set to <span class="underline">yes</span>. This file contains a
histogram of kmer occurrences encountered by SKIM3.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_mira_int_skimmarknastyrepeats_nastyseq_passxlst"></a>17.2.9.
mira_int_skimmarknastyrepeats_nastyseq_pass.X.lst
</h3></div></div></div><p>
Note: replace the <span class="emphasis"><em>X</em></span> by the pass of mira. Only written if
[-KS:mnr] is set to <span class="underline">yes</span>. One of the more interesting
files if you want to know the repetitive sequences cause the assembly to be
really difficult: for each masked part of a read, the masked sequences is
shown here.
</p><p>
E.g.
</p><pre class="screen">
U13a04h11.t1 TATATATATATATATATATATATA
U13a05b01.t1 TATATATATATATATATATATATA
U13a05c07.t1 AAAAAAAAAAAAAAA
U13a05e12.t1 CTCTCTCTCTCTCTCTCTCTCTCTCTCTC
</pre><p>
Simple repeats like the ones shown above will certainly pop-up there,
but a few other sequences (like e.g. rDNA/rRNA or SINEs, LINEs in
eukaryotes) will also appear.
</p><p>
Nifty thing to try out if you want to have a more compressed overview: sort
and unify by the second column.
</p><pre class="screen">
sort -k 2 -u mira_int_skimmarknastyrepeats_nastyseq_pass.X.lst
</pre><p>
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_mira_int_vectorclip_passxtxt"></a>17.2.10.
mira_int_vectorclip_pass.X.txt
</h3></div></div></div><p>
Note: replace the <span class="emphasis"><em>X</em></span> by the pass of mira. Only written if
[-CL:pvlc] is set to <span class="underline">yes</span>. Tells you where possible
sequencing vector (or adaptor) leftovers were found and clipped (or not
clipped).
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_miratmpads_passxforward_and_miratmpads_passxcomplement"></a>17.2.11.
miratmp.ads_pass.X.forward and miratmp.ads_pass.X.complement
</h3></div></div></div><p>
Note: replace the <span class="emphasis"><em>X</em></span> by the pass of mira. Which read aligns with
Smith-Waterman against which other read, 'forward-forward' and
'forward-complement'.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_miratmpads_passxreject"></a>17.2.12.
miratmp.ads_pass.X.reject
</h3></div></div></div><p>
Note: replace the <span class="emphasis"><em>X</em></span> by the pass of mira. Which possible read
overlaps failed the Smith-Waterman alignment check.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_miratmpnoqualities"></a>17.2.13.
miratmp.noqualities
</h3></div></div></div><p>
Which reads went completely without qualities into the assembly.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_miratmpusedids"></a>17.2.14.
miratmp.usedids
</h3></div></div></div><p>
Which reads effectively went into the assembly (after clipping etc.).
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sect2_logf_mira_readpoolinfolst"></a>17.2.15.
mira_readpoolinfo.lst
</h3></div></div></div></div></div></div></div></body></html>
|