/usr/share/clustalw/clustalw_help is in clustalw 2.1+lgpl-4.
This file is owned by root:root, with mode 0o644.
The actual contents of the file can be viewed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 | CLUSTAL 2.1 Multiple Sequence Alignments
>> HELP NEW << NEW FEATURES/OPTIONS
==UPGMA==
The UPGMA algorithm has been added to allow faster tree construction. The user now
has the choice of using Neighbour Joining or UPGMA. The default is still NJ, but the
user can change this by setting the clustering parameter.
-CLUSTERING= :NJ or UPGMA
==ITERATION==
A remove first iteration scheme has been added. This can be used to improve the final
alignment or improve the alignment at each stage of the progressive alignment. During the
iteration step each sequence is removed in turn and realigned. If the resulting alignment
is better than the previous alignment it is kept. This process is repeated until the score
converges (the score is not improved) or until the maximum number of iterations is
reached. The user can iterate at each step of the progressive alignment by setting the
iteration parameter to TREE or just on the final alignment by seting the iteration
parameter to ALIGNMENT. The default is no iteration. The maximum number of iterations can
be set using the numiter parameter. The default number of iterations is 3.
-ITERATION= :NONE or TREE or ALIGNMENT
-NUMITER=n :Maximum number of iterations to perform
==HELP==
-FULLHELP :Print out the complete help content
==MISC==
-MAXSEQLEN=n :Maximum allowed sequence length
-QUIET :Reduce console output to minimum
-STATS=file :Log some alignents statistics to file
>> HELP 1 << General help for CLUSTAL W (2.1)
Clustal W is a general purpose multiple alignment program for DNA or proteins.
SEQUENCE INPUT: all sequences must be in 1 file, one after another.
7 formats are automatically recognised: NBRF-PIR, EMBL-SWISSPROT,
Pearson (Fasta), Clustal (*.aln), GCG-MSF (Pileup), GCG9-RSF and GDE flat file.
All non-alphabetic characters (spaces, digits, punctuation marks) are ignored
except "-" which is used to indicate a GAP ("." in MSF-RSF).
To do a MULTIPLE ALIGNMENT on a set of sequences, use item 1 from this menu to
INPUT them; go to menu item 2 to do the multiple alignment.
PROFILE ALIGNMENTS (menu item 3) are used to align 2 alignments. Use this to
add a new sequence to an old alignment, or to use secondary structure to guide
the alignment process. GAPS in the old alignments are indicated using the "-"
character. PROFILES can be input in ANY of the allowed formats; just
use "-" (or "." for MSF-RSF) for each gap position.
PHYLOGENETIC TREES (menu item 4) can be calculated from old alignments (read in
with "-" characters to indicate gaps) OR after a multiple alignment while the
alignment is still in memory.
The program tries to automatically recognise the different file formats used
and to guess whether the sequences are amino acid or nucleotide. This is not
always foolproof.
FASTA and NBRF-PIR formats are recognised by having a ">" as the first
character in the file.
EMBL-Swiss Prot formats are recognised by the letters
ID at the start of the file (the token for the entry name field).
CLUSTAL format is recognised by the word CLUSTAL at the beginning of the file.
GCG-MSF format is recognised by one of the following:
- the word PileUp at the start of the file.
- the word !!AA_MULTIPLE_ALIGNMENT or !!NA_MULTIPLE_ALIGNMENT
at the start of the file.
- the word MSF on the first line of the line, and the characters ..
at the end of this line.
GCG-RSF format is recognised by the word !!RICH_SEQUENCE at the beginning of
the file.
If 85% or more of the characters in the sequence are from A,C,G,T,U or N, the
sequence will be assumed to be nucleotide. This works in 97.3% of cases
but watch out!
>> HELP 2 << Help for multiple alignments
If you have already loaded sequences, use menu item 1 to do the complete
multiple alignment. You will be prompted for 2 output files: 1 for the
alignment itself; another to store a dendrogram that describes the similarity
of the sequences to each other.
Multiple alignments are carried out in 3 stages (automatically done from menu
item 1 ...Do complete multiple alignments now):
1) all sequences are compared to each other (pairwise alignments);
2) a dendrogram (like a phylogenetic tree) is constructed, describing the
approximate groupings of the sequences by similarity (stored in a file).
3) the final multiple alignment is carried out, using the dendrogram as a guide.
PAIRWISE ALIGNMENT parameters control the speed-sensitivity of the initial
alignments.
MULTIPLE ALIGNMENT parameters control the gaps in the final multiple alignments.
RESET GAPS (menu item 7) will remove any new gaps introduced into the sequences
during multiple alignment if you wish to change the parameters and try again.
This only takes effect just before you do a second multiple alignment. You
can make phylogenetic trees after alignment whether or not this is ON.
If you turn this OFF, the new gaps are kept even if you do a second multiple
alignment. This allows you to iterate the alignment gradually. Sometimes, the
alignment is improved by a second or third pass.
SCREEN DISPLAY (menu item 8) can be used to send the output alignments to the
screen as well as to the output file.
You can skip the first stages (pairwise alignments; dendrogram) by using an
old dendrogram file (menu item 3); or you can just produce the dendrogram
with no final multiple alignment (menu item 2).
OUTPUT FORMAT: Menu item 9 (format options) allows you to choose from 6
different alignment formats (CLUSTAL, GCG, NBRF-PIR, PHYLIP, GDE, NEXUS, and FASTA).
>> HELP 3 << Help for pairwise alignment parameters
A distance is calculated between every pair of sequences and these are used to
construct the dendrogram which guides the final multiple alignment. The scores
are calculated from separate pairwise alignments. These can be calculated using
2 methods: dynamic programming (slow but accurate) or by the method of Wilbur
and Lipman (extremely fast but approximate).
You can choose between the 2 alignment methods using menu option 8. The
slow-accurate method is fine for short sequences but will be VERY SLOW for
many (e.g. >100) long (e.g. >1000 residue) sequences.
SLOW-ACCURATE alignment parameters:
These parameters do not have any affect on the speed of the alignments.
They are used to give initial alignments which are then rescored to give percent
identity scores. These % scores are the ones which are displayed on the
screen. The scores are converted to distances for the trees.
1) Gap Open Penalty: the penalty for opening a gap in the alignment.
2) Gap extension penalty: the penalty for extending a gap by 1 residue.
3) Protein weight matrix: the scoring table which describes the similarity
of each amino acid to each other.
4) DNA weight matrix: the scores assigned to matches and mismatches
(including IUB ambiguity codes).
FAST-APPROXIMATE alignment parameters:
These similarity scores are calculated from fast, approximate, global align-
ments, which are controlled by 4 parameters. 2 techniques are used to make
these alignments very fast: 1) only exactly matching fragments (k-tuples) are
considered; 2) only the 'best' diagonals (the ones with most k-tuple matches)
are used.
K-TUPLE SIZE: This is the size of exactly matching fragment that is used.
INCREASE for speed (max= 2 for proteins; 4 for DNA), DECREASE for sensitivity.
For longer sequences (e.g. >1000 residues) you may need to increase the default.
GAP PENALTY: This is a penalty for each gap in the fast alignments. It has
little affect on the speed or sensitivity except for extreme values.
TOP DIAGONALS: The number of k-tuple matches on each diagonal (in an imaginary
dot-matrix plot) is calculated. Only the best ones (with most matches) are
used in the alignment. This parameter specifies how many. Decrease for speed;
increase for sensitivity.
WINDOW SIZE: This is the number of diagonals around each of the 'best'
diagonals that will be used. Decrease for speed; increase for sensitivity.
>> HELP 4 << Help for multiple alignment parameters
These parameters control the final multiple alignment. This is the core of the
program and the details are complicated. To fully understand the use of the
parameters and the scoring system, you will have to refer to the documentation.
Each step in the final multiple alignment consists of aligning two alignments
or sequences. This is done progressively, following the branching order in
the GUIDE TREE. The basic parameters to control this are two gap penalties and
the scores for various identical-non-indentical residues.
1) and 2) The GAP PENALTIES are set by menu items 1 and 2. These control the
cost of opening up every new gap and the cost of every item in a gap.
Increasing the gap opening penalty will make gaps less frequent. Increasing
the gap extension penalty will make gaps shorter. Terminal gaps are not
penalised.
3) The DELAY DIVERGENT SEQUENCES switch delays the alignment of the most
distantly related sequences until after the most closely related sequences have
been aligned. The setting shows the percent identity level required to delay
the addition of a sequence; sequences that are less identical than this level
to any other sequences will be aligned later.
4) The TRANSITION WEIGHT gives transitions (A <--> G or C <--> T
i.e. purine-purine or pyrimidine-pyrimidine substitutions) a weight between 0
and 1; a weight of zero means that the transitions are scored as mismatches,
while a weight of 1 gives the transitions the match score. For distantly related
DNA sequences, the weight should be near to zero; for closely related sequences
it can be useful to assign a higher score.
5) PROTEIN WEIGHT MATRIX leads to a new menu where you are offered a choice of
weight matrices. The default for proteins in version 1.8 is the PAM series
derived by Gonnet and colleagues. Note, a series is used! The actual matrix
that is used depends on how similar the sequences to be aligned at this
alignment step are. Different matrices work differently at each evolutionary
distance.
6) DNA WEIGHT MATRIX leads to a new menu where a single matrix (not a series)
can be selected. The default is the matrix used by BESTFIT for comparison of
nucleic acid sequences.
Further help is offered in the weight matrix menu.
7) In the weight matrices, you can use negative as well as positive values if
you wish, although the matrix will be automatically adjusted to all positive
scores, unless the NEGATIVE MATRIX option is selected.
8) PROTEIN GAP PARAMETERS displays a menu allowing you to set some Gap Penalty
options which are only used in protein alignments.
>> HELP A << Help for protein gap parameters.
1) RESIDUE SPECIFIC PENALTIES are amino acid specific gap penalties that reduce
or increase the gap opening penalties at each position in the alignment or
sequence. See the documentation for details. As an example, positions that
are rich in glycine are more likely to have an adjacent gap than positions that
are rich in valine.
2) 3) HYDROPHILIC GAP PENALTIES are used to increase the chances of a gap within
a run (5 or more residues) of hydrophilic amino acids; these are likely to
be loop or random coil regions where gaps are more common. The residues that
are "considered" to be hydrophilic are set by menu item 3.
4) GAP SEPARATION DISTANCE tries to decrease the chances of gaps being too
close to each other. Gaps that are less than this distance apart are penalised
more than other gaps. This does not prevent close gaps; it makes them less
frequent, promoting a block-like appearance of the alignment.
5) END GAP SEPARATION treats end gaps just like internal gaps for the purposes
of avoiding gaps that are too close (set by GAP SEPARATION DISTANCE above).
If you turn this off, end gaps will be ignored for this purpose. This is
useful when you wish to align fragments where the end gaps are not biologically
meaningful.
>> HELP 5 << Help for output format options.
Several output formats are offered. You can choose any (or all 6 if you wish).
CLUSTAL format output is a self explanatory alignment format. It shows the
sequences aligned in blocks. It can be read in again at a later date to
(for example) calculate a phylogenetic tree or add a new sequence with a
profile alignment.
GCG output can be used by any of the GCG programs that can work on multiple
alignments (e.g. PRETTY, PROFILEMAKE, PLOTALIGN). It is the same as the GCG
.msf format files (multiple sequence file); new in version 7 of GCG.
Fasta output cis widely used because of it's simplicity. Each sequence name is
preceeded by a '>'-sign. The sequence itself is printed out in the following lines
PHYLIP format output can be used for input to the PHYLIP package of Joe
Felsenstein. This is an extremely widely used package for doing every
imaginable form of phylogenetic analysis (MUCH more than the the modest intro-
duction offered by this program).
NBRF-PIR: this is the same as the standard PIR format with ONE ADDITION. Gap
characters "-" are used to indicate the positions of gaps in the multiple
alignment. These files can be re-used as input in any part of clustal that
allows sequences (or alignments or profiles) to be read in.
GDE: this is the flat file format used by the GDE package of Steven Smith.
NEXUS: the format used by several phylogeny programs, including PAUP and
MacClade.
GDE OUTPUT CASE: sequences in GDE format may be written in either upper or
lower case.
CLUSTALW SEQUENCE NUMBERS: residue numbers may be added to the end of the
alignment lines in clustalw format.
OUTPUT ORDER is used to control the order of the sequences in the output
alignments. By default, the order corresponds to the order in which the
sequences were aligned (from the guide tree-dendrogram), thus automatically
grouping closely related sequences. This switch can be used to set the order
to the same as the input file.
PARAMETER OUTPUT: This option allows you to save all your parameter settings
in a parameter file. This file can be used subsequently to rerun Clustal W
using the same parameters.
>> HELP 6 << Help for profile and structure alignments
By PROFILE ALIGNMENT, we mean alignment using existing alignments. Profile
alignments allow you to store alignments of your favourite sequences and add
new sequences to them in small bunches at a time. A profile is simply an
alignment of one or more sequences (e.g. an alignment output file from CLUSTAL
W). Each input can be a single sequence. One or both sets of input sequences
may include secondary structure assignments or gap penalty masks to guide the
alignment.
The profiles can be in any of the allowed input formats with "-" characters
used to specify gaps (except for MSF-RSF where "." is used).
You have to specify the 2 profiles by choosing menu items 1 and 2 and giving
2 file names. Then Menu item 3 will align the 2 profiles to each other.
Secondary structure masks in either profile can be used to guide the alignment.
Menu item 4 will take the sequences in the second profile and align them to
the first profile, 1 at a time. This is useful to add some new sequences to
an existing alignment, or to align a set of sequences to a known structure.
In this case, the second profile would not be pre-aligned.
The alignment parameters can be set using menu items 5, 6 and 7. These are
EXACTLY the same parameters as used by the general, automatic multiple
alignment procedure. The general multiple alignment procedure is simply a
series of profile alignments. Carrying out a series of profile alignments on
larger and larger groups of sequences, allows you to manually build up a
complete alignment, if necessary editing intermediate alignments.
SECONDARY STRUCTURE OPTIONS. Menu Option 0 allows you to set 2D structure
parameters. If a solved structure is available, it can be used to guide the
alignment by raising gap penalties within secondary structure elements, so
that gaps will preferentially be inserted into unstructured surface loops.
Alternatively, a user-specified gap penalty mask can be supplied directly.
A gap penalty mask is a series of numbers between 1 and 9, one per position in
the alignment. Each number specifies how much the gap opening penalty is to be
raised at that position (raised by multiplying the basic gap opening penalty
by the number) i.e. a mask figure of 1 at a position means no change
in gap opening penalty; a figure of 4 means that the gap opening penalty is
four times greater at that position, making gaps 4 times harder to open.
The format for gap penalty masks and secondary structure masks is explained
in the help under option 0 (secondary structure options).
>> HELP B << Help for secondary structure - gap penalty masks
The use of secondary structure-based penalties has been shown to improve the
accuracy of multiple alignment. Therefore CLUSTAL W now allows gap penalty
masks to be supplied with the input sequences. The masks work by raising gap
penalties in specified regions (typically secondary structure elements) so that
gaps are preferentially opened in the less well conserved regions (typically
surface loops).
Options 1 and 2 control whether the input secondary structure information or
gap penalty masks will be used.
Option 3 controls whether the secondary structure and gap penalty masks should
be included in the output alignment.
Options 4 and 5 provide the value for raising the gap penalty at core Alpha
Helical (A) and Beta Strand (B) residues. In CLUSTAL format, capital residues
denote the A and B core structure notation. The basic gap penalties are
multiplied by the amount specified.
Option 6 provides the value for the gap penalty in Loops. By default this
penalty is not raised. In CLUSTAL format, loops are specified by "." in the
secondary structure notation.
Option 7 provides the value for setting the gap penalty at the ends of
secondary structures. Ends of secondary structures are observed to grow
and-or shrink in related structures. Therefore by default these are given
intermediate values, lower than the core penalties. All secondary structure
read in as lower case in CLUSTAL format gets the reduced terminal penalty.
Options 8 and 9 specify the range of structure termini for the intermediate
penalties. In the alignment output, these are indicated as lower case.
For Alpha Helices, by default, the range spans the end helical turn. For
Beta Strands, the default range spans the end residue and the adjacent loop
residue, since sequence conservation often extends beyond the actual H-bonded
Beta Strand.
CLUSTAL W can read the masks from SWISS-PROT, CLUSTAL or GDE format input
files. For many 3-D protein structures, secondary structure information is
recorded in the feature tables of SWISS-PROT database entries. You should
always check that the assignments are correct - some are quite inaccurate.
CLUSTAL W looks for SWISS-PROT HELIX and STRAND assignments e.g.
FT HELIX 100 115
FT STRAND 118 119
The structure and penalty masks can also be read from CLUSTAL alignment format
as comment lines beginning "!SS_" or "!GM_" e.g.
!SS_HBA_HUMA ..aaaAAAAAAAAAAaaa.aaaAAAAAAAAAAaaaaaaAaaa.........aaaAAAAAA
!GM_HBA_HUMA 112224444444444222122244444444442222224222111111111222444444
HBA_HUMA VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK
Note that the mask itself is a set of numbers between 1 and 9 each of which is
assigned to the residue(s) in the same column below.
In GDE flat file format, the masks are specified as text and the names must
begin with "SS_ or "GM_.
Either a structure or penalty mask or both may be used. If both are included in
an alignment, the user will be asked which is to be used.
>> HELP C << Help for secondary structure - gap penalty mask output options
The options in this menu let you choose whether or not to include the masks
in the CLUSTAL W output alignments. Showing both is useful for understanding
how the masks work. The secondary structure information is itself very useful
in judging the alignment quality and in seeing how residue conservation
patterns vary with secondary structure.
>> HELP 7 << Help for phylogenetic trees
1) Before calculating a tree, you must have an ALIGNMENT in memory. This can be
input in any format or you should have just carried out a full multiple
alignment and the alignment is still in memory.
*************** Remember YOU MUST ALIGN THE SEQUENCES FIRST!!!! ***************
The methods used are NJ (Neighbour Joining) and UPGMA. First
you calculate distances (percent divergence) between all pairs of sequence from
a multiple alignment; second you apply the NJ or UPGMA method to the distance matrix.
2) EXCLUDE POSITIONS WITH GAPS? With this option, any alignment positions where
ANY of the sequences have a gap will be ignored. This means that 'like' will be
compared to 'like' in all distances, which is highly desirable. It also
automatically throws away the most ambiguous parts of the alignment, which are
concentrated around gaps (usually). The disadvantage is that you may throw away
much of the data if there are many gaps (which is why it is difficult for us to
make it the default).
3) CORRECT FOR MULTIPLE SUBSTITUTIONS? For small divergence (say <10%) this
option makes no difference. For greater divergence, it corrects for the fact
that observed distances underestimate actual evolutionary distances. This is
because, as sequences diverge, more than one substitution will happen at many
sites. However, you only see one difference when you look at the present day
sequences. Therefore, this option has the effect of stretching branch lengths
in trees (especially long branches). The corrections used here (for DNA or
proteins) are both due to Motoo Kimura. See the documentation for details.
Where possible, this option should be used. However, for VERY divergent
sequences, the distances cannot be reliably corrected. You will be warned if
this happens. Even if none of the distances in a data set exceed the reliable
threshold, if you bootstrap the data, some of the bootstrap distances may
randomly exceed the safe limit.
4) To calculate a tree, use option 4 (DRAW TREE NOW). This gives an UNROOTED
tree and all branch lengths. The root of the tree can only be inferred by
using an outgroup (a sequence that you are certain branches at the outside
of the tree .... certain on biological grounds) OR if you assume a degree
of constancy in the 'molecular clock', you can place the root in the 'middle'
of the tree (roughly equidistant from all tips).
5) TOGGLE PHYLIP BOOTSTRAP POSITIONS
By default, the bootstrap values are correctly placed on the tree branches of
the phylip format output tree. The toggle allows them to be placed on the
nodes, which is incorrect, but some display packages (e.g. TreeTool, TreeView
and Phylowin) only support node labelling but not branch labelling. Care
should be taken to note which branches and labels go together.
6) OUTPUT FORMATS: four different formats are allowed. None of these displays
the tree visually. Useful display programs accepting PHYLIP format include
NJplot (from Manolo Gouy and supplied with Clustal W), TreeView (Mac-PC), and
PHYLIP itself - OR get the PHYLIP package and use the tree drawing facilities
there. (Get the PHYLIP package anyway if you are interested in trees). The
NEXUS format can be read into PAUP or MacClade.
>> HELP 8 << Help for choosing a weight matrix
For protein alignments, you use a weight matrix to determine the similarity of
non-identical amino acids. For example, Tyr aligned with Phe is usually judged
to be 'better' than Tyr aligned with Pro.
There are three 'in-built' series of weight matrices offered. Each consists of
several matrices which work differently at different evolutionary distances. To
see the exact details, read the documentation. Crudely, we store several
matrices in memory, spanning the full range of amino acid distance (from almost
identical sequences to highly divergent ones). For very similar sequences, it
is best to use a strict weight matrix which only gives a high score to
identities and the most favoured conservative substitutions. For more divergent
sequences, it is appropriate to use "softer" matrices which give a high score
to many other frequent substitutions.
1) BLOSUM (Henikoff). These matrices appear to be the best available for
carrying out database similarity (homology searches). The matrices used are:
Blosum 80, 62, 45 and 30. (BLOSUM was the default in earlier Clustal W
versions)
2) PAM (Dayhoff). These have been extremely widely used since the late '70s.
We use the PAM 20, 60, 120 and 350 matrices.
3) GONNET. These matrices were derived using almost the same procedure as the
Dayhoff one (above) but are much more up to date and are based on a far larger
data set. They appear to be more sensitive than the Dayhoff series. We use the
GONNET 80, 120, 160, 250 and 350 matrices. This series is the default for
Clustal W version 1.8.
We also supply an identity matrix which gives a score of 1.0 to two identical
amino acids and a score of zero otherwise. This matrix is not very useful.
Alternatively, you can read in your own (just one matrix, not a series).
A new matrix can be read from a file on disk, if the filename consists only
of lower case characters. The values in the new weight matrix must be integers
and the scores should be similarities. You can use negative as well as positive
values if you wish, although the matrix will be automatically adjusted to all
positive scores.
For DNA, a single matrix (not a series) is used. Two hard-coded matrices are
available:
1) IUB. This is the default scoring matrix used by BESTFIT for the comparison
of nucleic acid sequences. X's and N's are treated as matches to any IUB
ambiguity symbol. All matches score 1.9; all mismatches for IUB symbols score 0.
2) CLUSTALW(1.6). The previous system used by Clustal W, in which matches score
1.0 and mismatches score 0. All matches for IUB symbols also score 0.
INPUT FORMAT The format used for a new matrix is the same as the BLAST program.
Any lines beginning with a # character are assumed to be comments. The first
non-comment line should contain a list of amino acids in any order, using the
1 letter code, followed by a * character. This should be followed by a square
matrix of integer scores, with one row and one column for each amino acid. The
last row and column of the matrix (corresponding to the * character) contain
the minimum score over the whole matrix.
>> HELP 9 << Help for command line parameters
DATA (sequences)
-INFILE=file.ext :input sequences.
-PROFILE1=file.ext and -PROFILE2=file.ext :profiles (old alignment).
VERBS (do things)
-OPTIONS :list the command line parameters
-HELP or -CHECK :outline the command line params.
-FULLHELP :output full help content.
-ALIGN :do full multiple alignment.
-TREE :calculate NJ tree.
-PIM :output percent identity matrix (while calculating the tree)
-BOOTSTRAP(=n) :bootstrap a NJ tree (n= number of bootstraps; def. = 1000).
-CONVERT :output the input sequences in a different file format.
PARAMETERS (set things)
***General settings:****
-INTERACTIVE :read command line, then enter normal interactive menus
-QUICKTREE :use FAST algorithm for the alignment guide tree
-TYPE= :PROTEIN or DNA sequences
-NEGATIVE :protein alignment with negative values in matrix
-OUTFILE= :sequence alignment file name
-OUTPUT= :CLUSTAL(default), GCG, GDE, PHYLIP, PIR, NEXUS and FASTA
-OUTORDER= :INPUT or ALIGNED
-CASE :LOWER or UPPER (for GDE output only)
-SEQNOS= :OFF or ON (for Clustal output only)
-SEQNO_RANGE=:OFF or ON (NEW: for all output formats)
-RANGE=m,n :sequence range to write starting m to m+n
-MAXSEQLEN=n :maximum allowed input sequence length
-QUIET :Reduce console output to minimum
-STATS= :Log some alignents statistics to file
***Fast Pairwise Alignments:***
-KTUPLE=n :word size
-TOPDIAGS=n :number of best diags.
-WINDOW=n :window around best diags.
-PAIRGAP=n :gap penalty
-SCORE :PERCENT or ABSOLUTE
***Slow Pairwise Alignments:***
-PWMATRIX= :Protein weight matrix=BLOSUM, PAM, GONNET, ID or filename
-PWDNAMATRIX= :DNA weight matrix=IUB, CLUSTALW or filename
-PWGAPOPEN=f :gap opening penalty
-PWGAPEXT=f :gap opening penalty
***Multiple Alignments:***
-NEWTREE= :file for new guide tree
-USETREE= :file for old guide tree
-MATRIX= :Protein weight matrix=BLOSUM, PAM, GONNET, ID or filename
-DNAMATRIX= :DNA weight matrix=IUB, CLUSTALW or filename
-GAPOPEN=f :gap opening penalty
-GAPEXT=f :gap extension penalty
-ENDGAPS :no end gap separation pen.
-GAPDIST=n :gap separation pen. range
-NOPGAP :residue-specific gaps off
-NOHGAP :hydrophilic gaps off
-HGAPRESIDUES= :list hydrophilic res.
-MAXDIV=n :% ident. for delay
-TYPE= :PROTEIN or DNA
-TRANSWEIGHT=f :transitions weighting
-ITERATION= :NONE or TREE or ALIGNMENT
-NUMITER=n :maximum number of iterations to perform
-NOWEIGHTS :disable sequence weighting
***Profile Alignments:***
-PROFILE :Merge two alignments by profile alignment
-NEWTREE1= :file for new guide tree for profile1
-NEWTREE2= :file for new guide tree for profile2
-USETREE1= :file for old guide tree for profile1
-USETREE2= :file for old guide tree for profile2
***Sequence to Profile Alignments:***
-SEQUENCES :Sequentially add profile2 sequences to profile1 alignment
-NEWTREE= :file for new guide tree
-USETREE= :file for old guide tree
***Structure Alignments:***
-NOSECSTR1 :do not use secondary structure-gap penalty mask for profile 1
-NOSECSTR2 :do not use secondary structure-gap penalty mask for profile 2
-SECSTROUT=STRUCTURE or MASK or BOTH or NONE :output in alignment file
-HELIXGAP=n :gap penalty for helix core residues
-STRANDGAP=n :gap penalty for strand core residues
-LOOPGAP=n :gap penalty for loop regions
-TERMINALGAP=n :gap penalty for structure termini
-HELIXENDIN=n :number of residues inside helix to be treated as terminal
-HELIXENDOUT=n :number of residues outside helix to be treated as terminal
-STRANDENDIN=n :number of residues inside strand to be treated as terminal
-STRANDENDOUT=n:number of residues outside strand to be treated as terminal
***Trees:***
-OUTPUTTREE=nj OR phylip OR dist OR nexus
-SEED=n :seed number for bootstraps.
-KIMURA :use Kimura's correction.
-TOSSGAPS :ignore positions with gaps.
-BOOTLABELS=node OR branch :position of bootstrap values in tree display
-CLUSTERING= :NJ or UPGMA
>> HELP 0 << Help for tree output format options
Four output formats are offered: 1) Clustal, 2) Phylip, 3) Just the distances
4) Nexus
None of these formats displays the results graphically. Many packages can
display trees in the the PHYLIP format 2) below. It can also be imported into
the PHYLIP programs RETREE, DRAWTREE and DRAWGRAM for graphical display.
NEXUS format trees can be read by PAUP and MacClade.
1) Clustal format output.
This format is verbose and lists all of the distances between the sequences and
the number of alignment positions used for each. The tree is described at the
end of the file. It lists the sequences that are joined at each alignment step
and the branch lengths. After two sequences are joined, it is referred to later
as a NODE. The number of a NODE is the number of the lowest sequence in that
NODE.
2) Phylip format output.
This format is the New Hampshire format, used by many phylogenetic analysis
packages. It consists of a series of nested parentheses, describing the
branching order, with the sequence names and branch lengths. It can be used by
the RETREE, DRAWGRAM and DRAWTREE programs of the PHYLIP package to see the
trees graphically. This is the same format used during multiple alignment for
the guide trees.
Use this format with NJplot (Manolo Gouy), supplied with Clustal W. Some other
packages that can read and display New Hampshire format are TreeView (Mac/PC),
TreeTool (UNIX), and Phylowin.
3) The distances only.
This format just outputs a matrix of all the pairwise distances in a format
that can be used by the Phylip package. It used to be useful when one could not
produce distances from protein sequences in the Phylip package but is now
redundant (Protdist of Phylip 3.5 now does this).
4) NEXUS FORMAT TREE. This format is used by several popular phylogeny programs,
including PAUP and MacClade. The format is described fully in:
Maddison, D. R., D. L. Swofford and W. P. Maddison. 1997.
NEXUS: an extensible file format for systematic information.
Systematic Biology 46:590-621.
5) TOGGLE PHYLIP BOOTSTRAP POSITIONS
By default, the bootstrap values are placed on the nodes of the phylip format
output tree. This is inaccurate as the bootstrap values should be associated
with the tree branches and not the nodes. However, this format can be read and
displayed by TreeTool, TreeView and Phylowin. An option is available to
correctly place the bootstrap values on the branches with which they are
associated.
|