/usr/share/doc/blast2/formatrpsdb.html is in blast2 1:2.2.25.20110713-3ubuntu2.
This file is owned by root:root, with mode 0o644.
The actual contents of the file can be viewed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 | <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator"
content="HTML Tidy for Linux/x86 (vers 1st October 2002), see www.w3.org" />
<title></title>
</head>
<body>
<pre>
Formatrpsdb: Build databases for RPS Blast
Introduction
------------
Formatrpsdb is a utility that converts a collection of input
sequences into a database suitable for use with Reverse
Position Specific (RPS) Blast. Each input sequence, together
with its position-specific scoring matrix (PSSM), is ASN.1
encoded into a PssmWithParameters (or 'scoremat') object
and resides in a separate file. Scoremat objects can be created
using the blastPGP binary in the Standalone BLAST distribution.
Formatrpsdb is given a list of these files and produces the
corresponding database.
Formatrpsdb is designed to perform the work of formatdb,
makemat and copymat simultaneously, without generating the
large number of intermediate files these utilities would need
to create an RPS Blast database. Further, scoremat objects
are in more general use than the binary format makemat requires.
It is hoped that direct manipulation of scoremat objects
will encourage conversion of more diverse sequence
collections into RPS Blast databases.
Databases generated by formatrpsdb are binary compatible
with databases generated by formatdb/makemat/copymat,
although the database files will in general not be byte-
for-byte identical.
Relevant Documents
------------------
Information on RPS Blast, as well as instructions for creating
RPS Blast databases using formatdb/makemat/copymat:
<a href="rpsblast.html">Information on rpsblast</a>
<a href="formatdb.html">Information on formatdb</a>
<a href="blastpgp.html">Information on blastpgp</a>
The ASN.1 specification for PssmWithParameters is available
in the NCBI C toolkit sources, in tools/scoremat.asn
Preconditions for Use
---------------------
This section assumes some familiarity with the documents pre-
viously specified.
An RPS Blast database consists of two groups of files. The first
group is a standard protein database generated by formatdb
(RPS Blast cannot use nucleotide databases). The second group
of files contains precomputations used to speed up RPS Blast
searches of the standard protein database. Previously, formatdb
would build the first group of files, and makemat/copymat would
be used to build the second group (the 'RPS data files').
As was mentioned, formatrpsdb performs all of these steps in
a single pass. However, The collection of sequences passed to
formatrpsdb must already be consistent in several important ways:
- All sequences must use the same protein alphabet.
- All scores in all PSSMs must be scaled by the same
factor.
- If the scoremat does not contain a PSSM, it must
contain a set of residue frequencies that formatrpsdb
can use to create a PSSM manually. The PSSM creation
process is identical to that performed by makemat,
and requires a scaling factor, gap existence and
extension penalties, and an underlying score matrix.
These must be provided as command line options to formatrpsdb,
or each scoremat can contain one or more of these values
(which will be used in place of the values specified as
input arguments). If a sequence contains both a PSSM and
residue frequencies, the latter will be ignored (see the
command line options below).
Regarding the last requirement, a collection of sequences
passed to formatrpsdb may include a mixture of sequences for which
a PSSM is available and sequences for which only the residue
frequencies are available. The present version of formatrpsdb
requires that all parameters (scale factor, gap open/extend,
underlying score matrix), whether appearing within a scoremat or
supplied from the command line, must be the same for all sequences.
Prebuilt collections of sequences that satisfy these criteria
are available from NCBI, along with tools capable of building
compliant sequence files. Further, blastpgp is capable of reading
and writing scoremat files containing residue frequencies.
Command Line Options
--------------------
A list of the command line options and the current version for
formatrpsdb may be obtained by executing formatrpsdb without
options, as in:
formatrpsdb -
The formatrpsdb options are listed below:
-t Title for database file [String] Optional
This will be printed by utilities like fastacmd as the
title of the generated database.
-i Input file containing list of ASN.1 Scoremat filenames [File In]
Each Scoremat file contains the score matrix (or residue
frequencies) and identification data for a single sequence.
Filenames should appear one per line in this file, and the
corresponding sequences will be added to the database in
the order listed in this file. There are no restrictions
on the filenames that appear in the list.
-l Logfile name: [File Out] Optional
default = formatrpsdb.log
Status and error information will be written to this file.
-o Create index files for database [T/F] Optional
default = F
If the "-o" option is TRUE and the sequence identifiers
within each scoremat allow it, formatrpsdb will generate
index files for the generated database. These will allow
retrieval of individual sequences by utilities like fastacmd.
-v Database volume size in millions of letters [Integer] Optional
default = 0
range from 0 to <NULL>
This option breaks up large collections of sequences into
'volumes' (each with a maximum size of 1 billion letters).
-b Scoremat files are binary [T/F] Optional
default = F
The scoremat ASN.1 format allows sequence data in human-readable
text format or a more compact binary format. Setting this option
to 'T' signals to formatrpsdb that all of the scoremat files
listed in the file for '-i' option contain binary ASN.1 scoremat data.
If set to 'F', scoremat files will all be treated as containing
ASCII text ASN.1
-f Threshold for extending hits for RPS database [Real] Optional
default = 11.0
Formatrpsdb builds a Blast lookup table while the database is
being generated. This table indexes each input sequence for
searches using RPS Blast. The argument to '-f' specifies the
threshold value; groups of letters in any input sequence which
score above this value are added to the lookup table.
Note that fractional threshold values (e.g. '10.5') are allowed
for this argument.
-n Base name of output database
(same as input file if not specified) [String] Optional
By default, the database generated will consist of a collection
of files whose prefix matches that of the filename specified in
the '-i' option. To give the database files a different prefix,
specify the desired string for this option.
-S For scoremats that contain only residue frequencies, the
scaling factor to apply when creating PSSMs [Real] Optional
default = 100.0
When given a scoremat file that does not contain a PSSM,
formatrpsdb looks for a set of residue frequencies in the file,
and attempts to create a PSSM using those residue frequencies.
The creation process requires a scale factor for the computed
scores, provided by this argument.
-G The gap opening penalty (if not present in the scoremat)
[Integer] Optional
default = 11
-E The gap extension penalty (if not present in the scoremat)
[Integer] Optional
default = 1
If an input file does not contain gap opening and extension
penalties, the values of these two arguments will be substituted.
These are primarily intended for scoremat files that contain
only residue frequencies.
-U Underlying score matrix (if not present in the
scoremat) [String] Optional
default = BLOSUM62
If an input file does not contain the name of the NCBI standard
score matrix from which residue frequencies were derived, the
matrix name specified by the -U option will be substituted.
This is primarily intended for scoremat files that contain only
residue frequencies.
Examples of Use
---------------
Given a set of three sequence files 'scoremat1', 'scoremat2'
and 'scoremat3', along with a text file 'list' consisting
of the three lines
scoremat1
scoremat2
scoremat3
the command to create an RPS blast database is
formatrpsdb -i list
which creates the files
list.pin list.psq list.phr list.rps list.loo list.aux
The first three files are a standard non-indexed protein database,
and the last three are RPS data files. To index the database for
retrieval of individual sequences, use
formatrpsdb -i list -o T
which will add the files
list.pin list.psd list.psi
To instead call this database 'mydb', use
formatrpsdb -i list -o T -n mydb
which will create 'mydb.*' instead of 'list.*'
Additional Information and Help
-------------------------------
Please direct bug reports, inquiries for assistance, and requests
for new features to blast-help@ncbi.nlm.nih.gov
Last updated July 23 2004
</pre>
</body>
</html>
|