/usr/share/doc/cdbfasta/cdbfasta

<!DOCTYPE HTML PUBLIC "-//w3c//dtd html 4.0 transitional//en">
<html><head>


   <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
   <meta name="GENERATOR" content="Mozilla/4.78 [en] (X11; U; Linux 2.4.19 i686) [Netscape]">
   <title>cdb tools for fasta files</title>
</head><body bgcolor="#ffffff" link="#0000ee" text="#000000" vlink="#551a8b" alink="#ff0000">

<h1>
CDB (Constant DataBase) indexing and retrieval tools for multi-FASTA files</h1>
This is a brief introduction to a couple of platform independent file-based
hashing tools (<b>cdbfasta</b> and <b>cdbyank</b>) that can be used for
creating indices for quick retrieval of any particular sequences from large
multi-FASTA files. The last version has the option to compress data records
in order to save space. The index files are now architecture independent,
the same index file can be created and used on many different Unix platform
(be it 32bit/64bit, big-endian or little-endian architectures) and even
Windows.
<p><b>Content:</b>
</p><p><b>&nbsp;&nbsp; 1.Typical usage</b>
<br><b>&nbsp;&nbsp; 2.Retrieving sequence ranges or deflines</b>
<br><b>&nbsp;&nbsp; 3.Data compression option</b>
<br><b>&nbsp;&nbsp; 4.Development notes</b>
<br>&nbsp;
</p><h2>
1. Typical usage</h2>
Use <b>cdbfasta</b> to create the index file for a multi-FASTA file and
<b>cdbyank</b>
to pull records based on that index file. An usage message is displayed
if the commands cdbyank or cdbyank are run without any parameters (or with
-h). In order to create an index file, the name of the fasta file to be
indexed must be provided:
<pre>cdbfasta &lt;fasta_file&gt;</pre>
The fasta file can be specified with the whole path (if it's not in the
current directory), e.g.
<pre>cdbfasta /usr/local/db/GUDB.human</pre>
By default cdbfasta creates an index file with the same path and name as
the database file but with the&nbsp; .cidx suffix added to the original
name. So in the example above, a file GUDB.human.cidx will be created in
/usr/local/db/. The default usage considers the key for a FASTA record
to be the first space-delimited token following the "&gt;" starting character
from the definition line. For example, if a FASTA record had a defline
like this:
<p>&gt;AA141526
</p><p>...then we can use the string 'AA141526' with cdbyank to retrieve the
full FASTA record associated to that sequence name:
</p><pre>cdbyank -a 'AA141526' /usr/local/db/GUDB.human.cidx</pre>
Sometimes all the space delimited tokens in the defline need to be declared
as keys in the index file, pointing to the same fasta record. This can
be accomplished by cdbfasta by using the "<b>-m</b>" switch.
<p>For long and complex fastA file accessions (for example : EGAD|61|GP|186739|gb|AAA63210.1||M60828)
there is a possibility to create the index file in such a way that there
is no need to provide the full string to cdbyank in order to retrieve such
a sequence, but only the first "&lt;db&gt;|&lt;accession&gt;" pair (i.e. a substring
ending at the second '|' character) should be enough. (EGAD|61 in the example
above). In order to enable this feature, there are two alternative options
for cdbfasta:
</p><ul>
<li>
<b>-c</b> : the index file is built only by storing the "shortcut key"
(the first "db|accession" pair found in the defline of each fasta record).
In this case, cdbyank will only be able to accept these "shortcut" accessions
for record retrieval.</li>

<li>
<b>-C</b> : the index file is built by storing both the "shortcut key"
and the full keys (which are considered to end at the first space character
in the defline). In this case, two strings are stored as keys for each
fastA record so any of them can be used as an accession for retrieval of
the same record with cdbyank.</li>
</ul>
In order to retrieve records from the database file, cdbyank should be
provided with the name of the index file created previously with cdbfasta,
e.g.:
<pre>cdbyank -a 'human|Z98492' /usr/local/db/GUDB.human.cidx</pre>
A list of accessions is expected at stdin if -a option is not provided,
e.g.:
<pre>cat seq_list | cdbyank /usr/local/db/GUDB.human.cidx</pre>
This way the output will be a series a fasta records at stdout. By redirecting
this output to a file a multifasta file is obtained. cdbyank locates the
database file by stripping the '.cidx' suffix off the index filename. But
this is not enforced, because by using the <b>-d </b>option, cdbyank can
make use of a user-provided database to be used by the given index file.
In the example above, if the index file "GUDB.human.cidx" is moved into
another directory, a cdbyank command (in that other directory) can be issued
like that:
<pre>cdbyank -a 'human|Z98492' -d /usr/local/db/GUDB.human GUDB.human.cidx</pre>
The position of the index file in the list of arguments of cdbyank is not
enforced. For the -a usage, the error status returned by cdbyank to the
shell will be 1 if the given key was not found and 0 for success.
<p>The total number of fasta records indexed and the list of the keys stored
in a specific cdb index file can be retrieved with cdbyank's <b>-n</b>
and <b>-l </b>switches, respectively. This information is obtained from
the index file directly (the database file is not needed for that). There
is also a -s option that displays a summary of the indexing information
stored in the index at index time. These are the initial name of the fastA
file, its size, how the index was created (e.g. was -m (multiple keys)
option given ? was -c or -C (shortcut keys) option given?), the number
of keys stored in the file as well as the number of fasta records indexed
- the latter being the same with what <b>-n</b> option returns.
</p><p>As an extra feature, cdbfasta and cdbyank can also be used for some
special cases where databases may have different records but with the same
key (non-unique keys). Although the performance will degrade a little,
cdbfasta is able to index this kind of files, but by default cdbyank only
outputs the first record found. If you want all the possible records sharing
the same key (accession) to be retrieved and displayed, the <b>-x </b>option
should be given to cdbyank.
<br>&nbsp;
</p><h2>
2. Retrieving sequence ranges or only the defline</h2>

<p><br>There are two <b>cdbyank</b> options added for convenience: <b>-F</b>
option returns the definition line of each requested FASTA record (the
first line for each record).&nbsp; The <b>-R </b>option of cdbyank is intended
for FASTA files containing actual genetic sequences (nucleotide or protein)
and expects each of the retrieval commands to have the following format
(space delimited):
</p><p>&lt;key&gt;&nbsp; &lt;right_coordinate&gt;&nbsp; &lt;left_coordinate&gt;
</p><p>For example if we only want to retrieve the sequence range 24...178
(letter numbering starts at 1) from sequence with the name 'human|Z98492',
then the cdbyank command would look like this:
</p><pre>cdbyank -a 'human|Z98492 24 178' -R GUDB.human.cidx</pre>
Multiple sequence ranges can be extracted this way by providing a file
having each line following the format above (key followed by the two coordinates).
Then, as before, such file can be piped into cdbyank with -R option to
pull specific sequence ranges for each of the sequences specified in the
input file.
<br>&nbsp;
<pre>cat seqlistranges | cdbyank -R GUDB.human.cidx</pre>
Note that this range option works by actually parsing and looping through
the retrieved record characters internally - so the performance is poor
when some terminal range is pulled from a very large record.
<h2>
3. Data compression option</h2>
The indexing program <b>cdbfasta</b> has the&nbsp; <b>-z &lt;compressed_db&gt;</b>
option which creates a compressed file &lt;compress_db&gt; from the data in
the given input file and at the same time creates an index file for this
new compressed database, named &lt;compressed_db&gt;.cidxz.The original input
file can then be discarded -- as it can be recovered at any point later
from the &lt;compressed_db&gt; file by using the <b>-z</b> option of <b>cdbyank</b>.
<br>Because each record is compressed separately, compression is poor if
the records are small. Compression is only advised when:
<ul>
<li>
data records are large enough for the compression algorithm to become efficient
(at least 1KB per record, the more the better)</li>

<li>
only random access is needed to the data records (so the original file
can be discarded)</li>
</ul>
The compression can be quite slow for large files and there is also some
performance penalty for cdbyank as it has to decompress the retrieved records
on the fly. The input data for cdbfasta compression can be collected from
stdin if '-' is used instead of a file name:
<pre>cat my_data_files* | cdbfasta - -z mydata.cdbz</pre>
This option is useful especially when the total size of input data files
is extremely large (over the file-system limits or over the 4GB internal
limit of cdbfasta) while the compressed output can be small enough to fall
under such limits.
<br>With compressed databases cdbyank can be used normally without extra
options as it will auto-detect the compression (from the index file info)
and activate on-the-fly decompression of the retrieved records. Only -F
and -R options are not yet supported for compressed records.
<h2>
4. Development notes</h2>
These tools were developed in C++, based on the publicly available <b>cdb
</b>("constant
database") code written by D.J. Bernstein (<a href="http://cr.yp.to/djb.html">http://cr.yp.to/djb.html</a>).
"<i>Constant databases</i>" are those that we don't need to add to or remove
records from. The original C source was (rather crudely) wrapped into C++
classes and adjusted to automatically index fasta records and to create
an external index instead of compacting the original data file like the
original cdb library code does.&nbsp; Also the "endianness" is now checked
at runtime and the bytes are swapped accordingly such that the file offsets
and record sizes are always read/written in the same way in the index file.
<br>The compression option uses <b>zlib</b>'s "deflate" method. The program
uses deflate() with Z_FULL_FLUSH after each record, such that random record
decompression is possible after the first [dummy] record is decompressed
internally.
<br>The index file contains an info chunk (actually stored at the end of
the file) which maintains a summary data and flags about the indexing process
(the -s option of cdbyank shows this info). Since the compression option
was added, cdbyank is always trying to read this information first (before
opening the data file) in order to determine if the data records are compressed
or not.
<p>Please let me know if you notice any problems with these tools.
</p><p>--
<br>Geo Pertea
<br>geo.pertea@gmail.com
<br>06/09/2003
<br>&nbsp;
</p></body></html>
cdbfasta 0.99-20100722-1 / usr / share / doc / cdbfasta / cdbfasta_usage.html