/usr/share/gmod/chado/soi/README is in libchado-perl 1.23-5.
This file is owned by root:root, with mode 0o644.
The actual contents of the file can be viewed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 | INTRODUCTION
Chado is a relational database schema for managing genomic and genetic
organism data. To maintain genomic data in chado, all sequence features
and their relationship are stored in 2 tables: feature and
feature_relationship. The intrinsic type of a feature is stored in the
feature table where the feature type is defined in the sequence
ontology (SO). Parent to child relationships are stored in the
feature_relationship table, with a relationship type defined in SO.
By examining all feature types and relationships of every feature in a
chado database one can determine how SO is instantiated (SOI) for this
particular database instance. SOI can be created using a server-side
function, but it could also be created using a database trigger. Once
the SOI has been created, an SQL query can be written in a uniform way.
Of particular interest is that the query can be made much more efficient
in a RDBMS that supports server-side programming and sub-query.
Traditionally, to retrieve relationship between features, a table
joining is used. This kind of SQL query may result in many duplicate
values coming in from database server. For example, a parent can appear
as many times as the number of its children multiplied by the number of
their children and so on. Using SOI, an SQL query can written in such a
way that each distinct feature appears only once with feature's values
plus its parent ID if any and the feature depth in a SOI sub-tree of
interest. With the depth value, the parent feature is guaranteed to come
in before its children so that placing children into their parent is a
linear search. Two SOI modules have been implemented in perl:
SOI::Adapter.pm and SOI::Feature.pm. SOI modules are lightweight, e.g.
Feature.pm has about 300 lines of code. Using these 2 modules, any data
model can be constructed from a chado database. An external SQL template
can be used to construct a feature tree as long as the SQL conforms to
the following: 1) each feature has parent ID (excluding the top node
feature), 2) each feature has the tree depth value. Since constructing
the feature relationship is no longer hard-coded into the SQL query,
data model growth will have no exponential effect on code base and
retrieval performance. Performance data from various queries show SOI
modules have excellent performance compared to architectures that rely
upon table joining for feature retrieval.
THE soi PROJECT
Currently the soi project has modules supporting only postgres chado database,
soi modules should work in other dbms as long as soi (ontology) can be
created. Some templates depend on server side functions (see templates
and fx directories)
contact: sshu@fruitfly.org
CONTENTS
readme: doc for soi project
Directories:
SOI: perl SOI modules (Adapter, Feature, Outputter)
scripts: perl scripts using SOI module to query chado db and use results
for some purpose, say dump xml
cgi: script for web server, currently we have only get_xml.pl that serves
out GAME xml for apollo
fx: plpgsql functions on which some templates depend
[deprecated, mv to cv or sequence module function dir]
templates: soi templates SOI::Adapter can take and construct feature tree
FEATURES
Pros:
fast
lightweight
seamlessly extendable
flexible
Cons:
depends on soi created in a chado database
depends on server side programming (hence plpgsql functions above)
SQL is a bit harder to write? see templates for example
(see ABSTRACT for more)
INSTALLATION
To create soi in a chado database:
1) load plpgsql functions in fx directory to the database (not just for
creating soi)
2) select * from create_soi() in psql shell (need to change unique
constraint on cvtermpath in current FlyBase chado dump, see
gmod/schema/chado/modules/sequence/bdgp/bdgp-index.pl)
General instruction to write SQL for SOI::Adapter (see templates for example):
1) use union to join top feature select with child feature select
2) top feature parent_id has to be null and soi tree depth has to be 1
3) child feature depth has to be maximum from its parent in the soi tree
so a child feature is guaranteed to come in after its parent feature
from server
4) each feature appears only once in the result set
here is whole soi tree SQL for a type:
(select c.name, c.cvterm_id, 1 as depth
FROM cvterm c, cv
WHERE c.cv_id = cv.cv_id and c.name IN ($top_feature_type) and cv.name = 'so')
UNION
(select c.name, c.cvterm_id, max(pathdistance+1) as depth
FROM cvterm c, cvtermpath path, cvterm p, cv
WHERE c.cvterm_id = subject_id and p.cvterm_id = object_id
and path.cv_id =cv.cv_id and cv.name = 'soi'
and p.name in ($top_feature_type) group by c.name, c.cvterm_id)
here is child soi tree SQL for a type
select c.name, c.cvterm_id, max(pathdistance+1) as depth
FROM cvterm c, cvtermpath path, cvterm p, cv
WHERE c.cvterm_id = subject_id and p.cvterm_id = object_id
and path.cv_id =cv.cv_id and cv.name = 'soi'
and p.name in ($top_feature_type) group by c.name, c.cvterm_id
performance issue:
postgres v7.4 is better than v7.3
removing chromosome_arm residues form db speeds up query of
golden_path_region (Dros specific?)
create cluster index on featureloc for speed (see
chado/modules/sequence/bdgp/bdgp-index.pl)
|