/usr/share/gmod/chado/soi/README

INTRODUCTION

Chado is a relational database schema for managing genomic and genetic 
organism data. To maintain genomic data in chado, all sequence features
and their relationship are stored in 2 tables: feature and
feature_relationship. The intrinsic type of a feature is stored in the
feature table where the feature type is defined in the sequence
ontology (SO). Parent to child relationships are stored in the
feature_relationship table, with a relationship type defined in SO. 

By examining all feature types and relationships of every feature in a
chado database one can determine how SO is instantiated (SOI) for this
particular database instance. SOI can be created using a server-side
function, but it could also be created using a database trigger. Once
the SOI has been created, an SQL query can be written in a uniform way.
Of particular interest is that the query can be made much more efficient
in a RDBMS that supports server-side programming and sub-query. 

Traditionally, to retrieve relationship between features, a table
joining is used. This kind of SQL query may result in many duplicate
values coming in from database server. For example, a parent can appear
as many times as the number of its children multiplied by the number of
their children and so on. Using SOI, an SQL query can written in such a
way that each distinct feature appears only once with feature's values
plus its parent ID if any and the feature depth in a SOI sub-tree of
interest. With the depth value, the parent feature is guaranteed to come
in before its children so that placing children into their parent is a
linear search. Two SOI modules have been implemented in perl:
SOI::Adapter.pm and SOI::Feature.pm. SOI modules are lightweight, e.g.
Feature.pm has about 300 lines of code. Using these 2 modules, any data
model can be constructed from a chado database. An external SQL template
can be used to construct a feature tree as long as the SQL conforms to
the following: 1) each feature has parent ID (excluding the top node
feature), 2) each feature has the tree depth value. Since constructing
the feature relationship is no longer hard-coded into the SQL query,
data model growth will have no exponential effect on code base and
retrieval performance. Performance data from various queries show SOI
modules have excellent performance compared to architectures that rely
upon table joining for feature retrieval.


THE soi PROJECT

Currently the soi project has modules supporting only postgres chado database,
soi modules should work in other dbms as long as soi (ontology) can be
created.  Some templates depend on server side functions (see templates
and fx directories) 

contact: sshu@fruitfly.org


CONTENTS

readme: doc for soi project

Directories:

SOI: perl SOI modules (Adapter, Feature, Outputter)

scripts: perl scripts using SOI module to query chado db and use results
    for some purpose, say dump xml

cgi: script for web server, currently we have only get_xml.pl that serves
    out GAME xml for apollo

fx: plpgsql functions on which some templates depend
[deprecated, mv to cv or sequence module function dir]

templates: soi templates SOI::Adapter can take and construct feature tree


FEATURES

Pros:
    fast
    lightweight
    seamlessly extendable
    flexible

Cons:
    depends on soi created in a chado database
    depends on server side programming (hence plpgsql functions above)
    SQL is a bit harder to write? see templates for example
    (see ABSTRACT for more)


INSTALLATION

To create soi in a chado database:

1) load plpgsql functions in fx directory to the database (not just for
   creating soi)
2) select * from create_soi() in psql shell (need to change unique
   constraint on cvtermpath in current FlyBase chado dump, see
   gmod/schema/chado/modules/sequence/bdgp/bdgp-index.pl)

General instruction to write SQL for SOI::Adapter (see templates for example):

1) use union to join top feature select with child feature select
2) top feature parent_id has to be null and soi tree depth has to be 1
3) child feature depth has to be maximum from its parent in the soi tree
   so a child feature is guaranteed to come in after its parent feature
   from server
4) each feature appears only once in the result set

here is whole soi tree SQL for a type:

(select c.name, c.cvterm_id, 1 as depth
FROM cvterm c, cv
WHERE c.cv_id = cv.cv_id and c.name IN ($top_feature_type) and cv.name = 'so')
UNION
(select c.name, c.cvterm_id, max(pathdistance+1) as depth
FROM cvterm c, cvtermpath path, cvterm p, cv
WHERE c.cvterm_id = subject_id and p.cvterm_id = object_id
and path.cv_id =cv.cv_id and cv.name = 'soi'
and p.name in ($top_feature_type) group by c.name, c.cvterm_id)


here is child soi tree SQL for a type

select c.name, c.cvterm_id, max(pathdistance+1) as depth
FROM cvterm c, cvtermpath path, cvterm p, cv
WHERE c.cvterm_id = subject_id and p.cvterm_id = object_id
and path.cv_id =cv.cv_id and cv.name = 'soi'
and p.name in ($top_feature_type) group by c.name, c.cvterm_id

performance issue:

postgres v7.4 is better than v7.3
removing chromosome_arm residues form db speeds up query of
    golden_path_region (Dros specific?)
create cluster index on featureloc for speed (see
    chado/modules/sequence/bdgp/bdgp-index.pl)
libchado-perl 1.31-4 / usr / share / gmod / chado / soi / README