RT-RICO - Identifying Non-Independent Patterns in Protein Motif Sequence Data

Useful Links

Something new ...

For proteins with NO homologous protein with known structure in PDB.
:: www.pdb.org => Advanced Search => Sequence (Blast/Fasta/PSI_BLAST); Blast => Sequence search (E Cut Off 10.0, or 30.0 or use Fasta, depending on output, need a algorithm for this)
(Fasta search may return different results; some proteins may use Fasta search, because Blast returned nothing)

What I am doing...

1. SCOP: Structural Classification of Proteins

2. scope parseable files follow the structure of scope hierarchy

3. get data from scope
--- http://scop.mrc-lmb.cam.ac.uk/scop/ref/nar2002.pdf
--- http://scop.mrc-lmb.cam.ac.uk/scop/ref/nar2007.pdf
--- http://scop.mrc-lmb.cam.ac.uk/scop/ref/1995-jmb-scop.pdf

4. Pfam database of protein families and HMMs - it has some good links to other databases

5. A MSD example

6. Beginning Perl in Bioinformatics Book Exercises & Examples, PDB file format (secondary structure) explanation. PDB downloads ss.txt.

Proposed Steps

0. Do steps 1 to 5 manually to prove it can be done before coding.

1. Parse SCOPE file, get proteins names divided into 4 structures. (dir.cla.scop.txt_1.xx)

2. Use step 1 results to parse PDB or NCBI protein data file, get amino acid sequence and corresponding secondary structure sequence.

3. Use (part of) step 2 results to build data file ready for RT-RICO, e.g. (5 amino acids + 1 secondary structure ) x 3 secondary structures + decide the format of the data file (e.g. is it suitable to use 5 amino acids? or 9 amino acids?)

4. Use step 3 results, modify RT-RICO code, generate rules e.g. from 5 amino acids -> 1 secondary structure (slightly different for beginning and end). Use the rules generated to predict secondary structure (probability).

5. Use (part of) step 2 results to generate Q3 score, use step 3 results & newly developed algorithm to generate Q3 score

6. If step 5 works, write a paper. (And it works !!! Yeh !!!)



Other Readings (Mainly for Data)