RT-RICO - Identifying Non-Independent Patterns in Protein Motif Sequence Data

Useful Links

Wikipedia Secondary Structure
Wikipedia Secondary Structure Prediction
Pfam database of protein families and HMMs
The PSIPRED protein structure prediction server
CASP, Critical Assessment of Techniques for Protein Structure Prediction
EVA Results for Secondary Structure, other methods/servers
SCOP: Structural Classification of Proteins
NCBI
PDB ----- PDB Secondary Structure Download

Something new ...

For proteins with NO homologous protein with known structure in PDB.
:: www.pdb.org => Advanced Search => Sequence (Blast/Fasta/PSI_BLAST); Blast => Sequence search (E Cut Off 10.0, or 30.0 or use Fasta, depending on output, need a algorithm for this)
(Fasta search may return different results; some proteins may use Fasta search, because Blast returned nothing)

What I am doing...

1. SCOP: Structural Classification of Proteins

2. scope parseable files follow the structure of scope hierarchy

3. get data from scope
--- http://scop.mrc-lmb.cam.ac.uk/scop/ref/nar2002.pdf
--- http://scop.mrc-lmb.cam.ac.uk/scop/ref/nar2007.pdf
--- http://scop.mrc-lmb.cam.ac.uk/scop/ref/1995-jmb-scop.pdf

4. Pfam database of protein families and HMMs - it has some good links to other databases

5. A MSD example

6. Beginning Perl in Bioinformatics Book Exercises & Examples, PDB file format (secondary structure) explanation. PDB downloads ss.txt.

Proposed Steps

0. Do steps 1 to 5 manually to prove it can be done before coding.

1. Parse SCOPE file, get proteins names divided into 4 structures. (dir.cla.scop.txt_1.xx)

2. Use step 1 results to parse PDB or NCBI protein data file, get amino acid sequence and corresponding secondary structure sequence.

3. Use (part of) step 2 results to build data file ready for RT-RICO, e.g. (5 amino acids + 1 secondary structure ) x 3 secondary structures + decide the format of the data file (e.g. is it suitable to use 5 amino acids? or 9 amino acids?)

4. Use step 3 results, modify RT-RICO code, generate rules e.g. from 5 amino acids -> 1 secondary structure (slightly different for beginning and end). Use the rules generated to predict secondary structure (probability).

5. Use (part of) step 2 results to generate Q3 score, use step 3 results & newly developed algorithm to generate Q3 score

6. If step 5 works, write a paper. (And it works !!! Yeh !!!)

Notes

(Wiki) Secondary structure is formally defined by the hydrogen bonds of the biopolymer, as observed in an atomic-resolution structure. In proteins, the secondary structure is defined by patterns of hydrogen bonds between backbone amide groups (sidechain-mainchain and sidechain-sidechain hydrogen bonds are irrelevant), where the DSSP definition of a hydrogen bond is used.

(Wiki) Secondary structure prediction is a set of techniques in bioinformatics that aim to predict the local secondary structures of proteins and RNA sequences based only on knowledge of their primary structure - amino acid or nucleotide sequence, respectively. For proteins, a prediction consists of assigning regions of the amino acid sequence as likely alpha helices, beta strands (often noted as "extended" conformations), or turns. The success of a prediction is determined by comparing it to the results of the DSSP algorithm applied to the crystal structure of the protein; for nucleic acids, it may be determined from the hydrogen bonding pattern.

The accuracy of current protein secondary structure prediction methods is assessed in weekly benchmarks such as LiveBench and EVA.

* G = 3-turn helix (310 helix). Min length 3 residues.
* H = 4-turn helix (alpha helix). Min length 4 residues.
* I = 5-turn helix (pi helix). Min length 5 residues.
* T = hydrogen bonded turn (3, 4 or 5 turn)
* E = beta sheet in parallel and/or anti-parallel sheet conformation (extended strand). Min length 2 residues.
* B = residue in isolated beta-bridge (single pair beta-sheet hydrogen bond formation)
* S = bend (the only non-hydrogen-bond based assignment)

DSSP:
* H: alpha helix
* B: residue in isolated beta-bridge
* E: extended strand, participates in beta ladder
* G: 3-helix (3/10 helix)
* I: 5-helix (pi helix)
* T: hydrogen bonded turn
* S: bend
* .: no assigned structure

(G, H, I) => Helix H
(E, B) => Sheet E
(T, S) => Coil C

All-alpha: consist almost entirely (at least 90%) of alpha-helices
All-beta: composed mostly of beta-sheets (at least 90%) in their secondary structure
Alpha/beta: alternating, mainly parallel segments of alpha-helices and beta-sheets
Alpha+beta: mixture of all-alpha and all-beta regions, mostly in an anti-parallel fashion

Reading

Fadime Üney Yüksektepea, Özlem Yılmaza and Metin Türka, Prediction of secondary structures of proteins using a two-stage method, Computers & Chemical Engineering Volume 32, Issues 1-2, January 2008, Pages 78-88
DSSP Program. Kabsch W, Sander C (1983). "Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features". Biopolymers 22 (12): 2577-637.
U¨ ney, F., & Tu¨rkay, M. (2006). A mixed-integer programming approach to multi-class data classification problem. European Journal of Operational Research, 173(3), 910–920.
I start using hardcopy paper readings... so I won't put them here.
Tu¨rkay, M., U¨ ney, F., & Yılmaz, O. (2005). L. Puigjaner & A. Espuna (Eds.), Prediction of folding type of proteins using mixed-integer linear programming, computer-aided chemical engineering: ESCAPE-15 (pp. 523–528). Amsterdam: Elsevier.
Baker, D., & Sali, A. (2001). Protein structure prediction and structural genomics. Science, 294(5540), 93–96. (. pdf file)
Altschul, S., Madden, T., Shaffer, A., Zhang, J., Zhang, Z., Miller, W., et al. (1997). Gapped Blast and PSI-Blast: A new generation of protein database search programs. Nucleic Acids Research, 25, 3389–3402.
Bradley, P., Chivian, D., Meiler, J., Misura, K. M., Rohl, C., Schief, W., et al. (2003). Rosetta predictions in CASP5: Successes, failures, and prospects for complete automation. Proteins: Structure Function, and Genetics, 53, 457–468. (.pdf file)