RT-RICO - Identifying Non-Independent Patterns in Protein Motif Sequence Data
Something new ...
For proteins with NO homologous protein with known structure in PDB.
:: www.pdb.org => Advanced Search => Sequence (Blast/Fasta/PSI_BLAST); Blast => Sequence search (E Cut Off 10.0, or 30.0 or use Fasta, depending on output, need a algorithm for this)
(Fasta search may return different results; some proteins may use Fasta search, because Blast returned nothing)
1. SCOP: Structural Classification of Proteins
2. scope parseable files follow the structure of scope hierarchy
3. get data from scope
--- http://scop.mrc-lmb.cam.ac.uk/scop/ref/nar2002.pdf
--- http://scop.mrc-lmb.cam.ac.uk/scop/ref/nar2007.pdf
--- http://scop.mrc-lmb.cam.ac.uk/scop/ref/1995-jmb-scop.pdf
4. Pfam database of protein families and HMMs - it has some good links to other databases
5. A MSD example
6. Beginning Perl in Bioinformatics Book Exercises & Examples, PDB file format (secondary structure) explanation. PDB downloads ss.txt.
0. Do steps 1 to 5 manually to prove it can be done before coding.
1. Parse SCOPE file, get proteins names divided into 4 structures. (dir.cla.scop.txt_1.xx)
2. Use step 1 results to parse PDB or NCBI protein data file, get amino acid sequence and corresponding secondary structure sequence.
3. Use (part of) step 2 results to build data file ready for RT-RICO, e.g. (5 amino acids + 1 secondary structure ) x 3 secondary structures + decide the format of the data file (e.g. is it suitable to use 5 amino acids? or 9 amino acids?)
4. Use step 3 results, modify RT-RICO code, generate rules e.g. from 5 amino acids -> 1 secondary structure (slightly different for beginning and end). Use the rules generated to predict secondary structure (probability).
5. Use (part of) step 2 results to generate Q3 score, use step 3 results & newly developed algorithm to generate Q3 score
6. If step 5 works, write a paper. (And it works !!! Yeh !!!)
- (Wiki) Secondary structure is formally defined by the hydrogen bonds of the biopolymer, as observed in an atomic-resolution structure. In proteins, the secondary structure is defined by patterns of hydrogen bonds between backbone amide groups (sidechain-mainchain and sidechain-sidechain hydrogen bonds are irrelevant), where the DSSP definition of a hydrogen bond is used.
- The accuracy of current protein secondary structure prediction methods is assessed in weekly benchmarks such as LiveBench and EVA.
- * G = 3-turn helix (310 helix). Min length 3 residues.
* H = 4-turn helix (alpha helix). Min length 4 residues.
* I = 5-turn helix (pi helix). Min length 5 residues.
* T = hydrogen bonded turn (3, 4 or 5 turn)
* E = beta sheet in parallel and/or anti-parallel sheet conformation (extended strand). Min length 2 residues.
* B = residue in isolated beta-bridge (single pair beta-sheet hydrogen bond formation)
* S = bend (the only non-hydrogen-bond based assignment)
- DSSP:
* H: alpha helix
* B: residue in isolated beta-bridge
* E: extended strand, participates in beta ladder
* G: 3-helix (3/10 helix)
* I: 5-helix (pi helix)
* T: hydrogen bonded turn
* S: bend
* .: no assigned structure
- (G, H, I) => Helix H
(E, B) => Sheet E
(T, S) => Coil C
- All-alpha: consist almost entirely (at least 90%) of alpha-helices
All-beta: composed mostly of beta-sheets (at least 90%) in their secondary structure
Alpha/beta: alternating, mainly parallel segments of alpha-helices and beta-sheets
Alpha+beta: mixture of all-alpha and all-beta regions, mostly in an anti-parallel fashion
- Fadime Üney Yüksektepea, Özlem Yılmaza and Metin Türka, Prediction of secondary structures of proteins using a two-stage method, Computers & Chemical Engineering Volume 32, Issues 1-2, January 2008, Pages 78-88
- DSSP Program. Kabsch W, Sander C (1983). "Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features". Biopolymers 22 (12): 2577-637.
- U¨ ney, F., & Tu¨rkay, M. (2006). A mixed-integer programming approach to multi-class data classification problem. European Journal of Operational Research, 173(3), 910–920.
- I start using hardcopy paper readings... so I won't put them here.
- Tu¨rkay, M., U¨ ney, F., & Yılmaz, O. (2005). L. Puigjaner & A. Espuna (Eds.), Prediction of folding type of proteins using mixed-integer linear programming, computer-aided chemical engineering: ESCAPE-15 (pp. 523–528). Amsterdam: Elsevier.
- Baker, D., & Sali, A. (2001). Protein structure prediction and structural genomics. Science, 294(5540), 93–96. (. pdf file)
- Altschul, S., Madden, T., Shaffer, A., Zhang, J., Zhang, Z., Miller, W., et al. (1997). Gapped Blast and PSI-Blast: A new generation of protein database search programs. Nucleic Acids Research, 25, 3389–3402.
- Bradley, P., Chivian, D., Meiler, J., Misura, K. M., Rohl, C., Schief, W., et al. (2003). Rosetta predictions in CASP5: Successes, failures, and prospects for complete automation. Proteins: Structure Function, and Genetics, 53, 457–468. (.pdf file)
Other Readings (Mainly for Data)