PRICO - Idenfity Protein Domains and Motifs

Useful Links


Initial Test Data on RICO

Our first test was done using the RICO algorithm on a set of ten sequences that involved the first five amino acids from α-helices.  The alpha-helices were all taken from human RNA binding proteins and the sequences and first five amino acids were determined using Protein Data Bank (www.pdb.org) and the European Bioinformatics Institute (www.ebi.ac.uk).  Five amino acids in each sequence were used because approximately 4 amino acids occur in a single turn of an alpha-helix, so a sequence of five would be likely to show some relationship. However, unlike later tests these sequences were not taken from alpha-helices of the same length.

Later we use alpha-helices sequences of the same length.

 

Initial Test Data on C4.5

C4.5 uses a somewhat different approach to generate rules, and due to this is less strict and will give results that may show a contradiction elsewhere in the data.  To begin, C4.5 requires a set of data to use as a training set.  This set generates a decision tree for determining the value of the specified attribute.  For a preliminary test, we used 30 amino acid sequences of length 5 that were converted to their amino acid groups.  The last amino acid in the sequence was identified as the attribute that a decision tree would be created to predict.  This was a much smaller set of data than would be normally used to create a decision tree.  Such trees normally require thousands of data sets to create an efficient decision tree for predicting rules for an attribute.

The decision tree created gave results for predicting the 5th amino acid with the 14 test sequences.

The training data gave a decision tree that can be paraphrased as follows:

if position 3 = L then class +A
else if position 3 = B then class +A
else if position 3 = C then class +A
else if position 3 = S then class -A
else if position 3 = A then
          if position 2 = A then class -A
          else if position 2 = L then class -A
          else if position 2 = B then class -A
          else if position 2 = C then class +A
          else if position 2 = R then class -A
          else if position 2 = S then class -A
else if position 3 = R then
          if position 2 = A then class -A
          else if position 2 = L then class -A
          else if position 2 = B then class -A
          else if position 2 = C then class -A
          else if position 2 = R then class -A
          else if position 2 = S then class +A

The decision tree was then used to derive the set of production rules (and evaluation of the rules):

Running PRICO on the UMR nic cluster server

Instructions to submit PRICO jobs on nic cluster

Link to nic cluster website to monitor jobs submitted

 

RICO Test on "helical region" sequences from Human proteins.

C4.5 Test on "helical region" sequences from Human proteins.