Classification binding thrombin data set

Carlos-Francisco Méndez-Cruz
Commit a2229ec2daaa13b86810427e1468f27b6bfa9f9a a2229ec2 1 parent ee96c629
Showing 13 changed files with 905 additions and 0 deletions
binding-thrombin-dataset/README.testset
binding-thrombin-dataset/README.trainingset
binding-thrombin-dataset/README.txt
binding-thrombin-dataset/Thrombin.testset
binding-thrombin-dataset/Thrombin.testset.class
binding-thrombin-dataset/imb-training-crossvalidation-testing-binding-thrombin.py
binding-thrombin-dataset/imb-training-testing-binding-thrombin.py
binding-thrombin-dataset/models/delete-me
binding-thrombin-dataset/reports/delete-me
binding-thrombin-dataset/thrombin.data
binding-thrombin-dataset/thrombin.names
binding-thrombin-dataset/training-crossvalidation-testing-binding-thrombin.py
binding-thrombin-dataset/training-testing-binding-thrombin.py
--- a/binding-thrombin-dataset/README.testset 0 → 100644
View file @a2229ec
+++ b/binding-thrombin-dataset/README.testset 0 → 100644
View file @a2229ec
+ The test set consists of 634 data points, each of which represents
+ a molecule that is either active (A) or inactive (I).  The test set
+ has the same format as the training set, with the exception that the
+ activity value (A or I) for each data point is missing, that is, has
+ been replaced by a question mark (?).  Please submit one prediction,
+ A or I, for each data point.  Your submission should be in the form
+ of a file that starts with your contact information, followed by a
+ line with 5 asterisks, followed immediately by your predictions, with
+ one line per data point.  The predictions should be in the same order
+ as the test set data points.  So your prediction for the first example
+ should appear on the first line after the asterisks, your prediction
+ for the second example should appear on the second line after the
+ asterisks, etc.  Hence, after your contact information, the prediction
+ file will consist of 635 lines and have the form:
+ 
+ *****
+ I
+ I
+ A
+ I
+ A
+ I
+ 
+ etc.
+ 
+ You may submit your prediction by email to page@biostat.wisc.edu
+ or by anonymous ftp to ftp.biostat.wisc.edu, placing the file
+ into the directory dropboxes/page/.  If using email, please use
+ the subject line "KDDcup <name> thrombin" where <name> is your
+ name.  If using ftp, please name the file KDDcup.<name>.thrombin
+ where <name> is your name.  For example, my submission would be
+ named KDDcup.DavidPage.thrombin
+ 
+ Only one submission per person per task is permitted.  If you do not
+ receive email confirmation of your submission within 24 hours, please
+ email page@biostat.wisc.edu with subject "KDDcup no confirmation".
+ 
+ For group entries, the contact information should include the names
+ of everyone to be credited as a member of the group should your entry
+ achieve the highest score.  But no person is to be listed on more than
+ one entry per task.
--- a/binding-thrombin-dataset/README.trainingset 0 → 100644
View file @a2229ec
+++ b/binding-thrombin-dataset/README.trainingset 0 → 100644
View file @a2229ec
+ Prediction of Molecular Bioactivity for Drug Design -- Binding to Thrombin
+ --------------------------------------------------------------------------
+ 
+ Drugs are typically small organic molecules that achieve their desired
+ activity by binding to a target site on a receptor. The first step in
+ the discovery of a new drug is usually to identify and isolate the
+ receptor to which it should bind, followed by testing many small
+ molecules for their ability to bind to the target site. This leaves
+ researchers with the task of determining what separates the active
+ (binding) compounds from the inactive (non-binding) ones.  Such a
+ determination can then be used in the design of new compounds that not
+ only bind, but also have all the other properties required for a drug
+ (solubility, oral absorption, lack of side effects, appropriate duration
+ of action, toxicity, etc.). 
+ 
+ The present training data set consists of 1909 compounds tested for
+ their ability to bind to a target site on thrombin, a key receptor in
+ blood clotting. The chemical structures of these compounds are not
+ necessary for our analysis and are not included. Of these compounds, 42
+ are active (bind well) and the others are inactive. Each compound is
+ described by a single feature vector comprised of a class value (A for
+ active, I for inactive) and 139,351 binary features, which describe
+ three-dimensional properties of the molecule. The definitions of the
+ individual bits are not included - we don't know what each individual
+ bit means, only that they are generated in an internally consistent
+ manner for all 1909 compounds. Biological activity in general, and
+ receptor binding affinity in particular, correlate with various
+ structural and physical properties of small organic molecules. The task
+ is to determine which of these properties are critical in this case and
+ to learn to accurately predict the class value.  To simulate the
+ real-world drug design environment, the test set contains 636 additional
+ compounds that were in fact generated based on the assay results
+ recorded for the training set. In evaluating the accuracy, a
+ differential cost model will be used, so that the sum of the costs of
+ the actives will be equal to the sum of the costs of the inactives.
+ 
+ We thank DuPont Pharmaceuticals for graciously providing this data set
+ for the KDD Cup 2001 competition.  All publications referring to
+ analysis of this data set should acknowledge DuPont Pharmaceuticals
+ Research Laboratories and KDD Cup 2001.
--- a/binding-thrombin-dataset/README.txt 0 → 100644
View file @a2229ec
+++ b/binding-thrombin-dataset/README.txt 0 → 100644
View file @a2229ec
--- a/binding-thrombin-dataset/Thrombin.testset 0 → 100644
View file @a2229ec
+++ b/binding-thrombin-dataset/Thrombin.testset 0 → 100644
View file @a2229ec
--- a/binding-thrombin-dataset/Thrombin.testset.class 0 → 100644
View file @a2229ec
+++ b/binding-thrombin-dataset/Thrombin.testset.class 0 → 100644
View file @a2229ec
+ I
+ A
+ I
+ I
+ I
+ A
+ I
+ I
+ I
+ A
+ I
+ I
+ I
+ A
+ I
+ A
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ A
+ I
+ A
+ I
+ I
+ I
+ I
+ I
+ A
+ I
+ I
+ I
+ A
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ A
+ A
+ I
+ I
+ I
+ I
+ I
+ A
+ I
+ A
+ A
+ I
+ I
+ I
+ A
+ I
+ I
+ I
+ I
+ A
+ A
+ I
+ A
+ I
+ I
+ A
+ A
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ A
+ I
+ I
+ A
+ I
+ A
+ I
+ I
+ I
+ A
+ A
+ I
+ I
+ I
+ I
+ I
+ I
+ A
+ I
+ I
+ I
+ I
+ A
+ A
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ A
+ A
+ I
+ A
+ A
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ A
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ A
+ I
+ I
+ I
+ I
+ A
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ A
+ I
+ I
+ A
+ I
+ A
+ I
+ I
+ A
+ I
+ A
+ I
+ A
+ I
+ A
+ I
+ I
+ I
+ I
+ I
+ A
+ I
+ I
+ A
+ I
+ I
+ A
+ I
+ I
+ I
+ A
+ I
+ A
+ I
+ I
+ A
+ I
+ I
+ I
+ I
+ A
+ I
+ A
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ A
+ I
+ A
+ I
+ I
+ I
+ I
+ I
+ I
+ A
+ I
+ I
+ A
+ A
+ A
+ I
+ I
+ A
+ A
+ I
+ I
+ I
+ I
+ A
+ I
+ I
+ I
+ I
+ A
+ I
+ A
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ A
+ A
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ A
+ A
+ I
+ I
+ I
+ I
+ I
+ I
+ A
+ A
+ I
+ I
+ I
+ I
+ I
+ I
+ A
+ I
+ A
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ A
+ I
+ I
+ A
+ I
+ I
+ I
+ I
+ I
+ I
+ A
+ A
+ I
+ I
+ I
+ I
+ I
+ A
+ I
+ I
+ I
+ I
+ I
+ A
+ A
+ A
+ I
+ A
+ I
+ I
+ I
+ I
+ A
+ A
+ I
+ A
+ A
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ A
+ I
+ I
+ I
+ I
+ A
+ A
+ I
+ I
+ A
+ I
+ I
+ I
+ I
+ I
+ A
+ A
+ I
+ A
+ I
+ I
+ I
+ I
+ I
+ I
+ A
+ A
+ I
+ I
+ A
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ A
+ I
+ I
+ A
+ I
+ I
+ A
+ I
+ I
+ I
+ I
+ A
+ A
+ I
+ A
+ A
+ I
+ I
+ A
+ I
+ I
+ I
+ I
+ A
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ A
+ I
+ I
+ A
+ A
+ I
+ I
+ I
+ A
+ I
+ I
+ I
+ I
+ A
+ I
+ A
+ I
+ I
+ I
+ I
+ I
+ A
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ A
+ A
+ A
+ A
+ I
+ I
+ I
+ A
+ A
+ I
+ I
+ I
+ I
+ I
+ A
+ I
+ A
+ I
+ I
+ I
+ I
+ I
+ A
+ I
+ I
+ I
+ I
+ A
+ A
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ A
+ I
+ A
+ I
+ I
+ A
+ I
+ I
+ I
+ I
+ A
+ I
+ I
+ A
+ A
+ I
+ I
+ I
+ A
+ I
+ A
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ A
+ A
+ I
+ I
+ I
+ A
+ I
+ I
+ I
+ A
+ I
+ I
+ I
+ I
+ I
+ I
+ A
+ I
+ I
+ I
+ I
+ I
+ A
+ I
+ I
+ I
+ I
+ I
+ A
+ I
+ I
+ A
+ I
+ I
+ I
+ I
+ I
+ I
+ A
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ A
+ I
+ I
+ A
+ I
+ I
+ I
+ I
+ A
+ I
+ I
+ I
+ I
+ I
+ A
+ I
+ I
+ I
+ A
+ I
+ I
+ I
+ A
+ I
+ A
+ A
+ I
+ A
+ I
+ I
+ I
+ I
+ I
+ I
+ I
+ A
+ I
+ I
+ I
+ A
+ I
+ A
+ I
+ I
+ I
+ I
+ A
+ I
+ A
+ I
--- a/binding-thrombin-dataset/imb-training-crossvalidation-testing-binding-thrombin.py 0 → 100644
View file @a2229ec
+++ b/binding-thrombin-dataset/imb-training-crossvalidation-testing-binding-thrombin.py 0 → 100644
View file @a2229ec
--- a/binding-thrombin-dataset/imb-training-testing-binding-thrombin.py 0 → 100644
View file @a2229ec
+++ b/binding-thrombin-dataset/imb-training-testing-binding-thrombin.py 0 → 100644
View file @a2229ec
--- a/binding-thrombin-dataset/models/delete-me 0 → 100644
View file @a2229ec
+++ b/binding-thrombin-dataset/models/delete-me 0 → 100644
View file @a2229ec
--- a/binding-thrombin-dataset/reports/delete-me 0 → 100644
View file @a2229ec
+++ b/binding-thrombin-dataset/reports/delete-me 0 → 100644
View file @a2229ec
--- a/binding-thrombin-dataset/thrombin.data 0 → 100644
View file @a2229ec
+++ b/binding-thrombin-dataset/thrombin.data 0 → 100644
View file @a2229ec
--- a/binding-thrombin-dataset/thrombin.names 0 → 100644
View file @a2229ec
+++ b/binding-thrombin-dataset/thrombin.names 0 → 100644
View file @a2229ec
--- a/binding-thrombin-dataset/training-crossvalidation-testing-binding-thrombin.py 0 → 100644
View file @a2229ec
+++ b/binding-thrombin-dataset/training-crossvalidation-testing-binding-thrombin.py 0 → 100644
View file @a2229ec
--- a/binding-thrombin-dataset/training-testing-binding-thrombin.py 0 → 100644
View file @a2229ec
+++ b/binding-thrombin-dataset/training-testing-binding-thrombin.py 0 → 100644
View file @a2229ec
+ # -*- encoding: utf-8 -*-
+ 
+ import os
+ from time import time
+ import argparse
+ from sklearn.naive_bayes import BernoulliNB
+ from sklearn.svm import SVC
+ from sklearn.neighbors import KNeighborsClassifier
+ from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, \
+     classification_report
+ from sklearn.externals import joblib
+ from scipy.sparse import csr_matrix
+ 
+ __author__ = 'CMendezC'
+ 
+ # Goal: training and testing binding thrombin data set
+ 
+ # Parameters:
+ # 1) --inputPath Path to read input files.
+ # 2) --inputTrainingData File to read training data.
+ # 3) --inputTestingData File to read testing data.
+ # 4) --inputTestingClasses File to read testing classes.
+ # 5) --outputModelPath Path to place output model.
+ # 6) --outputModelFile File to place output model.
+ # 7) --outputReportPath Path to place evaluation report.
+ # 8) --outputReportFile File to place evaluation report.
+ # 9) --classifier Classifier: BernoulliNB, SVM, kNN.
+ # 10) --saveData Save matrices
+ 
+ # Ouput:
+ # 1) Classification model and evaluation report.
+ 
+ # Execution:
+ 
+ # python training-testing-binding-thrombin.py
+ # --inputPath /home/binding-thrombin-dataset
+ # --inputTrainingData thrombin.data
+ # --inputTestingData Thrombin.testset
+ # --inputTestingClasses Thrombin.testset.class
+ # --outputModelPath /home/binding-thrombin-dataset/models
+ # --outputModelFile SVM-model.mod
+ # --outputReportPath /home/binding-thrombin-dataset/reports
+ # --outputReportFile SVM.txt
+ # --classifier SVM
+ # --saveData
+ 
+ # source activate python3
+ # python training-testing-binding-thrombin.py --inputPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset --inputTrainingData thrombin.data --inputTestingData Thrombin.testset --inputTestingClasses Thrombin.testset.class --outputModelPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/models --outputModelFile SVM-model.mod --outputReportPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/reports --outputReportFile SVM.txt --classifier SVM --saveData
+ 
+ ###########################################################
+ #                       MAIN PROGRAM                      #
+ ###########################################################
+ 
+ if __name__ == "__main__":
+     # Parameter definition
+     parser = argparse.ArgumentParser(description='Training and testing Binding Thrombin Dataset.')
+     parser.add_argument("--inputPath", dest="inputPath",
+                       help="Path to read input files", metavar="PATH")
+     parser.add_argument("--inputTrainingData", dest="inputTrainingData",
+                       help="File to read training data", metavar="FILE")
+     parser.add_argument("--inputTestingData", dest="inputTestingData",
+                       help="File to read testing data", metavar="FILE")
+     parser.add_argument("--inputTestingClasses", dest="inputTestingClasses",
+                       help="File to read testing classes", metavar="FILE")
+     parser.add_argument("--outputModelPath", dest="outputModelPath",
+                       help="Path to place output model", metavar="PATH")
+     parser.add_argument("--outputModelFile", dest="outputModelFile",
+                       help="File to place output model", metavar="FILE")
+     parser.add_argument("--outputReportPath", dest="outputReportPath",
+                       help="Path to place evaluation report", metavar="PATH")
+     parser.add_argument("--outputReportFile", dest="outputReportFile",
+                       help="File to place evaluation report", metavar="FILE")
+     parser.add_argument("--classifier", dest="classifier",
+                       help="Classifier", metavar="NAME",
+                       choices=('BernoulliNB', 'SVM', 'kNN'), default='SVM')
+     parser.add_argument("--saveData", dest="saveData", action='store_true',
+                       help="Save matrices")
+ 
+     args = parser.parse_args()
+ 
+     # Printing parameter values
+     print('-------------------------------- PARAMETERS --------------------------------')
+     print("Path to read input files: " + str(args.inputPath))
+     print("File to read training data: " + str(args.inputTrainingData))
+     print("File to read testing data: " + str(args.inputTestingData))
+     print("File to read testing classes: " + str(args.inputTestingClasses))
+     print("Path to place output model: " + str(args.outputModelPath))
+     print("File to place output model: " + str(args.outputModelFile))
+     print("Path to place evaluation report: " + str(args.outputReportPath))
+     print("File to place evaluation report: " + str(args.outputReportFile))
+     print("Classifier: " + str(args.classifier))
+     print("Save matrices: " + str(args.saveData))
+ 
+     # Start time
+     t0 = time()
+ 
+     print("Reading training data and true classes...")
+     X_train = None
+     if args.saveData:
+         y_train = []
+         trainingData = []
+         with open(os.path.join(args.inputPath, args.inputTrainingData), encoding='utf8', mode='r') \
+                 as iFile:
+             for line in iFile:
+                 line = line.strip('\r\n')
+                 listLine = line.split(',')
+                 y_train.append(listLine[0])
+                 trainingData.append(listLine[1:])
+         # X_train = np.matrix(trainingData)
+         X_train = csr_matrix(trainingData, dtype='double')
+         print("   Saving matrix and classes...")
+         joblib.dump(X_train, os.path.join(args.outputModelPath, args.inputTrainingData + '.jlb'))
+         joblib.dump(y_train, os.path.join(args.outputModelPath, args.inputTrainingData + '.class.jlb'))
+         print("      Done!")
+     else:
+         print("   Loading matrix and classes...")
+         X_train = joblib.load(os.path.join(args.outputModelPath, args.inputTrainingData + '.jlb'))
+         y_train = joblib.load(os.path.join(args.outputModelPath, args.inputTrainingData + '.class.jlb'))
+         print("      Done!")
+ 
+     print("   Number of training classes: {}".format(len(y_train)))
+     print("   Number of training class A: {}".format(y_train.count('A')))
+     print("   Number of training class I: {}".format(y_train.count('I')))
+     print("   Shape of training matrix: {}".format(X_train.shape))
+ 
+     print("Reading testing data and true classes...")
+     X_test = None
+     if args.saveData:
+         y_test = []
+         testingData = []
+         with open(os.path.join(args.inputPath, args.inputTestingData), encoding='utf8', mode='r') \
+                 as iFile:
+             for line in iFile:
+                 line = line.strip('\r\n')
+                 listLine = line.split(',')
+                 testingData.append(listLine[1:])
+         X_test = csr_matrix(testingData, dtype='double')
+         with open(os.path.join(args.inputPath, args.inputTestingClasses), encoding='utf8', mode='r') \
+                 as iFile:
+             for line in iFile:
+                 line = line.strip('\r\n')
+                 y_test.append(line)
+         print("   Saving matrix and classes...")
+         joblib.dump(X_test, os.path.join(args.outputModelPath, args.inputTestingData + '.jlb'))
+         joblib.dump(y_test, os.path.join(args.outputModelPath, args.inputTestingClasses + '.class.jlb'))
+         print("      Done!")
+     else:
+         print("   Loading matrix and classes...")
+         X_test = joblib.load(os.path.join(args.outputModelPath, args.inputTestingData + '.jlb'))
+         y_test = joblib.load(os.path.join(args.outputModelPath, args.inputTestingClasses + '.class.jlb'))
+         print("      Done!")
+ 
+     print("   Number of testing classes: {}".format(len(y_test)))
+     print("   Number of testing class A: {}".format(y_test.count('A')))
+     print("   Number of testing class I: {}".format(y_test.count('I')))
+     print("   Shape of testing matrix: {}".format(X_test.shape))
+ 
+     if args.classifier == "BernoulliNB":
+         classifier = BernoulliNB()
+     elif args.classifier == "SVM":
+         classifier = SVC()
+     elif args.classifier == "kNN":
+         classifier = KNeighborsClassifier()
+     else:
+         print("Bad classifier")
+         exit()
+ 
+     print("Training...")
+     classifier.fit(X_train, y_train)
+     print("   Done!")
+ 
+     print("Testing (prediction in new data)...")
+     y_pred = classifier.predict(X_test)
+     print("   Done!")
+ 
+     print("Saving report...")
+     with open(os.path.join(args.outputReportPath, args.outputReportFile), mode='w', encoding='utf8') as oFile:
+         oFile.write('**********        EVALUATION REPORT     **********\n')
+         oFile.write('Classifier: {}\n'.format(args.classifier))
+         oFile.write('Accuracy: {}\n'.format(accuracy_score(y_test, y_pred)))
+         oFile.write('Precision: {}\n'.format(precision_score(y_test, y_pred, average='weighted')))
+         oFile.write('Recall: {}\n'.format(recall_score(y_test, y_pred, average='weighted')))
+         oFile.write('F-score: {}\n'.format(f1_score(y_test, y_pred, average='weighted')))
+         oFile.write('Confusion matrix: \n')
+         oFile.write(str(confusion_matrix(y_test, y_pred)) + '\n')
+         oFile.write('Classification report: \n')
+         oFile.write(classification_report(y_test, y_pred) + '\n')
+     print("   Done!")
+ 
+     print("Training and testing done in: %fs" % (time() - t0))