Carlos-Francisco Méndez-Cruz

Classification binding thrombin data set

The test set consists of 634 data points, each of which represents
a molecule that is either active (A) or inactive (I). The test set
has the same format as the training set, with the exception that the
activity value (A or I) for each data point is missing, that is, has
been replaced by a question mark (?). Please submit one prediction,
A or I, for each data point. Your submission should be in the form
of a file that starts with your contact information, followed by a
line with 5 asterisks, followed immediately by your predictions, with
one line per data point. The predictions should be in the same order
as the test set data points. So your prediction for the first example
should appear on the first line after the asterisks, your prediction
for the second example should appear on the second line after the
asterisks, etc. Hence, after your contact information, the prediction
file will consist of 635 lines and have the form:
*****
I
I
A
I
A
I
etc.
You may submit your prediction by email to page@biostat.wisc.edu
or by anonymous ftp to ftp.biostat.wisc.edu, placing the file
into the directory dropboxes/page/. If using email, please use
the subject line "KDDcup <name> thrombin" where <name> is your
name. If using ftp, please name the file KDDcup.<name>.thrombin
where <name> is your name. For example, my submission would be
named KDDcup.DavidPage.thrombin
Only one submission per person per task is permitted. If you do not
receive email confirmation of your submission within 24 hours, please
email page@biostat.wisc.edu with subject "KDDcup no confirmation".
For group entries, the contact information should include the names
of everyone to be credited as a member of the group should your entry
achieve the highest score. But no person is to be listed on more than
one entry per task.
Prediction of Molecular Bioactivity for Drug Design -- Binding to Thrombin
--------------------------------------------------------------------------
Drugs are typically small organic molecules that achieve their desired
activity by binding to a target site on a receptor. The first step in
the discovery of a new drug is usually to identify and isolate the
receptor to which it should bind, followed by testing many small
molecules for their ability to bind to the target site. This leaves
researchers with the task of determining what separates the active
(binding) compounds from the inactive (non-binding) ones. Such a
determination can then be used in the design of new compounds that not
only bind, but also have all the other properties required for a drug
(solubility, oral absorption, lack of side effects, appropriate duration
of action, toxicity, etc.).
The present training data set consists of 1909 compounds tested for
their ability to bind to a target site on thrombin, a key receptor in
blood clotting. The chemical structures of these compounds are not
necessary for our analysis and are not included. Of these compounds, 42
are active (bind well) and the others are inactive. Each compound is
described by a single feature vector comprised of a class value (A for
active, I for inactive) and 139,351 binary features, which describe
three-dimensional properties of the molecule. The definitions of the
individual bits are not included - we don't know what each individual
bit means, only that they are generated in an internally consistent
manner for all 1909 compounds. Biological activity in general, and
receptor binding affinity in particular, correlate with various
structural and physical properties of small organic molecules. The task
is to determine which of these properties are critical in this case and
to learn to accurately predict the class value. To simulate the
real-world drug design environment, the test set contains 636 additional
compounds that were in fact generated based on the assay results
recorded for the training set. In evaluating the accuracy, a
differential cost model will be used, so that the sum of the costs of
the actives will be equal to the sum of the costs of the inactives.
We thank DuPont Pharmaceuticals for graciously providing this data set
for the KDD Cup 2001 competition. All publications referring to
analysis of this data set should acknowledge DuPont Pharmaceuticals
Research Laboratories and KDD Cup 2001.
This diff could not be displayed because it is too large.
I
A
I
I
I
A
I
I
I
A
I
I
I
A
I
A
I
I
I
I
I
I
I
I
I
I
I
A
I
A
I
I
I
I
I
A
I
I
I
A
I
I
I
I
I
I
I
I
A
A
I
I
I
I
I
A
I
A
A
I
I
I
A
I
I
I
I
A
A
I
A
I
I
A
A
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
A
I
I
A
I
A
I
I
I
A
A
I
I
I
I
I
I
A
I
I
I
I
A
A
I
I
I
I
I
I
I
I
A
A
I
A
A
I
I
I
I
I
I
I
I
I
I
A
I
I
I
I
I
I
I
I
I
A
I
I
I
I
A
I
I
I
I
I
I
I
A
I
I
A
I
A
I
I
A
I
A
I
A
I
A
I
I
I
I
I
A
I
I
A
I
I
A
I
I
I
A
I
A
I
I
A
I
I
I
I
A
I
A
I
I
I
I
I
I
I
I
I
I
I
A
I
A
I
I
I
I
I
I
A
I
I
A
A
A
I
I
A
A
I
I
I
I
A
I
I
I
I
A
I
A
I
I
I
I
I
I
I
A
A
I
I
I
I
I
I
I
I
A
A
I
I
I
I
I
I
A
A
I
I
I
I
I
I
A
I
A
I
I
I
I
I
I
I
I
I
A
I
I
A
I
I
I
I
I
I
A
A
I
I
I
I
I
A
I
I
I
I
I
A
A
A
I
A
I
I
I
I
A
A
I
A
A
I
I
I
I
I
I
I
I
I
I
I
I
A
I
I
I
I
A
A
I
I
A
I
I
I
I
I
A
A
I
A
I
I
I
I
I
I
A
A
I
I
A
I
I
I
I
I
I
I
I
I
I
I
I
A
I
I
A
I
I
A
I
I
I
I
A
A
I
A
A
I
I
A
I
I
I
I
A
I
I
I
I
I
I
I
I
I
I
A
I
I
A
A
I
I
I
A
I
I
I
I
A
I
A
I
I
I
I
I
A
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
A
A
A
A
I
I
I
A
A
I
I
I
I
I
A
I
A
I
I
I
I
I
A
I
I
I
I
A
A
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
A
I
A
I
I
A
I
I
I
I
A
I
I
A
A
I
I
I
A
I
A
I
I
I
I
I
I
I
A
A
I
I
I
A
I
I
I
A
I
I
I
I
I
I
A
I
I
I
I
I
A
I
I
I
I
I
A
I
I
A
I
I
I
I
I
I
A
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
A
I
I
A
I
I
I
I
A
I
I
I
I
I
A
I
I
I
A
I
I
I
A
I
A
A
I
A
I
I
I
I
I
I
I
A
I
I
I
A
I
A
I
I
I
I
A
I
A
I
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
# -*- encoding: utf-8 -*-
import os
from time import time
import argparse
from sklearn.naive_bayes import BernoulliNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, \
classification_report
from sklearn.externals import joblib
from scipy.sparse import csr_matrix
__author__ = 'CMendezC'
# Goal: training and testing binding thrombin data set
# Parameters:
# 1) --inputPath Path to read input files.
# 2) --inputTrainingData File to read training data.
# 3) --inputTestingData File to read testing data.
# 4) --inputTestingClasses File to read testing classes.
# 5) --outputModelPath Path to place output model.
# 6) --outputModelFile File to place output model.
# 7) --outputReportPath Path to place evaluation report.
# 8) --outputReportFile File to place evaluation report.
# 9) --classifier Classifier: BernoulliNB, SVM, kNN.
# 10) --saveData Save matrices
# Ouput:
# 1) Classification model and evaluation report.
# Execution:
# python training-testing-binding-thrombin.py
# --inputPath /home/binding-thrombin-dataset
# --inputTrainingData thrombin.data
# --inputTestingData Thrombin.testset
# --inputTestingClasses Thrombin.testset.class
# --outputModelPath /home/binding-thrombin-dataset/models
# --outputModelFile SVM-model.mod
# --outputReportPath /home/binding-thrombin-dataset/reports
# --outputReportFile SVM.txt
# --classifier SVM
# --saveData
# source activate python3
# python training-testing-binding-thrombin.py --inputPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset --inputTrainingData thrombin.data --inputTestingData Thrombin.testset --inputTestingClasses Thrombin.testset.class --outputModelPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/models --outputModelFile SVM-model.mod --outputReportPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/reports --outputReportFile SVM.txt --classifier SVM --saveData
###########################################################
# MAIN PROGRAM #
###########################################################
if __name__ == "__main__":
# Parameter definition
parser = argparse.ArgumentParser(description='Training and testing Binding Thrombin Dataset.')
parser.add_argument("--inputPath", dest="inputPath",
help="Path to read input files", metavar="PATH")
parser.add_argument("--inputTrainingData", dest="inputTrainingData",
help="File to read training data", metavar="FILE")
parser.add_argument("--inputTestingData", dest="inputTestingData",
help="File to read testing data", metavar="FILE")
parser.add_argument("--inputTestingClasses", dest="inputTestingClasses",
help="File to read testing classes", metavar="FILE")
parser.add_argument("--outputModelPath", dest="outputModelPath",
help="Path to place output model", metavar="PATH")
parser.add_argument("--outputModelFile", dest="outputModelFile",
help="File to place output model", metavar="FILE")
parser.add_argument("--outputReportPath", dest="outputReportPath",
help="Path to place evaluation report", metavar="PATH")
parser.add_argument("--outputReportFile", dest="outputReportFile",
help="File to place evaluation report", metavar="FILE")
parser.add_argument("--classifier", dest="classifier",
help="Classifier", metavar="NAME",
choices=('BernoulliNB', 'SVM', 'kNN'), default='SVM')
parser.add_argument("--saveData", dest="saveData", action='store_true',
help="Save matrices")
args = parser.parse_args()
# Printing parameter values
print('-------------------------------- PARAMETERS --------------------------------')
print("Path to read input files: " + str(args.inputPath))
print("File to read training data: " + str(args.inputTrainingData))
print("File to read testing data: " + str(args.inputTestingData))
print("File to read testing classes: " + str(args.inputTestingClasses))
print("Path to place output model: " + str(args.outputModelPath))
print("File to place output model: " + str(args.outputModelFile))
print("Path to place evaluation report: " + str(args.outputReportPath))
print("File to place evaluation report: " + str(args.outputReportFile))
print("Classifier: " + str(args.classifier))
print("Save matrices: " + str(args.saveData))
# Start time
t0 = time()
print("Reading training data and true classes...")
X_train = None
if args.saveData:
y_train = []
trainingData = []
with open(os.path.join(args.inputPath, args.inputTrainingData), encoding='utf8', mode='r') \
as iFile:
for line in iFile:
line = line.strip('\r\n')
listLine = line.split(',')
y_train.append(listLine[0])
trainingData.append(listLine[1:])
# X_train = np.matrix(trainingData)
X_train = csr_matrix(trainingData, dtype='double')
print(" Saving matrix and classes...")
joblib.dump(X_train, os.path.join(args.outputModelPath, args.inputTrainingData + '.jlb'))
joblib.dump(y_train, os.path.join(args.outputModelPath, args.inputTrainingData + '.class.jlb'))
print(" Done!")
else:
print(" Loading matrix and classes...")
X_train = joblib.load(os.path.join(args.outputModelPath, args.inputTrainingData + '.jlb'))
y_train = joblib.load(os.path.join(args.outputModelPath, args.inputTrainingData + '.class.jlb'))
print(" Done!")
print(" Number of training classes: {}".format(len(y_train)))
print(" Number of training class A: {}".format(y_train.count('A')))
print(" Number of training class I: {}".format(y_train.count('I')))
print(" Shape of training matrix: {}".format(X_train.shape))
print("Reading testing data and true classes...")
X_test = None
if args.saveData:
y_test = []
testingData = []
with open(os.path.join(args.inputPath, args.inputTestingData), encoding='utf8', mode='r') \
as iFile:
for line in iFile:
line = line.strip('\r\n')
listLine = line.split(',')
testingData.append(listLine[1:])
X_test = csr_matrix(testingData, dtype='double')
with open(os.path.join(args.inputPath, args.inputTestingClasses), encoding='utf8', mode='r') \
as iFile:
for line in iFile:
line = line.strip('\r\n')
y_test.append(line)
print(" Saving matrix and classes...")
joblib.dump(X_test, os.path.join(args.outputModelPath, args.inputTestingData + '.jlb'))
joblib.dump(y_test, os.path.join(args.outputModelPath, args.inputTestingClasses + '.class.jlb'))
print(" Done!")
else:
print(" Loading matrix and classes...")
X_test = joblib.load(os.path.join(args.outputModelPath, args.inputTestingData + '.jlb'))
y_test = joblib.load(os.path.join(args.outputModelPath, args.inputTestingClasses + '.class.jlb'))
print(" Done!")
print(" Number of testing classes: {}".format(len(y_test)))
print(" Number of testing class A: {}".format(y_test.count('A')))
print(" Number of testing class I: {}".format(y_test.count('I')))
print(" Shape of testing matrix: {}".format(X_test.shape))
if args.classifier == "BernoulliNB":
classifier = BernoulliNB()
elif args.classifier == "SVM":
classifier = SVC()
elif args.classifier == "kNN":
classifier = KNeighborsClassifier()
else:
print("Bad classifier")
exit()
print("Training...")
classifier.fit(X_train, y_train)
print(" Done!")
print("Testing (prediction in new data)...")
y_pred = classifier.predict(X_test)
print(" Done!")
print("Saving report...")
with open(os.path.join(args.outputReportPath, args.outputReportFile), mode='w', encoding='utf8') as oFile:
oFile.write('********** EVALUATION REPORT **********\n')
oFile.write('Classifier: {}\n'.format(args.classifier))
oFile.write('Accuracy: {}\n'.format(accuracy_score(y_test, y_pred)))
oFile.write('Precision: {}\n'.format(precision_score(y_test, y_pred, average='weighted')))
oFile.write('Recall: {}\n'.format(recall_score(y_test, y_pred, average='weighted')))
oFile.write('F-score: {}\n'.format(f1_score(y_test, y_pred, average='weighted')))
oFile.write('Confusion matrix: \n')
oFile.write(str(confusion_matrix(y_test, y_pred)) + '\n')
oFile.write('Classification report: \n')
oFile.write(classification_report(y_test, y_pred) + '\n')
print(" Done!")
print("Training and testing done in: %fs" % (time() - t0))