Remove files

Carlos-Francisco Méndez-Cruz
Commit ee96c62975551c47b41e18a3201b153d4d935748 ee96c629 1 parent 8ac86b90
Showing 9 changed files with 0 additions and 1168 deletions
binding-thrombin-dataset/README.testset
binding-thrombin-dataset/README.trainingset
binding-thrombin-dataset/README.txt
binding-thrombin-dataset/Thrombin.testset
binding-thrombin-dataset/Thrombin.testset.class
binding-thrombin-dataset/thrombin.data
binding-thrombin-dataset/thrombin.names
binding-thrombin-dataset/training-crossvalidation-testing-binding-thrombin.py
binding-thrombin-dataset/training-testing-binding-thrombin.py
--- a/binding-thrombin-dataset/README.testset deleted 100644 → 0
View file @8ac86b9
+++ b/binding-thrombin-dataset/README.testset deleted 100644 → 0
View file @8ac86b9
-The test set consists of 634 data points, each of which represents
-a molecule that is either active (A) or inactive (I).  The test set
-has the same format as the training set, with the exception that the
-activity value (A or I) for each data point is missing, that is, has
-been replaced by a question mark (?).  Please submit one prediction,
-A or I, for each data point.  Your submission should be in the form
-of a file that starts with your contact information, followed by a
-line with 5 asterisks, followed immediately by your predictions, with
-one line per data point.  The predictions should be in the same order
-as the test set data points.  So your prediction for the first example
-should appear on the first line after the asterisks, your prediction
-for the second example should appear on the second line after the
-asterisks, etc.  Hence, after your contact information, the prediction
-file will consist of 635 lines and have the form:
-
-*****
-I
-I
-A
-I
-A
-I
-
-etc.
-
-You may submit your prediction by email to page@biostat.wisc.edu
-or by anonymous ftp to ftp.biostat.wisc.edu, placing the file
-into the directory dropboxes/page/.  If using email, please use
-the subject line "KDDcup <name> thrombin" where <name> is your
-name.  If using ftp, please name the file KDDcup.<name>.thrombin
-where <name> is your name.  For example, my submission would be
-named KDDcup.DavidPage.thrombin
-
-Only one submission per person per task is permitted.  If you do not
-receive email confirmation of your submission within 24 hours, please
-email page@biostat.wisc.edu with subject "KDDcup no confirmation".
-
-For group entries, the contact information should include the names
-of everyone to be credited as a member of the group should your entry
-achieve the highest score.  But no person is to be listed on more than
-one entry per task.
--- a/binding-thrombin-dataset/README.trainingset deleted 100644 → 0
View file @8ac86b9
+++ b/binding-thrombin-dataset/README.trainingset deleted 100644 → 0
View file @8ac86b9
-Prediction of Molecular Bioactivity for Drug Design -- Binding to Thrombin
---------------------------------------------------------------------------
-
-Drugs are typically small organic molecules that achieve their desired
-activity by binding to a target site on a receptor. The first step in
-the discovery of a new drug is usually to identify and isolate the
-receptor to which it should bind, followed by testing many small
-molecules for their ability to bind to the target site. This leaves
-researchers with the task of determining what separates the active
-(binding) compounds from the inactive (non-binding) ones.  Such a
-determination can then be used in the design of new compounds that not
-only bind, but also have all the other properties required for a drug
-(solubility, oral absorption, lack of side effects, appropriate duration
-of action, toxicity, etc.). 
-
-The present training data set consists of 1909 compounds tested for
-their ability to bind to a target site on thrombin, a key receptor in
-blood clotting. The chemical structures of these compounds are not
-necessary for our analysis and are not included. Of these compounds, 42
-are active (bind well) and the others are inactive. Each compound is
-described by a single feature vector comprised of a class value (A for
-active, I for inactive) and 139,351 binary features, which describe
-three-dimensional properties of the molecule. The definitions of the
-individual bits are not included - we don't know what each individual
-bit means, only that they are generated in an internally consistent
-manner for all 1909 compounds. Biological activity in general, and
-receptor binding affinity in particular, correlate with various
-structural and physical properties of small organic molecules. The task
-is to determine which of these properties are critical in this case and
-to learn to accurately predict the class value.  To simulate the
-real-world drug design environment, the test set contains 636 additional
-compounds that were in fact generated based on the assay results
-recorded for the training set. In evaluating the accuracy, a
-differential cost model will be used, so that the sum of the costs of
-the actives will be equal to the sum of the costs of the inactives.
-
-We thank DuPont Pharmaceuticals for graciously providing this data set
-for the KDD Cup 2001 competition.  All publications referring to
-analysis of this data set should acknowledge DuPont Pharmaceuticals
-Research Laboratories and KDD Cup 2001.
--- a/binding-thrombin-dataset/README.txt deleted 100644 → 0
View file @8ac86b9
+++ b/binding-thrombin-dataset/README.txt deleted 100644 → 0
View file @8ac86b9
--- a/binding-thrombin-dataset/Thrombin.testset deleted 100644 → 0
View file @8ac86b9
+++ b/binding-thrombin-dataset/Thrombin.testset deleted 100644 → 0
View file @8ac86b9
--- a/binding-thrombin-dataset/Thrombin.testset.class deleted 100644 → 0
View file @8ac86b9
+++ b/binding-thrombin-dataset/Thrombin.testset.class deleted 100644 → 0
View file @8ac86b9
-I
-A
-I
-I
-I
-A
-I
-I
-I
-A
-I
-I
-I
-A
-I
-A
-I
-I
-I
-I
-I
-I
-I
-I
-I
-I
-I
-A
-I
-A
-I
-I
-I
-I
-I
-A
-I
-I
-I
-A
-I
-I
-I
-I
-I
-I
-I
-I
-A
-A
-I
-I
-I
-I
-I
-A
-I
-A
-A
-I
-I
-I
-A
-I
-I
-I
-I
-A
-A
-I
-A
-I
-I
-A
-A
-I
-I
-I
-I
-I
-I
-I
-I
-I
-I
-I
-I
-I
-I
-I
-A
-I
-I
-A
-I
-A
-I
-I
-I
-A
-A
-I
-I
-I
-I
-I
-I
-A
-I
-I
-I
-I
-A
-A
-I
-I
-I
-I
-I
-I
-I
-I
-A
-A
-I
-A
-A
-I
-I
-I
-I
-I
-I
-I
-I
-I
-I
-A
-I
-I
-I
-I
-I
-I
-I
-I
-I
-A
-I
-I
-I
-I
-A
-I
-I
-I
-I
-I
-I
-I
-A
-I
-I
-A
-I
-A
-I
-I
-A
-I
-A
-I
-A
-I
-A
-I
-I
-I
-I
-I
-A
-I
-I
-A
-I
-I
-A
-I
-I
-I
-A
-I
-A
-I
-I
-A
-I
-I
-I
-I
-A
-I
-A
-I
-I
-I
-I
-I
-I
-I
-I
-I
-I
-I
-A
-I
-A
-I
-I
-I
-I
-I
-I
-A
-I
-I
-A
-A
-A
-I
-I
-A
-A
-I
-I
-I
-I
-A
-I
-I
-I
-I
-A
-I
-A
-I
-I
-I
-I
-I
-I
-I
-A
-A
-I
-I
-I
-I
-I
-I
-I
-I
-A
-A
-I
-I
-I
-I
-I
-I
-A
-A
-I
-I
-I
-I
-I
-I
-A
-I
-A
-I
-I
-I
-I
-I
-I
-I
-I
-I
-A
-I
-I
-A
-I
-I
-I
-I
-I
-I
-A
-A
-I
-I
-I
-I
-I
-A
-I
-I
-I
-I
-I
-A
-A
-A
-I
-A
-I
-I
-I
-I
-A
-A
-I
-A
-A
-I
-I
-I
-I
-I
-I
-I
-I
-I
-I
-I
-I
-A
-I
-I
-I
-I
-A
-A
-I
-I
-A
-I
-I
-I
-I
-I
-A
-A
-I
-A
-I
-I
-I
-I
-I
-I
-A
-A
-I
-I
-A
-I
-I
-I
-I
-I
-I
-I
-I
-I
-I
-I
-I
-A
-I
-I
-A
-I
-I
-A
-I
-I
-I
-I
-A
-A
-I
-A
-A
-I
-I
-A
-I
-I
-I
-I
-A
-I
-I
-I
-I
-I
-I
-I
-I
-I
-I
-A
-I
-I
-A
-A
-I
-I
-I
-A
-I
-I
-I
-I
-A
-I
-A
-I
-I
-I
-I
-I
-A
-I
-I
-I
-I
-I
-I
-I
-I
-I
-I
-I
-I
-I
-I
-I
-I
-I
-I
-A
-A
-A
-A
-I
-I
-I
-A
-A
-I
-I
-I
-I
-I
-A
-I
-A
-I
-I
-I
-I
-I
-A
-I
-I
-I
-I
-A
-A
-I
-I
-I
-I
-I
-I
-I
-I
-I
-I
-I
-I
-I
-I
-I
-I
-I
-A
-I
-A
-I
-I
-A
-I
-I
-I
-I
-A
-I
-I
-A
-A
-I
-I
-I
-A
-I
-A
-I
-I
-I
-I
-I
-I
-I
-A
-A
-I
-I
-I
-A
-I
-I
-I
-A
-I
-I
-I
-I
-I
-I
-A
-I
-I
-I
-I
-I
-A
-I
-I
-I
-I
-I
-A
-I
-I
-A
-I
-I
-I
-I
-I
-I
-A
-I
-I
-I
-I
-I
-I
-I
-I
-I
-I
-I
-I
-I
-I
-I
-I
-A
-I
-I
-A
-I
-I
-I
-I
-A
-I
-I
-I
-I
-I
-A
-I
-I
-I
-A
-I
-I
-I
-A
-I
-A
-A
-I
-A
-I
-I
-I
-I
-I
-I
-I
-A
-I
-I
-I
-A
-I
-A
-I
-I
-I
-I
-A
-I
-A
-I
--- a/binding-thrombin-dataset/thrombin.data deleted 100644 → 0
View file @8ac86b9
+++ b/binding-thrombin-dataset/thrombin.data deleted 100644 → 0
View file @8ac86b9
--- a/binding-thrombin-dataset/thrombin.names deleted 100644 → 0
View file @8ac86b9
+++ b/binding-thrombin-dataset/thrombin.names deleted 100644 → 0
View file @8ac86b9
--- a/binding-thrombin-dataset/training-crossvalidation-testing-binding-thrombin.py deleted 100644 → 0
View file @8ac86b9
+++ b/binding-thrombin-dataset/training-crossvalidation-testing-binding-thrombin.py deleted 100644 → 0
View file @8ac86b9
-# -*- encoding: utf-8 -*-
-
-import os
-from time import time
-import argparse
-from sklearn.naive_bayes import BernoulliNB
-from sklearn.svm import SVC
-from sklearn.neighbors import KNeighborsClassifier
-from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, \
-    classification_report
-from sklearn.externals import joblib
-from sklearn import model_selection
-from sklearn.feature_selection import SelectKBest, chi2
-from sklearn.decomposition import TruncatedSVD
-from scipy.sparse import csr_matrix
-import scipy
-
-__author__ = 'CMendezC'
-
-# Goal: training, crossvalidation and testing binding thrombin data set
-
-# Parameters:
-# 1) --inputPath Path to read input files.
-# 2) --inputTrainingData File to read training data.
-# 3) --inputTestingData File to read testing data.
-# 4) --inputTestingClasses File to read testing classes.
-# 5) --outputModelPath Path to place output model.
-# 6) --outputModelFile File to place output model.
-# 7) --outputReportPath Path to place evaluation report.
-# 8) --outputReportFile File to place evaluation report.
-# 9) --classifier Classifier: BernoulliNB, SVM, kNN.
-# 10) --saveData Save matrices
-# 11) --kernel Kernel
-# 12) --reduction Feature selection or dimensionality reduction
-
-# Ouput:
-# 1) Classification model and evaluation report.
-
-# Execution:
-
-# python training-crossvalidation-testing-binding-thrombin.py
-# --inputPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset
-# --inputTrainingData thrombin.data
-# --inputTestingData Thrombin.testset
-# --inputTestingClasses Thrombin.testset.class
-# --outputModelPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/models
-# --outputModelFile SVM-model.mod
-# --outputReportPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/reports
-# --outputReportFile SVM.txt
-# --classifier SVM
-# --saveData
-# --kernel linear
-# --reduction SVD200
-
-# source activate python3
-# python training-crossvalidation-testing-binding-thrombin.py --inputPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset --inputTrainingData thrombin.data --inputTestingData Thrombin.testset --inputTestingClasses Thrombin.testset.class --outputModelPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/models --outputModelFile SVM-linear-model.mod --outputReportPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/reports --outputReportFile SVM-linear.txt --classifier SVM --kernel rbf
-
-###########################################################
-#                       MAIN PROGRAM                      #
-###########################################################
-
-if __name__ == "__main__":
-    # Parameter definition
-    parser = argparse.ArgumentParser(description='Training validation Binding Thrombin Dataset.')
-    parser.add_argument("--inputPath", dest="inputPath",
-                      help="Path to read input files", metavar="PATH")
-    parser.add_argument("--inputTrainingData", dest="inputTrainingData",
-                      help="File to read training data", metavar="FILE")
-    parser.add_argument("--inputTestingData", dest="inputTestingData",
-                      help="File to read testing data", metavar="FILE")
-    parser.add_argument("--inputTestingClasses", dest="inputTestingClasses",
-                      help="File to read testing classes", metavar="FILE")
-    parser.add_argument("--outputModelPath", dest="outputModelPath",
-                      help="Path to place output model", metavar="PATH")
-    parser.add_argument("--outputModelFile", dest="outputModelFile",
-                      help="File to place output model", metavar="FILE")
-    parser.add_argument("--outputReportPath", dest="outputReportPath",
-                      help="Path to place evaluation report", metavar="PATH")
-    parser.add_argument("--outputReportFile", dest="outputReportFile",
-                      help="File to place evaluation report", metavar="FILE")
-    parser.add_argument("--classifier", dest="classifier",
-                      help="Classifier", metavar="NAME",
-                      choices=('BernoulliNB', 'SVM', 'kNN'), default='SVM')
-    parser.add_argument("--saveData", dest="saveData", action='store_true',
-                      help="Save matrices")
-    parser.add_argument("--kernel", dest="kernel",
-                      help="Kernel SVM", metavar="NAME",
-                      choices=('linear', 'rbf', 'poly'), default='linear')
-    parser.add_argument("--reduction", dest="reduction",
-                      help="Feature selection or dimensionality reduction", metavar="NAME",
-                      choices=('SVD200', 'SVD300', 'CHI250', 'CHI2100'), default=None)
-
-    args = parser.parse_args()
-
-    # Printing parameter values
-    print('-------------------------------- PARAMETERS --------------------------------')
-    print("Path to read input files: " + str(args.inputPath))
-    print("File to read training data: " + str(args.inputTrainingData))
-    print("File to read testing data: " + str(args.inputTestingData))
-    print("File to read testing classes: " + str(args.inputTestingClasses))
-    print("Path to place output model: " + str(args.outputModelPath))
-    print("File to place output model: " + str(args.outputModelFile))
-    print("Path to place evaluation report: " + str(args.outputReportPath))
-    print("File to place evaluation report: " + str(args.outputReportFile))
-    print("Classifier: " + str(args.classifier))
-    print("Save matrices: " + str(args.saveData))
-    print("Kernel: " + str(args.kernel))
-    print("Reduction: " + str(args.reduction))
-
-    # Start time
-    t0 = time()
-
-    print("Reading training data and true classes...")
-    X_train = None
-    if args.saveData:
-        y_train = []
-        trainingData = []
-        with open(os.path.join(args.inputPath, args.inputTrainingData), encoding='utf8', mode='r') \
-                as iFile:
-            for line in iFile:
-                line = line.strip('\r\n')
-                listLine = line.split(',')
-                y_train.append(listLine[0])
-                trainingData.append(listLine[1:])
-        # X_train = np.matrix(trainingData)
-        X_train = csr_matrix(trainingData, dtype='double')
-        print("   Saving matrix and classes...")
-        joblib.dump(X_train, os.path.join(args.outputModelPath, args.inputTrainingData + '.jlb'))
-        joblib.dump(y_train, os.path.join(args.outputModelPath, args.inputTrainingData + '.class.jlb'))
-        print("      Done!")
-    else:
-        print("   Loading matrix and classes...")
-        X_train = joblib.load(os.path.join(args.outputModelPath, args.inputTrainingData + '.jlb'))
-        y_train = joblib.load(os.path.join(args.outputModelPath, args.inputTrainingData + '.class.jlb'))
-        print("      Done!")
-
-    print("   Number of training classes: {}".format(len(y_train)))
-    print("   Number of training class A: {}".format(y_train.count('A')))
-    print("   Number of training class I: {}".format(y_train.count('I')))
-    print("   Shape of training matrix: {}".format(X_train.shape))
-
-    print("Reading testing data and true classes...")
-    X_test = None
-    if args.saveData:
-        y_test = []
-        testingData = []
-        with open(os.path.join(args.inputPath, args.inputTestingData), encoding='utf8', mode='r') \
-                as iFile:
-            for line in iFile:
-                line = line.strip('\r\n')
-                listLine = line.split(',')
-                testingData.append(listLine[1:])
-        X_test = csr_matrix(testingData, dtype='double')
-        with open(os.path.join(args.inputPath, args.inputTestingClasses), encoding='utf8', mode='r') \
-                as iFile:
-            for line in iFile:
-                line = line.strip('\r\n')
-                y_test.append(line)
-        print("   Saving matrix and classes...")
-        joblib.dump(X_test, os.path.join(args.outputModelPath, args.inputTestingData + '.jlb'))
-        joblib.dump(y_test, os.path.join(args.outputModelPath, args.inputTestingClasses + '.class.jlb'))
-        print("      Done!")
-    else:
-        print("   Loading matrix and classes...")
-        X_test = joblib.load(os.path.join(args.outputModelPath, args.inputTestingData + '.jlb'))
-        y_test = joblib.load(os.path.join(args.outputModelPath, args.inputTestingClasses + '.class.jlb'))
-        print("      Done!")
-
-    print("   Number of testing classes: {}".format(len(y_test)))
-    print("   Number of testing class A: {}".format(y_test.count('A')))
-    print("   Number of testing class I: {}".format(y_test.count('I')))
-    print("   Shape of testing matrix: {}".format(X_test.shape))
-
-    # Feature selection and dimensional reduction
-    if args.reduction is not None:
-        print('Performing dimensionality reduction or feature selection...', args.reduction)
-        if args.reduction == 'SVD200':
-            reduc = TruncatedSVD(n_components=200, random_state=42)
-            X_train = reduc.fit_transform(X_train)
-        if args.reduction == 'SVD300':
-            reduc = TruncatedSVD(n_components=300, random_state=42)
-            X_train = reduc.fit_transform(X_train)
-        elif args.reduction == 'CHI250':
-            reduc = SelectKBest(chi2, k=50)
-            X_train = reduc.fit_transform(X_train, y_train)
-        elif args.reduction == 'CHI2100':
-            reduc = SelectKBest(chi2, k=100)
-            X_train = reduc.fit_transform(X_train, y_train)
-        print("   Done!")
-        print('     New shape of training matrix: ', X_train.shape)
-
-    jobs = -1
-    paramGrid = []
-    nIter = 20
-    crossV = 10
-    print("Defining randomized grid search...")
-    if args.classifier == 'SVM':
-        # SVM
-        classifier = SVC()
-        if args.kernel == 'rbf':
-            paramGrid = {'C': scipy.stats.expon(scale=100),
-                         'gamma': scipy.stats.expon(scale=.1),
-                         'kernel': ['rbf'], 'class_weight': ['balanced', None]}
-        elif args.kernel == 'linear':
-            paramGrid = {'C': scipy.stats.expon(scale=100),
-                         'kernel': ['linear'],
-                         'class_weight': ['balanced', None]}
-        elif args.kernel == 'poly':
-            paramGrid = {'C': scipy.stats.expon(scale=100),
-                         'gamma': scipy.stats.expon(scale=.1), 'degree': [2, 3],
-                         'kernel': ['poly'], 'class_weight': ['balanced', None]}
-        myClassifier = model_selection.RandomizedSearchCV(classifier,
-                    paramGrid, n_iter=nIter,
-                    cv=crossV, n_jobs=jobs, verbose=3)
-    elif args.classifier == 'BernoulliNB':
-        # BernoulliNB
-        classifier = BernoulliNB()
-        paramGrid = {'alpha': scipy.stats.expon(scale=1.0)}
-        myClassifier = model_selection.RandomizedSearchCV(classifier, paramGrid, n_iter=nIter,
-                                                          cv=crossV, n_jobs=jobs, verbose=3)
-    # elif args.classifier == 'kNN':
-    #     # kNN
-    #     k_range = list(range(1, 7, 2))
-    #     classifier = KNeighborsClassifier()
-    #     paramGrid = {'n_neighbors ': k_range}
-    #     myClassifier = model_selection.RandomizedSearchCV(classifier, paramGrid, n_iter=3,
-    #                                                       cv=crossV, n_jobs=jobs, verbose=3)
-    else:
-        print("Bad classifier")
-        exit()
-    print("   Done!")
-
-    print("Training...")
-    myClassifier.fit(X_train, y_train)
-    print("   Done!")
-
-    print("Testing (prediction in new data)...")
-    if args.reduction is not None:
-        X_test = reduc.transform(X_test)
-    y_pred = myClassifier.predict(X_test)
-    best_parameters = myClassifier.best_estimator_.get_params()
-    print("   Done!")
-
-    print("Saving report...")
-    with open(os.path.join(args.outputReportPath, args.outputReportFile), mode='w', encoding='utf8') as oFile:
-        oFile.write('**********        EVALUATION REPORT     **********\n')
-        oFile.write('Reduction: {}\n'.format(args.reduction))
-        oFile.write('Classifier: {}\n'.format(args.classifier))
-        oFile.write('Kernel: {}\n'.format(args.kernel))
-        oFile.write('Accuracy: {}\n'.format(accuracy_score(y_test, y_pred)))
-        oFile.write('Precision: {}\n'.format(precision_score(y_test, y_pred, average='weighted')))
-        oFile.write('Recall: {}\n'.format(recall_score(y_test, y_pred, average='weighted')))
-        oFile.write('F-score: {}\n'.format(f1_score(y_test, y_pred, average='weighted')))
-        oFile.write('Confusion matrix: \n')
-        oFile.write(str(confusion_matrix(y_test, y_pred)) + '\n')
-        oFile.write('Classification report: \n')
-        oFile.write(classification_report(y_test, y_pred) + '\n')
-        oFile.write('Best parameters: \n')
-        for param in sorted(best_parameters.keys()):
-            oFile.write("\t%s: %r\n" % (param, best_parameters[param]))
-    print("   Done!")
-
-    print("Training and testing done in: %fs" % (time() - t0))
--- a/binding-thrombin-dataset/training-testing-binding-thrombin.py deleted 100644 → 0
View file @8ac86b9
+++ b/binding-thrombin-dataset/training-testing-binding-thrombin.py deleted 100644 → 0
View file @8ac86b9
-# -*- encoding: utf-8 -*-
-
-import os
-from time import time
-import argparse
-from sklearn.naive_bayes import BernoulliNB
-from sklearn.svm import SVC
-from sklearn.neighbors import KNeighborsClassifier
-from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, \
-    classification_report
-from sklearn.externals import joblib
-from scipy.sparse import csr_matrix
-
-__author__ = 'CMendezC'
-
-# Goal: training and testing binding thrombin data set
-
-# Parameters:
-# 1) --inputPath Path to read input files.
-# 2) --inputTrainingData File to read training data.
-# 3) --inputTestingData File to read testing data.
-# 4) --inputTestingClasses File to read testing classes.
-# 5) --outputModelPath Path to place output model.
-# 6) --outputModelFile File to place output model.
-# 7) --outputReportPath Path to place evaluation report.
-# 8) --outputReportFile File to place evaluation report.
-# 9) --classifier Classifier: BernoulliNB, SVM, kNN.
-# 10) --saveData Save matrices
-
-# Ouput:
-# 1) Classification model and evaluation report.
-
-# Execution:
-
-# python training-testing-binding-thrombin.py
-# --inputPath /home/binding-thrombin-dataset
-# --inputTrainingData thrombin.data
-# --inputTestingData Thrombin.testset
-# --inputTestingClasses Thrombin.testset.class
-# --outputModelPath /home/binding-thrombin-dataset/models
-# --outputModelFile SVM-model.mod
-# --outputReportPath /home/binding-thrombin-dataset/reports
-# --outputReportFile SVM.txt
-# --classifier SVM
-# --saveData
-
-# source activate python3
-# python training-testing-binding-thrombin.py --inputPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset --inputTrainingData thrombin.data --inputTestingData Thrombin.testset --inputTestingClasses Thrombin.testset.class --outputModelPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/models --outputModelFile SVM-model.mod --outputReportPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/reports --outputReportFile SVM.txt --classifier SVM --saveData
-
-###########################################################
-#                       MAIN PROGRAM                      #
-###########################################################
-
-if __name__ == "__main__":
-    # Parameter definition
-    parser = argparse.ArgumentParser(description='Training and testing Binding Thrombin Dataset.')
-    parser.add_argument("--inputPath", dest="inputPath",
-                      help="Path to read input files", metavar="PATH")
-    parser.add_argument("--inputTrainingData", dest="inputTrainingData",
-                      help="File to read training data", metavar="FILE")
-    parser.add_argument("--inputTestingData", dest="inputTestingData",
-                      help="File to read testing data", metavar="FILE")
-    parser.add_argument("--inputTestingClasses", dest="inputTestingClasses",
-                      help="File to read testing classes", metavar="FILE")
-    parser.add_argument("--outputModelPath", dest="outputModelPath",
-                      help="Path to place output model", metavar="PATH")
-    parser.add_argument("--outputModelFile", dest="outputModelFile",
-                      help="File to place output model", metavar="FILE")
-    parser.add_argument("--outputReportPath", dest="outputReportPath",
-                      help="Path to place evaluation report", metavar="PATH")
-    parser.add_argument("--outputReportFile", dest="outputReportFile",
-                      help="File to place evaluation report", metavar="FILE")
-    parser.add_argument("--classifier", dest="classifier",
-                      help="Classifier", metavar="NAME",
-                      choices=('BernoulliNB', 'SVM', 'kNN'), default='SVM')
-    parser.add_argument("--saveData", dest="saveData", action='store_true',
-                      help="Save matrices")
-
-    args = parser.parse_args()
-
-    # Printing parameter values
-    print('-------------------------------- PARAMETERS --------------------------------')
-    print("Path to read input files: " + str(args.inputPath))
-    print("File to read training data: " + str(args.inputTrainingData))
-    print("File to read testing data: " + str(args.inputTestingData))
-    print("File to read testing classes: " + str(args.inputTestingClasses))
-    print("Path to place output model: " + str(args.outputModelPath))
-    print("File to place output model: " + str(args.outputModelFile))
-    print("Path to place evaluation report: " + str(args.outputReportPath))
-    print("File to place evaluation report: " + str(args.outputReportFile))
-    print("Classifier: " + str(args.classifier))
-    print("Save matrices: " + str(args.saveData))
-
-    # Start time
-    t0 = time()
-
-    print("Reading training data and true classes...")
-    X_train = None
-    if args.saveData:
-        y_train = []
-        trainingData = []
-        with open(os.path.join(args.inputPath, args.inputTrainingData), encoding='utf8', mode='r') \
-                as iFile:
-            for line in iFile:
-                line = line.strip('\r\n')
-                listLine = line.split(',')
-                y_train.append(listLine[0])
-                trainingData.append(listLine[1:])
-        # X_train = np.matrix(trainingData)
-        X_train = csr_matrix(trainingData, dtype='double')
-        print("   Saving matrix and classes...")
-        joblib.dump(X_train, os.path.join(args.outputModelPath, args.inputTrainingData + '.jlb'))
-        joblib.dump(y_train, os.path.join(args.outputModelPath, args.inputTrainingData + '.class.jlb'))
-        print("      Done!")
-    else:
-        print("   Loading matrix and classes...")
-        X_train = joblib.load(os.path.join(args.outputModelPath, args.inputTrainingData + '.jlb'))
-        y_train = joblib.load(os.path.join(args.outputModelPath, args.inputTrainingData + '.class.jlb'))
-        print("      Done!")
-
-    print("   Number of training classes: {}".format(len(y_train)))
-    print("   Number of training class A: {}".format(y_train.count('A')))
-    print("   Number of training class I: {}".format(y_train.count('I')))
-    print("   Shape of training matrix: {}".format(X_train.shape))
-
-    print("Reading testing data and true classes...")
-    X_test = None
-    if args.saveData:
-        y_test = []
-        testingData = []
-        with open(os.path.join(args.inputPath, args.inputTestingData), encoding='utf8', mode='r') \
-                as iFile:
-            for line in iFile:
-                line = line.strip('\r\n')
-                listLine = line.split(',')
-                testingData.append(listLine[1:])
-        X_test = csr_matrix(testingData, dtype='double')
-        with open(os.path.join(args.inputPath, args.inputTestingClasses), encoding='utf8', mode='r') \
-                as iFile:
-            for line in iFile:
-                line = line.strip('\r\n')
-                y_test.append(line)
-        print("   Saving matrix and classes...")
-        joblib.dump(X_test, os.path.join(args.outputModelPath, args.inputTestingData + '.jlb'))
-        joblib.dump(y_test, os.path.join(args.outputModelPath, args.inputTestingClasses + '.class.jlb'))
-        print("      Done!")
-    else:
-        print("   Loading matrix and classes...")
-        X_test = joblib.load(os.path.join(args.outputModelPath, args.inputTestingData + '.jlb'))
-        y_test = joblib.load(os.path.join(args.outputModelPath, args.inputTestingClasses + '.class.jlb'))
-        print("      Done!")
-
-    print("   Number of testing classes: {}".format(len(y_test)))
-    print("   Number of testing class A: {}".format(y_test.count('A')))
-    print("   Number of testing class I: {}".format(y_test.count('I')))
-    print("   Shape of testing matrix: {}".format(X_test.shape))
-
-    if args.classifier == "BernoulliNB":
-        classifier = BernoulliNB()
-    elif args.classifier == "SVM":
-        classifier = SVC()
-    elif args.classifier == "kNN":
-        classifier = KNeighborsClassifier()
-    else:
-        print("Bad classifier")
-        exit()
-
-    print("Training...")
-    classifier.fit(X_train, y_train)
-    print("   Done!")
-
-    print("Testing (prediction in new data)...")
-    y_pred = classifier.predict(X_test)
-    print("   Done!")
-
-    print("Saving report...")
-    with open(os.path.join(args.outputReportPath, args.outputReportFile), mode='w', encoding='utf8') as oFile:
-        oFile.write('**********        EVALUATION REPORT     **********\n')
-        oFile.write('Classifier: {}\n'.format(args.classifier))
-        oFile.write('Accuracy: {}\n'.format(accuracy_score(y_test, y_pred)))
-        oFile.write('Precision: {}\n'.format(precision_score(y_test, y_pred, average='weighted')))
-        oFile.write('Recall: {}\n'.format(recall_score(y_test, y_pred, average='weighted')))
-        oFile.write('F-score: {}\n'.format(f1_score(y_test, y_pred, average='weighted')))
-        oFile.write('Confusion matrix: \n')
-        oFile.write(str(confusion_matrix(y_test, y_pred)) + '\n')
-        oFile.write('Classification report: \n')
-        oFile.write(classification_report(y_test, y_pred) + '\n')
-    print("   Done!")
-
-    print("Training and testing done in: %fs" % (time() - t0))