Carlos-Francisco Méndez-Cruz

Classification binding thrombin data set

1 +The test set consists of 634 data points, each of which represents
2 +a molecule that is either active (A) or inactive (I). The test set
3 +has the same format as the training set, with the exception that the
4 +activity value (A or I) for each data point is missing, that is, has
5 +been replaced by a question mark (?). Please submit one prediction,
6 +A or I, for each data point. Your submission should be in the form
7 +of a file that starts with your contact information, followed by a
8 +line with 5 asterisks, followed immediately by your predictions, with
9 +one line per data point. The predictions should be in the same order
10 +as the test set data points. So your prediction for the first example
11 +should appear on the first line after the asterisks, your prediction
12 +for the second example should appear on the second line after the
13 +asterisks, etc. Hence, after your contact information, the prediction
14 +file will consist of 635 lines and have the form:
15 +
16 +*****
17 +I
18 +I
19 +A
20 +I
21 +A
22 +I
23 +
24 +etc.
25 +
26 +You may submit your prediction by email to page@biostat.wisc.edu
27 +or by anonymous ftp to ftp.biostat.wisc.edu, placing the file
28 +into the directory dropboxes/page/. If using email, please use
29 +the subject line "KDDcup <name> thrombin" where <name> is your
30 +name. If using ftp, please name the file KDDcup.<name>.thrombin
31 +where <name> is your name. For example, my submission would be
32 +named KDDcup.DavidPage.thrombin
33 +
34 +Only one submission per person per task is permitted. If you do not
35 +receive email confirmation of your submission within 24 hours, please
36 +email page@biostat.wisc.edu with subject "KDDcup no confirmation".
37 +
38 +For group entries, the contact information should include the names
39 +of everyone to be credited as a member of the group should your entry
40 +achieve the highest score. But no person is to be listed on more than
41 +one entry per task.
1 +Prediction of Molecular Bioactivity for Drug Design -- Binding to Thrombin
2 +--------------------------------------------------------------------------
3 +
4 +Drugs are typically small organic molecules that achieve their desired
5 +activity by binding to a target site on a receptor. The first step in
6 +the discovery of a new drug is usually to identify and isolate the
7 +receptor to which it should bind, followed by testing many small
8 +molecules for their ability to bind to the target site. This leaves
9 +researchers with the task of determining what separates the active
10 +(binding) compounds from the inactive (non-binding) ones. Such a
11 +determination can then be used in the design of new compounds that not
12 +only bind, but also have all the other properties required for a drug
13 +(solubility, oral absorption, lack of side effects, appropriate duration
14 +of action, toxicity, etc.).
15 +
16 +The present training data set consists of 1909 compounds tested for
17 +their ability to bind to a target site on thrombin, a key receptor in
18 +blood clotting. The chemical structures of these compounds are not
19 +necessary for our analysis and are not included. Of these compounds, 42
20 +are active (bind well) and the others are inactive. Each compound is
21 +described by a single feature vector comprised of a class value (A for
22 +active, I for inactive) and 139,351 binary features, which describe
23 +three-dimensional properties of the molecule. The definitions of the
24 +individual bits are not included - we don't know what each individual
25 +bit means, only that they are generated in an internally consistent
26 +manner for all 1909 compounds. Biological activity in general, and
27 +receptor binding affinity in particular, correlate with various
28 +structural and physical properties of small organic molecules. The task
29 +is to determine which of these properties are critical in this case and
30 +to learn to accurately predict the class value. To simulate the
31 +real-world drug design environment, the test set contains 636 additional
32 +compounds that were in fact generated based on the assay results
33 +recorded for the training set. In evaluating the accuracy, a
34 +differential cost model will be used, so that the sum of the costs of
35 +the actives will be equal to the sum of the costs of the inactives.
36 +
37 +We thank DuPont Pharmaceuticals for graciously providing this data set
38 +for the KDD Cup 2001 competition. All publications referring to
39 +analysis of this data set should acknowledge DuPont Pharmaceuticals
40 +Research Laboratories and KDD Cup 2001.
This diff could not be displayed because it is too large.
1 +I
2 +A
3 +I
4 +I
5 +I
6 +A
7 +I
8 +I
9 +I
10 +A
11 +I
12 +I
13 +I
14 +A
15 +I
16 +A
17 +I
18 +I
19 +I
20 +I
21 +I
22 +I
23 +I
24 +I
25 +I
26 +I
27 +I
28 +A
29 +I
30 +A
31 +I
32 +I
33 +I
34 +I
35 +I
36 +A
37 +I
38 +I
39 +I
40 +A
41 +I
42 +I
43 +I
44 +I
45 +I
46 +I
47 +I
48 +I
49 +A
50 +A
51 +I
52 +I
53 +I
54 +I
55 +I
56 +A
57 +I
58 +A
59 +A
60 +I
61 +I
62 +I
63 +A
64 +I
65 +I
66 +I
67 +I
68 +A
69 +A
70 +I
71 +A
72 +I
73 +I
74 +A
75 +A
76 +I
77 +I
78 +I
79 +I
80 +I
81 +I
82 +I
83 +I
84 +I
85 +I
86 +I
87 +I
88 +I
89 +I
90 +I
91 +A
92 +I
93 +I
94 +A
95 +I
96 +A
97 +I
98 +I
99 +I
100 +A
101 +A
102 +I
103 +I
104 +I
105 +I
106 +I
107 +I
108 +A
109 +I
110 +I
111 +I
112 +I
113 +A
114 +A
115 +I
116 +I
117 +I
118 +I
119 +I
120 +I
121 +I
122 +I
123 +A
124 +A
125 +I
126 +A
127 +A
128 +I
129 +I
130 +I
131 +I
132 +I
133 +I
134 +I
135 +I
136 +I
137 +I
138 +A
139 +I
140 +I
141 +I
142 +I
143 +I
144 +I
145 +I
146 +I
147 +I
148 +A
149 +I
150 +I
151 +I
152 +I
153 +A
154 +I
155 +I
156 +I
157 +I
158 +I
159 +I
160 +I
161 +A
162 +I
163 +I
164 +A
165 +I
166 +A
167 +I
168 +I
169 +A
170 +I
171 +A
172 +I
173 +A
174 +I
175 +A
176 +I
177 +I
178 +I
179 +I
180 +I
181 +A
182 +I
183 +I
184 +A
185 +I
186 +I
187 +A
188 +I
189 +I
190 +I
191 +A
192 +I
193 +A
194 +I
195 +I
196 +A
197 +I
198 +I
199 +I
200 +I
201 +A
202 +I
203 +A
204 +I
205 +I
206 +I
207 +I
208 +I
209 +I
210 +I
211 +I
212 +I
213 +I
214 +I
215 +A
216 +I
217 +A
218 +I
219 +I
220 +I
221 +I
222 +I
223 +I
224 +A
225 +I
226 +I
227 +A
228 +A
229 +A
230 +I
231 +I
232 +A
233 +A
234 +I
235 +I
236 +I
237 +I
238 +A
239 +I
240 +I
241 +I
242 +I
243 +A
244 +I
245 +A
246 +I
247 +I
248 +I
249 +I
250 +I
251 +I
252 +I
253 +A
254 +A
255 +I
256 +I
257 +I
258 +I
259 +I
260 +I
261 +I
262 +I
263 +A
264 +A
265 +I
266 +I
267 +I
268 +I
269 +I
270 +I
271 +A
272 +A
273 +I
274 +I
275 +I
276 +I
277 +I
278 +I
279 +A
280 +I
281 +A
282 +I
283 +I
284 +I
285 +I
286 +I
287 +I
288 +I
289 +I
290 +I
291 +A
292 +I
293 +I
294 +A
295 +I
296 +I
297 +I
298 +I
299 +I
300 +I
301 +A
302 +A
303 +I
304 +I
305 +I
306 +I
307 +I
308 +A
309 +I
310 +I
311 +I
312 +I
313 +I
314 +A
315 +A
316 +A
317 +I
318 +A
319 +I
320 +I
321 +I
322 +I
323 +A
324 +A
325 +I
326 +A
327 +A
328 +I
329 +I
330 +I
331 +I
332 +I
333 +I
334 +I
335 +I
336 +I
337 +I
338 +I
339 +I
340 +A
341 +I
342 +I
343 +I
344 +I
345 +A
346 +A
347 +I
348 +I
349 +A
350 +I
351 +I
352 +I
353 +I
354 +I
355 +A
356 +A
357 +I
358 +A
359 +I
360 +I
361 +I
362 +I
363 +I
364 +I
365 +A
366 +A
367 +I
368 +I
369 +A
370 +I
371 +I
372 +I
373 +I
374 +I
375 +I
376 +I
377 +I
378 +I
379 +I
380 +I
381 +I
382 +A
383 +I
384 +I
385 +A
386 +I
387 +I
388 +A
389 +I
390 +I
391 +I
392 +I
393 +A
394 +A
395 +I
396 +A
397 +A
398 +I
399 +I
400 +A
401 +I
402 +I
403 +I
404 +I
405 +A
406 +I
407 +I
408 +I
409 +I
410 +I
411 +I
412 +I
413 +I
414 +I
415 +I
416 +A
417 +I
418 +I
419 +A
420 +A
421 +I
422 +I
423 +I
424 +A
425 +I
426 +I
427 +I
428 +I
429 +A
430 +I
431 +A
432 +I
433 +I
434 +I
435 +I
436 +I
437 +A
438 +I
439 +I
440 +I
441 +I
442 +I
443 +I
444 +I
445 +I
446 +I
447 +I
448 +I
449 +I
450 +I
451 +I
452 +I
453 +I
454 +I
455 +I
456 +A
457 +A
458 +A
459 +A
460 +I
461 +I
462 +I
463 +A
464 +A
465 +I
466 +I
467 +I
468 +I
469 +I
470 +A
471 +I
472 +A
473 +I
474 +I
475 +I
476 +I
477 +I
478 +A
479 +I
480 +I
481 +I
482 +I
483 +A
484 +A
485 +I
486 +I
487 +I
488 +I
489 +I
490 +I
491 +I
492 +I
493 +I
494 +I
495 +I
496 +I
497 +I
498 +I
499 +I
500 +I
501 +I
502 +A
503 +I
504 +A
505 +I
506 +I
507 +A
508 +I
509 +I
510 +I
511 +I
512 +A
513 +I
514 +I
515 +A
516 +A
517 +I
518 +I
519 +I
520 +A
521 +I
522 +A
523 +I
524 +I
525 +I
526 +I
527 +I
528 +I
529 +I
530 +A
531 +A
532 +I
533 +I
534 +I
535 +A
536 +I
537 +I
538 +I
539 +A
540 +I
541 +I
542 +I
543 +I
544 +I
545 +I
546 +A
547 +I
548 +I
549 +I
550 +I
551 +I
552 +A
553 +I
554 +I
555 +I
556 +I
557 +I
558 +A
559 +I
560 +I
561 +A
562 +I
563 +I
564 +I
565 +I
566 +I
567 +I
568 +A
569 +I
570 +I
571 +I
572 +I
573 +I
574 +I
575 +I
576 +I
577 +I
578 +I
579 +I
580 +I
581 +I
582 +I
583 +I
584 +I
585 +A
586 +I
587 +I
588 +A
589 +I
590 +I
591 +I
592 +I
593 +A
594 +I
595 +I
596 +I
597 +I
598 +I
599 +A
600 +I
601 +I
602 +I
603 +A
604 +I
605 +I
606 +I
607 +A
608 +I
609 +A
610 +A
611 +I
612 +A
613 +I
614 +I
615 +I
616 +I
617 +I
618 +I
619 +I
620 +A
621 +I
622 +I
623 +I
624 +A
625 +I
626 +A
627 +I
628 +I
629 +I
630 +I
631 +A
632 +I
633 +A
634 +I
1 +# -*- encoding: utf-8 -*-
2 +
3 +import os
4 +from time import time
5 +import argparse
6 +from sklearn.naive_bayes import BernoulliNB
7 +from sklearn.svm import SVC
8 +from sklearn.neighbors import KNeighborsClassifier
9 +from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, \
10 + classification_report
11 +from sklearn.externals import joblib
12 +from sklearn import model_selection
13 +from sklearn.feature_selection import SelectKBest, chi2
14 +from sklearn.decomposition import TruncatedSVD
15 +from scipy.sparse import csr_matrix
16 +import scipy
17 +from imblearn.combine import SMOTEENN, SMOTETomek
18 +from imblearn.over_sampling import SMOTE, ADASYN, RandomOverSampler
19 +from imblearn.under_sampling import EditedNearestNeighbours, TomekLinks, \
20 + OneSidedSelection, RandomUnderSampler, NeighbourhoodCleaningRule, \
21 + InstanceHardnessThreshold, ClusterCentroids
22 +from imblearn.ensemble import EasyEnsemble, BalanceCascade
23 +
24 +__author__ = 'CMendezC'
25 +
26 +# Goal: training, crossvalidation and testing binding thrombin data set
27 +
28 +# Parameters:
29 +# 1) --inputPath Path to read input files.
30 +# 2) --inputTrainingData File to read training data.
31 +# 3) --inputTestingData File to read testing data.
32 +# 4) --inputTestingClasses File to read testing classes.
33 +# 5) --outputModelPath Path to place output model.
34 +# 6) --outputModelFile File to place output model.
35 +# 7) --outputReportPath Path to place evaluation report.
36 +# 8) --outputReportFile File to place evaluation report.
37 +# 9) --classifier Classifier: BernoulliNB, SVM, kNN.
38 +# 10) --saveData Save matrices
39 +# 11) --kernel Kernel
40 +# 12) --reduction Feature selection or dimensionality reduction
41 +# 13) --imbalanced Imbalanced method
42 +
43 +# Ouput:
44 +# 1) Classification model and evaluation report.
45 +
46 +# Execution:
47 +
48 +# python training-crossvalidation-testing-binding-thrombin.py
49 +# --inputPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset
50 +# --inputTrainingData thrombin.data
51 +# --inputTestingData Thrombin.testset
52 +# --inputTestingClasses Thrombin.testset.class
53 +# --outputModelPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/models
54 +# --outputModelFile SVM-lineal-model.mod
55 +# --outputReportPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/reports
56 +# --outputReportFile SVM-lineal.txt
57 +# --classifier SVM
58 +# --saveData
59 +# --kernel linear
60 +# --imbalanced RandomUS
61 +
62 +# source activate python3
63 +# python training-crossvalidation-testing-binding-thrombin.py --inputPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset --inputTrainingData thrombin.data --inputTestingData Thrombin.testset --inputTestingClasses Thrombin.testset.class --outputModelPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/models --outputModelFile SVM-lineal-model.mod --outputReportPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/reports --outputReportFile SVM-lineal.txt --classifier SVM --kernel linear --imbalanced RandomUS
64 +
65 +###########################################################
66 +# MAIN PROGRAM #
67 +###########################################################
68 +
69 +if __name__ == "__main__":
70 + # Parameter definition
71 + parser = argparse.ArgumentParser(description='Training validation Binding Thrombin Dataset.')
72 + parser.add_argument("--inputPath", dest="inputPath",
73 + help="Path to read input files", metavar="PATH")
74 + parser.add_argument("--inputTrainingData", dest="inputTrainingData",
75 + help="File to read training data", metavar="FILE")
76 + parser.add_argument("--inputTestingData", dest="inputTestingData",
77 + help="File to read testing data", metavar="FILE")
78 + parser.add_argument("--inputTestingClasses", dest="inputTestingClasses",
79 + help="File to read testing classes", metavar="FILE")
80 + parser.add_argument("--outputModelPath", dest="outputModelPath",
81 + help="Path to place output model", metavar="PATH")
82 + parser.add_argument("--outputModelFile", dest="outputModelFile",
83 + help="File to place output model", metavar="FILE")
84 + parser.add_argument("--outputReportPath", dest="outputReportPath",
85 + help="Path to place evaluation report", metavar="PATH")
86 + parser.add_argument("--outputReportFile", dest="outputReportFile",
87 + help="File to place evaluation report", metavar="FILE")
88 + parser.add_argument("--classifier", dest="classifier",
89 + help="Classifier", metavar="NAME",
90 + choices=('BernoulliNB', 'SVM', 'kNN'), default='SVM')
91 + parser.add_argument("--saveData", dest="saveData", action='store_true',
92 + help="Save matrices")
93 + parser.add_argument("--kernel", dest="kernel",
94 + help="Kernel SVM", metavar="NAME",
95 + choices=('linear', 'rbf', 'poly'), default='linear')
96 + parser.add_argument("--reduction", dest="reduction",
97 + help="Feature selection or dimensionality reduction", metavar="NAME",
98 + choices=('SVD200', 'SVD300', 'CHI250', 'CHI2100'), default=None)
99 + parser.add_argument("--imbalanced", dest="imbalanced",
100 + choices=('RandomUS', 'Tomek', 'NCR',
101 + 'IHT', 'RandomOS', 'ADASYN', 'SMOTE_reg',
102 + 'SMOTE_svm', 'SMOTE_b1', 'SMOTE_b2', 'OSS',
103 + 'SMOTE+ENN'), default=None,
104 + help="Undersampling: RandomUS, Tomek, Neighbourhood Cleanning Rule (NCR), "
105 + "Instance Hardess Threshold (IHT), One Sided Selection (OSS). "
106 + "Oversampling: RandomOS, ADACYN, SMOTE_reg, "
107 + "SMOTE_svm, SMOTE_b1, SMOTE_b2. Combine: "
108 + "SMOTE + ENN", metavar="TEXT")
109 +
110 + args = parser.parse_args()
111 +
112 + # Printing parameter values
113 + print('-------------------------------- PARAMETERS --------------------------------')
114 + print("Path to read input files: " + str(args.inputPath))
115 + print("File to read training data: " + str(args.inputTrainingData))
116 + print("File to read testing data: " + str(args.inputTestingData))
117 + print("File to read testing classes: " + str(args.inputTestingClasses))
118 + print("Path to place output model: " + str(args.outputModelPath))
119 + print("File to place output model: " + str(args.outputModelFile))
120 + print("Path to place evaluation report: " + str(args.outputReportPath))
121 + print("File to place evaluation report: " + str(args.outputReportFile))
122 + print("Classifier: " + str(args.classifier))
123 + print("Save matrices: " + str(args.saveData))
124 + print("Kernel: " + str(args.kernel))
125 + print("Reduction: " + str(args.reduction))
126 + print("Imbalanced: " + str(args.imbalanced))
127 +
128 + # Start time
129 + t0 = time()
130 +
131 + print("Reading training data and true classes...")
132 + X_train = None
133 + if args.saveData:
134 + y_train = []
135 + trainingData = []
136 + with open(os.path.join(args.inputPath, args.inputTrainingData), encoding='utf8', mode='r') \
137 + as iFile:
138 + for line in iFile:
139 + line = line.strip('\r\n')
140 + listLine = line.split(',')
141 + y_train.append(listLine[0])
142 + trainingData.append(listLine[1:])
143 + # X_train = np.matrix(trainingData)
144 + X_train = csr_matrix(trainingData, dtype='double')
145 + print(" Saving matrix and classes...")
146 + joblib.dump(X_train, os.path.join(args.outputModelPath, args.inputTrainingData + '.jlb'))
147 + joblib.dump(y_train, os.path.join(args.outputModelPath, args.inputTrainingData + '.class.jlb'))
148 + print(" Done!")
149 + else:
150 + print(" Loading matrix and classes...")
151 + X_train = joblib.load(os.path.join(args.outputModelPath, args.inputTrainingData + '.jlb'))
152 + y_train = joblib.load(os.path.join(args.outputModelPath, args.inputTrainingData + '.class.jlb'))
153 + print(" Done!")
154 +
155 + print(" Number of training classes: {}".format(len(y_train)))
156 + print(" Number of training class A: {}".format(y_train.count('A')))
157 + print(" Number of training class I: {}".format(y_train.count('I')))
158 + print(" Shape of training matrix: {}".format(X_train.shape))
159 +
160 + # Feature selection and dimensional reduction
161 + if args.reduction is not None:
162 + print('Performing dimensionality reduction or feature selection...', args.reduction)
163 + if args.reduction == 'SVD200':
164 + reduc = TruncatedSVD(n_components=200, random_state=42)
165 + X_train = reduc.fit_transform(X_train)
166 + if args.reduction == 'SVD300':
167 + reduc = TruncatedSVD(n_components=300, random_state=42)
168 + X_train = reduc.fit_transform(X_train)
169 + elif args.reduction == 'CHI250':
170 + reduc = SelectKBest(chi2, k=50)
171 + X_train = reduc.fit_transform(X_train, y_train)
172 + elif args.reduction == 'CHI2100':
173 + reduc = SelectKBest(chi2, k=100)
174 + X_train = reduc.fit_transform(X_train, y_train)
175 + print(" Done!")
176 + print(' New shape of training matrix: ', X_train.shape)
177 +
178 + if args.imbalanced != None:
179 + t1 = time()
180 + # Combination over and under sampling
181 + jobs = 15
182 + if args.imbalanced == "SMOTE+ENN":
183 + sm = SMOTEENN(random_state=42, n_jobs=jobs)
184 + elif args.imbalanced == "SMOTE+Tomek":
185 + sm = SMOTETomek(random_state=42, n_jobs=jobs)
186 + # Over sampling
187 + elif args.imbalanced == "SMOTE_reg":
188 + sm = SMOTE(random_state=42, n_jobs=jobs)
189 + elif args.imbalanced == "SMOTE_svm":
190 + sm = SMOTE(random_state=42, n_jobs=jobs, kind='svm')
191 + elif args.imbalanced == "SMOTE_b1":
192 + sm = SMOTE(random_state=42, n_jobs=jobs, kind='borderline1')
193 + elif args.imbalanced == "SMOTE_b2":
194 + sm = SMOTE(random_state=42, n_jobs=jobs, kind='borderline2')
195 + elif args.imbalanced == "RandomOS":
196 + sm = RandomOverSampler(random_state=42)
197 + # Under sampling
198 + elif args.imbalanced == "ENN":
199 + sm = EditedNearestNeighbours(random_state=42, n_jobs=jobs)
200 + elif args.imbalanced == "Tomek":
201 + sm = TomekLinks(random_state=42, n_jobs=jobs)
202 + elif args.imbalanced == "OSS":
203 + sm = OneSidedSelection(random_state=42, n_jobs=jobs)
204 + elif args.imbalanced == "RandomUS":
205 + sm = RandomUnderSampler(random_state=42)
206 + elif args.imbalanced == "NCR":
207 + sm = NeighbourhoodCleaningRule(random_state=42, n_jobs=jobs)
208 + elif args.imbalanced == "IHT":
209 + sm = InstanceHardnessThreshold(random_state=42, n_jobs=jobs)
210 + elif args.imbalanced == "ClusterC":
211 + sm = ClusterCentroids(random_state=42, n_jobs=jobs)
212 + elif args.imbalanced == "Balanced":
213 + sm = BalanceCascade(random_state=42)
214 + elif args.imbalanced == "Easy":
215 + sm = EasyEnsemble(random_state=42, n_subsets=3)
216 + elif args.imbalanced == "ADASYN":
217 + sm = ADASYN(random_state=42, n_jobs=jobs)
218 +
219 + # Apply transformation
220 + X_train, y_train = sm.fit_sample(X_train, y_train)
221 +
222 + print(" After transformtion with {}".format(args.imbalanced))
223 + print(" Number of training classes: {}".format(len(y_train)))
224 + print(" Number of training class A: {}".format(list(y_train).count('A')))
225 + print(" Number of training class I: {}".format(list(y_train).count('I')))
226 + print(" Shape of training matrix: {}".format(X_train.shape))
227 + print(" Data transformation done in : %fs" % (time() - t1))
228 +
229 + print("Reading testing data and true classes...")
230 + X_test = None
231 + if args.saveData:
232 + y_test = []
233 + testingData = []
234 + with open(os.path.join(args.inputPath, args.inputTestingData), encoding='utf8', mode='r') \
235 + as iFile:
236 + for line in iFile:
237 + line = line.strip('\r\n')
238 + listLine = line.split(',')
239 + testingData.append(listLine[1:])
240 + X_test = csr_matrix(testingData, dtype='double')
241 + with open(os.path.join(args.inputPath, args.inputTestingClasses), encoding='utf8', mode='r') \
242 + as iFile:
243 + for line in iFile:
244 + line = line.strip('\r\n')
245 + y_test.append(line)
246 + print(" Saving matrix and classes...")
247 + joblib.dump(X_test, os.path.join(args.outputModelPath, args.inputTestingData + '.jlb'))
248 + joblib.dump(y_test, os.path.join(args.outputModelPath, args.inputTestingClasses + '.class.jlb'))
249 + print(" Done!")
250 + else:
251 + print(" Loading matrix and classes...")
252 + X_test = joblib.load(os.path.join(args.outputModelPath, args.inputTestingData + '.jlb'))
253 + y_test = joblib.load(os.path.join(args.outputModelPath, args.inputTestingClasses + '.class.jlb'))
254 + print(" Done!")
255 +
256 + print(" Number of testing classes: {}".format(len(y_test)))
257 + print(" Number of testing class A: {}".format(y_test.count('A')))
258 + print(" Number of testing class I: {}".format(y_test.count('I')))
259 + print(" Shape of testing matrix: {}".format(X_test.shape))
260 +
261 + jobs = -1
262 + paramGrid = []
263 + nIter = 20
264 + crossV = 10
265 + print("Defining randomized grid search...")
266 + if args.classifier == 'SVM':
267 + # SVM
268 + classifier = SVC()
269 + if args.kernel == 'rbf':
270 + paramGrid = {'C': scipy.stats.expon(scale=100),
271 + 'gamma': scipy.stats.expon(scale=.1),
272 + 'kernel': ['rbf'], 'class_weight': ['balanced', None]}
273 + elif args.kernel == 'linear':
274 + paramGrid = {'C': scipy.stats.expon(scale=100),
275 + 'kernel': ['linear'],
276 + 'class_weight': ['balanced', None]}
277 + elif args.kernel == 'poly':
278 + paramGrid = {'C': scipy.stats.expon(scale=100),
279 + 'gamma': scipy.stats.expon(scale=.1), 'degree': [2, 3],
280 + 'kernel': ['poly'], 'class_weight': ['balanced', None]}
281 + myClassifier = model_selection.RandomizedSearchCV(classifier,
282 + paramGrid, n_iter=nIter,
283 + cv=crossV, n_jobs=jobs, verbose=3)
284 + elif args.classifier == 'BernoulliNB':
285 + # BernoulliNB
286 + classifier = BernoulliNB()
287 + paramGrid = {'alpha': scipy.stats.expon(scale=1.0)}
288 + myClassifier = model_selection.RandomizedSearchCV(classifier, paramGrid, n_iter=nIter,
289 + cv=crossV, n_jobs=jobs, verbose=3)
290 + # elif args.classifier == 'kNN':
291 + # # kNN
292 + # k_range = list(range(1, 7, 2))
293 + # classifier = KNeighborsClassifier()
294 + # paramGrid = {'n_neighbors ': k_range}
295 + # myClassifier = model_selection.RandomizedSearchCV(classifier, paramGrid, n_iter=3,
296 + # cv=crossV, n_jobs=jobs, verbose=3)
297 + else:
298 + print("Bad classifier")
299 + exit()
300 + print(" Done!")
301 +
302 + print("Training...")
303 + myClassifier.fit(X_train, y_train)
304 + print(" Done!")
305 +
306 + print("Testing (prediction in new data)...")
307 + if args.reduction is not None:
308 + X_test = reduc.transform(X_test)
309 + y_pred = myClassifier.predict(X_test)
310 + best_parameters = myClassifier.best_estimator_.get_params()
311 + print(" Done!")
312 +
313 + print("Saving report...")
314 + with open(os.path.join(args.outputReportPath, args.outputReportFile), mode='w', encoding='utf8') as oFile:
315 + oFile.write('********** EVALUATION REPORT **********\n')
316 + oFile.write('Reduction: {}\n'.format(args.reduction))
317 + oFile.write('Classifier: {}\n'.format(args.classifier))
318 + oFile.write('Kernel: {}\n'.format(args.kernel))
319 + oFile.write('Accuracy: {}\n'.format(accuracy_score(y_test, y_pred)))
320 + oFile.write('Precision: {}\n'.format(precision_score(y_test, y_pred, average='weighted')))
321 + oFile.write('Recall: {}\n'.format(recall_score(y_test, y_pred, average='weighted')))
322 + oFile.write('F-score: {}\n'.format(f1_score(y_test, y_pred, average='weighted')))
323 + oFile.write('Confusion matrix: \n')
324 + oFile.write(str(confusion_matrix(y_test, y_pred)) + '\n')
325 + oFile.write('Classification report: \n')
326 + oFile.write(classification_report(y_test, y_pred) + '\n')
327 + oFile.write('Best parameters: \n')
328 + for param in sorted(best_parameters.keys()):
329 + oFile.write("\t%s: %r\n" % (param, best_parameters[param]))
330 + print(" Done!")
331 +
332 + print("Training and testing done in: %fs" % (time() - t0))
1 +# -*- encoding: utf-8 -*-
2 +
3 +import os
4 +from time import time
5 +import argparse
6 +from sklearn.naive_bayes import BernoulliNB
7 +from sklearn.svm import SVC
8 +from sklearn.neighbors import KNeighborsClassifier
9 +from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, \
10 + classification_report
11 +from sklearn.externals import joblib
12 +from sklearn import model_selection
13 +from scipy.sparse import csr_matrix
14 +import scipy
15 +from imblearn.under_sampling import RandomUnderSampler
16 +from imblearn.over_sampling import RandomOverSampler
17 +
18 +__author__ = 'CMendezC'
19 +
20 +# Goal: training, crossvalidation and testing binding thrombin data set
21 +
22 +# Parameters:
23 +# 1) --inputPath Path to read input files.
24 +# 2) --inputTrainingData File to read training data.
25 +# 3) --inputTestingData File to read testing data.
26 +# 4) --inputTestingClasses File to read testing classes.
27 +# 5) --outputModelPath Path to place output model.
28 +# 6) --outputModelFile File to place output model.
29 +# 7) --outputReportPath Path to place evaluation report.
30 +# 8) --outputReportFile File to place evaluation report.
31 +# 9) --classifier Classifier: BernoulliNB, SVM, kNN.
32 +# 10) --saveData Save matrices
33 +# 11) --kernel Kernel
34 +# 12) --imbalanced Imbalanced method
35 +
36 +# Ouput:
37 +# 1) Classification model and evaluation report.
38 +
39 +# Execution:
40 +
41 +# python imb-training-testing-binding-thrombin.py
42 +# --inputPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset
43 +# --inputTrainingData thrombin.data
44 +# --inputTestingData Thrombin.testset
45 +# --inputTestingClasses Thrombin.testset.class
46 +# --outputModelPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/models
47 +# --outputModelFile SVM-lineal-model.mod
48 +# --outputReportPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/reports
49 +# --outputReportFile SVM-lineal.txt
50 +# --classifier SVM
51 +# --saveData
52 +# --kernel linear
53 +# --imbalanced RandomUS
54 +
55 +# source activate python3
56 +# python imb-training-testing-binding-thrombin.py --inputPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset --inputTrainingData thrombin.data --inputTestingData Thrombin.testset --inputTestingClasses Thrombin.testset.class --outputModelPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/models --outputModelFile SVM-lineal-model.mod --outputReportPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/reports --outputReportFile SVM-lineal.txt --classifier SVM --kernel linear
57 +# python imb-training-testing-binding-thrombin.py --inputPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset --inputTrainingData thrombin.data --inputTestingData Thrombin.testset --inputTestingClasses Thrombin.testset.class --outputModelPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/models --outputModelFile SVM-lineal-model.mod --outputReportPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/reports --outputReportFile SVM-lineal-RandomUS.txt --classifier SVM --kernel linear --imbalanced RandomUS
58 +
59 +
60 +# --imbalanced RandomUS
61 +
62 +###########################################################
63 +# MAIN PROGRAM #
64 +###########################################################
65 +
66 +if __name__ == "__main__":
67 + # Parameter definition
68 + parser = argparse.ArgumentParser(description='Training validation Binding Thrombin Dataset.')
69 + parser.add_argument("--inputPath", dest="inputPath",
70 + help="Path to read input files", metavar="PATH")
71 + parser.add_argument("--inputTrainingData", dest="inputTrainingData",
72 + help="File to read training data", metavar="FILE")
73 + parser.add_argument("--inputTestingData", dest="inputTestingData",
74 + help="File to read testing data", metavar="FILE")
75 + parser.add_argument("--inputTestingClasses", dest="inputTestingClasses",
76 + help="File to read testing classes", metavar="FILE")
77 + parser.add_argument("--outputModelPath", dest="outputModelPath",
78 + help="Path to place output model", metavar="PATH")
79 + parser.add_argument("--outputModelFile", dest="outputModelFile",
80 + help="File to place output model", metavar="FILE")
81 + parser.add_argument("--outputReportPath", dest="outputReportPath",
82 + help="Path to place evaluation report", metavar="PATH")
83 + parser.add_argument("--outputReportFile", dest="outputReportFile",
84 + help="File to place evaluation report", metavar="FILE")
85 + parser.add_argument("--classifier", dest="classifier",
86 + help="Classifier", metavar="NAME",
87 + choices=('BernoulliNB', 'SVM', 'kNN'), default='SVM')
88 + parser.add_argument("--saveData", dest="saveData", action='store_true',
89 + help="Save matrices")
90 + parser.add_argument("--kernel", dest="kernel",
91 + help="Kernel SVM", metavar="NAME",
92 + choices=('linear', 'rbf', 'poly'), default='linear')
93 + parser.add_argument("--imbalanced", dest="imbalanced",
94 + choices=('RandomUS', 'RandomOS'), default=None,
95 + help="Undersampling: RandomUS. Oversampling: RandomOS", metavar="TEXT")
96 +
97 + args = parser.parse_args()
98 +
99 + # Printing parameter values
100 + print('-------------------------------- PARAMETERS --------------------------------')
101 + print("Path to read input files: " + str(args.inputPath))
102 + print("File to read training data: " + str(args.inputTrainingData))
103 + print("File to read testing data: " + str(args.inputTestingData))
104 + print("File to read testing classes: " + str(args.inputTestingClasses))
105 + print("Path to place output model: " + str(args.outputModelPath))
106 + print("File to place output model: " + str(args.outputModelFile))
107 + print("Path to place evaluation report: " + str(args.outputReportPath))
108 + print("File to place evaluation report: " + str(args.outputReportFile))
109 + print("Classifier: " + str(args.classifier))
110 + print("Save matrices: " + str(args.saveData))
111 + print("Kernel: " + str(args.kernel))
112 + print("Imbalanced: " + str(args.imbalanced))
113 +
114 + # Start time
115 + t0 = time()
116 +
117 + print("Reading training data and true classes...")
118 + X_train = None
119 + if args.saveData:
120 + y_train = []
121 + trainingData = []
122 + with open(os.path.join(args.inputPath, args.inputTrainingData), encoding='utf8', mode='r') \
123 + as iFile:
124 + for line in iFile:
125 + line = line.strip('\r\n')
126 + listLine = line.split(',')
127 + y_train.append(listLine[0])
128 + trainingData.append(listLine[1:])
129 + # X_train = np.matrix(trainingData)
130 + X_train = csr_matrix(trainingData, dtype='double')
131 + print(" Saving matrix and classes...")
132 + joblib.dump(X_train, os.path.join(args.outputModelPath, args.inputTrainingData + '.jlb'))
133 + joblib.dump(y_train, os.path.join(args.outputModelPath, args.inputTrainingData + '.class.jlb'))
134 + print(" Done!")
135 + else:
136 + print(" Loading matrix and classes...")
137 + X_train = joblib.load(os.path.join(args.outputModelPath, args.inputTrainingData + '.jlb'))
138 + y_train = joblib.load(os.path.join(args.outputModelPath, args.inputTrainingData + '.class.jlb'))
139 + print(" Done!")
140 +
141 + print(" Number of training classes: {}".format(len(y_train)))
142 + print(" Number of training class A: {}".format(y_train.count('A')))
143 + print(" Number of training class I: {}".format(y_train.count('I')))
144 + print(" Shape of training matrix: {}".format(X_train.shape))
145 +
146 + if args.imbalanced != None:
147 + t1 = time()
148 + # Combination over and under sampling
149 + jobs = 15
150 + if args.imbalanced == "RandomOS":
151 + sm = RandomOverSampler(random_state=42)
152 + # Under sampling
153 + elif args.imbalanced == "RandomUS":
154 + sm = RandomUnderSampler(random_state=42)
155 +
156 + # Apply transformation
157 + X_train, y_train = sm.fit_sample(X_train, y_train)
158 +
159 + print(" After transformtion with {}".format(args.imbalanced))
160 + print(" Number of training classes: {}".format(len(y_train)))
161 + print(" Number of training class A: {}".format(list(y_train).count('A')))
162 + print(" Number of training class I: {}".format(list(y_train).count('I')))
163 + print(" Shape of training matrix: {}".format(X_train.shape))
164 + print(" Data transformation done in : %fs" % (time() - t1))
165 +
166 + print("Reading testing data and true classes...")
167 + X_test = None
168 + if args.saveData:
169 + y_test = []
170 + testingData = []
171 + with open(os.path.join(args.inputPath, args.inputTestingData), encoding='utf8', mode='r') \
172 + as iFile:
173 + for line in iFile:
174 + line = line.strip('\r\n')
175 + listLine = line.split(',')
176 + testingData.append(listLine[1:])
177 + X_test = csr_matrix(testingData, dtype='double')
178 + with open(os.path.join(args.inputPath, args.inputTestingClasses), encoding='utf8', mode='r') \
179 + as iFile:
180 + for line in iFile:
181 + line = line.strip('\r\n')
182 + y_test.append(line)
183 + print(" Saving matrix and classes...")
184 + joblib.dump(X_test, os.path.join(args.outputModelPath, args.inputTestingData + '.jlb'))
185 + joblib.dump(y_test, os.path.join(args.outputModelPath, args.inputTestingClasses + '.class.jlb'))
186 + print(" Done!")
187 + else:
188 + print(" Loading matrix and classes...")
189 + X_test = joblib.load(os.path.join(args.outputModelPath, args.inputTestingData + '.jlb'))
190 + y_test = joblib.load(os.path.join(args.outputModelPath, args.inputTestingClasses + '.class.jlb'))
191 + print(" Done!")
192 +
193 + print(" Number of testing classes: {}".format(len(y_test)))
194 + print(" Number of testing class A: {}".format(y_test.count('A')))
195 + print(" Number of testing class I: {}".format(y_test.count('I')))
196 + print(" Shape of testing matrix: {}".format(X_test.shape))
197 +
198 + if args.classifier == 'SVM':
199 + # SVM
200 + myClassifier = SVC(kernel=args.kernel)
201 + elif args.classifier == 'BernoulliNB':
202 + # BernoulliNB
203 + myClassifier = BernoulliNB()
204 + elif args.classifier == 'kNN':
205 + # kNN
206 + myClassifier = KNeighborsClassifier()
207 + else:
208 + print("Bad classifier")
209 + exit()
210 + print(" Done!")
211 +
212 + print("Training...")
213 + myClassifier.fit(X_train, y_train)
214 + print(" Done!")
215 +
216 + y_pred = myClassifier.predict(X_test)
217 + print(" Done!")
218 +
219 + print("Saving report...")
220 + with open(os.path.join(args.outputReportPath, args.outputReportFile), mode='w', encoding='utf8') as oFile:
221 + oFile.write('********** EVALUATION REPORT **********\n')
222 + oFile.write('Classifier: {}\n'.format(args.classifier))
223 + oFile.write('Kernel: {}\n'.format(args.kernel))
224 + oFile.write('Accuracy: {}\n'.format(accuracy_score(y_test, y_pred)))
225 + oFile.write('Precision: {}\n'.format(precision_score(y_test, y_pred, average='weighted')))
226 + oFile.write('Recall: {}\n'.format(recall_score(y_test, y_pred, average='weighted')))
227 + oFile.write('F-score: {}\n'.format(f1_score(y_test, y_pred, average='weighted')))
228 + oFile.write('Confusion matrix: \n')
229 + oFile.write(str(confusion_matrix(y_test, y_pred)) + '\n')
230 + oFile.write('Classification report: \n')
231 + oFile.write(classification_report(y_test, y_pred) + '\n')
232 + print(" Done!")
233 +
234 + print("Training and testing done in: %fs" % (time() - t0))
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
1 +# -*- encoding: utf-8 -*-
2 +
3 +import os
4 +from time import time
5 +import argparse
6 +from sklearn.naive_bayes import BernoulliNB
7 +from sklearn.svm import SVC
8 +from sklearn.neighbors import KNeighborsClassifier
9 +from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, \
10 + classification_report
11 +from sklearn.externals import joblib
12 +from sklearn import model_selection
13 +from sklearn.feature_selection import SelectKBest, chi2
14 +from sklearn.decomposition import TruncatedSVD
15 +from scipy.sparse import csr_matrix
16 +import scipy
17 +
18 +__author__ = 'CMendezC'
19 +
20 +# Goal: training, crossvalidation and testing binding thrombin data set
21 +
22 +# Parameters:
23 +# 1) --inputPath Path to read input files.
24 +# 2) --inputTrainingData File to read training data.
25 +# 3) --inputTestingData File to read testing data.
26 +# 4) --inputTestingClasses File to read testing classes.
27 +# 5) --outputModelPath Path to place output model.
28 +# 6) --outputModelFile File to place output model.
29 +# 7) --outputReportPath Path to place evaluation report.
30 +# 8) --outputReportFile File to place evaluation report.
31 +# 9) --classifier Classifier: BernoulliNB, SVM, kNN.
32 +# 10) --saveData Save matrices
33 +# 11) --kernel Kernel
34 +# 12) --reduction Feature selection or dimensionality reduction
35 +
36 +# Ouput:
37 +# 1) Classification model and evaluation report.
38 +
39 +# Execution:
40 +
41 +# python training-crossvalidation-testing-binding-thrombin.py
42 +# --inputPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset
43 +# --inputTrainingData thrombin.data
44 +# --inputTestingData Thrombin.testset
45 +# --inputTestingClasses Thrombin.testset.class
46 +# --outputModelPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/models
47 +# --outputModelFile SVM-model.mod
48 +# --outputReportPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/reports
49 +# --outputReportFile SVM.txt
50 +# --classifier SVM
51 +# --saveData
52 +# --kernel linear
53 +# --reduction SVD200
54 +
55 +# source activate python3
56 +# python training-crossvalidation-testing-binding-thrombin.py --inputPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset --inputTrainingData thrombin.data --inputTestingData Thrombin.testset --inputTestingClasses Thrombin.testset.class --outputModelPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/models --outputModelFile SVM-linear-model.mod --outputReportPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/reports --outputReportFile SVM-linear.txt --classifier SVM --kernel rbf
57 +
58 +###########################################################
59 +# MAIN PROGRAM #
60 +###########################################################
61 +
62 +if __name__ == "__main__":
63 + # Parameter definition
64 + parser = argparse.ArgumentParser(description='Training validation Binding Thrombin Dataset.')
65 + parser.add_argument("--inputPath", dest="inputPath",
66 + help="Path to read input files", metavar="PATH")
67 + parser.add_argument("--inputTrainingData", dest="inputTrainingData",
68 + help="File to read training data", metavar="FILE")
69 + parser.add_argument("--inputTestingData", dest="inputTestingData",
70 + help="File to read testing data", metavar="FILE")
71 + parser.add_argument("--inputTestingClasses", dest="inputTestingClasses",
72 + help="File to read testing classes", metavar="FILE")
73 + parser.add_argument("--outputModelPath", dest="outputModelPath",
74 + help="Path to place output model", metavar="PATH")
75 + parser.add_argument("--outputModelFile", dest="outputModelFile",
76 + help="File to place output model", metavar="FILE")
77 + parser.add_argument("--outputReportPath", dest="outputReportPath",
78 + help="Path to place evaluation report", metavar="PATH")
79 + parser.add_argument("--outputReportFile", dest="outputReportFile",
80 + help="File to place evaluation report", metavar="FILE")
81 + parser.add_argument("--classifier", dest="classifier",
82 + help="Classifier", metavar="NAME",
83 + choices=('BernoulliNB', 'SVM', 'kNN'), default='SVM')
84 + parser.add_argument("--saveData", dest="saveData", action='store_true',
85 + help="Save matrices")
86 + parser.add_argument("--kernel", dest="kernel",
87 + help="Kernel SVM", metavar="NAME",
88 + choices=('linear', 'rbf', 'poly'), default='linear')
89 + parser.add_argument("--reduction", dest="reduction",
90 + help="Feature selection or dimensionality reduction", metavar="NAME",
91 + choices=('SVD200', 'SVD300', 'CHI250', 'CHI2100'), default=None)
92 +
93 + args = parser.parse_args()
94 +
95 + # Printing parameter values
96 + print('-------------------------------- PARAMETERS --------------------------------')
97 + print("Path to read input files: " + str(args.inputPath))
98 + print("File to read training data: " + str(args.inputTrainingData))
99 + print("File to read testing data: " + str(args.inputTestingData))
100 + print("File to read testing classes: " + str(args.inputTestingClasses))
101 + print("Path to place output model: " + str(args.outputModelPath))
102 + print("File to place output model: " + str(args.outputModelFile))
103 + print("Path to place evaluation report: " + str(args.outputReportPath))
104 + print("File to place evaluation report: " + str(args.outputReportFile))
105 + print("Classifier: " + str(args.classifier))
106 + print("Save matrices: " + str(args.saveData))
107 + print("Kernel: " + str(args.kernel))
108 + print("Reduction: " + str(args.reduction))
109 +
110 + # Start time
111 + t0 = time()
112 +
113 + print("Reading training data and true classes...")
114 + X_train = None
115 + if args.saveData:
116 + y_train = []
117 + trainingData = []
118 + with open(os.path.join(args.inputPath, args.inputTrainingData), encoding='utf8', mode='r') \
119 + as iFile:
120 + for line in iFile:
121 + line = line.strip('\r\n')
122 + listLine = line.split(',')
123 + y_train.append(listLine[0])
124 + trainingData.append(listLine[1:])
125 + # X_train = np.matrix(trainingData)
126 + X_train = csr_matrix(trainingData, dtype='double')
127 + print(" Saving matrix and classes...")
128 + joblib.dump(X_train, os.path.join(args.outputModelPath, args.inputTrainingData + '.jlb'))
129 + joblib.dump(y_train, os.path.join(args.outputModelPath, args.inputTrainingData + '.class.jlb'))
130 + print(" Done!")
131 + else:
132 + print(" Loading matrix and classes...")
133 + X_train = joblib.load(os.path.join(args.outputModelPath, args.inputTrainingData + '.jlb'))
134 + y_train = joblib.load(os.path.join(args.outputModelPath, args.inputTrainingData + '.class.jlb'))
135 + print(" Done!")
136 +
137 + print(" Number of training classes: {}".format(len(y_train)))
138 + print(" Number of training class A: {}".format(y_train.count('A')))
139 + print(" Number of training class I: {}".format(y_train.count('I')))
140 + print(" Shape of training matrix: {}".format(X_train.shape))
141 +
142 + print("Reading testing data and true classes...")
143 + X_test = None
144 + if args.saveData:
145 + y_test = []
146 + testingData = []
147 + with open(os.path.join(args.inputPath, args.inputTestingData), encoding='utf8', mode='r') \
148 + as iFile:
149 + for line in iFile:
150 + line = line.strip('\r\n')
151 + listLine = line.split(',')
152 + testingData.append(listLine[1:])
153 + X_test = csr_matrix(testingData, dtype='double')
154 + with open(os.path.join(args.inputPath, args.inputTestingClasses), encoding='utf8', mode='r') \
155 + as iFile:
156 + for line in iFile:
157 + line = line.strip('\r\n')
158 + y_test.append(line)
159 + print(" Saving matrix and classes...")
160 + joblib.dump(X_test, os.path.join(args.outputModelPath, args.inputTestingData + '.jlb'))
161 + joblib.dump(y_test, os.path.join(args.outputModelPath, args.inputTestingClasses + '.class.jlb'))
162 + print(" Done!")
163 + else:
164 + print(" Loading matrix and classes...")
165 + X_test = joblib.load(os.path.join(args.outputModelPath, args.inputTestingData + '.jlb'))
166 + y_test = joblib.load(os.path.join(args.outputModelPath, args.inputTestingClasses + '.class.jlb'))
167 + print(" Done!")
168 +
169 + print(" Number of testing classes: {}".format(len(y_test)))
170 + print(" Number of testing class A: {}".format(y_test.count('A')))
171 + print(" Number of testing class I: {}".format(y_test.count('I')))
172 + print(" Shape of testing matrix: {}".format(X_test.shape))
173 +
174 + # Feature selection and dimensional reduction
175 + if args.reduction is not None:
176 + print('Performing dimensionality reduction or feature selection...', args.reduction)
177 + if args.reduction == 'SVD200':
178 + reduc = TruncatedSVD(n_components=200, random_state=42)
179 + X_train = reduc.fit_transform(X_train)
180 + if args.reduction == 'SVD300':
181 + reduc = TruncatedSVD(n_components=300, random_state=42)
182 + X_train = reduc.fit_transform(X_train)
183 + elif args.reduction == 'CHI250':
184 + reduc = SelectKBest(chi2, k=50)
185 + X_train = reduc.fit_transform(X_train, y_train)
186 + elif args.reduction == 'CHI2100':
187 + reduc = SelectKBest(chi2, k=100)
188 + X_train = reduc.fit_transform(X_train, y_train)
189 + print(" Done!")
190 + print(' New shape of training matrix: ', X_train.shape)
191 +
192 + jobs = -1
193 + paramGrid = []
194 + nIter = 20
195 + crossV = 10
196 + print("Defining randomized grid search...")
197 + if args.classifier == 'SVM':
198 + # SVM
199 + classifier = SVC()
200 + if args.kernel == 'rbf':
201 + paramGrid = {'C': scipy.stats.expon(scale=100),
202 + 'gamma': scipy.stats.expon(scale=.1),
203 + 'kernel': ['rbf'], 'class_weight': ['balanced', None]}
204 + elif args.kernel == 'linear':
205 + paramGrid = {'C': scipy.stats.expon(scale=100),
206 + 'kernel': ['linear'],
207 + 'class_weight': ['balanced', None]}
208 + elif args.kernel == 'poly':
209 + paramGrid = {'C': scipy.stats.expon(scale=100),
210 + 'gamma': scipy.stats.expon(scale=.1), 'degree': [2, 3],
211 + 'kernel': ['poly'], 'class_weight': ['balanced', None]}
212 + myClassifier = model_selection.RandomizedSearchCV(classifier,
213 + paramGrid, n_iter=nIter,
214 + cv=crossV, n_jobs=jobs, verbose=3)
215 + elif args.classifier == 'BernoulliNB':
216 + # BernoulliNB
217 + classifier = BernoulliNB()
218 + paramGrid = {'alpha': scipy.stats.expon(scale=1.0)}
219 + myClassifier = model_selection.RandomizedSearchCV(classifier, paramGrid, n_iter=nIter,
220 + cv=crossV, n_jobs=jobs, verbose=3)
221 + # elif args.classifier == 'kNN':
222 + # # kNN
223 + # k_range = list(range(1, 7, 2))
224 + # classifier = KNeighborsClassifier()
225 + # paramGrid = {'n_neighbors ': k_range}
226 + # myClassifier = model_selection.RandomizedSearchCV(classifier, paramGrid, n_iter=3,
227 + # cv=crossV, n_jobs=jobs, verbose=3)
228 + else:
229 + print("Bad classifier")
230 + exit()
231 + print(" Done!")
232 +
233 + print("Training...")
234 + myClassifier.fit(X_train, y_train)
235 + print(" Done!")
236 +
237 + print("Testing (prediction in new data)...")
238 + if args.reduction is not None:
239 + X_test = reduc.transform(X_test)
240 + y_pred = myClassifier.predict(X_test)
241 + best_parameters = myClassifier.best_estimator_.get_params()
242 + print(" Done!")
243 +
244 + print("Saving report...")
245 + with open(os.path.join(args.outputReportPath, args.outputReportFile), mode='w', encoding='utf8') as oFile:
246 + oFile.write('********** EVALUATION REPORT **********\n')
247 + oFile.write('Reduction: {}\n'.format(args.reduction))
248 + oFile.write('Classifier: {}\n'.format(args.classifier))
249 + oFile.write('Kernel: {}\n'.format(args.kernel))
250 + oFile.write('Accuracy: {}\n'.format(accuracy_score(y_test, y_pred)))
251 + oFile.write('Precision: {}\n'.format(precision_score(y_test, y_pred, average='weighted')))
252 + oFile.write('Recall: {}\n'.format(recall_score(y_test, y_pred, average='weighted')))
253 + oFile.write('F-score: {}\n'.format(f1_score(y_test, y_pred, average='weighted')))
254 + oFile.write('Confusion matrix: \n')
255 + oFile.write(str(confusion_matrix(y_test, y_pred)) + '\n')
256 + oFile.write('Classification report: \n')
257 + oFile.write(classification_report(y_test, y_pred) + '\n')
258 + oFile.write('Best parameters: \n')
259 + for param in sorted(best_parameters.keys()):
260 + oFile.write("\t%s: %r\n" % (param, best_parameters[param]))
261 + print(" Done!")
262 +
263 + print("Training and testing done in: %fs" % (time() - t0))
1 +# -*- encoding: utf-8 -*-
2 +
3 +import os
4 +from time import time
5 +import argparse
6 +from sklearn.naive_bayes import BernoulliNB
7 +from sklearn.svm import SVC
8 +from sklearn.neighbors import KNeighborsClassifier
9 +from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, \
10 + classification_report
11 +from sklearn.externals import joblib
12 +from scipy.sparse import csr_matrix
13 +
14 +__author__ = 'CMendezC'
15 +
16 +# Goal: training and testing binding thrombin data set
17 +
18 +# Parameters:
19 +# 1) --inputPath Path to read input files.
20 +# 2) --inputTrainingData File to read training data.
21 +# 3) --inputTestingData File to read testing data.
22 +# 4) --inputTestingClasses File to read testing classes.
23 +# 5) --outputModelPath Path to place output model.
24 +# 6) --outputModelFile File to place output model.
25 +# 7) --outputReportPath Path to place evaluation report.
26 +# 8) --outputReportFile File to place evaluation report.
27 +# 9) --classifier Classifier: BernoulliNB, SVM, kNN.
28 +# 10) --saveData Save matrices
29 +
30 +# Ouput:
31 +# 1) Classification model and evaluation report.
32 +
33 +# Execution:
34 +
35 +# python training-testing-binding-thrombin.py
36 +# --inputPath /home/binding-thrombin-dataset
37 +# --inputTrainingData thrombin.data
38 +# --inputTestingData Thrombin.testset
39 +# --inputTestingClasses Thrombin.testset.class
40 +# --outputModelPath /home/binding-thrombin-dataset/models
41 +# --outputModelFile SVM-model.mod
42 +# --outputReportPath /home/binding-thrombin-dataset/reports
43 +# --outputReportFile SVM.txt
44 +# --classifier SVM
45 +# --saveData
46 +
47 +# source activate python3
48 +# python training-testing-binding-thrombin.py --inputPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset --inputTrainingData thrombin.data --inputTestingData Thrombin.testset --inputTestingClasses Thrombin.testset.class --outputModelPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/models --outputModelFile SVM-model.mod --outputReportPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/reports --outputReportFile SVM.txt --classifier SVM --saveData
49 +
50 +###########################################################
51 +# MAIN PROGRAM #
52 +###########################################################
53 +
54 +if __name__ == "__main__":
55 + # Parameter definition
56 + parser = argparse.ArgumentParser(description='Training and testing Binding Thrombin Dataset.')
57 + parser.add_argument("--inputPath", dest="inputPath",
58 + help="Path to read input files", metavar="PATH")
59 + parser.add_argument("--inputTrainingData", dest="inputTrainingData",
60 + help="File to read training data", metavar="FILE")
61 + parser.add_argument("--inputTestingData", dest="inputTestingData",
62 + help="File to read testing data", metavar="FILE")
63 + parser.add_argument("--inputTestingClasses", dest="inputTestingClasses",
64 + help="File to read testing classes", metavar="FILE")
65 + parser.add_argument("--outputModelPath", dest="outputModelPath",
66 + help="Path to place output model", metavar="PATH")
67 + parser.add_argument("--outputModelFile", dest="outputModelFile",
68 + help="File to place output model", metavar="FILE")
69 + parser.add_argument("--outputReportPath", dest="outputReportPath",
70 + help="Path to place evaluation report", metavar="PATH")
71 + parser.add_argument("--outputReportFile", dest="outputReportFile",
72 + help="File to place evaluation report", metavar="FILE")
73 + parser.add_argument("--classifier", dest="classifier",
74 + help="Classifier", metavar="NAME",
75 + choices=('BernoulliNB', 'SVM', 'kNN'), default='SVM')
76 + parser.add_argument("--saveData", dest="saveData", action='store_true',
77 + help="Save matrices")
78 +
79 + args = parser.parse_args()
80 +
81 + # Printing parameter values
82 + print('-------------------------------- PARAMETERS --------------------------------')
83 + print("Path to read input files: " + str(args.inputPath))
84 + print("File to read training data: " + str(args.inputTrainingData))
85 + print("File to read testing data: " + str(args.inputTestingData))
86 + print("File to read testing classes: " + str(args.inputTestingClasses))
87 + print("Path to place output model: " + str(args.outputModelPath))
88 + print("File to place output model: " + str(args.outputModelFile))
89 + print("Path to place evaluation report: " + str(args.outputReportPath))
90 + print("File to place evaluation report: " + str(args.outputReportFile))
91 + print("Classifier: " + str(args.classifier))
92 + print("Save matrices: " + str(args.saveData))
93 +
94 + # Start time
95 + t0 = time()
96 +
97 + print("Reading training data and true classes...")
98 + X_train = None
99 + if args.saveData:
100 + y_train = []
101 + trainingData = []
102 + with open(os.path.join(args.inputPath, args.inputTrainingData), encoding='utf8', mode='r') \
103 + as iFile:
104 + for line in iFile:
105 + line = line.strip('\r\n')
106 + listLine = line.split(',')
107 + y_train.append(listLine[0])
108 + trainingData.append(listLine[1:])
109 + # X_train = np.matrix(trainingData)
110 + X_train = csr_matrix(trainingData, dtype='double')
111 + print(" Saving matrix and classes...")
112 + joblib.dump(X_train, os.path.join(args.outputModelPath, args.inputTrainingData + '.jlb'))
113 + joblib.dump(y_train, os.path.join(args.outputModelPath, args.inputTrainingData + '.class.jlb'))
114 + print(" Done!")
115 + else:
116 + print(" Loading matrix and classes...")
117 + X_train = joblib.load(os.path.join(args.outputModelPath, args.inputTrainingData + '.jlb'))
118 + y_train = joblib.load(os.path.join(args.outputModelPath, args.inputTrainingData + '.class.jlb'))
119 + print(" Done!")
120 +
121 + print(" Number of training classes: {}".format(len(y_train)))
122 + print(" Number of training class A: {}".format(y_train.count('A')))
123 + print(" Number of training class I: {}".format(y_train.count('I')))
124 + print(" Shape of training matrix: {}".format(X_train.shape))
125 +
126 + print("Reading testing data and true classes...")
127 + X_test = None
128 + if args.saveData:
129 + y_test = []
130 + testingData = []
131 + with open(os.path.join(args.inputPath, args.inputTestingData), encoding='utf8', mode='r') \
132 + as iFile:
133 + for line in iFile:
134 + line = line.strip('\r\n')
135 + listLine = line.split(',')
136 + testingData.append(listLine[1:])
137 + X_test = csr_matrix(testingData, dtype='double')
138 + with open(os.path.join(args.inputPath, args.inputTestingClasses), encoding='utf8', mode='r') \
139 + as iFile:
140 + for line in iFile:
141 + line = line.strip('\r\n')
142 + y_test.append(line)
143 + print(" Saving matrix and classes...")
144 + joblib.dump(X_test, os.path.join(args.outputModelPath, args.inputTestingData + '.jlb'))
145 + joblib.dump(y_test, os.path.join(args.outputModelPath, args.inputTestingClasses + '.class.jlb'))
146 + print(" Done!")
147 + else:
148 + print(" Loading matrix and classes...")
149 + X_test = joblib.load(os.path.join(args.outputModelPath, args.inputTestingData + '.jlb'))
150 + y_test = joblib.load(os.path.join(args.outputModelPath, args.inputTestingClasses + '.class.jlb'))
151 + print(" Done!")
152 +
153 + print(" Number of testing classes: {}".format(len(y_test)))
154 + print(" Number of testing class A: {}".format(y_test.count('A')))
155 + print(" Number of testing class I: {}".format(y_test.count('I')))
156 + print(" Shape of testing matrix: {}".format(X_test.shape))
157 +
158 + if args.classifier == "BernoulliNB":
159 + classifier = BernoulliNB()
160 + elif args.classifier == "SVM":
161 + classifier = SVC()
162 + elif args.classifier == "kNN":
163 + classifier = KNeighborsClassifier()
164 + else:
165 + print("Bad classifier")
166 + exit()
167 +
168 + print("Training...")
169 + classifier.fit(X_train, y_train)
170 + print(" Done!")
171 +
172 + print("Testing (prediction in new data)...")
173 + y_pred = classifier.predict(X_test)
174 + print(" Done!")
175 +
176 + print("Saving report...")
177 + with open(os.path.join(args.outputReportPath, args.outputReportFile), mode='w', encoding='utf8') as oFile:
178 + oFile.write('********** EVALUATION REPORT **********\n')
179 + oFile.write('Classifier: {}\n'.format(args.classifier))
180 + oFile.write('Accuracy: {}\n'.format(accuracy_score(y_test, y_pred)))
181 + oFile.write('Precision: {}\n'.format(precision_score(y_test, y_pred, average='weighted')))
182 + oFile.write('Recall: {}\n'.format(recall_score(y_test, y_pred, average='weighted')))
183 + oFile.write('F-score: {}\n'.format(f1_score(y_test, y_pred, average='weighted')))
184 + oFile.write('Confusion matrix: \n')
185 + oFile.write(str(confusion_matrix(y_test, y_pred)) + '\n')
186 + oFile.write('Classification report: \n')
187 + oFile.write(classification_report(y_test, y_pred) + '\n')
188 + print(" Done!")
189 +
190 + print("Training and testing done in: %fs" % (time() - t0))