Carlos-Francisco Méndez-Cruz

Remove files

1 -The test set consists of 634 data points, each of which represents
2 -a molecule that is either active (A) or inactive (I). The test set
3 -has the same format as the training set, with the exception that the
4 -activity value (A or I) for each data point is missing, that is, has
5 -been replaced by a question mark (?). Please submit one prediction,
6 -A or I, for each data point. Your submission should be in the form
7 -of a file that starts with your contact information, followed by a
8 -line with 5 asterisks, followed immediately by your predictions, with
9 -one line per data point. The predictions should be in the same order
10 -as the test set data points. So your prediction for the first example
11 -should appear on the first line after the asterisks, your prediction
12 -for the second example should appear on the second line after the
13 -asterisks, etc. Hence, after your contact information, the prediction
14 -file will consist of 635 lines and have the form:
15 -
16 -*****
17 -I
18 -I
19 -A
20 -I
21 -A
22 -I
23 -
24 -etc.
25 -
26 -You may submit your prediction by email to page@biostat.wisc.edu
27 -or by anonymous ftp to ftp.biostat.wisc.edu, placing the file
28 -into the directory dropboxes/page/. If using email, please use
29 -the subject line "KDDcup <name> thrombin" where <name> is your
30 -name. If using ftp, please name the file KDDcup.<name>.thrombin
31 -where <name> is your name. For example, my submission would be
32 -named KDDcup.DavidPage.thrombin
33 -
34 -Only one submission per person per task is permitted. If you do not
35 -receive email confirmation of your submission within 24 hours, please
36 -email page@biostat.wisc.edu with subject "KDDcup no confirmation".
37 -
38 -For group entries, the contact information should include the names
39 -of everyone to be credited as a member of the group should your entry
40 -achieve the highest score. But no person is to be listed on more than
41 -one entry per task.
1 -Prediction of Molecular Bioactivity for Drug Design -- Binding to Thrombin
2 ---------------------------------------------------------------------------
3 -
4 -Drugs are typically small organic molecules that achieve their desired
5 -activity by binding to a target site on a receptor. The first step in
6 -the discovery of a new drug is usually to identify and isolate the
7 -receptor to which it should bind, followed by testing many small
8 -molecules for their ability to bind to the target site. This leaves
9 -researchers with the task of determining what separates the active
10 -(binding) compounds from the inactive (non-binding) ones. Such a
11 -determination can then be used in the design of new compounds that not
12 -only bind, but also have all the other properties required for a drug
13 -(solubility, oral absorption, lack of side effects, appropriate duration
14 -of action, toxicity, etc.).
15 -
16 -The present training data set consists of 1909 compounds tested for
17 -their ability to bind to a target site on thrombin, a key receptor in
18 -blood clotting. The chemical structures of these compounds are not
19 -necessary for our analysis and are not included. Of these compounds, 42
20 -are active (bind well) and the others are inactive. Each compound is
21 -described by a single feature vector comprised of a class value (A for
22 -active, I for inactive) and 139,351 binary features, which describe
23 -three-dimensional properties of the molecule. The definitions of the
24 -individual bits are not included - we don't know what each individual
25 -bit means, only that they are generated in an internally consistent
26 -manner for all 1909 compounds. Biological activity in general, and
27 -receptor binding affinity in particular, correlate with various
28 -structural and physical properties of small organic molecules. The task
29 -is to determine which of these properties are critical in this case and
30 -to learn to accurately predict the class value. To simulate the
31 -real-world drug design environment, the test set contains 636 additional
32 -compounds that were in fact generated based on the assay results
33 -recorded for the training set. In evaluating the accuracy, a
34 -differential cost model will be used, so that the sum of the costs of
35 -the actives will be equal to the sum of the costs of the inactives.
36 -
37 -We thank DuPont Pharmaceuticals for graciously providing this data set
38 -for the KDD Cup 2001 competition. All publications referring to
39 -analysis of this data set should acknowledge DuPont Pharmaceuticals
40 -Research Laboratories and KDD Cup 2001.
This diff could not be displayed because it is too large.
1 -I
2 -A
3 -I
4 -I
5 -I
6 -A
7 -I
8 -I
9 -I
10 -A
11 -I
12 -I
13 -I
14 -A
15 -I
16 -A
17 -I
18 -I
19 -I
20 -I
21 -I
22 -I
23 -I
24 -I
25 -I
26 -I
27 -I
28 -A
29 -I
30 -A
31 -I
32 -I
33 -I
34 -I
35 -I
36 -A
37 -I
38 -I
39 -I
40 -A
41 -I
42 -I
43 -I
44 -I
45 -I
46 -I
47 -I
48 -I
49 -A
50 -A
51 -I
52 -I
53 -I
54 -I
55 -I
56 -A
57 -I
58 -A
59 -A
60 -I
61 -I
62 -I
63 -A
64 -I
65 -I
66 -I
67 -I
68 -A
69 -A
70 -I
71 -A
72 -I
73 -I
74 -A
75 -A
76 -I
77 -I
78 -I
79 -I
80 -I
81 -I
82 -I
83 -I
84 -I
85 -I
86 -I
87 -I
88 -I
89 -I
90 -I
91 -A
92 -I
93 -I
94 -A
95 -I
96 -A
97 -I
98 -I
99 -I
100 -A
101 -A
102 -I
103 -I
104 -I
105 -I
106 -I
107 -I
108 -A
109 -I
110 -I
111 -I
112 -I
113 -A
114 -A
115 -I
116 -I
117 -I
118 -I
119 -I
120 -I
121 -I
122 -I
123 -A
124 -A
125 -I
126 -A
127 -A
128 -I
129 -I
130 -I
131 -I
132 -I
133 -I
134 -I
135 -I
136 -I
137 -I
138 -A
139 -I
140 -I
141 -I
142 -I
143 -I
144 -I
145 -I
146 -I
147 -I
148 -A
149 -I
150 -I
151 -I
152 -I
153 -A
154 -I
155 -I
156 -I
157 -I
158 -I
159 -I
160 -I
161 -A
162 -I
163 -I
164 -A
165 -I
166 -A
167 -I
168 -I
169 -A
170 -I
171 -A
172 -I
173 -A
174 -I
175 -A
176 -I
177 -I
178 -I
179 -I
180 -I
181 -A
182 -I
183 -I
184 -A
185 -I
186 -I
187 -A
188 -I
189 -I
190 -I
191 -A
192 -I
193 -A
194 -I
195 -I
196 -A
197 -I
198 -I
199 -I
200 -I
201 -A
202 -I
203 -A
204 -I
205 -I
206 -I
207 -I
208 -I
209 -I
210 -I
211 -I
212 -I
213 -I
214 -I
215 -A
216 -I
217 -A
218 -I
219 -I
220 -I
221 -I
222 -I
223 -I
224 -A
225 -I
226 -I
227 -A
228 -A
229 -A
230 -I
231 -I
232 -A
233 -A
234 -I
235 -I
236 -I
237 -I
238 -A
239 -I
240 -I
241 -I
242 -I
243 -A
244 -I
245 -A
246 -I
247 -I
248 -I
249 -I
250 -I
251 -I
252 -I
253 -A
254 -A
255 -I
256 -I
257 -I
258 -I
259 -I
260 -I
261 -I
262 -I
263 -A
264 -A
265 -I
266 -I
267 -I
268 -I
269 -I
270 -I
271 -A
272 -A
273 -I
274 -I
275 -I
276 -I
277 -I
278 -I
279 -A
280 -I
281 -A
282 -I
283 -I
284 -I
285 -I
286 -I
287 -I
288 -I
289 -I
290 -I
291 -A
292 -I
293 -I
294 -A
295 -I
296 -I
297 -I
298 -I
299 -I
300 -I
301 -A
302 -A
303 -I
304 -I
305 -I
306 -I
307 -I
308 -A
309 -I
310 -I
311 -I
312 -I
313 -I
314 -A
315 -A
316 -A
317 -I
318 -A
319 -I
320 -I
321 -I
322 -I
323 -A
324 -A
325 -I
326 -A
327 -A
328 -I
329 -I
330 -I
331 -I
332 -I
333 -I
334 -I
335 -I
336 -I
337 -I
338 -I
339 -I
340 -A
341 -I
342 -I
343 -I
344 -I
345 -A
346 -A
347 -I
348 -I
349 -A
350 -I
351 -I
352 -I
353 -I
354 -I
355 -A
356 -A
357 -I
358 -A
359 -I
360 -I
361 -I
362 -I
363 -I
364 -I
365 -A
366 -A
367 -I
368 -I
369 -A
370 -I
371 -I
372 -I
373 -I
374 -I
375 -I
376 -I
377 -I
378 -I
379 -I
380 -I
381 -I
382 -A
383 -I
384 -I
385 -A
386 -I
387 -I
388 -A
389 -I
390 -I
391 -I
392 -I
393 -A
394 -A
395 -I
396 -A
397 -A
398 -I
399 -I
400 -A
401 -I
402 -I
403 -I
404 -I
405 -A
406 -I
407 -I
408 -I
409 -I
410 -I
411 -I
412 -I
413 -I
414 -I
415 -I
416 -A
417 -I
418 -I
419 -A
420 -A
421 -I
422 -I
423 -I
424 -A
425 -I
426 -I
427 -I
428 -I
429 -A
430 -I
431 -A
432 -I
433 -I
434 -I
435 -I
436 -I
437 -A
438 -I
439 -I
440 -I
441 -I
442 -I
443 -I
444 -I
445 -I
446 -I
447 -I
448 -I
449 -I
450 -I
451 -I
452 -I
453 -I
454 -I
455 -I
456 -A
457 -A
458 -A
459 -A
460 -I
461 -I
462 -I
463 -A
464 -A
465 -I
466 -I
467 -I
468 -I
469 -I
470 -A
471 -I
472 -A
473 -I
474 -I
475 -I
476 -I
477 -I
478 -A
479 -I
480 -I
481 -I
482 -I
483 -A
484 -A
485 -I
486 -I
487 -I
488 -I
489 -I
490 -I
491 -I
492 -I
493 -I
494 -I
495 -I
496 -I
497 -I
498 -I
499 -I
500 -I
501 -I
502 -A
503 -I
504 -A
505 -I
506 -I
507 -A
508 -I
509 -I
510 -I
511 -I
512 -A
513 -I
514 -I
515 -A
516 -A
517 -I
518 -I
519 -I
520 -A
521 -I
522 -A
523 -I
524 -I
525 -I
526 -I
527 -I
528 -I
529 -I
530 -A
531 -A
532 -I
533 -I
534 -I
535 -A
536 -I
537 -I
538 -I
539 -A
540 -I
541 -I
542 -I
543 -I
544 -I
545 -I
546 -A
547 -I
548 -I
549 -I
550 -I
551 -I
552 -A
553 -I
554 -I
555 -I
556 -I
557 -I
558 -A
559 -I
560 -I
561 -A
562 -I
563 -I
564 -I
565 -I
566 -I
567 -I
568 -A
569 -I
570 -I
571 -I
572 -I
573 -I
574 -I
575 -I
576 -I
577 -I
578 -I
579 -I
580 -I
581 -I
582 -I
583 -I
584 -I
585 -A
586 -I
587 -I
588 -A
589 -I
590 -I
591 -I
592 -I
593 -A
594 -I
595 -I
596 -I
597 -I
598 -I
599 -A
600 -I
601 -I
602 -I
603 -A
604 -I
605 -I
606 -I
607 -A
608 -I
609 -A
610 -A
611 -I
612 -A
613 -I
614 -I
615 -I
616 -I
617 -I
618 -I
619 -I
620 -A
621 -I
622 -I
623 -I
624 -A
625 -I
626 -A
627 -I
628 -I
629 -I
630 -I
631 -A
632 -I
633 -A
634 -I
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
1 -# -*- encoding: utf-8 -*-
2 -
3 -import os
4 -from time import time
5 -import argparse
6 -from sklearn.naive_bayes import BernoulliNB
7 -from sklearn.svm import SVC
8 -from sklearn.neighbors import KNeighborsClassifier
9 -from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, \
10 - classification_report
11 -from sklearn.externals import joblib
12 -from sklearn import model_selection
13 -from sklearn.feature_selection import SelectKBest, chi2
14 -from sklearn.decomposition import TruncatedSVD
15 -from scipy.sparse import csr_matrix
16 -import scipy
17 -
18 -__author__ = 'CMendezC'
19 -
20 -# Goal: training, crossvalidation and testing binding thrombin data set
21 -
22 -# Parameters:
23 -# 1) --inputPath Path to read input files.
24 -# 2) --inputTrainingData File to read training data.
25 -# 3) --inputTestingData File to read testing data.
26 -# 4) --inputTestingClasses File to read testing classes.
27 -# 5) --outputModelPath Path to place output model.
28 -# 6) --outputModelFile File to place output model.
29 -# 7) --outputReportPath Path to place evaluation report.
30 -# 8) --outputReportFile File to place evaluation report.
31 -# 9) --classifier Classifier: BernoulliNB, SVM, kNN.
32 -# 10) --saveData Save matrices
33 -# 11) --kernel Kernel
34 -# 12) --reduction Feature selection or dimensionality reduction
35 -
36 -# Ouput:
37 -# 1) Classification model and evaluation report.
38 -
39 -# Execution:
40 -
41 -# python training-crossvalidation-testing-binding-thrombin.py
42 -# --inputPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset
43 -# --inputTrainingData thrombin.data
44 -# --inputTestingData Thrombin.testset
45 -# --inputTestingClasses Thrombin.testset.class
46 -# --outputModelPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/models
47 -# --outputModelFile SVM-model.mod
48 -# --outputReportPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/reports
49 -# --outputReportFile SVM.txt
50 -# --classifier SVM
51 -# --saveData
52 -# --kernel linear
53 -# --reduction SVD200
54 -
55 -# source activate python3
56 -# python training-crossvalidation-testing-binding-thrombin.py --inputPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset --inputTrainingData thrombin.data --inputTestingData Thrombin.testset --inputTestingClasses Thrombin.testset.class --outputModelPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/models --outputModelFile SVM-linear-model.mod --outputReportPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/reports --outputReportFile SVM-linear.txt --classifier SVM --kernel rbf
57 -
58 -###########################################################
59 -# MAIN PROGRAM #
60 -###########################################################
61 -
62 -if __name__ == "__main__":
63 - # Parameter definition
64 - parser = argparse.ArgumentParser(description='Training validation Binding Thrombin Dataset.')
65 - parser.add_argument("--inputPath", dest="inputPath",
66 - help="Path to read input files", metavar="PATH")
67 - parser.add_argument("--inputTrainingData", dest="inputTrainingData",
68 - help="File to read training data", metavar="FILE")
69 - parser.add_argument("--inputTestingData", dest="inputTestingData",
70 - help="File to read testing data", metavar="FILE")
71 - parser.add_argument("--inputTestingClasses", dest="inputTestingClasses",
72 - help="File to read testing classes", metavar="FILE")
73 - parser.add_argument("--outputModelPath", dest="outputModelPath",
74 - help="Path to place output model", metavar="PATH")
75 - parser.add_argument("--outputModelFile", dest="outputModelFile",
76 - help="File to place output model", metavar="FILE")
77 - parser.add_argument("--outputReportPath", dest="outputReportPath",
78 - help="Path to place evaluation report", metavar="PATH")
79 - parser.add_argument("--outputReportFile", dest="outputReportFile",
80 - help="File to place evaluation report", metavar="FILE")
81 - parser.add_argument("--classifier", dest="classifier",
82 - help="Classifier", metavar="NAME",
83 - choices=('BernoulliNB', 'SVM', 'kNN'), default='SVM')
84 - parser.add_argument("--saveData", dest="saveData", action='store_true',
85 - help="Save matrices")
86 - parser.add_argument("--kernel", dest="kernel",
87 - help="Kernel SVM", metavar="NAME",
88 - choices=('linear', 'rbf', 'poly'), default='linear')
89 - parser.add_argument("--reduction", dest="reduction",
90 - help="Feature selection or dimensionality reduction", metavar="NAME",
91 - choices=('SVD200', 'SVD300', 'CHI250', 'CHI2100'), default=None)
92 -
93 - args = parser.parse_args()
94 -
95 - # Printing parameter values
96 - print('-------------------------------- PARAMETERS --------------------------------')
97 - print("Path to read input files: " + str(args.inputPath))
98 - print("File to read training data: " + str(args.inputTrainingData))
99 - print("File to read testing data: " + str(args.inputTestingData))
100 - print("File to read testing classes: " + str(args.inputTestingClasses))
101 - print("Path to place output model: " + str(args.outputModelPath))
102 - print("File to place output model: " + str(args.outputModelFile))
103 - print("Path to place evaluation report: " + str(args.outputReportPath))
104 - print("File to place evaluation report: " + str(args.outputReportFile))
105 - print("Classifier: " + str(args.classifier))
106 - print("Save matrices: " + str(args.saveData))
107 - print("Kernel: " + str(args.kernel))
108 - print("Reduction: " + str(args.reduction))
109 -
110 - # Start time
111 - t0 = time()
112 -
113 - print("Reading training data and true classes...")
114 - X_train = None
115 - if args.saveData:
116 - y_train = []
117 - trainingData = []
118 - with open(os.path.join(args.inputPath, args.inputTrainingData), encoding='utf8', mode='r') \
119 - as iFile:
120 - for line in iFile:
121 - line = line.strip('\r\n')
122 - listLine = line.split(',')
123 - y_train.append(listLine[0])
124 - trainingData.append(listLine[1:])
125 - # X_train = np.matrix(trainingData)
126 - X_train = csr_matrix(trainingData, dtype='double')
127 - print(" Saving matrix and classes...")
128 - joblib.dump(X_train, os.path.join(args.outputModelPath, args.inputTrainingData + '.jlb'))
129 - joblib.dump(y_train, os.path.join(args.outputModelPath, args.inputTrainingData + '.class.jlb'))
130 - print(" Done!")
131 - else:
132 - print(" Loading matrix and classes...")
133 - X_train = joblib.load(os.path.join(args.outputModelPath, args.inputTrainingData + '.jlb'))
134 - y_train = joblib.load(os.path.join(args.outputModelPath, args.inputTrainingData + '.class.jlb'))
135 - print(" Done!")
136 -
137 - print(" Number of training classes: {}".format(len(y_train)))
138 - print(" Number of training class A: {}".format(y_train.count('A')))
139 - print(" Number of training class I: {}".format(y_train.count('I')))
140 - print(" Shape of training matrix: {}".format(X_train.shape))
141 -
142 - print("Reading testing data and true classes...")
143 - X_test = None
144 - if args.saveData:
145 - y_test = []
146 - testingData = []
147 - with open(os.path.join(args.inputPath, args.inputTestingData), encoding='utf8', mode='r') \
148 - as iFile:
149 - for line in iFile:
150 - line = line.strip('\r\n')
151 - listLine = line.split(',')
152 - testingData.append(listLine[1:])
153 - X_test = csr_matrix(testingData, dtype='double')
154 - with open(os.path.join(args.inputPath, args.inputTestingClasses), encoding='utf8', mode='r') \
155 - as iFile:
156 - for line in iFile:
157 - line = line.strip('\r\n')
158 - y_test.append(line)
159 - print(" Saving matrix and classes...")
160 - joblib.dump(X_test, os.path.join(args.outputModelPath, args.inputTestingData + '.jlb'))
161 - joblib.dump(y_test, os.path.join(args.outputModelPath, args.inputTestingClasses + '.class.jlb'))
162 - print(" Done!")
163 - else:
164 - print(" Loading matrix and classes...")
165 - X_test = joblib.load(os.path.join(args.outputModelPath, args.inputTestingData + '.jlb'))
166 - y_test = joblib.load(os.path.join(args.outputModelPath, args.inputTestingClasses + '.class.jlb'))
167 - print(" Done!")
168 -
169 - print(" Number of testing classes: {}".format(len(y_test)))
170 - print(" Number of testing class A: {}".format(y_test.count('A')))
171 - print(" Number of testing class I: {}".format(y_test.count('I')))
172 - print(" Shape of testing matrix: {}".format(X_test.shape))
173 -
174 - # Feature selection and dimensional reduction
175 - if args.reduction is not None:
176 - print('Performing dimensionality reduction or feature selection...', args.reduction)
177 - if args.reduction == 'SVD200':
178 - reduc = TruncatedSVD(n_components=200, random_state=42)
179 - X_train = reduc.fit_transform(X_train)
180 - if args.reduction == 'SVD300':
181 - reduc = TruncatedSVD(n_components=300, random_state=42)
182 - X_train = reduc.fit_transform(X_train)
183 - elif args.reduction == 'CHI250':
184 - reduc = SelectKBest(chi2, k=50)
185 - X_train = reduc.fit_transform(X_train, y_train)
186 - elif args.reduction == 'CHI2100':
187 - reduc = SelectKBest(chi2, k=100)
188 - X_train = reduc.fit_transform(X_train, y_train)
189 - print(" Done!")
190 - print(' New shape of training matrix: ', X_train.shape)
191 -
192 - jobs = -1
193 - paramGrid = []
194 - nIter = 20
195 - crossV = 10
196 - print("Defining randomized grid search...")
197 - if args.classifier == 'SVM':
198 - # SVM
199 - classifier = SVC()
200 - if args.kernel == 'rbf':
201 - paramGrid = {'C': scipy.stats.expon(scale=100),
202 - 'gamma': scipy.stats.expon(scale=.1),
203 - 'kernel': ['rbf'], 'class_weight': ['balanced', None]}
204 - elif args.kernel == 'linear':
205 - paramGrid = {'C': scipy.stats.expon(scale=100),
206 - 'kernel': ['linear'],
207 - 'class_weight': ['balanced', None]}
208 - elif args.kernel == 'poly':
209 - paramGrid = {'C': scipy.stats.expon(scale=100),
210 - 'gamma': scipy.stats.expon(scale=.1), 'degree': [2, 3],
211 - 'kernel': ['poly'], 'class_weight': ['balanced', None]}
212 - myClassifier = model_selection.RandomizedSearchCV(classifier,
213 - paramGrid, n_iter=nIter,
214 - cv=crossV, n_jobs=jobs, verbose=3)
215 - elif args.classifier == 'BernoulliNB':
216 - # BernoulliNB
217 - classifier = BernoulliNB()
218 - paramGrid = {'alpha': scipy.stats.expon(scale=1.0)}
219 - myClassifier = model_selection.RandomizedSearchCV(classifier, paramGrid, n_iter=nIter,
220 - cv=crossV, n_jobs=jobs, verbose=3)
221 - # elif args.classifier == 'kNN':
222 - # # kNN
223 - # k_range = list(range(1, 7, 2))
224 - # classifier = KNeighborsClassifier()
225 - # paramGrid = {'n_neighbors ': k_range}
226 - # myClassifier = model_selection.RandomizedSearchCV(classifier, paramGrid, n_iter=3,
227 - # cv=crossV, n_jobs=jobs, verbose=3)
228 - else:
229 - print("Bad classifier")
230 - exit()
231 - print(" Done!")
232 -
233 - print("Training...")
234 - myClassifier.fit(X_train, y_train)
235 - print(" Done!")
236 -
237 - print("Testing (prediction in new data)...")
238 - if args.reduction is not None:
239 - X_test = reduc.transform(X_test)
240 - y_pred = myClassifier.predict(X_test)
241 - best_parameters = myClassifier.best_estimator_.get_params()
242 - print(" Done!")
243 -
244 - print("Saving report...")
245 - with open(os.path.join(args.outputReportPath, args.outputReportFile), mode='w', encoding='utf8') as oFile:
246 - oFile.write('********** EVALUATION REPORT **********\n')
247 - oFile.write('Reduction: {}\n'.format(args.reduction))
248 - oFile.write('Classifier: {}\n'.format(args.classifier))
249 - oFile.write('Kernel: {}\n'.format(args.kernel))
250 - oFile.write('Accuracy: {}\n'.format(accuracy_score(y_test, y_pred)))
251 - oFile.write('Precision: {}\n'.format(precision_score(y_test, y_pred, average='weighted')))
252 - oFile.write('Recall: {}\n'.format(recall_score(y_test, y_pred, average='weighted')))
253 - oFile.write('F-score: {}\n'.format(f1_score(y_test, y_pred, average='weighted')))
254 - oFile.write('Confusion matrix: \n')
255 - oFile.write(str(confusion_matrix(y_test, y_pred)) + '\n')
256 - oFile.write('Classification report: \n')
257 - oFile.write(classification_report(y_test, y_pred) + '\n')
258 - oFile.write('Best parameters: \n')
259 - for param in sorted(best_parameters.keys()):
260 - oFile.write("\t%s: %r\n" % (param, best_parameters[param]))
261 - print(" Done!")
262 -
263 - print("Training and testing done in: %fs" % (time() - t0))
1 -# -*- encoding: utf-8 -*-
2 -
3 -import os
4 -from time import time
5 -import argparse
6 -from sklearn.naive_bayes import BernoulliNB
7 -from sklearn.svm import SVC
8 -from sklearn.neighbors import KNeighborsClassifier
9 -from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, \
10 - classification_report
11 -from sklearn.externals import joblib
12 -from scipy.sparse import csr_matrix
13 -
14 -__author__ = 'CMendezC'
15 -
16 -# Goal: training and testing binding thrombin data set
17 -
18 -# Parameters:
19 -# 1) --inputPath Path to read input files.
20 -# 2) --inputTrainingData File to read training data.
21 -# 3) --inputTestingData File to read testing data.
22 -# 4) --inputTestingClasses File to read testing classes.
23 -# 5) --outputModelPath Path to place output model.
24 -# 6) --outputModelFile File to place output model.
25 -# 7) --outputReportPath Path to place evaluation report.
26 -# 8) --outputReportFile File to place evaluation report.
27 -# 9) --classifier Classifier: BernoulliNB, SVM, kNN.
28 -# 10) --saveData Save matrices
29 -
30 -# Ouput:
31 -# 1) Classification model and evaluation report.
32 -
33 -# Execution:
34 -
35 -# python training-testing-binding-thrombin.py
36 -# --inputPath /home/binding-thrombin-dataset
37 -# --inputTrainingData thrombin.data
38 -# --inputTestingData Thrombin.testset
39 -# --inputTestingClasses Thrombin.testset.class
40 -# --outputModelPath /home/binding-thrombin-dataset/models
41 -# --outputModelFile SVM-model.mod
42 -# --outputReportPath /home/binding-thrombin-dataset/reports
43 -# --outputReportFile SVM.txt
44 -# --classifier SVM
45 -# --saveData
46 -
47 -# source activate python3
48 -# python training-testing-binding-thrombin.py --inputPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset --inputTrainingData thrombin.data --inputTestingData Thrombin.testset --inputTestingClasses Thrombin.testset.class --outputModelPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/models --outputModelFile SVM-model.mod --outputReportPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/reports --outputReportFile SVM.txt --classifier SVM --saveData
49 -
50 -###########################################################
51 -# MAIN PROGRAM #
52 -###########################################################
53 -
54 -if __name__ == "__main__":
55 - # Parameter definition
56 - parser = argparse.ArgumentParser(description='Training and testing Binding Thrombin Dataset.')
57 - parser.add_argument("--inputPath", dest="inputPath",
58 - help="Path to read input files", metavar="PATH")
59 - parser.add_argument("--inputTrainingData", dest="inputTrainingData",
60 - help="File to read training data", metavar="FILE")
61 - parser.add_argument("--inputTestingData", dest="inputTestingData",
62 - help="File to read testing data", metavar="FILE")
63 - parser.add_argument("--inputTestingClasses", dest="inputTestingClasses",
64 - help="File to read testing classes", metavar="FILE")
65 - parser.add_argument("--outputModelPath", dest="outputModelPath",
66 - help="Path to place output model", metavar="PATH")
67 - parser.add_argument("--outputModelFile", dest="outputModelFile",
68 - help="File to place output model", metavar="FILE")
69 - parser.add_argument("--outputReportPath", dest="outputReportPath",
70 - help="Path to place evaluation report", metavar="PATH")
71 - parser.add_argument("--outputReportFile", dest="outputReportFile",
72 - help="File to place evaluation report", metavar="FILE")
73 - parser.add_argument("--classifier", dest="classifier",
74 - help="Classifier", metavar="NAME",
75 - choices=('BernoulliNB', 'SVM', 'kNN'), default='SVM')
76 - parser.add_argument("--saveData", dest="saveData", action='store_true',
77 - help="Save matrices")
78 -
79 - args = parser.parse_args()
80 -
81 - # Printing parameter values
82 - print('-------------------------------- PARAMETERS --------------------------------')
83 - print("Path to read input files: " + str(args.inputPath))
84 - print("File to read training data: " + str(args.inputTrainingData))
85 - print("File to read testing data: " + str(args.inputTestingData))
86 - print("File to read testing classes: " + str(args.inputTestingClasses))
87 - print("Path to place output model: " + str(args.outputModelPath))
88 - print("File to place output model: " + str(args.outputModelFile))
89 - print("Path to place evaluation report: " + str(args.outputReportPath))
90 - print("File to place evaluation report: " + str(args.outputReportFile))
91 - print("Classifier: " + str(args.classifier))
92 - print("Save matrices: " + str(args.saveData))
93 -
94 - # Start time
95 - t0 = time()
96 -
97 - print("Reading training data and true classes...")
98 - X_train = None
99 - if args.saveData:
100 - y_train = []
101 - trainingData = []
102 - with open(os.path.join(args.inputPath, args.inputTrainingData), encoding='utf8', mode='r') \
103 - as iFile:
104 - for line in iFile:
105 - line = line.strip('\r\n')
106 - listLine = line.split(',')
107 - y_train.append(listLine[0])
108 - trainingData.append(listLine[1:])
109 - # X_train = np.matrix(trainingData)
110 - X_train = csr_matrix(trainingData, dtype='double')
111 - print(" Saving matrix and classes...")
112 - joblib.dump(X_train, os.path.join(args.outputModelPath, args.inputTrainingData + '.jlb'))
113 - joblib.dump(y_train, os.path.join(args.outputModelPath, args.inputTrainingData + '.class.jlb'))
114 - print(" Done!")
115 - else:
116 - print(" Loading matrix and classes...")
117 - X_train = joblib.load(os.path.join(args.outputModelPath, args.inputTrainingData + '.jlb'))
118 - y_train = joblib.load(os.path.join(args.outputModelPath, args.inputTrainingData + '.class.jlb'))
119 - print(" Done!")
120 -
121 - print(" Number of training classes: {}".format(len(y_train)))
122 - print(" Number of training class A: {}".format(y_train.count('A')))
123 - print(" Number of training class I: {}".format(y_train.count('I')))
124 - print(" Shape of training matrix: {}".format(X_train.shape))
125 -
126 - print("Reading testing data and true classes...")
127 - X_test = None
128 - if args.saveData:
129 - y_test = []
130 - testingData = []
131 - with open(os.path.join(args.inputPath, args.inputTestingData), encoding='utf8', mode='r') \
132 - as iFile:
133 - for line in iFile:
134 - line = line.strip('\r\n')
135 - listLine = line.split(',')
136 - testingData.append(listLine[1:])
137 - X_test = csr_matrix(testingData, dtype='double')
138 - with open(os.path.join(args.inputPath, args.inputTestingClasses), encoding='utf8', mode='r') \
139 - as iFile:
140 - for line in iFile:
141 - line = line.strip('\r\n')
142 - y_test.append(line)
143 - print(" Saving matrix and classes...")
144 - joblib.dump(X_test, os.path.join(args.outputModelPath, args.inputTestingData + '.jlb'))
145 - joblib.dump(y_test, os.path.join(args.outputModelPath, args.inputTestingClasses + '.class.jlb'))
146 - print(" Done!")
147 - else:
148 - print(" Loading matrix and classes...")
149 - X_test = joblib.load(os.path.join(args.outputModelPath, args.inputTestingData + '.jlb'))
150 - y_test = joblib.load(os.path.join(args.outputModelPath, args.inputTestingClasses + '.class.jlb'))
151 - print(" Done!")
152 -
153 - print(" Number of testing classes: {}".format(len(y_test)))
154 - print(" Number of testing class A: {}".format(y_test.count('A')))
155 - print(" Number of testing class I: {}".format(y_test.count('I')))
156 - print(" Shape of testing matrix: {}".format(X_test.shape))
157 -
158 - if args.classifier == "BernoulliNB":
159 - classifier = BernoulliNB()
160 - elif args.classifier == "SVM":
161 - classifier = SVC()
162 - elif args.classifier == "kNN":
163 - classifier = KNeighborsClassifier()
164 - else:
165 - print("Bad classifier")
166 - exit()
167 -
168 - print("Training...")
169 - classifier.fit(X_train, y_train)
170 - print(" Done!")
171 -
172 - print("Testing (prediction in new data)...")
173 - y_pred = classifier.predict(X_test)
174 - print(" Done!")
175 -
176 - print("Saving report...")
177 - with open(os.path.join(args.outputReportPath, args.outputReportFile), mode='w', encoding='utf8') as oFile:
178 - oFile.write('********** EVALUATION REPORT **********\n')
179 - oFile.write('Classifier: {}\n'.format(args.classifier))
180 - oFile.write('Accuracy: {}\n'.format(accuracy_score(y_test, y_pred)))
181 - oFile.write('Precision: {}\n'.format(precision_score(y_test, y_pred, average='weighted')))
182 - oFile.write('Recall: {}\n'.format(recall_score(y_test, y_pred, average='weighted')))
183 - oFile.write('F-score: {}\n'.format(f1_score(y_test, y_pred, average='weighted')))
184 - oFile.write('Confusion matrix: \n')
185 - oFile.write(str(confusion_matrix(y_test, y_pred)) + '\n')
186 - oFile.write('Classification report: \n')
187 - oFile.write(classification_report(y_test, y_pred) + '\n')
188 - print(" Done!")
189 -
190 - print("Training and testing done in: %fs" % (time() - t0))