Showing
13 changed files
with
1734 additions
and
0 deletions
binding-thrombin-dataset/README.testset
0 → 100644
1 | +The test set consists of 634 data points, each of which represents | ||
2 | +a molecule that is either active (A) or inactive (I). The test set | ||
3 | +has the same format as the training set, with the exception that the | ||
4 | +activity value (A or I) for each data point is missing, that is, has | ||
5 | +been replaced by a question mark (?). Please submit one prediction, | ||
6 | +A or I, for each data point. Your submission should be in the form | ||
7 | +of a file that starts with your contact information, followed by a | ||
8 | +line with 5 asterisks, followed immediately by your predictions, with | ||
9 | +one line per data point. The predictions should be in the same order | ||
10 | +as the test set data points. So your prediction for the first example | ||
11 | +should appear on the first line after the asterisks, your prediction | ||
12 | +for the second example should appear on the second line after the | ||
13 | +asterisks, etc. Hence, after your contact information, the prediction | ||
14 | +file will consist of 635 lines and have the form: | ||
15 | + | ||
16 | +***** | ||
17 | +I | ||
18 | +I | ||
19 | +A | ||
20 | +I | ||
21 | +A | ||
22 | +I | ||
23 | + | ||
24 | +etc. | ||
25 | + | ||
26 | +You may submit your prediction by email to page@biostat.wisc.edu | ||
27 | +or by anonymous ftp to ftp.biostat.wisc.edu, placing the file | ||
28 | +into the directory dropboxes/page/. If using email, please use | ||
29 | +the subject line "KDDcup <name> thrombin" where <name> is your | ||
30 | +name. If using ftp, please name the file KDDcup.<name>.thrombin | ||
31 | +where <name> is your name. For example, my submission would be | ||
32 | +named KDDcup.DavidPage.thrombin | ||
33 | + | ||
34 | +Only one submission per person per task is permitted. If you do not | ||
35 | +receive email confirmation of your submission within 24 hours, please | ||
36 | +email page@biostat.wisc.edu with subject "KDDcup no confirmation". | ||
37 | + | ||
38 | +For group entries, the contact information should include the names | ||
39 | +of everyone to be credited as a member of the group should your entry | ||
40 | +achieve the highest score. But no person is to be listed on more than | ||
41 | +one entry per task. |
binding-thrombin-dataset/README.trainingset
0 → 100644
1 | +Prediction of Molecular Bioactivity for Drug Design -- Binding to Thrombin | ||
2 | +-------------------------------------------------------------------------- | ||
3 | + | ||
4 | +Drugs are typically small organic molecules that achieve their desired | ||
5 | +activity by binding to a target site on a receptor. The first step in | ||
6 | +the discovery of a new drug is usually to identify and isolate the | ||
7 | +receptor to which it should bind, followed by testing many small | ||
8 | +molecules for their ability to bind to the target site. This leaves | ||
9 | +researchers with the task of determining what separates the active | ||
10 | +(binding) compounds from the inactive (non-binding) ones. Such a | ||
11 | +determination can then be used in the design of new compounds that not | ||
12 | +only bind, but also have all the other properties required for a drug | ||
13 | +(solubility, oral absorption, lack of side effects, appropriate duration | ||
14 | +of action, toxicity, etc.). | ||
15 | + | ||
16 | +The present training data set consists of 1909 compounds tested for | ||
17 | +their ability to bind to a target site on thrombin, a key receptor in | ||
18 | +blood clotting. The chemical structures of these compounds are not | ||
19 | +necessary for our analysis and are not included. Of these compounds, 42 | ||
20 | +are active (bind well) and the others are inactive. Each compound is | ||
21 | +described by a single feature vector comprised of a class value (A for | ||
22 | +active, I for inactive) and 139,351 binary features, which describe | ||
23 | +three-dimensional properties of the molecule. The definitions of the | ||
24 | +individual bits are not included - we don't know what each individual | ||
25 | +bit means, only that they are generated in an internally consistent | ||
26 | +manner for all 1909 compounds. Biological activity in general, and | ||
27 | +receptor binding affinity in particular, correlate with various | ||
28 | +structural and physical properties of small organic molecules. The task | ||
29 | +is to determine which of these properties are critical in this case and | ||
30 | +to learn to accurately predict the class value. To simulate the | ||
31 | +real-world drug design environment, the test set contains 636 additional | ||
32 | +compounds that were in fact generated based on the assay results | ||
33 | +recorded for the training set. In evaluating the accuracy, a | ||
34 | +differential cost model will be used, so that the sum of the costs of | ||
35 | +the actives will be equal to the sum of the costs of the inactives. | ||
36 | + | ||
37 | +We thank DuPont Pharmaceuticals for graciously providing this data set | ||
38 | +for the KDD Cup 2001 competition. All publications referring to | ||
39 | +analysis of this data set should acknowledge DuPont Pharmaceuticals | ||
40 | +Research Laboratories and KDD Cup 2001. |
binding-thrombin-dataset/README.txt
0 → 100644
File mode changed
binding-thrombin-dataset/Thrombin.testset
0 → 100644
This diff could not be displayed because it is too large.
1 | +I | ||
2 | +A | ||
3 | +I | ||
4 | +I | ||
5 | +I | ||
6 | +A | ||
7 | +I | ||
8 | +I | ||
9 | +I | ||
10 | +A | ||
11 | +I | ||
12 | +I | ||
13 | +I | ||
14 | +A | ||
15 | +I | ||
16 | +A | ||
17 | +I | ||
18 | +I | ||
19 | +I | ||
20 | +I | ||
21 | +I | ||
22 | +I | ||
23 | +I | ||
24 | +I | ||
25 | +I | ||
26 | +I | ||
27 | +I | ||
28 | +A | ||
29 | +I | ||
30 | +A | ||
31 | +I | ||
32 | +I | ||
33 | +I | ||
34 | +I | ||
35 | +I | ||
36 | +A | ||
37 | +I | ||
38 | +I | ||
39 | +I | ||
40 | +A | ||
41 | +I | ||
42 | +I | ||
43 | +I | ||
44 | +I | ||
45 | +I | ||
46 | +I | ||
47 | +I | ||
48 | +I | ||
49 | +A | ||
50 | +A | ||
51 | +I | ||
52 | +I | ||
53 | +I | ||
54 | +I | ||
55 | +I | ||
56 | +A | ||
57 | +I | ||
58 | +A | ||
59 | +A | ||
60 | +I | ||
61 | +I | ||
62 | +I | ||
63 | +A | ||
64 | +I | ||
65 | +I | ||
66 | +I | ||
67 | +I | ||
68 | +A | ||
69 | +A | ||
70 | +I | ||
71 | +A | ||
72 | +I | ||
73 | +I | ||
74 | +A | ||
75 | +A | ||
76 | +I | ||
77 | +I | ||
78 | +I | ||
79 | +I | ||
80 | +I | ||
81 | +I | ||
82 | +I | ||
83 | +I | ||
84 | +I | ||
85 | +I | ||
86 | +I | ||
87 | +I | ||
88 | +I | ||
89 | +I | ||
90 | +I | ||
91 | +A | ||
92 | +I | ||
93 | +I | ||
94 | +A | ||
95 | +I | ||
96 | +A | ||
97 | +I | ||
98 | +I | ||
99 | +I | ||
100 | +A | ||
101 | +A | ||
102 | +I | ||
103 | +I | ||
104 | +I | ||
105 | +I | ||
106 | +I | ||
107 | +I | ||
108 | +A | ||
109 | +I | ||
110 | +I | ||
111 | +I | ||
112 | +I | ||
113 | +A | ||
114 | +A | ||
115 | +I | ||
116 | +I | ||
117 | +I | ||
118 | +I | ||
119 | +I | ||
120 | +I | ||
121 | +I | ||
122 | +I | ||
123 | +A | ||
124 | +A | ||
125 | +I | ||
126 | +A | ||
127 | +A | ||
128 | +I | ||
129 | +I | ||
130 | +I | ||
131 | +I | ||
132 | +I | ||
133 | +I | ||
134 | +I | ||
135 | +I | ||
136 | +I | ||
137 | +I | ||
138 | +A | ||
139 | +I | ||
140 | +I | ||
141 | +I | ||
142 | +I | ||
143 | +I | ||
144 | +I | ||
145 | +I | ||
146 | +I | ||
147 | +I | ||
148 | +A | ||
149 | +I | ||
150 | +I | ||
151 | +I | ||
152 | +I | ||
153 | +A | ||
154 | +I | ||
155 | +I | ||
156 | +I | ||
157 | +I | ||
158 | +I | ||
159 | +I | ||
160 | +I | ||
161 | +A | ||
162 | +I | ||
163 | +I | ||
164 | +A | ||
165 | +I | ||
166 | +A | ||
167 | +I | ||
168 | +I | ||
169 | +A | ||
170 | +I | ||
171 | +A | ||
172 | +I | ||
173 | +A | ||
174 | +I | ||
175 | +A | ||
176 | +I | ||
177 | +I | ||
178 | +I | ||
179 | +I | ||
180 | +I | ||
181 | +A | ||
182 | +I | ||
183 | +I | ||
184 | +A | ||
185 | +I | ||
186 | +I | ||
187 | +A | ||
188 | +I | ||
189 | +I | ||
190 | +I | ||
191 | +A | ||
192 | +I | ||
193 | +A | ||
194 | +I | ||
195 | +I | ||
196 | +A | ||
197 | +I | ||
198 | +I | ||
199 | +I | ||
200 | +I | ||
201 | +A | ||
202 | +I | ||
203 | +A | ||
204 | +I | ||
205 | +I | ||
206 | +I | ||
207 | +I | ||
208 | +I | ||
209 | +I | ||
210 | +I | ||
211 | +I | ||
212 | +I | ||
213 | +I | ||
214 | +I | ||
215 | +A | ||
216 | +I | ||
217 | +A | ||
218 | +I | ||
219 | +I | ||
220 | +I | ||
221 | +I | ||
222 | +I | ||
223 | +I | ||
224 | +A | ||
225 | +I | ||
226 | +I | ||
227 | +A | ||
228 | +A | ||
229 | +A | ||
230 | +I | ||
231 | +I | ||
232 | +A | ||
233 | +A | ||
234 | +I | ||
235 | +I | ||
236 | +I | ||
237 | +I | ||
238 | +A | ||
239 | +I | ||
240 | +I | ||
241 | +I | ||
242 | +I | ||
243 | +A | ||
244 | +I | ||
245 | +A | ||
246 | +I | ||
247 | +I | ||
248 | +I | ||
249 | +I | ||
250 | +I | ||
251 | +I | ||
252 | +I | ||
253 | +A | ||
254 | +A | ||
255 | +I | ||
256 | +I | ||
257 | +I | ||
258 | +I | ||
259 | +I | ||
260 | +I | ||
261 | +I | ||
262 | +I | ||
263 | +A | ||
264 | +A | ||
265 | +I | ||
266 | +I | ||
267 | +I | ||
268 | +I | ||
269 | +I | ||
270 | +I | ||
271 | +A | ||
272 | +A | ||
273 | +I | ||
274 | +I | ||
275 | +I | ||
276 | +I | ||
277 | +I | ||
278 | +I | ||
279 | +A | ||
280 | +I | ||
281 | +A | ||
282 | +I | ||
283 | +I | ||
284 | +I | ||
285 | +I | ||
286 | +I | ||
287 | +I | ||
288 | +I | ||
289 | +I | ||
290 | +I | ||
291 | +A | ||
292 | +I | ||
293 | +I | ||
294 | +A | ||
295 | +I | ||
296 | +I | ||
297 | +I | ||
298 | +I | ||
299 | +I | ||
300 | +I | ||
301 | +A | ||
302 | +A | ||
303 | +I | ||
304 | +I | ||
305 | +I | ||
306 | +I | ||
307 | +I | ||
308 | +A | ||
309 | +I | ||
310 | +I | ||
311 | +I | ||
312 | +I | ||
313 | +I | ||
314 | +A | ||
315 | +A | ||
316 | +A | ||
317 | +I | ||
318 | +A | ||
319 | +I | ||
320 | +I | ||
321 | +I | ||
322 | +I | ||
323 | +A | ||
324 | +A | ||
325 | +I | ||
326 | +A | ||
327 | +A | ||
328 | +I | ||
329 | +I | ||
330 | +I | ||
331 | +I | ||
332 | +I | ||
333 | +I | ||
334 | +I | ||
335 | +I | ||
336 | +I | ||
337 | +I | ||
338 | +I | ||
339 | +I | ||
340 | +A | ||
341 | +I | ||
342 | +I | ||
343 | +I | ||
344 | +I | ||
345 | +A | ||
346 | +A | ||
347 | +I | ||
348 | +I | ||
349 | +A | ||
350 | +I | ||
351 | +I | ||
352 | +I | ||
353 | +I | ||
354 | +I | ||
355 | +A | ||
356 | +A | ||
357 | +I | ||
358 | +A | ||
359 | +I | ||
360 | +I | ||
361 | +I | ||
362 | +I | ||
363 | +I | ||
364 | +I | ||
365 | +A | ||
366 | +A | ||
367 | +I | ||
368 | +I | ||
369 | +A | ||
370 | +I | ||
371 | +I | ||
372 | +I | ||
373 | +I | ||
374 | +I | ||
375 | +I | ||
376 | +I | ||
377 | +I | ||
378 | +I | ||
379 | +I | ||
380 | +I | ||
381 | +I | ||
382 | +A | ||
383 | +I | ||
384 | +I | ||
385 | +A | ||
386 | +I | ||
387 | +I | ||
388 | +A | ||
389 | +I | ||
390 | +I | ||
391 | +I | ||
392 | +I | ||
393 | +A | ||
394 | +A | ||
395 | +I | ||
396 | +A | ||
397 | +A | ||
398 | +I | ||
399 | +I | ||
400 | +A | ||
401 | +I | ||
402 | +I | ||
403 | +I | ||
404 | +I | ||
405 | +A | ||
406 | +I | ||
407 | +I | ||
408 | +I | ||
409 | +I | ||
410 | +I | ||
411 | +I | ||
412 | +I | ||
413 | +I | ||
414 | +I | ||
415 | +I | ||
416 | +A | ||
417 | +I | ||
418 | +I | ||
419 | +A | ||
420 | +A | ||
421 | +I | ||
422 | +I | ||
423 | +I | ||
424 | +A | ||
425 | +I | ||
426 | +I | ||
427 | +I | ||
428 | +I | ||
429 | +A | ||
430 | +I | ||
431 | +A | ||
432 | +I | ||
433 | +I | ||
434 | +I | ||
435 | +I | ||
436 | +I | ||
437 | +A | ||
438 | +I | ||
439 | +I | ||
440 | +I | ||
441 | +I | ||
442 | +I | ||
443 | +I | ||
444 | +I | ||
445 | +I | ||
446 | +I | ||
447 | +I | ||
448 | +I | ||
449 | +I | ||
450 | +I | ||
451 | +I | ||
452 | +I | ||
453 | +I | ||
454 | +I | ||
455 | +I | ||
456 | +A | ||
457 | +A | ||
458 | +A | ||
459 | +A | ||
460 | +I | ||
461 | +I | ||
462 | +I | ||
463 | +A | ||
464 | +A | ||
465 | +I | ||
466 | +I | ||
467 | +I | ||
468 | +I | ||
469 | +I | ||
470 | +A | ||
471 | +I | ||
472 | +A | ||
473 | +I | ||
474 | +I | ||
475 | +I | ||
476 | +I | ||
477 | +I | ||
478 | +A | ||
479 | +I | ||
480 | +I | ||
481 | +I | ||
482 | +I | ||
483 | +A | ||
484 | +A | ||
485 | +I | ||
486 | +I | ||
487 | +I | ||
488 | +I | ||
489 | +I | ||
490 | +I | ||
491 | +I | ||
492 | +I | ||
493 | +I | ||
494 | +I | ||
495 | +I | ||
496 | +I | ||
497 | +I | ||
498 | +I | ||
499 | +I | ||
500 | +I | ||
501 | +I | ||
502 | +A | ||
503 | +I | ||
504 | +A | ||
505 | +I | ||
506 | +I | ||
507 | +A | ||
508 | +I | ||
509 | +I | ||
510 | +I | ||
511 | +I | ||
512 | +A | ||
513 | +I | ||
514 | +I | ||
515 | +A | ||
516 | +A | ||
517 | +I | ||
518 | +I | ||
519 | +I | ||
520 | +A | ||
521 | +I | ||
522 | +A | ||
523 | +I | ||
524 | +I | ||
525 | +I | ||
526 | +I | ||
527 | +I | ||
528 | +I | ||
529 | +I | ||
530 | +A | ||
531 | +A | ||
532 | +I | ||
533 | +I | ||
534 | +I | ||
535 | +A | ||
536 | +I | ||
537 | +I | ||
538 | +I | ||
539 | +A | ||
540 | +I | ||
541 | +I | ||
542 | +I | ||
543 | +I | ||
544 | +I | ||
545 | +I | ||
546 | +A | ||
547 | +I | ||
548 | +I | ||
549 | +I | ||
550 | +I | ||
551 | +I | ||
552 | +A | ||
553 | +I | ||
554 | +I | ||
555 | +I | ||
556 | +I | ||
557 | +I | ||
558 | +A | ||
559 | +I | ||
560 | +I | ||
561 | +A | ||
562 | +I | ||
563 | +I | ||
564 | +I | ||
565 | +I | ||
566 | +I | ||
567 | +I | ||
568 | +A | ||
569 | +I | ||
570 | +I | ||
571 | +I | ||
572 | +I | ||
573 | +I | ||
574 | +I | ||
575 | +I | ||
576 | +I | ||
577 | +I | ||
578 | +I | ||
579 | +I | ||
580 | +I | ||
581 | +I | ||
582 | +I | ||
583 | +I | ||
584 | +I | ||
585 | +A | ||
586 | +I | ||
587 | +I | ||
588 | +A | ||
589 | +I | ||
590 | +I | ||
591 | +I | ||
592 | +I | ||
593 | +A | ||
594 | +I | ||
595 | +I | ||
596 | +I | ||
597 | +I | ||
598 | +I | ||
599 | +A | ||
600 | +I | ||
601 | +I | ||
602 | +I | ||
603 | +A | ||
604 | +I | ||
605 | +I | ||
606 | +I | ||
607 | +A | ||
608 | +I | ||
609 | +A | ||
610 | +A | ||
611 | +I | ||
612 | +A | ||
613 | +I | ||
614 | +I | ||
615 | +I | ||
616 | +I | ||
617 | +I | ||
618 | +I | ||
619 | +I | ||
620 | +A | ||
621 | +I | ||
622 | +I | ||
623 | +I | ||
624 | +A | ||
625 | +I | ||
626 | +A | ||
627 | +I | ||
628 | +I | ||
629 | +I | ||
630 | +I | ||
631 | +A | ||
632 | +I | ||
633 | +A | ||
634 | +I |
1 | +# -*- encoding: utf-8 -*- | ||
2 | + | ||
3 | +import os | ||
4 | +from time import time | ||
5 | +import argparse | ||
6 | +from sklearn.naive_bayes import BernoulliNB | ||
7 | +from sklearn.svm import SVC | ||
8 | +from sklearn.neighbors import KNeighborsClassifier | ||
9 | +from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, \ | ||
10 | + classification_report | ||
11 | +from sklearn.externals import joblib | ||
12 | +from sklearn import model_selection | ||
13 | +from sklearn.feature_selection import SelectKBest, chi2 | ||
14 | +from sklearn.decomposition import TruncatedSVD | ||
15 | +from scipy.sparse import csr_matrix | ||
16 | +import scipy | ||
17 | +from imblearn.combine import SMOTEENN, SMOTETomek | ||
18 | +from imblearn.over_sampling import SMOTE, ADASYN, RandomOverSampler | ||
19 | +from imblearn.under_sampling import EditedNearestNeighbours, TomekLinks, \ | ||
20 | + OneSidedSelection, RandomUnderSampler, NeighbourhoodCleaningRule, \ | ||
21 | + InstanceHardnessThreshold, ClusterCentroids | ||
22 | +from imblearn.ensemble import EasyEnsemble, BalanceCascade | ||
23 | + | ||
24 | +__author__ = 'CMendezC' | ||
25 | + | ||
26 | +# Goal: training, crossvalidation and testing binding thrombin data set | ||
27 | + | ||
28 | +# Parameters: | ||
29 | +# 1) --inputPath Path to read input files. | ||
30 | +# 2) --inputTrainingData File to read training data. | ||
31 | +# 3) --inputTestingData File to read testing data. | ||
32 | +# 4) --inputTestingClasses File to read testing classes. | ||
33 | +# 5) --outputModelPath Path to place output model. | ||
34 | +# 6) --outputModelFile File to place output model. | ||
35 | +# 7) --outputReportPath Path to place evaluation report. | ||
36 | +# 8) --outputReportFile File to place evaluation report. | ||
37 | +# 9) --classifier Classifier: BernoulliNB, SVM, kNN. | ||
38 | +# 10) --saveData Save matrices | ||
39 | +# 11) --kernel Kernel | ||
40 | +# 12) --reduction Feature selection or dimensionality reduction | ||
41 | +# 13) --imbalanced Imbalanced method | ||
42 | + | ||
43 | +# Ouput: | ||
44 | +# 1) Classification model and evaluation report. | ||
45 | + | ||
46 | +# Execution: | ||
47 | + | ||
48 | +# python training-crossvalidation-testing-binding-thrombin.py | ||
49 | +# --inputPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset | ||
50 | +# --inputTrainingData thrombin.data | ||
51 | +# --inputTestingData Thrombin.testset | ||
52 | +# --inputTestingClasses Thrombin.testset.class | ||
53 | +# --outputModelPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/models | ||
54 | +# --outputModelFile SVM-lineal-model.mod | ||
55 | +# --outputReportPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/reports | ||
56 | +# --outputReportFile SVM-lineal.txt | ||
57 | +# --classifier SVM | ||
58 | +# --saveData | ||
59 | +# --kernel linear | ||
60 | +# --imbalanced RandomUS | ||
61 | + | ||
62 | +# source activate python3 | ||
63 | +# python training-crossvalidation-testing-binding-thrombin.py --inputPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset --inputTrainingData thrombin.data --inputTestingData Thrombin.testset --inputTestingClasses Thrombin.testset.class --outputModelPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/models --outputModelFile SVM-lineal-model.mod --outputReportPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/reports --outputReportFile SVM-lineal.txt --classifier SVM --kernel linear --imbalanced RandomUS | ||
64 | + | ||
65 | +########################################################### | ||
66 | +# MAIN PROGRAM # | ||
67 | +########################################################### | ||
68 | + | ||
69 | +if __name__ == "__main__": | ||
70 | + # Parameter definition | ||
71 | + parser = argparse.ArgumentParser(description='Training validation Binding Thrombin Dataset.') | ||
72 | + parser.add_argument("--inputPath", dest="inputPath", | ||
73 | + help="Path to read input files", metavar="PATH") | ||
74 | + parser.add_argument("--inputTrainingData", dest="inputTrainingData", | ||
75 | + help="File to read training data", metavar="FILE") | ||
76 | + parser.add_argument("--inputTestingData", dest="inputTestingData", | ||
77 | + help="File to read testing data", metavar="FILE") | ||
78 | + parser.add_argument("--inputTestingClasses", dest="inputTestingClasses", | ||
79 | + help="File to read testing classes", metavar="FILE") | ||
80 | + parser.add_argument("--outputModelPath", dest="outputModelPath", | ||
81 | + help="Path to place output model", metavar="PATH") | ||
82 | + parser.add_argument("--outputModelFile", dest="outputModelFile", | ||
83 | + help="File to place output model", metavar="FILE") | ||
84 | + parser.add_argument("--outputReportPath", dest="outputReportPath", | ||
85 | + help="Path to place evaluation report", metavar="PATH") | ||
86 | + parser.add_argument("--outputReportFile", dest="outputReportFile", | ||
87 | + help="File to place evaluation report", metavar="FILE") | ||
88 | + parser.add_argument("--classifier", dest="classifier", | ||
89 | + help="Classifier", metavar="NAME", | ||
90 | + choices=('BernoulliNB', 'SVM', 'kNN'), default='SVM') | ||
91 | + parser.add_argument("--saveData", dest="saveData", action='store_true', | ||
92 | + help="Save matrices") | ||
93 | + parser.add_argument("--kernel", dest="kernel", | ||
94 | + help="Kernel SVM", metavar="NAME", | ||
95 | + choices=('linear', 'rbf', 'poly'), default='linear') | ||
96 | + parser.add_argument("--reduction", dest="reduction", | ||
97 | + help="Feature selection or dimensionality reduction", metavar="NAME", | ||
98 | + choices=('SVD200', 'SVD300', 'CHI250', 'CHI2100'), default=None) | ||
99 | + parser.add_argument("--imbalanced", dest="imbalanced", | ||
100 | + choices=('RandomUS', 'Tomek', 'NCR', | ||
101 | + 'IHT', 'RandomOS', 'ADASYN', 'SMOTE_reg', | ||
102 | + 'SMOTE_svm', 'SMOTE_b1', 'SMOTE_b2', 'OSS', | ||
103 | + 'SMOTE+ENN'), default=None, | ||
104 | + help="Undersampling: RandomUS, Tomek, Neighbourhood Cleanning Rule (NCR), " | ||
105 | + "Instance Hardess Threshold (IHT), One Sided Selection (OSS). " | ||
106 | + "Oversampling: RandomOS, ADACYN, SMOTE_reg, " | ||
107 | + "SMOTE_svm, SMOTE_b1, SMOTE_b2. Combine: " | ||
108 | + "SMOTE + ENN", metavar="TEXT") | ||
109 | + | ||
110 | + args = parser.parse_args() | ||
111 | + | ||
112 | + # Printing parameter values | ||
113 | + print('-------------------------------- PARAMETERS --------------------------------') | ||
114 | + print("Path to read input files: " + str(args.inputPath)) | ||
115 | + print("File to read training data: " + str(args.inputTrainingData)) | ||
116 | + print("File to read testing data: " + str(args.inputTestingData)) | ||
117 | + print("File to read testing classes: " + str(args.inputTestingClasses)) | ||
118 | + print("Path to place output model: " + str(args.outputModelPath)) | ||
119 | + print("File to place output model: " + str(args.outputModelFile)) | ||
120 | + print("Path to place evaluation report: " + str(args.outputReportPath)) | ||
121 | + print("File to place evaluation report: " + str(args.outputReportFile)) | ||
122 | + print("Classifier: " + str(args.classifier)) | ||
123 | + print("Save matrices: " + str(args.saveData)) | ||
124 | + print("Kernel: " + str(args.kernel)) | ||
125 | + print("Reduction: " + str(args.reduction)) | ||
126 | + print("Imbalanced: " + str(args.imbalanced)) | ||
127 | + | ||
128 | + # Start time | ||
129 | + t0 = time() | ||
130 | + | ||
131 | + print("Reading training data and true classes...") | ||
132 | + X_train = None | ||
133 | + if args.saveData: | ||
134 | + y_train = [] | ||
135 | + trainingData = [] | ||
136 | + with open(os.path.join(args.inputPath, args.inputTrainingData), encoding='utf8', mode='r') \ | ||
137 | + as iFile: | ||
138 | + for line in iFile: | ||
139 | + line = line.strip('\r\n') | ||
140 | + listLine = line.split(',') | ||
141 | + y_train.append(listLine[0]) | ||
142 | + trainingData.append(listLine[1:]) | ||
143 | + # X_train = np.matrix(trainingData) | ||
144 | + X_train = csr_matrix(trainingData, dtype='double') | ||
145 | + print(" Saving matrix and classes...") | ||
146 | + joblib.dump(X_train, os.path.join(args.outputModelPath, args.inputTrainingData + '.jlb')) | ||
147 | + joblib.dump(y_train, os.path.join(args.outputModelPath, args.inputTrainingData + '.class.jlb')) | ||
148 | + print(" Done!") | ||
149 | + else: | ||
150 | + print(" Loading matrix and classes...") | ||
151 | + X_train = joblib.load(os.path.join(args.outputModelPath, args.inputTrainingData + '.jlb')) | ||
152 | + y_train = joblib.load(os.path.join(args.outputModelPath, args.inputTrainingData + '.class.jlb')) | ||
153 | + print(" Done!") | ||
154 | + | ||
155 | + print(" Number of training classes: {}".format(len(y_train))) | ||
156 | + print(" Number of training class A: {}".format(y_train.count('A'))) | ||
157 | + print(" Number of training class I: {}".format(y_train.count('I'))) | ||
158 | + print(" Shape of training matrix: {}".format(X_train.shape)) | ||
159 | + | ||
160 | + # Feature selection and dimensional reduction | ||
161 | + if args.reduction is not None: | ||
162 | + print('Performing dimensionality reduction or feature selection...', args.reduction) | ||
163 | + if args.reduction == 'SVD200': | ||
164 | + reduc = TruncatedSVD(n_components=200, random_state=42) | ||
165 | + X_train = reduc.fit_transform(X_train) | ||
166 | + if args.reduction == 'SVD300': | ||
167 | + reduc = TruncatedSVD(n_components=300, random_state=42) | ||
168 | + X_train = reduc.fit_transform(X_train) | ||
169 | + elif args.reduction == 'CHI250': | ||
170 | + reduc = SelectKBest(chi2, k=50) | ||
171 | + X_train = reduc.fit_transform(X_train, y_train) | ||
172 | + elif args.reduction == 'CHI2100': | ||
173 | + reduc = SelectKBest(chi2, k=100) | ||
174 | + X_train = reduc.fit_transform(X_train, y_train) | ||
175 | + print(" Done!") | ||
176 | + print(' New shape of training matrix: ', X_train.shape) | ||
177 | + | ||
178 | + if args.imbalanced != None: | ||
179 | + t1 = time() | ||
180 | + # Combination over and under sampling | ||
181 | + jobs = 15 | ||
182 | + if args.imbalanced == "SMOTE+ENN": | ||
183 | + sm = SMOTEENN(random_state=42, n_jobs=jobs) | ||
184 | + elif args.imbalanced == "SMOTE+Tomek": | ||
185 | + sm = SMOTETomek(random_state=42, n_jobs=jobs) | ||
186 | + # Over sampling | ||
187 | + elif args.imbalanced == "SMOTE_reg": | ||
188 | + sm = SMOTE(random_state=42, n_jobs=jobs) | ||
189 | + elif args.imbalanced == "SMOTE_svm": | ||
190 | + sm = SMOTE(random_state=42, n_jobs=jobs, kind='svm') | ||
191 | + elif args.imbalanced == "SMOTE_b1": | ||
192 | + sm = SMOTE(random_state=42, n_jobs=jobs, kind='borderline1') | ||
193 | + elif args.imbalanced == "SMOTE_b2": | ||
194 | + sm = SMOTE(random_state=42, n_jobs=jobs, kind='borderline2') | ||
195 | + elif args.imbalanced == "RandomOS": | ||
196 | + sm = RandomOverSampler(random_state=42) | ||
197 | + # Under sampling | ||
198 | + elif args.imbalanced == "ENN": | ||
199 | + sm = EditedNearestNeighbours(random_state=42, n_jobs=jobs) | ||
200 | + elif args.imbalanced == "Tomek": | ||
201 | + sm = TomekLinks(random_state=42, n_jobs=jobs) | ||
202 | + elif args.imbalanced == "OSS": | ||
203 | + sm = OneSidedSelection(random_state=42, n_jobs=jobs) | ||
204 | + elif args.imbalanced == "RandomUS": | ||
205 | + sm = RandomUnderSampler(random_state=42) | ||
206 | + elif args.imbalanced == "NCR": | ||
207 | + sm = NeighbourhoodCleaningRule(random_state=42, n_jobs=jobs) | ||
208 | + elif args.imbalanced == "IHT": | ||
209 | + sm = InstanceHardnessThreshold(random_state=42, n_jobs=jobs) | ||
210 | + elif args.imbalanced == "ClusterC": | ||
211 | + sm = ClusterCentroids(random_state=42, n_jobs=jobs) | ||
212 | + elif args.imbalanced == "Balanced": | ||
213 | + sm = BalanceCascade(random_state=42) | ||
214 | + elif args.imbalanced == "Easy": | ||
215 | + sm = EasyEnsemble(random_state=42, n_subsets=3) | ||
216 | + elif args.imbalanced == "ADASYN": | ||
217 | + sm = ADASYN(random_state=42, n_jobs=jobs) | ||
218 | + | ||
219 | + # Apply transformation | ||
220 | + X_train, y_train = sm.fit_sample(X_train, y_train) | ||
221 | + | ||
222 | + print(" After transformtion with {}".format(args.imbalanced)) | ||
223 | + print(" Number of training classes: {}".format(len(y_train))) | ||
224 | + print(" Number of training class A: {}".format(list(y_train).count('A'))) | ||
225 | + print(" Number of training class I: {}".format(list(y_train).count('I'))) | ||
226 | + print(" Shape of training matrix: {}".format(X_train.shape)) | ||
227 | + print(" Data transformation done in : %fs" % (time() - t1)) | ||
228 | + | ||
229 | + print("Reading testing data and true classes...") | ||
230 | + X_test = None | ||
231 | + if args.saveData: | ||
232 | + y_test = [] | ||
233 | + testingData = [] | ||
234 | + with open(os.path.join(args.inputPath, args.inputTestingData), encoding='utf8', mode='r') \ | ||
235 | + as iFile: | ||
236 | + for line in iFile: | ||
237 | + line = line.strip('\r\n') | ||
238 | + listLine = line.split(',') | ||
239 | + testingData.append(listLine[1:]) | ||
240 | + X_test = csr_matrix(testingData, dtype='double') | ||
241 | + with open(os.path.join(args.inputPath, args.inputTestingClasses), encoding='utf8', mode='r') \ | ||
242 | + as iFile: | ||
243 | + for line in iFile: | ||
244 | + line = line.strip('\r\n') | ||
245 | + y_test.append(line) | ||
246 | + print(" Saving matrix and classes...") | ||
247 | + joblib.dump(X_test, os.path.join(args.outputModelPath, args.inputTestingData + '.jlb')) | ||
248 | + joblib.dump(y_test, os.path.join(args.outputModelPath, args.inputTestingClasses + '.class.jlb')) | ||
249 | + print(" Done!") | ||
250 | + else: | ||
251 | + print(" Loading matrix and classes...") | ||
252 | + X_test = joblib.load(os.path.join(args.outputModelPath, args.inputTestingData + '.jlb')) | ||
253 | + y_test = joblib.load(os.path.join(args.outputModelPath, args.inputTestingClasses + '.class.jlb')) | ||
254 | + print(" Done!") | ||
255 | + | ||
256 | + print(" Number of testing classes: {}".format(len(y_test))) | ||
257 | + print(" Number of testing class A: {}".format(y_test.count('A'))) | ||
258 | + print(" Number of testing class I: {}".format(y_test.count('I'))) | ||
259 | + print(" Shape of testing matrix: {}".format(X_test.shape)) | ||
260 | + | ||
261 | + jobs = -1 | ||
262 | + paramGrid = [] | ||
263 | + nIter = 20 | ||
264 | + crossV = 10 | ||
265 | + print("Defining randomized grid search...") | ||
266 | + if args.classifier == 'SVM': | ||
267 | + # SVM | ||
268 | + classifier = SVC() | ||
269 | + if args.kernel == 'rbf': | ||
270 | + paramGrid = {'C': scipy.stats.expon(scale=100), | ||
271 | + 'gamma': scipy.stats.expon(scale=.1), | ||
272 | + 'kernel': ['rbf'], 'class_weight': ['balanced', None]} | ||
273 | + elif args.kernel == 'linear': | ||
274 | + paramGrid = {'C': scipy.stats.expon(scale=100), | ||
275 | + 'kernel': ['linear'], | ||
276 | + 'class_weight': ['balanced', None]} | ||
277 | + elif args.kernel == 'poly': | ||
278 | + paramGrid = {'C': scipy.stats.expon(scale=100), | ||
279 | + 'gamma': scipy.stats.expon(scale=.1), 'degree': [2, 3], | ||
280 | + 'kernel': ['poly'], 'class_weight': ['balanced', None]} | ||
281 | + myClassifier = model_selection.RandomizedSearchCV(classifier, | ||
282 | + paramGrid, n_iter=nIter, | ||
283 | + cv=crossV, n_jobs=jobs, verbose=3) | ||
284 | + elif args.classifier == 'BernoulliNB': | ||
285 | + # BernoulliNB | ||
286 | + classifier = BernoulliNB() | ||
287 | + paramGrid = {'alpha': scipy.stats.expon(scale=1.0)} | ||
288 | + myClassifier = model_selection.RandomizedSearchCV(classifier, paramGrid, n_iter=nIter, | ||
289 | + cv=crossV, n_jobs=jobs, verbose=3) | ||
290 | + # elif args.classifier == 'kNN': | ||
291 | + # # kNN | ||
292 | + # k_range = list(range(1, 7, 2)) | ||
293 | + # classifier = KNeighborsClassifier() | ||
294 | + # paramGrid = {'n_neighbors ': k_range} | ||
295 | + # myClassifier = model_selection.RandomizedSearchCV(classifier, paramGrid, n_iter=3, | ||
296 | + # cv=crossV, n_jobs=jobs, verbose=3) | ||
297 | + else: | ||
298 | + print("Bad classifier") | ||
299 | + exit() | ||
300 | + print(" Done!") | ||
301 | + | ||
302 | + print("Training...") | ||
303 | + myClassifier.fit(X_train, y_train) | ||
304 | + print(" Done!") | ||
305 | + | ||
306 | + print("Testing (prediction in new data)...") | ||
307 | + if args.reduction is not None: | ||
308 | + X_test = reduc.transform(X_test) | ||
309 | + y_pred = myClassifier.predict(X_test) | ||
310 | + best_parameters = myClassifier.best_estimator_.get_params() | ||
311 | + print(" Done!") | ||
312 | + | ||
313 | + print("Saving report...") | ||
314 | + with open(os.path.join(args.outputReportPath, args.outputReportFile), mode='w', encoding='utf8') as oFile: | ||
315 | + oFile.write('********** EVALUATION REPORT **********\n') | ||
316 | + oFile.write('Reduction: {}\n'.format(args.reduction)) | ||
317 | + oFile.write('Classifier: {}\n'.format(args.classifier)) | ||
318 | + oFile.write('Kernel: {}\n'.format(args.kernel)) | ||
319 | + oFile.write('Accuracy: {}\n'.format(accuracy_score(y_test, y_pred))) | ||
320 | + oFile.write('Precision: {}\n'.format(precision_score(y_test, y_pred, average='weighted'))) | ||
321 | + oFile.write('Recall: {}\n'.format(recall_score(y_test, y_pred, average='weighted'))) | ||
322 | + oFile.write('F-score: {}\n'.format(f1_score(y_test, y_pred, average='weighted'))) | ||
323 | + oFile.write('Confusion matrix: \n') | ||
324 | + oFile.write(str(confusion_matrix(y_test, y_pred)) + '\n') | ||
325 | + oFile.write('Classification report: \n') | ||
326 | + oFile.write(classification_report(y_test, y_pred) + '\n') | ||
327 | + oFile.write('Best parameters: \n') | ||
328 | + for param in sorted(best_parameters.keys()): | ||
329 | + oFile.write("\t%s: %r\n" % (param, best_parameters[param])) | ||
330 | + print(" Done!") | ||
331 | + | ||
332 | + print("Training and testing done in: %fs" % (time() - t0)) |
1 | +# -*- encoding: utf-8 -*- | ||
2 | + | ||
3 | +import os | ||
4 | +from time import time | ||
5 | +import argparse | ||
6 | +from sklearn.naive_bayes import BernoulliNB | ||
7 | +from sklearn.svm import SVC | ||
8 | +from sklearn.neighbors import KNeighborsClassifier | ||
9 | +from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, \ | ||
10 | + classification_report | ||
11 | +from sklearn.externals import joblib | ||
12 | +from sklearn import model_selection | ||
13 | +from scipy.sparse import csr_matrix | ||
14 | +import scipy | ||
15 | +from imblearn.under_sampling import RandomUnderSampler | ||
16 | +from imblearn.over_sampling import RandomOverSampler | ||
17 | + | ||
18 | +__author__ = 'CMendezC' | ||
19 | + | ||
20 | +# Goal: training, crossvalidation and testing binding thrombin data set | ||
21 | + | ||
22 | +# Parameters: | ||
23 | +# 1) --inputPath Path to read input files. | ||
24 | +# 2) --inputTrainingData File to read training data. | ||
25 | +# 3) --inputTestingData File to read testing data. | ||
26 | +# 4) --inputTestingClasses File to read testing classes. | ||
27 | +# 5) --outputModelPath Path to place output model. | ||
28 | +# 6) --outputModelFile File to place output model. | ||
29 | +# 7) --outputReportPath Path to place evaluation report. | ||
30 | +# 8) --outputReportFile File to place evaluation report. | ||
31 | +# 9) --classifier Classifier: BernoulliNB, SVM, kNN. | ||
32 | +# 10) --saveData Save matrices | ||
33 | +# 11) --kernel Kernel | ||
34 | +# 12) --imbalanced Imbalanced method | ||
35 | + | ||
36 | +# Ouput: | ||
37 | +# 1) Classification model and evaluation report. | ||
38 | + | ||
39 | +# Execution: | ||
40 | + | ||
41 | +# python imb-training-testing-binding-thrombin.py | ||
42 | +# --inputPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset | ||
43 | +# --inputTrainingData thrombin.data | ||
44 | +# --inputTestingData Thrombin.testset | ||
45 | +# --inputTestingClasses Thrombin.testset.class | ||
46 | +# --outputModelPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/models | ||
47 | +# --outputModelFile SVM-lineal-model.mod | ||
48 | +# --outputReportPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/reports | ||
49 | +# --outputReportFile SVM-lineal.txt | ||
50 | +# --classifier SVM | ||
51 | +# --saveData | ||
52 | +# --kernel linear | ||
53 | +# --imbalanced RandomUS | ||
54 | + | ||
55 | +# source activate python3 | ||
56 | +# python imb-training-testing-binding-thrombin.py --inputPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset --inputTrainingData thrombin.data --inputTestingData Thrombin.testset --inputTestingClasses Thrombin.testset.class --outputModelPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/models --outputModelFile SVM-lineal-model.mod --outputReportPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/reports --outputReportFile SVM-lineal.txt --classifier SVM --kernel linear | ||
57 | +# python imb-training-testing-binding-thrombin.py --inputPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset --inputTrainingData thrombin.data --inputTestingData Thrombin.testset --inputTestingClasses Thrombin.testset.class --outputModelPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/models --outputModelFile SVM-lineal-model.mod --outputReportPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/reports --outputReportFile SVM-lineal-RandomUS.txt --classifier SVM --kernel linear --imbalanced RandomUS | ||
58 | + | ||
59 | + | ||
60 | +# --imbalanced RandomUS | ||
61 | + | ||
62 | +########################################################### | ||
63 | +# MAIN PROGRAM # | ||
64 | +########################################################### | ||
65 | + | ||
66 | +if __name__ == "__main__": | ||
67 | + # Parameter definition | ||
68 | + parser = argparse.ArgumentParser(description='Training validation Binding Thrombin Dataset.') | ||
69 | + parser.add_argument("--inputPath", dest="inputPath", | ||
70 | + help="Path to read input files", metavar="PATH") | ||
71 | + parser.add_argument("--inputTrainingData", dest="inputTrainingData", | ||
72 | + help="File to read training data", metavar="FILE") | ||
73 | + parser.add_argument("--inputTestingData", dest="inputTestingData", | ||
74 | + help="File to read testing data", metavar="FILE") | ||
75 | + parser.add_argument("--inputTestingClasses", dest="inputTestingClasses", | ||
76 | + help="File to read testing classes", metavar="FILE") | ||
77 | + parser.add_argument("--outputModelPath", dest="outputModelPath", | ||
78 | + help="Path to place output model", metavar="PATH") | ||
79 | + parser.add_argument("--outputModelFile", dest="outputModelFile", | ||
80 | + help="File to place output model", metavar="FILE") | ||
81 | + parser.add_argument("--outputReportPath", dest="outputReportPath", | ||
82 | + help="Path to place evaluation report", metavar="PATH") | ||
83 | + parser.add_argument("--outputReportFile", dest="outputReportFile", | ||
84 | + help="File to place evaluation report", metavar="FILE") | ||
85 | + parser.add_argument("--classifier", dest="classifier", | ||
86 | + help="Classifier", metavar="NAME", | ||
87 | + choices=('BernoulliNB', 'SVM', 'kNN'), default='SVM') | ||
88 | + parser.add_argument("--saveData", dest="saveData", action='store_true', | ||
89 | + help="Save matrices") | ||
90 | + parser.add_argument("--kernel", dest="kernel", | ||
91 | + help="Kernel SVM", metavar="NAME", | ||
92 | + choices=('linear', 'rbf', 'poly'), default='linear') | ||
93 | + parser.add_argument("--imbalanced", dest="imbalanced", | ||
94 | + choices=('RandomUS', 'RandomOS'), default=None, | ||
95 | + help="Undersampling: RandomUS. Oversampling: RandomOS", metavar="TEXT") | ||
96 | + | ||
97 | + args = parser.parse_args() | ||
98 | + | ||
99 | + # Printing parameter values | ||
100 | + print('-------------------------------- PARAMETERS --------------------------------') | ||
101 | + print("Path to read input files: " + str(args.inputPath)) | ||
102 | + print("File to read training data: " + str(args.inputTrainingData)) | ||
103 | + print("File to read testing data: " + str(args.inputTestingData)) | ||
104 | + print("File to read testing classes: " + str(args.inputTestingClasses)) | ||
105 | + print("Path to place output model: " + str(args.outputModelPath)) | ||
106 | + print("File to place output model: " + str(args.outputModelFile)) | ||
107 | + print("Path to place evaluation report: " + str(args.outputReportPath)) | ||
108 | + print("File to place evaluation report: " + str(args.outputReportFile)) | ||
109 | + print("Classifier: " + str(args.classifier)) | ||
110 | + print("Save matrices: " + str(args.saveData)) | ||
111 | + print("Kernel: " + str(args.kernel)) | ||
112 | + print("Imbalanced: " + str(args.imbalanced)) | ||
113 | + | ||
114 | + # Start time | ||
115 | + t0 = time() | ||
116 | + | ||
117 | + print("Reading training data and true classes...") | ||
118 | + X_train = None | ||
119 | + if args.saveData: | ||
120 | + y_train = [] | ||
121 | + trainingData = [] | ||
122 | + with open(os.path.join(args.inputPath, args.inputTrainingData), encoding='utf8', mode='r') \ | ||
123 | + as iFile: | ||
124 | + for line in iFile: | ||
125 | + line = line.strip('\r\n') | ||
126 | + listLine = line.split(',') | ||
127 | + y_train.append(listLine[0]) | ||
128 | + trainingData.append(listLine[1:]) | ||
129 | + # X_train = np.matrix(trainingData) | ||
130 | + X_train = csr_matrix(trainingData, dtype='double') | ||
131 | + print(" Saving matrix and classes...") | ||
132 | + joblib.dump(X_train, os.path.join(args.outputModelPath, args.inputTrainingData + '.jlb')) | ||
133 | + joblib.dump(y_train, os.path.join(args.outputModelPath, args.inputTrainingData + '.class.jlb')) | ||
134 | + print(" Done!") | ||
135 | + else: | ||
136 | + print(" Loading matrix and classes...") | ||
137 | + X_train = joblib.load(os.path.join(args.outputModelPath, args.inputTrainingData + '.jlb')) | ||
138 | + y_train = joblib.load(os.path.join(args.outputModelPath, args.inputTrainingData + '.class.jlb')) | ||
139 | + print(" Done!") | ||
140 | + | ||
141 | + print(" Number of training classes: {}".format(len(y_train))) | ||
142 | + print(" Number of training class A: {}".format(y_train.count('A'))) | ||
143 | + print(" Number of training class I: {}".format(y_train.count('I'))) | ||
144 | + print(" Shape of training matrix: {}".format(X_train.shape)) | ||
145 | + | ||
146 | + if args.imbalanced != None: | ||
147 | + t1 = time() | ||
148 | + # Combination over and under sampling | ||
149 | + jobs = 15 | ||
150 | + if args.imbalanced == "RandomOS": | ||
151 | + sm = RandomOverSampler(random_state=42) | ||
152 | + # Under sampling | ||
153 | + elif args.imbalanced == "RandomUS": | ||
154 | + sm = RandomUnderSampler(random_state=42) | ||
155 | + | ||
156 | + # Apply transformation | ||
157 | + X_train, y_train = sm.fit_sample(X_train, y_train) | ||
158 | + | ||
159 | + print(" After transformtion with {}".format(args.imbalanced)) | ||
160 | + print(" Number of training classes: {}".format(len(y_train))) | ||
161 | + print(" Number of training class A: {}".format(list(y_train).count('A'))) | ||
162 | + print(" Number of training class I: {}".format(list(y_train).count('I'))) | ||
163 | + print(" Shape of training matrix: {}".format(X_train.shape)) | ||
164 | + print(" Data transformation done in : %fs" % (time() - t1)) | ||
165 | + | ||
166 | + print("Reading testing data and true classes...") | ||
167 | + X_test = None | ||
168 | + if args.saveData: | ||
169 | + y_test = [] | ||
170 | + testingData = [] | ||
171 | + with open(os.path.join(args.inputPath, args.inputTestingData), encoding='utf8', mode='r') \ | ||
172 | + as iFile: | ||
173 | + for line in iFile: | ||
174 | + line = line.strip('\r\n') | ||
175 | + listLine = line.split(',') | ||
176 | + testingData.append(listLine[1:]) | ||
177 | + X_test = csr_matrix(testingData, dtype='double') | ||
178 | + with open(os.path.join(args.inputPath, args.inputTestingClasses), encoding='utf8', mode='r') \ | ||
179 | + as iFile: | ||
180 | + for line in iFile: | ||
181 | + line = line.strip('\r\n') | ||
182 | + y_test.append(line) | ||
183 | + print(" Saving matrix and classes...") | ||
184 | + joblib.dump(X_test, os.path.join(args.outputModelPath, args.inputTestingData + '.jlb')) | ||
185 | + joblib.dump(y_test, os.path.join(args.outputModelPath, args.inputTestingClasses + '.class.jlb')) | ||
186 | + print(" Done!") | ||
187 | + else: | ||
188 | + print(" Loading matrix and classes...") | ||
189 | + X_test = joblib.load(os.path.join(args.outputModelPath, args.inputTestingData + '.jlb')) | ||
190 | + y_test = joblib.load(os.path.join(args.outputModelPath, args.inputTestingClasses + '.class.jlb')) | ||
191 | + print(" Done!") | ||
192 | + | ||
193 | + print(" Number of testing classes: {}".format(len(y_test))) | ||
194 | + print(" Number of testing class A: {}".format(y_test.count('A'))) | ||
195 | + print(" Number of testing class I: {}".format(y_test.count('I'))) | ||
196 | + print(" Shape of testing matrix: {}".format(X_test.shape)) | ||
197 | + | ||
198 | + if args.classifier == 'SVM': | ||
199 | + # SVM | ||
200 | + myClassifier = SVC(kernel=args.kernel) | ||
201 | + elif args.classifier == 'BernoulliNB': | ||
202 | + # BernoulliNB | ||
203 | + myClassifier = BernoulliNB() | ||
204 | + elif args.classifier == 'kNN': | ||
205 | + # kNN | ||
206 | + myClassifier = KNeighborsClassifier() | ||
207 | + else: | ||
208 | + print("Bad classifier") | ||
209 | + exit() | ||
210 | + print(" Done!") | ||
211 | + | ||
212 | + print("Training...") | ||
213 | + myClassifier.fit(X_train, y_train) | ||
214 | + print(" Done!") | ||
215 | + | ||
216 | + y_pred = myClassifier.predict(X_test) | ||
217 | + print(" Done!") | ||
218 | + | ||
219 | + print("Saving report...") | ||
220 | + with open(os.path.join(args.outputReportPath, args.outputReportFile), mode='w', encoding='utf8') as oFile: | ||
221 | + oFile.write('********** EVALUATION REPORT **********\n') | ||
222 | + oFile.write('Classifier: {}\n'.format(args.classifier)) | ||
223 | + oFile.write('Kernel: {}\n'.format(args.kernel)) | ||
224 | + oFile.write('Accuracy: {}\n'.format(accuracy_score(y_test, y_pred))) | ||
225 | + oFile.write('Precision: {}\n'.format(precision_score(y_test, y_pred, average='weighted'))) | ||
226 | + oFile.write('Recall: {}\n'.format(recall_score(y_test, y_pred, average='weighted'))) | ||
227 | + oFile.write('F-score: {}\n'.format(f1_score(y_test, y_pred, average='weighted'))) | ||
228 | + oFile.write('Confusion matrix: \n') | ||
229 | + oFile.write(str(confusion_matrix(y_test, y_pred)) + '\n') | ||
230 | + oFile.write('Classification report: \n') | ||
231 | + oFile.write(classification_report(y_test, y_pred) + '\n') | ||
232 | + print(" Done!") | ||
233 | + | ||
234 | + print("Training and testing done in: %fs" % (time() - t0)) |
binding-thrombin-dataset/models/delete-me
0 → 100644
File mode changed
binding-thrombin-dataset/reports/delete-me
0 → 100644
File mode changed
binding-thrombin-dataset/thrombin.data
0 → 100644
This diff could not be displayed because it is too large.
binding-thrombin-dataset/thrombin.names
0 → 100644
This diff could not be displayed because it is too large.
1 | +# -*- encoding: utf-8 -*- | ||
2 | + | ||
3 | +import os | ||
4 | +from time import time | ||
5 | +import argparse | ||
6 | +from sklearn.naive_bayes import BernoulliNB | ||
7 | +from sklearn.svm import SVC | ||
8 | +from sklearn.neighbors import KNeighborsClassifier | ||
9 | +from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, \ | ||
10 | + classification_report | ||
11 | +from sklearn.externals import joblib | ||
12 | +from sklearn import model_selection | ||
13 | +from sklearn.feature_selection import SelectKBest, chi2 | ||
14 | +from sklearn.decomposition import TruncatedSVD | ||
15 | +from scipy.sparse import csr_matrix | ||
16 | +import scipy | ||
17 | + | ||
18 | +__author__ = 'CMendezC' | ||
19 | + | ||
20 | +# Goal: training, crossvalidation and testing binding thrombin data set | ||
21 | + | ||
22 | +# Parameters: | ||
23 | +# 1) --inputPath Path to read input files. | ||
24 | +# 2) --inputTrainingData File to read training data. | ||
25 | +# 3) --inputTestingData File to read testing data. | ||
26 | +# 4) --inputTestingClasses File to read testing classes. | ||
27 | +# 5) --outputModelPath Path to place output model. | ||
28 | +# 6) --outputModelFile File to place output model. | ||
29 | +# 7) --outputReportPath Path to place evaluation report. | ||
30 | +# 8) --outputReportFile File to place evaluation report. | ||
31 | +# 9) --classifier Classifier: BernoulliNB, SVM, kNN. | ||
32 | +# 10) --saveData Save matrices | ||
33 | +# 11) --kernel Kernel | ||
34 | +# 12) --reduction Feature selection or dimensionality reduction | ||
35 | + | ||
36 | +# Ouput: | ||
37 | +# 1) Classification model and evaluation report. | ||
38 | + | ||
39 | +# Execution: | ||
40 | + | ||
41 | +# python training-crossvalidation-testing-binding-thrombin.py | ||
42 | +# --inputPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset | ||
43 | +# --inputTrainingData thrombin.data | ||
44 | +# --inputTestingData Thrombin.testset | ||
45 | +# --inputTestingClasses Thrombin.testset.class | ||
46 | +# --outputModelPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/models | ||
47 | +# --outputModelFile SVM-model.mod | ||
48 | +# --outputReportPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/reports | ||
49 | +# --outputReportFile SVM.txt | ||
50 | +# --classifier SVM | ||
51 | +# --saveData | ||
52 | +# --kernel linear | ||
53 | +# --reduction SVD200 | ||
54 | + | ||
55 | +# source activate python3 | ||
56 | +# python training-crossvalidation-testing-binding-thrombin.py --inputPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset --inputTrainingData thrombin.data --inputTestingData Thrombin.testset --inputTestingClasses Thrombin.testset.class --outputModelPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/models --outputModelFile SVM-linear-model.mod --outputReportPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/reports --outputReportFile SVM-linear.txt --classifier SVM --kernel rbf | ||
57 | + | ||
58 | +########################################################### | ||
59 | +# MAIN PROGRAM # | ||
60 | +########################################################### | ||
61 | + | ||
62 | +if __name__ == "__main__": | ||
63 | + # Parameter definition | ||
64 | + parser = argparse.ArgumentParser(description='Training validation Binding Thrombin Dataset.') | ||
65 | + parser.add_argument("--inputPath", dest="inputPath", | ||
66 | + help="Path to read input files", metavar="PATH") | ||
67 | + parser.add_argument("--inputTrainingData", dest="inputTrainingData", | ||
68 | + help="File to read training data", metavar="FILE") | ||
69 | + parser.add_argument("--inputTestingData", dest="inputTestingData", | ||
70 | + help="File to read testing data", metavar="FILE") | ||
71 | + parser.add_argument("--inputTestingClasses", dest="inputTestingClasses", | ||
72 | + help="File to read testing classes", metavar="FILE") | ||
73 | + parser.add_argument("--outputModelPath", dest="outputModelPath", | ||
74 | + help="Path to place output model", metavar="PATH") | ||
75 | + parser.add_argument("--outputModelFile", dest="outputModelFile", | ||
76 | + help="File to place output model", metavar="FILE") | ||
77 | + parser.add_argument("--outputReportPath", dest="outputReportPath", | ||
78 | + help="Path to place evaluation report", metavar="PATH") | ||
79 | + parser.add_argument("--outputReportFile", dest="outputReportFile", | ||
80 | + help="File to place evaluation report", metavar="FILE") | ||
81 | + parser.add_argument("--classifier", dest="classifier", | ||
82 | + help="Classifier", metavar="NAME", | ||
83 | + choices=('BernoulliNB', 'SVM', 'kNN'), default='SVM') | ||
84 | + parser.add_argument("--saveData", dest="saveData", action='store_true', | ||
85 | + help="Save matrices") | ||
86 | + parser.add_argument("--kernel", dest="kernel", | ||
87 | + help="Kernel SVM", metavar="NAME", | ||
88 | + choices=('linear', 'rbf', 'poly'), default='linear') | ||
89 | + parser.add_argument("--reduction", dest="reduction", | ||
90 | + help="Feature selection or dimensionality reduction", metavar="NAME", | ||
91 | + choices=('SVD200', 'SVD300', 'CHI250', 'CHI2100'), default=None) | ||
92 | + | ||
93 | + args = parser.parse_args() | ||
94 | + | ||
95 | + # Printing parameter values | ||
96 | + print('-------------------------------- PARAMETERS --------------------------------') | ||
97 | + print("Path to read input files: " + str(args.inputPath)) | ||
98 | + print("File to read training data: " + str(args.inputTrainingData)) | ||
99 | + print("File to read testing data: " + str(args.inputTestingData)) | ||
100 | + print("File to read testing classes: " + str(args.inputTestingClasses)) | ||
101 | + print("Path to place output model: " + str(args.outputModelPath)) | ||
102 | + print("File to place output model: " + str(args.outputModelFile)) | ||
103 | + print("Path to place evaluation report: " + str(args.outputReportPath)) | ||
104 | + print("File to place evaluation report: " + str(args.outputReportFile)) | ||
105 | + print("Classifier: " + str(args.classifier)) | ||
106 | + print("Save matrices: " + str(args.saveData)) | ||
107 | + print("Kernel: " + str(args.kernel)) | ||
108 | + print("Reduction: " + str(args.reduction)) | ||
109 | + | ||
110 | + # Start time | ||
111 | + t0 = time() | ||
112 | + | ||
113 | + print("Reading training data and true classes...") | ||
114 | + X_train = None | ||
115 | + if args.saveData: | ||
116 | + y_train = [] | ||
117 | + trainingData = [] | ||
118 | + with open(os.path.join(args.inputPath, args.inputTrainingData), encoding='utf8', mode='r') \ | ||
119 | + as iFile: | ||
120 | + for line in iFile: | ||
121 | + line = line.strip('\r\n') | ||
122 | + listLine = line.split(',') | ||
123 | + y_train.append(listLine[0]) | ||
124 | + trainingData.append(listLine[1:]) | ||
125 | + # X_train = np.matrix(trainingData) | ||
126 | + X_train = csr_matrix(trainingData, dtype='double') | ||
127 | + print(" Saving matrix and classes...") | ||
128 | + joblib.dump(X_train, os.path.join(args.outputModelPath, args.inputTrainingData + '.jlb')) | ||
129 | + joblib.dump(y_train, os.path.join(args.outputModelPath, args.inputTrainingData + '.class.jlb')) | ||
130 | + print(" Done!") | ||
131 | + else: | ||
132 | + print(" Loading matrix and classes...") | ||
133 | + X_train = joblib.load(os.path.join(args.outputModelPath, args.inputTrainingData + '.jlb')) | ||
134 | + y_train = joblib.load(os.path.join(args.outputModelPath, args.inputTrainingData + '.class.jlb')) | ||
135 | + print(" Done!") | ||
136 | + | ||
137 | + print(" Number of training classes: {}".format(len(y_train))) | ||
138 | + print(" Number of training class A: {}".format(y_train.count('A'))) | ||
139 | + print(" Number of training class I: {}".format(y_train.count('I'))) | ||
140 | + print(" Shape of training matrix: {}".format(X_train.shape)) | ||
141 | + | ||
142 | + print("Reading testing data and true classes...") | ||
143 | + X_test = None | ||
144 | + if args.saveData: | ||
145 | + y_test = [] | ||
146 | + testingData = [] | ||
147 | + with open(os.path.join(args.inputPath, args.inputTestingData), encoding='utf8', mode='r') \ | ||
148 | + as iFile: | ||
149 | + for line in iFile: | ||
150 | + line = line.strip('\r\n') | ||
151 | + listLine = line.split(',') | ||
152 | + testingData.append(listLine[1:]) | ||
153 | + X_test = csr_matrix(testingData, dtype='double') | ||
154 | + with open(os.path.join(args.inputPath, args.inputTestingClasses), encoding='utf8', mode='r') \ | ||
155 | + as iFile: | ||
156 | + for line in iFile: | ||
157 | + line = line.strip('\r\n') | ||
158 | + y_test.append(line) | ||
159 | + print(" Saving matrix and classes...") | ||
160 | + joblib.dump(X_test, os.path.join(args.outputModelPath, args.inputTestingData + '.jlb')) | ||
161 | + joblib.dump(y_test, os.path.join(args.outputModelPath, args.inputTestingClasses + '.class.jlb')) | ||
162 | + print(" Done!") | ||
163 | + else: | ||
164 | + print(" Loading matrix and classes...") | ||
165 | + X_test = joblib.load(os.path.join(args.outputModelPath, args.inputTestingData + '.jlb')) | ||
166 | + y_test = joblib.load(os.path.join(args.outputModelPath, args.inputTestingClasses + '.class.jlb')) | ||
167 | + print(" Done!") | ||
168 | + | ||
169 | + print(" Number of testing classes: {}".format(len(y_test))) | ||
170 | + print(" Number of testing class A: {}".format(y_test.count('A'))) | ||
171 | + print(" Number of testing class I: {}".format(y_test.count('I'))) | ||
172 | + print(" Shape of testing matrix: {}".format(X_test.shape)) | ||
173 | + | ||
174 | + # Feature selection and dimensional reduction | ||
175 | + if args.reduction is not None: | ||
176 | + print('Performing dimensionality reduction or feature selection...', args.reduction) | ||
177 | + if args.reduction == 'SVD200': | ||
178 | + reduc = TruncatedSVD(n_components=200, random_state=42) | ||
179 | + X_train = reduc.fit_transform(X_train) | ||
180 | + if args.reduction == 'SVD300': | ||
181 | + reduc = TruncatedSVD(n_components=300, random_state=42) | ||
182 | + X_train = reduc.fit_transform(X_train) | ||
183 | + elif args.reduction == 'CHI250': | ||
184 | + reduc = SelectKBest(chi2, k=50) | ||
185 | + X_train = reduc.fit_transform(X_train, y_train) | ||
186 | + elif args.reduction == 'CHI2100': | ||
187 | + reduc = SelectKBest(chi2, k=100) | ||
188 | + X_train = reduc.fit_transform(X_train, y_train) | ||
189 | + print(" Done!") | ||
190 | + print(' New shape of training matrix: ', X_train.shape) | ||
191 | + | ||
192 | + jobs = -1 | ||
193 | + paramGrid = [] | ||
194 | + nIter = 20 | ||
195 | + crossV = 10 | ||
196 | + print("Defining randomized grid search...") | ||
197 | + if args.classifier == 'SVM': | ||
198 | + # SVM | ||
199 | + classifier = SVC() | ||
200 | + if args.kernel == 'rbf': | ||
201 | + paramGrid = {'C': scipy.stats.expon(scale=100), | ||
202 | + 'gamma': scipy.stats.expon(scale=.1), | ||
203 | + 'kernel': ['rbf'], 'class_weight': ['balanced', None]} | ||
204 | + elif args.kernel == 'linear': | ||
205 | + paramGrid = {'C': scipy.stats.expon(scale=100), | ||
206 | + 'kernel': ['linear'], | ||
207 | + 'class_weight': ['balanced', None]} | ||
208 | + elif args.kernel == 'poly': | ||
209 | + paramGrid = {'C': scipy.stats.expon(scale=100), | ||
210 | + 'gamma': scipy.stats.expon(scale=.1), 'degree': [2, 3], | ||
211 | + 'kernel': ['poly'], 'class_weight': ['balanced', None]} | ||
212 | + myClassifier = model_selection.RandomizedSearchCV(classifier, | ||
213 | + paramGrid, n_iter=nIter, | ||
214 | + cv=crossV, n_jobs=jobs, verbose=3) | ||
215 | + elif args.classifier == 'BernoulliNB': | ||
216 | + # BernoulliNB | ||
217 | + classifier = BernoulliNB() | ||
218 | + paramGrid = {'alpha': scipy.stats.expon(scale=1.0)} | ||
219 | + myClassifier = model_selection.RandomizedSearchCV(classifier, paramGrid, n_iter=nIter, | ||
220 | + cv=crossV, n_jobs=jobs, verbose=3) | ||
221 | + # elif args.classifier == 'kNN': | ||
222 | + # # kNN | ||
223 | + # k_range = list(range(1, 7, 2)) | ||
224 | + # classifier = KNeighborsClassifier() | ||
225 | + # paramGrid = {'n_neighbors ': k_range} | ||
226 | + # myClassifier = model_selection.RandomizedSearchCV(classifier, paramGrid, n_iter=3, | ||
227 | + # cv=crossV, n_jobs=jobs, verbose=3) | ||
228 | + else: | ||
229 | + print("Bad classifier") | ||
230 | + exit() | ||
231 | + print(" Done!") | ||
232 | + | ||
233 | + print("Training...") | ||
234 | + myClassifier.fit(X_train, y_train) | ||
235 | + print(" Done!") | ||
236 | + | ||
237 | + print("Testing (prediction in new data)...") | ||
238 | + if args.reduction is not None: | ||
239 | + X_test = reduc.transform(X_test) | ||
240 | + y_pred = myClassifier.predict(X_test) | ||
241 | + best_parameters = myClassifier.best_estimator_.get_params() | ||
242 | + print(" Done!") | ||
243 | + | ||
244 | + print("Saving report...") | ||
245 | + with open(os.path.join(args.outputReportPath, args.outputReportFile), mode='w', encoding='utf8') as oFile: | ||
246 | + oFile.write('********** EVALUATION REPORT **********\n') | ||
247 | + oFile.write('Reduction: {}\n'.format(args.reduction)) | ||
248 | + oFile.write('Classifier: {}\n'.format(args.classifier)) | ||
249 | + oFile.write('Kernel: {}\n'.format(args.kernel)) | ||
250 | + oFile.write('Accuracy: {}\n'.format(accuracy_score(y_test, y_pred))) | ||
251 | + oFile.write('Precision: {}\n'.format(precision_score(y_test, y_pred, average='weighted'))) | ||
252 | + oFile.write('Recall: {}\n'.format(recall_score(y_test, y_pred, average='weighted'))) | ||
253 | + oFile.write('F-score: {}\n'.format(f1_score(y_test, y_pred, average='weighted'))) | ||
254 | + oFile.write('Confusion matrix: \n') | ||
255 | + oFile.write(str(confusion_matrix(y_test, y_pred)) + '\n') | ||
256 | + oFile.write('Classification report: \n') | ||
257 | + oFile.write(classification_report(y_test, y_pred) + '\n') | ||
258 | + oFile.write('Best parameters: \n') | ||
259 | + for param in sorted(best_parameters.keys()): | ||
260 | + oFile.write("\t%s: %r\n" % (param, best_parameters[param])) | ||
261 | + print(" Done!") | ||
262 | + | ||
263 | + print("Training and testing done in: %fs" % (time() - t0)) |
1 | +# -*- encoding: utf-8 -*- | ||
2 | + | ||
3 | +import os | ||
4 | +from time import time | ||
5 | +import argparse | ||
6 | +from sklearn.naive_bayes import BernoulliNB | ||
7 | +from sklearn.svm import SVC | ||
8 | +from sklearn.neighbors import KNeighborsClassifier | ||
9 | +from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, \ | ||
10 | + classification_report | ||
11 | +from sklearn.externals import joblib | ||
12 | +from scipy.sparse import csr_matrix | ||
13 | + | ||
14 | +__author__ = 'CMendezC' | ||
15 | + | ||
16 | +# Goal: training and testing binding thrombin data set | ||
17 | + | ||
18 | +# Parameters: | ||
19 | +# 1) --inputPath Path to read input files. | ||
20 | +# 2) --inputTrainingData File to read training data. | ||
21 | +# 3) --inputTestingData File to read testing data. | ||
22 | +# 4) --inputTestingClasses File to read testing classes. | ||
23 | +# 5) --outputModelPath Path to place output model. | ||
24 | +# 6) --outputModelFile File to place output model. | ||
25 | +# 7) --outputReportPath Path to place evaluation report. | ||
26 | +# 8) --outputReportFile File to place evaluation report. | ||
27 | +# 9) --classifier Classifier: BernoulliNB, SVM, kNN. | ||
28 | +# 10) --saveData Save matrices | ||
29 | + | ||
30 | +# Ouput: | ||
31 | +# 1) Classification model and evaluation report. | ||
32 | + | ||
33 | +# Execution: | ||
34 | + | ||
35 | +# python training-testing-binding-thrombin.py | ||
36 | +# --inputPath /home/binding-thrombin-dataset | ||
37 | +# --inputTrainingData thrombin.data | ||
38 | +# --inputTestingData Thrombin.testset | ||
39 | +# --inputTestingClasses Thrombin.testset.class | ||
40 | +# --outputModelPath /home/binding-thrombin-dataset/models | ||
41 | +# --outputModelFile SVM-model.mod | ||
42 | +# --outputReportPath /home/binding-thrombin-dataset/reports | ||
43 | +# --outputReportFile SVM.txt | ||
44 | +# --classifier SVM | ||
45 | +# --saveData | ||
46 | + | ||
47 | +# source activate python3 | ||
48 | +# python training-testing-binding-thrombin.py --inputPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset --inputTrainingData thrombin.data --inputTestingData Thrombin.testset --inputTestingClasses Thrombin.testset.class --outputModelPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/models --outputModelFile SVM-model.mod --outputReportPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/reports --outputReportFile SVM.txt --classifier SVM --saveData | ||
49 | + | ||
50 | +########################################################### | ||
51 | +# MAIN PROGRAM # | ||
52 | +########################################################### | ||
53 | + | ||
54 | +if __name__ == "__main__": | ||
55 | + # Parameter definition | ||
56 | + parser = argparse.ArgumentParser(description='Training and testing Binding Thrombin Dataset.') | ||
57 | + parser.add_argument("--inputPath", dest="inputPath", | ||
58 | + help="Path to read input files", metavar="PATH") | ||
59 | + parser.add_argument("--inputTrainingData", dest="inputTrainingData", | ||
60 | + help="File to read training data", metavar="FILE") | ||
61 | + parser.add_argument("--inputTestingData", dest="inputTestingData", | ||
62 | + help="File to read testing data", metavar="FILE") | ||
63 | + parser.add_argument("--inputTestingClasses", dest="inputTestingClasses", | ||
64 | + help="File to read testing classes", metavar="FILE") | ||
65 | + parser.add_argument("--outputModelPath", dest="outputModelPath", | ||
66 | + help="Path to place output model", metavar="PATH") | ||
67 | + parser.add_argument("--outputModelFile", dest="outputModelFile", | ||
68 | + help="File to place output model", metavar="FILE") | ||
69 | + parser.add_argument("--outputReportPath", dest="outputReportPath", | ||
70 | + help="Path to place evaluation report", metavar="PATH") | ||
71 | + parser.add_argument("--outputReportFile", dest="outputReportFile", | ||
72 | + help="File to place evaluation report", metavar="FILE") | ||
73 | + parser.add_argument("--classifier", dest="classifier", | ||
74 | + help="Classifier", metavar="NAME", | ||
75 | + choices=('BernoulliNB', 'SVM', 'kNN'), default='SVM') | ||
76 | + parser.add_argument("--saveData", dest="saveData", action='store_true', | ||
77 | + help="Save matrices") | ||
78 | + | ||
79 | + args = parser.parse_args() | ||
80 | + | ||
81 | + # Printing parameter values | ||
82 | + print('-------------------------------- PARAMETERS --------------------------------') | ||
83 | + print("Path to read input files: " + str(args.inputPath)) | ||
84 | + print("File to read training data: " + str(args.inputTrainingData)) | ||
85 | + print("File to read testing data: " + str(args.inputTestingData)) | ||
86 | + print("File to read testing classes: " + str(args.inputTestingClasses)) | ||
87 | + print("Path to place output model: " + str(args.outputModelPath)) | ||
88 | + print("File to place output model: " + str(args.outputModelFile)) | ||
89 | + print("Path to place evaluation report: " + str(args.outputReportPath)) | ||
90 | + print("File to place evaluation report: " + str(args.outputReportFile)) | ||
91 | + print("Classifier: " + str(args.classifier)) | ||
92 | + print("Save matrices: " + str(args.saveData)) | ||
93 | + | ||
94 | + # Start time | ||
95 | + t0 = time() | ||
96 | + | ||
97 | + print("Reading training data and true classes...") | ||
98 | + X_train = None | ||
99 | + if args.saveData: | ||
100 | + y_train = [] | ||
101 | + trainingData = [] | ||
102 | + with open(os.path.join(args.inputPath, args.inputTrainingData), encoding='utf8', mode='r') \ | ||
103 | + as iFile: | ||
104 | + for line in iFile: | ||
105 | + line = line.strip('\r\n') | ||
106 | + listLine = line.split(',') | ||
107 | + y_train.append(listLine[0]) | ||
108 | + trainingData.append(listLine[1:]) | ||
109 | + # X_train = np.matrix(trainingData) | ||
110 | + X_train = csr_matrix(trainingData, dtype='double') | ||
111 | + print(" Saving matrix and classes...") | ||
112 | + joblib.dump(X_train, os.path.join(args.outputModelPath, args.inputTrainingData + '.jlb')) | ||
113 | + joblib.dump(y_train, os.path.join(args.outputModelPath, args.inputTrainingData + '.class.jlb')) | ||
114 | + print(" Done!") | ||
115 | + else: | ||
116 | + print(" Loading matrix and classes...") | ||
117 | + X_train = joblib.load(os.path.join(args.outputModelPath, args.inputTrainingData + '.jlb')) | ||
118 | + y_train = joblib.load(os.path.join(args.outputModelPath, args.inputTrainingData + '.class.jlb')) | ||
119 | + print(" Done!") | ||
120 | + | ||
121 | + print(" Number of training classes: {}".format(len(y_train))) | ||
122 | + print(" Number of training class A: {}".format(y_train.count('A'))) | ||
123 | + print(" Number of training class I: {}".format(y_train.count('I'))) | ||
124 | + print(" Shape of training matrix: {}".format(X_train.shape)) | ||
125 | + | ||
126 | + print("Reading testing data and true classes...") | ||
127 | + X_test = None | ||
128 | + if args.saveData: | ||
129 | + y_test = [] | ||
130 | + testingData = [] | ||
131 | + with open(os.path.join(args.inputPath, args.inputTestingData), encoding='utf8', mode='r') \ | ||
132 | + as iFile: | ||
133 | + for line in iFile: | ||
134 | + line = line.strip('\r\n') | ||
135 | + listLine = line.split(',') | ||
136 | + testingData.append(listLine[1:]) | ||
137 | + X_test = csr_matrix(testingData, dtype='double') | ||
138 | + with open(os.path.join(args.inputPath, args.inputTestingClasses), encoding='utf8', mode='r') \ | ||
139 | + as iFile: | ||
140 | + for line in iFile: | ||
141 | + line = line.strip('\r\n') | ||
142 | + y_test.append(line) | ||
143 | + print(" Saving matrix and classes...") | ||
144 | + joblib.dump(X_test, os.path.join(args.outputModelPath, args.inputTestingData + '.jlb')) | ||
145 | + joblib.dump(y_test, os.path.join(args.outputModelPath, args.inputTestingClasses + '.class.jlb')) | ||
146 | + print(" Done!") | ||
147 | + else: | ||
148 | + print(" Loading matrix and classes...") | ||
149 | + X_test = joblib.load(os.path.join(args.outputModelPath, args.inputTestingData + '.jlb')) | ||
150 | + y_test = joblib.load(os.path.join(args.outputModelPath, args.inputTestingClasses + '.class.jlb')) | ||
151 | + print(" Done!") | ||
152 | + | ||
153 | + print(" Number of testing classes: {}".format(len(y_test))) | ||
154 | + print(" Number of testing class A: {}".format(y_test.count('A'))) | ||
155 | + print(" Number of testing class I: {}".format(y_test.count('I'))) | ||
156 | + print(" Shape of testing matrix: {}".format(X_test.shape)) | ||
157 | + | ||
158 | + if args.classifier == "BernoulliNB": | ||
159 | + classifier = BernoulliNB() | ||
160 | + elif args.classifier == "SVM": | ||
161 | + classifier = SVC() | ||
162 | + elif args.classifier == "kNN": | ||
163 | + classifier = KNeighborsClassifier() | ||
164 | + else: | ||
165 | + print("Bad classifier") | ||
166 | + exit() | ||
167 | + | ||
168 | + print("Training...") | ||
169 | + classifier.fit(X_train, y_train) | ||
170 | + print(" Done!") | ||
171 | + | ||
172 | + print("Testing (prediction in new data)...") | ||
173 | + y_pred = classifier.predict(X_test) | ||
174 | + print(" Done!") | ||
175 | + | ||
176 | + print("Saving report...") | ||
177 | + with open(os.path.join(args.outputReportPath, args.outputReportFile), mode='w', encoding='utf8') as oFile: | ||
178 | + oFile.write('********** EVALUATION REPORT **********\n') | ||
179 | + oFile.write('Classifier: {}\n'.format(args.classifier)) | ||
180 | + oFile.write('Accuracy: {}\n'.format(accuracy_score(y_test, y_pred))) | ||
181 | + oFile.write('Precision: {}\n'.format(precision_score(y_test, y_pred, average='weighted'))) | ||
182 | + oFile.write('Recall: {}\n'.format(recall_score(y_test, y_pred, average='weighted'))) | ||
183 | + oFile.write('F-score: {}\n'.format(f1_score(y_test, y_pred, average='weighted'))) | ||
184 | + oFile.write('Confusion matrix: \n') | ||
185 | + oFile.write(str(confusion_matrix(y_test, y_pred)) + '\n') | ||
186 | + oFile.write('Classification report: \n') | ||
187 | + oFile.write(classification_report(y_test, y_pred) + '\n') | ||
188 | + print(" Done!") | ||
189 | + | ||
190 | + print("Training and testing done in: %fs" % (time() - t0)) |
-
Please register or login to post a comment