Showing
9 changed files
with
0 additions
and
1168 deletions
1 | -The test set consists of 634 data points, each of which represents | ||
2 | -a molecule that is either active (A) or inactive (I). The test set | ||
3 | -has the same format as the training set, with the exception that the | ||
4 | -activity value (A or I) for each data point is missing, that is, has | ||
5 | -been replaced by a question mark (?). Please submit one prediction, | ||
6 | -A or I, for each data point. Your submission should be in the form | ||
7 | -of a file that starts with your contact information, followed by a | ||
8 | -line with 5 asterisks, followed immediately by your predictions, with | ||
9 | -one line per data point. The predictions should be in the same order | ||
10 | -as the test set data points. So your prediction for the first example | ||
11 | -should appear on the first line after the asterisks, your prediction | ||
12 | -for the second example should appear on the second line after the | ||
13 | -asterisks, etc. Hence, after your contact information, the prediction | ||
14 | -file will consist of 635 lines and have the form: | ||
15 | - | ||
16 | -***** | ||
17 | -I | ||
18 | -I | ||
19 | -A | ||
20 | -I | ||
21 | -A | ||
22 | -I | ||
23 | - | ||
24 | -etc. | ||
25 | - | ||
26 | -You may submit your prediction by email to page@biostat.wisc.edu | ||
27 | -or by anonymous ftp to ftp.biostat.wisc.edu, placing the file | ||
28 | -into the directory dropboxes/page/. If using email, please use | ||
29 | -the subject line "KDDcup <name> thrombin" where <name> is your | ||
30 | -name. If using ftp, please name the file KDDcup.<name>.thrombin | ||
31 | -where <name> is your name. For example, my submission would be | ||
32 | -named KDDcup.DavidPage.thrombin | ||
33 | - | ||
34 | -Only one submission per person per task is permitted. If you do not | ||
35 | -receive email confirmation of your submission within 24 hours, please | ||
36 | -email page@biostat.wisc.edu with subject "KDDcup no confirmation". | ||
37 | - | ||
38 | -For group entries, the contact information should include the names | ||
39 | -of everyone to be credited as a member of the group should your entry | ||
40 | -achieve the highest score. But no person is to be listed on more than | ||
41 | -one entry per task. |
1 | -Prediction of Molecular Bioactivity for Drug Design -- Binding to Thrombin | ||
2 | --------------------------------------------------------------------------- | ||
3 | - | ||
4 | -Drugs are typically small organic molecules that achieve their desired | ||
5 | -activity by binding to a target site on a receptor. The first step in | ||
6 | -the discovery of a new drug is usually to identify and isolate the | ||
7 | -receptor to which it should bind, followed by testing many small | ||
8 | -molecules for their ability to bind to the target site. This leaves | ||
9 | -researchers with the task of determining what separates the active | ||
10 | -(binding) compounds from the inactive (non-binding) ones. Such a | ||
11 | -determination can then be used in the design of new compounds that not | ||
12 | -only bind, but also have all the other properties required for a drug | ||
13 | -(solubility, oral absorption, lack of side effects, appropriate duration | ||
14 | -of action, toxicity, etc.). | ||
15 | - | ||
16 | -The present training data set consists of 1909 compounds tested for | ||
17 | -their ability to bind to a target site on thrombin, a key receptor in | ||
18 | -blood clotting. The chemical structures of these compounds are not | ||
19 | -necessary for our analysis and are not included. Of these compounds, 42 | ||
20 | -are active (bind well) and the others are inactive. Each compound is | ||
21 | -described by a single feature vector comprised of a class value (A for | ||
22 | -active, I for inactive) and 139,351 binary features, which describe | ||
23 | -three-dimensional properties of the molecule. The definitions of the | ||
24 | -individual bits are not included - we don't know what each individual | ||
25 | -bit means, only that they are generated in an internally consistent | ||
26 | -manner for all 1909 compounds. Biological activity in general, and | ||
27 | -receptor binding affinity in particular, correlate with various | ||
28 | -structural and physical properties of small organic molecules. The task | ||
29 | -is to determine which of these properties are critical in this case and | ||
30 | -to learn to accurately predict the class value. To simulate the | ||
31 | -real-world drug design environment, the test set contains 636 additional | ||
32 | -compounds that were in fact generated based on the assay results | ||
33 | -recorded for the training set. In evaluating the accuracy, a | ||
34 | -differential cost model will be used, so that the sum of the costs of | ||
35 | -the actives will be equal to the sum of the costs of the inactives. | ||
36 | - | ||
37 | -We thank DuPont Pharmaceuticals for graciously providing this data set | ||
38 | -for the KDD Cup 2001 competition. All publications referring to | ||
39 | -analysis of this data set should acknowledge DuPont Pharmaceuticals | ||
40 | -Research Laboratories and KDD Cup 2001. |
binding-thrombin-dataset/README.txt
deleted
100644 → 0
File mode changed
This diff could not be displayed because it is too large.
1 | -I | ||
2 | -A | ||
3 | -I | ||
4 | -I | ||
5 | -I | ||
6 | -A | ||
7 | -I | ||
8 | -I | ||
9 | -I | ||
10 | -A | ||
11 | -I | ||
12 | -I | ||
13 | -I | ||
14 | -A | ||
15 | -I | ||
16 | -A | ||
17 | -I | ||
18 | -I | ||
19 | -I | ||
20 | -I | ||
21 | -I | ||
22 | -I | ||
23 | -I | ||
24 | -I | ||
25 | -I | ||
26 | -I | ||
27 | -I | ||
28 | -A | ||
29 | -I | ||
30 | -A | ||
31 | -I | ||
32 | -I | ||
33 | -I | ||
34 | -I | ||
35 | -I | ||
36 | -A | ||
37 | -I | ||
38 | -I | ||
39 | -I | ||
40 | -A | ||
41 | -I | ||
42 | -I | ||
43 | -I | ||
44 | -I | ||
45 | -I | ||
46 | -I | ||
47 | -I | ||
48 | -I | ||
49 | -A | ||
50 | -A | ||
51 | -I | ||
52 | -I | ||
53 | -I | ||
54 | -I | ||
55 | -I | ||
56 | -A | ||
57 | -I | ||
58 | -A | ||
59 | -A | ||
60 | -I | ||
61 | -I | ||
62 | -I | ||
63 | -A | ||
64 | -I | ||
65 | -I | ||
66 | -I | ||
67 | -I | ||
68 | -A | ||
69 | -A | ||
70 | -I | ||
71 | -A | ||
72 | -I | ||
73 | -I | ||
74 | -A | ||
75 | -A | ||
76 | -I | ||
77 | -I | ||
78 | -I | ||
79 | -I | ||
80 | -I | ||
81 | -I | ||
82 | -I | ||
83 | -I | ||
84 | -I | ||
85 | -I | ||
86 | -I | ||
87 | -I | ||
88 | -I | ||
89 | -I | ||
90 | -I | ||
91 | -A | ||
92 | -I | ||
93 | -I | ||
94 | -A | ||
95 | -I | ||
96 | -A | ||
97 | -I | ||
98 | -I | ||
99 | -I | ||
100 | -A | ||
101 | -A | ||
102 | -I | ||
103 | -I | ||
104 | -I | ||
105 | -I | ||
106 | -I | ||
107 | -I | ||
108 | -A | ||
109 | -I | ||
110 | -I | ||
111 | -I | ||
112 | -I | ||
113 | -A | ||
114 | -A | ||
115 | -I | ||
116 | -I | ||
117 | -I | ||
118 | -I | ||
119 | -I | ||
120 | -I | ||
121 | -I | ||
122 | -I | ||
123 | -A | ||
124 | -A | ||
125 | -I | ||
126 | -A | ||
127 | -A | ||
128 | -I | ||
129 | -I | ||
130 | -I | ||
131 | -I | ||
132 | -I | ||
133 | -I | ||
134 | -I | ||
135 | -I | ||
136 | -I | ||
137 | -I | ||
138 | -A | ||
139 | -I | ||
140 | -I | ||
141 | -I | ||
142 | -I | ||
143 | -I | ||
144 | -I | ||
145 | -I | ||
146 | -I | ||
147 | -I | ||
148 | -A | ||
149 | -I | ||
150 | -I | ||
151 | -I | ||
152 | -I | ||
153 | -A | ||
154 | -I | ||
155 | -I | ||
156 | -I | ||
157 | -I | ||
158 | -I | ||
159 | -I | ||
160 | -I | ||
161 | -A | ||
162 | -I | ||
163 | -I | ||
164 | -A | ||
165 | -I | ||
166 | -A | ||
167 | -I | ||
168 | -I | ||
169 | -A | ||
170 | -I | ||
171 | -A | ||
172 | -I | ||
173 | -A | ||
174 | -I | ||
175 | -A | ||
176 | -I | ||
177 | -I | ||
178 | -I | ||
179 | -I | ||
180 | -I | ||
181 | -A | ||
182 | -I | ||
183 | -I | ||
184 | -A | ||
185 | -I | ||
186 | -I | ||
187 | -A | ||
188 | -I | ||
189 | -I | ||
190 | -I | ||
191 | -A | ||
192 | -I | ||
193 | -A | ||
194 | -I | ||
195 | -I | ||
196 | -A | ||
197 | -I | ||
198 | -I | ||
199 | -I | ||
200 | -I | ||
201 | -A | ||
202 | -I | ||
203 | -A | ||
204 | -I | ||
205 | -I | ||
206 | -I | ||
207 | -I | ||
208 | -I | ||
209 | -I | ||
210 | -I | ||
211 | -I | ||
212 | -I | ||
213 | -I | ||
214 | -I | ||
215 | -A | ||
216 | -I | ||
217 | -A | ||
218 | -I | ||
219 | -I | ||
220 | -I | ||
221 | -I | ||
222 | -I | ||
223 | -I | ||
224 | -A | ||
225 | -I | ||
226 | -I | ||
227 | -A | ||
228 | -A | ||
229 | -A | ||
230 | -I | ||
231 | -I | ||
232 | -A | ||
233 | -A | ||
234 | -I | ||
235 | -I | ||
236 | -I | ||
237 | -I | ||
238 | -A | ||
239 | -I | ||
240 | -I | ||
241 | -I | ||
242 | -I | ||
243 | -A | ||
244 | -I | ||
245 | -A | ||
246 | -I | ||
247 | -I | ||
248 | -I | ||
249 | -I | ||
250 | -I | ||
251 | -I | ||
252 | -I | ||
253 | -A | ||
254 | -A | ||
255 | -I | ||
256 | -I | ||
257 | -I | ||
258 | -I | ||
259 | -I | ||
260 | -I | ||
261 | -I | ||
262 | -I | ||
263 | -A | ||
264 | -A | ||
265 | -I | ||
266 | -I | ||
267 | -I | ||
268 | -I | ||
269 | -I | ||
270 | -I | ||
271 | -A | ||
272 | -A | ||
273 | -I | ||
274 | -I | ||
275 | -I | ||
276 | -I | ||
277 | -I | ||
278 | -I | ||
279 | -A | ||
280 | -I | ||
281 | -A | ||
282 | -I | ||
283 | -I | ||
284 | -I | ||
285 | -I | ||
286 | -I | ||
287 | -I | ||
288 | -I | ||
289 | -I | ||
290 | -I | ||
291 | -A | ||
292 | -I | ||
293 | -I | ||
294 | -A | ||
295 | -I | ||
296 | -I | ||
297 | -I | ||
298 | -I | ||
299 | -I | ||
300 | -I | ||
301 | -A | ||
302 | -A | ||
303 | -I | ||
304 | -I | ||
305 | -I | ||
306 | -I | ||
307 | -I | ||
308 | -A | ||
309 | -I | ||
310 | -I | ||
311 | -I | ||
312 | -I | ||
313 | -I | ||
314 | -A | ||
315 | -A | ||
316 | -A | ||
317 | -I | ||
318 | -A | ||
319 | -I | ||
320 | -I | ||
321 | -I | ||
322 | -I | ||
323 | -A | ||
324 | -A | ||
325 | -I | ||
326 | -A | ||
327 | -A | ||
328 | -I | ||
329 | -I | ||
330 | -I | ||
331 | -I | ||
332 | -I | ||
333 | -I | ||
334 | -I | ||
335 | -I | ||
336 | -I | ||
337 | -I | ||
338 | -I | ||
339 | -I | ||
340 | -A | ||
341 | -I | ||
342 | -I | ||
343 | -I | ||
344 | -I | ||
345 | -A | ||
346 | -A | ||
347 | -I | ||
348 | -I | ||
349 | -A | ||
350 | -I | ||
351 | -I | ||
352 | -I | ||
353 | -I | ||
354 | -I | ||
355 | -A | ||
356 | -A | ||
357 | -I | ||
358 | -A | ||
359 | -I | ||
360 | -I | ||
361 | -I | ||
362 | -I | ||
363 | -I | ||
364 | -I | ||
365 | -A | ||
366 | -A | ||
367 | -I | ||
368 | -I | ||
369 | -A | ||
370 | -I | ||
371 | -I | ||
372 | -I | ||
373 | -I | ||
374 | -I | ||
375 | -I | ||
376 | -I | ||
377 | -I | ||
378 | -I | ||
379 | -I | ||
380 | -I | ||
381 | -I | ||
382 | -A | ||
383 | -I | ||
384 | -I | ||
385 | -A | ||
386 | -I | ||
387 | -I | ||
388 | -A | ||
389 | -I | ||
390 | -I | ||
391 | -I | ||
392 | -I | ||
393 | -A | ||
394 | -A | ||
395 | -I | ||
396 | -A | ||
397 | -A | ||
398 | -I | ||
399 | -I | ||
400 | -A | ||
401 | -I | ||
402 | -I | ||
403 | -I | ||
404 | -I | ||
405 | -A | ||
406 | -I | ||
407 | -I | ||
408 | -I | ||
409 | -I | ||
410 | -I | ||
411 | -I | ||
412 | -I | ||
413 | -I | ||
414 | -I | ||
415 | -I | ||
416 | -A | ||
417 | -I | ||
418 | -I | ||
419 | -A | ||
420 | -A | ||
421 | -I | ||
422 | -I | ||
423 | -I | ||
424 | -A | ||
425 | -I | ||
426 | -I | ||
427 | -I | ||
428 | -I | ||
429 | -A | ||
430 | -I | ||
431 | -A | ||
432 | -I | ||
433 | -I | ||
434 | -I | ||
435 | -I | ||
436 | -I | ||
437 | -A | ||
438 | -I | ||
439 | -I | ||
440 | -I | ||
441 | -I | ||
442 | -I | ||
443 | -I | ||
444 | -I | ||
445 | -I | ||
446 | -I | ||
447 | -I | ||
448 | -I | ||
449 | -I | ||
450 | -I | ||
451 | -I | ||
452 | -I | ||
453 | -I | ||
454 | -I | ||
455 | -I | ||
456 | -A | ||
457 | -A | ||
458 | -A | ||
459 | -A | ||
460 | -I | ||
461 | -I | ||
462 | -I | ||
463 | -A | ||
464 | -A | ||
465 | -I | ||
466 | -I | ||
467 | -I | ||
468 | -I | ||
469 | -I | ||
470 | -A | ||
471 | -I | ||
472 | -A | ||
473 | -I | ||
474 | -I | ||
475 | -I | ||
476 | -I | ||
477 | -I | ||
478 | -A | ||
479 | -I | ||
480 | -I | ||
481 | -I | ||
482 | -I | ||
483 | -A | ||
484 | -A | ||
485 | -I | ||
486 | -I | ||
487 | -I | ||
488 | -I | ||
489 | -I | ||
490 | -I | ||
491 | -I | ||
492 | -I | ||
493 | -I | ||
494 | -I | ||
495 | -I | ||
496 | -I | ||
497 | -I | ||
498 | -I | ||
499 | -I | ||
500 | -I | ||
501 | -I | ||
502 | -A | ||
503 | -I | ||
504 | -A | ||
505 | -I | ||
506 | -I | ||
507 | -A | ||
508 | -I | ||
509 | -I | ||
510 | -I | ||
511 | -I | ||
512 | -A | ||
513 | -I | ||
514 | -I | ||
515 | -A | ||
516 | -A | ||
517 | -I | ||
518 | -I | ||
519 | -I | ||
520 | -A | ||
521 | -I | ||
522 | -A | ||
523 | -I | ||
524 | -I | ||
525 | -I | ||
526 | -I | ||
527 | -I | ||
528 | -I | ||
529 | -I | ||
530 | -A | ||
531 | -A | ||
532 | -I | ||
533 | -I | ||
534 | -I | ||
535 | -A | ||
536 | -I | ||
537 | -I | ||
538 | -I | ||
539 | -A | ||
540 | -I | ||
541 | -I | ||
542 | -I | ||
543 | -I | ||
544 | -I | ||
545 | -I | ||
546 | -A | ||
547 | -I | ||
548 | -I | ||
549 | -I | ||
550 | -I | ||
551 | -I | ||
552 | -A | ||
553 | -I | ||
554 | -I | ||
555 | -I | ||
556 | -I | ||
557 | -I | ||
558 | -A | ||
559 | -I | ||
560 | -I | ||
561 | -A | ||
562 | -I | ||
563 | -I | ||
564 | -I | ||
565 | -I | ||
566 | -I | ||
567 | -I | ||
568 | -A | ||
569 | -I | ||
570 | -I | ||
571 | -I | ||
572 | -I | ||
573 | -I | ||
574 | -I | ||
575 | -I | ||
576 | -I | ||
577 | -I | ||
578 | -I | ||
579 | -I | ||
580 | -I | ||
581 | -I | ||
582 | -I | ||
583 | -I | ||
584 | -I | ||
585 | -A | ||
586 | -I | ||
587 | -I | ||
588 | -A | ||
589 | -I | ||
590 | -I | ||
591 | -I | ||
592 | -I | ||
593 | -A | ||
594 | -I | ||
595 | -I | ||
596 | -I | ||
597 | -I | ||
598 | -I | ||
599 | -A | ||
600 | -I | ||
601 | -I | ||
602 | -I | ||
603 | -A | ||
604 | -I | ||
605 | -I | ||
606 | -I | ||
607 | -A | ||
608 | -I | ||
609 | -A | ||
610 | -A | ||
611 | -I | ||
612 | -A | ||
613 | -I | ||
614 | -I | ||
615 | -I | ||
616 | -I | ||
617 | -I | ||
618 | -I | ||
619 | -I | ||
620 | -A | ||
621 | -I | ||
622 | -I | ||
623 | -I | ||
624 | -A | ||
625 | -I | ||
626 | -A | ||
627 | -I | ||
628 | -I | ||
629 | -I | ||
630 | -I | ||
631 | -A | ||
632 | -I | ||
633 | -A | ||
634 | -I |
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
1 | -# -*- encoding: utf-8 -*- | ||
2 | - | ||
3 | -import os | ||
4 | -from time import time | ||
5 | -import argparse | ||
6 | -from sklearn.naive_bayes import BernoulliNB | ||
7 | -from sklearn.svm import SVC | ||
8 | -from sklearn.neighbors import KNeighborsClassifier | ||
9 | -from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, \ | ||
10 | - classification_report | ||
11 | -from sklearn.externals import joblib | ||
12 | -from sklearn import model_selection | ||
13 | -from sklearn.feature_selection import SelectKBest, chi2 | ||
14 | -from sklearn.decomposition import TruncatedSVD | ||
15 | -from scipy.sparse import csr_matrix | ||
16 | -import scipy | ||
17 | - | ||
18 | -__author__ = 'CMendezC' | ||
19 | - | ||
20 | -# Goal: training, crossvalidation and testing binding thrombin data set | ||
21 | - | ||
22 | -# Parameters: | ||
23 | -# 1) --inputPath Path to read input files. | ||
24 | -# 2) --inputTrainingData File to read training data. | ||
25 | -# 3) --inputTestingData File to read testing data. | ||
26 | -# 4) --inputTestingClasses File to read testing classes. | ||
27 | -# 5) --outputModelPath Path to place output model. | ||
28 | -# 6) --outputModelFile File to place output model. | ||
29 | -# 7) --outputReportPath Path to place evaluation report. | ||
30 | -# 8) --outputReportFile File to place evaluation report. | ||
31 | -# 9) --classifier Classifier: BernoulliNB, SVM, kNN. | ||
32 | -# 10) --saveData Save matrices | ||
33 | -# 11) --kernel Kernel | ||
34 | -# 12) --reduction Feature selection or dimensionality reduction | ||
35 | - | ||
36 | -# Ouput: | ||
37 | -# 1) Classification model and evaluation report. | ||
38 | - | ||
39 | -# Execution: | ||
40 | - | ||
41 | -# python training-crossvalidation-testing-binding-thrombin.py | ||
42 | -# --inputPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset | ||
43 | -# --inputTrainingData thrombin.data | ||
44 | -# --inputTestingData Thrombin.testset | ||
45 | -# --inputTestingClasses Thrombin.testset.class | ||
46 | -# --outputModelPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/models | ||
47 | -# --outputModelFile SVM-model.mod | ||
48 | -# --outputReportPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/reports | ||
49 | -# --outputReportFile SVM.txt | ||
50 | -# --classifier SVM | ||
51 | -# --saveData | ||
52 | -# --kernel linear | ||
53 | -# --reduction SVD200 | ||
54 | - | ||
55 | -# source activate python3 | ||
56 | -# python training-crossvalidation-testing-binding-thrombin.py --inputPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset --inputTrainingData thrombin.data --inputTestingData Thrombin.testset --inputTestingClasses Thrombin.testset.class --outputModelPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/models --outputModelFile SVM-linear-model.mod --outputReportPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/reports --outputReportFile SVM-linear.txt --classifier SVM --kernel rbf | ||
57 | - | ||
58 | -########################################################### | ||
59 | -# MAIN PROGRAM # | ||
60 | -########################################################### | ||
61 | - | ||
62 | -if __name__ == "__main__": | ||
63 | - # Parameter definition | ||
64 | - parser = argparse.ArgumentParser(description='Training validation Binding Thrombin Dataset.') | ||
65 | - parser.add_argument("--inputPath", dest="inputPath", | ||
66 | - help="Path to read input files", metavar="PATH") | ||
67 | - parser.add_argument("--inputTrainingData", dest="inputTrainingData", | ||
68 | - help="File to read training data", metavar="FILE") | ||
69 | - parser.add_argument("--inputTestingData", dest="inputTestingData", | ||
70 | - help="File to read testing data", metavar="FILE") | ||
71 | - parser.add_argument("--inputTestingClasses", dest="inputTestingClasses", | ||
72 | - help="File to read testing classes", metavar="FILE") | ||
73 | - parser.add_argument("--outputModelPath", dest="outputModelPath", | ||
74 | - help="Path to place output model", metavar="PATH") | ||
75 | - parser.add_argument("--outputModelFile", dest="outputModelFile", | ||
76 | - help="File to place output model", metavar="FILE") | ||
77 | - parser.add_argument("--outputReportPath", dest="outputReportPath", | ||
78 | - help="Path to place evaluation report", metavar="PATH") | ||
79 | - parser.add_argument("--outputReportFile", dest="outputReportFile", | ||
80 | - help="File to place evaluation report", metavar="FILE") | ||
81 | - parser.add_argument("--classifier", dest="classifier", | ||
82 | - help="Classifier", metavar="NAME", | ||
83 | - choices=('BernoulliNB', 'SVM', 'kNN'), default='SVM') | ||
84 | - parser.add_argument("--saveData", dest="saveData", action='store_true', | ||
85 | - help="Save matrices") | ||
86 | - parser.add_argument("--kernel", dest="kernel", | ||
87 | - help="Kernel SVM", metavar="NAME", | ||
88 | - choices=('linear', 'rbf', 'poly'), default='linear') | ||
89 | - parser.add_argument("--reduction", dest="reduction", | ||
90 | - help="Feature selection or dimensionality reduction", metavar="NAME", | ||
91 | - choices=('SVD200', 'SVD300', 'CHI250', 'CHI2100'), default=None) | ||
92 | - | ||
93 | - args = parser.parse_args() | ||
94 | - | ||
95 | - # Printing parameter values | ||
96 | - print('-------------------------------- PARAMETERS --------------------------------') | ||
97 | - print("Path to read input files: " + str(args.inputPath)) | ||
98 | - print("File to read training data: " + str(args.inputTrainingData)) | ||
99 | - print("File to read testing data: " + str(args.inputTestingData)) | ||
100 | - print("File to read testing classes: " + str(args.inputTestingClasses)) | ||
101 | - print("Path to place output model: " + str(args.outputModelPath)) | ||
102 | - print("File to place output model: " + str(args.outputModelFile)) | ||
103 | - print("Path to place evaluation report: " + str(args.outputReportPath)) | ||
104 | - print("File to place evaluation report: " + str(args.outputReportFile)) | ||
105 | - print("Classifier: " + str(args.classifier)) | ||
106 | - print("Save matrices: " + str(args.saveData)) | ||
107 | - print("Kernel: " + str(args.kernel)) | ||
108 | - print("Reduction: " + str(args.reduction)) | ||
109 | - | ||
110 | - # Start time | ||
111 | - t0 = time() | ||
112 | - | ||
113 | - print("Reading training data and true classes...") | ||
114 | - X_train = None | ||
115 | - if args.saveData: | ||
116 | - y_train = [] | ||
117 | - trainingData = [] | ||
118 | - with open(os.path.join(args.inputPath, args.inputTrainingData), encoding='utf8', mode='r') \ | ||
119 | - as iFile: | ||
120 | - for line in iFile: | ||
121 | - line = line.strip('\r\n') | ||
122 | - listLine = line.split(',') | ||
123 | - y_train.append(listLine[0]) | ||
124 | - trainingData.append(listLine[1:]) | ||
125 | - # X_train = np.matrix(trainingData) | ||
126 | - X_train = csr_matrix(trainingData, dtype='double') | ||
127 | - print(" Saving matrix and classes...") | ||
128 | - joblib.dump(X_train, os.path.join(args.outputModelPath, args.inputTrainingData + '.jlb')) | ||
129 | - joblib.dump(y_train, os.path.join(args.outputModelPath, args.inputTrainingData + '.class.jlb')) | ||
130 | - print(" Done!") | ||
131 | - else: | ||
132 | - print(" Loading matrix and classes...") | ||
133 | - X_train = joblib.load(os.path.join(args.outputModelPath, args.inputTrainingData + '.jlb')) | ||
134 | - y_train = joblib.load(os.path.join(args.outputModelPath, args.inputTrainingData + '.class.jlb')) | ||
135 | - print(" Done!") | ||
136 | - | ||
137 | - print(" Number of training classes: {}".format(len(y_train))) | ||
138 | - print(" Number of training class A: {}".format(y_train.count('A'))) | ||
139 | - print(" Number of training class I: {}".format(y_train.count('I'))) | ||
140 | - print(" Shape of training matrix: {}".format(X_train.shape)) | ||
141 | - | ||
142 | - print("Reading testing data and true classes...") | ||
143 | - X_test = None | ||
144 | - if args.saveData: | ||
145 | - y_test = [] | ||
146 | - testingData = [] | ||
147 | - with open(os.path.join(args.inputPath, args.inputTestingData), encoding='utf8', mode='r') \ | ||
148 | - as iFile: | ||
149 | - for line in iFile: | ||
150 | - line = line.strip('\r\n') | ||
151 | - listLine = line.split(',') | ||
152 | - testingData.append(listLine[1:]) | ||
153 | - X_test = csr_matrix(testingData, dtype='double') | ||
154 | - with open(os.path.join(args.inputPath, args.inputTestingClasses), encoding='utf8', mode='r') \ | ||
155 | - as iFile: | ||
156 | - for line in iFile: | ||
157 | - line = line.strip('\r\n') | ||
158 | - y_test.append(line) | ||
159 | - print(" Saving matrix and classes...") | ||
160 | - joblib.dump(X_test, os.path.join(args.outputModelPath, args.inputTestingData + '.jlb')) | ||
161 | - joblib.dump(y_test, os.path.join(args.outputModelPath, args.inputTestingClasses + '.class.jlb')) | ||
162 | - print(" Done!") | ||
163 | - else: | ||
164 | - print(" Loading matrix and classes...") | ||
165 | - X_test = joblib.load(os.path.join(args.outputModelPath, args.inputTestingData + '.jlb')) | ||
166 | - y_test = joblib.load(os.path.join(args.outputModelPath, args.inputTestingClasses + '.class.jlb')) | ||
167 | - print(" Done!") | ||
168 | - | ||
169 | - print(" Number of testing classes: {}".format(len(y_test))) | ||
170 | - print(" Number of testing class A: {}".format(y_test.count('A'))) | ||
171 | - print(" Number of testing class I: {}".format(y_test.count('I'))) | ||
172 | - print(" Shape of testing matrix: {}".format(X_test.shape)) | ||
173 | - | ||
174 | - # Feature selection and dimensional reduction | ||
175 | - if args.reduction is not None: | ||
176 | - print('Performing dimensionality reduction or feature selection...', args.reduction) | ||
177 | - if args.reduction == 'SVD200': | ||
178 | - reduc = TruncatedSVD(n_components=200, random_state=42) | ||
179 | - X_train = reduc.fit_transform(X_train) | ||
180 | - if args.reduction == 'SVD300': | ||
181 | - reduc = TruncatedSVD(n_components=300, random_state=42) | ||
182 | - X_train = reduc.fit_transform(X_train) | ||
183 | - elif args.reduction == 'CHI250': | ||
184 | - reduc = SelectKBest(chi2, k=50) | ||
185 | - X_train = reduc.fit_transform(X_train, y_train) | ||
186 | - elif args.reduction == 'CHI2100': | ||
187 | - reduc = SelectKBest(chi2, k=100) | ||
188 | - X_train = reduc.fit_transform(X_train, y_train) | ||
189 | - print(" Done!") | ||
190 | - print(' New shape of training matrix: ', X_train.shape) | ||
191 | - | ||
192 | - jobs = -1 | ||
193 | - paramGrid = [] | ||
194 | - nIter = 20 | ||
195 | - crossV = 10 | ||
196 | - print("Defining randomized grid search...") | ||
197 | - if args.classifier == 'SVM': | ||
198 | - # SVM | ||
199 | - classifier = SVC() | ||
200 | - if args.kernel == 'rbf': | ||
201 | - paramGrid = {'C': scipy.stats.expon(scale=100), | ||
202 | - 'gamma': scipy.stats.expon(scale=.1), | ||
203 | - 'kernel': ['rbf'], 'class_weight': ['balanced', None]} | ||
204 | - elif args.kernel == 'linear': | ||
205 | - paramGrid = {'C': scipy.stats.expon(scale=100), | ||
206 | - 'kernel': ['linear'], | ||
207 | - 'class_weight': ['balanced', None]} | ||
208 | - elif args.kernel == 'poly': | ||
209 | - paramGrid = {'C': scipy.stats.expon(scale=100), | ||
210 | - 'gamma': scipy.stats.expon(scale=.1), 'degree': [2, 3], | ||
211 | - 'kernel': ['poly'], 'class_weight': ['balanced', None]} | ||
212 | - myClassifier = model_selection.RandomizedSearchCV(classifier, | ||
213 | - paramGrid, n_iter=nIter, | ||
214 | - cv=crossV, n_jobs=jobs, verbose=3) | ||
215 | - elif args.classifier == 'BernoulliNB': | ||
216 | - # BernoulliNB | ||
217 | - classifier = BernoulliNB() | ||
218 | - paramGrid = {'alpha': scipy.stats.expon(scale=1.0)} | ||
219 | - myClassifier = model_selection.RandomizedSearchCV(classifier, paramGrid, n_iter=nIter, | ||
220 | - cv=crossV, n_jobs=jobs, verbose=3) | ||
221 | - # elif args.classifier == 'kNN': | ||
222 | - # # kNN | ||
223 | - # k_range = list(range(1, 7, 2)) | ||
224 | - # classifier = KNeighborsClassifier() | ||
225 | - # paramGrid = {'n_neighbors ': k_range} | ||
226 | - # myClassifier = model_selection.RandomizedSearchCV(classifier, paramGrid, n_iter=3, | ||
227 | - # cv=crossV, n_jobs=jobs, verbose=3) | ||
228 | - else: | ||
229 | - print("Bad classifier") | ||
230 | - exit() | ||
231 | - print(" Done!") | ||
232 | - | ||
233 | - print("Training...") | ||
234 | - myClassifier.fit(X_train, y_train) | ||
235 | - print(" Done!") | ||
236 | - | ||
237 | - print("Testing (prediction in new data)...") | ||
238 | - if args.reduction is not None: | ||
239 | - X_test = reduc.transform(X_test) | ||
240 | - y_pred = myClassifier.predict(X_test) | ||
241 | - best_parameters = myClassifier.best_estimator_.get_params() | ||
242 | - print(" Done!") | ||
243 | - | ||
244 | - print("Saving report...") | ||
245 | - with open(os.path.join(args.outputReportPath, args.outputReportFile), mode='w', encoding='utf8') as oFile: | ||
246 | - oFile.write('********** EVALUATION REPORT **********\n') | ||
247 | - oFile.write('Reduction: {}\n'.format(args.reduction)) | ||
248 | - oFile.write('Classifier: {}\n'.format(args.classifier)) | ||
249 | - oFile.write('Kernel: {}\n'.format(args.kernel)) | ||
250 | - oFile.write('Accuracy: {}\n'.format(accuracy_score(y_test, y_pred))) | ||
251 | - oFile.write('Precision: {}\n'.format(precision_score(y_test, y_pred, average='weighted'))) | ||
252 | - oFile.write('Recall: {}\n'.format(recall_score(y_test, y_pred, average='weighted'))) | ||
253 | - oFile.write('F-score: {}\n'.format(f1_score(y_test, y_pred, average='weighted'))) | ||
254 | - oFile.write('Confusion matrix: \n') | ||
255 | - oFile.write(str(confusion_matrix(y_test, y_pred)) + '\n') | ||
256 | - oFile.write('Classification report: \n') | ||
257 | - oFile.write(classification_report(y_test, y_pred) + '\n') | ||
258 | - oFile.write('Best parameters: \n') | ||
259 | - for param in sorted(best_parameters.keys()): | ||
260 | - oFile.write("\t%s: %r\n" % (param, best_parameters[param])) | ||
261 | - print(" Done!") | ||
262 | - | ||
263 | - print("Training and testing done in: %fs" % (time() - t0)) |
1 | -# -*- encoding: utf-8 -*- | ||
2 | - | ||
3 | -import os | ||
4 | -from time import time | ||
5 | -import argparse | ||
6 | -from sklearn.naive_bayes import BernoulliNB | ||
7 | -from sklearn.svm import SVC | ||
8 | -from sklearn.neighbors import KNeighborsClassifier | ||
9 | -from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, \ | ||
10 | - classification_report | ||
11 | -from sklearn.externals import joblib | ||
12 | -from scipy.sparse import csr_matrix | ||
13 | - | ||
14 | -__author__ = 'CMendezC' | ||
15 | - | ||
16 | -# Goal: training and testing binding thrombin data set | ||
17 | - | ||
18 | -# Parameters: | ||
19 | -# 1) --inputPath Path to read input files. | ||
20 | -# 2) --inputTrainingData File to read training data. | ||
21 | -# 3) --inputTestingData File to read testing data. | ||
22 | -# 4) --inputTestingClasses File to read testing classes. | ||
23 | -# 5) --outputModelPath Path to place output model. | ||
24 | -# 6) --outputModelFile File to place output model. | ||
25 | -# 7) --outputReportPath Path to place evaluation report. | ||
26 | -# 8) --outputReportFile File to place evaluation report. | ||
27 | -# 9) --classifier Classifier: BernoulliNB, SVM, kNN. | ||
28 | -# 10) --saveData Save matrices | ||
29 | - | ||
30 | -# Ouput: | ||
31 | -# 1) Classification model and evaluation report. | ||
32 | - | ||
33 | -# Execution: | ||
34 | - | ||
35 | -# python training-testing-binding-thrombin.py | ||
36 | -# --inputPath /home/binding-thrombin-dataset | ||
37 | -# --inputTrainingData thrombin.data | ||
38 | -# --inputTestingData Thrombin.testset | ||
39 | -# --inputTestingClasses Thrombin.testset.class | ||
40 | -# --outputModelPath /home/binding-thrombin-dataset/models | ||
41 | -# --outputModelFile SVM-model.mod | ||
42 | -# --outputReportPath /home/binding-thrombin-dataset/reports | ||
43 | -# --outputReportFile SVM.txt | ||
44 | -# --classifier SVM | ||
45 | -# --saveData | ||
46 | - | ||
47 | -# source activate python3 | ||
48 | -# python training-testing-binding-thrombin.py --inputPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset --inputTrainingData thrombin.data --inputTestingData Thrombin.testset --inputTestingClasses Thrombin.testset.class --outputModelPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/models --outputModelFile SVM-model.mod --outputReportPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/reports --outputReportFile SVM.txt --classifier SVM --saveData | ||
49 | - | ||
50 | -########################################################### | ||
51 | -# MAIN PROGRAM # | ||
52 | -########################################################### | ||
53 | - | ||
54 | -if __name__ == "__main__": | ||
55 | - # Parameter definition | ||
56 | - parser = argparse.ArgumentParser(description='Training and testing Binding Thrombin Dataset.') | ||
57 | - parser.add_argument("--inputPath", dest="inputPath", | ||
58 | - help="Path to read input files", metavar="PATH") | ||
59 | - parser.add_argument("--inputTrainingData", dest="inputTrainingData", | ||
60 | - help="File to read training data", metavar="FILE") | ||
61 | - parser.add_argument("--inputTestingData", dest="inputTestingData", | ||
62 | - help="File to read testing data", metavar="FILE") | ||
63 | - parser.add_argument("--inputTestingClasses", dest="inputTestingClasses", | ||
64 | - help="File to read testing classes", metavar="FILE") | ||
65 | - parser.add_argument("--outputModelPath", dest="outputModelPath", | ||
66 | - help="Path to place output model", metavar="PATH") | ||
67 | - parser.add_argument("--outputModelFile", dest="outputModelFile", | ||
68 | - help="File to place output model", metavar="FILE") | ||
69 | - parser.add_argument("--outputReportPath", dest="outputReportPath", | ||
70 | - help="Path to place evaluation report", metavar="PATH") | ||
71 | - parser.add_argument("--outputReportFile", dest="outputReportFile", | ||
72 | - help="File to place evaluation report", metavar="FILE") | ||
73 | - parser.add_argument("--classifier", dest="classifier", | ||
74 | - help="Classifier", metavar="NAME", | ||
75 | - choices=('BernoulliNB', 'SVM', 'kNN'), default='SVM') | ||
76 | - parser.add_argument("--saveData", dest="saveData", action='store_true', | ||
77 | - help="Save matrices") | ||
78 | - | ||
79 | - args = parser.parse_args() | ||
80 | - | ||
81 | - # Printing parameter values | ||
82 | - print('-------------------------------- PARAMETERS --------------------------------') | ||
83 | - print("Path to read input files: " + str(args.inputPath)) | ||
84 | - print("File to read training data: " + str(args.inputTrainingData)) | ||
85 | - print("File to read testing data: " + str(args.inputTestingData)) | ||
86 | - print("File to read testing classes: " + str(args.inputTestingClasses)) | ||
87 | - print("Path to place output model: " + str(args.outputModelPath)) | ||
88 | - print("File to place output model: " + str(args.outputModelFile)) | ||
89 | - print("Path to place evaluation report: " + str(args.outputReportPath)) | ||
90 | - print("File to place evaluation report: " + str(args.outputReportFile)) | ||
91 | - print("Classifier: " + str(args.classifier)) | ||
92 | - print("Save matrices: " + str(args.saveData)) | ||
93 | - | ||
94 | - # Start time | ||
95 | - t0 = time() | ||
96 | - | ||
97 | - print("Reading training data and true classes...") | ||
98 | - X_train = None | ||
99 | - if args.saveData: | ||
100 | - y_train = [] | ||
101 | - trainingData = [] | ||
102 | - with open(os.path.join(args.inputPath, args.inputTrainingData), encoding='utf8', mode='r') \ | ||
103 | - as iFile: | ||
104 | - for line in iFile: | ||
105 | - line = line.strip('\r\n') | ||
106 | - listLine = line.split(',') | ||
107 | - y_train.append(listLine[0]) | ||
108 | - trainingData.append(listLine[1:]) | ||
109 | - # X_train = np.matrix(trainingData) | ||
110 | - X_train = csr_matrix(trainingData, dtype='double') | ||
111 | - print(" Saving matrix and classes...") | ||
112 | - joblib.dump(X_train, os.path.join(args.outputModelPath, args.inputTrainingData + '.jlb')) | ||
113 | - joblib.dump(y_train, os.path.join(args.outputModelPath, args.inputTrainingData + '.class.jlb')) | ||
114 | - print(" Done!") | ||
115 | - else: | ||
116 | - print(" Loading matrix and classes...") | ||
117 | - X_train = joblib.load(os.path.join(args.outputModelPath, args.inputTrainingData + '.jlb')) | ||
118 | - y_train = joblib.load(os.path.join(args.outputModelPath, args.inputTrainingData + '.class.jlb')) | ||
119 | - print(" Done!") | ||
120 | - | ||
121 | - print(" Number of training classes: {}".format(len(y_train))) | ||
122 | - print(" Number of training class A: {}".format(y_train.count('A'))) | ||
123 | - print(" Number of training class I: {}".format(y_train.count('I'))) | ||
124 | - print(" Shape of training matrix: {}".format(X_train.shape)) | ||
125 | - | ||
126 | - print("Reading testing data and true classes...") | ||
127 | - X_test = None | ||
128 | - if args.saveData: | ||
129 | - y_test = [] | ||
130 | - testingData = [] | ||
131 | - with open(os.path.join(args.inputPath, args.inputTestingData), encoding='utf8', mode='r') \ | ||
132 | - as iFile: | ||
133 | - for line in iFile: | ||
134 | - line = line.strip('\r\n') | ||
135 | - listLine = line.split(',') | ||
136 | - testingData.append(listLine[1:]) | ||
137 | - X_test = csr_matrix(testingData, dtype='double') | ||
138 | - with open(os.path.join(args.inputPath, args.inputTestingClasses), encoding='utf8', mode='r') \ | ||
139 | - as iFile: | ||
140 | - for line in iFile: | ||
141 | - line = line.strip('\r\n') | ||
142 | - y_test.append(line) | ||
143 | - print(" Saving matrix and classes...") | ||
144 | - joblib.dump(X_test, os.path.join(args.outputModelPath, args.inputTestingData + '.jlb')) | ||
145 | - joblib.dump(y_test, os.path.join(args.outputModelPath, args.inputTestingClasses + '.class.jlb')) | ||
146 | - print(" Done!") | ||
147 | - else: | ||
148 | - print(" Loading matrix and classes...") | ||
149 | - X_test = joblib.load(os.path.join(args.outputModelPath, args.inputTestingData + '.jlb')) | ||
150 | - y_test = joblib.load(os.path.join(args.outputModelPath, args.inputTestingClasses + '.class.jlb')) | ||
151 | - print(" Done!") | ||
152 | - | ||
153 | - print(" Number of testing classes: {}".format(len(y_test))) | ||
154 | - print(" Number of testing class A: {}".format(y_test.count('A'))) | ||
155 | - print(" Number of testing class I: {}".format(y_test.count('I'))) | ||
156 | - print(" Shape of testing matrix: {}".format(X_test.shape)) | ||
157 | - | ||
158 | - if args.classifier == "BernoulliNB": | ||
159 | - classifier = BernoulliNB() | ||
160 | - elif args.classifier == "SVM": | ||
161 | - classifier = SVC() | ||
162 | - elif args.classifier == "kNN": | ||
163 | - classifier = KNeighborsClassifier() | ||
164 | - else: | ||
165 | - print("Bad classifier") | ||
166 | - exit() | ||
167 | - | ||
168 | - print("Training...") | ||
169 | - classifier.fit(X_train, y_train) | ||
170 | - print(" Done!") | ||
171 | - | ||
172 | - print("Testing (prediction in new data)...") | ||
173 | - y_pred = classifier.predict(X_test) | ||
174 | - print(" Done!") | ||
175 | - | ||
176 | - print("Saving report...") | ||
177 | - with open(os.path.join(args.outputReportPath, args.outputReportFile), mode='w', encoding='utf8') as oFile: | ||
178 | - oFile.write('********** EVALUATION REPORT **********\n') | ||
179 | - oFile.write('Classifier: {}\n'.format(args.classifier)) | ||
180 | - oFile.write('Accuracy: {}\n'.format(accuracy_score(y_test, y_pred))) | ||
181 | - oFile.write('Precision: {}\n'.format(precision_score(y_test, y_pred, average='weighted'))) | ||
182 | - oFile.write('Recall: {}\n'.format(recall_score(y_test, y_pred, average='weighted'))) | ||
183 | - oFile.write('F-score: {}\n'.format(f1_score(y_test, y_pred, average='weighted'))) | ||
184 | - oFile.write('Confusion matrix: \n') | ||
185 | - oFile.write(str(confusion_matrix(y_test, y_pred)) + '\n') | ||
186 | - oFile.write('Classification report: \n') | ||
187 | - oFile.write(classification_report(y_test, y_pred) + '\n') | ||
188 | - print(" Done!") | ||
189 | - | ||
190 | - print("Training and testing done in: %fs" % (time() - t0)) |
-
Please register or login to post a comment