Showing
13 changed files
with
905 additions
and
0 deletions
binding-thrombin-dataset/README.testset
0 → 100644
1 | +The test set consists of 634 data points, each of which represents | ||
2 | +a molecule that is either active (A) or inactive (I). The test set | ||
3 | +has the same format as the training set, with the exception that the | ||
4 | +activity value (A or I) for each data point is missing, that is, has | ||
5 | +been replaced by a question mark (?). Please submit one prediction, | ||
6 | +A or I, for each data point. Your submission should be in the form | ||
7 | +of a file that starts with your contact information, followed by a | ||
8 | +line with 5 asterisks, followed immediately by your predictions, with | ||
9 | +one line per data point. The predictions should be in the same order | ||
10 | +as the test set data points. So your prediction for the first example | ||
11 | +should appear on the first line after the asterisks, your prediction | ||
12 | +for the second example should appear on the second line after the | ||
13 | +asterisks, etc. Hence, after your contact information, the prediction | ||
14 | +file will consist of 635 lines and have the form: | ||
15 | + | ||
16 | +***** | ||
17 | +I | ||
18 | +I | ||
19 | +A | ||
20 | +I | ||
21 | +A | ||
22 | +I | ||
23 | + | ||
24 | +etc. | ||
25 | + | ||
26 | +You may submit your prediction by email to page@biostat.wisc.edu | ||
27 | +or by anonymous ftp to ftp.biostat.wisc.edu, placing the file | ||
28 | +into the directory dropboxes/page/. If using email, please use | ||
29 | +the subject line "KDDcup <name> thrombin" where <name> is your | ||
30 | +name. If using ftp, please name the file KDDcup.<name>.thrombin | ||
31 | +where <name> is your name. For example, my submission would be | ||
32 | +named KDDcup.DavidPage.thrombin | ||
33 | + | ||
34 | +Only one submission per person per task is permitted. If you do not | ||
35 | +receive email confirmation of your submission within 24 hours, please | ||
36 | +email page@biostat.wisc.edu with subject "KDDcup no confirmation". | ||
37 | + | ||
38 | +For group entries, the contact information should include the names | ||
39 | +of everyone to be credited as a member of the group should your entry | ||
40 | +achieve the highest score. But no person is to be listed on more than | ||
41 | +one entry per task. |
binding-thrombin-dataset/README.trainingset
0 → 100644
1 | +Prediction of Molecular Bioactivity for Drug Design -- Binding to Thrombin | ||
2 | +-------------------------------------------------------------------------- | ||
3 | + | ||
4 | +Drugs are typically small organic molecules that achieve their desired | ||
5 | +activity by binding to a target site on a receptor. The first step in | ||
6 | +the discovery of a new drug is usually to identify and isolate the | ||
7 | +receptor to which it should bind, followed by testing many small | ||
8 | +molecules for their ability to bind to the target site. This leaves | ||
9 | +researchers with the task of determining what separates the active | ||
10 | +(binding) compounds from the inactive (non-binding) ones. Such a | ||
11 | +determination can then be used in the design of new compounds that not | ||
12 | +only bind, but also have all the other properties required for a drug | ||
13 | +(solubility, oral absorption, lack of side effects, appropriate duration | ||
14 | +of action, toxicity, etc.). | ||
15 | + | ||
16 | +The present training data set consists of 1909 compounds tested for | ||
17 | +their ability to bind to a target site on thrombin, a key receptor in | ||
18 | +blood clotting. The chemical structures of these compounds are not | ||
19 | +necessary for our analysis and are not included. Of these compounds, 42 | ||
20 | +are active (bind well) and the others are inactive. Each compound is | ||
21 | +described by a single feature vector comprised of a class value (A for | ||
22 | +active, I for inactive) and 139,351 binary features, which describe | ||
23 | +three-dimensional properties of the molecule. The definitions of the | ||
24 | +individual bits are not included - we don't know what each individual | ||
25 | +bit means, only that they are generated in an internally consistent | ||
26 | +manner for all 1909 compounds. Biological activity in general, and | ||
27 | +receptor binding affinity in particular, correlate with various | ||
28 | +structural and physical properties of small organic molecules. The task | ||
29 | +is to determine which of these properties are critical in this case and | ||
30 | +to learn to accurately predict the class value. To simulate the | ||
31 | +real-world drug design environment, the test set contains 636 additional | ||
32 | +compounds that were in fact generated based on the assay results | ||
33 | +recorded for the training set. In evaluating the accuracy, a | ||
34 | +differential cost model will be used, so that the sum of the costs of | ||
35 | +the actives will be equal to the sum of the costs of the inactives. | ||
36 | + | ||
37 | +We thank DuPont Pharmaceuticals for graciously providing this data set | ||
38 | +for the KDD Cup 2001 competition. All publications referring to | ||
39 | +analysis of this data set should acknowledge DuPont Pharmaceuticals | ||
40 | +Research Laboratories and KDD Cup 2001. |
binding-thrombin-dataset/README.txt
0 → 100644
File mode changed
binding-thrombin-dataset/Thrombin.testset
0 → 100644
This diff could not be displayed because it is too large.
1 | +I | ||
2 | +A | ||
3 | +I | ||
4 | +I | ||
5 | +I | ||
6 | +A | ||
7 | +I | ||
8 | +I | ||
9 | +I | ||
10 | +A | ||
11 | +I | ||
12 | +I | ||
13 | +I | ||
14 | +A | ||
15 | +I | ||
16 | +A | ||
17 | +I | ||
18 | +I | ||
19 | +I | ||
20 | +I | ||
21 | +I | ||
22 | +I | ||
23 | +I | ||
24 | +I | ||
25 | +I | ||
26 | +I | ||
27 | +I | ||
28 | +A | ||
29 | +I | ||
30 | +A | ||
31 | +I | ||
32 | +I | ||
33 | +I | ||
34 | +I | ||
35 | +I | ||
36 | +A | ||
37 | +I | ||
38 | +I | ||
39 | +I | ||
40 | +A | ||
41 | +I | ||
42 | +I | ||
43 | +I | ||
44 | +I | ||
45 | +I | ||
46 | +I | ||
47 | +I | ||
48 | +I | ||
49 | +A | ||
50 | +A | ||
51 | +I | ||
52 | +I | ||
53 | +I | ||
54 | +I | ||
55 | +I | ||
56 | +A | ||
57 | +I | ||
58 | +A | ||
59 | +A | ||
60 | +I | ||
61 | +I | ||
62 | +I | ||
63 | +A | ||
64 | +I | ||
65 | +I | ||
66 | +I | ||
67 | +I | ||
68 | +A | ||
69 | +A | ||
70 | +I | ||
71 | +A | ||
72 | +I | ||
73 | +I | ||
74 | +A | ||
75 | +A | ||
76 | +I | ||
77 | +I | ||
78 | +I | ||
79 | +I | ||
80 | +I | ||
81 | +I | ||
82 | +I | ||
83 | +I | ||
84 | +I | ||
85 | +I | ||
86 | +I | ||
87 | +I | ||
88 | +I | ||
89 | +I | ||
90 | +I | ||
91 | +A | ||
92 | +I | ||
93 | +I | ||
94 | +A | ||
95 | +I | ||
96 | +A | ||
97 | +I | ||
98 | +I | ||
99 | +I | ||
100 | +A | ||
101 | +A | ||
102 | +I | ||
103 | +I | ||
104 | +I | ||
105 | +I | ||
106 | +I | ||
107 | +I | ||
108 | +A | ||
109 | +I | ||
110 | +I | ||
111 | +I | ||
112 | +I | ||
113 | +A | ||
114 | +A | ||
115 | +I | ||
116 | +I | ||
117 | +I | ||
118 | +I | ||
119 | +I | ||
120 | +I | ||
121 | +I | ||
122 | +I | ||
123 | +A | ||
124 | +A | ||
125 | +I | ||
126 | +A | ||
127 | +A | ||
128 | +I | ||
129 | +I | ||
130 | +I | ||
131 | +I | ||
132 | +I | ||
133 | +I | ||
134 | +I | ||
135 | +I | ||
136 | +I | ||
137 | +I | ||
138 | +A | ||
139 | +I | ||
140 | +I | ||
141 | +I | ||
142 | +I | ||
143 | +I | ||
144 | +I | ||
145 | +I | ||
146 | +I | ||
147 | +I | ||
148 | +A | ||
149 | +I | ||
150 | +I | ||
151 | +I | ||
152 | +I | ||
153 | +A | ||
154 | +I | ||
155 | +I | ||
156 | +I | ||
157 | +I | ||
158 | +I | ||
159 | +I | ||
160 | +I | ||
161 | +A | ||
162 | +I | ||
163 | +I | ||
164 | +A | ||
165 | +I | ||
166 | +A | ||
167 | +I | ||
168 | +I | ||
169 | +A | ||
170 | +I | ||
171 | +A | ||
172 | +I | ||
173 | +A | ||
174 | +I | ||
175 | +A | ||
176 | +I | ||
177 | +I | ||
178 | +I | ||
179 | +I | ||
180 | +I | ||
181 | +A | ||
182 | +I | ||
183 | +I | ||
184 | +A | ||
185 | +I | ||
186 | +I | ||
187 | +A | ||
188 | +I | ||
189 | +I | ||
190 | +I | ||
191 | +A | ||
192 | +I | ||
193 | +A | ||
194 | +I | ||
195 | +I | ||
196 | +A | ||
197 | +I | ||
198 | +I | ||
199 | +I | ||
200 | +I | ||
201 | +A | ||
202 | +I | ||
203 | +A | ||
204 | +I | ||
205 | +I | ||
206 | +I | ||
207 | +I | ||
208 | +I | ||
209 | +I | ||
210 | +I | ||
211 | +I | ||
212 | +I | ||
213 | +I | ||
214 | +I | ||
215 | +A | ||
216 | +I | ||
217 | +A | ||
218 | +I | ||
219 | +I | ||
220 | +I | ||
221 | +I | ||
222 | +I | ||
223 | +I | ||
224 | +A | ||
225 | +I | ||
226 | +I | ||
227 | +A | ||
228 | +A | ||
229 | +A | ||
230 | +I | ||
231 | +I | ||
232 | +A | ||
233 | +A | ||
234 | +I | ||
235 | +I | ||
236 | +I | ||
237 | +I | ||
238 | +A | ||
239 | +I | ||
240 | +I | ||
241 | +I | ||
242 | +I | ||
243 | +A | ||
244 | +I | ||
245 | +A | ||
246 | +I | ||
247 | +I | ||
248 | +I | ||
249 | +I | ||
250 | +I | ||
251 | +I | ||
252 | +I | ||
253 | +A | ||
254 | +A | ||
255 | +I | ||
256 | +I | ||
257 | +I | ||
258 | +I | ||
259 | +I | ||
260 | +I | ||
261 | +I | ||
262 | +I | ||
263 | +A | ||
264 | +A | ||
265 | +I | ||
266 | +I | ||
267 | +I | ||
268 | +I | ||
269 | +I | ||
270 | +I | ||
271 | +A | ||
272 | +A | ||
273 | +I | ||
274 | +I | ||
275 | +I | ||
276 | +I | ||
277 | +I | ||
278 | +I | ||
279 | +A | ||
280 | +I | ||
281 | +A | ||
282 | +I | ||
283 | +I | ||
284 | +I | ||
285 | +I | ||
286 | +I | ||
287 | +I | ||
288 | +I | ||
289 | +I | ||
290 | +I | ||
291 | +A | ||
292 | +I | ||
293 | +I | ||
294 | +A | ||
295 | +I | ||
296 | +I | ||
297 | +I | ||
298 | +I | ||
299 | +I | ||
300 | +I | ||
301 | +A | ||
302 | +A | ||
303 | +I | ||
304 | +I | ||
305 | +I | ||
306 | +I | ||
307 | +I | ||
308 | +A | ||
309 | +I | ||
310 | +I | ||
311 | +I | ||
312 | +I | ||
313 | +I | ||
314 | +A | ||
315 | +A | ||
316 | +A | ||
317 | +I | ||
318 | +A | ||
319 | +I | ||
320 | +I | ||
321 | +I | ||
322 | +I | ||
323 | +A | ||
324 | +A | ||
325 | +I | ||
326 | +A | ||
327 | +A | ||
328 | +I | ||
329 | +I | ||
330 | +I | ||
331 | +I | ||
332 | +I | ||
333 | +I | ||
334 | +I | ||
335 | +I | ||
336 | +I | ||
337 | +I | ||
338 | +I | ||
339 | +I | ||
340 | +A | ||
341 | +I | ||
342 | +I | ||
343 | +I | ||
344 | +I | ||
345 | +A | ||
346 | +A | ||
347 | +I | ||
348 | +I | ||
349 | +A | ||
350 | +I | ||
351 | +I | ||
352 | +I | ||
353 | +I | ||
354 | +I | ||
355 | +A | ||
356 | +A | ||
357 | +I | ||
358 | +A | ||
359 | +I | ||
360 | +I | ||
361 | +I | ||
362 | +I | ||
363 | +I | ||
364 | +I | ||
365 | +A | ||
366 | +A | ||
367 | +I | ||
368 | +I | ||
369 | +A | ||
370 | +I | ||
371 | +I | ||
372 | +I | ||
373 | +I | ||
374 | +I | ||
375 | +I | ||
376 | +I | ||
377 | +I | ||
378 | +I | ||
379 | +I | ||
380 | +I | ||
381 | +I | ||
382 | +A | ||
383 | +I | ||
384 | +I | ||
385 | +A | ||
386 | +I | ||
387 | +I | ||
388 | +A | ||
389 | +I | ||
390 | +I | ||
391 | +I | ||
392 | +I | ||
393 | +A | ||
394 | +A | ||
395 | +I | ||
396 | +A | ||
397 | +A | ||
398 | +I | ||
399 | +I | ||
400 | +A | ||
401 | +I | ||
402 | +I | ||
403 | +I | ||
404 | +I | ||
405 | +A | ||
406 | +I | ||
407 | +I | ||
408 | +I | ||
409 | +I | ||
410 | +I | ||
411 | +I | ||
412 | +I | ||
413 | +I | ||
414 | +I | ||
415 | +I | ||
416 | +A | ||
417 | +I | ||
418 | +I | ||
419 | +A | ||
420 | +A | ||
421 | +I | ||
422 | +I | ||
423 | +I | ||
424 | +A | ||
425 | +I | ||
426 | +I | ||
427 | +I | ||
428 | +I | ||
429 | +A | ||
430 | +I | ||
431 | +A | ||
432 | +I | ||
433 | +I | ||
434 | +I | ||
435 | +I | ||
436 | +I | ||
437 | +A | ||
438 | +I | ||
439 | +I | ||
440 | +I | ||
441 | +I | ||
442 | +I | ||
443 | +I | ||
444 | +I | ||
445 | +I | ||
446 | +I | ||
447 | +I | ||
448 | +I | ||
449 | +I | ||
450 | +I | ||
451 | +I | ||
452 | +I | ||
453 | +I | ||
454 | +I | ||
455 | +I | ||
456 | +A | ||
457 | +A | ||
458 | +A | ||
459 | +A | ||
460 | +I | ||
461 | +I | ||
462 | +I | ||
463 | +A | ||
464 | +A | ||
465 | +I | ||
466 | +I | ||
467 | +I | ||
468 | +I | ||
469 | +I | ||
470 | +A | ||
471 | +I | ||
472 | +A | ||
473 | +I | ||
474 | +I | ||
475 | +I | ||
476 | +I | ||
477 | +I | ||
478 | +A | ||
479 | +I | ||
480 | +I | ||
481 | +I | ||
482 | +I | ||
483 | +A | ||
484 | +A | ||
485 | +I | ||
486 | +I | ||
487 | +I | ||
488 | +I | ||
489 | +I | ||
490 | +I | ||
491 | +I | ||
492 | +I | ||
493 | +I | ||
494 | +I | ||
495 | +I | ||
496 | +I | ||
497 | +I | ||
498 | +I | ||
499 | +I | ||
500 | +I | ||
501 | +I | ||
502 | +A | ||
503 | +I | ||
504 | +A | ||
505 | +I | ||
506 | +I | ||
507 | +A | ||
508 | +I | ||
509 | +I | ||
510 | +I | ||
511 | +I | ||
512 | +A | ||
513 | +I | ||
514 | +I | ||
515 | +A | ||
516 | +A | ||
517 | +I | ||
518 | +I | ||
519 | +I | ||
520 | +A | ||
521 | +I | ||
522 | +A | ||
523 | +I | ||
524 | +I | ||
525 | +I | ||
526 | +I | ||
527 | +I | ||
528 | +I | ||
529 | +I | ||
530 | +A | ||
531 | +A | ||
532 | +I | ||
533 | +I | ||
534 | +I | ||
535 | +A | ||
536 | +I | ||
537 | +I | ||
538 | +I | ||
539 | +A | ||
540 | +I | ||
541 | +I | ||
542 | +I | ||
543 | +I | ||
544 | +I | ||
545 | +I | ||
546 | +A | ||
547 | +I | ||
548 | +I | ||
549 | +I | ||
550 | +I | ||
551 | +I | ||
552 | +A | ||
553 | +I | ||
554 | +I | ||
555 | +I | ||
556 | +I | ||
557 | +I | ||
558 | +A | ||
559 | +I | ||
560 | +I | ||
561 | +A | ||
562 | +I | ||
563 | +I | ||
564 | +I | ||
565 | +I | ||
566 | +I | ||
567 | +I | ||
568 | +A | ||
569 | +I | ||
570 | +I | ||
571 | +I | ||
572 | +I | ||
573 | +I | ||
574 | +I | ||
575 | +I | ||
576 | +I | ||
577 | +I | ||
578 | +I | ||
579 | +I | ||
580 | +I | ||
581 | +I | ||
582 | +I | ||
583 | +I | ||
584 | +I | ||
585 | +A | ||
586 | +I | ||
587 | +I | ||
588 | +A | ||
589 | +I | ||
590 | +I | ||
591 | +I | ||
592 | +I | ||
593 | +A | ||
594 | +I | ||
595 | +I | ||
596 | +I | ||
597 | +I | ||
598 | +I | ||
599 | +A | ||
600 | +I | ||
601 | +I | ||
602 | +I | ||
603 | +A | ||
604 | +I | ||
605 | +I | ||
606 | +I | ||
607 | +A | ||
608 | +I | ||
609 | +A | ||
610 | +A | ||
611 | +I | ||
612 | +A | ||
613 | +I | ||
614 | +I | ||
615 | +I | ||
616 | +I | ||
617 | +I | ||
618 | +I | ||
619 | +I | ||
620 | +A | ||
621 | +I | ||
622 | +I | ||
623 | +I | ||
624 | +A | ||
625 | +I | ||
626 | +A | ||
627 | +I | ||
628 | +I | ||
629 | +I | ||
630 | +I | ||
631 | +A | ||
632 | +I | ||
633 | +A | ||
634 | +I |
This diff is collapsed. Click to expand it.
This diff is collapsed. Click to expand it.
binding-thrombin-dataset/models/delete-me
0 → 100644
File mode changed
binding-thrombin-dataset/reports/delete-me
0 → 100644
File mode changed
binding-thrombin-dataset/thrombin.data
0 → 100644
This diff could not be displayed because it is too large.
binding-thrombin-dataset/thrombin.names
0 → 100644
This diff could not be displayed because it is too large.
This diff is collapsed. Click to expand it.
1 | +# -*- encoding: utf-8 -*- | ||
2 | + | ||
3 | +import os | ||
4 | +from time import time | ||
5 | +import argparse | ||
6 | +from sklearn.naive_bayes import BernoulliNB | ||
7 | +from sklearn.svm import SVC | ||
8 | +from sklearn.neighbors import KNeighborsClassifier | ||
9 | +from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, \ | ||
10 | + classification_report | ||
11 | +from sklearn.externals import joblib | ||
12 | +from scipy.sparse import csr_matrix | ||
13 | + | ||
14 | +__author__ = 'CMendezC' | ||
15 | + | ||
16 | +# Goal: training and testing binding thrombin data set | ||
17 | + | ||
18 | +# Parameters: | ||
19 | +# 1) --inputPath Path to read input files. | ||
20 | +# 2) --inputTrainingData File to read training data. | ||
21 | +# 3) --inputTestingData File to read testing data. | ||
22 | +# 4) --inputTestingClasses File to read testing classes. | ||
23 | +# 5) --outputModelPath Path to place output model. | ||
24 | +# 6) --outputModelFile File to place output model. | ||
25 | +# 7) --outputReportPath Path to place evaluation report. | ||
26 | +# 8) --outputReportFile File to place evaluation report. | ||
27 | +# 9) --classifier Classifier: BernoulliNB, SVM, kNN. | ||
28 | +# 10) --saveData Save matrices | ||
29 | + | ||
30 | +# Ouput: | ||
31 | +# 1) Classification model and evaluation report. | ||
32 | + | ||
33 | +# Execution: | ||
34 | + | ||
35 | +# python training-testing-binding-thrombin.py | ||
36 | +# --inputPath /home/binding-thrombin-dataset | ||
37 | +# --inputTrainingData thrombin.data | ||
38 | +# --inputTestingData Thrombin.testset | ||
39 | +# --inputTestingClasses Thrombin.testset.class | ||
40 | +# --outputModelPath /home/binding-thrombin-dataset/models | ||
41 | +# --outputModelFile SVM-model.mod | ||
42 | +# --outputReportPath /home/binding-thrombin-dataset/reports | ||
43 | +# --outputReportFile SVM.txt | ||
44 | +# --classifier SVM | ||
45 | +# --saveData | ||
46 | + | ||
47 | +# source activate python3 | ||
48 | +# python training-testing-binding-thrombin.py --inputPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset --inputTrainingData thrombin.data --inputTestingData Thrombin.testset --inputTestingClasses Thrombin.testset.class --outputModelPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/models --outputModelFile SVM-model.mod --outputReportPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/reports --outputReportFile SVM.txt --classifier SVM --saveData | ||
49 | + | ||
50 | +########################################################### | ||
51 | +# MAIN PROGRAM # | ||
52 | +########################################################### | ||
53 | + | ||
54 | +if __name__ == "__main__": | ||
55 | + # Parameter definition | ||
56 | + parser = argparse.ArgumentParser(description='Training and testing Binding Thrombin Dataset.') | ||
57 | + parser.add_argument("--inputPath", dest="inputPath", | ||
58 | + help="Path to read input files", metavar="PATH") | ||
59 | + parser.add_argument("--inputTrainingData", dest="inputTrainingData", | ||
60 | + help="File to read training data", metavar="FILE") | ||
61 | + parser.add_argument("--inputTestingData", dest="inputTestingData", | ||
62 | + help="File to read testing data", metavar="FILE") | ||
63 | + parser.add_argument("--inputTestingClasses", dest="inputTestingClasses", | ||
64 | + help="File to read testing classes", metavar="FILE") | ||
65 | + parser.add_argument("--outputModelPath", dest="outputModelPath", | ||
66 | + help="Path to place output model", metavar="PATH") | ||
67 | + parser.add_argument("--outputModelFile", dest="outputModelFile", | ||
68 | + help="File to place output model", metavar="FILE") | ||
69 | + parser.add_argument("--outputReportPath", dest="outputReportPath", | ||
70 | + help="Path to place evaluation report", metavar="PATH") | ||
71 | + parser.add_argument("--outputReportFile", dest="outputReportFile", | ||
72 | + help="File to place evaluation report", metavar="FILE") | ||
73 | + parser.add_argument("--classifier", dest="classifier", | ||
74 | + help="Classifier", metavar="NAME", | ||
75 | + choices=('BernoulliNB', 'SVM', 'kNN'), default='SVM') | ||
76 | + parser.add_argument("--saveData", dest="saveData", action='store_true', | ||
77 | + help="Save matrices") | ||
78 | + | ||
79 | + args = parser.parse_args() | ||
80 | + | ||
81 | + # Printing parameter values | ||
82 | + print('-------------------------------- PARAMETERS --------------------------------') | ||
83 | + print("Path to read input files: " + str(args.inputPath)) | ||
84 | + print("File to read training data: " + str(args.inputTrainingData)) | ||
85 | + print("File to read testing data: " + str(args.inputTestingData)) | ||
86 | + print("File to read testing classes: " + str(args.inputTestingClasses)) | ||
87 | + print("Path to place output model: " + str(args.outputModelPath)) | ||
88 | + print("File to place output model: " + str(args.outputModelFile)) | ||
89 | + print("Path to place evaluation report: " + str(args.outputReportPath)) | ||
90 | + print("File to place evaluation report: " + str(args.outputReportFile)) | ||
91 | + print("Classifier: " + str(args.classifier)) | ||
92 | + print("Save matrices: " + str(args.saveData)) | ||
93 | + | ||
94 | + # Start time | ||
95 | + t0 = time() | ||
96 | + | ||
97 | + print("Reading training data and true classes...") | ||
98 | + X_train = None | ||
99 | + if args.saveData: | ||
100 | + y_train = [] | ||
101 | + trainingData = [] | ||
102 | + with open(os.path.join(args.inputPath, args.inputTrainingData), encoding='utf8', mode='r') \ | ||
103 | + as iFile: | ||
104 | + for line in iFile: | ||
105 | + line = line.strip('\r\n') | ||
106 | + listLine = line.split(',') | ||
107 | + y_train.append(listLine[0]) | ||
108 | + trainingData.append(listLine[1:]) | ||
109 | + # X_train = np.matrix(trainingData) | ||
110 | + X_train = csr_matrix(trainingData, dtype='double') | ||
111 | + print(" Saving matrix and classes...") | ||
112 | + joblib.dump(X_train, os.path.join(args.outputModelPath, args.inputTrainingData + '.jlb')) | ||
113 | + joblib.dump(y_train, os.path.join(args.outputModelPath, args.inputTrainingData + '.class.jlb')) | ||
114 | + print(" Done!") | ||
115 | + else: | ||
116 | + print(" Loading matrix and classes...") | ||
117 | + X_train = joblib.load(os.path.join(args.outputModelPath, args.inputTrainingData + '.jlb')) | ||
118 | + y_train = joblib.load(os.path.join(args.outputModelPath, args.inputTrainingData + '.class.jlb')) | ||
119 | + print(" Done!") | ||
120 | + | ||
121 | + print(" Number of training classes: {}".format(len(y_train))) | ||
122 | + print(" Number of training class A: {}".format(y_train.count('A'))) | ||
123 | + print(" Number of training class I: {}".format(y_train.count('I'))) | ||
124 | + print(" Shape of training matrix: {}".format(X_train.shape)) | ||
125 | + | ||
126 | + print("Reading testing data and true classes...") | ||
127 | + X_test = None | ||
128 | + if args.saveData: | ||
129 | + y_test = [] | ||
130 | + testingData = [] | ||
131 | + with open(os.path.join(args.inputPath, args.inputTestingData), encoding='utf8', mode='r') \ | ||
132 | + as iFile: | ||
133 | + for line in iFile: | ||
134 | + line = line.strip('\r\n') | ||
135 | + listLine = line.split(',') | ||
136 | + testingData.append(listLine[1:]) | ||
137 | + X_test = csr_matrix(testingData, dtype='double') | ||
138 | + with open(os.path.join(args.inputPath, args.inputTestingClasses), encoding='utf8', mode='r') \ | ||
139 | + as iFile: | ||
140 | + for line in iFile: | ||
141 | + line = line.strip('\r\n') | ||
142 | + y_test.append(line) | ||
143 | + print(" Saving matrix and classes...") | ||
144 | + joblib.dump(X_test, os.path.join(args.outputModelPath, args.inputTestingData + '.jlb')) | ||
145 | + joblib.dump(y_test, os.path.join(args.outputModelPath, args.inputTestingClasses + '.class.jlb')) | ||
146 | + print(" Done!") | ||
147 | + else: | ||
148 | + print(" Loading matrix and classes...") | ||
149 | + X_test = joblib.load(os.path.join(args.outputModelPath, args.inputTestingData + '.jlb')) | ||
150 | + y_test = joblib.load(os.path.join(args.outputModelPath, args.inputTestingClasses + '.class.jlb')) | ||
151 | + print(" Done!") | ||
152 | + | ||
153 | + print(" Number of testing classes: {}".format(len(y_test))) | ||
154 | + print(" Number of testing class A: {}".format(y_test.count('A'))) | ||
155 | + print(" Number of testing class I: {}".format(y_test.count('I'))) | ||
156 | + print(" Shape of testing matrix: {}".format(X_test.shape)) | ||
157 | + | ||
158 | + if args.classifier == "BernoulliNB": | ||
159 | + classifier = BernoulliNB() | ||
160 | + elif args.classifier == "SVM": | ||
161 | + classifier = SVC() | ||
162 | + elif args.classifier == "kNN": | ||
163 | + classifier = KNeighborsClassifier() | ||
164 | + else: | ||
165 | + print("Bad classifier") | ||
166 | + exit() | ||
167 | + | ||
168 | + print("Training...") | ||
169 | + classifier.fit(X_train, y_train) | ||
170 | + print(" Done!") | ||
171 | + | ||
172 | + print("Testing (prediction in new data)...") | ||
173 | + y_pred = classifier.predict(X_test) | ||
174 | + print(" Done!") | ||
175 | + | ||
176 | + print("Saving report...") | ||
177 | + with open(os.path.join(args.outputReportPath, args.outputReportFile), mode='w', encoding='utf8') as oFile: | ||
178 | + oFile.write('********** EVALUATION REPORT **********\n') | ||
179 | + oFile.write('Classifier: {}\n'.format(args.classifier)) | ||
180 | + oFile.write('Accuracy: {}\n'.format(accuracy_score(y_test, y_pred))) | ||
181 | + oFile.write('Precision: {}\n'.format(precision_score(y_test, y_pred, average='weighted'))) | ||
182 | + oFile.write('Recall: {}\n'.format(recall_score(y_test, y_pred, average='weighted'))) | ||
183 | + oFile.write('F-score: {}\n'.format(f1_score(y_test, y_pred, average='weighted'))) | ||
184 | + oFile.write('Confusion matrix: \n') | ||
185 | + oFile.write(str(confusion_matrix(y_test, y_pred)) + '\n') | ||
186 | + oFile.write('Classification report: \n') | ||
187 | + oFile.write(classification_report(y_test, y_pred) + '\n') | ||
188 | + print(" Done!") | ||
189 | + | ||
190 | + print("Training and testing done in: %fs" % (time() - t0)) |
-
Please register or login to post a comment