Showing
8 changed files
with
878 additions
and
0 deletions
| 1 | +The test set consists of 634 data points, each of which represents | ||
| 2 | +a molecule that is either active (A) or inactive (I). The test set | ||
| 3 | +has the same format as the training set, with the exception that the | ||
| 4 | +activity value (A or I) for each data point is missing, that is, has | ||
| 5 | +been replaced by a question mark (?). Please submit one prediction, | ||
| 6 | +A or I, for each data point. Your submission should be in the form | ||
| 7 | +of a file that starts with your contact information, followed by a | ||
| 8 | +line with 5 asterisks, followed immediately by your predictions, with | ||
| 9 | +one line per data point. The predictions should be in the same order | ||
| 10 | +as the test set data points. So your prediction for the first example | ||
| 11 | +should appear on the first line after the asterisks, your prediction | ||
| 12 | +for the second example should appear on the second line after the | ||
| 13 | +asterisks, etc. Hence, after your contact information, the prediction | ||
| 14 | +file will consist of 635 lines and have the form: | ||
| 15 | + | ||
| 16 | +***** | ||
| 17 | +I | ||
| 18 | +I | ||
| 19 | +A | ||
| 20 | +I | ||
| 21 | +A | ||
| 22 | +I | ||
| 23 | + | ||
| 24 | +etc. | ||
| 25 | + | ||
| 26 | +You may submit your prediction by email to page@biostat.wisc.edu | ||
| 27 | +or by anonymous ftp to ftp.biostat.wisc.edu, placing the file | ||
| 28 | +into the directory dropboxes/page/. If using email, please use | ||
| 29 | +the subject line "KDDcup <name> thrombin" where <name> is your | ||
| 30 | +name. If using ftp, please name the file KDDcup.<name>.thrombin | ||
| 31 | +where <name> is your name. For example, my submission would be | ||
| 32 | +named KDDcup.DavidPage.thrombin | ||
| 33 | + | ||
| 34 | +Only one submission per person per task is permitted. If you do not | ||
| 35 | +receive email confirmation of your submission within 24 hours, please | ||
| 36 | +email page@biostat.wisc.edu with subject "KDDcup no confirmation". | ||
| 37 | + | ||
| 38 | +For group entries, the contact information should include the names | ||
| 39 | +of everyone to be credited as a member of the group should your entry | ||
| 40 | +achieve the highest score. But no person is to be listed on more than | ||
| 41 | +one entry per task. |
| 1 | +Prediction of Molecular Bioactivity for Drug Design -- Binding to Thrombin | ||
| 2 | +-------------------------------------------------------------------------- | ||
| 3 | + | ||
| 4 | +Drugs are typically small organic molecules that achieve their desired | ||
| 5 | +activity by binding to a target site on a receptor. The first step in | ||
| 6 | +the discovery of a new drug is usually to identify and isolate the | ||
| 7 | +receptor to which it should bind, followed by testing many small | ||
| 8 | +molecules for their ability to bind to the target site. This leaves | ||
| 9 | +researchers with the task of determining what separates the active | ||
| 10 | +(binding) compounds from the inactive (non-binding) ones. Such a | ||
| 11 | +determination can then be used in the design of new compounds that not | ||
| 12 | +only bind, but also have all the other properties required for a drug | ||
| 13 | +(solubility, oral absorption, lack of side effects, appropriate duration | ||
| 14 | +of action, toxicity, etc.). | ||
| 15 | + | ||
| 16 | +The present training data set consists of 1909 compounds tested for | ||
| 17 | +their ability to bind to a target site on thrombin, a key receptor in | ||
| 18 | +blood clotting. The chemical structures of these compounds are not | ||
| 19 | +necessary for our analysis and are not included. Of these compounds, 42 | ||
| 20 | +are active (bind well) and the others are inactive. Each compound is | ||
| 21 | +described by a single feature vector comprised of a class value (A for | ||
| 22 | +active, I for inactive) and 139,351 binary features, which describe | ||
| 23 | +three-dimensional properties of the molecule. The definitions of the | ||
| 24 | +individual bits are not included - we don't know what each individual | ||
| 25 | +bit means, only that they are generated in an internally consistent | ||
| 26 | +manner for all 1909 compounds. Biological activity in general, and | ||
| 27 | +receptor binding affinity in particular, correlate with various | ||
| 28 | +structural and physical properties of small organic molecules. The task | ||
| 29 | +is to determine which of these properties are critical in this case and | ||
| 30 | +to learn to accurately predict the class value. To simulate the | ||
| 31 | +real-world drug design environment, the test set contains 636 additional | ||
| 32 | +compounds that were in fact generated based on the assay results | ||
| 33 | +recorded for the training set. In evaluating the accuracy, a | ||
| 34 | +differential cost model will be used, so that the sum of the costs of | ||
| 35 | +the actives will be equal to the sum of the costs of the inactives. | ||
| 36 | + | ||
| 37 | +We thank DuPont Pharmaceuticals for graciously providing this data set | ||
| 38 | +for the KDD Cup 2001 competition. All publications referring to | ||
| 39 | +analysis of this data set should acknowledge DuPont Pharmaceuticals | ||
| 40 | +Research Laboratories and KDD Cup 2001. |
File mode changed
This diff could not be displayed because it is too large.
| 1 | +I | ||
| 2 | +A | ||
| 3 | +I | ||
| 4 | +I | ||
| 5 | +I | ||
| 6 | +A | ||
| 7 | +I | ||
| 8 | +I | ||
| 9 | +I | ||
| 10 | +A | ||
| 11 | +I | ||
| 12 | +I | ||
| 13 | +I | ||
| 14 | +A | ||
| 15 | +I | ||
| 16 | +A | ||
| 17 | +I | ||
| 18 | +I | ||
| 19 | +I | ||
| 20 | +I | ||
| 21 | +I | ||
| 22 | +I | ||
| 23 | +I | ||
| 24 | +I | ||
| 25 | +I | ||
| 26 | +I | ||
| 27 | +I | ||
| 28 | +A | ||
| 29 | +I | ||
| 30 | +A | ||
| 31 | +I | ||
| 32 | +I | ||
| 33 | +I | ||
| 34 | +I | ||
| 35 | +I | ||
| 36 | +A | ||
| 37 | +I | ||
| 38 | +I | ||
| 39 | +I | ||
| 40 | +A | ||
| 41 | +I | ||
| 42 | +I | ||
| 43 | +I | ||
| 44 | +I | ||
| 45 | +I | ||
| 46 | +I | ||
| 47 | +I | ||
| 48 | +I | ||
| 49 | +A | ||
| 50 | +A | ||
| 51 | +I | ||
| 52 | +I | ||
| 53 | +I | ||
| 54 | +I | ||
| 55 | +I | ||
| 56 | +A | ||
| 57 | +I | ||
| 58 | +A | ||
| 59 | +A | ||
| 60 | +I | ||
| 61 | +I | ||
| 62 | +I | ||
| 63 | +A | ||
| 64 | +I | ||
| 65 | +I | ||
| 66 | +I | ||
| 67 | +I | ||
| 68 | +A | ||
| 69 | +A | ||
| 70 | +I | ||
| 71 | +A | ||
| 72 | +I | ||
| 73 | +I | ||
| 74 | +A | ||
| 75 | +A | ||
| 76 | +I | ||
| 77 | +I | ||
| 78 | +I | ||
| 79 | +I | ||
| 80 | +I | ||
| 81 | +I | ||
| 82 | +I | ||
| 83 | +I | ||
| 84 | +I | ||
| 85 | +I | ||
| 86 | +I | ||
| 87 | +I | ||
| 88 | +I | ||
| 89 | +I | ||
| 90 | +I | ||
| 91 | +A | ||
| 92 | +I | ||
| 93 | +I | ||
| 94 | +A | ||
| 95 | +I | ||
| 96 | +A | ||
| 97 | +I | ||
| 98 | +I | ||
| 99 | +I | ||
| 100 | +A | ||
| 101 | +A | ||
| 102 | +I | ||
| 103 | +I | ||
| 104 | +I | ||
| 105 | +I | ||
| 106 | +I | ||
| 107 | +I | ||
| 108 | +A | ||
| 109 | +I | ||
| 110 | +I | ||
| 111 | +I | ||
| 112 | +I | ||
| 113 | +A | ||
| 114 | +A | ||
| 115 | +I | ||
| 116 | +I | ||
| 117 | +I | ||
| 118 | +I | ||
| 119 | +I | ||
| 120 | +I | ||
| 121 | +I | ||
| 122 | +I | ||
| 123 | +A | ||
| 124 | +A | ||
| 125 | +I | ||
| 126 | +A | ||
| 127 | +A | ||
| 128 | +I | ||
| 129 | +I | ||
| 130 | +I | ||
| 131 | +I | ||
| 132 | +I | ||
| 133 | +I | ||
| 134 | +I | ||
| 135 | +I | ||
| 136 | +I | ||
| 137 | +I | ||
| 138 | +A | ||
| 139 | +I | ||
| 140 | +I | ||
| 141 | +I | ||
| 142 | +I | ||
| 143 | +I | ||
| 144 | +I | ||
| 145 | +I | ||
| 146 | +I | ||
| 147 | +I | ||
| 148 | +A | ||
| 149 | +I | ||
| 150 | +I | ||
| 151 | +I | ||
| 152 | +I | ||
| 153 | +A | ||
| 154 | +I | ||
| 155 | +I | ||
| 156 | +I | ||
| 157 | +I | ||
| 158 | +I | ||
| 159 | +I | ||
| 160 | +I | ||
| 161 | +A | ||
| 162 | +I | ||
| 163 | +I | ||
| 164 | +A | ||
| 165 | +I | ||
| 166 | +A | ||
| 167 | +I | ||
| 168 | +I | ||
| 169 | +A | ||
| 170 | +I | ||
| 171 | +A | ||
| 172 | +I | ||
| 173 | +A | ||
| 174 | +I | ||
| 175 | +A | ||
| 176 | +I | ||
| 177 | +I | ||
| 178 | +I | ||
| 179 | +I | ||
| 180 | +I | ||
| 181 | +A | ||
| 182 | +I | ||
| 183 | +I | ||
| 184 | +A | ||
| 185 | +I | ||
| 186 | +I | ||
| 187 | +A | ||
| 188 | +I | ||
| 189 | +I | ||
| 190 | +I | ||
| 191 | +A | ||
| 192 | +I | ||
| 193 | +A | ||
| 194 | +I | ||
| 195 | +I | ||
| 196 | +A | ||
| 197 | +I | ||
| 198 | +I | ||
| 199 | +I | ||
| 200 | +I | ||
| 201 | +A | ||
| 202 | +I | ||
| 203 | +A | ||
| 204 | +I | ||
| 205 | +I | ||
| 206 | +I | ||
| 207 | +I | ||
| 208 | +I | ||
| 209 | +I | ||
| 210 | +I | ||
| 211 | +I | ||
| 212 | +I | ||
| 213 | +I | ||
| 214 | +I | ||
| 215 | +A | ||
| 216 | +I | ||
| 217 | +A | ||
| 218 | +I | ||
| 219 | +I | ||
| 220 | +I | ||
| 221 | +I | ||
| 222 | +I | ||
| 223 | +I | ||
| 224 | +A | ||
| 225 | +I | ||
| 226 | +I | ||
| 227 | +A | ||
| 228 | +A | ||
| 229 | +A | ||
| 230 | +I | ||
| 231 | +I | ||
| 232 | +A | ||
| 233 | +A | ||
| 234 | +I | ||
| 235 | +I | ||
| 236 | +I | ||
| 237 | +I | ||
| 238 | +A | ||
| 239 | +I | ||
| 240 | +I | ||
| 241 | +I | ||
| 242 | +I | ||
| 243 | +A | ||
| 244 | +I | ||
| 245 | +A | ||
| 246 | +I | ||
| 247 | +I | ||
| 248 | +I | ||
| 249 | +I | ||
| 250 | +I | ||
| 251 | +I | ||
| 252 | +I | ||
| 253 | +A | ||
| 254 | +A | ||
| 255 | +I | ||
| 256 | +I | ||
| 257 | +I | ||
| 258 | +I | ||
| 259 | +I | ||
| 260 | +I | ||
| 261 | +I | ||
| 262 | +I | ||
| 263 | +A | ||
| 264 | +A | ||
| 265 | +I | ||
| 266 | +I | ||
| 267 | +I | ||
| 268 | +I | ||
| 269 | +I | ||
| 270 | +I | ||
| 271 | +A | ||
| 272 | +A | ||
| 273 | +I | ||
| 274 | +I | ||
| 275 | +I | ||
| 276 | +I | ||
| 277 | +I | ||
| 278 | +I | ||
| 279 | +A | ||
| 280 | +I | ||
| 281 | +A | ||
| 282 | +I | ||
| 283 | +I | ||
| 284 | +I | ||
| 285 | +I | ||
| 286 | +I | ||
| 287 | +I | ||
| 288 | +I | ||
| 289 | +I | ||
| 290 | +I | ||
| 291 | +A | ||
| 292 | +I | ||
| 293 | +I | ||
| 294 | +A | ||
| 295 | +I | ||
| 296 | +I | ||
| 297 | +I | ||
| 298 | +I | ||
| 299 | +I | ||
| 300 | +I | ||
| 301 | +A | ||
| 302 | +A | ||
| 303 | +I | ||
| 304 | +I | ||
| 305 | +I | ||
| 306 | +I | ||
| 307 | +I | ||
| 308 | +A | ||
| 309 | +I | ||
| 310 | +I | ||
| 311 | +I | ||
| 312 | +I | ||
| 313 | +I | ||
| 314 | +A | ||
| 315 | +A | ||
| 316 | +A | ||
| 317 | +I | ||
| 318 | +A | ||
| 319 | +I | ||
| 320 | +I | ||
| 321 | +I | ||
| 322 | +I | ||
| 323 | +A | ||
| 324 | +A | ||
| 325 | +I | ||
| 326 | +A | ||
| 327 | +A | ||
| 328 | +I | ||
| 329 | +I | ||
| 330 | +I | ||
| 331 | +I | ||
| 332 | +I | ||
| 333 | +I | ||
| 334 | +I | ||
| 335 | +I | ||
| 336 | +I | ||
| 337 | +I | ||
| 338 | +I | ||
| 339 | +I | ||
| 340 | +A | ||
| 341 | +I | ||
| 342 | +I | ||
| 343 | +I | ||
| 344 | +I | ||
| 345 | +A | ||
| 346 | +A | ||
| 347 | +I | ||
| 348 | +I | ||
| 349 | +A | ||
| 350 | +I | ||
| 351 | +I | ||
| 352 | +I | ||
| 353 | +I | ||
| 354 | +I | ||
| 355 | +A | ||
| 356 | +A | ||
| 357 | +I | ||
| 358 | +A | ||
| 359 | +I | ||
| 360 | +I | ||
| 361 | +I | ||
| 362 | +I | ||
| 363 | +I | ||
| 364 | +I | ||
| 365 | +A | ||
| 366 | +A | ||
| 367 | +I | ||
| 368 | +I | ||
| 369 | +A | ||
| 370 | +I | ||
| 371 | +I | ||
| 372 | +I | ||
| 373 | +I | ||
| 374 | +I | ||
| 375 | +I | ||
| 376 | +I | ||
| 377 | +I | ||
| 378 | +I | ||
| 379 | +I | ||
| 380 | +I | ||
| 381 | +I | ||
| 382 | +A | ||
| 383 | +I | ||
| 384 | +I | ||
| 385 | +A | ||
| 386 | +I | ||
| 387 | +I | ||
| 388 | +A | ||
| 389 | +I | ||
| 390 | +I | ||
| 391 | +I | ||
| 392 | +I | ||
| 393 | +A | ||
| 394 | +A | ||
| 395 | +I | ||
| 396 | +A | ||
| 397 | +A | ||
| 398 | +I | ||
| 399 | +I | ||
| 400 | +A | ||
| 401 | +I | ||
| 402 | +I | ||
| 403 | +I | ||
| 404 | +I | ||
| 405 | +A | ||
| 406 | +I | ||
| 407 | +I | ||
| 408 | +I | ||
| 409 | +I | ||
| 410 | +I | ||
| 411 | +I | ||
| 412 | +I | ||
| 413 | +I | ||
| 414 | +I | ||
| 415 | +I | ||
| 416 | +A | ||
| 417 | +I | ||
| 418 | +I | ||
| 419 | +A | ||
| 420 | +A | ||
| 421 | +I | ||
| 422 | +I | ||
| 423 | +I | ||
| 424 | +A | ||
| 425 | +I | ||
| 426 | +I | ||
| 427 | +I | ||
| 428 | +I | ||
| 429 | +A | ||
| 430 | +I | ||
| 431 | +A | ||
| 432 | +I | ||
| 433 | +I | ||
| 434 | +I | ||
| 435 | +I | ||
| 436 | +I | ||
| 437 | +A | ||
| 438 | +I | ||
| 439 | +I | ||
| 440 | +I | ||
| 441 | +I | ||
| 442 | +I | ||
| 443 | +I | ||
| 444 | +I | ||
| 445 | +I | ||
| 446 | +I | ||
| 447 | +I | ||
| 448 | +I | ||
| 449 | +I | ||
| 450 | +I | ||
| 451 | +I | ||
| 452 | +I | ||
| 453 | +I | ||
| 454 | +I | ||
| 455 | +I | ||
| 456 | +A | ||
| 457 | +A | ||
| 458 | +A | ||
| 459 | +A | ||
| 460 | +I | ||
| 461 | +I | ||
| 462 | +I | ||
| 463 | +A | ||
| 464 | +A | ||
| 465 | +I | ||
| 466 | +I | ||
| 467 | +I | ||
| 468 | +I | ||
| 469 | +I | ||
| 470 | +A | ||
| 471 | +I | ||
| 472 | +A | ||
| 473 | +I | ||
| 474 | +I | ||
| 475 | +I | ||
| 476 | +I | ||
| 477 | +I | ||
| 478 | +A | ||
| 479 | +I | ||
| 480 | +I | ||
| 481 | +I | ||
| 482 | +I | ||
| 483 | +A | ||
| 484 | +A | ||
| 485 | +I | ||
| 486 | +I | ||
| 487 | +I | ||
| 488 | +I | ||
| 489 | +I | ||
| 490 | +I | ||
| 491 | +I | ||
| 492 | +I | ||
| 493 | +I | ||
| 494 | +I | ||
| 495 | +I | ||
| 496 | +I | ||
| 497 | +I | ||
| 498 | +I | ||
| 499 | +I | ||
| 500 | +I | ||
| 501 | +I | ||
| 502 | +A | ||
| 503 | +I | ||
| 504 | +A | ||
| 505 | +I | ||
| 506 | +I | ||
| 507 | +A | ||
| 508 | +I | ||
| 509 | +I | ||
| 510 | +I | ||
| 511 | +I | ||
| 512 | +A | ||
| 513 | +I | ||
| 514 | +I | ||
| 515 | +A | ||
| 516 | +A | ||
| 517 | +I | ||
| 518 | +I | ||
| 519 | +I | ||
| 520 | +A | ||
| 521 | +I | ||
| 522 | +A | ||
| 523 | +I | ||
| 524 | +I | ||
| 525 | +I | ||
| 526 | +I | ||
| 527 | +I | ||
| 528 | +I | ||
| 529 | +I | ||
| 530 | +A | ||
| 531 | +A | ||
| 532 | +I | ||
| 533 | +I | ||
| 534 | +I | ||
| 535 | +A | ||
| 536 | +I | ||
| 537 | +I | ||
| 538 | +I | ||
| 539 | +A | ||
| 540 | +I | ||
| 541 | +I | ||
| 542 | +I | ||
| 543 | +I | ||
| 544 | +I | ||
| 545 | +I | ||
| 546 | +A | ||
| 547 | +I | ||
| 548 | +I | ||
| 549 | +I | ||
| 550 | +I | ||
| 551 | +I | ||
| 552 | +A | ||
| 553 | +I | ||
| 554 | +I | ||
| 555 | +I | ||
| 556 | +I | ||
| 557 | +I | ||
| 558 | +A | ||
| 559 | +I | ||
| 560 | +I | ||
| 561 | +A | ||
| 562 | +I | ||
| 563 | +I | ||
| 564 | +I | ||
| 565 | +I | ||
| 566 | +I | ||
| 567 | +I | ||
| 568 | +A | ||
| 569 | +I | ||
| 570 | +I | ||
| 571 | +I | ||
| 572 | +I | ||
| 573 | +I | ||
| 574 | +I | ||
| 575 | +I | ||
| 576 | +I | ||
| 577 | +I | ||
| 578 | +I | ||
| 579 | +I | ||
| 580 | +I | ||
| 581 | +I | ||
| 582 | +I | ||
| 583 | +I | ||
| 584 | +I | ||
| 585 | +A | ||
| 586 | +I | ||
| 587 | +I | ||
| 588 | +A | ||
| 589 | +I | ||
| 590 | +I | ||
| 591 | +I | ||
| 592 | +I | ||
| 593 | +A | ||
| 594 | +I | ||
| 595 | +I | ||
| 596 | +I | ||
| 597 | +I | ||
| 598 | +I | ||
| 599 | +A | ||
| 600 | +I | ||
| 601 | +I | ||
| 602 | +I | ||
| 603 | +A | ||
| 604 | +I | ||
| 605 | +I | ||
| 606 | +I | ||
| 607 | +A | ||
| 608 | +I | ||
| 609 | +A | ||
| 610 | +A | ||
| 611 | +I | ||
| 612 | +A | ||
| 613 | +I | ||
| 614 | +I | ||
| 615 | +I | ||
| 616 | +I | ||
| 617 | +I | ||
| 618 | +I | ||
| 619 | +I | ||
| 620 | +A | ||
| 621 | +I | ||
| 622 | +I | ||
| 623 | +I | ||
| 624 | +A | ||
| 625 | +I | ||
| 626 | +A | ||
| 627 | +I | ||
| 628 | +I | ||
| 629 | +I | ||
| 630 | +I | ||
| 631 | +A | ||
| 632 | +I | ||
| 633 | +A | ||
| 634 | +I |
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
clasificacion-automatica/binding-thrombin-dataset/training-validation-binding-thrombin.py
0 → 100644
| 1 | +# -*- encoding: utf-8 -*- | ||
| 2 | + | ||
| 3 | +import os | ||
| 4 | +from time import time | ||
| 5 | +import argparse | ||
| 6 | +from sklearn.naive_bayes import BernoulliNB | ||
| 7 | +from sklearn.svm import SVC | ||
| 8 | +from sklearn.neighbors import NearestCentroid | ||
| 9 | +from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, \ | ||
| 10 | + classification_report | ||
| 11 | +import sys | ||
| 12 | +from scipy.sparse import csr_matrix | ||
| 13 | +import numpy as np | ||
| 14 | + | ||
| 15 | +__author__ = 'CMendezC' | ||
| 16 | + | ||
| 17 | +# Goal: training and validation binding thrombin data set | ||
| 18 | + | ||
| 19 | +# Parameters: | ||
| 20 | +# 1) --inputPath Path to read input files. | ||
| 21 | +# 2) --inputTrainingData File to read training data. | ||
| 22 | +# 3) --inputTestingData File to read testing data. | ||
| 23 | +# 4) --inputTestingClasses File to read testing classes. | ||
| 24 | +# 5) --outputModelPath Path to place output model. | ||
| 25 | +# 6) --outputModelFile File to place output model. | ||
| 26 | +# 7) --outputReportPath Path to place evaluation report. | ||
| 27 | +# 8) --outputReportFile File to place evaluation report. | ||
| 28 | +# 9) --classifier Classifier: BernoulliNB, SVM, NearestCentroid. | ||
| 29 | + | ||
| 30 | +# Ouput: | ||
| 31 | +# 1) Classification model and evaluation report. | ||
| 32 | + | ||
| 33 | +# Execution: | ||
| 34 | + | ||
| 35 | +# python training-validation-binding-thrombin.py | ||
| 36 | +# --inputPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/binding-thrombin-dataset | ||
| 37 | +# --inputTrainingData thrombin.data | ||
| 38 | +# --inputTestingData Thrombin.testset | ||
| 39 | +# --inputTestingClasses Thrombin.testset.class | ||
| 40 | +# --outputModelPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/binding-thrombin-dataset/models | ||
| 41 | +# --outputModelFile SVM-model.mod | ||
| 42 | +# --outputReportPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/binding-thrombin-dataset/reports | ||
| 43 | +# --outputReportFile SVM.txt | ||
| 44 | +# --classifier SVM | ||
| 45 | + | ||
| 46 | +# source activate python3 | ||
| 47 | + | ||
| 48 | +########################################################### | ||
| 49 | +# MAIN PROGRAM # | ||
| 50 | +########################################################### | ||
| 51 | + | ||
| 52 | +if __name__ == "__main__": | ||
| 53 | + # Parameter definition | ||
| 54 | + parser = argparse.ArgumentParser(description='Training validation Binding Thrombin Dataset.') | ||
| 55 | + parser.add_argument("--inputPath", dest="inputPath", | ||
| 56 | + help="Path to read input files", metavar="PATH") | ||
| 57 | + parser.add_argument("--inputTrainingData", dest="inputTrainingData", | ||
| 58 | + help="File to read training data", metavar="FILE") | ||
| 59 | + parser.add_argument("--inputTestingData", dest="inputTestingData", | ||
| 60 | + help="File to read testing data", metavar="FILE") | ||
| 61 | + parser.add_argument("--inputTestingClasses", dest="inputTestingClasses", | ||
| 62 | + help="File to read testing classes", metavar="FILE") | ||
| 63 | + parser.add_argument("--outputModelPath", dest="outputModelPath", | ||
| 64 | + help="Path to place output model", metavar="PATH") | ||
| 65 | + parser.add_argument("--outputModelFile", dest="outputModelFile", | ||
| 66 | + help="File to place output model", metavar="FILE") | ||
| 67 | + parser.add_argument("--outputReportPath", dest="outputReportPath", | ||
| 68 | + help="Path to place evaluation report", metavar="PATH") | ||
| 69 | + parser.add_argument("--outputReportFile", dest="outputReportFile", | ||
| 70 | + help="File to place evaluation report", metavar="FILE") | ||
| 71 | + parser.add_argument("--classifier", dest="classifier", | ||
| 72 | + help="Classifier", metavar="NAME", | ||
| 73 | + choices=('BernoulliNB', 'SVM', 'NearestCentroid'), default='SVM') | ||
| 74 | + | ||
| 75 | + (options, args) = parser.parse_args() | ||
| 76 | + if len(args) > 0: | ||
| 77 | + parser.error("None parameters indicated.") | ||
| 78 | + sys.exit(1) | ||
| 79 | + | ||
| 80 | + # Printing parameter values | ||
| 81 | + print('-------------------------------- PARAMETERS --------------------------------') | ||
| 82 | + print("Path to read input files: " + str(options.inputPath)) | ||
| 83 | + print("File to read training data: " + str(options.inputTrainingData)) | ||
| 84 | + print("File to read testing data: " + str(options.inputTestingData)) | ||
| 85 | + print("File to read testing classes: " + str(options.inputTestingClasses)) | ||
| 86 | + print("Path to place output model: " + str(options.outputModelPath)) | ||
| 87 | + print("File to place output model: " + str(options.outputModelFile)) | ||
| 88 | + print("Path to place evaluation report: " + str(options.outputReportPath)) | ||
| 89 | + print("File to place evaluation report: " + str(options.outputReportFile)) | ||
| 90 | + print("Classifier: " + str(options.outputFile)) | ||
| 91 | + | ||
| 92 | + # Start time | ||
| 93 | + t0 = time() | ||
| 94 | + | ||
| 95 | + print(" Reading training data and true classes...") | ||
| 96 | + trainingClasses = [] | ||
| 97 | + trainingData = [] | ||
| 98 | + with open(os.path.join(options.inputPath, options.inputTrainingData), encoding='utf8', mode='r') \ | ||
| 99 | + as iFile: | ||
| 100 | + for line in iFile: | ||
| 101 | + line = line.strip('\r\n') | ||
| 102 | + listLine = line.split(',') | ||
| 103 | + trainingClasses.append(listLine[0]) | ||
| 104 | + trainingData.append(listLine[1:]) | ||
| 105 | + # trainingMatrix = np.matrix(trainingData) | ||
| 106 | + trainingMatrix = csr_matrix(trainingData, dtype='double') | ||
| 107 | + | ||
| 108 | + print("Number of training classes: {}".format(len(trainingClasses))) | ||
| 109 | + print("Number of training class A: {}".format(trainingClasses.count('A'))) | ||
| 110 | + print("Number of training class I: {}".format(trainingClasses.count('I'))) | ||
| 111 | + print("Shape of training matrix: {}".format(trainingMatrix.shape)) | ||
| 112 | + | ||
| 113 | + print(" Reading testing data and true classes...") | ||
| 114 | + testingClasses = [] | ||
| 115 | + testingData = [] | ||
| 116 | + with open(os.path.join(options.inputPath, options.inputTestingData), encoding='utf8', mode='r') \ | ||
| 117 | + as iFile: | ||
| 118 | + for line in iFile: | ||
| 119 | + line = line.strip('\r\n') | ||
| 120 | + listLine = line.split(',') | ||
| 121 | + testingData.append(listLine) | ||
| 122 | + testingMatrix = csr_matrix(testingData, dtype='double') | ||
| 123 | + with open(os.path.join(options.inputPath, options.inputTestingClasses), encoding='utf8', mode='r') \ | ||
| 124 | + as iFile: | ||
| 125 | + for line in iFile: | ||
| 126 | + line = line.strip('\r\n') | ||
| 127 | + testingClasses.append(line) | ||
| 128 | + | ||
| 129 | + print("Number of testing classes: {}".format(len(testingClasses))) | ||
| 130 | + print("Number of testing class A: {}".format(trainingClasses.count('A'))) | ||
| 131 | + print("Number of testing class I: {}".format(trainingClasses.count('I'))) | ||
| 132 | + print("Shape of testing matrix: {}".format(testingMatrix.shape)) | ||
| 133 | + | ||
| 134 | + if options.classifier == "MultinomialNB": | ||
| 135 | + classifier = BernoulliNB() | ||
| 136 | + elif options.classifier == "SVM": | ||
| 137 | + classifier = SVC() | ||
| 138 | + elif options.classifier == "NearestCentroid": | ||
| 139 | + classifier = NearestCentroid() | ||
| 140 | + | ||
| 141 | + print(" Training...") | ||
| 142 | + classifier.fit(trainingMatrix, trainingClasses) | ||
| 143 | + print(" Done!") | ||
| 144 | + | ||
| 145 | + print(" Testing (prediction in new data)...") | ||
| 146 | + y_pred = classifier.predict(testingMatrix) | ||
| 147 | + print(" Done!") | ||
| 148 | + | ||
| 149 | + print(" Saving report...") | ||
| 150 | + with open(os.path.join(options.outputPath, options.outputFile), mode='w', encoding='utf8') as oFile: | ||
| 151 | + oFile.write('********** EVALUATION REPORT **********\n') | ||
| 152 | + oFile.write('Classifier: {}\n'.format(options.classifier)) | ||
| 153 | + oFile.write('Accuracy: {}\n'.format(accuracy_score(testingClasses, y_pred))) | ||
| 154 | + oFile.write('Precision: {}\n'.format(precision_score(testingClasses, y_pred, average='weighted'))) | ||
| 155 | + oFile.write('Recall: {}\n'.format(recall_score(testingClasses, y_pred, average='weighted'))) | ||
| 156 | + oFile.write('F-score: {}\n'.format(f1_score(testingClasses, y_pred, average='weighted'))) | ||
| 157 | + oFile.write('Confusion matrix: \n') | ||
| 158 | + oFile.write(str(confusion_matrix(testingClasses, y_pred)) + '\n') | ||
| 159 | + oFile.write('Classification report: \n') | ||
| 160 | + oFile.write(classification_report(testingClasses, y_pred) + '\n') | ||
| 161 | + print(" Done!") | ||
| 162 | + | ||
| 163 | + print("Training and testing done in: %fs" % (time() - t0)) |
-
Please register or login to post a comment