Carlos-Francisco Méndez-Cruz

Classification binding thrombin data set

1 +The test set consists of 634 data points, each of which represents
2 +a molecule that is either active (A) or inactive (I). The test set
3 +has the same format as the training set, with the exception that the
4 +activity value (A or I) for each data point is missing, that is, has
5 +been replaced by a question mark (?). Please submit one prediction,
6 +A or I, for each data point. Your submission should be in the form
7 +of a file that starts with your contact information, followed by a
8 +line with 5 asterisks, followed immediately by your predictions, with
9 +one line per data point. The predictions should be in the same order
10 +as the test set data points. So your prediction for the first example
11 +should appear on the first line after the asterisks, your prediction
12 +for the second example should appear on the second line after the
13 +asterisks, etc. Hence, after your contact information, the prediction
14 +file will consist of 635 lines and have the form:
15 +
16 +*****
17 +I
18 +I
19 +A
20 +I
21 +A
22 +I
23 +
24 +etc.
25 +
26 +You may submit your prediction by email to page@biostat.wisc.edu
27 +or by anonymous ftp to ftp.biostat.wisc.edu, placing the file
28 +into the directory dropboxes/page/. If using email, please use
29 +the subject line "KDDcup <name> thrombin" where <name> is your
30 +name. If using ftp, please name the file KDDcup.<name>.thrombin
31 +where <name> is your name. For example, my submission would be
32 +named KDDcup.DavidPage.thrombin
33 +
34 +Only one submission per person per task is permitted. If you do not
35 +receive email confirmation of your submission within 24 hours, please
36 +email page@biostat.wisc.edu with subject "KDDcup no confirmation".
37 +
38 +For group entries, the contact information should include the names
39 +of everyone to be credited as a member of the group should your entry
40 +achieve the highest score. But no person is to be listed on more than
41 +one entry per task.
1 +Prediction of Molecular Bioactivity for Drug Design -- Binding to Thrombin
2 +--------------------------------------------------------------------------
3 +
4 +Drugs are typically small organic molecules that achieve their desired
5 +activity by binding to a target site on a receptor. The first step in
6 +the discovery of a new drug is usually to identify and isolate the
7 +receptor to which it should bind, followed by testing many small
8 +molecules for their ability to bind to the target site. This leaves
9 +researchers with the task of determining what separates the active
10 +(binding) compounds from the inactive (non-binding) ones. Such a
11 +determination can then be used in the design of new compounds that not
12 +only bind, but also have all the other properties required for a drug
13 +(solubility, oral absorption, lack of side effects, appropriate duration
14 +of action, toxicity, etc.).
15 +
16 +The present training data set consists of 1909 compounds tested for
17 +their ability to bind to a target site on thrombin, a key receptor in
18 +blood clotting. The chemical structures of these compounds are not
19 +necessary for our analysis and are not included. Of these compounds, 42
20 +are active (bind well) and the others are inactive. Each compound is
21 +described by a single feature vector comprised of a class value (A for
22 +active, I for inactive) and 139,351 binary features, which describe
23 +three-dimensional properties of the molecule. The definitions of the
24 +individual bits are not included - we don't know what each individual
25 +bit means, only that they are generated in an internally consistent
26 +manner for all 1909 compounds. Biological activity in general, and
27 +receptor binding affinity in particular, correlate with various
28 +structural and physical properties of small organic molecules. The task
29 +is to determine which of these properties are critical in this case and
30 +to learn to accurately predict the class value. To simulate the
31 +real-world drug design environment, the test set contains 636 additional
32 +compounds that were in fact generated based on the assay results
33 +recorded for the training set. In evaluating the accuracy, a
34 +differential cost model will be used, so that the sum of the costs of
35 +the actives will be equal to the sum of the costs of the inactives.
36 +
37 +We thank DuPont Pharmaceuticals for graciously providing this data set
38 +for the KDD Cup 2001 competition. All publications referring to
39 +analysis of this data set should acknowledge DuPont Pharmaceuticals
40 +Research Laboratories and KDD Cup 2001.
This diff could not be displayed because it is too large.
1 +I
2 +A
3 +I
4 +I
5 +I
6 +A
7 +I
8 +I
9 +I
10 +A
11 +I
12 +I
13 +I
14 +A
15 +I
16 +A
17 +I
18 +I
19 +I
20 +I
21 +I
22 +I
23 +I
24 +I
25 +I
26 +I
27 +I
28 +A
29 +I
30 +A
31 +I
32 +I
33 +I
34 +I
35 +I
36 +A
37 +I
38 +I
39 +I
40 +A
41 +I
42 +I
43 +I
44 +I
45 +I
46 +I
47 +I
48 +I
49 +A
50 +A
51 +I
52 +I
53 +I
54 +I
55 +I
56 +A
57 +I
58 +A
59 +A
60 +I
61 +I
62 +I
63 +A
64 +I
65 +I
66 +I
67 +I
68 +A
69 +A
70 +I
71 +A
72 +I
73 +I
74 +A
75 +A
76 +I
77 +I
78 +I
79 +I
80 +I
81 +I
82 +I
83 +I
84 +I
85 +I
86 +I
87 +I
88 +I
89 +I
90 +I
91 +A
92 +I
93 +I
94 +A
95 +I
96 +A
97 +I
98 +I
99 +I
100 +A
101 +A
102 +I
103 +I
104 +I
105 +I
106 +I
107 +I
108 +A
109 +I
110 +I
111 +I
112 +I
113 +A
114 +A
115 +I
116 +I
117 +I
118 +I
119 +I
120 +I
121 +I
122 +I
123 +A
124 +A
125 +I
126 +A
127 +A
128 +I
129 +I
130 +I
131 +I
132 +I
133 +I
134 +I
135 +I
136 +I
137 +I
138 +A
139 +I
140 +I
141 +I
142 +I
143 +I
144 +I
145 +I
146 +I
147 +I
148 +A
149 +I
150 +I
151 +I
152 +I
153 +A
154 +I
155 +I
156 +I
157 +I
158 +I
159 +I
160 +I
161 +A
162 +I
163 +I
164 +A
165 +I
166 +A
167 +I
168 +I
169 +A
170 +I
171 +A
172 +I
173 +A
174 +I
175 +A
176 +I
177 +I
178 +I
179 +I
180 +I
181 +A
182 +I
183 +I
184 +A
185 +I
186 +I
187 +A
188 +I
189 +I
190 +I
191 +A
192 +I
193 +A
194 +I
195 +I
196 +A
197 +I
198 +I
199 +I
200 +I
201 +A
202 +I
203 +A
204 +I
205 +I
206 +I
207 +I
208 +I
209 +I
210 +I
211 +I
212 +I
213 +I
214 +I
215 +A
216 +I
217 +A
218 +I
219 +I
220 +I
221 +I
222 +I
223 +I
224 +A
225 +I
226 +I
227 +A
228 +A
229 +A
230 +I
231 +I
232 +A
233 +A
234 +I
235 +I
236 +I
237 +I
238 +A
239 +I
240 +I
241 +I
242 +I
243 +A
244 +I
245 +A
246 +I
247 +I
248 +I
249 +I
250 +I
251 +I
252 +I
253 +A
254 +A
255 +I
256 +I
257 +I
258 +I
259 +I
260 +I
261 +I
262 +I
263 +A
264 +A
265 +I
266 +I
267 +I
268 +I
269 +I
270 +I
271 +A
272 +A
273 +I
274 +I
275 +I
276 +I
277 +I
278 +I
279 +A
280 +I
281 +A
282 +I
283 +I
284 +I
285 +I
286 +I
287 +I
288 +I
289 +I
290 +I
291 +A
292 +I
293 +I
294 +A
295 +I
296 +I
297 +I
298 +I
299 +I
300 +I
301 +A
302 +A
303 +I
304 +I
305 +I
306 +I
307 +I
308 +A
309 +I
310 +I
311 +I
312 +I
313 +I
314 +A
315 +A
316 +A
317 +I
318 +A
319 +I
320 +I
321 +I
322 +I
323 +A
324 +A
325 +I
326 +A
327 +A
328 +I
329 +I
330 +I
331 +I
332 +I
333 +I
334 +I
335 +I
336 +I
337 +I
338 +I
339 +I
340 +A
341 +I
342 +I
343 +I
344 +I
345 +A
346 +A
347 +I
348 +I
349 +A
350 +I
351 +I
352 +I
353 +I
354 +I
355 +A
356 +A
357 +I
358 +A
359 +I
360 +I
361 +I
362 +I
363 +I
364 +I
365 +A
366 +A
367 +I
368 +I
369 +A
370 +I
371 +I
372 +I
373 +I
374 +I
375 +I
376 +I
377 +I
378 +I
379 +I
380 +I
381 +I
382 +A
383 +I
384 +I
385 +A
386 +I
387 +I
388 +A
389 +I
390 +I
391 +I
392 +I
393 +A
394 +A
395 +I
396 +A
397 +A
398 +I
399 +I
400 +A
401 +I
402 +I
403 +I
404 +I
405 +A
406 +I
407 +I
408 +I
409 +I
410 +I
411 +I
412 +I
413 +I
414 +I
415 +I
416 +A
417 +I
418 +I
419 +A
420 +A
421 +I
422 +I
423 +I
424 +A
425 +I
426 +I
427 +I
428 +I
429 +A
430 +I
431 +A
432 +I
433 +I
434 +I
435 +I
436 +I
437 +A
438 +I
439 +I
440 +I
441 +I
442 +I
443 +I
444 +I
445 +I
446 +I
447 +I
448 +I
449 +I
450 +I
451 +I
452 +I
453 +I
454 +I
455 +I
456 +A
457 +A
458 +A
459 +A
460 +I
461 +I
462 +I
463 +A
464 +A
465 +I
466 +I
467 +I
468 +I
469 +I
470 +A
471 +I
472 +A
473 +I
474 +I
475 +I
476 +I
477 +I
478 +A
479 +I
480 +I
481 +I
482 +I
483 +A
484 +A
485 +I
486 +I
487 +I
488 +I
489 +I
490 +I
491 +I
492 +I
493 +I
494 +I
495 +I
496 +I
497 +I
498 +I
499 +I
500 +I
501 +I
502 +A
503 +I
504 +A
505 +I
506 +I
507 +A
508 +I
509 +I
510 +I
511 +I
512 +A
513 +I
514 +I
515 +A
516 +A
517 +I
518 +I
519 +I
520 +A
521 +I
522 +A
523 +I
524 +I
525 +I
526 +I
527 +I
528 +I
529 +I
530 +A
531 +A
532 +I
533 +I
534 +I
535 +A
536 +I
537 +I
538 +I
539 +A
540 +I
541 +I
542 +I
543 +I
544 +I
545 +I
546 +A
547 +I
548 +I
549 +I
550 +I
551 +I
552 +A
553 +I
554 +I
555 +I
556 +I
557 +I
558 +A
559 +I
560 +I
561 +A
562 +I
563 +I
564 +I
565 +I
566 +I
567 +I
568 +A
569 +I
570 +I
571 +I
572 +I
573 +I
574 +I
575 +I
576 +I
577 +I
578 +I
579 +I
580 +I
581 +I
582 +I
583 +I
584 +I
585 +A
586 +I
587 +I
588 +A
589 +I
590 +I
591 +I
592 +I
593 +A
594 +I
595 +I
596 +I
597 +I
598 +I
599 +A
600 +I
601 +I
602 +I
603 +A
604 +I
605 +I
606 +I
607 +A
608 +I
609 +A
610 +A
611 +I
612 +A
613 +I
614 +I
615 +I
616 +I
617 +I
618 +I
619 +I
620 +A
621 +I
622 +I
623 +I
624 +A
625 +I
626 +A
627 +I
628 +I
629 +I
630 +I
631 +A
632 +I
633 +A
634 +I
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
1 +# -*- encoding: utf-8 -*-
2 +
3 +import os
4 +from time import time
5 +import argparse
6 +from sklearn.naive_bayes import BernoulliNB
7 +from sklearn.svm import SVC
8 +from sklearn.neighbors import KNeighborsClassifier
9 +from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, \
10 + classification_report
11 +from sklearn.externals import joblib
12 +from scipy.sparse import csr_matrix
13 +
14 +__author__ = 'CMendezC'
15 +
16 +# Goal: training and testing binding thrombin data set
17 +
18 +# Parameters:
19 +# 1) --inputPath Path to read input files.
20 +# 2) --inputTrainingData File to read training data.
21 +# 3) --inputTestingData File to read testing data.
22 +# 4) --inputTestingClasses File to read testing classes.
23 +# 5) --outputModelPath Path to place output model.
24 +# 6) --outputModelFile File to place output model.
25 +# 7) --outputReportPath Path to place evaluation report.
26 +# 8) --outputReportFile File to place evaluation report.
27 +# 9) --classifier Classifier: BernoulliNB, SVM, kNN.
28 +# 10) --saveData Save matrices
29 +
30 +# Ouput:
31 +# 1) Classification model and evaluation report.
32 +
33 +# Execution:
34 +
35 +# python training-testing-binding-thrombin.py
36 +# --inputPath /home/binding-thrombin-dataset
37 +# --inputTrainingData thrombin.data
38 +# --inputTestingData Thrombin.testset
39 +# --inputTestingClasses Thrombin.testset.class
40 +# --outputModelPath /home/binding-thrombin-dataset/models
41 +# --outputModelFile SVM-model.mod
42 +# --outputReportPath /home/binding-thrombin-dataset/reports
43 +# --outputReportFile SVM.txt
44 +# --classifier SVM
45 +# --saveData
46 +
47 +# source activate python3
48 +# python training-testing-binding-thrombin.py --inputPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset --inputTrainingData thrombin.data --inputTestingData Thrombin.testset --inputTestingClasses Thrombin.testset.class --outputModelPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/models --outputModelFile SVM-model.mod --outputReportPath /home/compu2/bionlp/lcg-bioinfoI-bionlp/clasificacion-automatica/binding-thrombin-dataset/reports --outputReportFile SVM.txt --classifier SVM --saveData
49 +
50 +###########################################################
51 +# MAIN PROGRAM #
52 +###########################################################
53 +
54 +if __name__ == "__main__":
55 + # Parameter definition
56 + parser = argparse.ArgumentParser(description='Training and testing Binding Thrombin Dataset.')
57 + parser.add_argument("--inputPath", dest="inputPath",
58 + help="Path to read input files", metavar="PATH")
59 + parser.add_argument("--inputTrainingData", dest="inputTrainingData",
60 + help="File to read training data", metavar="FILE")
61 + parser.add_argument("--inputTestingData", dest="inputTestingData",
62 + help="File to read testing data", metavar="FILE")
63 + parser.add_argument("--inputTestingClasses", dest="inputTestingClasses",
64 + help="File to read testing classes", metavar="FILE")
65 + parser.add_argument("--outputModelPath", dest="outputModelPath",
66 + help="Path to place output model", metavar="PATH")
67 + parser.add_argument("--outputModelFile", dest="outputModelFile",
68 + help="File to place output model", metavar="FILE")
69 + parser.add_argument("--outputReportPath", dest="outputReportPath",
70 + help="Path to place evaluation report", metavar="PATH")
71 + parser.add_argument("--outputReportFile", dest="outputReportFile",
72 + help="File to place evaluation report", metavar="FILE")
73 + parser.add_argument("--classifier", dest="classifier",
74 + help="Classifier", metavar="NAME",
75 + choices=('BernoulliNB', 'SVM', 'kNN'), default='SVM')
76 + parser.add_argument("--saveData", dest="saveData", action='store_true',
77 + help="Save matrices")
78 +
79 + args = parser.parse_args()
80 +
81 + # Printing parameter values
82 + print('-------------------------------- PARAMETERS --------------------------------')
83 + print("Path to read input files: " + str(args.inputPath))
84 + print("File to read training data: " + str(args.inputTrainingData))
85 + print("File to read testing data: " + str(args.inputTestingData))
86 + print("File to read testing classes: " + str(args.inputTestingClasses))
87 + print("Path to place output model: " + str(args.outputModelPath))
88 + print("File to place output model: " + str(args.outputModelFile))
89 + print("Path to place evaluation report: " + str(args.outputReportPath))
90 + print("File to place evaluation report: " + str(args.outputReportFile))
91 + print("Classifier: " + str(args.classifier))
92 + print("Save matrices: " + str(args.saveData))
93 +
94 + # Start time
95 + t0 = time()
96 +
97 + print("Reading training data and true classes...")
98 + X_train = None
99 + if args.saveData:
100 + y_train = []
101 + trainingData = []
102 + with open(os.path.join(args.inputPath, args.inputTrainingData), encoding='utf8', mode='r') \
103 + as iFile:
104 + for line in iFile:
105 + line = line.strip('\r\n')
106 + listLine = line.split(',')
107 + y_train.append(listLine[0])
108 + trainingData.append(listLine[1:])
109 + # X_train = np.matrix(trainingData)
110 + X_train = csr_matrix(trainingData, dtype='double')
111 + print(" Saving matrix and classes...")
112 + joblib.dump(X_train, os.path.join(args.outputModelPath, args.inputTrainingData + '.jlb'))
113 + joblib.dump(y_train, os.path.join(args.outputModelPath, args.inputTrainingData + '.class.jlb'))
114 + print(" Done!")
115 + else:
116 + print(" Loading matrix and classes...")
117 + X_train = joblib.load(os.path.join(args.outputModelPath, args.inputTrainingData + '.jlb'))
118 + y_train = joblib.load(os.path.join(args.outputModelPath, args.inputTrainingData + '.class.jlb'))
119 + print(" Done!")
120 +
121 + print(" Number of training classes: {}".format(len(y_train)))
122 + print(" Number of training class A: {}".format(y_train.count('A')))
123 + print(" Number of training class I: {}".format(y_train.count('I')))
124 + print(" Shape of training matrix: {}".format(X_train.shape))
125 +
126 + print("Reading testing data and true classes...")
127 + X_test = None
128 + if args.saveData:
129 + y_test = []
130 + testingData = []
131 + with open(os.path.join(args.inputPath, args.inputTestingData), encoding='utf8', mode='r') \
132 + as iFile:
133 + for line in iFile:
134 + line = line.strip('\r\n')
135 + listLine = line.split(',')
136 + testingData.append(listLine[1:])
137 + X_test = csr_matrix(testingData, dtype='double')
138 + with open(os.path.join(args.inputPath, args.inputTestingClasses), encoding='utf8', mode='r') \
139 + as iFile:
140 + for line in iFile:
141 + line = line.strip('\r\n')
142 + y_test.append(line)
143 + print(" Saving matrix and classes...")
144 + joblib.dump(X_test, os.path.join(args.outputModelPath, args.inputTestingData + '.jlb'))
145 + joblib.dump(y_test, os.path.join(args.outputModelPath, args.inputTestingClasses + '.class.jlb'))
146 + print(" Done!")
147 + else:
148 + print(" Loading matrix and classes...")
149 + X_test = joblib.load(os.path.join(args.outputModelPath, args.inputTestingData + '.jlb'))
150 + y_test = joblib.load(os.path.join(args.outputModelPath, args.inputTestingClasses + '.class.jlb'))
151 + print(" Done!")
152 +
153 + print(" Number of testing classes: {}".format(len(y_test)))
154 + print(" Number of testing class A: {}".format(y_test.count('A')))
155 + print(" Number of testing class I: {}".format(y_test.count('I')))
156 + print(" Shape of testing matrix: {}".format(X_test.shape))
157 +
158 + if args.classifier == "BernoulliNB":
159 + classifier = BernoulliNB()
160 + elif args.classifier == "SVM":
161 + classifier = SVC()
162 + elif args.classifier == "kNN":
163 + classifier = KNeighborsClassifier()
164 + else:
165 + print("Bad classifier")
166 + exit()
167 +
168 + print("Training...")
169 + classifier.fit(X_train, y_train)
170 + print(" Done!")
171 +
172 + print("Testing (prediction in new data)...")
173 + y_pred = classifier.predict(X_test)
174 + print(" Done!")
175 +
176 + print("Saving report...")
177 + with open(os.path.join(args.outputReportPath, args.outputReportFile), mode='w', encoding='utf8') as oFile:
178 + oFile.write('********** EVALUATION REPORT **********\n')
179 + oFile.write('Classifier: {}\n'.format(args.classifier))
180 + oFile.write('Accuracy: {}\n'.format(accuracy_score(y_test, y_pred)))
181 + oFile.write('Precision: {}\n'.format(precision_score(y_test, y_pred, average='weighted')))
182 + oFile.write('Recall: {}\n'.format(recall_score(y_test, y_pred, average='weighted')))
183 + oFile.write('F-score: {}\n'.format(f1_score(y_test, y_pred, average='weighted')))
184 + oFile.write('Confusion matrix: \n')
185 + oFile.write(str(confusion_matrix(y_test, y_pred)) + '\n')
186 + oFile.write('Classification report: \n')
187 + oFile.write(classification_report(y_test, y_pred) + '\n')
188 + print(" Done!")
189 +
190 + print("Training and testing done in: %fs" % (time() - t0))