Showing
32 changed files
with
85 additions
and
546 deletions
1 | -# This paper talks about (and reports) experimental data | 1 | +# This paper talks about (and reports) experimental data? |
2 | 2 | ||
3 | -Automatic discrimination of useless papers via machine learning of abstracts. | 3 | +We tackle this issue by performing automatic discrimination of useless papers via machine learning of abstracts. |
4 | 4 | ||
5 | The main method follows the next pipeline: | 5 | The main method follows the next pipeline: |
6 | 6 | ||
7 | ### Training mode | 7 | ### Training mode |
8 | - Parse abstracts from two input files (classA and classB; see files format at the `data/` directory) | 8 | - Parse abstracts from two input files (classA and classB; see files format at the `data/` directory) |
9 | - Transform abstracts into their TFIDF sparse representations | 9 | - Transform abstracts into their TFIDF sparse representations |
10 | +- Transform TFIDF representations into their 200-dimensional SVD approximation and save it at `model_binClass/svd_model.pkl` | ||
10 | - Train Support Vector Machines with different parameters by using GridSearch | 11 | - Train Support Vector Machines with different parameters by using GridSearch |
11 | -- Select the best estimator and save it at `model/svm_model.pkl` (default) | 12 | +- Select the best estimator and save it at `model_binClass/svm_model.pkl` (default) |
12 | -- Save TFIDF transformation for keeping the training vocabulary (stored at `model/tfidf_model.pkl`) | 13 | +- Save TFIDF transformation for keeping the training vocabulary (stored at `model_binClass/tfidf_model.pkl`) |
13 | 14 | ||
14 | ### Prediction mode | 15 | ### Prediction mode |
15 | - Parse abstracts from a unique input file | 16 | - Parse abstracts from a unique input file |
16 | - Transform abstracts into their TFIDF sparse representations | 17 | - Transform abstracts into their TFIDF sparse representations |
18 | +- Transform TFIDF representations into their 200-dimensional SVD approximation | ||
17 | - Predict useless/useful papers by means of their abstracts using pretrained Support Vector Machines | 19 | - Predict useless/useful papers by means of their abstracts using pretrained Support Vector Machines |
18 | 20 | ||
19 | # Usage | 21 | # Usage |
... | @@ -21,7 +23,7 @@ The main method follows the next pipeline: | ... | @@ -21,7 +23,7 @@ The main method follows the next pipeline: |
21 | For filtering unknown abstracts run | 23 | For filtering unknown abstracts run |
22 | 24 | ||
23 | ```bash | 25 | ```bash |
24 | -$ python filter_abstracts.py --input data/test_abstracts.txt | 26 | +$ python filter_abstracts_binClass.py --input data/test_abstracts.txt |
25 | ``` | 27 | ``` |
26 | The predictions will be stored by default at `filter_output/`, unless a different directory is specified by means of the `--out` option. The default names containing the predicitons are | 28 | The predictions will be stored by default at `filter_output/`, unless a different directory is specified by means of the `--out` option. The default names containing the predicitons are |
27 | 29 | ||
... | @@ -36,10 +38,10 @@ The format of each file is: | ... | @@ -36,10 +38,10 @@ The format of each file is: |
36 | <PMID> \t <text of the abstract> | 38 | <PMID> \t <text of the abstract> |
37 | ``` | 39 | ``` |
38 | 40 | ||
39 | -For training a new model set the list of parameters at `model_params.conf` and then run | 41 | +For training a new model set the list of parameters at `model_params_binClass.conf` and then run |
40 | 42 | ||
41 | ```bash | 43 | ```bash |
42 | -$ python filter_abstracts.py --classA data/ecoli_abstracts/not_useful_abstracts.txt --classB data/ecoli_abstracts/useful_abstracts.txt | 44 | +$ python filter_abstracts_binClass.py --classA data/ecoli_abstracts/not_useful_abstracts.txt --classB data/ecoli_abstracts/useful_abstracts.txt |
43 | ``` | 45 | ``` |
44 | 46 | ||
45 | -where `--classA` and `--classA` are used to specify input training files. In this example `data/ecoli_abstracts/useful_abstracts.txt` is the training files containing abstracts of papers reporting experimental data (the desired or useful class for us). | 47 | +where `--classA` and `--classB` (the useful papers) are used to specify input training files. In this example `data/ecoli_abstracts/useful_abstracts.txt` is the training file containing abstracts of papers reporting experimental data (the desired or useful class for us). This file must be parsed and it has a special format (the same thing holds for unuseful abstracts. See their contents). | ... | ... |
classify_abstracts.py
deleted
100644 → 0
1 | -#from pdb import set_trace as st | ||
2 | -from sklearn.cross_validation import train_test_split as splitt | ||
3 | -from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer | ||
4 | -from sklearn.decomposition import TruncatedSVD | ||
5 | -from sklearn.naive_bayes import MultinomialNB | ||
6 | -from sklearn.linear_model import SGDClassifier | ||
7 | -from sklearn.neighbors import KNeighborsClassifier | ||
8 | -from sklearn.neighbors import NearestCentroid | ||
9 | -from sklearn.ensemble import RandomForestClassifier | ||
10 | -from sklearn.svm import LinearSVC | ||
11 | -from sklearn.svm import SVC | ||
12 | -from sklearn import metrics | ||
13 | -from sklearn.ensemble import (ExtraTreesClassifier, RandomForestClassifier, | ||
14 | - AdaBoostClassifier, GradientBoostingClassifier) | ||
15 | -from sklearn.grid_search import GridSearchCV | ||
16 | -from sklearn.externals import joblib | ||
17 | -import pandas as pd | ||
18 | -from numpy import mean, std | ||
19 | - | ||
20 | - | ||
21 | -class EstimatorSelectionHelper: | ||
22 | - "http://www.codiply.com/blog/hyperparameter-grid-search-across-multiple-models-in-scikit-learn/" | ||
23 | - def __init__(self, models, params): | ||
24 | - if not set(models.keys()).issubset(set(params.keys())): | ||
25 | - missing_params = list(set(models.keys()) - set(params.keys())) | ||
26 | - raise ValueError("Some estimators are missing parameters: %s" % missing_params) | ||
27 | - self.models = models | ||
28 | - self.params = params | ||
29 | - self.keys = models.keys() | ||
30 | - self.grid_searches = {} | ||
31 | - self.best_estimator = {} | ||
32 | - | ||
33 | - def fit(self, X, y, cv=3, n_jobs=1, verbose=1, scoring=None, refit=False): | ||
34 | - for key in self.keys: | ||
35 | - print("Running GridSearchCV for %s." % key) | ||
36 | - model = self.models[key] | ||
37 | - params = self.params[key] | ||
38 | - gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs, | ||
39 | - verbose=verbose, scoring=scoring, refit=refit) | ||
40 | - gs.fit(X,y) | ||
41 | - self.grid_searches[key] = gs | ||
42 | - | ||
43 | - def score_summary(self, sort_by='mean_score'): | ||
44 | - def row(key, scores, params, model): | ||
45 | - d = { | ||
46 | - 'estimator': key, | ||
47 | - 'min_score': min(scores), | ||
48 | - 'max_score': max(scores), | ||
49 | - 'mean_score': mean(scores), | ||
50 | - 'std_score': std(scores), | ||
51 | - 'model': model | ||
52 | - } | ||
53 | - return pd.Series(dict(list(params.items()) + list(d.items()))) | ||
54 | - | ||
55 | - rows = [row(k, gsc.cv_validation_scores, gsc.parameters, m) | ||
56 | - for k in self.keys | ||
57 | - for gsc, m in zip(self.grid_searches[k].grid_scores_, self.grid_searches[k].best_estimator_)] | ||
58 | - df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False) | ||
59 | - | ||
60 | - columns = ['estimator', 'min_score', 'mean_score', 'max_score', 'std_score'] | ||
61 | - columns = columns + [c for c in df.columns if (c not in columns and c != 'model')] | ||
62 | - self.best_estimator_ = df['model'][0] | ||
63 | - return df[columns] | ||
64 | - | ||
65 | - | ||
66 | -def get_abstracts(file_name, label): | ||
67 | - f = open(file_name) | ||
68 | - extract = {} | ||
69 | - docs = [] | ||
70 | - empties = [] | ||
71 | - lines = f.readlines() | ||
72 | - cpright = False | ||
73 | - | ||
74 | - for i, ln in enumerate(lines): | ||
75 | - if not ln.strip(): | ||
76 | - empties.append(i) | ||
77 | - continue | ||
78 | - elif ' doi: ' in ln: | ||
79 | - for j in range(i, i + 10): | ||
80 | - if not lines[j].strip(): | ||
81 | - title_idx = j + 1 | ||
82 | - break | ||
83 | - continue | ||
84 | - | ||
85 | - elif 'cpright ' in ln: | ||
86 | - cpright = True | ||
87 | - | ||
88 | - elif 'DOI: ' in ln: | ||
89 | - if 'PMCID: ' in lines[i + 1]: | ||
90 | - extract['pmid'] = int(lines[i + 2].strip().split()[1]) | ||
91 | - elif not 'PMCID: ' in lines[i + 1] and 'PMID: ' in lines[i + 1]: | ||
92 | - extract['pmid'] = int(lines[i + 1].strip().split()[1]) | ||
93 | - | ||
94 | - if cpright: | ||
95 | - get = slice(empties[-3], empties[-2]) | ||
96 | - cpright = False | ||
97 | - else: | ||
98 | - get = slice(empties[-2], empties[-1]) | ||
99 | - | ||
100 | - extract['body'] = " ".join(lines[get]).replace("\n", ' ').replace(" ", ' ') | ||
101 | - title = [] | ||
102 | - for j in range(title_idx, title_idx + 5): | ||
103 | - if lines[j].strip(): | ||
104 | - title.append(lines[j]) | ||
105 | - else: | ||
106 | - break | ||
107 | - extract['title'] = " ".join(title).replace("\n", ' ').replace(" ", ' ') | ||
108 | - extract['topic'] = label | ||
109 | - docs.append(extract) | ||
110 | - empties = [] | ||
111 | - extract = {} | ||
112 | - | ||
113 | - return docs | ||
114 | - | ||
115 | - | ||
116 | -filename = "data/ecoli_abstracts/not_useful_abstracts.txt" | ||
117 | -labels = ['useless', 'useful'] | ||
118 | - | ||
119 | -abstracs = get_abstracts(file_name=filename, label=labels[0]) | ||
120 | - | ||
121 | -filename = "data/ecoli_abstracts/useful_abstracts.txt" | ||
122 | - | ||
123 | -abstracs += get_abstracts(file_name=filename, label=labels[1]) | ||
124 | - | ||
125 | -X = [x['body'] for x in abstracs] | ||
126 | -y = [1 if x['topic'] == 'useful' else 0 for x in abstracs] | ||
127 | - | ||
128 | -models1 = { | ||
129 | - 'ExtraTreesClassifier': ExtraTreesClassifier(), | ||
130 | - 'RandomForestClassifier': RandomForestClassifier(), | ||
131 | - 'AdaBoostClassifier': AdaBoostClassifier(), | ||
132 | - 'GradientBoostingClassifier': GradientBoostingClassifier(), | ||
133 | - 'SVC': SVC() | ||
134 | -} | ||
135 | - | ||
136 | -params1 = { | ||
137 | - 'ExtraTreesClassifier': {'n_estimators': [16, 32]}, | ||
138 | - 'RandomForestClassifier': {'n_estimators': [16, 32]}, | ||
139 | - 'AdaBoostClassifier': {'n_estimators': [16, 32]}, | ||
140 | - 'GradientBoostingClassifier': {'n_estimators': [16, 32], | ||
141 | - 'learning_rate': [0.8, 1.0]}, | ||
142 | - 'SVC': [ | ||
143 | - {'kernel': ['rbf'], 'C': [1, 10, 100, 150, 200, 300, 350, 400], | ||
144 | - 'gamma': [0.1, 0.01, 0.001, 0.0001, 0.00001]}, | ||
145 | - {'kernel': ['poly'], 'C': [1, 10, 100, 150, 200, 300, 350, 400], | ||
146 | - 'degree': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20, 23, 26], | ||
147 | - 'coef0': [0.1, 0.2,0.3,0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]} | ||
148 | - ] | ||
149 | -} | ||
150 | - | ||
151 | -clf = EstimatorSelectionHelper(models1, params1) | ||
152 | - | ||
153 | -vectorizer = TfidfVectorizer(binary=True) | ||
154 | - #ngram_range=(1, 3) | ||
155 | - #) | ||
156 | -#vectorizer = HashingVectorizer(non_negative=True) | ||
157 | -print(vectorizer) | ||
158 | -#svd = TruncatedSVD(n_components=200, random_state=42, n_iter=20) | ||
159 | -X = vectorizer.fit_transform(X) | ||
160 | -#X = svd.fit_transform(X) | ||
161 | - | ||
162 | -#X_train, X_test, y_train, y_test = splitt(X, y, test_size=0.3, random_state=42) | ||
163 | - | ||
164 | -#from sklearn.feature_selection import chi2, SelectKBest | ||
165 | -#ch2 = SelectKBest(chi2, k=200) | ||
166 | -#X_train = ch2.fit_transform(X_train, y_train) | ||
167 | -#X_test = ch2.transform(X_test) | ||
168 | - | ||
169 | -#clf = MultinomialNB(alpha=.01) | ||
170 | -#clf = Classifier(n_jobs=-1, n_iter=100) | ||
171 | -#st() | ||
172 | -clf.fit(X, y, scoring='f1', n_jobs=-1) | ||
173 | - | ||
174 | -#pred = clf.predict(X_test) | ||
175 | -#print(metrics.f1_score(y_test, pred, average='macro')) | ||
176 | -print(clf.score_summary(sort_by='min_score')) | ||
177 | - | ||
178 | -joblib.dump(clf.best_estimator_, 'model/svm_model.pkl') | ||
179 | -joblib.dump(vectorizer, 'model/tifidf_model.pkl') |
filter_abstracts.py
deleted
100644 → 0
1 | -#from pdb import set_trace as st | ||
2 | -from sklearn.cross_validation import train_test_split as splitt | ||
3 | -from sklearn.feature_extraction.text import TfidfVectorizer | ||
4 | -from sklearn.model_selection import RandomizedSearchCV | ||
5 | -from sklearn.model_selection import GridSearchCV | ||
6 | -from sklearn import metrics | ||
7 | -from sklearn.svm import SVC | ||
8 | -import numpy as np | ||
9 | -import argparse | ||
10 | -import csv | ||
11 | -import os | ||
12 | -from sklearn.externals import joblib | ||
13 | -from time import time | ||
14 | -from scipy.stats import randint as sp_randint | ||
15 | -from scipy.stats import expon | ||
16 | -from sklearn.preprocessing import label_binarize | ||
17 | - | ||
18 | - | ||
19 | -def get_abstracts(file_name, label): | ||
20 | - f = open(file_name) | ||
21 | - extract = {} | ||
22 | - docs = [] | ||
23 | - empties = [] | ||
24 | - lines = f.readlines() | ||
25 | - copyright = False | ||
26 | - | ||
27 | - for i, ln in enumerate(lines): | ||
28 | - if not ln.strip(): | ||
29 | - empties.append(i) | ||
30 | - continue | ||
31 | - elif ' doi: ' in ln: | ||
32 | - for j in range(i, i + 10): | ||
33 | - if not lines[j].strip(): | ||
34 | - title_idx = j + 1 | ||
35 | - break | ||
36 | - continue | ||
37 | - | ||
38 | - elif 'Copyright ' in ln or 'Publish' in ln or u'\N{COPYRIGHT SIGN}' in ln: | ||
39 | - copyright = True | ||
40 | - | ||
41 | - elif 'DOI: ' in ln: | ||
42 | - if 'PMCID: ' in lines[i + 1]: | ||
43 | - extract['pmid'] = int(lines[i + 2].strip().split()[1]) | ||
44 | - elif not 'PMCID: ' in lines[i + 1] and 'PMID: ' in lines[i + 1]: | ||
45 | - extract['pmid'] = int(lines[i + 1].strip().split()[1]) | ||
46 | - | ||
47 | - if copyright: | ||
48 | - get = slice(empties[-3], empties[-2]) | ||
49 | - copyright = False | ||
50 | - else: | ||
51 | - get = slice(empties[-2], empties[-1]) | ||
52 | - | ||
53 | - extract['body'] = " ".join(lines[get]).replace("\n", ' ' | ||
54 | - ).replace(" ", ' ') | ||
55 | - title = [] | ||
56 | - for j in range(title_idx, title_idx + 5): | ||
57 | - if lines[j].strip(): | ||
58 | - title.append(lines[j]) | ||
59 | - else: | ||
60 | - break | ||
61 | - extract['title'] = " ".join(title).replace("\n", ' ' | ||
62 | - ).replace(" ", ' ') | ||
63 | - extract['topic'] = label | ||
64 | - docs.append(extract) | ||
65 | - empties = [] | ||
66 | - extract = {} | ||
67 | - | ||
68 | - return docs | ||
69 | - | ||
70 | - | ||
71 | -parser = argparse.ArgumentParser( | ||
72 | - description="This script separates abstracts of biomedical papers that" | ||
73 | - "report data from biomedical experiments from those that do not.") | ||
74 | -parser.add_argument("--input", help="Input file containing the abstracts to" | ||
75 | - "be predited.") | ||
76 | -parser.add_argument("--classA", help="Input file containing the abstracts of" | ||
77 | - "class A to be learned.") | ||
78 | -parser.add_argument("--classB", help="Input file containing the abstracts of" | ||
79 | - "class B to be learned.") | ||
80 | -parser.add_argument("--out", help="Path to the output directory " | ||
81 | - "(default='./filter_output')", default="filter_output") | ||
82 | -parser.add_argument("--svcmodel", help="Path to custom pretrained svc model" | ||
83 | - "(default='./model/svm_model.pkl')", default="model/svm_model.pkl") | ||
84 | - | ||
85 | -args = parser.parse_args() | ||
86 | - | ||
87 | -labels = {0: 'useless', 1: 'useful'} | ||
88 | -vectorizer = TfidfVectorizer(binary=True) | ||
89 | -print(vectorizer) | ||
90 | - | ||
91 | -if args.classA and args.classA and not args.input: | ||
92 | - f0 = open("model_params.conf") | ||
93 | - n_iter_search = 10 | ||
94 | - params = [p for p in csv.DictReader(f0)] | ||
95 | - f0.close() | ||
96 | - names = list(params[0].keys()) | ||
97 | - model_params = {n: [] for n in names} | ||
98 | - | ||
99 | - for n in names: | ||
100 | - for d in params: | ||
101 | - for k in d: | ||
102 | - if k == n: | ||
103 | - try: | ||
104 | - model_params[n].append(float(d[k])) | ||
105 | - except ValueError: | ||
106 | - model_params[n].append(d[k]) | ||
107 | - | ||
108 | - model_params = {k: list(set(model_params[k])) for k in model_params} | ||
109 | - abstracs = get_abstracts(file_name=args.classA, label=labels[0]) | ||
110 | - abstracs += get_abstracts(file_name=args.classB, label=labels[1]) | ||
111 | - | ||
112 | - tfidf_model = vectorizer.fit([x['body'] for x in abstracs]) | ||
113 | - X = vectorizer.transform([x['body'] for x in abstracs]) | ||
114 | - #y = [x['topic'] for x in abstracs] | ||
115 | - y = [0 if x['topic'] == 'useless' else 1 for x in abstracs] | ||
116 | - | ||
117 | - #X_train, X_test, y_train, y_test = splitt(X, y, test_size=0.3, random_state=42) | ||
118 | - | ||
119 | - clf = SVC()#kernel='linear', C=100.0, gamma=0.0001)# degree=11, coef0=0.9) | ||
120 | - clf = GridSearchCV(clf, cv=3, | ||
121 | - param_grid=model_params, | ||
122 | - # clf = RandomizedSearchCV(clf, param_distributions=model_params, cv=5, n_iter=n_iter_search, | ||
123 | - n_jobs=-1, scoring='f1') | ||
124 | - start = time() | ||
125 | - clf.fit(X, y) | ||
126 | - | ||
127 | - #clf.fit(X_train, y_train) | ||
128 | - print("GridSearch took %.2f seconds for %d candidates" | ||
129 | - " parameter settings." % ((time() - start), n_iter_search)) | ||
130 | - | ||
131 | - print(clf.best_estimator_) | ||
132 | - print() | ||
133 | - print(clf.best_score_) | ||
134 | - #print(metrics.f1_score(clf.predict(X_test), y_test)) | ||
135 | - | ||
136 | - #joblib.dump(clf, 'model/svm_model.pkl') | ||
137 | - joblib.dump(clf.best_estimator_, 'model/svm_model.pkl') | ||
138 | - joblib.dump(tfidf_model, 'model/tfidf_model.pkl') | ||
139 | - | ||
140 | -else: | ||
141 | - | ||
142 | - clf = joblib.load(args.svcmodel) | ||
143 | - vectorizer = joblib.load('model/tfidf_model.pkl') | ||
144 | - abstracs = get_abstracts(file_name=args.input, label='unknown') | ||
145 | - X = vectorizer.transform([x['body'] for x in abstracs]) | ||
146 | - classes = clf.predict(X) | ||
147 | - | ||
148 | - if not os.path.exists(args.out): | ||
149 | - os.makedirs(args.out) | ||
150 | - # Writing predictions to output files | ||
151 | - with open(args.out + "/" + labels[0] + ".out", 'w') as f0, \ | ||
152 | - open(args.out + "/" + labels[1] + ".out", 'w') as f1: | ||
153 | - for c, a in zip(classes, abstracs): | ||
154 | - if c == 0: | ||
155 | - f0.write("%d\t%s\n" % (a['pmid'], a['body'])) | ||
156 | - elif c == 1: | ||
157 | - f1.write("%d\t%s\n" % (a['pmid'], a['body'])) |
1 | -#from pdb import set_trace as st | ||
2 | -from sklearn.cross_validation import train_test_split as splitt | ||
3 | from sklearn.feature_extraction.text import TfidfVectorizer | 1 | from sklearn.feature_extraction.text import TfidfVectorizer |
4 | from sklearn.decomposition import TruncatedSVD | 2 | from sklearn.decomposition import TruncatedSVD |
5 | -from sklearn.model_selection import RandomizedSearchCV | ||
6 | from sklearn.model_selection import GridSearchCV | 3 | from sklearn.model_selection import GridSearchCV |
7 | from sklearn import metrics | 4 | from sklearn import metrics |
8 | from sklearn.svm import SVC | 5 | from sklearn.svm import SVC |
... | @@ -12,9 +9,6 @@ import csv | ... | @@ -12,9 +9,6 @@ import csv |
12 | import os | 9 | import os |
13 | from sklearn.externals import joblib | 10 | from sklearn.externals import joblib |
14 | from time import time | 11 | from time import time |
15 | -from scipy.stats import randint as sp_randint | ||
16 | -from scipy.stats import expon | ||
17 | -from sklearn.preprocessing import label_binarize | ||
18 | 12 | ||
19 | 13 | ||
20 | def get_abstracts(file_name, label): | 14 | def get_abstracts(file_name, label): |
... | @@ -75,22 +69,22 @@ parser = argparse.ArgumentParser( | ... | @@ -75,22 +69,22 @@ parser = argparse.ArgumentParser( |
75 | parser.add_argument("--input", help="Input file containing the abstracts to" | 69 | parser.add_argument("--input", help="Input file containing the abstracts to" |
76 | "be predited.") | 70 | "be predited.") |
77 | parser.add_argument("--classA", help="Input file containing the abstracts of" | 71 | parser.add_argument("--classA", help="Input file containing the abstracts of" |
78 | - "class A to be learned.") | 72 | + " class useless to be learned.") |
79 | parser.add_argument("--classB", help="Input file containing the abstracts of" | 73 | parser.add_argument("--classB", help="Input file containing the abstracts of" |
80 | - "class B to be learned.") | 74 | + " class USEFUL to be learned.") |
81 | parser.add_argument("--out", help="Path to the output directory " | 75 | parser.add_argument("--out", help="Path to the output directory " |
82 | "(default='./filter_output')", default="filter_output") | 76 | "(default='./filter_output')", default="filter_output") |
83 | parser.add_argument("--svcmodel", help="Path to custom pretrained svc model" | 77 | parser.add_argument("--svcmodel", help="Path to custom pretrained svc model" |
84 | - "(default='./model/svm_model.pkl')", default="model/svm_model.pkl") | 78 | + "(default='./model_binClass/svm_model.pkl')", |
79 | + default="model_binClass/svm_model.pkl") | ||
85 | 80 | ||
86 | args = parser.parse_args() | 81 | args = parser.parse_args() |
87 | 82 | ||
88 | labels = {0: 'useless', 1: 'useful'} | 83 | labels = {0: 'useless', 1: 'useful'} |
89 | vectorizer = TfidfVectorizer(binary=True) | 84 | vectorizer = TfidfVectorizer(binary=True) |
90 | -print(vectorizer) | ||
91 | 85 | ||
92 | if args.classA and args.classB and not args.input: | 86 | if args.classA and args.classB and not args.input: |
93 | - f0 = open("model_params.conf") | 87 | + f0 = open("model_params_binClass.conf") |
94 | n_iter_search = 10 | 88 | n_iter_search = 10 |
95 | params = [p for p in csv.DictReader(f0)] | 89 | params = [p for p in csv.DictReader(f0)] |
96 | f0.close() | 90 | f0.close() |
... | @@ -115,38 +109,38 @@ if args.classA and args.classB and not args.input: | ... | @@ -115,38 +109,38 @@ if args.classA and args.classB and not args.input: |
115 | svd = TruncatedSVD(n_components=200, random_state=42, n_iter=20) | 109 | svd = TruncatedSVD(n_components=200, random_state=42, n_iter=20) |
116 | svd_model = svd.fit(X) | 110 | svd_model = svd.fit(X) |
117 | X = svd_model.transform(X) | 111 | X = svd_model.transform(X) |
118 | - #y = [x['topic'] for x in abstracs] | ||
119 | y = [0 if x['topic'] == 'useless' else 1 for x in abstracs] | 112 | y = [0 if x['topic'] == 'useless' else 1 for x in abstracs] |
120 | 113 | ||
121 | - #X_train, X_test, y_train, y_test = splitt(X, y, test_size=0.3, random_state=42) | 114 | + clf = SVC() |
122 | - | 115 | + clf = GridSearchCV(clf, cv=3, param_grid=model_params, |
123 | - clf = SVC()#kernel='linear', C=100.0, gamma=0.0001)# degree=11, coef0=0.9) | ||
124 | - clf = GridSearchCV(clf, cv=3, | ||
125 | - param_grid=model_params, | ||
126 | - # clf = RandomizedSearchCV(clf, param_distributions=model_params, cv=5, n_iter=n_iter_search, | ||
127 | n_jobs=-1, scoring='f1') | 116 | n_jobs=-1, scoring='f1') |
128 | start = time() | 117 | start = time() |
129 | clf.fit(X, y) | 118 | clf.fit(X, y) |
130 | 119 | ||
131 | - #clf.fit(X_train, y_train) | ||
132 | print("GridSearch took %.2f seconds for %d candidates" | 120 | print("GridSearch took %.2f seconds for %d candidates" |
133 | " parameter settings." % ((time() - start), n_iter_search)) | 121 | " parameter settings." % ((time() - start), n_iter_search)) |
134 | 122 | ||
123 | + print() | ||
124 | + print("The best model parameters:") | ||
125 | + print(vectorizer) | ||
126 | + print(svd) | ||
135 | print(clf.best_estimator_) | 127 | print(clf.best_estimator_) |
136 | print() | 128 | print() |
129 | + print("The best F1 score:") | ||
137 | print(clf.best_score_) | 130 | print(clf.best_score_) |
138 | - #print(metrics.f1_score(clf.predict(X_test), y_test)) | ||
139 | 131 | ||
140 | - #joblib.dump(clf, 'model/svm_model.pkl') | 132 | + joblib.dump(clf.best_estimator_, 'model_binClass/svm_model.pkl') |
141 | - joblib.dump(clf.best_estimator_, 'model/svm_model.pkl') | 133 | + joblib.dump(tfidf_model, 'model_binClass/tfidf_model.pkl') |
142 | - joblib.dump(tfidf_model, 'model/tfidf_model.pkl') | 134 | + joblib.dump(svd_model, 'model_binClass/svd_model.pkl') |
143 | - joblib.dump(svd_model, 'model/svd_model.pkl') | ||
144 | 135 | ||
145 | else: | 136 | else: |
146 | 137 | ||
147 | clf = joblib.load(args.svcmodel) | 138 | clf = joblib.load(args.svcmodel) |
148 | - vectorizer = joblib.load('model/tfidf_model.pkl') | 139 | + vectorizer = joblib.load('model_binClass/tfidf_model.pkl') |
149 | - svd = joblib.load('model/svd_model.pkl') | 140 | + svd = joblib.load('model_binClass/svd_model.pkl') |
141 | + print(vectorizer) | ||
142 | + print(svd) | ||
143 | + print(clf) | ||
150 | abstracs = get_abstracts(file_name=args.input, label='unknown') | 144 | abstracs = get_abstracts(file_name=args.input, label='unknown') |
151 | X = vectorizer.transform([x['body'] for x in abstracs]) | 145 | X = vectorizer.transform([x['body'] for x in abstracs]) |
152 | X = svd.transform(X) | 146 | X = svd.transform(X) |
... | @@ -162,3 +156,5 @@ else: | ... | @@ -162,3 +156,5 @@ else: |
162 | f0.write("%d\t%s\n" % (a['pmid'], a['body'])) | 156 | f0.write("%d\t%s\n" % (a['pmid'], a['body'])) |
163 | elif c == 1: | 157 | elif c == 1: |
164 | f1.write("%d\t%s\n" % (a['pmid'], a['body'])) | 158 | f1.write("%d\t%s\n" % (a['pmid'], a['body'])) |
159 | + | ||
160 | + print ("FINISHED!!") | ... | ... |
1 | -#from pdb import set_trace as st | ||
2 | -from sklearn.cross_validation import train_test_split as splitt | ||
3 | from sklearn.feature_extraction.text import TfidfVectorizer | 1 | from sklearn.feature_extraction.text import TfidfVectorizer |
4 | from sklearn.decomposition import TruncatedSVD | 2 | from sklearn.decomposition import TruncatedSVD |
5 | -from sklearn.model_selection import RandomizedSearchCV | ||
6 | from sklearn.model_selection import GridSearchCV | 3 | from sklearn.model_selection import GridSearchCV |
7 | from sklearn import metrics | 4 | from sklearn import metrics |
8 | -from sklearn.svm import SVC | 5 | + |
6 | +from sklearn.svm import OneClassSVM | ||
9 | import numpy as np | 7 | import numpy as np |
10 | import argparse | 8 | import argparse |
11 | import csv | 9 | import csv |
12 | import os | 10 | import os |
13 | from sklearn.externals import joblib | 11 | from sklearn.externals import joblib |
14 | from time import time | 12 | from time import time |
15 | -from scipy.stats import randint as sp_randint | ||
16 | -from scipy.stats import expon | ||
17 | -from sklearn.preprocessing import label_binarize | ||
18 | 13 | ||
19 | 14 | ||
20 | def get_abstracts(file_name, label): | 15 | def get_abstracts(file_name, label): |
... | @@ -75,22 +70,23 @@ parser = argparse.ArgumentParser( | ... | @@ -75,22 +70,23 @@ parser = argparse.ArgumentParser( |
75 | parser.add_argument("--input", help="Input file containing the abstracts to" | 70 | parser.add_argument("--input", help="Input file containing the abstracts to" |
76 | "be predited.") | 71 | "be predited.") |
77 | parser.add_argument("--classA", help="Input file containing the abstracts of" | 72 | parser.add_argument("--classA", help="Input file containing the abstracts of" |
78 | - "class A to be learned.") | 73 | + " class USEFUL to be learned.") |
79 | parser.add_argument("--classB", help="Input file containing the abstracts of" | 74 | parser.add_argument("--classB", help="Input file containing the abstracts of" |
80 | - "class B to be learned.") | 75 | + " class useless to be learned.") |
81 | parser.add_argument("--out", help="Path to the output directory " | 76 | parser.add_argument("--out", help="Path to the output directory " |
82 | "(default='./filter_output')", default="filter_output") | 77 | "(default='./filter_output')", default="filter_output") |
83 | parser.add_argument("--svcmodel", help="Path to custom pretrained svc model" | 78 | parser.add_argument("--svcmodel", help="Path to custom pretrained svc model" |
84 | - "(default='./model/svm_model.pkl')", default="model/svm_model.pkl") | 79 | + "(default='./model_oneClass/svm_model.pkl')", |
80 | + default="model_oneClass/svm_model.pkl") | ||
85 | 81 | ||
86 | args = parser.parse_args() | 82 | args = parser.parse_args() |
87 | 83 | ||
88 | labels = {0: 'useless', 1: 'useful'} | 84 | labels = {0: 'useless', 1: 'useful'} |
89 | vectorizer = TfidfVectorizer(binary=True) | 85 | vectorizer = TfidfVectorizer(binary=True) |
90 | -print(vectorizer) | 86 | + |
91 | 87 | ||
92 | if args.classA and args.classB and not args.input: | 88 | if args.classA and args.classB and not args.input: |
93 | - f0 = open("model_params.conf") | 89 | + f0 = open("model_params_oneClass.conf") |
94 | n_iter_search = 10 | 90 | n_iter_search = 10 |
95 | params = [p for p in csv.DictReader(f0)] | 91 | params = [p for p in csv.DictReader(f0)] |
96 | f0.close() | 92 | f0.close() |
... | @@ -115,50 +111,52 @@ if args.classA and args.classB and not args.input: | ... | @@ -115,50 +111,52 @@ if args.classA and args.classB and not args.input: |
115 | svd = TruncatedSVD(n_components=200, random_state=42, n_iter=20) | 111 | svd = TruncatedSVD(n_components=200, random_state=42, n_iter=20) |
116 | svd_model = svd.fit(X) | 112 | svd_model = svd.fit(X) |
117 | X = svd_model.transform(X) | 113 | X = svd_model.transform(X) |
118 | - #y = [x['topic'] for x in abstracs] | 114 | + y = [-1 if x['topic'] == 'useless' else 1 for x in abstracs] |
119 | - y = [0 if x['topic'] == 'useless' else 1 for x in abstracs] | ||
120 | - | ||
121 | - #X_train, X_test, y_train, y_test = splitt(X, y, test_size=0.3, random_state=42) | ||
122 | 115 | ||
123 | - clf = SVC()#kernel='linear', C=100.0, gamma=0.0001)# degree=11, coef0=0.9) | 116 | + clf = OneClassSVM() |
124 | - clf = GridSearchCV(clf, cv=3, | 117 | + clf = GridSearchCV(clf, cv=3, param_grid=model_params, |
125 | - param_grid=model_params, | ||
126 | - # clf = RandomizedSearchCV(clf, param_distributions=model_params, cv=5, n_iter=n_iter_search, | ||
127 | n_jobs=-1, scoring='f1') | 118 | n_jobs=-1, scoring='f1') |
128 | start = time() | 119 | start = time() |
129 | clf.fit(X, y) | 120 | clf.fit(X, y) |
130 | 121 | ||
131 | - #clf.fit(X_train, y_train) | ||
132 | print("GridSearch took %.2f seconds for %d candidates" | 122 | print("GridSearch took %.2f seconds for %d candidates" |
133 | " parameter settings." % ((time() - start), n_iter_search)) | 123 | " parameter settings." % ((time() - start), n_iter_search)) |
134 | 124 | ||
125 | + print() | ||
126 | + print("The best model parameters:") | ||
127 | + print(vectorizer) | ||
128 | + print(svd) | ||
135 | print(clf.best_estimator_) | 129 | print(clf.best_estimator_) |
136 | print() | 130 | print() |
131 | + print("The best F1 score:") | ||
137 | print(clf.best_score_) | 132 | print(clf.best_score_) |
138 | - #print(metrics.f1_score(clf.predict(X_test), y_test)) | ||
139 | 133 | ||
140 | - #joblib.dump(clf, 'model/svm_model.pkl') | 134 | + joblib.dump(clf.best_estimator_, 'model_oneClass/svm_model.pkl') |
141 | - joblib.dump(clf.best_estimator_, 'model/svm_model.pkl') | 135 | + joblib.dump(tfidf_model, 'model_oneClass/tfidf_model.pkl') |
142 | - joblib.dump(tfidf_model, 'model/tfidf_model.pkl') | 136 | + joblib.dump(svd_model, 'model_oneClass/svd_model.pkl') |
143 | - joblib.dump(svd_model, 'model/svd_model.pkl') | ||
144 | 137 | ||
145 | else: | 138 | else: |
146 | 139 | ||
147 | clf = joblib.load(args.svcmodel) | 140 | clf = joblib.load(args.svcmodel) |
148 | - vectorizer = joblib.load('model/tfidf_model.pkl') | 141 | + vectorizer = joblib.load('model_oneClass/tfidf_model.pkl') |
149 | - svd = joblib.load('model/svd_model.pkl') | 142 | + svd = joblib.load('model_oneClass/svd_model.pkl') |
143 | + print(vectorizer) | ||
144 | + print(svd) | ||
145 | + print(clf) | ||
150 | abstracs = get_abstracts(file_name=args.input, label='unknown') | 146 | abstracs = get_abstracts(file_name=args.input, label='unknown') |
151 | X = vectorizer.transform([x['body'] for x in abstracs]) | 147 | X = vectorizer.transform([x['body'] for x in abstracs]) |
152 | X = svd.transform(X) | 148 | X = svd.transform(X) |
153 | classes = clf.predict(X) | 149 | classes = clf.predict(X) |
154 | - | 150 | + |
155 | if not os.path.exists(args.out): | 151 | if not os.path.exists(args.out): |
156 | os.makedirs(args.out) | 152 | os.makedirs(args.out) |
157 | # Writing predictions to output files | 153 | # Writing predictions to output files |
158 | with open(args.out + "/" + labels[0] + ".out", 'w') as f0, \ | 154 | with open(args.out + "/" + labels[0] + ".out", 'w') as f0, \ |
159 | open(args.out + "/" + labels[1] + ".out", 'w') as f1: | 155 | open(args.out + "/" + labels[1] + ".out", 'w') as f1: |
160 | for c, a in zip(classes, abstracs): | 156 | for c, a in zip(classes, abstracs): |
161 | - if c == 0: | 157 | + if c == 1: |
162 | f0.write("%d\t%s\n" % (a['pmid'], a['body'])) | 158 | f0.write("%d\t%s\n" % (a['pmid'], a['body'])) |
163 | - elif c == 1: | 159 | + elif c == -1: |
164 | f1.write("%d\t%s\n" % (a['pmid'], a['body'])) | 160 | f1.write("%d\t%s\n" % (a['pmid'], a['body'])) |
161 | + | ||
162 | + print("FINISHED!!") | ... | ... |
model/svd_model.pkl
deleted
100644 → 0
No preview for this file type
model/svm_model.paper.pkl
deleted
100644 → 0
No preview for this file type
model/svm_model.pkl
deleted
100644 → 0
No preview for this file type
model/tfidf_model.paper.pkl
deleted
100644 → 0
No preview for this file type
model/tifidf_model.pkl
deleted
100644 → 0
No preview for this file type
model_binClass/svd_model.pkl
0 → 100644
No preview for this file type
model_binClass/svm_model.pkl
0 → 100644
No preview for this file type
No preview for this file type
model_oneClass/svd_model.pkl
0 → 100644
No preview for this file type
model_oneClass/svm_model.pkl
0 → 100644
No preview for this file type
model_oneClass/tfidf_model.pkl
0 → 100644
No preview for this file type
model_params_oneClass.conf
0 → 100644
1 | +kernel,degree,coef0,nu,gamma | ||
2 | +linear,1,0.5,1.0,0.0 | ||
3 | +linear,1,0.5,0.9,0.0 | ||
4 | +linear,1,0.5,0.8,0.0 | ||
5 | +linear,1,0.5,0.7,0.0 | ||
6 | +linear,1,0.5,0.6,0.0 | ||
7 | +linear,1,0.5,0.5,0.0 | ||
8 | +linear,1,0.5,0.4,0.0 | ||
9 | +linear,1,0.5,0.3,0.0 | ||
10 | +linear,1,0.5,0.2,0.0 | ||
11 | +linear,1,0.5,0.1,0.0 | ||
12 | +rbf,1,0.5,1.0,2.0 | ||
13 | +rbf,1,0.5,0.9,0.0001 | ||
14 | +rbf,1,0.5,0.8,0.0001 | ||
15 | +rbf,1,0.5,0.7,0.0001 | ||
16 | +rbf,1,0.5,0.6,0.001 | ||
17 | +rbf,1,0.5,0.5,0.001 | ||
18 | +rbf,1,0.5,0.4,0.001 | ||
19 | +rbf,1,0.5,0.7,0.0001 | ||
20 | +rbf,1,0.5,0.4,0.0001 | ||
21 | +rbf,1,0.5,0.5,0.0001 |
oneClass_trainUseful_out/useful.out
0 → 100644
This diff is collapsed. Click to expand it.
oneClass_trainUseless_out/useful.out
0 → 100644
This diff is collapsed. Click to expand it.
oneClass_trainUseless_out/useless.out
0 → 100644
This diff is collapsed. Click to expand it.
This diff is collapsed. Click to expand it.
outRNAseq_binClass/useless.out
0 → 100644
This diff is collapsed. Click to expand it.
outRNAseq_oneClass/useful.out
0 → 100644
This diff is collapsed. Click to expand it.
outRNAseq_oneClass/useless.out
0 → 100644
This diff is collapsed. Click to expand it.
report.txt
deleted
100644 → 0
1 | -TfidfVectorizer(analyzer='word', binary=True, decode_error='strict', | ||
2 | - dtype=<class 'numpy.int64'>, encoding='utf-8', input='content', | ||
3 | - lowercase=True, max_df=1.0, max_features=None, min_df=1, | ||
4 | - ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True, | ||
5 | - stop_words=None, strip_accents=None, sublinear_tf=False, | ||
6 | - token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True, | ||
7 | - vocabulary=None) | ||
8 | -Running GridSearchCV for GradientBoostingClassifier. | ||
9 | -Fitting 3 folds for each of 4 candidates, totalling 12 fits | ||
10 | -Running GridSearchCV for AdaBoostClassifier. | ||
11 | -Fitting 3 folds for each of 2 candidates, totalling 6 fits | ||
12 | -Running GridSearchCV for ExtraTreesClassifier. | ||
13 | -Fitting 3 folds for each of 2 candidates, totalling 6 fits | ||
14 | -Running GridSearchCV for SVC. | ||
15 | -Fitting 3 folds for each of 63 candidates, totalling 189 fits | ||
16 | -Running GridSearchCV for RandomForestClassifier. | ||
17 | -Fitting 3 folds for each of 2 candidates, totalling 6 fits | ||
18 | - estimator min_score mean_score max_score std_score \ | ||
19 | -36 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
20 | -66 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
21 | -35 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
22 | -37 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
23 | -38 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
24 | -39 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
25 | -40 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
26 | -41 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
27 | -42 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
28 | -43 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
29 | -44 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
30 | -45 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
31 | -46 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
32 | -47 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
33 | -48 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
34 | -49 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
35 | -50 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
36 | -51 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
37 | -52 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
38 | -53 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
39 | -54 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
40 | -55 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
41 | -56 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
42 | -57 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
43 | -58 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
44 | -59 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
45 | -60 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
46 | -61 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
47 | -62 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
48 | -63 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
49 | -.. ... ... ... ... ... | ||
50 | -12 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
51 | -13 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
52 | -14 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
53 | -15 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
54 | -16 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
55 | -17 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
56 | -26 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
57 | -25 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
58 | -30 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
59 | -29 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
60 | -28 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
61 | -27 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
62 | -19 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
63 | -65 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
64 | -24 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
65 | -23 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
66 | -22 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
67 | -21 SVC 0.69697 0.702911 0.705882 0.00420147 | ||
68 | -18 SVC 0.686567 0.693502 0.69697 0.0049038 | ||
69 | -20 SVC 0.676923 0.691047 0.707692 0.0126874 | ||
70 | -7 ExtraTreesClassifier 0.619048 0.662524 0.688525 0.0309388 | ||
71 | -6 ExtraTreesClassifier 0.588235 0.611627 0.655738 0.0312098 | ||
72 | -1 GradientBoostingClassifier 0.577778 0.595982 0.610169 0.0135256 | ||
73 | -0 GradientBoostingClassifier 0.5 0.549894 0.596491 0.0394613 | ||
74 | -71 RandomForestClassifier 0.470588 0.557789 0.625 0.0646035 | ||
75 | -3 GradientBoostingClassifier 0.454545 0.548927 0.596491 0.0667386 | ||
76 | -2 GradientBoostingClassifier 0.439024 0.588593 0.701754 0.110305 | ||
77 | -5 AdaBoostClassifier 0.411765 0.489657 0.618182 0.0915596 | ||
78 | -4 AdaBoostClassifier 0.4 0.54013 0.655172 0.105673 | ||
79 | -72 RandomForestClassifier 0.380952 0.504177 0.631579 0.10236 | ||
80 | - | ||
81 | - C degree gamma kernel learning_rate n_estimators | ||
82 | -36 100 6 NaN poly NaN NaN | ||
83 | -66 200 NaN 0.0001 sigmoid NaN NaN | ||
84 | -35 100 5 NaN poly NaN NaN | ||
85 | -37 150 2 NaN poly NaN NaN | ||
86 | -38 150 3 NaN poly NaN NaN | ||
87 | -39 150 4 NaN poly NaN NaN | ||
88 | -40 150 5 NaN poly NaN NaN | ||
89 | -41 150 6 NaN poly NaN NaN | ||
90 | -42 200 2 NaN poly NaN NaN | ||
91 | -43 200 3 NaN poly NaN NaN | ||
92 | -44 200 4 NaN poly NaN NaN | ||
93 | -45 200 5 NaN poly NaN NaN | ||
94 | -46 200 6 NaN poly NaN NaN | ||
95 | -47 300 2 NaN poly NaN NaN | ||
96 | -48 300 3 NaN poly NaN NaN | ||
97 | -49 300 4 NaN poly NaN NaN | ||
98 | -50 300 5 NaN poly NaN NaN | ||
99 | -51 300 6 NaN poly NaN NaN | ||
100 | -52 400 2 NaN poly NaN NaN | ||
101 | -53 400 3 NaN poly NaN NaN | ||
102 | -54 400 4 NaN poly NaN NaN | ||
103 | -55 400 5 NaN poly NaN NaN | ||
104 | -56 400 6 NaN poly NaN NaN | ||
105 | -57 1 NaN 0.001 sigmoid NaN NaN | ||
106 | -58 1 NaN 0.0001 sigmoid NaN NaN | ||
107 | -59 10 NaN 0.001 sigmoid NaN NaN | ||
108 | -60 10 NaN 0.0001 sigmoid NaN NaN | ||
109 | -61 100 NaN 0.001 sigmoid NaN NaN | ||
110 | -62 100 NaN 0.0001 sigmoid NaN NaN | ||
111 | -63 150 NaN 0.001 sigmoid NaN NaN | ||
112 | -.. ... ... ... ... ... ... | ||
113 | -12 100 NaN 0.001 rbf NaN NaN | ||
114 | -13 100 NaN 0.0001 rbf NaN NaN | ||
115 | -14 150 NaN 0.001 rbf NaN NaN | ||
116 | -15 150 NaN 0.0001 rbf NaN NaN | ||
117 | -16 200 NaN 0.001 rbf NaN NaN | ||
118 | -17 200 NaN 0.0001 rbf NaN NaN | ||
119 | -26 1 6 NaN poly NaN NaN | ||
120 | -25 1 5 NaN poly NaN NaN | ||
121 | -30 10 5 NaN poly NaN NaN | ||
122 | -29 10 4 NaN poly NaN NaN | ||
123 | -28 10 3 NaN poly NaN NaN | ||
124 | -27 10 2 NaN poly NaN NaN | ||
125 | -19 300 NaN 0.0001 rbf NaN NaN | ||
126 | -65 200 NaN 0.001 sigmoid NaN NaN | ||
127 | -24 1 4 NaN poly NaN NaN | ||
128 | -23 1 3 NaN poly NaN NaN | ||
129 | -22 1 2 NaN poly NaN NaN | ||
130 | -21 400 NaN 0.0001 rbf NaN NaN | ||
131 | -18 300 NaN 0.001 rbf NaN NaN | ||
132 | -20 400 NaN 0.001 rbf NaN NaN | ||
133 | -7 NaN NaN NaN NaN NaN 32 | ||
134 | -6 NaN NaN NaN NaN NaN 16 | ||
135 | -1 NaN NaN NaN NaN 0.8 32 | ||
136 | -0 NaN NaN NaN NaN 0.8 16 | ||
137 | -71 NaN NaN NaN NaN NaN 16 | ||
138 | -3 NaN NaN NaN NaN 1 32 | ||
139 | -2 NaN NaN NaN NaN 1 16 | ||
140 | -5 NaN NaN NaN NaN NaN 32 | ||
141 | -4 NaN NaN NaN NaN NaN 16 | ||
142 | -72 NaN NaN NaN NaN NaN 32 | ||
143 | - | ||
144 | -[73 rows x 11 columns] |
tablita_results.xlsx
0 → 100644
No preview for this file type
trainUseful_out/useful.out
0 → 100644
This diff could not be displayed because it is too large.
trainUseful_out/useless.out
0 → 100644
File mode changed
trainUseless_out/useful.out
0 → 100644
1 | +28911122 ChIP-exo/nexus experiments rely on innovative modifications of the commonly used ChIP-seq protocol for high resolution mapping of transcription factor binding sites. Although many aspects of the ChIP-exo data analysis are similar to those of ChIP-seq, these high throughput experiments pose a number of unique quality control and analysis challenges. We develop a novel statistical quality control pipeline and accompanying R/Bioconductor package, ChIPexoQual, to enable exploration and analysis of ChIP-exo and related experiments. ChIPexoQual evaluates a number of key issues including strand imbalance, library complexity, and signal enrichment of data. Assessment of these features are facilitated through diagnostic plots and summary statistics computed over regions of the genome with varying levels of coverage. We evaluated our QC pipeline with both large collections of public ChIP-exo/nexus data and multiple, new ChIP-exo datasets from Escherichia coli. ChIPexoQual analysis of these datasets resulted in guidelines for using these QC metrics across a wide range of sequencing depths and provided further insights for modelling ChIP-exo data. |
trainUseless_out/useless.out
0 → 100644
This diff is collapsed. Click to expand it.
-
Please register or login to post a comment