This paper talks about (and reports) experimental data
Automatic discrimination of useless papers via machine learning of abstracts.
The main method follows the next pipeline:
Training mode
- Parse abstracts from two input files (classA and classB; see files format at the
data/
directory) - Transform abstracts into their TFIDF sparse representations
- Train Support Vector Machines with different parameters by using GridSearch
- Select the best estimator and save it at
model_binClass/svm_model.pkl
(default) - Save TFIDF transformation for keeping the training vocabulary (stored at
model_binClass/tfidf_model.pkl
)
Prediction mode
- Parse abstracts from a unique input file
- Transform abstracts into their TFIDF sparse representations
- Transform TFIDF representations into their 200-dimensional SVD approximation
- Predict useless/useful papers by means of their abstracts using pretrained Support Vector Machines
Usage
For filtering unknown abstracts run
$ python filter_abstracts_binClass.py --input data/test_abstracts.txt
The predictions will be stored by default at filter_output/
, unless a different directory is specified by means of the --out
option. The default names containing the predicitons are
filter_output/useful.out
filter_output/useless.out
The format of each file is:
<PMID> \t <text of the abstract>
...
<PMID> \t <text of the abstract>
For training a new model set the list of parameters at model_params_binClass.conf
and then run
$ python filter_abstracts_binClass.py --classA data/ecoli_abstracts/not_useful_abstracts.txt --classB data/ecoli_abstracts/useful_abstracts.txt
where --classA
and --classB
(the useful papers) are used to specify input training files. In this example data/ecoli_abstracts/useful_abstracts.txt
is the training files containing abstracts of papers reporting experimental data (the desired or useful class for us).