README.md



This paper talks about (and reports) experimental data

Automatic discrimination of useless papers via machine learning of abstracts.

The main method follows the next pipeline:


Training mode


Parse abstracts from two input files (classA and classB; see files format at the data/ directory)
Transform abstracts into their TFIDF sparse representations
Train Support Vector Machines with different parameters by using GridSearch 
Select the best estimator and save it at model_binClass/svm_model.pkl (default)
Save TFIDF transformation for keeping the training vocabulary (stored at model_binClass/tfidf_model.pkl)


Prediction mode


Parse abstracts from a unique input file
Transform abstracts into their TFIDF sparse representations
Transform TFIDF representations into their 200-dimensional SVD approximation
Predict useless/useful papers by means of their abstracts using pretrained Support Vector Machines


Usage

For filtering unknown abstracts run

$ python filter_abstracts_binClass.py --input data/test_abstracts.txt


The predictions will be stored by default at filter_output/, unless a different directory is specified by means of the --out option. The default names containing the predicitons are 


filter_output/useful.out
filter_output/useless.out


The format of each file is:

<PMID> \t <text of the abstract>
...
<PMID> \t <text of the abstract>


For training a new model set the list of parameters at model_params_binClass.conf and then run

$ python filter_abstracts_binClass.py --classA data/ecoli_abstracts/not_useful_abstracts.txt --classB data/ecoli_abstracts/useful_abstracts.txt


where --classA and --classB (the useful papers) are used to specify input training files. In this example data/ecoli_abstracts/useful_abstracts.txt is the training files containing abstracts of papers reporting experimental data (the desired or useful class for us).