Name Last Update
data Loading commit data...
deprecated Loading commit data...
filter_output Loading commit data...
model_binClass Loading commit data...
model_oneClass Loading commit data...
oneClass_trainUseful_out Loading commit data...
oneClass_trainUseless_out Loading commit data...
outRNAseq_binClass Loading commit data...
outRNAseq_oneClass Loading commit data...
trainUseful_out Loading commit data...
trainUseless_out Loading commit data...
.~lock.tablita_results.xlsx# Loading commit data...
README.md Loading commit data...
filter_abstracts.py.save Loading commit data...
filter_abstracts_binClass.py Loading commit data...
filter_abstracts_oneClass.py Loading commit data...
filter_papers.py Loading commit data...
model_params_binClass.conf Loading commit data...
model_params_oneClass.conf Loading commit data...
tablita_results.xlsx Loading commit data...

This paper talks about (and reports) experimental data

Automatic discrimination of useless papers via machine learning of abstracts.

The main method follows the next pipeline:

Training mode

  • Parse abstracts from two input files (classA and classB; see files format at the data/ directory)
  • Transform abstracts into their TFIDF sparse representations
  • Train Support Vector Machines with different parameters by using GridSearch
  • Select the best estimator and save it at model/svm_model.pkl (default)
  • Save TFIDF transformation for keeping the training vocabulary (stored at model/tfidf_model.pkl)

Prediction mode

  • Parse abstracts from a unique input file
  • Transform abstracts into their TFIDF sparse representations
  • Predict useless/useful papers by means of their abstracts using pretrained Support Vector Machines

Usage

For filtering unknown abstracts run

$ python filter_abstracts.py --input data/test_abstracts.txt

The predictions will be stored by default at filter_output/, unless a different directory is specified by means of the --out option. The default names containing the predicitons are

  • filter_output/useful.out
  • filter_output/useless.out

The format of each file is:

<PMID> \t <text of the abstract>
...
<PMID> \t <text of the abstract>

For training a new model set the list of parameters at model_params.conf and then run

$ python filter_abstracts.py --classA data/ecoli_abstracts/not_useful_abstracts.txt --classB data/ecoli_abstracts/useful_abstracts.txt

where --classA and --classA are used to specify input training files. In this example data/ecoli_abstracts/useful_abstracts.txt is the training files containing abstracts of papers reporting experimental data (the desired or useful class for us).