Name Last Update
data Loading commit data...
deprecated Loading commit data...
filter_output Loading commit data...
model Loading commit data...
README.md Loading commit data...
classify_abstracts.py Loading commit data...
filter_abstracts.py Loading commit data...
filter_papers.py Loading commit data...
model_params.conf Loading commit data...
report.txt Loading commit data...

This paper talks about (and reports) experimental data

Automatic discrimination of useless papers via machine learning of abstracts.

The main method follows the next pipeline:

Training mode

  • Parse abstracts from two input files (classA and classB; see files format at the data/ directory)
  • Transform abstracts into their TFIDF sparse representations
  • Train Support Vector Machines with different parameters by using GridSearch
  • Select the best estimator and save it at model/svm_model.pkl (default)
  • Save TFIDF transformation for keeping the training vocabulary (stored at model/tfidf_model.pkl)

Prediction mode

  • Parse abstracts from a unique input file
  • Transform abstracts into their TFIDF sparse representations
  • Predict useless/useful papers by means of their abstracts using pretrained Support Vector Machines

Usage

For filtering unknown abstracts run

$ python filter_abstracts.py --input data/test_abstracts.txt

The predictions will be stored by default at filter_output/, unless a different directory is specified by means of the --out option. The default names containing the predicitons are

  • filter_output/useful.out
  • filter_output/useless.out

The format of each file is:

<PMID> \t <text of the abstract>
...
<PMID> \t <text of the abstract>

For training a new model set the list of parameters at model_params.conf and then run

$ python filter_abstracts.py --classA data/ecoli_abstracts/not_useful_abstracts.txt --classB data/ecoli_abstracts/useful_abstracts.txt

where --classA and --classA are used to specify input training files. In this example data/ecoli_abstracts/useful_abstracts.txt is the training files containing abstracts of papers reporting experimental data (the desired or useful class for us).