U

useless

Automatic discrimination of useless papers via machine learning of abstracts.

This paper talks about (and reports) experimental data?

We tackle this issue by performing automatic discrimination of useless papers via machine learning of abstracts.

The main method follows the next pipeline:

Training mode

  • Parse abstracts from two input files (classA and classB; see files format at the data/ directory)
  • Transform abstracts into their TFIDF sparse representations
  • Transform TFIDF representations into their 200-dimensional SVD approximation and save it at model_binClass/svd_model.pkl
  • Train Support Vector Machines with different parameters by using GridSearch
  • Select the best estimator and save it at model_binClass/svm_model.pkl (default)
  • Save TFIDF transformation for keeping the training vocabulary (stored at model_binClass/tfidf_model.pkl)

Prediction mode

  • Parse abstracts from a unique input file
  • Transform abstracts into their TFIDF sparse representations
  • Transform TFIDF representations into their 200-dimensional SVD approximation
  • Predict useless/useful papers by means of their abstracts using pretrained Support Vector Machines

Dependencies

  • Python 3.6
  • Sckit-learn 0.18.1

Usage

For filtering unknown abstracts run

$ python filter_abstracts_binClass.py --input data/test_abstracts.txt

The predictions will be stored by default at filter_output/, unless a different directory is specified by means of the --out option. The default names containing the predicitons are

  • filter_output/useful.out
  • filter_output/useless.out

The format of each file is:

<PMID> \t <text of the abstract>
...
<PMID> \t <text of the abstract>

For training a new model set the list of parameters at model_params_binClass.conf and then run

$ python filter_abstracts_binClass.py --classA data/ecoli_abstracts/not_useful_abstracts.txt --classB data/ecoli_abstracts/useful_abstracts.txt

where --classA and --classB (the useful papers) are used to specify input training files. In this example data/ecoli_abstracts/useful_abstracts.txt is the training file containing abstracts of papers reporting experimental data (the desired or useful class for us). This file must be parsed and it has a special format (the same thing holds for unuseful abstracts. See their contents).