README.md



Automatic Extraction of Growth Conditions (GCs) from the Gene Expression Omnibus (GEO).

Project to extract in an automatic way the growth conditions of all enterobacteria within the GEO using "Conditional Random Fields " (CRFs).


Prerequisites


Programming languages


Python (version 2.7, version 3.7)
Bash


Folder content

CRF


bin


label-split_training_test_v1.py
params.py
training_validation_v3.py


data-sets


test-data-set-30.txt
training-data-set-70.txt


models


 training-data-set-70.fStopWords_False.fSymbols_False.mod


reports

  Folder that encloses files with information of the performance of the CRF while identifying GCs.


report_training-data-set-70.fStopWords_False.fSymbols_False.txt
 y_pred_training-data-set-70.fStopWords_False.fSymbols_False.txt
 y_test_training-data-set-70.fStopWords_False.fSymbols_False.txt


CoreNLP


bin


 get-raw-sentences.sh

 Script that extracts the GCs from the file: "tagged-xml-data" and adds the phrase: "PGCGROWTHCONDITIONS" to all lines.

 single_run.sh

 Script that runs th script: "corenlp.sh" with the desired parameters.


input


raw-metadata-senteneces.txt

 Resulting file from "get-raw-sentences.sh". Contains all the GCs. 


output


raw-metadata-senteneces.txt.conll

 This file contains all the words of all the GCs tagged with its "LEMMA" & "POS"


data-sets


report-manually-tagged-gcs

  Contains the extracted GCs of all the samples for each serie.

tagged-xml-data

  Contains the original xml-tagged files where the GCs will be extracted.