Name Last Update
.idea Loading commit data...
CRF Loading commit data...
CoreNLP Loading commit data...
data-sets Loading commit data...
README.md Loading commit data...

Automatic Extraction of Growth Conditions (GCs) from the Gene Expression Omnibus (GEO).

Project to extract in an automatic way the growth conditions of all enterobacteria within the GEO using "Conditional Random Fields " (CRFs).

Prerequisites

Programming languages

  • Python (version 2.7, version 3.7)
  • Bash

Folder content

CRF

  • bin
    1. label-split_training_test_v1.py
    2. params.py
    3. training_validation_v3.py
  • data-sets
    1. test-data-set-30.txt
    2. training-data-set-70.txt
  • models
    1. training-data-set-70.fStopWords_False.fSymbols_False.mod
  • reports
    Folder that encloses files with information of the performance of the CRF while identifying GCs.
    1. report_training-data-set-70.fStopWords_False.fSymbols_False.txt
    2. y_pred_training-data-set-70.fStopWords_False.fSymbols_False.txt
    3. y_test_training-data-set-70.fStopWords_False.fSymbols_False.txt

CoreNLP

  • bin
    1. get-raw-sentences.sh
      Script that extracts the GCs from the file: "tagged-xml-data" and adds the phrase: "PGCGROWTHCONDITIONS" to all lines.
    2. single_run.sh
      Script that runs th script: "corenlp.sh" with the desired parameters.
  • input
    1. raw-metadata-senteneces.txt
      Resulting file from "get-raw-sentences.sh". Contains all the GCs.
  • output
    1. raw-metadata-senteneces.txt.conll
      This file contains all the words of all the GCs tagged with its "LEMMA" & "POS"

data-sets

  • report-manually-tagged-gcs
    Contains the extracted GCs of all the samples for each serie.
  • tagged-xml-data
    Contains the original xml-tagged files where the GCs will be extracted.