A

automatic-extraction-growth-conditions

Automatic extraction of growth conditions

2088c9b2 Extracción de GCs de literatura. · by cmendezc

Automatic Extraction of Growth Conditions (GCs) from the Gene Expression Omnibus (GEO)

Project to extract in an automatic way the growth conditions of all enterobacteria within the GEO using "Conditional Random Fields " (CRFs).

Research Group

Main researcher
Méndez Cruz Carlos Francisco
Members
Gaytan Nuñez Estefani
Meza Landeros Kevin Emmanuel
Tierrafría Victor (curator)

Main Purpose

As it is known, GEO Database is home to thousands of High-Throughput (HT) genetic expresison experiments. The documentation for each experiment done within the database includes the Growth Conditions (GC) used in it. Unfortunately they are not registered in a structured way, but they are into text fragments associated with various fields (we call them metadata).
Since knowing the GCs of these experiments helps to better understand genetic regulation, it becomes important to extract these conditions. However, doing it manually requires a lot of effort on large data sets.

Thats why, our hypothesis is that a predictive model can determine the GCs of thousands of experiments stored in the GEO. Our goal is to generate a report, that will be used by curators to review and validate the GC of the experiments.

Metodolgy

  1. GEO files download
    GEO files from all Entero bacteria were downloaded to a server and ordered in 4 directorie (all of them with lots of GSE00000000 folders):
    - Binding_exp
    - Binding_HT
    - Function_ex
    - Function_HT
    Each of the GSE00000000 folders contains a compresed file (GSE00000_family.soft.gz) that must be extracted.

  2. Obtaining SOFT files and its transformation to an XML format
    An script goes trhough every GSE00000000 folder an unzips "GSE00000_family.soft.gz" files, in order to obain "GSE00000_family.soft" files.
    These last are all saved in another directory, keeping the structure of the 4 father directories.
    Then another script transforms SOFT files into XML files.

  3. Tagging the GC within the XML files

Prerequisites

Programming languages

  • Python (version 2.7, version 3.7)
  • Bash

Folder content

CRF

  • bin
    1. label-split_training_test.py
    2. training_validation.py
  • data-sets
    1. Tags.txt
    2. test-data-set-30.txt
    3. training-data-set-70.txt
  • models
    1. model_S1_False_S2_False_v1.mod
  • reports
    Folder that encloses files with information of the performance of the CRF while identifying GCs.

CoreNLP

  • bin
    1. get-raw-sentences.sh
      Script that extracts the GCs from the file: "tagged-xml-data" and adds the phrase: "PGCGROWTHCONDITIONS" to all lines.
    2. single_run.sh
      Script that runs th script: "corenlp.sh" with the desired parameters.
  • input
    1. raw-metadata-senteneces.txt
      Resulting file from "get-raw-sentences.sh". Contains all the GCs.
  • output
    1. raw-metadata-senteneces.txt.conll
      This file contains all the words of all the GCs tagged with its "LEMMA" & "POS"

data-sets

  • report-manually-tagged-gcs
    Folder with the extracted GCs of all the samples for each serie.
  • tagged-xml-data
    Folder that contains the original xml-tagged files where the GCs will be extracted.