# Automatic Extraction of Growth Conditions (GCs) from the Gene Expression Omnibus (GEO)
Project to extract in an automatic way the growth conditions of all enterobacteria within the GEO using "Conditional Random Fields " (CRFs).
## Research Gruop
**Main researcher**
Méndez Cruz Carlos Francisco
**Members**
Gaytan Nuñez Estefani
Meza Landeros Kevin Emmanuel
Tierrafría Victor _(curator)_
## Main Purpose
As it is known, GEO Database is home to thousands of High-Throughput (HT) genetic expresison experiments. The documentation for each experiment done within the database includes the Growth Conditions (GC) used in it. Unfurtunately they are not registered in a structured way, but they are into text fragments associated with various fields (we call them metadata).
Since knowing the GCs of these experiments helps to better understand genetic regulation, it becomes important to extract these conditions. However, doing it manually requires a lot of
effort on large data sets.
Thats why, our hypothesis is that a predictive model can determine the GCs of thousands of experiments stored in the GEO. Our goal is to generate a report, that will be used by curators to review and validate the GC of the experiments.
## Metodolgy
1. __*GEO files download*__
GEO files from all Entero bacteria were downloaded to a server and ordered in 4 directorie (all of them with lots of _GSE00000000_ folders):
- Binding_exp
- Binding_HT
- Function_ex
- Function_HT
Each of the _GSE00000000_ folders contains a compresed file (GSE00000_family.soft.gz) that must be extracted.
2. __*Obtaining SOFT files and its transformation to an XML format*__
An script goes trhough every _GSE00000000_ folder an unzips _"GSE00000_family.soft.gz"_ files, in order to obain _"GSE00000_family.soft"_ files.
These last are all saved in another directory, keeping the structure of the 4 father directories.
Then another script transforms SOFT files into XML files.
3. __*Tagging the GC within the XML files*__
## Prerequisites
### Programming languages
- Python (version 2.7, version 3.7)
...
...
@@ -9,19 +42,17 @@ Project to extract in an automatic way the growth conditions of all enterobacter