Estefani Gaytan Nunez
Showing 1 changed file with 42 additions and 11 deletions
1 # Automatic Extraction of Growth Conditions (GCs) from the Gene Expression Omnibus (GEO) 1 # Automatic Extraction of Growth Conditions (GCs) from the Gene Expression Omnibus (GEO)
2 Project to extract in an automatic way the growth conditions of all enterobacteria within the GEO using "Conditional Random Fields " (CRFs). 2 Project to extract in an automatic way the growth conditions of all enterobacteria within the GEO using "Conditional Random Fields " (CRFs).
3 3
4 +## Research Gruop
5 +**Main researcher**
6 +Méndez Cruz Carlos Francisco
7 +**Members**
8 +Gaytan Nuñez Estefani
9 +Meza Landeros Kevin Emmanuel
10 +Tierrafría Victor _(curator)_
11 +
12 +## Main Purpose
13 +As it is known, GEO Database is home to thousands of High-Throughput (HT) genetic expresison experiments. The documentation for each experiment done within the database includes the Growth Conditions (GC) used in it. Unfurtunately they are not registered in a structured way, but they are into text fragments associated with various fields (we call them metadata).
14 +Since knowing the GCs of these experiments helps to better understand genetic regulation, it becomes important to extract these conditions. However, doing it manually requires a lot of
15 +effort on large data sets.
16 +
17 +Thats why, our hypothesis is that a predictive model can determine the GCs of thousands of experiments stored in the GEO. Our goal is to generate a report, that will be used by curators to review and validate the GC of the experiments.
18 +
19 +
20 +
21 +## Metodolgy
22 + 1. __*GEO files download*__
23 + GEO files from all Entero bacteria were downloaded to a server and ordered in 4 directorie (all of them with lots of _GSE00000000_ folders):
24 + - Binding_exp
25 + - Binding_HT
26 + - Function_ex
27 + - Function_HT
28 + Each of the _GSE00000000_ folders contains a compresed file (GSE00000_family.soft.gz) that must be extracted.
29 +
30 + 2. __*Obtaining SOFT files and its transformation to an XML format*__
31 + An script goes trhough every _GSE00000000_ folder an unzips _"GSE00000_family.soft.gz"_ files, in order to obain _"GSE00000_family.soft"_ files.
32 + These last are all saved in another directory, keeping the structure of the 4 father directories.
33 + Then another script transforms SOFT files into XML files.
34 +
35 + 3. __*Tagging the GC within the XML files*__
36 +
4 ## Prerequisites 37 ## Prerequisites
5 ### Programming languages 38 ### Programming languages
6 - Python (version 2.7, version 3.7) 39 - Python (version 2.7, version 3.7)
...@@ -9,19 +42,17 @@ Project to extract in an automatic way the growth conditions of all enterobacter ...@@ -9,19 +42,17 @@ Project to extract in an automatic way the growth conditions of all enterobacter
9 ## Folder content 42 ## Folder content
10 **CRF** 43 **CRF**
11 - bin 44 - bin
12 - 1. label-split_training_test_v1.py 45 + 1. label-split_training_test.py
13 - 2. params.py 46 + 3. training_validation.py
14 - 3. training_validation_v3.py
15 - data-sets 47 - data-sets
16 - 1. test-data-set-30.txt 48 + 1. Tags.txt
17 - 2. training-data-set-70.txt 49 + 2. test-data-set-30.txt
50 + 3. training-data-set-70.txt
18 - models 51 - models
19 - 1. training-data-set-70.fStopWords_False.fSymbols_False.mod 52 + 1. model_S1_False_S2_False_v1.mod
20 - reports 53 - reports
21 _Folder that encloses files with **information of the performance of the CRF while identifying GCs.**_ 54 _Folder that encloses files with **information of the performance of the CRF while identifying GCs.**_
22 - 1. report_training-data-set-70.fStopWords_False.fSymbols_False.txt 55 +
23 - 2. y_pred_training-data-set-70.fStopWords_False.fSymbols_False.txt
24 - 3. y_test_training-data-set-70.fStopWords_False.fSymbols_False.txt
25 56
26 **CoreNLP** 57 **CoreNLP**
27 - bin 58 - bin
...@@ -38,6 +69,6 @@ Project to extract in an automatic way the growth conditions of all enterobacter ...@@ -38,6 +69,6 @@ Project to extract in an automatic way the growth conditions of all enterobacter
38 69
39 **data-sets** 70 **data-sets**
40 - report-manually-tagged-gcs 71 - report-manually-tagged-gcs
41 - _Contains the extracted GCs of all the samples for each serie._ 72 + _Folder with the extracted GCs of all the samples for each serie._
42 - tagged-xml-data 73 - tagged-xml-data
43 - _Contains the **original xml-tagged files** where the GCs will be extracted._
...\ No newline at end of file ...\ No newline at end of file
74 + _Folder that contains the **original xml-tagged files** where the GCs will be extracted._
...\ No newline at end of file ...\ No newline at end of file
......