Merge branch 'master' of http://pakal.ccg.unam.mx/cmendezc/automatic-extraction-growth-conditions

Estefani Gaytan Nunez
Commit 738fe1d2cc40ec50a9beb03575de28bdf0c7842f 738fe1d2 2 parents 08674030 a46c79b7
Showing 1 changed file with 42 additions and 11 deletions
README.md
--- a/README.md
View file @738fe1d
+++ b/README.md
View file @738fe1d
 # Automatic Extraction of Growth Conditions (GCs) from the Gene Expression Omnibus (GEO)
 Project to extract in an automatic way the growth conditions of all enterobacteria within the GEO using "Conditional Random Fields " (CRFs).
+## Research Gruop
+**Main researcher**  
+Méndez Cruz Carlos Francisco  
+**Members**  
+Gaytan Nuñez Estefani  
+Meza Landeros Kevin Emmanuel  
+Tierrafría Victor _(curator)_  
+
+## Main Purpose  
+As it is known, GEO Database is home to thousands of High-Throughput (HT) genetic expresison experiments. The documentation for each experiment done within the database includes the Growth Conditions (GC) used in it. Unfurtunately they are not registered in a structured way, but they are into text fragments associated with various fields (we call them metadata).  
+Since knowing the GCs of these experiments helps to better understand genetic regulation, it becomes important to extract these conditions. However, doing it manually requires a lot of 
+effort on large data sets.  
+  
+Thats why, our hypothesis is that a predictive model can determine the GCs of thousands of experiments stored in the GEO. Our goal is to generate a report, that will be used by curators to review and validate the GC of the experiments. 
+
+
+
+## Metodolgy
+   1. __*GEO files download*__  
+        GEO files from all Entero bacteria were downloaded to a server and ordered in 4 directorie (all of them with lots of _GSE00000000_ folders):  
+        - Binding_exp  
+        - Binding_HT  
+        - Function_ex  
+        - Function_HT  
+        Each of the _GSE00000000_ folders contains  a compresed file (GSE00000_family.soft.gz) that must be extracted.  
+        
+   2. __*Obtaining SOFT files and its transformation to an XML format*__  
+        An script goes trhough every _GSE00000000_ folder an unzips _"GSE00000_family.soft.gz"_  files, in order to obain _"GSE00000_family.soft"_ files.  
+        These last are all saved in another directory, keeping the structure of the 4 father directories.  
+        Then another script transforms SOFT files into XML files.
+
+   3. __*Tagging the GC within the XML files*__  
+
 ## Prerequisites
 ### Programming languages
    - Python (version 2.7, version 3.7)
@@ -9,19 +42,17 @@ Project to extract in an automatic way the growth conditions of all enterobacter
 ## Folder content
 **CRF**
    - bin
-      1. label-split_training_test_v1.py
+      1. label-split_training_test.py
-      2. params.py
+      3. training_validation.py
-      3. training_validation_v3.py
    - data-sets
-      1. test-data-set-30.txt
+      1. Tags.txt
-      2. training-data-set-70.txt
+      2. test-data-set-30.txt
+      3. training-data-set-70.txt
    - models
-      1.  training-data-set-70.fStopWords_False.fSymbols_False.mod
+      1.  model_S1_False_S2_False_v1.mod
    - reports  
       _Folder that encloses files with **information of the performance of the CRF while identifying GCs.**_
-      1. report_training-data-set-70.fStopWords_False.fSymbols_False.txt
+
-      2.  y_pred_training-data-set-70.fStopWords_False.fSymbols_False.txt
-      3.  y_test_training-data-set-70.fStopWords_False.fSymbols_False.txt
 **CoreNLP**
    - bin
@@ -38,6 +69,6 @@ Project to extract in an automatic way the growth conditions of all enterobacter
 **data-sets**
    - report-manually-tagged-gcs  
-      _Contains the extracted GCs of all the samples for each serie._
+      _Folder with the extracted GCs of all the samples for each serie._
    - tagged-xml-data  
-      _Contains the **original xml-tagged files** where the GCs will be extracted._
\ No newline at end of file
+      _Folder that contains the **original xml-tagged files** where the GCs will be extracted._
\ No newline at end of file