Merge branch 'master' of http://pakal.ccg.unam.mx/cmendezc/automatic-extraction-growth-conditions
Showing
1 changed file
with
42 additions
and
11 deletions
1 | # Automatic Extraction of Growth Conditions (GCs) from the Gene Expression Omnibus (GEO) | 1 | # Automatic Extraction of Growth Conditions (GCs) from the Gene Expression Omnibus (GEO) |
2 | Project to extract in an automatic way the growth conditions of all enterobacteria within the GEO using "Conditional Random Fields " (CRFs). | 2 | Project to extract in an automatic way the growth conditions of all enterobacteria within the GEO using "Conditional Random Fields " (CRFs). |
3 | 3 | ||
4 | +## Research Gruop | ||
5 | +**Main researcher** | ||
6 | +Méndez Cruz Carlos Francisco | ||
7 | +**Members** | ||
8 | +Gaytan Nuñez Estefani | ||
9 | +Meza Landeros Kevin Emmanuel | ||
10 | +Tierrafría Victor _(curator)_ | ||
11 | + | ||
12 | +## Main Purpose | ||
13 | +As it is known, GEO Database is home to thousands of High-Throughput (HT) genetic expresison experiments. The documentation for each experiment done within the database includes the Growth Conditions (GC) used in it. Unfurtunately they are not registered in a structured way, but they are into text fragments associated with various fields (we call them metadata). | ||
14 | +Since knowing the GCs of these experiments helps to better understand genetic regulation, it becomes important to extract these conditions. However, doing it manually requires a lot of | ||
15 | +effort on large data sets. | ||
16 | + | ||
17 | +Thats why, our hypothesis is that a predictive model can determine the GCs of thousands of experiments stored in the GEO. Our goal is to generate a report, that will be used by curators to review and validate the GC of the experiments. | ||
18 | + | ||
19 | + | ||
20 | + | ||
21 | +## Metodolgy | ||
22 | + 1. __*GEO files download*__ | ||
23 | + GEO files from all Entero bacteria were downloaded to a server and ordered in 4 directorie (all of them with lots of _GSE00000000_ folders): | ||
24 | + - Binding_exp | ||
25 | + - Binding_HT | ||
26 | + - Function_ex | ||
27 | + - Function_HT | ||
28 | + Each of the _GSE00000000_ folders contains a compresed file (GSE00000_family.soft.gz) that must be extracted. | ||
29 | + | ||
30 | + 2. __*Obtaining SOFT files and its transformation to an XML format*__ | ||
31 | + An script goes trhough every _GSE00000000_ folder an unzips _"GSE00000_family.soft.gz"_ files, in order to obain _"GSE00000_family.soft"_ files. | ||
32 | + These last are all saved in another directory, keeping the structure of the 4 father directories. | ||
33 | + Then another script transforms SOFT files into XML files. | ||
34 | + | ||
35 | + 3. __*Tagging the GC within the XML files*__ | ||
36 | + | ||
4 | ## Prerequisites | 37 | ## Prerequisites |
5 | ### Programming languages | 38 | ### Programming languages |
6 | - Python (version 2.7, version 3.7) | 39 | - Python (version 2.7, version 3.7) |
... | @@ -9,19 +42,17 @@ Project to extract in an automatic way the growth conditions of all enterobacter | ... | @@ -9,19 +42,17 @@ Project to extract in an automatic way the growth conditions of all enterobacter |
9 | ## Folder content | 42 | ## Folder content |
10 | **CRF** | 43 | **CRF** |
11 | - bin | 44 | - bin |
12 | - 1. label-split_training_test_v1.py | 45 | + 1. label-split_training_test.py |
13 | - 2. params.py | 46 | + 3. training_validation.py |
14 | - 3. training_validation_v3.py | ||
15 | - data-sets | 47 | - data-sets |
16 | - 1. test-data-set-30.txt | 48 | + 1. Tags.txt |
17 | - 2. training-data-set-70.txt | 49 | + 2. test-data-set-30.txt |
50 | + 3. training-data-set-70.txt | ||
18 | - models | 51 | - models |
19 | - 1. training-data-set-70.fStopWords_False.fSymbols_False.mod | 52 | + 1. model_S1_False_S2_False_v1.mod |
20 | - reports | 53 | - reports |
21 | _Folder that encloses files with **information of the performance of the CRF while identifying GCs.**_ | 54 | _Folder that encloses files with **information of the performance of the CRF while identifying GCs.**_ |
22 | - 1. report_training-data-set-70.fStopWords_False.fSymbols_False.txt | 55 | + |
23 | - 2. y_pred_training-data-set-70.fStopWords_False.fSymbols_False.txt | ||
24 | - 3. y_test_training-data-set-70.fStopWords_False.fSymbols_False.txt | ||
25 | 56 | ||
26 | **CoreNLP** | 57 | **CoreNLP** |
27 | - bin | 58 | - bin |
... | @@ -38,6 +69,6 @@ Project to extract in an automatic way the growth conditions of all enterobacter | ... | @@ -38,6 +69,6 @@ Project to extract in an automatic way the growth conditions of all enterobacter |
38 | 69 | ||
39 | **data-sets** | 70 | **data-sets** |
40 | - report-manually-tagged-gcs | 71 | - report-manually-tagged-gcs |
41 | - _Contains the extracted GCs of all the samples for each serie._ | 72 | + _Folder with the extracted GCs of all the samples for each serie._ |
42 | - tagged-xml-data | 73 | - tagged-xml-data |
43 | - _Contains the **original xml-tagged files** where the GCs will be extracted._ | ||
... | \ No newline at end of file | ... | \ No newline at end of file |
74 | + _Folder that contains the **original xml-tagged files** where the GCs will be extracted._ | ||
... | \ No newline at end of file | ... | \ No newline at end of file | ... | ... |
-
Please register or login to post a comment