Merge branch 'master' of http://pakal.ccg.unam.mx/cmendezc/automatic-extraction-growth-conditions

Estefani Gaytan Nunez
Commit 3e95a34263f0599facf58486b3a69a672494a468 3e95a342 2 parents 5cea8340 e2b4be09
Showing 1 changed file with 19 additions and 10 deletions
README.md
--- a/README.md
View file @3e95a34
+++ b/README.md
View file @3e95a34
-# "Automatic Extraction of Growth Conditions (GC) from the Gene Expression Omnibus (GEO)".  
+# Automatic Extraction of Growth Conditions (GCs) from the Gene Expression Omnibus (GEO)
-Project to extract in an automatic way, the growth conditions of all enterobacteria within the GEO using "Conditional Random Fields " (CRFs).
+Project to extract in an automatic way the growth conditions of all enterobacteria within the GEO using "Conditional Random Fields " (CRFs).
 ## Prerequisites
 ### Programming languages
-   - Python (version 2.7, version 3)
+   - Python (version 2.7, version 3.7)
    - Bash
 ## Folder content
@@ -12,23 +12,32 @@ Project to extract in an automatic way, the growth conditions of all enterobacte
       1. label-split_training_test_v1.py
       2. params.py
       3. training_validation_v3.py
-   - check
-      1. sentences-405-order-rep.txt
    - data-sets
       1. test-data-set-30.txt
       2. training-data-set-70.txt
    - models
       1.  training-data-set-70.fStopWords_False.fSymbols_False.mod
-   - reports
+   - reports  
+      _Folder that encloses files with **information of the performance of the CRF while identifying GCs.**_
       1. report_training-data-set-70.fStopWords_False.fSymbols_False.txt
       2.  y_pred_training-data-set-70.fStopWords_False.fSymbols_False.txt
       3.  y_test_training-data-set-70.fStopWords_False.fSymbols_False.txt
 **CoreNLP**
    - bin
-      1.  get-raw-sentences.sh
+      1.  get-raw-sentences.sh   
-      2.  single_run.sh
+         _Script that **extracts the GCs** from the file: "tagged-xml-data" and adds the phrase: "PGCGROWTHCONDITIONS" to all lines._
+      2.  single_run.sh   
+         _Script that **runs** th script: "corenlp.sh" with the desired parameters._
    - input
-      1. raw-metadata-senteneces.txt
+      1. raw-metadata-senteneces.txt  
+         _Resulting file from "get-raw-sentences.sh". **Contains all the GCs.**_ 
    - output
-      1. raw-metadata-senteneces.txt.conll
\ No newline at end of file
+      1. raw-metadata-senteneces.txt.conll  
+         _This file contains **all the words of all the GCs** tagged with its **"LEMMA" & "POS"**_
+
+**data-sets**
+   - report-manually-tagged-gcs  
+      _Contains the extracted GCs of all the samples for each serie._
+   - tagged-xml-data  
+      _Contains the **original xml-tagged files** where the GCs will be extracted._
\ No newline at end of file