srr_htregulondb_mapping_report.out 3.45 KB
/usr/local/lib/python3.6/dist-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
  warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
/home/egaytan/automatic-extraction-growth-conditions/mapping_MCO/bin/mapping2MCO_v5.py:312: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  str_matches_odf["SOURCE"] = mco_ifile


-------------------------------- PARAMETERS --------------------------------

--inputPath      Path of npl tagged file: /home/egaytan/automatic-extraction-growth-conditions/mapping_MCO/input/
--iAnnotatedFile Input file of npl tagged file: srr_htregulondb_model_Run3_v10_S1_False_S2_True_S3_False_S4_False_Run3_v10.tsv
--iOntoFile      Input file with the ontology entities (MCO-terms): gc_ontology_terms_v2.txt
--iLinksFile     Input file with links and id for the ontology (MCO-type-links): None
--iSynFile       Input file for the additional ontology of synonyms (MCO-syn-json): mco_terms_v0.2.json
--outputPath     Output path to place output files: /home/egaytan/automatic-extraction-growth-conditions/mapping_MCO/output/v2/
--outputFile     Output of the mapping process: srr_htregulondb.tsv
--minPerMatch    Minimal string matching percentage: 80
--minCRFProbs    Minimal crf probabilities allowed: 0.9





-------------------------------- INPUTS --------------------------------


npl tagged file

          SRR                        ...                                                                  REPO_FILE
0  SRR5742248                        ...                          http://pakal.ccg.unam.mx/cmendezc/automatic-ex...
5  SRR5742250                        ...                          http://pakal.ccg.unam.mx/cmendezc/automatic-ex...
7  SRR5742250                        ...                          http://pakal.ccg.unam.mx/cmendezc/automatic-ex...

[3 rows x 15 columns]

ontology entities

        TERM_ID                         TERM_NAME
0  MCO000000014  generically dependent continuant
1  MCO000000015                         radiation
2  MCO000000016         electromagnetic radiation

additional ontology of synonyms (MCO-syn-json)

                   ENTITY_NAME       TERM_ID       TERM_NAME
MCO000000019        continuant  MCO000000019                
MCO000002475    culture medium  MCO000002475                
MCO000002467_0        Organism  MCO000002467  biologicentity


-------------------------------- RESULTS --------------------------------


Tracking exact terms to MCO...

Mapping 4099 terms to MCO based on exact strings...

Mapping 3770 terms to MCO - synonyms based on exact strings...

Total of terms mapped by exact strings: 387
Saving filtered terms from raw mapping...


3712 unmapped terms based on exact strings
Dropping duplicated unmapped term names...
206 unmapped unique terms based on exact strings

compute string similarty...

Mapping to MCO 206 terms based on string similarity...

Mapping to MCO - synonyms 152 terms based on string siilarity..

Unique terms mapped by string similarity: 73
Total of terms mapped by string similarity: 1992
Saving filtered terms from str mapping...


--------------------END----------------------
Total of terms mapped: 2379

Total of terms unmapped: 1720