sub_srr_IV_v3.out 3.5 KB
/usr/local/lib/python3.6/dist-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
  warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
mapping2MCO_v2.py:266: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  str_matches_odf["SOURCE"] = mco_ifile


-------------------------------- PARAMETERS --------------------------------

--inputPath          Path of npl tagged file (crf output): /home/egaytan/automatic-extraction-growth-conditions/mapping_MCO/input/
--iAnnotatedFile     Input file of npl tagged file (crf output: sub_srr_IV.tsv
--iOntoFile          Input file with the ontology entities (MCO-terms): gc_ontology_terms_v2.txt
--iLinksFile         Input file with links and id for the ontology (MCO-type-links): None
--iSynFile           Input file for the additional ontology of synonyms (MCO-syn-json): mco_terms_v0.2.json
--outputPath         Output path to place output files: /home/egaytan/automatic-extraction-growth-conditions/mapping_MCO/output/
--outputFile         Output of the mapping process: sub_srr_IV_9prob_80per_v3.tsv
--minPerMatch        Minimal string matching percentage: 80
--minCRFProbs        Minimal crf probabilities allowed: 0.7





-------------------------------- INPUTS --------------------------------


npl tagged file

         GSE                        ...                                                                   REPOFILE
0  GSE100373                        ...                          http://pakal.ccg.unam.mx/cmendezc/automatic-ex...
1  GSE100373                        ...                          http://pakal.ccg.unam.mx/cmendezc/automatic-ex...
2  GSE100373                        ...                          http://pakal.ccg.unam.mx/cmendezc/automatic-ex...

[3 rows x 10 columns]

ontology entities

        TERM_ID                         TERM_NAME
0  MCO000000014  generically dependent continuant
1  MCO000000015                         radiation
2  MCO000000016         electromagnetic radiation

additional ontology of synonyms (MCO-syn-json)

                   ENTITY_NAME       TERM_ID       TERM_NAME
MCO000000019        continuant  MCO000000019                
MCO000002475    culture medium  MCO000002475                
MCO000002467_0        Organism  MCO000002467  biologicentity


-------------------------------- RESULTS --------------------------------


Mapping 8312 terms to MCO based on exact strings...


Mapping 7910 terms to MCO - synonyms based on exact strings...

               BANGLINE         ...                   TERM_TYPE
22  characteristics_ch1         ...          Genetic background
34  characteristics_ch1         ...          Genetic background
45  characteristics_ch1         ...          Genetic background

[3 rows x 13 columns]
Total of terms mapped by exact strings: 1001
Saving filtered terms from raw mapping...



15733 unmapped terms based on exact strings
Dropping duplicated unmapped term names...

391 unmapped unique terms based on exact strings


Mapping to MCO 391 terms based on string similarity...


Mapping to MCO - synonyms 391 terms based on string siilarity..

Unique terms mapped by string similarity: 10
Total of terms mapped by string similarity: 2210
Saving filtered terms from str mapping...