README.md



Automatic analysis of morphological units: segmentation and clustering of Spanish, Maya and Nahuatl


Carlos-Francisco Méndez-Cruz and Ignacio Arroyo-Fernández

In this repository, results of two automatic morphological 
analyzes for Spanish, Nahuatl and Maya are shown. 
The first is the automatic segmentation of each language using 
unsupervised learning of morphology (ULM). This analysis segments each word 
into morphs. The second is the clustering of the 
segmented morphs using word embeddings.
A manual review of the automatic segmentation, 
showed that automatic methods discovered many of the morphs 
of each language despite their morphological complexity. 
The general tendency was that more functional/grammatical morphs 
(inflectional and derivational) were better segmented. 
For the clustering, it was observed that 
functional/grammatical morphs tended to appear together, which allowed to 
conclude that the word embeddings represented the contextual 
information necessary to differentiate them from morphs with 
lexical-semantic content. Our study is one of the 
few works showing a general overview of the unsupervised 
analysis of linguistic morphology for Spanish and 
Mexican languages.


Input

You must place input files of the article collection within preprocessing_pipeline/original/ directory. Input files must be raw text files. Extension *.txt is mandatory.


NLP preprocessing pipeline

The first step is preprocessing the input files with the NLP-preprocessing-pipeline/NLP-preprocessing-pipeline.sh shell script. This step must be performed only once for the same article collection.


Preprocessing directory

Our pipeline utilizes the preprocessing-files directory to save temporary files for each preprocessing task. These files could be removed after the NLP preprocessing has finished, except those for the features directory. These files are used for the automatic classification task.


Term list directory

Several term lists are employed. These lists are on the term list directory termLists.


Configure

You must indicate the path for the input texts directory (ORIGINAL_CORPUS_PATH), the preprocessing directory (PREPROCESSING_PATH), the term list directory (TERM_PATH), the Stanford POS Tagger directory (STANFORD_POSTAGGER_PATH), the BioLemmatizer directory (BIO_LEMMATIZER_PATH), and the name of the TF for summarization (TF_NAME). 

    ORIGINAL_CORPUS_PATH=../preprocessing-files/original
    PREPROCESSING_PATH=../preprocessing-files
    TERM_PATH=../termLists
    STANFORD_POSTAGGER_PATH=/home/cmendezc/STANFORD_POSTAGGER/stanford-postagger-2015-12-09
    BIO_LEMMATIZER_PATH=/home/cmendezc/BIO_LEMMATIZER
    TF_NAME=MarA


You must have installed Stanford POS Tagger and BioLemmatizer within your computer. They are not included within this repository, see following references for obtaining these programs:


Toutanova, K., Klein, D., Manning, C. and Singer, Y. (2003) Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the HLT-NAACL, pp. 252-259.
https://nlp.stanford.edu/software/tagger.shtml
Liu, H., Christiansen, T., Baumgartner, W. A., Jr., and Verspoor, K. (2012) BioLemmatizer: a lemmatization tool for morphological processing of biomedical text. J. Biomed. Semantics, 3, 1-29.
https://sourceforge.net/projects/biolemmatizer/


You could indicate which preprocessing steps will be executed by assigning TRUE/FALSE for the corresponding variable within shell script:

    PRE=TRUE
    echo "   Preprocessing: $PRE"
    POS=TRUE
    echo "   POS Tagging: $POS"
    LEMMA=TRUE
    echo "   Lemmatization: $LEMMA"
    TERM=TRUE
    echo "   Terminological tagging: $TERM"
    TRANS=TRUE
    echo "   Transformation: $TRANS"
    FEAT=TRUE
    echo "   Feature extraction: $FEAT"


Execute

Execute the NLP preprocessing pipeline within the NLP-preprocessing-pipeline directory by using the NLP-preprocessing-pipeline.sh shell script. Several output files will be generated while shell script is running.

    cd NLP-preprocessing-pipeline
    ./NLP-preprocessing-pipeline.sh


Automatic summarization

At present, our pipeline generates the automatic summary of only one TF at the same time (i.e. one by one). The TF name must be indicated within the shell scripts. The NLP preprocessing pipeline must be already executed, so the features directory must contain several files.


Configure


Automatic classification

You must indicate the directory path for the feature sentences (INPUT_PATH), the classified sentences (OUTPUT_PATH), and the trained classification model (MODEL_PATH). Also, you must indicate the name of the trained model (MODEL), the name of the feature employed for classification (FEATURE), and the name of the TF (TF_NAME). Do not change the names of the model and the feature.

    INPUT_PATH=../preprocessing-files/features
    OUTPUT_PATH=./classified
    MODEL_PATH=.
    MODEL=SVM_model
    FEATURE=lemma_lemma_pos_pos
    TF_NAME=MarA


Making automatic summary

You must indicate the directory path to place the output automatic summary (OUTPUT_PATH), the directory path for the classified sentences (INPUT_PATH), and the name of the file with the classified sentences (INPUT_FILE).

    OUTPUT_PATH=../automatic-summary
    INPUT_PATH=./classified
    INPUT_FILE=$TF_NAME.txt


Execution

Execute the automatic summarization pipeline within the automatic-summarization-pipeline directory by using the automatic-summarization-pipeline.sh shell script.

    cd automatic-summarization-pipeline
    ./automatic-summarization-pipeline.sh


Output

A raw text file with the automatic summary of the TF is placed within automatic-summary directory.


Contact

Questions can be sent to Computational Genomics Program (Center for Genomic Sciences, Mexico): cmendezc at ccg dot unam dot mx.