Carlos-Francisco Méndez-Cruz

Automatic morphological analysis

1 # Automatic analysis of morphological units: segmentation and clustering of Spanish, Maya and Nahuatl 1 # Automatic analysis of morphological units: segmentation and clustering of Spanish, Maya and Nahuatl
2 ## Carlos-Francisco Méndez-Cruz and Ignacio Arroyo-Fernández 2 ## Carlos-Francisco Méndez-Cruz and Ignacio Arroyo-Fernández
3 3
4 -
5 In this repository, results of two automatic morphological 4 In this repository, results of two automatic morphological
6 analyzes for Spanish, Nahuatl and Maya are shown. 5 analyzes for Spanish, Nahuatl and Maya are shown.
7 The first is the automatic segmentation of each language using 6 The first is the automatic segmentation of each language using
8 -unsupervised learning of morphology (ULM). This analysis segments each word 7 +unsupervised learning of morphology (ULM). This analysis segmented each word
9 into morphs. The second is the clustering of the 8 into morphs. The second is the clustering of the
10 segmented morphs using word embeddings. 9 segmented morphs using word embeddings.
11 -A manual review of the automatic segmentation, 10 +A manual review of the automatic segmentation
12 showed that automatic methods discovered many of the morphs 11 showed that automatic methods discovered many of the morphs
13 of each language despite their morphological complexity. 12 of each language despite their morphological complexity.
14 -The general tendency was that more functional/grammatical morphs 13 +The general tendency was that more functional morphs
15 (inflectional and derivational) were better segmented. 14 (inflectional and derivational) were better segmented.
16 For the clustering, it was observed that 15 For the clustering, it was observed that
17 -functional/grammatical morphs tended to appear together, which allowed to 16 +functional morphs tended to appear together, which allowed to
18 conclude that the word embeddings represented the contextual 17 conclude that the word embeddings represented the contextual
19 information necessary to differentiate them from morphs with 18 information necessary to differentiate them from morphs with
20 -lexical-semantic content. Our study is one of the 19 +lexical-semantic content.
21 -few works showing a general overview of the unsupervised
22 -analysis of linguistic morphology for Spanish and
23 -Mexican languages.
24 -
25 -# Input
26 -You must place input files of the article collection within `preprocessing_pipeline/original/` directory. Input files must be raw text files. Extension *.txt is mandatory.
27 -
28 -# NLP preprocessing pipeline
29 -The first step is preprocessing the input files with the `NLP-preprocessing-pipeline/NLP-preprocessing-pipeline.sh` shell script. This step must be performed only once for the same article collection.
30 -
31 -## Preprocessing directory
32 -Our pipeline utilizes the `preprocessing-files` directory to save temporary files for each preprocessing task. These files could be removed after the NLP preprocessing has finished, except those for the `features` directory. These files are used for the automatic classification task.
33 -
34 -## Term list directory
35 -Several term lists are employed. These lists are on the term list directory `termLists`.
36 -
37 -## Configure
38 -You must indicate the path for the input texts directory (`ORIGINAL_CORPUS_PATH`), the preprocessing directory (`PREPROCESSING_PATH`), the term list directory (`TERM_PATH`), the Stanford POS Tagger directory (`STANFORD_POSTAGGER_PATH`), the BioLemmatizer directory (`BIO_LEMMATIZER_PATH`), and the name of the TF for summarization (`TF_NAME`).
39 -```shell
40 - ORIGINAL_CORPUS_PATH=../preprocessing-files/original
41 - PREPROCESSING_PATH=../preprocessing-files
42 - TERM_PATH=../termLists
43 - STANFORD_POSTAGGER_PATH=/home/cmendezc/STANFORD_POSTAGGER/stanford-postagger-2015-12-09
44 - BIO_LEMMATIZER_PATH=/home/cmendezc/BIO_LEMMATIZER
45 - TF_NAME=MarA
46 -```
47 20
48 -You must have installed Stanford POS Tagger and BioLemmatizer within your computer. They are not included within this repository, see following references for obtaining these programs: 21 +# Directory description
49 -- Toutanova, K., Klein, D., Manning, C. and Singer, Y. (2003) Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the HLT-NAACL, pp. 252-259.
50 -- https://nlp.stanford.edu/software/tagger.shtml
51 -- Liu, H., Christiansen, T., Baumgartner, W. A., Jr., and Verspoor, K. (2012) BioLemmatizer: a lemmatization tool for morphological processing of biomedical text. J. Biomed. Semantics, 3, 1-29.
52 -- https://sourceforge.net/projects/biolemmatizer/
53 22
54 -You could indicate which preprocessing steps will be executed by assigning TRUE/FALSE for the corresponding variable within shell script: 23 +## Corpora
55 -```shell 24 +Only a sample of documents employed in our study.
56 - PRE=TRUE 25 +Complete versions must be request by e-mail (see **Contact**).
57 - echo " Preprocessing: $PRE"
58 - POS=TRUE
59 - echo " POS Tagging: $POS"
60 - LEMMA=TRUE
61 - echo " Lemmatization: $LEMMA"
62 - TERM=TRUE
63 - echo " Terminological tagging: $TERM"
64 - TRANS=TRUE
65 - echo " Transformation: $TRANS"
66 - FEAT=TRUE
67 - echo " Feature extraction: $FEAT"
68 -```
69 26
70 -## Execute 27 +## Segmentation
71 -Execute the NLP preprocessing pipeline within the `NLP-preprocessing-pipeline` directory by using the `NLP-preprocessing-pipeline.sh` shell script. Several output files will be generated while shell script is running. 28 +Segmented corpus for each language.
72 -```shell 29 +Maya and Nahuatl were segmented using _Morfessor CatMap_
73 - cd NLP-preprocessing-pipeline 30 +(http://www.cis.hut.fi/projects/morpho/).
74 - ./NLP-preprocessing-pipeline.sh 31 +Spanish was segmented by the authors.
75 -```
76 32
77 -# Automatic summarization 33 +## Clustering
78 -At present, our pipeline generates the automatic summary of only one TF at the same time (i.e. one by one). The TF name must be indicated within the shell scripts. The NLP preprocessing pipeline must be already executed, so the `features` directory must contain several files. 34 +Clusters of morphs for each language:
79 - 35 +500 groups for Maya and Nahuatl, 1000 groups for Spanish.
80 -## Configure
81 -
82 -### Automatic classification
83 -You must indicate the directory path for the feature sentences (`INPUT_PATH`), the classified sentences (`OUTPUT_PATH`), and the trained classification model (`MODEL_PATH`). Also, you must indicate the name of the trained model (`MODEL`), the name of the feature employed for classification (`FEATURE`), and the name of the TF (`TF_NAME`). Do not change the names of the model and the feature.
84 -```shell
85 - INPUT_PATH=../preprocessing-files/features
86 - OUTPUT_PATH=./classified
87 - MODEL_PATH=.
88 - MODEL=SVM_model
89 - FEATURE=lemma_lemma_pos_pos
90 - TF_NAME=MarA
91 -```
92 -
93 -### Making automatic summary
94 -You must indicate the directory path to place the output automatic summary (`OUTPUT_PATH`), the directory path for the classified sentences (`INPUT_PATH`), and the name of the file with the classified sentences (`INPUT_FILE`).
95 -```shell
96 - OUTPUT_PATH=../automatic-summary
97 - INPUT_PATH=./classified
98 - INPUT_FILE=$TF_NAME.txt
99 -```
100 -
101 -## Execution
102 -Execute the automatic summarization pipeline within the `automatic-summarization-pipeline` directory by using the `automatic-summarization-pipeline.sh` shell script.
103 -```shell
104 - cd automatic-summarization-pipeline
105 - ./automatic-summarization-pipeline.sh
106 -```
107 -
108 -## Output
109 -A raw text file with the automatic summary of the TF is placed within `automatic-summary` directory.
110 36
111 ## Contact 37 ## Contact
112 -Questions can be sent to Computational Genomics Program (Center for Genomic Sciences, Mexico): cmendezc at ccg dot unam dot mx. 38 +Carlos Méndez (cmendezc at ccg dot unam dot mx)
39 +
40 +Center for Genomic Sciences, UNAM, Mexico
113 41
......
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.