S

spanish-maya-nahuatl-morphological-analysis

Results of two automatic linguistic morphological analysis of Spanish, Nahuatl and Maya: (i) unsupervised learning of morphology and (ii) morph clustering using word embeddings.

8e006a75 Update README.md · by Carlos-Francisco Méndez-Cruz

Automatic analysis of morphological units: segmentation and clustering of Spanish, Maya and Nahuatl

Carlos-Francisco Méndez-Cruz and Ignacio Arroyo-Fernández

In this repository, results of two automatic morphological analyzes for Spanish, Nahuatl and Maya are shown. The first analysis is the automatic segmentation of each language using unsupervised learning of morphology (ULM). This analysis segmented each word into morphs. The second analysis is the clustering of the segmented morphs using word embeddings. A manual review of the automatic segmentation showed that automatic methods discovered many of the morphs of each language despite their morphological complexity. The general tendency was that more functional morphs (inflectional and derivational) were better segmented. For the clustering, it was observed that functional morphs tended to appear together, which allowed to conclude that the word embeddings represented the contextual information necessary to differentiate them from morphs with lexical-semantic content.

Clustering

/clustering

k-means clustering of morphs for each language: 500 groups for Maya and Nahuatl, 1000 groups for Spanish.

Corpora

/corpora

Only a sample of documents employed in our study are available. Complete versions must be request by e-mail (see Contact).

Segmentation

/segmentation

Segmented corpus for each language. Maya and Nahuatl were segmented using Morfessor CatMap (http://www.cis.hut.fi/projects/morpho/). Spanish was segmented by the authors using the approach described in: Méndez-Cruz, C. F., A. Medina-Urrea, G. Sierra (2016), "Unsupervised morphological segmentation based on affixality measurements", Pattern Recognition Letters, 84(2016), pp. 127-133.

Contact

Carlos Méndez (cmendezc at ccg dot unam dot mx), Center for Genome Sciences, UNAM, Mexico.

Ignacio Arroyo (iaf at ciencias dot unam dot mx), Faculty of Sciences, UNAM, Mexico.