Name Last Update
.idea Loading commit data...
clustering Loading commit data...
corpora Loading commit data...
segmentation Loading commit data...
README.md Loading commit data...

Automatic analysis of morphological units: segmentation and clustering of Spanish, Maya and Nahuatl

Carlos-Francisco Méndez-Cruz and Ignacio Arroyo-Fernández

In this repository, results of two automatic morphological analyzes for Spanish, Nahuatl and Maya are shown. The first analysis is the automatic segmentation of each language using unsupervised learning of morphology (ULM). This analysis segmented each word into morphs. The second analysis is the clustering of the segmented morphs using word embeddings. A manual review of the automatic segmentation showed that automatic methods discovered many of the morphs of each language despite their morphological complexity. The general tendency was that more functional morphs (inflectional and derivational) were better segmented. For the clustering, it was observed that functional morphs tended to appear together, which allowed to conclude that the word embeddings represented the contextual information necessary to differentiate them from morphs with lexical-semantic content.

Clustering

/clustering

k-means clustering of morphs for each language: 500 groups for Maya and Nahuatl, 1000 groups for Spanish.

Corpora

/corpora

Only a sample of documents employed in our study are available. Complete versions must be request by e-mail (see Contact).

Segmentation

/segmentation

Segmented corpus for each language. Maya and Nahuatl were segmented using Morfessor CatMap (http://www.cis.hut.fi/projects/morpho/). Spanish was segmented by the authors using the approach described in: Méndez-Cruz, C. F., A. Medina-Urrea, G. Sierra (2016), "Unsupervised morphological segmentation based on affixality measurements", Pattern Recognition Letters, 84(2016), pp. 127-133.

Contact

Carlos Méndez (cmendezc at ccg dot unam dot mx), Center for Genome Sciences, UNAM, Mexico.

Ignacio Arroyo (iaf at ciencias dot unam dot mx), Faculty of Sciences, UNAM, Mexico.