Carlos-Francisco Méndez-Cruz

README

Showing 1 changed file with 20 additions and 12 deletions
# Automatic analysis of morphological units: segmentation and clustering of Spanish, Maya, and Nahuatl
# Automatic analysis of morphological units: segmentation and clustering of Spanish, Maya and Nahuatl
## Carlos-Francisco Méndez-Cruz and Ignacio Arroyo-Fernández
In the BioNLP group of the Computational Genomics Program (Center for Genomic Sciences, Mexico), we conduct research on automatic text summarization for helping the biocuration process of RegulonDB (http://regulondb.ccg.unam.mx/).
RegulonDB is a database dedicated to the transcriptional regulation of Escherichia coli K-12. This database contains a set of summaries about several properties of TFs. These summaries are also found in EcoCyc (https://ecocyc.org/). These summaries are written manually by using several scientific articles.
We have proposed an initial approach for the automatic generation of these summaries. In this initial approach, we generate summaries only about two properties of TFs:
1. The biological processes in which the regulated genes are involved
2. The number, name, and size of the structural domains constituting the TF
The automatic summaries are made by the concatenation of the automatically classified sentences from scientific articles by an SVM classifier. The evaluation of these initial automatic summaries indicated that they carried part of the relevant information included in the manual summaries.
This repository provides a pipeline for generating these initial automatic summaries.
In this repository, results of two automatic morphological
analyzes for Spanish, Nahuatl and Maya are shown.
The first is the automatic segmentation of each language using
unsupervised learning of morphology (ULM). This analysis segments each word
into morphs. The second is the clustering of the
segmented morphs using word embeddings.
A manual review of the automatic segmentation,
showed that automatic methods discovered many of the morphs
of each language despite their morphological complexity.
The general tendency was that more functional/grammatical morphs
(inflectional and derivational) were better segmented.
For the clustering, it was observed that
functional/grammatical morphs tended to appear together, which allowed to
conclude that the word embeddings represented the contextual
information necessary to differentiate them from morphs with
lexical-semantic content. Our study is one of the
few works showing a general overview of the unsupervised
analysis of linguistic morphology for Spanish and
Mexican languages.
# Input
You must place input files of the article collection within `preprocessing_pipeline/original/` directory. Input files must be raw text files. Extension *.txt is mandatory.
......