Carlos-Francisco Méndez-Cruz

README

Showing 1 changed file with 20 additions and 12 deletions
1 -# Automatic analysis of morphological units: segmentation and clustering of Spanish, Maya, and Nahuatl 1 +# Automatic analysis of morphological units: segmentation and clustering of Spanish, Maya and Nahuatl
2 ## Carlos-Francisco Méndez-Cruz and Ignacio Arroyo-Fernández 2 ## Carlos-Francisco Méndez-Cruz and Ignacio Arroyo-Fernández
3 3
4 4
5 -In the BioNLP group of the Computational Genomics Program (Center for Genomic Sciences, Mexico), we conduct research on automatic text summarization for helping the biocuration process of RegulonDB (http://regulondb.ccg.unam.mx/). 5 +In this repository, results of two automatic morphological
6 - 6 +analyzes for Spanish, Nahuatl and Maya are shown.
7 -RegulonDB is a database dedicated to the transcriptional regulation of Escherichia coli K-12. This database contains a set of summaries about several properties of TFs. These summaries are also found in EcoCyc (https://ecocyc.org/). These summaries are written manually by using several scientific articles. 7 +The first is the automatic segmentation of each language using
8 - 8 +unsupervised learning of morphology (ULM). This analysis segments each word
9 -We have proposed an initial approach for the automatic generation of these summaries. In this initial approach, we generate summaries only about two properties of TFs: 9 +into morphs. The second is the clustering of the
10 -1. The biological processes in which the regulated genes are involved 10 +segmented morphs using word embeddings.
11 -2. The number, name, and size of the structural domains constituting the TF 11 +A manual review of the automatic segmentation,
12 - 12 +showed that automatic methods discovered many of the morphs
13 -The automatic summaries are made by the concatenation of the automatically classified sentences from scientific articles by an SVM classifier. The evaluation of these initial automatic summaries indicated that they carried part of the relevant information included in the manual summaries. 13 +of each language despite their morphological complexity.
14 - 14 +The general tendency was that more functional/grammatical morphs
15 -This repository provides a pipeline for generating these initial automatic summaries. 15 +(inflectional and derivational) were better segmented.
16 +For the clustering, it was observed that
17 +functional/grammatical morphs tended to appear together, which allowed to
18 +conclude that the word embeddings represented the contextual
19 +information necessary to differentiate them from morphs with
20 +lexical-semantic content. Our study is one of the
21 +few works showing a general overview of the unsupervised
22 +analysis of linguistic morphology for Spanish and
23 +Mexican languages.
16 24
17 # Input 25 # Input
18 You must place input files of the article collection within `preprocessing_pipeline/original/` directory. Input files must be raw text files. Extension *.txt is mandatory. 26 You must place input files of the article collection within `preprocessing_pipeline/original/` directory. Input files must be raw text files. Extension *.txt is mandatory.
......