README

Carlos-Francisco Méndez-Cruz
Commit 25f793ef787e37a25fe61b4e7f46b3c85cee5cc4 25f793ef 1 parent e4d28f2b
Showing 1 changed file with 20 additions and 12 deletions
README.md
--- a/README.md
View file @25f793e
+++ b/README.md
View file @25f793e
- # Automatic analysis of morphological units: segmentation and clustering of Spanish, Maya, and Nahuatl
+ # Automatic analysis of morphological units: segmentation and clustering of Spanish, Maya and Nahuatl
 ## Carlos-Francisco Méndez-Cruz and Ignacio Arroyo-Fernández
 
 
- In the BioNLP group of the Computational Genomics Program (Center for Genomic Sciences, Mexico), we conduct research on automatic text summarization for helping the biocuration process of RegulonDB (http://regulondb.ccg.unam.mx/).
- 
- RegulonDB is a database dedicated to the transcriptional regulation of Escherichia coli K-12. This database contains a set of summaries about several properties of TFs. These summaries are also found in EcoCyc (https://ecocyc.org/). These summaries are written manually by using several scientific articles.
- 
- We have proposed an initial approach for the automatic generation of these summaries. In this initial approach, we generate summaries only about two properties of TFs:
- 1.	The biological processes in which the regulated genes are involved
- 2.	The number, name, and size of the structural domains constituting the TF
- 
- The automatic summaries are made by the concatenation of the automatically classified sentences from scientific articles by an SVM classifier. The evaluation of these initial automatic summaries indicated that they carried part of the relevant information included in the manual summaries.
-  
- This repository provides a pipeline for generating these initial automatic summaries.
+ In this repository, results of two automatic morphological 
+ analyzes for Spanish, Nahuatl and Maya are shown. 
+ The first is the automatic segmentation of each language using 
+ unsupervised learning of morphology (ULM). This analysis segments each word 
+ into morphs. The second is the clustering of the 
+ segmented morphs using word embeddings.
+ A manual review of the automatic segmentation, 
+ showed that automatic methods discovered many of the morphs 
+ of each language despite their morphological complexity. 
+ The general tendency was that more functional/grammatical morphs 
+ (inflectional and derivational) were better segmented. 
+ For the clustering, it was observed that 
+ functional/grammatical morphs tended to appear together, which allowed to 
+ conclude that the word embeddings represented the contextual 
+ information necessary to differentiate them from morphs with 
+ lexical-semantic content. Our study is one of the 
+ few works showing a general overview of the unsupervised 
+ analysis of linguistic morphology for Spanish and 
+ Mexican languages.
 
 # Input
 You must place input files of the article collection within `preprocessing_pipeline/original/` directory. Input files must be raw text files. Extension *.txt is mandatory.