Carlos-Francisco Méndez-Cruz

README

1 +<?xml version="1.0" encoding="UTF-8"?>
2 +<project version="4">
3 + <component name="VcsDirectoryMappings">
4 + <mapping directory="$PROJECT_DIR$" vcs="Git" />
5 + </component>
6 +</project>
...\ No newline at end of file ...\ No newline at end of file
1 +# Automatic analysis of morphological units: segmentation and clustering of Spanish, Maya, and Nahuatl
2 +## Carlos-Francisco Méndez-Cruz and Ignacio Arroyo-Fernández
3 +
4 +
5 +In the BioNLP group of the Computational Genomics Program (Center for Genomic Sciences, Mexico), we conduct research on automatic text summarization for helping the biocuration process of RegulonDB (http://regulondb.ccg.unam.mx/).
6 +
7 +RegulonDB is a database dedicated to the transcriptional regulation of Escherichia coli K-12. This database contains a set of summaries about several properties of TFs. These summaries are also found in EcoCyc (https://ecocyc.org/). These summaries are written manually by using several scientific articles.
8 +
9 +We have proposed an initial approach for the automatic generation of these summaries. In this initial approach, we generate summaries only about two properties of TFs:
10 +1. The biological processes in which the regulated genes are involved
11 +2. The number, name, and size of the structural domains constituting the TF
12 +
13 +The automatic summaries are made by the concatenation of the automatically classified sentences from scientific articles by an SVM classifier. The evaluation of these initial automatic summaries indicated that they carried part of the relevant information included in the manual summaries.
14 +
15 +This repository provides a pipeline for generating these initial automatic summaries.
16 +
17 +# Input
18 +You must place input files of the article collection within `preprocessing_pipeline/original/` directory. Input files must be raw text files. Extension *.txt is mandatory.
19 +
20 +# NLP preprocessing pipeline
21 +The first step is preprocessing the input files with the `NLP-preprocessing-pipeline/NLP-preprocessing-pipeline.sh` shell script. This step must be performed only once for the same article collection.
22 +
23 +## Preprocessing directory
24 +Our pipeline utilizes the `preprocessing-files` directory to save temporary files for each preprocessing task. These files could be removed after the NLP preprocessing has finished, except those for the `features` directory. These files are used for the automatic classification task.
25 +
26 +## Term list directory
27 +Several term lists are employed. These lists are on the term list directory `termLists`.
28 +
29 +## Configure
30 +You must indicate the path for the input texts directory (`ORIGINAL_CORPUS_PATH`), the preprocessing directory (`PREPROCESSING_PATH`), the term list directory (`TERM_PATH`), the Stanford POS Tagger directory (`STANFORD_POSTAGGER_PATH`), the BioLemmatizer directory (`BIO_LEMMATIZER_PATH`), and the name of the TF for summarization (`TF_NAME`).
31 +```shell
32 + ORIGINAL_CORPUS_PATH=../preprocessing-files/original
33 + PREPROCESSING_PATH=../preprocessing-files
34 + TERM_PATH=../termLists
35 + STANFORD_POSTAGGER_PATH=/home/cmendezc/STANFORD_POSTAGGER/stanford-postagger-2015-12-09
36 + BIO_LEMMATIZER_PATH=/home/cmendezc/BIO_LEMMATIZER
37 + TF_NAME=MarA
38 +```
39 +
40 +You must have installed Stanford POS Tagger and BioLemmatizer within your computer. They are not included within this repository, see following references for obtaining these programs:
41 +- Toutanova, K., Klein, D., Manning, C. and Singer, Y. (2003) Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the HLT-NAACL, pp. 252-259.
42 +- https://nlp.stanford.edu/software/tagger.shtml
43 +- Liu, H., Christiansen, T., Baumgartner, W. A., Jr., and Verspoor, K. (2012) BioLemmatizer: a lemmatization tool for morphological processing of biomedical text. J. Biomed. Semantics, 3, 1-29.
44 +- https://sourceforge.net/projects/biolemmatizer/
45 +
46 +You could indicate which preprocessing steps will be executed by assigning TRUE/FALSE for the corresponding variable within shell script:
47 +```shell
48 + PRE=TRUE
49 + echo " Preprocessing: $PRE"
50 + POS=TRUE
51 + echo " POS Tagging: $POS"
52 + LEMMA=TRUE
53 + echo " Lemmatization: $LEMMA"
54 + TERM=TRUE
55 + echo " Terminological tagging: $TERM"
56 + TRANS=TRUE
57 + echo " Transformation: $TRANS"
58 + FEAT=TRUE
59 + echo " Feature extraction: $FEAT"
60 +```
61 +
62 +## Execute
63 +Execute the NLP preprocessing pipeline within the `NLP-preprocessing-pipeline` directory by using the `NLP-preprocessing-pipeline.sh` shell script. Several output files will be generated while shell script is running.
64 +```shell
65 + cd NLP-preprocessing-pipeline
66 + ./NLP-preprocessing-pipeline.sh
67 +```
68 +
69 +# Automatic summarization
70 +At present, our pipeline generates the automatic summary of only one TF at the same time (i.e. one by one). The TF name must be indicated within the shell scripts. The NLP preprocessing pipeline must be already executed, so the `features` directory must contain several files.
71 +
72 +## Configure
73 +
74 +### Automatic classification
75 +You must indicate the directory path for the feature sentences (`INPUT_PATH`), the classified sentences (`OUTPUT_PATH`), and the trained classification model (`MODEL_PATH`). Also, you must indicate the name of the trained model (`MODEL`), the name of the feature employed for classification (`FEATURE`), and the name of the TF (`TF_NAME`). Do not change the names of the model and the feature.
76 +```shell
77 + INPUT_PATH=../preprocessing-files/features
78 + OUTPUT_PATH=./classified
79 + MODEL_PATH=.
80 + MODEL=SVM_model
81 + FEATURE=lemma_lemma_pos_pos
82 + TF_NAME=MarA
83 +```
84 +
85 +### Making automatic summary
86 +You must indicate the directory path to place the output automatic summary (`OUTPUT_PATH`), the directory path for the classified sentences (`INPUT_PATH`), and the name of the file with the classified sentences (`INPUT_FILE`).
87 +```shell
88 + OUTPUT_PATH=../automatic-summary
89 + INPUT_PATH=./classified
90 + INPUT_FILE=$TF_NAME.txt
91 +```
92 +
93 +## Execution
94 +Execute the automatic summarization pipeline within the `automatic-summarization-pipeline` directory by using the `automatic-summarization-pipeline.sh` shell script.
95 +```shell
96 + cd automatic-summarization-pipeline
97 + ./automatic-summarization-pipeline.sh
98 +```
99 +
100 +## Output
101 +A raw text file with the automatic summary of the TF is placed within `automatic-summary` directory.
102 +
103 +## Contact
104 +Questions can be sent to Computational Genomics Program (Center for Genomic Sciences, Mexico): cmendezc at ccg dot unam dot mx.
105 +