28842878.txt 7.77 KB
Chapter 8
Chromatin ImmunoPrecipitation ( ChIP ) sequencing has become one of the most important methods for discovering the binding sites of NAPs and TFs on the DNA in vivo . 
In a ChIP experiment DNA and proteins are first cross-linked to strengthen protein-DNA interactions . 
The cross-linked chromatin is then sheared within a size range of 200 -- 500 base pairs ( bp ) . 
Next , the protein of interest is immuno-precipitated using an appropriate antibody . 
The cross-links are reversed and the DNA obtained is used either for sequencing ( ChIP-seq ) or for hybridization on a microarray-based platform ( ChIP-chip ) . 
ChIP studies have been used to understand developmental processes and disease associations in eukaryotes [ 1 ] . 
The roles of DNA binding proteins in bacterial chromosome maintenance and gene regulation have also been uncovered using this method . 
One of the first uses of the ChIP method for bacteria was the analysis of the genome-wide distribution of cAMP-receptor protein ( CRP ) on E. coli chromosome , which resulted in the suggestion that this GTF might in fact be a NAP [ 2 ] . 
Since then various groups have carried out experiments to determine the genome wide binding patterns of various NAPs and GTFs including , but not limited to , HNS , Fis , HU , IHF , FNR , Fur , and LRP [ 3 -- 7 ] . 
A certain degree of care must be taken while performing a ChIP experiment . 
The control usually is categorized broadly into two categories : ( a ) Input : the fragmented genomic sample extracted before immuno-precipitation ; ( b ) mock-IP : the sample treated without the antibody or with a nonspecific antibody such as IgG ( Immunoglobulin G ) . 
This article shares our experience performing ChIP-seq experiments with E. coli NAPs and GTFs , exploring the computational aspects of such studies . 
Hardware : Computer with installed UNIX , Linux or MAC OSX ( with xcode installed separately ) , with a minimum of 4 GB of RAM . 
Software : All software listed below are open source tools . 
Whereas some of these procedures make use of sophisticated algorithms including the Burrows-Wheeler procedure for rapidly aligning millions of reads to a reference sequence , many others can also be implemented efficiently using easy-to-write scripts in programming languages such as PERL or PYTHON . 
Install the following software : FastQC [ 8 ] , Cutadapt [ 9 ] , Burrows wheeler aligner ( BWA ) [ 10 ] , SAMtools [ 11 ] , Bedtools [ 12 ] , and MACS [ 13 ] . 
Install R [ 14 ] ( check the newest stable version ) and bioconductor packages such as Genefilter [ 15 ] . 
UCSC archaeal genome browser [ 16 ] for visualization ( web only ) . 
MEME-ChIP [ 17 ] ( web only ) . 
Sample names : The filenames below assume paired end sequencing . 
l ChIP biological replicate 1 -- ChIP1.read1.fastq and ChIP1 . 
read2.fastq . 
l ChIP biological replicate 2 -- ChIP2.read1.fastq and ChIP2 . 
read2.fastq . 
l Input control replicate 1 -- Input1.read1.fastq and Input1 . 
read2.fastq . 
l Input control replicate 2 -- Input2.read1.fastq and Input2 . 
read2.fastq . 
3 Methods
Install the software and packages mentioned in Subheading 2 . 
All of these are open source and the installation is straightforward . 
After obtaining the reads , check the quality using FastQC software . 
This tool gives the output in html format where you can see the sequence quality of the reads , sequence duplication , % GC content , and adapter contamination . 
Reads are aligned to the reference genome using BWA ( see Note 2 ) . 
To check only the mapped reads for further downstream analysis ¬ 
- c - count the number of occurrences , F - to remove , 0x40 flag - unmapped reads . 
After checking for the number of reads , user can use the following command to work only with reads that are mapped . 
sort command sorts the output according to the user given option , o for output filename , n option to sort it according to the read name . 
where -- d option is for computing the coverage per base and ibam stands for input bam file . 
This step might take longer time to run . 
The . 
cov output file has three columns in which two columns are of interest : the second column with the base position and third with the coverage computed for that specific position . 
Model-based Analysis of ChIP-Sequencing ( MACS ) identifies regions bound by a NAP/GTF/Histone modification . 
The model assumes the read distribution to be Poisson and then performs three key steps to find enrichment -- removal of redundant reads , adjustment of read position based on fragment size distribution , and calculation of peak enrichment using local background normalization [ 13 , 18 ] . 
MACS can be installed on local machine using the author 's instructions . 
We have used MACS2 version for our analysis purpose . 
There are several parameters that one has to consider before running MACS on dataset . 
$ macs2 callpeak - t ChIP1.bam ChIP2.bam - c Input1.bam Input2 . 
bam - f BAMPE - g 4.6 e7 - n output 
- c for input/mock data control . 
MACS can also work without this dataset . 
- f for the format of the input files . 
MACS takes several read formats including SAM , BAM , BED , ELAND . 
For paired end reads BAM and ELAND formats can be used by specifying it as BAMPE and ELANDMULTIPLET . 
If this option is not specified MACS by default will decide the format automatically ( see Note 6 ) . 
- p is value cutoff . 
If you do n't set this default will be 1e-5 . 
The output contains several files named ChIP1_peaks . 
bed , ChIP1_peaks . 
xls , ChIP1_summits . 
bed etc. . 
ChIP1_peaks . 
bed has the start and end of the genomic coordinates of the putative binding sites . 
The fourth column corresponds to the name of the file and fifth is the - log10 ( q value ) also seen in the ChIP1_peaks . 
bed . 
The log2 fold-change cutoff is 1.2 and greater ( see Note 7 ) . 
Peak visualization in the UCSC genome browser gives detailed information on whether peaks are clustered in specific regions of chromosome , evolutionary conservation with other organisms , gene annotation tracks ( refseq ) to name a few ( Fig. 3 ) . 
One can also combine different NAP peak files into one file and view the differences and similarity in the same window . 
One of the key questions in the gene regulation field is whether the binding of NAP/GTF on a regulatory region of a gene can explain the regulation of expression of that specific gene . 
GTFs/NAPs bind to various regions on the chromosome . 
But , only those peaks which are present in the regulatory regions of the chromosome are likely to influence gene expression directly . 
One point to note here is that the regulation of gene expression is not straightforward , as there is increasing evidence of combinatorial regulation by several GTFs / NAPs ; hence , readers must be cautious before interpreting these results . 
We already know from the extensive gene-centric studies of gene regulation and transcription initiation in E. coli that binding of activators and repressors starts from ~ 150 bp upstream till the transcription start site [ 19 -- 21 ] . 
To probe the role of NAP/GTF follow the below instructions . 
4 Notes
Following this , the user obtains a tab-delimited file with geno-mic regions ( list of operons ) which are bound in their respective regulatory region by the NAP/GTF and are differentially expressed in mutant NAP/GTF background . 
This indicates whether the binding effect of the NAP/TF on the gene expression is direct or indirect . 
Based on the position of binding from the transcription start site , user can also predict whether the GTF is an activator or repressor ; for this one will presumably require a more precise binding site identification than is permitted by the resolution of the ChIP , something that can be obtained by combining ChIP peaks with motif identification , or by using higher-resolution experimental techniques such as ChIP-exo .