28842878.txt 7.77 KB

Raw Blame History Permalink

Chapter 8
Chromatin ImmunoPrecipitation ( ChIP ) sequencing has become one of the most important methods for discovering the binding sites of NAPs and TFs on the DNA in vivo .
In a ChIP experiment DNA and proteins are ﬁrst cross-linked to strengthen protein-DNA interactions .
The cross-linked chromatin is then sheared within a size range of 200 -- 500 base pairs ( bp ) .
Next , the protein of interest is immuno-precipitated using an appropriate antibody .
The cross-links are reversed and the DNA obtained is used either for sequencing ( ChIP-seq ) or for hybridization on a microarray-based platform ( ChIP-chip ) .
ChIP studies have been used to understand developmental processes and disease associations in eukaryotes [ 1 ] .
The roles of DNA binding proteins in bacterial chromosome maintenance and gene regulation have also been uncovered using this method .
One of the ﬁrst uses of the ChIP method for bacteria was the analysis of the genome-wide distribution of cAMP-receptor protein ( CRP ) on E. coli chromosome , which resulted in the suggestion that this GTF might in fact be a NAP [ 2 ] .
Since then various groups have carried out experiments to determine the genome wide binding patterns of various NAPs and GTFs including , but not limited to , HNS , Fis , HU , IHF , FNR , Fur , and LRP [ 3 -- 7 ] .
A certain degree of care must be taken while performing a ChIP experiment .
The control usually is categorized broadly into two categories : ( a ) Input : the fragmented genomic sample extracted before immuno-precipitation ; ( b ) mock-IP : the sample treated without the antibody or with a nonspeciﬁc antibody such as IgG ( Immunoglobulin G ) .
This article shares our experience performing ChIP-seq experiments with E. coli NAPs and GTFs , exploring the computational aspects of such studies .
Hardware : Computer with installed UNIX , Linux or MAC OSX ( with xcode installed separately ) , with a minimum of 4 GB of RAM .
Software : All software listed below are open source tools .
Whereas some of these procedures make use of sophisticated algorithms including the Burrows-Wheeler procedure for rapidly aligning millions of reads to a reference sequence , many others can also be implemented efﬁciently using easy-to-write scripts in programming languages such as PERL or PYTHON .
Install the following software : FastQC [ 8 ] , Cutadapt [ 9 ] , Burrows wheeler aligner ( BWA ) [ 10 ] , SAMtools [ 11 ] , Bedtools [ 12 ] , and MACS [ 13 ] .
Install R [ 14 ] ( check the newest stable version ) and bioconductor packages such as Geneﬁlter [ 15 ] .
UCSC archaeal genome browser [ 16 ] for visualization ( web only ) .
MEME-ChIP [ 17 ] ( web only ) .
Sample names : The ﬁlenames below assume paired end sequencing .
l ChIP biological replicate 1 -- ChIP1.read1.fastq and ChIP1 .
read2.fastq .
l ChIP biological replicate 2 -- ChIP2.read1.fastq and ChIP2 .
read2.fastq .
l Input control replicate 1 -- Input1.read1.fastq and Input1 .
read2.fastq .
l Input control replicate 2 -- Input2.read1.fastq and Input2 .
read2.fastq .
3 Methods
Install the software and packages mentioned in Subheading 2 .
All of these are open source and the installation is straightforward .
After obtaining the reads , check the quality using FastQC software .
This tool gives the output in html format where you can see the sequence quality of the reads , sequence duplication , % GC content , and adapter contamination .
Reads are aligned to the reference genome using BWA ( see Note 2 ) .
To check only the mapped reads for further downstream analysis ¬
- c - count the number of occurrences , F - to remove , 0x40 ﬂag - unmapped reads .
After checking for the number of reads , user can use the following command to work only with reads that are mapped .
sort command sorts the output according to the user given option , o for output ﬁlename , n option to sort it according to the read name .
where -- d option is for computing the coverage per base and ibam stands for input bam ﬁle .
This step might take longer time to run .
The .
cov output ﬁle has three columns in which two columns are of interest : the second column with the base position and third with the coverage computed for that speciﬁc position .
Model-based Analysis of ChIP-Sequencing ( MACS ) identiﬁes regions bound by a NAP/GTF/Histone modiﬁcation .
The model assumes the read distribution to be Poisson and then performs three key steps to ﬁnd enrichment -- removal of redundant reads , adjustment of read position based on fragment size distribution , and calculation of peak enrichment using local background normalization [ 13 , 18 ] .
MACS can be installed on local machine using the author 's instructions .
We have used MACS2 version for our analysis purpose .
There are several parameters that one has to consider before running MACS on dataset .
$ macs2 callpeak - t ChIP1.bam ChIP2.bam - c Input1.bam Input2 .
bam - f BAMPE - g 4.6 e7 - n output
- c for input/mock data control .
MACS can also work without this dataset .
- f for the format of the input ﬁles .
MACS takes several read formats including SAM , BAM , BED , ELAND .
For paired end reads BAM and ELAND formats can be used by specifying it as BAMPE and ELANDMULTIPLET .
If this option is not speciﬁed MACS by default will decide the format automatically ( see Note 6 ) .
- p is value cutoff .
If you do n't set this default will be 1e-5 .
The output contains several ﬁles named ChIP1_peaks .
bed , ChIP1_peaks .
xls , ChIP1_summits .
bed etc. .
ChIP1_peaks .
bed has the start and end of the genomic coordinates of the putative binding sites .
The fourth column corresponds to the name of the ﬁle and ﬁfth is the - log10 ( q value ) also seen in the ChIP1_peaks .
bed .
The log2 fold-change cutoff is 1.2 and greater ( see Note 7 ) .
Peak visualization in the UCSC genome browser gives detailed information on whether peaks are clustered in speciﬁc regions of chromosome , evolutionary conservation with other organisms , gene annotation tracks ( refseq ) to name a few ( Fig. 3 ) .
One can also combine different NAP peak ﬁles into one ﬁle and view the differences and similarity in the same window .
One of the key questions in the gene regulation ﬁeld is whether the binding of NAP/GTF on a regulatory region of a gene can explain the regulation of expression of that speciﬁc gene .
GTFs/NAPs bind to various regions on the chromosome .
But , only those peaks which are present in the regulatory regions of the chromosome are likely to inﬂuence gene expression directly .
One point to note here is that the regulation of gene expression is not straightforward , as there is increasing evidence of combinatorial regulation by several GTFs / NAPs ; hence , readers must be cautious before interpreting these results .
We already know from the extensive gene-centric studies of gene regulation and transcription initiation in E. coli that binding of activators and repressors starts from ~ 150 bp upstream till the transcription start site [ 19 -- 21 ] .
To probe the role of NAP/GTF follow the below instructions .
4 Notes
Following this , the user obtains a tab-delimited ﬁle with geno-mic regions ( list of operons ) which are bound in their respective regulatory region by the NAP/GTF and are differentially expressed in mutant NAP/GTF background .
This indicates whether the binding effect of the NAP/TF on the gene expression is direct or indirect .
Based on the position of binding from the transcription start site , user can also predict whether the GTF is an activator or repressor ; for this one will presumably require a more precise binding site identiﬁcation than is permitted by the resolution of the ChIP , something that can be obtained by combining ChIP peaks with motif identiﬁcation , or by using higher-resolution experimental techniques such as ChIP-exo .