A high resolution map of a cyanobacterial transcriptome
1 Graduate Program in Systems Biology, Harvard University, 52 Oxford Street, Northwest 445.40, Cambridge, MA 02138, USA
2 Howard Hughes Medical Institute, Harvard Faculty of Arts and Sciences Center for Systems Biology, Departments of Molecular and Cellular Biology and Chemistry and Chemical Biology, Harvard University, 52 Oxford Street, Northwest 445.40, Cambridge, MA 02138, USA
Genome Biology 2011, 12:R47 doi:10.1186/gb-2011-12-5-r47Published: 25 May 2011
Additional file 1:
Tables S1 to S4. All genome positions and strands are relative to GenBank CP000100. Table S1 - all annotated transcripts: column A, transcript ID number; column B, strand (1 is plus strand, 0 is minus strand); column C, first ORF on transcript; column D, last ORF on transcript; column E, predicted 5' transcription start site; column F, predicted 3' end; column G, length of transcript; column H, is the transcript an mRNA? (all transcripts that included any rRNA or tRNA were not considered as mRNA; Materials and methods); column I, the number of ORFs per transcript; column J, length of 5' UTR; column K, length of 3' UTR; column L, mean of the raw RNA sequencing reads over the full transcript; column M, number of transcripts per cell assuming a total of 1,500 mRNAs per cell. Table S2 - all non-coding transcripts: column A, non-coding transcript ID number; column B, predicted 5' transcription start site; column C, predicted 3' end; column D, strand (1 is plus strand, 0 is minus strand); column E, mean of the raw RNA sequencing reads over the full non-coding transcript; column F, length of non-coding transcript; column G, percent overlap that a non-coding transcript has with an ORF that was designated as not transcribed (designated when mean RNA sequencing coverage of ORF is less than two reads per nucleotide); column H, percentage of non-coding transcript that is antisense to an annotated transcript; column I, does the non-coding transcript pass the high confidence criteria? (Materials and methods); column J, does the non-coding transcript pass the circadian criteria? (Materials and methods); column K, the difference in gene expression of the non-coding transcript in the dusk versus dawn circadian timepoints calculated by tiling microarray (all probes internal to the non-coding transcript were used to make this calculation); column L, RFAM homology. Table S3 - all RNA polymerase peaks: column A, peak ID number; column B, start of peak; column C, end of peak; column D, position of peak maximum; column E, total ChIP reads at peak maximum (sum of circadian timepoints, dawn and dusk, after normalization for total number of reads); column F, P-value for enrichment of reads in ChIP sample versus mock immmunoprecipitation. Table S4 - comparison of literature 5' versus RNA sequencing 5': column A, JGI ID for ORF; column B, common name for ORF; column C, strand (1 is plus strand, 0 is minus strand); column D, translation start position of ORF; column E, literature-based 5' transcription start site; column F, alternative 5' transcription start site from literature; column G, 5' transcription start site estimate from our RNA sequencing; column H, difference between our 5' transcription start site estimate and the closest literature estimate; column I, method of 5' transcription start site determination used in the literature reference; column J, literature reference. Table S5 - expression of all JGI predicted ORFs: column A, JGI ID for ORF; column B, Synpcc7942 ORF ID; column C, start of ORF (in the case when the ORF is on the plus strand, this is where the start codon is located); column D, end of ORF (in the case when the ORF is on the minus strand, this is where the start codon is located); column E, strand (1 is plus strand, 0 is minus strand); column F, mean of the raw RNA sequencing reads over the full ORF.
Format: XLSX Size: 787KB Download file
Additional file 2:
Supplementary Figures S1 to S8. Figure S1: examples of 5' determination from RNA sequencing. (a) 5' Determination of the ntcA transcript. A sharp drop in RNA sequencing reads is observed at the 5' end of the mRNA. 5' end determination by RNA sequencing and traditional methods  differ only by a single nucleotide. (b) 5' determination of the purF transcript. The RNA sequencing estimate is over 80 nucleotides different from that derived by traditional methods . Subsequent experiments  have shown that the minimal promoter for the purF transcript contains the RNA sequencing 5' end but not the literature 5' end. A more complete comparison of RNA sequencing and traditional transcription start determination is provided in Table S4 in Additional file 1. Figure S2: representative RNA pol ChIP over a 40-kb region. (a) RNA sequencing data. Positive strand transcription is shown in blue (positive y-axis), and negative strand transcription in red (negative y-axis). ORFs on the positive and negative strands are indicated by horizontal black lines. RNA pol peaks significantly enriched over the mock immunoprecipitation (P < 0.1) are indicated with vertical green lines and those that are not (P ≥ 0.1) are indicated with vertical pink lines. Large RNA pol peaks tend to be located near the 5' end of transcripts, although there are many peaks in the middle of transcripts potentially caused by RNA pol pausing. (b) RNA pol ChIP and mock. RNA pol ChIP (black) and mock immunoprecipitation (green) are normalized such that the genome average is 200 reads per nucleotide. Almost all RNA pol peaks are enriched over the mock immunoprecipitation. A complete listing of RNA pol peaks and their enrichment is provided in Table S3 in Additional file 1. (c) RNA pol ChIP normalized by input. Normalization of RNA pol ChIP by input does not qualitatively change the data (compare Figure S2b and Figure S2c in Additional file 2). Figure S3: comparison of changes in gene expression and RNA pol ChIP at two points in the circadian cycle. (a) Changes in RNA pol occupancy at two separate times during the circadian cycle (dusk and dawn). Changes in RNA pol are reflective of changes in transcript level by microarray (Pearson correlation, r = 0.6860). The probability of getting a correlation as large by random chance (P-value) is 2.2286e-197. Figure S4: characteristics of transcription start. (a) Melting temperature at transcription start. The melting temperature of 10-nucleotide fragments from -200 to +200 of all mRNAs was averaged (Materials and methods). A drop in the melting temperature is observed at the promoter. (b) Nucleotide content at transcription start sites. Nucleotide content of all mRNAs aligned by transcription start. (c) Zoomed in nucleotide content at transcription start. Nucleotide content of all mRNAs aligned by transcription start. Preference for adenine at the +1 position and a -10 element can be observed. Figure S5: comparison of minimum free energy changes with that of dinucleotide-shuffled sequences. (a) Minimum free energy change at RNA pol peaks. The minimum free energy of 60-nucleotide RNA fragments with 10-nucleotide spacing was calculated and averaged for all mRNAs (Materials and methods). A drop in minimum free energy slightly prior to the position of the RNA pol peak is observed. To prevent sequence features of the transcription terminus or promoters from interfering with this analysis, a subset of 183 RNA pol peaks satisfying the following criteria were used: (1) RNA pol peak must be closer to a 5' end than a 3' end; and (2) RNA pol peak must be +100 to +300 relative to the 5' end. Since RNA pol ChIP does not specify the strand being transcribed, the strand of transcription was inferred from RNA sequencing data. Dinucleotide shuffled sequences show a qualitatively similar trend to native sequences, suggesting that there is no specific secondary structure at this transition (Materials and methods). (b) Sequence changes near RNA pol peaks. A sequence content change from low to high GC content can be observed near the position of the RNA pol peaks. The same subset of RNA pol peaks are used here as in Figure S5a in Additional file 2. A smoothing window of five nucleotides has been applied to smooth nucleotide contents. These sequence changes may be responsible for the free energy changes we observe. It is also possible that these changes in sequence content may contribute to RNA pol pausing by an unknown mechanism. (c) Minimum free energy change at transcription terminus. Minimum free energy was calculated as above after aligning all transcripts by transcription terminus. Dinucleotide-shuffled sequences do not resemble native sequences, suggesting that a discrete hairpin-like structure exists at the terminus of transcripts (Materials and methods). (d) Minimum free energy change at transcription start. Minimum free energy was calculated as above after aligning all transcripts by 5' transcription start. A drop in minimum free energy occurs globally within transcripts and may be related to our observation of global RNA pol pausing. Dinucleotide-shuffled sequences show a qualitatively similar trend to native sequences (Materials and methods). Figure S6: enrichment in RNA sequencing at 5'. (a) Increased RNA sequencing signal at 5' ends. An increase in RNA sequencing signal can be observed at the 5' end of mRNAs. Several biological phenomena may account for this enrichment, but one intriguing possibility is the existence of many partial or nascent transcripts caused by pausing of RNA pol near the 5' end of the transcript. (b) RNA pol pausing at 5' ends may contribute to RNA sequencing enrichment at 5' ends. A slight but significant correlation exists between the retention ratio of RNA pol and the enrichment of RNA sequencing prior to the RNA pol peak. The same subset of RNA pol peaks was used as in Figure S5a in Additional file 2. Pearson correlation is r = 0.4591, and the probability of getting a correlation as large by random chance (P-value) is 6.2879e-11. Figure S7: the phycocyanin operon - a functional case of partial transcription termination. (a) Partial transcription termination controls the stochiometry of cpcβ and cpcα to rod linker mRNA at approximately 6:1. This stochiometry reflects the organization of the phycobilisome - a hexameric α-β double disc with an associated linker . RNA sequencing data cannot be mapped to the cpcβ and cpcα coding region because it is not unique in the genome (another copy of cpcβ and cpcα, corresponding to the core proximal phycobilisomes exists in the genome). The position of predicted terminators (from TransTermHP) is indicated in green, and the position of JGI predicted ORFs is indicated in black. Figure S8: circadian gene expression of putative non-coding RNAs. (a) Gene expression by tiling microarray of high-confidence circadian non-coding RNAs. Gene expression of non-coding RNAs with potential for circadian gene expression are plotted by non-coding transcript ID (Table S2 in Additional file 1). Gene expression ratios for non-coding RNAs are computed by averaging the gene expression ratios for all tiling probes internal to the non-coding transcript.
Format: PDF Size: 5.8MB Download file
This file can be viewed with: Adobe Acrobat Reader