Motif composition, conservation and condition-specificity of single and alternative transcription start sites in the Drosophila genome
1 Program in Computational Biology and Bioinformatics, Duke University, Science Drive, Durham, NC 27708, USA
2 Institute for Genome Sciences and Policy, Duke University, Science Drive, Durham, NC 27708, USA
3 Max Planck Institute of Molecular Cell Biology and Genetics, Pfotenhauerstrasse, Dresden 01307, Germany
4 Department of Biostatistics and Bioinformatics, Duke University, Duke University School of Medicine, Erwin Road, Durham NC 27710, USA
5 Department of Computer Science, Duke University, Durham, NC 27708, USA
Genome Biology 2009, 10:R73 doi:10.1186/gb-2009-10-7-r73Published: 9 July 2009
Additional data file 1:
Table S1: false positive estimates of TSS assignments by condition. To assess the validity of the TSS condition assignments, we performed 100 random permutations of condition labels from the (sub-)clusters and evaluated their associations using the same methodology as for the identified TSSs. The numbers of false positives (column 3) were empirically estimated as the mean number of sites having a specific association with each condition (column 1) across all 100 random permutations. The false positive rate (column 4) was calculated by dividing the number of false positives by the number of identified TSSs that were observed to have the condition association (column 2). Table S2: GO enrichments for genes with different condition associations for alternative TSSs. The table lists all significant GO categories for genes with alternative TSSs associated with specific conditions, at a false discovery rate cutoff of 0.1, and present in more than five genes. Table S3: embryo associations confirm utilization patterns of known genes. We compared the embryonic utilization patterns previously observed for known genes to those identified using EST and Affymetrix tiling array data. Analysis of genes with at least one TSS having an EST embryo association (column 3), and promoter utilization in at least one tiling array time period (column 4) agree with previously reported expression patterns from in situ images (column 5) , and published reports (column 6). Table S4: false positive approximations of embryonic temporal promoter assignments. We evaluated the expected number of false positive temporal expression assignments for the set of promoters of 4,664 identified TSSs across 12 developmental periods (column 1) corresponding to 2-hour increments during embryogenesis (column 2). We chose 4,664 random intergenic sites and found the difference in median fluorescence intensities of neighboring tiles for each of the 12 time points. The differences in fluorescence intensities were compared to the difference thresholds (column 3) used to classify the set of 4,664 promoters. Random intergenic sites with fluorescence intensity differences above the threshold were counted as false positives. For each time point, the total number of false positives (column 4) was divided by the total number of random intergenic sites to approximate the rate of false positives (column 5). Figure S1: alternative TSSs and alternative promoters are widely distributed across the genome. For each chromosome, the number of genes with one TSS location (blue) and more than one (that is, alternative) TSS location (red) were counted. Genes having alternative TSSs were divided into two groups according to the number and type of promoters: those having one broad promoter (yellow) and those having alternative promoters of the peaked or broad type, or any combination thereof (green). With the exception of chromosome 4, the overall fraction of genes with alternative TSSs ranged from 28 to 32%, and the fraction of genes with alternative promoters was 12 to 14%. Chromosome 4 is much smaller in size than the other Drosophila chromosomes, and had an elevated percentage of genes with alternative TSSs (19 out of 38; 50%) and alternative promoters (34%), possibly due to the small sample size. Figure S2: evaluation of TSS quality. The quality of the TSS calls was evaluated by comparing the locations of initiation sites across databases and the frequencies of elements in the core promoter sequences surrounding them. (a) EPD location differences. Each of the 1,840 EPD TSSs was compared to the set of identified TSSs that were on the same chromosome. The difference in location of the closest identified TSS was taken from each EPD TSS, with the identified TSS as reference position (0). Differences ranged from 0 to greater than 1,000 bp. The plot covers a region of ± 20 nucleotides, which covers 76% (1,404) of EPD start sites. (b) Flybase location differences. All TSSs in Flybase that were upstream of the most downstream start codon, and did not map to a start codon location, were selected for comparison. Each of the TSSs identified by the hierarchical clustering strategy was compared to all of the Flybase TSSs listed for the same gene. The smallest difference in location between the Flybase TSS and the selected TSS was calculated at 1-bp resolution using the selected TSS as a reference point (0). The orientation of transcription of each gene was used to determine the orientation of the differences. A negative difference corresponded to a Flybase TSS being located upstream of the selected TSS, and a positive value signified that the Flybase TSS was downstream of the selected TSS. The plot covers a region of ± 300 nucleotides, which covered 79% (4,406) of TSSs matching to Flybase start sites. Compared to EPD, differences in start site locations are thus one order of magnitude larger at roughly the same coverage. (c) Presence of core promoter elements. For 2,725 genes with exactly one TSS in our set and an annotated initiation site in Flybase, motif matches were identified in the preferred windows in their core promoter sequences using separate zero order Markov models as background. There is a consistently higher number of motif matches in the promoters of the TSSs identified here, compared to those of the TSSs from the Flybase 5' end annotations. Figure S3: sequence elements in preferred windows of peaked promoters preserve trends of motif associations. (a) Associations of element occurrences. Motif matches were constrained to their preferred windows in peaked core promoters and normalized to the number of occurrences per 100 kb (see Materials and methods). The mean number of occurrences across the three random intergenic sets is shown. (b) Correspondence of elements to embryonic utilization. The set of peaked core promoters was divided into three groups according to the their pattern of embryonic utilization (maternal, zygotic, or both). The numbers of elements in the preferred windows of each group are shown. Figure S4: Shannon entropy values segregate into three groups. The distributions of ESTs in the (sub-)clusters used to call TSSs were evaluated using Shannon entropy. As an example, the figure shows the entropy histogram for the embryonic condition with bins of size 0.5. The QEmbryo,tss values naturally separate into three groups: those less than 1, those between 1 and 10, and those greater than 10. The large frequency of QEmbryo,tss values between 13 and 13.5 is an artifact resulting from using 0.0001 to smooth P(i | tss) for (sub-)clusters containing ESTs mainly from one non-embryo library.
Format: DOC Size: 782KB Download file
This file can be viewed with: Microsoft Word Viewer
Additional data file 2:
Genomic locations and the frequencies of ESTs from each library are given for the initial groupings of ESTs, the (sub-)clusters created after clustering, and the TSSs chosen from each (sub-)cluster.
Format: TXT Size: 14.8MB Download file
Additional data file 3:
All motif matches in the peaked and broad promoters are included, regardless of preferred windows. Promoters without at least one motif match are excluded from the file.
Format: TXT Size: 158KB Download file
Additional data file 4:
Gene, chromosome, orientation, and condition association as determined by Shannon entropy for each individual TSS.
Format: TXT Size: 408KB Download file
Additional data file 5:
Gene, chromosome, orientation, and temporal pattern of utilization determined by the tiling arrays for peaked and broad promoters.
Format: TXT Size: 488KB Download file
Additional data file 6:
Patterns of utilization across the 12 development periods that occur at least 5 times in the set of peaked and broad promoters.
Format: TXT Size: 2KB Download file