Email updates

Keep up to date with the latest news and content from Genome Biology and BioMed Central.

This article is part of the supplement: Beyond the Genome: The true gene count, human evolution and disease genomics

Open Badges Invited speaker presentation

Mining data from 1000 genomes to identify the causal variant in regions under positive selection

Shari Grossman123*, Ilya Shlyakhter12, Elinor K Karlsson12, Shervin Tabrizi12, Kristian Andersen12, John Rinn2, Eric Lander2, Steve Schaffner2, Pardis C Sabeti12 and The 1000 Genomes Project

  • * Corresponding author: Shari Grossman

  • † Equal contributors

Author Affiliations

1 Center for Systems Biology and Department of Organismic and Evolutionary Biology, Cambridge, MA 02138, USA

2 Broad Institute of Harvard and MIT, Cambridge, MA 02139 USA

3 Harvard Medical School, Boston, MA 0211, USA

For all author emails, please log on.

Genome Biology 2010, 11(Suppl 1):I22  doi:10.1186/gb-2010-11-s1-i22

The electronic version of this article is the complete one and can be found online at:

Published:11 October 2010

© 2010 Grossman et al; licensee BioMed Central Ltd.

Invited speaker presentation

The human genome contains hundreds of regions in which the patterns of genetic variation indicate recent positive natural selection, yet for most of these the underlying gene and the advantageous mutation remain unknown. We recently reported the development of a method, Composite of Multiple Signals (CMS), that combines tests for multiple signals of natural selection and increases resolution by up to 100-fold.

Applying CMS to candidate selected regions from the International Haplotype Map, we localized several hundred signals to ~50-100 kb, identifying individual gene and polymorphism targets of selection. These regions included genes involved in processes known to be targets of selection, such as infectious disease, skin pigment, metabolism, and hair and sweat. We further identified many candidates that are similar to regulatory elements. In several regions, we identified variants that are significantly associated with the expression of nearby genes in the selected population. Moreover nearly half of the ~200 regions we examined localized to regions with no genes. Thirty of the regions contain long non-coding RNAs that have been shown to often regulate nearby genes, suggesting that variation within the RNAs might have functional consequences.

With preliminary data now available from the 1000 Genomes Project, we are beginning to explore full sequence data, which should contains most if not all of the causal selected polymorphisms. We extended the CMS method to the preliminary data set, validating our previously identified candidates and identifying many new intriguing coding and regulatory variants.