The Importance of Global Studies of the Genetics of Type 2 Diabetes
Article information
Abstract
Genome wide association analyses have revealed large numbers of common variants influencing predisposition to type 2 diabetes and related phenotypes. These studies have predominantly featured European populations, but are now being extended to samples from a wider range of ethnic groups. The transethnic analysis of association data is already providing insights into the genetic, molecular and biological causes of diabetes, and the relevance of such studies will increase as human discovery genetics increasingly moves towards sequencing-based approaches and a focus on low frequency and rare variants.
INTRODUCTION
The past few years have seen an explosion in our capacity to identify DNA sequence variants that influence individual predisposition to type 2 diabetes and related traits such as fasting glucose, body mass index, and fat distribution. These discoveries have largely been powered by the ability of researchers to undertake genome wide surveys for genetic associations in very large numbers of well-characterised samples, making use of high-density genotyping arrays capable of capturing the majority of common variation segregating in human populations [1].
There are now over 40 loci confidently associated with individual risk of type 2 diabetes, and over 30 associated with body mass index and risk of obesity [2-4]. Each of these loci has the potential to reveal novel biological insights into disease pathogenesis, though a great deal of detailed functional work remains to be done to link the association signals discovered to the specific local transcripts through which they mediate their effect on disease risk.
To date, most of these discoveries have been made in samples of European origin, whether collected in Europe or North America [2-4]. However, there are now growing numbers of genome wide association and resequencing studies for diabetes and related traits being conducted in samples from other parts of the world, most particularly those from East and South Asia, and from minority populations in the United States (Hispanics and African Americans) [5-10]. For example, recent studies of samples from East Asia were the first to describe type 2 diabetes risk variants near to the KCNQ1, UBE2E2, C2CD4A/B, SRR and PTPRD genes [5-8].
This review will discuss the value of multiethnic studies of diabetes genetics, and describe how these are likely to add to our understanding of type 2 diabetes genetics and biology.
EXPLORING THE OVERLAP IN ASSOCIATION SIGNALS BETWEEN MAJOR ETHNIC GROUPS
There have been a growing number of studies which have taken the diabetes association signals first discovered by genome wide association analyses in one population (mostly Europeans) and evaluated the evidence for their association with diabetes in others [11-16]. Moreover, as more and more genome wide association studies are completed in non-European populations [5-10], it becomes increasingly possible to compare the genome wide patterns of association across a wide diversity of ethnic groups.
The consensus from these comparisons is that the majority of the signals identified so far show clear evidence of directionally-consistent association across major population groups [11-16]. This consistency is particularly obvious for samples from populations that are not of recent African origin: additional data from African-descent samples are awaited with interest. Initial reports of failed replication (at FTO for example) were largely, it seems, the result of inadequate sample size, combined with differences in allele frequency that made some signals far harder to detect in some non-European populations [17-21]. Given the small effect sizes of many of the common variant signals found so far, and the massive sample sizes required for their initial discovery, it is not surprising that most of these transethnic replication studies have been underpowered to detect confirmatory signals at all known loci. However, if one builds up data over multiple studies, and/or uses measures that are better-powered for modest sample size (such as genetic risk scores, or the proportion of loci showing directionally consistent odds ratios) the degree of overlap is striking [16]. Nor is this simply a case of loci that are discovered in Europeans being identified in other populations, as the reverse example of KCNQ1 demonstrates [5,6].
What can we learn from this? The genome wide association studies conducted to date have necessarily focussed on common variants, and it should come as no surprise to find that most such variants are present in populations across the globe (Fig. 1). In the absence of strong selective pressures, it takes many thousands of generations for a mutation to drift to high frequency, and we can expect that most of the common variants seen in non-African populations predate the most recent expansion out of Africa, around 70,000 years ago, and will be shared amongst populations from Stockholm to Seoul, and from Mexico City to Mumbai [22].
One obvious corollary of this overlap is that differences in the prevalence and in the presentation of diabetes across the world [23,24] are rather unlikely to be attributable to common sequence variants. Does this overlap mean that we should abandon efforts to map common variants for diabetes in additional populations, given the large investments already made in the analysis of samples from Europe and East Asia? Absolutely not, for the simple reason that the between-population differences in effect sizes and allele frequency that occur at some loci translate into very marked differences in the potential for their initial discovery (especially to the levels of statistical stringency required for genome wide studies). The loci emerging from genome wide association studies in East Asian samples demonstrate this extremely well: the signals at KCNQ1 and C2CD4A/B for example are definitely also present in European subjects but were missed by previous genome wide association efforts in Europeans for reasons of power and chance [5,6,8,25].
TRANSETHNIC FINE MAPPING FOR COMMON VARIANT SIGNALS
Genome wide association studies are reliant on linkage disequilibrium for the initial identification of signals since it is unlikely that the causal variant (or variants) at any locus will actually be represented on any given genotyping array. However, once a signal has been found and shown, by replication, to be genuine, linkage disequilibrium becomes an obstacle, frustrating efforts to home in on the causal variant at the locus. For example, at the FTO locus, attempts at further refinement of the association signal (through resequencing, dense genotyping or imputation from HapMap or 1,000 Genomes reference panels) have been unsuccessful: as far as we can tell, from studies of European samples at least, the causal allele could be any one of dozens of highly-correlated alleles carried on a 50 kb haplotype.
However, since local patterns of linkage disequilibrium often differ between major population groups [22], one would hope that fine-mapping studies conducted at the transethnic level might enable some refinement of location, and in some circumstances, provide strong statistical evidence in favour of a single causal variant. Naturally there are some assumptions behind such analyses, the first being that the same single causal variant is shared between the populations concerned. The overlap in common variant signals reported above is clearly reassuring in this respect as it suggests that allelic heterogeneity is limited, at least amongst non-African populations.
The major limitation of this approach is likely to be the fact that patterns of linkage disequilibrium and haplotype structure are quite similar between non-African populations [22], and this has fostered growing interest in the interrogation of samples of recent African origin (for example African Americans) [26]. The high genetic diversity of African populations, and the long period of divergence, means that the linkage disequilibrium patterns in African populations are often markedly different to those seen in Europeans and Asians. This has the potential therefore to offer considerable benefits in terms of fine-mapping, but only provided locus and allelic heterogeneity are not extreme. Put simply, there is a danger that at some loci, there will be no susceptibility alleles segregating in accessible African populations, meaning that there is "nothing to fine-map." The limited data for type 2 diabetes susceptibility in African Americans is reassuring in this respect [16,26], and it will be interesting to see the results of the genome wide association studies that are currently being completed using samples from this population.
In the meantime, it seems sensible to pursue a broad strategy that attempts fine mapping in both non-African and African populations. Interestingly, several of the strongest diabetes susceptibility signals (TCF7L2, CDKAL1, and KCNQ1) do demonstrate rather unusually divergent haplotype structures between major ethnic groups [5,6,22], providing some encouragement that, as the data sets available become larger, effective fine-mapping will be possible. Fortunately such studies can be based around existing genome wide association data (complemented with imputation from ethnically-diverse reference panels, such as those forthcoming from the 1,000 Genomes Project [27]), so the costs are largely those of analysis.
INFORMATION ON GENETIC ARCHITECTURE AND SELECTION
Transethnic studies are also capable of providing valuable insights into the genetic architecture of type 2 diabetes. An excellent example of this relates to the important clues that transethnic studies have provided with respect to the so-called "synthetic" association hypothesis [28]. This hypothesis, which was derived predominantly from simulation studies rather than empirical data, proposed that many (perhaps most) of the common variant signals identified by genome wide association studies are not the result of causal variants that are themselves also common. Rather, the common variant signals detected are merely a consequence of the ways in which multiple rare causal alleles at each locus are scattered across the common haplotypes in the region. If true, this "synthetic" association model has profound implications for the genetic architecture of common disease, and for the strategies that should be adopted for identification of the causal variants.
Although widely promoted at the time of its publication, and seized upon by those antagonistic to the genome-wide association approach, there is relatively little empirical evidence to support this model. The CARD15 association with inflammatory bowel disease [29] shows that "synthetic" associations can occur, but are they really responsible for the majority of common variant signals detected by genome wide association studies?
One clear prediction of the "synthetic" model is that common variant signals detected in one major ethnic group should not be expected to replicate in others. This is because rare alleles have usually arisen quite recently (in the absence of selection, it takes many generations for a new mutation to drift to higher frequency), such that many of these rare alleles will have appeared during the course of the modern human diaspora and will be not be widely-represented across multiple major ethnic groups (Fig. 1). Under those circumstances, it would be highly unlikely that the different sets of rare causal alleles that might have arisen in Europeans and East Asians (for example) would have stacked up, by chance, on the same set of haplotypes, and thereby generated the same common variant signals. However, this is precisely what appears what we observe for type 2 diabetes. In other words, the directional consistency and high reproducibility across major ethnic groups of almost all common variant signals for type 2 diabetes provides strong evidence that these signals are driven by causal alleles that are themselves common [16]. Presumably, these common, causal alleles predate the recent human expansion out of Africa, and having been carried to the four corners of the world, show broadly similar effects on diabetes risk.
But can we go further? The thrifty genotype hypothesis, first promulgated by Neel, opined that the high prevalence of diabetes (and obesity) in modern populations, might be the result of many generations of selection for alleles that, in prehistoric times at least, conferred some kind of selective advantage [30]. The most obvious mechanism for this would involve individual differences in the capacity for the efficient storage of energy as fat during times of plenty. Individuals with "thrifty genotypes" would, according to this hypothesis, be in an advantageous position during periods of erratic food supply. However, in today's societies with access to constant (and excessive) food availability, individuals carrying these same alleles are now predisposed to develop obesity and diabetes.
Given the growing number of diabetes-susceptibility variants now established, do transethnic comparisons provide evidence for selection that might support the thrifty genotype hypothesis? For the time being, the answer to this question remains far from conclusive. Using a variety of approaches, including comparing the frequencies of diabetes risk-alleles across populations, as well as looking for other genetic hallmarks of recent selection, studies to date have concluded that the evidence for selection is modest when viewed across all risk loci [31,32]. However, it is notable that the loci with the strongest evidence for ethnic differences in allele frequency and haplotype structure (both of these possible markers of selection) are also those (TCF7L2, CDKAL1, and KCNQ1) with some of the largest effects on diabetes risk [31,32]. It may be that the evidence for selection is most obvious when the phenotypic effects are also greatest and that these data are pointing towards subtle selection effects. Nonetheless, it is fair to say that the transethnic data to date fail to provide compelling support for the thrifty genotype hypothesis.
EXPLAINING DIFFERENCES IN PREVALENCE AND PRESENTATION OF DISEASE
Although most cases of diabetes across the globe are considered to fit within the "umbrella" of type 2 diabetes, there is no doubt that prevalence and presentation of type 2 diabetes differs between major ethnic groups [23,24,33,34]. Of course, these differences may turn out to be largely attributable to differences in environmental factors [35], but migration studies (for example the high prevalence of diabetes in migrant South Asian populations worldwide) may point to an important genetic contribution.
For reasons hinted at above (particularly the high degree of overlap between the signals observed in different ethnic groups), it seems rather unlikely that between-population variation in the pattern of common variant signals will explain major differences in prevalence or presentation [16,33]. Having said that, there is emerging evidence that the effect sizes for most type 2 diabetes common variant loci are systematically larger in Japanese case-control comparisons than in equivalent analyses from other populations [16,36]. It remains to be seen whether this observation, if confirmed, represents an intrinsic and ethnic-specific different in genetic risk. Such differences in effect size could reflect the ways in which the cases and controls were selected (for example, selection for lean cases can boost some signals), the extent of environmental (dietary, economic) homogeneity (possibly greater in the Japanese population than in others), and the prevalence of obesity (if low, this may mean that cultural and lifestyle factors are having less of an impact on diabetes risk, thereby inflating the role of genetic variants).
If there are genetic explanations for interethnic differences in prevalence and presentation of diabetes, these are likely to come from variants that are (at the global level at least) of lower frequency. Not only are such variants likely to be of more recent origin, and therefore more population specific (Fig. 1), but a subset of them may well have larger effects than the common variant signals discovered to date [37]. There are a growing number of examples of ethnic-specific variants that underlie substantial differences in disease prevalence - influencing rates of heart failure in South Asians, renal disease in Africans, and hepatosteatosis in Hispanics for example [38-40]. In some instances, these variants have been subject to marked selection and have risen to relatively high frequency in one or other major ethnic group. Whilst the detection of such highly-selected variants therefore continues to justify the application of common variant genome-wide association scan methodologies to diverse ethnic groups, it seems probable that resequencing approaches, directed towards low frequency and rare variant discovery, will prove the most powerful strategies for uncovering the genetic basis of interethnic differences in disease prevalence and presentation.
ADVANCING BIOLOGICAL UNDERSTANDING OF DISEASE PREDISPOSITION
One of the major challenges thrown up by the success of the genome wide association approach lies in connecting the signals found to their downstream biology. Many of the genome wide association signals map to regulatory regions some distance from the nearest coding genes, and for only a minority of the forty or so known type 2 diabetes susceptibility loci has the transcript responsible for the causal effect been characterised [2]. This represents a serious impediment to the translation of the genetic discoveries into the improved understanding of disease predisposition that can support clinical advances.
One of the most obvious strategies for linking signals to function lies in searching for "smoking gun" mutations in the genes mapping near to a genome wide association signal. The idea here is to expose the transcript responsible for the predisposition by identifying which (if any) of the genes in the vicinity contains variants predicted to have high functional impact (ideally rare, coding mutations of large effect that are clearly expected to abrogate gene function, such as frameshifts or premature stop mutations) and which can be shown to be responsible for type 2 diabetes or a closely related phenotype (such as a more genetic monogenic or syndromic form of diabetes). The best example of this approach to date comes from type 1 diabetes. Exon resequencing of the transcripts mapping to a genome wide association signal for type 1 diabetes on chromosome 2, revealed a number of low frequency variants with high putative functional impact within the IFIH1 gene, each of which showed evidence of an association with type 1 diabetes [41]. Though these variants did not explain the original common variant signal, they did provide a very strong pointer to IFIH1 as the gene most likely to be responsible for mediating the association effect at this locus.
In conducting such studies, there are obvious merits in examining more than one ethnic group. Given that clear-cut "smoking gun" mutations (from both a statistical and functional perspective) will not be seen at every locus (they are likely to represent random accidents of nature, often of recent origin and likely to disappear within a few generations), extending the survey to a wide range of different ethnic groups provides the chance to "buy multiple tickets to the lottery." The hope is that an interesting "smoking gun" signal clearly visible in one ethnic group, can be rapidly followed up in others (where the signal exists but is not so obvious), and that it will be the accumulation of a wide variety of different "smoking gun" mutations, with clearly independent mutational histories, which provides the necessary pointers to identification of the transcript mediating the common variant association signal. The T2D-GENES consortium, for example, is testing this approach by resequencing over 500 genes from type 2 diabetes genome wide association signals in over 10,000 case-control samples ascertained from European, East Asian, South Asian, Hispanic, and African-American populations.
THE FUTURE
Human genetics is shifting from an era dominated by common variant discovery powered by genome-wide association studies, to one of low frequency and rare variant identification through sequencing. As the field moves in this direction, it will become ever more important, for a variety of reasons, to examine the genetic basis of disease in multiple ethnic groups. First, we can expect to see greater divergence of genetic predisposition between populations (both locus and allele heterogeneity) as far as low frequency variants are concerned, simply because they are more likely to be of recent, and ethnic-specific origin. Second, such divergence means that studies conducted in multiple ethnic groups (provided, of course, that they take proper account of population structure) will offer greater opportunities for discovery, and more chances to find high-impact alleles well-suited to subsequent functional and physiological characterisation. Third, as we've seen, the genetic basis of interethnic differences in prevalence and presentation of disease (including response to therapeutic and preventative interventions) is more likely to be explained by lower frequency variants. Fourth, as it becomes harder to obtain convincing statistical evidence within a single ethnic group that a given low frequency variant (or set of low frequency/rare variants) is associated with disease (simply because power for any given effect size is reduced for lower frequency variants), the demonstration that the same gene harbours an excess of rare susceptibility (or protective) alleles in several distinct ethnic groups will provide an ever more important signal for establishing a causal link between that gene and disease.
For all these reasons, it is crucial, if we are to understand the basis of a global disease such as diabetes, that we pursue well-powered genetic and genomic enquiry in as many diverse populations as possible. It is equally important that these efforts are linked through strong scientific collaborations and mechanisms for data exchange, since it is increasingly true that, only by working together, will we be able to overcome the very considerable challenges that remain.
ACKNOWLEDGMENT
I would like to acknowledge the many colleagues, senior and junior, national and international, with whom it has been such a pleasure to work on these challenging problems. The commitment to productive collaboration of researchers in our field has served as a powerful example to others. In particular, I wish to highlight the collective endeavour represented by the Global Diabetes Consortium and the T2D-GENES Consortium, funded by the National Institutes of Diabetes, Digestive and Kidney Diseases in the US (U01DK085545).