Thomson Reuters
 

 ScienceWatch

AUTHOR COMMENTARIES - 2008

Mark J. Daly Mass General's Mark J. Daly on HapMap and Disease Genes
Featured Science Watch® Newsletter Interview

Computational biologist Mark J. Daly of Massachusetts General Hospital and the Broad Institute, Boston, discusses his work, particularly on a catalogue of genetic variation known as HapMap, a resource which has assisted in elucidating the genetic underpinnings of type 2 diabetes and other diseases. The concept of haplotypes and HapMap has made Daly one of the most highly cited researchers in biology. He currently ranks among the top dozen most-cited authors in molecular biology & genetics in Thomson Reuters's Essential Science IndicatorsSM database, based on papers published in the last decade.


In 2002, when geneticists began pushing for the funding to create a new type of genomic map—known as the haplotype map, or HapMap—the project was controversial, to say the least. Its opponents described it as a $110 million boondoggle in the making. The project, however, went ahead. Just three years later, in October 2005, HapMap was published in Nature, while the data were made freely available online. Now, less than three years after that, the payoff has been remarkable. HapMap may have set a new record for the time necessary to cover the ground from scientific controversy to unambiguous success.

The original proponents of HapMap included some of the biggest names in genetics and molecular biology, among them Francis Collins of the Human Genome Project and Eric Lander of the Whitehead Institute Center for Genome Research (now part of the Broad Institute). But the initial spark of inspiration emerged from the experimental insight of one of Lander’s students, Mark J. Daly. Since then, the concept of haplotypes and HapMap has made Daly one of the most highly cited researchers in biology. He currently ranks among the top dozen most-cited authors in molecular biology & genetics in Thomson Reuters's Essential Science IndicatorsSM database, based on papers published in the last decade.

Furthermore, in this publication's recent survey of high-impact biology between 2002 and 2006, Daly ranked high among the featured authors, thanks to 10 top-cited papers (Science Watch, 19[1]: 1-2, January/February 2008). And the latest update to the Hot Papers Database includes eight reports from Daly and colleagues published over the last two years. Daly's past Hot Papers include the original 2005 HapMap report from Nature (see the table below, paper #5), which debuted at #1 in the Biology Top Ten in the September/October 2006 issue of Science Watch and stayed there until March/April 2008, when it reached the Hot Papers mandatory two-year retirement age.

Daly, 40, received his bachelor of science degree from MIT in 1989 and his Ph.D. in genetics from Leiden University in 2004. Between 2001 and 2005 he was also a Pfizer fellow in computational biology at the Whitehead Institute, where he originally trained with Lander. He is currently an assistant professor of medicine at Harvard Medical School, an assistant geneticist at Massachusetts General Hospital, and a senior associate member of the Broad Institute, where he leads computational biology in medical and population genetics.

Daly spoke to Science Watch from his office at Mass General in Boston.

SW:  Okay, first question before we proceed: I do have to ask what exactly a haplotype is.

A haplotype is simply the collection of alleles at variable positions nearby each other on a chromosome. Imagine if you have variation at one position on a chromosome where the base could be either A or C, and then, 100 bases later, another variable base at which chromosomes can bear either G or T. That combination of two alleles carried on a single chromosome (in this case either A-G, A-T, C-G, or C-T) is what we call a haplotype—essentially, the genotype of an individual chromosome as expressed at multiple sites along the chromosome.

SW:  How did you come to realize that such a thing existed? What was your insight, in other words?

"Read an archived Science Watch ® Newsletter interview, along with various articles & features about Eric Lander in ScienceWatch.com."
 

I started out as a student in Eric Lander's laboratory in the late 1980s, before there was a genome project and at a time when we were still holding out hope that methods of family-based linkage analysis—the tool that works on these rare Mendelian, single-gene disorders—would work for more complex diseases as well. That’s where my initial effort in computational biology was in those early years.

The idea then emerged experimentally. We were studying a particular region of chromosome 5 that had been implicated in Crohn’s disease. We came to focus in that region on a set of genes that we had high confidence were involved in the disease. When we looked at the sequence, though, we didn’t find any obvious Mendelian-like, smoking-gun mutations. We began to look more broadly at the sequences from a number of patients, and we discovered this unexpected correlation between alleles in that region. In other words, if we found a single nucleotide polymorphism (SNP) anywhere in that region, we would be able to predict, with different degrees of certainty, what other SNPs would be found throughout that region. In some cases this was true for SNPs hundreds of kilobases away. What this reflected was that, despite many polymorphic sites in the region, there were actually a surprisingly small number of haplotypes shared by everyone in the population.

This was very much unexpected, and it suggested to us that we could possibly accomplish much of what we needed to find disease genes by choosing only a small number of polymorphisms in a particular region and using those as surrogates or proxies for the many other remaining polymorphisms in that region. By studying that particular region on chromosome 5 in more detail, and then other regions across the genome, we and other groups were able to come upon this model demonstrating considerable structure in the relationship between polymorphisms discovered in the genome. That, in turn, led us very quickly to the formation of the HapMap project, cataloguing genetic variations in a fashion that could then be used by medical geneticists to study any region of the genome efficiently and thoroughly.

SW:  The project was very controversial at the outset. Why were some geneticists so adamant that it wouldn’t work?

There wasn't an enormous amount of empirical data at the beginning. When we first started studying this region on chromosome 5, the results were quite unexpected. There really wasn’t a great deal of data from human populations with respect to what a high density of genetic variation looks like across tens or hundreds of kilobases of the genome. To propose then that there were these correlations or structures to the data was to suggest that aspects of some of the more traditional work that had been done to model human genetic variation might not have been completely accurate. So one controversy was simply over the fact that we lacked conclusive data on many of these questions. What we began to observe and describe in a few anecdotal regions originally was not completely consistent with the genetic model of the day. All of that eventually harmonized. There’s little controversy on any of these grounds now because we have much greater appreciation of the origins of human genetic-variation patterns.

In particular, a primary point of debate early on was whether recombination hotspots were really present throughout the genome and whether they were required to explain the data we were observing. This was all resolved by much more detailed studies and analysis of the HapMap data and finding that, indeed, much of human recombination occurs in discrete hotspots.

SW:  What do you mean by a recombination hotspot?

You can imagine that recombination between two chromosomes could happen anywhere along the chromosome, creating new combinations of alleles, or new haplotypes, that span the crossover point. Since humans have been around for tens of thousands of generations, if recombination did happen anywhere in the chromosome, it would be constantly creating new assortments of alleles, and we wouldn’t see the significant correlations of SNPs that we observed. So what turns out to be the case is that from generation to generation recombination tends to happen preferentially at particular points on the chromosome: hotspots. That leaves long segments of the genome—tens to hundreds of kilobases, in some cases—where recombination essentially never happens. In those segments, where recombination never occurs, there’s a great deal of redundancy in genetic information provided by each individual polymorphism.

SW:  Is there a reliable theory for what determines a hotspot?

 
Highly Cited Papers by Mark J. Daly and Colleagues, Published Since 1998
(Ranked by total citations)
Rank     Paper Cites
1 Mouse Genome Sequencing Consortium (R.H. Waterston, et al.), "Initial sequencing and comparative analysis of the mouse genome," Nature, 420(6915): 520-62, 2002. 1,973
2 S.B. Gabriel, et al., "The structure of haplotype blocks in the human genome," Science, 296(5576): 2225-9, 2002. 1,340
3 Int.'l SNP Map Working Group (R. Sachidanandam, et al.), "A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms," Nature, 409(6822): 928-33, 2001. 1,087
4 J.C. Barrett, B. Fry, J. Maller, "Haploview: Analysis and visualization of LD and haplotype maps," Bioinformatics, 21(2): 263-5, 2005. 1,088
5 D. Altshuler, et al., "A haplotype map of the human genome," Nature, 437(7063): 1299-1320, 2005. 1,040
SOURCE: Thomson Reuters
Web of Science®

There are a number of theories, but nothing’s been demonstrated to explain the whole picture yet. Certainly nothing as simple as a universal signature sequence pattern. There are correlations with specific repetitive elements of DNA, but nothing that explains very conclusively why a certain region is a hotspot. It may have to do more with the structure of how DNA is packaged in the chromosome and where it’s open enough to receive recombination machinery. There are likely to be epigenetic influences.

SW:  HapMap was published in 2005. How would you characterize its success so far?

It’s been an invaluable tool in the development of the genome-wide association genotyping arrays that are now in use in large-scale studies, and it continues to play a critical role in the analysis of those data. In essence, it’s one of the components that have spearheaded what can only be described as a revolution in our ability to discover genes for complex human diseases. Just two or three years ago, we had only a handful of conclusive associations to complex diseases. In the studies just published in the last two years, there are clearly in excess of a hundred new, validated associations—and evidence that we've only scratched the surface! That number will continue to go up by leaps and bounds as the technology improves and as groups work more collaboratively to bring larger and larger sample sizes to bear on the challenges in different disease areas.

SW:  Prior to HapMap, the percentage of purported associations that were replicated in later studies was extremely low. Has that changed significantly with HapMap?

When I cite a number in excess of a hundred, these are all genes that meet the most rigorous level of statistical significance and have been conclusively replicated. I expect that number to go up considerably, because I’m not even counting the number of promising studies that are just coming to publication for the first time. That’s the biggest shift in the field: findings are published and there’s a dramatic improvement in the rigor of the analyses and replication in these studies. It’s become quite clear now why the literature prior to the last couple of years had such a spotty performance. The studies being done now involve very, very large sample sizes, and the effects being discovered that are reliably replicated are weak ones. Up until recently, even if the researchers had guessed the genes correctly, the studies were typically not done in sample sizes large enough to document the modest effect associations. Consequently, most of what was published early on were statistical fluctuations, which is what inevitably happens when you’re not testing enough SNPs and when the studies are not adequately powered to find the true effects you’re looking for.

SW:  Are there specific diseases for which HapMap has been particularly useful?

The diseases for which this method has been most successful are those in which there have been multiple scans, usually three or more studies undertaken simultaneously, and then the results or the data pooled to perform more comprehensive assessments—type 2 diabetes, for instance. There are some new publications on genes related to lipid levels—HDL and LDL cholesterol and triglycerides—and some on adult stature that will be coming out soon. Crohn’s disease has seen some particularly noteworthy results. But these conditions have had a lot of effort placed in them. Coronary artery disease and breast cancer also have some very noteworthy results. It remains to be seen how extensible that is to other diseases. We’re very keen to see how tractable or intractable psychological diseases are—bipolar disorder and schizophrenia, for example. As yet, the experiments are not quite so advanced there as they are in autoimmune disease or in cardiovascular disease. For many diseases, they’re just now getting underway, or maybe one study has been completed. In most cases we’ve found the greatest success after two or three groups have agreed to work together and have merged scan data and pooled resources for definitive follow-up studies. We have high hopes as the technology continues to improve; genotyping technology is dramatically improved from what we had to work with when the first studies began, say a year and a half ago.

SW:  Could you play devil’s advocate for a second and tell us where these analyses are most likely to go wrong, to lead to erroneous results, if they do so?

These studies can go awry in a number of ways. Most of them involve inadequate attention to study-design quality and data quality. It’s not so much the sophistication of the analysis, but that the really successful studies have been those that paid very close attention to the quality and accuracy of the lab work, of the randomization procedures for cases and controls in the lab, etc. They’ve paid keen attention to quality control of the data coming out of the lab. One problem is that many studies are done with very new genotyping technology, because everyone is keen do their studies with the latest chips. So, almost by definition, many studies will be done with genotyping products and algorithms that have only existed for a couple of months. This can be a problem if careful attention is not paid to every element, from DNA through the lab process through computational analysis. You have to pay real attention to DNA quality and laboratory procedures, and scrutinize the data quality in every possible way. That’s the hallmark of the successful studies.

SW:  Should we expect these studies to someday clarify the environmental influences of diseases like type 2 diabetes and cardiovascular disease?

What people have to realize is that these gene-discovery projects simply give pointers to genes and regions that may be involved. We then have to embark on much more detailed studies of those genes and regions to identify the precise causal variants and what they’re doing. When we get to that point, we can begin to ask questions: does this potentially synchronize with environmental covariates? Does it open up targets for therapeutics and so forth? But that’s a long way off. There’s a lot of follow-up work to do—although that’s a good problem to have. We’re very enthusiastic about the results to date and the potential for increasing our understanding of these diseases. But there’s a lot of work to do. We don’t want to get too far ahead of ourselves.

Keywords: Mark J. Daly, haplotypes, HapMap, Eric Lander, recombination, disease genes.

 



2008 : June 2008 - Author Commentaries : Mark J. Daly
Scientific Home   |   About Scientific   |   Site Search   |   Site Map
Copyright Notices   |   Terms of Use   |   Privacy Statement
Previous
left arrow key
Next
right arrow key
Close Move