How to Travel Through Time With DNA

Two posts ago, I talked about how the timely intervention of fungi during the Carboniferous period brought an end to the most prolific coal-producing period in the planet's history, in part by taking some incredibly smug plants down a notch. But the story led to one very important question: how do we know what happened millions of years ago? In my previous post, I started to answer that question by talking about phylogenetics, which uses the principle that, given a group of species, we can use what characters they have in common to infer how they are related to one another. Today, in my final post in this series, I'm going to talk about how DNA sequencing turbocharged phylogenetics (then called cladistics), turning the field from the compilation of genealogies to something much more powerful.

But – you might be inclined to interject, having read my previous two posts – the events you talked about happened millions of years ago. Surely there's no DNA left that's that old. How can looking at DNA from living organisms tell us about dead ones? This is an excellent question. To answer it, we need a volunteer. How about you?


You happen to have an uncanny resemblance to the figure who likes to nonchalantly stand next to dinosaurs, but that's not important right now. What is important is that, like the rest of your species, you have DNA. Let's extract a very small amount and sequence it. The raw data look a bit messy, but the computer helpfully cleans it up into something that looks a little like this:

DNA strand 1.png

Each nucleotide base in the sequence is represented by a block. The blocks are color-coded so we can easily see which of the four nucleotide bases falls where. Green corresponds to cytosine, red to adenine, blue to thymine, and yellow to guanine. Because we're scientists and we like large samples, let's take some more sequences from other people, focusing on the same location in the genome:

These sequences all look pretty similar to one another, but not to yours. It looks like you've got a mutation at the fifth base: your thymine should actually be a guanine. This is nothing to worry about, since we pulled our stretch of DNA from a non-coding portion of your genome. In the non-coding regions, variability tends to be pretty high, because this DNA is selectively neutral – since it doesn't have an affect on what genes are expressed, it can't have an effect on how many surviving offspring you produce, the benchmark for a fit person. There are tons of exceptions to this general rule, but that's a story for another time. For now, let's keep it simple.

Because it looks like the most common sequencing in this area is CCAGGCCA, we'll call that the consensus sequence: that is, the “official” human sequence for this part of the genome. However, say that you spread your genes by producing offspring. If you have at least two kids, statistically, one of them should have your mutation. If you have at least four kids, the proportion of your sequence in the human population doubles, from 1:7,000,000 to 2:7,000,000. But why stop there? If your mutant children likewise produce two mutant grandchildren apiece, the proportion of your sequence has doubled again. Over time, with a lot of luck and even more reproduction, your sequence might come to dominate the population and become the consensus sequence. This is a phenomenon called “fixation.” A lot of factors determine whether a mutation becomes fixed and how long it takes to do so – population size, generation time, whether or not mating is random, and again, luck, but if we can make reasonable estimations for those factors, we can determine how many fixed mutations we should see in a stretch of DNA after a given period of time.

This is an incredibly valuable concept when we go back to comparing species. Instead of, as we were doing before, making a table of differences (many of which were subjective and/or lazy), we can get an objective measure of how far apart two species are solely from their genes. All we have to do is find homologs, or sequences of DNA that share the same common ancestor. For example, the genes that produce hemoglobin in chimpanzees and those that produce hemoglobin in humans are homologous to one another, because it's highly probable that hemoglobin was around well before chimpanzees or humans, seeing as it's a common trait possessed by all vertebrates.

We can also refer to physical traits as homologs – for example, chimp feet and human feet are homologous because they presumably arose from the same structure, which was likely also a foot. However, the problem with this is that there's nothing useful to quantify – we can talk about how the shape of the foot is different for chimps and humans, how our foot bones are longer, but we can't say how many changes had to occur for our feet to change shape.

This is where DNA analysis steps in. All we need to do is pick a stretch of DNA that's (a) common to all the species we're studying, (b) can mutate without killing the organism, and (c) is long enough to be useful but not so long as to be unwieldy, both in terms of its extraction and computational analyses. Once we find that region, though, we can compare sequences of different species and build ourselves a table as we did before, except this time, instead of physical differences, we use differences in sequences. This process allows us to build a huge table of differences, sometimes in the hundreds or thousands, that is much more objective and accurate than our previous method.

To take this idea one step further, we can use genes that code for functional products to determine when a particular strategy arose. Let's go back to the white rot fungi – when did the first of these fungi arise out of the mists of history and begin chowing on wood? What the authors in the article I summarized did was look at the class of enzymes responsible for white rot fungi's ability to break down wood. By analyzing a ton of fungi presumably related to the white rot evolutionary lineage and found the one that had was the least different – in terms of number of mutations – from the white rot fungi, but lacked the white rot enzymes. From there, they were able to use fossil evidence to determine when other major events in that lineage happened – they knew, for example, when about the Ascomycetes and the Basidiomycetes parted ways – to create a scale showing how many mutations they should see over a given time period. From there, all it took was applying that scale to estimate when that very first peroxidase evolved.

Piece of cake, right? Well, not really. When it comes to phylogenetics, the Data Monster gets positively ravenous. This approach requires a least several sequences from each species being analyzed (and the Floudas group, who authored the study, repeated this process for 31 species of fungi). Furthermore, if you don't know ahead of time where your target enzyme is, you'll need the whole genome of the organism so you can look for it. Modern advances in technology have made it so this only takes a couple of months, instead of years. Then you need to make comparisons, which for large studies still requires the use of a supercomputer. So, while the researchers can thank some clever reasoning for making this project possible, it still took a lot of concerted effort to travel back to those ancient swampy forests where an obscure but crucial piece of history was made.