New blogs

Leherensuge was replaced in October 2010 by two new blogs: For what they were... we are and For what we are... they will be. Check them out.

Wednesday, May 6, 2009

First impressions on Alexe 2009. Scrapping mtDNA R off?

I posted before, a new paper by G. Alexe et al. challenges the existence of the mtDNA macro-haplogroup R as we know it and suggests a very much modified tree for Eurasian mtDNA (while ratifying the basic global tree at the L, M and N levels).

These are my first impressions as I read it.

From a comparison of our haplogroups to two chimp and one bonobo sequences, and assuming a chimp-human coalescent time of 5 million years before present, we find a human mtDNA TMRCA of 206,000 ± 14,000 years before present.

As the actual Pan-Homo split probably happened more like 8 million years ago (not less than 7 in any case) , the correct TMRCA estimate would rather be of 329,000 years BP, which is much older than normally considered.

When we create k + 1 clusters, we do not use the previous clustering into k clusters. Nevertheless, we find that the k + 1 clusters consist of k – 1 clusters from the previous clustering plus two more obtained from the splitting of one of the clusters at level k. These recursive bifurcations as k -> k + 1 are strong, inherent features of the data which we interpret as branch splits on the tree. The polymorphisms on branches are identified by a frequency analysis at each split.

This seems to be the core of their K-means analysis method. However, for what I have seen in the use of k-means clustering within autosomal DNA (where it is widely used), the new k+1 clusters actually show not a split of the previous clusters but a representation of further data ignored in them, this data is generally less important numerically for the whole sample but may be most important locally for specific populations. Of course, against what the authors claim, sample size and pre-selection does usually matter and the K=2 graph will always show the two most important clusters overall, even if they are minimal in some of the populations studied. That is why it is so important to deepen the k-means analysis as much as possible, in order to find all possible clusters and their local relevance.

For example: taking the K-means analaysis used by Bauchet 2007 as reference, a sample of Europeans in general is likely to show two different clusters at K=2 that appear to have a cline along a S-N axis (this may vary depending on the sampled populations) but when you reach K=5 or K=6, these two components have become diluted to very low levels in all but two population clusters: Eastern Mediterraneans and Finnish. They dominate the K=2 because they have a somewhat wider distribution but they are minor in most populations and almost non-existent in some cases (Basques for example).

I also took notice of the following criticisms of the basic k-means method of analysis, from Wikipedia k-means++:

However, the k-means algorithm has at least two major theoretic shortcomings:
  • First, it has been shown that the worst case running time of the algorithm is super-polynomial in the input size.
  • Secondly, the approximation found can be arbitrarily bad with respect to the objective function compared to the optimal clustering.

So I am not really convinced how good is this method, that ignores the standard SNP-based tree building, to describe the real mtDNA structure.

Whatever the case, this is the tree they propose for Eurasian mtDNA (fig. 4 - I ignored fig 4-c because it proposes no changes to known L1, L2 and L3 trees / a = N, b = M):

click to enlarge

Notice how many of the proposed bifurcations have no defining mutation/s. Even if they are right, this really challenges the way we usually think of haploid lineages, as defined by uniquely occurring SNPs, which are the ones defining the tree structure.

I am instinctively horrified by this sugegstion of "haplogroups" with no shared mutations. Maybe I am completely wrong but you get the following pseudo-clades that have no defining SNPs at all:
  • A western neo-R, gathering R0, U and JT.
  • N(xR)
  • N(xR,F)
  • N(xR,F,W)
  • N(xR,F,W,A)
  • B'N9b'R5'X - and others downstream of this "node"
  • M(xG)
  • M(xG,M7b)
  • M7a'8'9'10'Q
  • M7a'9'10'Q
  • M7'a'9'10
  • M9'10
  • Several "nodes" in haplogroup D
I really do not know what to do with this. I'm simly baffled.

However, there are suggestions of novel SNP-defined macrogroups too, though they only appear to share single SNP markers in each case:
  • B'N9b'I'R5'X would share the 52131G mutation
  • B*'B4'R5'X would share the 103928A mutation
  • M7a'8'9'10'D'Q (including C and Z in M8) would be defined by 5460G

Regardless on wether this means anything at all. It is significative that, while it is generally accepted that the genus Pan (chimpanzee and bonobo) diverged from us at a single evolutionary node, fig. 5 shows that the different Pan populations have homogeneously different levels of affinity with Homo mtDNA, with bonobos being slightly more like us also in the mtDNA. Of course, this is very relative, as all Pan groups diverge from us at rates of c. 1350 mutations, while we diverge from the L0'1 consensual root at just a few dozen mutations.

This is all by the moment.

No comments: