Interesting new paper at PLoS Genetics by Gil McVean of the University of Oxford, UK: A Genealogical Interpretation of Principal Component Analysis.
Genetic variation in natural populations typically demonstrates structure arising from diverse processes including geographical isolation, founder events, migration, and admixture. One technique commonly used to uncover such structure is principal components analysis, which identifies the primary axes of variation in data and projects the samples onto these axes in a graphically appealing and intuitive manner. However, as the method is non-parametric, it can be hard to relate PCA to underlying process. Here, I show that the underlying genealogical history of the samples can be related directly to the PC projection. The result is useful because it is straightforward to predict the effects of different demographic processes on the sample genealogy. However, the result also reveals the limitations of PCA, in that multiple processes can give the same projections, it is strongly influenced by uneven sampling, and it discards important information in the spatial structure of genetic variation along chromosomes.
The distortions introduced by unequal sampling are quite obvious in figure 3, where equidistant hypothetical samples in a lattice (A) are subsequently distorted by simply increasing some of the samples: when only one sample is increased, this one appears as more distinct (B), when several samples are increased in different ammounts, they artificially dominate the PCA graph, with smaller samples losing their distinct personality.
This influence of sampling choice I have already noticed previously, so I am glad to see it clearly explained and due warning issued.
... the results also demonstrate the way in which sampling schemes can influence PC projections and how similar projections can arise from very different demographic scenarios. Consequently, using these results to motivate inference from PCA about underlying demographic process may prove difficult.There is more than just this. In fact the author makes an important effort to show how PCA can be used to infer admixture apportions when this admixture is a known fact, either when the source populations are known or when they are not as well.
Worth a read.