Skip to content

petrelharp/local_pca

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Local PCA/population structure (lostruct)

If you use this method, please cite Li & Ralph 2019:

Local PCA Shows How the Effect of Population Structure Differs Along the Genome, Han Li and Peter Ralph, Genetics January 1, 2019 vol. 211 no. 1 289-304.


@article {li2019local,
	author = {Li, Han and Ralph, Peter},
	title = {Local PCA Shows How the Effect of Population Structure Differs Along the Genome},
	volume = {211}, number = {1}, pages = {289--304}, year = {2019},
	doi = {10.1534/genetics.118.301747}, publisher = {Genetics}, journal = {Genetics}
}

Note: a prototype python implementation by Joseph Guhlin is available at github:jghulin/lostruct-py.

Installation

To install the package, make sure you have devtools (by doing install.packages("devtools")), and then running

install.packages("data.table")
devtools::install_github("petrelharp/local_pca/lostruct")
library(lostruct)

Note: the library is called lostruct.

Using the R package

The example scripts in the directories above mostly work without the R package. To start using the code on your own data, have a look at these files:

  • A quick example : in four lines of code, reads in chromosome 22 from a TPED, and does local PCA.

  • Setting up data : after documenting where the data are from, does local PCA on a small subset of the whole dataset, to establish how the functions work.

  • Drop in your own data and make a report with the scripts provided in the templated/ directory, modified from the Medicago analysis.

Prerequisites

  • To use the functions to read in windows from a VCF or BCF file, you will need bcftools.
  • To compile the example report, you probably want templater.

Quick start

The three steps are:

  1. Compute the local PCA coordinates - done with eigen_windows().
  2. Compute the distance matrix between windows - done with pc_dist() on the output of eigen_windows().
  3. Visualize - whatever you want; MDS is implemented in R with cmdscale().

Data formats:

The function eigen_windows() basically wants your data in a numeric matrix, with one row per variant and one column per sample (so that x[i,j] is the number of alleles that sample j has at site i). If your data are already in this form, then you can use it directly.

We also have two methods to get data in from standard formats, tped and vcf. Neither are extensively tested: double-check what you are getting out of them.

  • TPED: the read_tped() function will read in a tped file and output a numeric matrix like the above. For instance:

    snps <- read_tped("mydata.tped")
    
  • VCF, in memory: the read_vcf() function does the same. For instance:

    snps <- read_vcf("mydata.vcf")
    
  • BCF, not in memory: eigen_windows() instead of a matrix can take a function that when called returns the submatrix corresponding to the appropriate window. (see documentation) Since we only need one window in memory at a time, this reduces the memory footprint. We use bcftools to extract the windows, so you need bcftools, and your vcf file must be converted to bcf (or bgzipped) and indexed. To do this, for instance, run:

    bcftools convert -O b mydata.vcf > mydata.bcf
    bcftools index mydata.bcf
    

    Once you have this, the function vcf_windower() will create the window extractor function. For instance,

    snps <- vcf_windower("my_data.bcf",size=1e3,type='snp')
    snps(10)
    

    will return the 10th window of 1,000 SNPs in the file my_data.vcf.

In any case, the next step is:

pcs <- eigen_windows(snps,k=2)
pcdist <- pc_dist(pcs,npc=2)

which gives you pcs, a matrix whose rows give the first two eigenvalues and eigenvectors for each window, and pcdist, the pairwise distance matrix between those windows.

Standalone code

Also included in this repository is code we used to analyze the datasets in the paper (before the R package was written). The general order to see the code in each directory is

  1. recode : turn bases into numbers
  2. PCA : find local PCs
  3. distance : compute distance matrix between windows from local PCs
  4. MDS : visualize the result

There are standalone examples for each of the three datasets studied:

POPRES (Homo sapiens, SNP chip data from a few worldwide populations)

Chromosome 1 is the example given. See also popres_example.R for an example of some steps using the package.

DPGP (Drosophila melanogaster population genome project)

Chromosome 3L is the example given .

Medicago (Medicago truncatula hapmap)

For Medicago, it calculates the pairwise distance for all 8 chromosome together and then apply MDS and use subset of the whole MDS result for each chromosome.

A note on implementation:

This method works through the genome doing something (PCA on the covariance matrix) one window at a time. Because of this, it can be frustratingly slow to first load the entire dataset into memory. There are several methods implemented here to avoid this; for instance, vcf_windower() which is used to compute PCs for the medicago data. The interface is via a function that takes an integer, n, and returns a data frame of the genomic data in the nth window.