Structural Variation in the Human Genome

advertisement
Structural Variation in the
Human Genome
Michael Snyder
March 2, 2010
Genetic Variation
Among People
Single nucleotide polymorphisms
(SNPs)
GATTTAGATCGCGATAGAG
GATTTAGATCTCGATAGAG
0.1% difference among
people
Mapping Structural Variation in Humans
>1 kb segments
- Thought to be Common
12% of the genome
(Redon et al. 2006)
- Likely involved in phenotype
variation and disease
CNVs
- Until recently most methods for
detection were low resolution
(>50 kb)
Size Distribution of CNV in a Human Genome
Why Study Structural
Variation?
• Common in “normal” human genomes-major cause of phenotypic variation
• Common in certain diseases, particularly
cancer
• Now showing up in rare disease; autism,
schizophrenia
Most Genome Sequencing Projects Ignore SVs
Project
Techno logy
Paired
End
SNPs;
Short
Inde l
3M;
0.3M
3M;
0.2M
3M
SVs
New
Seq.
Genotype
Reference
0.2M (>
1000bp)
Limi ted
1M
Limi ted
No
No
Limi ted
No
No
3M;
0.1M
4M; 10K
2.7K
(>100bp)
0.1K
No
No
No
No
Levy et al.,
2007
Wheeler et
al., 2008
Pushkarev et
al., 2009
Wang et al.,
2008
Bentley et
al., 2008
5.5K
(unknown
defi nition)
Limi ted
No
No
McKernan et
al., 2009
No
No
Ahn et al.,
2009
Kim et al.,
2009
Drmanac et
al., 2009
Eur opean-Venter Sanger
Yes
Eur opeanWatson
Eur opeanQuake
Asian
454
No
Heli cos
No
Ill umi na
Partially
HapMap
Sample;
Yoruban 18507
HapMap
Sample;
Yoruban 18507
Korean
Ill umi na
Yes
SOLiD
Partially
4M;
0.2M
Ill umi na
Yes
3M
Korean- AK1
Ill umi na
Yes
~300 CNVs
No
No
Three human
genomes
Complete
Genomi cs
Yes
Limi ted
Ill umi na
No
Limi ted (5090K block
substitutions)
Limi ted
No
AML genome &
normal
counterpart
AML genome
3.45M;
0.17M
3.24.5M;
0.3-0.5M
3.8M;
0.7K
No
No
Ley et al.,
2008
Ill umi na
Yes
64
Limi ted
No
No
Melanoma
genome
Lung cancer
genome
Ill umi na
Yes
32K;1K
51
No
No
SOLiD
Yes
23K; 65
392
No
No
Mardis et al.,
2009
Pleasance et
al., 2009a
Pleasance et
al. 2009b
Why Not Studied More?
• Often involves repeated regions
• Rearrangements are complex
• Can involve highly repetitive elements
Genome Tiling Arrays
800 bp
25-36mer
High-Resolution CGH with Oligonucleotide Tiling
Microarrays
HR-CGH
Maskless Array Synthesis
385,000 oligomers/chip
Isothermal oligomers, 45-85
bp
Tiling at ~1/100bp nonrepetitive genomic sequence
Detects CNVs at <1 kb
resolution
Urban et al., 2006
With R. Selzer and R. Green
High Resolution Comparative Genomic Hybridization
Chromosome 22
Chromosome 22
Patient 98-135
Nimblegen/MAS
Technology
Isothermal Arrays Covering
Chromosome 22
Patient 99-199
Mapped Copy Number
Variation In VCFS Patients
Resolution ~50-200 bp
Patient 97-237
Urban et al. (2006) PNAS
LCR A
B C D
Mapping Breakpoints of Partial Trisomies of Chromosome 21
verified
verified
With Korenberg Lab, UCLA
Signal
Copy Number Variations in the Human Genome
Signal
Person 1
Person 2
Chromosome Position
Extra DNA
Missing DNA
Genome Tiling Arrays
800 bp
36mer
PCR Products
Oligonucleotide Array
Massively Parallel Sequencing
AGTTCACCTAAGA…
CTTGAATGCCGAT…
GTCATTCCGCAAT…
High Throughput DNA Sequencing based Methods
to detect CNVs/SVs
1. Paired ends
Deletion
Reference
Mapping
Genome
Reference
Sequenced
paired-ends
3. Split read
2. Read depth
Deletion
Deletion
Reference
Reference
Genome
Genome
Read
Reads
Mapping
Mapping
Read count
Reference
Zero level
High Resolution-Paired-End Mapping (HR-PEM)
Shear to 3 kb
Adaptor ligation
Genomic DNA
Bio
Bio
Circularize
Fragments
Bio
Random Cleavage
Bio
Bio
454 Sequencing
(250bp reads, 400K reads/run)
Map paired ends to reference genome
Korbel et al., 2007 Science
Bio
200-300bp
Summary of PEM Results
NA15510
(European?, female)
NA18505
(Yoruba, female)
# of sequence reads
> 10 M.
> 21 M.
Paired ends uniquely
mapped
> 4.2 M.
> 8.6 M.
Fold coverage
~ 2.1x
~ 4.3x
Predicted Structural
Variants*
Indels
Inversion breakpoints
473
422
51
825
753
72
Estimated total variants*
genome-wide
759
902
*at this resolution
~1000 SVs >2.5kb per Person
*
VCFS
*
Size distribution of Structural Variants
Cumulative sequence coverage in Mb
(NA18505, shown as function of SV-size)
10kb
[Compare with overall 11M refSNP entries]
[Arrow indicates lower size cutoff for deletions]
Size distribution of Structural Variants
Cumulative sequence coverage in Mb
(NA18505, shown as function of SV-size)
10kb
10kb
[Compare with overall 11M refSNP entries]
[Arrow indicates lower size cutoff for deletions]
High Throughput Sequencing of Breakpoints
?
+
+
+
Cut Gel Bands
and Pool
PCR SVs
Shotgunsequence PCR
Mixture Using 454
Assemble
contigs and
determine
breakpoints
Genome
Sequencer FLX
>200 SVs Sequenced Across Breakpoints
Analysis of Breakpoints
Homologous
Recombination
14%
Nonhomologous
Recombination
56%
Retrotransposons
30%
17% of SVs Affect Genes
Non-allelic homologous recombination (NAHR; breakpoints in OR51A2 and OR51A4)
Olfactory Receptor Gene Fusion
Heterogeneity in Olfactory Receptor Genes
(Examined 851 OR Loci)
CNVs affect:
93 Genes
151 genes
Paired-end
• Variations of the method are available
for many platforms: Roche, Illumina,
LifeTechnologies
• Long reads are preferable for optimal
detection
• Can get different sizes
- Roche 20 kb, 8kb, 3 kb
- Ilumina, SOLiD 1.5 kb
Paired-end:
Advantages/Disadvantages
• Can detect highly repetitive CNVs (LINE, SINE,
etc.)
• Detect inversions as well as insertions and
deletions
• Defines location of CNV
• Relies on confident independent mapping of
each end, problems in regions flanked by
repeats
• Small span between ends limits resolution of
complex regions
• Large span between ends limits resolution of
break points
High Throughput DNA Sequencing based Methods
to detect CNVs/SVs
1. Paired ends
Deletion
Reference
Mapping
Genome
Reference
Sequenced
paired-ends
3. Split read
2. Read depth
Deletion
Deletion
Reference
Reference
Genome
Genome
Read
Reads
Mapping
Mapping
Read count
Reference
Zero level
Sequence Read Depth Analysis
Individual sequence
Reads
Mapping
Reference genome
Counting mapped reads
Read depth signal
Zero level
28
Novel method,
CNVnator,
mean-shift approach
•
•
•
•
For each bin attraction (meanshift) vector points in the
direction of bins with most
similar RD signal
No prior assumptions about
number, sizes, haplotype,
frequency and density of CNV
regions
Achieves discontinuitypreserving smoothing
Derived from image-processing
applications
Alexej Abyzov
CNVnator on RD data
NA12878, Solexa 36 bp paired reads, ~28x coverage
Trio predictions
RD vs paired-end
Read Depth
Paired-end
• Difficulty in finding highly
repetitive CNVs (LINE, SINE,
etc.)
• Uncertain in CNV location
• Uses mutual information of
both ends, better mapping
and ascertainment in
homologous region
• Ascertains complex
regions
• Can find large insertions
• Can be used with pairedend, single-end and mixed
data
• Can detect highly repetitive
CNVs (LINE, SINE, etc.)
• Defines precise location of
CNV
• Relies on confident
independent mapping of
each end, problems in
regions flanked by repeats
• Small span between ends
limits resolution of complex
regions
• Large span between ends
limits resolution of break
points
Caucasian trio daughter
RD vs read pair (example)
Not found by short
read pair analysis due to
not confident read mapping
High Throughput DNA Sequencing based Methods
to detect CNVs/SVs
1. Paired ends
Deletion
Reference
Mapping
Genome
Reference
Sequenced
paired-ends
3. Split read
2. Read depth
Deletion
Deletion
Reference
Reference
Genome
Genome
Read
Reads
Mapping
Mapping
Read count
Reference
Zero level
Split-read Analysis
Deletion Event
Reference
Deletion
Read
Breakpoint
Reference
Read
Insertion
Insertion Event
1. Paired ends
Methods to Find SVs
Deletion
Reference
Mapping
Genome
Sequenced
Reference
paired-ends
2. Split read
3. Read depth (or aCGH)
Deletion
Deletion
Reference
Reference
Genome
Genome
Read
Reads
Mapping
Mapping
Read count
Reference
Zero level
4. Local Reassembly
[Snyder et al. Genes & Dev. ('10), in press]
Simple Local Assembly:
iterative contig extension
-- a mostly greedy approach
Du et al. (2009), PLoS Comp Biol.
SVs with sequenced
breakpoints
BreakSeq enables detecting SVs in Next-Gen
Sequencing data based on breakpoint junctions
Leveraging read data to identify previously known SVs (“Break-Seq”)
Map reads
onto
Detection of insertions
Library of SV
breakpoint junctions
Detection of deletions
[Lam et al. Nat. Biotech. ('10)]
Applying BreakSeq to short-read based personal genomes
Personal genome (ID)
Ancestry
High support hits
(>4 supporting hits)
Total hits
(incl. low support)
NA18507*
Yoruba
105
179
YH*
East Asian
81
158
NA12891
[1000 Genomes Project, CEU trio]
European
113
219
*According to the operational definition we used in our analysis (>1kb
events) less than 5 SVs were previously reported in these genomes …
[Lam et al. Nat. Biotech. ('10)]
Conclusions
1) SVs are abundant in the human genome
2) Different methods are used to detect
them: Read pairs, Read Depth, Split
reads, New assembly
3) Many SV breakpoints are being
sequenced; nonhomologous end joining
is common. The breakppoint library can
be used to identify SVs.
Acknowledgments
•
•
•
•
•
•
Jan Korbel
Alexej Abyzov
Alex Urban
Zhengdong Zhang
Hugo Lam
Mark Gerstein
454 for Paired End
Tim Harkins, Michael Egholm
2nd-Gen Sequencing based Methods to detect
CNVs/SVs
1. Paired ends
Deletion
Reference
Mapping
Genome
Reference
Sequenced
paired-ends
2. Split read
3. Read depth
Deletion
Deletion
Reference
Reference
Genome
Genome
Read
Reads
Mapping
Mapping
Read count
Reference
Zero level
SV-CapSeq v1.0 results for deletions
Data set
Total
SVs
Confirme Confirmatio
d
n rate
Confirmation rate
(coverage
corrected)*
1KG selected events
1839
307
17%
20%
Pre-confirmed
184
134
73%
88%
PCR confirmed
294
101
34%
41%
Pre- & PCR
confirmed
56
41
73%
88%
PCR non-validated
940
105
11%
13%
454 PEMer deletions
575
283
49%
Combining 3 captures/elutions (1 per member of CEU trio)
and 1+(2x0.5) 454 Titanium runs
59%
*For 2x allelic coverage and breakpoints at least 20 bp away from read ends
SV Junction and Identification
[Lam et al. Nat. Biotech. ('10)]
Contents of the SV-CapSeq array v1.0
2.1 million oligomers tiling the target regions of the genome:
1839 deletion CNVs from (mostly) short read Solexa data (1000 Genome Project)
From long read 454 paired-end data:
575 deletion CNVs
296 insertions CNVs
191 inversions SVs
(plus Split-Read indel predictions, Zhengdong Zhang)
Validations by prediction set
Validation rate by prediction set
Confirmation rate
Depth
Sequence Read RD
signal
12,988,627
12,995,076
Array capture
12,988,735
12,996,115
PCR primers
12,988,825
12,994,750
Multi-method
prediction
Read depth
analysis
Chromosome 7, Mbp
~6500 bp deletion CNV
Depth
Sequence Read RD
signal
12,988,627
12,995,076
Array capture
12,988,735
12,996,115
PCR primers
12,988,825
12,994,750
Multi-method
Prediction
(short-read and array)
Read depth
analysis
Chromosome 7, Mbp
~6500 bp deletion CNV
Depth
Sequence Read RD
signal
12,988,627
12,995,076
12,988,735
12,996,115 PCR primers
12,988,825
12,994,750
Array capture
Multi-method
Prediction
(short-read and array)
Read depth
analysis
Chromosome 7, Mbp
~6500 bp deletion CNV
Depth
Sequence Read RD
signal
12,988,627
12,995,076 Array capture
long-read seq
12,988,735
12,996,115 PCR primers
12,988,825
12,994,750
Multi-method
Prediction
(short-read and array)
Read depth
analysis
Chromosome 7, Mbp
~6500 bp deletion CNV
Depth
Sequence ReadRD
signal
12,988,627
12,995,076 Array capture
long-read seq
12,988,735
12,996,115 PCR primers
12,988,825
12,994,750
Original Prediction
From set of 1839
Read depth
analysis
Chromosome 7, Mbp
~6500 bp deletion CNV
SV-CapSeq v1.0 results for deletions
Data set
Total
SVs
Confirme Confirmatio
d
n rate
Confirmation rate
(coverage
corrected)*
1KG selected events
1839
307
17%
20%
Pre-confirmed
184
134
73%
88%
PCR confirmed
294
101
34%
41%
Pre- & PCR
confirmed
56
41
73%
88%
PCR non-validated
940
105
11%
13%
454 PEMer deletions
575
283
49%
Combining 3 captures/elutions (1 per member of CEU trio)
and 1+(2x0.5) 454 Titanium runs
59%
*For 2x allelic coverage and breakpoints at least 20 bp away from read ends
SV-CapSeq Analysis of Structural Variation in the human genome
Ongoing work:
-Develop analysis pipelines for insertion and inversion SV-CapSeq data
-Analyze nature of off-target CapSeq reads: cross-hybridization and cross-mapping
-Design improved SV-CapSeq array
Goal
Sequence across n x 10,000 SV breakpoints with a single capture and less than
one 454 run or ideally using Solexa-Illumina
Important for precision CNV/SV screens and high-quality human genome sequencing
Analysis of Genomic Structural Variation
-exact sizes and breakpoint sequences of CNV/SV are difficult to define but important
for functional understanding
-in the absence of long-read deep whole-genome sequencing combining arrays and
sequencing allows high-throughput validation and breakpoint analysis
SV-CapSeq Design v2.0:
For Pilot2/DeepCov:
Total SVs -- 3946 (set of CNV used by Jan Korbel for PCR primer design/round 2; only CEU trio)
Deletions -- 2550
Insertions -- 1396 (includes mobile elements)
Total bases to be covered -- 4,784,597
Expected coverage -- 7x (for diploid genome with 500,000 of 400 bp reads by 454)
SV-CapSeq Design v2.0:
For Pilot1/LowCov
NA12003 -- CEPH male
NA18870 -- Yoruba female
NA18953 -- Japanese male
SV selection:
1)
All events selected by Jan for PCR validation
2)
250 RD calls from each of the following groups: Yale, CSHL, Einstein
Tiling strategy:
200 bp into outer direction for insertion break point(s)
500 bp into both directions from deletion break points
Total SVs -- 1546
Deletions -- 1438
Mobile elements -- 108
No other insertions
Total bases to be covered -- 2,501,719
Expected coverage -- 8.8x (for diploid genome with 1,000,000 of 400 bp reads by 454)
Computations
• Megablast mapping
– Mismatch score = -1
– Hits with > 90% identity
– At least 40 matching bases
• Best hit placement
– At least one hit has score > 150
– No overlapping hits with score difference < 10
• Selecting candidate reads by intersecting
placements with predicted regions extended by 1kb
• Needleman-Wunsch alignment of candidate reads
with predicted regions (0 gap extend penalty)
Criteria for validation
• Can find two good alignment blocks
(see next slide)
• 50% mutual overlap between predicted
region and gap between the blocks
• Sum of break-point uncertainty < 5 kb
Acknowledgements
Yale University
454 / Roche
UCLA/Cedars-Sinai
Alexej Abyzov
Jason Affourtit
Tal Tirosh-Wagner
Jan O Korbel
Brian Godwin
Fabian Grubert
Jan Simons
Julie Korenberg
(now University of Utah)
Dejan Palejev
Lei Du
Maya Kasowski
Bruce Taillon
Chandra Erdman
Zhoutao Chen
Philip Kim
Tim Harkins
Nicholas Carriero
Michael Egholm
UPenn
April Hacker
Beverly Emanuel
Cornell
Francesca Demichelis
Sunita Setlur
Eugenia Saunders
Andrea Tanzer
Sanger Centre
Mark Gerstein
Jianxiang Chi
Sherman Weissman
Fengtang Yang
Michael Snyder
Yujun Zhang
(now Stanford University)
Matthew Hurles
Nigel Carter
Mark Rubin
NimbleGen-Roche
Rebecca Selzer
Todd Richmond
Matthew Rodesch
Roland Green
Thomas Alberts
Alignment blocks
Genome
Read
Block 1
Criteria: gaps < 5 bp, number of aligned nucs > 10
Block 2
Read-Depth Analysis: Platform comparison
(on aCGH calls)
Deletions
Duplications
Illumina, ~5x
Illumina, ~5x
38
SOLiD, ~4x
22
14
8
2
0
36
3
15
0
1
Helicos, ~1x
by >50% of reciprocal overlap
SOLiD, ~4x
1
0
0
Helicos, ~1x
Size Spectrum of Human Genomic Variation
Scherer et al. 2007
Types of Structural Variation
Hurles et al. 2008
The resolution gap in SV analysis
100
101
102
103
104
105
106
107
108
109
[bp]
Microscope
Sanger sequencing
BAC-, oligo/SNP array, (FISH)
HR-CGH-arrays
454-PEM
Breakpoint prediction
to within PCR range
(short-read)
2nd-gen sequencing
[adapted from Lupski et al. Nat Genet 2007]
454-PEM
Paired End Mapping
Korbel et al. Science 19 October 2007: Vol. 318. no. 5849, pp. 420 - 426
Mechanism Distribution
Published SVs
1KG SVs
1. Targeted Sequencing
• hybridize genomic DNA to capture array
• wash away unbound fraction
• Elute off target DNA
• Sequence with 454 Titanium (~400 bp reads)
2. SV-CapSeq analytical pipeline
• Map reads using Megablast; Best hit placement
• Intersect placements with target regions
• Precisely align reads with Needleman-Wunsch to identify
split reads: SV validated, breakpoint sequence found
Array Capture Sequencing
Roche-NimbleGen
SV-CapSeq: Array Design
Deletion
2000bp 2000bp
Insertion
500bp
Inversion
(not to scale)
5000bp 5000bp
2000bp 2000bp
500bp
5000bp 5000bp
Represented on the capture tiling array
Contents of the SV-CapSeq array v1.0
2.1 million oligomers tiling the target regions of the genome:
1839 deletion CNVs from (mostly) short read Solexa data (1000 Genome Project)
From long read 454 paired-end data:
575 deletion CNVs
296 insertions CNVs
191 inversions SVs
(plus Split-Read indel predictions, Zhengdong Zhang)
Confirmation rate by overlap
1. Paired ends
Methods to Find
Deletion
SVs
Reference
Mapping
Genome
Sequenced
Reference
paired-ends
2. Split read
3. Read depth (or aCGH)
Deletion
Deletion
Reference
Reference
Genome
Genome
Read
Reads
Mapping
Mapping
Read count
Reference
Zero level
4. Local Reassembly
[Snyder et al. Genes & Dev. ('10), in press]
CNV discovery: RD vs CGH
RD
CGH
[Daughter in Caucasian trio, NA12878]
[CGH prediction are from Conrad et al., Nature, 2009]
Optimal integration of sequencing technologies:
Local Reassembly of large novel insertions
Given a fixed budget, what are the sequencing coverage A, B and C that can achieve the maximum
reconstruction rate (on average/worst-case)? Maybe a few long reads can bootstrap reconstruction process.
Du et al. (2009), PLoS Comp Biol, in press
Optimal integration of sequencing technologies:
Need Efficient Simulation
Different combinations of technologies (i.e. read lenghs) very expensive to actually test.
Also computationally expensive to simulate.
(Each round of whole-genome assembly takes >100 CPU hrs; thus, simulation exploring 1K possibilities takes
100K CPU hr)
Du et al. (2009), PLoS Comp Biol, in press
Optimal integration of sequencing technologies:
Efficient Simulation Toolbox using Mappability Maps
~100,000 X
speedup
Du et al. (2009), PLoS Comp Biol, in press
Experimental Validation
A) CGH
a
B) Fiber-FISH
(For inversions)
c Without inversion
With inversion
CGH
PEM
C) PCR (Often 4 People)
M
b
A B C D A B C D A B C D A B C D A B C D A B C D A B C D A B C D
3000 bp
1500 bp
500 bp
>500 SVs validated
~50% SV are in more than one ethnic group
M
Download