Structural Variation is now being widely recognized as a driver in many diseases and much research is being done on using structural variation (SV) analysis for both diagnostics and identifying therapeutic targets. But it wasn’t until relatively recently that the importance of SV in disease was understood. Historically, technology has been the limiting factor in understanding SV. Early low-resolution, cytogenetic techniques like karyotyping left the impression that SV was a rare occurrence. Later molecular cytogenetic methods such as Fluorescence In situ Hybridization (FISH), qPCR and comparative genomic hybridization arrays (aCGH) increased resolution and contributed greatly to our understanding of SV, particularly larger rearrangements, but are still relatively low-resolution hindering specificity as compared to sequencing.
With the landslide of cheap, high-resolution, next-generation sequencing (NGS) data it would seem that the next generation of SV analysis is upon us. However, NGS and the data generated from short-reads makes it difficult to impossible to detect many types of structural variants. Because of this, decades-old cytogenetic methods remain the gold standard in many SV analysis applications leaving a gap in the understanding of critical germline or somatic events. Whole genome mapping when combined with advances in algorithms and cloud computing is closing the gap between classic cytogenetics and genomic approaches to SV analysis.
Below is a discussion on SV and its role in disease, cytogenetic and genomic methods of SV analysis, whole genome mapping for SV analysis and how to address informatics challenges in SV analysis. Within each section there are links to relevant papers and sources along with links to more in-depth discussions. Alternatively, all the content is available as an ebook for offline reading.
Structural variation (SV) is most commonly defined as any genetic variation >50bp and includes deletions, duplications, copy-number variants, insertions, inversions and translocations. Click here to learn more about SV. It is rather well known that genetic variation is what makes us different from one another both in our physical traits and in susceptibility to disease. Traditional cytogenetic techniques were best at detecting large SV but the incidence was thought to be rare. Much of the early sequence-based research on genetic variation focused on single nucleotide polymorphisms (SNPs) or single base pair differences in the sequences of our genomes. However later research uncovered that of the roughly 0.1% genetic variation between individuals, SV was the primary contributor to these differences with up to 5 times more variability in the human genome being due to SV than SNPs and that most SNPs are actually part of larger SV. One recent article in Nature by Chaisson et al found more than 27,600 unique structural variants in the human genome.
As the picture of SV became clearer, its obvious implication in heritability and disease also came into focus. This highly cited 2010 paper from the Annual Review of Medicine talked about the promise of SV research, particularly copy-number variation (CNV), in better understanding certain diseases. But a deeper understanding of SV, particularly as it relates to diseases like cancer, has been slower to materialize than hoped. This is due to the inherent limitations of sequencing approaches to detecting structural variants. That is changing with the introduction of whole genome mapping and the availability of rapid cloud-based assembly and SV calling algorithms like Hitachi’s Human Chromosome Explorer™ (HCE). It may now be possible to close the gap of missing heritability and increased somatic diagnostic yield.
Most importantly, the pace of discovery is accelerating. A PubMed search of structural variation shows a nearly 500% increase in papers referencing SV from 2000 to 2018. As our understanding of the importance of SV in disease and our detection methods get better the role of SV in disease will become clearer with opportunities for better diagnostics and new therapeutic targets becoming a reality.
From early cytogenetic techniques like karyotyping through later molecular cytogenetic techniques like Fluorescence In situ Hybridization (FISH) and comparative genomic hybridization arrays (aCGH) on to sequence-based approaches, the resolution at which we can study SV has dramatically increased.
Karyotyping has remained in standard practice because it provides pathologic information that is observable for well-characterized genetic disorders. However, its limitations in specificity and the capture of de novo or rare events, and slow turn-around time due to culturing are well documented. It is also expensive at about $11k per analysis.
FISH improves upon karyotyping in that it is molecular and
cost effective at about $1500 per analysis. However, like karyotyping requires
cell culture, and is limited to clinically observed phenotypic suspicion,
limiting its use to germline analysis only.
aCGH provides molecular information and is particularly good at detecting copy number variation (CNV) across thousands of genes. It is relatively cost effective at about $1,300 to $2,700 / analysis. However, the chromosomal locus is lost, which could be important for accurate pathogenesis, and it can only detect events for which probes have been designed, eliminating its ability to detect previously unknown SV.
Next generation sequencing (NGS) and other sequence-based methods of SV analysis offer the promise of base pair resolution of the genome, yet traditional cytogenetic and molecular cytogenetic techniques, particularly in clinical applications, are still widely used. A reliance on the latter techniques persists due to factors inherent to the limitations of NGS or cost. NGS sequence data is comprised of short reads around 300-700 bps which are re-assembled into contiguous fragments. This poses a challenge in detecting large structural variants that may span these fragments. Short reads also make it difficult to resolve areas with multiple repeats as found in CNV. Additionally, NGS has difficulty resolving GC rich regions of the genome not to mention some applications like detecting SV in heterogeneous samples (with allele frequencies <10%) are nearly impossible using sequencing methods for SV analysis.
Despite technological advances a critical gap of missing heritability and decreased somatic diagnostic yield still exists between cytogenetics and genomics. Whole Genome Mapping (WGM) is poised to fill that gap. WGM is well positioned to replace traditional cytogenetic techniques in clinical settings and new studies are showing that combining NGS with other SV analysis methods like whole genome mapping can significantly improve results in research.
WGM is a single-molecule analysis method that does not use amplification or in-vitro synthesis of the DNA molecules. High molecular weight, long DNA fragments are isolated, labelled with sequence specific labels and linearized with each fragment being directly representative of the cell population. Depending on the platform the labeled molecules are either optically imaged or electronically detected to create long reads with distance information between the labelled sequences. These reads are assembled into consensus maps. The consensus maps are then compared with an in silico reference to identify areas of differences that comprise SV.
WGM is not subject to amplification bias or error and SV is directly observed instead of inferred as with most sequencing based approaches. However, WGM does not provide single base pair level resolution. Instead, it acts like a molecular karyotype where structural information is preserved but is much more specific to inter and intragenic SV.
Whole Genome Mapping (WGM) can be used to detect both heterozygous and homozygous structural variants, often missed by both cytogenetics techniques and sequencing approaches. Also known as optical mapping or electrical mapping, WGM has been around awhile but companies like Bionano Genomics and Nabsys have recently refined and commercialized the technology. With recent improvements in price performance, the cost of WGM is comparable or less than traditional cytogenetic approaches to structural Variation (SV) analysis. Several new studies have shown that combining SV analysis platforms like WGM and NGS improves results. This makes WGM a viable method for SV analysis both as standalone replacement for cytogenetic techniques and an orthogonal platform to sequence-based methods.
In the next section we will look at the informatics challenges facing SV researchers and how cloud computing and advanced bioinformatics software like Hitachi’s Human Chromosome Explorer™ are further improving accuracy while decreasing analysis times in some cases from days to hours.
Whether using sequencing based approaches or WGM, the computational challenge of SV analysis is significant, involving large data sets and the use of computationally intensive algorithms for the assembly of consensus sequences or maps and the calling of structural variants. Storage needs and new algorithms require either a significant investment in dedicated servers and related hardware or deployment in the cloud. Either way, IT human resources are required to manage it all. Once the IT infrastructure is in place researchers need to tap bioinformatics resources to choose from the 100's of available algorithms and configure and deploy complex software only to wait days in some cases for the CPUs to crunch the data. Many projects do not have the budget or throughput to justify the investment to build the required IT infrastructure and/or employ bioinformatics staff. Higher throughput groups are looking to reduce existing IT costs, increase SV analysis accuracy and decrease time to results.
Hitachi developed Human Chromosome Explorer™ (HCE) for rapid structural variation analysis in the cloud from genome mapping. HCE exploits multiple threads, clusters, and compute nodes available in the cloud to distribute the generation of long-range uniform contigs into a haplotype aware assembly of a whole human genome map meta-analyzed for SV. Structural variants are further analyzed and segregated by type before visualized in a chromosomal view against a reference in HCE. By utilizing these latest algorithmic advances and parallel processing in assembly and variant calling, HCE provides analysis of SV from human whole genome maps in a couple of hours.
Just as genome mapping bridges the gap between classic cytogenetics techniques and sequencing, HCE bridges the gap in analysis by being a platform for which large SV can be analyzed at scale allowing for the concerted study of missing heritability and large somatic events with confidence.