Structural variation (SV) accounts for a large amount of genetic variation, which may contribute significantly to disease susceptibility and state. It has been documented that up to 13% of the human genome is subject to large SV (1). Although there are numerous techniques used to identify SV, none of them are feasible in practical applied use. This means the picture of human genomic variation is far from complete as significant sequenced regions containing SVs remain unread. Here we, at Hitachi High Technologies, describe the Human Chromosome Explorer(SM) (HCE), an advanced cloud-based analytical system used to discover SV throughout the genome combined with genome mapping technologies. HCE enables the complete integrated analysis of combined exome and intergenic regions, providing a more complete view of full genome SV than currently available.
Structural Variation Defined
The analysis of the human genome revealed many great insights. Perhaps one of the most interesting is the amount of genetic variation given that the difference in DNA sequences between any two individuals being only about 0.1%. Genetic variation is a broad term that includes differences as small as a single base pair to alterations in the number or structure of entire chromosomes. Structural variation in the genome refers to cytogenetically visible and submicroscopic variants, namely above 1 kb (2). SV is a modification in the organization of genetic material of varying sizes (3, 4). Common examples of SV are comprised of balanced forms, including inversions and translocations, as well as unbalanced forms, including small deletions and duplications (indels), large-scale copy number variants (CNV), and insertions.
Importance of SV in genome research
SV has been recognized as an important source of human genetic disease susceptibility. This recognition has come, in part, from the identification that the contribution of SV to the overall amount of genomic variation is greater than that of single base pair changes (1, 4, 5, 6). These genomic aberrations range from whole or partial chromosome abnormalities (trisomies, monosomies, translocations), microdeletion syndromes and Mendelian diseases, to more complex traits such as Crohn’s disease, cancer, autoimmune disease, and neurodevelopment disorders including autism spectrum disorders (ASD), schizophrenia, epilepsy, and Parkinson’s disease (7, 8, 9). Most of these diseases are not fully understood as a large number of SVs remain undiscovered or untranslated. Therefore, studies of SVs possess significant potential for new approaches to disease diagnosis and medical treatment.
Detecting SVs is also important for increasing the quality of de novo sequencing. Because of the “short read” approach of commonly-used next generation DNA sequencers (with read lengths less than 700 bases (10)), SVs such as repetitive sequences are not uniquely assembled and left unread (11, 12). These sequences are often found in clinically relevant genomic regions such as the Human Leukocyte Antigen (HLA) (13). HLA is related to type I diabetes, rheumatoid arthritis, Crohn’s disease, and other autoimmune diseases and immune responses. However, HLA is notoriously hyper-variable, limiting its characterization and subsequent practical clinical application. SV detection and analysis may be highly important and relevant to finalize the reference genome in our hopes to achieve additional insights from unread genome regions.
Current Methods for identification and limitations
Currently there are multiple approaches for the identification of SV (mainly microarrays and next generation sequencers (NGSs) (2)), but all have distinct limitations. Therefore, no single strategy provides a systematic and comprehensive means to accurately identify all forms of variation including copy number, content, and structure.
A common method for detecting SVs is paired-end sequence by using NGS (14). This method reads the sequence of two ends of DNA fragments with several kilobases in length apart. For detecting SV, the two ends of a sequence are mapped to the reference genome. By combining the information on the distance between two ends on the reference and the physical fragment size, SVs are identified. This method, in principle, can detect all variation (15). However, insertions are particularly difficult to detect due to the misalignment of inserted sequence. Moreover, the error in mapping the sequence to the repetitive regions to the reference genome is large. These issues become more significant as DNA fragments become longer (14). Therefore, SV detection by NGS are still limited and the results are somewhat unreliable.
The Human Chromosome Explorer
The study of SVs has been limited to the techniques above resulting in an underrepresented and skewed representation of SV. One of the approaches to overcome the above-mentioned issues is genome mapping (15, 16). Genome mapping technology uses fluorescently-labeled or fluorescently-stained DNA fragments of several hundred kilobases in length that can be detected and analyzed. Each DNA fragment has a pattern unique to its sequence. Numbers of detected fragments are assembled into one scaffold and mapped to a reference genome. During this process, SVs including insertions, deletions, inversion, and translocations are identified. Moreover, diploid information can also be obtained through the advanced de novo assembly technology. Although the technologies are commercially available, the SV detections are not practical yet due to the lack of analysis technology. The analysis requires expensive servers, deep knowledge on bioinformatics, and laborious process assessing results of SV identification.
To address the analysis problem, Hitachi High Technologies has developed the Human Chromosome Explorer (HCE). HCE is a cloud-based informatics platform that automatically assembles the primary results from detected DNA fragments and maps them to provide a holistic view of genomic SV (figures 1, 2, 3, 4). This platform does not require an expensive server, informatics technicians, or time-consuming assessment of the results.
Although it is clear that SV contributes significantly to human disease, to what extent throughout the genome is still largely unknown. Understanding this impact, its complex traits, and evolution, requires a determination of the distribution of such variation within the human population. This is followed by the differentiation between pathogenic and benign variation to develop deeper understanding of intergenic DNA and comprehend the factors that influence variation.
The challenge now is to develop research aiming to identify and classify Structural Variation, thus leading to deeper understanding that may drive the next wave of clinical breakthroughs and applications.
1.Conrad DF, Pinto D, Redon R, Feuk L, Gokcumen O, Zhang Y, et al. Origins and functional impact of copy number variation in the human genome. Nature. 2010;464:704–12.
3.Alkan C, Coe BP, Eichler EE. Genome structural variation discovery and genotyping. Nat Rev Genet., 2011 May;12(5):363-76.
4.Freeman JL, Perry GH, Feuk L, Redon R, McCarroll SA, Altshuler DM, Aburatani H, Jones KW, Tyler-Smith C, Hurles ME, Carter NP, Scherer SW, Lee C. Copy number variation: new insights in genome diversity. Genome Res. 2006;16:949–961.
5.Tuzun E, et al. Fine-scale structural variation of the human genome. Nature Genet. 2005;37:727–732
6.Kidd JM, et al. Mapping and sequencing of structural variation from eight human genomes. Nature. 2008;453:56–64.
7.Stankiewicz P, Lupski JR. Structural variation in the human genome and its role in disease. Annu Rev Med. 2010;61:437–455.
8.Girirajan S, Campbell CD, Eichler EE. Human copy number variation and complex genetic disease. Annu Rev Genet. 2011;45:203–226.
9.Weischenfeldt J, Symmons O, Spitz F, Korbel JO. Phenotypic impact of genomic structural variation: insights from and for human disease. Nat Rev Genet. 2013;14:125–138.
10.Loman, N.J., et al., Performance comparison of benchtop high-throughput sequencing platforms. Nat Biotechnol, 2012. 30(5): p. 434-9
11.Baker M. De Novo genome assembly: what every biologist should know. Nat. Methods. 2012;9;333-337.
12.Alkan C1, Sajjadian S, Eichler EE. Limitations of next-generation genome sequence assembly. Nat Methods. 2011 Jan;8(1):61-5.
13.Horton, R., et al., Gene map of the extended human MHC. Nat Rev Genet, 2004. 5(12): p. 889-899.
14.Korbel,J.O.,Urban,A.E.,Affourtit,J.P.,Godwin,B.,Grubert,F.,Simons,J.F.,et al. 2007. Paired-end mapping reveals extensive structural variation in the human genome. Science 318,420–426.