Open access

Whole exome and genome sequencing for Mendelian immune disorders: from molecular diagnostics to new disease variant and gene discovery

Author: Daniele Merico [email protected]Authors Info & Affiliations

Publication: LymphoSign Journal

22 November 2016

https://doi.org/10.14785/lymphosign-2016-0011

Abstract

Whole exome and whole genome sequencing are next generation sequencing (NGS) applications that enable investigation of all coding variants (around 20 000) or all variants (around 4 million) in the human genome. They provide an extremely powerful tool for detecting variants with an established implication in Mendelian disorders as well as for discovering new disease variants and genes. The large number of variants generated requires elaborate databases, prediction models, and integrated workflows to identify which variants are more likely to contribute to disease. We discuss the whole exome and whole genome options, review the sequencing platforms and variant calling pipelines available for different variant types, and devote most of the review to how genetic variants can be annotated and prioritized to identify the ones likely contributing to disorder. The application focus will be Mendelian disorders; disorders caused by rare or common variants with a more complex genetic architecture will only be discussed briefly. For variant annotation and interpretation, we will concentrate on smaller variants (substitutions, insertions, and deletions), only briefly reviewing structural and copy number variation.

Sequencing and variant detection

In this section, we will review how the sequencing process works, what sequencing platforms are available and what analytical methods are typically used to detect genetic variants.

Overview of next generation sequencing of whole exome and genome

Next generation sequencing (NGS) of exomes and genomes typically consists of the following steps (Pabinger et al. 2014). DNA fragments (length 300–500 base pairs (bp)) are processed into a sequencing library, adding adaptor and barcodes to their ends, and then both ends of the fragments are sequenced to produce read pairs, with each read spanning 100–150 bp and the paired reads separated by a gap (Goodwin et al. 2016). Reads are then aligned to the human genome reference sequence by a “read aligner” (e.g., using BWA, the Burrows–Wheeler Aligner (Li and Durbin 2010)). Afterwards, genetic variants (small substitutions, insertions, and deletions) are detected using multi-module tools called “variant callers” (e.g., GATK, the Genome Analysis Toolkit (McKenna et al. 2010)) that receive in input aligned reads and evaluate if the read bases at each reference genome position are supportive of variation, modelling base sequencing quality, alignment specificity, and other quality parameters. Recent-generation variant callers are also able to perform local de-novo sequence assembly, achieving greater sensitivity in detecting longer and more complex variants (e.g., the GATK Haplotype Caller). Specialized variant callers are required for copy number variant (CNV) and other structural variant (such as inversions and translocation) detection (Pabinger et al. 2014). Small substitutions, especially single nucleotide variants (SNVs), are more straightforward to detect; in contrast, insertions, deletions, structural variants, and CNVs are more challenging (Pang et al. 2014). Community efforts have led to the establishment of a variant detection benchmark for small substitutions, insertions, and deletions (Zook et al. 2014), whereas no unified benchmark is available for copy number variants, although several have been proposed (Pang et al. 2014; Mu et al. 2015; Zarrei et al. 2015).

Specialized filtering methods (Jiang et al. 2013; Yuen et al. 2015) or variant callers (e.g., DeNovoGear (Ramu et al. 2013)) are additionally required for detecting if variants arise de-novo in a proband compared to the parental genomes (relevant for trios and other family studies). Finally, variants are annotated for their effect on gene products and allele frequency in established reference datasets; clinical effect classification is fetched from reference databases; additional gene annotations and global probabilistic scores can also be added to facilitate the prioritization of disease causing variants.

Whole exome versus whole genome sequencing

Whole exome differs from whole genome because specific regions, corresponding to coding exons and adjacent genomic intervals, are captured by hybridization probes (or amplified by polymerase chain reaction, depending on the sequencing technology) (Bamshad et al. 2011). Although termed “whole” exome, it is important to note that not all exonic regions are properly captured or amplified because of design and other technical limitations (Bamshad et al. 2011; Jiang et al. 2013). In contrast, whole genome sequencing is able to sequence >96% of the human genome reference sequence (Jiang et al. 2013; Yuen et al. 2015; Stavropoulos et al. 2016). Finally, sequencing depth is more evenly distributed (Jiang et al. 2013). Whole exomes are typically sequenced at average coverage 50–100×, whereas whole genomes are typically sequenced at 30–50× for disease variant identification applications (polymorphism detection when sequencing large sets of individuals can be performed successfully at a much lower depth (1000 Genomes Project Consortium et al. 2015)).

Sequencing platforms

Illumina sequencing platforms currently represent the largest share of the exome and sequencing applications (Goodwin et al. 2016). Whole exome sequencing is typically performed on the HiSeq2500, HiSeq3000, HiSeq4000 instruments; whole genome sequencing can be performed on the same instruments, but is more economic on the newer HiSeqX platform, which offers greater throughput. All instruments of the Illumina HiSeq series perform sequencing by synthesis (SBS) with cyclic reversible termination and 4 colour labels; this approach has an overall accuracy rate of >99.5%, with an overall tendency towards substitution errors and some under-representation of regions with extremely low or extremely high GC content (Goodwin et al. 2016). Exome and whole genome sequencing are typically performed using paired end reads, with each read spanning 100–150 bp and fragments spanning 300–500 bp (Goodwin et al. 2016).

High-quality whole genome sequencing can also be performed using the Complete Genomics platform. Complete Genomics sequencing is based on sequencing by ligation; this approach results in a very high accuracy (∼99.99%), as each base is probed multiple times (Goodwin et al. 2016). Complete Genomics fragments are 500 bp, with 35 bp at both ends covered by shorter reads separated by small gaps and a larger gap in the central portion of the fragment. The Complete Genomics sequencing service includes delivery of called variants, including small substitutions, insertions, deletions, as well as structural and copy number variants. However, the Revolocity sequencing system was intended to compete with the Illumina HiSeq but its launch was suspended in 2016, creating some uncertainty on the future of this sequencing platform (Goodwin et al. 2016).

LifeTech’s AmpliSeq IonProton offers an alternative option for exome and targeted sequencing. This platform relies on SBS with single-nucleotide addition and consequently presents a higher insertion and deletion error rate (Goodwin et al. 2016); if this is not properly addressed by removing lower quality variants, it can lead to many incorrect candidate disease variants with damaging effects. The future of this platform is also less certain, considering the constant improvement and cost reduction of Illumina platforms.

All sequencing platforms described so far are based on short reads; while they currently offer the best value-for-money in terms of exome and whole genome sequencing, they are insufficient for resolving more complex genomic loci with higher repeat content and (or) very high variation between individuals. Single-molecule long-read sequencing, as offered by the Pacific BioSciences RS II system, addresses this problem by offering read lengths of about 20 kb; however the sequencing cost per Gb is much higher compared to the Illumina instruments (20–100× for the PacBio RS II), limiting its applications. This may change in the future as newer system with greater throughput and lower costs are deployed. A cheaper alternative is also offered by synthetic long-read technologies, although long-read reconstruction can be affected by errors (Goodwin et al. 2016).

Other sequencing systems, like SOLiD (colour-space) and 454 have been relegated to a small niche within the industry or recently discontinued and thus they are not described in detail; the BGISEQ-500 system has less information available and thus is not described in detail either (Goodwin et al. 2016).

Variant annotation and prioritization

The large number of variants generated requires elaborate databases, prediction models, and integrated workflows to identify a small number of candidate variants that are more likely to contribute to disease.

Identification of clinically relevant variants is characterized by stricter criteria: it requires very clear and highly impactful gene product effects (e.g., truncating loss-of-function) or functional validation experiments for variants whose effect is more questionable, frequency lower than expected based on disease prevalence, family segregation (ideally from an extended pedigree and (or) multiple independent families) or markedly higher incidence in disease carriers compared to controls, clustering with already established pathogenic mutations, and obviously it is restricted to genes with a well-established implication in Mendelian disease. The ACMG guidelines represent the most broadly adopted human clinical genetics standard for classifying variants as “benign”, “likely benign”, “uncertain significance”, “likely pathogenic”, or “pathogenic” (Richards et al. 2015).

Discovery of novel disease variants within established Mendelian genes follows similar principles, especially as far as frequency filtering, but it is also more open and exploratory in terms of variant effect and impact. Certain effects may be under-reported in clinical databases (e.g., splicing sequence changes besides the highly conserved dinucleotide, transcriptional regulatory changes), and obviously functional validation is typically not already available and may have to be performed for the most promising candidate variants; similar considerations apply to segregation in multiple families. Discovery of variants in novel genes additionally requires prioritization of genes that are likely to contribute to human Mendelian disease based on phenotypic abnormality in model organisms, known functional role, and genetic constraint.

We argue that these different use cases can all be served by bioinformatics workflows prioritizing variants along different “dimensions” (Figure 1): (i) sequencing quality, (ii) allele frequency, (iii) variant effect on gene product, and (iv) gene information. In the presence of trio or other family-based designs, the variant inheritance pattern or de-novo status (v) can also be used for prioritization (Bamshad et al. 2011; Saunders et al. 2012; Jiang et al. 2013; Ngan et al. 2014; Dewey et al. 2015; Merico et al. 2015a, 2015b; Miller et al. 2015; Smedley and Robinson 2015; Yuen et al. 2015; Stavropoulos et al. 2016).

Figure 1:

Sequencing quality

For Illumina sequencing analyzed using GATK, the VQSR (Variant Quality Score Recalibration) filter value “PASS” is typically used as the primary quality filter; the VQSR procedure takes advantage of known polymorphisms in the human population to learn what parameter values discriminate true variants from false variants (McKenna et al. 2010). Additional hard filters can be applied to mask regions of the genome that are repeat rich and (or) prone to mapping artifacts (e.g., segmental duplications). Different sets of filters are optimal for Complete Genomics whole genome sequencing (Merico et al. 2015a; Yuen et al. 2015; Stavropoulos et al. 2016). In general, optimal filtering criteria can be established by sequencing the reference sample NA12878 and comparing called variants to the multi-platform benchmark established by the Genome in a Bottle Consortium in collaboration with the National Institute of Standards and Technology (Zook et al. 2014). Additional benchmarks are being established to avoid overfitting on the reference sample (e.g., the PrecisionFDA truth challenge https://precision.fda.gov/challenges/truth).

Allele frequencies

Using a variant filter based on allele frequency in reference databases avoids selecting variants incorrectly predicted to be disruptive for the gene product (Richards et al. 2015). One to five percent is typically used for autosomal recessive genes and <0.1% (or never observed) for dominant genes (Bamshad et al. 2011). The 1000 Genomes project (totalling 2504 individuals in the latest release) is the main resource covering the whole human genome and different continental population (Europeans, Africans, Eastern Asians, Southern Asians, admixed Latin Americans) (1000 Genomes Project Consortium et al. 2015). It is a common practice to filter based on the maximum frequency observed across the 5 continental populations (Bamshad et al. 2011; Lek et al. 2016); while the earlier release of 1000 Genomes presented excessively small continental population sizes for accurate allele frequency estimate, the current size is satisfactory (Brown et al. 2016).

Exome Aggregation Consortium (ExAC) is an additional resource that is restricted to coding regions, but includes many more subjects (60 706 unrelated individuals); caution needs to be exerted when using ExAC frequencies, because ExAC includes disease carrier cohorts, although individuals affected by Mendelian early-onset disorders have been excluded (Lek et al. 2016).

In contrast, it is not advisable to filter variants based on presence or absence in NCBI dbSNP, because dbSNP is a comprehensive variant archive that includes pathogenic variantsand is not restricted to polymorphic variation present in the human population (Bamshad et al. 2011; Johnston and Biesecker 2013).

Finally, certain variants may have gone undetected in reference allele frequencies databases because they are platform-specific artifacts, because they belong to an error-prone or highly variable region that has been masked, or because they are complex and thus their detection and representation are highly dependent on the alignment and variant calling procedure. It is always advisable to include internal platform-matched and analysis pipeline-matched controls to neutralize this type of issue (Bamshad et al. 2011).

Clinical variant classification databases

ClinVar (Landrum et al. 2016) and HMGD (Stenson et al. 2014) are the most popular clinical variant classification databases.

ClinVar (https://www.ncbi.nlm.nih.gov/clinvar/) is hosted by the National Center for Biotechnology Information (NCBI) and is freely available. ClinVar currently holds >125 000 classified variants, with almost 4800 genes having variants that affect a single gene (i.e., after removing multi-genic structural variants). The main classification categories of clinical relevance are benign, likely benign, uncertain significance, likely pathogenic, and pathogenic. Clinical variant classification and supporting information are submitted by clinical testing laboratories, research laboratories, and authoritative databases such as OMIM (Online Mendelian Inheritance in Man). Although many submitters follow rigorous guidelines for variant identification (ACMG Guidelines (Richards et al. 2015)), or follow similar guidelines established within their institution (Eggington et al. 2014), conflict in variant classification can arise in the presence of multiple classifications by different submitters.

HGMD is a commercial database (Stenson et al. 2014) with variant entries added by professional curators; it captures variants that are causative or associated to human disease, and tends to favour inclusiveness over stringency in pathogenicity assessment.

In general, clinical classification databases are not perfect: there is a significant degree of disagreement between databases because of different classification standards (Vail et al. 2015). In addition, for a given individual genome, several variants reported as pathogenic can occur in absence of the expected disease, suggesting that they are incorrectly classified or have fairly low penetrance (Xue et al. 2012; Berg et al. 2013; MacArthur et al. 2014; Lek et al. 2016). Obviously, even if a recessive variant is correctly labelled as pathogenic, the zygosity needs to be taken into account (a single heterozygous damaging variant is insufficient to produce disease in an individual in absence of compound heterozygosity). For all these reasons, a variant classified as pathogenic in a database cannot be automatically interpreted as pathogenic when observed in an individual: all lines of evidence (allele frequency, gene product effect, etc.) need to be evaluated (Berg et al. 2013; Richards et al. 2015).

Gene product effect and impact

To understand the effect of genetic variants on downstream molecular and cellular processes, it is important to first determine the variant effect on gene products. While non-genic sequences are important for gene expression regulation, gene products are the main link between the genome and downstream functional processes; effects on regulatory sequences (and how to link them back to the gene product they are presumed to regulate) will be discussed in a dedicated subsection. For clarity, we will use “effect” to refer to the type of gene product alteration, and “impact” to refer to the quantitative extent of the gene product alteration. We will refer to a gene product alteration as “damaging” when its (predicted) impact corresponds to a significant gene product alteration that is typically sufficient to cause disease given the appropriate combination of zygosity and gene mode of inheritance (“pathogenicity”) (MacArthur et al. 2014).

Coding gene effects: overview

Protein-coding genes represent the best functionally characterized subset, and genes implicated in Mendelian disorders are almost all protein coding: protein-coding gene variants represent 99.9% of ClinVar pathogenic or likely pathogenic variants.

Effects on protein-coding genes can be articulated in the following categories: (i) acquisition of a premature stop codon (“stop-gain”), (ii) change in the reading frame (“frameshift”) typically caused by insertions or deletions, (iii) alteration of the highly conserved intronic dinucleotide of a splice site (“core splice site alteration”), (iv) loss of the start codon (“start-loss”), (v) loss of the stop codon (“stop-loss”), (vi) changes of a single amino acid (“missense”), (vii) insertion or deletion or multiple substitutions of amino acids (“in-frame amino acid sequence change”), (viii) changes in the splicing consensus sequence or other sequences regulating splicing (“splicing sequence change”), and (ix) UTR (untranslated sequence) changes with different molecular regulatory mechanisms at 3′ and 5′ UTRs (“UTR change”). It is also common to annotate variants that do not change the amino acid encoding of a codon as “synonymous”; however, these variants may have an effect on splicing, and thus this characterization can be misleading if automatically interpreted as silent or neutral (Richards et al. 2015).

The effect of a variant depends on the gene models and software used (McCarthy et al. 2014; Frankish et al. 2015). NCBI RefSeq, GENCODE Basic, and GENCODE Comprehensive (used by the Ensembl project) are the most commonly used gene models. Gene models from GENCODE Basic and especially from the curated subset of RefSeq (i.e., excluding transcript models, marked as XM_ or XR_) are more conservative, whereas the full RefSeq and especially the GENCODE Comprehensive gene models are more complete, but they also include more transcripts that, even if expressed at sufficiently high levels, can be more functionally questionable (Frankish et al. 2015). It is also important to note that GENCODE Basic does not include non-coding transcripts, whereas the curated subset of RefSeq does. Additionally, GENCODE Comprehensive includes incomplete transcripts, which typically lack a proper start or end and consequently they do not have a proper coding sequence (Frankish et al. 2015).

The terminology used to express the gene product effect also depends on the bioinformatics annotation tool used; the most commonly used are Annovar (Wang et al. 2010), VEP (Variant Effect Predictor) (McLaren et al. 2016), and SnpEff (Cingolani et al. 2012). Sequence Ontology (SO) is a standardized vocabulary for variant effects on gene product (Reese et al. 2010); more systematic adoption of SO terms should lead to uniform effect annotation given the same gene models.

Coding gene effects: truncating loss-of-function variants

Stop-gains and frameshifts are the effects with the strongest impact on the gene product: unless they occur in the terminal part of the coding sequence, they are very likely to trigger nonsense-mediated decay and lead to complete loss of the transcript carrying the variant, resulting in completely abrogated gene product expression for bi-allelic variants. For this reason, these variants referred to as “truncating” or “loss-of-function” (we will refer to them as “truncating loss-of-function” to avoid any ambiguity). Alterations of core splice sites can produce the same outcome by causing exon skipping, skipping of part of the exon, or retention of intronic sequence; alterations of other splicing control sequences can have similar effects, but typically they do not completely abrogate the wild-type transcript and thus need specific predictive models for impact prediction (these will be described in a specific subsection). Start-losses are also often included in the truncating loss-of-function category, although an alternative downstream start site could be utilized.

Impact prediction is typically not necessary for truncating loss-of-function variants (95.6% of ClinVar variants in this category are (likely) pathogenic); however, it is beneficial to consider the percentage of coding sequence truncated and the percentage of transcript isoforms affected.

Coding gene effects: missense and in-frame amino acid changes

Changes of single or multiple amino acids can have very different effects on gene product function depending on the sequence affected and the properties of wild-type and mutant amino acids. At a given site, only a dramatic amino acid change may have a damaging impact (for instance, the ClinVar entry NM_000032.4(ALAS2):c.1676G>C (p.Arg559Pro) is deemed likely pathogenic for hereditary X-linked sideroblastic anemia while NM_000032.4(ALAS2):c.1676G>A (p.Arg559His) is deemed likely benign); in contrast, other sites may be sensitive even to apparently moderate amino changes (e.g., STAMBP K303R (Naseer et al. 2016)). Bioinformatics impact predictors attempt to address this problem and are reviewed in detail in this section. In any case, it is always highly recommendable to manually evaluate additionally available information, such as overlap with annotated protein domains (available from the UCSC Genome Browser via the PFAM and SwissProt tracks), as well as presence of hotspots of pathogenic variants.

Impact predictors are typically available for missense effects, although the same predictive principles could be extended to other amino acid changes. Different predictors can be categorized based on (i) what predictive features are used to predict impact, (ii) whether they require a training set of known damaging or pathogenic variants, and (iii) what such dataset they use. All the most commonly used predictors (such as SIFT (Ng and Henikoff 2001), PolyPhen2 (Adzhubei et al. 2010) and MutationAssessor (Reva et al. 2011)) rely on protein sequence position-specific conservation and amino acid substitution rates based on multiple protein sequence alignments. This is more powerful than genomic DNA conservation or substitution rates due to the degenerate nature of the genetic code and different effects of specific third-base codon nucleotide substitutions. SIFT was the first predictive model using such features, and its simplicity as well as its independence from variants already classified as pathogenic or damaging still make it appealing. MutationAssessor improves the SIFT algorithm by better modelling positions conserved only in protein subfamilies, and similar to SIFT it does not require a training set of already classified variants. In contrast to SIFT and MutationAssessor, PolyPhen2 includes a richer set of predictive features, including not only alignment-derived conservation and substitution rates but also amino acid physicochemical properties and predicted secondary structure; however, that comes at the cost of relying on training data to fit the predictive model, which is (at least partially) undesirable because training data are not completely unbiased. PolyPhen2 predictions are available for two training sets, “HumDiv” and “HumVar”.

Missense impact predictors are far from being perfect (MacArthur et al. 2014). Different types of circularity have been pointed out (Grimm et al. 2015), and the evaluation sets used so far either overlap with the training set of some methods or they are limited to a few genes and types of protein functionality (e.g., enzymatic activity) thus lacking generality. In addition, current generation predictors are incapable of returning more mechanistic details, because systematic data on different types of mechanistic protein structure perturbation are not systematically available yet; therefore, training sets typically consist of pathogenic versus benign or evolutionarily constrained versus evolutionarily diverged variants. In particular, current predictors fail to distinguish between loss-of-function and gain-of-function variants, a particularly vexing issue for autosomal dominant genes that cause Mendelian disorder only upon loss or gain-of-function (e.g., CARD11 gain-of-function (Chan et al. 2013)). Consequently, the use of missense predictors is quite different in a clinical diagnostic or discovery setting; ACMG Guidelines attribute the lowest level of evidence to them and the guidelines require a perfect consensus among different methods (Richards et al. 2015), favouring functional evidence or segregation patterns. In contrast, in discovery settings, it is common to rank candidate missense variants based on the combination of different predictors’ scores.

Beyond the typical approach to missense impact prediction, post-translational modifications (PTMs) such as phosphorylation, ubiquitination, and acetylation are important for protein signalling pathways; identifying variants disrupting them (Reimand et al. 2013; Narayan et al. 2016) represent an area of growing interest, supported by experimental databases (Hornbeck et al. 2012) and bioinformatics predictive models (Wagih et al. 2015).

Coding gene effects: splicing sequence changes

Pre-mRNA splicing is a key cellular process required for gene expression, and alternative splicing regulation enables the expression of a diversified protein repertoire across cell types and environmental conditions (Scotti and Swanson 2016; Sibley et al. 2016). Diverse estimates for the importance of splicing alterations in Mendelian disorders have been reported. Excluding missense variants for simplicity, 8.6%–9.5% of ClinVar pathogenic and likely pathogenic variants are expected to act via splicing alterations of core splice sites or splicing sequence. Of those, the majority (corresponding to 6.8% of ClinVar (likely) pathogenic variants) consist of core splice alterations; in this review, they were grouped with truncating loss-of-function variants because they are typically damaging and do not require impact prediction models (93.7% of ClinVar core splice alterations are pathogenic or likely pathogenic). However, splicing alterations may be under-represented in ClinVar (because of biased focus on coding sequence (Sibley et al. 2016)) and have been estimated to contribute to 16% (Mort et al. 2014) or even more (Soukarieh et al. 2016) of (likely) pathogenic variants in Mendelian disorders, with the difference mostly accounted by unrecognized splice sequence changes. It is also possible that splicing sequence changes may typically have more subtle effects, and thus be more important for disorders with more complex genetic architectures (Merico et al. 2015c; Xiong et al. 2015; Yuen et al. 2016). Finally, while the nervous system is expected to exhibit the largest splicing diversity (Sibley et al. 2016), splicing may be important also for the immune system considering its rich and highly diversified cell type as well as recognition molecule array.

Splicing predictors for splicing sequence alterations can be categorized along 3 axes: (i) whether they are derived from reference data on physiological splicing or directly trained on pre-classified mutation data (pathogenic, benign); (ii) what type of splicing outcome they predict; and (iii) what type of splicing sequence change they are able to model. Models trained on already classified variants typically predict a generic damaging impact without providing mechanistic details on the splicing alteration; other models predict the difference in consensus sequence strength, exon inclusion percentage change, or alternative splice site usage, but they do not predict downstream events such as frameshift, intron retention, or premature stop-gain acquisition, which can result from splicing alteration. Models trained on physiological splicing data are preferable (at least in principle) because they are not prone to overfitting on biases affecting existing labelled mutation data (Xiong et al. 2015). Splicing sequence changes can be categorized as follows: (i) alteration of consensus sequence, important for the constitutive splicing machinery and proximal to the core splice sites; (ii) loss or gain of exonic splicing enhancers and suppressors, often modelled using exhamer-based models; (iii) loss or gain of intronic splicing enhancers and suppressors; (iv) loss or gain of other motifs for splicing factors and RNA-binding proteins; and (v) creation of de-novo core splice sites or strengthening of existing cryptic splice sites.

MaxEntScan (Yeo and Burge 2004) is one of the most popular and effective models for consensus sequence alteration. While it is trained on known and decoy splice site sequence, it performs very well on pathogenic variant classification (Jian et al. 2014). However, it is not able to model the branch site sequence, which is outside of its consensus sequence boundaries.

dbscSNV is another model for consensus sequence alteration. It is trained on mutation data labelled as pathogenic or benign. It uses MaxEntScan and other sequence-based models plus genomic conservation, but it is not able to model the branch site sequence either; it has a performance similar to MaxEntScan alone (Jian et al. 2014).

Several models capture exonic splicing enhancers and suppressors using hexamers (i.e., sequences of 6 nucleotides) (Erkelenz et al. 2014; Rosenberg et al. 2015) and are thus complementary to models targeting consensus sequence. HEXplorer predicts percentage exon inclusion based on hexamer weights learnt by comparing exonic to intronic sequence (Erkelenz et al. 2014). HAL predicts exon inclusion and alternative splice site utilization; it is trained on synthetic sequences and transcriptional patterns measured in a minigene system. As it is trained on synthetic mutation data, it is immune to typical biases that affect other methods trained on pathogenic versus benign variants (Rosenberg et al. 2015).

MutPredSplice predicts the splicing damaging impact of exonic variants and is trained on pre-classified variants; its features consist of scores from consensus sequence models, a mutation-based hexamer model, genomic conservation, and other features based on gene annotation (e.g., distance of the variant from an annotated splice site) (Mort et al. 2014).

SPANR predicts percentage exon inclusion for exonic as well as intronic variants; it is trained on splicing patterns in non-constitutive exons. Its features consist of scores from consensus sequence models, a large compendium of splicing factor and RNA-binding protein motifs, RNA accessibility based on secondary structure, and nucleosome positioning (Xiong et al. 2015).

There are also more complex types of splicing that need specialized predictors and are not reviewed in detail here (Sibley et al. 2016).

Coding gene effects: UTR changes

UTR changes can result in altered gene expression. The 5′ UTR is important for translation regulation (Scheper et al. 2007): it contains the sequence guiding the ribosome assembly on the mature transcript (Kozak sequence for cap complex-dependent transcript recognition, or IRES (internal ribosome entry site(s)) for internal cap-independent recognition) and other regulatory elements; its length, GC content, and presence of secondary structure elements modulate ribosome start site recognition and translation rate; it can also contain decoy start codons that interfere with proper translation (uAUG, upstream AUG, and uORF, upstream open reading frames; the latter comprise a short coding sequence terminating in a stop codon) (Chatterjee and Pal 2009; Barbosa et al. 2013). The 3′ UTR contains the polyadenylation signal important for transcript maturation and can also contain sequences modulating transcript stability recognized by RNA binding proteins or miRNAs (Chatterjee and Pal 2009).

Currently, few Mendelian disorders (such as hereditary thrombocytaemia) have been reported to be caused by UTR single nucleotide substitutions, indels, or exon skipping (ClinVar: 0.2% pathogenic or likely pathogenic variants consisting of UTR changes). There is also a lack of quantitative variant impact prediction methods, perhaps also because of the coding exon bias in exome sequencing capture target design (Scheper et al. 2007; Chatterjee and Pal 2009). Consequently, UTR mutations are often ignored. This may change in the near future, also in relation to the availability of functional genomics dataset profiling UTR regulatory sequences and effects on gene expression (Goodarzi et al. 2012) as well as well-established miRNA target prediction methods (such as TargetScan (Agarwal et al. 2015)). In the interim, in absence of more principled predictors, genomic conservation can be used for UTR change impact (Yuen et al. 2016).

It is possible that a small number of UTR-altering variants have a sufficiently large impact that contributes to Mendelian disorder, whereas the majority may have a smaller effect and may be relevant mainly for more complex disorders (Yuen et al. 2016).

Non-coding gene effects and impact prediction

Non-coding RNA (ncRNA) genes do not code for protein sequence but instead produce a RNA molecule that directly exerts a biochemical activity (structural, catalytic, or binding other nucleic acid molecules and thus modulating processes like transcription and chromatin state regulation). ncRNA genes can range from shorter, more conserved, and better functionally characterized species that have a well-established structural or catalytic role (such as small nuclear RNA, snRNA) or a regulatory role (such as miRNA), to more heterogeneously conserved and characterized lincRNA (long intergenic non-coding RNA, lincRNA) (ENCODE Project Consortium 2012; Palazzo and Lee 2015). Variants impacting the former ncRNA category can be efficiently scored using genomic conservation, although specialized models may be beneficial for miRNA; in contrast, for the latter category impact prediction is more challenging and conservation-based criteria may only be partially useful (Chodroff et al. 2010; Derrien et al. 2012).

In general, ncRNA variants causing Mendelian disorders are not many (Quek et al. 2015), and variant effect evaluation may become more efficient only as more sophisticated impact prediction models are developed and more information is accumulated at the gene level. It is also possible that the majority of miRNA and lincRNA variants cause molecular process alterations insufficient for Mendelian disorder causation, but sufficient for pathogenicity contribution in cancer and other complex disorders, whereas only a few highly conserved ncRNA that have a key role in biological processes can effectively contribute to Mendelian disorders (e.g., the small nuclear RNA gene RNU4ATAC (Edery et al. 2011; He et al. 2011)).

Gene transcriptional regulation effect and impact prediction

Transcriptional regulation relies on several intertwined molecular layers: chromatin state and DNA methylation control DNA accessibility to transcription factors and the transcriptional machinery, with different functional elements (e.g., active promoters, enhancers, insulators) characterized by different histone mark combinations; in turn, specific transcription factors can initiate changes in the local chromatin and DNA methylation state. The completion of large-scale projects such as ENCODE and Roadmap Epigenomics (ENCODE Project Consortium 2012; Kundaje et al. 2015) has resulted in a wealth of data for different cell types, profiling binding-accessible chromatin sites (using DNase-seq or ATAC-seq), histone marks, and transcription factor binding (using ChIP-seq). Chromatin states are typically inferred using histone marks and other available data (such as methylation and binding of specific factors) (Ernst and Kellis 2012; Hoffman et al. 2012).

Certain predictive models focus on transcription factor binding affinity changes induced by sequence changes; position-specific weight matrices have been broadly used to model transcription factor sequence specificity (Wasserman and Sandelin 2004), but recently more powerful models based on deep learning have been introduced (e.g., DeepBind, (Alipanahi et al. 2015)). Additional information, such as occurrence of the predicted binding event in other species or genomic sequence conservation can be used to refine predictions (Wenger et al. 2013). Cell-specific inferred chromatin states are also useful to put these predictions in context.

Other predictive methods are trained to directly learn the sequence specificity of selected functional regions, such as DNaseI hypersenstivity sites (expected to correspond to open chromatin protein binding sites) or enhancers (Lee et al. 2015; Zhou and Troyanskaya 2015; Kelley et al. 2016).

Mapping variants to target genes requires more effort compared to other impact predictors. Since chromatin looping can establish physical proximity to target genes for distal regulatory elements, experimentally determined (Denker and de Laat 2016) and predicted chromatin contact maps (Whalen et al. 2016; Zhu et al. 2016) are important.

While a limited number of variants have been implicated in Mendelian disorders so far (Deplancke et al. 2016), this type of variants may have a significant and under-recognized role in Mendelian disorders, perhaps especially in compound heterozygosity with a strong-effect variant (Zhang and Lupski 2015). Despite the increasing availability of predictive models, there are persisting challenges to overcome: (i) experimental benchmarks critical for predictor evaluation are still being defined (e.g., (Maurano et al. 2015)), (ii) the presence of complex interactive effects between different transcription factors prevents a simple reductionist approach to the prediction problem, and (iii) the redundancy of transcriptional regulation prevents the loss of a specific transcription factor binding site may be insufficient to produce abnormal phenotype (Deplancke et al. 2016). Structural variants deleting regulatory elements have also been reported as causing Mendelian disorders (Weedon et al. 2014), and their impact may be easier to interpret, especially when deleting larger portions of the genome, as long as transcriptional regulatory sequence alteration can be correctly mapped to its target gene(s) (Zhang and Lupski 2015).

Genomic conservation measures

The availability of reference sequences for many other mammalian and vertebrate species enables using their multiple sequence alignments for deriving genomic conservation scores. While these scores are overall highly correlated with the presence of coding exonic sequence, they can help discriminate more diverged genes and identify elements outside of coding exons expected to have conserved function (e.g., UTRs, intronic or intergenic regulatory elements, ncRNA genes). The most popular resources are PhyloP (LRT test) (Pollard et al. 2010), PhastCons (Siepel et al. 2005), and GERP++ (Davydov et al. 2010). PhyloP (LRT test) provides a position-specific conservation score, PhastCons provide a regional conservation score used to identify conserved elements, and GERP++ provides both a positional and regional conservation score. In particular, GERP++ conserved elements have been demonstrated to be less fragmented than PhastCons (Davydov et al. 2010). In addition, currently available GERP++ is based only on mammalian genomes, whereas PhyloP (LRT test) and PhastCons are available for primate, mammalian, and vertebrate genomes. Conservation based on mammalian genomes is overall less statistically powerful but is able to identify mammalian-specific elements.

Meta-predictors

Meta-predictors aim to optimally combine different predictors or to train a more comprehensive model, including features used by other predictors as well as impact scores returned by other predictors (e.g., CADD (Kircher et al. 2014), Eigen (Ionita-Laza et al. 2016)). While valuable (especially for less sophisticated users or for statistical methods requiring a single impact score), meta-predictors also present the following issues: (i) challenges in the definition of impact benchmarks prevent from definitively establishing if the meta-predictor is better than each individual specialized predictors (Grimm et al. 2015) and (ii) effects that have stronger impact than others may dominate the predictive performance and make the predictive model less sensitive to more subtle changes.

Gene information: disease phenotype, mode of inheritance, genetic constraint, and prioritization of new disease genes

So far we have reviewed in high detail variant effects on gene products and models predicting their impact. While a variant may be very rare and have a highly damaging impact on the gene product, the corresponding gene may be dispensable, thus resulting in no (or minor) phenotypic abnormality, or the variant may be heterozygous and the gene autosomal recessive, resulting in no phenotype in absence of other damaging or at least hypofunctional variants. Therefore, it is important to additionally rank variants based on the likelihood that a gene alteration will produce the disease phenotype of interest, and to also consider if the variant zygosity and the gene’s known mode of inheritance can cause disease based on simple Mendelian genetic rules. Quantitative models of genetic constraint based on genetic variation in the human population can be used to independently validate the known mode of inheritance and also prioritize new disease genes. Prioritization of new disease genes additionally benefits from known abnormal phenotypes in model organism (caused by knock outs or other engineered genetic constructs), annotated gene function, and experimentally determined gene product interaction networks.

Gene information: disease phenotype

In a clinical diagnostic setting, only variants on established Mendelian disease genes are considered. In a discovery setting, it is reasonable to first consider variants that are likely to disrupt established Mendelian disease genes. In both cases, it is desirable to rank genes by how well their known phenotypic spectrum matches with the patient’s disease.

The Human Phenotype Ontology (HPO) (Kohler et al. 2014) is the main resource for modelling genes’ known phenotypic spectrum in Mendelian disease (Smedley and Robinson 2015). It consists of a structured controlled vocabulary with terms connected by formal relations; the ontology is composed of 3 independent sub-ontologies that capture the mode of inheritance, the onset and clinical course, and phenotypic abnormalities. Gene-HPO term annotations are derived from OMIM, Orphanet, and DECIPHER using automated text mining procedures integrated by human expert curation. Since HPO terms capturing phenotypic abnormalities decompose disease into phenotypic elements at different granularity levels (e.g., “Immunodeficiency” is a subclass of the broader class “Abnormality of the immune system”), they can be used by algorithms computing the degree of phenotypic match between the gene and the patient’s presentation (Kohler et al. 2009). Other available algorithms for phenotypic matching take into account additional evidence such as model organism abnormal phenotypes for orthologous genes, allele frequency, variant effect and impact, gene function and interaction networks, family segregation and mode of inheritance (Phevor (Singleton et al. 2014), Exomiser (Smedley et al. 2015), PhenIX (Zemojtel et al. 2014), and comparative performance reviewed by Smedley and Robinson (2015)). These algorithms are more appropriate for discovery rather than clinical diagnostic settings; in addition, while they offer a final aggregate rank, they may give less control to the user in fine tuning different lines of evidence or modeling additional effects and impact predictors. More details are presented in the “Gene annotation: prioritizing new disease genes” subsection.

Gene mode of inheritance and variant zygosity required for disease causation

The concept of dominance is central to Mendelian genetics. Disease causation for genes with a reported autosomal recessive mode of inheritance requires a damaging homozygous variant or compound heterozygosity with (at least) 2 damaging variants. X-linked genes are typically not dominant, and hemizygous damaging variation in males will cause disease, while heterozygous females will be normal or present significantly attenuated disease phenotype (with the exception of skewed X inactivation). While disease for autosomal recessive genes and X-linked genes is typically caused by loss-of-function variation, the landscape is more varied for dominant genes: heterozygous loss-of-function (i.e., haploinsufficient), gain-of-function, and more rarely dominant-negative or toxic-gain-of-function can all be causative of disease, with specific genes or sometimes specific domains within genes following the same mechanism. Mode of inheritance is available from CGD (Clinical Genomic Database) (Solomon et al. 2013) and HPO. However, the detailed indication of mechanism (loss-of-function, gain-of-function, and more rarely dominant-negative or toxic-gain-of-function) is not available and needs to be mined from the literature or textual summaries of the literature available in authoritative databases such as OMIM.

Gene information: modelling genetic constraint

Measures of genetic constraint can be used to select genes that are under negative selection for variation and thus more likely to produce a phenotype if their gene product is altered. Their use is particularly helpful for discovering new disease genes, although they can also be used to flag Mendelian genes with limited constrained (thus more likely to be prone to predicted damaging yet benign variation) or further refine dominant genes based on sensitivity to truncating (i.e., haploinsufficient) as opposed to missense variants. For new disease gene discovery, constraint metrics are complementary to phenotypic information in model organisms and functional information from Gene Ontology (Gene Ontology Consortium et al. 2013), pathways, and networks. Haploinsufficiency prediction is a particularly interesting problem because model organism phenotypes are more often available only for homozygous loss-of-function constructs, and established dominant Mendelian genes are not characterized as haploinsufficient or gain-of-function by available resources (CGD, HPO).

The haploinsufficiency prediction model proposed by Huang et al. (2010) uses genomic characteristics and gene network interactions as predictive features; known haploinsufficient genes mined from the literature constitute the positive training set, whereas genes with common copy number losses constitute the negative training set. Most of the predictive performance is driven by network proximity to known haploinsufficient genes, thus questioning the validity of the model, which may be highly biased towards interacting clusters of known haploinsufficient genes (Steinberg et al. 2015). A better alternative is offered by the ExAC pLI (Lek et al. 2016), which models the probability of a gene being intolerant to truncating loss-of-function variation (i.e., haploinsufficient); the pLI estimate is derived from the difference between the expected and observed number of truncating loss-of-function variants in ExAC, where the expected number is based on a background mutational model including the variants’ trinucleotide context and corrected for covariates like exome sequencing depth (Samocha et al. 2014; Lek et al. 2016).

The genic intolerance (GI) score (Petrovski et al. 2013) models the constraint of a gene to coding sequence alterations, without distinguishing between recessive and dominant genes or between different effects (i.e., truncating loss-of-function versus more localized amino acid changes). The GI is derived from regression residuals obtained by comparing the gene-wise number of common coding sequence altering variants to the gene-wise total number of coding variants from the NHLBI-ESP6500 exome dataset (Tennessen et al. 2012); this assumes that synonymous variants are never damaging. The ExAC missense constraint z-score probably represents a better alternative, considering ExAC has about 10× more exomes than NHLBI-ESP6500, and the missense constraint score follows a dominant model. Similar to pLI, this score is derived from the difference between the expected and observed number of missense variants in ExAC; however, compared to pLI it is more correlated to gene length. While no constraint score is available for gain-of-function dominant variation, genes with high ExAC missense constraint and low truncating loss-of-function constraint (pLI) can be presumed to mainly cause disease via gain-of-function missense variation.

Other measures of constraint have also been proposed but their use has been more limited or specific to certain variant types (Uddin et al. 2014; Telenti et al. 2016).

Gene information: prioritizing new disease genes

Genes that lack a disease phenotype in humans may have orthologs investigated in model organisms using genetic constructs like homozygous knock-out. One of the most valuable resources is phenotypic abnormality annotations provided by MGI (Mouse Genome Informatics) (Eppig et al. 2015), based on the Mammalian Phenotype Ontology (Smith and Eppig 2012). Since mouse genetic constructs are most often available as homozygous, this resource is particularly valuable for the discovery of novel recessive and X-linked genes.

In absence of direct phenotypic information, additional functional genomics resources can be leveraged to prioritize genes: (i) functional gene-sets, (ii) pathways, (iii) gene interaction networks, and (iv) genetic constraint scores. Gene Ontology (Gene Ontology Consortium et al. 2013) provides a controlled vocabulary for functional annotations, and corresponding gene annotations are available from human gene databases such NCBI Entrez Gene; Gene Ontology annotations can be mined for immune-related functions such as “pre-B cell differentiation” (GO:0002329). Pathway databases provide more structured information than Gene Ontology: pathways consist of metabolic, signalling, or other regulatory molecular processes represented using directed causal interactions among gene products and small molecules (Cary et al. 2005). For instance, the pathway database KEGG (Kyoto Encyclopedia of Genes and Genomes) (Kanehisa et al. 2016) has 16 immune pathways, such as the B-cell and T-cell receptor signalling pathways (hsa04662, hsa04660). Other commonly used pathways databases that include immune pathways are Reactome (Fabregat et al. 2016), WikiPathways (Kutmon et al. 2016) and NetPath (Kandasamy et al. 2010). Pathways are particularly useful for highly focused searches in presence of a strong hypothesis about what metabolic, signalling, or regulatory process is presumed to be altered in the patient. In addition, gene interaction networks (protein–protein physical interaction networks, co-expression, etc.) can be leveraged to expand a set of input genes into a larger set of functionally related genes; when this approach is applied to predict new disease genes from known disease genes, it follows the principle of “guilt by association” (Tranchevent et al. 2011; Wang et al. 2011; Smedley et al. 2014). While gene interaction networks do not present the high level of curation and detail as pathways, they cover many more genes. Manual searches can be carried out using visualization tools like GeneMANIA (Zuberi et al. 2013). In addition, some of the previously reviewed prioritization models utilizing phenotypic match plus additional evidence take advantage of Gene Ontology functional annotations and (or) network interactions (Phevor (Singleton et al. 2014), Exomiser (Smedley and Robinson 2015)).

Study design and diagnostic yield

Study design considerations

For disorders prevalently caused by dominant de-novo variants (mainly observed for neurodevelopmental disorders like intellectual disability (Veltman and Brunner 2012)) the trio design (i.e., parents and affected proband) is the most appropriate (Bamshad et al. 2011; MacArthur et al. 2014). For larger cohorts, statistical modelling capturing gene mutability (Samocha et al. 2014) can be leveraged to identify genes with a significant de-novo mutation load (MacArthur et al. 2014; Deciphering Developmental Disorders Study 2015; Ware et al. 2015).

For dominant inherited disorders, large multi-generation families are the most suitable for variant identification. Genotyping some family members on linkage array (Botstein and Risch 2003) instead of using whole exome or whole genome for all members is cost effective (MacArthur et al. 2014); similarly, it is most beneficial to sample most-distally related individuals within the family (Bamshad et al. 2011).

Autosomal recessive disorders are usually easier to solve (Bamshad et al. 2011); it is important to distinguish between presence and absence of consanguinity. In the presence of consanguinity, the variant is likely to be homozygous and to be embedded within a homozygosity stretch; minimally, one affected individual should be sequenced. Adding the parents, an unaffected sibling, and another affected sibling helps reduce the number of candidate variants and the amount of follow-up work. Finally, genotyping some family members on linkage array instead of using whole exome or whole genome is cost effective. In the absence of consanguinity, the variant can be homozygous or compound heterozygous, and it is recommended to sequence the parents and one affected or unaffected sibling in addition to the affected individual. In both circumstances, sequencing multiple unrelated families strengthens the confidence in the finding, especially for novel Mendelian genes.

For X-linked disorders, sequencing the affected male individual is typically sufficient.

Diagnostic yield considerations

The average diagnostic yield for whole exome sequencing in cohorts of patients referred for clinical sequencing is estimated at 25% (Yang et al. 2014); this captures truncating loss-of-function and amino acid changes, whereas contribution of splicing sequence and UTR changes may be under-estimated. Additionally evaluating copy number variants increases diagnostic yield from 25% to 35% for cohorts with congenital abnormalities and (or) developmental delay, and whole genome sequencing can lead to reliable detection of pathogenic copy number changes (Stavropoulos et al. 2016).

The contribution of UTR, splicing sequence, and transcriptional regulatory sequence changes has been estimated for complex disorders in preliminary studies (Yuen et al. 2016), but there is no robust estimate for Mendelian disorders. The contribution of structural variants to diagnostic yield is still being assessed (Noll et al. 2016).

Conclusions and future directions

Challenges

Coding variant effects are biologically well understood, there is a wealth of models for impact predictions and guidelines for clinical variant classification. Nonetheless, amino acid change impact prediction is not a solved problem (MacArthur et al. 2014; Grimm et al. 2015); distinction between loss-of-function and gain-of-function is a particularly important unsolved problem. Databases with clinically classified variants contain errors (MacArthur et al. 2014). Finally, while there is a consensus that an integrated probabilistic model should be used for clinical and discovery variant evaluation, no such consensus model has been developed (MacArthur et al. 2014).

Whole genome sequencing poses additional challenges. Impact predictors are available for splicing sequence changes and transcriptional regulation changes, but the lack of large and well-established evaluation benchmarks and their underrepresentation in clinical classification databases hamper performance assessment (Smedley et al. 2016); in addition, more advanced predictors than currently available may be required to achieve optimal performance. Comprehensive impact predictors are not available for UTR changes or long non-coding RNA yet. For these 4 effect categories, it is difficult to anticipate how many variants may have a strong enough impact to contribute to Mendelian disorder causation, as opposed to disorders with more complex architecture.

Opportunities: using RNA-seq and epigenetic profiling to facilitate diagnostics and discovery

RNA-seq can be leveraged to aid the detection of splicing-altering variants: in the presence of RNA-seq from control individuals and patients for a disease-relevant tissue, splicing junction detection can reveal junctions present or absent in 1 or more patients but not in the controls (Cummings et al. 2016). This approach can also help detect variants missed by exome capture or suggest the presence of structural variation (Cummings et al. 2016).

In addition, variants damaging core splicing machinery can lead to widespread splicing alteration that can be readily detected by RNA-seq (Argente et al. 2014; Merico et al. 2015a). Following a similar strategy, genome-wide profiling of epigenetic marks by ChIP-seq or methylation arrays may help revealing widespread alterations caused by genetic variants disrupting master regulators of DNA methylation and chromatin modification (Yuen et al. 2016). These approaches can be particularly powerful when the impact of genetic variants is more difficult to predict.

Variants at the interface between Mendelian and more complex genetic architectures

Low-penetrance or variable-expressivity dominant Mendelian variants may benefit from analytical strategies developed for complex disorders. Burden tests can reveal additive effects at the pathway level (Bansal et al. 2010). Polygenic risk score imparted by common variants, which can be calculated in presence of GWAS results on large cohorts, (International Schizophrenia Consortium et al. 2009), can lower the risk threshold for variants with lower penetrance (Merico et al. 2015c).

Acknowledgements

Daniele Merico is a full-time employee of Deep Genomics Inc. and has a scientific affiliation to The Centre for Applied Genomics (TCAG) at The Hospital for Sick Children. Special thanks to Prof. Chaim M. Roifman, Prof. Stephen W. Scherer, Dr. Christian R. Marshall, Dr. Ryan K.C. Yuen, and other TCAG team members for highly valuable scientific discussions on Mendelian and complex disorder genetics research. Special thanks also to Prof. Brendan J. Frey, Dr. Babak Alipanahi, Dr. Andrew Delong, Dr. Jinkuk Kim, Dr. Mark Sun, Omar Wagih, Michael Wainberg, Dr. Hui Yuan Xiong, and other Deep Genomics team members for highly valuable scientific discussions on quantitative modelling of splicing and non-coding variant effects.

REFERENCES

Adzhubei I.A., Schmidt S., Peshkin L., Ramensky V.E., Gerasimova A., Bork P., Kondrashov A.S., and Sunyaev S.R.2010. A method and server for predicting damaging missense mutations. Nat. Methods.7(4):248–249.

Create a new account

Request Username

Change Password

Your password must have 8 characters or more and contain 3 of the following:

Password Changed Successfully

LOGIN TO YOUR ACCOUNT

Verify Phone

Congrats!

Abstract

Sequencing and variant detection

Overview of next generation sequencing of whole exome and genome

Whole exome versus whole genome sequencing

Sequencing platforms

Variant annotation and prioritization

Sequencing quality

Allele frequencies

Clinical variant classification databases

Gene product effect and impact

Coding gene effects: overview

Coding gene effects: truncating loss-of-function variants

Coding gene effects: missense and in-frame amino acid changes

Coding gene effects: splicing sequence changes

Coding gene effects: UTR changes

Non-coding gene effects and impact prediction

Gene transcriptional regulation effect and impact prediction

Genomic conservation measures

Meta-predictors

Gene information: disease phenotype, mode of inheritance, genetic constraint, and prioritization of new disease genes

Gene information: disease phenotype

Gene mode of inheritance and variant zygosity required for disease causation

Gene information: modelling genetic constraint

Gene information: prioritizing new disease genes

Study design and diagnostic yield

Study design considerations

Diagnostic yield considerations

Conclusions and future directions

Challenges

Opportunities: using RNA-seq and epigenetic profiling to facilitate diagnostics and discovery

Variants at the interface between Mendelian and more complex genetic architectures

Acknowledgements

REFERENCES

Information

Published In

History

Copyright

Authors

Affiliations

Metrics

Other Metrics

Citations

Cite As

Export Citations

Cited by

View options

PDF

Login options

Subscribe

Purchase options

Restore your content access

Share

Share the article link

Share on social media