Even though their existence is acknowledged, the impact of primary sequence errors on evolutionary inference is poorly characterized. In a first step to fill this gap, we have developed a program called HmmCleaner, which detects and eliminates these errors from MSAs. Using real data from vertebrates, we observed that segment-filtering methods improve the quality of evolutionary inference more than the currently used block-filtering methods.
The formers were especially effective at improving branch length inferences, and at reducing false positive rate during detection of positive selection. Segment filtering methods such as HmmCleaner accurately detect simulated primary sequence errors. Our results suggest that these errors are more detrimental than alignment errors. However, they also show that stochastic sampling error is predominant in single-gene evolutionary inferences. Therefore, we argue that MSA filtering should focus on segment instead of block removal and that more studies are required to find the optimal balance between accuracy improvement and stochastic error increase brought by data removal.
Evolutionary studies require the identification of homologous characters. Except for highly divergent proteins, the recognition of homologous protein-coding genes is generally straightforward because the availability of many positions provides enough statistical power, at least some protein regions being well conserved i. In contrast, the identification of homology at the residue level, through multiple sequence alignment MSA , is more difficult. Limited statistical power and low similarity may generate ambiguously aligned regions AARs.
Due to the high combinatorics of sequence alignment, some parts of AARs are expected to be aligned wrongly more often than correctly. Despite efforts in improving alignment methods [ 1 ], errors still affect MSAs and may negatively impact subsequent analyses. During phylogenetic inference, they generate a non-phylogenetic signal, conflicting with the genuine historical phylogenetic signal in the data [ 2 , 3 ]. Their presence also inflates estimates of positive selection [ 4 , 5 ].
A common approach to reduce the impact of alignment errors is a posteriori filtering of MSAs. Several software packages [ 6 — 12 ] were designed to identify AARs based on various criteria, such as the stability of the MSA to the guide tree [ 12 ] or the validation of a set of rules dependent on the conservation pattern [ 6 , 8 , 9 ]. AARs are expected to contain non-homologous residues in most sequences, but also genuine homologous residues in the remaining sequences.
Removal of AARs is therefore expected to simultaneously decrease non-phylogenetic and phylogenetic signal, but the first more than the second. Some studies suggest that block-filtering software improves evolutionary inference [ 13 — 16 ], whereas other authors find support for the opposite [ 17 , 18 ]. Another source of noise in MSAs are primary sequence errors. Fundamentally different from alignment errors, primary sequence errors especially those affecting only one or a few sequences are unlikely to be removed by block-filtering programs, except if they are included within AARs.
To properly handle such errors, filtering software should be designed to remove amino-acid segments sequence by sequence, instead of block by block. Besides, primary sequence errors provide a strong non-historical signal that is more likely to bias evolutionary estimates e. Accordingly, a few studies have shown that they can be a source of erroneous signal [ 3 ] or even drive alignment errors [ 5 ].
Yet, this aspect is generally not taken into account while analyzing MSAs. In fact, nothing is known about the relative importance of primary sequence errors versus alignment errors in evolutionary analysis of real MSAs. Here, we present HmmCleaner, a program dedicated to the detection and removal of primary sequence errors in multiple alignments of protein sequences. It implements an approach looking for low similarity segments specific to one sequence using a profile hidden Markov model pHMM built from the whole alignment with HMMER [ 19 ]. In the following sections, we first introduce the HmmCleaner principle.
Then we explain the optimization of its parameters, characterize its performance by simulating primary sequence errors and compare it to PREQUAL performance [ 20 ], a recently released software package with a similar approach based on pairHMM. Then, we address the effect of filtering software on evolutionary analysis. First, we determine whether the use of HmmCleaner avoids the erroneous detection of positive selection when frameshift errors have been voluntarily introduced. Second, using empirical datasets, we compare the effect of segment- and block-filtering methods on evolutionary inferences single-gene phylogenetic reconstruction and branch length estimation as a first insight into the relative impact of alignment errors and primary sequence errors.
Summary of the Trans-Pacific Partnership Agreement | United States Trade Representative
Overview of the four steps of HmmCleaner. In the diagram representing the pHMM, squares correspond to main states of the model whereas diamonds are insertion states and circles deletion states. Alignment of one envelope of a given sequence of the MSA. To optimize the four parameters of the scoring matrix, we developed a simulator that introduces primary sequence errors into existing MSAs. The principle is to take a genuine protein-coding alignment of nucleotide nt sequences and to randomly introduce a unique error in a specified number of sequences.
The resulting sequences are then translated into amino acids aa before realignment. Here, we chose to generate frameshift errors, each one followed by the opposite compensatory mutation after a predefined number of out-of-frame codons. This approach allowed us to use multiple alignments of true protein sequences resulting from real evolutionary processes whereas primary sequence errors are simulated, contrary to Whelan et al. Mean sensitivity and specificity of HmmCleaner towards detection of primary sequence errors introduced in unambiguously aligned regions UARs.
Each dot corresponds to the two means of the values obtained across 80, simulations and 3 operational definitions of UARs for one of the combinations of the 4 parameters of the scoring matrix. The four new scoring matrices provided with HmmCleaner v2 and the scoring matrix equivalent to HmmCleaner v1. Global sensitivity and specificity were computed across all conditions of simulation, whereas the last two columns only used the most species-rich MSAs.
Given these results and the computational burden implied by the leave-one-out strategy, we decided to stick to the complete strategy in the remaining of this article.
Impact of the length and number of primary sequence errors, and of the prokaryotic lineage, on sensitivity a , c , e and specificity b , d , f of HmmCleaner used with the default scoring matrix. Effect of primary sequence error length. Effect of the number of primary sequence errors. Effect of the prokaryotic lineage. Box-plots were computed across all considered MSAs and values are means averaged over the different conditions of simulation. Our hypothesis is that more errors increase the probability of having overlapping identical errors.
As PREQUAL considers the best posterior probability per residue across a series of closely related sequences, only one identical residue is enough to consider it as correct. High-resolution analysis of the impact of the length of primary sequence errors on the sensitivity of HmmCleaner used with the default scoring matrix. The plain line represents the improvement in mean sensitivity with increasing error length, while error bars show the variability across , simulated primary sequence errors in MSAs.
In particular, short errors were more easily detected in gap-rich regions Additional file 1 Figure S6A than in fast-evolving regions Additional file 1 Figure S6B. In gappy regions, a possible explanation for the good sensitivity could be that there are only few sequences to locally define the pHMM. Consequently, HMMER expects the presence of a highly specific segment of amino acids and is thus more severe when the observed segment does not correspond. Regarding the worse sensitivity in fast-evolving regions, our interpretation is that the pHMM is less specific flat profile and can more easily accommodate any divergent segment, including primary sequence errors.
HmmCleaner thus accurately detects all simulated errors but shorter ones in all types of regions. Sensitivity and specificity of filtering software over different error types. Segment-filtering methods were developed based on the hypothesis that block-filtering methods are not adapted to detect primary sequence errors. The performance of BMGE was not surprising, since the insertion of a random segment typically constitutes a divergent block.
In contrast, the specificity of block- and outlier-filtering methods Table 2 was generally higher than the specificity of segment-filtering methods. Indeed, as expected from their rationale, methods that filter outlier blocks or outlier sequences appear by design far less sensitive to primary sequence errors than segment filtering methods.
As shown in Fig. When genuinely homologous segments are highly divergent, i. Accordingly, specificity was higher in UARs Fig. This confirms that its low specificity is due to evolutionary divergence. Such a negative correlation between sequence divergence and HmmCleaner specificity can be due to: i the presence of overlooked primary sequence errors in our datasets, ii the presence of alignment errors that would result in detection errors, iii the detection of segments corresponding to insertion events, or iv the detection of homologous but divergent segments that look like primary sequence errors see above.
Yet, this was not the case Fig. Similarly, for hypothesis two to be true, we would expect an important impact of the aligner software on the false positive rate. This was not the case either Additional file 1 Figure S3.
Evaluating the Performance of an Organization
Therefore, the last two hypotheses should explain most of the observed false positives. Detection of positive selection in the presence of primary sequence errors. A detailed manual analysis of the MSAs in which the signal for positive selection had disappeared generally found the presence of structural annotation errors that were correctly detected and removed by HmmCleaner. The remaining 3. Importantly, the use of HmmCleaner on the MSAs in which we had introduced primary sequence errors drastically reduced the detection of positive selection 7. This value was slightly higher than the control 3.
Nonetheless, our simulations did not allow us to verify that HmmCleaner behaves correctly in real cases of positive selection. To this end, we selected MSAs with well-established presence of sites showing positive selection [ 27 , 28 ]. In contrast, they did on phytochrome genes but subsequent analyses were as significant on the filtered MSAs than on the raw MSAs.
In conclusion, both software did not appear to negatively impact detection of true positive selection, at least at a small evolutionary scale. We also examined filtering methods that reduce the stochastic sampling error removal of partial sequences and selection of the longest genes , as this type of error might be critical for single-gene inferences. Two aspects of phylogenetic inference were considered: tree topology and branch lengths. See the legend of Fig. Generally speaking, the effect of various filtering methods, including HmmCleaner, on phylogenetic accuracy was limited Table 4.
HmmCleaner thus appears to discard almost exclusively segments that are poorly informative for inferring phylogeny, which is expected because it removes low similarity segments. Accordingly, a random removal of the same amount of data than HmmCleaner decreased accuracy more severely 1.
Moreover, studying the effect of HmmCleaner on each clade of the vertebrate phylogeny reveals that it slightly improved accuracy within clades mainly represented by species for which genomic data had been used mammals and birds.
What is School-wide PBIS?
In contrast, the MAMMALIA dataset demonstrated the positive effect of using HmmCleaner on genomic-based datasets, which are more likely to contain annotation errors: accuracy improved from The same pattern was observed for aa sequences Table 3. Finally, since segment- and block-filtering methods have different targets primary sequence and alignment errors, respectively , it could be of interest to combine them, as already done in practice for recent large phylogenomic matrices [ 30 , 32 , 33 ].
These contrasted results illustrate the difficulty of data filtering, data loss increasing stochastic error while decreasing reconstruction errors. When primary sequence errors are not negligible, the increase of stochastic error due to data filtering is overcome by the reduction of non-phylogenetic signal.
More generally, the better performance of segment-filtering methods HmmCleaner and PREQUAL versus block-filtering methods BMGE and TrimAl suggests that primary sequence errors especially annotation errors are more detrimental to phylogenetic inference than alignment errors. See Additional file 1 : Table S2 for mean values.
Finally, we examined the effect of filtering software on the branch lengths of the concatenated trees. All pairwise comparisons e. We interpret these differences as the result of the removal of structural annotation errors. The negligible impact of filtering software on correlation coefficients in the case of concatenation is likely due to the law of large numbers.
In contrast, the tree length or total branch length of the concatenated trees was severely modified by all filtering software. For aa supermatrices, the tree length without filtering was 4. It decreased to 3. In agreement with their objectives, this suggests that filtering methods are efficient at removing the more divergent residues that increase tree length. In this article, we presented a new version of HmmCleaner, a software package that automatically identifies and removes low similarity segments in MSAs with the purpose of limiting the negative effect of primary sequence errors on evolutionary inferences.
The performance of our method was investigated through analyses of both simulated and empirical data. Its specificity to simulated errors is also high, with its false positives mostly corresponding to insertions or low similarity segments that would be difficult to handle in subsequent steps of analysis. We showed that segment-filtering software HmmCleaner and PREQUAL have more positive effects on evolutionary inferences detection of branch-specific positive selection, topological accuracy and branch-length estimation than the commonly used block-filtering software BMGE and TrimAl.
This suggests that primary sequence errors are more detrimental to evolutionary analyses than alignment errors. Therefore, we argue that the efforts of the research community should address both alignment and primary sequence errors, in other words that more energy should be devoted on structural annotations. Given the pervasiveness of primary sequence errors, we recommend the use of segment-filtering methods in high-throughput analyses of eukaryotic genomic data. For now, HmmCleaner targets low similarity segments that are by essence difficult to align and therefore may decrease the frequency of alignment errors, possibly to the extent of making them negligible.
In this respect, the advantage of specifically removing erroneous segments instead of entire blocks is to reduce the amount of data lost for the subsequent analyses, hence limiting the rise in stochastic error, which we have shown to be the major limiting factor for the accuracy of single-gene phylogenies. HmmCleaner detects low similarity segments in four steps Fig. The pHMM is a model of the ancestral sequence that can generate all the observed sequences.
In our method, the pHMM can either be built upon i all sequences of the MSA complete strategy or ii all sequences except for the one being analyzed leave-one-out strategy. Second, we estimate the probability that the pHMM generates each amino acid of a given sequence of the MSA, with the hypothesis that a primary sequence error will have a very low probability. To do so, each sequence of the MSA is evaluated with the pHMM using hmmsearch with default options, which yields profile-sequence alignments Fig. HMMER performs this step following a heuristic of homology search at the end of which it defines a set of subsequences envelopes estimated to fit a part of the pHMM.
Those alignments allow us to identify which segments of each sequence of the MSA are expected to have been generated by the pHMM, and within each segment, they provide the posterior probability that a specific amino acid has been generated by the pHMM as well as the level of match of each amino acid to the consensus of the pHMM. This is probably because posterior probability depends both on the quality of the match and on the quality of the alignment around the site while our method focuses solely on the match quality with the assumption that the alignment is correct.
Segments of this string corresponding to subsequences that do not fit the pHMM and thus are missing from HMMER output are filled with blank characters, so as to have a full-length representation of each sequence. The cumulative similarity score increases when the residue is expected by the pHMM and decreases otherwise Fig.
It is computed from left to right, starting at a maximal value of 1, representing a perfect fit to the pHMM, and it is strictly comprised between 0 and 1 included. Fourth, a low similarity segment is defined wherever the cumulative score reaches zero. Its start is set after the last position where the score was 1, while its end is defined by the last position of the segment where the score was null or by the end of the sequence Fig.
To optimize the parameters and study the performance of HmmCleaner, we created four datasets by assembling MSAs of protein-coding genes sampled from four different prokaryotic lineages Alphaproteobacteria, Cyanobacteria, Euryarchaeota, and Crenarchaeota. We chose prokaryotes to minimize the presence of annotation errors, as these lineages are mostly devoid of introns, simplifying the structural annotation of their genes.
Yet, a few structural annotation errors will likely subsist, in particular due to sequencing errors, incorrect start codon predictions, programmed ribosomal frameshifts and programmed transcriptional realignments [ 35 ]. Outliers were defined as sequences having a length shorter than the mean length minus 1.
In addition, we removed these outlier sequences from the retained orthogroups. To assemble the four final datasets of MSAs each, we selected at random orthogroups for each of the four lineages, and aligned their sequences with MAFFT 7. Since our simulations introduce frameshifts in nt sequences see below , we transferred the alignment gaps from protein sequences to the corresponding nt sequences.
To study the impact of HmmCleaner on evolutionary inferences, we used two additional datasets assembled from animal sequences. As both nt and amino-acid alignments were available for download, we used both types of sequences. The latter corresponded to the orthologous genes from Irisarri et al. Because these authors had used filtering softwares during their dataset construction, we had to re-apply their last step selection of a single sequence per organism and construction of chimeric sequences when necessary using SCaFoS [ 37 ] on a pre-filtering version of the corresponding MSAs.
To study the properties of HmmCleaner, we developed a simulator designed to create primary sequence errors in protein MSAs. In a first step, it takes an existing protein-coding alignment of nt sequences and randomly introduces a primary sequence error in a specified number of sequences. Primary sequence errors can be of three types, i a frameshift followed by the opposite compensatory mutation after a predefined number of out-of-frame codons, ii a scrambled segment resulting from the shuffling of individual nucleotides over a predefined number of codons or iii the arbitrary insertion of a segment shuffled as in ii.
In a second step, HmmCleaner is run on the resulting MSA and the detected low similarity segments are compared to the locations of the simulated errors to quantify the number of true positives, false positives, false negatives and true negatives. To allow a fine-grained analysis of the behavior of HmmCleaner, our simulator further characterizes the context of each position of the original MSA by its gap frequency, substitution rate and conservation level, as determined by block-filtering software.
More precisely, we used BMGE [ 9 ] at three different stringency settings strict, entropy cutoff of 0. To optimize the four parameters of the scoring matrix of HmmCleaner, we simulated frameshift errors on the two large datasets Cyanobacteria and Euryarchaeota. For each nt MSA, 4 subsets of sequences of different sizes 5, 10, 25 and 50 sequences were drawn times at random. On each of these samples, 1 to 5 sequences were randomly affected by a primary sequence error of length 10 to aa. HmmCleaner was then run on each resulting amino-acid MSA complete strategy under different combinations of its four parameters.
These ranges were defined based on preliminary simulations aimed at thoroughly exploring the zone of high specificity and high sensitivity. To ensure that our parameter optimization was robust, we studied the impact of introducing variations in our simulation protocol. Second, we compared the results obtained with different operational definitions of UARs.
Third, we considered the potential impact of the number of sequences in the MSAs 5, 10, 25 or In this case, we expected a larger effect owing to the dependence of pHMM statistical power on the amount of observations available to build the models. In contrast, sensitivity was more affected, and the correlation coefficient dropped to 0.
- Organix: Signs of Leadership in a Changing Church.
- App Store Review Guidelines - Apple Developer.
- Navigation menu;
- Lost Cat: A True Story of Love, Desperation, and GPS Technology.
It is, in fact, the same transformation as zROC, below, except that the complement of the hit rate, the miss rate or false negative rate, is used. This alternative spends more graph area on the region of interest. Most of the ROC area is of little interest; one primarily cares about the region tight against the y-axis and the top left corner — which, because of using miss rate instead of its complement, the hit rate, is the lower left corner in a DET plot.
Furthermore, DET graphs have the useful property of linearity and a linear threshold behavior for normal distributions. The analysis of the ROC performance in graphs with this warping of the axes was used by psychologists in perception studies halfway through the 20th century, where this was dubbed "double probability paper". If a standard score is applied to the ROC curve, the curve will be transformed into a straight line.
In memory strength theory , one must assume that the zROC is not only linear, but has a slope of 1. The normal distributions of targets studied objects that the subjects need to recall and lures non studied objects that the subjects attempt to recall is the factor causing the zROC to be linear. The linearity of the zROC curve depends on the standard deviations of the target and lure strength distributions. If the standard deviations are equal, the slope will be 1. If the standard deviation of the target strength distribution is larger than the standard deviation of the lure strength distribution, then the slope will be smaller than 1.
In most studies, it has been found that the zROC curve slopes constantly fall below 1, usually between 0. A slope of 0. The z-score of an ROC curve is always linear, as assumed, except in special situations. The Yonelinas familiarity-recollection model is a two-dimensional account of recognition memory. Instead of the subject simply answering yes or no to a specific input, the subject gives the input a feeling of familiarity, which operates like the original ROC curve.
What changes, though, is a parameter for Recollection R. Recollection is assumed to be all-or-none, and it trumps familiarity. If there were no recollection component, zROC would have a predicted slope of 1. However, when adding the recollection component, the zROC curve will be concave up, with a decreased slope. This difference in shape and slope result from an added element of variability due to some items being recollected.
Patients with anterograde amnesia are unable to recollect, so their Yonelinas zROC curve would have a slope close to 1. For these purposes they measured the ability of a radar receiver operator to make these important distinctions, which was called the Receiver Operating Characteristic.
In the s, ROC curves were employed in psychophysics to assess human and occasionally non-human animal detection of weak signals. In radiology , ROC analysis is a common technique to evaluate new radiology techniques. ROC curves are widely used in laboratory medicine to assess the diagnostic accuracy of a test, to choose the optimal cut-off of a test and to compare diagnostic accuracy of several tests. ROC curves also proved useful for the evaluation of machine learning techniques. The first application of ROC in machine learning was by Spackman who demonstrated the value of ROC curves in comparing and evaluating different classification algorithms.
ROC curves are also used in verification of forecasts in meteorology.
Performance Evaluation and Attribution of Security Portfolios by Russ Wermers, Bernd R. Fischer
Given the success of ROC curves for the assessment of classification models, the extension of ROC curves for other supervised tasks has also been investigated. Also, the area under RROC curves is proportional to the error variance of the regression model. From Wikipedia, the free encyclopedia. Terminology and derivations from a confusion matrix condition positive P the number of real positive cases in the data condition negative N the number of real negative cases in the data true positive TP eqv.
Statistics portal. Pattern Recognition Letters. Journal of Machine Learning Technologies. Encyclopedia of machine learning. Retrieved 11 August Earth Syst. Bibcode : HESS Weather and Forecasting. Ocean Modelling. Bibcode : OcMod.. The elements of statistical learning: data mining, inference, and prediction 2nd ed.
Quarterly Journal of the Royal Meteorological Society. Archived from the original PDF on Machine Learning. Retrieved Global Ecology and Biogeography. Journal of Machine Learning Research. International Conference on Information Science and Technology. Medical Decision Making.
Landscape Ecology. On Linear DETs. Douglas Detection Theory: A User's Guide 2nd ed. Signal detection theory and psychophysics. Department of Mathematics, University of Utah. Retrieved May 25, Clinical Chemistry. The statistical evaluation of medical tests for classification and prediction. New York, NY: Oxford. Journal of Climate. Washington, DC. Pattern Recognition. Outline Index. Descriptive statistics. Mean arithmetic geometric harmonic Median Mode.
Central limit theorem Moments Skewness Kurtosis L-moments. Index of dispersion. Grouped data Frequency distribution Contingency table.
- Côte divoire : De limpasse au chaos : quelle issue? (French Edition).
- Addressing false discoveries in network inference | Bioinformatics | Oxford Academic.
- Fromage à Bergues (French Edition).
- 1 Introduction.
- Otolaryngology: Allergy, Asthma, and Immunology (Audio-Digest Foundation Otolaryngology Continuing Medical Education (CME). Book 45).
- Uma noite con zoe (Sabrina) (Portuguese Edition);
- Receiver operating characteristic?
Pearson product-moment correlation Rank correlation Spearman's rho Kendall's tau Partial correlation Scatter plot. Data collection. Sampling stratified cluster Standard error Opinion poll Questionnaire. Scientific control Randomized experiment Randomized controlled trial Random assignment Blocking Interaction Factorial experiment. Adaptive clinical trial Up-and-Down Designs Stochastic approximation. Cross-sectional study Cohort study Natural experiment Quasi-experiment. Statistical inference. Z -test normal Student's t -test F -test. Bayesian probability prior posterior Credible interval Bayes factor Bayesian estimator Maximum posterior estimator.
Correlation Regression analysis.
Related Chapter 008, Multiple Fund Performance Evaluation:The False Discovery Rate Approach
Copyright 2019 - All Right Reserved