• Byonic FAQs

    Is Byonic a search engine like Mascot, SEQUEST, etc.?
    Yes, Byonic can do anything Mascot, SEQUEST, etc. can do, and more besides.

     

    Why is Byonic better?
    Byonic extends the state-of-the-art in two ways: its scorer is more accurate and allows a much wider range of search possibilities. The “scorer” is the algorithm that matches peptides to mass spectra by comparing the predicted fragmentation of the peptide with the observed peaks in the spectra, in consideration of the precursor mass accuracy and precision. Byonic incorporates a substantial amount of chemical knowledge into its fragmentation prediction, such as reduced CID fragmentation on the C-terminal side of proline and strong neutral losses from certain modifications. This expert system knowledge means that Byonic is more sensitive (more true positives) and specific (fewer false positives) than other search engines for exactly the same search as objectively measured using a decoy database strategy. Moreover, Byonic enables a wider variety of searches than the other search engines. With Modification Fine Control™, Byonic can search for 10s or even 100s of modification types simultaneously without a prohibitively large combinatorial explosion. Byonic’s Wildcard Search™ allows the user to search for unanticipated or even unknown modifications alongside known modifications. Finally, Byonic’s glycosylation search allows the user to identify glycopeptides without prior knowledge of glycan masses or glycosylation sites.

     

    How much better is Byonic?
    The performance advantage depends upon the data, the search, and the skill of the user. On exactly the same search, Byonic’s advantage ranges from about 20% to 300% more spectra identified at the same False Discovery Rate (FDR), with the advantage increasing with the difficulty of the search. Easy searches are ones with small search spaces (such as due to a small protein database or a small number of modification types) and difficult searches are ones with large search spaces (such as due to nonspecific digestion, a large number of modification types, and low precursor mass accuracy).

     

    What’s the difference between Byonic and Preview?
    Preview offers an initial peek preview of your data to help you set the parameters for a much more sensitive Byonic search. Preview advises the user on mass accuracy, digestion specificity, and the prevalence of ~ 60 common modifications. Preview optionally recalibrates the m/z measurements to improve sensitivity/specificity for subsequent searches (using any search engine).

     

    How should I prepare my data for Byonic?
    Nothing is required. Byonic takes in raw data in raw format from Thermo, Waters, Sciex, Bruker, and Agilent. Optionally, you may want to run the data through Preview, and if the scatter plots of mass error vs. m/z for precursor and fragment mass errors reveal systematic m/z measurement errors, ask Preview to recalibrate the data and return a new m/z-recalibrated MGF file.

     

    Should I de-isotope my data?
    No! Byonic handles isotope peaks internally. De-isotoping the spectra beforehand using, for example, Mascot Distiller, destroys valuable information. De-isotoping is an especially bad idea for ETD spectra, which often have peaks (c–1 and z+1 peaks) that lead de-isotoping algorithms astray.

     

    How should I choose search parameters?
    Set mass tolerances appropriate for the type of instrument, for example, 10 ppm precursor tolerance for a high resolution instrument and 0.3 Dalton fragment tolerance for ion trap fragmentation. Preview’s mass error plots can help you choose these tolerances. Preview’s m/z recalibration can remove systematic errors so that data can be run with tighter tolerances, for example, 5 ppm instead of 10 ppm tolerance for a high resolution instrument. Tight tolerances offer significant advantage for difficult searches, for example, resolving nearly isobaric modifications such as sulfation and phosphorylation, or identifying glycopeptides with poor fragmentation. Tolerances can be set in either Da or ppm, as appropriate for the instrument. Set digestion specificity based on the prevalence of nonspecific digestion and the complexity of the search. If the modification complexity of the search is high, as in wildcard, glycosylation, or oxidative footprinting searches, it is best to avoid the extra complexity of searching for nonspecific digestion, unless the nonspecific digestion rate is high (say, over 20% of all peptides). Set modifications based upon prevalence reported by Preview and the goal of the study. If the goal of the study is phosphorylation site identification, enable up to 3 or even 4 phosphorylation sites per peptide, and avoid other modifications unless they are prevalent. If the goal of the study is simply protein identification, it is best to enable only the most common modifications (for example, oxidized methionine and deamidation). Be especially alert to over-alkylation; in some samples, over-alkylation is so common that the majority of peptides carry iodoacetamide artifacts. Some modifications are more costly (for example, sodiation on any residue as opposed to just E and D), but others (such as pyro-glu on N-terminal glutamine) barely increase the size of the search space.

     

    What is a focused database?
    Byonic enables the user output a small protein database, containing only the higher –ranking proteins (whether they be target or decoy) from a first search, along with appropriate decoys. A database specifically focused on the proteins in the sample under study improves speed and accuracy for subsequent searches. Focused databases are especially useful for wide modification searches, such as glycosylation and wildcard searches. We do not, however, advocate the use of a focused database for every study; they are unnecessary for most searches. Similarly, there is no need to make the first search very narrow; enable common modifications and nonspecific digestion as appropriate.

     

    What is a wildcard modification?
    A wildcard modification is nonspecific in both mass and (optionally) residue type; this is Byonic’s version of blind modification search. Adding a wildcard with mass range of 100 Daltons increases the search time approximately 100-fold, so it is faster to add a wildcard only on small searches, for example, fully tryptic searches with few other modifications. However, wildcard modification searches can be specified to apply only to specific residues.

     

    When should I use a wildcard modification?
    We use a wildcard modification most often in a final clean-up search to be sure we haven’t missed anything interesting. We take out most other modifications and search against a focused database so that the wildcard search does not take too long. Wildcard search can also be used to find sequence variants. On the other hand, wildcard matches tend to be approximate, rather than exact: the wildcard modification is often misplaced, off by one Dalton, or the sum of two closely spaced modifications. Be alert for approximate answers such as EV[–18]PQLEVTK, where –18 almost surely belongs on E not V, and L[–113]EDEFVEVTK, where the right answer is surely EDEFVEVTK. Finally, we use wildcard search to solve mystery spectra; on a well-sequenced organism almost all spectra are at most one wildcard away from a database sequence.

     

    Should I search my data more than once?
    Yes! It can be a good idea to bracket your search with several settings of the crucial parameters when trying to get the most from the data. Even on data with overall 5 ppm precursor mass accuracy, there will be a few valid identifications with much larger errors, due to interfering MS1 peaks, mixture spectra, and so forth.

     

    Should I combine multiple search engines?
    In our experiments, other search engines find very few valid spectrum identifications missed by Byonic, typically less than 1%.

     

    How does Byonic compute p-values?
    Byonic computes both peptide-spectrum match (PSM) and protein p-values, assuming simple models of random matches. For PSMs, Byonic assumes that random scores are independent identically distributed (i.i.d.) picks from a probability distribution with an exponential right-hand tail. This distribution depends only upon the size of the search (number of modifications, digestion specificity, size of the protein database, and so forth), and not upon the spectrum itself. Byonic reports the log base 10 of the p-value, so that a LogProb (log p-value) of –2.0 should occur by chance on only about one out of 100 spectra. For proteins, Byonic computes the expected total LogProb of PSMs hitting each protein, assuming that random PSMs are distributed uniformly over the protein database. The protein LogProb is the excess of total LogProb over the expected amount (that is, how much more negative). Proteins are ranked from most confident on down according to LogProb.

     

    How does Byonic estimate the False Discovery Rate (FDR)?
    The False Discovery Rate (FDR) in a list of identifications (either proteins or PSMs) is the number of incorrect identifications divided by the total number of identifications. Byonic estimates PSM FDR using the target/decoy approach, which is the de facto standard for significance testing in proteomics. We have devised a method called two-dimensional FDR (2D FDR) http://www.ncbi.nlm.nih.gov/pubmed/22010998 that can take into account protein-level information when computing PSM FDR, without biasing the FDR estimate. Two-dimensional FDR gives greater sensitivity/specificity than other methods because it can retain lower scoring PSMs to high-ranking proteins (which are likely to be correct) yet discard higher scoring PSMs to low-ranking proteins (which are likely to be incorrect).

  • FAQ about Preview

    Is Preview a search engine like Byonic, Mascot, or SEQUEST?

    Not exactly. Preview samples the data, so it generally makes fewer identifications than a full search engine. On the other hand, it tests many more modifications and search options than any full search engine.

    How should I use Preview?

    Run Preview before you run any other searches, so that you will know what type of full search will be most effective.

    Should I use Preview to recalibrate my m/z measurements?

    Yes! If Preview makes sufficiently many identifications, say at least 20 precursors and at least 50 fragments, then you will generally get better results out of a full search with Preview’s recalibrated spectrum file than with the original spectrum file unless the original calibration is extremely good. If you have enough identifications to avoid over-fitting (say 100 or more precursors), you can even run Preview’s recalibrated spectrum file through Preview again for even more precise recalibration.

    How does Preview recalibrate m/z measurements?

    It maps measured m/z values to recalibrated m/z values using quadratic curves– the red curves shown in the plots. We have found that calibration does not drift much over the course of an LC-ESI run, so that the same quadratic curve works for all spectra. Calibration can change from plate to plate with MALDI, however, so it’s quite possible to see a lot of scatter in the m/z errors from a data set comprising many MALDI plates.

    If Preview reports median precursor error of 2 ppm, should I set the precursor tolerance in the full search to 2 ppm?

    No! The median error is the typical error for an abundant ion and at least 3 to 5 times smaller than the maximum error. Also check the number of “off-by-one” errors reported by Preview: even on high-accuracy instruments, many precursor masses may reflect the mass of the first isotope peak rather than the monoisotopic mass.

    In the full search, should I enable all the modifications that Preview reports as “Common variable modifications”?

    Not necessarily. Some full search engines do not support all the modifications supported by Preview. Some modifications are biologically uninteresting (for example, sodiation) and should only be enabled if they would contribute a significant number of additional identifications.

    How does Preview compute False Discovery Rate (FDR)?

    Preview uses the target / decoy approach to FDR estimation, and estimates the number of true identifications by the number of target identifications minus the number of decoy identifications. There is no need to add decoy proteins to the protein database, because Preview does this automatically. Preview does not report FDR, but it uses FDR internally to decide which identifications to accept.

    How reliable are Preview’s statistics?

    Preview’s statistics are especially good for “normal” shotgun proteomics, meaning digested multi-protein samples. Preview loses some reliability on very highly modified samples, in which many peptides carry more than one variable modification.

    How can I use Preview to improve my sample processing?

    Preview reports on the amount of nonspecific digestion, m/z measurement errors, and sample preparation artifacts such as over- and under-alkylation, carbamylation, oxidation, sodiation, and deamidation. This type of information can provide valuable feedback.

    How should I read Preview’s peptide and protein identifications?

    This list (accessible from the Detail page) gives the highest-scoring identification for each spectrum, so long as the score is high enough to be statistically significant. We don’t usually do much with this list of identifications: remember that Preview samples the data, and does NOT perform a full search.

    How should I read Preview’s wildcard search results?

    Preview’s wildcard is just what it sounds like: any mass shift on any one residue. Wildcard identifications are often approximate, with misplaced modifications, two modifications combined into one wildcard, two known modifications in a combination not considered by Preview’s other searches, and so forth. On the other hand, these identifications, especially if they have scores over 60, are rarely completely wrong. A wildcard search will find polymorphisms, unanticipated modifications, and mystery mass shifts in almost any sample.

    Why do the Summary and Details statistics sometimes disagree?

    The Summary page reports the overall gain to be achieved by enabling the modification, for example, 8.5% more identifications by allowing oxidized methionine for the BTK sample data. In contrast, the Details page reports the rate of modification, for example, 32.9% of peptides containing methionine contain at least one oxidized methionine. In other words, the Summary reports the “bottom line”, how many more identifications can be obtained by enabling the modification, while the Details page reports direct comparisons on specific, limited searches.

    For example, to assess the rate of oxidized methionine, Preview searches the spectra only against methionine-containing peptides, and reports the results of the search on the Details page. Then after all searches have been done, Preview compiles the summary statistics by counting up all the identifications for all spectra.

    Denominators in the percentages may also vary from search to search due to “second-order” effects such as multiply modified peptides and corrections for hits to decoys.

    Preview’s statistics can lose accuracy on extreme data sets, those in which a large percentage (say 30% or more) of the peptides carry more than one type of modification: for example, a data set that is both highly over-alkylated and highly oxidized.

    Where do I learn more about Preview?

    Check out Preview’s Tutorial and Help webpages. For background on why Preview was created and what is does, see the About page.