Sequence similarity distribution of the BLAST hits

Top PDF Sequence similarity distribution of the BLAST hits:

Outlier Detection in BLAST Hits

Outlier Detection in BLAST Hits

The most common approach for assigning taxonomic labels to reads involves comparing them to a database of sequences from known organisms. These similarity-based methods typically run rapidly and work well when organisms in the sample are well represented in the database. However, a majority of microorganisms cannot be easily cultured in laboratories, and even if they are culturable, a smaller number have been sequenced. Thus, not all environmental organisms may be represented in the sequence database. This prevents the similarity-based methods from accurately characterizing organisms within a sample that are only distantly related to the sequences in the reference database. Phylogenetic-tree based methods can characterize novel organisms within a sample by statistically modeling the evolutionary processes that generated these sequences [15, 13]. However, such methods incur a high computational cost, limiting their applicability in the context of the large datasets generated in current studies. Ideally, we would want to use similarity-based methods to assign labels to sequences from known organisms, and to use phylogenetic methods to assign labels to sequences from unknown organisms.
Show more

11 Read more

Outlier detection in BLAST hits

Outlier detection in BLAST hits

We propose a two-step method for taxonomy assign- ment where we use a rapid assignment method that can accurately assign labels to sequences that are well rep- resented in the database, and then use more complex phylogenetic methods to classify only those sequences unclassified in the first step. In this work, we study whether and when a method can assign accurate taxo- nomic labels using a similarity search of a reference data- base. We employ BLAST because it is one of the most widely used similarity search methods [5]. However, it has been shown that the best BLAST hit may not always provide the correct taxonomic label [6]. Most taxonomic- assignment methods utilizing BLAST employ ad-hoc techniques such as recording the consensus label among the top five hits, or using a threshold based on E-value, percent identity, or bit-score [7–10]. Here we propose an alternative approach for detecting whether and when the top BLAST hits yield correct taxonomic labels. We model the problem of separating phylogenetically cor- rect matches from matches to sequences from similar but phylogenetically more distant organisms as a problem of outlier detection among BLAST hits. Our preliminary results involving simulated and real metagenomic data- sets demonstrate the potential of employing our method as a filtering step before using phylogenetic methods.
Show more

9 Read more

Div BLAST : diversification of sequence search results

Div BLAST : diversification of sequence search results

Conclusions Diverse browsing of sequences and structures is essential for exploratory research in bioinformatics. The current approach of curating non-redundant databases and eliminating identical sequences or fragments is costly and prone to error. In addition, as we illustrated in the experimental section, most queries still contain results with too much redundancy. Alignment and search tools need to perform diversifications tailored for each query. To the best of our knowledge, ours is the first work that investigates diversity in sequence search and alignment. We propose quality measures and methods to diversify the results of sequence similarity search tools. As the result set already includes top-matching sequences, we focus on selecting a diverse subset of this result. To obtain non-redundant results, one could either specify a similarity threshold and omit the sequences that have more similarity than the threshold, or use clustering algorithms. However, these approaches would fail to return enough results and may not supply the desired diversity regarding the query. To overcome this problem, we first presented a pairwise bit comparison approach, BitDiversity, by treating the sequence matches as bit sequences. BitDiversity stresses the diversity in matching locations without considering amino acid differences in those locations. Diversity rate is calculated with the XOR operation for the bit sequences of two sequences.
Show more

23 Read more

Similarity Searches on Sequence Databases: BLAST, FASTA. Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003

Similarity Searches on Sequence Databases: BLAST, FASTA. Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003

Heuristic Sequence Alignment: Principle • These methods are heuristic; i.e., an empirical method of computer programming in which rules of thumb are used to find solutions. • They almost always works to find related sequences in a database search but does not have the underlying guarantee of an optimal solution like the dynamic programming algorithm.

59 Read more

Having a BLAST: Analyzing Gene Sequence Data with BlastQuest

Having a BLAST: Analyzing Gene Sequence Data with BlastQuest

biological point of view, sequences with no homologous sequence match often lead to new genes and are analysed in a different manner (outside of BlastQuest). In addition, the homology search criteria for each BLAST search, such as the BLAST program name, database name, matrix, and date, are stored in Query_Hit table. These parameters are important to users because for the same query sequence, BLAST generates different results based on different criteria. For example, BLASTN results and BLASTX results may indicate different functions for the same query sequence. In addition, the same BLAST search on different days may generate different hits since BlastQuest’s BLAST server is regularly updated with the latest version of the NCBI data files. The MySQL database also stores information about how related gene segments are assembled into single consensus DNA sequences by PHRAP, which is external to BlastQuest and invoked before the DNA sequence results are submitted to BLAST. PHRAP outputs its results in an ACE file, which is mapped into the relation Assembly. If the user considers the results of the BLAST search interesting, s/he may want to extract the physical clones from which the specific query sequences are generated or assembled. This is possible by joining the Assembly and Query tables via the “qid” foreign key to retrieve all segments and corresponding clone names that are clustered into a specific query sequence.
Show more

5 Read more

Improvements on Seeding Based Protein Sequence Similarity Search

Improvements on Seeding Based Protein Sequence Similarity Search

Since the Smith-Waterman algorithms run in quadratic time of the total length of the sequences, they become impractical when invoking large-scale sequences compar- isons. So, a target of heuristics is to maintain reasonable high sensitivity and make few as possible calls to the Smith-Waterman algorithm. Seeding (or filtration) al- gorithm based tools, which trade sensitivity for speed, is a popular choice among other approaches. The seeding based approach runs faster than the Smith-Waterman algorithm, but misses some true homologies. Many clever ideas on the seeding al- gorithm have been developed which helps to bring a more efficient application to life. The FASTA program was first released in 1985; the BLAST program was first released in 1990; and the PatternHunter program was first released in 2002. These exemplified the evolution of seeding based approaches over the past three decades.
Show more

119 Read more

Detecting remotely related proteins by their interactions and sequence similarity

Detecting remotely related proteins by their interactions and sequence similarity

The combined method is applicable only to protein sequences and their homologs for which protein-interaction data are available, in contrast to sequence comparison alone, which is applicable to all protein sequences. This limitation is quantified by the following two examples. First, ⬇20–50% of the proteins in the benchmark have a partner in the G 2 set (Fig. 2c). Second, for specificity of 75%, sequence comparison by PSI- BLAST makes 30,302 pairs with correct fold assignments, whereas our combined method finds 2,885 true positives of which 188 were not reported by PSI- BLAST . Two of these assignments are shown in Fig. 3. We suggest that even the com- paratively small coverage of the combined method is already useful in practice, given the 2 million known protein sequences that need to be related to each other; very few methods for characterization of proteins, experimental or computational, are applicable to most protein sequences, and many proven methods are applicable only to a small fraction of all proteins. Moreover, the usefulness of our combined method is clearly increasing with the growth of the databases of known protein sequences and their interactions. We also expect that the idea of combining protein sequence compar- ison and protein interactions could enable additional improvements in the matching of remotely related protein sequences.
Show more

6 Read more

The similarity/dissimilarity analysis of protein sequence based on nucleotide triplet codon

The similarity/dissimilarity analysis of protein sequence based on nucleotide triplet codon

ABSTRACT Based on nucleotide triplet codon, a graphical representation of protein sequences is outlined. A numerical characterization including the location, number and distribution information of all the 20 kinds of amino acids is proposed. The similarity/dissimilarity analysis of ND5 protein sequences of nine species is done, and our approach is compared to other approaches recently proposed based on the coefficient of correlation of the results of these approaches with the results calculated by ClustalW. It shows that our approach has better correlations with ClustalW for all nine species than other approaches, which gives an intuition of better performance.
Show more

8 Read more

UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches

UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches

The UniRef databases have been produced for 10 years and are used worldwide for a broad range of applications. Since first released in 2004, UniRef has been cited over 400 times based on Google Scholar and unique citations from PubMed Central. UniRef’s ability to reduce redundancy while preserving information on source and quality annotation has proven useful in many studies based on the citation analysis. The most common uses of UniRef databases continue to be in functional annotation, family classifica- tion, systems biology, structural genomics, phylogenetic analysis and mass spectrometry. Recent studies have also used UniRef for im- proving protein sequence alignments through homology extension ( Chang et al., 2012 ), increasing sequence search sensitivity with transitive alignments ( Malde and Furmanek, 2013 ), developing rep- resentative proteomes and proteome clusters ( Chen et al., 2011 ), predicting the functional effects of disease variants ( Capriotti and Altman, 2011a , b ; Sim et al., 2012 ), performing functional screening of metagenomics data ( Foerstner et al., 2008 ; Wommack et al., 2012 ), developing large-scale hierarchical clustering algorithms ( Loewenstein et al., 2008 ), studying gene duplication ( Rivera et al., 2010 ) and conducting genomic studies of peptide and oligonucleo- tide frequencies ( Capone et al., 2010 ). Based on the UniProt usage statistics, UniRef web pages receive approximately 200 000 hits per month. The UniRef file download has been increasing steadily since its inception with an annual growth rate of 20% in recent years, now reaching more than 3000 annual unique IP downloads.
Show more

7 Read more

SeqStruct : A New Amino Acid Similarity Matrix Based on Sequence Correlations and Structural Contacts Yields Sequence-Structure Congruence

SeqStruct : A New Amino Acid Similarity Matrix Based on Sequence Correlations and Structural Contacts Yields Sequence-Structure Congruence

Potential Impacts. In summary, the approach applied here combines two extremely different sets of data – the protein sequences and structures, and combines them within a physical context by using close contacts in the structures to select the strongest correlations. This yields significant gains in sequence matching. The results produce dramatic improvements in sequence matching that will aid the present users of BLAST or any other present sequence matching software, because it relies on a simple 20x20 amino acid similarity matrix just as most other matching procedures also use. So its use is straightforward to implement. Further possible extensions include incorporating higher order multi-body correlations manifested in structures, may provide additional gains in the representations of the complex, dense proteins.
Show more

18 Read more

Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing

Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing

This high efficiency indicates that the dynamic load balancing has low overhead. First, the additional MPI message-passing time required by the dynamic load balancing (as compared to static load balancing) is negligible compared to the time spent in BLAST itself. Second, the time spent repeatedly initiating BLAST on a processor for each chunk of 20K amino acids is small. Repeated reading of the database se- quences is very fast due to the large amount of disk caching that Linux systems perform. Note this is true both for an individual processor and for a group of pro- cessors on one node of the cluster, which will all share the disk cache on that node. However, the initial read of the database must be performed over the network since we use NFS to mount a common filesystem onto all nodes of the cluster. One portion of the overhead is due to the time to read the database across the net- work into the nodes of the cluster. Another portion of the overhead is due to the presence of a master pro- cess and a writer process, in addition to the 36 worker processes. These two processes must be executed on processors that are also executing worker processes.
Show more

6 Read more

BEAP: The BLAST Extension and Alignment Program  a tool for contig construction and analysis of preliminary genome sequence

BEAP: The BLAST Extension and Alignment Program a tool for contig construction and analysis of preliminary genome sequence

Conclusion Researchers need to have some flexibility to adapt to each new genome depending on genomic complexity and gene content. A variety of options in the BEAP process were designed to help researchers tackle a wide range of chal- lenges. BEAP is not limited to bovine or animal applica- tions. Any sequence database can be queried or used as a template sequence. The user can specify local (megaB- LAST) or remote (BLASTn) database querying, stringency of BLAST hits (E-value), and word size within the megaB- LAST option. Users have great flexibility in how to use BEAP output. The BEAP package could be very useful to researchers working with draft quality genome sequence.
Show more

5 Read more

Sequence to Sequence Similarity for Image Denoising

Sequence to Sequence Similarity for Image Denoising

Keywords: Image denoising, block similarity, sequence to sequence similarity, edge information ________________________________________________________________________________________________________ I. I NTRODUCTION Image is always subjected to noise during its acquisition, coding, transmission, and processing steps. Noise is very difficult to remove from the digital images without the prior knowledge of noise model. Hence noise removal forms the pre-processing step in the field of photography, research, satellite technology and medical science, where degraded image has to be restored before further processing. Image denoising is the manipulation of the image data to produce a visually high quality image. Generally the existing or current denoising algorithms or approaches are filtering approach, multifractal approach and domain transform approach such as wavelet based approach. But such image denoising causes blurring and introduces artifacts in the original image.Different types of images inherit different types of noise and different noise models are used to present different noise types. Denoising method tends to be problem specific and depends upon the type of image and noise model. Based on noise types and noise models so many denoising techniques are developed, but some time these techniques are not useful for all application. Also most all of them cause loss of content during denoising process. Loss of content causes blurring of image. This is the one of the main issue while developing a denoising algorithm. However time to time we need the reinforcement learning of theoretical and practical idea of noise present in digital images. The objective of this thesis is to find out a solution for the problem of using conventional denoising techniques.
Show more

6 Read more

An Improved HITS Algorithm Based on Pagequery Similarity and Page Popularity

An Improved HITS Algorithm Based on Pagequery Similarity and Page Popularity

Hongfei Lin 1 and Cong Zhang 2 2. School of Software, Dalian University of Technology, Dalian, China Email: [email protected] Abstract—The HITS algorithm is a very popular and effective algorithm to rank web documents based on the link information among a set of web pages. However, it assigns every link with the same weight. This assumption results in topic drift. In this paper, we firstly define the generalized similarity between a query and a page, and the popularity of a web page. Then we propose a weighted HITS algorithm which differentiates the importance of links with the query- page similarities and the popularity of web pages.
Show more

5 Read more

Expected Sequence Similarity Maximization

Expected Sequence Similarity Maximization

ity maximization problem can be solved efficiently. This opens up the option of seeking the most appro- priate rational kernel or transducer T for the spe- cific task considered. In particular, the kernel K used in our machine translation applications might not be optimal. One may well imagine for exam- ple that some n-grams should be further emphasized and others de-emphasized in the definition of the similarity. This can be easily accommodated in the framework of rational kernels by modifying the tran- sition weights of T. But, ideally, one would wish to select those weights in an optimal fashion. As mentioned earlier, we leave this question to future work. However, we can offer a brief look at how one could tackle this question. One method for de- termining an optimal kernel for the expected sim- ilarity maximization problem consists of solving a problem similar to that of learning kernels in classi- fication or regression. Let X 1 , . . . , X m be m lattices with Ref(X 1 ), . . . , Ref(X m ) the associated refer- ences and let x(K, X b i ) be the solution of the ex- pected similarity maximization for lattice X i when using kernel K . Then, the kernel learning optimiza- tion problem can be formulated as follows:
Show more

9 Read more

Plagiarism or not? investigation of Turnitin®-detected similarity hits in biology laboratory reports.

Plagiarism or not? investigation of Turnitin®-detected similarity hits in biology laboratory reports.

It has been suggested that the process of writing a lab report enables students to make sense of their lab experience within the context of scientific inquiry [1]. A traditional undergraduate biology lab report includes several standard sections that mirror the structure of a scientific paper: a brief Abstract summarizes the most important findings, an Introduction explains the background of the experiment, a Materials and Methods section outlines and details experimental procedures, a Results section presents data, and a Discussion section elaborates on data analysis. This traditional format was implemented in the upper-division molecular biology lab classes that were the focus of this study. For the past seven years of teaching these labs, similarity-matching software, Turnitin®, has been used to detect potential instances of plagiarism. Over the years, instructors have been aware of the large number of similarities detected by Turnitin® in many of the lab reports. The majority of these matches linked to lab reports by other students, which raised the possibility of our students having access to, and plagiarizing from, these lab reports.
Show more

27 Read more

Massively Parallel Sequence Alignment using pcj-blast

Massively Parallel Sequence Alignment using pcj-blast

BLAST - Basic Local Alignment Search Tool (1991) • The heuristic algorithm it uses is much faster than other approaches • The search time can be long (days or weeks) for large datasets  NCBI-BLAST is the most widely used implementation

20 Read more

Atom-Atom-Path similarity and Sphere Exclusion clustering: tools for prioritizing fragment hits

Atom-Atom-Path similarity and Sphere Exclusion clustering: tools for prioritizing fragment hits

absorb for hit-to-lead development. To increase the chance of success, careful prioritization of the initial hits by a group of experienced specialists from different areas of drug discovery is important for advancing the most promising fragment hits. Fragment hit triage in advance of a structure determination typically weighs the LE parameter. However, many other properties in- cluding affinity, selectivity, and most importantly the chemical structure of the compound need critical con- sideration. In addition, fragment libraries often contain related molecules providing initial SAR and confidence in scaffold types. Clustering hit sets helps bring related molecules and features together for consideration, but the cluster order is usually determined by the algorithm and is independent of other factors such as LE. This re- sults in a functional randomization of the order of the experimental data making trends harder to identify. To direct the attention of the specialists to the most prom- ising hits we have employed a directed clustering method using a new similarity algorithm, to group hits with respect to both structure and data.
Show more

11 Read more

Structural Similarity Based Object Tracking in Video Sequence

Structural Similarity Based Object Tracking in Video Sequence

In this paper we propose the use of structural sim- ilarity measure for object tracking in video sequences by means of a particle filter. The motivation of apply- ing particle filtering is that it has been proven to be a scalable and powerful approach, able to cope with non- linearities, and work under uncertainties, which makes it a suitable approach for object tracking in video se- quences (see for example [8] and [3]). The similarity measure proposed in [11] captures spatial characteris- tics of an image and has shown to be robust to illumi- nation and contrast changes. It has been used for the purposes of quality assessment of distorted and fused images [6, 9], but not for tracking. In the present pa- per, we show how this measure can be applied for track- ing purposes. It allows one to substitute histograms and to calculate in a straightforward way the measure- ment likelihood function within particle filtering. We show that it is a good and fast alternative to histogram based tracking.
Show more

6 Read more

DNA Sequence Similarity Requirements for Interspecific Recombination in Bacillus

DNA Sequence Similarity Requirements for Interspecific Recombination in Bacillus

Accepted for publication August 9, 1999 ABSTRACT Gene transfer in bacteria is notoriously promiscuous. Genetic material is known to be transferred between groups as distantly related as the Gram positives and Gram negatives. However, the frequency of homologous recombination decreases sharply with the level of relatedness between the donor and recipient. Several studies show that this sexual isolation is an exponential function of DNA sequence divergence between recombining substrates. The two major factors implicated in producing the recombina- tional barrier are the mismatch repair system and the requirement for a short region of sequence identity to initiate strand exchange. Here we demonstrate that sexual isolation in Bacillus transformation results almost exclusively from the need for regions of identity at both the 59 and 39 ends of the donor DNA strand. We show that, by providing the essential identity, we can effectively eliminate sexual isolation between highly divergent sequences. We also present evidence that the potential of a donor sequence to act as a recombinogenic, invasive end is determined by the stability (melting point) of the donor-recipient complex. These results explain the exponential relationship between sexual isolation and sequence diver- gence observed in bacteria. They also suggest a model for rapid spread of novel adaptations, such as antibiotic resistance genes, among related species.
Show more

9 Read more

Show all 10000 documents...