Results 1  10
of
22
On Optimizing DistanceBased Similarity Search for Biological Databases
 Stanford University
, 2005
"... Similarity search leveraging distancebased index structures is increasingly being used for both multimedia and biological database applications. We consider distancebased indexing for three important biological data types, protein kmers with the metric PAM model, DNA kmers with Hamming distance ..."
Abstract

Cited by 10 (4 self)
 Add to MetaCart
(Show Context)
Similarity search leveraging distancebased index structures is increasingly being used for both multimedia and biological database applications. We consider distancebased indexing for three important biological data types, protein kmers with the metric PAM model, DNA kmers with Hamming distance and peptide fragmentation spectra with a pseudometric derived from cosine distance. To date, the primary driver of this research has been multimedia applications, where similarity functions are often Euclidean norms on high dimensional feature vectors. We develop results showing that the character of these biological workloads is different from multimedia workloads. In particular, they are not intrinsically very high dimensional, and deserving different optimization heuristics. Based on MVPtrees, we develop a pivot selection heuristic seeking centers and show it outperforms the most widely used corner seeking heuristic. Similarly, we develop a data partitioning approach sensitive to the actual data distribution in lieu of median splits. 1.
Mushegian A: The choice of optimal distance measure in genomewide datasets
 Bioinformatics 2005
"... Motivation: Many types of genomic data are naturally represented as binary vectors. Numerous tasks in computational biology can be cast as analysis of relationships between these vectors, and the first step is frequently to compute their pairwise distance matrix. Many distance measures have been pro ..."
Abstract

Cited by 6 (4 self)
 Add to MetaCart
(Show Context)
Motivation: Many types of genomic data are naturally represented as binary vectors. Numerous tasks in computational biology can be cast as analysis of relationships between these vectors, and the first step is frequently to compute their pairwise distance matrix. Many distance measures have been proposed in the literature, but there is no theory justifying the choice of distance measure. Results: We examine the approaches to measuring distances between binary vectors and study the characteristic properties of various distance measures and their performance in several tasks of genome analysis. Most distance measures between binary vectors turn out to belong to a single parametric family, namely generalized averagebased distance with different exponents. We show that descriptive statistics of distance distribution, such as skewness and kurtosis, can guide the appropriate choice of the exponent. On the contrary, the more familiar distance properties, such as metric and additivity, appear to have much less effect on the performance of distances. Availability: R code GADIST is available from the corresponding author upon request. 2
Indexbased approach to similarity search in protein and nucleotide databases
 DATESO
, 2007
"... nucleotide databases ..."
(Show Context)
Biosequence use cases in mobios sql
 IEEE Computer Society Bulletin of the Technical Committee on Data Engineering
, 2004
"... The sequencing and annotation of entire genomes has enriched the content of biological sequence databases such that new methods of sequence analysis, comparison and retrieval are being invented and rerun on an increasingly regular basis, generating new and more complete biological information. Examp ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
The sequencing and annotation of entire genomes has enriched the content of biological sequence databases such that new methods of sequence analysis, comparison and retrieval are being invented and rerun on an increasingly regular basis, generating new and more complete biological information. Examples include full genome comparisons and phylogenetic footprinting. Simple identification of homologous sequences based on BLAST searches is now just one option for querying the contents of a sequence database. These developments underscore the need for more general methods of sequence data management and concomitant programming models that simplify biological discovery. MoBIoS, the Molecular Biological Information System, with mSQL, its set of SQL extensions, is such a system. MoBIoS supports two views of sequence data. Sequences are identified and stored based on long functional units (e.g. genes, proteins and chromosomes). Matching and analysis of sequences exploits distancebased methods comparing shortoverlapping substrings. We show that a number of sequence analysis problems can thus be succinctly expressed as mSQL queries. 1
Anytime KNearest Neighbor Search for Database Applications
 FIRST INTERNATIONAL WORKSHOP ON SIMILARITY SEARCH AND APPLICATIONS
, 2008
"... Many contemporary database applications require similaritybased retrieval of complex objects where the only usable knowledge of its domain is determined by a metric distance function. In support of these applications, we explored a search strategy for knearest neighbor searches with MVPtrees that ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Many contemporary database applications require similaritybased retrieval of complex objects where the only usable knowledge of its domain is determined by a metric distance function. In support of these applications, we explored a search strategy for knearest neighbor searches with MVPtrees that greedily identifies k answers and then improves the answer set monotonically. The algorithm returns an approximate solution when terminated early, as determined by a limiting radius or an internal measure of progress. Given unbounded time the algorithm terminates with an exact solution. Approximate solutions to knearest neighbor search provide much needed speed improvement to hard nearestneighbor problems. Our anytime approximate formulation is well suited for interactive search applications as well as applications where the distance function itself is an approximation. We evaluate the algorithm over a suite of workloads, including image retrieval, biological data and highdimensional vector data. Experimental results demonstrate the practical applicability of our approach.
Dialysis and its Application
"... Abstract: Dialysis is a renal replacement therapy that provides an artificial replacement for kidney disfunction, and it is a life support treatment but not treat kidney diseases. Dialysis is based on the principle of the diffusion of solutes along a concentration gradient across a semipermeable mem ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Abstract: Dialysis is a renal replacement therapy that provides an artificial replacement for kidney disfunction, and it is a life support treatment but not treat kidney diseases. Dialysis is based on the principle of the diffusion of solutes along a concentration gradient across a semipermeable membrane. There are three main types of dialysis: hemodialysis, peritoneal dialysis and hemofiltration. [New York Science
On MetricSpace Indexing and Real Workloads
"... Contemporary technology is fostering new demands to manage large collections of complex data, including the contents of multimedia and biological databases. In many cases the similarity of the data is defined using a metric distance function. There are many competing algorithmic approaches which, of ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Contemporary technology is fostering new demands to manage large collections of complex data, including the contents of multimedia and biological databases. In many cases the similarity of the data is defined using a metric distance function. There are many competing algorithmic approaches which, offline, create data structures materializing a hierarchical clustering of the data and leverage the triangle inequality to speed the search for similar data. In order to determine a solution of general applicability it is important to assess the variety of methods on various types of real world data. We evaluate the performance of an algorithm from each of the three major classes of metricspace indexing methods: generalized hyperplane, vantage point, and radiusbased methods. The workloads comprise an image database, a yeast protein sequence database and a database of mass spectrometer protein signatures. For range queries of practical interest the multivantage point algorithm (MVPtrees) is shown to be superior. We further consider the optimization of MVPtrees. We consider a common heuristic, choosing corners as vantage points, and show that on real workloads choosing centers perform better. 1.
BIOINFORMATICS
, 2005
"... The choice of optimal distance measure in genomewide datasets ..."
(Show Context)
Managing Biosequences
"... mSQL is an extended SQL query language targeting the expanding area of biological sequence databases and sequence analysis methods. The core aspects include firstclass data types for biological sequences, operators based on an extendedrelational algebra, an ability to define logical views of seque ..."
Abstract
 Add to MetaCart
(Show Context)
mSQL is an extended SQL query language targeting the expanding area of biological sequence databases and sequence analysis methods. The core aspects include firstclass data types for biological sequences, operators based on an extendedrelational algebra, an ability to define logical views of sequences as overlapping qgrams and the materialization of those views as metricspace indices. We first describe the current trends in biological analysis that necessitate a more intuitive, flexible, and optimizable approach than current methodologies. We present our solution, mSQL, and describe its formal definition with respect to both physical and logical operators, detailing the cost model of each operator. We describe the necessity of indexing sequences offline to adequately manage this type of data given space and time concerns. We assess a number of metricspace indexing methods and conclude that MVPtrees can be expected to perform the best for sequence data. We ultimately implement two queries in mSQL to show that, not only can biologically valid analyses be expressed in concise mSQL queries, such queries can be optimized in the same ways as those relying on a standard relational algebra.
JAVA DBMS.
"... This package is available as open source. It is based on multivantage point trees (MVPT) [3]. This algorithm was chosen as the result of a study where the performance of an algorithm from each of the three major classes of distancebased index algorithm was compared using a suite of biological data ..."
Abstract
 Add to MetaCart
(Show Context)
This package is available as open source. It is based on multivantage point trees (MVPT) [3]. This algorithm was chosen as the result of a study where the performance of an algorithm from each of the three major classes of distancebased index algorithm was compared using a suite of biological databases. The original MVPT is a mainmemory data structure. We implemented the diskbased MVP index by fitting each index node into a disk page. Lastly, we designed new bulkload heuristics to improve query performance[7]. A special property of our implementation is that once an index for a dataset is constructed, a user may employ any of four different retrieval methods. In addition to range and knn queries, the package support rangelimited knearest neighbor, and approximate rangelimited knearest neighbor