Results 1 -
5 of
5
Sequential Pattern Mining with Constraints on Large Protein Databases
"... Sequential pattern mining in protein databases has the potential of efficiently discovering recurring structures that exist in protein sequences. This in turn may provide an understanding of the functional role of proteins which support such structures. In this paper we generalize a well known seque ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Sequential pattern mining in protein databases has the potential of efficiently discovering recurring structures that exist in protein sequences. This in turn may provide an understanding of the functional role of proteins which support such structures. In this paper we generalize a well known sequential pattern mining algorithm, SPAM [1], by incorporating gap and regular expression constraints along the lines proposed in SPIRIT [2]. However the advantages of using a depth-first algorithm like SPAM is that (a) it allows us to push the constraints deeper inside the mining process by exploiting the prefix antimonotone property of some constraints, (b) It uses a simple vertical bitmap data structure for counting and (c) it is known to be efficient for mining long patterns. Our work onextending SPAM is motivated by its role in two concrete applications: (1) as a “feature factory” for a secondary structure classification problem in integral membrane proteins and (2) as an underlying engine for answering secondary stucture queries on protein databases. A detailed set of experiments confirm that the incorporation of gap and regular expresssion constraints allows us to extract more specific and biologically relevant information. 1
Vectorization Techniques for Protein Data Analysis: A Position Paper
"... With the increasing amount of data about the 3D structure of proteins, there is a growing need for computational techniques to analyse these structure data. In our work, we pursue a commonly technique used in On-line Analytical Processing (OLAP) domains -- namely the multi-dimensional data model and ..."
Abstract
- Add to MetaCart
With the increasing amount of data about the 3D structure of proteins, there is a growing need for computational techniques to analyse these structure data. In our work, we pursue a commonly technique used in On-line Analytical Processing (OLAP) domains -- namely the multi-dimensional data model and vectorization of data elements onto multi-dimensional vector spaces. This paper presents preliminary work carried out in this direction and provides a vision for future work.
OMICS A Journal of Integrative Biology
, 2003
"... The life of a cell is governed by the physicochemical properties of a complex network of interacting macromolecules (primarily genes and proteins). Hence, a full scientific understanding of and rational engineering approach to cell physiology require accurate mathematical models of the spatial and t ..."
Abstract
- Add to MetaCart
The life of a cell is governed by the physicochemical properties of a complex network of interacting macromolecules (primarily genes and proteins). Hence, a full scientific understanding of and rational engineering approach to cell physiology require accurate mathematical models of the spatial and temporal dynamics of these macromolecular assemblies, especially the networks involved in integrating signals and regulating cellular responses. The Virginia Tech Consortium is involved in three specific goals of DARPA's computational biology program (Bio-COMP): to create effective software tools for modeling gene-proteinmetabolite networks, to employ these tools in creating a new generation of realistic models, and to test and refine these models by well-conceived experimental studies. The special emphasis of this group is to understand the mechanisms of cell cycle control in eukaryotes (yeast cells and frog eggs). The software tools developed at Virginia Tech are designed to meet general requirements of modeling regulatory networks and are collected in a problem-solving environment called JigCell.
December 2005The SBC-Tree: An Index for Run-Length Compressed Sequences
, 2005
"... Run-Length-Encoding (RLE) is a data compression technique that is used in various applications, e.g., biological sequence databases. multimedia: and facsimile transmission. One of the main challenges is how to operate, e.g., indexing: searching, and retriexral: on the compressed data without decompr ..."
Abstract
- Add to MetaCart
Run-Length-Encoding (RLE) is a data compression technique that is used in various applications, e.g., biological sequence databases. multimedia: and facsimile transmission. One of the main challenges is how to operate, e.g., indexing: searching, and retriexral: on the compressed data without decompressing it. In t.his paper, we present the String &tree for _Compressed sequences; termed the SBC-tree, for indexing and searching RLE-compressed sequences of arbitrary length. The SBC-tree is a two-level index structure based on the well-knoxvn String B-tree and a 3-sided range query structure. The SBC-tree supports substring as \\re11 as prefix m,atching, and range search operations over RLE-compressed sequences. The SBC-tree has an optimal external-memory space complexity of O(N/B) pages, where N is the total length of the compressed sequences, and B is the disk page size. The insertion and deletion of all suffixes of a compressed sequence of length m taltes O(m logB(N + m)) I/O operations. Substring match,ing, pre,fix matching, and range search execute in an optimal O(log, N + F) I/O operations, where Ip is the length of the compressed query pattern and T is the query output size. Re present also two variants of the SBC-tree: the SBC-tree that is based on an R-tree instead of the 3-sided structure: and the one-level SBC-tree that does not use a two-dimensional index. These variants do not have provable worstcase theoret.ica1 bounds for search operations, but perform well in practice. The SBC-tree index is realized inside PostgreSQL in t,he context of a biological protein database application. Performance results illustrate that using the SBC-tree to index RLE-compressed sequences achieves up to an order of magnitude reduction in storage, up to 30 % reduction in 110s for the insertion operations, and retains the optimal search performance achieved by the St,ring B-tree over the uncompressed sequences.!I c 0, h
FOR PROTEIN SECONDARY STRUCTURES
"... Searching proteins on their secondary structures provides a rough and fast method of identification of molecules having a similar fold. Since existing database management systems do not offer integrated exploration methods for querying protein structures, the structural similarity searching is usual ..."
Abstract
- Add to MetaCart
Searching proteins on their secondary structures provides a rough and fast method of identification of molecules having a similar fold. Since existing database management systems do not offer integrated exploration methods for querying protein structures, the structural similarity searching is usually performed by external tools. This often lengthens the processing time and requires additional processing steps, like adaptation of input and output data formats. In the paper, we present the extended SQL language, which allows searching a database in order to find proteins having secondary structures similar to the structural pattern specified by a user. Presented query language is integrated with the relational database management system and it simplifies the manipulation of biological data. 1.

