Results 1  10
of
47
Top 10 algorithms in data mining
, 2007
"... Abstract This paper presents the top 10 data mining algorithms identified by the IEEE International Conference on Data Mining (ICDM) in December 2006: C4.5, kMeans, SVM, Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes, and CART. These top 10 algorithms are among the most influential data mining a ..."
Abstract

Cited by 113 (2 self)
 Add to MetaCart
Abstract This paper presents the top 10 data mining algorithms identified by the IEEE International Conference on Data Mining (ICDM) in December 2006: C4.5, kMeans, SVM, Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes, and CART. These top 10 algorithms are among the most influential data mining algorithms in the research community. With each algorithm, we provide a description of the algorithm, discuss the impact of the algorithm, and review current and further research on the algorithm. These 10 algorithms cover classification,
Community Structure in Graphs
, 2007
"... Graph vertices are often organized into groups that seem to live fairly independently of the rest of the graph, with which they share but a few edges, whereas the relationships between group members are stronger, as shown by the large number of mutual connections. Such groups of vertices, or communi ..."
Abstract

Cited by 44 (0 self)
 Add to MetaCart
Graph vertices are often organized into groups that seem to live fairly independently of the rest of the graph, with which they share but a few edges, whereas the relationships between group members are stronger, as shown by the large number of mutual connections. Such groups of vertices, or communities, can be considered as independent compartments of a graph. Detecting communities is of great importance in sociology, biology and computer science, disciplines where systems are often represented as graphs. The task is very hard, though, both conceptually, due to the ambiguity in the definition of community and in the discrimination of different partitions and practically, because algorithms must find “good ” partitions among an exponentially large number of them. Other complications are represented by the possible occurrence of hierarchies, i.e. communities which are nested inside larger communities, and by the existence of overlaps between communities, due to the presence of nodes belonging to more groups. All these aspects are dealt with in some detail and many methods are described, from traditional approaches used in computer science and sociology to recent techniques developed mostly within statistical physics.
Fitting hyperelastic models to experimental data
 HYPERELASTIC CONSTITUTIVE LAWS 5
, 2004
"... This paper is concerned with determining material parameters in incompressible isotropic elastic strain–energy functions on the basis of a nonlinear least squares optimization method by fitting data from the classical experiments of Treloar and Jones and Treloar on natural rubber. We consider thre ..."
Abstract

Cited by 26 (2 self)
 Add to MetaCart
This paper is concerned with determining material parameters in incompressible isotropic elastic strain–energy functions on the basis of a nonlinear least squares optimization method by fitting data from the classical experiments of Treloar and Jones and Treloar on natural rubber. We consider three separate forms of strainenergy function, based respectively on use of the principal stretches, the usual principal invariants of the CauchyGreen deformation tensor and a certain set of ‘orthogonal’ invariants of the logarithmic strain tensor. We highlight, in particular, (a) the relative errors generated in the fitting process and (b) the occurrence of multiple sets of optimal material parameters for the same data sets. This multiplicity can lead to very different numerical solutions for a given boundaryvalue problem, and this is illustrated for a simple example.
Geometric denoising of proteinprotein interaction networks
 PLoS Comput. Biol
, 2009
"... Understanding complex networks of proteinprotein interactions (PPIs) is one of the foremost challenges of the postgenomic era. Due to the recent advances in experimental biotechnology, including yeast2hybrid (Y2H), tandem affinity purification (TAP) and other highthroughput methods for protein ..."
Abstract

Cited by 14 (3 self)
 Add to MetaCart
(Show Context)
Understanding complex networks of proteinprotein interactions (PPIs) is one of the foremost challenges of the postgenomic era. Due to the recent advances in experimental biotechnology, including yeast2hybrid (Y2H), tandem affinity purification (TAP) and other highthroughput methods for proteinprotein interaction (PPI) detection, huge amounts of PPI network data are becoming available. Of major concern, however, are the levels of noise and incompleteness. For example, for Y2H screens, it is thought that the false positive rate could be as high as 64%, and the false negative rate may range from 43 % to 71%. TAP experiments are believed to have comparable levels of noise. We present a novel technique to assess the confidence levels of interactions in PPI networks obtained from experimental studies. We use it for predicting new interactions and thus for guiding future biological experiments. This technique is the first to utilize currently the best fitting network model for PPI networks, geometric graphs. Our approach achieves specificity of 85 % and sensitivity of 90%. We use it to assign confidence scores to physical proteinprotein interactions in the human PPI network downloaded from BioGRID. Using our approach, we predict 251 interactions in the human PPI network, a statistically significant fraction of which correspond to protein pairs sharing common GO terms. Moreover, we validate a statistically significant portion of our predicted interactions in the HPRD database and the newer release of BioGRID. The data and Matlab code implementing the
The more you learn, the less you store: memorycontrolled incremental support vector machines
, 2006
"... ..."
Gene Circuit Analysis of the Terminal Gap Gene huckebein
"... The early embryo of Drosophila melanogaster provides a powerful model system to study the role of genes in pattern formation. The gap gene network constitutes the first zygotic regulatory tier in the hierarchy of the segmentation genes involved in specifying the position of body segments. Here, we u ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
(Show Context)
The early embryo of Drosophila melanogaster provides a powerful model system to study the role of genes in pattern formation. The gap gene network constitutes the first zygotic regulatory tier in the hierarchy of the segmentation genes involved in specifying the position of body segments. Here, we use an integrative, systemslevel approach to investigate the regulatory effect of the terminal gap gene huckebein (hkb) on gap gene expression. We present quantitative expression data for the Hkb protein, which enable us to include hkb in gap gene circuit models. Gap gene circuits are mathematical models of gene networks used as computational tools to extract regulatory information from spatial expression data. This is achieved by fitting the model to gap gene expression patterns, in order to obtain estimates for regulatory parameters which predict a specific network topology. We show how considering variability in the data combined with analysis of parameter determinability significantly improves the biological relevance and consistency of the approach. Our models are in agreement with earlier results, which they extend in two important respects: First, we show that Hkb is involved in the regulation of the posterior hunchback (hb) domain, but does not have any other essential function. Specifically, Hkb is required for the anterior shift in the posterior border of this domain, which is now reproduced correctly in our models. Second, gap gene circuits presented here are able to reproduce mutants of terminal gap genes, while previously published models were unable to reproduce any null mutants correctly. As a consequence, our models now capture the expression
Data distribution schemes of sparse arrays on distributed memory multicomputers
 in International Conference on Parallel Processing Workshops
, 2002
"... A data distribution scheme of sparse arrays on a distributed memory multicomputer, in general, is composed of three phases, data partition, data distribution, and data compression. To implement the data distribution scheme, methods proposed in the literature first perform the data partition phase, t ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
(Show Context)
A data distribution scheme of sparse arrays on a distributed memory multicomputer, in general, is composed of three phases, data partition, data distribution, and data compression. To implement the data distribution scheme, methods proposed in the literature first perform the data partition phase, then the data distribution phase, followed by the data compression phase. We called this scheme as Send Followed Compress (SFC) scheme. In this paper, we propose two other data distribution schemes, Compress Followed Send (CFS) and EncodingDecoding (ED), for sparse array distribution. In the CFS scheme, the data compression phase is performed before the data distribution phase. In the ED scheme, the data compression phase can be divided into two steps, encoding and decoding. The encoding step and the decoding step are performed before and after the data distribution phase, respectively. To evaluate the CFS and the ED schemes, we compare them with the SFC scheme. Both theoretical analysis and experimental test were conducted. In theoretical analysis, we analyze the SFC, the CFS, and the ED schemes in terms of the data distribution time and the data compression time. In experimental test, we implemented these schemes on an IBM SP2 parallel machine. From the experimental results, for most of test cases, the CFS and the ED schemes outperform the SFC scheme. For the CFS and the ED schemes, the ED scheme outperforms the CFS scheme for all test cases. Index Terms − Data distribution schemes, Data
Glotaran: A JavaBased Graphical User Interface for the R Package TIMP
 J. Stat. Software
"... In this work the software application called Glotaran is introduced as a Javabased graphical user interface to the R package TIMP, a problem solving environment for fitting superposition models to multidimensional data. TIMP uses a commandline user interface for the interaction with data, the sp ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
In this work the software application called Glotaran is introduced as a Javabased graphical user interface to the R package TIMP, a problem solving environment for fitting superposition models to multidimensional data. TIMP uses a commandline user interface for the interaction with data, the specification of models and viewing of analysis results. Instead, Glotaran provides a graphical user interface which features interactive and dynamic data inspection, easier – assisted by the user interface – model specification and interactive viewing of results. The interactivity component is especially helpful when working with large, multidimensional datasets as often result from timeresolved spectroscopy measurements, allowing the user to easily preselect and manipulate data before analysis and to quickly zoom in to regions of interest in the analysis results. Glotaran has been developed on top of the NetBeans rich client platform and communicates with R through the JavatoR interface Rserve. The background and the functionality of the application are described here. In addition, the design, development and implementation process of Glotaran is documented in a generic way.
Coordinate dependence of variability analysis
 PLoS Computational Biology
, 2010
"... Analysis of motor performance variability in tasks with redundancy affords insight about synergies underlying central nervous system (CNS) control. Preferential distribution of variability in ways that minimally affect task performance suggests sophisticated neural control. Unfortunately, in the ana ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
Analysis of motor performance variability in tasks with redundancy affords insight about synergies underlying central nervous system (CNS) control. Preferential distribution of variability in ways that minimally affect task performance suggests sophisticated neural control. Unfortunately, in the analysis of variability the choice of coordinates used to represent multidimensional data may profoundly affect analysis, introducing an arbitrariness which compromises its conclusions. This paper assesses the influence of coordinates. Methods based on analyzing a covariance matrix are fundamentally dependent on an investigator’s choices. Two reasons are identified: using anisotropy of a covariance matrix as evidence of preferential distribution of variability; and using orthogonality to quantify relevance of variability to task performance. Both are exquisitely sensitive to coordinates. Unless coordinates are known a priori, these methods do not support unambiguous inferences about CNS control. An alternative method uses a twolevel approach where variability in task execution (expressed in one coordinate frame) is mapped by a function to its result (expressed in another coordinate frame). An analysis of variability in execution using this function to quantify performance at the level of results offers substantially less sensitivity to coordinates than analysis of a covariance matrix of execution variables. This is an initial step towards
A higherorder generalized singular value decomposition for comparative analysis of largescale datasets. Under revision
, 2009
"... The number of highdimensional datasets recording multiple aspects of a single phenomenon is increasing in many areas of science, accompanied by a need for mathematical frameworks that can compare multiple largescale matrices with different row dimensions. The only such framework to date, the gener ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
The number of highdimensional datasets recording multiple aspects of a single phenomenon is increasing in many areas of science, accompanied by a need for mathematical frameworks that can compare multiple largescale matrices with different row dimensions. The only such framework to date, the generalized singular value decomposition (GSVD), is limited to two matrices. We mathematically define a higherorder GSVD (HO GSVD) for N$2 matrices Di[R min, each with full column rank. Each matrix is exactly factored as Di = UiSiV T, where V, identical in all factorizations, is obtained from the eigensystem SV = VL of the arithmetic mean S of all pairwise quotients AiA {1 j of the matrices Ai~DT i Di, i?j. We prove that this decomposition extends to higher orders almost all of the mathematical properties of the GSVD. The matrix S is nondefective with V and L real. Its eigenvalues satisfy lk$1. Equality holds if and only if the corresponding eigenvector vk is a right basis vector of equal significance in all matrices Di and Dj, that is si,k/sj,k = 1 for all i and j, and the corresponding left basis vector ui,k is orthogonal to all other vectors in Ui for all i. The eigenvalues lk = 1, therefore, define the ‘‘common HO GSVD subspace.’ ’ We illustrate the HO GSVD with a comparison of genomescale cellcycle mRNA expression from S. pombe, S. cerevisiae and human. Unlike existing algorithms, a mapping among the genes of these disparate organisms is not required. We find that the approximately common HO GSVD subspace represents the cellcycle mRNA expression oscillations, which