Results 1 - 10
of
19
Elastic maps and nets for approximating principal manifolds and their application to microarray data visualization
- In this book
"... Summary. Principal manifolds are defined as lines or surfaces passing through “the middle ” of data distribution. Linear principal manifolds (Principal Components Analysis) are routinely used for dimension reduction, noise filtering and data visualization. Recently, methods for constructing non-line ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
Summary. Principal manifolds are defined as lines or surfaces passing through “the middle ” of data distribution. Linear principal manifolds (Principal Components Analysis) are routinely used for dimension reduction, noise filtering and data visualization. Recently, methods for constructing non-linear principal manifolds were proposed, including our elastic maps approach which is based on a physical analogy with elastic membranes. We have developed a general geometric framework for constructing “principal objects ” of various dimensions and topologies with the simplest quadratic form of the smoothness penalty which allows very effective parallel implementations. Our approach is implemented in three programming languages (C++, Java and Delphi) with two graphical user interfaces (VidaExpert and ViMiDa applications). In this paper we overview the method of elastic maps and present in detail one of its major applications: the visualization of microarray data in bioinformatics. We show that the method of elastic maps outperforms linear PCA in terms of data approximation, representation of between-point distance structure, preservation of local point neighborhood and representing point classes in low-dimensional spaces. Key words: elastic maps, principal manifolds, elastic functional, data analysis, data visualization, surface modeling 1
Principal graphs and manifolds
- in “Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods and Techniques
"... In many physical statistical, biological and other investigations it is desirable to approximate a system of points by objects of lower dimension and/or complexity. For this purpose, Karl Pearson invented principal component analysis in 1901 and found ‘lines and planes of closest fit to system of po ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
In many physical statistical, biological and other investigations it is desirable to approximate a system of points by objects of lower dimension and/or complexity. For this purpose, Karl Pearson invented principal component analysis in 1901 and found ‘lines and planes of closest fit to system of points’. The famous k-means algorithm solves the approximation problem too, but by finite sets instead of lines and planes. This chapter gives a brief practical introduction into the methods of construction of general principal objects, i.e. objects embedded in the ‘middle ’ of the multidimensional data set. As a basis, the unifying framework of mean squared distance approximation of finite datasets is selected. Principal graphs and manifolds are constructed as generalisations of principal components and k-means principal points. For this purpose, the family of expectation/maximisation algorithms with nearest generalisations is presented. Construction of principal graphs with controlled complexity is based on the graph grammar approach.
VISUALIZATION OF CONTENT AND SEMANTICAL RELATIONS OF GEONOTES
"... The total population of GPS-enabled location-based services (LBS) subscribers is constantly increasing. These GPS-enabled devices produce a wide range of media content (e.g., text/audio notes, pictures, or videos) enhanced by geo-tagged information. This fact poses a challenge regarding how to store ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
The total population of GPS-enabled location-based services (LBS) subscribers is constantly increasing. These GPS-enabled devices produce a wide range of media content (e.g., text/audio notes, pictures, or videos) enhanced by geo-tagged information. This fact poses a challenge regarding how to store and retrieve it and opens new research opportunities for visualizing this type of data. The overall aim of our research efforts is to develop novel approaches and methods for visualizing the content of these documents that will be placed in maps using GPS coordinates as well as to visualize the semantical, temporal, and spatial relations between the documents themselves. In this work, we concentrate on text documents and other data formats that could be transformed into text. We combined different visualization and interaction techniques, such as glyph-based techniques and visual clustering, to achieve our research aims. Our visualization tool, called GNV System (GeoNotes Visualization System), demonstrates the interplay of different interaction techniques and components as well as their functionality. It will be presented in this paper.
KNOWLEDGE DISCOVERY IN COMPUTER NETWORK DATA: A SECURITY PERSPECTIVE by
, 2006
"... From a security perspective, computer network data is analyzed largely for two purposes: to detect known structures, and to identify previously unknown struc-tures. As an example of the former, it is considered standard procedure to filter network traffic for previously identified viruses in order t ..."
Abstract
- Add to MetaCart
From a security perspective, computer network data is analyzed largely for two purposes: to detect known structures, and to identify previously unknown struc-tures. As an example of the former, it is considered standard procedure to filter network traffic for previously identified viruses in order to prevent infection and to reduce virus spread. As an example of the latter, security researchers may want to search datasets in order to identify and discover previously unknown relationships or structures in the data, such as intrusions into a network by an external hacker. However, among other limitations, traditional methods of network data analysis are insufficient when processing large volumes of network traffic, do not allow for the discovery of local structures, do not visualize high-dimensional data in meaningful ways, and do not allow user input during search iterations. We present the development, analysis, and testing of a new framework for the analysis of network traffic data. In particular, and among others, the frame-work addresses the following questions: How is network traffic represented in high-
Mining Allocating Patterns in One-sum Weighted Items
"... An Association Rule (AR) is a common knowledge model in data mining that describes an implicative cooccurring relationship between two disjoint sets of binary-valued transaction database attributes (items), expressed in the form of an “antecedent ⇒ consequent ” rule. A variant of the AR is the Weigh ..."
Abstract
- Add to MetaCart
An Association Rule (AR) is a common knowledge model in data mining that describes an implicative cooccurring relationship between two disjoint sets of binary-valued transaction database attributes (items), expressed in the form of an “antecedent ⇒ consequent ” rule. A variant of the AR is the Weighted Association Rule (WAR). With regard to a marketing context, this paper introduces a new knowledge model in data mining ⎯ ALlocating Pattern (ALP). An ALP is a special form of WAR, where each rule item is associated with a weighting score between 0 and 1, and the sum of all rule item scores is 1. It can not only indicate the implicative co-occurring relationship between two (disjoint) sets of items in a weighted setting, but also inform the “allocating ” relationship among rule items. ALPs can be demonstrated to be applicable in marketing and possibly a surprising variety of other areas. We further propose an Apriori based algorithm to extract hidden and interesting ALPs from a “one-sum ” weighted transaction database. The experimental results show the effectiveness of the proposed algorithm. 1.
Organization on the ACM Computing Classification System
"... Abstract. We propose a method, Cluster-Lift, for parsimoniously mapping clusters of ontology classes of lower levels onto a subset of high level classes in such a way that the latter can be considered as a generalized description of the former. Specifically, we consider the problem of visualization ..."
Abstract
- Add to MetaCart
Abstract. We propose a method, Cluster-Lift, for parsimoniously mapping clusters of ontology classes of lower levels onto a subset of high level classes in such a way that the latter can be considered as a generalized description of the former. Specifically, we consider the problem of visualization of activities of a Computer Science Research organization on the ACM Computing Subjects Classification (ACMC), which is a three level taxonomy. It is possible to specify the set of ACMC subjects that are investigated by the organization’s teams and individual members and map them to the ACMC hierarchy. This visualization, however, usually appears overly detailed, confusing, and difficult to interpret. This is why we propose a two-stage Cluster-Lift procedure. On the first stage, the subjects are clustered according to their similarity defined in such a way that the greater the number of researchers working on a pair of subjects, the greater the similarity between the pair. On the second stage, each subject cluster is mapped onto ACMC and lifted within the taxonomy. The lifting involves a formalization of the concept of “head subject”, as well as its “gaps ” and “offshoots ” and is to be done in a parsimonious way by minimizing a weighted sum of the numbers of head subjects, gaps and offshoots. The Cluster-Lift results are easy to see and interpret. A real-world example of the working of our approach is provided.
Measuring Distance to Dataset Based on K-Means Clustering
"... A QSAR is a model of some property (e.g. toxicity or solubility) of a chemical compound in terms of its chemical structure. Such in silico methods are increasingly used over in vitro synthesis and experimentation to screen for suitability as drug candidates. The Applicability Problem A QSAR model is ..."
Abstract
- Add to MetaCart
A QSAR is a model of some property (e.g. toxicity or solubility) of a chemical compound in terms of its chemical structure. Such in silico methods are increasingly used over in vitro synthesis and experimentation to screen for suitability as drug candidates. The Applicability Problem A QSAR model is trained on a ‘training set ’ of compounds whose toxicity is known. Predictions are sought for new compounds’ toxicity … but the model is only applicable near the training set. How can we measure this distance to training set? Distance to Dataset: Existing Measures Bounding Box Leverage or ‘Mahalanobis ’ distance k Nearest Neighbours • Simple to compute but … • Real datasets aren’t box-shaped • Extremely crude model of dataset’s shape • Sound mathematical footing but … • Assumes there is a global model • Measures uncertainty due to estimation of the model, rather than its applicability • Average distance to nearby points • Closely fits the dataset’s shape but … • Not parsimonious: ‘overfitted’ K-Means Clustering • Traditionally used for classification • Used here to model dataset’s shape • Handles irregular, non-convex, and even disconnected shapes • We could simply take distance to nearest centroid- discontinuous, ‘winner takes all ’ approach • Instead, take average distance to all centroids, weighted by fuzzy membership: Distance to Dataset:
Annotated Suffix Trees for Text Modelling and Classification
, 2008
"... Suffix trees are compact and versatile data structures in which paths from the root to nodes represent substrings of the encoded text. By annotating such a tree with the frequencies of substrings, it is possible to construct a compact model of text that captures its sequential nature. This thesis in ..."
Abstract
- Add to MetaCart
Suffix trees are compact and versatile data structures in which paths from the root to nodes represent substrings of the encoded text. By annotating such a tree with the frequencies of substrings, it is possible to construct a compact model of text that captures its sequential nature. This thesis investigates the use of such a model in the representation and classification of text. The basic approach in this thesis is to use an Annotated Suffix Tree (AST) to represent a pre-specified collection of texts (“class”). A document, represented as a string or another (“auxiliary”) suffix tree, is matched to the AST to allow, firstly, the scoring of matches between the document and the AST and, secondly, the identification of a number of substrings (“features”) that maximally contribute to the matching score. Based on this, methods are proposed for the interrelated problems of: (i) classification of text against several, possibly overlapping, classes, (ii) highlighting the features in a text which are most relevant to a particular class (this problem, to our knowledge, has never before been computationally addressed). The developed methods are applied to well-established text analysis problems such as e-mail spam filtering and document classification, with three aims in mind: (i) to adjust parameters of the scoring function and assess the effect on performance, (ii) to test the method on benchmark and newly developed test sets, and (iii) to generate human-readable evaluations of classification features within query documents. Experiments show that the AST method is competitive with other current approaches and in some cases, such as spam filtering, achieves higher classification accuracy; the method also allows the tackling of problems not typically addressed by current alternative methods. The AST method is therefore a useful addition to the arsenal of available classification methods. 2
Constructing and Mapping Fuzzy Thematic Clusters to Higher Ranks in a Taxonomy
"... Abstract — We present a method for mapping a structure to a related taxonomy in a thematically consistent way. The components of the structure are supplied with fuzzy profiles over the taxonomy. These are then generalized in two steps: first, by fuzzy clustering, and then by mapping the clusters to ..."
Abstract
- Add to MetaCart
Abstract — We present a method for mapping a structure to a related taxonomy in a thematically consistent way. The components of the structure are supplied with fuzzy profiles over the taxonomy. These are then generalized in two steps: first, by fuzzy clustering, and then by mapping the clusters to higher ranks of the taxonomy. To be specific, we concentrate on the Computer Sciences area represented by the ACM Computing Classification System (ACM-CCS), but the approach is aplicable also to other taxonomies. We build fuzzy clusters of the taxonomy subjects according to the similarity between individual profiles. Clusters are extracted using an original additive spectral clustering method involving a number of model-based stopping conditions. The clusters are parsimoniously lifted to higher ranks of the taxonomy using an original recursive algorithm for minimizing a penalty function that involves “head subjects ” on the higher ranks of the taxonomy along with their “gaps ” and “offshoots”. An example is given illustrating the method applied to real-world data. I.
Integrating Data Mining and Agent Based Modeling and Simulation
"... Abstract. In this paper, we introduce an integration study which combines Data Mining (DM) and Agent Based Modeling and Simulation (ABMS). This study, as a new paradigm for DM/ABMS, is concerned with two approaches: (i) applying DM techniques in ABMS investigation, and inversely (ii) utilizing ABMS ..."
Abstract
- Add to MetaCart
Abstract. In this paper, we introduce an integration study which combines Data Mining (DM) and Agent Based Modeling and Simulation (ABMS). This study, as a new paradigm for DM/ABMS, is concerned with two approaches: (i) applying DM techniques in ABMS investigation, and inversely (ii) utilizing ABMS results in DM research. Detailed description of each approach is presented in this paper. A conclusion and the future work of this (integration) study are given at the end.

