Results 1 - 10
of
67
Feature Selection with Linked Data in Social Media
"... Feature selection is widely used in preparing highdimensional data for effective data mining. Increasingly popular social media data presents new challenges to feature selection. Social media data consists of (1) traditional high-dimensional, attribute-value data such as posts, tweets, comments, and ..."
Abstract
-
Cited by 19 (15 self)
- Add to MetaCart
(Show Context)
Feature selection is widely used in preparing highdimensional data for effective data mining. Increasingly popular social media data presents new challenges to feature selection. Social media data consists of (1) traditional high-dimensional, attribute-value data such as posts, tweets, comments, and images, and (2) linked data that describes the relationships between social media users as well as who post the posts, etc. The nature of social media also determines that its data is massive, noisy, and incomplete, which exacerbates the already challenging problem of feature selection. In this paper, we illustrate the differences between attributevalue data and social media data, investigate if linked data can be exploited in a new feature selection framework by taking advantage of social science theories, extensively evaluate the effects of user-user and user-post relationships manifested in linked data on feature selection, and discuss some research issues for future work. 1
Visual Analysis of Graphs with Multiple Conencted Components
"... In this paper, we present a system for the interactive visualization and exploration of graphs with many weakly connected components. The visualization of large graphs has recently received much research attention. However, specific systems for visual analysis of graph data sets consisting of many s ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
In this paper, we present a system for the interactive visualization and exploration of graphs with many weakly connected components. The visualization of large graphs has recently received much research attention. However, specific systems for visual analysis of graph data sets consisting of many such components are rare. In our approach, we rely on graph clustering using an extensive set of topology descriptors. Specifically, we use the Self-Organizing-Map algorithm in conjunction with a user-adaptable combination of graph features for clustering of graphs. It offers insight into the overall structure of the data set. The clustering output is presented in a grid containing clusters of the connected components of the input graph. Interactive feature selection and task-tailored data views allow the exploration of the whole graph space. The system provides also tools for assessment and display of cluster quality. We demonstrate the usefulness of our system by application to a shareholder structure analysis problem based on a large real-world data set. While so far our approach is applied to weighted directed graphs only, it can be used for various graph types.
Topic taxonomy adaptation for group profiling
- ACM Trans. Knowl. Discov. Data
, 2008
"... A topic taxonomy is an effective representation that describes salient features of virtual groups or online communities. A topic taxonomy consists of topic nodes. Each internal node is defined by its vertical path (i.e., ancestor and child nodes) and its horizonal list of attributes (or terms). In a ..."
Abstract
-
Cited by 13 (9 self)
- Add to MetaCart
A topic taxonomy is an effective representation that describes salient features of virtual groups or online communities. A topic taxonomy consists of topic nodes. Each internal node is defined by its vertical path (i.e., ancestor and child nodes) and its horizonal list of attributes (or terms). In a text-dominant environment, a topic taxonomy can be used to flexibly describe a group’s interests with varying granularity. However, the stagnant nature of a taxonomy may fail to timely capture the dynamic change of group’s interest. This paper addresses the problem of how to adapt a topic taxonomy to the accumulated data that reflect the change of group’s interest to achieve dynamic group profiling. We first discuss the issues related to topic taxonomy. We next formulate taxonomy adaptation as an optimization problem to find the taxonomy that best fits the data. We then present a viable algorithm that can efficiently accomplish taxonomy adaptation. We conduct extensive experiments to evaluate our approach’s efficacy for group profiling, compare the approach with some alternatives, and study its performance for dynamic group profiling. While pointing out various applications of taxonomy adaption, we suggest some future work that can take advantage of burgeoning Web 2.0 services for online targeted marketing, counterterrorism in connecting dots, and community tracking.
Opcode sequences as representation of executables for data-mining-based unknown malware detection
- INFORMATION SCIENCES 227
, 2013
"... Malware can be defined as any type of malicious code that has the potential to harm a computer or network. The volume of malware is growing faster every year and poses a serious global security threat. Consequently, malware detection has become a critical topic in computer security. Currently, signa ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
(Show Context)
Malware can be defined as any type of malicious code that has the potential to harm a computer or network. The volume of malware is growing faster every year and poses a serious global security threat. Consequently, malware detection has become a critical topic in computer security. Currently, signature-based detection is the most widespread method used in commercial antivirus. In spite of the broad use of this method, it can detect malware only after the malicious executable has already caused damage and provided the malware is adequately documented. Therefore, the signature-based method consistently fails to detect new malware. In this paper, we propose a new method to detect unknown malware families. This model is based on the frequency of the appearance of opcode sequences. Furthermore, we describe a technique to mine the relevance of each opcode and assess the frequency of each opcode sequence. In addition, we provide empirical validation that this new method is capable of detecting unknown malware.
Unsupervised feature selection for linked social media data
- in KDD
, 2012
"... The prevalent use of social media produces mountains of unlabeled, high-dimensional data. Feature selection has been shown effective in dealing with high-dimensional data for efficient data mining. Feature selection for unlabeled data remains a challenging task due to the absence of label informatio ..."
Abstract
-
Cited by 12 (9 self)
- Add to MetaCart
(Show Context)
The prevalent use of social media produces mountains of unlabeled, high-dimensional data. Feature selection has been shown effective in dealing with high-dimensional data for efficient data mining. Feature selection for unlabeled data remains a challenging task due to the absence of label information by which the feature relevance can be assessed. The unique characteristics of social media data further complicatethealreadychallengingproblemofunsupervisedfeature selection, (e.g., part of social media data is linked, which makes invalid the independent and identically distributed assumption), bringing about new challenges to traditional unsupervised feature selection algorithms. In this paper, we study the differences between social media data and traditional attribute-value data, investigate if the relations revealed in linked data can be used to help select relevant features, and propose a novel unsupervised feature selection framework, LUFS, for linked social media data. We perform experiments with real-world social media datasets to evaluate the effectiveness of the proposed framework and probe the working of its key components.
Advancing Feature Selection Research − ASU Feature Selection Repository
"... The rapid advance of computer based high-throughput technique have provided unparalleled op-portunities for humans to expand capabilities in production, services, communications, and research. Meanwhile, immense quantities of high-dimensional data are accumulated challenging state-of-the-art data mi ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
(Show Context)
The rapid advance of computer based high-throughput technique have provided unparalleled op-portunities for humans to expand capabilities in production, services, communications, and research. Meanwhile, immense quantities of high-dimensional data are accumulated challenging state-of-the-art data mining techniques. Feature selection is an essential step in successful data mining applications, which can effectively reduce data dimensionality by removing the irrelevant (and the redundant) fea-tures. In the past few decades, researchers have developed large amount of feature selection algorithms. These algorithms are designed to serve different purposes, are of different models, and all have their own advantages and disadvantages. Although there have been intensive efforts on surveying existing feature selection algorithms, to the best of our knowledge, there is still not a dedicated repository that collects the representative feature selection algorithms to facilitate their comparison and joint study. To fill this gap, in this work we present a feature selection repository, which is designed to collect the most popular algorithms that have been developed in the feature selection research to serve as a platform for facilitating their application, comparison and joint study. The repository also effectively assists researchers to achieve more reliable evaluation in the process of developing new feature selection algorithms. 1
Identifying Biologically Relevant Genes via Multiple Heterogeneous Data Sources
"... Selection of genes that are differentially expressed and critical to a particular biological process has been a major challenge in post-array analysis. Recent development in bioinformatics has made various data sources available such as mRNA and miRNA expression profiles, biological pathway and gene ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
(Show Context)
Selection of genes that are differentially expressed and critical to a particular biological process has been a major challenge in post-array analysis. Recent development in bioinformatics has made various data sources available such as mRNA and miRNA expression profiles, biological pathway and gene annotation, etc. Efficient and effective integration of multiple data sources helps enrich our knowledge about the involved samples and genes for selecting genes bearing significant biological relevance. In this work, we studied a novel problem of multi-source gene selection: given multiple heterogeneous data sources (or data sets), select genes from expression profiles by integrating information from various data sources. We investigated how to effectively employ information contained in multiple data sources to extract an intrinsic global geometric pattern and use it in covariance analysis for gene selection. We designed and conducted experiments to systematically compare the proposed approach with representative methods in terms of statistical and biological significance, and showed the efficacy and potential of the proposed approach with promising findings.