Results 1 -
6 of
6
On Nonmetric Similarity Search Problems in Complex Domains
, 2010
"... The task of similarity search is widely used in various areas of computing, including multimedia databases, data mining, bioinformatics, social networks, etc. In fact, retrieval of semantically unstructured data entities requires a form of aggregated qualification that selects entities relevant to a ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
The task of similarity search is widely used in various areas of computing, including multimedia databases, data mining, bioinformatics, social networks, etc. In fact, retrieval of semantically unstructured data entities requires a form of aggregated qualification that selects entities relevant to a query. A popular type of such a mechanism is similarity querying. For a long time, the database-oriented applications of similarity search employed the definition of similarity restricted to metric distances. Due to its topological properties, metric similarity can be effectively used to index a database which can be then queried efficiently by so-called metric access methods. However, together with the increasing complexity of data entities across various domains, in recent years there appeared many similarities that were not metrics – we call them nonmetric similarity functions. In this paper we survey domains employing nonmetric functions for effective similarity search, and methods for efficient nonmetric similarity search. First, we show that the ongoing research in many of these domains requires complex representations of data entities. Simultaneously, such complex representations allow us to model also complex and computationally expensive similarity functions (often represented by various matching algorithms). However, the more complex similarity function one develops, the more likely it will be a nonmetric. Second, we review the state-of-the-art techniques for efficient (fast) nonmetric similarity search, concerning both exact and approximate search. Finally, we discuss some open problems and possible future research trends.
Hardware support for language aware information mining
"... Abstract. Information retrieval from text or ‘text mining ’ is the process of extracting interesting and non-trivial knowledge from unstructured text. With the ever increasing amounts of information stored on the web or archived within a computing system, high performance data processing architectur ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract. Information retrieval from text or ‘text mining ’ is the process of extracting interesting and non-trivial knowledge from unstructured text. With the ever increasing amounts of information stored on the web or archived within a computing system, high performance data processing architectures are required to process this data in real time. The aim of the work presented in this paper is the development of a hardware text mining IP-Core for use in FPGA based systems. In this paper we will describe the pre-processing engine we have developed for the PRESENCE II PCI card, to accelerate the identification of significant words within a document, logging their frequency and position. The performance of this system is then compared to an equivalent software implementation using the Lucene software package. 1
University of York,
, 2005
"... The AICP (Ambient Intelligent Co-Processor) project aims are to develop and implement high performance hardware pattern matching algorithms for use in embedded ubiquitous systems. As part of this project we aim to implement the pattern-matching algorithms onto the PRESENCE-2 hardware platform. PRESE ..."
Abstract
- Add to MetaCart
(Show Context)
The AICP (Ambient Intelligent Co-Processor) project aims are to develop and implement high performance hardware pattern matching algorithms for use in embedded ubiquitous systems. As part of this project we aim to implement the pattern-matching algorithms onto the PRESENCE-2 hardware platform. PRESENCE-2 is a PCI-based accelerator card for high performance applications, designed and built here in the Computer Science Department at York University. This paper introduces the capabilities of the PRESENCE-2 card, and explains the hardware/software design methods and the libraries and tools needed to develop scalable solutions on the card. The paper closes by detailing several pattern-matching libraries that have been ported to the PRESENCE-2 card. 1
Received Day Month Year Revised Day Month Year Communicated by Managing Editor
, 906
"... We review the current state of data mining and machine learning in Astronomy. Data Mining can have a somewhat mixed connotation from the point of view of a researcher in this field. On the one hand, it is a powerful approach, holding the potential to fully exploit the exponentially increasing amount ..."
Abstract
- Add to MetaCart
We review the current state of data mining and machine learning in Astronomy. Data Mining can have a somewhat mixed connotation from the point of view of a researcher in this field. On the one hand, it is a powerful approach, holding the potential to fully exploit the exponentially increasing amount of available data, which promises almost limitless scientific advances. On the other, it can be the application of black-box computing algorithms that at best give little physical insight, and at worst provide questionable results. Here, we give an overview of the entire data mining process, from data collection through the interpretation of results. We cover common machine learning algorithms, such as artificial neural networks and support vector machines; applications from a broad range of Astronomy, with an emphasis on those where data mining resulted in improved physical insights, and important current and future directions, including the construction of full probability density functions, parallel algorithms, petascale computing, and the time domain. We conclude that, so long as one is careful about the selection of an appropriate algorithm and is guided by the astronomical problem at hand, data mining can be the powerful tool, and not the meaningless black box.