Results 11  20
of
436
Fast Kernels for String and Tree Matching
, 2004
"... Introduction Many problems in machine learning require a data classification algorithm to work with a set of discrete objects. Common examples include biological sequence analysis where data is represented as strings (Durbin et al., 1998) and Natural Language Processing (NLP) where the data is give ..."
Abstract

Cited by 110 (7 self)
 Add to MetaCart
Introduction Many problems in machine learning require a data classification algorithm to work with a set of discrete objects. Common examples include biological sequence analysis where data is represented as strings (Durbin et al., 1998) and Natural Language Processing (NLP) where the data is given in the form of a string combined with a parse tree (Collins and Du#y, 2001) or an annotated sequence (Altun et al., 2003). In order to apply kernel methods one defines a measure of similarity between discrete structures via a feature map # : X F. Here X is the set of discrete structures (eg. the set of all parse trees of a language) and F is a Hilbert space. Since #(x) F we can define a kernel by evaluating the scalar products k(x, x # ) = ##(x), #(x # )# (1.1) where x, x # X. The success of a kernel method employing k depends both on the faithful representation of discrete data and an e#cient means of computing k. Recent research e#ort has focussed on defining meaningful ker
Finding Maximal Repetitions in a Word in Linear Time
 In Symposium on Foundations of Computer Science
, 1999
"... A repetition in a word is a subword with the period of at most half of the subword length. We study maximal repetitions occurring in, that is those for which any extended subword of has a bigger period. The set of such repetitions represents in a compact way all repetitions in.We first prove a combi ..."
Abstract

Cited by 92 (4 self)
 Add to MetaCart
(Show Context)
A repetition in a word is a subword with the period of at most half of the subword length. We study maximal repetitions occurring in, that is those for which any extended subword of has a bigger period. The set of such repetitions represents in a compact way all repetitions in.We first prove a combinatorial result asserting that the sum of exponents of all maximal repetitions of a word of length is bounded by a linear function in. This implies, in particular, that there is only a linear number of maximal repetitions in a word. This allows us to construct a lineartime algorithm for finding all maximal repetitions. Some consequences and applications of these results are discussed, as well as related works. 1.
Compressed suffix trees with full functionality
 Theory of Computing Systems
"... We introduce new data structures for compressed suffix trees whose size are linear in the text size. The size is measured in bits; thus they occupy only O(n log A) bits for a text of length n on an alphabet A. This is a remarkable improvement on current suffix trees which require O(n log n) bits. ..."
Abstract

Cited by 86 (6 self)
 Add to MetaCart
(Show Context)
We introduce new data structures for compressed suffix trees whose size are linear in the text size. The size is measured in bits; thus they occupy only O(n log A) bits for a text of length n on an alphabet A. This is a remarkable improvement on current suffix trees which require O(n log n) bits. Though some components of suffix trees have been compressed, there is no linearsize data structure for suffix trees with full functionality such as computing suffix links, stringdepths and lowest common ancestors. The data structure proposed in this paper is the first one that has linear size and supports all operations efficiently. Any algorithm running on a suffix tree can also be executed on our compressed suffix trees with a slight slowdown of a factor of polylog(n). 1
A survey on web clustering engines
, 2009
"... Web clustering engines organize search results by topic, thus offering a complementary view to the flatranked list returned by conventional search engines. In this survey, we discuss the issues that must be addressed in the development of a Web clustering engine, including acquisition and preproces ..."
Abstract

Cited by 82 (7 self)
 Add to MetaCart
Web clustering engines organize search results by topic, thus offering a complementary view to the flatranked list returned by conventional search engines. In this survey, we discuss the issues that must be addressed in the development of a Web clustering engine, including acquisition and preprocessing of search results, their clustering and visualization. Search results clustering, the core of the system, has specific requirements that cannot be addressed by classical clustering algorithms. We emphasize the role played by the quality of the cluster labels as opposed to optimizing only the clustering structure. We highlight the main characteristics of a number of existing Web clustering engines and also discuss how to evaluate their retrieval performance. Some directions for future research are finally presented.
Dictionary matching and indexing with errors and don’t cares,”
 in Proceedings of the thirtysixth annual ACM symposium on Theory of computing. ACM,
, 2004
"... ..."
(Show Context)
Do Code Clones Matter?
"... Code cloning is not only assumed to inflate maintenance costs but also considered defectprone as inconsistent changes to code duplicates can lead to unexpected behavior. Consequently, the identification of duplicated code, clone detection, has been a very active area of research in recent years. Up ..."
Abstract

Cited by 79 (14 self)
 Add to MetaCart
(Show Context)
Code cloning is not only assumed to inflate maintenance costs but also considered defectprone as inconsistent changes to code duplicates can lead to unexpected behavior. Consequently, the identification of duplicated code, clone detection, has been a very active area of research in recent years. Up to now, however, no substantial investigation of the consequences of code cloning on program correctness has been carried out. To remedy this shortcoming, this paper presents the results of a largescale case study that was undertaken to find out if inconsistent changes to cloned code can indicate faults. For the analyzed commercial and open source systems we not only found that inconsistent changes to clones are very frequent but also identified a significant number of faults induced by such changes. The clone detection
Clone Detection Using Abstract Syntax Suffix Trees
, 2006
"... Reusing software through copying and pasting is a continuous plague in software development despite the fact that it creates serious maintenance problems. Various techniques have been proposed to find duplicated redundant code (also known as software clones). A recent study has compared these techni ..."
Abstract

Cited by 67 (6 self)
 Add to MetaCart
Reusing software through copying and pasting is a continuous plague in software development despite the fact that it creates serious maintenance problems. Various techniques have been proposed to find duplicated redundant code (also known as software clones). A recent study has compared these techniques and shown that tokenbased clone detection based on suffix trees is extremely fast but yields clone candidates that are often no syntactic units [26]. Current techniques based on abstract syntax trees—on the other hand—find syntactic clones but are considerably less efficient. This paper describes how we can make use of suffix trees to find clones in abstract syntax trees. This new approach is able to find syntactic clones in linear time and space. The paper reports the results of several large case studies in which we empirically compare the new technique to other techniques using the Bellon benchmark for clone detectors.
Parameterized Duplication in Strings: Algorithms and an Application to Software Maintenance
 SIAM Journal on Computing
, 1997
"... As an aid in software maintenance, it would be useful to be able to track down duplication in large software systems efficiently. Duplication in code is often in the form of sections of code that are the same except for a systematic change of parameters such as identifiers and constants. To model su ..."
Abstract

Cited by 65 (5 self)
 Add to MetaCart
As an aid in software maintenance, it would be useful to be able to track down duplication in large software systems efficiently. Duplication in code is often in the form of sections of code that are the same except for a systematic change of parameters such as identifiers and constants. To model such parameterized duplication in code, this paper introduces the notions of parameterized strings and parameterized matches of parameterized strings. A data structure called a parameterized suffix tree is defined to aid in searching for parameterized matches. For fixed alphabets, algorithms are given to construct a parameterized suffix tree in linear time and to find all maximal parameterized matches over a threshold length in a parameterized pstring in time linear in the size of the input plus the number of matches reported. The algorithms have been implemented
Succinct suffix arrays based on runlength encoding
 Nordic Journal of Computing
, 2005
"... A succinct fulltext selfindex is a data structure built on a text T = t1t2...tn, which takes little space (ideally close to that of the compressed text), permits efficient search for the occurrences of a pattern P = p1p2... pm in T, and is able to reproduce any text substring, so the selfindex re ..."
Abstract

Cited by 60 (32 self)
 Add to MetaCart
A succinct fulltext selfindex is a data structure built on a text T = t1t2...tn, which takes little space (ideally close to that of the compressed text), permits efficient search for the occurrences of a pattern P = p1p2... pm in T, and is able to reproduce any text substring, so the selfindex replaces the text. Several remarkable selfindexes have been developed in recent years. Many of those take space proportional to nH0 or nHk bits, where Hk is the kth order empirical entropy of T. The time to count how many times does P occur in T ranges from O(m) to O(m log n). In this paper we present a new selfindex, called RLFM index for “runlength FMindex”, that counts the occurrences of P in T in O(m) time when the alphabet size is σ = O(polylog(n)). The RLFM index requires nHk log σ + O(n) bits of space, for any k ≤ α log σ n and constant 0 < α < 1. Previous indexes that achieve O(m) counting time either require more than nH0 bits of space or require that σ = O(1). We also show that the RLFM index can be enhanced to locate occurrences in the text and display text substrings in time independent of σ. In addition, we prove a close relationship between the kth order entropy of the text and some regularities that show up in their suffix arrays and in the BurrowsWheeler transform of T. This relationship is of independent interest and permits bounding the space occupancy of the RLFM index, as well as that of other existing compressed indexes. Finally, we present some practical considerations in order to implement the RLFM index, obtaining two implementations with different spacetime tradeoffs. We empirically compare our indexes against the best existing implementations and show that they are practical and competitive against those. 1
Breaking a TimeandSpace Barrier in Constructing FullText Indices
"... Suffix trees and suffix arrays are the most prominent fulltext indices, and their construction algorithms are well studied. It has been open for a long time whether these indicescan be constructed in both o(n log n) time and o(n log n)bit working space, where n denotes the length of the text. Int ..."
Abstract

Cited by 59 (4 self)
 Add to MetaCart
(Show Context)
Suffix trees and suffix arrays are the most prominent fulltext indices, and their construction algorithms are well studied. It has been open for a long time whether these indicescan be constructed in both o(n log n) time and o(n log n)bit working space, where n denotes the length of the text. Inthe literature, the fastest algorithm runs in O(n) time, whileit requires O(n log n)bit working space. On the other hand,the most spaceefficient algorithm requires O(n)bit working space while it runs in O(n log n) time. This paper breaks the longstanding timeandspace barrier under the unitcost word RAM. We give an algorithm for constructing the suffix array which takes O(n) time and O(n)bit working space, for texts with constantsize alphabets. Note that both the time and the space bounds are optimal. For constructing the suffix tree, our algorithm requires O(n logffl n) time and O(n)bit working space forany 0! ffl! 1. Apart from that, our algorithm can alsobe adopted to build other existing fulltext indices, such as