Results 1  10
of
13
Tight Bounds for Distributed Functional Monitoring
"... We resolve several fundamental questions in the area of distributed functional monitoring, initiated by Cormode, Muthukrishnan, and Yi (SODA, 2008), and receiving recent attention. In this model there are k sites each tracking their input streams and communicating with a central coordinator. The coo ..."
Abstract

Cited by 20 (10 self)
 Add to MetaCart
(Show Context)
We resolve several fundamental questions in the area of distributed functional monitoring, initiated by Cormode, Muthukrishnan, and Yi (SODA, 2008), and receiving recent attention. In this model there are k sites each tracking their input streams and communicating with a central coordinator. The coordinator’s task is to continuously maintain an approximate output to a function computed over the union of the k streams. The goal is to minimize the number of bits communicated. Let the pth frequency moment be defined as Fp f
INFORMATION COST TRADEOFFS FOR AUGMENTED INDEX AND STREAMING LANGUAGE RECOGNITION
"... This paper makes three main contributions to the theory of communication complexity and stream computation. First, we present new bounds on the information complexity of AUGMENTEDINDEX. In contrast to analogous results for INDEX by Jain, Radhakrishnan and Sen [J. ACM, 2009], we have to overcome the ..."
Abstract

Cited by 18 (3 self)
 Add to MetaCart
(Show Context)
This paper makes three main contributions to the theory of communication complexity and stream computation. First, we present new bounds on the information complexity of AUGMENTEDINDEX. In contrast to analogous results for INDEX by Jain, Radhakrishnan and Sen [J. ACM, 2009], we have to overcome the significant technical challenge that protocols for AUGMENTEDINDEX may violate the “rectangle property ” due to the inherent input sharing. Second, we use these bounds to resolve an open problem of Magniez, Mathieu and Nayak [STOC, 2010] on the multipass complexity of recognizing Dyck languages. This results in a natural separation between the standard multipass model and the multipass model that permits reverse passes. Third, we present the first passive memory checkers that verify the interaction transcripts of priority queues, stacks, and doubleended queues. We obtain tight upper and lower bounds for these problems, thereby addressing an important subclass of the memory checking framework of Blum et al. [Algorithmica, 1994].
Validating XML documents in the streaming model with external memory
 In ICDT
, 2012
"... We study the problem of validating XML documents of size N against general DTDs in the context of streaming algorithms. The starting point of this work is a wellknown space lower bound. There are XML documents and DTDs for which ppass streaming algorithms require Ω(N/p) space. We show that when al ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
(Show Context)
We study the problem of validating XML documents of size N against general DTDs in the context of streaming algorithms. The starting point of this work is a wellknown space lower bound. There are XML documents and DTDs for which ppass streaming algorithms require Ω(N/p) space. We show that when allowing access to external memory, there is a deterministic streaming algorithm that solves this problem with memory space O(log 2 N), a constant number of auxiliary read/write streams, and O(log N) total number of passes on the XML document and auxiliary streams. An important intermediate step of this algorithm is the computation of the FirstChildNextSibling (FCNS) encoding of the initial XML document in a streaming fashion. We study this problem independently, and we also provide memory efficient streaming algorithms for decoding an XML document given in its FCNS encoding. Furthermore, validating XML documents encoding binary trees in the usual streaming model without external memory can be done with sublinear memory. There is a onepass algorithm using O ( √ N log N) space, and a bidirectional twopass algorithm using O(log 2 N) space performing this task.
GrammarBased Compression in a Streaming Model
 LATA 2010. LNCS
, 2010
"... We show that, given a string s of length n, with constant memory and logarithmic passes over a constant number of streams we can build a contextfree grammar that generates s and only s and whose size is within an O min g log g, n / log nfactor of the minimum g. This stands in contrast to our pre ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
We show that, given a string s of length n, with constant memory and logarithmic passes over a constant number of streams we can build a contextfree grammar that generates s and only s and whose size is within an O min g log g, n / log nfactor of the minimum g. This stands in contrast to our previous result that, with polylogarithmic memory and polylogarithmic passes over a single stream, we cannot build such a grammar whose size is within any polynomial of g.
On Repairing Structural Problems In Semistructured Data
"... Semistructured data such as XML are popular for data interchange and storage. However, many XML documents have improper nesting where open and closetags are unmatched. Since some semistructured data (e.g., Latex) have a flexible grammar and since many XML documents lack an accompanying DTD or XSD ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
(Show Context)
Semistructured data such as XML are popular for data interchange and storage. However, many XML documents have improper nesting where open and closetags are unmatched. Since some semistructured data (e.g., Latex) have a flexible grammar and since many XML documents lack an accompanying DTD or XSD, we focus on computing a syntactic repair via the edit distance. To solve this problem, we propose a dynamic programming algorithm which takes cubic time. While this algorithm is not scalable, wellformed substrings of the data can be pruned to enable faster computation. Unfortunately, there are still cases where the dynamic program could be very expensive; hence, we give branchandbound algorithms based on various combinations of two heuristics, called MinCost and MaxBenefit, that trade off between accuracy and efficiency. Finally, we experimentally demonstrate the performance of these algorithms on real data. 1.
Streaming Complexity of Checking Priority Queues ∗
"... This work is in the line of designing efficient checkers for testing the reliability of some massive data structures. Given a sequential access to the insert/extract operations on such a structure, one would like to decide, a posteriori only, if it corresponds to the evolution of a reliable structur ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
This work is in the line of designing efficient checkers for testing the reliability of some massive data structures. Given a sequential access to the insert/extract operations on such a structure, one would like to decide, a posteriori only, if it corresponds to the evolution of a reliable structure. In a context of massive data, one would like to minimize both the amount of reliable memory of the checker and the number of passes on the sequence of operations. Chu, Kannan and McGregor [9] initiated the study of checking priority queues in this setting. They showed that the use of timestamps allows to check a priority queue with a single pass and memory space Õ( √ N). Later, Chakrabarti, Cormode, Kondapally and McGregor [7] removed the use of timestamps, and proved that more passes do not help. We show that, even in the presence of timestamps, more passes do not help, solving an open problem of [9, 7]. On the other hand, we show that a second pass, but in reverse direction, shrinks the memory space to Õ((log N)2), extending a phenomenon the first time observed by Magniez, Mathieu and Nayak [15] for checking wellparenthesized expressions. 1
The Dyck language edit distance problem in nearlinear time. FOCS
, 2014
"... Abstract Given a string σ over alphabet Σ and a grammar G defined over the same alphabet, how many minimum number of repairs (insertions, deletions and substitutions) are required to map σ into a valid member of G? The seminal work of Aho and Peterson in 1972 initiated the study of this language ed ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Abstract Given a string σ over alphabet Σ and a grammar G defined over the same alphabet, how many minimum number of repairs (insertions, deletions and substitutions) are required to map σ into a valid member of G? The seminal work of Aho and Peterson in 1972 initiated the study of this language edit distance problem providing a dynamic programming algorithm for context free languages that runs in O(G 2 n 3 ) time, where n is the string length and G is the grammar size. While later improvements reduced the running time to O(Gn 3 ), the cubic running time on the input length held a major bottleneck for applying these algorithms to their multitude of applications. In this paper, we study the language edit distance problem for a fundamental context free language, DYCK(s) representing the language of wellbalanced parentheses of s different types, that has been pivotal in the development of formal language theory. We provide the very first nearlinear time algorithm to tightly approximate the DYCK(s) language edit distance problem for any arbitrary s. DYCK(s) language edit distance significantly generalizes the wellstudied string edit distance problem, and appears in most applications of language edit distance ranging from data quality in databases, generating automated errorcorrecting parsers in compiler optimization to structure prediction problems in biological sequences. Its nondeterministic counterpart is known as the hardest context free language. Our main result is an algorithm for edit distance computation to DYCK(s) for any positive integer s that runs in O(n 1+ polylog(n)) time and achieves an approximation factor of O( 1 β(n) log OP T ), for any > 0. Here OP T is the optimal edit distance to DYCK(s) and β(n) is the best approximation factor known for the simpler problem of string edit distance running in analogous time. If we allow O(n 1+ + OP T  2 n ) time, then the approximation factor can be reduced to O( 1 log OP T ). Since the best known nearlinear time algorithm for the string edit distance problem has β(n) = polylog(n), under nearlinear time computation model both DYCK(s) language and string edit distance problems have polylog(n) approximation factors. This comes as a surprise since the former is a significant generalization of the latter and their exact computations via dynamic programming show a stark difference in time complexity. Rather less surprisingly, we show that the framework for efficiently approximating edit distance to DYCK(s) can be utilized for many other languages. We illustrate this by considering various memory checking languages (studied extensively under distributed verification) such as STACK, QUEUE, PQ and DEQUE which comprise of valid transcripts of stacks, queues, priority queues and doubleended queues respectively. Therefore, any language that can be recognized by these data structures, can also be repaired efficiently by our algorithm.
Computations on Massive Data Sets: Streaming Algorithms and TwoParty Communication
, 2013
"... The treatment of massive data sets is a major challenge in computer science nowadays. In this PhD thesis, we consider two computational models that address problems that arise when processing massive data sets. The first model is the Data Streaming Model. When processing massive data sets, random ac ..."
Abstract
 Add to MetaCart
The treatment of massive data sets is a major challenge in computer science nowadays. In this PhD thesis, we consider two computational models that address problems that arise when processing massive data sets. The first model is the Data Streaming Model. When processing massive data sets, random access to the input data is very costly. Therefore, streaming algorithms only have restricted access to the input data: They sequentially scan the input data once or only a few times. In addition, streaming algorithms use a random access memory of sublinear size in the length of the input. Sequential input access and sublinear memory are drastic limitations when designing algorithms. The major goal of this PhD thesis is to explore the limitations and the strengths of the streaming model. The second model is the Communication Model. When data is processed by multiple computational units at different locations, which are connected through a slow interconnection network such as the Internet, then the message exchange of the participating parties for synchronizing their calculations is often a bottleneck.
Research Statement
, 2008
"... My research lies in the field of theoretical computer science with a primary focus in computational complexity theory and a secondary focus in (approximation) algorithms. These two areas may seem unconnected on the surface, but are in fact two sides of a coin: one of the chief goals in my complexity ..."
Abstract
 Add to MetaCart
(Show Context)
My research lies in the field of theoretical computer science with a primary focus in computational complexity theory and a secondary focus in (approximation) algorithms. These two areas may seem unconnected on the surface, but are in fact two sides of a coin: one of the chief goals in my complexity research is to establish limits on our ability to solve certain problems with computers, whereas in my work on approximation algorithms I attempt to work around the proven (or seeming) intractability of computational problems that need to be solved, for various applications. Both areas place great emphasis on precise mathematical modelling of computational problems and rigorous proofs (rather than experimental evidence) to ensure that the research results remain valid in spite of future advances in computer hardware and software. Finally, both areas draw upon, and contribute to, a common toolkit of ideas and basic techniques, leading to plenty of opportunities for crossfertilisation. Below, I provide some basic background for both these focus areas. I then identify key themes in my research so far, in Section 2 and move on to outlining my most important specific results (rather than exhaustively listing all my results), loosely grouped by topic, in Sections 3 and 4. Finally, in Section 5, I discuss some research directions and specific challenges that I would like to tackle in my future work. Copies of my papers can be found at