Download:
|
by Miklos Ajtai, Randal Burns, Ronald Fagin, Darrell D. E. Long, Larry Stockmeyer
Journal of the ACM
http://www.almaden.ibm.com/cs/people/stock/diff6.ps
Add To MetaCart
Abstract:
The subject of this paper is differential compression, the algorithmic task of finding common strings between versions of data and using the commonality between versions to encode one version compactly by describing it as a set of changes from its companion. A main goal of the work is to present new differencing algorithms that operate at a fine granularity (the atomic unit of change), make no assumptions about the format or alignment of input data, and use linear time and give good compression on version data that typically arises in practice. First we review existing differencing algorithms that compress optimally but use time proportional to n 2 in the worst case and space proportional to n in the best and worst case. Then we present new algorithms, which do not always compress optimally but use considerably less time and space than existing algorithms. One new algorithm runs in O(n) time and O(1) space in the worst case (where each unit of space contains dlog ne bits, and the space to hold the input data is not included). We introduce two new techniques for differential compression, and we apply these to give additional algorithms that improve the compression performance of the linear-time algorithm and the time performance of the quadratic-time algorithm. Having presented these algorithms, we experimentally explore their properties, such as time and compression performance, by running them on actual versioned data. In these experiments, the new algorithms run in linear time and constant space, and their compression performance closely approximates that of previous algorithms that use considerably more time and space. Finally, we present theoretical results that limit the compression power of differencing algorithms that are restricted to making only a single pass over the data.
Citations
|
784
|
Information Theory and Reliable Communication
– Gallager
- 1968
|
|
773
|
A universal algorithm for sequential data compression
– Ziv, Lempel
- 1977
|
|
582
|
Algorithms on Strings, Trees, and Sequences
– Gusfield
- 1997
|
|
500
|
The Art of Computer Programming, Volume 3, Sorting and Searching
– Knuth
- 1975
|
|
494
|
Compression of individual sequences via variable rate coding. IEEE Transaction on Information Theory, IT-24(5):530-536. Copyright © 2006, Juniper Networks, Inc. All rights reserved. Juniper Networks and the Juniper Networks logo are registered trademarks
– Ziv, Lempel
- 1978
|
|
415
|
The string-tostring correction problem
– Wagner, Fischer
- 1974
|
|
340
|
RCS—A System for Version Control
– Tichy
- 1985
|
|
301
|
Linear pattern matching algorithms
– Weiner
|
|
271
|
Probabilistic computations: towards a unified measure of complexity
– Yao
- 1977
|
|
239
|
The source code control system
– Rochkind
- 1975
|
|
200
|
Potential benefits of delta encoding and data compression for HTTP
– Mogul, Douglis, et al.
- 1997
|
|
193
|
Efficient randomized pattern-matching algorithms
– Karp, Rabin
- 1981
|
|
100
|
Compressed suffix arrays and suffix trees with applications to text indexing and string matching
– Grossi, Vitter
|
|
84
|
Optimistic deltas for WWW latency reduction
– Banga, Douglis, et al.
- 1997
|
|
79
|
Meaningful Change Detection in Structured Data
– Chawathe, Garcia-Molina
- 1997
|
|
72
|
Reducing the space requirement of suffix trees
– Kurtz
- 1999
|
|
65
|
The string-to-string correction problem with block moves
– Tichy
- 1984
|
|
50
|
File system support for delta compression
– MacDonald
- 2000
|
|
46
|
A file comparison program
– MILLER, MYERS
- 1985
|
|
39
|
Delta algorithms: An empirical analysis
– Hunt, Vo, et al.
- 1998
|
|
30
|
Determinism versus non-determinism for linear time RAMs
– Ajtai
- 1999
|
|
29
|
Cache-based compaction: A new technique for optimizing web transfer
– Chan, Woo
- 1999
|
|
28
|
Inequalities
– Hardy, Littlewood, et al.
- 1934
|
|
24
|
An Editor for Revision Control
– Fraser, Myer
- 1987
|
|
18
|
Delta storage for arbitrary non-text files
– REICHENBERGER
- 1991
|
|
7
|
Tutorial on MPEG-2 Video Compression
– Tudor
- 1995
|
|
6
|
Efficient distributed backup and restore with delta compression
– Burns, Long
- 1997
|
|
5
|
In-place reconstruction of delta compressed files
– BURNS, LONG
- 1998
|
|
5
|
Combining of changes to a source file
– JONG
- 1972
|
|
3
|
The VCDIFF generic differencing and compression format
– KORN, VO
- 1999
|
|
3
|
PGP Source Code and Internals
– Zimmerman
- 1995
|
|
1
|
Compactly Encoding with Differential Compression 367
– KARP, RABIN
- 1987
|