Abstract. We propose and analyze a distribution learning algorithm for variable memory length Markov processes. These processes can be described by a subclass of probabilistic finite automata which we name Probabilistic Suffix Automata (PSA). Though hardness results are known for learning distributions generated by general probabilistic automata, we prove that the algorithm we present can efficiently learn distributions generated by PSAs. In particular, we show that for any target PSA, the KL-divergence between the distribution generated by the target and the distribution generated by the hypothesis the learning algorithm outputs, can be made small with high confidence in polynomial time and sample complexity. The learning algorithm is motivated by applications in human-machine interaction. Here we present two applications of the algorithm. In the first one we apply the algorithm in order to construct a model of the English language, and use this model to correct corrupted text. In the second application we construct a simple stochastic model for E.coli DNA. 1.
|
4364
|
Elements of Information Theory
– Cover, Thomas
- 1991
|
|
4344
|
Maximum likelihood from incomplete data via the EM algorithm
– Dempster, Laird, et al.
- 1977
|
|
2103
|
A tutorial in hidden Markov models and selected applications in speech recognition
– Rabiner
- 1989
|
|
1397
|
Dynamic Programming
– Bellman
- 1957
|
|
545
|
An introduction to hidden markov models
– Rabiner, Juang
- 1986
|
|
481
|
Compression of individual sequences via variable-rate coding
– Ziv, Lempel
- 1978
|
|
415
|
A maximization technique occurring in statistical analysis of probabilistic functions of Markov chains
– Baum, Petrie, et al.
- 1970
|
|
361
|
An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov Processes
– Baum
- 1972
|
|
279
|
Self-Organized Language Modelling for Speech Recognition. Dordrecht
– Jelinek
- 1985
|
|
224
|
Practical Prefetching via Data Compression
– Curewitz, Krishnan, et al.
- 1993
|
|
207
|
Prediction and entropy of printed English
– Shannon
- 1951
|
|
122
|
Learning decision trees using the Fourier spectrum
– Kushilevitz, Mansour
- 1993
|
|
110
|
A universal data compression system
– Rissanen
- 1983
|
|
71
|
On the computational complexity of approximating distributions by probabilistic automata
– Abe, Warmuth
- 1992
|
|
67
|
The Power of Amnesia
– Ron, Singer, et al.
- 1994
|
|
64
|
On the learnability of discrete distributions
– Kearns, Mansour, et al.
- 1994
|
|
60
|
Conductance and convergence of Markov chains (A combinatorial Treatment of expanders)," FOCS
– Mihail
- 1989
|
|
58
|
A fast sequential decoding algorithm using a stack
– Jelinek
- 1969
|
|
58
|
The context tree weighting method: Basicproperties
– Willems, Shtarkov, et al.
- 1995
|
|
55
|
Eigenvalue bounds on convergence to stationarity for nonreversible Markov chains, with an application to the exclusion process,” The Annals of Applied Probability
– Fill
- 1991
|
|
53
|
Markov Source Modeling of Text Generation
– Jelinek
- 1985
|
|
51
|
Complexity of strings in the class of Markov sources
– Rissanen
- 1986
|
|
49
|
On the learnability and usage of acyclic probabilistic finite automata
– Ron, Singer, et al.
- 1998
|
|
47
|
A hidden Markov model that finds genes in E. coli DNA
– Krogh, Mian, et al.
- 1994
|
|
42
|
Efficient learning of typical finite automata from random walks
– Freund, Kearns, et al.
- 1993
|
|
33
|
Optimal Prediction for Prefetching in the Worst Case
– Krishnan, Vitter
- 1994
|
|
33
|
Discrete sequence prediction and its applications
– Laird
- 1994
|
|
32
|
Estimation of probabilities in the language model of the IBM speech recognition system
– Nadas
- 1984
|
|
28
|
A Sequential Algorithm for the Universal Coding of Finite Memory Sources
– Weinberger, Lempel, et al.
- 1992
|
|
23
|
Learning and robust learning of product distributions
– Höffgen
- 1993
|
|
18
|
Part-of-speech tagging using a variable memory Markov model
– Schütze, Singer
- 1994
|
|
11
|
Inference and minimization of hidden Markov chains
– Gillman, Sipser
- 1994
|
|
4
|
Error bounds for convulutional codes and an asymptotically optimal decoding algorithm
– Viterbi
- 1967
|
|
3
|
genes, sequences, and computers: An Escherichia coli case study
– Maps
- 1993
|
|
3
|
An adaptive cursive handwriting recognition system
– Singer, Tishby
- 1995
|
|
2
|
Statistics of language: Introduction
– Good
- 1969
|
|
1
|
Applications of DAWGs to data compression
– Blumer
|