@MISC{Pfleger_partialn-grams, author = {Karl Pfleger}, title = {Partial N-grams}, year = {} }

Share

OpenURL

Abstract

uirement) partial n-grams can provide better performance. We performed experiments using book1 from the Calgary compression corpus [1], a Thomas Hardy novel, transformed to a 26-letter alphabet (using letters as the basic symbols, not words). We examined standard predict-the-next-symbol inference, though in general we are interested in arbitrary prediction patterns, such as predicting a middle symbol from context on both sides or simultaneously predicting multiple symbols [5]. (This generality necessitates representing estimates for the full joint distribution rather than the conditional distribution.) Accuracy and entropy were measured using a held out test set consisting of the last 10,000 characters. Accuracy is the standard predictive accuracy common in machine learning, the proportion of times the correct symbol was predicted by the model. Entropy here means the standard measure referred to as the entropy of the test data given the model, or the cross-entropy. We expected partial