Results 1  10
of
1,579
We Must Choose The Simplest Physical Theory: LevinLiVitányi Theorem And Its Potential Physical Applications
, 1998
"... . If several physical theories are consistent with the same experimental data, which theory should we choose? Physicists often choose the simplest theory; this principle (explicitly formulated by Occam) is one of the basic principles of physical reasoning. However, until recently, this principle was ..."
Abstract

Cited by 6 (4 self)
 Add to MetaCart
this principle, or, to be more precise, deduce it from the fundamentals of mathematical statistics as the choice corresponding to the least informative prior measure. Potential physical applications of this formalization (due to Li and Vit'anyi) are presented. In particular, we show that, on the qualitative
The Similarity Metric
 IEEE TRANSACTIONS ON INFORMATION THEORY
, 2003
"... A new class of distances appropriate for measuring similarity relations between sequences, say one type of similarity per distance, is studied. We propose a new "normalized information distance", based on the noncomputable notion of Kolmogorov complexity, and show that it is in this class ..."
Abstract

Cited by 276 (34 self)
 Add to MetaCart
A new class of distances appropriate for measuring similarity relations between sequences, say one type of similarity per distance, is studied. We propose a new "normalized information distance", based on the noncomputable notion of Kolmogorov complexity, and show that it is in this class and it minorizes every computable distance in the class (that is, it is universal in that it discovers all computable similarities). We demonstrate that it is a metric and call it the similarity metric. This theory forms the foundation for a new practical tool. To evidence generality and robustness we give two distinctive applications in widely divergent areas using standard compression programs like gzip and GenCompress. First, we compare whole mitochondrial genomes and infer their evolutionary history. This results in a first completely automatic computed whole mitochondrial phylogeny tree. Secondly, we fully automatically compute the language tree of 52 different languages.
An efficient, probabilistically sound algorithm for segmentation and word discovery
 MACHINE LEARNING
, 1999
"... This paper presents a modelbased, unsupervised algorithm for recovering word boundaries in a naturallanguage text from which they have been deleted. The algorithm is derived from a probability model of the source that generated the text. The fundamental structure of the model is specified abstract ..."
Abstract

Cited by 197 (2 self)
 Add to MetaCart
This paper presents a modelbased, unsupervised algorithm for recovering word boundaries in a naturallanguage text from which they have been deleted. The algorithm is derived from a probability model of the source that generated the text. The fundamental structure of the model is specified abstractly so that the detailed component models of phonology, wordorder, and word frequency can be replaced in a modular fashion. The model yields a languageindependent, prior probability distribution on all possible sequences of all possible words over a given alphabet, based on the assumption that the input was generated by concatenating words from a fixed but unknown lexicon. The model is unusual in that it treats the generation of a complete corpus, regardless of length, as a single event in the probability space. Accordingly, the algorithm does not estimate a probability distribution on words; instead, it attempts to calculate the prior probabilities of various word sequences that could underlie the observed text. Experiments on phonemic transcripts of spontaneous speech by parents to young children suggest that our algorithm is more effective than other proposed algorithms, at least when utterance boundaries are given and the text includes a substantial number of short utterances.
Model Selection and the Principle of Minimum Description Length
 Journal of the American Statistical Association
, 1998
"... This paper reviews the principle of Minimum Description Length (MDL) for problems of model selection. By viewing statistical modeling as a means of generating descriptions of observed data, the MDL framework discriminates between competing models based on the complexity of each description. This ..."
Abstract

Cited by 195 (8 self)
 Add to MetaCart
This paper reviews the principle of Minimum Description Length (MDL) for problems of model selection. By viewing statistical modeling as a means of generating descriptions of observed data, the MDL framework discriminates between competing models based on the complexity of each description. This approach began with Kolmogorov's theory of algorithmic complexity, matured in the literature on information theory, and has recently received renewed interest within the statistics community. In the pages that follow, we review both the practical as well as the theoretical aspects of MDL as a tool for model selection, emphasizing the rich connections between information theory and statistics. At the boundary between these two disciplines, we find many interesting interpretations of popular frequentist and Bayesian procedures. As we will see, MDL provides an objective umbrella under which rather disparate approaches to statistical modeling can coexist and be compared. We illustrate th...
Atomic Snapshots of Shared Memory
, 1993
"... . This paper introduces a general formulation of atomic snapshot memory, a shared memory partitioned into words written (updated) by individual processes, or instantaneously read (scanned) in its entirety. This paper presents three waitfree implementations of atomic snapshot memory. The first imple ..."
Abstract

Cited by 184 (51 self)
 Add to MetaCart
. This paper introduces a general formulation of atomic snapshot memory, a shared memory partitioned into words written (updated) by individual processes, or instantaneously read (scanned) in its entirety. This paper presents three waitfree implementations of atomic snapshot memory. The first implementation in this paper uses unbounded (integer) fields in these registers, and is particularly easy to understand. The second implementation uses bounded registers. Its correctness proof follows the ideas of the unbounded implementation. Both constructions implement a singlewriter snapshot memory, in which each word may be updated by only one process, from singlewriter, nreader registers. The third algorithm implements a multiwriter snapshot memory from atomic nwriter, nreader registers, again echoing key ideas from the earlier constructions. All operations require \Theta(n 2 ) reads and writes to the component shared registers in the worst case. Categories and Subject Discriptors:...
Software documents: comparison and measurement
 in SEKE ’07: Proceedings of the 18th Int. Conf. on Software Engineering and Knowledge Engineering
, 2007
"... Abstract — For some time now, researchers have been seeking to place software measurement on a more firmly grounded footing by establishing a theoretical basis for software comparison. Although there has been some work on trying to employ information theoretic concepts for the quantification of code ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
of code documents, particularly on employing entropy and entropylike measurements, we propose that employing the Similarity Metric of Li, Vitányi, and coworkers for the comparison of software documents will lead to the establishment of a theoretically justifiable means of comparing and evaluating
Almost Everywhere High Nonuniform Complexity
, 1992
"... . We investigate the distribution of nonuniform complexities in uniform complexity classes. We prove that almost every problem decidable in exponential space has essentially maximum circuitsize and spacebounded Kolmogorov complexity almost everywhere. (The circuitsize lower bound actually exceeds ..."
Abstract

Cited by 176 (40 self)
 Add to MetaCart
. We investigate the distribution of nonuniform complexities in uniform complexity classes. We prove that almost every problem decidable in exponential space has essentially maximum circuitsize and spacebounded Kolmogorov complexity almost everywhere. (The circuitsize lower bound actually exceeds, and thereby strengthens, the Shannon 2 n n lower bound for almost every problem, with no computability constraint.) In exponential time complexity classes, we prove that the strongest relativizable lower bounds hold almost everywhere for almost all problems. Finally, we show that infinite pseudorandom sequences have high nonuniform complexity almost everywhere. The results are unified by a new, more powerful formulation of the underlying measure theory, based on uniform systems of density functions, and by the introduction of a new nonuniform complexity measure, the selective Kolmogorov complexity. This research was supported in part by NSF Grants CCR8809238 and CCR9157382 and in ...
A Survey of Modern Authorship Attribution Methods
 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY
"... Authorship attribution supported by statistical or computational methods has a long history starting from 19th century and marked by the seminal study of Mosteller and Wallace (1964) on the authorship of the disputed Federalist Papers. During the last decade, this scientific field has been developed ..."
Abstract

Cited by 155 (9 self)
 Add to MetaCart
Authorship attribution supported by statistical or computational methods has a long history starting from 19th century and marked by the seminal study of Mosteller and Wallace (1964) on the authorship of the disputed Federalist Papers. During the last decade, this scientific field has been developed substantially taking advantage of research advances in areas such as machine learning, information retrieval, and natural language processing. The plethora of available electronic texts (e.g., email messages, online forum messages, blogs, source code, etc.) indicates a wide variety of applications of this technology provided it is able to handle short and noisy text from multiple candidate authors. In this paper, a survey of recent advances of the automated approaches to attributing authorship is presented examining their characteristics for both text representation and text classification. The focus of this survey is on computational requirements and settings rather than linguistic or literary issues. We also discuss evaluation methodologies and criteria for authorship attribution studies and list open questions that will attract future work in this area.
A basis theorem for Π0 1 classes of positive measure and jump inversion for random reals
 Proceedings of the American Mathematical Society
, 2006
"... We extend the Shoenfield jump inversion theorem to the members of any Π0 1 class P⊆2ω with nonzero measure; i.e., for every Σ0 2 set S ≥T ∅ ′, there is a ∆0 2 real A ∈Psuch that A ′ ≡T S. In particular, we get jump inversion for ∆0 2 1random reals. This paper is part of an ongoing program to stud ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
introduced by MartinLöf [13] and represent the most widely studied randomness class. For the purposes of this introduction, we will assume that the reader is somewhat familiar with basic algorithmic randomness, as per LiVitányi [12], and with computability theory [18]. A review of notation and terminology
Toward a method of selecting among computational models of cognition
 Psychological Review
, 2002
"... The question of how one should decide among competing explanations of data is at the heart of the scientific enterprise. Computational models of cognition are increasingly being advanced as explanations of behavior. The success of this line of inquiry depends on the development of robust methods to ..."
Abstract

Cited by 152 (16 self)
 Add to MetaCart
The question of how one should decide among competing explanations of data is at the heart of the scientific enterprise. Computational models of cognition are increasingly being advanced as explanations of behavior. The success of this line of inquiry depends on the development of robust methods to guide the evaluation and selection of these models. This article introduces a method of selecting among mathematical models of cognition known as minimum description length, which provides an intuitive and theoretically wellgrounded understanding of why one model should be chosen. A central but elusive concept in model selection, complexity, can also be derived with the method. The adequacy of the method is demonstrated in 3 areas of cognitive modeling: psychophysics, information integration, and categorization. How should one choose among competing theoretical explanations of data? This question is at the heart of the scientific enterprise, regardless of whether verbal models are being tested in an experimental setting or computational models are being evaluated in simulations. A number of criteria have been proposed to assist in this endeavor, summarized nicely by Jacobs and Grainger
Results 1  10
of
1,579