96 citations found. Retrieving documents...
A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Marmuthh. Occam's razor. In Information Processing Letters, pages 377--380. 24 edition, 1987.

 Home/Search   Document Not in Database   Summary   Related Articles   Check  

This paper is cited in the following contexts:

First 50 documents  Next 50

LSAT - An Algorithm for the Synthesis of Two Level.. - Oliveira.. (1991)   (3 citations)  (Correct)

....of function f by just looking at the classified samples it was presented with. Performance is measured by checking the agreement between the output of functions g and f for a set of minterms drawn according to the same distribution. Theoretical results predict that, under reasonable assumptions [3], sim pler descriptions for g should be expected to lead to better results when predicting the class of future samples. 2 Definitions Let f be an incompletely specified function of boolean variables, Xl, Xn , i.e. a mapping from 0, 1 0, 1, where the. is used to identify the ....

A. Blumer, A. Ehrenfeucht, D. Haussler & M. Warmuth "Occam's Razor", Information Processing Letters, vol 24, pp. 377-380, North-Holland, 1987.


Evaluation and Selection of Biases in Machine Learning - Gordon, Jardins (1995)   (22 citations)  (Correct)

....A pr ocedur al bias (also called algor ith,zic bias [26] determines the order of traver sal of the states in the space defined by a representational bias. Exanples of procedural biases include the bean width in a bean search and a preference for sinple or specific hypotheses. Occan s Razor [6] and the Mininun Description Length Principle [9] 33] provide fornal notivations for why a preference for sinple hy potheses works well theoretically. However, they leave the question of a practical inplenentation open, and appropriate representational biases and search heuristics that find ....

Blumer, A., Ehreffeucht, A., Haussler, D., and Warmuth, M. (1987). Occam's razor. Infor- mation Processing Letters, 24:37380.


Pareto-Optimal Patterns in Logical Analysis of Data - Hammer, Kogan, Simeone, al. (2001)   (1 citation)  (Correct)

.... this is a popular point of view, it is not universally accepted (for a discussion, see [15, 16] Another argument used by some authors in favor of simplicity states that simplicity leads to higher accuracy (see e.g. the computational learning theory model of Occam s razor, which is proposed in [3]) This point of view is again not universally accepted. Moreover, various theoretical and empirical arguments were made, stating that simplicity in itself may even lead to lower accuracy (see e.g. 15, 16, 24] In particular, it was shown in [22] that a decrease in the simplicity of patterns ....

A. Blumer, A. Ehrenfeucht, D. Haussler, and M.K. Warmuth. "Occam's razor", Information Processing Letters, 24 (1987), 377--380.


On Integrating Inductive Learning with Prior Knowledge and.. - Giraud-Carrier (1994)   (Correct)

....training examples, the system s ability to generalize is dependent upon the representativeness of its training set. Moreover, the number of sets or conjunctions of candidate critical features that must be searched is exponential in the number of inputs, and IL becomes computationally expensive [BLU87]. Finally, IL does not take advantage of an important source of knowledge, namely prior knowledge. Human learning is often not the sole result of exposure to random examples. Rather, built in mechanisms (e.g. pain) and teachers (present in most social structures, such as the family or the ....

....set. Also, there is no guarantee that, given an arbitrary training set, the system will find enough good critical features to get a reasonable approximation of A. Moreover, the number of features to be searched is exponential in the number of inputs, and TSL becomes computationally expensive [BLU87]. Finally, the scarcity of interesting positive theoretical results suggests the difficulty of learning without sufficient a priori knowledge. This paper presents precept driven learning (PDL) PDL is intended to overcome some of TSL s weaknesses. In PDL, the training set is augmented by a small ....

Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M.K. (1987). Occam's Razor. Information Processing Letters, 24, 377-380.


Learning Coherent Concepts - Garg, Roth (2001)   (Correct)

....is assumed that during the training phase one is provided with m training examples drawn independently according to P and labeled according to some target concept c 1 . The learning algorithm chooses a hypothesis h 2 H that is consistent with target function on the training data. In this setting (Blumer et al. 1987) we know that for the true error of h (that is, P r P (h(x) 6= c 1 (x) to be bounded by with probability at least 1 , the number of training examples required needs to be greater than m ln jHj ln 1 : 1) This analysis can be extended to the non admissible case (when the target ....

Blumer, A., Ehrenfeucht, A., Haussler, D., &Warmuth, M. K. (1987). Occam's razor. Information Processing Letters, 24, 377--380.


Learning Decision Lists - Rivest (1987)   (213 citations)  (Correct)

....or false to the attributes can be classi ed as either a positive or a negative instance of the concept to be learned. To make precise the notion of learnable from examples , we adopt the de nition of polynomial learnability pioneered by Valiant (1984) and studied by a number of authors (e.g. Blumer, Ehrenfeucht, Haussler, Warmuth 1986, 1987). In this model a concept to be learned is rst selected arbitrarily from a prespeci ed class F of concepts. Then, a learning algorithm may request a number of positive and negative examples of the unknown concept. These examples are drawn at random according to xed but unknown probability ....

....have developed some elegant methods for doing so. They show how one can prove polynomial learnability by proving polynomial time identi ability, if the class F is polynomial sized (i.e. if n = O(n t ) for some t) For the reader s convenience we repeat a key theorem and its proof from Blumer et al. 1987). Theorem 4 Given a function f 2 F n and a sample S of f of size m drawn according to P n , the probability is at most jF n j(1 ) m 13 that there exists a hypothesis g 2 F n such that the error of g is greater than , and g is consistent with S. Proof: If g is a single hypothesis in F ....

[Article contains additional citation context not shown here]

Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. (1987) Occam's Razor. Information Processing Letters 24, 377-380.


Learning Qualitative Models of Dynamic Systems - Hau, Coiera (1997)   (7 citations)  (Correct)

....least 1 #, when given a set of examples of size m = s(n, 1 # , 1 # ) drawn according to P n , output a c # # C n such that error(c # ) # #. Further, A s running time is polynomially bounded in n and m. 3.2.1. Proving PAC Learnability One approach of PAC learning due to Blumer et al. (Blumer, et al. 1987) is as follows: draw a large enough set of examples according to P n , and find an algorithm which, given the examples, outputs any concept c #C n consistent with all the examples in polynomial time. If there exists such an algorithm for the concept class C, C is said to be polynomialtime ....

Blumer, A., Ehrenfeucht, A., Haussler, D. & Warmuth, M.K. 1987. Occam's razor. Information Processing Letters, 24(6):377--380.


On the Learnability of the Uncomputable - Lathrop (1996)   (Correct)

....(Angluin and Smith 1983; Kearns 1990; Valiant 1984; Pitt 1990) for formal machine learning, among others. The formal approach has been widely explored by the machine learning community (Pitt 1990) and a great deal is known about necessary conditions, limitations, and bounds (Blumer et al. 1989; Blumer et al. 1987; Ehrenfeucht et al. 1989; Kearns and Valiant 1994; Linial et al. 1991; Pitt and Valiant 1988; Shvaytser 1990) Amsterdam (Amsterdam 1988) discusses limitations of the framework. The problem of language identification in the limit differs from this work in finding an exact rather than ....

Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M.K. (1987). Occam's razor. Information Processing Letters 24 , (6):377--380.


Report for Publication of the Activity of the Working Group.. - Shawe-Taylor (1997)   (Correct)

....distribution. Although there exist negative results due to cryptographic assumptions, first experimental results dealing with random target DFA s and some restricted classes of probability distributions are encouraging. The reason for this may be based on the paradigm of Occam s Razor (see [6]) and the minimisation abilities of the BOOST approach. For further information please contact Andreas Birkendorf (birkendo Ls2.informatik.uni dortmund.de) 1.5 Industrial Applications of NeuroCOLT Results The synergy created by a Europe wide collaboration is not only seen in the theoretical ....

A. Blumer, A. Ehrenfeucht, D. Haussler and M. Warmuth (1987) Occam's Razor. Information Processing Letters, 24, pp. 377--380.


An implicit formulation for exact BDD minimization - Oliveira, Carloni, Villa.. (1996)   (Correct)

....is of paramount importance. This requires an exact algorithm to find those solutions or at least to validate the quality of heuristic algorithms. For instance, in inductive learning applications, the accuracy of the inferred hypotheses is strongly dependent on the complexity of the result [1]. One possible and very effective representation scheme for inferred hypotheses are BDDs. However, it was observed [13] that when BDDs are used as the representation scheme, existing heuristic algorithms for BDD minimization find solutions that are so far from the minimum that makes them of little ....

A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Occam's razor. Inform. Proc. Lett., 24:377--380, April 1987.


Expected Error Analysis for Model Selection - Scheffer, Joachims (1999)   (8 citations)  (Correct)

....the chance of some hypothesis in H i possessing a large difference between true and empirical error grows steeply. Therefore, given two hypotheses with equal empirical error which come from distinct models, PAC theory gives better guarantees for the one which comes from the smaller model (e.g. Blumer et al. 1987). But just because there is a hypothesis with a large difference between true and empirical error does not mean that the expected error of the returned hypothesis is high. In fact, from the expected error analysis we can derive that when the prior distribution of error values in the model remains ....

....in the hypothesis language grows necessarily when the hypothesis language grows. In order to account for the possibility of worst case choices of hL from the empirical error minimizing hypotheses, PAC theory gives weaker guarantees for hypotheses which have been learned from larger models (Blumer et al. 1987). When we study the average error rate over all possible target functions (rather than the worst possible choice of hL for a given target function) it turns out that the generalization error of an empirical error minimizing hypothesis is completely independent of the size of the hypothesis space ....

Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. (1987). Occam's razor. Information processing Letters, 24, 377--380.


On the Learnability of the Uncomputable - Richard Lathrop Information   (Correct)

....(Angluin and Smith 1983; Kearns 1990; Valiant 1984; Pitt 1990) for formal machine learning, among others. The formal approach has been widely explored by the machine learning community (Pitt 1990) and a great deal is known about necessary conditions, limitations, and bounds (Blumer et al. 1989; Blumer et al. 1987; Ehrenfeucht et al. 1989; Kearns and Valiant 1994; Linial et al. 1991; Pitt and Valiant 1988; Shvaytser 1990) Amsterdam (Amsterdam 1988) discusses limitations of the framework. The problem of language identification in the limit differs from this work in finding an exact rather than ....

Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M.K. (1987). Occam's razor. Information Processing Letters 24 , (6):377--380.


Computational Sample Complexity and Attribute-Efficient Learning - Servedio (2000)   (Correct)

.... of the well structured 1 decision list, and each useful example provides the output bit for one of the last q pairs (note that it is possible to identify useful examples as long as S contains at least one positive example) Since there are 2 n well structured 1 decision lists, Occam s Razor [6] immediately implies that O(n=ffl) examples suffice for this polynomial time learning algorithm. Now we show the lower bound on CSC(C ; n; ffl) The idea of the proof is as follows: we will exhibit a particular distribution on f0; 1g n and show that any polynomial time learning algorithm for ....

A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth, Occam's razor, Inform. Process. Lett. 24 (1987), 377--380.


Online Prediction Algorithms for Databases and Operating Systems - Krishnan (1995)   (7 citations)  (Correct)

....evaluate our prefetching algorithm relative to the best online algorithm that has complete knowledge of the structure and transition probabilities of the Markov source. Prefetching is a learning problem that involves predicting the page requests of the user. Work in computational learning theory [BEHa, BEHb, BoP] has shown that prediction is 3.1. PAGE REQUEST MODELS AND MAIN RESULTS 9 synonymous with generalization and data compression. If a data compressor expects a certain character to be next with a very high probability, it will assign that character a relatively small code. In the end, if the net ....

A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth, "Occam's Razor," Information Processing Letters 24 (1987).


Optimal Prefetching via Data Compression - Vitter, Krishnan (1991)   (46 citations)  (Correct)

....and transition probabilities of the Markov source. Prefetching is a learning problem that involves predicting the page requests of the user. Our novel approach is to use optimal data compression methods to do optimal prefetching. Our motivation is recent work in computational learning theory [BEHa, BEHb, BoP], which has shown that prediction is synonymous with generalization and data compression. Our intuition in this paper is that in order to compress data well, one has to be able to predict future data well, and hence a good data compressor should also be a good predictor: If a data compressor ....

A. Blumer, A. Ehrenfeucht, D. Haussler & M. K. Warmuth, "Occam's Razor," Information Processing Letters 24 (1987).


Genetic Programming with Guaranteed Quality - Droste (1998)   (2 citations)  (Correct)

.... many experiments in GP show that trying to find small functions leads to better generalizing functions than ignoring function size (see e.g. Hooper and Flann (1996) Kinnear (1993) Rosca (1996) and Zhang and Muhlenbein (1995) In this paper the well known Occam s razor theorem (see Blumer et al. 1987) ) from the field of learning theory is used to give an explanation on the influence the size of a function can have on its generalization properties. Roughly speaking, it states, that one should restrict the search space as much as possible while trying to find a good approximation of the unknown ....

....on many inputs. Analogously, if the system would output hypotheses of a prespecified class H ae F for many training examples, one could assume that the average approximation quality of the found functions in H is relatively high, if H contains relatively few functions. Occam s razor theorem (see Blumer et al. 1987) ) contains an exact formulation of this common senseargumentation (for sake of completeness, the short proof of the theorem is added) Theorem 1 (Occam s razor theorem) Let H be a subset of F and f an arbitrary element of F . The probability of independently choosing m examples X 1 ; Xm ....

[Article contains additional citation context not shown here]

Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M.K. 1987. Occam's Razor. Information Processing Letters. Number 24. Pages 377-380.


A Monotonic Measure for Optimal Feature Selection - Liu, Motoda, Dash (1998)   (1 citation)  (Correct)

....the classes. When a classification problem is defined by features, the number of features (N ) can be quite large. A classifier may encounter problems to learn something meaningful because the required amounts of data (N , or the number of patterns) increase exponentially in proportion with N [4]. The task of feature selection is to determine which features to select in order to achieve maximum performance with the minimum measurement effort [3] Reducing the number of features directly alleviates the measurement effort. Performance for a classifier can be its predictive ....

A. Blumer, A. Ehrenfeucht, D. Haussler, and M.K. Warmuth. Occam's razor. In J.W. Shavlik and T.G. Dietterich, editors, Readings in Machine Learning, pages 201--204. Morgan Kaufmann, 1990.


Discovering Neural Nets With Low Kolmogorov Complexity And High .. - Schmidhuber (1997)   (10 citations)  (Correct)

....is the following law. The nth number is n 4 Gamma 10n 3 35n 2 Gamma 48n 24: But an IQ test requires you to answer 10 instead of 34. Why not 34 The reasons are: 1) Simple solutions are preferred over complex ones. This idea is often referred to as Occam s razor (e.g. Blumer et al. 1987). 2) It is assumed that the simpler the rules, the better the generalization on test data. 3) The makers of the IQ test assume that everybody agrees on what simple means. 1 Nets with low Kolmogorov complexity 2 Similarly, many neural net and machine learning researchers agree 1 that ....

Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. (1987). Occam's razor. Information Processing Letters, 24:377--380.


A Rigorous Investigation Of "Evidence" And "Occam Factors" In.. - Wolpert   (Correct)

....is incapable of deriving such results. A discussion of some of the other issues beyond the scope of conventional Bayesian analysis can be found in [Wolpert 1992a] There exist many different scenarios of interest in supervised machine learning. For example, both the PAC framework ( Valiant 1984, Blumer et al. 1987, Blumer et al. 1989] and the statistical mechanics framework concentrate on the analysis of the distribution P(E f, m) for some fixed but usually unknown target function f) when i) the sampling assumption is i.i.d. and ii) the error function is the i.i.d. error function. The difference ....

Blumer, A., et al. (1987). Occam's razor. Information Processing Letters, 24, 377-380.


Feature Selection in Unsupervised Learning via Evolutionary .. - Yongseog Kim Management (2000)   (11 citations)  (Correct)

....are therefore not directly applicable because they are biased by the dimensionalityof the space, which is variable in feature selection problems. In our study we use four heuristic tness criteria, described below. Two of the criteria are inspired by statistical metrics and twoby Occam s razor [2]. Each objective is normalized into the unit interval and maximized by the EA. ####### # This objectiveismeanttofavor dense clusters by measuring cluster cohesiveness. It is inspired by the total within cluster sum of squares (TWSS) measure. Formally, let # # ## =1# ### ##, be data points and # ....

A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Occam's razor. ########### ########## #######, 24:377-380, 1987.


Hybrid Search of Feature Subsets - Dash, Liu (1998)   (2 citations)  (Correct)

....the underlying structure. Because redundant and irrelevant information is cached inside the totality of the features, a classifier that uses all features will perform worse than a classifier that uses relevant features that maximize interclass differences and minimize intraclass differences [2]. Feature selection is a task of searching for optimal subset of features from all available features [4] Its motivation is three fold: Simplifying the classifier; Improving the accuracy of the classifier; and Reducing data dimensionality for the classifier. The last point is particularly ....

A.L. Blumer, A. Ehrenfeucht, D. Haussler, and M.K. Warmuth. Occam's razor. In J.W. Shavlik and T.G. Dietterich, editors, Readings in Machine Learning, pages 201--204. Morgan Kaufmann, 1990.


Geometric Patterns: Algorithms and Applications - Scott (2000)   (Correct)

....It outputs a hypothesis that, with probability at least 1 #, has error at most # on examples randomly drawn according to the same distribution D. We now review a PAC algorithm (Goldberg Goldman, 1994; Goldberg et al. 1996) for learning C k,n . This algorithm is called an Occam algorithm (Blumer et al. 1987; 1989) because, in the spirit of Occam s Razor, its hypothesis is a shorter representation of the training sample. It draws a sample of size m = O # 1 # log 1 # k 5 2 log(kn) # log 5 2 # k log(kn) # ## and finds a hypothesis consistent with all examples. It builds the ....

Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. K. (1987). Occam's razor. Inform. Proc. Lett., 24, 377--380.


Computational Learning Theory - Goldman   (Correct)

....of at most ln jSj 1 halfspaces. However, since the VC dimension of the hypothesis grows with the size of the sample, the basic technique described above cannot be applied. In general, when using a set covering approach, the size of the hypothesis often depends on the size of the sample. [Blumer, Ehrenfeucht, Haussler, and Warmuth, 1987 and 1989] extended this basic technique by showing that finding a hypothesis h consistent with a sample S for which the size of h is sub linear in jSj is sufficient to guarantee PAC learnability. In other words, by obtaining sufficient data compression one obtains good generalization. More ....

....of learning the intersection of halfspaces the greedy covering technique provided substantial data compression. Namely, the size of our hypothesis only had a logarithmic dependence on the size of the sample. In general, only a sub linear dependence is required as given by the following result of [Blumer, Ehrenfeucht, Haussler, and Warmuth, 1987]. Let A be an Occam algorithm for concept class C that has hypothesis space H A k;n;m . If vcd(Hyp A k;n;m ) p(k; n)m fi (so jhj p(k; n)m fi ) for some polynomial p(k; n) 2 and fi 1, then A is a PAC learning algorithm for C using sample size m = max 2 ffl ln 1 ffi ; 2 ln ....

Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. 1987. Occam's razor. Inform. Proc. Lett. 24, 377--380.


Efficient algorithms for the inference of minimum size DFAs - Oliveira, Silva   (Correct)

....utility as a specification of the desired controllers, because they will not be able to generate the control signals in situations that do not match exactly the waveforms present in the traces. Although we will not address in detail the arguments over the merit of the Occam s razor approach (Blumer et al. 1987), we argue that the selection of a minimum size DFA compatible with this specification is the method most likely to yield the desired result. In fact, a variety of results show that there is a strong correlation between the complexity of the generated hypothesis (Pearl, 1978) Blumer et al. ....

Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. (1987). Occam's razor. Inform. Proc. Lett., 24:377--380.


Learning to Take Actions - Khardon (1998)   (10 citations)  (Correct)

....shows that if a learning algorithm finds a strategy that can be described concisely, and such that it suggests the same actions that have been observed in the example traces, then it is guaranteed to be successful. Thus the well known convergence results for Occam algorithms in the PAC model (Blumer, Ehrenfeucht, Haussler, Warmuth, 1987), that are known to hold in deterministic worlds (Tadepalli Natarajan, 1996) hold also for stochastic partially observable worlds. A large part of the paper is devoted to the study of rule based action strategies and their learnability. The rules in the representation are of the form C A, ....

....1=ffi) where n is the number of predicates measured in each example, and with probability at least 1 Gamma ffi, A outputs a strategy s such that Q(t; D) Gamma Q(s; D) ffl. 3 Learning Action Strategies In this section we present a general learning result. Similar to results in the PAC model (Blumer et al. 1987) we show that an Occam algorithm that finds a concise action strategy which is consistent with all the examples seen, is a learning algorithm. This result is later used to prove the learnability of rule based systems. The main idea is that an action strategy that is very different from the ....

[Article contains additional citation context not shown here]

Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. K. (1987). Occam's razor. Information Processing Letters, 24, 377--380.


R-MINI: An Iterative Approach for Generating Minimal Rules from.. - Hong (1997)   (2 citations)  (Correct)

....enforced bias. While generating a true minimum rule set is well known to be NP hard, we show that the objective of near minimality can be achieved without a computational explosion. It is not the purpose of this paper to argue the merits of minimality in detail. The Occam s razor argument [7 9] strongly favors minimality of the representation for accuracy. Fayyad and Irani [10] provide an analysis for why the minimal number of leaves in a tree (equivalent to the number of rules) is perhaps the most important bias. Simpler rules are more understandable and more efficient to apply. The ....

A. Blumer, A. Ehrenfeucht, D. Haussler, and M. Warmuth, "Occam's Razor", Information Processing Letters, Vol. 24, North Holland, 1987, 377-380.


A Corpus-Based Approach to Language Learning - Brill (1993)   (86 citations)  (Correct)

....more on the issue of language and less on the issue of statistics. The only potential parameter is the threshold value above which a transformation must score for it to be learned. The performance of the system with respect to this one parameter 2 This is described a bit more formally in [10]. can easily be observed by learning a set of transformations with the threshold set to zero. Then text can be annotated using this transformation list. For any threshold value, the effect of setting that threshold can be easily observed by measuring performance up to the point where the first ....

A. Blumer, A. Ehrenfeucht, D. Haussler, and M. Warmuth. Occam's razor. In Information Processing Letters, volume 24, 1987.


The Role of Occam's Razor in Knowledge Discovery - Domingos (1999)   (14 citations)  (Correct)

....over trees with many. By this result, a decision tree with one million nodes extracted from a set of ten such trees is preferable to one with ten nodes extracted from a set of a million, given the same training set error. OCCAM S RAZOR IN KNOWLEDGE DISCOVERY 3 Put another way, the results in Blumer et al. 1987) only say that if we select a su#ciently small set of models prior to looking at the data, and by good fortune one of those models closely agrees with the data, we can be confident that it will also do well on future data. The theoretical results give no guidance as to how to select that set of ....

Blumer, A., Ehrenfeucht, A., Haussler, D., &Warmuth, M. K. (1987). Occam's razor. Information Processing Letters, 24, 377--380.


Expected Error Analysis for Model Selection - Scheffer, Joachims (1999)   (8 citations)  (Correct)

....the chance of some hypothesis in H i possessing a large difference between true and empirical error grows steeply. Therefore, given two hypotheses with equal empirical error which come from distinct models, PAC theory gives better guarantees for the one which comes from the smaller model (e.g. Blumer et al. 1987). But just because there is a hypothesis with a large difference between true and empirical error does not mean that the expected error of the returned hypothesis is high. In fact, from the expected error analysis we can derive that when the prior distribution of error values in the model remains ....

....in the hypothesis language grows necessarily when the hypothesis language grows. In order to account for the possibility of worst case choices of hL from the empirical error minimizing hypotheses, PAC theory gives weaker guarantees for hypotheses which have been learned from larger models (Blumer et al. 1987). These guarantees hold even if the learner is somehow able to always return the worst possible hypothesis from the set of empirical error minimizing hypotheses. Theorem 3, however, proves that an increase of the number of hypotheses in the model does not cause the expected learning curve to grow ....

Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. (1987). Occam's razor. Information processing Letters, 24, 377--380.


Optimal Asymptotic Identification Under Bounded Disturbances - Tse, Dahleh, Tsitsiklis   (16 citations)  (Correct)

....data seen so far. This avoids overfitting of data, a problem which crops up all the time in statistics and pattern recognition. It is interesting to note that this same principle of Occam s Razor has also been applied to guarantee convergence in distribution free probabilistic learning problems [1, 25]. In contrast to the oe compactness condition that guarantees convergence, a stronger compactness condition guarantees uniform convergence. Proposition 3.9 Suppose convergence in the ae topology on M implies component wise convergence of the impulse response. If the model set M is compact in the ....

A. Blumer, A. Ehrenfeucht. D. Haussler, M. Warmuth, "Occam's Razor", Information Processing Letters 24, pp.377-380, 1987.


PAC Learning with Simple Examples - Denis, D'Halluin, Gilleron (1996)   (9 citations)  (Correct)

....R if there is a PAC learning algorithm with simple examples A for F in R and A runs in time polynomial in 1=ffl, 1=ffi, and the length of r. Throughout the paper we will write PACS learning algorithm and PACS learnable. 3. 2 An Occam s Razor Theorem The Occam s Razor Theorem of Blumer et al. [5]) for PAC learning is one of the most important result in PAC learning theory. In this Section we give a version of this theorem for PACS learning. We prove that if each concept f in a class F has a set of simple and representative examples denoted by S r , where r is a name for f , then F is ....

A. Blumer, A. Ehrenfeucht, D. Haussler, and M.K. Warmuth, Occam's razor, Inform. Proc. Lett. 24 (1987) 377-380.


What do Constructive Learners Really Learn? - Thornton (1998)   (Correct)

....recent years, researchers have made rapid progress in the theoretical analysis of learning. Early work by Gold [13] and Valiant [14,15] established a tradition which grew to encompass theoretical constructs such as VC dimension [16] and led to the theoretical advances of Haussler and others, e.g. [17, 18, 19, 20, 21, 21, 22]. Much of this work is directed towards the goal of analyzing the complexity of learning but, at the time of writing, measuring the hardness of arbitrary learning problems (e.g. specific training sets) remains problematic [23] However, it turns out that a useful, qualitative measure of problem ....

Blumer, A., Ehrenfeucht, A., Haussler, D. and Warmuth, M. (1987). Occam 's razor. Information Processing Letters, 24 (pp. 377-380).


The Relationship between PAC, the Statistical Physics framework.. - Wolpert (1994)   (3 citations)  (Correct)

....i) Background. In the last decade several theoretical frameworks for addressing supervised learning have been discussed in the neural net community. Some of them are represented in the other papers in this proceedings. This paper is primarily concerned with four of those frameworks: PAC ([Blumer et al. 1987, Blumer et al. 1989, Haussler 1994ab, Valiant 1984, Dietterich 1990, COLT, Rivest 1989, Natarajan 1991, Anthony and Biggs 1992] the statistical physics of supervised learning (SP [Hertz et al. 1991, Opper and Haussler 1991ab, Schwartz et al. 1990, Seung et al. 1991, Tishby et al. 1989, ....

....tools that have been used are ill suited for investigating off training set behavior. In fact, often the four frameworks use language which implies that their goal is understanding off training set behavior, even when they use a test set that 5 can overlap with the training set. For example, in [Blumer 1987], in the context of noise free supervised learning, we read that the real value of a scientific explanation lies not in its ability to explain [what one has already seen] but in predicting events that have yet to occur , despite the fact that the subsequent analysis allows test sets to overlap ....

[Article contains additional citation context not shown here]

Blumer, A., et al. (1987). Occam's razor. Information Processing Letters, 24, 377-380.


Bayesian Integration of Rule Models - Domingos   (Correct)

....and these biases are known before the learner is applied to any given training set, these biases can be considered to imply a corresponding prior distribution. Specifically, most decision tree and rule learners (including C4.5RULES) incorporate a simplicity bias, also known as Occam s razor (Blumer, Ehrenfeucht, Haussler Warmuth, 1987) : they give preference to simpler models, on the assumption that these models will have lower error on a test set than more complex ones, even if they have higher error on the training set. The trade off between error and simplicity (or between the likelihood and the prior) can be made in ....

Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. K. (1987). Occam's razor.


Unsupervised Constructive Learning - Thornton   (Correct)

....recent years, researchers have made rapid progress in the theoretical analysis of learning. Early work by Gold [13] and Valiant [14,15] established a tradition which grew to encompass theoretical constructs such as VC dimension [16] and led to the theoretical advances of Haussler and others, e.g. [17, 18, 19, 20, 21, 21, 22]. Much of this work is directed towards the goal of analyzing the complexity of learning but, at the time of writing, measuring the hardness of arbitrary learning problems (e.g. specific training sets) remains problematic [23] However, it turns out that a useful, qualitative measure of problem ....

Blumer, A., Ehrenfeucht, A., Haussler, D. and Warmuth, M. (1987). Occam 's razor. Information Processing Letters, 24 (pp. 377-380).


Feedforward and Recurrent Neural Networks and Genetic Programs.. - McCluskey (1993)   (4 citations)  (Correct)

....rules which are consistent with the data, many of which work only by accident. Lemma: Given any function f in a hypothesis class of r hypotheses, the probability that any hypothesis with error larger than epsilon is consistent with a sample of f of size m is less than r (1 Gamma ffl) m . [1]. While the exponential m in this expression may superficially appear to insure that a sample size in the hundreds or thousands will eliminate most accidentally correct hypotheses, the hypothesis space is also exponential in the complexity of the hypothesis (number of weights) and when mapping ....

.... weekly d1[i] 7=4=365 annually to weekly d1[i] 7=365 fill empty values fills empty dates with the most recent non empty value normalize d1[i] maximum absolute value for d1 yearly average replaces all values for each calendar year with the average value for that year limit1 clips values to range [ 1,1] log10 log 10 (d1) is month if(month = d1) then 1 else 0 is year mod 4 if( year modulo 4) d1) then 1 else 0 make zeroes empty if(abs(d1) 10 Gamma6 ) then empty else d1 A.3 List operations: Add sum Average average of all non empty values Appendix B Raw Financial Data Available (all data ....

Blumer, Anselm, Andrzej Ehrenfeucht, David Haussler, and Manfred K. Warmuth, "Occam's Razor", in Information Processing Letters 24 (1987) pp 377-380, also reprinted in Readings in Machine Learning by Jude W. Shavlik and Thomas G. Dietterich.


Further Experimental Evidence against the Utility of Occam's Razor - Webb (1996)   (25 citations)  (Correct)

....advantages are to be expected from its application in classification learning. However, the literature does contain two statements that seem to capture at least one widely adopted approach to c fl1996 AI Access Foundation and Morgan Kaufmann Publishers. All rights reserved. Webb the principle. Blumer, Ehrenfeucht, Haussler, and Warmuth (1987) suggest that to wield Occam s razor is to adopt the goal of discovering the simplest hypothesis that is consistent with the sample data with the expectation that the simplest hypothesis will perform well on further observations taken from the same source . Quinlan (1986) states Given a choice ....

....To merely state that a less complex explanation is preferable does not specify by what criterion it is preferable. The implicit assumption underlying much machine learning research appears to be that, all other things being equal, less complex classifiers will be, in general, more accurate (Blumer et al. 1987; Quinlan, 1986) It is this Occam thesis that this paper seeks to discredit. On a straight forward interpretation, for a syntactic measure to be used to predict expected accuracy appears absurd. If two classifiers have identical meaning (such as IF 20AGE40 THEN POS and IF 20AGE30 OR 30AGE40 THEN ....

[Article contains additional citation context not shown here]

Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. K. (1987). Occam's Razor.


A Preliminary PAC Analysis of Theory Revision - Mooney (1993)   (8 citations)  (Correct)

....one way of measuring the distance between two theories is to determine the minimum number of primitive syntactic modifications needed to transform one theory into the other. The notion that the initial theory is close to the DNF, k DNF, etc. 2 This is a direct consequence of a result in Blumer et al. 1987). T 0 d T c H d H Figure 2: Restricted Hypothesis Space for Theory Revision correct one can then be captured by assuming that the syntactic distance between the two theories is less than some value, d. The hypothesis space can then be limited to all theories within a distance d of the initial ....

....with at most s c literals. An analysis of the VC dimension of this hypothesis space, would provide such a lower bound (Ehrenfeucht et al. 1989) The above results are closely related to previous results on the sample complexity of learning concepts representable with a limited number of bits (Blumer et al. 1987). With respect to pure induction, the term s c ln(s c n) in Equation 7 is proportional to the number of bits needed to represent a theory with s c literals. Since there are at most s c n possible literals to pick from (n observables plus at most s c non observables) O(ln(s c n) bits are ....

[Article contains additional citation context not shown here]

Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. (1987). Occam's razor. Information Processing Letters, 24:377--380.


Learning Hierarchical Rule Sets - Kivinen, Mannila, Ukkonen (1993)   (2 citations)  (Correct)

....of the smallest k level rule set that classifies the examples correctly. If the basic concept class satisfies some mild assumptions, the algorithm runs in time that is polynomial in the input size, and produces a rule set of size O( log m) k n k ) Hence, the algorithm is an Occam algorithm [3] and therefore also a learning algorithm in the PAC model. The algorithm is based on the greedy approximation technique for weighted set cover problems. We choose the set L k in such a way that every example not in the default class is in some c 2 L k . For each basic concept c there is a cost for ....

....As the examples in Section 2 show, in many interesting cases we can take fi = 0. We know that H(m) Theta(log m) O(m ff ) for all ff 0. Therefore, the size of the representation output by RS k is O(m ff l fi jL c j k ) and RS k is an Occam algorithm. The results of Blumer et al. [3] give the following corollary. Corollary 6 Let ff 0 be an arbitrary constant. Let 0 1, 0 ffi 1, n 2 IN and l 2 IN be arbitrary parameters. There is a bound m 0 = O 0 1 log 1 ffi l fi n k 1 ff 1 A such that the following holds for all m m 0 , all probability ....

A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth, Occam's razor, Inform. Process. Lett., 24 (1987), pp. 377--380.


PAC Learning of One-Dimensional Patterns - Goldberg, Goldman, Scott (1996)   (1 citation)  (Correct)

....problem of finding such a hypothesis from the class is NP complete. In fact, the size of the hypothesis output by our algorithm depends on the size of the sample. In particular the representation complexity of a hypothesis is sublinear in the sample size and polynomial in the parameters n and k. Blumer at al. 1987,89) show that this achievement of data compression is sufficient to guarantee polynomial learnability. Let H A s;m be the hypothesis space used by algorithm A for a target complexity of s and sample size m. More formally, we say that algorithm A is an Occam Algorithm for concept class C if ....

Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M.K. (1987). Occam's Razor. Information Processing Letters 24 pp. 377-380.


A General Lower Bound on the Number - Of Examples Needed   Self-citation (Ehrenfeucht Haussler)   (Correct)

No context found.

Blumer, A., A. Ehrenfeucht, D. Haussler, M. Warmuth, "Occam's Razor", Information Processing Letters, 24, 1987, pp. 377-380.


Relating Data Compression and Learnability - Littlestone, Warmuth (1986)   (20 citations)  Self-citation (Warmuth)   (Correct)

....compressed to the kernel but must be able to reconstruct the values of the sample. Note that we don t require any bounds on the length of the encoding of the kernel. The points of X might for instance be reals of arbitrary high precision. Compression to a bounded number of bits is discussed in [BEHW87] and is much simpler. Theorem 2.1 For any compression scheme with kernel size k the error is larger than with probability (w.r.t. P m ) less than m k (1 ) m k when given a sample of size m k. 4 Proof: Suppose we are learning some concept . Given an and an m, we want to nd a ....

....the next four points the second rectangle and so forth. Given the additional information, knows the locations of the rectangles and can predict accordingly. 7 We now generalize the theorems of the previous section. The case k = 0 in which the sample is compressed to ln(jQj) bits was studied in [BEHW87] Our bounds always contain the bounds of [BEHW87] as a subcase. Theorem 3.1: For a compression scheme with kernel size k and additional information Q the error is larger than with probability less than jQj m k (1 ) m k when given a sample of size m k. Proof: This proof is an ....

[Article contains additional citation context not shown here]

Blumer, A., A. Ehrenfeucht, D. Haussler and M. Warmuth, "Occam's Razor," Information Processing Letters 24, 1987, pp. 377-380.


Sample compression, learnability, and the Vapnik-Chervonenkis .. - Floyd, Warmuth (1993)   (6 citations)  Self-citation (Warmuth)   (Correct)

....of the learning algorithm for C is the smallest required sample size, as a function of ffl and ffi. 2 For a finite concept class C 2 X , Theorem 2.1 gives an upper bound on the sample complexity required for learning the class C. This upper bound is linear in lnjCj. Theorem 2. 1: V82] [BEHW87], BEHW89] Let C 2 X be any finite concept class. Then for sample size greater than 1 ffl ln jCj ffi , any algorithm that chooses a hypothesis from C consistent with the examples is a learning algorithm for C. Definitions (the Vapnik Chervonenkis dimension) For infinite classes such ....

Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M., "Occam's Razor", Inf. Proc. Let., 24, 1987, pp. 377-380.


Dominance Detection in Meetings Using Easily Obtainable Features - Rienks, Heylen (2005)   (Correct)

No context found.

A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Marmuthh. Occam's razor. In Information Processing Letters, pages 377--380. 24 edition, 1987.


Exact Minimization of Binary Decision Diagrams.. - Oliveira.. (1998)   (3 citations)  (Correct)

No context found.

# A. Blumer, A. Ehrenfeucht, D. Haussler, and M.K. Warmuth, "Occam 's Razor," Information Processing Letters, vol. 24, pp. 377-380, Apr. 1987.


A Formal Definition of Intelligence Based on an.. - Hernandez-Orallo, al. (1998)   (Correct)

No context found.

Blumer, A.; Ehrenfeucht, A.; Haussler, D.; Warmuth, M. K. "Occam's razor" Inf.Proc.Lett. 24, 377-380, 1987.


Learning DNF Formulas: A Survey - Hellerstein   (Correct)

No context found.

A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Occam's razor. Inform. Proc. Lett., 24:377-380, April 1987.


A study of two probabilistic methods for searching large spaces .. - Srinivasan (1999)   (3 citations)  (Correct)

No context found.

Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. (1987). Occam's Razor. Information Processing Letters, 24:377{ 380.


Faster Algorithms for Finding Minimal Consistent DFAs - Lang (1999)   (Correct)

No context found.

A. Blumer, A. Ehrenfeucht, D. Haussler, M. Warmuth (1987) Occam's razor. Information Processing Letters, 24(6):377-380.


Constructive Reinforcement Learning - Hernandez-Orallo (1999)   (Correct)

No context found.

A. Blumer, A. Ehrenfeucht, D. Haussler and M.K. Warmuth "Occam's razor" Inf. Proc. Letters, 24, 377-380 (1987).

First 50 documents  Next 50

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC