| G. E. Hinton. Learning distributed representations of concepts. In Proc. Ann. Conf. of the Cognitive Science Society, volume 1, 1986. |
....are represented by the hidden units in a network which combine the inputs of multiple features thus allowing the model to take advantage of dependencies among the features. Understanding the hidden units themselves is often dicult because these units often learn distributed representations [11]. Hidden units can be thought of as representing higher level, derived features . In a distributed representation, however these derived features may not correspond to well understood features in the problem domain. Instead features which are meaningful in the 13 context of the problem domain ....
Hinton G. Learning distributed representations of concepts. In Proceedings of the Eighth Annual Conference of the Cognitive Science Society, pages 1-12, Amherst, MA, 1986. Erlbaum.
.... have been made to interpret connectionist networks, focusing on feedforward networks in particular [Andrews and Diederich, 1996, Abe et al. 1993, Shavlik, 1994] For in stance, visuatizations of internal activations or weight strengths can be used to get an impres sion of the internal knowledge [Hinton, 1986, Gorman and Sejnowski, 1988] Some effort has also been made to reduce the network size in order to simplify the knowledge expressed therein by elimi nating very small weights. Furthermore, groups of similar weights can be replaced with their average strength [Shavlik, 1994] In addition, ....
Hinton, G. E. (1986). Learning distributed representations of concepts. In Proceedings of the 8 th Meeting of the Cognitive Science Society.
....represented by part of a neural network, and it yields parameters for expressing the distribution of Z i . Experiments on four UCI data sets show this approach to work comparatively very well [3, 2] The idea of a distributed representation for symbols dates from the early days of connectionism [5]. More recently, Hinton s approach was improved and successfully demonstrated on learning several symbolic relations [9] The idea of using neural networks for language modeling is not new either, e.g. 8] In contrast, here we push this idea to a large scale, and concentrate on learning a ....
G.E. Hinton. Learning distributed representations of concepts. In Proceedings of the Eighth Annual Conference of the Cognitive Science Society, pages 1--12, Amherst
....possible value of each feature represents a concept. For instance, the feature COLOR generates the concepts Red, Blue, etc. Hopefully such an algorithm can be used directly to discover useful new concepts in the Cyc KB, and can be used indirectly in analogical reasoning in Cyc. Solving Hinton s [Hin86] family relations problem in a more pleasing manner has been the first important milestone. The Family Relations problem is described in the next section. Then the Minimum Description Length (MDL) principle [Ris89] in whichtheoriesare judged to be good in direct proportion to how small they ....
....after all is just extending regularities beyond their customary context. It is appealing that in this sense, doing completion with MDL OC is doing analogical reasoning without explicitly going through the several stages customary in the literature. Back propagation approaches to feature discovery [Hin86, MD87] havesuffered from the asymmetric treatment of input and output units, and the use of indirect methods like bottlenecks for encouraging good representations, rather than incorporating an explicit declarativecharacterization of conciseness. This is not an inherent limitation# a back propagation ....
Geoffrey E. Hinton. Learning distributed representations of concepts. In Proceedings of the Eighth Annual Cognitive Science Conference, pages 1--12, Amherst, Massachusetts, 1986. Cognitive Science Society.
....data is introduced, the network grows in accordance with the functionality requirement [16] Another method for finding an appropriate network size is through the use of network reduction algorithms that typically start with a large network and reduce its size by a variety of means. Pruning [25] is one method whereby unimportant connections are severed or, more drastically, entire nodes can be removed if they are found not to be useful. CHAPTER 2. LITERATURE SURVEY: A MOTIVATION 9 2.6 Controlling a Network s Degrees of Freedom Directly It is possible, however, to obtain the effect of ....
....learn noise, a learning algorithm could be enhanced by adding a new term to the original cost function. The new term could be designed such that its reduction would lead to a network with limited degrees of freedom. Examples of this type of work are ridge regression [21] and related weight decay [25], which control the degrees of freedom of function approximators. The idea of a multi objective cost function is quite useful since it might be easier to specify an additional objective for the cost function rather than a new algorithm that would perform the equivalent function. Some additional ....
G. E. Hinton. Learning distributed representations of concepts. In Proceedings of the Eighth Annual Conference of the Cognitive Science Society, pages 1--12, Erlbaum, Hillsdale, NJ, 1989.
....data is introduced, the network grows in accordance with the functionality requirement [14] Another method for finding an appropriate network size is through the use of network reduction algorithms that typically start with a large network and reduce its size by a variety of means. Pruning [24] is one method whereby unimportant connections are severed or, more drastically, entire nodes can be removed if they are found not to be useful. 2.6 Controlling a Network s Degrees of Freedom Directly It is possible, however, to obtain the effect of pruning by encouraging weights to approach ....
....reduce its ability to learn noise, a learning algorithm could be enhanced by adding a new term to the original cost function. The new term could be designed such that its reduction would lead to a network with limited degrees of freedom. Examples are ridge regression [20] and related weight decay [24], which control the degreesof freedom of function approximators. The idea of a multi objective cost function is quite useful since it might be easier to specify an additional objective for a cost function rather than a new algorithm that would perform the equivalent function. One possible ....
G. E. Hinton. Learning distributed representations of concepts. In Proceedings of the Eighth Annual Conference of the Cognitive Science Society, pages 1--12, Erlbaum, Hillsdale, NJ, 1989.
....elegant and automatic solution, which incorporates feature selection into the learning algorithm as a means of parameter elimination. Thus we view feature selection as a means of finding the optimal number of parameters, in the spirit of neural network algorithms such as pruning [82] weight decay [38] and optimal brain damage [48] Essential to the feature selection algorithm is the fact that the features have all been normalized to lie in approximately the same range. All ordered features whether real valued (as with the breast cancer data) or integer valued (as the SEER data) are ....
G. E. Hinton. Learning distributed representations of concepts. In Proceedings of the Eighth Annual Conference of the Cognitive Science Society, pages 1--12, Hillsdale, 1986. Erlbaum.
....and in connectionist cognitive modeling have proposed approaches which learn representations with useful similarity structure. Usually, this structure is an emergent property, a side effect of performing some other task. Examples include Miikulainen s FGREP model [90] Hinton s family trees model [67], and Elman s recurrent network model for sentence processing [38] Each has empirically demonstrated that neural networks can learn representations with a task dependent similarity structure. However, the similarity structure is not a target in the training procedure. Schutze [130] 131] has ....
....one that has the best trade off between ranking performance and cost. Adding cost terms is a common approach in the neural network literature. For example, cost terms have been added to the optimization to favor neural networks which have the lowest complexity, in order to promote generalization [67]. This is an interesting direction for future work, though many difficult issues remain to be addressed. A possible limitation of the method is its reliance on sample queries and identified relevant documents for training. In some environments it may be quite difficult to acquire these samples. ....
Geoffrey E. Hinton. Learning distributed representations of concepts. In Proceedings of the Eighth Annual Cognitive Science Society Conference. Lawrence Erlbaum Associates, Hillsdale, NJ, 1986.
....and statistical independence achieved in this toy example are minimal at best. 2.3. Nonlinear Principal Manifolds One of the simplest methods for computing nonlinear principal manifolds is the nonlinear PCA (NLPCA) autoassociative multi layer neural network [16, 8] shown in Figure 2. Hinton [11] was first to point out that nonlinear networks form useful representations in their hidden layers and Ackley et al. 1] were the first to implement an autoencoder trained to reproduce its inputs. The so called bottleneck layer forms a lower dimensional manifold representation by means of a ....
G. E. Hinton. Learning distributed representations of concepts. In Proc. Ann. Conf. of the Cognitive Science Society, volume 1, 1986.
....represented by part of a neural network, and it yields parameters for expressing the distribution of Z i . Experiments on four UCI data sets show this approach to work comparatively very well [3, 2] The idea of a distributed representation for symbols dates from the early days of connectionism [5]. More recently, Hinton s approach was improved and successfully demonstrated on learning several symbolic relations [9] The idea of using neural networks for language modeling is not new either, e.g. 8] In contrast, here we push this idea to a large scale, and concentrate on learning a ....
G.E. Hinton. Learning distributed representations of concepts. In Proceedings of the Eighth Annual Conference of the Cognitive Science Society, pages 1--12, Amherst 1986, 1986. Lawrence Erlbaum, Hillsdale.
.... be shown that no algorithm that uses weight vectors of the form w t 1 = P t i=1 a i x i can have smaller loss in this situation [LLW95] This class of algorithms also includes a basic variant of weight decay, where an additional jjw t jj 2 2 error term is used as a penalty for large weights [Hin86] According to a commonly accepted heuristic, the number of examples needed to learn linear functions is roughly proportional to the number of dimensions in the instances. The results presented here do not contradict this in any way. The number of examples required for the EG Sigma algorithm ....
G. E. Hinton. Learning distributed representations of concepts. In Proc. 8th Annual Conf. of the Cognitive Science Society, pages 1--12, Hillsdale, 1986. Erlbaum.
....computable by, any individual unit; it is only computable at the network level, not the processor level. For example, several imprecise units broadly tuned to respond to overlapping ranges of a stimulus can together pinpoint the value of the stimulus more precisely than any individual unit can [Hin86] This method of coarse coding to achieve high resolution with sloppy hardware (or the wetware of biology such as real neurons) is employed, for example, by the retina to recognize color [Lan77] by bats to detect targets [SH86] and by the electric fish to sense electric field changes [BH88] ....
G.E. Hinton. Learning distributed representations of concepts. In Proceedings of the Eighth Annual Conference of the Cognitive Science Society, pages 1--12, Amherst 1986, 1986. Lawrence Erlbaum, Hillsdale.
....term consisting of the squared 2 norm of x is added to the error function so that the modified objective function has bounded level sets : min x2 n f(x) K X j=1 f j (x) ckxk 2 ; where c 0 is a (small) penalty parameter. This, in fact, corresponds to the weight decay training [21, 69]. Weight decay is a useful approach since it tends to generate simpler networks by minimizing nonzero arc connections. Simpler networks often possess better generalization properties. All of our results apply with merely redefining the objective function of the problem. 29 We could also consider ....
G. E. Hinton. Learning distributed representations of concepts. In Proceedings of the Eighth Annual Conference of the Cognitive Science Society, pages 1--12, Hillsdale, 1986. Erlbaum.
.... can be shown that no algorithm that uses weight vectors of the form w t 1 = P t a t x t can have smaller loss in this situation [LLW91] This class of algorithms also includes a basic variant of weight decay, where an additional jjw t jj 2 2 error term is used as a penalty for large weights [Hin86] 45 0 50 100 150 200 250 300 0 50 100 150 200 250 300 350 400 450 trials EG GD Figure 9.3: Cumulative losses of GD (solid line) and EG Sigma (dotted line) with their upper bounds, for instances x t 2 f Gamma1; 1 g 100 and target u = Gamma1; 1; Gamma1; 0; 0) ....
G. E. Hinton. Learning distributed representations of concepts. In Proc. 8th Annual Conference of the Cognitive Science Society, Amherst, MA, August 1986.
.... be shown that no algorithm that uses weight vectors of the form w t 1 = P t i=1 a i x i can have smaller loss in this situation [LLW95] This class of algorithms also includes a basic variant of weight decay, where an additional jjw t jj 2 2 error term is used as a penalty for large weights [Hin86] According to a commonly accepted heuristic, the number of examples needed to learn linear functions is roughly proportional to the number of dimensions in the instances. The results presented here do not contradict this in any way. The number of examples required for the EG Sigma algorithm ....
G. E. Hinton. Learning distributed representations of concepts. In Proc. 8th Annual Conf. of the Cognitive Science Society, pages 1--12, Hillsdale, 1986. Erlbaum.
.... algorithm that uses weight vectors of the form w t 1 = P t i=1 a i x i can have smaller loss in this situation (Littlestone et al. 1995) This class of algorithms also includes a basic variant of weight decay, where an additional jjw t jj 2 2 error term is used as a penalty for large weights (Hinton, 1986). According to a commonly accepted heuristic, the number of examples needed to learn linear functions is roughly proportional to the number of dimensions in the instances. The results presented here do not contradict this in any way. The number of examples required for the EG Sigma algorithm ....
Hinton, G. E. (1986), Learning distributed representations of concepts, in "Proceedings, 8th Annual Conference of the Cognitive Science Society," pp. 1--12, Erlbaum, Hillsdale.
....different initial configurations and optimization algorithms caused the system to arrive at different solutions, but these solutions were almost always very similar in terms of generalization performance. 2 LRE results Here we present the results obtained applying LRE to the Family Tree Problem [1]. In this problem, the data consists of people and relations among people belonging to two families, one Italian and one English, shown in fig.1 (left) All the information in these trees can be represented in simple propositions of the form (person 1 ; relation; person 2 ) Using the ....
Geoffrey E. Hinton. Learning distributed representations of concepts. In Proceedings of the Eighth Annual Conference of the Cognitive Science Society, pages 1--12. Erlbaum, NJ, 1986.
....models can discriminate the ungrammatical sentences where short range structure is corrupted, but the single HMM cannot discriminate the cases where the longer range structure is corrupted. 3. 3 Family Trees The nal example application of PoHMM s is one of symbolic inference in two family trees [3]. In the family trees problem we consider two families one English and the other Italian. There are twelve people in each family. In addition there are twelve familial relationships such as father, daughter, uncle etc. The data set is composed of a set of triplets of the form person relation ....
G. E. Hinton. Learning distributed representations of concepts. In Proceedings of the Eight Annual Conference of the Cognitive Science Society, pages 1-12, Hillsdale, NJ, August 1986. Lawrence Erlbaum Associates.
....Victoria = James Margaret = Arthur Jennifer = Charles Colin Christopher = Penelope Andrew = Christine Charlotte Bortolo = Emma Giannina = Pietro Aurelio = Maria Grazia = Pierino Doralice = Marcello Alberto Mariemma Figure 2. Two isomorphic family trees. The symbol = means married to . Hinton (1986) showed that a multilayer neural network trained using backpropagation (Rumelhart et al. 1986) could make explicit the semantic features of concepts and relations present in the data. Unfortunately, the system had problems in generalizing when many triplets were missing from the training set. ....
....was obtained using gradient ascent to optimize the modi ed goodness function while the temperature was annealed. of Colin, Margaret and Jennifer in the tree, and then use this information to make the correct inference. The generalization achieved by LRE is much better than the neural networks of Hinton (1986) and O Reilly (1996) which typically made one or two errors even when only 4 cases were held out during training. 3.3 Results on the Family Tree Problem with Real Data We have used LRE to solve a much bigger family tree task. The tree is a branch of the real family tree of one of the authors ....
Hinton, G. E. (1986). Learning distributed representations of concepts. In Proceedings of the eighth annual conference of the cognitive science society, 1-12. NJ: Erlbaum.
....the task of generalizing to unobserved triplets is non trivial. In the next section we brie y review related work on learning distributed representations. LRE is then presented in detail in section 3. Section 4 presents the results obtained using LRE on the number problem and the family tree task (Hinton, 1986), as well as the results obtained on a much larger version of the family tree task that uses data from a real family tree. We also compare these results to the results obtained using Principal Components Analysis. In section 5 we examine how a solution obtained from an impoverished data set can be ....
....or a suitably transformed representation of this count. Each word can then be represented by its projection onto each of the learned features and words with similar meanings will have similar projections. Again, LSA is unable to make use of the speci c relational information in a triplet. Hinton (1986) showed that a multilayer neural network trained using backpropagation (Rumelhart et al. 1986) could make explicit the semantic features of concepts and relations present in the data. Unfortunately, the system had problems in generalizing when many triplets were missing from the training set. ....
[Article contains additional citation context not shown here]
Hinton, G. E. (1986). Learning distributed representations of concepts. In Proceedings of the Eighth Annual Conference of the Cognitive Science Society, pages 1-12. Erlbaum, NJ.
....models can discriminate the ungrammatical sentences where short range structure is corrupted, but the single HMM cannot discriminate the cases where the longer range structure is corrupted. 3. 3 Family Trees The final example application of PoHMM s is one of symbolic inference in two family trees [Hinton, 1986]. In the family trees problem we consider two families one English and the 6 220 215 210 205 200 220 215 210 205 200 PoHMM Subject Verb Agreement Discrimination 220 215 210 205 200 220 215 210 205 200 PoHMM No Not Agreement Discrimination 30 25 20 15 10 5 30 25 ....
Hinton, G. E. (1986). Learning distributed representations of concepts. In Proceedings of the Eight Annual Conference of the Cognitive Science Society, pages 1--12, Hillsdale, NJ. Lawrence Erlbaum Associates.
....the task of generalizing to unobserved triplets is non trivial. In the next section we brie y review related work on learning distributed representations. LRE is then presented in detail in section 3. Section 4 presents the results obtained using LRE on the number problem and the family tree task (Hinton, 1986), as well as the results obtained on a much larger version of the family tree problem that uses data from a real family tree. We also compare these results to the results obtained using Principal Components Analysis. In section 5 we examine how a solution obtained from an impoverished data set can ....
....or a suitably transformed representation of this count. Each word can then be represented by its projection onto each of the learned features and words with similar meanings will have similar projections. Again, LSA is unable to make use of the speci c relational information in a triplet. Hinton (1986) showed that a multilayer neural network trained using backpropagation (Rumelhart et al. 1986) could make explicit the semantic features of concepts and relations present in the data. Unfortunately, the system had problems in generalizing when many triplets were missing from the training set. ....
[Article contains additional citation context not shown here]
Hinton, G. E. (1986). Learning distributed representations of concepts. In Proceedings of the Eighth Annual Conference of the Cognitive Science Society, pages 1-12. Erlbaum, NJ.
No context found.
G. E. Hinton. Learning distributed representations of concepts. In Proc. Ann. Conf. of the Cognitive Science Society, volume 1, 1986.
No context found.
G. E. Hinton. Learning distributed representations of concepts. In Proceedings of the Eighth Annual Conference of the Cognitive Science Society, pages 1--12, Amherst, Mass, August 1986.
No context found.
Hinton, G.E. (1986). Learning distributed representations of concepts. Proceedings of the Eighth Annual Conference of the Cognitive Science Society, pages 1-12. Lawrence Earlbaum, Hillsdale, NJ.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC