Results 1  10
of
311
Efficient estimation of word representations in vector space
, 2013
"... We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types ..."
Abstract

Cited by 371 (6 self)
 Add to MetaCart
We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide stateoftheart performance on our test set for measuring syntactic and semantic word similarities.
Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
"... Semantic word spaces have been very useful but cannot express the meaning of longer phrases in a principled way. Further progress towards understanding compositionality in tasks such as sentiment detection requires richer supervised training and evaluation resources and more powerful models of compo ..."
Abstract

Cited by 191 (7 self)
 Add to MetaCart
(Show Context)
Semantic word spaces have been very useful but cannot express the meaning of longer phrases in a principled way. Further progress towards understanding compositionality in tasks such as sentiment detection requires richer supervised training and evaluation resources and more powerful models of composition. To remedy this, we introduce a Sentiment Treebank. It includes fine grained sentiment labels for 215,154 phrases in the parse trees of 11,855 sentences and presents new challenges for sentiment compositionality. To address them, we introduce the Recursive Neural Tensor Network. When trained on the new treebank, this model outperforms all previous methods on several metrics. It pushes the state of the art in single sentence positive/negative classification from 80 % up to 85.4%. The accuracy of predicting finegrained sentiment labels for all phrases reaches 80.7%, an improvement of 9.7 % over bag of features baselines. Lastly, it is the only model that can accurately capture the effects of negation and its scope at various tree levels for both positive and negative phrases. 1
Dual averaging methods for regularized stochastic learning and online optimization
 In Advances in Neural Information Processing Systems 23
, 2009
"... We consider regularized stochastic learning and online optimization problems, where the objective function is the sum of two convex terms: one is the loss function of the learning task, and the other is a simple regularization term such as ℓ1norm for promoting sparsity. We develop extensions of Nes ..."
Abstract

Cited by 133 (7 self)
 Add to MetaCart
We consider regularized stochastic learning and online optimization problems, where the objective function is the sum of two convex terms: one is the loss function of the learning task, and the other is a simple regularization term such as ℓ1norm for promoting sparsity. We develop extensions of Nesterov’s dual averaging method, that can exploit the regularization structure in an online setting. At each iteration of these methods, the learning variables are adjusted by solving a simple minimization problem that involves the running average of all past subgradients of the loss function and the whole regularization term, not just its subgradient. In the case of ℓ1regularization, our method is particularly effective in obtaining sparse solutions. We show that these methods achieve the optimal convergence rates or regret bounds that are standard in the literature on stochastic and online convex optimization. For stochastic learning problems in which the loss functions have Lipschitz continuous gradients, we also present an accelerated version of the dual averaging method.
GloVe: Global Vectors for Word Representation
"... Recent methods for learning vector space representations of words have succeeded in capturing finegrained semantic and syntactic regularities using vector arithmetic, but the origin of these regularities has remained opaque. We analyze and make explicit the model properties needed for such regular ..."
Abstract

Cited by 123 (9 self)
 Add to MetaCart
(Show Context)
Recent methods for learning vector space representations of words have succeeded in capturing finegrained semantic and syntactic regularities using vector arithmetic, but the origin of these regularities has remained opaque. We analyze and make explicit the model properties needed for such regularities to emerge in word vectors. The result is a new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods. Our model efficiently leverages statistical information by training only on the nonzero elements in a wordword cooccurrence matrix, rather than on the entire sparse matrix or on individual context windows in a large corpus. The model produces a vector space with meaningful substructure, as evidenced by its performance of 75 % on a recent word analogy task. It also outperforms related models on similarity tasks and named entity recognition. 1
Large scale distributed deep networks,
 Proceedings of NIPS,
, 2012
"... Abstract Recent work in unsupervised feature learning and deep learning has shown that being able to train large models can dramatically improve performance. In this paper, we consider the problem of training a deep network with billions of parameters using tens of thousands of CPU cores. We have d ..."
Abstract

Cited by 107 (12 self)
 Add to MetaCart
(Show Context)
Abstract Recent work in unsupervised feature learning and deep learning has shown that being able to train large models can dramatically improve performance. In this paper, we consider the problem of training a deep network with billions of parameters using tens of thousands of CPU cores. We have developed a software framework called DistBelief that can utilize computing clusters with thousands of machines to train large models. Within this framework, we have developed two algorithms for largescale distributed training: (i) Downpour SGD, an asynchronous stochastic gradient descent procedure supporting a large number of model replicas, and (ii) Sandblaster, a framework that supports a variety of distributed batch optimization procedures, including a distributed implementation of LBFGS. Downpour SGD and Sandblaster LBFGS both increase the scale and speed of deep network training. We have successfully used our system to train a deep network 30x larger than previously reported in the literature, and achieves stateoftheart performance on ImageNet, a visual object recognition task with 16 million images and 21k categories. We show that these same techniques dramatically accelerate the training of a more modestlysized deep network for a commercial speech recognition service. Although we focus on and report performance of these methods as applied to training large neural networks, the underlying algorithms are applicable to any gradientbased machine learning algorithm.
Parsing with Compositional Vector Grammars
"... Natural language parsing has typically been done with small sets of discrete categories such as NP and VP, but this representation does not capture the full syntactic nor semantic richness of linguistic phrases, and attempts to improve on this by lexicalizing phrases or splitting categories only par ..."
Abstract

Cited by 107 (5 self)
 Add to MetaCart
(Show Context)
Natural language parsing has typically been done with small sets of discrete categories such as NP and VP, but this representation does not capture the full syntactic nor semantic richness of linguistic phrases, and attempts to improve on this by lexicalizing phrases or splitting categories only partly address the problem at the cost of huge feature spaces and sparseness. Instead, we introduce a Compositional Vector Grammar (CVG), which combines PCFGs with a syntactically untied recursive neural network that learns syntacticosemantic, compositional vector representations. The CVG improves the PCFG of the Stanford Parser by 3.8 % to obtain an F1 score of 90.4%. It is fast to train and implemented approximately as an efficient reranker it is about 20 % faster than the current Stanford factored parser. The CVG learns a soft notion of head words and improves performance on the types of ambiguities that require semantic information such as PP attachments. 1
Parallelized stochastic gradient descent
 Advances in Neural Information Processing Systems 23
, 2010
"... Abstract With the increase in available data parallel machine learning has become an increasingly pressing problem. In this paper we present the first parallel stochastic gradient descent algorithm including a detailed analysis and experimental evidence. Unlike prior work on parallel optimization a ..."
Abstract

Cited by 97 (4 self)
 Add to MetaCart
Abstract With the increase in available data parallel machine learning has become an increasingly pressing problem. In this paper we present the first parallel stochastic gradient descent algorithm including a detailed analysis and experimental evidence. Unlike prior work on parallel optimization algorithms
A reliable effective terascale linear learning system
, 2011
"... We present a system and a set of techniques for learning linear predictors with convex losses on terascale data sets, with trillions of features,1 billions of training examples and millions of parameters in an hour using a cluster of 1000 machines. Individually none of the component techniques are n ..."
Abstract

Cited by 72 (6 self)
 Add to MetaCart
We present a system and a set of techniques for learning linear predictors with convex losses on terascale data sets, with trillions of features,1 billions of training examples and millions of parameters in an hour using a cluster of 1000 machines. Individually none of the component techniques are new, but the careful synthesis required to obtain an efficient implementation is. The result is, up to our knowledge, the most scalable and efficient linear learning system reported in the literature.2 We describe and thoroughly evaluate the components of the system, showing the importance of the various design choices.