Results 1 
8 of
8
On the computational efficiency of training neural networks
"... It is wellknown that neural networks are computationally hard to train. On the other hand, in practice, modern day neural networks are trained efficiently using SGD and a variety of tricks that include different activation functions (e.g. ReLU), overspecification (i.e., train networks which are l ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
(Show Context)
It is wellknown that neural networks are computationally hard to train. On the other hand, in practice, modern day neural networks are trained efficiently using SGD and a variety of tricks that include different activation functions (e.g. ReLU), overspecification (i.e., train networks which are larger than needed), and regularization. In this paper we revisit the computational complexity of training neural networks from a modern perspective. We provide both positive and negative results, some of them yield new provably efficient and practical algorithms for training certain types of neural networks. 1
Beating the perils of nonconvexity: Guaranteed training of neural networks using tensor methods. CoRR abs/1506.08473,
, 2015
"... Abstract Training neural networks is a challenging nonconvex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for training a twolayer neural network. We provide risk bounds for our prop ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Abstract Training neural networks is a challenging nonconvex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for training a twolayer neural network. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NPhard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild nondegeneracy conditions. It consists of simple embarrassingly parallel linear and multilinear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with general nonlinear activations.
regularized Neural Networks are Improperly Learnable in Polynomial Time
"... Abstract We study the improper learning of multilayer neural networks. Suppose that the neural network to be learned has k hidden layers and that the 1 norm of the incoming weights of any neuron is bounded by L. We present a kernelbased method, such that with probability at least 1 − δ, it learn ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract We study the improper learning of multilayer neural networks. Suppose that the neural network to be learned has k hidden layers and that the 1 norm of the incoming weights of any neuron is bounded by L. We present a kernelbased method, such that with probability at least 1 − δ, it learns a predictor whose generalization error is at most worse than that of the neural network. The sample complexity and the time complexity of the presented method are polynomial in the input dimension and in and on the activation function, independent of the number of neurons. The algorithm applies to both sigmoidlike activation functions and ReLUlike activation functions. It implies that any sufficiently sparse neural network is learnable in polynomial time.
CONVERGENT LEARNING: DO DIFFERENT NEURAL NETWORKS LEARN THE SAME REPRESENTATIONS?
, 2016
"... Recent successes in training large, deep neural networks have prompted active investigation into the representations learned on their intermediate layers. Such research is difficult because it requires making sense of nonlinear computations performed by millions of learned parameters, but valuable ..."
Abstract
 Add to MetaCart
Recent successes in training large, deep neural networks have prompted active investigation into the representations learned on their intermediate layers. Such research is difficult because it requires making sense of nonlinear computations performed by millions of learned parameters, but valuable because it increases our ability to understand current models and training algorithms and thus create improved versions of them. In this paper we investigate the extent to which neural networks exhibit what we call convergent learning, which is when the representations learned by multiple nets converge to a set of features which are either individually similar between networks or where subsets of features span similar lowdimensional spaces. We propose a specific method of probing representations: training multiple networks and then comparing and contrasting their individual, learned representations at the level of neurons or groups of neurons. We begin research into this question by introducing three techniques to approximately align different neural networks on a feature or subspace level: a bipartite matching ap
Model Selection in Compositional Spaces
, 2014
"... We often build complex probabilistic models by composing simpler models—using one model to generate parameters or latent variables for another model. This allows us to express complex distributions over the observed data and to share statistical structure between different parts of a model. In this ..."
Abstract
 Add to MetaCart
(Show Context)
We often build complex probabilistic models by composing simpler models—using one model to generate parameters or latent variables for another model. This allows us to express complex distributions over the observed data and to share statistical structure between different parts of a model. In this thesis, we present a space of matrix decomposition models defined by the composition of a small number of motifs of probabilistic modeling, including clustering, low rank factorizations, and binary latent factor models. This compositional structure can be represented by a contextfree grammar whose production rules correspond to these motifs. By exploiting the structure of this grammar, we can generically and efficiently infer latent components and estimate predictive likelihood for nearly 2500 model structures using a small toolbox of reusable algorithms. Using a greedy search over this grammar, we automatically choose the decomposition structure from raw data by evaluating only a small fraction of all models. The proposed method typically finds the correct structure for synthetic data and backs off gracefully to simpler models under heavy noise.
On the Quality of the Initial Basin in Overspecified Neural Networks Ohad Shamir
"... Abstract Deep learning, in the form of artificial neural networks, has achieved remarkable practical success in recent years, for a variety of difficult machine learning applications. However, a theoretical explanation for this remains a major open problem, since training neural networks involves o ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract Deep learning, in the form of artificial neural networks, has achieved remarkable practical success in recent years, for a variety of difficult machine learning applications. However, a theoretical explanation for this remains a major open problem, since training neural networks involves optimizing a highly nonconvex objective function, and is known to be computationally hard in the worst case. In this work, we study the geometric structure of the associated nonconvex objective function, in the context of ReLU networks and starting from a random initialization of the network parameters. We identify some conditions under which it becomes more favorable to optimization, in the sense of (i) High probability of initializing at a point from which there is a monotonically decreasing path to a global minimum; and (ii) High probability of initializing at a basin (suitably defined) with a small minimal objective value. A common theme in our results is that such properties are more likely to hold for larger ("overspecified") networks, which accords with some recent empirical and theoretical observations.