#### DMCA

## Largescale deep unsupervised learning using graphics processors (2009)

### Cached

### Download Links

Venue: | International Conf. on Machine Learning |

Citations: | 51 - 8 self |

### Citations

4201 | Regression shrinkage and selection via the lasso
- Tibshirani
- 1994
(Show Context)
Citation Context ...≤ 1, ∀j ∈ {1, ..., n} where the first term in the objective function encourages good reconstruction (x(i) ≈ ∑ (i) j bja j ), and the second term encourages sparsity by penalizing nonzero activations (=-=Tibshirani, 1996-=-). The optimization problem is not jointly convex in both b and a variables, but it is convex in either one of those variables, if the other is kept fixed. This suggests an alternating minimization al... |

3436 | and S.Ghemawat, “MapReduce: simplified data processing on large clusters
- Dean
- 2008
(Show Context)
Citation Context ...thers can be easily implemented in parallel on multicore architectures, by having each core perform the required computations for a subset of input examples, and then combining the results centrally (=-=Dean & Ghemawat, 2004-=-; Chu et al., 2006). However, standard algorithms for DBNs and sparse coding are difficult to parallelize with such “data-parallel” schemes, because they involve iterative, stochastic parameter update... |

1325 | R.: Least angle regression
- Efron, Hastie, et al.
- 2004
(Show Context)
Citation Context ...alternating minimization algorithm with two steps: first, keeping b fixed, we optimize over a, which leads to an L1regularized least squares problem, that can be solved using custom-designed solvers (=-=Efron et al., 2004-=-; Lee et al., 2006; Andrew & Gao, 2007). Then, we keep a fixed, and optimize over b using convex optimization techniques (Lee et al., 2006). For problems with highdimensional inputs and large numbers ... |

1302 |
Emergence of simple-cell receptive field properties by learning a sparse code for natural images
- Olshausen, Field
- 1996
(Show Context)
Citation Context ...author(s)/owner(s). 1. Introduction We consider two well-known unsupervised learning models, deep belief networks (DBNs) and sparse coding, that can learn hierarchical representations of their input (=-=Olshausen & Field, 1996-=-; Hinton & Salakhutdinov, 2006). With the invention of increasingly efficient learning algorithms over the past decade, these models have been applied to a number of machine learning applications, inc... |

970 | A fast learning algorithm for deep belief nets,” Neural computation
- Hinton, Osindero, et al.
- 2006
(Show Context)
Citation Context ...the largest-scale models possible, so this is not an exact comparison; but the order of magnitude difference between our desired model and recent work is striking. Published source Application Params =-=Hinton et al., 2006-=- Digit images 1.6mn Hinton & Salakhutdinov Face images 3.8mn Salakhutdinov & Hinton Sem. hashing 2.6mn Ranzato & Szummer Text 3mn Our model 100mn the model is large, but not otherwise (Lee et al., 200... |

850 | Training products of experts by minimizing contrastive divergence
- Hinton
(Show Context)
Citation Context ...uted: P (hj|x) = sigmoid(bj + ∑ i wijxi) (1) P (xi|h) = sigmoid(ci + ∑ j wijhj) (2) Maximum likelihood parameter learning for an RBM can be efficiently approximated by contrastive divergence updates (=-=Hinton, 2002-=-), where we start with the unlabeled examples as the visible units, alternately sample the hidden units h and visible units x using Gibbs sampling (Equations 1-2), and update the parameters as: wij :=... |

796 |
Reducing the dimensionality of data with neural networks
- Hinton, Salakhutdinov
- 2006
(Show Context)
Citation Context ... free parameters. We consider two well-known unsupervised learning models, deep belief networks (DBNs) and sparse coding, that have recently been applied to a flurry of machine learning applications (=-=Hinton & Salakhutdinov, 2006-=-; Raina et al., 2007). Unfortunately, current learning algorithms for both models are too slow for large-scale applications, forcing researchers to focus on smaller-scale models, or to use fewer train... |

442 | Efficient sparse coding algorithms
- Lee, Battle, et al.
- 2007
(Show Context)
Citation Context ...on et al., 2006 Digit images 1.6mn Hinton & Salakhutdinov Face images 3.8mn Salakhutdinov & Hinton Sem. hashing 2.6mn Ranzato & Szummer Text 3mn Our model 100mn the model is large, but not otherwise (=-=Lee et al., 2006-=-). There has been a lot of recent work on scaling up DBN and sparse coding algorithms, sometimes with entire research papers devoted to ingenious methods devised specifically for each of these models ... |

394 | Greedy layer-wise training of deep networks,”
- Bengio, Lamblin, et al.
- 2007
(Show Context)
Citation Context ...t of recent work on scaling up DBN and sparse coding algorithms, sometimes with entire research papers devoted to ingenious methods devised specifically for each of these models (Hinton et al., 2006; =-=Bengio et al., 2006-=-; Murray & Kreutz-Delgado, 2006; Lee et al., 2006; Kavukcuoglu et al., 2008). Meanwhile, the raw clock speed of single CPUs has begun to hit a hardware power limit, and most of the growth in processin... |

369 | Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations,”
- Lee, Grosse, et al.
- 2009
(Show Context)
Citation Context ...eters in all patches, such that, for example, wA = wB in Figure 2. If overlapping patches are tiled one pixel apart, this model is identical to the convolutional RBM model (Desjardins & Bengio, 2008; =-=Lee et al., 2009-=-). Contrastive divergence learning in this model can be implemented by using convolutions to perform the Gibbs sampling operation h|x. For small to medium filter (patch) sizes, spatial convolution can... |

356 | Independent component filters of natural images compared with simple cells in primary visual cortex - Hateren, Schaaf - 1998 |

324 | Pathwise coordinate optimization
- Friedman, Hastie, et al.
(Show Context)
Citation Context ... the observation that in the optimization problem in Equation (6), if we vary only one of the activations aj, while keeping the other activations fixed, the optimal value a∗ j can be easily computed (=-=Friedman et al., 2007-=-). Letting B be a matrix with bj as its j-th column, and rj = b T j bj: ⎧ ⎨ a ∗ 0 if |gj − rjaj| ≤ β j = (−gj + rjaj + β)/rj if gj − rjaj > β ⎩ (−gj + rjaj − β)/rj if gj − rjaj < −β 1 ∑ where g = ∇a ‖... |

297 | Self-taught learning: transfer learning from unlabeled data
- Raina, Battle, et al.
- 2007
(Show Context)
Citation Context ...two well-known unsupervised learning models, deep belief networks (DBNs) and sparse coding, that have recently been applied to a flurry of machine learning applications (Hinton & Salakhutdinov, 2006; =-=Raina et al., 2007-=-). Unfortunately, current learning algorithms for both models are too slow for large-scale applications, forcing researchers to focus on smaller-scale models, or to use fewer training examples. In thi... |

232 | Mapreduce for machine learning on multicore
- Chu, Kim, et al.
- 2006
(Show Context)
Citation Context ...lemented in parallel on multicore architectures, by having each core perform the required computations for a subset of input examples, and then combining the results centrally (Dean & Ghemawat, 2004; =-=Chu et al., 2006-=-). However, standard algorithms for DBNs and sparse coding are difficult to parallelize with such “data-parallel” schemes, because they involve iterative, stochastic parameter updates, where any updat... |

178 | Scalable training of L1-regularized log-linear models
- Andrew, Gao
(Show Context)
Citation Context ...h two steps: first, keeping b fixed, we optimize over a, which leads to an L1regularized least squares problem, that can be solved using custom-designed solvers (Efron et al., 2004; Lee et al., 2006; =-=Andrew & Gao, 2007-=-). Then, we keep a fixed, and optimize over b using convex optimization techniques (Lee et al., 2006). For problems with highdimensional inputs and large numbers of basis vectors, the first step is pa... |

163 | Sparse deep belief net model for visual area V2,” in
- Lee, Ekanadham, et al.
- 2008
(Show Context)
Citation Context ...re image patches of the required size. Following previous work, we used Gaussian visible units and binary hidden units, and trained a sparse RBM by adding an additional penalty term to the objective (=-=Lee et al., 2007-=-)—however, these modifications do not affect the running time results significantly. For learning, we performed one-step contrastive divergence updates using a mini-batch of 192 examples. Table 2 show... |

150 | Scaling to very very large corpora for natural language disambiguation.
- Banko, Brill
- 2001
(Show Context)
Citation Context ...correction—it has been shown that simple, classical models can outperform newer, more complex models, just because the simple models can be tractably learnt using orders of magnitude more input data (=-=Banko & Brill, 2001-=-; Brants et al., 2007). Analogously, in our view, scaling up existing DBN and sparse coding models to use more parameters, or more training data, might produce very significant performance benefits. F... |

77 | Fast support vector machine training and classification on graphics processors,”
- Catanzaro, Sundaram, et al.
- 2008
(Show Context)
Citation Context ...GPU threads can further subdivide the work in each block, often working with just a single element of an input example. GPUs have been applied to certain problems in machine learning, including SVMs (=-=Catanzaro et al., 2008-=-), and supervised learning in convolutional networks (Chellapilla et al., 2006). To continue this line of work, and to encourage further applications of deep belief networks and sparse coding, we will... |

67 | Fast Inference in Sparse Coding Algorithms with Applications to Object Recognition,”
- Kavukcuoglu, Ranzato, et al.
- 2008
(Show Context)
Citation Context ...imes with entire research papers devoted to ingenious methods devised specifically for each of these models (Hinton et al., 2006; Bengio et al., 2006; Murray & Kreutz-Delgado, 2006; Lee et al., 2006; =-=Kavukcuoglu et al., 2008-=-). Meanwhile, the raw clock speed of single CPUs has begun to hit a hardware power limit, and most of the growth in processing power is increasingly obtained by throwing together multiple CPU cores, i... |

43 | Differentiable Sparse Coding,”
- Bagnell, Bradley
- 2009
(Show Context)
Citation Context ... of 900 arbitrary intensity values). Such a higher-level representation can then be applied to classification tasks, where it leads to good results even with limited labeled data (Raina et al., 2007; =-=Bradley & Bagnell, 2008-=-). Specifically, given inputs x ∈ Rk , sparse coding attempts to find basis vectors b = {b1, b2, . . . , bn}, bj ∈ Rk such that each input x can be represented as a linear combination of a few basis v... |

43 |
Microprocessors for the New Millennium: Challenges, Opportunities, and New Frontiers
- Gelsinger
- 2001
(Show Context)
Citation Context ...single CPUs has begun to hit a hardware power limit, and most of the growth in processing power is increasingly obtained by throwing together multiple CPU cores, instead of speeding up a single core (=-=Gelsinger, 2001-=-; Frank, 2002). Recent work has shown that several popular learning algorithms such as logistic regression, linear SVMs and others can be easily implemented in parallel on multicore architectures, by ... |

35 | Semi-supervised learning of compact document representations with deep networks,” - Ranzato, Szummer - 2008 |

24 | Learning sparse overcomplete codes for images.
- Murray, Kreutz-Delgado
- 2008
(Show Context)
Citation Context ...caling up DBN and sparse coding algorithms, sometimes with entire research papers devoted to ingenious methods devised specifically for each of these models (Hinton et al., 2006; Bengio et al., 2006; =-=Murray & Kreutz-Delgado, 2006-=-; Lee et al., 2006; Kavukcuoglu et al., 2008). Meanwhile, the raw clock speed of single CPUs has begun to hit a hardware power limit, and most of the growth in processing power is increasingly obtaine... |

22 | High Performance Convolutional Neural Networks for Document Processing. In:
- Chellapilla, Puri, et al.
- 2006
(Show Context)
Citation Context ... just a single element of an input example. GPUs have been applied to certain problems in machine learning, including SVMs (Catanzaro et al., 2008), and supervised learning in convolutional networks (=-=Chellapilla et al., 2006-=-). To continue this line of work, and to encourage further applications of deep belief networks and sparse coding, we will make our source code publicly available. Acknowledgments: We give warm thanks... |

21 |
constrained CMOS scaling limits
- Frank, “Power-
- 2002
(Show Context)
Citation Context ...egun to hit a hardware power limit, and most of the growth in processing power is increasingly obtained by throwing together multiple CPU cores, instead of speeding up a single core (Gelsinger, 2001; =-=Frank, 2002-=-). Recent work has shown that several popular learning algorithms such as logistic regression, linear SVMs and others can be easily implemented in parallel on multicore architectures, by having each c... |

18 | High-performance implementation of the level-3 - Goto, Geijn |

10 |
Empirical evaluation of convolutional RBMs for vision.
- Desjardins, Bengio
- 2008
(Show Context)
Citation Context ... be modified to share parameters in all patches, such that, for example, wA = wB in Figure 2. If overlapping patches are tiled one pixel apart, this model is identical to the convolutional RBM model (=-=Desjardins & Bengio, 2008-=-; Lee et al., 2009). Contrastive divergence learning in this model can be implemented by using convolutions to perform the Gibbs sampling operation h|x. For small to medium filter (patch) sizes, spati... |

9 |
Feature selection, L1 vs
- Ng
- 2004
(Show Context)
Citation Context ...j| (6) The objective function is not differentiable because of the second term. This problem has recently received wide attention because of its robust feature selection properties (Tibshirani, 1996; =-=Ng, 2004-=-), and custom algorithms have been designed to solve it (Efron et al., 2004; Lee et al., 2006; Andrew & Gao, 2007). Some of these algorithms use sparse linear algebra operations to achieve efficiency.... |

6 |
Large Language Models
- Brants, Popat, et al.
- 2007
(Show Context)
Citation Context ...n shown that simple, classical models can outperform newer, more complex models, just because the simple models can be tractably learnt using orders of magnitude more input data (Banko & Brill, 2001; =-=Brants et al., 2007-=-). Analogously, in our view, scaling up existing DBN and sparse coding models to use more parameters, or more training data, might produce very significant performance benefits. For example, it has be... |

4 |
Speeding up stochastic gradient descent.
- Bengio
- 2007
(Show Context)
Citation Context ...ghly optimized multithreaded linear algebra packages: ATLAS BLAS (Whaley et al., 2001) and Goto BLAS (Goto & Van De Geijn, 2008). Consistent with previous results, we found that Goto BLAS was faster (=-=Bengio, 2007-=-), so we report CPU results using it. As input, we used a large dataset of natural images (van Hateren & van der Schaaff, 1997) and obtained input examples by randomly extracting square image patches ... |

2 |
Many-core GPU computing with NVIDIA
- Harris
- 2008
(Show Context)
Citation Context ...aper, we now review the basic ideas behind successful computation with GPUs. 2. Computing with graphics processors We illustrate the principles of GPU computing using Nvidia’s CUDA programming model (=-=Harris, 2008-=-). Figure 1 shows a simplified schematic of a typical Nvidia GPU. The GPU hardware provides two levels of parallelism: there are several multiprocessors (MPs), and each multiprocessor contains several... |