Results 1  10
of
57
Object Detection with Discriminatively Trained Part Based Models
"... We describe an object detection system based on mixtures of multiscale deformable part models. Our system is able to represent highly variable object classes and achieves stateoftheart results in the PASCAL object detection challenges. While deformable part models have become quite popular, their ..."
Abstract

Cited by 1422 (49 self)
 Add to MetaCart
We describe an object detection system based on mixtures of multiscale deformable part models. Our system is able to represent highly variable object classes and achieves stateoftheart results in the PASCAL object detection challenges. While deformable part models have become quite popular, their value had not been demonstrated on difficult benchmarks such as the PASCAL datasets. Our system relies on new methods for discriminative training with partially labeled data. We combine a marginsensitive approach for datamining hard negative examples with a formalism we call latent SVM. A latent SVM is a reformulation of MISVM in terms of latent variables. A latent SVM is semiconvex and the training problem becomes convex once latent information is specified for the positive examples. This leads to an iterative training algorithm that alternates between fixing latent values for positive examples and optimizing the latent SVM objective function.
A discriminatively trained, multiscale, deformable part model
 In IEEE Conference on Computer Vision and Pattern Recognition (CVPR2008
, 2008
"... This paper describes a discriminatively trained, multiscale, deformable part model for object detection. Our system achieves a twofold improvement in average precision over the best performance in the 2006 PASCAL person detection challenge. It also outperforms the best results in the 2007 challenge ..."
Abstract

Cited by 555 (11 self)
 Add to MetaCart
(Show Context)
This paper describes a discriminatively trained, multiscale, deformable part model for object detection. Our system achieves a twofold improvement in average precision over the best performance in the 2006 PASCAL person detection challenge. It also outperforms the best results in the 2007 challenge in ten out of twenty categories. The system relies heavily on deformable parts. While deformable part models have become quite popular, their value had not been demonstrated on difficult benchmarks such as the PASCAL challenge. Our system also relies heavily on new methods for discriminative training. We combine a marginsensitive approach for data mining hard negative examples with a formalism we call latent SVM. A latent SVM, like a hidden CRF, leads to a nonconvex training problem. However, a latent SVM is semiconvex and the training problem becomes convex once latent information is specified for the positive examples. We believe that our training methods will eventually make possible the effective use of more latent information such as hierarchical (grammar) models and models involving latent three dimensional pose. 1.
Learning Deep Architectures for AI
"... Theoretical results suggest that in order to learn the kind of complicated functions that can represent highlevel abstractions (e.g. in vision, language, and other AIlevel tasks), one may need deep architectures. Deep architectures are composed of multiple levels of nonlinear operations, such as i ..."
Abstract

Cited by 183 (30 self)
 Add to MetaCart
Theoretical results suggest that in order to learn the kind of complicated functions that can represent highlevel abstractions (e.g. in vision, language, and other AIlevel tasks), one may need deep architectures. Deep architectures are composed of multiple levels of nonlinear operations, such as in neural nets with many hidden layers or in complicated propositional formulae reusing many subformulae. Searching the parameter space of deep architectures is a difficult task, but learning algorithms such as those for Deep Belief Networks have recently been proposed to tackle this problem with notable success, beating the stateoftheart in certain areas. This paper discusses the motivations and principles regarding learning algorithms for deep architectures, in particular those exploiting as building blocks unsupervised learning of singlelayer models such as Restricted Boltzmann Machines, used to construct deeper models such as Deep Belief Networks.
Y.: Sparse feature learning for deep belief networks
 In: Advances in Neural Information Processing Systems (NIPS 2007
, 2007
"... Unsupervised learning algorithms aim to discover the structure hidden in the data, and to learn representations that are more suitable as input to a supervised machine than the raw input. Many unsupervised methods are based on reconstructing the input from the representation, while constraining the ..."
Abstract

Cited by 130 (14 self)
 Add to MetaCart
(Show Context)
Unsupervised learning algorithms aim to discover the structure hidden in the data, and to learn representations that are more suitable as input to a supervised machine than the raw input. Many unsupervised methods are based on reconstructing the input from the representation, while constraining the representation to have certain desirable properties (e.g. low dimension, sparsity, etc). Others are based on approximating density by stochastically reconstructing the input from the representation. We describe a novel and efficient algorithm to learn sparse representations, and compare it theoretically and experimentally with a similar machine trained probabilistically, namely a Restricted Boltzmann Machine. We propose a simple criterion to compare and select different unsupervised machines based on the tradeoff between the reconstruction error and the information content of the representation. We demonstrate this method by extracting features from a dataset of handwritten numerals, and from a dataset of natural image patches. We show that by stacking multiple levels of such machines and by training sequentially, highorder dependencies between the input observed variables can be captured. 1
Synergistic face detection and pose estimation with energybased model
 In NIPS
, 2005
"... We describe a novel method for realtime, simultaneous multiview face detection and facial pose estimation. The method employs a convolutional network to map face images to points on a manifold, parametrized by pose, and nonface images to points far from that manifold. This network is trained by ..."
Abstract

Cited by 102 (14 self)
 Add to MetaCart
(Show Context)
We describe a novel method for realtime, simultaneous multiview face detection and facial pose estimation. The method employs a convolutional network to map face images to points on a manifold, parametrized by pose, and nonface images to points far from that manifold. This network is trained by optimizing a loss function of three variables: image, pose, and face/nonface label. We test the resulting system, in a single configuration, on three standard data sets – one for frontal pose, one for rotated faces, and one for profiles – and find that its performance on each set is comparable to previous multiview face detectors that can only handle one form of pose variation. We also show experimentally that the system’s accuracy on both face detection and pose estimation is improved by training for the two tasks together. 1
TaskDriven Dictionary Learning
"... Abstract—Modeling data with linear combinations of a few elements from a learned dictionary has been the focus of much recent research in machine learning, neuroscience, and signal processing. For signals such as natural images that admit such sparse representations, it is now well established that ..."
Abstract

Cited by 86 (3 self)
 Add to MetaCart
Abstract—Modeling data with linear combinations of a few elements from a learned dictionary has been the focus of much recent research in machine learning, neuroscience, and signal processing. For signals such as natural images that admit such sparse representations, it is now well established that these models are well suited to restoration tasks. In this context, learning the dictionary amounts to solving a largescale matrix factorization problem, which can be done efficiently with classical optimization tools. The same approach has also been used for learning features from data for other purposes, e.g., image classification, but tuning the dictionary in a supervised way for these tasks has proven to be more difficult. In this paper, we present a general formulation for supervised dictionary learning adapted to a wide variety of tasks, and present an efficient algorithm for solving the corresponding optimization problem. Experiments on handwritten digit classification, digital art identification, nonlinear inverse image problems, and compressed sensing demonstrate that our approach is effective in largescale settings, and is well suited to supervised and semisupervised classification, as well as regression tasks for data that admit sparse representations. Index Terms—Basis pursuit, Lasso, dictionary learning, matrix factorization, semisupervised learning, compressed sensing. Ç 1
Maxmargin hidden conditional random fields for human action recognition
, 2008
"... We present a new method for classification with structured latent variables. Our model is formulated using the maxmargin formalism in the discriminative learning literature. We propose an efficient learning algorithm based on the cutting plane method and decomposed dual optimization. We apply our m ..."
Abstract

Cited by 52 (9 self)
 Add to MetaCart
(Show Context)
We present a new method for classification with structured latent variables. Our model is formulated using the maxmargin formalism in the discriminative learning literature. We propose an efficient learning algorithm based on the cutting plane method and decomposed dual optimization. We apply our model to the problem of recognizing human actions from video sequences, where we model a human action as a global root template and a constellation of several “parts”. We show that our model outperforms another similar method that uses hidden conditional random fields, and is comparable to other stateoftheart approaches. More importantly, our proposed work is quite general and can potentially be applied in a wide variety of vision problems that involve various complex, interdependent latent structures. 1.
Solving Multiclass Support Vector Machines with LaRank
 In 24th International Conference on Machine Learning
, 2007
"... Optimization algorithms for large margin multiclass recognizers are often too costly to handle ambitious problems with structured outputs and exponential numbers of classes. Optimization algorithms that rely on the full gradient are not effective because, unlike the solution, the gradient is not spa ..."
Abstract

Cited by 45 (3 self)
 Add to MetaCart
(Show Context)
Optimization algorithms for large margin multiclass recognizers are often too costly to handle ambitious problems with structured outputs and exponential numbers of classes. Optimization algorithms that rely on the full gradient are not effective because, unlike the solution, the gradient is not sparse and is very large. The LaRank algorithm sidesteps this difficulty by relying on a randomized exploration inspired by the perceptron algorithm. We show that this approach is competitive with gradient based optimizers on simple multiclass problems. Furthermore, a single LaRank pass over the training examples delivers test error rates that are nearly as good as those of the final solution. 1.
Convolutional Networks and Applications in Vision
"... Abstract — Intelligent tasks, such as visual perception, auditory perception, and language understanding require the construction of good internal representations of the world (or ”features”), which must be invariant to irrelevant variations of the input while, preserving relevant information. A maj ..."
Abstract

Cited by 41 (2 self)
 Add to MetaCart
(Show Context)
Abstract — Intelligent tasks, such as visual perception, auditory perception, and language understanding require the construction of good internal representations of the world (or ”features”), which must be invariant to irrelevant variations of the input while, preserving relevant information. A major question for Machine Learning is how to learn such good features automatically. Convolutional Networks (ConvNets) are a biologicallyinspired trainable architecture that can learn invariant features. Each stage in a ConvNets is composed of a filter bank, some nonlinearities, and feature pooling layers. With multiple stages, a ConvNet can learn multilevel hierarchies of features. While ConvNets have been successfully deployed in many commercial applications from OCR to video surveillance, they require large amounts of labeled training samples. We describe new unsupervised learning algorithms, and new nonlinear stages that allow ConvNets to be trained with very few labeled samples. Applications to visual object recognition and vision navigation for offroad mobile robots are described.
A unified framework for multitarget tracking and collective activity recognition
 In ECCV
, 2012
"... Abstract. We present a coherent, discriminative framework for simultaneously tracking multiple people and estimating their collective activities. Instead of treating the two problems separately, our model is grounded in the intuition that a strong correlation exists between a person’s motion, their ..."
Abstract

Cited by 39 (3 self)
 Add to MetaCart
(Show Context)
Abstract. We present a coherent, discriminative framework for simultaneously tracking multiple people and estimating their collective activities. Instead of treating the two problems separately, our model is grounded in the intuition that a strong correlation exists between a person’s motion, their activity, and the motion and activities of other nearby people. Instead of directly linking the solutions to these two problems, we introduce a hierarchy of activity types that creates a natural progression that leads from a specific person’s motion to the activity of the group as a whole. Our model is capable of jointly tracking multiple people, recognizing individual activities (atomic activities), the interactions between pairs of people (interaction activities), and finally the behavior of groups of people (collective activities). We also propose an algorithm for solving this otherwise intractable joint inference problem by combining belief propagation with a version of the branch and bound algorithm equipped with integer programming. Experimental results on challenging video datasets demonstrate our theoretical claims and indicate that our model achieves the best collective activity classification results to date.