Results 1 - 10
of
103
Spatial pyramid pooling in deep convolutional networks for visual recognition
- In ECCV
"... Abstract. Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g. 224×224) input image. This requirement is “artificial” and may hurt the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with a more principled poo ..."
Abstract
-
Cited by 52 (5 self)
- Add to MetaCart
(Show Context)
Abstract. Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g. 224×224) input image. This requirement is “artificial” and may hurt the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with a more principled pooling strategy, “spatial pyramid pooling”, to elimi-nate the above requirement. The new network structure, called SPP-net, can generate a fixed-length representation regardless of image size/scale. By removing the fixed-size limitation, we can improve all CNN-based image classification methods in general. Our SPP-net achieves state-of-the-art accuracy on the datasets of ImageNet 2012, Pascal VOC 2007, and Caltech101. The power of SPP-net is more significant in object detection. Using SPP-net, we compute the feature maps from the entire image only once, and then pool features in arbitrary regions (sub-images) to generate fixed-length representations for training the detectors. This method avoids repeatedly computing the convolutional features. In processing test im-ages, our method computes convolutional features 30-170 × faster than the recent leading method R-CNN (and 24-64 × faster overall), while achieving better or comparable accuracy on Pascal VOC 2007.1 1
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
, 2015
"... Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU ..."
Abstract
-
Cited by 40 (0 self)
- Add to MetaCart
Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk. Second, we derive a robust initialization method that particularly considers the rectifier nonlinearities. This method enables us to train extremely deep rectified models directly from scratch and to investigate deeper or wider network architectures. Based on the learnable activation and advanced initialization, we achieve 4.94 % top-5 test error on the ImageNet 2012 classification dataset. This is a 26 % relative improvement over the ILSVRC 2014 winner (GoogLeNet, 6.66 % [33]). To our knowledge, our result is the first1 to surpass the reported human-level performance (5.1%, [26]) on this dataset.
Deep learning face representation by joint identification-verification
- in Advances in Neural Information Processing Systems
, 2014
"... The key challenge of face recognition is to develop effective feature repre-sentations for reducing intra-personal variations while enlarging inter-personal differences. In this paper, we show that it can be well solved with deep learning and using both face identification and verification signals a ..."
Abstract
-
Cited by 22 (3 self)
- Add to MetaCart
The key challenge of face recognition is to develop effective feature repre-sentations for reducing intra-personal variations while enlarging inter-personal differences. In this paper, we show that it can be well solved with deep learning and using both face identification and verification signals as supervision. The Deep IDentification-verification features (DeepID2) are learned with carefully designed deep convolutional networks. The face identification task increases the inter-personal variations by drawing DeepID2 extracted from different identities apart, while the face verification task reduces the intra-personal variations by pulling DeepID2 extracted from the same identity together, both of which are essential to face recognition. The learned DeepID2 features can be well generalized to new identities unseen in the training data. On the challenging LFW dataset [11], 99.15 % face verification accuracy is achieved. Compared with the best deep learning result [21] on LFW, the error rate has been significantly reduced by 67%. 1
Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images
"... Deep neural networks (DNNs) have recently been achieving state-of-the-art performance on a variety of pattern-recognition tasks, most notably visual classification problems. Given that DNNs are now able to classify ob-jects in images with near-human-level performance, ques-tions naturally arise as t ..."
Abstract
-
Cited by 13 (4 self)
- Add to MetaCart
Deep neural networks (DNNs) have recently been achieving state-of-the-art performance on a variety of pattern-recognition tasks, most notably visual classification problems. Given that DNNs are now able to classify ob-jects in images with near-human-level performance, ques-tions naturally arise as to what differences remain between computer and human vision. A recent study revealed that changing an image (e.g. of a lion) in a way imperceptible to humans can cause a DNN to label the image as something else entirely (e.g. mislabeling a lion a library). Here we show a related result: it is easy to produce images that are completely unrecognizable to humans, but that state-of-the-art DNNs believe to be recognizable objects with 99.99% confidence (e.g. labeling with certainty that white noise static is a lion). Specifically, we take convolutional neu-ral networks trained to perform well on either the ImageNet or MNIST datasets and then find images with evolutionary algorithms or gradient ascent that DNNs label with high confidence as belonging to each dataset class. It is possi-ble to produce images totally unrecognizable to human eyes that DNNs believe with near certainty are familiar objects. Our results shed light on interesting differences between hu-man vision and current DNNs, and raise questions about the generality of DNN computer vision. 1.
Surpassing human-level face verification performance on LFW with GaussianFace
, 2014
"... Face verification remains a challenging problem in very complex conditions with large variations such as pose, illumination, expression, and occlusions. This problem is exacerbated when we rely unrealistically on a single training data source, which is often insufficient to cover the intrinsically c ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
(Show Context)
Face verification remains a challenging problem in very complex conditions with large variations such as pose, illumination, expression, and occlusions. This problem is exacerbated when we rely unrealistically on a single training data source, which is often insufficient to cover the intrinsically complex face variations. This paper pro-poses a principled multi-task learning approach based on Discriminative Gaussian Process Latent Variable Model, named GaussianFace, to enrich the diversity of training data. In comparison to existing methods, our model exploits additional data from multiple source-domains to improve the generalization performance of face verification in an unknown target-domain. Importantly, our model can adapt automatically to complex data distributions, and therefore can well capture complex face variations inherent in multiple sources. Extensive experiments demonstrate the effectiveness of the proposed model in learning from diverse data sources and generalize to unseen domain. Specifically, the accuracy of our algorithm achieves an impressive accuracy rate of 98.52 % on the well-known and challenging Labeled Faces in the Wild (LFW) benchmark [23]. For the first time, the human-level performance in face verification (97.53%) [28] on LFW is surpassed. 1 1.
Eigen-PEP for Video Face Recognition
"... Abstract. To effectively solve the problem of large scale video face recognition, we argue for a comprehensive, compact, and yet flexible rep-resentation of a face subject. It shall comprehensively integrate the visual information from all relevant video frames of the subject in a compact form. It s ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
(Show Context)
Abstract. To effectively solve the problem of large scale video face recognition, we argue for a comprehensive, compact, and yet flexible rep-resentation of a face subject. It shall comprehensively integrate the visual information from all relevant video frames of the subject in a compact form. It shall also be flexible to be incrementally updated, incorporating new or retiring obsolete observations. In search for such a representa-tion, we present the Eigen-PEP that is built upon the recent success of the probabilistic elastic part (PEP) model. It first integrates the informa-tion from relevant video sources by a part-based average pooling through the PEP model, which produces an intermediate high dimensional, part-based, and pose-invariant representation. We then compress the inter-mediate representation through principal component analysis, and only a number of principal eigen dimensions are kept (as small as 100). We evaluate the Eigen-PEP representation both for video-based face ver-ification and identification on the YouTube Faces Dataset and a new Celebrity-1000 video face dataset, respectively. On YouTube Faces, we further improve the state-of-the-art recognition accuracy. On Celebrity-1000, we lead the competing baselines by a significant margin while of-fering a scalable solution that is linear with respect to the number of subjects. (a) LFW (b) YouTube Faces (c) Celebrity-1000 Fig. 1. Sample images in three unconstrained face recognition datasets: the image-
Pcanet: A simple deep learning baseline for image classification?” arXiv preprint arXiv:1404.3606
, 2014
"... Abstract — In this paper, we propose a very simple deep learning network for image classification that is based on very basic data processing components: 1) cascaded principal com-ponent analysis (PCA); 2) binary hashing; and 3) blockwise histograms. In the proposed architecture, the PCA is employed ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
(Show Context)
Abstract — In this paper, we propose a very simple deep learning network for image classification that is based on very basic data processing components: 1) cascaded principal com-ponent analysis (PCA); 2) binary hashing; and 3) blockwise histograms. In the proposed architecture, the PCA is employed to learn multistage filter banks. This is followed by simple binary hashing and block histograms for indexing and pooling. This architecture is thus called the PCA network (PCANet) and can be extremely easily and efficiently designed and learned. For comparison and to provide a better understanding, we also introduce and study two simple variations of PCANet: 1) RandNet and 2) LDANet. They share the same topology as PCANet, but their cascaded filters are either randomly selected or learned from linear discriminant analysis. We have extensively tested these basic networks on many benchmark visual data sets
Privacy Behaviors of Lifeloggers using Wearable Cameras
- UbiComp
"... A number of wearable ‘lifelogging ’ camera devices have been released recently, allowing consumers to capture images and other sensor data continuously from a first-person perspec-tive. Unlike traditional cameras that are used deliberately and sporadically, lifelogging devices are always ‘on ’ and a ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
(Show Context)
A number of wearable ‘lifelogging ’ camera devices have been released recently, allowing consumers to capture images and other sensor data continuously from a first-person perspec-tive. Unlike traditional cameras that are used deliberately and sporadically, lifelogging devices are always ‘on ’ and au-tomatically capturing images. Such features may challenge users ’ (and bystanders’) expectations about privacy and con-trol of image gathering and dissemination. While lifelogging cameras are growing in popularity, little is known about pri-vacy perceptions of these devices or what kinds of privacy challenges they are likely to create. To explore how people manage privacy in the context of lifel-ogging cameras, as well as which kinds of first-person images people consider ‘sensitive, ’ we conducted an in situ user study (N = 36) in which participants wore a lifelogging device for a week, answered questionnaires about the collected images, and participated in an exit interview. Our findings indicate that: 1) some people may prefer to manage privacy through in situ physical control of image collection in order to avoid later burdensome review of all collected images; 2) a combi-nation of factors including time, location, and the objects and people appearing in the photo determines its ‘sensitivity; ’ and 3) people are concerned about the privacy of bystanders, de-spite reporting almost no opposition or concerns expressed by bystanders over the course of the study. Author Keywords Lifelogging; wearable cameras; privacy
From generic to specific deep representations for visual recognition
- CoRR
"... Evidence is mounting that CNNs are currently the most efficient and successful way to learn visual representations. This paper address the questions on why CNN representations are so effective and how to improve them if one wants to maximize performance for a single task or a range of tasks. We asse ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
Evidence is mounting that CNNs are currently the most efficient and successful way to learn visual representations. This paper address the questions on why CNN representations are so effective and how to improve them if one wants to maximize performance for a single task or a range of tasks. We assess experimentally the importance of different aspects of learning and choosing a CNN representation to its performance on a diverse set of visual recognition tasks. In particular, we investigate how altering the parameters in a network’s architecture and its training impacts the representation’s ability to specialize and generalize. We also study the effect of fine-tuning a generic network towards a particular task. Extensive exper-iments indicate the trends; (a) increasing specialization increases performance on the target task but can hurt the ability to generalize to other tasks and (b) the less specialized the original network the more likely it is to benefit from fine-tuning. As by-products we have learnt several deep CNN image representations which when combined with a simple linear SVM classifier or similarity measure pro-duce the best performance on 12 standard datasets measuring the ability to solve visual recognition tasks ranging from image classification to image retrieval. 1
Understanding neural networks through deep visualization
- In ICML Workshop on Deep Learning
"... Recent years have produced great advances in training large, deep neural networks (DNNs), in-cluding notable successes in training convolu-tional neural networks (convnets) to recognize natural images. However, our understanding of how these models work, especially what compu-tations they perform at ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
(Show Context)
Recent years have produced great advances in training large, deep neural networks (DNNs), in-cluding notable successes in training convolu-tional neural networks (convnets) to recognize natural images. However, our understanding of how these models work, especially what compu-tations they perform at intermediate layers, has lagged behind. Progress in the field will be further accelerated by the development of bet-ter tools for visualizing and interpreting neural nets. We introduce two such tools here. The first is a tool that visualizes the activations pro-duced on each layer of a trained convnet as it processes an image or video (e.g. a live web-cam stream). We have found that looking at live activations that change in response to user input helps build valuable intuitions about how con-vnets work. The second tool enables visualizing features at each layer of a DNN via regularized optimization in image space. Because previous versions of this idea produced less recognizable images, here we introduce several new regular-ization methods that combine to produce qualita-tively clearer, more interpretable visualizations. Both tools are open source and work on a pre-trained convnet with minimal setup.