#### DMCA

## Object based Scene Representations using Fisher Scores of Local Subspace Projections

### Citations

5864 | A tutorial on hidden Markov models and selected applications in speech recognition
- Rabiner
- 1989
(Show Context)
Citation Context ... (5) In summary, the Fisher score∇θL(θ)|{θ=θb} of background model θb is the gradient of the Q-function of EM evaluated at reference model θb. The computation of the score thus simplifies into the two steps of EM. First, the E step computes the Q function Q(p(z|x; θb); θ) at the reference θb. Second, the M-step evaluates the gradient of the Q function with respect to θ at θ = θb. This interpretation of the Fisher score is particularly helpful when efficient implementations of the EM algorithm are available, e.g. the recursive Baum-Welch computations commonly used to learn hidden Markov models [15]. For more tractable distributions, such as the GMM, it enables the simple reuse of the EM equations, which are always required to learn the reference model θb, to compute the Fisher score. 3 2.2 Bag of features Fisher scores are usually combined with the bag-of-features representation, where an image is described as an orderless collection of localized descriptors D = {x1, x2, . . . xn}. These were traditionally SIFT descriptors, but have more recently been replaced with responses of object recognition CNNs [6, 1, 2]. In this work we use the semantic features proposed in [2], which are obtain... |

1520 | Gradientbased learning applied to document recognition
- Lecun, Bottou, et al.
- 1998
(Show Context)
Citation Context ...e complementary, in the sense that their combination outperforms each of the representations by itself. When combined, they produce a state-of-the-art scene classifier. 1 Introduction In recent years, convolutional neural networks (CNNs) trained on large scale datasets have achieved remarkable performance on traditional vision problems such as image classification [8, 18, 26], object detection and localization [5, 16] and others. The success of CNNs can be attributed to their ability to learn highly discriminative, non-linear, visual transformations with the help of supervised backpropagation [9]. Beyond the impressive, sometimes even superhuman, results on certain datasets, a remarkable property of these classifiers is the solution of the dataset bias problem [20] that has plagued computer vision for decades. It has now been shown many times that a network trained to solve a task on a certain dataset (e.g. object recognition on ImageNet) can be very easily fine-tuned to solve a related problem on another dataset (e.g. object detection on the Pascal VOC or MS-COCO). Less clear, however, is the robustness of current CNNs to the problem of task bias, i.e. their ability to generalize acc... |

1007 | Imagenet classification with deep convolutional neural networks
- Krizhevsky, Sutskever, et al.
(Show Context)
Citation Context ...experiments show that the MFA-FS has state of the art performance for object-to-scene transfer and this transfer actually outperforms the training of a scene CNN from a large scene dataset. The two representations are also shown to be complementary, in the sense that their combination outperforms each of the representations by itself. When combined, they produce a state-of-the-art scene classifier. 1 Introduction In recent years, convolutional neural networks (CNNs) trained on large scale datasets have achieved remarkable performance on traditional vision problems such as image classification [8, 18, 26], object detection and localization [5, 16] and others. The success of CNNs can be attributed to their ability to learn highly discriminative, non-linear, visual transformations with the help of supervised backpropagation [9]. Beyond the impressive, sometimes even superhuman, results on certain datasets, a remarkable property of these classifiers is the solution of the dataset bias problem [20] that has plagued computer vision for decades. It has now been shown many times that a network trained to solve a task on a certain dataset (e.g. object recognition on ImageNet) can be very easily fine-t... |

545 | Exploiting generative models in discriminative classi
- Jaakkola, Haussler
- 1999
(Show Context)
Citation Context ...the performance of the best “directly learned” CNNs [26], can be substantially improved by fusion with object recognition CNNs [6, 11, 2]. So far, the transfer from object CNNs to holistic scene description has been most extensively studied in the area of scene classification, where state of the art results have been obtained with the bag of semantics representation of [2]. This consists of feeding image patches through an object recognition CNN, collecting a bag vectors of object recognition scores, and embedding this bag into a fixed dimensional vector space with recourse to a Fisher vector [7]. While there are variations of detail, all other competitive methods are based on a similar architecture [6, 11]. This observation is, in principle, applicable to other tasks. For example, the state of the art in image captioning is to use a CNN as an image encoder that extracts a feature vector from the image. This feature vector is the fed to a natural language decoder (typically an LSTM) that produces sentences. While there has not yet been an extensive investigation of the best image encoder, it is likely that the best representations for scene classification should also be effective enco... |

355 | T.Mensink, Improving the Fisher Kernel for large-scale image classification
- Perronnin, Sánchez
(Show Context)
Citation Context ... has not yet been an extensive investigation of the best image encoder, it is likely that the best representations for scene classification should also be effective encodings for language generation. For these reasons, we restrict our attention to the scene classifcation problem in the remainder of this work, focusing on the question of how to address possible limitations of the Fisher vector embedding. We note, in particular, that while Fisher vectors have been classically defined using gradients of image loglikelihood with respect to the means and variances of a Gaussian mixture model (GMM) [13], this definition has not been applied universally in the CNN transfer context, where variance statistics are often disregarded [6, 2]. In this work we make several contributions to the use of Fisher vector type of representations for object to scene transfer. The first is to show that, for object recognition scores produced by a CNN [2], variance statistics are much less informative of scene class distributions than the mean gradients, and can even degrade scene classification performance. We then argue that this is due to the inability of the standard GMM of diagonal covariances to provide a... |

303 | Sun database: Large-scale scene recognition from abbey to zoo
- Xiao, Hays, et al.
(Show Context)
Citation Context ...res [21, 3]. Although these methods describe global covariance structure, they lack the ability of the MFA-FS to capture that information along locally linear approximations of the highly non-linear CNN feature manifold. This is shown to be important, as the MFA-FS is shown to outperform all these representations by non-trivial margins. Finally, we show that the MFA-FS enables effective task transfer, by showing that MFA-FS vectors extracted from deep CNNs trained for ImageNet object recognition [8, 18], achieve state-of-the-art results on challenging scene recognition benchmarks, such as SUN [25] and MIT Indoor Scenes [14]. 2 2 Fisher scores In computer vision, an image I is frequently interpreted as a set of descriptors D = {x1, . . . , xn} sampled from some generative model p(x; θ). Since most classifiers require fixed-length inputs, it is common to map the set D into a fixed-length vector. A popular mapping consists of computing the gradient (with respect to θ) of the log-likelihood∇θL(θ) = ∂∂θ log p(D; θ) for a model θ b. This is known as the Fisher score of θ. This gradient vector is often normalized by the square root of the Fisher information matrix F , according to F− 1 2∇θL(θ... |

277 | The EM algorithm for mixtures of factor analyzers
- Ghahramani, Hinton
- 1996
(Show Context)
Citation Context ...eral contributions to the use of Fisher vector type of representations for object to scene transfer. The first is to show that, for object recognition scores produced by a CNN [2], variance statistics are much less informative of scene class distributions than the mean gradients, and can even degrade scene classification performance. We then argue that this is due to the inability of the standard GMM of diagonal covariances to provide a good approximation to the non-linear manifold of CNN responses. This leads to the adoption of a richer generative model, the mixture of factor analyzers (MFA) [4, 22], which locally approximates the scene class manifold by low-dimensional linear spaces. Our second contribution is to show that, by locally projecting the feature data into these spaces, the MFA can efficiently model its local covariance structure. For this, we derive the Fisher score of the MFA model, denoted the MFA Fisher score (MFA-FS), a representation similar to the GMM Fisher vector of [13, 17]. We show that, for high dimensional CNN features, the MFA-FS captures highly discriminative covariance statistics, which were previously unavailable in [6, 2], producing significantly improved sc... |

250 | Rich feature hierarchies for accurate object detection and semantic segmentation /
- Girshick
- 2014
(Show Context)
Citation Context ...he art performance for object-to-scene transfer and this transfer actually outperforms the training of a scene CNN from a large scene dataset. The two representations are also shown to be complementary, in the sense that their combination outperforms each of the representations by itself. When combined, they produce a state-of-the-art scene classifier. 1 Introduction In recent years, convolutional neural networks (CNNs) trained on large scale datasets have achieved remarkable performance on traditional vision problems such as image classification [8, 18, 26], object detection and localization [5, 16] and others. The success of CNNs can be attributed to their ability to learn highly discriminative, non-linear, visual transformations with the help of supervised backpropagation [9]. Beyond the impressive, sometimes even superhuman, results on certain datasets, a remarkable property of these classifiers is the solution of the dataset bias problem [20] that has plagued computer vision for decades. It has now been shown many times that a network trained to solve a task on a certain dataset (e.g. object recognition on ImageNet) can be very easily fine-tuned to solve a related problem on another ... |

167 | Recognizing indoor scenes
- Quattoni, Torralba
- 2009
(Show Context)
Citation Context ... methods describe global covariance structure, they lack the ability of the MFA-FS to capture that information along locally linear approximations of the highly non-linear CNN feature manifold. This is shown to be important, as the MFA-FS is shown to outperform all these representations by non-trivial margins. Finally, we show that the MFA-FS enables effective task transfer, by showing that MFA-FS vectors extracted from deep CNNs trained for ImageNet object recognition [8, 18], achieve state-of-the-art results on challenging scene recognition benchmarks, such as SUN [25] and MIT Indoor Scenes [14]. 2 2 Fisher scores In computer vision, an image I is frequently interpreted as a set of descriptors D = {x1, . . . , xn} sampled from some generative model p(x; θ). Since most classifiers require fixed-length inputs, it is common to map the set D into a fixed-length vector. A popular mapping consists of computing the gradient (with respect to θ) of the log-likelihood∇θL(θ) = ∂∂θ log p(D; θ) for a model θ b. This is known as the Fisher score of θ. This gradient vector is often normalized by the square root of the Fisher information matrix F , according to F− 1 2∇θL(θ). This is referred to as t... |

154 | Very Deep Convolutional Networks for Large-Scale Image Recognition.
- Simonyan, Zisserman
- 2015
(Show Context)
Citation Context ...experiments show that the MFA-FS has state of the art performance for object-to-scene transfer and this transfer actually outperforms the training of a scene CNN from a large scene dataset. The two representations are also shown to be complementary, in the sense that their combination outperforms each of the representations by itself. When combined, they produce a state-of-the-art scene classifier. 1 Introduction In recent years, convolutional neural networks (CNNs) trained on large scale datasets have achieved remarkable performance on traditional vision problems such as image classification [8, 18, 26], object detection and localization [5, 16] and others. The success of CNNs can be attributed to their ability to learn highly discriminative, non-linear, visual transformations with the help of supervised backpropagation [9]. Beyond the impressive, sometimes even superhuman, results on certain datasets, a remarkable property of these classifiers is the solution of the dataset bias problem [20] that has plagued computer vision for decades. It has now been shown many times that a network trained to solve a task on a certain dataset (e.g. object recognition on ImageNet) can be very easily fine-t... |

153 | Efros. Unbiased look at dataset bias
- Torralba, A
- 2011
(Show Context)
Citation Context ...Introduction In recent years, convolutional neural networks (CNNs) trained on large scale datasets have achieved remarkable performance on traditional vision problems such as image classification [8, 18, 26], object detection and localization [5, 16] and others. The success of CNNs can be attributed to their ability to learn highly discriminative, non-linear, visual transformations with the help of supervised backpropagation [9]. Beyond the impressive, sometimes even superhuman, results on certain datasets, a remarkable property of these classifiers is the solution of the dataset bias problem [20] that has plagued computer vision for decades. It has now been shown many times that a network trained to solve a task on a certain dataset (e.g. object recognition on ImageNet) can be very easily fine-tuned to solve a related problem on another dataset (e.g. object detection on the Pascal VOC or MS-COCO). Less clear, however, is the robustness of current CNNs to the problem of task bias, i.e. their ability to generalize accross tasks. Given the large number of possible vision tasks, it is impossible to train a CNN from scratch for each. In fact, it is likely not even feasible to collect the l... |

91 |
Image classification with the fisher vector: theory and practice
- Sánchez, Perronnin, et al.
- 2013
(Show Context)
Citation Context ... of diagonal covariances to provide a good approximation to the non-linear manifold of CNN responses. This leads to the adoption of a richer generative model, the mixture of factor analyzers (MFA) [4, 22], which locally approximates the scene class manifold by low-dimensional linear spaces. Our second contribution is to show that, by locally projecting the feature data into these spaces, the MFA can efficiently model its local covariance structure. For this, we derive the Fisher score of the MFA model, denoted the MFA Fisher score (MFA-FS), a representation similar to the GMM Fisher vector of [13, 17]. We show that, for high dimensional CNN features, the MFA-FS captures highly discriminative covariance statistics, which were previously unavailable in [6, 2], producing significantly improved scene classification over the conventional GMM Fisher vector. The third contribution is a detailed experimental investigation of the MFA-FS. Since this can be seen as a second order pooling mechanism, we compare it to a number of recent methods for second order pooling of CNN features [21, 3]. Although these methods describe global covariance structure, they lack the ability of the MFA-FS to capture tha... |

46 | Learning deep features for scene recognition using places database.
- Zhou, Lapedriza, et al.
- 2014
(Show Context)
Citation Context ...experiments show that the MFA-FS has state of the art performance for object-to-scene transfer and this transfer actually outperforms the training of a scene CNN from a large scene dataset. The two representations are also shown to be complementary, in the sense that their combination outperforms each of the representations by itself. When combined, they produce a state-of-the-art scene classifier. 1 Introduction In recent years, convolutional neural networks (CNNs) trained on large scale datasets have achieved remarkable performance on traditional vision problems such as image classification [8, 18, 26], object detection and localization [5, 16] and others. The success of CNNs can be attributed to their ability to learn highly discriminative, non-linear, visual transformations with the help of supervised backpropagation [9]. Beyond the impressive, sometimes even superhuman, results on certain datasets, a remarkable property of these classifiers is the solution of the dataset bias problem [20] that has plagued computer vision for decades. It has now been shown many times that a network trained to solve a task on a certain dataset (e.g. object recognition on ImageNet) can be very easily fine-t... |

32 | Multi-scale orderless pooling of deep convolutional activation features.
- Gong, Wang, et al.
- 2014
(Show Context)
Citation Context ... of task transfer. In this work, we consider a very common class of such problems, where a classifier trained on a class of instances is to be transferred to a second class of instances, which are loose combinations of the 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. original ones. In particular, we consider the problem where the original instances are objects and the target instances are scene-level concepts that somehow depend on those objects. Examples of this problem include the transfer of object classifiers to tasks such as scene classification [6, 11, 2] or image captioning [23]. In all these cases, the goal is to predict holistic scene tags from the scores (or features) from an object CNN classifier. The dependence of the holistic descriptions on these objects could range from very explicit to very subtle. For example, on the explicit end of the spectrum, an image captioning system could produce sentence such as “a person is sitting on a stool and feeding a zebra.” On the other hand, on the subtle end of the spectrum, a scene classification system would leverage the recognition of certain rocks, tree stumps, bushes and a particular lizard sp... |

14 | Learning nonlinear image manifolds by global alignment of local linear models
- Verbeek
- 2006
(Show Context)
Citation Context ...eral contributions to the use of Fisher vector type of representations for object to scene transfer. The first is to show that, for object recognition scores produced by a CNN [2], variance statistics are much less informative of scene class distributions than the mean gradients, and can even degrade scene classification performance. We then argue that this is due to the inability of the standard GMM of diagonal covariances to provide a good approximation to the non-linear manifold of CNN responses. This leads to the adoption of a richer generative model, the mixture of factor analyzers (MFA) [4, 22], which locally approximates the scene class manifold by low-dimensional linear spaces. Our second contribution is to show that, by locally projecting the feature data into these spaces, the MFA can efficiently model its local covariance structure. For this, we derive the Fisher score of the MFA model, denoted the MFA Fisher score (MFA-FS), a representation similar to the GMM Fisher vector of [13, 17]. We show that, for high dimensional CNN features, the MFA-FS captures highly discriminative covariance statistics, which were previously unavailable in [6, 2], producing significantly improved sc... |

6 | Faster R-CNN: Towards real-time object detection with region proposal networks.
- Ren, He, et al.
- 2015
(Show Context)
Citation Context ...he art performance for object-to-scene transfer and this transfer actually outperforms the training of a scene CNN from a large scene dataset. The two representations are also shown to be complementary, in the sense that their combination outperforms each of the representations by itself. When combined, they produce a state-of-the-art scene classifier. 1 Introduction In recent years, convolutional neural networks (CNNs) trained on large scale datasets have achieved remarkable performance on traditional vision problems such as image classification [8, 18, 26], object detection and localization [5, 16] and others. The success of CNNs can be attributed to their ability to learn highly discriminative, non-linear, visual transformations with the help of supervised backpropagation [9]. Beyond the impressive, sometimes even superhuman, results on certain datasets, a remarkable property of these classifiers is the solution of the dataset bias problem [20] that has plagued computer vision for decades. It has now been shown many times that a network trained to solve a task on a certain dataset (e.g. object recognition on ImageNet) can be very easily fine-tuned to solve a related problem on another ... |

5 | Encoding High Dimensional Local Features by Sparse Coding Based Fisher Vectors,”
- Liu, Shen, et al.
- 2014
(Show Context)
Citation Context ... of task transfer. In this work, we consider a very common class of such problems, where a classifier trained on a class of instances is to be transferred to a second class of instances, which are loose combinations of the 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. original ones. In particular, we consider the problem where the original instances are objects and the target instances are scene-level concepts that somehow depend on those objects. Examples of this problem include the transfer of object classifiers to tasks such as scene classification [6, 11, 2] or image captioning [23]. In all these cases, the goal is to predict holistic scene tags from the scores (or features) from an object CNN classifier. The dependence of the holistic descriptions on these objects could range from very explicit to very subtle. For example, on the explicit end of the spectrum, an image captioning system could produce sentence such as “a person is sitting on a stool and feeding a zebra.” On the other hand, on the subtle end of the spectrum, a scene classification system would leverage the recognition of certain rocks, tree stumps, bushes and a particular lizard sp... |

2 |
Deep filter banks for texture recognition, description, and segmentation.
- Cimpoi, Maji, et al.
- 2015
(Show Context)
Citation Context ...the recursive Baum-Welch computations commonly used to learn hidden Markov models [15]. For more tractable distributions, such as the GMM, it enables the simple reuse of the EM equations, which are always required to learn the reference model θb, to compute the Fisher score. 3 2.2 Bag of features Fisher scores are usually combined with the bag-of-features representation, where an image is described as an orderless collection of localized descriptors D = {x1, x2, . . . xn}. These were traditionally SIFT descriptors, but have more recently been replaced with responses of object recognition CNNs [6, 1, 2]. In this work we use the semantic features proposed in [2], which are obtained by transforming softmax probability vectors pi, obtained for image patches, into their natural parameter form. These features were shown to perform better than activations of other CNN layers [2]. 2.3 Gaussian Mixture Fisher Vectors A GMM is a model with a discrete hidden variable that determines the mixture component which explains the observed data. The generative process is as follows. A mixture component zi is first sampled from a multinomial distribution p(z = k) = wk. An observation xi is then sampled from th... |

2 |
Scene classification with semantic fisher vectors.
- Dixit, Chen, et al.
- 2015
(Show Context)
Citation Context ... of task transfer. In this work, we consider a very common class of such problems, where a classifier trained on a class of instances is to be transferred to a second class of instances, which are loose combinations of the 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. original ones. In particular, we consider the problem where the original instances are objects and the target instances are scene-level concepts that somehow depend on those objects. Examples of this problem include the transfer of object classifiers to tasks such as scene classification [6, 11, 2] or image captioning [23]. In all these cases, the goal is to predict holistic scene tags from the scores (or features) from an object CNN classifier. The dependence of the holistic descriptions on these objects could range from very explicit to very subtle. For example, on the explicit end of the spectrum, an image captioning system could produce sentence such as “a person is sitting on a stool and feeding a zebra.” On the other hand, on the subtle end of the spectrum, a scene classification system would leverage the recognition of certain rocks, tree stumps, bushes and a particular lizard sp... |

2 |
Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and Tell: A Neural Image Caption Generator. In
- Vinyals
- 2015
(Show Context)
Citation Context ..., we consider a very common class of such problems, where a classifier trained on a class of instances is to be transferred to a second class of instances, which are loose combinations of the 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. original ones. In particular, we consider the problem where the original instances are objects and the target instances are scene-level concepts that somehow depend on those objects. Examples of this problem include the transfer of object classifiers to tasks such as scene classification [6, 11, 2] or image captioning [23]. In all these cases, the goal is to predict holistic scene tags from the scores (or features) from an object CNN classifier. The dependence of the holistic descriptions on these objects could range from very explicit to very subtle. For example, on the explicit end of the spectrum, an image captioning system could produce sentence such as “a person is sitting on a stool and feeding a zebra.” On the other hand, on the subtle end of the spectrum, a scene classification system would leverage the recognition of certain rocks, tree stumps, bushes and a particular lizard species to label an image w... |

1 |
Compact bilinear pooling.
- Gao, Beijbom, et al.
- 2015
(Show Context)
Citation Context ...sfer with the Places CNN, on both MIT Indoors and SUN and for both the AlexNet and VGG architectures. This supports the hypothesis that the variability of configurations of most scenes makes scene classification much harder than object recognition, to the point where CNN architectures that have close-to or above human performance for 7 Table 4: Comparison to task transfer methods (ImageNet CNNs) on MIT Indoor. Method 1 scale mscale AlexNet MFA-FS 71.11 73.58 GMM FV [2] 68.5 72.86 FV+FC [1] - 71.6 Sparse Coding [11] 68.2 - VGG MFA-FS 79.9 81.43 Sparse Coding [12] - 77.6 H-Sparse [12] - 79.5 BN [3] 77.55 - FV+FC [1] - 81.0 VGG + dim. reduction MFA-FS + PCA (5k) 79.3 - BN (8k) [3] 76.17 - Table 5: Comparison with the Places trained Scene CNNs. Method SUN Indoor AlexNet MFA-FS 55.95 73.58 Places 54.3 68.24 Combined 63.16 79.86 VGG MFA-FS 63.31 81.43 Places 61.32 79.47 Combined 71.06 87.23 AlexNet + VGG Places (VGG + Alex) 65.91 81.29 MFA-FS(Alex) + Places(VGG) 68.8 85.6 MFA-FS(VGG) + Places(Alex) 67.34 82.82 object recognition are much less effective for scenes. It is, instead, preferable to pool object detections across the scene image, using a pooling mechanism such as the MFA-FS. This ... |

1 | Mid-level deep pattern mining.
- Li, Liu, et al.
- 2015
(Show Context)
Citation Context ...tch scales (128, 96, 160). Method MIT Indoor SUN MFA-FS + Places (VGG) 87.23 71.06 MFA-FS + Places (AlexNet) 79.86 63.16 MFA-FS (VGG) 81.43 63.31 MFA-FS (AlexNet) 73.58 55.95 Full BN (VGG) [3] 77.55 - Compact BN (VGG) [3] 76.17 - H-Sparse (VGG) [12] 79.5 - Sparse Coding (VGG) [12] 77.6 - Sparse Coding (AlexNet) [11] 68.2 MetaClass (AlexNet) + Places [24] 78.9 58.11 FV (AlexNet)(4 scales) + Places [2] 79.0 61.72 FV (AlexNet)(3 scales) + Places [2] 78.5∗ - FV (AlexNet) (4 scales) [2] 72.86 54.4 FV (Alexnet)(3 scales) [2] 71.24 53.0 VLAD (AlexNet) [6] 68.88 51.98 FV+FC (VGG) [1] 81.0 - Mid Level [10] 70.46 - 4.3 Comparison with ImageNet based Classifiers We next compared the MFA-FS to state of the art scene classifiers also based on transfer from ImageNet CNN features [11, 1–3]. Since all these methods only report results for MIT Indoor, we limited the comparison to this dataset, with the results of Table 4. The GMM-FV of [2] operates on AlexNet CNN semantics extracted from image patches of multiple sizes (96, 128, 160, 80). The FV in [1] is computed using convolutional features from AlexNet or VGG-16 extracted in a large multi-scale setting. Liu et al. proposed a gradient representation ... |

1 |
Compositional model based fisher vector coding for image classification.
- Liu, Wang, et al.
- 2016
(Show Context)
Citation Context ...function of patch scale. MIT SUN Indoor AlexNet 160x160 69.83 52.36 128x128 71.11 53.38 96x96 70.51 53.54 3 scales 73.58 55.95 VGG-16 160x160 77.26 59.77 128x128 77.28 60.99 96x96 79.57 61.71 3 scales 80.1 63.31 VGG-19 160x160 77.21 - 128x128 79.39 - 96x96 79.9 - 3 scales 81.43 - Table 3: Performance of scene classification methods. *- combination of patch scales (128, 96, 160). Method MIT Indoor SUN MFA-FS + Places (VGG) 87.23 71.06 MFA-FS + Places (AlexNet) 79.86 63.16 MFA-FS (VGG) 81.43 63.31 MFA-FS (AlexNet) 73.58 55.95 Full BN (VGG) [3] 77.55 - Compact BN (VGG) [3] 76.17 - H-Sparse (VGG) [12] 79.5 - Sparse Coding (VGG) [12] 77.6 - Sparse Coding (AlexNet) [11] 68.2 MetaClass (AlexNet) + Places [24] 78.9 58.11 FV (AlexNet)(4 scales) + Places [2] 79.0 61.72 FV (AlexNet)(3 scales) + Places [2] 78.5∗ - FV (AlexNet) (4 scales) [2] 72.86 54.4 FV (Alexnet)(3 scales) [2] 71.24 53.0 VLAD (AlexNet) [6] 68.88 51.98 FV+FC (VGG) [1] 81.0 - Mid Level [10] 70.46 - 4.3 Comparison with ImageNet based Classifiers We next compared the MFA-FS to state of the art scene classifiers also based on transfer from ImageNet CNN features [11, 1–3]. Since all these methods only report results for MIT Indoor, we... |

1 |
Akihiko Torii, and Masatoshi Okutomi. Fisher vector based on full-covariance gaussian mixture model.
- Tanaka
- 2013
(Show Context)
Citation Context ...50, . . . , 500}) and MFA hidden sub-spaces dimensions (R ∈ {1, . . . , 10}). For comparable vector dimensions, the covariance based scores always significantly outperforms the variance statistics on both datasets. A final observation is that, due to covariance modeling in MFAs, the MFA-FS(µ) performs better the GMM-FV(µ). The first order residuals pooled to obtain the MFA-FS(µ) (14) are scaled by covariance matrices instead of variances. This local de-correlation provides a non-trivial improvement for the MFA-FS(µ) over the GMM-FV(µ)(∼ 1.5% points). Covariance modeling was previously used in [19] to obtain FVs w.r.t. Gaussian means and local subspace variances (eigen-values of covariance). Their subspace variance FV, derived with our MFAs, performs much better than the variance GMM-FV (σ), due to a better underlying model (60.7% v 53.86% on Indoor). It is, however, still inferior to the MFA-FS(Λ) which captures full covariance within local subspaces. While a combination of the MFA-FS(µ) and MFA-FS(Λ) produces a small improvement (∼ 1%), we restrict to using the latter in the remainder of this work. 4.2 Multi-scale learning and Deep CNNs Recent works have demonstrated value in combinin... |

1 |
Tsung-Yu Lin and Subhransu Maji. Bilinear cnn models for fine-grained visual recognition.
- RoyChowdhury
- 2015
(Show Context)
Citation Context ...e of the MFA model, denoted the MFA Fisher score (MFA-FS), a representation similar to the GMM Fisher vector of [13, 17]. We show that, for high dimensional CNN features, the MFA-FS captures highly discriminative covariance statistics, which were previously unavailable in [6, 2], producing significantly improved scene classification over the conventional GMM Fisher vector. The third contribution is a detailed experimental investigation of the MFA-FS. Since this can be seen as a second order pooling mechanism, we compare it to a number of recent methods for second order pooling of CNN features [21, 3]. Although these methods describe global covariance structure, they lack the ability of the MFA-FS to capture that information along locally linear approximations of the highly non-linear CNN feature manifold. This is shown to be important, as the MFA-FS is shown to outperform all these representations by non-trivial margins. Finally, we show that the MFA-FS enables effective task transfer, by showing that MFA-FS vectors extracted from deep CNNs trained for ImageNet object recognition [8, 18], achieve state-of-the-art results on challenging scene recognition benchmarks, such as SUN [25] and MI... |

1 | Harvesting discriminative meta objects with deep cnn features for scene classification.
- Wu, Wang, et al.
- 2015
(Show Context)
Citation Context ... the scores with respect to the factor loading matrices Λk account for covariance statistics of the observations xi, not just variances. We refer to the representations (14) and (15) as MFA Fisher scores (MFA-FS). Note that these are not FVs due to the absence of normalization by the Fisher information, which is more complex to compute than for the variance-GMM. 3 Related work The most popular approach to transfer object scores (usually from an ImageNet CNN) into a feature vector for scene classification is to rely on FV-style pooling. Although most classifiers default to the GMM-FV embedding [6, 1, 2, 24], some recent works have explored different encoding [11] and pooling schemes [21, 3] with promising results. Liu et al. [11] derived an FV like representation from sparse coding. Their model can be described as a factor analyzer with Gaussian observations p(x|z) ∼ N (Λz, σ2I) conditioned on Laplace factors p(z) ∝ ∏ r exp(−|zr|). While the sparse FA marginal p(x) is intractable, it can be approximated by an evidence lower bound p(x) ≥ ∫ q(z)p(x,z)q(z) dz derived from a suitable variational posterior q(z). In [11], q is a point posterior δ(z − z∗) and the MAP inference simplifies into sparse co... |