Sparse flexible models of local image descriptors have been introduced in the field of computer vision as a method that allows for a substantial increase in the richness of visual object representation without causing a negative impact in the efficiency of the visual classification process. To understand the method, it is important to remember that local image descriptors are essentially composed of two parts: 1- geometry, and 2- appearance. The geometry consists of position, orientation, and scale where the feature was detected; while appearance is represented by the image features extracted from the region used to detect the descriptor. The tractability of the problem depends on certain assumptions on the dependence of the appearance and geometry of local image descriptors. On one extreme lies full appearance and geometric dependence between all local descriptors representing a visual object. This seems intuitively correct, but in practice such assumption is intractable because of the complexity of the inference and lack of training data. On the other extreme resides the complete independence of appearance and geometry between local descriptors, which leads to the known bag of features model. Clearly, such assumption is quite appealing in terms of learning and inference complexity. However, that assumption makes little sense intuitively because there is little evidence that the visual classification can be done robustly taking into consideration only a set of local descriptors that have no relation between each other. For instance, when detecting a human face, one expects two regions with the appearance of an eye and another region with the appearance of a nose in the middle of the eyes, which suggests both geometric and appearance dependence. Sparse flexible models target a compromise between tractability and performance accuracy by having a variable (sparsity factor) that controls the level of dependence between local descriptors. In this work we intend to provide theoretical justification of the functionality (training and inference) of such model, and also to implement efficient algorithms for training and visual classification.
Objectives:
The objective of this work is to produce a theoretical justification of the functionality of sparse flexible models of local image descriptors, and an implementation of efficient training and visual classification algorithms.
Detailed description:
The use of local image descriptor for the problems of visual object detection has experienced a tremendous growth [Lowe04,Carneiro07]. Models of this type are usually good at representing visual objects that have distinctive localized textured patches and/or a consistent spatial arrangement of these patches. Another positive aspect of this model is the fact that it can handle partial occlusion and large deformations of the visual object. The main negative aspect of such model is the limited discriminating power of the local descriptors given their small spatial support, a problem that tends to get accentuated with larger database since each local descriptor tends to become even less discriminating. The crucial point here is therefore the independence assumptions made. Completely independence assumptions lead to the widely known bag-of-feature models [Csurka04], which produce surprisingly good visual classification results. Full dependence assumptions produced the constellation model, which has been abandoned given that the very high dimensional parameter space produced by this model forced a quite difficult training procedure. The last four years has witnessed a tremendous growth in models that try to find a good balance between tractability (training and inference) and complexity (in the form of independence assumptions). In this proposal, we are interested in the further investigation of the sparse flexible models proposed by Carneiro [Carneiro06]. In this model, the independence assumption is an input parameter K that determines the connectivity of each local descriptor. The paper [Carneiro06] proposes an on-line training approach and an inference algorithm, but it does not provide any convergence proofs. Another point missing in that work is the lack of an efficient training and inference.
Our objective with this work is exactly covering the points that are missing in that work. The following steps are planned in the work:
1. Literature review in the area of recurrent visual classification models using local descriptors with different dependence assumptions.
2. Development of the convergence proof for the training and inference algorithms.
3. Design and implementation of efficient training and inference algorithms for the sparse flexible models of local descriptors.
4. Experiments and comparisons with current state-of-the-art models using the PASCAL database of visual classification [PASCAL].
References:
[Lowe04] D. Lowe. Distinctive image features from scale-invariant keypoints. InternationalJournal of Computer Vision, 60(2):91-110, 2004.
[Carneiro07] Gustavo Carneiro and Allan Jepson. Flexible Spatial Configuration of Local Image Features. IEEE Transactions on Pattern Analysis and Machine Intelligence. Vol. 29(12), pp. 2089-2104, 2007.
[Csurka04] G. Csurka, C. Bray, C. Dance, and L. Fan. Visual categorization with bags of keypoints. In Workshop on Statistical Learning in Computer Vision, ECCV, 2004.
[Carneiro06] Gustavo Carneiro and David Lowe. Sparse Flexible Models of Local Features. Proceedings of the European Conference on Computer Vision (ECCV). Graz, Austria. 2006.
[PASCAL] http://www.pascal-network.org/challenges/VOC
Requirements (grades, required courses, etc):
Expected results:
At the end of the work, the students will have enriched their experience in computer vision, and machine learning. In particular the following goals are expected:
- Convergence proof for the training and inference algorithms of the sparse flexible model of local descriptors [Carneiro06];
- Design and implementation of efficient training and inference algorithms for the sparse flexible models of local descriptors.
- Experiments and comparisons with current state-of-the-art models using the PASCAL database of visual classification [PASCAL].