I am a postdoc researcher in UTS, working with Prof Yi Yang. I was a postdoc in University of Texas at San Antonio working with Prof Qi Tian. I received my PhD degree in Jul. 2015 from Department of Electronic Engineering, Tsinghua University, supervised by Prof. Shengjin Wang. I received my B.E from School of Life Sciences, Tsinghua University in 2010. I am interested in person re-identification, image retrieval, and deep learning.
Can supervise: YES
Person re-identification, image retrieval, deep learning
Hu, Y., Zheng, L., Yang, Y. & Huang, Y. 2018, 'Twitter100k: A Real-World Dataset for Weakly Supervised Cross-Media Retrieval', IEEE Transactions on Multimedia, vol. 20, no. 4, pp. 927-938.View/Download from: UTS OPUS or Publisher's site
© 2017 IEEE. This paper contributes a new large-scale dataset for weakly supervised cross-media retrieval, named Twitter100k. Current datasets, such as Wikipedia, NUS Wide, and Flickr30k, have two major limitations. First, these datasets are lacking in content diversity, i.e., only some predefined classes are covered. Second, texts in these datasets are written in well-organized language, leading to inconsistency with realistic applications. To overcome these drawbacks, the proposed Twitter100k dataset is characterized by two aspects: it has 100 000 image-text pairs randomly crawled from Twitter, and thus, has no constraint in the image categories; and text in Twitter100k is written in informal language by the users. Since strongly supervised methods leverage the class labels that may be missing in practice, this paper focuses on weakly supervised learning for cross-media retrieval, in which only text-image pairs are exploited during training. We extensively benchmark the performance of four subspace learning methods and three variants of the correspondence AutoEncoder, along with various text features on Wikipedia, Flickr30k, and Twitter100k. As a minor contribution, we also design a deep neural network to learn cross-modal embeddings for Twitter100k. Inspired by the characteristic of Twitter100k, we propose a method to integrate optical character recognition into cross-media retrieval. The experiment results show that the proposed method improves the baseline performance.
Zheng, L., Yang, Y. & Tian, Q. 2018, 'SIFT Meets CNN: A Decade Survey of Instance Retrieval', IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 5, pp. 1224-1244.View/Download from: UTS OPUS or Publisher's site
In the early days, content-based image retrieval (CBIR) was studied with global features. Since 2003, image retrieval based on local descriptors (de facto SIFT) has been extensively studied for over a decade due to the advantage of SIFT in dealing with image transformations. Recently, image representations based on the convolutional neural network (CNN) have attracted increasing interest in the community and demonstrated impressive performance. Given this time of rapid evolution, this article provides a comprehensive survey of instance retrieval over the last decade. Two broad categories, SIFT-based and CNN-based methods, are presented. For the former, according to the codebook size, we organize the literature into using large/medium-sized/small codebooks. For the latter, we discuss three lines of methods, i.e., using pre-trained or fine-tuned CNN models, and hybrid methods. The first two perform a single-pass of an image to the network, while the last category employs a patch-based feature extraction scheme. This survey presents milestones in modern instance retrieval, reviews a broad selection of previous works in different categories, and provides insights on the connection between SIFT and CNN-based methods. After analyzing and comparing retrieval performance of different categories on several datasets, we discuss promising directions towards generic and specialized instance retrieval.
Liu, Z., Wang, S., Zheng, L. & Tian, Q. 2017, 'Robust imagegraph: Rank-level feature fusion for image search', IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3128-3141.View/Download from: UTS OPUS or Publisher's site
© 2016 IEEE. Recently, feature fusion has demonstrated its effectiveness in image search. However, bad features and inappropriate parameters usually bring about false positive images, i.e., outliers, leading to inferior performance. Therefore, a major challenge of fusion scheme is how to be robust to outliers. Towards this goal, this paper proposes a rank-level framework for robust feature fusion. First, we define Rank Distance to measure the relevance of images at rank level. Based on it, Bayes similarity is introduced to evaluate the retrieval quality of individual features, through which true matches tend to obtain higher weight than outliers. Then, we construct the directed ImageGraph to encode the relationship of images. Each image is connected to its K nearest neighbors with an edge, and the edge is weighted by Bayes similarity. Multiple rank lists resulted from different methods are merged via ImageGraph. Furthermore, on the fused ImageGraph, local ranking is performed to re-order the initial rank lists. It aims at local optimization, and thus is more robust to global outliers. Extensive experiments on four benchmark data sets validate the effectiveness of our method. Besides, the proposed method outperforms two popular fusion schemes, and the results are competitive to the state-of-the-art.
Zhang, Z., Liu, S., Mei, X., Xiao, B. & Zheng, L. 2017, 'Learning completed discriminative local features for texture classification', Pattern Recognition, vol. 67, pp. 263-275.View/Download from: UTS OPUS or Publisher's site
© 2017 Elsevier Ltd Local binary patterns (LBP) and its variants have shown great potentials in texture classification tasks. LBP-like texture classification methods usually follow a two-step feature extraction process: in the first pattern encoding step, the local structure information around each pixel is encoded into a binary string; in the second histogram accumulation step, the binary strings are accumulated into a histogram as the feature vector of a texture image. The performances of these classification methods are closely related to the distinctiveness of the feature vectors. In this paper, we propose a novel feature representation method, namely Completed Discriminative Local Features (CDLF), for texture classification. The proposed CDLF improves the distinctiveness of LBP-like feature vectors in two aspects: in the pattern encoding stage, we learn a transformation matrix using labeled data, which significantly increases the discrimination power of the encoded binary strings; in the histogram accumulation step, we use an adaptive weight strategy to consider the contributions of pixels in different regions. The experimental results on three challenging texture databases demonstrate that the proposed CDLF achieves significantly better results than previous LBP-like feature representation methods for texture classification tasks.
Zhu, F., Kong, X., Zheng, L., Fu, H. & Tian, Q. 2017, 'Part-Based Deep Hashing for Large-Scale Person Re-Identification', IEEE Transactions on Image Processing, vol. 26, no. 10, pp. 4806-4817.View/Download from: UTS OPUS or Publisher's site
© 1992-2012 IEEE. Large-scale is a trend in person re-identi-fication (re-id). It is important that real-time search be performed in a large gallery. While previous methods mostly focus on discriminative learning, this paper makes the attempt in integrating deep learning and hashing into one framework to evaluate the efficiency and accuracy for large-scale person re-id. We integrate spatial information for discriminative visual representation by partitioning the pedestrian image into horizontal parts. Specifically, Part-based Deep Hashing (PDH) is proposed, in which batches of triplet samples are employed as the input of the deep hashing architecture. Each triplet sample contains two pedestrian images (or parts) with the same identity and one pedestrian image (or part) of the different identity. A triplet loss function is employed with a constraint that the Hamming distance of pedestrian images (or parts) with the same identity is smaller than ones with the different identity. In the experiment, we show that the proposed PDH method yields very competitive re-id accuracy on the large-scale Market-1501 and Market-1501+500K datasets.
Zheng, L., Wang, S., Wang, J. & Tian, Q. 2016, 'Accurate Image Search with Multi-Scale Contextual Evidences', International Journal of Computer Vision, vol. 120, no. 1, pp. 1-13.View/Download from: UTS OPUS or Publisher's site
© 2016, Springer Science+Business Media New York. This paper considers the task of image search using the Bag-of-Words (BoW) model. In this model, the precision of visual matching plays a critical role. Conventionally, local cues of a keypoint, e.g., SIFT, are employed. However, such strategy does not consider the contextual evidences of a keypoint, a problem which would lead to the prevalence of false matches. To address this problem and enable accurate visual matching, this paper proposes to integrate discriminative cues from multiple contextual levels, i.e., local, regional, and global, via probabilistic analysis. 'True match is defined as a pair of keypoints corresponding to the same scene location on all three levels (Fig. 1). Specifically, the Convolutional Neural Network (CNN) is employed to extract features from regional and global patches. We show that CNN feature is complementary to SIFT due to its semantic awareness and compares favorably to several other descriptors such as GIST, HSV, etc. To reduce memory usage, we propose to index CNN features outside the inverted file, communicated by memory-efficient pointers. Experiments on three benchmark datasets demonstrate that our method greatly promotes the search accuracy when CNN feature is integrated. We show that our method is efficient in terms of time cost compared with the BoW baseline, and yields competitive accuracy with the state-of-the-arts.
Zheng, L., Wang, S., Guo, P., Liang, H. & Tian, Q. 2015, 'Tensor index for large scale image retrieval', Multimedia Systems, vol. 21, no. 6, pp. 569-579.View/Download from: UTS OPUS or Publisher's site
© 2014, Springer-Verlag Berlin Heidelberg. Recently, the bag-of-words representation is widely applied in the image retrieval applications. In this model, visual word is a core component. However, compared with text retrieval, one major problem associated with image retrieval consists in the visual word ambiguity, i.e., a trade-off between precision and recall of visual matching. To address this problem, this paper proposes a tensor index structure to improve precision and recall simultaneously. Essentially, the tensor index is a multi-dimensional index structure. It combines the strengths of two state-of-the-art indexing strategies, i.e., the inverted multi-index [Babenko and Lempitsky (Computer vision and pattern recognition (CVPR), 2012 IEEE Conference, 3069–3076, 2012)] as well as the joint inverted index [Xia et al. (ICCV, 2013)] which are initially designed for approximate nearest neighbor search problems. This paper, instead, exploits their usage in the scenario of image retrieval and provides insights into how to combine them effectively. We show that on the one hand, the multi-index enhances the discriminative power of visual words, thus improving precision; on the other hand, the introduction of multiple codebooks corrects quantization artifacts, thus improving recall. Extensive experiments on two benchmark datasets demonstrate that tensor index significantly improves the baseline approach. Moreover, when incorporating methods such as Hamming embedding, we achieve competitive performances compared to the state-of-the-art ones.
Zheng, L., Wang, S., Liu, Z. & Tian, Q. 2015, 'Fast image retrieval: Query pruning and early termination', IEEE Transactions on Multimedia, vol. 17, no. 5, pp. 648-659.View/Download from: UTS OPUS or Publisher's site
© 1999-2012 IEEE. Efficiency is of great importance for image retrieval systems. For this pragmatic issue, this paper proposes a fast image retrieval framework to speed up the online retrieval process. To this end, an impact score for local features is proposed in the first place, which considers multiple properties of a local feature, including TF-IDF, scale, saliency, and ambiguity. Then, to decrease memory consumption, the impact score is quantized to an integer, which leads to a novel inverted index organization, called Q-Index. Importantly, based on the impact score, two closely complementary strategies are introduced: query pruning and early termination. On one hand, query pruning discards less important features in the query. On the other hand, early termination visits indexed features only with high impact scores, resulting in the partial traversing of the inverted index. Our approach is tested on two benchmark datasets populated with an additional 1 million images to account as negative examples. Compared with full traversal of the inverted index, we show that our system is capable of visiting less than 10% of the 'should-visit' postings, thus achieving a significant speed-up in query time while providing competitive retrieval accuracy.
Zheng, L., Wang, S. & Tian, Q. 2014, 'Coupled binary embedding for large-scale image retrieval.', IEEE Transactions on Image Processing, vol. 23, no. 8, pp. 3368-3380.View/Download from: UTS OPUS or Publisher's site
Visual matching is a crucial step in image retrieval based on the bag-of-words (BoW) model. In the baseline method, two keypoints are considered as a matching pair if their SIFT descriptors are quantized to the same visual word. However, the SIFT visual word has two limitations. First, it loses most of its discriminative power during quantization. Second, SIFT only describes the local texture feature. Both drawbacks impair the discriminative power of the BoW model and lead to false positive matches. To tackle this problem, this paper proposes to embed multiple binary features at indexing level. To model correlation between features, a multi-IDF scheme is introduced, through which different binary features are coupled into the inverted file. We show that matching verification methods based on binary features, such as Hamming embedding, can be effectively incorporated in our framework. As an extension, we explore the fusion of binary color feature into image retrieval. The joint integration of the SIFT visual word and binary features greatly enhances the precision of visual matching, reducing the impact of false positive matches. Our method is evaluated through extensive experiments on four benchmark datasets (Ukbench, Holidays, DupImage, and MIR Flickr 1M). We show that our method significantly improves the baseline approach. In addition, large-scale experiments indicate that the proposed method requires acceptable memory usage and query time compared with other approaches. Further, when global color feature is integrated, our method yields competitive performance with the state-of-the-arts.
Zheng, L., Wang, S. & Tian, Q. 2014, 'L(p) -norm IDF for scalable image retrieval.', IEEE Transactions on Image Processing, vol. 23, no. 8, pp. 3604-3617.View/Download from: UTS OPUS or Publisher's site
The inverse document frequency (IDF) is prevalently utilized in the bag-of-words-based image retrieval application. The basic idea is to assign less weight to terms with high frequency, and vice versa. However, in the conventional IDF routine, the estimation of visual word frequency is coarse and heuristic. Therefore, its effectiveness is largely compromised and far from optimal. To address this problem, this paper introduces a novel IDF family by the use of Lp-norm pooling technique. Carefully designed, the proposed IDF considers the term frequency, document frequency, the complexity of images, as well as the codebook information. We further propose a parameter tuning strategy, which helps to produce optimal balancing between TF and pIDF weights, yielding the so-called Lp-norm IDF (pIDF). We show that the conventional IDF is a special case of our generalized version, and two novel IDFs, i.e., the average IDF and the max IDF, can be defined from the concept of pIDF. Further, by counting for the term-frequency in each image, the proposed pIDF helps to alleviate the visual word burstiness phenomenon. Our method is evaluated through extensive experiments on four benchmark data sets (Oxford 5K, Paris 6K, Holidays, and Ukbench). We show that the pIDF works well on large scale databases and when the codebook is trained on irrelevant data. We report an mean average precision improvement of as large as +13.0% over the baseline TF-IDF approach on a 1M data set. In addition, the pIDF has a wide application scope varying from buildings to general objects and scenes. When combined with postprocessing steps, we achieve competitive results compared with the state-of-the-art methods. In addition, since the pIDF is computed offline, no extra computation or memory cost is introduced to the system at all.
Zheng, L. & Wang, S. 2013, 'Visual phraselet: Refining spatial constraints for large scale image search', IEEE Signal Processing Letters, vol. 20, no. 4, pp. 391-394.View/Download from: UTS OPUS or Publisher's site
The Bag-of-Words (BoW) model is prone to the deficiency of spatial constraints among visual words. The state of the art methods encode spatial information via visual phrases. However, these methods discard the spatial context among visual phrases instead. To address the problem, this letter introduces a novel visual concept, the Visual Phraselet, as a kind of similarity measurement between images. The visual phraselet refers to the spatial consistent group of visual phrases. In a simple yet effective manner, visual phraselet filters out false visual phrase matches, and is much more discriminative than both visual word and visual phrase. To boost the discovery of visual phraselets, we apply the soft quantization scheme. Our method is evaluated through extensive experiments on three benchmark datasets (Oxford 5 K, Paris 6 K and Flickr 1 M). We report significant improvements as large as 54.6% over the baseline approach, thus validating the concept of visual phraselet. © 1994-2012 IEEE.
Zheng, L., Zhang, H., Sun, S., Chandraker, M., Yang, Y. & Tian, Q. 2017, 'Person Re-identification in the Wild', Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, HI, USA.View/Download from: UTS OPUS or Publisher's site
This paper presents a novel large-scale dataset and comprehensive baselines for end-to-end pedestrian detection and person recognition in raw video frames. Our baselines address three issues: the performance of various combinations of detectors and recognizers, mechanisms for pedestrian detection to help improve overall re-identification (re-ID) accuracy and assessing the effectiveness of different detectors for re-ID. We make three distinct contributions. First, a new dataset, PRW, is introduced to evaluate Person Re-identification in the Wild, using videos acquired through six synchronized cameras. It contains 932 identities and 11,816 frames in which pedestrians are annotated with their bounding box positions and identities. Extensive benchmarking results are presented on this dataset. Second, we show that pedestrian detection aids re-ID through two simple yet effective improvements: a cascaded fine-tuning strategy that trains a detection model first and then the classification model, and a Confidence Weighted Similarity (CWS) metric that incorporates detection scores into similarity measurement. Third, we derive insights in evaluating detector performance for the particular scenario of accurate person re-ID
Zheng, Z., Zheng, L. & Yang, Y. 2017, 'Unlabeled Samples Generated by GAN Improve the Person Re-identification Baseline in vitro', 2017 IEEE International Conference on Computer Vision, IEEE, Venice, Italy.View/Download from: UTS OPUS or Publisher's site
The main contribution of this paper is a simple semi- supervised pipeline that only uses the original training se t without collecting extra data. It is challenging in 1) how to obtain more training data only from the training set and 2) how to use the newly generated data. In this work, the generative adversarial network (GAN) is used to generate unlabeled samples. We propose the label smoothing regu- larization for outliers (LSRO). This method assigns a uni- form label distribution to the unlabeled images, which reg- ularizes the supervised model and improves the baseline. We verify the proposed method on a practical problem: person re-identification (re-ID). This task aims to retriev e a query person from other cameras. We adopt the deep con- volutional generative adversarial network (DCGAN) for sample generation, anda baseline convolutionalneuralnet - work (CNN) for representation learning. Experiments show that adding the GAN-generated data effectively improves the discriminative ability of learned CNN embeddings. On three large-scale datasets, Market-1501, CUHK03 and DukeMTMC-reID, we obtain +4.37%, +1.6% and +2.46% improvement in rank-1 precision over the baseline CNN, respectively. We additionally apply the proposed method to fine-grained bird recognition and achieve a +0.6% im- provement over a strong baseline
Zhong, Z., Zheng, L., Cao, D. & Li, S. 2017, 'Re-ranking Person Re-identification with k-reciprocal Encoding', Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, HI, USA.View/Download from: UTS OPUS or Publisher's site
When considering person re-identification (re-ID) as a retrieval process, re-ranking is a critical step to improve its accuracy. Yet in the re-ID community, limited effort has been devoted to re-ranking, especially those fully automatic, unsupervised solutions. In this paper, we propose a k-reciprocal encoding method to re-rank the re-ID results. Our hypothesis is that if a gallery image is similar to the probe in the k-reciprocal nearest neighbors, it is more likely to be a true match. Specifically, given an image, a k-reciprocal feature is calculated by encoding its k-reciprocal nearest neighbors into a single vector, which is used for re-ranking under the Jaccard distance. The final distance is computed as the combination of the original distance and the Jaccard distance. Our re-ranking method does not require any human interaction or any labeled data, so it is applicable to large-scale datasets. Experiments on the large-scale Market-1501, CUHK03, MARS, and PRW datasets confirm the effectiveness of our method.
Li, Y., Kong, X., Zheng, L. & Tian, Q. 2016, 'Exploiting Hierarchical Activations of Neural Network for Image Retrieval', MM '16 ACM Multimedia Conference, ACM International Conference on Multimedia, ACM, Amsterdam, Netherlands.View/Download from: UTS OPUS or Publisher's site
Xie, L., Zheng, L., Wang, J., Yuille, A. & Tian, Q. 2016, 'Interactive: Inter-layer activeness propagation', Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference, IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, NV, USA, pp. 270-279.View/Download from: UTS OPUS or Publisher's site
An increasing number of computer vision tasks can be tackled with deep features, which are the intermediate outputs of a pre-trained Convolutional Neural Network. Despite the astonishing performance, deep features extracted from low-level neurons are still below satisfaction, arguably because they cannot access the spatial context contained in the higher layers. In this paper, we present InterActive, a novel algorithm which computes the activeness of neurons and network connections. Activeness is propagated through a neural network in a top-down manner, carrying highlevel context and improving the descriptive power of lowlevel and mid-level neurons. Visualization indicates that neuron activeness can be interpreted as spatial-weighted neuron responses. We achieve state-of-the-art classification performance on a wide range of image datasets.
Zheng, L., Bie, Z., Sun, Y., Wang, J., Su, C., Wang, S. & Tian, Q. 2016, 'MARS: A Video Benchmark for Large-Scale Person Re-identification', Computer Vision – ECCV 2016 (LNCS), European Conference on Computer Vision, Springer, Amsterdam, The Netherlands, pp. 868-884.View/Download from: UTS OPUS or Publisher's site
This paper considers person re-identification (re-id) in videos. We introduce a new video re-id dataset, named Motion Analysis and Re-identification Set (MARS), a video extension of the Market-1501 dataset. To our knowledge, MARS is the largest video re-id dataset to date. Containing 1,261 IDs and around 20,000 tracklets, it provides rich visual information compared to image-based datasets. Meanwhile, MARS reaches a step closer to practice. The tracklets are automatically generated by the Deformable Part Model (DPM) as pedestrian detector and the GMMCP tracker. A number of false detection/tracking results are also included as distractors which would exist predominantly in practical video databases. Extensive evaluation of the state-of-the-art methods including the space-time descriptors and CNN is presented. We show that CNN in classification mode can be trained from scratch using the consecutive bounding boxes of each identity. The learned CNN embedding outperforms other competing methods considerably and has good generalization ability on other video re-id datasets upon fine-tuning.
Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J. & Tian, Q. 2015, 'Scalable person re-identification: A benchmark', Proceedings 2015 IEEE International Conference on Computer Vision, IEEE International Conference on Computer Vision, IEEE, Santiago, Chile, pp. 1116-1124.View/Download from: UTS OPUS or Publisher's site
This paper contributes a new high quality dataset for person re-identification, named "Market-1501". Generally, current datasets: 1) are limited in scale, 2) consist of hand-drawn bboxes, which are unavailable under realistic settings, 3) have only one ground truth and one query image for each identity (close environment). To tackle these problems, the proposed Market-1501 dataset is featured in three aspects. First, it contains over 32,000 annotated bboxes, plus a distractor set of over 500K images, making it the largest person re-id dataset to date. Second, images in Market-1501 dataset are produced using the Deformable Part Model (DPM) as pedestrian detector. Third, our dataset is collected in an open system, where each identity has multiple images under each camera. As a minor contribution, inspired by recent advances in large-scale image search, this paper proposes an unsupervised Bag-of-Words descriptor. We view person re-identification as a special task of image search. In experiment, we show that the proposed descriptor yields competitive accuracy on VIPeR, CUHK03, and Market-1501 datasets, and is scalable on the large-scale 500k dataset.
Zheng, L., Wang, S. & Tian, Q. 2015, 'Coloring image search with coupled multi-index', 2015 IEEE China Summit and International Conference on Signal and Information Processing, ChinaSIP 2015 - Proceedings, IEEE China Summit and International Conference on Signal and Information Processing, IEEE, Chengdu, China, pp. 137-141.View/Download from: UTS OPUS or Publisher's site
© 2015 IEEE. The precision of visual matching and the trade-off between accuracy and time efficiency have long been bottlenecks of image search systems. This work addresses the two problem simultaneously by introducing the coupled Multi-Index (cMI) structure. First, by combining SIFT and color features on the indexing-level, the discriminative power of visual words is greatly enhanced. Second, by reducing the number of inverted entries to be traversed, c-MI brings about significant improvement in time efficiency. Experiments are performed on two widely used benchmark datasets. We demonstrate both state-of-The-Art image search accuracy and cut-by-half query time.
Zheng, L., Wang, S., Tian, L., He, F., Liu, Z. & Tian, Q. 2015, 'Query-adaptive late fusion for image search and person re-identification', Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, MA, USA, pp. 1741-1750.View/Download from: UTS OPUS or Publisher's site
Feature fusion has been proven effective [35, 36] in image search. Typically, it is assumed that the to-be-fused heterogeneous features work well by themselves for the query. However, in a more realistic situation, one does not know in advance whether a feature is effective or not for a given query. As a result, it is of great importance to identify feature effectiveness in a query-adaptive manner.
Liu, Z., Wang, S., Zheng, L. & Tian, Q. 2014, 'Visual reranking with improved image graph', ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, Florence, ITALY, pp. 6889-6893.View/Download from: UTS OPUS or Publisher's site
This paper introduces an improved reranking method for the Bag-of-Words (BoW) based image search. Built on , a directed image graph robust to outlier distraction is proposed. In our approach, the relevance among images is encoded in the image graph, based on which the initial rank list is refined. Moreover, we show that the rank-level feature fusion can be adopted in this reranking method as well. Taking advantage of the complementary nature of various features, the reranking performance is further enhanced. Particularly, we exploit the reranking method combining the BoW and color information. Experiments on two benchmark datasets demonstrate that our method yields significant improvements and the reranking results are competitive to the state-of-the-art methods.
Liu, Z., Wang, S., Zheng, L. & Tian, Q. 2014, 'Visual reranking with improved image graph', ICASSP.
Zheng, L., Wang, S., Liu, Z. & Tian, Q. 2014, 'Packing and padding: Coupled multi-index for accurate image retrieval', 27th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Columbus, OH, pp. 4321-4328.View/Download from: UTS OPUS or Publisher's site
In Bag-of-Words (BoW) based image retrieval, the SIFT visual word has a low discriminative power, so false positive matches occur prevalently. Apart from the information loss during quantization, another cause is that the SIFT feature only describes the local gradient distribution. To address this problem, this paper proposes a coupled Multi-Index (c-MI) framework to perform feature fusion at indexing level. Basically, complementary features are coupled into a multi-dimensional inverted index. Each dimension of c-MI corresponds to one kind of feature, and the retrieval process votes for images similar in both SIFT and other feature spaces. Specifically, we exploit the fusion of local color feature into c-MI. While the precision of visual match is greatly enhanced, we adopt Multiple Assignment to improve recall. The joint cooperation of SIFT and color features significantly reduces the impact of false positive matches. Extensive experiments on several benchmark datasets demonstrate that c-MI improves the retrieval accuracy significantly, while consuming only half of the query time compared to the baseline. Importantly, we show that c-MI is well complementary to many prior techniques. Assembling these methods, we have obtained an mAP of 85.8% and N-S score of 3.85 on Holidays and Ukbench datasets, respectively, which compare favorably with the state-of-the-arts.
Zheng, L., Wang, S., Zhou, W. & Tian, Q. 2014, 'Bayes merging of multiple vocabularies for scalable image retrieval', CVPR.
Zheng, L., Wang, S., Zhou, W. & Tian, Q. 2014, 'Bayes merging of multiple vocabularies for scalable image retrieval', Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference, IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Columbus, OH, USA, pp. 1963-1970.View/Download from: UTS OPUS or Publisher's site
In the Bag-of-Words (BoW) model, the vocabulary is of key importance. Typically, multiple vocabularies are generated to correct quantization artifacts and improve recall. However, this routine is corrupted by vocabulary correlation, i.e., overlapping among different vocabularies. Vocabulary correlation leads to an over-counting of the indexed features in the overlapped area, or the intersection set, thus compromising the retrieval accuracy. In order to address the correlation problem while preserve the benefit of high recall, this paper proposes a Bayes merging approach to down-weight the indexed features in the intersection set. Through explicitly modeling the correlation problem in a probabilistic view, a joint similarity on both image- and feature-level is estimated for the indexed features in the intersection set. We evaluate our method on three benchmark datasets. Albeit simple, Bayes merging can be well applied in various merging tasks, and consistently improves the baselines on multi-vocabulary merging. Moreover, Bayes merging is efficient in terms of both time and memory cost, and yields competitive performance with the state-of-the-art methods.
Zheng, L., Wang, S., Liu, Z. & Tian, Q. 2013, 'Lp-Norm IDF for Large Scale Image Search', IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Portland, OR, USA, pp. 3604-3617.View/Download from: UTS OPUS or Publisher's site
The Inverse Document Frequency (IDF) is prevalently utilized in the Bag-of-Words based image search. The basic idea is to assign less weight to terms with high frequency, and vice versa. However, the estimation of visual word frequency is coarse and heuristic. Therefore, the effectiveness of the conventional IDF routine is marginal, and far from optimal. To tackle this problem, this paper introduces a novel IDF expression by the use of Lp-norm pooling technique. Carefully designed, the proposed IDF takes into account the term frequency, document frequency, the complexity of images, as well as the codebook information. Optimizing the IDF function towards optimal balancing between TF and pIDF weights yields the so-called Lp-norm IDF (pIDF). We show that the conventional IDF is a special case of our generalized version, and two novel IDFs, i.e. the average IDF and the max IDF, can also be derived from our formula. Further, by counting for the term-frequency in each image, the proposed Lp-norm IDF helps to alleviate the visual word burstiness phenomenon. Our method is evaluated through extensive experiments on three benchmark datasets (Oxford 5K, Paris 6K and Flickr 1M). We report a performance improvement of as large as 27.1% over the baseline approach. Moreover, since the Lp-norm IDF is computed offline, no extra computation or memory cost is introduced to the system at all.
Zheng, L., Taiqing, W., Shengjin, W. & Xiaoqing, D. 2011, 'DSW feature based Hidden Marcov Model: An application on object identification', Proceedings of the 2011 International Conference of Soft Computing and Pattern Recognition, SoCPaR 2011, International Conference of Soft Computing and Pattern Recognition, IEEE, Dalian, China, pp. 502-506.View/Download from: UTS OPUS or Publisher's site
This paper proposes to perform palmprint identification with Hidden Markov Models (HMM). Palmprint identification, as an emerging biometric technology, has been extensively investigated in the last decade. Due to its low-price capture device, fast implementation speed and high accuracy, palmprint identification is very competitive in biometric research area. Currently, the majority of literatures focus on palm line extraction algorithms and coding schemes, with little attention on classifier design. In this paper, Down-sliding Window (DSW) technique is employed to create a highcorrelated feature sequence while palmprint is featured by simple down-sampled images. One-to-50 experiment demonstrates that HMM with single component and six states give the best overall performance 99.80%, which indicates the feasibility of HMMs for tasks in palmprint identification.
SVDNet for Pedestrian Retrieval Yifan Sun y , Liang Zheng z , Weijian Deng x , Shengjin Wang y y Tsinghua University z University of Technology Sydney x University of Chinese Academy of Sciences email@example.com, f liangzheng06, dengwj16 g @gmail.com, firstname.lastname@example.org Abstract This paper proposes the SVDNet for retrieval problems, with focus on the application of person re-identification (re- ID). We view each weight vector within a fully connected (FC) layer in a convolutional neuron network (CNN) as a projection basis. It is observed that the weight vectors are usually highly correlated. This problem leads to correla- tions among entries of the FC descriptor, and compromises the retrieval performance based on the Euclidean distance. To address the problem, this paper proposes to optimize the deep representation learning process with Singular Vector Decomposition (SVD). Specifically, with the restraint and relaxation iteration (RRI) training scheme, we are able to iteratively integrate the orthogonality constraint in CNN training, yielding the so-called SVDNet. We conduct ex- periments on the Market-1501, CUHK03, and DukeMTMC- reID datasets, and show that RRI effectively reduces the correlation among the projection vectors, produces more discriminative FC descriptors, and significantly improves the re-ID accuracy. On the Market-1501 dataset, for in- stance, rank-1 accuracy is improved from 55.3% to 80.5% for CaffeNet, and from 73.8% to 82.3% for ResNet-50.
Label estimation is an important component in an unsupervised person re-identification (re-ID) system. This paper focuses on cross-camera label estimation, which can be subsequently used in feature learning to learn robust re-ID models. Specifically, we propose to construct a graph for samples in each camera, and then graph matching scheme is introduced for cross-camera labeling association. While labels directly output from existing graph matching methods may be noisy and inaccurate due to significant cross-camera variations, this paper proposes a dynamic graph matching (DGM) method. DGM iteratively updates the image graph and the label estimation process by learning a better feature space with intermediate estimated labels. DGM is advantageous in two aspects: 1) the accuracy of estimated labels is improved significantly with the iterations; 2) DGM is robust to noisy initial training data. Extensive experiments conducted on three benchmarks including the large-scale MARS dataset show that DGM yields competitive performance to fully supervised baselines, and outperforms competing unsupervised learning methods.
Person re-identification (re-ID) has become increasingly popular in the community due to its application and research significance. It aims at spotting a person of interest in other cameras. In the early days, hand-crafted algorithms and small-scale evaluation were predominantly reported. Recent years have witnessed the emergence of large-scale datasets and deep learning systems which make use of large data volumes. Considering different tasks, we classify most current re-ID methods into two classes, i.e., image-based and video-based; in both tasks, hand-crafted and deep learning systems will be reviewed. Moreover, two new re-ID tasks which are much closer to real-world applications are described and discussed, i.e., end-to-end re-ID and fast re-ID in very large galleries. This paper: 1) introduces the history of person re-ID and its relationship with image classification and instance retrieval; 2) surveys a broad selection of the hand-crafted systems and the large-scale methods in both image- and video-based re-ID; 3) describes critical future directions in end-to-end re-ID and fast retrieval in large galleries; and 4) finally briefs some important yet under-developed issues.
The objective of this paper is the effective transfer of the Convolutional
Neural Network (CNN) feature in image search and classification.
Systematically, we study three facts in CNN transfer. 1) We demonstrate the
advantage of using images with a properly large size as input to CNN instead of
the conventionally resized one. 2) We benchmark the performance of different
CNN layers improved by average/max pooling on the feature maps. Our observation
suggests that the Conv5 feature yields very competitive accuracy under such
pooling step. 3) We find that the simple combination of pooled features
extracted across various CNN layers is effective in collecting evidences from
both low and high level descriptors. Following these good practices, we are
capable of improving the state of the art on a number of benchmarks to a large