Litao Yu is currently a research fellow in the School of Electrical and Data Engineering, University of Technology Sydney. He obtained his PhD from the School of Information Technology and Electrical Engineering, The University of Queensland in 2016. He then worked as a post-doctoral research fellow with Prof. Michael Milford at Australian Centre for Robotic Vision, Queensland University of Technology, in 2017, and a research fellow with Prof. Yongsheng Gao at Institute for Integrated and Intelligent Systems (IIIS), Griffith University, from 2018 to 2019. He is an active researcher in machine learning and multimedia content analysis. Besides the research in frontier technologies, he also conducted some linkage projects to bridge the gap between academic research and industry applications, such as ARC Research Hub for Driving Farming Productivity and Disease Prevention, and Deep Learning based Automatic Fish Processing (sponsored by Food Agility CRC).
Litao Yu has published several high-quality publications, including IEEE Transactions on Image Processing, ACM International Conference on Multimedia, IEEE Transactions on Neural Networks and Learning Systems, etc. He has also been invited to be the reviewers of the renowned journals such as IEEE Transactions on Multimedia.
Litao Yu's research interests include machine learning, image processing, multimedia information retrieval and semantic understanding.
Yu, L, Jacobson, A & Milford, M 2018, 'Rhythmic representations: Learning periodic patterns for scalable place recognition at a sublinear storage cost', IEEE Robotics and Automation Letters, vol. 3, no. 2, pp. 811-818.View/Download from: Publisher's site
© 2018 IEEE. Robotic and animal mapping systems share many challenges and characteristics: They must function in a wide variety of environmental conditions, enable the robot or animal to navigate effectively to find food or shelter, and be computationally tractable from both a speed and storage perspective.With regards to map storage, the mammalian brain appears to take a diametrically opposed approach to all current robotic mapping systems. Where robotic mapping systems attempt to solve the data association problem to minimize representational aliasing, neurons in the brain intentionally break data association by encoding large (potentially unlimited) numbers of places with a single neuron. In this letter, we propose a novel method based on supervised learning techniques that seeks out regularly repeating visual patterns in the environment with mutually complementary co-prime frequencies, and an encoding scheme that enables storage requirements to growsublinearlywith the size of the environment beingmapped.To improve robustness in challenging real-world environments while maintaining storage growth sublinearity, we incorporate both multiexemplar learning and data augmentation techniques. Using large benchmark robotic mapping datasets, we demonstrate the combined system achieving high-performance place recognition with sublinear storage requirements and characterize the performancestorage growth tradeoff curve. The work serves as the first robotic mapping system with sublinear storage scaling properties, as well as the first large-scale demonstration in real-world environments of one of the proposed memory benefits of these neurons.
Yu, L, Huang, Z, Shen, F, Song, J, Shen, HT & Zhou, X 2017, 'Bilinear Optimized Product Quantization for Scalable Visual Content Analysis', IEEE Transactions on Image Processing, vol. 26, no. 10, pp. 5057-5069.View/Download from: Publisher's site
© 1992-2012 IEEE. Product quantization (PQ) has been recognized as a useful technique to encode visual feature vectors into compact codes to reduce both the storage and computation cost. Recent advances in retrieval and vision tasks indicate that high-dimensional descriptors are critical to ensuring high accuracy on large-scale data sets. However, optimizing PQ codes with high-dimensional data is extremely time-consuming and memory-consuming. To solve this problem, in this paper, we present a novel PQ method based on bilinear projection, which can well exploit the natural data structure and reduce the computational complexity. Specifically, we learn a global bilinear projection for PQ, where we provide both non-parametric and parametric solutions. The non-parametric solution does not need any data distribution assumption. The parametric solution can avoid the problem of local optima caused by random initialization, and enjoys a theoretical error bound. Besides, we further extend this approach by learning locally bilinear projections to fit underlying data distributions. We show by extensive experiments that our proposed method, dubbed bilinear optimization product quantization, achieves competitive retrieval and classification accuracies while having significant lower time and space complexities.
Yu, L, Huang, Z, Cao, J & Shen, HT 2016, 'Scalable video event retrieval by visual state binary embedding', IEEE Transactions on Multimedia, vol. 18, no. 8, pp. 1590-1603.View/Download from: Publisher's site
© 1999-2012 IEEE. With the exponential increase of media data on the web, fast media retrieval is becoming a significant research topic in multimedia content analysis. Among the variety of techniques, learning binary embedding (hashing) functions is one of the most popular approaches that can achieve scalable information retrieval in large databases, and it is mainly used in the near-duplicate multimedia search. However, till now most hashing methods are specifically designed for near-duplicate retrieval at the visual level rather than the semantic level. In this paper, we propose a visual state binary embedding (VSBE) model to encode the video frames, which can preserve the essential semantic information in binary matrices, to facilitate fast video event retrieval in unconstrained cases. Compared with other video binary embedding models, one advantage of our proposed VSBE model is that it only needs a limited number of key frames from the training videos for hash function training, so the computational complexity is much lower in the training phase. At the same time, we apply the pairwise constraints generated from the visual states to sketch the local properties of the events at the semantic level, so accuracy is also ensured. We conducted extensive experiments on the challenging TRECVID MED dataset, and have proved the superiority of our proposed VSBE model.
© 2016 Elsevier B.V. The task of multimedia event detection (MED) aims at training a set of models that can automatically detect the most event-relevant videos from large datasets. In this paper, we attempt to build a robust spatial-temporal deep neural network for large-scale video event detection. In our setting, each video follows a multiple instance assumption, where its visual segments contain both spatial and temporal properties of events. Regarding these properties, we try to implement the MED system by a two-step training phase: unsupervised recurrent video reconstruction and supervised fine-tuning. We conduct extensive experiments on the challenging TRECVID MED14 dataset, which indicate that with the consideration of both spatial and temporal information, the detection performance can be further boosted compared with the state-of-the-art MED models.
Yu, L, Yang, Y, Huang, Z, Wang, P, Song, J & Shen, HT 2016, 'Web video event recognition by semantic analysis from ubiquitous documents', IEEE Transactions on Image Processing, vol. 25, no. 12, pp. 5689-5701.View/Download from: Publisher's site
© 1992-2012 IEEE. In recent years, the task of event recognition from videos has attracted increasing interest in multimedia area. While most of the existing research was mainly focused on exploring visual cues to handle relatively small-granular events, it is difficult to directly analyze video content without any prior knowledge. Therefore, synthesizing both the visual and semantic analysis is a natural way for video event understanding. In this paper, we study the problem of Web video event recognition, where Web videos often describe large-granular events and carry limited textual information. Key challenges include how to accurately represent event semantics from incomplete textual information and how to effectively explore the correlation between visual and textual cues for video event understanding. We propose a novel framework to perform complex event recognition from Web videos. In order to compensate the insufficient expressive power of visual cues, we construct an event knowledge base by deeply mining semantic information from ubiquitous Web documents. This event knowledge base is capable of describing each event with comprehensive semantics. By utilizing this base, the textual cues for a video can be significantly enriched. Furthermore, we introduce a two-view adaptive regression model, which explores the intrinsic correlation between the visual and textual cues of the videos to learn reliable classifiers. Extensive experiments on two real-world video data sets show the effectiveness of our proposed framework and prove that the event knowledge base indeed helps improve the performance of Web video event recognition.
Yu, L, Shao, J, Xu, XS & Shen, HT 2014, 'Max-margin adaptive model for complex video pattern recognition', Multimedia Tools and Applications, vol. 74, no. 2, pp. 505-521.View/Download from: Publisher's site
© 2014, Springer Science+Business Media New York. Patternrecognitionmodels are usually used in a variety of applications ranging from video concept annotation to event detection. In this paper we propose a new framework called the max-margin adaptive (MMA) model for complex video pattern recognition, which can utilize a large number of unlabeled videos to assist the model training. The MMA model considers the data distribution consistence between labeled training videos and unlabeled auxiliary ones from the statistical perspective by learning an optimal mapping function which also broadens the margin between positive labeled videos and negative labeled videos to improve the robustness of the model. The experiments are conducted on two public datasets including CCV for video object/event detection and HMDB for action recognition. Our results demonstrate that the proposed MMA model is very effective on complex video pattern recognition tasks, and outperforms the state-of-the-art algorithms.
© 2018 Association for Computing Machinery. Product Quantisation (PQ) has been recognised as an effective encoding technique for scalable multimedia content analysis. In this paper, we propose a novel learning framework that enables an end-to-end encoding strategy from raw images to compact PQ codes. The system aims to learn both PQ encoding functions and codewords for content-based image retrieval. In detail, we first design a trainable encoding layer that is pluggable into neural networks, so the codewords can be trained in back-forward propagation. Then we integrate it into a Deep Convolutional Generative Adversarial Network (DC-GAN). In our proposed encoding framework, the raw images are directly encoded by passing through the convolutional and encoding layers, and the generator aims to use the codewords as constrained inputs to generate full image representations that are visually similar to the original images. By taking the advantages of the generative adversarial model, our proposed system can produce high-quality PQ codewords and encoding functions for scalable multimedia retrieval tasks. Experiments show that the proposed architecture GA-PQ outperforms the state-of-the-art encoding techniques on three public image datasets.
Zhang, Q, Yu, L & Long, G 2015, 'SocialTrail: Recommending Social Trajectories from Location-Based Social Networks', Databases Theory and Applications (LNCS), Australasian Database Conference, Springer International Publishing, Melbourne, VIC, Australia, pp. 314-317.View/Download from: Publisher's site
Trajectory recommendation plays an important role for travel planning. Most existing systems are mainly designed for spot recommendation without the understanding of the overall trip and tend to utilize homogeneous data only (e.g., geo-tagged images). Furthermore, they focus on the popularity of locations and fail to consider other important factors like traveling time and sequence, etc. In this paper, we propose a novel system that can not only integrate geo-tagged images and check-in data to discover meaningful social trajectories to enrich the travel information, but also take both temporal and spatial factors into consideration to make trajectory recommendation more accurately.