Dr Min Xu received Ph.D degree of IT from University of Newcastle, Australia, Master degree of Science (Computing) from National University of Singapore, and Bachelor degree of Engineering from University of Science and Technology of China respectively.
Dr Xu's expertise is in multimedia data (video, audio and text) analytics and computer vision. She has proposed several innovative methods for 1) multimedia affective/semantic content analysis, 2) multi-modality information analysis and fusion, 3) personalised multimedia services. Recently, she is focusing on applying machine learning algorithms (e.g. deep neural networks) for multimedia applications, including affective computing, image caption and action recognition.
Dr Xu introduced audio keywords to assist video content analysis in 2003. Her proposed method outperformed most traditional visual based methods and attached a lot of followed research on joined audio and video content analysis. She further proposed a multimodality mid-level representation framework to bridge the gap between low-level audio and video features and high-level video content. In 2006, she developed a video adaptation system based on MPEG-21 Digital Item Adaptation framework. The proposed system, one of the earliest such systems, considered user preference of video content as well as normal bandwidth constraints and provided a personalised video access. Another of her recent achievements is affective content analysis using multiple modality features.
Dr Xu has published over 100 research papers in high quality international journals and conferences. Over 1500 citations of her research papers show her reputation in her research field.
Dr Xu has been invited to be a member of the program committee for many international top conferences, including ACM Multimedia Conference and reviewers for various highly-rated international journals, such as IEEE Transactions on Multimedia, IEEE Transactions on Circuits and Systems for Video Technology and much more. She is an Associate Editor of Journal of Neurocomputing.
Can supervise: YES
- Multimedia data analytics
- Affective multimedia computing
- Spatial-temporal information fusion
- Audio/Video signal processing
- Pattern Recognition
- Computer Vision
- Web Service Development
- Cloud Computing Infrastructure
- Internetworking Project
- Network Essentials
- LANs and Routing
- Network Design
- Image Processing and Pattern Recognition
© 2013 Springer Science+Business Media, LLC. All rights reserved. Interactive Media is a new research field and a landmark in multimedia development. The Era of Interactive Media is an edited volume contributed from world experts working in academia, research institutions and industry. The Era of Interactive Media focuses mainly on Interactive Media and its various applications. This book also covers multimedia analysis and retrieval; multimedia security rights and management; multimedia compression and optimization; multimedia communication and networking; and multimedia systems and applications. The Era of Interactive Media is designed for a professional audience composed of practitioners and researchers working in the field of multimedia. Advanced-level students in computer science and electrical engineering will also find this book useful as a secondary text or reference.
Liang, Q, Wu, W, Yang, Y, Zhang, R, Peng, Y & Xu, M 2020, 'Multi-Player Tracking for Multi-View Sports Videos with Improved K-Shortest Path Algorithm', APPLIED SCIENCES-BASEL, vol. 10, no. 3.View/Download from: Publisher's site
Shi, Z, Pan, Q & Xu, M 2020, 'LSTM-Cubic A * -based auxiliary decision support system in air traffic management', NEUROCOMPUTING, vol. 391, pp. 167-176.
Usman, M, Jan, MA, Jolfaei, A, Xu, M, He, X & Chen, J 2020, 'A Distributed and Anonymous Data Collection Framework Based on Multilevel Edge Computing Architecture', IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, vol. 16, no. 9, pp. 6114-6123.View/Download from: Publisher's site
Wang, Z, Xu, M, Ye, N, Xiao, F, Ruchuan, WN & Huang, H 2020, 'Computer Vision-assisted 3D Object Localization via COTS RFID Devices and a Monocular Camera', IEEE Transactions on Mobile Computing.View/Download from: Publisher's site
IEEE In most RFID localization systems, acquiring a reader antenna's position at each sampling time is challenging, especially for those antenna-carrying robot or drone systems with unpredictable trajectories. In this paper, we present RF-MVO that fuses RFID and computer vision for stationary RFID localization in 3D space by attaching a light-weight 2D monocular camera to two reader antennas in parallel. Firstly, the existing monocular visual odometry only recovers a camera/antenna trajectory in the camera view from 2D images. By combining it with RF phase, we design a model to estimate a scale factor for real-world trajectory transformation, along with spatial directions of an RFID tag relative to a virtual antenna array due to the mobility of each antenna. Then we propose a novel RFID localization algorithm that does not require exhaustively searching all possible positions within the pre-specified region. Secondly, to speed up the searching process and improve localization accuracy, we propose a coarse-to-fine optimization algorithm. Thirdly, we introduce the concept of horizontal dilution of precision (HDOP) to measure the confidence level of localization results. Our experiments demonstrate the effectiveness of proposed algorithms and show RF-MVO can achieve 6.23 cm localization error.
Wu, L, Xu, M, Qian, S & Cui, J 2020, 'Image to Modern Chinese Poetry Creation via a Constrained Topic-aware Model', ACM Transactions on Multimedia Computing, Communications and Applications, vol. 16, no. 2.View/Download from: Publisher's site
© 2020 ACM. Artificial creativity has attracted increasing research attention in the field of multimedia and artificial intelligence. Despite the promising work on poetry/painting/music generation, creating modern Chinese poetry from images, which can significantly enrich the functionality of photo-sharing platforms, has rarely been explored. Moreover, existing generation models cannot tackle three challenges in this task: (1) Maintaining semantic consistency between images and poems; (2) preventing topic drift in the generation; (3) avoidance of certain words appearing frequently. These three points are even common challenges in other sequence generation tasks. In this article, we propose a Constrained Topic-aware Model (CTAM) to create modern Chinese poetries from images regarding the challenges above. Without image-poetry paired dataset, we construct a visual semantic vector to embed visual contents via image captions. For the topic-drift problem, we propose a topic-aware poetry generation model. Additionally, we design an Anti-frequency Decoding (AFD) scheme to constrain high-frequency characters in the generation. Experimental results show that our model achieves promising performance and is effective in poetry's readability and semantic consistency.
Wu, L, Xu, M, Wang, J & Perry, S 2020, 'Recall What You See Continually Using GridLSTM in Image Captioning', IEEE Transactions on Multimedia, vol. 22, no. 3, pp. 808-818.View/Download from: Publisher's site
The goal of image captioning is to automatically describe an image with a sentence, and the task has attracted research attention from both the computer vision and natural-language processing research communities. The existing encoder–decoder model and its variants, which are the most popular models for image captioning, use the image features in three ways: first, they inject the encoded image features into the decoder only once at the initial step, which does not enable the rich image content to be explored sufficiently while gradually generating a text caption; second, they concatenate the encoded image features with text as extra inputs at every step, which introduces unnecessary noise; and, third, they using an attention mechanism, which increases the computational complexity due to the introduction of extra neural nets to identify the attention regions. Different from the existing methods, in this paper, we propose a novel network, Recall Network, for generating captions that are consistent with the images. The recall network selectively involves the visual features by using a GridLSTM and, thus, is able to recall image contents while generating each word. By importing the visual information as the latent memory along the depth dimension LSTM, the decoder is able to admit the visual features dynamically through the inherent LSTM structure without adding any extra neural nets or parameters. The Recall Network efficiently prevents the decoder from deviating from the original image content. To verify the efficiency of our model, we conducted exhaustive experiments on full and dense image captioning. The experimental results clearly demonstrate that our recall network outperforms the conventional encoder–decoder model by a large margin and that it performs comparably to the state-of-the-art methods.
Zhang, H & Xu, M 2020, 'Improving the generalization performance of deep networks by dual pattern learning with adversarial adaptation', KNOWLEDGE-BASED SYSTEMS, vol. 200.View/Download from: Publisher's site
Zhang, R, Wu, L, Yang, Y, Wu, W, Chen, Y & Xu, M 2020, 'Multi-camera multi-player tracking with deep player identification in sports video', PATTERN RECOGNITION, vol. 102.View/Download from: Publisher's site
Chen, X, Kong, X, Xu, M, Sandrasegaran, K & Zheng, J 2019, 'Road Vehicle Detection and Classification Using Magnetic Field Measurement', IEEE ACCESS, vol. 7, pp. 52622-52633.View/Download from: Publisher's site
Hu, S, Xu, M, Zhang, H, Xiao, C & Gui, C 2019, 'Affective Content-aware Adaptation Scheme on QoE Optimization of Adaptive Streaming over HTTP', ACM Transactions on Multimedia Computing Communications and Applications, vol. 15, no. 3s, pp. 100-118.View/Download from: Publisher's site
The article presents a novel affective content-aware adaptation scheme (ACAA) to optimize Quality of Experience (QoE) for dynamic adaptive video streaming over HTTP (DASH). Most of the existing DASH adaptation schemes conduct video bit-rate adaptation based on an estimation of available network resources, which ignore user preference on affective content (AC) embedded in video data streaming over the network. Since the personal demands to AC is very different among all viewers, to satisfy individual affective demand is critical to improve the QoE in commercial video services. However, the results of video affective analysis cannot be applied into a current adaptive streaming scheme directly. Correlating the AC distributions in user's viewing history to each being streamed segment, the affective relevancy can be inferred as an affective metric for the AC related segment. Further, we have proposed an ACAA scheme to optimize QoE for user desired affective content while taking into account both network status and affective relevancy. We have implemented the ACAA scheme over a realistic trace-based evaluation and compared its performance in terms of network performance, QoE with that of Probe and Adaptation (PANDA), buffer-based adaptation (BBA), and Model Predictive Control (MPC). Experimental results show that ACAA can preserve available buffer time for future being delivered affective content preferred by viewer's individual preference to achieve better QoE in affective contents than those normal contents while remain the overall QoE to be satisfactory.
Nie, X, Wang, L, Ding, H & Xu, M 2019, 'Strawberry Verticillium Wilt Detection Network Based on Multi-Task Learning and Attention', IEEE Access, vol. 7, pp. 170003-170011.View/Download from: Publisher's site
© 2013 IEEE. Plant disease detection has an inestimable effect on plant cultivation. Accurate detection of plant disease can control the spread of disease early and prevent unnecessary loss. Strawberry verticillium wilt is a soil-borne, multi-symptomatic disease. To detect strawberry verticillium wilt accurately, we first propose a disease detection network based on Faster R-CNN and multi-task learning to detect strawberry verticillium wilt. Then, the strawberry verticillium wilt detection network (SVWDN), which uses attention mechanisms in the feature extraction of the disease detection network, is proposed. SVWDN detects verticillium wilt according to the symptoms of detected plant components (i.e.,young leaves and petioles). Compared with other existing methods for detecting disease from the whole plant appearance, the SVWDN automatically classifies the petioles and young leaves while determining whether the strawberry has verticillium wilt. To provide a dataset for evaluating and testing our method, we construct a large dataset that contains 3, 531 images with 4 categories (Healthy-leaf, Healthy-petiole, Verticillium-leaf and Verticillium-petiole). Each image also has a label to indicate whether the strawberry is suffering from verticillium wilt. With the proposed strawberry verticillium wilt detection network, we achieved a mAP of 77.54% on object detection of 4 categories and 99.95% accuracy for strawberry verticillium wilt detection.
Rao, T, Li, X, Zhang, H & Xu, M 2019, 'Multi-level region-based Convolutional Neural Network for image emotion classification', Neurocomputing, vol. 333, pp. 429-439.View/Download from: Publisher's site
© 2018 Analyzing emotional information of visual content has attracted growing attention for the tendency of internet users to share their feelings via images and videos online. In this paper, we investigate the problem of affective image analysis, which is very challenging due to its complexity and subjectivity. Previous research reveals that image emotion is related to low-level to high-level visual features from both global and local view, while most of the current approaches only focus on improving emotion recognition performance based on single-level visual features from a global view. Aiming to utilize different levels of visual features from both global and local view, we propose a multi-level region-based Convolutional Neural Network (CNN) framework to discover the sentimental response of local regions. We first employ Feature Pyramid Network (FPN) to extract multi-level deep representations. Then, an emotional region proposal method is used to generate proper local regions and remove excessive non-emotional regions for image emotion classification. Finally, to deal with the subjectivity in emotional labels, we propose a multi-task loss function to take the probabilities of images belonging to different emotion classes into consideration. Extensive experiments show that our method outperforms the state-of-the-art approaches on various commonly used benchmark datasets.
© 2018 Semantic embedding has demonstrated its value in latent representation learning of data, and can be effectively adopted for many applications. However, it is difficult to propose a joint learning framework for semantic embedding in Community Question Answer (CQA), because CQA data have multi-view and sparse properties. In this paper, we propose a generic Multi-modal Multi-view Semantic Embedding (MMSE) framework via a Bayesian model for question answering. Compared with existing semantic learning methods, the proposed model mainly has two advantages: (1) To deal with the multi-view property, we utilize the Gaussian topic model to learn semantic embedding from both local view and global view. (2) To deal with the sparse property of question answer pairs in CQA, social structure information is incorporated to enhance the quality of general text content semantic embedding from other answers by using the shared topic distribution to model the relationship between these two modalities (user relationship and text content). We evaluate our model for question answering and expert finding task, and the experimental results on two real-world datasets show the effectiveness of our MMSE model for semantic embedding learning.
Usman, M, He, X, Lam, KKM, Xu, M, Chen, J, Bokhari, SMM & Jan, MA 2019, 'Error Concealment for Cloud-based and Scalable Video Coding of HD Videos', IEEE Transactions on Cloud Computing, vol. 7, no. 4, pp. 975-987.View/Download from: Publisher's site
IEEE The encoding of HD videos faces two challenges: requirements for a strong processing power and a large storage space. One time-efficient solution addressing these challenges is to use a cloud platform and to use a scalable video coding technique to generate multiple video streams with varying bit-rates. Packet-loss is very common during the transmission of these video streams over the Internet and becomes another challenge. One solution to address this challenge is to retransmit lost video packets, but this will create end-to-end delay. Therefore, it would be good if the problem of packet-loss can be dealt with at the user's side. In this paper, we present a novel system that encodes and stores the videos using the Amazon cloud computing platform, and recover lost video frames on user side using a new Error Concealment (EC) technique. To efficiently utilize the computation power of a user's mobile device, the EC is performed based on a multiple-thread and parallel process. The simulation results clearly show that, on average, our proposed EC technique outperforms the traditional Block Matching Algorithm (BMA) and the Frame Copy (FC) techniques.
Wang, Z, Xu, M, Ye, N, Wang, R & Huang, H 2019, 'RF-Focus: Computer Vision-assisted Region-of-interest RFID Tag Recognition and Localization in Multipath-prevalent Environments', Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 3, no. 1, pp. 1-30.View/Download from: Publisher's site
Ye, J, Yang, X, Xu, M, Chan, PK-S & Ma, C 2019, 'Novel N-Substituted oseltamivir derivatives as potent influenza neuraminidase inhibitors: Design, synthesis, biological evaluation, ADME prediction and molecular docking studies', EUROPEAN JOURNAL OF MEDICINAL CHEMISTRY, vol. 182.View/Download from: Publisher's site
Zhang, R, Chengpo, M, Xu, M, Lixin, X & Xiaofeng, X 2019, 'Facial Component-Landmark Detection with Weakly-supervised LR-CNN', IEEE Access, vol. 7, pp. 10263-10277.View/Download from: Publisher's site
Zhang, R, Mu, C, Xu, M, Xu, L, Shi, Q & Wang, J 2019, 'Synthetic IR image refinement using adversarial learning with bidirectional mappings', IEEE Access, vol. 7, pp. 153734-153750.View/Download from: Publisher's site
© 2019 IEEE. Collecting a large dataset of real infrared (IR) images is expensive, time-consuming, and even unavailable in some specific scenarios. With recent progress in machine learning, it has become more feasible to replace real IR images with qualified synthetic IR images in learning-based IR systems. However, this alternative may fail to achieve the desired performance, due to the gap between real and synthetic IR images. Inspired by adversarial learning for image-to-image translation, we propose the Synthetic IR Refinement Generative Adversarial Network (SIR-GAN) to narrow this gap. By learning the bidirectional mappings between two unpaired domains, the realism of the simulated IR images generated from the IR Simulator are significantly improved, where the source domain contains a large number of simulated IR images, where the target domain contains a limited quantity of real IR images. Specifically, driven by the idea of transferring infrared characteristic and protect target semantic information simultaneously, we propose a SIR refinement loss to consider an infrared loss and a structure loss further to the adversarial loss and the consistency loss. To further reduce the gap, stabilize training, and avoid artefacts, we modify the proposed algorithm by developing a training strategy, adding the U-net in the generators, using the dilated convolution in the discriminators and invoking the N-Adam acts as the optimizer. Qualitative, quantitative, and ablation study experiments demonstrate the superiority of the proposed approach compared with the state-of-the-art techniques in terms of realism and fidelity. In addition, our refined IR images are evaluated in the context of a feasibility study, where the accuracy of the trained classifier is significantly improved by adding our refined data into a small real-data training set.
© 2017, Springer Science+Business Media, LLC. Affective image analysis, which estimates humans' emotion reflection on images, has attracted increasing attention. Most of the existing methods focus on developing efficient visual features according to theoretical and empirical concepts, and extract these features from an image as a whole. However, analyzing emotion from an entire image, can only extract the dominant emotion conveyed by the whole image, which ignores the affective differences existing among different regions within the image. This may reduce the performance of emotion recognition, and limit the range of possible applications. In this paper, we are the first to propose the concept of affective map, by which image emotion can be represented at region-level. In an affective map, the value of each pixel represents the probability of the pixel belonging to a certain emotion category. Two popular application exemplars, i.e. affective image classification and visual saliency computing, are explored to prove the effectiveness of the proposed affective map. Analyzing detailed image emotion at a region-level, the accuracy of affective image classification has been improved 5.1% on average. The Area Under the Curve (AUC) of visual saliency detection has been improved 15% on average.
© 2017 Springer Science+Business Media, LLC Facial expression recognition plays a crucial role in a wide range of applications of psychotherapy, security systems, marketing, commerce and much more. Detecting a macro-expression, which is a direct representation of an 'emotion,' is a relatively straight-forward task. Playing a pivotal role as macro-expressions, micro-expressions are more accurate indicators of a train of thought or even subtle, passive or involuntary thoughts. Compared to macro-expressions, identifying micro-expressions is a much more challenging research question because their time spans are narrowed down to a fraction of a second, and can only be defined using a broader classification scale. This paper is an all-inclusive survey-cum-analysis of the various micro-expression recognition techniques. We analyze the general framework for micro-expression recognition system by decomposing the pipeline into fundamental components, namely face detecting, pre-processing, facial feature detection and extraction, datasets, and classification. We discuss the role of these elements and highlight the models and new trends that are followed in their design. Moreover, we provide an extensive analysis of micro-expression recognition systems by comparing their performance. We also discuss the new deep learning features that can, in the near future, replace the hand-crafted features for facial micro-expression recognition. This survey has been developed, focusing on the methodologies applied, databases used, performance regarding recognition accuracy and comparing these to distil the gaps in the efficiencies, future scope, and research potentials. Through this survey, we intend to look into this problem and develop a comprehensive and efficient recognition scheme. This study allows us to identify open issues and to determine future directions for designing real-world micro-expression recognition systems.
Usman, M, Yang, N, Jan, MA, He, X, Xu, M & Lam, KM 2018, 'A Joint Framework for QoS and QoE for Video Transmission over Wireless Multimedia Sensor Networks', IEEE Transactions on Mobile Computing, vol. 17, no. 4.View/Download from: Publisher's site
With the emergence of Wireless Multimedia Sensor Networks (MWSNs), the distribution of multimedia contents have now become a reality. Without proper management, the transmission of multimedia data over WMSNs affects the performance of networks due to excessive packets dropping. The existing studies on Quality of Service (QoS) mostly deal with simple WSNs and as such do not account for an increasing number of sensor nodes and an increasing volume of data. In this paper, we propose a novel framework to support QoS in WMSNs along with a light weight Error Concealment (EC) scheme. The EC schemes play a vital role to enhance Quality of Experience (QoE) by maintaining an acceptable quality at the receiving ends. The main objectives of the proposed framework are to maximise the network throughput and to cover-up the effects produced by dropped video packets. To control the data rate, Scalable Video Coding (SVC) is applied at multimedia sensor nodes with variable Quantization Parameters (QPs). Multipath routing is exploited to support real-time video transmission. Experimental results show that the proposed framework can efficiently adjust large volumes of video data under certain network distortions and can effectively conceal dropped/corrupted video frames by producing better objective measurements.
© 2018 Elsevier B.V. Person re-identification and visual tracking are two important tasks in video surveillance. Many works have been done on appearance modeling for these two tasks. However, existing feature descriptors are mainly constructed on three-channel color spaces, such like RGB, HSV and XYZ. These color spaces somehow enable meaningful representation for color, yet may lack distinctiveness for real-world tasks. In this paper, we propose a multi-channel Encoding Color Space (ECS), and consider the color distinction with the design of image feature descriptor. In order to overcome the illumination variation and shape deformation, we design features on the basis of the Encoding Color Space and Histogram of Oriented Gradient (HOG), which enables rich color-gradient characteristics. Additionally, we extract Second Order Histogram (SOH) on the descriptor constructed to capture abstract information with layout constrains. Exhaustive experiments are performed on datasets VIPeR, CAVIAR, CUHK01 and Visual Tracking Benchmark. Experimental results on these datasets show that our feature descriptors could achieve promising performance.
© 2013 IEEE. The single-image rain removal problem has attracted tremendous interests within the deep learning domains. Although deep learning based de-raining methods outperform many conventional methods, there are still unresolved issues in regards to improving the performance. In this paper, we propose a simplified residual dense network (SRDN) to improve the de-raining performance and cut down the computation time. Inspired by the image processing domain knowledge that a rainy image can be decomposed into a base (low-pass) layer and a detail (high-pass) layer, we train our network by directly learning the residual between the detail layer of rainy images and the detail layer of clean images. It can both significantly reduce the mapping range from input to output and easily employ the image enhancement operation to handle the heavy rain with hazy looks. Instead of designing a deeper network structure to increase the learning ability of network, we propose a simplified dense block to explore more effective information between layers and, hence, reduce the computation time of network. Experiments on both synthetic and real-world images demonstrate that our SRDN network can achieve competitive results in comparison with the benchmarked and conventional approaches for single-image rain removal.
Zhang, H & Xu, M 2018, 'Recognition of Emotions in User-Generated Videos with Kernelized Features', IEEE Transactions on Multimedia, vol. 20, no. 10, pp. 2824-2835.View/Download from: Publisher's site
© 1999-2012 IEEE. Recognition of emotions in user-generated videos has attracted increasing research attention. Most existing approaches are based on spatial features extracted from video frames. However, due to the broad affective gap between spatial features of images and high-level emotions, the performance of existing approaches is restricted. To bridge the affective gap, we propose recognizing emotions in user-generated videos with kernelized features. We reformulate the equation of the discrete Fourier transform as a linear kernel function and construct a polynomial kernel function based on the linear kernel. The polynomial kernel is applied to spatial features of video frames to generate kernelized features. Compared with spatial features, kernelized features show superior discriminative capability. Moreover, we are the first to apply the sparse representation method to reduce the impact of noise contained in videos; this method helps contribute to performance improvement. Extensive experiments are conducted on two challenging benchmark datasets, that is, VideoEmotion-8 and Ekman-6. The experimental results demonstrate that the proposed method achieves state-of-the-art performance.
Guo, D, Xu, J, Zhang, J, Xu, M, Cui, Y & He, X 2017, 'User relationship strength modeling for friend recommendation on Instagram', Neurocomputing, vol. 239, pp. 9-18.View/Download from: Publisher's site
© 2017 Elsevier B.V.Social strength modeling in the social media community has attracted increasing research interest. Different from Flickr, which has been explored by many researchers, Instagram is more popular for mobile users and is conducive to likes and comments but seldom investigated. On Instagram, a user can post photos/videos, follow other users, comment and like other users' posts. These actions generate diverse forms of data that result in multiple user relationship views. In this paper, we propose a new framework to discover the underlying social relationship strength. User relationship learning under multiple views and the relationship strength modeling are coupled into one process framework. In addition, given the learned relationship strength, a coarse-to-fine method is proposed for friend recommendation. Experiments on friend recommendations for Instagram are presented to show the effectiveness and efficiency of the proposed framework. As exhibited by our experimental results, it can obtain better performance over other related methods. Although our method has been proposed for Instagram, it can be easily extended to any other social media communities.
Lu, D, Sang, J, Chen, Z, Xu, M & Mei, T 2017, 'Who Are Your "Real" Friends: Analyzing and Distinguishing Between Offline and Online Friendships From Social Multimedia Data', IEEE TRANSACTIONS ON MULTIMEDIA, vol. 19, no. 6, pp. 1299-1313.View/Download from: Publisher's site
Wang, Y, Liu, J, Li, Y, Fu, J, Xu, M & Lu, H 2017, 'Hierarchically Supervised Deconvolutional Network for Semantic Video Segmentation', Pattern Recognition, vol. 64, pp. 437-445.View/Download from: Publisher's site
© 2016 Elsevier Ltd.Semantic video segmentation is a challenging task of fine-grained semantic understanding of video data. In this paper, we present a jointly trained deep learning framework to make the best use of spatial and temporal information for semantic video segmentation. Along the spatial dimension, a hierarchically supervised deconvolutional neural network (HDCNN) is proposed to conduct pixel-wise semantic interpretation for single video frames. HDCNN is constructed with convolutional layers in VGG-net and their mirrored deconvolutional structure, where all fully connected layers are removed. And hierarchical classification layers are added to multi-scale deconvolutional features to introduce more contextual information for pixel-wise semantic interpretation. Besides, a coarse-to-fine training strategy is adopted to enhance the performance of foreground object segmentation in videos. Along the temporal dimension, we introduce Transition Layers upon the structure of HDCNN to make the pixel-wise label prediction consist with adjacent pixels across space and time domains. The learning process of the Transition Layers can be implemented as a set of extra convolutional calculations connected with HDCNN. These two parts are jointly trained as a unified deep network in our approach. Thorough evaluations are performed on two challenging video datasets, i.e., CamVid and GATECH. Our approach achieves state-of-the-art performance on both of the two datasets.
Millions of surveillance cameras have been installed in public areas, producing vast amounts of video data every day. It is an urgent need to develop intelligent techniques to automatically detect and segment moving objects which have wide applications. Various approaches have been developed for moving object detection based on background modeling in the literature. Most of them focus on temporal information but partly or totally ignore spatial information, bringing about sensitivity to noise and background motion. In this paper, we propose a unified model sharing framework for moving object detection. To begin with, to exploit the spatial-temporal correlation across different pixels, we establish a many-to-one correspondence by model sharing between pixels, and a pixel is labeled into foreground or background by searching an optimal matched model in the neighborhood. Then a random sampling strategy is introduced for online update of the shared models. In this way, we can reduce the total number of models dramatically and match a proper model for each pixel accurately. Furthermore, existing approaches can be naturally embedded into the proposed sharing framework. Two popular approaches, statistical model and sample consensus model, are used to verify the effectiveness. Experiments and comparisons on ChangeDetection benchmark 2014 demonstrate the superiority of the model sharing solution.
Liu, H, Xu, M, Wang, J, Rao, T & Burnett, I 2016, 'Improving Visual Saliency Computing With Emotion Intensity', IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, vol. 27, no. 6, pp. 1201-1213.View/Download from: Publisher's site
Usman, M, He, X, Lam, K, Xu, M, Bokhari, SM & Chen, J 2016, 'Frame Interpolation for Cloud-Based Mobile Video Streaming', IEEE Transactions on Multimedia, vol. 18, no. 5, pp. 831-839.View/Download from: Publisher's site
Cloud-based High Definition (HD) video streaming is becoming popular day by day. On one hand, it is important for both end users and large storage servers to store their huge amount of data at different locations and servers. On the other hand, it is becoming a big challenge for network service providers to provide reliable connectivity to the network users. There have been many studies over cloud-based video streaming for Quality of Experience (QoE) for services like YouTube. Packet losses and bit errors are very common in transmission networks, which affect the user feedback over cloud-based media services. To cover up packet losses and bit errors, Error Concealment (EC) techniques are usually applied at the decoder/receiver side to estimate the lost information. This paper proposes a time-efficient and quality-oriented EC method. The proposed method considers H.265/HEVC based intra-encoded videos for the estimation of whole intra-frame loss. The main emphasis in the proposed approach is the recovery of Motion Vectors (MVs) of a lost frame in real-time. To boost-up the search process for the lost MVs, a bigger block size and searching in parallel are both considered. The simulation results clearly show that our proposed method outperforms the traditional Block Matching Algorithm (BMA) by approximately 2.5 dB and Frame Copy (FC) by up to 12 dB at a packet loss rate of 1%, 3%, and 5% with different Quantization Parameters (QPs). The computational time of the proposed approach outperforms the BMA by approximately 1788 seconds.
Wang, J, Qu, Z, Chen, Y, Mei, T, Xu, M, Zhang, L & Lu, H 2016, 'Adaptive Content Condensation Based on Grid Optimization for Thumbnail Image Generation', IEEE Transactions on Circuits and Systems for Video Technology, vol. 26, no. 11, pp. 2079-2092.View/Download from: Publisher's site
An ideal thumbnail generator should effectively
condense unimportant regions and keep the important content
undeformed, completed, and at a proper scale, i.e., accuracy,
completeness, and sufficiency. Each retargeting method has its
own advantage for resizing arbitrary images. However, they
often ignore the completeness and sufficiency for information
presentation in thumbnails. In this paper, we formulate thumbnail
generation as an image content condensation problem and
propose a unified grid optimization framework to fuse multiple
operators. From the view of accuracy, completeness, and suffi-
ciency for information presentation, we exploit complementary
relationships among three condensation operators and fuse them
into a unified grid-based convex programming problem, which
could be solved simultaneously and efficiently through numerical
optimization. Besides warping energy to preserve the geometric
structure of important objects, we put forward two grid-based
energy terms to keep the completeness of important objects and
retain them at a proper size. Finally, an adaptive procedure is
proposed to dynamically adjust the contribution of loss functions
for achieving optimal content condensation. Both qualitative
and quantitative comparison results demonstrate that the proposed
method achieves an excellent tradeoff among accuracy,
completeness, and sufficiency of information preservation. The
experimental results show that our approach is obviously
superior to the state-of-the-art techniques.
© 2016 Elsevier B.V. With the wide use of consumer electronics and the rapid development of online shopping, more and more ad videos are developed for IDTV and mobile users. However, a huge amount of time spending on the Internet advertising somehow brings users uncomfortable viewing experience rather than effectively generates high consumption of advertised products. Therefore, it is urgent to find a viewer-friendly and advertiser-beneficial solution to launch ads. This paper is the first attempt to improve the effectiveness of advertising through combining online shopping information with an ad video and directing viewers to proper online shopping places. The proposed ActiveAd framework includes four main components. Firstly, an ad video analysis component detects both syntactic and semantic elements from ad videos, e.g. FMPIs (Frame Marked with Production Information), visual concepts, and textual keywords within the ad videos. Our ad video analysis provides a comprehensive solution to extract meaningful elements from ad videos. Secondly, a visual linking by search component is proposed to collect websites which contain similar images as FMPIs. Features used for the visual search are weighted and fused in order to ensure the uniformity of search results. Thirdly, different kinds of tags and product categories extracted from collected websites are aggregated in order to identify the representative text of the product. Finally, query keywords are selected through calculating cosine similarity from two kinds of keywords, i.e. keywords identified from tag aggregation and keywords obtained through ad video analysis (visual concept detection and textual keyword detection). A query vector is generated by selected keywords and used to retrieve product online. In this paper, a powerful cross-media contextual search including visual search, tag aggregation and textual search is achieved with the help of ad video analysis. Experiments demonstrate that our proposed Active...
Hasan, MA, Xu, M, He, X & Wang, Y 2015, 'A camera motion histogram descriptor for video shot classification', Multimedia Tools and Applications, vol. 74, no. 24, pp. 11073-11098.View/Download from: Publisher's site
© 2014, Springer Science+Business Media New York. In this paper, a novel camera motion descriptor is proposed for video shot classification. In the proposed method, raw motion information of consecutive video frames are extracted by computing the motion vector of each macroblock to form motion vector fields (MVFs). Next, a motion consistency analysis is applied on MVFs to eliminate the inconsistent motion vectors. Then, MVFs are divided into nine (3 × 3) local regions and the singular value decomposition (SVD) technique is applied on the motion vectors extracted from each local region in the temporal direction. Consistent motion vectors of a number of MVFs are compactly represented at a time to characterize temporal camera motion. Accordingly, each local region of the whole video shot is represented using a sequence of compactly represented vectors. Finally, the sequence of vectors is converted into a histogram to describe the camera motions of each local region. Combination of all the local histograms is considered as the camera motion descriptor of a video shot. The shot descriptors are used in a classifier to classify video shots. In this work, we use support vector machine (SVM) for performing classification tasks. The experimental results show that the proposed camera motion descriptor has strong discriminative capability to classify different camera motion patterns in professionally captured video shots effectively. We also show that our proposed approach outperforms two state-of-the-art video shot classification methods.
Hasan, MA, Xu, M, He, X & Xu, C 2014, 'CAMHID: Camera Motion Histogram Descriptor and Its Application to Cinematographic Shot Classification', IEEE Transactions on Circuits and Systems for Video Technology, vol. 24, no. 10, pp. 1682-1695.View/Download from: Publisher's site
Landmark search is crucial to improve the quality of travel experience. Smart phones make it possible to search landmarks anytime and anywhere. Most of the existing work computes image features on smart phones locally after taking a landmark image. Compared with sending original image to the remote server, sending computed features saves network bandwidth and consequently makes sending process fast. However, this scheme would be restricted by the limitations of phone battery power and computational ability. In this paper, we propose to send compressed (low resolution) images to remote server instead of computing image features locally for landmark recognition and search. To this end, a robust 3D model based method is proposed to recognize query images with corresponding landmarks. Using the proposed method, images with low resolution can be recognized accurately, even though images only contain a small part of the landmark or are taken under various conditions of lighting, zoom, occlusions and different viewpoints. In order to provide an attractive landmark search result, a 3D texture model is generated to respond to a landmark query. The proposed search approach, which opens up a new direction, starts from a 2D compressed image query input and ends with a 3D model search result.
Wang, J, Xu, M, He, X, Lu, H & Hoang, DB 2014, 'A hybrid domain enhanced framework for video retargeting with spatialtemporal importance and 3D grid optimization', Signal Processing, vol. 94, pp. 33-47.View/Download from: Publisher's site
Recently, a ubiquitous video access is highly demanded for online video applications. One big challenge is that video service needs to adapt different device capabilities. Pervasive multimedia devices require an accurate and user comfort video retargeting. Letting users see their preferred content accurately directly affects their comforts. User preferences on video contents are different in various video domains. In this paper, we present a hybrid framework of video retargeting with a domain enhanced spatial-temporal grid optimisation. First, we parse videos from low-level features to high-level visual concepts, combining with visual attention for an accurate importance description. Second, a semantic importance map is built up representing the spatial importance and temporal continuity, which is incorporated with a 3D rectilinear grid scaleplate to map frames to a target display, thereby keeping the aspect ratio of semantically salient objects as well as the perceptual coherency. Extensive evaluations are made on five typical video genres, i.e. sports, advertisements, lecture, news and surveillance. The comparison with the state-of-the-art approaches on both images and videos have demonstrated the advantages of the proposed approach.
Wu, L, Cao, L, Xu, M & Wang, J 2014, 'A Hybrid Image Retargeting Approach via Combining Seam Carving and Grid Warping', Journal of Multimedia, vol. 9, no. 4, pp. 482-492.View/Download from: Publisher's site
Image retargeting is a critical technique for browsing images in diversified terminals. In this paper, we propose a hybrid image resizing approach by jointly using seam carving and warping. Firstly, based on the importance partition with the saliency map, we apply a weighted seam carving approach to make the seams distributed dispersedly in the important regions. Then we propose Content Aware Image Distance (CAID) to assess the deformation caused by removing seams. The weighted seam carving will stop with a fixed threshold to assure little visual image quality degradation. Finally, the grid based warping is utilized to achieve the final size with a global optimization model, since warping tends to avoid discontinuity artifacts of important region and typically make the distortion distribution of unimportant region more coherently. Experiments and comparison in the public RetargetMe dataset  with Dong , Energy-based deformation , Multi-operator , SeamCarving , Simple scaling operator, Shift-maps , Scale and Stretch , Streaming Video , Non-homogeneous warping , show the superiority of the proposed approach.
Xu, M, Wang, J, He, X, Jin, JS, Luo, S & Lu, H 2014, 'A three-level framework for affective content analysis and its case studies', Multimedia Tools and Applications, vol. 70, no. 2, pp. 757-779.View/Download from: Publisher's site
Emotional factors directly reflect audiences' attention, evaluation and memory. Recently, video affective content analysis attracts more and more research efforts. Most of the existing methods map low-level affective features directly to emotions by applying machine learning. Compared to human perception process, there is actually a gap between low-level features and high-level human perception of emotion. In order to bridge the gap, we propose a three-level affective content analysis framework by introducing mid-level representation to indicate dialog, audio emotional events (e.g., horror sounds and laughters) and textual concepts (e.g., informative keywords). Mid-level representation is obtained from machine learning on low-level features and used to infer high-level affective content. We further apply the proposed framework and focus on a number of case studies. Audio emotional event, dialog and subtitle are studied to assist affective content detection in different video domains/genres. Multiple modalities are considered for affective analysis, since different modality has its own merit to evoke emotions. Experimental results shows the proposed framework is effective and efficient for affective content analysis. Audio emotional event, dialog and subtitle are promising mid-level representations. © 2012 Springer Science+Business Media, LLC.
Video retargeting is a crowded but challenging research area. In order to maximally comfort the viewers' watching experience, the most challenging issue is how to retain the spatial shape of important objects while ensure temporal smoothness and coherenc
Xu, M, Xu, C, He, S, Jin, J, Luo, S & Rui, Y 2013, 'Hierarchical affective content analysis in arousal and valence dimensions', Signal Processing, vol. 93, pp. 2140-2150.View/Download from: Publisher's site
Different from the existing work focusing on emotion type detection, the proposed approach in this paper provides flexibility for users to pick up their favorite affective content by choosing either emotion intensity levels or emotion types. Specifically, we propose a hierarchical structure for movie emotions and analyze emotion intensity and emotion type by using arousal and valence related features hierarchically. Firstly, three emotion intensity levels are detected by using fuzzy c-mean clustering on arousal features. Fuzzy clustering provides a mathematical model to represent vagueness, which is close to human perception. Then, valence related features are used to detect five emotion types. Considering video is continuous time series data and the occurrence of a certain emotion is affected by recent emotional history, conditional random fields (CRFs) are used to capture the context information. Outperforming Hidden Markov Model, CRF relaxes the independence assumption for states required by HMM and avoids bias problem. Experimental results show that CRF-based hierarchical method outperforms the one-step method on emotion type detection. User study shows that majority of the viewers prefer to have option of accessing movie content by emotion intensity levels. Majority of the users are satisfied with the proposed emotion detection.
Peng, Y, Xu, M, Ni, Z, Jin, J & Luo, S 2012, 'Combining Front Vehicle Detection With 3D Pose Estimation For A Better Driver Assistance', International Journal of Advanced Robotic Systems, vol. 9, pp. 0-0.View/Download from: Publisher's site
Driver assistant systems enhance traffic safety and efficiency. The accurate 3D pose of a front vehicle can help a driver to make the right decision on the road. We propose a novel real-time system to estimate the 3D pose of the front vehicle. This syste
Landmark image classification is a challenging task due to the various circumstances, e. g., illumination, viewpoint, zoom in/out and occlusion under which landmark images are taken. Most existing approaches utilize features extracted from the whole imag
Xu, M, He, S, Peng, Y, Jin, J, Luo, S, Chia, LT & Hu, Y 2012, 'Content on demand video adaptation based on MPEG-21 digital item adaptation', EURASIP Journal on Wireless Communications and Networking, vol. 2012, no. 104, pp. 1-16.View/Download from: Publisher's site
One of the major objectives in multimedia research is to provide pervasive access and personalized use of multimedia information. Pervasive access of video data implies the access of cognitive and affective aspects of video content. Personalized use requires the services satisfy individual users needs on video content. This article attempts to provide a content-on-demand (CoD) video adaptation solution by considering users preference on cognitive content and affective content for video media in general, sports video and movies in particular. In this article, CoD video adaptation system is developed to support users decision in selecting their content of interest and adaptively deliver video source by selecting relevant content and dropping frames while considering network conditions. First, video contents are annotated by the description schemes (DSs) provided by MPEG-7 multimedia description schemes (MDSs). Then, to achieve a generic adaptation solution, the adaptation is developed following MPEG-21 Digital Item Adaptation (DIA) framework. We study the MPEG-21 reference software on XML generation and develop our own system for CoD video adaptation in three steps: (1) the content information is parsed from MPEG-7 annotation XML file together with bitstream to generate generic Bitstream Syntax Description (gBSD); (2) Usersâ preference, network characteristic and adaptation QoS (AQoS) are considered for making adaptation decision; (3) adaptation engine automatically parses adaptation decisions and gBSD to achieve adaptation. Unlike most existing adaptation work, the system adapts the content of interest in the video stream according to usersâ preference. We implement the above-mentioned MPEG-7 and MPEG-21 standards and provide a generic video adaptation solution. Adaptation based on gBSD avoids complex video computation. Thirty students from various departments were invited to assess the system and their responses have been positive.
Gao, L, Xu, M, Yan, SF, Liu, MG, Hou, CH & Wang, DH 2011, 'Content - aware broadcast soccer video retargeting using fuzzy logic', Electronics Letters, vol. 47, no. 12, pp. 694-695.View/Download from: Publisher's site
A content-aware video retargeting method is proposed for playing broadcast soccer video in small displays. Four visual perception clues are predefined based on soccer game-specific knowledge and modelled by visual attention features firstly. Then, a fuzzy logic inference system is proposed to estimate visual attention values (AVs) of ball and players by fusing attention features. AVs are later used to determine the region of interest (ROI) of each frame. Finally, a retargeted video is generated by the ROI of each frame with polynomial curve fitting for temporal smoothing. Both subjective and objective evaluation results are promising.
Xu, M, Xu, C, Duan, L, Jin, JS & Luo, S 2008, 'Audio keywords generation for sports video analysis', ACM Transactions on Multimedia Computing, Communications and Applications, vol. 4, no. 2.View/Download from: Publisher's site
Sports video has attracted a global viewership. Research effort in this area has been focused on semantic event detection in sports video to facilitate accessing and browsing. Most of the event detection methods in sports video are based on visual features. However, being a significant component of sports video, audio may also play an important role in semantic event detection. In this paper, we have borrowed the concept of the keyword from the text mining domain to define a set of specific audio sounds. These specific audio sounds refer to a set of game-specific sounds with strong relationships to the actions of players, referees, commentators, and audience, which are the reference points for interesting sports events. Unlike low-level features, audio keywords can be considered as a mid-level representation, able to facilitate high-level analysis from the semantic concept point of view. Audio keywords are created from low-level audio features with learning by support vector machines. With the help of video shots, the created audio keywords can be used to detect semantic events in sports video by Hidden Markov Model (HMM) learning. Experiments on creating audio keywords and, subsequently, event detection based on audio keywords have been very encouraging. Based on the experimental results, we believe that the audio keyword is an effective representation that is able to achieve satisfying results for event detection in sports video. Application in three sports types demonstrates the practicality of the proposed method. © 2008 ACM.
Wang, S, Dash, M, Chia, LT & Xu, M 2007, 'Efficient sampling of training set in large and noisy multimedia data', ACM Transactions on Multimedia Computing, Communications and Applications, vol. 3, no. 3.View/Download from: Publisher's site
As the amount of multimedia data is increasing day-by-day thanks to less expensive storage devices and increasing numbers of information sources, machine learning algorithms are faced with large-sized and noisy datasets. Fortunately, the use of a good sampling set for training influences the final results significantly. But using a simple random sample (SRS) may not obtain satisfactory results because such a sample may not adequately represent the large and noisy dataset due to its blind approach in selecting samples. The difficulty is particularly apparent for huge datasets where, due to memory constraints, only very small sample sizes are used. This is typically the case for multimedia applications, where data size is usually very large. In this article we propose a new and efficient method to sample of large and noisy multimedia data. The proposed method is based on a simple distance measure that compares the histograms of the sample set and the whole set in order to estimate the representativeness of the sample. The proposed method deals with noise in an elegant manner which SRS and other methods are not able to deal with. We experiment on image and audio datasets. Comparison with SRS and other methods shows that the proposed method is vastly superior in terms of sample representativeness, particularly for small sample sizes although time-wise it is comparable to SRS, the least expensive method in terms of time. © 2007 ACM.
Liu, S, Xu, M, Yi, H, Chia, LT & Rajan, D 2006, 'Multimodal semantic analysis and annotation for basketball video', Eurasip Journal on Applied Signal Processing, vol. 2006, pp. 1-13.View/Download from: Publisher's site
This paper presents a new multiple-modality method for extracting semantic information from basketball video. The visual, motion, and audio information are extracted from video to first generate some low-level video segmentation and classification. Domain knowledge is further exploited for detecting interesting events in the basketball video. For video, both visual and motion prediction information are utilized for shot and scene boundary detection algorithm; this will be followed by scene classification. For audio, audio keysounds are sets of specific audio sounds related to semantic events and a classification method based on hidden Markov model (HMM) is used for audio keysound identification. Subsequently, by analyzing the multimodal information, the positions of potential semantic events, such as "foul" and "shot at the basket," are located with additional domain knowledge. Finally, a video annotation is generated according to MPEG-7 multimedia description schemes (MDSs). Experimental results demonstrate the effectiveness of the proposed method. Copyright © 2006 Hindawi Publishing Corporation. All rights reserved.
As the amount of multimedia data is increasing day-by-day thanks to cheaper storage devices and increasing number of information sources, the machine learning algorithms are faced with large-sized datasets. When original data is huge in size small sample sizes are preferred for various applications. This is typically the case for multimedia applications. But using a simple random sample may not obtain satisfactory results because such a sample may not adequately represent the entire data set due to random fluctuations in the sampling process. The difficulty is particularly apparent when small sample sizes are needed. Fortunately the use of a good sampling set for training can improve the final results significantly. In KDD'03 we proposed EASE that outputs a sample based on its 'closeness' to the original sample. Reported results show that EASE outperforms simple random sampling (SRS). In this paper we propose EASIER that extends EASE in two ways. (1) EASE is a halving algorithm, i.e., to achieve the required sample ratio it starts from a suitable initial large sample and iteratively halves. EASIER, on the other hand, does away with the repeated halving by directly obtaining the required sample ratio in one iteration. (2) EASE was shown to work on IBM QUEST dataset which is a categorical count data set. EASIER, in addition, is shown to work on continuous data of images and audio features. We have successfully applied EASIER to image classification and audio event identification applications. Experimental results show that EASIER outperforms SRS significantly. © Springer Science + Business Media, LLC 2006.
Duan, LY, Xu, M, Tian, Q, Xu, CS & Jin, JS 2005, 'A unified framework for semantic shot classification in sports video', IEEE Transactions on Multimedia, vol. 7, no. 6, pp. 1066-1083.View/Download from: Publisher's site
The extensive amount of multimedia information available necessitates content-based video indexing and retrieval methods. Since humans tend to use high-level semantic concepts when querying and browsing multimedia databases, there is an increasing need for semantic video indexing and analysis. For this purpose, we present a unified framework for semantic shot classification in sports video, which has been widely studied due to tremendous commercial potentials. Unlike most existing approaches, which focus on clustering by aggregating shots or key-frames with similar low-level features, the proposed scheme employs supervised learning to perform a top-down video shot classification. Moreover, the supervised learning procedure is constructed on the basis of effective mid-level representations instead of exhaustive low-level features. This framework consists of three main steps: 1) identify video shot classes for each sport; 2) develop a common set of motion, color, shot length-related mid-level representations; and 3) supervised learning of the given sports video shots. It is observed that for each sport we can predefine a small number of semantic shot classes, about 5-10, which covers 90%-95% of broadcast sports video. We employ nonparametric feature space analysis to map low-level features to mid-level semantic video shot attributes such as dominant object (a player) motion, camera motion patterns, and court shape, etc. Based on the fusion of those mid-level shot attributes, we classify video shots into the predefined shot classes, each of which has clear semantic meanings. With this framework we have achieved good classification accuracy of 85%-95% on the game videos of five typical ball type sports (i.e., tennis, basketball, Volleyball, soccer, and table tennis) with over 5500 shots of about 8 h. With correctly classified sports video shots, further structural and temporal analysis, such as event detection, highlight extraction, video skimming, and table of conte...
Xu, M, Duan, LY, Cai, J, Chia, LT, Xu, C & Tian, Q 2004, 'HMM-based audio keyword generation', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 3333, pp. 566-574.View/Download from: Publisher's site
With the exponential growth in the production creation of multimedia data, there is an increasing need for video semantic analysis. Audio, as a significant part of video, provides important cues to human perception when humans are browsing and understanding video contents. To detect semantic content by useful audio information, we introduce audio keywords which are sets of specific audio sounds related to semantic events. In our previous work, we designed a hierarchical Support Vector Machine (SVM) classifier for audio keyword identification. However, a weakness of our previous work is that audio signals are artificially segmented into 20 ms frames for frame-based SVM identification without any contextual information. In this paper, we propose a classification method based on Hidden Markov Modal (HMM) for audio keyword identification as an improved work instead of using hierarchical SVM classifier. Choosing HMM is motivated by the successful story of HMM in speech recognition. Unlike the frame-based SVM classification followed by major voting, our proposed HMM-based classifiers treat specific sound as a continuous time series data and employ hidden states transition to capture context information. In particular, we study how to find an effective HMM, i.e., determining topology, observation vectors and statistical parameters of HMM. We also compare different HMM structures with different hidden states, and adjust time series data with variable length. Experimental data includes 40 minutes basketball audio which comes from real-time sports games. Experimental results show that, for audio keyword generation, the proposed HMM-based method outperforms the previous hierarchical SVM. © Springer-Verlag 2004.
Zeng, C, He, X, Jia, W & Xu, M 2013, 'Recent advances on graph-based image segmentation techniques' in Image Processing: Concepts, Methodologies, Tools, and Applications, IGI Global, pp. 1323-1337.View/Download from: Publisher's site
© 2013, IGI Global. Image segmentation techniques using graph theory has become a thriving research area in computer vision community in recent years. This chapter mainly focuses on the most up-to-date research achievements in graph-based image segmentation published in top journals and conferences in computer vision community. The representative graph-based image segmentation methods included in this chapter are classified into six categories: minimum-cut/maximum-flow model (called graph-cut in some literatures), random walk model, minimum spanning tree model, normalized cut model and isoperimetric graph partitioning. The basic rationales of these models are presented, and the image segmentation methods based on these graph-based models are discussed as the main concern of this chapter. Several performance evaluation methods for image segmentation are given. Some public databases for testing image segmentation algorithms are introduced and the future work on graph-based image segmentation is discussed at the end of this chapter.
Zeng, C, Jia, W, He, S & Xu, M 2013, 'Recent Advances on Graph-Based Image Segmentation Techniques' in Bai, X, Cheng, J & Hancock, E (eds), Graph-Based Methods in Computer Vision: Developments and Applications, IGI Global, Hershey, Pennsylvania (USA), pp. 140-154.View/Download from: Publisher's site
Image segmentation techniques using graph theory has become a thriving research area in computer vision community in recent years. This chapter mainly focuses on the most up-to-date research achievements in graph-based image segmentation published in top journals and conferences in computer vision community. The representative graph-based image segmentation methods included in this chapter are classified into six categories: minimum-cut/maximum-flow model (called graph-cut in some literatures), random walk model, minimum spanning tree model, normalized cut model and isoperimetric graph partitioning. The basic rationales of these models are presented, and the image segmentation methods based on these graph-based models are discussed as the main concern of this chapter. Several performance evaluation methods for image segmentation are given. Some public databases for testing image segmentation algorithms are introduced and the future work on graph-based image segmentation is discussed at the end of this chapter.
Guo, W, Xu, C, Ma, S & Xu, M 2010, 'Visual Attention Based Motion Object Detection and Trajectory' in Qiu, G, Lam, KM, Kiya, H, Xue, XY, Kuo, CCJ & Lew, MS (eds), Lecture Notes in Computer Science 6298 - Advances in Multimedia Information Processing - PCM 2010, Springer, Germany, pp. 462-470.View/Download from: Publisher's site
A motion trajectory tracking method using a novel visual attention model and kernel density estimation is proposed in this paper. As a crucial step, moving objects detection is based on visual attention. The visual attention model is built by combination of the static and motion feature attention map and a Karhunen-Loeve transform (KLT) distribution map. Since the visual attention analysis is conducted on object level instead of pixel level, the proposed method can detect any kinds of motion objects provided saliency without the affection of objects appearance and surrounding circumstance. After locating the region of moving object, the kernel density is estimated for trajectory tracking. The experimental results show that the proposed method is promising for moving objects detection and trajectory tracking.
Xu, M, He, S, Jin, J, Peng, Y, Xu, C & Guo, W 2010, 'Using Scripts for Affective Content Retrieval' in Qiu, G, Lam, KM, Kiya, H, Xue, XY, Kuo, CCJ & Lew, MS (eds), Lecture Notes in Computer Science 6298 - Advances in Multimedia Information Processing - PCM 2010, Springer, Germany, pp. 43-51.View/Download from: Publisher's site
Movie affective content analysis attracts increasing research efforts since affective content not only affect users attentions but also locate movie highlights. However, affective content retrieval is still a challenging task due to the limitation of affective features in movies. Scripts provide direct access to the movie content and represent affective aspects of the movie. In this paper, we utilize scripts as an important clue to retrieve video affective content. The proposed approach includes two main steps. Firstly, affective script partitions are extracted by detecting emotional words. Secondly, affective partitions are validated by using visual and auditory features. The results are encouraging and compared with the manually labelled ground truth.
Liao, Q, Wang, D, Holewa, H & Xu, M 2019, 'Squeezed Bilinear Pooling for Fine-Grained Visual Categorization', 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), IEEE, South Korea.View/Download from: Publisher's site
Xu, Y, Xu, D, Hong, X, Ouyang, W, Ji, R, Xu, M & Zhao, G 2019, 'Structured Modeling of Joint Deep Feature and Prediction Refinement for Salient Object Detection', 2019 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Korea.View/Download from: Publisher's site
Recent saliency models extensively explore to incorporate multi-scale contextual information from Convolutional Neural Networks (CNNs). Besides direct fusion strategies, many approaches introduce message-passing to enhance CNN features or predictions. However, the messages are mainly transmitted in two ways, by feature-to-feature passing, and by prediction-to-prediction passing. In this paper, we add message-passing between features and predictions and propose a deep unified CRF saliency model . We design a novel cascade CRFs architecture with CNN to jointly refine deep features and predictions at each scale and progressively compute a final refined saliency map. We formulate the CRF graphical model that involves message-passing of feature-feature, feature-prediction, and prediction-prediction, from the coarse scale to the finer scale, to update the features and the corresponding predictions. Also, we formulate the mean-field updates for joint end-to-end model training with CNN through back propagation. The proposed deep unified CRF saliency model is evaluated over six datasets and shows highly competitive performance among the state of the arts.
Liao, Q, Holewa, H, Xu, M & Wang, D 2018, 'Fine-Grained Categorization by Deep Part-Collaboration Convolution Net', 2018 International Conference on Digital Image Computing: Techniques and Applications, DICTA 2018, Digital Image Computing: Techniques and Applications, IEEE, Australia.View/Download from: Publisher's site
© 2018 IEEE. In part-based categorization context, the ability to learn representative feature from quantitative tiny object parts is of similar importance as to exactly localize the parts. We propose a new deep net structure for fine-grained categorization that follows the taxonomy workflow, which makes it interpretable and understandable for humans. By training customized sub-nets on each manually annotated parts, we increased the state-of-the-art part-based classification accuracy for general fine-grained CUB-200-2011 dataset by 2.1%. Our study shows the proposed method can produce more activation to discriminate detail part difference while maintaining high computing performance by applying a set of strategies to optimize the deep net structure.
Pham, T, Takalkar, M, Xu, M, Hoang, DT, Truong, HA, Dutkiewicz, E & Perry, S 2019, 'Airborne Object Detection Using Hyperspectral Imaging: Deep Learning Review', Computational Science and Its Applications – ICCSA 2019, International Conference on Computational Science and Its Applications, Springer, Saint Petersburg, Russia, pp. 306-321.View/Download from: Publisher's site
Hyperspectral images have been increasingly important in object detection applications especially in remote sensing scenarios. Machine learning algorithms have become emerging tools for hyperspectral image analysis. The high dimensionality of hyperspectral images and the availability of simulated spectral sample libraries make deep learning an appealing approach. This report reviews recent data processing and object detection methods in the area including hand-crafted and automated feature extraction based on deep learning neural networks. The accuracy performances were compared according to existing reports as well as our own experiments (i.e., re-implementing and testing on new datasets). CNN models provided reliable performance of over 97% detection accuracy across a large set of HSI collections. A wide range of data were used: a rural area (Indian Pines data), an urban area (Pavia University), a wetland region (Botswana), an industrial field (Kennedy Space Center), to a farm site (Salinas). Note that, the Botswana set was not reviewed in recent works, thus high accuracy selected methods were newly compared in this work. A plain CNN model was also found to be able to perform comparably to its more complex variants in target detection applications.
Sang, L, Xu, M, Qian, S & Wu, X 2019, 'AAANE: Attention-based adversarial autoencoder for multi-scale network embedding', Advances in Knowledge Discovery and Data Mining 23rd Pacific-Asia Conference, PAKDD 2019, Macau, China, April 14-17, 2019, Proceedings, Part III, Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, China, pp. 3-14.View/Download from: Publisher's site
© Springer Nature Switzerland AG 2019. Network embedding represents nodes in a continuous vector space and preserves structure information from a network. Existing methods usually adopt a one-size-fits-all approach when concerning multi-scale structure information, such as first- and second-order proximity of nodes, ignoring the fact that different scales play different roles in embedding learning. In this paper, we propose an Attention-based Adversarial Autoencoder Network Embedding (AAANE) framework, which promotes the collaboration of different scales and lets them vote for robust representations. The proposed AAANE consists of two components: (1) an attention-based autoencoder that effectively capture the highly non-linear network structure, which can de-emphasize irrelevant scales during training, and (2) an adversarial regularization guides the autoencoder in learning robust representations by matching the posterior distribution of the latent embeddings to a given prior distribution. Experimental results on real-world networks show that the proposed approach outperforms strong baselines.
Takalkar, MA, Zhang, H & Xu, M 2019, 'Improving Micro-expression Recognition Accuracy Using Twofold Feature Extraction', MultiMedia Modeling (LNCS), International Conference on Multimedia Modeling, Springer, Thessaloniki, Greece, pp. 652-664.View/Download from: Publisher's site
© 2019, Springer Nature Switzerland AG. Micro-expressions are generated involuntarily on a person's face and are usually a manifestation of repressed feelings of the person. Micro-expressions are characterised by short duration, involuntariness and low intensity. Because of these characteristics, micro-expressions are difficult to perceive and interpret correctly, and they are profoundly challenging to identify and categorise automatically. Previous work for micro-expression recognition has used hand-crafted features like LBP-TOP, Gabor filter, HOG and optical flow. Recent work also has demonstrated the possible use of deep learning for micro-expression recognition. This paper is the first work to explore the use of hand-craft feature descriptor and deep feature descriptor for micro-expression recognition task. The aim is to use the hand-craft and deep learning feature descriptor to extract features and integrate them together to construct a large feature vector to describe a video. Through experiments on CASME, CASME II and CASME+2 databases, we demonstrate our proposed method can achieve promising results for micro-expression recognition accuracy with larger training samples.
Yang, Y, Xu, M, Wu, W, Zhang, R & Peng, Y 2018, '3D Multiview Basketball Players Detection and Localization Based on Probabilistic Occupancy', 2018 International Conference on Digital Image Computing: Techniques and Applications, DICTA 2018, Digital Image Computing: Techniques and Applications, IEEE, Australia, pp. 267-274.View/Download from: Publisher's site
© 2018 IEEE. This paper addresses the issue of 3D multiview basketball players detection and localization. Existing methods for this problem typically take background subtraction as input, which limits the accuracy of localization and the performance of further object tracking. Moreover, the performance of background subtraction based methods is heavily impacted by the occlusions in crowded scenes. In this paper, we propose an innovative method which jointly implements deep learning based player detection and occupancy probability based player localization. What's more, a new Bayesian model of the localization algorithms is developed, which uses foreground information from fisheye cameras to setup meaningful initialization values in the first step of iteration, in order to not only eliminate ambiguous detection, but also accelerate computational processes. Experimental results on real basketball game data demonstrate that our methods significantly improve the performance compared with current methods, by eliminating missed and false detection, as well as increasing probabilities of positive results.
Chen, X, Kong, X & Xu, M 2018, 'Road Vehicle Recognition Using Magnetic Sensing Feature Extraction and Classification', International Journal of Electrical, Computer, Energetic, Electronic and Communication Engineering, International Conference on Intelligent Sensors, Sensor Networks and Information Processing, International Research Publication House, Venice, Italy, pp. 270-275.View/Download from: Publisher's site
This paper presents a road vehicle detection approach for the intelligent transportation system. This approach mainly uses low-cost magnetic sensor and associated data collection system to collect magnetic signals. This system can measure the magnetic field changing, and it also can detect and count vehicles. We extend Mel Frequency Cepstral Coefficients to analyze vehicle magnetic signals. Vehicle type features are extracted using representation of cepstrum, frame energy, and gap cepstrum of magnetic signals. We design a 2-dimensional map algorithm using Vector Quantization to classify vehicle magnetic features to four typical types of vehicles in Australian suburbs: sedan, VAN, truck, and bus. Experiments results show that our approach achieves a high level of accuracy for vehicle detection and classification.
Huang, D-Y, Zhao, S, Schuller, BW, Yao, H, Tao, J, Xu, M, Xie, L, Huang, Q & Yang, J 2018, 'ASMMC-MMAC 2018: The JointWorkshop of 4th theWorkshop on Affective Social Multimedia Computing and first Multi-Modal Affective Computing of Large-Scale Multimedia Data Workshop', PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 26th ACM Multimedia Conference (MM), ASSOC COMPUTING MACHINERY, Seoul, SOUTH KOREA, pp. 2120-2121.View/Download from: Publisher's site
Li, Z, Yao, L, Nie, P, Zhang, D & Xu, M 2018, 'Multi-rate Gated Recurrent Convolutional Networks for Video-Based Pedestrian Re-Identification', The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18), AAAI Conference on Artifical Intelligence, AAAI, New Orleans, Lousiana, USA, pp. 7081-7088.
Matching pedestrians across multiple camera views has attracted lots of recent research attention due to its apparent importance in surveillance and security applications.While most existing works address this problem in a still-image setting, we consider the more informative and challenging video-based person re-identification problem, where a video of a pedestrian as seen in one camera needs to be matched to a gallery of videos captured by other non-overlapping cameras. We employ a convolutional network to extract the appearance and motion features from raw video sequences, and then feed them into a multi-rate recurrent network to exploit the temporal correlations, and more importantly, to take into account the fact that pedestrians, sometimes even the same pedestrian, move in different speeds across different camera views. The combined network is trained in an end-to-end fashion, and we further propose an initialization strategy via context reconstruction to largely improve the performance. We conduct extensive experiments on the iLIDS-VID and PRID-2011 datasets, and our experimental results confirm the effectiveness and the generalization ability of our model.
Shi, Z, Xu, M, Pan, Q, Yan, B & Zhang, H 2018, 'LSTM-based Flight Trajectory Prediction', International Joint Conference on Neural Networks, IEEE, Rio de Janeiro, Brazil.View/Download from: Publisher's site
Safety ranks the first in Air Traffic Management (ATM). Accurate trajectory prediction can help ATM to forecast potential dangers and effectively provide instructions for safely traveling. Most trajectory prediction algorithms work for land traffic, which rely on points of interest (POIs) and are only suitable for stationary road condition. Compared with land traffic prediction, flight trajectory prediction is very difficult because way-points are sparse and the flight envelopes are heavily affected by external factors. In this paper, we propose a flight trajectory prediction model based on a Long Short-Term Memory (LSTM) network. The four interacting layers of a repeating module in an LSTM enables it to connect the long-term dependencies to present predicting task. Applying sliding windows in LSTM maintains the continuity and avoids compromising the dynamic dependencies of adjacent states in the long-term sequences, which helps to improve accuracy of trajectory prediction. Taking time dimension into consideration, both 3-D (time stamp, latitude and longitude) and 4-D (time stamp, latitude, longitude and altitude) trajectories are predicted to prove the efficiency of our approach. The dataset we use was collected by ADS-B ground stations. We evaluate our model by widely used measurements, such as the mean absolute error (MAE), the mean relative error (MRE), the root mean square error (RMSE) and the dynamic warping time (DWT) methods. As Markov Model is the most popular in time series processing, comparisons among Markov Model (MM), weighted Markov Model (wMM) and our model are presented. Our model outperforms the existing models (MM and wMM) and provides a strong basis for abnormal detection and decision-making.
Wang, Z, Xu, M, Ning, Y, Wang, R & Huang, H 2018, 'RF-MVO: Simultaneous 3D Object Localization and Camera Trajectory Recovery Using RFID Devices and a 2D Monocular Camera', International Conference on Distributed Computing Systems, IEEE, Vienna, Austria.View/Download from: Publisher's site
Most of the existing RFID-based localization systems cannot well locate RFID-tagged objects in a 3D space. Limited robot-based RFID solutions require reader antennas to be carried by a robot moving along an already-known trajectory at a constant speed. As the first attempt, this paper presents RFMVO, which fuses battery-free RFID and monocular visual odometry to locate stationary RFID tags in a 3D space and recover an unknown trajectory of reader antennas binding with a 2D monocular camera. The proposed hybrid system exhibits three unique features. Firstly, since the trajectory of a 2D monocular camera can only be recovered up to an unknown scale factor, RF-MVO combines the relative-scale camera trajectory with depth-enabled RF phase to estimate an absolute scale factor and spatially incident angles of an RFID tag. Secondly, we propose a joint optimization algorithm consisting of coarse-to-fine angular refinement, 3D tag localization and parameter nonlinear optimization, to improve real-time performance. Thirdly, RFMVO can determine the effect of relative tag-antenna geometry on the estimation precision, providing optimal tag positions and absolute scale factors. Our experiments show that RF-MVO can achieve 6.23cm tag localization accuracy in a 3D space and 0.0158 absolute scale factor estimation accuracy for camera trajectory recovery.
Takalkar, M & Xu, M 2017, 'Image based Facial Micro-Expression Recognition using Deep Learning on Small Datasets', Proceedings of Digital Image Computing: Techniques and Applications (DICTA), 2017 International Conference on, The International Conference on Digital Image Computing: Techniques and Applications, IEEE, Sydney, Australia, pp. 1-7.View/Download from: Publisher's site
Facial micro-expression refers to split-second muscle changes in the face, indicating that a person is either consciously or unconsciously suppressing their true emotions and even mental health. Therefore, micro-expression recognition attracts increasing research efforts in both fields of psychology and computer vision. Existing research on micro-expression recognition has mainly used hand-crafted features, for example, Local Binary Pattern-Three Orthogonal Planes (LBP-TOP), Gabor filter and optical flow. Recently, Deep Convolutional neural systems have demonstrated a high degree effectiveness for difficult face recognition tasks. This paper explores the possible use of deep learning for micro-expression recognition. To develop a reliable deep neural network extensive training sets are required with a huge number of labeled image samples. However, micro-expression recognition is a challenging task due to the repressed facial appearance and short duration, which results in the lack of training data. In this paper, we propose to generate extensive training datasets of synthetic images using data augmentation on CASME and CASME II databases. Then, these datasets are combined to tune a satisfactory CNN-based micro-expression recognizer. Experimental results demonstrate the effectiveness of the proposed CNN approach in image based micro- expression recognition and present comparable results with the best-related works.
Zhu, X, Li, L, Zhang, W, Rao, Xu, M, Huang, Q & Xu, D 2017, 'Dependency Exploitation: A Unified CNN-RNN Approach for Visual Emotion Recognition', Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17), International Joint Conference on Artificial Intelligence, International Joint Conferences on Artificial Intelligence Organization, Melbourne, Australia, pp. 3595-3601.View/Download from: Publisher's site
Visual emotion recognition aims to associate images
with appropriate emotions. There are different
visual stimuli that can affect human emotion
from low-level to high-level, such as color, texture,
part, object, etc. However, most existing methods
treat different levels of features as independent entity
without having effective method for feature fusion.
In this paper, we propose a unified CNNRNN
model to predict the emotion based on the
fused features from different levels by exploiting
the dependency among them. Our proposed architecture
leverages convolutional neural network
(CNN) with multiple layers to extract different levels
of features within a multi-task learning framework,
in which two related loss functions are introduced
to learn the feature representation. Considering
the dependencies within the low-level and
high-level features, a bidirectional recurrent neural
network (RNN) is proposed to integrate the learned
features from different layers in the CNN model.
Extensive experiments on both Internet images and
art photo datasets demonstrate that our method outperforms
the state-of-the-art methods with at least
7% performance improvement.
Rao, TIANRONG, Xu, M, Liu, H, Wang, J & Burnett, I 2016, 'MULTI-SCALE BLOCKS BASED IMAGE EMOTION CLASSIFICATION USING MULTIPLE INSTANCE LEARNING', Proceedings - International Conference on Image Processing, ICIP, IEEE International Conference on Image Processing, IEEE, Phoenix, Arizona, USA.View/Download from: Publisher's site
Emotional factors usually affect users' preferences for and evaluations of images. Although affective image analysis attracts increasing attention, there are still three major challenges remaining: 1) it is difficult to classify an image into a single emotion type since different regions within an image can represent different emotions; 2) there is a gap between low-level features and high-level emotions and 3) it is difficult to collect a training set of reliable emotional image content. To address these three issues, we propose an emotion classification method based on multi-scale blocks using Multiple Instance Learning (MIL). We firstly extract blocks of an image at multiple scales using different image segmentation methods pyramid segmentation and simple linear iterative clustering (SLIC) and represent each block using the bag-of-visual-words (BoVW) method. Then, to bridge the "affective gap", probabilistic latent semantic analysis (pLSA) is employed to estimate the latent topic distribution as a mid-level representation of each block. Finally, MIL, which reduces the need for exact labelling, is employed to classify the dominant emotion type of the image. Experiments carried out on three widely used datasets demonstrate that our proposed method with S-LIC effectively improves the state-of-the-art results of image emotion classification 5.1% on average.
Wu, L, Wang, JQ, Zhu, G, Xu, M & Lu, H 2016, 'Person re-identification via rich color-gradient feature', 2016 IEEE International Conference on Multimedia and Expo (ICME), IEEE International Conference on Multimedia and Expo, IEEE, Seattle, USA, pp. 1-6.View/Download from: Publisher's site
Person re-identification refers to match the same pedestrian across disjoint views in non-overlapping camera networks. Lots of local and global features in the literature are put forward to solve the matching problem, where color feature is robust to viewpoint variance and gradient feature provides a rich representation robust to illumination change. However, how to effectively combine the color and gradient features is an open problem. In this paper, to effectively leverage the color-gradient property in multiple color spaces, we propose a novel Second Order Histogram feature (SOH) for person reidentification in large surveillance dataset. Firstly, we utilize discrete encoding to transform commonly used color space into Encoding Color Space (ECS), and calculate the statistical gradient features on each color channel. Then, a second order statistical distribution is calculated on each cell map with a spatial partition. In this way, the proposed SOH feature effectively leverages the statistical property of gradient and color as well as reduces the redundant information. Finally, a metric learned by KISSME  with Mahalanobis distance is used for person matching. Experimental results on three public datasets, VIPeR, CAVIAR and CUHK01, show the promise of the proposed approach
Zhang, H & Xu, M 2016, 'Modeling Temporal Information Using Discrete Fourier Transform for Recognizing Emotions in User-generated Videos', Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), IEEE International Conference on Image Processing, IEEE, Phoenix, Arizona, USA, pp. 629-633.View/Download from: Publisher's site
With the widespread of user-generated Internet videos, emotion recognition in those videos attracts increasing research efforts. However, most existing works are based on framelevel visual features and/or audio features, which might fail to model the temporal information, e.g. characteristics accumulated along time. In order to capture video temporal information, in this paper, we propose to analyse features in frequency domain transformed by discrete Fourier transform (DFT features). Frame-level features are firstly extract by a pre-trained deep convolutional neural network (CNN). Then, time domain features are transferred and interpolated into DFT features. CNN and DFT features are further encoded and fused for emotion classification. By this way, static image features extracted from a pre-trained deep CNN and temporal information represented by DFT features are jointly considered for video emotion recognition. Experimental results demonstrate that combining DFT features can effectively capture temporal information and therefore improve emotion recognition performance. Our approach has achieved a state-of-the-art performance on the largest video emotion dataset (VideoEmotion-8 dataset), improving accuracy from 51.1% to 55.6%.
Guo, H, Wang, J, Xu, M, Zha, Z & Lu, H 2015, 'Learning Multi-view Deep Features for Small Object Retrieval in Surveillance Scenarios', Proceedings of the 23rd ACM International Conference on Multimedia, ACM International Conference on Multimedia, ACM, Brisbane, Australia, pp. 859-859.View/Download from: Publisher's site
Maleki, B, Ebrahimnezhad, H, Xu, M & He, X 2015, 'Hand Gesture Recognition for a Virtual Mouse Application Using Geometric Feature of Finger's Trajectories', Proceedings of the 7th International Conference on Internet Multimedia Computing and Service, International Conference on Internet Multimedia Computing and Service, ACM, China.View/Download from: Publisher's site
We aim to enable a computer to comprehend and perform the mouse functions by analyzing a video with hand motions. For this purpose, dynamic gestures are captured by a web cam and are recognized as pre-defined gestures which are used to suggest mouse functions. The proposed algorithm initially detects the hand. Then, it tracks fingertips' trajectories within a frame sequence. Finally, hand gestures are recognized through computing a set of proposed geometric features of fingers' trajectories and comparing with our collected gestures dataset. In this paper, four types of descriptors are defined for a dynamic gesture. Each descriptor includes different number of features, which compose a feature vector with 135 dimensions. Different classification algorithms (e.g. KNN, LDA, Naïve Bayes and SVM) are applied to compare the detection results. The minimal misclassification error rate (MCR) reaches about 4% (i.e. Correct Recognition rate of 96%). Furthermore, we applied Principle Component Analysis (PCA) to reduce the number of features. With 30 dimensional features (principle components), LDA classifier can achieve about 0.09% misclassification error rate.
Usman, M, He, X, Xu, M & Lam, KM 2015, 'Survey of Error Concealment Techniques: Research Directions and Open Issues', 2015 Picture Coding Symposium, PCS 2015 - with 2015 Packet Video Workshop, PV 2015 - Proceedings, IEEE Picture Coding Symposium, IEEE, Cairns, pp. 233-238.View/Download from: Publisher's site
Wang, X, Xu, M & Pusatli, T 2015, 'A Survey of Applying Machine Learning Techniques for Credit Rating: Existing Models and Open Issues', Springer International Publishing Switzerland 2015 S. Arik et al. (Eds.): ICONIP 2015, Part II, LNCS 9490, International Conference on Neural Information Processing, Springer International Publishing, Istanbul, Turkey.View/Download from: Publisher's site
In recent years, machine learning techniques have been widely applied for credit rating. To make a rational comparison of performance of different learning-based credit rating models, we focused on those models that
are constructed and validated on the two mostly used Australian and German credit approval data sets. Based on a systematic review of literatures, we further compare and discuss about the performance of existing models. In addition, we identified and illustrated the limitations of existing works and discuss about some open issues that could benefit future research in this area.
Zhang, F, Li, J, Li, F, Xu, M, Xu, Y & He, X 2015, 'Community detection based on links and node features in social networks', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), International Conference on Multimedia Modelling, Springer, Sydney, Australia, pp. 418-429.View/Download from: Publisher's site
© Springer International Publishing Switzerland 2015. Community detection is a significant but challenging task in the field of social network analysis. Many effective methods have been proposed to solve this problem. However, most of them are mainly based on the topological structure or node attributes. In this paper, based on SPAEM , we propose a joint probabilistic model to detect community which combines node attributes and topological structure. In our model, we create a novel feature-based weighted network, within which each edge weight is represented by the node feature similarity between two nodes at the end of the edge. Then we fuse the original network and the created network with a parameter and employ expectation-maximization algorithm (EM) to identify a community. Experiments on a diverse set of data, collected from Facebook and Twitter, demonstrate that our algorithm has achieved promising results compared with other algorithms.
Guo, D, Zhang, J, Xu, M, He, X, Li, M & Zhao, C 2014, 'A Multiple Features Distance Preserving (MFDP) Model for Saliency Detection', Bouzerdoum, Digital Image Computing Techniques and Applications, IEEE, Wollongong.View/Download from: Publisher's site
Playing a vital role, saliency has been widely applied for various image analysis tasks, such as content-aware image retargeting, image retrieval and object detection. It is generally accepted that saliency detection can benefit from the integration of multiple visual features. However, most of the existing literatures fuse multiple features at saliency map level without considering cross-feature information, i.e. generate a saliency map based on several maps computed from an individual feature. In this paper, we propose a Multiple Feature Distance Preserving (MFDP) model to seamlessly integrate multiple visual features through an alternative optimization process. Our method outperforms the state-of-the-arts methods on saliency detection. Saliency detected by our method is further cooperated with seam carving algorithm and significantly improves the performance on image retargeting.
Liu, H, Xu, M, He, X & Wang, J 2014, 'Estimate Gaze Density by Incorporating Emotion', ACM MM2014, ACM International Conference on Multimedia, ACM, Orlando, pp. 1113-1116.View/Download from: Publisher's site
Gaze density estimation has attracted many research efforts
in the past years. The factors considered in the existing
methods include low level feature saliency, spatial position,
and objects. Emotion, as an important factor driving attention,
has not been taken into account. In this paper, we are
the first to estimate gaze density through incorporating emotion.
To estimate the emotion intensity of each position in an
image, we consider three aspects, generic emotional content,
facial expression intensity, and emotional objects. Generic
emotional content is estimated by using Multiple instance
learning, which is employed to train an emotion detector
from weakly labeled images. Facial expression intensity is
estimated by using a ranking method. Emotional objects
are detected, by taking blood/injury and worm/snake as examples.
Finally, emotion intensity, low level feature saliency,
and spatial position, are fused, through a linear support vector
machine, to estimate gaze density. The performance is
tested on public eye tracking dataset. Experimental results
indicate that incorporating emotion does improve the performance
of gaze density estimation.
Teng, K, Wang, J, Xu, M & Lu, H 2014, 'Mask assisted object coding with deep learning for object retrieval in surveillance videos', MM '14 Proceedings of the ACM International Conference on Multimedia, ACM International Conference on Multimedia, ACM, Orlando, Florida, USA.View/Download from: Publisher's site
Almarwani, A, Alqarni, L, Hakami, H, Chaczko, ZC & Xu, M 2013, 'Door Wave Home Automation System', IET/IEEE Second International Conference on Smart and Sustainable City, IET International Conference on Smart and Sustainable City, IEEE, Shanghai, China, pp. 98-103.View/Download from: Publisher's site
The technological developments are focused on automation of control systems. Technology is used in homes to create a digital environment such as controlling room temperature, sundry devices, security and lighting. The design of home automation systems is geared towards the automation of processes like remote control of the home environment appliances. The use of Wireless Sensor and Actuators Networks (WSANs) in home automation is a growing trend. WSANs are based on network architecture and protocols in order to enable a network of integrated devices which monitor and control household apparatus.
Fu, J, Wang, J, Li, Z, Xu, M & Lu, H 2013, 'Efficient clothing retrieval with semantic-preserving visual phrases', Lecture Notes in Computer Science, Asian Conference on Computer Vision, Springer, Singapore, Singapore, pp. 420-431.View/Download from: Publisher's site
In this paper, we address the problem of large scale cross-scenario clothing retrieval with semantic-preserving visual phrases (SPVP). Since the human parts are important cues for clothing detection and segmentation, we firstly detect human parts as the
Hasan, MA, Xu, M & He, S 2011, 'A Comprehensive Approach to Automatic Image, Browsing for Small Display Devices', The Era of Interactive Media - Pacific-Rim Conference on Multimedia, Pacific-Rim Conference on Multimedia, Springer New York, Sydney, Australia, pp. 267-276.View/Download from: Publisher's site
Recently, small displays are widely used to browse digital images. While using a small display device, the content of the image appears very small. Users have to use manual zooming and panning in order to see the detail of the image on a small display. Hence, an automatic image browsing solution is desired for user convenience. In this chapter, a novel comprehensive and efficient system is proposed to browse high resolution images using small display devices by automatically panning and zooming on Region-of-Interests (ROIs). The challenge is to provide a better user experience on heterogeneous small display sizes. First of all, an input image is classified into one of the three different classes: close-up, landscape and other. Then the ROIs of image are extracted. Finally, ROIs are browsed based on different intuitive and study based strategies. Our proposed system is evaluated by subjective test. Experimental results indicate that the proposed system is an effective large image displaying technique on small display devices.
Liu, H, Xu, D, Huang, Q, Li, W, Xu, M & Lin, S 2013, 'Semantically-based Human Scanpath Estimation with HMMs', International Conference on Computer Vision, IEEE International Conference on Computer Vision, IEEE, Sydney, Australia, pp. 3232-3239.View/Download from: Publisher's site
We present a method for estimating human scan paths, which are sequences of gaze shifts that follow visual attention over an image. In this work, scan paths are modeled based on three principal factors that influence human attention, namely low-level feature saliency, spatial position, and semantic content. Low-level feature saliency is formulated as transition probabilities between different image regions based on feature differences. The effect of spatial position on gaze shifts is modeled as a Levy flight with the shifts following a 2D Cauchy distribution. To account for semantic content, we propose to use a Hidden Markov Model (HMM) with a Bag-of-Visual-Words descriptor of image regions. An HMM is well-suited for this purpose in that 1) the hidden states, obtained by unsupervised learning, can represent latent semantic concepts, 2) the prior distribution of the hidden states describes visual attraction to the semantic concepts, and 3) the transition probabilities represent human gaze shift patterns. The proposed method is applied to task-driven viewing processes. Experiments and analysis performed on human eye gaze data verify the effectiveness of this method
Peng, Y, Jin, JS, Luo, S, Xu, M, Au, S, Zhang, Z & Cui, Y 2011, 'Vehicle type classification using data mining techniques', The Era of Interactive Media, Pacific-Rim Conference on Multimedia, Springer, Sydney, Australia, pp. 325-335.View/Download from: Publisher's site
© 2013 Springer Science+Business Media, LLC. All rights reserved. In this paper, we proposed a novel and accurate visual-based vehicle type classification system. The system builts up a classifier through applying Support Vector Machine with various features of vehicle image. We made three contributions here: first, we originally incorporated color of license plate in the classification system. Moreover, the vehicle front was measured accurately based on license plate localization and background-subtraction technique. Finally, type probabilities for every vehicle image were derived from eigenvectors rather than deciding vehicle type directly. Instead of calculating eigenvectors from the whole body images of vehicle in existing methods, our eigenvectors are calculated from vehicle front images. These improvements make our system more applicable and accurate. The experiments demonstrated our system performed well with very promising classification rate under different weather or lighting conditions.
Yang, X, Zhang, T, Xu, C & Xu, M 2013, 'Graph-guided fusion penalty based sparse coding for image classification', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Pacific-Rim Conference on Multimedia, Springer, Nanjing, China, pp. 475-484.View/Download from: Publisher's site
In image classification, conventional sparse coding only encodes local features independently. As a result, the similar local features may be encoded into code vectors with large discrepancy. This sensitiveness has became the bottleneck of the traditional sparse coding based image classification methods. In this paper, we propose a novel graph-guided fusion penalty based sparse coding method. To alleviate the sensitiveness of the traditional sparse coding, our approach constrains that the similar local features are encoded into similar code vectors. To achieve this goal, we add the popular graph-guided fusion penalty term into the traditional l1-regularized sparse coding formulation. Finally, we adopt the multi-task form of the smoothing proximal gradient method to solve our optimization problem efficiently. Experimental results on 3 benchmark datasets demonstrate the effectiveness of our improved sparse coding method in image classification. © Springer International Publishing Switzerland 2013.
Hasan, M, Xu, M, He, S & Chen, L 2012, 'Shot Classification Using Domain Specific Features for Movie Management', Lecture Notes in Computer Science, International Conference on DASFAA, Springer, Busan, South Korea, pp. 314-318.View/Download from: Publisher's site
Among many video types, movie content indexing and retrieval is a significantly challenging task because of the wide variety of shooting techniques and the broad range of genres. A movie consists of a series of video shots. Managing a movie at shot level provides a feasible way for movie understanding and summarization. Consequently, an effective shot classification is greatly desired for advanced movie management. In this demo, we explore novel domain specific features for effective shot classification. Experimental results show that the proposed method classifies movie shots from wide range of movie genres with improved accuracy compared to existing work
Peng, Y, Jin, J, Luo, S, Xu, M & Cai, Y 2012, 'Vehicle Type Classification Using PCA with Self-Clustering', 2012 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), IEEE International Conference on Multimedia and Expo, IEEE, Melbourne, Australia, pp. 384-389.View/Download from: Publisher's site
Different conditions, such as occlusions, changes of lighting, shadows and rotations, make vehicle type classification still a challenging task, especially for real-time applications. Most existing methods rely on presumptions on certain conditions, such as lighting conditions and special camera settings. However, these presumptions usually do not work for applications in real world. In this paper, we propose a robust vehicle type classification method based on adaptive multi-class Principal Components Analysis (PCA). We treat car images captured at daytime and night-time separately. Vehicle front is extracted by examining vehicle front width and the location of license plate. Then, after generating eigenvectors to represent extracted vehicle fronts, we propose a PCA method with self-clustering to classify vehicle type. The comparison experiments with the state of art methods and real-time evaluations demonstrate the promising performance of our proposed method. Moreover, as we do not find any public database including sufficient desired images, we built up online our own database including 4924 high-resolution images of vehicle front view for further research on this topic.
Peng, Y, Jin, J, Luo, S, Xu, M & Cui, Y 2012, '3D pose estimation of front vehicle towards a better driver assistance system', 2012 IEEE International Conference on Multimedia and Expo Workshops, IEEE International Conference on Multimedia and Expo, IEEE Computer Society, Melbourne Australia, pp. 522-527.View/Download from: Publisher's site
Driver assistance system enhances traffic safety and efficiency. Accurate 3D pose of front vehicle can help driver to make right decisions on road. We propose a novel real-time system to estimate 3D pose of the front vehicle. This system consists of two parallel threads: vehicle rear tracking and mapping. Vehicle rear is firstly identified in the video captured by an on-board camera, after license plate localization and foreground extraction. 3D pose estimation technique is then employed with respect to extracted vehicle rear. Most 3D pose estimation techniques need prior models or a stereo initialization with user cooperation. It is extremely difficult to obtain prior models due to various appearances of vehicle rears. Moreover, it is unsafe to ask for drivers cooperation when vehicle is running. In our system, two initial key frames for stereo algorithm are automatically extracted by vehicle rear detection and tracking. Map points are defined as a collection of point features extracted from vehicle rear with their 3D information. These map points are inferences that relating 2D features detected in following vehicle rears with 3D world. Relative 3D Pose between current vehicle rear and on-board camera is then estimated through mapping that matches map points with current point features. We demonstrate the abilities of our system by augmented reality, which needs accurate and real-time 3D pose estimation.
Peng, Y, Jin, JS, Luo, S, Xu, M & Cui, Y 2012, 'VEHICLE TYPE CLASSIFICATION USING PCA WITH SELF-CLUSTERING', 2012 IEEE International Conference on Multimedia and Expo Workshops (ICME), 2012 IEEE International Conference on Multimedia and Expo Workshops, IEEE Computer Society, Melbourne Australia, pp. 384-389.View/Download from: Publisher's site
Different conditions, such as occlusions, changes of lighting, shadows and rotations, make vehicle type classification still a challenging task, especially for real-time applications. Most existing methods rely on presumptions on certain conditions, such as lighting conditions and special camera settings. However, these presumptions usually do not work for applications in real world. In this paper, we propose a robust vehicle type classification method based on adaptive multi-class Principal Components Analysis (PCA). We treat car images captured at daytime and night-time separately. Vehicle front is extracted by examining vehicle front width and the location of license plate. Then, after generating eigenvectors to represent extracted vehicle fronts, we propose a PCA method with self-clustering to classify vehicle type. The comparison experiments with the state of art methods and real-time evaluations demonstrate the promising performance of our proposed method. Moreover, as we do not find any public database including sufficient desired images, we built up online our own database including 4924 high-resolution images of vehicle front view for further research on this topic
Peng, Y, Luo, S, Xu, M, Ni, Z, Jin, J, Wang, J & Zhao, G 2012, 'Bag of Features using sparse coding for gender classification', Proceedings of the 4th International Conference on Internet Multimedia Computing and Service, ICIMS 2012, ACM, Wuhan, China, pp. 80-83.View/Download from: Publisher's site
Gender classification is challenging. Methods for gender classification need to discriminate subtle differences between male and female. Bag-of-Features (BoF) method with sparse coding has been proven very powerful in image classification. In this paper, we apply BoF method for gender classification. We use two sets of images: training images and testing images. All images are represented by a set of Scale-Invariant Feature Transform (SIFT) descriptors. In training stage, using sparse coding, Visual Words Dictionary (VWD) is constructed from SIFT descriptors extracted from training images. In testing, SIFT descriptors of testing images are approximated by visual words in VWD. The choices of approximating visual words determine the classification decision. We apply our method and another two popular methods on public dataset for gender classification. We achieved promising results.
Peng, Y, Luo, S, Xu, M, Ni, Z, Jin, J, Wang, J & Zhao, G 2012, 'Bag of Features Using Sparse Coding for Gender Classification', 4th International Conference on Internet Multimedia Computing and Service, International Conference on Internet Multimedia Computing and Service, ACM, Wuhan, China, pp. 80-83.View/Download from: Publisher's site
Gender classification is challenging. Methods for gender classification need to discriminate subtle differences between male and female. Bag-of-Features (BoF) method with sparse coding has been proven very powerful in image classification. In this paper, we apply BoF method for gender classification. We use two sets of images: training images and testing images. All images are represented by a set of Scale-Invariant Feature Transform (SIFT) descriptors. In training stage, using sparse coding, Visual Words Dictionary (VWD) is constructed from SIFT descriptors extracted from training images. In testing, SIFT descriptors of testing images are approximated by visual words in VWD. The choices of approximating visual words determine the classification decision. We apply our method and another two popular methods on public dataset for gender classification. We achieved promising results
Peng, Y, Xu, M, Ni, Z, Jin, J & Luo, S 2012, 'Accurate pedestrian counting system based on local features', Lecture Notes in Computer Science, Pacific-Rim Conference on Multimedia, Springer, Singapore, pp. 850-860.View/Download from: Publisher's site
Accurate pedestrian counting are challenging in real-world due to occlusions, pedestrians overlays or camera view sensitive. In this paper, we propose an accurate and robust pedestrian detection and counting system to address these problems. Our proposed method is group-based, where the count of people in a dense moving group is estimated as a whole. Moving groups containing single or several pedestrians are discriminated from other moving objects. Our method utilizes 9 features of each moving group within a video frame to estimate the pedestrian number in each group. Pedestrian counts are optimized by a novel tracking method, which is based on an analysis of moving groups match, split or merge. Comparison experiments with other two current methods on three benchmark surveillance videos show the effectiveness of our proposed method.
Qu, Z, Wang, J, Xu, M & Lu, H 2012, 'A Grid Based Resizing Framework via Effectively Combining Cropping with Warping', 2012 IEEE International Conference on Image Processing ICIP 2012, IEEE International Conference on Image Processing, IEEE Computer Society, Lake Buena Vista, Florida, USA, pp. 2997-3000.
Image retargeting is a problem of adapting images to arbitrary aspect ratios in order to maximize users browsing experience. As two major solutions for image retargeting, warping and cropping have their own advantages and limitations respectively. In this paper, a grid based resizing framework is proposed for effectively combining warping with cropping. Firstly, warping preserves more important content within the cropping window through retaining the aspect ratios of salient grids and distorting the non-salient ones. Secondly, cropping provides extra space for warping to absorb the spatial deformation and assures the important content is retained in retargeted image simultaneously. Finally the objective function is formulated as two energy terms for warping and cropping respectively. And, a nonlinear optimization is applied to obtain the retargeting results. Our approach could make warping and cropping complement each other, and improve the quality of retargeted image effectively. Experiments and comparisons on the ReTargetMe dataset demonstrate the superiority of our approach.
Qu, Z, Wang, J, Xu, M & Lu, H 2012, 'Fusing warping, cropping, and scaling for optimal image thumbnail generation', Lecture Notes in Computer Science, Asian Conference on Computer Vision, Springer, Singapore, Singapore, pp. 445-456.View/Download from: Publisher's site
Image retargeting, as a content aware technique, is regarded as a logical tool for generating image thumbnails. However, the enormous difference between the size of source and target usually hinders single retargeting method from obtaining satisfactory r
Wang, W, Wu, Q, He, S & Xu, M 2012, 'On Splitting Dataset: Boosting Locally Adaptive Regression Kernels for Car Localization', 2012 12th International Conference on Control, Automation, Robotics & Vision, International Conference on Control, Automation, Robotics and Vision, IEEE Press, Guangzhou, China (People's Republic of), pp. 1154-1159.View/Download from: Publisher's site
In this paper, we study the impact of learning an Adaboost classifier with small sample set (i.e., with fewer training examples). In particular, we make use of car localization as an underlying application, because car localization can be widely used to various real world applications. In order to evaluate the performance of Adaboost learning with a few examples, we simply apply Adaboost learning to a recently proposed feature descriptor - Locally Adaptive Regression Kernel (LARK). As a type of state-of-the-art feature descriptor, LARK is robust against illumination changes and noises. More importantly, we use LARK because its spatial property is also favorable for our purpose (i.e., each patch in the LARK descriptor corresponds to one unique pixel in the original image). In addition to learning a detector from the entire training dataset, we also split the original training dataset into several sub-groups and then we train one detector for each sub-group. We compare those features associated using the detector of each sub-group with that of the detector learnt with the entire training dataset and propose improvements based on the comparison results. Our experimental results indicate that the Adaboost learning is only successful on a small dataset when those learnt features simultaneously satisfy two conditions that: 1. features are learnt from the Region of Interest (ROI), and 2. features are sufficiently far away from each other.
Peng, Y, Xu, M, Jin, JS, Luo, S & Zhao, G 2011, 'Cascade Based License Plate Localization with Line Segment Features and Haar-Like Features', 6th International Conference on Image and Graphics (ICIG) 2011, International Conference on Image and Graphics, IEEE Computer Society, Hefei, Anchui China, pp. 1023-1028.View/Download from: Publisher's site
AdaBoost classifiers with Haar-like features are widely used for license plate (LP) localization. However, it normally requires high-dimensional Haar-like features which cause extremely high computational cost. In this paper, a rejection cascade was built for LP localization with reduced Haar-like features. We first introduced line segment features as pre-input of Haarlike features for AdaBoost to eliminate more than 70% of the background in an image. Line segment features, including density, directionality and regularity, were extracted from line segments, which were detected by applying Hough Transform on an edge image. Later, AdaBoost classifiers with Haar-like features were further applied to identify the exact location of license plates. Our method dramatically reduced the demanded dimensions of Haar-like features, therefore saved much time in AdaBoost training stage. By comparing our method with methods of only using Haar-like features and only using line segment features, experimental results demonstrated that our proposed method achieved the best detection rate with significantly reduced dimensions of Haar-like features.
Xiao, X, Xu, C, Wang, J & Xu, M 2011, 'Landmark Recognition and Retrieval : From 2D to 3D', Proceedings of the 2011 joint ACM workshop on Human gesture and behavior understanding, Jjoint ACM workshop on Human gesture and behavior understanding, ACM NY USA, Scottsdale, USA, pp. 77-78.View/Download from: Publisher's site
Existing landmark retrieval methods cannot provide a comprehensive solution, by which user can view different angles of landmark. In this paper, we propose a novel approach to reconstruct and retrieve 3D landmark models by direct 2D to 3D matching. In an offline module, firstly, attention-based 3D reconstruction method is proposed to reconstruct sparse 3D landmark models. Secondly, we construct textured 3D landmark model for each sparse 3D landmark model. Finally, a 3D landmark recognizer is built for each landmark based on the 3D landmark model. In online module, query images are recognized by the 3D landmark recognizers using a 2D to 3D matching approach. For each recognized query image, a 3D landmark model and a 3D landmark texture model are presented as a query result. Experimental results demonstrate the effectiveness of our proposed approach.
Xu, M, He, S, Xu, C, Wang, J, Hasan, MA, Lu, H & Jin, JS 2011, 'Using Context Saliency for Movie Shot Classification', 18th IEEE International Conference on Image Processing, IEEE International Conference on Image Processing, IEEE Computer Society, Brussels, Belgium, pp. 3653-3656.View/Download from: Publisher's site
Movie shot classification is vital but challenging task due to various movie genres, different movie shooting techniques and much more shot types than other video domain. Variety of shot types are used in movies in order to attract audiences attention and enhance their watching experience. In this paper, we introduce context saliency to measure visual attention distributed in keyframes for movie shot classification. Different from traditional saliency maps, context saliency map is generated by removing redundancy from contrast saliency and incorporating geometry constrains. Context saliency is later combined with color and texture features to generate feature vectors. Support Vector Machine (SVM) is used to classify keyframes into pre-defined shot classes. Different from the existing works of either performing in a certain movie genre or classifying movie shot into limited directing semantic classes, the proposed method has three unique features: 1) context saliency significantly improves movie shot classification; 2) our method works for all movie genres; 3) our method deals with the most common types of video shots in movies. The experimental results indicate that the proposed method is effective and efficient for movie shot classification.
Cui, Y, Jin, J, Park, M, Luo, S, Xu, M, Peng, Y, Felix, WS & Santos, L 2010, 'Computer Aided Abnormality Detection for Microscopy Images of Cervical Tissue', 2010 IEEE/ICME International Conference on Complex Medical Engineering, IEEE/ICME International Conference on Complex Medical Engineering, IEEE Computer Society, Gold Coast Australia, pp. 63-68.View/Download from: Publisher's site
Cervical cancer is the second most common malignancy among woman worldwide, if it is detected in early stage, cure rate is relatively high. Computer aided abnormality detection for cervical smear is developed to assist medical experts to handle microscopy images, examine cell abnormalities and diagnose dyskaryosis. The microscopy images of cells in cervix uteri are stained by tumor marker Ki-67, so that the abnormal nuclei present brown while normal ones are bluish. Segmentation is the most important and difficult task to calculate the ratio of abnormal nuclei to all nuclei. In order to achieve accurate segmentation of nuclei, we propose a multi-level segmentation approach for abnormality identification in microscopy images. First level segmentation aims to partition abnormal (stained) nuclei regions and all nuclei regions. Because of under-segmentation after first level segmentation, second level segmentation is applied to further partition the clustered nuclei. In order to classify touching regions of clustered nuclei and separate regions of single nucleus, relevant meaningful features are extracted from regions of interest. Consequently all the nuclei regions are separated and in conjunction with the abnormal nuclei regions in the first level segmentation, the abnormality i.e. ration of abnormal nuclei to all nuclei is obtained. Experimental results indicate that our method achieved an accuracy of 93.55% and 95.8% in term of abnormal nuclei and all nuclei respectively for identification of abnormalities. Our proposed method produces a satisfactory segmentation.
Guo, W, xu, C, Ma, S & Xu, M 2010, 'Visual Attention Based Small Object Segmentation in Natural Images', 2010 IEEE International Conference on Image Processing ICIP 2010 - Proceedings, IEEE International Conference on Image Processing, IEEE Computer Society, Hongkong, pp. 1565-1568.View/Download from: Publisher's site
Small object segmentation is a challenging task in image processing and computer vision. In this paper we propose a visual attention based segmentation approach to segment interesting objects with small size in natural images. Different from traditional methods which use the single feature vectors, visual attention analysis is used on local and global features to extract the region of interesting objects. Within the region selected by visual attention analysis, Gaussian Mixture Model (GMM) is applied to further locate the object region. By incorporation of visual attention analysis into object segmentation, the proposed approach is able to narrow the searching region for object segmentation so as to increase the segmentation accuracy and reduce the computational complex. Experimental results demonstrate that the proposed approach is efficient for object segmentation in natural images, especially for small objects. The proposed method outperformeds traditional GMM based segmentation significantly
Liu, H, Xu, M, Huang, Q, Jin, J, Jiang, S & Xu, C 2010, 'A Close-up Detection Method for Movies', 2010 IEEE International Conference on Image Processing ICIP 2010 - Proceedings, IEEE International Conference on Image Processing, IEEE Computer Society, Hongkong, pp. 1505-1508.View/Download from: Publisher's site
Close-up (CU) is a photographic technique which tightly frames a person or an object. In movies, it is applied to guide audience attention and to evoke audience emotion. In this paper, we detect face CU, object CU and lean of movies, which are widely used to romance emotions. A lean consist of shots in a sequence, with a close-up shot as focus. A set of features are extracted by considering movie making techniques and human attention for CU detection. The features are average saliency, color entropy, color variance, face height, skin area, and texture scales. These features are tested through statistical hypothesis test to be significantly discriminating for CUs. Then, support Vector Machine (SVM) is applied on these features to detect face CU detection result, lean is further detected by investigating the changing of the face/object size. Lean detection is of challenge due to the technique of montage. We solve this problem through color similarity estimation and SIFT point matching. Experimental results on four full length movies verify the effectiveness of the proposed method.
Mu, Y, Xu, M & Yan, S 2010, 'Learning From Very-Few Labeled Examples with Soft Labels', 2010 IEEE International Conference on Image Processing ICIP 2010 - Proceedings, IEEE International Conference on Image Processing, IEEE Computer Society, Hongkong, pp. 3869-3872.
In this paper we propose Softboost, a novel Boosting al-gorithm which combines the merits of transductive and inductive learning approaches to attack the problem of learning from very few labeled training examples. In the transductive stage, soft labels of both the labeled and unlabeled samples are estimated based on a Markovian propagating procedure. While in the subsequent inductive stage, to efficiently handle out-of-sample data, we learn a weighted combination of simple rules in Boosting style, each of which maximizes confidence-weighted inter-class Kullback-Leibler (KL) divergence under current data distribution. Finally, experiments on toy dataset and USPS handwritten digits are presented to demonstrate its effectiveness.
Park, M, jin, JS, Peng, Y, Summons, D, Yu, D, Cui, Y, Luo, S, Felix, WS, Santos, L & Xu, M 2010, 'Automatic cell segmentation in microscopic color images using ellipse fitting and watershed', 2010 IEEE/ICME International Conference on Complex Medical Engineering, IEEE/ICME International Conference on Complex Medical Engineering, IEEE Computer Society, Gold Coast, pp. 69-74.View/Download from: Publisher's site
This paper presents an efficient and innovative method for the automated counting of cells in microscopic image. The performance of watershed-based algorithms for the segmentation of clustered cells has been well demonstrated. The strength of our algorithm lies in the fact that it incorporates knowledge of color in the image. Our method uses the watershed transform with iterative shape alignment and is shown to be more accurate in retaining cell shape. We report a sensitivity of 97% and specificity of 96% when all color bands are used. Our methods could be of value to computer-based systems designed to objectively interpret microscopic images, since they provide a means for accurate cell segmentation.
Peng, Y, Jin, J, Luo, S & Xu, M 2010, 'Learning priors for super-resolution in video sequence', Proceedings of the 2nd International Conference on Internet Multimedia Computing and Service, ICIMCS'10, International Conference on Internet Multimedia Computing and Service, ACM, Harbin, China, pp. 163-166.View/Download from: Publisher's site
Video becomes a crucial information resource in last decades, because of the rapid development of camera as well as the internet explosion. High-quality video sequences are always desired in lots of fields. Since the bottleneck of data storage and interf
Peng, Y, Park, M, Xu, M, Luo, S, Jin, J, Cui, Y, Felix, WS & Santos, L 2010, 'Clustering Nuclei Using Machine Learning Techniques', 2010 IEEE/ICME International Conference on Complex Medical Engineering, IEEE/ICME International Conference on Complex Medical Engineering, IEEE Computer Society, Gold Coast Australia, pp. 52-57.View/Download from: Publisher's site
Cervical cancer is the second most common cancer among women. Meanwhile, cervical cancer could be largely preventable and curable with regular Pap tests. Nuclei changes in the cervix could be found by this test. Accurate nuclei detection is extremely critical as it is the previous step of analysing nuclei changes and diagnosis afterwards. Recently, computer-aided nuclei segmentation has increased dramatically. Athough such algorithms could be utilised in the situation for spare nuclei since they are intuitively detected, the segmentation for the complicated nuclei clusters is still challenging task. This paper presents a new methodology for the detection of cervical nuclei clusters. We first detect all the nuclei from the cervical microscopic image by an ellipse fitting algorithm. Second, we chose some high-relevant features from all the the features we obtained in last step via F-score, which is based on to what extent one feature attributes to results. All the ellipses are then classified into single ones and cluster ones by C4.5 decision tree with selected features. We evaluated the performance of this method by the classification accuracy, sensitivity, and cluster predictive value. With the 9 selected features fromt he original 13 features, we came by the promising classification accuracy (97,8%).
Xu, M, Chen, L, He, S, Xu, C & Jin, J 2010, 'Adaptive Local Hyperplanes for MTV affective analysis', Proceedings of the 2nd International Conference on Internet Multimedia Computing and Service, ICIMCS'10, International Conference on Internet Multimedia Computing and Service, ACM, Harbin, China, pp. 167-170.View/Download from: Publisher's site
Affective analysis attracts increasing attention in multimedia domain since affective factors directly reflect audiences' attention, evaluation and memory. Existing study focuses on mapping low-level affective features to high-level emotions by applying
Xu, M, Guo, W, Xu, C & Ma, S 2010, 'Visual Attention Based Motion Object Detection and Trajectory Tracking', 11th Pacific Rim Conference on Multimedia, Springer, Shanghai China, pp. 462-470.View/Download from: Publisher's site
A motion trajectory tracking method using a novel visual attention model and kernel density estimation is proposed in this paper. As a crucial step, moving objects detection is based on visual attention. The visual attention model is built by combination of the static and motion feature attention map and a Karhunen-Loeve transform (KLT) distribution map. Since the visual attention analysis is conducted on object level instead of pixel level, the proposed method can detect any kinds of motion objects provided saliency without the affection of objects appearance and surrounding circumstance. After locating the region of moving object, the kernel density is estimated for trajectory tracking. The experimental results show that the proposed method is promising for moving objects detection and trajectory tracking.
Yu, P, Park, M, Xu, M, Luo, S, Jin, JS, Cui, Y & Felix Wong, WS 2010, 'Detection of nuclei clusters from cervical cancer microscopic imagery using C4.5', ICCET 2010 - 2010 International Conference on Computer Engineering and Technology, Proceedings.View/Download from: Publisher's site
Cervical cancer is the second most common cancer among women. At the same time, cervical cancer could be largely preventable and curable with regular Pap tests. This test can find nuclei changes in the cervix. Accurate nuclei detection is extremely critical as it is the previous step of analysing nuclei changes and diagnosis afterwards. In recent years, automatic nuclei segmentation has increased dramatically. Although such algorithms could be utilised in the situation for sparse nuclei since they are intuitively detected, the segmentation for the complicated nuclei clusters is still challenging task. This paper presents a new methodology for the detection of cervical nuclei clusters. We first detect all the nuclei from the cervical microscopic image by an ellipse fitting algorithm. All the ellipses are then classified into single ones and cluster ones by C4.5 decision tree with selected features. We evaluated the performance of this method by the classification accuracy, sensitivity, and cluster predictive value. The result shown that the promising classification accuracy (97.8%) is obtained using C4.5 with 9 relative features. © 2010 IEEE.
Park, M, Jin, JS, Xu, M, Wong, WSF, Luo, S & Cui, Y 2009, 'Microscopic image segmentation based on color pixels classification', 1st International Conference on Internet Multimedia Computing and Service, ICIMCS 2009, pp. 53-59.View/Download from: Publisher's site
The computer-assisted microscopy systems can increase the accuracy of the analysis. To guarantee correct results in computer-assisted microscopy, accurate nuclei segmentation is crucially important since images segmentation is the first step towards image understanding and image analysis. In this paper, we present clustering techniques to segment homogeneous clusters in RGB color space and then label each cluster as a different region. According to the evaluation process, 97% of nuclei pixels were correctly delineated with our algorithm and on average 90% of nuclei were correctly detected. Our methods could be of value to computer-based systems designed to objectively interpret microscopic images by accurate nuclei segmentation. Copyright 2009 ACM.
Xu, M, Luo, S, Jin, JS & Park, M 2009, 'Affective content analysis by mid-level representation in multiple modalities', 1st International Conference on Internet Multimedia Computing and Service, ICIMCS 2009, pp. 201-207.View/Download from: Publisher's site
Movie affective content detection attracts ever-increasing research efforts. However, the affective content analysis is still a challenging task due to the gap between low-level perceptual features and high-level human perception of the media. Moreover, clues from multiple modalities should be considered for affective analysis, since they were used in movies to represent emotions and romance emotional atmosphere. In this paper, mid-level representations are generated from low-level features. These mid-level representations are from multiple modalities and used for affective content inference. Besides video shots which is commonly used for video content analysis, audio sounds, dialogue and subtitle are explored to contribute to detect affective content. Since affective analysis rely on movie genres, experiments are implemented in respective genres. The results shows that audio sounds, dialogues and subtitles are effective and efficient for affective content detection. Copyright 2009 ACM.
Park, M, Jin, SJ, Hofstetter, R, Xu, M & Kang, BH 2008, 'Automatic colonic polyp detection by the mapping using regional unit sphere', Proceedings - 2008 International Conference on Multimedia and Ubiquitous Engineering, MUE 2008, pp. 144-149.View/Download from: Publisher's site
Colonic polyps appear like elliptical protrusions on the inner wall of the colon. The many proposed algorithms assumed the shape of a polyp as a spherical cap, so the algorithms are not flexible when the polyps are irregular shapes. In this paper, we propose a mapping using regional unit sphere (MuRUS) method to overcome the problem caused by unexpected polyp shapes. The MuRUS has shape invariant and size invariant properties. Our method was applied to colon CT images from 37 patients each having a prone and supine scan. There are 45 colonscopically confirmed polyps. The results obtained by our algorithm were compared with those gold standards. 100% of polyps >= 10mm in diameter were detected, 90% of polyps >= 6mm in diameter were detected and 70% of polyps < 6mm in diameter were detected at 7.0 FPs per patient. © 2008 IEEE.
Xu, M, Jin, JS & Luo, S 2008, 'Personalized video adaptation based on video content analysis', Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 26-35.View/Download from: Publisher's site
Personalized video adaptation is expected to satisfy individual users' needs on video content. Multimedia data mining plays a significant role of video annotation to meet users' preference on video content. In this paper, a comprehensive solution for personalized video adaptation is proposed based on video content mining. Video content mining targets both cognitive content and affective content. Cognitive content refers to those semantic events, which are very specific for the video domains. Sometimes, users might prefer "emotional decision" to select their interested video content. Therefore, we introduce affective content which causes audiences' strong reactions. For cognitive content mining, features are extracted from multiple modalities. Machine learning module is further performed to get some middle-level features, such as specific audio sounds, semantic video shots and so on. Those middle-level features are used to detect cognitive content by using Hidden Markov Models. For affective content mining, affective content is detected with three affective levels: "low", "medium" and "high". Considering affective levels might have no sharp boundaries, fuzzy c mean clustering is used on low-level features to simulate user's perceptions. The adaptation is later implemented based on MPEG-21 Digital Item Adaptation framework. One of the challenges is how to quantify users' preference on video content. Information Entropy (IE) and Membership Functions are calculated to decide priorities for resource allocation for cognitive content and affective content respectively. Copyright 2008 ACM.
Xu, M, Jin, JS, Luo, S & Duan, L 2008, 'Hierarchical movie affective content analysis based on arousal and valence features', MM'08 - Proceedings of the 2008 ACM International Conference on Multimedia, with co-located Symposium and Workshops, pp. 677-680.View/Download from: Publisher's site
Emotional factors directly reflect audiences' attention, evaluation and memory. Affective contents analysis not only create an index for users to access their interested movie segments, but also provide feasible entry for video highlights. Most of the work focus on emotion type detection. Besides emotion type, emotion intensity is also a significant clue for users to find their interested content. For some film genres (Horror, Action, etc), the segments with high emotion intensity have the most possibilities to be video highlights. In this paper, we propose a hierarchical structure for emotion categories and analyze emotion intensity and emotion type by using arousal and valence related features hierarchically. Firstly, High, Medium and Low are detected as emotion intensity levels by using fuzzy c-mean clustering on arousal features. Fuzzy clustering provides a mathematical model to represent vagueness, which is close to human perception. After that, valence related features are used to detect emotion types (Anger, Sad, Fear, Happy and Neutral ). Considering video is continuous time series data and the occurrence of a certain emotion is a ted by recent emotional history, Hidden Markov Models (HMMs) are used to capture the context information. Experimental results shows the movie segments with high emotion intensity cover over 80% of the movie highlights in Horror and Action movies and the hierarchical method outperforms the one-step method on emotion type detection. Meanwhile, it is exible for user to pick up their favorite a ective content by choosing both emotion intensity levels and emotion types. Copyright 2008 ACM.
Xu, M, Luo, S & Jin, JS 2008, 'Affective content detection by using timing features and fuzzy clustering', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 685-692.View/Download from: Publisher's site
Emotional factors directly reflect audiences' attention, evaluation and memory. Movie affective content detection attracts more and more research efforts. Most of the existing work focus on developing efficient affective features or implementing feasible pattern recognition algorithms. However, some important issues are ignored. 1) Most of the feature used in affective content detection are traditional visual/audio features. While affective content detection needs those features which are directly related to emotions. 2) affective content is a subjective concept which heavily depends on human perception. It is hard to find a clear boundary for various emotion categories. While most of the existing methods utilize hard pattern recognition algorithm to generate clear boundary for emotion categories. In this paper, we consider the above two issues by two aspects. 1) We employ timing features which are important of films and an important part of films' power to affect viewers' feelings and emotions. Meanwhile, audio features are used together with timing features to detect affective content from multiple modalities. 2) Fuzzy clustering is used in this paper to map affective features to affective content. Fuzzy logic provides a mathematical model to represent vagueness, which is close to human perception. Experimental results shows the proposed method is effective and efficient. © 2008 Springer.
Xu, M, Park, M, Luo, S & Jin, JS 2008, 'Comparison analysis on supervised learning based solutions for sports video categorization', Proceedings of the 2008 IEEE 10th Workshop on Multimedia Signal Processing, MMSP 2008, pp. 526-529.View/Download from: Publisher's site
Due to the wide viewer-ship and high commercial potentials, recently, sports video analysis attracts extensive research efforts. One of the main tasks in sports video analysis is to identify sports genres i.e. sports video categorization. Most of the existing work focus on mapping content-based features to sports genres by using supervised learning methods. Moreover, video data sets seeks efficient data reduction methods due to the large size and noisy data. It lacks comparison analysis on the implementation and performance of these methods. In this paper, the research is carried out by using four dominant machine learning algorithms, namely Decision Tree, Support Vector Machine, K Nearest Neighbor and Naive Bayesian, and comparing their performance on a high dimensional feature set which selected by some feature selection tools such as Correlation-based Feature Selection (CFS), Principal Components Analysis (PCA) and Relief. Experimental results shows that Support Vector Machine (SVM) and k-NN are not sensitive to reduction of training sets. Moreover, three different feature reduction methods perform very differently with respect to four different tools. © 2008 IEEE.
Xu, M, Luo, S & Jin, J 2007, 'Video adaptation based on affective content with MPEG-21 DIA framework', Proceedings of the 2007 IEEE Symposium on Computational Intelligence in Image and Signal Processing, CIISP 2007, pp. 386-390.View/Download from: Publisher's site
We present a video adaptation system which takes account of users' preference on video Affective Content (AC) and limited network resource. AC directly causes an user's attention, evaluation and memory which also provides feasible entry for video highlight. According to user's preference, the proposed adaptation insures the video parcels with AC are allocated as much as possible network resource. The system is implemented with MPEG-21 Digital Item Adaptation (DIA) framework which provides a generic video adaptation solution for all video formats and various usage environments by manipulating on XML files. XML file based adaptation avoids complex video computation. 30 students from various departments were invited to test the system and their responses were positive. © 2007 IEEE.
Xu, M, Chia, LT, Yi, H & Rajan, D 2006, 'Affective content detection in sitcom using subtitle and audio', MMM2006: 12th International Multi-Media Modelling Conference - Proceedings, pp. 129-134.
From a personalized media point of view, many users favor a flexible tool to quickly browse the affective content in a video. Such affective content may cause audiences' strong reactions or special emotional experiences, such as anger, sadness, fear, joy and love. This paper attempt to extract affective content for digital videos by analyzing the subtitle files of DVD/DivX videos and utilizing audio event to assist affective content detection. Firstly, video are segmented by dialogue script partition. Compared to traditional video shot, video segmented by scripts is not affected by camera changes and shooting angles and easy to include video segments with compact content. Secondly, emotion-related vocabularies in video script are detected to locate affective video content. Using script to directly access video content avoids complex video analysis. Thirdly, audio event detection is utilized to assist affective content detection. Compared with traditional video semantic analysis, affective content analysis puts much more emphasis on the audience's reactions and emotions. Initial experiments are carried on sitcom videos because its simple video structure provides useful domain knowledge. The experimental results demonstrate that subtitle file analysis and audio event detection provides effective and efficient clues to determine the emotional content of the videos. © 2006 IEEE.
Xu, M, Li, J, Chia, LT, Hu, Y, Lee, BS, Rajan, D & Jin, JS 2006, 'Event on demand with MPEG-21 video adaptation system', Proceedings of the 14th Annual ACM International Conference on Multimedia, MM 2006, pp. 921-930.View/Download from: Publisher's site
In this paper, we present an event-on-demand (EoD)video adaptation system. The proposed system supports users in deciding their events of interest and considers network conditions to adapt video source by event selection and frame dropping.Firstly, events are detected by audio/video analysis and annotated by the description schemes (DSs)provided by MPEG-7 Multimedia Description Schemes (MDSs). And then, to achieve a generic adaptation solution, the adaptation is developed following MPEG-21 Digital Item Adaptation (DIA)framework. We look at early release of the MPEG-21 Reference Software on XML generation and develop our own system for EoD video adaptation in three steps:1) the event information is parsed from MPEG-7 annotation XML file together with bitstream to generate generic Bitstream Syntax Description (gBSD). 2) Users' preference, Network Characteristic and Adaptation QoS (AQoS) are considered for making adaptation decision. 3) adaptation engine automatically parses adaptation decisions and gBSD to achieve adaptation.Unlike most existing adaptation work, the system adapts video of events with interest according to users' preference. Implementation following MPEG-7 and MPEG-21 standards provides a generic video adaptation solution. gBSD based adaptation avoids complex video computation. 30 students from various departments were invited to test the system and their responses has been positive. Copyright 2006 ACM.
Xu, M, Li, J, Hu, Y, Chia, LT, Lee, BS, Rajan, D & Cai, J 2006, 'An event-driven sports video adaptation for the MPEG-21 DIA framework', 2006 IEEE International Conference on Multimedia and Expo, ICME 2006 - Proceedings, pp. 1245-1248.View/Download from: Publisher's site
We present an event-driven video adaptation system in this paper. Events are detected by audio/video analysis and annotated by the description schemes (DSs) provided by MPEG-7 Multimedia Description Schemes (MDSs). And then, adaptation take account of users' preference of events and network characteristic to adapt video by event selection and frame dropping as following three steps: 1) the event information is parsed from MPEG-7 annotation XML file together with bitstream to generate generic Bitstream Syntax Description (gBSD). 2) Users' preference, Network Characteristic and Adaptation QoS (AQoS) are considered for making adaptation decision. 3) adaptation engine automatically parses adaptation decisions and gBSD to achieve adaptation. Different from most existing adaptation work, the system adapts video by interesting events according to users 'preference. To achieve a generic adaptation solution, the system is developed following MPEG-7 and MPEG-21 standards. gBSD based adaptation avoids complex video computation. 30 students from various departments test the system with satisfaction. Although, the system is tested on basketball video adaptation so far, it is easy to extend to other video domains. © 2006 British Crown Copyright.
Wang, S, Xu, M, Chia, LT & Dash, M 2005, 'EASIER sampling for audio event identification', IEEE International Conference on Multimedia and Expo, ICME 2005, pp. 1214-1217.View/Download from: Publisher's site
An audio event refers to some specific audio sound which plays important role for video content analysis. In our previous work , we have established audio event identification as an audio classification task. Due to the large size of audio database, representative samples are necessary for training the classifier. However, the commonly used random selection of training samples is often not adequate in selecting representative samples. In this paper we present EASIER sampling algorithm to select those data which more efficiently represent audio data characters for audio event identifier training. EASIER deterministically produces a subsample whose "distance" from the complete database is minimal. Experiments in the context of audio event identification show that EASIER outperforms simple random sampling significantly. ©2005 IEEE.
Xu, M, Chia, LT & Jin, J 2005, 'Affective content analysis in comedy and horror videos by audio emotional event detection', IEEE International Conference on Multimedia and Expo, ICME 2005, pp. 622-625.View/Download from: Publisher's site
We study the problem of affective content analysis. In this paper -we think of affective contents as those video/audio segments, which may cause an audience's strong reactions or special emotional experiences, such as laughing or fear. Those emotional factors are related to the users' attention, evaluation, and memories of the content The modeling of affective effects depends on the video genres. In this work, we focus on comedy and horror films to extract the affective content by detecting a set of so-called audio emotional events (AEE) such as laughing, horror sounds, etc. Those AEE can be modeled by various audio processing techniques, and they can directly reflect an audience's emotion. We use the AEE as a clue to locate corresponding video segments. Domain knowledge is more or less employed at this stage. Our experimental dataset consists of 40-minutes comedy video and 40-minutes horror film. An average recall and precision of above 90% is achieved. It is shown that, in addition to rich visual information, an appropriate usage of special audios is an effective way to assist affective content analysis. © 2005 IEEE.
Duan, LY, Xu, M, Tian, Q & Xu, CS 2004, 'Mean shift based nonparametric motion characterization', Proceedings - International Conference on Image Processing, ICIP, pp. 1597-1600.View/Download from: Publisher's site
Motion content is a very powerful cue for organizing video data. Efficient and robust identification of the camera motion nature and the dominant object motion is important for generation of useful motion annotations. Most of existing methods focus on the estimation of a parametric motion model from dense optical flow fields or block-based MPEG motion vector fields (MVF). However, it is hard to achieve reliable model estimation in large amounts of video data. This is due to the violation of parametric assumption in the presence of large object motion and bad estimation of the optical flow in low-textured regions. In this paper, we employ the mean shift procedure and the histogram to propose a novel nonparametric motion representation. With this motion representation, we transform the motion analysis to the classification problem of camera motion patterns in the presence of dominant object motion and non-dominant object motion. The unique features include three main aspects: I) Instead of computationally expensive and vulnerable parametric regression we base the motion characterization on the classification of motion patterns, 2) we employ machine learning to capture the knowledge of recognizing camera motion patterns from bad motion fields, and 3) with the mean shift filtering the proposed motion representation elegantly considers the spatial-range cues so as to remove noise and implement discontinuity preserving smoothing of motion fields. Promising results are achieved on 1096 motion vector fields extracted from compressed broadcast soccer video. © 2004 IEEE.
Duan, LY, Xu, M, Tian, Q & Xu, CS 2004, 'Mean shift based video segment representation and applications to replay detection', ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings.
Effective and efficient representation of the low-level features of groups of frames or shots is an important yet challenging task for video analysis and retrieval. Key frame-based representation is limited by the difficulties in shot boundary detection of gradual transition and a variety of ways in key frame extraction. In this paper, we employ the mean shift-based mode seeking function to develop a new approach for compact representation of the video segment. The proposed video representation is motivated by recognizing that, on the global level, humans perceive images only as a combination of few most prominent colors. We exploit the spatiotemporal mode seeking in feature space to simulate "subjectivity" of human decisions to video segment retrieval and identification. The effectiveness of video representation and matching scheme is shown by initial experiments on replay detection in broadcast sports video.
Duan, LY, Xu, M, Tian, Q & Xu, CS 2004, 'Nonparametric motion model', ACM Multimedia 2004 - proceedings of the 12th ACM International Conference on Multimedia, pp. 754-755.View/Download from: Publisher's site
Motion information is a powerful cue for visual perception. In the context of video indexing and retrieval, motion content serves as a useful source for compact video representation. There has been a lot of literature about parametric motion models. However, it is hard to secure a proper parametric assumption in a wide range of video scenarios. Diverse camera shots and frequent occurrences of improper optical flow estimation or block matching motivate us to develop nonparametric motion models. In this demonstration, we present a novel nonparametric motion model. The unique features mainly include: 1) Instead of computationally expensive and vulnerable parametric regression our proposed model bases the motion characterization on the classification of motion patterns; 2) we employ machine learning to capture the knowledge of recognizing camera motion patterns from bad motion vector fields (MVF); and 3) with the mean shift filtering our proposed motion representation elegantly incorporates the spatial-range information for noise removal and discontinuity preserving smoothing of MVF. Promising results have been achieved on two tasks: 1) camera motion pattern recognition on 23191 MVFs and 2) recognition of the intensity of motion activity on 622 video segments culled from the MPEG-7 dataset.
Duan, LY, Xu, M, Tian, Q & Xu, CS 2004, 'Nonparametric motion model with applications to camera motion pattern classification', ACM Multimedia 2004 - proceedings of the 12th ACM International Conference on Multimedia, pp. 328-331.
Motion information is a powerful cue for visual perception. In the context of video indexing and retrieval, motion content serves as a useful source for compact video representation. There has been a lot of literature about parametric motion models. However, it is hard to secure a proper parametric assumption in a wide range of video scenarios. Diverse camera shots and frequent occurrences of bad optical flow estimation motivate us to develop nonparametric motion models. In this paper, we employ the mean shift procedure to propose a novel nonparametric motion representation. With this compact representation, various motion characterization tasks can be achieved by machine learning. Such a learning mechanism can not only capture the domain-independent parametric constraints, but also acquire the domain-dependent knowledge to tolerate the influence of bad dense optical flow vectors or block-based MPEG motion vector fields (MVF). The proposed nonparametric motion model has been applied to camera motion pattern classification on 23191 MVF extracted from MPEG-7 dataset.
Xu, M, Duan, LY, Chia, LT & Xu, CS 2004, 'Audio keyword generation for sports video analysis', ACM Multimedia 2004 - proceedings of the 12th ACM International Conference on Multimedia, pp. 758-759.View/Download from: Publisher's site
Semantic sports video analysis has attracted many research interests and audio cues have been shown to play an important role in semantics inference. To facilitate event detection using audio information, we have introduced the concept of audio keyword (e.g. excited/plain commentator speech, excited/plain audience sound, etc.) to describe the game-specific sound associated with an event. In our previous work, we have designed a hierarchical Support Vector Machine (SVM) classifier for audio keyword identification. However, there are two inherent weaknesses: 1) a frame-based SVM classifier does not incorporate any contextual information; 2) a robust recognizer relies on large amounts of training data in the case of different sports games videos. In this demo, we present a flexible Hidden Markov Model (HMM)-based audio keyword generation system. This is motivated by the successful story of applying HMM in speech recognition. Unlike the frame-based SVM classification followed by a major voting, our HMM-based system treats an audio keyword as a continuous time series data and employs hidden states transition to capture contexts. Moreover, our system introduces an adaptation mechanism to tune the initial HMM models (obtained from available training data) to improve performance by a small number of data from a new sports game video. Promising results has been demonstrated on the tennis, soccer and basketball videos with the total length of 2 hours.
Duan, LY, Xu, M & Tian, Q 2003, 'Semantic shot classification in sports video', Proceedings of SPIE - The International Society for Optical Engineering, pp. 300-313.View/Download from: Publisher's site
In this paper, we present a unified framework for semantic shot classification in sports videos. Unlike previous approaches, which focus on clustering by aggregating shots with similar low-level features, the proposed scheme makes use of domain knowledge of a specific sport to perform a top-down video shot classification, including identification of video shot classes for each sport, and supervised learning and classification of the given sports video with low-level and middle-level features extracted from the sports video. It is observed that for each sport we can predefine a small number of semantic shot classes, about 5-10, which covers 90-95% of sports broadcasting video. With the supervised learning method, we can map the low-level features to middle-level semantic video shot attributes such as dominant object motion (a player), camera motion patterns, and court shape, etc. On the basis of the appropriate fusion of those middle-level shot classes, we classify video shots into the predefined video shot classes, each of which has a clear semantic meaning. The proposed method has been tested over 4 types of sports videos: tennis, basketball, Volleyball and soccer. Good classification accuracy of 85-95% has been achieved. With correctly classified sports video shots, further structural and temporal analysis, such as event detection, video skimming, table of content, etc, will be greatly facilitated.
Duan, LY, Xu, M, Chua, TS, Tian, Q & Xu, CS 2003, 'A mid-level representation framework for semantic sports video analysis', Proceedings of the ACM International Multimedia Conference and Exhibition, pp. 33-44.View/Download from: Publisher's site
Sports video has been widely studied due to its tremendous commercial potentials. Despite encouraging results from various specific sports games, it is almost impossible to extend a system for a new sports game because they usually employ different sets of low-level features appropriate for the specific games and closely coupled with the use of game specific rules to detect events or highlights. There is a lack of internal representation and structure to be generic and applicable for many different sports. In this paper, we present a generic mid-level representation framework for semantic sports video analysis. The mid-level representation layer is introduced between the low-level audio-visual processing and high-level semantic analysis. It allows us to separate sports specific knowledge and rules from the low-level and mid-level feature extraction. This makes sports video analysis more efficient, effective, and less ad-hoc for various types of sports. To achieve robustness of the low-level feature analysis, a non-parametric clustering, mean shift procedure, has been successfully applied to both color and motion analysis. The proposed framework has been tested for five field-ball type sports covering duration of about 8 hours. Experiments have shown its robust performance in semantic analysis and event detection. We believe that the proposed mid-level representation framework can be used for event detection, highlight extraction, summarization and personalization of many types of sports video.
Duan, LY, Xu, M, Tian, Q & Xu, CS 2003, 'Nonparametric color characterization using mean shift', Proceedings of the ACM International Multimedia Conference and Exhibition, pp. 243-246.View/Download from: Publisher's site
Color is very useful in locating and recognizing objects that occur in artificial environments. The color histogram has shown its efficiency and advantages as a general tool for various applications, such as content-based image retrieval and video browsing, object indexing and location, and video segmentation. However, due to the lack of any spatial and context information, the histogram is not robust and effective for color characterization (e.g. dominant color) in large video databases. In this paper, we propose a nonparametric color characterization model using mean shift procedure, with an emphasis on spatio-temporal consistency. Experimental results suggest that the color characterization model is much more effective for video indexing and browsing, particularly in the domain of structured video (e.g. sports video).
Xu, M, Duan, LY, Xu, C, Kankanhalli, M & Tian, Q 2003, 'Event detection in basketball video using multiple modalities', ICICS-PCM 2003 - Proceedings of the 2003 Joint Conference of the 4th International Conference on Information, Communications and Signal Processing and 4th Pacific-Rim Conference on Multimedia, pp. 1526-1530.View/Download from: Publisher's site
© 2003 IEEE. Semantic sports video analysis has attracted more and more attention recently. In this paper, we present a basketball event detection method by using multiple modalities. Instead of using low-level features, the proposed method is built upon visual and auditory mid-level features i.e. semantic shot classes and audio keywords. Promising event detection results have been achieved. By heuristically mapping semantic shot classes and by aligning audio keywords with semantic shot classes, we have been able to detect nine basketball events. Experimental results have shown our proposed method is effective for basketball event detection.
Xu, M, Duan, LY, Xu, CS & Tian, Q 2003, 'A fusion scheme of visual and auditory modalities for event detection in sports video', Proceedings - IEEE International Conference on Multimedia and Expo, pp. 333-336.View/Download from: Publisher's site
© 2003 IEEE. In this paper, we propose an effective fusion scheme of visual and auditory modalities to detect events in sports video. The proposed scheme is built upon semantic shot classification, where we classify video shots into several major or interesting classes, each of which has clear semantic meanings. Among major shot classes we perform classification of the different auditory signal segments (i.e. silence, hitting ball, applause, commentator speech) with the goal of detecting events with strong semantic meaning. For instance, for tennis video, we have identified five interesting events: serve, reserve, ace, return, and score. Since we have developed a unified framework for semantic shot classification in sports videos and a set of audio mid-level representation with supervised learning methods, the proposed fusion scheme can be easily adapted to a new sports game. We are extending this fusion scheme to three additional typical sports videos: basketball, volleyball and soccer. Correctly detected sports video events will greatly facilitate further structural and temporal analysis, such as sports video skimming, table of content, etc.
Xu, M, Duan, LY, Xu, CS & Tian, Q 2003, 'A fusion scheme of visual and auditory modalities for event detection in sports video', ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp. 189-192.
In this paper, we propose an effective fusion scheme of visual and auditory modalities to detect events in sports video. The proposed scheme is built upon semantic shot classification, where we classify video shots into several major or interesting classes, each of which has clear semantic meanings. Among major shot classes we perform classification of the different auditory signal segments (i.e. silence, hitting ball, applause, commentator speech) with the goal of detecting events with strong semantic meaning. For instance, for tennis video, we have identified five interesting events: serve, reserve, ace, return, and score. Since we have developed a unified framework for semantic shot classification in sports videos and a set of audio mid-level representation with supervised learning methods, the proposed fusion scheme can be easily adapted to a new sports game. We are extending this fusion scheme to three additional typical sports videos: basketball, Volleyball and soccer. Correctly detected sports video events will greatly facilitate further structural and temporal analysis, such as sports video skimming, table of content, etc.
Xu, M, Maddage, NC, Xu, C, Kankanhalli, M & Tian, Q 2003, 'Creating audio keywords for event detection in soccer video', Proceedings - IEEE International Conference on Multimedia and Expo, pp. 281-283.View/Download from: Publisher's site
© 2003 IEEE. This paper presents a novel framework called audio keywords to assist event detection in soccer video. Audio keyword is a middle-level representation that can bridge the gap between low-level features and high-level semantics. Audio keywords are created from low-level audio features by using support vector machine learning. The created audio keywords can be used to detect semantic events in soccer video by applying a heuristic mapping. Experiments of audio keywords creation and event detection based on audio keywords have illustrated promising results. According to the experimental results, we believe that audio keyword is an effective representation that is able to achieve more intuitionistic result for event detection in sports video compared with the method of event detection directly based on low-level features.
Duan, LY, Xu, M, Yu, XD & Tian, Q 2002, 'A unified framework for semantic shot classification in sports videos', Proceedings of the ACM International Multimedia Conference and Exhibition, pp. 419-420.
In this demonstration, we present a unified framework for semantic shot classification in sports videos. Unlike previous approaches, which focus on clustering by aggregating shots with similar low-level features, the proposed scheme makes use of domain knowledge of specific sport to perform a top-down video shot classification, including identification of video shots classes for each sport, and supervised learning and classification of given sports video with low-level and middle-level features extracted from the sports video. It's observed that for each sport we can predefine a small number of semantic shot classes, 5-10, which cover 90 to 95 % of sports broadcasting video. With supervised learning method, we can map the low-level features to middle-level semantic video shot attributes such as dominant object motion (a player), camera motion patterns, and court shape, etc. On the basis of the appropriate fusion of those middle-level shot attributes, we classify video shots into the predefined video shot classes, each of which has a clear semantic meaning. The proposed method has been tested over 3 types of sports videos: tennis, basketball, and soccer. Good classification results ranging from 80-95% have been achieved. The proposed framework provides a generic solution for sports video semantic shot classification, which can be adapted to a new sport type easily. With correctly classified sports video shots further structural and temporal analysis will be greatly facilitated.
Duan, LY, Yu, XD, Xu, M & Tian, Q 2002, 'Foreground segmentation using motion vectors in sports video', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 751-758.
In this paper, we present an effective algorithm for foreground objects segmentation for sports video. This algorithm consists of three steps: low-level features extraction, camera motion estimate, and foreground object extraction. We employ a robust M-estimator to motion vectors fields to estimate global camera motion parameters based on a four-parameter camera motion model, followed by outliers analysis using robust weights instead of the residuals to extract foreground objects. Based on the fact that foreground objects' motion patterns are independent of the global motion model caused by camera motions such as pan, tilt, and zooming, we considers those macro-blocks as foreground, which corresponds to the outliers blocks during robust regression procedure. Experiments showed that the proposed algorithm can robustly extract foreground objects like tennis players and estimate camera motion parameters. Based on these results, high-level semantic video indexing such as event detection and sports video structure analysis can be greatly facilitated. Furthermore, basing the algorithm on compressed domain features can achieve great saving in computation. © Springer-Verlag Berlin Heidelberg 2002.
In this paper, we propose a new deep network that learns multi-level deep
representations for image emotion classification (MldrNet). Image emotion can
be recognized through image semantics, image aesthetics and low-level visual
features from both global and local views. Existing image emotion
classification works using hand-crafted features or deep features mainly focus
on either low-level visual features or semantic-level image representations
without taking all factors into consideration. Our proposed MldrNet unifies
deep representations of three levels, i.e. image semantics, image aesthetics
and low-level visual features through multiple instance learning (MIL) in order
to effectively cope with noisy labeled data, such as images collected from the
Internet. Extensive experiments on both Internet images and abstract paintings
demonstrate the proposed method outperforms the state-of-the-art methods using
deep features or hand-crafted features. The proposed approach also outperforms
the state-of-the-art methods with at least 6% performance improvement in terms
of overall classification accuracy.