Dr. Ling Chen is an Associate Professor with the UTS Priority Research Centre for Artificial Intelligence (CAI) and the Faculty of Engineering and Information Technology (FEIT) in the University of Technology Sydney (UTS). She has been the Director of the Data Science & Knowledge Discovery Laboratory (The DSKD Lab) within CAI since 2014. She is currently the Deputy Head of School (Research) of School of Computer Science in FEIT. Before joining UTS, she was a Research Fellow with L3S Research Center, Leibniz University Hannover, Germany. Dr. Chen received her Ph.D. in Computer Engineering from Nanyang Technological University Singapore.
Dr. Chen's main research interests focus on data mining and machine learning. She has worked on fundamental data mining tasks such as (regular and irregular) pattern mining from structured data and uncertain data, hashing and embedding learning for structured data. Her recent work also includes knowledge discovery from social networks and social media. She has developed novel and effective algorithms for event detection from social media and recommendation in social networks.
Dr. Chen has published more than 80 papers in major international data mining conferences including SIGKDD, IJCAI, ICDM, AAAI, SIGMOD, VLDB, ICDE, SDM and CIKM, and journals like IEEE TNNLS, IEEE TKDE, IEEE Trans Cybern, and ACM TOIS. Dr. Chen has served as a PC member for many conferences such as SIGKDD, AAAI, and IJCAI. She was the Conference Chair of AusDM 2015.
Can supervise: YES
- Data Analytics
- Machine Learning
- Outlier/Anomaly Detection
- Graph and network representation, hashing and embedding
- Social media (text) mining
- Recommendation and information retrieval
© 2018, Springer-Verlag London Ltd., part of Springer Nature. Spectral clustering is one of the most popular clustering methods and is particularly useful for pattern recognition and image analysis. When using spectral clustering for analysis, users are either required to implement their own platforms, which requires strong data analytics and machine learning skills, or allow a third party to access and analyze their data, which may compromise their data privacy or security. Traditionally, this problem is solved by privacy-preserving data mining using randomization perturbation or secure multi-party computation. However, the existing methods suffer from the problems of inaccurate results or high computational requirements on the data owner's side. To address these problems, in this paper, we propose a new secure outsourcing data mining (SODM) paradigm, which allows data owners to encrypt their data to ensure maximum data security. After the encryption, data owners can outsource their encrypted data to data analytics service providers (i.e., data analytics agent) for knowledge discovery, with a guarantee that neither the data analytics agent nor the other parties can compromise data privacy. To allow data mining to be efficiently carried out on encrypted data, we design a secure KD-tree to index all the encrypted data. Based on the SODM framework, a secure spectral clustering algorithm is proposed. The experiments on real-world datasets demonstrate the effectiveness and the efficiency of the system for the secure outsourcing of data mining.
© 1992-2012 IEEE. Spectral clustering plays a significant role in applications that rely on multi-view data due to its well-defined mathematical framework and excellent performance on arbitrarily-shaped clusters. Unfortunately, directly optimizing the spectral clustering inevitably results in an NP-hard problem due to the discrete constraints on the clustering labels. Hence, conventional approaches intuitively include a relax-and-discretize strategy to approximate the original solution. However, there are no principles in this strategy that prevent the possibility of information loss between each stage of the process. This uncertainty is aggravated when a procedure of heterogeneous features fusion has to be included in multi-view spectral clustering. In this paper, we avoid an NP-hard optimization problem and develop a general framework for multi-view discrete graph clustering by directly learning a consensus partition across multiple views, instead of using the relax-and-discretize strategy. An effective re-weighting optimization algorithm is exploited to solve the proposed challenging problem. Further, we provide a theoretical analysis of the model's convergence properties and computational complexity for the proposed algorithm. Extensive experiments on several benchmark datasets verify the effectiveness and superiority of the proposed algorithm on clustering and image segmentation tasks.
IEEE Min-Hash is a popular technique for efficiently estimating the Jaccard similarity of binary sets. Consistent Weighted Sampling (CWS) generalizes the Min-Hash scheme to sketch weighted sets and has drawn increasing interest from the community. Due to its constant-time complexity independent of the values of the weights, Improved CWS (ICWS) is considered as the state-of-the-art CWS algorithm. In this paper, we revisit ICWS and analyze its underlying mechanism to show that there actually exists dependence between the two components of the hash-code produced by ICWS, which violates the condition of independence. To remedy the problem, we propose an Improved ICWS (I2CWS) algorithm which not only shares the same theoretical computational complexity as ICWS but also abides by the required conditions of the CWS scheme. The experimental results on a number of synthetic data sets and real-world text data sets demonstrate that our I2CWS algorithm can estimate the Jaccard similarity more accurately, and also competes with or outperforms the compared methods, including ICWS, in classification and top-K retrieval, after relieving the underlying dependence.
Wen, D, Qin, L, Zhang, Y, Chang, L & Chen, L 2019, 'Enumerating k-vertex connected components in large graphs', Proceedings - International Conference on Data Engineering, vol. 2019-April, pp. 52-63.View/Download from: Publisher's site
© 2019 IEEE. In social network analysis, structural cohesion (or vertex connectivity) is a fundamental metric in measuring the cohesion of social groups. Given an undirected graph, a k-vertex connected component (k-VCC) is a maximal connected subgraph whose structural cohesion is at least k. A k-VCC has many outstanding structural properties, such as high cohesiveness, high robustness, and subgraph overlapping. In this paper, given a graph G and an integer k, we study the problem of computing all k-VCCs in G. The general idea for this problem is to recursively partition the graph into overlapped subgraphs. We prove the upper bound of the number of partitions, which implies the polynomial running time algorithm for the k-VCC enumeration. However, the basic solution is costly in computing the vertex cut. To improve the algorithmic efficiency, we observe that the key is reducing the number of local connectivity testings. We propose two effective optimization strategies, namely neighbor sweep and group sweep, to significantly reduce the number of local connectivity testings. We conduct extensive performance studies using ten large real datasets to demonstrate the efficiency of our proposed algorithms. The experimental results demonstrate that our approach can achieve a speedup of up to two orders of magnitude compared to the state-of-the-art algorithm.
Han, B, Tsang, IW, Chen, L, Zhou, JT & Yu, CP 2019, 'Beyond Majority Voting: A Coarse-to-Fine Label Filtration for Heavily Noisy Labels.', IEEE transactions on neural networks and learning systems.View/Download from: Publisher's site
Crowdsourcing has become the most appealing way to provide a plethora of labels at a low cost. Nevertheless, labels from amateur workers are often noisy, which inevitably degenerates the robustness of subsequent learning models. To improve the label quality for subsequent use, majority voting (MV) is widely leveraged to aggregate crowdsourced labels due to its simplicity and scalability. However, when crowdsourced labels are ``heavily'' noisy (e.g., 40% of noisy labels), MV may not work well because of the fact ``garbage (heavily noisy labels) in, garbage (full aggregated labels) out.'' This issue inspires us to think: if the ultimate target is to learn a robust model using noisy labels, why not provide partial aggregated labels and ensure that these labels are reliable enough for learning models? To solve this challenge by improving MV, we propose a coarse-to-fine label filtration model called double filter machine (DFM), which consists of a (majority) voting filter and a sparse filter serially. Specifically, the DFM refines crowdsourced labels from coarse filtering to fine filtering. In the stage of coarse filtering, the DFM aggregates crowdsourced labels by voting filter, which yields (quality-acceptable) full aggregated labels. In the stage of fine filtering, DFM further digs out a set of high-quality labels from full aggregated labels by sparse filter, since this filter can identify high-quality labels by the methodology of support selection. Based on the insight of compressed sensing, DFM recovers a ground-truth signal from heavily noisy data under a restricted isometry property. To sum up, the primary benefits of DFM are to keep the scalability by voting filter, while improve the robustness by sparse filter. We also derive theoretical guarantees for the convergence and recovery of DFM and reveal its complexity. We conduct comprehensive experiments on both the UCI simulated and the AMT crowdsourced datasets. Empirical results show that partial aggregated la...
Du, X, Yin, H, Chen, L, Wang, Y, Yang, Y & Zhou, X 2019, 'Personalized Video Recommendation Using Rich Contents from Videos', IEEE Transactions on Knowledge and Data Engineering.View/Download from: UTS OPUS or Publisher's site
IEEE Video recommendation has become an essential way of helping people explore the massive videos and discover the ones that may be of interest to them. In the existing video recommender systems, the models make the recommendations based on the user-video interactions and single specific content features. When the specific content features are unavailable, the performance of the existing models will seriously deteriorate. Inspired by the fact that rich contents (e.g., text, audio, motion, and so on) exist in videos, in this paper, we explore how to use these rich contents to overcome the limitations caused by the unavailability of the specific ones. Specifically, we propose a novel general framework that incorporates arbitrary single content feature with user-video interactions, named as collaborative embedding regression (CER) model, to make effective video recommendation in both in-matrix and out-of-matrix scenarios. Our extensive experiments on two real-world large-scale datasets show that CER beats the existing recommender models with any single content feature and is more time efficient. In addition, we propose a priority-based late fusion (PRI) method to gain the benefit brought by the integrating the multiple content features. The corresponding experiment shows that PRI brings real performance improvement to the baseline and outperforms the existing fusion methods.
Zhan, K, Chang, X, Guan, J, Chen, L, Ma, Z & Yang, Y 2019, 'Adaptive Structure Discovery for Multimedia Analysis Using Multiple Features.', IEEE transactions on cybernetics.View/Download from: UTS OPUS or Publisher's site
Multifeature learning has been a fundamental research problem in multimedia analysis. Most existing multifeature learning methods exploit graph, which must be computed beforehand, as input to uncover data distribution. These methods have two major problems confronted. First, graph construction requires calculating similarity based on nearby data pairs by a fixed function, e.g., the RBF kernel, but the intrinsic correlation among different data pairs varies constantly. Therefore, feature learning based on such predefined graphs may degrade, especially when there is dramatic correlation variation between nearby data pairs. Second, in most existing algorithms, each single-feature graph is computed independently and then combine them for learning, which ignores the correlation between multiple features. In this paper, a new unsupervised multifeature learning method is proposed to make the best utilization of the correlation among different features by jointly optimizing data correlation from multiple features in an adaptive way. As opposed to computing the affinity weight of data pairs by a fixed function, the weight of affinity graph is learned by a well-designed optimization problem. Additionally, the affinity graph of data pairs from different features is optimized in a global level to better leverage the correlation among different channels. In this way, the adaptive approach correlates the features of all features for a better learning process. Experimental results on real-world datasets demonstrate that our approach outperforms the state-of-the-art algorithms on leveraging multiple features for multimedia analysis.
Wang, Y, Feng, C, Chen, L, Yin, H, Guo, C & Chu, Y 2018, 'User identity linkage across social networks via linked heterogeneous network embedding', World Wide Web, vol. 2018, pp. 1-22.View/Download from: UTS OPUS or Publisher's site
© 2018 Springer Science+Business Media, LLC, part of Springer Nature User identity linkage has important implications in many cross-network applications, such as user profile modeling, recommendation and link prediction across social networks. To discover accurate cross-network user correspondences, it is a critical prerequisite to find effective user representations. While structural and content information describe users from different perspectives, there is a correlation between the two aspects of information. For example, a user who follows a celebrity tends to post content about the celebrity as well. Therefore, the projections of structural and content information of a user should be as close to each other as possible, which inspires us to fuse the two aspects of information in a unified space. However, owing to the information heterogeneity, most existing methods extract features from content and structural information respectively, instead of describing them in a unified way. In this paper, we propose a Linked Heterogeneous Network Embedding model (LHNE) to learn the comprehensive representations of users by collectively leveraging structural and content information in a unified framework. We first model the topics of user interests from content information to filter out noise. Next, cross-network structural and content information are embedded into a unified space by jointly capturing the friend-based and interest-based user co-occurrence in intra-network and inter-network, respectively. Meanwhile, LHNE learns user transfer and topic transfer for enhancing information exchange across networks. Empirical results show LHNE outperforms the state-of-the-art methods on both real social network and synthetic datasets and can work well even with little or no structural information.
Wu, W, Li, B, Chen, L, Zhu, X & Zhang, C 2018, 'K-Ary Tree Hashing for Fast Graph Classification', IEEE Transactions on Knowledge and Data Engineering, vol. 30, no. 5, pp. 936-949.View/Download from: UTS OPUS or Publisher's site
IEEE Existing graph classification usually relies on an exhaustive enumeration of substructure patterns, where the number of substructures expands exponentially w.r.t. with the size of the graph set. Recently, the Weisfeiler-Lehman (WL) graph kernel has achieved the best performance in terms of both accuracy and efficiency among state-of-the-art methods. However, it is still time-consuming, especially for large-scale graph classification tasks. In this paper, we present a < formula > < tex > $k$ < /tex > < /formula > -Ary Tree based Hashing (KATH) algorithm, which is able to obtain competitive accuracy with a very fast runtime. The main idea of KATH is to construct a traversal table to quickly approximate the subtree patterns in WL using < formula > < tex > $k$ < /tex > < /formula > -ary trees. Based on the traversal table, KATH employs a recursive indexing process that performs only < formula > < tex > $r$ < /tex > < /formula > times of matrix indexing to generate all < formula > < tex > $(r-1)$ < /tex > < /formula > -depth < formula > < tex > $k$ < /tex > < /formula > -ary trees, where the leaf node labels of a tree can uniquely specify the pattern. After that, the MinHash scheme is used to fingerprint the acquired subtree patterns for a graph. Our experimental results on both real world and synthetic datasets show that KATH runs significantly faster than state-of-the-art methods while achieving competitive or better accuracy.
Bo Han, Tsang, IW, Ling Chen, Yu, CP & Sai-Fu Fung 2018, 'Progressive Stochastic Learning for Noisy Labels.', IEEE transactions on neural networks and learning systems, vol. 29, no. 10, pp. 5136-5148.View/Download from: UTS OPUS or Publisher's site
Large-scale learning problems require a plethora of labels that can be efficiently collected from crowdsourcing services at low cost. However, labels annotated by crowdsourced workers are often noisy, which inevitably degrades the performance of large-scale optimizations including the prevalent stochastic gradient descent (SGD). Specifically, these noisy labels adversely affect updates of the primal variable in conventional SGD. To solve this challenge, we propose a robust SGD mechanism called progressive stochastic learning (POSTAL), which naturally integrates the learning regime of curriculum learning (CL) with the update process of vanilla SGD. Our inspiration comes from the progressive learning process of CL, namely learning from "easy" tasks to "complex" tasks. Through the robust learning process of CL, POSTAL aims to yield robust updates of the primal variable on an ordered label sequence, namely, from "reliable" labels to "noisy" labels. To realize POSTAL mechanism, we design a cluster of "screening losses," which sorts all labels from the reliable region to the noisy region. To sum up, POSTAL using screening losses ensures robust updates of the primal variable on reliable labels first, then on noisy labels incrementally until convergence. In theory, we derive the convergence rate of POSTAL realized by screening losses. Meanwhile, we provide the robustness analysis of representative screening losses. Experimental results on UCI1 simulated and Amazon Mechanical Turk crowdsourcing data sets show that the POSTAL using screening losses is more effective and robust than several existing baselines.1UCI is the abbreviation of University of California Irvine.
Chi, L, Li, B, Zhu, X, Pan, S & Chen, L 2018, 'Hashing for Adaptive Real-Time Graph Stream Classification With Concept Drifts.', IEEE Transactions on Cybernetics, vol. 48, no. 5, pp. 1591-1604.View/Download from: UTS OPUS or Publisher's site
Many applications involve processing networked streaming data in a timely manner. Graph stream classification aims to learn a classification model from a stream of graphs with only one-pass of data, requiring real-time processing in training and prediction. This is a nontrivial task, as many existing methods require multipass of the graph stream to extract subgraph structures as features for graph classification which does not simultaneously satisfy "one-pass" and "real-time" requirements. In this paper, we propose an adaptive real-time graph stream classification method to address this challenge. We partition the unbounded graph stream data into consecutive graph chunks, each consisting of a fixed number of graphs and delivering a corresponding chunk-level classifier. We employ a random hashing function to compress the original node set of graphs in each chunk for fast feature detection when training chunk-level classifiers. Furthermore, a differential hashing strategy is applied to map unlimited increasing features (i.e., cliques) into a fixed-size feature space which is then used as a feature vector for stochastic learning. Finally, the chunk-level classifiers are weighted in an ensemble learning model for graph classification. The proposed method substantially speeds up the graph feature extraction and avoids unbounded graph feature growth. Moreover, it effectively offsets concept drifts in graph stream classification. Experiments on real-world and synthetic graph streams demonstrate that our method significantly outperforms existing methods in both classification accuracy and learning efficiency.
Wang, W, Yin, H, Chen, L, Sun, Y, Sadiq, S & Zhou, X 2017, 'ST-SAGE: A spatial-Temporal sparse additive generative model for spatial item recommendation', ACM Transactions on Intelligent Systems and Technology, vol. 8, no. 3.View/Download from: UTS OPUS or Publisher's site
©c 2017 ACM 2157-6904/2017/04-ART48 15.00. With the rapid development of location-based social networks (LBSNs), spatial item recommendation has become an important mobile application, especially when users travel away from home. However, this type of recommendation is very challenging compared to traditional recommender systems. A user may visit only a limited number of spatial items, leading to a very sparse user-item matrix. This matrix becomes even sparser when the user travels to a distant place, as most of the items visited by a user are usually located within a short distance from the user's home. Moreover, user interests and behavior patterns may vary dramatically across different time and geographical regions. In light of this, we propose ST-SAGE, a spatial-Temporal sparse additive generative model for spatial item recommendation in this article. ST-SAGE considers both personal interests of the users and the preferences of the crowd in the target region at the given time by exploiting both the co-occurrence patterns and content of spatial items. To further alleviate the data-sparsity issue, ST-SAGE exploits the geographical correlation by smoothing the crowd's preferences over a well-designed spatial index structure called the spatial pyramid. To speed up the tra ining process of ST-SAGE, we implement a parallel version of themodel inference algorithm on the GraphLab framework.We conduct extensive experiments; the experimental results clearly demonstrate that ST-SAGE outperforms the state-of-The-Art recommender systems in terms of recommendation effectiveness, model training efficiency, and online recommendation efficiency.
Yin, H, Wang, W, Wang, H, Chen, L & Zhou, X 2017, 'Spatial-Aware Hierarchical Collaborative Deep Learning for POI Recommendation', IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 11, pp. 2537-2551.View/Download from: UTS OPUS or Publisher's site
© 2017 IEEE. Point-of-interest (POI) recommendation has become an important way to help people discover attractive and interesting places, especially when they travel out of town. However, the extreme sparsity of user-POI matrix and cold-start issues severely hinder the performance of collaborative filtering-based methods. Moreover, user preferences may vary dramatically with respect to the geographical regions due to different urban compositions and cultures. To address these challenges, we stand on recent advances in deep learning and propose a Spatial-Aware Hierarchical Collaborative Deep Learning model (SH-CDL). The model jointly performs deep representation learning for POIs from heterogeneous features and hierarchically additive representation learning for spatial-aware personal preferences. To combat data sparsity in spatial-aware user preference modeling, both the collective preferences of the public in a given target region and the personal preferences of the user in adjacent regions are exploited in the form of social regularization and spatial smoothing. To deal with the multimodal heterogeneous features of the POIs, we introduce a late feature fusion strategy into our SH-CDL model. The extensive experimental analysis shows that our proposed model outperforms the state-of-the-art recommendation models, especially in out-of-town and cold-start recommendation scenarios.
Hou, S, Chen, L, Tao, D, Zhou, S, Liu, W & Zheng, Y 2017, 'Multi-layer multi-view topic model for classifying advertising video', Pattern Recognition, vol. 68, pp. 66-81.View/Download from: UTS OPUS or Publisher's site
© 2017 Elsevier Ltd The recent proliferation of advertising (ad) videos has driven the research in multiple applications, ranging from video analysis to video indexing and retrieval. Among them, classifying ad video is a key task because it allows automatic organization of videos according to categories or genres, and this further enables ad video indexing and retrieval. However, classifying ad video is challenging compared to other types of video classification because of its unconstrained content. While many studies focus on embedding ads relevant to videos, to our knowledge, few focus on ad video classification. In order to classify ad video, this paper proposes a novel ad video representation that aims to sufficiently capture the latent semantics of video content from multiple views in an unsupervised manner. In particular, we represent ad videos from four views, including bag-of-feature (BOF), vector of locally aggregated descriptors (VLAD), fisher vector (FV) and object bank (OB). We then devise a multi-layer multi-view topic model, mlmv_LDA, which models the topics of videos from different views. A topical representation for video, supporting category-related task, is finally achieved by the proposed method. Our empirical classification results on 10,111 real-world ad videos demonstrate that the proposed approach effectively differentiate ad videos.
Wang, H, Zhang, P, Zhu, X, Tsang, IWH, Chen, L, Zhang, C & Wu, X 2017, 'Incremental Subgraph Feature Selection for Graph Classification', IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 1, pp. 128-142.View/Download from: Publisher's site
Wang, H, Wu, J, Pan, S, Zhang, P & Chen, L 2017, 'Towards large-scale social networks with online diffusion provenance detection', Computer Networks, vol. 114, pp. 154-166.View/Download from: UTS OPUS or Publisher's site
© 2016 Elsevier B.V.In this paper we study a new problem of online discovering diffusion provenances in large networks. Existing work on network diffusion provenance identification focuses on offline learning where data collected from network detectors are static and a snapshot of the network is available before learning. However, an offline learning model does not meet the need for early warning, real-time awareness, or a real-time response to malicious information spreading in networks. To this end, we propose an online regression model for real-time diffusion provenance identification. Specifically, we first use offline collected network cascades to infer the edge transmission weights, and then use an online l 1 non-convex regression model as the identification model. The proposed methods are empirically evaluated on both synthetic and real-world networks. Experimental results demonstrate the effectiveness of the proposed model.
Hou, S, Zhou, S, Chen, L, Feng, Y & Awudu, K 2016, 'Multi-label learning with label relevance in advertising video', Neurocomputing, vol. 171, pp. 932-948.View/Download from: UTS OPUS or Publisher's site
The recent proliferation of videos has brought out the need for applications such as automatic annotation and organization. These applications could greatly benefit from the respective thematic content depending on the type of video. Unlike the other kinds of video, an advertising video usually conveys a specific theme in a certain time period (e.g. drawing the audience׳s attention to a product or emphasizing the brand). Traditional multi-label algorithms may not work effectively with advertising videos due mainly to their heterogeneous nature. In this paper, we propose a new learning paradigm to resolve the problems arising out of traditional multi-label learning in advertising videos through label relevance. Aiming to address the issue of label relevance, we firstly assign each label with label degree (LD) to classify all the labels into three groups such as first label (FL), important label (IL) and common label (CL), and then propose a Directed Probability Label Graph (DPLG) model to mine the most related labels from the multi-label data with label relevance, in which the interdependency between labels is considered. In the implementation of DPLG, the labels that appear occasionally and possess inconspicuous co-occurrences are consequently eliminated effectively, employing λ-filtering and τ-pruning processes, respectively. And then the graph theory is utilized in DPLG to acquire Correlative Label-Sets (CLSs). Lastly, the searched Correlative Label-Sets (CLSs) are utilized to enhance multi-label annotation. Experimental results on advertising videos and several publicly available datasets demonstrate the effectiveness of the proposed method for multi-label annotation with label relevance
© 2016. Extreme Learning Machine (ELM) is a promising model for training single-hidden layer feedforward networks (SLFNs) and has been widely used for classification. However, ELM faces the challenge of arbitrarily selected parameters, e.g., the network weights and hidden biases. Therefore, many efforts have been made to enhance the performance of ELM, such as using evolutionary algorithms to explore promising areas of the solution space. Although evolutionary algorithms can explore promising areas of the solution space, they are not able to locate global optimum efficiently. In this paper, we present a new Memetic Algorithm (MA)-based Extreme Learning Machine (M-ELM for short). M-ELM embeds the local search strategy into the global optimization framework to obtain optimal network parameters. Experiments and comparisons on 46 UCI data sets validate the performance of M-ELM. The corresponding results demonstrate that M-ELM significantly outperforms state-of-the-art ELM algorithms.
Yin, H, Cui, B, Chen, L, Hu, Z & Zhang, C 2015, 'Modeling Location-Based User Rating Profiles for Personalized Recommendation', ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, vol. 9, no. 3.View/Download from: Publisher's site
Yin, H, Cui, B, Chen, L, Hu, Z & Zhou, X 2015, 'Dynamic User Modeling in Social Media Systems', ACM TRANSACTIONS ON INFORMATION SYSTEMS, vol. 33, no. 3.View/Download from: UTS OPUS or Publisher's site
Newly emerging location-based and event-based social network services provide us with a new platform to understand users' preferences based on their activity history. A user can only visit a limited number of venues/events and most of them are within a limited distance range, so the user-item matrix is very sparse, which creates a big challenge to the traditional collaborative filtering-based recommender systems. The problem becomes even more challenging when people travel to a new city where they have no activity information.
In this article, we propose LCARS, a location-content-aware recommender system that offers a particular user a set of venues (e.g., restaurants and shopping malls) or events (e.g., concerts and exhibitions) by giving consideration to both personal interest and local preference. This recommender system can facilitate people's travel not only near the area in which they live, but also in a city that is new to them. Specifically, LCARS consists of two components: offline modeling and online recommendation. The offline modeling part, called LCA-LDA, is designed to learn the interest of each individual user and the local preference of each individual city by capturing item cooccurrence patterns and exploiting item contents. The online recommendation part takes a querying user along with a querying city as input, and automatically combines the learned interest of the querying user and the local preference of the querying city to produce the top-k recommendations. To speed up the online process, a scalable query processing technique is developed by extending both the Threshold Algorithm (TA) and TA-approximation algorithm. We evaluate the performance of our recommender system on two real datasets, that is, DoubanEvent and Foursquare, and one large-scale synthetic dataset. The results show the superiority of LCARS in recommending spatial items for users, especially when traveling to new cities, in terms of both effectiveness and efficiency. Beside...
Li, B, Chen, L, Zhu, X & Zhang, C 2013, 'Noisy but Non-malicious User Detection in Social Recommender Systems', World Wide Web, vol. 16, no. 5-6, pp. 677-699.View/Download from: UTS OPUS or Publisher's site
Social recommender systems largely rely on user-contributed data to infer users' preference. While this feature has enabled many interesting applications in social networking services, it also introduces unreliability to recommenders as users are allowed to insert data freely. Although detecting malicious attacks from social spammers has been studied for years, little work was done for detecting Noisy but Non-Malicious Users (NNMUs), which refers to those genuine users who may provide some untruthful data due to their imperfect behaviors. Unlike colluded malicious attacks that can be detected by finding similarly-behaved user profiles, NNMUs are more difficult to identify since their profiles are neither similar nor correlated from one another. In this article, we study how to detect NNMUs in social recommender systems. Based on the assumption that the ratings provided by a same user on closely correlated items should have similar scores, we propose an effective method for NNMU detection by capturing and accumulating userâs âself-contradictionsâ, i.e., the cases that a user provides very different rating scores on closely correlated items. We show that self-contradiction capturing can be formulated as a constrained quadratic optimization problem w.r.t. a set of slack variables, which can be further used to quantify the underlying noise in each test user profile. We adopt three real-world data sets to empirically test the proposed method. The experimental results show that our method (i) is effective in real-world NNMU detection scenarios, (ii) can significantly outperform other noisy-user detection methods, and (iii) can improve recommendation performance for other users after removing detected NNMUs from the recommender system.
Song, K, Feng, S, Gao, W, Wang, D, Chen, L & Zhang, C 2018, 'Build emotion lexicon from microblogs by combining effects of seed words and emoticons in a heterogeneous graph' in Social Media Content Analysis: Natural Language Processing and Beyond, World Scientific, Singapore, pp. 171-195.View/Download from: UTS OPUS or Publisher's site
© 2018 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. As an indispensable resource for emotion analysis, emotion lexicons have attracted increasing attention in recent years. Most existing methods focus on capturing the single emotional effect of words rather than the emotion distributions which are helpful to model multiple complex emotions in a subjective text. Meanwhile, automatic lexicon building methods are overly dependent on seed words but neglect the effect of emoticons which are natural graphical labels of fine-grained emotion. In this chapter, we describe a novel emotion lexicon building framework that leverages both seed words and emoticons simultaneously to capture emotion distributions of candidate words more accurately. Our method overcomes the weakness of existing methods by combining the effects of both seed words and emoticons in a unified three-layer heterogeneous graph, in which a multi-label random walk (MLRW) algorithm is performed to strengthen the emotion distribution estimation. Experimental results on real-world data reveal that our constructed emotion lexicon achieves promising results for emotion classification compared to the state-of-the-art lexicons.
Liu, W, Chang, X, Chen, L & Yang, Y 2018, 'Semi-supervised Joint Learning of Representation and Relation for Person Re-identification', AAAI Conference on Artificial Intelligence, Louisiana, USA.
Wu, W, Li, B, Chen, L & Zhang, C 2018, 'Efficient attributed network embedding via recursive randomized hashing', IJCAI International Joint Conference on Artificial Intelligence, International Joint Conference on Artificial Intelligence, International Joint Conference on Artificial Intelligence, Stockholm, Sweden, pp. 2861-2867.View/Download from: UTS OPUS or Publisher's site
© 2018 International Joint Conferences on Artificial Intelligence. All right reserved. Attributed network embedding aims to learn a low-dimensional representation for each node of a network, considering both attributes and structure information of the node. However, the learning based methods usually involve substantial cost in time, which makes them impractical without the help of a powerful workhorse. In this paper, we propose a simple yet effective algorithm, named NetHash, to solve this problem only with moderate computing capacity. NetHash employs the randomized hashing technique to encode shallow trees, each of which is rooted at a node of the network. The main idea is to efficiently encode both attributes and structure information of each node by recursively sketching the corresponding rooted tree from bottom (i.e., the predefined highest-order neighboring nodes) to top (i.e., the root node), and particularly, to preserve as much information closer to the root node as possible. Our extensive experimental results show that the proposed algorithm, which does not need learning, runs significantly faster than the state-of-the-art learning-based network embedding methods while achieving competitive or even better performance in accuracy.
Pang, G, Cao, L, Chen, L, Lian, D & Liu, H 2018, 'Sparse Modeling-based Sequential Ensemble Learning for Effective Outlier Detection in High-dimensional Numeric Data.', The Thirty-Second AAAI Conference on Artificial Intelligence, AAAI Conference on Artificial Intelligence, AAAI, New Orleans, USA, pp. 3892-3899.View/Download from: UTS OPUS
The large proportion of irrelevant or noisy features in real-life high-dimensional data presents a significant challenge to subspace/feature selection-based high-dimensional outlier detection (a.k.a. outlier scoring) methods. These methods often perform the two dependent tasks: relevant feature subset search and outlier scoring independently, consequently retaining features/subspaces irrelevant to the scoring method and downgrading the detection performance. This paper introduces a novel sequential ensemble-based framework SEMSE and its instance CINFO to address this issue. SEMSE learns the sequential ensembles to mutually refine feature selection and outlier scoring by iterative sparse modeling with outlier scores as the pseudo target feature. CINFO instantiates SEMSE by using three successive recurrent components to build such sequential ensembles. Given outlier scores output by an existing outlier scoring method on a feature subset, CINFO first defines a Cantelli's inequality-based outlier thresholding function to select outlier candidates with a false positive upper bound. It then performs lasso-based sparse regression by treating the outlier scores as the target feature and the original features as predictors on the outlier candidate set to obtain a feature subset that is tailored for the outlier scoring method. Our experiments show that two different outlier scoring methods enabled by CINFO (i) perform significantly better on 11 real-life high-dimensional data sets, and (ii) have much better resilience to noisy features, compared to their bare versions and three state-of-the-art competitors. The source code of CINFO is available at https://sites.google.com/site/gspangsite/sourcecode.
Pang, G, Chen, L, Cao, L & Liu, H 2018, 'Learning representations of ultrahigh-dimensional data for random distance-based outlier detection', Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, London, United Kingdom, pp. 2041-2050.View/Download from: UTS OPUS or Publisher's site
© 2018 Association for Computing Machinery. Learning expressive low-dimensional representations of ultrahigh-dimensional data, e.g., data with thousands/millions of features, has been a major way to enable learning methods to address the curse of dimensionality. However, existing unsupervised representation learning methods mainly focus on preserving the data regularity information and learning the representations independently of subsequent outlier detection methods, which can result in suboptimal and unstable performance of detecting irregularities (i.e., outliers). This paper introduces a ranking model-based framework, called RAMODO, to address this issue. RAMODO unifies representation learning and outlier detection to learn low-dimensional representations that are tailored for a state-of-the-art outlier detection approach - the random distance-based approach. This customized learning yields more optimal and stable representations for the targeted outlier detectors. Additionally, RAMODO can leverage little labeled data as prior knowledge to learn more expressive and application-relevant representations. We instantiate RAMODO to an efficient method called REPEN to demonstrate the performance of RAMODO. Extensive empirical results on eight real-world ultrahigh dimensional data sets show that REPEN (i) enables a random distance-based detector to obtain significantly better AUC performance and two orders of magnitude speedup; (ii) performs substantially better and more stably than four state-of-the-art representation learning methods; and (iii) leverages less than 1% labeled data to achieve up to 32% AUC improvement.
Wang, J, Chen, L, Qin, L & Wu, X 2018, 'ASTM: An Attentional Segmentation Based Topic Model for Short Texts', Proceedings - IEEE International Conference on Data Mining, ICDM, IEEE International Conference on Data Mining, IEEE, Singapore, Singapore, pp. 577-586.View/Download from: UTS OPUS or Publisher's site
© 2018 IEEE. To address the data sparsity problem in short text understanding, various alternative topic models leveraging word embeddings as background knowledge have been developed recently. However, existing models combine auxiliary information and topic modeling in a straightforward way without considering human reading habits. In contrast, extensive studies have proven that it is full of potential in textual analysis by taking into account human attention. Therefore, we propose a novel model, Attentional Segmentation based Topic Model (ASTM), to integrate both word embeddings as supplementary information and an attention mechanism that segments short text documents into fragments of adjacent words receiving similar attention. Each segment is assigned to a topic and each document can have multiple topics. We evaluate the performance of our model on three real-world short text datasets. The experimental results demonstrate that our model outperforms the state-of-the-art in terms of both topic coherence and text classification.
Liu, C, Chen, L, Tsang, I & Yin, H 2018, 'Towards the Learning of Weighted Multi-label Associative Classifiers', Proceedings of the International Joint Conference on Neural Networks, International Joint Conference on Neural Networks, IEEE, Rio de Janeiro, Brazil, pp. 1-7.View/Download from: UTS OPUS or Publisher's site
© 2018 IEEE. Because of the ability to capture the correlation between features and labels, association rules have been applied to multi-label classification. However, existing multi-label associative classification algorithms usually exploit association rules using heuristic strategies. Moreover, only the covering association rules whose feature set is a subset of the testing instance are considered. Discarding any mined rules may diminish the performance of the classifier, especially when some rules only differ from the testing instance by a few insignificant features. In this paper we propose Weighted Multi-label Associative Classifiers (WMAC) that leverage an extended set of association rules with overlapping features with the testing instance to learn a universal weight vector for features. For this purpose, we embed the set of rules into a linear model and weigh the association rules by its confidence. Empirical results on diversified datasets clearly demonstrate that WMAC outperforms other well-established multi-label classification algorithms.
Yang, H, Pan, S, Zhang, P, Chen, L, Lian, D & Zhang, C 2018, 'Binarized attributed network embedding', Proceedings - IEEE International Conference on Data Mining, ICDM, IEEE International Conference on Data Mining, IEEE, Singapore, Singapore, pp. 1476-1481.View/Download from: UTS OPUS or Publisher's site
© 2018 IEEE. Attributed network embedding enables joint representation learning of node links and attributes. Existing attributed network embedding models are designed in continuous Euclidean spaces which often introduce data redundancy and impose challenges to storage and computation costs. To this end, we present a Binarized Attributed Network Embedding model (BANE for short) to learn binary node representation. Specifically, we define a new Weisfeiler-Lehman proximity matrix to capture data dependence between node links and attributes by aggregating the information of node attributes and links from neighboring nodes to a given target node in a layer-wise manner. Based on the Weisfeiler-Lehman proximity matrix, we formulate a new Weisfiler-Lehman matrix factorization learning function under the binary node representation constraint. The learning problem is a mixed integer optimization and an efficient cyclic coordinate descent (CCD) algorithm is used as the solution. Node classification and link prediction experiments on real-world datasets show that the proposed BANE model outperforms the state-of-the-art network embedding methods.
Liu, W, Chang, X, Chen, L & Yang, Y 2018, 'Semi-supervised Bayesian attribute learning for person re-identification', Thirty-Second AAAI Conference on Artificial Intelligence, AAAI, New Orleans, Louisiana, USA.View/Download from: UTS OPUS
Person re-identification (re-ID) tasks aim to identify the same person in multiple images captured from non-overlapping camera views. Most previous re-ID studies have attempted to solve this problem through either representation learning or metric learning, or by combining both techniques. Representation learning relies on the latent factors or attributes of the data. In most of these works, the dimensionality of the factors/attributes has to be manually determined for each new dataset. Thus, this approach is not robust. Metric learning optimizes a metric across the dataset to measure similarity according to distance. However, choosing the optimal method for computing these distances is data dependent, and learning the appropriate metric relies on a sufficient number of pair-wise labels. To overcome these limitations, we propose a novel algorithm for person re-ID, called semi-supervised Bayesian attribute learning. We introduce an Indian Buffet Process to identify the priors of the latent attributes. The dimensionality of attributes factors is then automatically determined by nonparametric Bayesian learning. Meanwhile, unlike traditional distance metric learning, we propose a re-identification probability distribution to describe how likely it is that a pair of images contains the same person. This technique relies solely on the latent attributes of both images. Moreover, pair-wise labels that are not known can be estimated from pair-wise labels that are known, making this a robust approach for semi-supervised learning. Extensive experiments demonstrate the superior performance of our algorithm over several state-of-the-art algorithms on small-scale datasets and comparable performance on large-scale re-ID datasets.
Wu, W, Li, B, Chen, L & Zhang, C 2017, 'Consistent Weighted Sampling Made More Practical', Proceedings of the 26th International Conference on World Wide Web, International World Wide Web Conference, ACM DL, Perth, Australia.View/Download from: UTS OPUS or Publisher's site
Min-Hash, which is widely used for efficiently estimating similarities of bag-of-words represented data, plays an increasingly important role in the era of big data. It has been extended to deal with real-value weighted sets -- Improved Consistent Weighted Sampling (ICWS) is considered as the state-of-the-art for this problem. In this paper, we propose a Practical CWS (PCWS) algorithm. We first transform the original form of ICWS into an equivalent expression, based on which we find some interesting properties that inspire us to make the ICWS algorithm simpler and more efficient in both space and time complexities. PCWS is not only mathematically equivalent to ICWS and preserves the same theoretical properties, but also saves 20% memory footprint and substantial computational cost compared to ICWS. The experimental results on a number of real-world text data sets demonstrate that PCWS obtains the same (even better) classification and retrieval performance as ICWS with 1/5~1/3 reduced empirical runtime.
Pang, G, Cao, L, Chen, L & Liu, H 2017, 'Learning homophily couplings from non-IID data for joint feature selection and noise-resilient outlier detection', IJCAI International Joint Conference on Artificial Intelligence, International Joint Conference on Artificial Intelligence, International Joint Conference on Artificial Intelligence Organization, Melbourne, Australia, pp. 2585-2591.View/Download from: UTS OPUS
This paper introduces a novel wrapper-based outlier detection framework (WrapperOD) and its instance (HOUR) for identifying outliers in noisy data (i.e., data with noisy features) with strong couplings between outlying behaviors. Existing subspace or feature selection-based methods are significantly challenged by such data, as their search of feature subset(s) is independent of outlier scoring and thus can be misled by noisy features. In contrast, HOUR takes a wrapper approach to iteratively optimize the feature subset selection and outlier scoring using a top-κ outlier ranking evaluation measure as its objective function. HOUR learns homophily couplings between outlying behaviors (i.e., abnormal behaviors are not independent - they bond together) in constructing a noise-resilient outlier scoring function to produce a reliable outlier ranking in each iteration. We show that HOUR (i) retains a 2-approximation outlier ranking to the optimal one; and (ii) significantly outperforms five stateof-the-art competitors on 15 real-world data sets with different noise levels in terms of AUC and/or P@n. The source code of HOUR is available at https://sites.google.com/site/gspangsite/sourcecode.
Liu, B, Chen, L, Zhu, X, Zhang, Y & Zhang, C 2017, 'Protecting Location Privacy in Spatial Crowdsourcing using Encrypted Data', 20th International Conference on Extending Database Technology (EDBT), International Conference on Extending Database Technology, Open Proceedings, Venice, Italy.View/Download from: UTS OPUS or Publisher's site
In spatial crowdsourcing, spatial tasks are outsourced to a
set of workers in proximity of the task locations for efficient
assignment. It usually requires workers to disclose their locations,
which inevitably raises security concerns about the
privacy of the workers' locations. In this paper, we propose
a secure SC framework based on encryption, which ensures
that workers' location information is never released to any
party, yet the system can still assign tasks to workers situated
in proximity of each task's location. We solve the
challenge of assigning tasks based on encrypted data using
homomorphic encryption. Moreover, to overcome the efficiency
issue, we propose a novel secure indexing technique
with a newly devised SKD-tree to index encrypted worker
locations. Experiments on real-world data evaluate various
aspects of the performance of the proposed SC platform.
Liu, W, Chang, X, Chen, L & Yang, Y 2017, 'Early Active Learning with Pairwise Constraint for Person Re-identification', ECML PKDD 2017: Machine Learning and Knowledge Discovery in Databases, Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, Skopje, Macedonia, pp. 103-118.View/Download from: UTS OPUS or Publisher's site
Research on person re-identification (re-id) has attached much attention in the machine learning field in recent years. With sufficient labeled training data, supervised re-id algorithm can obtain promising performance. However, producing labeled data for training supervised re-id models is an extremely challenging and time-consuming task because it requires every pair of images across no-overlapping camera views to be labeled. Moreover, in the early stage of experiments, when labor resources are limited, only a small number of data can be labeled. Thus, it is essential to design an effective algorithm to select the most representative samples. This is referred as early active learning or early stage experimental design problem. The pairwise relationship plays a vital role in the re-id problem, but most of the existing early active learning algorithms fail to consider this relationship. To overcome this limitation, we propose a novel and efficient early active learning algorithm with a pairwise constraint for person re-identification in this paper. By introducing the pairwise constraint, the closeness of similar representations of instances is enforced in active learning. This benefits the performance of active learning for re-id. Extensive experimental results on four benchmark datasets confirm the superiority of the proposed algorithm.
Pan, P, Feng, J, Chen, L & Yang, Y 2017, 'Online compressed robust PCA', Proceedings of the International Joint Conference on Neural Networks, International Joint Conference on Neural Networks, IEEE, Anchorage, AK, USA, pp. 1041-1048.View/Download from: UTS OPUS or Publisher's site
© 2017 IEEE. In this work, we consider the problem of robust principal component analysis (RPCA) for streaming noisy data that has been highly compressed. This problem is prominent when one deals with high-dimensional and large-scale data and data compression is necessary. To solve this problem, we propose an online compressed RPCA algorithm to efficiently recover the low-rank components of raw data. Though data compression incurs severe information loss, we provide deep analysis on the proposed algorithm and prove that the low-rank component can be asymptotically recovered under mild conditions. Compared with other recent works on compressed RPCA, our algorithm reduces the memory cost significantly by processing data in an online fashion and reduces the communication cost by accepting sequential compressed data as input.
Liu, B, Chen, L, Liu, C, Zhang, C & Qiu, W 2016, 'Mining co-locations from continuously distributed uncertain spatial data', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics):Web Technologies and Applications 18th Asia-Pacific Web Conference, APWeb 2016, Web Technologies and Applications : Asia-Pacific Web Conference (APWeb), Springer, Suzhou, China, pp. 66-78.View/Download from: UTS OPUS or Publisher's site
© Springer International Publishing Switzerland 2016.A co-location pattern is a group of spatial features whose instances tend to locate together in geographic space. While traditional co-location mining focuses on discovering co-location patterns from deterministic spatial data sets, in this paper, we study the problem in the context of continuously distributed uncertain data. In particular, we aim to discover co-location patterns from uncertain spatial data where locations of spatial instances are represented as multivariate Gaussian distributions. We first formulate the problem of probabilistic co-location mining based on newly defined prevalence measures. When the locations of instances are represented as continuous variables, the major challenges of probabilistic co-location mining lie in the efficient computation of prevalence measures and the verification of the probabilistic neighborhood relationship between instances. We develop an effective probabilistic co-location mining framework integrated with optimization strategies to address the challenges. Our experiments on multiple datasets demonstrate the effectiveness of the proposed algorithm.
Liu, C & Chen, L 2016, 'Summarizing Uncertain Transaction Databases by Probabilistic Tiles', Proceedings of the 2016 International Joint Conference on Neural Networks (IJCNN), IEEE International Joint Conference on Neural Networks, IEEE, Vancouver, pp. 4375-4382.View/Download from: UTS OPUS or Publisher's site
Transaction data mining is ubiquitous in various
domains and has been researched extensively. In recent years,
observing that uncertainty is inherent in many real world applications,
uncertain data mining has attracted much research
attention. Among the research problems, summarization is important
because it produces concise and informative results, which
facilitates further analysis. However, there are few works exploring
how to effectively summarize uncertain transaction data. In
this paper, we formulate the problem of summarizing uncertain
transaction data as Minimal Probabilistic Tile Cover Mining, which
aims to find a high-quality probabilistic tile set covering an
uncertain database with minimal cost. We define the concept of
Probabilistic Price and Probabilistic Price Order to evaluate and
compare the quality of tiles, and propose a framework to discover
the minimal probabilistic tile cover. The bottleneck is to check
whether a tile is better than another according to the Probabilistic
Price Order, which involves the computation of a joint probability.
We prove that it can be decomposed into independent terms and
calculated efficiently. Several optimization techniques are devised
to further improve the performance. Experimental results on real
world datasets demonstrate the conciseness of the produced tiles
and the effectiveness and efficiency of our approach
Song, K, Gao, W, Chen, L, Feng, S, Wang, D & Zhang, C 2016, 'Build Emotion Lexicon from the Mood of Crowd via Topic-Assisted Joint Non-negative Matrix Factorization', ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), Pisa, Italy.
Wang, W, Yin, H, Sadiq, S, Chen, L, Xie, M & Zhou, X 2016, 'SPORE: A Sequential Personalized Spatial Item Recommender System', Proceedings of the 2016 IEEE 32nd International Conference on Data Engineering (ICDE), IEEE International Conference on Data Engineering, IEEE, Helsinki, pp. 954-965.View/Download from: UTS OPUS or Publisher's site
With the rapid development of location-based social networks (LBSNs), spatial item recommendation has become an important way of helping users discover interesting locations to increase their engagement with location-based services. Although human movement exhibits sequential patterns in LBSNs, most current studies on spatial item recommendations do not consider the sequential influence of locations. Leveraging sequential patterns in spatial item recommendation is, however, very challenging, considering 1) users' check-in data in LBSNs has a low sampling rate in both space and time, which renders existing prediction techniques on GPS trajectories ineffective; 2) the prediction space is extremely large, with millions of distinct locations as the next prediction target, which impedes the application of classical Markov chain models; and 3) there is no existing framework that unifies users' personal interests and the sequential influence in a principled manner. In light of the above challenges, we propose a sequential personalized spatial item recommendation framework (SPORE) which introduces a novel latent variable topic-region to model and fuse sequential influence with personal interests in the latent and exponential space. The advantages of modeling the sequential effect at the topic-region level include a significantly reduced prediction space, an effective alleviation of data sparsity and a direct expression of the semantic meaning of users' spatial activities. Furthermore, we design an asymmetric Locality Sensitive Hashing (ALSH) technique to speed up the online top-k recommendation process by extending the traditional LSH. We evaluate the performance of SPORE on two real datasets and one large-scale synthetic dataset. The results demonstrate a significant improvement in SPORE's ability to recommend spatial items, in terms of both effectiveness and efficiency, compared with the state-of-the-art methods.
Wu, W, Li, B, Chen, L & Zhang, C 2016, 'Canonical Consistent Weighted Sampling for Real-Value Weighted Min-Hash', Proceedings of the 2016 IEEE 16th International Conference on Data Mining, IEEE International Conference on Data Mining, IEEE, Barcelona, Spain, pp. 1287-1292.View/Download from: UTS OPUS or Publisher's site
Min-Hash, as a member of the Locality Sensitive Hashing (LSH) family for sketching sets, plays an important role in the big data era. It is widely used for efficiently estimating similarities of bag-of-words represented data and has been extended to dealing with multi-sets and real-value weighted sets. Improved Consistent Weighted Sampling (ICWS) has been recognized as the state-of-the-art for real-value weighted Min-Hash. However, the algorithmic implementation of ICWS is flawed because it violates the uniformity of the Min-Hash scheme. In this paper, we propose a Canonical Consistent Weighted Sampling (CCWS) algorithm, which not only retains the same theoretical complexity as ICWS but also strictly complies with the definition of Min-Hash. The experimental results demonstrate that the proposed CCWS algorithm runs faster than the state-of-the-arts while achieving similar classification performance on a number of real-world text data sets.
Wu, W, Li, B, Chen, L & Zhang, C 2016, 'Cross-view feature hashing for image retrieval', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, Auckland, New Zealand, pp. 203-214.View/Download from: UTS OPUS or Publisher's site
© Springer International Publishing Switzerland 2016.Traditional cross-view information retrieval mainly rests on correlating two sets of features in different views. However, features in different views usually have different physical interpretations. It may be inappropriate to map multiple views of data onto a shared feature space and directly compare them. In this paper, we propose a simple yet effective Cross-View Feature Hashing (CVFH) algorithm via a 'partition and match' approach. The feature space for each view is bi-partitioned multiple times using B hash functions and the resulting binary codes for all the views can thus be represented in a compatible B-bit Hamming space. To ensure that hashed feature space is effective for supporting generic machine learning and information retrieval functionalities, the hash functions are learned to satisfy two criteria: (1) the neighbors in the original feature spaces should be also close in the Hamming space; and (2) the binary codes for multiple views of the same sample should be similar in the shared Hamming space. We apply CVFH to cross view image retrieval. The experimental results show that CVFH can outperform the Canonical Component Analysis (CCA) based cross-view method.
Pang, G, Cao, L & Chen, L 2016, 'Outlier detection in complex categorical data by modelling the feature value couplings', Proceedings of the 25th International Joint Conference on Artificial Intelligence, International Joint Conference on Artificial Intelligence (IJCAI), AAAI Press, New York.View/Download from: UTS OPUS
Pang, G, Cao, L, Chen, L & Liu, H 2016, 'Unsupervised Feature Selection for Outlier Detection by Modelling Hierarchical Value-Feature Couplings', Proceedings - IEEE International Conference on Data Mining, ICDM, IEEE International Conference on Data Mining, IEEE, Barcelona.View/Download from: UTS OPUS or Publisher's site
Han, B, Tsang, IW & Chen, L 2016, 'On the Convergence of A Family of Robust Losses for Stochastic Gradient Descent', Machine Learning and Knowledge Discovery in Databases - LNCS, The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery (ECML-PKDD), Springer, Riva del Garda, Italy.View/Download from: UTS OPUS or Publisher's site
The convergence of Stochastic Gradient Descent (SGD) using convex loss functions has been widely studied. However, vanilla SGD methods using convex losses cannot perform well with noisy labels, which adversely affect the update of the primal variable in SGD methods. Unfortunately, noisy labels are ubiquitous in real world applications such as crowdsourcing. To handle noisy labels, in this paper, we present a family of robust losses for SGD methods. By employing our robust losses, SGD methods successfully reduce negative effects caused by noisy labels on each update of the primal variable. We not only reveal the convergence rate of SGD methods using robust losses, but also provide the robustness analysis on two representative robust losses. Comprehensive experimental results on six real-world datasets show that SGD methods using robust losses are obviously more robust than other baseline methods in most situations with fast convergence.
Liu, B, Chen, L, Liu, C, Zhang, C & Qiu, W 2015, 'RCP Mining: Towards the Summarization of Spatial Co-location Patterns', Advances in Spatial and Temporal Databases (LNCS), International Symposium on Advances in Spatial and Temporal Databases, Springer, Hong Kong, China, pp. 451-469.View/Download from: UTS OPUS or Publisher's site
Co-location pattern mining is an important task in spatial data mining. However, the traditional framework of co-location pattern mining produces an exponential number of patterns because of the downward closure property, which makes it hard for users to understand, or apply. To address this issue, in this paper, we study the problem of mining representative co-location patterns (RCP). We first define a covering relationship between two co-location patterns by finding a new measure to appropriately quantify the distance between patterns in terms of their prevalence, based on which the problem of RCP mining is formally formulated. To solve the problem of RCP mining, we first propose an algorithm called RCPFast, adopting the post-mining framework that is commonly used by existing distance-based pattern summarization techniques. To address the peculiar challenge in spatial data mining, we further propose another algorithm, RCPMS, which employs the mine-and-summarize framework that pushes pattern summarization into the co-location mining process. Optimization strategies are also designed to further improve the performance of RCPMS. Our experimental results on both synthetic and real-world data sets demonstrate that RCP mining effectively summarizes spatial co-location patterns, and RCPMS is more efficient than RCPFast, especially on dense data sets.
Song, K, Feng, S, Gao, W, Wang, D, Chen, L & Zhang, C 2015, 'Building emotion lexicon from microblogs by combining effects of seed words and emoticons in a heterogeneous graph', Proceedings of the 26th ACM Conference on Hypertext & Social Media, ACM Conference on HyperText and Social Media, ACM, Guzelyurt, Northern Cyprus, pp. 283-292.View/Download from: UTS OPUS or Publisher's site
As an indispensable resource for emotion analysis, emotion
lexicons have attracted increasing attention in recent years.
Most existing methods focus on capturing the single emotional
effect of words rather than the emotion distributions
which are helpful to model multiple complex emotions in
a subjective text. Meanwhile, automatic lexicon building
methods are overly dependent on seed words but neglect
the effect of emoticons which are natural graphical labels of
fine-grained emotion. In this paper, we propose a novel emotion
lexicon building framework that leverages both seed
words and emoticons simultaneously to capture emotion distributions
of candidate words more accurately. Our method
overcomes the weakness of existing methods by combining
the effects of both seed words and emoticons in a unified
three-layer heterogeneous graph, in which a multi-label random
walk (MLRW) algorithm is performed to strengthen
the emotion distribution estimation. Experimental results
on real-world data reveal that our constructed emotion
lexicon achieves promising results for emotion classification
compared to the state-of-the-art lexicons
Wang, H, Zhang, P, Chen, L & Zhang, C 2015, 'Socialanalysis: A Real-Time query and mining system from social media data streams', Databases Theory and Applications (LNCS), Australasian Database Conference, Springer, Melbourne, Australia, pp. 318-322.View/Download from: Publisher's site
© Springer International Publishing Switzerland 2015. In this paper, we present our recent progress of designing a real-time system, SocialAnalysis, to discover and summarize emergent social events from social media data streams. In social networks era, people always frequently post messages or comments about their activities and opinions. Hence, there exist temporal correlations between the physical world and virtual social networks, which can help us to monitor and track social events, detecting and positioning anomalous events before their outbreakings, so as to provide early warning. The key technologies in the system include: (1) Data denoising methods based on multi-features, which screens out the query-related event data from massive background data. (2) Abnormal events detection methods based on statistical learning, which can detect anomalies by analyzing and mining a series of observations and statistics on the time axis. (3) Geographical position recognition, which is used to recognize regions where abnormal events may happen.
Wang, H, Zhang, P, Chen, L, Liu, H & Zhang, C 2015, 'Online Diffusion Source Detection in Social Networks', Proceedings of the 2015 International Joint Conference on Neural Networks (IJCNN), IEEE International Joint Conference on Neural Networks, IEEE, Killarney, Ireland, pp. 1-8.View/Download from: UTS OPUS or Publisher's site
In this paper we study a new problem of online diffusion source detection in social networks. Existing work on diffusion source detection focuses on offline learning, which assumes data collected from network detectors are static and a snapshot of network is available before learning. However, an offline learning model does not meet the needs of early warning, real-time awareness, and real-time response of malicious information spreading in social networks. In this paper, we combine online learning and regression-based detection methods for real-time diffusion source detection. Specifically, we propose a new ℓ1 non-convex regression model as the learning function, and an Online Stochastic Sub-gradient algorithm (OSS for short). The proposed model is empirically evaluated on both synthetic and real-world networks. Experimental results demonstrate the effectiveness of the proposed model.
Wang, W, Yin, H, Chen, L, Sun, Y, Sadiq, S & Zhou, X 2015, 'GEO-SAGE: A geographical sparse additive generative model for spatial item recommendation', KDD '15 Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM International Conference on Knowledge Discovery and Data Mining, ACM, Sydney, NSW, Australia, pp. 1255-1264.View/Download from: UTS OPUS or Publisher's site
Wang, H, Zhang, P, Tsang, I, Chen, L & Zhang, C 2015, 'Defragging Subgraph Features for Graph Classification', Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, ACM International Conference on Information and Knowledge Management, ACM, Melbourne, VIC, Australia, pp. 1687-1690.View/Download from: UTS OPUS or Publisher's site
Graph classification is an important tool for analysing structured and semi-structured data, where subgraphs are commonly used as the feature representation. However, the number and size of subgraph features crucially depend on the threshold parameters of frequent subgraph mining algorithms. Any improper setting of the parameters will generate many trivial short-pattern subgraph fragments which dominate the feature space, distort graph classifiers and bury interesting long-pattern subgraphs. In this paper, we propose a new Subgraph Join Feature Selection (SJFS) algorithm. The SJFS algorithm, by forcing graph classifiers to join short-pattern subgraph fragments, can defrag trivial subgraph features and deliver long-pattern interesting subgraphs. Experimental results on both synthetic and real-world social network graph data demonstrate the performance of the proposed method.
Yin, H, Cui, B, Chen, L, Hu, Z & Huang, Z 2014, 'A temporal context-aware model for user behavior modeling in social media systems', International Conference on Management of Data, SIGMOD 2014, ACM Special Interest Group on Management of Data Conference, ACM, Snowbird, UT, USA, pp. 1543-1554.View/Download from: UTS OPUS or Publisher's site
Liu, C, Chen, L & Zhang, C 2013, 'Mining Probabilistic Representative Frequent Patterns From Uncertain Data', The 13th SIAM International Conference on Data Mining (SDM 2013), SIAM International Conference on Data Mining, SIAM / Omnipress, Austin, Texas, USA, pp. 1-9.View/Download from: UTS OPUS or Publisher's site
Probabilistic frequent pattern mining over uncertain data has received a great deal of attention recently due to the wide applications of uncertain data. Similar to its counterpart in deterministic databases, however, probabilistic frequent pattern mining suffers from the same problem of generating an exponential number of result patterns. The large number of discovered patterns hinders further evaluation and analysis, and calls for the need to find a small number of representative patterns to approximate all other patterns. This paper formally defines the problem of probabilistic representative frequent pattern (P-RFP) mining, which aims to find the minimal set of patterns with sufficiently high probability to represent all other patterns. The problem's bottleneck turns out to be checking whether a pattern can probabilistically represent another, which involves the computation of a joint probability of supports of two patterns. To address the problem, we propose a novel and efficient dynamic programming-based approach. Moreover, we have devised a set of effective optimization strategies to further improve the computation efficiency. Our experimental results demonstrate that the proposed P-RFP mining effectively reduces the size of probabilistic frequent patterns. Our proposed approach not only discovers the set of P-RFPs efficiently, but also restores the frequency probability information of patterns with an error guarantee.
Liu, C, Chen, L & Zhang, C 2013, 'Summarizing Probabilistic Frequent Patterns: A Fast Approach', Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD13), ACM International Conference on Knowledge Discovery and Data Mining, ACM, Chicago, Illinois USA, pp. 527-535.View/Download from: UTS OPUS or Publisher's site
Mining probabilistic frequent patterns from uncertain data has received a great deal of attention in recent years due to the wide applications. However, probabilistic frequent pattern mining suffers from the problem that an exponential number of result patterns are generated, which seriously hinders further evaluation and analysis. In this paper, we focus on the problem of mining probabilistic representative frequent patterns (P-RFP), which is the minimal set of patterns with adequately high probability to represent all frequent patterns. Observing the bottleneck in checking whether a pattern can probabilistically represent another, which involves the computation of a joint probability of the supports of two patterns, we introduce a novel approximation of the joint probability with both theoretical and empirical proofs. Based on the approximation, we propose an Approximate P-RFP Mining (APM) algorithm, which effectively and efficiently compresses the set of probabilistic frequent patterns. To our knowledge, this is the first attempt to analyze the relationship between two probabilistic frequent patterns through an approximate approach. Our experiments on both synthetic and real-world datasets demonstrate that the APM algorithm accelerates P-RFP mining dramatically, orders of magnitudes faster than an exact solution. Moreover, the error rate of APM is guaranteed to be very small when the database contains hundreds transactions, which further affirms APM is a practical solution for summarizing probabilistic frequent patterns.
Wan, L, Chen, L & Zhang, C 2013, 'Mining Dependent Frequent Serial Episodes from Uncertain Sequence Data', Proceedings of the13th IEEE International Conference on Data Mining, IEEE International Conference on Data Mining, IEEE Computer Society Press, Dallas, TX, USA, pp. 1211-1216.View/Download from: UTS OPUS or Publisher's site
In this paper, we focus on the problem of mining Probabilistic Dependent Frequent Serial Episodes (P-DFSEs) from uncertain sequence data. By observing that the frequentness probability of an episode in an uncertain sequence is a Markov Chain imbeddable variable, we first propose an Embedded Markov Chain-based algorithm that efficiently computes the frequentness probability of an episode by projecting the probability space into a set of limited partitions. To further improve the computation efficiency, we devise an optimized approach that prunes candidate episodes early by estimating the upper bound of their frequentness probabilities.
Wan, L, Chen, L & Zhang, C 2013, 'Mining Frequent Serial Episodes over Uncertain Sequence Data', The 16th International Conference on Extending Database Technology (EDBT 2013), Extending Database Technology, ACM EDBT/ICDT 2013, Genoa, Italy, pp. 215-226.View/Download from: UTS OPUS or Publisher's site
Data uncertainty has posed many unique challenges to nearly all types of data mining tasks, creating a need for uncertain data mining. In this paper, we focus on the particular task of mining probabilistic frequent serial episodes (P-FSEs) from uncertain sequence data, which applies to many real applications including sensor readings as well as customer purchase sequences. We first define the notion of P-FSEs, based on the frequentness probabilities of serial episodes under possible world semantics. To discover P-FSEs over an uncertain sequence, we propose: 1) an exact approach that computes the accurate frequentness probabilities of episodes; 2) an approximate approach that approximates the frequency of episodes using probability models; 3) an optimized approach that efficiently prunes a candidate episode by estimating an upper bound of its frequentness probability using approximation techniques. We conduct extensive experiments to evaluate the performance of the developed data mining algorithms. Our experimental results show that: 1) while existing research demonstrates that approximate approaches are orders of magnitudes faster than exact approaches, for P-FSE mining, the efficiency improvement of the approximate approach over the exact approach is marginal; 2) although it has been recognized that the normal distribution based approximation approach is fairly accurate when the data set is large enough, for P-FSE mining, the binomial distribution based approximation achieves higher accuracy when the the number of episode occurrences is limited; 3) the optimized approach clearly outperforms the other two approaches in terms of the runtime, and achieves very high accuracy.
Yin, H, Sun, Y, Cui, B, Hu, Z & Chen, L 2013, 'LCARS: A Location-Content-Aware Recommender System', ACM SIGKDD Conference on Knowledge Discovery and Data Mining, ACM International Conference on Knowledge Discovery and Data Mining, ACM, Chicago, Illinois USA, pp. 221-229.View/Download from: UTS OPUS or Publisher's site
Newly emerging location-based and event-based social network services provide us with a new platform to understand users preferences based on their activity history. A user can only visit a limited number of venues/events and most of them are within a limited distance range, so the user-item matrix is very sparse, which creates a big challenge for traditional collaborative filtering-based recommender systems. The problem becomes more challenging when people travel to a new city where they have no activity history. In this paper, we propose LCARS, a location-content-aware recommender system that offers a particular user a set of venues (e.g., restaurants) or events (e.g., concerts and exhibitions) by giving consideration to both personal interest and local preference. This recommender system can facilitate peoples travel not only near the area in which they live, but also in a city that is new to them. Specifically, LCARS consists of two components: offline modeling and online recommendation. The offline modeling part, called LCA- LDA, is designed to learn the interest of each individual user and the local preference of each individual city by capturing item co- occurrence patterns and exploiting item contents. The online recommendation part automatically combines the learnt interest of the querying user and the local preference of the querying city to produce the top-k recommendations. To speed up this online process, a scalable query processing technique is developed by extending the classic Threshold Algorithm (TA). We evaluate the performance of our recommender system on two large-scale real data sets, Douban- Event and Foursquare. The results show the superiority of LCARS in recommending spatial items for users, especially when traveling to new cities, in terms of both effectiveness and efficiency.
Hasan, M, Xu, M, He, S & Chen, L 2012, 'Shot Classification Using Domain Specific Features for Movie Management', Lecture Notes in Computer Science, International Conference on DASFAA, Springer, Busan, South Korea, pp. 314-318.View/Download from: UTS OPUS or Publisher's site
Among many video types, movie content indexing and retrieval is a significantly challenging task because of the wide variety of shooting techniques and the broad range of genres. A movie consists of a series of video shots. Managing a movie at shot level provides a feasible way for movie understanding and summarization. Consequently, an effective shot classification is greatly desired for advanced movie management. In this demo, we explore novel domain specific features for effective shot classification. Experimental results show that the proposed method classifies movie shots from wide range of movie genres with improved accuracy compared to existing work
Long, G, Chen, L, Zhu, X & Zhang, C 2012, 'TCSST: transfer classification of short & sparse text using external data', Proc. Of The 21st ACM Conference on Information and Knowledge Management (CIKM-12), ACM International Conference on Information and Knowledge Management, ACM, Maui, Hawaii, USA, pp. 764-772.View/Download from: UTS OPUS or Publisher's site
Short & sparse text is becoming more prevalent on the web, such as search snippets, micro-blogs and product reviews. Accurately classifying short & sparse text has emerged as an important while challenging task. Existing work has considered utilizing external data (e.g. Wikipedia) to alleviate data sparseness, by appending topics detected from external data as new features. However, training a classifier on features concatenated from different spaces is not easy considering the features have different physical meanings and different significance to the classification task. Moreover, it exacerbates the "curse of dimensionality" problem. In this study, we propose a transfer classification method, TCSST, to exploit the external data to tackle the data sparsity issue. The transfer classifier will be learned in the original feature space. Considering that the labels of the external data may not be readily available or sufficiently enough, TCSST further exploits the unlabeled external data to aid the transfer classification. We develop novel strategies to allow TCSST to iteratively select high quality unlabeled external data to help with the classification. We evaluate the performance of TCSST on both benchmark as well as real-world data sets. Our experimental results demonstrate that the proposed method is effective in classifying very short & sparse text, consistently outperforming existing and baseline methods
Chen, L & Zhang, C 2011, 'Semi-supervised Variable Weighting for Clustering', Proceedings of the Eleventh SIAM International Conference on Data Mining, SIAM International Conference on Data Mining, SIAM / Omnipress, Mesa, Arizona, USA, pp. 863-871.View/Download from: UTS OPUS or Publisher's site
Semi-supervised learning, which uses a small amount of labeled data in conjunction with a large amount of unlabeled data for training, has recently attracted huge research attention due to the considerable improvement in learning accuracy. In this work, we focus on semi- supervised variable weighting for clustering, which is a critical step in clustering as it is known that interesting clustering structure usually occurs in a subspace defined by a subset of variables. Besides exploiting both labeled and unlabeled data to effectively identify the real importance of variables, our method embeds variable weighting in the process of semi-supervised clustering, rather than calculating variable weights separately, to ensure the computation efficiency. Our experiments carried out on both synthetic and real data demonstrate that semi-supervised variable weighting signicantly improves the clustering accuracy of existing semi-supervised k-means without variable weighting, or with unsupervised variable weighting.
Papapetrou, O & Chen, L 2011, 'XStreamCluster: an Efficient Algorithm for Streaming XML Data Clustering', Lecture Notes in Computer Science: Database Systems for Advanced Applications 16th International Conference, DASFAA 2011, Database Systems for Advanced Applications, Springer, Hongkong, China, pp. 496-510.View/Download from: UTS OPUS or Publisher's site
XML clustering finds many applications, ranging from storage to query processing. However, existing clustering algorithms focus on static XML collections, whereas modern information systems frequently deal with streaming XML data that needs to be processed online. Streaming XML clustering is a challenging task because of the high computational and space efficiency requirements implicated for online approaches. In this paper we propose XStreamCluster, which addresses the two challenges using a two-layered optimization. The bottom layer employs Bloom filters to encode the XML documents, providing a space-efficient solution to memory usage. The top layer is based on Locality Sensitive Hashing and contributes to the computational efficiency. The theoretical analysis shows that the approximate solution of XStreamCluster generates similarly good clusters as the exact solution, with high probability. The experimental results demonstrate that XStreamCluster improves both memory efficiency and computational time by at least an order of magnitude without affecting clustering quality, compared to its variants and a baseline approach.
Yu, PS, Fan, W, Nejdl, W, Chen, L, Sun, A, Simovici, D, Baralis, E, Nguifo, EM, Xu, G, Yin, J, Ceci, M, Cortez, P, Christen, P, Berka, P, Alves, R, Xu, S, Elomaa, T, Kosters, W, Graco, W, Wang, W, Balke, WT & Zhao, Y 2011, 'Preface', Proceedings - IEEE International Conference on Data Mining, ICDM.View/Download from: Publisher's site
Su, H, Chen, L, Ye, Y, Sun, Z & Wu, Q 2010, 'A refinement approach to handling model misfit in semi-supervised learning', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Springer, Chongqing, China, pp. 75-86.View/Download from: UTS OPUS or Publisher's site
Semi-supervised learning has been the focus of machine learning and data mining research in the past few years. Various algorithms and techniques have been proposed, from generative models to graph-based algorithms. In this work, we focus on the Cluster-
Xu, M, Chen, L, He, S, Xu, C & Jin, J 2010, 'Adaptive Local Hyperplanes for MTV affective analysis', Proceedings of the 2nd International Conference on Internet Multimedia Computing and Service, ICIMCS'10, International Conference on Internet Multimedia Computing and Service, ACM, Harbin, China, pp. 167-170.View/Download from: UTS OPUS or Publisher's site
Affective analysis attracts increasing attention in multimedia domain since affective factors directly reflect audiences' attention, evaluation and memory. Existing study focuses on mapping low-level affective features to high-level emotions by applying
Chen, L, Bhowmick, SS & Li, J 2006, 'COWES: Clustering web users based on historical web sessions', DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, PROCEEDINGS, 11th International Conference on Database Systems for Advanced Applications, SPRINGER-VERLAG BERLIN, Singapore, SINGAPORE, pp. 541-556.