# Dr Ying Zhang

### Biography

I received the Bachelor and Master degree in Computer Science from Peking University, P.R. China, in July 1998 and July 2001. I got the PhD in April 2008 from the School of Computer Science and Engineering, the University of New South Wales (UNSW), Australia. I had worked as a research fellow from 2008 to 2013 in database group of UNSW. Now I am a senior lecturer and ARC DECRA research fellow (2014-2016) at the University of Technology, Sydney (UTS).

### Professional

IEEE member

## Links

**Senior Lecturer,**School of Software

**Core Member,**QCIS - Quantum Computation and Intelligent Systems

Bch Computer Science, Mas Compter Science, PhD Computer Science

**Phone**

+61 2 9514 1098

**ORCID**

### Research Interests

My current research focuses on efficient query processing on massive spatial data, stream data, uncertain data and graphs.

**Can supervise:**Yes

Supervise areas in Database and Data mining.

## Conferences

Chang, L., Lin, X., Zhang, W., Yu, J.X., Zhang, Y. & Qin, L. 2015, 'Optimal Enumeration: Efficient Top-k Tree Matching',

View/Download from: UTS OPUS or Publisher's site

*Proceedings of the VLDB Endowment*, 41st International Conference on Very Large Data Bases,, VLDB Endowment, Kahola Coast, Hawaii, pp. 533-544.View/Download from: UTS OPUS or Publisher's site

Driven by many real applications, graph pattern matching has attracted
a great deal of attention recently. Consider that a twigpattern
matching may result in an extremely large number of matches
in a graph; this may not only confuse users by providing too many
results but also lead to high computational costs. In this paper,
we study the problem of top-k tree pattern matching; that is, given
a rooted tree T, compute its top-k matches in a directed graph G
based on the twig-pattern matching semantics. We firstly present
a novel and optimal enumeration paradigm based on the principle
of Lawler's procedure. We show that our enumeration algorithm
runs in O(nT + log k) time in each round where nT is the number
of nodes in T. Considering that the time complexity to output a
match of T is O(nT ) and nT log k in practice, our enumeration
technique is optimal. Moreover, the cost of generating top-1 match
of T in our algorithm is O(mR) where mR is the number of edges
in the transitive closure of a data graph G involving all relevant
nodes to T. O(mR) is also optimal in the worst case without preknowledge
of G. Consequently, our algorithm is optimal with the
running time O(mR +k(nT +log k)) in contrast to the time complexity
O(mR log k+knT (log k+dT )) of the existing technique where dT
is the maximal node degree in T. Secondly, a novel priority based
access technique is proposed, which greatly reduces the number of
edges accessed and results in a significant performance improvement.
Finally, we apply our techniques to the general form of top-k
graph pattern matching problem (i.e., query is a graph) to improve
the existing techniques. Comprehensive empirical studies demonstrate
that our techniques may improve the existing techniques by
orders of magnitude.

Wang, X., Zhang, Y., Zhang, W., Lin, X. & Wang, W. 2015, 'AP-Tree: Efficiently support continuous spatial-keyword queries over stream',

View/Download from: UTS OPUS or Publisher's site

*Proceedings - International Conference on Data Engineering*, 2015 IEEE 31st International Conference on Data Engineering (ICDE), IEEE, Seoul, Korea, pp. 1107-1118.View/Download from: UTS OPUS or Publisher's site

© 2015 IEEE. We investigate the problem of processing a large amount of continuous spatial-keyword queries over streaming data, which is essential in many applications such as location-based recommendation and advertising, thanks to the proliferation of geo-equipped devices and the ensuing location-based social media applications. For example, a location-based e-coupon system may allow potentially millions of users to register their continuous spatial-keyword queries (e.g., interests in nearby sales) by specifying a set of keywords and a spatial region; the system then delivers each incoming spatial-textual object (e.g., a geo-tagged e-coupon) to all the matched queries (i.e., users) whose spatial and textual requirements are satisfied. While there are several prior approaches aiming at providing efficient query processing techniques for the problem, their approaches belong to spatial-first indexing method which cannot well exploit the keyword distribution. In addition, their textual filtering techniques are built upon simple variants of traditional inverted indexes, which do not perform well for the textual constraint imposed by the problem. In this paper, we address the above limitations and provide a highly efficient solution based on a novel adaptive index, named AP-Tree. The AP-Tree adaptively groups registered queries using keyword and spatial partitions, guided by a cost model. The AP-Tree also naturally indexes ordered keyword combinations. We present index construction algorithm that seamlessly and effectively integrates keyword and spatial partitions. Consequently, our method adapts well to the underlying spatial and keyword distributions of the data. Our extensive experiments demonstrate that AP-Tree achieves up to an order of magnitude improvement on efficiency compared with prior state-of-the-art methods.

Wang, X., Zhang, Y., Zhang, W., Lin, X. & Cheema, M.A. 2015, 'Optimal spatial dominance: An effective search of nearest neighbor candidates',

View/Download from: UTS OPUS or Publisher's site

*Proceedings of the ACM SIGMOD International Conference on Management of Data*, pp. 923-938.View/Download from: UTS OPUS or Publisher's site

Copyright © 2015 ACM. In many domains such as computational geometry and database management, an object may be described by multiple instances (points). Then the distance (or similarity) between two objects is captured by the pair-wise distances among their instances. In the past, numerous nearest neighbor (NN) functions have been proposed to define the distance between objects with multiple instances and to identify the NN object. Nevertheless, considering that a user may not have a specific NN function in mind, it is desirable to provide her with a set of NN candidates. Ideally, the set of NN candidates must include every object that is NN for at least one of the NN functions and must exclude every nonpromising object. However, no one has studied the problem of NN candidates computation from this perspective. Although some of the existing works aim at returning a set of candidate objects, they do not focus on the NN functions while computing the candidate objects. As a result, they either fail to include an NN object w.r.t. some NN functions or include a large number of unnecessary objects that have no potential to be the NN regardless of the NN functions. Motivated by this, we classify the existing NN functions for objects with multiple instances into three families by characterizing their key features. Then, we advocate three spatial dominance operators to compute NN candidates where each operator is optimal w.r.t. different coverage of NN functions. Efficient algorithms are proposed for the dominance check and corresponding NN candidates computation. Extensive empirical study on real and synthetic datasets shows that our proposed operators can significantly reduce the number of NN candidates. The comprehensive performance evaluation demonstrates the efficiency of our computation techniques.

Han, Y., Wang, L., Zhang, Y., Zhang, W. & Lin, X. 2015, 'Spatial keyword range search on trajectories',

View/Download from: UTS OPUS or Publisher's site

*Database Systems for Advanced Applications (LNCS)*, International Conference on Database Systems for Advanced Applications, Springer, Hanoi, Vietnam, pp. 223-240.View/Download from: UTS OPUS or Publisher's site

© 2015, Springer International Publishing Switzerland, All rights Reserved. With advances in geo-positioning technologies and ensuing location based service, there are a rapid growing amount of trajectories associated with textual information collected in many emerging applications. For instance, nowadays many people are used to sharing interesting experience through Foursquare or Twitter along their travel routes. In this paper, we investigate the problem of spatial keyword range search on trajectories, which is essential to make sense of large amount of trajectory data. To the best of our knowledge, this is the first work to systematically investigate range search over trajectories where three important aspects, i.e., spatio, temporal and textual, are all taken into consideration. Given a query region, a timespan and a set of keywords, we aim to retrieve trajectories that go through this region during query timespan, and contain all the query keywords. To facilitate the range search, a novel index structure called IOC-Tree is proposed based on the inverted indexing and octree techniques to effectively explore the spatio, temporal and textual pruning techniques. Furthermore, this structure can also support the query with order-sensitive keywords. Comprehensive experiments on several real-life datasets are conducted to demonstrate the efficiency.

Zhan, L., Zhang, Y., Zhang, W., Wang, X. & Lin, X. 2015, 'Range search on uncertain trajectories',

View/Download from: UTS OPUS or Publisher's site

*Proceedings of the 24th ACM International on Conference on Information and Knowledge Management*, 24th ACM International on Conference on Information and Knowledge Management, ACM, USA, pp. 921-930.View/Download from: UTS OPUS or Publisher's site

The range search on trajectories is fundamental in a wide spectrum of applications such as environment monitoring and location based services. In practice, a large portion of spatio-temporal data in the above applications is generated with low sampling rate and the uncertainty arises between two subsequent observations of a moving object. To make sense of the uncertain trajectory data, it is critical to properly model the uncertainty of the trajectories and develop efficient range search algorithms on the new model. Assuming uncertain trajectories are modeled by the popular Markov Chains, in this paper we investigate the problem of range search on uncertain trajectories. In particular, we propose a general framework for range search on uncertain trajectories following the filtering-and-refinement paradigm where summaries of uncertain trajectories are constructed to facilitate the filtering process. Moreover, statistics based and partition based filtering techniques are developed to enhance the filtering capabilities. Comprehensive experiments demonstrate the effectiveness and efficiency of our new techniques.

Wang, S., Cheema, M.A., Zhang, Y. & Lin, X. 2015, 'Selecting representative objects considering coverage and diversity',

View/Download from: UTS OPUS or Publisher's site

*GeoRich 2015 - 2nd International ACM Workshop on Managing and Mining Enriched Geo-Spatial Data, in conjunction with SIGMOD 2015*, pp. 31-36.View/Download from: UTS OPUS or Publisher's site

© 2015 ACM. We say that an object o attracts a user u if o is one of the top-k objects according to the preference function defined by u. Given a set of objects (e.g., restaurants) and a set of users, in this paper, we study the problem of computing a set of representative objects considering two criteria: coverage and diversity. Coverage of a set S of objects is the distinct number of users that are attracted by the objects in S. Although a set of objects with high coverage attracts a large number of users, it is possible that all of these users have quite similar preferences. Consequently, the set of objects may be attractive only for a specific class of users with similar preference functions which may disappoint other users having widely different preferences. The diversity criterion addresses this issue by selecting a set S of objects such that the set of attracted users for each object in S is as different as possible from the sets of users attracted by the other objects in S. The existing work on representative objects considers only one of the coverage and diversity criteria. We are the first to consider both of the criteria where the importance of each criterion can be controlled using a parameter. Our algorithm has two phases. In the first phase, we prune the objects that cannot be among the representative objects and compute the set of attracted users (also called reverse top-k) for each of the remaining objects. In the second phase, the reverse top-k of these objects are used to compute the representative objects maximizing coverage and diversity. Since this problem is NP-hard, the second phase employs a greedy algorithm. For the sake of time and space efficiency, we adopt MinHash and KMV Synopses to assist the set operations. We prove that the proposed greedy algorithm is Q-approximate. Our extensive experimental study on real and synthetic data sets demonstrates the effectiveness of our proposed techniques.

Sun, Y., Wang, W., Qin, J., Zhang, Y. & Lin, X. 2014, 'SRS: Solving c-Approximate Nearest Neighbor Queries in High Dimensional Euclidean Space with a Tiny Index',

View/Download from: UTS OPUS

*Proceedings of the VLDB Endowment International Conference on Very Large Data Bases*, International Conference on Very Large Data Bases, Kohala Coast, Hawaii, pp. 1-12.View/Download from: UTS OPUS

Zhan, L., Zhang, Y., Zhang, W. & Lin, X. 2014, 'Identifying Top k Dominating Objects over Uncertain Data',

View/Download from: UTS OPUS or Publisher's site

*Database Systems for Advanced Applications - 19th International Conference, DASFAA 2014, Bali, Indonesia, April 21-24, 2014. Proceedings, Part I*, pp. 388-405.View/Download from: UTS OPUS or Publisher's site

Zhang, C., Zhang, Y., Zhang, W., Lin, X., Cheema, M.A. & Wang, X. 2014, 'Diversified Spatial Keyword Search On Road Networks',

View/Download from: UTS OPUS or Publisher's site

*Proc. 17th International Conference on Extending Database Technology (EDBT), Athens, Greece, March 24-28, 2014.*, pp. 367-378.View/Download from: UTS OPUS or Publisher's site

Yang, S., Cheema, M.A., Lin, X. & Zhang, Y. 2014, 'SLICE: Reviving regions-based pruning for reverse k nearest neighbors queries',

View/Download from: UTS OPUS or Publisher's site

*IEEE 30th International Conference on Data Engineering, Chicago, ICDE 2014, IL, USA, March 31 - April 4, 2014*, pp. 760-771.View/Download from: UTS OPUS or Publisher's site

Wang, X., Zhang, Y., Zhang, W. & Lin, X. 2014, 'Efficiently identify local frequent keyword co-occurrence patterns in geo-tagged Twitter stream',

View/Download from: UTS OPUS or Publisher's site

*The 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '14, Gold Coast, QLD, Australia - July 06 - 11, 2014*, pp. 1215-1218.View/Download from: UTS OPUS or Publisher's site

Zhang, Y., Zhang, W., Lin, X., Cheema, M.A. & Zhang, C. 2014, 'Matching dominance: capture the semantics of dominance for multi-dimensional uncertain objects',

View/Download from: UTS OPUS or Publisher's site

*Conference on Scientific and Statistical Database Management, SSDBM '14, Aalborg, Denmark, June 30 - July 02, 2014*, pp. 18-18.View/Download from: UTS OPUS or Publisher's site

Zhang, C., Zhang, Y., Zhang, W. & Lin, X. 2013, 'Inverted linear quadtree: Efficient top k spatial keyword search',

View/Download from: UTS OPUS or Publisher's site

*IEEE 29th International Conference on Data Engineering*, International Conference on Data Engineering, IEEE, Brisbane, Australia, pp. 901-912.View/Download from: UTS OPUS or Publisher's site

With advances in geo-positioning technologies and geo-location services, there are a rapidly growing amount of spatio-textual objects collected in many applications such as location based services and social networks, in which an object is described by its spatial location and a set of keywords (terms). Consequently, the study of spatial keyword search which explores both location and textual description of the objects has attracted great attention from the commercial organizations and research communities. In the paper, we study the problem of top k spatial keyword search (TOPK-SK), which is fundamental in the spatial keyword queries. Given a set of spatio-textual objects, a query location and a set of query keywords, the top k spatial keyword search retrieves the closest k objects each of which contains all keywords in the query. Based on the inverted index and the linear quadtree, we propose a novel index structure, called inverted linear quadtree (IL-Quadtree), which is carefully designed to exploit both spatial and keyword based pruning techniques to effectively reduce the search space. An efficient algorithm is then developed to tackle top k spatial keyword search. In addition, we show that the IL-Quadtree technique can also be applied to improve the performance of other spatial keyword queries such as the direction-aware top k spatial keyword search and the spatio-textual ranking query. Comprehensive experiments on real and synthetic data clearly demonstrate the efficiency of our methods.

Lin, Q., Zhang, Y., Zhang, W. & Lin, X. 2013, 'AVR-Tree: Speeding Up the NN and ANN Queries on Location Data',

View/Download from: UTS OPUS or Publisher's site

*LNCS - Database Systems for Advanced Applications - Proceedings, Part I of the 18th International Conference on Database Systems for Advanced Applications*, Database Systems for Advanced Applications, Springer, China, pp. 116-130.View/Download from: UTS OPUS or Publisher's site

In the paper, we study the problems of nearest neighbor queries (NN) and all nearest neighbor queries (ANN) on location data, which have a wide range of applications such as Geographic Information System (GIS) and Location based Service (LBS). We propose a new structure, termed AVR-Tree, based on the R-tree and Voronoi diagram techniques. Compared with the existing indexing techniques used for NN and ANN queries on location data, AVR-Tree can achieve a better trade-off between the pruning effectiveness and the index size for NN and ANN queries. We also conduct a comprehensive performance evaluation for the proposed techniques based on both real and synthetic data, which shows that AVR-Tree based NN and ANN algorithms achieve better performance compared with their best competitors in terms of both CPU and I/O costs.

Zhang, Q., Ye, P., Lin, X. & Zhang, Y. 2013, 'Skyline probability over uncertain preferences',

View/Download from: UTS OPUS or Publisher's site

*Proceedings of the 16th International Conference on Extending Database Technology*, International Conference on Extending Database Technology, ACM, Genoa, Italy, pp. 395-405.View/Download from: UTS OPUS or Publisher's site

Skyline analysis is a key in a wide spectrum of real applications involving multi-criteria optimal decision making. In recent years, a considerable amount of research has been contributed on efficient computation of skyline probabilities over uncertain environment. Most studies if not all, assume uncertainty lies only in attribute values. To the extent of our knowledge, only one study addresses the skyline probability computation problem in scenarios where uncertainty resides in attribute preferences, instead of values. However this study takes a problematic approach by assuming independent object dominance, which we find is not always true in uncertain preference scenarios. In fact this assumption has already been shown to be not necessarily true in uncertain value scenarios. Motivated by this, we revisit the skyline probability computation over uncertain preferences in this paper. We first show that the problem of skyline probability computation over uncertain preferences is #P-complete. Then we propose efficient exact and approximate algorithms to tackle this problem. While the exact algorithm remains exponential in the worst case, our experiments demonstrate its efficiency in practice. The approximate algorithm achieves e-approximation by the confidence (1 - d) with time complexity O(dn1/e2 ln 1/d), where n is the number of objects and d is the dimensionality. The efficiency and effectiveness of our methods are verified by extensive experimental results on real and synthetic data sets.

Cheema, M.A., Lin, X., Zhang, W. & Zhang, Y. 2013, 'A safe zone based approach for monitoring moving skyline queries',

View/Download from: UTS OPUS or Publisher's site

*Proceedings of the 16th International Conference on Extending Database Technology*, International Conference on Extending Database Technology, ACM, Genoa, Italy, pp. 275-286.View/Download from: UTS OPUS or Publisher's site

Given a set of criterions, an object o dominates another object ó if o is more preferable than ó according to every criterion. A skyline query returns every object that is not dominated by any other object. In this paper, we study the problem of continuously monitoring a moving skyline query where one of the criterions is the distance between the objects and the moving query. We propose a safe zone based approach to address the challenge of efficiently updating the results as the query moves. A safe zone is the area such that the results of a query remain unchanged as long as the query lies inside this area. Hence, the results are required to be updated only when the query leaves its safe zone. Although the main focus of this paper is to present the techniques for Euclidean distance metric, the proposed techniques are applicable to any metric distance (e.g., Manhattan distance, road network distance). We present several non-trivial optimizations and propose an efficient algorithm for safe zone construction. Our experiments demonstrate that the cost of our safe zone based approach is reasonably close to a lower bound cost and is three orders of magnitude lower than the cost of a nave algorithm.

Zhang, W., Li, A., Cheema, M.A., Zhang, Y. & Chang, L. 2013, 'Probabilistic n-of-N skyline computation over uncertain data streams',

View/Download from: Publisher's site

*Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)*, pp. 439-457.View/Download from: Publisher's site

Skyline operator is a useful tool in multi-criteria decision making in various applications. Uncertainty is inherent in real applications due to various reasons. In this paper, we consider the problem of efficiently computing probabilistic skylines against the most recent N uncertain elements in a data stream seen so far. Specifically, we study the problem in the n-of-N model; that is, computing the probabilistic skyline for the most recent n (n N) elements, where an element is a probabilistic skyline element if its skyline probability is not below a given probability threshold q. Firstly, an effective pruning technique to minimize the number of uncertain elements to be kept is developed. It can be shown that on average storing only O(logd N) uncertain elements from the most recent N elements is sufficient to support the precise computation of all probabilistic n-of-N skyline queries in a d-dimension space if the data distribution on each dimension is independent. A novel encoding scheme is then proposed together with efficient update techniques so that computing a probabilistic n-of-N skyline query in a d-dimension space is reduced to O (d loglogN + s) if the data distribution is independent, where s is the number of skyline points. Extensive experiments demonstrate that the new techniques on uncertain data streams can support on-line probabilistic skyline query computation over rapid data streams. © 2013 Springer-Verlag.

Feng, X., Zhang, W., Zhao, X., Zhang, Y. & Gao, Y. 2013, 'Probabilistic k-skyband operator over sliding windows',

View/Download from: Publisher's site

*Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)*, pp. 190-202.View/Download from: Publisher's site

Given a set of data elements in a d-dimensional space, a k-skyband query reports the set of elements which are dominated by at most k - 1 other elements in. k-skyband query is a fundamental query type in data analyzing as it keeps a minimum candidate set for all top-k ranking queries where the ranking functions are monotonic. In this paper, we study the problem of k-skyband over uncertain data streams following the possible world semantics where each data element is associated with an occurrence probability. Firstly, a dynamic programming based algorithm is proposed to identify k-skyband results for a given set of uncertain elements regarding a pre-specified probability threshold. Secondly, we characterize the minimum set of elements to be kept in the sliding window to guarantee correct computing of k-skyband. Thirdly, efficient update techniques based on R-tree structures are developed to handle frequent updates of the elements over the sliding window. Extensive empirical studies demonstrate the efficiency and effectiveness of our techniques. © 2013 Springer-Verlag Berlin Heidelberg.

Feng, X., Zhao, X., Gao, Y. & Zhang, Y. 2013, 'Probabilistic top-k dominating query over sliding windows',

View/Download from: Publisher's site

*Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)*, pp. 782-793.View/Download from: Publisher's site

Probabilistic queries on uncertain data have been intensively investigated lately. Top-k dominating query is important in many applications, e.g., decision making, as it offers choices which are better than the most of others. In this paper, we study the problem of probabilistic top-k dominating query over sliding windows. An efficient algorithm is developed to compute the exact solution. Extensive experiments are conducted to demonstrate the efficiency and effectiveness of the algorithm. © 2013 Springer-Verlag.

Zhang, Y., Zhang, W., Lin, Q. & Lin, X. 2012, 'Effectively Indexing the Multi-Dimensional Uncertain Objects for Range Searching',

View/Download from: UTS OPUS or Publisher's site

*Proceedings of the 15th International Conference on Extending Database Technology*, International Conference on Extending Database Technology, ACM, Berlin, Germany, pp. 504-515.View/Download from: UTS OPUS or Publisher's site

The range searching problem is fundamental in a wide spectrum of applications such as radio frequency identification (RFID) networks, location based services (LBS), and global position system (GPS). As the uncertainty is inherent in those applications, it is highly demanded to address the uncertainty in the range search since the traditional techniques cannot be applied due to the inherence difference between the uncertain data and traditional data. In the paper, we propose a novel indexing structure, named U-Quadtree, to organize the uncertain objects in a multi-dimensional space such that the range searching can be answered efficiently by applying filtering techniques. Particularly, based on some insights of the range search on uncertain data, we propose a cost model which carefully considers various factors that may impact the performance of the range searching. Then an effective and efficient index construction algorithm is proposed to build the optimal U-Quadtree regarding the cost model. Comprehensive experiments demonstrate that our technique outperforms the existing works for range searching on multi-dimensional uncertain objects.

Zhan, L., Zhang, Y., Zhang, W. & Lin, X. 2012, 'Finding Top k Most Influential Spatial Facilities Over Uncertain Objects',

View/Download from: UTS OPUS or Publisher's site

*Proceedings of the 21st ACM international conference on Information and knowledge management*, ACM international conference on Information and knowledge managemen, ACM, Maui, USA, pp. 922-931.View/Download from: UTS OPUS or Publisher's site

Uncertainty is inherent in many important applications, such as location-based services (LBS), sensor monitoring and radio-frequency identification (RFID). Recently, considerable research efforts have been put into the field of uncertainty-aware spatial query processing. In this paper, we study the problem of finding top k most influential facilities over a set of uncertain objects, which is an important spatial query in the above applications. Based on the maximal utility principle, we propose a new ranking model to identify the top k most influential facilities, which carefully captures influence of facilities on the uncertain objects. By utilizing two uncertain object indexing techniques, R-tree and U-Quadtree, effective and efficient algorithms are proposed following the filtering and verification paradigm, which significantly improves the performance of the algorithms in terms of CPU and I/O costs. Comprehensive experiments on real datasets demonstrate the effectiveness and efficiency of our techniques.

Yu, W., Lin, X., Zhang, W., Zhang, Y. & Le, J. 2012, 'SimFusion+: extending simfusion towards efficient estimation on large and dynamic networks',

View/Download from: UTS OPUS or Publisher's site

*Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieva*, International ACM SIGIR conference on Research and development in information retrieva, ACM, Portland, USA, pp. 365-374.View/Download from: UTS OPUS or Publisher's site

SimFusion has become a captivating measure of similarity between objects in a web graph. It is iteratively distilled from the notion that "the similarity between two objects is reinforced by the similarity of their related objects". The existing SimFusion model usually exploits the Unified Relationship Matrix (URM) to represent latent relationships among heterogeneous data, and adopts an iterative paradigm for SimFusion computation. However, due to the row normalization of URM, the traditional SimFusion model may produce the trivial solution; worse still, the iterative computation of SimFusion may not ensure the global convergence of the solution. This paper studies the revision of this model, providing a full treatment from complexity to algorithms. (1) We propose SimFusion+ based on a notion of the Unified Adjacency Matrix (UAM), a modification of the URM, to prevent the trivial solution and the divergence issue of SimFusion. (2) We show that for any vertex-pair, SimFusion+ can be performed in O(1) time and O(n) space with an O(km)-time precomputation done only once, as opposed to the O(kn3) time and O(n2) space of its traditional counterpart, where n, m, and k denote the number of vertices, edges, and iterations respectively. (3) We also devise an incremental algorithm for further improving the computation of SimFusion+ when networks are dynamically updated, with performance guarantees for similarity estimation. We experimentally verify that these algorithms scale well, and the revised notion of SimFusion is able to converge to a non-trivial solution, and allows us to identify more sensible structure information in large real-world networks.

Lin, Q., Zhang, Y., Zhang, W. & Li, A. 2012, 'General Spatial Skyline Operator',

View/Download from: UTS OPUS or Publisher's site

*LNCS - Database Systems for Advanced Applications - Proceedings of the 17th Conference on Database Systems for Advanced Applications*, Springer, Busan, South Korea, pp. 494-508.View/Download from: UTS OPUS or Publisher's site

With the emergence of location-aware mobile device technologies, communication technologies and GPS systems, various location-aware queries have attracted great attentions in the database literature. In many user recommendation systems, the spatial preference query is used to suggest the objects based on their spatial proximity to the facilities. In this paper, we study the problem of general spatial skyline which can provide a minimal set of candidates that contain optimal solutions for any monotonic distance based spatial preference query. An efficient algorithm is proposed to significantly reduce the number of non-promising objects in the computation. The paper also covers a comprehensive performance study of the proposed techniques based on both real and synthetic dat

Zhang, W., Xu, J., Liang, X., Zhang, Y. & Lin, X. 2012, 'Top-k Similarity Join over Multi-valued Objects',

View/Download from: UTS OPUS or Publisher's site

*LNCS - Database Systems for Advanced Applications - Proceedings of the 17th International Conference on Database Systems for Advanced Applications*, International Conference on Database Systems for Advanced Applications, Springer, Busan, Korea, pp. 509-525.View/Download from: UTS OPUS or Publisher's site

The top-k similarity joins have been extensively studied and used in a wide spectrum of applications such as information retrieval, decision making, spatial data analysis and data mining. Given two sets of objects U and V, a top-k similarity join returns k pairs of most similar objects from UV. In the conventional model of top-k similarity join processing, an object is usually regarded as a point in a multi-dimensional space and the similarity between two objects is usually measured by distance metrics such as Euclidean distance. However, in many applications an object may be described by multiple values (instances) and the conventional model is not applicable since it does not address the distributions of object instances. In this paper, we study top-k similarity join queries over multi-valued objects. We apply quantile based distance to explore the relative instance distribution among the multiple instances of objects. Efficient and effective techniques to process top-k similarity joins over multi-valued objects are developed following a filtering-refinement framework. Novel distance, statistic and weight based pruning techniques are proposed. Comprehensive experiments on both real and synthetic datasets demonstrate the efficiency and effectiveness of our techniques.

Lin, X., Zhang, Y., Zhang, W. & Cheema, M.A. 2011, 'Stochastic Skyline Operator',

View/Download from: UTS OPUS or Publisher's site

*2011 IEEE 27th International Conference on Data Engineering*, IEEE International Conference on Data Engineering, IEEE, Hannover, Germany, pp. 721-732.View/Download from: UTS OPUS or Publisher's site

In many applications involving the multiple criteria optimal decision making, users may often want to make a personal trade-off among all optimal solutions. As a key feature, the skyline in a multi-dimensional space provides the minimum set of candidates for such purposes by removing all points not preferred by any (monotonic) utility/scoring functions; that is, the skyline removes all objects not preferred by any user no mater how their preferences vary. Driven by many applications with uncertain data, the probabilistic skyline model is proposed to retrieve uncertain objects based on skyline probabilities. Nevertheless, skyline probabilities cannot capture the preferences of monotonic utility functions. Motivated by this, in this paper we propose a novel skyline operator, namely stochastic skyline. In the light of the expected utility principle, stochastic skyline guarantees to provide the minimum set of candidates for the optimal solutions over all possible monotonic multiplicative utility functions. In contrast to the conventional skyline or the probabilistic skyline computation, we show that the problem of stochastic skyline is NP-complete with respect to the dimensionality.

Cheema, M.A., Lin, X., Zhang, W. & Zhang, Y. 2011, 'Influence zone: Efficiently processing reverse k nearest neighbors queries',

View/Download from: UTS OPUS or Publisher's site

*2011 IEEE 27th International Conference on Data Engineering*, International Conference on Data Engineering, IEEE, Hannover, Germany, pp. 577-588.View/Download from: UTS OPUS or Publisher's site

Given a set of objects and a query q, a point p is called the reverse k nearest neighbor (RkNN) of q if q is one of the k closest objects of p. In this paper, we introduce the concept of influence zone which is the area such that every point inside this area is the RkNN of q and every point outside this area is not the RkNN. The influence zone has several applications in location based services, marketing and decision support systems. It can also be used to efficiently process RkNN queries. First, we present efficient algorithm to compute the influence zone. Then, based on the influence zone, we present efficient algorithms to process RkNN queries that significantly outperform existing best known techniques for both the snapshot and continuous RkNN queries. We also present a detailed theoretical analysis to analyse the area of the influence zone and IO costs of our RkNN processing algorithms. Our experiments demonstrate the accuracy of our theoretical analysis.

Zhu, K., Zhang, W., Zhu, G., Zhang, Y. & Lin, X. 2011, 'BMC: An Efficient Method to Evaluate Probabilistic Reachability Queries',

View/Download from: UTS OPUS or Publisher's site

*LNCS - Database Systems for Advanced Applications - Part 1, Proceedings of the 16th International Conference on Database Systems for Advanced Applications*, International Conference on Database Systems for Advanced Applications, Springer, Hong Kong, pp. 434-449.View/Download from: UTS OPUS or Publisher's site

Reachability query is a fundamental problem in graph databases. It answers whether or not there exists a path between a source vertex and a destination vertex and is widely used in various applications including road networks, social networks, world wide web and bioinformatics. In some emerging important applications, uncertainties may be inherent in the graphs. For instance, each edge in a graph could be associated with a probability to appear. In this paper, we study the reachability problem over such uncertain graphs in a threshold fashion, namely, to determine if a source vertex could reach a destination vertex with probabilty larger than a user specified probability value t. Finding reachability on uncertain graphs has been proved to be NP-Hard. We first propose novel and effective bounding techniques to obtain the upper bound of reachability probability between the source and destination. If the upper bound fails to prune the query, efficient dynamic Monte Carlo simulation technqiues will be applied to answer the probabilitistic reachability query with an accuracy guarantee. Extensive experiments over real and synthetic datasets are conducted to demonstrate the efficiency and effectiveness of our techniques.

Shang, H., Lin, X., Zhang, Y., Yu, J.X. & Wang, W. 2010, 'Connected substructure similarity search',

View/Download from: UTS OPUS or Publisher's site

*Proceedings of the 2010 ACM SIGMOD International Conference on Management of data*, ACM SIGMOD International Conference on Management of data, ACM, Indianapolis, USA, pp. 903-914.View/Download from: UTS OPUS or Publisher's site

Substructure similarity search is to retrieve graphs that approximately contain a given query graph. It has many applications, e.g., detecting similar functions among chemical compounds. The problem is challenging as even testing subgraph containment between two graphs is NP-complete. Hence, existing techniques adopt the filtering-and-verification framework with the focus on developing effective and efficient techniques to remove non-promising graphs. Nevertheless, existing filtering techniques may be still unable to effectively remove many "low" quality candidates. To resolve this, in this paper we propose a novel indexing technique, GrafD-Index, to index graphs according to their "distances" to features. We characterize a tight condition under which the distance-based triangular inequality holds. We then develop lower and upper bounding techniques that exploit the GrafD-Index to (1) prune non-promising graphs and (2) include graphs whose similarities are guaranteed to exceed the given similarity threshold. Considering that the verification phase is not well studied and plays the dominant role in the whole process, we devise efficient algorithms to verify candidates. A comprehensive experiment using real datasets demonstrates that our proposed methods significantly outperform existing methods.

Zhu, K., Zhang, Y., Lin, X., Zhu, G. & Wang, W. 2010, 'NOVA: A Novel and Efficient Framework for Finding Subgraph Isomorphism Mappings in Large Graphs',

View/Download from: UTS OPUS or Publisher's site

*LNCS - Database Systems for Advanced Applications - Part 1 Proceedings of the International Conference on Database Systems for Advanced Applications*, International Conference on Database Systems for Advanced Applications, Springer, Tsukuba, Japan, pp. 140-154.View/Download from: UTS OPUS or Publisher's site

Considerable efforts have been spent in studying subgraph problem. Traditional subgraph containment query is to retrieve all database graphs which contain the query graph g. A variation to that is to find all occurrences of a particular pattern(the query) in a large database graph. We call it subgraph matching problem. The state of art solution to this problem is GADDI. In this paper, we will propose a more efficient index and algorithm to answer subgraph matching problem. The index is based on the label distribution of neighbourhood vertices and it is structured as a multi-dimensional vector signature. A novel algorithm is also proposed to further speed up the isomorphic enumeration process. This algorithm attempts to maximize the computational sharing. It also attempts to predict some enumeration state is impossible to lead to a final answer by eagerly pruning strategy. We have performed extensive experiments to demonstrate the efficiency and the effectiveness of our technique.

Zhang, W., Lin, X., Cheema, M.A., Zhang, Y. & Wang, W. 2010, 'Quantile-Based KNN Over Multi-Valued Objects',

View/Download from: UTS OPUS or Publisher's site

*2010 IEEE 26th International Conference on Data Engineering*, International Conference on Data Engineering, IEEE, Long Beach, USA, pp. 16-27.View/Download from: UTS OPUS or Publisher's site

K Nearest Neighbor search has many applications including data mining, multi-media, image processing, and monitoring moving objects. In this paper, we study the problem of KNN over multi-valued objects. We aim to provide effective and efficient techniques to identify KNN sensitive to relative distributions of objects.We propose to use quantiles to summarize relative-distribution-sensitive K nearest neighbors. Given a query Q and a quantile Â Â (0, 1), we firstly study the problem of efficiently computing K nearest objects based on a Â-quantile distance e.g. median distance from each object to Q. The second problem is to retrieve the K nearest objects to Q based on overall distances in the Âbest populationÂ with a given size specified by Â-quantile for each object. While the first problem can be solved in polynomial time, we show that the 2nd problem is NP-hard. A set of efficient, novel algorithms have been proposed to give an exact solution for the first problem and an approximate solution for the second problem with the approximation ratio. Extensive experiment demonstrates that our techniques are very efficient and effective.

Zhang, Y., Lin, X., Zhu, G., Zhang, W. & Lin, Q. 2010, 'Efficient Rank Based KNN Processing over Uncertain Data',

View/Download from: UTS OPUS or Publisher's site

*2010 IEEE 26th International Conference on Engineering*, International Conference on Engineering, IEEE, Long Beach, USA, pp. 28-39.View/Download from: UTS OPUS or Publisher's site

Uncertain data are inherent in many applications such as environmental surveillance and quantitative economics research. As an important problem in many applications, KNN query has been extensively investigated in the literature. In this paper, we study the problem of processing rank based KNN query against uncertain data. Besides applying the expected rank semantic to compute KNN, we also introduce the median rank which is less sensitive to the outliers. We show both ranking methods satisfy nice top-k properties such as exact-k, containment, unique ranking, value invariance, stability and fairfulness. For given query q, IO and CPU efficient algorithms are proposed in the paper to compute KNN based on expected (median) ranks of the uncertain objects. To tackle the correlations of the uncertain objects and high IO cost caused by large number of instances of the uncertain objects, randomized algorithms are proposed to approximately compute KNN with theoretical guarantees. Comprehensive experiments are conducted on both real and synthetic data to demonstrate the efficiency of our techniques.

Shang, H., Zhu, K., Lin, X., Zhang, Y. & Ichise, R. 2010, 'Similarity Search on Supergraph Containment',

View/Download from: UTS OPUS or Publisher's site

*2010 IEEE 26th International Conference on Data Engineering*, International Conference on Data Engineering, IEEE, Long Beach, USA, pp. 637-648.View/Download from: UTS OPUS or Publisher's site

A supergraph containment search is to retrieve the data graphs contained by a query graph. In this paper, we study the problem of efficiently retrieving all data graphs approximately contained by a query graph, namely similarity search on supergraph containment. We propose a novel and efficient index to boost the efficiency of query processing. We have studied the query processing cost and propose two index construction strategies aimed at optimizing the performance of different types of data graphs: top-down strategy and bottom-up strategy. Moreover, a novel indexing technique is proposed by effectively merging the indexes of individual data graphs; this not only reduces the index size but also further reduces the query processing time. We conduct extensive experiments on real data sets to demonstrate the efficiency and the effectiveness of our techniques.

Zhang, W., Zhang, Y., Cheema, M.A. & Lin, X. 2010, 'Counting distinct objects over sliding windows',

View/Download from: UTS OPUS

*Proceedings of the Twenty-First Australasian Conference on Database Technologies*, Australasian Conference on Database Technologies, ACM, Brisbane, Australia, pp. 75-84.View/Download from: UTS OPUS

Aggregation against distinct objects has been involved in many real applications with the presence of duplicates, including real-time monitoring moving objects. In this paper, we investigate the problem of counting distinct objects over sliding windows with arbitrary lengths. We present novel, time and space efficient, one scan algorithms to continuously maintain a sketch so that the counting can be approximately conducted with a relative error guarantee e in the presence of object duplicates. Efficient query algorithms have also been developed by effectively utilizing the skyband property. Moreover, the proposed techniques may be immediately applied to the range counting aggregation and heavy hitter problem against distinct elements. A comprehensive performance study demonstrates that our algorithms can support real-time computation against high speed data streams.

Cheema, M.A., Lin, X., Zhang, Y., Wang, W. & Zhang, W. 2009, 'Lazy Updates: An Efficient Technique to Continuously Monitoring Reverse kNN',

View/Download from: UTS OPUS or Publisher's site

*Proceedings of the VLDB Endowment*, VLDB Endowment, ACM, Lyon, France, pp. 1138-1149.View/Download from: UTS OPUS or Publisher's site

In this paper, we study the problem of continuous monitoring of reverse k nearest neighbor queries. Existing continuous reverse nearest neighbor monitoring techniques are sensitive towards objects and queries movement. For example, the results of a query are to be recomputed whenever the query changes its location. We present a framework for continuous reverse k nearest neighbor queries by assigning each object and query with a rectangular safe region such that the expensive recomputation is not required as long as the query and objects remain in their respective safe regions. This significantly improves the computation cost. As a by-product, our framework also reduces the communication cost in client-server architectures because an object does not report its location to the server unless it leaves its safe region or the server sends a location update request. We also conduct a rigid cost analysis to guide an effective selection of such rectangular safe regions. The extensive experiments demonstrate that our techniques outperform the existing techniques by an order of magnitude in terms of computation cost and communication cost.

Hasan, M., Cheema, M.A., Lin, X. & Zhang, Y. 2009, 'Efficient Construction of Safe Regions for Moving kNN Queries over Dynamic Datasets',

View/Download from: UTS OPUS or Publisher's site

*LNCS - Advances in Spatial and Temporal Databases - Proceedings of SSTD*, SSTD, Springer, Aalborg, Denmark, pp. 373-379.View/Download from: UTS OPUS or Publisher's site

The concept of safe region has been used to reduce the computation and communication cost for the continuous monitoring of k nearest neighbor (kNN) queries. A safe region is an area such that as long as a query remains in it, the set of its kNNs does not change. In this paper, we present an efficient technique to construct the safe region by using cheap RangeNN queries. We also extend our approach for dynamic datasets (the objects may appear or disappear from the dataset). Our proposed algorithm outperforms existing algorithms and scales better with the increase in k

Zhang, W., Lin, X., Zhang, Y., Wang, W. & Yu, J.X. 2009, 'Probabilistic Skyline Operator over Sliding Windows',

View/Download from: UTS OPUS or Publisher's site

*IEEE 25th International Conference on Data Engineering*, International Conference on Data Engineering, IEEE, Shanghai, China, pp. 1060-1071.View/Download from: UTS OPUS or Publisher's site

Skyline computation has many applications including multi-criteria decision making. In this paper, we study the problem of efficient processing of continuous skyline queries over sliding windows on uncertain data elements regarding given probability thresholds. We first characterize what kind of elements we need to keep in our query computation. Then we show the size of dynamically maintained candidate set and the size of skyline. We develop novel, efficient techniques to process a continuous, probabilistic skyline query. Finally, we extend our techniques to the applications where multiple probability thresholds are given or we want to retrieve "top-k" skyline data objects. Our extensive experiments demonstrate that the proposed techniques are very efficient and handle a high-speed data stream in real time.

Zhang, Y., Lin, X., Tao, Y. & Zhang, W. 2009, 'Uncertain Location based Range Aggregates in a multi-dimensional space',

View/Download from: UTS OPUS or Publisher's site

*Proceedings of the 2009 IEEE International Conference on Data Engineering*, IEEE International Conference on Data Engineering, IEEE, Shanghai, China, pp. 1247-1250.View/Download from: UTS OPUS or Publisher's site

Uncertain data are inherent in many applications such as environmental surveillance and quantitative economics research. Recently, considerable research efforts have been put into the field of analysing uncertain data. In this paper, we study the problem of processing the uncertain location based range aggregate in a multi-dimensional space. We first formally introduce the problem, then propose a general filtering-and-verification framework to solve the problem. Two filtering techniques, named STF and PCR respectively, are proposed to significantly reduce the verification cost

Yang, S., Zhang, W., Zhang, Y. & Lin, X. 2009, 'Probabilistic threshold range aggregate query processing over uncertain data', *Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)*, pp. 51-62.

View/Download from: Publisher's site

View/Download from: Publisher's site

Large amount of uncertain data is inherent in many novel and important applications such as sensor data analysis and mobile data management. A probabilistic threshold range aggregate (PTRA) query retrieves summarized information about the uncertain objects satisfying a range query, with respect to a given probability threshold. This paper is the first one to address this important type of query.We develop a new index structure aU-tree and propose an exact querying algorithm based on aU-tree. For the pursue of efficiency, two techniques SingleSample and DoubleSample are developed. Both techniques provide approximate answers to a PTRA query with accuracy guarantee. Experimental study demonstrates the efficiency and effectiveness of our proposed methods. © Springer-Verlag Berlin Heidelberg 2009.

Shang, H., Zhang, Y., Lin, X. & Yu, J.X. 2008, 'Taming Verification Hardness: an efficient algorithm for testing sub graph isomorphism',

View/Download from: UTS OPUS or Publisher's site

*Proceedings of the VLDB Endowment*, VLDB Endowment, ACM, Auckland, New Zealand, pp. 364-375.View/Download from: UTS OPUS or Publisher's site

Graphs are widely used to model complicated data semantics in many applications. In this paper, we aim to develop efficient techniques to retrieve graphs, containing a given query graph, from a large set of graphs. Considering the problem of testing subgraph isomorphism is generally NP-hard, most of the existing techniques are based on the framework of filtering-and-verification to reduce the precise computation costs; consequently various novel feature-based indexes have been developed. While the existing techniques work well for small query graphs, the verification phase becomes a bottleneck when the query graph size increases. Motivated by this, in the paper we firstly propose a novel and efficient algorithm for testing subgraph isomorphism, QuickSI. Secondly, we develop a new feature-based index technique to accommodate QuickSI in the filtering phase. Our extensive experiments on real and synthetic data demonstrate the efficiency and scalability of the proposed techniques, which significantly improve the existing techniques

Lin, X. & Zhang, Y. 2008, 'Aggregate computation over data streams', *Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)*, pp. 10-25.

View/Download from: Publisher's site

View/Download from: Publisher's site

Nowadays, we have witnessed the widely recognized phenomenon of high speed data streams. Various statistics computation over data streams is often required by many applications, including processing of relational type queries, data mining and high speed network management. In this paper, we provide survey for three important kinds of aggregate computations over data streams: frequency moment, frequency count and order statistic. © 2008 Springer-Verlag Berlin Heidelberg.

Zhang, W., Lin, X., Pei, J. & Zhang, Y. 2008, 'Managing uncertain data: Probabilistic approaches',

View/Download from: Publisher's site

*Proceedings - The 9th International Conference on Web-Age Information Management, WAIM 2008*, pp. 405-412.View/Download from: Publisher's site

Uncertain data are inherent in many important applications. Recently, considerable research efforts have been put into the field of managing uncertain data. In this paper, we summarize existing techniques to query and model uncertain data and systems that effectively manage uncertain data, mainly from a probabilistic point of view. © 2008 IEEE.

Lin, X., Yuan, Y., Zhang, Q. & Zhang, Y. 2007, 'Selecting stars: The k most representative skyline operator',

View/Download from: Publisher's site

*Proceedings - International Conference on Data Engineering*, pp. 86-95.View/Download from: Publisher's site

Skyline computation has many applications including multi-criteria decision making. In this paper, we study the problem of selecting k skyline points so that the number of points, which are dominated by at least one of these k skyline points, is maximized. We first present an efficient dynamic programming based exact algorithm in a 2d-space. Then, we show that the problem is NP-hard when the dimensionality is 3 or more and it can be approximately solved by a polynomial time algorithm with the guaranteed approximation ratio 1 - 1/e. To speed-up the computation, an efficient, scalable, index-based randomized algorithm is developed by applying the FM probabilistic counting technique. A comprehensive performance evaluation demonstrates that our randomized technique is very efficient, highly accurate, and scalable. © 2007 IEEE.

Zhang, Y., Lin, X., Yuan, Y., Kitsuregawa, M., Zhou, X. & Yu, J.X. 2007, 'Summarizing order statistics over data streams with duplicates',

View/Download from: Publisher's site

*Proceedings - International Conference on Data Engineering*, pp. 1329-1333.View/Download from: Publisher's site

Zhang, Y., Lin, X., Xu, J., Korn, F. & Wang, W. 2006, 'Space-efficient relative error order sketch over data streams',

View/Download from: Publisher's site

*Proceedings - International Conference on Data Engineering*, p. 51.View/Download from: Publisher's site

We consider the problem of continuously maintaining order sketches over data streams with a relative rank error guarantee . Novel space-efficient and one-scan randomised techniques are developed. Our first randomised algorithm can guarantee such a relative error precision with confidence 1 - using O(1/ 2 log 1/ loge 2 N) space, where N is the number of data elements seen so far in a data stream. Then, a new one-scan space compression technique is developed. Combined with the first randomised algorithm, the one-scan space compression technique yields another one-scan randomised algorithm that guarantees the space requirement is O(1/ log(1/ log 1/ (1/ log 1/) log 2+ N/1-1/2 ) (for > 0) on average while the worst case space remains O(1/ 2 log 1/ log 2 N). These results are immediately applicable to approximately computing quantiles over data streams with a relative error guarantee and significantly improve the previous best space bound O( 1/ 3 log 1/ log N). Our extensive experiment results demonstrate that both techniques can support an on-line computation against high speed data streams. © 2006 IEEE.

## Journal articles

Yang, J., Zhang, W., Zhang, Y., Wang, X. & Lin, X. 2016, 'Categorical top-k spatial influence query',

View/Download from: UTS OPUS or Publisher's site

*World Wide Web*, pp. 1-29.View/Download from: UTS OPUS or Publisher's site

© 2016 Springer Science+Business Media New York The influence of a spatial facility object depicts the importance of the object in the whole data space. In this paper, we present a novel definition of object influence in applications where objects are of different categories. We study the problem of Spatial Influence Query which considers the contribution of an object in forming functional units consisting of a given set of objects with different categories designated by users. We first show that the problem of spatial influence query is NP-hard with respect to the number of object categories in the functional unit. To tackle the computational hardness, we develop an efficient framework following two main steps, possible participants finding and optimal functional unit computation. Based on this framework, for the first step, novel and efficient pruning techniques are developed based on the nearest neighbor set (NNS) approach. To find the optimal functional unit efficiently, we propose two algorithms, an exact algorithm and an efficient approximate algorithm with performance guarantee. Comprehensive experiments on both real and synthetic datasets demonstrate the effectiveness and efficiency of our techniques.

Zhang, W., Lin, X., Zhang, Y., Zhu, K. & Zhu, G. 2016, 'Efficient Probabilistic Supergraph Search',

View/Download from: UTS OPUS or Publisher's site

*IEEE Transactions on Knowledge and Data Engineering*, vol. 28, no. 4, pp. 965-978.View/Download from: UTS OPUS or Publisher's site

© 2015 IEEE. Given a query graph q, retrieving the data graphs g from a set D of data graphs such that q contains g, namely supergraph containment search, is fundamental in graph data analysis with a wide range of real applications. It is very challenging due to the NP-Completeness of subgraph isomorphism testing. Driven by many real applications, in this paper, we study the problem of probabilistic supergraph search; that is, given a set D of uncertain data graphs, a certain query graph q and a probability threshold , we retrieve the data graphs gu from D such that the probability of q containing gu is not smaller than . We show that besides the NP-Completeness of subgraph isomorphism testing, the problem of calculating probabilities is #P-Complete; thus, it is even more challenging than the supergraph containment search. To tackle the computational hardness, we first propose two novel pruning rules, based on probabilistic connectivity and features, respectively, to efficiently prune non-promising data graphs. Then, efficient verification algorithms are developed with the aim of sharing computation and terminating non-promising computation as early as possible. Extensive performance studies on both real and synthetic data demonstrate the efficiency and effectiveness of our techniques in practice.

Wang, X., Zhang, Y., Zhang, W., Lin, X. & Wang, W. 2015, 'AP-Tree: efficiently support location-aware Publish/Subscribe',

View/Download from: UTS OPUS or Publisher's site

*VLDB Journal*, vol. 24, no. 6, pp. 823-848.View/Download from: UTS OPUS or Publisher's site

© 2015, Springer-Verlag Berlin Heidelberg. We investigate the problem of efficiently supporting location-aware Publish/Subscribe (Pub/Sub for short), which is essential in many applications such as location-based recommendation and advertising, thanks to the proliferation of geo-equipped devices and the ensuing location-based social media applications. In a location-aware Pub/Sub system (e.g., an e-coupon system), subscribers can register their interest as spatial-keyword subscriptions (e.g., interest in nearby iphone discount); each incoming geo-textual message (e.g., geo-tagged e-coupon) will be delivered to all the relevant subscribers immediately. While there are several prior approaches aiming at providing efficient processing techniques for this problem, their approaches belong to spatial-prioritized indexing method which cannot well exploit the keyword distribution. In addition, their textual filtering techniques are built upon simple variants of traditional inverted indexes, which do not perform well for the textual constraint imposed by the problem. In this paper, we address the above limitations and provide a highly efficient solution based on a novel adaptive index, named AP-Tree. AP-Tree adaptively groups registered subscriptions using keyword and spatial partitions, guided by a cost model. AP-Tree also naturally indexes ordered keyword combinations. Furthermore, we show that our techniques can be extended to process moving spatial-keyword subscriptions, where subscribers can continuously update their locations. We present efficient algorithms to process both stationary and moving subscriptions, which can seamlessly and effectively integrate keyword and spatial partitions. Our extensive experiments demonstrate that AP-Tree and its variant AP+-Tree can achieve up to an order of magnitude improvement on efficiency compared with prior state-of-the-art methods.

Zhang, W., Li, A., Cheema, M.A., Zhang, Y. & Chang, L. 2015, 'Probabilistic n-of-N skyline computation over uncertain data streams',

View/Download from: UTS OPUS or Publisher's site

*World Wide Web*, vol. 18, no. 5, pp. 1331-1350.View/Download from: UTS OPUS or Publisher's site

© 2014, Springer Science+Business Media New York. Skyline operator is a useful tool in multi-criteria decision making in various applications. Uncertainty is inherent in real applications due to various reasons. In this paper, we consider the problem of efficiently computing probabilistic skylines against the most recent N uncertain elements in a data stream seen so far. Specifically, we study the problem in the n-of-N model; that is, computing the probabilistic skyline for the most recent n ( n N) elements, where an element is a probabilistic skyline element if its skyline probability is not below a given probability threshold q. Firstly, an effective pruning technique to minimize the number of uncertain elements to be kept is developed. It can be shown that on average storing only O(logdN) uncertain elements from the most recent N elements is sufficient to support the precise computation of all probabilistic n-of-N skyline queries in a d-dimension space if the data distribution on each dimension is independent. A novel encoding scheme is then proposed together with efficient update techniques so that computing a probabilistic n-of-N skyline query in a d-dimension space is reduced to O(dloglogN + s) if the data distribution is independent, where s is the number of skyline points. A trigger based technique is provided to process continuous n-of-N skyline queries. Extensive experiments demonstrate that the new techniques on uncertain data streams can support on-line probabilistic skyline query computation over rapid data streams.

Zhan, L., Zhang, Y., Zhang, W. & Lin, X. 2015, 'Finding Top k Most Influential Spatial Facilities over Uncertain Objects',

View/Download from: UTS OPUS or Publisher's site

*IEEE Transactions on Knowledge and Data Engineering*, vol. 27, no. 12, pp. 3289-3303.View/Download from: UTS OPUS or Publisher's site

—Due to a variety of reasons including data randomness and incompleteness, noise, privacy, etc., uncertainty is inherent in
many important applications, such as location-based services (LBS), sensor network monitoring, and radio-frequency identification
(RFID). Recently, considerable research efforts have been devoted into the field of uncertainty-aware spatial query processing such that
the uncertainty of the data can be effectively and efficiently tackled. In this paper, we study the problem of finding top k most influential
facilities over a set of uncertain objects, which is an important and fundamental spatial query in the above applications. Based on the
maximal utility principle, we propose a new ranking model to identify the top k most influential facilities, which carefully captures influence
of facilities on the uncertain objects. By utilizing two uncertain object indexing techniques, R-tree and U-Quadtree, effective and efficient
algorithms are proposed following the filtering and verification paradigm, which significantly improves the performance of the algorithms
in terms of CPU and I/O costs. To effectively support uncertain objects with a large number of instances, we also develop randomized
algorithms with accuracy guarantee. Then, a hybrid algorithm is devised which effectively combines the randomized and exact
algorithms. Comprehensive experiments on real datasets demonstrate the effectiveness and efficiency of our techniques.

Zhang, Y., Zhang, W., Pei, J., Lin, X., Lin, Q. & Li, A. 2014, 'Consensus-Based Ranking of Multivalued Objects: A Generalized Borda Count Approach',

View/Download from: UTS OPUS or Publisher's site

*IEEE Trans. Knowl. Data Eng.*, vol. 26, pp. 83-96.View/Download from: UTS OPUS or Publisher's site

Zhang, Y., Zhang, W., Lin, Q., Lin, X. & Shen, H.T. 2014, 'Effectively Indexing the Multidimensional Uncertain Objects',

View/Download from: UTS OPUS or Publisher's site

*IEEE Trans. Knowl. Data Eng.*, vol. 26, pp. 608-622.View/Download from: UTS OPUS or Publisher's site

Zhang, W., Zhan, L., Zhang, Y., Cheema, M.A. & Lin, X. 2014, 'Efficient top-k similarity join processing over multi-valued objects',

View/Download from: UTS OPUS or Publisher's site

*World Wide Web*, vol. 17, pp. 285-309.View/Download from: UTS OPUS or Publisher's site

Zhang, W., Lin, X., Zhang, Y., Wang, W., Zhu, G. & Yu, J.X. 2013, 'Probabilistic Skyline Operator over Sliding Windows',

View/Download from: UTS OPUS or Publisher's site

*Information Systems*, vol. 38, no. 8, pp. 1212-1233.View/Download from: UTS OPUS or Publisher's site

Skyline computation has many applications including multi-criteria decision making. In this paper, we study the problem of efficiently computing the skyline over sliding windows on uncertain data elements against probability thresholds. Firstly, we characterize the properties of elements to be kept in our computation. Then, we show the size of dynamically maintained candidate set and the size of skyline. Novel, efficient techniques are developed to process continuous probabilistic skyline queries over sliding windows. Finally, we extend our techniques to cover the applications where multiple probability thresholds are given, top-k skyline data objects are retrieved, or elements have individual life-spans. Our extensive experiments demonstrate that the proposed techniques are very efficient and can handle a high-speed data stream in real time.

Lin, Q., Zhang, Y., Zhang, W. & Lin, X. 2013, 'Efficient general spatial skyline computation',

View/Download from: UTS OPUS or Publisher's site

*World Wide Web*, vol. 16, no. 3, pp. 247-270.View/Download from: UTS OPUS or Publisher's site

With the emergence of location-aware mobile device technologies, communication technologies and GPS systems, the location based queries have attracted great attentions in the database literature. In many user recommendation web services, the spatial preference query is used to suggest the objects based on their spatial proximity with the facilities. In this paper, we study the problem of general spatial skyline (GSSKY) which can provide the minimal candidate set of the optimal solutions for any monotonic distance based spatial preference query. Efficient progressive algorithm called P-GSSKY is proposed to significantly reduce the number of non-promising objects in the computation. Moreover, we also propose spatial join based algorithm, called J-GSSKY, which can compute GSSKY efficiently in terms of I/O cost. The paper conducts a comprehensive performance study of the proposed techniques based on both real and synthetic data

Zhang, W., Lin, X., Zhang, Y., Cheema, M.A. & Zhang, Q. 2012, 'Stochastic Skylines',

View/Download from: UTS OPUS or Publisher's site

*ACM Transactions on Database Systems*, vol. 37, no. 2, pp. 1-34.View/Download from: UTS OPUS or Publisher's site

In many applications involving multiple criteria optimal decision making, users may often want to make a personal trade-off among all optimal solutions for selecting one object that fits best their personal needs. As a key feature, the skyline in a multidimensional space provides the minimum set of candidates for such purposes by removing all points not preferred by any (monotonic) utility/scoring functions; that is, the skyline removes all objects not preferred by any user no matter how their preferences vary. Driven by many recent applications with uncertain data, the probabilistic skyline model is proposed to retrieve uncertain objects based on skyline probabilities. Nevertheless, skyline probabilities cannot capture the preferences of monotonic utility functions. Motivated by this, in this article we propose a novel skyline operator, namely stochastic skylines. In the light of the expected utility principle, stochastic skylines guarantee to provide the minimum set of candidates to optimal solutions over a family of utility functions. We first propose the lskyline operator based on the lower orthant orders. lskyline guarantees to provide the minimum set of candidates to the optimal solutions for the family of monotonic multiplicative utility functions. While lskyline works very effectively for the family of multiplicative functions, it may miss optimal solutions for other utility /scoring functions (e.g., linear functions). To resolve this, we also propose a general stochastic skyline operator, gskyline, based on the usual orders. gskyline provides the minimum candidate set to the optimal solutions for all monotonic functions.

Cheema, M.A., Zhang, W., Lin, X. & Zhang, Y. 2012, 'Efficiently processing snapshot and continuous reverse k nearest neighbors queries',

View/Download from: UTS OPUS

*VLDB Journal*, vol. 21, no. 5, pp. 702-728.View/Download from: UTS OPUS

Given a set of objects and a query q, a point p is called the reverse k nearest neighbor (RkNN) of q if q is one of the k closest objects of p. In this paper, we introduce the concept of influence zone that is the area such that every point inside this area is the RkNN of q and every point outside this area is not the RkNN. The influence zone has several applications in location-based services, marketing and decision support systems. It can also be used to efficiently process RkNN queries. First, we present efficient algorithm to compute the influence zone. Then, based on the influence zone, we present efficient algorithms to process RkNN queries that significantly outperform existing best-known techniques for both the snapshot and continuous RkNN queries. We also present a detailed theoretical analysis to analyze the area of the influence zone and IO costs of our RkNN processing algorithms. Our experiments demonstrate the accuracy of our theoretical analysis. This paper is an extended version of our previous work (Cheema et al. in Proceedings of ICDE, pp. 577---588, 2011). We make the following new contributions in this extended version: (1) we conduct a rigorous complexity analysis and show that the complexity of one of our proposed algorithms in Cheema et al. (Proceedings of ICDE, pp. 577---588, 2011) can be reduced from O(m 2) to O( km) where m > k is the number of objects used to compute the influence zone, (2) we show that our techniques can be applied to dimensionality higher than two, and (3) we present efficient techniques to handle data updates.

Cheema, M.A., Zhang, W., Lin, X., Zhang, Y. & Li, X. 2012, 'Continuous reverse k nearest neighbors queries in Euclidean space and in spatial networks',

View/Download from: UTS OPUS or Publisher's site

*VLDB Journal*, vol. 21, no. 1, pp. 69-95.View/Download from: UTS OPUS or Publisher's site

In this paper, we study the problem of continuous monitoring of reverse k nearest neighbors queries in Euclidean space as well as in spatial networks. Existing techniques are sensitive toward objects and queries movement. For example, the results of a query are to be recomputed whenever the query changes its location. We present a framework for continuous reverse k nearest neighbor (RkNN) queries by assigning each object and query with a safe region such that the expensive recomputation is not required as long as the query and objects remain in their respective safe regions. This significantly improves the computation cost. As a byproduct, our framework also reduces the communication cost in client---server architectures because an object does not report its location to the server unless it leaves its safe region or the server sends a location update request. We also conduct a rigid cost analysis for our Euclidean space RkNN algorithm. We show that our techniques can also be applied to answer bichromatic RkNN queries in Euclidean space as well as in spatial networks. Furthermore, we show that our techniques can be extended for the spatial networks that are represented by directed graphs. The extensive experiments demonstrate that our techniques outperform the existing techniques by an order of magnitude in terms of computation cost and communication cost.

Zhang, Y., Lin, X., Tao, Y., Zhang, W. & Wang, H. 2012, 'Efficient Computation of Range Aggregates against Uncertain Location-Based Queries',

View/Download from: UTS OPUS or Publisher's site

*IEEE Transactions On Knowledge And Data Engineering*, vol. 24, no. 7, pp. 1244-1258.View/Download from: UTS OPUS or Publisher's site

In many applications, including location-based services, queries may not be precise. In this paper, we study the problem of efficiently computing range aggregates in a multidimensional space when the query location is uncertain. Specifically, for a query point Q whose location is uncertain and a set S of points in a multidimensional space, we want to calculate the aggregate (e.g., count, average and sum) over the subset S' of S such that for each p ? S', Q has at least probability ? within the distance ? to p. We propose novel, efficient techniques to solve the problem following the filtering-and-verification paradigm. In particular, two novel filtering techniques are proposed to effectively and efficiently remove data points from verification. Our comprehensive experiments based on both real and synthetic data demonstrate the efficiency and scalability of our techniques.

Cheema, M.A., Zhang, W., Lin, X. & Zhang, Y. 2012, 'Efficiently processing snapshot and continuous reverse k nearest neighbors queries',

View/Download from: Publisher's site

*VLDB Journal*, vol. 21, no. 5, pp. 703-728.View/Download from: Publisher's site

Given a set of objects and a query q, a point p is called the reverse k nearest neighbor (RkNN) of q if q is one of the k closest objects of p. In this paper, we introduce the concept of influence zone that is the area such that every point inside this area is the RkNN of q and every point outside this area is not the RkNN. The influence zone has several applications in location-based services, marketing and decision support systems. It can also be used to efficiently process RkNN queries. First, we present efficient algorithm to compute the influence zone. Then, based on the influence zone, we present efficient algorithms to process RkNN queries that significantly outperform existing best-known techniques for both the snapshot and continuous RkNN queries. We also present a detailed theoretical analysis to analyze the area of the influence zone and IO costs of our RkNN processing algorithms. Our experiments demonstrate the accuracy of our theoretical analysis. This paper is an extended version of our previous work (Cheema et al. in Proceedings of ICDE, pp. 577-588, 2011). We make the following new contributions in this extended version: (1) we conduct a rigorous complexity analysis and show that the complexity of one of our proposed algorithms in Cheema et al. (Proceedings of ICDE, pp. 577-588, 2011) can be reduced from O(m 2) to O(km) where m > k is the number of objects used to compute the influence zone, (2) we show that our techniques can be applied to dimensionality higher than two, and (3) we present efficient techniques to handle data updates. © 2012 Springer-Verlag.

Zhang, Y., Zhang, W., Lin, X., Jiang, B. & Pei, J. 2011, 'Ranking uncertain sky: The probabilistic top-k skyline operator',

View/Download from: UTS OPUS or Publisher's site

*Information Systems*, vol. 36, no. 5, pp. 898-915.View/Download from: UTS OPUS or Publisher's site

Many recent applications involve processing and analyzing uncertain data. In this paper, we combine the feature of top-k objects with that of skyline to model the problem of top-k skyline objects against uncertain data. The problem of efficiently computing top-k skyline objects on large uncertain datasets is challenging in both discrete and continuous cases. In this paper, firstly an efficient exact algorithm for computing the top-k skyline objects is developed for discrete cases. To address applications where each object may have a massive set of instances or a continuous probability density function, we also develop an efficient randomized algorithm with an @e@?approximation guarantee. Moreover, our algorithms can be immediately extended to efficiently compute p-skyline; that is, retrieving the uncertain objects with skyline probabilities above a given threshold. Our extensive experiments on synthetic and real data demonstrate the efficiency of both algorithms and the randomized algorithm is highly accurate. They also show that our techniques significantly outperform the existing techniques for computing p-skyline.

Zhang, W., Lin, X., Zhang, Y., Pei, J. & Wang, W. 2010, 'Threshold-based Probabilistic Top-k Dominating Queries',

View/Download from: UTS OPUS or Publisher's site

*VLDB Journal*, vol. 19, no. 2, pp. 283-305.View/Download from: UTS OPUS or Publisher's site

Recently, due to intrinsic characteristics in many underlying data sets, a number of probabilistic queries on uncertain data have been investigated. Top-k dominating queries are very important in many applications including decision making in a multidimensional space. In this paper, we study the problem of efficiently computing top-k dominating queries on uncertain data. We first formally define the problem. Then, we develop an efficient, threshold-based algorithm to compute the exact solution. To overcome some inherent computational deficiency in an exact computation, we develop an efficient randomized algorithm with an accuracy guarantee. Our extensive experiments demonstrate that both algorithms are quite efficient, while the randomized algorithm is quite scalable against data set sizes, object areas, k values, etc. The randomized algorithm is also highly accurate in practice.

Zhang, Y., Lin, X., Zhang, W., Wang, J. & Lin, Q. 2010, 'Effectively Indexing the Uncertain Space',

View/Download from: UTS OPUS or Publisher's site

*IEEE Transactions On Knowledge And Data Engineering*, vol. 22, no. 9, pp. 1247-1261.View/Download from: UTS OPUS or Publisher's site

With the rapid development of various optical, infrared, and radar sensors and GPS techniques, there are a huge amount of multidimensional uncertain data collected and accumulated everyday. Recently, considerable research efforts have been made in the field of indexing, analyzing, and mining uncertain data. As shown in a recent book on uncertain data, in order to efficiently manage and mine uncertain data, effective indexing techniques are highly desirable. Based on the observation that the existing index structures for multidimensional data are sensitive to the size or shape of uncertain regions of uncertain objects and the queries, in this paper, we introduce a novel R-Tree-based inverted index structure, named UI-Tree, to efficiently support various queries including range queries, similarity joins, and their size estimation, as well as top-k range query, over multidimensional uncertain objects against continuous or discrete cases. Comprehensive experiments are conducted on both real data and synthetic data to demonstrate the efficiency of our techniques.

Zhang, Y., Lin, X., Yuan, Y., Kitsuregawa, M., Zhou, X. & Yu, J.X. 2010, 'Duplicate-Insensitive Order Statistics Computation over Data Streams',

View/Download from: UTS OPUS or Publisher's site

*IEEE Transactions On Knowledge And Data Engineering*, vol. 22, no. 4, pp. 493-507.View/Download from: UTS OPUS or Publisher's site

Duplicates in data streams may often be observed by the projection on a subspace and/or multiple recordings of objects. Without the uniqueness assumption on observed data elements, many conventional aggregates computation problems need to be further investigated due to their duplication-sensitive nature. In this paper, we present novel, space-efficient, one-scan algorithms to continuously maintain duplicate-insensitive order sketches so that rank-based queries can be approximately processed with a relative rank error guarantee epsilon in the presence of data duplicates. Besides the space efficiency, the proposed algorithms are time-efficient and highly accurate. Moreover, our techniques may be immediately applied to the heavy hitter problem against distinct elements and to the existing fault-tolerant distributed communication techniques. A comprehensive performance study demonstrates that our algorithms can support real-time computation against high-speed data streams.