# Dr Lu Qin

### Biography

In July of 2006, I received my bachelor degree from department of Computer Science and Technology in Renmin University of China (RUC). In August of 2010, I received the Ph.D degree in the Chinese University of Hong Kong (CUHK). My thesis adviser is Prof. Jeffrey Xu Yu. I have been a postdoc in the Chinese University of Hong Kong from August 2010 to August 2013. I am now a senior lecturer and a core member in the Centre of Artificial Intelligence (CAI) in University of Technology, Sydney (UTS). My research interests include Big Graph Analytics and Graph Query Processing.

**Senior Lecturer,**A/DRsch Centre for Artificial Intelligence

**Core Member,**Centre for Artificial Intelligence

Bch, PhD

**Phone**

+61 2 9514 9742

**ORCID**

**Can supervise:**Yes

Category 1: Supervise All

**Current PhD Students:**

Wentao Li (From 07/2016)

Dian Ouyang (From 07/2015)

Dong Wen (From 01/2015)

Fan Zhang (Co-supervision From 07/2014)

## Books

Yu, J.X., Qin, L. & Chang, L. 2010,

View/Download from: UTS OPUS or Publisher's site

*Keyword Search in Databases*, 1, Morgan & Claypool Publishers, NA.View/Download from: UTS OPUS or Publisher's site

It has become highly desirable to provide users with ?exible ways to query/search information over databases as simple as keyword search like Google search. This book surveys the recent developments on keyword search over databases, and focuses on ?nding structural information among objects in a database using a set of keywords. Such structural information to be returned can be either trees or subgraphs representing how the objects, that contain the required keywords, are interconnected in a relational database or in an XML database. The structural keyword search is completely different from ?nding documents that contain all the user-given keywords. The former focuses on the interconnected object structures, whereas the latter focuses on the object content. The book is organized as follows. In Chapter 1, we highlight the main research issues on the structural keyword search in different contexts. In Chapter 2, we focus on supporting structural keyword search in a relational database management system using the SQL query language. We concentrate on how to generate a set of SQL queries that can ?nd all the structural information among records in a relational database completely, and how to evaluate the generated set of SQL queries ef?ciently.In Chapter 3,we discuss graph algorithms for structural keyword search by treating an entire relational database as a large data graph. In Chapter 4, we discuss structural keyword search in a large tree-structured XML database.In Chapter 5,we highlight several interesting research issues regarding keyword search on databases. The book can be used as either an extended survey for people who are interested in the structural keyword search or a reference book for a postgraduate course on the related topics.

## Conferences

Liu, S., Liu, A., Zhao, L., Liu, G., Li, Z., Zhao, P., Zheng, K. & Qin, L. 2016, 'Efficient query processing with mutual privacy protection for location-based services',

View/Download from: Publisher's site

*DASFAA 2016: Database Systems for Advanced Applications*, International Conference on Database Systems for Advanced Applications, Springer, Dallas, Texas, Unites States, pp. 299-313.View/Download from: Publisher's site

© Springer International Publishing Switzerland 2016.Data privacy in location-based services involves two aspects. The location of a user is a kind of private data as many sensitive information can be inferred from it given some background knowledge. On the other hand, the POI database is a great asset to the LBS provider as its construction requires many resources and efforts. In this paper, we propose a method of protecting mutual privacy (i.e., the location of the user issuing a query and the POI database of the LBS provider) for location-based query processing. Our approach consists of two steps: data preparation and query processing. Data preparation is conducted by LBS itself and is totally an offline computation, while query processing involves some online computation and multiple rounds of communication between LBS and the user. We implement the query processing by two rounds of oblivious transfer extension (OT-Extension) on two small key sets, resulting an immediate response even on some big POI databases. We also theoretically prove the security and analyze the complexity of our approach. Compared with two state-of-the-art methods, our approach has several orders of magnitude improvement in response time, at the expense of little and acceptable communication cost.

Yuan, L., Qin, L., Lin, X., Chang, L. & Zhang, W. 2016, 'I/O efficient ECC graph decomposition via graph reduction',

*Proceedings of the 42nd International Conference on Very Large Data Bases*, International Conference on Very Large Data Bases, VLDB Endowment, New Delhi, India, pp. 516-527. © 2016 VLDB Endowment 21508097/16/03.The problem of computing k-edge connected components (k-ECCs) of a graph G for a specific k is a fundamental graph problemand has been investigated recently. In this paper, we study the problemof ECC decomposition, which computes the k-ECCs of a graphG for all k values. ECC decomposition can be widely applied ina variety of applications such as graph-topology analysis, communitydetection, Steiner component search, and graph visualization.A straightforward solution for ECC decomposition is to apply theexisting k-ECC computation algorithm to compute the k-ECCs forall k values. However, this solution is not applicable to large graphsfor two challenging reasons. First, all existing k-ECC computationalgorithms are highly memory intensive due to the complex datastructures used in the algorithms. Second, the number of possiblek values can be very large, resulting in a high computational costwhen each k value is independently considered. In this paper, weaddress the above challenges, and study I/O efficient ECC decompositionvia graph reduction. We introduce two elegant graph reductionoperators which aim to reduce the size of the graph loadedin memory while preserving the connectivity information of a certainset of edges to be computed for a specific k. We also proposethree novel I/O efficient algorithms, Bottom-Up, Top-Down, andHybrid, that explore the k values in different orders to reduce the redundantcomputations between different k values. We analyze theI/O and memory costs for all proposed algorithms. In our experiments, we evaluate our algorithms using seven real large datasetswith various graph properties, one of which contains 1:95 billionedges. The experimental results show that our proposed algorithmsare scalable and efficient.

Feng, X., Chang, L., Lin, X., Qin, L. & Zhang, W. 2016, 'Computing Connected Components with linear communication cost in pregel-like systems',

View/Download from: UTS OPUS or Publisher's site

*2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016*, International Conference on Data Engineering (ICDE), IEEE, Helsinki, Finland, pp. 85-96.View/Download from: UTS OPUS or Publisher's site

© 2016 IEEE.The paper studies two fundamental problems in graph analytics: computing Connected Components (CCs) and computing BiConnected Components (BCCs) of a graph. With the recent advent of Big Data, developing effcient distributed algorithms for computing CCs and BCCs of a big graph has received increasing interests. As with the existing research efforts, in this paper we focus on the Pregel programming model, while the techniques may be extended to other programming models including MapReduce and Spark. The state-of-the-art techniques for computing CCs and BCCs in Pregel incur O(m #supersteps) total costs for both data communication and computation, where m is the number of edges in a graph and #supersteps is the number of supersteps. Since the network communication speed is usually much slower than the computation speed, communication costs are the dominant costs of the total running time in the existing techniques. In this paper, we propose a new paradigm based on graph decomposition to reduce the total communication costs from O(m#supersteps) to O(m), for both computing CCs and computing BCCs. Moreover, the total computation costs of our techniques are smaller than that of the existing techniques in practice, though theoretically they are almost the same. Comprehensive empirical studies demonstrate that our approaches can outperform the existing techniques by one order of magnitude regarding the total running time.

Chang, L., Li, W., Lin, X., Qin, L. & Zhang, W. 2016, 'PSCAN: Fast and exact structural graph clustering',

View/Download from: UTS OPUS or Publisher's site

*Proceedings of the 2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016*, IEEE International Conference on Data Engineering (ICDE), IEEE, Piscataway, USA, pp. 253-264.View/Download from: UTS OPUS or Publisher's site

© 2016 IEEE.In this paper, we study the problem of structural graph clustering, a fundamental problem in managing and analyzing graph data. Given a large graph G = (V, E), structural graph clustering is to assign vertices in V to clusters and to identify the sets of hub vertices and outlier vertices as well, such that vertices in the same cluster are densely connected to each other while vertices in different clusters are loosely connected to each other. Firstly, we prove that the existing SCAN approach is worst-case optimal. Nevertheless, it is still not scalable to large graphs due to exhaustively computing structural similarity for every pair of adjacent vertices. Secondly, we make three observations about structural graph clustering, which present opportunities for further optimization. Based on these observations, in this paper we develop a new two-step paradigm for scalable structural graph clustering. Thirdly, following this paradigm, we present a new approach aiming to reduce the number of structural similarity computations. Moreover, we propose optimization techniques to speed up checking whether two vertices are structure-similar to each other. Finally, we conduct extensive performance studies on large real and synthetic graphs, which demonstrate that our new approach outperforms the state-of-the-art approaches by over one order of magnitude. Noticeably, for the twitter graph with 1 billion edges, our approach takes 25 minutes while the state-of-the-art approach cannot finish even after 24 hours.

Li, Z., Qin, L., Cheng, H., Zhang, X. & Zhou, X. 2016, 'TRIP: An interactive retrieving-inferring data imputation approach',

View/Download from: Publisher's site

*2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016*, pp. 1462-1463.View/Download from: Publisher's site

© 2016 IEEE. Data imputation aims at filling in missing attribute values in databases. Existing imputation approaches to nonquantitive string data can be roughly put into two categories: (1) inferring-based approaches [2], and (2) retrieving-based approaches [1] . Specifically, the inferring-based approaches find substitutes or estimations for the missing ones from the complete part of the data set. However, they typically fall short in filling in unique missing attribute values which do not exist in the complete part of the data set [1]. The retrieving-based approaches resort to external resources for help by formulating proper web search queries to retrieve web pages containing the missing values from the Web, and then extracting the missing values from the retrieved web pages [1] . This webbased retrieving approach reaches a high imputation precision and recall, but on the other hand, issues a large number of web search queries, which brings a large overhead [1].

Lyu, B., Qin, L., Lin, X., Chang, L. & Yu, J.X. 2016, 'Scalable supergraph search in large graph databases',

View/Download from: Publisher's site

*2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016*, International Conference on Data Engineering (ICDE), Helsinki, Finland, pp. 157-168.View/Download from: Publisher's site

© 2016 IEEE.Supergraph search is a fundamental problem in graph databases that is widely applied in many application scenarios. Given a graph database and a query-graph, supergraph search retrieves all data-graphs contained in the query-graph from the graph database. Most existing solutions for supergraph search follow the pruning-and-verification framework, which prunes false answers based on features in the pruning phase and performs subgraph isomorphism testings on the remaining graphs in the verification phase. However, they are not scalable to handle large-sized data-graphs and query-graphs due to three drawbacks. First, they rely on a frequent subgraph mining algorithm to select features which is expensive and cannot generate large features. Second, they require a costly verification phase. Third, they process features in a fixed order without considering their relationship to the query-graph. In this paper, we address the three drawbacks and propose new indexing and query processing algorithms. In indexing, we select features directly from the data-graphs without expensive frequent subgraph mining. The features form a feature-tree that contains all-sized features and both the cost sharing and pruning power of the features are considered. In query processing, we propose a verification-free algorithm, where the order to process features is query-dependent by considering both the cost sharing and the pruning power. We explore two optimization strategies to further improve the algorithm efficiency. The first strategy applies a lightweight graph compression technique and the second strategy optimizes the inclusion of answers. Finally, we conduct extensive performance studies on two real large datasets to demonstrate the high scalability of our algorithms.

Bi, F., Chang, L., Lin, X., Qin, L. & Zhang, W. 2016, 'Efficient subgraph matching by postponing Cartesian products',

View/Download from: Publisher's site

*Proceedings of the ACM SIGMOD International Conference on Management of Data*, Conference Association for Computing Machinery SIGMOD / PODS (ACM - SIGMOD / PODS), Association of Computing Machinery ACM, San Francisco, California, United States, pp. 1199-1214.View/Download from: Publisher's site

© 2016 ACM.In this paper, we study the problem of subgraph matching that extracts all subgraph isomorphic embeddings of a query graph q in a large data graph G. The existing algorithms for subgraph matching follow Ullmann's backtracking approach; that is, iteratively map query vertices to data vertices by following a matching order of query vertices. It has been shown that the matching order of query vertices is a very important aspect to the efficiency of a subgraph matching algorithm. Recently, many advanced techniques, such as enforcing connectivity and merging similar vertices in query or data graphs, have been proposed to provide an effective matching order with the aim to reduce unpromising intermediate results especially the ones caused by redundant Cartesian products. In this paper, for the first time we address the issue of unpromising results by Cartesian products from "dissimilar" vertices. We propose a new framework by postponing the Cartesian products based on the structure of a query to minimize the redundant Cartesian products. Our second contribution is proposing a new path-based auxiliary data structure, with the size O(|E(G)| |V(q)|), to generate a matching order and conduct subgraph matching, which significantly reduces the exponential size O(|V(G)||V(q)|-1) of the existing path-based auxiliary data structure, where V(G) and E(G) are the vertex and edge sets of a data graph G, respectively, and V(q) is the vertex set of a query q. Extensive empirical studies on real and synthetic graphs demonstrate that our techniques outperform the state-of-the-art algorithms by up to 3 orders of magnitude.

Li, R.H., Qin, L., Yu, J.X. & Mao, R. 2016, 'Efficient and progressive Group Steiner Tree search',

View/Download from: Publisher's site

*Proceedings of the ACM SIGMOD International Conference on Management of Data*, Association for Computing Machinery SIGMOD / PODS Conference (ACM SIGMOD / PODS), Association of Computing Machinery ACM, San Francisco, California, United States, pp. 91-106.View/Download from: Publisher's site

© 2016 ACM.The Group Steiner Tree (GST) problem is a fundamental problem in database area that has been successfully applied to keyword search in relational databases and team search in social networks. The state-of-the-art algorithm for the GST problem is a parameterized dynamic programming (DP) algorithm, which finds the optimal tree in O(3n + 2 (n log n + m)) time, where is the number of given groups, m and n are the number of the edges and nodes of the graph respectively. The major limitations of the parameterized DP algorithm are twofold: (i) it is intractable even for very small values of (e.g., = 8) in large graphs due to its exponential complexity, and (ii) it cannot generate a solution until the algorithm has completed its entire execution. To overcome these limitations, we propose an efficient and progressive GST algorithm in this paper, called PrunedDP. It is based on newly-developed optimal-tree decomposition and conditional tree merging techniques. The proposed algorithm not only drastically reduces the search space of the parameterized DP algorithm, but it also produces progressively-refined feasible solutions during algorithm execution. To further speed up the PrunedDP algorithm, we propose a progressive A-search algorithm, based on several carefully-designed lower-bounding techniques. We conduct extensive experiments to evaluate our algorithms on several large scale real-world graphs. The results show that our best algorithm is not only able to generate progressively-refined feasible solutions, but it also finds the optimal solution with at least two orders of magnitude acceleration over the state-of-the-art algorithm, using much less memory.

Zhang, H., Zhu, Y., Qin, L., Cheng, H. & Yu, J.X. 2016, 'Efficient triangle listing for billion-scale graphs',

View/Download from: Publisher's site

*Proceedings - 2016 IEEE International Conference on Big Data, Big Data 2016*, pp. 813-822.View/Download from: Publisher's site

© 2016 IEEE. This paper addresses the classical triangle listing problem, which aims at enumerating all the tuples of three vertices connected with each other by edges. This problem has been intensively studied in internal and external memory, but it is still an urgent challenge in distributed environment where multiple machines across the network can be utilized to achieve good performance and scalability. As one of the de facto computing methodologies in distributed environment, MapReduce has been used in some of existing triangle listing algorithms. However, these algorithms usually need to shuffle a huge amount of intermediate data, which seriously hinders the scalability on large scale graphs. In this paper, we propose a new triangle listing algorithm in MapReduce, FTL, which utilizes a light weight data structure to substantially reduce the intermediate data transferred during the shuffle stage, and also is equipped with multiple-round techniques to ease the burden on memory and network bandwidth when dealing with graphs at billion scale. We prove that the size of the intermediate data can be well bounded near to the number of triangles in the graph. To further reduce the shuffle size in each round, we also devise a compact data structure to store the intermediate data, which can save space up to 2/3. The extensive experimental results show that our algorithms outperform existing competitors by several times on large real world graphs.

Chang, L., Lin, X., Qin, L., Yu, J.X. & Pei, J. 2015, 'Efficiently Computing Top-K Shortest Path Join',

View/Download from: UTS OPUS or Publisher's site

*Proceedings of the 18th International Conference on Extending Database Technology, EDBT 2015, Brussels, Belgium, March 23-27, 2015.*, pp. 133-144.View/Download from: UTS OPUS or Publisher's site

Chang, L., Lin, X., Zhang, W., Yu, J.X., Zhang, Y. & Qin, L. 2015, 'Optimal Enumeration: Efficient Top-k Tree Matching',

View/Download from: UTS OPUS or Publisher's site

*Proceedings of the VLDB Endowment*, 41st International Conference on Very Large Data Bases,, VLDB Endowment, Kahola Coast, Hawaii, pp. 533-544.View/Download from: UTS OPUS or Publisher's site

Driven by many real applications, graph pattern matching has attracted
a great deal of attention recently. Consider that a twigpattern
matching may result in an extremely large number of matches
in a graph; this may not only confuse users by providing too many
results but also lead to high computational costs. In this paper,
we study the problem of top-k tree pattern matching; that is, given
a rooted tree T, compute its top-k matches in a directed graph G
based on the twig-pattern matching semantics. We firstly present
a novel and optimal enumeration paradigm based on the principle
of Lawler's procedure. We show that our enumeration algorithm
runs in O(nT + log k) time in each round where nT is the number
of nodes in T. Considering that the time complexity to output a
match of T is O(nT ) and nT log k in practice, our enumeration
technique is optimal. Moreover, the cost of generating top-1 match
of T in our algorithm is O(mR) where mR is the number of edges
in the transitive closure of a data graph G involving all relevant
nodes to T. O(mR) is also optimal in the worst case without preknowledge
of G. Consequently, our algorithm is optimal with the
running time O(mR +k(nT +log k)) in contrast to the time complexity
O(mR log k+knT (log k+dT )) of the existing technique where dT
is the maximal node degree in T. Secondly, a novel priority based
access technique is proposed, which greatly reduces the number of
edges accessed and results in a significant performance improvement.
Finally, we apply our techniques to the general form of top-k
graph pattern matching problem (i.e., query is a graph) to improve
the existing techniques. Comprehensive empirical studies demonstrate
that our techniques may improve the existing techniques by
orders of magnitude.

Li, R.-.H., Qin, L., Yu, J.X. & Mao, R. 2015, 'Influential Community Search in Large Networks',

View/Download from: UTS OPUS

*Proceedings of the VLDB Endowment*, 41st International Conference on Very Large Data Bases, Proceedings of the Vldb Endowment International Conference on Very Large Data Bases, Kohala Coast, Hawaii, pp. 509-520.View/Download from: UTS OPUS

Community search is a problem of finding densely connected subgraphs
that satisfy the query conditions in a network, which has
attracted much attention in recent years. However, all the previous
studies on community search do not consider the influence of
a community. In this paper, we introduce a novel community model
called k-influential community based on the concept of k-core,
which can capture the influence of a community. Based on the
new community model, we propose a linear-time online search algorithm
to find the top-r k-influential communities in a network.
To further speed up the influential community search algorithm,
we devise a linear-space index structure which supports efficient
search of the top-r k-influential communities in optimal time. We
also propose an efficient algorithm to maintain the index when the
network is frequently updated. We conduct extensive experiments
on 7 real-world large networks, and the results demonstrate the ef-
ficiency and effectiveness of the proposed methods

Chang, L., Lin, X., Qin, L., Yu, J.X. & Zhang, W. 2015, 'Index-based optimal algorithms for computing steiner components with maximum connectivity',

View/Download from: UTS OPUS or Publisher's site

*Proceedings of the ACM SIGMOD International Conference on Management of Data*, pp. 459-474.View/Download from: UTS OPUS or Publisher's site

With the proliferation of graph applications, the problem of efficiently computing all k-edge connected components of a graph G for a user-given k has been recently investigated. In this paper, we study the problem of efficiently computing the steiner component with the maximum connectivity; that is, given a set q of query vertices in a graph G, we aim to find the maximum induced subgraph g of G such that g contains q and g has the maximum connectivity, where g is denoted as SMCC. To accommodate online query processing, we present an efficient algorithm based on a novel index such that the algorithm runs in linear time regarding the result size; thus, the algorithm is optimal since it needs at least linear time to output the result. Moreover, in this paper we also investigate variations of the above problem. We show that such a problem with the constraint that the size of the SMCC is not smaller than a given size can also be solved in linear time regarding the result size (thus, optimal). We also show that the problem of computing the connectivity (rather than the graph details) of SMCC can be solved in linear time regarding the query size (thus, optimal). To build the index, we extend the techniques in [7] to accommodate batch processing and computation sharing. To efficiently support the applications with graph updates, we also present novel increment techniques. Finally, we conduct extensive performance studies on large real and synthetic graphs, which demonstrate that our index-based algorithms significantly outperform baseline algorithms by several orders of magnitude and our indexing algorithms are efficient.

Qin, L., Li, R.H., Chang, L. & Zhang, C. 2015, 'Locally densest subgraph discovery',

View/Download from: UTS OPUS or Publisher's site

*Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, Sydney, Australia, pp. 965-974.View/Download from: UTS OPUS or Publisher's site

© 2015 ACM. Mining dense subgraphs from a large graph is a fundamental graph mining task and can be widely applied in a variety of application domains such as network science, biology, graph database, web mining, graph compression, and micro-blogging systems. Here a dense subgraph is defined as a subgraph with high density (#.edge/#.node). Existing studies of this problem either focus on finding the densest subgraph or identifying an optimal clique-like dense subgraph, and they adopt a simple greedy approach to find the top-k dense subgraphs. However, their identified subgraphs cannot be used to represent the dense regions of the graph. Intuitively, to represent a dense region, the subgraph identified should be the subgraph with highest density in its local region in the graph. However, it is non-trivial to formally model a locally densest subgraph. In this paper, we aim to discover top-k such representative locally densest subgraphs of a graph. We provide an elegant parameter-free definition of a locally densest subgraph. The definition not only fits well with the intuition, but is also associated with several nice structural properties. We show that the set of locally densest subgraphs in a graph can be computed in polynomial time. We further propose three novel pruning strategies to largely reduce the search space of the algorithm. In our experiments, we use several real datasets with various graph properties to evaluate the effectiveness of our model using four quality measures and a case study. We also test our algorithms on several real web-scale graphs, one of which contains 118.14 million nodes and 1.02 billion edges, to demonstrate the high efficiency of the proposed algorithms.

Yuan, L., Qin, L., Lin, X., Chang, L. & Zhang, W. 2015, 'Diversified top-k clique search',

View/Download from: UTS OPUS or Publisher's site

*Proceedings - International Conference on Data Engineering*, pp. 387-398.View/Download from: UTS OPUS or Publisher's site

© 2015 IEEE. Maximal clique enumeration is a fundamental problem in graph theory and has been extensively studied. However, maximal clique enumeration is time-consuming in large graphs and always returns enormous cliques with large overlaps. Motivated by this, in this paper, we study the diversified top-k clique search problem which is to find top-k maximal cliques that can cover most number of nodes in the graph. Diversified top-k clique search can be widely used in a lot of applications including community search, motif discovery, and anomaly detection in large graphs. A naive solution for diversified top-k clique search is to keep all maximal cliques in memory and then find k of them that cover most nodes in the graph by using the approximate greedy max k-cover algorithm. However, such a solution is impractical when the graph is large. In this paper, instead of keeping all maximal cliques in memory, we devise an algorithm to maintain k candidates in the process of maximal clique enumeration. Our algorithm has limited memory footprint and can achieve a guaranteed approximation ratio. We also introduce a novel light-weight PNP-Index, based on which we design an optimal maximal clique maintenance algorithm. We further explore three optimization strategies to avoid enumerating all maximal cliques and thus largely reduce the computational cost. We conduct extensive performance studies on six real graphs one of which contains 0.3 billion edges, and the results demonstrate the high efficiency and effectiveness of our approach.

Li, R.H., Yu, J.X., Qin, L., Mao, R. & Jin, T. 2015, 'On random walk based graph sampling',

View/Download from: UTS OPUS or Publisher's site

*Proceedings - International Conference on Data Engineering*, International Conference on Data Engineering, IEEE, Seoul, South Korea, pp. 927-938.View/Download from: UTS OPUS or Publisher's site

© 2015 IEEE. Random walk based graph sampling has been recognized as a fundamental technique to collect uniform node samples from a large graph. In this paper, we first present a comprehensive analysis of the drawbacks of three widely-used random walk based graph sampling algorithms, called re-weighted random walk (RW) algorithm, Metropolis-Hastings random walk (MH) algorithm and maximum-degree random walk (MD) algorithm. Then, to address the limitations of these algorithms, we propose two general random walk based algorithms, named rejection-controlled Metropolis-Hastings (RCMH) algorithm and generalized maximum-degree random walk (GMD) algorithm. We show that RCMH balances the tradeoff between the limitations of RW and MH, and GMD balances the tradeoff between the drawbacks of RW and MD. To further improve the performance of our algorithms, we integrate the so-called delayed acceptance technique and the non-backtracking random walk technique into RCMH and GMD respectively. We conduct extensive experiments over four real-world datasets, and the results demonstrate the effectiveness of the proposed algorithms.

Zhang, Z., Yu, J.X., Qin, L. & Shang, Z. 2015, 'Divide & conquer: I/O efficient depth-first search',

View/Download from: UTS OPUS or Publisher's site

*Proceedings of the ACM SIGMOD International Conference on Management of Data*, pp. 445-458.View/Download from: UTS OPUS or Publisher's site

Copyright © 2015 ACM. Depth-First Search (DFS), which traverses a graph in the depthfirst order, is one of the fundamental graph operations, and the result of DFS over all nodes in G is a spanning tree known as a DFS-Tree. There are many graph algorithms that need DFS such as connected component computation, topological sort, community detection, eulerian path computation, graph bipartiteness testing, planar graph testing, etc, because the in-memory DFS algorithm shows it can be done in linear time w.r.t. the size of G. However, given the fact that real-world graphs grow rapidly in the big data era, the in-memory DFS algorithm cannot be used to handle a large graph that cannot be entirely held in main memory. In this paper, we focus on I/O efficiency and study semi-external algorithms to DFS a graph G which is on disk. Here, like the existing semiexternal algorithms, we assume that a spanning tree of G can be held in main memory and the remaining edges of G are kept on disk, and compute the DFS-Tree in main memory with which DFS can be identified. We propose novel divide & conquer algorithms to DFS over a graph G on disk. In brief, we divide a graph into several subgraphs, compute the DFS-Tree for each subgraph independently, and then merge them together to compute the DFS-Tree for the whole graph. With the global DFS-Tree computed we identify DFS. We discuss the valid division, that can lead to the correct DFS, and the challenges to do so. We propose two division algorithms, named Divide-Star and Divide-TD, and a merge algorithm. We conduct extensive experimental studies using four real massive datasets and several synthetic datasets to confirm the I/O efficiency of our approach.

Lai, L., Qin, L., Lin, X. & Chang, L. 2015, 'Scalable Subgraph Enumeration in MapReduce.',

View/Download from: UTS OPUS

*Proceedings of the VLDB Endowment*, 41st International Conference!on! Very Large Data Bases, VLDB Endowment, Kohala Coast, Hawaii, pp. 974-985.View/Download from: UTS OPUS

Subgraph enumeration, which aims to find all the subgraphs of a
large data graph that are isomorphic to a given pattern graph, is
a fundamental graph problem with a wide range of applications.
However, existing sequential algorithms for subgraph enumeration
fall short in handling large graphs due to the involvement of computationally
intensive subgraph isomorphism operations. Thus, some
recent researches focus on solving the problem using MapReduce.
Nevertheless, exiting MapReduce approaches are not scalable to
handle very large graphs since they either produce a huge number
of partial results or consume a large amount of memory. Motivated
by this, in this paper, we propose a new algorithm TwinTwigJoin
based on a left-deep-join framework in MapReduce, in which the
basic join unit is a TwinTwig (an edge or two incident edges of
a node). We show that in the Erdos-R ¨ enyi random-graph model, ´
TwinTwigJoin is instance optimal in the left-deep-join framework
under reasonable assumptions, and we devise an algorithm to compute
the optimal join plan. Three optimization strategies are explored
to improve our algorithm. Furthermore, we discuss how our
approach can be adapted in the power-law random-graph model.We
conduct extensive performance studies in several real graphs, one
of which contains billions of edges. Our approach significantly outperforms
existing solutions in all tests.

Zhang, S., Qin, L., Zheng, Y. & Cheng, H. 2015, 'Effective and efficient: Large-scale dynamic city express',

View/Download from: UTS OPUS or Publisher's site

*GIS: Proceedings of the ACM International Symposium on Advances in Geographic Information Systems*, ACM International Symposium on Advances in Geographic Information Systems, ACM, Seattle, Washington.View/Download from: UTS OPUS or Publisher's site

City express services are in great demand in recent years. However, the current city express system is found to be unsatisfactory for both the service providers and customers. In this paper, we are the first to systematically study the large-scale dynamic city express problem. We aim to increase both the effectiveness and the efficiency of the scheduling algorithm. To improve the effectiveness, we adopt a batch assignment strategy that computes the pickup-delivery routes for a group of requests received in a short period rather than dealing with each request individually. To improve the efficiency, we design a two-level priority queue structure to reduce redundant shortest distance calculation and repeated candidate generation. We develop a simulation system and conduct extensive performance studies in the real road network of Beijing city. The experimental results demonstrate the high effectiveness and efficiency of our algorithm.

Huang, X., Cheng, H., Qin, L., Tian, W. & Yu, J.X. 2014, 'Querying k-truss community in large and dynamic graphs',

View/Download from: UTS OPUS or Publisher's site

*International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22-27, 2014*, pp. 1311-1322.View/Download from: UTS OPUS or Publisher's site

Qin, L., Yu, J.X., Chang, L., Cheng, H., Zhang, C. & Lin, X. 2014, 'Scalable big graph processing in MapReduce',

View/Download from: UTS OPUS or Publisher's site

*International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22-27, 2014*, pp. 827-838.View/Download from: UTS OPUS or Publisher's site

Zhang, Z., Qin, L. & Yu, J.X. 2014, 'Contract & Expand: I/O Efficient SCCs Computing',

View/Download from: UTS OPUS or Publisher's site

*IEEE 30th International Conference on Data Engineering, Chicago, ICDE 2014, IL, USA, March 31 - April 4, 2014*, 2014 IEEE 30th International Conference on Data Engineering (ICDE), IEEE, Piscataway, USA, pp. 208-219.View/Download from: UTS OPUS or Publisher's site

As an important branch of big data processing, big graph processing is becoming increasingly popular in recent years. Strongly connected component (SCC) computation is a fundamental graph operation on directed graphs, where an SCC is a maximal subgraph S of a directed graph G in which every pair of nodes is reachable from each other in S. By contracting each SCC into a node, a large general directed graph can be represented by a small directed acyclic graph (DAG). In the literature, there are I/O efficient semi-external algorithms to compute all SCCs of a graph G, by assuming that all nodes of a graph G can fit in the main memory. However, many real graphs are large and even the nodes cannot reside entirely in the main memory. In this paper, we study new I/O efficient external algorithms to find all SCCs for a directed graph G whose nodes cannot fit entirely in the main memory. To overcome the deficiency of the existing external graph contraction based approach that usually cannot stop in finite iterations, and the external DFS based approach that will generate a large number of random I/Os, we explore a new contraction-expansion based approach. In the graph contraction phase, instead of contracting the whole graph as the contraction based approach, we only contract the nodes of a graph, which are much more selective. The contraction phase stops when all nodes of the graph can fit in the main memory, such that the semi-external algorithm can be used in SCC computation. In the graph expansion phase, as the graph is expanded in the reverse order as it is contracted, the SCCs of all nodes in the graph are computed. Both graph contraction phase and graph expansion phase use only I/O efficient sequential scans and external sorts of nodes/edges in the graph. Our algorithm leverages the efficiency of the semi-external SCC computation algorithm and usually stops in a small number of iterations. We further optimize our approach by reducing the size of nodes and edges of t...

Chang, L., Lin, X., Qin, L., Liu, C. & Liang, W. 2013, 'Efficiently Computing k-Edge Connected Components via Graph Decomposition',

View/Download from: UTS OPUS or Publisher's site

*Proceedings of ACM Conference on Management of Data*, ACM Conference on Management of Data, ACM, New York, NY, USA, pp. 205-216.View/Download from: UTS OPUS or Publisher's site

k-edge connected components; graph decomposition; minimum cut

Zhang, Z., Yu, J.X., Qin, L., Chang, L. & Lin, X. 2013, 'I/O Efficient: Computing SCCs in Massive Graphs',

View/Download from: UTS OPUS or Publisher's site

*Proceedings of ACM Conference on Management of Data*, ACM Conference on Management of Data, ACM, New York, NY, USA, pp. 181-192.View/Download from: UTS OPUS or Publisher's site

A strongly connected component (SCC) is a maximal subgraph of a directed graph G in which every pair of nodes are reachable from each other in the SCC. With such a property, a general directed graph can be represented by a directed acyclic graph DAG by contracting an SCC of G to a node in DAG. In many real applications that need graph pattern matching, topological sorting, or reachability query processing, the best way to deal with a general directed graph is to deal with its DAG representation. Therefore, finding all SCCs in a directed graph G is a critical operation. The existing in-memory algorithms based on depth first search (DFS) can find all SCCs in linear time w.r.t. the size of a graph. However, when a graph cannot resident entirely in the main memory, the existing external or semi-external algorithms to find all SCCs have limitation to achieve high I/O efficiency. In this paper, we study new I/O efficient semi-external algorithms to find all SCCs for a massive directed graph G that cannot reside in main memory entirely. To overcome the deficiency of the existing DFS based semi-external algorithm that heavily relies on a total order, we explore a weak order based on which we investigate new algorithms. We propose a new two phase algorithm, namely, tree construction and tree search. In the tree construction phase, a spanning tree of G can be constructed in bounded sequential scans of G. In the tree search phase, it needs to sequentially scan the graph once to find all SCCs.

Qiao, M., Qin, L., Cheng, H., Yu, J.X. & Tian, W. 2013, 'Top-K nearest keyword search on large graphs',

View/Download from: UTS OPUS or Publisher's site

*Proceedings of the VLDB Endowment*, VLDB Endowment, ACM, Trento, Italy, pp. 901-912.View/Download from: UTS OPUS or Publisher's site

It is quite common for networks emerging nowadays to have labels or textual contents on the nodes. On such networks, we study the problem of top-k nearest keyword (k-NK) search. In a network G modeled as an undirected graph, each node is attached with zero or more keywords, and each edge is assigned with a weight measuring its length. Given a query node q in G and a keyword ?, a k-NK query seeks k nodes which contain ? and are nearest to q. k-NK is not only useful as a stand-alone query but also as a building block for tackling complex graph pattern matching problems. The key to an accurate k-NK result is a precise shortest distance estimation in a graph. Based on the latest distance oracle technique, we build a shortest path tree for a distance oracle and use the tree distance as a more accurate estimation. With such representation, the original k-NK query on a graph can be reduced to answering the query on a set of trees and then assembling the results obtained from the trees. We propose two efficient algorithms to report the exact k-NK result on a tree. One is query time optimized for a scenario when a small number of result nodes are of interest to users. The other handles k-NK queries for an arbitrarily large k efficiently. In obtaining a k-NK result on a graph from that on trees, a global storage technique is proposed to further reduce the index size and the query time. Extensive experimental results conform with our theoretical findings, and demonstrate the effectiveness and efficiency of our k-NK algorithms on large real graphs.

Huang, X., Cheng, H., Li, R., Qin, L. & Yu, J.X. 2013, 'Top-K Structural Diversity Search in Large Network',

View/Download from: UTS OPUS or Publisher's site

*Proceedings of the VLDB Endowment*, VLDB Endowment, ACM, Trento, Italy, pp. 1618-1629.View/Download from: UTS OPUS or Publisher's site

Social contagion depicts a process of information (e.g., fads, opinions, news) diffusion in the online social networks. A recent study reports that in a social contagion process the probability of contagion is tightly controlled by the number of connected components in an individual's neighborhood. Such a number is termed structural diversity of an individual and it is shown to be a key predictor in the social contagion process. Based on this, a fundamental issue in a social network is to find top-k users with the highest structural diversities. In this paper, we, for the first time, study the top-k structural diversity search problem in a large network. Specifically, we develop an effective upper bound of structural diversity for pruning the search space. The upper bound can be incrementally refined in the search process. Based on such upper bound, we propose an efficient framework for top-k structural diversity search. To further speed up the structural diversity evaluation in the search process, several carefully devised heuristic search strategies are proposed. Extensive experimental studies are conducted in 13 real-world large networks, and the results demonstrate the efficiency and effectiveness of the proposed methods.

Zhang, Z., Yu, J.X., Qin, L., Zhu, Q. & Zhou, X. 2012, 'I/O Cost Minimization: Reachability Queries Processing over Massive Graphs',

View/Download from: UTS OPUS or Publisher's site

*Proceedings of the 15th International Conference on Extending Database Technology*, International Conference on Extending Database Technology, ACM, Berlin, Germany, pp. 468-479.View/Download from: UTS OPUS or Publisher's site

Given a directed graph G, a reachability query (u, v) asks whether there exists a path from a node u to a node v in G. The existing studies support reachability queries using indexing techniques, where both the graph and the index are required to reside in main memory. However, they cannot handle reachability queries on massive graphs, when the graph and the index cannot be entirely held in memory because of the high I/O cost. In this paper, we focus on how to minimize the I/O cost when answering reachability queries on massive graphs that cannot reside entirely in memory. First, we propose a new Yes-Label scheme, as a complement of the No-Label used in GRAIL [23], to reduce the number of intermediate results generated. Second, we show how to minimize the number of I/Os using a heap-on-disk data structure when traversing a graph. We also propose new methods to partition the heap-on-disk, in order to ensure that only sequential I/Os are performed. Third, we analyze our approaches and show how to extend our approaches to answer multiple reachability queries effectively. Finally, we conducted extensive performance studies on both large synthetic and large real graphs, and confirm the efficiency of our approaches.

Zhu, Y., Qin, L., Yu, J.X. & Cheng, H. 2012, 'Finding Top-K Similar Graphs in Graph Databases',

View/Download from: UTS OPUS or Publisher's site

*Proceedings of the 15th International Conference on Extending Database Technology*, International Conference on Extending Database Technology, ACM, Berlin, Germany, pp. 456-467.View/Download from: UTS OPUS or Publisher's site

Querying similar graphs in graph databases has been widely studied in graph query processing in recent years. Existing works mainly focus on subgraph similarity search and supergraph similarity search. In this paper, we study the problem of finding top-k graphs in a graph database that are most similar to a query graph. This problem has many applications, such as image retrieval and chemical compound structure search. Regarding the similarity measure, feature based and kernel based similarity measures have been used in the literature. But such measures are rough and may lose the connectivity information among substructures. In this paper, we introduce a new similarity measure based on the maximum common subgraph (MCS) of two graphs. We show that this measure can better capture the common and different structures of two graphs. Since computing the MCS of two graphs is NP-hard, we propose an algorithm to answer the top-k graph similarity query using two distance lower bounds with different computational costs, in order to reduce the number of MCS computations. We further introduce an indexing technique, which can better make use of the triangle property of similarities among graphs in the database to get tighter lower bounds. Three different indexing methods are proposed with different tradeoffs between pruning power and construction cost. We conducted extensive performance studies on large real datasets to evaluate the performance of our approaches.

Zhu, Y., Yu, J.X., Cheng, H. & Qin, L. 2012, 'Graph Classification: A Diversified Discriminative Feature Selection Approach',

View/Download from: UTS OPUS or Publisher's site

*Proceedings of 2012 ACM International Conference on Information and Knowledge Management*, ACM International Conference on Information and Knowledge Management, ACM, Maui, HI, USA, pp. 205-214.View/Download from: UTS OPUS or Publisher's site

A graph models complex structural relationships among objects, and has been prevalently used in a wide range of applications. Building an automated graph classification model becomes very important for predicting unknown graphs or understanding complex structures between different classes. The graph classification framework being widely used consists of two steps, namely, feature selection and classification. The key issue is how to select important subgraph features from a graph database with a large number of graphs including positive graphs and negative graphs. Given the features selected, a generic classification approach can be used to build a classification model. In this paper, we focus on feature selection. We identify two main issues with the most widely used feature selection approach which is based on a discriminative score to select frequent subgraph features, and introduce a new diversified discriminative score to select features that have a higher diversity. We analyze the properties of the newly proposed diversified discriminative score, and conducted extensive performance studies to demonstrate that such a diversified discriminative score makes positive/negative graphs separable and leads to a higher classification accuracy.

Qin, L., Yu, J.X. & Chang, L. 2012, 'Diversifying Top-K Results',

View/Download from: UTS OPUS or Publisher's site

*Proceedings of the VLDB Endowment*, VLDB Endowment, ACM, Istanbul, Turkey, pp. 1124-1135.View/Download from: UTS OPUS or Publisher's site

Top-k query processing finds a list of k results that have largest scores w.r.t the user given query, with the assumption that all the k results are independent to each other. In practice, some of the top-k results returned can be very similar to each other. As a result some of the top-k results returned are redundant. In the literature, diversified top-k search has been studied to return k results that take both score and diversity into consideration. Most existing solutions on diversified top-k search assume that scores of all the search results are given, and some works solve the diversity problem on a specific problem and can hardly be extended to general cases. In this paper, we study the diversified top-k search problem. We define a general diversified top-k search problem that only considers the similarity of the search results themselves. We propose a framework, such that most existing solutions for top-k query processing can be extended easily to handle diversified top-k search, by simply applying three new functions, a sufficient stop condition sufficient(), a necessary stop condition necessary(), and an algorithm for diversified top-k search on the current set of generated results, div-search-current(). We propose three new algorithms, namely, div-astar, div-dp, and div-cut to solve the div-search-current() problem. div-astar is an A* based algorithm, div-dp is an algorithm that decomposes the results into components which are searched using div-astar independently and combined using dynamic programming. div-cut further decomposes the current set of generated results using cut points and combines the results using sophisticated operations. We conducted extensive performance studies using two real datasets, enwiki and reuters. Our div-cut algorithm finds the optimal solution for diversified top-k search problem in seconds even for k as large as 2, 000.

Qin, L., Yu, J.X. & Chang, L. 2011, 'Computing Structural Statistics by Keywords in Databases',

View/Download from: UTS OPUS or Publisher's site

*Proceedings of the 27th International Conference on Data Engineering*, International Conference on Data Engineering, IEEE, Hannover, Germany, pp. 363-374.View/Download from: UTS OPUS or Publisher's site

Keyword search in RDBs has been extensively studied in recent years. The existing studies focused on finding all or top-k interconnected tuple-structures that contain keywords. In reality, the number of such interconnected tuple-structures for a keyword query can be large. It becomes very difficult for users to obtain any valuable information more than individual interconnected tuple-structures. Also, it becomes challenging to provide a similar mechanism like group-&-aggregate for those interconnected tuple-structures. In this paper, we study computing structural statistics keyword queries by extending the group-&-aggregate framework. We consider an RDB as a large directed graph where nodes represent tuples, and edges represent the links among tuples. Instead of using tuples as a member in a group to be grouped, we consider rooted subgraphs. Such a rooted subgraph represents an interconnected tuple-structure among tuples and some of the tuples contain keywords. The dimensions of the rooted subgraphs are determined by dimensional-keywords in a data driven fashion. Two rooted subgraphs are grouped into the same group if they are isomorphic based on the dimensions or in other words the dimensional-keywords. The scores of the rooted subgraphs are computed by a user-given score function if the rooted subgraphs contain some of general keywords. Here, the general keywords are used to compute scores rather than determining dimensions. The aggregates are computed using an SQL aggregate function for every group based on the scores computed. We give our motivation using a real dataset. We propose new approaches to compute structural statistics keyword queries, perform extensive performance studies using two large real datasets and a large synthetic dataset, and confirm the effectiveness and efficiency of our approach.

Chang, L., Yu, J.X., Qin, L., Zhu, Y. & Wang, H. 2011, 'Finding Information Nebula over Large Networks',

View/Download from: UTS OPUS or Publisher's site

*Proceedings of 2011 ACM International Conference on Information and Knowledge Management*, ACM, Glasgow, United Kingdom, pp. 1465-1474.View/Download from: UTS OPUS or Publisher's site

Social and information networks have been extensively studied over years. In this paper, we concentrate ourselves on a large information network that is composed of entities and relationships, where entities are associated with sets of keyword terms (kterms) to specify what they are, and relationships describe the link structure among entities which can be very complex. Our work is motivated but is different from the existing works that find a best subgraph to describe how user-specified entities are connected. We compute information nebula (cloud) which is a set of top-K kterms P that are most correlated to a set of user-specified kterms Q, over a large information network. Our goal is to find how kterms are correlated given the complex information network among entities. The information nebula computing requests us to take all possible kterms into consideration for the top-K kterms selection, and needs to measure the similarity between kterms by considering all possible subgraphs that connect them instead of the best single one. In this work, we compute information nebula using a global structural-context similarity, and our similarity measure is independent of connection subgraphs. To the best of our knowledge, among the link-based similarity methods, none of the existing work considers similarity between two sets of nodes or two kterms. We propose new algorithms to find top-K kterms P for a given set of kterms Q based on the global structural-context similarity, without computing all the similarity scores of kterms in the large information network. We performed extensive performance studies using large real datasets, and confirmed the effectiveness and efficiency of our approach.

Zhu, Y., Qin, L., Yu, J.X., Ke, Y. & Lin, X. 2011, 'High Efficiency and Quality: Large Graphs Matching',

View/Download from: UTS OPUS or Publisher's site

*Proceedings of 2011 ACM International Conference on Information and Knowledge Management*, ACM International Conference on Information and Knowledge Management, ACM, Glasgow, United Kingdom, pp. 1755-1764.View/Download from: UTS OPUS or Publisher's site

Graph matching plays an essential role in many real applications. In this paper, we study how to match two large graphs by maximizing the number of matched edges, which is known as maximum common subgraph matching and is NP-hard. To find exact matching, it cannot handle a graph with more than 30 nodes. To find an approximate matching, the quality can be very poor. We propose a novel two-step approach which can efficiently match two large graphs over thousands of nodes with high matching quality. In the first step, we propose an anchor-selection/expansion approach to compute a good initial matching. In the second step, we propose a new approach to refine the initial matching. We give the optimality of our refinement and discuss how to randomly refine the matching with different combinations. We conducted extensive testing using real and synthetic datasets, and will report our findings.

Chang, L., Yu, J.X., Qin, L. & Lin, X. 2010, 'Probabilistic Ranking over Relations',

View/Download from: UTS OPUS or Publisher's site

*Proceedings of the 13th International Conference on Extending Database Technology*, International Conference on Extending Database Technology, ACM, Lausanne, Switzerland, pp. 477-488.View/Download from: UTS OPUS or Publisher's site

Probabilistic top-k ranking queries have been extensively studied due to the fact that data obtained can be uncertain in many real applications. A probabilistic top-k ranking query ranks objects by the interplay of score and probability, with an implicit assumption that both scores based on which objects are ranked and probabilities of the existence of the objects are stored in the same relation. We observe that in general scores and probabilities are highly possible to be stored in different relations, for example, in column-oriented DBMSs and in data warehouses. In this paper we study probabilistic top-k ranking queries when scores and probabilities are stored in different relations. We focus on reducing the join cost in probabilistic top-k ranking. We investigate two probabilistic score functions, discuss the upper/lower bounds in random access and sequential access, and provide insights on the advantages and disadvantages of random/sequential access in terms of upper/lower bounds. We also propose random, sequential, and hybrid algorithms to conduct probabilistic top-k ranking. We conducted extensive performance studies using real and synthetic datasets, and report our findings in this paper.

Qin, L., Yu, J.X. & Chang, L. 2010, 'Ten thousand SQLs: parallel keyword queries computing',

View/Download from: UTS OPUS or Publisher's site

*Proceedings of the VLDB Endowment*, Proceedings of the VLDB Endowment, ACM, Singapore, pp. 58-69.View/Download from: UTS OPUS or Publisher's site

Keyword search in relational databases has been extensively studied. Given a relational database, a keyword query finds a set of interconnected tuple structures connected by foreign key references. On rdbms, a keyword query is processed in two steps, namely, candidate networks (CNs) generation and CNs evaluation, where a CN is an sql. In common, a keyword query needs to be processed using over 10,000 sqls. There are several approaches to process a keyword query on rdbms, but there is a limit to achieve high performance on a uniprocessor architecture. In this paper, we study parallel computing keyword queries on a multicore architecture. We give three observations on keyword query computing, namely, a large number of sqls that needs to be processed, high sharing possibility among sqls, and large intermediate results with small number of final results. All make it challenging for parallel keyword queries computing. We investigate three approaches. We first study the query level parallelism, where each sql is processed by one core. We distribute the sqls into different cores based on three objectives, regarding minimizing workload skew, minimizing intercore sharing and maximizing intra-core sharing respectively. Such an approach has the potential risk of load unbalancing through accumulating errors of cost estimation. We then study the operation level parallelism, where each operation of an sql is processed by one core. All operations are processed in stages, where in each stage the costs of operations are re-estimated to reduce the accumulated error. Such operation level parallelism still has drawbacks of workload skew when large operations are involved and a large number of cores are used. Finally, we propose a new algorithm that partitions relations adaptively in order to minimize the extra cost of partitioning and at the same time reduce workload skew.

Qin, L., Yu, J.X., Chang, L. & Tao, Y. 2009, 'Querying Communities in Relational Databases',

View/Download from: UTS OPUS or Publisher's site

*Proceedings of the 25th International Conference on Data Engineering*, IEEE, Shanghai, China, pp. 724-735.View/Download from: UTS OPUS or Publisher's site

Keyword search on relational databases provides users with insights that they can not easily observe using the traditional RDBMS techniques. Here, an l-keyword query is specified by a set of l keywords, {k1, k2, middot middot middot , kl}. It finds how the tuples that contain the keywords are connected in a relational database via the possible foreign key references. Conceptually, it is to find some structural information in a database graph, where nodes are tuples and edges are foreign key references. The existing work studied how to find connected trees for an l-keyword query. However, a tree may only show partial information about how those tuples that contain the keywords are connected. In this paper, we focus on finding communities for an l-keyword query. A community is an induced subgraph that contains all the l-keywords within a given distance. We propose new efficient algorithms to find all/top-k communities which consume small memory, for an l-keyword query. For top k l-keyword queries, our algorithm allows users to interactively enlarge k at run time. We conducted extensive performance studies using two large real datasets to confirm the efficiency of our algorithms

Chang, L., Yu, J.X. & Qin, L. 2009, 'Query Ranking in Probabilistic XML Data',

View/Download from: UTS OPUS or Publisher's site

*Proceedings of the 12th International Conference on Extending Database Technology*, International Conference on Extending Database Technology, ACM, Saint-Petersburg, Russia, pp. 156-167.View/Download from: UTS OPUS or Publisher's site

Twig queries have been extensively studied as a major fragment of XPATH queries to query XML data. In this paper, we study PXML-RANK query, (Q, k), which is to rank top-k probabilities of the answers of a twig query Q in probabilistic XML (PXML) data. A new research issue is how to compute top-k probabilities of answers of a twig query Q in PXML in the presence of containment (ancestor/descendant) relationships. In the presence of the ancestor/descendant relationships, the existing dynamic programming approaches to rank top-k probabilities over a set of tuples cannot be directly applied, because any node/edge in PXML may have impacts on the top-k probabilities of answers. We propose new algorithms to compute PXML-RANK queries efficiently and give conditions under which a PXML-RANK query can be processed efficiently without enumeration of all the possible worlds. We conduct extensive performance studies using both real and large benchmark datasets, and confirm the efficiency of our algorithms

Qin, L., Yu, J.X. & Chang, L. 2009, 'Keyword Search in Databases: The Power of RDBMS',

View/Download from: UTS OPUS or Publisher's site

*Proceedings of ACM Conference on Management of Data*, Proceedings of ACM Conference on Management of Data, ACM, Providence, Rhode Island, USA, pp. 681-694.View/Download from: UTS OPUS or Publisher's site

Keyword search in relational databases (RDBs) has been extensively studied recently. A keyword search (or a keyword query) in RDBs is specified by a set of keywords to explore the interconnected tuple structures in an RDB that cannot be easily identified using SQL on RDBMS. In brief, it finds how the tuples containing the given keywords are connected via sequences of connections (foreign key references) among tuples in an RDB. Such interconnected tuple structures can be found as connected trees up to a certain size, sets of tuples that are reachable from a root tuple within a radius, or even multi-center subgraphs within a radius. In the literature, there are two main approaches. One is to generate a set of relational algebra expressions and evaluate every such expression using SQL on an RDBMS directly or in a middleware on top of an RDBMS indirectly. Due to a large number of relational algebra expressions needed to process, most of the existing works take a middleware approach without fully utilizing RDBMSs. The other is to materialize an RDB as a graph and find the interconnected tuple structures using graph-based algorithms in memory

Chang, L., Yu, J.X. & Qin, L. 2009, 'Context-sensitive document ranking',

View/Download from: Publisher's site

*International Conference on Information and Knowledge Management, Proceedings*, pp. 1533-1536.View/Download from: Publisher's site

Ranking is a main research issue in IR-styled keyword search over a set of documents. In this paper, we study a new keyword search problem, called context-sensitive document ranking, which is to rank documents with an additional context that provides additional information about the application domain where the documents are to be searched and ranked. The work is motivated by the fact that additional information associated with the documents can possibly assist users to find more relevant documents when they are unable to find the needed documents from the documents alone. In this paper, a context is a multi-attribute graph, which can represent any information maintained in a relational database. The context-sensitive ranking is related to several research issues, how to score documents, how to evaluate the additional information obtained in the context that may contribute the document ranking, how to rank the documents by combining the scores/costs from the documents and the context. More importantly, the relationships between documents and the information stored in a relational database may be uncertain, because they are from different data sources and the relationships are determined systematically using similarity match which causes uncertainty. In this paper, we concentrate ourselves on these research issues, and provide our solution on how to rank the documents in a context where there exist uncertainty between the documents and the context. We confirm the effectiveness of our approaches by conducting extensive experimental studies using real datasets. Copyright 2009 ACM.

Qin, L., Yu, J.X., Chang, L. & Tao, Y. 2009, 'Scalable keyword search on large data streams',

View/Download from: Publisher's site

*Proceedings - International Conference on Data Engineering*, pp. 1199-1202.View/Download from: Publisher's site

It is widely realized that the integration of information retrieval (IR) and database (DB) techniques provides users with a broad range of high quality services. A new challenging issue along the same direction is IR-styled m-keyword query processing in a RDBMS framework over an open-ended relational data stream. The capability of supporting m-keyword queries over a relational data stream makes it possible for users to monitor events, that are implicitly interrelated, over a relational data stream in a timely manner. In brief, the problem is to find all connected trees whose size is less than or equal to a user-given threshold in terms of number of nodes for a m-keyword query, {k1, k2, , km}, over a relational data stream on a database schema GS. The difficulty of the problem is related to the number of costly joins to be processed over time, which is affected by the parameters such as the number of keywords (m), the maximum size of connected trees (Tmax), as well as the complexity of the database schema when it is viewed as a schema graph (GS). In this paper, we propose a new demand-driven approach to process such a query over a high speed data stream. We show that we can significantly reduce the number of intermediate results when processing joins over a data stream, and therefore can achieve high efficiency. © 2009 IEEE.

Ding, B., Yu, J.X. & Qin, L. 2008, 'Finding Time-Dependent Shortest Paths over Large Graphs',

View/Download from: UTS OPUS or Publisher's site

*Proceedings of the 11th International Conference on Extending Database Technology*, International Conference on Extending Database Technology, ACM, Nantes, France, pp. 205-216.View/Download from: UTS OPUS or Publisher's site

The spatial and temporal databases have been studied widely and intensively over years. In this paper, we study how to answer queries of finding the best departure time that minimizes the total travel time from a place to another, over a road network, where the traffic conditions dynamically change from time to time. We study a generalized form of this problem, called the time-dependent shortest-path problem. A time-dependent graph GT is a graph that has an edge-delay function, wi, j(t), associated with each edge (vi, vj), to be stored in a database. The edge-delay function wi, j(t) specifies how much time it takes to travel from node vi to node vj, if it departs from vi at time t. A user-specified query is to ask the minimum-travel-time path, from a source node, vs, to a destination node, ve, over the time-dependent graph, GT, with the best departure time to be selected from a time interval T. We denote this user query as LTT(vs, ve, T) over GT. The challenge of this problem is the added complexity due to the time dependency in the time-dependent graph. That is, edge delays are not constants, and can vary from time to time. In this paper, we propose a novel algorithm to find the minimum-travel-time path with the best departure time for a LTT(vs, ve, T) query over a large graph GT. Our approach outperforms existing algorithms in terms of both time complexity in theory and efficiency in practice. We will discuss the design of our algorithm, together with its correctness and complexity. We conducted extensive experimental studies over large graphs and will report our findings.

Ding, B., Yu, J.X., Wang, S., Qin, L., Zhang, X. & Lin, X. 2007, 'Finding Top-k Min-Cost Connected Trees in Databases',

View/Download from: UTS OPUS or Publisher's site

*Proceedings of the 23th International Conference on Data Engineering*, IEEE, Istanbul, Turkey, pp. 836-845.View/Download from: UTS OPUS or Publisher's site

It is widely realized that the integration of database and information retrieval techniques will provide users with a wide range of high quality services. In this paper, we study processing an l-keyword query, p1, p1, ..., pl, against a relational database which can be modeled as a weighted graph, G(V, E). Here V is a set of nodes (tuples) and E is a set of edges representing foreign key references between tuples. Let Vi ? V be a set of nodes that contain the keyword pi. We study finding top-k minimum cost connected trees that contain at least one node in every subset Vi, and denote our problem as GST-k When k = 1, it is known as a minimum cost group Steiner tree problem which is NP-complete. We observe that the number of keywords, l, is small, and propose a novel parameterized solution, with l as a parameter, to find the optimal GST-1, in time complexity O(3ln + 2l ((l + logn)n + m)), where n and m are the numbers of nodes and edges in graph G. Our solution can handle graphs with a large number of nodes. Our GST-1 solution can be easily extended to support GST-k, which outperforms the existing GST-k solutions over both weighted undirected/directed graphs. We conducted extensive experimental studies, and report our finding

Wang, S., Peng, Z., Zhang, J., Qin, L., Wang, S., Yu, J.X. & Ding, B. 2006, 'NUITS: A Novel User Interface for Efficient Keyword Search over Databases.',

*VLDB*, ACM, pp. 1143-1146.## Journal articles

Yuan, L., Qin, L., Lin, X., Chang, L. & Zhang, W. 2017, 'I/O efficient ECC graph decomposition via graph reduction',

View/Download from: Publisher's site

*VLDB Journal*, vol. 26, no. 2, pp. 275-300.View/Download from: Publisher's site

© 2016 Springer-Verlag Berlin HeidelbergThe problem of computing k-edge connected components (k-(Formula presented.)s) of a graph G for a specific k is a fundamental graph problem and has been investigated recently. In this paper, we study the problem of (Formula presented.) decomposition, which computes the k-(Formula presented.)s of a graph G for all possible k values. (Formula presented.) decomposition can be widely applied in a variety of applications such as graph-topology analysis, community detection, Steiner Component Search, and graph visualization. A straightforward solution for (Formula presented.) decomposition is to apply the existing k-(Formula presented.) computation algorithm to compute the k-(Formula presented.)s for all k values. However, this solution is not applicable to large graphs for two challenging reasons. First, all existing k-(Formula presented.) computation algorithms are highly memory intensive due to the complex data structures used in the algorithms. Second, the number of possible k values can be very large, resulting in a high computational cost when each k value is independently considered. In this paper, we address the above challenges, and study I/O efficient (Formula presented.) decomposition via graph reduction. We introduce two elegant graph reduction operators which aim to reduce the size of the graph loaded in memory while preserving the connectivity information of a certain set of edges to be computed for a specific k. We also propose three novel I/O efficient algorithms, (Formula presented.)-(Formula presented.), (Formula presented.)-(Formula presented.), and (Formula presented.), that explore the k values in different orders to reduce the redundant computations between different k values. We analyze the I/O and memory costs for all proposed algorithms. In addition, we extend our algorithm to build an efficient index for Steiner Component Search. We show that our index can be used to perform Steiner Component Search in o...

Chang, L., Li, W., Qin, L., Zhang, W. & Yang, S. 2017, 'PSCAN : Fast and Exact Structural Graph Clustering',

View/Download from: Publisher's site

*IEEE Transactions on Knowledge and Data Engineering*, vol. 29, no. 2, pp. 387-401.View/Download from: Publisher's site

© 2016 IEEE. We study the problem of structural graph clustering, a fundamental problem in managing and analyzing graph data. Given an undirected unweighted graph, structural graph clustering is to assign vertices to clusters, and to identify the sets of hub vertices and outlier vertices as well, such that vertices in the same cluster are densely connected to each other while vertices in different clusters are loosely connected. In this paper, we develop a new two-step paradigm for scalable structural graph clustering based on our three observations. Then, we present a pSCAN approach, within the paradigm, aiming to reduce the number of structural similarity computations, and propose optimization techniques to speed up checking whether two vertices are structure-similar. pSCAN outputs exactly the same clusters as the existing approaches SCAN and SCAN++, and we prove that pSCAN is worst-case optimal. Moreover, we propose efficient techniques for updating the clusters when the input graph dynamically changes, and we also extend our techniques to other similarity measures, e.g., Jaccard similarity. Performance studies on large real and synthetic graphs demonstrate the efficiency of our new approach and our dynamic cluster maintenance techniques. Noticeably, for the twitter graph with 1 billion edges, our approach takes 25 minutes while the state-of-the-art approach cannot finish even after 24 hours.

Lai, L., Qin, L., Lin, X. & Chang, L. 2017, 'Scalable subgraph enumeration in MapReduce: a cost-oriented approach',

View/Download from: Publisher's site

*VLDB Journal*, vol. 26, no. 3, pp. 421-446.View/Download from: Publisher's site

© 2017, Springer-Verlag Berlin Heidelberg. Subgraph enumeration, which aims to find all the subgraphs of a large data graph that are isomorphic to a given pattern graph, is a fundamental graph problem with a wide range of applications. However, existing sequential algorithms for subgraph enumeration fall short in handling large graphs due to the involvement of computationally intensive subgraph isomorphism operations. Thus, some recent researches focus on solving the problem using MapReduce. Nevertheless, exiting MapReduce approaches are not scalable to handle very large graphs since they either produce a huge number of partial results or consume a large amount of memory. Motivated by this, in this paper, we propose a new algorithm TwinTwigJoin based on a left-deep-join framework in MapReduce, in which the basic join unit is a TwinTwig (an edge or two incident edges of a node). We show that in the Erdös–Rényi random graph model, TwinTwigJoin is instance optimal in the left-deep-join framework under reasonable assumptions, and we devise an algorithm to compute the optimal join plan. We further discuss how our approach can be adapted to handle the power-law random graph model. Three optimization strategies are explored to improve our algorithm. Ultimately, by aggregating equivalent nodes into a compressed node, we construct the compressed graph, upon which the subgraph enumeration is further improved. We conduct extensive performance studies in several real graphs, one of which contains billions of edges. Our approach significantly outperforms existing solutions in all tests.

Zhu, Y., Zhang, H., Qin, L. & Cheng, H. 2017, 'Efficient MapReduce algorithms for triangle listing in billion-scale graphs',

View/Download from: Publisher's site

*Distributed and Parallel Databases*, vol. 35, no. 2, pp. 149-176.View/Download from: Publisher's site

© 2017, Springer Science+Business Media New York. This paper addresses the classical triangle listing problem, which aims at enumerating all the tuples of three vertices connected with each other by edges. This problem has been intensively studied in internal and external memory, but it is still an urgent challenge in distributed environment where multiple machines across the network can be utilized to achieve good performance and scalability. As one of the de facto computing methodologies in distributed environment, MapReduce has been used in some of existing triangle listing algorithms. However, these algorithms usually need to shuffle a huge amount of intermediate data, which seriously hinders their scalability on large scale graphs. In this paper, we propose a new triangle listing algorithm in MapReduce, FTL, which utilizes a light weight data structure to substantially reduce the intermediate data transferred during the shuffle stage, and also is equipped with multiple-round techniques to ease the burden on memory and network bandwidth when dealing with graphs at billion scale. We prove that the size of the intermediate data can be well bounded near to the number of triangles in the graph. To further reduce the shuffle size and memory cost, we also propose improved algorithms based on a compact data structure, and present several optimization techniques to accelerate the computation and reduce the memory consumption. The extensive experimental results show that our algorithms outperform existing competitors by several times on both synthetic graphs and real world graphs.

Zhang, F., Zhang, W., Zhang, Y., Qin, L. & Lin, X. 2017, 'OLAK: An Efficient Algorithm to Prevent Unraveling in Social Networks',

View/Download from: UTS OPUS

*Proceedings of the VLDB Endowment*, vol. 10, no. 6, pp. 649-660.View/Download from: UTS OPUS

In this paper, we study the problem of the anchored k-core.
Given a graph G, an integer k and a budget b, we aim to
identify b vertices in G so that we can determine the largest
induced subgraph J in which every vertex, except the b vertices,
has at least k neighbors in J. This problem was introduced
by Bhawalkar and Kleinberg et al. in the context of
user engagement in social networks, where a user may leave
a community if he/she has less than k friends engaged. The
problem has been shown to be NP-hard and inapproximable.
A polynomial-time algorithm for graphs with bounded tree width
has been proposed. However, this assumption usually
does not hold in real-life graphs, and their techniques cannot
be extended to handle general graphs.
Motivated by this, we propose an efficient algorithm,
namely onion-layer based anchored k-core (OLAK), for the
anchored k-core problem on large scale graphs. To facilitate
computation of the anchored k-core, we design an onion layer
structure, which is generated by a simple onion-peeling like
algorithm against a small set of vertices in the graph.
We show that computation of the best anchor can simply
be conducted upon the vertices on the onion layers, which
significantly reduces the search space. Based on the well organized
layer structure, we develop efficient candidates
exploration, early termination and pruning techniques to
further speed up computation. Comprehensive experiments
on 10 real-life graphs demonstrate the effectiveness and efficiency of our proposed methods.

Yuan, L., Qin, L., Lin, X., Chang, L. & Zhang, W. 2016, 'Diversified top-k clique search',

View/Download from: UTS OPUS or Publisher's site

*The VLDB Journal*, vol. 25, no. 2, pp. 171-196.View/Download from: UTS OPUS or Publisher's site

© 2015 Springer-Verlag Berlin Heidelberg Maximal clique enumeration is a fundamental problem in graph theory and has been extensively studied. However, maximal clique enumeration is time-consuming in large graphs and always returns enormous cliques with large overlaps. Motivated by this, in this paper, we study the diversified top-k clique search problem which is to find top-k cliques that can cover most number of nodes in the graph. Diversified top-k clique search can be widely used in a lot of applications including community search, motif discovery, and anomaly detection in large graphs. A naive solution for diversified top-k clique search is to keep all maximal cliques in memory and then find k of them that cover most nodes in the graph by using the approximate greedy max k-cover algorithm. However, such a solution is impractical when the graph is large. In this paper, instead of keeping all maximal cliques in memory, we devise an algorithm to maintain k candidates in the process of maximal clique enumeration. Our algorithm has limited memory footprint and can achieve a guaranteed approximation ratio. We also introduce a novel light-weight (Formula presented.)-(Formula presented.), based on which we design an optimal maximal clique maintenance algorithm. We further explore three optimization strategies to avoid enumerating all maximal cliques and thus largely reduce the computational cost. Besides, for the massive input graph, we develop an I/O efficient algorithm to tackle the problem when the input graph cannot fit in main memory. We conduct extensive performance studies on real graphs and synthetic graphs. One of the real graphs contains 1.02 billion edges. The results demonstrate the high efficiency and effectiveness of our approach.

Li, R.H., Qin, L., Yu, J.X. & Mao, R. 2016, 'Optimal Multi-Meeting-Point Route Search',

View/Download from: Publisher's site

*IEEE Transactions on Knowledge and Data Engineering*, vol. 28, no. 3, pp. 770-784.View/Download from: Publisher's site

© 1989-2012 IEEE. Real-time ride-sharing applications (e.g., Uber and Lyft) are very popular in recent years. Motivated by the ride-sharing application, we propose a new type of query in road networks, called the optimal multi-meeting-point route (OMMPR) query. Given a road network G , a source node s , a target node t , and a set of query nodes U, the OMMPR query aims at finding the best route starting from s and ending at t such that the weighted average cost between the cost of the route and the total cost of the shortest paths from every query node to the route is minimized. We show that the problem of computing the OMMPR query is NP-hard. To answer the OMMPR query efficiently, we propose two novel parameterized solutions based on dynamic programming (DP), with the number of query nodes l (i.e., l=|U|) as a parameter, which is typically very small in practice. The two proposed parameterized algorithms run in O(3 m+ 2 n (l+\log (n))) and O(2 (m + n (l+\log (n)))) time, respectively, where n and m denote the number of nodes and edges in graph G, thus they are tractable in practice. To reduce the search space of the DP-based algorithms, we propose two novel optimized algorithms based on bidirectional DP and a carefully-designed lower bounding technique. We conduct extensive experimental studies on four large real-world road networks, and the results demonstrate the efficiency of the proposed algorithms.

Zhang, S., Qin, L., Zheng, Y. & Cheng, H. 2016, 'Effective and efficient: Large-scale dynamic city express',

View/Download from: UTS OPUS or Publisher's site

*IEEE Transactions on Knowledge and Data Engineering*, vol. PP, no. 99.View/Download from: UTS OPUS or Publisher's site

© 1989-2012 IEEE. Due to the large number of requirements for city express services in recent years, the current city express system is found to be unsatisfactory for both the service providers and customers. In this paper, we are the first to systematically study the large-scale dynamic city express problem. We aim to increase both the effectiveness and the efficiency of the scheduling algorithm. The challenges of the problem stem from the highly dynamic environment, the NP-completeness with respect to the number of requests, and real-time demands for the scheduling result. We introduce a basic algorithm to assign a request to a courier on a first-come, first-served basis. To improve the effectiveness of the basic algorithm, we adopt a batch assignment strategy that computes the pickup-delivery routes for a group of requests received in a short period rather than dealing with each request individually. To improve the efficiency of the algorithm, we further design a two-level priority queue structure to reduce redundant shortest distance calculation and repeated candidate generation. We develop a simulation system and conduct extensive performance studies on the real road network of Beijing city. The experimental results demonstrate the high effectiveness and efficiency of our algorithms. Remarkably, our system can achieve much better service quality and largely reduce the operation cost of a city express company simultaneously.

Zhang, Z., Yu, J.X., Qin, L., Chang, L. & Lin, X. 2015, 'I/O efficient: computing SCCs in massive graphs',

View/Download from: UTS OPUS or Publisher's site

*VLDB Journal*, vol. 24, no. 2, pp. 245-270.View/Download from: UTS OPUS or Publisher's site

A strongly connected component ((Formula presented.)) is a maximal subgraph of a directed graph (Formula presented.) in which every pair of nodes is reachable from each other in the (Formula presented.). With such a property, a general directed graph can be represented by a directed acyclic graph (DAG) by contracting every (Formula presented.) of (Formula presented.) to a node in DAG. In many real applications that need graph pattern matching, topological sorting, or reachability query processing, the best way to deal with a general directed graph is to deal with its DAG representation. Therefore, finding all (Formula presented.)s in a directed graph (Formula presented.) is a critical operation. The existing in-memory algorithms based on depth first search (DFS) can find all (Formula presented.)s in linear time with respect to the size of a graph. However, when a graph cannot reside entirely in the main memory, the existing external or semi-external algorithms to find all (Formula presented.)s have limitation to achieve high I/O efficiency. In this paper, we study new I/O-efficient semi-external algorithms to find all (Formula presented.)s for a massive directed graph (Formula presented.) that cannot reside in main memory entirely. To overcome the deficiency of the existing DFS-based semi-external algorithm that heavily relies on a total order, we explore a weak order based on which we investigate new algorithms. We propose a new two-phase algorithm, namely, tree construction and tree search. In the tree construction phase, a spanning tree of (Formula presented.) can be constructed in bounded number of sequential scans of (Formula presented.). In the tree search phase, it needs to sequentially scan the graph once to find all (Formula presented.)s. In addition, we propose a new single-phase algorithm, which combines the tree construction and tree search phases into a single phase, with three new optimization techniques. They are early acceptance, early rejection, ...

Huang, X., Cheng, H., Li, R.-.H., Qin, L. & Yu, J.X. 2015, 'Top-K structural diversity search in large networks.',

View/Download from: UTS OPUS or Publisher's site

*The VLDB Journal*, vol. 24, pp. 319-343.View/Download from: UTS OPUS or Publisher's site

Social contagion depicts a process of information (e.g., fads, opinions, news) diffusion in the online social networks. A recent study reports that in a social contagion process, the probability of contagion is tightly controlled by the number of connected components in an individual's neighborhood. Such a number is termed structural diversity of an individual, and it is shown to be a key predictor in the social contagion process. Based on this, a fundamental issue in a social network is to find top-k users with the highest structural diversities. In this paper, we, for the first time, study the top-k structural diversity search problem in a large network. Specifically, we study two types of structural diversity measures, namely, component-based structural diversity measure and core-based structural diversity measure. For component-based structural diversity, we develop an effective upper bound of structural diversity for pruning the search space. The upper bound can be incrementally refined in the search process. Based on such upper bound, we propose an efficient framework for top-k structural diversity search. To further speed up the structural diversity evaluation in the search process, several carefully devised search strategies are proposed. We also design efficient techniques to handle frequent updates in dynamic networks and maintain the top-k results. We further show how the techniques proposed in component-based structural diversity measure can be extended to handle the core-based structural diversity measure. Extensive experimental studies are conducted in real-world large networks and synthetic graphs, and the results demonstrate the efficiency and effectiveness of the proposed methods.

Li, Z., Qin, L., Cheng, H., Zhang, X. & Zhou, X. 2015, 'TRIP: An Interactive Retrieving-Inferring Data Imputation Approach.',

View/Download from: UTS OPUS or Publisher's site

*IEEE Transactions on Knowledge and Data Engineering*, vol. 27, pp. 2550-2563.View/Download from: UTS OPUS or Publisher's site

Data imputation aims at filling in missing attribute values in databases. Most existing imputation methods to string attribute
values are inferring-based approaches, which usually fail to reach a high imputation recall by just inferring missing values from the
complete part of the data set. Recently, some retrieving-based methods are proposed to retrieve missing values from external
resources such as the World Wide Web, which tend to reach a much higher imputation recall, but inevitably bring a large overhead by
issuing a large number of search queries. In this paper, we investigate the interaction between the inferring-based methods and the
retrieving-based methods. We show that retrieving a small number of selected missing values can greatly improve the imputation recall
of the inferring-based methods. With this intuition, we propose an inTeractive Retrieving-Inferring data imPutation approach (TRIP),
which performs retrieving and inferring alternately in filling in missing attribute values in a data set. To ensure the high recall at the
minimum cost, TRIP faces a challenge of selecting the least number of missing values for retrieving to maximize the number of
inferable values. Our proposed solution is able to identify an optimal retrieving-inferring scheduling scheme in deterministic data
imputation, and the optimality of the generated scheme is theoretically analyzed with proofs. We also analyze with an example that the
optimal scheme is not feasible to be achieved in t-constrained stochastic data imputation (t-SDI), but still, our proposed solution
identifies an expected-optimal scheme in t-SDI. Extensive experiments on four data collections show that TRIP retrieves on average
20 percent missing values and achieves the same high recall that was reached by the retrieving-based approach.

Wen, D., Qin, L., Zhang, Y., Lin, X. & Yu, J.X. 2015, 'I/O Efficient Core Graph Decomposition at Web Scale.',

*CoRR*, vol. abs/1511.00367. Zhu, Y., Yu, J.X. & Qin, L. 2014, 'Leveraging graph dimensions in online graph search',

View/Download from: UTS OPUS

*Proceedings of the VLDB Endowment*, vol. 8, no. 1, pp. 85-96.View/Download from: UTS OPUS

Graphs have been widely used due to its expressive power to model complicated relationships. However, given a graph database D G = {g 1 ; g 2 ; , g n }, it is challenging to process graph queries since a basic graph query usually involves costly graph operations such as maximum common subgraph and graph edit distance computation, which are NP-hard. In this paper, we study a novel DS-preserved mapping which maps graphs in a graph database D G onto a multidimensional space M G under a structural dimension Musing a mapping function (). The DS-preserved mapping preserves two things: distance and structure. By the distance-preserving, it means that any two graphs g i and g j in D G must map to two data objects (g i ) and (g j ) in M G , such that the distance, d((g i ); (g j ), between (g i ) and (g j ) in M G approximates the graph dissimilarity (g i ; g j ) in D G . By the structure-preserving, it further means that for a given unseen query graph q, the distance between q and any graph g i in D G needs to be preserved such that (q; g i ) d((q); (g i )). We discuss the rationality of using graph dimension M for online graph processing, and show how to identify a small set of subgraphs to form M efficiently. We propose an iterative algorithm DSPM to compute the graph dimension, and discuss its optimization techniques. We also give an approximate algorithm DSPMap in order to handle a large graph database. We conduct extensive performance studies on both real and synthetic datasets to evaluate the top-k similarity query which is to find top-k similar graphs from D G for a query graph, and show the effectiveness and efficiency of our approaches. © 2014 VLDB.

Chang, L., Yu, J.X. & Qin, L. 2013, 'Fast Maximal Cliques Enumeration in Sparse Graphs',

View/Download from: UTS OPUS or Publisher's site

*Algorithmica*, vol. 66, no. 1, pp. 173-186.View/Download from: UTS OPUS or Publisher's site

In this paper, we consider the problem of generating all maximal cliques in a sparse graph in polynomial delay. Given a graph G=(V,E) with n vertices and m edges, the latest and fastest polynomial delay algorithm for sparse graphs enumerates all maximal cliques in O(? 4) time delay, where ? is the maximum degree of vertices. However, it requires an O(nm) preprocessing time. We improve it in two aspects. First, our algorithm does not need preprocessing. Therefore, our algorithm is a truly polynomial delay algorithm. Second, our algorithm enumerates all maximal cliques in O(?H 3) time delay, where H is the so called H-value of a graph or equivalently it is the smallest integer satisfying |{v?V|d(v)=H}|=H given d(v) as the degree of a vertex. In real-world network data, H usually is a small value and much smaller than delta

Zhu, Y., Qin, L., Yu, J.X., Ke, Y. & Lin, X. 2013, 'High Efficiency and Quality: Large Graphs Matching',

View/Download from: UTS OPUS or Publisher's site

*VLDB Journal*, vol. 22, no. 3, pp. 345-368.View/Download from: UTS OPUS or Publisher's site

Graph matching plays an essential role in many real applications. In this paper, we study how to match two large graphs by maximizing the number of matched edges, which is known as maximum common subgraph matching and is NP-hard. To find exact matching, it cannot a graph with more than 30 nodes. To find an approximate matching, the quality can be very poor. We propose a novel two-step approach that can efficiently match two large graphs over thousands of nodes with high matching quality. In the first step, we propose an anchor-selection/expansion approach to compute a good initial matching. In the second step, we propose a new approach to refine the initial matching. We give the optimality of our refinement and discuss how to randomly refine the matching with different combinations. We further show how to extend our solution to handle labeled graphs. We conducted extensive testing using real and synthetic datasets and report our findings in this paper.

Qiao, M., Cheng, H., Qin, L., Yu, J.X., Yu, P. & Chang, L. 2013, 'Computing weight constraint reachability in large networks',

View/Download from: UTS OPUS or Publisher's site

*VLDB Journal*, vol. 22, no. 3, pp. 275-294.View/Download from: UTS OPUS or Publisher's site

Abstract Reachability is a fundamental problem on large-scale networks emerging nowadays in various application domains, such as social networks, communication networks, biological networks, road networks, etc. It has been studied extensively. However, little existing work has studied reachability with realistic constraints imposed on graphs with real-valued edge or node weights. In fact, such weights are very common in many real-world networks, for example, the bandwidth of a link in communication networks, the reliability of an interaction between two proteins in PPI networks, and the handling capacity of a warehouse/storage point in a distribution network. In this paper, we formalize a new yet important reachability query in weighted undirected graphs, called weight constraint reachability (WCR) query that asks: is there a path between nodes a and b, on which each real-valued edge (or node) weight satisfies a range constraint. We discover an interesting property of WCR, based on which, we design a novel edge-based index structure to answer the WCR query in O(1) time. Furthermore, we consider the case when the index cannot entirely fit in the memory, which can be very common for emerging massive networks. An I/O-efficient index is proposed, which provides constant I/O (precisely four I/Os) query time with O(|V|log|V|) disk-based index size. Extensive experimental studies on both real and synthetic datasets demonstrate the efficiency and scalability of our solutions in answering the WCR query.

Chang, L., Yu, J.X., Qin, L., Cheng, H. & Qiao, M. 2012, 'The Exact Distance to Destination in Undirected World',

View/Download from: UTS OPUS or Publisher's site

*VLDB Journal*, vol. 21, no. 6, pp. 869-888.View/Download from: UTS OPUS or Publisher's site

Shortest distance queries are essential not only in graph analysis and graph mining tasks but also in database applications, when a large graph needs to be dealt with. Such shortest distance queries are frequently issued by end-users or requested as a subroutine in real applications. For intensive queries on large graphs, it is impractical to compute shortest distances on-line from scratch, and impractical to materialize all-pairs shortest distances. In the literature, 2-hop distance labeling is proposed to index the all-pairs shortest distances. It assigns distance labels to vertices in a large graph in a pre-computing step off-line and then answers shortest distance queries on-line by making use of such distance labels, which avoids exhaustively traversing the large graph when answering queries. However, the existing algorithms to generate 2-hop distance labels are not scalable to large graphs. Finding an optimal 2-hop distance labeling is NP-hard, and heuristic algorithms may generate large size distance labels while still needing to pre-compute all-pairs shortest paths. In this paper, we propose a multi-hop distance labeling approach, which generates a subset of the 2-hop distance labels as index off-line. We can compute the multi-hop distance labels efficiently by avoiding pre-computing all-pairs shortest paths. In addition, our multi-hop distance labeling is small in size to be stored. To answer a shortest distance query between two vertices, we first generate the query-specific small set of 2-hop distance labels for the two vertices based on our multi-hop distance labels stored and compute the shortest distance between the two vertices based on the 2-hop distance labels generated on-line. We conducted extensive performance studies on large real graphs and confirmed the efficiency of our multi-hop distance labeling scheme.

Qin, L., Yu, J.X. & Chang, L. 2012, 'Computing Structural Statistics by Keywords in Databases',

View/Download from: UTS OPUS or Publisher's site

*IEEE Transactions On Knowledge And Data Engineering*, vol. 24, no. 10, pp. 1731-1746.View/Download from: UTS OPUS or Publisher's site

Keyword search in RDBs has been extensively studied in recent years. The existing studies focused on finding all or top-k interconnected tuple-structures that contain keywords. In reality, the number of such interconnected tuple-structures for a keyword query can be large. It becomes very difficult for users to obtain any valuable information more than individual interconnected tuple-structures. Also, it becomes challenging to provide a similar mechanism like group-&-aggregate for those interconnected tuple-structures. In this paper, we study computing structural statistics keyword queries by extending the group-&-aggregate framework. We consider an RDB as a large directed graph where nodes represent tuples, and edges represent the links among tuples. Instead of using tuples as a member in a group, we consider rooted subgraphs. Such a rooted subgraph represents an interconnected tuple-structure among tuples and some of the tuples contain keywords. The dimensions of the rooted subgraphs are determined by dimensional keywords in a data driven fashion. Two rooted subgraphs are grouped into the same group if they are isomorphic based on the dimensions or in other words the dimensional keywords. The scores of the rooted subgraphs are computed by a user-given score function if the rooted subgraphs contain some of general keywords. Here, the general keywords are used to compute scores rather than determining dimensions. The aggregates are computed using an sql aggregate function for every group based on the scores computed. We give our motivation using a real data set. We propose new approaches to compute structural statistics keyword queries, perform extensive performance studies using two large real data sets and a large synthetic data set, and confirm the effectiveness and efficiency of our approach.

Qin, L., Yu, J.X. & Chang, L. 2012, 'Diversifying Top-K Results.',

*PVLDB*, vol. 5, pp. 1124-1135. Qin, L., Yu, J.X. & Chang, L. 2011, 'Scalable Keyword Search on Large Data Streams',

View/Download from: UTS OPUS or Publisher's site

*VLDB Journal*, vol. 20, no. 1, pp. 35-57.View/Download from: UTS OPUS or Publisher's site

It is widely recognized that the integration of information retrieval (IR) and database (DB) techniques provides users with a broad range of high quality services. Along this direction, IR-styled m-keyword query processing over a relational database in an rdbms framework has been well studied. It finds all hidden interconnected tuple structures, for example connected trees that contain keywords and are interconnected by sequences of primary/foreign key relationships among tuples. A new challenging issue is how to monitor events that are implicitly interrelated over an open-ended relational data stream for a user-given m-keyword query. Such a relational data stream is a sequence of tuple insertion/deletion operations. The difficulty of the problem is related to the number of costly joins to be processed over time when tuples are inserted and/or deleted. Such cost is mainly affected by three parameters, namely, the number of keywords, the maximum size of interconnected tuple structures, and the complexity of the database schema when it is viewed as a schema graph. In this paper, we propose new approaches. First, we propose a novel algorithm to efficiently determine all the joins that need to be processed for answering an m-keyword query. Second, we propose a new demand-driven approach to process such a query over a high speed relational data stream. We show that we can achieve high efficiency by significantly reducing the number of intermediate results when processing joins over a relational data stream. The proposed new techniques allow us to achieve high scalability in terms of both query plan generation and query plan execution. We conducted extensive experimental studies using synthetic data and real data to simulate a relational data stream. Our approach significantly outperforms existing algorithms.

Chang, L., Yu, J.X. & Qin, L. 2010, 'Context-Sensitive Document Ranking',

View/Download from: UTS OPUS or Publisher's site

*Journal Of Computer Science And Technology*, vol. 25, no. 3, pp. 444-457.View/Download from: UTS OPUS or Publisher's site

Ranking is a main research issue in IR-styled keyword search over a set of documents. In this paper, we study a new keyword search problem, called context-sensitive document ranking, which is to rank documents with an additional context that provides additional information about the application domain where the documents are to be searched and ranked. The work is motivated by the fact that additional information associated with the documents can possibly assist users to find more relevant documents when they are unable to find the needed documents from the documents alone. In this paper, a context is a multi-attribute graph, which can represent any information maintained in a relational database, where multi-attribute nodes represent tuples, and edges represent primary key and foreign key references among nodes. The context-sensitive ranking is related to several research issues, how to score documents, how to evaluate the additional information obtained in the context that may contribute to the document ranking, how to rank the documents by combining the scores/costs from the documents and the context. More importantly, the relationships between documents and the information stored in a relational database may be uncertain, because they are from different data sources and the relationships are determined systematically using similarity match which causes uncertainty. In this paper, we concentrate ourselves on these research issues, and provide our solution on how to rank the documents in a context where there exist uncertainty between the documents and the context. We confirm the effectiveness of our approaches by conducting extensive experimental studies using real datasets. We present our findings in this paper.

Yu, J.X., Qin, L. & Chang, L. 2010, 'Keyword Search in Relational Databases: A Survey.',

*IEEE Data Eng. Bull.*, vol. 33, pp. 67-78. Fan, W., Yu, J.X., Li, J., Ding, B. & Qin, L. 2009, 'Query Translation from XPath to SQL in the Presence of Recursive DTDS',

View/Download from: UTS OPUS or Publisher's site

*The VLDB Journal*, vol. 18, no. 4, pp. 857-883.View/Download from: UTS OPUS or Publisher's site

We study the problem of evaluating xpath queries over xml data that is stored in an rdbms via schema-based shredding. The interaction between recursion (descendants-axis) in xpath queries and recursion in dtds makes it challenging to answer xpath queries using rdbms. We present a new approach to translating xpath queries into sql queries based on a notion of extended XP ath expressions and a simple least fixpoint (lfp) operator. Extended xpath expressions are a mild extension of xpath, and the lfp operator takes a single input relation and is already supported by most commercial rdbms. We show that extended xpath expressions are capable of capturing both dtd recursion and xpath queries in a uniform framework. Furthermore, they can be translated into an equivalent sequence of sql queries with the lfp operator. We present algorithms for rewriting xpath queries over a (possibly recursive) dtd into extended xpath expressions and for translating extended xpath expressions to sql queries, as well as optimization techniques. The novelty of our approach consists in its capability to answer a large class of xpath queries by means of only low-end rdbms features already available in most rdbms, as well as its flexibility to accommodate existing relational query optimization techniques. In addition, these translation algorithms provide a solution to query answering for certain (possibly recursive) xml views of xml data. Our experimental results verify the effectiveness of our techniques.

Chang, L., Yu, J.X. & Qin, L. 2009, 'Fast Probabilistic Ranking under x-Relation Model',

*CoRR*, vol. abs/0906.4927. Qin, L., Yu, J.X., Ding, B. & Ishikawa, Y. 2008, 'Monitoring Aggregate k-NN Objects in Road Networks',

View/Download from: UTS OPUS or Publisher's site

*Lecture Notes in Computer Science*, vol. 5069, no. 1, pp. 168-186.View/Download from: UTS OPUS or Publisher's site

In recent years, there is an increasing need to monitor k nearest neighbor (k-NN) in a road network. There are existing solutions on either monitoring k-NN objects from a single query point over a road network, or computing the snapshot k-NN objects over a road network to minimize an aggregate distance function with respect to multiple query points. In this paper, we study a new problem that is to monitor k-NN objects over a road network from multiple query points to minimize an aggregate distance function with respect to the multiple query points. We call it a continuous aggregate k-NN (CANN) query. We propose a new approach that can significantly reduce the cost of computing network distances when monitoring aggregate k-NN objects on road networks. We conducted extensive experimental studies and confirmed the efficiency of our algorithms.

Qin, L., Yu, J.X., Ding, B. & Wang, S. 2007, 'TwigList - Make Twig Pattern Matching Fast',

View/Download from: UTS OPUS or Publisher's site

*Lecture Notes in Computer Science*, vol. 4443, no. 1, pp. 850-862.View/Download from: UTS OPUS or Publisher's site

Twig pattern matching problem has been widely studied in recent years. Give an XML tree t. A twig-pattern matching query, Q, represented as a query tree, is to find all the occurrences of such twig pattern in t. Previous works like HolisticTwig and TJFast decomposed the twig pattern into single paths from root to leaves, and merged all the occurrences of such path-patterns to find the occurrences of the twig-pattern matching query, Q. Their techniques can effectively prune impossible path-patterns to avoid producing a large amount of intermediate results. But they still need to merge path-patterns which occurs high computational cost. Recently, Twig2Stack was proposed to overcome this problem using hierarchical-stacks to further reduce the merging cost. But, due to the complex hierarchical-stacks Twig2Stack used, Twig2Stack may end up many random accesses in memory, and need to load the whole XML tree into memory in the worst case. In this paper, we propose a new algorithm, called TwigList, which uses simple lists. Both time and space complexity of our algorithm are linear with respect to the total number of pattern occurrences and the size of XML tree. In addition, our algorithm can be easily modified as an external algorithm. We conducted extensive experimental studies using large benchmark and real datasets. Our algorithm significantly outperforms the up-to-date algorithm.

Peng, Z., Zhang, J., Wang, S. & Qin, L. 2006, 'TreeCluster: Clustering Results of Keyword Search over Databases',

View/Download from: UTS OPUS or Publisher's site

*Lecture Notes in Computer Science*, vol. 4016, pp. 385-396.View/Download from: UTS OPUS or Publisher's site

A critical challenge in keyword search over relational data- bases (KSORD) is to improve its result presentation to facilitate users quick browsing through search results. An effective method is to organize the results into clusters. However, traditional clustering method is not applicable to KSORD search results. In this paper, we propose a novel clustering method named TreeCluster. In the first step, we use labels to represent schema information of each result tree and reformulate the clustering problem as a problem of judging whether labeled trees are isomorphic. In the second step, we rank user keywords according to their frequencies in databases, and further partition the large clusters based on keyword nodes. Furthermore, we give each cluster a readable description, and present the description and each result graphically to help users understand the results more easily. Experimental results verify our methods effectiveness and efficiency.

**Selected Peer-Assessed Projects**