## Biography

In July of 2006, I received my bachelor degree from department of Computer Science and Technology in Renmin University of China (RUC). In August of 2010, I received the Ph.D degree in the Chinese University of Hong Kong (CUHK). My thesis adviser is Prof. Jeffrey Xu Yu. I have been a postdoc in the Chinese University of Hong Kong from August 2010 to August 2013. I am now a senior lecturer and a core member in the Centre of Artificial Intelligence (CAI) in University of Technology, Sydney (UTS). My research interests include Big Graph Analytics and Graph Query Processing.

#### Publications

Yu, J.X., Qin, L. & Chang, L. 2010, *Keyword Search in Databases*, 1, Morgan & Claypool Publishers, NA.View/Download from: UTS OPUS or Publisher's site

#### View description

It has become highly desirable to provide users with ?exible ways to query/search information over databases as simple as keyword search like Google search. This book surveys the recent developments on keyword search over databases, and focuses on ?nding structural information among objects in a database using a set of keywords. Such structural information to be returned can be either trees or subgraphs representing how the objects, that contain the required keywords, are interconnected in a relational database or in an XML database. The structural keyword search is completely different from ?nding documents that contain all the user-given keywords. The former focuses on the interconnected object structures, whereas the latter focuses on the object content. The book is organized as follows. In Chapter 1, we highlight the main research issues on the structural keyword search in different contexts. In Chapter 2, we focus on supporting structural keyword search in a relational database management system using the SQL query language. We concentrate on how to generate a set of SQL queries that can ?nd all the structural information among records in a relational database completely, and how to evaluate the generated set of SQL queries ef?ciently.In Chapter 3,we discuss graph algorithms for structural keyword search by treating an entire relational database as a large data graph. In Chapter 4, we discuss structural keyword search in a large tree-structured XML database.In Chapter 5,we highlight several interesting research issues regarding keyword search on databases. The book can be used as either an extended survey for people who are interested in the structural keyword search or a reference book for a postgraduate course on the related topics.

Chang, L., Li, W., Qin, L., Zhang, W. & Yang, S. 2017, 'PSCAN : Fast and Exact Structural Graph Clustering', *IEEE Transactions on Knowledge and Data Engineering*, vol. 29, no. 2, pp. 387-401.View/Download from: Publisher's site

#### View description

© 2016 IEEE. We study the problem of structural graph clustering, a fundamental problem in managing and analyzing graph data. Given an undirected unweighted graph, structural graph clustering is to assign vertices to clusters, and to identify the sets of hub vertices and outlier vertices as well, such that vertices in the same cluster are densely connected to each other while vertices in different clusters are loosely connected. In this paper, we develop a new two-step paradigm for scalable structural graph clustering based on our three observations. Then, we present a pSCAN approach, within the paradigm, aiming to reduce the number of structural similarity computations, and propose optimization techniques to speed up checking whether two vertices are structure-similar. pSCAN outputs exactly the same clusters as the existing approaches SCAN and SCAN++, and we prove that pSCAN is worst-case optimal. Moreover, we propose efficient techniques for updating the clusters when the input graph dynamically changes, and we also extend our techniques to other similarity measures, e.g., Jaccard similarity. Performance studies on large real and synthetic graphs demonstrate the efficiency of our new approach and our dynamic cluster maintenance techniques. Noticeably, for the twitter graph with 1 billion edges, our approach takes 25 minutes while the state-of-the-art approach cannot finish even after 24 hours.

Lai, L., Qin, L., Lin, X. & Chang, L. 2017, 'Scalable subgraph enumeration in MapReduce: a cost-oriented approach', *VLDB Journal*, vol. 26, no. 3, pp. 421-446.View/Download from: Publisher's site

#### View description

© 2017, Springer-Verlag Berlin Heidelberg. Subgraph enumeration, which aims to find all the subgraphs of a large data graph that are isomorphic to a given pattern graph, is a fundamental graph problem with a wide range of applications. However, existing sequential algorithms for subgraph enumeration fall short in handling large graphs due to the involvement of computationally intensive subgraph isomorphism operations. Thus, some recent researches focus on solving the problem using MapReduce. Nevertheless, exiting MapReduce approaches are not scalable to handle very large graphs since they either produce a huge number of partial results or consume a large amount of memory. Motivated by this, in this paper, we propose a new algorithm TwinTwigJoin based on a left-deep-join framework in MapReduce, in which the basic join unit is a TwinTwig (an edge or two incident edges of a node). We show that in the Erdös–Rényi random graph model, TwinTwigJoin is instance optimal in the left-deep-join framework under reasonable assumptions, and we devise an algorithm to compute the optimal join plan. We further discuss how our approach can be adapted to handle the power-law random graph model. Three optimization strategies are explored to improve our algorithm. Ultimately, by aggregating equivalent nodes into a compressed node, we construct the compressed graph, upon which the subgraph enumeration is further improved. We conduct extensive performance studies in several real graphs, one of which contains billions of edges. Our approach significantly outperforms existing solutions in all tests.

Li, R.H., Qin, L., Yu, J.X. & Mao, R. 2017, 'Finding influential communities in massive networks', *VLDB Journal*, vol. 26, no. 6, pp. 751-776.View/Download from: Publisher's site

#### View description

© 2017, Springer-Verlag Berlin Heidelberg. Community search is a problem of finding densely connected subgraphs that satisfy the query conditions in a network, which has attracted much attention in recent years. However, all the previous studies on community search do not consider the influence of a community. In this paper, we introduce a novel community model called k-influential community based on the concept of k-core to capture the influence of a community. Based on this community model, we propose a linear time online search algorithm to find the top-rk-influential communities in a network. To further speed up the influential community search algorithm, we devise a linear space data structure which supports efficient search of the top-rk-influential communities in optimal time. We also propose an efficient algorithm to maintain the data structure when the network is frequently updated. Additionally, we propose a novel I/O-efficient algorithm to find the top-rk-influential communities in a disk-resident graph under the assumption of U= O(n) , where U and n denote the size of the main memory and the number of nodes, respectively. Finally, we conduct extensive experiments on six real-world massive networks, and the results demonstrate the efficiency and effectiveness of the proposed methods.

Yuan, L., Qin, L., Lin, X., Chang, L. & Zhang, W. 2017, 'I/O efficient ECC graph decomposition via graph reduction', *VLDB Journal*, vol. 26, no. 2, pp. 275-300.View/Download from: Publisher's site

#### View description

© 2016 Springer-Verlag Berlin HeidelbergThe problem of computing k-edge connected components (k-(Formula presented.)s) of a graph G for a specific k is a fundamental graph problem and has been investigated recently. In this paper, we study the problem of (Formula presented.) decomposition, which computes the k-(Formula presented.)s of a graph G for all possible k values. (Formula presented.) decomposition can be widely applied in a variety of applications such as graph-topology analysis, community detection, Steiner Component Search, and graph visualization. A straightforward solution for (Formula presented.) decomposition is to apply the existing k-(Formula presented.) computation algorithm to compute the k-(Formula presented.)s for all k values. However, this solution is not applicable to large graphs for two challenging reasons. First, all existing k-(Formula presented.) computation algorithms are highly memory intensive due to the complex data structures used in the algorithms. Second, the number of possible k values can be very large, resulting in a high computational cost when each k value is independently considered. In this paper, we address the above challenges, and study I/O efficient (Formula presented.) decomposition via graph reduction. We introduce two elegant graph reduction operators which aim to reduce the size of the graph loaded in memory while preserving the connectivity information of a certain set of edges to be computed for a specific k. We also propose three novel I/O efficient algorithms, (Formula presented.)-(Formula presented.), (Formula presented.)-(Formula presented.), and (Formula presented.), that explore the k values in different orders to reduce the redundant computations between different k values. We analyze the I/O and memory costs for all proposed algorithms. In addition, we extend our algorithm to build an efficient index for Steiner Component Search. We show that our index can be used to perform Steiner Component Search in o...

Yuan, L., Qin, L., Zhang, W., Chang, L. & Yang, J. 2017, 'Index-based Densest Clique Percolation Community Search In Networks', *IEEE Transactions on Knowledge and Data Engineering*.View/Download from: Publisher's site

#### View description

IEEE Community search is important in graph analysis and can be used in many applications. In the literature, various community models have been proposed. However, most of them cannot well identify the overlaps between communities which is an essential feature of real graphs. To address this issue, < formula > < tex > $k$ < /tex > < /formula > -clique percolation community model was proposed and has been proven effective in many applications. Motivated by this, in this paper, we adopt the < formula > < tex > $k$ < /tex > < /formula > -clique percolation community model and study the densest clique percolation community search problem which aims to find the < formula > < tex > $k$ < /tex > < /formula > -clique percolation community with the maximum k value that contains a given set of query nodes. We adopt an index-based approach to solve this problem. Based on the observation that a < formula > < tex > $k$ < /tex > < /formula > -clique percolation community is a union of maximal cliques, we devise a compact index, DCPC-Index, to preserve the maximal cliques and their connectivity information of the input graph. With DCPC-Index, we can answer the densest clique percolation community query efficiently. Besides, we also propose an index construction algorithm based on the definition of DCPC-Index and further improve the algorithm in terms of efficiency and memory consumption. The experimental results demonstrate the efficiency of our index-based query processing algorithm and index construction algorithm.

Zhu, Y., Zhang, H., Qin, L. & Cheng, H. 2017, 'Efficient MapReduce algorithms for triangle listing in billion-scale graphs', *Distributed and Parallel Databases*, vol. 35, no. 2, pp. 149-176.View/Download from: Publisher's site

#### View description

© 2017, Springer Science+Business Media New York. This paper addresses the classical triangle listing problem, which aims at enumerating all the tuples of three vertices connected with each other by edges. This problem has been intensively studied in internal and external memory, but it is still an urgent challenge in distributed environment where multiple machines across the network can be utilized to achieve good performance and scalability. As one of the de facto computing methodologies in distributed environment, MapReduce has been used in some of existing triangle listing algorithms. However, these algorithms usually need to shuffle a huge amount of intermediate data, which seriously hinders their scalability on large scale graphs. In this paper, we propose a new triangle listing algorithm in MapReduce, FTL, which utilizes a light weight data structure to substantially reduce the intermediate data transferred during the shuffle stage, and also is equipped with multiple-round techniques to ease the burden on memory and network bandwidth when dealing with graphs at billion scale. We prove that the size of the intermediate data can be well bounded near to the number of triangles in the graph. To further reduce the shuffle size and memory cost, we also propose improved algorithms based on a compact data structure, and present several optimization techniques to accelerate the computation and reduce the memory consumption. The extensive experimental results show that our algorithms outperform existing competitors by several times on both synthetic graphs and real world graphs.

Zhang, F., Zhang, W., Zhang, Y., Qin, L. & Lin, X. 2017, 'OLAK: An Efficient Algorithm to Prevent Unraveling in Social Networks', *Proceedings of the VLDB Endowment*, vol. 10, no. 6, pp. 649-660.View/Download from: UTS OPUS

#### View description

In this paper, we study the problem of the anchored k-core.

Given a graph G, an integer k and a budget b, we aim to

identify b vertices in G so that we can determine the largest

induced subgraph J in which every vertex, except the b vertices,

has at least k neighbors in J. This problem was introduced

by Bhawalkar and Kleinberg et al. in the context of

user engagement in social networks, where a user may leave

a community if he/she has less than k friends engaged. The

problem has been shown to be NP-hard and inapproximable.

A polynomial-time algorithm for graphs with bounded tree width

has been proposed. However, this assumption usually

does not hold in real-life graphs, and their techniques cannot

be extended to handle general graphs.

Motivated by this, we propose an efficient algorithm,

namely onion-layer based anchored k-core (OLAK), for the

anchored k-core problem on large scale graphs. To facilitate

computation of the anchored k-core, we design an onion layer

structure, which is generated by a simple onion-peeling like

algorithm against a small set of vertices in the graph.

We show that computation of the best anchor can simply

be conducted upon the vertices on the onion layers, which

significantly reduces the search space. Based on the well organized

layer structure, we develop efficient candidates

exploration, early termination and pruning techniques to

further speed up computation. Comprehensive experiments

on 10 real-life graphs demonstrate the effectiveness and efficiency of our proposed methods.

Zhang, F., Zhang, Y., Qin, L., Zhang, W. & Lin, X. 2017, 'When Engagement Meets Similarity: Efficient (k, r)-Core Computation on Social Networks', *Proceedings of the VLDB Endowment*, vol. 10, no. 10, pp. 998-1009.View/Download from: UTS OPUS

#### View description

In this paper, we investigate the problem of (k,r)-core which intends to find cohesive subgraphs on social networks considering both user engagement and similarity perspectives. In particular, we adopt the popular concept of k-core to guarantee the engagement of the users (vertices) in a group (subgraph) where each vertex in a (k,r)-core connects to at least k other vertices. Meanwhile, we consider the pairwise similarity among users based on their attributes. Efficient algorithms are proposed to enumerate all maximal (k,r)-cores and find the maximum (k,r)-core, where both problems are shown to be NP-hard. Effective pruning techniques substantially reduce the search space of two algorithms. A novel (k,k')-core based (k,r)-core size upper bound enhances performance of the maximum (k,r)-core computation. We also devise effective search orders for two mining algorithms where search priorities for vertices are different. Comprehensive experiments on real-life data demonstrate that the maximal/maximum (k,r)-cores enable us to find interesting cohesive subgraphs, and performance of two mining algorithms is effectively improved by proposed techniques.

Li, R.H., Qin, L., Yu, J.X. & Mao, R. 2016, 'Optimal Multi-Meeting-Point Route Search', *IEEE Transactions on Knowledge and Data Engineering*, vol. 28, no. 3, pp. 770-784.View/Download from: UTS OPUS or Publisher's site

#### View description

© 1989-2012 IEEE. Real-time ride-sharing applications (e.g., Uber and Lyft) are very popular in recent years. Motivated by the ride-sharing application, we propose a new type of query in road networks, called the optimal multi-meeting-point route (OMMPR) query. Given a road network G , a source node s , a target node t , and a set of query nodes U, the OMMPR query aims at finding the best route starting from s and ending at t such that the weighted average cost between the cost of the route and the total cost of the shortest paths from every query node to the route is minimized. We show that the problem of computing the OMMPR query is NP-hard. To answer the OMMPR query efficiently, we propose two novel parameterized solutions based on dynamic programming (DP), with the number of query nodes l (i.e., l=|U|) as a parameter, which is typically very small in practice. The two proposed parameterized algorithms run in O(3 m+ 2 n (l+\log (n))) and O(2 (m + n (l+\log (n)))) time, respectively, where n and m denote the number of nodes and edges in graph G, thus they are tractable in practice. To reduce the search space of the DP-based algorithms, we propose two novel optimized algorithms based on bidirectional DP and a carefully-designed lower bounding technique. We conduct extensive experimental studies on four large real-world road networks, and the results demonstrate the efficiency of the proposed algorithms.

Zhang, S., Qin, L., Zheng, Y. & Cheng, H. 2016, 'Effective and Efficient: Large-Scale Dynamic City Express', *IEEE Transactions on Knowledge and Data Engineering*, vol. 28, no. 12, pp. 3203-3217.View/Download from: UTS OPUS or Publisher's site

#### View description

© 1989-2012 IEEE. Due to the large number of requirements for city express services in recent years, the current city express system is found to be unsatisfactory for both the service providers and customers. In this paper, we are the first to systematically study the large-scale dynamic city express problem. We aim to increase both the effectiveness and the efficiency of the scheduling algorithm. The challenges of the problem stem from the highly dynamic environment, the NP-completeness with respect to the number of requests, and real-Time demands for the scheduling result. We introduce a basic algorithm to assign a request to a courier on a first-come, first-served basis. To improve the effectiveness of the basic algorithm, we adopt a batch assignment strategy that computes the pickup-delivery routes for a group of requests received in a short period rather than dealing with each request individually. To improve the efficiency of the algorithm, we further design a two-level priority queue structure to reduce redundant shortest distance calculation and repeated candidate generation. We develop a simulation system and conduct extensive performance studies on the real road network of Beijing city. The experimental results demonstrate the high effectiveness and efficiency of our algorithms. Remarkably, our system can achieve much better service quality and largely reduce the operation cost of a city express company simultaneously.

Yuan, L., Qin, L., Lin, X., Chang, L. & Zhang, W. 2016, 'Diversified top-k clique search', *The VLDB Journal*, vol. 25, no. 2, pp. 171-196.View/Download from: UTS OPUS or Publisher's site

#### View description

© 2015 Springer-Verlag Berlin Heidelberg Maximal clique enumeration is a fundamental problem in graph theory and has been extensively studied. However, maximal clique enumeration is time-consuming in large graphs and always returns enormous cliques with large overlaps. Motivated by this, in this paper, we study the diversified top-k clique search problem which is to find top-k cliques that can cover most number of nodes in the graph. Diversified top-k clique search can be widely used in a lot of applications including community search, motif discovery, and anomaly detection in large graphs. A naive solution for diversified top-k clique search is to keep all maximal cliques in memory and then find k of them that cover most nodes in the graph by using the approximate greedy max k-cover algorithm. However, such a solution is impractical when the graph is large. In this paper, instead of keeping all maximal cliques in memory, we devise an algorithm to maintain k candidates in the process of maximal clique enumeration. Our algorithm has limited memory footprint and can achieve a guaranteed approximation ratio. We also introduce a novel light-weight (Formula presented.)-(Formula presented.), based on which we design an optimal maximal clique maintenance algorithm. We further explore three optimization strategies to avoid enumerating all maximal cliques and thus largely reduce the computational cost. Besides, for the massive input graph, we develop an I/O efficient algorithm to tackle the problem when the input graph cannot fit in main memory. We conduct extensive performance studies on real graphs and synthetic graphs. One of the real graphs contains 1.02 billion edges. The results demonstrate the high efficiency and effectiveness of our approach.

Huang, X., Cheng, H., Li, R.-.H., Qin, L. & Yu, J.X. 2015, 'Top-K structural diversity search in large networks.', *The VLDB Journal*, vol. 24, pp. 319-343.View/Download from: UTS OPUS or Publisher's site

#### View description

Social contagion depicts a process of information (e.g., fads, opinions, news) diffusion in the online social networks. A recent study reports that in a social contagion process, the probability of contagion is tightly controlled by the number of connected components in an individual's neighborhood. Such a number is termed structural diversity of an individual, and it is shown to be a key predictor in the social contagion process. Based on this, a fundamental issue in a social network is to find top-k users with the highest structural diversities. In this paper, we, for the first time, study the top-k structural diversity search problem in a large network. Specifically, we study two types of structural diversity measures, namely, component-based structural diversity measure and core-based structural diversity measure. For component-based structural diversity, we develop an effective upper bound of structural diversity for pruning the search space. The upper bound can be incrementally refined in the search process. Based on such upper bound, we propose an efficient framework for top-k structural diversity search. To further speed up the structural diversity evaluation in the search process, several carefully devised search strategies are proposed. We also design efficient techniques to handle frequent updates in dynamic networks and maintain the top-k results. We further show how the techniques proposed in component-based structural diversity measure can be extended to handle the core-based structural diversity measure. Extensive experimental studies are conducted in real-world large networks and synthetic graphs, and the results demonstrate the efficiency and effectiveness of the proposed methods.

Li, Z., Qin, L., Cheng, H., Zhang, X. & Zhou, X. 2015, 'TRIP: An Interactive Retrieving-Inferring Data Imputation Approach.', *IEEE Transactions on Knowledge and Data Engineering*, vol. 27, pp. 2550-2563.View/Download from: UTS OPUS or Publisher's site

#### View description

Data imputation aims at filling in missing attribute values in databases. Most existing imputation methods to string attribute

values are inferring-based approaches, which usually fail to reach a high imputation recall by just inferring missing values from the

complete part of the data set. Recently, some retrieving-based methods are proposed to retrieve missing values from external

resources such as the World Wide Web, which tend to reach a much higher imputation recall, but inevitably bring a large overhead by

issuing a large number of search queries. In this paper, we investigate the interaction between the inferring-based methods and the

retrieving-based methods. We show that retrieving a small number of selected missing values can greatly improve the imputation recall

of the inferring-based methods. With this intuition, we propose an inTeractive Retrieving-Inferring data imPutation approach (TRIP),

which performs retrieving and inferring alternately in filling in missing attribute values in a data set. To ensure the high recall at the

minimum cost, TRIP faces a challenge of selecting the least number of missing values for retrieving to maximize the number of

inferable values. Our proposed solution is able to identify an optimal retrieving-inferring scheduling scheme in deterministic data

imputation, and the optimality of the generated scheme is theoretically analyzed with proofs. We also analyze with an example that the

optimal scheme is not feasible to be achieved in t-constrained stochastic data imputation (t-SDI), but still, our proposed solution

identifies an expected-optimal scheme in t-SDI. Extensive experiments on four data collections show that TRIP retrieves on average

20 percent missing values and achieves the same high recall that was reached by the retrieving-based approach.

Wen, D., Qin, L., Zhang, Y., Lin, X. & Yu, J.X. 2015, 'I/O Efficient Core Graph Decomposition at Web Scale.', *CoRR*, vol. abs/1511.00367.

Zhang, Z., Yu, J.X., Qin, L., Chang, L. & Lin, X. 2015, 'I/O efficient: computing SCCs in massive graphs', *VLDB Journal*, vol. 24, no. 2, pp. 245-270.View/Download from: UTS OPUS or Publisher's site

#### View description

A strongly connected component ((Formula presented.)) is a maximal subgraph of a directed graph (Formula presented.) in which every pair of nodes is reachable from each other in the (Formula presented.). With such a property, a general directed graph can be represented by a directed acyclic graph (DAG) by contracting every (Formula presented.) of (Formula presented.) to a node in DAG. In many real applications that need graph pattern matching, topological sorting, or reachability query processing, the best way to deal with a general directed graph is to deal with its DAG representation. Therefore, finding all (Formula presented.)s in a directed graph (Formula presented.) is a critical operation. The existing in-memory algorithms based on depth first search (DFS) can find all (Formula presented.)s in linear time with respect to the size of a graph. However, when a graph cannot reside entirely in the main memory, the existing external or semi-external algorithms to find all (Formula presented.)s have limitation to achieve high I/O efficiency. In this paper, we study new I/O-efficient semi-external algorithms to find all (Formula presented.)s for a massive directed graph (Formula presented.) that cannot reside in main memory entirely. To overcome the deficiency of the existing DFS-based semi-external algorithm that heavily relies on a total order, we explore a weak order based on which we investigate new algorithms. We propose a new two-phase algorithm, namely, tree construction and tree search. In the tree construction phase, a spanning tree of (Formula presented.) can be constructed in bounded number of sequential scans of (Formula presented.). In the tree search phase, it needs to sequentially scan the graph once to find all (Formula presented.)s. In addition, we propose a new single-phase algorithm, which combines the tree construction and tree search phases into a single phase, with three new optimization techniques. They are early acceptance, early rejection, ...

Zhu, Y., Yu, J.X. & Qin, L. 2014, 'Leveraging graph dimensions in online graph search', *Proceedings of the VLDB Endowment*, vol. 8, no. 1, pp. 85-96.View/Download from: UTS OPUS or Publisher's site

#### View description

Graphs have been widely used due to its expressive power to model complicated relationships. However, given a graph database D G = {g 1 ; g 2 ; , g n }, it is challenging to process graph queries since a basic graph query usually involves costly graph operations such as maximum common subgraph and graph edit distance computation, which are NP-hard. In this paper, we study a novel DS-preserved mapping which maps graphs in a graph database D G onto a multidimensional space M G under a structural dimension Musing a mapping function (). The DS-preserved mapping preserves two things: distance and structure. By the distance-preserving, it means that any two graphs g i and g j in D G must map to two data objects (g i ) and (g j ) in M G , such that the distance, d((g i ); (g j ), between (g i ) and (g j ) in M G approximates the graph dissimilarity (g i ; g j ) in D G . By the structure-preserving, it further means that for a given unseen query graph q, the distance between q and any graph g i in D G needs to be preserved such that (q; g i ) d((q); (g i )). We discuss the rationality of using graph dimension M for online graph processing, and show how to identify a small set of subgraphs to form M efficiently. We propose an iterative algorithm DSPM to compute the graph dimension, and discuss its optimization techniques. We also give an approximate algorithm DSPMap in order to handle a large graph database. We conduct extensive performance studies on both real and synthetic datasets to evaluate the top-k similarity query which is to find top-k similar graphs from D G for a query graph, and show the effectiveness and efficiency of our approaches. © 2014 VLDB.

Chang, L., Yu, J.X. & Qin, L. 2013, 'Fast Maximal Cliques Enumeration in Sparse Graphs', *Algorithmica*, vol. 66, no. 1, pp. 173-186.View/Download from: UTS OPUS or Publisher's site

#### View description

In this paper, we consider the problem of generating all maximal cliques in a sparse graph in polynomial delay. Given a graph G=(V,E) with n vertices and m edges, the latest and fastest polynomial delay algorithm for sparse graphs enumerates all maximal cliques in O(? 4) time delay, where ? is the maximum degree of vertices. However, it requires an O(nm) preprocessing time. We improve it in two aspects. First, our algorithm does not need preprocessing. Therefore, our algorithm is a truly polynomial delay algorithm. Second, our algorithm enumerates all maximal cliques in O(?H 3) time delay, where H is the so called H-value of a graph or equivalently it is the smallest integer satisfying |{v?V|d(v)=H}|=H given d(v) as the degree of a vertex. In real-world network data, H usually is a small value and much smaller than delta

Qiao, M., Cheng, H., Qin, L., Yu, J.X., Yu, P. & Chang, L. 2013, 'Computing weight constraint reachability in large networks', *VLDB Journal*, vol. 22, no. 3, pp. 275-294.View/Download from: UTS OPUS or Publisher's site

#### View description

Abstract Reachability is a fundamental problem on large-scale networks emerging nowadays in various application domains, such as social networks, communication networks, biological networks, road networks, etc. It has been studied extensively. However, little existing work has studied reachability with realistic constraints imposed on graphs with real-valued edge or node weights. In fact, such weights are very common in many real-world networks, for example, the bandwidth of a link in communication networks, the reliability of an interaction between two proteins in PPI networks, and the handling capacity of a warehouse/storage point in a distribution network. In this paper, we formalize a new yet important reachability query in weighted undirected graphs, called weight constraint reachability (WCR) query that asks: is there a path between nodes a and b, on which each real-valued edge (or node) weight satisfies a range constraint. We discover an interesting property of WCR, based on which, we design a novel edge-based index structure to answer the WCR query in O(1) time. Furthermore, we consider the case when the index cannot entirely fit in the memory, which can be very common for emerging massive networks. An I/O-efficient index is proposed, which provides constant I/O (precisely four I/Os) query time with O(|V|log|V|) disk-based index size. Extensive experimental studies on both real and synthetic datasets demonstrate the efficiency and scalability of our solutions in answering the WCR query.

Zhu, Y., Qin, L., Yu, J.X., Ke, Y. & Lin, X. 2013, 'High Efficiency and Quality: Large Graphs Matching', *VLDB Journal*, vol. 22, no. 3, pp. 345-368.View/Download from: UTS OPUS or Publisher's site

#### View description

Graph matching plays an essential role in many real applications. In this paper, we study how to match two large graphs by maximizing the number of matched edges, which is known as maximum common subgraph matching and is NP-hard. To find exact matching, it cannot a graph with more than 30 nodes. To find an approximate matching, the quality can be very poor. We propose a novel two-step approach that can efficiently match two large graphs over thousands of nodes with high matching quality. In the first step, we propose an anchor-selection/expansion approach to compute a good initial matching. In the second step, we propose a new approach to refine the initial matching. We give the optimality of our refinement and discuss how to randomly refine the matching with different combinations. We further show how to extend our solution to handle labeled graphs. We conducted extensive testing using real and synthetic datasets and report our findings in this paper.

Chang, L., Yu, J.X., Qin, L., Cheng, H. & Qiao, M. 2012, 'The Exact Distance to Destination in Undirected World', *VLDB Journal*, vol. 21, no. 6, pp. 869-888.View/Download from: UTS OPUS or Publisher's site

#### View description

Shortest distance queries are essential not only in graph analysis and graph mining tasks but also in database applications, when a large graph needs to be dealt with. Such shortest distance queries are frequently issued by end-users or requested as a subroutine in real applications. For intensive queries on large graphs, it is impractical to compute shortest distances on-line from scratch, and impractical to materialize all-pairs shortest distances. In the literature, 2-hop distance labeling is proposed to index the all-pairs shortest distances. It assigns distance labels to vertices in a large graph in a pre-computing step off-line and then answers shortest distance queries on-line by making use of such distance labels, which avoids exhaustively traversing the large graph when answering queries. However, the existing algorithms to generate 2-hop distance labels are not scalable to large graphs. Finding an optimal 2-hop distance labeling is NP-hard, and heuristic algorithms may generate large size distance labels while still needing to pre-compute all-pairs shortest paths. In this paper, we propose a multi-hop distance labeling approach, which generates a subset of the 2-hop distance labels as index off-line. We can compute the multi-hop distance labels efficiently by avoiding pre-computing all-pairs shortest paths. In addition, our multi-hop distance labeling is small in size to be stored. To answer a shortest distance query between two vertices, we first generate the query-specific small set of 2-hop distance labels for the two vertices based on our multi-hop distance labels stored and compute the shortest distance between the two vertices based on the 2-hop distance labels generated on-line. We conducted extensive performance studies on large real graphs and confirmed the efficiency of our multi-hop distance labeling scheme.

Qin, L., Yu, J.X. & Chang, L. 2012, 'Computing Structural Statistics by Keywords in Databases', *IEEE Transactions On Knowledge And Data Engineering*, vol. 24, no. 10, pp. 1731-1746.View/Download from: UTS OPUS or Publisher's site

#### View description

Keyword search in RDBs has been extensively studied in recent years. The existing studies focused on finding all or top-k interconnected tuple-structures that contain keywords. In reality, the number of such interconnected tuple-structures for a keyword query can be large. It becomes very difficult for users to obtain any valuable information more than individual interconnected tuple-structures. Also, it becomes challenging to provide a similar mechanism like group-&-aggregate for those interconnected tuple-structures. In this paper, we study computing structural statistics keyword queries by extending the group-&-aggregate framework. We consider an RDB as a large directed graph where nodes represent tuples, and edges represent the links among tuples. Instead of using tuples as a member in a group, we consider rooted subgraphs. Such a rooted subgraph represents an interconnected tuple-structure among tuples and some of the tuples contain keywords. The dimensions of the rooted subgraphs are determined by dimensional keywords in a data driven fashion. Two rooted subgraphs are grouped into the same group if they are isomorphic based on the dimensions or in other words the dimensional keywords. The scores of the rooted subgraphs are computed by a user-given score function if the rooted subgraphs contain some of general keywords. Here, the general keywords are used to compute scores rather than determining dimensions. The aggregates are computed using an sql aggregate function for every group based on the scores computed. We give our motivation using a real data set. We propose new approaches to compute structural statistics keyword queries, perform extensive performance studies using two large real data sets and a large synthetic data set, and confirm the effectiveness and efficiency of our approach.

Qin, L., Yu, J.X. & Chang, L. 2012, 'Diversifying Top-K Results.', *PVLDB*, vol. 5, pp. 1124-1135.

Qin, L., Yu, J.X. & Chang, L. 2011, 'Scalable Keyword Search on Large Data Streams', *VLDB Journal*, vol. 20, no. 1, pp. 35-57.View/Download from: UTS OPUS or Publisher's site

#### View description

It is widely recognized that the integration of information retrieval (IR) and database (DB) techniques provides users with a broad range of high quality services. Along this direction, IR-styled m-keyword query processing over a relational database in an rdbms framework has been well studied. It finds all hidden interconnected tuple structures, for example connected trees that contain keywords and are interconnected by sequences of primary/foreign key relationships among tuples. A new challenging issue is how to monitor events that are implicitly interrelated over an open-ended relational data stream for a user-given m-keyword query. Such a relational data stream is a sequence of tuple insertion/deletion operations. The difficulty of the problem is related to the number of costly joins to be processed over time when tuples are inserted and/or deleted. Such cost is mainly affected by three parameters, namely, the number of keywords, the maximum size of interconnected tuple structures, and the complexity of the database schema when it is viewed as a schema graph. In this paper, we propose new approaches. First, we propose a novel algorithm to efficiently determine all the joins that need to be processed for answering an m-keyword query. Second, we propose a new demand-driven approach to process such a query over a high speed relational data stream. We show that we can achieve high efficiency by significantly reducing the number of intermediate results when processing joins over a relational data stream. The proposed new techniques allow us to achieve high scalability in terms of both query plan generation and query plan execution. We conducted extensive experimental studies using synthetic data and real data to simulate a relational data stream. Our approach significantly outperforms existing algorithms.

Chang, L., Yu, J.X. & Qin, L. 2010, 'Context-Sensitive Document Ranking', *Journal Of Computer Science And Technology*, vol. 25, no. 3, pp. 444-457.View/Download from: UTS OPUS or Publisher's site

#### View description

Ranking is a main research issue in IR-styled keyword search over a set of documents. In this paper, we study a new keyword search problem, called context-sensitive document ranking, which is to rank documents with an additional context that provides additional information about the application domain where the documents are to be searched and ranked. The work is motivated by the fact that additional information associated with the documents can possibly assist users to find more relevant documents when they are unable to find the needed documents from the documents alone. In this paper, a context is a multi-attribute graph, which can represent any information maintained in a relational database, where multi-attribute nodes represent tuples, and edges represent primary key and foreign key references among nodes. The context-sensitive ranking is related to several research issues, how to score documents, how to evaluate the additional information obtained in the context that may contribute to the document ranking, how to rank the documents by combining the scores/costs from the documents and the context. More importantly, the relationships between documents and the information stored in a relational database may be uncertain, because they are from different data sources and the relationships are determined systematically using similarity match which causes uncertainty. In this paper, we concentrate ourselves on these research issues, and provide our solution on how to rank the documents in a context where there exist uncertainty between the documents and the context. We confirm the effectiveness of our approaches by conducting extensive experimental studies using real datasets. We present our findings in this paper.

Yu, J.X., Qin, L. & Chang, L. 2010, 'Keyword Search in Relational Databases: A Survey.', *IEEE Data Eng. Bull.*, vol. 33, pp. 67-78.

Chang, L., Yu, J.X. & Qin, L. 2009, 'Fast Probabilistic Ranking under x-Relation Model', *CoRR*, vol. abs/0906.4927.

Fan, W., Yu, J.X., Li, J., Ding, B. & Qin, L. 2009, 'Query Translation from XPath to SQL in the Presence of Recursive DTDS', *The VLDB Journal*, vol. 18, no. 4, pp. 857-883.View/Download from: UTS OPUS or Publisher's site

#### View description

We study the problem of evaluating xpath queries over xml data that is stored in an rdbms via schema-based shredding. The interaction between recursion (descendants-axis) in xpath queries and recursion in dtds makes it challenging to answer xpath queries using rdbms. We present a new approach to translating xpath queries into sql queries based on a notion of extended XP ath expressions and a simple least fixpoint (lfp) operator. Extended xpath expressions are a mild extension of xpath, and the lfp operator takes a single input relation and is already supported by most commercial rdbms. We show that extended xpath expressions are capable of capturing both dtd recursion and xpath queries in a uniform framework. Furthermore, they can be translated into an equivalent sequence of sql queries with the lfp operator. We present algorithms for rewriting xpath queries over a (possibly recursive) dtd into extended xpath expressions and for translating extended xpath expressions to sql queries, as well as optimization techniques. The novelty of our approach consists in its capability to answer a large class of xpath queries by means of only low-end rdbms features already available in most rdbms, as well as its flexibility to accommodate existing relational query optimization techniques. In addition, these translation algorithms provide a solution to query answering for certain (possibly recursive) xml views of xml data. Our experimental results verify the effectiveness of our techniques.

Qin, L., Yu, J.X., Ding, B. & Ishikawa, Y. 2008, 'Monitoring Aggregate k-NN Objects in Road Networks', *Lecture Notes in Computer Science*, vol. 5069, no. 1, pp. 168-186.View/Download from: UTS OPUS or Publisher's site

#### View description

In recent years, there is an increasing need to monitor k nearest neighbor (k-NN) in a road network. There are existing solutions on either monitoring k-NN objects from a single query point over a road network, or computing the snapshot k-NN objects over a road network to minimize an aggregate distance function with respect to multiple query points. In this paper, we study a new problem that is to monitor k-NN objects over a road network from multiple query points to minimize an aggregate distance function with respect to the multiple query points. We call it a continuous aggregate k-NN (CANN) query. We propose a new approach that can significantly reduce the cost of computing network distances when monitoring aggregate k-NN objects on road networks. We conducted extensive experimental studies and confirmed the efficiency of our algorithms.

Qin, L., Yu, J.X., Ding, B. & Wang, S. 2007, 'TwigList - Make Twig Pattern Matching Fast', *Lecture Notes in Computer Science*, vol. 4443, no. 1, pp. 850-862.View/Download from: UTS OPUS or Publisher's site

#### View description

Twig pattern matching problem has been widely studied in recent years. Give an XML tree t. A twig-pattern matching query, Q, represented as a query tree, is to find all the occurrences of such twig pattern in t. Previous works like HolisticTwig and TJFast decomposed the twig pattern into single paths from root to leaves, and merged all the occurrences of such path-patterns to find the occurrences of the twig-pattern matching query, Q. Their techniques can effectively prune impossible path-patterns to avoid producing a large amount of intermediate results. But they still need to merge path-patterns which occurs high computational cost. Recently, Twig2Stack was proposed to overcome this problem using hierarchical-stacks to further reduce the merging cost. But, due to the complex hierarchical-stacks Twig2Stack used, Twig2Stack may end up many random accesses in memory, and need to load the whole XML tree into memory in the worst case. In this paper, we propose a new algorithm, called TwigList, which uses simple lists. Both time and space complexity of our algorithm are linear with respect to the total number of pattern occurrences and the size of XML tree. In addition, our algorithm can be easily modified as an external algorithm. We conducted extensive experimental studies using large benchmark and real datasets. Our algorithm significantly outperforms the up-to-date algorithm.

Peng, Z., Zhang, J., Wang, S. & Qin, L. 2006, 'TreeCluster: Clustering Results of Keyword Search over Databases', *Lecture Notes in Computer Science*, vol. 4016, pp. 385-396.View/Download from: UTS OPUS or Publisher's site

#### View description

A critical challenge in keyword search over relational data- bases (KSORD) is to improve its result presentation to facilitate users quick browsing through search results. An effective method is to organize the results into clusters. However, traditional clustering method is not applicable to KSORD search results. In this paper, we propose a novel clustering method named TreeCluster. In the first step, we use labels to represent schema information of each result tree and reformulate the clustering problem as a problem of judging whether labeled trees are isomorphic. In the second step, we rank user keywords according to their frequencies in databases, and further partition the large clusters based on keyword nodes. Furthermore, we give each cluster a readable description, and present the description and each result graphically to help users understand the results more easily. Experimental results verify our methods effectiveness and efficiency.

Chang, L., Zhang, C., Lin, X. & Qin, L. 2017, 'Scalable Top-K structural diversity search', *Proceedings - International Conference on Data Engineering*, pp. 95-98.View/Download from: Publisher's site

#### View description

© 2017 IEEE. This paper studies the problem of top-k structural diversity search, which is to compute k users with the highest structural diversities that is measured by the number of connected components in the neighborhood of a user. As the existing algorithms are not scalable for processing large graphs due to their limits, in this paper we propose a scalable algorithm Div-TriE to improve the efficiency. Div-TriE has two optimal features compared with the existing algorithms. Firstly, we show that as a key building block, we only need to enumerate each triangle at most once in Div-TriE, in contrast to the up-To three times in the existing techniques. Secondly, we develop efficient techniques so that the computation against each enumerated triangle is (amortized) constant, in contrast to the non-constant costs in the corresponding costs of the existing techniques. Extensive experimental results on real graphs show that Div-TriE outperforms the existing techniques by one order of magnitude.

Zhang, H., Zhu, Y., Qin, L., Cheng, H. & Yu, J.X. 2017, 'Efficient local clustering coefficient estimation in massive graphs', *Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)*, pp. 371-386.View/Download from: Publisher's site

#### View description

© Springer International Publishing AG 2017. Graph is a powerful tool to model interactions in disparate applications, and how to assess the structure of a graph is an essential task across all the domains. As a classic measure to characterize the connectivity of graphs, clustering coefficient and its variants are of particular interest in graph structural analysis. However, the largest of today's graphs may have nodes and edges in billion scale, which makes the simple task of computing clustering coefficients quite complicated and expensive. Thus, approximate solutions have attracted much attention from researchers recently. However, they only target global and binned degree wise clustering coefficient estimation, and their techniques are not suitable for local clustering coefficient estimation that is of great importance for individual nodes. In this paper, we propose a new sampling scheme to estimate the local clustering coefficient with error bounded, where global and binned degree-wise clustering coefficients can be considered as special cases. Meanwhile, based on our sampling scheme, we propose a new framework to estimate all the three clustering coefficients in a unified way. To make it scalable on massive graphs, we further design an efficient MapReduce algorithm under this framework. Extensive experiments validate the efficiency and effectiveness of our algorithms, which significantly outperform state-of-the-art exact and approximate algorithms on many real graph datasets.

Wang, X., Qin, L., Lin, X., Zhang, Y. & Chang, L. 2017, 'Leveraging set relations in exact set similarity join', *Proceedings of the VLDB Endowment*, pp. 1-12.View/Download from: UTS OPUS

#### View description

© 2017 VLDB Endowment. Exact set similarity join, which finds all the similar set pairs from two collections of sets, is a fundamental problem with a wide range of applications. The existing solutions for set similarity join follow a filtering-verification framework, which generates a list of candidate pairs through scanning indexes in the filtering phase, and reports those similar pairs in the verification phase. Though much research has been conducted on this problem, set relations, which we find out is quite effective on improving the algorithm efficiency through computational cost sharing, have never been studied. Therefore, in this paper, instead of considering each set individually, we explore the set relations in different levels to reduce the overall computational costs. First, it has been shown that most of the computational time is spent on the filtering phase, which can be quadratic to the number of sets in the worst case for the existing solutions. Thus we explore index-level set relations to reduce the filtering cost to be linear to the size of the input while keeping the same filtering power. We achieve this by grouping related sets into blocks in the index and skipping useless index probes in joins. Second, we explore answer-level set relations to further improve the algorithm based on the intuition that if two sets are similar, their answers may have a large overlap. We derive an algorithm which incrementally generates the answer of one set from an already computed answer of another similar set rather than compute the answer from scratch to reduce the computational cost. Finally, we conduct extensive performance studies using 21 real datasets with various data properties from a wide range of domains. The experimental results demonstrate that our algorithm outperforms all the existing algorithms across all datasets and can achieve more than an order of magnitude speedup against the stateof- the-art algorithms.

Zhang, F., Zhang, Y., Qin, L., Zhang, W. & Lin, X. 2017, 'Finding Critical Users for Social Network Engagement: The Collapsed k-Core Problem', *Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA.*, pp. 245-251.

Zhang, Y., Yu, J.X., Zhang, Y. & Qin, L. 2017, 'A fast order-based approach for core maintenance', *Proceedings - International Conference on Data Engineering*, pp. 337-348.View/Download from: UTS OPUS or Publisher's site

#### View description

© 2017 IEEE. Graphs have been widely used in many applications such as social networks, collaboration networks, and biological networks. One important graph analytics is to explore cohesive subgraphs in a large graph. Among several cohesive subgraphs studied, k-core is one that can be computed in linear time for a static graph. Since graphs are evolving in real applications, in this paper, we study core maintenance which is to reduce the computational cost to compute k-cores for a graph when graphs are updated from time to time dynamically. We identify drawbacks of the existing efficient algorithm, which needs a large search space to find the vertices that need to be updated, and has high overhead to maintain the index built, when a graph is updated. We propose a new order-based approach to maintain an order, called k-order, among vertices, while a graph is updated. Our new algorithm can significantly outperform the state-of-Theart algorithm up to 3 orders of magnitude for the 11 large real graphs tested. We report our findings in this paper.

Bi, F., Chang, L., Lin, X., Qin, L. & Zhang, W. 2016, 'Efficient subgraph matching by postponing Cartesian products', *Proceedings of the ACM SIGMOD International Conference on Management of Data*, International Conference on Management of Data, Association of Computing Machinery ACM, San Francisco, California, United States, pp. 1199-1214.View/Download from: UTS OPUS or Publisher's site

#### View description

© 2016 ACM.In this paper, we study the problem of subgraph matching that extracts all subgraph isomorphic embeddings of a query graph q in a large data graph G. The existing algorithms for subgraph matching follow Ullmann's backtracking approach; that is, iteratively map query vertices to data vertices by following a matching order of query vertices. It has been shown that the matching order of query vertices is a very important aspect to the efficiency of a subgraph matching algorithm. Recently, many advanced techniques, such as enforcing connectivity and merging similar vertices in query or data graphs, have been proposed to provide an effective matching order with the aim to reduce unpromising intermediate results especially the ones caused by redundant Cartesian products. In this paper, for the first time we address the issue of unpromising results by Cartesian products from "dissimilar" vertices. We propose a new framework by postponing the Cartesian products based on the structure of a query to minimize the redundant Cartesian products. Our second contribution is proposing a new path-based auxiliary data structure, with the size O(|E(G)| |V(q)|), to generate a matching order and conduct subgraph matching, which significantly reduces the exponential size O(|V(G)||V(q)|-1) of the existing path-based auxiliary data structure, where V(G) and E(G) are the vertex and edge sets of a data graph G, respectively, and V(q) is the vertex set of a query q. Extensive empirical studies on real and synthetic graphs demonstrate that our techniques outperform the state-of-the-art algorithms by up to 3 orders of magnitude.

Chang, L., Li, W., Lin, X., Qin, L. & Zhang, W. 2016, 'PSCAN: Fast and exact structural graph clustering', *Proceedings of the 2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016*, IEEE International Conference on Data Engineering, IEEE, Helsinki, Finland, pp. 253-264.View/Download from: UTS OPUS or Publisher's site

#### View description

© 2016 IEEE.In this paper, we study the problem of structural graph clustering, a fundamental problem in managing and analyzing graph data. Given a large graph G = (V, E), structural graph clustering is to assign vertices in V to clusters and to identify the sets of hub vertices and outlier vertices as well, such that vertices in the same cluster are densely connected to each other while vertices in different clusters are loosely connected to each other. Firstly, we prove that the existing SCAN approach is worst-case optimal. Nevertheless, it is still not scalable to large graphs due to exhaustively computing structural similarity for every pair of adjacent vertices. Secondly, we make three observations about structural graph clustering, which present opportunities for further optimization. Based on these observations, in this paper we develop a new two-step paradigm for scalable structural graph clustering. Thirdly, following this paradigm, we present a new approach aiming to reduce the number of structural similarity computations. Moreover, we propose optimization techniques to speed up checking whether two vertices are structure-similar to each other. Finally, we conduct extensive performance studies on large real and synthetic graphs, which demonstrate that our new approach outperforms the state-of-the-art approaches by over one order of magnitude. Noticeably, for the twitter graph with 1 billion edges, our approach takes 25 minutes while the state-of-the-art approach cannot finish even after 24 hours.

Feng, X., Chang, L., Lin, X., Qin, L. & Zhang, W. 2016, 'Computing Connected Components with linear communication cost in pregel-like systems', *2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016*, IEEE International Conference on Data Engineering, IEEE, Helsinki, Finland, pp. 85-96.View/Download from: UTS OPUS or Publisher's site

#### View description

© 2016 IEEE.The paper studies two fundamental problems in graph analytics: computing Connected Components (CCs) and computing BiConnected Components (BCCs) of a graph. With the recent advent of Big Data, developing effcient distributed algorithms for computing CCs and BCCs of a big graph has received increasing interests. As with the existing research efforts, in this paper we focus on the Pregel programming model, while the techniques may be extended to other programming models including MapReduce and Spark. The state-of-the-art techniques for computing CCs and BCCs in Pregel incur O(m #supersteps) total costs for both data communication and computation, where m is the number of edges in a graph and #supersteps is the number of supersteps. Since the network communication speed is usually much slower than the computation speed, communication costs are the dominant costs of the total running time in the existing techniques. In this paper, we propose a new paradigm based on graph decomposition to reduce the total communication costs from O(m#supersteps) to O(m), for both computing CCs and computing BCCs. Moreover, the total computation costs of our techniques are smaller than that of the existing techniques in practice, though theoretically they are almost the same. Comprehensive empirical studies demonstrate that our approaches can outperform the existing techniques by one order of magnitude regarding the total running time.

Li, R.H., Qin, L., Yu, J.X. & Mao, R. 2016, 'Efficient and progressive Group Steiner Tree search', *Proceedings of the ACM SIGMOD International Conference on Management of Data*, ACM SIGMOD Conference, Association of Computing Machinery ACM, San Francisco, California, United States, pp. 91-106.View/Download from: UTS OPUS or Publisher's site

#### View description

© 2016 ACM.The Group Steiner Tree (GST) problem is a fundamental problem in database area that has been successfully applied to keyword search in relational databases and team search in social networks. The state-of-the-art algorithm for the GST problem is a parameterized dynamic programming (DP) algorithm, which finds the optimal tree in O(3n + 2 (n log n + m)) time, where is the number of given groups, m and n are the number of the edges and nodes of the graph respectively. The major limitations of the parameterized DP algorithm are twofold: (i) it is intractable even for very small values of (e.g., = 8) in large graphs due to its exponential complexity, and (ii) it cannot generate a solution until the algorithm has completed its entire execution. To overcome these limitations, we propose an efficient and progressive GST algorithm in this paper, called PrunedDP. It is based on newly-developed optimal-tree decomposition and conditional tree merging techniques. The proposed algorithm not only drastically reduces the search space of the parameterized DP algorithm, but it also produces progressively-refined feasible solutions during algorithm execution. To further speed up the PrunedDP algorithm, we propose a progressive A-search algorithm, based on several carefully-designed lower-bounding techniques. We conduct extensive experiments to evaluate our algorithms on several large scale real-world graphs. The results show that our best algorithm is not only able to generate progressively-refined feasible solutions, but it also finds the optimal solution with at least two orders of magnitude acceleration over the state-of-the-art algorithm, using much less memory.

Li, Z., Qin, L., Cheng, H., Zhang, X. & Zhou, X. 2016, 'TRIP: An interactive retrieving-inferring data imputation approach', *2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016*, pp. 1462-1463.View/Download from: Publisher's site

#### View description

© 2016 IEEE. Data imputation aims at filling in missing attribute values in databases. Existing imputation approaches to nonquantitive string data can be roughly put into two categories: (1) inferring-based approaches [2], and (2) retrieving-based approaches [1] . Specifically, the inferring-based approaches find substitutes or estimations for the missing ones from the complete part of the data set. However, they typically fall short in filling in unique missing attribute values which do not exist in the complete part of the data set [1]. The retrieving-based approaches resort to external resources for help by formulating proper web search queries to retrieve web pages containing the missing values from the Web, and then extracting the missing values from the retrieved web pages [1] . This webbased retrieving approach reaches a high imputation precision and recall, but on the other hand, issues a large number of web search queries, which brings a large overhead [1].

Liu, S., Liu, A., Zhao, L., Liu, G., Li, Z., Zhao, P., Zheng, K. & Qin, L. 2016, 'Efficient query processing with mutual privacy protection for location-based services', *DASFAA 2016: Database Systems for Advanced Applications*, Database Systems for Advanced Applications, Springer, Dallas, Texas, Unites States, pp. 299-313.View/Download from: Publisher's site

#### View description

© Springer International Publishing Switzerland 2016.Data privacy in location-based services involves two aspects. The location of a user is a kind of private data as many sensitive information can be inferred from it given some background knowledge. On the other hand, the POI database is a great asset to the LBS provider as its construction requires many resources and efforts. In this paper, we propose a method of protecting mutual privacy (i.e., the location of the user issuing a query and the POI database of the LBS provider) for location-based query processing. Our approach consists of two steps: data preparation and query processing. Data preparation is conducted by LBS itself and is totally an offline computation, while query processing involves some online computation and multiple rounds of communication between LBS and the user. We implement the query processing by two rounds of oblivious transfer extension (OT-Extension) on two small key sets, resulting an immediate response even on some big POI databases. We also theoretically prove the security and analyze the complexity of our approach. Compared with two state-of-the-art methods, our approach has several orders of magnitude improvement in response time, at the expense of little and acceptable communication cost.

Lyu, B., Qin, L., Lin, X., Chang, L. & Yu, J.X. 2016, 'Scalable supergraph search in large graph databases', *2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016*, IEEE International Conference on Data Engineering, IEEE, Helsinki, Finland, pp. 157-168.View/Download from: UTS OPUS or Publisher's site

#### View description

© 2016 IEEE.Supergraph search is a fundamental problem in graph databases that is widely applied in many application scenarios. Given a graph database and a query-graph, supergraph search retrieves all data-graphs contained in the query-graph from the graph database. Most existing solutions for supergraph search follow the pruning-and-verification framework, which prunes false answers based on features in the pruning phase and performs subgraph isomorphism testings on the remaining graphs in the verification phase. However, they are not scalable to handle large-sized data-graphs and query-graphs due to three drawbacks. First, they rely on a frequent subgraph mining algorithm to select features which is expensive and cannot generate large features. Second, they require a costly verification phase. Third, they process features in a fixed order without considering their relationship to the query-graph. In this paper, we address the three drawbacks and propose new indexing and query processing algorithms. In indexing, we select features directly from the data-graphs without expensive frequent subgraph mining. The features form a feature-tree that contains all-sized features and both the cost sharing and pruning power of the features are considered. In query processing, we propose a verification-free algorithm, where the order to process features is query-dependent by considering both the cost sharing and the pruning power. We explore two optimization strategies to further improve the algorithm efficiency. The first strategy applies a lightweight graph compression technique and the second strategy optimizes the inclusion of answers. Finally, we conduct extensive performance studies on two real large datasets to demonstrate the high scalability of our algorithms.

Yuan, L., Qin, L., Lin, X., Chang, L. & Zhang, W. 2016, 'I/O efficient ECC graph decomposition via graph reduction', *Proceedings of the 42nd International Conference on Very Large Data Bases*, International Conference on Very Large Databases, VLDB Endowment, New Delhi, India, pp. 516-527.View/Download from: UTS OPUS or Publisher's site

#### View description

© 2016 VLDB Endowment 21508097/16/03.The problem of computing k-edge connected components (k-ECCs) of a graph G for a specific k is a fundamental graph problemand has been investigated recently. In this paper, we study the problemof ECC decomposition, which computes the k-ECCs of a graphG for all k values. ECC decomposition can be widely applied ina variety of applications such as graph-topology analysis, communitydetection, Steiner component search, and graph visualization.A straightforward solution for ECC decomposition is to apply theexisting k-ECC computation algorithm to compute the k-ECCs forall k values. However, this solution is not applicable to large graphsfor two challenging reasons. First, all existing k-ECC computationalgorithms are highly memory intensive due to the complex datastructures used in the algorithms. Second, the number of possiblek values can be very large, resulting in a high computational costwhen each k value is independently considered. In this paper, weaddress the above challenges, and study I/O efficient ECC decompositionvia graph reduction. We introduce two elegant graph reductionoperators which aim to reduce the size of the graph loadedin memory while preserving the connectivity information of a certainset of edges to be computed for a specific k. We also proposethree novel I/O efficient algorithms, Bottom-Up, Top-Down, andHybrid, that explore the k values in different orders to reduce the redundantcomputations between different k values. We analyze theI/O and memory costs for all proposed algorithms. In our experiments, we evaluate our algorithms using seven real large datasetswith various graph properties, one of which contains 1:95 billionedges. The experimental results show that our proposed algorithmsare scalable and efficient.

Zhang, H., Zhu, Y., Qin, L., Cheng, H. & Yu, J.X. 2016, 'Efficient triangle listing for billion-scale graphs', *Proceedings - 2016 IEEE International Conference on Big Data, Big Data 2016*, pp. 813-822.View/Download from: Publisher's site

#### View description

© 2016 IEEE. This paper addresses the classical triangle listing problem, which aims at enumerating all the tuples of three vertices connected with each other by edges. This problem has been intensively studied in internal and external memory, but it is still an urgent challenge in distributed environment where multiple machines across the network can be utilized to achieve good performance and scalability. As one of the de facto computing methodologies in distributed environment, MapReduce has been used in some of existing triangle listing algorithms. However, these algorithms usually need to shuffle a huge amount of intermediate data, which seriously hinders the scalability on large scale graphs. In this paper, we propose a new triangle listing algorithm in MapReduce, FTL, which utilizes a light weight data structure to substantially reduce the intermediate data transferred during the shuffle stage, and also is equipped with multiple-round techniques to ease the burden on memory and network bandwidth when dealing with graphs at billion scale. We prove that the size of the intermediate data can be well bounded near to the number of triangles in the graph. To further reduce the shuffle size in each round, we also devise a compact data structure to store the intermediate data, which can save space up to 2/3. The extensive experimental results show that our algorithms outperform existing competitors by several times on large real world graphs.

Lai, L.B., Qin, L., Lin, X.M., Zhang, Y., Chang, L.J. & Yang, S.Y. 2016, 'Scalable Distributed Subgraph Enumeration', *Proceedings of the VLDB Endowment*, International Conference on Very Large Databases, VLDB Endowment, Munich, Germany, pp. 217-228.View/Download from: UTS OPUS or Publisher's site

Wen, D., Qin, L., Zhang, Y., Lin, X. & Yu, J.X. 2016, 'I/O Efficient Core Graph Decomposition at Web Scale.', *CoRR*.View/Download from: UTS OPUS

Chang, L., Lin, X., Qin, L., Yu, J.X. & Pei, J. 2015, 'Efficiently Computing Top-K Shortest Path Join', *Proceedings of the 18th International Conference on Extending Database Technology, EDBT 2015, Brussels, Belgium, March 23-27, 2015.*, Extending Database Technology, OpenProceedings.org, Belgium, pp. 133-144.View/Download from: UTS OPUS or Publisher's site

Chang, L., Lin, X., Qin, L., Yu, J.X. & Zhang, W. 2015, 'Index-based optimal algorithms for computing steiner components with maximum connectivity', *Proceedings of the ACM SIGMOD International Conference on Management of Data*, ACM Special Interest Group on Management of Data Conference, ACM, Melbourne, Victoria, Australia, pp. 459-474.View/Download from: UTS OPUS or Publisher's site

#### View description

With the proliferation of graph applications, the problem of efficiently computing all k-edge connected components of a graph G for a user-given k has been recently investigated. In this paper, we study the problem of efficiently computing the steiner component with the maximum connectivity; that is, given a set q of query vertices in a graph G, we aim to find the maximum induced subgraph g of G such that g contains q and g has the maximum connectivity, where g is denoted as SMCC. To accommodate online query processing, we present an efficient algorithm based on a novel index such that the algorithm runs in linear time regarding the result size; thus, the algorithm is optimal since it needs at least linear time to output the result. Moreover, in this paper we also investigate variations of the above problem. We show that such a problem with the constraint that the size of the SMCC is not smaller than a given size can also be solved in linear time regarding the result size (thus, optimal). We also show that the problem of computing the connectivity (rather than the graph details) of SMCC can be solved in linear time regarding the query size (thus, optimal). To build the index, we extend the techniques in [7] to accommodate batch processing and computation sharing. To efficiently support the applications with graph updates, we also present novel increment techniques. Finally, we conduct extensive performance studies on large real and synthetic graphs, which demonstrate that our index-based algorithms significantly outperform baseline algorithms by several orders of magnitude and our indexing algorithms are efficient.

Lai, L., Qin, L., Lin, X. & Chang, L. 2015, 'Scalable Subgraph Enumeration in MapReduce.', *Proceedings of the VLDB Endowment*, International Conference on Very Large Databases, VLDB Endowment, Kohala Coast, Hawaii, pp. 974-985.View/Download from: UTS OPUS or Publisher's site

#### View description

Subgraph enumeration, which aims to find all the subgraphs of a

large data graph that are isomorphic to a given pattern graph, is

a fundamental graph problem with a wide range of applications.

However, existing sequential algorithms for subgraph enumeration

fall short in handling large graphs due to the involvement of computationally

intensive subgraph isomorphism operations. Thus, some

recent researches focus on solving the problem using MapReduce.

Nevertheless, exiting MapReduce approaches are not scalable to

handle very large graphs since they either produce a huge number

of partial results or consume a large amount of memory. Motivated

by this, in this paper, we propose a new algorithm TwinTwigJoin

based on a left-deep-join framework in MapReduce, in which the

basic join unit is a TwinTwig (an edge or two incident edges of

a node). We show that in the Erdos-R ¨ enyi random-graph model, ´

TwinTwigJoin is instance optimal in the left-deep-join framework

under reasonable assumptions, and we devise an algorithm to compute

the optimal join plan. Three optimization strategies are explored

to improve our algorithm. Furthermore, we discuss how our

approach can be adapted in the power-law random-graph model.We

conduct extensive performance studies in several real graphs, one

of which contains billions of edges. Our approach significantly outperforms

existing solutions in all tests.

Li, R.-.H., Qin, L., Yu, J.X. & Mao, R. 2015, 'Influential Community Search in Large Networks', *Proceedings of the VLDB Endowment*, International Conference on Very Large Databases, Proceedings of the Vldb Endowment International Conference on Very Large Data Bases, Kohala Coast, Hawaii, pp. 509-520.View/Download from: UTS OPUS or Publisher's site

#### View description

Community search is a problem of finding densely connected subgraphs

that satisfy the query conditions in a network, which has

attracted much attention in recent years. However, all the previous

studies on community search do not consider the influence of

a community. In this paper, we introduce a novel community model

called k-influential community based on the concept of k-core,

which can capture the influence of a community. Based on the

new community model, we propose a linear-time online search algorithm

to find the top-r k-influential communities in a network.

To further speed up the influential community search algorithm,

we devise a linear-space index structure which supports efficient

search of the top-r k-influential communities in optimal time. We

also propose an efficient algorithm to maintain the index when the

network is frequently updated. We conduct extensive experiments

on 7 real-world large networks, and the results demonstrate the ef-

ficiency and effectiveness of the proposed methods

Li, R.H., Yu, J.X., Qin, L., Mao, R. & Jin, T. 2015, 'On random walk based graph sampling', *Proceedings - International Conference on Data Engineering*, IEEE International Conference on Data Engineering, IEEE, Seoul, South Korea, pp. 927-938.View/Download from: UTS OPUS or Publisher's site

#### View description

© 2015 IEEE. Random walk based graph sampling has been recognized as a fundamental technique to collect uniform node samples from a large graph. In this paper, we first present a comprehensive analysis of the drawbacks of three widely-used random walk based graph sampling algorithms, called re-weighted random walk (RW) algorithm, Metropolis-Hastings random walk (MH) algorithm and maximum-degree random walk (MD) algorithm. Then, to address the limitations of these algorithms, we propose two general random walk based algorithms, named rejection-controlled Metropolis-Hastings (RCMH) algorithm and generalized maximum-degree random walk (GMD) algorithm. We show that RCMH balances the tradeoff between the limitations of RW and MH, and GMD balances the tradeoff between the drawbacks of RW and MD. To further improve the performance of our algorithms, we integrate the so-called delayed acceptance technique and the non-backtracking random walk technique into RCMH and GMD respectively. We conduct extensive experiments over four real-world datasets, and the results demonstrate the effectiveness of the proposed algorithms.

Yuan, L., Qin, L., Lin, X., Chang, L. & Zhang, W. 2015, 'Diversified top-k clique search', *Proceedings - International Conference on Data Engineering*, IEEE International Conference on Data Engineering, IEEE, South Korea, pp. 387-398.View/Download from: UTS OPUS or Publisher's site

#### View description

© 2015 IEEE. Maximal clique enumeration is a fundamental problem in graph theory and has been extensively studied. However, maximal clique enumeration is time-consuming in large graphs and always returns enormous cliques with large overlaps. Motivated by this, in this paper, we study the diversified top-k clique search problem which is to find top-k maximal cliques that can cover most number of nodes in the graph. Diversified top-k clique search can be widely used in a lot of applications including community search, motif discovery, and anomaly detection in large graphs. A naive solution for diversified top-k clique search is to keep all maximal cliques in memory and then find k of them that cover most nodes in the graph by using the approximate greedy max k-cover algorithm. However, such a solution is impractical when the graph is large. In this paper, instead of keeping all maximal cliques in memory, we devise an algorithm to maintain k candidates in the process of maximal clique enumeration. Our algorithm has limited memory footprint and can achieve a guaranteed approximation ratio. We also introduce a novel light-weight PNP-Index, based on which we design an optimal maximal clique maintenance algorithm. We further explore three optimization strategies to avoid enumerating all maximal cliques and thus largely reduce the computational cost. We conduct extensive performance studies on six real graphs one of which contains 0.3 billion edges, and the results demonstrate the high efficiency and effectiveness of our approach.

Zhang, S., Qin, L., Zheng, Y. & Cheng, H. 2015, 'Effective and efficient: Large-scale dynamic city express', *Proceedings of the ACM International Symposium on Advances in Geographic Information Systems*, ACM International Conference on Advances in Geographic Information Systems, ACM, Seattle, Washington.View/Download from: UTS OPUS or Publisher's site

#### View description

City express services are in great demand in recent years. However, the current city express system is found to be unsatisfactory for both the service providers and customers. In this paper, we are the first to systematically study the large-scale dynamic city express problem. We aim to increase both the effectiveness and the efficiency of the scheduling algorithm. To improve the effectiveness, we adopt a batch assignment strategy that computes the pickup-delivery routes for a group of requests received in a short period rather than dealing with each request individually. To improve the efficiency, we design a two-level priority queue structure to reduce redundant shortest distance calculation and repeated candidate generation. We develop a simulation system and conduct extensive performance studies in the real road network of Beijing city. The experimental results demonstrate the high effectiveness and efficiency of our algorithm.

Zhang, Z., Yu, J.X., Qin, L. & Shang, Z. 2015, 'Divide & conquer: I/O efficient depth-first search', *Proceedings of the ACM SIGMOD International Conference on Management of Data*, ACM Special Interest Group on Management of Data Conference, ACM, Melbourne, Victoria, Australia, pp. 445-458.View/Download from: UTS OPUS or Publisher's site

#### View description

Copyright © 2015 ACM. Depth-First Search (DFS), which traverses a graph in the depthfirst order, is one of the fundamental graph operations, and the result of DFS over all nodes in G is a spanning tree known as a DFS-Tree. There are many graph algorithms that need DFS such as connected component computation, topological sort, community detection, eulerian path computation, graph bipartiteness testing, planar graph testing, etc, because the in-memory DFS algorithm shows it can be done in linear time w.r.t. the size of G. However, given the fact that real-world graphs grow rapidly in the big data era, the in-memory DFS algorithm cannot be used to handle a large graph that cannot be entirely held in main memory. In this paper, we focus on I/O efficiency and study semi-external algorithms to DFS a graph G which is on disk. Here, like the existing semiexternal algorithms, we assume that a spanning tree of G can be held in main memory and the remaining edges of G are kept on disk, and compute the DFS-Tree in main memory with which DFS can be identified. We propose novel divide & conquer algorithms to DFS over a graph G on disk. In brief, we divide a graph into several subgraphs, compute the DFS-Tree for each subgraph independently, and then merge them together to compute the DFS-Tree for the whole graph. With the global DFS-Tree computed we identify DFS. We discuss the valid division, that can lead to the correct DFS, and the challenges to do so. We propose two division algorithms, named Divide-Star and Divide-TD, and a merge algorithm. We conduct extensive experimental studies using four real massive datasets and several synthetic datasets to confirm the I/O efficiency of our approach.

Qin, L., Li, R.H., Chang, L. & Zhang, C. 2015, 'Locally densest subgraph discovery', *Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, ACM International Conference on Knowledge Discovery and Data Mining, ACM, Sydney, Australia, pp. 965-974.View/Download from: UTS OPUS or Publisher's site

#### View description

© 2015 ACM. Mining dense subgraphs from a large graph is a fundamental graph mining task and can be widely applied in a variety of application domains such as network science, biology, graph database, web mining, graph compression, and micro-blogging systems. Here a dense subgraph is defined as a subgraph with high density (#.edge/#.node). Existing studies of this problem either focus on finding the densest subgraph or identifying an optimal clique-like dense subgraph, and they adopt a simple greedy approach to find the top-k dense subgraphs. However, their identified subgraphs cannot be used to represent the dense regions of the graph. Intuitively, to represent a dense region, the subgraph identified should be the subgraph with highest density in its local region in the graph. However, it is non-trivial to formally model a locally densest subgraph. In this paper, we aim to discover top-k such representative locally densest subgraphs of a graph. We provide an elegant parameter-free definition of a locally densest subgraph. The definition not only fits well with the intuition, but is also associated with several nice structural properties. We show that the set of locally densest subgraphs in a graph can be computed in polynomial time. We further propose three novel pruning strategies to largely reduce the search space of the algorithm. In our experiments, we use several real datasets with various graph properties to evaluate the effectiveness of our model using four quality measures and a case study. We also test our algorithms on several real web-scale graphs, one of which contains 118.14 million nodes and 1.02 billion edges, to demonstrate the high efficiency of the proposed algorithms.

Chang, L., Lin, X., Zhang, W., Yu, J.X., Zhang, Y. & Qin, L. 2015, 'Optimal Enumeration: Efficient Top-k Tree Matching', *Proceedings of the VLDB Endowment*, International Conference on Very Large Databases, VLDB Endowment, Kahola Coast, Hawaii, pp. 533-544.View/Download from: UTS OPUS or Publisher's site

#### View description

Driven by many real applications, graph pattern matching has attracted

a great deal of attention recently. Consider that a twigpattern

matching may result in an extremely large number of matches

in a graph; this may not only confuse users by providing too many

results but also lead to high computational costs. In this paper,

we study the problem of top-k tree pattern matching; that is, given

a rooted tree T, compute its top-k matches in a directed graph G

based on the twig-pattern matching semantics. We firstly present

a novel and optimal enumeration paradigm based on the principle

of Lawler's procedure. We show that our enumeration algorithm

runs in O(nT + log k) time in each round where nT is the number

of nodes in T. Considering that the time complexity to output a

match of T is O(nT ) and nT log k in practice, our enumeration

technique is optimal. Moreover, the cost of generating top-1 match

of T in our algorithm is O(mR) where mR is the number of edges

in the transitive closure of a data graph G involving all relevant

nodes to T. O(mR) is also optimal in the worst case without preknowledge

of G. Consequently, our algorithm is optimal with the

running time O(mR +k(nT +log k)) in contrast to the time complexity

O(mR log k+knT (log k+dT )) of the existing technique where dT

is the maximal node degree in T. Secondly, a novel priority based

access technique is proposed, which greatly reduces the number of

edges accessed and results in a significant performance improvement.

Finally, we apply our techniques to the general form of top-k

graph pattern matching problem (i.e., query is a graph) to improve

the existing techniques. Comprehensive empirical studies demonstrate

that our techniques may improve the existing techniques by

orders of magnitude.

Huang, X., Cheng, H., Qin, L., Tian, W. & Yu, J.X. 2014, 'Querying k-truss community in large and dynamic graphs', *International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22-27, 2014*, ACM Special Interest Group on Management of Data Conference, ACM, Utah, USA, pp. 1311-1322.View/Download from: UTS OPUS or Publisher's site

Zhang, Z., Qin, L. & Yu, J.X. 2014, 'Contract & Expand: I/O Efficient SCCs Computing', *IEEE 30th International Conference on Data Engineering, Chicago, ICDE 2014, IL, USA, March 31 - April 4, 2014*, IEEE International Conference on Data Engineering, IEEE, Chicago, IL, pp. 208-219.View/Download from: UTS OPUS or Publisher's site

#### View description

As an important branch of big data processing, big graph processing is becoming increasingly popular in recent years. Strongly connected component (SCC) computation is a fundamental graph operation on directed graphs, where an SCC is a maximal subgraph S of a directed graph G in which every pair of nodes is reachable from each other in S. By contracting each SCC into a node, a large general directed graph can be represented by a small directed acyclic graph (DAG). In the literature, there are I/O efficient semi-external algorithms to compute all SCCs of a graph G, by assuming that all nodes of a graph G can fit in the main memory. However, many real graphs are large and even the nodes cannot reside entirely in the main memory. In this paper, we study new I/O efficient external algorithms to find all SCCs for a directed graph G whose nodes cannot fit entirely in the main memory. To overcome the deficiency of the existing external graph contraction based approach that usually cannot stop in finite iterations, and the external DFS based approach that will generate a large number of random I/Os, we explore a new contraction-expansion based approach. In the graph contraction phase, instead of contracting the whole graph as the contraction based approach, we only contract the nodes of a graph, which are much more selective. The contraction phase stops when all nodes of the graph can fit in the main memory, such that the semi-external algorithm can be used in SCC computation. In the graph expansion phase, as the graph is expanded in the reverse order as it is contracted, the SCCs of all nodes in the graph are computed. Both graph contraction phase and graph expansion phase use only I/O efficient sequential scans and external sorts of nodes/edges in the graph. Our algorithm leverages the efficiency of the semi-external SCC computation algorithm and usually stops in a small number of iterations. We further optimize our approach by reducing the size of nodes and edges of t...

Qin, L., Yu, J.X., Chang, L., Cheng, H., Zhang, C. & Lin, X. 2014, 'Scalable big graph processing in MapReduce', *International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22-27, 2014*, ACM Special Interest Group on Management of Data Conference, ACM, Utah, USA, pp. 827-838.View/Download from: UTS OPUS or Publisher's site

Chang, L., Lin, X., Qin, L., Liu, C. & Liang, W. 2013, 'Efficiently Computing k-Edge Connected Components via Graph Decomposition', *Proceedings of ACM Conference on Management of Data*, ACM Conference on Management of Data, ACM, New York, NY, USA, pp. 205-216.View/Download from: UTS OPUS or Publisher's site

#### View description

k-edge connected components; graph decomposition; minimum cut

Huang, X., Cheng, H., Li, R., Qin, L. & Yu, J.X. 2013, 'Top-K Structural Diversity Search in Large Network', *Proceedings of the VLDB Endowment*, International Conference on Very Large Databases, ACM, Trento, Italy, pp. 1618-1629.View/Download from: UTS OPUS or Publisher's site

#### View description

Social contagion depicts a process of information (e.g., fads, opinions, news) diffusion in the online social networks. A recent study reports that in a social contagion process the probability of contagion is tightly controlled by the number of connected components in an individual's neighborhood. Such a number is termed structural diversity of an individual and it is shown to be a key predictor in the social contagion process. Based on this, a fundamental issue in a social network is to find top-k users with the highest structural diversities. In this paper, we, for the first time, study the top-k structural diversity search problem in a large network. Specifically, we develop an effective upper bound of structural diversity for pruning the search space. The upper bound can be incrementally refined in the search process. Based on such upper bound, we propose an efficient framework for top-k structural diversity search. To further speed up the structural diversity evaluation in the search process, several carefully devised heuristic search strategies are proposed. Extensive experimental studies are conducted in 13 real-world large networks, and the results demonstrate the efficiency and effectiveness of the proposed methods.

Qiao, M., Qin, L., Cheng, H., Yu, J.X. & Tian, W. 2013, 'Top-K nearest keyword search on large graphs', *Proceedings of the VLDB Endowment*, International Conference on Very Large Databases, ACM, Trento, Italy, pp. 901-912.View/Download from: UTS OPUS or Publisher's site

#### View description

It is quite common for networks emerging nowadays to have labels or textual contents on the nodes. On such networks, we study the problem of top-k nearest keyword (k-NK) search. In a network G modeled as an undirected graph, each node is attached with zero or more keywords, and each edge is assigned with a weight measuring its length. Given a query node q in G and a keyword ?, a k-NK query seeks k nodes which contain ? and are nearest to q. k-NK is not only useful as a stand-alone query but also as a building block for tackling complex graph pattern matching problems. The key to an accurate k-NK result is a precise shortest distance estimation in a graph. Based on the latest distance oracle technique, we build a shortest path tree for a distance oracle and use the tree distance as a more accurate estimation. With such representation, the original k-NK query on a graph can be reduced to answering the query on a set of trees and then assembling the results obtained from the trees. We propose two efficient algorithms to report the exact k-NK result on a tree. One is query time optimized for a scenario when a small number of result nodes are of interest to users. The other handles k-NK queries for an arbitrarily large k efficiently. In obtaining a k-NK result on a graph from that on trees, a global storage technique is proposed to further reduce the index size and the query time. Extensive experimental results conform with our theoretical findings, and demonstrate the effectiveness and efficiency of our k-NK algorithms on large real graphs.

Zhang, Z., Yu, J.X., Qin, L., Chang, L. & Lin, X. 2013, 'I/O Efficient: Computing SCCs in Massive Graphs', *Proceedings of ACM Conference on Management of Data*, ACM Conference on Management of Data, ACM, New York, NY, USA, pp. 181-192.View/Download from: UTS OPUS or Publisher's site

#### View description

A strongly connected component (SCC) is a maximal subgraph of a directed graph G in which every pair of nodes are reachable from each other in the SCC. With such a property, a general directed graph can be represented by a directed acyclic graph DAG by contracting an SCC of G to a node in DAG. In many real applications that need graph pattern matching, topological sorting, or reachability query processing, the best way to deal with a general directed graph is to deal with its DAG representation. Therefore, finding all SCCs in a directed graph G is a critical operation. The existing in-memory algorithms based on depth first search (DFS) can find all SCCs in linear time w.r.t. the size of a graph. However, when a graph cannot resident entirely in the main memory, the existing external or semi-external algorithms to find all SCCs have limitation to achieve high I/O efficiency. In this paper, we study new I/O efficient semi-external algorithms to find all SCCs for a massive directed graph G that cannot reside in main memory entirely. To overcome the deficiency of the existing DFS based semi-external algorithm that heavily relies on a total order, we explore a weak order based on which we investigate new algorithms. We propose a new two phase algorithm, namely, tree construction and tree search. In the tree construction phase, a spanning tree of G can be constructed in bounded sequential scans of G. In the tree search phase, it needs to sequentially scan the graph once to find all SCCs.

Qin, L., Yu, J.X. & Chang, L. 2012, 'Diversifying Top-K Results', *Proceedings of the VLDB Endowment*, International Conference on Very Large Databases, ACM, Istanbul, Turkey, pp. 1124-1135.View/Download from: UTS OPUS or Publisher's site

#### View description

Top-k query processing finds a list of k results that have largest scores w.r.t the user given query, with the assumption that all the k results are independent to each other. In practice, some of the top-k results returned can be very similar to each other. As a result some of the top-k results returned are redundant. In the literature, diversified top-k search has been studied to return k results that take both score and diversity into consideration. Most existing solutions on diversified top-k search assume that scores of all the search results are given, and some works solve the diversity problem on a specific problem and can hardly be extended to general cases. In this paper, we study the diversified top-k search problem. We define a general diversified top-k search problem that only considers the similarity of the search results themselves. We propose a framework, such that most existing solutions for top-k query processing can be extended easily to handle diversified top-k search, by simply applying three new functions, a sufficient stop condition sufficient(), a necessary stop condition necessary(), and an algorithm for diversified top-k search on the current set of generated results, div-search-current(). We propose three new algorithms, namely, div-astar, div-dp, and div-cut to solve the div-search-current() problem. div-astar is an A* based algorithm, div-dp is an algorithm that decomposes the results into components which are searched using div-astar independently and combined using dynamic programming. div-cut further decomposes the current set of generated results using cut points and combines the results using sophisticated operations. We conducted extensive performance studies using two real datasets, enwiki and reuters. Our div-cut algorithm finds the optimal solution for diversified top-k search problem in seconds even for k as large as 2, 000.

Zhang, Z., Yu, J.X., Qin, L., Zhu, Q. & Zhou, X. 2012, 'I/O Cost Minimization: Reachability Queries Processing over Massive Graphs', *Proceedings of the 15th International Conference on Extending Database Technology*, Extending Database Technology, ACM, Berlin, Germany, pp. 468-479.View/Download from: UTS OPUS or Publisher's site

#### View description

Given a directed graph G, a reachability query (u, v) asks whether there exists a path from a node u to a node v in G. The existing studies support reachability queries using indexing techniques, where both the graph and the index are required to reside in main memory. However, they cannot handle reachability queries on massive graphs, when the graph and the index cannot be entirely held in memory because of the high I/O cost. In this paper, we focus on how to minimize the I/O cost when answering reachability queries on massive graphs that cannot reside entirely in memory. First, we propose a new Yes-Label scheme, as a complement of the No-Label used in GRAIL [23], to reduce the number of intermediate results generated. Second, we show how to minimize the number of I/Os using a heap-on-disk data structure when traversing a graph. We also propose new methods to partition the heap-on-disk, in order to ensure that only sequential I/Os are performed. Third, we analyze our approaches and show how to extend our approaches to answer multiple reachability queries effectively. Finally, we conducted extensive performance studies on both large synthetic and large real graphs, and confirm the efficiency of our approaches.

Zhu, Y., Qin, L., Yu, J.X. & Cheng, H. 2012, 'Finding Top-K Similar Graphs in Graph Databases', *Proceedings of the 15th International Conference on Extending Database Technology*, Extending Database Technology, ACM, Berlin, Germany, pp. 456-467.View/Download from: UTS OPUS or Publisher's site

#### View description

Querying similar graphs in graph databases has been widely studied in graph query processing in recent years. Existing works mainly focus on subgraph similarity search and supergraph similarity search. In this paper, we study the problem of finding top-k graphs in a graph database that are most similar to a query graph. This problem has many applications, such as image retrieval and chemical compound structure search. Regarding the similarity measure, feature based and kernel based similarity measures have been used in the literature. But such measures are rough and may lose the connectivity information among substructures. In this paper, we introduce a new similarity measure based on the maximum common subgraph (MCS) of two graphs. We show that this measure can better capture the common and different structures of two graphs. Since computing the MCS of two graphs is NP-hard, we propose an algorithm to answer the top-k graph similarity query using two distance lower bounds with different computational costs, in order to reduce the number of MCS computations. We further introduce an indexing technique, which can better make use of the triangle property of similarities among graphs in the database to get tighter lower bounds. Three different indexing methods are proposed with different tradeoffs between pruning power and construction cost. We conducted extensive performance studies on large real datasets to evaluate the performance of our approaches.

Zhu, Y., Yu, J.X., Cheng, H. & Qin, L. 2012, 'Graph Classification: A Diversified Discriminative Feature Selection Approach', *Proceedings of 2012 ACM International Conference on Information and Knowledge Management*, ACM International Conference on Information and Knowledge Management, ACM, Maui, HI, USA, pp. 205-214.View/Download from: UTS OPUS or Publisher's site

#### View description

A graph models complex structural relationships among objects, and has been prevalently used in a wide range of applications. Building an automated graph classification model becomes very important for predicting unknown graphs or understanding complex structures between different classes. The graph classification framework being widely used consists of two steps, namely, feature selection and classification. The key issue is how to select important subgraph features from a graph database with a large number of graphs including positive graphs and negative graphs. Given the features selected, a generic classification approach can be used to build a classification model. In this paper, we focus on feature selection. We identify two main issues with the most widely used feature selection approach which is based on a discriminative score to select frequent subgraph features, and introduce a new diversified discriminative score to select features that have a higher diversity. We analyze the properties of the newly proposed diversified discriminative score, and conducted extensive performance studies to demonstrate that such a diversified discriminative score makes positive/negative graphs separable and leads to a higher classification accuracy.

Chang, L., Yu, J.X., Qin, L., Zhu, Y. & Wang, H. 2011, 'Finding Information Nebula over Large Networks', *Proceedings of 2011 ACM International Conference on Information and Knowledge Management*, ACM International Conference on Information and Knowledge Management, ACM, Glasgow, United Kingdom, pp. 1465-1474.View/Download from: UTS OPUS or Publisher's site

#### View description

Social and information networks have been extensively studied over years. In this paper, we concentrate ourselves on a large information network that is composed of entities and relationships, where entities are associated with sets of keyword terms (kterms) to specify what they are, and relationships describe the link structure among entities which can be very complex. Our work is motivated but is different from the existing works that find a best subgraph to describe how user-specified entities are connected. We compute information nebula (cloud) which is a set of top-K kterms P that are most correlated to a set of user-specified kterms Q, over a large information network. Our goal is to find how kterms are correlated given the complex information network among entities. The information nebula computing requests us to take all possible kterms into consideration for the top-K kterms selection, and needs to measure the similarity between kterms by considering all possible subgraphs that connect them instead of the best single one. In this work, we compute information nebula using a global structural-context similarity, and our similarity measure is independent of connection subgraphs. To the best of our knowledge, among the link-based similarity methods, none of the existing work considers similarity between two sets of nodes or two kterms. We propose new algorithms to find top-K kterms P for a given set of kterms Q based on the global structural-context similarity, without computing all the similarity scores of kterms in the large information network. We performed extensive performance studies using large real datasets, and confirmed the effectiveness and efficiency of our approach.

Qin, L., Yu, J.X. & Chang, L. 2011, 'Computing Structural Statistics by Keywords in Databases', *Proceedings of the 27th International Conference on Data Engineering*, International Conference on Data Engineering, IEEE, Hannover, Germany, pp. 363-374.View/Download from: UTS OPUS or Publisher's site

#### View description

Keyword search in RDBs has been extensively studied in recent years. The existing studies focused on finding all or top-k interconnected tuple-structures that contain keywords. In reality, the number of such interconnected tuple-structures for a keyword query can be large. It becomes very difficult for users to obtain any valuable information more than individual interconnected tuple-structures. Also, it becomes challenging to provide a similar mechanism like group-&-aggregate for those interconnected tuple-structures. In this paper, we study computing structural statistics keyword queries by extending the group-&-aggregate framework. We consider an RDB as a large directed graph where nodes represent tuples, and edges represent the links among tuples. Instead of using tuples as a member in a group to be grouped, we consider rooted subgraphs. Such a rooted subgraph represents an interconnected tuple-structure among tuples and some of the tuples contain keywords. The dimensions of the rooted subgraphs are determined by dimensional-keywords in a data driven fashion. Two rooted subgraphs are grouped into the same group if they are isomorphic based on the dimensions or in other words the dimensional-keywords. The scores of the rooted subgraphs are computed by a user-given score function if the rooted subgraphs contain some of general keywords. Here, the general keywords are used to compute scores rather than determining dimensions. The aggregates are computed using an SQL aggregate function for every group based on the scores computed. We give our motivation using a real dataset. We propose new approaches to compute structural statistics keyword queries, perform extensive performance studies using two large real datasets and a large synthetic dataset, and confirm the effectiveness and efficiency of our approach.

Zhu, Y., Qin, L., Yu, J.X., Ke, Y. & Lin, X. 2011, 'High Efficiency and Quality: Large Graphs Matching', *Proceedings of 2011 ACM International Conference on Information and Knowledge Management*, ACM International Conference on Information and Knowledge Management, ACM, Glasgow, United Kingdom, pp. 1755-1764.View/Download from: UTS OPUS or Publisher's site

#### View description

Graph matching plays an essential role in many real applications. In this paper, we study how to match two large graphs by maximizing the number of matched edges, which is known as maximum common subgraph matching and is NP-hard. To find exact matching, it cannot handle a graph with more than 30 nodes. To find an approximate matching, the quality can be very poor. We propose a novel two-step approach which can efficiently match two large graphs over thousands of nodes with high matching quality. In the first step, we propose an anchor-selection/expansion approach to compute a good initial matching. In the second step, we propose a new approach to refine the initial matching. We give the optimality of our refinement and discuss how to randomly refine the matching with different combinations. We conducted extensive testing using real and synthetic datasets, and will report our findings.

Chang, L., Yu, J.X., Qin, L. & Lin, X. 2010, 'Probabilistic Ranking over Relations', *Proceedings of the 13th International Conference on Extending Database Technology*, International Conference on Extending Database Technology, ACM, Lausanne, Switzerland, pp. 477-488.View/Download from: UTS OPUS or Publisher's site

#### View description

Probabilistic top-k ranking queries have been extensively studied due to the fact that data obtained can be uncertain in many real applications. A probabilistic top-k ranking query ranks objects by the interplay of score and probability, with an implicit assumption that both scores based on which objects are ranked and probabilities of the existence of the objects are stored in the same relation. We observe that in general scores and probabilities are highly possible to be stored in different relations, for example, in column-oriented DBMSs and in data warehouses. In this paper we study probabilistic top-k ranking queries when scores and probabilities are stored in different relations. We focus on reducing the join cost in probabilistic top-k ranking. We investigate two probabilistic score functions, discuss the upper/lower bounds in random access and sequential access, and provide insights on the advantages and disadvantages of random/sequential access in terms of upper/lower bounds. We also propose random, sequential, and hybrid algorithms to conduct probabilistic top-k ranking. We conducted extensive performance studies using real and synthetic datasets, and report our findings in this paper.

Qin, L., Yu, J.X. & Chang, L. 2010, 'Ten thousand SQLs: parallel keyword queries computing', *Proceedings of the VLDB Endowment*, Proceedings of the VLDB Endowment, ACM, Singapore, pp. 58-69.View/Download from: UTS OPUS or Publisher's site

#### View description

Keyword search in relational databases has been extensively studied. Given a relational database, a keyword query finds a set of interconnected tuple structures connected by foreign key references. On rdbms, a keyword query is processed in two steps, namely, candidate networks (CNs) generation and CNs evaluation, where a CN is an sql. In common, a keyword query needs to be processed using over 10,000 sqls. There are several approaches to process a keyword query on rdbms, but there is a limit to achieve high performance on a uniprocessor architecture. In this paper, we study parallel computing keyword queries on a multicore architecture. We give three observations on keyword query computing, namely, a large number of sqls that needs to be processed, high sharing possibility among sqls, and large intermediate results with small number of final results. All make it challenging for parallel keyword queries computing. We investigate three approaches. We first study the query level parallelism, where each sql is processed by one core. We distribute the sqls into different cores based on three objectives, regarding minimizing workload skew, minimizing intercore sharing and maximizing intra-core sharing respectively. Such an approach has the potential risk of load unbalancing through accumulating errors of cost estimation. We then study the operation level parallelism, where each operation of an sql is processed by one core. All operations are processed in stages, where in each stage the costs of operations are re-estimated to reduce the accumulated error. Such operation level parallelism still has drawbacks of workload skew when large operations are involved and a large number of cores are used. Finally, we propose a new algorithm that partitions relations adaptively in order to minimize the extra cost of partitioning and at the same time reduce workload skew.

Chang, L., Yu, J.X. & Qin, L. 2009, 'Context-sensitive document ranking', *International Conference on Information and Knowledge Management, Proceedings*, pp. 1533-1536.View/Download from: Publisher's site

#### View description

Ranking is a main research issue in IR-styled keyword search over a set of documents. In this paper, we study a new keyword search problem, called context-sensitive document ranking, which is to rank documents with an additional context that provides additional information about the application domain where the documents are to be searched and ranked. The work is motivated by the fact that additional information associated with the documents can possibly assist users to find more relevant documents when they are unable to find the needed documents from the documents alone. In this paper, a context is a multi-attribute graph, which can represent any information maintained in a relational database. The context-sensitive ranking is related to several research issues, how to score documents, how to evaluate the additional information obtained in the context that may contribute the document ranking, how to rank the documents by combining the scores/costs from the documents and the context. More importantly, the relationships between documents and the information stored in a relational database may be uncertain, because they are from different data sources and the relationships are determined systematically using similarity match which causes uncertainty. In this paper, we concentrate ourselves on these research issues, and provide our solution on how to rank the documents in a context where there exist uncertainty between the documents and the context. We confirm the effectiveness of our approaches by conducting extensive experimental studies using real datasets. Copyright 2009 ACM.

Chang, L., Yu, J.X. & Qin, L. 2009, 'Query Ranking in Probabilistic XML Data', *Proceedings of the 12th International Conference on Extending Database Technology*, International Conference on Extending Database Technology, ACM, Saint-Petersburg, Russia, pp. 156-167.View/Download from: UTS OPUS or Publisher's site

#### View description

Twig queries have been extensively studied as a major fragment of XPATH queries to query XML data. In this paper, we study PXML-RANK query, (Q, k), which is to rank top-k probabilities of the answers of a twig query Q in probabilistic XML (PXML) data. A new research issue is how to compute top-k probabilities of answers of a twig query Q in PXML in the presence of containment (ancestor/descendant) relationships. In the presence of the ancestor/descendant relationships, the existing dynamic programming approaches to rank top-k probabilities over a set of tuples cannot be directly applied, because any node/edge in PXML may have impacts on the top-k probabilities of answers. We propose new algorithms to compute PXML-RANK queries efficiently and give conditions under which a PXML-RANK query can be processed efficiently without enumeration of all the possible worlds. We conduct extensive performance studies using both real and large benchmark datasets, and confirm the efficiency of our algorithms

Qin, L., Yu, J.X. & Chang, L. 2009, 'Keyword Search in Databases: The Power of RDBMS', *Proceedings of ACM Conference on Management of Data*, Proceedings of ACM Conference on Management of Data, ACM, Providence, Rhode Island, USA, pp. 681-694.View/Download from: UTS OPUS or Publisher's site

#### View description

Keyword search in relational databases (RDBs) has been extensively studied recently. A keyword search (or a keyword query) in RDBs is specified by a set of keywords to explore the interconnected tuple structures in an RDB that cannot be easily identified using SQL on RDBMS. In brief, it finds how the tuples containing the given keywords are connected via sequences of connections (foreign key references) among tuples in an RDB. Such interconnected tuple structures can be found as connected trees up to a certain size, sets of tuples that are reachable from a root tuple within a radius, or even multi-center subgraphs within a radius. In the literature, there are two main approaches. One is to generate a set of relational algebra expressions and evaluate every such expression using SQL on an RDBMS directly or in a middleware on top of an RDBMS indirectly. Due to a large number of relational algebra expressions needed to process, most of the existing works take a middleware approach without fully utilizing RDBMSs. The other is to materialize an RDB as a graph and find the interconnected tuple structures using graph-based algorithms in memory

Qin, L., Yu, J.X., Chang, L. & Tao, Y. 2009, 'Querying Communities in Relational Databases', *Proceedings of the 25th International Conference on Data Engineering*, IEEE, Shanghai, China, pp. 724-735.View/Download from: UTS OPUS or Publisher's site

#### View description

Keyword search on relational databases provides users with insights that they can not easily observe using the traditional RDBMS techniques. Here, an l-keyword query is specified by a set of l keywords, {k1, k2, middot middot middot , kl}. It finds how the tuples that contain the keywords are connected in a relational database via the possible foreign key references. Conceptually, it is to find some structural information in a database graph, where nodes are tuples and edges are foreign key references. The existing work studied how to find connected trees for an l-keyword query. However, a tree may only show partial information about how those tuples that contain the keywords are connected. In this paper, we focus on finding communities for an l-keyword query. A community is an induced subgraph that contains all the l-keywords within a given distance. We propose new efficient algorithms to find all/top-k communities which consume small memory, for an l-keyword query. For top k l-keyword queries, our algorithm allows users to interactively enlarge k at run time. We conducted extensive performance studies using two large real datasets to confirm the efficiency of our algorithms

Qin, L., Yu, J.X., Chang, L. & Tao, Y. 2009, 'Scalable keyword search on large data streams', *Proceedings - International Conference on Data Engineering*, pp. 1199-1202.View/Download from: Publisher's site

#### View description

It is widely realized that the integration of information retrieval (IR) and database (DB) techniques provides users with a broad range of high quality services. A new challenging issue along the same direction is IR-styled m-keyword query processing in a RDBMS framework over an open-ended relational data stream. The capability of supporting m-keyword queries over a relational data stream makes it possible for users to monitor events, that are implicitly interrelated, over a relational data stream in a timely manner. In brief, the problem is to find all connected trees whose size is less than or equal to a user-given threshold in terms of number of nodes for a m-keyword query, {k1, k2, , km}, over a relational data stream on a database schema GS. The difficulty of the problem is related to the number of costly joins to be processed over time, which is affected by the parameters such as the number of keywords (m), the maximum size of connected trees (Tmax), as well as the complexity of the database schema when it is viewed as a schema graph (GS). In this paper, we propose a new demand-driven approach to process such a query over a high speed data stream. We show that we can significantly reduce the number of intermediate results when processing joins over a data stream, and therefore can achieve high efficiency. © 2009 IEEE.

Ding, B., Yu, J.X. & Qin, L. 2008, 'Finding Time-Dependent Shortest Paths over Large Graphs', *Proceedings of the 11th International Conference on Extending Database Technology*, International Conference on Extending Database Technology, ACM, Nantes, France, pp. 205-216.View/Download from: UTS OPUS or Publisher's site

#### View description

The spatial and temporal databases have been studied widely and intensively over years. In this paper, we study how to answer queries of finding the best departure time that minimizes the total travel time from a place to another, over a road network, where the traffic conditions dynamically change from time to time. We study a generalized form of this problem, called the time-dependent shortest-path problem. A time-dependent graph GT is a graph that has an edge-delay function, wi, j(t), associated with each edge (vi, vj), to be stored in a database. The edge-delay function wi, j(t) specifies how much time it takes to travel from node vi to node vj, if it departs from vi at time t. A user-specified query is to ask the minimum-travel-time path, from a source node, vs, to a destination node, ve, over the time-dependent graph, GT, with the best departure time to be selected from a time interval T. We denote this user query as LTT(vs, ve, T) over GT. The challenge of this problem is the added complexity due to the time dependency in the time-dependent graph. That is, edge delays are not constants, and can vary from time to time. In this paper, we propose a novel algorithm to find the minimum-travel-time path with the best departure time for a LTT(vs, ve, T) query over a large graph GT. Our approach outperforms existing algorithms in terms of both time complexity in theory and efficiency in practice. We will discuss the design of our algorithm, together with its correctness and complexity. We conducted extensive experimental studies over large graphs and will report our findings.

Ding, B., Yu, J.X., Wang, S., Qin, L., Zhang, X. & Lin, X. 2007, 'Finding Top-k Min-Cost Connected Trees in Databases', *Proceedings of the 23th International Conference on Data Engineering*, IEEE, Istanbul, Turkey, pp. 836-845.View/Download from: UTS OPUS or Publisher's site

#### View description

It is widely realized that the integration of database and information retrieval techniques will provide users with a wide range of high quality services. In this paper, we study processing an l-keyword query, p1, p1, ..., pl, against a relational database which can be modeled as a weighted graph, G(V, E). Here V is a set of nodes (tuples) and E is a set of edges representing foreign key references between tuples. Let Vi ? V be a set of nodes that contain the keyword pi. We study finding top-k minimum cost connected trees that contain at least one node in every subset Vi, and denote our problem as GST-k When k = 1, it is known as a minimum cost group Steiner tree problem which is NP-complete. We observe that the number of keywords, l, is small, and propose a novel parameterized solution, with l as a parameter, to find the optimal GST-1, in time complexity O(3ln + 2l ((l + logn)n + m)), where n and m are the numbers of nodes and edges in graph G. Our solution can handle graphs with a large number of nodes. Our GST-1 solution can be easily extended to support GST-k, which outperforms the existing GST-k solutions over both weighted undirected/directed graphs. We conducted extensive experimental studies, and report our finding

Wang, S., Peng, Z., Zhang, J., Qin, L., Wang, S., Yu, J.X. & Ding, B. 2006, 'NUITS: A Novel User Interface for Efficient Keyword Search over Databases.', *VLDB*, ACM, pp. 1143-1146.