UTS site search

Dr Lu Qin

Biography

In July of 2006, I received my bachelor degree from department of Computer Science and Technology in Renmin University of China (RUC). In August of 2010, I received the Ph.D degree from department of Systems Engineering and Engineering Management in the Chinese University of Hong Kong (CUHK). My thesis adviser is Prof. Jeffrey Xu Yu. I have been a postdoc in the department of Systems Engineering and Engineering Management of the Chinese University of Hong Kong from August 2010 to August 2013. I am now a senior lecturer and a core member in the Centre of Quantum Computation and Intelligent Systems (QCIS) in University of Technology, Sydney (UTS).

Image of Lu Qin
Senior Lecturer, A/DRsch Ctr Quantum Computat'n & Intelligent Systs
Core Member, QCIS - Quantum Computation and Intelligent Systems
Bch, PhD
 
Phone
+61 2 9514 9742
Can supervise: Yes

 Current PhD Students: 

Dong Wen (From 01/2015)

Dian Ouyang (From 07/2015)

Fan Zhang (Co-supervision From 07/2014)

Books

Yu, J.X., Qin, L. & Chang, L. 2010, Keyword Search in Databases, 1, Morgan & Claypool Publishers, NA.
View/Download from: UTS OPUS or Publisher's site
It has become highly desirable to provide users with ?exible ways to query/search information over databases as simple as keyword search like Google search. This book surveys the recent developments on keyword search over databases, and focuses on ?nding structural information among objects in a database using a set of keywords. Such structural information to be returned can be either trees or subgraphs representing how the objects, that contain the required keywords, are interconnected in a relational database or in an XML database. The structural keyword search is completely different from ?nding documents that contain all the user-given keywords. The former focuses on the interconnected object structures, whereas the latter focuses on the object content. The book is organized as follows. In Chapter 1, we highlight the main research issues on the structural keyword search in different contexts. In Chapter 2, we focus on supporting structural keyword search in a relational database management system using the SQL query language. We concentrate on how to generate a set of SQL queries that can ?nd all the structural information among records in a relational database completely, and how to evaluate the generated set of SQL queries ef?ciently.In Chapter 3,we discuss graph algorithms for structural keyword search by treating an entire relational database as a large data graph. In Chapter 4, we discuss structural keyword search in a large tree-structured XML database.In Chapter 5,we highlight several interesting research issues regarding keyword search on databases. The book can be used as either an extended survey for people who are interested in the structural keyword search or a reference book for a postgraduate course on the related topics.

Conferences

Liu, S., Liu, A., Zhao, L., Liu, G., Li, Z., Zhao, P., Zheng, K. & Qin, L. 2016, 'Efficient query processing with mutual privacy protection for location-based services', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 299-313.
View/Download from: Publisher's site
© Springer International Publishing Switzerland 2016. Data privacy in location-based services involves two aspects. The location of a user is a kind of private data as many sensitive information can be inferred from it given some background knowledge. On the other hand, the POI database is a great asset to the LBS provider as its construction requires many resources and efforts. In this paper, we propose a method of protecting mutual privacy (i.e., the location of the user issuing a query and the POI database of the LBS provider) for location-based query processing. Our approach consists of two steps: data preparation and query processing. Data preparation is conducted by LBS itself and is totally an offline computation, while query processing involves some online computation and multiple rounds of communication between LBS and the user. We implement the query processing by two rounds of oblivious transfer extension (OT-Extension) on two small key sets, resulting an immediate response even on some big POI databases. We also theoretically prove the security and analyze the complexity of our approach. Compared with two state-of-the-art methods, our approach has several orders of magnitude improvement in response time, at the expense of little and acceptable communication cost.
Yuan, L., Qin, L., Lin, X., Chang, L. & Zhang, W. 2016, 'I/O efficient ECC graph decomposition via graph reduction', Proceedings of the VLDB Endowment, pp. 516-527.
© 2016 VLDB Endowment 21508097/16/03.The problem of computing k-edge connected components (k-ECCs) of a graph G for a specific k is a fundamental graph problemand has been investigated recently. In this paper, we study the problemof ECC decomposition, which computes the k-ECCs of a graphG for all k values. ECC decomposition can be widely applied ina variety of applications such as graph-topology analysis, communitydetection, Steiner component search, and graph visualization.A straightforward solution for ECC decomposition is to apply theexisting k-ECC computation algorithm to compute the k-ECCs forall k values. However, this solution is not applicable to large graphsfor two challenging reasons. First, all existing k-ECC computationalgorithms are highly memory intensive due to the complex datastructures used in the algorithms. Second, the number of possiblek values can be very large, resulting in a high computational costwhen each k value is independently considered. In this paper, weaddress the above challenges, and study I/O efficient ECC decompositionvia graph reduction. We introduce two elegant graph reductionoperators which aim to reduce the size of the graph loadedin memory while preserving the connectivity information of a certainset of edges to be computed for a specific k. We also proposethree novel I/O efficient algorithms, Bottom-Up, Top-Down, andHybrid, that explore the k values in different orders to reduce the redundantcomputations between different k values. We analyze theI/O and memory costs for all proposed algorithms. In our experiments, we evaluate our algorithms using seven real large datasetswith various graph properties, one of which contains 1:95 billionedges. The experimental results show that our proposed algorithmsare scalable and efficient.
Feng, X., Chang, L., Lin, X., Qin, L. & Zhang, W. 2016, 'Computing Connected Components with linear communication cost in pregel-like systems', 2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016, pp. 85-96.
View/Download from: Publisher's site
© 2016 IEEE.The paper studies two fundamental problems in graph analytics: computing Connected Components (CCs) and computing BiConnected Components (BCCs) of a graph. With the recent advent of Big Data, developing effcient distributed algorithms for computing CCs and BCCs of a big graph has received increasing interests. As with the existing research efforts, in this paper we focus on the Pregel programming model, while the techniques may be extended to other programming models including MapReduce and Spark. The state-of-the-art techniques for computing CCs and BCCs in Pregel incur O(m #supersteps) total costs for both data communication and computation, where m is the number of edges in a graph and #supersteps is the number of supersteps. Since the network communication speed is usually much slower than the computation speed, communication costs are the dominant costs of the total running time in the existing techniques. In this paper, we propose a new paradigm based on graph decomposition to reduce the total communication costs from O(m#supersteps) to O(m), for both computing CCs and computing BCCs. Moreover, the total computation costs of our techniques are smaller than that of the existing techniques in practice, though theoretically they are almost the same. Comprehensive empirical studies demonstrate that our approaches can outperform the existing techniques by one order of magnitude regarding the total running time.
Chang, L., Li, W., Lin, X., Qin, L. & Zhang, W. 2016, 'PSCAN: Fast and exact structural graph clustering', 2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016, pp. 253-264.
View/Download from: Publisher's site
© 2016 IEEE.In this paper, we study the problem of structural graph clustering, a fundamental problem in managing and analyzing graph data. Given a large graph G = (V, E), structural graph clustering is to assign vertices in V to clusters and to identify the sets of hub vertices and outlier vertices as well, such that vertices in the same cluster are densely connected to each other while vertices in different clusters are loosely connected to each other. Firstly, we prove that the existing SCAN approach is worst-case optimal. Nevertheless, it is still not scalable to large graphs due to exhaustively computing structural similarity for every pair of adjacent vertices. Secondly, we make three observations about structural graph clustering, which present opportunities for further optimization. Based on these observations, in this paper we develop a new two-step paradigm for scalable structural graph clustering. Thirdly, following this paradigm, we present a new approach aiming to reduce the number of structural similarity computations. Moreover, we propose optimization techniques to speed up checking whether two vertices are structure-similar to each other. Finally, we conduct extensive performance studies on large real and synthetic graphs, which demonstrate that our new approach outperforms the state-of-the-art approaches by over one order of magnitude. Noticeably, for the twitter graph with 1 billion edges, our approach takes 25 minutes while the state-of-the-art approach cannot finish even after 24 hours.
Li, Z., Qin, L., Cheng, H., Zhang, X. & Zhou, X. 2016, 'TRIP: An interactive retrieving-inferring data imputation approach', 2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016, pp. 1462-1463.
View/Download from: Publisher's site
© 2016 IEEE.Data imputation aims at filling in missing attribute values in databases. Existing imputation approaches to nonquantitive string data can be roughly put into two categories: (1) inferring-based approaches [2], and (2) retrieving-based approaches [1]. Specifically, the inferring-based approaches find substitutes or estimations for the missing ones from the complete part of the data set. However, they typically fall short in filling in unique missing attribute values which do not exist in the complete part of the data set [1]. The retrieving-based approaches resort to external resources for help by formulating proper web search queries to retrieve web pages containing the missing values from the Web, and then extracting the missing values from the retrieved web pages [1]. This webbased retrieving approach reaches a high imputation precision and recall, but on the other hand, issues a large number of web search queries, which brings a large overhead [1].
Lyu, B., Qin, L., Lin, X., Chang, L. & Yu, J.X. 2016, 'Scalable supergraph search in large graph databases', 2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016, pp. 157-168.
View/Download from: Publisher's site
© 2016 IEEE.Supergraph search is a fundamental problem in graph databases that is widely applied in many application scenarios. Given a graph database and a query-graph, supergraph search retrieves all data-graphs contained in the query-graph from the graph database. Most existing solutions for supergraph search follow the pruning-and-verification framework, which prunes false answers based on features in the pruning phase and performs subgraph isomorphism testings on the remaining graphs in the verification phase. However, they are not scalable to handle large-sized data-graphs and query-graphs due to three drawbacks. First, they rely on a frequent subgraph mining algorithm to select features which is expensive and cannot generate large features. Second, they require a costly verification phase. Third, they process features in a fixed order without considering their relationship to the query-graph. In this paper, we address the three drawbacks and propose new indexing and query processing algorithms. In indexing, we select features directly from the data-graphs without expensive frequent subgraph mining. The features form a feature-tree that contains all-sized features and both the cost sharing and pruning power of the features are considered. In query processing, we propose a verification-free algorithm, where the order to process features is query-dependent by considering both the cost sharing and the pruning power. We explore two optimization strategies to further improve the algorithm efficiency. The first strategy applies a lightweight graph compression technique and the second strategy optimizes the inclusion of answers. Finally, we conduct extensive performance studies on two real large datasets to demonstrate the high scalability of our algorithms.
Bi, F., Chang, L., Lin, X., Qin, L. & Zhang, W. 2016, 'Efficient subgraph matching by postponing Cartesian products', Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 1199-1214.
View/Download from: Publisher's site
© 2016 ACM.In this paper, we study the problem of subgraph matching that extracts all subgraph isomorphic embeddings of a query graph q in a large data graph G. The existing algorithms for subgraph matching follow Ullmann's backtracking approach; that is, iteratively map query vertices to data vertices by following a matching order of query vertices. It has been shown that the matching order of query vertices is a very important aspect to the efficiency of a subgraph matching algorithm. Recently, many advanced techniques, such as enforcing connectivity and merging similar vertices in query or data graphs, have been proposed to provide an effective matching order with the aim to reduce unpromising intermediate results especially the ones caused by redundant Cartesian products. In this paper, for the first time we address the issue of unpromising results by Cartesian products from "dissimilar" vertices. We propose a new framework by postponing the Cartesian products based on the structure of a query to minimize the redundant Cartesian products. Our second contribution is proposing a new path-based auxiliary data structure, with the size O(|E(G)| |V(q)|), to generate a matching order and conduct subgraph matching, which significantly reduces the exponential size O(|V(G)||V(q)|-1) of the existing path-based auxiliary data structure, where V(G) and E(G) are the vertex and edge sets of a data graph G, respectively, and V(q) is the vertex set of a query q. Extensive empirical studies on real and synthetic graphs demonstrate that our techniques outperform the state-of-the-art algorithms by up to 3 orders of magnitude.
Li, R.H., Qin, L., Yu, J.X. & Mao, R. 2016, 'Efficient and progressive Group Steiner Tree search', Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 91-106.
View/Download from: Publisher's site
© 2016 ACM.The Group Steiner Tree (GST) problem is a fundamental problem in database area that has been successfully applied to keyword search in relational databases and team search in social networks. The state-of-the-art algorithm for the GST problem is a parameterized dynamic programming (DP) algorithm, which finds the optimal tree in O(3n + 2 (n log n + m)) time, where is the number of given groups, m and n are the number of the edges and nodes of the graph respectively. The major limitations of the parameterized DP algorithm are twofold: (i) it is intractable even for very small values of (e.g., = 8) in large graphs due to its exponential complexity, and (ii) it cannot generate a solution until the algorithm has completed its entire execution. To overcome these limitations, we propose an efficient and progressive GST algorithm in this paper, called PrunedDP. It is based on newly-developed optimal-tree decomposition and conditional tree merging techniques. The proposed algorithm not only drastically reduces the search space of the parameterized DP algorithm, but it also produces progressively-refined feasible solutions during algorithm execution. To further speed up the PrunedDP algorithm, we propose a progressive A-search algorithm, based on several carefully-designed lower-bounding techniques. We conduct extensive experiments to evaluate our algorithms on several large scale real-world graphs. The results show that our best algorithm is not only able to generate progressively-refined feasible solutions, but it also finds the optimal solution with at least two orders of magnitude acceleration over the state-of-the-art algorithm, using much less memory.
Chang, L., Lin, X., Qin, L., Yu, J.X. & Pei, J. 2015, 'Efficiently Computing Top-K Shortest Path Join', Proceedings of the 18th International Conference on Extending Database Technology, EDBT 2015, Brussels, Belgium, March 23-27, 2015., pp. 133-144.
View/Download from: UTS OPUS or Publisher's site
Chang, L., Lin, X., Zhang, W., Yu, J.X., Zhang, Y. & Qin, L. 2015, 'Optimal Enumeration: Efficient Top-k Tree Matching', Proceedings of the VLDB Endowment, 41st International Conference on Very Large Data Bases,, VLDB Endowment, Kahola Coast, Hawaii, pp. 533-544.
View/Download from: UTS OPUS or Publisher's site
Driven by many real applications, graph pattern matching has attracted a great deal of attention recently. Consider that a twigpattern matching may result in an extremely large number of matches in a graph; this may not only confuse users by providing too many results but also lead to high computational costs. In this paper, we study the problem of top-k tree pattern matching; that is, given a rooted tree T, compute its top-k matches in a directed graph G based on the twig-pattern matching semantics. We firstly present a novel and optimal enumeration paradigm based on the principle of Lawler's procedure. We show that our enumeration algorithm runs in O(nT + log k) time in each round where nT is the number of nodes in T. Considering that the time complexity to output a match of T is O(nT ) and nT log k in practice, our enumeration technique is optimal. Moreover, the cost of generating top-1 match of T in our algorithm is O(mR) where mR is the number of edges in the transitive closure of a data graph G involving all relevant nodes to T. O(mR) is also optimal in the worst case without preknowledge of G. Consequently, our algorithm is optimal with the running time O(mR +k(nT +log k)) in contrast to the time complexity O(mR log k+knT (log k+dT )) of the existing technique where dT is the maximal node degree in T. Secondly, a novel priority based access technique is proposed, which greatly reduces the number of edges accessed and results in a significant performance improvement. Finally, we apply our techniques to the general form of top-k graph pattern matching problem (i.e., query is a graph) to improve the existing techniques. Comprehensive empirical studies demonstrate that our techniques may improve the existing techniques by orders of magnitude.
Li, R.-.H., Qin, L., Yu, J.X. & Mao, R. 2015, 'Influential Community Search in Large Networks', Proceedings of the VLDB Endowment, 41st International Conference on Very Large Data Bases, Proceedings of the Vldb Endowment International Conference on Very Large Data Bases, Kohala Coast, Hawaii, pp. 509-520.
View/Download from: UTS OPUS
Community search is a problem of finding densely connected subgraphs that satisfy the query conditions in a network, which has attracted much attention in recent years. However, all the previous studies on community search do not consider the influence of a community. In this paper, we introduce a novel community model called k-influential community based on the concept of k-core, which can capture the influence of a community. Based on the new community model, we propose a linear-time online search algorithm to find the top-r k-influential communities in a network. To further speed up the influential community search algorithm, we devise a linear-space index structure which supports efficient search of the top-r k-influential communities in optimal time. We also propose an efficient algorithm to maintain the index when the network is frequently updated. We conduct extensive experiments on 7 real-world large networks, and the results demonstrate the ef- ficiency and effectiveness of the proposed methods
Chang, L., Lin, X., Qin, L., Yu, J.X. & Zhang, W. 2015, 'Index-based optimal algorithms for computing steiner components with maximum connectivity', Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 459-474.
View/Download from: UTS OPUS or Publisher's site
With the proliferation of graph applications, the problem of efficiently computing all k-edge connected components of a graph G for a user-given k has been recently investigated. In this paper, we study the problem of efficiently computing the steiner component with the maximum connectivity; that is, given a set q of query vertices in a graph G, we aim to find the maximum induced subgraph g of G such that g contains q and g has the maximum connectivity, where g is denoted as SMCC. To accommodate online query processing, we present an efficient algorithm based on a novel index such that the algorithm runs in linear time regarding the result size; thus, the algorithm is optimal since it needs at least linear time to output the result. Moreover, in this paper we also investigate variations of the above problem. We show that such a problem with the constraint that the size of the SMCC is not smaller than a given size can also be solved in linear time regarding the result size (thus, optimal). We also show that the problem of computing the connectivity (rather than the graph details) of SMCC can be solved in linear time regarding the query size (thus, optimal). To build the index, we extend the techniques in [7] to accommodate batch processing and computation sharing. To efficiently support the applications with graph updates, we also present novel increment techniques. Finally, we conduct extensive performance studies on large real and synthetic graphs, which demonstrate that our index-based algorithms significantly outperform baseline algorithms by several orders of magnitude and our indexing algorithms are efficient.
Qin, L., Li, R.H., Chang, L. & Zhang, C. 2015, 'Locally densest subgraph discovery', Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, Sydney, Australia, pp. 965-974.
View/Download from: UTS OPUS or Publisher's site
© 2015 ACM. Mining dense subgraphs from a large graph is a fundamental graph mining task and can be widely applied in a variety of application domains such as network science, biology, graph database, web mining, graph compression, and micro-blogging systems. Here a dense subgraph is defined as a subgraph with high density (#.edge/#.node). Existing studies of this problem either focus on finding the densest subgraph or identifying an optimal clique-like dense subgraph, and they adopt a simple greedy approach to find the top-k dense subgraphs. However, their identified subgraphs cannot be used to represent the dense regions of the graph. Intuitively, to represent a dense region, the subgraph identified should be the subgraph with highest density in its local region in the graph. However, it is non-trivial to formally model a locally densest subgraph. In this paper, we aim to discover top-k such representative locally densest subgraphs of a graph. We provide an elegant parameter-free definition of a locally densest subgraph. The definition not only fits well with the intuition, but is also associated with several nice structural properties. We show that the set of locally densest subgraphs in a graph can be computed in polynomial time. We further propose three novel pruning strategies to largely reduce the search space of the algorithm. In our experiments, we use several real datasets with various graph properties to evaluate the effectiveness of our model using four quality measures and a case study. We also test our algorithms on several real web-scale graphs, one of which contains 118.14 million nodes and 1.02 billion edges, to demonstrate the high efficiency of the proposed algorithms.
Yuan, L., Qin, L., Lin, X., Chang, L. & Zhang, W. 2015, 'Diversified top-k clique search', Proceedings - International Conference on Data Engineering, pp. 387-398.
View/Download from: UTS OPUS or Publisher's site
© 2015 IEEE. Maximal clique enumeration is a fundamental problem in graph theory and has been extensively studied. However, maximal clique enumeration is time-consuming in large graphs and always returns enormous cliques with large overlaps. Motivated by this, in this paper, we study the diversified top-k clique search problem which is to find top-k maximal cliques that can cover most number of nodes in the graph. Diversified top-k clique search can be widely used in a lot of applications including community search, motif discovery, and anomaly detection in large graphs. A naive solution for diversified top-k clique search is to keep all maximal cliques in memory and then find k of them that cover most nodes in the graph by using the approximate greedy max k-cover algorithm. However, such a solution is impractical when the graph is large. In this paper, instead of keeping all maximal cliques in memory, we devise an algorithm to maintain k candidates in the process of maximal clique enumeration. Our algorithm has limited memory footprint and can achieve a guaranteed approximation ratio. We also introduce a novel light-weight PNP-Index, based on which we design an optimal maximal clique maintenance algorithm. We further explore three optimization strategies to avoid enumerating all maximal cliques and thus largely reduce the computational cost. We conduct extensive performance studies on six real graphs one of which contains 0.3 billion edges, and the results demonstrate the high efficiency and effectiveness of our approach.
Li, R.H., Yu, J.X., Qin, L., Mao, R. & Jin, T. 2015, 'On random walk based graph sampling', Proceedings - International Conference on Data Engineering, International Conference on Data Engineering, IEEE, Seoul, South Korea, pp. 927-938.
View/Download from: UTS OPUS or Publisher's site
© 2015 IEEE. Random walk based graph sampling has been recognized as a fundamental technique to collect uniform node samples from a large graph. In this paper, we first present a comprehensive analysis of the drawbacks of three widely-used random walk based graph sampling algorithms, called re-weighted random walk (RW) algorithm, Metropolis-Hastings random walk (MH) algorithm and maximum-degree random walk (MD) algorithm. Then, to address the limitations of these algorithms, we propose two general random walk based algorithms, named rejection-controlled Metropolis-Hastings (RCMH) algorithm and generalized maximum-degree random walk (GMD) algorithm. We show that RCMH balances the tradeoff between the limitations of RW and MH, and GMD balances the tradeoff between the drawbacks of RW and MD. To further improve the performance of our algorithms, we integrate the so-called delayed acceptance technique and the non-backtracking random walk technique into RCMH and GMD respectively. We conduct extensive experiments over four real-world datasets, and the results demonstrate the effectiveness of the proposed algorithms.
Zhang, Z., Yu, J.X., Qin, L. & Shang, Z. 2015, 'Divide & conquer: I/O efficient depth-first search', Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 445-458.
View/Download from: UTS OPUS or Publisher's site
Copyright © 2015 ACM. Depth-First Search (DFS), which traverses a graph in the depthfirst order, is one of the fundamental graph operations, and the result of DFS over all nodes in G is a spanning tree known as a DFS-Tree. There are many graph algorithms that need DFS such as connected component computation, topological sort, community detection, eulerian path computation, graph bipartiteness testing, planar graph testing, etc, because the in-memory DFS algorithm shows it can be done in linear time w.r.t. the size of G. However, given the fact that real-world graphs grow rapidly in the big data era, the in-memory DFS algorithm cannot be used to handle a large graph that cannot be entirely held in main memory. In this paper, we focus on I/O efficiency and study semi-external algorithms to DFS a graph G which is on disk. Here, like the existing semiexternal algorithms, we assume that a spanning tree of G can be held in main memory and the remaining edges of G are kept on disk, and compute the DFS-Tree in main memory with which DFS can be identified. We propose novel divide & conquer algorithms to DFS over a graph G on disk. In brief, we divide a graph into several subgraphs, compute the DFS-Tree for each subgraph independently, and then merge them together to compute the DFS-Tree for the whole graph. With the global DFS-Tree computed we identify DFS. We discuss the valid division, that can lead to the correct DFS, and the challenges to do so. We propose two division algorithms, named Divide-Star and Divide-TD, and a merge algorithm. We conduct extensive experimental studies using four real massive datasets and several synthetic datasets to confirm the I/O efficiency of our approach.
Lai, L., Qin, L., Lin, X. & Chang, L. 2015, 'Scalable Subgraph Enumeration in MapReduce.', Proceedings of the VLDB Endowment, 41st International Conference!on! Very Large Data Bases, VLDB Endowment, Kohala Coast, Hawaii, pp. 974-985.
View/Download from: UTS OPUS
Subgraph enumeration, which aims to find all the subgraphs of a large data graph that are isomorphic to a given pattern graph, is a fundamental graph problem with a wide range of applications. However, existing sequential algorithms for subgraph enumeration fall short in handling large graphs due to the involvement of computationally intensive subgraph isomorphism operations. Thus, some recent researches focus on solving the problem using MapReduce. Nevertheless, exiting MapReduce approaches are not scalable to handle very large graphs since they either produce a huge number of partial results or consume a large amount of memory. Motivated by this, in this paper, we propose a new algorithm TwinTwigJoin based on a left-deep-join framework in MapReduce, in which the basic join unit is a TwinTwig (an edge or two incident edges of a node). We show that in the Erdos-R ¨ enyi random-graph model, ´ TwinTwigJoin is instance optimal in the left-deep-join framework under reasonable assumptions, and we devise an algorithm to compute the optimal join plan. Three optimization strategies are explored to improve our algorithm. Furthermore, we discuss how our approach can be adapted in the power-law random-graph model.We conduct extensive performance studies in several real graphs, one of which contains billions of edges. Our approach significantly outperforms existing solutions in all tests.
Zhang, S., Qin, L., Zheng, Y. & Cheng, H. 2015, 'Effective and efficient: Large-scale dynamic city express', GIS: Proceedings of the ACM International Symposium on Advances in Geographic Information Systems, ACM International Symposium on Advances in Geographic Information Systems, ACM, Seattle, Washington.
View/Download from: UTS OPUS or Publisher's site
City express services are in great demand in recent years. However, the current city express system is found to be unsatisfactory for both the service providers and customers. In this paper, we are the first to systematically study the large-scale dynamic city express problem. We aim to increase both the effectiveness and the efficiency of the scheduling algorithm. To improve the effectiveness, we adopt a batch assignment strategy that computes the pickup-delivery routes for a group of requests received in a short period rather than dealing with each request individually. To improve the efficiency, we design a two-level priority queue structure to reduce redundant shortest distance calculation and repeated candidate generation. We develop a simulation system and conduct extensive performance studies in the real road network of Beijing city. The experimental results demonstrate the high effectiveness and efficiency of our algorithm.
Huang, X., Cheng, H., Qin, L., Tian, W. & Yu, J.X. 2014, 'Querying k-truss community in large and dynamic graphs', International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22-27, 2014, pp. 1311-1322.
View/Download from: UTS OPUS or Publisher's site
Qin, L., Yu, J.X., Chang, L., Cheng, H., Zhang, C. & Lin, X. 2014, 'Scalable big graph processing in MapReduce', International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22-27, 2014, pp. 827-838.
View/Download from: UTS OPUS or Publisher's site
Zhang, Z., Qin, L. & Yu, J.X. 2014, 'Contract & Expand: I/O Efficient SCCs Computing', IEEE 30th International Conference on Data Engineering, Chicago, ICDE 2014, IL, USA, March 31 - April 4, 2014, 2014 IEEE 30th International Conference on Data Engineering (ICDE), IEEE, Piscataway, USA, pp. 208-219.
View/Download from: UTS OPUS or Publisher's site
As an important branch of big data processing, big graph processing is becoming increasingly popular in recent years. Strongly connected component (SCC) computation is a fundamental graph operation on directed graphs, where an SCC is a maximal subgraph S of a directed graph G in which every pair of nodes is reachable from each other in S. By contracting each SCC into a node, a large general directed graph can be represented by a small directed acyclic graph (DAG). In the literature, there are I/O efficient semi-external algorithms to compute all SCCs of a graph G, by assuming that all nodes of a graph G can fit in the main memory. However, many real graphs are large and even the nodes cannot reside entirely in the main memory. In this paper, we study new I/O efficient external algorithms to find all SCCs for a directed graph G whose nodes cannot fit entirely in the main memory. To overcome the deficiency of the existing external graph contraction based approach that usually cannot stop in finite iterations, and the external DFS based approach that will generate a large number of random I/Os, we explore a new contraction-expansion based approach. In the graph contraction phase, instead of contracting the whole graph as the contraction based approach, we only contract the nodes of a graph, which are much more selective. The contraction phase stops when all nodes of the graph can fit in the main memory, such that the semi-external algorithm can be used in SCC computation. In the graph expansion phase, as the graph is expanded in the reverse order as it is contracted, the SCCs of all nodes in the graph are computed. Both graph contraction phase and graph expansion phase use only I/O efficient sequential scans and external sorts of nodes/edges in the graph. Our algorithm leverages the efficiency of the semi-external SCC computation algorithm and usually stops in a small number of iterations. We further optimize our approach by reducing the size of nodes and edges of t...
Chang, L., Lin, X., Qin, L., Liu, C. & Liang, W. 2013, 'Efficiently Computing k-Edge Connected Components via Graph Decomposition', Proceedings of ACM Conference on Management of Data, ACM Conference on Management of Data, ACM, New York, NY, USA, pp. 205-216.
View/Download from: UTS OPUS or Publisher's site
k-edge connected components; graph decomposition; minimum cut
Zhang, Z., Yu, J.X., Qin, L., Chang, L. & Lin, X. 2013, 'I/O Efficient: Computing SCCs in Massive Graphs', Proceedings of ACM Conference on Management of Data, ACM Conference on Management of Data, ACM, New York, NY, USA, pp. 181-192.
View/Download from: UTS OPUS or Publisher's site
A strongly connected component (SCC) is a maximal subgraph of a directed graph G in which every pair of nodes are reachable from each other in the SCC. With such a property, a general directed graph can be represented by a directed acyclic graph DAG by contracting an SCC of G to a node in DAG. In many real applications that need graph pattern matching, topological sorting, or reachability query processing, the best way to deal with a general directed graph is to deal with its DAG representation. Therefore, finding all SCCs in a directed graph G is a critical operation. The existing in-memory algorithms based on depth first search (DFS) can find all SCCs in linear time w.r.t. the size of a graph. However, when a graph cannot resident entirely in the main memory, the existing external or semi-external algorithms to find all SCCs have limitation to achieve high I/O efficiency. In this paper, we study new I/O efficient semi-external algorithms to find all SCCs for a massive directed graph G that cannot reside in main memory entirely. To overcome the deficiency of the existing DFS based semi-external algorithm that heavily relies on a total order, we explore a weak order based on which we investigate new algorithms. We propose a new two phase algorithm, namely, tree construction and tree search. In the tree construction phase, a spanning tree of G can be constructed in bounded sequential scans of G. In the tree search phase, it needs to sequentially scan the graph once to find all SCCs.
Qiao, M., Qin, L., Cheng, H., Yu, J.X. & Tian, W. 2013, 'Top-K nearest keyword search on large graphs', Proceedings of the VLDB Endowment, VLDB Endowment, ACM, Trento, Italy, pp. 901-912.
View/Download from: UTS OPUS or Publisher's site
It is quite common for networks emerging nowadays to have labels or textual contents on the nodes. On such networks, we study the problem of top-k nearest keyword (k-NK) search. In a network G modeled as an undirected graph, each node is attached with zero or more keywords, and each edge is assigned with a weight measuring its length. Given a query node q in G and a keyword ?, a k-NK query seeks k nodes which contain ? and are nearest to q. k-NK is not only useful as a stand-alone query but also as a building block for tackling complex graph pattern matching problems. The key to an accurate k-NK result is a precise shortest distance estimation in a graph. Based on the latest distance oracle technique, we build a shortest path tree for a distance oracle and use the tree distance as a more accurate estimation. With such representation, the original k-NK query on a graph can be reduced to answering the query on a set of trees and then assembling the results obtained from the trees. We propose two efficient algorithms to report the exact k-NK result on a tree. One is query time optimized for a scenario when a small number of result nodes are of interest to users. The other handles k-NK queries for an arbitrarily large k efficiently. In obtaining a k-NK result on a graph from that on trees, a global storage technique is proposed to further reduce the index size and the query time. Extensive experimental results conform with our theoretical findings, and demonstrate the effectiveness and efficiency of our k-NK algorithms on large real graphs.
Huang, X., Cheng, H., Li, R., Qin, L. & Yu, J.X. 2013, 'Top-K Structural Diversity Search in Large Network', Proceedings of the VLDB Endowment, VLDB Endowment, ACM, Trento, Italy, pp. 1618-1629.
View/Download from: UTS OPUS or Publisher's site
Social contagion depicts a process of information (e.g., fads, opinions, news) diffusion in the online social networks. A recent study reports that in a social contagion process the probability of contagion is tightly controlled by the number of connected components in an individual's neighborhood. Such a number is termed structural diversity of an individual and it is shown to be a key predictor in the social contagion process. Based on this, a fundamental issue in a social network is to find top-k users with the highest structural diversities. In this paper, we, for the first time, study the top-k structural diversity search problem in a large network. Specifically, we develop an effective upper bound of structural diversity for pruning the search space. The upper bound can be incrementally refined in the search process. Based on such upper bound, we propose an efficient framework for top-k structural diversity search. To further speed up the structural diversity evaluation in the search process, several carefully devised heuristic search strategies are proposed. Extensive experimental studies are conducted in 13 real-world large networks, and the results demonstrate the efficiency and effectiveness of the proposed methods.
Zhang, Z., Yu, J.X., Qin, L., Zhu, Q. & Zhou, X. 2012, 'I/O Cost Minimization: Reachability Queries Processing over Massive Graphs', Proceedings of the 15th International Conference on Extending Database Technology, International Conference on Extending Database Technology, ACM, Berlin, Germany, pp. 468-479.
View/Download from: UTS OPUS or Publisher's site
Given a directed graph G, a reachability query (u, v) asks whether there exists a path from a node u to a node v in G. The existing studies support reachability queries using indexing techniques, where both the graph and the index are required to reside in main memory. However, they cannot handle reachability queries on massive graphs, when the graph and the index cannot be entirely held in memory because of the high I/O cost. In this paper, we focus on how to minimize the I/O cost when answering reachability queries on massive graphs that cannot reside entirely in memory. First, we propose a new Yes-Label scheme, as a complement of the No-Label used in GRAIL [23], to reduce the number of intermediate results generated. Second, we show how to minimize the number of I/Os using a heap-on-disk data structure when traversing a graph. We also propose new methods to partition the heap-on-disk, in order to ensure that only sequential I/Os are performed. Third, we analyze our approaches and show how to extend our approaches to answer multiple reachability queries effectively. Finally, we conducted extensive performance studies on both large synthetic and large real graphs, and confirm the efficiency of our approaches.
Zhu, Y., Qin, L., Yu, J.X. & Cheng, H. 2012, 'Finding Top-K Similar Graphs in Graph Databases', Proceedings of the 15th International Conference on Extending Database Technology, International Conference on Extending Database Technology, ACM, Berlin, Germany, pp. 456-467.
View/Download from: UTS OPUS or Publisher's site
Querying similar graphs in graph databases has been widely studied in graph query processing in recent years. Existing works mainly focus on subgraph similarity search and supergraph similarity search. In this paper, we study the problem of finding top-k graphs in a graph database that are most similar to a query graph. This problem has many applications, such as image retrieval and chemical compound structure search. Regarding the similarity measure, feature based and kernel based similarity measures have been used in the literature. But such measures are rough and may lose the connectivity information among substructures. In this paper, we introduce a new similarity measure based on the maximum common subgraph (MCS) of two graphs. We show that this measure can better capture the common and different structures of two graphs. Since computing the MCS of two graphs is NP-hard, we propose an algorithm to answer the top-k graph similarity query using two distance lower bounds with different computational costs, in order to reduce the number of MCS computations. We further introduce an indexing technique, which can better make use of the triangle property of similarities among graphs in the database to get tighter lower bounds. Three different indexing methods are proposed with different tradeoffs between pruning power and construction cost. We conducted extensive performance studies on large real datasets to evaluate the performance of our approaches.
Zhu, Y., Yu, J.X., Cheng, H. & Qin, L. 2012, 'Graph Classification: A Diversified Discriminative Feature Selection Approach', Proceedings of 2012 ACM International Conference on Information and Knowledge Management, ACM International Conference on Information and Knowledge Management, ACM, Maui, HI, USA, pp. 205-214.
View/Download from: UTS OPUS or Publisher's site
A graph models complex structural relationships among objects, and has been prevalently used in a wide range of applications. Building an automated graph classification model becomes very important for predicting unknown graphs or understanding complex structures between different classes. The graph classification framework being widely used consists of two steps, namely, feature selection and classification. The key issue is how to select important subgraph features from a graph database with a large number of graphs including positive graphs and negative graphs. Given the features selected, a generic classification approach can be used to build a classification model. In this paper, we focus on feature selection. We identify two main issues with the most widely used feature selection approach which is based on a discriminative score to select frequent subgraph features, and introduce a new diversified discriminative score to select features that have a higher diversity. We analyze the properties of the newly proposed diversified discriminative score, and conducted extensive performance studies to demonstrate that such a diversified discriminative score makes positive/negative graphs separable and leads to a higher classification accuracy.
Qin, L., Yu, J.X. & Chang, L. 2012, 'Diversifying Top-K Results', Proceedings of the VLDB Endowment, VLDB Endowment, ACM, Istanbul, Turkey, pp. 1124-1135.
View/Download from: UTS OPUS or Publisher's site
Top-k query processing finds a list of k results that have largest scores w.r.t the user given query, with the assumption that all the k results are independent to each other. In practice, some of the top-k results returned can be very similar to each other. As a result some of the top-k results returned are redundant. In the literature, diversified top-k search has been studied to return k results that take both score and diversity into consideration. Most existing solutions on diversified top-k search assume that scores of all the search results are given, and some works solve the diversity problem on a specific problem and can hardly be extended to general cases. In this paper, we study the diversified top-k search problem. We define a general diversified top-k search problem that only considers the similarity of the search results themselves. We propose a framework, such that most existing solutions for top-k query processing can be extended easily to handle diversified top-k search, by simply applying three new functions, a sufficient stop condition sufficient(), a necessary stop condition necessary(), and an algorithm for diversified top-k search on the current set of generated results, div-search-current(). We propose three new algorithms, namely, div-astar, div-dp, and div-cut to solve the div-search-current() problem. div-astar is an A* based algorithm, div-dp is an algorithm that decomposes the results into components which are searched using div-astar independently and combined using dynamic programming. div-cut further decomposes the current set of generated results using cut points and combines the results using sophisticated operations. We conducted extensive performance studies using two real datasets, enwiki and reuters. Our div-cut algorithm finds the optimal solution for diversified top-k search problem in seconds even for k as large as 2, 000.
Qin, L., Yu, J.X. & Chang, L. 2011, 'Computing Structural Statistics by Keywords in Databases', Proceedings of the 27th International Conference on Data Engineering, International Conference on Data Engineering, IEEE, Hannover, Germany, pp. 363-374.
View/Download from: UTS OPUS or Publisher's site
Keyword search in RDBs has been extensively studied in recent years. The existing studies focused on finding all or top-k interconnected tuple-structures that contain keywords. In reality, the number of such interconnected tuple-structures for a keyword query can be large. It becomes very difficult for users to obtain any valuable information more than individual interconnected tuple-structures. Also, it becomes challenging to provide a similar mechanism like group-&-aggregate for those interconnected tuple-structures. In this paper, we study computing structural statistics keyword queries by extending the group-&-aggregate framework. We consider an RDB as a large directed graph where nodes represent tuples, and edges represent the links among tuples. Instead of using tuples as a member in a group to be grouped, we consider rooted subgraphs. Such a rooted subgraph represents an interconnected tuple-structure among tuples and some of the tuples contain keywords. The dimensions of the rooted subgraphs are determined by dimensional-keywords in a data driven fashion. Two rooted subgraphs are grouped into the same group if they are isomorphic based on the dimensions or in other words the dimensional-keywords. The scores of the rooted subgraphs are computed by a user-given score function if the rooted subgraphs contain some of general keywords. Here, the general keywords are used to compute scores rather than determining dimensions. The aggregates are computed using an SQL aggregate function for every group based on the scores computed. We give our motivation using a real dataset. We propose new approaches to compute structural statistics keyword queries, perform extensive performance studies using two large real datasets and a large synthetic dataset, and confirm the effectiveness and efficiency of our approach.
Chang, L., Yu, J.X., Qin, L., Zhu, Y. & Wang, H. 2011, 'Finding Information Nebula over Large Networks', Proceedings of 2011 ACM International Conference on Information and Knowledge Management, ACM, Glasgow, United Kingdom, pp. 1465-1474.
View/Download from: UTS OPUS or Publisher's site
Social and information networks have been extensively studied over years. In this paper, we concentrate ourselves on a large information network that is composed of entities and relationships, where entities are associated with sets of keyword terms (kterms) to specify what they are, and relationships describe the link structure among entities which can be very complex. Our work is motivated but is different from the existing works that find a best subgraph to describe how user-specified entities are connected. We compute information nebula (cloud) which is a set of top-K kterms P that are most correlated to a set of user-specified kterms Q, over a large information network. Our goal is to find how kterms are correlated given the complex information network among entities. The information nebula computing requests us to take all possible kterms into consideration for the top-K kterms selection, and needs to measure the similarity between kterms by considering all possible subgraphs that connect them instead of the best single one. In this work, we compute information nebula using a global structural-context similarity, and our similarity measure is independent of connection subgraphs. To the best of our knowledge, among the link-based similarity methods, none of the existing work considers similarity between two sets of nodes or two kterms. We propose new algorithms to find top-K kterms P for a given set of kterms Q based on the global structural-context similarity, without computing all the similarity scores of kterms in the large information network. We performed extensive performance studies using large real datasets, and confirmed the effectiveness and efficiency of our approach.
Zhu, Y., Qin, L., Yu, J.X., Ke, Y. & Lin, X. 2011, 'High Efficiency and Quality: Large Graphs Matching', Proceedings of 2011 ACM International Conference on Information and Knowledge Management, ACM International Conference on Information and Knowledge Management, ACM, Glasgow, United Kingdom, pp. 1755-1764.
View/Download from: UTS OPUS or Publisher's site
Graph matching plays an essential role in many real applications. In this paper, we study how to match two large graphs by maximizing the number of matched edges, which is known as maximum common subgraph matching and is NP-hard. To find exact matching, it cannot handle a graph with more than 30 nodes. To find an approximate matching, the quality can be very poor. We propose a novel two-step approach which can efficiently match two large graphs over thousands of nodes with high matching quality. In the first step, we propose an anchor-selection/expansion approach to compute a good initial matching. In the second step, we propose a new approach to refine the initial matching. We give the optimality of our refinement and discuss how to randomly refine the matching with different combinations. We conducted extensive testing using real and synthetic datasets, and will report our findings.
Chang, L., Yu, J.X., Qin, L. & Lin, X. 2010, 'Probabilistic Ranking over Relations', Proceedings of the 13th International Conference on Extending Database Technology, International Conference on Extending Database Technology, ACM, Lausanne, Switzerland, pp. 477-488.
View/Download from: UTS OPUS or Publisher's site
Probabilistic top-k ranking queries have been extensively studied due to the fact that data obtained can be uncertain in many real applications. A probabilistic top-k ranking query ranks objects by the interplay of score and probability, with an implicit assumption that both scores based on which objects are ranked and probabilities of the existence of the objects are stored in the same relation. We observe that in general scores and probabilities are highly possible to be stored in different relations, for example, in column-oriented DBMSs and in data warehouses. In this paper we study probabilistic top-k ranking queries when scores and probabilities are stored in different relations. We focus on reducing the join cost in probabilistic top-k ranking. We investigate two probabilistic score functions, discuss the upper/lower bounds in random access and sequential access, and provide insights on the advantages and disadvantages of random/sequential access in terms of upper/lower bounds. We also propose random, sequential, and hybrid algorithms to conduct probabilistic top-k ranking. We conducted extensive performance studies using real and synthetic datasets, and report our findings in this paper.
Qin, L., Yu, J.X. & Chang, L. 2010, 'Ten thousand SQLs: parallel keyword queries computing', Proceedings of the VLDB Endowment, Proceedings of the VLDB Endowment, ACM, Singapore, pp. 58-69.
View/Download from: UTS OPUS or Publisher's site
Keyword search in relational databases has been extensively studied. Given a relational database, a keyword query finds a set of interconnected tuple structures connected by foreign key references. On rdbms, a keyword query is processed in two steps, namely, candidate networks (CNs) generation and CNs evaluation, where a CN is an sql. In common, a keyword query needs to be processed using over 10,000 sqls. There are several approaches to process a keyword query on rdbms, but there is a limit to achieve high performance on a uniprocessor architecture. In this paper, we study parallel computing keyword queries on a multicore architecture. We give three observations on keyword query computing, namely, a large number of sqls that needs to be processed, high sharing possibility among sqls, and large intermediate results with small number of final results. All make it challenging for parallel keyword queries computing. We investigate three approaches. We first study the query level parallelism, where each sql is processed by one core. We distribute the sqls into different cores based on three objectives, regarding minimizing workload skew, minimizing intercore sharing and maximizing intra-core sharing respectively. Such an approach has the potential risk of load unbalancing through accumulating errors of cost estimation. We then study the operation level parallelism, where each operation of an sql is processed by one core. All operations are processed in stages, where in each stage the costs of operations are re-estimated to reduce the accumulated error. Such operation level parallelism still has drawbacks of workload skew when large operations are involved and a large number of cores are used. Finally, we propose a new algorithm that partitions relations adaptively in order to minimize the extra cost of partitioning and at the same time reduce workload skew.
Qin, L., Yu, J.X., Chang, L. & Tao, Y. 2009, 'Querying Communities in Relational Databases', Proceedings of the 25th International Conference on Data Engineering, IEEE, Shanghai, China, pp. 724-735.
View/Download from: UTS OPUS or Publisher's site
Keyword search on relational databases provides users with insights that they can not easily observe using the traditional RDBMS techniques. Here, an l-keyword query is specified by a set of l keywords, {k1, k2, middot middot middot , kl}. It finds how the tuples that contain the keywords are connected in a relational database via the possible foreign key references. Conceptually, it is to find some structural information in a database graph, where nodes are tuples and edges are foreign key references. The existing work studied how to find connected trees for an l-keyword query. However, a tree may only show partial information about how those tuples that contain the keywords are connected. In this paper, we focus on finding communities for an l-keyword query. A community is an induced subgraph that contains all the l-keywords within a given distance. We propose new efficient algorithms to find all/top-k communities which consume small memory, for an l-keyword query. For top k l-keyword queries, our algorithm allows users to interactively enlarge k at run time. We conducted extensive performance studies using two large real datasets to confirm the efficiency of our algorithms
Chang, L., Yu, J.X. & Qin, L. 2009, 'Query Ranking in Probabilistic XML Data', Proceedings of the 12th International Conference on Extending Database Technology, International Conference on Extending Database Technology, ACM, Saint-Petersburg, Russia, pp. 156-167.
View/Download from: UTS OPUS or Publisher's site
Twig queries have been extensively studied as a major fragment of XPATH queries to query XML data. In this paper, we study PXML-RANK query, (Q, k), which is to rank top-k probabilities of the answers of a twig query Q in probabilistic XML (PXML) data. A new research issue is how to compute top-k probabilities of answers of a twig query Q in PXML in the presence of containment (ancestor/descendant) relationships. In the presence of the ancestor/descendant relationships, the existing dynamic programming approaches to rank top-k probabilities over a set of tuples cannot be directly applied, because any node/edge in PXML may have impacts on the top-k probabilities of answers. We propose new algorithms to compute PXML-RANK queries efficiently and give conditions under which a PXML-RANK query can be processed efficiently without enumeration of all the possible worlds. We conduct extensive performance studies using both real and large benchmark datasets, and confirm the efficiency of our algorithms
Qin, L., Yu, J.X. & Chang, L. 2009, 'Keyword Search in Databases: The Power of RDBMS', Proceedings of ACM Conference on Management of Data, Proceedings of ACM Conference on Management of Data, ACM, Providence, Rhode Island, USA, pp. 681-694.
View/Download from: UTS OPUS or Publisher's site
Keyword search in relational databases (RDBs) has been extensively studied recently. A keyword search (or a keyword query) in RDBs is specified by a set of keywords to explore the interconnected tuple structures in an RDB that cannot be easily identified using SQL on RDBMS. In brief, it finds how the tuples containing the given keywords are connected via sequences of connections (foreign key references) among tuples in an RDB. Such interconnected tuple structures can be found as connected trees up to a certain size, sets of tuples that are reachable from a root tuple within a radius, or even multi-center subgraphs within a radius. In the literature, there are two main approaches. One is to generate a set of relational algebra expressions and evaluate every such expression using SQL on an RDBMS directly or in a middleware on top of an RDBMS indirectly. Due to a large number of relational algebra expressions needed to process, most of the existing works take a middleware approach without fully utilizing RDBMSs. The other is to materialize an RDB as a graph and find the interconnected tuple structures using graph-based algorithms in memory
Chang, L., Yu, J.X. & Qin, L. 2009, 'Context-sensitive document ranking', International Conference on Information and Knowledge Management, Proceedings, pp. 1533-1536.
View/Download from: Publisher's site
Ranking is a main research issue in IR-styled keyword search over a set of documents. In this paper, we study a new keyword search problem, called context-sensitive document ranking, which is to rank documents with an additional context that provides additional information about the application domain where the documents are to be searched and ranked. The work is motivated by the fact that additional information associated with the documents can possibly assist users to find more relevant documents when they are unable to find the needed documents from the documents alone. In this paper, a context is a multi-attribute graph, which can represent any information maintained in a relational database. The context-sensitive ranking is related to several research issues, how to score documents, how to evaluate the additional information obtained in the context that may contribute the document ranking, how to rank the documents by combining the scores/costs from the documents and the context. More importantly, the relationships between documents and the information stored in a relational database may be uncertain, because they are from different data sources and the relationships are determined systematically using similarity match which causes uncertainty. In this paper, we concentrate ourselves on these research issues, and provide our solution on how to rank the documents in a context where there exist uncertainty between the documents and the context. We confirm the effectiveness of our approaches by conducting extensive experimental studies using real datasets. Copyright 2009 ACM.
Qin, L., Yu, J.X., Chang, L. & Tao, Y. 2009, 'Scalable keyword search on large data streams', Proceedings - International Conference on Data Engineering, pp. 1199-1202.
View/Download from: Publisher's site
It is widely realized that the integration of information retrieval (IR) and database (DB) techniques provides users with a broad range of high quality services. A new challenging issue along the same direction is IR-styled m-keyword query processing in a RDBMS framework over an open-ended relational data stream. The capability of supporting m-keyword queries over a relational data stream makes it possible for users to monitor events, that are implicitly interrelated, over a relational data stream in a timely manner. In brief, the problem is to find all connected trees whose size is less than or equal to a user-given threshold in terms of number of nodes for a m-keyword query, {k1, k2, , km}, over a relational data stream on a database schema GS. The difficulty of the problem is related to the number of costly joins to be processed over time, which is affected by the parameters such as the number of keywords (m), the maximum size of connected trees (Tmax), as well as the complexity of the database schema when it is viewed as a schema graph (GS). In this paper, we propose a new demand-driven approach to process such a query over a high speed data stream. We show that we can significantly reduce the number of intermediate results when processing joins over a data stream, and therefore can achieve high efficiency. © 2009 IEEE.
Ding, B., Yu, J.X. & Qin, L. 2008, 'Finding Time-Dependent Shortest Paths over Large Graphs', Proceedings of the 11th International Conference on Extending Database Technology, International Conference on Extending Database Technology, ACM, Nantes, France, pp. 205-216.
View/Download from: UTS OPUS or Publisher's site
The spatial and temporal databases have been studied widely and intensively over years. In this paper, we study how to answer queries of finding the best departure time that minimizes the total travel time from a place to another, over a road network, where the traffic conditions dynamically change from time to time. We study a generalized form of this problem, called the time-dependent shortest-path problem. A time-dependent graph GT is a graph that has an edge-delay function, wi, j(t), associated with each edge (vi, vj), to be stored in a database. The edge-delay function wi, j(t) specifies how much time it takes to travel from node vi to node vj, if it departs from vi at time t. A user-specified query is to ask the minimum-travel-time path, from a source node, vs, to a destination node, ve, over the time-dependent graph, GT, with the best departure time to be selected from a time interval T. We denote this user query as LTT(vs, ve, T) over GT. The challenge of this problem is the added complexity due to the time dependency in the time-dependent graph. That is, edge delays are not constants, and can vary from time to time. In this paper, we propose a novel algorithm to find the minimum-travel-time path with the best departure time for a LTT(vs, ve, T) query over a large graph GT. Our approach outperforms existing algorithms in terms of both time complexity in theory and efficiency in practice. We will discuss the design of our algorithm, together with its correctness and complexity. We conducted extensive experimental studies over large graphs and will report our findings.
Ding, B., Yu, J.X., Wang, S., Qin, L., Zhang, X. & Lin, X. 2007, 'Finding Top-k Min-Cost Connected Trees in Databases', Proceedings of the 23th International Conference on Data Engineering, IEEE, Istanbul, Turkey, pp. 836-845.
View/Download from: UTS OPUS or Publisher's site
It is widely realized that the integration of database and information retrieval techniques will provide users with a wide range of high quality services. In this paper, we study processing an l-keyword query, p1, p1, ..., pl, against a relational database which can be modeled as a weighted graph, G(V, E). Here V is a set of nodes (tuples) and E is a set of edges representing foreign key references between tuples. Let Vi ? V be a set of nodes that contain the keyword pi. We study finding top-k minimum cost connected trees that contain at least one node in every subset Vi, and denote our problem as GST-k When k = 1, it is known as a minimum cost group Steiner tree problem which is NP-complete. We observe that the number of keywords, l, is small, and propose a novel parameterized solution, with l as a parameter, to find the optimal GST-1, in time complexity O(3ln + 2l ((l + logn)n + m)), where n and m are the numbers of nodes and edges in graph G. Our solution can handle graphs with a large number of nodes. Our GST-1 solution can be easily extended to support GST-k, which outperforms the existing GST-k solutions over both weighted undirected/directed graphs. We conducted extensive experimental studies, and report our finding
Wang, S., Peng, Z., Zhang, J., Qin, L., Wang, S., Yu, J.X. & Ding, B. 2006, 'NUITS: A Novel User Interface for Efficient Keyword Search over Databases.', VLDB, ACM, pp. 1143-1146.

Journal articles

Wen, D., Qin, L., Zhang, Y., Lin, X. & Yu, J.X. 2016, 'I/O Efficient Core Graph Decomposition at Web Scale.', CoRR, vol. abs/1511.00367.
View/Download from: UTS OPUS
Yuan, L., Qin, L., Lin, X., Chang, L. & Zhang, W. 2016, 'Diversified top-k clique search', The VLDB Journal, vol. 25, no. 2, pp. 171-196.
View/Download from: UTS OPUS or Publisher's site
© 2015 Springer-Verlag Berlin Heidelberg Maximal clique enumeration is a fundamental problem in graph theory and has been extensively studied. However, maximal clique enumeration is time-consuming in large graphs and always returns enormous cliques with large overlaps. Motivated by this, in this paper, we study the diversified top-k clique search problem which is to find top-k cliques that can cover most number of nodes in the graph. Diversified top-k clique search can be widely used in a lot of applications including community search, motif discovery, and anomaly detection in large graphs. A naive solution for diversified top-k clique search is to keep all maximal cliques in memory and then find k of them that cover most nodes in the graph by using the approximate greedy max k-cover algorithm. However, such a solution is impractical when the graph is large. In this paper, instead of keeping all maximal cliques in memory, we devise an algorithm to maintain k candidates in the process of maximal clique enumeration. Our algorithm has limited memory footprint and can achieve a guaranteed approximation ratio. We also introduce a novel light-weight (Formula presented.)-(Formula presented.), based on which we design an optimal maximal clique maintenance algorithm. We further explore three optimization strategies to avoid enumerating all maximal cliques and thus largely reduce the computational cost. Besides, for the massive input graph, we develop an I/O efficient algorithm to tackle the problem when the input graph cannot fit in main memory. We conduct extensive performance studies on real graphs and synthetic graphs. One of the real graphs contains 1.02 billion edges. The results demonstrate the high efficiency and effectiveness of our approach.
Li, R.H., Qin, L., Yu, J.X. & Mao, R. 2016, 'Optimal Multi-Meeting-Point Route Search', IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 3, pp. 770-784.
View/Download from: Publisher's site
© 1989-2012 IEEE. Real-time ride-sharing applications (e.g., Uber and Lyft) are very popular in recent years. Motivated by the ride-sharing application, we propose a new type of query in road networks, called the optimal multi-meeting-point route (OMMPR) query. Given a road network G , a source node s , a target node t , and a set of query nodes U, the OMMPR query aims at finding the best route starting from s and ending at t such that the weighted average cost between the cost of the route and the total cost of the shortest paths from every query node to the route is minimized. We show that the problem of computing the OMMPR query is NP-hard. To answer the OMMPR query efficiently, we propose two novel parameterized solutions based on dynamic programming (DP), with the number of query nodes l (i.e., l=|U|) as a parameter, which is typically very small in practice. The two proposed parameterized algorithms run in O(3 m+ 2 n (l+\log (n))) and O(2 (m + n (l+\log (n)))) time, respectively, where n and m denote the number of nodes and edges in graph G, thus they are tractable in practice. To reduce the search space of the DP-based algorithms, we propose two novel optimized algorithms based on bidirectional DP and a carefully-designed lower bounding technique. We conduct extensive experimental studies on four large real-world road networks, and the results demonstrate the efficiency of the proposed algorithms.
Zhang, S., Qin, L., Zheng, Y. & Cheng, H. 2016, 'Effective and efficient: Large-scale dynamic city express', IEEE Transactions on Knowledge and Data Engineering, vol. PP, no. 99.
View/Download from: Publisher's site
© 1989-2012 IEEE.Due to the large number of requirements for city express services in recent years, the current city express system is found to be unsatisfactory for both the service providers and customers. In this paper, we are the first to systematically study the large-scale dynamic city express problem. We aim to increase both the effectiveness and the efficiency of the scheduling algorithm. The challenges of the problem stem from the highly dynamic environment, the NP-completeness with respect to the number of requests, and real-time demands for the scheduling result. We introduce a basic algorithm to assign a request to a courier on a first-come, first-served basis. To improve the effectiveness of the basic algorithm, we adopt a batch assignment strategy that computes the pickup-delivery routes for a group of requests received in a short period rather than dealing with each request individually. To improve the efficiency of the algorithm, we further design a two-level priority queue structure to reduce redundant shortest distance calculation and repeated candidate generation. We develop a simulation system and conduct extensive performance studies on the real road network of Beijing city. The experimental results demonstrate the high effectiveness and efficiency of our algorithms. Remarkably, our system can achieve much better service quality and largely reduce the operation cost of a city express company simultaneously.
Zhang, Z., Yu, J.X., Qin, L., Chang, L. & Lin, X. 2015, 'I/O efficient: computing SCCs in massive graphs', VLDB Journal, vol. 24, no. 2, pp. 245-270.
View/Download from: UTS OPUS or Publisher's site
A strongly connected component ((Formula presented.)) is a maximal subgraph of a directed graph (Formula presented.) in which every pair of nodes is reachable from each other in the (Formula presented.). With such a property, a general directed graph can be represented by a directed acyclic graph (DAG) by contracting every (Formula presented.) of (Formula presented.) to a node in DAG. In many real applications that need graph pattern matching, topological sorting, or reachability query processing, the best way to deal with a general directed graph is to deal with its DAG representation. Therefore, finding all (Formula presented.)s in a directed graph (Formula presented.) is a critical operation. The existing in-memory algorithms based on depth first search (DFS) can find all (Formula presented.)s in linear time with respect to the size of a graph. However, when a graph cannot reside entirely in the main memory, the existing external or semi-external algorithms to find all (Formula presented.)s have limitation to achieve high I/O efficiency. In this paper, we study new I/O-efficient semi-external algorithms to find all (Formula presented.)s for a massive directed graph (Formula presented.) that cannot reside in main memory entirely. To overcome the deficiency of the existing DFS-based semi-external algorithm that heavily relies on a total order, we explore a weak order based on which we investigate new algorithms. We propose a new two-phase algorithm, namely, tree construction and tree search. In the tree construction phase, a spanning tree of (Formula presented.) can be constructed in bounded number of sequential scans of (Formula presented.). In the tree search phase, it needs to sequentially scan the graph once to find all (Formula presented.)s. In addition, we propose a new single-phase algorithm, which combines the tree construction and tree search phases into a single phase, with three new optimization techniques. They are early acceptance, early rejection, ...
Huang, X., Cheng, H., Li, R.-.H., Qin, L. & Yu, J.X. 2015, 'Top-K structural diversity search in large networks.', The VLDB Journal, vol. 24, pp. 319-343.
View/Download from: UTS OPUS or Publisher's site
Social contagion depicts a process of information (e.g., fads, opinions, news) diffusion in the online social networks. A recent study reports that in a social contagion process, the probability of contagion is tightly controlled by the number of connected components in an individual's neighborhood. Such a number is termed structural diversity of an individual, and it is shown to be a key predictor in the social contagion process. Based on this, a fundamental issue in a social network is to find top-k users with the highest structural diversities. In this paper, we, for the first time, study the top-k structural diversity search problem in a large network. Specifically, we study two types of structural diversity measures, namely, component-based structural diversity measure and core-based structural diversity measure. For component-based structural diversity, we develop an effective upper bound of structural diversity for pruning the search space. The upper bound can be incrementally refined in the search process. Based on such upper bound, we propose an efficient framework for top-k structural diversity search. To further speed up the structural diversity evaluation in the search process, several carefully devised search strategies are proposed. We also design efficient techniques to handle frequent updates in dynamic networks and maintain the top-k results. We further show how the techniques proposed in component-based structural diversity measure can be extended to handle the core-based structural diversity measure. Extensive experimental studies are conducted in real-world large networks and synthetic graphs, and the results demonstrate the efficiency and effectiveness of the proposed methods.
Li, Z., Qin, L., Cheng, H., Zhang, X. & Zhou, X. 2015, 'TRIP: An Interactive Retrieving-Inferring Data Imputation Approach.', IEEE Transactions on Knowledge and Data Engineering, vol. 27, pp. 2550-2563.
View/Download from: UTS OPUS or Publisher's site
Data imputation aims at filling in missing attribute values in databases. Most existing imputation methods to string attribute values are inferring-based approaches, which usually fail to reach a high imputation recall by just inferring missing values from the complete part of the data set. Recently, some retrieving-based methods are proposed to retrieve missing values from external resources such as the World Wide Web, which tend to reach a much higher imputation recall, but inevitably bring a large overhead by issuing a large number of search queries. In this paper, we investigate the interaction between the inferring-based methods and the retrieving-based methods. We show that retrieving a small number of selected missing values can greatly improve the imputation recall of the inferring-based methods. With this intuition, we propose an inTeractive Retrieving-Inferring data imPutation approach (TRIP), which performs retrieving and inferring alternately in filling in missing attribute values in a data set. To ensure the high recall at the minimum cost, TRIP faces a challenge of selecting the least number of missing values for retrieving to maximize the number of inferable values. Our proposed solution is able to identify an optimal retrieving-inferring scheduling scheme in deterministic data imputation, and the optimality of the generated scheme is theoretically analyzed with proofs. We also analyze with an example that the optimal scheme is not feasible to be achieved in t-constrained stochastic data imputation (t-SDI), but still, our proposed solution identifies an expected-optimal scheme in t-SDI. Extensive experiments on four data collections show that TRIP retrieves on average 20 percent missing values and achieves the same high recall that was reached by the retrieving-based approach.
Wen, D., Qin, L., Zhang, Y., Lin, X. & Yu, J.X. 2015, 'I/O Efficient Core Graph Decomposition at Web Scale.', CoRR, vol. abs/1511.00367.
Zhu, Y., Yu, J.X. & Qin, L. 2014, 'Leveraging graph dimensions in online graph search', Proceedings of the VLDB Endowment, vol. 8, no. 1, pp. 85-96.
View/Download from: UTS OPUS
Graphs have been widely used due to its expressive power to model complicated relationships. However, given a graph database DG = {g1; g2; , gn}, it is challenging to process graph queries since a basic graph query usually involves costly graph operations such as maximum common subgraph and graph edit distance computation, which are NP-hard. In this paper, we study a novel DS-preserved mapping which maps graphs in a graph database DG onto a multidimensional space MG under a structural dimension Musing a mapping function (). The DS-preserved mapping preserves two things: distance and structure. By the distance-preserving, it means that any two graphs gi and gj in DG must map to two data objects (gi) and (gj) in MG, such that the distance, d((gi); (gj), between (gi) and (gj) in MG approximates the graph dissimilarity (gi; gj) in DG. By the structure-preserving, it further means that for a given unseen query graph q, the distance between q and any graph gi in DG needs to be preserved such that (q; gi) d((q); (gi)). We discuss the rationality of using graph dimension M for online graph processing, and show how to identify a small set of subgraphs to form M efficiently. We propose an iterative algorithm DSPM to compute the graph dimension, and discuss its optimization techniques. We also give an approximate algorithm DSPMap in order to handle a large graph database. We conduct extensive performance studies on both real and synthetic datasets to evaluate the top-k similarity query which is to find top-k similar graphs from DG for a query graph, and show the effectiveness and efficiency of our approaches. © 2014 VLDB.
Chang, L., Yu, J.X. & Qin, L. 2013, 'Fast Maximal Cliques Enumeration in Sparse Graphs', Algorithmica, vol. 66, no. 1, pp. 173-186.
View/Download from: UTS OPUS or Publisher's site
In this paper, we consider the problem of generating all maximal cliques in a sparse graph in polynomial delay. Given a graph G=(V,E) with n vertices and m edges, the latest and fastest polynomial delay algorithm for sparse graphs enumerates all maximal cliques in O(? 4) time delay, where ? is the maximum degree of vertices. However, it requires an O(nm) preprocessing time. We improve it in two aspects. First, our algorithm does not need preprocessing. Therefore, our algorithm is a truly polynomial delay algorithm. Second, our algorithm enumerates all maximal cliques in O(?H 3) time delay, where H is the so called H-value of a graph or equivalently it is the smallest integer satisfying |{v?V|d(v)=H}|=H given d(v) as the degree of a vertex. In real-world network data, H usually is a small value and much smaller than delta
Zhu, Y., Qin, L., Yu, J.X., Ke, Y. & Lin, X. 2013, 'High Efficiency and Quality: Large Graphs Matching', VLDB Journal, vol. 22, no. 3, pp. 345-368.
View/Download from: UTS OPUS or Publisher's site
Graph matching plays an essential role in many real applications. In this paper, we study how to match two large graphs by maximizing the number of matched edges, which is known as maximum common subgraph matching and is NP-hard. To find exact matching, it cannot a graph with more than 30 nodes. To find an approximate matching, the quality can be very poor. We propose a novel two-step approach that can efficiently match two large graphs over thousands of nodes with high matching quality. In the first step, we propose an anchor-selection/expansion approach to compute a good initial matching. In the second step, we propose a new approach to refine the initial matching. We give the optimality of our refinement and discuss how to randomly refine the matching with different combinations. We further show how to extend our solution to handle labeled graphs. We conducted extensive testing using real and synthetic datasets and report our findings in this paper.
Qiao, M., Cheng, H., Qin, L., Yu, J.X., Yu, P. & Chang, L. 2013, 'Computing weight constraint reachability in large networks', VLDB Journal, vol. 22, no. 3, pp. 275-294.
View/Download from: UTS OPUS or Publisher's site
Abstract Reachability is a fundamental problem on large-scale networks emerging nowadays in various application domains, such as social networks, communication networks, biological networks, road networks, etc. It has been studied extensively. However, little existing work has studied reachability with realistic constraints imposed on graphs with real-valued edge or node weights. In fact, such weights are very common in many real-world networks, for example, the bandwidth of a link in communication networks, the reliability of an interaction between two proteins in PPI networks, and the handling capacity of a warehouse/storage point in a distribution network. In this paper, we formalize a new yet important reachability query in weighted undirected graphs, called weight constraint reachability (WCR) query that asks: is there a path between nodes a and b, on which each real-valued edge (or node) weight satisfies a range constraint. We discover an interesting property of WCR, based on which, we design a novel edge-based index structure to answer the WCR query in O(1) time. Furthermore, we consider the case when the index cannot entirely fit in the memory, which can be very common for emerging massive networks. An I/O-efficient index is proposed, which provides constant I/O (precisely four I/Os) query time with O(|V|log|V|) disk-based index size. Extensive experimental studies on both real and synthetic datasets demonstrate the efficiency and scalability of our solutions in answering the WCR query.
Chang, L., Yu, J.X., Qin, L., Cheng, H. & Qiao, M. 2012, 'The Exact Distance to Destination in Undirected World', VLDB Journal, vol. 21, no. 6, pp. 869-888.
View/Download from: UTS OPUS or Publisher's site
Shortest distance queries are essential not only in graph analysis and graph mining tasks but also in database applications, when a large graph needs to be dealt with. Such shortest distance queries are frequently issued by end-users or requested as a subroutine in real applications. For intensive queries on large graphs, it is impractical to compute shortest distances on-line from scratch, and impractical to materialize all-pairs shortest distances. In the literature, 2-hop distance labeling is proposed to index the all-pairs shortest distances. It assigns distance labels to vertices in a large graph in a pre-computing step off-line and then answers shortest distance queries on-line by making use of such distance labels, which avoids exhaustively traversing the large graph when answering queries. However, the existing algorithms to generate 2-hop distance labels are not scalable to large graphs. Finding an optimal 2-hop distance labeling is NP-hard, and heuristic algorithms may generate large size distance labels while still needing to pre-compute all-pairs shortest paths. In this paper, we propose a multi-hop distance labeling approach, which generates a subset of the 2-hop distance labels as index off-line. We can compute the multi-hop distance labels efficiently by avoiding pre-computing all-pairs shortest paths. In addition, our multi-hop distance labeling is small in size to be stored. To answer a shortest distance query between two vertices, we first generate the query-specific small set of 2-hop distance labels for the two vertices based on our multi-hop distance labels stored and compute the shortest distance between the two vertices based on the 2-hop distance labels generated on-line. We conducted extensive performance studies on large real graphs and confirmed the efficiency of our multi-hop distance labeling scheme.
Qin, L., Yu, J.X. & Chang, L. 2012, 'Computing Structural Statistics by Keywords in Databases', IEEE Transactions On Knowledge And Data Engineering, vol. 24, no. 10, pp. 1731-1746.
View/Download from: UTS OPUS or Publisher's site
Keyword search in RDBs has been extensively studied in recent years. The existing studies focused on finding all or top-k interconnected tuple-structures that contain keywords. In reality, the number of such interconnected tuple-structures for a keyword query can be large. It becomes very difficult for users to obtain any valuable information more than individual interconnected tuple-structures. Also, it becomes challenging to provide a similar mechanism like group-&-aggregate for those interconnected tuple-structures. In this paper, we study computing structural statistics keyword queries by extending the group-&-aggregate framework. We consider an RDB as a large directed graph where nodes represent tuples, and edges represent the links among tuples. Instead of using tuples as a member in a group, we consider rooted subgraphs. Such a rooted subgraph represents an interconnected tuple-structure among tuples and some of the tuples contain keywords. The dimensions of the rooted subgraphs are determined by dimensional keywords in a data driven fashion. Two rooted subgraphs are grouped into the same group if they are isomorphic based on the dimensions or in other words the dimensional keywords. The scores of the rooted subgraphs are computed by a user-given score function if the rooted subgraphs contain some of general keywords. Here, the general keywords are used to compute scores rather than determining dimensions. The aggregates are computed using an sql aggregate function for every group based on the scores computed. We give our motivation using a real data set. We propose new approaches to compute structural statistics keyword queries, perform extensive performance studies using two large real data sets and a large synthetic data set, and confirm the effectiveness and efficiency of our approach.
Qin, L., Yu, J.X. & Chang, L. 2012, 'Diversifying Top-K Results.', PVLDB, vol. 5, pp. 1124-1135.
Qin, L., Yu, J.X. & Chang, L. 2011, 'Scalable Keyword Search on Large Data Streams', VLDB Journal, vol. 20, no. 1, pp. 35-57.
View/Download from: UTS OPUS or Publisher's site
It is widely recognized that the integration of information retrieval (IR) and database (DB) techniques provides users with a broad range of high quality services. Along this direction, IR-styled m-keyword query processing over a relational database in an rdbms framework has been well studied. It finds all hidden interconnected tuple structures, for example connected trees that contain keywords and are interconnected by sequences of primary/foreign key relationships among tuples. A new challenging issue is how to monitor events that are implicitly interrelated over an open-ended relational data stream for a user-given m-keyword query. Such a relational data stream is a sequence of tuple insertion/deletion operations. The difficulty of the problem is related to the number of costly joins to be processed over time when tuples are inserted and/or deleted. Such cost is mainly affected by three parameters, namely, the number of keywords, the maximum size of interconnected tuple structures, and the complexity of the database schema when it is viewed as a schema graph. In this paper, we propose new approaches. First, we propose a novel algorithm to efficiently determine all the joins that need to be processed for answering an m-keyword query. Second, we propose a new demand-driven approach to process such a query over a high speed relational data stream. We show that we can achieve high efficiency by significantly reducing the number of intermediate results when processing joins over a relational data stream. The proposed new techniques allow us to achieve high scalability in terms of both query plan generation and query plan execution. We conducted extensive experimental studies using synthetic data and real data to simulate a relational data stream. Our approach significantly outperforms existing algorithms.
Chang, L., Yu, J.X. & Qin, L. 2010, 'Context-Sensitive Document Ranking', Journal Of Computer Science And Technology, vol. 25, no. 3, pp. 444-457.
View/Download from: UTS OPUS or Publisher's site
Ranking is a main research issue in IR-styled keyword search over a set of documents. In this paper, we study a new keyword search problem, called context-sensitive document ranking, which is to rank documents with an additional context that provides additional information about the application domain where the documents are to be searched and ranked. The work is motivated by the fact that additional information associated with the documents can possibly assist users to find more relevant documents when they are unable to find the needed documents from the documents alone. In this paper, a context is a multi-attribute graph, which can represent any information maintained in a relational database, where multi-attribute nodes represent tuples, and edges represent primary key and foreign key references among nodes. The context-sensitive ranking is related to several research issues, how to score documents, how to evaluate the additional information obtained in the context that may contribute to the document ranking, how to rank the documents by combining the scores/costs from the documents and the context. More importantly, the relationships between documents and the information stored in a relational database may be uncertain, because they are from different data sources and the relationships are determined systematically using similarity match which causes uncertainty. In this paper, we concentrate ourselves on these research issues, and provide our solution on how to rank the documents in a context where there exist uncertainty between the documents and the context. We confirm the effectiveness of our approaches by conducting extensive experimental studies using real datasets. We present our findings in this paper.
Yu, J.X., Qin, L. & Chang, L. 2010, 'Keyword Search in Relational Databases: A Survey.', IEEE Data Eng. Bull., vol. 33, pp. 67-78.
Fan, W., Yu, J.X., Li, J., Ding, B. & Qin, L. 2009, 'Query Translation from XPath to SQL in the Presence of Recursive DTDS', The VLDB Journal, vol. 18, no. 4, pp. 857-883.
View/Download from: UTS OPUS or Publisher's site
We study the problem of evaluating xpath queries over xml data that is stored in an rdbms via schema-based shredding. The interaction between recursion (descendants-axis) in xpath queries and recursion in dtds makes it challenging to answer xpath queries using rdbms. We present a new approach to translating xpath queries into sql queries based on a notion of extended XP ath expressions and a simple least fixpoint (lfp) operator. Extended xpath expressions are a mild extension of xpath, and the lfp operator takes a single input relation and is already supported by most commercial rdbms. We show that extended xpath expressions are capable of capturing both dtd recursion and xpath queries in a uniform framework. Furthermore, they can be translated into an equivalent sequence of sql queries with the lfp operator. We present algorithms for rewriting xpath queries over a (possibly recursive) dtd into extended xpath expressions and for translating extended xpath expressions to sql queries, as well as optimization techniques. The novelty of our approach consists in its capability to answer a large class of xpath queries by means of only low-end rdbms features already available in most rdbms, as well as its flexibility to accommodate existing relational query optimization techniques. In addition, these translation algorithms provide a solution to query answering for certain (possibly recursive) xml views of xml data. Our experimental results verify the effectiveness of our techniques.
Chang, L., Yu, J.X. & Qin, L. 2009, 'Fast Probabilistic Ranking under x-Relation Model', CoRR, vol. abs/0906.4927.
Qin, L., Yu, J.X., Ding, B. & Ishikawa, Y. 2008, 'Monitoring Aggregate k-NN Objects in Road Networks', Lecture Notes in Computer Science, vol. 5069, no. 1, pp. 168-186.
View/Download from: UTS OPUS or Publisher's site
In recent years, there is an increasing need to monitor k nearest neighbor (k-NN) in a road network. There are existing solutions on either monitoring k-NN objects from a single query point over a road network, or computing the snapshot k-NN objects over a road network to minimize an aggregate distance function with respect to multiple query points. In this paper, we study a new problem that is to monitor k-NN objects over a road network from multiple query points to minimize an aggregate distance function with respect to the multiple query points. We call it a continuous aggregate k-NN (CANN) query. We propose a new approach that can significantly reduce the cost of computing network distances when monitoring aggregate k-NN objects on road networks. We conducted extensive experimental studies and confirmed the efficiency of our algorithms.
Qin, L., Yu, J.X., Ding, B. & Wang, S. 2007, 'TwigList - Make Twig Pattern Matching Fast', Lecture Notes in Computer Science, vol. 4443, no. 1, pp. 850-862.
View/Download from: UTS OPUS or Publisher's site
Twig pattern matching problem has been widely studied in recent years. Give an XML tree t. A twig-pattern matching query, Q, represented as a query tree, is to find all the occurrences of such twig pattern in t. Previous works like HolisticTwig and TJFast decomposed the twig pattern into single paths from root to leaves, and merged all the occurrences of such path-patterns to find the occurrences of the twig-pattern matching query, Q. Their techniques can effectively prune impossible path-patterns to avoid producing a large amount of intermediate results. But they still need to merge path-patterns which occurs high computational cost. Recently, Twig2Stack was proposed to overcome this problem using hierarchical-stacks to further reduce the merging cost. But, due to the complex hierarchical-stacks Twig2Stack used, Twig2Stack may end up many random accesses in memory, and need to load the whole XML tree into memory in the worst case. In this paper, we propose a new algorithm, called TwigList, which uses simple lists. Both time and space complexity of our algorithm are linear with respect to the total number of pattern occurrences and the size of XML tree. In addition, our algorithm can be easily modified as an external algorithm. We conducted extensive experimental studies using large benchmark and real datasets. Our algorithm significantly outperforms the up-to-date algorithm.
Peng, Z., Zhang, J., Wang, S. & Qin, L. 2006, 'TreeCluster: Clustering Results of Keyword Search over Databases', Lecture Notes in Computer Science, vol. 4016, pp. 385-396.
View/Download from: UTS OPUS or Publisher's site
A critical challenge in keyword search over relational data- bases (KSORD) is to improve its result presentation to facilitate users quick browsing through search results. An effective method is to organize the results into clusters. However, traditional clustering method is not applicable to KSORD search results. In this paper, we propose a novel clustering method named TreeCluster. In the first step, we use labels to represent schema information of each result tree and reformulate the clustering problem as a problem of judging whether labeled trees are isomorphic. In the second step, we rank user keywords according to their frequencies in databases, and further partition the large clusters based on keyword nodes. Furthermore, we give each cluster a readable description, and present the description and each result graphically to help users understand the results more easily. Experimental results verify our methods effectiveness and efficiency.