Bioinformatics (sequencing data compression)
IEEE The latest sequencing technologies such as the Pacific Biosciences (PacBio) and Oxford Nanopore machines can generate long reads at the length of thousands of nucleic bases which is much longer than the reads at the length of hundreds generated by Illumina machines. However, these long reads are prone to much higher error rates, for example 15%, making downstream analysis and applications very difficult. Error correction is a process to improve the quality of sequencing data. Hybrid correction strategies have been recently proposed to combine Illumina reads of low error rates to fix sequencing errors in the noisy long reads with good performance. In this paper, we propose a new method named Bicolor, a bi-level framework of hybrid error correction for further improving the quality of PacBio long reads. At the first level, our method uses a de Bruijn graph-based error correction idea to search paths in pairs of solid < formula > < tex > $k$ < /tex > < /formula > -mers iteratively with an increasing length of < formula > < tex > $k$ < /tex > < /formula > -mer. At the second level, we combine the processed results under different parameters from the first level. In particular, a multiple sequence alignment algorithm is used to align those similar long reads, followed by a voting algorithm which determines the final base at each position of the reads. We compare the superior performance of Bicolor with three state-of-the-art methods on three real data sets. Results demonstrate that Bicolor always achieves the highest identity ratio. Bicolor also achieves a higher alignment ratio ( < formula > < tex > $ & #x003E; 1.3\%$ < /tex > < /formula > ) and a higher number of aligned reads than the current methods on two data sets. On the third data set, our method is closely competitive to the current methods in terms of number of aligned reads and genome coverage. The C++ source codes of our algorithm are freely available at https://github.com/yuansliu/Bicolor.
Liu, Y, Yu, Y, Dinger, ME & Li, J 2019, 'Index suffix-prefix overlaps by (w, k)-minimizer to generate long contigs for reads compression.', Bioinformatics.View/Download from: UTS OPUS or Publisher's site
Motivation:Advanced high-throughput sequencing technologies have produced massive amount of reads data, and algorithms have been specially designed to contract the size of these data sets for efficient storage and transmission. Reordering reads with regard to their positions in de novo assembled contigs or in explicit reference sequences has been proven to be one of the most effective reads compression approach. As there is usually no good prior knowledge about the reference sequence, current focus is on the novel construction of de novo assembled contigs. Results:We introduce a new de novo compression algorithm named minicom. This algorithm uses large k-minimizers to index the reads and subgroup those that have the same minimizer. Within each subgroup, a contig is constructed. Then some pairs of the contigs derived from the subgroups are merged into longer contigs according to a (w; k)-minimizer indexed suffix-prefix overlap similarity between two contigs. This merging process is repeated after the longer contigs are formed until no pair of contigs can be merged. We compare the performance of minicom with two reference-based methods and four de novo methods on 18 data sets (13 RNA-seq data sets and 5 whole genome sequencing data sets). In the compression of single-end reads, minicom obtained the smallest file size for 22 of 34 cases with significant improvement. In the compression of paired-end reads, minicom achieved 20-80% compression gain over the best state-of-the-art algorithm. Our method also achieved a 10% size reduction of compressed files in comparison with the best algorithm under the reads-order preserving mode. These excellent performances are mainly attributed to the exploit of the redundancy of the repetitive substrings in the long contigs. Availability and Implementation:https://github.com/yuansliu/minicom. Supplementary Information:Supplementary data are available at Bioinformatics online.
Liu, Y, Yu Zhang, L & Li, J 2019, 'Fast detection of maximal exact matches via fixed sampling of query k-mers and Bloom filtering of index k-mers.', Bioinformatics (Oxford, England).View/Download from: Publisher's site
MOTIVATION:Detection of maximal exact matches (MEMs) between two long sequences is a fundamental problem in pairwise reference-query genome comparisons. To efficiently compare larger and larger genomes, reducing the number of indexed k-mers as well as the number of query k-mers has been adopted as a mainstream approach which saves the computational resources by avoiding a significant number of unnecessary matches. RESULTS:Under this framework, we proposed a new method to detect all MEMs from a pair of genomes. The method first performs a fixed sampling of k-mers on the query sequence, and add these selected k-mers to a Bloom filter. Then all the k-mers of the reference sequence are tested by the Bloom filter. If a k-mer passes the test, it is inserted into a hash table for indexing. Compared with the existing methods, much less number of query k-mers are generated and much less k-mers are inserted into the index to avoid unnecessary matches, leading to an efficient matching process and memory usage savings. Experiments on large genomes demonstrate that our method is at least 1.8 times faster than the best of the existing algorithms. This performance is mainly attributed to the key novelty of our method that the fixed k-mer sampling must be conducted on the query sequence and the index k-mers are filtered from the reference sequence via a Bloom filter. AVAILABILITY AND IMPLEMENTATION:https://github.com/yuansliu/bfMEM. SUPPLEMENTARY INFORMATION:Supplementary data are available at Bioinformatics online.
Zhang, LY, Liu, Y, Wang, C, Zhou, J, Zhang, Y & Chen, G 2018, 'Improved known-plaintext attack to permutation-only multimedia ciphers', Information Sciences, vol. 430-431, pp. 228-239.View/Download from: Publisher's site
© 2017 Elsevier Inc. Permutation is a commonly used operation in many secure multimedia systems. However, it is fragile against cryptanalysis when used alone. For instance, it is well-known that permutation-only multimedia encryption is insecure against known-plaintext attack (KPA). There exist algorithms that are able to (partially) retrieve the secret permutation sequences in polynomial time with logarithmic amount of plaintexts in the number of elements to be permuted. But existing works fail to answer how many known plaintexts are needed to fully recover a underlying secret permutation sequence and how to balance the storage cost and computational complexity in implementing the KPA attack. This paper addresses these two problems. With a new concept of composite representation, the underlying theoretical rules governing the KPA attack on a permutation-only cipher are revealed, and some attractive algorithms outperforming the state-of-the-art methods in terms of computational complexity are developed. As a case study, experiments are performed on permutation-only image encryption to verify the theoretic analysis. The performance gap of the proposed KPA between artificial noise-like images, which perfectly fits the theoretical model, and the corresponding natural images is identified and analyzed. Finally, experimental results are shown to demonstrate the efficiency improvement of the new schemes over the existing ones.
Zhang, LY, Liu, Y, Pareschi, F, Zhang, Y, Wong, KW, Rovatti, R & Setti, G 2018, 'On the security of a class of diffusion mechanisms for image encryption', IEEE Transactions on Cybernetics, vol. 48, no. 4, pp. 1163-1175.View/Download from: UTS OPUS or Publisher's site
© 2017 IEEE. The need for fast and strong image cryptosystems motivates researchers to develop new techniques to apply traditional cryptographic primitives in order to exploit the intrinsic features of digital images. One of the most popular and mature technique is the use of complex dynamic phenomena, including chaotic orbits and quantum walks, to generate the required key stream. In this paper, under the assumption of plaintext attacks we investigate the security of a classic diffusion mechanism (and of its variants) used as the core cryptographic primitive in some image cryptosystems based on the aforementioned complex dynamic phenomena. We have theoretically found that regardless of the key schedule process, the data complexity for recovering each element of the equivalent secret key from these diffusion mechanisms is only O(1). The proposed analysis is validated by means of numerical examples. Some additional cryptographic applications of this paper are also discussed.
Liu, Y, Zeng, X, He, Z & Zou, Q 2017, 'Inferring MicroRNA-Disease Associations by Random Walk on a Heterogeneous Network with Multiple Data Sources', IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 14, no. 4, pp. 905-915.View/Download from: UTS OPUS or Publisher's site
© 2004-2012 IEEE. Since the discovery of the regulatory function of microRNA (miRNA), increased attention has focused on identifying the relationship between miRNA and disease. It has been suggested that computational method is an efficient way to identify potential disease-related miRNAs for further confirmation using biological experiments. In this paper, we first highlighted three limitations commonly associated with previous computational methods. To resolve these limitations, we established disease similarity subnetwork and miRNA similarity subnetwork by integrating multiple data sources, where the disease similarity is composed of disease semantic similarity and disease functional similarity, and the miRNA similarity is calculated using the miRNA-Target gene and miRNA-lncRNA (long non-coding RNA) associations. Then, a heterogeneous network was constructed by connecting the disease similarity subnetwork and the miRNA similarity subnetwork using the known miRNA-disease associations. We extended random walk with restart to predict miRNA-disease associations in the heterogeneous network. The leave-one-out cross-validation achieved an average area under the curve (AUC) of 0.8049 across 341 diseases and 476 miRNAs. For five-fold cross-validation, our method achieved an AUC from 0.7970 to 0.9249 for 15 human diseases. Case studies further demonstrated the feasibility of our method to discover potential miRNA-disease associations. An online service for prediction is freely available at http://ifmda.aliapp.com.
Zhang, LY, Zhang, Y, Liu, Y, Yang, A & Chen, G 2017, 'Security Analysis of Some Diffusion Mechanisms Used in Chaotic Ciphers', International Journal of Bifurcation and Chaos, vol. 27, no. 10, pp. 1-13.View/Download from: UTS OPUS or Publisher's site
© 2017 World Scientific Publishing Company. As a variant of the substitution-permutation network, the permutation-diffusion structure has received extensive attention in the field of chaotic cryptography over the last three decades. Because of the high implementation speed and nonlinearity over GF(2), the Galois field of two elements, mixing modulo addition/multiplication and Exclusive OR becomes very popular in various designs to achieve the desired diffusion effect. This paper reports that some diffusion mechanisms based on modulo addition/multiplication and Exclusive OR are not resistant to plaintext attacks as claimed. By cracking several recently proposed chaotic ciphers as examples, it is demonstrated that a good understanding of the strength and weakness of these crypto-primitives is crucial for designing more practical chaotic encryption algorithms in the future.
The rapidly increasing number of genomes generated by high-throughput sequencing platforms and assembly algorithms is accompanied by problems in data storage, compression and communication. Traditional compression algorithms are unable to meet the demand of high compression ratio due to the intrinsic challenging features of DNA sequences such as small alphabet size, frequent repeats and palindromes. Reference-based lossless compression, by which only the differences between two similar genomes are stored, is a promising approach with high compression ratio.We present a high-performance referential genome compression algorithm named HiRGC. It is based on a 2-bit encoding scheme and an advanced greedy-matching search on a hash table. We compare the performance of HiRGC with four state-of-the-art compression methods on a benchmark dataset of eight human genomes. HiRGC takes <30 min to compress about 21 gigabytes of each set of the seven target genomes into 96-260 megabytes, achieving compression ratios of 217 to 82 times. This performance is at least 1.9 times better than the best competing algorithm on its best case. Our compression speed is also at least 2.9 times faster. HiRGC is stable and robust to deal with different reference genomes. In contrast, the competing methods' performance varies widely on different reference genomes. More experiments on 100 human genomes from the 1000 Genome Project and on genomes of several other species again demonstrate that HiRGC's performance is consistently excellent.The C ++ and Java source codes of our algorithm are freely available for academic and non-commercial use. They can be downloaded from https://github.com/yuansliu/HiRGC.firstname.lastname@example.org.Supplementary data are available at Bioinformatics online.
Peng, H, Lan, C, Liu, Y, Liu, T, Blumenstein, M & Li, J 2017, 'Chromosome preference of disease genes and vectorization for the prediction of non-coding disease genes.', Oncotarget, vol. 8, no. 45, pp. 78901-78916.View/Download from: UTS OPUS or Publisher's site
Disease-related protein-coding genes have been widely studied, but disease-related non-coding genes remain largely unknown. This work introduces a new vector to represent diseases, and applies the newly vectorized data for a positive-unlabeled learning algorithm to predict and rank disease-related long non-coding RNA (lncRNA) genes. This novel vector representation for diseases consists of two sub-vectors, one is composed of 45 elements, characterizing the information entropies of the disease genes distribution over 45 chromosome substructures. This idea is supported by our observation that some substructures (e.g., the chromosome 6 p-arm) are highly preferred by disease-related protein coding genes, while some (e.g., the 21 p-arm) are not favored at all. The second sub-vector is 30-dimensional, characterizing the distribution of disease gene enriched KEGG pathways in comparison with our manually created pathway groups. The second sub-vector complements with the first one to differentiate between various diseases. Our prediction method outperforms the state-of-the-art methods on benchmark datasets for prioritizing disease related lncRNA genes. The method also works well when only the sequence information of an lncRNA gene is known, or even when a given disease has no currently recognized long non-coding genes.
Zhang, X, Zhang, X, Verma, S, Liu, Y, Blumenstein, M & Li, J 2019, 'Detection of Anomalous Traffic Patterns and Insight Analysis from Bus Trajectory Data', PRICAI 2019: Trends in Artificial Intelligence, The 16th Pacific Rim International Conference on Artificial Intelligence, Cuvu, Fiji.View/Download from: UTS OPUS
Zhang, X, Liu, Y, Zheng, Y, Zhao, Z, Li, J & Liu, Y 2018, 'Distinction between Ships and Icebergs in SAR Images Using Ensemble Loss Trained Convolutional Neural Networks', AI 2018: AI 2018: Advances in Artificial Intelligence (LNAI), Australasian Joint Conference on Artificial Intelligence, Springer, Wellington, New Zealand, pp. 216-223.View/Download from: UTS OPUS or Publisher's site
With the phenomenon of global warming, more new shipping routes will be open and utilized by more and more ships in the polar regions, particularly in the Arctic. Synthetic aperture radar (SAR) has been widely used in ship and iceberg monitoring for maritime surveillance and safety in the Arctic waters. At present, compared with the object detection of ship or iceberg, the task of ship and iceberg distinction in SAR images is still in challenge. In this work, we propose a novel loss function called ensemble loss to train convolutional neural networks (CNNs), which is a convex function and incorporates the traits of cross entropy and hinge loss. The ensemble loss trained CNNs model for the distinction between ship and iceberg is evaluated on a real-world SAR data set, which can get a higher classification accuracy to 90.15%. Experiment on another real image data set also confirm the effectiveness of the proposed ensemble loss.