Bioinformatics (sequencing data compression)
IEEE The latest sequencing technologies such as the Pacific Biosciences (PacBio) and Oxford Nanopore machines can generate long reads at the length of thousands of nucleic bases which is much longer than the reads at the length of hundreds generated by Illumina machines. However, these long reads are prone to much higher error rates, for example 15%, making downstream analysis and applications very difficult. Error correction is a process to improve the quality of sequencing data. Hybrid correction strategies have been recently proposed to combine Illumina reads of low error rates to fix sequencing errors in the noisy long reads with good performance. In this paper, we propose a new method named Bicolor, a bi-level framework of hybrid error correction for further improving the quality of PacBio long reads. At the first level, our method uses a de Bruijn graph-based error correction idea to search paths in pairs of solid < formula > < tex > $k$ < /tex > < /formula > -mers iteratively with an increasing length of < formula > < tex > $k$ < /tex > < /formula > -mer. At the second level, we combine the processed results under different parameters from the first level. In particular, a multiple sequence alignment algorithm is used to align those similar long reads, followed by a voting algorithm which determines the final base at each position of the reads. We compare the superior performance of Bicolor with three state-of-the-art methods on three real data sets. Results demonstrate that Bicolor always achieves the highest identity ratio. Bicolor also achieves a higher alignment ratio ( < formula > < tex > $ & #x003E; 1.3\%$ < /tex > < /formula > ) and a higher number of aligned reads than the current methods on two data sets. On the third data set, our method is closely competitive to the current methods in terms of number of aligned reads and genome coverage. The C++ source codes of our algorithm are freely available at https://github.com/yuansliu/Bicolor.
Liu, Y, Yu, Y, Dinger, ME & Li, J 2019, 'Index suffix-prefix overlaps by (w, k)-minimizer to generate long contigs for reads compression.', Bioinformatics, vol. 35, no. 12, pp. 2066-2074.View/Download from: Publisher's site
Motivation:Advanced high-throughput sequencing technologies have produced massive amount of reads data, and algorithms have been specially designed to contract the size of these data sets for efficient storage and transmission. Reordering reads with regard to their positions in de novo assembled contigs or in explicit reference sequences has been proven to be one of the most effective reads compression approach. As there is usually no good prior knowledge about the reference sequence, current focus is on the novel construction of de novo assembled contigs. Results:We introduce a new de novo compression algorithm named minicom. This algorithm uses large k-minimizers to index the reads and subgroup those that have the same minimizer. Within each subgroup, a contig is constructed. Then some pairs of the contigs derived from the subgroups are merged into longer contigs according to a (w; k)-minimizer indexed suffix-prefix overlap similarity between two contigs. This merging process is repeated after the longer contigs are formed until no pair of contigs can be merged. We compare the performance of minicom with two reference-based methods and four de novo methods on 18 data sets (13 RNA-seq data sets and 5 whole genome sequencing data sets). In the compression of single-end reads, minicom obtained the smallest file size for 22 of 34 cases with significant improvement. In the compression of paired-end reads, minicom achieved 20-80% compression gain over the best state-of-the-art algorithm. Our method also achieved a 10% size reduction of compressed files in comparison with the best algorithm under the reads-order preserving mode. These excellent performances are mainly attributed to the exploit of the redundancy of the repetitive substrings in the long contigs. Availability and Implementation:https://github.com/yuansliu/minicom. Supplementary Information:Supplementary data are available at Bioinformatics online.
Liu, Y, Zhang, LY, Li, J & Hancock, J 2019, 'Fast detection of maximal exact matches via fixed sampling of query K-mers and Bloom filtering of index K-mers', Bioinformatics, vol. 35, no. 22, pp. 4560-4567.View/Download from: Publisher's site
© 2019 The Author(s). All rights reserved. Detection of maximal exact matches (MEMs) between two long sequences is a fundamental problem in pairwise reference-query genome comparisons. To efficiently compare larger and larger genomes, reducing the number of indexed k-mers as well as the number of query k-mers has been adopted as a mainstream approach which saves the computational resources by avoiding a significant number of unnecessary matches. Results: Under this framework, we proposed a new method to detect all MEMs from a pair of genomes. The method first performs a fixed sampling of k-mers on the query sequence, and adds these selected k-mers to a Bloom filter. Then all the k-mers of the reference sequence are tested by the Bloom filter. If a k-mer passes the test, it is inserted into a hash table for indexing. Compared with the existing methods, much less number of query k-mers are generated and much less k-mers are inserted into the index to avoid unnecessary matches, leading to an efficient matching process and memory usage savings. Experiments on large genomes demonstrate that our method is at least 1.8 times faster than the best of the existing algorithms. This performance is mainly attributed to the key novelty of our method that the fixed k-mer sampling must be conducted on the query sequence and the index k-mers are filtered from the reference sequence via a Bloom filter.
Tang, T, Liu, Y, Zhang, B, Su, B & Li, J 2019, 'Sketch distance-based clustering of chromosomes for large genome database compression.', BMC genomics, vol. 20, no. Suppl 10, pp. 978-978.View/Download from: Publisher's site
BACKGROUND:The rapid development of Next-Generation Sequencing technologies enables sequencing genomes with low cost. The dramatically increasing amount of sequencing data raised crucial needs for efficient compression algorithms. Reference-based compression algorithms have exhibited outstanding performance on compressing single genomes. However, for the more challenging and more useful problem of compressing a large collection of n genomes, straightforward application of these reference-based algorithms suffers a series of issues such as difficult reference selection and remarkable performance variation. RESULTS:We propose an efficient clustering-based reference selection algorithm for reference-based compression within separate clusters of the n genomes. This method clusters the genomes into subsets of highly similar genomes using MinHash sketch distance, and uses the centroid sequence of each cluster as the reference genome for an outstanding reference-based compression of the remaining genomes in each cluster. A final reference is then selected from these reference genomes for the compression of the remaining reference genomes. Our method significantly improved the performance of the-state-of-art compression algorithms on large-scale human and rice genome databases containing thousands of genome sequences. The compression ratio gain can reach up to 20-30% in most cases for the datasets from NCBI, the 1000 Human Genomes Project and the 3000 Rice Genomes Project. The best improvement boosts the performance from 351.74 compression folds to 443.51 folds. CONCLUSIONS:The compression ratio of reference-based compression on large scale genome datasets can be improved via reference selection by applying appropriate data preprocessing and clustering methods. Our algorithm provides an efficient way to compress large genome database.
Zhang, LY, Liu, Y, Pareschi, F, Zhang, Y, Wong, KW, Rovatti, R & Setti, G 2018, 'On the security of a class of diffusion mechanisms for image encryption', IEEE Transactions on Cybernetics, vol. 48, no. 4, pp. 1163-1175.View/Download from: Publisher's site
© 2017 IEEE. The need for fast and strong image cryptosystems motivates researchers to develop new techniques to apply traditional cryptographic primitives in order to exploit the intrinsic features of digital images. One of the most popular and mature technique is the use of complex dynamic phenomena, including chaotic orbits and quantum walks, to generate the required key stream. In this paper, under the assumption of plaintext attacks we investigate the security of a classic diffusion mechanism (and of its variants) used as the core cryptographic primitive in some image cryptosystems based on the aforementioned complex dynamic phenomena. We have theoretically found that regardless of the key schedule process, the data complexity for recovering each element of the equivalent secret key from these diffusion mechanisms is only O(1). The proposed analysis is validated by means of numerical examples. Some additional cryptographic applications of this paper are also discussed.
Zhang, LY, Liu, Y, Wang, C, Zhou, J, Zhang, Y & Chen, G 2018, 'Improved known-plaintext attack to permutation-only multimedia ciphers', Information Sciences, vol. 430-431, pp. 228-239.View/Download from: Publisher's site
© 2017 Elsevier Inc. Permutation is a commonly used operation in many secure multimedia systems. However, it is fragile against cryptanalysis when used alone. For instance, it is well-known that permutation-only multimedia encryption is insecure against known-plaintext attack (KPA). There exist algorithms that are able to (partially) retrieve the secret permutation sequences in polynomial time with logarithmic amount of plaintexts in the number of elements to be permuted. But existing works fail to answer how many known plaintexts are needed to fully recover a underlying secret permutation sequence and how to balance the storage cost and computational complexity in implementing the KPA attack. This paper addresses these two problems. With a new concept of composite representation, the underlying theoretical rules governing the KPA attack on a permutation-only cipher are revealed, and some attractive algorithms outperforming the state-of-the-art methods in terms of computational complexity are developed. As a case study, experiments are performed on permutation-only image encryption to verify the theoretic analysis. The performance gap of the proposed KPA between artificial noise-like images, which perfectly fits the theoretical model, and the corresponding natural images is identified and analyzed. Finally, experimental results are shown to demonstrate the efficiency improvement of the new schemes over the existing ones.
The rapidly increasing number of genomes generated by high-throughput sequencing platforms and assembly algorithms is accompanied by problems in data storage, compression and communication. Traditional compression algorithms are unable to meet the demand of high compression ratio due to the intrinsic challenging features of DNA sequences such as small alphabet size, frequent repeats and palindromes. Reference-based lossless compression, by which only the differences between two similar genomes are stored, is a promising approach with high compression ratio.We present a high-performance referential genome compression algorithm named HiRGC. It is based on a 2-bit encoding scheme and an advanced greedy-matching search on a hash table. We compare the performance of HiRGC with four state-of-the-art compression methods on a benchmark dataset of eight human genomes. HiRGC takes <30 min to compress about 21 gigabytes of each set of the seven target genomes into 96-260 megabytes, achieving compression ratios of 217 to 82 times. This performance is at least 1.9 times better than the best competing algorithm on its best case. Our compression speed is also at least 2.9 times faster. HiRGC is stable and robust to deal with different reference genomes. In contrast, the competing methods' performance varies widely on different reference genomes. More experiments on 100 human genomes from the 1000 Genome Project and on genomes of several other species again demonstrate that HiRGC's performance is consistently excellent.The C ++ and Java source codes of our algorithm are freely available for academic and non-commercial use. They can be downloaded from https://github.com/yuansliu/HiRGC.email@example.com.Supplementary data are available at Bioinformatics online.
Liu, Y, Zeng, X, He, Z & Zou, Q 2017, 'Inferring MicroRNA-Disease Associations by Random Walk on a Heterogeneous Network with Multiple Data Sources', IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 14, no. 4, pp. 905-915.View/Download from: Publisher's site
© 2004-2012 IEEE. Since the discovery of the regulatory function of microRNA (miRNA), increased attention has focused on identifying the relationship between miRNA and disease. It has been suggested that computational method is an efficient way to identify potential disease-related miRNAs for further confirmation using biological experiments. In this paper, we first highlighted three limitations commonly associated with previous computational methods. To resolve these limitations, we established disease similarity subnetwork and miRNA similarity subnetwork by integrating multiple data sources, where the disease similarity is composed of disease semantic similarity and disease functional similarity, and the miRNA similarity is calculated using the miRNA-Target gene and miRNA-lncRNA (long non-coding RNA) associations. Then, a heterogeneous network was constructed by connecting the disease similarity subnetwork and the miRNA similarity subnetwork using the known miRNA-disease associations. We extended random walk with restart to predict miRNA-disease associations in the heterogeneous network. The leave-one-out cross-validation achieved an average area under the curve (AUC) of 0.8049 across 341 diseases and 476 miRNAs. For five-fold cross-validation, our method achieved an AUC from 0.7970 to 0.9249 for 15 human diseases. Case studies further demonstrated the feasibility of our method to discover potential miRNA-disease associations. An online service for prediction is freely available at http://ifmda.aliapp.com.
Peng, H, Lan, C, Liu, Y, Liu, T, Blumenstein, M & Li, J 2017, 'Chromosome preference of disease genes and vectorization for the prediction of non-coding disease genes.', Oncotarget, vol. 8, no. 45, pp. 78901-78916.View/Download from: Publisher's site
Disease-related protein-coding genes have been widely studied, but disease-related non-coding genes remain largely unknown. This work introduces a new vector to represent diseases, and applies the newly vectorized data for a positive-unlabeled learning algorithm to predict and rank disease-related long non-coding RNA (lncRNA) genes. This novel vector representation for diseases consists of two sub-vectors, one is composed of 45 elements, characterizing the information entropies of the disease genes distribution over 45 chromosome substructures. This idea is supported by our observation that some substructures (e.g., the chromosome 6 p-arm) are highly preferred by disease-related protein coding genes, while some (e.g., the 21 p-arm) are not favored at all. The second sub-vector is 30-dimensional, characterizing the distribution of disease gene enriched KEGG pathways in comparison with our manually created pathway groups. The second sub-vector complements with the first one to differentiate between various diseases. Our prediction method outperforms the state-of-the-art methods on benchmark datasets for prioritizing disease related lncRNA genes. The method also works well when only the sequence information of an lncRNA gene is known, or even when a given disease has no currently recognized long non-coding genes.
Zeng, X, Liao, Y, Liu, Y & Zou, Q 2017, 'Prediction and validation of disease genes using HeteSim scores', IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 14, no. 3, pp. 687-695.View/Download from: Publisher's site
© 2004-2012 IEEE. Deciphering the gene disease association is an important goal in biomedical research. In this paper, we use a novel relevance measure, called HeteSim, to prioritize candidate disease genes. Two methods based on heterogeneous networks constructed using protein-protein interaction, gene-phenotype associations, and phenotype-phenotype similarity, are presented. In HeteSim-MultiPath (HSMP), HeteSim scores of different paths are combined with a constant that dampens the contributions of longer paths. In HeteSim-SVM (HSSVM), HeteSim scores are combined with a machine learning method. The 3-fold experiments show that our non-machine learning method HSMP performs better than the existing non-machine learning methods, our machine learning method HSSVM obtains similar accuracy with the best existing machine learning method CATAPULT. From the analysis of the top 10 predicted genes for different diseases, we found that HSSVM avoid the disadvantage of the existing machine learning based methods, which always predict similar genes for different diseases. The data sets and Matlab code for the two methods are freely available for download at http://lab.malab.cn/data/HeteSim/index.jsp.
Zhang, LY, Zhang, Y, Liu, Y, Yang, A & Chen, G 2017, 'Security Analysis of Some Diffusion Mechanisms Used in Chaotic Ciphers', International Journal of Bifurcation and Chaos, vol. 27, no. 10, pp. 1-13.View/Download from: Publisher's site
© 2017 World Scientific Publishing Company. As a variant of the substitution-permutation network, the permutation-diffusion structure has received extensive attention in the field of chaotic cryptography over the last three decades. Because of the high implementation speed and nonlinearity over GF(2), the Galois field of two elements, mixing modulo addition/multiplication and Exclusive OR becomes very popular in various designs to achieve the desired diffusion effect. This paper reports that some diffusion mechanisms based on modulo addition/multiplication and Exclusive OR are not resistant to plaintext attacks as claimed. By cracking several recently proposed chaotic ciphers as examples, it is demonstrated that a good understanding of the strength and weakness of these crypto-primitives is crucial for designing more practical chaotic encryption algorithms in the future.
Liu, Y, Zhang, LY, Wang, J, Zhang, Y & Wong, KW 2016, 'Chosen-plaintext attack of an image encryption scheme based on modified permutation–diffusion structure', Nonlinear Dynamics, vol. 84, no. 4, pp. 2241-2250.View/Download from: Publisher's site
© 2016, Springer Science+Business Media Dordrecht. Since the first appearance in Fridrich's design, the usage of permutation–diffusion structure for designing digital image cryptosystem has been receiving increasing research attention in the field of chaos-based cryptography. Recently, a novel chaotic image cipher using a single-round modified permutation–diffusion pattern (ICMPD) was proposed. Unlike traditional permutation–diffusion structure, the permutation of ICMPD is operated on bit level instead of pixel level and its diffusion stage is operated on masked pixels, which are obtained by carrying out the classical affine cipher, instead of plain pixels. Following a divide-and-conquer strategy, this paper reports that ICMPD can be compromised by a chosen-plaintext attack efficiently and the involved data complexity is linear to the size of the plain-image. Moreover, the relationship between the cryptographic kernel at the diffusion stage of ICMPD and the classical modulo addition then XORing operation is explored thoroughly.
Zeng, L, Liu, R, Zhang, LY, Liu, Y & Wong, KW 2016, 'Cryptanalyzing an image encryption algorithm based on scrambling and Veginère cipher', Multimedia Tools and Applications, vol. 75, no. 10, pp. 5439-5453.View/Download from: Publisher's site
© 2015, Springer Science+Business Media New York. Recently, an image encryption algorithm based on scrambling and Veginère cipher has been proposed. However, it was soon cryptanalyzed by Zhang et al. using a method composed of both chosen-plaintext attack and differential attacks. This paper briefly reviews the two attack approaches proposed by Zhang et al. and outlines their mathematical interpretations. Based on these approaches, we present an improved chosen-plaintext attack to further reduce the number of chosen-plaintexts required, which is proved to be optimal. Moreover, it is found that an elaborately designed known-plaintext attack can efficiently compromise the image cipher under study. This finding is confirmed by both mathematical analysis and numerical simulations. The cryptanalyzing techniques developed in this paper provide some insights for designing secure and efficient multimedia ciphers.
Liu, Y, Fan, H, Xie, EY, Cheng, G & Li, C 2015, 'Deciphering an image cipher based on mixed transformed logistic maps', International Journal of Bifurcation and Chaos in Applied Sciences and Engineering, vol. 25, no. 13.View/Download from: Publisher's site
© 2015 World Scientific Publishing Company. Since John von Neumann suggested utilizing Logistic map as a random number generator in 1947, a great number of encryption schemes based on Logistic map and/or its variants have been proposed. This paper re-evaluates the security of an image cipher based on transformed logistic maps and proves that the image cipher can be deciphered efficiently under two different conditions: (1) two pairs of known plain-images and the corresponding cipher-images with computational complexity of O(218 + L); (2) two pairs of chosen plain-images and the corresponding cipher-images with computational complexity of O(L), where L is the number of pixels in the plain-image. In contrast, the required condition in the previous deciphering method is 87 pairs of chosen plain-images and the corresponding cipher-images with computational complexity of O(27 + L). In addition, three other security flaws existing in most Logistic-map-based ciphers are also reported.
Li, C, Liu, Y, Zhang, LY & Wong, KW 2014, 'Cryptanalyzing a class of image encryption schemes based on Chinese remainder theorem', Signal Processing: Image Communication, vol. 29, no. 8, pp. 914-920.View/Download from: Publisher's site
As a fundamental theorem in number theory, the Chinese Reminder Theorem (CRT) is widely used to construct cryptographic primitives. This paper investigates the security of a class of image encryption schemes based on CRT, referred to as CECRT. Making use of some properties of CRT, the equivalent secret key of CECRT can be recovered efficiently. The required number of pairs of chosen plaintext and the corresponding ciphertext is only (1+⌈ ( log2L)/l⌉), the attack complexity is only O(L), where L is the plaintext length and l is the number of bits representing a plaintext symbol. In addition, other defects of CECRT, such as invalid compression function and low sensitivity to plaintext, are reported. The work in this paper will help clarify positive role of CRT in cryptology. © 2014 Elsevier B.V.
Liu, Y, Tang, J & Xie, T 2014, 'Cryptanalyzing a RGB image encryption algorithm based on DNA encoding and chaos map', Optics and Laser Technology, vol. 60, pp. 111-115.View/Download from: Publisher's site
Recently, a RGB image encryption algorithm based on DNA encoding and chaos map has been proposed. It was reported that the encryption algorithm can be broken with four pairs of chosen plain-images and the corresponding cipher-images. This paper re-evaluates the security of the encryption algorithm, and finds that the encryption algorithm can be broken efficiently with only one known plain-image. The effectiveness of the proposed known-plaintext attack is supported by both rigorous theoretical analysis and experimental results. In addition, two other security defects are also reported. © 2014 Elsevier Ltd.
Xie, T, Liu, Y & Tang, J 2014, 'Breaking a novel image fusion encryption algorithm based on DNA sequence operation and hyper-chaotic system', Optik, vol. 125, no. 24, pp. 7166-7169.View/Download from: Publisher's site
© 2014 Elsevier GmbH. All rights reserved. Recently, a novel image fusion encryption algorithm based on DNA sequence operation and hyper-chaotic system was proposed. It was reported that the scheme can be broken with 4mn/3 +1 chosen plain-images and the corresponding cipher-images, where mn is the size of the plain-image. This paper re-evaluates the security of the encryption scheme and finds that the encryption scheme can be broken with less than ⌈log2(4mn)/2⌉+1 chosen plain-images, even three in many cases. The effectiveness of the proposed chosen-plaintext attack is supported by theoretical analysis, and verified by experimental results.
Zhang, LY, Hu, X, Liu, Y, Wong, KW & Gan, J 2014, 'A chaotic image encryption scheme owning temp-value feedback', Communications in Nonlinear Science and Numerical Simulation, vol. 19, no. 10, pp. 3653-3659.View/Download from: Publisher's site
Many round-based chaotic image encryption algorithms employ the permutation-diffusion structure. This structure has been found insecure when the iteration round is equal to one and the secret permutation of some existing schemes can be recovered even a higher round is adopted. In this paper, we present a single round permutation-diffusion chaotic cipher for gray image, in which some temp-value feedback mechanisms are introduced to resist the known attacks. Specifically, we firstly embed the plaintext feedback technique in the permutation process to develop different permutation sequences for different plain-images and then employ plaintext/ciphertext feedback for diffusion to generate equivalent secret key dynamically. Experimental results show that the new scheme owns large key space and can resist the differential attack. It is also efficient. © 2014 Elsevier B.V.
Li, C, Liu, Y, Xie, T & Chen, MZQ 2013, 'Breaking a novel image encryption scheme based on improved hyperchaotic sequences', Nonlinear Dynamics, vol. 73, no. 3, pp. 2083-2089.View/Download from: Publisher's site
Recently, a novel image encryption scheme based on improved hyperchaotic sequences was proposed. A pseudo-random number sequence, generated by a hyper-chaos system, is used to determine two involved encryption functions, bitwise exclusive or (XOR) operation and modulo addition. It was reported that the scheme can be broken with some pairs of chosen plain-images and the corresponding cipher-images. This paper re-evaluates the security of the encryption scheme and finds that the encryption scheme can be broken with only one known plain-image. The performance of the known-plaintext attack, in terms of success probability and computation load, become even much better when two known plain-images are available. In addition, security defects on insensitivity of the encryption result with respect to changes of secret key and plain-image are also reported. © 2013 Springer Science+Business Media Dordrecht.
Li, C, Liu, Y, Zhang, LY & Chen, MZQ 2013, 'Breaking a chaotic image encryption algorithm based on modulo addition and xor operation', International Journal of Bifurcation and Chaos, vol. 23, no. 4.View/Download from: Publisher's site
This paper re-evaluates the security of a chaotic image encryption algorithm called MCKBA/ HCKBA and finds that it can be broken efficiently with two known plain-images and the corresponding cipher-images. In addition, it is reported that a previously proposed breaking on MCKBA/HCKBA can be further improved by reducing the number of chosen plain-images from four to two. The two attacks are both based on the properties of solving a composite function involving the carry bit, which is composed of the modulo addition and the bitwise OR operations. Both rigorous theoretical analysis and detailed experimental results are provided. © 2013 World Scientific Publishing Company.
A maximal match between two genomes is a contiguous non-extendable sub-sequence common in the two genomes. DNA bases mutate very often from the genome of one individual to another. When a mutation occurs in a maximal match, it breaks the maximal match into shorter match segments. The coding cost using these broken segments for reference-based genome compression is much higher than that of using the maximal match which is allowed to contain mutations.
We present memRGC, a novel reference-based genome compression algorithm that leverages mutation-containing matches for genome encoding. MemRGC detects maximal matches between two genomes using a coprime double-window k-mer sampling search scheme, the method then extends these matches to cover mismatches (mutations) and their neighboring maximal matches to form long and mutation-containing matches. Experiments reveal that memRGC boosts the compression performance by an average of 27% in reference-based genome compression. MemRGC is also better than the best state-of-the-art methods on all of the benchmark data sets, sometimes better by 50%. Moreover, memRGC uses much less memory and de-compression resources, while providing comparable compression speed. These advantages are of significant benefits to genome data storage and transmission.
Availability and Implementation
Supplementary data are available at Bioinformatics online.<...
Zhang, X, Zhang, X, Verma, S, Liu, Y, Blumenstein, M & Li, J 2019, 'Detection of Anomalous Traffic Patterns and Insight Analysis from Bus Trajectory Data', PRICAI 2019: Trends in Artificial Intelligence, The 16th Pacific Rim International Conference on Artificial Intelligence, Cuvu, Fiji.
Zhang, X, Liu, Y, Zheng, Y, Zhao, Z, Li, J & Liu, Y 2018, 'Distinction between Ships and Icebergs in SAR Images Using Ensemble Loss Trained Convolutional Neural Networks', AI 2018: AI 2018: Advances in Artificial Intelligence (LNAI), Australasian Joint Conference on Artificial Intelligence, Springer, Wellington, New Zealand, pp. 216-223.View/Download from: Publisher's site
With the phenomenon of global warming, more new shipping routes will be open and utilized by more and more ships in the polar regions, particularly in the Arctic. Synthetic aperture radar (SAR) has been widely used in ship and iceberg monitoring for maritime surveillance and safety in the Arctic waters. At present, compared with the object detection of ship or iceberg, the task of ship and iceberg distinction in SAR images is still in challenge. In this work, we propose a novel loss function called ensemble loss to train convolutional neural networks (CNNs), which is a convex function and incorporates the traits of cross entropy and hinge loss. The ensemble loss trained CNNs model for the distinction between ship and iceberg is evaluated on a real-world SAR data set, which can get a higher classification accuracy to 90.15%. Experiment on another real image data set also confirm the effectiveness of the proposed ensemble loss.