Ian started his career with GEC-Marconi in the UK, completing sponsored BSc/MEng degrees at the University of Bath before working on satellite modem technology. He returned to the University of Bath with Vodafone sponsorship for his PhD in Hybrid Speech Coding Techniques which he completed in 1992. Following a further period in industry working on signal processing products, he moved to Australia, joining the University of Wollongong initially. During this period he worked extensively with Motorola on collaborative grants and projects in the areas of Speech coding and Audio signal processing. From 2003-2007, Ian was Australian Head of Delegation at the ISO/IEC standardisation group MPEG, where he also chaired the Multimedia Description Schemes subgroup. He continues to be actively involved in ISO/IEC SC29, the host committee for the MPEG and JPEG families of standards. One outcome of Ian’s MPEG activities was the formation of a start-up company which attracted significant venture funding in 2006/7. Ian is an active researcher in the Audio and Multimedia Signal Processing space, collaborating with European colleagues through the COST network, Qualinet, and is currently a member of the editorial board of IEEE Multimedia. He has more than 170 publications, as well as US patents and international standards contributions. Prior to joining UTS, Professor Burnett was the Head of School of Electrical and Computer Engineering at RMIT University. During his tenure, the School of Electrical and Computer Engineering significant growth and expansion internationally with programs in Vietnam and Hong Kong, and significant research performance.
Zhao, S., Cheng, E., Qiu, X., Burnett, I. & Chia-Chun Liu, J. 2018, 'Spatial decorrelation of wind noise with porous microphone windscreens', Journal of the Acoustical Society of America, vol. 143, no. 1, pp. 330-339.View/Download from: Publisher's site
© 2018 Acoustical Society of America. This paper explores the wind noise reduction mechanism of porous microphone windscreens by investigating the spatial correlation of wind noise. First, the spatial structure of the wind noise signal is studied by simulating the magnitude squared coherence of the pressure measured with two microphones at various separation distances, and it is found that the coherence of the two signals decreases with the separation distance and the wind noise is spatially correlated only within a certain distance less than the turbulence wavelength. Then, the wind noise reduction of the porous microphone windscreen is investigated, and the porous windscreen is found to be the most effective in attenuating wind noise in a certain frequency range, where the windscreen diameter is approximately 2 to 4 times the turbulence wavelengths (2 < D 0 / < 4), regardless of the wind speed and windscreen diameter. The spatial coherence between the wind noise outside and inside a porous microphone windscreen is compared with that without the windscreen, and the coherence is found to decrease significantly when the windscreen diameter is approximately 2 to 4 times the turbulence wavelengths, corresponding to the most effective wind noise reduction frequency range of the windscreen. Experimental results with a fan are presented to support the simulations. It is concluded that the wind noise reduction mechanism of porous microphone windscreens is related to the spatial decorrelation effect on the wind noise signals provided by the porous material and structure.
Hieu, M.B., Lech, M., Cheng, E., Neville, K. & Burnett, I.S. 2017, 'Object Recognition Using Deep Convolutional Features Transformed by a Recursive Network Structure', IEEE ACCESS, vol. 4, pp. 10059-10066.View/Download from: Publisher's site
Deep neural networks (DNNs) trained on large data sets have been shown to be able to capture
high-quality features describing image data. Numerous studies have proposed various ways to transfer DNN
structures trained on large data sets to perform classification tasks represented by relatively small data sets.
Due to the limitations of these proposals, it is not well known how to effectively adapt the pre-trained
model into the new task. Typically, the transfer process uses a combination of fine-tuning and training of
adaptation layers; however, both tasks are susceptible to problems with data shortage and high computational
complexity. This paper proposes an improvement to the well-known AlexNet feature extraction technique.
The proposed approach applies a recursive neural network structure on features extracted by a deep
convolutional neural network pre-trained on a large data set. Object recognition experiments conducted on
the Washington RGBD image data set have shown that the proposed method has the advantages of structural
simplicity combined with the ability to provide higher recognition accuracy at a low computational cost
compared with other relevant methods. The new approach requires no training at the feature extraction phase,
and can be performed very efficiently as the output features are compact and highly discriminative, and can
be used with a simple classifier in object recognition settings
Lee, D., Van Dorp Schuitman, J., Cabrera, D., Qiu, X. & Burnett, I. 2017, 'Comparison of psychoacoustic-based reverberance parameters', Journal of the Acoustical Society of America, vol. 142, no. 4, pp. 1832-1840.View/Download from: Publisher's site
© 2017 Acoustical Society of America. This study compared psychoacoustic reverberance parameters to each other, as well as to reverberation time (RT) and early decay time (EDT) under various acoustic conditions. The psychoacoustic parameters were loudness-based RT (T N ), loudness-based EDT [EDT N ; Lee, Cabrera, and Martens, J. Acoust. Soc. Am. 131, 1194-1205 (2012a)], and parameter for reverberance [P REV ; van Dorp Schuitman, de Vries, and Lindau., J. Acoust. Soc. Am. 133, 1572-1585 (2013)]. For the comparisons, a wide range of sound pressure levels (SPLs) from 20 dB to 100 dB and RTs from 0.5 s to 5.0 s were evaluated, and two sets of subjective data from the previous studies were used for the cross-validation and comparison. Results of the comparisons show that the psychoacoustic reverberance parameters provided better matches to reverberance than RT and EDT; however, the performance of these psychoacoustic reverberance parameters varied with the SPL range, the type of audio sample, and the reverberation conditions. This study reveals that P REV is the most relevant for estimating a relative change in reverberance between samples when the SPL range is small, while EDT N is useful in estimating the absolute reverberance. This study also suggests the use of P REV and EDT N for speech and music samples, respectively.
Wang, X., Cheng, E., Burnett, I.S., Huang, Y. & Wlodkowic, D. 2017, 'Automatic multiple zebrafish larvae tracking in unconstrained microscopic video conditions.', Scientific reports, vol. 7, no. 1, p. 17596.View/Download from: Publisher's site
The accurate tracking of zebrafish larvae movement is fundamental to research in many biomedical, pharmaceutical, and behavioral science applications. However, the locomotive characteristics of zebrafish larvae are significantly different from adult zebrafish, where existing adult zebrafish tracking systems cannot reliably track zebrafish larvae. Further, the far smaller size differentiation between larvae and the container render the detection of water impurities inevitable, which further affects the tracking of zebrafish larvae or require very strict video imaging conditions that typically result in unreliable tracking results for realistic experimental conditions. This paper investigates the adaptation of advanced computer vision segmentation techniques and multiple object tracking algorithms to develop an accurate, efficient and reliable multiple zebrafish larvae tracking system. The proposed system has been tested on a set of single and multiple adult and larvae zebrafish videos in a wide variety of (complex) video conditions, including shadowing, labels, water bubbles and background artifacts. Compared with existing state-of-the-art and commercial multiple organism tracking systems, the proposed system improves the tracking accuracy by up to 31.57% in unconstrained video imaging conditions. To facilitate the evaluation on zebrafish segmentation and tracking research, a dataset with annotated ground truth is also presented. The software is also publicly accessible.
Zhao, S., Cheng, E., Qiu, X., Burnett, I. & Liu, J.C.-.C. 2017, 'Wind noise spectra in small Reynolds number turbulent flows.', The Journal of the Acoustical Society of America, vol. 142, no. 5, p. 3227.View/Download from: Publisher's site
Wind noise spectra caused by wind from fans in indoor environments have been found to be different from those measured in outdoor atmospheric conditions. Although many models have been developed to predict outdoor wind noise spectra under the assumption of large Reynolds number [Zhao, Cheng, Qiu, Burnett, and Liu (2016). J. Acoust. Soc. Am. 140, 4178-4182, and the references therein], they cannot be applied directly to the indoor situations because the Reynolds number of wind from fans in indoor environments is usually much smaller than that experienced in atmospheric turbulence. This paper proposes a pressure structure function model that combines the energy-containing and dissipation ranges so that the pressure spectrum for small Reynolds number turbulent flows can be calculated. The proposed pressure structure function model is validated with the experimental results in the literature, and then the obtained pressure spectrum is verified with the numerical simulation and experiment results. It is demonstrated that the pressure spectrum obtained from the proposed pressure structure function model can be utilized to estimate wind noise spectra caused by turbulent flows with small Reynolds numbers.
Zhao, S., Dabin, M., Cheng, E., Qiu, X., Burnett, I. & Liu, J.C.-.C. 2017, 'On the wind noise reduction mechanism of porous microphone windscreens.', The Journal of the Acoustical Society of America, vol. 142, no. 4, p. 2454.View/Download from: Publisher's site
This paper investigates the wind noise reduction mechanism of porous microphone windscreens. The pressure fluctuations inside the porous windscreens with various viscous and inertial coefficients are studied with numerical simulations. The viscous and inertial coefficients represent the viscous forces resulting from the fluid-solid interaction along the surface of the pores and the inertial forces imposed on the fluid flow by the solid structure of the porous medium, respectively. Simulation results indicate that the wind noise reduction first increases and then decreases with both viscous and inertial coefficients after reaching a maximum. Experimental results conducted on five porous microphone windscreens with porosity from 20 to 60 pores per inch (PPI) show that the 40 PPI windscreen has the highest wind noise reduction performance, and this supports the simulation results. The existence of the optimal values for the viscous and inertial coefficients is explained qualitatively and it is shown that the design of the porous microphone windscreens should take into account both the turbulence suppression inside and the wake generation behind the windscreen to achieve optimal performance.
Liu, H., Xu, M., Wang, J., Rao, T. & Burnett, I. 2016, 'Improving Visual Saliency Computing With Emotion Intensity.', IEEE transactions on neural networks and learning systems, vol. 27, no. 6, pp. 1201-1213.View/Download from: UTS OPUS or Publisher's site
Saliency maps that integrate individual feature maps into a global measure of visual attention are widely used to estimate human gaze density. Most of the existing methods consider low-level visual features and locations of objects, and/or emphasize the spatial position with center prior. Recent psychology research suggests that emotions strongly influence human visual attention. In this paper, we explore the influence of emotional content on visual attention. On top of the traditional bottom-up saliency map generation, our saliency map is generated in cooperation with three emotion factors, i.e., general emotional content, facial expression intensity, and emotional object locations. Experiments, carried out on National University of Singapore Eye Fixation (a public eye tracking data set), demonstrate that incorporating emotion does improve the quality of visual saliency maps computed by bottom-up approaches for the gaze density estimation. Our method increases about 0.1 on an average of area under the curve of receiver operation characteristic curve, compared with the four baseline bottom-up approaches (Itti's, attention based on information maximization, saliency using natural, and graph-based vision saliency).
Radmanesh, N., Burnett, I.S. & Rao, B.D. 2016, 'A Lasso-LS Optimization with a Frequency Variable Dictionary in a Multizone Sound System', IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 24, no. 3, pp. 583-593.View/Download from: Publisher's site
©2016 IEEE. This paper presents an approach for multizone wideband sound field generation using an efficient harmonic nested (EHN) dictionary for sparse loudspeakers' placement and weight. Effectively, the nested arrays provide a priori knowledge of prospective loudspeaker locations based on the frequency bands of interest. The nested arrays are then further optimized in the Lasso stage to form an efficient loudspeakers' location dictionary. The final loudspeaker locations and weightings are estimated by a two-stage Lasso-LS pressure matching optimization. In the first stage Lasso algorithm, the center band frequencies of octave bands from 1 kHz to 8 kHz were used to select active loudspeakers. A second stage then optimizes reproduction using all selected loudspeakers on the basis of a regularized LS algorithm. The results demonstrate that the proposed approach provides a solution for the multizone sound system with the mean squared error (MSE) under -30 dB across the targeted frequency range (500 Hz- 16 kHz) using a linear array e.g. 13 loudspeakers. While, the single-stage LS approach generates the MSE peaks of -10 dB and -9 dB at 9 kHz within the active and silent zones respectively using an identical number of loudspeakers and array length.
© 2016 Elsevier B.V. With the wide use of consumer electronics and the rapid development of online shopping, more and more ad videos are developed for IDTV and mobile users. However, a huge amount of time spending on the Internet advertising somehow brings users uncomfortable viewing experience rather than effectively generates high consumption of advertised products. Therefore, it is urgent to find a viewer-friendly and advertiser-beneficial solution to launch ads. This paper is the first attempt to improve the effectiveness of advertising through combining online shopping information with an ad video and directing viewers to proper online shopping places. The proposed ActiveAd framework includes four main components. Firstly, an ad video analysis component detects both syntactic and semantic elements from ad videos, e.g. FMPIs (Frame Marked with Production Information), visual concepts, and textual keywords within the ad videos. Our ad video analysis provides a comprehensive solution to extract meaningful elements from ad videos. Secondly, a visual linking by search component is proposed to collect websites which contain similar images as FMPIs. Features used for the visual search are weighted and fused in order to ensure the uniformity of search results. Thirdly, different kinds of tags and product categories extracted from collected websites are aggregated in order to identify the representative text of the product. Finally, query keywords are selected through calculating cosine similarity from two kinds of keywords, i.e. keywords identified from tag aggregation and keywords obtained through ad video analysis (visual concept detection and textual keyword detection). A query vector is generated by selected keywords and used to retrieve product online. In this paper, a powerful cross-media contextual search including visual search, tag aggregation and textual search is achieved with the help of ad video analysis. Experiments demonstrate that our proposed Active...
Wu, L., Qiu, X., Burnett, I. & Guo, Y. 2016, 'Uncertainties of reverberation time estimation via adaptively identified room impulse responses', JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, vol. 139, no. 3, pp. 1093-1100.View/Download from: UTS OPUS or Publisher's site
Zhao, S., Cheng, E., Qiu, X., Burnett, I. & Liu, J.C.-.C. 2016, 'Pressure spectra in turbulent flows in the inertial and the dissipation ranges.', The Journal of the Acoustical Society of America, vol. 140, no. 6, p. 4178.View/Download from: UTS OPUS or Publisher's site
Based on existing studies that provide the pressure spectra in turbulent flows from the asymptotic pressure structure function in the inertial range, this paper extends the pressure spectrum to the dissipation range by proposing a pressure structure function model that incorporates both the inertial and dissipation ranges. Existing experiment results were used to validate the proposed pressure structure function model first, and then the obtained pressure spectrum was compared with the simulation and measurement data in the literature and the wind-induced noise measured outdoors. All comparisons demonstrate that the pressure spectrum obtained from the proposed pressure structure function model can be used to estimate the pressure spectra in both the inertial and dissipation ranges in turbulent flows with a sufficiently large Reynolds number.
Wu, L., Qiu, X., Burnett, I. & Guo, Y. 2015, 'Reverberation time estimation from speech signals based on blind room impulse response identification (L)', JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, vol. 138, no. 2, pp. 731-734.View/Download from: Publisher's site
Wu, L., Qiu, X., Burnett, I.S. & Guo, Y. 2015, 'A recursive least square algorithm for active control of mixed noise', JOURNAL OF SOUND AND VIBRATION, vol. 339, pp. 1-10.View/Download from: Publisher's site
Wu, L., Qiu, X., Burnett, I.S. & Guo, Y. 2015, 'Decoupling feedforward and feedback structures in hybrid active noise control systems for uncorrelated narrowband disturbances', JOURNAL OF SOUND AND VIBRATION, vol. 350, pp. 1-10.View/Download from: Publisher's site
Zhao, S., Qiu, X. & Burnett, I. 2015, 'Acoustic contrast control in an arc-shaped area using a linear loudspeaker array (L)', JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, vol. 137, no. 2, pp. 1036-1039.View/Download from: Publisher's site
Zhao, S., Qiu, X., Cheng, E., Burnett, I., Williams, N., Burry, J. & Burry, M. 2015, 'Sound quality inside small meeting rooms with different room shape and fine structures', APPLIED ACOUSTICS, vol. 93, pp. 65-74.View/Download from: Publisher's site
Zhao, S., Hu, Y., Lu, J., Qiu, X., Cheng, J. & Burnett, I. 2014, 'Delivering sound energy along an arbitrary convex trajectory.', Scientific Reports, vol. 4.View/Download from: UTS OPUS or Publisher's site
Accelerating beams have attracted considerable research interest due to their peculiar properties and various applications. Although there have been numerous research on the generation and application of accelerating light beams, few results have been published on the generation of accelerating acoustic beams. Here we report on the experimental observation of accelerating acoustic beams along arbitrary convex trajectories. The desired trajectory is projected to the spatial phase profile on the boundary which is discretized and sampled spatially. The sound field distribution is formulated with the Green function and the integral equation method. Both the paraxial and the non-paraxial regimes are examined and observed in the experiments. The effect of obstacle scattering in the sound field is also investigated and the results demonstrate that the approach is robust against obstacle scattering. The realization of accelerating acoustic beams will have an impact on various applications where acoustic information and energy are required to be delivered along an arbitrary convex trajectory.
Cheng, B., Ritz, C., Burnett, I. & Zheng, X. 2013, 'A General Compression Approach to Multi-Channel Three-Dimensional Audio', IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, vol. 21, no. 8, pp. 1676-1688.View/Download from: UTS OPUS or Publisher's site
Radmanesh, N. & Burnett, I.S. 2013, 'Generation of Isolated Wideband Sound Fields Using a Combined Two-stage Lasso-LS Algorithm', IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, vol. 21, no. 2, pp. 378-387.View/Download from: UTS OPUS or Publisher's site
Stolar, M.N., Lech, M., Sheeber, L.B., Burnett, I.S. & Allen, N.B. 2013, 'Introducing emotions to the modeling of intra- and inter-personal influences in parent-adolescent conversations', IEEE Transactions on Affective Computing, vol. 4, no. 4, pp. 372-385.View/Download from: UTS OPUS or Publisher's site
An understanding of the dynamics underlying emotional interactions between speakers is essential to the design of effective conversational strategies for interviews, mental health therapies, teaching and counseling, as well as the design of naturalistic human-machine communication systems. The present study introduces a new approach to the modeling of emotional influences during parent-adolescent conversations. The proposed dynamic influence model (DIM) estimates the joint conditional probabilities of speaker's states as a linear combination of simpler inter-and intra-speaker conditional probabilities. Contrary to the previously existing influence models (IMs), the DIM's coefficients are given not as static, constant values but as dynamically changing functions of the time delay between the current and the previous state. The speaker's states were annotated using four labels (speech with positive emotion, speech with negative emotion, emotionally neutral speech and silence with undefined emotion). Experimental results based on the audio recordings of 63 different naturalistic (not acted) parent-adolescent conversations showed that the proposed method leads to psychologically plausible observations. It was also demonstrated that the proposed DIM can achieve up to 20 percent higher accuracy of discriminating between emotional influence patterns of parents and adolescents when compared to the previously used static IM. © 2010-2012 IEEE.
Ebrahimi, T., Karam, L., Pereira, F., El-Maleh, K. & Burnett, I. 2011, 'The quality of multimedia: Challenges and trends [From the Guest Editors]', IEEE Signal Processing Magazine, vol. 28, no. 6.View/Download from: Publisher's site
One of the most visible consequences of progress in multimedia has been an increase in quality of the products and services offered. Nowadays, there is no longer only a question of which features are to be included in multimedia applications but also how well they contribute to the quality of user experience. © 2011 IEEE.
Adistambha, K., Davis, S.J., Ritz, C.H. & Burnett, I.S. 2010, 'Efficient multimedia query-by-content from mobile devices', COMPUTERS & ELECTRICAL ENGINEERING, vol. 36, no. 4, pp. 626-642.View/Download from: Publisher's site
Davis, S.J., Ritz, C.H. & Burnett, I.S. 2009, 'Using Social Networking and Collections to Enable Video Semantics Acquisition', IEEE MULTIMEDIA, vol. 16, no. 4, pp. 52-60.
The derivation of spatial cues representing source localisation information is a typical component of multichannel spatial audio coders. Efficient compression of spatial cues based on psychoacoustic localisation features is investigated. Results show that the proposed quantisation approach for spatial cue compression achieves bit-rates of less than 6kbit/s while preserving critical source localisation information. © The Institution of Engineering and Technology 2008.
Doeller, M., Tous, R., Gruhne, M., Yoon, K., Sano, M. & Burnett, I.S. 2008, 'The MPEG Query Format: Unifying Access to Multimedia Retrieval Systems', IEEE MULTIMEDIA, vol. 15, no. 4, pp. 82-95.
Ritz, C.H., Schiemer, G., Burnett, I.S., Cheng, E., Lock, D., Narushima, T., Ingham, S. & Wood Conroy, D. 2008, 'An Anechoic Configurable Hemispheric Environment for Spatialised Sound', Proceedings of the Australasian Computer Music Conference 2008, pp. 65-68.
Thomas-Kerr, J., Burnett, I. & Ritz, C. 2008, 'Format-independent rich media delivery using the bitstream binding language', IEEE Transactions on Multimedia, vol. 10, no. 3, pp. 514-522.View/Download from: Publisher's site
Several recent standards address virtual containers for rich multimedia content: collections of media with metadata describing the relationships between them and providing an immersive user experience. While these standards-which include MPEG-21 and TVAnytime-provide numerous tools for interacting with rich media objects, they do not provide a framework for streaming or delivery of such content. This paper presents the Bitstream Binding Language (BBL), a format-independent tool that describes how multimedia content and metadata may be bound into delivery formats. Using a BBL description, a generic processor can map rich content (an MPEG-21 digital item, for example) into a streaming or static delivery format. BBL provides a universal syntax for fragmentation and packetization of both XML and binary data, and allows new content and metadata formats to be delivered without requiring the addition of new software to the delivery infrastructure. Following its development by the authors, BBL was adopted by MPEG as Part 18 of the MPEG-21 Multimedia Framework. © 2006 IEEE.
Adistambha, K., Davis, S.J., Ritz, C.H. & Burnett, I.S. 2007, 'Query Streaming for Multimedia Query by Content from Mobile Devices', The 9th International Symposium on DSP and Communication Systems, DSPCS'2007, pp. 1-6.
Thomas-Kerr, J.A.I., Burnett, I.S., Ritz, C.H., Devillers, S., De Schrijver, D. & Van de Walle, R. 2007, 'Is that a fish in your ear? A universal metalanguage for multimedia', IEEE Multimedia, vol. 14, no. 2, pp. 72-77.View/Download from: Publisher's site
Universal Multimedia Access promises to adaptively deliver multimedia content to users according to their needs?whether it's their device, context, or preferences. Central to UMA is the development of metadata standards for describing multimedia resources to allow their adaptation. In this article, the authors report on the development of the Bitstream Syntax Description Language (BSDL) and describe applications for scalable content adaptation, format independent streaming, and delivery and configurable media coding. © 2007 IEEE.
Amielh, M.E., Wan, E.Y., Devillers, S., Singer, D.W. & Burnett, I.S. 2006, 'MPEG-21 PART 17: URI-Based Fragment Identification of MPEG Resources', Proceedings of (VIE 2006) IET International Conference on Visual Information Engineering, pp. 120-125.
Raad, W., Burnett, I.S. & Raad, I.S. 2006, 'A variable Length Linear Array for Smart Antenna Systems', International Conference on Information and Communication Technologies: From Theory to Applications, vol. 2, pp. 2213-2217.View/Download from: Publisher's site
Rong, L. & Burnett, I.S. 2006, 'Adaptive resource replication in a ubiquitous peer to peer based multimedia distribution environment', Consumer Communications and Networking Conference, CCNC 2006, vol. 1, pp. 65-68.View/Download from: Publisher's site
Smith, D., Lukasiak, J. & Burnett, I.S. 2006, 'An analysis of the limitations of blind signal separation application with speech', Signal Processing, vol. 86, no. 2, pp. 353-359.View/Download from: Publisher's site
Blind Signal Separation (BSS) techniques are commonly employed in the separation of speech signals, using Independent Component Analysis (ICA) as the criterion for separation. This paper investigates the viability of employing ICA for real-time speech separation (where short frame sizes are the norm). The relationship between the statistics of speech and the assumption of statistical independence (at the core of ICA) is examined over a range of frame sizes. The investigation confirms that statistical independence is not a valid assumption for speech when divided into the short frames appropriate to real-time separation. This is primarily due to the quasi-stationary nature of speech over the temporal short term. We conclude that employing ICA for real-time speech separation will always result in limited performance due to a fundamental failure to meet the strict assumptions of ICA. © 2005 Elsevier B.V. All rights reserved.
Burnett, I.S., Davis, S.J. & Drury, G.M. 2005, 'MPEG-21 digital item declaration and identification - Principles and compression', IEEE Transactions on Multimedia, vol. 7, no. 3, pp. 400-407.View/Download from: Publisher's site
Telecommunications and Information Technology Research Institute. At the core of the MPEG-21 Multimedia Framework is the concept of the Digital Item, a virtual container for a hierarchical structure of metadata and resources. This paper considers the Digital Item Declaration Language (DIDL), gives examples of its usage, and discusses how it is used to integrate other parts of MPEG-21. The paper then discusses how Digital Item Identification integrates with the DIDL to allow MPEG-21 to utilize standard identifiers from many application spaces. Finally, an alternative, compressed form of the XML Digital Item Declaration is described. This uses schema-based compression to significantly reduce the size of these XML documents. © 2005 IEEE.
The vision of MPEG-21 - the multimedia framework - is to enable transparent and augmented use of multimedia resources across a wide range of networks and devices used by different communities. As the initial standardization effort for MPEG-21 reaches its conclusion, this article explores one case scenario and examines whether MPEG-21 can achieve its vision. © 2005 IEEE.
Smith, D., Lukasiak, J. & Burnett, I. 2005, 'Blind speech separation using a joint model of speech production', IEEE Signal Processing Letters, vol. 12, no. 11, pp. 784-787.View/Download from: Publisher's site
We propose a new blind signal separation (BSS) technique, developed specifically for speech, that exploits a priori knowledge of speech production mechanisms. In our approach, the autoregressive (AR) structure and fundamental frequency (F0) production mechanisms of speech are jointly modeled. We compare the separation performance of our joint AR-F0 algorithm to existing BSS algorithms that model either speech's AR structure 1 or F0  individually. Experimental results indicate that the joint algorithm demonstrates superior separation performance to both the individual AR algorithm (up to 77% improvement) and F0 (up to 50% improvement) algorithms. This suggests that speech separation performance is improved by employing a BSS model with a more realistic description of the speech production process. © 2005 IEEE.
Adistambha, K., Ritz, C.H., Lukasiak, J. & Burnett, I.S. 2004, 'An Investigation into Embedded Audio Coding Using An AAC Perceptually Lossless Base Layer', Proceedings of Tenth Australian International Conference on Speech Science and Technology (SST2004), pp. 227-230.
Lukasiak, J., Burnett, I.S., Drury, G., Agostinho, G., Bennett, S., Lockyear, L. & Harper, B. 2004, 'A Framework for the Flexible Content Packaging of Learning Objects and Learning Designs', Journal of Educational Multimedia and Hypermedia, vol. 13, no. 4, pp. 465-481.
O'Dwyer, M.F., Potard, G. & Burnett, I.S. 2004, 'A 16-Speaker 3D Audio-Visual Display Interface and Control System', Proceedings of ICAD 04. Tenth Meeting of the International Conference on Auditory Display.
Potard, G. & Burnett, I.S. 2004, 'A 3-D Audio Scene Description Scheme Based on XML', AES 25th Annual Conference: Metadata for Audio, pp. 102-112.
Potard, G. & Burnett, I.S. 2004, 'Control and Measurement of Apparent Sound Source Width and its Applications to Sonification and Virtual Auditory Displays', Proceedings of ICAD 2004.
Potard, G. & Burnett, I.S. 2004, 'Decorrelation Techniques for the Rendering of Apparent Sound Source Width in 3D Audio Displays', Proceedings of the 7th International Conference on Digital Audio Effects (DAFx'04), pp. 280-284.
Rong, L. & Burnett, I.S. 2004, 'Application Level Session Hand-Off Management in a Ubiquitous Multimedia Environment', Proceedings of ICETE (3) 2004, pp. 223-229.
Smith, D., Lukasiak, J. & Burnett, I. 2004, 'Two channel, block adaptive audio separation using the cross correlation of time frequency information', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 3195, pp. 889-897.
TIFCORR is a Blind Signal Separation technique that is well suited to separating audio signals, requiring each signal to be sparse in only a local time-frequency region of their representation . TIFCORR can suffer from inconsistencies in mixing system estimation, thus we present a modified algorithm incorporating k-means clustering  to improve estimation robustness. To improve the data efficiency of TIFCORR, we also include an adaptive weighting function for mixing column estimates. These modifications transform our algorithm into a block adaptive algorithm with the ability to track time-varying mixtures. © Springer-Verlag 2004.
Barrett, T., Burnett, I.S. & Lukasiak, J. 2003, 'Lie Analysis of the Webster Horn Equation with Application to Audio Object Recognition', 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA, pp. 217-220.View/Download from: Publisher's site
The goals and achievements of MPEG-21, an open standards-based framework for multimedia delivery consumption, are discussed. MPEG-21 is considered as the newest of a series of standards, produced by the Moving Picture Experts Group. The MPEG-21 enables the use of multimedia resources across a wide range of networks and devices. It shows interoperability by focusing on the way in which the elements of a multimedia application infrastructure relate, integrate and interact.
Cheng, E., Lukasiak, J., Ritz, C.H. & Burnett, I.S. 2003, 'Linked Auditory/Visual Modelling with Tensegrity Structures', Proceedings of 7th International Symposium on DSP and Communication Systems (DSPCS'03) combined with the 2nd Workshop on the Internet.
Lukasiak, J., Burnett, I.S., Drury, G. & Goodes, J. 2003, 'Flexible content packaging of Learning Objects and learning Designs', Proceedings of ED Media 2003 Symposium, pp. 44-50.
The role of the internet and mobile communication systems in the development of multimedia access is discussed. The importance of the user and not the terminal as the final point in the multimedia consumption chain is emphasized by the importance given to universal multimedia experience (UME) as opposed to universal multimedia access (UMA). In the context of the UMEs the emphasis is on signal processing related developments.
Potard, G. & Burnett, I.S. 2003, 'A study on sound source apparent shape and wideness', 2003 International Conference on Auditory Display, pp. 25-28.
Raad, M., Mertins, A. & Burnett, I.S. 2003, 'Scalable to lossless audio compression based on perceptual set partitioning in hierarchical trees (PSPIHT)', IEEE International Conference on Acoustics, Speech, and Signal Processing, (ICASSP'03, vol. 5, pp. 624-627.View/Download from: Publisher's site
Valf, S., Wysocki, T. & Burnett, I.S. 2003, 'Convolutional Interleaver for Unequal Protection of Turbo Codes', 7th International Symposium on DSP and Communication Systems (DSPCS'03) combined with the 2nd Workshop on the Internet, Telecommunications and Signal Processing (WITSP'03),.
Barrett, T., Burnett, I.S. & Lukasiak, J. 2002, 'Analysis of Parameter Bounds for Mono Audio Signal Stream Separation using Combined PCA/ICA Techniques', Proceedings of the 9th Australian International Conference on Speech Science and Technology, pp. 562-567.
Braun, J., Burnett, I.S. & Gosbell, V.J. 2002, 'Description Schemes for Power Quality Data', Proceedings AUPEC02.
Braun, J.P., Burnett, I.S. & Gosbell, V.J. 2002, 'Software sound synthesiser as a source of power quality waveforms', Journal of Electrical and Electronics Engineering, Australia, vol. 22, no. 3, pp. 203-209.
The testing of power quality in laboratory requires waveform sources to emulate the various disturbances present in electrical networks. Often arbitrary waveform generators are used to recreate these disturbances. The main limitation of this technique is the periodicity of their output signals. To circumvent this limitation, this paper proposes the usage of the Csound sound synthesiser software package as a source of power quality waveforms. This permits the creation of real time waveforms that are accurate and of long duration. It is also shown that this approach improves the testing of the conformance of power quality analysers to existing standards.
Lukasiak, J. & Burnett, I.S. 2002, 'Low Delay Scalable Decomposition of Speech Waveforms', Proceedings of 6th Intl. Symposium on DSP for Communication Systems, Sydney, pp. 12-15.
Raad, M., Mertins, A. & Burnett, I.S. 2002, 'Audio Compression Using the MLT and SPIHT', Proceedings of 6th Intl. Symposium on DSP for Communication Systems, pp. 128-132.
Ritz, C. & Burnett, I.S. 2002, 'Wideband Speech Coding at 4kbps Using Waveform Interpolation', Proceedings of 6th Intl. Symposium on DSP for Communication Systems, pp. 144-148.
An implementation of waveform interpolation (WI) applied to wide-band speech coding at 4 kbit/s is presented. Listening tests comparing narrowband speech coded using standard coders are presented, with wideband speech coded using WI. Results show a clear preference for wideband speech over narrowband speech coded at an equivalent bit rate.
Burnett, I.S. & Ritz, C.H. 2001, 'Temporal Decomposition: A Promising approach to Low-rate Wideband Speech Compression', EUROSPEECH 2001, pp. 2315-2318.
Reducing the bit rate of waveform interpolation speech coders while maintaining the perceptual quality has recently been the focus of a great deal of research. This letter proposes a new method of slowly evolving waveform (SEW) quantization specifically targeted at low rate coding. The proposed method uses a pulse model whose parameters are implicitly contained in the quantized rapidly evolving waveform (REW) parameters, thus requiring no bits for transmission. Results indicate no degradation in perceptual speech quality when compared to that of the existing SEW quantization method. This retention of perceptual quality is in spite of a 12% reduction in the overall coder bit rate.
Lukasiak, J. & Burnett, I.S. 2001, 'Source Enhanced Linear Prediction of Speech Incorporating Simultaneously masked Spectral Weighting', Journal of Telecommunications and Information Technology, vol. 3, pp. 15-23.
Lukasiak, J., Burnett, I.S. & Ritz, C.H. 2001, 'Low Rate Speech Coding Incorporating Simultaneously Masked Spectrally Weighted Linear Prediction', EUROSPEECH 2001, pp. 1989-1992.
Raad, M., Burnett, I.S. & Mertins, A. 2001, 'Scalable audio coding employing sorted sinusoidal parameters', Proceedings of Sixth International Symposium on Signal Processing and its Applications (ISSPA 2001),, vol. 1, pp. 174-177.View/Download from: Publisher's site
An investigation into low bit rate wideband speech coding for applications such as unicast streaming is presented. Wideband spectral parameters are quantised below 1 kbit/s using temporal decomposition (TD) applied to the line spectral frequencies. Quantisation using TD performs significantly better than split vector quantisation at an equivalent bit rate.
Chong, N.R., Burnett, I.S. & Chicharo, J.F. 2000, 'A new waveform interpolation coding scheme based on pitch synchronous wavelet transform decomposition', IEEE Transactions on Speech and Audio Processing, vol. 8, no. 3, pp. 345-348.View/Download from: Publisher's site
This correspondence uses a pitch synchronous wavelet transform (PSWT) as an alternative characteristic waveform decomposition method for the waveform interpolation (WI) paradigm. The proposed method has the benefit of providing additional scalability in quantization than the existing WI decomposition to meet desired quality requirements. The PSWT is implemented as a quadrature mirror filter bank and decomposes the characteristic waveform surface into a series of reduced time resolution surfaces. Efficient quantization of these surfaces is achieved by exploiting their perceptual importance and inherent transmission rate requirements. The multiresolution representation has the additional benefit of more flexible parameter quantization, allowing a more accurate description of perceptually important scales, especially at higher coding rates. The proposed PSWT-WI coder is very well suited to high quality speech storage applications. © 2000 IEEE.
Chong, N.R., Burnett, I.S. & Chicharo, J.F. 2000, 'Use of the Pitch Synchronous Wavelet Transform as a new decomposition for WI', Speech and Audio Processing, IEEE Transactions on, vol. 8, no. 3, pp. 345-348.View/Download from: Publisher's site
Chong-White, N.R. & Burnett, I.S. 2000, 'Accurate, critically sampled characteristic waveform surface construction for waveform interpolation decomposition', Electronics Letters, vol. 36, no. 14, pp. 1245-1247.View/Download from: Publisher's site
Chong-White, N.R. & Burnett, I.S. 2000, 'Improved Signal Analysis and Time-Synchronous Reconstruction in Waveform Interpolation Coding', of IEEE Speech Coding Workshop, pp. 56-58.View/Download from: Publisher's site
Lukasiak, J., Burnett, I.S., Chicharo, J.F. & Thomson, M.M. 2000, 'Linear Prediction Incorporating Simultaneous Masking', of IEEE Int Conf. Acoustics, Speech, Sig. Processing, vol. 3, pp. 1471-1474.View/Download from: Publisher's site
Parry, J.J., Burnett, I.S. & Chicharo, J.F. 2000, 'Language-specific phonetic structure and the quantization of the spectral envelope of speech', Speech Communication, vol. 32, no. 4, pp. 229-250.View/Download from: Publisher's site
In the design of low-bit-rate (LBR) speech coding algorithms, language variability is often considered to be of secondary importance in comparison with other operational factors such as speaker variability and noise. Given that languages differ extensively in the composition of the spectral envelope and that the quantized spectral envelope of speech represents an important part of the bit allocation in speech coding, it is surprising to find that no comprehensive studies have ever been carried out on the role of language in spectral quantization. This paper addresses this through a series of performance studies of spectral quantization carried out across a set of language families typical of global mobile telecommunications. The study considers factors of quantizer design such as the size and structure of codebooks, and the quantity of monolingual data used in codebook training. This study found that quantization distortion is not uniform across languages. It is shown that a significant difference exists in the behaviour of spectral quantization across languages, in particular the behaviour of high distortion outliers. Detailed analysis of the spectral distortion data on a phonetic level revealed that the nature of the distribution of spectral energy in phonemes influenced the behaviour of monolingual codebooks. Some explanations for codebook performance are presented as well as a set of recommendations for codebook design for multi-lingual environments.
Ritz, C.H., Burnett, I.S. & Lukasiak, J. 2000, 'Very Low Rate Speech Coding Using Temporal Decomposition and Waveform Interpolation', IEEE Speech Coding Workshop,, pp. 29-31.View/Download from: Publisher's site
Chong, N.R., Burnett, I.S. & Chicharo, J.F. 1999, 'Adapting Waveform Interpolation (with Pitch-Spaced Subbands) for Quantisation', IEEE Speech Coding Workshop, pp. 96-98.View/Download from: Publisher's site
Chong, N.R., Burnett, I.S. & Chicharo, J.F. 1999, 'Low Delay Multi-Level Decomposition And Quantisation Techniques For WI Coding', of IEEE Int Conf. Acoustics, Speech, Sig. Processing, vol. 1, pp. 241-244.View/Download from: Publisher's site
Parry, J.J., Burnett, I.S. & Chicharo, J.F. 1999, 'Linguistic Mapping In LSF Space For Low Bit Rate Coding', IEEE Int Conf. Acoustics, Speech, Sig. Processing, vol. 2, pp. 653-656.View/Download from: Publisher's site
Chong, N.R., Burnett, I.S. & Chicharo, J.F. 1998, 'An Improved Decomposition Method For WI Using IIR Filter Banks', 5th International Conference on Spoken Language Processing, vol. 5, pp. 1799-1802.
Chong, N.R., Burnett, I.S., Chicharo, J.F. & Thomson, M.M. 1998, 'Use of the Pitch Synchronous Wavelet Transform as a New Decomposition Method for WI', IEEE Int Conf. Acoustics, Speech, Sig. Processing, vol. 1, pp. 513-516.View/Download from: Publisher's site
Parry, J.J., Burnett, I.S. & Chicharo, J.F. 1998, 'Using Linguistic Knowledge to Improve the Design of Low-Rate LSF quantisation', Proceedings of 5th Int. Conf. Spoken Language Processing, vol. 6, pp. 2599-2602.
Burnett, I.S. & Pham, D.H. 1997, 'Multi-Prototype Waveform Coding Using Frame-by-Frame Analysis-by-Synthesis', IEEE Int. Conf. on Acoust., Speech and Signal Processing '97, vol. 2, pp. 1567-1570.View/Download from: Publisher's site
Parry, J.J., Burnett, I.S. & Chicharo, J.F. 1997, 'The consequences of linguistic perception on low-rate speech coding', IEEE Int. Conf. on Acoust., Speech and Signal Processing '97, vol. 2, pp. 1383-1386.View/Download from: Publisher's site
Burnett, I.S. & Parry, J.J. 1996, 'On The Effects of Language And Accents On Low-Rate Speech Coders', Proceedings of Int. Conf. on Spoken Language Processing, vol. 1, pp. 291-294.View/Download from: Publisher's site
Gambino, P. & Burnett, I.S. 1996, '"Pitch Detection Based on Prototype Waveforms', IEEE International Symposium on Signal Processing and its Applications, vol. 1, pp. 73-76.
Gambino, P. & Burnett, I.S. 1996, 'Low-delay Pitch Detection Using Dynamic-Programming/Viterbi Techniques', IEEE International Symposium on Signal Processing and its Applications, pp. 77-80.
Pham, D.H. & Burnett, I.S. 1996, 'Quantisation Techniques for Prototype Waveforms', Proceedings of IEEE International Symposium on Signal Processing and its Applications, vol. 1, pp. 53-56.
Burnett, I.S. & Bradley, G.J. 1995, 'Low Complexity Decomposition and Coding of Prototype Waveforms', Proceedings of IEEE Workshop on Speech Coding for Telecommunications, pp. 23-24.View/Download from: Publisher's site
Burnett, I.S. & Bradley, G.J. 1995, 'New Techniques for Multi-Prototype Waveform Coding at 2.84kb/s', Proceedings of IEEE Int. Conf. on Acoust., Speech and Signal Processing, vol. 1, pp. 261-264.View/Download from: Publisher's site
Burnett, I.S. & Holbeche, R.J. 1993, 'A mixed prototype waveform/CELP coder for sub 3kb/s', IEEE Int. Conf. on Acoust., Speech and Signal Processing, vol. 2, pp. 175-178.View/Download from: Publisher's site
Sapiano, P.C., Holbeche, R.J., Burnett, I.S. & Pulley, D.R. 1992, 'Modulation recognition by neural network techniques', Proceedings of IEEE Int. Symp. on Communications.
Burnett, I.S. & Holbeche, R.J. 1991, 'The application of the DFT to CELP architectures', IEEE Workshop on Speech Coding for Telecommunications, pp. 83-84.
Ritz, C.H., Shujau, M., Zheng, X., Cheng, B., Cheng, E. & Burnett, I.S. 2011, 'Backward Compatible Spatialised Teleconferencing based on Squeezed Recordings' in Strumillo, P. (ed), Advances in Sound Localization, In-Tech, UK, pp. 363-384.View/Download from: UTS OPUS or Publisher's site
Commercial teleconferencing systems currently available, although offering sophisticated
video stimulus of the remote participants, commonly employ only mono or stereo audio
playback for the user. However, in teleconferencing applications where there are multiple
participants at multiple sites, spatializing the audio reproduced at each site (using
headphones or loudspeakers) to assist listeners to distinguish between participating
speakers can significantly improve the meeting experience (Baldis, 2001; Evans et al., 2000;
Ward & Elko 1999; Kilgore et al., 2003; Wrigley et al., 2009; James & Hawksford, 2008). An
example is Vocal Village (Kilgore et al., 2003), which uses online avatars to co-locate remote
participants over the Internet in virtual space with audio spatialized over headphones
(Kilgore, et al., 2003). This system adds speaker location cues to monaural speech to create a
user manipulable soundfield that matches the avatar's position in the virtual space. Giving
participants the freedom to manipulate the acoustic location of other participants in the
rendered sound scene that they experience has been shown to provide for improved
multitasking performance (Wrigley et al., 2009).
A system for multiparty teleconferencing requires firstly a stage for recording speech from
multiple participants at each site. These signals then need to be compressed to allow for
efficient transmission of the spatial speech. One approach is to utilise close-talking
microphones to record each participant (e.g. lapel microphones), and then encode each
speech signal separately prior to transmission (James & Hawksford, 2008). Alternatively, for
increased flexibility, a microphone array located at a central point on, say, a meeting table
can be used to generate a multichannel recording of the meeting speech A microphone array
approach is adopted in this work and allows for processing of the recordings to identify
relative spatial locations of the sources as well as multichannel sp...
Burnett, I.S. & Potard, G. 2002, 'Using XML schemas to create and encode interactive 3-D audio scenes' in Plaice, J., Kropf, P., Schulthess, P. & Slonim, J. (eds), Proceedings of DCW2002, Sydney Australia, Lecture Notes in Computer Science LNCS 2468, Springer-Verlag, Berlin Heidelberg, pp. 193-203.View/Download from: Publisher's site
Lindsay, A., Burnett, I.S., Quackenbush, S.R. & Jackson, M.A. 2002, 'Fundamentals of Audio Descriptions' in Manjunath, B.S., Salembier, P. & Sikora, T. (eds), Introduction to MPEG-7: Multimedia Content Description Interface, John Wiley and Sons, Ltd., Chichester, Englan, pp. 283-298.
The MPEG standards are an evolving set of standards for video and audio compression. MPEG 7 technology covers the most recent developments in multimedia search and retreival, designed to standardise the description of multimedia content supporting a wide range of applications including DVD, CD and HDTV.
Multimedia content description, search and retrieval is a rapidly expanding research area due to the increasing amount of audiovisual (AV) data available. The wealth of practical applications available and currently under development (for example, large scale multimedia search engines and AV broadcast servers) has lead to the development of processing tools to create the description of AV material or to support the identification or retrieval of AV documents. Written by experts in the field, this book has been designed as a unique tutorial in the new MPEG 7 standard covering content creation, content distribution and content consumption. At present there are no books documenting the available technologies in such a comprehensive way.
* Presents a comprehensive overview of the principles and concepts involved in the complete range of Audio Visual material indexing, metadata description, information retrieval and browsing
* Details the major processing tools used for indexing and retrieval of images and video sequences
* Individual chapters, written by experts who have contributed to the development of MPEG 7, provide clear explanations of the underlying tools and technologies contributing to the standard
* Demostration software offering step-by-step guidance to the multi-media system components and eXperimentation model (XM) MPEG reference software
* Coincides with the release of the ISO standard in late 2001.
A valuable reference resource for practising electronic and communications engineers designing and implementing MPEG 7 compliant systems, as well as for researchers and students working with multimedia database technology.
Bui, H.M., Lech, M., Cheng, E., Neville, K., Burnett, I.S. & IEEE 2016, 'Using Grayscale Images for Object Recognition with Convolutional-Recursive Neural Network', 2016 IEEE Sixth International Conference on Communications and Electronics (ICCE), International Conference on Communications and Electronics (HUT-ICCE), IEEE, Vietnam, pp. 321-325.View/Download from: UTS OPUS or Publisher's site
Rajapaksha, T., Qiu, X., Cheng, E. & Burnett, I. 2016, 'Geometrical room geometry estimation from room impulse responses', ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, Shanghai, China, pp. 331-335.View/Download from: UTS OPUS or Publisher's site
© 2016 IEEE.Room geometry estimation from corresponding Room Impulse Responses (RIRs) has attracted much attention in recent years, and a key challenge is to find the first order image source locations from the RIRs under different environments. Unlike the existing approaches which require a priori knowledge of the room or require some ideal conditions, this paper proposes an intuitive geometrical method based on the acoustical image source model. The proposed approach does not need any a priori knowledge of the room, only the RIRs from one arbitrary source location to five arbitrary receiving locations. The first order image sources of the walls in a room are identified first, then the room geometry is estimated based on the wall locations using a geometrical approach. Simulations with 2D and 3D convex polyhedral rooms demonstrate the feasibility and the precision of the proposed approach is discussed.
Rao, T.I.A.N.R.O.N.G., Xu, M., Liu, H., Wang, J. & Burnett, I. 2016, 'MULTI-SCALE BLOCKS BASED IMAGE EMOTION CLASSIFICATION USING MULTIPLE INSTANCE LEARNING', Proceedings - International Conference on Image Processing, ICIP, IEEE International Conference on Image Processing, IEEE, Phoenix, Arizona, USA.View/Download from: UTS OPUS or Publisher's site
Emotional factors usually affect users' preferences for and evaluations of images. Although affective image analysis attracts increasing attention, there are still three major challenges remaining: 1) it is difficult to classify an image into a single emotion type since different regions within an image can represent different emotions; 2) there is a gap between low-level features and high-level emotions and 3) it is difficult to collect a training set of reliable emotional image content. To address these three issues, we propose an emotion classification method based on multi-scale blocks using Multiple Instance Learning (MIL). We firstly extract blocks of an image at multiple scales using different image segmentation methods pyramid segmentation and simple linear iterative clustering (SLIC) and represent each block using the bag-of-visual-words (BoVW) method. Then, to bridge the 'affective gap, probabilistic latent semantic analysis (pLSA) is employed to estimate the latent topic distribution as a mid-level representation of each block. Finally, MIL, which reduces the need for exact labelling, is employed to classify the dominant emotion type of the image. Experiments carried out on three widely used datasets demonstrate that our proposed method with S-LIC effectively improves the state-of-the-art results of image emotion classification 5.1% on average.
Zhao, S., Cheng, E., Qiu, X., Burnett, I. & Liu, J.C.C. 2016, 'Estimation of the frequency boundaries of the inertial range for wind noise spectra in anechoic wind tunnels', Proceedings - 2nd Australasian Acoustical Societies Conference, ACOUSTICS 2016, Conference of the Australian Acoustical Society, AAS, Brisbane, Australia, pp. 1187-1196.View/Download from: UTS OPUS
Wind noise generated by the intrinsic turbulence in the flow can affect the outdoor noise measurements. Various attempts have been made to investigate the wind noise generation mechanism. Wind noise spectra in anechoic wind tunnels can be divided into three frequency regions. In the low frequency region known as the energy-containing range, the wind noise spectrum does not change significantly with frequency. In contrast, in the middle frequency region (or inertial range) the decay rate of the wind noise spectrum curve follows the 7/3 power law, but in the high frequency region (or dissipation range) the decay rate of the wind noise spectrum curve is faster than the -7/3 power law. The boundaries of the -7/3 power law frequency range depend on the Reynolds number; however, no exact value is known according to current literature. This paper proposes a method for predicting the boundary values based on the energy cascade theory. Large eddy simulations of free jet were performed to validate the proposed method and the results were found to be in reasonable agreement with existing experiment measurements obtained in an anechoic wind tunnel. Additional simulations were also conducted with different inflow entrance sizes to further verify the predictions from the proposed method.
Choy, S.-.M., Chiu, K.-.H., Cheng, E., Burnett, I. & IEEE 2015, '3D Fatigue from Stereoscopic 3D Video Displays: Comparing Objective and Subjective Tests using Electroencephalography', Proceedings of TENCON 2015 - 2015 IEEE Region 10 Conference, IEEE Tencon (IEEE Region 10 Conference), IEEE, Macao, pp. 1-4.View/Download from: Publisher's site
The use of stereoscopic display has increased in recent times, with a growing range of applications using 3D videos for visual entertainment, data visualization, and medical applications. However, stereoscopic 3D video can lead to adverse reactions amongst some viewers, including visual fatigue, headache and nausea; such reactions can further lead to Visually Induced Motion Sickness (VIMS). Whilst motion sickness symptoms can occur from other types of visual displays, this paper investigates the rapid adjustment triggered by human pupils as a potential cause of 3D fatigue due to VIMS from stereoscopic 3D displays. Using Electroencephalogram (EEG) biosignals and eye blink tools to measure the 3D fatigue, a series of objective and subjective experiments were conducted to investigate the effect of stereoscopic 3D across a series of video sequences.
Choy, S.-.M., Chiu, K.-.H., Cheng, E., Burnett, I. & IEEE 2015, '3D Fatigue from Stereoscopic 3D Video Displays: Comparing Objective and Subjective Tests using Electroencephalography', TENCON 2015 - 2015 IEEE REGION 10 CONFERENCE.
Qiu, X., Cheng, E., Burnett, I., Williams, N., Burry, J. & Burry, M. 2015, 'Preliminary study on the speech privacy performance of the Fabpod', Acoustics 2015 Hunter Valley, Conference of the Australian Acoustical Society, Australian Acoustical Society, Sydney, Australia.
This paper reports the preliminary measurement results for characterising the speech privacy performance of an open ceiling meeting room called Fabpod in RMIT University, where the Speech Privacy Class standardized in the ASTM E2638 was adopted in the measurements to rate the speech privacy performance. The background sound pressure level inside and outside the Fabpod and the sound pressure level differences at different locations inside and outside the Fabpod with different sound source locations were measured in one third octave bands from 50 Hz to 10000 Hz. Based on the measurement results, the Speech Privacy Class of the Fabpod was calculated. The conclusion is that the Fabpod cannot meet the normal speech privacy criteria and the meeting inside the Fabpod can easily be overheard outside. Speech privacy is affected by many factors including the speech attenuation from the sound source to the receiver and the level of the background noise. The speech attenuation from the sound source to the receiver depends on the height of the wall or barrier, the sound absorption coefficient of the ceiling and the distance between the sound source and receiver. To achieve acceptable speech privacy for the Fabpod, all design parameters have to be tuned to near optimum values. The measures that can be used to increase the speech privacy of the Fabpod are discussed.
Sharma, S., Cheng, E. & Burnett, I.S. 2015, 'A Simple Objective Method for Automatic Error Detection in Stereoscopic 3D Video', Proceedings for the Big Data Visual Analytics (BDVA), 2015, Big Data Visual Analytics, IEEE, Hobart, TAS, pp. 119-121.View/Download from: Publisher's site
With the increased popularity of 3D videos online and through consumer and cinema media, there exist few techniques for the automatic detection of stereoscopic error in 3D videos. Further, techniques based on disparity estimation are imprecise and computationally complex. This paper proposes a simple objective method to detect common errors inherent to stereoscopic 3D content due to discrepant objects between the left and the right view of the image pairs, stereoscopic window violation and undesirably high binocular disparity that causes viewing discomfort. The technique proposed in this paper identifies stereoscopic errors by computing only the edge disparity, which is computationally less expensive and uses simplified methods that may be optimised for real-time computation. Evaluations of the proposed technique are conducted on a series of stereoscopic 3D videos containing common errors, where regions that contain a range of different errors are successfully and clearly identified.
Smith, D., Lukasiak, J. & Burnett, I.S. 2004, 'A two channel, block-adaptive audio separation technique based upon time-frequency information', 12th European Signal Processing Conference, European Signal Processing Conference, IEEE, Vienna, Austria, pp. 393-396.View/Download from: UTS OPUS
© 2004 EUSIPCO. TIFROM [1, 2] is a two channel separation technique, which is well suited to separating audio signals, and in particular, dependent signals that fall outside the scope of conventional BSS applications  . One problem with TIFROM however, is degraded performance due to inconsistent estimation of the mixing system. To reduce these inconsistencies, we present a modified algorithm that incorporates k-means clustering  and normalised variance, improving upon TIFROM estimation results significantly. To improve TIFROM data efficiency we also include a weighting (running average) function for mixing column estimates. This transforms our modified algorithm into a block based adaptive algorithm with the ability to track a slowly time-varying mixture.
Stolar, M.N., Lech, M. & Burnett, I.S. 2014, 'Optimized multi-channel deep neural network with 2D graphical representation of acoustic speech features for emotion recognition', 2014, 8th International Conference on Signal Processing and Communication Systems, ICSPCS 2014 - Proceedings, International Conference on Signal Processing and Communication Systems, Institute of Electrical and Electronics Engineers Inc., Gold Coast, Australia, pp. 1-6.View/Download from: Publisher's site
This study investigates the effectiveness of speech emotion recognition using a new approach called the Optimized Multi-Channel Deep Neural Network (OMC-DNN), The proposed method has been tested with input features given as simple 2D black and white images representing graphs of the MFCC coefficients or the TEO parameters calculated either from speech (MFCC-S, TEO-S) or glottal waveforms (MFCC-G, TEO-G). A comparison with 6 different single-channel benchmark classifiers has shown that the OMC-DNN provided the best performance in both pair-wise (emotion vs. neutral) and simultaneous multiclass recognition of 7 emotions (anger, boredom, disgust, happiness, fear, sadness and neutral). In the pair-wise case, the OMC-DNN outperformed the single-channel DNN by 5%-10% depending on the feature set. In the multiclass case, the OMC-DNN outperformed or matched the singlechannel equivalents for all features. The speech spectrum and the glottal energy characteristics were identified as two important factors in discriminating between different types of categorical emotions in speech.
Stolar, M.N., Lech, M. & Burnett, I.S. 2015, 'Prediction of Emotional States in Parent-Adolescent Conversations using Non-Linear Autoregressive Neural Networks', Proceeidngs of the 2015 9th International Conference on Signal Processing and Communication Systems (ICSPCS), International Conference on Signal Processing and Communication Systems, IEEE, Cairns, Australia, pp. 1-6.View/Download from: UTS OPUS or Publisher's site
This study investigates an application of
nonlinear autoregressive (NAR) models to the prediction
of the most likely time series of emotional state transitions
of speakers engaged in dyadic conversations. While,
previous methods analyzed each speaker in separation, the
new approach proposes to couple both speakers into a
nonlinear recursive predictive neural network system
(NARX-NN). The NARX-NN system was tested and
compared with its uncoupled version (NAR-NN). The tests
were conducted using speech recordings from 63 parentchild
dyads including 29 depressed and 34 non-depressed
adolescent children, 14-18 years of age. The conversations
were conducted on three different topics. The NARX-NN
outperformed the NAR-NN method in all experimental
scenarios and across all topics of conversation. Predictions
of emotional states for depressed children led to higher
accuracy than the predictions for non-depressed children.
Modeling with class and/or speaker dependency improved
the results compared to the class and/or speaker
Wang, X., Cheng, E. & Burnett, I.S. 2015, 'Improved (STEM) cell segmentation with histogram matching image contrast enhancement', 2015 IEEE China Summit and International Conference on Signal and Information Processing, ChinaSIP 2015 - Proceedings, IEEE China Summit and International Conference on Signal and Information Processing, IEEE, Chengdu, China, pp. 816-820.View/Download from: Publisher's site
© 2015 IEEE. The tracking of moving biological cells in time-lapse video sequences is fundamental to further understanding biological processes. Automatic cell tracking techniques require accurate cell image segmentation; however, current segmentation techniques are susceptible to errors due to non-ideal but realistic cell image conditions, including low contrast typical of cell microscopic images. This paper proposes a novel image pre-processing technique to enhance the low grayscale image contrast for improved cell image segmentation accuracy. A shifted bi-Gaussian model is matched to the original cell image intensity histogram for greater differentiation between the cell foreground and image background, whilst maintaining the original intensity histogram shape. Experiments conducted on a stem cell time-lapse image database show up to 33% improved segmentation accuracy, in some frames (partially or completely) detecting cells that manual ground-Truth and/or existing segmentation approaches fail to identify.
Woolford, S. & Burnett, I.S. 2015, 'Multiview 3D Profilometry Using Resonance Based Decomposition And 3-Phase Shift Profilometry', INTERNATIONAL CONFERENCE ON EXPERIMENTAL MECHANICS 2014.View/Download from: Publisher's site
Qiu, X., Gao, M. & Burnett, I. 2014, 'A comparison between adaptive ANC algorithms with and without cancellation path modelling', 21st International Congress on Sound and Vibration 2014, ICSV 2014, 21st International Congress on Sound and Vibration, pp. 122-129.View/Download from: UTS OPUS
The adaptive filters in active noise control (ANC) systems differ from other common adaptive filters in the existence of the cancellation path, which is the transfer function between the outputs of the adaptive control filters and the error sensors. Cancellation paths play a critical role in active noise control systems, and the corresponding adaptive algorithms usually require the information of the cancellation paths for updating the control filters. The most commonly used filtered-x LMS algorithm takes into account the cancellation paths by filtering the reference signal with an estimate of the cancellation path transfer functions. For many ANC applications, the cancellation path modelling must be carried out online to maintain the stability of the system, and one modelling method obtains the cancellation path information by injecting uncorrelated signal into the cancellation path. This paper will introduce the filtered-x LMS algorithm embedded with this online cancellation path modelling and the direction search LMS algorithm, which is one of the ANC algorithms that do not need an explicit model of the cancellation path. In the direction search LMS algorithm, the standard LMS algorithm is adopted to update the adaptive filter coefficients directly with the reference signal by automatically choosing a proper update direction based on the monitoring of the excess noise power. The performance of the two typical adaptive ANC algorithms, one with and one without cancellation path modelling, will be compared in terms of noise reduction level, tracking speed, computation load and robustness.
Stolar, M.N., Lech, M. & Burnett, I.S. 2014, 'Using the influence model coefficients and the random walk to predict emotional interactions in parent-child conversations', 2014, 8th International Conference on Signal Processing and Communication Systems, ICSPCS 2014 - Proceedings, International Conference on Signal Processing and Communication Systems, IEEE, Gold Coast, AUSTRALIA.View/Download from: Publisher's site
© 2014 IEEE. This study introduces an interactive random walk as a new method for predicting sequences of four different construct states (positive emotion, negative emotion, neutral emotion and silence) of speakers in parent-child conversations. The proposed approach used the emotional transition probability arrays and the Influence Model (IM) coefficients to support the interacting random walk predictions. The interactive random walk was applied to generate sequences of speakers' states using higher order emotional transition probabilities. The new approach was tested on 63 different parent-child conversations conducted in naturalistic (not-acted) way. The prediction outcomes were visualized using the 2D random walk on a graph approach. The prediction quality was measured using the relative error between the actual and the predicted transition probabilities as well as, the error between the actual and the predicted end-point position on the 2D graph of emotional states. A comparison between the proposed random walk supported by the IM coefficients and the classical approach without the IM coefficients showed that proposed method generally offers improved results in terms of the prediction error and the endpoint position but at the cost of slower convergence rates.
Woolford, S. & Burnett, I.S. 2014, 'A multi-view profilometry system using RGB channel separated fringe patterns and unscented Kalman filter', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), International Symposium on Visual Computing 2014, pp. 683-694.
© Springer International Publishing Switzerland 2014. In this paper a one-shot method to determine the shape of an object from overlapping cosine fringes projected from multiple projectors is presented. This overcomes the limitation with single projector systems that do not allow imaging the entire object with a single shot. The proposed method projects orthogonal fringe patterns of different colours from different projectors and uses colour channel isolation and Fourier domain filtering to isolate the fringes. An Unscented Kaman Filter and smoother are used to demodulate the fringe pattern, which does not rely on a strictly sinusoidal fringe pattern for good results. Sources of error are discussed and their effects on the resulting parameter estimation are shown, as well as methods to reduce their impact. The proposed method is tested on simulations and real world objects and it is shown to be effective to isolate interfering fringes and determine the shape of an object with non-sinusoidal fringes input as opposed to Fourier Transform Profilometry.
Woolford, S. & Burnett, I.S. 2014, 'Toward a one shot multi-projector profilometry system for full field of view object measurement', ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, Florence, Italy, pp. 569-573.View/Download from: UTS OPUS or Publisher's site
In this paper a one-shot method to determine the shape of an object from overlapping cosine fringes projected from multiple projectors is presented. This overcomes the limitation with single projector systems that do not allow imaging the entire object with a single shot. The proposed method projects orthogonal fringe patterns from different projectors and uses Fourier domain filtering to isolate the fringes, which are demodulated using an unscented particle filter. Sources of error are discussed and their effects on the resulting parameter estimation are shown, as well as methods to reduce their impact. The proposed method is tested on simulations and real world objects and it is shown to be effective to isolate interfering fringes and determine the shape of an object. © 2014 IEEE.
Wu, L., Qiu, X., Burnett, I.S., Cheng, E. & Guo, Y. 2014, 'A decoupled hybrid structure for active noise control with uncorrelatednarrowband disturbances', INTERNOISE 2014 - 43rd International Congress on Noise Control Engineering: Improving the World Through Noise Control, INTERNOISE 2014 - 43rd International Congress on Noise Control Engineering: Improving the World Through Noise Control.View/Download from: UTS OPUS
In real active noise control (ANC)applications,the following situations frequently occur, one isthat disturbances only present at the error sensor and havelowcorrelation with reference signal, the other is thatthere is no enough space or ideal position for locating the reference sensor to satisfy causality condition. Thusthe residual noise after feedforward control can be seen as uncorrelated narrowband disturbancesin these situationsand ahybrid adaptive feedforward and feedback structure is often utilized to cope with this problem.Many efforts have been paid to improve the performance of the hybrid ANC system, nevertheless, few interests are concerned about the combination method between the feedforward and feedback structure. After investigating the conventional combination method of hybrid feedforward and feedback system, this paper introduces analternate combination method for hybrid ANC systemwhich featuresthat itavoidsthe coupling between the feedforward and feedback structures and both structures are concatenated to attenuate the ambient noise. Simulations are carried out to validatethe effectiveness of the introduced methodfor ANCwith uncorrelated narrowband disturbances.
Ling, L., Cheng, E. & Burnett, I.S. 2013, 'An Iterated Extended Kalman Filter for 3D mapping via Kinect camera', ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, Vancouver, BC, Canada, pp. 1773-1777.View/Download from: UTS OPUS or Publisher's site
This paper proposes the use of the Iterated Extended Kalman Filter (IEKF) in a real-time 3D mapping framework applied to Microsoft Kinect RGB-D data. Standard EKF techniques typically used for 3D mapping are susceptible to errors introduced during the state prediction linearization and measurement prediction. When models are highly nonlinear due to measurement errors e.g., outliers, occlusions and feature initialization errors, the errors propagate and directly result in divergence and estimation inconsistencies. To prevent linearized error propagation, this paper proposes repetitive linearization of the nonlinear measurement model to provide a running estimate of camera motion. The effects of iterated-EKF are experimentally simulated with synthetic map and landmark data on a range and bearing camera model. It was shown that the IEKF measurement update outperforms the EKF update when the state causes nonlinearities in the measurement function. In the real indoor environment 3D mapping experiment, more robust convergence behavior for the IEKF was demonstrated, whilst the EKF updates failed to converge. © 2013 IEEE.
Radmanesh, N. & Burnett, I.S. 2013, 'Effectiveness of horizontal personal sound systems for listeners of variable heights', ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, Vancouver, BC, Canada, pp. 316-320.View/Download from: UTS OPUS or Publisher's site
Standard surround systems for generation of isolated wideband soundfields employ uniformly-spaced array of speakers in the horizontal plane. For these systems, the evaluation of sound reproduction with height is important due to listener's variable heights. Previous work demonstrated that controlling both the speakers' location and their complex weights using two-stage Lasso-LS pressure matching optimization allows isolated sound reproduction with limited number of speakers within the speakers' plane. This work demonstrates that deployment of this technique can also give up to 24dB in suppression of sound at heights between zero and one meter from speakers' plane over single-stage LS using e.g. 90 speakers in a semicircular array. © 2013 IEEE.
Shujau, M., Ritz, C.H. & Burnett, I.S. 2013, 'Speech dereverberation based on Linear Prediction: An Acoustic Vector Sensor approach', ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, Vancouver, BC, Canada, pp. 639-643.View/Download from: UTS OPUS or Publisher's site
This paper introduces a dereverberation algorithm based on Linear Prediction (LP) applied to the outputs of an Acoustic Vector Sensor (AVS). The approach applies adaptive beamforming to take advantage of the directional outputs of the AVS array to obtain a more accurate LP spectrum than can be obtained with a single channel or Uniform Linear Array (ULA) with a comparable number of channels. This is then used within a modified version of the Spatiotemporal Averaging Method for Enhancement of Reverberant Speech (SMERSH) algorithm derived for the AVS to enhance the LP residual signal. In a highly reverberant environment, the approach demonstrates a significant improvement compared to a ULA as measured by both the Signal to Reverberant Ratio (SRR) and Speech to Reverberation Modulation Energy Ratio (SRMR) for sources ranging from at 1m to 5m from the array. © 2013 IEEE.
Woolford, S. & Burnett, I.S. 2013, 'A novel one shot object profilometry system using Direct Sequence Spread Spectrum profilometry', 2013 IEEE 11th IVMSP Workshop: 3D Image/Video Technologies and Applications, IVMSP 2013 - Proceedings, IEEE IVMSP Workshop - D Image/Video Technologies and Applications, IEEE, Seoul, South Korea.View/Download from: UTS OPUS or Publisher's site
In this paper a new method of determining 3D object shape using patterns derived from Direct Sequence Spread Spectrum (DSSS) and an Unscented Kalman Filter (UKF) is presented. First a binary message is encoded via Binary Phase Shift Keying (BPSK), and spread using pseudo-random spreading to create a pattern. An Iterative Unscented Kalman Filter (IUKF) is then used to determine the deformation in the pattern due to an object, and a Kalman smoother is used to reduce noise in the deformation estimation. Results show that the iterative UKF is able to determine the deformation in the pattern with a lower absolute error residual between the ground truth and estimated deformation than the non-iterated UKF. Results of the accompanying Cramer-Rao lower bounds show that the lower bound on the DSSS Pattern is lower than that of the fringe pattern. © 2013 IEEE.
Adistambha, K., Davis, S., Ritz, C.H., Stirling, D. & Burnett, I.S. 2012, 'Toward human motion search using fingerprinting', 2012 International Symposium on Communications and Information Technologies, ISCIT 2012, International Symposium on Communications and Information Technologies, IEEE Xplore, Gold Coast, QLD, Australia, pp. 1033-1038.View/Download from: UTS OPUS or Publisher's site
This paper investigates a 'fingerprinting' technique for describing human motion sequences. This work shows that human motion fingerprints can facilitate the search of human motion within large databases, similar to the fingerprinting approach used for the search of audio and image databases. This paper investigates the extraction of a reliable set of features from human motion capture data sequences that can be combined to generate a unique fingerprint. Results show that the fingerprints could be used to reliably differentiate between unique motions. © 2012 IEEE.
Cheng, E., Burton, P., Burton, J., Joseski, A. & Burnett, I. 2012, 'RMIT3DV: Pre-announcement of a creative commons uncompressed HD 3D video database', 2012 4th International Workshop on Quality of Multimedia Experience, QoMEX 2012, International Workshop on Quality of Multimedia Experience (QoMEX), IEEE, Yarra Valley, VIC, Australia, pp. 212-217.View/Download from: UTS OPUS or Publisher's site
There has been much recent interest, both from industry and research communities, in 3D video technologies and processing techniques. However, with the standardisation of 3D video coding well underway and researchers studying 3D multimedia delivery and users' quality of multimedia experience in 3D video environments, there exist few publicly available databases of 3D video content. Further, there are even fewer sources of uncompressed 3D video content for flexible use in a number of research studies and applications. This paper thus presents a preliminary version of RMIT3DV: an uncompressed HD 3D video database currently composed of 31 video sequences that encompass a range of environments, lighting conditions, textures, motion, etc. The database was natively filmed on a professional HD 3D camera, and this paper describes the 3D film production workflow in addition to the database distribution and potential future applications of the content. The database is freely available online via the creative commons license, and researchers are encouraged to contribute 3D content to grow the resource for the (HD) 3D video research community. © 2012 IEEE.
Davis, S., Cheng, E., Ritz, C. & Burnett, I. 2012, 'Ensuring Quality of Experience for markerless image recognition applied to print media content', 2012 4th International Workshop on Quality of Multimedia Experience, QoMEX 2012, International Workshop on Quality of Multimedia Experience (QoMEX), IEEE, Yarra Valley, VIC, Australia, pp. 158-163.View/Download from: UTS OPUS or Publisher's site
This paper investigates how minimal user interaction paradigms and markerless image recognition technologies can be applied to matching print media content to online digital proofs. By linking print material to online content, users can enhance their experience with traditional forms of print media with updated online content, videos, interactive online features etc. The proposed approach is based on extracting features from images/text from mobile device camera images to form fingerprints that are used to find matching images/text within a limited test set. An important criterion for these applications is to ensure that the user Quality of Experience (QoE), particularly in terms of matching accuracy and time, is robust to a variety of conditions typically encountered in practical scenarios. In this paper, the performance of a number of computer vision techniques that extract the image features and form the fingerprints are analysed and compared. Both computer simulation tests and mobile device experiments in realistic user conditions are conducted to study the effectiveness of the techniques when considering scale, rotation, blur and lighting variations typically encountered by a user. © 2012 IEEE.
Ling, L., Burrent, I.S. & Cheng, E. 2012, 'A dense 3D reconstruction approach from uncalibrated video sequences', Proceedings of the 2012 IEEE International Conference on Multimedia and Expo Workshops, ICMEW 2012, IEEE International Conference on Multimedia and Expo, IEEE, Melbourne, VIC, Australia, pp. 587-592.View/Download from: UTS OPUS or Publisher's site
Current approaches for 3D reconstruction from feature points of images are classed as sparse and dense techniques. However, the sparse approaches are insufficient for surface reconstruction since only sparsely distributed feature points are presented. Further, existing dense reconstruction approaches require pre-calibrated camera orientation, which limits the applicability and flexibility. This paper proposes a one-stop 3D reconstruction solution that reconstructs a highly dense surface from an uncalibrated video sequence, the camera orientations and surface reconstruction are simultaneously computed from new dense point features using an approach motivated by Structure from Motion (SfM) techniques. Further, this paper presents a flexible automatic method with the simple interface of 'videos to 3D model'. These improvements are essential to practical applications in 3D modeling and visualization. The reliability of the proposed algorithm has been tested on various data sets and the accuracy and performance are compared with both sparse and dense reconstruction benchmark algorithms. © 2012 IEEE.
Radmanesh, N. & Burnett, I.S. 2012, 'Wideband sound reproduction in a 2D multi-zone system using a combined two-stage Lasso-LS algorithm', Sensor Array and Multichannel Signal Processing Workshop (SAM), 2012 IEEE 7th, IEEE Sensor Array and Multichannel Signal Processing Workshop (SAM), IEEE, Hoboken, NJ, USA, pp. 453-456.View/Download from: UTS OPUS or Publisher's site
This paper addresses the provision of personal soundfields (zones) to multiple listeners in a space for which there are several fixed, virtual sources. For such multizone systems, optimization of speaker positions and weightings is important to reduce the number of active speakers. In this paper a two stage pressure matching optimization is proposed for wideband sound sources (such as speech signals). In the first stage, the least-absolute shrinkage and selection operator (Lasso) is used to select the speakers' positions for all sources and frequencies. A second stage then optimizes reproduction using all selected speakers on the basis of regularized least-squares (LS) or Lasso algorithm. The performance of these new, two-stage approaches is investigated for different speaker numbers, frequency range and reproduction angles. Results show that a limited arc of speakers using a two-stage optimization can give up to 38dB improvement in zone normalized squared error over a single-stage LS optimization.
Radmanesh, N. & Burnett, I.S. 2012, 'Wideband sound reproduction in a 2D multi-zone system using a combined two-stage Lasso-LS algorithm', Sensor Array and Multichannel Signal Processing Workshop (SAM), 2012 IEEE 7th, Sensor Array and Multichannel Signal Processing Workshop (SAM), IEEE, Hoboken, NJ, pp. 453-456.View/Download from: Publisher's site
This paper addresses the provision of personal soundfields (zones) to multiple listeners in a space for which there are several fixed, virtual sources. For such multizone systems, optimization of speaker positions and weightings is important to reduce the number of active speakers. In this paper a two stage pressure matching optimization is proposed for wideband sound sources (such as speech signals). In the first stage, the least-absolute shrinkage and selection operator (Lasso) is used to select the speakers' positions for all sources and frequencies. A second stage then optimizes reproduction using all selected speakers on the basis of regularized least-squares (LS) or Lasso algorithm. The performance of these new, two-stage approaches is investigated for different speaker numbers, frequency range and reproduction angles. Results show that a limited arc of speakers using a two-stage optimization can give up to 38dB improvement in zone normalized squared error over a single-stage LS optimization.
Rainer, B., Waltl, M., Cheng, E., Shujau, M., Timmerer, C., Davis, S., Burnett, I., Ritz, C. & Hellwagner, H. 2012, 'Investigating the impact of sensory effects on the Quality of Experience and emotional response in web videos', 2012 4th International Workshop on Quality of Multimedia Experience, QoMEX 2012, International Workshop on Quality of Multimedia Experience (QoMEX), IEEE, Yarra Valley, VIC, Australia, pp. 278-283.View/Download from: UTS OPUS or Publisher's site
Multimedia is ubiquitously available online with large amounts of video increasingly consumed through Web sites such as YouTube or Google Video. However, online multimedia typically limits users to visual/auditory stimulus, with onscreen visual media accompanied by audio. The recent introduction of MPEG-V proposed multi-sensory user experiences in multimedia environments, such as enriching video content with so-called sensory effects like wind, vibration, light, etc. In MPEG-V, these sensory effects are represented as Sensory Effect Metadata (SEM), which is additionally associated to the multimedia content. This paper presents three user studies that utilize the sensory effects framework of MPEG-V, investigating the emotional response of users and enhancement of Quality of Experience (QoE) of Web video sequences from a range of genres with and without sensory effects. In particular, the user studies were conducted in Austria and Australia to investigate whether geography and cultural differences affect users' elicited emotional responses and QoE. © 2012 IEEE.
Cheng, E. & Burnett, I.S. 2011, 'On the effect of AMR and AMR-WB GSM compression on overlapped speech for forensic analysis', 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) - Proceedings, IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, Prague, Czech Republic, pp. 1872-1875.View/Download from: UTS OPUS or Publisher's site
The recent ubiquity of mobile telephony has posed the challenge of forensic speech analysis on compressed speech content. Whilst existing research studies have investigated the effect of mobile speech compression on speaker and speech parameters, this paper addresses the effect of speech compression on parameters when an interfering background speaker is present in clean and noisy conditions. Preliminary evaluations presented in this paper study the effect of the Adaptive Multi-Rate (AMR) and Adaptive Multi-Rate Wideband (AMR-WB) speech coders on the Linear Prediction (LP) speech spectrum, Line Spectral Frequencies (LSFs), and Mel Frequency Cepstral Coefficients (MFCCs). Results indicate that due caution should be employed for the forensic analysis of mobile telephony speech: speech coder parameters are significantly degraded when an interfering speaker or noise is present, compared to parameters obtained from the main speaker alone. Moreover, at high SNR the speech parameters exhibit values that gradually transition from those ideally and independently obtained from the main speaker to those of the background speaker as the amplitude of the background interfering speaker increases. © 2011 IEEE.
Cheng, E. & Burnett, I.S. 2011, 'Reproduction of independent narrowband soundfields in a multizone surround system and its extension to speech signal sources', 2011 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, Prague, pp. 461-464.View/Download from: UTS OPUS or Publisher's site
Cheng, E., Davis, S., Burnett, I. & Ritz, C. 2011, 'An ambient multimedia user experience feedback framework based on user tagging and EEG biosignals', Proceedings of the 4th International Workshop on Semantic Ambient Media Experience, SAME 2011, in Conjunction with the 5th International Convergence on Communities and Technologies, pp. 61-66.
Multimedia is increasingly accessed online and within social networks; however, users are typically limited to visual/auditory stimulus through media presented onscreen with accompanying audio over speakers. Whilst recent research studying additional ambient sensory multimedia effects recorded numerical scores of perceptual quality, the users' time-varying emotional response to the ambient sensory feedback is not considered. This paper thus introduces a framework to evaluate user ambient quality of multimedia experience and discover users' time-varying emotional responses through explicit user tagging and implicit EEG biosignal analysis. In the proposed framework, users interact with the media via discrete tagging activities whilst their EEG biosignal emotional feedback is continuously monitored in-between user tagging events with emotional states correlated with media content and tags. Copyright © (2011) by International Ambient Media Association (iAMEA).
Davis, S., Cheng, E., Burnett, I. & Ritz, C. 2011, 'Multimedia user feedback based on augmenting user tags with EEG emotional states', 2011 3rd International Workshop on Quality of Multimedia Experience, QoMEX 2011, International Workshop on Quality of Multimedia Experience (QoMEX), IEEE, Mechelen, Belgium, pp. 143-148.View/Download from: UTS OPUS or Publisher's site
Efficient content-based access to large multimedia collections requires annotations that are human-meaningful, and user tagging of media is one means to obtain such semantic metadata. Tags can also act as user feedback essential for quality of multimedia experience assessment; however, tags can lack user context and become ambiguous between different users. Further, user tagging is a deliberate and discrete event where a user's response to the media can significantly vary in-between tagging events. This paper extends upon the authors' social multimedia adaptation framework to explore the use of EEG biosignals obtained from consumer EEG headsets to form context around explicit tagging activities and as user emotional feedback in-between user tagging events. Preliminary user studies investigating grouped participant responses indicate the most indicative emotional states to be short-term excitement, engagement and frustration in addition to gyroscope information. © 2011 IEEE.
Ling, L., Burnett, I.S. & Cheng, E. 2011, 'A flexible markerless registration method for video augmented reality', MMSP 2011 - IEEE International Workshop on Multimedia Signal Processing, IEEE International Workshop on Multimedia Signal Processing, IEEE, Hangzhou, China.View/Download from: UTS OPUS or Publisher's site
This paper proposes a flexible, markerless registration method that addresses the problem of realistic virtual object placement at any position in a video sequence. The registration consists of two steps: four points are specified by the user to build the world coordinate system, where the virtual object is rendered. A self-calibration camera tracking algorithm is then proposed to recover the camera viewpoint frame-by-frame, such that the virtual object can be dynamically and correctly rendered according to camera movement. The proposed registration method needs no reference fiducials, knowledge of camera parameters or the user environment, where the virtual object can be placed in any environment even without any distinct features. Experimental evaluations demonstrate low errors for several camera motion rotations around the X and Y axes for the self-calibration algorithm. Finally, virtual object rendering applications in different user environments are evaluated. © 2011 IEEE.
Ling, L., Cheng, E. & Burnett, I.S. 2011, 'Eight solutions of the essential matrix for continuous camera motion tracking in video augmented reality', Proceedings - 2011 IEEE International Conference on Multimedia and Expo (ICME), IEEE International Conference on Multimedia and Expo, IEEE, LaSalle, Ramon Llull University Barcelona, Spain.View/Download from: UTS OPUS or Publisher's site
This paper considers a self-calibration approach to the estimation of motion parameters for an unknown camera used for video-based augmented reality. Whilst existing systems derive four SVD solutions of the essential matrix, which encodes the epipolar geometry between two camera views, this paper presents eight possible solutions derived from mathematical computation and geometrical analysis. The eight solutions not only reflect the position and orientation of the camera in static displacement but also the dynamic, relative orientation between the camera and an object in continuous motion. This paper details a novel algorithm that introduces three geometric constraints to determine the rotation and translation matrix from the eight possible essential matrix solutions. An OpenGL camera motion simulator is used to demonstrate and evaluate the reliability of the proposed algorithms; this directly visualizes the abstract computer vision parameters into real 3D. © 2011 IEEE.
Radmanesh, N., Burnett, I.S. & IEEE 2011, 'REPRODUCTION OF INDEPENDENT NARROWBAND SOUNDFIELDS IN A MULTIZONE SURROUND SYSTEM AND ITS EXTENSION TO SPEECH SIGNAL SOURCES', 2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, pp. 461-464.
Shujau, M., Ritz, C.H. & Burnett, I.S. 2011, 'Linear predictive perceptual filtering for acoustic vector sensors: Exploiting directional recordings for high quality speech enhancement', ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, Prague, Czech Republic, pp. 5068-5071.View/Download from: UTS OPUS or Publisher's site
This paper investigates the performance of a new technique for speech enhancement which combines Linear Predictive (LP) spectrum-based perceptual filtering to the recordings obtained from an Acoustic Vector Sensor (AVS). The technique takes advantage of the directional polar responses of the AVS to obtain a significantly more accurate representation of the LP spectrum of a target speech signal in the presence of noise when compared to single channel, omni-directional recordings. Comparisons between the speech quality obtained from the proposed technique and existing beamforming-based speech enhancement techniques for the AVS are made through Perceptual Evaluation of Speech Quality (PESQ) tests and Mean Opinion Score (MOS) listening tests. Results show significant improvements in PESQ and MOS scores of 0.2 and 1.6, respectively, for the proposed enhancement technique. Being based on a miniature microphone array, the approach is particular suitable for hands free communication applications in mobile telephony. © 2011 IEEE.
Shujau, M., Ritz, C.H. & Burnett, I.S. 2011, 'Separation of speech sources using an acoustic vector sensor', MMSP 2011 - IEEE International Workshop on Multimedia Signal Processing, IEEE International Workshop on Multimedia Signal Processing, IEEE, Hangzhou, China.View/Download from: UTS OPUS or Publisher's site
This paper investigates how the directional characteristics of an Acoustic Vector Sensor (AVS) can be used to separate speech sources. The technique described in this work takes advantage of the frequency domain direction of arrival estimates to identify the location, relative to the AVS array, of each individual speaker in a group of speakers and separate them accordingly into individual speech signals. Results presented in this work show that the technique can be used for real-time separation of speech sources using a single 20ms frame of speech, furthermore the results presented show that there is an average improvement in the Signal to Interference Ratio (SIR) for the proposed algorithm over the unprocessed recording of 15.1 dB and an average improvement of 5.4 dB in terms of Signal to Distortion Ratio (SDR) over the unprocessed recordings. In addition to the SIR and SDR results, Perceptual Evaluation of Speech Quality (PESQ) and listening tests both show an improvement in perceptual quality of 1 Mean Opinion Score (MOS) over unprocessed recordings. © 2011 IEEE.
Cheng, E., Davis, S., Burnett, I. & Ritz, C. 2010, 'The Role of Experts in Social Media - Are the Tertiary Educated Engaged?', PROCEEDINGS OF THE 2010 IEEE INTERNATIONAL SYMPOSIUM ON TECHNOLOGY AND SOCIETY: SOCIAL IMPLICATIONS OF EMERGING TECHNOLOGIES, pp. 205-212.
Davis, S.J., Cheng, E.C., Burnett, I.S. & Ritz, C.H. 2010, 'Multimedia adaptation based on semantics from social network users interacting with media', 2010 2nd International Workshop on Quality of Multimedia Experience, QoMEX 2010 - Proceedings, pp. 170-175.View/Download from: Publisher's site
A key goal of adaptive multimedia delivery is to provide users with content that maximizes their quality of experience. To achieve this goal, adaptive multimedia systems require descriptions of the content and user preference information, moving beyond traditional criteria such as quality of service requirements or perceptual quality based on traditional metrics. Media is increasingly consumed within online social networks and multimedia sharing websites can also add a wealth of metadata. In this paper, mechanisms for gathering semantics that relate to user preferences when interacting with media content in social networks are proposed. Subjective results indicate the proposed mechanisms can successfully provide information about user and social group media preferences that can be used for adapting multimedia for improved user quality of experience. ©2010 IEEE.
Neville, K., Burton, P. & Burnett, I. 2010, 'A Second Life virtual studio as an online teaching environment', ASEE Annual Conference and Exposition, Conference Proceedings.
In this paper the development of a virtual learning environment in Second Life is detailed. The learning environment described is in the form of a virtual television studio for use in multimedia engineering courses, with an example implementation described for RMIT University's offshore campus. This paper details the problems associated with offshore learning and lists the requirements needed for creating an effective virtual learning environment for these offshore students. This paper also discusses the steps taken to create this virtual environment in the virtual world Second Life and the problems that have been faced due to hardware and software limitations in this particular virtual world. Finally, the steps to be taken to evaluate the effectiveness of this type of learning environment will be outlined. © American Society for Engineering Education, 2010.
Shujau, M., Ritz, C.H. & Burnett, I.S. 2010, 'Speech enhancement via separation of sources from co-located microphone recordings', ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp. 137-140.View/Download from: Publisher's site
This paper investigates multichannel speech enhancement for co-located microphone recordings based on Independent Component Analysis (ICA). Comparisons are made between co-located microphone arrays that contain microphones with mixed polar responses with traditional uniform linear arrays formed from omni-directional microphones. It is shown that polar responses of the microphones are a key factor in the performance of ICA applied to co-located microphones. Results from PESQ testing show a significant improvement in speech quality of ICA separated sources as a result of using an AVS rather than other types of microphone arrays. ©2010 IEEE.
Shujau, M., Ritz, C.H. & Burnett, I.S. 2010, 'Using in-air acoustic vector sensors for tracking moving speakers', 4th International Conference on Signal Processing and Communication Systems, ICSPCS'2010 - Proceedings.View/Download from: Publisher's site
This paper investigates the use of an Acoustic Vector Sensor (AVS) for tracking a moving speaker in real time through estimation of the Direction of Arrival (DOA). This estimation is obtained using the MUltiple SIgnal Classification (MUSIC)  algorithm applied on a time-frame basis. The performance of the AVS is compared with a SoundField Microphone which has similar polar responses to the AVS using time-frames ranging from 20 ms to 1 s. Results show that for 20 ms frames, the AVS is capable of estimating the DOA for both mono-tone and speech signals, which are both stationary and moving, with an accuracy of approximately 1.6 0 and less than 5 0 in azimuth, for stationary and moving speech sources, respectively. The results also show that the DOA estimates using the SoundField microphone are significantly less accurate than those obtained from the AVS. Furthermore, the results suggest that for estimating the DOA for speech sources, a Voice Activity Detector (VAD) is critical to ensure accurate azimuth estimation. ©2010 IEEE.
Smith, D., Cheng, E. & Burnett, I.S. 2010, 'Musical onset detection using MPEG-7 audio descriptors', 20th International Congress on Acoustics 2010, ICA 2010 - Incorporating Proceedings of the 2010 Annual Conference of the Australian Acoustical Society, pp. 4036-4042.
An onset detection system that exploits MPEG-7 audio descriptors is proposed in this paper, with investigations into the feasibility of MPEG-7 based onset detection performed across a diverse database of music. Detection functions were developed from both individual MPEG-7 descriptors and combinations of descriptors (joint detection functions). The results indicated that individual descriptors could achieve respectable detection performance (maximum F-measure of 0.753) with basic waveform features. Average detection performance could be improved by up to 11.2%, however, when joint detection functions were comprised of diverse combinations of MPEG-7 descriptors. This may be attributed to the increased capability of detection functions, composed of different spectral and temporal features, in capturing the variation in onset characteristics from different musical styles. It is thus concluded that the proposed onset detection system could be plausibly integrated into an existing MPEG-7 audio analysis system with minimal computational overhead.
Cheng, B., Ritz, C. & Burnett, I. 2009, 'Spatial audio coding by squeezing: Analysis and application to compressing multiple soundfields', European Signal Processing Conference, pp. 909-913.
Spatially Squeezed Surround Audio Coding (S 3 AC) proposed by the authors provides efficient compression of multi-channel surround audio. Compression is achieved by exploiting human sound localisation blur to save the surround soundfield information in a squeezed stereo soundfield. In this paper, the localisation loss during the S 3 AC analysis/synthesis is evaluated and the minimum size of the S 3 AC squeezed soundfield is derived in a frequency dependent form. Results from perceptual listening tests show that, compared with standard squeezing from a 360° surround soundfield to a 60° stereo soundfield, a more intensive squeezing method, such as from 360°-to-5°, does not introduce audible localisation distortion. This leads to a further application of S 3 AC for compressing more than one surround soundfield into a single stereo downmix for applications such as spatialised teleconferencing. This application is also described and perceptually evaluate. © EURASIP, 2009.
Cheng, E., Burnett, I.S. & Ritz, C. 2009, 'The effect of microphone directivity patterns on spatial cues for reverberant multichannel meeting speech analysis', European Signal Processing Conference, pp. 2181-2185.
Multiparty meetings common to many business environments often have participants who are generally stationary. Hence, active speakers can be disambiguated by location, and meeting analysis research groups have proposed the use of speaker location information (spatial cues) for meeting segmentation and higher level analysis. As the cues are estimated from multi-microphone recordings, this paper studies the effect of varying microphone directivity patterns on the spatial cue accuracy and reliability. Results from theoretical simulations and recordings from a real reverberant environment suggest that different spatial cues (based on inter-microphone signal time delays or amplitude level differences) optimally respond to different microphone directivity patterns, where time delay accuracy was found to be independent of the relative microphone configuration. © EURASIP, 2009.
Shujau, M., Ritz, C.H. & Burnett, I.S. 2009, 'Designing acoustic vector sensors for localisation of sound sources in air', European Signal Processing Conference, pp. 849-853.
This paper investigates the design and application of an Acoustic Vector Sensor (AVS), traditionally used for underwater applications, for localisation of sound sources in air. The paper investigates the relationship between the design factors and the accuracy of Direction of Arrival (DOA) estimates for sources of varying frequency and compares the performance of an AVS with a Uniform Linear Array (ULA) of comparable size. Results show that the design of the AVS is critical in achieving polar responses that result in DOA estimates of high accuracy. For our proposed design, DOA estimates were found to be more accurate for a range of source frequencies when compared to an existing AVS design and were significantly more accurate than those obtained from a ULA of comparable size. © EURASIP, 2009.
Thomas-Kerr, J., Ritz, C. & Burnett, I.S. 2009, 'Semantic-aware delivery of multimedia', 2009 9th International Symposium on Communications and Information Technology, ISCIT 2009, pp. 1498-1503.View/Download from: Publisher's site
This paper describes a system that is able to take arbitrary semantic metadata, and utilize it in the multimedia delivery decision-making process. Format independence is achieved using schema languages to describe the details of any given content or metadata, so that declarative mapping rules can be specified for translating from format-specific data points to format-independent concepts that are directly used by the framework. The system utilizes the criterion of "semantic-distortion", as an extension of Rate-Distortion Optimization based multimedia delivery. Several short video clips were encoded using H.264/SVC scalable video coding, and Scalable-To-Lossless (SLS) audio coding and adapted to four target bit rates. Subjective tests found a 72% preference for those clips which had been adapted so as to devote more bandwidth to the semantically important parts of the content when compared with standard objective-based bit-rate adaptation. ©2009 IEEE.
Adistambha, K., Ritz, C.H. & Burnett, I.S. 2008, 'Motion classification using dynamic time warping', Proceedings of the 2008 IEEE 10th Workshop on Multimedia Signal Processing, MMSP 2008, pp. 622-627.View/Download from: Publisher's site
Automatic generation of metadata is an important component of multimedia search-by-content systems as it both avoids the need for manual annotation as well as minimising subjective descriptions and human errors. This paper explores the automatic attachment of basic descriptions (or 'Tags') to human motion held in a motion-capture database on the basis of a Dynamic Time Warping (DTW) approach. The captured motion is held in the Acclaim ASF/AMC format commonly used in game and movie motion capture work and the approach allows for the comparison and classification of motion from different subjects. The work analyses the bone rotations important to a small set of movements and results indicate that only a small set of examples is required to perform reliable motion classification. © 2008 IEEE.
Adistambha, K., Ritz, C.H. & Burnett, I.S. 2008, 'Motion Classification Using Dynamic Time Warping', 2008 IEEE 10TH WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, VOLS 1 AND 2, 10th IEEE Workshop on Multimedia Signal Processing, IEEE, Cairns, AUSTRALIA, pp. 626-+.
Cheng, B., Ritz, C. & Burnett, I. 2008, 'A spatial squeezing approach to ambisonic audio compression', ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp. 369-372.View/Download from: Publisher's site
Spatially Squeezed Surround Audio Coding (S 3 AC) has been previously shown to provide efficient coding with perceptually accurate soundfield reconstruction when applied to ITU 5.1 multichannel audio. This paper investigates the application of S 3 AC to the coding of Ambisonic audio recordings. Traditional Ambisonics achieve compression and backward compatibility through the use of the UHJ matrixing approach to obtain a stereo signal. In this paper the relationship to Ambisonic B-format signals is described and alternative approaches that derive a stereo or mono-downmix signal based on S 3 AC are presented and evaluated. The mono-downmix approach utilizes side information consisting of spatial cues that are quantized based on novel source localization listening experiments. Objective and subjective tests demonstrate significant improvements in the localization of sound sources resulting from decoding the compressed B-format signals to a 5.1 speaker playback. ©2008 IEEE.
Cheng, B., Ritz, C.H. & Burnett, I.S. 2008, 'Binaural reproduction of Spatially Squeezed Surround Audio', International Conference on Signal Processing Proceedings, ICSP, pp. 506-509.View/Download from: Publisher's site
Spatially Squeezed Surround Audio Coding (S 3 AC) has been previously proposed as an efficient approach to multi-channel spatial audio coding with stereo/mono backward compatibility. This paper presents a binaural reproduction scheme that exploits source localisation information in the S 3 AC squeezed soundfield or S 3 AC cues to simulate surround audio scene over headphones. The approach utilises interpolated HRTFs to bring the S 3 AC advantages of accurate, localised sound sources to stereo headphone systems. The integration of the HRTF approach to reproduction also exploits human localisation ability to reduce interpolation complexity. Subjective experiments demonstrate that accurate localisation is achieved from binaural interpolated HRTF playback when compared to multi-channel playback. © 2008 IEEE.
Cheng, E., Burnett, I.S. & Ritz, C.H. 2008, 'Multivariate autoregressive modelling of multichannel reverberant speech', Proceedings of the 2008 IEEE 10th Workshop on Multimedia Signal Processing, MMSP 2008, pp. 945-949.View/Download from: Publisher's site
Recent research in speech localization and dereverberation introduced processing of the multichannel linear prediction (LP) residual of speech recorded with multiple microphones. This paper investigates the novel use of intra- and inter-channel speech prediction by proposing the use of a multichannel LP model derived from multivariate autoregression (MVAR), where current LP approaches are based on univariate autoregression (AR). Experiments were conducted on simulated anechoic and reverberant synthetic speech vowels and real speech sentences; results show that, especially at low reverberation times, the MVAR model exhibits greater prediction gains from the residual signal, compared to residuals obtained from univariate AR models for individually or jointly modelled speech channels. In addition, the MVAR model more accurately models the speech signal when compared to univariate LP of a similar prediction order and when a smaller number of microphones are deployed. © 2008 IEEE.
Cheng, E., Burnett, I.S. & Ritz, C.H. 2008, 'Multivariate Autoregressive Modelling of Multichannel Reverberant Speech', 2008 IEEE 10TH WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, VOLS 1 AND 2, 10th IEEE Workshop on Multimedia Signal Processing, IEEE, Cairns, AUSTRALIA, pp. 949-+.
Cheng, E., Cheng, B., Ritz, C. & Burnett, I.S. 2008, 'Spatialized teleconferencing: Recording and 'Squeezed' rendering of multiple distributed sites', Proceedings of the 2008 Australasian Telecommunication Networks and Applications Conference, ATNAC 2008, pp. 411-416.View/Download from: Publisher's site
Teleconferencing systems are becoming increasing realistic and pleasant for users to interact with geographically distant meeting participants. Video screens display a complete view of the remote participants, using technology such as wraparound or multiple video screens. However, the corresponding audio does not offer the same sophistication: often only a mono or stereo track is presented. This paper proposes a teleconferencing audio recording and playback paradigm that captures the spatial location of the geographically distributed participants for rendering of the remote soundfields at the users' end. Utilizing standard 5.1 surround sound playback, this paper proposes a surround rendering approach that 'squeezes' the multiple recorded soundfields from remote teleconferencing sites to assist the user to disambiguate multiple speakers from different participating sites. © 2008 IEEE.
Hellerud, E., Burnett, I., Solvang, A. & Svensson, U.P. 2008, 'Encoding higher order ambisonics with AAC', Audio Engineering Society - 124th Audio Engineering Society Convention 2008, pp. 501-508.
In this work we explore a simple method for reducing the bit rate needed for transmitting and storing Higher Order Ambisonics (HOA). The HOA B-format signals are simply encoded using Advanced Audio Coding (AAC) as if they were individual mono signals. Wave field simulations show that by allocating more bits to the lower order signals than the higher the resulting error is very low in the sweet spot, but increases as function of distance from the center. Encoding the higher order signals with a low bit rate does not lead to a reduced audio quality. The spatial information is improved when higher-order channels are included, even if these are encoded with a low bit rate.
Adistambha, K., Doeller, M., Tous, R., Gruhne, M., Sano, M., Tsinaraki, C., Christodoulakis, S., Yoon, K., Ritz, C.H. & Burnett, I.S. 2007, 'The MPEG-7 query format: A new standard in progress for multimedia query by content', ISCIT 2007 - 2007 International Symposium on Communications and Information Technologies Proceedings, pp. 479-484.View/Download from: Publisher's site
In recent years, the amount of Internet accessible digital audiovisual media files has vastly increased. Therefore the need to describe the media (by way of metadata) has also increased significantly. MPEG-7 (finalized in 2001) provides a comprehensive and rich metadata standard for the description of multimedia content. Unfortunately, a standardized query format does not exist for MPEG-7, or other multimedia metadata. Such a standard would provide for communications between querying clients and databases, supporting cross-modal and cross-media retrieval. The ISO/IEC SC29WG11 committee decided therefore to contribute to this application space by adding such functionality as a new part of the MPEG-7 series of standards. In response to a Call for Proposals, six proposals were submitted. This paper describes the strengths of each proposal as well as the resulting draft standard for the MPEG-7 quer y for mat. © 2007 IEEE.
Adistamblta, K., Ritz, C.H. & Burnett, I.S. 2007, 'MQF: An XML based multimedia query format', Proceedings of the 2007 IEEE International Conference on Multimedia and Expo, ICME 2007, pp. 264-267.
MQF is a new XML based format designed to facilitate communication between disparate systems for applications involving multimedia query by content. Currently no standardized protocol exists which are able to provide flexibility in formulation of a query, such as the combination of any multimedia format (image, video, sound) to serve as query terms, combined with very complicated query conditions that can utilize a hierarchy of meta-search engines. In this work, we propose MQF as a flexible solution to serve as a communication format between a client and server for use in content based multimedia searching. © 2007 IEEE.
Cheng, B., Ritz, C. & Burnett, I. 2007, 'Encoding independent sources in spatially squeezed surround audio coding', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 804-813.
Spatially Squeezed Surround Audio Coding (S 3 AC) was introduced as an approach to multi-channel audio compression which specifically aims to preserve original source localization information. In this paper, extensions to S 3 AC that allow for the accurate coding of independent spatial sources overlapped in both frequency and time are described; these use compact side information. An evaluation of the coder applied to tone and band-pass spatial sources shows that S 3 AC offers improved source localization performance while maintaining bit-rates, when compared with other state-of-the-art spatial audio coders. © Springer-Verlag Berlin Heidelberg 2007.
Cheng, B., Ritz, C. & Burnett, I. 2007, 'Principles and analysis of the squeezing approach to low bit rate spatial audio coding', ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings.View/Download from: Publisher's site
This paper presents a novel solution to multichannel spatial audio coding; Spatial Squeezing Surround Audio Coding (S 3 AC). The S 3 AC scheme analyses a multichannel audio: signal and downmixes it into a stereo signal pair containing both the monophonic properties of audio sources and their localization information; this avoids the need for side information. The approach uses timefrequency analysis of a spatial audio scene and exploits Virtual sources and amplitude panning techniques to 'squeeze' 360° of a horizontal soundfield to a 60° stereo signal pair. In comparison with other spatial audio coding techniques, S 3 AC significantly advances in-band encoding of the localization information in the original sound scene and achieves accurate recoverability of dynamic localized sources. © 2007 IEEE.
Cheng, E., Burnett, I. & Ritz, C. 2007, 'Using spatial audio cues from speech excitation for meeting speech segmentation', International Conference on Signal Processing Proceedings, ICSP.View/Download from: Publisher's site
Multiparty meetings generally involve stationary participants. Participant location information can thus be used to segment the recorded meeting speech into each speaker's 'turn' for meeting 'browsing'. To represent speaker location information from speech, previous research showed that the most reliable time delay estimates are extracted from the Hubert envelope of the Linear Prediction residual signal. The authors' past work has proposed the use of spatial audio cues to represent speaker location information. This paper proposes extracting spatial audio cues from the Hubert envelope of the speech residual for indicating changing speaker location for meeting speech segmentation. Experiments conducted on recordings of a real acoustic environment show that spatial cues from the Hilbert envelope are more consistent across frequency subbands and can clearly distinguish between spatially distributed speakers, compared to spatial cues estimated from the recorded speech or residual signal. © 2006 IEEE.
Cheng, E., Burnett, I.S. & Ritz, C. 2007, 'Time delay estimation of reverberant meeting speech: On the use of multichannel linear prediction', Proceedings - International Conference on Signal Image Technologies and Internet Based Systems, SITIS 2007, pp. 531-537.View/Download from: Publisher's site
Effective and efficient access to multiparty meeting recordings requires techniques for meeting analysis and indexing. Since meeting participants are generally stationary, speaker location information may be used to identify meeting events e.g., detect speaker changes. Time-delay estimation (TDE) utilizing cross-correlation of multichannel speech recordings is a common approach for deriving speech source location information. Recent research improved TDE by calculating TDE from linear prediction (LP) residual signals obtained from LP analysis on each individual speech channel. This paper investigates the use of LP residuals for speech TDE, where the residuals are obtained from jointly modeling the multiple speech channels. Experiments conducted with a simulated reverberant room and real room recordings show that jointly modeled LP better predicts the LP coefficients, compared to LP applied to individual channels. Both the individually and jointly modeled LP exhibit similar TDE performance, and outperform TDE on the speech alone, especially with the real recordings. © 2008 IEEE.
XML is a popular approach to interoperable exchange of data between a wide range of devices. This paper explores the use of the Remote XML Exchange Protocol as a mechanism to provide efficient interaction with complex XML documents to users with limited complexity devices and/or limited bandwidth connections. The interactive mechanisms provided by the protocol allow users to navigate, edit and download XML even when delivery of the full XML document is impossible. The paper examines the use of the protocol to enable multiple users to collaboratively edit remote XML documents. Further, the paper explores the combination of the protocol, collaborative editing and recently released Word processor/Office suite XML schema formats.
Rong, L. & Burnett, I. 2007, 'Improved dynamic multimedia resource adaptation-based peer-to-peer system through locality-based clustering and service', IEEE Region 10 Annual International Conference, Proceedings/TENCON.View/Download from: Publisher's site
A dynamic P2P architecture based on MPEG-21 was proposed in our previous work to support resource adaptation/personalization according to the surrounding usage environment and user preferences. In this paper, we improve the proposed system through two separate but related modifications. Firstly, peers are clustered according to registered geographic location information. Secondly, based on that registered location information, a locality-based service is introduced which allows users to search services according to their geographic locations. The service complements the proposed P2P architecture by encouraging service providers to increase the uptime of their devices and hence provide the spare computer power for active adaptation of resources for low-end peers. Simulation results show that the proposed approach reduces download time and network delays while increasing resource availability and download speed in the network.
Smith, D. & Burnett, I. 2007, 'Blind separation of speech with a switched sparsity and temporal criteria', 2006 IEEE 8th Workshop on Multimedia Signal Processing, MMSP 2006, pp. 136-140.View/Download from: Publisher's site
A Blind Signal Separation algorithm (SCAtemp) that exploits both the sparse time-frequency representation and temporal structure of speech is proposed. SCAtemp compares each speech signal's adherence to the sparsity and temporal criteria, before switching to the most appropriate criteria to estimate each signal. This algorithm is shown to improve the real time separation performance of conventional BSS algorithms exclusively exploiting either the temporal structure, sparsity or statistical independence of signals. The improvement of SCAtemp over conventional BSS algorithms can be attributed to the use of additional a priori knowledge of speech in the temporal short term. © 2006 IEEE.
Thomas-Kerr, J., Burnett, I. & Ritz, C. 2007, 'Intelligent multimedia delivery? It's a question of semantics', ISCIT 2007 - 2007 International Symposium on Communications and Information Technologies Proceedings, pp. 473-478.View/Download from: Publisher's site
Intelligent multimedia delivery uses semantic information about content to enhance the delivery process. This paper proposes a model for intelligent multimedia delivery that advances the state of the art by incorporating a concept of semantic distortion into the delivery optimization process. Furthermore, the model combines format-independence with rate-distortion optimization to provide a flexible framework for intelligent delivery of multimedia in existing and future formats. The paper provides a review of the existing work covering components of the model: scalable coding formats, rate-distortion optimization, format-independent adaptation, and semantic adaptation. It then details the model, and identifies open problems and opportunities for further research. © 2007 IEEE.
Thomas-Kerr, J., Janneck, J., Mattavelli, M., Burnett, I. & Ritz, C. 2007, 'Reconfigurable media coding: Self-describing multimedia bitstreams', IEEE Workshop on Signal Processing Systems, SiPS: Design and Implementation, pp. 319-324.View/Download from: Publisher's site
The development of MP3 and JPEG sparked an explosion in digital content on the internet. These early encoding formats have since been joined by many others, including Quicktime, Ogg, MPEG-2 and MPEG-4, which poses an escalating challenge to vendors wishing to develop devices that interoperate with as much content as possible. This paper presents aspects of Reconfigurable Media Coding (RMC), a project currently underway at MPEG to define a self-describing bitstream format. In other words, an RMC bitstream contains metadata to assemble a decoder from a fundamental building-blocks, as well as a schema that describes the syntax of the content data, and how it may be parsed. RMC makes it easy to extend (reconfigure) existing codecs, for example adding error resilience or new chroma-subsampling profiles, or to build entirely new codecs. This paper addresses the bitstream syntax component of RMC, validating the approach by applying it to the recent MPEG-4 Video simple profile coder. © 2007 IEEE.
Bin, C., Ritz, C. & Burnett, I. 2006, 'Squeezing the auditory space: A new approach to multi-channel audio coding', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 572-581.
This paper presents a novel solution for efficient representation of multi-channel spatial audio signals. Unlike other spatial audio coding techniques, the solution inherently requires no additional side information to recover the surround sound panorama from a two-channel downmix. For a typical five-channel case, only a stereo downmix signal is required for the decoder to reconstruct the full five-channel audio signal. In addition to the bandwidth saved by transmitting no side information, the technique has significant advantages in terms of computational complexity. © Springer-Verlag Berlin Heidelberg 2006.
As multiparty meetings involve participants that are generally stationary when actively speaking, participant location information can be used to segment the recorded meeting audio into speaker 'turns.' In this paper, speaker location information derived from 'spatial cues' generated by spatial audio coding techniques is investigated. The validity of using spatial cues for meeting audio segmentation is explored through investigating multiple microphone meeting audio recording techniques and different spatial audio coders. Experimental results show that the statistical relationship between speaker location and interchannel level and phase-based spatial cues strongly depends on the microphone pattern. Results also indicate that interchannel correlation-based spatial cues represent location information that is ambiguous for meeting audio segmentation.
Cheng, E., Burnett, I. & Ritz, C. 2006, 'Varying microphone patterns for meeting speech segmentation using spatial audio cues', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 221-228.
Meetings, common to many business environments, generally involve stationary participants. Thus, participant location information can be used to segment meeting speech recordings into each speaker's 'turn'. The authors' previous work proposed the use of spatial audio cues to represent the speaker locations. This paper studies the validity of using spatial audio cues for meeting speech segmentation by investigating the effect of varying microphone pattern on the spatial cues. Experiments conducted on recordings of a real acoustic environment indicate that the relationship between speaker location and spatial audio cues strongly depends on the microphone pattern. © Springer-Verlag Berlin Heidelberg 2006.
Cheng, E., Davis, S., Burnett, I. & Lukasiak, J. 2006, 'Efficient delivery of hierarchically structured meeting audio metadata with a bi-directional XML protocol', 2006 International Conference on Computing and Informatics, ICOCI '06.View/Download from: Publisher's site
This paper explores user-centered metadata delivery through the example of hierarchically organized meeting audio metadata. Audio annotations that describe meeting scenarios can vary from low-level signal-based descriptors to high-level semantics. Users of meeting metadata also have widely varying requirements and hence want metadata at varying levels and detail. Thus, for efficient metadata access, it is vital to provide customization or choice of the metadata to be delivered using e.g. regions of interest and annotation detail specification. As well as proposing a user-centered metadata organization strategy, this paper introduces the use of a bi-directional XML protocol for metadata delivery. The combination provides advantages in terms of bandwidth efficiency when an example meeting metadata browser application is examined with practical user interfaces. ©2006 IEEE.
Davis, S.J. & Burnett, I.S. 2006, 'On-demand partial schema delivery for multimedia metadata', 2006 IEEE International Conference on Multimedia and Expo, ICME 2006 - Proceedings, pp. 1513-1516.View/Download from: Publisher's site
XML is a popular approach to interoperable exchange of Multimedia metadata between a wide range of devices. This paper explores extending the use of the Remote XML Exchange Protocol (previously proposed by the authors) as a mechanism to provide efficient interaction with complex Multimedia XML documents and their associated schemas. This is particularly applicable to users with limited application complexity devices and/or limited bandwidth connections. Many XML documents do not fully utilize all the information present in a given schema; thus, users download substantial redundant information for the current application. This paper introduces the use of RXEP for the transmission of small, relevant schema sections. The paper investigates the advantages of schema retrieval using RXEP in terms of the bandwidth saved. © 2006 IEEE.
Raad, W. & Burnett, I.S. 2006, 'A variable length linear array for smart antenna systems using partial optimization', 2006 IEEE Singapore International Conference on Communication Systems, ICCS 2006.View/Download from: Publisher's site
Adaptive Arrays have been used extensively in wireless communications applications to reduce interference between desired users and interfering signals. Significant research has been directed at the Uniform Linear Array (ULA), however, generally a fixed array length is used. In a mobile environment this has the disadvantage of producing a fixed beamwidth due to the fixed array lengths. The Variable Linear Array (VLA) algorithm is a proposed algorithm that allows the beamwidth to be varied by activating and deactivating elements depending on the interferers present. This approach uses the LMS algorithm for adaptation, in an embedded system this could prove costly. This paper proposes an alternative approach based on partial optimization techniques that remove the LMS from the optimization process. Furthermore studies into the cost of having such a system is presented and compared to the operational cost of the LMS algorithm. © 2006 IEEE.
Ritz, C., Adistambha, K., Lukasiak, J. & Burnett, I. 2006, 'A codebook-based cascade coder for embedded lossless audio coding', Audio Engineering Society - 120th Convention Spring Preprints 2006, pp. 912-917.
Embedded lossless audio coding embeds a perceptual audio coding bitstream within a lossless audio coding bitstream. Such an approach provides access to both a lossy and lossless version of the audio signal within the one coding scheme. Previously, a lossless embedded audio coder based on the Advanced Audio Coding (AAC) approach and utilizing both backward Linear Predictive Coding (LPC) and cascade coding was proposed. This paper further investigates the adaptation of cascade coding to lossless audio compression using a novel codebook based approach. The codebook is trained using LPC residual signals obtained from the decorrelation stage of the embedded coder. Results show that the overall lossless compression performance of cascade coding closely follows Rice coding.
Rong, L. & Burnett, I. 2006, 'Facilitating Universal Multimedia Adaptation (UMA) in a heterogeneous peer-to-peer network', Proceedings - Second International Conference on Automated Production of Cross Media Content for Multi-Channel Distribution, AXMEDIS 2006, pp. 105-109.View/Download from: Publisher's site
This paper proposes a P2P architecture which uses MPEG-21 as a standard based technique to dynamically adapt resources according to various usage environment attributes such as terminal capabilities and user preferences. In the architecture, a super peer based approach is used to cluster peers, store peer information, perform searches and instruct peers to adapt/send resources. Pull and push-based adaptation methods are introduced to adapt search results and resources in an intelligent manner based on the usage environment attributes. Simulation results show that the proposed architecture reduces download time while increasing resource availabilities and download speed in the network when compared to traditional P2P systems. © 2006 IEEE.
Rong, L. & Burnett, I.S. 2006, 'Adaptive resource replication in a ubiquitous peer-to-peer based multimedia distribution environment', 2006 3rd IEEE Consumer Communications and Networking Conference, CCNC 2006, pp. 65-68.View/Download from: Publisher's site
A dynamic P2P architecture was proposed in our previous work to support resource adaptation/personalization according to the surrounding usage environment and user preferences. In this paper, we propose an adaptive resource replication strategy based on the proposed P2P architecture. It uses resource request rate as the metric to trigger the resource replication process, and proportionally replicate multimedia resources into various configuration states according to the properties of peers (i.e., terminal capabilities and user preferences) and the size of peer clusters. Also, the strategy uses peer related information stored on super peers to determine which peers should be selected to perform adaptive replications and where the resulting replicas should be stored. Simulation results show that the proposed strategy reduces network delays while increasing resource hit rate in comparison to FreeNet and random replication strategies. © 2006 IEEE.
A sequential approach to Sparse Component Analysis (SeqTIF) is proposed in this paper. Although SeqTIF employs the estimation process of the simultaneous TIFROM algorithm , a source cancellation and deflation technique are also incorporated to sequentially estimate speech signals in the mixture. Results indicate that SeqTIF's separation performance shows a clear improvement upon the simultaneous TIFROM approach, due to the less restrictive assumptions it places upon the signals in the mixture. In particular, the analysis indicates SeqTIF's data efficiency is high, enabling the sequential approach to track a time-varying mixture with much greater accuracy than the simultaneous algorithm. Furthermore, SeqTIF is a more flexible approach, free from the constraints that a simultaneous approach places upon the mixing system.
Thomas-Kerr, J., Burnett, I. & Ritz, C. 2006, 'An efficient approach to generic multimedia adaptation', Proceedings of the 14th Annual ACM International Conference on Multimedia, MM 2006, pp. 49-52.View/Download from: Publisher's site
This paper addresses efficiency issues identified in the Bitstream Syntax Description Language used by the MPEG-21 generic multimedia adaptation framework. In particular, when used to adapt modern content formats such as H.264/AVC, the time required for processing increases exponentially relative to the duration of the bitstream. In response, the paper proposes several additional features for the Bitstream Syntax Description Language which reduce the complexity of adaptation using BSDL to a linear function of bitstream duration. These features are implemented and validated using bitstreams of real-world length.
Thomas-Kerr, J., Burnett, I. & Ritz, C. 2006, 'Enhancing interoperability via generic multimedia syntax translation', Proceedings - Second International Conference on Automated Production of Cross Media Content for Multi-Channel Distribution, AXMEDIS 2006, pp. 85-90.View/Download from: Publisher's site
The Bitstream Binding Language (BBL) is a new technology developed by the authors and being standardized by MPEG, which describes how multimedia content and metadata can be mapped onto streaming formats. This paper describes how BBL can be used to enhance the interoperability of multimedia content by providing a generic mechanism for the translation of content between formats. As new content formats are developed, BBL can be used to describe how to translate the content into a form that existing devices are able to render. This consequently simplifies the adoption of new multimedia content forms because existing devices are able to consume the content even though they do not understand its native format. © 2006 IEEE.
Thomas-Kerr, J., Burnett, I. & Ritz, C. 2006, 'Format-independent multimedia streaming', 2006 IEEE International Conference on Multimedia and Expo, ICME 2006 - Proceedings, pp. 1509-1512.View/Download from: Publisher's site
The Bitstream Binding Language (BBL) is a new technology developed by the authors and being standardized by MPEG, which describes how multimedia content and metadata can be mapped onto streaming formats. This paper describes a particular application of BBL - format-independent multimedia streaming. This means that streaming servers no longer require additional software modules in order to support new content formats as they are introduced. Instead, the server requires only a BBL description of the mapping between the content format and the stream, and any content in the new format may then be delivered by the streaming server. This approach is validated using the H.264/AVC format as an example, and performance data are provided. © 2006 IEEE.
Thomas-Kerr, J., Burnett, I. & Ritz, C. 2006, 'Generic, scalable multimedia streaming and delivery with example application for H.264/AVC', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 349-356.
The ever increasing diversity of multimedia technology presents a growing challenge to interoperability in as new content formats are developed. The Bitstream Binding Language (BBL) addresses this problem by providing a format-independent language to describe how multimedia content is to be delivered. This paper proposes extensions to BBL that enable a generic, scalable streaming server architecture. In this architecture, new content formats are supported by providing a simple file with instructions as to how the software may be streamed. This approach removes any need to modify existing software to provide such support. © Springer-Verlag Berlin Heidelberg 2006.
Cheng, E., Lukasiak, J., Burnett, I.S. & Stirling, D. 2005, 'Using spatial cues for meeting speech segmentation', IEEE International Conference on Multimedia and Expo, ICME 2005, pp. 350-353.View/Download from: Publisher's site
This work investigates the validity and accuracy of using spatial cues with Time-Delay Estimation (TDE) as a method of segmenting multichannel recorded speech by speaker location. In environments such as meetings where speakers do not significantly alter position, segmentation by speaker location essentially leads to segmentation by speaker 'turn'. The proposed system calculates location information using TDEs and spatial cues extracted from multichannel meeting audio recordings. This location information is then input into a simple segmentation algorithm. Experiments have been performed on both theoretical and real meeting recordings with non-overlapping speakers, and theoretical recordings with overlapping speakers. Segmentation results reveal the most robust cue to be a combination of spatial information and TDEs. This cue combination leads to greater segmentation accuracy for classifying individual speakers and detecting overlapping sections than using spatial cues or time-delay information alone. © 2005 IEEE.
Davis, S.J. & Burnett, I.S. 2005, 'Efficient delivery within the MPEG-21 framework', Proceedings - First International Conference on Automated Production of Cross Media Content for Multi-channel Distribution, AXMEDIS 2005, pp. 205-208.View/Download from: Publisher's site
MPEG-21, the new Multimedia Framework standard is nearing completion. It is centred around the concept of a Digital Item which is exchanged in the form of an XML declaration. This paper initially considers the challenge of compressing these Digital Item Declarations and presents conclusive results showing that schema based compression is the best solution. A new technique is then presented whereby the one-way delivery framework is expanded to allow client selection and control of Digital Item fragment transmission. This uses a new Remote XML Exchange Protocol (RXEP) which significantly reduces storage and time delays in mobile environments. Finally, the new approach is combined with schema-based compression to offer maximum efficiency. © 2005 IEEE.
Davis, S.J. & Burnett, I.S. 2005, 'Exchanging XML multimedia containers using a binary XML protocol', IEEE International Conference on Multimedia and Expo, ICME 2005, pp. 358-361.View/Download from: Publisher's site
XML is becoming increasingly popular as the ubiquitous standard for metadata; consequently, it is being incorporated into many multimedia applications, such as those based on MPEG-7 and MPEG-21. However, XML is often verbose and transmitting the large fi lescan be wasteful in bandwidth and power-limited mobile applications. This paper introduces an XML access mechanism, RXEP, which combines XML compression with a fragment access protocol. RXEP ensures that essential information is exchanged effl cientlywhile minimising superfluous XML content transmission. This makes XML containers an attractive technique for multimedia content delivery. © 2005 IEEE.
Lauf, S. & Burnett, I. 2005, 'A protected digital item declaration language for MPEG-21', Proceedings - First International Conference on Automated Production of Cross Media Content for Multi-channel Distribution, AXMEDIS 2005, pp. 275-278.View/Download from: Publisher's site
Within the MPEG-21 Multimedia Framework, Digital Items are introduced as a structured digital representation for multimedia. To facilitate the representation of Digital Items which include secure or controlled content, the authors have implemented an IPMP Digital Item Declaration Language (IPMP DIDL). This provides for a protected representation of Digital Item structure, allowing the use of existing DRM systems and rights expression languages. This paper examines the design and implementation of the IPMP DIDL and its incorporation into MPEG-21 Part 4: IPMP Components. © 2005 IEEE.
The MPEG-21 Multimedia Framework aims to realize interoperable access to content across heterogeneous networks and devices. Within the Framework, the concept of Digital Items is introduced as a structured digital representation for multimedia. To demonstrate the applicability of MPEG-21 to seamless multimedia interactions on limited platforms, the authors have produced an implementation of MPEG-21 for a mobile device, in Java 2 Micro Edition (J2ME). This paper examines the design and implementation of the Mobile MPEG-21 Peer, including a specialized architecture and processing mechanisms specific to the J2ME platform. Copyright © 2005 ACM.
Rong, L. & Burnett, I. 2005, 'BitTorrent in a dynamic resource adapting peer-to-peer network', Proceedings - First International Conference on Automated Production of Cross Media Content for Multi-channel Distribution, AXMEDIS 2005, pp. 224-227.View/Download from: Publisher's site
Our previous work proposed a MPEG-21 based P2P network architecture supporting resource adaptation on the basis of usage environment and user preferences. In this paper, we investigate the deployment of a BitTorrent (BT) - like approach in the previous super-peer P2P resource adaptation architecture. In addition, the architecture's peer selection strategy is adopted and evaluated as a way to enhance the peer selection process in BT. The strategy uses super peers as trackers to intelligently select peers according to their capabilities and shared resource segments. Simulation results show that the proposed selection strategy reduces average access time and increases download speed when compared with the existing BT peer selection process with randomly selected peers. The results also show that the deployment of BT in the P2P Adaptation architecture greatly reduces the congested download problem previously reported. © 2005 IEEE.
Rong, L. & Burnett, I. 2005, 'Dynamic resource adaptation in a heterogeneous Peer-to-Peer environment', 2005 2nd IEEE Consumer Communications and Networking Conference, CCNC2005, pp. 416-420.
This paper focuses on metadata-based, multimedia resource representation and retrieval in a P2P environment as a means of Universal Multimedia Access (UMA). The primary focus of the work is a P2P architecture which uses MPEG-21 as a standards based technique to dynamically adapt resources to various usage environments. In the architecture, a super peer based approach is used to cluster peers, store peer information, perform searches and instruct peers to adapt/send resources. Also a two-stages based adaptation method is introduced to adapt search results and resources in an intelligent manner based on the usage environment attributes. The concept is demonstrated using a test-bed built based on the JXTA peer-to-peer frame-work. In addition, simulation results show that the proposed architecture reduces download time while increasing resource availabilities and download speed in the network when compared to traditional P2P systems. © 2004 IEEE.
Smith, D., Lukasiak, J. & Burnett, I. 2005, 'An investigation of temporal modeling in blind signal separation', Proceedings - 8th International Symposium on Signal Processing and its Applications, ISSPA 2005, pp. 503-506.View/Download from: Publisher's site
This paper investigates the performance of blind signal separation (BSS) algorithms that exploit the temporal pre-dictability of speech. Specifically, the investigation considers how the separation performance of two BSS algorithms will be affected when the length of the AR process (used in the algorithms to model speech) is varied. The investigation concludes that the length of the AR process (prediction order) has a significant impact on separation performance. In particular, the separation performance of both algorithms is degraded, if the AR model's prediction order, over fits, or under fits, the temporal structure of the speech. It is revealed that a prediction order of 30-50 provides maximum separation performance for natural speech, however a prediction order of 10 is more applicable if computational cost is a consideration. ©2005 IEEE.
Thomas-Kerr, J., Burnett, I. & Ciufo, P. 2005, 'Bitstream binding language - Mapping XML multimedia containers into streams', IEEE International Conference on Multimedia and Expo, ICME 2005, pp. 626-629.View/Download from: Publisher's site
Bitstream Binding Language (BBL) provides an abstraction layer between XML multimedia containers and the way their resources and metadata are published in a bitstream. It allows multiple bindings from a single source document to facilitate interoperability and applicability of the multimedia content to a wide range of terminals and users. BBL introduces a number of features not found in other XML fragmentation techniques - such as support for random-access - and provides a highly general framework for combining metadata with multimedia content in a transport format. This paper presents an overview of BBL and provides details of a Java implementation using the MPEG-21 Multi-media Framework as a sample application. © 2005 IEEE.
Premaratne, P. & Burnett, I. 2004, 'Role of wavelet transforms in image restoration', IEEE Region 10 Annual International Conference, Proceedings/TENCON.
Image processing techniques including image restoration rely heavily on Discrete Fourier Transform (DFT) for frequency domain representation for analysis of its frequency content. Even though Wavelet analysis has been around for more than a decade, much of its potential as tool to analyze time-frequency localization of signal has not been properly tapped. In blind iterative deconvolution where a degraded image is restored with minimum apriori information about the original image or the point spread function (PSF), it is almost impossible to evaluate whether a restoration is achieved or not without human observation. We propose a new approach using wavelet decomposition to assess an image being restored or not automatically. © 2004 IEEE.
Premaratne, P., Burnett, I. & Liyanage, C.D. 2004, 'Blur retrieval via separation of zeros sheets from noisy blurred images', 2004 International Symposium on Intelligent Multimedia, Video and Speech Processing, ISIMP 2004, pp. 559-562.
A novel method of separating point spread function from blurred images using zeros of the z transform is presented when more than one blurred images are available. The proposed method is demonstrated to be effective with significant contamination of signal-to-noise ratios over 30dB. This method holds much promise as a blind deconvolution (i.e. problem of recovering two functions from their convolution) technique, as it does not impose any constraints on the point spread function such as positivity. The article is presented with experimental results over different signal-to-noise ratios depicting its effectiveness as a practical image restoration technique.
Rong, L. & Burnett, I. 2004, 'Dynamic multimedia adaptation and updating of media streams with MPEG-21', IEEE Consumer Communications and Networking Conference, CCNC, pp. 436-441.
This paper discusses media streaming using dynamic resource adaptation and update as a means of facilitating Universal Multimedia Access (UMA): the concept of accessing multimedia content through a variety of possible schemes. As background to our work, the paper summarizes the most common content negotiation approaches and addresses their facets and problems. MPEG-21 the multimedia framework and its relationship to UMA is then explained. The primary focus of the work is an end-to-end approach to content adaptation which takes advantage of MPEG-21 to facilitate the UMA concept in a media streaming environment. The concept is validated using a media streaming test-bed which provides for wide adaptation according to broad usage descriptions.
Lukasiak, J. & Burnett, I.S. 2003, 'Scalable speech coding spanning the 4 Kbps divide', Proceedings - 7th International Symposium on Signal Processing and Its Applications, ISSPA 2003, pp. 397-400.View/Download from: Publisher's site
This paper examines a scalable method for coding the LP residual. The scalable method is capable of increasing the accuracy of the reconstructed speech from a parametric representation at low rates to a more accurate waveform matched representation at higher rates. The method entails pitch length segmentation, decomposition into pulsed and noise components and modeling of the pulsed components using a fixed shape pulse model in a closed-loop, Analysis by Synthesis system. Subjective testing is presented that indicates that in addition to the AbyS modeling, the pulse parameter evolution must be constrained in synthesis. Results indicate that this proposed method is capable of producing perceptually scalable speech quality as the bit rate is increased through 4 kbps. © 2003 IEEE.
Raad, M., Burnett, I. & Mertins, A. 2003, 'Multi-rate and multi-resolution scalable to lossless audio compression using PSPIHT', Proceedings - 7th International Symposium on Signal Processing and Its Applications, ISSPA 2003, pp. 121-124.View/Download from: Publisher's site
This paper presents a scalable to lossless compression scheme that allows scalability in terms of sampling rate as well as quantization resolution. The scheme presented is perceptually scalable and it also allows lossless compression. The scheme produces smooth objective scalability, in terms of SNR, until lossless compression is achieved. The scheme is built around the perceptual SPIHT algorithm, which is a modification of the SPIHT algorithm. Objective and subjective results are given that show perceptual as well as objective scalability. The subjective results given also show that the proposed scheme performs comparably with the MPEG-4 AAC coder at 16, 32 and 64 kbps. © 2003 IEEE.
Raad, M., Burnett, I. & Mertins, A. 2003, 'Multi-rate extension of the scalable to lossless PSPIHT audio coder', EUROSPEECH 2003 - 8th European Conference on Speech Communication and Technology, pp. 1117-1120.
This paper extends a scalable to lossless compression scheme to allow scalability in terms of sampling rate as well as quantization resolution. The scheme presented is an extension of a perceptually scalable scheme that scales to lossless compression, producing smooth objective scalability, in terms of SNR, until lossless compression is achieved. The scheme is built around the Perceptual SPIHT algorithm, which is a modification of the SPIHT algorithm. An analysis of the expected limitations of scaling across sampling rates is given as well as lossless compression results showing the competitive performance of the presented technique.
Raad, M., Mertins, A. & Burnett, I. 2003, 'Scalable to lossless audio compression based on perceptual set partitioning in hierarchical trees (PSPIHT)', ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp. 624-627.
This paper proposes a technique for scalable to lossless audio compression. The scheme presented is perceptually scalable and also provides for lossless compression. It produces smooth objective scalability, in terms of SegSNR, from lossy to lossless compression. The proposal is built around the Perceptual SPIHT algorithm, which is a modification of the SPIHT algorithm and is introduced in this paper. Both objective and subjective results are reported and demonstrate both perceptual and objective measure scalability. The subjective results indicate that the proposed method performs comparably with the MPEG-4 AAC coder at 16, 32 and 64 kbps, yet also achieves a scalable-to-lossless architecture.
Ritz, C.H., Burnett, I.S. & Lukasiak, J. 2003, 'Low bit rate wideband WI speech coding', ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp. 804-807.
This paper investigates Waveform Interpolation (WI) applied low bit rate wideband speech coding. An analysis of the evolutionary behaviour of wideband Characteristic Waveforms (CWs) shows that direct application of the classical WI algorithm may not be appropriate for wideband speech. We propose a modification whereby CW quantisation is performed using classical WI decomposition for the low frequency region and noise modelling for the high frequency region. Wideband WI coders incorporating this modification and operating at 4 kbps and 6 kbps are described. Subjective testing of these coders, shows that WI is a promising approach to low bit rate wideband speech compression.
Ritz, C.H., Burnett, I.S. & Lukasiak, J. 2003, 'Low bit rate wideband WI speech coding', Proceedings - IEEE International Conference on Multimedia and Expo, pp. I377-I380.View/Download from: Publisher's site
© 2003 IEEE. This paper investigates waveform interpolation (WI) applied low bit rate wideband speech coding. An analysis of the evolutionary behaviour of wideband characteristic waveforms (CWs) shows that direct application of the classical WI algorithm may not be appropriate for wideband speech. We propose a modification whereby CW quantisation is performed using classical WI decomposition for the low frequency region and noise modelling for the high frequency region. Wideband WI coders incorporating this modification and operating at 4 kbps and 6 kbps are described. Subjective testing of these coders shows that WI is a promising approach to low bit rate wideband speech compression.
Potard, G. & Burnett, I. 2002, 'Using XML schemas to create and encode interactive 3-D audio scenes for multimedia and virtual reality applications', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 193-203.
© Springer-Verlag Berlin Heidelberg 2002. An object-oriented 3-D sound scene description scheme is proposed. The scheme establishes a way to compose and encode timevarying spatial sound scenes using audio and acoustical objects. This scheme can be used in applications where efficient coding of interactive 3-D sound scenes is needed (e.g. interactive virtual displays and videoconferencing). It can also be used in non-interactive application such as cinema and 3-D music. The scheme offers clear advantages over multi-channel 3-D sound formats regarding scalability and interactivity with the sound-scene because each object has its own set of parameters and can be modified by the end-user at the decoding stage. The objectoriented approach also allows the creation of macro-object descriptors that allow fast and efficient coding of 3-D sound scenes using references to macro-object libraries. The scheme has been implemented in a XML schema and can be used to define 3-D sound scenes in XML format in a standard way.
Jackson, M.A. & Burnett, I.S. 2001, 'Fuzzy clustering evaluation of time-frequency distribution (TFD) schemes for audio stream segregation', IEEE International Conference on Fuzzy Systems, pp. 553-556.
Audio stream segregation is a task performed constantly by the human auditory system, yet is difficult to reproduce with a computer. The research detailed in this paper looks at performing just one method of stream segregation, the temporal coherence boundary, using a fuzzy clustering system. The main focus of the paper is to examine the effectiveness of several time-frequency distributions as the features vectors for the system. Three time-frequency distributions are examined and their effectiveness evaluated in terms of correct separation and computational complexity. The main evaluation compares the popular gamma-tone filterbank with the MPEG-7 Audio Spectrum Envelope. The results are promising indicating the less computationally expensive MPEG-7 descriptor performs well; implying stream segregation may be possible using the MPEG-7 Audio low-level description scheme.
Lukasiak, J. & Burnett, I.S. 2001, 'SEW representation for low rate WI coding', ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp. 697-700.
This paper considers low-rate Waveform Interpolation (WI) coding. It compares the existing, common Slowly Evolving Waveform (SEW) quantisation scheme with two new schemes for representing and quantising the SEW. The first scheme uses a minimum phase estimate to reconstruct the SEW whilst the second scheme uses a pulse model whose parameters are implicitly transmitted in the quantised rapidly evolving waveform (REW). These new schemes maintain or reduce the bit rate required for transmission of the SEW. Results indicate that, for low rate WI coding, necessarily coarse SEW magnitude spectrum quantisation limits the contribution of the SEW to perceptual quality. Perceptual tests indicate that avoiding coarse spectral shape quantisation and using a fixed shape model that lends itself to smooth interpolation, maintains the perceptual quality of the synthesized speech. The proposed fixed shape model requires no bits for transmission, allowing a 12 percent reduction in the overall coder bit rate.
Raad, M. & Burnett, I.S. 2001, 'Audio coding using sorted sinusoidal parameters', ISCAS 2001 - 2001 IEEE International Symposium on Circuits and Systems, Conference Proceedings, pp. 401-404.View/Download from: Publisher's site
This paper describes a new audio coding scheme based on sinusoidal coding of signals. Sinusoidal coding permits the representation of a given signal through the summation of sinusoids. The parameters of the sinusoids (the amplitudes, phases and frequencies) are transmitted to allow the signal reconstruction. In the proposed scheme, the sinusoidal parameters are sorted according to energy content and perceptual significance. The most significant parameters are transmitted first allowing the use of only a small set of the parameters for signal reconstruction. The proposed scheme incurs a low delay and uses a 20 ms frame length. Results show that the coder operating at a mean rate of 39 kb/s, performs favorably in comparison with the MPEG-4 coder at 42 kb/s. © 2001 IEEE.
Burnett, I. 1997, 'Waveform interpolation paradigm - foundation of a class of speech coders', IEEE Region 10 Annual International Conference, Proceedings/TENCON, p. 1.
The waveform interpolation (WI) paradigm offers effective, high-perceptual quality speech coding at low-rates by representing speech/residual as an evolving set of pitch cycle waveforms. The formation of the speech/residual into an evolving surface of phase-aligned characteristic waveforms and the subsequent decomposition of that surface into near-independent evolving surfaces for quantization may be the most distinctive feature of WI coding. The technique attains high quality at low rates by utilizing smooth interpolation of almost all its parameters requiring careful consideration of events. The technique is a truly hybrid speech coding algorithm, performing analysis in both the time and discrete frequency domains.
Burnett, I.S. & Ni, J. 1997, 'Waveform interpolation and analysis-by-synthesis - a good match?', IEEE Workshop on Speech Coding for Telecommunications Proceedings, pp. 29-30.
This paper considers the incorporation of Analysis-by-Synthesis (A-by-S) techniques in Waveform Interpolation (WI) coding architectures. It proposes an altered A-by-S mechanism which overcomes the non-synchronous input/output speech which is characteristic of WI coders. The proposed architectures operate on a prototype-by-prototype basis, optimizing parameter choices within each frame. Three possible architectures are considered, with varying degrees of A-by-S integration. These low-rate, 2.4 kb/s coders are tailored to the transmission of no prototype phase information. It is also shown that the incorporation of A-by-S in WI allows exploitation of CELP style perceptual weighting techniques. The conclusion is that the use of A-by-S can improve the perceptual performance of WI coders.
Parry, J.J., Burnett, I.S. & Chicharo, J.F. 1997, 'Cross-language performance study of vector quantization', IEEE Workshop on Speech Coding for Telecommunications Proceedings, pp. 79-80.
This paper investigates the performance of split Vector Quantization (VQ) of Line Spectral Frequencies (LSFs) across a set of 10 modern languages. Spectral quantization accounts for a significant portion of the bit allocation in low-rate speech coding. Split VQ of LSFs can achieve transparent quantization of the Linear Prediction Coefficients at 24 bits/frame. In this work, codebooks are trained on individual languages and the cross-language VQ performance was measured using Spectral Distortion (SD). The results show that the spectral structure of the codebook training language influences the performance of the VQ. The number of bits/frame required for transparent speech varied by as much a 2 bits across languages.
Burnett, I.S. & Parry, J.J. 1996, 'On the effects of accent and language on low rate speech coders', International Conference on Spoken Language Processing, ICSLP, Proceedings, pp. 291-294.
Telecommunications networks are exposed to a plethora of accents and languages. Fundamental to current and future systems are low rate speech coders. This paper examines the problems associated with speech coding of different languages and accents. Our investigations show that most low-rate (8 kb/s and below) speech coders show bias towards non-accented English. When the coders are used for heavily accented English or other languages, significant performance degradation is noted. This paper examines the reasons for such variations and some approaches for improving coder performance.