Feature Extraction and Enriched Access Modules for Musical Audio Data

Transcription

1 Internal Note Feature Extraction and Enriched Access Modules for Musical Audio Data Version 1.0 Draft Date: 15 February 2007 Editor: QMUL Contributors: Christian Landone, Dan Barry, Ivan Damnjanovic

2 Introduction This document enumerates the modules for the extraction of musical features from recorded audio files to be integrated within the EASAIER framework. Additional modules dedicated to the implementation of Enriched Access features will also be described here. 1. Architectural Notes The EASAIER framework requires applications with DSP capability both on the content provider side ( Archiver ) and on the end user side ( Browsernavigator ). Initially the project envisaged a complete separation between feature extraction and enriched access tools, the former being exclusively assigned to the server side application and the latter to be used by the client side application. In practice, following the initial systems architecture meeting ( ), it was found that both sides of the system might benefit from a certain degree of interoperability between the two set of tools. The following sections describe the generic feature extraction and audio processing functionality of the Server and Client-Side applications. (note, these two sections are mostly generated by brainstorming and guessing, so do modifyaddcomplain as you see fit) 1.1. Server Side Archiving Software The server side archiving application is a tool that allows content providers to manually enter andor automatically extract meta-data from musical audiovideo assets and archive them within the EASAIER system Audio analysis The musical audio asset is submitted to the application and, whenever necessary, undergoes restoration. A simplified system diagram is proposed in figure 1. Also, a compressed version of the audio asset is generated and submitted, along with the original, to the audio files repository: this lower quality copy can then be used by the EASAIER server for the purpose of streaming audio to the end user without using excessive amounts of bandwidth. The process of sound source separation may also be performed at this stage, although there is limited confidence that this will significantly improve the performance of the musical features extractors. However, the inclusion of this algorithm within the Archiver would allow an expert operator to choose an optimal set of separation parameters uniquely associated with the audio file, which can be transmitted to the enriched access tools on the client-side application as default settings. Following restoration and source separation, the audio data goes through a number of modules for the extraction of mid and high-level musical features that will be included in the meta-data associated to the audio file under analysis for classification and search purposes. The modules have been divided in two categories: mid-level extractors and high-level extractors. Broadly speaking, mid level extractors return time-synchronous (frame-based) information such as harmonic and timbre profiles, chord sequences or the position of beats and are particularly suitable

3 for spawning transcriptions and performing similarity-based searches within the EASAIER archives. High level extractors, on the other side, aim to describe global, and mostly single-valued, information regarding a piece of music, such as the tempo, meter, global key, mode or the presence of a particular instrument within the audio file. These descriptors can be employed to perform a parameter-based search such as: find an audio file exhibiting a tempo of 120 bpm at 44 time signature and containing the instrument conga. The mid-level features are extracted by the relevant algorithm (see section 2 for a description) and stored in a suitable format (TBD) in a repository within the EASAIER system. As well as being utilised by the server for search purposes, these features can also be used by the client-side navigation and playback tool to provide specialised visualisations of the music under analysis (e.g. an intensity envelope) and markers on points of interest within the waveform (e.g. position of beats, versechorus boundary, etc). Mid level descriptors are also used within the archiving application by a second level of software modules for the generation of high level features. Unfortunately high level features extractors are not robust enough at this stage of development to guarantee an absolute consistency, hence we envisage the use of a reliability metric that can prompt the operator to double-check the results and, if necessary, to manually populate the relevant high-level tags. Input Audio File (PCM) Compression To PCM & compressed audio assets repository Archiving application (musical audio) De-Noising Restoration Source Separation Optimal source separation & denoising parameters Mid-Level Feature Extractors Transcript. High-Level Features Extractors Manual Tags & Manual High Level Features Mid-level descriptors & transcript (similarity search) High-level features (parametric search) Manual Entry TagsData Reliability Metric To Metadata Repository Figure 1: server side musical audio archiving Video analysis The video asset is submitted to the EASAIER server and it undergoes necessary transcoding process. A compressed version of the video asset is generated and submitted, along with the original, to the video files repository, this lower quality copy can then be used by the EASAIER server for the purpose of streaming video to the end user without using excessive amounts of

4 bandwidth. In this process the audio stream is extracted from the video for purpose of audio analysis given in figure 1. The video stream undergoes then automatic analysis as shown in figure 2. All these processes on the videoaudio assets will be accomplished using open source software, such as ffmpeg [FFMPEG]. The ffmpeg software is known as fastest and most reliable open source transcoding software, having integrated majority of popular audiovideo coders. Input Video File Compression Original video file Streaming video file (eg. mpeg 4) Multimedia assets repository Audio Stream Extraction PCM Audio stream analysis (figure 1) Metadata Features Temporal data Video Segmentation and Keyframe extraction Keyframes KF temporal data Video segments temporal data Keyframe Analysis Manual Annotation KF Extracted Features Video segments metadata Figure 2: server side video archiving. Metadata repository QMUL will also provide video segmentation and key frame extraction modules. The modules take as input video in mpeg2 format and give as output temporal information about start and duration of video segments as well as keyframes images and their positions within video file. The modules are already available as linux binaries and in the stage of developing cross-platform versions. In the current implementation, only one feature is extracted for each video frame, the ColorLayout. ColorLayout is a simple representation of the layout of colour within a frame, using a DCT to represent the feature. One DCT is created for each colour component (one luminance and two chrominance components in the case of a video frame). A difference metric for each component involves taking the weighted Euclidian distance between each DCT value in each colour component. This leads to fast matching, and scalability can be improved by using fewer DCT values and sacrificing accuracy. The resulting feature vector can be used for a variety of applications. Simple shot cuts can be detected by looking for peaks in the rate of change of feature between subsequent frames, which produces a robust method for detecting abrupt shot changes which is reasonably accurate even in sequences with high visual activity. In the presence, we are working on expansion of the feature set used for the cut detection and keyframe extraction and on more sophisticated difference metrics, such as N-Cut (Normalized Cut). The extracted keyframes are further processed in order to extract a set of MPEG7 low-level descriptors [MPEG7], which will be used in the EASAIER cross-retrieval engine in addition to audio similarities searches to provide expansion of searches to non-audio assets. For this purpose the MPEG-7 experimentation Model (XM) software [MP7XM] will be used. This is standard

5 reference software used by Mpeg standardization body that is open source and both Linux and Windows versions exists and were tested and used at QMUL. The starting set of features that will be extracted for the purpose of EASAIER is defined in EASAIER metadata document [EMD2006] and Deliverable 3.1 [ED312006], but is still to be refined during implementation and testing phases of the EASAIER project Client Side Search and Browsing Software The end user will be able to access the content of the EASAIER archive by means of an application (figure 3) that can retrieve an audio asset and its associated meta-data using a variety of non mutually exclusive query methodologies, such as: - Queries based on general tags: i.e. find material by authortitle, genre and year - Musical parameters-based queries: i.e. find songs by key, orchestration, tempo range. - Similarity-based queries: i.e. once a musical audio asset has been retrieved, find other assets that exhibit some degree of similarity in terms of macroscopic structure, timbre and harmonic profile. The audio is delivered by the server (either by streaming or download of the entire compressed file) to the client application and then buffered and converted to a suitable format for further processing and visualisation of its time-domain waveform. Following the decoding stage, a suite of real-time audio processing modules allows restoration, source separation and enhancement of the incoming audio stream. The associated meta-data retrieved from the server contains a set of default parameters for both the source separation and restoration algorithms; alternatively, the user can override these parameters manually through an advanced menuinterface on the client application (enriched access UI). The default source separation parameters can be associated to the tags generated by the instrument recognition algorithms to provide a click and play list of the various orchestral components of the musical audio asset. A time-scale modification algorithm that can be operated in real time by the user is included in the enriched access set of tools, allowing to slow down or speed up the audio playback, without affecting the pitch content. As well as providing default operational parameters to the enriched access tool set, the meta-data also contain: 1) General and music-specific tags providing comprehensive information regarding the audio asset under analysis (displayed in the Browsing and Searching UI ) 2) Mid-level features that can be used to deliver technical visualisations of the audio asset as well as markers for advanced playback and looping functionalities, (displayed in the Looping and Visualisation UI ) Although high and mid-level musical descriptors are generated by the archiving application on the server side, an enhancement to the functionality offered by the EASAIER system can be identified in the ability to provide similarity-based searches using audio files residing on the client s hard drive.

6 As shown in the bottom of figure 3, this functionality will require the deployment of a scaled-down version of the archiving application, allowing the generation of data that can be used to search the contents of the EASAIER server. Streaming Audio Buffer Decode Browsing application (musical audio) Source Separation De-Noising Restoration Equalisation E A S A I E R Metadata Default Enriched Access Parameters High-Level Features TextualGeneral Tags Mid-Level Features Time & pitch Scale Modification User-Defined Parameters Audio Out S E R V E R Query Browsing & Searching UI QUERY ENGINE Looping & visualisation UI Search method and parameters Enriched Access UI Mid & High Level Features Mid-level descriptors & transcript (similarity search) High-level features (parametric search) Mini Archiver (musical audio) High-Level Features Extractors Mid-Level Feature Extractors Transcript. 2. Software Modules Local Audio File Figure 3: client side musical audio browsernavigator. The software modules described in this section are included in the following EASAIER work packages: 1) WP4 Sound Object Representation: This work package deals with the identification of features within the archived audio assets. As far as the musical audio is concerned, the tools will enable the extraction of high and mid-level descriptors for classification and search

7 purposes as well as modules capable of providing information regarding the musical structure of the audio asset for visualisation and looping purposes. 2) WP5 Enriched Access Tools: Tools developed within this work package will allow the user to apply useful modifications to the audio content at access time and in real-time, enabling an enriched exploration of the musical audio asset Enriched access Time-scale Modification Pitch-scale Modification Provided by DIT : The TSM algorithm will allow the user to vary the playback rate of the audio in real-time without affecting the local pitch content. The module will use both time domain algorithms and frequency domain algorithms. The appropriate algorithm will be chosen automatically depending on metadata provided with the audio content. The user should also be able to choose the algorithm manually. Pitch scale modification independent of time base is achievable in similar manner. Provided by QMUL: An alternative TSM algorithm based on a phase vocoder implementation. The algorithm allows for excellent transient preservation and robust stereo performance but requires a-priori knowledge of transients within the audio file, this can be provided by the extracted mid-level features Sound Source Separation Provided by DIT : A real-time separation algorithm which is capable of separating multiple sources from 2 channel mixtures. At present this tool requires the user to set some parameters based on visual and audio feedback from the GUI in order to achieve meaningful separations. This version of the algorithm will be deployed as an enriched access tool for WP5. An automated version of this algorithm may also be provided as a pre-processor for transcription and instrument recognition in WP4. Some other work on single channel separation is ongoing within the group at DIT Equalisation and Noise Reduction Provided by DIT : DIT may also be able to provide some rudimentary real-time noise reduction and equalisation tools for the purposes of audio enhancement. QMUL will provide support in the generation of libraries for these tools Sound Object Representations Segmentation Provided by DIT : Some segmentation routines such as a Novel Event Detector which may be incorporated if desired. Provided by QMUL:

8 A module for the segmentation and thumbnailing of recorded musical audio using a hierarchical timbre model (SoundBite) is available Mid Level s and Music Transcription Provided by DIT : The transcription algorithm will perform a non real-time analysis which will result in a musical transcription of the audio content. Harmony features may also be extracted during this analysis. It is also intended that some time aligned visual indication of harmony be provided. Alternative representations of transcribed audio will also be provided such as melodic contours for the purposes of melodic similarity queries. This tool will be deployed at the server side and will provide metadata for the purposes of indexing. The tool may also be deployed at the client side for the case where the user wishes to query by example, where the example audio comes from outside the database. Provided by QMUL: The Centre for Digital Music can provide the following Feature Extraction Modules Detection Function : A module for the generation of a function describing the local structure of an audio signal. Peak Picking : Module for the estimation of onsets from the detection function. Also contains a class for Detection Function processing. Onset Detection : A module for estimating onsets from audio files, incorporating the detection function and peak picking classes. MultiBand Onset Detection : (Released after ) Module for estimating tonal and percussive onsets from audio files. Chroma Class :A module for logarithmic frequency analysis. Beat Tracker : A module for Beat Tracking of Musical Audio Harmonic Change Detection Function (HCDF) : Module for the detection of harmonic change in musical audio files. Chord Estimation : (Ongoing Research) Module for the estimation of musical chords from audio files. Harmonic Content Estimation : The module is intended to provide a mid-level representation of the harmonic and rhythmic information from audio files. The algorithm returns a robust description of musical attributes that is intended to be used for similarity matching rather than for transcription and information retrieval Key Estimation : (Ongoing Research) Module for the estimation of the key in a musical file (frame-based) High Level s

9 Tempo Estimator : (Ongoing Research) The module estimates tempo from a musical audio file using information returned by the beat tracking algorithm Meter Estimator : (Ongoing Research) The module estimates the time signature from a a musical audio file using information returned by the beat tracking algorithm. Global Key Estimator : (Ongoing Research) Module for the estimation of the predominant key in a musical file using information returned by the frame-based key estimation algorithm Musical Instrument Recognition Provided by DIT : DIT has very recently begun work in this field. We expect to be able to integrate this work into EASAIER at a later stage. QMUL will provide legacy code (Instrument Identification Libraries) and knowledge gained from previous research carried out at the Centre for Digital Music

10 3. Current status of Software Modules Type of Feature Underlying Technology Input Output Lang. Current Module Extractor [ references ] Scope Development Status Name Low Level Detection Function A number of techniques are covered. [JF2000] [JPB 2005] Input is a dense frequency domain frame Outputs a single value per input frame module Completed & Deployed VAMP Plugin is available Low Level Peak Picking Detection function undergoes DC removal, smoothing and median filtering [IK2002]. Peak selection is based on quadratic fit [reference needed]. Input is detection function Output is a vector indicating location of estimated onsets Time base is relative to the detection function Completed & Deployed Low Level Onset Detection The module links the detection function and peak-picking classes to provide a complete onset estimator. Input is a pointer to a location containing samples of the audio file under analysis Output is a vector indicating the location of estimated onsets. Time base is relative to the original audio file. Completed & Deployed VAMP Plugin is available Low Level Multi-Band Onset Detection The module splits the signal into four sub-bands using a constant-q filterbank prior to onset detection [CD2004]. Tonal and percussive components are discriminated on the basis of the presence of onsets on the different sub-bands. [ER2005] Output are vectors indicating the location of estimated tonal and percussive onsets. Time base is relative to the original audio file. A version is available. Low Level Chroma Based on an FFT, utilises a sparse kernel approach for the calculation of a constant-q transform. Complete. version is

11 The Chormagram (HPCP) is then calculated from the result of the Constant-Q data. [JB1991], [JB1992], [CH2005] Output is a dense matrix containing the Chromagram bins of the file under analysis. Time base depends on the resolution of the Constant-Q transform deployed but needs revision. VAMP Plugin is available Mid Level Beat Tracker Beat times are recovered by passing the output of an onset detection function through comb filterbank matrices to identify the beat period and alignment The module uses a two state model for tracking tempo changes and for maintaining continuity within a single tempo hypothesis Output is either a sparse vector Sparse Vector with the non-zero elements denoting an estimated beat or a vector containing the temporal location of the identified beats Time base is relative to the original audio file. version is deployed VAMP Plugin is available [MD2004] [MD2005] LowMid Level Harmonic Change Detection Function (HCDF) A 12 bins chromagram is mapped to a 6-D space using a tonal centroid transform and smoothed using a Guassian window. The HCDF is defined as the rate of change of the smoothed tonal centroid signal Output is a dense vector representing the peak change between tonal centorid frames Complete. version is deployed VAMP Plugin is available Transition times between harmonically stable regions can be obtained by peak picking the HCDF Mid Level Chord Estimation The algorithm relies on a 36-bins tuned chromagram obtained from a constant-q transform. The identification is performed using chord templates Standard IO A complete C implementation is not currently available. [CH2005],[MC2004],[BP2002] Output is a sequence of estimated chord symbols. MidHigh Level Key Estimation The key space is modelled by a 24-state HMM. Each state represents one of the 24 major or minor chords and each observation represents a chord transition. Standard IO Input is sequence of estimated chord symbols A complete C implementation is not currently available.

12 [KN2006] Output is estimated key either on a frame or per-track basis. Mid Level Similarity Retrieval Harmonic Content Estimation A 36-bins tuned chromagram is averaged between detected beats. The resulting averaged chromagram is further reduced to 12 bins by summing all three bins for each pitch class Standard IO The output represents a sequence of major and minor triads. A complete C implementation is not currently available. The state transition matrix, mean vector and covariance matrix of a HMM are initialised using musical knowledge and selectively trained using the 12 bins chromagram. Time base consists in detected beats (tactus) The chord sequence is then inferred from the HMM using Viterbi decoding [JPB2005A] High Level Instrument Identification Libraries The instrument identifier relies on a mono-feature timbre modelling approach, using Line Spectrum Frequencies (LSF) as the unique identifier. undetermined C Status unknown. Code is allegedly in DSPMac repository Various classifier are implemented, in particular k-means, Gaussian mixture models and Support Vector Machines. [NC2005],[NC2006],[FI1975],[PK1986] Similarity Retrieval Enriched Access SoundBite For a given track, the space of possible timbres is divided into N timbre types, each of which generates timbre features according to a Gaussian distribution The sequence of timbre features through the track Is modelled by an N-state Hidden Markov Model where the hidden states correspond to the N timbre-types, The most likely sequence of timbre types to have generated the features is Viterbi decoded from the HMM Output is a sequence of labelled segments. C A C demonstrator is available for Mac. OSX. VAMP Plugin is available

13 The most likely segmentation is found by clustering histograms of the timbre types. The features vector consists in the first 20 PCA components extracted from the normalised constant-q spectrum of the audio under analysis along with the normalised envelope. Analysis hop size is chosen as the estimated beat length of the audio under analysis. [ML2006A],[ML2006B] High Level Tempo & Meter The algorithm is based on the beat tracker described above. Meter estimation is currently limited to 44 and 34. The tempo value is estimated by analysing the beat histogram generated using tempo tracking across the audio file and a measure of reliability can be inferred from the distribution of bins in the histogram Outputs are: - Histogram of detected tempos - Estimated main tempo - Estimated time signature code completed and deployed. Some further experimental work is needed. Enriched Access Time-Scaling Time scaling is performed using a FFT-based phase vocoder. Percussive onsets are identified using a multi-band onset detection algorithm and only steady state portions of the signals are time scaled, thus preserving the integrity of transients. Coherence in stereo signals is maintained by using a single reference channel for the identification of transient and steady state frames. Output is the time-scaled audio data. A non-optimal implementation is available [ER2005]

14 4. References: [JF2000] J. Foote, Automatic audio segmentation using a measure of audio novelty, in Proc. IEEE Int. Conf. Multimedia and Expo (ICME2000), vol. I, New York, Jul. 2000, pp [JPB2005] J.P.Bello et al, A Tutorial on Onset Detection in Music Signals, in IEEE Transactions on speech and audio processing, vol. 13, no. 5, September [IK2002] I. Kauppinen, Methods for detecting impulsive noise in speech and audio signals, in Proc. 14th Int. Conf. Digital Signal Processing (DSP2002), vol. 2, Santorini, Greece, Jul. 2002, pp [CD2004] C Duxbury. Signal Models for Polyphonic Music. PhD Thesis, [ER2005] E.Ravelli et al, Fast implementation for non-linear time-scaling of stereo signals, in Proc. of the 8th Int. Conference on Digital Audio Effects (DAFx 05), Madrid, Spain, September 20-22, 2005 [JB1991] Judith Brown, Calculation of a Constant Q Spectral Transform,Journal of the Acoustical Society of America, vol. 89, no. 1, , [JB1992] Judith C. Brown, Miller S. Puckette, An Efficient Algorithm for the Calculation of a Constant Q Transform, Journal of the Acoustical Society of America, vol. 92, no. 5, , [CH2005] C.Harte, M.B. Sandler, Automatic Chord Identification Using a Quantised Chromagram, in Proc. Of the 118th AES Convention 2005 May Barcelona, Spain [MD2004] M. E. P. Davies and M. D. Plumbley, Causal tempo tracking of audio, in 5th International Symposium on Music Information Retrieval, October [MD2005] M. E. P. Davies and M. D. Plumbley, Beat tracking with a two state model, in Proceedings of ICASSP, Philadelphia, USA, March 18 23, 2005 [CH2006] C. Harte, M. Gasser, M.B. Sandler, Detecting Harmonic Ch ange in musical audio, in Proc of AMCMM 06, Santa Barbara, USA, October 27, 2006 a, vol. 89, no. 1, [MC2004] Markus Cremer and Claus Derboven, A System for Harmonic Ananlysis of Polyphonic Music,Proceedings of the AES 25th International Conference, 2004, London, UK, [BP2002] Bryan Pardo and William P. Birmingham, Algorithms for Chordal Analysis, 2002, Computer Music Journal, vol. 26, no. 2, [KN2006] Katy Noland, Mark Sandler, Key Estimation using a Hidden Markov Model, in Proc of ISMIR, Victoria, Canada, 2006 [JPB2005A] J.P. Bello, J. Pickens, A Robust Mid-level Representation for Harmonic Content in Music Signals, in 6th International Symposium on Music Information Retrieval, London, [PK1986] P. Kabal and R.P. Ramachandran, The Computation of line spectral frequencies using Chebyshev polynomials,, IEEE trans. on Acoustics, Speech and Signal Processing, vol. ASSP-34, no. 6, , 1986 [FI1975] F. Itakura, Line spectrum representation linear predictive coefficients of speech signals,, J. Acoust. Soc. Amer., vol. 57, S35, 1975 [NC2005] N. Chetry et al, Musical Instrument Identification using LSF and K-means, in Proc. AES 118th Convention, Barcelona, Spain, 2005 May [NC2006] N. Chetry et al, Computer Models for Musical Instrument Identification, PhD Thesis, [ML2006A] M.Levy et al, New methods in structural segmentation of musical audio, in Proc. Eusipco [ML2006B] M.Levy et al, Extraction of High-Level Musical Structure from audio data and its application to thumbnail generation,, in Prc ICASSP 2006 [FFMPEG]

15 [MPEG7] ISOIEC JTC1SC29WG11, Information Technology Multimedia Content Description Interface Part 3: Multimedia Description Schemes, ISOIEC FDIS , [MP7XM] MPEG-7 experimentation Model, [EMD2006] Dan Barry et al, EASAIER Metadata & s, Internal Note, ver. 1.0 Draft, [ED312006] EASAIER Deliverable 3.1: Retrieval System Functionality and Specifications, ver. 1.12, November 1, 2006.