Scale And Shift Invariant Time/frequency Representation Using Auditory Statistics: Application To Rhythm Description.
Ugo Marchand, IRCAM
Geoffroy Peeters, IRCAM

In this paper we propose two novel scale and shift-invariant time-frequency representations of the audio content. Scale-invariance is a desired property to describe the rhythm of an audio signal as it will allow to obtain the same representations for same rhythms played at different tempi. This property can be achieved by expressing the time-axis in log-scale, for example using the Scale Transform (ST). Since the frequency locations of the audio content are also important, we previously extended the ST to the Modulation Scale Spectrum (MSS). However, this MSS does not allow to represent the inter-relationship between the audio content existing in various frequency bands. To solve this issue, we propose here two novel representations. The first one is based on the 2D Scale Transform, the second on statistics (inspired by the auditory experiments of McDermott) that represent the inter-relationship between the various frequency bands. We apply both representations to a task of rhythm class recognition and demonstrates their benefits. We show that the introduction of auditory statistics allows a large increase of the recognition results.