An audio-to-MIDI application in Java

(1)

M A S T E R ' S T H E S I S

An Audio-to-MIDI Application in Java

Gustaf Forsberg

Luleå University of Technology MSc Programmes in Engineering Computer Science and Engineering

Department of Computer Science and Electrical Engineering Division of Information and Communication Technology

2009:073 CIV - ISSN: 1402-1617 - ISRN: LTU-EX--09/073--SE

(2)

(3)

i

Audio and MIDI data are fundamentally different, yet intertwined in the world of computer-based music composition and production. While a musical performance may be represented in both forms, MIDI data can always be edited and modified without compromising sound quality, and musical notation can be produced from it rather straightforwardly. Thus, having a performance stored as MIDI data can sometimes be preferable to having it stored as audio data. However, in the absence of a MIDI-enabled instrument, the MIDI data would need to be generated from the audio data, putting some rather severe restrictions on the possibilities.

This thesis presents the foundation of an audio-to-MIDI application developed in Java, following an introductory discussion on pitch detection, MIDI, and the general problem of audio-to-MIDI translation. The audio-to-MIDI performance of the application is generally good for music with fairly simple sounds, but more work is needed for it to properly handle the more complex sounds expected in the typical usage scenario.

(4)

ii

(5)

iii

For almost as long as I can remember, music has been a central part of my life. I grew up with the music of Johann Sebastian Bach, and although I did not realize it at the time, its sublime beauty is often mirrored in the patterns and behavior of nature.

During the years I studied composition, I became increasingly aware of the mathematics of music; during the years I have been studying computer science, I have become increasingly aware of ‘the music of mathematics’.

The subject of this thesis arose from a wish to apply software engineering skills in a musical context, and also – importantly – to learn something new. I had never previously done any sound programming, which gave the practical aspect a certain appeal. Since I have not specialized in signal analysis, I needed to read up quite a bit on the theory as well. This proved to be a tremendously interesting experience, most often leading to contemplation way beyond the purely mathematical details.

In closing, I would like to thank my supervisor, Dr. Kåre Synnes, for advice and assistance throughout the work on the thesis.

Gustaf Forsberg April 2009

(6)

iv

(7)

v

1 Introduction

1.1 Background

In computer-based music creation, one is often working with two fundamentally different formats; audio and MIDI. While audio data represents the actual sound (i.e.

the waveform), MIDI simply provides a protocol used to communicate performance- related information. Sound production is left to a MIDI instrument, which may be either hardware or software.

In some aspects, when working with a piece of music, MIDI has a number of advantages over working directly with audio data. Tasks like for instance adjustment of tempo or phrasing, tweaking velocity, or removing unwanted notes are trivial when working with MIDI, whereas in the audio case re-recording would likely be preferred.

Furthermore, due to the nature of MIDI, the step to musical notation is fairly short; in some cases the conversion is a one-step affair, although some manual editing is usually required to produce a good-looking score. As a concluding example, it could also be mentioned that MIDI provides a very space-efficient way of storing a performance.

There are, then, several situations where MIDI data may be preferable to audio data.

Generally, this does not present much of a problem to a keyboard player – most keyboards today have MIDI functionality, and indeed, MIDI was designed with keyboard instruments in mind. However, the situation is quite another if an instrument lacks MIDI functionality, or if only an audio recording of a performance is available. In such cases, it would be practical to be able to translate audio data into MIDI data.

1.2 Thesis overview

Audio-to-MIDI translation is the main subject of this project, both from a theoretical and from a practical perspective. An outline of the thesis is presented below, in terms of purpose, delimitations, and information organization.

(10)

1.2.1 Purpose

The goal of this thesis is the design and implementation of a general-purpose audio- to-MIDI application. The application is not intended to let users make MIDI files from their CD or mp3 collection; rather, it should be thought of as a musician’s tool, to be used for example as a quick means to transcription of improvisations.

The application aims to provide a ‘working environment’ and not just limit itself to pure audio-to-MIDI functionality. Hence, it will support audio and MIDI file handling and playback, audio recording, and other related features. The application should be quick and easy to use, so a clear and intuitive GUI is desired.

1.2.2 Delimitations

Although signal analysis is central in the subject of audio-to-MIDI translation, the area of the thesis project is in fact software engineering. The main implication of this is that the formal focus of the thesis lies on the application, rather than on the often intricate mathematical details. Nevertheless, a significant amount of time had to be dedicated to theory studies since the author did not have any previous experience of signal analysis.

Regarding pitch detection and audio-to-MIDI, even seemingly simple musical passages can present significant difficulties and may require quite exquisite solutions.

However, since pitch detection and audio-to-MIDI are only part of what the application does, the degree of sophistication of such functionality had to be balanced with respect to the other desired features. Thus, work had to be delimited by excluding certain features from the application and limiting functionality of others; a more detailed discussion on conceived features and functionality is presented in chapter 5. The application should currently be considered a platform or prototype which will be further refined, since the envisioned final version lies well beyond the scope of this thesis.

1.2.3 General structure

After this brief introductory chapter, we will turn our attention to the prerequisites of an audio-to-MIDI application, discussing topics such as pitch detection and MIDI.

Following that, chapters three and four concern themselves with the design and implementation of the application, along with a series of tests to determine its general performance. The thesis concludes with a discussion on both the current state of the application and future work.

(11)

3

2 Technical background

2.1 Pitch detection

The problem of algorithmically identifying which notes are sounding at a given moment can range from fairly trivial to hard, or even impossible. A gentle melody played by solo violin, for example, does not need to be particularly difficult. A furious violin solo accompanied by an equally furious orchestra, on the other hand, would be quite another matter. While pitch detection may be considered solved in the monophonic case, non-monophonic methods are still an interesting research area, perhaps particularly so in conjunction with timbre identification and separation. This, however, is far beyond the scope (and purpose) of this thesis; here, we restrict our concerns to identifying one or more sounding notes without attempting to identify the instruments.

2.1.1 General issues

As we recall, the simplest pitched sound is the sine tone, with its pitch being determined solely by the frequency of the single sinusoid. When dealing with sine tones, it is trivial to determine even several simultaneous pitches, since each peak in the frequency spectrum corresponds to a separate note.

Typically, however, a pitched sound will have several periodic components (referred to as partials), differing in frequency, amplitude, and phase. In the typical pitched instrument, the frequencies of the partials align in a harmonic series. This means that the frequencies are whole-number multiples of some common fundamental frequency, and partials with this property are called harmonics. The term overtone is often used to refer to any partial – harmonic or inharmonic – other than the fundamental. To varying degrees, the presence of overtones makes pitch detection more complicated.

We may assume the pitch of a note to be determined by its fundamental frequency, although it is important to point out that pitch really is a psychoacoustic concept – it is something we perceive. There are several interesting examples of this; the Shepard scale, for instance, seems to remain within a fixed pitch interval (e.g. an octave) no matter how far we continue to ascend or descend in pitch. It is an auditory illusion, created by means of Shepard tones (basically tones which are constructed from sine

(12)

tones with differing amplitudes in octaves) [1]. Another example of the psychological (and neurological) aspect of pitch is that in the case of harmonic partials, we tend to

‘hear’ the fundamental even if it is not present; this is known as periodicity pitch, missing fundamental, or subjective fundamental. It may seem like a somewhat artificial example, but the effect is used in practice, for instance in the production of deep organ tones [2].

Musically, the second and fourth harmonics lie one and two octaves above the fundamental, respectively, and the third harmonic lies a fifth above the second harmonic. Together, these intervals produce a very clean sound; indeed, much of the

‘color’ of the sound lies in the configuration of the higher harmonics. Overtones are typically not perceived as separate notes, but in some sounds they are. Even so, we hardly think of them as notes being played, but rather consider them components of the sound. In other words, they do not necessarily alter the perceived pitch.

0 1000 2000 3000 4000 0

50 100

Frequency (Hz)

Magnitude (%)

0 1000 2000 3000 4000 0

50 100

Frequency (Hz)

Magnitude (%)

(a) (b)

Figure 1. Magnitude spectra of the note g with fundamental frequency approximately 196 Hz, played on an electric guitar with a clean tone (a) and a tenor crumhorn (b).

Compared to the guitar, the crumhorn is notably rich in overtones; the frequency range of the plots has been limited for readability, but partials of the crumhorn continue way up to about 15 kHz.

Unless we have a case with a missing fundamental, we might be able to do monophonic pitch detection by finding the lowest frequency present, although this requires a clean signal. For non-monophonic pitch detection, we need to single out the fundamentals from the overtones. In Figure 1 (a) above, the magnitude at the fundamental frequency is the largest, and indeed, sometimes it is possible to perform pitch detection by simply finding peaks above a certain threshold in the magnitude spectrum. However, when several notes are sounding simultaneously, some partials may have common frequencies, and the results of wave superposition could easily thwart our attempts to identify fundamentals by magnitude. Also, as we can see in Figure 1 (b), it is by no means a given that the magnitude of the fundamental is the largest. In fact, not even the guitar tone of Figure 1 (a) can be assumed to always have largest magnitude at the fundamental frequency; as the string vibrates, the relative amplitudes of the partials vary.

(13)

Apart from issues that arise from a musical context, such as tone complexity and polyphony, there are several other factors which can complicate proper pitch detection. The noise level of the signal is one such factor. Some pitch detection methods are more sensitive to noise than others, and often a compromise must be reached between noise sensitivity, accuracy, and computational cost. Naturally, if there is a real-time requirement, keeping the computational cost down becomes more important. Sounds or sound phenomena originating from the recording environment (such as echoes or reverb) also complicate analysis.

We may distinguish two basic approaches to the pitch detection problem; the time- domain approach, and the frequency-domain approach. In the following sections, a few examples of each approach are discussed.

2.1.2 Time-domain methods

A straightforward approach to the pitch detection problem is the zero-crossing method;

if we have for example a sine tone, we can obtain its frequency by simply determining the zero-crossing rate (or peak rate) of the signal. This method is computationally inexpensive, but generally sensitive to noise and not very well suited for more complex signals. It may however be somewhat improved by means of adaptive filtering [3].

Auto-correlation is another, quite popular, way to tackle the pitch detection problem in the time domain. The main idea is to compare a segment of the signal with a shifted version of itself; the correlation should be greatest when the shift corresponds to the fundamental period of the signal. A problem with this approach is that the accuracy tends to decrease at higher frequencies, due to periods becoming shorter and approximation errors becoming greater. This method also suffers somewhat from false detections – typically it has problems with periodic signals where the period is that of a missing fundamental [4] – and it may not, in its basic form, be well suited for polyphonic music [5].

Related to auto-correlation is the average magnitude difference function (AMDF). While auto-correlation computes the product of the original signal segment and the shifted version, AMDF computes the difference. It is possible to combine the auto- correlation function and AMDF to form the weighted auto-correlation function, which is better at handling noisy signals [6].

2.1.3 Frequency-domain methods

There are several methods available to transform from the time domain to the frequency domain. The best known and most frequently used method is probably the

(14)

Fourier transform, allowing for decomposition of a function into oscillatory functions.

In particular, the Fourier series lets us express any periodic function as a sum of sinusoids, with varying amplitudes and frequencies. These frequencies are related in that they are integer multiples of the fundamental frequency of the periodic function;

in other words, they are harmonics. In the digital realm, we must work with discrete transforms. The discrete Fourier transform (DFT) is, unsurprisingly, the discrete analogue of the continuous Fourier transform. The DFT is practically always implemented as a fast Fourier transform (FFT), which produces the same result but in significantly less time.

Popularity aside, there are some issues with the Fourier approach. For example, better localization in frequency means worse localization in time, and vice versa. Moreover, since frequencies of musical notes are distributed logarithmically and the frequency bins of the FFT are distributed linearly, resolutions at low frequencies tend to be too low, and resolutions at high frequencies tend to be unnecessarily high.

Many of these issues are absent in wavelet transforms. Simply put, a wavelet is a function which divides the initial function into different frequency components and allows for examination of these components in an appropriate scale; this remedies the resolution issues of the FFT. Also, contrary to the FFT, wavelets are localized in both time and frequency, and generally do not have a problem handling discontinuities. While not (yet) widely adopted in audio processing, wavelets are often used in image processing.

There are several other transforms which may be used in audio processing, for example the constant Q transform and the discrete Hartley transform. Nevertheless, the FFT is ubiquitous, and hereafter we take the word ‘transform’ to imply the FFT.

After transformation into the frequency domain, there are several ways to estimate pitch. In very simple cases, for example monophonic music with non-complex sounds, it might be sufficient with frequency peak detection directly following the transform, as mentioned in section 2.1.1. Most often, however, a more sophisticated method is called for.

To obtain the harmonic product spectrum (HPS) [7], we begin by downsampling the spectrum a number of times, each time producing a more ‘compressed’ version.

Specifically, the nth downsampled spectrum is 1/(n + 1) the size of the original spectrum. The point is to utilize that the partials belong to a harmonic series, so that the first harmonic (i.e. the fundamental) in the original spectrum aligns with the second harmonic in the first downsampled spectrum, which in turn aligns with the third harmonic in the second downsampled spectrum, and so on. Thus, the number of spectra considered equals the number of harmonics considered, and the HPS is finally produced by multiplying the spectra together, with the idea of amplifying the fundamental frequencies.

(15)

Figure 2. Downsampling to make harmonics align. The first harmonic in the original spectrum (top) coincides with the second, third, and fourth harmonics of the downsampled spectra, respectively.

While the HPS method is quite insensitive to noise and generally works well, there may be problems with notes being detected in the wrong octave (usually one octave too high). Some extra peak analysis may help [6], but this may be difficult in polyphonic cases.

Another method of pitch-tracking is cepstral analysis. A cepstrum, first described in a 1963 paper [8], is basically the spectrum of a spectrum, obtained by taking the transform of the logarithm of the magnitude spectrum. Hence, cepstral analysis is not really carried out in the frequency domain, but actually in the quefrency domain (although the method is still generally considered to belong to the frequency-domain approach). Quefrency can be said to be a measure of time in a different sense, and peaks (rahmonics) at certain quefrencies occur due to periodicity of partials. Cepstral analysis is quite popular in speech analysis since the logarithm operation increases robustness for formants (acoustic resonances, for example of the human vocal tract), but it also leads to a raised noise level [9].

As a last example of frequency-domain methods, a combination of the HPS and cepstral methods may be a promising alternative [10]. In the aptly named cepstrum- biased harmonic product spectrum (CBHPS), we see both the noise robustness of the HPS and the robustness to pitch errors of the cepstrum [6]. Since the HPS exists in the frequency domain and the cepstrum in the quefrency domain, combining them first requires the cepstrum to be converted to frequency-domain indexing. Multiplying together the HPS and the frequency-indexed cepstrum produces the CBHPS.

(16)

2.1.4 Some notes on the DFT and the FFT

As we shall see in chapter 3, the implementation relies on the FFT for transformation into the frequency domain. While the DFT and the FFT are standard textbook material and thus shall not be covered in depth here, a brief review of some relevant aspects may be appropriate.

For a discrete sequence x of length n, the DFT is defined by

1 ..., , 0 ,

1

0

−

=

∑

⁻

=

n k

x X

n

m

km n m

k ω ,

where

n i

n e

π

ω

−2

=

is a primitive nth root of unity. The DFT is, by definition, a complex transform; it takes complex-valued input and produces complex-valued output. For a real-valued input, the second half of the transform will be a complex conjugate mirror of the first;

that is,

∗

= _n−_k

k X

X .

Hence, in the case of real-valued input, we need only consider the first half of the transform.

Each Xk corresponds to a particular frequency (or ‘frequency bin’). The distance between two frequency bins (i.e. the spectral resolution) is obtained by dividing the sample rate with the input (window) size. If the input signal contains frequencies which are not integer multiples of the spectral resolution, spectral leakage of some degree will occur, as seen in Figure 3 (a). This is basically a side effect of the discontinuities that may arise when considering a finite-length segment of (what is assumed by the transform to be) an infinite signal. Spectral leakage from one sinusoid may very well obscure another sinusoid in the signal; in particular, several notes close to each other could produce an almost unusable spectrum from a pitch detection perspective.

(17)

Frequency

Magnitude

Frequency

Magnitude

(a) (b)

Figure 3. A signal with a frequency that is not an integer multiple of the spectral resolution of the transform will produce spectral leakage as seen in (a); energy has

’spilled over’ into the other frequency bins. In (b), a Hamming window was applied before transforming, with a noticeable reduction in spectral leakage effects as a result.

With no pre-processing, the segment is a view of the signal through a rectangular window. To counter spectral leakage, we may use a non-rectangular window in order to make the segment begin and end less abruptly. In Figure 3 (b), we see the results of applying the Hamming window function

( )

1

cos2 46 . 0 54 .

0 − −

= n

m m

w π

to the signal segment before taking the transform. Like many window functions, the Hamming window is bell-shaped, but other shapes (such as triangular windows) are occasionally used.

From the definition of the DFT, we can see that it has an asymptotical complexity of O(n²). The FFT is a divide-and-conquer method which, by utilizing properties of the complex roots of unity, improves the complexity to O(n lg n). While the divide-and- conquer approach may hint of recursion and dependence upon the factorization of n, there are several FFT algorithms of varying kinds. Most commonly seen are iterative implementations of the Cooley-Tukey radix-2 algorithm, which requires n to be a power of two.

Although all FFT algorithms run in O(n lg n) time, naturally we wish to minimize the time taken in practice. Our audio signal will be strictly real, and if we have n real samples it might seem like we have to ‘add’ an imaginary signal consisting solely of zeros before transforming, since it is a complex transform. This is fortunately not the case. By taking every even-indexed sample to be real-valued and every odd-indexed sample to be imaginary-valued, we produce a complex input of size n/2, on which the transform is applied. From the result, the transform of the initial real sequence is obtained through a final unwrapping step (which, like the construction of the complex input from the real input, runs in linear time). Hence, a real signal of length n requires

(18)

only an n/2-size transform, which is a significant improvement. For the mathematical details, see for example [11].

2.2 MIDI

MIDI (Musical Instrument Digital Interface) was created in the early 1980’s in an effort to standardize the way digital musical instruments communicated. Previously, various manufacturers had developed their own digital interfaces, and some were beginning to worry that the use (and hence sales) of synthesizers would be inhibited by the lack of compatibility. The first MIDI instrument, the Sequential Circuits Prophet-600, appeared in January 1983, soon to be followed by Roland’s JX-3P. At this time, the MIDI specification was very simple, defining only the most basic instructions. Since then, however, it has grown significantly.

The MIDI specification can be said to consist of three main parts; the message specification, the transport specification, and the file specification. Of these three, probably the most important part, and the part of our primary concern, is the message specification, or protocol.

2.2.1 Messages

A MIDI message consists of a status (or command) byte, followed by a number of data bytes. Status bytes are identified by the MSB being set. For commands in the 0x80-0xEF range (also known as channel commands), the three bits following the MSB specify the command in question and the remaining four bits specify which MIDI channel the command affects. Thus, there are 7 channel commands and 16 channels.

Among the channel commands, we find instructions for playing and manipulating notes and similar. The commands in the 0xF0-0xFF range are system commands, which are not aimed at a particular channel; rather, they are concerned with for example starting or stopping playback.

Naturally, the content of any present data bytes depend on the command with which they are associated. A program change command, for instance, is followed by one data byte, containing the number of the instrument sound (or patch) to be used. When two data bytes are used, they usually contain one separate piece of information each; for example, a note on command uses two data bytes, specifying note number and velocity, respectively. Since the MSB is used to signify whether it is a status byte or a data byte, this gives 128 possible note numbers (in comparison, a standard piano has 88 keys) and 128 different velocities (including the zero velocity). Here, 128 different values are quite sufficient, but in some cases a greater range is desired. An example is the pitch bend command, where one data byte holds the least significant bits and the other the

(19)

most significant; the 2¹⁴ different values allow for very smooth pitch transitions.

Channel messages always have one or two data bytes, while system messages may have zero data bytes. Thus, a MIDI message is at most three bytes in size.

The small message size is important for the timing. The MIDI protocol is a serial communications protocol, with a specified bandwidth of a mere 31.25 kBaud (approximately 3.8 kByte/s). There is no true simultaneity; a chord, for example, is in practice a really fast arpeggio. With a maximum message size of three bytes, well over a thousand messages can be sent per second even in the worst case, disregarding practical limitations.

2.2.2 Standard MIDI files and General MIDI

A standard MIDI file (SMF) is little more than a list of performance-related information. There are two frequently used types; type 0, which uses a single track, and type 1, where individual parts have individual tracks. There is also a type 2, which can contain multiple songs, but this is not commonly used.

Storing a performance in a file necessitates an extra piece of information; timestamps.

The ‘classic’ method is tempo-based timestamping, where the timing resolution is given in pulses per quarter note (PPQ). Timestamping may also be time-based, according to SMPTE (Society of Motion Picture and Television Engineers) specifications. Here, there are a certain number of frames per second (ranging from 24 to 30), and there is a certain number of ticks per frame. Obviously, in order for the translation to stay true to the original performance, the resolution needs to be high enough to avoid inconsistencies.

Since no actual sound data is stored, the space required is very small, especially compared to an audio file. By accessing the file, any device or application capable of MIDI playback can replay the performance. How it actually sounds depends on the MIDI instrument utilized for playback, and this brings up the issue of uniform playback. While some MIDI instruments have very high-quality sounds, and while the program change command lets us tell the MIDI instrument to use a certain patch, it is not specified what sound we will actually get – it may differ from instrument to instrument. This means, for example, that a piece which plays back with an organ sound on one MIDI instrument might play back with drum sounds on another.

To counter this problem, General MIDI (GM) was created. While not a part of MIDI per se, GM defines specific features for MIDI instruments. For instance, with a GM instrument, we know that MIDI channel 10 is reserved for percussion sounds, and we also know that a particular note number played on this channel will always produce a particular percussion instrument sound. For other channels, we know which program number corresponds to which instrument (for example, the acoustic grand piano is

(20)

always found at program number 1, the violin is always number 41, and so on). In addition to organizing instrument layout, GM also makes specifications regarding polyphony, velocity, and multitimbrality. Thus, adhering to the GM standard increases the chances of correct playback on foreign systems.

Since first published in 1991, GM has been superseded by GM2 (in 1999). GM2 is fully compatible with the original GM, while considerably extended. There also exists a slimmed-down version (General MIDI ‘Lite’, GML) aimed at mobile applications, and some instrument manufacturers have introduced their own extensions and variants, for example the XG standard from Yamaha.

2.3 Audio-to-MIDI

Now, having acquainted ourselves somewhat with both pitch detection and MIDI, we can reflect a bit upon the requirements and possibilities of audio-to-MIDI functionality. As we shall see, we encounter limitations of both pitch detection and MIDI.

2.3.1 General considerations

Perhaps the most obvious issue is that it is not enough to have a well-functioning pitch detection algorithm; in order to produce a correct translation, we must also be able to tell when a note was played, and when it was released. There are a number of ways to tackle this problem. For example, we may consider spectral peaks to indicate note onsets whenever a certain threshold is exceeded. Determining that threshold may be a bit tricky in practice; for instance, since the dynamics of an instrument may vary so that notes in a certain register are naturally louder than notes in another register, the threshold value may need to vary throughout the frequency range. It may also be desired to consider changes in spectral magnitude (i.e. spectral flux) in addition to the values themselves. Other approaches to note onset detection may work with the amplitude of the signal, or changes in phase [12].

Of course, some cases are easier to handle than others. Sounds such as sine tones may be easy to deal with from a pitch detection perspective, but the unchanging nature of the notes can make proper detection of repeated notes difficult. With other sounds, it may be very easy to determine when a note begins or ends, but there may be inharmonic transients that complicate pitch detection.

Erratic note detections can arise from very slight fluctuations in frequency or amplitude. These fluctuations may be temporary and very short, and in such cases the resulting notes have very short durations (i.e. the note onset is almost immediately

(21)

followed by the note offset). Thus, it may at times be possible to use note duration as a ‘mistake criterion’ in order to clean up the audio-to-MIDI translation.

Apart from note onset and offset detection, obviously we must also handle the pitch information itself. In the 12-tone equal temperament, the fundamental frequency fk of the kth semitone above a note with fundamental frequency f 0 is given by

...

, 2 , 1 , 2 ¹²

0⋅ =

= f k

f_k ^k .

This makes each note have a fundamental frequency which is approximately 5.9%

higher than that of the preceding semitone. The most basic approach for audio-to- MIDI translation is probably to disregard devices such as vibrato and glissando and simply match frequencies to their closest note in the equal temperament. This is of course ideal for music that is itself limited in that respect (such as piano music), but less well suited if we wish the translation to mimic the original performance in detail.

To handle for example glissando correctly, the audio-to-MIDI method must ‘know’

when a new note should be played and when to simply apply the expression to an already sounding note. This implies a certain degree of sophistication in note onset and note offset detection. We must also keep in mind that the relevant MIDI commands are channel commands; for example, pitch bend will affect all currently sounding notes on the specified channel. Thus, if we want to be able to handle polyphonic cases such as when one voice is doing a glissando while another is not, each voice needs a separate channel. Obviously, this requires that the number of voices does not exceed the number of channels.

The quick and dirty way to assign different voices to different channels would be to simply correlate channel number with pitch order. For example, we might always let the highest note be handled by channel 1, the next highest by channel 2, and so on.

There are, however, numerous problems with this approach. For one thing, if two voices cross each other, their channel numbers will no longer correspond to their pitch order. This is likely to necessitate manual corrections if for instance musical notation is to be produced. Moreover, if the crossing of voices is the result of a glissando, this approach will simply not work. Ideally, we would like to be able to track each voice and thus make sure that each note gets assigned to the correct channel, but this would most often require elaborate pattern matching and timbre identification.

As a final example of audio-to-MIDI considerations, we may take the dynamics of music; in a way, this is related to note onset detection, discussed earlier. In the simplest case, there are no dynamics to speak of; all notes are equal in volume. On the other hand, if some notes are soft and others loud, we must make sure that soft notes are not disregarded. If we, in addition, wish the dynamics to be reflected in the MIDI translation, each note must be assigned an appropriate MIDI velocity.

(22)

Several products aiming to bridge the gap between audio and MIDI exist, of varying character. For example, they may be aimed at hobbyists or professionals, they may be general-purpose or specialized for a particular instrument, and so on. We conclude this chapter with some general remarks about hardware- and software-based solutions.

2.3.2 Hardware solutions

There are two main approaches to hardware solutions to the audio-to-MIDI problem;

integrated and non-integrated. An example of the former case are MIDI guitars with on-board DSP’s, allowing the MIDI cable to be connected directly into the instrument. However, musicians tend to be very picky about their instruments, and most would be unhappy having to use an instrument they did not like in order to have MIDI functionality. In such cases, the non-integrated approach may be more appealing; since the processing is done externally, the instrument generally needs no modification apart from possibly mounting a special pickup. For example, stringed instruments may be fitted with a pickup that sends a separate signal for each string, greatly simplifying multi-pitch detection. As a note, hybrid solutions exist as well, where the pickup is integrated but the DSP is not.

Hardware-based solutions are particularly appealing in cases like that of stringed instruments, where the ability to process the signal from each string separately could strongly affect the quality of the result. Also, hardware solutions may be preferred if platform independence is desired, or in live settings.

2.3.3 Software solutions

Most audio-to-MIDI software are stand-alone applications, but there are also plug-ins, intended for use within a host application. Plug-ins often have direct hardware counterparts, and are typically fairly light-weight and dedicated to a particular real-time task.

In software solutions, the GUI possibilities pave the way for numerous additional features, such as extensive editing functionality and production of musical notation.

Unless audio-to-MIDI needs to be performed in a real-time situation – for example having audio triggering MIDI events during a live performance – the direct result of the translation is often an intermediary step requiring editing. Of course, the editing itself does not really depend on whether the translation was performed by hardware or software, but it can be convenient to be able to perform all tasks using a single tool or platform.

(23)

15

3 Design and implementation

3.1 Overview

The application was developed in JDK6 on Windows XP, using an Intel quad-core machine with 2 GB of RAM. No reasonably modern system should have any problems running it; the only ‘real’ requirements are enough RAM to hold the audio data (a non-issue these days) and a decent sound card.

In general terms, the main features of the application are the following:

• Opening/saving audio and MIDI files

• Playback of audio/MIDI files

• Audio recording with or without metronome

• Configurable audio-to-MIDI translation

There are various configuration options available, for example allowing the user to control the sample rate and bit depth used during recording, and which window function to use for pre-processing during pitch detection. Also, the user has a number of options for controlling audio-to-MIDI behavior (such as MIDI resolution and pitch detection thresholds).

In discussion, it is sometimes difficult to clearly separate design from implementation.

In this chapter, the design discussions are typically concerned with overall structure and component interaction, while discussions on implementation may comment on for example language specifics. The design sections will contain UML diagrams to give an overview of either a component as a whole, or the specific capabilities of the component.

3.1.1 General design

The application design is based on the Model-View-Controller (MVC) pattern, where the controller is notified of relevant user interaction with the view and reacts accordingly through direct access to both the view and the model. While there are several variants of this pattern, the main point is to separate user interface from business logic.

(24)

Figure 4. MVC as employed in the application.

Direct and indirect access is indicated by solid and dashed arrows, respectively.

Often in MVC, the model has no direct access to the view; instead, the view observes the model, fetching data of interest when notified of a relevant change. Another variant has the controller managing all the information flow, with no connection whatsoever between the view and the model. Except for a particular real-time case, this is the variant used in the application.

In implementation terms, simplifying slightly, the view corresponds to the GUI class, the controller to the Controller class, and the model is split into the MidiCentral and AudioCentral classes. Instantiation (and initial configuration) of these classes is the duty of the AudioToMidiApp class.

<<interface>>

WindowFunction

RectangularWindow BlackmanWindow

AudioToMidiPanel HammingWindow

GaussWindow

PitchDetector

PlottingPanel

AudioCentral

ControlPanel MenuSystem IterativeFFT

MidiCentral

Controller

GUI

Figure 5. An overview of the application, the AudioToMidiApp launcher class excluded. The AudioCentral has access to the GUI, but otherwise all interaction between the components goes through the Controller.

Model View

Controller

(25)

The Controller class implements several listener interfaces, allowing it to be aware of and handle user interaction with the GUI as well as special MIDI and audio events.

Also, usage-oriented checking (such as querying the user whether a file should be saved before exiting) is handled here.

3.2 Graphical user interface

Ideally, using the application should require as little interaction as possible. Following the ‘make the common case fast’ guideline, all the basic tools needed for recording, playback, and audio-to-MIDI translation are accessible directly from the control panels on the main screen. Additional functionality is provided through menus.

Figure 6. The application during audio playback. The buttons controlling audio recording and audio-to-MIDI translation are greyed out.

While the record button is audio specific, the stop and play buttons control both audio and MIDI. Also, buttons are disabled at times when their functionality is not available. For example, as seen in Figure 6, the record button is disabled during playback, as is the audio-to-MIDI button. Both are re-enabled when playback ends.

However, as mentioned in chapter 5, such ‘user proofing’ is not yet consistently implemented.

(26)

3.2.1 Design notes

Several parts make up the graphical user interface. Apart from the main window, the important elements are the two plots, the two lower panels from which for instance playback and audio-to-MIDI is controlled, and the menu bar.

GUI ...

<<constructor>> GUI(in title:String, in controller:Controller) showConfirmDialog(in message:String):boolean setPlaying():void

setRecording():void setStopped():void

getAudioFileChooser():JFileChooser getMidiFileChooser():JFileChooser plotAudio(in samples:double[*]):void

plotSpectrum(in magnitudeSpectrum:double[*]):void clearPlots():void

AudioToMidiPanel PlottingPanel

ControlPanel

MenuSystem 1

1

1 2

Figure 7. The GUI class provides a number of methods used for interaction with GUI components.

The GUI class and its components are fully unaware of the rest of the application save for the Controller, which is registered as a listener to various GUI components.

Changes to GUI appearance and functionality are handled through direct method calls; the GUI class acts as an interface to other GUI elements, most importantly the two plots. Although there are efficiency reasons to let the model feed plot data directly to the view in this manner, an observer pattern may be a cleaner approach regarding smaller updates, and is subject to future evaluation.

3.2.2 Implementation notes

Basic Swing/AWT components are used throughout. The main application window is provided by the GUI class, which extends javax.swing.JFrame. The GUI class also handles instantiation of the other GUI components, in particular the PlottingPanel objects, the ControlPanel, the AudioToMidiPanel, and the MenuSystem. The latter is a subclass of javax.swing.JMenuBar, while the panels are subclasses of javax.swing.JPanel.

Through the plotAudio() and plotSpectrum() methods in the GUI class, the plots are continuously fed with data during playback. These methods are called by an inner class of AudioCentral (see section 3.4), and adjust the supplied data for the plots.

This generally means scaling with regards to plot height and plot width, and in the

(27)

case of the audio signal data passed to plotAudio() this pre-processing also includes root mean square (RMS) calculations.

Since it is not assumed that the data supplied to the plotting methods describe the complete signal (or the spectrum taken over the complete signal), but rather a small segment, the scaling procedure assumes that the given data has already been normalized to values within [-1.0, 1.0].

The setPlaying(), setRecording(), and setStopped() methods are called by the Controller when playback or recording starts or stops. Through these methods, visual (and sometimes functional) changes such as icon changes or disabling of buttons are controlled. These methods, along with the abovementioned methods used for plotting, highlights the view being unaware of the model.

The GUI also owns the file chooser dialogs used when opening and saving files.

However, instead of offering methods to interact with these, the GUI class provides methods to obtain them as to facilitate direct interaction. This results in somewhat less cluttered code.

Currently, the GUI is all hand-written (i.e. not constructed using a GUI builder tool).

Although a clear and intuitive GUI is important, it was somewhat down-prioritized at this stage in favor of pitch detection and other key features. The aim was mostly to provide a sufficiently good GUI within the scope of the thesis. Thus, there is room for much polishing, both with regards to design details and implementation details (see chapter 5).

3.3 MIDI functionality

The application supports opening of MIDI files and saving the MIDI data produced by an audio-to-MIDI translation as a type 0 MIDI file. When MIDI data is present, it may be played back, and the playback may be muted/unmuted at any time. The playback tempo can be controlled through a spinner in the GUI.

Since the MIDI functionality is so basic, dividing it over several classes would rather lead to fragmentation than to improved structure.

(28)

MidiCentral ...

<<constructor>> MidiCentral(in metaEventListener:MetaEventListener) isPlaying():boolean

openFile(in file:File):void saveFile(in file:File):void

setSequence(in sequence:Sequence):void startPlayback():boolean

stopPlayback():void rewindToBeginning():void setTempo(in tempo:int):void

setUseMetronome(in useMetronome:boolean):void startMetronome():void

stopMetronome():void setMuted(in b:boolean):void ...

Figure 8. Public elements of the MidiCentral.

The MIDI functionality is provided by the single MidiCentral class. There is, however, an inner (private) class for the metronome functionality, as described in the following section.

Java’s MIDI capabilities are accessed through the javax.sound.midi package.

Working with MIDI sequences and playback is fairly straightforward, although a couple of things could be mentioned.

A MIDI sequence is represented by a Sequence object, which has a number of Track objects containing the MIDI events. After adding a sequence to a Sequencer, it may be played back by calling the latter’s start() method; local sound production is handled by a Synthesizer (which may obtain sound data from a Soundbank). In the normal case, the sequencer’s Transmitter sends MIDI messages to the synthesizer’s Receiver. Transmitters and receivers may be obtained and connected explicitly; unless this is done, defaults are used.

The sequencer does not close itself when the end of the sequence is reached, thus keeping hold of acquired system resources. However, at the end of playback, a particular MetaEvent is dispatched, which we may use to trigger the closing of the Sequencer. In our case, this MetaEvent is caught by the Controller, which is registered as a MetaEventListener in the MidiCentral.

The MidiCentral is also responsible for providing metronome functionality. If ‘Use metronome’ in the GUI is checked, there will be quarter-note MIDI clicks during

(29)

audio recording. What happens at each metronome click is detailed in the inner class MetronomeTask, which implements the Runnable interface. This task is run by means of a ScheduledExecutorService. Note that, depending on system audio settings (e.g. “What U Hear” source selection), the metronome click may come to be recorded. There is currently no ‘stand-alone’ metronome; it is only available during recording, and hence started through the ‘record’ button in the GUI.

The Sequencer class has a setTrackMute() method, which is the way muting of MIDI playback is currently implemented. While it has a certain appeal in its simplicity, there are some issues with this approach. For one thing, it could be considered to be a kind of ‘fake’ mute, since we in effect mute notes instead of sounds. Moreover, according to the API documentation, it is actually not guaranteed that a Sequencer supports this functionality. Muting MIDI playback by muting the synthesizer is perhaps the proper way, and this leads us to some issues of MIDI volume control in Java.

The easiest way to set up a MIDI playback system is to use Java’s own default synthesizer. However, some versions of the JRE (e.g. the Windows version) do not ship with a soundbank, thus requiring the user to download it separately. It can therefore not be assumed that a soundbank is present. Java Sound has a fallback mechanism so that if it can not obtain a soundbank for the synthesizer, it tries to utilize a hardware MIDI port instead. However, this is generally not desired since it results in various inconsistencies.

If no soundbank was found, attempting to change the MIDI volume through the default synthesizer will obviously not work; we must obtain the Receiver from the MidiSystem instead of from the Synthesizer if we want control. This was tried during implementation, but the results were considered unsatisfactory. Not wishing to require the user to download a soundbank or configure the sound system manually, the volume control functionality was skipped in this version of the application.

3.4 Audio functionality

The application supports recording and playback of audio in 8- or 16-bit PCM format at various sample rates. Stereo is currently not supported. Any audio file within specification may be opened and played back, and present audio data may be saved.

Playback may be muted/unmuted at any time. Also, plot data is continuously fed to the GUI during playback.

(30)

The audio functionality is a bit more complex than the MIDI functionality, being directly involved in pitch detection and audio-to-MIDI translation in addition to handling playback, recording, and opening/saving files.

AudioCentral ...

<<constructor>> AudioCentral(in controller:Controller, in gui:GUI) getSupportedRecordingFormats():AudioFormat[*]

setRecordingSampleRate(in recordingSampleRate:float):void setRecordingBitDepth(in recordingBitDepth:int):void setTempo(in tempo:int):void

setPPQ(in ppq:int):void isPlaying():boolean newFile():void openFile(in file:File):void saveFile(in file:File):void startPlayback():boolean startRecording():void

stopPlaybackAndRecording():void rewindToBeginning():void setMuted(in b:boolean):void

setWindowFunction(in name:String):void setLowThreshold(in d:double):void setHighThreshold(in d:double):void createMidiFromAudio():Sequence ...

...

PitchDetector 1

Figure 9. The AudioCentral class. Not shown are two inner classes used for playback and recording, respectively.

The AudioCentral provides the Controller with a frontend to audio-related functionality. While the specifics of pitch detection are left to the PitchDetector (see section 3.5), the audio-to-MIDI translation procedure is implemented in the AudioCentral. Audio-to-MIDI specifics are discussed in section 3.6; the following section is more concerned with more general implementation details.

The javax.sound.sampled package provides core audio functionality. Audio devices are represented as Mixer objects, and a device may have one or several ports (such as for instance microphone input or line-level output; these are represented by Port objects). While detailed device and port selection is possible, the application currently uses the defaults.

(31)

An object implementing the Line interface may be viewed as an audio transport path to or from the system. Mixers and ports are both lines, although when speaking of lines we usually refer to lines going into or out from the mixer. To capture audio, we acquire a TargetDataLine from which the signal is read. For playback, we may use either a SourceDataLine or a Clip. While the former is continuously fed with audio data during playback by writing to its buffer, the latter lets all the data be loaded from the beginning. This results in lower playback latency, and also makes it possible to jump between different positions in the audio (which may be desired for fast forward/rewind functions). Also, looping of the audio data is directly supported by the Clip class. Hence, unless the audio data requires too much memory to be loaded at once, or is not known in its entirety at the start of playback, a Clip is generally to be preferred over a SourceDataLine. Clip is the playback line of choice in the implementation.

Supposedly, there have been issues with out-of-memory errors when attempting to load clips greater than 5 MB. However, no such issues have been encountered during development. As an example, 10 MB of audio was recorded, played back, and re- loaded from file with no problems whatsoever.

An audio clip is played back by calling Clip’s start() method. When playback is complete, a LineEvent is dispatched. In the implementation, this is noticed by the Controller instance, which is registered as a LineListener with the Clip.

Playback and recording is handled by two inner classes of AudioCentral;

PlaybackTask and RecordingTask. Both implement the Runnable interface and are executed by means of an ExecutorService. In the case of playback, a ScheduledExecutorService is used, which sees to it that the GUI plots are fed with data regularly.

Again, as was mentioned in section 3.3.2, if the metronome is used during audio recording, it may come to be recorded, depending on system settings.

The sample rate and bit depth used for audio recording may be specified via the

‘Settings’ menu in the GUI. Currently, four pre-determined sample rates are listed, all assumed (by means of a rather unsophisticated test performed at application launch) to be supported by the system. Note, however, that playback and audio-to-MIDI is supported for any available sample rate.

In section 3.3.2, the omission of MIDI volume control in the current implementation was discussed. Controlling audio volume does not present similar difficulties, but for reasons of consistency an audio volume control was omitted as well.

(32)

3.5 Pitch detection functionality

The application supports pitch detection of (definite-pitched) sounds from about the note F# (at approximately 92.5 Hz) and upward, depending on sample rate. From the audio-to-MIDI panel in the GUI, the thresholds used to filter sounding pitches can be adjusted.

As previously seen, the PitchDetector is a component of the AudioCentral.

PitchDetector ...

<<constructor>> PitchDetector()

<<constructor>> PitchDetector(in sampleRate:float, in windowSize:int) setWindowSize(in windowSize:int):void

setSampleRate(in sampleRate:float):void setWindowFunction(in name:String):void setLowThreshold(in d:double):void setHighThreshold(in d:double):void

getMagnitudeSpectrum(in audioData:double[*]):double[*]

prepareForTranslation():void

getPitches(in audioData:double[*]):double[*]

...

IterativeFFT ...

<<constructor>> IterativeFFT()

<<constructor>> IterativeFFT(in numSamples:int) setNumSamples(in numSamples:int):void getMagnitudes(in input:double[*]):double[*]

...

<<interface>>

WindowFunction apply(in data:double[*]):void

RectangularWindow BlackmanWindow

HammingWindow GaussWindow 1

1

Figure 10. The PitchDetector class and its closest friends.

Generally, analysis logic resides within the PitchDetector class, while processing is performed by the IterativeFFT and WindowFunction instances.

The design has basically everything even remotely related to signal processing go via the PitchDetector. This includes producing the data used to plot the frequency spectrum in the GUI, although no actual pitch detection is performed in that case.

(33)

The pitch detection utilizes an iterative radix-2 FFT algorithm, meaning only sample windows having a size which is an even power of two are supported. With this restriction in mind, the window size is set depending on the sample rate in an attempt to strike a balance between frequency resolution and time resolution. For example, for audio sampled at 44100 Hz, a window size of 8192 samples will be used. This corresponds to roughly 0.186 seconds and gives a frequency resolution of approximately 5.4 Hz, which is sufficient to correctly identify the note F# at about 92.5 Hz. On the other hand, with a sample rate of 8000 Hz, a 1024-sample window will be used, which corresponds to 0.128 seconds and a frequency resolution of 7.8125 Hz. Here, we can only reliably detect notes down to d at about 146.8 Hz.

The transform is implemented in the class IterativeFFT. The ‘n for the price of n/2’

procedure mentioned in section 2.1.4 is employed, along with miscellaneous smaller tweaks such as using bit-shift operators for multiplications and divisions by a power of two. Also, n is required to have been specified before using the transform. This allows pre-computation of constants specific to a transform of a given size; such values are often referred to as twiddle factors.

It may be noted that IterativeFFT does not currently have a method that returns the actual transform, i.e. a sequence of complex numbers; there is only the getMagnitudes() method, which returns (the first half of) the magnitude spectrum.

Also mentioned in section 2.1.4 is the phenomenon of spectral leakage. Via the

‘Settings’ menu, the user may choose from several window functions. These are implemented as ‘function objects’; they implement the WindowFunction interface and provide a single apply() method.

The pitch detection algorithm utilizes the harmonic product spectrum, as described in section 2.1.3. When the magnitude of some frequency in the HPS exceeds a certain high threshold, that frequency is added to a list of currently sounding frequencies.

Similarly, when the magnitude falls below a certain low threshold, the frequency is removed from the list.

Currently, these thresholds are calculated from the number of harmonics considered when constructing the HPS, the size of the sample window, and the values specified via the threshold spinners in the GUI. In other words, the thresholds do not take the power of the signal into account. This causes the pitch detection to behave somewhat differently depending on sample rate and overall volume.

(34)

3.6 Audio-to-MIDI functionality

The application supports, or attempts to support, audio-to-MIDI translation of audio of varying complexity. The timing resolution can be controlled via the ‘Tempo’ and

‘PPQ’ spinners in the GUI.

The PitchDetector keeping track of currently sounding pitches makes it perhaps even more intertwined with the audio-to-MIDI functionality. Indeed, it was designed with audio-to-MIDI in mind. There is no separate audio-to-MIDI object; all methods of the translation mechanism are defined within the AudioCentral, utilizing the functionality of the PitchDetector.

Audio-to-MIDI translation works by processing a number of overlapping sample windows. The degree of the overlap is determined by the length of a MIDI tick, which in turn is determined by the tempo and PPQ settings. For each window, pitch detection is followed by translating the detected pitches to MIDI note numbers.

Appropriate note on and note off messages are then generated with proper timestamping, and added to a Track (belonging to a Sequence; see section 3.3.2). When the translation is complete, the Controller sees to it that the MidiCentral obtains the produced MIDI sequence.

While the PitchDetector keeps track of currently sounding frequencies, the getMidiFromAudio() method of the AudioCentral keeps track of currently sounding MIDI notes during the translation process. The currently sounding notes will typically change less often than the currently sounding frequencies, since several frequencies may map to the same note. This is particularly the case with higher pitches.

An audio-to-MIDI application in Java

M A S T E R ' S T H E S I S