Real-Time Adaptive Audio Mixing System Using Inter-Spectral Dependencies

(1)

Master of Science in Electrical Engineering

Department of Electrical Engineering, Linköping University, 2016

Real-Time Adaptive Audio

Mixing System Using

Inter-Spectral

Dependencies

(2)

Robert Koria LiTH-ISY-EX–16/4977–SE Supervisor: Manon Kok

isy_{, Linköpings universitet}

Examiner: Fredrik Gustafsson

isy_{, Linköpings universitet}

Division of Automatic Control Department of Electrical Engineering

(3)

(4)

(5)

Abstract

The process of mixing tracks for a live stage performance or studio session is both time consuming and expensive with assistance of professionals. It is also difficult for individuals to remain competitive against established companies, since mul-tiple tracks must be properly mixed in order to achieve well-enhanced elements – generally, a poor mix makes it difficult for the listener to distinguish the differ-ent elemdiffer-ents of the mix. The developed method during this thesis work aims at facilitating the mixing work for live performances and studio sessions.

The implemented system analyzes the energy spectrum of the tracks included in the mix. By unmasking spectral components, the spectral overlap of the tracks is minimized. The system filters non-characteristic frequencies, leaving signifi-cant frequencies undisturbed. Five tracks have been used from the final mix of a successful radio song. These tracks have been analyzed and used to illustrate and validate the developed method. The system was successfully implemented in MATLAB with promising results and conclusions. The processed mix unmasks frequency content and is perceived to sound clearer compared to the unprocessed mix by a number of test individuals.

The method reminds of a multi-band compressor that analyzes the spectral information between tracks. Thus, by use of inter-spectral dependencies, the thesis investigates the possibility to control the amplitudes in time by filtration in frequency domain. The compression rate in time domain is reflected in regard to a trade-off between conservation of characteristic frequencies and reduction of spectral overlaps.

(6)

(7)

Acknowledgments

Big thanks to professor Fredrik Gustafsson for valuable and professional input, but also for believing in my idea for this master thesis. I would also like to express my great thanks to my supervisor Manon Kok for amazing help, guidance and for always being available as a friend and supervisor.

Linköping, Juni 2016 Robert Koria

(8)

(9)

Notation

Sets

Notation Description

N Set of natural numbers R Set of real numbers C Set of complex numbers Z Set of integers

() Continuous time values [] Discrete time values Abbreviations

Abbreviations Description matlab _{Matrix Laboratory}

daw _{Digital Audio Workstation} dafx Digital Audio Effects

a-dafx Adaptive Digital Audio Effects spl Sound Pressure Level

lkfs Loudness, K-weighted, relative to Full Scale rms Root Mean Square

fir Finite Impulse Response mimo Multi-Input Multi-Output

irf _{Impulse Response Function} ma _{Moving Average}

ft _{Fourier transform}

ift _{Inverse Fourier transform} dtft _{Discrete-Time Fourier transform}

idtft _{Inverse Discrete-Time Fourier transform} dft _{Discrete Fourier transform}

idft Inverse Discrete Fourier transform

(12)

(13)

1

Introduction

1.1 Background

By reading [6], a very good insight is gained into the field of music production. This section is intended to give a feel for the work.

1.1.1 Music Production

Music has been written for millenniums and various instruments have been cre-ated by man to make musical sounds. With the passage of time, electronic mu-sical instrument has been invented. Impressively, the last decades, it has been possible to generate complex sounds digitally. Thanks to the development of computers and the overall digitization of signal processing as e.g. modulation of signals, changing dynamics of signals and equalization. Bedroom producers are today ready to compete for market shares since a basic music studio only consists of a computer and one monitor. The process of producing music can be divided into three data processing stations – recording and arrangement of tracks, mix-ing of tracks and mastermix-ing of mix. In order to understand which part of the production chain the research relates to, a short briefing is presented.

Recording and Arrangement of Tracks

In general, music is composed within aDigital Audio Workstation (daw).

Musi-cians use the interface of the daw to record, edit and arrange audio signals.

Mixing of Tracks

The user interface of a daw is supplied with Digital Audio Effects (dafx). Mix-ing refers to the use of these effects which manipulates the sound of a track.

(14)

ure 1.1 illustrates how musicians use dafx. By controlling parameters of a dafx, the acoustical and visual representation changes. Thus, the sound of each track can be tuned after a desirable result.

DAFX Input signal Output signal Acoustical and visual representation Acoustical and visual representation Control parameters

Figure 1.1:Illustration of the approach of mixing for a producer [18]. Besides manipulating each sound individually, they must be manipulated with respect to each other. Every single track has its unique sound which has to be fused or mixed among the other tracks in a detailed, exciting and comfort-able hearing experience. The goal is to obtain a mix with well-defined elements and excitements. This process tends to be difficult since the tracks overlap in both frequency- and time domain.

When a mix consists of multiple tracks, there is a risk that the sound image becomes indistinct. In addition, unpleasant frequencies can be introduced when multiple tracks overlap in frequency domain. The more tracks that are used in a mix, the more mixing time needs to be put down in order to fuse all signals. Fur-thermore, the audience that listen to the music are generally being uncondition-ally to their choice of speaker quality. A flat frequency response of the speakers is optimal in the sense that it reproduces a sound identical to the source. This means that is up to the music producers to deliver a good and clear sound regard-less of the audio system. When all the tracks interact in a harmonious sound, the final mix is obtained.

This thesis aims to reducing the mutual information of frequency coefficients for the tracks, in order to clean the final mix for very small time intervals – mak-ing the elements in the final mix more distinct and enhanced.

"The perfect mix may need no mastering at all!" – Bob Katz [6]

Mastering of Mix

Compared to mixing, mastering is about manipulating the whole mixture of ele-ments and not the individual tracks. The goal is to obtain a mastered mix with a high average loudness. In addition, to do a final touch of the sound design

(15)

with-1.1 Background 3 out damaging the underlying final mix. The mastering process tends to be diffi-cult, if the starting point, i.e., the final mix, has indistinct elements – techniques for increasing average loudness tend to make the elements even more indistinct. The relevance of high average loudness is discussed in subsection 1.4.2.

Figure 1.2 illustrates a typical radio song. As can be seen, by using existing techniques, it is possible to achieve a mixed signal sequence with a high average signal energy at climax where the dynamic of the envelop is low. Despite the high average loudness, no enhancement of unpleasant frequencies are audible in the radio song. Time (s) 0 50 100 150 200 250 Amplitude -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

Envelope of a radio song

Figure 1.2:A mastered mix by Avicii – Wake Me Up.

It requires good knowledge of dafx to avoid damaging the sound image when mastering. The key is to increase average loudness without introducing unde-sired artifacts as distortion, unpleasant frequencies and causing indistinct ele-ments. Standard dafx are presented in subsection 1.1.2.

Due to the reduction of overlapped frequency coefficients, this thesis facili-tate the transition between mixing and mastering. The mastering process, is the final step of the music production chain before distribution. Thus, the developed method during this thesis contributes to this step by being applied on the final mix. This will help to highlight the elements one last time before mastering – making the elements more distinct and damping unpleasant frequencies.

"Mastering is the art of compromise since changing anything affects everything", "The last 10% of the job takes 90% of the time." – Bob Katz [6]

(16)

1.1.2 Intuition Based on Existing Techniques

There are many techniques for unmasking tracks when mixing. One technique among others is to shift the different elements in time, in the order of millisec-onds. Another one is panning of tracks into different channels – music is usually produced in stereo format. However, overlap can still be remained. The overlap causes difficulties in distinguishing different elements in the mix. In this subsec-tion, standard dafx used when mixing are interpreted.

Equalizer

An equalizer is a dafx for modifying a certain frequency response. It can be used in many ways since it is manually controlled. A number of filtering forms are available, e.g. low-pass, band-pass, high-pass and notch filter, and all can be used in arbitrary combinations. In order to highlight the essence with respect to this thesis work, only one concept is presented.

The equalizer is used to damp frequency regions where other tracks already play in. The selected equalization setting is static over time. Thus, in order to unmask the tracks in relation to each other, the equalizer is non-intuitive since the frequency response of a track is changing over time. The developed system of this thesis can be interpreted as a dynamic equalizer that handles this case.

Compressor and Expander

Compressors and expanders are both powerful dafx to control the dynamics of the envelope. There are many different ways to compress or expand an audio signal. However, only two ways are relevant to high-light for an intuition of the implementation – the downward compression and the upward expansion, see Figure 1.3. (Loud) Original dynamic range (Soft) New dynamic range Downward compression Upward compression Original dynamic range (Loud) New dynamic range (Soft) Upward expansion Downward expansion

Figure 1.3:Illustration between compression and expansion. • Downward compression decreases the signal amplitude over a desired

(17)

1.2 Problem Formulation 5

• Upward expansion increases the signal amplitude over a desired power threshold while keeping softer signal values unaffected.

Modern compressors and expanders are designed with an input property whi-ch is called side-whi-chain. This input can be used to control the compression or expansion rate of an assigned primary track. Consider a secondary track that can operate as an input to a downward compressor assigned to a primary track. According to a mapping function of amplitudes, the powerful amplitudes of the primary track are compressed based on the signal power of the secondary track. Hence, it is possible to decrease the amplitude of the primary track when the sig-nal power of the secondary track is increasing. This is a technique that unmasks the used tracks on a larger scale, since it clears the room for the secondary track by damping all frequencies with the same factor.

Generally, compressors do not respond to small fluctuations in time domain, i.e., frequencies. Thus, the compressor distorts the nature of the sound. The problem of not handling frequencies also yields for expanders. However, there are compressors and expanders that operate in specific frequency bands, called multi-band compressors and multi-band expanders. The developed system is an example of an adaptive multi-band compressor where side-chain prevails be-tween tracks. Thereto, by use of inter-spectral dependencies and analysis in fre-quency domain, the amplitudes of the spectral components can be regulated ac-cording to functions that maps the amplitudes – based on principles of down-ward compression and updown-ward expansion.

1.2 Problem Formulation

dafx are diligently applied on tracks to obtain a well-defined and enhanced sound. In cases where multiple audio tracks are to be mixed, an equalizer is commonly applied on each audio track. This process requires manual tuning of each equalizer. The problem with manually tuning equalizers is time-invariant filtering window. The time-invariant filter does not adjust to frequency changes of other tracks once it is set. However, an time-variant filter makes it impossible to set a static filter which unmasks in right frequencies for all time windows. An automated time-variant equalizer makes the essential and repetitive part of the final mixing process more effortless and effective. Note that the developed dy-namic equalizer operates based on the principles of downward compression and upward expansion, but in frequency domain.

The purpose of the implemented system is not to replace the static equalizer. It is rather a complement to mixing work. To clean up the mix, as a final touch, in an efficient way. The solution can be used in both live and playback purposes since the system is causal. This thesis is meant as a proof of principles where ele-ments in a mix is enhanced by unmasking spectral components of the consisted tracks while leaving significant frequency components undisturbed.

(18)

1.3 Method

The signal processing will be done batch wise in a causal manner, allowing for extension to real-time use in future work. Hence, the calculations are restricted to be as fast and simple as possible. The implementation will be done inMatrix Laboratory (matlab) which prevents proper testing of real-time use. It will,

how-ever, be prepared for that purpose. The methodology is divided into the following parts.

• Go through existing research in this field to find inspiration.

• Set up a time-variant framework of inter-spectral dependencies between tracks.

• To design an optimization application that is regulated with respect to pres-ervation of characteristic frequencies and reduction of frequency overlaps. The goal is to develop an application that always satisfy its criteria.

1.3.1 Validation

There are three main ways that together validate the features of the system. • Amplitude spectrum

Comparing the amplitude spectrum of each track, before and after – verify-ing that powerful frequencies are preserved.

• Frequency overlaps

Calculating a design criterion, before and after – verifying reduction of fre-quency overlaps.

• Audibility

It is recognized that a good mix can not be measured more than to be heard. A survey based on asking independent individuals, without any premises, the following question: Which mix is easiest for distinguishing the different

elements?

A set of tracks by the famous groupSwedish House Mafia is used as a

bench-mark example throughout this thesis. The instruments are taken from the final mix of the songLeave The World Behind, which fits to this thesis since the last step

of music production, mastering, is to enhance the final mix. The set consists of five instruments, all of them are thought to play in parallel, see Figures 1.4–1.13. This set is used to illustrate, test and validate the implemented system. Further seen in Figures 1.4–1.13, only few seconds, for completing two bars, are used. This is in order to save computing time. However, to validate the results, the processed mix is looped.

Note that music is produced in stereo format and that the Figures 1.4–1.13 contain information from only one channel out of two. However, the data from the two channels are almost equivalent.

(19)

1.3 Method 7 Kick drum Time (s) 0 0.5 1 1.5 2 2.5 3 3.5 Amplitude -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4

Envelope of Kick drum

Figure 1.4:Envelope of the kick drum represented over 3.75 seconds.

Time (s) 0.5 1 1.5 2 2.5 3 3.5 Frequency (Hz) 0 5 10 15 20

Spectrogram of Kick drum

Magnitude (dB) -140 -120 -100 -80 -60 -40 Frequency (Hz) 100 Magnitude (dB) -200 -150 -100 -50

0 Spectrum of Kick drum

Signal amplitude -0.4 -0.2 0 0.2 0.4 Number of samples 102 103 104 105

Histogram of Kick drum

(20)

Bassline Time (s) 0 0.5 1 1.5 2 2.5 3 3.5 Amplitude -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 Envelope of Bassline

Figure 1.6:Envelope of the bassline represented over 3.75 seconds.

Time (s) 0.5 1 1.5 2 2.5 3 3.5 Frequency (Hz) 0 5 10 15 20 Spectrogram of Bassline Magnitude (dB) -140 -120 -100 -80 -60 -40 Frequency (Hz) 100 Magnitude (dB) -250 -200 -150 -100 -50 0 Spectrum of Bassline Signal amplitude -0.5 0 0.5 Number of samples 102 103 104 105 Histogram of Bassline

(21)

1.3 Method 9 Piano Time (s) 0 0.5 1 1.5 2 2.5 3 3.5 Amplitude -0.3 -0.2 -0.1 0 0.1 0.2 0.3 Envelope of Piano

Figure 1.8:Envelope of the piano represented over 3.75 seconds.

Time (s) 0.5 1 1.5 2 2.5 3 3.5 Frequency (Hz) 0 5 10 15 20 Spectrogram of Piano Magnitude (dB) -150 -100 -50 Frequency (Hz) 100 Magnitude (dB) -200 -150 -100 -50 0 Spectrum of Piano Signal amplitude -0.2 0 0.2 Number of samples 101 102 103 104 105 Histogram of Piano

(22)

Pluck Time (s) 0 0.5 1 1.5 2 2.5 3 3.5 Amplitude -0.6 -0.4 -0.2 0 0.2 0.4 0.6 Envelope of Pluck

Figure 1.10:Envelope of the pluck represented over 3.75 seconds.

Time (s) 0.5 1 1.5 2 2.5 3 3.5 Frequency (Hz) 0 5 10 15 20 Spectrogram of Pluck Magnitude (dB) -140 -120 -100 -80 -60 -40 Frequency (Hz) 100 Magnitude (dB) -200 -150 -100 -50 0 Spectrum of Pluck Signal amplitude -0.5 0 0.5 Number of samples 100 102 104 106 Histogram of Pluck

(23)

1.3 Method 11 Vocal Time (s) 0 0.5 1 1.5 2 2.5 3 3.5 Amplitude -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 Envelope of Vocal

Figure 1.12:Envelope of the vocal represented over 3.75 seconds.

Time (s) 0.5 1 1.5 2 2.5 3 3.5 Frequency (Hz) 0 5 10 15 20 Spectrogram of Vocal Magnitude (dB) -140 -120 -100 -80 -60 -40 Frequency (Hz) 100 Magnitude (dB) -200 -150 -100 -50 0 Spectrum of Vocal Signal amplitude -0.5 0 0.5 Number of samples 101 102 103 104 105 Histogram of Vocal

(24)

1.4 Related Work

1.4.1 Adaptive Digital Audio Effects

In parallel with the development of computers, more and more individuals are in-volved in the development of software solutions for the music industry. There are few studies with the same ambition as this thesis work. However, some studies have given conceptual inspiration.

In [16], the different forms of Adaptive Digital Audio Effects (a-dafx) are de-fined. a-dafx basically combines the theory of sound transformation and adap-tive control. In [18] several audio effects are presented, e.g., the fundamental development of automatic equalization. The framework of a-dafx is effective and has been used in many studies related to music mixing, e.g., [8, 10]. In [3], the concept of panning tracks into stereo channels is utilized in order to minimize unwanted effects of masking. In [9], an equal chance of masking is achieved by optimizing the likelihood of each track to be heard by use of equal-loudness-level contour [15]. The research in [7] presents a real-time equalizer of harmonic and percussive components which operates through sliding block analysis without any prior knowledge of the tracks. Methods which take advantage of mutual in-formation between tracks in order to manipulate the mix, are presented in [12, 4]. Generally, research in the field of music production aims to facilitate pro-cesses. New ways of finding previous mixing parameters that have been used for the same tracks, by adaption to a target, are introduced in [1, 13, 14]. As an example, these studies aim to create a more straightforward workflow when remastering to an old mastered mix.

1.4.2 Loudness War

As a partial result, the system has potential to increase the average loudness while reducing frequency overlaps. Thus, this subsection is describing the relevance of an increased average loudness.

During the last decades, a loudness war has been more intense due to peak normalization of old-fashioned broadcast channels. The peak normalization tech-nique normalizes a signal based on the peak level which forces the producers to significantly compress their records in order to stand out from the crowd and to boost the experienced energy of the mastered mix. Producers aim for a high level of average loudness in both the climax and anti-climax of their final mix. Thus, it becomes difficult to highlight the climax. Consequently, dafx for com-pressing and expanding amplitudes of each track are commonly used in order to control the dynamics of the mix. Music enthusiasts have been criticising the industry since the loudness war has gone too far – causing low dynamic range of the overall mix by not handling frequencies which destroys the nature of the sound.

In what sense compression results in a louder subjective loudness, is explained in [2]. In addition, the article presents a future technique of normalization, the ITU-R9 BS.1770-2 standard [11], in order to put an end to the loudness war.

(25)

1.5 Thesis Outline 13

The new normalization technique concentrates on media channels, e.g., Youtube, TV and radio stations, where there are inconsistencies in subjective loudness be-tween different audio signals. The unit Loudness, K-weighted, relative to Full Scale (lkfs) is used as a subjective loudness meter to match differences. However, loud-ness war still prevails. daws are still in the framework of peak normalization and circulation of individual audio files are still popular.

1.5 Thesis Outline

Chapter 2 goes through the important theory that is the basis for the implemen-tation.

Chapter 3 reviews the implementation and tuning of the system.

Chapter 4 presents the results of the implementation with tuned values of the system’s design parameters based on the benchmark example.

Chapter 5 presents the conclusions that are drawn and the opportunities for a future work.

(26)

(27)

2

Essential Theory

The implemented framework is constructed based on transformation theory and the main operations are in frequency domain. In this chapter, the theory is dis-cussed. A more extensive derivation can be found in [5].

2.1 Basis of Signal Processing

Music signals can be seen as non-periodic time signals. These can be transformed by theFourier Transform (ft), defined as

F {x(t)} =

∞

Z

−∞

x(t)e−jωtdt = X(jω), (2.1)

to relate time domain to frequency domain. ω is defined as the angular frequency,

ω = 2πf . (2.2)

The ft (2.1) is defined for continuous signals as x(t), where t ∈ R. Shorty de-scribed, ft retrieves the sinusoidal frequency components, R → C, which an arbitrary time signal, x(t), is made up of. This is possible by a switched basis, based on Euler’s formula. It is the frequency components, X(jω), that build the energy spectrum since each component can be described as a frequency rate with a corresponding amplitude. The signals can be transformed back, C → R, using

theInverse Fourier Transform (ift), defined as

F −₁ {_{X(jω)} =} 1 2π ∞ Z −∞ X(jω)ejωtdω = x(t). (2.3) 15

(28)

The formula (2.3) is able to reconstruct every non-ideal signal, x(t), which pro-gresses through time by mean of allowing all possibilities of frequency rates and amplitudes, X(jω). The transforms (2.1)–(2.3) are, in simple terms, based on the unit circle. Hence, there is no feasibility to construct orthogonal edges in time do-main by the ift. However, digital music signals have no such edges and therefore any two-dimensional music signal can be reconstructed. Moreover, frequency fil-ters are to be designed with long transition bands in order to avoid aliasing in time domain after filtering in frequency domain.

The used audio signals are sampled. The ft and ift need to be redefined for discrete operations. Hence, time is sampled,

x[k] = x(kT ), k ∈ Z, (2.4)

where T is the sampling time. The Time-Discrete Fourier Transform (dtft) is given by XT(ejωT) = T ∞ X k=−∞ x[k]e−jωkT (2.5)

and theInverse Time-Discrete Fourier Transform (idtft) is expressed as

x(k) = 1 2π π/T Z −_π/T XT(ejωT)ejωkT dω. (2.6)

The subscript, T , in (2.5) indicates a sampled version of the FT (2.1). Riemann approximation tells that XT(ejωT) → X(jω) when T → 0. In order to obtain es-sential theory for all possible transformation cases, the relationship between the Fourier coefficients in (2.1) and (2.5) is further described by Poisson’s summation formula, XT(ejωT) = ∞ X r=−∞ X(j(ω + rωs)), ωs = 2π T . (2.7)

XT(ejωT) is a periodic function since ejwT has a period of2πT . Thereto, derivation of (2.7) in [5] introduces the theory of alias effect, called folding, since ω is con-ditioned to finite available values. The principle of folding distorts the spectrum. The subsection 2.1.1 discusses how folding is avoided.

The implemented system takes advantage of batch wise processing, the dtft must be rewritten for cases of

xN[k], k = 1, 2, ..., N . (2.8)

By multiplication of a rectangular window (2.10),

RN[k] =        1, k = 1, 2, ..., N 0, otherwise (2.9) xN[k] = x[k]RN[k], (2.10)

(29)

2.1 Basis of Signal Processing 17

the expression (2.5) can be rewritten as

X_TN(ejωT) = T N X k=1 xN[k]e −_jω(k−1)T . (2.11)

As can be understood by (2.11), the frequency resolution, X_TN(ejωT_{), is directly} related to the batch length, N . The truncation of time leads to an issue of leak-age in frequency domain. This means that the energy of an arbitrary frequency component leaks to the nearby frequencies. The leakage pattern is constructed based on a convoluted sinc function due to the multiplication by RN in time do-main (2.10). The issue of leakage is important for the implementation and will be discussed in subsection 2.1.2.

All continuous variables must be expressed in discrete form in order to be used on a computer. The time index, t, has so far been reformulated but the angular frequency, ω, has not. This is done by introducing a frequency grid reso-lution ω0,

ω = (n − 1)ω0= (n − 1)

2π

N T, n = 1, 2, ..., N . (2.12)

As seen in (2.12), the values of n are discrete since the grid points of ω is built by multiples of ω0. Information is lost as in the case when the time is sampled.

In summary, the discrete values of the time and frequency components, given continuous values, results in the form of

t = T , 2T , .., N T (2.13)

and

ω = ω0, 2ω0, .., N ω0. (2.14)

Hence, by the sampled frequency grid (2.12), theDiscrete Fourier transform (dft)

is given by X[n] = N X k=1 xN[k]e −2πj N (k−1)(n−1) (2.15)

and theInverse Discrete Fourier transform (idft) is given by xN[k] = 1 N N X n=1 X[n]e2πjN (k−1)(n−1). (2.16)

Note that it is (2.15)–(2.16) that are used in the implementation.

A batch of data, xN[k], is considered stochastic since the system inputs any au-dio signal. Furthermore, the implemented system takes advantage of frequency derivatives. Thus, the frequency spectrum must be estimated properly. In com-parison between nearby batches, differences of frequency components obtained by the dft for a stochastic signal are considered noisy. Derivations from [5] en-able an efficient method to obtain the expected power spectrum, φ, called peri-odogram, defined as

φ[n] = T

N|X[n]|

(30)

The periodogram (2.17) preforms the estimation well and smoothly over time if the batch length, N , is not critical.

2.1.1 Sampling Theorem

Derivation in [5] of Poisson’s summation formula (2.7) tells that if X(ejωT_{) = 0} outside the interval

− ωs 2 ≤ω ≤

ωs

2 , (2.18)

then XT(ejωT) and X(jω) coincide. Thus, no information is lost regarding fre-quency components. Poisson’s summation formula leads to the conditional sam-pling theorem,

ωN ≥ ωs

2 . (2.19)

The Nyquist frequency, ωN, indicates the highest frequency XT(ejωT) consist of. If the condition does not hold, aliasing is introduced. The human ear registers frequencies around 20 − 20000 Hz, therefore music is produced in the same fre-quency range. The used data, Figures 1.4–1.13, are sampled at fs = 44.1 kHz. Hence, aliasing is avoided. The sampling frequency is defined as

fs = 1

T, (2.20)

where T is the sampling time.

2.1.2 Leakage

Leakage obstructs the frequency resolution to bring out essential peaks of the energy spectrum. Thus, frequency regions with high energy gets an extended in-terval causing filtering operations in irrelevant frequencies. By zero-padding the batch xN[k], the frequency grid resolution ω0becomes small since N is increased.

It is understood that ω becomes more finely divided and the peaks more distinct. Thus, the leakage is spread over additional components.

2.1.3 Windowing

The rectangular window (2.9) is used to create the batches of data. As mentioned, this causes convolution in the frequency domain that results in leakage. By multi-plying the batches with a standard window that is finite, it is possible to truncate and warp the energy spectrum at the same time. This, in a way that enhances the dominant frequencies. Some standard windows are presented [5]. The Bartlett window, (2.21), is useful since it minimizes side lobes of the leakage,

WN[k] =        2k N, 0 ≤ k ≤ N2 2 −2k_N, N₂ ≤_{k ≤ N} (2.21) xN[k] = x[k]WN[k] P kWN[k] . (2.22)

(31)

2.1 Basis of Signal Processing 19

The variable in the denominator of (2.22) is a normalization factor to maintain the level of energy after windowing. Note that the standard window must be applied on the batches before zero-padding.

2.1.4 Circular Convolution

When introducing filtering operations in frequency domain, there is a risk of circular convolution that causes overlap in time domain. The idea behind the system does not tolerate any overlap in time domain for a processed batch. The underlying problem lies in the rise of convolution when multiplying in frequency domain. The number of dft coefficients, i.e. the number of samples in time domain, before performing the dft must exceed the duration of the convolution produced in time domain when multiplying these dft coefficients. The solution lies in the preparation of batch length before transforming, easily done by zero-padding xN[k] with a length of at least N . The zero-padded value, N , originates from the main filtering operation of the system of which aFinite Impulse Response

(fir) filter of length N filters the batch.

2.1.5 Root Mean Square

The average loudness is mathematically interpreted as theRoot Mean Square (rms).

Consider a mixed signal x[k]. The average signal energy for a particular time win-dow is expressed as Ψ = r 1 N(x[1] 2_{+ x[2]}2_{+ ... + x[N ]}2_). _(2.23)

Further, as a result of the used file format,Waveform Audio File Format (.wave),

the amplitude is limited to −1 ≤ x[k] ≤ 1, in matlab, for an arbitrary k. This results in a maximal average loudness, rms, of Ψ = 1.

2.1.6 Parseval’s Formula

According to [5], Parseval’s formula of discrete time values becomes ||_x||2₌ N X k=1 |_x_N_[k]|2₌ 1 N N X n=1 |_X[n]|2_. _(2.24)

The energy content, ||x||2, is preserved between time and frequency domain. The formula (2.24) is used to calculate the energy reduction of the implemented sys-tem.

(32)

(33)

3

Implementation

This chapter describes the main implementation of the system. The data from the tracks are denoted sm. They have been processed and merged as a final mix, χ. Both the left and right channel are processed. For the presentation of results, only the left channel is considered since both channels of the tracks are almost equiva-lent. The subscript, m = 1, 2, ..., M, indicates a particular track of the benchmark data. The tracks are sorted as in Chapter 1, where m = 1 is the Kick drum and

M = 5 is the Vocal.

3.1 Modelling Approach

The implemented system which performs dynamic equalization is non-linear since the contained operations are non-linear such as the dft and idft, see Fig-ure 3.1. This section presents the used framework of the system. An exact de-scription of the system is presented in section 3.2.

Htot

sm ym

Figure 3.1:Htotis the non-linear system which performs dynamic equaliza-tion in order to enhance the mix with regard to a optimizaequaliza-tion problem.

The system, Htot, inputs audio signals sm∈ Rand outputs ym∈ R. The tracks,

sm, are given in time domain but transformed to frequency domain, Sm. In the frequency domain, two kind of filters are designed – compression and expansion

(34)

filters. The compression filters are weighted according to a set of inter-spectral dependencies, in order to unmask tracks. The task of the expansion filters is to expand characteristic frequencies of the weighted compression filters. The filters operate on the frequency bins, n.

The causal framework generates batches from sm, by use of a sliding time window. The framework of the system is characterised by a Input Multi-Output (mimo), Figure 3.2 illustrates the signal flow of each input to each output.

Filtering DFT Filtering Filtering Pre-process DFT Pre-process Filter design DFT Pre-process Filter design Filter design IDFT IDFT IDFT Filter dependecies s2 s1 s5 y1 y2 y5

Figure 3.2:An illustration of the dynamic equalizer. The primary operations are in frequency domain. Thus, the ambition lies in estimating the energy spectrum of each track. These energy spectrum estimates are used to create compression and expansion filters.

3.1.1 Causal Framework

The implemented system processes batches,

sm[k − N + 1 : k]T, k = N , ..., Ntot, (3.1) where k indicates an actual time window for a certain iteration. Hence, an initial time-delay is introduced which is equal to a batch length, N . The batch length,

N , originates from the total sample length,

Ntot = ttotfs, (3.2) of an arbitrary track. Ntotis equal for all presented tracks. The analysis is limited to ttot = 3.75 seconds of a track. Further, {tN ∈ R |tN ≥ 0} is introduced as a design parameter to adjust the batch length,

N = Ntot

tN

ttot

(35)

3.1 Modelling Approach 23

This is done by computing the time ratio, tN

ttot, and multiplying it with the total

sample length, Ntot. More specifically, tN is a design parameter to adjust the balance between the frequency resolution of the estimated energy spectrum and the filtering efficiency. Figure 3.3 illustrates the sliding time window used for the process of dynamic equalization.

Number of steps

Extracting last sample in time domain Sliding

window

Figure 3.3:The batch of data that is being processed is filtered in frequency domain and subsequently transformed to time domain, where the last sam-ple is extracted and appended to the previous iteration’s extractions. By ex-tracting the last sample of each batch, a filtered time signal, ym[k], is ob-tained.

3.1.2 Spectral Accuracy of Filtering Operations

Audio signals with various audio effects applied have complex waveforms, and since the system aims to take advantage of any audio signal, every input-signal is considered non-periodic. The complexity causes difficulties in modelling arbi-trary audio inputs for spectral prediction. The certainty of the spectral estima-tions are fundamental for the system to hold. By processing batches, the filtering operations constructed by the spectral estimations will be more correct.

3.1.3 Synchronization of Inter-Spectral Dependencies

Since the filters are created based on energy spectrum estimations, a problem arises regarding synchronization. The filtering processes for each track must be synchronized. In other words, the input and output data of each process must coincide in time. The energy spectrum of a processed track, for a certain time window, is adapted based on the energy spectrum of the remaining tracks for that time. Hence, the time instances must match. Otherwise, an ongoing adaption is based on a false energy spectrum.

3.1.4 Time-delay due to Causal Framework

The system introduces a time delay between input and output. For a further developed real-time application, the computation time for processing a batch is

(36)

considerable. It is the last samples of ymthat has been considered for extraction during the implementation in order to shorten the delay. By a greater extraction, computation time is accelerated.

3.1.5 Smooth Filtering Operations over Iterations

A static equalization setting does not result in inconsistency in frequency re-sponse over time. However, an approach of dynamic equalization does. Espe-cially in the case of this thesis, where the filtering operations are to be as frequent and effective as possible. Inconsistency over time of the frequency response after filtration results in a distorted sound since the natural energy spectrum sequence of the sound gets interrupted by incoherence. In order to create a smooth fre-quency response, the sliding time window is designed to step one sample for each iteration. Hence, the last sample is extracted from ym.

3.1.6 Sample Extraction and Properties from the Filtration

The preservation of properties from the equalization to the extracted samples is essential. The extent of extraction advances the audible results since more sam-ples will be stored for a filtered batch. More than one sample has been considered for the extraction. However, a smooth changing spectrum has also been priori-tized. If the processed mix sounds more distinct for an extraction of one sample, it probably does for other cases. In this sense, the chosen extent of extraction is for worst case.

3.1.7 Optimizing the Design

In order to understand what the system wants to achieve, a measure of what is good or desirable is defined. The system is designed to simultaneously satisfy the criteria, ζs = Ntot X k=2N N X n=1 M Y m=1 |_S_knm| ≥ _ζ_y₌ Ntot X k=N N X n=1 M Y m=1 |_Y_knm| _(3.4a) L ≥ L0, (3.4b)

for all cases of tuned parameters. The index, n ∈ Z, denotes the frequency bins. Note that the dft components of |Yknm|∈ R_{, are obtained from the output of the} system, ym[k], where the extracted sample for each filtered time window, k, has been appended to a complete signal sequence. The calculation of the criterion (3.4a) is a separate implementation of sliding time window analysis, focusing on the dft components for validation. Hence, the criterion (3.4a) proves that the properties of the system’s frequency operations are preserved. The interval of k differ since Sknm originate from the original data of the tracks, and Yknm from the processed data. As mentioned, the use of a causal framework introduces an initial delay of order N .

The value ζsis obtained by first multiplying all the dft components. This will result in a conservation of frequency overlaps between the tracks. Consider two

(37)

3.2 Modelling of Dynamic Equalization 25

sets of spectral components which are entirely distinct, ζs will then be equal to zero. Further, the amplitudes are summed across the frequency bins, n, to obtain an absolute value of the masking. This process is repeatedly calculated across the time windows, k.

ζyis calculated in the exact same way, but for the filtered and extracted sam-ples. The samples have been manipulated by a time-variant frequency filter

Hknm(A, B), where {A ∈ R | A ≥ 0, } and {Bm∈ R |B ≥ 0}.

Certainly, all frequencies of the tracks can be eliminated, Hknm(A, B) = 0, in order to reduce ζy. However, this is not intuitive since tracks will be muted. Frequency overlap are not reduced to any price. The listening experience, L, of the processed mix must be better or equal to the unprocessed mix, L ≥ L0.

The system is designed to preserve characteristic frequencies. This is where a trade-off, between A and B, comes in. A is a design parameter to control the com-pression rate of frequencies in order to reduce ζyand B is a design parameter to control the preservation of characteristic frequencies if they in certain iterations are to be filtered due to unmasking tracks. By analysis of spectral derivative es-timations, expansion filters can be obtained in order to preserve characteristic frequencies. There is no autonomous calculated optimal value, best tuned values are chosen based on music intuition. The system is designed to meet the criteria for any value adjustments of the design parameters.

The criteria (3.4a)–(3.4b) describe the goal of the implemented system. How-ever, in order to understand the big picture. An interpretation is given in time-frequency-track-plane. Consider the five spectrograms in Figures 1.4 –1.13, each corresponded to a track. By imagining them in parallel, a cubic space of the mix can be formed for one channel. The main key is to achieve manipulated spectro-grams where frequency overlaps have been reduced for all time instances without eliminating characteristic frequencies. In this sense, the elements in the mix will be sounded more distinct through time.

3.2 Modelling of Dynamic Equalization

In this section, the implementation from Figure 3.2 is described. An exact flow of

Htotis further illustrated in Figure 3.4. Note that all created filters are considered

asFinite Impulse Response (fir) filters and that an ongoing iteration, k, is further

represented as an subscript.

3.2.1 Pre-Process

Determination of Filtering Levels

By calculating the rms of each batch, Ψ_km= r 1 N(skm[1] 2_{+ s} km[2]2+ ... + skm[N ]2), (3.5) the energy level is quantified in smooth sense over iterations, and also the dy-namics of the amplitudes for each track are inherited. This is used to determine

(38)

the gain of the created frequency filter (3.6). Since the amplitude is limited to −_{1 ≤ skm}≤1, the rms differs according to 0 ≤ Ψ_km≤_{1. Further, Ψ}_km_{is mapped,}

αkm=

2

1 + e−_AΨ_km −1, (3.6)

for determining αkm. The design parameter, A, from the optimization criterion (3.4a), is used to adjust the inertia of the filtering levels. The mapping function in (3.6) can be fairly linear which makes the filtering levels proportional to the rms _{of the batches. If desired, the mapping functions can be adjusted so that} maximal filtration prevails for all the rms values of the batches.

Windowing Zero-padding DFT Zero-padding DFT Calculating compression level Estimating spectrum Smoothing by a moving average Designing expansion filters Designing compression filters Weightning by inter-spectral dependencies Weightning compression against expansion Smoothing by a moving average Equalization IDFT skm ykm

Figure 3.4:An illustration of the exact signal flow of the dynamic equalizer for a batch, k, and a track, m.

3.2.2 DFT

Windowing for a Better Spectral Estimation

The system calculates the dft twice due to the usage of standard window, WN[kN]. In order to obtain warped frequency coefficients by the dft and to improve the spectral estimation for the creation of filters, a standard window,

s_kmW[kN] =

skm[kN]WN[kN] PN

kN=1WN[kN]

(39)

is applied on the batches. The superscript, W , indicates that a data set is warped by a standard window.

Zero-Padding for Avoiding Circular Convolution

Before transformation, skmand sWkmare zero-padded. The operation of zero-padd-ing is appendzero-padd-ing, multiples of N , zeros. Hence, linear convolution is maintained and the underlying dtft is further approximated with a design parameter {P ∈

Z |P ≥ 1}. This results in an extended length of samples, N → N (P + 1), and the

batches become expressed as

βW = [s_kmW 0N P]

β = [skm 0N P]T,

(3.8) where 0N P is interpreted as an array with N P zeros.

Separation of Transformations

The Fourier coefficients used for the creation of filters,

Skm[nP] = N (P +1) X kP=1 βW[kP]e− 2πj N (P +1)(kP−1)(nP−1)_, _n P, kP = 1, 2, ..., N (P + 1), (3.9)

must be separated from the ones,

S_kmo [nP] = N (P +1) X kP=1 β[kP]e − 2πj N (P +1)(kP−1)(nP−1)_, _n P, kP = 1, 2, ..., N (P + 1), (3.10)

that are only being transformed, filtered and inverse transformed. The super-script o indicates the original transforms, the ones that produce the output. Note that S_kmo [nP] are not affected by a standard window.

3.2.3 Filter Design

Expansion Filters of Periodograms

The frequency coefficients, Skm, are used to estimate the spectra according to

φkm =

T

N (P + 1)|Skm|

2_, _(3.11)

and the frequency coefficients of So

km are to be compressed by frequency filters created by the inter-spectral dependencies. Further, the expression (3.11) is smo-othed,

φγ_km =1

(40)

by aMoving Average (ma) between nearby samples to reduce noise. The ma is

designed as a zero-phase filter, and the operation is introduced as a superscript by the notation γ.

By calculating time derivatives, φ0_km, counterweight functions can be created in order to expand essential frequencies. Numerical derivatives of the spectral estimates, periodograms, are calculated by

φ0_km = φ γ km−φ γ (k−1)m T , φ 0 km≥0. (3.13) Frequency (Hz) 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Λk3 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Filter of expansion

Figure 3.5: An illustration of how frequency filters, for the purpose of ex-panding high derivatives of frequency components, are created based on the periodograms. The plot is of track 3 and iteration k = 24.

The condition φ0_km ≥ _{0 originates from the concept of spectral expansion –} frequencies which have positive derivatives follow by less compression, see Fig-ure 3.5. Meaning that the negative derivatives are neglected. Further, the sliding time window replaces one frequency component for each iteration, resulting in smooth changing spectrum estimations. The smoothness avoids too differenti-ated periodograms for nearby iterations. Thus, the expansion filters,

Λ_km= 1 −2 arctan(Bmφ

0

km)

(41)

can be formed, created by mapping the values of Bmφ

0

kmbetween 0 − 1 and invert-ing the obtained values between 0 − 1. Thereto, Bmis a design parameter, related to the optimization criterion (3.4a), that adjusts the inertia of of expansions. The mapping function in (3.14) is designed to pass lower frequency derivatives prop-erly in order to not to lose important frequencies.

Frequency (Hz) 0 200 400 600 800 1000 1200 1400 1600 1800 2000 φ γ k3 ×10-12 0 0.5 1 1.5 2 2.5 3 3.5 Periodogram Frequency (Hz) 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Hk3 0 0.2 0.4 0.6 0.8 1 Filter of periodogram

Figure 3.6: An illustration of how compression filters, for the purpose of unmasking tracks, are created based on the periodograms. The plot is of track 3 and iteration k = 24.

Frequency Filters of Periodograms

Further, frequency filters,

Hkm = max_φγ km(φ γ km) − αkmφ γ km max_φγ km(φ γ km) , {_α_km∈ R |_{0 ≤ α}_km≤_1}, _(3.15)

are obtained for the purpose of compressing the overlapped frequencies. Fig-ure 3.6 illustrates the intuition behind the formula (3.15). It basically describes a flip of the periodograms. Through these operations, frequency filters are formed, consisting of essential information from their spectral estimates. In addition, αkm is included to regulate the filtering levels based on the extension of the batches, Ψ_km.

(42)

Anti-Aliasing Filter

The main filtering operation of the system, damps the frequency coefficients

S_kmo [nP]. This gives rise to convolution in time domain by theImpulse Response Function (irf). By that, it is important that the amplitudes of the irf are defined

between −1 and 1 since no radical gain which may cause distortion is desired. The irf_{s are considered random since they are created based on the periodograms of} each iteration. This makes it hard to express the irf stable within its limits of distortion. However, the irf has for all iterations been measured between −1 to 1, see Figure 3.7. It has been seen that aliasing prevails in the filters, H_k5γ, where the transition band has become too short for some iterations. This is the main reasoning behind the use of ma. By a large number of tN, the transition bands of spectral components become smoother and aliasing suppressed.

Samples -1500 -1000 -500 0 500 1000 1500 Impulse response of H γ k5 -0.2 0 0.2 0.4 0.6 0.8 1

Figure 3.7:The impulse response of H_k5γ for a certain iteration. This resulted filter is the most complex one. This is why it is chosen to be presented.

By introducing interpretations of transformation theory, the aliasing defect of the system can be interpreted. Parseval’s formula states that energy is conserved between the two domains. This means that the amplitudes of the resulted im-pulse response are limited to an amount of energy which is originated from the energy of a batch. However, when damping the frequency coefficients So

km[nP], the original set of frequencies is destroyed. This can possibly result in a radi-cal distribution of amplitudes in time domain, which in worst case can lead to

(43)

distortion. According to basic theory of Fourier series, the foundation of a time signal lies in the most powerful frequencies of its set. Thus, by suppressing an arbitrary non-powerful frequency, it may be sufficient for a processed sample to be restored as a stronger amplitude than before, since a particular fluctuations is removed. However, the benchmark examples have quite big headroom before distortion is introduced. Thus, during this thesis, this has not been a problem. As further will be seen in the results, the numerical differences between samples of the processed and unprocessed mix are mostly negative, Figures 4.8–4.9. Mean-ing that the amplitudes in most cases are compressed and that no radical gain prevail.

3.2.4 Two Sets of Inter-Spectral Dependencies

Two different sets of inter-spectral dependencies have been used to test how the audible and visual results are affected. Set I is not explicitly an optimal set in order to satisfy the criterion (3.4a). It is however of interest to check the audible difference compared to Set II which is more efficient.

Set I – Creating a New Dimension of Sorting the Mix

It is possible to set up the network of dependencies as desired and based on intu-ition. The set,

H_k1ι = 1 H_k2ι = Hk1 H_k3ι = Hk1Hk2 H_k4ι = Hk1Hk2Hk3 H_k5ι = Hk1Hk2Hk3Hk4, (3.16)

is a hierarchy for unmasking, resulting in a filter, H_kmι , that weight the informa-tion of the frequencies given by each track. The weighting operainforma-tion is intro-duced as a superscript by the notation ι. By creating a priority list of tracks, see Figure 3.8, each track is enhanced relative to its rank. One can interpret that a new dimension(frequency) of the two dimensional music stage (amplitude and stereo panorama) is sorted within the mix.

Track 4 Track 3

Track 2 Track 1

Figure 3.8:Illustration of how the equalization of the tracks are dependent of each others periodograms, where Track 1 is theKick drum and Track 5 is

(44)

The human ear is less accurate in distinguishing amplitudes of low frequen-cies from different sources compared to higher ones [15]. Because of this, it is intuitive to sort the priority list based on the energy content’s distribution for the tracks, where the track with most energy in low frequencies, theKick drum, has a

rank of order 1.

Set II – Optimal Set

To satisfy the criterion (3.4a) with a larger margin, the set,

H_kmι = M Y

u=1,u,m

Hku, (3.17)

is the best choice. The set (3.17), indicates a series of filter multiplications, re-sulting in merged filters, H_kmι , that contain information of the frequencies other tracks are weighting. Hence, a batch is equalized based on the periodograms of the other batches. Further, by leaving characteristic frequencies non-equalized, no significant audible information is lost. In addition, by carrying out this equal-ization frequently, the audible result is considered to be best.

3.2.5 Filtering

Compression Versus Expansion

The final filters,

H_kmιι = 1 − (1 − H_kmι )Λkm (3.18) are obtained by weighting the filters, H_kmι , with their corresponding filters of derivatives, Λkm. This creates a dynamic of which frequencies with high deriva-tives are expanded, maximally, to the original amplitude. A feasibility to only fil-ter frequencies which have low derivatives where masking prevails. By studying the maximal time derivatives, maxφ0_km(φ

0

km), for each iteration, it can be seen that these derivatives are highly dependent on the amplitudes in time domain. Mean-ing that these derivatives reflect the strength of a signal, see Figure 3.9. Hence, the system maintains characteristic frequencies.

To ensure a smooth changing equalization over iterations due to the operation,

φ0_km ≥_{0, of a clipped saturation. A ma,}

H_kmγ [kp] = 1 2(H

ιι

km[kp] + Hkmιι [kp+ 1]), Hkmιι [N (P + 1) + 1] = 0, (3.19) with a design of zero-phase, is applied.

(45)

3.2 Modelling of Dynamic Equalization 33 Time (s) 0.5 1 1.5 2 2.5 3 3.5 max φ ′ k1 (φ ′ k1 ) ×10-9 2 4 6 8 Track 1 Time (s) 0.5 1 1.5 2 2.5 3 3.5 max φ ′ k2 (φ ′ k2 ) ×10-9 5 10 15 Track 2 Time (s) 0.5 1 1.5 2 2.5 3 3.5 max φ ′ k3 (φ ′ k3 ) ×10-10 1 3 5 Track 3 Time (s) 0.5 1 1.5 2 2.5 3 3.5 max φ ′ k4 (φ ′ k4 ) ×10-10 2 6 10 14 Track 4 Time (s) 0.5 1 1.5 2 2.5 3 3.5 max φ ′ k5 (φ ′ k5 ) ×10-10 5 10 15 Track 5 Figure 3.9: maxφ0 km(φ 0

km) reflects significant frequencies of the tracks that should have space, since the derivatives have maximal rate. Consider the case when someone is playing on a drum kit. The harder the player hits, the faster the frequency response is built. Further, maxφ0_km(φ

0

km) reminds strongly of the signals in time domain, meaning that it somehow reflects the characteristics.

The Main Filtering Operation

The main and last filtering operation of the system is stated as

Ykm= H

γ

kmSkmo . (3.20)

Ykm are the frequency coefficients of the output, a result of the filtered original transforms S_kmo .

3.2.6 IDFT

Extraction of Last Sample

The new obtained frequency coefficients are inverse transformed,

ykm[kP] = 1 N (P + 1) N (P +1) X nP=1 Ykme 2πj N (P +1)(kP−1)(nP−1)_, _(3.21)

(46)

and the last sample of each batch, excluding the zero-padded contribution of samples, is extracted, ykm[N ]. Note that the last sample ykm[N ] corresponds to

skm[N ] since the components of zero-padding are appended on the end of the original batch skm. Further, since the risk of distortion a condition,

ykm[N ] =        ykm[N ], ykm[N ] ≤ 1 1, ykm[N ] ≥ 1, (3.22) is introduced to overcome it. Hence, the processed signals become ym[k] by ap-pending each sample ykm[N ].

Merging Tracks into Mixes

By merging the tracks

χs[k] = 5 X m=1 sm[k] (3.23) and χy[k] = 5 X m=1 ym[k], (3.24)

the mixes before, χs, and after, χy, processing are obtained.

Normalization of the Mixes

In order to compare the audibility of the mixes, they are peak normalized. Start-ing by calculatStart-ing the absolute values, |χs|_{and |χy}|, since the amplitude differs between −1 and 1. Further, by dividing both mixes with their maximal ampli-tude, χS = χs max|_χ_s|(|χs|) χY = χy max|_χ_y|(|χ_y|) , (3.25)

(47)

3.3 Optimization of Design Parameters 35

3.3 Optimization of Design Parameters

Table 3.1 summarises the design parameters that are to be tuned. Design parameter Description

tN Determines the filtering resolution of the inter-spectral dependencies.

P Approximation of underlying dtft.

A Determines the extent of the filtration to unmask the tracks.

Bm Determines the extent of preserving the characteristic frequencies.

Table 3.1:Summation of system’s design parameters.

During processing, some real-time plots have been used to analyze the outcome of variables, two are illustrated in Figures 3.10–3.11.

Frequency (Hz)

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Hk1

0 0.5

1 Compression filter based on Track 1

Frequency (Hz)

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Hk2

0 0.5

Frequency (Hz)

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Hk30.5₀

Frequency (Hz)

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Hk40.5₀

Frequency (Hz) 0 200 400 600 800 1000 1200 1400 1600 1800 2000 H ι k5 0 0.5

1 Compression filter to be applied on Track 5

Frequency (Hz)

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Λk50.5₀

1 Expansion filter based on Track 5

Frequency (Hz) 0 200 400 600 800 1000 1200 1400 1600 1800 2000 H γ k5 0 0.5

1 Weighted compression filter to be applied on Track 5

Figure 3.10: A series of information, focus on getting insight in the filter weights, Hkm, behind a compression filter, H_kmι , of a particular track, in this case theVocal. Further, the characteristic frequencies are seen in Λk5which expands the compression filter, H_k5ι , giving H_k5γ.

(48)

Iteration (k) 10 20 30 40 50 60 70 80 90 100 α k1 0 0.5

1 Mapped energy of Track 1

Frequency (Hz)

0 100 200 300 400 500 600 700 800 900 1000

Hk1

0 0.5

Frequency

0 100 200 300 400 500 600 700 800 900 1000

Λk1

0 0.5

1 Expansion filter based on Track 1

Frequency (Hz) 0 100 200 300 400 500 600 700 800 900 1000 H ι k1 0 0.5

1 Compression filter to be applied on Track 1

Frequency (Hz) 0 100 200 300 400 500 600 700 800 900 1000 H γ k1 0 0.5

1 Weighted compression filter to be applied on Track 1

Figure 3.11: A series of information, useful for getting a deeper insight in essential variables concerning a particular track, in this case theKick drum.

As can be seen, αk1indicates the batch size and filtering level of Hk1 which filters other tracks. Further, Λk1indicates the expansion of the compression filter H_k1ι due to high derivatives of φ0_k1. The expansion is seen in H_k1γ.

3.3.1 Tuning the Filtering Resolution

The performance of the system depends significantly on how tN is set. It affects the choice of A, Bm and P . The starting point has been to select a resolution of the spectrum that is as good as possible. If the resolutions are poor, there will be lack of necessary frequency component, i.e., information of nearby frequencies will be merged to one component, but computation time becomes shorter. If the resolutions are too good, the extracted samples, ykm[N ], become too insignificant and randomly computed. Hence, the filtration properties will not be inherited to time domain, and the computation time becomes longer. By introducing a condition of one informational sample for each 12.5 Hertz, essential information across the spectrum, 0 − 22.05 kHz, is not lost. A finer division of the spectrum is not quite audible. Thus,

tN = 0.04 → N = 1764 (3.26) gives an interval of

ωs

N = 12.5H z (3.27)

between each spectral coefficient. In addition, by a large value of tN, aliasing is reduced.

(49)

3.3 Optimization of Design Parameters 37

3.3.2 Approximating the Underlying

DTFT

To ensure linear convolution and to save computation time P = 1. However, 2N samples across the spectrum holds for the choice of tN, the frequency peaks are quite narrow when the spectrum are visualized.

3.3.3 Tuning the Frequency Filtration

Time (s) 0.5 1 1.5 2 2.5 3 3.5 Ψk1 0 0.5 1 Track 1 Time (s) 0.5 1 1.5 2 2.5 3 3.5 Ψk2 0 0.5 1 Track 2 Time (s) 0.5 1 1.5 2 2.5 3 3.5 Ψk3 0 0.5 1 Track 3 Time (s) 0.5 1 1.5 2 2.5 3 3.5 Ψk4 0 0.5 1 Track 4 Time (s) 0.5 1 1.5 2 2.5 3 3.5 Ψk5 0 0.5 1 Track 5

Figure 3.12:The dynamics of Ψkmfor 3.75 s.

The dynamics of αkm are adjusted by tuning A with regard to maxα_km(αkm) ≈ 0.95 for maxΨkm(Ψkm). The value 0.95 is chosen to create large filtration that

is not canceling frequencies out. By storing Ψkm, five vectors are obtained, see Figure 3.12. Further, by calculating,

max

Ψ_km(Ψkm) = [0.2452 0.2870 0.1032 0.1396 0.1330]

T_,

(3.28) the global maximums of each track are derived. As seen, the second value in the vector (3.28) is maxΨ_k2(Ψk2).

(50)

It is maxΨ_k2(Ψk2) = 0.2870, the Bassline, that sets A. This creates a dynamic of which the filtering become extensive when batches with high energy prevail. The intention is to relate the extent of each filtration to the rms (2.22) of each batch to get a dynamic equalizer. Hence, A is obtained from

max

α_km(αkm) =

2

1 + e−A maxΨ k2(Ψk2)

−_{1 = 0.95,} _(3.29)

which implies that A ≈ 12.76, see Figures 3.14–3.15. In matlab, A is obtained by analyzing the given audio signals, as presented. However, in a real-time case, A is tuned by studying αkmvisually through real-time plots, as Figures 3.10–3.11, while strumming the tracks according to the composed mix.

Time (s) 0.56 0.58 0.6 0.62 0.64 0.66 0.68 0.7 min Λk1 (Λ k1 ) 0 0.5 1 Track 1 Time (s) 0.88 0.89 0.9 0.91 0.92 0.93 0.94 0.95 0.96 min Λk2 (Λ k2 ) 0 0.5 1 Track 1 Time (s) 0.98 0.99 1 1.01 1.02 1.03 1.04 min Λk3 (Λ k3 ) 0 0.5 1 Track 1 Time (s) 2.63 2.635 2.64 2.645 2.65 2.655 2.66 2.665 2.67 2.675 2.68 min Λk4 (Λ k4 ) 0 0.5 1 Track 1 Time (s) 2.53 2.535 2.54 2.545 2.55 2.555 2.56 2.565 2.57 2.575 2.58 min Λk5 (Λ k5 ) 0 0.5 1 Track 1

Figure 3.13: A closer look into minΛ(Λkm), to demonstrate a smooth

Real-Time Adaptive Audio Mixing System Using Inter-Spectral Dependencies

Master of Science in Electrical Engineering

Department of Electrical Engineering, Linköping University, 2016