Degree Project for BSc in Computer Science and Engineering, 15 ects
AI Drummer - Using Learning to Enhance
Artificial Drummer Creativity
Oscar Th¨orn
Degree in Computer Science and Engineering, 180 ects ¨
Orebro Spring Semester 2020
Supervisor: Alessandro Saffiotti Examiner: Lars Karlsson
¨
Orebro universitet Institutionen f¨or naturvetenskap och teknik 701 82 ¨Orebro
¨
Orebro University School of Science and Technology
Abstract
This project explores the usability of Transformers for learning a model that can play the drums and accompany a human pianist. Building upon previous work using fuzzy logic systems three experiments are devised to test the usability of Transformers. The report also includes a brief survey of algorithmic music generation.
The result of the project are that in their current form Transformers cannot easily learn collaborative music generation. The key insights is that a new way to encode sequences are needed for collaboration between human and robot in the music domain. This encoding should be able to handle the varied demands and lengths of different musical instruments.
Keywords: Music Generation, Transformers, Deep Learning, Human-Robot Collaboration
Preface
First and foremost a huge thank you to my supervisor Alessandro Saffiotti who has guided me through this project, and who a little over a year ago welcomed me to the exiting world of AI, Music and Human-Robot collaboration. Thanks to Peter Knudsen for everything music related. A big thanks to Neurolearn for lending me some equipment for duration of the project.Thanks also to ¨Orebro University, AASS and the CREA network.
Table of Contents
1 Introduction 5
1.1 Motivation . . . 5
1.2 Aims and objectives . . . 5
1.3 Methodology . . . 6
1.4 Outline . . . 7
2 Literature in Music Generation 8 2.1 History of Music Generation . . . 8
2.2 Approaches to Music Generation . . . 8
2.2.1 Grammars . . . 9
2.2.2 Symbolic, Knowledge-Based Systems . . . 10
2.2.3 Markov Chains . . . 10
2.2.4 Artificial Neural Networks . . . 11
2.2.5 Evolutionary and Other Population Based Methods . . . 11
2.2.6 Self-Similarity and Cellular Automata . . . 12
2.3 Deep Neural Network and Transformers . . . 12
2.3.1 Attention . . . 13
2.3.2 Transformer Architectures . . . 15
3 Experiment 1: Drum-machine commands using 1D encoding 16 3.1 Description . . . 16
3.2 Architecture . . . 16
3.3 Encoding . . . 17
3.4 Dataset . . . 18
3.5 Result . . . 19
4 Experiment 2: Direct drumming using 1D encoding 22 4.1 Description . . . 22
4.2 Architecture . . . 22
4.3 Encoding . . . 22
4.4 Dataset . . . 23
4.5 Result . . . 23
5 Experiment 3: Drum-machine commands using 2D encoding and beat quantization 26 5.1 Description . . . 26
5.2 Architecture . . . 26
5.3 Encoding . . . 27
5.4 Dataset . . . 27
6 Discussion 29
6.1 Causes for failure . . . 29
6.1.1 Data . . . 29
6.1.2 Architecture . . . 29
6.1.3 Encoding . . . 30
6.1.4 Hyper-parameters . . . 31
6.2 Next steps . . . 31
6.3 Social, economic, and ethical aspects . . . 32
7 Reflection 35
8 References 36
1
Introduction
This report begins with an introductory chapter describing the motivation of the project followed by a high-level description of the project, a project specification as well as an outline of this report.
1.1
Motivation
This project is a part of ongoing research in the intersection of ”Musikalisk Gestaltning” and Artificial Intelligence being performed at AASS in collabora-tion with CREA, both at ¨Orebro University. Professor Alessandro Saffiotti has been supervising the project.
In the spring of 2019 a collaboration between researchers in the areas of Music, Computer Science and Philosophy lead to the conceiving of a project that would lead to the development of an artificial drummer. The drummer can, in real-time, could accompany the playing of a pianist playing jazz improvisation. The system that was developed used a pair of fuzzy logic rule-based systems to both translate the digital signals of the piano into features and to control a commercial drum machine based on those features. The system was well received both from an artistic point of view at several concerts [57] and at the 31st Swedish AI Society Workshop were a short paper was presented [65].
This led to the further development, in early fall 2019, of a graphical interface for the system, including the capability of conveniently changing the rule base of the second fuzzy logic system [63]. Furthermore, the team behind the system attended Music Tech Fest at ¨Orebro University to showcase the system and participate in a hackathon to develop similar ideas in Music and AI. There the idea of integrating the drummer system with some sort of dancing was born. This lead to the development, in late 2019 and early 2020, of an extension to the system that controls the robot Pepper and causes it to ”dance” based on the playing of the pianist. This extended system was showcased as part of the ”Festkonsert” that started the yearly celebration of ¨Orebro University [57].
During the previous work on the Artificial Drummer it was determined that one possible avenue of further research was to explore learning-based approaches to improve the systems behaviour. This project presented here was designed to pursue that direction of research.
1.2
Aims and objectives
The initial aim of this project was to compare and possibly improve the be-haviour of the artificial drummer by using learning-based approaches instead of knowledge based approaches. After some consideration it was decided that the learning model employed would be the Transformer architecture, which is further detailed in Section 2.3. This was deemed to present interesting problems both from an engineering perspective and from a scientific perspective.
Engineering challenges included the use of streaming data, the continual playing of the piano, instead of fixed length samples and achieving real-time or
close to real-time performance.
Scientifically this project was of interest both because it applies the Trans-former model to real-time streaming data and because it does not seek to gen-erate performances of the same instrument based on some prompt (as in Mu-sic Transformer [30]) but rather continually generates performances in another instrument. The comparison between a knowledge-based model and a learning-based model in the area of human computer co-creation was also of interest.
The initial goal for the system was that it should learn to play the drums to accompany a human pianist. A successful system should be able to meet the following demands. The system should...
• ... be able to handle streaming data. Meaning the system should be able to run continuously with data being gathered as the system runs without having access to future data.
• ... run in real-time. In this situation real-time is defined qualitatively meaning that the system should be able to work with a live pianist. De-pending on the input to and the output of the system this might mean per step iteration times between a couple of milliseconds (individual notes) to half a second (individual beats at 120 BPM).
• ... produce meaningful musical output in relation to the piano music. Furthermore, it a secondary goal was that the finished system should be open-sourced on the AI4EU platform as a tool that could inspire new creative expressions.
1.3
Methodology
The initial idea of the project was to have the Transformer model learn to imitate the behaviour of the existing fuzzy system. In essence the Transformer attempts to copy the commands that the existing fussy system sends to the drum-machine. This has some advantages. First it is easy to generate the dataset and this dataset should have less variance then a natural dataset which will give quicker initial results. Secondly it is easy to compare the initial results since there already exists a working system to compare to. This idea is described as Experiment 1 in this report.
The first step towards Experiment 1 was to generate the dataset, as described in Section 3.4. This was followed by implementing the model architecture to be used, described in Section 3.2. After training the model based on the data it was concluded that the system did not work as well as desired. To explore other avenues of two additional experiments were devised.
Experiment 2 follows the same general pattern as the first experiment but instead of trying to copy the commands to the drum-machine this experiment tried to create a model that copied the actual playing of the drum-machine, the individual strikes on the drums.
Experiment 3 was devised to use an alternative representation to Experiment 1 and 2. The representation used is significantly denser, with the motivation be-ing that this allows the model to work with a larger context without necessarily losing much meaningful information.
The three experiments performed are presented in Sections 3, 4, 5 and dis-cussed together in Section 6.
1.4
Outline
The report first, in Section 2, briefly surveys the relevant literature for algo-rithmic music generation with a special focus on the use of Transformers both in music and other tasks. This is followed by three chapters describing the ex-periments performed during the project, Sections 3, 4, 5. Finally the results and impact of the project is discussed, Section 6, as well as insights gained and lessons learned from the project. This is followed by a reflection about the course, Section 7.
2
Literature in Music Generation
This project is primarily interested in the learning capabilities of music genera-tion systems. Thus, this secgenera-tion will give a brief overview of the theoretical basis of this project, which will be expanded in the project report. The focus is on music generation research, with a special focus on learning-based approaches.
2.1
History of Music Generation
Music generation has been an active area of research since at least 1958. Early examples include the Illiac Suite [27], developed by Hiller and Isaacson, a system that used rules and Markov chains to compose music. The follow up system by Baker called MUSICOMP [2]. The work done by Xenakis on computer assisted composition [2]. And work done by Gill, by request of the BBC. [19].
Since then many systems have been developed, both systems that autonomously generate music and systems that are designed to augment human composers in their work [6]. Examples of contemporary autonomous systems include MuseNet by OpenAI [8], while the Magenta framework by Google is an example of aug-mentation [55].
2.2
Approaches to Music Generation
Fern`andez and Vico [17] suggest that algorithmic composition can be divided into at least five areas based on methodology. They suggest the following group-ing:
• Grammars
Figure 1: A taxonomy of algorithmic composition methods according to Fern`andez [17]
• Symbolic, Knowledge-Based Systems • Markov Chains
• Artificial Neural Networks
• Evolutionary and Other Population Based Methods • Self-Similarity and Cellular Automata
Music generation systems can further be described by which parts of the music generation they implement. Oore et al. suggest four such categories [47]. First is end-to-end direct audio generation where the waveforms are directly generated by the system, such as done by DeepMind’s WaveNet. [46] The second is a system that acts as a synthesizer, but the score and performance is provided independent of the system, such as done by NSynth [15]. The third approach is to only generate the scores and not performance or audio. The final suggested approach, which is used in this project is to directly generate performances without generating scores, a synthesizer is then needed to generate audio.
There are also other avenues of research related to music composition. One interesting example is how to classify and recognise the emotion of a piece of music. This is very relevant to music generation because one of the primary questions faced by the human composer and performer is what kind of emotions the music should convey but it is also one of the harder things for a computer to learn [71, 48]. This is not a direct area of focus for this project but is nevertheless something that might be meaningful for further work.
The following sections will briefly give an overview of the different methods listed by Fern`andez.
2.2.1 Grammars
In a simple form a grammar consists of two elements, an alphabet of symbols, Σ, and a set of rules that work on these symbols, often to expand abstract symbols to detailed sequences of symbols [17]. A simple example of a grammar, specifically an L-System, is given by McCormack [41] as Σ = {C, E, G} and the set of rules p = {C → E, E → CGC, G → N U LL}. Using this grammar and the symbol C as a starting point it is possible to generate a new sequence after some iterations as seen in Table 1. In this case the iteration could be continued forever but it is also possible to add terminating symbols and stop once the sequence has been transformed into only terminating symbols. Another common extension is to add some stochasticity to how the sequence is generated by having rules that are applied with some probability.
Due to their hierarchical nature grammars have frequently been used in the past generate music, for example the Illiac Suite one of the first examples of music generation used grammars to generate parts of the music [27].
ITERATION STRING 0 C 1 E 2 C G C 3 E E 4 C G C C G C 5 E E E E Result C E C G C E E C G C C G C E E E E Table 1: Five iterations used a simple grammar, based on [41]
2.2.2 Symbolic, Knowledge-Based Systems
Knowledge-based systems in this context are a variety of rule-based methods that represent knowledge, in this case about music, in the form of rules. Using rule-based system to generate music comes naturally since music has a rich theoretical history of more or less formal rules.
Again, we must mention the Illiac Suite as one of the early works that used rule-based system [17, 27], specifically to model musical counterpoint. Latter examples include Gill [19] that used rules as a search heuristic to find composi-tions and Rothgeb how used a set of rules to find harmonization’s for bass [56]. Similar harmonization tasks have been attempted by Thomas [62] and Steels [61, 60]. In 1998 Riecken used a system based on Society of Mind by Minsky to melodies based on specified emotions [53].
This is also the approach taken in our first system, which was based on fuzzy logic rules, to let a drum-machine accompany a human pianist [65].
2.2.3 Markov Chains
A Markov process consists of a set of states and the probabilities from tran-sitioning from a given state A to a subsequent state B. In the common form the next state is only conditioned upon the current state, and not based on any previous states, this is called the Markov property. There are also extensions of Markov Chains called N-gram models that take history into account.
Markov chains have been applied to music generation in the past both based on transitions generated from training data and based on musical theory [3]. However, they have proved of limited use for full automatic compositions. Sim-ple Markov Chains have proven to not be able to model cohesive structure over longer timespans only capturing local structure while larger n-gram models sim-ply learned copies of the training data [17].
The main interest in Markov Chains for music generation has been as a tool for composers to generate source material to aid their composition. Examples of this include Tipei [66], Jones [32], Langston [33], North [45] and Ames and Domino [4]. Markov chains have also found limited use in improvisation systems, for example Manaris [39] and Grachten [21].
2.2.4 Artificial Neural Networks
Artificial Neural Networks, more commonly called simply Neural Networks, have been used in a variety of domains including music generation. Some of these, like DeepMind’s WaveNet [46] have been mentioned previously, but they have been used for music generation since Todd used them to generate monophonic melodies in 1989 [67]. Another interesting early example is the work done by Lewis [35] who used a system similar, in concept, to a modern discriminator ar-chitecture. He first trained a network to learn a function from compositions to a score of how good the compositions were and then reversed the system to gen-erate compositions. Also, worth mentioning, since it touches on the same area of music as this paper, is the work of Franklin [18] who used a neural network in conjunction with reinforcement learning to create jazz improvisations.
During the past decade, a significant amount of research has been performed on using deep neural networks in the area of music generation. Some examples have already been mentioned such as MuseNet and WaveNet [8, 46]. Other notable examples include PerformanceRNN by Simon and Oore [59] were a recurrent neural network is used to generate polyphonic music. DeepBach by Hadjeres and Pachet [23] that combines supervised and unsupervised learning to generate music in the style of Bach. BachBot by Liang [36] that also generates music in the style of Bach but using a novel way of modelling polyphony for training. And MusicVAE by Magenta [54] which uses a Variational Autoencoder to generate music.
Recently techniques from Natural language processing (NLP) have also be-gun to be employed in this area with good results, first by the Magenta team with their Music Transformer [30]. The motivation for the use of transformers in music generation has been that they can generate coherent structure in music over longer periods of time. Previously the state of the art in this area used Long short-term memory (LSTM) units as the core building block of the learn-ing architecture [47]. Transformers will be the main architecture of this project and will be discussed more further down.
2.2.5 Evolutionary and Other Population Based Methods
Evolutionary algorithms generally work by keeping a set of candidate solutions and continually improving these candidate solutions by evaluating them and combining selections of the candidates to form new candidate solutions. It follows the pattern of evolution where better candidates survive and produce new candidates. In most cases what is better is determined by a fitness function. Evolutionary algorithms have been used in music generation on many occasions since fix length musical sequences can be naturally modelled as note progression that can then be improved iteratively [17].
One major problem with using evolutionary algorithms for music generation has been that defining a good fitness function is very hard, which follows natu-rally from the fact that objectively evaluating music is nearly impossible. This has led to evolutionary algorithms mostly being used in limited and constrained
settings [17]. Nevertheless, evolutionary algorithms have been used. In 1991 Horner and Goldberg [28] performed thematic bridging with EA. Marques used EA to generate short polyphonic melodies [40]. Birchfield used hierarchical EA to generate long compositions using multiple instruments [34]. Lozano used EA to generate harmonizations of predefined sequences [37].
2.2.6 Self-Similarity and Cellular Automata
Self-similarity is the property that a given sequence is statistically similar across the sequence [17]. This is commonly found in classical music where compositions have multiple levels of regular structure and it has been found that self-similar music generally is more pleasing to listeners [29, 69]. Based on this many dif-ferent systems to generate music based on self-similarity have been created [20, 51, 25, 33]. One such system is cellular automata.
Cellular automata are a technique were cells are ordered in a grid, or other n-dimensional structure, each cell can be in one of a set of states and is ini-tialized to a specific state at time 0. At each time-step the state of every cell updates based on its neighbours and some transition rule. This can generate very interesting patterns in the data which can then be used as raw materials for compositions or even as full automated compositions, though these tend to be somewhat lacking musically [17].
Some notable examples include, CAMUS by Miranda a system that used to CAs to generate musical sequences, though even the author admits that the sequences lack in musicality [44, 43]. Dorin used multiple CAs to generate polyrhythmic patterns [14]. There have even been commercial products that help composers, notably WolframTones [42]. Furthermore, CAs have been used in conjunction with neural networks, which learn rules for the CA, to produce new compositions [50].
2.3
Deep Neural Network and Transformers
Transformers were first introduced by Vaswani et al. in 2017 [68] which build upon previous work on self-attention, explained bellow, based models [7]. The first Transformer had an encoder-decoder architecture, common to neural ma-chine translation, but latter both encoder-only, BERT [13], and decoder-only, GPT-2 [52], architectures have been used.
The main feature of Transformers is the self-attention mechanism that allows the models to learn how to pay attention to different parts of the input sequence. For example, when translating the word ”bank” in the sentence ”the robber came to the bank” the model can learn to pay special attention to the word robber since that word provides the information that the translation should be the word for the financial institute and not the word for the bank of river.
This section will provide an overview of Transformer models by first dis-cussing the attention mechanism and then some notable Transformer architec-tures.
Figure 2: Forming the queries, keys, values vectors for scaled-dot product at-tention [1]
2.3.1 Attention
Self-attention is a mechanism whereby the system makes predictions on one part of a sample based on other parts of the sample (in sequential systems the other parts would be previous parts of the sample, i.e. the previous words of a sentence). There are many forms of attention that generally can be divided into three different categories. Global/Soft attention pays attention to the entire in-put space [70]. Local/Hard attention pays attention only to a subset of the inin-put space [70, 38]. Self-attention, discussed here, gives attention to related parts of sequential input [7]. The self-attention mechanism used by Vaswani is scaled dot-product attention [68]. But there are also other mechanisms, including Content-based attention [22], Additive attention [5], Location-based attention and Dot-Product attention [38].
Scaled dot-product attention works by assigning weights to each position of the input sequence. Specifically given an input sequence consisting of vector word embeddings three smaller vectors are formed for every word, a queries, keys, and values vector. These vectors a formed by multiplying the embeddings with a learned weight matrix, visualized in Figure 2.
Given the set of matrices Q, K, V corresponding to queries, keys and values the scaled dot-product attention is computed as in Equation 1, where dk is
the dimensionality of query and key matrices. The softmax function computes the normalized probabilities based on the input vectors1. Thus, intuitively the
attention function computes a weighted average of the input values based on
1sof tmax(x) i= e
xi
PK j exj
which values should be given the most attention.[68]. Attention(Q, K, V ) = sof tmax(QK
T
√ dk
)V [68] (1)
To improve performance this operation stacked and performed on different sets of the input in parallel using individual weights for each head. This is called multi-head self-attention. As can be seen in Figure 3 multi-head attention is the core building block of the Transformer architecture. Multi-head self-attention is useful since it gives the model additional expressiveness and possibility to focus on several parts of the input using different representations. The output from different heads is concatenated together and transformed to the appropriate size using a linear layer [68].
Figure 3: Left: Multi-head attention [68], Right: Original Transformer Archi-tecture [68]
A final detail of the self-attention mechanism with sequential input is that the scaled-dot product attention is position invariant, it does not account for the positions of tokens. Thus, it is necessary to add this information to the encoded data. This is done by adding a positional encoding vector to the embedding of each word in the sequence. Vaswani used a sinusoidal positional encoding, Equation 2, where i is the position of the word in the sequence and δ is position in the embedding vector going from 0 to d [68].
P E(i, δ) = ( sin ( i 100002δ0/d), if δ = 2δ 0 cos ( i 100002δ0/d), if δ = 2δ 0 + 1[68] (2) 2.3.2 Transformer Architectures
The original Transformer architecture consists of an encoder, centre in Figure 3, and a decoder. right in Figure 3. The encoder creates an attention-based representation of the input sequence which is then feed to the second stage of the decoder. The decoder is meant to retrieve the information generated by the encoder in combination with the previous outputs of the model to determine the output probabilities. So at each time-step the output depends both on the current input and the previous outputs [68].
Many improvements have been proposed to this basic architecture. Espe-cially to improve memory and computational efficiency since the memory re-quirement of the original transformer grows quadratically in the sequence length [30]. Some such improvements include using sparse attention matrix factoriza-tion [9] and Transformer XL [12], which improves inference times by eliminating unnecessary computations of the same data. Other improvements include au-tomatically considering the distance between inputs [58], which the original Transformer does not do, and its subsequent improvements [30].
The previous work that has been done on Transformers will serve as a foun-dation for the work that will be done in this project. Based on the original Transformer model, with relevant improvements.
3
Experiment 1: Drum-machine commands
us-ing 1D encodus-ing
3.1
Description
The first experiment performed was for the system to learn the same model as the existing fuzzy logic system. As described before the fuzzy logic system works on the midi input representation of the piano playing and outputs commands to a commercial drum-machine. The system does not directly control individual parts of the drum-kit but rather controls parameters of the drum-machine. The system in this experiment is given an appropriate encoding of the same input as the fuzzy logic system and is expected to learn to output the correct commands to the drum-machine, Figure 4 visualises this relationship.
3.2
Architecture
In this first experiment a Transformer XL [12] architecture was used. The Transformer XL model is an improvement on the original Transformer model that can maintain a longer context than the original model.
There are two main differences between the Transformer XL architecture and the original Transformer architecture. The first is that Transformer XL uses a decoder only architecture, which has proven to work well for language
Figure 4: Visualisation of the high-level architecture and flow of data during training and inference for Experiment 1
modelling tasks but also for translation, the second is that Transformer XL makes the Attention layers of the architecture recurrent. The attention layers thus receive both the new input and their own output (called the hidden state) from the previous step.
So, the full architecture is an Embedding layer followed by a variable number of Decoder layers followed by a linear and softmax layer. The decoder layers consist of a recurrent version of multi-head attention, explained in section 2.3.1 and a linear layer.
It should also be noted that the Transformer XL architecture uses a different positional encoding compared to the original Transformer. Since Transformer XL maintains a context across input segments absolute positional encoding does not work, since the first element in every sequence always has the same absolute positional encoding. Instead Transformer XL uses a relative positional encoding that relates elements based on their difference from each other, this can be maintained across sequences [12].
The architecture is implemented in Python using the PyTorch deep learning framework and is based on the official implementation provided by the authors the Transformer XL paper [72].
3.3
Encoding
The types of Transformer models used in this project, that handles sequential data, only accepts one dimensional data thus an encoding from midi, where each event contains many dimensions of data, to a one-dimensional sequence is needed. This encoding is explained here.
First at refresher on the midi format. Each midi message contains several fields of data. There are two fields which are always present, type and time. The type field specifies the type of message, including note on, note off and control and meta messages, and the time field specifies the number of ticks since the previous message. Note events also have velocity and note fields both containing integers between 0 and 128. Control events instead have a control and value fields, also containing integers from 0 to 128. For the encoding of the piano we are interested in note events as well as the control event signifying the use of the pedal. For the drum encoding we are interested in a subset of the note on events and control events, signifying commands sent to the strike drum machine.
The general way to encode this information in one dimension is modelled on the encoding presented by Oore et al [47]. The suggest that each note on and note off event be encoded as a number corresponding to its pitch. For example, note on with note value 0 would be encoded as 0 and the corresponding note off event as 129. Then the velocity information would be encoded as separate tokens. Meaning at the start of a sequence the velocity is assumed to be 0 then if a note event appears with a velocity different to the last encoded velocity, in this case 0, a token would be inserted, before the token for the note, in the encoded sequence that signifies that the velocity is now at a new level. The time information is handled in a similar way. The time starts at 0 and then
Figure 5: Visualization of the encoding process
with each message time-shift tokens are inserted until the time is equal to the time of the message, time-shifts can advance the time more than one step.
The encoding of the drum-commands is done in a similar way. However, one added complexity arises considering that the piano and drum sequences need to be of the same length as measured in number of tokens. This is difficult to achieve since the two sequences might cover the same time but wary in the number of messages. The solution is to pad each sequence at various places to keep the two sequences in sync, both in time and number of messages.
3.4
Dataset
For this experiment the Maestro piano dataset is used. The was originally created by the Magenta team for use when, among other things, training music generation systems. The dataset contains over 200 hours of professional piano performances, in both audio and midi form, recorded over many years during the International Piano-e-Competition [24].
The dataset only contains the piano performances, but we also need the commands sent to the drum-machine by the fuzzy system based on these per-formances. Thus, each of the midi performances has been run, using a cus-tom pipeline, through the fussy logic system with the output commands being
Figure 6: Visualisation of the input performance. Improvisation performed by Peter Knudsen based on Impressions by John Coltrane.
recorded and mapped to the correct performance.
The generated raw piano performances and drum commands data is then encoded using the system described in section 3.3, this encoded data is stored and composes the x,y pairs used to train the Transformer model.
3.5
Result
After training the model on the described dataset the model was tested on unseen samples. Running the model with a performance by Peter Knudsen of Impressions by John Coltrane, Figure 6, yield the output commands that are visually represented in Figure 7, the bottom three graphs. The sample is also available for download [64].
Listening to, and watching the behaviour of the drum-machine, it quickly becomes clear that the model has not learnt the desired behaviour. This is fur-ther confirmed by the comparing the visualizations of the original fuzzy system to the visualization of the model output. Where the fuzzy system follows the music with the intensity and complexity changing more or less gradually the output of the model fluctuates randomly. There does not seem to be any struc-ture to the fluctuations, apart from complexity on average being higher, and not discernible relationship to the music can be heard. It also only changes the pattern once, towards the end of the music, but it does it in a way that would be impossible in the underlying system, going from pattern 1 to 6 directly. It
Figure 7: Top: Visualisation of commands sent to the drum-machine by the fuzzy system. Bottom: Visualisation of commands sent to the drum-machine by the Transformer.
manages to get the muting of two pieces of the drum-kit, between time 20-60, roughly correct but that might be an accident since it does not behave correctly at any other time.
The only thing the system seems to have learned very well is to handle the time. Since the forward steps in time is explicitly generated by the model there is no guarantee that the length of the output (in clock time) will match the length of the input. In this the generated output is within a few seconds in the length of the input.
4
Experiment 2: Direct drumming using 1D
en-coding
4.1
Description
The second experiment performed was for the system to learn to play the drums directly, meaning controlling the strikes on the individual parts of the drum kit. The motivation for this experiment was to explore the impact of the output format and type on the system’s ability to learn. The first experiment did not conclusively rule out Transformers as a viable option for automatic music accompaniment. One problem with the output of the system in the first ex-periment is that in the underlying fuzzy system the commands are calculated in parallel but Transformers traditional enforce a sequential process where only one command can be sent each timestep. It is also possible that the relationship between the input and the output is too hard to learn, while the Transformer and the fuzzy system receive the same data the fuzzy system calculates features of the data over time that maybe the attention mechanism of the Transformer cannot match. Training the system directly on the individual strikes of the drum could possibly have alleviated these problems. The strikes rarely occur exactly simultaneously and even when they do the used encoding has proved effective in previous work with similar, single instrument, music generation. It is also possible that the strikes have a closer relationship to the piano. The system was given the piano performances as input and was expected to learn to play the drums, based on a limited set of drum patterns generated by the drum-machine.
4.2
Architecture
The second experiment was performed using both the original Transformer ar-chitecture and separately using the Transformer XL arar-chitecture.
The Transformer XL architecture used is the same as in experiment 1 and is given in section 3.2. The original Transformer architecture is the explained in detail in section 2.3. In brief the architecture consists of an encoder and a decoder. The encoder extracts information from the input sequence and the de-coder uses that combined with information from the expected output(training) / or previously generated output to generate the next token.
As with the Transformer XL architecture the Transformer architecture was implemented in Python using the PyTorch framework and is based on a PyTorch implementation of the original paper [31].
4.3
Encoding
The encoding follows the same principle as explained in section 3.3 with the only difference being that the output encoding is mapped to individual drumbeats instead of commands to the drum-machine.
4.4
Dataset
The dataset in this case only consists of a handful of samples of piano perfor-mances performed by Peter Knudsen at ¨Orebro University. These samples were run through the fuzzy logic system to generate drum performances on the drum-machine. The drum performances were then saved from the drum machine and combined with the appropriate piano performance and encoded. The encoded piano and drum performances served as the x,y pairs for this experiment.
The reason for using such a tiny dataset in this case is twofold. First gen-erating the dataset cannot be done without manual intervention in this case making more than a few samples infeasible. There is also no public dataset readily available that could be substituted. Second, the purpose of this exper-iment is primarily to examine the feasibility of the model learning to play the drums. While a larger dataset is necessary for the model to learn a generalised behaviour, a small dataset is sufficient to see if the model can learn to replicate the training material. Third, training multiple times on a large dataset uses to much of the limited time of this project.
4.5
Result
After training the models on the described dataset the models were tested but this time using a sample present in the dataset. The sample used as an example here is the same as the one discussed in Experiment 1, a performance by Peter
Figure 8: Visualisation of the high-level architecture and flow of data during training and inference for Experiment 2
Figure 9: Drums generated at Random using the number of possible tokens as in Experiment 2
Knudsen of Impressions by John Coltrane, Figure 6.
The output generated by the Transformer XL model is visually represented in Figure 10, the middle graph. The output generated by the original Transformer model is presented at the bottom of the same Figure. The samples are also available for download [64].
The it is clear that the original Transformer model has not learnt anything meaningful. Visually the output is more similar to a random sampling of the encoded tokens, Figure 9, just as in the previous case it seems that the model learns to generate the time fairly well, producing similarly long output as the input. However, it does not learn anything meaningful concerning the drum-ming.
Visually it is clear that the Transformer XL model has learnt something. There are clear visual similarities between the representation of the drum per-formance generated by the fuzzy system, Figure 10, and the perper-formance gen-erated by the Transformer XL model. However, upon hearing the gengen-erated drum performance it is clear that the model has not learned the fine-grained details the are necessary to generate a meaningful drum performance. The per-formance lacks rhythm and there are no discernible patterns to the drumming, rather strikes seem random even though it is clear that there is some large-scale structure in the performance.
Figure 10: Top: Visualisation of drum performance generated by the drum-machine using the fuzzy system. Middle: Visualisation of drum performance generated by the Transformer XL model. Bottom: Visualisation of drum per-formance generated by the original Transformer model.
5
Experiment 3: Drum-machine commands
us-ing 2D encodus-ing and beat quantization
5.1
Description
The third experiment performed was again for the model to learn the same system as the existing fuzzy logic system. As described before the fuzzy logic system works on the midi input representation of the piano playing and outputs commands to a commercial drum-machine. The system does not directly control individual parts of the kit but rather controls parameters of the drum-machine. The difference between this and the first experiment is that this experiment is given sequential input encoded in a 2D space where the input and output has been quantized so that each step in the sequence represents a single beat in the performance. To accommodate this the experiment, use a modified version of the original Transformer architecture.
The motivation for trying this was that the there are problems with the en-coding used for the first two experiments that perhaps had not been adequately considered. First the addition of padding in the sequences to continually match them in length can be viewed as adding noise to the data. The padding car-ries very little information apart from keeping tokens synchronised but it is not given that the system does not learn some pattern anyway, even if that is not the case learning to ignore those tokens certainly takes work on the systems part. There is also a second problem with the encoding. The number of input tokens constrains the number of output tokens since they have to be the same. This should not affect training (or evaluation) since the sequences are synchronised during preprocessing, but if the system was applied live it would be a significant problem since there is no reason to think that the number of piano keys pressed has any relationship with the number of strikes on the drum. The encoding used in this experiment does not have these problems, as the number of beats are a part of the performance of both instruments and always the same between instruments (in most styles of music) thus eliminating the need for padding or different length sequences. This encoding is also more compact allowing the system to handle a larger context.
5.2
Architecture
The third experiment was performed using a modified Transformer architecture based on the original Transformer architecture. In this version the multi-head attention layers where modified to handle 2 dimensional data. The modifica-tions where based on the Image Transformer [49] and Self-Attention Generative Adversarial Networks [73] papers and is a version of Soft attention [70].
5.3
Encoding
The encoding for the third experiment is substantially different from the encod-ing used for the first two experiments. Instead of encodencod-ing each keystroke of the piano separately and explicitly encoding the time this encoding only encodes full beats and keeps the time implicit across the x-axis. For each beat a single column is created containing one position for each of the 128 notes available in midi, one position representing the use of the pedal and one position repre-senting the total velocity of the keystrokes played from the time of the previous beat to the time of this beat. For each of the note positions the value in the position is equal to the number of note on events for that note. If a note, for example note 1 is pressed 4 times during the beat a four would be encoded in row 1 of the column. Concatenating all of these columns together gives a 2D grid similar to a grey scale image.
The output similarly is a grid using multi-label many hot encoding. This grid also has the time along the x-axis and each column is a representation of the commands sent to the drum-machine during a single beat. Since multiple commands can be sent to the drum-machine during a single beat, specifically one from each category (intensity, complexity, pattern, fill, and mute), the output contains more than one 1 and the rest zeroes.
5.4
Dataset
The dataset used for the third experiment is generated using the same method as the first experiment, by running performances through the fuzzy logic system, but using just a few sample performances by Peter Knudsen at ¨Orebro Univer-sity. The generated data and the piano performances is the encoded using the encoding describe for this experiment to form the x,y pairs used.
The reasons for using only a tiny dataset is the same as in experiment two. It allows us to explore the feasibility of the system while not spending an inordinate amount of time on training the system.
5.5
Result
Training the model described above using the described dataset and encoding does not lead to any meaningful results. Generally, in multi-label problems every class is independent of every other class, when classifying movie genres a movie might for example belong both to the action and romance genre. In this case the classes are only partially independent. For example, the first 32 positions on the many hot encoding vector correspond to the intensity the drum-machine should use, the range 0-127 has been divided into 32 steps. This means that there should only be one 1 in these 32 positions, however this cannot easily be enforced using the sigmoid activation function used here or any other standard activation functions. In this case the system in many instances outputs many 1s in positions where there should only be one 1, in other cases it does not output only 0s in positions where there should be exactly one 1. In that case of the 32
intensity positions this might mean, for example, that the intensity should be 5, 60 and 100 at the same time. This makes the output of the system impossible to interpret and thus meaningless.
However, this result should be taken a simply an indication of improvements needed and not as proof of that the idea in itself is in error. It was not possible, due to time constraints, to solve the problems presented here but my intuition is that the problems come more from issues with how this experiment and the architecture were implemented than from any fundamental flaw in the reasoning.
6
Discussion
This section will begin with a discussion of possible reasons for the failure of the experiments conducted during this project. It will then move on to a discussion of the next steps planned to improve and continue the exploration of the project problem. Finally, the project as a whole will the examined based on social, economic, and ethical aspects.
6.1
Causes for failure
There might be many reasons that the experiments did not produce satisfactory results, some of which will be discussed here.
6.1.1 Data
Machine learning models in general and deep learning models in particular are highly sensitive to the quality and amount of data that is available. One possible explanation for the failures is that more data was needed for the model to be able to learn. However, this is not very likely. The amount of data available for training generally mostly affects the ability of the model to learn general behaviour, to improve the behaviour of the model towards unseen data. The amount of data does not generally improve the behaviour of the model towards data it has already seen and might even be detrimental if the model is not expressive enough. As seen, especially in experiment 2, the model cannot even accurately learn to overfit to already seen examples.
A more likely explanation is that the data used is too varied. Given an input sequence ABC and output sequence DEF learning the function mapping between them is trivial. However, if there are multiple examples of similar or identical input sequences, for example ABC-¿DEF, ABC-¿FED, learning the function mapping between them is significantly harder or in this toy example even impossible. Now imagine that these to sequences were part of a larger context that indeed could differentiate them, allowing for learning, it would then be necessary for the learning model to be able to capture the necessary context. One possible reason why the models in this project fail might be that they do not adequately capture the necessary context, thus leading to similar input data with dissimilar output data and consequently erratic learning. This might be a reason for why the Transformer XL model can learn some structure in experiment 2 while the original Transformer model performs no better than a random baseline. The Transformer XL architecture is recurrent and incorporates context even beyond the current sequence while the Transformer model only looks at the current sequence.
6.1.2 Architecture
The architecture plays a major role in any deep learning project, with some architectures being more or less suitable for certain tasks. One possible con-tributing factor as to why the models in this project have not produced good
results could be that the architectures are not suitable for the task. Transformer architectures have been used separately for both translation and sequence gen-eration but this project both at once, the model must both find a mapping between the drums and the piano but also generate the drums. From a transla-tion perspective the task is also difficult because the relatransla-tionship between piano and drums is less clear than the relationship between words with the same meaning in two languages.
Transformer architectures also usually require input in the form of one-dimensional sequences, which is suitable for language but, this might not be suitable for music were many notes (or words) might be played at once and where the timing of the notes is of paramount importance. As discussed below encoding of music might require a representation in higher dimensions.
6.1.3 Encoding
Just as there are many different ways to represent music to a human with different strengths and weaknesses there are many ways to represent music to a deep learning model, also with various strengths and weaknesses. A good encoding can have significant impact on what and how the model learns, while a bad encoding can make learning impossible by not properly providing the information needed by the learning system.
For experiment 1 and 2 of this project a common encoding scheme for midi is used. While this encoding has proved functional in other settings it is possible that it is suboptimal in this case. One difference between this project and other music generation projects is that other music generation projects try to learn the next step in the input sequence. The end result of this learning is that the learning system can be given a starting point, random or predetermined, and from that point create a coherent musical piece. The system might interact with itself in various interesting ways but apart from the initial input it has no connection to the real world.
This project however tries to learn to collaborate with another musician, regardless of if that musician is artificial or human. It does this but trying to learn to play along with another sequence of music. To make the learning possible is necessary that the sequences are in sync, the output token at position i should correspond to the input token at the same position. However, as seen in Section 3.3, this leads to the need for padding in both the input and output sequences to get them to line up which in turn causes the length of the sequences to grow. Longer sequences mean that less context can be processed by the model at once which leads to worse behaviour. Furthermore, at inference time this constrains the output of the system as the length of the output sequence has to be the same as the input sequence, something that is unlikely to occur naturally in music.
This problem of encoding is one of the main lessons learned during this project. Since there have been little work done in human machine collaborative music using deep learning there is no go-to way to encode the music when dealing with multiple musical performers. Discovering such an encoding would be useful
not just for this project but many others like it. Some properties needed have become clear from the experiments performed. The encoding needs to be able to handle multiple events per timestep. Having an encoding where event is placed after the next leads to sequence being of vastly different lengths even though they deal with the same amount of time. This leads to the need for padding sequences which is hard to do at inference time and also degrades the quality of the data. The second property that the encoding should have is that it should handle time without the need to directly encode the passage of time. Since different instruments and performers will produce sound at different time intervals the encoding needs to be able to handle this while still keeping the musical sequences of different performers in sync.
Improving the encoding thus might lead to better experiment results. The encoding used in experiment 3 was an attempt to solve this problem but requires further work.
6.1.4 Hyper-parameters
The last possibility for why the models did not work is that they were not properly tuned. Deep learning models in general and specifically Transformers can be quite sensitive to changes in hyper-parameters and model size. It is possible that the parameters needed to learn this problem are very specific and that they were not tried during this project.
6.2
Next steps
This project is part of ongoing research in the area of music generation and hu-man robot collaboration. Even though the experiments did not produce satis-factory results this project has generated valuable insights. The initial intuition behind using Transformers for this task was twofold. First Transformers have been used very successfully for both text and music generation problems, it is clear that Transformers can learn interesting and seemingly creative behaviour. Secondly, Transformer have performed well on simultaneous translation tasks which were thought to bear certain similarities to learning relationships between two instruments. I do not believe that the experiments performed here conclu-sively show that this intuition was wrong. Especially experiment 2 indicates that the system is learning, even if the output is not precise enough for music, and we do not have enough data to say that the system cannot learn better with further work. My intuition is that especially the encoding has a significant im-pact on the performance of the system, in addition to a system that can handle different length sequences. This section discusses some possible next steps and insights from this project.
First it is clear that a better encoding is needed when working with multi-ple instruments in different sequences and especially when doing collaborative music generation with both human and robot actors. This is one of the main takeaways from this project. Since there has not been so much work performed on collaborative human robot music generation there is not a clear standard
for how data should be encoded to facilitate the generation of music based on the performance of another system. The new encoding, or architecture, needs to be able to handle data of different length so as to not constrain the length of the output based on the length of the input. It also needs to be able to handle the fact that different instruments are not going to line up in time and are likely to produce sound at different intervals. It would also be beneficial if a compacter and clearer encoding could be produced as it is likely to improve the performance of the model simply from the fact that more context can be processed a once. Additional work in this direction would be meaningful be-yond the scope of this project as it might be useful and many areas of music generation, especially with humans in the loop.
One possible way to solve the encoding problem without changing the en-coding could be to encode both the piano and drums in one sequence. A model could be trained on such sequences, as has been done with MuseNet [8], to gen-erate both the piano and drums simultaneously. At inference time the piano output could be discarded or even zeroed out from the sampling probability in favor of the actually human piano input. There are clear challenges with this approach such as the correctly merging live and generated music, but it is a possible avenue for improvement of the system.
A possible experiment is to explore if the models could learn better with more similar data. One possible problem with the current data is that the relationship between the piano and the drums is too weak to easily be learnable. This could be explored further by creating a dataset were both the input and the output consists of piano playing with some relationship between them. For example, the output piano could harmonize the input piano or play a counterpoint. This would still allow exploration of human robot interaction in an interesting setting while possibly removing one source of error.
There are also architectural improvements that can be explored. OpenAI has developed a Transformer that uses a form of attention called sparse attention which has significantly lower memory requirements at the expense of increased computation time [10]. It has had some success in raw audio generation and similar techniques could be used here to extend the context of the sequences being generated.
6.3
Social, economic, and ethical aspects
Music generation, being part of research on computer creativity, is an interest-ing area of research because it touches on the fundamental aspects of human existence. The ability to be creative, to create from nothing, is one of the dis-tinguishing characteristics that separates humans from the animals. So, if we are ever to achieve anything remotely resembling artificial general intelligence computational creativity must be an integrated part of it.
This broader context, that this project is a part of, poses several challenging questions. If we eventually can achieve systems that generate music, on-demand, that is so good that people want to listen to it for the sake of the music what does that do to our view of human creativity. If computer system can encroach
into something so human as music, then will we view our own creativity as something less special? Even today it is possible to generate every possible combination of 8-note 12-beat melodies, covering most of modern pop-music [11]. This challenges us to find new ways of thinking concerning human creativ-ity and how AI should interact with humans. This interaction can generally be divided into three models; Augmentation, where the human still performs the task but the AI enhances the performance of the human; Replacement, where the AI replaces the human in doing the task; Collaboration, where both the AI and the human are active in jointly performing the task. This project, and the ongoing research, is based on the belief that the most viable model of human-AI interaction is the collaborative model. This has also been the opinion of the European union [16, 26]. This is in many respects a more difficult task than replacement because of the human element, but a more fruitful approach nonetheless.
Adopting a collaborative approach in the area of computational creativity and especially music generation has the potential to improve our understanding of our own human condition and open new avenues and expressions of creativity. We have seen this already, on an incredibly small scale, with musicians at the university that are forced to articulate and examine their conceptions of music and creativity in the interaction with intelligent systems, leading to a better understanding of what they do.
Failing to account for the consequences of even something so seemingly harm-less as a music generation system could prove very problematic. We have seen this with social media were the systems that are supposed to show us things we enjoy also end up isolation us from opinions that could cause us to grow and instead segments us in our own worldview. Imagine similar behaviour in music. If there is an endless stream of perfectly personalised music available, a personal soundtrack if you will, that continually responds to shifts in my mood and activity what does this mean for our view of music. Music is already ex-tremely commercialized removing much of the artistic expression and instead turning it to something to consume and such a service could lead to even more of a consumption mindset. However, handled correctly it could also lead to increased creativity and open the musical arena to people who have previously not been able to interact with music in a creative way, either due to lack of skill or opportunity.
There is also an economical dilemma attached to algorithmic music gen-eration. Already in the current market many smaller musicians are having trouble surviving on their work as music consumption moves to streaming and on-demand platforms. If algorithmic music generation develops and becomes mainstream, then this problem could be further exasperated. There are two problems attached to this. The first problem is the question of who even should be paid. It is more or less impossible to conclusively trace the output of a music generation system back to any specific training data. So even though systems do not create ex nihilo but rather based on the training data it might be impossible to attribute and compensate to original musicians, and corporations working in bad faith might us this difficulty to simply not pay anybody effectively making
music free for them to produce. Even if corporations pay the original musi-cians this could also be problematic, since it is difficult to determine who to pay, they could opt to pay very little to everyone included in the training data. This would be similar to the problem with streaming services only much worse and with the added problem that the pay is not necessarily tied to the skill or following of the musician. And if the systems started to generate great music that then got included in the training data it would eventually be impossible to know anything about the original sources. In the best case scenario music generation would be used to generate more music based on a specific musician or small set of musicians and those musicians would be paid accordingly, but at the present level of development the need for large datasets prohibit this.
These and other considerations should be an integral part of the development of algorithmic music generation and artificial intelligence in general so that what we develop leads to the betterment of humans.
7
Reflection
This project has been a great learning experience and even though the results have not been as great as one might have hoped I have learned a lot both about project management and about the technical and theoretical aspects of algorithmic music generation and deep learning. In this section I will reflect on the course based on the course-goals.
Course-goals The course-goals state that the outcome of the course should be that the student has deepened theoretical knowledge in some area of study as well as knowledge about relevant techniques and methods. I think that this project has been a great way to learn more about algorithmic music generation and I now feel that I can really orient myself in the research being done, both in the past and present. I have also learnt a lot about deep learning techniques, their usefulness, and challenges, especially the use of Transformers. And if in the future I continue to work on similar problems I have a good foundation to stand on.
The project has also given me many new skills both technical and in project management. I have been fortunate to have good help from my supervisor Alessandro Saffiotti but I have also been encouraged to design and plan the project with a great deal of autonomy. In future project I think that I will have good use of the ability to find relevant knowledge quickly but I have also learnt the need for quickly getting to initial results, if I could do this project again I would probably start with a smaller dataset and iterate more quickly. It is not up to me to judge if this report and my presentation of the project is clearly communicating and has a good scientific foundation, but I have certainly learnt a lot about technical and scientific writing.
General reflection This has been a very interesting project. One of the main takeaways for me is the impact of data representation when doing machine learning. It is not enough to have the data you also need to be able to format the data in a way that is meaningful and useful for the system you are using.
If I would continue this project, I would want to spend more time working on the data representation and the details of the architecture. Since this course is only 10 weeks long it was hard to have time to make detailed changes to the architecture, so I would like to continue to learn how different machine learning techniques work so that I can better incorporate different ideas into architectures I use.
8
References
[1] Jay Alammar. The Illustrated Transformer. url: http : / / jalammar . github.io/illustrated-transformer/ (visited on 05/20/2020). [2] Charles Ames. “Automated Composition in Retrospect: 1956-1986”. In:
Leonardo 20.2 (1987), pp. 169–185. issn: 0024094X, 15309282. url: http: //www.jstor.org/stable/1578334.
[3] Charles Ames. “The Markov Process as a Compositional Model: A Survey and Tutorial”. In: Leonardo 22.2 (1989), pp. 175–187. issn: 0024094X, 15309282. url: http://www.jstor.org/stable/1575226.
[4] Charles Ames and Michael Domino. “Cybernetic Composer: An Overview”. In: Understanding Music with AI: Perspectives on Music Cognition. Cam-bridge, MA, USA: MIT Press, 1992, pp. 186–205. isbn: 0262521709. [5] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural
Ma-chine Translation by Jointly Learning to Align and Translate”. In: 3rd Int. Conf. on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. Ed. by Yoshua Bengio and Yann LeCun. 2015. url: http://arxiv.org/abs/1409.0473.
[6] Jean-Pierre Briot, Ga¨etan Hadjeres, and Fran¸cois Pachet. “Deep Learning Techniques for Music Generation - A Survey”. In: ArXiv abs/1709.01620 (2017).
[7] Jianpeng Cheng, Li Dong, and Mirella Lapata. “Long short-term memory-networks for machine reading”. In: arXiv preprint arXiv:1601.06733 (2016). [8] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. “Generating
Long Sequences with Sparse Transformers”. In: ArXiv abs/1904.10509 (2019).
[9] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. “Generating long sequences with sparse transformers”. In: arXiv preprint arXiv:1904.10509 (2019).
[10] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. “Generating long sequences with sparse transformers”. In: arXiv preprint arXiv:1904.10509 (2019).
[11] Samantha Cole. Musicians Algorithmically Generate Every Possible Melody, Release Them to Public Domain. url: https : / / www . vice . com / en _ us/article/wxepzw/musicians- algorithmically- generate- every-possible-melody-release-them-to-public-domain (visited on 05/27/2020). [12] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and
Ruslan Salakhutdinov. “Transformer-xl: Attentive language models be-yond a fixed-length context”. In: arXiv preprint arXiv:1901.02860 (2019). [13] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. In: arXiv preprint arXiv:1810.04805 (2018).
[14] Alan Dorin. “Liquiprism: generating polyrhythms with cellular automata”. English. In: Proceedings of the 8th International Conference on Auditory Display. Ed. by R Nakatsu and H Kawahara. International Conference on Auditory Display ; Conference date: 01-01-2002. Advanced Telecommuni-cations Research Institute, 2002, pp. 447–451. isbn: 4990119002.
[15] Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Moham-mad Norouzi, Douglas Eck, and Karen Simonyan. “Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders”. In: Proc. of the the Int. Conf. on Machine Learning. 2017, pp. 1068–1077.
[16] European Commission. White Paper on Artificial Intelligence — A Euro-pean approach to excellence and trust. https://ec.europa.eu/commission/ presscorner/detail/en/ip_20_273. 2020.
[17] Jose Fernandez and Francisco Vico. “AI Methods in Algorithmic Com-position: A Comprehensive Survey”. In: Journal of Artificial Intelligence Research 48 (2014), pp. 513–582. doi: 10.1613/jair.3908.
[18] Judy A. Franklin. “Multi-Phase Learning for Jazz Improvisation and In-teraction”. In: 2001.
[19] S. Gill. “A Technique for the Composition of Music in a Computer”. In: The Computer Journal 6.2 (1963), pp. 129–133. issn: 0010-4620. doi: 10 . 1093 / comjnl / 6 . 2 . 129. eprint: https : / / academic . oup . com / comjnl/article- pdf/6/2/129/1041403/6- 2- 129.pdf. url: https: //doi.org/10.1093/comjnl/6.2.129.
[20] Michael Gogins. “Iterated Functions Systems Music”. In: Computer Music Journal 15.1 (1991), pp. 40–48. issn: 01489267, 15315169. url: http : //www.jstor.org/stable/3680385.
[21] Maarten Grachten. “JIG: Jazz Improvisation Generator”. In: In Proceed-ings of the MOSART Workshop on Current Research Directions in Com-puter Music. 2001.
[22] Alex Graves, Greg Wayne, and Ivo Danihelka. “Neural Turing Machines”. In: CoRR abs/1410.5401 (2014). arXiv: 1410.5401. url: http://arxiv. org/abs/1410.5401.
[23] Ga¨etan Hadjeres, Fran¸cois Pachet, and Frank Nielsen. “DeepBach: a Steer-able Model for Bach Chorales Generation”. In: Proceedings of the 34th International Conference on Machine Learning. Ed. by Doina Precup and Yee Whye Teh. Vol. 70. Proceedings of Machine Learning Research. Inter-national Convention Centre, Sydney, Australia: PMLR, 2017, pp. 1362– 1371. url: http://proceedings.mlr.press/v70/hadjeres17a.html. [24] Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon,
Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, and Douglas Eck. “Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset”. In: International Conference on Learning Represen-tations. 2019. url: https://openreview.net/forum?id=r1lYRjC9F7.
[25] Martin Herman. “Deterministic chaos, iterative models, dynamical sys-tems and their application in algorithmic composition”. In: Proc. of the Int. Computer Music Conference. 1993.
[26] High Level Expert Group on Artificial Intelligence. Ethics Guidelines for Trustworthy AI. https://ec.europa.eu/digital-single-market/en/ news/ethics-guidelines-trustworthy-ai. 2019.
[27] L. A. Hiller Jr. and L. M. Isaacson. “Musical Composition with a High-Speed Digital Computer”. In: J. Audio Eng. Soc 6.3 (1958), pp. 154–160. url: http://www.aes.org/e-lib/browse.cfm?elib=231.
[28] Andrew Horner and David Goldberg. “Genetic Algorithms and Computer-Assisted Music Composition”. In: Proceedings of the International Con-ference on Genetic Algorithms. 1991, pp. 337–441.
[29] Kenneth J. Hsu and Andrew Hsu. “Self-Similarity of the ”1/f Noise” Called Music”. In: Proceedings of the National Academy of Sciences of the United States of America 88.8 (1991), pp. 3507–3509. issn: 00278424. url: http://www.jstor.org/stable/2356768.
[30] Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam Shazeer, Andrew M Dai, Matthew D Hoffman, Monica Dinculescu, and Douglas Eck. “Music transformer: Generating music with long-term structure”. In: Proc. of the Int. Conf. on Learning Representations. 2018.
[31] Yu-Hsiang Huang. Attention is all you need: A Pytorch Implementation. url: https : / / github . com / jadore801120 / attention is all you -need-pytorch (visited on 05/24/2020).
[32] Kevin Jones. “Compositional Applications of Stochastic Processes”. In: Computer Music Journal 5.2 (1981), pp. 45–61. issn: 01489267, 15315169. url: http://www.jstor.org/stable/3679879.
[33] Peter S. Langston. “Six Techniques for Algorithmic Music Composition”. In: Proc. of the Int. Computer Music Conference. Michigan Publishing, 1989. url: http://hdl.handle.net/2027/spo.bbp2372.1989.040. [34] Fred Lerdahl and David Birchfield. “Evolving intelligent musical
materi-als”. PhD thesis. Columbia University, 2003.
[35] J. P. Lewis. “Creation by refinement and the problem of algorithmic music composition”. In: Music and Connectionism. Cambridge, MA, USA: MIT Press, 1991.
[36] Feynman T. Liang, Mark Gotham, Matthew Johnson, and Jamie Shotton. “Automatic Stylistic Composition of Bach Chorales with Deep LSTM”. In: Proc. of 18th International Society for Music Information Retrieval Conf. 2017.
[37] Leonardo Lozano, Andr´es Medaglia, and N. Velasco. “Generation of Pop-Rock Chord Sequences Using Genetic Algorithms and Variable Neigh-borhood Search”. In: Proc. of the Conf. on Applications of Evolutionary Computation. 2009, pp. 573–578. doi: 10.1007/978-3-642-01129-0_64.