News Release 27-Mar-2024

Exploring variational auto-encoder architectures, configurations, and datasets for generative music explainable AI

Peer-Reviewed Publication

Beijing Zhongke Journal Publising Co. Ltd.

Variational auto-encoder architecture — **image:**
A VAE architecture consists of 1) an encoder which encodes training data into 2) a multi-dimensional latent space which is used by 3) a decoder which decodes data from the latent space to generate data in the style of the training data
view more

Credit: Beijing Zhongke Journal Publising Co. Ltd.

Music generation is a key use of AI for arts, and is arguably one of the earliest forms of AI art. However, contemporary generative music models rely increasingly on complex machine learning models such as neural networks and deep learning techniques which are difficult for people to understand and control. This makes it hard to use such models in real-world music making contexts as they are generally inaccessible to musicians or anyone besides their creator.

Making AI models more understandable to users is the focus of the rapidly expanding research field of explainable AI (XAI). One approach to making machine learning models more understandable is to expose elements of the models to people in semantically meaningful ways, for example, using latent space regularization to increase the meaningfulness of dimensions in otherwise opaque latent spaces. To date, there has been very little research on the applicability and use of XAI for the arts. Indeed, there is a lack of research on what configurations of generative AI models and datasets are more or less, amenable to explanation.

The paper published in Machine Intelligence Research by the team of Prof. Nick Bryan-Kinns takes a first step towards understanding the link between the explanation and performance of AI models for the arts by examining what effect different AI model architectures, configurations and training datasets have on the performance of generative AI models that have some explainable features.

Section 2 is about the related work. It first gives an introduction of explainable AI, including the definition, approaches, important elements, etc. Researchers are specifically concerned with how to make AI models more interpretable for people so that they can better control the generative aspects of the AI model. Then, this part presents some limited research on XAI for the arts. Taking music as a key form of artistic endeavour, this paper explores explainable AI for music and focuses on a core use of AI for music-generative music. Considering the current related researches, researchers aim to compare the effect of meaningful labels on AI models in different configurations and with different datasets. This part also proposes latent spaces for music generation.

As outlined in previous sections, there are many approaches to music generation using deep learning models, and each year new models are added to the repertoire of music generation systems. However, to date there has been no systematic analysis of how different training datasets and AI model architectures might impact the performance of XAI models for music. The core research question of this paper is: What effects do different AI model architectures, configurations, and training data have on the performance of generative AI models for music with explainable features? This paper begins to address the core research question by systematically asking the following questions about the performance of VAE generative models with explainable features: RQ1. What is the effect of VAE model architectures on performance? RQ2. What effect do the musical features imposed on the latent space have on performance? RQ3. What effect does the size of latent space have on performance? RQ4. What effect do training datasets have on performance?

Section 4 includes three subparts: Candidate AI models, datasets, and musical features. The first subpart introduces measureVAE and adversarialVAE. As a first step in understanding what effect explainable features have on the performance of generative AI model architectures, researchers compare two representative example VAE generative music models - measureVAE and adversarialVAE. Both approaches build on a VAE architecture to generate music but differ in terms of how musically semantic information is applied to the music generation with measureVAE imposing regularized dimensions on the latent space and adversarialVAE adding control attributes to the decoder. In the second subpart, researchers use the frequently used Irish folk dataset, compare this with datasets of Turkish folk music, pop music and classical music. This subpart also consists a Table to present key features of the datasets used including their musical features. The third subpart introduces many musical features that could be imposed on music generation: note density, note range, rhythmic complexity, and average interval jump.

Section 5 and Section 6 are about experimental setting of comparing VAE model architectures and experimental setting of latent space configuration and training datasets. Researchers found that measureVAE has higher reconstruction accuracy and reconstruction efficiency than adversarialVAE but lower musical attribute independence. The results also show that measureVAE can generate music across folk, pop, rock, jazz and blues, R&B and classical music styles, and performs best with lower complexity musical styles such as pop and rock. Furthermore, results show that measureVAE was able to generate music across these genres with interpretable musical dimensions of control.

The measureVAE generated output was found to have different musical interpretability scores for different datasets, but there was not a correlation between the musical features of datasets and the related interpretability scores of the generated music. For 4 regularized dimensions, the Irish folk dataset has the highest average interpretability scores for note density and note range, whereas Muse Bach has the highest for rhythmic complexity and average interval jump interpretability scores.

Interpretability metrics were in general higher when only two dimensions of the latent space were regularized. Similarly, loss and reconstruction accuracy scores were better for two regularized dimensions than four. These findings are to be expected as it is easier to achieve successful and linearly independent regularization in fewer dimensions. For loss and reconstruction accuracy scores, measureVAE performed better with the pair of note density and rhythmic complexity regularized dimensions than when it is trained with note range and average interval jump regularized dimensions. This may be because measureVAE is better at generating the tonal and rhythmic aspects of the music which are captured by ND and RC.

In terms of recommendations for use, results suggest that a 32 or 64 dimensional latent space would be optimal using measureVAE to generate music across a range of genres as this minimizes latent space size whilst maximizing reconstruction performance and providing similar interpretability scores to those offered by higher dimensional spaces. However, careful selection of latent space size is required for generation of specific genres of music. For example, Irish folk music may be optimally generated with 16 or even 8 latent dimensional space.

These results show that when explainable features are added to the measureVAE system, it performs well across genres of musical generation. For XAI and the arts more broadly, researchers’ approach demonstrates how complex AI models can be compared and explored to identify optimal configurations for the required styles of music generation. The work also demonstrates the complex relationships between datasets, explainable attributes, and AI model music generation performance. This complex relationship has some wider implications for generative AI models. For example, it highlights the bias built in to models which makes them more amenable to certain datasets rather than others – a key concern of human-centred AI. In this study, the structure of measureVAE biased it towards lower complexity musical styles such as pop and rock at the expense of more complex forms of music such as Turkish makam, it is worth noting there are more marginalized forms of music.

This is the first time that two VAE models with semantic features for control of music generation have been systematically compared in terms of performance, latent space features, musical attributes, and training datasets. The team of Prof. Nick Bryan-Kinns proposes that future research needs to explore the effect that other genres and datasets, dataset sizes, musical attributes and training regimes have on the performance of explainable AI models.

See the article:

Exploring Variational Auto-encoder Architectures, Configurations, and Datasets for Generative Music Explainable AI

http://doi.org/10.1007/s11633-023-1457-1

Journal

Machine Intelligence Research

DOI

10.1007/s11633-023-1457-1

Article Title

Exploring Variational Auto-encoder Architectures, Configurations, and Datasets for Generative Music Explainable AI

Article Publication Date

15-Jan-2024

Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.