Music Genré Classification

Deekshith Raya, 150102069, ECE


Vinod Patidar, 150102070, ECE


Yash Khatri, 150102071, ECE


Sai Bhaskar Devatha, 150108012, EEE

Harmonic and melodic characteristics of a music are the ones that make a sound feel like music. Thus, making them a key distinguisher for an automatic music type classifier like ours. Pitch is a perceptual property of sound that makes humans distinguish between music based on these characteristics. Chroma, a key-component of pitch, is widely used in measuring and analysing these characteristics due to close correlation, while being robust to changes in instrumentation and timbre of the music. Therefore, in this project we extracted and analysed chromagram to study their effectiveness in classifying music into four broad categories namely – Classical, Jazz, Metal and Pop.
1. Introduction
With day-by-day increasing internet penetration, huge amount of useful data is available at proximity to people. Although it seems that there is ease of access to data, but this exponentially increasing amount of data brings to table a new problem – most of this chunk is unclassified.

Through this project, we aim to resolve this problem with something very close to people – music. We aim to explore various methodologies used to develop an automatic music genre classifier and thus, help in comparing efficiency to these methods.

Apart from the generic use of classification, it can be further used to better understand audio properties and human perception of music. Moreover, its applications can be extended to develop various systems like music genre-based disco lights and emotion-mapped music.
1.1 Introduction to Problem
While building such a classifier, the major challenge lies in deciding the frame rate and in feature computation and extraction.

The decision of the frame rate is extremely critical to the performance of the classifier due to huge variations with time in a music file i.e. the audio might lean more toward one genre at the start and toward another at the end, but the average is that matters the most.

On similar lines, extraction of features which detect subtle differences between the genres are very important to enhance the accuracy of the classifier.
1.2 Figure
This text displays when the image is unavailable
1.3 Literature Review
The field of automatic music genre classification has been of peak interest to audio signal processing researchers for quite some time. There are various publications targeting different set of features and using different statistical methodologies for classification.

George Tzanetakis, Georg Essl and Perry Cook pioneered in the field and published their work on 9-dimensional feature vector: mean-Centroid, mean-Rolloff, mean-Flux, mean-Zero Crossings, std-Centroid, std-Rolloff, std-Flux, std-Zero Crossings, Low Enegry. They obtained an accuracy of 16 % and 62% with random and Gaussian classifier.[1] Hariharan Subramanian used MFCC, Rhythmic features and MPEG-7 features in addition to the above features.[3] In Music Genre Classification, Michael Haggblade, Yang Hong and Kenny Kao used MFCC and got an accuracy of around 87% with DAG SVM.[4] In Music type classification by Spectral Contrast feature, an accuracy of 82.3% was observed.

Juan Pablo Bello describes in his paper that most music is based on the tonality system i.e sounds arranged according to pitch relationships into interdependent spatial and temporal structures. And also shows usage of chroma to analyze these relations and therefore understand human perception of music. This inspired us to experiment with chromagram in our automatic music genre classification project.[8]
1.4 Proposed Approach
The automatic music genre classifier, here, is aimed to categorize a musical data into 4 broad categories – Classical, Jazz, Metal and Pop. The categorization is done by using a classifier upon a vector of features computed from the musical data.

Humans are remarkably good at genre classification as they can identify genre of a music by 250 milliseconds of an audio. This suggests that genre classification methodology should be as close as possible to human perception of music rather than any higher-level theoretical description. Therefore, here we have used chroma-based features as they are closely correlated to harmonic and melodic aspects of music, while being robust to changes in timbre and instrumentation. Chromagram is also very widely used to analyze and map human perception of music to signal processing techniques. Further, we compared performance of these chroma-based audio features or pitch class profiles with the performance of other features like MFCC, zero-crossing, rhythm based features, etc to establish classification efficiency associated with each.

This text displays when the image is unavailable
1.5 Report Organization
The report is organized as follows,
1. Title and Authors.[Top]
2. Abstract.[0]
3. Introduction to Problem.[1]
4. Proposed Approach.[2]
5. Experimental Results, Project Code and Discussions.[3]
6. Conclusions and Summary.[4]
7. References.[5]
The above order is followed to ensure every finding of us is properly documented.
2. Proposed Approach
The approach used here to build this automatic music genre classifier involves three major steps -
   2.1 Data Preprocessing
   2.2 Feature Extraction and Selection
   2.3 Classification
2.1 Data Preprocessing
Before diving right into feature analysis and classification, one needs to have an appropriate format of analysable data. The GTZAN dataset has data in the ".au" format. This was converted into ".wav" format using AnyMP4 Audio Converter software. Further, library was used to convert the ".wav" form data into a time series and store as a Numpy array. This same Numpy array was then used for feature analysis and classification purposes.
2.2 Feature Extraction and Selection
The audio file, which was stored into the Numpy array, is then divided into equally sized frames for further analysis. The following procedure was then used to extract chroma-based features,
->  Short Time Fourier Transform (STFT) was initially calculated for each frame.
->  This was then passed through a chroma filter to obtain chromagram.
This chromagram was used to calculate features like centroid chroma, chroma spread, Min-Max chroma and chroma flux. In an additional approach, 10 frames were used together to calculate mean energy and standard deviation of each chroma.

This text displays when the image is unavailable
Chromagram Basics: Most music is based on the tonality system. Tonality arranges sounds according to pitch relationships into interdependent spatial and temporal structures. Characterizing chords, keys, melody, motifs and even form, largely depends on these structures. The pitch helix is a representation of pitch relationships that places tones in the surface of a cylinder. It models the special relationship that exists between octave intervals. Chroma represents the inherent circularity of pitch organization. Chroma describes the angle of pitch rotation as it traverses the helix. Two octave-related pitches will share the same angle in the chroma circle: a relation that is not captured by a linear pitch scale (or even Mel). For the analysis of tonal music we quantize this angle into 12 positions or pitch classes.
This text displays when the image is unavailable
Chroma can be taught as a pitch of frequencies
This text displays when the image is unavailable
Chroma filter bank

Chroma Centroid: It is the weighted mean of chroma bins where the weights are the energy associated with chroma bins.The centroid is the measure of the spectral shape and higher centroid values correspond to brighter textures with more high frequencies.

\begin{align} Centroid = \frac{\sum_{k=0}^{11} k|C(k)|}{ \sum_{k=0}^{11} |C(k)| } \end{align}

Chroma Spread: It gives an idea of the shape of chroma in each frame. Its calculated as follows.

$$ Spread = \frac{\sum_{k=0}^{11} (k-Centroid)^2(C(k))}{\sum_{k=0}^{11} C(k)}$$

Max Chroma: The Chroma bin associated with maximum energy.

Min Chroma: It is the least energy chroma bin.

Chroma Flux: It shows how chroma distribution varies across frames. It can be calculated as follows. The spectral flux is defined as the squared difference between the normalized magnitudes of successive spectral distributions that correspond to successive signal frames.

$$ Flux_r = \sum_{k=0}^{11} |(C_r(k)-C_{r-1}(k))|$$

2.3 Classification
Two different classifiers, namely SVM and Multi-Layer Perceptron, were used upon the feature vectors that were extracted in the previous step. In SVM classifier, gamma was varied to obtain different levels of fitness. For Multi-Layer Perceptron, the following 3 configurations of hidden layers were used 8-8, 12-8-6, 3-3.
3. Experiments & Results
3.1 Dataset Description
GTZAN Dataset is considered as a standard dataset for music genre classification. This dataset was made and used by G. Tzanetakis and P. Cook, who were pioneers in the music genre classification problem.[1] The dataset consists of 400 audio tracks each 30 seconds long. It contains 4 genres, each represented by 100 tracks. The tracks are all 22050Hz Mono 16-bit audio files in .au format.
3.2 Discussion
Inspired from the features like spectral spread and centroid, generally used to analyse speech using spectral features, we tried extracting the parallel features in chromogram domain.

In the initial attempt we constructed the feature vectors, on which the data was trained, has following features:
  1. Chroma Centroid
  2. Chroma Spread
  3. Min chroma
  4. Max chroma
  5. Chroma flux
These 5 features were augmented with the 12 chormagram coefficients obtained directly from chromogram making a 17 dimensional feature vector.

Not satisfied with the classification accuracy we increased our aperture of analysis by considering 10 frames with a hop length of 5 frames. We found the mean of each chroma coefficient and its variance over 10 frames thereby in a way observing chroma distribution this gave us 24 dimension feature vector which gave us improved results.
3.3 Project Code
All experiment codes for our project can be found from this GitHub Repository.
3.4 Results
Classes Features Frame Size(No.of Frames) Classifier Accuracy
Rock, Classical, Jazz, Metal Cen, Var, Max, Min, Flux 0.046s(2048) SVM 38.8%
Rock, Classical, Jazz, Metal 12 Chroma Coefficients 0.046s(2048) SVM 42.5%
Rock, Classical, Jazz, Metal Cen, Var, Max, Min, Flux, 12 Chroma Coefficients 0.046s(2048) SVM 43.5%
Rock, Classical, Jazz, Metal Mean and Variance of 12 Chroma Coefficients 0.232s(4096) SVM 46.5%
Rock, Classical, Jazz, Metal Mean and Variance of 12 Chroma Coefficients 0.232s(4096) SVM 54.5%
Rock, Classical, Jazz, Metal Mean and Variance of 12 Chroma Coefficients 0.232s(4096) SVM Overfit(Gamma = 5) 55.4%
Rock, Classical, Jazz, Metal Mean and Variance of 12 Chroma Coefficients 0.232s(4096) MLP(12,8,6) 57%
Classical, Jazz, Metal Mean and Variance of 12 Chroma Coefficients 0.232s(4096) MLP(8,8) 68.5%
Classical, Jazz, Metal Mean and Variance of 12 Chroma Coefficients 0.232s(4096) SVM 68.6%
Classical, Jazz, Metal Mean and Variance of 12 Chroma Coefficients 0.232s(4096) MLP(6,3,3) 68.8%
Classical, Jazz, Metal Mean and Variance of 12 Chroma Coefficients 0.232s(4096) MLP(12,8,6) 70.1%
4. Conclusions
4.1 Summary
This project was aimed to tackle the problem of automatic music genre classification based on various features. We pre-processed the data first followed by feature extraction and selection, lastly followed by classification. Here, we focussed our spectrum of features onto just chroma-based features as these act as a good metric for human perception of music. Through feature analysis and classification, a maximum accuracy of 70% was obtained.
4.2 Future Extensions
Further, we look forward to include more features into the application and improvise our classification algorithm to improve overall performance. We also plan to broaden the spectrum of the genres used.
4.3 Applications
Apart from the most generic use of classifying huge chunks of data, this classifier can also be used for following applications,
  1. Developing an automatic genre based disco lights system.
  2. Automatic Equaliser.
  3. Emotion-mapped music player.
5. References

[1] George Tzanetakis, Georg Essl and Perry Cook, Automatic Musical Genre Classification of Audio Signals.

[2] MDan-Ning hang, Lie Lu, Hong-Jiang Zhang, Jian-Hua Tao and Lian-Hong Cui, Music type classification by Spectral Contrast feature.

[3] Hariharan Subramanian, Audio Signal Classification.

[4] Michael Haggblade, Yang Hong and Kenny Kao, Music Genre Classification.

[5] Meinard M¨uller and Sebastian Ewert, Chroma Toolbox: MATLAB Implementations for extracting variants of Chroma-based audio features.

[6] Martin Vetterli, A Theory of Multirate Filter Banks.

[7] Carol L. Krumhansl and Lola L. Cuddy, A Theory of Tonal Hierarchies in Music.

[8] Juan Pablo Bello, Chroma and tonality, Music Information Retrieval.

[9] Fredric Patin, Beat Detection Algorithms.

[10] Justin Jonathan Salamon, Chroma-based Predominant Melody and Bass Line Extraction from Music Audio Signals