Mode Prediction
For the purpose of this paper, we looked at only the two most common modes in Western
Music, the major and minor modes. These are also the only modes analyzed by
The
Million Song Dataset and Spotify's
Web API. The major and minor
modes are a part of the “Diatonic Collection,” which refers to
“any scale [or mode] where the octave is divided evenly into seven
steps”
[
Laitz 2003]. A step can be either a whole or half step (whole tone or
semitone) and the way that these are arranged in order to divide the octave will determine
if the mode is major or minor. A major scale consists of the pattern
W-W-H-W-W-W-H and the minor scale consists of
W-H-W-W-H-W-W
[
Laitz 2003].
Figure 1 shows a major scale
starting on the pitch “C” and
Figure 2 shows two types
of minor scales starting on “C”. The seventh “step” in the
harmonic minor scale example is raised in order to create a “leading
tone.” The leading tone occurs when the seventh scale degree is a half step away
from the first scale degree, also called the “tonic.” This
leading tone to tonic relationship will become an important music theory principle that we
use to train our AIs more accurately than previous published attempts.
Many previous papers which use supervised learning to determine mode or key test only
against songs from specific genres or styles, and few make attempts at predicting mode
regardless of genre. Even the often-cited yearly competition on musical key detection
hosted by Music Information Retrieval Evaluation eXchange (MIREX) has participants'
algorithms compete at classifying 1252 classical music pieces [
İzmirli 2006]
[
Pauws 2004]. However, if we look again at
Figure
1 and
Figure 2, we can see that mode is not
exclusive to genre or style, it is simply a specific arrangement of whole and half steps.
So for a supervised learner programmed to “think” like a musician and
thus determine mode based on its understanding of these music theory principles, genre or
style should not affect the outcome. While this might work in a perfect world, artists
have always looked for ways to “break away” from the norm and this can
indeed manifest itself more in certain genres than others. Taking this into consideration,
in this research we only selected songs for our separate ground truth set involving
various genres which obey exact specifications for what constitutes as major or minor.
This ground truth set will be a separate list of 100 songs labeled by us to further check
the accuracy of our AI algorithms during testing. We wish to discuss shortcomings in the
accuracy of past research that uses AI algorithms for predicting major or minor mode
rather than to suggest a universal method for identifying all modes and scales.
This is one aspect where our research differs from previous papers. An AI system which
incorporates a solid understanding of the rules of music theory pertaining to mode should
be able to outperform others that do not incorporate such understanding or those that
focus on specific genres. While certain genres or styles may increase the difficulty of
algorithmically determining mode, the same is true for a human musical analyst. When
successful, an AI algorithm for determining mode will process data much faster than a
musician, who would either have to look through the score or figure it out by ear in order
to make their decision. For parsing music-related big data quickly and accurately, speed
is imperative. Thus we suggest the following framework (
Figure
3) by which a supervised learner can be trained to make predictions exclusively
from pitch data in order to determine the mode of a song. The process is akin to methods
used by human musical analysts. Below we also outline other areas where we apply a more
musician-like approach to our methods to achieve greater accuracy.
As can be seen in
Figure 3, any scale or mode that does not
meet the exact specifications of major or minor we categorize as
other,
ambiguous or
nontonal (OANs). The primary reason that past
research has trained supervised learners on only one specific genre or style is to avoid
OANs. When OANs are not segregated from major or minor modes, they are fit improperly,
leading to misclassifications.
Other pertains to any scale separate from major or minor that still contains
a tonal center. Some examples of this are: modes (except Ionian or Aeolian), whole tone
scale, pentatonic scale, and non-Western scales. Nontonal refers to any song
which does not center around a given pitch. A common occurrence of this can be found in
various examples of “12-tone music,” where each pitch is given
equal importance in the song and thus no tonic can be derived from it.
Where our paper differs from previous work is the handling of songs related to the
outcome
ambiguous. This occurs when either major or minor can be seen as an
equally correct determination from the given pitches in a song. This most often occurs
when chords containing the leading tone are avoided (
Figure
4) and thus the remaining chords are consistent with both the major key and its
relative minor (
Figure 4 &
Figure 5). The leading tone is a “tendency tone” or a tone
that pulls towards a given target, in this case the tonic. This specific pull is one that
can often signify a given mode and is therefore avoided in songs that the composer wished
to float between the two modes. This can also be accomplished by simply using the relative
minor's natural minor scale. Since the natural minor scale does not raise any pitches, it
actually has the exact same notes (and resultant triads) as its relative major scale.
Figure 6 gives an example of a well-known pop song
Despacito
[
Fonsi and Yankee 2017] and given these rules, what mode can we say the song is in?
This tough choice is a common occurrence, especially in modern pop music, which can
explain why papers that focused heavily on dance music might have had accuracy issues.
Other authors have noted that their AI algorithms can mistake the major scale for its
relative natural minor scale during prediction and it is likely that their algorithms did
not account for the raised leading tone to truly distinguish major from minor. Since we
focused on achieving higher accuracy than existing major vs minor mode prediction AI
algorithms by incorporating music theory principles, we removed any instances of songs
with an ambiguous mode from our ground truth set in order to get a clearer picture of how
our system compares with the existing models. Adding other mode outcomes in order to
detect OANs algorithmically is a part of our ongoing and future research.
The most popular method of turning pitch data into something that can be used to train
machine learners comes in the form of “chroma features.” Chroma feature
data is available for every song found in Spotify's giant database of songs through the
use of their
Web API. Chroma features are vectors containing 12 values (coded
as real numbers between 0 and 1) reflecting the relative dominance of all 12 pitches over
the course of a small segment, usually lasting no longer than a second [
Jehan 2014]. Each vector begins on the pitch “C” and continues in
chromatic order (C#, D, D#, etc.) until all 12 pitches are reached. In order to create an
AI model that could make predictions on a wide variety of musical styles, we collected the
chroma features for approximately 100,000 songs over the last 40 years. Spotify's
Web API offers its data at different temporal resolutions, from the
aforementioned short segments through sections of the work to the track as a whole.
Beyond chroma features, the API offers Spotify's own algorithmic analysis of musical
features such as mode within these temporal units, and provides a corresponding level of
confidence for each (coded as a real number between 0 and 1). We used Spotify's mode
confidence levels to find every section within our 100,000 song list which had a mode
confidence level of 0.6 or higher. The API's documentation states that “
confidence indicates the reliability of its corresponding
attribute... elements carrying a small confidence value should be considered
speculative... there may not be sufficient data in the audio to compute the element with
high certainty”
[
Jehan 2014] thus giving good reason to remove sections with lower
confidence levels from the dataset. Previous work such as Serrà et al [
Serrà et al. 2012b], Finley and Razi [
Finley and Razi 2019] and Mahieu [
Mahieu 2017] also used confidence thresholds, but at the temporal resolution
of a whole track rather than the sections that we used. By analyzing at the level of
sections, we were able to triple our training samples from 100,000 songs to approximately
350,000 sections. Not only did this method increase the number of potential training
samples, but it allowed us to focus on specific areas of each song that were more likely
to provide an accurate representation of each mode as they appeared. For example, a
classical piece of music in “sonata form” will undergo a
“development” section whereby it passes through contrasting keys and
modes to build tension before its final resolve to the home key, mode and initial
material. Pop music employs a similar tactic with “the bridge,” a
section found after the midway point of a song to add contrast to the musical material
found up until this point. Both of these contrasting sections might add confusion during
the training process if the song is analyzed as a whole, but removing them or analyzing
them separately gives the program more accurate training samples. The ability to gain more
training samples from the original list of songs has the advantage of providing more data
for training a supervised learner.
In previous work, a central tendency vector was created by taking the mean of each of the
12 components of the chroma vectors for a whole track, and this was then labelled as
either major or minor for training. In an effort to mitigate the effects of noise on our
averaged vector in any given recording, we found that using the medians rather than means
gave us a better representation of the actual pitch content unaffected by potential
outliers in the data. One common example is due to percussive instruments, such as a drum
kit's cymbals, that happen to emphasize pitches that are “undesirable”
for determining the song's particular key or mode. If that cymbal hit only occurs every
few bars of music, but the “desirable” pitches occur much more often,
we can lessen the effect that cymbal hit will have on our outcome by using a robust
estimator. A musician working with a score or by ear would also filter out any unwanted
sounds that did not help in mode determination. We found the medians of every segment's
chroma feature vector found within each of our 350,000 sections.
The last step in the preparation process is to transpose each chroma vector such that
they are all aligned into the same key. As our neural network (NN) will only output
predictions of major or minor, we want to have the exact same tonal center for each chroma
vector to easily compare between their whole and half step patterns (
Figure 3). We based our transposition method on the one
described by Serrà et al [
Serrà et al. 2012b] and also used in their 2012 work.
This method determines an “optimal transposition index” (OTI) by
creating a new vector of the dot products between a chroma vector reflecting the key they
wish to transpose to and the twelve possible rotations (i.e., 12 possible keys) of a
second chroma vector. Using a right circular shift operation, the chroma vector is rotated
by one half step each time until the maximum of 12 is reached. Argmax, a function which
returns the position of the highest value in a vector, provides the OTI from the list of
dot product correlation values, thus returning the number of steps needed to transpose the
key of one chroma vector to match another (see Appendix 1.1 for a more detailed formula).
Our method differs slightly from Serrà et al: since our vectors are all normalized, we
used cosine similarity instead of the related dot product.
In order to train a neural network for mode prediction, some previous studies used the
mode labels from the Spotify
Web API for whole tracks or for sections of
tracks. When we checked these measures against our own separate ground truth set (analyzed
by Lupker), we discovered that the automated mode labeling was relatively inaccurate
(Table 1). Instead we adapted the less complex method of Finley & Razi [
Finley and Razi 2019], which reduced the need for training NNs. They compared chroma
vectors to “KK-Profiles” to distinguish mode and other musical elements. Krumhansl
and Kessler profiles (
Figure 7) come from a study where
human subjects were asked to rate how well each of the twelve chromatic notes fit within a
key after hearing musical elements such as scales, chords or cadences [
Krumhansl and Kessler 1982]. The resulting vector can be normalized to the range between
0-1 for direct comparisons to chroma vectors using similarity measures. By incorporating
both modified chroma transpositions and KK-profile similarity tests, we were able to label
our training data in a novel way.
To combine these two approaches, we first rewrite Serrà et al's formula (Appendix 1.1) to
incorporate Finley and Razi's method by making both KK-profile vectors (for major and
minor modes) the new 'desired vectors' by which we will transpose our chroma vector set.
This will eventually transpose our entire set of vectors to C major and C minor since the
tonic of the KK-profiles is the first value in the vector, or pitch “C”. Correlations
between KK-profiles and each of the 12 possible rotations of any given chroma vector are
determined using cosine similarity. Instead of using the function which would return the
position of the vector rotation that has the highest correlation (argmax), we use a
different function which tells us what that correlation value is (amax). Two new lists are
created. One is a list of the highest possible correlations for each transposed chroma
vector and the major KK-profile, while the other is a list of correlations between each
transposed chroma vector and the minor KK-profile. Finally, to determine the mode of each
chroma vector, we then simply use a function to determine the position of the higher
correlated value between these two lists, position 0 for major and 1 for minor (see
Appendixes 1.2.1 & 1.2.2).
As noted by Finley & Razi, the most common issue affecting accuracy levels for
supervised or unsupervised machine learners attempting to detect the mode or key is “being off by a perfect musical interval of a fifth from the true key,
relative mode errors or parallel mode errors”
[
Finley and Razi 2019]. Unlike papers which followed the MIREX competition rules, our
algorithm does not give partial credit to miscalculations no matter how closely related
they may be to the true mode or key. Instead we offer methods to reduce these errors. To
attempt to correct these issues for mode detection, it is necessary to address the
potential differences between a result from music psychology, like the KK-profiles, and
the music theoretic concepts that they quantify. As we mentioned earlier, the leading tone
in a scale is one of the most important signifiers of mode. In the
Despacito
example, where the leading tone is avoided, it is hard to determine major or minor mode.
In the (empirically determined) KK-profiles, the leading tone seems to be ranked
comparatively low relative to the importance it holds theoretically. If the pitches are
ordered from greatest to lowest perceived importance, the leading tone doesn't even
register in the top five in either KK-profile. This might be a consequence of the study
design, which asked subjects to judge how well each note seemed to fit after hearing other
musical elements played.
The distance from the tonic to the leading tone is a major seventh interval (11
semitones). Different intervals fall into groups known as consonant or dissonant. Laitz
defines consonant intervals as “stable intervals… such as the unison,
the third, the fifth (perfect only)” and dissonant intervals as “unstable intervals… [including] the second, the seventh, and all
diminished and augmented intervals”
[
Laitz 2003]. More dissonant intervals are perceived as having more tension.
Rather than separating intervals into categories of consonant and dissonant, Hindemith
ranks them on a spectrum, which represents their properties more effectively. He ranks the
octave as the “most perfect,” the major seventh as the “least
perfect” and all intervals in between as “decreasing in
euphony in proportion to their distance from the octave and their proximity to the major
seventh”
[
Hindemith 1984]. While determining the best method of interval ranking is
irrelevant to this paper, both theorists identify the major seventh as one of the most
dissonant intervals. Thus, if the leading tone were to be played by itself (that is,
without the context of a chord after a musical sequence) it might sound off, unstable or
tense due to the dissonance of a major seventh interval in relation to the tonic. In a
song's chord progression or melody, this note will often be given context by its chordal
accompaniment or the note might be resolved by subsequent notes. These methods and others
will 'handle the dissonance' and make the leading tone sound less out of place. We
concluded that the leading tone value found within the empirical KK-profiles should be
boosted to reflect its importance in a chord progression in the major or minor mode. Our
tests showed that boosting the original major KK-profile's 12th value from 2.88 to 3.7 and
the original minor KK-profile's 12th value from 3.17 to 4.1 increased the accuracy of the
model at determining the correct mode by removing all instances where misclassifications
were made between relative major and minor keys.
Our training samples include a list of mode determinations labelling our 350,000 chroma
vectors. However, the algorithm assumes that every vector is in a major or minor mode with
no consideration for OANs. Trying to categorize every vector as either major or minor
leads to highly inaccurate results during testing, and seems to be a main cause of
miscalculations made by the mode prediction algorithms of Spotify's Web API
and the Million Song Dataset. To account for other or nontonal scales, we can
set a threshold of acceptable correlation values (major and minor modes) and unacceptable
values (other or nontonal scales). Our testing showed that a threshold of greater than or
equal to 0.9 gave the best accuracy on our ground truth set for determining major or minor
modes. These unacceptable vectors contain other or nontonal scales and future research
will determine ways of addressing and classifying these further.
To address ambiguous mode determinations between relative major and minor
modes, we can set another threshold for removing potentially misleading data for training
samples. While observing the correlation values used to determine major or minor labels,
we set a further constraint when these values are too close to pick between them
confidently. If the absolute difference between the two values is less than or equal to
0.02, we determine these correlation values to be indistinguishable and thus likely to
reflect an ambiguous mode. As mentioned earlier, this is likely due to the song's chord
progression avoiding certain mode determining factors such as the leading tone, and
therefore the song can fit almost evenly into either the major or minor
classification.