In this thesis, the similarity, the complexity, as well as the evolution of English song lyrics over the past five decades will be examined with the help of statistical methods. Hence, the central research question of this thesis is: Can information gained by Natural Language Processing and statistical topic modelling be used to determine whether and to what extent song lyrics of various genres changed over the course of the past 50 years?
Based on this, the goals of this thesis are:determining how similar songs of five diverse genres (alternative, country, pop, rock, and hip-hop) are, as measured by text statistics and text features that are composed by Natural Language Processing (NLP) and text mining methods. Additionally, using these methods as well as an attempt to find out whether song lyrics are becoming less complex and therefore less sophisticated And, finally, the main target of this thesis set for itself, computing statistical topic models by applying Latent Dirichlet Allocation (LDA), to analyse how similar the topics of songs are and whether they changed over time. This will be conducted by calculating similarity measures on the per-topic-per-word probability distributions that are output of the LDA models.
Contents
1. Introduction to the Topic
1.1. Previous Research
1.2. Hypotheses and Methods
2. Theory
2.1. Music Information Retrieval
2.2. Natural Language Processing
2.2.1. Lemmatization
2.2.2. Part-of-Speech Tagging
2.3. Text Data Mining
2.3.1. N-grams
2.3.2. Term Frequency-Inverse Document Frequency
2.4. Topic Modeling Using Latent Dirichlet Allocation
2.4.1. Model Tuning
2.4.2. Model Evaluation
2.5. Similarity Measures
2.5.1. Jensen-Shannon Divergence
2.5.2. Hellinger Distance
2.5.3. Log Ratio
3. Data
3.1. Data Selection and Web Scraping
3.2. Data Pre-Processing
3.3. The Final Data Set
4. Analyses
4.1. Text Statistics
4.1.1. Comparison of Text Statistics
4.1.2. Comparison of Word Use
4.2. Text Features
4.2.1. Term Frequency-Inverse Document Frequency in Application
4.2.2. Part-of-Speech Tagging in Application
4.2.3. N-grams in Application
4.2.4. Conclusions about Text Statistics and Text Features
4.3. LDA Modeling
4.3.1. Parameter Tuning
4.3.2. Model Evaluation
4.3.3. Topic Similarity Within Models
4.3.4. Topic Similarity Between Models
4.3.5. Conclusions about LDA Modeling and Similarity Measures
5. Findings and Prospects
5.1. Findings Compared to Previous Research
5.2. Need for Improvement and Future Applications
Appendix
References
List of Figures
1. Distribution of lexical density per song by genre
2. Word use between pop and rock lyrics according to term frequencies
3. Word use between hip-hop and pop lyrics according to term frequencies
4. Word use between pop and rock lyrics according to log odds ratio
5. Word use between rock and hip-hop lyrics according to log odds ratio
6. Top 15 most important words by genre
7. Distribution of repetitions per song by genre
8. Distribution of word count per song by genre
9. Distribution of word length by genre
10. Distribution of lexical diversity per song by genre
11. Distribution of word count per song by genre and decade
12. Distribution of lexical diversity per song by genre and decade
13. Distribution of lexical density per song by genre and decade
14. Top 15 most used words by genre
15. Distribution of repetitions per song by genre and decade
16. “FindTopicsNumber” plot with default settings
17. “FindTopicsNumber” plot with a = 7.5, 8 = default
18. Top 15 most common words per topic
List of Tables
1. Average total word count per song by genre and decade
2. Average lexical density per song by genre and decade in %
3. Average lexical diversity per song by genre and decade
4. Average repetitiveness per song by genre and decade in %
5. Preferred number of topics K by “FindTopicsNumber” output
6. Preferred number of topics K by maximum log likelihood
7. Preferred number of topics K by perplexity
8. Various settings for LDA
9. Similarity measures within unbalanced genres
10. Topic labels within unbalanced genres
11. Similarity measures within unbalanced decades
12. Topic labels within unbalanced decades
13. Similarity measures between genres
14. Similarity measures between decades
1. Introduction to the Topic
While this thesis is aiming at the analysis of English music lyrics, an anecdote about German music business seems fit to introduce to this subject. It is partly targeted at the similarity of song lyrics which will mainly be explored here. The four most used topics in popular German music, according to satirist Jan Böhmermann, are “Menschen, Leben, Tanzen, Welt”. This is as well the title of a song “composed” by chimpanzees, using lines from German pop songs, tweets by popular influencers, advertising slogans, and proverbs (cf. Böhmermann 2017). Böhmermann performed it in order to criticize the music business in Germany, he is claiming that the lyrics of popular songs all are very similar, superficial , and sound like advertisement (cf. ibid., Rohleder 2018). Upon its commercial release the song hit the top 10 German single charts, which was symptomatic for the current state of popular German music, according to Stern magazine (cf. Stern 2017). Apart from this parody, there are actual analyses and scientific studies which seek to prove the deterioration of music lyrics. Bagot and Scott, for example, analyze about 6000 songs by top-selling UK artists, trying to determine the sophistication of their lyrics. Comparing the lyrics to readability scores, which are used to identify the level of difficulty of school literature, they specify how demanding the songs are. For instance, they discover that to be able to understand the average song by Depeche Mode, 10.3 years of school education are necessary, which makes them the artists with the most sophisticated lyrics among those that were examined. In contrast, Led Zeppelin’s lyrics only require 4.4 years of education to be understood (cf. Bagot/Scott 2016). Another study is conducted by Powell-Morse, who is using 225 songs of four genres to find out whether lyrical intelligence is decreasing. He compares songs that were at the top of the Billboard charts for at least 3 weeks over a time span of ten years until 2014. His primary findings are: on average, a hit song could be comprehended by a 3rd grader, but this level declined over the past decade. Country lyrics, according to his research, were the smartest, while hip-hop used the least intelligent language (cf. Powell-Morse 2015). Further examples for the analysis of song lyrics will be named in the upcoming subsection.
Partly motivated by the above-stated, in this thesis, the similarity, the complexity, as well as the evolution of English song lyrics over the past five decades will be examined with the help of statistical methods. Hence, the central research question of this thesis is: Can information gained by Natural Language Processing and statistical topic modeling be used to determine whether and to what extent song lyrics of various genres changed over the course of the past 50 years?
Based on this, the goals of this thesis are:
- determining how similar songs of five diverse genres (alternative, country, pop, rock, and hip-hop) are, as measured by text statistics and text features that are composed by Natural Language Processing (NLP) and text mining methods.
- using these methods as well for an attempt to find out whether song lyrics are becoming less complex and therefore less sophisticated.
- and the main target this thesis set for itself, computing statistical topic models by applying Latent Dirichlet Allocation (LDA), to analyze how similar the topics of songs are and whether they changed over time. This will be conducted by calculating similarity measures on the per-topic-per-word probability distributions that are output of the LDA models.
The biggest motivation behind setting the aforesaid research goals is that they have not been explored before quite in that way they will be in this thesis. This thesis wants to give an insight into some themes that are possible to be examined on a collection (corpus) of song lyrics with NLP methods and statistical topic modeling. By no means does it expect to be all-encompassing, but it is to be seen as a contribution to the already existing work on these subjects that belong to a vast field of research (cf. sections 2.1.-2.4.), and hopefully it will manage to point out some new, additional perspectives. The following subsection is outlining related research and explores previous studies that this thesis is following and trying to build upon. Subsection 1.2. will outline the hypotheses that were modeled from the research goals, as well as the approaches to check them.
1.1. Previous Research
A lot of the already existing research is heavily focusing on analyzing audio music data and thereby is disregarding the extra value that an analysis of lyrics can yield (cf. section 2.1.). One shortcoming of some studies conducted on lyrics data seems to be that not enough songs are used, there are studies that only analyze 125, 420, or 1900 songs (cf. Mahedero et al. 2005; Oudenne/Chasins 2010; Berger/Packard 2018). Many of the methods applied here are designed for large amounts of text data because their objective is to automatically help structuring this data (cf. sections 2.1.-2.4.), whereas applying them on just a few hundred songs seems to be a considerable waste of potential. Many previous studies only take a single genre into account for their analysis (cf. Hu/Yu 2011; Sasaki et al. 2014; Johnson-Roberson/Johnson-Roberson 2013), or do not consider evaluation over time (cf. Sterckx 2013; Tsukuda et al. 2017; Sasaki et al. 2014), whereas here various genres are to be compared, as well as to be analyzed between decades. By doing so, it is to be expected to get a more thorough understanding of the lyrics and to give a better insight on what the methods applied here are capable to render. Studies that use huge text data sets and analyze different genres and decades obviously exist, but they are mainly focusing on other analyses and methods than those utilized here. For example, there is a vast amount of work conducted on sentiment analysis or the classification of genres and artists (cf. Knees et al. 2005; Mahedero et al. 2005; Hu et al. 2009; McKay et al 2010; Sharma/Murty 2011; Keerthana/Kalpana 2017; Canicatti 2016). The text statistics and lyrical features that are to be examined here on their own, in many studies only serve as an addition for improving the classification accuracy or are used for clustering (cf. Mayer et al. 2008; McKay et al 2010; Mayer/Rauber 2010). Authors who use text statistics and text features to explore song genres, e.g. are Mayer/Rauber. Ultimately, they want to compare genre classification techniques applied on lyrics and audio data, but one part of their work is dedicated to analyzing lyrics features like rhymes, or text statistics. Among the text statistics that they are comparing for ten different genres are such measures as the average word length, text length, the use of punctuation, or the use of various parts of speech. One of their findings is that folk music and hip-hop are exhibiting more creative language than the other genres, which becomes clear through the increased use of rhymes and the higher number of unique words per song (cf. Mayer/Rauber 2010, 351ff.). Among other linguistic features, Motschenbacher relies on text statistics as well for comparing a corpus of pop song lyrics to a corpus containing lyrics of songs performed at the Eurovision Song Contest. He mainly uses word frequencies for his analysis (cf. Motschenbacher 2016, 6f.). Ellis et al. are trying to build a score for measuring lexical novelty of songs based on statistical properties of the lyrics data. Apart from computing statistics of unique words, they use the inverse document frequency measure (idf) to identify terms that can be related to genres, moods, or topics. Their aim is to provide a score system that could allow users to choose the amount of lexical novelty, i.e. complexity, of the songs they want to listen to (cf. Ellis et al. 2015, 1f.). In order to use lyrics for genre classification, Fang et al. first determine lyrics features that are further used in the classification process, like parts of speech, n-grams, the use of pronouns, as well as text statistics, e.g. word lengths, song lengths, and the term frequency-inverse document frequency (tf-idf) for songs of nine genres distributed over a time period from 1954-2014 (cf. Fang et al. 2017, 466f.). McKay et al. extract various lyrics features and text statistics from the song data in order to utilize them for genre classification. They compute e.g. word frequencies, n-grams, parts of speech, letter frequencies, or the amount of lines per song (cf. McKay et al 2010, 3). In this thesis various text statistics like word frequencies, word lengths, or song lengths will be computed, as well as lyrics features like n-grams, parts of speech, or term frequencyinverse document frequency. Sections 2.2. and 2.3. explain the theory behind these applications and the results can be gathered in sections 4.1. and 4.2.
Apart from the studies mentioned in section 1., there are some others that try to explore the complexity of song lyrics, or in other words, how sophisticated they are. As already stated, Mayer/Rauber identify folk and hip-hop as the genres with the most creative language since they use the highest amount of unique words and many rhymes (cf. Mayer/Rauber 2010, 351). The very first work on the complexity of song lyrics, however, already was published in 1984 by Knuth, a computer scientist, as an in-joke under the title “The Complexity of Songs”. In the article he points out how the invention of the chorus contributed to reducing long and complex ballads to repetitive songs with decreased context (cf. Knuth 1984, 344f.). While this present thesis will try to examine the repetitiveness of lyrics with the help of n-grams, there are by far more serious approaches to rely on for exploring the complexity of lyrics than the one by Knuth, e.g. the one by Jockers. In his opinion, the complexity of songs can be depicted by measures of vocabulary richness, which “[...] can be represented as a mean word frequency or as a relationship between the number of unique words used [...] and a count of the number of word tokens in the document” (cf. Jockers 2014, 59). He is calculating a type-token-ratio by dividing the number of unique words by the total number of words. The lower this type-token-ratio , the less lexically varied or the less complex a song, according to Jockers (ibid.). This measure will be used in this thesis, but it will be referred to as lexical density, following the example of Liske. In her online tutorial on Natural Language Processing of song lyrics by Prince, she is describing lexical complexity, or sophistication, of lyrics by calculating word frequencies, word lengths, lexical density, and lexical diversity, i.e. the number of unique words in a song (cf. Liske 2018). Another way to determine how difficult, or complex, a song is, is proposed by Ellis et al. As mentioned above, they are computing a lexical novelty score for song lyrics, which is based on inverse document frequency. The idea behind this is that “the novelty or unfamiliarity of a stimulus has a direct bearing on basic cognitive processing. For example, words that are statistically infrequent (i.e., have a high idf) are more difficult to perceive, recognize, and recall than more commonly encountered words” (cf. Ellis et al. 2015, 1). Based on these studies, the lexical complexity of the lyrics used in this thesis will be explored by computing the above-mentioned measures and statistics. Then they will be compared between the five genres and over time.
Other researchers using topic modeling on song lyrics, for example, are comparing different topic modeling algorithms to each other. Lukic is applying LDA and the Pachinko algorithm on lyrics in order to extract topics and then compare them with the help of the Kullback-Leibler distance measure (cf. Lukic 2014). Sterckx et al. compute LDA models and labeled LDA models, i.e. supervised LDA models, and use cosine similarity to measure differences between them (cf. Sterckx et al. 2014). In a previous work, Sterckx is employing labeled LDA models as well in order to compare the results to social tags collected online and is using them for various classification methods (cf. Sterckx 2013). Johnson-Roberson/Johnson-Roberson are applying LDA modeling and Dirichlet multinomial regression on a set of hip-hop lyrics to examine regional and temporal variation over a time-span of 25 years (cf. Johnson-Roberson/Johnson-Roberson 2013). Further applications of topic modeling on song lyrics are including e.g. the study by Tsukuda et al. who are trying to build a model that can deal with the artists’ taste of topics. They assume that each artist has a topic distribution which assigns a topic to every song (cf. Tsukuda et al. 2017). In the study conducted by Sasaki et al., LDA is applied on Japanese songs to build a lyrics retrieval system which allows the user to browse and visualize song lyrics (cf. Sasaki et al. 2014). Further ways of topic modeling song lyrics are explored e.g. by Miao et al., who utilize Neural Variational Inference to find the topics of songs (cf. Miao et al 2010). Kleedorfer et al. rely on non-negative matrix factorization in order to extract topics from song lyrics which then are used for clustering (cf. Kleedorfer et al. 2008). Another method of finding topics is centering-resonance analysis, used by Henard/Rossetti to determine the success of pop songs depending on the topics they are talking about (cf. Henard/Rossetti 2014). In this thesis LDA will be applied for the extraction of topics from the lyrics data set and can be reviewed in section 4.3. Several models will be built to learn the differences of the topics within and between them by computing various similarity measures.
Most of the research on song similarity is based on audio data (cf. Serra et al. 2008; Percino 2014; Gebelhoff 2016) and there are various studies about measuring the similarity between songs, for instance with the aim to find similar songs for the use in music recommender systems (cf. Stenzel/Kamps 2005; Barrington et al. 2009; Eck et al. 2008). Studies that explore the computation of song similarities from lyrics, for example, are conducted by Mahedero et al., Logan et al., Schedl, or Berger/Packard. Mahedero et al. are using inverse document frequency to compute the cosine similarity (cf. Mahedero et al. 2005, 477). Logan at al. are conducting a semantic analysis of song lyrics for calculating similarities and compare these to acoustic similarities (cf. Logan et al. 2004). Among other things, Schedl examines artist similarity in his study by computing graphical similarity networks (cf. Schedl 2008). The research by Berger/Packard is the only currently available study that could be found where the authors are applying LDA models on song lyrics with the goal to utilize the resulting distributions for measuring similarity, as it is intended to be done in this thesis at hand. (Lukic does as well, but he utilizes the similarities to compare two topic modeling approaches (cf. Lukic 2014)). They are using 1900 songs of a three-year time span and compute the similarities - based on the language style matching equation by Ireland/Pennebaker (2010) - between each song and its genre with the help of the LDA output (cf. Berger/Packard 2018, 2). However, most studies calculating similarity measures from the output of LDA models do not use lyrics, like it will be applied here, but instead scientific literature, newspaper articles, Wikipedia articles, web links, or Amazon ratings (cf. He et al. 2011; Mimno et al 2011; Blei/Chaney 2012; Chuang et al. 2013; Aletras/Stevenson 2014; Towne et al. 2016). To the best of my knowledge, there is not yet a scientific study that explores topic similarity of song lyrics of various genres within topic models and over a longer time period in the same way that it will be conducted here (cf. sections 2.5., 4.3.3. and 4.3.4.).
1.2. Hypotheses and Methods
The hypotheses that were derived from the research goals and shall be explored in this thesis are the following:
1. Some genres display higher similarity to each other, while other genres are more distinct; this can as well be observed over time.
2. Lexical complexity decreases over time.
3. What topic number K and which hyperparameters are preferable for the LDA models?
4. The extracted topics within a model can be interpreted and labeled by human judgment. Further, they are differentiable, i.e. not too similar.
5. The extracted topics change over time. This means the topics, which are examined between various models throughout decades, are varying.
Hypotheses 1 and 2 will be explored by computing text statistics (word frequencies, word lengths, unique words, lexical density) and text features (tf-idf, n-grams/ repetitions, parts of speech), and by comparing these between genres and decades. The construct of lexical complexity which is used here, is built upon the definitions by Liske, Jockers, and Ellis et al. mentioned in section 1.1. Lexical complexity, as defined for the use in this thesis, consists of several features: lexical density, i.e. the number of unique words divided by the number of total words, word lengths, and the amount of repetitions. As already mentioned, the genres which will be compared here are alternative, country, pop, rock and hip-hop. They will be examined over a time-span of roughly 50 years, from 1970-2018. Section 3 explains the reasons for choosing these genres and this time period. A characteristic of music genres, which is somewhat problematic, according to Schedl, is that they are inconsistent. There are a lot of ways of differentiating genres and subgenres and there is no consensus on how to label them, or what songs to include in what genre by what definition (cf. Schedl 2008, 42f.). Bearing this in mind, in this thesis mutually exclusive genres were created by sorting each artist into just one genre. The genre affiliation for each artist was extracted from Discogs.com and Wikipedia.com (cf. section 3). While some people might possibly classify certain artists in this data set into another genre than the one which they were assigned with, it is important to notice that the mutual exclusiveness of the artists and the songs in each genre is what matters most for the analyses conducted here. For further information on this process, check section 3.
In order to explore hypothesis 3, model tuning will be applied by using the “ldatun- ing” package in R, computing perplexity scores, and the maximum log-likelihoods. Further information on this can be looked up in section 2.4.1. and the results are depicted in section 4.3.1. Hypotheses 4 and 5 will be tested by calculating diverse similarity measures - Jensen-Shannon divergence, Hellinger distance and log ratio - on the per-topic-per-word probability distributions that are one of the outputs of the LDA models computed beforehand. Furthermore, manually assigned topic labels will be analyzed for reference. In this way, similarities within the topic models, as well as similarities between two topic models, can be observed. While the first is utilized to determine the similarity of the topics found by one topic model, the latter serves as means to figuring out whether topics changed over time. Something that is important to note about LDA models is that there is not one correct, perfect model. LDA is a probabilistic algorithm based on Machine Learning principles and it is depending on its input parameters. If they are changed, the model output will differ as well. It is paramount to understand that the construction of the most perfect model is not the aim of this thesis, but that the models used here, will ultimately be chosen according to a somewhat heuristic approach on parameter tuning (cf. section 4.3.1.). Then, the models following from this will be further evaluated. To read more about topic modeling using LDA and the similarity measures applied here, see sections 2.4. and 2.5. For the analyses and results, see section 4.3.1.
Section 2 of this thesis depicts the theories behind the methods used, in section 3 the data mining and pre-processing procedure, as well as the final data set, is outlined, section 4 is illustrating the data analyses and results, while section 5 summarizes the findings, points out need for improvement, and explains future prospects and applications that could be explored in further research. All analyses are conducted on a self-designed data set using the statistics software R, version 3.5.1.
2. Theory
This chapter will outline the theoretical foundation of the methods applied in this thesis. Basic concepts like Music Information Retrieval (MIR), Natural Language Processing (NLP), and Text Data Mining will be explained, as well as some specific applicational methods, e.g. the use of n-grams, part-of-speech (POS) tagging, or term weighting with term frequency-inverse document frequency (tf-idf). Furthermore, topic modeling with Latent Dirichlet Allocation (LDA), and the similarity measures used to determine the similarity of topics within and between the LDA models, will be described. The analyses will be depicted in section 4.
2.1. Music Information Retrieval
Music Information Retrieval is a relatively young, interdisciplinary research field, which experienced a growth in the 1990s due to music data becoming available in digital form and the advance of faster computing power (cf. Burgoyne et al. 2016, 215). Theoretical knowledge and applications are originating from various areas of research, e.g. information science, musicology, audio engineering, or computer science (cf. Downie 2003, 196). MIR is about extracting, analyzing and using information from music data (cf. Schedl 2008, 17). According to Schedl et al., who are referring to Downie, MIR “[. . . ] is foremost concerned with the extraction and inference of meaningful features from music (from the audio signal, symbolic representation or external sources such as web pages), indexing of music using these features, and the development of different search and retrieval schemes [. . . ]” (Schedl et al. 2014, 128). They also explain that musical information can consist of “pitch, rhythm, harmony, timbre, lyrics, performance, [...]” (Schedl et al. 2014, 210). Downie defines seven facets that music information can be drawn from, to be specific: “pitch, temporal, harmonic, timbral, editorial, textual, and bibliographic facets” (Downie 2003, 297). Song lyrics, which are the focus of analysis in this thesis, belong to the textual facet, whereas some album meta data that will be used for additional information, e.g. the decade a song was published in, is part of the bibliographic facet.
Some main research topics in MIR are the classification of musical genres, moods, and artists, as well as music analysis, knowledge representation, or similarity retrieval (cf. Mayer/Rauber 2010, 335). Common real-world applications these tasks are used for, e.g. are music recommender systems that rely on various similarity criterions, like the audio fingerprint, and are supposed to make it easier for users to find similar songs. Further applications are cover song identification, or automatic playlist generation which fall back on similarity retrieval as well (cf. Schedl et al. 2014, 132; Burgoyne et al. 2016, 213). Generally, audio data is used for these tasks (cf. Mayer/Rauber 2010, 334; Schedl 2008, 45; Downie 2003, 302). As an explanation to why MIR has always been mainly focused on retrieving information from audio data, Downie means that “[. . . ] the vast majority of listeners understand music solely as an auditory art form” (Downie 2003, 302) and in the opinion of Mayer/Rauber “[...] music perception itself is based on sonic characteristics to a large extent” (Mayer/Rauber 2010, 334). A lot of research is engaging in problems like genre classification (cf. Li et al. 2003; Scaringella et al. 2006; Haggblade et al. 2011; Jeong/Lee 2016), mood classification (cf. Ren et al. 2015; Lidy/Schindler 2016; Ridoean et al. 2017), or building recommender systems (cf. Stenzel/Kamps 2005; Barrington et al. 2009; Eck et al. 2008), using audio data only. In doing so, it is disregarding all the information that can be extracted from the textual facet of music. Although song lyrics contain a lot of additional information, research dedicated to MIR using lyrics is by far not as vast as the efforts to retrieve information from audio data (cf. Schedl 2008, 45). Logan et al. and Knees et al. show some of the first approaches to extract meaningful information from song lyrics (cf. Logan et al. 2004; Knees et al. 2005). In section 1.1. some further research that was conducted since is outlined. What audio data cannot grasp, for example, is semantics, as well as the structure lyrics are providing to a song (cf. Mayer/Rauber 2010, 334). Besides, text is independent of melody: for instance, given two songs with the same melody but different lyrics, audio analysis can only identify information for one song, since they both sound the same. But by analyzing the lyrics, it could be recognized that the songs are not identical (cf. Downie 2003, 301). Additional advantages of using lyrics data, rather than audio data, are that they do not require as many computational resources and they are available for free on the internet, whereas the use of musical audio data is mostly linked with copy rights issues (cf. Downie 2003, 302).
2.2. Natural Language Processing
According to Goldberg “Natural language processing (NLP) is a collective term referring to automatic computational processing of human languages. This includes both algorithms that take human-produced text as input, and algorithms that produce natural looking text as output” (Goldberg 2017, xvii). Bird et al. write that NLP can “cover any kind of computer manipulation of natural language. At one extreme, it could be as simple as counting word frequencies to compare different writing styles. At the other extreme, NLP involves “understanding” complete human utterances, at least to the extent of being able to give useful responses to them” (Bird 2009, ix). In short, it is a research field concerned with the analysis and representation of natural language text or speech using automatic computational processing. Research in NLP is interdisciplinary, unifying scientists from areas like computer science, mathematics, linguistics, artificial intelligence, robotics , or psychology. Common applications using NLP are, for example, machine translation, information retrieval, speech recognition, or user interfaces (cf. Chowdhury 2003, 1). Typical tasks in NLP include part-of-speech tagging, lemmatization, stemming, named entity recognition, understanding and generation of natural language, sentiment analysis, topic segmentation, speech recognition, or text-to-speech transformation (cf. Kakde et al. 2013; Zhang et al. 2015). Some of these tasks are based on statistical methods, e.g. part-of-speech tagging, machine translation, word-sense disambiguation, or grammar learning (cf. Chowdhury 2003, 3; Liddy 2001, 10). Since applied in this thesis, lemmatization and part-of-speech tagging will be further explained in the following subsections. Topic segmentation in the form of topic modeling using Latent Dirichlet Allocation will be presented in section 2.4.
2.2.1. Lemmatization
Used as a form of data pre-processing in various NLP applications, lemmatization is the process of reducing words to their dictionary form, the lemma. In contrast to stemming, it respects the part of speech of the input words and their meaning in the sentence. By grouping them together, it enables analyzing different grammatically inflected forms of a word as simply one item, its root. Therefore, lemmas allow to reduce the complexity of unlemmatized text data by decreasing the amount of distinct terms. Lemmas are dictionary terms, while by using stemming the resulting word stems cannot always be identified as valid words anymore (cf. Kurdi 2016, 101; Liu 2012, 1; Müller et al. 2015, 2268; Dave/Balani 2015, 366). Whether to use lemmatization or stemming needs to be decided depending on what kind of analyses will further be conducted on the data, e.g. if analyzing words by different parts of speech is completely irrelevant to the researcher, then stemming might suffice. There are various approaches of applying lemmatization to text data. Some scientists employ log-linear classifiers (cf. Chrupala et al. 2008; Müller et al. 2015) or neural networks (cf. Chakrabarty et al. 2017) to find the lemmas of input terms for diverse languages. There are rule-based approaches as well that take grammatical characteristics such as suffixes, etc. into account (cf. Plisson 2004; Paul 2013). One of the most trivial procedures that can be applied for simple languages like English (as compared to more complex languages like German or Turkish), is using a list containing all possible word forms and their respective lemmas to assign a lemma to each input term (cf. Somers 2008, 4). This method is used in the “textstem” and “lexicon” packages in R, resorting to Mechura’s English lemmatization list, consisting of over 40.000 unique words plus their lemmas (cf. Silge et al. 2018, 13). It is implemented in this thesis for lemmatizing the lyrics in order to decrease the complexity of the data set and to be able to better summarize results, e.g. the frequency count of the most used terms. Considering that this is the simplest form of finding lemmas, the results are considerably well, at least it can be noted that no discrepancies could be discovered while further analyzing the lemmatized lyrics.
2.2.2. Part-of-Speech Tagging
“Part-of-speech (POS) tagging consists of automatically attaching a part-of-speech tag to all of the words in a given corpus” (Kurdi 2016, 117). It could as well be described as “a lexical categorization or grammatical tagging of words according to their definition and the textual context they appear in” (Mayer et al. 2008, 339). Part-of-speech tags could help e.g. with word-level understanding or improve search functions (cf. Liddy 2001, 6; Kurdi 2016, 117). Tagsets, i.e. lists of tags, for various languages enable classification of words into their respective categories, like the Universal part-of-speech tagset, containing NOUN, VERB, ADJ (for adjective), PREP (for preposition), etc. The main difficulty POS tagging needs to face is ambigua- tion, since in some languages, like English, there are numerous words that could be assigned to different parts of speech (cf. Kurdi 2016, 117f.). Examples therefor are e.g. to play as a verb vs. play as a noun, or to love as a verb vs. love as a noun. There are diverse methods of finding the appropriate tags for each word, of which the most common are transformation-based tagging (cf. Brill 1995), rule-based tagging, and stochastic tagging. Rule-based tagging relies on grammatical rules, like examining suffixes, because for certain words they allow to find their corresponding tags, e.g. terms that end on “-able” supposedly are adjectives, while words ending on “-tion” are likely to be nouns (cf. Kurdi 2016, 120). The approach that is applied here, is stochastic tagging, which is “based on [the] probability of certain tag[s] occurring, given various possibilities” (Nau 2010, 7). For this purpose, a pre-trained corpus containing words that already were tagged is necessary (ibid.). The tagging process consists of two phases: (1) the identification stage, where all potential tags for each input word are identified; (2) the disambiguation stage, where the most likely tag from those identified in (1) is assigned to each word (cf. Kurdi 2016, 120; Liddy 2001, 6). Serving as a training corpus in the analyses conducted for this thesis are the Universal Dependency models, consisting of over 576.000 unique words trained on various text sources like newspapers , literature, law texts, or Wikipedia articles (cf. Nivre 2017). POS tags can be utilized e.g. for sentiment analysis (cf. Pang/Lee 2008) or text style analysis (cf. Argamon et al. 2003). Mayer et al. are using POS tags to define and compare rhyme and style features in music lyrics. One of their assumptions is that “different genres will also differ in the category of words they are using [...]” (Mayer et al. 2008, 339). A similar analysis will be conducted here, which will then shortly be compared to these previous findings in section 4.2.2. In this thesis, POS tagging is mainly applied in addition to some of the other methods in order to investigate whether the results differ conditional on the part of speech which is used. For example, tf-idf and the most used words per genre are examined separated by POS.
2.3. Text Data Mining
Data Mining describes the collection, cleaning, processing, analysis, and gain of useful information from data. There are various data mining applications, problems, and data representations that can be found in everyday life. For example, the evergrowing amount of data online is necessitating methods that can give structure to it, find patterns, and extract practical insights from it (cf. Aggarwal 2015, 1; Aggarwal/Zhai 2012, 2). There are many different forms of data, e.g. categorical, quantitative, time-series, graphical, or text data (cf. Aggarwal 2015, vii). The focus of this thesis is on extracting information from text data that is at hand in the form of lyrics. An important goal of text mining is “going beyond information access to further help users analyze and digest information and facilitate decision making” (Aggarwal/Zhai 2012, 2; cf. Kwartler 2017, 1f.). The main tasks being tackled by text mining methods are, according to Aggarwal: “association pattern mining, clustering, classification, and outlier detection” (Aggarwal 2015, xxiii). While most of the work of the data mining process is allotted to data preparation, the whole processing pipeline from collecting data up to describing the final results “[. . . ] is conceptually similar to that of an actual mining process from a mineral ore to the refined end product. The term “mining” derives its roots from this analogy” (Aggarwal 2015, 2). The data cleaning and data processing phases of the data mining pipeline require Natural Language Processing methods such as tokenization, stemming, lemmatization, POS tagging, etc. to cleanse and give structure to the extracted text data. Analytical text mining methods, that are applied here, are n-grams and term frequency-inverse document frequency which will be described in the next subsections.
2.3.1. N-grams
N-grams are n overlapping items, or slices, from a longer string of letters or words (cf. Cavnar/Trenkle 1994, 2). Instead of examining single terms (unigrams), n-grams enable the analysis of consecutive sequences of words (cf. Silge/Robinson 2018). They can be examined and further processed exactly like single words, their frequencies can be counted and even topic models can be computed on n-grams. Primarily, they are used to explore relationships between two and more words or to provide context, e.g. in sentiment analysis. They can as well contribute to exploring word co-occurrences by computing correlations between n-grams. By counting their frequencies, it can be observed how commonly certain n-grams are contained in any given text corpus. For instance, Silge/Robinson observe street names that can be found in their corpus by analyzing bigrams, i.e. pairs of two consecutive words. Ngrams help giving context to sentiment analysis, e.g. by discovering whether a word has a positive meaning (“happy”), or whether it actually is connoted in a negative way (“not happy”) (cf. Silge/Robinson 2018). Another application for n-grams is building probabilistic language models, Markov models, to predict the next item in a series of items or words and e.g. is commonly implemented in computational biology for sequence analysis (cf. Ganapathiraju et al. 2004; Maetschke et al. 2010). In this thesis n-grams are mainly helpful for computing the amount of repetitions per song by using four-grams and counting their frequencies.
2.3.2. Term Frequency-Inverse Document Frequency
Term frequency-inverse document frequency (tf-idf) is a term weighting approach that assigns weights to words in a corpus according to their importance. The fundamental assumption behind tf-idf term weighting is that terms, i.e. words, are important when they occur more frequently in one document, but at the same time less frequently in the other documents of a text corpus (cf. Mayer et al. 2008, 338; Kleedorfer et al. 2008, 289). The intention behind this is to deal with words that appear very frequently throughout a collection of texts but do not hold much infor- mation, like the English terms “the” or “by”, etc. (cf. Torgo 2016, 74). According to Schedl, the aim of tf-idf is to emphasize “terms that are discriminative for a document, while reducing the weight of terms that are less informative” (Schedl 2008, 57). Computing tf-idf works as follows:
Abbildung in dieser Leseprobe nicht enthalten
By tf (t,d) the term frequency is denoted, i.e. the number of times a term t is occurring in a document d. Term frequency assigns a high weight to very frequent terms and by itself is not a good indicator for the importance of words in a document, since extremely common words like “the” do not offer a lot of information. However, it could be utilized to define a customized stop word list of these most occurring terms (cf. Silge/Robinson 2018; Schedl 2008, 57). The second coefficient cNNt^ ’ the inverse document frequency (idf), is the total number of documents N divided by the number of documents d containing a word t, i.e. the document frequency df (t). It is scaled by the natural logarithm (cf. Mayer et al. 2008, 338). Inverse document frequency “decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents” (cf. Silge/Robinson 2018). Extremely common words are taking a tf-idf value of zero, because idf will be zero, since it is the natural logarithm of 1. Vice versa, it holds that the higher the tf-idf value is, the less common a word occurs in a collection of documents (ibid.). Real-world applications relying on tf-idf e.g. are text-based recommender systems like search engines (cf. Croft et al. 2009).
In this thesis tf-idf is used to compute term weights for the most important words by genre and decade - to be able to portray the words that are descriptive for each genre and decade - and examine them. They will be compared to the most frequently used words, which is supposed to give a clear picture of the usefulness of applying tf-idf weights in contrast to just counting simple term occurrences. Section 4.2.1. is further discussing this application. Tf-idf weighted words could also be used for topic modeling in order to only include words into the models that are of a certain importance for the text collection, but it will not be applied here.
2.4. Topic Modeling Using Latent Dirichlet Allocation
This section is dealing with the theory behind the models that mainly will be examined in this thesis. First, a general overview will be given over statistical topic modeling and Latent Dirichlet Allocation, then model tuning and model validation will be explained in the following subsections.
Topic models are probabilistic models that help “uncovering the underlying semantic structure” of a collection of text documents (Blei/Lafferty 2009, 1). They can be applied to various text sources such as scientific articles (cf. Griffiths/Steyvers 2004; Blei et al. 2003), or newspapers (cf. Wei/Croft 2006; Mimno/Blei 2011), implementations on song lyrics can be found as well (cf. Sasaki et al. 2014; Sterckx et al. 2014; Sharma/Murty 2011; compare section 1.1. for more information). Topic modeling algorithms enable unsupervised classification of text documents by finding their underlying hidden structure and dividing them into natural groups, even if it is unclear what exactly is being looked for (cf. Silge/Robinson 2018). According to Blei/Lafferty, topic models can give structure to unstructured text collections “by discovering patterns of word use and connecting documents that exhibit similar patterns [...]” (Blei/Lafferty 2009, 1). No prior annotations or labeled documents are required in order to perform topic modeling, since the topics arise by analyzing the original text. Organization and summarization of vast amounts of text data at a rate that humans could never do by hand is possible by applying topic modeling algorithms (cf. Blei 2012, 78). Topic models can be helpful for reducing the high dimensionality of certain data and for making huge collections of text data interpretable (cf. Shao/Qin 2014, 199).
The most widely-used topic model is Latent Dirichlet Allocation (LDA), which was first developed by Blei et al (cf. Blei et al. 2003). It is an advancement of Latent Semantic Analysis and Probabilistic Latent Semantic Analysis, enhancing the shortcomings of these topic modeling approaches, which namely are overfitting, and the impossibility of computing probabilities for documents that are not in the training data (cf. Blei et al. 2003, 994; Blei 2012, 80; Aggarwal/Zhai 2012, 142). By using Dirichlet priors, LDA enables generalizing to new documents (cf. Aggarwal 2015, 446). In LDA, a topic is denoted as the distribution over a fixed vocabulary (cf. Blei 2012, 78). Although LDA is frequently used on text data, it is not limited to it and finds applications in image retrieval or bioinformatics as well (cf. Blei et al. 2003, 995).
The main assumption of LDA is that every document is a mixture of topics and each topic is a mixture of words. Every document being composed of a mixture of topics means that each document contains words from several topics in different proportions. Every topic consisting of a mixture of words can be explained as follows: given a model with two topics, for example, some words can be attributed to topic 1 and others to topic 2. But there are words that can be ascribed to both topics as well. LDA makes it possible to estimate the mixture of words ascribed to each topic and the mixture of topics describing each document at the same time (cf. Silge/Robinson 2018; Blei/Lafferty 2009, 2f.). Some further assumptions of LDA are (1) the order of words in a document does not matter (bag-of-words assumption), (2) the order of documents in a collection does not matter, and (3) the number of topics that the LDA model is generating is assumed to be known beforehand and is fix. While the number of topics K needs to be specified before the model will be computed, the topics themselves are not known in advance, since the aim is to learn them from the data by applying the LDA model (cf. Blei 2012, 82f.; Blei/Lafferty 2009, 3). The main goal of topic modeling is the automatic discovery of topics from a collection of text documents. While these documents are observed random variables, the topic structure - composed of topics, per-document topic distributions, and per-word topic assignments - is hidden, i.e. latent. This leads to the main computational obstacle LDA is facing, namely that this hidden topic structure needs to be inferred by resorting to the observed variables, the words in the documents (cf. Blei 2012, 79). According to Blei, the “utility of topic models stems from the property that the inferred hidden structure resembles the thematic structure of the collection. This interpretable hidden structure annotates each document in the collection [...]” (ibid.). By defining a joint probability distribution, over both the observed and hidden random variables, the computation of the posterior of the hidden structure is possible. The “posterior distribution of the hidden variables given the observed documents determines a hidden topical decomposition [. . . ]” of the input text (Blei/Lafferty 2009, 3). Since LDA is a generative probabilistic model, at its basis is a generative process, including the hidden variables, from which the data is arising. According to Blei/Lafferty (cf. Blei/Lafferty 2009, 3f.), the generative LDA process works like this:
Abbildung in dieser Leseprobe nicht enthalten
K is a pre-specified number of topics, N the vocabulary size, a is a positive K -vector, n is a scalar. Dir (a) is denoted as a N -dimensional Dirichlet distribution with vector parameter a, Dir(g) is a K -dimensional symmetric Dirichlet distribution with scalar parameter g. The hidden random variables determining the topic structure are the topics ß i- K, the per-document topic proportions O i- p, and the per-word topic assignments Z i:d,in. Each ß is a distribution over the vocabulary, 0 d are the topic proportions for the d th document, with 0 d,k being the topic proportion for topic k in document d, Z d are the topic assignments for the d th document, with Z dn being the topic assignment for the n th word in document d and are the observed words for document d, with W d,n being the n th word in document d (cf. Blei/Lafferty 2009, 3f.; Blei 2012, 80). Following Blei (cf. Blei 2012, 80), by this process, the joint probability distribution is being defined, which can be denoted as follows:
Abbildung in dieser Leseprobe nicht enthalten
Latent Dirichlet Allocation obtains its name due to the Dirichlet prior on 0, which is a conjugate prior for the multinomial distribution and therefore convenient, since it is making statistical inference easier (cf. Steyvers/Griffiths 2007, 4). Following Blei/Lafferty (cf. Blei/Lafferty 2009, 4), the density of the Dirichlet distribution is denoted as:
Abbildung in dieser Leseprobe nicht enthalten
According to Blei/Lafferty a is a positive K -vector and r denotes the Gamma function, which is like a real-valued extension of the factorial function. A symmetric Dirichlet distribution describes a Dirichlet distribution where each component of the parameter is equal to the same value. It is used as a distribution over discrete distributions and each component in the random vector corresponds with the probability of drawing the item associated with this component (cf. Blei/Lafferty 2009, 4). Hyperparameter a i is depicting a count for prior observations of the number of times a topic i is sampled in a document, before any words from that document were observed. The a parameter determines how smoothed a topic distribution is. Oftentimes a value of 50/K is chosen for a, but some define a = 1/K. Similarly to (3), another prior, a symmetric Dirichlet distribution Dir(n) is placed on ß. Hyperparameter n is the count for prior observations of the number of times words are sampled from a topic before observing any words and it is smoothing the word distribution in every topic. Choosing n = 0.01 is commonly applied (cf. Steyvers/Griffiths 2007, 4f.; Chuang et al. 2013, 616f.), but in the “topicmodels” package which is used for fitting the LDA models, the default value for n = 0.1 (cf. Grün/Hornik 2018, 10). For higher values of a, each document tends to be composed of a mixture of most topics, when the a value is set lower, on the other hand, it is more likely that each document is composed of just a few, or only one, topics. Higher n values mean that each topic contains a mixture of most words, whereas a lower value for n makes it more likely that the topics only contain a few words (cf. George/Doss. 2018, 3). In LDA, two Dirichlet random variables are included, 0, the topic proportions, are distributions over topic indices {1,... ,K }, while ß, the topics, are distributions over the vocabulary N (cf. Blei/Lafferty 2009, 4). Given the joint distribution (2) and the Dirichlet priors, the posterior distribution can be inferred, which - according to Blei/Lafferty (cf. Blei/Lafferty 2009, 5) - is denoted as:
Abbildung in dieser Leseprobe nicht enthalten
However, deriving the posterior in its closed form is impossible, since the marginal distribution, the denominator of (4), is intractable to compute. For this reason, the posterior needs to be approximated (cf. Blei 2012, 81). There are various techniques that can be used in order to infer the posterior by approximation. They either can be sampling-based or variational methods, among them Gibbs sampling (cf. Steyvers/Griffiths 2007), mean field variational inference (cf. Blei et al. 2003), expectation propagation (cf. Minka/Lafferty 2002), or collapsed variational inference (cf. Teh et al. 2006). Since used in this thesis, - as according to Kwartler, it is faster than the other approaches and therefore most appropriate for a large text collection (cf. Kwartler 2017, 156), like the lyrics data set being analyzed here - only approximation by Gibbs sampling will be outlined at this point.
Using Gibbs sampling, a Markov chain, i.e. a sequence of random variables where each is depending on the previous one, is constructed. Its limiting distribution is the posterior. Blei describes the process as following: “The Markov chain is defined on the hidden topic variables for a particular corpus, and the algorithm is to run the chain for a long time, collect samples from the limiting distribution, and then approximate the distribution with the collected samples” (Blei 2012, 81). The sampling happens sequentially and only stops when the sampled values approximate the target distribution. Gibbs sampling does not deliver direct estimates for 0 and ß, however, they can be approximated by using posterior estimates of the topic assignments Z. A document collection can be depicted by a set of word indices w i and document indices d i for each word token i. During Gibbs sampling, each word token in a text collection is being considered sequentially and the probability of assigning the current word token to each topic is estimated, which is conditioned on the topic assignments to all other word tokens. A topic is sampled from this conditional distribution and is then stored as the new topic assignment for this word token (cf. Steyvers/Griffiths 2007, 7f.). According to Steyvers/Griffiths “[...] words are assigned to topics depending on how likely the word is for a topic, as well as how dominant a topic is in a document” (Steyvers/Griffiths 2007, 8). At the start of the Gibbs sampling algorithm, each word token is assigned to a random topic in {1 ,. ..,K }. The Gibbs samples each consist of the set of topic assignments to all N words in the text collection, which is achieved by a single pass through all documents. Samples that are drawn during the burn-in phase of the algorithm need to be discarded, because they are not fit to estimate the posterior. However, after the burn-in period, the successive samples begin to approximate the target distribution, i.e. the posterior. In order to receive a representative set of samples from the posterior, some samples are regularly saved to avoid correlations. Gibbs sampling returns direct estimates of the topic assignments Z for each word, but the estimates for 0 and ß often are wanted as well (cf. Steyvers/Griffiths 2007, 7f.). Steyvers/Griffiths write that these “values correspond to the predictive distributions of sampling a new token of word i from topic [ K ], and sampling a new token (as of yet unobserved) in document d from topic [ K ], and are also the posterior means of these quantities conditioned on a particular sample Z.” (cf. Steyvers/Griffiths 2007, 8). How LDA is applied in R for this thesis will be explained in section 4.3.
2.4.1. Model Tuning
In order to find the number of K topics that needs to be specified before computing the LDA models, various approaches can be applied. Among them are e.g. model tuning with the “ldatuning” package in R, calculating the perplexity measure, or the maximum log likelihood , as well as n-fold cross-validation. Since n-fold cross-validation for this task takes an unreasonable amount of computational time, it will not be applied here. The use of the “ldatuning” package in R will be further explained in section 4.3.1 where the application of this model tuning method is depicted. A somewhat related approach for finding the optimal number of K topics - which is partly implied in the “ldatuning” package - is proposed by Grif- fiths/Steyvers through computing the maximum log likelihood of LDA models with various K values (cf. Griffiths/Steyvers 2004, 5231). They are choosing a Bayesian approach for determining K, since it is essentially a “problem of model selection” and as a solution the posterior probability of the different models should be observed. Therefore, they decided to compute the log likelihood of the data, given various models with different topic values to find the optimal number of K, because it is the “[. . . ] key constituent of this posterior probability” (Griffiths/Steyvers 2004, 5231). The results of this procedure, as implemented in this thesis, can be gathered in section 4.3.1. The perplexity measure, another method for determining the topic number K, is defined as the “algebraic [...] equivalent to the inverse of the geometric mean per-word likelihood” (Blei et al. 2003, 1008). Perplexity is measuring how well a probability model fits new data. It takes a previously fit model as well as a held-out test, or validation, set to compute perplexity scores that allow the comparison of models with different K. The lower the perplexity score, the better the generalization performance, according to Blei et al., which means the model with the lowest perplexity score should be preferred (cf. Blei et al. 2003, 1008). Results of the perplexity computation can be found in section 4.3.1. By using the “ldatuning” package, the perplexity measure, and the highest log likelihood, not only the optimal number of K topics can be determined, but also the other model parameters, such as a, n, or the number of iterations needed for Gibbs-sampling, since the whole LDA model will be examined, not just the input values for K (cf. Griffiths/Steyvers 2004, 5231).
2.4.2. Model Evaluation
Since the topics inferred by LDA cannot always be interpreted in an easy way, especially if models with a high number of K are computed, statistical measures are increasingly being used for this task (cf. Sievert/Shirley 2014, 64). Some approaches rely on the above-mentioned perplexity and log likelihood in order to determine the goodness of fit and to find out which model performs best (cf. Jones forthcoming, 4; Wallach et al. 2009, 2). Other studies examine, for example, the interpretability of topics according to some ground truth (cf. Chuang et al. 2013), the computation of coherence measures based on topics manually interpreted by humans (cf. Mimno et al 2011), or n-gram based approaches (cf. Blei/Lafferty 2009). A further solution is proposed by Jones (forthcoming) for examining the goodness of fit of topic models by calculating an R 2 measure since it can be easily interpreted (cf. Jones forthcoming, 6ff.). However, the most commonly applied method for interpreting the output of topic models remains the description of the topics by their top-n most common words (cf. Ramage et al. 2009, 2). Although this is not a statistically-based method and solely relies on human judgment, this approach will be applied - in addition to examining the log likelihoods - in this thesis and can be reviewed in section 4.3.2. This decision is made since one focus of this thesis is supposed to be directed on determining topic similarities, whereas the evaluation of the model fit statistics of topic model outputs is a wide field of research in itself; it has not been completely explored yet and still leaves a lot of space for improvements which shall not be conducted here.
2.5. Similarity Measures
In order to compare two topics that were extracted by a topic modeling algorithm, one can resort to compare their per-topic-per-word probability distributions ß (cf. Aletras/Stevenson 2014, 22; Mimno et al 2011, 269; Blei/Chaney 2012, 3). In this way, it is not only possible to compare the topics within one topic model, but model-to-model comparisons can as well be conducted by taking one model’s output as the reference for the other model (cf. Chuang et al. 2013, 2). Alternatively, instead of using measures based on the per-topic-per-word probability distributions ß, distributional semantic, or knowledge-based methods could be applied (cf. Aletras/Stevenson 2014, 22f.). Nevertheless, in this thesis the similarity between the topics within a model and between two topic models will be computed based on their per-topic-per-word probability distributions that can be derived from the LDA model outputs. There are various methods for measuring the similarity or distance - these two expressions will be used relatively interchangeably since the biggest similarity is basically an equivalent to the smallest distance - between topics based on their per-topic-per-word probability distributions, among them Jensen-Shannon divergence, Kullback-Leibler divergence, cosine measures, log ratio, Hellinger distance, or Manhattan distance. In the following subsections, the measures that are utilized in this thesis to compute the similarities within and between topic models will be outlined shortly. Jensen-Shannon divergence, Hellinger Distance, and log ratio are applied (in section 4.3.3. and 4.3.4.). They are chosen because they portray three slightly different ways of computing topic similarities, e.g. log ratio allows a graphical representation of words within topics that are most distinct according to their per-topic-per-word distributions ß (cf. Silge/Robinson 2018). Some of the other measures mentioned above were excluded from use, e.g. cosine similarity, since they require other forms of input data.
2.5.1. Jensen-Shannon Divergence
Jensen-Shannon divergence (JSD) can be used to compute the similarities “between all pairs of topic-word distributions in a given model” (Mimno et al 2011, 269) to determine how similar or distinct the topics are. It is a modification of Kullback-Leibler divergence (KLD), since KLD is not symmetrical, but JDS is (cf. Dagan 2000, 485f.). For two probability functions p(x) and q(x), following Shao/Qin, KLD can be computed as (cf. Shao/Qin 2014, 201):
Abbildung in dieser Leseprobe nicht enthalten
The range of JSD is between [0 , 1], with values of 0 signifying there is maximum similarity between the distributions and values of 1 indicating that there is no similarity (cf. Aletras/Stevenson 2014, 24). JSD is not a true metric, since it does not satisfy the triangle inequality, but it can be converted to a metric by simply calculating D JS(p,q) (cf. Pachter 2014; Endres/Schindelin 2003, 1859f.). Further explanation on JSD and the results of the computation can be found in sections 4.3.3 and 4.3.4.
2.5.2. Hellinger Distance
Equally to JSD, Hellinger distance (HD) allows to forgo the shortcomings of KLD.
Abbildung in dieser Leseprobe nicht enthalten
HD has a range from [0 , 1] with 1 indicating maximum distance, which happens when zero probability is assigned by p to every set to which q is giving a positive probability, and vice versa. By simply calculating 1 — HD(p, q) the distance measure can be transformed to a similarity measure, where a value of 1 is equal to maximum similarity (cf. Rus et al. 2013, 464). HD is a true metric because it satisfies the four properties for distance measures being metrics (cf. Harsha 2017, 12; Rathore NA). The results can be extracted from sections 4.3.3. and 4.3.4.
2.5.3. Log Ratio
According to Silge/Robinson, log ratio is a good measure to determine the greatest differences in the per-topic-per-word distributions ß of two topics (ß is equivalent to the one defined in section 2.4. as the distribution over words, i.e. the topics that the LDA model is extracting from the text collection). Computing the log ratio log -() makes the differences symmetrical, e.g. if ß - is twice as large as ß i, then the log ratio takes a value of 1, vice versa it will be -1. Before the computation of the log ratios between the per-topic-per-word probabilities of two topics, it can be useful to filter ß for words that appear very commonly. Another advantage of using log ratios for similarity comparisons is the possibility to represent the results graphically (cf. Silge/Robinson 2018). Further details can be reviewed in section 4.3.3.
3. Data
In the following the choice of song genres, as well as the methods for selecting artists, downloading album meta data and lyrics will be explained. Furthermore, the data pre-processing will be outlined and summary statistics for the final data set will be presented. Web sites - Discogs.com and Wikipedia.com - were scraped for meta information using the “rvest” package, afterwards the song lyrics were downloaded from Genius.com with the help of the “geniusR” package.
3.1. Data Selection and Web Scraping
The decision to build a customized data set, instead of using an already available one, was made because the data sets that can be found online do not seem to be fitting for the analyses that will be conducted here. There are many data sets exclusively with audio data (cf. Lerch NA; Defferrard et al. 2017), and those found containing lyrics, e.g. the Million Song Dataset, are either storing them in a form (cf. Ellis et al. 2015, 2) that does not allow to perform all the analyses applied here, or do not contain all the variables needed.
Since an additional download of these variables would have been necessary anyway, the choice was made to extract all the required data independently and build a customized data set. It was decided to analyze song lyrics of five genres over a time period from 1970 to 2018. The genres used for analysis are alternative, country, pop, rock, and hip-hop. The intention behind choosing these genres is not only to find out how well analytical methods are able to deal with genres that obviously are easily distinguishable from each other, e.g. country and hiphop, but also to evaluate how genres that might have more similarities, e.g. rock and alternative, will be handled. Analyzing lyrics from the past five decades appears to be a reasonable approach for detecting alterations in text statistics and topics, since according to Fell/Sporleder, it takes a time span of at least twenty years for changes to become observable (cf. Fell/Sporleder 2014, 628). Researchers who previously studied changes in song lyrics investigated periods of twenty-five to sixty years (cf. Johnson-Roberson/Johnson-Roberson 2013; Kakde et al. 2013; Henard/Rossetti 2014; Fang et al. 2017) and for this reason the choice of exploring lyrics data over the past fifty years appeared to be appropriate. The easiest way to start building a data set seemed to be searching for artists from the genres and decades mentioned before. Discogs.com proved to be the most straight forward option for this undertaking, since it enables setting filters for musical genres and decades in the quest for most-searched musicians and albums. It is a crowdsourced web site with the goal to become the largest online database of information about music meta data (cf. Discogs.com 2018). Since the query returns the most-searched albums by genre and decade, rather than the most-searched musicians (meaning that some artists appear in the search results more than once), the top 100 entries were scraped respectively, using “rvest” functions in R. Then, only distinct artist names were extracted. This data needed to be cleaned, because some of the musicians were listed using slightly different names (e.g. Bowie and David Bowie, Pink and P!nk, Rolling Stones and The Rolling Stones). After cleaning the data, it was checked for intersections between the aforesaid genres and with the help of genre information that was scraped from Discogs.com as well, the musicians were sorted into one genre each. In case of overlapping genres (e.g. rock and pop, or rock and country), additional genre information that can be found on the respective artists’ Wikipedia sites was re-viewed, making sure to create mutually exclusive genre groups. However, some artists had to be excluded from further use, since their genre information explicitly stated pop-rock or country-rock, for example. An advantage of primarily using the genre information from Discogs.com is that the genres found there are not appointed by a single person, but rather from the collective of users who tag the artists with the genres they belong to. Like this, the assignment to a genre is not a sole decision, but rather something like a majority vote. Eventually resulting from this process were lists with approximately 20-40 artists per genre and decade. Non-English-speaking artists, as well as one-time collaborations between various musicians were deleted. In a next step, the Wikipedia sites of the remaining musicians were scraped. Only artists who published at least seven albums throughout their career were considered for the final data set. For the decades 2000s and 2010s artists who published at least four albums were regarded, since according to Jewalikar/Fragapane, it is necessary to analyze at least four albums per musician to determine the artists’ lyrical styles (cf. Jewalikar/Fragapane 2015). No analyses are made on artist level here, nevertheless this seems to be an adequate reference point. The discography information, i.e. album titles, release dates, and highest positions for US and UK charts, was extracted next. Due to differing URLs, varying amount of table columns, or Wikipedia data not being presented as an HTML table, this stage of the data-extraction process turned out to be tedious. Resulting from this procedure were artist lists for each genre and decade, containing the names of the musicians, their respective genre, the titles of their albums, the year of release, and the peak chart position in the US/ UK. If the US chart position was not available, the UK position is stated instead. Using this data and the “geniusR” package, song lyrics were downloaded from Genius.com, the biggest collection of song lyrics online (cf. Genius.com 2018), and joint into the final data set.
[...]
- Citation du texte
- Laura Zapf (Auteur), 2019, How Did English Songs Evolve? Retrieving Information from Song Lyrics Via Natural Language Processing and Statistical Topic Modeling, Munich, GRIN Verlag, https://www.grin.com/document/997210
-
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X. -
Téléchargez vos propres textes! Gagnez de l'argent et un iPhone X.