Text mining with R and its application to song lyrics

September 10, 2015

Introduction

Data Scientist for Trinity Health

Healthcare
- complex and somewhat deep data
- LOTS of reporting ("how many patients with diabetes?")
- predicting bad outcomes based on previous events
Background: PhD in Bioinformatics
- Regulation of gene expression
- Next-generation sequencing data
- over 1B reads between 50 and 300 bases
- Deep but not complex data

Outline

Creating corpus with tm
Visualization techniques for text
Clustering with cluster
Topic modeling with topicmodels
Word representations with word2vec

Motivation

Text is everywhere!

Twitter –> 200M tweets/day * 25 words/tweet = 5B words/day!
Facebook –> 55M status updates/day
Amazon –> customer reviews
Healthcare
- Clinical notes ("family history of high cholesterol")
- Event descriptions ("patient slipped and fell")

Text source

Song lyrics!

Band of choice: TOOL
5 albums
famously mysterious, ambiguous lyrics

First, it's a good idea to have this in your .Rprofile:

options(stringsAsFactors = FALSE)

Then, create functions to access musiXmatch API:

requires API key
artist search to get ID
get all album IDs
using artist ID and album IDs, get track IDs
get lyrics from track IDs

call = paste("http://api.musixmatch.com/ws/1.1/track.lyrics.get?",
              "track_id=", track,
              "&apikey=", apikey,
              "&format=xml", sep = "")

xml <- xmlParse(call)
  
lyrics <- tryCatch(xmlToDataFrame(nodes=getNodeSet(xml, "//lyrics_body")),
                   error = function(e) print("NA"))

Creating a collection of lyrics

get_corpus <- function(artist, apikey){
  
  #get albums for particular artist
  albums <- get_albums(artist, apikey)
  
  #loop through albums
  tracks <- lapply(albums, get_tracks, apikey) %>% unlist()
  
  #loop through tracks
  lyrics <- lapply(tracks, get_lyrics, apikey) %>% unlist()
  
  obj <- list(lyrics = data.frame(lyrics),
              n_albums = length(albums),
              n_tracks = length(tracks))
  
  return(obj)
}

collection <- get_corpus(artist, apikey)

Creating a corpus with `tm`

Clean up text and build corpus

removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)

collection$lyrics %<>% sapply(., function(x) str_replace_all(x, "[^[:alnum:] ]", ""))

custom_stopwords <- c(stopwords("english"), other) %>%
  sapply(., function(x) gsub("[[:punct:]]", "", x), simplify = "array", USE.NAMES = F)

corpus <- Corpus(VectorSource(collection$lyrics)) %>%
  tm_map(content_transformer(tolower)) %>%
  tm_map(stripWhitespace) %>%
  tm_map(removePunctuation) %>%
  tm_map(removeNumbers) %>%
  tm_map(removeURL) %>%
  tm_map(removeWords, custom_stopwords) %>%
  tm_map(stemDocument) %>%
  tm_map(PlainTextDocument)

tdm <- corpus %>%
  TermDocumentMatrix(control = list(minWordLength = 3))

dtm <- as.DocumentTermMatrix(tdm)

Term frequency

term.freq <- slam::row_sums(tdm, na.rm = T)
high.freq <- sort(term.freq, decreasing = T)[1:20]

freq.terms <- names(high.freq)

df <- data.frame(term = names(high.freq), freq = high.freq)
df$term <- factor(df$term, levels = df$term[order(df$freq)])

ggplot(df, aes(x = term, y = freq)) +
  geom_bar(stat = "identity") +
  xlab("Terms") +
  ylab("Count") +
  coord_flip()

Visualization techniques for text

(obligatory) Word cloud

word.freq <- sort(term.freq, decreasing = T)[1:100]
pal2 <- brewer.pal(8, "Dark2")

wordcloud(words = names(word.freq),
          freq = word.freq,
          scale = c(4, 0.1),
          colors = pal2,
          rot.per = 0.15,
          random.color = F,
          random.order = F)

Network of correlated words

plot(tdm,
     term = freq.terms,
     corThreshold = 0.2,
     weighting = T)

Clustering with `cluster`

k-means clustering

unsupervised learning
group n documents into k clusters

Weighting terms for clustering

term frequency-inverse document frequency (tf-idf)
offset by term frequency in the corpus

Example: N = # documents, d = # documents with term

"information content" of a term: log(N/d)
- rare term = high idf: log(100/4) = 4.64
- common term = low idf: log(100/60) = 0.74

tdm_tfxidf <- weightTfIdf(tdm)

d <- stats::dist(t(tdm_tfxidf), method = "euclidian")

k <- 15

set.seed(12345)
kmean <- kmeans(d, k)

kmean$betweenss/kmean$totss

## [1] 0.8028445

Automatic labeling of clusters

for(i in 1:k){
  inGroup <- slam::row_means(tdm_tfxidf[, kmean$cluster == i])

  words <- names(sort(inGroup, decreasing = T)[1:5])
  
  cat("Cluster ", i, ": ", words, "\n")
}

## Cluster  1 :  let take way back another 
## Cluster  2 :  satan kinda anyway engine funny 
## Cluster  3 :  others unto done young bit 
## Cluster  4 :  mine suck comfort calling man 
## Cluster  5 :  beyond infancy lines reaching white 
## Cluster  6 :  say serious even want pushin 
## Cluster  7 :  part better mention weather scars 
## Cluster  8 :  punishment cure wrong someone told 
## Cluster  9 :  bog easy thick lost belligerent 
## Cluster  10 :  fret arizona bay circus fix 
## Cluster  11 :  want away eleven frightened broken 
## Cluster  12 :  need bear belong borderline boredoms 
## Cluster  13 :  abominatio ache acid across acts 
## Cluster  14 :  every comfortable twice past whistle 
## Cluster  15 :  eyed form hold wide one

Topic modeling with `topicmodels`

Blei, 2012, Communications of the ACM

doc.freq <- slam::row_sums(dtm, na.rm = T)

dtm_subset <- dtm[doc.freq > 0, ]

set.seed(56789)
lda <- LDA(dtm_subset, k, method = "Gibbs")
topics <- terms(lda, 5)

topics

##      Topic 1         Topic 2   Topic 3   Topic 4      Topic 5      
## [1,] "part"          "weather" "want"    "kinda"      "around"     
## [2,] "see"           "change"  "say"     "satan"      "deep"       
## [3,] "better"        "drugs"   "even"    "scared"     "voice"      
## [4,] "will"          "mention" "away"    "frightened" "comfortable"
## [5,] "communication" "shadow"  "serious" "right"      "twice"      
##      Topic 6     Topic 7  Topic 8 Topic 9 Topic 10 Topic 11  Topic 12 
## [1,] "done"      "now"    "fret"  "hold"  "take"   "need"    "still"  
## [2,] "something" "makes"  "see"   "one"   "way"    "guide"   "pushin" 
## [3,] "finger"    "broken" "time"  "wide"  "feel"   "follow"  "shoving"
## [4,] "every"     "see"    "day"   "now"   "kill"   "someone" "love"   
## [5,] "come"      "poison" "snake" "will"  "let"    "ein"     "right"  
##      Topic 13  Topic 14 Topic 15
## [1,] "calling" "must"   "watch" 
## [2,] "eleven"  "miss"   "told"  
## [3,] "come"    "one"    "wrong" 
## [4,] "please"  "gonna"  "lost"  
## [5,] "pure"    "lot"    "bog"

Bonus round: word2vec, AKA Skynet for NLP

What is word2vec?

Neural network
- learns word representation <–> word
Continuous bag of words
- Input: w[i-2], w[i-1], w[i+1], w[i+2]
- Output: w[i]
- Predict word(s) given a context
Skip-gram
- Input: w[i]
- Output: w[i-2], w[i-1], w[i+1], w[i+2]
- Predict the context given word(s)

Clean up text to prepare for training

perl -pe 's/[^A-Za-z \n]//g' lyrics.txt | perl -ne 'print lc' > lyrics_clean.txt

word2vec -train lyrics_clean.txt
         -output lyrics.bin
         -cbow 0
         -size 40
         -window 5
         -negative 5
         -hs 0
         -sample 1e-4
         -threads 2
         -binary 1
         -iter 15

gensim: Python API for word2vec

import gensim

model = gensim.models.Word2Vec.load_word2vec_format('lyrics.bin', binary=True)

Words most similar to "sober"

model.most_similar(positive=['sober'],topn=10)

[(u'drink', 0.9925556778907776),
(u'forever', 0.9861373901367188),
(u'worthless', 0.9771038293838501),
(u'liar', 0.9750714898109436),
(u'start', 0.9736068844795227),
(u'past', 0.9666637182235718),
(u'whistle', 0.9328956007957458),
(u'jesus', 0.9014797806739807),
(u'cant', 0.8932009935379028),
(u'something', 0.8820810914039612)]

Sober
"Why can't we not be sober"
"I am just a worthless liar"

Word algebra: "shadow" - "shrouding"

model.most_similar(positive=['shadow'],negative=['shrouding'],topn=10)

[(u'shadows', 0.6374932527542114),
(u'scabs', 0.6287078857421875),
(u'join', 0.6279171705245972),
(u'shedding', 0.6278539299964905),
(u'picking', 0.6239765286445618),
(u'numb', 0.6163206696510315),
(u'through', 0.6030178666114807),
(u'child', 0.6013140678405762),
(u'skin', 0.5950607061386108),
(u'muscles', 0.5946651697158813)]

46 & 2
"My shadow's shedding skin"
"I've been picking my scabs again"

Sober
"There's a shadow just behind me"
"Shrouding every step I take"

"pieces" + "fit"

model.most_similar(positive=['pieces','fit'],topn=10)

[(u'cuz', 0.9940807223320007),
(u'watched', 0.9926615357398987),
(u'tumble', 0.969153106212616),
(u'crippling', 0.9584764838218689),
(u'fault', 0.9492095112800598),
(u'fall', 0.9391438364982605),
(u'communication', 0.9290061593055725),
(u'none', 0.9206214547157288),
(u'mildewed', 0.9142787456512451),
(u'cannot', 0.8897439241409302)]

Schism
"I know the pieces fit, cuz I watched them tumble down"
"No fault, none to blame"
"Crippling our communication"

Acknowledgements

Thank you very much for your attention!

Resources

gensim: https://radimrehurek.com/gensim/index.html

word2vec demystified: http://alexminnaar.com/

major R packages: tm, topicmodels, slam