September 10, 2015

Introduction

Data Scientist for Trinity Health

  • Healthcare
    • complex and somewhat deep data
    • LOTS of reporting ("how many patients with diabetes?")
    • predicting bad outcomes based on previous events
  • Background: PhD in Bioinformatics
    • Regulation of gene expression
    • Next-generation sequencing data
    • over 1B reads between 50 and 300 bases
    • Deep but not complex data

Outline

  • Creating corpus with tm

  • Visualization techniques for text

  • Clustering with cluster

  • Topic modeling with topicmodels

  • Word representations with word2vec

Motivation

Text is everywhere!

  • Twitter –> 200M tweets/day * 25 words/tweet = 5B words/day!

  • Facebook –> 55M status updates/day

  • Amazon –> customer reviews

  • Healthcare
    • Clinical notes ("family history of high cholesterol")
    • Event descriptions ("patient slipped and fell")

Text source

Song lyrics!

  • Band of choice: TOOL
  • 5 albums
  • famously mysterious, ambiguous lyrics

First, it's a good idea to have this in your .Rprofile:

options(stringsAsFactors = FALSE)

Then, create functions to access musiXmatch API:

  • requires API key
  • artist search to get ID
  • get all album IDs
  • using artist ID and album IDs, get track IDs
  • get lyrics from track IDs
call = paste("http://api.musixmatch.com/ws/1.1/track.lyrics.get?",
              "track_id=", track,
              "&apikey=", apikey,
              "&format=xml", sep = "")

xml <- xmlParse(call)
  
lyrics <- tryCatch(xmlToDataFrame(nodes=getNodeSet(xml, "//lyrics_body")),
                   error = function(e) print("NA"))

Creating a collection of lyrics

get_corpus <- function(artist, apikey){
  
  #get albums for particular artist
  albums <- get_albums(artist, apikey)
  
  #loop through albums
  tracks <- lapply(albums, get_tracks, apikey) %>% unlist()
  
  #loop through tracks
  lyrics <- lapply(tracks, get_lyrics, apikey) %>% unlist()
  
  obj <- list(lyrics = data.frame(lyrics),
              n_albums = length(albums),
              n_tracks = length(tracks))
  
  return(obj)
}

collection <- get_corpus(artist, apikey)

Creating a corpus with tm

Clean up text and build corpus

removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)

collection$lyrics %<>% sapply(., function(x) str_replace_all(x, "[^[:alnum:] ]", ""))

custom_stopwords <- c(stopwords("english"), other) %>%
  sapply(., function(x) gsub("[[:punct:]]", "", x), simplify = "array", USE.NAMES = F)

corpus <- Corpus(VectorSource(collection$lyrics)) %>%
  tm_map(content_transformer(tolower)) %>%
  tm_map(stripWhitespace) %>%
  tm_map(removePunctuation) %>%
  tm_map(removeNumbers) %>%
  tm_map(removeURL) %>%
  tm_map(removeWords, custom_stopwords) %>%
  tm_map(stemDocument) %>%
  tm_map(PlainTextDocument)

tdm <- corpus %>%
  TermDocumentMatrix(control = list(minWordLength = 3))

dtm <- as.DocumentTermMatrix(tdm)

Term frequency

term.freq <- slam::row_sums(tdm, na.rm = T)
high.freq <- sort(term.freq, decreasing = T)[1:20]

freq.terms <- names(high.freq)
df <- data.frame(term = names(high.freq), freq = high.freq)
df$term <- factor(df$term, levels = df$term[order(df$freq)])

ggplot(df, aes(x = term, y = freq)) +
  geom_bar(stat = "identity") +
  xlab("Terms") +
  ylab("Count") +
  coord_flip()

Visualization techniques for text

(obligatory) Word cloud

word.freq <- sort(term.freq, decreasing = T)[1:100]
pal2 <- brewer.pal(8, "Dark2")

wordcloud(words = names(word.freq),
          freq = word.freq,
          scale = c(4, 0.1),
          colors = pal2,
          rot.per = 0.15,
          random.color = F,
          random.order = F)

Network of correlated words

plot(tdm,
     term = freq.terms,
     corThreshold = 0.2,
     weighting = T)

Clustering with cluster

k-means clustering

  • unsupervised learning
  • group n documents into k clusters

Weighting terms for clustering

  • term frequency-inverse document frequency (tf-idf)
  • offset by term frequency in the corpus

Example: N = # documents, d = # documents with term

  • "information content" of a term: log(N/d)
    • rare term = high idf: log(100/4) = 4.64
    • common term = low idf: log(100/60) = 0.74
tdm_tfxidf <- weightTfIdf(tdm)

d <- stats::dist(t(tdm_tfxidf), method = "euclidian")

k <- 15

set.seed(12345)
kmean <- kmeans(d, k)

kmean$betweenss/kmean$totss
## [1] 0.8028445

Automatic labeling of clusters

for(i in 1:k){
  inGroup <- slam::row_means(tdm_tfxidf[, kmean$cluster == i])

  words <- names(sort(inGroup, decreasing = T)[1:5])
  
  cat("Cluster ", i, ": ", words, "\n")
}
## Cluster  1 :  let take way back another 
## Cluster  2 :  satan kinda anyway engine funny 
## Cluster  3 :  others unto done young bit 
## Cluster  4 :  mine suck comfort calling man 
## Cluster  5 :  beyond infancy lines reaching white 
## Cluster  6 :  say serious even want pushin 
## Cluster  7 :  part better mention weather scars 
## Cluster  8 :  punishment cure wrong someone told 
## Cluster  9 :  bog easy thick lost belligerent 
## Cluster  10 :  fret arizona bay circus fix 
## Cluster  11 :  want away eleven frightened broken 
## Cluster  12 :  need bear belong borderline boredoms 
## Cluster  13 :  abominatio ache acid across acts 
## Cluster  14 :  every comfortable twice past whistle 
## Cluster  15 :  eyed form hold wide one

Topic modeling with topicmodels