library(dplyr)
library(ggplot2)
library(magrittr)
library(stringr)
library(DT)

options(stringsAsFactors = FALSE)

write.delim <- function(x, file, sep='\t', quote = FALSE, row.names=FALSE, na = '', ...) {
  write.table(x = x, file = file, sep=sep, quote=quote, row.names=row.names, na=na, ...)
}

Here, we parse the LabeledIn resource Khare, Li, and Lu (2014), which can be downloaded here.

fieldnames <- c('study_drug_label_ID', 'DailyMed_SPL_ID', 'UMLS_CUIs', 'IN_RXCUI', 'SCDF_RXCUI', 'SCD_RXCUI', 'Other_SCDF_RXCUI', 'Other_SCD_RXCUI')
results.df <- file.path('download', 'LabeledIn_Structured_Results.txt') %>%
  read.table(sep = '|', quote = '', comment.char = '', colClasses = 'character', col.names = fieldnames, na.strings = '')

results.df %>% head %>% DT::datatable()

expand_indications <- function(df) {
  data.frame(
    rxnorm_id = df$IN_RXCUI,
    disease_cui = stringr::str_extract_all(df$UMLS_CUIs, "C\\d+")[[1]])
}

indication.df <- results.df %>% 
  dplyr::select(UMLS_CUIs, IN_RXCUI) %>%
  dplyr::filter(! is.na(UMLS_CUIs)) %>% 
  dplyr::rowwise() %>%
  dplyr::do(expand_indications(.)) %>%
  dplyr::ungroup() %>% 
  dplyr::group_by(rxnorm_id, disease_cui) %>%
  dplyr::summarize(n_labels = n()) %>%
  dplyr::ungroup() %>%
  dplyr::arrange(rxnorm_id, desc(n_labels), disease_cui)

indication.df %>% DT::datatable()

RxNORM Ingredients

Download csv formatted RxNORM from bioportal for the 2014AA UMLS release.

# Create a data.frame of ingredients (drugs)
ingredient.df <- indication.df %>%
  dplyr::group_by(rxnorm_id) %>%
  dplyr::summarize(n_indications = n()) %>%
  dplyr::left_join(
    y = results.df %>%
      dplyr::transmute(label_id = study_drug_label_ID, rxnorm_id = IN_RXCUI) %>%
      dplyr::group_by(rxnorm_id) %>%
      dplyr::summarize(n_labels = n())
    )

ingredient.df <- file.path('download', 'RXNORM.csv.gz') %>%
  read.csv(colClasses = 'character', check.names = FALSE) %>% 
  dplyr::transmute(
    rxnorm_id = stringr::str_replace(`Class ID`, fixed('http://purl.bioontology.org/ontology/RXNORM/'), ''),
    rxnorm_name = `Preferred Label`,
    ingredient_cui = CUI) %>%
  dplyr::right_join(ingredient.df)

ingredient.df %>% DT::datatable()

MESH Diseases

Download csv formatted MESH from bioportal for the 2014AA UMLS release.

# Run only once

gz <- file.path('data', 'mesh-umls-map.txt.gz') %>% gzfile('w')

file.path('download', 'MESH.csv.gz') %>%
  read.csv(colClasses = 'character', check.names = FALSE) %>%
  dplyr::transmute(
    mesh_id = stringr::str_replace(`Class ID`, fixed('http://purl.bioontology.org/ontology/MESH/'), ''),
    mesh_name = `Preferred Label`,
    cuis = CUI) %>% 
  dplyr::rowwise() %>% 
  dplyr::do({data.frame(.,
    cui = stringr::str_split(.$cuis, fixed('|'))[[1]])
    }) %>% 
  dplyr::select(-cuis) %>%
  write.delim(gz); close(gz)

Not all umls diseases are getting mapped through MESH, so we will need to find a better solution.

The LabeledIn disease vocabulary is generated by:

2.4. Automatic disease recognition

The goal of this module was to identify all disease mentions as indication candidates from the textual descriptions of a given drug label. For this study, we prepared a disease lexicon using two seed ontologies, MeSH and SNOMED-CT, respectively useful for annotating scientific articles [30], [32] and [37] and clinical documents [31], [38] and [39]. The lexicon consists of 77464 concepts taken from: (i) the disease branch in MeSH, and (ii) the 11 disorder semantic types (UMLS disorder semantic types excluding ‘Finding’) in SNOMED-CT as recommended in a recent shared task [30].

As for the automatic tool, we applied MetaMap [27], a highly configurable program used for mapping biomedical texts to the UMLS identifying the mentions, offsets, and associated CUIs. We used the 2012 MetaMap Java API release that uses the 2012AB version of the UMLS Metathesaurus. We experimented with multiple settings of MetaMap, and the optimal setting method for this study is illustrated in Fig. 4.

# Create a data.frame of diseases
disease.df <- indication.df %>%
  dplyr::group_by(disease_cui) %>%
  dplyr::summarize(n_indications = n())

disease.df <- file.path('data', 'mesh-umls-map.txt.gz') %>%
  read.delim(quote = '', comment.char = '') %>%
  dplyr::select(disease_cui = cui, disease_mesh_id = mesh_id, disease_mesh_name = mesh_name) %>%
  dplyr::right_join(disease.df)

disease.df %>% DT::datatable()

indication.df %>%
  dplyr::left_join(disease.df %>% dplyr::select(disease_cui, disease_mesh_name)) %>% 
  dplyr::left_join(ingredient.df %>% dplyr::select(rxnorm_id, rxnorm_name)) %>% 
  DT::datatable()

References

Khare, Ritu, Jiao Li, and Zhiyong Lu. 2014. “LabeledIn: Cataloging Labeled Indications for Human Drugs.” Journal of Biomedical Informatics 52 (December). Elsevier BV: 448–56. doi:10.1016/j.jbi.2014.08.004. http://dx.doi.org/10.1016/j.jbi.2014.08.004.

To the extent possible under law, Daniel Himmelstein has waived all copyright and related or neighboring rights to Indications. This work is published from: United States.

Parsing LabeledIn

Daniel Himmelstein, Leo Brueggeman, Sergio Baranzini

February 10, 2015

RxNORM Ingredients

MESH Diseases

2.4. Automatic disease recognition

References