library(dplyr)
library(ggplot2)
library(magrittr)
library(stringr)
library(DT)
options(stringsAsFactors = FALSE)
write.delim <- function(x, file, sep='\t', quote = FALSE, row.names=FALSE, na = '', ...) {
write.table(x = x, file = file, sep=sep, quote=quote, row.names=row.names, na=na, ...)
}
Here, we parse the LabeledIn resource Khare, Li, and Lu (2014), which can be downloaded here.
fieldnames <- c('study_drug_label_ID', 'DailyMed_SPL_ID', 'UMLS_CUIs', 'IN_RXCUI', 'SCDF_RXCUI', 'SCD_RXCUI', 'Other_SCDF_RXCUI', 'Other_SCD_RXCUI')
results.df <- file.path('download', 'LabeledIn_Structured_Results.txt') %>%
read.table(sep = '|', quote = '', comment.char = '', colClasses = 'character', col.names = fieldnames, na.strings = '')
results.df %>% head %>% DT::datatable()
expand_indications <- function(df) {
data.frame(
rxnorm_id = df$IN_RXCUI,
disease_cui = stringr::str_extract_all(df$UMLS_CUIs, "C\\d+")[[1]])
}
indication.df <- results.df %>%
dplyr::select(UMLS_CUIs, IN_RXCUI) %>%
dplyr::filter(! is.na(UMLS_CUIs)) %>%
dplyr::rowwise() %>%
dplyr::do(expand_indications(.)) %>%
dplyr::ungroup() %>%
dplyr::group_by(rxnorm_id, disease_cui) %>%
dplyr::summarize(n_labels = n()) %>%
dplyr::ungroup() %>%
dplyr::arrange(rxnorm_id, desc(n_labels), disease_cui)
indication.df %>% DT::datatable()
Download csv formatted RxNORM from bioportal for the 2014AA UMLS release.
# Create a data.frame of ingredients (drugs)
ingredient.df <- indication.df %>%
dplyr::group_by(rxnorm_id) %>%
dplyr::summarize(n_indications = n()) %>%
dplyr::left_join(
y = results.df %>%
dplyr::transmute(label_id = study_drug_label_ID, rxnorm_id = IN_RXCUI) %>%
dplyr::group_by(rxnorm_id) %>%
dplyr::summarize(n_labels = n())
)
ingredient.df <- file.path('download', 'RXNORM.csv.gz') %>%
read.csv(colClasses = 'character', check.names = FALSE) %>%
dplyr::transmute(
rxnorm_id = stringr::str_replace(`Class ID`, fixed('http://purl.bioontology.org/ontology/RXNORM/'), ''),
rxnorm_name = `Preferred Label`,
ingredient_cui = CUI) %>%
dplyr::right_join(ingredient.df)
ingredient.df %>% DT::datatable()
Download csv formatted MESH from bioportal for the 2014AA UMLS release.
# Run only once
gz <- file.path('data', 'mesh-umls-map.txt.gz') %>% gzfile('w')
file.path('download', 'MESH.csv.gz') %>%
read.csv(colClasses = 'character', check.names = FALSE) %>%
dplyr::transmute(
mesh_id = stringr::str_replace(`Class ID`, fixed('http://purl.bioontology.org/ontology/MESH/'), ''),
mesh_name = `Preferred Label`,
cuis = CUI) %>%
dplyr::rowwise() %>%
dplyr::do({data.frame(.,
cui = stringr::str_split(.$cuis, fixed('|'))[[1]])
}) %>%
dplyr::select(-cuis) %>%
write.delim(gz); close(gz)
Not all umls diseases are getting mapped through MESH, so we will need to find a better solution.
The LabeledIn disease vocabulary is generated by:
2.4. Automatic disease recognition
The goal of this module was to identify all disease mentions as indication candidates from the textual descriptions of a given drug label. For this study, we prepared a disease lexicon using two seed ontologies, MeSH and SNOMED-CT, respectively useful for annotating scientific articles [30], [32] and [37] and clinical documents [31], [38] and [39]. The lexicon consists of 77464 concepts taken from: (i) the disease branch in MeSH, and (ii) the 11 disorder semantic types (UMLS disorder semantic types excluding ‘Finding’) in SNOMED-CT as recommended in a recent shared task [30].
As for the automatic tool, we applied MetaMap [27], a highly configurable program used for mapping biomedical texts to the UMLS identifying the mentions, offsets, and associated CUIs. We used the 2012 MetaMap Java API release that uses the 2012AB version of the UMLS Metathesaurus. We experimented with multiple settings of MetaMap, and the optimal setting method for this study is illustrated in Fig. 4.
# Create a data.frame of diseases
disease.df <- indication.df %>%
dplyr::group_by(disease_cui) %>%
dplyr::summarize(n_indications = n())
disease.df <- file.path('data', 'mesh-umls-map.txt.gz') %>%
read.delim(quote = '', comment.char = '') %>%
dplyr::select(disease_cui = cui, disease_mesh_id = mesh_id, disease_mesh_name = mesh_name) %>%
dplyr::right_join(disease.df)
disease.df %>% DT::datatable()
indication.df %>%
dplyr::left_join(disease.df %>% dplyr::select(disease_cui, disease_mesh_name)) %>%
dplyr::left_join(ingredient.df %>% dplyr::select(rxnorm_id, rxnorm_name)) %>%
DT::datatable()
Khare, Ritu, Jiao Li, and Zhiyong Lu. 2014. “LabeledIn: Cataloging Labeled Indications for Human Drugs.” Journal of Biomedical Informatics 52 (December). Elsevier BV: 448–56. doi:10.1016/j.jbi.2014.08.004. http://dx.doi.org/10.1016/j.jbi.2014.08.004.
To the extent possible under law,
Daniel Himmelstein
has waived all copyright and related or neighboring rights to
Indications.
This work is published from:
United States.