BD2K ThinkTank on EHR Data
Semantic harmonization,
definition, content, ontologies:
OHDSI, CDRNs, NLP
George Hripcsak, MD, MS
Biomedical Informatics
Observational Health Data
Sciences and Informatics
(OHDSI)
• Grew out of OMOP
• Mission
– large-scale analysis of observational health
databases for population-level estimation and
patient-level predictions
• Vision
– international network of 1,000,000,000 patients to
generate evidence
Explore all drugs for a given outcome
Naproxen has one of the most significant associations with GI bleed, along with other NSAIDs
OHDSI
• Experiments to generate evidence
• Applications (tools)
• Data
• Methods
How OHDSI works
Source data warehouse, with identifiable patient-level data Standardized, de-identified patient-level database (OMOP CDM v5) ETL Summary statistics results repository OHDSI.org Consistency Temporality Strength Plausibility Experiment Coherence Biological gradient SpecificityAnalogy
Comparative effectiveness
Predictive modeling OHDSI Data Partners
OHDSI Coordinating Center
Standardized large-scale analytics Analysis results Analytics development and testing Research and education Data network support
OHDSI
• Each site retains its own data
• Use a deep information model
– Concepts, terminologies, conceptual relations
– “OMOP Common Data Model (v4, v5)”
– Strictly defines terminology, mappings
– Supports world-wide queries
• Advanced observational research methods
• Summarize or aggregate results centrally
CDRN Alignment Tasks
• Construct CDRN Data Model (DM) and CDRN Vocabulary
o Based on OMOP DM/Vocabulary o Address PCOR requirements o Address CDRN local needs
o Align with OMOP V5 development o Align with other CDRN centers o Address versioning
• Develop Map-Sets
o Develop vocabulary map-sets:
• Sources-to-OMOP • i2b2-OMOP
• PCOR-OMOP
o Address versioning
o Facilitate development of ETL processes
• i2b2-OMOP • PCOR-OMOP
Deliverables
Design Person table
Design terminology back-end Select/create demographics
controlled terminology
Create mappings of site terminology to controlled terminology for
submitting sites
Provide QA recommendations Document decisions and artifacts
Achilles
• Select right
research
database
• “Browse”
database
• Actually
navigate
preloaded
summaries
• Strength
• Consistency
• Temporality
• Plausibility
• Experiment
• Coherence
• Biological gradient
• Specificity
• Analogy
HOMER
Austin Bradford Hill, “The Environment and Disease: Association or Causation?,” Proceedings of the Royal Society of Medicine, 58 (1965), 295-300.
“What aspects of that association should we
especially consider before deciding that the most
likely interpretation of it is causation?”
PLATO
Patient-Level Assessment of Treatment Outcomes
Dataset
AUC
Sensitivity at p05 specificity at p05 Prediction at p05 False negative rate at p80 Prediction at p80 Training0.802
0.34 0.05 0.013 0.02 0.001 Test0.741
0.23 0.05 0.014 0.05 0.001 Validation0.720
0.21 0.05 0.013 0.04 0.001Wang et al., ADA, 2014 Regularized logistic regression, n=185k, p=10k
1-HERCULES
Quality | Costs | Clinical Practice Prevalence Exposures per pt Box Size Box Color Drug Exposures SearchCohort Explorer
Schizophrenia Add CohortTop Drug Exposures
1. Amoxicillin (8.9%) 2. Ibuprofen (7.5%) 3. Fluoxetine (6.3%) 4. Risperidone(3.6%) 5. Guafenisin(3.6%) 6. Varenicline (2.4%) 7. Albuterol HFA(2.3%) 8. Trazodone(2.1%) 9. Hydrocodone / APAP (2.2%) 10. Promethazine (2.1%) 11. Omeprazole(2.1%) 12. Haloperidol (2.0%)
HERACLES
Sparse Coding Relational Random Forests
(>-30, appendectomy, Y/N):
in the last 30 days, did the patient have an appendectomy?
(<0, max(SBP), 140):
at any time in the past did the patient’s SBP exceed 140 mmHg?
(<-90, rofecoxib, Y/N):
in the time period up to 90 days ago, did the patient have a prescription for rofecoxib?
(>-7, fever, Y/N):
in the last week, did the patient have a fever?
Bayesian Hierarchical Association Rule Mining
Shahn et al.
McCormick et al.
• Goal: Predict next event in current sequence given sequence database • So far, successful application to
RCT data
Tools for large-scale regression
BBR/BMR 2005 bayesianregression.org logistic, multinomial L1, L2 regularization sparse millions of predictors
hierarchical, priors, autosearch stable
BXR bayesianregression.org cleaner
BOXER
online logistic regression
CCD
bsccs.googlecode.com logistic, conditional logistic, multinomial, Poisson, Cox, ParamSurv, least squares L1, L2 regularization sparse millions of predictors
imputation CPU, GPU
Answer questions
Explore all drugs for a given outcome
Naproxen has one of the most significant associations with GI bleed, along with other NSAIDs
Columbia CDRN approach
EHR OHDSI OMOP NYC CDRN PCORnet i2b2 OHDSI OMOP NYC CDRNColumbia
New York City
SCILHS (Boston), ACT
i2b2
SHRINE
NYC-CDRN
New York City Clinical Data Research Network
Partner Organization
Health System • Clinical Directors Network
• Columbia University College of Physicians and Surgeons
• Montefiore Medical Center and Albert Einstein College of Med • Mount Sinai Health System and Icahn School of Medicine
• NewYork-Presbyterian Hospital
• NYU Langone Medical Center and NYU School of Medicine • Weill Cornell Medical College
Research Infrastructure
• Biomedical Research Alliance of New York • Cornell NYC Tech Campus
• New York Genome Center
• Rockefeller University
Health Information Exchange
Bronx RHIO (Bronx Regional Informatics Center)
Healthix
Patient
Organizations
• American Diabetes Association • Center for Medical Consumers
• Consumer Reports
• Cystic Fibrosis Foundation
• New York Academy of Medicine • NYS Department of Health
Creating the NYC-CDRN De-identified Database
Healthix
NYGC Health Systems
*De-identified data set may include proxy IDs; date-shifted dates of birth, service, death; first 3 digits of zip codes De-identified dataset
reflecting date shifts *
NYGC Research Accessible Dataset Cross-institutional proxy ID matching database Proxy ID – MRN links Aggregated, de-duplicated dataset 18 Date shift values for each patient 1 2 3 4 5
1. Health systems create random unique proxy IDs for each MRN and send a table of those proxy IDs and their associated MRNs to Healthix.
2. Healthix prepares a random date shift value for each patient and sends a table with each proxy ID and associated date shift value back to the health system.
3. Health system prepares a de-identified dataset using proxy IDs instead of MRNs and shifting all dates using values provided by Healthix.
4. Healthix meanwhile sends to NYGC a database that matches the proxy IDs of patients across health systems. 5. NYGC aggregates the data from all health systems, de-duplicates it using the matching database provided by
Healthix, and assigns to each patient a new unique random ID before making it accessible for research projects.
Scalable Collaborative Infrastructure
for a Learning Health Care System
(SCILHS)
• Boston Children’s Hospital
• Boston Health Net (Boston Med Center, etc.)
• Partners HealthCare System (Mass General, Brigham &
Women’s)
• Wake Forest Baptist Medical Center
• Beth Israel Deaconess Medical Center
• Cincinnati Children’s Hospital
• University of Texas Health Science Center/Houston
• Columbia U Medical Center and New York Presbyterian
• Morehouse/Grady/RCMI
Natural Language Processing
• Narrative text holds much of the useful info
– Slight increase of pulmonary vascular
congestion with new left pleural effusion,
question mild congestive changes
– s/p LURT 1998 c/b 1A rejection 7/07 back on
HD
Natural language processing
“
Slight increase of
pulmonary vascular
congestion
with new
left pleural effusion,
question mild
congestive changes
”
pulmonary vascular congestion
change: increase
degree: low
pleural effusion
region: left
status: new
congestive changes
certainty: moderate
degree: low
NLP Performance
0 0.2 0.4 0.6 0.8 1 0.5 0.7 0.9 Specificity S e n s it iv it y internist radiologist lay personnatural language processor keyword searches
no to all conditions
Ann Intern Med 1995;122:681-8 Ideal
NLP Transferability
0 0.2 0.4 0.6 0.8 1 0.75 0.8 0.85 0.9 0.95 1 Specificity S e n s it iv it y internist radiologist lay person unmodified system processor-trained system query-trained system keyword searchMethods Inf Med 1998;37:1–7 Ideal