01 - BD2K ThinkTank on EHR data semantic harmonization, definition, content, ontologies: OHDSI, CDRNs, NLP

(1)

BD2K ThinkTank on EHR Data

Semantic harmonization,

definition, content, ontologies:

OHDSI, CDRNs, NLP

George Hripcsak, MD, MS

Biomedical Informatics

(2)

Observational Health Data

Sciences and Informatics

(OHDSI)

• Grew out of OMOP

• Mission

– large-scale analysis of observational health

databases for population-level estimation and

patient-level predictions

• Vision

– international network of 1,000,000,000 patients to

generate evidence

Explore all drugs for a given outcome

Naproxen has one of the most significant associations with GI bleed, along with other NSAIDs

(3)

OHDSI

• Experiments to generate evidence

• Applications (tools)

• Data

• Methods

(4)

(5)

How OHDSI works

Source data warehouse, with identifiable patient-level data Standardized, de-identified patient-level database (OMOP CDM v5) ETL Summary statistics results repository OHDSI.org Consistency Temporality Strength Plausibility Experiment Coherence Biological gradient Specificity

Analogy

Comparative effectiveness

Predictive modeling OHDSI Data Partners

OHDSI Coordinating Center

Standardized large-scale analytics Analysis results Analytics development and testing Research and education Data network support

(6)

OHDSI

• Each site retains its own data

• Use a deep information model

– Concepts, terminologies, conceptual relations

– “OMOP Common Data Model (v4, v5)”

– Strictly defines terminology, mappings

– Supports world-wide queries

• Advanced observational research methods

• Summarize or aggregate results centrally

(7)

(8)

CDRN Alignment Tasks

• Construct CDRN Data Model (DM) and CDRN Vocabulary

o Based on OMOP DM/Vocabulary o Address PCOR requirements o Address CDRN local needs

o Align with OMOP V5 development o Align with other CDRN centers o Address versioning

• Develop Map-Sets

o Develop vocabulary map-sets:

• Sources-to-OMOP • i2b2-OMOP

• PCOR-OMOP

o Address versioning

o Facilitate development of ETL processes

• i2b2-OMOP • PCOR-OMOP

Deliverables

 Design Person table

 Design terminology back-end  Select/create demographics

controlled terminology

 Create mappings of site terminology to controlled terminology for

submitting sites

 Provide QA recommendations  Document decisions and artifacts

(9)

Achilles

• Select right

research

database

• “Browse”

database

• Actually

navigate

preloaded

summaries

(10)

• Strength

• Consistency

• Temporality

• Plausibility

• Experiment

• Coherence

• Biological gradient

• Specificity

• Analogy

HOMER

Austin Bradford Hill, “The Environment and Disease: Association or Causation?,” Proceedings of the Royal Society of Medicine, 58 (1965), 295-300.

“What aspects of that association should we

especially consider before deciding that the most

likely interpretation of it is causation?”

(11)

PLATO

Patient-Level Assessment of Treatment Outcomes

Dataset

AUC

Sensitivity at p05 specificity at p05 Prediction at p05 False negative rate at p80 Prediction at p80 Training

0.802

0.34 0.05 0.013 0.02 0.001 Test

0.741

0.23 0.05 0.014 0.05 0.001 Validation

0.720

0.21 0.05 0.013 0.04 0.001

Wang et al., ADA, 2014 Regularized logistic regression, n=185k, p=10k

(12)

1-HERCULES

Quality | Costs | Clinical Practice Prevalence Exposures per pt Box Size Box Color Drug Exposures Search

Cohort Explorer

Schizophrenia Add Cohort

Top Drug Exposures

1. Amoxicillin (8.9%) 2. Ibuprofen (7.5%) 3. Fluoxetine (6.3%) 4. Risperidone(3.6%) 5. Guafenisin(3.6%) 6. Varenicline (2.4%) 7. Albuterol HFA(2.3%) 8. Trazodone(2.1%) 9. Hydrocodone / APAP (2.2%) 10. Promethazine (2.1%) 11. Omeprazole(2.1%) 12. Haloperidol (2.0%)

HERACLES

(13)

Sparse Coding Relational Random Forests

(>-30, appendectomy, Y/N):

in the last 30 days, did the patient have an appendectomy?

(<0, max(SBP), 140):

at any time in the past did the patient’s SBP exceed 140 mmHg?

(<-90, rofecoxib, Y/N):

in the time period up to 90 days ago, did the patient have a prescription for rofecoxib?

(>-7, fever, Y/N):

in the last week, did the patient have a fever?

Bayesian Hierarchical Association Rule Mining

Shahn et al.

McCormick et al.

• Goal: Predict next event in current sequence given sequence database • So far, successful application to

RCT data

Tools for large-scale regression

BBR/BMR 2005 bayesianregression.org logistic, multinomial L1, L2 regularization sparse  millions of predictors

hierarchical, priors, autosearch stable

BXR bayesianregression.org cleaner

BOXER

online logistic regression

CCD

bsccs.googlecode.com logistic, conditional logistic, multinomial, Poisson, Cox, ParamSurv, least squares L1, L2 regularization sparse  millions of predictors

imputation CPU, GPU

(14)

Answer questions

Explore all drugs for a given outcome

Naproxen has one of the most significant associations with GI bleed, along with other NSAIDs

(15)

(16)

Columbia CDRN approach

EHR OHDSI OMOP NYC CDRN PCORnet i2b2 OHDSI OMOP NYC CDRN

Columbia

New York City

SCILHS (Boston), ACT

i2b2

SHRINE

(17)

NYC-CDRN

New York City Clinical Data Research Network

Partner Organization

Health System • Clinical Directors Network

• Columbia University College of Physicians and Surgeons

• Montefiore Medical Center and Albert Einstein College of Med • Mount Sinai Health System and Icahn School of Medicine

• NewYork-Presbyterian Hospital

• NYU Langone Medical Center and NYU School of Medicine • Weill Cornell Medical College

Research Infrastructure

• Biomedical Research Alliance of New York • Cornell NYC Tech Campus

• New York Genome Center

• Rockefeller University

Health Information Exchange

 Bronx RHIO (Bronx Regional Informatics Center)

 Healthix

Patient

Organizations

• American Diabetes Association • Center for Medical Consumers

• Consumer Reports

• Cystic Fibrosis Foundation

• New York Academy of Medicine • NYS Department of Health

(18)

Creating the NYC-CDRN De-identified Database

Healthix

NYGC Health Systems

*De-identified data set may include proxy IDs; date-shifted dates of birth, service, death; first 3 digits of zip codes De-identified dataset

reflecting date shifts *

NYGC Research Accessible Dataset Cross-institutional proxy ID matching database Proxy ID – MRN links Aggregated, de-duplicated dataset 18 Date shift values for each patient 1 2 3 4 5

1. Health systems create random unique proxy IDs for each MRN and send a table of those proxy IDs and their associated MRNs to Healthix.

2. Healthix prepares a random date shift value for each patient and sends a table with each proxy ID and associated date shift value back to the health system.

3. Health system prepares a de-identified dataset using proxy IDs instead of MRNs and shifting all dates using values provided by Healthix.

4. Healthix meanwhile sends to NYGC a database that matches the proxy IDs of patients across health systems. 5. NYGC aggregates the data from all health systems, de-duplicates it using the matching database provided by

Healthix, and assigns to each patient a new unique random ID before making it accessible for research projects.

(19)

Scalable Collaborative Infrastructure

for a Learning Health Care System

(SCILHS)

• Boston Children’s Hospital

• Boston Health Net (Boston Med Center, etc.)

• Partners HealthCare System (Mass General, Brigham &

Women’s)

• Wake Forest Baptist Medical Center

• Beth Israel Deaconess Medical Center

• Cincinnati Children’s Hospital

• University of Texas Health Science Center/Houston

• Columbia U Medical Center and New York Presbyterian

• Morehouse/Grady/RCMI

(20)

(21)

Natural Language Processing

• Narrative text holds much of the useful info

– Slight increase of pulmonary vascular

congestion with new left pleural effusion,

question mild congestive changes

– s/p LURT 1998 c/b 1A rejection 7/07 back on

HD

(22)

Natural language processing

“

Slight increase of

pulmonary vascular

congestion

with new

left pleural effusion,

question mild

congestive changes

”

pulmonary vascular congestion

change: increase

degree: low

pleural effusion

region: left

status: new

congestive changes

certainty: moderate

degree: low

(23)

NLP Performance

0 0.2 0.4 0.6 0.8 1 0.5 0.7 0.9 Specificity S e n s it iv it y internist radiologist lay person

natural language processor keyword searches

no to all conditions

Ann Intern Med 1995;122:681-8 Ideal

(24)

NLP Transferability

0 0.2 0.4 0.6 0.8 1 0.75 0.8 0.85 0.9 0.95 1 Specificity S e n s it iv it y internist radiologist lay person unmodified system processor-trained system query-trained system keyword search