14 - Session 4: Longitudinal and other temporal issues for long-term studies

(1)

Session 4: Longitudinal and

Other Temporal Issues for

Long-Term Studies

J. Michael Gaziano, MD, MPH

Scientific Director, MAVERIC, VA Boston Healthcare System

Chief, Division of Aging, Brigham and Women’s Hospital

Professor of Medicine, Harvard Medical School

December 11, 2014

(2)

Using Big Data in the VA

• VA Healthcare System

• Large-scale research programs nested in the

clinical system

– Genetic Mega Cohort: Million Veteran Program

– Pragmatic Randomized Trial: HCTZ v.

Chlorthalidone

• Using the longitudinal big data

• Summary and Lessons Learned

(3)

Nesting Population Research

in the VA Healthcare System

• VA is an ideal setting for

nested large-scale population

research

–

Stable and willing veteran

population of 8 to 10 million

–

Research infrastructure with

diverse expertise

–

Outstanding electronic medical

record; fully integrated; data

reaching back as far as 20 years

(4)

Million Veteran Program (MVP)

• Enroll up to one million users of the

VHA into an observational

mega-cohort

o

Blood collection for storage in

biorepository for future research

o

Collect health and lifestyle

information

o

Access to electronic medical record

o

Ability to recontact participants

(5)

Distribution of MVP Sites

(6)

Cohort

Identification

Centrally

Phone

Enrollment

& Consent

Randomize

Intervention

Delivered

by mail

Data Capture

By EHR &CMS

Study DB

Analysis

Clinical

Decision

Support

Care providers using EMR

Study team using traditional scientific tools

Pragmatic Trial of HCTZ v.

(7)

(8)

System Architecture

8

Access Authorization by Governance System

Vendor

Molecular Lab

Query

Mart

Query

Portal

Analysis

Environment

Consent

Manager

Study Mart Study Mart Study Mart

Data

Warehouse

VA

Non VA

Clinical Data

NDI, CMS

Survey Data

Molecular data

Researcher

(9)

Current State: Logistics for Data &

Environment

9

Ex

ternal

to

VI

N

C

I

GenISIS Scientific env

MVP Projs MAV-VINCI xfer zone

MVP participant roster CDW filtered by MVP roster VINCI Researcher MAVERIC “high-level” VINCI user MAVERIC VINCI CDW copy VINCI CDW copy Support by VINCI services

completed clinical data-set

VINCI

eg. MVP LOIs

(10)

Million Veteran Program (MVP)

Data Universe

10

VA - Clinical

VINCI, VIReC,

Self-reported

MVP surveys

Biospecimen

Non-VA

NDI, CMS, etc.

Molecular

Data

MVP

Participant

(11)

VA Data Sources

• Corporate Data Warehouse

Databases

• National Patient Care

Databases

• Vital Status

• Decision Support System

• National Data Extract

• Beneficiary Identification

Records Locator (BIRLS)

death file

• New England VISN-1

Pharmacy files

• Outpatient Clinic File (OPC)

• Patient Treatment File (PTF)

• Inpatient and Outpatient

Hospitalizations

• Clinic Inpatient and Outpatient

Visits

• Diagnosis (ICD-9) codes

• Procedure (CPT) codes

• Pharmacy data and laboratory

data

• Pharmacy Benefit Management

(PBM) system database

• OEF/OIF and OND Roster

• VA Clinical Assessment

Reporting and Tracking (CART)

• Veterans Affairs Surgical Quality

Improvement Program (VASQIP)

• Veterans Affairs Central Cancer

Registry (VACCR)

11

Special

Data

Access w/

Data

Steward

National

Data

Systems

(NDS)

(12)

Other Data Sources

MVP Data

• Self-Reported Survey Data:

 Lifestyle Survey Data (Personal

Information, Well-Being, Activity,

Health, Military Experience,

Dietary Intak, Medication,

Habits)

 Baseline Survey (Health, Military

Experiences, family medical

history)

• Genetic Information

• Vital Status: Social Security Death

Master Files, National Death Index,

State Vital Statistic Registry

Non-VA Data

• National Death Index (NDI)

• Centers for Medicare and Medicaid

Services (CMS)

• State Mortality Data

(13)

3-Tier 7-Step Phenotyping Process

Tier I

Algorithm

(

T1A

)

Initial cohort

(Likely cases, possible cases, likely non-cases)

Structured Data

Literature Search Expert Consultation Unstructured Data

Structured Data

Phenomic

Database

Data Processing Pipelines (NLP, data curation, extraction, augmentation, etc)

Refined Algorithm (

T2A

)

(Synthesize T1A and phenomic database to derive T2A)

• Development of a probabilistic model • Assignment of quantitative “ caseness” • Evaluate T2A • Formulate T3A Prior Knowledge

Tier II

Algorithm

(

T2A

)

Tier III

Algorithm

(

T3A

)

T1A T3A

Step 1: Define initial working algorithm (T1A) Step 5: Derive T2A

Step 2: Create study cohort and apply T1A Step 6: Evaluate T2A to formulate T3A

Step 3: Create Annotation Data Set Step 7: Develop probabilistic model and assign caseness

Step 4: Create Phenomic Database through Data Processing Pipelines

Deposit resulting algorithms to a central Phenotype Library

(14)

MAVERIC Phenotyping Activities

Phenotypes

•Disease

– Myocardial infarction (MI) – Stroke

– Unstable angina with revascularization – Acute congestive heart failure

– Death from cardiovascular disease – Vascular procedure

– Posttraumatic stress disorder (PTSD) – Schizophrenia

– Bipolar disorder – Traumatic brain injury – Depression

– Vascular dementia – Cognitive impairment – Type 2 diabetes mellitus

Algorithm Development

•CPT codes •ICD-9 codes •Laboratory values •Medications

•Natural Language Processing (NLP)

• Laboratory values – High-density lipoproteins (HDLs) – Low-density lipoproteins (LDLs) – Total cholesterol – Albumin – Serum creatinine – Triglycerides • Physical traits – Blood pressure • Demographics – Smoking – Alcohol consumption – Race – Combat exposure

Validation Methods

• Chart review by content experts

(15)

List of Validated Phenotypes

A. Disease Exposures and Outcomes- Algorithm

Generation

B. Characterization of Longitudinal Laboratory

and Clinical Values

C. Non-Disease Exposures and Outcomes

Algorithm Generation

(16)

A. Disease Exposures and

Outcomes-Algorithm Generation

Active Cancer Chronic Kidney Disease Crohns Disease Acute Kidney Injury Chronic kidney disease/End stage renal disorder Depression Alcohol Abuse or Dependence

Chronic liver disease, including Fatty Liver

Disease and Cirrhosis Diabetes

Alzheimer’s and non-Alzheimer's dementias Chronic Obstructive Pulmonary Disease Drug Induced Liver Injury Anemia Clostridium difficile Erectile Dysfunction Anxiety disorders

Cognitive disorder due to late effects of

cerebrovascular disease Falls in the Elderly Bipoloar disease/mania episodes Community Acquired Pnuemonias Fractures

BMI categorization Congestive Heart Failure

Head and Neck cancer diagnoses and tumor staging (stage III and IV)

Bradycardia Coronary Artery Disease Hepetitis C infection Cerebrovascular Disease Coronary Heart Disease Hypertension

(17)

A. Disease Exposures and Outcomes- Algorithm

Generation

Hy's Law and Elevated Liver Function Tests Osteoporotic Fractures Substance Abuse Disorders Incontinence and Catheter Use Peripheral Vascular Disease Suicidality

Intentional and Unintentional Injuries/Poisoning Personality Disorder Systemic Lupus Erythematosus Lower extremity peripheral vascular occlusive

disease Pneumonia Thrombocytopenia Major Bleeding Events- intracranial, gi, etc Post-traumatic stress disorder Transient Ischemic Attack Metabolic syndrome Prostate Cancer Traumatic brain injury MRSA infection Rheumatoid arthritis and severity index

Multiple myeloma and MGUS Revascularization Myocardial Infarction Schizophrenia

(18)

B)

Characterization of longitudinal laboratory and clinical

values, including but not limited to

:

• Blood Pressure

• Pulse/Heart Rate

• Lipids, (TC,HDL,LDL, non-HDL, Trigs)

• HbA1C, glucose

• Albumin

• Hb, PLT, HCT

• PCR for MRSA

• PSA

• Imaging Studies

(19)

C) Non-Disease Exposures and Outcomes

Algorithm Generation

 Antidepressants  Medication Dosing For Erythropoeitin Stimulating Agents

 Antiepileptic Drugs  MRIs (with and without contrast)

 Antipsychotics  NSAIDs

 Bare Metal Stent placement/Drug Eluting Stent

Placement (# stents, stent revisions)  Opioids

 CABG  PPI use and discontinuation

 Chemotherapy dosing algorithms  Proton Pump Inhibitors

 Erythropoiesis Stimulating Agent  Selective COX-2 Inhibitors

(20)

MAVERIC Phenotyping Activities

Phenotypes

•Disease

– Unstable angina with revascularization – Acute congestive heart failure

– Death from cardiovascular disease – Vascular procedure

– Posttraumatic stress disorder (PTSD) – Schizophrenia

– Bipolar disorder – Traumatic brain injury – Depression

– Vascular dementia – Cognitive impairment – Type 2 diabetes mellitus

Algorithm Development

•CPT codes •ICD-9 codes •Laboratory values •Medications

• Laboratory values – High-density lipoproteins (HDLs) – Low-density lipoproteins (LDLs) – Total cholesterol – Albumin – Serum creatinine – Triglycerides • Physical traits – Blood pressure • Demographics – Smoking – Alcohol consumption – Race – Combat exposure

Validation Methods

• Chart review by content experts

(21)

eMERGE Phenotyping Activities

Phenotypes • Disease – Atrial fibrillation – Cataracts – Crohn’s disease – Dementia – Diabetic retinopathy – Drug induced liver injury – Hypothyroidism

– Multiple sclerosis

– Peripheral arterial disease (PAD) – Rheumatoid arthritis

– Severe early childhood obesity – Type 2 diabetes mellitus

• Laboratory values

– Red blood cell indices – White blood cell indices – Lipids

– High-density lipoproteins (HDLs) – Hemoglobin A1c

• Medication response

– Poor metabolizers of Clopidogrel – Warfarin dose response

• Physical traits – Normal ECG – Height Algorithm Development •CPT codes •ICD-9 codes •Laboratory values •Medications

Validation Methods

•Multisite validation •Chart review by content experts

(22)

CHARGE Consortium

Phenotypes

•Disease

– Transient ischemic attack (TIA) – Heart failure

– Peripheral vascular disease (PVD) – Dementia – Diabetes – Hypertension – Atrial fibrillation – Depression •Laboratory values – Fasting lipids – Fasting glucose

– Glucose tolerance test •Physical traits – Blood pressure – Height, weight •Demographics – Smoking

Algorithm Development

•None – phenotypes collected during prospective cohort studies

Validation

•Phenotype standardization across the 5 cohorts

(23)

(24)

(25)

MVP Recruitment and

Enrollment

• Invitational Mailing/Appointment Mailing

–

Invitation letter, Baseline Survey, MVP Brochure

–

Appointment letter, Informed consent language

• Walk-in recruitment

• Study visit

–

_{Informed consent/HIPAA, Blood collection}

• Thank you Mailing

(26)

Blood Draw

• 4 ice packs must be in the freezer the day

before bloods will be drawn

• After obtaining consent, scan barcode on

EDTA blood tube to enter blood ID into

blood collection form

• Draw blood filling the tube

• Rescan tube

(27)

(28)

Processing

(29)

VA Central Biorepository

(30)

0 - 24 24 - 48 48 - 72 72 - 96 > 96 % of total 1.01 85.84 12.01 0.54 0.42 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00

P

e

rce

n

t

of T

ot

al

Sample Transit Time (hrs) from

collection to storage

MVP

Biosample quality

measurements

Good Lipemic Underfill ed

Hemolyz

ed Clotted Lysed Other % of total 93.37 2.83 2.46 0.58 0.57 0.16 0.03 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00

P

e

rc

e

nt

o

f t

o

tal

Biosample Quality

(31)

31

Current Lab Activities

• Receiving and Processing - 400-600 per day

• Shipping Samples for Sequencing and Genotyping:

Assay Type

Shipments to-date

Targeted Shipments

Whole Genome

sequencing

1886

1370 + 516

Whole Exome

sequencing

24260

24126

SNP Genotyping

206,603

~200,000

(32)

MVP Recruitment to Date

Invitation mailings sent

2.6 Million

Expressed interest by mail

19.4% (11.2%/8.2%)

Optout

13%

Completed Baseline Surveys

456,000

Consented Veterans

325,000

Specimens in Lab

323,000

Unscheduled (proportion)

40%

Upcoming appointments

11,000

(33)

(34)

Race

78% 20% 1% 1% 80% 18% 1% 1% 78% 21% 1% 1% 80% 18% 2% _1% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90%

Caucasian African American Asian Native American

Race

Data Source: VA

(35)

Data

Generation

Data

Transmission

Data

Ingestion

Data

Indexing

Data

Storage

Data

_Analysis

QC 1 (Disk)

QC 2 (Data)

Planned Genomic Data Pipeline for

Genotype data

At Vendor: 1. Sample QC 2. Data preparation as per VA requirements Tracking

1. Sample send outs 2. Data transfers

1. Disk QC

2. Data uptake into GenISIS Storage Systems

1. Data QC

2. Indexing & Meta data extraction

1. Data Storage 2. Data integration with honest broker 3. Data harmonization

1. Study marts 2. Data analysis

(36)

Planned Genomic Data Pipeline for

Sequence Data

(37)

Using Big Data in the VA

• VA Healthcare System

• Large-scale research programs nested in the

clinical system

– Genetic Mega Cohort: Million Veteran Program

– Pragmatic Randomized Trial: HCTZ v. Chlorthalidone

• Using the big data

– Biochemical pipelines

– Phenomic data

• Summary and Lessons for clinical care

(38)

(39)

(40)

40

CC

F/U Depression

The patient indicates that his symptoms have improved

significantly, but not as much as he expected. He is still sleeping a lot (about 12 hours per day) and finds it hard to concentrate on looking for work. He denies suicidal ideation. His PHQ-9 score is 16 today.

The patient has a history of binge eating episodes. He is an emotional eater and often feels out of control, but continues to eat after job search disappointments. He often binges at night and has done this 3-4 times per week for the past several years.

Professional Diagnosis: Axis I:

Major Depression, partial response to meds Binge eating disorder

Axis II:

Deferred Axis III:

(41)

VA Data Sources

• Corporate Data Warehouse

Databases

• National Patient Care Databases

• Vital Status

• Decision Support System

• National Data Extract

• Beneficiary Identification Records

Locator (BIRLS) death file

• New England VISN-1 Pharmacy

files

• Pharmacy Benefit Management

(PBM) system database

• Outpatient Clinic File (OPC)

• Patient Treatment File (PTF

)

• Clinic Inpatient and Outpatient

Visits

• Inpatient and Outpatient

Hospitalizations

• Diagnosis (ICD-9) codes

• Procedure (CPT) codes

• Pharmacy data and laboratory

data

• OEF/OIF and OND Roster

• VA Clinical Assessment Reporting

and Tracking (CART)

• Veterans Affairs Surgical Quality

Improvement Program (VASQIP)

• Veterans Affairs Central Cancer

Registry (VACCR

)

(42)

Other Data Sources

MVP Data

• Self-Reported Survey Data:

 Lifestyle Survey Data

(Personal Information,

Well-Being, Activity,

Health, Military

Experience, Dietary

Intake, Medication,

Habits)

 Baseline Survey (Health,

Military Experiences,

family medical history)

Non-VA Data

• National Death Index (NDI)

• Centers for Medicare and

Medicaid Services (CMS)

• Social Security Death

Master Files

• State Mortality Data

• Cancer Registries

(43)

Examples of Data Issues

• Types of data

– ICD codes

– Procedure codes

– Lab data

– Medication data

– Imaging data

(44)

Various Levels of Data Processing

44

Basic Cleaning

Data quality and logic checks of

raw data elements and values

checking logics on value ranges and type

Curation

Data standardization and

harmonization

laboratory data element naming convention

Simple

Phenotyping

Defining algorithms based on

prior knowledge based on

structured data elements

requires subject matter experts working together with

EMR data experts

Complex

Phenotyping

Deriving complex algorithms

combining both structured and

unstructured databases

i.

Further development and validation of complex

phenotyping algorithms will be completed based on

each funded project

ii.

This is a deeper phenotyping requiring

processing the unstructured database by expert data

programmers and analyst using and possibly building a

specific data mining pipeline.

(45)

Using Big Data in the VA

• VA Healthcare System

• Large-scale research programs nested in the

clinical system

– Genetic Mega Cohort: Million Veteran Program

– Pragmatic Randomized Trial: HCTZ v. Chlorthalidone

• Using the big data

– Biochemical pipelines

– Phenomic data

• Summary and Lessons for clinical care

(46)

Summary and Lessons for Research

• Using data for Research

– Don’t boil the ocean

– Develop structured data

model

– Computing environment

– Validate!

– Missing data is OK

– Research question defines

level of quality

– Research lab results not

necessarily from a certified

lab

(47)

Summary and Lessons for Clinical care

• Using Big data for Clinical

care

– Clinical question defines data

quality

– Real time need

– Missing data

– Centralize processes (pros

and cons)

(48)

48

Data

Sources

Curation

Zone

Landing

Zone

Cleaning

Zone

Query mart

Study mart

Phenotyping

Zone

- VA & Non-VA Sources - Access to Data Source - Ideal Sources Identification - Throughput - Identity checks - Assign MPI (DIVA ID) - Honest Broker Integration - Data Integrity checks - Algorithms for basic data cleaning - MVP backend Data Dictionary - Data Validation - Data Harmonization - Deriving phenotype terms based on standards - Phenotyping algorithm development - Data and metadata associations - Ontology development - Data Dictionary/Meta data Manager - Terminology based aggregate query - Metadata driven study data request - Access controlled Study specific data marts

(49)

CDW/Vista

VINCI

CDW

_CMS

NDI

_Access _Query _Operation

Study Datamart

GenISIS

Bio-repository _GenomicMVP Enclave

Clinical

Recruitment

Operations

(50)

VA Data Sources

• Corporate Data Warehouse

Databases

• National Patient Care

Databases

• Vital Status

• Decision Support System

• National Data Extract

• Beneficiary Identification

Records Locator (BIRLS)

death file

• New England VISN-1

Pharmacy files

• Outpatient Clinic File (OPC)

• Patient Treatment File (PTF)

• Inpatient and Outpatient

Hospitalizations

• Clinic Inpatient and Outpatient

Visits

• Diagnosis (ICD-9) codes

• Procedure (CPT) codes

• Pharmacy data and laboratory

data

• Pharmacy Benefit Management

(PBM) system database

• OEF/OIF and OND Roster

• VA Clinical Assessment

Reporting and Tracking (CART)

• Veterans Affairs Surgical Quality

Improvement Program (VASQIP)

• Veterans Affairs Central Cancer

Registry (VACCR)

50

Special

Data

Access w/

Data

Steward

National

Data

Systems

(NDS)

(51)