Session 4: Longitudinal and
Other Temporal Issues for
Long-Term Studies
J. Michael Gaziano, MD, MPH
Scientific Director, MAVERIC, VA Boston Healthcare System
Chief, Division of Aging, Brigham and Women’s Hospital
Professor of Medicine, Harvard Medical School
December 11, 2014
Using Big Data in the VA
• VA Healthcare System
• Large-scale research programs nested in the
clinical system
– Genetic Mega Cohort: Million Veteran Program
– Pragmatic Randomized Trial: HCTZ v.
Chlorthalidone
• Using the longitudinal big data
• Summary and Lessons Learned
Nesting Population Research
in the VA Healthcare System
• VA is an ideal setting for
nested large-scale population
research
–
Stable and willing veteran
population of 8 to 10 million
–
Research infrastructure with
diverse expertise
–
Outstanding electronic medical
record; fully integrated; data
reaching back as far as 20 years
Million Veteran Program (MVP)
•
Enroll up to one million users of the
VHA into an observational
mega-cohort
o
Blood collection for storage in
biorepository for future research
o
Collect health and lifestyle
information
o
Access to electronic medical record
o
Ability to recontact participants
Distribution of MVP Sites
Cohort
Identification
Centrally
Phone
Enrollment
& Consent
Randomize
Intervention
Delivered
by mail
Data Capture
By EHR &CMS
Study DB
Analysis
Clinical
Decision
Support
Care providers using EMR
Study team using traditional scientific tools
Pragmatic Trial of HCTZ v.
System Architecture
8
Access Authorization by Governance System
Vendor
Molecular LabQuery
Mart
Query
Portal
Analysis
Environment
Consent
Manager
Study Mart Study Mart Study MartData
Warehouse
VA
Non VAClinical Data
NDI, CMS
Survey Data
Molecular data
ResearcherCurrent State: Logistics for Data &
Environment
9Ex
ternal
to
VI
N
C
I
GenISIS Scientific envMVP Projs MAV-VINCI xfer zone
MVP participant roster CDW filtered by MVP roster VINCI Researcher MAVERIC “high-level” VINCI user MAVERIC VINCI CDW copy VINCI CDW copy Support by VINCI services
completed clinical data-set
completed clinical data-set
VINCI
eg. MVP LOIs
Million Veteran Program (MVP)
Data Universe
10VA - Clinical
VINCI, VIReC,
Self-reported
MVP surveys
Biospecimen
Non-VA
NDI, CMS, etc.
Molecular
Data
MVP
Participant
VA Data Sources
•
Corporate Data Warehouse
Databases
•
National Patient Care
Databases
•
Vital Status
•
Decision Support System
•
National Data Extract
•
Beneficiary Identification
Records Locator (BIRLS)
death file
•
New England VISN-1
Pharmacy files
•
Outpatient Clinic File (OPC)
•
Patient Treatment File (PTF)
•
Inpatient and Outpatient
Hospitalizations
•
Clinic Inpatient and Outpatient
Visits
•
Diagnosis (ICD-9) codes
•
Procedure (CPT) codes
•
Pharmacy data and laboratory
data
•
Pharmacy Benefit Management
(PBM) system database
•
OEF/OIF and OND Roster
•
VA Clinical Assessment
Reporting and Tracking (CART)
•
Veterans Affairs Surgical Quality
Improvement Program (VASQIP)
•
Veterans Affairs Central Cancer
Registry (VACCR)
11Special
Data
Access w/
Data
Steward
National
Data
Systems
(NDS)
Other Data Sources
MVP Data
•
Self-Reported Survey Data:
Lifestyle Survey Data (Personal
Information, Well-Being, Activity,
Health, Military Experience,
Dietary Intak, Medication,
Habits)
Baseline Survey (Health, Military
Experiences, family medical
history)
•
Genetic Information
•
Vital Status: Social Security Death
Master Files, National Death Index,
State Vital Statistic Registry
Non-VA Data
•
National Death Index (NDI)
•
Centers for Medicare and Medicaid
Services (CMS)
•
State Mortality Data
3-Tier 7-Step Phenotyping Process
Tier I
Algorithm
(
T1A
)
Initial cohort
(Likely cases, possible cases, likely non-cases)
Structured Data
Literature Search Expert Consultation Unstructured DataStructured Data
Phenomic
Database
Data Processing Pipelines (NLP, data curation, extraction, augmentation, etc)Refined Algorithm (
T2A
)
(Synthesize T1A and phenomic database to derive T2A)
• Development of a probabilistic model • Assignment of quantitative “ caseness” • Evaluate T2A • Formulate T3A Prior Knowledge
Tier II
Algorithm
(
T2A
)
Tier III
Algorithm
(
T3A
)
T1A T3AStep 1: Define initial working algorithm (T1A) Step 5: Derive T2A
Step 2: Create study cohort and apply T1A Step 6: Evaluate T2A to formulate T3A
Step 3: Create Annotation Data Set Step 7: Develop probabilistic model and assign caseness
Step 4: Create Phenomic Database through Data Processing Pipelines
Deposit resulting algorithms to a central Phenotype Library
MAVERIC Phenotyping Activities
Phenotypes
•Disease
– Myocardial infarction (MI) – Stroke
– Unstable angina with revascularization – Acute congestive heart failure
– Death from cardiovascular disease – Vascular procedure
– Posttraumatic stress disorder (PTSD) – Schizophrenia
– Bipolar disorder – Traumatic brain injury – Depression
– Vascular dementia – Cognitive impairment – Type 2 diabetes mellitus
Algorithm Development
•CPT codes •ICD-9 codes •Laboratory values •Medications
•Natural Language Processing (NLP)
• Laboratory values – High-density lipoproteins (HDLs) – Low-density lipoproteins (LDLs) – Total cholesterol – Albumin – Serum creatinine – Triglycerides • Physical traits – Blood pressure • Demographics – Smoking – Alcohol consumption – Race – Combat exposure
Validation Methods
• Chart review by content experts
List of Validated Phenotypes
A. Disease Exposures and Outcomes- Algorithm
Generation
B. Characterization of Longitudinal Laboratory
and Clinical Values
C. Non-Disease Exposures and Outcomes
Algorithm Generation
A. Disease Exposures and
Outcomes-Algorithm Generation
Active Cancer Chronic Kidney Disease Crohns Disease Acute Kidney Injury Chronic kidney disease/End stage renal disorder Depression Alcohol Abuse or Dependence
Chronic liver disease, including Fatty Liver
Disease and Cirrhosis Diabetes
Alzheimer’s and non-Alzheimer's dementias Chronic Obstructive Pulmonary Disease Drug Induced Liver Injury Anemia Clostridium difficile Erectile Dysfunction Anxiety disorders
Cognitive disorder due to late effects of
cerebrovascular disease Falls in the Elderly Bipoloar disease/mania episodes Community Acquired Pnuemonias Fractures
BMI categorization Congestive Heart Failure
Head and Neck cancer diagnoses and tumor staging (stage III and IV)
Bradycardia Coronary Artery Disease Hepetitis C infection Cerebrovascular Disease Coronary Heart Disease Hypertension
A. Disease Exposures and Outcomes- Algorithm
Generation
Hy's Law and Elevated Liver Function Tests Osteoporotic Fractures Substance Abuse Disorders Incontinence and Catheter Use Peripheral Vascular Disease Suicidality
Intentional and Unintentional Injuries/Poisoning Personality Disorder Systemic Lupus Erythematosus Lower extremity peripheral vascular occlusive
disease Pneumonia Thrombocytopenia Major Bleeding Events- intracranial, gi, etc Post-traumatic stress disorder Transient Ischemic Attack Metabolic syndrome Prostate Cancer Traumatic brain injury MRSA infection Rheumatoid arthritis and severity index
Multiple myeloma and MGUS Revascularization Myocardial Infarction Schizophrenia
B)
Characterization of longitudinal laboratory and clinical
values, including but not limited to
:
• Blood Pressure
• Pulse/Heart Rate
• Lipids, (TC,HDL,LDL, non-HDL, Trigs)
• HbA1C, glucose
• Albumin
• Hb, PLT, HCT
• PCR for MRSA
• PSA
• Imaging Studies
C) Non-Disease Exposures and Outcomes
Algorithm Generation
Antidepressants Medication Dosing For Erythropoeitin Stimulating Agents
Antiepileptic Drugs MRIs (with and without contrast)
Antipsychotics NSAIDs
Bare Metal Stent placement/Drug Eluting Stent
Placement (# stents, stent revisions) Opioids
CABG PPI use and discontinuation
Chemotherapy dosing algorithms Proton Pump Inhibitors
Erythropoiesis Stimulating Agent Selective COX-2 Inhibitors
MAVERIC Phenotyping Activities
Phenotypes
•Disease
– Myocardial infarction (MI) – Stroke
– Unstable angina with revascularization – Acute congestive heart failure
– Death from cardiovascular disease – Vascular procedure
– Posttraumatic stress disorder (PTSD) – Schizophrenia
– Bipolar disorder – Traumatic brain injury – Depression
– Vascular dementia – Cognitive impairment – Type 2 diabetes mellitus
Algorithm Development
•CPT codes •ICD-9 codes •Laboratory values •Medications
•Natural Language Processing (NLP)
• Laboratory values – High-density lipoproteins (HDLs) – Low-density lipoproteins (LDLs) – Total cholesterol – Albumin – Serum creatinine – Triglycerides • Physical traits – Blood pressure • Demographics – Smoking – Alcohol consumption – Race – Combat exposure
Validation Methods
• Chart review by content experts
eMERGE Phenotyping Activities
Phenotypes • Disease – Atrial fibrillation – Cataracts – Crohn’s disease – Dementia – Diabetic retinopathy – Drug induced liver injury – Hypothyroidism– Multiple sclerosis
– Peripheral arterial disease (PAD) – Rheumatoid arthritis
– Severe early childhood obesity – Type 2 diabetes mellitus
• Laboratory values
– Red blood cell indices – White blood cell indices – Lipids
– High-density lipoproteins (HDLs) – Hemoglobin A1c
• Medication response
– Poor metabolizers of Clopidogrel – Warfarin dose response
• Physical traits – Normal ECG – Height Algorithm Development •CPT codes •ICD-9 codes •Laboratory values •Medications
•Natural Language Processing (NLP)
Validation Methods
•Multisite validation •Chart review by content experts
CHARGE Consortium
Phenotypes
•Disease
– Myocardial infarction (MI) – Stroke
– Transient ischemic attack (TIA) – Heart failure
– Peripheral vascular disease (PVD) – Dementia – Diabetes – Hypertension – Atrial fibrillation – Depression •Laboratory values – Fasting lipids – Fasting glucose
– Glucose tolerance test •Physical traits – Blood pressure – Height, weight •Demographics – Smoking
Algorithm Development
•None – phenotypes collected during prospective cohort studies
Validation
•Phenotype standardization across the 5 cohorts
MVP Recruitment and
Enrollment
• Invitational Mailing/Appointment Mailing
–
Invitation letter, Baseline Survey, MVP Brochure
–
Appointment letter, Informed consent language
• Walk-in recruitment
• Study visit
–
Informed consent/HIPAA, Blood collection
• Thank you Mailing
Blood Draw
• 4 ice packs must be in the freezer the day
before bloods will be drawn
• After obtaining consent, scan barcode on
EDTA blood tube to enter blood ID into
blood collection form
• Draw blood filling the tube
• Rescan tube
Processing
VA Central Biorepository
0 - 24 24 - 48 48 - 72 72 - 96 > 96 % of total 1.01 85.84 12.01 0.54 0.42 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00
P
e
rce
n
t
of T
ot
al
Sample Transit Time (hrs) from
collection to storage
MVP
Biosample quality
measurements
Good Lipemic Underfill ed
Hemolyz
ed Clotted Lysed Other % of total 93.37 2.83 2.46 0.58 0.57 0.16 0.03 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00
P
e
rc
e
nt
o
f t
o
tal
Biosample Quality
31
Current Lab Activities
• Receiving and Processing - 400-600 per day
• Shipping Samples for Sequencing and Genotyping:
Assay Type
Shipments to-date
Targeted Shipments
Whole Genome
sequencing
1886
1370 + 516
Whole Exome
sequencing
24260
24126
SNP Genotyping
206,603
~200,000
MVP Recruitment to Date
Invitation mailings sent
2.6 Million
Expressed interest by mail
19.4% (11.2%/8.2%)
Optout
13%
Completed Baseline Surveys
456,000
Consented Veterans
325,000
Specimens in Lab
323,000
Unscheduled (proportion)
40%
Upcoming appointments
11,000
Race
78% 20% 1% 1% 80% 18% 1% 1% 78% 21% 1% 1% 80% 18% 2% 1% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90%Caucasian African American Asian Native American
Race
Data Source: VA
Data
Generation
Data
Transmission
Data
Ingestion
Data
Indexing
Data
Storage
Data
Analysis
QC 1 (Disk)
QC 2 (Data)
Planned Genomic Data Pipeline for
Genotype data
At Vendor: 1. Sample QC 2. Data preparation as per VA requirements Tracking1. Sample send outs 2. Data transfers
1. Disk QC
2. Data uptake into GenISIS Storage Systems
1. Data QC
2. Indexing & Meta data extraction
1. Data Storage 2. Data integration with honest broker 3. Data harmonization
1. Study marts 2. Data analysis
Planned Genomic Data Pipeline for
Sequence Data
Using Big Data in the VA
• VA Healthcare System
• Large-scale research programs nested in the
clinical system
– Genetic Mega Cohort: Million Veteran Program
– Pragmatic Randomized Trial: HCTZ v. Chlorthalidone
• Using the big data
– Biochemical pipelines
– Phenomic data
• Summary and Lessons for clinical care
40
CC
F/U Depression
The patient indicates that his symptoms have improved
significantly, but not as much as he expected. He is still sleeping a lot (about 12 hours per day) and finds it hard to concentrate on looking for work. He denies suicidal ideation. His PHQ-9 score is 16 today.
The patient has a history of binge eating episodes. He is an emotional eater and often feels out of control, but continues to eat after job search disappointments. He often binges at night and has done this 3-4 times per week for the past several years.
Professional Diagnosis: Axis I:
Major Depression, partial response to meds Binge eating disorder
Axis II:
Deferred Axis III:
VA Data Sources
• Corporate Data Warehouse
Databases
• National Patient Care Databases
• Vital Status
• Decision Support System
• National Data Extract
• Beneficiary Identification Records
Locator (BIRLS) death file
• New England VISN-1 Pharmacy
files
• Pharmacy Benefit Management
(PBM) system database
• Outpatient Clinic File (OPC)
• Patient Treatment File (PTF
)
• Clinic Inpatient and Outpatient
Visits
• Inpatient and Outpatient
Hospitalizations
• Diagnosis (ICD-9) codes
• Procedure (CPT) codes
• Pharmacy data and laboratory
data
• OEF/OIF and OND Roster
• VA Clinical Assessment Reporting
and Tracking (CART)
• Veterans Affairs Surgical Quality
Improvement Program (VASQIP)
• Veterans Affairs Central Cancer
Registry (VACCR
)
Other Data Sources
MVP Data
• Self-Reported Survey Data:
Lifestyle Survey Data
(Personal Information,
Well-Being, Activity,
Health, Military
Experience, Dietary
Intake, Medication,
Habits)
Baseline Survey (Health,
Military Experiences,
family medical history)
Non-VA Data
• National Death Index (NDI)
• Centers for Medicare and
Medicaid Services (CMS)
• Social Security Death
Master Files
• State Mortality Data
• Cancer Registries
Examples of Data Issues
• Types of data
– ICD codes
– Procedure codes
– Lab data
– Medication data
– Imaging data
Various Levels of Data Processing
44
Basic Cleaning
Data quality and logic checks of
raw data elements and values
checking logics on value ranges and type
Curation
Data standardization and
harmonization
laboratory data element naming convention
Simple
Phenotyping
Defining algorithms based on
prior knowledge based on
structured data elements
requires subject matter experts working together with
EMR data experts
Complex
Phenotyping
Deriving complex algorithms
combining both structured and
unstructured databases
i.
Further development and validation of complex
phenotyping algorithms will be completed based on
each funded project
ii.
This is a deeper phenotyping requiring
processing the unstructured database by expert data
programmers and analyst using and possibly building a
specific data mining pipeline.
Using Big Data in the VA
• VA Healthcare System
• Large-scale research programs nested in the
clinical system
– Genetic Mega Cohort: Million Veteran Program
– Pragmatic Randomized Trial: HCTZ v. Chlorthalidone
• Using the big data
– Biochemical pipelines
– Phenomic data
• Summary and Lessons for clinical care
Summary and Lessons for Research
• Using data for Research
– Don’t boil the ocean
– Develop structured data
model
– Computing environment
– Validate!
– Missing data is OK
– Research question defines
level of quality
– Research lab results not
necessarily from a certified
lab
Summary and Lessons for Clinical care
• Using Big data for Clinical
care
– Clinical question defines data
quality
– Real time need
– Missing data
– Centralize processes (pros
and cons)
48
Data
Sources
Curation
Zone
Landing
Zone
Cleaning
Zone
Query mart
Study mart
Phenotyping
Zone
- VA & Non-VA Sources - Access to Data Source - Ideal Sources Identification - Throughput - Identity checks - Assign MPI (DIVA ID) - Honest Broker Integration - Data Integrity checks - Algorithms for basic data cleaning - MVP backend Data Dictionary - Data Validation - Data Harmonization - Deriving phenotype terms based on standards - Phenotyping algorithm development - Data and metadata associations - Ontology development - Data Dictionary/Meta data Manager - Terminology based aggregate query - Metadata driven study data request - Access controlled Study specific data martsCDW/Vista
VINCI
CDW
CMS
NDI
Access Query OperationStudy Datamart
GenISIS
Bio-repository GenomicMVP EnclaveClinical
Recruitment
Operations
VA Data Sources
•
Corporate Data Warehouse
Databases
•
National Patient Care
Databases
•
Vital Status
•
Decision Support System
•
National Data Extract
•
Beneficiary Identification
Records Locator (BIRLS)
death file
•
New England VISN-1
Pharmacy files
•
Outpatient Clinic File (OPC)
•
Patient Treatment File (PTF)
•
Inpatient and Outpatient
Hospitalizations
•
Clinic Inpatient and Outpatient
Visits
•
Diagnosis (ICD-9) codes
•
Procedure (CPT) codes
•
Pharmacy data and laboratory
data
•
Pharmacy Benefit Management
(PBM) system database
•
OEF/OIF and OND Roster
•
VA Clinical Assessment
Reporting and Tracking (CART)
•
Veterans Affairs Surgical Quality
Improvement Program (VASQIP)
•
Veterans Affairs Central Cancer
Registry (VACCR)
50Special
Data
Access w/
Data
Steward
National
Data
Systems
(NDS)
Sources of Data Most Commonly
Used
•
Nationwide VA electronic medical record system (VISTA)
•
Inpatient & Outpatient clinic visit and hospitalization data 1997- present
•
Pharmacy & Lab data 2002- present
•
Mortality data (VA Benefits, Social Security Death Index, National Death Index)
•
Medicare data
•
Perioperative data (NSQIP)
•
Economic data (HERC)
•
Clinical note extraction
E & P Subcommittee Tasks
52
• Develop Strategy for Data Organization
• Define Data Access Issues
• Derive Process for Phenotyping
• Develop Phenotype Knowledge Base
• Catalogue Existing Phenotypes
MVP Nested Cohorts
Specific Data
mart
MVP
Enrollees
MVP
Respondents
All veteran
users
Handling different data sources
56