• No results found

Assessment and evaluation of a crosswalk between the functional independence measure and the continuity assessment record and evaluation

N/A
N/A
Protected

Academic year: 2021

Share "Assessment and evaluation of a crosswalk between the functional independence measure and the continuity assessment record and evaluation"

Copied!
127
0
0

Loading.... (view fulltext now)

Full text

(1)

Assessment and Evaluation of a Crosswalk between the Functional Independence Measure and the Continuity Assessment Record and Evaluation

by DAVID MELLICK

BS, University of Northern Colorado MA, University of Northern Arizona

A dissertation submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment

of the requirements for the degree of Doctor of Philosophy

Clinical Sciences Program 2020

(2)

ii

This dissertation for the Doctor of Philosophy degree by David Mellick

has been approved for the Clinical Sciences Program

by

Heather Haugen, PhD Chair

Cynthia Harrison-Felix, PhD Research Mentor Gale Whiteneck, PhD

Jessica M. Ketchum, PhD David Weitzenkamp, PhD

(3)

iii

Assessment and Evaluation of a Crosswalk between the Functional Independence Measure and the Continuity Assessment Record and Evaluation

Dissertation directed by Associate Professor Heather Haugen. Abstract

Background: Assessment of functional outcome is a major initiative for facilities

engaging in post-acute care. Unfortunately across time, the various types of post-acute care facilities had developed measures to assess function independently of each other, making for comparisons of patients across facilities difficult. To rectify this difference, in 2005 the Centers for Medicare and Medicaid Services developed a comprehensive measure that all facilities were to use to assess function called the Continuity Assessment Record and Evaluation (CARE). However because of the change in instruments, comparisons between the previous measures and the CARE necessitate a need for a crosswalk so that scores can be compared. This study focused on the creation and evaluation of a crosswalk between the motor subscale of the Functional Independence Measure (FIM) which had been used to assess function in inpatient rehabilitation facilities and the newer CARE tool.

Methods: An existing dataset of 982 persons who had sustained a moderate to severe

traumatic brain injury and were assessed using both FIM and CARE at inpatient rehabilitation admission and discharge was utilized to create three crosswalks using various methodology (expert opinion, equipercentile and Rasch). The dataset was split into a training and validation dataset. Each crosswalk was evaluated using several criteria including, reduction in uncertainty,

(4)

iv

percent of each crosswalked score falling within a ½ standard deviation (SD) of the reference measure, population invariance, comparison of statistical moments and effect size.

Results: Using the training dataset the expert opinion crosswalk met all of the criteria

except for the direction of population invariance within the race category. The equipercentile methodology satisfied all of the evaluation criteria, and the Rasch model met all of the criteria except for a difference in directionality in the skewness of the distributions as well as not meeting the targeted 80% of scores falling within ½ SD of the reference assessment. These results differed from the validation sample for the population invariance criteria, in which the age categories were in opposite directions and had observed differences between standardized mean difference for age that exceeded the threshold of 0.08 for both the equipercentile and Rasch crosswalks.

Conclusions and Significance: All three crosswalk methods produced acceptable criteria

for use. Therefore motor/physical functional outcome can be compared between cohorts having been assessed using different measures. The results indicate that for researchers wanting to compare cohorts that have been assessed using different instruments, any of the crosswalks could be utilized.

(5)

v Table of Contents

CHAPTER I. INTRODUCTION ... 1

Overview... 1

Pathways of Care ... 1

How is Function Measured in Post-Acute Care Settings? ... 3

Continuity Assessment Report and Evaluation ... 10

Linking Scores ... 11

Specific Aims ... 13

Significance ... 14

Summary ... 15

CHAPTER II. Review of the Literature ... 17

Introduction ... 17

Literature Search Methods ... 17

Review of the Literature ... 18

FIM ... 18

CARE ... 20

Measurement Linking ... 26

Criteria for Linking ... 32

Linking Functional Outcomes ... 35

(6)

vi

CHAPTER III. METHODS ... 42

Research Design and Data Collection ... 42

The Traumatic Brain Injury Model Systems National Database ... 42

Study Participants ... 44

Variables Collected ... 45

Analysis Plan ... 48

Missing Data ... 49

Analysis Plan for Research Aim 1 ... 49

Analysis Plan for Research Aim 2 ... 52

Validation Sample ... 56

CHAPTER IV. RESULTS OF ANALYSIS ... 57

Sample Demographics ... 57

Sample Functional Outcome Measure Scores ... 59

Missing Data ... 61 Expert Opinion ... 62 Equipercentile ... 68 Rasch ... 82 CHAPTER V. CONCLUSION ... 103 Study Limitations ... 108 Summary ... 110

(7)

vii List of Tables

Table I-1: Differences in Post-Acute Care Assessments ... 6

Table I-2: Differences in Functional Measure Items ... 7

Table II-1: Justification for CARE tool core items ... 23

Table III-1: Inclusion criteria for TBIMS NDB ... 44

Table III-2: Study Dataset Variables ... 45

Table III-3: Scoring Codes for the FIM-M and CARE ... 48

Table IV-1: Categorization of Severity of TBI ... 58

Table IV-2: Demographics of the samples ... 59

Table IV-3: Functional Outcome Measure Scores ... 60

Table IV-4: Expert Opinion FIM-M and CARE Item Conversion Scores ... 62

Table IV-5: Correlations and RiU between the CAREeo and FIMeo ... 64

Table IV-6: Percent of the scores within 1/2 SD of other assessment using expert opinion method ... 65

Table IV-7: Demographic Population Invariance for FIMeo and CAREeo ... 65

Table IV-8: Difference in SMD for Demographic variables for FIMeo and CAREeo ... 67

Table IV-9: Statistical moments 1 and 2 for CAREeo and FIMeo ... 67

Table IV-10: Statistical moments 3 and 4 for CAREeo and FIMeo ... 68

Table IV-11: Effect sizes of CAREeo and FIMeo crosswalks by expert opinion method ... 68

Table IV-12: Concordance table CARE to FIM-M and FIM-M to CARE ... 69

Table IV-13: Correlations and RiU between the assessments using equipercentile method ... 75

Table IV-14: Percent of the scores within 1/2 SD of original score using equipercentile method ... 76

Table IV-15: Demographic population invariance for CARE and CAREfromFIM using equipercentile method ... 77

Table IV-16: Difference in SMD for demographic variables for CARE using equipercentile method ... 78

Table IV-17: Demographic population invariance for FIM-M and FIMfromCARE using equipercentile method ... 79

Table IV-18: Difference in SMD for demographic variables for FIM-M using equipercentile method ... 79

Table IV-19: Statistical moments 1 and 2 for CARE and FIM-M using the equipercentile method ... 80

Table IV-20: Statistical moments 3 and 4 for CARE and FIM-M using the equipercentile method ... 81

Table IV-21: Effect sizes of CARE and FIM-M crosswalks by equipercentile method ... 81

Table IV-22: FIM concordance table using Rasch ... 88

Table IV-23: CARE concordance table using Rasch ... 91

Table IV-24: Correlations and RiU between the assessments using Rasch ... 96

Table IV-25: Percent of the scores within ½ SD of original score using Rasch method ... 97

Table IV-26: Demographic population invariance for CARE and CAREfromFIM using Rasch method ... 98

(8)

viii

Table IV-27: Difference in SMD for demographic variables for CARE using Rasch

method ... 99

Table IV-28: Demographic population invariance for FIM and FIMfromCARE using Rasch method ... 99

Table IV-29: Difference in SMD for demographic variables for FIM-M using Rasch method ... 100

Table IV-30: Statistical moments 1 and 2 for CARE and FIM-M using the Rasch method ... 101

Table IV-31: Statistical moments 3 and 4 for CARE and FIM using the Rasch method ... 101

Table IV-32: Effect sizes of CARE and FIM-M crosswalks by Rasch method ... 102

Table V-1: Results of the evaluation criteria for the training dataset ... 104

(9)

ix List of Figures

Figure IV-1: Scatterplot of FIM-M and CARE total scores by time of administration ... 61

Figure IV-2 : Scatterplot of CAREeo and FIMeo ... 63

Figure IV-3: Graphical representation of CARE to FIM-M concordance table ... 72

Figure IV-4: Graphical representation of FIM-M to CARE concordance table ... 73

Figure IV-5: Scatterplot of CARE using the equipercentile method ... 74

Figure IV-6: Scatterplot of FIM-M using the equipercentile method ... 75

Figure IV-7: FIM-M person/item map ... 83

Figure IV-8: FIM-M item map ... 84

Figure IV-9: CARE person/item map ... 85

Figure IV-10: CARE item map ... 86

Figure IV-11: Scatterplot of CARE using Rasch ... 95

(10)

x List of Abbreviations ACT American college test

ADL Activity of daily living

AM-PAC Activity measure for post-acute care CI Confidence interval

CoA Coefficient of alienation

CMS Centers for Medicare and Medicaid services CRAN Comprehensive R archive network

CTT Classical test theory DRA Deficit reduction act FIM-M FIM motor subscale GCS Glasgow coma scale

HAQ-DI Health assessment questionnaire disability index IADL Instrumental activities of daily living

ICC Intraclass correlation coefficient IRF Inpatient rehabilitation facility

IRF-PAI Inpatient rehabilitation facility – Patient assessment instrument IRT Item response theory

K-MBI Korean modified Barthel index LEAS Lower extremity activity scale

LSU HIS Louisiana state university health status instruments LTACH Long term acute care hospital

MDS Minimum data set NDB National database

Neuro-QOL Quality of life outcomes in neurological disorders

NIDILRR National institute on disability, independent living, and rehabilitation research OASIS Outcome and assessment information set

PAC Post acute care

PAC-PRD Post-acute care payment reform demonstration

PROMIS Patient reported outcomes measurement information systems PPS Prospective payment system

PTA Post traumatic amnesia RiU Reduction in uncertainty SAT Scholastic aptitude test

SCIM Spinal cord independence measure SD Standard deviation

SE Standard error

SEED Standard error for equating differences

SF-36 Medical outcomes study short form 36 item version SMD Standardized mean difference

SNF Skilled nursing facility TBI Traumatic brain injury

TBIMS Traumatic brain injury model systems TFC Time to follow commands

UDSMR Uniform Data System for Medical Rehabilitation UCLA University of California Los Angeles

(11)

xi

US United States

(12)

1

CHAPTER I.

INTRODUCTION

Overview Pathways of Care

Receiving heath care in the United States (US) after a catastrophic injury is anything but easy to comprehend. Which providers perform what care and to whom is a constant dance that represents decisions based on the type and severity of injury, the type of insurance an individual has, and the availability of treatment. A newly injured person has various care pathways they may take, or more likely have decided for them, based on financial and or insurance constraints. Take, for example, a fictional person who was involved in a car crash and sustained a traumatic brain injury (TBI). Immediately after the crash she was transported to a local trauma center’s emergency department and was quickly assessed and admitted to the hospital. During the acute care stay she was stabilized and many of the critical issues revolving around her medical condition were minimized. Upon discharge from the acute care facility, there are several possible destinations that she could be discharged to receive post-acute care (PAC) in the continuum of care and recovery after catastrophic injury. As you will come to understand, many of these PAC facilities offer various levels of support. One problem with admission to a PAC facility is that there are no guidelines to determine which category of PAC a person may thrive, rather it is often based on care need, ability to pay, and the availability of family/home support.

(13)

2

Each PAC facility type has developed distinctive outcome measures that assess a patient’s performance in conducting basic activities such as eating, bathing, grooming, walking, etc. to help determine the best post-acute placement (e.g., return to home or some alternative, if necessary), treatment plan and ultimately government insurance reimbursement rates. Today there are several levels of PAC facilities;

• Inpatient Rehabilitation Facilities (IRFs) are licensed facilities that are required to provide to each patient a minimum of three hours of combined physical,

occupational, or speech therapy per day. IRFs are also characterized by an increased physician presence and care is often provided by registered nurses.

• Long Term Acute Care Hospitals (LTACHs) are facilities that provide treatment for patients with more serious medical conditions that require medical care on an ongoing basis but no longer require intensive acute medical care or extensive diagnostic procedures.

• Skilled Nursing Facilities (SNFs) are nursing homes or hospital-based care units that obtain Centers for Medicare & Medicaid Services (CMS) certification to provide skilled nursing care and rehabilitation services on an inpatient basis.

• Home Health Agencies (HHAs) are organizations which provide care to persons who are typically unable to leave their residence without considerable effort and require at least some part-time skilled nursing and/or therapy services.

Broadly, all of these types of PAC facilities are designed to minimize the disabling impact of injuries and health conditions, and help individuals maximize physical and

(14)

3

community. In 2013, nearly 22 percent, or nearly 8 million hospital discharges from US hospitals used some type of post-acute services.1 Half of these discharges were to a HHA,

41% were to a SNF, 7% were to an IRF, and 2% were to a LTACH. Most importantly, each type of PAC has a unique methods of measuring a person’s function upon admission and discharge to verify change (i.e., improvement/gain) in function and to justify reimbursement that is specific to the type of PAC facility. Given the importance of the idea of function, we first must know how function can be measured.

How is Function Measured in Post-Acute Care Settings?

The term ‘Functional Assessment’ was first coined by Lawton in 1971 who defined it as any “systematic attempt to measure objectively the level at which a person is functioning in a variety of areas”.2 In practice, “functional assessment” aims to measure the

performance of a person on activities of daily living (ADL). Dittmar and Gresham have defined five applications for functional assessment: 1) evaluating individual outcomes; 2) planning for treatment interventions; 3) determining effectiveness of treatment; 4)

maintaining continuity of care; 5) improving resources or staffing needs.3 Further uses have

been identified such as the basis for payment systems and classification into disability related groups (DRGs) or payment/reimbursement categorizations.4 In 2002, the World

Health Organization further clarified the term “functional status” as an umbrella term that includes all body structures, activities, and participation in daily life.5 However early

measures such as the Katz and Barthel indexes6 used in rehabilitation focused specifically on

“activities”, such as activities of daily living and included eating, bathing, grooming,

(15)

4

of a wheelchair. Furthermore, unlike other measures such as height and weight, functional activity does not have a standard of measurement or scale; instead it can only be measured using a rating or response scale which is based on a person’s observed performance for a given activity.

Measurement of a patient’s outcomes in post-acute rehabilitation remains an important tool for disability management. Standard measurement procedures and

instruments are often used to inform stakeholders of progress and outcomes achieved by patients. Despite having over three decades of PAC measurement historically, there has been little agreement on the timing of assessments, nor has there been consistent

approaches and assessments to compare a patient’s performance throughout PACs. Over time, with the objective to improve the measurement process, separate instruments have been developed by researchers to evaluate the functional status of persons in a PAC setting.

There are three PAC outcome assessments that have solidified measurement over the past 30 years: The Inpatient Rehabilitation Facility – Patient Assessment Instrument (IRF-PAI) for IRFs,7 the Minimum Data Set (MDS) for SNFs,8 and the Outcome and

Assessment Information Set (OASIS) for HHAs9. LTACHs did not create, nor were mandated

to use any specific outcome assessments.

To assess functional status within a SNF, the MDS was developed under a

congressional mandate calling for a uniform assessment for residents of nursing homes to develop care plans.10 However, in time it became an instrument to measure quality of care

(16)

5

HHAs assess individuals using the OASIS. The OASIS was developed using funds from the Robert Wood Johnson Foundation as well as the Health Care Financing Administration with the goal to improve home health services care.11,13 In 2000, OASIS data provided the

basis for the establishment of Medicare reimbursement for home health.

To measure functional status for patients in IRFs, the FIM™, formerly the Functional Independence Measure, was developed in 1987 by a task force of the American Congress of Rehabilitation Medicine and the American Academy of Physical Medicine and Rehabilitation as a measure of burden of care.14 Starting in 1990, the FIM became the basis for the CMS

mandated tool IRF-PAI. Licensed IRFs were required to measure functional status using the FIM on all Medicare and Medicaid patients and submit these data to CMS. In 2002, data reported using the IRF-PAI, including facility, diagnostic, and demographic data along with the FIM were used to determine the Prospective Payment System (PPS) Part A

reimbursement for Medicare fee-for-service patients.15

Each of these assessments, FIM, OASIS, MDS are site-specific within the PAC system and lack common terminology, definitions, and domains, and each utilized different

methodology for development.16 Furthermore, each provider utilized a different PPS for

Medicare reimbursement. Complicating the matter is that many of the PAC facilities could treat patients with the same condition, yet measure different functional statuses due to the differences in assessments, leading to biases toward which person got treated at which PAC setting, depending on the PAC options in the region and other factors not measurable using the Medicare claims data.17 Further, each assessment collects data on a different schedule

(17)

6

actually observes the calendar day of admission as the time frame, whereas IRF-PAI has a three-day window from admission and MDS varies across eight days. Furthermore, the MDS has interim collection at day 14 as well as every 30 days after admission until day 100, and OASIS requires assessments every 60 days. Both IRF-PAI and OASIS have a discharge assessment but the MDS does not require one.

Table I-1: Differences in Post-Acute Care Assessments

Dimension

Inpatient Rehabilitation

Facilities Skilled Nursing Facilities Home Health Agencies

Tool IRF-PAI MDS OASIS

Frequency of Measurement

Admission and Discharge Initial (day 1-8); day 14; day 30; every 30 days until day 100

Calendar day of admission and every 60 days thereafter; and at discharge Time Period

of

Measurement

Lowest level within first/last 3 days

7 day look-back Status on day of assessment Method of Measurement Direct observation or with reported performance

Information gathered from multiple caregivers’ descriptions and documentation. Direct observation not required.

Direct observation preferred, but also often used interviews with patient in-home caregiver

While all three instruments attempt to classify functional status, each does so with different items (Table I-2). Some major differences for the MDS items include the ability to transfer in both the bathing and toileting items, the MDS does not differentiate upper body dressing from lower body dressing. The IRF-PAI accounts for not only the level of assistance needed for bowel and bladder control, but also the use of equipment and frequency of accidents, whereas the MDS and OASIS focus primarily on incontinence. The IRF-PAI is the only instrument that specifically captures walking up stairs as an indicator for locomotion. Finally the OASIS combines all transfer abilities regardless of activity

(18)

7

(tub/shower/toilet/chair) whereas the IRF-PAI separates out those activities, and as mentioned before the MDS includes toileting and bathing functions into their transfer items.

Table I-2: Differences in Functional Measure Items

IRF-PAI MDS OASIS

Self-Care Eating includes the ability to use suitable utensils to bring food to the mouth, as well as the ability to chew and swallow the food once the meal is presented in the customary manner on a table or tray.

Eating – how resident eats and drinks. Includes intake of nourishment by other means.

Feeding or eating ability to feed self-meals and snacks.

Grooming includes oral care, hair grooming (combing or brushing hair), washing the hands, washing the face, and either shaving the face or applying make-up. If the subject neither shaves nor applies make-up, grooming includes only the first four tasks.

Personal hygiene – how resident maintains personal hygiene, including combing hair, brushing teeth, shaving, applying makeup,

washing/drying face, hands and perineum.

Grooming – ability to tend to personal hygiene needs (i.e. washing face and hands, hair care, shaving or make up, teeth or denture care, fingernail care.

Bathing includes washing, rinsing, and drying the body from the neck down (excluding the back) in either a tub, shower, or sponge/bed bath.

Bathing – how resident takes full-body bath/shower, sponge bath, and transfers in/out of tub/shower.

Bathing – ability to wash entire body excludes grooming (washing face and hands only).

Dressing – Upper Body includes dressing and undressing above the waist, as well as applying and removing a

prosthesis or orthosis when applicable. The patient performs this activity safely.

Dressing – how resident puts on, fastens, and takes off all items of clothing, including donning/removing prosthesis.

Ability to dress upper body

Dressing – Lower Body includes dressing and undressing from the waist down, as well as applying and removing a

Ability to dress lower body – including undergarments, slacks, socks or nylons, shoes.

(19)

8

Table I-3 cont

prosthesis or orthosis when applicable. The patient performs this activity safely. Toileting includes maintaining perineal hygiene and adjusting clothing before and after using a toilet, commode, bedpan, or urinal. The patient performs this activity safely.

Toilet use – how resident uses the toilet room; transfer on/off toilet, cleanses, changes pad, manages ostomy or catheter, adjusts clothes.

Toileting – ability to get to and from the toilet or bedside commode.

Sphincter control

Bladder level of

assistance, include use of equipment and frequency of accidents

Bladder continence – control of urinary bladder function, with appliances, or continence programs, if employed.

Urinary incontinence or urinary catheter presence.

Bowel level of assistance, include use of equipment and frequency of

accidents.

Bowel continence – control of bowel movement, with appliance or bowel continence programs, if employed.

Bowel incontinence frequency.

Transfers Transfers: Bed, Chair, Wheelchair includes all aspects of transferring from a bed to a chair and back, or from a bed to a wheelchair and back, or coming to a standing position if walking is the typical mode of

locomotion.

Modes of transfer – bedfast all or most of time, or bed rails used for bed mobility or transfer.

Transferring: ability to move from bed to chair, on and off toilet or commode, into and out of tub or shower, and ability to turn and position self in bed if patient is bedfast.

Transfers: Toilet includes safely getting on and off a standard toilet.

Toilet use – how resident uses the toilet room; transfer on/off toilet, cleanses, changes pad, manages ostomy or catheter, adjusts clothes.

Transfers: Tub/Shower includes getting into and out of a tub/shower. The patient performs the activity safely.

Bathing – how resident takes full-body bath/shower, sponge bath, and transfers in/out of tub/shower.

Locomotion Walking includes walking on a level surface once in a standing position. Wheelchair includes using a wheelchair on a level surface once in a seated position.

Locomotion on unit – how resident moves between locations in his/her room and adjacent corridor on same floor. If in wheelchair, self-sufficiency once in chair.

Ambulation/ Locomotion – ability to safely walk, once in a standing position, or use a wheelchair, once in a seated position,

(20)

9

Table I-4 cont

Stairs includes going up and down 12 to 14 stairs (one flight) indoors in a safe manner.

Locomotion off unit – how resident moves to and returns from off unit locations. If facility has only one floor, how resident move to and from distant areas on the floor. If in wheelchair, self-sufficiency once in chair.

on a variety of surfaces.

Because of the measurement content and timing of measurement differences, several studies have tried to equate functional status across these three PAC settings. In 1997, Williams et al. convened an expert panel to see if a score on the MDS could be

transformed to match a score on the FIM, the functional measurement tool of the IRF-PAI.18

The results were mixed, while some items showed a high level of agreement between specific items, there were some noted floor and ceiling effects in which most patients scored too high or too low on a specific item not providing enough variance to actually translate them. Later, in 2000, CMS funded research to develop a new multipurpose functional assessment instrument, the Minimal Data Set for Post-Acute Care (MDS-PAC),19

which aimed to respond to the Balanced Budget Act of 1997 by providing a uniform assessment tool for SNFs. It was hoped that the resulting instrument would have been utilized to assist creating a scoring link between the FIM and MDS-PAC as well as evaluate whether or not the estimated FIM score would match payment equity with the new

measure. Unfortunately, the overall result was that the adding of a fourth measurement of functional status to try and equate the FIM and the MDS was not successful.

Nevertheless, both of these research projects further solidified the need and desire for a true standardized instrument that could be used in all PAC settings.

(21)

10

Continuity Assessment Report and Evaluation

In light of the disappointing results of linking the FIM to the MDS, in 2005, and as a result of the Deficit Reduction Act (DRA), CMS developed a Medicare Payment Reform Demonstration that assessed the consistency of payment incentives for Medicare

populations treated at various PAC settings, including LTACHs, IRFs, SNFs, and HHAs. This demonstration project allowed CMS to understand differences in patient treatment, outcomes, and cost in the various settings.20 As a result of this project, researchers created

a standardized patient assessment tool to be used in all PAC settings at admission and discharge. The tool was named the Continuity Assessment Report and Evaluation (CARE) Item Set. Like the other outcome assessments used in post-acute facilities and as mandated by Congress, the CARE measures the medical, cognitive, and functional status of Medicare beneficiaries as well as change in functional status and other outcomes such as presence of urinary tract infections, clostridium difficile infection, and pressure ulcers. In fact, the CARE was designed to replace the existing federal assessment tools for each of the PAC settings including the MDS used in SNFs, the OASIS in HHAs, and the IRF-PAI which includes the FIM for IRFs,20,21 and be used in LTACHs which previously did not have a standard of outcome

measurement.

While the overall goal of the new CARE consolidates functional measurement assessment at all PAC settings, the transition introduced a number of challenges at the practice level. First, given the relatively new adoption of the CARE, the tool has not been fully validated independently of CMSs research creating the tool. Second, the CARE may showcase more theoretical value than practical effectiveness. Third, the CARE may be more

(22)

11

complicated, inconsistent, and burdensome to administer. Fourth, switching to the CARE implies that some people will be assessed with the FIM, MDS, OASIS on admission and then the CARE at discharge. So, for these cases, how does a clinician assess change or success?

This last issue is important not only for clinicians, but also for researchers who have relied heavily on outcome assessments such as the FIM, MDS and OASIS. Moreover, studies that are longitudinal in nature need to address the inconsistency of functional outcome measurement across time. How does a researcher compare outcomes, either within a single patient, or cohorts of patients, when the measurement tool changes? One way to address this “seam” in the data would be to provide a statistical link between the scores of one measure to the scores on another measure.

Linking Scores

At its root, test linking is a process by which there is a transformation of a score from one test to a score on another test. Holland and Dorans22 provided an historical framework

that classified score linking into three categories: predicting, scale aligning, and equating. Each type of linking method determines how the linked scores are used and interpreted,22-25

and not all provide methodology that results in valid equating. Predicting

For prediction linking, the goal is to predict scores for one test using scores from another test. The model may be multivariable in nature utilizing information from other instruments or subject-specific characteristics (e.g., age, sex, injury severity) to improve prediction. By the very nature of modeling, this method will not act as a valid equating

(23)

12

methodology as it violates assumptions of equating outlined below,23 specifically regression

modeling will not produce scores that can be used bi-directionally. Scale alignment

The goal of scale alignment is to create a common scale that each measure would transform onto.24 Scale alignment had its root in educational testing development (e.g. SAT

and ACT). Scale aligning has many subcategories, including activities such as battery

scaling,25 anchor scaling,24 vertical scaling,25,26,27,28 calibration,24 and concordance29. Each of

these subcategories are defined by whether or not the linking function is applied to instruments with similar constructs, similar difficulty, similar reliability, and used for the same population of examinees. Battery scaling is used if the linking occurs for different constructs but a similar population. For example, this occurs when researchers are linking verbal and math sections of an educational test. If researchers are linking the same construct such as “verbal” with test that differ in difficulty it is considered vertical scaling. Calibration scaling refers to when linking the tests that have different reliability. This occurs when researchers are creating a “short form” of a test that uses a portion of variables used in the complete test. Finally, concordance refers to when the linking represents the same construct, difficulty, reliability and is assessed on the same population. The vast majority of linkages, fall under this concordance category as true equating, described below is rarely accomplished.23

(24)

13 Equating

Equating is the most rigorous form of linking in which a score on one assessment directly maps to a score on the other.24,25 True scale equating is difficult to obtain, as

Holland30 noted several fundamental assumptions for a linkage to be considered equating: 1) Equal construct – each assessment is measuring the same construct;

2) Equal reliability – the assessments should have the same level of reliability; 3) Symmetry – the transformation should be bi-directional;

4) Equity – the examiner/assessor should be indifferent to which assessment is

administered; and

5) Population invariance – the linking function that is used to produce the equating

should be the same regardless of subpopulation from which it is derived.

Dorans23 also points out that many examples of linking scores are more likely to be

concordances. While the goal of both concordances and equating is to establish a

relationship between scores of two assessments, the terminology used to describe them has meaning and will infer different interpretations.

Specific Aims

Because of the vast amount of data already collected using different outcome assessments in PAC settings and the move to consolidate these differences in a unified measure, various research projects will need to be initiated to fully understand how to overcome the disparity in data sources. Linking the scores is one way of addressing the seam as it remains important for researchers and clinicians alike to be able to equate

(25)

14

function regardless of the measure used. Furthermore, this “linking” will ultimately need to be constructed for various populations, as the function may not be the same depending on the population studied. For instance, the linking function could be significantly different for those with a significant motor impairment like spinal cord injury versus people having sustained a TBI. This study will attempt to create and evaluate crosswalks, or score linkages, between the FIM and CARE, utilizing a longitudinal database of people with moderate to severe TBI to determine if reducing the influence of the discrepancy in the data can be effective.

Research Aim 1: Create bi-directional crosswalks between the FIM and the CARE that maps

the score from one scale to that of the other.

Research Aim 2: Evaluate each crosswalk using methods to assess equating methodology.

Significance

The ability to crosswalk different assessments is an important step in establishing the validity, efficacy, and effectiveness of instruments. This dissertation will develop and assess a variable crosswalk between the FIM and the CARE. While the FIM consists of 18 items that range in domain from self-care, mobility, bowel/bladder function and cognitive, the crosswalk proposed will exclude the cognitive section as the CARE tool does not capture cognitive items for both admission and discharge like the mobility and self-care component. Specifically this dissertation will focus on linking the CARE and FIM Motor (FIM-M) subscale. While the FIM is largely regarded as the standard of functional independence

(26)

15

different outcome measurements and comparison between them has been difficult.33,34

Because of this disconnect, in 2005 CMS developed and introduced the CARE Item Set, with the objective to unify the existing federal assessment tools for each of the PAC settings.21

Amongst other assessments, the CARE includes a set of items to measure motor function. Given the movement toward a comprehensive post-acute measurement system, it is necessary to assess a linkage between the currently used FIM and the newly adopted CARE. Developing and testing crosswalks between the FIM-M and CARE will provide evidence of the ability to compare patient’s functional status over time and between cohorts that have been administered either outcome measure. While the crosswalk developed for this dissertation will only be validated for people with moderate to severe TBI, it will provide a contribution to the field to detail the methods which others may follow to design

crosswalks for other populations.

Summary

Functional outcome measurement is an important tool in describing a patient’s recovery following treatment in PAC settings. Furthermore, it also plays an integral role in determining financial reimbursements from CMS to those PAC facilities. Recently CMS has adopted a single measure for the assessment of function in PAC settings called the CARE Item set. For PAC settings like IRFs, this tool now replaces a longstanding tool, the FIM, which has been in existence for functional measurement since 1987. Given the widely adopted use of the FIM, many existing longitudinal research studies that have used FIM as a measure of outcome will likely need to switch to the CARE because of the mandate of CMS

(27)

16

to collect this measure instead of FIM. To date, no studies outside the internal studies which CMS conducted during the creation of the CARE have attempted to link FIM and CARE item set scores. This study presents the development and evaluation of three crosswalks

between the FIM-M and CARE item set in a large sample of patients who sustained a TBI and who completed both instruments as part of an existing research protocol. First, an expert review panel will evaluate the content of both items and scales and create a

harmonized scoring algorithm for which each test will be scored. Second, using classical test theory, an equipercentile method will be employed which will evaluate and link the score distributions of each test. Finally, an Item Response Theory (IRT) approach will be utilized for converting the test scores to a single logit scale in which concordance scores can be calculated. The accuracy of each crosswalk will be evaluated by measuring the reduction in uncertainty (RiU), population invariance, statistical moment comparisons, and effect sizes. Each crosswalk will be cross-validated in a proportion of the study sample not used for crosswalk creation.

(28)

17

CHAPTER II. D

REVIEW OF THE LITERATURE

Introduction

The focus of this literature review aims to provide a comprehensive overview of not only the FIM and CARE and their prior use and applicability to measurement in a brain injury population, but also to examine the history of various linking frameworks that have been utilized in a variety of medical conditions. Therefore, a literature review was completed to further understand:

• The history and utility of the FIM • The history and utility of the CARE

• Common equating frameworks and the methods to evaluate them

Literature Search Methods

The literature review was completed with the purpose of identifying relevant peer-reviewed publications, published white papers, and government publications. Searches were primarily through PubMed and Google Scholar using the following terms:

• ‘post-acute rehabilitation’ and ‘outcome’ • ‘rehabilitation’ and ‘traumatic brain injury’

• ‘functional independence’ and ‘traumatic brain injury’ • ‘functional independence measure’

(29)

18

• ‘test equating’ and ‘equating’ or ‘linking’ or ‘concordance’ • ‘equipercentile’

• ‘crosswalk’ and ‘test linking’

• ‘test equating’ and ‘rasch’ or ‘equating’ • ‘evaluation’ and ‘test equating’

• ‘crosswalk’ and ‘function’

Article titles and abstracts were reviewed to select pertinent studies or papers to contribute to the understanding of the project. The references from those selected papers were also reviewed to complete an exhaustive and comprehensive review.

Review of the Literature FIM

The FIM is used to objectively measure functional independence and burden of care through an individual’s level of motor and cognitive ability and assesses the extent of assistance required to complete activities of daily living (ADLs). The scale accounts for a patient’s level of independence, amount of assistance needed, use of adaptive or assistive devices, and the percentage of a given task completed successfully. This instrument is comprised of 18 items with a seven-level response scale of independent performance in self-care, sphincter control, mobility, locomotion, communication and social cognition.35

Thus, a total FIM score ranges from 18 to 126. Hamilton36 noted that, “Because each item is

(30)

19

item appropriately weighted) will correlate with the burden of care for the disabled person” (p. 862). Over time, the FIM has been associated with having different subscales with

Granger identifying three constructs: ADL, mobility, and continence.35 Stineman et al.37

showed in a study of 93,829 rehabilitation inpatients that a factor analysis of the FIM instrument supported the identification of ADL/motor and cognitive/communication dimensions across 20 impairment categories. Today, the multidimensional structure of the FIM by means of a Rasch analysis followed by factor analysis of standardized residuals, demonstrated the divergence of the five cognitively-oriented items from the 13 motor-oriented items.32,38

In US inpatient rehabilitation settings, the Uniform Data System for Medical Rehabilitation (UDSMR), is the most widely used clinical database for assessing

rehabilitation outcomes.39,40 Administered in most inpatient rehabilitation facilities within

three days of admission and discharge,41 the FIM is the core functional status measure of

the UDSMR and was developed to establish a uniform standard for the assessment of functional status during medical rehabilitation.40,42 The FIM incorporates concepts and

items from previous functional assessment instruments, such as the Katz Index of ADL, the PULSES profile, the Kenny Self-Care Evaluation, and the Barthel Index.43 The FIM system was

developed by a national task force co-sponsored by the American Congress of Rehabilitation Medicine and the American Academy of Physical Medicine and Rehabilitation to develop a national UDSMR to rate patient functional independence and the outcomes of medical rehabilitation.36 The original work of this task force was expanded by the Department of

(31)

20

been the mission of the UDSMR to measure medical rehabilitation outcomes across the continuum of care in PAC settings.44 The UDSMR maintains a national data repository for

reporting purposes of three million case records from 1,400 facilities around the world.44,45

Furthermore, eRehabData offered to providers from the American Medical Rehabilitation Providers Association, is another large inpatient rehabilitation outcome system that offers services to IRFs to maintain compliance with CMS by processing and submitting to CMS, IRF-PAI data including the FIM.

Psychometric studies of the FIM instrument support its use for research purposes. One of the strengths of the FIM is that it has undergone several methodological evaluations, in which it has demonstrated good psychometrics.46,47 Dodds et al.47noted high internal

consistency (Cronbach’s coefficient of .93 at admission and .95 at discharge), demonstrating the FIM to be a reliable instrument. Extensive investigations of the FIM’s reliability and validity have provided evidence of its interrater and test-retest reliability,48 internal

consistency,37,49 concurrent validity,50,51 and predictive validity51,52. Ottenbacher et al.48

performed a meta-analysis of 11 studies that estimated the median interrater reliability for the total FIM to be .95 and the test-retest and equivalence reliability to be .95 and .92, respectively. Additionally, the stability of the FIM motor score, or knowing that the assessment was measuring the same construct across time, was demonstrated in several studies.32,38

CARE

As part of the national Post-Acute Care Payment Reform Demonstration (PAC-PRD) and enabled by Congress under the DRA of 2005, the CARE was designed to standardize

(32)

21

assessments of patients’ medical, functional, cognitive, and social support status across all PAC settings. Largely this effort stemmed from modernizing CMS’s existing assessments including the IRF-PAI, the MDS, and the OASIS.

Kramer53 first detailed the types of approaches to assess PAC and to make

recommendations to CMS on ways to create a uniform assessment. In his report, Kramer recommended 31 domains for which information about discharge placement, care transitions, and outcome monitoring should to be measured. Only three domains listed satisfied all three purposes. They included Physical functioning/mobility, ADL/self-care, and Instrumental ADL (IADL)/advanced cognition.53 While Kramer noted the importance of

these three domains, he cautioned that the scale to measure these domains needs to be sensitive to small differences for those who are dependent because such difference could have a strong impact on where these beneficiaries can reside.53 Kramer also outlined a long

term vision for achieving a uniform patient assessment that included the following steps:53

1. Agree on a core dataset at every hospital discharge

2. Develop algorithms that would recommend patient placement 3. Assure that uniform assessment could be delivered electronically 4. Create a health and outcome monitoring system

5. Payment would be based on metrics regardless of discharge setting

With these goals in mind Kramer set the stage for the creation and adoption of what would ultimately be known as the CARE.

Johnston54 also added to the choir of voices in support for a uniform post-acute

(33)

22

any uniform assessment would need to include a measurement of the extent of functional independence, as well as the ability to participate in self-care.

The background of the development of the CARE is detailed in Gage55 three-part

report prepared by RTI International and delivered to the CMS Office of Clinical Standards and Quality. In the first report, Gage utilized expert panels and reported on methods for inclusion of items into the CARE. Supported by previous research,53 the work identified four

domains; medical severity, functional impairment, cognitive impairment, and social support. Given the breadth of information of the CARE tool in its entirety, only the functional domain will be described in more detail here.

Initially, the CARE consisted of core and supplemental items. The core items included six self-care items as well as five functional mobility items, and were used to evaluate all patients regardless of functional level.55 They represented items covering a

range of difficulty, were easily scored, and played a critical role in discharge planning. The core self-care items included basic self-care such as eating, tube feeding, oral hygiene, toilet hygiene, and dressing upper and lower body. The core functional mobility items consisted of lying to sitting on side of bed, sit to stand, transfers, walking, and wheelchair distance. Each core item from either self-care or functional mobility is scored on a six-level rating scale measuring the need for assistance: dependent, substantial assistance, partial assistance, supervision or touching assistance, set-up or cleanup assistance, or independent. The purpose of each item is to quantify the amount of assistance and therefore inform resource utilization needs as well as discharge placement. The justifications for inclusion of each core item are listed in Table II-1.

(34)

23

Table II-1: Justification for CARE tool core items Self-Care /

Mobility Items

Reasons for inclusion into the CARE

Eating Eating measures the ability to use suitable utensils to bring food to the mouth and swallow food once the meal is presented on a table or tray and also includes modified food consistency. Patients requiring higher levels of assistance may have higher resource utilization and this may also affect PAC discharge placement.

Tube Feeding Tube feeding includes the ability to manage all equipment and supplies for tube feeding. Patients requiring higher levels of assistance may have higher resource utilization and this may also affect PAC discharge placement. The supervision required for patients with substantial assistance may not be available in all settings. The tube feeding item is distinct from both the swallowing item and the eating item because patients who are able to manage the feeding tube on their own will be rated as independent and may require additional resources.

Oral Hygiene The oral hygiene item is included because it is an activity that all patients need to perform. Patients requiring higher levels of assistance may have higher resource utilization.

Toilet Hygiene Toilet hygiene includes the ability to maintain perineal hygiene, adjust clothes before and after using the toilet, commode, bedpan, or urinal. Patients requiring higher levels of assistance may have higher resource utilization.

Upper body dressing

Upper body dressing includes the ability to put on and remove shirt or pajama top, including buttoning three buttons. This item measures upper body mobility and fine motor skills. Patients requiring higher levels of assistance may have higher resource utilization.

Lower body dressing

Lower body dressing includes the ability to dress and undress below the waist, including fasteners. This item measures lower body mobility, balance, and dexterity. Similar to the upper body dressing item, patients requiring higher levels of assistance may have higher resource utilization.

Lying to sitting on side of bed

This is a lower level function item. Need for assistance with this item is indicative of resource utilization and may also affect PAC discharge placement.

Sit to stand This item measures balance and transition and is a more difficult functional item that may be used to assess fall risk. Need for assistance with this item is indicative of resource utilization.

Toilet transfer/ Chair/Bed to chair transfer

Both toilet transfer and chair-to-chair transfer are included in the CARE tool. Chair-to-chair, or bed-to-chair, transfer is a more basic surface-to-surface transfer, but toilet transfer is more difficult because it occurs in a constrained space. Toilet transfer is predictive of a patient’s ability to return home. For both items, patients requiring higher levels of assistance may have higher resource utilization and this may also affect PAC discharge placement.

(35)

24

Table II-2 cont Longest distance the patient can walk

The walking items codes the longest distance the patient can walk. This is a performance based item and the response categories include Walk 150 ft, Walk 100 ft, Walk 50 ft, or Walk in room once standing. This locomotion item is predictive of post-acute discharge placement and resource utilization. Patients with limited mobility requiring higher levels of assistance may have higher resource utilization.

Longest distance the patient can wheel

For patients whose primary mode of mobility is wheelchair, there is a locomotion item that corresponds to the walking item. The wheelchair items codes the longest distance the patient can wheel. This is a performance based item and the response categories include Wheel 150 ft, Wheel 100 ft, Wheel 50 ft, or Wheel in room once sitting. This locomotion item is predictive of post-acute discharge placement and resource utilization. Patients with limited mobility requiring higher levels of assistance may have higher resource utilization.

While the core set of items was being discussed, the expert panels also considered supplemental items that would help clarify a patient’s functional status. The supplemental items address a range of activities that differed in difficulty and are listed below:

• Wash upper body • Shower/bathe self • Roll left or right • Sit to lying • Picking up object

• Putting on/taking off footwear • Wheelchair use for mobility • 1 step (curb)

• Walk 50 feet with two turns • 12 steps interior

• 4 steps exterior

• Walk 10 feet on uneven surface • Car transfer

• Wheelchair users only—short ramp • Wheelchair users only—long ramp • Telephone—answering

• Telephone—placing call

Gage summarizes the report by acknowledging that this first version of the CARE, while solid, still has further development to achieve. Factors specific to less common

(36)

25

diseases like stroke and spinal cord injury are yet to be accounted for.55 From the

demonstration project, the CARE was effective in uniformly assessing patients’ and using the CARE to assist in creation of the PPS. All PAC settings (IRFs, HHAs, SNFs and LTACHs) were able to utilize the CARE to collect consistent and reliable data, and transmit these data to CMS.17

Later that year RTI pursued testing the CARE through conducting further reliability studies as well as inter-rater reliability testing. They found that the CARE items were reliable when used across settings and by different disciplines.56 The levels of agreement varied but

most showed correlation coefficients above 0.70; a few appeared weaker across the board such as certain aspects of swallowing measurement, walking 150 feet, light shopping, and laundry. The researchers also performed factor analysis and Rasch to assess the

dimensionality and item-fit of the CARE measures, which found reasonable unidemensionality and good item fit for the core variables in the CARE.

To underscore this initiative for a uniform measure to assess function, Chang and colleagues57 developed a Chinese version of the CARE (CARE-C) to better appreciate

post-acute measurement in Taiwan. While their study focused on stroke only, they concluded that the development of CARE-C was useful in evaluating the functional quality metrics and facilitates the assessment in PAC settings in Taiwan.

As evidenced by the historic utility of the FIM, but its use in only one post-acute setting, and the new drive for CMS to implement the CARE tool in all PAC settings, there remains a lack of research that addresses how these two assessments could be linked such that the “seam” generated by switching assessments can be minimized. The primary

(37)

26

methodology for doing this was developed by research done primarily in educational testing and is defined as measurement equating.

Measurement Linking

Linking is a statistical process that is used to adjust scores on instruments so that scores on each of the instruments can be used interchangeably. Interchangeably means that regardless of the instruments or version of the instrument taken, the same scores represent the same level of achievement and differences in scores are not due to difficulty differences between alternate versions of the instrument. In this context “difficult” is defined as the estimate of skill level needed to pass an item or instrument. While this concept is rather easy to follow when discussing an aptitude test in which there exists right and wrong answers, it is a bit more abstract to think about instruments that measure functional ability that score difficulty levels which are not based on “passing an item”, yet the concepts and terminology remain the same. Depending on how well a linking function works (described further in detail below) the end result is considered to be a “crosswalk” between the two instruments. Beginning in 1951, Flanagan58 differentiated the various types of frameworks

necessary for linking and the construction of score scales. He used the term comparability when describing scores on instruments that were scaled such that the distributions were similar for a certain population. For instance, an instrument that measures reading comprehension could be comparable to an instrument that measures mathematical

aptitude for children in a certain age group. He argued that any linking had to be population specific, meaning, in our example, that the linking transformation may not be appropriate for college aged students. In that regard he stated “Comparability which would hold for all

(38)

27

types of groups—that is, general comparability between different tests, or even between various forms of a particular test—is strictly and logically impossible” (p. 748). Flanagan described three major topics of linking. First, he differentiated linking scores that measure the same construct from linking scores with different constructs. For example, linking two instruments that measure reading ability is different than linking an instrument for which one measures reading comprehension to one that measures mathematical skill. Second, he recognized linking scores on test that measure the same construct but differ in difficulty. Third, he noted that some linking transformations could be symmetric, meaning that the transformation of scores are bi-directional.58 Many of the concepts still are considered and

used today, however the nomenclature has evolved from simply using the term comparability.

Following Flanagan’s 1951 effort, Angoff59 proceeded to use the term equating to

represent linking scores on multiple forms of a test that were created to be similar in

nature. Borrowing from the educational sphere, there exists many versions of the Scholastic Aptitude Test (SAT) all of which are linked such that each version is equated with the other. Differing from Flanagan, Angoff postulated that equating relationships should be population independent. Angoff also changed the way linking procedures were described. First, he used the term calibration to refer to linking scores on instruments that measure the same

construct but differ in reliability and/or difficulty. In this case, imagine an instrument that measures reading comprehension that has 100 items and you wanted to link it to an instrument that contained only 10 out of the 100 items essentially creating a “short form” version of the instrument. Angoff would term this type of linking calibration because the

(39)

28

reliability of the two version of the instrument would be different. Second, he limited the use of the term comparability to only linking scores on tests in which they were measuring different constructs.

Mislevy59 and Lim60 separately developed and further expanded to the frameworks

for linking that added statistical moderation and projection. Statistical moderation is a process of linking scores through a third (moderator) variable. Projection is a method of non-symmetrical linking using regression. Neither Mislevy nor Lim required as a part of their framework to distinguish test linking measures with similar or same constructs and those with different constructs.

Finally, Dorans22,60 made a distinction between the linking of scores from

instruments that measure the same construct versus linking of scores that measure different constructs. He used the term concordance to represent the linking of scores with similar constructs and that the use of statistical methods would lead to similar distributions for the measure.

In general, linking will be called something different depending on the attributes of instruments that are being linked. While equating may be the goal of linking, few

transformations will make this rigorous definition abiding by the five principles (equal construct, equal reliability, symmetry, equity, and population invariance). Most transformation will fall into the definition of scale alignment which would consist of a transformation being; Battery scaled, in which the instruments measure different constructs (reading and math) across a common population (5th graders); Vertical scaled, in which the

(40)

29

subtraction vs fractions and percentiles) and populations of examinees (2nd graders vs 5th

graders); Calibration, in which the instruments measure the same construct (math) in the same population (5th graders) but differ in instrument reliability (short form); and finally

Concordance, in which the instruments measure the same constructs, have similar reliability and difficulty and are measured on the same population.

There exists a variety of techniques used to equate tests,59,61, which can reduce to

two different processes of equating: classical test theory (CTT) and IRT. Classical Test Theory

CTT is regarded as the traditional approach to measurement. It is based on the assumption that a test-taker has an observed score and a true score; where the observed score is estimated as the true score plus/minus some unobservable measurement error.62

CTT is used in item analysis, a process in which an individual’s response to an item on an instrument is analyzed to assess item difficulty, or to identify an individual’s response to a particular item as it relates to the overall score to reflect a point-biserial correlation. It is also used to develop instruments, and to evaluate their quality, reliability, and validity. One process that uses CTT relies on having score distributions of the two tests coincide in a well-defined population, which is called equipercentile equating. Conversely, a broader approach using regression procedures, in which one score is regressed on another that can then be used to calculate a prediction of the other score.61 Pommerich et al.29 demonstrated

through linking of American College Test (ACT) and SAT that the equipercentile method (concordance) and regression (prediction) serve different purposes, and the result should be used accordingly. While the former is more appropriate when determining comparable

(41)

30

scores at which the same percentages of examinees score above and below relevant score points, the latter is more appropriate to predict an individual’s score.

One benefit of the equipercentile methodology is that equated scores will always be within the range of possible scores under the traditional conceptualization of percentiles and percentile ranks. However, equating relationships cannot be determined outside the highest observed scores. For example, if the distributions of the instruments to be linked do not represent all possible scores then the linking transformation will not be able to create a link to those scores. Let’s say there are two tests both that can have scores ranging from 0 to 100 that need to be linked, yet in our population one test has a ceiling effect such that all scores are above 50, then the equipercentile linking transformation will never be able to link to a score lower than 50. Furthermore, the score distribution can be irregular due to random error in estimating the equivalents, therefore smoothing methods can be used to mediate the effects of random error and provide more regular shapes. However, smoothing can introduce systematic error since the raw data will be changed to better fit a normal distribution. The intent in using a smoothing method is for the increase in systematic error to be more than offset by the decrease in random error.

IRT (Rasch)

In 1953 Rasch63 created a system in which, while trying to equate two assessments,

he independently estimated scale item difficulty free from the effect of the ability of the person responding to the items, as well as estimated person measures devoid of the effect of the difficulty of the item. While he understood that he could not determine how a person would respond to an item, he could however, estimate the probability of success on that

(42)

31

item. He postulated that the probability is only responsive to a person’s ability and the difficulty of the item as expressed in this formula:

log�Probability of Success

Probability of Failure� = 𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 − 𝐷𝐷𝐴𝐴ff𝐴𝐴𝑖𝑖𝑖𝑖𝐴𝐴𝐴𝐴𝐴𝐴

IRT is a body of related psychometric theory that provides an integrated

psychometric framework for developing and scoring tests. The main feature of IRT models is that they relate item responses to characteristics of individual persons and test items. The models are functions relating items and person parameters to the probability of a discrete outcome, such as a correct response to an item. IRT models have been developed for tests for which items are scored dichotomously (0/1) as well as for tests for which items are scored polytomously. IRT provides a basis for estimating parameters, ascertaining how well data fit a model, and investigating the psychometric properties of tests. Applications of IRT include test development, item banking, differential item functioning, adaptive testing, test equating, and test scaling. IRT (latent trait) methods have recently been advocated as a potential improvement over traditional CTT methods.64-66 Lord64 argued from theoretical

considerations that traditional equating methods are not appropriate for equating tests of differing difficulty, whereas IRT methods have the capacity to provide an appropriate equating in this case. It is possible that a large calibrated item pool could be used to construct forms that share the same content and statistical specifications to such a degree that scores can be truly equated; often, however, the relationship between scores on such forms is better described as calibrated.66

(43)

32

Criteria for Linking

Three questions arise when contemplating the development of a crosswalk. ‘Should linking be done?’ and ‘How should linking be accomplished?’ have been described

previously in the literature review, however the last question, ‘has the equating procedure been done well enough?’, is the focus of this section. Harris and Crouse67 have stated that

there is a problem of evaluating equating because there exists a diversity of methods. Harris and Crouse67 thoroughly reviewed and discussed available criteria. Among them are:

equating in circle paradigm, generated or simulated data, large sample criteria, standard error (SE), indices, consistency, and many more. Kolen and Brennan25 also have identified

that the properties of equating such as equity, symmetry, and population invariance can also be used to develop evaluative criteria.

Several studies have compared linking methods derived from linear, equipercentile and IRT methodology.64,68,69 Results from these studies were equivocal in their findings.

Lord64 posited that there was not enough experimental results to generalize which methods

under specific research designs are best suited to equate different tests. Kolen69 performed

a cross validation study which compared two traditional equating schemes with seven IRT models. He found that the equipercentile method produced more stable results compared to an IRT method. In a paper that sought to analyze the National Reference Scale, a dataset that equated reading in 14 separate reading tests for students in the fourth, fifth, and sixth grades, Rentz and Bashaw70 concluded that the Rasch and equipercentile equating results

were reasonably similar. Slinde71 noted, however, that the equipercentile method was an

(44)

33

item characteristic curve theory are likely more appropriate for vertical equating. Kolen69

also showed that overall the equipercentile and IRT methods were similar, although he demonstrated that the equipercentile procedure produced more stable results when the instruments being linked were easier.

Marco, Petersen, and Stewart72 compared a variety of equipercentile, linear, and IRT

equating methods for equating the verbal portion of the SAT. When a test was equated to an anchor test of similar difficulty all but one of the methods appeared to be satisfactory. When tests of different difficulty were equated, the IRT methods were superior and the linear methods clearly inferior. Empirically, Jaeger73 identified five indices: 1. Similarity of

Cumulative Score Distributions, 2. Shape of the Rea-Score to Scaled Score Transformations, 3. Consistency of Linear and Equipercentile Equating Results, 4. Similarity of Item Difficulty Distributions, and 5. Similarity of Item Discrimination Distributions, that should logically discriminate between situations in which either the linear equating method adequately adjusts for differences between score distributions, or if a more complex solution was needed. He found that linear methodology discriminated in four of the five indices and acknowledged that item difficulty plays a role in reducing the adequacy of the linear model.

Another study Skaggs and Lissitz74 explored the differences between linear,

equipercentile and IRT test equating methods. Using a Monte-Carlo approach, their results revealed a lack of robustness for the Rasch model by violating the equal discrimination assumption, and that the recommended procedure for tests similar to those in the study was the equipercentile method.

References

Related documents

46 Konkreta exempel skulle kunna vara främjandeinsatser för affärsänglar/affärsängelnätverk, skapa arenor där aktörer från utbuds- och efterfrågesidan kan mötas eller

Both Brazil and Sweden have made bilateral cooperation in areas of technology and innovation a top priority. It has been formalized in a series of agreements and made explicit

The increasing availability of data and attention to services has increased the understanding of the contribution of services to innovation and productivity in

Parallellmarknader innebär dock inte en drivkraft för en grön omställning Ökad andel direktförsäljning räddar många lokala producenter och kan tyckas utgöra en drivkraft

Närmare 90 procent av de statliga medlen (intäkter och utgifter) för näringslivets klimatomställning går till generella styrmedel, det vill säga styrmedel som påverkar

I dag uppgår denna del av befolkningen till knappt 4 200 personer och år 2030 beräknas det finnas drygt 4 800 personer i Gällivare kommun som är 65 år eller äldre i

Detta projekt utvecklar policymixen för strategin Smart industri (Näringsdepartementet, 2016a). En av anledningarna till en stark avgränsning är att analysen bygger på djupa

Industrial Emissions Directive, supplemented by horizontal legislation (e.g., Framework Directives on Waste and Water, Emissions Trading System, etc) and guidance on operating