• No results found

An Investigation on Reliability and Reference Values among Healthy Controls in the UDDGait Study

N/A
N/A
Protected

Academic year: 2021

Share "An Investigation on Reliability and Reference Values among Healthy Controls in the UDDGait Study"

Copied!
52
0
0

Loading.... (view fulltext now)

Full text

(1)

An Investigation on Reliability and

Reference Values among Healthy Controls in the UDDGait Study

By Maja Blomberg & Agnes Widenfalk

Department of Statistics Uppsala University

Supervisors Lars Berglund & Anna Cristina Åberg

2020

(2)

Abstract

Dementia disorders are difficult to detect in the early stages of the disease. Therefore, it is of importance to develop new methods in early diagnostics. The Uppsala Dalarna Dementia and Gait Study has established a new method in diagnosing early stages of cognitive impairments. The new method Timed- Up-and-Go (TUG) dual-task, combines a gait test with a verbal task. This thesis analyses nine TUG variables for healthy controls and discusses differences between two age groups, subjects younger than 72 years and subjects 72 years or older. Reliability of the new method is the primary focus point and is assessed using Intraclass Correlation Coefficients (ICC) and Bland-Altman plots. Normative reference values for the continuous variables are estimated with the help of bootstrap confidence intervals for the 2.5 and 97.5 percentiles. The results show that the three variables measuring the time to complete the test have good to excellent reliability, while variables that measure combinations of gait and verbal tasks show poor to moderate reliability. The lower reliability for these variables could be explained by them being ratios or differences of other variables. Differences in reliability can be seen between the age groups, where younger subjects have lower reliability partly due to homogeneity. Results show that the reference values of healthy controls are different for the two age groups.

Keywords: UDDGait Study, TUG Dual-Task, Reliability, Reference Values, ICC, Bland-Altman, Bootstrap Confidence Intervals.

(3)

Acknowledgements

Lars Berglund, Associate Professor. Department of Public Health and Caring Sciences, Uppsala University.

We would like to dedicate a special thanks to Lars Berglund, who reached out and gave us the opportunity to work on this project. Lars has been an important support throughout the process of writing this thesis.

Anna Cristina Åberg, Associate Professor. Department of Public Health and Caring Sciences, Uppsala University.

We would also like to thank Anna Cristina Åberg, principal investigator of the UDDGait Study, for her confidence in letting us be a part of the study. Anna Cristina contributed with the data, which enabled the work of this thesis.

(4)

Table of content

1. INTRODUCTION AND RESEARCH AIM ... 1

1.1INTRODUCTION ... 1

1.2AIM AND RESEARCH QUESTION ... 2

2. BACKGROUND ... 3

2.1THE UDDGAIT STUDY AND THE TIMED-UP-AND-GO PROCEDURE ... 3

2.2IMPORTANT CONCEPTS ... 4

2.2.1 Reliability ... 4

2.2.2 Reference Values ... 4

3. DATA ... 6

3.1DATASETS ... 6

3.2VARIABLE DESCRIPTIONS ... 6

4.1METHODOLOGIES FOR RELIABILITY ASSESSMENT ... 8

4.1.1 Types of Reliability ... 8

4.1.2 Justification of Reliability Measure ... 9

4.2RESAMPLING METHODS ... 12

5. METHODOLOGY ... 14

5.1INTRACLASS CORRELATION COEFFICIENTS ... 14

5.1.1 ICC Calculations ... 14

5.1.2 ICC Inference ... 17

5.1.3 Interpretation of ICC ... 18

5.2BLAND-ALTMAN LIMITS OF AGREEMENT... 19

5.3BOOTSTRAP ... 21

5.3.1 Bootstrap Confidence Intervals ... 21

5.4SOFTWARE USED ... 22

6. RESULTS ... 24

6.1DESCRIPTIVE STATISTICS ... 24

6.2RELIABILITY ASSESSMENT ... 27

6.2.1 Data Transformation ... 27

6.2.2 ICC Estimates ... 28

6.2.3 ICC Comparison ... 30

6.2.4 Bland-Altman Plots ... 31

6.3REFERENCE VALUES ... 35

6.3.1 Reference Value Estimations ... 35

7. DISCUSSION ... 38

8. CONCLUSION ... 41

9. REFERENCES ... 42

10. APPENDIX ... 45

10.1DESCRIPTIVE STATISTICS ... 45

10.2ANOVASUMMARIES ... 46

(5)

1. Introduction and Research Aim

1.1 Introduction

Finding methods to aid the screening and identification of early stages of dementia is extremely important. Early diagnostics are vital for the afflicted when it comes to accessing symptom relieving measures as soon as possible, to slow down the development of the disorder. The Uppsala-Dalarna Dementia and Gait (UDDGait) Study has developed a novel method to identify early cognitive impairment and dementia disorders. Dementia disorder is an irreversible chronic illness, leading to cognitive declination with undermining effects on daily activities. It impairs memory, orientation, learning skills, judgement and verbal functioning among other cognitive capacities. The cause of dementia are physical changes to the brain. The most common dementia disorder is Alzheimer’s disease.

However, there are less severe cognitive impairment conditions, such as mild cognitive impairment (MCI) and subjective cognitive impairment (SCI) (Cedervall et al. 2020). The World Alzheimer Report released in 2011, estimate that approximately 75% of people suffering with dementia disorders are not yet diagnosed and the majority of the diagnosed subjects are diagnosed in later stages of the disease (Prince et al., 2011). Therefore, it is of importance to increase the efficiency of screening and early identification in order to increase the quality of life for the afflicted.

The most established screening test for cognitive functions in clinical settings is, at this moment, the Mini Mental State Examination (MMSE). MMSE gives a rough estimate of the subject’s cognitive ability. The subject is asked to answer some questions, draw figures and repeat different words given to them by the test leader (Arevalo-Rodriguez et al., 2015). The scoring system for the test indicates that if a subject receives the score 26 or higher, they have non to mild incipient dementia and if the score is between 20-25, they have mild dementia. Moderate dementia has a score of 10-19 and if a subject has 0-9 points, he or she is considered to have severe dementia. However, MMSE lacks reliable measurements of the cognitive capability. Meanwhile, studies with a connection between cognitive functions and control of gait, i.e. human movement, suggest the incorporation of gait testing into memory assessment may increase early identification. It has been observed that reduced cognitive function is associated with poor performance in the Timed-Up-and-Go test (TUG). The TUG test is a movement sequence test and is further explained in section 2.1 (Podsiadlo & Richardson, 1991). The test has shown good reliability for people with Alzheimer disease, and it indicates that gait speed might be able to predict a declination in future cognitive abilities.

It is evident through research on the topic that individuals with dementia disorder changes their gait when completing tasks that requires dual attention (Cedervall et al., 2020). Therefore, the UDDGait Study tested combining the gait test with a verbal task to serve as a novel method for identifying early

(6)

dementia. The study subjects are divided into four groups depending on the stage of their dementia disorder. The first group consists of subjects with dementia, the second of subjects with MCI, the third group is subjects with SCI and the last group are the healthy controls. The pilot study was published in early 2020 and contained a TUG test with only the verbal task of naming animals (Cedervall et al., 2020). However, since then it has been expanded with a second verbal task, to recite the months in reverse order. The tasks and their designation will be presented in sections 2.1 and 3.2 of this thesis.

With the addition of the new verbal tasks the UDDGait Study has performed several tests on healthy controls, with some of the controls repeating the tests for a second time.

1.2 Aim and Research Question

On behalf of the Department of Public Health and Caring Studies, Uppsala University, the primary research aim of this thesis will be to assess reliability for the healthy controls in the UDDGait Study.

The secondary aim is to estimate the reference values of the TUG variables for the healthy controls.

When estimating reference values, the 2.5 and 97.5 percentiles will be presented since they define a 95% range of the healthy controls. Previous parts of the UDDGait Study have seen differences in predictive abilities of the method for different ages. To further investigate these differences in relation to age, an analysis of the two different age groups and how they perform will be conducted. To reach the aim of this study the following questions will be answered:

What is the reliability for the TUG variables in the UDDGait Study?

What are the reference values for the continuous TUG variables in the UDDGait Study?

(7)

2. Background

This section will present information about the UDDGait Study and important concepts that are useful when understanding the research aim of this thesis. First the Timed-Up-and-Go procedures of the UDDGait Study will be presented in detail. The key concepts reliability and reference values will then be revised.

2.1 The UDDGait Study and the Timed-Up-and-Go Procedure

The UDDGait Study has the purpose of providing a screening test that can aid early detection of dementia disorders (Cedervall et al., 2020). The study uses four different groups of subjects; dementia patients, MCI patients, SCI patients and lastly healthy controls. The three patient groups and the healthy controls were measured at baseline. The groups with MCI and SCI at baseline were followed up at 2.5 years and will be followed up 4 and 8 years after baseline to record dementia diagnoses. To achieve the early detection of dementia, a procedure with a Timed-Up-and-Go (TUG) test combined with a verbal task has been developed. The combination of TUG and a verbal task is called the TUG dual-task (TUGdt) procedure. The study used the same TUG performance test together with two different verbal tasks to be able to evaluate any differences between the groups at baseline, and to find the most optimal procedure for prediction of dementia development for MCI and SCI patients. The research subject first completes the TUG test alone, i.e. without the verbal task, also called the TUG single-task (TUGst). In the TUGst, the research subject starts in a sitting position in an armchair and three meters from the chair there is a marked line on the floor. Instructions were then given to the subject: “walk at a self-selected, comfortable speed, pass the marked line, turn around and walk back to the chair and sit down” (Cedervall et al., 2020). The test is timed, and the time starts when the subject’s back leaves the backrest of the chair and stops when the subject’s posterior is touching the chair’s seat again after completing the task.

When the TUG task has been successfully completed, the subject gets the instructions for the TUGdt.

The TUG procedure is repeated but now with the addition of naming different animals while performing the TUG task. If the subject cannot think of any animals, they are told to complete the mobility task and thus prioritize the TUG task. The time starts and stops at the same positions as in the first TUG task. All procedures are recorded by video to enable data extraction. As mentioned in the introduction, the UDDGait Study extended their verbal test and added a second verbal task, naming the months of the year in reverse order. The subjects first complete the TUGdt task naming different animals and after that, they name the months in reversed order while performing the TUG test. The two tests are assessed in the same way.

The video recordings were analysed by the same person and if uncertainties occurred, they were discussed with a second person. The recordings enabled the opportunity to go through the procedure several times to validate the correct numbers of animals and months recited in reverse order. Moreover,

(8)

the videos can also be used to validate the time measured and to observe qualitative deviations from the instructions, e.g. questions to the test leader.

2.2 Important Concepts

2.2.1 Reliability

Reliability is the consistency of a measurement. A measure with a high reliability will show greater consistency when the measurement is repeated (Hair et al., 2014). Consistency, replicability, precision of measurement and agreement are all concepts that can be used as a synonym of reliability (Weir 2005).

When assessing reliability, it is important to decide which kind of reliability that is the most relevant in the given situation. Further information on this is presented in section 4.1.1. The assessment of reliability for a new medical method is important, because clinical use requires that the method is based on solid evidence (Gwet, 2008). That is, before implementing TUGdt in clinical use it is important to make sure that the procedure yields replicable results. Repeatability of the results within clinical research ensures representation of a particular characteristic of interest to a higher degree (Gwet, 2008). The reliability of measurements is also important when concluding whether the change in results of a subject over time, is a real or nominal variation. Furthermore, a lack of reliability can result in impaired predictive ability when the resulting variables are used in a predictive regression model (Berglund, 2012). For these reasons it is important to assess reliability before using the method of interest for subsequent analyses.

The performance of a subject will not be exactly the same at the two measurement points due to biological variability of the subject, differences in the setting etc. The long-time average of the same subject is denoted as the usual value (Berglund, 2012). The total measurement error is defined as the deviation from the usual value. The deviation from the usual value at different measurement points can be a result of different causes. When the repeated measurements of the subject are made with a time difference of 1 week to 1 month, the deviation can be considered as the sum of technical and biological measurement error (Berglund, 2012). The biological error represents differences in daily features of the subject while the technical error represents differences due to variability in the measurement device. In the UDDGait Study the second measurement is made two weeks after the first one and therefore the assessment of reliability in this case is the total measurement error of the two aspects.

2.2.2 Reference Values

This thesis defines Reference Values as the percentile values that construct a 95% range of the healthy controls. To find the values that define the range of 95% of the healthy controls, the 2.5 and 97.5 percentiles will be estimated. The range within these reference values is denoted as reference range. In medical research reference ranges refer to the variability of humans in sickness and in health (Harris &

Boyd, 1995). When examining healthy controls, the reason of estimating the reference ranges is to find

(9)

the extent of variability that still represents a healthy control. The variability among healthy controls represents an ideal standard. This ideal standard, the reference range of healthy controls, can then be used to detect sickness (Harris & Boyd, 1995). The variability of humans can be caused by genetic differences, physiological processes, environmental factors and more (Solberg, 1987). This thesis will emphasize the reference values since they are important limits to detect deviating performances. This thesis will use the term Reference Values but other notations such as Reference Limits are also associated with the same concept (Solberg, 1987). In the UDDGait Study, the healthy controls have followed through the different TUG tasks (for the definition of healthy controls, see section 3.1). Using the performances of these healthy controls, it is possible to estimate the reference values for the healthy controls and define which results are classified as “healthy”. These estimated reference values can then be used as comparisons to detect deviating performance from non-healthy subjects.

(10)

3. Data

In this chapter the datasets used in this thesis will be presented and all the variables used will be defined.

Data descriptions such as sample size for each dataset, measurements the notations and clarification of the concept healthy controls will be included.

3.1 Datasets

The data used in this study is from the UDDGait Study and consists of two datasets. The first dataset contains 166 healthy controls, with two discrete and seven continuous variables measured at one point in time. In addition to the TUG-variables the datasets also include the variables age, gender and educational level of the subject. The second dataset is a subset of the first dataset where 43 out of the 166 healthy subjects conducted the same test a second time approximately two weeks after the first one.

Hence, this dataset contains two measurements of the TUG variables, the first measurement, denoted M1 and the second measurement, denoted M2.

The definition of healthy controls is “individuals without cognitive impairments”. The criteria for a healthy control are a subjective perception of normal cognitive function and a Mini Mental State Examination (MMSE) score of more than 26. The subjects were tested on this criterion before they were included in the study. Other criteria for inclusion were the subjects ability to walk three meters back and forth and to rise from a sitting position, they were not allowed to use indoor walking aids, no current or recent hospitalization (within the last two weeks) and they had to be able to communicate in Swedish.

The subjects were recruited through advertisements and flyers during the period May 2017 to March 2019 in Uppsala, Sweden.

3.2 Variable Descriptions

The variables used in this study are extracted from the three TUG procedures. These are the TUG single- task (TUGst) and the two TUG dual-tasks (TUGdt), where the subjects go through the TUG procedure naming animals and reciting, in reversed order, the months of the year. The test scores for each variable were assessed and revised through analyses of the video recordings made during the test procedure.

These tests result in two discrete and seven continuous variables. Characteristics such as the total time of the procedures, the number of correct months and animals named, the relative cost (in time) between the single and dual task and the number of correct animals and months for each 10 seconds have been quantified in these nine variables. All the variables in the datasets are described in detail, see Table 1 below.

(11)

As stated before, for the subset with 43 subjects, two measurements of the same variables have been measured but at two different points in time. Therefore, the subset dataset contains 18 variables in addition to the three background variables with information about gender, age and educational level.

For simplicity the variables in this dataset will use the same notation as in the full dataset, seen in Table 1, with the exception that M1 or M2 is added to indicate which measurement point that is presented.

In line with the research aim a further examination is also made, the differences between the two age groups. The first age group includes the younger subjects who are defined as subjects younger than 72 years. The second age group is the older subjects, defined as subjects older or equal to 72 years. The groups are split at 72 years of age because this was the median age for all four groups in the UDDGait Study, i.e. dementia, MCI, SCI and healthy controls, at baseline.

TABLE 1. DESCRIPTIONS OF VARIABLES

Variable Description

Age Age of the subject, measured in years.

Gender Gender, 1 if Male and 2 if Female.

Education Education, 1 if University/College level or 2 if lower education.

TUGst TUG single task. Measures the total time of the TUG procedure in seconds.

TUGdt NA TUG dual task naming animals. Measures the total time of the procedure in seconds.

TUGdt MB TUG dual task reciting the months in reversed order. Measures the total time of the procedure in seconds.

TUGdt NA, number of animals

TUG dual task naming animals. Measures the correct number of animals named during the procedure.

TUGdt MB, number of months

TUG dual task reciting the months in reversed order. Measures the correct number of reversed months named during the procedure.

TUGdt NA, cost% TUG dual task naming animals. Measures the TUGdt cost, i.e. relative time difference calculated as 100*(TUGdt NA−TUGst)/TUGst.

TUGdt MB, cost%

TUG dual task reciting months in reversed order. Measures the TUGdt cost, i.e. relative time difference calculated as 100*(TUGdt MB−TUGst)/TUGst.

TUGdt NA, animals/10s

TUG dual task naming animals. Measures the number of animals for each 10 seconds, i.e. the average number of correct animals for each second times 10.

Calculated as 10*(TUGdt NA, number of animals/ TUGdt NA ).

TUGdt MB, months/10s

TUG dual task reciting months in reversed order. Measures the correct number of months for each 10 seconds, i.e. the average number of correct months for each second times 10. Calculated as 10*(TUGdt MB, number of months/TUGdt MB).

NA denotes naming animals, MB is months backwards, dt is dual task and st denotes single task.

(12)

4. Previous Research and Justification of Methodology

This section will present some previous research on various methods to answer the research questions of this thesis. It will include discussions on which reliability assessment method is most suitable for the UDDGait Study. Other methods, such as resampling methods, that are used for estimating reference values will be included.

4.1 Methodologies for Reliability Assessment

4.1.1 Types of Reliability

Previous research on reliability has established three different types of reliabilities that are relevant when assessing reliability for a measurement method (Rousson et al. 2002, Koo & Li 2016). The type of reliability most relevant depends on the context of the method and the procedure of measurement. The three types of reliability are the interrater, the test-retest and the intrarater reliability and are defined as follows (Koo & Li, 2016):

Interrater: Reflects the variation between two or more raters who measure the same group of subjects.

Test-retest: Reflects the variation in measurements taken by an instrument on the same subject under the same conditions. It is generally indicative of reliability in situations when raters are not involved or rater effect is negligible, such as self-report survey instrument.

Intrarater: Reflects the variation of data measured by one rater across two or more measurements.

A rater is the person or technical equipment that assess scores or evaluates the subject during the test.

In the UDDGait Study, a single rater collects the data from the TUGdt, using video recordings, and the task is repeated one time approximately two weeks after the first measurement. Thus, the reliability study contains data measured at two points in time. Given the aim of the study and that the objective is reliability assessment of the method, rather than the rater (Weir, 2005), the reliability assessment of this thesis focuses on the test-retest reliability.

Two concepts with implications on the interpretation of reliability are agreement and correlation. A reliability measure should reflect both the absolute consistency (agreement) and the relative consistency (correlation) between measurement points (Koo & Li. 2016, Weir 2005). The absolute consistency is the agreement between the scores of the same subject, while the relative consistency reflects the consistency of the rank of subjects relative to the other subjects (Weir 2005). It is possible to have perfect relative consistency and still have poor absolute consistency, this is illustrated in Figure 1 below (Lee et al., 1989).

(13)

Figure 1. Examples of two measurement methods (P and M) with perfect relative consistency but different degrees of absolute consistency (Lee et al, 1989).

Figure 1 from Lee et al. (1989) show three examples, all with perfect relative consistency (correlation, 𝑟 = 1) but different degrees of agreement. The perfect relative consistency can be seen in all examples since the subjects are related to each other in the same way, with the use of both measurement methods P and M. That is, subject one has the lowest value and then each subject has a higher value up to subject five which is the last subject with the highest variable value. In example one there is a difference of 20 between the measurement methods, which means that there is not a perfect agreement, even though the relative consistency is perfect. In example three there are no differences between the methods and a perfect agreement can be observed. This illustrates how the two concepts relative consistency and agreement are different from each other.

4.1.2 Justification of Reliability Measure

According to Lexell and Downham (2005) there has been an increasing interest in assessing reliability of clinical methods since the beginning of the 21st century. Previous research on reliability have established that the total measurement error can be considered as a combination of systematic and random error (e.g. Rousson et al. 2002, Lexell & Downham 2005). The systematic error, i.e. bias, occurs when the test subject on average performs better or worse in the second test (Lexell & Downham, 2005).

For example, this can be a result of the subject knowing the procedure and therefore performing better in the second test or the subject being tired and therefore performing worse. The random error represents errors that does not move in the same direction in any systematic way between the two measurement points (Rousson et al., 2002). Depending on the reliability type of interest, the systematic error can have different effects on the interpretation of reliability.

In their 2005 article How to Assess the Reliability of Measurements in Rehabilitation, Lexell and Downham present some statistical methods for assessing test-retest reliability. The Pearson Correlation Coefficient and the Intraclass Correlation Coefficient (ICC) are two types of correlation coefficients that can be used for assessing test-retest reliability. Lexell and Downham (2005) argue that the ICC is preferred over the Pearson Correlation Coefficient. The main difference between these two reliability measures and how they should be used goes back to the concepts of absolute consistency (agreement)

(14)

and relative consistency (correlation), which have been presented in section 4.1.1 (Lee et al., 1989). The purpose of the Pearson Correlation Coefficient is to determine the relative consistency between two variables. Therefore, it does not detect differences in agreement between the variables (Yen & Lo, 2002).

A lack of agreement between the two measurement points can indicate a systematic error between these points and hence the Pearson Correlation Coefficient will not detect this systematic error (Yen & Lo, 2002). Some forms of ICC do capture systematic errors and is therefore more suitable when such errors may be present (Lee et al., 1989). The ICC is sensitive to the relative ranking in relation to the other subjects and captures both systematic and random errors which makes it well suited for reliability assessment (Liljequist et al., 2019).

Rousson et al. (2002) argues that systematic errors in a test-retest setting is often due to learning effects that improve the results during the second test or fatigue that impairs the result of the second test. Since these effects are not related to the method being evaluated but rather features of the subjects, it should not be considered as a measurement error and should not be a part of the reliability assessment. Rousson et al. (2002) argues that the Pearson Correlation Coefficient should be used in a test-retest setting since learning effects or fatigue is a negligible systematic error. Weir (2005) also argues that when physical performance is tested, the systematic error is often due to learning effect or fatigue. Weir (2005) describes that there is a discussion regarding how these kinds of systematic errors should be treated in the test-retest context. Yet, different forms of ICC handle systematic errors in different ways and it is possible to use an ICC that does not take systematic error into account (Weir 2005, McGraw & Wong 1995). Even though there is a discussion regarding the handling of systematic error, the ICC is very commonly used as a reliability measure in the test-retest setting.

Apart from the emphasis on relationship rather than agreement and systematic error, there are other limitations of the Pearson Correlation Coefficient which makes it less useful for reliability assessment.

The Pearson Correlation Coefficient is constructed with the purpose of determining the relationship between two variables and not the relationship between two measurements of the same variable (Yen &

Lo, 2002). To use the Pearson Correlation in this context would therefore be theoretically unsuitable and the ICC, which is constructed to measure relationships of the same variable would be a better choice (Yen & Lo, 2002). Moreover, the Pearson Correlation Coefficient measures linear relationships between two variables, why test-retest situations with more than one retest cannot be estimated with a single correlation coefficient. The ICC on the other hand does have calculation formulas that can handle this issue and can therefore estimate a single coefficient for all measurement points, tests and retests (Yen

& Lo, 2002). The use of ICC also enables a more general interpretation of the correlation compared to the Pearson Correlation Coefficient (Lin et al., 2012). This can be exemplified using a pair of brothers.

The Pearson Correlation could, for example, estimate the correlation of the heights of the older and the younger brother. The ICC on the other hand estimates a correlation which measure similarity between

(15)

brothers in general and not only in relation to their age (Lin et al., 2012). Therefore, the ICC is more useful when seeking answers about a more general correlation.

As discussed, one advantage with the ICC is its ability to detect systematic errors. However, ICC does not provide any information on the size of this systematic error (Grafton et al., 2005). That is, the estimated ICC and its estimated confidence interval can indicate poor reliability for both large and small actual systematic errors. In the same way ICC does not provide information on any pattern of discord (Lee et al., 1989). The pattern of errors is of interest since a clear pattern indicates non-randomn errors and thus a less reliable method. For example, the size of the error can get bigger with a higher value of the variable. To remedy this shortage of the ICC, other methods should be used as a complement, for example the Bland-Altman analysis (Grafton et al. 2005, Lee et al. 1989). The Bland-Altman figures enable to visually determine the size of the systematic error and detect any patterns of the errors. The Bland-Altman analysis will be used in this thesis as a complement of the ICC to quantify the systematic error and investigate any patterns of discord.

This study will use the ICC as measure of reliability since the design of the UDDGait Study does not exclude the possibility of a systematic error. As previously discussed, different forms of ICC handle systematic error differently. In the UDDGait Study it is considered useful to detect any systematic error, even though this error might be a result of a learning effect or fatigue. In the event of a systematic error being present, it is something that is desired to be detected and based on that discuss whether or not the effect should be considered as negligible or not. Using the information regarding the systematic error it can be concluded that an ICC form which detects systematic error is desired for the UDDGait Study.

The ICC(A,1) is such a coefficient, which will be explained and illustrated by the inclusion of column effect seen in Equation 4 in section 5.1.1. This ICC form is appropriate when wanting to include the systematic error in the reliability assessment (Liljequist et al., 2019).

The choice of the methods used in this thesis can be further justified by looking at previous test-retest reliability studies on new methods within medicine. For example, Grafton et al. (2005) examines the test-retest reliability for the short-form McGill Pain Questionnaire (MPQ) for patients with Osteoarthritis. To assess reliability, Grafton et al. (2002) uses a combination of methods to get an assessment as accurate as possible. The methods used by Grafton et al. (2002) when estimating reliability were ICC and Bland-Altman scatterplots. Another study assessing test-retest reliability in a clinical setting is Holmefur et al. (2009). The study evaluates the test-retest reliability of an Assisting Hand Assessment, AHA, for children with unilateral disabilities to investigate how they use their affected hand when using two different board games. To assess test-retest reliability of the method, Holmefur et al. (2009) uses several methods including ICC and a Bland-Altman approach with plots of difference, means and limits of agreement.

(16)

4.2 Resampling Methods

Resampling is a class of methods that base inference on a sampling distribution constructed from repeated resampling from the sample itself, rather than a theoretical sampling distribution (Yu, 2002).

The use of theoretical sampling distributions requires fulfilment of assumptions regarding the sample distribution which may not always be possible. In these cases, resampling methods using empirical sampling distributions are good alternatives (Yu, 2002). Bootstrap is one of these resampling techniques and it goes as far back as the 1940’s and it was made practical with the use of Monte Carlo approximation. In the late 1970’s, Bradley Efron (1979) published a paper in the Annals of Statistics where he defined a new resampling procedure and coined the expression, bootstrap. Efron’s bootstrap was constructed as a simple approximation to the jackknife procedure, which was an earlier resampling method developed by John Tukey. Remarkably the bootstrap performed as good as, or even better, than the jackknife, especially for larger sample sizes, but also in multiple other situations. Ordinary bootstrapping is sampled with replacement n times for a sample size of n, i.e. there are 𝑛𝑛 possible orders of bootstrap samples. Since some of the bootstrap observations are equivalent, they are variations of each other according to the exchangeability assumption (Chernick, 2011). Efron (1979) was not the first to express the use of Monte Carlo methods for resampling. However, using several Monte Carlo samples of size n with replacement from the original observations was an innovation. Efron (1979) was the first to connect bootstrapping to jackknife, cross-validation, permutation tests and the delta method, alongside competing with these methods in estimating the standard error of an estimator.

The resampling method jackknife was first introduced by Quenouille (1949) and later popularized by Tukey (1958). The jackknife method has an important use in estimating the standard error of an estimate and it is most suitable for small sample sizes. While bootstrap uses repeated samples to estimate variability, the jackknife uses pseudovalues to estimate it (Chernick, 2011). Psuedovalues are values that resembles but are not the true value and does not truly belong to the data being studied. Jackknife treats the pseudovalues as if they are independent and identically distributed with the same mean as the sample mean. Another resampling method is cross-validation which is mostly used in statistical modelling (Chernick, 2011). An example where cross-validation can be used, is to find the order of an autoregressive time series model or when concluding which variables to use in multiple linear regressions or logistic regression. It is even useful when deciding numbers of distributions in a mixture model. With simulation comparison Efron (1983) proved that the use of bootstrap bias correction provided an estimate for the classification error rate that was better than the cross-validation approach, leave-one-out proposed by Lachenbruch and Mickey (1968). The sample size was small with classifications that were restricted to two or three classes only and the predicting features were multivariate normally distributed.

(17)

This thesis will use resampling for different purposes. The main purpose is to estimate confidence intervals, for the healthy controls, when the distributions are not normal and the estimators are complex.

For the estimation of reference values, bootstrap is used to estimate confidence intervals for the population values of the percentiles. The International Foundation of Clinical Chemistry recommend the use of bootstrap for estimation of the reference values (Solberg, 2004). When studying the ICC confidence intervals, it is uncertain whether the assumption of normality is fulfilled. Therefore, bootstrap confidence intervals are estimated as comparison to the parametric confidence interval. When comparing ICC estimators with each other, resampling methods are useful since there are no theoretical distribution for the differences of ICC from the same sample. The reason Efron bootstrap methods are used instead of jackknife or cross-validation is validated not only by the discussion in the sections above.

The jackknife method is often used for estimating standard errors and cross-validation is used for finding the best model. Bootstrap is an improvement of these resampling methods and was considered the method best fitted for the purposes of this thesis.

(18)

5. Methodology

This section will provide more information on the statistical methods that have been selected, as justified in section 4. Three main methods have been used in this thesis and will therefore be presented. These methods are the Intraclass Correlation Coefficients, the Bland-Altman Limits of Agreement and the Bootstrap Resampling method.

5.1 Intraclass Correlation Coefficients

One of the methods used for reliability assessment in this thesis is the Intraclass Correlation Coefficient (ICC). The general mathematical formula of ICC as a measure of reliability, is the ratio between the variance between subjects and the variance between subjects plus the error variance (Koo & Li 2016, Weir 2005). Considering Equation 1 below, it is concluded that reliability is measured by ICC taking on values between 0 and 1. If the error variance is equal to 0, i.e. the error is 0, the reliability is 1. If the error variance on the other hand is increasingly large the reliability will take on a value close to 0.

𝐼𝐶𝐶 = 𝜎𝑟2

𝜎𝑟2+ 𝜎𝑒2 (1)

When further evaluating the reliability the error variance can be separated to better explain the deviation from the usual value. The error variance, 𝜎𝑒2, can be split up in systematic error and random error as discussed in section 4.1.2(Weir 2005). Using the notation from McGraw and Wong (1995) the systematic error is represented by the column effect, 𝜎𝑐2, which for this study is the two measurement points. The random error is denoted, 𝜎𝑟𝑒2. The mathematical formula describing reliability when the error variance is split into systematic and random errors is presented in Equation 2.

𝐼𝐶𝐶 = 𝜎𝑟2

𝜎𝑟2+ 𝜎𝑐2+ 𝜎𝑟𝑒2 (2)

5.1.1 ICC Calculations

From the general formula of reliability measured as ICC, different calculation formulas have been derived since the actual population variances are unknown. There are several types of ICC and to get an accurate result it is important that the right type of ICC is chosen in relation to the setting of the UDDGait Study (Koo & Li 2016). As mentioned in section 4.1.1 the reliability assessment of this study is a test- retest reliability setting and this will be the starting point for the choice of ICC. As an aid of finding the right ICC form, McGraw and Wong (1995) presents a flowchart for easy guidance, which is presented in Figure 2 below.

(19)

Figure 2. Flowchart of the ICC selection process (McGraw & Wong, 1996).

Figure 2 shows the ten different types of ICC that are presented by McGraw and Wong (1996). The ten ICC variations are combinations of four features and is an extension of the ICC theory presented by Shrout and Fleiss (1979). First a decision is made on the use of a one-way or two-way model. In the context of a test-retest situation a two-way model should be used since two time points are crossed with subjects, and the two-way model allows for a distinction between systematic and random error (Weir 2005). This implies that the two-way model should be used in the reliability assessment of the UDDGait Study. Secondly, considerations must be made regarding if the model should be a random or fixed (mixed) effects model. Repeated measures of the same subjects are not considered as random and therefore the column effects are fixed (Koo & Li 2016). Since the subjects are randomly selected from a larger population, subject is a random effect, which implies that a mixed effects model should be used.

Thirdly the correct ICC form needs to consider the intended clinical use of the method. If the clinical use will be based on the assessment of scores on a single measure, a single measure ICC should be used.

If the assessment of scores will be based on an average of several measurements, an average measurements ICC should be used. The selection of ICC type is hence made on the intended clinical use and therefore the number of measurement points in the reliability study is irrelevant. The clinical use of TUGdt will be based on a single measurement point and therefore a single measure model should be used. Lastly the choice of an agreement or a consistency form should be made. When working with test-retest reliability, agreement model should always be used because the agreement of values between the two measurement points is often the main interest (Koo & Li 2016). When putting together the requirements of an ICC to evaluate the test-retest reliability for the TUG tests, a two-way mixed effects

(20)

single measurement agreement ICC should be used. Using the notation of McGraw and Wong (1996) the two-way mixed effects single measurement agreement ICC is called the ICC(A,1).

Regardless of the ICC method chosen, the calculation of ICC always starts with a repeated measures ANOVA. The ANOVA output will yield mean squares from the model specified in Equation 3. The mean squares are split up in the subjects (rows), the measurements (columns) and the random error. This output is required for the subsequent calculations of the ICC, depending on the kind of ICC chosen for the reliability assessment. The linear ANOVA model specified for the two-way mixed effects single measurement agreement is presented in Equation 3 below (McGraw & Wong, 1996):

𝑥𝑖𝑗= 𝜇 + 𝑟𝑖+ 𝑐𝑗+ 𝑒𝑖𝑗 𝑖 = 1,2, … , 𝑛 , 𝑗 = 1,2, … , 𝑘 (3)

The 𝑥𝑖𝑗 represents the variable value of interest for individual i at measurement point j. 𝜇 is the population mean for all observations of variable 𝑥. 𝑟𝑖 is the individual effect, i.e. the row effect for individual i. 𝑐𝑗 is the effect of measurement j, i.e. the column effect of column j. 𝑒𝑖𝑗 denotes the random error, the variability that cannot be explained by either row or column effect, for individual i at measurement point j. In the setting of this study there are 43 individuals, 𝑛 = 43, and two measurement points, that is 𝑘 = 2.

The ANOVA model assumes both the row and error effects to be random, independent and normally distributed with a mean of zero and constant variances (McGraw & Wong, 1996). To investigate if the data meets the requirements of normality a Shapiro-Wilks test of normality will be performed. To decide whether the distribution is considered to be normal enough to perform the ANOVA, the w-statistic will be used. Variables with a w-statistic larger than 0.95 will be considered as normally distributed and therefore not in need of any transformation (Helmersson-Karlqvist et al. 2013, Simmons et al. 2017). If the variables have a w-statistic less than 0.95 they will be transformed to meet this requirement. The investigation of normality will be made on the large dataset, i.e. the healthy controls dataset with 𝑛 = 166. Since the smaller dataset is a subset of the larger dataset, normality in that dataset is assumed to be enough to be able to perform an accurate ANOVA analysis for the small dataset. The other assumption of the ANOVA analysis is equal variances across groups. This assumption will be checked using the Flinger-Killeen test.

With the ANOVA output it is possible to estimate the relevant variances presented in Equation 1 and thus calculate the chosen ICC form. The mathematical formula used for calculating the ICC(A,1) is given by Equation 4 below (McGraw & Wong, 1996):

(21)

𝐼𝐶𝐶(𝐴, 1) = 𝜎𝑟2

𝜎𝑟2+ 𝜃𝑐2+ 𝜎𝑒2= 𝑀𝑅𝑅− 𝑀𝑆𝐸 𝑀𝑆𝑅+ (𝑘 − 1)𝑀𝑆𝐸+𝑘

𝑛(𝑀𝑆𝐶− 𝑀𝑆𝐸)

. (4)

In the ICC(A,1) equation, 𝜎𝑟2 is the total variance of the row (subject) effect, 𝜃𝑐2 is the variance within subjects due to the fixed column effect and 𝜎𝑒2 is the variance within subjects due to the random error.

This is equal to the far-right term in Equation 4. The 𝑀𝑅𝑅 is the mean squares of the row effect and 𝑀𝑆𝐸 is the mean squares of the random error from the ANOVA output. In the denominator the notations 𝑀𝑆𝐶, 𝑘 and 𝑛 appears. 𝑀𝑆𝐶 stands for the mean squares of the column effect. 𝑘 is the number of measurements per subject and 𝑛 is the number of subjects.

Several different kinds of ICC measures can share the same calculation formula, the interpretation is however dependent on the chosen form of ICC (McGraw & Wong, 1996). Liljequist et al. (2019) concludes that the two-way random and two-way mixed effects model share the same calculation formula for agreement measures and a distinction between the two models are unnecessary since they give the same result. The distinction between the two models can however be useful since it gives more information about the context of the method investigated.

5.1.2 ICC Inference

The calculation of ICC only provides an estimation of the reliability in the sample. To be able to draw conclusions on the reliability in the population, statistical inference about the estimated ICC sample value needs to be performed (Koo & Li, 2016). McGraw and Wong (1996) who presented the ten forms of ICC seen in Figure 2, suggests that a 95% confidence interval could be used to present such inference.

The calculation of a confidence interval for ICC(A,1) should be made using the formulas presented below (McGraw & Wong, 1996).

Lower limit

𝑛(𝑀𝑅𝑅− 𝐹𝑛−1,𝑣𝑀𝑆𝐸)

𝐹𝑛−1,𝑣[𝑘𝑀𝑆𝐶 + (𝑘𝑛 − 𝑘 − 𝑛)𝑀𝑆𝐸] + 𝑛𝑀𝑆𝑅 (5)

and upper limit

𝑛(𝐹𝑣,𝑛−1𝑀𝑅𝑅− 𝑀𝑆𝐸)

𝑘𝑀𝑆𝐶 + (𝑘𝑛 − 𝑘 − 𝑛)𝑀𝑆𝐸+ 𝑛𝐹𝑣,𝑛−1𝑀𝑆𝑅 . (6)

Formulas 5 and 6 show the calculations for the lower and upper limits of a (1 − 𝛼)100% confidence interval for the ICC(A,1). 𝐹𝑛−1,𝑣 denotes the (1 −1

2𝛼) ∗ 100𝑡ℎ𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 of the F distribution with 𝑛 − 1 numerator degrees of freedom and v denominator degrees of freedom and 𝐹𝑣,𝑛−1 denotes the

(22)

(1 −1

2𝛼) ∗ 100𝑡ℎ𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 of the F distribution with v numerator degrees of freedom and 𝑛 − 1 denominator degrees of freedom. For a 95% confidence interval 𝛼= 0.05. The component υ, is calculated using Equations 7 and 8 below:

𝑣 = (𝑎𝑀𝑆𝐶 + 𝑏𝑀𝑆𝐸)2 (𝑎𝑀𝑆𝐶)2

𝑘 − 1 + (𝑏𝑀𝑆𝐸)2 (𝑛 − 1)(𝑘 − 1)

, (7)

where

𝑎 = 𝑘(𝜌̂)

𝑛(1 − 𝜌̂) , 𝑏 = 1 +𝑘𝜌̂(𝑛 − 1)

𝑛(1 − 𝜌̂) , (8)

where 𝜌̂ is the estimate of ICC.

5.1.3 Interpretation of ICC

The general interpretation of the ICC(A,1) is the “degree of absolute agreement for measurements made under the fixed levels of the column factor” (McGraw & Wong, 1996). Given by the mathematical definition of ICC presented in Equation 1 and 2, an ICC value close to 1 indicates good reliability and a value close to 0 indicates poor reliability. However, this intuitive interpretation does not provide a comprehensive framework for interpretation. The interpretation of ICC depends on the intended use of the method for which reliability is assessed, but there are some suggestions of guideline values for interpretation (Liljequist et al., 2019). Koo and Li (2016) suggest interpretations of certain ICC ranges given that the correct ICC form is chosen. As discussed in section 5.1.2 the conclusions for the population should be drawn from the inference made by the ICC and not only the point estimate, hence the estimated confidence intervals should be the base of interpretation. The guidelines of Koo and Li (2016) are presented in Table 2.

TABLE 2. A GUIDELINE OF INTERPRETATION FOR ICC (Koo & Li, 2016).

ICC value Interpretation

<0.5 Poor reliability

0.5-0.75 Moderate reliability

0.75-0.90 Good reliability

>0.9 Excellent reliability

The guideline in Table 2 is one suggested way of interpreting ICC and will be used for this study.

Moreover, it should be noted that this is a recommendation rather than a rule and other researchers have suggested other cut-off points (Liljequist et al., 2019).

(23)

The interpretation of ICC should also consider that there are other aspects that can affect the ICC value (Lexell & Downham, 2005). One such consideration is that a very homogenous sample yields a small total variance which can lead to a small ICC, even though the measurement accuracy is good (Lexell &

Downham, 2005). The intuition behind this can be understood by looking at the general expression of ICC in Equation 1. If the row variance is very small, i.e. the sample is homogenous, even a small error variance will have a large impact on the ICC value. Another feature that may affect the ICC is variables that are calculated from other values, such as sums and differences (Rousson et al., 2002). A variable representing a sum can give a higher ICC value then compared to the ICC’s components of the sum.

Variables representing differences can give a lower ICC estimation than the ICC’s of the terms. For variables constructed as a ratio of two other variables the reliability of the ratio can be poorer than the reliability of the numerator and the denominator separately (Nordhamn et al., 2000). The reliability of the ratio is often lower when the values of the numerator and denominator are positively correlated, and their measurement errors are uncorrelated. To get an accurate interpretation of the ICC value, the above described features should be considered together with the estimated value.

As seen in Equation 4 the calculation formula for ICC(A,1) includes a term which accounts for the column effect (i.e. the systematic error) which is denoted 𝜃𝑐2. The systematic error can be a result of differences in the setting of measurement, but in the test-retest setting it is often due to a learning effect (Weir, 2005). In this case the reliability will be poor even though the procedure itself has a high reliability. If learning effects are present, i.e. systematic errors are observed between the measurement points, measurements can be added until a plateau of learning is reached (Weir 2005). This will make the learning effect negligible and therefore the reliability assessment will only measure the random error.

The possibility of significant measurement effects should be considered when designing the reliability study (Weir 2005). It is hence important to study the setting of the method and analyse whether a systematic error can be a consequence of learning effects or if the method used experiences a lack of reliability.

5.2 Bland-Altman Limits of Agreement

In medical research it is important to compare different method measurements to see if the new method can replace the old or if they are interchangeable. Bland-Altman plots can also be used to compare the same method of measurement but for different measurement occasions. To visualise the mean difference between two measurements points the Bland-Altman method can be used. Bland and Altman (1983) promoted the use of plotting the difference between the measurement scores (M1-M2) against the average of the scores (M1+M2)/2. An example of this type of plot is shown below in Figure 3.

According to Bland and Altman the analysis ought to be based on the difference between two different measurements on the same subject. In case of the UDDGait Study, the Bland-Altman plots will be used

(24)

on the same subject but from two different measurement occasions, to see if results are interchangeable between the measurement points or if they are significantly different. In the original Bland-Altman method the mean difference is the estimated bias, i.e. the systematic error between methods while the limits of agreement measure random fluctuations around the mean difference (Bland & Altman, 1995).

It is also recommended to use a 95% limit of agreement, i.e. a mean difference of ±2*standard deviation which shows how far apart the results from two measurements are likely to be for 95% of the individuals.

This implies that the assumption of normally distributed differences needs to be fulfilled before continuing. The upper and lower 95% limit of agreement are calculated with Formula 9 as seen below (Bland & Altman, 1986).

𝑀𝑒𝑎𝑛 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 ± 1.96 ∗ 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 (9)

The mean difference is the mean from the first measurement minus the mean of the second measurement, i.e. (𝑚𝑒𝑎𝑛 𝑜𝑓 𝑀1 − 𝑚𝑒𝑎𝑛 𝑜𝑓 𝑀2). 1.96 is the value from a standard normal distribution that gives the range of 95% of the subjects i.e. this is approximately equivalent of ±2*standard deviation. When the differences are not normally distributed, the 95% limits of agreement will be calculated using bootstrap estimation. The mean and standard deviation of the differences must be assumed to be reasonably constant throughout the range of measurements for it to be a meaningful estimate. This assumption is easiest to check graphically through a plot. To plot the difference against the measurements is ineffective since the points tend to cluster, so Bland and Altman proposes that the difference should be plotted against the average of the measurements. The average score of the two measurement points is also the best estimate for the usual value, i.e. the long-time average, of the individual. From this type of plot, it becomes easier to see the magnitude of bias and errors, and it becomes easier to spot outliers and or to see any trend. (Bland & Altman, 1983). Plotting the points will enable the observer to see patterns between the measurements and it is a powerful way of displaying the results of the comparison. The Bland-Altman plots quantifies the range of agreement within which 95% of the subjects are found.

However clinical goals need to be used to evaluate whether this agreement range is sufficient or not (Giavarina, 2015).

Figure 3. Bland-Altman plot (Bland & Altman, 1986).

(25)

To investigate whether the bias is significant or not a confidence interval for the mean difference can be added to the plot (Giavarina, 2015). The confidence interval is based on the sampling distribution of the mean difference. When the differences are normally distributed, the confidence interval is based on the theoretical sampling distribution of the mean (Giavarina, 2015). If the differences are not normally distributed, the confidence interval cannot be estimated using the theoretical sampling distribution.

These confidence intervals will instead be estimated using the bootstrap method.

5.3 Bootstrap

Bootstrap is a technique for resampling and is used to estimate parameters based on existing data, parameters such as the mean, median or standard deviation. E.g. the population mean is denoted 𝜃, the estimated sample mean is denoted 𝜃 and 𝜃 is the bootstrap estimate of 𝜃 based on the bootstrap samples. When constructing confidence intervals bootstrapping can be helpful. The premise of bootstrap is to use only the data available and to not introduce extraneous assumptions.

For independent observations from the same population distribution, the empirical distribution which gives equal weight to each data point, is the basic segment for bootstrapping. In this case the mean and variance can be presented in the following way, where 𝜇 is the mean of the distribution function F and 𝜎2 is the variance. Where 𝜇 and 𝜎2 are integrated over all possible sets of x, 𝜇 = ∫ 𝑥𝑑𝐹(𝑥) and 𝜎2=

∫(𝑥 − 𝜇)2𝑑𝐹(𝑥). Based on the sample size of n, F is the population distribution and 𝐹𝑛 becomes the bootstrap distribution in the process of resampling. Since the distribution is empirically distributed with independent and identically distributed observations, this process is just randomly sampling with replacement of the original data (Chernick, 2011). The bootstrap process is repeated multiple times to receive a histogram of values for the parameter, e.g. the mean, this is called the Monte Carlo approximation to the bootstrap distribution. The average of the bootstrap values should be close to the sample estimate of the parameter. As mentioned earlier there can be 𝑛𝑛 distinct bootstrap samples, this amount of observations can become very large even with a small number of 𝑛, for example when 𝑛 = 10, we will receive 10 billion bootstrap samples. In practice a Monte Carlo approximation is used and by increasing M, i.e. the number of Monte Carlo simulations, the closer the histogram becomes the actual bootstrap distribution. In conclusion the Monte Carlo approximation firstly generates samples with replacement for the empirical distribution for the data, secondly replacing the original sample with bootstrap samples and estimates instead of the sample estimate. After that the process is repeated M times.

5.3.1 Bootstrap Confidence Intervals

Early on, Efron acknowledged the application of bootstrapping for confidence intervals and hypothesis testing, along with more complex problems (Chernick, 2011). Efron’s percentile method is one of the

(26)

most recognisable ways of constructing confidence intervals for bootstrap estimated parameters. The so called 90% percentile method that Efron coined takes out the lowest 5% and highest 5% to create a reasonable confidence set, other lower and higher percentages can be specified as well. This method is intuitive and easy to understand. However, for heavy-tailed or asymmetric distributions in small samples, the percentile method could be less favourable, and modification might be necessary to improve this. An improvement to the percentile method is the bootstrap percentile-t method. The bootstrap percentile-t is a simple and easily computed method (Chernick, 2011). In this thesis the percentile as well as the percentile-t methods will be used. The percentile-t method will be used when estimating the reference values for healthy controls, as well as for the limits of agreement and confidence interval for the biases in the Bland-Altman plots. The percentile method is used when the statistic of interest is complicated, and an easily interpretable confidence interval is desired. For example, the percentile method will be used when estimating ICC confidence intervals.

The most apparent way to construct a confidence interval based on bootstrap estimates is the percentile method. Let 𝜃̂𝑖 be the 𝑖th bootstrap estimate based on the 𝑖th bootstrap sample, where each bootstrap sample is of size n. By equivalence to random subsampling, one would expect that if the order of observations goes from smallest to largest, it is expected that the mid interval containing 95% of the 𝜃̂𝑖

, to be a 95% confidence interval for 𝜃 (Chernick, 2011).In the percentile-t the parameter 𝜃 can be specified e.g. as the population mean and the estimate 𝜃 is then the sample mean. In cases where 𝜃 is a more complex parameter than the mean, the bootstrap estimate of 𝜃 will not be known. Therefore, the Monte Carlo approximation is needed when generating confidence intervals. As mentioned earlier, it is known that we have a parameter 𝜃 and an estimate 𝜃 and 𝜃 is the bootstrap estimate of 𝜃 based on the bootstrap sample. 𝑆 is the estimate of the standard deviation for the bootstrap sample estimate of 𝜃. We define 𝑇= (𝜃− 𝜃)/𝑆 (Chernick, 1999). Because 𝜃 might be a parameter more complicated than the mean, the use of Monte Carlo will give B bootstrap samples and for each bootstrap sample, estimates of 𝜃 and 𝑇 can be calculated. The percentile method is applied to T rather than 𝜃. So in other words, an approximate two-sided 100(1 − 2𝛼)% confidence interval for 𝜃, is achieved from the interval [𝜃− 𝑇(1−𝛼) 𝑆, 𝜃− 𝑇(𝛼) 𝑆] where 𝑇(1−𝛼) is the 100(1 − 𝛼) percentile and 𝑇(𝛼) is 100𝛼 percentile of 𝑇 and 𝑆 is the estimated standard deviation for 𝜃. This is the percentile-t two-sided 100(1 − 2𝛼)%

confidence interval (Chernick, 2011).

5.4 Software used

To apply the methods above on the data from the UDDGait Study the statistical software R-studio has been used. The packages used for the ICC calculations is the “psych” and “psy” packages, the former for ICC estimations and confidence intervals and the latter for bootstrap estimations. The package

“BlandAltmanLeh” is used to calculate the Bland-Altman statistics and to construct the Bland-Altman

(27)

plots. The packages “boot” and “bootstrap” will be used for constructing the bootstrap resamples and confidence intervals.

(28)

6. Results

This section will present the results of the data analysis. Firstly, descriptive statistics for the variables in the datasets will be presented and highlighted, as well as investigating any differences between the age groups. Following this, results for the reliability assessments with ICC estimates and Bland-Altman plots will be presented. Lastly, results of the reference values estimations for the healthy controls and the two age groups of healthy controls will be presented.

6.1 Descriptive Statistics

Descriptive statistics are important because they give a higher understanding for the nature of the data and how the subjects preform. The tables presented in this section will show the descriptive statistics from the two datasets and they will also be presented by the age groups.

TABLE 3. DESCRIPTIVE STATISTICS FOR HEALTHY CONTROLS, n=166

Min Median Max Mean Sd

Age 50.00 70.00 91.00 69.51 10.70

TUGst 6.10 10.05 24.10 10.45 2.26

TUGdt NA 5.84 10.97 26.67 12.05 3.50

TUGdt MB 6.09 11.15 25.80 12.18 3.63

TUGdt NA, number of animals 3.00 8.00 15.00 8.01 1.97 TUGdt MB, number of months 3.00 9.00 13.00 9.10 2.16 TUGdt NA, cost% -10.24 9.87 100.49 14.59 17.22 TUGdt MB, cost% -17.85 11.47 113.45 15.84 19.42 TUGdt NA, animals/10s 2.14 6.73 12.39 7.01 2.06 TUGdt MB, months/10s 1.91 7.84 13.29 7.87 2.23 NA denotes naming animals, MB is naming months backwards, dt denotes dual task and st is single task.

Table 3 shows descriptive statistics for the nine TUG variables and the age variable. Variables TUGst, TUGdt NA and TUGdt MB measure the time it takes to complete the TUG procedure, which means that a low value of these variables indicates a better result for completing the TUG task. According to Table 3 all three time variables range, within the same values, from the minimum value of 6 to the maximum value of approximately 25. Even though the ranges of these variables are similar it is possible to conclude that the mean for TUGst is 10.45, while the mean for TUGdt NA is 12.05, and TUGdt MB mean is 12.18. This indicates that the average time to perform the TUG dual-tasks are longer than the average time to perform the TUG single-task in this sample. For the variables measuring the number of correct animals and months respectively, TUGdt NA, number of animals and TUGdt MB, number of months, a higher number displays more correct words named. These two variables also range within approximately the same values, both with a minimum value of 3 and a maximum value of 15 for TUGdt NA, number of animals and 13 for TUGdt MB, number of months. The mean of variable TUGdt MB, number of months is 9.10 while it is 8.01 for TUGdt NA, number of animals, indicating that these subjects on average are able to recite slightly more correct months than animals.

References

Related documents

46 Konkreta exempel skulle kunna vara främjandeinsatser för affärsänglar/affärsängelnätverk, skapa arenor där aktörer från utbuds- och efterfrågesidan kan mötas eller

Both Brazil and Sweden have made bilateral cooperation in areas of technology and innovation a top priority. It has been formalized in a series of agreements and made explicit

För att uppskatta den totala effekten av reformerna måste dock hänsyn tas till såväl samt- liga priseffekter som sammansättningseffekter, till följd av ökad försäljningsandel

Generella styrmedel kan ha varit mindre verksamma än man har trott De generella styrmedlen, till skillnad från de specifika styrmedlen, har kommit att användas i större

Parallellmarknader innebär dock inte en drivkraft för en grön omställning Ökad andel direktförsäljning räddar många lokala producenter och kan tyckas utgöra en drivkraft

Närmare 90 procent av de statliga medlen (intäkter och utgifter) för näringslivets klimatomställning går till generella styrmedel, det vill säga styrmedel som påverkar

I dag uppgår denna del av befolkningen till knappt 4 200 personer och år 2030 beräknas det finnas drygt 4 800 personer i Gällivare kommun som är 65 år eller äldre i

Industrial Emissions Directive, supplemented by horizontal legislation (e.g., Framework Directives on Waste and Water, Emissions Trading System, etc) and guidance on operating