• No results found

Some properties of measures of disagreement and disorder in paired ordinal data

N/A
N/A
Protected

Academic year: 2021

Share "Some properties of measures of disagreement and disorder in paired ordinal data "

Copied!
40
0
0

Loading.... (view fulltext now)

Full text

(1)

Some properties of measures of disagreement and disorder in paired ordinal data

(2)

To my beloved wife

Monica Örebro Studies in Statistics 4

HANS HÖGBERG

Some properties of measures of disagreement and disorder in paired ordinal data

(3)

To my beloved wife

Monica Örebro Studies in Statistics 4

HANS HÖGBERG

Some properties of measures of disagreement and disorder in paired ordinal data

(4)

© Hans Högberg, 2010

Title: Some properties of measures of disagreement and disorder in paired ordinal data.

Publisher: Örebro University 2010 www.publications.oru.se

trycksaker@oru.se

Print: Intellecta Infolog, Kållered 11/2010 ISSN 1651-8608

ISBN 978-91-7668-769-7

Abstract

Hans Högberg (2010): Some properties of measures of disagreement and disorder in paired ordinal data. Örebro Studies in Statistics 4, 38 pp.

The measures studied in this thesis were a measure of disorder, D, and a measure of the individual part of the disagreement, the measure of relative rank variance, RV, proposed by Svensson in 1993. The measure of disorder is a useful measure of order consistency in paired assessments of scales with a different number of possible values. The measure of relative rank variance is a useful measure in evaluating reliability and for evaluating change in quali- tative outcome variables.

In Paper I an overview of methods used in the analysis of dependent ordi- nal data and a comparison of the methods regarding the assumptions, speci- fications, applicability, and implications for use were made. In Paper II an application, and a comparison of the results of some standard models, tests, and measures to two different research problems were made. The sampling distribution of the measure of disorder was studied both analytically and by a simulation experiment in Paper III. The asymptotic normal distribution was shown by the theory of U-statistics and the simulation experiments for finite sample sizes and various amount of disorder showed that the sampling distribution was approximately normal for sample sizes of about 40 to 60 for moderate sizes of D and for smaller sample sizes for substantial sizes of D. The sampling distribution of the relative rank variance was studied in a simulation experiment in Paper IV. The simulation experiment showed that the sampling distribution was approximately normal for sample sizes of 60- 100 for moderate size of RV, and for smaller sample sizes for substantial size of RV. In Paper V a procedure for inference regarding relative rank variances from two or more samples was proposed. Pair-wise comparison by jackknife technique for variance estimation and the use of normal distribution as ap- proximation in inference for parameters in independent samples based on the results in Paper IV were demonstrated. Moreover, an application of Kruskal- Wallis test for independent samples and Friedman’s test for dependent sam- ples were conducted.

Keywords: agreement, augmented ranks, categorical data, disorder, jackknife, paired ordinal data, rating scales, sample size, sampling distribution, simulation, U-statistics, variance

Hans Högberg, Handelshögskolan

Örebro University, SE-701 82 Örebro, Sweden

(5)

© Hans Högberg, 2010

Title: Some properties of measures of disagreement and disorder in paired ordinal data.

Publisher: Örebro University 2010 www.publications.oru.se

trycksaker@oru.se

Print: Intellecta Infolog, Kållered 11/2010 ISSN 1651-8608

ISBN 978-91-7668-769-7

Abstract

Hans Högberg (2010): Some properties of measures of disagreement and disorder in paired ordinal data. Örebro Studies in Statistics 4, 38 pp.

The measures studied in this thesis were a measure of disorder, D, and a measure of the individual part of the disagreement, the measure of relative rank variance, RV, proposed by Svensson in 1993. The measure of disorder is a useful measure of order consistency in paired assessments of scales with a different number of possible values. The measure of relative rank variance is a useful measure in evaluating reliability and for evaluating change in quali- tative outcome variables.

In Paper I an overview of methods used in the analysis of dependent ordi- nal data and a comparison of the methods regarding the assumptions, speci- fications, applicability, and implications for use were made. In Paper II an application, and a comparison of the results of some standard models, tests, and measures to two different research problems were made. The sampling distribution of the measure of disorder was studied both analytically and by a simulation experiment in Paper III. The asymptotic normal distribution was shown by the theory of U-statistics and the simulation experiments for finite sample sizes and various amount of disorder showed that the sampling distribution was approximately normal for sample sizes of about 40 to 60 for moderate sizes of D and for smaller sample sizes for substantial sizes of D. The sampling distribution of the relative rank variance was studied in a simulation experiment in Paper IV. The simulation experiment showed that the sampling distribution was approximately normal for sample sizes of 60- 100 for moderate size of RV, and for smaller sample sizes for substantial size of RV. In Paper V a procedure for inference regarding relative rank variances from two or more samples was proposed. Pair-wise comparison by jackknife technique for variance estimation and the use of normal distribution as ap- proximation in inference for parameters in independent samples based on the results in Paper IV were demonstrated. Moreover, an application of Kruskal- Wallis test for independent samples and Friedman’s test for dependent sam- ples were conducted.

Keywords: agreement, augmented ranks, categorical data, disorder, jackknife, paired ordinal data, rating scales, sample size, sampling distribution, simulation, U-statistics, variance

Hans Högberg, Handelshögskolan

Örebro University, SE-701 82 Örebro, Sweden

(6)

List of papers

This thesis consists of an introductory part and the following five papers:

I. Högberg, H. and Svensson, E. An overview of methods in the analysis of dependent ordered categorical data: assumptions and implications.

Working Papers, Swedish Business School, Örebro University, No.

2008:7

II. Högberg, H. and Svensson, E. Comparison of methods in the analysis of dependent ordered categorical data. Working Papers, Swedish Business School, Örebro Univerity, No. 2008:6

III. Högberg, H. Statistical properties of a nonparametric measure of dis- cordance in paired ordinal data. Manuscript.

IV. Högberg, H. Rank-based methods for analysis of individual variations in paired ordinal data. Manuscript.

V. Högberg, H. Statistical aspects on multiple comparisons of relative rank variance in paired ordinal data. Manuscript.

Contents

1 INTRODUCTION... 9

2 AIMS OF THE THESIS ... 11

3 A THEORETICAL BACKGROUND ... 13

4 SUMMARY OF THE PAPERS ... 17

4.1 Paper I: An overview of methods in the analysis of dependent ordered categorical data: Assumptions and implications...17

4.2 Paper II: Comparison of methods in the analysis of dependent ordered categorical data...18

4.3 Paper III: Statistical properties of a nonparametric measure of discordance in paired ordinal data ...20

4.4 Paper IV: Rank-based methods for analysis of individual variations in paired ordinal data...23

4.5 Paper V: Statistical aspects on multiple comparisons of relative rank variance in paired ordinal data...24

5 DISCUSSION AND CONCLUSION ... 27

ACKNOWLEDGMENTS ... 33

REFERENCES... 35

(7)

List of papers

This thesis consists of an introductory part and the following five papers:

I. Högberg, H. and Svensson, E. An overview of methods in the analysis of dependent ordered categorical data: assumptions and implications.

Working Papers, Swedish Business School, Örebro University, No.

2008:7

II. Högberg, H. and Svensson, E. Comparison of methods in the analysis of dependent ordered categorical data. Working Papers, Swedish Business School, Örebro Univerity, No. 2008:6

III. Högberg, H. Statistical properties of a nonparametric measure of dis- cordance in paired ordinal data. Manuscript.

IV. Högberg, H. Rank-based methods for analysis of individual variations in paired ordinal data. Manuscript.

V. Högberg, H. Statistical aspects on multiple comparisons of relative rank variance in paired ordinal data. Manuscript.

Contents

1 INTRODUCTION... 9

2 AIMS OF THE THESIS ... 11

3 A THEORETICAL BACKGROUND ... 13

4 SUMMARY OF THE PAPERS ... 17

4.1 Paper I: An overview of methods in the analysis of dependent ordered categorical data: Assumptions and implications...17

4.2 Paper II: Comparison of methods in the analysis of dependent ordered categorical data...18

4.3 Paper III: Statistical properties of a nonparametric measure of discordance in paired ordinal data ...20

4.4 Paper IV: Rank-based methods for analysis of individual variations in paired ordinal data...23

4.5 Paper V: Statistical aspects on multiple comparisons of relative rank variance in paired ordinal data...24

5 DISCUSSION AND CONCLUSION ... 27

ACKNOWLEDGMENTS ... 33

REFERENCES... 35

(8)

HANS HÖGBERG Some properties of measures of disagreement… I 9

1 Introduction

Rating scales are commonly used in clinical research for assessing qualitative outcomes [1-6]. A characteristic of rating scales are that they produce ordinal data [7-10]. According to Stevens [11] ordinal data have an ordered structure but lack information about size and distance. The ordered categories may be as- signed numbers, letters or other labels. Ordered labels assigned to the different categories of the variable should not alter the results of statistical analysis [12- 14]. This rank-invariant property should be reflected in the statistical methods used in the analysis. Rank-based statistical methods are thus appropriate to use [10, 13, 15-20].

When studying the quality of data from rating scales, agreeable data from re- peated observations are desirable. Thus, agreement is an important concept to evaluate and paired ordinal data typically arises. Especially when analyzing paired ordinal data the rank-invariant property has an important consequence since it is not appropriate to taking differences of the paired observations [10, 12-14].

Standard statistical methods often address a specific type of disagreement.

There are several tests of marginal homogeneity and marginal models try to model the relations between the repeatedly observed variables, given assump- tions of the variations within the observed pairs. On the other hand, specific tests and measures are constructed to measure the extent of agreement in the observed pairs. Some of the methods are parametric methods and dichotomize the scale categories while other measures of agreement are conditional on marginal homo- geneity.

Disagreement may occur from unclear definition of the categories in a rating scale, that the rating task is not clearly stated, or that the cut-off points between adjacent categories are not clearly agreed upon [13, 20]. These exemplify rea- sons for systematic disagreement, and refer to a property of the rating scale or the rating situation. Systematic disagreement may be corrected once the reasons have been identified. Disagreement may also occur from occasional, haphazard events on individual basis in the rating situation of the subjects. Such disagree- ment is harder to correct. Thus, it is important to quantify and separate the sys- tematic and individual disagreement. The reasons for systematic disagreement may require one kind of action while the remaining individual disagreement may require another kind of action [14, 21-23].

Svensson [13, 20] has developed an approach to simultaneously assess both systematic and individual disagreement based on an augmented ranking method.

The approach and its measures are free from any distributional restriction on the data.

(9)

HANS HÖGBERG Some properties of measures of disagreement… I 9

1 Introduction

Rating scales are commonly used in clinical research for assessing qualitative outcomes [1-6]. A characteristic of rating scales are that they produce ordinal data [7-10]. According to Stevens [11] ordinal data have an ordered structure but lack information about size and distance. The ordered categories may be as- signed numbers, letters or other labels. Ordered labels assigned to the different categories of the variable should not alter the results of statistical analysis [12- 14]. This rank-invariant property should be reflected in the statistical methods used in the analysis. Rank-based statistical methods are thus appropriate to use [10, 13, 15-20].

When studying the quality of data from rating scales, agreeable data from re- peated observations are desirable. Thus, agreement is an important concept to evaluate and paired ordinal data typically arises. Especially when analyzing paired ordinal data the rank-invariant property has an important consequence since it is not appropriate to taking differences of the paired observations [10, 12-14].

Standard statistical methods often address a specific type of disagreement.

There are several tests of marginal homogeneity and marginal models try to model the relations between the repeatedly observed variables, given assump- tions of the variations within the observed pairs. On the other hand, specific tests and measures are constructed to measure the extent of agreement in the observed pairs. Some of the methods are parametric methods and dichotomize the scale categories while other measures of agreement are conditional on marginal homo- geneity.

Disagreement may occur from unclear definition of the categories in a rating scale, that the rating task is not clearly stated, or that the cut-off points between adjacent categories are not clearly agreed upon [13, 20]. These exemplify rea- sons for systematic disagreement, and refer to a property of the rating scale or the rating situation. Systematic disagreement may be corrected once the reasons have been identified. Disagreement may also occur from occasional, haphazard events on individual basis in the rating situation of the subjects. Such disagree- ment is harder to correct. Thus, it is important to quantify and separate the sys- tematic and individual disagreement. The reasons for systematic disagreement may require one kind of action while the remaining individual disagreement may require another kind of action [14, 21-23].

Svensson [13, 20] has developed an approach to simultaneously assess both systematic and individual disagreement based on an augmented ranking method.

The approach and its measures are free from any distributional restriction on the data.

(10)

10 I HANS HÖGBERG Some properties of measures of disagreement…

This thesis deals with two measures of individual disagreement; the measure of disorder and the measure of relative rank variance. Both measures have been used in many different studies regarding development of questionnaires, evalua- tions of validity, reliability and change in various application disciplines.

What are the main differences between these measures and other commonly used measures and methods in empirical studies? What are the characteristic properties and assumptions of the measures, tests and models? What are their implications for applicability? To what questions do they apply? What are the results from the different methods and do they agree? These questions comprise the starting point for this thesis and determined the aims of the first two papers.

Inferences about the corresponding population parameters have been based on estimates of variances either by an empirical counterpart to the theoretical vari- ance, by jackknife estimates or bootstrap technique [13, 20]. In order to further study valid inferences, the distributional properties were investigated in the third and fourth papers in this thesis. It was also considered important to develop methods for testing the difference of intra-rater agreement in different items in a multi-item questionnaire. This was done in the fifth paper.

HANS HÖGBERG Some properties of measures of disagreement… I 11

2 Aims of the thesis

The overall aim of the present thesis was to further investigate the properties of the two measures of individual variability, developed by Svensson, such as the possibility of asymptotic normality, and to suggest approaches for interval esti- mation and tests. Moreover, an aim was to show the importance of considering the properties of dependent ordered categorical data in choosing methods for statistical analysis. This overall aim was formulated as the following five spe- cific aims and presented in separate papers:

I. To overview and discuss the relative merits of standard methods and measures for analysis of dependent ordered categorical data and the measures in Svensson’s approach. The focus was at the assumptions of the models and data, the usefulness of the methods for descriptions and inferences, and their implications for use.

II. To apply various measures, models, and tests, including Svensson’s non-parametric approach to two different data sets of paired ordinal data and to compare and interpret the results and show their conditions for use.

III. To derive the asymptotic distributional properties of the empirical measures of disorder and monotonic agreement. Another aim was to in- vestigate the distributional properties of these measures for sample sizes encountered in practice and to study how well the variance estimators compare to the sampling variance. A final aim was to apply the meas- ures and two classical measures of concordance to an empirical data set.

IV. To study the distributional properties of the empirical measure of the relative rank variance RV for sample sizes met in practice. Another aim was to illustrate the methods for inference regarding the variance of the relative rank difference in an empirical data set.

V. To discuss and develop statistical methods for inference when compar- ing the individual disagreement measured by the measure of the relative rank variance RV between different items in a multi-item questionnaire.

Another aim was to illustrate the methods in an empirical data set.

(11)

10 I HANS HÖGBERG Some properties of measures of disagreement…

This thesis deals with two measures of individual disagreement; the measure of disorder and the measure of relative rank variance. Both measures have been used in many different studies regarding development of questionnaires, evalua- tions of validity, reliability and change in various application disciplines.

What are the main differences between these measures and other commonly used measures and methods in empirical studies? What are the characteristic properties and assumptions of the measures, tests and models? What are their implications for applicability? To what questions do they apply? What are the results from the different methods and do they agree? These questions comprise the starting point for this thesis and determined the aims of the first two papers.

Inferences about the corresponding population parameters have been based on estimates of variances either by an empirical counterpart to the theoretical vari- ance, by jackknife estimates or bootstrap technique [13, 20]. In order to further study valid inferences, the distributional properties were investigated in the third and fourth papers in this thesis. It was also considered important to develop methods for testing the difference of intra-rater agreement in different items in a multi-item questionnaire. This was done in the fifth paper.

HANS HÖGBERG Some properties of measures of disagreement… I 11

2 Aims of the thesis

The overall aim of the present thesis was to further investigate the properties of the two measures of individual variability, developed by Svensson, such as the possibility of asymptotic normality, and to suggest approaches for interval esti- mation and tests. Moreover, an aim was to show the importance of considering the properties of dependent ordered categorical data in choosing methods for statistical analysis. This overall aim was formulated as the following five spe- cific aims and presented in separate papers:

I. To overview and discuss the relative merits of standard methods and measures for analysis of dependent ordered categorical data and the measures in Svensson’s approach. The focus was at the assumptions of the models and data, the usefulness of the methods for descriptions and inferences, and their implications for use.

II. To apply various measures, models, and tests, including Svensson’s non-parametric approach to two different data sets of paired ordinal data and to compare and interpret the results and show their conditions for use.

III. To derive the asymptotic distributional properties of the empirical measures of disorder and monotonic agreement. Another aim was to in- vestigate the distributional properties of these measures for sample sizes encountered in practice and to study how well the variance estimators compare to the sampling variance. A final aim was to apply the meas- ures and two classical measures of concordance to an empirical data set.

IV. To study the distributional properties of the empirical measure of the relative rank variance RV for sample sizes met in practice. Another aim was to illustrate the methods for inference regarding the variance of the relative rank difference in an empirical data set.

V. To discuss and develop statistical methods for inference when compar- ing the individual disagreement measured by the measure of the relative rank variance RV between different items in a multi-item questionnaire.

Another aim was to illustrate the methods in an empirical data set.

(12)

HANS HÖGBERG Some properties of measures of disagreement… I 13

3 A theoretical background

The Svensson approach to analysis of paired ordinal data makes it possible to evaluate the systematic part of an observed disagreement in paired assessments on a rating scale separately from the individual variability of assessments [13, 20]. The observed agreement/disagreement pattern is described by the distribu- tion of the pairs of data in a contingency table when the assessments are made on a rating scale having a discrete number of ordered categories or by a scatter plot when assessments are made on a visual analogue scale with 101 possible values.

Figure 1 shows a contingency table with the notation used by Svensson and in the present thesis [13].

Figure 1. Schematic illustration of the basic notations in a contingency table with m categories used in formulas [13].

When the two sets of frequency distributions of assessments - also called mar- ginal distributions - differ, systematic disagreement is present. Svensson [13, 20]

defined two measures of systematic disagreement. The measure of Relative Posi- tion, is a measure of a systematic shift in the use of the scale categories between the two assessments, which means a case when one frequency distribution is stochastically larger than the other

(13)

HANS HÖGBERG Some properties of measures of disagreement… I 13

3 A theoretical background

The Svensson approach to analysis of paired ordinal data makes it possible to evaluate the systematic part of an observed disagreement in paired assessments on a rating scale separately from the individual variability of assessments [13, 20]. The observed agreement/disagreement pattern is described by the distribu- tion of the pairs of data in a contingency table when the assessments are made on a rating scale having a discrete number of ordered categories or by a scatter plot when assessments are made on a visual analogue scale with 101 possible values.

Figure 1 shows a contingency table with the notation used by Svensson and in the present thesis [13].

Figure 1. Schematic illustration of the basic notations in a contingency table with m categories used in formulas [13].

When the two sets of frequency distributions of assessments - also called mar- ginal distributions - differ, systematic disagreement is present. Svensson [13, 20]

defined two measures of systematic disagreement. The measure of Relative Posi- tion, is a measure of a systematic shift in the use of the scale categories between the two assessments, which means a case when one frequency distribution is stochastically larger than the other

(14)

14 I HANS HÖGBERG Some properties of measures of disagreement…

) ( )

(X Y P Y X

P   

  (1)

Another reason for a systematic disagreement could be a systematic dis- agreement in how the assessments are concentrated on the scale categories be- tween the two assessments. The measure is called the Relative Concentration,

) (

)

(Xl Yk Xm PYl Xk Ym

P     

  (2)

These measures are thoroughly described in [13, 20, 24] and are not treated fur- ther in this thesis.

Svensson has shown that it is always possible to construct one unique distri- bution of pairs of data to each set of marginal distributions, which is a rank- transformable pattern of agreement [13, 20]. This pattern illustrates the distribu- tion of paired data that is expected when all disagreement is explained by sys- tematic disagreement only, given the observed marginal distributions.

In the rank-transformable pattern of agreement, each pair of assessments will have the same rank ordering when ranking the assessments X, and the assess- ments Y, respectively. Svensson [13, 14, 20] proposed an augmented ranking approach by which the pairs of rank values given to the observations are tied to the pairs and not to each marginal distribution. Then the mutual relationship between the paired assessments on the individual by the two raters is utilized.

Empirical data sets commonly have individual variations in repeated assess- ments on scales. Then the observed distribution of pairs of data differs from the rank-transformable pattern of agreement, and so do the two set of ranks allocated to the pairs of data. Besides the measures of systematic disagreement Svensson has proposed two measures of such individual variability in an observed dis- agreement pattern; the measure of disorder and the measure of relative rank variance [13, 25, 26].

The augmented mean ranks for assessments X are calculated by:



1

1 1

1 1 )

( (1 )

2 1

i k

m l

j

l il ij

X kl

ij x x x

R (3)

where xij is the ij:th cell frequency. The augmented mean ranks for Y are defined correspondingly as:

 

m

k j l

i

k kj ij

Y kl

ij x x x

R

1 1 1

1 1 )

( (1 )

2

1 (4)

HANS HÖGBERG Some properties of measures of disagreement… I 15

Differences in augmented mean ranks indicate dispersed observations from the rank transformable pattern of agreement and define the empirical measure of individual variability in disagreement, the relative rank variance (RV) [13, 20].

This is a normed estimate of the parameter of the variance of the relative rank differences. The relative rank variance, RV, is defined as

 

m

i m j

n X Y

ijY ijX

ij R R

R n R n x

RV

1 1 1

2 ) ( ) ( 3 2 ) ( ) (

3 ( ) 6 ( )

6

(5)

where υ is the υ:th of n subjects and which estimates the parameter



m

i m j

ijlr ijul

ij q q

p

1 1

)2

(

6  (6)

where pij is the ij:th cell probability, qulij is the upper-left region probability, and

ijlr

q is the lower-left region probability. The relative rank variance is the recom- mended measure of individual variability in test-retest assessments on the same scale, which is the common design for evaluation of reliability of assessments and for evaluation of change in qualitative outcome variables [20, 27, 28].

In validity studies, the consistency in assessments on different scales of the same variable will be evaluated. The two comparing scales can have different number of possible values, and then the measure of disorder is a useful measure of order consistency in the paired assessments.

The measure of disorder, proposed by Svensson [25, 26], defines the level of discordance relative to total agreement in ordering irrespective of the scale levels and marginal distributions. This is in contrast to traditional measures like Kend- all’s tau-b [29, 30], Stuart’s tau-c [31] and Goodman-Kruskal’s gamma [32].

Svensson [25] and others, e.g. [32, 33] have demonstrated how the different approaches adjust for ties in different ways and how the measures depending on scaling and marginal distributions limit the possibility to attain the limit of unity.

The parameter is defined



1 2

1 1

)

m (

i m j

ijlr ijul ij

D p q q (7)

Svensson [13] showed the expression for the variance. The parameter D is the parameter of reversed order classification and equals the parameter of disordered observation of which the measure D is an estimator, except for the correction

(15)

14 I HANS HÖGBERG Some properties of measures of disagreement…

) ( )

(X Y P Y X

P   

  (1)

Another reason for a systematic disagreement could be a systematic dis- agreement in how the assessments are concentrated on the scale categories be- tween the two assessments. The measure is called the Relative Concentration,

) (

)

(Xl Yk Xm PYl Xk Ym

P     

  (2)

These measures are thoroughly described in [13, 20, 24] and are not treated fur- ther in this thesis.

Svensson has shown that it is always possible to construct one unique distri- bution of pairs of data to each set of marginal distributions, which is a rank- transformable pattern of agreement [13, 20]. This pattern illustrates the distribu- tion of paired data that is expected when all disagreement is explained by sys- tematic disagreement only, given the observed marginal distributions.

In the rank-transformable pattern of agreement, each pair of assessments will have the same rank ordering when ranking the assessments X, and the assess- ments Y, respectively. Svensson [13, 14, 20] proposed an augmented ranking approach by which the pairs of rank values given to the observations are tied to the pairs and not to each marginal distribution. Then the mutual relationship between the paired assessments on the individual by the two raters is utilized.

Empirical data sets commonly have individual variations in repeated assess- ments on scales. Then the observed distribution of pairs of data differs from the rank-transformable pattern of agreement, and so do the two set of ranks allocated to the pairs of data. Besides the measures of systematic disagreement Svensson has proposed two measures of such individual variability in an observed dis- agreement pattern; the measure of disorder and the measure of relative rank variance [13, 25, 26].

The augmented mean ranks for assessments X are calculated by:



1

1 1

1 1 )

( (1 )

2 1

i k

m l

j

l il ij

X kl

ij x x x

R (3)

where xij is the ij:th cell frequency. The augmented mean ranks for Y are defined correspondingly as:

 

m

k j l

i

k kj ij

Y kl

ij x x x

R

1 1 1

1 1 )

( (1 )

2

1 (4)

HANS HÖGBERG Some properties of measures of disagreement… I 15

Differences in augmented mean ranks indicate dispersed observations from the rank transformable pattern of agreement and define the empirical measure of individual variability in disagreement, the relative rank variance (RV) [13, 20].

This is a normed estimate of the parameter of the variance of the relative rank differences. The relative rank variance, RV, is defined as

 

m

i m j

n X Y

ijY ijX

ij R R

R n R n x

RV

1 1 1

2 ) ( ) ( 3 2 ) ( ) (

3 ( ) 6 ( )

6

(5)

where υ is the υ:th of n subjects and which estimates the parameter



m

i m j

ijlr ijul

ij q q

p

1 1

)2

(

6  (6)

where pij is the ij:th cell probability, qijul is the upper-left region probability, and

ijlr

q is the lower-left region probability. The relative rank variance is the recom- mended measure of individual variability in test-retest assessments on the same scale, which is the common design for evaluation of reliability of assessments and for evaluation of change in qualitative outcome variables [20, 27, 28].

In validity studies, the consistency in assessments on different scales of the same variable will be evaluated. The two comparing scales can have different number of possible values, and then the measure of disorder is a useful measure of order consistency in the paired assessments.

The measure of disorder, proposed by Svensson [25, 26], defines the level of discordance relative to total agreement in ordering irrespective of the scale levels and marginal distributions. This is in contrast to traditional measures like Kend- all’s tau-b [29, 30], Stuart’s tau-c [31] and Goodman-Kruskal’s gamma [32].

Svensson [25] and others, e.g. [32, 33] have demonstrated how the different approaches adjust for ties in different ways and how the measures depending on scaling and marginal distributions limit the possibility to attain the limit of unity.

The parameter is defined



1 2

1 1

)

m (

i m j

ijlr ijul ij

D p q q (7)

Svensson [13] showed the expression for the variance. The parameter D is the parameter of reversed order classification and equals the parameter of disordered observation of which the measure D is an estimator, except for the correction

(16)

16 I HANS HÖGBERG Some properties of measures of disagreement…

factor for tied observation in the denominator. The empirical measure of the parameter D can be written as

) 1 (

)

1 2 (

1 1



n n

x x x T

ijlr m

i m j

ijul

ij (8)

and the variance

) 1(

2 41 ) 1 ) ( 1 ( ) 2

( D D DD 2D

n n n n

T n

V  

 

 

 (9)

where DD is the probability that out of three pairs, the second and the third pairs are disordered to the first pair. This variance is then estimated by substitut- ing the empirical relative frequencies for the unknown probabilities. In particular



1 2

1 1

)2

ˆ (ˆ ˆ m ˆ

i m j

lrij ijul ij

DD p q q (10)

provided the existence of two such pairs disordered the pair in the ij:th cell.

The variance may also be estimated by jackknife or bootstrap techniques. The explicit asymptotic variance for the empirical measure of disorder, D, has not been shown yet and the sampling distributions of the empirical measures for sample sizes met in practice have not been demonstrated.

The variance of RV is a complicated expression and even an asymptotic ap- proximation is cumbersome to utilize for an empirical variance estimator [13].

Jackknife or bootstrap techniques may be used for variance estimation and infer- ences. The sampling distribution for RV is not known so the estimates of the variance of RV cannot be used directly for inference.

Bootstrap tests and confidence intervals are thus a possible strategy but to use the measures of disorder and relative rank variance thoroughly it is important to know more about the distributional properties.

HANS HÖGBERG Some properties of measures of disagreement… I 17

4 Summary of the papers

4.1 Paper I: An overview of methods in the analysis of dependent ordered categorical data: Assumptions and implications

The aim of the first paper was to give an overview of the methods used for the analysis of dependent ordered categorical data and to compare standard methods with Svensson’s measures. The exposition was focused on the assumptions, specifications, applicability and implications for the appropriate areas of applica- tion. The approach was to call attention to the different problems in analysing dependent ordered categorical data and put together and sum up the results. The overview gives a picture of standard methods as well as state-of-the-art methods and serves as an inventory of problems and a rationale to the development of Svensson’s measures.

At first some fundamental asymmetric models for categorical data are de- scribed followed by a description of how these fundamental models were elabo- rated to dependent ordinal data. These fundamental models are basic multinomial generalized linear models (GLM), such as the cumulative logit model, the adja- cent categories logit model, and the continuation ratio model, which are used in marginal models or conditional models. In marginal models focus is on popula- tion-average effects such as marginal homogeneity and in conditional models focus is on cluster-specific effects as well [34, 35]. The cluster was here a gen- eral concept that may describe subjects that are rated repeatedly over time or subjects rated by several raters. Special cases are paired ratings at two time points or paired rating by two raters.

A second major group of models that is described is log linear models. In con- trast to asymmetric models which distinguish between response and explanatory variables log linear models are symmetric and are useful for modelling associa- tion [36]. In certain cases there are correspondences between logit models for dependent data and log linear models. The log linear models are described, giv- ing special attention to those models developed for dependent ordinal data with applications to agreement studies. Many models in this class of log linear models are hierarchically structured and through an analysis of goodness-of-fit statistics from these hierarchical models conclusions about agreement patterns may be drawn [21, 37].

Further, summary measures for describing order consistency and agreement are described, such as Kendall’s tau-b, Goodman-Kruskal’s gamma, and Cohen’s kappa. Although specific, easy to understand, and easily accessible, they are often used inadequately. For example, assessment of association is not generally

(17)

16 I HANS HÖGBERG Some properties of measures of disagreement…

factor for tied observation in the denominator. The empirical measure of the parameter D can be written as

) 1 (

)

1 2 (

1 1



n n

x x x T

ijlr m

i m j

ijul

ij (8)

and the variance

) 1(

2 41 ) 1 ) ( 1 ( ) 2

( D D DD 2D

n n n n

T n

V  

 

 

 (9)

where DD is the probability that out of three pairs, the second and the third pairs are disordered to the first pair. This variance is then estimated by substitut- ing the empirical relative frequencies for the unknown probabilities. In particular



1 2

1 1

)2

ˆ (ˆ ˆ m ˆ

i m j

ijlr ijul ij

DD p q q (10)

provided the existence of two such pairs disordered the pair in the ij:th cell.

The variance may also be estimated by jackknife or bootstrap techniques. The explicit asymptotic variance for the empirical measure of disorder, D, has not been shown yet and the sampling distributions of the empirical measures for sample sizes met in practice have not been demonstrated.

The variance of RV is a complicated expression and even an asymptotic ap- proximation is cumbersome to utilize for an empirical variance estimator [13].

Jackknife or bootstrap techniques may be used for variance estimation and infer- ences. The sampling distribution for RV is not known so the estimates of the variance of RV cannot be used directly for inference.

Bootstrap tests and confidence intervals are thus a possible strategy but to use the measures of disorder and relative rank variance thoroughly it is important to know more about the distributional properties.

HANS HÖGBERG Some properties of measures of disagreement… I 17

4 Summary of the papers

4.1 Paper I: An overview of methods in the analysis of dependent ordered categorical data: Assumptions and implications

The aim of the first paper was to give an overview of the methods used for the analysis of dependent ordered categorical data and to compare standard methods with Svensson’s measures. The exposition was focused on the assumptions, specifications, applicability and implications for the appropriate areas of applica- tion. The approach was to call attention to the different problems in analysing dependent ordered categorical data and put together and sum up the results. The overview gives a picture of standard methods as well as state-of-the-art methods and serves as an inventory of problems and a rationale to the development of Svensson’s measures.

At first some fundamental asymmetric models for categorical data are de- scribed followed by a description of how these fundamental models were elabo- rated to dependent ordinal data. These fundamental models are basic multinomial generalized linear models (GLM), such as the cumulative logit model, the adja- cent categories logit model, and the continuation ratio model, which are used in marginal models or conditional models. In marginal models focus is on popula- tion-average effects such as marginal homogeneity and in conditional models focus is on cluster-specific effects as well [34, 35]. The cluster was here a gen- eral concept that may describe subjects that are rated repeatedly over time or subjects rated by several raters. Special cases are paired ratings at two time points or paired rating by two raters.

A second major group of models that is described is log linear models. In con- trast to asymmetric models which distinguish between response and explanatory variables log linear models are symmetric and are useful for modelling associa- tion [36]. In certain cases there are correspondences between logit models for dependent data and log linear models. The log linear models are described, giv- ing special attention to those models developed for dependent ordinal data with applications to agreement studies. Many models in this class of log linear models are hierarchically structured and through an analysis of goodness-of-fit statistics from these hierarchical models conclusions about agreement patterns may be drawn [21, 37].

Further, summary measures for describing order consistency and agreement are described, such as Kendall’s tau-b, Goodman-Kruskal’s gamma, and Cohen’s kappa. Although specific, easy to understand, and easily accessible, they are often used inadequately. For example, assessment of association is not generally

(18)

18 I HANS HÖGBERG Some properties of measures of disagreement…

the same as assessment of agreement. It was important to point out the original objective of the measures and their shortcomings.

The overview concludes with a description of the augmented ranking ap- proach and Svensson’s measures. In parallel to the development of these models and measures, Svensson brought up the lack of methods that were rank-invariant, and were able to quantify different important aspects of agreement and dis- agreement and, at the same time, could be used for analysis of change in ordinal outcome variables [13, 20, 27].

In the discussion it was concluded that using models is often considered to be superior to tests and summary measures due to the models’ elaborated and more facetted information, but models may also become more and more complicated to parameterize and interpret. The risk of misspecification and that the funda- mental assumptions behind the models are violated are obvious in using such models in applied research. It is, on the other hand, difficult to capture the many aspects of change, association or agreement by one single measure. Many mod- els use some scoring system and the models are then not invariant to any trans- formation of the scores or to merging or splitting categories. Some link functions in the models are not even palindromic invariant. Models and measures for or- dered categorical data should be rank invariant.

Many of the existing measures may be regarded as attractive as they are easy to understand and to apply. But some of these are not adequately used, e.g. corre- lation for measuring agreement, and some have serious drawbacks, e.g. the coef- ficient of kappa. Svensson’s approach is rank-invariant, non-parametric and uses the paired ordered information in the ranking procedure. By the complementary use of a few measures it is possible to evaluate both systematic and individual disagreement. The measures are equally apt to be used in designs for evaluation of change in response to some treatment as they are for assessment of reliability or validity. In contrast to other measures of concordance or agreement, the limits of their range for the measures of Svensson are attainable, irrespective of the number of possible response categories and the type of scaling and the category distributions.

4.2 Paper II: Comparison of methods in the analysis of dependent ordered categorical data

As a second step in the treatment of non-parametric methods for analysis of dependent ordered categorical data, a comparison of some standard measures, models and tests with Svensson’s measure using two empirical data sets was made. The novel approach was to bring together results from applying a variety

HANS HÖGBERG Some properties of measures of disagreement… I 19

of different measures, models, and tests on two very typical examples of research problems, and to compare those with the results from the measures of Svensson.

The empirical data sets represent two types of studies commonly encountered in clinical research. The first empirical data set was from a study concerning agreement in judging biopsy slides for carcinoma of the uterine cervix [38]. The purpose of that study was to investigate the variability in classification and the degree of agreement in ratings among pathologists. The original data has since been published, frequently served as an illustration in methodological papers [21, 39, 40]. The second empirical data set was from a study of individual and group changes in the patients’ social outcome after aneurysmal subarachnoid haemor- rhage between two occasions [28]. The intension of the study was to publish the results, but it was also used to illustrate some aspects of Svensson’s measures.

The measures, models and tests were determined by the different aims of original studies. These were various agreement and association measures, and models of agreement based on log linear models with parameters describing symmetry, quasi-symmetry and marginal homogeneity. Furthermore, tests of marginal ho- mogeneity and symmetry were applied, and in the case of testing change be- tween two time points the sign test was applied. Svensson’s measures of system- atic and individual disagreement could be applied to both study purposes [20].

The various standard measures, models and test were outlined but Svensson’s measures were presented more thoroughly.

To study reliability, as in the first application example, it was not sufficient to use one of the standard agreement or association measures, such as Cohen’s kappa, Goodman-Kruskal’s gamma, or Kendall’s tau-b. It was necessary to sup- plement the measures by one or more log linear models and models of marginal homogeneity. Most important though, was to use models and measures con- structed for paired ordinal data and relevant to the question of reliability.

To study change in ordinal response variables certain tests are in common use.

Such an example was the sign test in the second application. Change on group level may also be tested by marginal models as was demonstrated. Log linear models could be an option as they models cell frequencies. The independence model is often used. This model may be expanded depending on what patterns in the cell frequencies are relevant to study. But even if the choice of log linear models for analyses of change patterns is not a common option, they were used in paper two to contrast them with Svensson’s measures. Svensson’s measures are rank-invariant and utilize the fact that the data consists of pairs. Furthermore, the measures gave more comprehensive information about systematic and indi- vidual variations.

The study showed that the standard measures, models and tests gave diverging conclusions, which thus implies difficulties in the interpretation of such findings.

Further, the study showed that the measure RC of Svensson revealed systematic

(19)

18 I HANS HÖGBERG Some properties of measures of disagreement…

the same as assessment of agreement. It was important to point out the original objective of the measures and their shortcomings.

The overview concludes with a description of the augmented ranking ap- proach and Svensson’s measures. In parallel to the development of these models and measures, Svensson brought up the lack of methods that were rank-invariant, and were able to quantify different important aspects of agreement and dis- agreement and, at the same time, could be used for analysis of change in ordinal outcome variables [13, 20, 27].

In the discussion it was concluded that using models is often considered to be superior to tests and summary measures due to the models’ elaborated and more facetted information, but models may also become more and more complicated to parameterize and interpret. The risk of misspecification and that the funda- mental assumptions behind the models are violated are obvious in using such models in applied research. It is, on the other hand, difficult to capture the many aspects of change, association or agreement by one single measure. Many mod- els use some scoring system and the models are then not invariant to any trans- formation of the scores or to merging or splitting categories. Some link functions in the models are not even palindromic invariant. Models and measures for or- dered categorical data should be rank invariant.

Many of the existing measures may be regarded as attractive as they are easy to understand and to apply. But some of these are not adequately used, e.g. corre- lation for measuring agreement, and some have serious drawbacks, e.g. the coef- ficient of kappa. Svensson’s approach is rank-invariant, non-parametric and uses the paired ordered information in the ranking procedure. By the complementary use of a few measures it is possible to evaluate both systematic and individual disagreement. The measures are equally apt to be used in designs for evaluation of change in response to some treatment as they are for assessment of reliability or validity. In contrast to other measures of concordance or agreement, the limits of their range for the measures of Svensson are attainable, irrespective of the number of possible response categories and the type of scaling and the category distributions.

4.2 Paper II: Comparison of methods in the analysis of dependent ordered categorical data

As a second step in the treatment of non-parametric methods for analysis of dependent ordered categorical data, a comparison of some standard measures, models and tests with Svensson’s measure using two empirical data sets was made. The novel approach was to bring together results from applying a variety

HANS HÖGBERG Some properties of measures of disagreement… I 19

of different measures, models, and tests on two very typical examples of research problems, and to compare those with the results from the measures of Svensson.

The empirical data sets represent two types of studies commonly encountered in clinical research. The first empirical data set was from a study concerning agreement in judging biopsy slides for carcinoma of the uterine cervix [38]. The purpose of that study was to investigate the variability in classification and the degree of agreement in ratings among pathologists. The original data has since been published, frequently served as an illustration in methodological papers [21, 39, 40]. The second empirical data set was from a study of individual and group changes in the patients’ social outcome after aneurysmal subarachnoid haemor- rhage between two occasions [28]. The intension of the study was to publish the results, but it was also used to illustrate some aspects of Svensson’s measures.

The measures, models and tests were determined by the different aims of original studies. These were various agreement and association measures, and models of agreement based on log linear models with parameters describing symmetry, quasi-symmetry and marginal homogeneity. Furthermore, tests of marginal ho- mogeneity and symmetry were applied, and in the case of testing change be- tween two time points the sign test was applied. Svensson’s measures of system- atic and individual disagreement could be applied to both study purposes [20].

The various standard measures, models and test were outlined but Svensson’s measures were presented more thoroughly.

To study reliability, as in the first application example, it was not sufficient to use one of the standard agreement or association measures, such as Cohen’s kappa, Goodman-Kruskal’s gamma, or Kendall’s tau-b. It was necessary to sup- plement the measures by one or more log linear models and models of marginal homogeneity. Most important though, was to use models and measures con- structed for paired ordinal data and relevant to the question of reliability.

To study change in ordinal response variables certain tests are in common use.

Such an example was the sign test in the second application. Change on group level may also be tested by marginal models as was demonstrated. Log linear models could be an option as they models cell frequencies. The independence model is often used. This model may be expanded depending on what patterns in the cell frequencies are relevant to study. But even if the choice of log linear models for analyses of change patterns is not a common option, they were used in paper two to contrast them with Svensson’s measures. Svensson’s measures are rank-invariant and utilize the fact that the data consists of pairs. Furthermore, the measures gave more comprehensive information about systematic and indi- vidual variations.

The study showed that the standard measures, models and tests gave diverging conclusions, which thus implies difficulties in the interpretation of such findings.

Further, the study showed that the measure RC of Svensson revealed systematic

(20)

20 I HANS HÖGBERG Some properties of measures of disagreement…

differences in concentration in the biopsy slides assessment study, which could not be detected by the traditional methods. This is an indication of that the as- sumption of stochastic ordering, crucial to many models, was not fulfilled.

Moreover, the study showed that the measure RV of Svensson indicated individ- ual occasional causes of change in the social outcome study, which could not be detected explicitly by the traditional methods. One conclusion of the study was that it was not as easy to detect aspects of the kind of systematic and individual reasons for variations by the various standard measures, models or tests as it was by the measures of Svensson. As an example, the model of agreement plus uni- form association gave information of the kind of association and agreement, but no information about the systematic disagreement in concentration using the scale. A remaining question was whether the assumptions for the model was fulfilled. An implication of this was that the researcher has to be very aware of using the models for its intended purpose and check for their adequacy.

4.3 Paper III: Statistical properties of a nonparametric measure of discordance in paired ordinal data

One purpose of the study was to derive the large sample properties of the meas- ure of disorder, D. Another purpose was to investigate the distributional proper- ties of the measure and to compare variance estimators in sample sizes encoun- tered in practice. The measure and two classical measures of concordance were applied to an empirical data set and compared.

Measures based on indicators of disordered and ordered observations are tra- ditionally built up as the excess of concordant pairs over discordant pairs ad- justed for the number of tied observations. The measures differ in the way they consider tied observations. The measure of disorder, D, was defined by Svensson [25, 26] as:

t n n

x x x D

ijlr m

i m j

ijul ij



) 1 (

)

1 2 (

1 1 (11)

where the number of individuals classified to the i:th and j:th category respec- tively is denoted xij and xulijand xijlr is the number of observations in the upper- left region and lower-right region relative the ij:th cell, respectively, and where t is the correction factor for tied observations

HANS HÖGBERG Some properties of measures of disagreement… I 21



1 2

1 1

) 1

m (

i m j xij xij

t (12)

This way of defining tied observations imply that pairs of observations may be either discordant, concordant or else tied if the pairs of observations are iden- tical, which means that the number of concordant pairs is differently defined than in classical measures. When there is total agreement in ordering, no pairs of observations are in the upper-left or lower-right regions relative to the cells, which mean that xijulxlrij 0 and D=0. The maximum value of D=1 indicates total inconsistency in ordering. A measure of monotonic agreement (MA) was also defined as [25, 26]

D

MA12 (13)

The asymptotic distribution of D is not known. The large sample properties were shown by the theory of U-statistics [41] and the finite sample properties was investigated by a simulation experiment. The theory of U-statistics was well suited for non-parametric theoretical study of large sample distributional proper- ties provided the existence of the second moment of the kernel function of the U- statistic. Application of a theorem in Hoeffding [41] regarding a function of U- statistics leads to the general formula for the variance in the limiting normal distribution for a ratio of two U-statistics





 

 

 

 



 2

1 2

1

) , ( ) 1 ( )

(

) ( )

) ( ( ) (

y

y y

y g y

y m g

W m

AsVar U (14)

where m is the degree of the U-statistic and

1( )( 1) 1( )( 1)

) ,

1( E X X

    (15)

is the first order variance and covariance term in the decomposition of the kernel functions in conditional expectations of the U-statistics.

The simulation experiment was designed to study how fast the theoretical re- sults of asymptotical normality works in practice and what the empirical sam- pling distributions looked like in sample sizes encountered in practice. Further- more, the asymptotic variance and the approximate variance estimator derived in Svensson [13] were to be compared with the empirical variance in the simulation experiment. The sample sizes varied from 20 to 1000. Various appearance of contingency tables were chosen to serve as populations, based on various amount

References

Related documents

The Generalized Anderson Darling test for trend, the Lewis-Robinson test and the Mann test have the null hypothesis that the process follows a renewal

Det som också framgår i direktivtexten, men som rapporten inte tydligt lyfter fram, är dels att det står medlemsstaterna fritt att införa den modell för oberoende aggregering som

46 Konkreta exempel skulle kunna vara främjandeinsatser för affärsänglar/affärsängelnätverk, skapa arenor där aktörer från utbuds- och efterfrågesidan kan mötas eller

Syftet eller förväntan med denna rapport är inte heller att kunna ”mäta” effekter kvantita- tivt, utan att med huvudsakligt fokus på output och resultat i eller från

Som rapporten visar kräver detta en kontinuerlig diskussion och analys av den innovationspolitiska helhetens utformning – ett arbete som Tillväxtanalys på olika

Generella styrmedel kan ha varit mindre verksamma än man har trott De generella styrmedlen, till skillnad från de specifika styrmedlen, har kommit att användas i större

I regleringsbrevet för 2014 uppdrog Regeringen åt Tillväxtanalys att ”föreslå mätmetoder och indikatorer som kan användas vid utvärdering av de samhällsekonomiska effekterna av

Närmare 90 procent av de statliga medlen (intäkter och utgifter) för näringslivets klimatomställning går till generella styrmedel, det vill säga styrmedel som påverkar