• No results found

Differences in age at breeding between two genetically different populations of brown trout (Salmo trutta).

N/A
N/A
Protected

Academic year: 2021

Share "Differences in age at breeding between two genetically different populations of brown trout (Salmo trutta)."

Copied!
52
0
0

Loading.... (view fulltext now)

Full text

(1)

Master Thesis, 15 hp

Differences in age at breeding between two genetically different populations of brown

trout (Salmo trutta).

By Lars Sjöström

Department of Statistics Uppsala University

Supervisor: Inger Persson

2019

(2)

2

ABSTRACT

Survival analysis is an effective tool for conservation studies, since it measure the risk of an event that is important for the survival of populations and preservation of biodiversity. In this thesis three different models for survival analysis are used to estimate the age at breeding between two genetically different populations of brown trout. These populations are an evolutionary enigma, since they apparently coexist in direct competition with each other, which according to ecological theory should not happen. Thus it is of interest if differences between them can be identified. The data consists of brown trouts and has been collected over 20 years. The models are the Cox Proportional Hazards model, the Complementary Log-Log Link model and the Log Logistic Accelerated Failure-Time model. The Cox model were estimated in three different ways due to the nonproportional hazards in the estimates of time to breeding, which gave different interpretations of the same model. All of the models agree that the population B breed at younger ages than the population A, which suggests that the two populations have different reproductive strategies.

Key words: Survival analysis, Biodiversity, Population genetics, Ecology, Evolutionary biology, Cox Proportional Hazards model, Complementary Log-Log Link model, Accelerated Failure-Time model

(3)

3

CONTENTS

1 INTRODUCTION ... 4

2 MATERIAL AND METHODS ... 6

2.1 MATERIAL ... 6

2.2 METHODS ... 6

2.2.1 COX PROPORTIONAL HAZARDS REGRESSION MODEL ... 7

2.2.2 COMPLEMENTARY LOG-LOG LINK MODEL ... 11

2.2.3 THE LOG LOGISTIC ACCELERATED FAILURE-TIME MODEL ... 14

2.3 MODEL DIAGNOSTICS ... 17

3 RESULTS ... 19

3.1 CENSORING... 19

3.2 INVESTIGATION OF ASSUMPTIONS ... 19

3.2.1 SUPREMUM TEST FOR THE PROPORTIONAL HAZARDS ASSUMPTIONS ... 19

3.2.2 STANDARDIZED SCORE RESIDUALS PLOTS ... 20

3.2.3 ARJAS PLOTS ... 21

3.2.4 PROBABILITY DISTRIBUTION OF BREEDING ... 24

3.2.5 HOMOSCEDASTICITY ... 27

3.2.6 MULTICOLLINEARITY ... 29

3.3 SURVIVAL ANALYSIS ... 31

3.3.1 COX MODEL WITHOUT REGARD FOR NONPROPORTIONAL HAZARDS ... 31

3.3.2 COX MODEL WITH TIME DEPENDENT COVARIATES AND STRATIFICATIONS ... 33

3.3.3 COX MODEL WITH STRATIFICATIONS OF EVERY COVARIATE ... 35

3.3.4 COMPLEMENTARY LOG-LOG LINK MODEL ... 40

3.3.5 THE LOG LOGISTIC ACCELERATED FAILURE-TIME MODEL ... 42

3.4 COX-SNELL RESIDUALS PLOTS ... 44

3.5 MODELS THAT WERE DISCARDED ... 45

3.5.1 PIECEWISE EXPONENTIAL REGRESSION MODEL ... 45

3.5.2 MODELS WITH PRIOR AND POSTERIOR DISTRIBUTIONS ... 46

4 DISCUSSION ... 47

4.1 COMPARISON OF MODELS ... 47

4.2 CONCLUSION ... 50

4.3 SUGGESTIONS FOR FUTURE RESEARCH ... 50

5 REFERENCES ... 51

(4)

4

1 INTRODUCTION

In the interconnected lakes Västra Trollsvattnet and Östra Trollsvattnet, at an altitude of 700 metres above sea level in the uppermost reaches of the river Indalsälven in the Scandinavian mountains, lives two populations of brown trout that from now on will be called cluster A and B. These clusters are similar to the eye but distinguished by statistically significantly different allele frequencies (Jorde & Ryman 1996, Andersson et al. 2017).

An allele is a piece of DNA on a chromosome, the allele frequency is how common that allele is in a population. The place of an allele on a DNA spiral is called a loci. Noone have complete knowledge about the DNA in an entire population, since that would require knowledge about every specific individual, and since individuals always die and reproduce.

The best that we have are estimates of these allele frequencies that tell us about the genetic differences between individuals and populations (Gillespie 2004).

Cluster A and B have different allele frequencies on 14 loci, the alleles of each fish was used to calculate the probability of the fish belonging to Cluster A or Cluster B. They were assigned to the cluster to which they had more than 50 % probability of belonging. The distribution of the assignment probabilities gives an U-shape, which strongly indicates that they belong to different clusters. 80 % of the brown trouts that have at least 70 % probability of belonging to the one or other cluster, and there are obvious genetic differences between fishes assigned to the different clusters. Thus there the two clusters are two different populations of brown trouts (Palmé, Laikre, Ryman 2013).

There is no evidence that these two clusters are different due to any kind of natural selection, which indicates that the differences are the result of genetic drift (Jorde & Ryman 1996).

Genetic drift is the process when individuals with certain alleles have a higher survival and/or reproductive rate without any correlation to that allele. Like if we had a human village were by coincidence the brown eyed people had more children. This process systematically reduces the genetic variation, that is the number of alleles in the population total, for random alleles.

The genetic variation is restored thanks to mutations. This process is a force in evolution and a part of the development of new species (Gillespie 2004). There is however another

possibility, that the two clusters represent an older biodiversity with two lineages that

colonized the same area after the last glaciation c. 7000 years ago (Andersson et al. 2017).

(5)

5

Which alleles the individual has affect its probability of survival, since the alleles give

different fitness to survive in every given environment. Natural selection thus makes the allele frequencies in a population that improves fitness to survive dominant, while the others die off.

Thus one of cluster A and B should have better probabilities of survival in Västra Trollsvattnet and Östra Trollsvattnet and the other die out (Gillespie 2004).

However instead they coexist since at least four generations, and there is no indication that they are not in direct competition with each other (Andersson et al. 2017). According to theory, there should be some difference in how they survive which would mean that they are not in direct competition with each other. This study aim to bring more light to the topic, by investigating if there is any difference in age at breeding between the two clusters. A

difference in age at breeding would indicate different reproductive strategies, which could be a part of the explanation as to how they can coexist.

Breeding is in this study considered an event, and a brown trout that has not breed this year is considered a censored observation. If they have breed earlier years is unknown. Survival analysis is the study of event times and censoring times, which together is used to calculate at which point in time the event probably will occur (Klein & Moeschberger 2003, chapter 3).

This makes it excellent for studies on conservation of wildlife. The knowledge about when something happens to a population and which covariates influence when the event occurs teaches us about how they survive and what is needed to preserve them.

The investigation will be done with three different survival analysis models to compare the results of the different models. The models are the Cox Proportional Hazards model, the Complementary Log-Log Link model and the Log Logistic Accelerated Failure-Time model.

The Cox model is the most used survival analysis model, its results will be compared with the Complementary Log-Log Link model and Log Logistic Accelerated Failure-Time model. The two later are parametric models that give more precise estimates if they fit the data. (Klein &

Moeschberger 2003, chapter 8, chapter 12). The Complementary Log-Log Link model estimate the probability of survival with different assumptions (Agresti 2015, chapter 5). The Log Logistic Accelerated Failure-Time model describes how a change in covariate value change the median time to event (Klein & Moeschberger 2003, chapter 12).

These models will be thoroughly described in the material and methods section. The

methods section also describes the assumptions. The results section contains the investigation of the assumptions and a survival analysis section that describes the model estimates. The discussion section includes comparison of the models, conclusion and suggestions for future research.

(6)

6

2 MATERIAL AND METHODS

2.1 MATERIAL

The material for this thesis consists of 6149 brown trout (Salmo trutta) that has been sampled in the lakes Östra and Västra Trollsvattnet in the landscape of Jämtland. The lakes are a part of Lakes Bävervatten Project, which study genetic spatio-temporal variability in areas with no direct human manipulation of the fish populations. The lakes lie 20 km from the nearest road and takes a full day of hiking to reach. The brown trouts have been caught in nets, which means that small fish did not get in the sample. The trouts were taken apart, studied and put in cold storage, also called destructive sampling or sampling without replacement (Jorde &

Ryman 1996, Andersson et al. 2017).

The samples have been taken between the years 1987 and 2014. It was recorded in which lake they were sampled, at which year and date, which cluster the fish belonged to, in which of the two lakes they lived, the trout's weight, length, age, sex and if it had breed this year.

This thesis will investigate if age at breeding is affected by cluster, lake, sex weight and length. In total the dataset consists of dataset have 6159 observations and 2793 of these have bred this season. 10 of these observations have the age -9, since I cannot know if they were in fact 9 years old, there is no logical reason why just ten 9 years old fishes should happen to get a minus sign and none of another age, I have excluded these observations.

2.2 METHODS

Here I describe the three chosen methods of my study, Cox Proportional Hazards Regression model, a Complementary Log-Log Link model and an Log Logistic Accelerated Failure-Time model. In different ways these models estimate time to event, with time being the age of the brown trout and breeding being the event. The probability of event gives that the higher probability of event, the lower estimated time to event. The Cox Proportional Hazards

Regression model and the Complementary Log-Log Link model estimates how the probability of event changes, the Log Logistic Accelerated Failure-Time model estimates changes in median time to event.

The p-value of 5 % is used in every hypothesis test in this thesis. If the p-value is beneath 0.05 the null hypothesis that the covariate being tested has no effect on the outcome is

rejected. The 5 % level is used to avoid missing any covariate with significant effect, but also to avoid making the test too sensitive and including effects that are not there.

(7)

7

2.2.1 COX PROPORTIONAL HAZARDS REGRESSION MODEL

The Proportional Hazards Regression model, or Cox model, is a semi-parametric model which estimates the time to event by calculating the relationship between time to event and a set of explanatory variables. The time to event estimate catches the purpose of the study regarding when the trouts breed, and the covariate effects investigate the second purpose whether there is a difference between the clusters and if other variables have an effect. It is the most used survival analysis model (Klein & Moeschberger 2003, chapter 8).

A model that does not make any assumptions about the probability distributions of the parameters is called a nonparametric model. When the distribution of the parameters are known, the model is parametric (Klein & Moeschberger 2003, chapter 8).

The Cox Proportional Hazards model it is used since it is very robust, allow for time- dependent covariates and stratification analysis (Allison 1995, chapter 5), the benefits will be seen later in this thesis. The Cox model includes a baseline risk of event, multiplied with a parameter estimate of one or several covariates that affect that risk of event. When two

individuals have different values on covariate Z and all the other covariates are the same, then if the difference in risk of event is determined solely by the difference in Z, the hazards are proportional. The difference in Z is called the hazard ratio, which is the philosopher's stone that survival analysis use to measure the effect the covariate has on the risk of event.

The hazard rate at time t is given by the formula below where is the baseline risk of event, Z a number of covariates (risk factors) and βt is the parameters and c is a known function. The exponential distribution is common since the Cox model must be positive (Klein & Moeschberger 2003, chapter 8).

The hazard ratio is how the probability of event changes for a subject with a higher value on the covariate. It is given below for a subject with covariate value Z and another subject with covariate value Z* and k is one specific covariate (Klein & Moeschberger 2003, chapter 8).

(8)

8

2.2.1.1 ASSUMPTIONS OF THE COX REGRESSION MODEL

The Cox Regression Model requires certain assumptions to hold and give the right estimates of the data. In this section these assumptions described and discussed and ways of evaluating certain assumptions are described.

2.2.1.1.1 RANDOM SAMPLE

The observations in a survival analysis must be sampled at random (Kaplan & Meier 1958).

The sample is taken by nets from the two lakes Östra and Västra Trollvattnet, the nets only capture fish of a certain size. Since catches in nets happen at random, this is evidently a random sample, with the exception of very small brown trouts that were not included in the sample. (Meeting with researchers Linda Laikre and Nils Ryman).

2.2.1.1.2 INDEPENDENT OBSERVATIONS

The observations must be independent of each other, so that events and censoring times are not interacting with each other (Kaplan & Meier 1958). There is no reason why fish sampled in nets should not be independent of each other.

2.2.1.1.3 RIGHT CENSORED OR LEFT TRUNCATED DATA

Censoring means that we cannot observe the event since it is outside the time interval of the study. In this case the event of interest can take place before the sampling, either the fish has bred when it is caught or there is no information on when it would have bred. Thus this is a case of right censoring, with the catch being the right line where the censoring takes place and age at time of catch being the time of study (Cox 1972; Klein & Moeschberger 2003, chapter 3).

2.2.1.1.4 NONINFORMATIVE CENSORING

Noninformative censoring means that there is no correlation between time of censoring and time of event (Kaplan & Meier 1958). There is no way to adjust the nets to capture bred or non-bred fish, so no informative censoring has taken place.

2.2.1.1.5 LARGE SAMPLES

For a sample to be large enough there must be a certain number of events, in this case bred fish, for each covariate. Too few events means increase in bias and variability, unreliable confidence intervals and problems with model convergence. The common rule of thumb is 10 events per covariate, but as long as there are more than 30 events and more than 2 covariates, as little as 5 events often gives equally reliable estimates. (Vittinghoff & McCulloch 2007). This dataset have 6160 observations and 2793 events, which gives room for 279 covariates without violating this assumption.

(9)

9 2.2.1.1.6 PROPORTIONAL HAZARDS

When the baseline hazards between two variable values are the same and do not change over time and the differences are due solely to the covariates, the proportional hazards assumption holds, which is tested with various hypothesis tests and graphical methods (Klein &

Moeschberger 2003, chapters 8,9,11).

The hypothesis test that I use is the Supremum Test for Proportional hazards, which tests the standardized score residuals against time. The resulting p-values are valid asymptotically regardless of the covariance structure. Simulation studies have proven that the Supremum Test for Proportional hazards has high power, which makes it a preferable test (Lin, Wei, Ying 1993). The standardized score residuals are a measure of the difference between the estimated covariate values and the real values. The Supremum Test for Proportional hazards also create plots that provide a graphical check of the proportional hazards assumption. An alternative test is to creates time-dependent covariates for each covariate and test them one at the time with a Cox model (Klein & Moeschberger 2003, chapter 9). The advantage with the alternative is that it requires less computational capacity while the cost is that it requires more coding. I chose the Supremum test since it is proven to have high power, to limit coding and to get a test that that also produces a graphical description of the outcome.

The graphical description of standardized score residuals plotted against time on study, includes dotted lines at 1.3581 and -1. 3581 respectively. The probability that the graph exceeds any of these boundaries is 0.05 at most, so if the test rejects the null hypothesis, then the plot exceeds these boundaries (Klein & Moeschberger 2003, chapter 11).

The null hypothesis is that there are proportional hazards, if it is rejected at the significance level, then the assumption of proportional hazards is not violated. If it is not rejected at the significance level, there are no statistically significant nonproportional hazards. The test uses a 5% significance level, since I do not want to miss any nonproportional hazards with a too low significance level, but also not want to see any nonproportional hazards that are not there with a too high significance level.

The Andersen plot is another graphical method, this gives the ratio between the cumulative hazards for two values of a binary covariate, if it is a linear plot through the origin, the proportional hazards assumption holds. (Klein & Moeschberger 2003, chapter 11).

Another optional graphical method is to plot the log cumulative hazard rate or the difference in the log cumulative hazards rate against time, if the hazards are proportional then the curve is constant over time (Klein & Moeschberger 2003, chapter 11).

(10)

10

Yet another graphical method is the Arjas plot, that plots the estimated cumulative hazard rates against the number of events. This method has the advantage of plotting different values of the same covariate, while the other methods give one plot for the entire model. The Arjas plot also has a twofold purpose, to check for proportional hazards and to check if the covariate has any effect or could be dropped from the model. If the curves are close to a 45 degree line through the origin, then the covariate has little effect on risk of event. (Klein & Moeschberger 2003, chapter 11). If the curves are linear, the hazards are proportional. The Arjas plot has proven to have the highest rejection rate of nonproportional hazards of six graphical checks for proportional hazards, including the Andersen plot and the difference in the log cumulative hazards rate against time (Persson & Khamis 2007). For this reason I use the Arjas plot to check for nonproportional hazards in the study. Since description of standardized score residuals was not a part of Persson & Khamis investigation, there is a reason to include it here to see if there are any differences in detection of nonproportional hazards.

If the hazards for one covariate are non-proportional, one way to handle the non- proportional hazards is to stratify on that covariate. To stratify means that a model is

estimated where the baseline hazards are allowed to variate for different values of a covariate.

Another option is to use a time-dependent version of the covariate to account for the time effect on the baseline hazard. Stratification often gives better model fit, but has the drawback of not including a hazard ratio for that covariate. Stratification also requires categorical covariates, so that a continuous covariate must be divided into categories and a part of the continuous effect is lost. The lack of a hazard ratio can be compensated for by plotting the difference in time to event for different categories of the covariate stratified upon. With time- dependent covariates, the hazard ratio can only be estimated for each single value of the time- variable, but not a general hazard ratio for the entire time period. In this case it is possible to estimate the hazard ratio for each year with time-dependent covariates. (Klein &

Moeschberger 2003, chapter 11, Allison 1995, chapter 5). A third option is to avoid these drawbacks by ignoring the nonproportional hazards. I will in this thesis estimate different models to investigate the consequences for model fit and interpretation of the model with different handlings of proportional hazards.

(11)

11 2.2.2 COMPLEMENTARY LOG-LOG LINK MODEL

Since the time in this study is measured in discrete value, the probability of breeding at a specified time point can be estimated with a discrete time model. The Complementary Log- Log Link model is a parametric model that assumes a Bernoulli distribution for the response variable and a binomial distribution of the observations (Agresti 2015, chapter 5).

The parametric models have the advantage over the Cox model that if they fit the data, there estimates tend to become more precise than from a Cox model. However if the parametric model do not fit the data, it will estimate the wrong quantity (Klein & Moeschberger 2003, chapter 12).

The model gives parameter estimates that are identical to those of an underlying

Proportional Hazards model. Thus the parameter estimates can be interpreted as having a relative effect on the risk of event without assuming proportional hazards. The

Complementary Log-Log Link model assumes an asymmetric response curve with a

probability estimate that approaches 0 slowly but 1 rather sharply. The word complementary refers to that the Complementary Log-Log Link applies as the complement of the probability of event. The model is described in the formula below were π is the probability of event, β is a parameter vector and Z is a covariate vector, t is time, i is the observations and j is the

predictor, that is the covariate value. (Austin 2017, Agresti 2015, chapter 5, Wooldridge 2010, chapter 15).

The significance level of 5 % is going to be used when I test the null hypothesis that the covariates are equal to zero, that is have no effect on the outcome. The 5 % level is used to avoid missing any significant effect, but also to avoid making the test to sensitive and including effects that are not there.

The concept power is used for interpreting the effect the covariates of a Complementary Log-Log Link model has on the probability of event. It is calculated as beneath, were is the parameter estimates of the covariate. When there is an interaction term, the calculation is done as in the second equation (Agresti 2015, chapter 5).

(12)

12

The hazard ratio of a Cox model is also equal to , which means that power and hazard ratio are the same. A change in covariate value change the probability of survival with the value of the power, just like with a hazard ratio (Agresti 2015, chapter 5, Klein &

Moeschberger 2003, chapter 8).

The divergence between the value of the power and 1 tells how much a one unit increase in a covariate change the fitted probability of survival, that is of the event not happening.A power above 1 decrease the probability of survival and beneath 1 increase the probability of survival. Another way to say this is that power above 1 increase the probability of event and beneath 1 decrease the probability of event. (Agresti 2015, chapter 5). Thus a positive sign on the parameter estimates means that the probability of event increases, and a negative sign means that the probability of event decreases, since exp (positive) = value above 1, and exp (negative) = value beneath 1.

Thus the Complementary Log-Log Link model provides an estimate that is directly

comparable to a hazard ratio, but without the assumption of proportional hazards and neither is it limited to right censored data.

2.2.2.1 ASSUMPTIONS OF THE COMPLEMENTARY LOG-LOG LINK MODEL

The Complementary Log-Log Link requires certain assumptions to hold, if they do not hold, the estimates will be incorrect. In this section these assumptions described and discussed and ways of evaluating certain assumptions are described.

2.2.2.1.1 INDEPENDENT IDENTICALLY DISTRIBUTED OBSERVATIONS

The observations for a binary response model must be independent of each other and identically distributed in the groups in the sample (Wooldridge 2010, chapter 15). Since the fish caught in nets are caught at random, they are independent of each other and the

distribution of the observations can be assumed to be identical in each group, since no observation is favoured by the nets.

2.2.2.1.2 GUMBEL DISTRIBUTION OF THE RESPONSE VARIABLE

The response variable of the Complementary Log-Log Link model follows the Type I extreme-value distribution, also known as the Gumbel distribution. It follows an asymmetric shape that approaches 0 slower and 1 rather sharply (Agresti 2015, chapter 5). This

assumption is investigated in section 3.1.4 by plotting the probability distribution of the response value.

(13)

13 2.2.2.1.3 HOMOSCEDASTICITY

The Complementary Log-Log Link model use a special case of an ordinary least squares estimator. The ordinary least squares estimator requires homoskedasticity, that is the variance for the observations must be constant. The residuals in turn are assumed to have mean zero and be uncorrelated with each regressor (Wooldridge 2010, chapter 4, Agresti 2015, chapter 1). Thus this assumption is investigated with graphs of the spread of the residuals. If the residuals follows a linear pattern centred around zero with few outliers, then the assumption of homoscedasticity holds. Useful residuals are Standardized residuals, Pearson and Deviance residuals (Agresti 2015, chapter 2, chapter 5).

2.2.2.1.4 NO MULTICOLLINEARITY

The ordinary least squares estimator requires that the population, in this case the brown trouts in Östra and Västra Trollsvattnet, to have a positive definite variance matrix. This means that none of the parameters must be a linear function of the other parameters. This will be

investigated with variance inflation factors (Wooldridge 2010, chapter 4).

(14)

14

2.2.3 THE LOG LOGISTIC ACCELERATED FAILURE-TIME MODEL

An alternative to estimating the probability of event, is to estimate the covariate effect on median time to event. The Accelerated Failure-Time model is a parametric model that use the linear relationship between the logarithm of time and the covariate values as equivalent with the relationship between survival and the covariate values. This is described in the formula below were W is the error distribution, γt is a vector of regression coefficients, σ is the standard error and μ is the estimated mean of Z (Klein & Moeschberger 2003, chapter 12).

ln Z = μ + γtZ + σW

Unlike in the earlier models in this study, the distribution of the survival function has to be chosen (Klein & Moeschberger 2003, chapter 12). In this case I have tried several

distributions that can be used in an Accelerated Failure-Time model, the weibull distribution, gamma distribution, logistic distribution, log logistic distribution and log normal distribution.

Of these models log logistic and log normal had the best and similar AIC values and Cox- Snell plot (methods described in section 2.3 model diagnostics). Log logistic had a little better AIC values and log normal a little better Cox-Snell plot. I decided to go by AIC values since numeric values are exact but the difference between graphs will look bigger or smaller depending on the scale.

The log logistic distribution has a symmetric shape that starts at zero, raises to a peak and declines toward zero when the standard error of the distribution is less than 1, when the standard error is a equal to or above 1, the shape declines toward zero (Allison 1995, chapter 4). The distribution of the probabilities of breeding in figure 3, section 3.2.4 follows this pattern.

The survival function and hazard function of the log logistic distribution is described below, were α, λ > 0, Z ≥ 0. The value α = 1/σ and the value λ = exp (-μ/σ) (Klein & Moeschberger 2003, chapter 2, chapter 12).

These distributions of survival and hazard are then used in the Accelerated Failure-Time model, described below, and thus make it a Log Logistic Accelerated Failure-Time model (Klein & Moeschberger 2003, chapter 12).

The Accelerated Failure-Time model divides the baseline median time to event with the acceleration factor. The acceleration factor is an exponential function of the covariates and the regression coefficients. It relates the baseline survival function and the time to survival

(15)

15

function, as a consequence, the acceleration factor also relates the baseline hazard function and time to event with the risk of event. The relationships are described below were is the baseline survival function, is the baseline hazard function, t is the time to event, θt is a vector of regression coefficients and Z is a vector of fixed explanatory covariates, the acceleration factor is (Klein & Moeschberger 2003, chapter 12).

The relationship between these models and the linear model first presented is that is the survival function for the random variable exp (μ + σW), and θ = -γ. Thus the acceleration factor is calculated exp (-γ) = ψ.The estimated median time to event for the reference category isψ times that of the subject with a one unit change in the covariate (Klein &

Moeschberger 2003, chapter 12).

Thus the Accelerated Failure-Time model provides an estimate ψ of the change in median time to event that does not require the proportional hazards assumption and that is not limited to right censored data (Allison 1995, chapter 4).

2.2.3.1 ASSUMPTIONS OF THE LOG LOGISTIC ACCELERATED FAILURE-TIME MODEL The assumptions of an Accelerated Failure-Time model are identical to those of an ordinary linear model, since the difference between the models is that accelerated failure-time allows for censored observations (Allison 1995, chapter 4).

2.2.3.1.1 INDEPENDENT IDENTICALLY DISTRIBUTED OBSERVATIONS

The observations for an ordinary linear model must be independent of each other and identically distributed in the groups in the sample (Wooldridge 2010, chapter 4). Since the fish caught in nets are caught at random, they are independent of each other and the distribution of the observations can be assumed to be identical in each group, since no observation is favoured by the nets.

2.2.3.1.2 LOG LOGISTIC DISTRIBUTION

By definition, the distribution must be log logistic for this model, it is required that plot of the distribution of events to have a peak and decrease towards zero (Allison 1995, chapter 4).

This assumption is investigated in investigation of assumptions by plotting the probability distribution of the response value.

(16)

16 2.2.3.1.3 HOMOSCEDASTICITY

The Accelerated Failure-Time model requires homoskedasticity, that is the variance for the observations must be constant. The residuals in turn are assumed to have mean zero and be uncorrelated with each regressor (Wooldridge 2010, chapter 4, Allison 1995, chapter 4, Agresti 2015, chapter 1). Thus this assumption is investigated with graphs of the spread of the residuals. If the residuals follows a linear pattern centred around zero with few outliers, then the assumption of homoscedasticity holds. Useful residuals are standardized residuals, Pearson and deviance residuals (Agresti 2015, chapter 2, chapter 5).

2.2.3.1.4 NO MULTICOLLINEARITY

The Accelerated Failure-Time model requires that the population, in this case the brown trouts in Östra and Västra Trollsvattnet, to have a positive definite variance matrix. This means that none of the parameters must be a linear function of the other parameters. This will be investigated with variance inflation factors (Wooldridge 2010, chapter 4, Allison 1995, chapter 4).

(17)

17

2.3 MODEL DIAGNOSTICS

The model must fit the data to give estimates of the correct quantities(hazard ratio, power and acceleration factor). To investigate how well the estimated models fit the data, model

diagnostics are needed. Four different model diagnostics are used in this theses, which are described below.

2.3.1.1 AKAIKE INFORMATION CRITERION

The Akaike Information Criterion (AIC) is a criterion for comparison of models that balance the need of having few covariates with a good fit to the data, which is why I use it in the study.

The AIC value decreases whenever a new covariate is added that improves the model, and increases when a new variable do not improve the model. So the model with the lowest AIC has only the covariates that are necessary to get a better estimate of the outcome of the model.

The AIC model is described in the formula below were L is the usual likelihood function, p is the number of regression parameters and q is some predetermined constant (Klein &

Moeschberger 2003, chapter 8).

AIC = -2LogL + qp 2.3.1.2 -2 LOG LIKELIHOOD

The AIC reduces the bias in the -2 Log Likelihood (-2LogL) estimator (Agresti 2015, chapter 4). However, when I estimate a model were I stratify upon every covariate, SAS only gives the -2LogL estimator without covariates. When it comes to choosing the model that best fits the data, the Log likelihood can be used (Wooldridge 2010, chapter 13). Thus I include this estimate to be able to compare all the models, with the drawback that the estimate is biased.

2.3.1.3 GENERALIZED R-SQUARED

The generalized R-squared is a biased-reduced version of the R-squared, which is a measure of the effectiveness of the covariates in predicting the outcome of a regression. Comparing R- squareds is analogous with comparing Log likelihoods in a regression context. It measure the reduction in prediction errors, which is illustrated in the formulas below, were Y is the

response variable, Z the explanatory variables, p the number of explanatory variables, n is the number of observations (Agresti 2015 chapter 2, Wooldridge 2010, chapter 13, Allison 1995, chapter 8).

(18)

18

The generalized or adjusted version adjust for the sample size and number of covariates. The usefulness of the Generalized R-squared for comparing models is disputed, but it is always valuable to see how much of the outcome the covariates explain (Agresti 2015 chapter 2, Wooldridge 2010, chapter 13, Allison 1995, chapter 8).

The SAS procedure for Cox Regression and Accelerated Failure-Time models does not in itself produce an R-squared estimate. But generalized R-squared calculated from the chi- square statistic C of the likelihood ratio test of the null hypothesis that all parameter estimates are 0. and the sample size n, as illustrated below (Allison 1995, chapter 8). Likelihood ratio exist for the Accelerated Failure-Time model and every Cox model except the one for which every covariate have been stratified upon, more about that model later.

2.3.1.4 COX-SNELL RESIDUALS

The Cox-Snell residuals are defined according to the formula below, with as the estimator of the baseline cumulative hazard rate, Z is the covariate vector, b are the estimated parameter, p the number of covariates and k one single covariate (Klein & Moeschberger 2003, chapter 11).

If the estimates of the Cox Proportional Hazards model are close to the true parameter values, then these two estimates should be approximately equal and the resulting plot follow a 45 degree line. If it does not, then the model needs to be interpreted with caution. I use the Cox- Snell residuals since they are useful for examining the overall fit of both Cox proportional hazards model and parametric models (Klein & Moeschberger 2003, chapter 11, chapter 12).

The Cox-Snell residuals require an output to work, which means that no such plot can be created for the Cox model that handle nonproportional hazards with time-dependent covariates.

(19)

19

3 RESULTS

3.1 CENSORING

Survival analysis is the study of events and censored observations. Table 1 below gives how many they are in this sample and their percents.

Table 1: The censoring of the observations in the sample

Observations 6149 100 %

Events 2391 38.88 %

Censored 3758 61.12 %

3.2 INVESTIGATION OF ASSUMPTIONS

Several of the assumptions described in the earlier section requires a thorough investigation.

That real data does not necessarily meet the assumption is a fact of life and must be accepted.

Nevertheless an investigation should be made to see if the assumptions hold or if the model must be taken with some caution.

3.2.1 SUPREMUM TEST FOR THE PROPORTIONAL HAZARDS ASSUMPTIONS

A hypothesis test for the assumption of nonproportional hazards. The null hypothesis that is tested at a 5 % significance level is that the hazards are proportional.

Table 2: Supremum hypothesis test for nonproportional hazards.

Explanatory Variables Maximum Absolute

Value p-values of the hypothesis test

Cluster 1.8376 <.0001

Lake 0.3119 0.8740

Sex 6.2699 <.0001

Body length 5.3938 <.0001

Body weight 4.7815 <.0001

The Hypothesis test for nonproportional hazards which is shown in Table 2. This null hypothesis is rejected with strong significance for the covariates Cluster, Sex, Body length and Body weight, thus for these covariates the hazards have statistically significant

nonproportional hazards. For the covariate lake, the null hypothesis is not rejected, on the contrary there is a huge p-value, indicating that there are no statistically significant nonproportional hazards for lake.

(20)

20 3.2.2 STANDARDIZED SCORE RESIDUALS PLOTS

These plots visualize the result from the hypothesis test in the former section.

i) ii)

iii) iv)

v)

Figure 1: The plotting of the standardized score residuals.

Figure 1 confirms the hypothesis test. The important line is the thick blue line that shows if the proportionality of the hazards, if they are within the critical interval of 1.3581, the hazards are proportional. They is well beyond the critical interval for each graph except that of Lake in part ii), for which it is well within the interval (Klein & Moeschberger 2003, chapter 11).

(21)

21 3.2.3 ARJAS PLOTS

The Arjas plots investigate if the hazards are proportional, as described in section 2.2.1.1.6.

i)

ii)

iii)

(22)

22

iv)

v) Figure 2: The Arjas plots

From 2 i) for Clusters we see that the curves for cluster vary from the 45 degree, thus the covariate shall be included in the model. Cluster A has a curve and Cluster B variates from high above the diagonal to considerably below, which indicates that the hazards are

nonproportional. So the covariate shall be included and nonproportional hazards be adjusted for. ii) for Lake shows curves that goes considerably above the 45 degree line, so the

covariate shall be included in the model. Both lakes follow the same curve which first go high above the diagonal and then return to the diagonal, so the hazards are likely non-proportional in this case. From iii) for Gender we see that the female curve is deviant from the 45 degree line. The male curve is closer to linear but variate more and is close to the 45 degree line. The hazards are thus likely nonproportional and have an effect on probability of breeding at least for females. iv) for weight show curves that are well deviant for the 45 degree line except that for weight beneath 90 grams, thus this covariate should be included in the model. The lines

(23)

23

are going up and down and variates considerably. Thus the hazards are nonproportional for weights beneath 169 grams and measures needs to be taken. Weight is stratified upon by its quartiles. v) show curves that are well deviant for the 45 degree line, and thus the covariate should be included in the model. Except length beneath 215 mm, all the curves are non-linear, and thus the hazards are nonproportional. Length is stratified upon by its quartiles.

The Arjas plots in Figure 2 also shows that all the covariates have nonproportional hazards and an effect on the probability of breeding and thus all covariates should be included in the model.

(24)

24 3.2.4 PROBABILITY DISTRIBUTION OF BREEDING

The distribution of the age at breeding are investigated with descriptive plots of the

observations of breeding, to see if the breeding among the brown trouts in the sample match any distribution. Thus the censored observations are excluded.

Figure 3: The probability density function for age at breeding.

No predefined distribution worked for this sample, since the plot only shows the kernel curve that is uneven. The curve follows an asymmetrical pattern that arises from zero and upwards, if the ups and downs in the curve is ignored and only the histogram is followed, the raise is sharp. From the peak the curve proceed more slowly downwards to zero.

While the breeding of the brown trout does not follow any predefined statistical distribution, it approaches one sharply and zero slowly like the Gumbel distribution, and like the Log Logistic distribution it raise from zero, peaks and return to zero. Thus it is possible to use both the Complementary Log-Log Link model and the Log Logistic Acceleration Failure-Time model.

Since the distribution shall be identical in different groups, the distribution for the two clusters, lakes and genders are here compared to each other. I have excluded groups of weight and length since they are continuous covariates.

(25)

25

i) ii)

Figure 4: The probability distributions divided by cluster. i) is Cluster A and ii) is Cluster B

Figure 4 shows the distribution of probabilities of breeding in Clusters A and B. The most of both curves are inside 2.5 to ten years of age, with some up to 12.5 years of age. The graph of Cluster A is smooth while the graph of Cluster B goes up and down, however the histograms follow a su. Thus the assumption of identically distributed observations does holdsince the observations follow a similar pattern and all ages are observed.

i) ii)

Figure 5: The probability distributions divided by lake. i) is Västra Trollsvattnet and ii) is Östra Trollsvattnet.

Figure 5 shows that the probability distributions of the two lakes are in the same age span and follow a similar pattern as that in Figure 4. There are only slight differences, for instance a deeper dip in the top of Figure 5i) than in Figure4i).

(26)

26

i) ii)

Figure 6: The probability distributions divided by gender. i) is male and ii) is female.

Figure 6 shows a slight difference in the distributions of age at breeding between males and females. The probabilities of breeding for males are centred around 5 years old and for females centred around 6 or 7 years old. The assumption of identical distribution might not hold, and the result have to be interpreted with caution.

(27)

27 3.2.5 HOMOSCEDASTICITY

Homoscedasticity is investigated by plotting the model residuals, when the residuals are constant and centred around zero, the assumption holds, when they are not the assumption does not hold (Agresti 2015, chapter 4).

Figure 7: The residuals when modelling Lek (breeding) for the complementary log-log link model.

Figure 7 shows that the Pearson and Deviance residuals for the complementary log-log link model. Both have quite linear shapes and few outliers, thus the variance for this model is apparently constant and thus the homoscedasticity assumption holds for this model.

(28)

28

Figure 8: The standardized residuals of the Log Logistic Accelerated Failure-Time Model.

Figure 8 illustrates the standardized residuals for the Log Logistic Accelerated Failure-Time Model. The homoskedasticity assumption fails, the shape follows a pattern from beneath zero and upwards, and is not centred around zero. Thus the results of the model has to be taken with some caution, since the standard errors are not valid and any inference might not hold, the results are not generalizable.

(29)

29 3.2.6 MULTICOLLINEARITY

The assumption of multicollinearity is investigated with the variance inflation factor. The variance inflation factor is the multiple for which the variance increases in linear regression because the other covariates are correlated with the covariate in question, which is a

measurement of multicollinearity (Agresti 2015, chapter 4).

Table 3: The variance inflation factors for the covariates

Variable Reference

cathegory

Variance Inflation factor

Intercept 0

Cluster Cluster A 1.02639

Age 2.19592

Gender Male 1.01524

Lake Västra Trollsvattnet 1.00417

Weight 4.77328

Length 5.88839

Table 3 shows that Cluster, Lake and Gender have variance inflation factors beneath 1.1, which is low variance inflation factors, and thus there is no important multicollinearity. Age have a higher variance inflation factor and Weight and Length have high variance inflation factors. Which is quite logical since older fish are larger, and larger fish are both longer and heavier. This can be handled with estimating the interaction effects for the correlated covariates (Agresti 2015, chapter 4).

Trial and error gave that the interaction of length and gender gave the best p-values and fit statistics, thus that interaction effect is included in the complementary log-log link model and log logistic accelerated failure-time models. The interaction of weight and length would otherwise have been the logical choice, but I wanted the interaction effect that gave the best result for the entire model. .

Another option had been that one of the covariates weight or length could be excluded, but since I did not wanted to exclude something that the researchers thought was relevant, both are a part of this study. Trial and error also gave better AIC value with both these covariates included in the model.

(30)

30

When an interaction effect is included in the Cox models in SAS it gives parameter estimates that can be used for calculating a hazard ratio, but no confidence limits. The statement hazard ratio in the code gives hazard ratio and confidence interval for the covariates that are

interacting given a certain value of the covariate they are interacting with. To get confidence intervals to estimate the precision of the hazard ratios and hazard ratio estimates for every other value of the other covariates, I do not have interaction effects in the Cox models.

(31)

31

3.3 SURVIVAL ANALYSIS

In this section time to breeding is investigated with the three different models. Since the hazards for the covariates are nonproportional, the Cox model is estimated in three different versions that handle the nonproportional hazards differently. Then the Complementary Log- Log link model and Log Logistic Accelerated Failure-Time model are estimated.

3.3.1 COX MODEL WITHOUT REGARD FOR NONPROPORTIONAL HAZARDS

In the model in this section the nonproportional hazards have not been handled at all, to get hazard ratios for every explanatory variable, as the handling of them means that cluster and lake will not get any hazard ratios which will be explained in later sections. This also means that hazard ratios with and without handling can be compared.

Table 4: The hazards of breeding without regard for the nonproportional hazards.

AIC with covariates: 11197.542

-2 Log Likelihood without covariates: 12084.880

Generalized R-squared: 0.1358

Explanatory variable Reference cathegory

p-value Hazard Ratio

95% Hazard Ratio Profile Likelihood Confidence

Limits

Cluster Cluster A <.0001 0.705 0.649 0.767

Lake Västra

Trollsvattnet 0.0456 1.088 1.002 1.182

Gender Male 0.0259 1.097 1.011 1.191

Body Weight <.0001 0.998 0.997 0.999

Body Length <.0001 0.987 0.986 0.988

Table 4 shows the hazards of breeding for Brown Trouts in the lakes Västra and Östra Trollsvattnet, without regard for the fact that all the hazards are nonproportional. This was done to provide a model with hazard ratios for every covariate, and a hazard ratio in this model can be seen as an average hazard ratio over time.

The hazard ratio of Cluster (0.705) gives that the probability that a brown trout belonging to Cluster A is going to breed is 29.5 % lower than for a brown trout belonging to Cluster B, given that age, lake, gender, weight and length are fixed. That is a brown trout of Cluster A is less likely to breed. The p-value (<0.0001) means that the hazard ratio is statistically

(32)

32

significant. The 95 % hazard ratio confidence limits of 0.649 and 0.767 gives that with a probability of 95 %, this narrow interval covers the hazard ratio in the population of brown trouts. The quite narrow interval means that the hazard ratio is estimated with a high precision.

The hazard ratio of Lake (1.088) means that the probability that a brown trout living in Västra Trollsvattnet is going to breed is 8.8 % higher than a brown trout living in Västra Trollsvattnet, given that cluster, age, gender, weight and length are fixed. The p-value (0.0456) just beneath 5 % means that the hazard ratio is statistically significant. Thus a brown trout living in Västra Trollsvattnet breed at younger ages. The confidence limits 1.002and 1.182 gives that with a probability of 95 %, this narrow interval covers the hazard ratio in the population of brown trouts, which gives a high precision of the estimate.

The hazard ratio of Gender (1.097) means that the probability that a male brown trout is going to breed is 9.7 % higher than the probability that a female will breed, given that cluster, age, lake, weight and length are fixed. Thus males breed at younger ages. The p-value (0.0259) means that the hazard ratio is statistically significant. The confidence limits 1.011 and 1.291 gives that with a probability of 95 %, this narrow interval covers the hazard ratio in the population of brown trouts, which gives a high precision of the estimate.

The hazard ratio of Body Weight (0.998) means that the probability that a brown trout is going to breed decreases with 0.2 % by each gram of increase of body weight, given that cluster, age, lake, gender and length are fixed. Thus heavier fish breed at older ages. The p- value (<.0001) gives that the hazard ratio is statistically significant. The confidence limits 0.997 and 0.999 gives that with a probability of 95 %, this narrow interval covers the hazard ratio in the population of brown trouts, which gives a high precision of the estimate.

The hazard ratio of Body Length (0.987) means that the probability that a brown trout is going to breed decreases with 1.3 % by each millimetre of increase of body length, given that cluster, age, lake, gender and weight are fixed. Thus longer fish breed at older ages. The p- value (<.0001) gives that the hazard ratio is statistically significant. The confidence limits 0.986 and 0.988 gives that with a probability of 95 %, this narrow interval covers the hazard ratio in the population of brown trouts, which gives a high precision of the estimate, which gives a high precision of the estimate.

The generalized R-squared of 0.1358 means that there is some association between the covariates and age at breeding for the brown trouts in the sample. The AIC: 11197.542 and -2 Log Likelihood 12084.880 gives the model fit to the data to be compared with the other models.

(33)

33

3.3.2 COX MODEL WITH TIME DEPENDENT COVARIATES AND STRATIFICATIONS In this section the nonproportional hazards have been handled with a combination of time dependent covariates and stratifications, to get hazard ratios that account for time-dependency.

Table 5: The hazards breeding with time dependent covariates.

AIC with covariates: 10574.818

-2 Log Likelihood without covariates: 11760.561 Generalized R-square: 0.1770

Stratified by: Cluster, Lake

Explanatory variable Reference cathegory

Parameter

Estimate p-value

Hazard Ratio lower quartile age

Hazard Ratio median age

Hazard Ratio upper quartile age Gender Male 5.31644 <.0001

3.076 1.566 0.902

Time dependent Gender -3.02453 <.0001

Body Weight -0.02078 <.0001

0.993 0.996 0.998

Time dependent Body

weight 0.01028 <.0001

Body Length -0.05390 <.0001

0.979 0.984 0.988

Time dependent Body

length 0.02326 <.0001

Table 5 illustrates a model that has handled the nonproportional hazards detected in section 3.2. The covariates gender, body weight and body length have handled with the inclusion of time dependent covariates of those variables. That covariate is created by multiplying the covariate with log (age). Time dependent covariates of Cluster and Lake have p-values of 0.3424 and 0.6821respectively (results not shown in table. Since those p-values are statistically insignificant, the nonproportional hazards are handled by stratification upon Cluster and Lake instead. Covariates that have been stratified upon will not produce any parameter estimates or hazard ratios. Still it is the option that remains when time-dependent covariates are statistically insignificant.

With time dependent covariates, the hazard ratios are not constant over time and must be interpreted differently as illustrated below (Allison 1995, chapter 5).

(34)

34

In this calculation t is a value for the time at which it is calculated, I am using the median age 5, the lower quartile age 4 and the upper quartile age 6, as examples for illustration.

A male brown trout of 4 years old a male has 3.076 times higher probability of breeding than a 4 years old female, At 5 years old a male 56.6 % higher probability of breeding, at 6 years old a male has 9.8 % lower probability of breeding. Thus males breed at a younger age than females and the younger the age the larger the difference. The p-values (<.0001) prove that these temporal effects are statistically significant. This is expected given the results from figure 6 in section 3.2.4 probability distributions.

For every increased gram of body weight, the probability of breeding decreases with 0.7 % at age 4, 0.4 % at the median age 5 and decreases with 0.2 % per gram at age 6. So the

probability of breeding decreases with increased body weight, but the difference is decreasing with the years, the p-values (<.0001) prove that these temporal effects are statistically

significant.

For every increased millimetre of length, the probability of breeding decreases with 2.1 % at 4 years of age, 1.6 % at 5 years of age and 1.2 % at 6 years of age. Thus longer body length decreases the probability of breeding, and the difference is decreasing with age, the p-values (<.0001) prove that these temporal effects are statistically significant.

Compared to the non-handled hazard ratios (Table 4), the hazard ratios for weight and length are quite similar. But for gender the differences are considerable. Especially for 4 years old brown trouts, the handled hazard ratio is 3.076 compared to the unhandled 1.097, almost three times the size. Thus the hazard ratio for gender variates a lot more over time than the hazard ratios for weight and length.

The generalized R-squared of 0.1770 means that there is some association between the covariates and age at breeding for brown trouts in the sample, which is more than for the earlier model. The AIC: 10574.818 and -2 Log Likelihood 11760.561 gives that the model fit the data better than the model without handling nonproportional hazards (Table 4).

(35)

35

3.3.3 COX MODEL WITH STRATIFICATIONS OF EVERY COVARIATE

Covariates that has been stratified upon produce estimates of the probability that the event does not take place. Without any parameters or hazard ratios, the covariate effects have to be visualized by plotting the estimates for each covariate. Scatterplots have been used to

visualise the distribution of estimates of the probability of not breeding across time, since they are the best illustration of estimates that are spread over the same time interval. Without parameter estimates there are no statistically significant results. That the stratified covariates have effect on the outcome is proven by the Arjas plots in Figure 2. To create an output from an equation with time-dependent covariates is not possible in SAS, which is why this is only done for this model were every covariate has been stratified upon.

Table 6: Information about the stratified model.

-2 Log Likelihood without covariates: 9084.122 Stratified upon: Cluster, Lake, Gender, Weight, Length

(Weight and Length have been divided in four groups based on their quartiles)

Figure 9: The distribution of times to breeding for each cluster.

(36)

36

Figure 10: The distribution of times to breeding for each Lake.

Figure 11: The distribution of times to breeding for each gender.

(37)

37

Figure 12: The distribution of times to breeding for the quartiles of weight.

Figure 13: The distribution of times to breeding for the quartiles of length.

(38)

38

The scatterplot of Figure 9 shows the distribution of probabilities of not breeding (survival, in survival analysis terms) of the two clusters. Given the result from Table 4 were Cluster A have a lower probability of breeding, we would expect the green diamonds to dominate the high probabilities in Figure 9. Cluster B is more frequent beneath the probability of 0.4, Cluster A is most common above 0.6, like expected. So this scatterplot confirm the conclusions from Table 4.

The scatterplot of Figure 10 shows the distribution of probabilities of not breeding for Lake.

From Table 4 we get that Västra Trollsvattnet have a slightly higher probability of breeding.

Thus we expect the probabilities of Västra Trollsvattnet to be slightly lower than for Östra Trollsvattnet. Once again the scatterplot confirm this, Västra Trollsvattnet are in majority beneath 0.2 and Östra Trollsvattnet above 0.8, but Västra and Östra Trollsvattnet are quite equally distributed across the scatterplot.

For gender, Table 4 gives us that males have higher probabilities of breeding for median and lower age and lower probabilities of breeding for higher age. Figure 11 gives that males have lower probabilities of not breeding, that is higher probabilities of breeding, for age 4, and are at both top and bottom of the scatterplot for the median age 5 and ages 6,7 and 8.

For body weight, hazard ratios in Table 4 gives a slightly lower probability of breeding for each gram of body weight. Thus we expect the heavier fish to have higher probabilities of not breeding. The scatterplot in Figure 12 gives that there are more heavy than light fish groups beneath the probability of not breeding 0.4. While the lowest weight group of beneath 90 grams usually have higher probabilities of not breeding, and the other two groups in between.

But together in the same column, heavier fish have a higher probability of not breeding. In the median age column of 5 years, the heavier fish have higher probabilities of not breeding, and the lighter fish have lower probabilities of not breeding. For age years 6 and 7, the distribution of probabilities of breeding are more equally spread. Just like Table 4 indicates body weight become less important with age. Heavier fish have higher probabilities of not breeding when both heavy and light fish are present, like indicated by Table 4, but in general the lighter fish have higher probabilities of not breeding contrary to expectations.

Table 4 gives us that body length decreases the probability of breeding, thus we expect the long fishes to have higher probabilities of not breeding and the shorter fish to have higher probabilities of not breeding. Figure 13 completely contradict this, the longest fishes of length above 265 mm usually have low probabilities of not breeding and the shorter of length

beneath 215 mm have usually high probabilities, with the other two in between. But with different lengths in the same column, the longer fish have higher probabilities of not breeding.

(39)

39

In the median age column of 5 years, the shorter fish categories are further down and the longer further up. The hazard ratios of Table 4 becomes less different from one with higher age, indicating that length means less when the fishes become older, Figure 9 gives no direct support of this.

The -2 Log Likelihood 9084.122 gives that the model fit the data better than the earlier models.

The advantages with the scatterplot illustrations is that they show the diversity within the different covariates. For instance that brown trouts in both clusters are spread over the entire probability of not breeding spectrum. They also provide interpretations of the covariates when hazard ratios for handled proportional hazards are not appropriate, like in this case. The disadvantage is that the results are a lot more difficult to interpret than a hazard ratio.

(40)

40 3.3.4 COMPLEMENTARY LOG-LOG LINK MODEL

In this section the first of the parametric models is used to estimate the power of the covariates, which is equivalent to the hazard ratio.

Table 7: Estimates of the Complementary Log-Log Link effects on survival.

AIC with covariates: 7387.129

-2 Log Likelihood without covariates: 8217.869

Generalized R-squared: 0.1745

Explanatory variable Reference Cathegory

Parameter

estimate 95% Confidence Limits P-value Power

Intercept -4.1747 -4.8288 -3.5207 <.0001

Cluster Cluster A -0.5787 -0.6633 -0.4941 <.0001 0.561

Age 0.1998 0.1532 0.2465 <.0001 1.221

Lake Västra

Trollsvattnet 0.2591 0.1755 0.3427 <.0001 1.296

Gender Male 3.9748 3.3990 4.5506 <.0001 53.239

Body Weight 0.00224 0.000609 0.00387 0.0071 1.002

Body Length 0.00869 0.00514 0.0122 <.0001 1.009

Gender* Body Length -0.0158 -0.0181 -0.0136 -0.0158 0.984

Table 7 gives the parameter estimates that influence the probability of survival, that is of not breeding. The interaction effect of gender and length was necessary to give both of the

covariates significant p-values without effecting the rest of the model. It was chosen based on that it had lower AIC and p-values compared to the other possible interaction effects. Power is calculated from the parameter estimates with exp(β) = power. Since power and hazard ratio are equivalent, the covariates are interpreted the same way as in a Cox model.

The power of Cluster (0.561) gives that belonging to Cluster A decrease the probability of breeding for a brown trout with 43.9 % compared to belonging to Cluster B, given that all other covariates are constant. The p-value (<.0001) is statistically significant. The confidence interval of the parameter is quite narrow, and thus this is a quite precise estimate.

(41)

41

The power of Age (1.221) gives that every year of age increase the probability of breeding with 22.1 %. By consequence older fish are more likely to breed. The p-value (<.0001) is statistically significant. The confidence interval of the parameter is quite narrow, and thus this is a quite precise estimate.

The power of Lake (1.296) gives that a brown trout living in the Lake Västra Trollsvattnet has a 29.6 % increased probability of breeding compared to a brown trout living in Östra Trollsvattnet. The p-value (<.0001) is statistically significant. The confidence interval of the parameter is quite narrow, thus this is a quite precise estimate.

The power of Gender (53.239) gives that a male brown trout have 53.239 times higher probability of breeding given that all other covariates are constant. This is expected given the results from figure 6 in section 3.2.4 probability distributions, were males are given an earlier breeding age. The p-value (<.0001) is statistically significant. The confidence interval of the parameter is not very narrow, and thus this is not a very precise estimate. The interaction of gender and length however has very narrow confidence limits, thus that estimate is very precise.

The power of body weight (1.002) gives that every gram of weight increase the probability of breeding with 0.2%. The p-value (0.0071) is statistically significant. The confidence interval of the parameter is very narrow, and thus this is a very precise estimate.

The power of body length (1.009) which means that millimetre of length increase the probability of breeding with 0.9 %. The p-value (<.0001) is statistically significant. The confidence interval of the parameter is very narrow, and thus this is a very precise estimate.

The power of the interaction effect gender times length (0.984) means that if both gender and length increase with one (that is male and one millimetre longer), then the probability of breeding decrease with 1.6 %. The p-value (<.0001) is statistically significant. The confidence interval of the parameter is very narrow, and thus this is a very precise estimate.

Thus belonging to Cluster A and the interaction effect of being male and longer increase the age at breeding, while living in Västra Trollsvattnet, being male, older, heavier and longer decrease the age of breeding.

The generalized R-squared of 0.1745 which means that there is some association between the covariates and age at breeding for brown trouts in the sample. Which is more than for Cox model 3.3.1 slightly less than for the Cox model 3.3.2. The AIC: 7387.129 and -2 Log

Likelihood 8217.869 gives that the model fit the data better than the earlier models.

References

Related documents

Stöden omfattar statliga lån och kreditgarantier; anstånd med skatter och avgifter; tillfälligt sänkta arbetsgivaravgifter under pandemins första fas; ökat statligt ansvar

46 Konkreta exempel skulle kunna vara främjandeinsatser för affärsänglar/affärsängelnätverk, skapa arenor där aktörer från utbuds- och efterfrågesidan kan mötas eller

Generally, a transition from primary raw materials to recycled materials, along with a change to renewable energy, are the most important actions to reduce greenhouse gas emissions

För att uppskatta den totala effekten av reformerna måste dock hänsyn tas till såväl samt- liga priseffekter som sammansättningseffekter, till följd av ökad försäljningsandel

Från den teoretiska modellen vet vi att när det finns två budgivare på marknaden, och marknadsandelen för månadens vara ökar, så leder detta till lägre

Generella styrmedel kan ha varit mindre verksamma än man har trott De generella styrmedlen, till skillnad från de specifika styrmedlen, har kommit att användas i större

Närmare 90 procent av de statliga medlen (intäkter och utgifter) för näringslivets klimatomställning går till generella styrmedel, det vill säga styrmedel som påverkar

Den förbättrade tillgängligheten berör framför allt boende i områden med en mycket hög eller hög tillgänglighet till tätorter, men även antalet personer med längre än