Threats to validity - 3 Related work - Advancing trace recovery evaluation: Applied information

3 Related work

4.4 Threats to validity

Threats to the validity of the mapping study are analyzed with respect to construct validity, reliability, internal validity and external validity [172]. Particularly, we report deviations from the SLR guidelines [103].

Construct validityconcerns the relation between measures used in the study and the theories in which the research questions are grounded. In this study, this concerns the identification of papers, which is inherently qualitative and dependent on the coherence of the terminology in the field. To mitigate this threat, we took the following actions. The search string we used was validated using a golden set of publications, and we executed it in six different publication databases. Further-more, our subsequent exploratory search further improved our publication cover-age. A single researcher applied the inclusion/exclusion criteria, although, as a validation proposed by Kitchenham and Charters [103], another researcher justi-fied 10% of the search results from the primary databases. There is a risk that the specific terms of the search string related to ‘activity’ (e.g., “requirements tracing”) and ‘objects’ cause a bias toward both requirements research and pub-lications with technical focus. However, the golden set of pubpub-lications was es-tablished by a broad scanning of related work, using both searching and browsing, and was not restricted to specific search terms. Finally, as this work was conducted both as an SM and an SLR, the quality assessment was expressed as a RQ in its own (RQ3). As such, quality was assessed further than by the inclusion/exclusion

5 Results 61

criteria. Furthermore, applicable to RQ4, quality differences were accounted for by considering single publications reporting multiple studies as multiple units of empirical evidence.

An important threat to reliability concerns whether other researchers would come to the same conclusions based on the publications we selected. The ma-jor threat is the extraction of data, as mainly qualitative synthesis was applied, a method that involves interpretation. A single researcher extracted data from the primary publications, and the other two researchers reviewed the process, as sug-gested by Brereton et al. [27]. As a validation, both the reviewers individually repeated the data extraction on a 15% sample of the core primary publications.

Another reliability threat is that we present qualitative results with quantitative figures. Thus, the conclusions we draw might depend on the data we decided to visualize; however, the primary studies are publicly available, allowing oth-ers to validate our conclusions. Furthermore, as our study contains no formal meta-analysis, no sensitivity analysis was conducted, neither was publication bias explored explicitly.

Internal validityconcerns confounding factors that can affect the causal rela-tionship between the treatment and the outcome, especially relevant to RQ4. There is a threat that the reporting style in the primary publications has a bigger impact on our conclusions than the actual output from tools. Consequently, there is a risk that we failed to include results due to differences in both experimental setups and level of detail in reports. Also regarding RQ4, we aggregate evidence from pre-vious comparisons made in different ways, some report statistical analysis while others discuss results in more general terms. We do not weight the contributions of the individual studies based on this. Moreover, while we attempted to distinguish between query-based and matrix-based evaluations in the synthesis, different con-texts (e.g., domain, work task, artifact types, language of the artifacts) were all included in the synthesis of P-R values.

External validityrefers to generalization from this study. As we do not claim that our results apply to other applications of IR in software engineering, this is a minor threat. On the other hand, due to the comprehensive nature of our study, we extrapolate our conclusions on RQ4 to all studies on IR-based trace recovery published until December 2011, including studies that did not report a sufficient amount of details.

5 Results

Following the method defined in Section 4.2, we identified 79 primary publica-tions. Most of the publications were published in conferences or workshops (67 of 79, 85%), while twelve (15%) were published in scientific journals. Table 5 presents the top publication channels for IR-based trace recovery, showing that it spans several research topics. Figure 5 depicts the number of primary publications

62 Recovering from a Decade: A Systematic Review of Information. . .

Publication forum #Publications

International Requirements Engineering 9 Conference

International Conference on Automated 7 Software Engineering

International Conference on Program 6 Comprehension

International Workshop on Traceability in 6 Emerging Forms of Software Engineering

Working Conference on Reverse 5

Engineering

Empirical Software Engineering 4

International Conference on Software 4 Engineering

International Conference on Software 4 Maintenance

Other publication fora 34

(two or fewer publications)

Table 5: Top publication channels for IR-based trace recovery.

per year, starting from Antoniol et al.’s pioneering work from 1999. Almost 150 authors have contributed to the 79 primary publications, on average writing 2.2 of the articles. The top five authors have on average authored 14 of the primary publications, and are in total included as authors in 53% of the articles. Thus, a wide variety of researchers have been involved in IR-based trace recovery, but there is a group of a few well-published authors. More details and statistics about the primary publications are available in Appendix 6.

Several publications report empirical results from multiple evaluations. Con-sequently, our mapping includes 132 unique empirical contributions, i.e., the map-ping comprises results from 132 unique combinations of an applied IR model and its corresponding evaluation on a dataset. As described in Section 4.1, we denote

Figure 5: IR-based trace recovery publication trend. The curve shows the number of publications, while the bars display empirical studies in these publications.

5 Results 63

Figure 6: Taxonomy of IR models in trace recovery. The numbers show in how many of the primary publications a specific model has been applied, the numbers in parentheses show IR models applied since 2008.

such a unit of empirical evidence a ‘study’, to distinguish from ‘publications’.

In document Advancing trace recovery evaluation: Applied information retrieval in a software engineering context (Page 71-74)