A critical appraisal tool for systematic literature reviews in software engineering

(1)

InformationandSoftwareTechnology112(2019)48–50

ContentslistsavailableatScienceDirect

Information

and

Software

Technology

journalhomepage:www.elsevier.com/locate/infsof

A

critical

appraisal

tool

for

systematic

literature

reviews

in

software

engineering

☆

Nauman

bin

Ali

∗

_,

_Muhammad

_Usman

Blekinge Institute of Technology, Karlskrona, Sweden

a r t i c l e

i n f o

Keywords:

Systematic literature reviews Quality assessment Software engineering Critical appraisal tools AMSTAR

a b s t r a c t

Context:Methodologicalresearchonsystematicliteraturereviews(SLRs)inSoftwareEngineering(SE)hassofar focusedondevelopingandevaluatingguidelinesforconductingsystematicreviews.However,thesupportfor qualityassessmentofcompletedSLRshasnotreceivedthesamelevelofattention.

Objective:Toraiseawarenessoftheneedforacriticalappraisaltool(CAT)forassessingthequalityofSLRsin SE.Toinitiateacommunity-basedeﬀorttowardsthedevelopmentofsuchatool.

Method:WereviewedtheliteratureonthequalityassessmentofSLRstoidentifythefrequentlyusedCATsinSE andotherfields.Results:WeidentifiedthattheCATscurrentlyusedisSEwereborrowedfrommedicine,buthave notkeptpacewithsubstantialadvancementsinthefieldofmedicine.

Conclusion:Inthispaper,wehavearguedtheneedforaCATforqualityappraisalofSLRsinSE.Wehavealso identiﬁedatoolthathasthepotentialforapplicationinSE.Furthermore,wehavepresentedourapproachfor adaptingthisstate-of-the-artCATforassessingSLRsinSE.

1. Introduction

Inspiredbymedicine,evidence-basedsoftwareengineering(EBSE) promotestheuseofsystematicliteraturereviews(SLRs)to systemati-callyidentify,evaluateandsynthesizeresearchonatopicofinterest[1]. SincetheintroductionofSLRsinSoftwareEngineering(SE),therateof papersreportingSLRsinSE1_has_been_continually_increasing_(see_Fig.₁_).

However,severalrecentin-depthevaluationsofpublishedSLRshave identiﬁedseriousﬂawsregardingtheirquality.Forexample,issues re-latedto:(a)thereportingqualityofproceduresandoutcomes [2],(b) thereliabilityofsearch [3],and(c)lackofsynthesisortheuseof in-appropriatesynthesismethods [4] inSLRs.Suchissuesraisequestions aboutthecredibilityofSLRs.

Mostresearchers in SE haveused the fourquestions adoptedby Kitchenhametal.[5](itemsatodin Table1)forqualityassessmentof SLRs.Thesequestionsareinsuﬃcienttorevealimportantlimitationsin anSLRasdemonstratedbytheabove-listedstudies [2–4].

Fig.2helpstounderstandtheroleofguidelinesanddistinguishthe purposeofcriticalappraisaltools(CAT).Theguidelinesforplanningand conductinganSLRenablearesearchteamtoplanandexecuteareview

☆_This_is_an_Open_Access_article_distributed_in_accordance_with_the_terms_of_the_Creative_Commons_Attribution_(CC_BY_4.0)_license,_which_permits_others_to_distribute,

remix,adaptandbuilduponthiswork,includingforcommercialuse,providedtheoriginalworkisproperlycited.See:http://creativecommons.org/licenses/by/4.0/.

∗_{Corresponding}_author.

E-mailaddresses:nauman.ali@bth.se(N.b.Ali),muu@bth.se(M.Usman).

1 _(Search_string_used_in_Scopus_to_identify_SLRs_published_in_computing_{TITLE-ABS-KEY(}_{“systematic}_review_{” OR}_{“systematic}_literature_review_”)_AND_PUBYEAR_<

2019AND(LIMIT-TO(SUBJAREA,“COMP”)).

thatfollowsarigorousprocess[1,5].Similarly,thereportingguidelines helptheresearcherstocommunicatethedesignandexecutionofanSLR tothereaders [1].Morerecently,therearenewreportingguidelines thatareintendedtoimprovetheusefulnessoftheresultsofanSLRfor educationandpractice [6].

Ontheotherhand,theroleofcriticalappraisaltoolsistofacilitatea readertoanalyticallyassessthecredibilityofacompletedSLR.Suchan assessmentconsidersboththereportingquality,e.g.,“Arethereview’s inclusionandexclusioncriteriadescribed?”,andtheriskofbiasassessment inthedesignandexecutionoftheSLR,e.g.,“Arethereview’sinclusion andexclusioncriteriaappropriate?”.

AsthenumberofSLRsisincreasing,theneedfortoolstoassessthe qualityofanSLRwithouthavingtoreplicatethestudyisbecomingmore evident.SuchaCATwillhelptosustainandimprovethecredibilityof SLRsasaneﬀectivemeansfordecision-supportinSE.Itwillenablethe readersofSLRstodiﬀerentiatebetweengoodqualitySLRsfromtheones thatdidnotfollowarigorousandcomprehensiveapproach.

Inthis paper, weseektoraiseawareness of theneedfor a criti-calappraisalinstrumentandhaveintroducedacandidatesolutionfor thistask.Theworkpresentedinthispaperhasthepotentialtohavea

https://doi.org/10.1016/j.infsof.2019.04.006

Received5November2018;Receivedinrevisedform5April2019;Accepted12April2019 Availableonline15April2019

(2)

N.b. Ali and M. Usman Information and Software Technology 112 (2019) 48–50

Table1

AMSTAR-2,andDAREqualitycriteriausedtoappraiseSLRs.

DARE - Note: Fulﬁlling items a, b and e, and either c or d is mandatory for an SLR to be included in the DARE database of SLRs.

a. Were inclusion/exclusion criteria reported? d. Are suﬃcient details about the individual included studies presented? b. Was the search adequate? e. Were the included studies synthesised?

c. Was the quality of the included studies assessed?

AMSTAR -2 - Note: Items marked with an asterisk ( ∗ ) are not applicable for the appraisal of SMSs.

1. “Did the research questions and inclusion criteria for the review include the components of PICO? ”

2. “Did the report of the review contain an explicit statement that the review methods were established prior to the conduct of the review and did the report justify any signiﬁcant deviations from the protocol? ”

3. “Did the review authors explain their selection of the study designs for inclusion in the review? ” 4. “Did the review authors use a comprehensive literature search strategy? ”

5. “Did the review authors perform study selection in duplicate? ” 6. “Did the review authors perform data extraction in duplicate? ”

7. “Did the review authors provide a list of excluded studies and justify the exclusions? ” 8. “Did the review authors describe the included studies in adequate detail? ”

9. ∗ “Did the review authors use a satisfactory technique for assessing the risk of bias (RoB) in individual studies that were included in the review? ”

10. “Did the review authors report on the sources of funding for the studies included in the review? ”

11. ∗ “If meta-analysis was performed did the review authors use appropriate methods for statistical combination of results? ”

12. ∗ “If meta-analysis was performed, did the review authors assess the potential impact of RoB in individual studies on the results of the meta-analysis or other evidence

synthesis? ”

13. ∗ “Did the review authors account for RoB in individual studies when interpreting/ discussing the results of the review? ”

14. ∗ “Did the review authors provide a satisfactory explanation for, and discussion of, any heterogeneity observed in the results of the review? ”

15. ∗ “If they performed quantitative synthesis did the review authors carry out an adequate investigation of publication bias (small study bias) and discuss its likely impact on the

results of the review? ”

16. “Did the review authors report any potential sources of conﬂict of interest, including any funding they received for conducting the review? ”

Fig.1. TheincreasingnumberofSLRsincomputingsince2004.

Fig.2. AviewofqualityonthevariousstagesofanSLR.

profoundimpactonSEresearch,sinceitisusefulfortwoverycommon scenarios:(1)toassessthequalityofanSLRasareferee/reader,and (2)tosynthesizetheresultsofseveralSLRsonthesametopicandto understandthereasonsforanydiﬀerencesbetweentheirresults.

Theremainder ofthepaperisstructuredasfollows: Section2 ex-plainstheneedforaCATforSLRsinSE. Section3 presentsa state-of-the-artCAT.In Sections4 and 5,webrieﬂyproposeanapproachto customizeandvalidatethetoolforSE.Section6 concludesthepaper. 2. Need for a CAT for the quality assessment of SLRs in SE

Since2004,whenthefirstguidelinesforSLRsinSEwereintroduced, severalimprovementshavebeenmadetotheguidelinesforconducting andreportingSLRsinSE [1].However,theappraisaltoolsforSLRsin SEhavenotreceivedmuchattention.ResearchersintheSEfieldhave continuedtorely ona subsetof questionsidentifiedbyKitchenham etal. [5] fromthefieldofevidence-basedmedicineintheyear2004. Thecommonly-usedinterpretationoftheDARE2_criteria_in_SE_does_not

2 _The _CRD’s _Database _of _Abstracts _of _Reviews _of _Eﬀects _(DARE)

https://www.crd.york.ac.uk/CRDWeb/AboutPage.asp.

evenconsiderifthereisasynthesisperformedinareview.Thisexplains tosomeextentwhysomeofthelimitationsinthequalityofSLRse.g. poorreportingquality [2],lackofanadequatesearchstrategy [3] and thelackofsynthesis [4]cannotbesuﬃcientlyrevealedwiththeCATs currentlyusedinSE.

Inthemeantime,realizingtheimportanceofCATstoassessthe qual-ity of completed systematicreviews, researchersin other disciplines havefurtherdevelopedthesetools.Areviewofevidence-basedmedicine literaturerevealsthatonetoolthatstandsoutforthedegreeof valida-tionandapplicationisAMSTAR(AMeaSurementTooltoAssess system-aticReviews) [7].AMSTARwasdevelopedbasedonascopingreview ofthethenavailableratinginstruments.Thereviewidentiﬁedseveral over-lappingappraisalitems,whichwerecombinedinto11 AMSTAR appraisalitemsusingfactoranalysis[7].Afterpilottesting,theoriginal AMSTARwasvalidatedexternallyaswell[8].AMSTAR3_has_since_then

beenusedandvalidatedextensively[8,9].

3. Candidate CAT for quality assessment of SLRs in SE

Recently,thedesignersofAMSTARhaveproposedarevisionofthe tool(AMSTAR-2[8]).Therevisionisbasedoncommunityfeedback col-lectedthroughdiﬀerentchannelssuchaspublishedreportsofits appli-cation,theAMSTARwebsite,4_surveys_of_AMSTAR_users,_and_the

expe-rienceofparticipantsinAMSTARworkshops.Theteamthathasrevised thetoolincludesdesignersoftheoriginalinstrumentandtwodesigners ofanotherinstrumentROBIS(RiskOfBiasInSystematicreviews). RO-BIS5_is_a_relatively_new_instrument_and_is_designed_to_support_reviewers

inassessingtheriskofbiasincompletedSLRs.

AMSTAR-2canbeusedtoappraiseSLRsthatmayincludeboth ran-domizedornon-randomizedstudies.AMSTAR-2hasamoredetailed as-sessmentoftheriskofbiasinSLRsduetotheprimarystudiesincluded, andhowthereviewauthorshavedealtwithsuchbiaswhen interpret-ingreviewresults.AMSTAR-2consistsof16items(see Table1),and eachitemhasdetailedresponseoptionstoguideuserstomakethe ap-propriatejudgement(seecompleteAMSTAR-24_for_details)._The_initial

3_The_AMSTAR_paper_[7]_had₂₉₅₈_citations_on_February_13,_2018. 4_AMSTAR_{https://amstar.ca/}_.

5_ROBIS _{https://www.bristol.ac.uk/population-health-sciences/projects/}

robis/robis-tool/.

(3)

N.b. Ali and M. Usman Information and Software Technology 112 (2019) 48–50

1. Adapt AMSTAR-2 for SE

2. Obtain community feedback on the first version of CATSER

3. Revise CATSER by synthesizing community

feedback

5. Prepare and distribute CATSER for validation

6. Revise (if required) CATSER

4. - Elicit more feedback Sufficient Consensus? No

Yes

Fig.3. ApproachforadaptingAMSTAR-2forSE.

evaluationofAMSTAR-2,byhavingmultipleratersusethetool,has shownmoderatetogoodagreementformostitemsinthetool[8].

AMSTAR-2hasseveraladvantagesovertheDAREcriteriacommonly usedinSE.DAREisnotaCATperse;itisintendedtoprovidethecriteria thatSLRsshouldmeettobeincludedintheCRD’sdatabaseofSLRs.In SE,onlyfouroftheﬁveitemsofDARE(itemsatodin Table1)have oftenbeenused [5].ApartfromitembtheformulationofDAREitems onlycaptures thereporting qualityinSLRs(cf. [5]),e.g., seeitema aboutreportingoftheselectioncriteria.Furthermore,manyoftheitems inAMSTAR-2(e.g.items1,5,6,7,10,14,15,and16)whichcapture thequalityofanSLRarenotcoveredbytheDAREcriteria.

Inthisstudy,wehaveidentiﬁedAMSTAR-2asacandidateCATthat can beadapted forSE.Theapproachwe willusein developingand validatingCATSER(aCriticalAppraisalToolforSEsystematicReviews basedonAMSTAR-2)isdescribedinthefollowingsectionsanddepicted in Fig.3.

4. A proposed approach for adapting AMSTAR-2 for SE

WeproposetoﬁrstadaptAMSTAR-2forSEbyreviewingitsitems andresponseoptionsfortheirrelevancetoSEusingthe recommenda-tionsintheEBSEliterature(e.g.,[1,10]).Inthenextphase,wewill in-volvetheSEresearchcommunityforthefurtherevolutionofCATSER. WewillorganizeworkshopsattheprominentSEvenues(e.g.,the in-ternationalsymposiumonempiricalsoftwareengineeringand measure-ment(ESEM)6_)._Furthermore,_a_web-based_forum_will_be_set_up_to_collect

feedbackfromthewidercommunity.

WehavereviewedtherelevanceofAMSTAR-2itemsforSE system-aticsecondarystudies(systematicmappingstudies(SMS)[1]andSLRs). Outofthe16itemsinAMSTAR-2,weconsider10items(see Table1) relevantforthecriticalappraisalofbothSLRsandSMSs.These10items coverthefundamentalaspects(e.g.,protocoldevelopment,systematic search,studyselection,anddataextractionprocesses)necessaryforthe reliabilityofbothSLRsandSMSs.

SMSsdonotincludeathoroughsynthesisanddetailedquality as-sessmentoftheincludedprimarystudies [1].Therefore,weconsider theremainingsixitemsregardingsynthesisandmeta-analysisasonly relevantforSLRs.

TheresponseoptionsinAMSTAR-2areformulatedforthemedical discipline,andthesewillrequireadaptationforSE.Forthispurpose, wewilluse thelatest guidelinesfordesigning,reporting,conducting andvalidatingsystematicsecondarystudiesinSE[1–3,10].

6 _{http://www.esem-conferences.org/}_.

5. A proposed approach for validating CATSER

WeplantovalidateCATSERbyusingittoappraiseasetofSLRs usingreviewersbeyondthosewhowillbeinvolvedintheadaptationof AMSTAR-2forSE.Wewillallocateasmallsampleofrandomlyselected SLRstothereviewers.UsingtheresultsofindividuallyappraisedSLRs withCATSER,weplantocomputetheinter-raterreliabilityofCATSER. AnotheraspectoftheevaluationofCATSERwillfocusonits use-fulness toidentifysigniﬁcantﬂawsinanSLR.Inthefuture,wewill comparetheassessmentofSLRsusingCATSERandthecommonly-used interpretationofDAREinSE.

Thelong-termvalidationofsuchinstrumentsdependsonhowwidely theyareacceptedandusedbythecommunity.Wehopetoinitiatea communityeﬀortinSEtoadapt,validateandmatureCATSER(which willleveragethestrengthsofAMSTAR-2).

6. Conclusion

By comparing thestate-of-the-art tools in medicine with the fre-quentlyusedCATs inSE,andbasedontherecentevaluationsof the qualityofSLRs,weidentiﬁedandemphasizedtheneedforfurther re-searchonCATsforSLRsinSE.WehavealsoidentiﬁedacandidateCAT andproposedanapproachtoadaptitfortheneedsofSEwiththe in-volvementoftheSEresearchcommunity.

Thisapproachwillnotonlyimprovethequalityofthetool,but en-surecommunitybuy-inandthusincreasethelikelihoodofadoptionof thetool.GiventhecontinuedinterestinSLRsinSE,wecontendthat thisworkhasapotentiallysigniﬁcantimpactonresearch.Itwillhelp toimproveandsustainthecredibilityofSLRsinSE.

Acknowledgment

Theauthorswouldlike tothankProf.ClaesWohlinforproviding feedbackonthepaper.Thisworkhasbeensupportedbyaresearchgrant fortheVITSproject(referencenumber20180127)bytheKnowledge Foundation inSwedenandbyELLIIT,aStrategicAreawithinITand MobileCommunications,fundedbytheSwedishGovernment. Conﬂict of Interest

Theauthorsdeclarenoconﬂictofinterest. References

[1] B.A. Kitchenham , D. Budgen , P. Brereton , Evidence-Based Software Engineering and Systematic Reviews, Chapman & Hall/CRC, 2015 .

[2] D. Budgen , P. Brereton , S. Drummond , N. Williams , Reporting systematic reviews: some lessons from a tertiary study, Inf. Softw. Technol. 95 (2018) 62–74 . [3] N.B. Ali , M. Usman , Reliability of search in systematic reviews: towards a quality

assessment framework for the automated-search strategy, Inf. Softw. Technol. 99 (2018) 133–147 .

[4] D.S. Cruzes , T. Dybå, Research synthesis in software engineering: a tertiary study, Inf. Softw. Technol. 53 (5) (2011) 440–455 .

[5] B. Kitchenham , R. Pretorius , D. Budgen , O. Pearl Brereton , M. Turner , M. Niazi , S. Linkman , Systematic literature reviews in software engineering - a tertiary study, Inf. Softw. Technol. 52 (8) (2010) 792–805 .

[6] B. Cartaxo , G. Pinto , S. Soares , Towards a model to transfer knowledge from software engineering research to practice, Inf. Softw. Technol. 97 (2018) 80–82 .

[7] B.J. Shea , J.M. Grimshaw , G.A. Wells , M. Boers , N. Andersson , C. Hamel , A.C. Porter , P. Tugwell , D. Moher , L.M. Bouter , Development of AMSTAR: a measurement tool to assess the methodological quality of systematic reviews, BMC Med. Res. Methodol. 7 (1) (2007) 10 .

[8] B.J. Shea , B.C. Reeves , G. Wells , M. Thuku , C. Hamel , J. Moran , D. Moher , P. Tug- well , V. Welch , E. Kristjansson , et al. , AMSTAR 2: a critical appraisal tool for systematic reviews that include randomised or non-randomised studies of healthcare interventions, or both, BMJ 358 (2017) j4008 .

[9] B.J. Shea , L.M. Bouter , J. Peterson , M. Boers , N. Andersson , Z. Ortiz , T. Ramsay , A. Bai , V.K. Shukla , J.M. Grimshaw , External validation of a measurement tool to assess systematic reviews (AMSTAR), PLoS One 2 (12) (2007) e1350 .

[10] A. Ampatzoglou , S. Bibi , P. Avgeriou , M. Verbeek , A. Chatzigeorgiou , Identifying, categorizing and mitigating threats to validity in software engineering secondary studies, Inf. Softw. Technol. 106 (2019) 201–230 .