InformationandSoftwareTechnology112(2019)48–50
ContentslistsavailableatScienceDirect
Information
and
Software
Technology
journalhomepage:www.elsevier.com/locate/infsof
A
critical
appraisal
tool
for
systematic
literature
reviews
in
software
engineering
☆
Nauman
bin
Ali
∗,
Muhammad
Usman
Blekinge Institute of Technology, Karlskrona, Sweden
a r t i c l e
i n f o
Keywords:
Systematic literature reviews Quality assessment Software engineering Critical appraisal tools AMSTAR
a b s t r a c t
Context:Methodologicalresearchonsystematicliteraturereviews(SLRs)inSoftwareEngineering(SE)hassofar focusedondevelopingandevaluatingguidelinesforconductingsystematicreviews.However,thesupportfor qualityassessmentofcompletedSLRshasnotreceivedthesamelevelofattention.
Objective:Toraiseawarenessoftheneedforacriticalappraisaltool(CAT)forassessingthequalityofSLRsin SE.Toinitiateacommunity-basedefforttowardsthedevelopmentofsuchatool.
Method:WereviewedtheliteratureonthequalityassessmentofSLRstoidentifythefrequentlyusedCATsinSE andotherfields.Results:WeidentifiedthattheCATscurrentlyusedisSEwereborrowedfrommedicine,buthave notkeptpacewithsubstantialadvancementsinthefieldofmedicine.
Conclusion:Inthispaper,wehavearguedtheneedforaCATforqualityappraisalofSLRsinSE.Wehavealso identifiedatoolthathasthepotentialforapplicationinSE.Furthermore,wehavepresentedourapproachfor adaptingthisstate-of-the-artCATforassessingSLRsinSE.
1. Introduction
Inspiredbymedicine,evidence-basedsoftwareengineering(EBSE) promotestheuseofsystematicliteraturereviews(SLRs)to systemati-callyidentify,evaluateandsynthesizeresearchonatopicofinterest[1]. SincetheintroductionofSLRsinSoftwareEngineering(SE),therateof papersreportingSLRsinSE1hasbeencontinuallyincreasing(seeFig.1).
However,severalrecentin-depthevaluationsofpublishedSLRshave identifiedseriousflawsregardingtheirquality.Forexample,issues re-latedto:(a)thereportingqualityofproceduresandoutcomes [2],(b) thereliabilityofsearch [3],and(c)lackofsynthesisortheuseof in-appropriatesynthesismethods [4] inSLRs.Suchissuesraisequestions aboutthecredibilityofSLRs.
Mostresearchers in SE haveused the fourquestions adoptedby Kitchenhametal.[5](itemsatodin Table1)forqualityassessmentof SLRs.Thesequestionsareinsufficienttorevealimportantlimitationsin anSLRasdemonstratedbytheabove-listedstudies [2–4].
Fig.2helpstounderstandtheroleofguidelinesanddistinguishthe purposeofcriticalappraisaltools(CAT).Theguidelinesforplanningand conductinganSLRenablearesearchteamtoplanandexecuteareview
☆ThisisanOpenAccessarticledistributedinaccordancewiththetermsoftheCreativeCommonsAttribution(CCBY4.0)license,whichpermitsotherstodistribute,
remix,adaptandbuilduponthiswork,includingforcommercialuse,providedtheoriginalworkisproperlycited.See:http://creativecommons.org/licenses/by/4.0/.
∗Correspondingauthor.
E-mailaddresses:nauman.ali@bth.se(N.b.Ali),muu@bth.se(M.Usman).
1 (SearchstringusedinScopustoidentifySLRspublishedincomputingTITLE-ABS-KEY(“systematicreview” OR“systematicliteraturereview”)ANDPUBYEAR<
2019AND(LIMIT-TO(SUBJAREA,“COMP”)).
thatfollowsarigorousprocess[1,5].Similarly,thereportingguidelines helptheresearcherstocommunicatethedesignandexecutionofanSLR tothereaders [1].Morerecently,therearenewreportingguidelines thatareintendedtoimprovetheusefulnessoftheresultsofanSLRfor educationandpractice [6].
Ontheotherhand,theroleofcriticalappraisaltoolsistofacilitatea readertoanalyticallyassessthecredibilityofacompletedSLR.Suchan assessmentconsidersboththereportingquality,e.g.,“Arethereview’s inclusionandexclusioncriteriadescribed?”,andtheriskofbiasassessment inthedesignandexecutionoftheSLR,e.g.,“Arethereview’sinclusion andexclusioncriteriaappropriate?”.
AsthenumberofSLRsisincreasing,theneedfortoolstoassessthe qualityofanSLRwithouthavingtoreplicatethestudyisbecomingmore evident.SuchaCATwillhelptosustainandimprovethecredibilityof SLRsasaneffectivemeansfordecision-supportinSE.Itwillenablethe readersofSLRstodifferentiatebetweengoodqualitySLRsfromtheones thatdidnotfollowarigorousandcomprehensiveapproach.
Inthis paper, weseektoraiseawareness of theneedfor a criti-calappraisalinstrumentandhaveintroducedacandidatesolutionfor thistask.Theworkpresentedinthispaperhasthepotentialtohavea
https://doi.org/10.1016/j.infsof.2019.04.006
Received5November2018;Receivedinrevisedform5April2019;Accepted12April2019 Availableonline15April2019
N.b. Ali and M. Usman Information and Software Technology 112 (2019) 48–50
Table1
AMSTAR-2,andDAREqualitycriteriausedtoappraiseSLRs.
DARE - Note: Fulfilling items a, b and e, and either c or d is mandatory for an SLR to be included in the DARE database of SLRs.
a. Were inclusion/exclusion criteria reported? d. Are sufficient details about the individual included studies presented? b. Was the search adequate? e. Were the included studies synthesised?
c. Was the quality of the included studies assessed?
AMSTAR -2 - Note: Items marked with an asterisk ( ∗ ) are not applicable for the appraisal of SMSs.
1. “Did the research questions and inclusion criteria for the review include the components of PICO? ”
2. “Did the report of the review contain an explicit statement that the review methods were established prior to the conduct of the review and did the report justify any significant deviations from the protocol? ”
3. “Did the review authors explain their selection of the study designs for inclusion in the review? ” 4. “Did the review authors use a comprehensive literature search strategy? ”
5. “Did the review authors perform study selection in duplicate? ” 6. “Did the review authors perform data extraction in duplicate? ”
7. “Did the review authors provide a list of excluded studies and justify the exclusions? ” 8. “Did the review authors describe the included studies in adequate detail? ”
9. ∗ “Did the review authors use a satisfactory technique for assessing the risk of bias (RoB) in individual studies that were included in the review? ”
10. “Did the review authors report on the sources of funding for the studies included in the review? ”
11. ∗ “If meta-analysis was performed did the review authors use appropriate methods for statistical combination of results? ”
12. ∗ “If meta-analysis was performed, did the review authors assess the potential impact of RoB in individual studies on the results of the meta-analysis or other evidence
synthesis? ”
13. ∗ “Did the review authors account for RoB in individual studies when interpreting/ discussing the results of the review? ”
14. ∗ “Did the review authors provide a satisfactory explanation for, and discussion of, any heterogeneity observed in the results of the review? ”
15. ∗ “If they performed quantitative synthesis did the review authors carry out an adequate investigation of publication bias (small study bias) and discuss its likely impact on the
results of the review? ”
16. “Did the review authors report any potential sources of conflict of interest, including any funding they received for conducting the review? ”
Fig.1. TheincreasingnumberofSLRsincomputingsince2004.
Fig.2. AviewofqualityonthevariousstagesofanSLR.
profoundimpactonSEresearch,sinceitisusefulfortwoverycommon scenarios:(1)toassessthequalityofanSLRasareferee/reader,and (2)tosynthesizetheresultsofseveralSLRsonthesametopicandto understandthereasonsforanydifferencesbetweentheirresults.
Theremainder ofthepaperisstructuredasfollows: Section2 ex-plainstheneedforaCATforSLRsinSE. Section3 presentsa state-of-the-artCAT.In Sections4 and 5,webrieflyproposeanapproachto customizeandvalidatethetoolforSE.Section6 concludesthepaper. 2. Need for a CAT for the quality assessment of SLRs in SE
Since2004,whenthefirstguidelinesforSLRsinSEwereintroduced, severalimprovementshavebeenmadetotheguidelinesforconducting andreportingSLRsinSE [1].However,theappraisaltoolsforSLRsin SEhavenotreceivedmuchattention.ResearchersintheSEfieldhave continuedtorely ona subsetof questionsidentifiedbyKitchenham etal. [5] fromthefieldofevidence-basedmedicineintheyear2004. Thecommonly-usedinterpretationoftheDARE2criteriainSEdoesnot
2 The CRD’s Database of Abstracts of Reviews of Effects (DARE)
https://www.crd.york.ac.uk/CRDWeb/AboutPage.asp.
evenconsiderifthereisasynthesisperformedinareview.Thisexplains tosomeextentwhysomeofthelimitationsinthequalityofSLRse.g. poorreportingquality [2],lackofanadequatesearchstrategy [3] and thelackofsynthesis [4]cannotbesufficientlyrevealedwiththeCATs currentlyusedinSE.
Inthemeantime,realizingtheimportanceofCATstoassessthe qual-ity of completed systematicreviews, researchersin other disciplines havefurtherdevelopedthesetools.Areviewofevidence-basedmedicine literaturerevealsthatonetoolthatstandsoutforthedegreeof valida-tionandapplicationisAMSTAR(AMeaSurementTooltoAssess system-aticReviews) [7].AMSTARwasdevelopedbasedonascopingreview ofthethenavailableratinginstruments.Thereviewidentifiedseveral over-lappingappraisalitems,whichwerecombinedinto11 AMSTAR appraisalitemsusingfactoranalysis[7].Afterpilottesting,theoriginal AMSTARwasvalidatedexternallyaswell[8].AMSTAR3hassincethen
beenusedandvalidatedextensively[8,9].
3. Candidate CAT for quality assessment of SLRs in SE
Recently,thedesignersofAMSTARhaveproposedarevisionofthe tool(AMSTAR-2[8]).Therevisionisbasedoncommunityfeedback col-lectedthroughdifferentchannelssuchaspublishedreportsofits appli-cation,theAMSTARwebsite,4surveysofAMSTARusers,andthe
expe-rienceofparticipantsinAMSTARworkshops.Theteamthathasrevised thetoolincludesdesignersoftheoriginalinstrumentandtwodesigners ofanotherinstrumentROBIS(RiskOfBiasInSystematicreviews). RO-BIS5isarelativelynewinstrumentandisdesignedtosupportreviewers
inassessingtheriskofbiasincompletedSLRs.
AMSTAR-2canbeusedtoappraiseSLRsthatmayincludeboth ran-domizedornon-randomizedstudies.AMSTAR-2hasamoredetailed as-sessmentoftheriskofbiasinSLRsduetotheprimarystudiesincluded, andhowthereviewauthorshavedealtwithsuchbiaswhen interpret-ingreviewresults.AMSTAR-2consistsof16items(see Table1),and eachitemhasdetailedresponseoptionstoguideuserstomakethe ap-propriatejudgement(seecompleteAMSTAR-24fordetails).Theinitial
3TheAMSTARpaper[7]had2958citationsonFebruary13,2018. 4AMSTARhttps://amstar.ca/.
5ROBIS https://www.bristol.ac.uk/population-health-sciences/projects/
robis/robis-tool/.
N.b. Ali and M. Usman Information and Software Technology 112 (2019) 48–50
1. Adapt AMSTAR-2 for SE
2. Obtain community feedback on the first version of CATSER
3. Revise CATSER by synthesizing community
feedback
5. Prepare and distribute CATSER for validation
6. Revise (if required) CATSER
4. - Elicit more feedback Sufficient Consensus? No
Yes
Fig.3. ApproachforadaptingAMSTAR-2forSE.
evaluationofAMSTAR-2,byhavingmultipleratersusethetool,has shownmoderatetogoodagreementformostitemsinthetool[8].
AMSTAR-2hasseveraladvantagesovertheDAREcriteriacommonly usedinSE.DAREisnotaCATperse;itisintendedtoprovidethecriteria thatSLRsshouldmeettobeincludedintheCRD’sdatabaseofSLRs.In SE,onlyfourofthefiveitemsofDARE(itemsatodin Table1)have oftenbeenused [5].ApartfromitembtheformulationofDAREitems onlycaptures thereporting qualityinSLRs(cf. [5]),e.g., seeitema aboutreportingoftheselectioncriteria.Furthermore,manyoftheitems inAMSTAR-2(e.g.items1,5,6,7,10,14,15,and16)whichcapture thequalityofanSLRarenotcoveredbytheDAREcriteria.
Inthisstudy,wehaveidentifiedAMSTAR-2asacandidateCATthat can beadapted forSE.Theapproachwe willusein developingand validatingCATSER(aCriticalAppraisalToolforSEsystematicReviews basedonAMSTAR-2)isdescribedinthefollowingsectionsanddepicted in Fig.3.
4. A proposed approach for adapting AMSTAR-2 for SE
WeproposetofirstadaptAMSTAR-2forSEbyreviewingitsitems andresponseoptionsfortheirrelevancetoSEusingthe recommenda-tionsintheEBSEliterature(e.g.,[1,10]).Inthenextphase,wewill in-volvetheSEresearchcommunityforthefurtherevolutionofCATSER. WewillorganizeworkshopsattheprominentSEvenues(e.g.,the in-ternationalsymposiumonempiricalsoftwareengineeringand measure-ment(ESEM)6).Furthermore,aweb-basedforumwillbesetuptocollect
feedbackfromthewidercommunity.
WehavereviewedtherelevanceofAMSTAR-2itemsforSE system-aticsecondarystudies(systematicmappingstudies(SMS)[1]andSLRs). Outofthe16itemsinAMSTAR-2,weconsider10items(see Table1) relevantforthecriticalappraisalofbothSLRsandSMSs.These10items coverthefundamentalaspects(e.g.,protocoldevelopment,systematic search,studyselection,anddataextractionprocesses)necessaryforthe reliabilityofbothSLRsandSMSs.
SMSsdonotincludeathoroughsynthesisanddetailedquality as-sessmentoftheincludedprimarystudies [1].Therefore,weconsider theremainingsixitemsregardingsynthesisandmeta-analysisasonly relevantforSLRs.
TheresponseoptionsinAMSTAR-2areformulatedforthemedical discipline,andthesewillrequireadaptationforSE.Forthispurpose, wewilluse thelatest guidelinesfordesigning,reporting,conducting andvalidatingsystematicsecondarystudiesinSE[1–3,10].
6 http://www.esem-conferences.org/.
5. A proposed approach for validating CATSER
WeplantovalidateCATSERbyusingittoappraiseasetofSLRs usingreviewersbeyondthosewhowillbeinvolvedintheadaptationof AMSTAR-2forSE.Wewillallocateasmallsampleofrandomlyselected SLRstothereviewers.UsingtheresultsofindividuallyappraisedSLRs withCATSER,weplantocomputetheinter-raterreliabilityofCATSER. AnotheraspectoftheevaluationofCATSERwillfocusonits use-fulness toidentifysignificantflawsinanSLR.Inthefuture,wewill comparetheassessmentofSLRsusingCATSERandthecommonly-used interpretationofDAREinSE.
Thelong-termvalidationofsuchinstrumentsdependsonhowwidely theyareacceptedandusedbythecommunity.Wehopetoinitiatea communityeffortinSEtoadapt,validateandmatureCATSER(which willleveragethestrengthsofAMSTAR-2).
6. Conclusion
By comparing thestate-of-the-art tools in medicine with the fre-quentlyusedCATs inSE,andbasedontherecentevaluationsof the qualityofSLRs,weidentifiedandemphasizedtheneedforfurther re-searchonCATsforSLRsinSE.WehavealsoidentifiedacandidateCAT andproposedanapproachtoadaptitfortheneedsofSEwiththe in-volvementoftheSEresearchcommunity.
Thisapproachwillnotonlyimprovethequalityofthetool,but en-surecommunitybuy-inandthusincreasethelikelihoodofadoptionof thetool.GiventhecontinuedinterestinSLRsinSE,wecontendthat thisworkhasapotentiallysignificantimpactonresearch.Itwillhelp toimproveandsustainthecredibilityofSLRsinSE.
Acknowledgment
Theauthorswouldlike tothankProf.ClaesWohlinforproviding feedbackonthepaper.Thisworkhasbeensupportedbyaresearchgrant fortheVITSproject(referencenumber20180127)bytheKnowledge Foundation inSwedenandbyELLIIT,aStrategicAreawithinITand MobileCommunications,fundedbytheSwedishGovernment. Conflict of Interest
Theauthorsdeclarenoconflictofinterest. References
[1] B.A. Kitchenham , D. Budgen , P. Brereton , Evidence-Based Software Engineering and Systematic Reviews, Chapman & Hall/CRC, 2015 .
[2] D. Budgen , P. Brereton , S. Drummond , N. Williams , Reporting systematic reviews: some lessons from a tertiary study, Inf. Softw. Technol. 95 (2018) 62–74 . [3] N.B. Ali , M. Usman , Reliability of search in systematic reviews: towards a quality
assessment framework for the automated-search strategy, Inf. Softw. Technol. 99 (2018) 133–147 .
[4] D.S. Cruzes , T. Dybå, Research synthesis in software engineering: a tertiary study, Inf. Softw. Technol. 53 (5) (2011) 440–455 .
[5] B. Kitchenham , R. Pretorius , D. Budgen , O. Pearl Brereton , M. Turner , M. Niazi , S. Linkman , Systematic literature reviews in software engineering - a tertiary study, Inf. Softw. Technol. 52 (8) (2010) 792–805 .
[6] B. Cartaxo , G. Pinto , S. Soares , Towards a model to transfer knowledge from software engineering research to practice, Inf. Softw. Technol. 97 (2018) 80–82 .
[7] B.J. Shea , J.M. Grimshaw , G.A. Wells , M. Boers , N. Andersson , C. Hamel , A.C. Porter , P. Tugwell , D. Moher , L.M. Bouter , Development of AMSTAR: a measurement tool to assess the methodological quality of systematic reviews, BMC Med. Res. Methodol. 7 (1) (2007) 10 .
[8] B.J. Shea , B.C. Reeves , G. Wells , M. Thuku , C. Hamel , J. Moran , D. Moher , P. Tug- well , V. Welch , E. Kristjansson , et al. , AMSTAR 2: a critical appraisal tool for sys- tematic reviews that include randomised or non-randomised studies of healthcare interventions, or both, BMJ 358 (2017) j4008 .
[9] B.J. Shea , L.M. Bouter , J. Peterson , M. Boers , N. Andersson , Z. Ortiz , T. Ramsay , A. Bai , V.K. Shukla , J.M. Grimshaw , External validation of a measurement tool to assess systematic reviews (AMSTAR), PLoS One 2 (12) (2007) e1350 .
[10] A. Ampatzoglou , S. Bibi , P. Avgeriou , M. Verbeek , A. Chatzigeorgiou , Identifying, categorizing and mitigating threats to validity in software engineering secondary studies, Inf. Softw. Technol. 106 (2019) 201–230 .