Combatting the data volume issue in digital forensics: A structured literature review

(1)

Combatting the data volume issue in digital forensics: A structured literature review

Bachelor Degree Project in Information Technology IT610G, G2E 22.5HP

Spring term 2020

Date of examination: 2020-06-29 Mattias Sjöstrand

a17matsj@student.his.se

(2)

Abstract

The increase in data volume and amount of data sources submitted as evidence such as from Internet of Things (IoT) devices or cloud computing systems has caused the digital forensics process to take longer than before. The increase in time consumption applies to all stages of the digital forensics process which includes collection, processing and analysing material. Researchers have proposed many different solutions to this problem and the aim of this study is to summarize these solutions by conducting a systematic literature review. The literature review uses a handful of search terms applied to three different databases to gather the material needed for the study which are then filtered by selection criteria to guarantee the quality and

relevance to the topic. 29 articles were accepted for the analysis process which categorized the results by using thematic coding. The analysis showed that there were several ways to deal with data growth and different methods can be applied to different areas of digital forensics such as network forensics or social network forensics. Artificial Intelligence (AI) solutions in particular show a lot of potential for responding to the current and future challenges in the field by reducing manual effort and greatly increasing the speed of the processes.

(3)

1. Introduction

Back when digital forensics was relatively new, McKemmish (1999) stated that one of the biggest challenges in digital forensics is the rapid increase in the size of storage media. Garfinkel (2010) says that “The growing size of storage devices means that there is frequently insufficient time to create a forensic image of a subject device, or to process all of the data once it is found.” when discussing the future of digital forensics. The capacity to process data is not increasing at the same rate as the storage density of hard disks which leads to a growing gap in the means to process the data volumes using only processing power and therefore intelligent ways to process the data is needed to catch up.

Another considerable problem is that it is not always easy to know which source of evidence will be of use in a forensic investigation and consequently many different images might need

analysing to determine their value. Secure technologies such as encryption can also hinder the investigation process making it more and more time consuming. There is also a rise in anti- forensics techniques which further complicates the process.

While the field of digital forensics is evolving rapidly, its continuous evolution is heavily challenged by the increasing popularity of digital devices and the heterogeneity of the hardware and software platforms being used (Caviglione et al., 2017). The popularity of things such as IoT devices and cloud computing makes it difficult for the technologies in digital forensics to keep up.

This study aims to analyse different methods that deal with big data with emphasis on volume in digital forensics to provide insight on how this problem can be dealt with from various angles.

(5)

2. Background

The purpose of this chapter is to describe the background for the problem and further describe why this study is going to focus on this particular issue. It aims to give a context to the problem and explain why it is an important issue both currently and in the future.

The problem of the ever growing volume of data is something that has been addressed many times before. It is however an ongoing problem since the exponential growth of technology brings some serious challenges within the digital forensics field (Raghavan, 2013). Not only is the capacity steadily increasing but the price is simultaneously going down making it more affordable to store larger amounts of data.

According to Moore's Law, the number of transistors on an integrated circuit doubles every 18- 24 months, predicting the development of technology. Storage density however is not increasing at the same rate. In (Walter, 2005), Kryder recognized that between the year 1990 and 2005 the storage density of hard-drives had increased from 100 million bits to 100 billion bits which is a 1000-fold increase and correspond to the capacity doubling every 12 months compared to the 18-24 months of Moore's law. Kryder went on to project in 2009 that a standard 2.5” hard-drive would be able to store around 40 terabytes of data and cost around 40USD or 400 SEK (Kryder

& Chang Soo Kim, 2009). Kryder's law has since been proven to be outdated. The cost of media storage is decreasing at a slower pace than before and has been stabilizing since 2010 (Rosenthal et al., 2012). Disk storage costs in 2014 were more than 7 times higher than they would have been if Kryder's law continued at its usual pace (Rosenthal, 2016). Nonetheless the problems related to data growth still remain and additional problems associated with technologies have emerged.

The term digital forensics has been mentioned a few times so far but has never actually been explained. Digital forensics is defined by Palmer (2001) as:

The use of scientifically derived and proven methods toward the preservation, collection, validation, identification, analysis, interpretation, documentation and presentation of digital evidence derived from digital sources for the purpose of facilitating or furthering the reconstruction of events found to be criminal, or helping to anticipate unauthorized actions shown to be disruptive to planned operations. (p. 16)

Digital forensics, which is a branch of forensic science, contains many branches of its own that are more narrow in range. These branches include social media forensics, network forensics and mobile device forensics to name a few. They all have different types of data to deal with and require different solutions to handle them efficiently. Digital forensics experts have a lot of tools at their disposal today such as Forensic Toolkit (FTK) and Encase which aid them greatly in their work. When dealing with big data, combinations of methods and tools need to be used such as for example reducing the data down to a manageable size before analysing with the appropriate software. Many digital forensic tools are relatively new and under continuous development and knowing which areas to focus development on can be very beneficial.

(6)

When discussing big data volumes it is difficult not to mention big data. The term big data is commonly defined as data that is too large to handle with traditional means. According to De Mauro et al. (2015) big data can be expressed by:

• ‘Volume’, ‘Velocity’ and ‘Variety’, to describe the characteristics of Information involved;

• Specific ‘Technology’ and ‘Analytical Methods’, to clarify the unique requirements strictly needed to make use of such Information

• Transformation into insights and consequent creation of economic ‘Value’, as the principal way Big Data is impacting companies and society

While this describes big data in general, big data is very much a digital forensics problem due to the low cost of digital storage, the increasing ubiquity of computing, and the growth of the type and number of the IoT (Zawoad & Hasan, 2015).

2.1 Previous research

Many ideas to combat the data volume issue have been suggested by various researchers in the past and some of these ideas will be mentioned in this chapter. This will serve as a quick overview of past methods and not go into greater detail on how they were implemented.

One of the ideas is utilizing data mining techniques Beebe & Clark (2005), Palmer (2001) and Shannon (2004) to retrieve information such as clustering techniques and entity extraction.

Another approach is data reduction as discussed by Beebe (2009), Garfinkel (2006) and Keneally

& Brown (2005). By reducing either the amount of data that is being collected or the amount of data that is being processed it can become much more manageable. This can for example be achieved by focusing on key areas such as email, registry files, pictures or whatever may be suitable for the given case. Garfinkel (2006) also suggests Cross-Drive Analysis as a technique to deal with the problem which is a technique designed to “allow an investigator to simultaneously consider information from across a corpus of many data sources, such as disk drives or solid- state storage devices.”. This approach has several advantages over single-disk analysis and is further discussed by Patterson & Hargreaves (2012) as well as Raghavan et al. (2009).

Triage is another concept that is commonly used in a medical environment and is a process that prioritizes the treatment of patients based on their likelihood for recovery with and without treatment. It is relevant when the time and resources are insufficient to treat every individual such as during a pandemic. In the context of digital forensics, Parsonage (2009) defines triage as ”A process for sorting enquiries into groups based on the need for or likely benefit from

examination.” The use of triage in a digital forensics environment is also mentioned by Garfinkel (2010) and Reyes et al. (2007).

Modern GPUs have very many processors available that are normally used for graphical

calculations and they can also be used to perform general purpose calculations in a process called General-Purpose Computing on Graphics Processing Units (GPGPU). The usage of GPGPU and many threads to increase processing speeds has been discussed by Marziale et al. (2007) and Lee et al. (2008).

User-profiling is used to look for systematic usage patterns and can considerably narrow the search for a perpetrator and reason about the perpetrator's behaviour Abraham (2006). The

(7)

profiles can be built based on what tasks a user engages in, which applications they run, what time of day they do it and for what purpose they run them Garfinkel (2010). By finding these patterns, the time and processing power required to complete the investigation can be reduced.

An efficient solution is the appliance of various intelligent analysis methods such as hash analyses and pattern matching which includes complex algorithms to retrieve information in a more effective way as examined by Bebee (2009) as well as Artificial Intelligence which is explained further by Hoelz et al. (2009) and Sheldon (2005).

As evident by these sources this is not the first time this issue has been acknowledged. These sources are however quite outdated by Information Technology (IT) standards which is one of the reasons this study is conducted. The amount of data has increased along with the knowledge of how to handle it. This study is carried out to get a more recent overview of how the methods have evolved.

(8)

3. Problem definition

This chapter will describe the problem the study is trying to solve which includes details on the aim of the study and what the limitations are.

Although the term big data only has been around for 15 years, being able to interpret all available data is something that there has always been a need for. While this study is focused mainly on the volume aspect of big data, the other elements are also of high relevance to digital forensics.

Velocity is a big concern in IoT forensics and variety is a problem since the data comes in many different formats, both unstructured and structured. Structured data is quantitative data that has a defined known structure which makes it simple to search through. Unstructured data has no known structure making it much more complex. Unstructured data can for example be media files, emails or webpages, all of which are often encountered in digital forensics.

The existing tools and infrastructures in digital forensics cannot meet the expected response time when investigating on a big dataset. The amount of digital information is growing beyond the capacity of current techniques and procedures (Zawoad & Hasan, 2015).

The expected outcome for this study is that many of the methods brought up by previous researches will still be around today but adapted to today's environment. Considering that the concept of AI has been around since the 1950s (Buchanan, 2005) and has been the subject of a lot of research in modern times, it should have its uses in the field of digital forensics as well.

3.1 Aim

As shown in chapter 2, dealing with big data is a big issue that has had many proposed solutions in the past. These solutions might however not be as applicable to the datasets of today because of the increase in volume. What was considered big data 10 years ago is no longer considered big data today. The foundation of these solutions can however be built upon and improved. The goal of this study is to outline current methods for dealing with big data volumes within a digital forensics contexts. The following research question has been selected based on what the study is trying to accomplish:

What current methods exist that can combat big volumes of data in digital forensics?

It is important that it is clear what every aspect of the research question means to leave nothing to chance. Current methods refers to recent methods as opposed to methods from the past which in this case translates to the past 3 years at the time of writing this report. The reason for this demarcation is further explained in chapter 4. Big volumes of data means data of sizes too big to handle with conventional means such as a simple data search.

The end goal is to end up with an arrangement of methods that have proven that they can deal with big volumes of data and provide areas for further improvement to keep up with evolving problems.

3.2 Limitations

One aspect that will not be covered in this review is the technical factors behind the methods discovered such as the algorithms used. There will be descriptions on the surface level of their usage but due to the high level of knowledge in the field required to go in-depth more thorough

(9)

explanations of the technical details will be omitted to avoid misinformation. This study focuses more on what methods are more effective based on the sources and not so much the technical perspective.

As mentioned earlier the field of IT evolves constantly and rapidly. By conducting a systematic literature review this study is based on already existing research which can have the effect that some information is left out due to its recentness. Any research published after the collection of material is lost which could potentially have an impact on the results.

The time and resources available has been used to try and give as fair of an analysis as possible.

There is however always the argument that given more time and resources a better result could be achieved. It is the author's opinion that the effort has been sufficient to produce a compelling result but it is inarguable that more comprehensive research could be made given more resources.

Digital forensic experts face a lot of challenges today other than data volume such as law issues, scattered data and anti-forensics. Since this study focuses solely on one issue, methods that may be efficient in regards to other challenges but not relevant to this particular problem will be excluded.

(10)

4. Methodology

This chapter describes the method used to achieve the goal of the review and why this method was chosen. Details will be provided on how a systematic literature review is conducted and how the choices of databases and search terms were decided.

The previous chapter explained why this review is being conducted and to achieve the goal, past research in the field is going to be inspected to extract the information needed to come to a conclusion. According to Kitchenham (2004), a common reason for performing a systematic review is ”To identify any gaps in current research in order to suggest areas for further

investigation.”. A systematic review gives a fair and objective research and because they follow predetermined search strategies the results can easily be reproduced by other researchers. The downside of systematic reviews however are that they require a lot of effort compared to a regular review according to Kitchenham (2004).

4.1 Conducting a systematic review

Kuziemsky & Lau (2017) describes systematic reviews in the following way:

Systematic reviews attempt to aggregate, appraise, and synthesize in a single source all empirical evidence that meet a set of previously specified eligibility criteria in order to answer a clearly formulated and often narrow research question on a particular topic of interest to support evidence-based practice (Liberati et al., 2009). They adhere closely to explicit scientific principles (Liberati et al., 2009) and rigorous methodological guidelines (Higgins & Green, 2008) aimed at reducing random and systematic errors that can lead to deviations from truth in results or inferences (p.164)

According to Kuziemsky & Lau (2017), a systematic review involve the following steps:

1. Formulating a review question and developing a search strategy based on explicit

inclusion criteria for the identification of eligible studies (usually described in the context of a detailed review protocol).

2. Searching for eligible studies using multiple databases and information sources, including grey literature sources, without any language restrictions. (grey literature being material produced outside of the traditional distribution channels)

3. Selecting studies, extracting data, and assessing risk of bias in a duplicate manner using two independent reviewers to avoid random or systematic errors in the process.

4. Analysing data using quantitative or qualitative methods.

5. Presenting results in summary of findings tables.

6. Interpreting results and drawing conclusions.

(11)

Perhaps a more easily approachable process is proposed by Jesson et al. (2011) with the following steps:

1. Definition of a proper research question 2. Designing of an execution plan

3. Searching for literature

4. Application of inclusion and exclusion criteria 5. Assessing the quality of chosen articles 6. Synthesizing of results

It is very important to have a clear research question to define what the study is trying to accomplish. As stated by Liberati et al. (2009) the research question should focus on an

important topic that is relevant and is often narrow as questions that are too can be very difficult to answer.

There has to be a need for the research to be done so it is important that the question has not already been answered by another source since it would otherwise become a pointless study and the study may need to be altered with a different goal in mind. As stated earlier, the study needs to fill a gap in the research that already exists and spawn new ideas for future studies.

It is important that the method used in the study aligns with the final goal. Certain methods are better suited for certain research questions than others and it is dependant on what answer is expected. Some questions may have a simple yes or no as the response such as “Are there fewer women studying IT-related programs at University of Skövde than men?” while other questions will have a more elaborate answer and needs a different method to properly answer the question in a meaningful way.

Brereton et al. (2007) summarizes the review process with the three main phases: Planning the Review, Conducting the Review and Reporting the Review. It has been further visualized in a flowchart diagram (Figure 1.) which shows how the entire process will transpire in this study.

This diagram shows the sub-steps taken within each phase of the systematic literature review process. Step 1 which includes the research question, has already been completed up to this point and the remaining steps of the planning phase will be explained in the upcoming chapters. It is important to plan the review carefully since the early steps will impact the quality of the final product.

(12)

4.1.1 Define resources

The second step when doing a systematic literature review is finding the resources that is of interest to the review. There are a lot of scientific databases to choose from and some are more appropriate in certain fields than others and several sources need to be searched as only a single source would not find all of the material needed for a proper study. The Google scholar database can be useful when searching for articles of interest and contains material from many different sources but does not allow users to limit results to peer-reviewed or full text material and has not been considered for this reason.

Something to consider is that the search engines behaves in different ways. The ACM Portal and IEEExplore for example differ in the way that IEEExplore supports complex logical

combinations while ACM does not. That being said, Brereton et al. (2007) recommends the databases IEEExplore, ACM Digital Library, ScienceDirect and CiteSeerX for topics within software engineering. Citeseer appears to have big inconsistencies with how it handles searches compared to the other databases and has been excluded for this reason. Based on this suggestion the databases that will be used in this review are:

• ACM Digital Library

• IEEExplore

• ScienceDirect

After the resources to be used in the study have been defined, the next step in accordance with the flow diagram in figure 2 is to define the search terms.

Figure 1: Flow diagram of the systematic literature review process (author's own)

(13)

4.1.2 Define search terms

When doing a systematic literature review it is crucial to define the correct search terms since a too narrow search might leave out a lot of relevant material and a too wide search will give an abundance of results which could be too time-consuming to process. The search terms that are chosen will determine all of the material that will be analysed at a later stage so it is important that they are chosen carefully.

As mentioned in the previous chapter the search engines behave in different ways and it is therefore necessary to use search terms that behave the same within all the databases that are going to be used for the review.

Kitchenham (2004) suggests trying different combinations of search terms derived from the research question and draw up a list of synonyms, abbreviations, and alternate spellings.

Kitchenham (2004) also recommends using sophisticated search strings can be constructed using Boolean AND's and OR's for better results.

With these suggestions in mind, the following search terms were decided upon:

• ”digital forensics” AND ”big data”

• ”computer forensics” AND ”big data”

• “digital forensics” AND intelligence

• “computer forensics” AND intelligence

• “digital forensics” AND value

• “computer forensics” AND value

• “digital forensics” AND “data mining”

• “computer forensics” AND “data mining”

Since the terms digital forensics and computer forensics are sometimes used interchangeably when discussing an investigation of any computer-related or digital device both of the terms have been included to avoid missing out on any relevant material. The term big data refers to large volumes of data and is often defined by volume, velocity and variety and is very relevant to the study since it is trying to solve big data issues. Intelligence is used to include all articles that mention the usage of intelligence in a digital forensics context. Value is included for material that discuss the value in some form and finally, the concept of data mining contains an assortment of different methods for extracting data and is needed as a search term to find texts that have experimented with different types of data mining techniques to efficiently extract relevant information from digital forensic images.

When matching the keywords to the entire texts, a lot of irrelevant articles will be returned as a result. Because of this they have been applied to only the title, abstract, index terms and keywords where possible to make sure that as many results as possible are relevant to the study.

(14)

Database Search target

IEEExplore Publication title, Abstract & Index Terms

ACM Digital Library Title & Abstract

Sciencedirect Title, Abstract & Keywords Table 1: Databases and search term targets (author's own)

4.1.3 Define selection criteria

The fourth step of planning the review is to define the selection criteria. The selection criteria is what will decide which material will be considered for analysis after the search terms have been used in the databases. It is only after all the full texts have been retrieved that the

inclusion/exclusion criteria are applied to filter out all the irrelevant texts.

4.1.3.1 Inclusion criteria

For this study the following inclusion criteria have been selected:

• Has to be peer-reviewed

• Publication in journals or conferences

• Earliest publication year 2017

• Written in English

• Relevant to the goal of the study

When an article is peer-reviewed it ensures that it has a certain level of quality to it, as they otherwise would not have passed the peer-reviewing process. A peer-reviewed journal will not typically publish articles that fail to meet certain standards within the field.

The year requirement is included because of the rapid advancement of the technology and knowledge within the IT field so articles that are too dated will not be as relevant today as they were when they were first published. This is important to make sure that the final analysis reflects how things are in the world as of the time of writing this report. Although this inevitably causes some material to be lost, this limit also avoids the chance of getting an abundance of material while simultaneously keeping the result of the study more current.

When searching the scientific databases there are going to be some articles that are not relevant to the study but still fall within the scope of the search because they happen to contain the keywords used in the search. Because of this, articles that are deemed not relevant to the goal of the study will not continue to the analysis process.

(15)

4.1.3.2 Exclusion criteria

The following exclusion criteria have been selected:

• Fails to meet inclusion criteria

• Duplicate article found within another search

• Requires payment

In addition to not meeting the inclusion criteria any duplicate articles are to be removed to prevent analysis of the same material more than once. Any material behind a paywall will also not be used in the study.

4.1.4 Article evaluation

The process of reading through each article is mostly relevant to determine if said article is relevant to the goal of the study which is one of the inclusion criteria. The flowchart on figure 3 below shows the intended steps taken for each article and will be followed for as long as it takes to appropriately establish whether or not the article meets the inclusion criteria and will be included in the review analysis. This means that the process can be aborted at any time as long as a definitive decision can be made.

Note that the process described here is not used when doing the actual analysis on the material that made it through the inclusion/exclusion criteria. A thematic analysis will be used for that process instead which require the articles to be read in their entirety more than once.

Figure 2: Flow diagram of the article evaluation process (author's own)

(16)

4.1.5 Analysis method

For analysing the material at step 8 of the systematic review process (Figure 2) there are several different methods that can be used. For this study a thematic analysis is used which is a method for identifying, analysing, and reporting patterns also called themes within data (Braun & Clarke, 2006). The reason for thematic coding over other methods for analysis is that it is, as Thomas &

Harden (2008) describes, a tried and tested method that preserves an explicit and transparent link between conclusions and the text of primary studies; as such it preserves principles that have traditionally been important to systematic reviewing.

The themes that are found need to capture something important to the research question. These themes do not need to be predetermined and can be specified during the analysis. Forcing a theme on the text could result in a lowered quality of the research. In table 1 below are the steps taken during the thematic analysis.

Phase Process description

1. Familiarising with the data Reading through the data to get an idea of the content.

2. Searching for themes Finding themes and categorizing data relevant to each potential theme.

3. Reviewing themes Checking if the themes work in relation to the aim of the study. If not, examine them again.

4. Defining and naming themes Creating clear definitions for each theme and naming them appropriately.

5. Reporting the results Reporting the results by analysing the extracts and relating back to the research question.

Table 2: Thematic Analysis phases (author's own)

No external software was used in the identification and colour coding of the themes, instead the built-in feature of Adobe Acrobat Reader was used during the analysis.

4.1.6 Threats to validity

To make sure that the results are legitimate any threats to validity have to be addressed and counteracted. Zhou et al. (2016) lists common threats to validity with their definition:

• Construct Validity. Identify correct operational measures for the concepts being studied.

• Internal Validity. Seek to establish a causal relationship, whereby certain conditions are believed to lead to other conditions, as distinguished from spurious relationships.

• External Validity. Define the domain to which a study's findings can be generalized.

(17)

• Conclusion Validity. Demonstrate that the operations of a study such as the data collection procedure can be repeated with the same results.

There is some overlap between the categories meaning that a threat can belong to more than one category at a time. Zhou et al. (2016) also presents a list of many possible threats in the three different stages of the process from Figure 2. Threats that were deemed to be the most critical for this study have been addressed in the table below:

Process Stage Validity Threat Solution

Planning Construct, Internal

Inappropriate Research question, search

method,

inclusion/exclusion criteria.

Research question is carefully planned based on the aspects outlined in chapter 3. Search method and criteria are described in detail in chapter 4.

Planning Construct Incomprehensive

databases

Multiple databases are used based on credible sources to be appropriate for the study.

Conducting Internal, Conclusion Study selection bias

Objective selection of resources based on inclusion and exclusion criteria.

Conducting Internal, Conclusion Primary study duplication

All duplicates from different

searches/databases are identified and removed from the pool.

Reporting Internal

Subjective interpretation of material

Single reviewer

examining the data with clearly described

selection criteria.

Reporting External Low study

generalizability

Study details are selected in a way that provides satisfactory

generalization for the

(18)

4.1.7 Article selection

This section describes how the article selection process was carried out to allow for

reproducibility and prevent threats to external validity. The table below displays how many articles that were fetched from each database selected from chapter 4.1.1 with the selected search terms.

Database Articles

ACM Digital Library 116

ScienceDirect 39

IEEExplore 341

Total 496

Table 4: Article distribution (author's own)

A total of 496 articles were fetched from the databases after the following selection criteria were applied during the retrieval.

• Has to be peer-reviewed

• Publication in journals or conferences

• Earliest publication year 2017

• Requires payment

After all the material was retrieved, 112 duplicates were removed in accordance with the selection criteria from chapter 4.1.3.2. The remaining articles were then evaluated against the remaining selection criteria and 29 passed through the process and were in turn accepted for the analysis stage.

Figure 3: Article selection process (author's own)

(19)

5. Analysis

This section aims to put the gathered material together and explain the outcome. It contains the final steps which include analysing the articles using thematic analysis and reporting the results.

To give a better overview for the reader the different methods have been categorized into broader areas. The solutions that are part of the different categories will be explained in more detail later on. Using the steps described earlier a total of 8 themes were originally identified which were then grouped into 6 categories based on similarities. While additional strategies were occasionally identified, only solutions that focus on the data volume issue have been included which means that articles that focus on a more specific issue have not been included such as algorithms for facial recognition or image tampering detection. The solutions can be applicable at any stage of the process meaning during the collection phase or analysing phase for instance.

They can also be applied to different kinds of big data such as storage or network.

There have been several identified passages with different types of clustering which have been categorized under the data mining category. There are also a few instances of k-means clustering which have been added under both the data mining category and the AI category since k-means clustering utilizes machine learning techniques in the process. Articles that have passages in two categories that are grouped together is counted as one entry. Texts that mention clustering and data mining will fall under the data mining category one time for example. Below is a table with every passage identified followed by the altered themes with merged groupings.

Theme Nr. of passages

Artificial Intelligence 13

Clustering 9

Data reduction 8

Data mining 6

Triage 5

Visualization 4

Pattern recognition 4

Carving 1

Total 50

Table 5: Original themes with nr. of related passages (author's own)

(20)

Theme Nr. of passages

Artificial Intelligence 13

Data mining 11

Data reduction 8

Triage 5

Visualization 4

Pattern recognition 4

Total 45

Table 6: Altered themes with nr. of related passages (author's own)

The articles are listed and labelled in appendix A for easier referencing during the upcoming chapters which explain the different themes and certain techniques within them.

5.1 Artificial intelligence

AI is quite a broad field and there exist methods to solve a wide array of problems in many different areas. AI attempts to simulate human intelligence and contain the subfield machine learning which is the basis for developing intelligence. Machine learning is programming

computers to optimize a performance criterion using example data or past experience (Alpaydin, 2010). With the usage of algorithms machine learning can process data without the need for human interaction and gain more intelligence with each iteration.

Deep learning is an area of research within machine learning that solves problems by processing large datasets and makes use of neural networks. Jackson (2019) describes a neural network as an extremely simplified model of a biological neural network, represented as a collection of

interconnected 'artificial neurons'. These networks make decisions from the data inputted and based on the feedback adjusts the values to get closer to the desired result.

Artificial intelligence is useful in many different contexts due to its ability to adapt without needing manual input. Many solutions have suggested the use of machine learning to tackle the vast amounts of data that needs to be processed in digital forensics.

As mentioned earlier k-means clustering is one solution that utilizes machine learning techniques.

Kapil et al. (2016) describes k-means clustering as an ”unsupervised algorithm used to clique different object into clusters.”. A04 proposes a k-means clustering algorithm for big data analytics based on map reduce techniques which can handle enormous amounts of data effectively. A very simple explanation of map reducing is that it consists of a mapping phase which groups similar objects together and a reduce phase which then summarizes the objects. A04 combines this model with k-means clustering algorithm to increase the efficiency when dealing with large amounts of data. In a comparable fashion, A29 also suggests map reducing in combination with

(21)

k-means clustering to improve performance over large-scale datasets. These algorithms increase performance but does not consider privacy protection. Another advantage however is that this solution can be outsourced to cloud servers, reducing the cost of in-house IT infrastructures.

Neural networks is a subset of machine learning and works by creating a network of neurons and synapses similar to how the human brain works and is the basis for the concept of deep learning as explained earlier. A12 experiments with neural network methods as a way to deal with

anomalies in networks in big data environments. A25 discuss neural network methods to handle big data and evidence collection. A10 adopts deep learning into the realm of cyber forensics in an attempts to counter the increase of cyber-attacks happening against both individuals and

organisations. Deep learning can also be used as an aid for discovering Not Safe For Work (NSFW) images which is an approach discussed by A15. This approach uses deep learning algorithms to rearrange big datasets of images based on their likelihood of containing pornographic imagery which can aid investigators to detect relevant images early when the percentage of relevant images is low. This is useful since child pornography investigations typically require manual inspection and inspecting thousands of photos manually would take a considerable amount of time.

5.2 Data mining

Data mining is a technology that blends traditional data analysis methods with sophisticated algorithms for processing large volumes of data (Tan et al., 2014). Just like AI, data mining has uses in many different fields. Data mining can for example be used for data analytics in

businesses to extract customer buying patterns or to deal with large amounts of data in the field of science. Tan et al. (2014) also describes data mining as the process of automatically discovering useful information in large data repositories and the techniques are deployed to scour large databases in order to find novel and useful patterns that might otherwise remain unknown.

Many researchers highlight the usefulness of clustering when dealing with a lot of data. k-means clustering is mentioned as before by A04 and A29. A14 demonstrates how clustering can be used to detect suspicious images when retrieved from a large-scale database with a high-precision rate.

Considering how many pictures can be stored on a modern phone alone, the ability to process a large number of photos at a fast rate is very desirable. The imperfections in the manufacturing of camera sensors creates a sensor pattern noise which can be used to identify and link source cameras. It is a complicated process but can help investigators link together crime scenes and facilitate their investigation.

With the constantly growing amount of mobile devices there is also a growing amount of devices that needs to be analysed for digital forensics purposes. Data mining is one of the solutions proposed by A19 to deal with this problem. Social networking is very popular both for individuals and organisations and can yield a lot of relevant information for an investigation.

Contact information, IM history, video and picture data from social networking services are all examples of such information.

(22)

optimize the storage or in-network movement of data or reduce data redundancy and duplication. There is additionally methods that achieve data reduction by using compression techniques.

Data reduction means to filter in the data that is relevant to the forensic investigator and correspondingly reducing the amount of data that needs to be analysed. A13 achieves this by using hash-based solutions. They evaluate three different artefact lookup strategies for storing hash-based fragments with efficiency in mind when dealing with large data input.

A21 achieved a big reduction of the source volume while retaining key evidence and intelligence data by making use of a combination of selective imaging and open source intelligence. Selective imaging alone can reduce data volume to 0.2% of the source volume which is a huge reduction when dealing with large volumes. The selective imaging process involves reducing dimensions of pictures and thumbnail video data and put them in a logical container. This combined with quick analysis, entity extraction, mobile device analysis, cross-case and cross-device processing also outlined in A21 can reduce the data even further for faster processing and analysis times.

The usage of selective imaging for data reduction is further discussed by A20 with focus on IoT.

A19 also uses data reduction in this fashion in social network forensics which deals with a huge amount of media files. Reducing the size of media files on mobile devices saves a substantial amount of storage space considering how much an average device is capable of storing and how common media files are due to the nature of social network applications.

5.4 Triage

Although the specific interpretation of the term triage varies, it generally refers to a fast, initial screen of potential investigative targets in order to estimate their evidentiary value. (Roussev et al., 2013). Triaging can help combat big volumes of data by sorting out the data that is most likely to yield useful results which makes it favourable when limited resources are available by granting the ability to process the data with the highest priority first.

Triaging and its applications for data reduction is mentioned by A08 while discussing data search techniques. One type of data search method is to match every bit of data storage with target data which is a very time-consuming process. Triage can be applied in this case to reduce the size of the target object and therefore reduce the time needed for analysis.

A11 proposes triaging as a solution to the big data problem in the collection phase of the digital forensics process. These problems involve legal, technical, economic and the three V's in big data (Volume, Variety, Velocity). Triaging can be used to prioritize which evidence should be

collected.

5.5 Visualization

Visualization is the graphical representation of data (Chavhan & Nirkhi, 2012). Visualization makes it possible to display large amounts of data at once for more accessible analysis.

While visualization does not deal with the issue of processing big volumes of data directly, it does help in the later stages of the digital forensic investigation process when the data needs to be presented in a way that facilitates the readability. An example of visualization is demonstrated in

(23)

A02 by mapping out email associations graphically for a greater overview. This visualization technique can for example display which emails were sent to which contacts or the amount of emails between addresses. The visualized relationships between mail contacts can help the investigators with their analysis.

A07 utilizes visualization in network forensics that deal with big data. Depending on the size of the network there can be millions of logs of network data that with the help of visualization can be summarized and analysed more effectively. Visualization is often used in combination with data analytics to display the results.

5.6 Pattern recognition

Pattern recognition observes patterns in data and is often used in conjunction with other methods. This theme contains solutions that advocate the use of pattern recognition as a stand- alone approach.

Pattern recognition can be effective in detecting suspicious data, both in storage but also in networks as illustrated by A09. By observing patterns in the communication over a network the locations of suspects in a forensic investigation can be disclosed. There are a massive number of packets travelling over networks today and recognizing patterns linked to suspicious activity can give investigators a source IP-address or other potential evidence.

Since most of the communication between individuals today is in the form of electronic messages between various devices, searching for patterns in text is something that has become very

important for forensic analysts. Text mining is discussed by A28 where algorithms is used to extract relevant information from electronic messages such as instant messages and emails.

Finding patterns becomes difficult due to the fact that messages in this form of communication often contain a lot of noise but the method proposed by A28 gets around this obstacle.

(24)

6. Discussion

This chapter will discuss the processes of the study as well as the ethical considerations and how the study can be used for further research.

6.1 The article evaluation process

The initial filtering of articles followed the steps illustrated in figure 1 until an unambiguous decision could be made. Certain articles were determined to be irrelevant to the study very quickly upon reading whereas others required a more thorough examination. This process has treated all material in an unbiased way and include/exclude articles objectively based on the inclusion/exclusion criteria.

The initial searches netted a total of 496 articles of which 112 were duplicates. The article

evaluation process then filtered out an additional 355 articles that were estimated to be irrelevant to the goal of the study. Some of these were completely irrelevant and others were relevant to the subject of digital forensics but did however not deal directly with the issue of data volume. The best effort possible has been put in to try and make sure no relevant articles were eliminated during this process. Even texts that may seem irrelevant at first glance can have some interesting input on the matter which is why careful consideration has been applied to not discard material too quickly and miss out on important information.

6.2 The analysis process

As mentioned in chapter 4, thematic coding was used to analyse the material. All included articles were analysed in their entirety twice. The first pass was used to identify themes and the second pass then served to spot more instances of the established themes that may have been

overlooked on the first iteration. No external tool was used for this process, instead the tool in Adobe Acrobat Reader was used to colour code the relevant passages to the corresponding themes.

The themes were established ongoing during the analysis process and were not predetermined in any way. As explained during chapter 5, the original themes were changed slightly because of overlap and irrelevance. The one instance of the theme “carving” was removed since there was only one passage identified and it was only vaguely related to big data making it not useful to the overall goal.

Overlapping between themes can cause some uncertainties with how the results are interpreted.

Clustering and data mining were combined because clustering is a form of data mining but there are other instances of overlap as well. Machine learning is mentioned quite often in the material but is only marked as a theme when it is the direct solution to data quantity. Data mining for example utilizes machine learning techniques but is not always identified as a machine learning theme. To make the explanation simpler, data mining uses machine learning techniques but not all machine learning is used for data mining.

While a big effort has been put into separating the themes as distinctly as possible, overlapping and not specific enough definitions of themes could threaten some validity of the results. One could also argue that the search terms containing data mining will inflate the amount of articles containing data mining as a solution.

(25)

6.3 Ethical considerations

There are a few points to acknowledge when it comes to the ethical aspects of the study. When using citations, careful considerations have been taken to make sure that they are not taken out of context. Citations need to keep their original meaning and not be cut short to change the subject matter by excluding content prior or after the citation.

There is an argument that the information from this study could be used for anti-forensic research which hinder forensic investigations. Anti-forensics is however always going to be an obstacle to digital forensics. Developers of anti-forensic techniques are also already aware of many of the methods used by investigators and it is the author's belief that the benefit from the research outweight this possibly negative impact.

6.4 Societal impact

The difference between finding a key piece of evidence and it being omitted due to insufficient resources can have a huge impact on the individual user. While the study discusses ways to improve efficiency and dealing with big data, when translated to the real world it can have great ramifications. Even small improvements to the current technology in digital forensics can have huge effects and make the difference between a dangerous individual going free or being kept away from society.

This study can potentially aid researchers in the development of digital forensic tools which in turn can assist digital forensic experts by letting them spend less time in the early stages of the digital forensic process and put more focus on the analysis stage.

6.4 Future research

There is potential for future research on this subject considering that the advancements in IT is always growing. Since machine learning is one of the most effective ways to deal with the issue there are research opportunities for the algorithms used and further studies on their performance when applied to different branches of digital forensics.

Research could also be made to compare the value of the information extracted with these different methods that go through large amounts of data and see how much is lost in the process.

By using intelligent approaches there is inevitably going to be some loss of information in the operation in comparison to analysing datasets in their entirety. Analysis of a full forensic image and a subset using an intelligence based technique followed by an evaluation of the amount of relevant information discovered would be an area for further research. When dealing with large volumes of data, time is always going to be a consideration so being aware of the trade-off between time consumed and relevant information gained could be highly useful to investigators.

Ur Rehman et al. (2016) mentioned in their survey that big data reduction methods are emerging research area that needs attention by the researchers. The results from this study show that data reduction is very significant when dealing with big data within digital forensics.

(26)

7. Conclusion

As data volumes continue to increase, the ways to deal with it must also adapt and improve.

Many concepts from the past are still useful today, but their implementation and utilization differ greatly. By analysing existing research a summary of current methods could be compiled and answer the research question of the study:

What current methods exist that can combat big volumes of data in digital forensics?

From the analysis the following methods were identified:

• Artificial Intelligence

• Data mining

• Data reduction

• Triage

• Visualization

• Pattern recognition

When comparing to the past, although the basis for much of AI technology remains the same, the improvement in processing speeds give the technology the means to show its full potential.

Data mining continues to be a useful way to deal with large datasets. The difference from data mining options specified by Beebe & Clark (2005) and Palmer (2001) is that they were aimed at dealing with datasets in the terabyte range whereas today the techniques have to be able to handle much larger volumes which is also true for the other techniques.

Data reduction is a key method to deal with big data complexity and the variants gathered from this study have shown great reduction in big data volume within different branches of digital forensics. Reducing the amount of data needed greatly speeds up the analysis process.

Triage, visualization and pattern recognition have their applications in digital forensics but is not as common as the other three. The analysis shows that AI, data mining and data reduction are the most effective and can be seen as main methods of dealing with big data. Triage, visualization and pattern recognition can in certain situations be a direct solution to big data but works best in conjunction with other methods. For example utilizing visualization in combination with data mining to display the mined data as suggested by (Chavhan & Nirkhi, 2012). What this study also shows is that no current method can single-handedly deal with all the different elements of big data.

The study gives a further understanding on how big data techniques have evolved in digital forensics and what areas are the most resource efficient in today's environments. While there is no single solution that deals with all problems, this survey should give insight into what methods are the most effective for a given area. This knowledge can aid in the development of new techniques and tools which in turn can assist digital forensics experts in their work.

(27)

References

Alpaydin, E. (2010). Introduction to machine learning (2nd ed). MIT Press.

Abraham, T. (2006). Event sequence mining to develop profiles for computer forensic investigation purposes. ACSW Frontiers ‘06: Proceedings of the 2006 Australasian workshops on Grid computing and e- research: 145–153

Beebe, N., & Clark, J. (2005). Dealing with terabyte data sets in digital investigations, in Pollitt M &

Shenoi S (eds), Advances in digital forensics: 3–16

Beebe, N. (2009). Digital Forensic Research: The Good, the Bad and the Unaddressed. In G. Peterson

& S. Shenoi (Eds.), Advances in Digital Forensics V (Vol. 306, pp. 17–36). Springer Berlin Heidelberg.

Braun, V., & Clarke, V. (2006). Using thematic analysis in psychology. Qualitative Research in Psychology, 3(2), 77–101.

Brereton, P., Kitchenham, B. A., Budgen, D., Turner, M., & Khalil, M. (2007). Lessons from applying the systematic literature review process within the software engineering domain. Journal of Systems and Software, 80(4), 571–583.

Buchanan, B. G. (2005). A (very) brief history of artificial intelligence. Ai Magazine, 26(4), 53-53.

Caviglione, L., Wendzel, S., & Mazurczyk, W. (2017). The Future of Digital Forensics: Challenges and the Road Ahead. IEEE Security & Privacy, 15(6), 12–17.

https://doi.org/10.1109/MSP.2017.4251117

Chavhan, S., & Nirkhi, S. M. (2012). Visualization Techniques for Digital forensics: A Survey.

International Journal of Advanced Computer Research, 2(4), 74.

Chen, Z., Liu, W., Yang, Y., Wang, J., Guo, M., Chen, L., Yang, G., & Zhou, J. (2018). Electronic Evidence Service Research in Cloud Computing Environment. 2018 17th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/ 12th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE), 332–338.

https://doi.org/10.1109/TrustCom/BigDataSE.2018.00058

Chen, Z., Yang, Y., Chen, L., Wen, L., Wang, J., Yang, G., & Guo, M. (2017). Email Visualization Correlation Analysis Forensics Research. 2017 IEEE 4th International Conference on Cyber Security and Cloud Computing (CSCloud), 339–343. https://doi.org/10.1109/CSCloud.2017.28

Choo, K.-K. R. (2017). Research Challenges and Opportunities in Big Forensic Data. Proceedings of the

(28)

De Mauro, A., Greco, M., & Grimaldi, M. (2015, February). What is big data? A consensual definition and a review of key research topics. In AIP conference proceedings (Vol. 1644, No. 1, pp. 97-104).

American Institute of Physics.

Dhanasekaran, S., Sundarrajan, R., Murugan, B. S., Kalaivani, S., & Vasudevan, V. (2019).Enhanced Map Reduce Techniques for Big Data Analytics based on K-Means Clustering. 2019 IEEE International Conference on Intelligent Techniques in Control, Optimization and Signal Processing (INCOS), 1–

5. https://doi.org/10.1109/INCOS45849.2019.8951368

Feng, J., Yang, L. T., Dai, G., Wang, W., & Zou, D. (2019). A Secure High-Order Lanczos-Based Orthogonal Tensor SVD for Big Data Reduction in Cloud Environment. IEEE Transactions on Big Data, 5(3), 355–367. https://doi.org/10.1109/TBDATA.2018.2803841

Garfinkel, S. L. (2006). Forensic feature extraction and cross-drive analysis. Digital Investigation, 3, 71–81.

Garfinkel, S. L. (2010). Digital forensics research: The next 10 years. Digital Investigation, 7, S64–S73.

Ghazinour, K., Vakharia, D. M., Kannaji, K. C., & Satyakumar, R. (2017). A study on digital forensic tools. 2017 IEEE International Conference on Power, Control, Signals and Instrumentation Engineering (ICPCSI), 3136–3142. https://doi.org/10.1109/ICPCSI.2017.8392304

Hansen, R. A., Seigfried-Spellar, K., Lee, S., Chowdhury, S., Abraham, N., Springer, J., Yang, B., &

Rogers, M. (2018). File Toolkit for Selective Analysis & Reconstruction (FileTSAR) for Large- Scale Networks. 2018 IEEE International Conference on Big Data (Big Data), 3059–3065.

https://doi.org/10.1109/BigData.2018.8621914

Higgins, J. P. T., & Green, S. (Eds.). (2008). Cochrane handbook for systematic reviews of interventions: Cochrane book series. Hoboken, NJ: WileyBlackwell.

Hoelz, B. W. P., Ralha, C. G., & Geeverghese, R. (2009). Artificial intelligence applied to computer forensics. Proceedings of the 2009 ACM Symposium on Applied Computing - SAC ’09, 883.

Jackson, P. C. (2019). Introduction to artificial intelligence (Third edition). Dover Publications, Inc.

Jeong, D., & Lee, S. (2019). High-Speed Searching Target Data Traces Based on Statistical Sampling for Digital Forensics. IEEE Access, 7, 172264–172276.

https://doi.org/10.1109/ACCESS.2019.2956681

Jesson, J., Matheson, L., & Lacey, F. M. (2011). Doing your literature review: Traditional and systematic techniques. Los Angeles, CA: Sage Publications.

Kao, D.-Y., Lu, F.-Y., & Tsai, F.-C. (2020). Tool Mark Identification of Skype Traffic. 2020 22nd International Conference on Advanced Communication Technology (ICACT), 361–366.

https://doi.org/10.23919/ICACT48636.2020.9061405

(29)

Kapil, S., Chawla, M., & Ansari, M. D. (2016). On K-means data clustering algorithm with genetic algorithm. 2016 Fourth International Conference on Parallel, Distributed and Grid Computing (PDGC), 202–206.

Karie, N. M., Kebande, V. R., & Venter, H. S. (2019). Diverging deep learning cognitive computing techniques into cyber forensics. Forensic Science International: Synergy, 1, 61–67.

https://doi.org/10.1016/j.fsisyn.2019.03.006

Kenneally, E. E., & Brown, C. L. T. (2005). Risk sensitive digital evidence collection. Digital Investigation, 2(2), 101–119.

Kishore, N., Saxena, S., & Raina, P. (2017). Big data as a challenge and opportunity in digital forensic investigation. 2017 2nd International Conference on Telecommunication and Networks (TEL-NET), 1–5.

https://doi.org/10.1109/TEL-NET.2017.8343573

Kitchenham, B. (2004). Procedures for performing systematic reviews. Keele, UK, Keele University, 33(2004), 1-26.

Kryder, M. H., & Chang Soo Kim. (2009). After Hard Drives—What Comes Next? IEEE Transactions on Magnetics, 45(10), 3406–3413.

Kumar, N., Keserwani, P. K., & Samaddar, S. G. (2017). A Comparative Study of Machine Learning Methods for Generation of Digital Forensic Validated Data. 2017 Ninth International Conference on Advanced Computing (ICoAC), 15–20. https://doi.org/10.1109/ICoAC.2017.8441495

Kuziemsky, C., & Lau, F. Y. Y. (2017). Handbook of EHealth Evaluation: An Evidence-based Approach.

Univeristy of Victoria.

Lee, J., Un, S., & Hong, D. (2008). High-speed search using Tarari content processor in digital forensics. Digital Investigation, 5, S91–S95.

Liberati, A., Altman, D. G., Tetzlaff, J., Mulrow, C., Gøtzsche, P. C., Ioannidis, J. P. A., … Moher, D.

(2009). The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate health care interventions: Explanation and elaboration. Annals of Internal Medicine, 151(4), W-65.

Liebler, L., Schmitt, P., Baier, H., & Breitinger, F. (2019). On efficiency of artifact lookup strategies in digital forensics. Digital Investigation, 28, S116–S125. https://doi.org/10.1016/j.diin.2019.01.020 Lin, X., & Li, C.-T. (2016). Large-Scale Image Clustering Based on Camera Fingerprints. IEEE

Transactions on Information Forensics and Security, 1–1. https://doi.org/10.1109/TIFS.2016.2636086 Marziale, L., Richard, G. G., & Roussev, V. (2007). Massive threading: Using GPUs to increase the

(30)

Mayer, F., & Steinebach, M. (2017). Forensic Image Inspection Assisted by Deep Learning. Proceedings of the 12th International Conference on Availability, Reliability and Security, 1–9.

https://doi.org/10.1145/3098954.3104051

McKemmish, R. (1999). What is forensic computing? Australian Institute of Criminology.

Mokhtar, S. H., Muruti, G., Ibrahim, Z.-A., Rahim, F. A., & Kasim, H. (2018). A Review of Evidence Extraction Techniques in Big Data Environment. 2018 International Conference on Smart Computing and Electronic Enterprise (ICSCEE), 1–7. https://doi.org/10.1109/ICSCEE.2018.8538437

Mouhssine, E., & Khalid, C. (2018). Social Big Data Mining Framework for Extremist Content

Detection in Social Networks. 2018 International Symposium on Advanced Electrical and Communication Technologies (ISAECT), 1–5. https://doi.org/10.1109/ISAECT.2018.8618726

Okolica, J. S., Peterson, G. L., Mills, R. F., & Grimaila, M. R. (2020). Sequence Pattern Mining with Variables. IEEE Transactions on Knowledge and Data Engineering, 32(1), 177–187.

https://doi.org/10.1109/TKDE.2018.2881675

Palmer, G. (2001). A road map for digital forensic research. Report From the First Digital Forensic Research Workshop (DFRWS)

Parsonage, H. (2009). Computer forensics case assessment and triage—Some ideas for discussion.

http://computerforensics.parsonage. co.uk/triage/triage.htm

Patterson, J., & Hargreaves, C. (2012). The Potential for Cross-Drive Analysis Using. Automated Digital Forensics Timelines. 17.

Quick, D., & Choo, K.-K. R. (2017). Pervasive social networking forensics: Intelligence and evidence from mobile device extracts. Journal of Network and Computer Applications, 86, 24–33.

https://doi.org/10.1016/j.jnca.2016.11.018

Quick, D., & Choo, K.-K. R. (2018a). IoT Device Forensics and Data Reduction. IEEE Access, 6, 47566–47574. https://doi.org/10.1109/ACCESS.2018.2867466

Quick, D., & Choo, K.-K. R. (2018b). Digital forensic intelligence: Data subsets and Open Source Intelligence (DFINT + OSINT): A timely and cohesive mix. Future Generation Computer Systems, 78, 558–567. https://doi.org/10.1016/j.future.2016.12.032

Raghavan, S., Clark, A., & Mohay, G. (2009). FIA: An Open Forensic Integration Architecture for Composing Digital Evidence. In M. Sorell (Ed.), Forensics in Telecommunications, Information and Multimedia (Vol. 8, pp. 83–94). Springer Berlin Heidelberg.

Raghavan, S. (2013). Digital forensic research: Current state of the art. CSI Transactions on ICT, 1(1), 91–

114.

Reyes, A. (2007). Cyber crime investigations: Bridging the gaps between security professionals, law enforcement, and

(31)

Roussev, V., Quates, C., & Martell, R. (2013). Real-time digital forensics and triage. Digital Investigation, 10(2), 158–167. https://doi.org/10.1016/j.diin.2013.02.001

Rosenthal, D. (2016). The Medium-Term Prospects for Long-Term Storage Systems [Blog]. Retrieved from https://blog.dshr.org/2016/12/the-medium-term-prospects-for-long-term.html

Rosenthal, D., Rosenthal, D., Miller, E., Adams, I., Storer, M., & Zadok, E. (2012). The Economics of Long-Term Digital Storage. The Memory Of The World In The Digital Age: Digitization And

Preservation.

Sanchez, L., Grajeda, C., Baggili, I., & Hall, C. (2019). A Practitioner Survey Exploring the Value of Forensic Tools, AI, Filtering, & Safer Presentation for Investigating Child Sexual Abuse Material (CSAM). Digital Investigation, 29, S124–S142. https://doi.org/10.1016/j.diin.2019.04.005

Shalaginov, A., Johnsen, J. W., & Franke, K. (2017). Cyber crime investigations in the era of big data.

2017 IEEE International Conference on Big Data (Big Data), 3672–3676.

https://doi.org/10.1109/BigData.2017.8258362

Shalaginov, A., Kotsiuba, I., & Iqbal, A. (2019). Cybercrime Investigations in the Era of Smart

Applications: Way Forward Through Big Data. 2019 IEEE International Conference on Big Data (Big Data), 4309–4314. https://doi.org/10.1109/BigData47090.2019.9006596

Shannon, M. M. (2004). Forensic Relative Strength Scoring: ASCII and Entropy Scoring. 2(4), 19.

Sheldon, A. (2005). The future of forensic computing. Digital Investigation, 2(1), 31–35.

Soltani, S., & Seno, S. A. H. (2017). A survey on digital evidence collection and analysis. 2017 7th International Conference on Computer and Knowledge Engineering (ICCKE), 247–253.

https://doi.org/10.1109/ICCKE.2017.8167885

Tan, P.-N., Steinbach, M., & Kumar, V. (2014). Introduction to data mining (First edition, Pearson new international edition). Pearson.

Toraskar, T., Bhangale, U., Patil, S., & More, N. (2019). Efficient Computer Forensic Analysis Using Machine Learning Approaches. 2019 IEEE Bombay Section Signature Conference (IBSSC), 1–5.

https://doi.org/10.1109/IBSSC47189.2019.8973099

Ur Rehman, M. H., Liew, C. S., Abbas, A., Jayaraman, P. P., Wah, T. Y., & Khan, S. U. (2016). Big Data Reduction Methods: A Survey. Data Science and Engineering, 1(4), 265–284.

https://doi.org/10.1007/s41019-016-0022-0

Walter, C. (2005). Kryder's Law. Scientific American. Retrieved from http://www.scientificamerican.com/article/kryders-law/

(32)

Xiong, A., Huang, Y., Wu, Y., Zhang, J., & Long, L. (2018). An Adaptive Sliding Window Algorithm for Mining Frequent Itemsets in Computer Forensics. 2018 17th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/ 12th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE), 1660–1663.

https://doi.org/10.1109/TrustCom/BigDataSE.2018.00246

Xylogiannopoulos, K., Karampelas, P., & Alhajj, R. (2017). Text Mining in Unclean, Noisy or

Scrambled Datasets for Digital Forensics Analytics. 2017 European Intelligence and Security Informatics Conference (EISIC), 76–83. https://doi.org/10.1109/EISIC.2017.19

Yuan, J., & Tian, Y. (2019). Practical Privacy-Preserving MapReduce Based K-Means Clustering Over Large-Scale Dataset. IEEE Transactions on Cloud Computing, 7(2), 568–579.

https://doi.org/10.1109/TCC.2017.2656895

Zawoad, S., & Hasan, R. (2015). Digital Forensics in the Age of Big Data: Challenges, Approaches, and Opportunities. 2015 IEEE 17th International Conference on High Performance Computing and

Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems, 1320–1325.

https://doi.org/10.1109/HPCC-CSS-ICESS.2015.305

Zhou, X., Jin, Y., Zhang, H., Li, S., & Huang, X. (2016). A Map of Threats to Validity of Systematic Literature Reviews in Software Engineering. 2016 23rd Asia-Pacific Software Engineering Conference (APSEC), 153–160. https://doi.org/10.1109/APSEC.2016.031

Combatting the data volume issue in digital forensics: A structured literature review