A Study on Fingerprinting of Locally Assigned MAC-Addresses

(1)

Bachelor Thesis

Computer Science and Engineering, 300 credits

A Study on Fingerprinting of Locally Assigned MAC-Addresses

Bachelor Thesis in Computer Science

and Engineering, 15 credits

(2)

(3)

Acknowledgements

We would like to start this candidate Thesis with expressing our sincerest gratitude to our supervisor Naveed Muhammad for his help with this project. For all the ideas of how we could solve present and upcoming problems and for explaining several technologies needed for this Thesis to be possible. Also for answering all our obvious questions with a smile on his face.

Of course we would also like to give a warm thank you to everybody at Effect- Soft AB in Halmstad for making us feel so welcome from the very start. Our stay at the company was extremely fulfilling with everything from receiving help with our Thesis, listening to their Thursday project lectures to eating pizza and watching cartoons.

Best Regards,

Karl-Johan Djervbrant & Andreas Häggström Halmstad 2019

(4)

(5)

Abstract

The number of WiFi Devices is increasing every day, most people traveling has a device with a WiFi network card enabled. This is something EffectSoft AB in Halmstad utilizes in their service Flow, to track and count devices. The accuracy of counted devices was however not accurate enough for a commercial use and this is where this candidate Thesis will continue research on how to improve the accuracy.

It introduces the fundamental problem on how one cannot directly count transmitted MAC-Addresses to count present devices, since the manufacturers implement features against this such as MAC-Address randomization. It covers how manufacturers are not consistent in their implementation of the IEEE 802.11 standard, how this can be utilized to estimate how many devices are present in the network with three different approaches. It also concludes that Control Frame Attacks is not a viable approach any longer to count devices and the best method for counting devices are a combination of Passive Probe Request Analysis techniques.

Sammanfattning

Mängden enheter som kommunicerar över WiFi ökar dagligen och idag bär de flesta människor en enhet med ett aktiverat WiFi-nätverkskort. Detta använder Effect- soft AB, ett företag i Halmstad till sin teknik Flow för att räkna mobila enheter.

Noggrannheten för beräkningen är dock inte tillräckligt bra för att produkten ska kunna vara applicerbar på marknaden och därav handlar denna kandidatuppsat- sen om beräkning av mobila enheter. Denna rapport presenterar de problem som man stöter på vid beräkning av mobila enheter som tillexempel randomisering av MAC-Adresser. Den täcker även hur tillverkare inte är konsekventa i sin implementation av IEEE 802.11 standarden och hur detta kan utnyttjas genom tre metoder för beräkning av antal mobila enheter. Det fastställs att Control Frame Attack inte längre är en möjlig metod för syftet samt att den bästa metoden för beräkning av antalet mobila enheter är en kombination av olika passiva Probe Request analyser.

Keywords: MAC-Address Randomization, Probe Request, Passive Probe Request Analysis, Control Frame Attacks, Computer Science, Network, IEEE 802.11, WiFi, Tracking, Counting.

(6)

(7)

1 Introduction

This Candidate Thesis is in cooperation with EffectSoft[1] in Halmstad who has developed a service named Flow[2] to count and track mobile devices in a wireless network. Flow was initialized by EffectSoft when they were contacted by a mu- nicipality who were interested in a technique to monitor how their citizens walk in town and gather data on during which hours there’s more traffic than other. There are other similar techniques based on camera detection where a camera is placed in the ceiling or on a high ground, where it counts every pedestrian passing by.

Though due to integrity reasons, this was not an option and since Flow can’t iden- tify humans it was viewed as a benefit. Another positive property of Flow is the ability to deliver WiFi to the inhabitants, which is a great selling point. EffectSoft has however expressed that Flow never became accurate enough in counting devices for a commercial use. Therefore the research will be oriented towards making the product more accurate.

Figure 1: This is an example sketch of the system.

1.1 How Flow Works

As on how Flow works in current deployment, MAC-addresses and RSSI-values[3]

are collected from all devices with a WiFi network card enabled via Access Points.

Even if a device is not connected to the network, this data is still collected. Flow works with either one or more Access Points, depending on the area of coverage.

Though with the limiting factor that localization via triangulation[4] will not work if only one Access Point is used, in this case only device counting will be available. To triangulate devices with Flow one must first create a map over the signal strength (RSSI). This is done by walking around with a device and recording GPS and RSSI data from all Access Points. When the system later is in use the GPS and RSSI

(10)

1.2 Goals

The goal of this Thesis is to ascertain if counting mobile devices through fingerprinting is possible even if these devices has randomized MAC-Addresses. Requirements of the project are specified as stated below.

1. Counting of devices has to depend solely on network packets.

2. The result has to have an average accuracy of at least 80% percent. This was chosen because it was a reasonable level to accomplish related to the time frame of this thesis.

3. The percentage of miscalculation should be consistent, the result should therefore be precise.

4. Software developed should be able to run or be easily altered to run with Live Capturing of packets.

1.3 Restrictions

This Thesis will be restricted by the following subjects.

1. Limited test opportunities since a lifelike test requires for a significant number of devices while the correct number is also known.

2. Since this project is performed as a candidate Thesis, time is limited and is therefore a restriction of the project as it will affect both quality and quantity of the software development.

3. Access to the actual Flow system is currently not possible, effects of the results being implemented to it will therefore be solely speculative.

1.4 Contributions

This Thesis has made the following contributions to the field of network analysis.

• An Algorithm for calculating the number of devices using data passively extracted from Probe Requests.

• Information on how Probe Request data can be used to calculate devices through SSID and IFAT fingerprinting.

• Information on what future work could be done within the field and what route would be the best to choose.

• Extraction of manufacturer data useful for market analysis.

• A simple GUI for a better presentation of the device calculation.

• A minor field study on how vulnerable mobile devices are to a Control Frame Attack.

(11)

1.5 Structure of this Thesis

This Thesis begins with an explanation of background and theory needed for the reader to understand the methods to be presented in Section 2. Here it will be explained how MAC-Addresses are structured, how they are randomized and how it varies between manufacturers. Different network packets such as Request To Send, Clear To Send and Probe Request will be explained and also some of their various sub fields. Information on how packet comparison can be done will be presented, for instance how one can use the network names a device transmits or the time delta between transmitted probe requests packets to count devices. How this time delta can be used within clustering algorithms will be explained thereafter.

The methods used in this Thesis will be presented and described thoroughly in Sec- tion 3. Afterwards the result of the various methods will be presented in Section 4 followed by a section of Discussion where the results will be interpreted and discussed. After the discussion there will be a short conclusion of the Thesis where the results are compared to the goals and restrictions set for the Thesis.

(12)

(13)

2 Background and Theory

In this section, knowledge needed for understanding of the report will be explained.

This includes how MAC-Addresses are designed and how one can see if a MAC- Address is randomized or not and how MAC-Address Randomization works. For the Control Frame Attack it will be explained how the Request To Send and Clear To Send packets are constructed. The reader will gain knowledge of the Probe Request frame and its parts of interest for this Thesis. How packets can be compared will also be explained to further enhance the readers understanding of the field. For the time analysis of packets, Inter-Frame Arrival Time and how Unsupervised Machine Learning can help with this analysis will also be presented.

2.1 MAC-Addresses

A MAC-Address (Media Access Control) is an OSI[6] layer-2 identifier consisting of 48 bits. It’s used in the Medium Access Control sub layer[7]. All network interfaces following the IEEE 802.11[8] protocol has a unique MAC-address. This identifier is designed to be persistent and globally unique. This MAC-address is assigned to a network adapter or network card directly and it’s therefore not possible to alter a devices global MAC-Address through device settings. For devices to guarantee the address to be unique, organizations obtains assigned blocks of addresses from the IEEE organization. These blocks are named MAC-Address Block Large (MA-L) or more commonly Organizationally Unique Identifiers (OUI)[9]. By acquiring these blocks from IEEE the manufacturers have control over the MAC-Addresses with their assigned three-byte prefix. This makes sure that even if the three low-order bytes (Network Interface Controller, NIC) are the same for two network interfaces, their three high-order bytes are different. It’s the organizations responsibility making sure network interfaces aren’t assigned identical high-order and low-order bytes.

Following the usage of OUIs it’s trivial to determine the manufacturer of a device from its MAC-Address by using tools such as Wireshark[10].

Figure 2: Representation of 48-bit MAC-Address.

(14)

As seen in Figure 2 the second least significant bit in the most significant byte shows if the network interface is using a local MAC-Address. This is not always the case though since organizations can acquire more secure OUIs which doesn’t show the organizations name in the register. These OUIs will always have the local bit set. Local Mac-Addresses are used for several ways, one being multi Service Set IDentifier (SSID) access points[11]. The Unicast/Multicast bit shows if the network interface is broadcasting through a one-to-one relation or an one-to-many. Meaning if the interface is sending information to one or many devices.

2.2 MAC-Address Randomization

When a mobile device is within the area of an Access Point (AP) it sends probe request frames with information needed for a potential connection. Since the AP has to send a probe response it also needs a source MAC-Address so the device can receive the packet[11]. Without MAC-Address randomisation, each device would therefore communicate with its true global MAC-Address to each network within range. This would make it trivial for a network technician to map individual mobile devices. Since the global MAC-Address is unique, this would in turn make it possible to track a person without their endorsement. MAC-Address randomization enables the device to keep a secret identity before actually connecting to a network. The implementation of MAC-Address randomisation has not received a standard yet and therefore it varies between organizations/manufacturers. The algorithms used varies on device model and what firmware the device is currently running[11].

2.2.1 MAC-Address Randomization for iOS

MAC-Address randomization for iOS devices was first introduced with the release of iOS 8[12]. The randomized MAC-Address is used when the device transmits Probe Requests frames while not associated to a WiFi network[13]. The device also uses MAC randomization when it performs enhanced Preferred Network Offload scans.

Which is done when the device isn’t associated with a network or if its processor is in sleep mode. Apple also informs that neither they or any manufacturer can predict these randomized addresses[13]. iPhone6S and onward also has the functionality to know the hidden property of a known WiFi network. This allows the device to send Probe Requests without the Service Set Identifier (SSID) of the WiFi network within the packet. Any direct information of how the MAC-Address randomization is performed is unfortunately inaccessible since all software developed by Apple is closed source.

(15)

2.2.2 MAC-Address Randomization for Android

Android OS is an open source software and devices using Android OS has different implementations of the operating system. The randomization of MAC-Addresses is one of the various things each manufacturer has to implement themselves. This is done by following the steps stated below[14].

• Work with a WiFi-chip vendor to implement the IWifiStaIface.setMacAddress() method.

• Set config_wifi_support_connected_mac_randomization to true in the Set- tings config.xml

After implementation the developer follows Androids guide for validating the randomization functionality. The MAC-address of an Android device is then also used for Android’s WiFi-Aware and WiFi-Round Trip Time operations.

• WiFi-Aware

Also known as Neighbor Awareness Networking(NAN), WiFi-Aware allows An- droid devices running Android 8.0 or later to discover and connect to each other without any connectivity between them[15]. It works by connecting to or forming clusters of devices. The clustering behaviour is managed by the WiFi-Aware system service, applications have no control over the clustering behavior but use the WiFi-Aware APIs to communicate with system service managing the devices WiFi-Aware hardware.

• WiFi-Round Trip Time

The WiFi-Round Trip Time (RTT) API allows for applications to measure the distance between nearby RTT-capable access points and Wi-Fi Aware devices[16]. Measuring the position of three or more devices allows for triangulation by using a multilateration algorithm. The result of the algorithm is generally accurate within 1-2 meters.

Since the implementation of MAC-Address randomization for Android devices varies on the manufacturer, how it’s implemented in the different devices will not be presented. The interesting part is the fact that it varies between manufacturers.

2.3 Request To Send & Clear To Send

As a part of the IEEE 802.11 protocol Carrier Sense Multiple Access with Collision Detection (CSMA/CA), a node may send out a Request To Send (RTS) to nearby nodes when it wants to transmit data. The other nodes will wait for the transmission to be completed if there is no conflicting usage of the network. This is done to avoid packet collision and miscommunication. After sending an RTS the node waits for a Clear To Send (CTS) before transmitting any further data[17].

(16)

Figure 3: Representation of the RTS frame.

As shown in Figure 3, the RTS frame format consists of five field values[18].

• Frame Control provides control information and the packets frame type.

The control information in turn indicates for instance if the frame is to or from a Distribution System.

• Duration is the time required to transmit the pending Data or Management frame, CTS frame and Ack frame plus three Short Interframe Spaces (SIFs) of 10 µs each.

• RA (Receiving Address) is the MAC-Address value of the station intended to receive the pending individually addressed Data, Management or Control Frame.

• TA (Transmitting Address) is the MAC-Address value of the station transmitting the RTS frame.

• FCSis the Frame Check Sequence calculated from the data in the frame. Its sole purpose is for error detection.

The CTS frame is similar to the RTS as shown in Figure 4 with the main difference being not including a TA field value. The RA value is set to the address of the TA field of the corresponding RTS frame with the Unicast/Multicast forced to 0. The RA field is set to the MAC address of the transmitter if the CTS is the first in a frame exchange[18].

Figure 4: Representation of the CTS frame.

(17)

2.4 Probe Request

When a mobile device wants to determine what networks are present it will transmit Probe Requests to retrieve information about nearby access points. The format of the Probe Request is shown below in Figure 5, all fields of the Probe Request are mandatory while some vary in length[19].

Figure 5: Representation of the Probe Request Frame.

The Frame Body contains two fields, the Service Set Identifier (SSID) and the transmission rates supported by the mobile device. When an access point receives a Probe Request it uses the information of the Frame Body to determine whether the mobile device is capable of joining the network. For connection to be established, the mobile device has to support all the rates required by the network. The SSID field also has to be set to show that the mobile device wants to connect to the network. This can be done through having the specific SSID of the network or a Broadcast SSID in the field. If the mobile device sends a Probe Request with the SSID set to Broadcast, it will try to connect to any network present.

2.4.1 Service Set Identifier

The Service Set Identifier (SSID), also referred to as the network name allows network managers to assign a plain text identifier to a network[20]. This identifier is what a user would see when choosing what network to connect to. The length of the SSID field varies between 0 and 32 Octets, the special case where the SSID length is 0 is referred to as the Broadcast SSID. Used when a device wants to discover any nearby network.

2.4.2 Supported Rates

The Supported Rates element allows an 802.11 network to specify what data rates it supports. When a mobile device tries to connect to a network it checks its supported rates, some are mandatory and some are optional to support[21]. The different standardized supported rates of wireless LANs can be seen in Appendix A, Table 4.

The field consists of a string of bytes where each byte uses the seven low-order bits for the data rate. The most significant bit is used to indicate whether the rate is mandatory, 1 for mandatory rates and 0 for optional.

(18)

2.4.3 Higher Throughput Capabilities

Among the various data sent in a Probe Request is the Higher Throughput (HT) Capabilities field. It specifies the devices capability to communicate using HT radio channels, which is used within the Multiple Input Multiple Output (MIMO) technology. The HT Capabilities is 28 octets long and contains several different informational fields. The HT Capabilities Info field is 2 octets long and contains information about the various HT technologies supported by the device[22]. The format of the HT Capabilities Info field is as seen in Figure 6.

Figure 6: Representation of the HT Capabilities Info Field.

As shown in Figure 6, the HT Capabilities Info Field consists of 14 field values[23].

• LDPC Coding Capability

When transmitting data in an environment with noise and distur- bance, the method Low Density Parity Check is used. It gives ex- cellent error correction and performance. If a device is capable of transmitting LDPC coded packets, the LDPC bit is set to 1.

• Supported Channel Width Set Indicates which channel rates are supported by the device. If the Supported Channel Width Set bit is set to 0, it means the device only supports 20MHz channel communication. If the device also supports communication in 40MHz channels then the bit is set to 1.

• SM Power Save

Mobile devices supporting HT include multiple radios that supports

MIMO features. MIMO allows for mobile devices to e.g. transmit multiple streams of data simultane- ously. The downside of this feature is that if there are more active radios on the device, the power con- sumption is increased. SM Power Save field indicate if the mobile device is capable of the 802.11n SM Power Save method. SM Power Save can be enabled as either Dy- namic SM Power Save or Static SM Power Save. The selected mode and if it’s enabled as seen in Ap- pendix A, Table 5 Static SM Power Save Mode is enabled when the device only maintains one active receive radio chain. Dynamic SM Power Save indicates that the device can is capable of multiple receive radio chains. This is done through alternating between the different radio chains. A radio

(19)

chain is inactive until a receipt of a frame addressed to it.

• HT-GreenField

The HT-GreenField was introduced in 802.11n and indicates if the device is capable of receiving HT GreenField PLCP Protocol Data Unit (PPDU).

• Short GI for 20 Mhz and Short GI for 40Mhz

Digital signals are transmitted as bits or collections of bits, called symbols. 802.11n has a 800 nanosecond Guard Time (GI) which is the period of time between symbols. Guard Time is used to ac- commodate the late arrival of symbols traveling a long path within a multi path environment. Symbols can travel different routes to opti- mize performance but it also leads to some symbols arriving later than others. Without the Guard Time or with a short Guard Time symbols of a new packet could arrive before the last symbols of a previous packet resulting in data corrup- tion through symbol collision.

If the Short GI bit is set for a channel frequency, the Guard Time will be set to 400 nanoseconds, increasing the throughput but also the risk of symbol collision. If symbol collision occurs, re-transmissions are required and the data throughput would then be decreased.

• TX STBC and RX STBC

Space Time Block Coding is a MIMO technique used for improv-

being received with better quality than others. The receiving device will therefore be more likely to correctly decode the signal. The encoding for the TX STBC and RX STBC fields can be seen in Ap- pendix B, Table 6.

• HT Delayed Block Ack

Acknowledgements (ACK) are sent for every data or management frame from the receiving device to confirm the packet was transferred correctly. Block Acks allows for a device to efficiently send ACKs for several frames at the same time. Implementing HT-Delayed Block Ack simplifies the usage of delayed Block Acks by not requir- ing an ACK in response of a Block Ack Request (BlockAckReq). Sup- port for HT Delayed Block Ack is indicated by setting the field to 1.

• Maximum A-MSDU Length

Aggregate MAC Service Data Unit (A-MSDU) is the OSI layer 3-7 fields of a data frame with a Log- ical Link Control (LLC) header.

Multiple MSDUs can be aggregated within the same frame transmission. The maximum total length of the MSDUs of which the device is capable of receiving is indicated by this field. If the Maximum A- MSDU Length field is set to 0 the maximum length is 3839 bytes and 7935 bytes if the field is set to 1.

• DSSS/CCK Mode in 40Mhz

The DSSS/CCK Mode in 40Mhz field indicates if a device is capable of supporting Direct Se-

(20)

the field to indicate if it allows for communication with 22Mhz devices. The encoding of the field can be seen in Appendix B, Table 7.

• 40Mhz Intolerant

In 2.4Ghz Band there is only one none-overlapping 40Mhz channel. Using 40Mhz channels in 2.4Ghz is therefore not suitable for Multi-Channel Architec- ture (MCA). Within the 2.4Ghz band, 40Mhz channels are only usable for Single Channel Architec- ture (SCA). Using 40Mhz in a SCA however only works well if transmitting in an environment clear from other 2.4Ghz networks. Hav- ing the 40Mhz Intolerant bit set to 1 prohibits the device from transmitting or receiving frames of the 40Mhz channel. A device with the 40Mhz Intolerant bit set is forbidden to communicate with a

20/40Mhz Access Point. If a device is operating on the 5Ghz band, setting the 40Mhz Intolerant bit is forbidden and it’s always set to 0, indicating it’s allowed to communicate over 40Mhz channels.

• L-SIG TXOP Protection Support The Legacy Signal (L-SIG) TXOP Protection Support field is set to 1 to indicate that the device supports the L-SIG protection mecha- nism. This is used by a device to prevent nearby devices to transmit at the same time. This could be done by sending RTS/CTS packets or CTS-to-self but the device still needs to protect its transmissions from 802.11a/b/g/n devices by reserving the medium using L- SIG TXOP Protection. If the bit is set to 0 it indicates that the devices does not support L-SIG Protection.

2.4.4 Extended Capabilities

The Extended Capabilities field consists of information about the capabilities of a device such as ex1, ex2 and ex3[24]. It complements the Capability Information field of the packet. The structure of the field is shown in Figure 7.

Figure 7: Representation of the Extended Capabilities Field.

Extended Capabilities is a bit field containing the capabilities advertised by the transmitting device. The length of the field varies and is a variable n. The structure of the Extended Capabilities field is similar to the HT Capabilities Info Field presented in Section 2.4.3. Since the length of the Extended Capabilities field varies, the different interpretations of the bits will not be presented in this report but can be found at[24].

(21)

2.5 Packet Comparison

There are various different methods of comparing similarities of packets to determine whether two randomized MAC-Addresses were generated by the same device. The fields to be compared are the SSIDs a device transmits Probe Requests to, OUI, HT-Capabilities and Extended Capabilities. The comparison method to be chosen should also be able to compare other fields if needed without much modification to the algorithm. The most relevant ones to discuss are listed below with an explanation of how the similarity value is calculated.

2.5.1 Euclidean Distance

The Euclidean Distance is mostly used for geometrical problems and is common for clustering problems, including text. To measure the distance between two vectors, d₁ and d2 one calculates using the formula per below[25] where w is the term set with an assigned weight.

Distance(−→ d₁,−→

d₂) = (

n

X

d=1

|w_d,1 − w_d,2|²)^1/2 (1) It should be noted that the result of this equation will be the distance of the two vectors, not the similarity. When using the Euclidean Distance method one should therefore value small result values higher than larger values.

2.5.2 Jaccard Similarity

Jaccard Similarity is computed by dividing the intersection of two sets, A and B by their union. This gives a number between 0 and 1 representing how many items the two sets share. The Jaccard Similarity between two sets or vectors is determined as shown below[26].

J (A, B) = |A ∩ B|

|A ∪ B| (2)

(a) Jaccard equation in the form of sets

J (~a,~b) = ~a · ~b

|~a|²+ |~b|²− ~a · ~b (3) (b) Jaccard equation in the form of vectors Figure 8: Mathematical definition of Jaccard Similarity.

(22)

2.5.3 Cosine Similarity

The Cosine Similarity is determined by comparison between two vectors of data. It uses the mathematical Cosine formula to determine the cosine of the angle between the vectors, a number between 0 and 1 representing the similarity.

Cosθ = v~1· ~v2

|| ~v₁|| × || ~v₂|| = Similarity( ~v₁, ~v₂) (4) Cosine Similarity is one of the most popular algorithms to determine the similarity between documents, the reason being the algorithm’s independence of document length. As explained in[27], given two documents d1 and d2 one can combine them into d12. The calculated similarities will then be related as per below.

Similarity( ~d1, ~d12) = Similarity( ~d2, ~d12) = Similarity( ~d1, ~d2) (5)

2.6 Inter-Frame Arrival Time

When devices send Probe Request packets to scan for present networks they send these packets in bursts. Every Probe Request in this burst has the same MAC- Address. When a burst is transmitted it’s done within a certain time frame, which varies from device to device. In 2006 Franklin et al.[28] published a paper on how Inter-Frame Arrival Time (IFAT), which is the time between two Probe Request packets (see Figure 10), could be exploited to fingerprint wireless devices.

Figure 10: Representation of a burst of packets and inter-frame arrival time.

They found that device driver manufacturers implemented active scanning (probe requests) slightly different and these differences could be used to characterize a device. Later in 2016 Matte et al.[29] proved this still works and they could defeat MAC-Address randomization. This by improving the algorithm Franklin and his team used to also include the time difference between bursts and tested different algorithms to calculate the similarity between frames. Franklin’s method of distin- guishing devices begins with calculating the IFAT-data where it’s categorized in to different bins to filter out noise and get smooth probability estimations. This data was later used to create signatures where the distance between signatures specify how similar two signatures are. Whereas Matte and his team still uses the same binning technique, they tested three different algorithms to the evaluate closeness between signatures. One based upon Franklin’s method, one where the time between burst-sets are taken in consideration and the last method which is a hybrid of the two. The method used in this Thesis utilizes the same binning techniques as Franklin and Matte, but focuses instead on clustering algorithms to evaluate if signatures originate from the same device.

(23)

2.7 Clustering Algorithms

Various clustering algorithms can be used when analyzing data for similarities which might not be prominent for a human to find. Data with similar characteristics are assigned the same cluster membership. When determining if a data point is a member of a cluster the algorithm analyzes the distribution and special identities of the data to categorize it, to later on process it with a distance algorithm[30].

Clustering methods can be categorized based upon their initial goal on how to determine cluster membership. There are Monothetic clustering algorithms where the data can only be a member of a cluster if it has some common property. E.g.

Age 20-25, student at university X, Y and Z etc. There’s also Polythetic clustering algorithms where cluster membership are determined by a distance function[31].

In the case of determining boundaries between clusters there are two categories, Flat Clustering and Hierarchical Clustering, both of which has their own pros and cons depending on the application.

2.7.1 Flat Clustering

Algorithms which are flat clustering finds K numbers of clusters determined by the user in the data. There are two types of Flat Clustering algorithms, Hard Clustering and Soft Clustering. In hard clustering the algorithm creates clusters with no overlap between clusters. By other means, every data point in one cluster are related to each other and have nothing in common with other clusters. It creates a distinct or Hard boundary to other clusters. Whereas in Soft Clustering, data can belong to many different clusters but with different probabilities. Data may have a probability of 60% belonging to one cluster and a probability of 40% belonging to another[32]. One of the most common flat clustering algorithms is the K-Means algorithm because of its simplicity and effectiveness. It’s also a hard clustering algorithm. One issue with flat clustering algorithms such as K-Means is to select the right cardinality or K, since one might not know how many clusters there are in the data. There are however methods to get a good estimate of K, one of which is the Knee method.

This method utilized the property of residual sum of squares or RSS, which is a method used in statistics to measure variance in data. For every iteration of K- Means the RSS will decrease, eventually the RSS value will converge. By studying the RSS value where the angle of the slope suddenly decreases, i.e. where there’s a knee as seen in Figure 11, at this point one knows which value for K is good and by that knows how many clusters there are[33].

(24)

(a) Dataset with random data in five clusters.(b) A Knee i clearly shown at K=2 and K=5.

Figure 11: Example of K-Means in practice, from Figure 11b one can tell that there’s either two or five clusters in 11a.

2.7.2 Hierarchical Clustering

Hierarchical Clustering algorithms seeks to create a hierarchy of clusters. These types of algorithm do not need an input on how many clusters there are in the data, but instead calculates clusters recursively either with a bottom-top or a top- bottom approach, where the top group are the largest clusters and the bottom are the smallest clusters[34], as seen in Figure 12.

Figure 12: Example of Hierarchical Clustering.

A common algorithm for this is the Mean Shift algorithm, which is a centroid based algorithm and strives to find clusters by updating the centroids position by the mean of the surrounding points [35]. As seen in Figure 13, the algorithm can classify the two present clusters in the data. It’s good since it works on fairly large data sets but with the downside of being performance heavy, especially compared to K-Means.

(25)

(a) A 2D-Dataset with random data. (b) Mean Shift on the same dataset.

Figure 13: Mean Shift example on a random generated 2D-Dataset.

(26)

(27)

3 Method

In this section the Task Specification, Hardware and Software used for the project will be presented. It will be explained how the Control Frame Attack was performed and the structure of the transmitted packet will be shown. The Passive Probe Re- quest Analysis, Inter-Frame Arrival Time Analysis and how they built to cooperate will also be explained.

3.1 Task Specification

In the nature of how Flow counts devices, the MAC-Address is a central part and together with the RSSI[3] the system can count devices present in the area of coverage for the access points. To increase accuracy, you have to either increase how the position of a device is calculated or implement a method not to double count units based on their local MAC-Address. To narrow down the research, the most preeminent cause of inaccuracy is chosen, double counting devices due to randomized MAC-Addresses. The task of this Thesis is therefore to find possible methods of fingerprinting mobile devices in the purpose of improving Flow.

The research is carried out at the company’s main office. Daily stand-ups are held to inform the employees about the progress of research and development.

3.2 Hardware and Software Used

To inject WiFi-packets, an Asus UX310UQK laptop with Windows 10 (Build 17134) running Kali-Linux[36] (Kali-Linux-2019.1) in a Virtual Box[37] environment (Ver- sion 6.0.4r128413 (Qt5.6.2)) were used. Since the laptop built in network card has no publicly available drivers with monitor mode enabled, a WiFi-dongle (Netgear WNA3100M) compatible with monitor mode were used instead. This setup was chosen because of its convenience and its mobility, being able to bring the setup to a remote location were no other WiFi communication were occurring was a key criteria. The choice of operating system fell on Kali-Linux since it has many pre- installed tools suitable for WiFi monitoring, one downside of Kali-Linux in a security perspective is that all commands are executed with root privileges. In the case of an unauthorized attack, the attacker could execute malicious software on the computer and maybe damage the system. To mitigate potential damage, Kali-Linux were ran in a Virtual Box environment. In the case of an attack, the system could easily be reinstalled. One could dual-boot Kali-Linux alongside with Windows, but it would take much longer to re-configure the system in the case of an attack, in comparison to a Virtual Box environment. To enable monitor mode on the Netgear WNA3100M WiFi-dongle, Aircrack-ng[38] (Version 1.5.2) was used, which is a tool used to as- sess WiFi network security and comes pre-installed with Kali-Linux. To read and analyze network packets, Wireshark[10] (Version 2.6.6) and Python (Version 3.7.2) with the PyShark[39] (Version 0.4.2.2) and Scapy[40] (Version 2.4.2) libraries were

(28)

(Kernel Version 4.14.93). Which comes pre-installed with Nexmon[42] drivers to enable monitor mode on the Broadcom bcm43430a1 WiFi chip used on the Raspberry Pi. To analyze the packets received on the Raspberry Pi, Wireshark (Version 2.6.6) was used.

Since the bcm43430a1 and Netgear WNA3100M network cards only support IEEE 802.11n 2.4GHz, the results will therefore exclude the IEEE 802.11ac 5GHz standard. When developing the method discussed in Section 3.3.3, the Scikit- learn[43] library for data mining, data analysis and machine learning in Python was used. It’s simple to use and efficient, which were important in developing the software.

3.3 Method Description

3.3.1 Control Frame Attack

By utilizing the Scapy library for Python, the customized RTS frame needed to perform a Control Frame Attack were created[44]. This was done since it seemed to be a promising alternative to retrieve information of a device’s presence. It also was the only method presented in[9] with 100% success rate. Example of an RTS-packet shown in Figure 14 was constructed and transmitted to a mobile device.

Figure 14: Representation of the customized RTS frame.

The value in field addr1 is the receiving address and is set to either the global or local MAC-Address of a device. Addr2 is the transmitting address and is set to a value easily traced within Wireshark. Upon performing a Control Frame Attack by transmitting the RTS frame potential responses are monitored using Wireshark with monitor mode enabled.

For the first experiment the Python script was configured to monitor the network traffic using PyShark and send an RTS to any device transmitting a probe request with the Local bit set in its MAC-Address, (as seen in Figure 2). The purpose for this was to check whether the potential responses would contain information usable for fingerprinting randomized MAC-Addresses.

In the second experiment the Python script was modified to transmit the RTS with addr1 field set to the global MAC-Address of a mobile device. This is the method used in[9] and the purpose of this experiment was to see which devices confirm their presence by responding with a CTS frame.

(29)

3.3.2 Passive Probe Request Analysis for SSID Fingerprinting

Passive Probe Request Analysis implies that no packages are sent from the configured Access Point. This method and the method described in Section 3.3.3 strictly depends on passively reading and analyzing network traffic. Network packets gathered and stored in a .pcapng file using Wireshark were deconstructed with PyShark to extract the SSIDs and various other data. By using the PyShark filter wlan.fc.type_subtype eq 4 only Probe Request frames were processed. The following fields seemed useful and were extracted from the packets.

• MAC-Address

Used as the Python Dictionary key.

• OUI

Compared to a JSON file containing known OUIs to specify manufacturer.

• SSID

Used for fingerprinting.

• HT Capabilities and Extended Ca- pabilities

Seemingly network card specific values as described in 2.4.3 and 2.4.4.

• Timestamp

Used for time analysis.

The data was extracted using Algorithm 1 and appended to a dictionary using Algorithm 2. After all packets were read they were compared using Algorithm 3.

This algorithm in turn uses Algorithm 4 which returns the probability that two packets were transmitted by the same device. Cosine Similarity was chosen for comparison of the SSID list because of its independence of list length compared to Euclidean Distance. This would be useful if for instance a device would send probe requests to several saved SSIDs using a randomized MAC-Address and later transmit to fewer SSIDs whilst under a different randomized MAC-Address. After manually analyzed several Wireshark files this seemed to be a somewhat frequent situation. Therefore Cosine was chosen over Euclidean. The Cosine Similarity was also noted to have slightly less execution time compared to when using Jaccard.

(30)

Algorithm 1 Extraction of Probe Request Information, readMACAddresses().

Data: Probe Request packets

Result: Dictionary of MAC_Addresses and Packet Fingerprints for packet in packet list do

extract packet SSID mask out device OUI

extract device HT Capabilities extract device Extended Capabilities

append data along with MAC of device and timestamp using appendToDict() end

Algorithm 2 Appending packet to Dictionary, appendToDict().

Data: MAC_address, SSID, OUI, ht_Cap, ext_Cap, timestamp, signalstrength Result: Creates new Dictionary item or appends to existing

if MAC_Address local bit set then

if MAC_Address previously read and SSID not previously read then fetch MAC-Addresses current fingerprint

add signalstrength to fingerprint add SSID current fingerprint elseinitiate new fingerprint

add input data to fingerprint

add fingerprint and MAC to Dictionary end

else

if MAC_address not previously read then

initiate new fingerprint including MAC-Address add input data to fingerprint

add fingerprint and MAC to Dictionary end

end

(31)

Algorithm 3 Processes the dictionary to count unique devices, processFinger- prints().

Data: Dictionary MAC_Fingerprints Result: List UniqueDevices

initiate new list readItems

for item in MAC_Fingerprints items do if item is not in readItems then

add item to readItems add item to UniqueDevices end

end

for packet_x in UniqueDevices do initiate new list matches

for packet_y in UniqueDevices do

if neither packet_x or packet_y transmitts are using global MAC address then

if packet_x MAC address not equals to packet_y MAC address then calculate Jaccard or Cosine similarity of packets

if similar enough to be assumed the same device then append packet_y to matches

end end end end

for match in matches do merge packets

remove matching packet from UniqueDevices end

end

Algorithm 4 Compares two packets and returns their similarity from 0 to 1, com- parePackets().

Data: Probe Request packet_x, packet_y

Result: Similarity between packets represented as value between 0 and 1 Calculate Cosine Similarity and multiply with 0.5

set equalOUI to 1 if packets equal OUI

set equalHTcap to 1 if packets have equal HT Capabilities

set equalEXTcap to 1 if packets have equal Extended Capabilities set equalFields to equalOUI · equalHTcap · equalEXTcap · 0.5 return Cosine similarity + equalFields

(32)

The result of the SSID list comparison was set to be between a value of 0 to 0.5, this since the two packets also had to have equal extended capabilities, HT capabilities and OUI. The result of these three fields being equal were also given the weight 0.5. This ensured that the returned value never would be greater than 0.5 if the two packets didn’t share any SSID or had equal extended capabilities, HT capabilities or OUI. After processing the packets and merging those who seem to originate from the same transmitting device, the estimated number of devices is presented.

3.3.3 Passive Probe Request Analysis for Inter-Frame Arrival Time Analysis

Before building the IFAT algorithm, sample data was gathered to be used when testing the algorithm. As when developing and testing algorithms, one need consistent and reliable data to make sure development isn’t pending on bad ground truth.

This recorded data consisted of multiple Probe Request burst from one single device, recorded over a time period of roughly 400 seconds with the screen both locked and unlocked. Each set of recorded data from each phone tested was later analyzed to make sure it was clean from other disturbances, such as other devices and traffic.

In the case of interference, necessary Wireshark filters where used before further processing.

Later on in the process, larger sets of data were collected with more than one device. This to get a more realistic data set to try the algorithms.

Next step was to test if IFAT still could be used to characterize a device. This was done by simply calculating the arrival time between all recorded Probe Request frames and plot it. If there is a unique pattern for each device in these plots, one can assume that IFAT-Analysis is still a viable approach. As shown in Figure 15, there’s an obvious pattern in timing between frames and bursts sent from this device.

(a) IFAT of an iPhone 7 Plus iOS 11.1 with screen unlocked, plotted over time.

(b) IFAT of an iPhone 8 Plus iOS 12.1.4 with screen unlocked, plotted over time.

Figure 15: IFAT plotted to evaluate if IFAT still could be used to fingerprint a device. As seen, the pattern is different between these two devices.

After conducting this test and the results were analyzed, a conclusion was made that IFAT-Analysis still can be a viable approach. Next step was to split all Probe

(33)

Requests in to bursts and rules were set on how to categorize a burst set which are as followed.

• Every MAC-Address in a burst are the same.

• Time delta from the first packet in a burst to the n’th packet should not be greater than 0.7 seconds.

Next step was to calculate a burst set signature. The idea is that a device has one unique signature and if many burst sets has the same signature, one could argue that this is the same device.

Table 1: Signature example

Bin 0.15 0.3 0.45

Mean value 0.1132 0.2315 0.6582 Percentage 0.80 0.15 0.05

A signature consists of the mean IFAT for every burst set, as shown in Table 1.

To calculate a signature, the IFAT is binned in to three bins with equal-width binning. Based upon Franklin et al.[28] and Matte et al.[29] previous work, this method was used to get a more accurate result. Binning is a common method used when unwanted noise could interfere with the result.

When the signatures have been calculated, the Mean Shift algorithm processes every signature and outputs how many clusters it thinks are present in the data, the theory is that one cluster should be one device. Mean Shift was chosen because it doesn’t need supervision from the user in the case of telling the algorithm how many clusters are present, as in the case if K-Means would be chosen. There are though methods to make K-Mean able to predict cluster count as discussed in Section 2.7.1, but this method still need supervision in studying the variance to estimate how many clusters are present.

3.3.4 Cooperation between SSID Fingerprinting and IFAT Analysis Since both technologies depended on Probe Request packets it was possible to have them cooperating. This was done by modifying the function processFingerprints() to determine whether a certain dictionary item should be computed using SSID fingerprinting or using IFAT. The modified algorithm can be seen in Algorithm 5.

The idea of having these two methods cooperating is to use the SSID fingerprinting when possible since it’s a less processing power consuming method compared to IFAT clustering. When a device however only transmits using the Broadcast SSID, IFAT clustering should be used to fingerprint the device.

(34)

Algorithm 5Processes the dictionary to count unique devices by cooperating IFAT and SSID fingerprinting, processFingerprints().

Data: Dictionary MAC_Fingerprints Result: List UniqueDevices

initiate new list readItems initiate new list devices_not_be_time_analysed for item in MAC_Fingerprints items do

if Device only transmitted Probe Requests to Broadcast or Device transmits with global MAC address then

add device MAC to devices_not_be_time_analysed end

if item is not in readItems and item is in devices_not_be_time_analysed then add item to readItems

add item to UniqueDevices end

end

calculate number of devices which only transmitted to Broadcast by sending packet file and devices_not_be_time_analysed to IFAT analyser

for packet_x in UniqueDevices do initiate new list matches

for packet_y in UniqueDevices do

if neither packet_x or packet_y transmitts are using global MAC address then

if packet_x MAC address not equals to packet_y MAC address then calculate Jaccard or Cosine similarity of packets

if similar enough to be assumed the same device then append packet_y to matches

end end end end

for match in matches do merge packets

remove matching packet from UniqueDevices end

end

3.3.5 Graphical User Interface Development

The Graphical User Interface GUI was developed using a Python library named wxPython[45]. The library was chosen since it allowed for quick development and since it allowed for the developed GUI to adapt to the OS of which it’s run.

(35)

3.4 Result Analysis

3.4.1 Control Frame Attack

After completing the experiments, the result analysis will be done by comparing which devices are vulnerable to a Control Frame Attack through sending an RTS to both the local and global MAC-Address of a device. As the final goal is to be able to fingerprint mobile devices, the potential functionality of the Control Frame Attack will be discussed.

3.4.2 Passive Probe Request Analysis

As mentioned in Section 1.3 it’s not feasible to arrange tests with a large number of devices while still knowing the exact quantity. This since the project is restricted in both time and resources. Smaller tests with 15 to 30 devices will be performed to determine an estimation of accuracy. With these tests an accuracy percentage can be estimated to see whether the technology satisfies the goals set in Section 1.2. This result will then be used to evaluate if SSID, IFAT or a combination of the both is optimal. In the Section 4.2 two different ways to separate data readings will be mentioned. These two are Readings and Tests with Readings being different data gatherings within the same Test. The mean accuracy percentage presented in Section 4.2 is calculated as shown in Equation 6 with error being the difference between the observed and the true value and n being the amount of readings for the current test.

mean accuracy = Pn

i=0(1 − ^error_true) · 100

n (6)

(36)

(37)

4 Achieved Results

4.1 Control Frame Attack

Results from the Control Frame Attack can be seen in Table 2. Responses were retrieved from the HTC 10 (Released 12/04/2016), Samsung S8 (Released 21/04/2017) and from the iPhone 6 Plus (Released 19/09/2014) using both the devices global and local MAC-Addresses. These three were released either before or shortly after the study which brought attention to the subject was published[9]. A response from the Samsung S9 Plus was received when transmitting to the global MAC-Address but not the local. No response was retrieved from either the iPhone 8 Plus or iPhone 7 Plus even though the iPhone 7 Plus was also released in 2016. No information useful for fingerprinting of devices has yet been noted from the CTS responses retrieved. It was also noted that the HTC 10 and Samsung S9 Plus transmitted probe requests even with WiFi disabled on the device. The Samsung S9 Plus responded on a Control Frame Attack in this state. We did not manage to get a response from the HTC using Control Frame Attack with WiFi disabled. The Samsung S9 Plus also transmitted probe requests when the device was leaving Flight Mode despite WiFi disabled.

Table 2: Control Frame Attack to Global MAC Address Results.

Device Model OS Response: Local MAC Global MAC

HTC 10 Android 8.0.0 Yes Yes

Samsung S8 Android 9.0 Yes Yes

Samsung S9 Plus Android 9.0 No Yes

Samsung S10 Plus Android 9.0 No No

Iphone 6 Plus iOS 10.2 Yes Yes

Iphone 7 Plus iOS 11.1 No No

Iphone 8 Plus iOS 12.1.4 No No

A Study on Fingerprinting of Locally Assigned MAC-Addresses

Bachelor Thesis

Computer Science and Engineering, 300 credits