Detecting visual plagiarism withperception hashing

(1)

DEGREE PROJECT, IN COMPUTER SCIENCE , FIRST LEVEL STOCKHOLM, SWEDEN 2015

Detecting visual plagiarism with

perception hashing

JOHANNES WÅLARÖ

(2)

Detecting visual plagiarism with perception

hashing

JOHANNES WÅLARÖ

(3)

(4)

Abstract

(5)

Referat

(6)

Introduction

Classifying images can be done in a myriad of different ways, everything from ob-ject recognition and bag of word classifications to disregarding the images content entirely and using strict cryptographic hashes.

Perceptual hashing is something of a middle road, instead of the classical paradigm of cryptographic hashes where the tiniest changes avalanches into an entirely differ-ent hash, perceptual hashing instead uses the images contdiffer-ent to try to fingerprint the image, so that even if hashes are not identical they can be used to determine how “close” the images are to one and other (In terms of the underlying criteria).

Most of the perceptual hashes described in this report does this by a simple hamming distance, which is quick to calculate. Moreover the algorithms also have an amount of ductility when it comes to changes, so changing a single pixel for example will not change the generated hash in most cases.

1.1 Objectives of this report

The objective of this report is to look at a few different perceptual hashing algo-rithms and see how and if they might be used to detect visual plagiarism though hash matching.

1.2 Problem statement

The problem of detecting visual plagiarism is to find a method that can compare an image to a relatively large dataset in a reasonable amount of time, but also provide good accuracy in the matching.

(9)

(10)

Chapter 2

Background

2.1 Hamming distance

The hamming distance[1] can be used on most of the resulting hashes to get the perceived difference between two images, in that a perceptually similar image would have a short hamming distance (Ideally for the same image, a 0 distance).

The definition for the hamming distance is quite straightforward: The hamming distance d(x,y) is the number of ways in which x and y differ.

In other words, since the algorithms return a binary number, the hamming distance is simply the number of positions in which they are different.

2.2 Normalized Hamming distance

In 8.1 we see that a Hamming Distance is calculated for a one dimensional sequence, a single sequence of binary digits in this case. A Normalized Hamming Distance in this case works in much the same way, except that it is defined for a sequence of binary sequences instead, in this case an array of uint8’s[2].

The Normalized Hamming distances is simply the sum of partial Hamming dis-tances from each of the corresponding elements.

(11)

CHAPTER 2. BACKGROUND

2.3 aHash

aHash uses the average pixel strength of the image to calculate a hash[3]

2.3.1 Image Resizing

The image is first reduced to an n by n image, where n is 8 in used implementation.

2.3.2 Gray scaling

The image is then reduced to gray scale to reduce the color information density of the image. This will effectively make the image an 8x8 field with values in the range of the gray scales color depth.

2.3.3 Calculate average color

The images average color is calculated, by taking the sum of all the pixels colors and dividing by the number of pixels.

2.3.4 Generating the hash

The hash is then generated simply by making the values in a corresponding vector (8x8 long) 1 if the pixel in question is above the average, which we calculated in the previous step.

You can then interpret this 64-bit binary vector as you will, commonly as an hexadecimal string.

(12)

2.4. DHASH

2.4 dHash

This algorithm uses a difference field to generate a hash value[4]

The image is first reduced to an n+1 by n image, where n is 8 in used implementa-tion.

The image is reduced to gray scale to reduce the color information density of the image. This will effectively make the image an 9x8 field with values in the range of the gray scales color depth.

2.4.3 Generating a difference field

The difference field is then calculated from the 9x8 image img as: dfi,j=imgi.j-imgi,j+1 This is why we use an odd number of columns; we will need an “extra” end column to calculate the 8:th difference field column.

2.4.4 Generating the hash

If the n:th cell has a positive value, the hashbit n is set to 1 otherwise it is set to 0. A 64 bit hash value is returned.

You can then interpret this 64-bit binary vector as you will, commonly as an hexadecimal string.

(13)

2.5 pHash with DCT

This version of pHash uses DCT[5]

The image is converted to a grayscale image using its luminance.

2.5.2 Image smoothing

A smoothing filter is applied to the image.

The image is resized to an n by n picture, where n=32 is used in the implementation.

2.5.4 Discrete cosine transformation

A discrete cosine transformation[6] is performed on the image by generating a DCT matrix and performing a matrix multiplication with the image and then multiplying the resulting matrix with the DCT matrix transpose.

The 64 DCT coefficients with the lowest frequency is then used to make the hash (excluding the absolutely lowest coefficients). ie. dctImage.crop(1,1,8,8).unroll(’x’)

2.5.5 Calculate median value

The median value of the selected DCT coefficients is calculated.

2.5.6 Use average value to generate hash

The hash is generated by going through the selected DCT coefficients and setting the hashbit if the DCT value is higher than the calculated median value. A 64 bit value is returned.

(14)

2.6. PHASH WITH RADIAL VARIANCE

2.6 pHash with Radial Variance

This version of pHash uses Radial Variance[7]

The image is converted to a gray scale.

2.6.2 Image blurring

The image is blurred with a strength of a parameter sigma, commonly 1.

2.6.3 Gamma correction

The image is gamma corrected with a strength of a parameter gamma, commonly 1.

2.6.4 Radon Projections

The image is used as the basis to generate a radon map, also sometimes called a sinogram.

2.6.5 Feature extraction

The features from the sinogram is extracted.

2.6.6 Discrete Cosine transformation

A DCT[6] calculation is performed (much in the same way as in 8.6.4) on the features from the sinogram is computed and returns a sequence of 40 DCT coefficients.

2.6.7 Peak of Cross Correlation[8]

Instead of using hamming distance to compare hash “closeness” the radial type uses Cross Correlation. Simply speaking the Cross correlation peaks where the images “match” and the peaks are summarized, if the sum is above a threshold, the images are declared to be a match.

(15)

2.7 pHash with Marr–Hildreth Operator

This version of pHash uses the Marr-Hildreth Operator[9]

The image is converted to a 512 by 512 pixel image.

The image is converted to a gray scale.

2.7.3 Image Blurring

The image is blurred with the strength of one.

2.7.4 Correlation of the Marr-Hildreth kernel

The image is correlated with the Marr-Hildreth laplacian of gaussian kernel[10].

2.7.5 Image normalization

The image is normalized to between 0 and 1.

2.7.6 Hash generation

The image hash is generated to a 72 element array of bytes. A normalized hamming distance is used to calculate the distance to another image.

(16)

2.8. SEARCHING HASHES

2.8 Searching hashes

The feasibility of matching an image is not only dependent on the hash algorithm used but also very much on the size of the dataset. The size of that dataset will depend greatly on how it is constructed, if for example it was constructed taking only images from scientific reports, it would be significantly smaller than if it was constructed by for example crawling the web for all possible images. Even so, there are two primary ways of searching the dataset, listed below.

2.8.1 Logarithmic search

Logarithmic search speed is very much desired for searching large dataset, since the search time only increases logarithmically with the size of the dataset, making it possible to search a great number of images to find matches. The obvious drawback is that only zero-distance hashes will be matched, thus missing out on matches that might have a very small “closeness” distance.

2.8.2 Linear search

Linear search speed is a consequence of how distance is measured; a hamming distance must be calculated for the input image hash to all the other image hashes in the dataset. The generation of the hamming distance is fairly quick, but one would still have to go through the entirety of the dataset, making it unfeasible for large datasets. The upside is that even modified images can be matched to a distance threshold, making plagiarism harder.

2.9 ImageHash 0.3 for python

There is a perceptual hash library for python[11], including algorithms for aver-age hashing, differential hashing and pHashing with discrete cosine transformation, based on the posts[3, 4]. While their implementation of aHash and dHash seems to be decent in their implementation, the DCT pHash is not correct, and as such it will not be used. Instead we will use the pHash library written in c++ by the original algorithm authors for all the pHash calculations.

2.10 pHash for C++

The original implementation of the pHash algorithms[12] (Discrete cosine trans-formation, Radial Variance and Marr-Hildreth operator) is available in c++ with handlers for java, php. Since this implementation is the authority on pHash, it will be used for all pHash results.

(17)

(18)

Chapter 3

Method

Testing of perceptual hashing to detect visual plagiarism will be carried out by in the following steps

3.1 Image selection

A subset of 567 images was selected from wikimedia commons “Quality images” category[13]. Due to bandwidth and space constraints, a smaller image was used (600 pixel width).

3.2 Hash selection

Three of the hash algorithms were chosen: aHash serves as the simplest exam-ple of hashing, dHash is a slightly more comexam-plex algorithm and pHash is a well established implementation using more advanced methods such as discrete cosine transformations.

Additionally all three have the same hash size and hash comparison method (Hamming distance) unlike both Marr-Hildreth operation and radial variance hashes.

3.3 Rudimentary image manipulation

If you know the algorithms being used, you can construct a attack image, with the expressed purpose of fooling the algorithms in question, just as you may in other mediums. Since there is not really any getting around this, we instead focus on the image manipulation that would be realistic to perform with common photo editor software, Photoshop CC 14 in this case.

3.3.1 No change

The image is not changed, its simply saved, but it will still be recompressed, as such it provides a good baseline for what the jpg compression does in and of itself.

(19)

CHAPTER 3. METHOD

3.3.2 200% Scaling

The image is resized to twice its original size.

3.3.3 Color correction

The color was changed slightly, as to not make it obvious that manipulation had taken place, a -15 color correction on all three color channels was used.

3.3.4 Contrast adjustment

Contrast was increased by 40 percent.

3.3.5 Black and white filter

The image was grayscaled using a black and white filter.

3.4 One to One hash calculation

The before and after hashes for each image is compared and its distance calculated in order to see the effect of each manipulation on the hash values and distances, zero distances are counted and the average distance is recorded. “False” negatives, IE when an image is the same but the algorithm reports them over the threshold (8 for all tests) are also counted. (The number of true positives can obviously be derived from this number, since all images that are not false negatives are true positives in this case. Eg 567-False Negatives)

3.5 One to Many hash calculation

In order to asses whether or not an algorithm is suitable we also need to know the number of false positives, where images that are different are reported to be the same. This is done by taking an original image and comparing it to every other image in the particular dataset, resulting in about 160000 (n*n/2) comparisons in each test.

(20)

Chapter 4

Results

Below are the results for the different hash algorithms.

The average distance is of interest because if we do a full dataset search (Ham-ming distance to all data points) we want the number to be as low as possible, ie on average have a good margin to the threshold.

The number of zero distances is of interest because given a good percentage of exact matches we can do a search in log(n) time.

aHash results No change Color Cor-rection Contrast Adjustment Grayscale filtering 200% scal-ing Average distance 0.0406 0.0406 0.666 1.444 0.136

Number of zero distances 549 549 312 229 503

False negatives 0 0 0 7 0

False positives 867 867 800 916 857

dHash results No change Color Cor-rection Contrast Adjustment Grayscale filtering 200% scal-ing Average distance 0.0617 0.0617 0.899 1.483 0.261

pHash results No change Color Cor-rection Contrast Adjustment Grayscale filtering 200% scal-ing Average distance 0.071 0.071 1.756 1.975 3.642

(21)

(22)

Chapter 5

Discussion & Conclusion

Overall the results are positive. For “no change” and “color correction”, the number of zero distance matches are decent, while all algorithms in all tests have a reasonable match distance for a threshold search (Ie mostly below 8-10 which is a common threshold)

There are things about the algorithms not shown in the results, namely false-matching, which is when two different images are matched as the same. This is especially common with the simpler algorithms which does not use discrete cosine transformations.

The results are not very encouraging for log(n) searches in most cases, given that only roughly half of images gets a zero-distance match.

dHash fares well, as might be expected, seeing how it is a differential function, and the change in pixel to pixel should be fairly constant with the image manipu-lations chosen here. Ie the color correction will be nullified by the fact that dHash uses the average values to set bits off their difference.

One of the more surprising results is the poor performance for both ahash and dhash in the grayscale test, given that both algorithms converts to grayscale before computation. This might be because photoshop does not weigh all colors equally when gray scaling (ie does color analysis before conversion) or because the Image-Hash library does not convert explicitly to grayscale, it simply weights the different color channels together.

It is also quite surprising that pHash struggles at the 200% scaling test, doing worse than both aHash and dHash. I suspect this might be because of the bilinear interpolation done by the image processor in scaling it up, changing the DCT co-efficients of the image when combined with the DCT compression that JPEG does when resaving the image.

Given the results, a hybrid approach might be appropriate, that is constructing a hash with a higher tolerance for changes to provide a zero-distance match, and then applying a linear search on that subset of data with more complex and precise algorithms such as pHash.

(23)

(24)

Bibliography

[1] "Hamming distance in coding theory." 2010. Retrieved 20 Apr. 2015 http://www.maths.manchester.ac.uk/ pas/code/notes/part2.pdf

[2] Normalized hamming distance

pHash 0.96 source code implementation Row 955-971

[3] “Kind of Like That” Dr. Neal Krawetz 2013. Retrieved 20 Apr. 2015 http://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html

[4] “Looks Like It” Dr. Neal Krawetz 2011. Retrieved 20 Apr. 2015

http://www.hackerfactor.com/blog/?/archives/529-Kind-of-Like-That.html

[5] pHash DCT Imagehash

[6] Discrete Cosine Transformation

Mark Nixon, Alberto S Aguado. “Feature Extraction & Image Processing for Computer Vision, 3nd Edition.” Page 68-70, 2012

[7] pHash 0.96 source code implementation Row 258-296

[8] Bracewell, R. "Pentagram Notation for Cross Correlation." Page 46 and 243, 1965.

[9] pHash Marr–Hildreth Imagehash

[10] Feature Extraction & Image Processing for Computer Vision, 3nd Edition. Page 165-170 Mark Nixon, Alberto S Aguado

(25)

BIBLIOGRAPHY

[11] ImageHash 0.3 python Johannes Buchner https://github.com/JohannesBuchner/imagehash

[12] pHash 0.96 Evan Klinger & David Starkweather http://www.phash.org/releases/pHash-0.9.6.tar.gz

[13] Wikimedia commons - Quility images, Multiple authors

http://commons.wikimedia.org/wiki/Category:Quality_images Retrieved 20 Apr. 2015

(26)

Detecting visual plagiarism withperception hashing

Detecting visual plagiarism with

perception hashing

JOHANNES WÅLARÖ

Detecting visual plagiarism with perception

hashing

Abstract

Referat

Contents

Chapter 1

Introduction

1.1

Objectives of this report

1.2

Problem statement

Chapter 2

Background

2.1

Hamming distance

2.2

Normalized Hamming distance

2.3

aHash

2.4

dHash

2.5

pHash with DCT

2.6

pHash with Radial Variance

2.7

pHash with Marr–Hildreth Operator

2.8

Searching hashes

2.9

ImageHash 0.3 for python

2.10

pHash for C++

Chapter 3

Method

3.1

Image selection

3.2

Hash selection

3.3

Rudimentary image manipulation

3.4

One to One hash calculation

3.5

One to Many hash calculation

Chapter 4

Results

Chapter 5

Discussion & Conclusion

Bibliography