The effect of colour use on the quality of websites

(1)

VT 2016

Master's thesis in Cognitive Science, 15 ECTS

The effect of colour use on the quality of websites

Dorieke Grijseels

(2)

1 Abstract

The design of a website is important for the success of a company. Colours play an important part in websites. The goal of this thesis is to find out how the use of colour in websites relates to the quality of websites. Different aspects are studied. First it was found that the harmony of a colour palette only weakly correlates with the quality of a website. This correlation increases when only darker colour palettes are used. Next a method was proposed to extract the colour palette from a website. This novel method takes the saliency of the pixels in a website into account. Lastly, the palettes extracted using this method were utilized to propose a model to explain the relation between colour use and quality of websites. Sixty-one different features were tested using three different methods of feature selection. The accuracy achieved in the best model was low. Future work is suggested to improve on this, which should focus on identifying more relevant features and training the model using a better database.

(5)

2 Introduction

The design of a folder, poster or website can determine the success of a company. When designing a website several factors should be considered. The user should be engaged, but not overwhelmed by the website. The user’s attention should be drawn to the right places, so the user is guided to certain information.

EyeQuant is a company that is aimed at assessing websites. They use three different models that compute different scores for a given website. Two of these models are based on eye-tracking data for different users and one determines the visual clarity [14]. The models based on eye-tracking data have been tested on accuracy extensively [67]. Besides giving just statistical measures, the company also provides the users with a visual representation of the results.

Currently the scores are based on where the user looks on the website. This is determined using a list of characteristics based on previous eye-tracking data. The relative importance of these characteristics are computed using machine learning. The characteristics include colour contrasts, number of edges and clutter.

Colour contrast is already measured by the EyeQuant model, but not just colour contrast might influence the way the user evaluates a website. For example, the harmony of the colours in a picture can influence how pleasant it is to look at that picture [56]. It would be interesting to study what other factors related to colour, like harmony, might influence a user’s opinion of a site.

Colours have previously been shown to be an important part of websites [26, 52, 31]. The right colours can alter how time spend on a website is perceived. This was shown by Kiritani and Shirai [26]. They asked the participants to perform a search task on the internet. Each participant was presented only one background colour throughout the experiment. The experiment lasted 8 minutes. Afterwards the subjects were asked to estimate how much time they had spend on the task. When the background was either yellow or red, the participants experienced less time than when the background was green, blue or white. Different aspects of colour might directly influence the behaviour of visitors to a website. For example, Pelet et al. [52] showed that brightness of a site influenced the user’s intention to buy something. However, they found that this may be mediated by a negative mood.

The preference for a site is correlated with how well a user can complete a given task on that site [31]. However, Lee et al. [31] identified a factor that was more strongly correlated with the user preference. They let their participants grade a website on the predicted usability of that site, before they used the website. The participants had a clear preference for websites that looked more usable in advance. This shows that the first impression of a site, even before use, is important for a user. In the same study Lee et al. [31] showed that the colours used on a website correlate with the preference for that site. However, these two variables were not studied separately, so the causation might be either way. Either the participants preferred the colours when they preferred the website, or the other way around.

Preference of colours and colour combinations is a very objective measure, but attempts have been made to quantify the pleasantness of colour combinations. This is often done through colour harmony, where the assumption is that a more harmonious colour combination is also more pleasant. However, this link is not as clear-cut as some assume, as a lot of contradictory results have been found [56]. This is explained by Schloss et al. [56] by distinguishing between three factors.

They show that the preference for a combination is mostly explained by the lightness contrast and the preference for individual colours. However, a link is shown between preference of a colour combination and harmony, both increase when the hues of the two colours in the combination are more similar. It seems that harmony does play a role in explaining the preference for a combination of colours.

Colour harmony is hard to define. Burchett [7] performed an extensive content analysis on 12 books on colour to determine what colour harmony is. Of the 12 books, 10 talked about colour harmony in some way. Burchett determined a set of attributes that determine the harmonization of colours. These are, in order of importance as determined from the books: order, tone, configuration, area, interaction, association, similarity and attitude.

Recently attempts have been made to use the link between harmony and aesthetic preference to improve the colour use in images. Cohen-Or et al. [9] described a method of automatically harmonizing a picture. They use a set of colour templates as developed by Matsuda [36] which

(6)

denote harmonic templates. Of the eight templates seven are actually used, since one (the N type template) was developed specifically for grey-scale images. Before the harmonization process is started, they first determine the appropriate template for the image. Next they determine which colours are outside of this template, and then they adjust the colours that do not lie within the used template.

Baveye et al. [4] tried to improve upon this method in three ways. Most relevantly, they included measuring saliency of the pictures. This solved part of the semantic problems that needed to be solved by hand in the previous study. Some of the problems that arose included losing the colours of important parts of a picture. An example is a picture of a red ladybug on a green plant. Using the Cohen-Or [9] method, the colour of the ladybug would be altered to green in the harmonization process. This is semantically wrong, since green ladybugs do not exist. The ladybug is also the most salient part of the picture, and thus determines the look of the picture in large part. The Baveye [4] method prevented the recolouring of the ladybug and kept its colour intact.

A further improvement was made by Chamaret et al. [8]. They tried to mimic human perception in the model by including contrast masking properties. In addition to the harmony scoring map, they produced two additional maps. The contrast masking map provides information about the visibility of a colour as a result of edges and gradients. The entropy masking map shows how certain the masking effect is based on the texture complexity. When combining the disharmony map with these latter two, a score can be calculated to show how disharmonious the picture is.

A visual representation can also be given that shows what parts of an image are more or less disharmonious.

An entirely different approach was taken by Solli et al. [59], who based their approach on a study by Ou et al. [47]. They calculated harmony scores based on two-colour comparisons within a picture. They identified all colours in a picture that make up at least a certain percentage of that picture. For each combination of these colours the harmony score was calculated. The lowest of these scores determined the harmony score of the picture. They tested this method with a psychophysical experiment and found that their scores correlated quite well with the scores given by the participants.

Ou et al. [46] followed up their own study by exploring how their two-colour comparisons would work for three-colour comparisons. They hypothesized that for three-colour combinations the harmony can be determined by simply averaging over the individual values. They propose a different model for three-colour combinations where the colours were not all adjacent. They show that indeed their new models agree well with empirical studies.

2.1 Neural correlates of harmony

In previously mentioned studies, the view on harmony was tested in human subjects by asking them for their opinion about colour combinations. The assumption in many studies (e.g. [47, 61]) is that this preference or dislike for certain colour combinations is determined by universal factors.

However, other studies show that it is not solely determined by universal factors. In a review, O’Connor [44] determined other factors that influence the harmony score an individual might award a combination of colours. These include individual differences regarding age, gender personality and affective state. They furthermore included cultural experience, context in which the colours are seen and intervening perceptual effects. Lastly they mention the effect of time, for example the change of style trends over the years.

Although there are so many factors influencing the perception of colour harmony, there is a universal neural correlate in the brain that seems to be related to the perception of harmony.

Ikeda et al. [20] performed a functional magnetic resonance imaging (fMRI) experiment related to harmony. They presented different two-colour combinations to their participants in a checkerboard format and asked them to rate the harmony of the stimuli on a 9-point scale. They divided the scores on colour combinations into three groups based on the scores obtained during the experiment:

disharmonious (D), neutral (N) and harmonious (H). They showed that the presence of harmony was correlated with the activation of the bilateral rostral anterious cingulate cortex (rACC) and the medial orbitofrontal cortex (mOFC). During the viewing of disharmonious combinations, the right posterior insula and the left amygdala were activated.

The viewing of an aesthetically pleasing face has also been associated with activity in the mOFC

(7)

[1, 49]. O’Doherty [49] studied the brain activity during the viewing of attractive and unattractive faces using fMRI. The attractiveness of the faces had been determined in a pilot study, but the new participants also gave an attractiveness score to determine whether the faces were indeed perceived as attractive or unattractive. Besides activity in the mOFC, the researchers further showed that the viewing of aesthetically pleasing faces was correlated with activity in the medial prefrontal cortex (mPFC) and the posterior cingulate cortex (PCC). They also concluded that the mOFC is activated even when the participant is not actively trying to judge the attractiveness of a face.

These findings correspond with other studies that reviewed the neural correlates of aesthetically pleasing experiences. These are not just limited to visual experiences [28], auditory stimuli have shown similar activations. A study by Blood et al. [5] focused on the brain activity correlated with the perception of pleasant and unpleasant music. They exposed participants to a consonant (harmonious) and dissonant (disharmonious) version of a music stimulus while studying brain activity with a positron emission tomography (PET) scan. They showed that the activation of several parts in the orbitofrontal cortex (OFC) were negatively related with dissonance level, meaning more activation was correlated with more harmonious music. This agrees with the findings for visual stimuli [20].

Pleasant somatosensory experiences have also been shown to correlate with activity in certain regions of the orbitofrontal cortex (OFC). Rolls et al. [53] performed an fMRI study into the neural correlates of pleasant and unpleasant somatosensory experiences. They applied either a pleasant, neutral or unpleasant stimulus to the hands of participants that were lying in an fMRI scanner.

During the pleasant condition they recorded increased activation in the OFC and the anterior cingulate cortex (ACC). They furthermore saw increased activation in the posterior part of the insula as well as some somatosensory-specific regions in the unpleasant condition. The unpleasant stimulus also correlated with significant activation in the OFC. This is not in line with the studies on the other senses, where only pleasant experience caused an increase in activiation in the OFC.

Small et al. [58] performed a meta-analysis of previously published and at the time unpublished results regarding the brain activity during gustatory tasks. Although they do not specifically distinguish between pleasant and unpleasant stimuli, both were present in the data (e.g. salt, citric acid and chocolate). The combined results for these stimuli showed activation both in the OFC and in the insula, as well as some additional areas of the brain.

Zatorre et al. [69] performed a PET scan while their participants inhaled during presentation of an cotton wand. The wand was either odourless (control condition) or soaked in one of eight odourants. The conditions included smells varying in pleasantness (i.e. cherry and kerosene).

They found unilateral activation of the right OFC. The activation was again not tested against pleasantness, so it is unclear if the activity is specific to pleasant experiences.

Given that pleasantness leads to the same activation across the senses, one could wonder what the use of this activation is. One possible answer to this question is that it allows us to quickly recognize pleasant food (which evolutionarily means nutritious food). This hypothesis is supported by the findings of Kringelbach et al. [29] that subjective pleasantness of liquid food is correlated with the activation in the OFC. The subjective pleasantness is not just determined by the taste, but by a combination of senses, including touch and smell, which together determine whether something is pleasant.

Further insight into the function and purpose of the orbitofrontal cortex is given by Elliott et al. [13]. The OFC is activated in situations where decisions have to be made using insufficient information, where reward of an action is not known and has to be predicted. This seems in line with the earlier findings regarding food. Based on several factors the OFC determines the likely reward of eating the food. The more pleasant the food looks, feels, tastes or smells, the more likely it will give a high reward, so the higher the activation in the OFC.

(8)

3 Purpose

The aim of this thesis is to explore the relationship between colour use in websites and their quality.

Different aspects are studied, guided by a set of questions:

1. How is the harmony of a colour palette related to how the quality of a website, to which that palette is applied, is perceived?

2. How does one extract a colour palette from a website that is representative of the colours that influence a viewer’s perception of website?

3. What aspects of the colour use in websites are related to how the quality of that website is perceived, and how are they related?

In each of the questions, a link is made to the background of cognitive science. First, the link between the harmony of a palette and the quality of a website is explored. This is thought to be mediated by the experience of pleasantness, as linked to the OFC. Finding such a link would thus suggest a part of the brain to be investigated if further cognitive studies are performed on the topic. The second question tries to find an extraction method that more closely resembles how a viewer actually perceives the website. If this extraction method is verified, this might give some insight into how a brain perceives colours and colour combinations in a complex scene. Lastly, a model is sought that explains which aspects of colours are important for the quality of the website.

The factors that are included in the model might also be the factors that are important to the brain when determining the quality. In a further study, they could be linked to specific regions of the brain and their functions.

Each of these questions will be tackled in one of the following chapters. First, a background is given in Chapter 4 containing some of the concepts and methods used in the thesis. Chapter 5 studies the relation between the harmony of a colour palette and the quality of a website using an online survey. In Chapter 6 a method of extracting the colour palette from a website is proposed.

In Chapter 7 a novel model is proposed to predict the quality score of a website from the colours in a screenshot. Finally, an overall conclusion is drawn in Chapter 8.

(9)

4 Background

This chapter will present the general concepts used in this thesis. The concepts are explained and some of the more important equations are presented. The chapter is be divided into three parts, machine learning methods, saliency and colour measures.

4.1 Machine learning methods

Machine learning is a tool used to find patterns in data. Many different techniques are available for this purpose. This section will highlight those used in this thesis, which include linear and logistic regression, the lasso model, cross-validation and hierarchical clustering.

4.1.1 Linear Regression

Linear regression is a method used to model the relation between one dependent variable and one or more of independent variables (which are also called features). This relationship is expressed in the form of weights for each independent variable, resulting in a model of the form seen in Equation 1. Models created through linear regression can not only be used to study a relationship betwee variables, but it can also be used to predict the dependent variable from the independent variables.

Y = β0+ β1· x1+ β2· x2+ ... + βn· xn (1) 4.1.2 Logistic Regression

Logistic regression is similar to linear regression, but in logistic regression the dependent variable is not a continuous variable, but rather categorical. It therefore models a classification problem.

Logistic regression for binary variables was proposed as early as 1958 by David Cox [11]. Logistic regression still results in a model that looks like Equation 1, with the added restriction that Y is always in the range (0,1), since it represents a probability. To predict a classification, the Y value needs to be mapped to a class, which is often done using a logistic (or sigmoid) function.

4.1.3 Lasso model and LOO

Lasso is an acronym for least absolute shrinkage and selection operator. It is a method that was first introduced by Robert Tibshirani [64]. The goal of lasso is to improve the interpretability of the model it creates, compared to other methods like simple linear regression. Such a model consists of weights for each of the features in a dataset, which say something about the relationship between that feature and the dependent variable. It achieves this goal by setting the least impactful weights to 0, thus eliminating the features associated with them from the model. This is done by forcing the sum of the coefficients to be less than a certain value.

An alternative method of feature selection is testing each possible feature separately on its correlation with the dependent variable. Scikit-learn [51] contains the method f selection within the feature selection package, which does exactly that. For each feature in the original dataset it first orthogonalizes that feature and the dependent variable. Next it calculates the cross correlation between the two vectors, and lastly it converts this correlation to an F score and consequently a p-value. These scores indicate the quality of the feature.

The Leave One Out-method (LOO) used in this thesis utilizes the above function to perform feature selection. It first computes the F scores for each of the features in the dataset. Then it sorts the feature according to their F score. Next it tries an increasingly larger combination of features, starting with just the best two and ending with all features. On each iteration it performs regression using leave one out cross validation. The accuracy of each combination is recorded and the best combination is selected.

Another method used for feature selection is Recursive Feature Elimination (RFE). RFE is based on feature ranking, which is the practice of determining the importance of each feature in the model. This is similar to what f selection does. Ranking is often done by testing the accuracy of the model when a feature is not included [17]. When this is done for each feature, a ranking can be made. However, one cannot just take this ranking and select the top features,

(10)

since the ranking was made by excluding one feature at a time. Exclusing multiple feature at a time may have unexpected consequences. That is why Guyon et al. [17] introduced RFE. This is a recursive method, meaning it performs two steps in a loop-wise fashion. Each iteration starts by ranking all features according to their importance, usually by utilizing the method described above. Then the lowest ranked feature is excluded from the dataset and the new dataset is used for the next iteration. This is done until all features have been ranked. This ensures that the best combination of features are selected, not the best individual features.

4.1.4 Cross-Validation

Cross-validation (CV) is a technique used to train or validate models, usually when the data available to train and test the model is limited. Within CV the data is split up into smaller subsets, also called folds. On each iteration one fold is used as a test set while the others are the training set. Now a model can be trained on the training set and the accuracy of that model can be calculated for the test set. This is done until each subset has been a test set. The advantage of using CV is that all data can be used to train a model. When training a model using cross-validation, the model is improved upon during each iteration. This prevents overfitting.

Many different types of CV exist, Arlot et al. [3] identify at least 10 different types. They split them up in two categories, exhaustive data splitting and partial data splitting. The latter one includes the general k-fold CV (they call this V -fold CV), which means that the data is divided into k folds and each fold is used as a test set once. A popular type is ten-fold stratified CV, which has been shown to be the best method for model selection [27]. Ten-fold stratified CV means that the data is divided into ten subsets. In stratified CV each subset contains about the same average value for the dependent variable. Another type of CV is leave-one-out CV, where the number is subsets is equal to the number of data points in the dataset and each subset has a size of one. This type is usually computationally a lot heavier than ten-fold stratified CV, but it does not always perform better [27]. Another approach to CV is taking a random subset on each iteration, meaning that some samples can appear in multiple subsets, while others appear in none. This allows for the use of more subsets than k-folds CV. However, it means that the results of the CV can vary and may dependent on what data points are included in the subsets.

4.1.5 Hierarchical clustering

Clustering is a technique used to group data together based its features. Different types of clustering exist, each with their own advantages and disadvantages. K-means is a popular clustering method that takes an initial set of centroids (cluster centers) and recursively improves the fitting of these centroids by moving them around. The method is described in detail by Hartigan [18]. Although this is a quite straightforward method it has several disadvantages. For one, the number and position of the initial clusters have to be specified. Both these factors have a big influence on the accuracy of the clustering. The method also allows convergence to local minima.

An alternative method of clustering is hierarchical clustering. Within this method two data points or groups of data points are linked on each iteration, until all points are part of one group [38]. This results in a dendogram of the data points. The number of clusters now depends on where the dendogram is pruned. Any number of clusters can be selected after the initial clustering.

Many variations exist of hierarchical clustering, where the main difference is the way the distance between the groups of data points are determined. Murtagh et al [39] show how to calculate the distances for seven different methods. Single linkage uses the distance between the closest points of two clusters. Complete linkage uses the distance between the furthest points of two clusters. In the group average method, the average distance between all points in the two clusters determines the distance. The weighted method uses the average distance from the two parts that formed the cluster to the other points or clusters. All of these linkage methods can be used with any type of distance measurements.

All of these linkage methods are implemented in the scikit-learn package [10] for Python, as well as several others. However, these other methods can only be used with Eucledian distance.

They include the UPGMC (Unweighted Pair Group Method using Centroids) algorithm, WPGMC (Weighted Pair Group Method using Centroids) algorithm and Ward’s method. The UPGMC algormith (also called centroid linkage), defines the distance between two clusters as the distance

(11)

between their centroids. For each new link, a new centroid is calculated. The WPGMC algorithm is a weighted version of this, where again the average distance of the two clusters that are about to join to the remaining points or clusters is calculated. Ward’s method uses minimazion of the variance to determine the distance of the clustering.

The formulas to calculate the distances can be found in Equation 2 for all of the methods mentioned above. These are the implementations used in the scikit-learn package [10], and will be used when calculating the distances.

dmin(u, v) = min(dist(u[i], v[j])), for every i in u and j in v dmax(u, v) = max(dist(u[i], v[j])), for every i in u and j in v

dave(u, v) =X

ij

d((u[i], v[j])) (|u| ∗ |v|

d_weighted(u, v) = (dist(s, v) + dist(t, v))/2

where s and t are the two previous clusters that made up u d_{U P GM C}(u, v) = ||c_s− c_t||

where csand ct are the centroids for cluster s and t dW P GM C(u, v) = (dU P GM C(s, v) + dU P GM C(t, v))/2

d_{W ard}(u, v) = |v| + |s|

T d(v, s)²+|v| + |t|

T d(v, t)²−|v|

T d(s, t)²

¹₂

(2)

4.1.6 Cophenetic Correlation Coefficient

The performance of a clustering can be measured in either of two ways. When information is available about the true clustering of the data, for example the class of each data point in a dataset, external validation can be used. Generally four different methods are used: precision, recall, accuracy and F measure. These all make use of the false positives and false negatives found in the results of the clustering [34]. This type of validation is not possible when the true clustering is not available. When this is the case internal validation can be used, which uses the distances between points and between clusters to determine the quality of a clustering [55].

One type of internal validation is measured using the cophenetic correlation coefficient. This coefficient represents how well the data fit the dendogram that the hierarchical clustering produced.

It is often used in phenetic studies, the study of classifying organisms, but can also be applied in any type of research that involves hierarchical clustering [55]. Another internal validation method is the within-cluster sum of squares (WCSS). This method calculates the error of each data point for each cluster, expressed in the square of the distance from that data point to the centroid. The sum of these errors is a measure of the quality of the clustering.

4.2 Saliency

Everyday we get a vast amount of inputs through our senses. We do not process or remember all of these stimuli, because that would be very ineffective use of our limited resources. In order to select what stimuli should be attended to, a saliency map is formed within the brain [66]. The saliency of a stimulus is determined by a combination of bottom-up characteristics and top-down influences [66]. This bottom-up processing already starts at the retina. Here the incoming stimulus is compressed in such a way that contrasts are stressed. This is done by the receptive fields of the retinal ganglion cells [66]. During the later processing of a stimulus more and different receptive fields are involved, often coding for a certain type of contrast, for example movement or orientation [66].

However, very early on top-down processing gets involved through attentional modulation.

Evidence has been found for the attentional modulation as early as the lateral geniculate nucleus (LGN) [43]. The LGN is the structure in the brain that receives information from the retina [6], and is the gateway from the retina to the visual cortex. Which stimuli are selected can be dependent on their location within the visual field. In some cases certain objects or features are selected specifically [37].

(12)

Bottom-up processing results in a saliency map [66]. This does not just reflect the intensity of a stimulus, but more importantly its contrast. The saliency map is combined with the top- down influences, which can enhance or lessen the saliency. This happens in a non-linear fashion, intermediate contrasts are enhanced more than very strong or very weak contrasts [35]. Such saliency maps are not just important for selecting which feature should be attended and possibly remembered. The saliency also guides eye movements, especially immediately after being presented with the stimulus [50]. This allows the salient stimuli to be studied better, as they then fall into the foveal area.

Many attempts have been made to model saliency (see for example [23, 21]). Measuring saliency can also be done with eye-tracking, but this is an expensive and time-consuming process. Some of the saliency are biologically inspired, like the model proposed by Itti et al. [21]. They mimic the receptive fields of neurons to get an intensity map for different features of an image. The maps are combined to get a saliency map for that image. This saliency map can be used to predict the eye movements of a viewer. Another approach is to base a model on eye-tracking data. Judd et al. [23] took this approach by modeling eye-tracking data using machine learning.

The model used by EyeQuant resembles the Itti et al. [21] model. It uses a number of features to predict the overall saliency of an image. It furthermore incorporates a bias, which is the average of the place participants tend to look on a screen regardless of the stimulus. These features and this bias are combined using a Lasso model [64]. The model is trained with real eye-tracking data.

This allows the model to be trained for a specific set of images, which in the case of the EyeQuant model is e-commerce websites.

4.3 Colour measures

This section will explain some of the background of colour theory. Different colour spaces will be described, as well as ways to calculate the distance between colours in the L*a*b* colour space.

Furthermore the relation between the L*a*b* colour space and the CIE L*C*h^◦format is shown.

4.3.1 Colour spaces

Colours can be represented in different spaces. A well-known space is the RGB (Red-Green-Blue) space, where a colour is expressed in three colour components, red, green and blue. The RGB value of colour is usually expressed with a value ranging from 0 to 255 for each of these components, but may also be expressed on a range from 0 to 1. The value indicates the relative presence of that channel (red, green or blue). This colour space is based on the trichromacy of the human eye [65].

Colour is detected by a certain type of photoreceptor in the retina of humans, the cones. Cones come in three different flavours in the human eye, each specialized in a certain wavelength of light, short (S), medium (M) and long (L) [54]. This is roughly blue, green and red respectively. The RGB space thus was developed to reflect the arrangement of the cones.

A different class of colour spaces is based on how colours are experienced, the phenomenal colour spaces [65]. Amongst these is the HSV (Hue Saturation Value) colour space, which divides a colour in its hue, saturation and value. The hue is the basic colour, for example green, red, blue, etc. The saturation is the amount of whiteness (where a lower value is more whiteness), and the value is the lightness of the colour (where a low value is darker). This is said to be the most natural way of ordering colours, as it is the same way colours are represented in our brains [65].

A problem with both of these colour spaces is the perceptual non-uniformity [65]. Two colours that have a short Euclidean distance in either of these spaces, might be perceived as very different colours, while two colours with a relatively large Euclidean distance appear to be the same. This problem was addressed when the International Commission on Illumination (CIE) published the L*a*b* formula [33]. The L^∗, a^∗and b^∗components represent the lightness, redness-greenness and yellowness-blueness of a colour respectively [15]. This colour space should represent the perceptual differences between colours better.

4.3.2 Chroma, hue and lightness

The CIE L*a*b* colours can also be represented in a CIE L*C*h^◦format, where L* is lightness, C* is chroma and h^◦represents hue in degrees. The chroma and hue have to be calculated from a*

(13)

and b* in the CIE L*a*b* space. This is done using the formulas shown in Equation 3.

C_ab^∗ =p

a^∗2+ b^∗2 h^◦_ab= arctan(b^∗

a^∗)

(3)

For a pair of colours, the differences in chroma, hue or lightness can also be calculated. For chroma and lightness the difference is obtained by simply subtracting the two, but it is not that straightforward for hue, since hue is expressed as an angle. The difference formula for the hue can be derived from the colour difference formulas (see [15]). The formulas to calculate the differences are shown in Equation 4.

∆C_ab^∗ = C₁^∗− C₂^∗

∆L^∗_ab= L^∗₁− L^∗₂

∆H_ab^∗ =∆a^∗2+ ∆b^∗2− ∆C_ab^∗ ¹₂

(4)

4.3.3 CIEDE2000

Establishing the difference between colours is a very complex problem. Several formulas have been proposed to calculate the distance between colours in the L*a*b* colour space. The first method,

∆E76 (see Equation 5) was proposed in 1976 and used simply the Euclidean distance between the colours [15].

∆E_ab^∗ =(L^∗₂− L^∗₁)²+ (a^∗₂− a₁)²+ (b^∗₂− b^∗₁)²¹₂

(5) However, this method did not work as well as the initially thought, as the colours were not distributed uniformly in the L*a*b* colour space. To address this problem, a new colour difference method was defined, the CIE94 [15] (see Equation 6). This formula uses the lightness, chroma and hue calculated from the LAB colours.

∆E₉₄^∗ =

( ∆L^∗

K_LS_L)²+ (∆C_ab^∗

K_CS_C)²+ (∆H_ab^∗ K_HS_H)²

¹₂

(6) Several variations and improvements were made on this formula, most using the form in Equation 7. This equation was also used as a starting point for defining the new CIEDE2000.

∆E₉₄^∗ =

( ∆L^∗

K_LS_L)²+ (∆C_ab^∗

K_CS_C)²+ (∆H_ab^∗

K_HS_H)²+ ∆R

¹₂

where

∆R = R_Tf (∆C^∗∆H^∗)

(7)

This new colour distance formula was specifically aimed at improving the performance for the blue regions, which was done by tweaking R_T. Several different datasets were used to derive the optimal function of R_T. Further improvements were made by slightly altering the chroma, hue and lightness formulas. This resulted in the formula in Equation 8 (for in-depth explanation and implementation notes, see [33, 57]).

∆E₀₀=

( ∆L⁰

K_LS_L)²+ ( ∆C⁰

K_CS_C)²+ ( ∆H⁰

K_HS_H)²+ R_T( ∆C⁰

K_CS_C)( ∆H⁰ K_HS_H)

¹₂

where

R_T = − sin(2∆θ)R_C, where

∆θ = 30 exp(−( ¯h⁰− 275^◦)/25² ) and

R_C= 2 ·

C¯⁰⁷ C¯⁰⁷+ 25⁷

¹2

(8)

(14)

This last formula has been tested using four different databases and it has been approved by the International Commission on Illumination (CIE) [33].

(15)

5 Harmony score

The pleasantness of colours is related to the harmony between those colours [56]. Several methods have been proposed to measure harmony of colour combinations. However, many of these methods have not been verified or were only developed and tested in laboratory settings. Limited research is done on real-world applications of colour harmony. This chapter presents the results for the first experiment of this thesis, which is on the relationship between the harmony of the colours used in a website and the quality of that website. Furthermore the relation between the harmony score of palettes given by participants and calculated using a formula from the literature is studied.

5.1 Methods

The experiment consisted of a survey distributed among participants and the subsequent data analysis. In this section the methods of recruiting participants, designing the survey and analyzing the data is described. At least two sets of harmony scores were compared, those generated by the survey and those generated by the algorithm described in Section 5.1.5. The correlation between these two were determined. Furthermore, the correlation between the harmony scores and the pleasantness scores obtained from the survey were determined.

5.1.1 Design of the survey

Previous studies on harmony have included psychophysical experiments where participants were asked to give colour harmony or pleasantness scores to combinations of colours or images as a whole [47, 46, 59, 16, 61, 56, 48]. These studies used a variety of scales. The most common seems to be the 10-point scale as used in the studies by Ou et al. [47, 46, 48] and Szab´o et al. [61]. Their participants performed the experiment on a computer screen where they were given the option to give either a harmony or a disharmony judgement, each of which had five different scales (1-5).

Fedorovskaya et al. [16] instead asked their participants to rate visual harmony on a scale from 0-100. For all of these studies the conditions were controlled strictly. They controlled and specified the brightness of the screen, brightness in the room, distance from the screen and viewing angle.

Since the survey in the current study was performed online, the conditions could not be controlled in this way. Variation that arose due to the use of different screens and different brightnesses in the room possibly influenced the judgement of the participants. As such, a 10-point scale would have been too delicate. We did not expect our participants to be able to distinguish between moderately harmonious and slightly harmonious, as Ou et al. [46] do.

Solli et al. [59] used a different approach. They remarked that their intended use is online, where they would not be able to control the conditions, they also want to conduct the experiment online [59, p. 1887]. As the intended application of the current study is also online, the same reasoning is used to support the decision to do the survey online. Solli et al. [59] performed two surveys, a pilot study and a main study. The pilot study used a 5-point scale ranging from harmonious to disharmonious. Even with this less dense scale, it might still have been hard for a participant to chose between a little harmonious and very harmonious.

Nemcsics [40, 41, 42] wrote a series of papers on the laws of colour harmony. In these papers he used a dataset that was collected over 50 years of experiments. In several of these experiments, he asked participants to perform pair-wise comparisons [42]. These decisions are easier for the participant and thus allow them to judge more palettes in a shorter amount of time. Therefore, the same method seemed most appropriate for the current survey.

The participants were shown two pictures of palettes and were asked to judge the most harmonious palette. Each participants judged about 100 palettes, which took them approximately 5 minutes (based on previous visual clarity experiment with the same design). The participants were also asked to judge of screenshots of generic websites. In this way, the practical use of colour harmony scores for web design can be tested. The screenshots included a variety of generic websites, each with a different palette applied to it. The participants were asked specifically to judge which website is better, not which one is more harmonious. This allows for a comparison between the harmony scores from the palettes and the perceived quality of the websites. The scoring was again done in a pair-wise comparison.

(16)

The participants were not given a definition of what ’best’ is. The ’best’ website for a participant could be the one with the highest visual appeal. This is dependent on several different factors that have nothing to do with colour, like how boring the site is, the overall design, the layout and how imaginative the site is [32]. However, the most visually appealing site might not be the best, because other factors outside of the design come into play. Because defining a ’good quality website’

is so complex and often subjective, it seemed best to let the participant decide what they defined as good.

The survey was distributed among 300 participants. Each tested a total of 200 comparisons (100 palette comparisons and 100 website comparisons). Each participant only saw each palette once, and each stimulus was seen by an equal amount of participants. Each image was seen by all participants.

5.1.2 Material

The 100 palettes used in the survey were randomly selected from the palettes available on COLOUR- lovers [30], using the Python API. The order in which the palettes were presented was randomized.

All palettes consisted of a unique combination of five non-identical colours. The palettes were sorted on luminance beforehand, where the least luminant colour was presented left and the most luminant colour was presented right. This was done in order to make the stimuli more homogenous and to minimize the influence of ordering of colours on the score.

The COLOURlovers database [30] is an online public database that is used by colour enthusi- asts. Users can add their own colours and palettes of up to five colours. The colours and palettes can then be rated by other users by the means of giving hearts. Further indications of quality of a palette is given by the amount of views and by the amount of comments. However, these indications are prone to biases. For one, the top palettes get displayed at the top, thus yielding more views and hearts. Users might also be inclined to rate the palettes they like, while not rating average or bad palettes (instead of rating them with a low score) [45].

The websites that were presented in the survey were made using a simply Python program, as this was believed to be the quickest way to generate 100 websites. First, a generic website containing some basic elements was made by hand using HyperText Markup Language (HTML) and Cascading Style Sheets (CSS). The website included elements such as a header with subheader, a menu bar, text with headers and a sidebar. All the text was made up of the filler text ’lorem ipsum’, so as to not let the content influence the participants. For each screenshot the CSS file was altered so that the colours of the palette in question were used. A generic website was used in this case to prevent the participants from basing their assessment on other aspects of the websites.

Factors that might influence the quality score are among others the images on the site, placement of different elements, font and size of the text and familiarity with the website.

5.1.3 Participants

287 participants were recruited through CrowdFlower, an online data enrichment service. These participants were each paid $ 0.40 for the completion of the task. Since this is an online service in which participants can easily scam, the data was checked for scammers. A total of 15 scammers were identified.

They were identified by looking at the balance between left and right presses, the intra-subject agreement and the reaction time. Since every participant saw each pairing twice (once left-right, once right-left), one would expect genuine subjects to have about the same amount of left presses as right presses. Furthermore, one would expect the participants to agree with their own previous choice a majority of the time. Lastly, one would expect a reaction time within the human range, reacting too fast would mean the participant did not look at the images properly. Scammers were those with a left-right balance of less than 25% or more than 75%, with an intra-subject agreement of less than 40%, or with more than 20% of the reaction times outside of a range of 500 to 60000 ms.

The cleaned sample consisted of 272 real participants, of which 106 men and 166 women. They were aged 18 to 76 (m=38.8, sd=12.5). The task was only released to participants in English- speaking countries, as interpretation of specific terms (i.e. ’harmony’ and ’best website’) were important. They had a range of different educational backgrounds. Of the participants, 29 worked

(17)

Figure 1: Screenshot of the screen seen by the participants while doing the experiment.

with colour in their daily life (e.g. as a painter or interior decorator). For a full summary see Appendix A.

5.1.4 Procedure

The participants were recruited through CrowdFlower, where they read a short introduction text about the experiment. Using the link presented, they were directed to the experiment page. First they were given some additional information about the experiment and their rights as a participant.

Then they were asked to fill out a short questionnaire. At the end of the questionnaire they were asked to turn off any programs running on their computer that might influence their colour perception (e.g. f.lux).

After the questionnaire they started the experiment. The experiment was divided into two parts: one part with palette stimuli and one part with website stimuli. Each part started with a short instruction and a training round of 10 stimuli. After the training round they got a new message saying that the real experiment would begin.

During the experiment they were presented with two stimuli side by side (see Figure 1). Above the stimuli the question corresponding to that part of the experiment was shown (either ”Which palette is more harmonious?” or ”Which page looks better?”). The subject could chose one of the stimuli by pressing the corresponding arrow key on the keyboard, left for the left stimulus, right for the right stimulus. Below the stimuli a picture with a left and right arrow key were shown indicating which button should be pressed to chose the corresponding stimuli. At the bottom of the screen information was given about the progress. The stimuli were shown for two seconds, after which they disappeared and were replaced by a grey box with the text ”Please Choose Now”.

After the subject had chosen one of the stimuli, they were given feedback about their choice via a black box around the chosen stimulus and a blue box around the pressed key.

5.1.5 Harmony scores

Besides obtaining harmony scores from the participants, harmony scores were also computed using the equation Ou et al. described [47] (see Equation 9). This equation was designed to give a score to the harmony between two colours. However, the palettes used consisted of five colours. In order to get a score for the five-colour combinations, first the harmony scores for each of the two-colour combinations were calculated with Equation 9. This resulted in a total of 10 scores. To obtain one final score from these results, two different methods can be used. Either the average of all harmony scores can be used, as suggested by Ou et al. [46], or the minimum can be used, as suggested by

(18)

Solli et al [59]. Both methods were tested and compared.

CH = H_C+ H_L+ H_H where

HC = 0.04 + 0.53 tanh(0.8 − 0.045∆C)

∆C = ((∆H_ab^∗)²+ (∆C_ab^∗ 1.46 )²)¹² HL = HLsum+ H∆L

HLsum= 0.28 + 0.54 tanh(−3.88 + 0.029Lsum) where Lsum= L^∗₁+ L^∗₂

H∆L= 0.14 + 0.15 tanh(−2 + 0.2∆L) where ∆L = |L^∗₁− L^∗₂|

HH = HSY 1+ HSY 2

H_SY = E_C(H_S+ E_Y)

EC = 0.5 + 0.5 tanh(−2 + 0.5C_ab^∗ )

HS = −0.08 − 0.14 sin(hab+ 50^◦− 0.07 sin(2hab− 90^◦)) EY =0.22L^∗− 12.8

10 exp(90^◦− hab

10 − exp(90^◦− hab

10 ))

(9)

5.1.6 Ethical Considerations

All participants were volunteers and they could quit the experiment at any time without providing an explanation. All data will remain anonymous.

5.2 Results

The results of the survey are presented in this section. These results will be compared to the calculated harmony scores.

5.2.1 Inter-Observer Agreement

The harmony of a palette or quality of a website is very subjective and observers may disagree.

To measure the extend to which participants agreed about a certain score, the inter-observer agreement score was used. The inter-observer agreement can be measured using the Root Mean Square (RM S =pP

i(xi− ˆxi)²/T ), where xi is the score given by one participant to a stimulus, ˆ

xi is the mean score for that stimulus and T is the number of samples [47]. A lower RMS means a higher agreement. Since the experiment involved side-by-side comparisons and not a scale, no individual scores for each participant are available. In order to still test the inter-observer agreement, the data was randomly divided into 20 subsets, each subset including about 13 subjects.

For each subset the scores for the palettes were calculated. These scores were then used to calculate the RMS of the results. In order to compare the RMS with the literature, it was normalized (N RM S =_(x ^{RM S}

max−xmin)). The inter-observer agreement for the palettes was 0.09, the inter-observer agreement for the websites 0.15. These values are close to, or lower than values reported earlier [59, 47]. Additionally the RMS for the participant group that worked with colours was calculated.

The inter-observer agreement score was lower for this group, which indicates a higher agreement.

Their inter-observer agreement for the palettes was 0.06, and for the websites 0.07.

5.2.2 Working with colours

Of the 272 participants, 29 indicated in the survey that they worked with colour in their daily life.

In order to check whether this would have an influence on the scores they had given, a paired t-test was performed. No significant difference was found for both the scores given on palettes (p=1) and the scores given on websites (p=1) between the subjects that worked with colour and the subjects that did not. Therefore, further analysis was performed on the entire dataset and no distinction was made between the two groups.

(19)

Table 1: Correlation coefficients for the three different comparisons

Comparison r r (20 lightest removed)

Palette-Website 0.05 0.33

Algorithm-Palette Average 0.26 -0.10 Minimum 0.48 -0.14 Algorithm-Website Average -0.03 0.22

Minimum -0.06 0.18

5.2.3 Scores

For each trial the choice of the participants and the stimuli that were presented, were recorded.

From these results the scores for each of the palettes and for each of the websites were calculated.

This was done by applying logistic regression to the two-alternative forced choice data. For each trial for each participant it was recorded what stimulus was presented left (-1), and what stimulus was presented right (1), all other stimuli were represented with a 0 in the dataset. The target for the regression was the choice made by the participant for each trial. Using the scikit-learn toolkit in Python [51] logistic regression was applied resulting in a list of coefficients for each of the palettes or websites, and bias coefficients for each of the participants.

Besides the palette and websites scores, harmony scores were also calculated using the algorithm by Ou et al. [47]. Two different methods were used to obtain a score: the harmony scores for each colour combination within the palette were either averaged or the minimum was used.

The correlations between the palette scores and the websites scores, the algorithm scores and the palette scores, and the algorithm scores and the websites scores were calculated (see Table 1).

The scores produced by taking the minimum of all harmony scores of a palette have a much greater correlation with the palette scores given by the participants than the scores produced by taking the average. The correlation coefficient of 0.48 is very close to the coefficient found by Solli et al. [59]. Both algorithms are very poor predictors of the website quality. In Figure 2a some clear outliers could be seen, indicating palettes with good scores producing websites with bad scores.

All of these palettes had in common that they were among the 15 lightest palettes of the dataset.

After removing the 20 lightest palettes from the dataset to prevent this influence, the correlation between the algorithm and the website score, and the palette score and the website score increased for both algorithm methods (see Table 1 and Figure 2b).

5.3 Discussion

Three different scores were compared during this experiment. Participants were asked to judge palettes on the harmony of the colours and websites on quality. Further harmony scores were obtained by using the algorithm proposed by Ou et al. [47]. A correlation of 0.48 was found between the harmony scores given by the participants and those calculated using the algorithm.

This value agrees quite well with the literature, as this is not far from the correlation coefficient of 0.49 found by Solli et al. [59].

When comparing the palette scores and the website scores, a number of outliers were found.

These were all part of the 15 lightest palettes in the dataset (see for example Figure 3). They were removed, which resulted in a higher correlation between the algorithm and the website scores.

Interestingly, without the lightest palettes the algorithms were much less accurate in predicting the palette scores. This indicates that the accuracy of the algorithm relies quite heavily on the lightness effect.

Light websites are thus judged as worse than darker websites, even when the palettes have a similar harmony. This phenomenon might be explained by the fact that lighter websites are harder to read and cause a strain on the eyes. The palette was furthermore applied to a sample website with a white background, which could have influenced the harmony of the website as a whole and thus the perceived quality.

Ou et al. [47] proposed the equation that is used to calculate the harmony score. However, this equation was specifically proposed for two-colour combinations. Two methods of applying the equation to combinations of more colours were compared, either using the average of the score for

(20)

(a) Correlation between palette scores and website scores for the entire dataset.

(b) Correlation between palette scores and website scores for the 80 darkest palettes in the dataset.

Figure 2: Scatter plot with the palette scores and the website scores before (Figure 2a) and after (Figure 2b) the 20 lightest palettes were removed from the dataset.

(21)

(a) One of the light palette stimuli used in the experiment

(b) One of the light website stimuli used in the experiment

Figure 3: A decent palette resulting in a poor quality website.

all colour combinations in a palette, or using the minimum. It was found that using the minimum resulted in the highest correlation with the user data. This would imply that the assumption that one bad combination ruins a combination of colours, an assumption Solli et al. [59] make in their study, is correct. This would also mean that web designers do not need an elaborate model for a combination of multiple colours, but studying the separate two-colour combination would be enough. However, more evidence for this assumption is still needed.

Overall it seems that only a weak link exists between the harmony of a palette and the quality of a website. This could be due to the many other factors that determine the quality of a website.

These factor could be also be unrelated to colour. The readability is one example, the text on the site should have a colour contrasting from the background. The distribution of the colours in the palette might also influence the quality, as might the individual colours used. Examples of factors that are not related to colours are the use of images, the familiarity of a website and the estimated ease of use.

Although the raw colour palettes might be important, many other factors determine the quality of a website. The palette should be combined with other features if one aims to accurately predict the quality of a website. Further research into these other factors is needed before a comprehensive model can be build. A first step in that direction is done in the next chapters.

(22)

6 Colour Extraction

Websites contain a certain combination of colours, the colour palette. Extracting this palette from a website is not trivial. Websites are often a combination of static and dynamic parts. The dynamic parts are those parts that may change in content, for example temporary pictures or articles. The static parts stay the same over a longer period of time. Wu et al. [68] used this distinction in the palette extraction method they proposed. They located the fixed part and determined the colour theme of the website based only on these parts. The same approach is not used in this thesis for several reasons. The pictures in a website may have a great influence on the overall colour theme of the website, so it does not seem logical to exclude the dynamic parts. Furthermore, no reliable method has been found to segment the website in this way so far.

This chapter will describe a novel method of extracting a palette from a website. The novel method aims to approach the way the brain would extract colours. To accomplish this, saliency is taken into account while extracting the colours. Several different factors in the extraction are tested and the results are presented here.

6.1 Methods

In this section the procedure of extracting the palette from a screenshot is explained. Furthermore the data used to test the proposed novel extraction method is described.

6.1.1 Data

The data used to test and verify the extraction was a set of 200 screenshots of websites. It consisted of 200 screenshots that were previously randomly selected from the top 1500 pages on Alexa in August 2015 [2]. The websites were filtered to remove websites with the extensions .ru, .jp, .cn, and.co. They were further checked to make sure that they did not contain any broken, pornographic or non-western website.

6.1.2 EyeQuant Model

The proposed approach uses the saliency (see Subsection 4.2) of a website. This was in part inspired by the improvement made by Baveye et al. [4] on the Cohen-Or et al. [9] method of automatic harmonization. The goal was to predict how participants judge the quality of a website on first glance. Assuming they would mostly focus on the most salient parts of the website, these parts would have a greater influence on their judgement. The saliency score was calculated using the EyeQuant model, which was described in Section 4.2. A higher score indicated that subjects are more likely to look at this part of the website for the first few seconds of presentation.

As the EyeQuant model is based on eye movements it made sense to use it in the case of colours. Colour can only be perceived with the cones, which are most abundant around the fovea and almost absent in the periphery. The colours that get noticed by users of a site, are thus most likely located within the parts they focus their gaze on. This is exactly what the EyeQuant model models for.

6.1.3 Extraction Procedure

The extraction was performed on screenshots of websites. The screenshots were just of the website

’above the fold’, meaning what the user sees on their screen without scrolling. This would be the view of the website a user sees in the first few seconds of landing on a page, which is the timescale the EyeQuant model was built on as well. This was also the timescale of interest in the current study, as the quality judgement of a website is less influenced by external factors when it is made quickly.

For each pixel in the image the RGB value was determined and these values were saved. Next the saliency was computed using the EyeQuant model, which has been tested extensively [67]. This model gave a saliency score to each pixel on a scale of 0-1. The score was scaled to a wider range using a saliency scaling factor. Each pixel of the original images was added to this new dataset a number of times equal to the scaled saliency score, which was the reason a wider range is necessary.

(23)

Using this type of scaling allowed more salient pixels to be of more influence to the clustering than less salient pixels.

In the next step, the most occurring colours were taken from this dataset. Even with a small screenshot the dataset could grow to sizes of over one million entries after weighing, which would cause the hierarchical clustering algorithm to take a long time. Using only the most occuring colours reduced this problem. For each of the N top colours the number of times it occurs in the dataset (counter) was taken, and this value was scaled, so the total sum of values is equal to N . This asdone for each colour using Equation 10. The optimal value of N was determined in one of the experiments.

C₁= C₀· N P C0

, where C1= Scaled counter C₀= Unscaled counter

N = Number of top colours included

(10)

Next the data was transformed from the RGB to the L*a*b* colour space. The L*a*b* colour space allowed for calculating distances between colours based on human perception. These distances were needed to create clusters. On these L*a*b* colours hierarchical clustering was performed, using the clustering function from the SciPy library [22]. The clustering was performed using the ’average’ method, because this was shown to work best for this data (see Section 6.2).

The distance between the colours was calculated using the CIEDE2000 colour-difference formula [33, 57], as available within the python-colormath package [62].

To determine the amount of clusters, the acceleration of distance was determined. This was done by calculating the second derivative of the distance for the last 10 merges of the clustering. Next, the merge with the highest acceleration was determined, which represents the optimal number of clusters in the image. This method is a variant on the elbow method, since the goal is to identify the ’elbow’ in the distance plot [25].

This method has a preference for a smaller amount of clusters (see Figure 4a). However, looking through the dataset of website lead to the conclusion that it is doubtful that a two-colour combination is an accurate representation of colours in a website. Therefore a preference for a larger amount of clusters was desired. In order to force a larger amount of clusters, the acceleration for two clusters was set to 0. We would expect a more or less normal distribution of the amount of colours used in the websites in the dataset, as this is a randomly selected set. However, even after setting the acceleration for two clusters to 0, the algorithm still has a preference for a smaller amount of clusters. Therefore the acceleration for three clusters was scaled down.

Lastly the actual clusters were determined by cutting off the dendogram at the point that results in the amount of clusters that was previously determined to be optimal. Each colour in the dataset is assigned to a cluster. The palette of the website was then determined by taking the average colour of each cluster.

6.1.4 Experiment Procedure

Several parameters in the above description were tested to find the optimal settings. First, the scaling factor for three clusters. The acceleration obtained for three clusters was scaled down in order to force a higher amount of clusters. For each scaling factor the relative frequency of three clusters was obtained. The change in frequency was also recorded for each factor. Using the same logic as for the elbow method, a sudden change in frequency would indicate some sort of distinction in screenshots that truly have three colour palettes, and screenshots that do not. This would thus be the optimal scaling factor.

The next parameter that was tested was the method that was used for the clustering. For a dataset of the size used for the current clustering problem, using the ’centroid’ measure (also knows as the UPGMC algorithm) is recommended [55]. However, this method cannot be applied to the current dataset, because the clustering was performed using the CIEDE2000, which is a non-Euclidian distance. This also ruled out the WPBMC algorithm and Ward’s method. The

The effect of colour use on the quality of websites